Sunday, April 02, 2006
XML, RDF, Ontologies and Bioinformatic Data MiningThis entry relates to an excellent article written by Dr. Eric Neumann and recently published in Bio-IT World. I recommend you read this very brief article before jumping in to my response.
RDF — The Web’s Missing Link (By Dr. Eric K. Neumann - March 14, 2006)
Just so as to not start in to these arguments without any context, I'll give a brief summary of both Eric Neumann's thesis and a follow-up comment given by Neil Custer.
Eric claims RDF provides a flexible, light-weight means for confederating semantically-rich information about resources critical to biomedical informatics data processing - e.g., genes, proteins, biochemical pathways and their relationships to physiology and disease for instance. He also asserts RDF-related formalisms provide very significant advantages for implementing this task over XML alone(certainly over HTML/XHTML) in their ability to:
- Capture the semantic richness of scientific information, mainly by providing a much more flexible means of representing links or relations between elements;
- Support rapid evolution of these RDF repositories and the tools created to mine them as both the nature of our scientific knowledge and our needs for processing that knowledge evolve;
- Do so in a way that gets around the inherent limitations and constraints imposed when one tries to start with top-down data standards or ontology construction efforts.
My comments follow.
RDF is an XML-based standardLet's all just admit we agree on this point. XML revolutionized our ability to create precise, formal representations of data cum data - whether the data be describing a complex means for inter-relating data structures or Joe Smith's bank statement at Fidelity Mutual Bank in Albany, NY for March of 2006. XML can do this for documents residing on the web or elsewhere.
RDF is an XML-based specification. See the footnote below§ for my view of how and why XML & RDF evolved.
Why do we need another W3C standard beyond XML?I realize this is an over-simplification of the question posed by Neil Custer, but it captures the sentiment if not the letter of his polemical comment, I believe.
I think a portion of the answer derives from the many XML-based standards Neil himself cited - XML-Schema, XPath, XLink, XPointer, etc.
Each of these standards meets a specific goal beyond the basic XML guarantees and provides critical functionality required to support the larger goal of manipulating data in a network setting via largely algorithmic means. XML itself and these other low-level standards built on top of it primarily provide a means to represent document element structures and interconnect those structures amongst a collection of documents possibly dispersed across the network. They can even provide relatively sophisticated features such as querying the contents of a document for the presence of specific data within the structure (XQL) and mapping one structure to another (XSL).
RDF is itself providing additional functionality not supported explicitly by these foundational XML standards. RDF also uses several of these standards in order to hide some of the complexity of meeting its goal of providing a means to link resource elements together. In other words, yes, as Neil points out, we have all these XML foundational standards, and one can build from them a means to link semantically inter-related information resources. The point of the W3C having developed RDF as a standard is to provide the required base functionality so others who need to to construct a semantically-linked web of information need not wrestle with the complexity of the foundational XML standards to create the functionality encapsulated in RDF.
Another important aspect of an XML structural description is the XML standard requires the structure be organized in strictly hierarchical tree graphs. This constraint helps to simplify creating software to build and parse XML structures but can also be unduly restrictive when it comes to describing real-world entities. Hierarchical graphs can very effectively represent certain types of human classification schemes - e.g., Cladistic organismal taxonomy or Java Class hierarchies - but these abstractions often run into difficulty when they are used to map real-world objects.
The operative word here is structure. There is no sense of meaning except as might be implied in the structure. Anyone who's written the most simple program has learned the machines we use to manipulate our data know nothing of our implied intensions or meaning. They only can act on those things made explicit in the data or the code written to process the data. Most data processing relates back to meaning in some way, so if we intend for the processing of this data to be largely automated, we must have a way of explicitly representing meaning. This is particularly true in the bioinformatics realm, where the manipulations performed on large-scale data sets of > 1x10e6 elements can only be evaluated based on those aspects of both the data and the processing that is made explicit. For instance, when trying to compare the activity patterns in Functional MRI (fMRI) brain scans, one must know a lot of detailed information - detail about the subject (both intrinsic and extrinsic or environmental information), about the MRI device, about the location of the brain relative to a canonical coordinate space, etc. - before an effective comparison can be made. I challenge anyone to create an algorithmic means of performing that task without capturing the meaning of the data structures and their relations either as a part of the data representation scheme or embedded in the algorithms.
You also quickly learn representing as much as you can about the data and its meaningful context there in the resource containing that data greatly relieves the burden from the programmer charged with processing the data in that context. If the data structure itself contains explicit information about meaning and links to rules for enforcing the meaning, then that meaning need not be reconstructed by the algorithms. Not only does this reduce the amount of effort required to process the data in a meaningful way, it avoids the Tower of Babel problem arising when the programmers charged with building such semantic reconstitution algorithms choose disparate means to do so and reconstruct slightly different meanings for the same data sets. XML can be used to create self-describing document structures. RDF can be used to add self-describing meaning to the inter-related data contained in a document, a task all the other XML base standards - XML-Schema, XPointer, XST, etc. - cannot do alone.
RDF is all about representing meaning. The core RDF structure - the triplet - provides a formal means of linking objects in a meaningful way. It represents those relations in the form of triplet statements declaring one object as a property of another object. The Resource in the name refers to the subject of the triplet, and each triplet provides a little tidbit of the meaningful description for that subject. The reference to a resource object is made via a URI (Universal Resource Identifier) making it possible to deterministically locate the resource where ever it may reside. Here is a simple example using the more human readable N3-based syntax for RDF, as opposed to the RDF XML formalism (they are functionally equivalent and programs exist to deterministic translate between the two formalisms). The triplets are given as subject-->verb-->object relations. N3 provides a convenient means for listing multiple triplets about a single subject. You simply list that subject followed by a semi-colon separated list of all the verb-->object pairs
:Person a :rdf:class.
:Writer a rdfs:Class; rdfs:subClassOf :Person.
:eric a :Writer.
:Document a :rdf:class.
:BioItArticle a rdfs:Class; rdfs:subClassOf :Document.
:rdf_missingLink a :BioItArticle
:author a rdf:Property; rdfs:domain :Person; rdfs:range :Writer.
:rdf_missingLink :author :eric
The several compound object definitions you see above - e.g., rdf:class - merely include a short-hand for specifying the description of an object resides in another document (or namespace). For example, rdf:class explicitly indicates the full description of the class being invoked here is given in another namespace - in this case the RDF specification. This short-hand is filled out using namespace definitions given at the top of the document using the @prefix keyword - e.g., @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . specifies a URI for accessing the detailed object descriptions contained in the RDF specification and declares the short-hand rdf: as the tag to indicate the object is defined in that URI.
This seems like a lot to go through just to make the simple statement, "The Bio-IT article 'RDF--The Web's missing link' was authored by Eric." The primary advantages of proceeding this way extending beyond what XML (and it's associated base standards) can provide are:
- Meaning - not simply structure - is explicitly defined and nothing outside a given document or the additional namespace URIs given in that document need be taken into account to interpret that meaning within the scope intended by the original author.
- One might suggest XML alone has a means to indicate the required relations, but this is not the case. In XML, all relations are given as strictly hierarchical part of relations - e.g., "A title is a part a document". This restriction limits the meaning one can encode in a XML document to part of structural relations. In RDF, there is no limit to the number and nature of the relations used to define inter-relatedness. The only restriction is the relations need be defined within the document or one of the URIs pointed to in a namespace definition included in the document. Ultimately very complex graphs of related information can be assembled from RDF triplets without requiring they fit to a strictly hierarchical structure. This is much more in fitting to the variety and complexity of object inter-relatedness found in the real world in general and in biomedical research in specific.
- RDF at its core is about linking minimal data elements to other minimal data elements. It is a bottom-up approach for defining a semantically-specified data universe. XML from the outset - as was the case for SGML before it (see footnote) - was about defining document structure and is inherently a top down view of a collection of data. It is true one can create a document type containing just a single element to specify an atomized view, but this misses the point. XML is document centric. RDF is resource centric, where a resource can be as simple or as complex as your data processing needs require it to be.
Eric makes another point specifically relevant to performing this task in bioinformatics (or any field where complex, evolving definitions are at the core of what you are trying to manipulate): "...the evolutionary nature of science results in changing views and representations...Must the parsers be updated each time there is a new innovation in the science? How should we “link in” new data, annotations, and external references?"
It is critical to recognize XML alone, by its nature, is a closed and somewhat fragile (Eric uses the word "brittle") system. XML-Schema can be used to define very complex, rich documents layouts, so long as they hold to a strict, hierarchical structure. It can use XLink and XPath to derive some of the details about the document structure from distal locations either within or outside the document itself. In the end, however, you are left with a hard coded document description. No element within that document can be interpreted without recognizing its structural location within the document. This means code written to process or inter-relate elements in XML documents must need be altered if the XML document descriptions are altered. RDF, on the other hand, is focused on elemental resources. The trail of related elements defined via RDF and its associated specifications can be as varied as it needs to be in its structure and semantic content. Most importantly, if this connected graph of resources changes, old code written to parse elements in the graph will still function without modification.
If your logic needs to process the specific meaning uniquely captured in the new relations, you will obviously need to alter it do so, but the semantic relatedness of prior elements you are manipulating will still be valid and should still be parsable. Even more importantly where biomedical informatics is concerned, the elements being described in an RDF document can be inter-related and re-used in arbitrary ways by other researchers, without those researchers needing to cope with the complexity and idiosyncrasies of the overall document containing the resource(s) they want to reference. As Eric states, "...what if I want to add another observation (i.e., property+object) such as <is a target for treating> <colon cancer> in a way that does not disturb the data for others? If it’s clinically relevant now, but I need to wait two years for a committee to recommend it, then a lot of lives are taking a back seat to brittle technologies." The importance of this fact cannot be over-stressed both in terms of the upfront cost of building new semantically-driven data processing applications, as well as the long term maintenance cost of constructing semantic mining software around RDF, as opposed to around XML-Schema alone.
If we have RDF, why do we need ontologies
Here, too, I'm slightly misrepresenting Eric's actual comment. He doesn't claim there is no need for ontologies, only that constructing ontologies from the bottom up via RDF triples may be a better way to proceed than trying to do so in a canonical, top-down manner.
On this issue, I believe I would part ways a bit with Eric. I would agree the bottom up construction of a semantic web can be one way - a very far reaching and effective way - in which new universals are found which one will want to then incorporate into ontologies. I do not beleive, however, this should be the primary method driving ontology construction.
I would argue without both the bottom up and the top down approach to semantic mapping of biomedical data (raw, reduced, analyzed and the research literature), we will not be able to capture the labyrinthine interactions complemented by subtle, real-world detail critical to understanding complex phenotype and disease. Lacking that degree of granularity and sophistication, the knowledge maps and resulting knowledge mining tools will not be able to support the needs of research scientists working at the bleeding edge of their sub-discipline and practicing clinicians dealing with their own form of bleeding. There will be much low-hanging fruit available to those who employ any one of the approaches listed below. MEDLINE/PubMed and the other scientific literature databases in use for the last 30 years, however, have demonstrated how effective a mixture of these techniques can be in mining the scientific literature-base - however, imperfect this solution sometimes can be. The meteoric rise of the use of The Gene Ontology in the last 8 years to manage and mine bio-molecular data has shown how a similar combined approach can be effective for managing massive repositories of primary scientific data. Curators of the Gene Ontology have also shown by gradually evolving to make better use of all the techniques listed below the practical limits on the knowledge mining a system supports can be greatly extended.
As I see it, there are essentially five ways in which knowledge gets mapped in the electronic domain, though most of these techniques greatly pre-date the first working Turing Machines.
- The Scientific Method + Statistical Inference
- This is far and away the major technique used by most practicing scientists to distill knowledge from the observations they make of the real-world. Starting in the early 17th century, the work of Sir Francis Bacon (1561 - 1626), Réné Descartes (1596 - 1650), Galileo Galilei (1564 - 1642), and others systematized the use of inductive reasoning methods creating what eventually came to be referred to as The Scientific Method, documented in Bacon's seminal work, Novum Organum (1620). The general idea was the deductive method first systematize by Aristotle by which new theorems were constructed purely by analyzing the structure of axiomatic statements had limited power to describe the real world. This worked well for mathematicians and the emerging group of mathematical logicians, but the expanding group of Enlightenment-driven observers of the real-world (those we'd now-a-days call scientists) found making statements about real-world observations was difficult to impossible to do in the form of formal axioms. They needed to either wait until sufficient information about the real-world accumulated to make this possible, or find a new method. Luckily for us scientists, they chose the latter path. The goal was still to accumulate solid foundations on which to reason, but they now choose to use the bottom up inductive method of starting with real-world observations, as opposed to the top down deductive approach.
Emerging approximately a century later and extending into the mid-19th century, the work of Jacques Bernoulli (1654 - 1705), Siméon Denis Poisson (1781 - 1840), and Johann Carl Friedrich Gauss (1777 - 1855) and others helped to add an additional layer of mathematical probability theory and technique to this process, establishing statistical inferencing as the quantitative foundation for The Scientific Method. To this day, the overwhelming majority of knowledge representation whether it be in basic biomedical research, economics, or financial planning is at its roots applying this general approach.
- The Scientific Literature, Information Retrieval, & NLP
- In scientific research, the results of this inferential reasoning are mostly captured via the expert-based scientific writing process and represented in free text in the body of Scientific, Technical, and Medical (STM) publications. For over a century now, this literature-base has been accreting knowledge. The bulk of represented knowledge has been both generated by and digested by human scientific experts.
Information retrieval (IR) systems arose in the late 1960s as the literature databases (e.g., Index Medicus [precursor to MEDLINE], the Biological Abstracts, etc.) began their conversion to electronic form. IR systems extract all the relevant words from a corpus of documents and build a matrix of word occurrence. In the case of the early biomedical literature IR systems - except for the Biological Abstracts which, obviously, included abstracts - the only text available was the title, author names, journal names, and sometimes a set of keywords. A matrix was then constructed where one dimension contained the words in alphabetical order - the other dimension, the document locations (in this case, simply the location in the title or abstract). As one might expect given the highly tuned, expert lexicon that develops in various fields of science, the resulting matrix is quite sparse - e.g., except for a few very common terms such as cell, membrane, organ, human, etc. - the majority of unique words occur very infrequently. The real triumphs of IR researchers were: they developed a means to search this sparsely populated space very efficiently; they were able to use various techniques (Boolean-driven query engines being one) that helped to search these matrices effectively and deliver results back to the human user with relatively high precision (low false positive ratio) and recall (low false negative ratio). These systems are still very far from perfect - the precision & recall can still vary widely depnding both on the IR system and the corpus indexed - but they have developed very sophisticated means of reconstituting some of the meaning and context that is lost when words are ripped from documents and placed in meaningless, alphanumerical order.
These semantic extraction technologies (i.e., Natural Language Processing [NLP] techniques) developed over the last 25 years by computer science researchers in computational linguistics and expert systems AI research have gotten very good at extracting knowledge from the mostly unstructured literature-base. As this electronic literature-base increasingly includes the entire text from articles, the effectiveness of these automated semantic extraction tools will greatly improve. In the last 10 years or so, the goals for application of NLP to the scientific literature-base have begun to get more ambitious. They do not intend to replace the human expert in her role as interpreter of the knowledge in her field, but they have moved to automatically extract sufficient amounts of knowledge so that more large scale, effective knowledge integration can be achieved.
The integration would not only target bringing together research from disparate fields in ways a typical expert would not be able to do with the limited scope of journals they have time to read. It would also extend the scope of knowledge extraction back in time. For instance, a typical neuroscientist will learn early on Otto Loewi and Henry Dale were the first to demonstrate neurochemical transmission by identifying the primary neurotransmitter in parasympathetic nerve endings (Vagusstoff, now known as acetylcholine) and the primary neurotransmitter at sympathetic nerve endings (adrenaline). The typical neuroscientist (possibly even the typical neurochemist) will likely know little more about the other work done by Loewi, nor will they likely know anything about the collective of researchers all working on this issue back in the first quarter of the 20th century. Some may argue The Scientific Method itself enforces a Darwinian-like survival of the fittest on theories and demonstrated facts. Though this is true, I would argue it's primarily the limits of human intellect and memory that leads to the constant culling of older publications. Bibliometric studies have shown over and over except for the highly cited seminal articles (mostly related to novel experimental methods), the typical biomedical research article is read very infrequently 5 years after its published. Given the ever-increasing pace of scientific research over the past 20 years, the half-life of a typical article continues to fall. It's certainly not the case that all of the scientific findings lost in the historical record were disproved. Most were just incomplete and of limited value due to the limited state of knowledge at the time they were published. Publishers and archivists have begun digitizing the historical literature-base. Hopefully, we'll see disputes over intellectual property settled so that the bulk of this can ultimately provide long-term value to society and end up in The ScienceCommons, where everyone - NLP researchers seeking to mine it, as well as the body of researchers who just want to read it - will have free and open access. The journal Science, for instance, is now available in electronic form back to it's first issue published in 188o - though it's mostly unparsable TIFF files at the moment.
There is an equally hidden map of knowledge embedded in review articles such as those from The Annual Review series. When analyzed in the context of all other literature, reviews would likely be a significant aid in defining the core clusters in the knowledge space, as has been demonstrated in a limited way during the NSF-funded Digital Library Initiative research in the late 1990s, as well as by the clustering techniques used by Page & Brin to build the Google knowledge maps.
- Spilling outside of pure library and information science, there has been a great deal of research in bioinformatic information/text retrieval. A focus for this activity is the Genomics track of the Text REtrieval Conference sponsored by the Information Technology Lab of the U.S. National Institute of Standards (NIST) and Advanced Research and Development Activity in Information Technology (ARDA), a joint effort of the U.S. Intelligence Community and Department of Defence. There are several promient systems in use - e.g., ArrowSmith, Textpresso, etc.. Specifically within neuroscience, there are a handful of very significant literature informatics research projects - e.g., NeuroScholar, CoCoMac, BrainMap.org, Brain Architecture Managment System (BAMS). All of these projects have as their goal to provide neuroscientists a means to distill knowledge from a large segment of the neuroscience literature. Of those listed above, most focus on the structural and functional organization of the brain, but use a variety of text analysis techniques, not necessarily including a signficant NLP component.
The evolving efficacy of Knowledge Extraction (KE) NLP technology will likely soon (I believe in less than 10 years) make it possible for the literature-base to function as NCBI's GenBank does for genomic sequence. New elements added to the repository will immediately be scanned for lexical (and semantic) correlations against all of the pre-existing records. This not only greatly extends the effective semantic correlations with the current literature a given new article can expect to have, it also brings all the older literature back to life, the same way a BLAST query against a new nucleotide sequence can bring up records on sequences submitted to the database 20+ years ago. For this to happen, in addition to the legal and societal issues that need be resolved, there must be a complete overhaul of the literature-base backend. It will need to be much more uniform and structured and make better use of the 4 techniques below in addition to semantic web standards and tools - but that's the stuff of another blog entry. :-)
- Citation Maps
These consist on information authors linking there information to other related information. The first systematic compilation of citation maps in modern information science goes back to the last century with the creation by Frank Shepherd of Shepherd's Citation Index of legal precedents, still in active use today. In 1961, Dr. Eugene Garfield applied it to the life science literature as a part of an NIH-funded research project creating the Genetics Citation Index. He later extended this to the broader domain of all the scientific literature, founding in 1964 the Institute of Scientific Information with its flagship product, the Science Citation Index, or SCI. SCI indexed a large enough body of scientific publications to create a rich knowledge map useful to every day scientists. Remarkably, a minimum of computing technology was used to initially compile the Genetics Citation Index - an excellent example of how the "right" informatics algorithm can require minimal technology. Key to this effort, the knowledge map was pre-built by domain experts (article authors) and required only the proper techniques to be harvested.
- Google, of course, helped to edge ahead of the standard, lexically-based info retrieval search engines such as Lycos by mining a citation index - the implicit index in the href links embedded by web page authors. Page & Brin developed a means to cull these indexes, perform cluster analysis (very much like the network and graph theory analysis performed on the SCI 30 years prior), and use proprietary scoring algorithms to discern what the semantic modes were across the collection of all public web pages. Lexical indexes were collated with these clusters to allow users to continue using the Boolean text queries they were used to using, only the top hits in the results were now the primary citations in these Google semantic clusters.
- Usage Trail Indexes
There is another form of knowledge map popular in the commercial web publishing arena. Several years back, Brewster Kahle, the digital library innovator and inventor of the first Internet-wide search engine WAIS (the killer app for Thinking Machines Connection Machine architecture) founded the not-for-profit Internet Archive. Its subsidiary Alexa Internet in addition to providing access to the archive, constructed a usage map of the web based on the click traces from 100,000s of users of their Alexa browser plug-in. From this map of user behavior, Alexa built a semantic directory of the web capable of effectively identifying related groups of web sites. Soon after this, usage maps got wider exposure, as Amazon and other web-based booksellers introduced related books listings based on user browsing and purchasing behavior. Amazon also purchased Alexa and began marketing it to other web portals, most notably the then still independent NetScape portal. This lead to significant controversy over privacy concerns brought to bear by tracking every web site a user clicked to. Oddly enough, Netscape disappeared as an independent company, time passed and now all the major sites - including Google - are now tracking click-trails on users.
- Though current privacy and intellectual property concerns make it impractical, it would be immensely valuable to chart the usage map for biomedical scientists reading articles in the STM literature-base. Averaged across all readers in particular scientific domains, this map would likely reveal semantic correlations amongst articles not easily revealed through other techniques.
- Directories/Lexical Indexes
The original Yahoo Directory descends from the age-old sub-field of cataloging/serials management in library science. It is directly related to the lexical taxonomies or terminologies developed for organizing/searching the academic literature. These efforts focused on carving the lexicon into hierarchically organized domains - e.g., in the biomedical domain resources such as the National Library of Medicine (NLM) Medical Subject Headings (MeSH), The PsychInfo Thesaurus of Terms from the APA, BIOSIS Controlled vocabulary for the Biosciences, SNOMED, NeuroNames, NLM NCBI's Organism Taxonomy, etc.. The NLM's Unified Medical Language System (UMLS) is the apotheosis of this effort and is built on top of a minimal ontology to help systematize semantic categorization (The UMLS Semantic Network). For us neuroscientists involved in large-scale informatics projects focussed on data integration, systems interoperability, and field-wide knowledge mining, NeuroNames, created by Dr. Doug Bowden and his colleagues witin the Washington National Primate Research Center located at the University of Washington's School of Medicine has gradually evolved as the Lingua Franca of neuroanatomical terminology, in as much as such a beast can be wrangled from the somewhat less than compatible brains found across the range of mammalian species under study.
Ontological descriptions of the semantic universe are as old as Aristotle. Modern ontological practice derives from the convergence of major threads of research mathematics, logic, philosophy, and linguistics. Fundamental work in mathematical logic by Gottfried Wilhelm von Leibniz (1646 - 1716), George Boole (1815 - 1854), and Friedrich Ludwig Gottlob Frege (1848 - 1925) laid the groundwork for creating precise and computable logical assertions about objects in the real world. In philosophy, Aristotelian practice was revived in the wake of the Immanuel Kant's (1724 - 1804) subjective move into our heads, work performed by the ontological/phenomenological philosophers dating back to Baron Christian von Wolff (1679 - 1754) in the early 18th century and moving through 19th century with Edmund Husserl (1859 - 1938). In the 20th century, this work dovetailed with efforts in linguistics, artificial intelligence, and robotics - all fields that required, to one extent or another, to use formal techniques for representing information about the real world. Ontological frameworks for representing and computing on information about the real-world have infiltrated the biomedical informatics domain in a big way over the past 10 years. Most prominent among them has been The Gene Ontology (GO), but there are many others, some really enhanced lexical indices, while others following formal ontological best practices, such as the Foundational Model of Anatomy.
There are several important distinctions to remember about ontologies:
- Ontologies encode universals: Ontologies are designed to capture universal, characteristic elements of real world objects. These are described via acute and thorough observations of objects and their behavior in the real world, but the universals themselves don't truly exist. Ontologies provide a highly efficient and computable framework around which to organize information about the real world, but they do not themselves contain data about real world objects. There is no dog in the real world, only Johnny's cocker spaniel with the right ear that always droops. Data describing the specifics about Johnny's cocker spaniel would not live in an ontology, because it does not represent a universal. However, in order to determine how accurate and complete that data is, one would profit from relating that data to the ontological description of the universal dog.
- Ontological components exist in specific partitioning schemes: endurants vs. occurants; foundational, generic, domain and application ontologies; though the elements in these separate partitions must inter-relate, one must ensure the partitions themselves remain distinct; one must avoid pre-coordination within domain ontologies.
- Ontological interoperability relies on shared relations: a critical generic ontology for promoting ontology and knowledge map integration and interoperability is one defining a rich collection of specific relation classes to be re-used by all.
I would strongly recommend examing the following references for more detail on how these principles can be effectively applied in the biomedical domain. This work is by the formal ontological philosopher Dr. Barry Smith and his colleages at The Institute for Formal Ontology and Medical Information Science (IFOMIS), the University of Buffalo, The Gene Ontology Consortium, The National Center for Biomedical Ontology (cBiO), and The Foundational Model of Anatomy. I had begun to my summary of the highlights of this work to this blog entry, but my browser crashed, which is best for all of us, since the descriptions provided by Dr. Smith and his colleagues are more complete and correct than I could ever hope to be here:
- The Basic Formal Ontology (BFO)
- The Biodynamic Ontology
- The Ontology of Biomedical Reality
- The OBO Relations Ontology
Above I mention supporting real substantive work. This point cannot be over-emphasized. Biomedical scientists - and informaticist, in particular - must get real work done NOW (really yesterday). There is a constant push-pull between those developing standards and promoting interoperability and those out there in the field who need to produce substantive research and tools, publish and keep grants coming into their lab. The hopeful note I'd chime now is we are at a point where the standards and tools supporting them (such as Protégé, OBO-Edit, Semantic bioMOBY, etc.) - however imperfect - are sufficiently powerful to get real work done now. Given most of the researchers and developers on these projects are associated with the W3C Semantic Web Health Care and Life Science Interest Group, these resources are likely to improve and evolve with the standards. The recent creation of an NIH NCBC focused on bioontology development and use - The National Center For Biomedical Ontology mentioned above - will also be a huge help to us all, especially since a major goal for all NCBC as given in the BISTI Recommendations is for them to pursue education and dissemination of biomedical computational technology.
I'd make one final point regarding these various techniques for knowledge representation. It has been my experience - regardless of the technology you use to implement such systems (e.g., RDBMS, XML, RDF, etc.) - it is very important to keep a clear partition between the instance data (bio-molecular sequence records, neuroimaging repositories, scientific literature, etc.), the ontologies, the lexicons, and the knowledge maps (the linkage maps between instance data and ontologies). Obviously, the overall knowledge representation system needs to link across these various domains. The knowledge map component links instance data to ontological elements. The ontology links to the lexicon. Often NLP tools can help fill out the links between instance data and the ontology by passing through the lexicon. The formal structures and algorithmic tools required to optimal construct, maintain, expand, and mine these three major components, however, tend to be quite different. The task of building such systems is easier, I believe, if you maintain this distinction in both the design and implementation of the overal knowledge representation system.
Clearly, the growing acceptance and development of semantic formalisms for the web and tools to manipulate repositories built using these formalisms will be a good thing for all biomedical researchers. I fully expect within 5 - 10 years the daily practice of biomedical research and within 15 - 20 years clinical medical will have been transformed via the use of these techniques.
Thanks again to Eric Neumann for inaugurating this regular column in Bio-IT.
§The Evolution of Structured Document Data Standards
My sense is from the outset Tim Berners-Lee, original developer of the http protocol and founding & current director of the World Wide Web Consortium (W3C) hosted at the MIT Comp. Sci./AI Labs (CSAIL), knew it was far from optimal to convolve together into a single specification - HTML - the representation of information/data with a description of how that information was to be presented. At the time, SGML was the by far the most advanced standard for representing publishable documents. Given the complex relation between SGML and the DTD mechanism for defining a document type (or schema), I think he was right in deciding to create the HTML data+presentation specification as an SGML DTD, as opposed to home-brewing his own format, throwing the baby out with the bath water, because of the complexity of SGML. This is true even though SGML itself made some attempt to encourage people to keep data separate from presentation. We mustn't forget, even by the mid-90's SGML (who's name reminds us it was intended to be both standard AND general - two seemingly contradictory technical specifications) was so complex few tools for editing and publishing SGML-based content were available. Those that were could be very complex to use and were all very expensive due to the inordinate complexity of building such applications. The general aspect of the standard required authoring/publishing systems possess a great deal of programmatic flexibility. This was despite - or maybe because of - the fact SGML's evolution as a standard dated back all the way to the original work on formal document structure performed by Charles Goldfarb and colleagues at IBM starting back in the late 1960s.
It's my opinion consolidating data+presentation into a single specification made the process of creating both HTML document servers (HTTP servers) and HTML document viewers (web browsers) considerably easier, though a bit more error prone. Somewhat ironically, in the end, the wide-spread adoption of HTML and the eventual meteoric rise as http/HTML as the primary means of moving information out to human users on the Internet was to a large extent made possible by having a single spec - HTML - which both encapsulated data and the presentation of the data and DID NOT require SGML/DTD-based validation to function.
Starting in the mid-90's as more and more programmers developed algorithmic tools to generate, parse, sort, and analyze web content, it became clear data representation needed to be separated from the data presentation, hence the emergence of Cascading Style Sheets (CSS). The CSS style specification formalism provided a means to put the presentation details in a separate document from the data and make specification of style more re-usable and consistent across a web site. Unfortunately, given the wide-spread existing web server and client tool set in existence at the time, it wasn't practical to require all style info be moved to CSS, so in the end CSS just made the process of creating web content a little more complicated without achieving the intended goal of providing a clean view of the data for algorithms to manipulate.
CSS does help very significantly with the issue of re-using customized, complex style information. The human readability and human or machine editability of an HTML document however was compromised. For a vivid example of this fact - one that is in widespread use throughout the web because of the ubiquity of MS Word - try this simple experiment. Create one of the simplest Word documents conceivable - a blank page except for the words "Hello, World!" and use the Word command Save As Web Page - both as whole document and as display only. Create the same simple document in Mozilla/SeaMonkey Composer. You will find essentially no difference when viewing the document in a standard web browser such as FireFox. Open the three documents in a text editor. You will find both Word documents filled with CSS formal style specifications and various xml-based meta data, none of which is of much use except for possibly the LastAuthor and Revision meta data fields. Now count the number of text characters in each page. Here's what I found:
|Word - whole document||2230|
|Word - display only||1026|
Which do you think is easiest to write HTML parsing code for?
On the page generation side, many mechanisms arose to dynamically generate HTML content from alternative data stores, as opposed to requiring all HTML content be statically created as fixed documents. The early Common Gateway Interface (cgi) made it possible to functionally extend an http server and tie it to other data processing systems written in PERL, C/C++, Python, etc.. So-called "active" page generation systems built on top of existing database interfaces (e.g., ODBC, JDBC, DBI scripting interfaces) were derived in order to turn out HTML content on-the-fly. These mechanisms are still very much alive today in the form of PHP, Java Servlets, .NET, and most recently the Ruby On Rails ActiveRecord model framework.
There are a few very important points to keep in mind when looking at CSS & active page generation mechanisms:
- One of the most important aspects of CSS is in helping to separate HTML data content from HTML styling it helps to make web content more directly parsable by computer programs. A programmer no longer needed spend so much time creating error-prone logic to separate data from presentation information. CSS was not a complete solution to this problem - more elaborate XML-based mechanisms have evolved since, but it at least started the process.
- Active page technologies provide a means to combine data streams from an external store (e.g., a database) with styling to create a valid HTML stream - the only object a web browser is designed to accommodate.
- These two mechanism essentially represent inverse processes enabling HTML content to be an element in a larger web of information - "active" page technology gets data into the HTML information space and CSS (and XML-based mechanisms used to separate data from presentation) help in moving data out of the HTML information space.
- Though these mechanisms are functional inverses of each other, they are not themselves reversible. Given only a CSS and a formally specified collection of data, you cannot create an HTML page. You need to map elements in the data set to CSS style elements, specifying, for instance, which elements from the data are the title of the HTML document, what elements are in the body, etc.. The same is true for active pages technology. It is not possible to reverse the PHP statements used to run SQL SELECT queries on a database to derive the content for an HTML page. You can create form elements that can be used to generate SQL INSERT or SQL UPDATE statements, but that's not the same thing. In other words, you could not automatically extract the data from an dynamically generated page and create an index of the content of that page which you would place in a data store, as a typical search engine needs to do. This is why dynamically generated pages do not get into Google's indexes.
This last point has very serious practical implications for the amount and complexity of code one needs must use to automatically generate content for and parse content from the HTML information space. It is this very point that has lead to the need for a more transparent way, not only to separate data content from presentation information, but also to create a "reversible" formalism that is a better "impedance match" if you will between the HTML information space and the other digital information spaces in common use. Since the inception of the W3C, however, TBL has been working to right these wrongs. XML was the first big move in that direction and helped to simplify both the precision and the extensibility of formal document representation, ridding itself of nearly 30 years of baggage accumulated around SGML making SGML prohibitively expensive to rely on for this purpose.
The W3C Semantic Web activity developed RDF on top of XML with the hope of providing not only a formal mechanism to represent information content uncontaminated by presentation styles, but also to provide a efficient, formalism capable of capturing the semantic context for that data and moving this information automatically into and out of the HTML information space. This semantic component really derives from the Artificial Intelligence concept of "frames" - the idea that in order to create machine parsable representations of knowledge (not just data), data must be presented in a context that helps to define the scope and specific, intended meaning of the data. This goes way beyond what XML itself is intended to provide.