Title: Medical information retrieval

Participants: John Pestian, Łukasz Itert, Włodzisław Duch

Dates: 2002 - present

Goals:

Ultimate: create medical text analysis system supporting discoveries of new facts and building theories.
Structuring information stored in biomedical records, open-source software facilitating discovery, looking for associations between different items. Improving infromation retrieval, enabling automatic construction of expert systems.
First step: create XML annotated files from unstructured medical notes.

Main issues, notes and speculations.

Proposal:

Simplest version:
add semantic retrieval ideas to existing software, such as MedLee;
+ this should improve the precision and recall.
- a sort-term solution only, not based on standard search tools, will have to be abondoned sooner or later.
Add UMLS tags to text + XML tags;
use MetaMap, add Open-CYC for XML/UMLS annotation.
+ semantic network is popular concept and has wide support,
+ search may be done by generic semantic XML browser,
+ many tools will appear to deal with this kind of documents.
+ no change of text presented to the user or special knowledge representation will be required:
abstracts and articles are written by humans for humans and will not get more understandable by applying automatic tools.
+ Metamap should help to connect UMLS with XML or XHTML;
One idea how to approach it:
- try to cluster documents into many groups, each defining one topic;
- documents based on narrow topics should use words in specific senses; select those senses automatically from Wordnet + UMLS ?
Tools that will be needed to create a prototype system, find out how they could be used together:
XML editors, check standards/versions.
XML search tools, plugins for browsers?
Ontology editors that can handle UMLS: Protégé ?
Automatic Tools for Mapping Between Ontologies, DARPA Agent Markup Language (DAML)
The Defense Advanced Research Projects Agency (DARPA)
WebKB
Medical Concept Mapper, Customizable and Ontology-Enhanced Medical Information Retrieval Interface
NLP support:
Cyc-NLP for general language understanding.
A-Z phraser to extract noun frazes - whole phrases may be annotated.
MedLEE - A Medical Language Extraction and Encoding System - what will it do?
Semantic annotation: concept spaces? How to combine it with parsing?
Search engines: MedTextUs? MedSpace?

Knowledge representation issues.
UMLS as the basis, umlsks.nlm.nih.gov
Combination XML-UMLS and other ontologies?
1. Frames that store stereotype knowledge in templates.
  +Easy to implement,
  +Used by many reasoning systems, including CYC
  -Large number of frames for different types of problems needed.
  -Break down in untypical situations.
  -Explicit causal relations are not shown.
2. Graphical models: Bayesian causal networks?
  +Sooner or later simple associations will not be sufficient, so such models will have to be constructed.
  +Make relations between different variables explicit - extend the power of semantic networks.
  +Larger networks are composed from re-usable fragments.
  +Step towards simulation of processes that are not observed but ultimately responsible for clinical observations
  -More difficult to implement.
  -Perhaps not ready for large-scale model building?
3. Use of the UMLS semantic networks? Each document has a template with this network, nodes corresponding to concepts mentioned in the document are active.
4. Specific knowledge representation: genetic/metabolic pathways.
  EcoCyc by Karp.
  Medical systems: less structured, models of diseases are not known/structured so well as knowledge about pathways.
  MetaMap
  Semantic Knowledge Representation Research
  MetaMap has no suport for XML, but "It should be fairly easy to modify the output routines to include XML output".
  Protégé-2000 is an integrated tool for ontology and knowledge-base editing.
  Protégé-2000 is also an open-source, Java-based, extensible architecture for the creation of customized knowledge-based tools. It has UMLS Tabs! Will it help in automatic annotation?
  See Protege Plugins/Tabs | http://protege.stanford.edu/plugins/umlstab/umls_tab.html
NLP: analysis of information in biomedical papers.
Intelligent annotation of biomedical texts requires natural language understanding, parsing of sentences and mapping the biomedical ontologies (UMLS, genetic, biochemistry), adding XML tags.
1. Traditional approaches: based on parsing techniques.
  Cyc-NLP for general understanding,
  MedLEE - A Medical Language Extraction and Encoding System (Carol Friedman)
  TopicsWebSite
  + Supports XML
  May be useful at a later stage to prepare the annotated texts for machine reasoning.
  Not clear if it handles UMLS, quality of grammatical parsers, no open license?
  Extract noun frazes, for example with A-Z phraser
  Interesting technology: summarization of text.
  WordNet:
  online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.
  Inquirus is a prototype metasearch engine that uses query term context and page analysis for more efficient and comprehensive web search.
2. Statistical: based on concept spaces.
  Concept spaces, in particular MedSpace project
  Semantic Indexing for a Complete Subject Discipline: MedLine 10M abstracts index, 45 M noun phrases
  Combining semantic indexing with XML annotations?
  LVM, LSA - Latent Semantic Analysis: small number of latent parameters for recovering high-dimensional distribution data. Continuous or Berlnoulli mixtures?
  Intrinsic dimensionality of the data - geodesic distances? Use Tanenbaum's approach?
Reasoning
Many queries that go beyond information retrieval should use reasoning mechanisms. CYC is a good solution here since it does both, NLP and reasoning.
Alternatives? Expert systems that make use of XML information?
Interface, data presentation and visualization
WebSOM and similar? PCA/MDS representations?
Brain interface?
Inxight Star Tree interface.
WebGrid III, WWW implementation of George Kelly's repertory grid technique for building conceptual models.
+ Free, supports some reasoning and knowledge visualization
Has not been used in biomedical applications yet
Walrus
Tamara Munzner
Nice paper on info visualization
XML Topic maps
K42 Topic maps: interesting knowledge visualization using topic maps hyperbolic tree
Vizualization demos: Infoviz,
OmniViz.com - check
Patient data
Treemaps
List
Magic-Eye-View
Info visualization survey
Clusterization: used by IOP.
Search
Fast search in textual databases: Textract - gone?
Extension of search engines technology (such as Google) to semantic network.
Onix, Incredibly Fast Query Processing Speed ...
LSI Web page, with papers/software.
Hofmann probabilistic LSI papers
Evaluation:
Traditional: recall, precision,
ROC curves - max. recall for a given precision.
Common test collections
NIST TREC conferences
TREC data
Other issues:
Learning - in the longer run will be needed; in such domains learning via chunking mechanisms, as in SOAR, should be preferable.
Related: Virtual Town Doctors
Wider context: cognitive science inspirations for concept representation: Bayesian networks?

More notes:

IR organization addresses: | ACM SIGIR | trec.nist.gov | www.muc.saic.com |

Ontology Projects, also ontology visualizations.
Automatic Tools for Mapping Between Ontologies, a project in the DARPA Agent Markup Language (DAML)

XML
Semantic Web
Resource Description Framework (RDF)
D-Lib Test Suite Project (1999-2001) Final Report, 3/2002
XML presentation
XML FAQ: very good!
Microsoft sees XML as key technology
Netscape Mozzila has XML support, Opera and other browsers
MSXML XSLT FAQ and other XML technologies
XML Cover Pages is a comprehensive online reference well organized, all standards described, great software list
Chemical examples

DocZilla is a Mozilla-based SGML/XML/HTML- browser

CYC: Qualified parties can obtain a free license to a substantially larger subset of the Cyc Knowledge Base known as ResearchCyc, which is for R&D use only.
OpenCyc Version 1.0 released in July 2002.
XML community is providing a powerful syntax for adding structure to the web. Cyc enhances XML by providing a powerful universal semantics for modeling objects described via XML.

Semantic Web Technology: resources

AI Arizona Medical Informatics group includes MedTextUs and HelpfulMed

Ontolingua
Semantic lexicon with concept vector representation, or better procedure for disambiguation? Ex. "ocular" has many meanings, but "ocular" near "disease name" or "ocular" near the "telescope" etc has specific meaning. Cyc should use something like that to avoid asking context questions?
Index of sentence consistency should be maximized?
Longest noun phrase should be found first?

DAML - provide semantics for its tags and thus supports inference!
The DAML language is being developed as an extension to XML and the Resource Description Framework (RDF). The latest release of the language (DAML+OIL) provides a rich set of constructs with which to create ontologies and to markup information so that it is machine readable and understandable.
Cyc ontologies are already partially in DAML!
Semantic Types of the UMLS, Unified Medical Language System Sematic Network, Yale, Peishen Qi
OilEd is an ontology editor allowing the user to build ontologies using DAML+OIL.
Ontology Inference Layer OIL
WebScripter, a tool that enables ordinary users to easily and quickly assemble reports extracting and fusing information from multiple, heterogeneous DAMLized Web sources.
Example of search engine for sophisticated searches using DAML and HTML.
Problems: no decent viewer? Applet for viewing DAML files does not load, it is from 11/2000.

LSA and Educational Applications Latent Semantic Analysis (LSA) captures the essential relationships between text documents and word meaning, or semantics, the knowledge base which must be accessed to evaluate the quality of content. Several educational applications that employ LSA have been developed: (1) selecting the most appropriate text for learners with variable levels of background knowledge, (2) automatically scoring the content of an essay, and (3) helping students effectively summarize material.

An LSA Primer
Latent Semantic Analysis (LSA) is a mathematical/statistical technique for extracting and representing the similarity of meaning of words and passages by analysis of large bodies of text. It uses singular value decomposition, a general form of factor analysis, to condense a very large matrix of word-by-context data into a much smaller, but still large-typically 100-500 dimensional-representation (Deerwester, Dumais, Furnas, Landauer & Harshman, 1990). The right number of dimensions appears to be crucial; the best values yield up to four times as accurate simulation of human judgments as ordinary co-occurence measures. The underlying idea is that the aggregate of all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other. The adequacy of LSA’s reflection of human knowledge has been established in a variety of ways. <>LSA Papers
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284.
Landauer, T. K. & Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semanctic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review , 104 , 211-140
Van der Vet and Mars revive the attempts to incorporate predicate relationship between assigned index terms during the searching process, to control for the difference between, for example, "aspirin as a cause of, and aspirin as a cure for, headache." Such relationships, and syntaxes for their use, are defined for each indexing language and are applied to terms from specified hierarchies of indexing concepts to create coordinated concepts. This is used with the possibility of specifying all narrower terms than the chosen concept with an ANY operator. Thus for a query one may choose a single or ANY predicate relationship, a single or ANY term as the first element of syntax, and a single or ANY term as the second. Such indexing concepts can then be combined with Boolean operators. The system will take considerable time or space resources for concept expansion. The problem of syntax assignment at the time of indexing is not addressed.
Hierarchical Concept Indexing of Full-Text Documents in the Unified Medical Language System[register mark] Information Sources Map, Lawrence W. Wright, Holly K. Grossetta Nardini, Alar R. Aronson, and Thomas C. Rindflesch

Using Health Services/Technology Assessment Text (HSTAT) as a database, Wright et al. extracted four HSTAT files with material on breast cancer and containing 66 distinct documents. By using the available SGML tags, chapter and section headings were located and used to divide the documents into parts while retaining its hierarchical structure. Using MetaMap, which translates medical text to UMLS Metathesaurus terms and ranks these by occurrence, specificity, and position, terms which are less accurate than human indexing but superior to purely extracted terms are chosen using the document fragments. Since both the whole document and its sections are represented, the resulting index is hierarchical in nature. Of the MetaMap-generated MeSH terms, 60% were not in the current indexing of HSTAT, and MMI produced results similar to that of the HSTAT search facility--except that MMI could bring in larger sections or whole documents, rather than fine sections alone.

Another advanced retrieval system was presented by Larry Mongin, Javed Mostafa and John Fieber. They are combining clustering algorithms with dynamic 2D visualization techniques in their Sifter Project. In their poster user medical queries were mapped to Unified Medical Language System (UMLS) categories, and users interactively selected terms in those categories for cluster analysis and visualizations that show relationships with motion as well as spatial position.

Information Filtering Resources

NIST Retrieval Group Published Papers

Negation in medical texts: NegEx algorithm

Notes:

Working log (local accsess only)