Main issues, notes and speculations.
Proposal:
One idea how to approach it:
XML editors, check standards/versions.
XML search tools, plugins for browsers?
Ontology editors that can handle UMLS: Protégé ?
Automatic Tools for Mapping Between Ontologies, DARPA Agent Markup Language (DAML)
The Defense Advanced Research Projects Agency (DARPA)
WebKB
Medical Concept Mapper, Customizable and Ontology-Enhanced Medical Information Retrieval Interface
NLP support:
Cyc-NLP for general language understanding.
A-Z phraser to extract noun frazes - whole phrases may be annotated.
MedLEE - A Medical Language Extraction and Encoding System - what will it do?
Semantic annotation: concept spaces? How to combine it with parsing?
Search engines: MedTextUs? MedSpace?
UMLS as the basis, umlsks.nlm.nih.gov
Combination XML-UMLS and other ontologies?
EcoCyc by Karp.
Medical systems: less structured, models of diseases are not known/structured so well as knowledge about pathways.
MetaMap
Semantic Knowledge Representation Research
MetaMap has no suport for XML, but "It should be fairly easy to modify the output routines to include XML output".
Protégé-2000 is an integrated tool for ontology and knowledge-base editing.
Protégé-2000 is also an open-source, Java-based, extensible architecture for the creation of
customized knowledge-based tools. It has UMLS Tabs! Will it help in automatic annotation?
See Protege Plugins/Tabs |
http://protege.stanford.edu/plugins/umlstab/umls_tab.html
Extract noun frazes, for example with A-Z phraser
Interesting technology: summarization of text.
WordNet:
online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.
Inquirus is a prototype metasearch engine that uses query term context and page analysis for more efficient and comprehensive web search.
Concept spaces, in particular
MedSpace project
Semantic Indexing for a Complete Subject Discipline: MedLine 10M abstracts index, 45 M noun phrases
Combining semantic indexing with XML annotations?
LVM, LSA - Latent Semantic Analysis: small number of latent parameters for recovering high-dimensional distribution data. Continuous or Berlnoulli mixtures?
Intrinsic dimensionality of the data - geodesic distances? Use Tanenbaum's approach?
WebGrid III, WWW implementation of George Kelly's repertory grid technique for building conceptual models.
+ Free, supports some reasoning and knowledge visualization
Has not been used in biomedical applications yet
Walrus
Tamara Munzner
Nice paper on info visualization
XML Topic maps
K42 Topic maps: interesting knowledge visualization using
topic maps hyperbolic tree
Vizualization demos: Infoviz,
OmniViz.com - check
Treemaps
List
Magic-Eye-View
Info visualization survey
Clusterization: used by IOP.
LSI Web page, with papers/software.
Hofmann probabilistic LSI papers
NIST TREC conferences
TREC data
More notes:
IR organization addresses: | ACM SIGIR | trec.nist.gov | www.muc.saic.com |
Ontology Projects,
also ontology visualizations.
Automatic Tools for Mapping Between Ontologies, a project in the DARPA Agent Markup Language (DAML)
XML
Semantic Web
Resource Description Framework (RDF)
D-Lib Test Suite Project
(1999-2001) Final Report, 3/2002
XML presentation
XML FAQ: very good!
Microsoft sees XML as key technology
Netscape Mozzila has XML support, Opera and other browsers
MSXML XSLT FAQ and other XML technologies
XML Cover Pages is a comprehensive online reference
well organized, all standards described, great software list
Chemical examples
DocZilla is a Mozilla-based SGML/XML/HTML- browser
CYC: Qualified parties can obtain a free license to a substantially larger subset of the Cyc Knowledge Base known as ResearchCyc, which is for R&D use only.
OpenCyc Version 1.0 released in July 2002.
XML community is providing a powerful syntax for adding structure to the web. Cyc enhances XML by providing a powerful universal semantics for modeling objects described via XML.
Semantic Web Technology: resources
AI Arizona Medical Informatics group includes MedTextUs and HelpfulMed
Ontolingua
Semantic lexicon with concept vector representation, or better procedure for disambiguation?
Ex. "ocular" has many meanings, but "ocular" near "disease name" or "ocular" near the "telescope" etc
has specific meaning. Cyc should use something like that to avoid asking context questions?
Index of sentence consistency should be maximized?
Longest noun phrase should be found first?
DAML - provide semantics for its tags and thus supports inference!
The DAML language is being developed as an extension to XML and the Resource Description Framework (RDF). The latest release of the language (DAML+OIL) provides a rich set of constructs with which to create ontologies and to markup information so that it is machine readable and understandable.
Cyc ontologies are already partially in DAML!
Semantic Types of the UMLS, Unified Medical Language System Sematic Network, Yale, Peishen Qi
OilEd
is an ontology editor allowing the user to build ontologies using DAML+OIL.
Ontology Inference Layer OIL
WebScripter, a tool that enables ordinary users to easily and quickly assemble reports extracting and fusing information from multiple, heterogeneous DAMLized Web sources.
Example of search engine for sophisticated searches using DAML and HTML.
Problems: no decent viewer? Applet for viewing DAML files does not load, it is from 11/2000.
LSA and Educational Applications Latent Semantic Analysis (LSA) captures the essential relationships between text documents and word meaning, or semantics, the knowledge base which must be accessed to evaluate the quality of content. Several educational applications that employ LSA have been developed: (1) selecting the most appropriate text for learners with variable levels of background knowledge, (2) automatically scoring the content of an essay, and (3) helping students effectively summarize material.
An LSA Primer
Latent Semantic Analysis (LSA) is a mathematical/statistical technique for extracting and representing the similarity of meaning of words and passages by analysis of large bodies of text. It uses singular value decomposition, a general form of factor analysis, to condense a very large matrix of word-by-context data into a much smaller, but still large-typically 100-500 dimensional-representation (Deerwester, Dumais, Furnas, Landauer & Harshman, 1990). The right number of dimensions appears to be crucial; the best values yield up to four times as accurate simulation of human judgments as ordinary co-occurence measures.
The underlying idea is that the aggregate of all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other. The adequacy of LSA’s reflection of human knowledge has been established in a variety of ways.
<>LSA Papers
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284.
Landauer, T. K. & Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semanctic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review , 104 , 211-140
Van der Vet and Mars revive the attempts to incorporate predicate relationship between assigned index terms during the searching process, to control for the difference between, for example, "aspirin as a cause of, and aspirin as a cure for, headache." Such relationships, and syntaxes for their use, are defined for each indexing language and are applied to terms from specified hierarchies of indexing concepts to create coordinated concepts. This is used with the possibility of specifying all narrower terms than the chosen concept with an ANY operator. Thus for a query one may choose a single or ANY predicate relationship, a single or ANY term as the first element of syntax, and a single or ANY term as the second. Such indexing concepts can then be combined with Boolean operators. The system will take considerable time or space resources for concept expansion. The problem of syntax assignment at the time of indexing is not addressed.
Hierarchical Concept Indexing of Full-Text Documents in the Unified Medical Language System[register mark] Information Sources Map, Lawrence W. Wright, Holly K. Grossetta Nardini, Alar R. Aronson, and Thomas C. Rindflesch
Using Health Services/Technology Assessment Text (HSTAT) as a database, Wright et al. extracted four HSTAT files with material on breast cancer and containing 66 distinct documents. By using the available SGML tags, chapter and section headings were located and used to divide the documents into parts while retaining its hierarchical structure. Using MetaMap, which translates medical text to UMLS Metathesaurus terms and ranks these by occurrence, specificity, and position, terms which are less accurate than human indexing but superior to purely extracted terms are chosen using the document fragments. Since both the whole document and its sections are represented, the resulting index is hierarchical in nature. Of the MetaMap-generated MeSH terms, 60% were not in the current indexing of HSTAT, and MMI produced results similar to that of the HSTAT search facility--except that MMI could bring in larger sections or whole documents, rather than fine sections alone.
Another advanced retrieval system was presented by Larry Mongin, Javed Mostafa and John Fieber. They are combining clustering algorithms with dynamic 2D visualization techniques in their Sifter Project. In their poster user medical queries were mapped to Unified Medical Language System (UMLS) categories, and users interactively selected terms in those categories for cluster analysis and visualizations that show relationships with motion as well as spatial position.
Information Filtering Resources
NIST Retrieval Group Published Papers
Negation in medical texts: NegEx algorithm
Working log (local accsess only)