A Shared Task Involving Multi-label Classification of Clinical Free Text

John Pestian1, Christopher Brew2, Paweł Matykiewicz1,4, DJ Hovermale2, Neil Johnson1, K. Bretonnel Cohen3, and Wlodzislaw Duch4,
1Department of Biomedical Informatics, Children's Hospital Research Foundation, Cincinnati, Ohio, USA, and
2Ohio State University, Department of Linguistics,
3University of Colorado School of Medicine
4Department of Informatics, Nicolaus Copernicus University, Grudziadzka 5, 87-100 Torun, Poland.


This paper reports on a shared task involving the assignment of ICD- 9-CM codes to radiology reports. Two features distinguished this task from previous shared tasks in the biomedical domain. One is that it resulted in the first freely distributable corpus of fully anonymized clinical text. This resource is permanently available and will (we hope) facilitate future research. The other key feature of the task is that it required categorization with respect to a large and commercially significant set of labels. The number of participants was larger than in any previous biomedical challenge task. We describe the data production process and the evaluation measures, and give a preliminary analysis of the results. Many systems performed at levels approaching the inter-coder agreement, suggesting that human-like performance on this task is within the reach of currently available technologies.

Reference: Pestian J, Brew C, Matykiewicz P, Hovermale D.J, Johnson N, Cohen K.B, Duch W, A shared task involving multi-label classification of clinical free text. BioNLP 2007: Biological, translational, and clinical language processing, pp. 97–104, ACL 2007.

Preprint for comments in PDF, 371 KB.

BACK to the publications of W. Duch.
BACK to the on-line publications of the Department of Informatics, NCU.