Download - Data Mining in Rediology reports
Data Mining in Radiology Reports
Saeed Mehrabi
Spring 2010INFO-I535
Dr. Patrick W. Jamieson
Dr. Josette Jones
Outline• Introduction to data and text mining
• Our data set
• Structuring free text
• Results
• Similar works
• Discussion
What is Data Mining • Data mining is
The extraction of useful patterns from data sources such as databases, texts and web.
• There is a big gap from stored data to knowledge and
the transition won’t occur automatically.
• Many interesting things you want to find cannot be found using database queries “find me people likely to buy my products”
“Who are likely to respond to my promotion”
Why data mining now?
• The data is abundant.
• The data is being warehoused.
• The computing power is affordable.
• The competitive pressure is strong.
• Data mining tools have become available
Text Mining
Text mining applies and adapts data mining techniques to text domain
Structured vs. Free Text
• Structured text can be stored in a relational database.
• Providing the means to represent data available in text in structured format will make information exchange, data mining and information retrieval more feasible.
Data Set
• Our corpus consists of: 594,000 de-identified radiology reports
36 million words
4.3 million sentences
• The reports were dictated by the Indiana University Radiology faculty, a group of 40 radiologists, from 1993-1998.
Structuring Free text
• Regular expression was used to detect sentences in reports!
• Regular expression is a concise and flexible way of matching strings of text, such as particular characters or words.
• Sentences annotated to propositions which simply are sentences expressing the same concept for similar findings within reports
Structuring Free text (Cont.)
• A proposition is a declarative sentence, that is either true or false but not both.
Today is a beautiful sunny day. ( A proposition)
x + 2 = 4 (Not a proposition)
• Users can select propositions and map sentences to propositions
Corpus Annotation
• So for annotating each new sentence from the radiology reports the computer initially propose propositions
• The suggested propositions by the software are reviewed by experts and corrected as needed before validation.
• If there is no proposition in the ontology then the expert can create new ones.
Results
• The process of building the ontology of propositions is in parallel with the expert annotating sentences to the existing proposition
• So far, 427,433 unique sentences from the corpus have been annotated.
Representing a total of 2,561,330 sentences or 60% of the total sentences.
Results (Cont.)• The propositions are categorized into main findings such as
brain and skull, general radiology, ..
• All propositions with information such as whether they are normal or abnormal finding and the number of the sentences mapped to them are all stored in a relational data base
• We can find the most frequent or highest ranked propositions by sorting them based the number of sentences that are mapped to them, how many of them are normal or abnormal and the number of normal and abnormal propositions and sentences in each category
1-50
0
501-
1000
1001
-150
0
1501
-200
0
2001
-250
0
2501
-300
0
3001
-350
0
3501
-400
0
4001
-450
0
4501
-500
0
5001
-550
0
5501
-600
0
6001
-650
0
6501
-700
0
7001
-750
0
7501
-800
0
8001
-850
0
8501
-900
0
9001
-950
0
9501
-100
00
1000
1-10
500
1050
1-11
000
1100
1-11
500
1150
1-12
000
1200
1-12
500
1250
1-13
000
1300
1-13
500
1350
1-13
581
0
50
100
150
200
250
300
350
Number of normal and abnormal propositions within the 500 interval of highest ranked propositions
NormalAbnormal
Rank of Propositions
Nu
mb
er
of
Pro
po
sit
ion
s
1-500 501-1000 1001-1500 1501-2000 2001-25000
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
Number of normal and abnormal sentences mapped to the propositions
NormalAbnormal
Rank of Propositions
Nu
mb
er
of
Se
nte
nc
es
2501
-300
0
3001
-350
0
3501
-400
0
4001
-450
0
4501
-500
0
5001
-550
0
5501
-600
0
6001
-650
0
6501
-700
0
7001
-750
0
7501
-800
0
8001
-850
0
8501
-900
0
9001
-950
0
9501
-100
00
1000
1-10
500
1050
1-11
000
1100
1-11
500
1150
1-12
000
1200
1-12
500
1250
1-13
000
1300
1-13
500
1350
1-13
581
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Number of normal and abnormal sentences mapped to the propositions
NormalAbnormal
Rank of Propositions
Nu
mb
er
of
Stu
nd
en
ts
Brain
and
Skull
Breas
t
Face,
Mas
toids
, and
Nec
k
Gastro
intes
tinal
Gener
al Rad
iolog
y
Genito
urina
ry
Heart
and
Great
Ves
sel
Lung
, Med
iastin
um, a
nd P
leura
Misc
ellan
eous
Obs
erva
tion
Skelet
al an
d Sof
t Tiss
ue
Spine
and
Conte
nts
Vascu
lar a
nd L
ymph
atic
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Number of normal and abnormal propositions based on report categories
NormalAbnormal
Categories of findings
Nu
mb
er
of
Pro
po
sit
ion
s
Brain
and
Skull
Breas
t
Face,
Mas
toids
, and
Nec
k
Gastro
intes
tinal
Gener
al Rad
iolog
y
Genito
urina
ry
Heart
and
Great
Ves
sel
Lung
, Med
iastin
um, a
nd P
leura
Misc
ellan
eous
Obs
erva
tion
Skelet
al an
d Sof
t Tiss
ue
Spine
and
Conte
nts
Vascu
lar a
nd L
ymph
atic
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
Number of normal and abnormal sentences based on report cat-egories
NormalAbnormal
Categroies of Findings
Nu
mb
er
of
Sc
en
ten
ce
s
Similar works
CLEF (Clinical E-Science Framework)
• It consists of both structured records and free text documents(clinical narratives, radiology reports and histopathology report)
• Semantic annotation of clinical text to assist in the development and evaluation of an Information Extraction system
LEXIcon Mediated Entropy Reduction
LEXIMER(Cont.)
• Phrase Isolation includes scanning the report text and separating the content into phrases
• Noise Reduction decreases the amount of non-clinically relevant information contained within the report
• Signal Extraction pulls out the positive statements and recommendations from the clinically relevant phrases
NLP using OLAP for assessing Recommendations in radiology reports
• Database:4,279,179 radiology reports from a single tertiary health care center
10-year period (1995-2004)
Consist of reports of most common imaging modalities tests with patient demographics
• Leximer in conjunction with OnLine Analytic Processing was used for classifying reports into those with recommendation (IREC) and without recommendations for imaging
• IREC rates were determined for different patient age groups, gender, imaging modalities, indications, diseases, subspecialties, and referring physicians
Discussion
• CLEF work is on very limited number of reports
• In Leximer, there is no validation of their classification method and phrases cannot convey the meaning of a sentence.
• What distinguish our work from others is the large amount of data that is mined and consistent expert validation.
Reference
• Friedlin, J., Mahoui, M., Jones, J., Kashyap, V., & Jamieson , P. (2010). Knowledge Discovery and Data Mining of Free Text Radiology. Submitted to the journal of biomedical informatics
• Roberts, A., Gaizauskas, R., Hepple, M., Demetriou, G., Guo, Y., Setzer, A., et al. (2008). Semantic Annotation of Clinical Text: The CLEF Corpus. Retrieved April 20, 2010, from ftp://ftp.dcs.shef.ac.uk/home/robertg/papers/lrec08-clefcorpus.pdf
• Dang PA, Kalra MK, Blake MA, Schultz TJ, Stout M, Lemay PR, Freshman DJ, Halpern EF, Dreyer KJ. Natural language processing using online analytic processing for assessing recommendations in radiology reports.J Am Coll Radiol. 2008 Mar;5(3):197-204.
• http://www.nuance.com/healthcare/products/radcube-for-radiology.asp