use of wikipedia as a resource for disease identification and concept mapping
TRANSCRIPT
BioCreative V, Sevilla, Spain 10th September 2015
Use of Wikipedia as a resource for disease identification and concept mapping
Daniel Lowe, Noel O’Boyle and Roger Sayle
NextMove Software
BioCreative V, Sevilla, Spain 10th September 2015
But first…
BioCreative V, Sevilla, Spain 10th September 2015
DNER response Times
0
5,000
10,000
15,000
20,000
25,000
30
4#2
30
4#1
30
4#3
36
5#1
29
6#3
29
6#2
29
6#1
29
3#2
29
3#1
29
3#3
30
9#2
30
9#3
27
7#1
31
4#2
31
4#3
31
4#1
27
7#3
30
9#1
36
3#3
27
7#2
27
6#1
36
3#2
36
4#3
36
4#2
36
3#1
31
0#1
36
4#1
32
5#1
28
8#3
28
8#1
28
8#2
31
5#3
32
5#2
31
5#1
32
5#3
31
5#2
29
0#1
28
5#1
28
5#2
28
5#3
Re
spo
nse
tim
e (
mill
ise
con
ds)
45 ms
BioCreative V, Sevilla, Spain 10th September 2015
cid response Times
0
5,000
10,000
15,000
20,000
25,000
30,000
30
4#1
30
4#3
30
4#2
33
4#1
33
5#1
33
4#2
33
5#2
36
3#2
36
3#3
32
2#2
32
2#1
36
4#1
36
4#2
31
6#3
36
4#3
31
6#2
31
6#1
33
5#3
28
9#1
33
4#3
29
0#1
29
3#3
29
9#2
29
9#3
28
8#2
28
8#3
29
9#1
28
8#1
36
5#1
36
5#2
31
0#1
29
3#2
36
3#1
27
6#3
27
6#1
27
6#2
34
1#1
34
1#3
34
1#2
30
3#3
29
3#1
30
3#1
30
3#2
Re
spo
nse
tim
e (
mill
ise
con
ds)
97 ms
BioCreative V, Sevilla, Spain 10th September 2015
4 minutes of CPU time: 0.17¢
BioCreative V, Sevilla, Spain 10th September 2015
Nextmove software’s Leadmine
• Dictionaries and grammars (used in chemical recognition) matched as state machines
BioCreative V, Sevilla, Spain 10th September 2015
• No tokenisation step – What is recognized determines whether a character is delimiting
• No machine learning – Every term found has an explanation for why it was detected i.e.
which dictionary found it
• Spelling correction – Fuzzy matching where corrected term is used for normalization
(Damerau–Levenshtein distance with parametrised costs)
• Rule based abbreviation detection – Hearst and Schwartz algorithm with more patterns
BioCreative V, Sevilla, Spain 10th September 2015
Dictionary Preparation
• Manually curated dictionary
• MeSH
• Disease Ontology
• Terms from training/development sets
• Wikipedia
• Variants of terms added
– tumor ↔ tumour (synonyms)
– cancer → carcinoma (sub type of synonym)
→ 119,737 unique terms
BioCreative V, Sevilla, Spain 10th September 2015
Composite entity recognition
Heart and lung disease Recognized entity Identify preceding
“and” or “or”
Heart disease Potential entity
constructed
Is it recognized?
Report MeSH terms for all underlying entities: Heart disease (D006331) lung disease (D008171)
BioCreative V, Sevilla, Spain 10th September 2015
Wikipedia for diseases
BioCreative V, Sevilla, Spain 10th September 2015
Linking Wikipedia to MeSH
BioCreative V, Sevilla, Spain 10th September 2015
Extraction of synonyms
• Methodology 1
– MeSH IDs corresponding to diseases determined (MeSH trees C and F03)
– Find Wikipedia pages with disease/symptom boxes that contain one of these MeSH IDs.
– Associate the page title and all redirects with that MeSH ID
• Methodology 2
– Find pages whose name matches a MeSH synonym
– Associate all redirects to that page with the MeSH ID of the aforementioned synonym
BioCreative V, Sevilla, Spain 10th September 2015
Resultant dictionary
• 31,699 disease name/MeSH ID relationships extracted
• 20,611 not present in our pre-existing MeSH/Disease Ontology derived dictionary
• Some Disease Ontology terms, that did not refer to MeSH, could be assigned MeSH IDs
BioCreative V, Sevilla, Spain 10th September 2015
Issues
• Redirects are not semantic, rather than being a synonym can be a related concept e.g. – Treatment of the disease – Detection of the disease – Particular outbreak of the disease
• Redirecting to a section of a page can be a related concept, but can also be a sub-type of the disease
• Difference in classification granularity – heart disease redirects to the page on Cardiovascular
disease
BioCreative V, Sevilla, Spain 10th September 2015
Garbage in, Garbage out
• Original dictionary had “gambling” (MeSH has the same concept ID for pathological gambling (mental disorder) and gambling (specific instance of risk-taking behaviour)
• Hence Wikipedia allowed all terms related to gambling to be retrieved E.g. gambler, gamble, gambling den…
BioCreative V, Sevilla, Spain 10th September 2015
Results (concepts) (BioCreative V CDR training + development set)
Type Precision Recall F1-score
Wikipedia 79.3% 61.3% 69.1%
MeSH + Disease Ontology
91.6% 67.1% 77.4%
MeSH + Disease Ontology + Wikipedia
85.1% 73.1% 78.6%
BioCreative V, Sevilla, Spain 10th September 2015
Wikipedia term Mapped To Annotated in Corpus?
metastatic Neoplasm Metastasis No, but name of cancer
following the term is
albino Albinism no
excessive sweating Hyperhidrosis no
ulceration Ulcer no
dental pain Toothache Only “pain”
Anicteric Jaundice Correctly is not (term means
“without jaundice”)
Examples of Wikipedia false positives
BioCreative V, Sevilla, Spain 10th September 2015
Concept-level DNER Results
Precision Recall F1-score
86.08% 86.17% 86.12%
BioCreative V, Sevilla, Spain 10th September 2015
Chemical-induced disease relationships
BioCreative V, Sevilla, Spain 10th September 2015
Chemical-induced disease relationships
BioCreative V, Sevilla, Spain 10th September 2015
• Patterns where the chemical preceded the disease: – Chemical <caused>
– Chemical Disease
– Chemical <related to>
– <negative effects caused by> chemical
– <relationship between> chemical <and>
• Patterns where the chemical followed the disease: – Disease <caused by>
– Disease <after or during>
– Disease <after or while taking>
– Disease <in person taking>
– Disease <effect of>
– Disease <related to>
– Disease <complications of>
– <induction of> Disease <by or with>
BioCreative V, Sevilla, Spain 10th September 2015
cID Results
Type Precision Recall F-measure
Pattern-based 61.0% 35.9% 45.2%
Pattern-based + recall
boosting heuristic
52.6% 51.8% 52.2%
BioCreative V, Sevilla, Spain 10th September 2015
Conclusions
• Dictionary/grammar based recognition can match machine learning where relating entities to concepts is required
• Simple pattern-based CID identification seems to work surprisingly well
• Speed is often orders of magnitude better
• Wikipedia is an excellent source of adjectival and common names for diseases
BioCreative V, Sevilla, Spain 10th September 2015
Thank you for your time!
For the duration of the conference our web service can be tried out at:
http://nmsoftware.ddns.net:8080/leadminecdr/cdr.html
http://nextmovesoftware.com
http://nextmovesoftware.com/blog