use of wikipedia as a resource for disease identification and concept mapping

24
BioCreative V, Sevilla, Spain 10 th September 2015 Use of Wikipedia as a resource for disease identification and concept mapping Daniel Lowe, Noel O’Boyle and Roger Sayle NextMove Software

Upload: nextmove-software

Post on 27-Jan-2017

2.561 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Use of Wikipedia as a resource for disease identification and concept mapping

Daniel Lowe, Noel O’Boyle and Roger Sayle

NextMove Software

Page 2: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

But first…

Page 3: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

DNER response Times

0

5,000

10,000

15,000

20,000

25,000

30

4#2

30

4#1

30

4#3

36

5#1

29

6#3

29

6#2

29

6#1

29

3#2

29

3#1

29

3#3

30

9#2

30

9#3

27

7#1

31

4#2

31

4#3

31

4#1

27

7#3

30

9#1

36

3#3

27

7#2

27

6#1

36

3#2

36

4#3

36

4#2

36

3#1

31

0#1

36

4#1

32

5#1

28

8#3

28

8#1

28

8#2

31

5#3

32

5#2

31

5#1

32

5#3

31

5#2

29

0#1

28

5#1

28

5#2

28

5#3

Re

spo

nse

tim

e (

mill

ise

con

ds)

45 ms

Page 4: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

cid response Times

0

5,000

10,000

15,000

20,000

25,000

30,000

30

4#1

30

4#3

30

4#2

33

4#1

33

5#1

33

4#2

33

5#2

36

3#2

36

3#3

32

2#2

32

2#1

36

4#1

36

4#2

31

6#3

36

4#3

31

6#2

31

6#1

33

5#3

28

9#1

33

4#3

29

0#1

29

3#3

29

9#2

29

9#3

28

8#2

28

8#3

29

9#1

28

8#1

36

5#1

36

5#2

31

0#1

29

3#2

36

3#1

27

6#3

27

6#1

27

6#2

34

1#1

34

1#3

34

1#2

30

3#3

29

3#1

30

3#1

30

3#2

Re

spo

nse

tim

e (

mill

ise

con

ds)

97 ms

Page 5: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

4 minutes of CPU time: 0.17¢

Page 6: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Nextmove software’s Leadmine

• Dictionaries and grammars (used in chemical recognition) matched as state machines

Page 7: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

• No tokenisation step – What is recognized determines whether a character is delimiting

• No machine learning – Every term found has an explanation for why it was detected i.e.

which dictionary found it

• Spelling correction – Fuzzy matching where corrected term is used for normalization

(Damerau–Levenshtein distance with parametrised costs)

• Rule based abbreviation detection – Hearst and Schwartz algorithm with more patterns

Page 8: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Dictionary Preparation

• Manually curated dictionary

• MeSH

• Disease Ontology

• Terms from training/development sets

• Wikipedia

• Variants of terms added

– tumor ↔ tumour (synonyms)

– cancer → carcinoma (sub type of synonym)

→ 119,737 unique terms

Page 9: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Composite entity recognition

Heart and lung disease Recognized entity Identify preceding

“and” or “or”

Heart disease Potential entity

constructed

Is it recognized?

Report MeSH terms for all underlying entities: Heart disease (D006331) lung disease (D008171)

Page 10: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Wikipedia for diseases

Page 11: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Linking Wikipedia to MeSH

Page 12: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Extraction of synonyms

• Methodology 1

– MeSH IDs corresponding to diseases determined (MeSH trees C and F03)

– Find Wikipedia pages with disease/symptom boxes that contain one of these MeSH IDs.

– Associate the page title and all redirects with that MeSH ID

• Methodology 2

– Find pages whose name matches a MeSH synonym

– Associate all redirects to that page with the MeSH ID of the aforementioned synonym

Page 13: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Resultant dictionary

• 31,699 disease name/MeSH ID relationships extracted

• 20,611 not present in our pre-existing MeSH/Disease Ontology derived dictionary

• Some Disease Ontology terms, that did not refer to MeSH, could be assigned MeSH IDs

Page 14: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Issues

• Redirects are not semantic, rather than being a synonym can be a related concept e.g. – Treatment of the disease – Detection of the disease – Particular outbreak of the disease

• Redirecting to a section of a page can be a related concept, but can also be a sub-type of the disease

• Difference in classification granularity – heart disease redirects to the page on Cardiovascular

disease

Page 15: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Garbage in, Garbage out

• Original dictionary had “gambling” (MeSH has the same concept ID for pathological gambling (mental disorder) and gambling (specific instance of risk-taking behaviour)

• Hence Wikipedia allowed all terms related to gambling to be retrieved E.g. gambler, gamble, gambling den…

Page 16: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Results (concepts) (BioCreative V CDR training + development set)

Type Precision Recall F1-score

Wikipedia 79.3% 61.3% 69.1%

MeSH + Disease Ontology

91.6% 67.1% 77.4%

MeSH + Disease Ontology + Wikipedia

85.1% 73.1% 78.6%

Page 17: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Wikipedia term Mapped To Annotated in Corpus?

metastatic Neoplasm Metastasis No, but name of cancer

following the term is

albino Albinism no

excessive sweating Hyperhidrosis no

ulceration Ulcer no

dental pain Toothache Only “pain”

Anicteric Jaundice Correctly is not (term means

“without jaundice”)

Examples of Wikipedia false positives

Page 18: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Concept-level DNER Results

Precision Recall F1-score

86.08% 86.17% 86.12%

Page 19: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Chemical-induced disease relationships

Page 20: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Chemical-induced disease relationships

Page 21: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

• Patterns where the chemical preceded the disease: – Chemical <caused>

– Chemical Disease

– Chemical <related to>

– <negative effects caused by> chemical

– <relationship between> chemical <and>

• Patterns where the chemical followed the disease: – Disease <caused by>

– Disease <after or during>

– Disease <after or while taking>

– Disease <in person taking>

– Disease <effect of>

– Disease <related to>

– Disease <complications of>

– <induction of> Disease <by or with>

Page 22: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

cID Results

Type Precision Recall F-measure

Pattern-based 61.0% 35.9% 45.2%

Pattern-based + recall

boosting heuristic

52.6% 51.8% 52.2%

Page 23: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Conclusions

• Dictionary/grammar based recognition can match machine learning where relating entities to concepts is required

• Simple pattern-based CID identification seems to work surprisingly well

• Speed is often orders of magnitude better

• Wikipedia is an excellent source of adjectival and common names for diseases

Page 24: Use of Wikipedia as a resource for disease identification and concept mapping

BioCreative V, Sevilla, Spain 10th September 2015

Thank you for your time!

For the duration of the conference our web service can be tried out at:

http://nmsoftware.ddns.net:8080/leadminecdr/cdr.html

http://nextmovesoftware.com

http://nextmovesoftware.com/blog

[email protected]