disambiguation of biomedical text

32
Disambiguation of Biomedical Text Mark Stevenson Natural Language Processing Group University of Sheffield, UK http://www.dcs.shef.ac.uk/~marks Joint work with: Yikun Guo and Robert Gaizauskas (University of Sheffield) and David Martinez (University of Melbourne)

Upload: ownah

Post on 25-Feb-2016

35 views

Category:

Documents


1 download

DESCRIPTION

Disambiguation of Biomedical Text. Mark Stevenson Natural Language Processing Group University of Sheffield, UK http://www.dcs.shef.ac.uk/~marks Joint work with: Yikun Guo and Robert Gaizauskas (University of Sheffield) and David Martinez (University of Melbourne). Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Disambiguation  of Biomedical Text

Disambiguation of Biomedical Text

Mark StevensonNatural Language Processing Group

University of Sheffield, UK

http://www.dcs.shef.ac.uk/~marks

Joint work with: Yikun Guo and Robert Gaizauskas (University of Sheffield)

and David Martinez (University of Melbourne)

Page 2: Disambiguation  of Biomedical Text

Outline

• Ambiguity in biomedical documents• Disambiguation

– Knowledge sources• Evaluation• Semi-supervised acquisition of additional

training data

Page 3: Disambiguation  of Biomedical Text

Text in Biomedical Domain• The literature

on biomedicine and the life sciences is vast and growing rapidly

• Promising domain for text processing• Search engines necessary• Opportunities for knowledge discovery

Page 4: Disambiguation  of Biomedical Text

Ambiguity• Lexical ambiguity makes text processing more

difficult• Generally believed that ambiguities do not occur

with domains– One Sense per Discourse (Gale, Church and Yarowsky,

1992)– “there is a very strong tendency (98%) for multiple uses

of a word to share the same sense in a well-written discourse”

Page 5: Disambiguation  of Biomedical Text

cell

Page 6: Disambiguation  of Biomedical Text

culture “In peripheral blood mononuclear

cell culture streptococcal erythrogenic toxins are able to stimulate tryptophan degradation in humans.”

International Allergy Immunology

“The aim of this paper is to describe the origins, initial steps and strategy, current progress and main accomplishments of introducing a quality management culture within the healthcare system in Poland.”

International Journal of Qualitative Health Care

Page 7: Disambiguation  of Biomedical Text

Extent of Ambiguity Problem

• Weeber et. al. (2001)• Estimated that 11.7% of the phrases in abstracts added

to MEDLINE in 1998 were ambiguous

• Ambiguity is biggest challenge in automation of indexing MEDLINE and a hindrance to automated knowledge discovery (Weeber et. al. 2001)(Nadkarin et. al. 2001)(Aronson 2001)

Page 8: Disambiguation  of Biomedical Text

WSD System

• Supervised learning approach• Extension of Basque Country University’s

Senseval-3 system (Agirre and Martinez, 2004)

• Combines range of knowledge sources• Previous work shown that combining

knowledge sources is an effective approach to WSD

Page 9: Disambiguation  of Biomedical Text

Features

1. General• Wide range of features which are commonly

used by WSD systems

2. Domain specific• Two knowledge sources specific to biomedical

domain

Page 10: Disambiguation  of Biomedical Text

Example• “Body surface area adjustments of initial heparin

dosing …”

1. Individual Adjustment“By the fast (2.5mph) ambulation trial, both groups were performing equally, suggesting a rapid rate of adjustment to the device.”

2. Adjustment Action“Clinically, these four patients had mild symptoms which improved with dietary adjustment.”

3. Psychological adjustment“Predictors of patients' mental adjustment to cancer: patient characteristics and social support.”

Page 11: Disambiguation  of Biomedical Text

General Features (1)• Local collocations

• Bigrams and trigrams containing ambiguous word constructed from lemmas, word forms and PoS tags• left-content-word-lemma “area adjustment”• right-function-word-lemma “adjustment of'' • left-POS “NN NNS”• right-POS “NNS IN” • left-content-word-form “area adjustments”• right-function-word-form “adjustment of”

• First noun, verb, adjective and adverb preceding and following ambiguous word (lemma and word form)

Page 12: Disambiguation  of Biomedical Text

General Features (2)• Syntactic dependencies

• Five relations: subject, object, noun-modifier, preposition and sibling

• Salient bigrams• Salient bigrams in abstract

• Unigrams• Lemmas of all content words in the abstract and 8 word

window around target word• Lemmas of unigrams which appear frequently in entire

corpus

Page 13: Disambiguation  of Biomedical Text

Concept Unique Identifiers (CUIs)• CUIs refer to UMLS concepts• MetaMap segments text and identifies possible

CUIs for each phrase

"Body surface area adjustments"

C0005902:Body Surface Area [Diagnostic Procedure]

C1261466:Body surface area [Organism Attribute]

C0456081:Adjustments (Adjustment Action) [Health Care Activity]

C0376209:Adjustments (Individual Adjustment) [Individual Behavior]

"of initial heparin dosing"

C0205265:Initial (Initially) [Temporal Concept]

C1555582:initial [Idea or Concept]

C0019134:Heparin [Biologically Active Substance,Carbohydrate]

Page 14: Disambiguation  of Biomedical Text

Medical Subject Headings (MeSH)• Controlled vocabulary for indexing life science

publications• Contains over 24,000 headings organised into an

11 level hierarchy• Use MeSH terms assigned to abstract containing

ambiguous term

M01.060.116.100: “Aged” M01.060.116.100.080: “Aged, 80 and over”D27.505.954.502.119: “Anticoagulants”G09.188.261.560.150: “Blood Coagulation”

Page 15: Disambiguation  of Biomedical Text

Learning Algorithms

1. Vector Space Model• Simple memory-based learning algorithm

2. Naïve Bayes3. Support Vector Machine

• Weka implementations

Page 16: Disambiguation  of Biomedical Text

NLM-WSD data set• Standard evaluation corpus for WSD in

biomedical domain (“Biomedical SemEval”)• Contains highly 50 ambiguous terms

frequently found in Medline• 100 instances of each term manually

disambiguated with UMLS concepts by a team of annotators

• Baseline (MFS) accuracy of 78%• Average of 2.64 possible meanings per term

Page 17: Disambiguation  of Biomedical Text

Results

General CUI MeSH CUI+MeSH

Ling +MeSH

Ling + CUI

All

VSM 87.0 85.8 81.9 86.9 87.9 87.3 87.5

NB 86.4 81.2 85.7 81.1 86.4 81.7 81.8

SVM 85.9 83.5 85.3 84.5 86.2 85.3 86.0

• Combination of linguistic features with MeSH terms significantly better than any features used alone

• VSM significantly better than other learning algorithms

Page 18: Disambiguation  of Biomedical Text

colddepressiondischargeextraction

fatimplantation

japaneselead mole

pathology reduction

sex ultrasound

degreegrowth man

mosaic nutrition

repairscale

weightwhite

adjustment blood pressure

evaluation immunosuppression

radiation sensitivity

association condition culture

determination energy

failure fit

fluidefrequencyganglion

glucose inhibition pressure

resistance secretion

single strains support surgery transient

transport variation

Liu et. al. (2004) Leroy and Rindflesch (2005)Joshi et. al. (2005)

Common

Dominant sense < 90%Removed low IAA

Dominant sense < 65%

Page 19: Disambiguation  of Biomedical Text

Approach

MFS Liu et. al. (2004)

Leroy & Rindflesch

(2005)

Joshi et. al. (2005)

McInnes et. al. (2007)

Reported(General +

MeSH)

All words 78.0 85.3 87.9

Joshi 66.9 82.5 80.0 83.3

Leroy 55.3 65.5 77.4 74.5 79.7

Liu 69.9 78.0 84.9 82.0 84.8

Common 54.9 68.8 79.8 75.7 81.1

Page 20: Disambiguation  of Biomedical Text

Automatic Example Generation

• Various approaches to generating sense tagged examples without the need for manual annotation• Monosemous relatives (Leacock et. al. 1998) • Translations as sense definitions (Ng et. al. 2003)

• All unsupervised but require external knowledge sources (e.g. WordNet or parallel text)

• Alternative semi-supervised approach

Page 21: Disambiguation  of Biomedical Text

~~~~~~~~~~~~~~~~~~~~

Relevance Feedback

• Method for improving search results based on analysis of retrieved documents

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Retrieveddocuments

Relevance judgements

QueryModifiedQuery

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>>

Page 22: Disambiguation  of Biomedical Text

• Common approach to relevance feedback for vector space model (Rocchio, 1971)

qm = modified query vector q = original query vectorD+q = set of vectors representing known relevant documentsD-q = set of vectors representing known irrelevant documentsα,β,γ = weights

Page 23: Disambiguation  of Biomedical Text

Acquiring Sense Tagged Examples

• Treat set of sense tagged examples as retrieved documents• Examples tagged with sense considered relevant, all

other examples considered irrelevant

• For each sense, identify additional query terms which tend to discriminate examples tagged with that sense from those tagged with other senses

• Search for documents matching this extended query

Page 24: Disambiguation  of Biomedical Text

Identifying Query Terms

count(t,d) = frequency of term t in document dD+s = set of examples of target sense D-s = set of examples of other sensesα,β = weights

• Compute score for each term in the sense-tagged documents against each sense

idf(t) = inverse document frequency of t

Page 25: Disambiguation  of Biomedical Text

Terms for two senses of “culture”

‘anthropological culture’ ‘laboratory culture’cultural 26.17 suggest 6.32recommendation 14.82 protein 6.13force 14.80 presence 5.86ethnic 14.79 demonstrate 5.86practice 14.76 analysis 5.78man 14.76 gene 5.58

Page 26: Disambiguation  of Biomedical Text

Example Collection• Identify examples by querying Medline via online interface

• Preserve bias in original sense distribution• For example, if 75% usages are ‘laboratory culture’ and 25%

‘anthropological culture’ then ensure same 75:25 split in retrieved examples

• Use eight highest scoring terms (score(t,s)) for each sense

• Relax queries until enough examples can be retrieved:culture AND (suggest AND protein AND presence)culture AND ((suggest AND protein) OR (suggest AND presence) OR

(protein and presence))culture AND (suggest OR protein OR presence)

Page 27: Disambiguation  of Biomedical Text

Experiments

• 10-fold cross validation• Training portion (90 examples) analysed to generate

additional examples• Generated three sets for each term: 90, 180, 270 and

360 examples

• Combine automatically generated examples with training portion (+90, +180, +270, +360)

• Automatically generated examples alone (90, 180, 270, 360)

Page 28: Disambiguation  of Biomedical Text

Performance

Basic87.9

Combined

+90 +180 +270 +360

89.6 88.6 88.0 88.0

Additional only

90 180 270 360

88.4 87.9 87.5 87.3

Page 29: Disambiguation  of Biomedical Text

Individual Termsterm basic 90 difference

blood pressure 53 66 13reduction 88 96 8

repair 86 92 6mole 88 94 6

ultrasound 88 94 6

white 81 72 -9weight 82 71 -11degree 93 81 -12

evaluation 81 69 -12

Page 30: Disambiguation  of Biomedical Text

Conclusion

• Ambiguity real problem in biomedical domain

• Domain specific knowledge improves WSD performance

• Relevance feedback can be used to acquire additional training examples and further improve performance

Page 31: Disambiguation  of Biomedical Text

More Information

• This work has been funded EPSRC grants BioWSD and CASTLE

http://nlp.shef.ac.uk/BioWSD/

Page 32: Disambiguation  of Biomedical Text