in computer science nlp in a nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf ·...

27
Special Topics in Computer Science NLP in a Nutshell NLP in a Nutshell CS492B Spring Semester 2009 Speaker : HeeJin Lee Professor : Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology

Upload: others

Post on 16-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

Special Topics in Computer Science

NLP in a NutshellNLP in a NutshellCS492B Spring Semester 2009

Speaker : Hee‐Jin LeepProfessor : Jong C. ParkComputer Science Department

Korea Advanced Institute of Science and Technology

Page 2: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

TEXT MINING APPLICATIONS:INFORMATION EXTRACTION

Page 3: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

ContentsContents

Information Extraction: What? and Why?Information Extraction: What? and Why?Approaches to Information ExtractionInformation Extraction Challenges Application: Literature‐based DiscoveryApplication: Literature based DiscoveryConclusion

CS492 Special topics in computer science ‐NLP in a Nutshell 3

Page 4: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

Information Extraction (IE)Information Extraction (IE)

What is done by IE?What is done by IE?Take a natural language text from a document source  and extract essential facts about one or source, and extract essential facts about one or more predefined fact types

h f i h l h lRepresent each fact with a template whose slots are filled on the basis of what is found from the text

We have previously p yshown that ETS1 can activate GM‐CSF in Jurkat T cells.

Activate(ETS1, GM‐CSF)

CS492 Special topics in computer science ‐NLP in a Nutshell 4

Page 5: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

Information Extraction (IE)Information Extraction (IE)

IE vs  IRIE vs. IRInformation Retrieval (IR) Information extraction (IE)

d fReturns documents. Returns facts.

Is a classification task (each document is relevant/not relevant 

Is an application of natural language processing, involving the document is relevant/not relevant 

to a query).language processing, involving the analysis of text and synthesis of a structured representation.

C  b  d   i h   f     I  b d    i   l i d Can be done without reference to syntax (treating query and indeed the documents as merely a “bag of 

Is based on syntactic analysis and semantic analysis

words”).

CS492 Special topics in computer science ‐NLP in a Nutshell 5

Page 6: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

IE in Biology and BiomedicineIE in Biology and BiomedicineA large amount published paper in the A large amount published paper in the domain of biology and biomedicine

14,000,00016,000,00018,000,000

Total citations in MEDLINE

4,000,0006,000,0008,000,00010,000,00012,000,000

Total citations

02,000,000

Experts cannot check all the relevant papers.We can help them with automated toolsWe can help them with automated tools.

CS492 Special topics in computer science ‐NLP in a Nutshell 6

Page 7: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

Approaches to IEApproaches to IE

Pattern‐matching approachesPattern matching approachesBasic context free grammar approachesFull parsing approachesProbability based parsingProbability based parsingMixed syntax‐semantics approachesSublanguage‐driven information extractionOntology‐driven information extractionOntology driven information extraction

IE methods have evolved from simpler methods like pattern matching, to higher‐level NLP techniques such as full parsing.

CS492 Special topics in computer science ‐NLP in a Nutshell 7

to higher level NLP techniques such as full parsing.

Page 8: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

Pattern Matching ApproachesPattern Matching Approaches

Martin et al  (2004)Martin et al. (2004)Extract protein‐protein interaction

b f di i iUse a number of dictionariesProtein names and their synonymsProtein interaction verbs and their synonymsCommon strings to identify unknown proteins (e.g., protein, kinase) 

Sample patternp p($VarGene $Verb (the)? $VarGene)

CS492 Special topics in computer science ‐NLP in a Nutshell 8

Page 9: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

Full Parsing Approaches: BioIEFull Parsing Approaches: BioIE

Kim and Park (2004)Kim and Park (2004)Extract general biological interactions

i h id if i ‘k d b ’ d h iStart with identifying ‘keyword verbs’ and their arguments using pattern matchingFull parsing is used to validate the pattern matching result

Performance on corpora of 1,505 abstracts

CS492 Special topics in computer science ‐NLP in a Nutshell 9

Page 10: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

Full Parsing Approaches: BioIEFull Parsing Approaches: BioIE

System flowSystem flow

NP matching is done in a bidirectional way using heuristic rulesNP matching is done in a bidirectional way using heuristic rules.

CS492 Special topics in computer science ‐NLP in a Nutshell 10

Page 11: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

Full Parsing Approaches: BioIEFull Parsing Approaches: BioIE

ExampleExample

CS492 Special topics in computer science ‐NLP in a Nutshell 11

Page 12: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

Full Parsing Approaches: RelExFull Parsing Approaches: RelEx

Fundel et al., (2007)Fundel et al., (2007)Extract gene/protein interactionsStart with identifying gene/protein namesStart with identifying gene/protein namesDoes not identify the kind of interactionRelation extraction rather than information Relation extraction rather than information extraction

Performance (Recall/Precision/F meas re)Performance (Recall/Precision/F‐measure)85/79/82 on the LLL challenge data set78/79/78 on a 50‐abstract subset of the Human Protein Reference Database  

CS492 Special topics in computer science ‐NLP in a Nutshell 12

Page 13: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

Full Parsing Approaches: RelExFull Parsing Approaches: RelEx

System overviewSystem overview

Stanford Lexicalized ParserProMiner NER systemfnTBL NP‐chunker

Extract paths connecting Extract paths connecting pairs of proteins from dependency parse trees

CS492 Special topics in computer science ‐NLP in a Nutshell 13

Page 14: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

Full Parsing Approaches: RelExFull Parsing Approaches: RelEx

ExampleExample

Interacting protein pairs

( )(sigmaB, yvyD)

(Sigma H, yvyD)

CS492 Special topics in computer science ‐NLP in a Nutshell 14

Page 15: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

IE ChallengesIE Challenges

To compare the performance of different To compare the performance of different approaches, common standards or shared evaluation criteria are neededevaluation criteria are neededIE challenges

Propose tasksDevelop and distribute large enough training Develop and distribute large enough training and test datasets

CS492 Special topics in computer science ‐NLP in a Nutshell 15

Page 16: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

BioCreAtIvE ChallengeBioCreAtIvE Challenge

Critical Assessment of Information Critical Assessment of Information Extraction systems in Biology

h //bi i fhttp://biocreative.sourceforge.net

IE task in BioCreative 2 (2006)( )Task Description Highest F‐score

Protein interaction article  sub task(IAS)

Detection of protein interaction‐relevant articles

0.78(P:0 70 R:0 88)sub‐task(IAS) articles (P:0.70, R:0.88)

Protein interaction pairs sub‐task(IPS)

Extraction and normalization of protein interaction pairs

0.30(P:0.37, R:0.33)

P t i  i t ti   t   R t i l  f  t l t t   th t  PProtein interaction sentence sub‐task (ISS)

Retrieval of actual text passage that provide evidence for protein interactions

P:0.19

Protein interaction method sub task (IMS)

Retrieval of the interaction detection method

0.65(P:0 59 R:0 85)

CS492 Special topics in computer science ‐NLP in a Nutshell 16

sub‐task (IMS) method (P:0.59, R:0.85)

Page 17: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

Literature‐based Discovery (LBD)Literature based Discovery (LBD)

Literature‐based discoveryLiterature based discoveryA method for automatically generating hypotheses for scientific research by finding hypotheses for scientific research by finding overlooked implicit connections in the research literatureliterature

CS492 Special topics in computer science ‐NLP in a Nutshell 17

Page 18: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

LBD: a Simple ScenarioLBD: a Simple Scenario

Primary conceptsPrimary conceptsDiseasesDrugsSymptoms

RelationsCause(Disease  symptom)Cause(Disease, symptom)Decrease(Drug, symptom)

DiscoveriesTreat(Drug, Disease)( g, )

CS492 Special topics in computer science ‐NLP in a Nutshell 18

Page 19: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

LBD: a Simple ScenarioLBD: a Simple Scenario

Use an IE system to extract relations from Use an IE system to extract relations from the literature

Cause(Rynaud’s disease  blood viscosity reduction)Cause(Rynaud s disease, blood viscosity reduction)Cause (Rynaud’s disease, platelet aggregation reduction)Increase(Fish oil, blood viscosity)Increase(Fish oil  plate aggregation)

Hypothesize a new relation – a discovery !

Increase(Fish oil, plate aggregation)

Confirm with laboratory methods

Treat(Fish oil, Rynaud’s disease)

Confirm with laboratory methods

CS492 Special topics in computer science ‐NLP in a Nutshell 19

Page 20: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

LBD: a Real ExampleLBD: a Real Example

Hristovski et al  (2006)Hristovski et al. (2006)Their discovery pattern

CS492 Special topics in computer science ‐NLP in a Nutshell 20

Page 21: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

LBD: a Real ExampleLBD: a Real Example

Their method Their method Start with a disease X in mindi d h i l i l ’ h f lFind physiological concepts Y’s that frequently 

co‐occur with the disease X Extract relations between X and Y’s Find concepts Z’s co‐occur with Y’spExtract relations between Z’s and Y’sMake hypotheses using ‘discovery pattern’Make hypotheses using  discovery pattern

BITOLA, BioMedLee, SemRep are used.

CS492 Special topics in computer science ‐NLP in a Nutshell 21

Page 22: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

LBD: a Real ExampleLBD: a Real Example

What they foundWhat they found

Treat(eicosanpentaenoic acid  Rynaud’s)Treat(eicosanpentaenoic acid, Rynaud s)

Treat(Treatment for diabetes, Rynaud’s)

CS492 Special topics in computer science ‐NLP in a Nutshell 22

Page 23: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

ConclusionConclusion

Information Extraction is to extract Information Extraction is to extract structured information from unstructured texttext.IE methods have evolved from simpler methods to higher‐level NLP techniques.Challenges provide gold standard datasets Challenges provide gold standard datasets for evaluation. 

fIE systems can be used for literature‐based discovery.y

CS492 Special topics in computer science ‐NLP in a Nutshell 23

Page 24: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

ReferencesReferencesJohn McNaught, William J Black, “Information Extraction”, Text Mining for Biology and Biomedicine  2006for Biology and Biomedicine, 2006.Martin, E. P., et al., “Analysis of Protein/Protein Interactions Through Biomedical Literature: Text Mining of Abstracts vs. Text Mining of Full Articles”  Knowledge Exploration in Life Science Informatics  2004Articles , Knowledge Exploration in Life Science Informatics, 2004.Kim, J., J. Park. “BioIE: Retargetable information extraction and ontological annotation of biological interactions from the literature.” Journal of Bioinformatics and Computational Biology 2, no. 3 ,551‐568, Journal of Bioinformatics and Computational Biology 2, no. 3 ,551 568, 2004. Katrin Fundel, Robert Kuffner, Ralf Zimmer, “RelEx‐Relation extraction using dependency parse tree”, Bioinformatics, vol. 23, no. 3, 2007.g p y p , f , 3, 3, 7Pierre Zweigenbaum, Dina Demner‐Fushman, Hong Yu, Kevin B. Cohen, “Frontiers of biomedical text mining: current progress”, Briefings in bioinformatics, vol. 8, no. 5, 358‐375, 2007.Dimitar Hristovski, Carol Friedman, Thomas C Rindflesch, Borut Peterlin, “Exploiting Semantic Relations for Literature‐Based Discovery”, AMIA, 2006.

CS492 Special topics in computer science ‐NLP in a Nutshell 24

Page 25: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

Thank youThank you

CS492 Special topics in computer science ‐NLP in a Nutshell 25

Page 26: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

Raynaud’s DiseaseRaynaud s Disease

Raynaud's disease (RAY‐noz) is a vascular Raynaud s disease (RAY noz) is a vascular disorder[1] that affects blood flow to the extremities (the fingers  toes  nose and ears) extremities (the fingers, toes, nose and ears) when exposed to cold temperatures or in response to psychological stress. It is named for Maurice Raynaud (1834 ‐ 1881),[2] a y ( 34 ),French physician who first described it in 1862.[3]1862.

CS492 Special topics in computer science ‐NLP in a Nutshell 26

Page 27: in Computer Science NLP in a Nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf · Start wihith id if iidentifying ‘k d‘keyword verb’bs’ and thiheir arguments

Huntington DiseaseHuntington Disease

An autosomal‐dominant inherited An autosomal dominant inherited neurodegenerative disorder that is characterized by the insidious progressive characterized by the insidious progressive development of mood disturbances, behavioral changes, involuntary choreiformmovements and cognitive impairments. g pOnset is most commonly in adulthood, with a typical duration of 15‐20 years before a typical duration of 15 20 years before premature death.

CS492 Special topics in computer science ‐NLP in a Nutshell 27