in computer science nlp in a nutshell - nlpcl.kaist.ac.krnlpcl.kaist.ac.kr/~cs492/lecture16.pdf ·...
TRANSCRIPT
Special Topics in Computer Science
NLP in a NutshellNLP in a NutshellCS492B Spring Semester 2009
Speaker : Hee‐Jin LeepProfessor : Jong C. ParkComputer Science Department
Korea Advanced Institute of Science and Technology
TEXT MINING APPLICATIONS:INFORMATION EXTRACTION
ContentsContents
Information Extraction: What? and Why?Information Extraction: What? and Why?Approaches to Information ExtractionInformation Extraction Challenges Application: Literature‐based DiscoveryApplication: Literature based DiscoveryConclusion
CS492 Special topics in computer science ‐NLP in a Nutshell 3
Information Extraction (IE)Information Extraction (IE)
What is done by IE?What is done by IE?Take a natural language text from a document source and extract essential facts about one or source, and extract essential facts about one or more predefined fact types
h f i h l h lRepresent each fact with a template whose slots are filled on the basis of what is found from the text
We have previously p yshown that ETS1 can activate GM‐CSF in Jurkat T cells.
Activate(ETS1, GM‐CSF)
CS492 Special topics in computer science ‐NLP in a Nutshell 4
Information Extraction (IE)Information Extraction (IE)
IE vs IRIE vs. IRInformation Retrieval (IR) Information extraction (IE)
d fReturns documents. Returns facts.
Is a classification task (each document is relevant/not relevant
Is an application of natural language processing, involving the document is relevant/not relevant
to a query).language processing, involving the analysis of text and synthesis of a structured representation.
C b d i h f I b d i l i d Can be done without reference to syntax (treating query and indeed the documents as merely a “bag of
Is based on syntactic analysis and semantic analysis
words”).
CS492 Special topics in computer science ‐NLP in a Nutshell 5
IE in Biology and BiomedicineIE in Biology and BiomedicineA large amount published paper in the A large amount published paper in the domain of biology and biomedicine
14,000,00016,000,00018,000,000
Total citations in MEDLINE
4,000,0006,000,0008,000,00010,000,00012,000,000
Total citations
02,000,000
Experts cannot check all the relevant papers.We can help them with automated toolsWe can help them with automated tools.
CS492 Special topics in computer science ‐NLP in a Nutshell 6
Approaches to IEApproaches to IE
Pattern‐matching approachesPattern matching approachesBasic context free grammar approachesFull parsing approachesProbability based parsingProbability based parsingMixed syntax‐semantics approachesSublanguage‐driven information extractionOntology‐driven information extractionOntology driven information extraction
IE methods have evolved from simpler methods like pattern matching, to higher‐level NLP techniques such as full parsing.
CS492 Special topics in computer science ‐NLP in a Nutshell 7
to higher level NLP techniques such as full parsing.
Pattern Matching ApproachesPattern Matching Approaches
Martin et al (2004)Martin et al. (2004)Extract protein‐protein interaction
b f di i iUse a number of dictionariesProtein names and their synonymsProtein interaction verbs and their synonymsCommon strings to identify unknown proteins (e.g., protein, kinase)
Sample patternp p($VarGene $Verb (the)? $VarGene)
CS492 Special topics in computer science ‐NLP in a Nutshell 8
Full Parsing Approaches: BioIEFull Parsing Approaches: BioIE
Kim and Park (2004)Kim and Park (2004)Extract general biological interactions
i h id if i ‘k d b ’ d h iStart with identifying ‘keyword verbs’ and their arguments using pattern matchingFull parsing is used to validate the pattern matching result
Performance on corpora of 1,505 abstracts
CS492 Special topics in computer science ‐NLP in a Nutshell 9
Full Parsing Approaches: BioIEFull Parsing Approaches: BioIE
System flowSystem flow
NP matching is done in a bidirectional way using heuristic rulesNP matching is done in a bidirectional way using heuristic rules.
CS492 Special topics in computer science ‐NLP in a Nutshell 10
Full Parsing Approaches: BioIEFull Parsing Approaches: BioIE
ExampleExample
CS492 Special topics in computer science ‐NLP in a Nutshell 11
Full Parsing Approaches: RelExFull Parsing Approaches: RelEx
Fundel et al., (2007)Fundel et al., (2007)Extract gene/protein interactionsStart with identifying gene/protein namesStart with identifying gene/protein namesDoes not identify the kind of interactionRelation extraction rather than information Relation extraction rather than information extraction
Performance (Recall/Precision/F meas re)Performance (Recall/Precision/F‐measure)85/79/82 on the LLL challenge data set78/79/78 on a 50‐abstract subset of the Human Protein Reference Database
CS492 Special topics in computer science ‐NLP in a Nutshell 12
Full Parsing Approaches: RelExFull Parsing Approaches: RelEx
System overviewSystem overview
Stanford Lexicalized ParserProMiner NER systemfnTBL NP‐chunker
Extract paths connecting Extract paths connecting pairs of proteins from dependency parse trees
CS492 Special topics in computer science ‐NLP in a Nutshell 13
Full Parsing Approaches: RelExFull Parsing Approaches: RelEx
ExampleExample
Interacting protein pairs
( )(sigmaB, yvyD)
(Sigma H, yvyD)
CS492 Special topics in computer science ‐NLP in a Nutshell 14
IE ChallengesIE Challenges
To compare the performance of different To compare the performance of different approaches, common standards or shared evaluation criteria are neededevaluation criteria are neededIE challenges
Propose tasksDevelop and distribute large enough training Develop and distribute large enough training and test datasets
CS492 Special topics in computer science ‐NLP in a Nutshell 15
BioCreAtIvE ChallengeBioCreAtIvE Challenge
Critical Assessment of Information Critical Assessment of Information Extraction systems in Biology
h //bi i fhttp://biocreative.sourceforge.net
IE task in BioCreative 2 (2006)( )Task Description Highest F‐score
Protein interaction article sub task(IAS)
Detection of protein interaction‐relevant articles
0.78(P:0 70 R:0 88)sub‐task(IAS) articles (P:0.70, R:0.88)
Protein interaction pairs sub‐task(IPS)
Extraction and normalization of protein interaction pairs
0.30(P:0.37, R:0.33)
P t i i t ti t R t i l f t l t t th t PProtein interaction sentence sub‐task (ISS)
Retrieval of actual text passage that provide evidence for protein interactions
P:0.19
Protein interaction method sub task (IMS)
Retrieval of the interaction detection method
0.65(P:0 59 R:0 85)
CS492 Special topics in computer science ‐NLP in a Nutshell 16
sub‐task (IMS) method (P:0.59, R:0.85)
Literature‐based Discovery (LBD)Literature based Discovery (LBD)
Literature‐based discoveryLiterature based discoveryA method for automatically generating hypotheses for scientific research by finding hypotheses for scientific research by finding overlooked implicit connections in the research literatureliterature
CS492 Special topics in computer science ‐NLP in a Nutshell 17
LBD: a Simple ScenarioLBD: a Simple Scenario
Primary conceptsPrimary conceptsDiseasesDrugsSymptoms
RelationsCause(Disease symptom)Cause(Disease, symptom)Decrease(Drug, symptom)
DiscoveriesTreat(Drug, Disease)( g, )
CS492 Special topics in computer science ‐NLP in a Nutshell 18
LBD: a Simple ScenarioLBD: a Simple Scenario
Use an IE system to extract relations from Use an IE system to extract relations from the literature
Cause(Rynaud’s disease blood viscosity reduction)Cause(Rynaud s disease, blood viscosity reduction)Cause (Rynaud’s disease, platelet aggregation reduction)Increase(Fish oil, blood viscosity)Increase(Fish oil plate aggregation)
Hypothesize a new relation – a discovery !
Increase(Fish oil, plate aggregation)
Confirm with laboratory methods
Treat(Fish oil, Rynaud’s disease)
Confirm with laboratory methods
CS492 Special topics in computer science ‐NLP in a Nutshell 19
LBD: a Real ExampleLBD: a Real Example
Hristovski et al (2006)Hristovski et al. (2006)Their discovery pattern
CS492 Special topics in computer science ‐NLP in a Nutshell 20
LBD: a Real ExampleLBD: a Real Example
Their method Their method Start with a disease X in mindi d h i l i l ’ h f lFind physiological concepts Y’s that frequently
co‐occur with the disease X Extract relations between X and Y’s Find concepts Z’s co‐occur with Y’spExtract relations between Z’s and Y’sMake hypotheses using ‘discovery pattern’Make hypotheses using discovery pattern
BITOLA, BioMedLee, SemRep are used.
CS492 Special topics in computer science ‐NLP in a Nutshell 21
LBD: a Real ExampleLBD: a Real Example
What they foundWhat they found
Treat(eicosanpentaenoic acid Rynaud’s)Treat(eicosanpentaenoic acid, Rynaud s)
Treat(Treatment for diabetes, Rynaud’s)
CS492 Special topics in computer science ‐NLP in a Nutshell 22
ConclusionConclusion
Information Extraction is to extract Information Extraction is to extract structured information from unstructured texttext.IE methods have evolved from simpler methods to higher‐level NLP techniques.Challenges provide gold standard datasets Challenges provide gold standard datasets for evaluation.
fIE systems can be used for literature‐based discovery.y
CS492 Special topics in computer science ‐NLP in a Nutshell 23
ReferencesReferencesJohn McNaught, William J Black, “Information Extraction”, Text Mining for Biology and Biomedicine 2006for Biology and Biomedicine, 2006.Martin, E. P., et al., “Analysis of Protein/Protein Interactions Through Biomedical Literature: Text Mining of Abstracts vs. Text Mining of Full Articles” Knowledge Exploration in Life Science Informatics 2004Articles , Knowledge Exploration in Life Science Informatics, 2004.Kim, J., J. Park. “BioIE: Retargetable information extraction and ontological annotation of biological interactions from the literature.” Journal of Bioinformatics and Computational Biology 2, no. 3 ,551‐568, Journal of Bioinformatics and Computational Biology 2, no. 3 ,551 568, 2004. Katrin Fundel, Robert Kuffner, Ralf Zimmer, “RelEx‐Relation extraction using dependency parse tree”, Bioinformatics, vol. 23, no. 3, 2007.g p y p , f , 3, 3, 7Pierre Zweigenbaum, Dina Demner‐Fushman, Hong Yu, Kevin B. Cohen, “Frontiers of biomedical text mining: current progress”, Briefings in bioinformatics, vol. 8, no. 5, 358‐375, 2007.Dimitar Hristovski, Carol Friedman, Thomas C Rindflesch, Borut Peterlin, “Exploiting Semantic Relations for Literature‐Based Discovery”, AMIA, 2006.
CS492 Special topics in computer science ‐NLP in a Nutshell 24
Thank youThank you
CS492 Special topics in computer science ‐NLP in a Nutshell 25
Raynaud’s DiseaseRaynaud s Disease
Raynaud's disease (RAY‐noz) is a vascular Raynaud s disease (RAY noz) is a vascular disorder[1] that affects blood flow to the extremities (the fingers toes nose and ears) extremities (the fingers, toes, nose and ears) when exposed to cold temperatures or in response to psychological stress. It is named for Maurice Raynaud (1834 ‐ 1881),[2] a y ( 34 ),French physician who first described it in 1862.[3]1862.
CS492 Special topics in computer science ‐NLP in a Nutshell 26
Huntington DiseaseHuntington Disease
An autosomal‐dominant inherited An autosomal dominant inherited neurodegenerative disorder that is characterized by the insidious progressive characterized by the insidious progressive development of mood disturbances, behavioral changes, involuntary choreiformmovements and cognitive impairments. g pOnset is most commonly in adulthood, with a typical duration of 15‐20 years before a typical duration of 15 20 years before premature death.
CS492 Special topics in computer science ‐NLP in a Nutshell 27