nlp tech talk

33
QBRC TECH TALK THOMAS NATE PERSON CLINICAL SCIENCES PROSPR CENTER AND QBRC MAY 6, 2013 1 NLP: Natural Language Processing

Upload: thomas-person

Post on 08-Jul-2015

223 views

Category:

Technology


1 download

DESCRIPTION

NLP Tech Talk for QBRC at UTSW

TRANSCRIPT

Page 1: Nlp tech talk

Q B R C T E C H T A L K

T H O M A S N A T E P E R S O NC L I N I C A L S C I E N C E S

P R O S P R C E N T E R A N D Q B R C

M A Y 6 , 2 0 1 3

1

NLP: Natural Language Processing

Page 2: Nlp tech talk

2

Outline

Basics of NLP

NLP Toolkits

Basic Implementation Example

Questions?

Page 3: Nlp tech talk

3

What is NLP?

Not:

Natural Language Programming (NLP)

Neuro-Linguistic Programing (NLP)

“Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human(natural) languages.”

-Wikipedia

Page 4: Nlp tech talk

4

What can’t it do?

Extract information not understandable or discernible by “you”.

Extract deeper meaning.

Is not a substitute for Regular Expression pattern matching

Page 5: Nlp tech talk

5

Basics of NLP

Large research field From Speech Recognitions to Optical Character Recognition

Examples: Watson (Jeopardy) Cleverbot Siri/Dragon Speak Captcha

I am only concerned about Information Extraction (IE) Sentence detection Part of Speech (POS) tagging

(nouns, verbs, adverbs)

Named-entity recognition (NER) (names, organizations, locations)

Lemmatisation (Walk, walked, walks, walking)

Relationship extraction All possible word relationships

Parsing Determining most probable word relationships

Coreference Linking of references between multiple sentences

Page 6: Nlp tech talk

6

What’s the point of all that?

Help categorize unstructured text into a more structured format so that discrete information can more easily be extracted.

Page 7: Nlp tech talk

7

NLP Information Extraction Example

“Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.”

Page 8: Nlp tech talk

8

NLP Information Extraction ExamplePOS (Part of Speech) Tagging

“Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.”

Pierre/NNPVinken/NNP,/,61/CDyears/NNSold/JJ,/,will/MDjoin/VBthe/DTboard/NNas/INa/DTnonexecutive/JJdirector/NNNov./NNP29/CD./.

Page 9: Nlp tech talk

Penn Treebank Tagset

CC Coordinating conjunction e.g. and,but,or... CD Cardinal Number DT Determiner EX Existential there FW Foreign Word IN Preposision or subordinating conjunction JJ Adjective JJR Adjective, comparative JJS Adjective, superlative LS List Item Marker MD Modal e.g. can, could, might, may... NN Noun, singular or mass NNP Proper Noun, singular NNPS Proper Noun, plural NNS Noun, plural PDT Predeterminer e.g. all, both ... when they

precede an article POS Possessive Ending e.g. Nouns ending in 's PRP Personal Pronoun e.g. I, me, you, he... PRP$ Possessive Pronoun e.g. my, your, mine,

yours...

RB Adverb Most words that end in -ly as well as degree words like quite, too and very

RBR Adverb, comparative Adverbs with the comparative ending -er, with a strictly comparative meaning.

RBS Adverb, superlative RP Particle SYM Symbol Should be used for mathematical,

scientific or technical symbols TO to UH Interjection e.g. uh, well, yes, my... VB Verb, base form subsumes imperatives,

infinitives and subjunctives VBD Verb, past tense includes the conditional

form of the verb to be VBG Verb, gerund or persent participle VBN Verb, past participle VBP Verb, non-3rd person singular present VBZ Verb, 3rd person singular present WDT Wh-determiner e.g. which, and that when it

is used as a relative pronoun WP Wh-pronoun e.g. what, who, whom... WP$ Possessive wh-pronoun e.g. WRB Wh-adverb e.g. how, where why

Page 10: Nlp tech talk

10

POS Parse Tree

Page 11: Nlp tech talk

11

POS Parse Tree

“Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.”

( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NML (CD 61) (NNS years))

(JJ old)) (, ,))

(VP (MD will) (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as)

(NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)))

Page 12: Nlp tech talk

12

NLP Information Extraction ExamplePOS (Part of Speech) Tagging

“Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.”

Pierre/NNPVinken/NNP,/,61/CDyears/NNSold/JJ,/,will/MDjoin/VBthe/DTboard/NNas/INa/DTnonexecutive/JJdirector/NNNov./NNP29/CD./.

Page 13: Nlp tech talk

13

NLP Information Extraction ExampleNER (Named Entity Recognition) Tagging

“Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.”

Pierre/NNP/PERSONVinken/NNP/PERSON,/,/O 61/CD/DURATION years/NNS/NUMBER old/JJ/DURATION ,/,/O will/MD/Ojoin/VB/Othe/DT/O board/NN/Oas/IN/O a/DT/Ononexecutive/JJ/Odirector/NN/ONov./NNP/DATE29/CD/DATE././O

Page 14: Nlp tech talk

14

NLP Information Extraction ExampleLemmatisation

“Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.”

Pierre/NNP/PERSON [Pierre]Vinken/NNP/PERSON [Vinken],/,/O [,]61/CD/DURATION [61]years/NNS/NUMBER [year]old/JJ/DURATION [old],/,/O [,]will/MD/O [will]join/VB/O [join]the/DT/O [the]board/NN/O [board]as/IN/O [as]a/DT/O [a]nonexecutive/JJ/O [nonexecutive]director/NN/O [director]Nov./NNP/DATE [Nov.]29/CD/DATE [29]././O [.]

Page 15: Nlp tech talk

15

Relationship Parsing

“Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.”

nn(Vinken-1, Pierre-0) [nn modifier]nsubj(join-8, Vinken-1) [nominal subject]num(years-4, 61-3) [numeric modifier]npadvmod(old-5, years-4) [noun phrase adverbial modifier]amod(Vinken-1, old-5) [adjectival modifier]aux(join-8, will-7) [auxiliary]det(board-10, the-9) [determiner]dobj(join-8, board-10) [direct object]det(director-14, a-12) [determiner]amod(director-14, nonexecutive-13) [adjectival modifier]prep_as(join-8, director-14) [prep_collapsed]tmod(join-8, Nov.-15) [temporal modifier]num(Nov.-15, 29-16) [numeric modifier]

Page 16: Nlp tech talk

Relationship Extraction

root - root dep - dependent

aux - auxiliary auxpass - passive auxiliary cop - copula

arg - argument agent - agent comp - complement

acomp - adjectival complement attr - attributive ccomp - clausal complement with internal

subject xcomp - clausal complement with external

subject complm - complementizer obj - object

• dobj - direct object• iobj - indirect object• pobj - object of preposition

mark - marker (word introducing an advcl ) rel - relative (word introducing a rcmod )

subj - subject nsubj - nominal subject

• nsubjpass - passive nominal subject

csubj - clausal subject• csubjpass - passive clausal subject

cc - coordination conj - conjunct expl - expletive (expletive “there”)

mod - modifier abbrev - abbreviation modifier amod - adjectival modifier appos - appositional modifier advcl - adverbial clause modifier purpcl - purpose clause modifier det - determiner predet - predeterminer preconj - preconjunct infmod - infinitival modifier mwe - multi-word expression modifier partmod - participial modifier advmod - adverbial modifier neg - negation modifier rcmod - relative clause modifier quantmod - quantifier modifier nn - noun compound modifier npadvmod - noun phrase adverbial modifier tmod - temporal modifier num - numeric modifier number - element of compound number prep - prepositional modifier poss - possession modifier possessive - possessive modifier (’s) prt - phrasal verb particle

parataxis - parataxis punct - punctuation ref - referent sdep - semantic dependent

xsubj - controlling subject

Page 17: Nlp tech talk

17

NLP Toolkits

41 different toolkits listed in Wikipedia

Four of the more popular free open source (FOSS) IE toolkits

Name Language License Creators

OpenNLP JavaApache License 2.0

Online community

General Architecture for Text Engineering (GATE)

Java LGPLGATE open source community

Natural Language Toolkit (NLTK)

PythonApache 2.0

Team NLTK

Stanford NLP Java GPLThe Stanford Natural Language Processing Group

Page 18: Nlp tech talk

18

NLP Toolkits

OpenNLP Extensive publications

Corporate Sponsorship

Java

Page 19: Nlp tech talk

19

NLP Toolkits

General Architecture for Text Engineering (GATE) Extensive publications

Integrated Development Environment (IDE) to assist in development

Java

Java Annotation Patterns Engine (JAPE)

Page 20: Nlp tech talk

20

NLP Toolkits

Natural Language Tool Kit (NLTK) Extensive publications

Two published documentation books from O’Reilly and Packt

Page 21: Nlp tech talk

21

NLP Toolkits

Stanford Core NLP Extensive publications

Wrappers for Perl, Python, Ruby, and Scala languages

Plugins for GATE and NLTK

Page 22: Nlp tech talk

22

Questions from PROSPR to answer

From the hand typed Colonoscopy report:

How many Polyps

Location of Polyps

Size of Polyps

Page 23: Nlp tech talk

23

Sample Workflow

Report Definition

Report Sectionization

Formatting the Text

Process the Section

Further analysis

Page 24: Nlp tech talk

Report ExampleGastroenterology LaboratoryPatient Name: Susan Storm RichardsProcedure Date: 5/06/2013 15:00:15 PMMRN: 123456789Age: 60Accession #: 123456Gender: FemaleOrder #: 123456789Ethnicity: Attending MD: Victor Von Doom MDNote Status: FinalizedRoom: 666Procedure: Colonoscopy

Referring MD: Reed RichardsProviders: Victor von Doom, MD (Doctor)Attending Participation: I personally performed the entire procedure.Medicines: SomeDrug 3 mg IV, OtherDrug 75 micrograms IVIndications: Screening for colorectal malignant neoplasmComplications: No immediate complications.Patient Profile: Refer to note in patient chart for documentation of history and

physical.Procedure: Pre-Anesthesia Assessment:

- PLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laborisnisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officiadeserunt mollit anim id est laborum. ASA Grade Assessment: II - Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sintoccaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in esse cillum dolore eu fugiat nulla pariatur. Excepteur sintoccaecat cupidatat non proident, sunt in culpa qui officiadeserunt mollit anim id est laborum.

Findings: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrudexercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu f ugiat nullapariatur. Three pedunculated polyps were found in the mid sigmoid colon and in the proximal ascending colonThe polyps were 30 mm in size. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Estimated Blood Loss: Estimated blood loss: none.Recommendation: - Discharge patient to home (ambulatory).

- High fiber diet indefinitely.CPT(c) Code(s): --- Technical ---

G0121, Colorectal cancer screening; colonoscopy on individual not meeting criteria for high risk

CPT Copyright 2010 American Medical Association. All Rights Reserved.The codes documented in this report are preliminary and upon coder review may be revised to meet current compliance requirements.Victor von DoomVictor von Doom, MD5/6/2013 15:10This report has been signed electronically.Number of Addenda: 0

Page 25: Nlp tech talk

25

Sectioned

Findings: Lorem ipsum dolor sit amet, consecteturadipisicing elit, sed do eiusmod tempor incididunt utlabore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiatnulla pariatur. Three pedunculated polyps were found in the mid sigmoid colon and in the proximal ascending colonThe polyps were 30 mm in size. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Page 26: Nlp tech talk

26

Sample

“ Three pedunculated polyps were found in the mid sigmoid colon and in the proximal ascending colonThe polyps were 30 mm in size. ”

Page 27: Nlp tech talk

27

Regex Formatting Text: Removing Spaces

$text =~s/(\b\d+\b)(\s)(\bmm\b)/$1$3/g;

$text =~s/(\b[a-z]+)([A-Z])([a-z]+\b)/$1.\s$2$3/g;

$text =~ s/^\s+//;

$text =~ s/\s+$//;

Page 28: Nlp tech talk

28

Formatted Sample

“Three pedunculated polyps were found in the mid sigmoid colon and in the proximal ascending colon. The polyps were 30mm in size.”

Page 29: Nlp tech talk

NLP Information Extraction ExampleRelationship Dependencies

Original sentence:

Three pedunculated polyps were found in the mid sigmoid colon and in the proximal ascending colon.

Dependencies:num(polyps-2, Three-0) [numeric modifier]amod(polyps-2, pedunculated-1) [adjectival modifier]nsubjpass(found-4, polyps-2) [nominal passive subject]auxpass(found-4, were-3) [passive auxiliary]det(colon-9, the-6) [determiner]amod(colon-9, mid-7) [adjectival modifier]nn(colon-9, sigmoid-8) [nn modifier]prep_in(found-4, colon-9) [prep_collapsed]det(proximal-13, the-12) [determiner]prep_in(found-4, proximal-13) [prep_collapsed]conj_and(colon-9, proximal-13) [conj_collapsed]partmod(proximal-13, ascending-14) [participial

modifier]dobj(ascending-14, colon-15) [direct object]

Original sentence:

The polyps were 30mm in size.

Dependencies:

det(polyps-1, The-0) [determiner]nsubj(30mm-3, polyps-1) [nominal subject]cop(30mm-3, were-2) [copula]prep_in(30mm-3, size-5) [prep_collapsed]

Page 30: Nlp tech talk

30

Output

“Three pedunculated polyps were found in the mid sigmoid colon and in the proximal ascending colon. The polyps were 30 mm in size.”

Output

Number of Polyps: 3

Size of Polyps: 30,

Location of Polyps: 1,4,

Page 31: Nlp tech talk

1 use Lingua::StanfordCoreNLP;2 use Lingua::EN::Words2Nums;3 use strict;4 use warnings;5 my $pipeline = new Lingua::StanfordCoreNLP::Pipeline(1);

6 my $text = "Three pedunculated polyps were found in the mid sigmoid colon and in the proximal ascending colonThe polyps were 30 mm in size.";

7 $text =~s/(\b\d+\b)(\s)(\bmm\b)/$1$3/g; 8 $text =~s/(\b[a-z]+)([A-Z])([a-z]+\b)/$1.\s$2$3/g;9 $text =~ s/^\s+//; 10 $text =~ s/\s+$//;

11 my $result = $pipeline->process($text);

12 my $polypCount;13 my $polypSize;14 my $polypLocation;

15 for my $sentence (@{$result->toArray}) 16 {17 for my $dep (@{$sentence->getDependencies->toArray}) 18 {19 my $relation = $dep->getRelation,20 my $govern = $dep->getGovernor->getWord,21 my $depend = $dep->getDependent->getWord;22 my $num=words2nums($depend);2324 if(($relation eq "num")&&($govern=~/^polyp(|s)$/i))25 {26 $polypCount=$num;27 }28 if(($relation eq "nsubj")&&($govern=~/^\d+mm$/)&&($depend=~/^polyp(|s)$/i))29 {30 $govern=~s/mm$//;31 $polypSize="$govern,";32 }33 if(($relation eq "nn")&&($govern=~/^colon$/i)&&($depend=~/sigmoid/i))34 {35 $polypLocation="1,";36 }37 if(($relation eq "dobj")&&($govern=~/^ascending$/i)&&($depend=~/^colon$/i))38 {39 $polypLocation.="4,";40 }41 }42 }43 print "Number of Polyps:\t$polypCount\n";44 print "Size of Polyps:\t\t$polypSize\n";45 print "Location of Polyps:\t$polypLocation\n";

Perl Example

Page 32: Nlp tech talk

32

F - Score

6/26/2013

Comparison against a manually curated “Gold Standard”

Precision = Proportion of True Positives

Recall = True Proportion of Actual Positives

Page 33: Nlp tech talk

33

Questions?!