natural language processing for collective discourse...the opening sentence of madame bovary by...
TRANSCRIPT
Natural Language Processing for the Health Sciences
Dragomir Radev MIDAS/SI/CSE
MIDAS Town Hall March 30, 2016
CLAIR
• Today’s person is subjected to more information in a day than a person in the middle ages in a lifetime
• The search engine market is $94B a year – Feb 2016, New York Times
• Siri Gets 1 Billion requests a week – Jan 2016, USA Today, citing Apple
• PubMed adds 1.29 journal articles per minute! – 2010, Arif Jinha in Learned Publishing
• Users send out 168 Million emails every minute – 2015, go-globe.com
• Google indexes at least 48 Billion web pages – 2016, WorldWideWebSize.com
• The global healthcare NLP market is expected to grow from $1.10 billion in 2015 to $2.67 billion by 2020 – 2015, marketsandmarkets
Natural Language Text is Everywhere
2
Examples of NLP Systems
3
Natural Language Processing
Machine Learning
Database Systems
Linguistics Graph Theory
Artificial Intelligence
Human Computer Interaction
Stat/Data Science
Information Retrieval Humanities Social
Sciences Health
Sciences Education Finance Media
Connections Between NLP and Other Areas
4
NLP for the Health Sciences • Goals
– improving clinical workflow – delivery of on-time information – reducing readmissions – patient satisfaction
• Information Extraction – medical papers – building structured databases from unstructured text
• Coding and documentation – clinical notes – electronic patient records
• Personal information management – information retrieval
• Telemedicine – dialogue systems
• Clinical decision support – summarization and survey generation
• Interoperability
5
NLP methods • Sequence-based
– part of speech tagging – information extraction
• Tree-based – parsing – sentence simplification
• Graph-based – semi-supervised learning – clustering
• Semantic – reasoning – ontologies – similarity
6
Dependency Parsing
girl omelet fork
the a the with
ate
ROOT
root
det det det case
nsubj dobj nmod
7
Language Understanding • Semantic Analysis
Girl (g1) Omelet (o1) Fork (f1) Eating (e1) ^ Eater (e1,g1) ^ Eaten (e1,o1) ^ Instrument (e1,f1)
• World Knowledge Omelet (X) => Food (X)
• Inference Hungry (Z,t0) ^ Eater (e1,Z) ^ Eaten (e1,Y) ^ Time (e1,t1) ^ ^ Food (Y) ^ Precedes (t0,t1)=> ¬Hungry (Z,t1)
• Conclusion ¬Hungry (g1,t1)
8
Vector Semantics
http://www.tensorflow.org/tutorials/word2vec/index.md
Recursive NN for sentiment (Socher et al.)
NLP to logical form interface (Dong and Lapata, 2016)
12
Query-based Summarization: Biased LexRank
∑∈
++=qw
wqwsw idftftfqsrel *)1log(*)1log()|( ,,
++
=w
w sfN5.0
1logidf
)|(),(
),()1()|(
)|()|( qvvvzsim
vssimdqzrel
qsreldqsvCv
CzCz
∑∑∑ ∈∈∈
−+=
vBAv Tdd ])1([ −+=
13
CLAIR Summarization of Scientific Articles
• Goals: – Generating short summaries
of research articles – Generating short surveys of
research areas
• Approach: – Citing sentences
[Qazvinian et al., JAIR 2013]
Mohammed et al. (2009) suggested using citation information to generate surveys of scientific paradigms. Qazvinian and Radev (2008) proposed a method for summarizing scientific articles by building a similarity network of the citation sentences that cite the target paper, and then applying network analysis techniques to find a set of sentences that covers as much of the summarized paper facts as possible. We use this method as one of the baselines when we evaluate our approach. Qazvinian et al. (2010) proposed a citation-based summarization method that
14
Graph-based Summary Generation
15
Bipartite Graph Summarization
16
Survey Generation (in progress) • Input: Word Sense Disambiguation • Output:
Word-sense disambiguation, a problem that once seemed out of reach for systems without a great deal of handcrafted linguistic and world knowledge, can now in some cases be done with high accuracy when all information is derived automatically from corpora (Brown, Lai, and Mercer 1991; Yarowsky 1992; Gale, Church, and Yarowsky 1992; Bruce and Wiebe 1994). WSD approaches can be classified as (a) knowledge-based approaches, which make use of linguistic knowledge, manually coded or extracted from lexical resources (Agirre and Rigau, 1996; Lesk 1986); (b) corpus-based approaches, which make use of shallow knowledge automatically acquired from corpus and statistical or machine learning algorithms to induce disambiguation models (Yarowsky, 1995; Schtze 1998); and (c) hybrid approaches, which mix characteristics from the two other approaches to automatically acquire disambiguation models from corpus supported by linguistic knowledge (Ng and Lee 1996; Stevenson and Wilks, 2001). Many corpus based methods have been proposed to deal with the sense disambiguation problem when given definition for each possible sense of a target word or a tagged corpus with the instances of each possible sense, e.g., supervised sense disambiguation (Leacock et al. , 1998), and semi-supervised sense disambiguation (Yarowsky, 1995).
[Jha, King, Coke, Radev, AAAI 2015] 17
CLAIR Protein Interaction Extraction
• Goal: – Extracting protein interaction networks from the
literature – Measuring the associations between proteins and
diseases
18
19
Prostate Cancer Seed Genes
20
Output
21
More about NLP
• talk at MIDAS – April 8 at 4 pm • midas.umich.edu
23
Extra slides
24
The NLP Pipeline
The girl ate the omelet with a fork.
DET N VBD DET N PRP DET N
25
Constituent Parsing
S
NP VBD
ate
NP VP
PP
NP PRP
with
The girl
DET N
omelet the
DET N
a fork
DET N
26
• Part of speech ambiguity – Ali won the first round – The cantaloupe is a round fruit
• Polysemy and word sense disambiguation – Melissa celebrated in a bar – Melissa passed the bar exam
• Noun phrase parsing – (Ann Arbor) News – Ann (Arbor News)
• Idiomatic expressions – This car cost an arm and a leg – I heard the news straight from the horse’s mouth
• World Knowledge – Every American has a mother – Every American has a president
• Similarity and semantic relatedness – A large airplane is parked on the runway – A jet is sitting on the tarmac
What Makes NLP Hard
27
• Morphological similarity – respect-respectful
• Synonymy – talkative-chatty
• Ontological similarity – cat-tabby
• Distributional similarity – doctor-nurse
• Relational similarity – Aspirin-cold
• Sentence similarity – (paraphrases)
28
Similarity and Relatedness
Applications of Sentence Similarity • Paraphrasing
– We were in class when the head-master came in, followed by a "new fellow," not wearing the school uniform, and a school servant carrying a large desk.
– We were in the study-hall when the headmaster entered, followed by a new boy not yet in school uniform and by the handy man carrying a large desk.
• Sentence simplification and retargeting – The best known feature of Jupiter is the Great Red Spot, a persistent anticyclonic storm that is larger than
Earth, located 22° south of the equator – One of the biggest features in Jupiter's atmosphere is the Great Red Spot
• Machine translation – No word yet on when the airline may start the service – Aucun mot encore sur le moment où la compagnie aérienne pourrait démarrer le service
• Entailment – Neiman Marcus Group, Inc. files for its long awaited IPO – Neiman Marcus goes public
• Presupposition, inference, and background knowledge – Stockbridge mayor Tim Thompson resigned. – There is a city called Stockbridge. Cities have mayors. Stockbridge has (had) a mayor. Tim Thompson is no
longer mayor of Stockbridge
29
Human Judgments of Similarity
[Finkelstein et al. 2002]
http://wordvectors.org/suite.php
tiger cat 7.35 tiger tiger 10.00 book paper 7.46 computer keyboard 7.62 computer internet 7.58 plane car 5.77 train car 6.31 telephone communication 7.50 television radio 6.77 media radio 7.42 drug abuse 6.85 bread butter 6.19 cucumber potato 5.92
bar - Sense 1 barroom, bar, saloon, ginmill, taproom => room => area => structure, construction => artifact, artefact => object, physical object => entity, something bar - Sense 9 legal profession, bar, legal community => profession, community => occupation, vocation, occupational group => body => gathering, assemblage => social group => group, grouping
https://wordnet.princeton.edu/
[Miller and Fellbaum]
30
Vector Semantics
• vector(‘king’) - vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’) – [Word2Vec, Mikolov 2013]
31 Image courtesy of Daniel Jurafsky
Acquiring Word Similarity from Collective Discourse
• A person rides a bicycle on concrete.
• A woman riding a bicycle on a large concrete area.
• A woman riding a mountain bike on a large concrete area.
• On a large, flat concrete area a woman in shorts rides a two-wheeled bike.
• The woman is riding a bicycle.
[Everingham & al. 2008] 32
Graph-based Learning for NLP • Representation
– Lexical networks – vertices=text – edges=similarities
• Properties – Centrality – Diversity
• Learning tasks – Supervised (Classification) – Unsupervised (Clustering) – Semi-supervised
• Methods – Random Walks – Harmonic Functions
• Solid Mathematical Foundation – Scaleability
• Efficient Algorithms – Spectral Methods – Diffusion – Diversity – Evolution over time
33
Similarity Functions
• Cosine similarity
• KL divergence
• Language model generation probability • Syntactic kernels
∑∑
∑
==
=
⋅
=n
ii
n
ii
n
iii
qd
qdqd
1
2
1
2
1),cos(
34
Syntactic Kernels
[Özateş, Özgür, and Radev, LREC 2016] 35
Modularity-based Clustering
36
CLAIR Text Summarization
• Goal: – Reduce a long document
… or a set of related documents
– for shorter/easier reading … or for a mobile device (while preserving the most important information) (can be generic or query-based)
• Approach: – Lexical networks – Baseline: Centroid method, Radev et al. 2001
37
Iraqi Vice President Taha Yassin Ramadan announced today, Sunday, that Iraq refuses to back down from its decision to stop cooperating with disarmament inspectors before its demands are met. Iraqi Vice president Taha Yassin Ramadan announced today, Thursday, that Iraq rejects cooperating with the United Nations except on the issue of lifting the blockade imposed upon it since the year 1990. Ramadan told reporters in Baghdad that "Iraq cannot deal positively with whoever represents the Security Council unless there was a clear stance on the issue of lifting the blockade off of it. Baghdad had decided late last October to completely cease cooperating with the inspectors of the United Nations Special Commission (UNSCOM), in charge of disarming Iraq's weapons, and whose work became very limited since the fifth of August, and announced it will not resume its cooperation with the Commission even if it were subjected to a military operation. The Russian Foreign Minister, Igor Ivanov, warned today, Wednesday against using force against Iraq, which will destroy, according to him, seven years of difficult diplomatic work and will complicate the regional situation in the area. Ivanov contended that carrying out air strikes against Iraq, who refuses to cooperate with the United Nations inspectors, ``will end the tremendous work achieved by the international group during the past seven years and will complicate the situation in the region.'' Nevertheless, Ivanov stressed that Baghdad must resume working with the Special Commission in charge of disarming the Iraqi weapons of mass destruction (UNSCOM). The Special Representative of the United Nations Secretary-General in Baghdad, Prakash Shah, announced today, Wednesday, after meeting with the Iraqi Deputy Prime Minister Tariq Aziz, that Iraq refuses to back down from its decision to cut off cooperation with the disarmament inspectors. British Prime Minister Tony Blair said today, Sunday, that the crisis between the international community and Iraq ``did not end'' and that Britain is still ``ready, prepared, and able to strike Iraq.'' In a gathering with the press held at the Prime Minister's office, Blair contended that the crisis with Iraq ``will not end until Iraq has absolutely and unconditionally respected its commitments'' towards the United Nations. A spokesman for Tony Blair had indicated that the British Prime Minister gave permission to British Air Force Tornado planes stationed in Kuwait to join the aerial bombardment against Iraq.
38
39
d4s1
d1s1
d3s2
d3s1
d2s3
d2s1
d2s2
d5s2 d5s3
d5s1
d3s3
Lexical Centrality (t=0.3)
40
d4s1
d1s1
d3s2
d3s1
d2s3
d2s1
d2s2
d5s2 d5s3
d5s1
d3s3
Lexical Centrality (t=0.2)
41
d4s1
d1s1
d3s2
d3s1
d2s3 d3s3
d2s1
d2s2
d5s2 d5s3
d5s1
Lexical Centrality (t=0.1)
d4s1
d3s2
d2s1
42
Recurrence Relation
Can guarantee solution by allowing “jump” probability
d/N.
s 0.5
0.3
0.8 0.2
0.1 0.3
0.9
0.2 0.4
LexRank – Centrality in Text Graphs
43 [Erkan and Radev, JAIR 2006]
DivRank: Diversity Reranking
[Mei, Guo and Radev, SIGKDD 2010] 44
Objective:
By taking the partial derivative of fv, we get:
DivRank
45
NIST DUC Evaluation
46
Abstracts vs. citation summaries
47
49
(Ming-Wei 2006)
In the context of DPs, this edge based factorization method was proposed by Eisner (1996).
(McDonald 2005)
Eisner (1996) gave a generative model with a cubic parsing algorithm based on an edge factorization of trees.
(Lee 1997)
Eisner (1996) proposed an O(n3) parsing algorithm for PDG.
(Buchholz 2006)
If the parse has to be projective, Eisner’s bottom-up-span algorithm (Eisner, 1996) can be used for the search.
Example
Eisner, J. (1996) “Three New Probabilistic Models for Dependency Parsing.”
50
51
Generated Summary
52
CLAIR 9. Protein-Vaccine Networks
• Goal: – Identify gene interaction networks associated with
vaccines