natural language processing for collective discourse...the opening sentence of madame bovary by...

Post on 28-Jun-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Natural Language Processing for the Health Sciences

Dragomir Radev MIDAS/SI/CSE

MIDAS Town Hall March 30, 2016

radev@umich.edu

CLAIR

• Today’s person is subjected to more information in a day than a person in the middle ages in a lifetime

• The search engine market is $94B a year – Feb 2016, New York Times

• Siri Gets 1 Billion requests a week – Jan 2016, USA Today, citing Apple

• PubMed adds 1.29 journal articles per minute! – 2010, Arif Jinha in Learned Publishing

• Users send out 168 Million emails every minute – 2015, go-globe.com

• Google indexes at least 48 Billion web pages – 2016, WorldWideWebSize.com

• The global healthcare NLP market is expected to grow from $1.10 billion in 2015 to $2.67 billion by 2020 – 2015, marketsandmarkets

Natural Language Text is Everywhere

2

Presenter
Presentation Notes
Jinha 2010 – one paper per minute http://www.telegraph.co.uk/news/science/science-news/8316534/Welcome-to-the-information-age-174-newspapers-a-day.html Say something about multilingual texts

Examples of NLP Systems

3

Presenter
Presentation Notes
What is NLP? "The antagonist of Stevenson's Treasure Island." (Who is Long John Silver?)

Natural Language Processing

Machine Learning

Database Systems

Linguistics Graph Theory

Artificial Intelligence

Human Computer Interaction

Stat/Data Science

Information Retrieval Humanities Social

Sciences Health

Sciences Education Finance Media

Connections Between NLP and Other Areas

4

Presenter
Presentation Notes
Machine Learning – sequence analysis Database Systems – structured information extraction and storage Language Technologies Graph Theory Artificial Intelligence – knowledge representation and reasoning Human Computer Interaction – crowdsourcing Data Science – large-scale information extraction and statistical analysis Information Retrieval – building personalized search tools Humanities Social Sciences Medicine Education Finance Media (also Legal)

NLP for the Health Sciences • Goals

– improving clinical workflow – delivery of on-time information – reducing readmissions – patient satisfaction

• Information Extraction – medical papers – building structured databases from unstructured text

• Coding and documentation – clinical notes – electronic patient records

• Personal information management – information retrieval

• Telemedicine – dialogue systems

• Clinical decision support – summarization and survey generation

• Interoperability

5

NLP methods • Sequence-based

– part of speech tagging – information extraction

• Tree-based – parsing – sentence simplification

• Graph-based – semi-supervised learning – clustering

• Semantic – reasoning – ontologies – similarity

6

Dependency Parsing

girl omelet fork

the a the with

ate

ROOT

root

det det det case

nsubj dobj nmod

7

Presenter
Presentation Notes
Remove?

Language Understanding • Semantic Analysis

Girl (g1) Omelet (o1) Fork (f1) Eating (e1) ^ Eater (e1,g1) ^ Eaten (e1,o1) ^ Instrument (e1,f1)

• World Knowledge Omelet (X) => Food (X)

• Inference Hungry (Z,t0) ^ Eater (e1,Z) ^ Eaten (e1,Y) ^ Time (e1,t1) ^ ^ Food (Y) ^ Precedes (t0,t1)=> ¬Hungry (Z,t1)

• Conclusion ¬Hungry (g1,t1)

8

Presenter
Presentation Notes
importance of relations importance of similarity

Vector Semantics

http://www.tensorflow.org/tutorials/word2vec/index.md

Recursive NN for sentiment (Socher et al.)

NLP to logical form interface (Dong and Lapata, 2016)

12

Presenter
Presentation Notes
Use colors

Query-based Summarization: Biased LexRank

∑∈

++=qw

wqwsw idftftfqsrel *)1log(*)1log()|( ,,

++

=w

w sfN5.0

1logidf

)|(),(

),()1()|(

)|()|( qvvvzsim

vssimdqzrel

qsreldqsvCv

CzCz

∑∑∑ ∈∈∈

−+=

vBAv Tdd ])1([ −+=

13

CLAIR Summarization of Scientific Articles

• Goals: – Generating short summaries

of research articles – Generating short surveys of

research areas

• Approach: – Citing sentences

[Qazvinian et al., JAIR 2013]

Mohammed et al. (2009) suggested using citation information to generate surveys of scientific paradigms. Qazvinian and Radev (2008) proposed a method for summarizing scientific articles by building a similarity network of the citation sentences that cite the target paper, and then applying network analysis techniques to find a set of sentences that covers as much of the summarized paper facts as possible. We use this method as one of the baselines when we evaluate our approach. Qazvinian et al. (2010) proposed a citation-based summarization method that

14

Graph-based Summary Generation

15

Presenter
Presentation Notes
Delete this slide?

Bipartite Graph Summarization

16

Presenter
Presentation Notes
HITS Remove this slide?

Survey Generation (in progress) • Input: Word Sense Disambiguation • Output:

Word-sense disambiguation, a problem that once seemed out of reach for systems without a great deal of handcrafted linguistic and world knowledge, can now in some cases be done with high accuracy when all information is derived automatically from corpora (Brown, Lai, and Mercer 1991; Yarowsky 1992; Gale, Church, and Yarowsky 1992; Bruce and Wiebe 1994). WSD approaches can be classified as (a) knowledge-based approaches, which make use of linguistic knowledge, manually coded or extracted from lexical resources (Agirre and Rigau, 1996; Lesk 1986); (b) corpus-based approaches, which make use of shallow knowledge automatically acquired from corpus and statistical or machine learning algorithms to induce disambiguation models (Yarowsky, 1995; Schtze 1998); and (c) hybrid approaches, which mix characteristics from the two other approaches to automatically acquire disambiguation models from corpus supported by linguistic knowledge (Ng and Lee 1996; Stevenson and Wilks, 2001). Many corpus based methods have been proposed to deal with the sense disambiguation problem when given definition for each possible sense of a target word or a tagged corpus with the instances of each possible sense, e.g., supervised sense disambiguation (Leacock et al. , 1998), and semi-supervised sense disambiguation (Yarowsky, 1995).

[Jha, King, Coke, Radev, AAAI 2015] 17

Presenter
Presentation Notes
Connection with query-based summarization

CLAIR Protein Interaction Extraction

• Goal: – Extracting protein interaction networks from the

literature – Measuring the associations between proteins and

diseases

18

19

Prostate Cancer Seed Genes

20

Output

21

More about NLP

• talk at MIDAS – April 8 at 4 pm • midas.umich.edu

23

Extra slides

24

The NLP Pipeline

The girl ate the omelet with a fork.

DET N VBD DET N PRP DET N

25

Constituent Parsing

S

NP VBD

ate

NP VP

PP

NP PRP

with

The girl

DET N

omelet the

DET N

a fork

DET N

26

Presenter
Presentation Notes
JM?

• Part of speech ambiguity – Ali won the first round – The cantaloupe is a round fruit

• Polysemy and word sense disambiguation – Melissa celebrated in a bar – Melissa passed the bar exam

• Noun phrase parsing – (Ann Arbor) News – Ann (Arbor News)

• Idiomatic expressions – This car cost an arm and a leg – I heard the news straight from the horse’s mouth

• World Knowledge – Every American has a mother – Every American has a president

• Similarity and semantic relatedness – A large airplane is parked on the runway – A jet is sitting on the tarmac

What Makes NLP Hard

27

Presenter
Presentation Notes
I will focus more on semantic similarity as a hard problem that hasn’t been solved.

• Morphological similarity – respect-respectful

• Synonymy – talkative-chatty

• Ontological similarity – cat-tabby

• Distributional similarity – doctor-nurse

• Relational similarity – Aspirin-cold

• Sentence similarity – (paraphrases)

28

Similarity and Relatedness

Presenter
Presentation Notes
Remove?

Applications of Sentence Similarity • Paraphrasing

– We were in class when the head-master came in, followed by a "new fellow," not wearing the school uniform, and a school servant carrying a large desk.

– We were in the study-hall when the headmaster entered, followed by a new boy not yet in school uniform and by the handy man carrying a large desk.

• Sentence simplification and retargeting – The best known feature of Jupiter is the Great Red Spot, a persistent anticyclonic storm that is larger than

Earth, located 22° south of the equator – One of the biggest features in Jupiter's atmosphere is the Great Red Spot

• Machine translation – No word yet on when the airline may start the service – Aucun mot encore sur le moment où la compagnie aérienne pourrait démarrer le service

• Entailment – Neiman Marcus Group, Inc. files for its long awaited IPO – Neiman Marcus goes public

• Presupposition, inference, and background knowledge – Stockbridge mayor Tim Thompson resigned. – There is a city called Stockbridge. Cities have mayors. Stockbridge has (had) a mayor. Tim Thompson is no

longer mayor of Stockbridge

29

Presenter
Presentation Notes
The opening sentence of Madame Bovary by Gustave Flaubert (1857) Original  Nous étions à l’Étude, quand le Proviseur entra, suivi d’un nouveau habillé en bourgeois et d’un garçon de classe qui portait un grand pupitre. Ceux qui dormaient se réveillèrent, et chacun s’ éleva comme surpris dans son travail.   Translation 1 (Eleanor Marx-Aveling) We were in class when the head-master came in, followed by a "new fellow," not wearing the school uniform, and a school servant carrying a large desk. Those who had been asleep woke up, and every one rose as if just surprised at his work. Translation 2 We were in the study-hall when the headmaster entered, followed by a new boy not yet in school uniform and by the handy man carrying a large desk. Their arrival disturbed the slumbers of some of us, but we all stood up in our places as though rising from our work. Translation 3 We were in the prep.-room when the Head came in, followed by a new boy in 'mufti' and a beadle carrying a big desk. The sleepers aroused themselves, and we all stood up, putting on a startled look, as if we had been buried in our work. Example application – summarizing medical papers for lay people.

Human Judgments of Similarity

[Finkelstein et al. 2002]

http://wordvectors.org/suite.php

tiger cat 7.35 tiger tiger 10.00 book paper 7.46 computer keyboard 7.62 computer internet 7.58 plane car 5.77 train car 6.31 telephone communication 7.50 television radio 6.77 media radio 7.42 drug abuse 6.85 bread butter 6.19 cucumber potato 5.92

bar - Sense 1 barroom, bar, saloon, ginmill, taproom => room => area => structure, construction => artifact, artefact => object, physical object => entity, something bar - Sense 9 legal profession, bar, legal community => profession, community => occupation, vocation, occupational group => body => gathering, assemblage => social group => group, grouping

https://wordnet.princeton.edu/

[Miller and Fellbaum]

30

Presenter
Presentation Notes
automatic

Vector Semantics

• vector(‘king’) - vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’) – [Word2Vec, Mikolov 2013]

31 Image courtesy of Daniel Jurafsky

Acquiring Word Similarity from Collective Discourse

• A person rides a bicycle on concrete.

• A woman riding a bicycle on a large concrete area.

• A woman riding a mountain bike on a large concrete area.

• On a large, flat concrete area a woman in shorts rides a two-wheeled bike.

• The woman is riding a bicycle.

[Everingham & al. 2008] 32

Presenter
Presentation Notes
Parsing the five sentences simultaneously What if we had 5,000 sentences?

Graph-based Learning for NLP • Representation

– Lexical networks – vertices=text – edges=similarities

• Properties – Centrality – Diversity

• Learning tasks – Supervised (Classification) – Unsupervised (Clustering) – Semi-supervised

• Methods – Random Walks – Harmonic Functions

• Solid Mathematical Foundation – Scaleability

• Efficient Algorithms – Spectral Methods – Diffusion – Diversity – Evolution over time

33

Presenter
Presentation Notes
Include image

Similarity Functions

• Cosine similarity

• KL divergence

• Language model generation probability • Syntactic kernels

∑∑

==

=

=n

ii

n

ii

n

iii

qd

qdqd

1

2

1

2

1),cos(

34

Syntactic Kernels

[Özateş, Özgür, and Radev, LREC 2016] 35

Modularity-based Clustering

36

CLAIR Text Summarization

• Goal: – Reduce a long document

… or a set of related documents

– for shorter/easier reading … or for a mobile device (while preserving the most important information) (can be generic or query-based)

• Approach: – Lexical networks – Baseline: Centroid method, Radev et al. 2001

37

Iraqi Vice President Taha Yassin Ramadan announced today, Sunday, that Iraq refuses to back down from its decision to stop cooperating with disarmament inspectors before its demands are met. Iraqi Vice president Taha Yassin Ramadan announced today, Thursday, that Iraq rejects cooperating with the United Nations except on the issue of lifting the blockade imposed upon it since the year 1990. Ramadan told reporters in Baghdad that "Iraq cannot deal positively with whoever represents the Security Council unless there was a clear stance on the issue of lifting the blockade off of it. Baghdad had decided late last October to completely cease cooperating with the inspectors of the United Nations Special Commission (UNSCOM), in charge of disarming Iraq's weapons, and whose work became very limited since the fifth of August, and announced it will not resume its cooperation with the Commission even if it were subjected to a military operation. The Russian Foreign Minister, Igor Ivanov, warned today, Wednesday against using force against Iraq, which will destroy, according to him, seven years of difficult diplomatic work and will complicate the regional situation in the area. Ivanov contended that carrying out air strikes against Iraq, who refuses to cooperate with the United Nations inspectors, ``will end the tremendous work achieved by the international group during the past seven years and will complicate the situation in the region.'' Nevertheless, Ivanov stressed that Baghdad must resume working with the Special Commission in charge of disarming the Iraqi weapons of mass destruction (UNSCOM). The Special Representative of the United Nations Secretary-General in Baghdad, Prakash Shah, announced today, Wednesday, after meeting with the Iraqi Deputy Prime Minister Tariq Aziz, that Iraq refuses to back down from its decision to cut off cooperation with the disarmament inspectors. British Prime Minister Tony Blair said today, Sunday, that the crisis between the international community and Iraq ``did not end'' and that Britain is still ``ready, prepared, and able to strike Iraq.'' In a gathering with the press held at the Prime Minister's office, Blair contended that the crisis with Iraq ``will not end until Iraq has absolutely and unconditionally respected its commitments'' towards the United Nations. A spokesman for Tony Blair had indicated that the British Prime Minister gave permission to British Air Force Tornado planes stationed in Kuwait to join the aerial bombardment against Iraq.

38

39

d4s1

d1s1

d3s2

d3s1

d2s3

d2s1

d2s2

d5s2 d5s3

d5s1

d3s3

Lexical Centrality (t=0.3)

40

d4s1

d1s1

d3s2

d3s1

d2s3

d2s1

d2s2

d5s2 d5s3

d5s1

d3s3

Lexical Centrality (t=0.2)

41

d4s1

d1s1

d3s2

d3s1

d2s3 d3s3

d2s1

d2s2

d5s2 d5s3

d5s1

Lexical Centrality (t=0.1)

d4s1

d3s2

d2s1

42

Presenter
Presentation Notes
THEME

Recurrence Relation

Can guarantee solution by allowing “jump” probability

d/N.

s 0.5

0.3

0.8 0.2

0.1 0.3

0.9

0.2 0.4

LexRank – Centrality in Text Graphs

43 [Erkan and Radev, JAIR 2006]

DivRank: Diversity Reranking

[Mei, Guo and Radev, SIGKDD 2010] 44

Presenter
Presentation Notes
VARIATIONS

Objective:

By taking the partial derivative of fv, we get:

DivRank

45

Presenter
Presentation Notes
Hide?

NIST DUC Evaluation

46

Presenter
Presentation Notes
Explain the results

Abstracts vs. citation summaries

47

http://clair.eecs.umich.edu/aan

The ACL Anthology Network

48

49

(Ming-Wei 2006)

In the context of DPs, this edge based factorization method was proposed by Eisner (1996).

(McDonald 2005)

Eisner (1996) gave a generative model with a cubic parsing algorithm based on an edge factorization of trees.

(Lee 1997)

Eisner (1996) proposed an O(n3) parsing algorithm for PDG.

(Buchholz 2006)

If the parse has to be projective, Eisner’s bottom-up-span algorithm (Eisner, 1996) can be used for the search.

Example

Eisner, J. (1996) “Three New Probabilistic Models for Dependency Parsing.”

50

Presenter
Presentation Notes
Connect to Collective Discourse

51

Generated Summary

52

CLAIR 9. Protein-Vaccine Networks

• Goal: – Identify gene interaction networks associated with

vaccines

top related