natural language processing
TRANSCRIPT
![Page 1: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/1.jpg)
Natural Language Processing
![Page 2: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/2.jpg)
2
Why “natural language”?
Natural vs. artificial
Language vs. English
![Page 3: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/3.jpg)
3
Why “natural language”?
Natural vs. artificial Not precise, ambiguous, wide range of
expression
Language vs. English English, French, Japanese, Spanish
![Page 4: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/4.jpg)
4
Why “natural language”?
Natural vs. artificial Not precise, ambiguous, wide range of
expression Language vs. English
English, French, Japanese, Spanish
Natural language processing = programs, theories towards understanding a problem or question in natural language and answering it
![Page 5: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/5.jpg)
5
Approaches
System building Interactive Understanding only Generation only
Theoretical Draws on linguistics, psychology,
philosophy
![Page 6: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/6.jpg)
6
Building an NL system is hard
Unlikely to be possible without solid theoretical underpinnings
![Page 7: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/7.jpg)
7
Natural language is useful
Question-answering systems http://tangra.si.umich.edu/clair/NSIR/NSIR.cgi
Mixed initiative systems http://www.cs.columbia.edu/~noemie/match.mpg
Information extraction http://nlp.cs.nyu.edu/info-extr/biomedical-snapshot.jpg
Systems that write/speak http://www-2.cs.cmu.edu/~awb/synthesizers.html MAGIC
Machine translation http://world.altavista.com/babelfish
![Page 8: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/8.jpg)
8
Topics
Syntax
Semantics
Pragmatics
Statistical NLP: combining learning and NL processing
![Page 9: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/9.jpg)
9
Goal of Interpretation
Identify sentence meaning
Do something with meaning Need some representation of
action/meaning
![Page 10: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/10.jpg)
10
Analysis of form: Syntax
Which parts were damaged by larger machines?
Which parts damaged larger machines? Which larger machines damaged parts?
Approaches: Statistical part of speech tagging Parsing using a grammar Shallow parsing: identify meaningful
chunks
![Page 11: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/11.jpg)
11
Which parts were damaged by larger machines?
S (Q)
NP VP
N NP (Q)
machines
V (past)
damage Det (Q) N
which parts
ADJ
larger
![Page 12: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/12.jpg)
12
Which parts were damaged by machines? – with functional roles
S (Q)
NP (SUBJ) VP
N NP (Q) (OBJ)
machines
V (past)
damage Det (Q) N
which parts
ADJ
larger
![Page 13: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/13.jpg)
13
Which parts damaged machines? – with functional roles
S (Q)
NP (OBJ)
VP
N
machines
V (past)
damage
parts
NP (Q) (SUBJ)
Det (Q) N
which
ADJ
larger
![Page 14: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/14.jpg)
14
Parsers
Grammar S -> NP VP NP -> DET {ADJ*} N
Different types of grammars Context Free vs. Context Sensitive Lexical Functional Grammar vs. Tree Adjoining
Grammars Different ways of acquiring grammars
Hand-encoded vs. machine learned Domain independent (TreeBank, Wall Street
Journal) Domain dependent (Medical texts)
![Page 15: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/15.jpg)
15
Semantics: analysis of meaning
Word meaning John picked up a bad cold John picked up a large rock. John picked up Radio Netherlands on his radio. John picked up a hitchhiker on Highway 66.
Phrasal meaning Baby bonuses -> allocations Senior citizens -> personnes agees Causing havoc -> seme le dessaroi
Approaches Representing meaning Statistical word disambiguation Symbolic rule-based vs. shallow statistical
semantics
![Page 16: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/16.jpg)
16
Representing Meaning - WordNet
![Page 17: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/17.jpg)
17
![Page 18: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/18.jpg)
18
OMEGA
http://omega.isi.edu:8007/index
http://omega.is.edu/doc/browsers.html
![Page 19: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/19.jpg)
19
![Page 20: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/20.jpg)
20
Statistical Word Sense Disambiguation
Context within the sentence determines which sense is correct
The candidate picked up [sense6] thousands of additional votes.
He picked up [sense2] the book and started to read. Her performance in school picked up [sense13].
The swimmers got out of the river and climbed the bank [sloping land] to retrieve their towels.
The investors took their money out of the bank [financial institution] and moved it into stocks and bonds.
![Page 21: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/21.jpg)
21
Goal
A program which can predict which sense is the correct sense given a new sentence containing “pick up” or “bank”
Avoid manually itemizing all words which can occur in sentences with different meanings
Can we use machine learning?
![Page 22: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/22.jpg)
22
What do we need?
Data
Features
Machine Learning algorithm Decision tree vs. SVM/Naïve Bayes Inspecting the output
Accuracy of these methods
![Page 23: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/23.jpg)
23
Using Categories from Roget’s Thesaurus (e.g., machine vs. animal) for training
![Page 24: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/24.jpg)
24
Training data for “machines”
![Page 25: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/25.jpg)
25
![Page 26: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/26.jpg)
26
Predicting the correct sense in unseen text
Use presence of the salient words in context
50 word window
Use Baye’s rule to compute probabilities for different categories
![Page 27: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/27.jpg)
27
“Crane”
Occurred 74 times in Grolliers, 36 as animal, 38 as machine
Prediction in new sentences were 99% correct
Example: lift water and to grind grain .PP Treadmills attached to cranes were used to lift heavy objects from Roman times.
![Page 28: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/28.jpg)
28
![Page 29: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/29.jpg)
29
![Page 30: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/30.jpg)
30
Going Home – A play in one act
Scene 1: Pennsylvania Station, NYCBonnie: Long Beach?Passerby: Downstairs, LIRR Station
Scene 2: ticket counter: LIRRBonnie: Long Beach?Clerk: $4.50
Scene 3: Information Booth, LIRRBonnie: Long Beach?Clerk: 4:19, Track 17
Scene 4: On the train, vicinity of Forest HillsBonnie: Long Beach?Conductor: Change at Jamaica
Scene 5: On the next train, vicinity of LynbrookBonnie: Long Beach?Conductor: Rigtht after Island Park.
![Page 31: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/31.jpg)
31
Question Answering on the web
Input: English question
Data: documents retrieved by a search engine from the web
Output: The phrase(s) within the documents that answer the question
![Page 32: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/32.jpg)
32
Examples
When was X born? When was Mozart born? Mozart was born in 1756. When was Gandhi born? Gandhi (1869-1948)
Where are the Rocky Mountains located?
What is nepotism?
![Page 33: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/33.jpg)
33
Common Approach
Create a query from the question When was Mozart born -> Mozart born Use WordNet to expand terms and increase
recall: Which high school was ranked highest in the US in
1998? “high school” -> (high&school)|
(senior&high&school)|(senior&high(|high|highschool
Use search engine to find relevant documents
Pinpoint passage within document that has answer using patterns
From IR to NLP
![Page 34: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/34.jpg)
34
PRODUCE A BIOGRAPHY OF [PERON].Only these fields are Relevant:
1. Name(s), aliases:2. *Date of Birth or Current Age:3. *Date of Death:4. *Place of Birth:5. *Place of Death:6. Cause of Death:7. Religion (Affiliations):8. Known locations and dates:9. Last known address:10. Previous domiciles:11. Ethnic or tribal affiliations:12. Immediate family members 13. Native Language spoken:14. Secondary Languages spoken:15. Physical Characteristics 16. Passport number and country of issue:17. Professional positions:18. Education 19. Party or other organization affiliations:20. Publications (titles and dates):
![Page 35: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/35.jpg)
35
Biography of Han Ming
Han Ming, born 1944 March in Pyongyan, South Korean Lei Fa Women’s University in French law, literature, a former female South Korean people, chairman of South Korea women’s groups,…Han, 62, has championed women’s rights and liberal political ideas. Han was imprisoned from 1979 to 1981 on charges of teaching pro-Communist ideas to workers, farmers and low-income women. She became the first minister of gender equality in 2001 and later served as an environment minister.
![Page 36: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/36.jpg)
36
Biography – two approaches
To obtain high precision, we handle each slot independently using bootstrapping to learn IE patterns.
To improve the recall, we utilize a biography Language Model.
![Page 37: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/37.jpg)
37
Approach Characteristics of the IE approach
Training resource: Wikipedia and its manual annotations
Bootstrapping interleaves two corpora to improve precision
Wikipedia: reliable but small Web: noisy but many relevant documents
No manual annotation or automatic tagging of corpus Use seed tuples (person, date-of-birth) to find patterns This approach is scalable for any corpus
Irrespective of size Irrespective of whether it is static or dynamic
The IE system is augmented with language models to increase recall
![Page 38: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/38.jpg)
38
Biography as an IE task
We need patterns to extract information from a sentence
Creating patterns manually is a time consuming task, and not scalable
We want to find these patterns automatically
![Page 39: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/39.jpg)
39
Biography patterns from Wikipedia
![Page 40: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/40.jpg)
40
• Martin Luther King, Jr., (January 15, 1929 – April 4, 1968) was the most …
• Martin Luther King, Jr., was born on January 15, 1929, in Atlanta, Georgia.
Biography patterns from Wikipedia
![Page 41: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/41.jpg)
41
Run IdFinder on these sentences
<Person> Martin Luther King, Jr. </Person>, (<Date>January 15, 1929</Date> – <Date> April 4, 1968</Date>) was the most…
<Person> Martin Luther King, Jr. </Person>, was born on <Date> January 15, 1929 </Date>, in <GPE> Atlanta, Georgia </GPE>.
Take the token sequence that includes the tags of interest + some context (2 tokens before and 2 tokens after)
![Page 42: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/42.jpg)
42
Convert to Patterns:
<My_Person> (<My_Date> – <Date>) was the
<My_Person> , was born on <My_Date>, in
Remove more specific patterns – if there is a pattern that contains other, take the smallest > k tokens.
<MY_Person> , was born on <My_Date>
<My_Person> (<My_Date> – <Date>)
Finally, verify the patterns manually to remove irrelevant patterns.
![Page 43: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/43.jpg)
43
Examples of Patterns:
502 distinct place-of-birth patterns: 600 <MY_Person> was born in <MY_GPE> 169 <MY_Person> ( born <Date> in <MY_GPE> ) 44 Born in <MY_GPE> <MY_Person> 10 <MY_Person> was a native <MY_GPE> 10 <MY_Person> 's hometown of <MY_GPE> 1 <MY_Person> was baptized in <MY_GPE> …
291 distinct date-of-death patterns: 770 <MY_Person> ( <Date> - <MY_Date> ) 92 <MY_Person> died on <MY_Date> 19 <MY_Person> <Date> - <MY_Date> 16 <MY_Person> died in <GPE> on <MY_Date> 3 < MY_Person> passed away on < MY_Date > 1 < MY_Person> committed suicide on <MY_Date> …
![Page 44: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/44.jpg)
44
Biography as an IE task
This approach is good for the consistently annotated fields in Wikipedia: place of birth, date of birth, place of death, date of death
Not all fields of interests are annotated, a different approach is needed to cover the rest of the slots
![Page 45: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/45.jpg)
45
Bouncing between Wikipedia and Google Use one seed only:
<my person> and <target field> Google: “Arafat” “civil engineering”, we get:
![Page 46: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/46.jpg)
46
![Page 47: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/47.jpg)
47
Use one seed only: <my person> and <target field>
Google: “Arafat” “civil engineering”, we get:⇒ Arafat graduated with a bachelor’s degree in civil engineering ⇒ Arafat studied civil engineering ⇒ Arafat, a civil engineering student⇒ …
Using these snippets, corresponding patterns are created, then filtered out manually.
Bouncing between Wikipedia and Google
![Page 48: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/48.jpg)
48
Use one seed tuple only: <my person> and <target field>
Google: “Arafat” “civil engineering”, we get:⇒ Arafat graduated with a bachelor’s degree in civil
engineering ⇒ Arafat studied civil engineering ⇒ Arafat, a civil engineering student⇒ …
Using these snippets, corresponding patterns are created, then filtered out manually
To get more seed pairs, go to Wikipedia biography pages only and search for:
“graduated with a bachelor’s degree in” We get:
Bouncing between Wikipedia and Google
![Page 49: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/49.jpg)
49
![Page 50: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/50.jpg)
50
New seed tuples: “Burnie Thompson” “political science“ “Henrey Luke” “Environment Studies” “Erin Crocker” “industrial and management
engineering” “Denise Bode” “political science” …
Go back to Google and repeat the process to get more seed patterns!
Bouncing between Wikipedia and Google
![Page 51: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/51.jpg)
51
Bouncing between Wikipedia and Google
This approach worked well for a few fields such as: education, publication, Immediate family members, and Party or other organization affiliations
Did not provide good patterns for every field, such as: Religion, Ethnic or tribal affiliations, and Previous domiciles), we got a lot of noise
For some slots, we created some patterns manually
![Page 52: Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a8de071a28ab1d0d8b46d3/html5/thumbnails/52.jpg)
52
Biography as Sentence Selection and Ranking
To obtain high recall, we also want to include sentences that IE may miss, perhaps due to ill-formed sentences (ASR and MT)
Get the top 100 documents from Indri
Extract all sentences that contain the person or reference to him/her
Use a variety of features to rank these sentence…