Download - CSA3180: Natural Language Processing
December 2005 CSA3180: Information Extraction II 1
CSA3180: Natural Language Processing
Information Extraction 2• Named Entities• Question Answering• Anaphora Resolution• Co-Reference
December 2005 CSA3180: Information Extraction II 2
Introduction
• Slides partially based on talk by Lucian Vlad Lita
• Sheffield GATE Multilingual Extraction slides based on Diana Maynard’s talks
• Anaphora resolution slides based on Dan Cristea slides, with additional input from Gabriela-Eugenia Dima, Oana Postolache and Georgiana Puşcaşu
December 2005 CSA3180: Information Extraction II 3
References
• Fastus System Documentation
• Robert Gaizauskas “IE Perspective on Text Mining”
• Daniel Bikel’s “Nymble: A High Performance Learning Name Finder”
• Helena Ahonen-Myka’s notes on FSTs
• Javelin system documentation
• MUC 7 Overview & Results
December 2005 CSA3180: Information Extraction II 4
Named Entities
• Named Entities• Person Name: Colin Powell, Frodo• Location Name: Middle East, Aiur• Organization: UN, DARPA
• Domain Specific vs. Open Domain
December 2005 CSA3180: Information Extraction II 5
Anaphora Resolution
unprocessed text
annotation tool
fine-tuningcomparison & evaluation
AR engine
AR golden standard AR annotated text
December 2005 CSA3180: Information Extraction II 6
Anaphora Resolution
• Text:– Nature of discourse– Anaphoric phenomena
• Anaphora Resolution Engines:– Models– General AR Frameworks– Knowledge Sources
December 2005 CSA3180: Information Extraction II 7
Anaphora Resolution
Anaphora represents the relation between a “proform” (called an “anaphor”) and another term (called an "antecedent"), when the interpretation of the anaphor is in a certain way determined by the interpretation of the antecedent.
Barbara Lust, Introduction to Studies in the Acquisition of Anaphora, D. Reidel, 1986
December 2005 CSA3180: Information Extraction II 8
Anaphora Example
It was a bright cold day in April, and the clocks were striking
thirteen. Winston Smith, his chin nuzzled into his breast in an
effort to escape the vile wind, slipped quickly through the glass
doors of Victory Mansions, though not quickly enough to
prevent a swirl of gritty dust from entering along with him.
Orwell, 1984
antecedent anaphor antecedent anaphor
December 2005 CSA3180: Information Extraction II 9
Anaphora
• pronouns (personal, demonstrative, ...)– full pronouns – clitics (RO: dă-mi-l, IT: dammelo)
• nouns– definite– indefinite
• adjectives, numerals (generally associated with an ellipsis)
• In this the play is expressionist1 in its approach to theme.
• But it is also so1 in its use of unfamiliar devices...
December 2005 CSA3180: Information Extraction II 10
Referential Expressions
• mark the noun phrases
• for each NP ask a question about it
• keep as REs those NPs that can be naturally referenced in the question
The policeman got in the car in a hurry in order to catch the run-away thief.
December 2005 CSA3180: Information Extraction II 11
Referential Expressions
a. John was going down the street looking for Bill‘s house.
b. He found it at the first corner.
December 2005 CSA3180: Information Extraction II 12
Referential Expressions
a. John was going down the street looking for Bill‘s house.
b. He met him at the first corner.
December 2005 CSA3180: Information Extraction II 13
Referential Expressions
The empty anaphor Gianni diede una mela a Michele. Piu tardi, gli diede un’arancia.
[Not&Zancanara, 1996]
John gave an apple to Michelle. Later on, gave her an orange.
December 2005 CSA3180: Information Extraction II 14
Textual Ellipsis
The functional (bridge) anaphora
The state of the accumulator is indicated to the user. 30 minutes before the complete uncharge, the computer signals for 5 seconds.
[Strube&Hahn, 1996]
December 2005 CSA3180: Information Extraction II 15
Events, States, Descriptions
He left without eating1. Because of this1 , he
was starving in the evening.
But, he adds, Priesley is more interested in Johnson living than in Johnson dead1.
In this1 the play is expressionist in its
approach to theme.
[Halliday & Hassan, 1976]
December 2005 CSA3180: Information Extraction II 16
Definite/Indefinite NPs
Once upon a time, there was a king and a queen. And the king one day went hunting.
Apollo took out his bow...
Take the elevator to the 4th floor.
December 2005 CSA3180: Information Extraction II 17
Anaphora Resolution
• State of the art in Anaphora Resolution:– Identity: 65-80%– Other: much less…
December 2005 CSA3180: Information Extraction II 18
What is so difficult?
Nothing – everything is so simple!
John1 has just arrived. He1 seems tired.
The girl1 leaves the trash on the table and wants to go away. The boy2 tries to hold her1 by the arm31; she1 escapes and runs; he2 calls her1 back.
Caragiale, At the Mansion
December 2005 CSA3180: Information Extraction II 19
What is so difficult?
Nothing indeed, but imagine letting the machine go wrong...
There‘s a pile of inflammable trash next to your car. You‘ll have to get rid of it.
If the baby does not thrive on the raw milk, boil it.
[Hobbs, 1997]
December 2005 CSA3180: Information Extraction II 20
What is so difficult?
Semantic restrictions
Jeff1 helped Dick2 wash the car. He1 washed the windows as Dick2 waxed the car.He1 soaped a pane. Jeff1 helped Dick2 wash the car. He1 washed the windows as Dick2 waxed the car.He2 buffed the hood.
[Walker, Joshi & Prince, 1997]
December 2005 CSA3180: Information Extraction II 21
What is so difficult?Semantic corelates
An elephant1 hit the car with the trunk. The animal1
had to be taken away not to produce other damages.
* An animal1 hit the car with the trunk. The elephant1
had to be taken away not to produce other damages.
December 2005 CSA3180: Information Extraction II 22
What is so difficult?
Long distance recovery (pronominalization)1. His re-entry into Hollywood came with the movie “Brainstorm”,2. but its completion and release has been delayed by the death of
co-star Natalie Wood.3. He plays Hugh Hefner of Playboy magazine in Bob Fosse’s “Star
80.”4. It’s about Dorothy Stratton, the Playboy Playmate who was killed
by her husband.5. He also stars in the movie “Class.”
Los Angeles Times, July 18, 1983, cited in [Fox, 1986]
December 2005 CSA3180: Information Extraction II 23
What is so difficult?
Gender mismatches
Mr. Chairman..., what is her position upon this issue? (political correctness!!)
Number mismatches
The government discussed ... They ...
December 2005 CSA3180: Information Extraction II 24
What is so difficult?
Distributed antecedents
John1 invited Mary2 to the cinema. After the
movie ended they3={1,2} went to a
restaurant.
December 2005 CSA3180: Information Extraction II 25
What is so difficult?
Empty/non-empty anaphors
John gave an apple to Michelle.
Later on, gave her an orange.
John gave an apple to Michelle.
Later on, he gave her an orange.
John gave an apple to Michelle.
Later on, this one asks him for an orange.
December 2005 CSA3180: Information Extraction II 26
Semantics are Essential
Police ... They
Teacher... She/He
A car... The automobile
A Mercedes... The car
A lamp... The bulb
December 2005 CSA3180: Information Extraction II 27
Semantics are not all
• Pronouns - poor semantic features
Ro. maşină = ea (feminine) Ro. automobil = el (masculine)
• Gender in Romance languages
he [+animate, +male, +singular]she [+animate, +female, +singular]it [+inanimate, +singular]they [+plural]
• Anaphora resolution by concord rulesUn camion a heurté une voiture. Celle-ci a été complètement détruite.
(A truck hit a car. It was completely destroyed.)
Gender mismatch !
Gender match!
December 2005 CSA3180: Information Extraction II 28
Anaphora Resolution
[Charniak, 1972]
It order to do AR, one has to be able to do everything else. Once everything else is done AR comes for free.
December 2005 CSA3180: Information Extraction II 29
Anaphora ResolutionMost current anaphora resolution systems implement a pipeline architecture with three modules:
•Collect:determines the List of Potential Antecedents (LPAs).
•Filter:eliminates from the LPA the referees that are incompatible with the referential expression under scrutiny.
•Preference:determines the most likely antecedent on the basis of an ordering policy.
Preference
Filter
Collect
Referential expressions
a1, a2, a3, … an
a1, a2, a3, … an
December 2005 CSA3180: Information Extraction II 30
Anaphora Resolution Models
• [Hobbs, 1976] (pronominal anaphora)Naïve algorithm:- implies a surface parse tree- navigation on the syntactic tree of the anaphor‘s
sentence and the preceding ones in the order of recency, each tree in a left-to-right, breadth-first manner
A semantic approach:- implies a semantic representation of the sentences
(logical expression)- a collection of semantic operations (inferences)- type of pronoun is important
December 2005 CSA3180: Information Extraction II 31
Anaphora Resolution Models
• [Lappin & Leass, 1994] (pronominal anaphora)
- syntactic structures- an intrasentensial syntactic filtering- morphological filter (person, number, gender)- detection of pleonastic pronouns- salience parameters (grammatical role,
parallelism of grammatical roles, frequency of mention, proximity, sentence recency)
December 2005 CSA3180: Information Extraction II 32
Anaphora Resolution Models
• [Sidner, 1981], [Grosz&Sidner, 1986]- focus/attentional based- give more salience to those semantic
entities that are in focus- define where to look for an antecedent in
the semantic structure of the preceding text (a stack in G&S‘s model)
December 2005 CSA3180: Information Extraction II 33
AR Models: Centering
• [Grosz, Joshi, Weinstein, 1983, 1995]• [Brennan, Friedman and Pollard, 1987]
• Cf(u) = <e1, e2, ... ek> - an ordered list• Cb(u) = ei• Cp(u) = e1
• CON > RET > SSH > ASH
Cb(u) = Cb(u-1) Cb(u) Cb(u-1)
Cb(u) = Cp(u)
Cb(u) Cp(u)
CONTINUING SMOOTH SHIFT
RETAINING ABRUPT SHIFT
December 2005 CSA3180: Information Extraction II 34
AR Models: Centeringa. I haven’t seen Jeff for several days.
b. Carl thinks he’s studying for his exams.
c. I think he? went to the Cape with Linda. [Grosz, Joshi & Weinstein, 1983]
Cf = (I=[I], [Jeff])
Cb = [I]
Cf = ([Carl], he=[Jeff], [Jeff´s exams])
Cb = [Jeff]
December 2005 CSA3180: Information Extraction II 35
AR Models: Centering
b. Carl thinks he’s studying for his exams.
c. I think he? went to the Cape with Linda.
Cf = ([Carl], he=[Jeff], [Jeff´s exams])
Cb = [Jeff]
RETAINING
ABRUPT SHIFT
Cf = (I=[I], he=[Jeff], [the Cape], [Linda])
Cb = [Jeff]Cf = (I=[I], he=[Carl], [the Cape], [Linda])
Cb = [Carl]
Jeff
December 2005 CSA3180: Information Extraction II 36
Anaphora Resolution Models
• [Mitkov, 1998]- knowledge-poor approach- POS tagger, noun phrase rules- 2 previous sentences - definiteness, giveness, lexical reiteration,
section heading preference, distance, terms of the field, etc.
December 2005 CSA3180: Information Extraction II 37
General Framework
Build a framework capable of easily accommodating any of the existing AR models, fine-tune them, practice with them to enhance performance (learning), eventually obtaining a better model
December 2005 CSA3180: Information Extraction II 38
General Framework
AR-engine
text
AR-model3
AR-model2
AR-model1
December 2005 CSA3180: Information Extraction II 39
Co-References
• Halliday and Hassan: a semantic relation, not a textual one
Co-referential anaphoric relation
The text layer
The semantic layer
a
a evokes centera
centera
b evokes centera
b
December 2005 CSA3180: Information Extraction II 40
Time and Discourse
• Discourse has a dynamic nature
Time axesreal time
discourse time
story time
1 2
2 11000 1030800 920
1 2
December 2005 CSA3180: Information Extraction II 41
Resolution Moment
Police officer David Cheshire went to Dillard's home. Putting his ear next to Dillard's head, Cheshire heard the music also.
[Tanaka, 1999]
CheshireCheshire Dillard his Dillard
December 2005 CSA3180: Information Extraction II 42
Resolution Delay
• Sanford and Garrod (1989)– initiation point– completion point
• Information is kept in a temporary location of memory
December 2005 CSA3180: Information Extraction II 43
Cataphora – What is there?
• The element referred to is anticipated by the referring element
• Theories– scepticism– syntactic reality
From the corner of the divan of Persian saddle-bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum…
Oscar Wilde, The Picture of Dorian Gray
December 2005 CSA3180: Information Extraction II 44
No right reference needed in discourse processing
• Introduction of an empty discourse entity
• Addition of new features as discourse unfolds
• Pronoun anticipation in Romanian
I taught Gabriel to read. = Ro. L-am învatat pe Gabriel sa citeasca.
December 2005 CSA3180: Information Extraction II 45
Unique directionality in interpretation
gender = mascnumber = sg
?
John
gender = mascnumber = sgsem = personname = John
he he John
gender = mascnumber = sgsem = personname = John
anaphora cataphora
December 2005 CSA3180: Information Extraction II 46
Automatic Interpretation
• necessity for an intermediate level
The text layer
The restriction layer
a
The semantic layer
RE a projects fsa
fsa
centera
fsa evokes centera
b
December 2005 CSA3180: Information Extraction II 47
Three Layer Approach to AR
1. John sold his bicycle
2. although Bill would have wanted it.
The text layer ……………………………………………
The semantic layer …………
it his bicycle
The restrictions layer …… …………………
evokesevokes
no = sgsem=bicycledet = yes
projects
no = sgsem=bicycledet = yes
projects
no = sgsem=¬human
December 2005 CSA3180: Information Extraction II 48
Delayed Interpretation
fsDillard
Police officer David Cheshire went to Dillard's home. Putting his ear next to Dillard's head, Cheshire heard the music also.
The text layer
The restriction layer
The semantic layer
t0
Cheshire
Cheshire
fsCheshire fsDillard
Dillard
t1
his
t2
Dillard
t3
Dillard
fshis
candidates={ , }
December 2005 CSA3180: Information Extraction II 49
Delayed Interpretation
time
From the corner of the divan of Persian saddle-bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum…
he
t0
gender = mascnumber = singsem = person
?
projection
gender=mascnumber=singsem= personname= Lord Henry Wotton
evoking initiates
evoking completes
gender=mascnumber=singsem= personname= Lord Henry Wotton
t1
hisThe text layer
The restriction layer
The semantic layer
Lord Henry Wotton
t2
December 2005 CSA3180: Information Extraction II 50
The case of Cataphora
The semantic layer …………
The restrictions layer …… …………………
The text layer ……………………………………………
1. Although Bill would have wanted it,
2. John sold his bicycle to somebody else.
it his bicycle
projects
no = sgsem=¬human
projects
no = sgsem=bicycledet = yes
evokes
no = sgsem=¬human
evokes
no = sgsem=bicycledet = yes
December 2005 CSA3180: Information Extraction II 51
AR Models
• a set of primary attributes
• a set of knowledge sources
• a set of evocation heuristics or rules
• a set of rules that configure the domain of referential accessibility
December 2005 CSA3180: Information Extraction II 52
AR Models
The text layer ……………………….…………………
The semantic layer ….……………………DEm
REa
The projection layer ……………………………….…
DEj DE1
attrx
knowledge sources
primary attributes
REb REc REd REx
domain of referential accessibility
heuristics/rules
December 2005 CSA3180: Information Extraction II 53
Set of Primary Attributes
a. morphological
- number
- lexical gender
- person
December 2005 CSA3180: Information Extraction II 54
Set of Primary Attributesb. syntactical-full syntactic description of REs as constituents of a syntactic tree
[Lappin and Leass, 1994]CT based approaches [Grosz, Joshi and Weinstein, 1995], [Brennan, Friedman and Pollard, 1987], syntactic domain based approaches [Chomsky, 1981], [Reinhart, 1981], [Gordon and Hendricks, 1998], [Kennedy and Boguraev, 1996]
-quality of being adjunct, embedded or complement of a preposition [Kennedy and Boguraev, 1996]
-inclusion or not in an existential construction [Kennedy and Boguraev, 1996]
-syntactic patterns in which the RE is involvedsyntactic parallelism [Kennedy and Boguraev, 1996], [Mitkov, 1997]
December 2005 CSA3180: Information Extraction II 55
Set of Primary Attributes
c. semantic-position of the head of the RE in a conceptual
hierarchy (animacy, sex (or natural gender), concreteness)
WordNet based models [Poesio, Vieira and Teufel, 1997]
-inclusion in a synonymy class-semantic roles, out of which selectional
restrictions, inferential links, pragmatic limitations, semantic parallelism and object preference can be verified
December 2005 CSA3180: Information Extraction II 56
Set of Primary Attributes
d. positional
-offset of the first token of the RE in the text
[Kennedy and Boguraev, 1996]
-inclusion in an utterance, sentence or clause, considered as a discourse unit
[Hobbs, 1987], Azzam, Humphreys and Gaizauskas, 1998], [Cristea et al., 2000]
December 2005 CSA3180: Information Extraction II 57
Set of Primary Attributes
e. surface realisation (type)
the domain of this feature contains: zero-pronoun, clitic pronoun, full pronoun, reflexive pronoun, possessive pronoun, demonstrative pronoun, reciprocal pronoun, expletive “it”, bare noun (undetermined NP), indefinite determined NP, definite determined NP, proper noun (name)
[Gordon and Hendricks, 1998], [Cristea et. al, 2000]
December 2005 CSA3180: Information Extraction II 58
Set of Primary Attributes
f. other
inclusion or not of the RE in a specific lexical field (“domain concept”)
[Mitkov, 1997]
- frequency of the term in the text
[Mitkov, 1997]
- occurrence of the term in a heading
[Mitkov, 1997]
December 2005 CSA3180: Information Extraction II 59
Knowledge Sources
• Type of process: incremental• A knowledge source: a (virtual) processor able
to fetch values to attributes on the restriction layer
• Minimum set: POS-tagger + shallow parser