csa3180: natural language processing

59
December 2005 CSA3180: Information Extraction II 1 CSA3180: Natural Language Processing Information Extraction 2 • Named Entities • Question Answering • Anaphora Resolution • Co-Reference

Upload: howie

Post on 14-Jan-2016

34 views

Category:

Documents


2 download

DESCRIPTION

CSA3180: Natural Language Processing. Information Extraction 2 Named Entities Question Answering Anaphora Resolution Co-Reference. Introduction. Slides partially based on talk by Lucian Vlad Lita Sheffield GATE Multilingual Extraction slides based on Diana Maynard’s talks - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 1

CSA3180: Natural Language Processing

Information Extraction 2• Named Entities• Question Answering• Anaphora Resolution• Co-Reference

Page 2: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 2

Introduction

• Slides partially based on talk by Lucian Vlad Lita

• Sheffield GATE Multilingual Extraction slides based on Diana Maynard’s talks

• Anaphora resolution slides based on Dan Cristea slides, with additional input from Gabriela-Eugenia Dima, Oana Postolache and Georgiana Puşcaşu

Page 3: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 3

References

• Fastus System Documentation

• Robert Gaizauskas “IE Perspective on Text Mining”

• Daniel Bikel’s “Nymble: A High Performance Learning Name Finder”

• Helena Ahonen-Myka’s notes on FSTs

• Javelin system documentation

• MUC 7 Overview & Results

Page 4: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 4

Named Entities

• Named Entities• Person Name: Colin Powell, Frodo• Location Name: Middle East, Aiur• Organization: UN, DARPA

• Domain Specific vs. Open Domain

Page 5: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 5

Anaphora Resolution

unprocessed text

annotation tool

fine-tuningcomparison & evaluation

AR engine

AR golden standard AR annotated text

Page 6: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 6

Anaphora Resolution

• Text:– Nature of discourse– Anaphoric phenomena

• Anaphora Resolution Engines:– Models– General AR Frameworks– Knowledge Sources

Page 7: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 7

Anaphora Resolution

Anaphora represents the relation between a “proform” (called an “anaphor”) and another term (called an "antecedent"), when the interpretation of the anaphor is in a certain way determined by the interpretation of the antecedent.

Barbara Lust, Introduction to Studies in the Acquisition of Anaphora, D. Reidel, 1986

Page 8: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 8

Anaphora Example

It was a bright cold day in April, and the clocks were striking

thirteen. Winston Smith, his chin nuzzled into his breast in an

effort to escape the vile wind, slipped quickly through the glass

doors of Victory Mansions, though not quickly enough to

prevent a swirl of gritty dust from entering along with him.

Orwell, 1984

antecedent anaphor antecedent anaphor

Page 9: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 9

Anaphora

• pronouns (personal, demonstrative, ...)– full pronouns – clitics (RO: dă-mi-l, IT: dammelo)

• nouns– definite– indefinite

• adjectives, numerals (generally associated with an ellipsis)

• In this the play is expressionist1 in its approach to theme.

• But it is also so1 in its use of unfamiliar devices...

Page 10: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 10

Referential Expressions

• mark the noun phrases

• for each NP ask a question about it

• keep as REs those NPs that can be naturally referenced in the question

The policeman got in the car in a hurry in order to catch the run-away thief.

Page 11: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 11

Referential Expressions

a. John was going down the street looking for Bill‘s house.

b. He found it at the first corner.

Page 12: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 12

Referential Expressions

a. John was going down the street looking for Bill‘s house.

b. He met him at the first corner.

Page 13: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 13

Referential Expressions

The empty anaphor Gianni diede una mela a Michele. Piu tardi, gli diede un’arancia.

[Not&Zancanara, 1996]

John gave an apple to Michelle. Later on, gave her an orange.

Page 14: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 14

Textual Ellipsis

The functional (bridge) anaphora

The state of the accumulator is indicated to the user. 30 minutes before the complete uncharge, the computer signals for 5 seconds.

[Strube&Hahn, 1996]

Page 15: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 15

Events, States, Descriptions

He left without eating1. Because of this1 , he

was starving in the evening.

But, he adds, Priesley is more interested in Johnson living than in Johnson dead1.

In this1 the play is expressionist in its

approach to theme.

[Halliday & Hassan, 1976]

Page 16: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 16

Definite/Indefinite NPs

Once upon a time, there was a king and a queen. And the king one day went hunting.

Apollo took out his bow...

Take the elevator to the 4th floor.

Page 17: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 17

Anaphora Resolution

• State of the art in Anaphora Resolution:– Identity: 65-80%– Other: much less…

Page 18: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 18

What is so difficult?

Nothing – everything is so simple!

John1 has just arrived. He1 seems tired.

The girl1 leaves the trash on the table and wants to go away. The boy2 tries to hold her1 by the arm31; she1 escapes and runs; he2 calls her1 back.

Caragiale, At the Mansion

Page 19: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 19

What is so difficult?

Nothing indeed, but imagine letting the machine go wrong...

There‘s a pile of inflammable trash next to your car. You‘ll have to get rid of it.

If the baby does not thrive on the raw milk, boil it.

[Hobbs, 1997]

Page 20: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 20

What is so difficult?

Semantic restrictions

Jeff1 helped Dick2 wash the car. He1 washed the windows as Dick2 waxed the car.He1 soaped a pane. Jeff1 helped Dick2 wash the car. He1 washed the windows as Dick2 waxed the car.He2 buffed the hood.

[Walker, Joshi & Prince, 1997]

Page 21: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 21

What is so difficult?Semantic corelates

An elephant1 hit the car with the trunk. The animal1

had to be taken away not to produce other damages.

* An animal1 hit the car with the trunk. The elephant1

had to be taken away not to produce other damages.

Page 22: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 22

What is so difficult?

Long distance recovery (pronominalization)1. His re-entry into Hollywood came with the movie “Brainstorm”,2. but its completion and release has been delayed by the death of

co-star Natalie Wood.3. He plays Hugh Hefner of Playboy magazine in Bob Fosse’s “Star

80.”4. It’s about Dorothy Stratton, the Playboy Playmate who was killed

by her husband.5. He also stars in the movie “Class.”

Los Angeles Times, July 18, 1983, cited in [Fox, 1986]

Page 23: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 23

What is so difficult?

Gender mismatches

Mr. Chairman..., what is her position upon this issue? (political correctness!!)

Number mismatches

The government discussed ... They ...

Page 24: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 24

What is so difficult?

Distributed antecedents

John1 invited Mary2 to the cinema. After the

movie ended they3={1,2} went to a

restaurant.

Page 25: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 25

What is so difficult?

Empty/non-empty anaphors

John gave an apple to Michelle.

Later on, gave her an orange.

John gave an apple to Michelle.

Later on, he gave her an orange.

John gave an apple to Michelle.

Later on, this one asks him for an orange.

Page 26: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 26

Semantics are Essential

Police ... They

Teacher... She/He

A car... The automobile

A Mercedes... The car

A lamp... The bulb

Page 27: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 27

Semantics are not all

• Pronouns - poor semantic features

Ro. maşină = ea (feminine) Ro. automobil = el (masculine)

• Gender in Romance languages

he [+animate, +male, +singular]she [+animate, +female, +singular]it [+inanimate, +singular]they [+plural]

• Anaphora resolution by concord rulesUn camion a heurté une voiture. Celle-ci a été complètement détruite.

(A truck hit a car. It was completely destroyed.)

Gender mismatch !

Gender match!

Page 28: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 28

Anaphora Resolution

[Charniak, 1972]

It order to do AR, one has to be able to do everything else. Once everything else is done AR comes for free.

Page 29: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 29

Anaphora ResolutionMost current anaphora resolution systems implement a pipeline architecture with three modules:

•Collect:determines the List of Potential Antecedents (LPAs).

•Filter:eliminates from the LPA the referees that are incompatible with the referential expression under scrutiny.

•Preference:determines the most likely antecedent on the basis of an ordering policy.

Preference

Filter

Collect

Referential expressions

a1, a2, a3, … an

a1, a2, a3, … an

Page 30: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 30

Anaphora Resolution Models

• [Hobbs, 1976] (pronominal anaphora)Naïve algorithm:- implies a surface parse tree- navigation on the syntactic tree of the anaphor‘s

sentence and the preceding ones in the order of recency, each tree in a left-to-right, breadth-first manner

A semantic approach:- implies a semantic representation of the sentences

(logical expression)- a collection of semantic operations (inferences)- type of pronoun is important

Page 31: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 31

Anaphora Resolution Models

• [Lappin & Leass, 1994] (pronominal anaphora)

- syntactic structures- an intrasentensial syntactic filtering- morphological filter (person, number, gender)- detection of pleonastic pronouns- salience parameters (grammatical role,

parallelism of grammatical roles, frequency of mention, proximity, sentence recency)

Page 32: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 32

Anaphora Resolution Models

• [Sidner, 1981], [Grosz&Sidner, 1986]- focus/attentional based- give more salience to those semantic

entities that are in focus- define where to look for an antecedent in

the semantic structure of the preceding text (a stack in G&S‘s model)

Page 33: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 33

AR Models: Centering

• [Grosz, Joshi, Weinstein, 1983, 1995]• [Brennan, Friedman and Pollard, 1987]

• Cf(u) = <e1, e2, ... ek> - an ordered list• Cb(u) = ei• Cp(u) = e1

• CON > RET > SSH > ASH

Cb(u) = Cb(u-1) Cb(u) Cb(u-1)

Cb(u) = Cp(u)

Cb(u) Cp(u)

CONTINUING SMOOTH SHIFT

RETAINING ABRUPT SHIFT

Page 34: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 34

AR Models: Centeringa. I haven’t seen Jeff for several days.

b. Carl thinks he’s studying for his exams.

c. I think he? went to the Cape with Linda. [Grosz, Joshi & Weinstein, 1983]

Cf = (I=[I], [Jeff])

Cb = [I]

Cf = ([Carl], he=[Jeff], [Jeff´s exams])

Cb = [Jeff]

Page 35: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 35

AR Models: Centering

b. Carl thinks he’s studying for his exams.

c. I think he? went to the Cape with Linda.

Cf = ([Carl], he=[Jeff], [Jeff´s exams])

Cb = [Jeff]

RETAINING

ABRUPT SHIFT

Cf = (I=[I], he=[Jeff], [the Cape], [Linda])

Cb = [Jeff]Cf = (I=[I], he=[Carl], [the Cape], [Linda])

Cb = [Carl]

Jeff

Page 36: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 36

Anaphora Resolution Models

• [Mitkov, 1998]- knowledge-poor approach- POS tagger, noun phrase rules- 2 previous sentences - definiteness, giveness, lexical reiteration,

section heading preference, distance, terms of the field, etc.

Page 37: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 37

General Framework

Build a framework capable of easily accommodating any of the existing AR models, fine-tune them, practice with them to enhance performance (learning), eventually obtaining a better model

Page 38: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 38

General Framework

AR-engine

text

AR-model3

AR-model2

AR-model1

Page 39: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 39

Co-References

• Halliday and Hassan: a semantic relation, not a textual one

Co-referential anaphoric relation

The text layer

The semantic layer

a

a evokes centera

centera

b evokes centera

b

Page 40: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 40

Time and Discourse

• Discourse has a dynamic nature

Time axesreal time

discourse time

story time

1 2

2 11000 1030800 920

1 2

Page 41: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 41

Resolution Moment

Police officer David Cheshire went to Dillard's home. Putting his ear next to Dillard's head, Cheshire heard the music also.

[Tanaka, 1999]

CheshireCheshire Dillard his Dillard

Page 42: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 42

Resolution Delay

• Sanford and Garrod (1989)– initiation point– completion point

• Information is kept in a temporary location of memory

Page 43: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 43

Cataphora – What is there?

• The element referred to is anticipated by the referring element

• Theories– scepticism– syntactic reality

From the corner of the divan of Persian saddle-bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum…

Oscar Wilde, The Picture of Dorian Gray

Page 44: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 44

No right reference needed in discourse processing

• Introduction of an empty discourse entity

• Addition of new features as discourse unfolds

• Pronoun anticipation in Romanian

I taught Gabriel to read. = Ro. L-am învatat pe Gabriel sa citeasca.

Page 45: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 45

Unique directionality in interpretation

gender = mascnumber = sg

?

John

gender = mascnumber = sgsem = personname = John

he he John

gender = mascnumber = sgsem = personname = John

anaphora cataphora

Page 46: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 46

Automatic Interpretation

• necessity for an intermediate level

The text layer

The restriction layer

a

The semantic layer

RE a projects fsa

fsa

centera

fsa evokes centera

b

Page 47: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 47

Three Layer Approach to AR

1. John sold his bicycle

2. although Bill would have wanted it.

The text layer ……………………………………………

The semantic layer …………

it his bicycle

The restrictions layer …… …………………

evokesevokes

no = sgsem=bicycledet = yes

projects

no = sgsem=bicycledet = yes

projects

no = sgsem=¬human

Page 48: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 48

Delayed Interpretation

fsDillard

Police officer David Cheshire went to Dillard's home. Putting his ear next to Dillard's head, Cheshire heard the music also.

The text layer

The restriction layer

The semantic layer

t0

Cheshire

Cheshire

fsCheshire fsDillard

Dillard

t1

his

t2

Dillard

t3

Dillard

fshis

candidates={ , }

Page 49: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 49

Delayed Interpretation

time

From the corner of the divan of Persian saddle-bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum…

he

t0

gender = mascnumber = singsem = person

?

projection

gender=mascnumber=singsem= personname= Lord Henry Wotton

evoking initiates

evoking completes

gender=mascnumber=singsem= personname= Lord Henry Wotton

t1

hisThe text layer

The restriction layer

The semantic layer

Lord Henry Wotton

t2

Page 50: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 50

The case of Cataphora

The semantic layer …………

The restrictions layer …… …………………

The text layer ……………………………………………

1. Although Bill would have wanted it,

2. John sold his bicycle to somebody else.

it his bicycle

projects

no = sgsem=¬human

projects

no = sgsem=bicycledet = yes

evokes

no = sgsem=¬human

evokes

no = sgsem=bicycledet = yes

Page 51: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 51

AR Models

• a set of primary attributes

• a set of knowledge sources

• a set of evocation heuristics or rules

• a set of rules that configure the domain of referential accessibility

Page 52: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 52

AR Models

The text layer ……………………….…………………

The semantic layer ….……………………DEm

REa

The projection layer ……………………………….…

DEj DE1

attrx

knowledge sources

primary attributes

REb REc REd REx

domain of referential accessibility

heuristics/rules

Page 53: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 53

Set of Primary Attributes

a. morphological

-      number

-      lexical gender

-      person

Page 54: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 54

Set of Primary Attributesb. syntactical-full syntactic description of REs as constituents of a syntactic tree

[Lappin and Leass, 1994]CT based approaches [Grosz, Joshi and Weinstein, 1995], [Brennan, Friedman and Pollard, 1987], syntactic domain based approaches [Chomsky, 1981], [Reinhart, 1981], [Gordon and Hendricks, 1998], [Kennedy and Boguraev, 1996]

-quality of being adjunct, embedded or complement of a preposition [Kennedy and Boguraev, 1996]

-inclusion or not in an existential construction [Kennedy and Boguraev, 1996]

-syntactic patterns in which the RE is involvedsyntactic parallelism [Kennedy and Boguraev, 1996], [Mitkov, 1997]

Page 55: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 55

Set of Primary Attributes

c. semantic-position of the head of the RE in a conceptual

hierarchy (animacy, sex (or natural gender), concreteness)

WordNet based models [Poesio, Vieira and Teufel, 1997]

-inclusion in a synonymy class-semantic roles, out of which selectional

restrictions, inferential links, pragmatic limitations, semantic parallelism and object preference can be verified

Page 56: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 56

Set of Primary Attributes

d. positional

-offset of the first token of the RE in the text

[Kennedy and Boguraev, 1996]

-inclusion in an utterance, sentence or clause, considered as a discourse unit

[Hobbs, 1987], Azzam, Humphreys and Gaizauskas, 1998], [Cristea et al., 2000]

Page 57: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 57

Set of Primary Attributes

e. surface realisation (type)

the domain of this feature contains: zero-pronoun, clitic pronoun, full pronoun, reflexive pronoun, possessive pronoun, demonstrative pronoun, reciprocal pronoun, expletive “it”, bare noun (undetermined NP), indefinite determined NP, definite determined NP, proper noun (name)

[Gordon and Hendricks, 1998], [Cristea et. al, 2000]

Page 58: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 58

Set of Primary Attributes

f. other

inclusion or not of the RE in a specific lexical field (“domain concept”)

[Mitkov, 1997]

- frequency of the term in the text

[Mitkov, 1997]

- occurrence of the term in a heading

[Mitkov, 1997]

Page 59: CSA3180: Natural Language Processing

December 2005 CSA3180: Information Extraction II 59

Knowledge Sources

• Type of process: incremental• A knowledge source: a (virtual) processor able

to fetch values to attributes on the restriction layer

• Minimum set: POS-tagger + shallow parser