statistical nlp: from linguistic strip mining to deep linguistic

Statistical NLP: From linguistic strip mining to deep linguistic processing

Christopher ManningStanford University

Two themes

1. Erasing the gap between “Statistical NLP” and “Deep linguistic/knowledge-based processing”• Statistical NLP methods are now dealing with

problems that only deep grammars used to• Deep processing models are incorporating machine

learning methods for disambiguation2. Extending “Statistical NLP” towards providing

robust notions of inference• Most of the early Statistical NLP work was surface

phenomena• Most of it concentrated on issues of linguistic form• Can we extend the paradigm to inference?

1. Statistical and deep processing: How do they differ?

• Penn Treebank tree [really the stripped form usually

returned by statistical parsers]

• LFG/HPSG analysis

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Traditionally…

Statistical parsers didn’t model:• Grammatical relations• Phenomena like raising and control• Long distance dependencies like Wh-movement• Models of argument structure/subcategorization

Deep linguistics processing didn’t model:• Probabilities of different analyses• Robust processing of

incomplete/ungrammatical/long utterance

Extracting grammatical relations from statistical parses

[de Marneffe et al. 2006]• Exploit the high-quality syntactic analysis done by

statistical parsers to get the dependencies• Dependencies are generated by pattern-matching rules

Bills on ports and immigration were submitted by Senator Brownback

NPS

NPNNP NNP

PPIN

VPVP

VBN

VBD

NNCCNNS

NPINNP PP

NNS

submitted

Bills were Brownback

Senator

nsubjpass auxpass agent

nnprep_onports

immigration

cc_and

Recovering control/raising/long-distance movement dependencies

null complementizer

dislocated dependencies

shared dependency

[Levy and Manning 2004; cf. Johnson 2002]

Full example:Start with a context-free parse

1a: null insertion sites

1b: location of null insertions

2a: identify dislocations

2b: identify origin sites

2c: insert dislocations in origins

3a: identify sites of non-local shared dependency

3b: insert non-local shared dependency sites

3c: find controllers of shared dependencies

The model’s feature templates

origin?

Syntactic category, parent, grandparent (subj vs objextraction; VP finiteness

• Syntactic path (Gildea & Jurafsky 2002):<↑SBAR,↓S,↓VP,↓S,↓VP>

Plus: feature conjunctions, specialized features for expletive subject dislocations, passivizations, passing featuralinformation properly through coordinations, etc., etc.

cf. Campbell (2004, ACL) – it’s really linguistic rules

• Presence of daughters (NP under S)

• Head words (wanted vs to vs eat)

Evaluation on dependency metric: gold-standard input trees

70 75 80 85 90 95 100

Overall

NP

S

VP

ADJP

SBAR

ADVP

Accuracy (F

Johnson 2002,gold input trees

Levy&Manning,gold input trees

CF gold tree de

Semantic role labeling[Toutanova, Haghighi & Manning 2005]

• Who offered Mr. Smith a reimbursement ?• Whom did Shaw Publishing offer a reimbursement?• What did Shaw Publishing offer Mr. Smith?• When did Shaw Publishing offer Mr. Smith a

reimbursement?

Shaw Publishing offered Mr. Smith a reimbursement last March

WHATWHENWHO

WHOM

Most Previous Work: Local Models

• Extract features for each node and the predicate • Classify nodes independently

the children

The ogrecooked

NP

S

VPNPΦ(n)

Phrase Type: NPPath: NP-up-VP-down VHead Word: childrenPredicate: cookPassive: falsePosition: after

A Drawback of Local Models

S

VP

The ogrecooked

NP

NP

the children

NP

a meal

NPPATIENT

NPAGENT

NPBENIFICIARY NPPATIENT

Semantic roles: joint models boost results

Accuracies of local and joint models on core arguments

Error reduction from best published result: 44.6% on Integrated 52% on Classification

95.0

90.6

95.195.7

91.8

96.1

97.6

94.8

85

87

89

91

93

95

97

99

Id Class Integrated

X&P

Local

Joint

best previous result Xue and Palmer 04

local model resultjoint model resultf-

mea

sure

2nd place inCoNLL 2005;

see ACL2005 andCoNLL papers

Application: Semantically precise search for relations/events

Query: afghans destroying opium poppies

Disambiguation and robustness in deep linguistic processing

• LFG• Johnson, Geman, Riezler, Kaplan,…

• HPSG• Toutanova, Manning, Oepen, Baldridge, Miyao, Tsujii, …

• Categorial Grammar (CCG)• Hockenmaier, Clark, Steedman, etc.

• The work builds machine learning models (commonly log-linear models) to either rank whole parses or to determine the optimal analysis within a random field

We see considerable convergence in methods and coverage.

2. Textual Inference. Motivation:The external perspective on NLP

• NLP has many successful tools with all sorts of uses • Part of speech tagging, named entity recognition, syntactic

parsing, semantic role parsing, coreference determination

but they concentrate on structure not meaning.

• By-and-large non-NLP people want systems for more holistic semantic tasks

• Text categorization

• Information retrieval/web search

• The state-of-the-art in these areas is (slightly extended) bag-of-words models

• How can we extend our NLP technologies to satisfy people’s holistic semantic tasks – while still working on any text?

The textual inference task

• Does text T justify an inference to hypothesis H?• An informal, intuitive notion of inference: not strict logic• Focus on local inference steps, not long chains of deduction• Emphasis on variability of linguistic expression

• Robust, accurate textual inference would enable:• Semantic search: H: lobbyists attempting to bribe U.S. legislators

T: The A.P. named two more senators who received contributions engineeredby lobbyist Jack Abramoff in return for political favors.

• Question answering: H: Who bought J.D. Edwards?T: Thanks to its recent acquisition of J.D. Edwards, Oracle will soon be able…

• Customer email response• Relation extraction (database building)• Document summarization

Textual inference as graph alignment[Haghighi et al. 05, de Salvo Braz et al. 05]

• Find least cost alignment of H to part of T, using locally decomposable cost model (lexical and structural costs)

• Assumption: good alignment ⇒ valid inference

T: CNN reported that thirteen soldiers lost their lives in today’s ambush.

H: Several troops were killed in the ambush.

lost

soldiers lives ambush

thirteen their today’s

reported

CNN

dobjinnsubj

nn dep poss

nsubj ccomp

killed

troops were ambush

several the

auxinnsubjpass

amod det

⊨

Why we need sloppy matching• Passage: Today's best estimate of giant panda numbers in the

wild is about 1,100 individuals living in up to 32 separate populations mostly in China's Sichuan Province, but also in Shaanxi and Gansu provinces.

• Hypothesis 1: There are 32 pandas in the wild in China. (FALSE)

• Hypothesis 2: There are about 1,100 pandas in the wild in China.(TRUE)

• We’d like to get this right, but we just don’t have the technology to fully infer from best estimate of giant panda numbers in the wild is about 1,100 to there are about 1,100 pandas in the wild

Weighted abduction models[Hobbs et al. 93, Moldovan et al. 03, Raina et al. 05]

• Translate to FOL and try to prove H from T

• Allow assumptions at various “costs”

• Superficially, like using formal semantics & logic• Actually, analogous to graph-matching

approach• FOL ⇔ dependency graphs• abduction costs ⇔ lexical match costs

• Modulo use of additional axioms [Tatu et al. 06]

T: Kidnappers released a Filipino hostage. H: A Filipino hostage was freed.

∃ e, a, b kidnappers(a) ∧ release(e, a, b)∧ Filipino(b) ∧ hostage(b)

∃ f, x Filipino(x) ∧ hostage(x)∧ freed(f, x)

⊨?

⊨?

released(p, q, r) → ∃ s freed(s, r) ⇐ enables proof; costs $2.00

Problems with alignment models

• Alignments are important, but…• Good alignment valid inference:

1. Assumption of upward monotonicity

2. Assumption of locality

3. Confounding of alignment and entailment

⇔/

Problem 1: non-monotonicity

• In normal “upward monotone” contexts, generalizing a concept preserves truth:T: Some Korean historians believe the murals are of Korean origin.H: Some historians believe the murals are of Korean origin.

• But not in “downward monotone” contexts:T: Few Korean historians doubt that Koguryo belonged to Korea.H: Few historians doubt that Koguryo belonged to Korea.

• Lots of constructs invert monotonicity!

⊭

⊨

• explicit negation: not• restrictive quantifiers: no, few, at most n• negative or restrictive verbs: lack, fail,

deny

• preps & adverbs: without, except, only

• comparatives and superlatives• antecedent of a conditional: if

Problem 2: non-locality

• To be tractable, alignment scoring must be local• But valid inference can hinge on non-local

factors:

T1: The army confirmed that interrogators desecrated the Koran.

H: Interrogators desecrated the Koran.

T2: Newsweek retracted its report that the army had acknowledged thatinterrogators desecrated the Koran.

H: Interrogators desecrated the Koran.

⊨

⊭

Problem 3:confounding alignment & inference

• If alignment ⇒ entailment, lexical cost model mustpenalize e.g. antonyms, inverses:T: Stocks fell on worries that oil prices would rise this winter.

H: Stock prices climbed.

• But aligner will seek the best alignment:T: Stocks fell on worries that oil prices would rise this winter.

H: Stock prices climbed.

• Actually, we want the first alignment, and then a separate assessment of entailment! [cf. Marsi & Krahmer 05]

must prevent this alignment

maybe entailed?

Solution: three-stage architecture[MacCartney et al., HLT-NAACL 2006]

1. linguisticanalysis

2. graphalignment

3. features &classification

tunedthreshold

yes

no

score = = –0.88–1.28

T: India buys missiles.H: India acquires arms.

buys

India missiles

nsubj dobj

acquires

India arms

nsubj dobj

⊨

buys

India missiles

nsubj dobj

acquires

India arms

nsubj dobj

0.00

–0.53

–0.75

IndiaPOSNERIDF

NNPLOCATION0.027

buysPOSNERIDF

VBZ–0.045

… … …

Feature fi wi

Structure match + 0.10

Alignment: good + 0.30

Features of valid inferences

• After alignment, extract features of inference• Look for global characteristics of valid and invalid

inferences• Features embody crude semantic theories• Feature categories: adjuncts, modals, quantifiers,

implicatives, antonymy, tenses, structure, explicit numbers & dates

• Alignment score is also an important feature• Extracted features ⇒ statistical model ⇒ score

• Can learn feature weights using logistic regression• Or, can use hand-tuned weights

• (Score ≥ threshold) ? ⇒ prediction: yes/no• Threshold can be tuned

Features: restrictive adjuncts

• Does hypothesis add/drop a restrictive adjunct?• Adjunct is dropped: usually truth-preserving• Adjunct is added: suggests no entailment• But in a downward monotone context, this is reversed

T: In all, Zerich bought $422 million worth of oil from Iraq, according to the Volcker committee.

H: Zerich bought oil from Iraq during the embargo.

T: Zerich didn’t buy any oil from Iraq, according to the Volcker committee.H: Zerich didn’t buy oil from Iraq during the embargo.

• Generate features for add/drop, monotonicity

⊭

⊨

Features: factives & implicativesT: Libya has tried, with limited success, to develop its own indigenous missile,

and to extend the range of its aging SCUD force for many years under the Al Fatah and other missile programs.

H: Libya has developed its own domestic missile program.

• Evaluate governing verbs for implicativity class• Unknown: say, tell, suspect, try, …• Fact: know, acknowledge, ignore, …• True: manage to, …• False: fail to, forget to, …

• Need to check for ↓-monotone context here too• not try to win not win, but not manage to win not win

⊭

⊭ ⊨

Increasing semantic fidelity with natural logic

• Natural logic is a logic of natural language inference• No symbols: → ¬ ∧ ∨ ∀ ∃• Just words: All men are mortal…

• Focuses on a widespread, familiar class of inference: those involving monotonicity• Many other inferences can be reinterpreted in this way• We’ve been working on extensions to exclusion

• Avoids the difficulty in applying logic in the wild

Broadening edits

Many kinds of edits serve to broaden:generalize: animal

Public health and animal health officials from the Departments of Health and Community Services and Forest Resources and Agrifoods today confirmed a rabies case involving a red fox in western Newfoundland.[RTE2_dev #467]

Heuristic: broadening edits yield entailments

drop qualifier

temporal widening:recentlylocative widening:

in Canadadrop conjunct

(or add disjunct)

weaken quantifier: every ⇒ many relax modal: must ⇒ might

Public health and animal health officials from the Departments of Health and Community Services and Forest Resources and Agrifoods today confirmed a rabies case involving a red fox in western Newfoundland.

Narrowing edits

Many kinds of edits serve to narrow:specialize: Dr. Laura Gray

Heuristic: narrowing edits yield reverse entailments

add qualifier: juvenile

temporal narrowing:this morninglocative narrowing:

in Dundee

add conjunct(or drop disjunct)

strengthen quantifier: some ⇒ most tighten modal: could ⇒ did

Negative polarity.

No case of indigenously acquired rabies infection has been confirmed inman or any animal species during the past 2 years. [RTE2_dev #601]

Negative contexts

In negative contexts, the heuristic is reversed:• narrowing edits yield entailments• broadening edits yield reverse entailments

specialize: mammal add qualifier: positively

temporal narrowing: yeardrop disjunct

Why the inversion?But not just explicit negation…

Monotonicity as a lexical property

• Monotonicity: a property of semantic functors• Upward-monotone (↑): broader inputs ⇒ broader outputs

• This is the default for all semantic functors!• Adjectives: famous sprinter ⊆ famous runner• Verbs: begin sprinting ⊆ begin running• Quantifiers: several sprinters ⊆ several runners

• Downward-monotone (↓): broader inputs ⇒ narrower outputs• Quantifiers, verbs, others: no runners ⊆ no sprinters

• Non-monotone: neither upward- nor downward-monotone• Quantifiers, others: most sprinters # most runners

• Monotonicity of binary functors• Monotonicity can vary by argument: no ∈ ↓↓, every ∈ ↓↑

Monotonicity as a structural property

• Scope of monotonicity inversions?• Multiple inversions?

• Can be computed on a tree using SánchezValencia’s monotonicity calculus

• While he did this in categorial grammar, you can – and we do – do this on Penn Treebank trees• With only a little ugliness due to their flatness

No soldier left without rations.

without

without

No

No

↓↓

↓↓

↓↓

↓

↓

↓

Results

• Pascal RTE1 test set results• Accuracy

• Just alignment: 56.3%• With Inferers: 64.1%

• Pascal RTE3 development set results• Accuracy

• Without natural logic: 67.3%• Adding natural logic: 69.6%

• Natural logic has coverage of 87/800 examples with 77% precision – cf. Bos & Markert (2005) who get 77% accuracy from a coverage of only 30/800 examples.

Envoi

The top level messages:• Statistical NLP is now extending into the deeper forms of

analysis traditionally the province of deep linguistic processing, while the latter have taken on statistics for disambiguation and robustness. What differences remain?

• For both, the challenge is to move beyond structural analysis to satisfying people’s more holistic information needs

• Textual inference is one framework for addressing some of those needs, which preserves the IR idea of semantic tools that work over arbitrary text

• It has potential for applications like evidence extraction• Find passage suggesting price fixing by Enron

• But there is still much to do in having the lexical and knowledge resources needed to solve this problem.

statistical nlp: from linguistic strip mining to deep linguistic

Documents