page 1 november 2007 beckman institute with thanks to: collaborators: ming-wei chang, vasin...
TRANSCRIPT
Page 1
November 2007
Beckman Institute
With thanks to:
Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov,
Mark Sammons, Scott Yih, Dav ZimakFunding: ARDA, under the AQUAINT program
NSF: ITR IIS-0085836, ITR IIS-0428472, ITR IIS- 0085980 A DOI grant under the Reflex program, DASH Optimization (Xpress-MP)
Natural Language ProcessingviaLearning and Global Inference with Constraints
Dan RothDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign
Page 2
Nice to Meet You
Page 3
Learning and Inference
Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome.
(Learned) models/classifiers for different sub-problems
Incorporate classifiers’ information, along with constraints, in making coherent decisions – decisions that respect the local models as well as domain & context specific constraints.
Global inference for the best assignment to all variables of interest.
Page 4
Inference
Page 5
Comprehension
1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.
A process that maintains and updates a collection of propositions about the state of affairs.
(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.
Page 6
Illinois’ bored of education board
...Nissan Car and truck plant is ……divide life into plant and animal kingdom
(This Art) (can N) (will MD) (rust V) V,N,N
The dog bit the kid. He was taken to a veterinarian a hospital
What we Know: Stand Alone Ambiguity Resolution
Learn a function f: X Y that maps observations in a domain to one of several categories or <
Page 7
Theoretically: generalization bounds How many example does one need to see in order to
guarantee good behavior on previously unobserved examples.
Algorithmically: good learning algorithms for linear representations.
Can deal with very high dimensionality (106 features) Very efficient in terms of computation and # of examples.
On-line. Key issues remaining:
Learning protocols: how to minimize interaction (supervision); how to map domain/task information to supervision; semi-supervised learning; active learning; ranking; sequences; adaptation
What are the features? No good theoretical understanding here.
Programming systems that have multiple classifiers
Classification is Well Understood
Page 8
Comprehension
1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.
A process that maintains and updates a collection of propositions about the state of affairs.
(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.
This is an Inference Problem
Page 9
This Talk
Global Inference over Local Models/Classifiers + Expressive Constraints
Model Generality of the framework
Training Paradigms
Global vs. Local training Semi-Supervised Learning
Examples Semantic Parsing Information Extraction Pipeline processes
Page 10
Sequential Constrains Structure
Three models for sequential inference with classifiers[Punyakanok & Roth NIPS’01]
HMM; HMM with Classifiers Sufficient for easy problems
Conditional Models (PMM) Allows direct modeling of states as a function of input Classifiers may vary; SNoW (Winnow;Perceptron); MEMM: MaxEnt;
SVM based
Constraint Satisfaction Models (CSCL: more general constrains)
The inference problem is modeled as weighted 2-SAT With sequential constraints: shown to have efficient solution
Recent work – viewed as multi-class classification; emphasis on global training [Collins’02, CRFs,SVMs]; efficiency and performance issues
s1
o1
s2
o2
s3
o3
s4
o4
s5
o5
s6
o6
s1
o1
s2
o2
s3
o3
s4
o4
s5
o5
s6
o6By far, the most popular in applications
Allows for Dynamic Programming based
Inference
What if the structure of the problem/constraints is not sequential?
Page 11
Pipeline
Pipelining is a crude approximation; interactions occur across levels and down stream decisions often interact with previous decisions.
Leads to propagation of errors Occasionally, later stage problems are easier but
upstream mistakes will not be corrected.
POS Tagging
Phrases
Semantic Entities
Relations
Most problems are not single classification problems
Parsing
WSD Semantic Role Labeling
Raw Data
Looking for: Global inference over the outcomes of different (learned) predictors as a way to break away from this paradigm [between pipeline & fully global] Allows a flexible way to incorporate linguistic and structural constraints.
Page 12
Inference with General Constraint Structure
Dole ’s wife, Elizabeth , is a native of N.C.
E1 E2 E3
R12 R2
3
other 0.05per 0.85loc 0.10
other 0.05per 0.50loc 0.45
other 0.10per 0.60loc 0.30
irrelevant 0.10spouse_of 0.05born_in 0.85
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.05spouse_of 0.45born_in 0.50
other 0.05per 0.85loc 0.10
other 0.10per 0.60loc 0.30
other 0.05per 0.50loc 0.45
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.10spouse_of 0.05born_in 0.85
other 0.05per 0.50loc 0.45
Improvement over no inference: 2-5%
Page 13
Random Variables Y:
Conditional Distributions P (learned by models/classifiers) Constraints C– any Boolean function defined on partial assignments (possibly: + weights W )
Goal: Find the “best” assignment The assignment that achieves the highest global
performance. This is an Integer Programming Problem
Problem Setting
y7y4 y5 y6 y8
y1 y2 y3C(y1,y4)C(y2,y3,y6,y7,y8)
Y*=argmaxY PY subject to constraints C(+ WC)Other, more general ways to incorporate
soft constraints here [ACL’07]
observations
Page 14
A General Inference Setting
Essentially all complex models studied today can be viewed as optimizing a linear objective function: HMMs/CRFs [Roth’99; Collins’02;Laffarty et. al 02]
Linear objective functions can be derived from probabilistic perspective: Markov Random Field [standard; Kleinberg&Tardos] Optimization Problem (Metric
Labeling) [Chekuri et. al’01] Linear Programming Problems
Inference as Constrained Optimization [Yih&Roth CoNLL’04] [Punyakanok et. al
COLING’04]… The probabilistic perspective supports finding the most likely assignment
Not necessarily what we want Our Integer linear programming (ILP) formulation
Allows the incorporation of more general cost functions General (non-sequential) constraint structure Better exploitation (computationally) of hard constraints Can find the optimal solution if desired
Page 15
Formal Model
How to solve?
This is an Integer Linear Program
Solve using ILP packages gives an exact solution.
Search techniques are also possible
(Soft) constraints component
Weight Vector for “local” models
Penalty for violatingthe constraint.
How far away is y from a “legal” assignment
Subject to constraints
A collection of Classifiers; Log-linear models (HMM, CRF) or a combination
How to train?
Incorporating constraints in the learning process?
Page 16
Example: Semantic Role Labeling
I left my pearls to my daughter in my will .
[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .
A0 Leaver A1 Things left A2 Benefactor AM-LOC Location I left my pearls to my daughter in my will .
Special Case (structure output problem): here, all the data is available at one time; in general, classifiers might be learned from different sources, at different times, at different contexts.
Implications on training paradigms
Overlapping arguments
If A2 is present, A1 must also be
present.
Who did what to whom, when, where, why,…
Page 17
PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations. It adds a layer of generic semantic labels to Penn Tree
Bank II. (Almost) all the labels are on the constituents of the
parse trees.
Core arguments: A0-A5 and AA different semantics for each verb specified in the PropBank Frame files
13 types of adjuncts labeled as AM-arg where arg specifies the adjunct type
Semantic Role Labeling (2/2)
Page 18
Algorithmic Approach
Identify argument candidates Pruning [Xue&Palmer, EMNLP’04] Argument Identifier
Binary classification (SNoW)
Classify argument candidates Argument Classifier
Multi-class classification (SNoW)
Inference Use the estimated probability
distribution given by the argument classifier
Use structural and linguistic constraints
Infer the optimal global output
I left my nice pearls to her
I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]
I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]
I left my nice pearls to her
I left my nice pearls to her
Identify Vocabulary
Inference over (old and new)
Vocabulary
candidate arguments
EASY
Page 19
InferenceI left my nice pearls to her
The output of the argument classifier often violates some constraints, especially when the sentence is long.
Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming. [Punyakanok et. al 04, Roth & Yih 04;05]
Input: The probability estimation (by the argument classifier)
Structural and linguistic constraints
Allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).
Page 20
Integer Linear Programming Inference
For each argument ai
Set up a Boolean variable: ai,t indicating whether ai is classified as t
Goal is to maximize i score(ai = t ) ai,t
Subject to the (linear) constraints Any Boolean constraints can be encoded as linear
constraint(s).
If score(ai = t ) = P(ai = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints.
Page 21
No duplicate argument classes
a POTARG x{a = A0} 1
R-ARG
a2 POTARG , a POTARG x{a = A0} x{a2 = R-A0}
C-ARG a2 POTARG ,
(a POTARG) (a is before a2 ) x{a = A0} x{a2 = C-A0}
Many other possible constraints: Unique labels No overlapping or embedding Relations between number of arguments If verb is of type A, no argument of type B
Any Boolean rule can be encoded as a linear constraint.
If there is an R-ARG phrase, there is an ARG Phrase
If there is an C-ARG phrase, there is an ARG before it
Constraints
Joint inference can be used also to combine different SRL Systems.
Universally quantified rules
Page 22
ILP as a Unified Algorithmic Scheme
Consider a common model for sequential inference: HMM/CRF Inference in this model is done via the Viterbi Algorithm.
Viterbi is a special case of the Linear Programming based Inference. Viterbi is a shortest path problem, which is a LP, with a
canonical matrix that is totally unimodular. Therefore, you can get integrality constraints for free.
One can now incorporate non-sequential/expressive/declarative constraints by modifying this canonical matrix
The extension reduces to a polynomial scheme under some conditions (e.g., when constraints are sequential, when the solution space does not change, etc.)
Not necessarily increases complexity and very efficient in practice
[Roth&Yih, ICML’05]
y1 y2 y3 y4 y5y
x x1 x2 x3 x4 x5s
ABC
ABC
ABC
ABC
ABC
t
Learn a rather simple model; make decisions with a more expressive model
So far, shown the use of only (deterministic) constraints. Can be used with statistical constraints.
Page 23
Extracting Relations via Semantic AnalysisScreen shot from a CCG demo http://L2R.cs.uiuc.edu/~cogcomp
Semantic parsing reveals several relations in the sentence along with their arguments.
Top ranked system in CoNLL’05 shared task
Key difference is the Inference
This approach produces a very good semantic parser. F1~90%
Easy and fast: ~7 Sent/Sec (using Xpress-MP)
Page 24
This Talk
Global Inference over Local Models/Classifiers + Expressive Constraints
Model Generality of the framework
Training Paradigms
Global vs. Local training Semi-Supervised Learning
Examples Semantic Parsing Information Extraction Pipeline processes
Page 25
Given: Q: Who acquired Overture? Determine: A: Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year.
Textual Entailment
Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year
Yahoo acquired Overture
Is it true that?(Textual Entailment)
Overture is a search company Google is a search company
……….
Google owns Overture
Phrasal verb paraphrasing [Connor&Roth’07]
Entity matching [Li et. al, AAAI’04, NAACL’04]
Semantic Role Labeling
Page 26
Training Paradigms that Support Global Inference
Incorporating general constraints (Algorithmic Approach)
Allow both statistical and expressive declarative constraints
Allow non-sequential constraints (generally difficult)
Coupling vs. Decoupling Training and Inference. Incorporating global constraints is important but Should it be done only at evaluation time or also at
training time? Issues related to:
modularity, efficiency and performance, availability of training data
May not be relevant in some problems.
Page 27
Input: o1 o2 o3 o4 o5 o6 o7 o8 o9 o10
Classifier 1:Classifier 2:
Infer:
Phrase Identification Problem
Use classifiers’ outcomes to identify phrases Final outcome determined by optimizing classifiers outcome
and constrains
[ [ [ []] ] ] ] ]
[ ] ][
Did this classifier make a
mistake? How to train it?
Page 28
Training in the presence of Constraints
General Training Paradigm: First Term: Learning from data Second Term: Guiding the model by constraints Can choose if constraints are included in training
or only in evaluation
Page 29
L+I: Learning plus Inference
IBT: Inference-based Training
Training w/o ConstraintsTesting: Inference with Constraints
x1
x6
x2
x5
x4
x3
x7
y1
y2
y5
y4
y3
f1(x)
f2(x)
f3(x)f4(x)
f5(x)
X
Y
Learning the components together!
Cartoon: each model can be
more complex and may have a view on a set of output
variables.
Page 30
-1 1 111Y’ Local Predictions
Perceptron-based Global Learning
x1
x6
x2
x5
x4
x3
x7
f1(x)
f2(x)
f3(x)f4(x)
f5(x)
X
Y
-1 1 1-1-1YTrue Global Labeling
-1 1 11-1Y’ Apply Constraints:
Which one is better? When and Why?
Page 31
Claims
When the local classification problems are “easy”, L+I outperforms IBT. In many applications, the components are identifiable
and easy to learn (e.g., argument, open-close, PER). Only when the local problems become difficult to solve in
isolation, IBT outperforms L+I, but needs a larger number of training examples. When data is scarce, problems are not easy and
constraints can be used, along with a “weak” model, to label unlabeled data and improve model.
Experimental results and theoretical intuition to support our claims.L+I: cheaper computationally; modular
IBT is better in the limit, and other extreme cases.
Combinations: L+I, and then IBT are possible
Page 32
opt=0.2opt=0.1opt=0
Bound Prediction
Local ≤ opt + ( ( d log m + log 1/ ) / m )1/2
Global ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m )1/2
Bounds Simulated Data
L+I vs. IBT: the more identifiable individual problems are the better overall performance is with L+I
Indication for hardness of
problem
Page 33
Relative Merits: SRL
Difficulty of the learning problem
(# features)
L+I is better.
When the problem is artificially made harder, the tradeoff is clearer.
easyhard
In some cases problems are hard due to lack of training data.
Semi-supervised learning
Page 34
Prediction result of a trained HMM
Lars Ole Andersen . Program analysis and
specialization for the C Programming language
. PhD thesis .DIKU , University of Copenhagen , May1994 .
Information extraction with Background Knowledge (Constraints)
Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 .
[AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE]
Violates lots of constraints!
Page 35
Examples of Constraints
Each field must be a consecutive list of words, and can appear at most once in a citation.
State transitions must occur on punctuation marks.
The citation can only start with AUTHOR or EDITOR.
The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE …….
Page 36
Information Extraction with Constraints
Adding constraints, we get correct results!
[AUTHOR] Lars Ole Andersen . [TITLE] Program analysis and specialization
for the C Programming language .
[TECH-REPORT] PhD thesis .[INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 .
If incorporated into the training, better results mean Better Feedback!
Page 37
Semi-Supervised Learning with Constraints
Model
Decision Time Constraints
Un-labeled Data
Constraints
In traditional Semi-Supervised learning the model can drift away from the correct one.
Constraints can be used At decision time, to bias the objective function
towards favoring constraint satisfaction. At training to improve labeling of un-labled data (and
thus improve the model)
Page 38
=learn(Tr)
For N iterations do
T=
For each x in unlabeled dataset
y Inference(x, )
T=T {(x, y)}
Any supervised learning algorithm parametrized by
Augmenting the training set (feedback). Any inference algorithm (with constraints).
Inference(x,C, )
Constraint - Driven Learning (CODL) [Chang, Ratinov, Roth, ACL’07]
Page 39
Constraint - Driven Learning (CODL) [Chang, Ratinov, Roth, ACL’07]
=learn(Tr)
For N iterations do
T=
For each x in unlabeled dataset
{y1,…,yK} Top-K-Inference(x,C, )
T=T {(x, yi)}i=1…k
= +(1- )learn(T)Learn from new training data.Weight supervised and unsupervised model(Nigam2000*).
Augmenting the training set (feedback). Any inference algorithm (with constraints).
Any supervised learning algorithm parametrized by
Page 40
Token-based accuracy (inference with constraints)
60
65
70
75
80
85
90
95
100
5 10 15 20 25 300
Labeled set size
Ac
cu
rac
y
supervised Weighted EM CODL
Page 41
Objective function:
Semi-Supervised Learning with Constraints
# of available labeled examples
Learning w ConstraintsConstraints are used to Bootstrap a semi-supervised learner
Poor model + constraints are used to annotate unlabeled data, which in turn is used to keep training the model.
Learning w/o Constraints: 300 examples.
Page 42
Discussed a general paradigm for learning and inference in the context of natural language understanding tasks
A general Constraint Optimization approach for integration of learned models with additional (declarative or statistical) expressivity.
A paradigm for making Machine Learning practical – allow domain/task specific constraints.
How to train? Learn Locally and make use globally (via global
inference) [Punyakanok et. al IJCAI’05]
Ability to make use of domain & constraints to drive supervision
[Klementiev & Roth, ACL’06; Chang, Ratinov, Roth, ACL’07]
Conclusions
LBJ (Learning Based Java): http://L2R.cs.uiuc.edu/~cogcomp
A modeling language that supports programming along with building learned models and allows incorporating and
inference with constraints
Page 43
Random
Variables Y:
Conditional Distributions P (learned by models/classifiers) Constraints C– any Boolean function defined on partial assignments (possibly: + weights W ) Goal: Find the “best” assignment
The assignment that achieves the highest global accuracy.
This is an Integer Programming Problem
Problem Setting
y4 y5 y6 y7 y8
y1 y2 y3C(y1,y4)C(y2,y3,y6,y7,y8)
Y*=argmaxY PY subject to constraints C(+ WC)Other, more general ways to incorporate
soft constraints here [ACL’07]