an introduction to conditional random field ching-chun hsiao 1
TRANSCRIPT
1
An Introduction to Conditional Random Field
Ching-Chun Hsiao
2
Outline Problem description Why conditional random fields(CRF) Introduction to CRF
CRF model Inference of CRF Learning of CRF
Applications References
3
Reference Charles Elkan, “Log-linear Models and Conditional Random
Field,” Notes for a tutorial at CIKM, 2008. Charles Sutton and Andrew McCallum, “An Introduction to
Conditional Random Fields for Relational Learning,” MIT Press, 2006.
Andrew Y. Ng and Michael I. Jordan, “On Discriminative Vs. Generative Classifiers: A Comparison Of Logistic Regression And Naive Bayes,” In Advances in Neural Information Processing Systems (NIPS), 2002.
John Lafferty, Andrew McCallum and Fernando Pereira, “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, ” In Proceedings of the Eighteenth International Conference on Machine Learning, pages 282–289, 2001.
4
Outline Problem description Why conditional random fields(CRF) Introduction to CRF
CRF model Inference of CRF Learning of CRF
Applications References
5
Problem Description Given observed data X, we wish to predict Y
(labels) Example:
X = {Temperature, Humidity, ...} Xn = observation on day n
Y = {Sunny, Rainy, Cloudy} Yn = weather on day n
30°C 20%
Sunny?Rainy?Cloudy?
Light breeze
May depend on one another
May depend on the weather of yesterday
6
Outline Problem description Why conditional random fields(CRF) Introduction to CRF
CRF model Inference of CRF Learning of CRF
Applications References
7
Generative Model vs. Discriminative Model Generative model and discriminative model
Generative model A model that generate observed data randomly Model the joint probability p(x,y)
Discriminative model Directly estimate the posterior probability p(y|x) Aim at modeling the “discrimination” between different
outputs
Naïve Bayes, … HMM, …Bayesian
network, MRF, …
Single variable
Sequence General
Logistic regression, …
Linear-chain CRF MEMM, …
General CRF, …
Conditional
8
Why Conditional Random Fields –1 Generative model
Generative model targets to find the joint probability p(x,y) and make the prediction based on Bayes rule to calculate p(y|x)
ex: naive Bayes (single output) and HMM (Hidden Markov Model) (sequence output)
K
kk yxpypyxp
1
)|()(),(
a vector of features
Assume that given y, features are independent
T
ttttt yxpyypyxp
11 )|()|(),(
Assumption: 1. each state t only depends on its immediate predecessor 2. Conditional independence of observed given its state.
Sequence output
9
Why Conditional Random Fields –2
30°C
20%
Humidity, temperature and the wind scale are independent
Mon.
{30°C, 20%, light breeze}
Light breez
e
Tue.
{28°C, 30%, light breeze}
Wed.
{25°C, 40%,
moderate breeze}
Thu.
{22°C, 60%,
moderate breeze}
A B: A causes B
10
Why Conditional Random Fields –3 Difficulties for generative models
Not practical to represent multiple interacting features (hard to model p(x)) or long-range dependencies of the observations
Very strict independence assumptions on the observations
Mon.
{30°C, 20%, light breeze}
Tue.
{28°C, 30%, light breeze}
Wed.
{25°C, 40%,
moderate breeze}
Thu.
{22°C, 60%,
moderate breeze}
11
Why Conditional Random Fields –4 Discriminative models
Directly model the posterior p(y|x) Aim at modeling the “discrimination” between
different outputs Ex: logistic regression (maximum entropy) and
CRF
12
Why Conditional Random Fields –5 Advantages of discriminative models
Training process aim at finding optimal coefficients for features no matter the features are correlated
Not sensitive to unbalanced training data Especially for the classification problem, we don’t
have to care about p(x)
Why Conditional Random Fields –6 Logistic regression (maximum entropy) Suppose we have a bin of candies, each with
an associated label (A,B,C, or D) Each candy has multiple colors in its wrapper Each candy is assigned a label randomly based on
some distribution over wrapper colors
13
Observation: the color of the wrapper
Label: 4 kinds of favors
A: chocolate
B: strawberry
C: lemon
D: milk
14
Why Conditional Random Field –7 For any candy with a red label pulled from the
bin: P(A|red)+P(B|red)+P(C|red)+P(D|red) = 1 Infinite number of distributions exist that fit this
constraint The distribution that fits with the idea of maximum
entropy is: (the most uniform) P(A|red)=0.25 P(B|red)=0.25 P(C|red)=0.25 P(D|red)=0.25
15
Why Conditional Random Field –8 Now suppose we add some evidence to our
model We note that 80% of all candies with red labels are
either labeled A or B P(A|red) + P(B|red) = 0.8
The updated model that reflects this would be: P(A|red) = 0.4 P(B|red) = 0.4 P(C|red) = 0.1 P(D|red) = 0.1
As we make more observations and find more constraints, the model gets more complex
16
Why Conditional Random Field –9 Given a collection of facts, choose a model
which is consistent with all the facts, but otherwise as uniform as possible
,),(
);|(
),(
wxZ
ewxyp
jjj yxFw
ion termnormalizat a is ),(
),(
y
yxFwj
jj
ewxZ
By learning
Defined feature functions evidence
x1 x2 xd
yB)f(A, nodes B andA between nodesfactor
Factor Graph:
17
Outline Problem description Why conditional random fields(CRF) Introduction to CRF
CRF model Inference of CRF Learning of CRF
Applications References
18
Linear-Chain CRF –1 If we extend the logistic regression to a
sequence problem( ):
,),(
);|(
),(
wxZ
ewxyp
jjj yxFw
ion termnormalizat a is ),(
),(
y
yxFwj
jj
ewxZ
y
x1 x2 xd
yt-1
x1 x2 xd
yt
x1 x2 xd
yt+
1
Entire x
sentence thealong sum a ,),,(),( where 1 t
ttjj xyyfyxF
19
Linear-Chain CRF –2
y1 y2 y3
x1 x2 x3
y1 y2 y3
x
20
General CRF
Divide Graph G into many templates ψA. The parameters inside each template are tied K(A) is the number of feature functions for the template
)()|(
)(
1
),(
xZ
e
xyp G
xyf
A
AK
kaaakak
21
Inference of CRF Problem description:
Given the observations({xi}) and the probability model(parameters such as ωi mentioned above), we target to find the best state sequence
For general graphs, the problem of exact inference in CRFs is intractable Chain or tree like CRF can yield exact inference Approximation solutions
22
Inference of Linear-Chain CRF –1 The inference of linear-chain CRF is very
similar to that of HMM Example: POS(part of speech) tagging
the identification of words as nouns, verbs,adjectives, adverbs, etc.
Students
need another break
noun verb article noun
23
Inference of Linear-Chain CRF –2 We firstly illustrate the inference of HMM
students/V
students/N
students/P
students/ART
need/V
need/N
need/P
need/ART
o/s
another/V
another/N
another/P
another/ART
break/V
break/N
break/P
break/ART
7.6x10-6
0.00725
0
0
0.00031
1.3x10-5
0.0002
0
0
1.2x10-7
0
7.2x10-5
2.6x10-9
4.3x10-6
0
0
24
Inference of Linear-Chain CRF –3 Then back to CRF
iiiiy
i jiiijy
j iiiijy
jjjy
yxF
y
y
yyg
xyyf
xyyf
yxFxZ
e
xypy
jjj
),(maxarg
),,(maxarg
),,(maxarg
),(maxarg)(
maxarg
);|(maxarg
1
1
1
),(
*
Inference of Linear-Chain CRF –4
25
gi can be represented as a mxm matrix where m is the cardinality of the set of the tags
j
iijjiii xyyfyyg ),,(),( 11
V
ART
N
N V ART
yi-1
yi
V
ART
N
V
ART
N
26
Inference of Linear-Chain CRF –5 The inference of linear-chain CRF is similar to
that of HMM, which uses Viterbi algorithm. v: range over the tags U(k,v) to be the score of the best sequence of tags
from 1 to k, where tag k is required to be v
)],(),1([max
)],(),([max),(
11
1
1
11
},...,{
1
11
vygykU
vygyygvkU
kkky
k
k
ikiii
yy
k
k
27
Learning of CRF Problem description
Given training pairs({xi,yi}), we wish to estimate the parameters of the model ({ωi})
Method For chain or tree structured CRFs, they can be
trained by maximum likelihood we will focus on the learning of linear chain CRF
General CRFs are intractable hence approximation solutions are necessary
28
Learning of Linear-chain CRF –1 Conditional maximum likelihood (CML)
x: observations; y: labels
Apply CML to the learning of CRF It can be shown that the conditional log-likelihood
of the linear-chain CRF is a convex function we can apply gradient ascent to the CML problem
);|(max)|;(max xypxyL
);|(max xyp
);|(log xyp
0),(log),();|(log
xZyxFxyp
jj
j
29
Learning of Linear-chain CRF –2
For the entire training set T)]',([),(
);|'()',(),(
0),(log),();|(log
);'|(~'
'
yxFEyxF
xypyxFyxF
xZyxFxyp
jxypyj
yjj
jj
j
Ep[·] denotes expectation with respect to distribution p.
Tx
jxypyTyxj yxFEyxF
,);|(~
,
)],([),(
The expectation of the feature fx with respect to the model distribution
The expectation of the feature fx with respect to the empirical distribution
30
Learning of Linear-chain CRF –3 To yield the best model:
The expectation of each feature with respect to the model distribution is equal to the expected value under the empirical distribution of the training data
The same as the “maximum entropy model”
Logistic regression(maximum
entropy)
Extend to sequence
Linear-Chain CRF
31
Learning of Linear-chain CRF –4 Apply stochastic gradient ascent
Change the parameter values one example at a time
Stochastic: because the derivative based on a randomly chosen single example is a random approximation to the true derivative based on all training data
rate learning a is
)]),([),((: );|(~
yxFEyxF jxypyjjj
32
Outline Problem description Why conditional random fields(CRF) Introduction to CRF
CRF model Inference of CRF Learning of CRF
Comparisons Applications References
33
Outline Problem description Why conditional random fields(CRF) Introduction to CRF
CRF model Inference of CRF Learning of CRF
Applications References
34
Application – Stereo Matching (1) Ref : Learning Conditional Random Fields for
Stereo(CVPR, 2007)
Rectified imL Rectified imR
obj
35
Application – Stereo Matching (2) Model the stereo matching problem using CRF
p: pixels in the reference images dp: the disparity at pixel p
cp: the matching cost at pixel p
gpg: the color gradient between neighbor pixels p and q, (p,q) N
36
Application – Image Labeling(1) Ref: Multiscale Conditional Random Fields for
Image Labeling (CVPR, 2004) Image labeling: assign each pixel to one of a
finite set of labels
37
Application – Image Labeling(2) Model the image labeling problem using CRF
X: input image L: output label field S: the entire image
s
gRC gLPfLPXLPZ
XLP ),(),(),|(1
)|(
Local classifier
applied to the X
Regional feature
extracted from L
Global feature
extracted from L
38
Application – Image Labeling(3)
39
Application – Gesture Recognition (1) Ref.: S. Wang, A. Quattoni, L. Morency, D.
Demirdjian, and T. Darrell., “Hidden conditional random fields for gesture recognition,” CVPR, 2006.
40
Application – Gesture Recognition (2) s = {s1, s2, ..., sm}, each si
∈ S captures certain underlying structure of each class and S is the set of hidden states in the model
41
Application – Gesture Recognition (3) The graph E is a chain where each node
corresponds to a hidden state variable at time t
ω that defines the amount of past and future history to be used when predicting the state at time t.
Assume θ= [θe θy θs]a vector that caninclude any feature of the observation sequence for a specific window size ω.
θs[sj ] to refer to theparameters θs that correspond to state sj ∈ S.
θy[y, sj ] stands for parameters that correspond to class yand state sj
θe[y, sj, sk] refers to parameters that correspondto class y and the pair of states sj and sk.
42
Application – Gesture Recognition (4) Thirteen users were asked to perform these
six gestures; an average of 90 gestures per class were collected.
43
Summary Discriminative model has the advantage of
Less sensitive to the unbalanced training data Deal with correlated features
CRF is one of the discriminative model and meets the maximum entropy model
44
Factor Graph
Represent HMM using factor graphs
x1 x2 x3
y
Represent naïve Bayes using factor graphs
y1 y2 y3
x1 x2 x3
f(y1) f(y2|y1)
f(y3|y2)
f(y1|x1)
f(y2|x2)
f(y3|x3)