conditional random fields - welcome to cedarsrihari/cse574/chap13/ch13.5... · conditional random...
TRANSCRIPT
Machine Learning Srihari
1
Conditional Random Fields
Sargur N. [email protected]
Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/CSE574/index.html
Machine Learning Srihari
2
Outline1. Generative and Discriminative Models2. Classifiers
Naïve Bayes and Logistic Regression3. Sequential Models
HMM and CRFMarkov Random Field
4. CRF vs HMM performance comparisonNLP: Table extraction, POS tagging, Shallow parsing, Document analysis
5. CRFs in Computer Vision6. Summary7. References
Machine Learning Srihari
3
Methods for Classification• Generative Models (Two-step)
• Infer class-conditional densities p(x|Ck) and priors p(Ck)• then use Bayes theorem to determine posterior
probabilities.
• Discriminative Models (One-step)• Directly infer posterior probabilities p(Ck|x)
• Decision Theory– In both cases use decision theory to assign each new x
to a class
)x()()|x()x|(
pCpCpCp kk
k =
Machine Learning Srihari
4
Generative Classifier• Given M variables, x =(x1,..,xM), class variable
y and joint distribution p(x,y) we can – Marginalize
– Condition• By conditioning the joint pdf we form a classifier
• Huge need for samples– If xi are binary, need 2M values to specify p(x,y)
• If M=10, there are two classes and 100 samples are needed to estimate a given probability, then we need2048 samples
x( ) (x, )p y p y= ∑
(x, )( | x)(x)
p yp yp
=
Machine Learning Srihari
5
Classification of ML Methods• Generative Methods
– “Generative” since sampling can generate synthetic data points
– Popular models• Naïve Bayes, Mixtures of multinomials• Mixtures of Gaussians, Hidden Markov Models• Bayesian networks, Markov random fields
• Discriminative Methods– Focus on given task– better performance– Popular models
• Logistic regression, SVMs• Traditional neural networks, Nearest neighbor• Conditional Random Fields (CRF)
Machine Learning Srihari
6
Generative-Discriminative Pairs
• Naïve Bayes and Logistic Regression form a generative-discriminative pair
• Their relationship mirrors that between HMMs and linear-chain CRFs
Machine Learning Srihari
7
Graphical Model RelationshipHidden Markov ModelNaïve Bayes Classifier
GE
NE
RA
TIV
E
Logistic Regression
SEQUENCE
CONDITION
y
x1 xM x1 xN
y1 yN
p(y,x)
p(y/x)
Y
p(Y,X)Xx
CONDITION
SEQUENCE
p(Y/X)
DIS
CR
IMIN
ATI
VE
Conditional Random Field
Machine Learning Srihari
8
Naïve Bayes Classifier• Goal is to predict single class variable y
given a vector of features x=(x1,..,xM)• Assume that once class labels are
known the features are independent• Joint probability model has the form
– Need to estimate only M probabilities• Factor graph obtained by defining
factors ψ(y)=p(y), ψm(y,xm)=p(xm,y)
1
( , x) ( ) ( | )M
mm
p y p y p x y=
= ∏
Machine Learning Srihari
9
Logistic Regression Classifier
a
σ(a)
Properties:A. Symmetry
σ(-a)=1-σ(a)B. Inverse
a=ln(σ /1-σ)known as logit.Also known as log odds since it is the ratioln[p(C1|x)/p(C2|x)]
C. Derivativedσ/da=σ(1-σ)
Logistic Sigmoid• Feature vector x• Two-class classification: class
variable y has values C1 and C2• A posteriori probability p(C1|x)
written asp(C1|x) =f(x) = σ (wTx) where
• Known as logistic regression in statistics– Although a model for classification
rather than for regression
1( )1 exp( )
aa
σ =+ −
Machine Learning Srihari
10
Relationship between Logistic Regression and Generative Classifier
• Posterior probability of class variable y is
• In a generative model we estimate the class-conditionals (which are used to determine a)
• In the discriminative approach we directly estimate a as a linear function of x i.e., a = wTx
1 11
1 1 2 2
1 1
2 2
(x | ) ( )( | x)(x | ) ( ) (x | ) ( )
(x | ) ( )1 = ( ) where ln1 exp( ) (x | ) ( )
p C p Cp Cp C p C p C p C
p C p Ca aa p C p C
σ
=+
= =+ −
Machine Learning Srihari
11
Logistic Regression Parameters
• With M variables logistic regression has Mparameters w=(w1,..,wM)
• By contrast, generative approach– by fitting Gaussian class-conditional densities will
result in 2M parameters for means, M(M+1)/2parameters for shared covariance matrix, and one for class prior p(C1)
– Which can be reduced to O(M) parameters by assuming independence via Naïve Bayes
Machine Learning Srihari
12
Learning Logistic Parameters• For data set {xn,tn}, tn={0,1} likelihood is
where t=(t1,..tN)T and yn=p(C1|xn)• Parameter w specifies the yn as follows
yn=σ(an) and an=wTxn
• Defining cross-entropy error function
• Taking gradient wrt w
• Same form as sum-of-squares error for linear regression– Can use sequential algorithm where samples are
presented one at a time using
1
1
(t | w) {1 }n n
n
Nt t
nn
p y y −
=
= −∏
{ }1
(w) ln (t | w) ln (1 ) ln(1 )N
n n n nn
E p t y t y=
= − = − + − −∑
1(w) ( )x
N
n n nn
E y t=
∇ = −∑
( 1) ( )w wt tnEη+ = − ∇
Machine Learning Srihari
13
Multi-class Logistic Regression
• Case of K>2 classes
• Known as normalized exponentialwhere ak=ln p(x|Ck)p(Ck)
• Normalized exponential also known as softmaxsince if ak>>aj then p(Ck|x)=1 and p(Cj|x)=0
• In logistic regression we assume activations given by ak=wk
Tx
( | ) ( )( | x)( | ) ( )
exp( ) =exp( )
k kk
j jj
k
jj
p x C p Cp Cp x C p C
aa
=∑
∑
Machine Learning Srihari
14
Learning Parameters for Multiclass• Determining parameters {wk} by m.l.e. requires
derivatives of yk wrt all activations ajwhere Ikj are elements of the identity matrix
• Likelihood function written using 1-of-K coding– Target vector tn for xn belonging to Ck is a binary vector
with all elements zero except for element k which is one
where T is an N x K matrix of target variables with elements tnk
• Cross entropy error function
• Gradient of error function is– which allows a sequential weight vector update algorithm
( )kk kj j
j
y y I ya
∂= −
∂
11 1 1 1
( | w ,.., w ) ( | ) nk nk
nk
N K N Kt t
K k nn k n k
p T p C x y= = = =
= =∏∏ ∏∏
1 11 1
(w ,.., w ) ln ( | w ,..w ) lnN K
K K nk nn k
E p T t ky= =
= − = −∑∑
( )1
xN
nj nj nn
y t=
−∑
Machine Learning Srihari
15
Graphical Model for Logistic Regression• Multiclass logistic regression can be
written as
• Rather than using one weight per class we can define feature functions that are nonzero only for a single class
• This notation mirrors the usual notation for CRFs
1
1
1( | x) exp where(x)
(x) = exp
K
y yj jj
K
y yj jyj
p y xZ
Z x
λ λ
λ λ
=
=
⎧ ⎫= +⎨ ⎬
⎩ ⎭⎧ ⎫
+⎨ ⎬⎩ ⎭
∑
∑ ∑
1
1( | x) exp ( , x)(x)
K
k kk
p y f yZ
λ=
⎧ ⎫= ⎨ ⎬
⎩ ⎭∑
Machine Learning Srihari
16
3. Sequence Models
• Classifiers predict only a single class variable
• Graphical Models are best to model many variables that are interdependent
• Given sequence of observations X={xn}n=1N
• Underlying sequence of states Y={yn}n=1N
Machine Learning Srihari
17
Sequence Models
Y
X
• Hidden Markov Model (HMM)
• Independence assumptions:• each state depends only on its immediate predecessor• each observation variable depends only on the current
state
• Limitations:
• Strong independence assumptions among the observations X. • Introduction of a large number of parameters by modeling the joint
probability p(y,x), which requires modeling the distribution p(x)
Machine Learning Srihari
18
Sequence ModelsHidden Markov Model (HMM)
Conditional Random Fields (CRF)
Y
X
Y
X
Y
X
Y
X
x
Y
A key advantage of CRF is their great flexibility to include a wide variety of arbitrary, non-independent features of the observations.
Machine Learning Srihari
19
Generative Model: HMM• X is observed data sequence to be
labeled, Y is the random variable over the label sequences
• HMM is a distribution that models p(Y, X)
• Joint distribution is
• Highly structured network indicates conditional independences,– past states independent of future
states– Conditional independence of observed
given its state.
y1 y2 yn yN
x1 x2 xn xN
11
, ( | ) ( | )N
n n n nn
p( ) p y y p y−=
= ∏Y X x
Machine Learning Srihari
20
Discriminative Model for Sequential Data
• CRF models the conditional distribution p(Y/X)
• CRF is a random field globally conditioned on the observation X
• The conditional distribution p(Y|X) that follows from the joint distribution p(Y,X) can be rewritten as a Markov Random Field
y1 y2 yn yN
X
Machine Learning Srihari
21
Markov Random Field (MRF)• Also called undirected graphical model• Joint distribution of set of variables x is defined by an
undirected graph as
where C is a maximal clique (each node connected to every other node),
xC is the set of variables in that clique, ψC is a potential function (or local or compatibility function)
such that ψC(xC) > 0, typically ψC(xC) = exp{-E(xC)}, andis the partition function for normalization
• Model refers to a family of distributions and Fieldrefers to a specific one
1(x) (x )C CC
pZ
ψ= ∏
x(x )C C
C
Z ψ= ∑ ∏
Machine Learning Srihari
22
MRF with Input-Output Variables• X is a set of input variables that are observed
– Element of X is denoted x• Y is a set of output variables that we predict
– Element of Y is denoted y• A are subsets of X U Y
– Elements of A that are in A ^ X are denoted xA– Element of A that are in A ^ Y are denoted yA
• Then undirected graphical model has the form
x,y
1(x,y) (x , y ) where Z= (x , y )A A A A A AA A
pZ
= Ψ Ψ∑∏ ∏
Machine Learning Srihari
23
MRF Local Function
• Assume each local function has the form
where θA is a parameter vector, fA are feature functions and m=1,..M are feature subscripts
(x , y ) exp (x , y ) A A A Am Am A Am
fθ⎧ ⎫Ψ = ⎨ ⎬
⎩ ⎭∑
Machine Learning Srihari
24
From HMM to CRF• In an HMM
• Can be rewritten as
• Further rewritten as
• Which gives us
• Note that Z cancels out
11
, ( | ) ( | )N
n n n nn
p( ) p y y p y−=
= ∏Y X x
1{ } { },
1( , ) exp 1 1n nij y i y j
n i j S n ip
Zλ µ
−= =∈ ∈
⎧ ⎫= +⎨ ⎬
⎩ ⎭∑ ∑ ∑∑Y X { } { }1 1
n noi y i x oS o O
= =∈∑
11
1( ) exp ( , , )M
m m n n nm
p f y yZ
λ −=
⎧ ⎫= ⎨ ⎬
⎩ ⎭∑Y, X x
1
1
, )
, , )
n n
n n n
y
y
−
−
1
''
1
exp ( ,( , )( )
( ', ) exp (
M
m m nm
My
m mym
f yp y xp
p y x f y
λ
λ
=
=
⎧ ⎫⎨ ⎬⎩ ⎭= =
⎧ ⎫⎨ ⎬
x
x⎩ ⎭
∑∑ ∑ ∑
Y | X
Indicator function:1{x = x’} takes value 1whenx=x’ and 0 otherwise
Parameters of the distribution:θ ={λij,µoi}
Feature Functions have the form fm(yn,yn-1,xn):Need one feature for eachstate transition (i,j)fij(y,y’,x)=1{y=i}1{y’=j} and one for each state-observation pairfio(y,y’,x)=1{y=i}1{x=o}
Machine Learning Srihari
25
CRF definition
• A linear chain CRF is a distribution p(Y|X) that takes the form
• Where Z(X) is an instance specific normalization function
11
1( | ) exp ( , , )(X)
M
m m n n nm
p f y yZ
λ −=
⎧ ⎫= ⎨ ⎬
⎩ ⎭∑Y X x
11
(X) exp ( , , )M
m m n n ny m
Z f y yλ −=
⎧ ⎫= ⎨ ⎬
⎩ ⎭∑ ∑ x
Machine Learning Srihari
26
Functional ModelsHidden Markov ModelNaïve Bayes Classifier
Logistic Regression
y
x1 xM x1 xN
y1 yN
1
( | )M
mm
Y
GE
NE
RA
TIV
E
x X
DIS
CR
IMIN
ATI
VE
Conditional Random Field
p(y, ) p(y) p x y=
= ∏x
1
'1
exp ( , )( | )
exp ( ', )
M
m mm
M
m mym
f yp y
f y
λ
λ
=
=
⎧ ⎫⎨ ⎬⎩ ⎭=
⎧ ⎫⎨ ⎬⎩ ⎭
∑
∑ ∑
xx
x
11
, ( | ) ( | )N
n n n nn
p( ) p y y p y−=
= ∏Y X x
11
1'1
exp ( , , )( | )
exp ( ', ', )
M
m m n n nm
M
m m n n nm
f y yp
f y y
λ
λ
−=
−=
⎧ ⎫⎨ ⎬⎩ ⎭=
⎧ ⎫⎨ ⎬⎩ ⎭
∑
∑ ∑y
xY X
x
yn
xn
Machine Learning Srihari
27
4. CRF vs Other Models• Generative Models
– Relax assuming conditional independence of observed data given the labels
– Can contain arbitrary feature functions• Each feature function can use entire input data sequence.
Probability of label at observed data segment may depend on any past or future data segments.
• Other Discriminative Models– Avoid limitation of other discriminative Markov models
biased towards states with few successor states.– Single exponential model for joint probability of entire
sequence of labels given observed sequence.– Each factor depends only on previous label, and not
future labels. P(y | x) = product of factors, one for each label.
Machine Learning Srihari
28
NLP: Part Of Speech Tagging
For a sequence of words w = {w1,w2,..wn} find syntactic labels s for each word:
w = The quick brown fox jumped over the lazy dogs = DET VERB ADJ NOUN-S VERB-P PREP DET ADJ NOUN-S
Baseline is already 90%
• Tag every word with its most frequent tag
• Tag unknown words as nouns
Model Error
HMM 5.69%
CRF 5.55% Per-word error rates for POS tagging on the Penn treebank
Machine Learning Srihari
29
Table ExtractionTo label lines of text document:
Whether part of table and its role in table.
Finding tables and extracting information is necessary component of data mining, question-answering and IR tasks.
HMM CRF
89.7% 99.9%
Machine Learning Srihari
30
Shallow Parsing• Precursor to full parsing or information extraction
– Identifies non-recursive cores of various phrase types in text• Input: words in a sentence annotated automatically with POS
tags• Task: label each word with a label indicating
– word is outside a chunk (O), starts a chunk (B), continues a chunk (I)
CRFs beat all reported single-model NP chunking results on standard evaluation dataset
NP chunks
Machine Learning Srihari
31
Handwritten Word Recognition
∑=
');x,'(
);x,(
),x|(y
y
y
eeyP θψ
θψ
θ ∑ ∑= ∈
⎟⎟⎠
⎞⎜⎜⎝
⎛+=
m
j Ekj
tkj
sj yykjIyjAy
1 ),(),x,,,,(;x,,();x,( θθθψ
where yi ε (a-z,A-Z,0-9}, θ : model parameters
∑ ⋅=i
sijj
si
sj yjfyjA ))x,,(();x,,( θθ
∑ ⋅=i
tijkkj
ti
tkj yykjfyykjI ))x,,,,(();x,,,,( θθ
Given word image and lexicon, find most probable lexical entryAlgorithm Outline
• Oversegment image segment combinations are potential characters
• Given y = a word in lexicon, s = grouping of segments, x = input word image features
• Find word in lexicon and segment grouping that maximizesP(y,s | x),
0 20 40 60 80 100 1200.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Word Recognition Rank
Pre
cisi
on
SDPCRF
Interaction Potential
Association Potential (state term)
CRF Model
CRF
Segment-DP
WR Rank
Prec
isio
n
Machine Learning Srihari
32
Document Zoning (labeling regions) error rates
CRF Neural Network
Naive Bayes
Machine Printed Text
1.64% 2.35% 11.54%
Handwritten Text
5.19% 20.90% 25.04%
Noise 10.20% 15.00% 12.23%
Total 4.25% 7.04% 12.58%
Machine Learning Srihari
33
5. CRFs in Computer Vision
Machine Learning Srihari
34
Applications in Computer Vision• Image Modeling
– Labeling regions in image (image segmentation)
– Object detection and recognition (important areas)
– Sign detection in natural images
– Natural scene categorization
– Gesture Recognition
• Image retrieval• Object/motion
segmentation in video scene
Small patch is water in onecontext and sky in another
Machine Learning Srihari
35
…
……
(a) Segmentation and labeling of input image in meaningful regions.
(b) Detection of structured textures such as buildings.
(c) restore images corrupted by noise.
Various tasks in computer vision that require explicit consideration of spatial dependencies!
Machine Learning Srihari
36
Image ModelingImage Labeling:
classifying every image pixel/patch into a finite set of classObject Recognition:
requires semantic image content understanding based on labeling
CRF models:• Directly predict the
segmentation/labeling given the observed image
• Incorporate arbitrary functions of the observed features into the training process
Machine Learning Srihari
37
CRFs for Image Modeling• Discriminative Random Fields (DRF)
– S. Kumar and M. Hebert. Discriminative fields for modeling spatial dependencies in natural images. 2003
• Multiscale Conditional Random Fields (MCRF)– X. He, R. S. Zemel, and M. A’ . Carreira-Perpin˜a’n. Multiscale
conditional random fields for image labeling. 2004.
• Hierarchical Conditional Random Field (HCRF)– S. Kumar and M. Hebert. A hierarchical field framework for unified
context-based classification. 2005– Jordan Reynolds and Kevin Murphy. Figure-ground segmentation
using a hierarchical conditional random field. 2007
• Tree Structured Conditional Random Fields (TCRF)– P. Awasthi, A. Gagrani, and B. Ravindran, Image Modeling using
Tree Structured Conditional Random Fields. 2007
Machine Learning Srihari
38
Discriminative Random Fields (DRF)
CRF - normalized product of potential functions
transition feature functionInteraction potential
state feature functionAssociation potential
enforce adaptive data-dependent smoothing over the label field.
In the DRF framework
take into account the neighborhood interaction of the data
Proposed DRF model was applied to the task of detecting man-made structures in natural scenes.
Machine Learning Srihari
39
structure detection by DRF(label each image site as structured or nonstructured)
For similar detection rates, DRF reduces the false positives considerably.
Though the model outperforms traditional MRFs, it is not strong enough to capture long range correlations among the labels due to the rigid lattice based structure which allows for only pairwiseinteractions
Machine Learning Srihari
40
Multiscale Conditional Random Fields (MCRF)
Combination of three different classifiers operating at local, regional and global contexts respectively.
Two main drawbacks:
• Including additional classifiers operating at different scales into the mCRF framework introduces a large number of model parameters.
• The model assumes conditional independence of hidden variables given the label field.
Machine Learning Srihari
41
Hierarchical CRF (HCRF)
Scene context is important in different domains to achieve good classification even though the local appearance is impoverished
• region-region interaction• object-region interaction• object-object interaction
HCRF: Incorporate the local as well as the global context of any of the three types in a single model
Machine Learning Srihari
42
Hierarchical CRF (HCRF)Layer 1short-range: pixelwise label
Layer 2long-range : contextual object or regions
exploit different levels of contextual information in images for robust classification.
Each layer is modeled as a conditional field that allows one to capture arbitrary observation dependent label interactions
Machine Learning Srihari
43
Hierarchical CRF (HCRF)
Machine Learning Srihari
44
……region-region interaction
object-region interaction
Machine Learning Srihari
45
object-object interaction
Machine Learning Srihari
46
Tree-Structured CRF (TCRF)Combine advantages of hierarchical models and discriminative
approaches in a single framework
model the association between the labels of 2×2 neighborhood of pixels
a. introduce a weight vector for every edge which represents the compatibility between the labels of the pairwise nodes
b. introduce a hidden variable H which is connected to all the 4 nodes. For every value which variable H takes, it induces a probability distribution over the labels of the nodes connected to it
Machine Learning Srihari
47
Tree-Structured CRF (TCRF)• Dividing the whole image into regions of size m × m• introducing a hidden variable for each of them gives a layer of hidden
variables over the given label field• each label node is associated with a hidden variable inthe layer above
Tree structured CRF with m=2Tree structured CRF with m=2
Similar to CRFs, the conditional probability p(y,H|x) of a TCRF factors into a product of potential functions
Machine Learning Srihari
48
Tree-Structured CRF (TCRF)
Object detection Image labeling
long range correlations among non-neighboring pixels can be easily modeled as associations among the hidden variables in the layer above.
tree structure allows inference to be carried out in a time linear in the number of pixels
Machine Learning Srihari
49
Sign Detection in Natural ImagesWeinman, J. Hanson, A. McCallum, A. Sign detection in natural images with conditional random fields. 2004.
Calculates a joint labeling of image patches, rather than labeling patches independently and use CRF to learn the characteristics of regions that contain text.
MaxEnt adding CRF
Machine Learning Srihari
50
Human Gesture RecognitionWang Quattoni, A. Morency, L.-P. Demirdjian, D. Darrell, T..Hidden Conditional Random Fields for Gesture Recognition. Computer Science and Artificial Intelligence Laboratory, MIT. 2006.
These gesture classes are: FB - Flip Back, SV - Shrink Vertically, EV - Expand Vertically, DB - Double Back, PB - Point and Back, EH – Expand Horizontally
Machine Learning Srihari
51
Human Gesture RecognitionHidden Conditional Random Fields
s = {s1, s2, ..., sm}, is the set of hidden states in the model,captures certain underlying structure of each class.
is potential function, measures the compatibility between a label, a set of observations and a configuration of the hidden states.
Machine Learning Srihari
52
Hidden states distribution
Machine Learning Srihari
53
Natural Scene CategorizationWang, Y., Gong, S., Conditional Random Field for Natural Scene Categorization. 2007.
Classification Oriented Conditional Random Field (COCRF)
Machine Learning Srihari
54
Natural Scene CategorizationCOCRF discoversthe spatial layoutdistribution of local patches and their pairwise interaction for a category
pairwise interaction potential between topics for four categories
Machine Learning Srihari
55
6. Summary • Generative and Discriminative methods are two-broad
approaches: former involve modeling, latter directly solve classification
• For classification – Naïve Bayes and Logistic Regression form a generative
discriminative pair• For sequential data
– HMM and CRF are corresponding pair • CRF performs better in language related tasks• Generative models are more elegant, have explanatory
power
Machine Learning Srihari
56
7. References1. C. Bishop, Pattern Recognition and Machine Learning,
Springer, 20062. C. Sutton and A. McCallum, An Introduction to Conditional
Random Fields for Relational Learning3. S. Shetty, H. Srinivasan and S. N. Srihari, Handwritten
Word Recognition using CRFs, ICDAR 20074. S. Shetty, H.Srinivasan and S. N. Srihari, Segmentation
and Labeling of Documents using CRFs, SPIE-DRR 20075. X. He, R. Zennel and M.A. Carreira-Perpinan, Multiscale
Conditional Random Fields for Image Labeling
Machine Learning Srihari
57
Other Topics in Sequential Data
• Sequential Data and Markov Models:http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.1-MarkovModels.pdf
• Hidden Markov Models:http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.2-HiddenMarkovModels.pdf
• Extensions of HMMs:http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.3-HMMExtensions.pdf
• Linear Dynamical Systems:http://www.cedar.buffalo.edu/~srihari/CSE574/Chap11/Ch11.4-LinearDynamicalSystems.pdf