an introduction to conditional random field

44
An Introduction to Conditional Random Field Ching-Chun Hsiao 1

Upload: quanda

Post on 22-Feb-2016

106 views

Category:

Documents


1 download

DESCRIPTION

An Introduction to Conditional Random Field. Ching -Chun Hsiao. Outline. Problem description Why conditional random fields(CRF) Introduction to CRF CRF model Inference of CRF Learning of CRF Applications References. Reference. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An Introduction to Conditional Random Field

1

An Introduction to Conditional Random Field

Ching-Chun Hsiao

Page 2: An Introduction to Conditional Random Field

2

Outline Problem description Why conditional random fields(CRF) Introduction to CRF

CRF model Inference of CRF Learning of CRF

Applications References

Page 3: An Introduction to Conditional Random Field

3

Reference Charles Elkan, “Log-linear Models and Conditional Random

Field,” Notes for a tutorial at CIKM, 2008. Charles Sutton and Andrew McCallum, “An Introduction to

Conditional Random Fields for Relational Learning,” MIT Press, 2006.

Andrew Y. Ng and Michael I. Jordan, “On Discriminative Vs. Generative Classifiers: A Comparison Of Logistic Regression And Naive Bayes,” In Advances in Neural Information Processing Systems (NIPS), 2002.

John Lafferty, Andrew McCallum and Fernando Pereira, “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, ” In Proceedings of the Eighteenth International Conference on Machine Learning, pages 282–289, 2001.

Page 4: An Introduction to Conditional Random Field

4

Outline Problem description Why conditional random fields(CRF) Introduction to CRF

CRF model Inference of CRF Learning of CRF

Applications References

Page 5: An Introduction to Conditional Random Field

5

Problem Description Given observed data X, we wish to predict Y

(labels) Example:

X = {Temperature, Humidity, ...} Xn = observation on day n

Y = {Sunny, Rainy, Cloudy} Yn = weather on day n

30°C 20%

Sunny?Rainy?Cloudy?

Light breeze

May depend on one another

May depend on the weather of yesterday

Page 6: An Introduction to Conditional Random Field

6

Outline Problem description Why conditional random fields(CRF) Introduction to CRF

CRF model Inference of CRF Learning of CRF

Applications References

Page 7: An Introduction to Conditional Random Field

7

Generative Model vs. Discriminative Model Generative model and discriminative model

Generative model A model that generate observed data randomly Model the joint probability p(x,y)

Discriminative model Directly estimate the posterior probability p(y|x) Aim at modeling the “discrimination” between different

outputs

Naïve Bayes, … HMM, … Bayesian network, MRF, …

Single variable

Sequence General

Logistic regression, …

Linear-chain CRF MEMM, … General CRF, …

Conditional

Page 8: An Introduction to Conditional Random Field

8

Why Conditional Random Fields –1 Generative model

Generative model targets to find the joint probability p(x,y) and make the prediction based on Bayes rule to calculate p(y|x)

ex: naive Bayes (single output) and HMM (Hidden Markov Model) (sequence output)

K

kk yxpypyxp

1

)|()(),(

a vector of features

Assume that given y, features are independent

T

ttttt yxpyypyxp

11 )|()|(),( Assumption: 1. each state t only depends on its immediate predecessor 2. Conditional independence of observed given its state.

Sequence output

Page 9: An Introduction to Conditional Random Field

9

Why Conditional Random Fields –2

30°C

20%

Humidity, temperature and the wind scale are independent

Mon.

{30°C, 20%, light breeze}

Light breez

e

Tue.

{28°C, 30%, light breeze}

Wed.

{25°C, 40%,

moderate breeze}

Thu.

{22°C, 60%,

moderate breeze}

A B: A causes B

Page 10: An Introduction to Conditional Random Field

10

Why Conditional Random Fields –3 Difficulties for generative models

Not practical to represent multiple interacting features (hard to model p(x)) or long-range dependencies of the observations

Very strict independence assumptions on the observations

Mon.

{30°C, 20%, light breeze}

Tue.

{28°C, 30%, light breeze}

Wed.

{25°C, 40%,

moderate breeze}

Thu.

{22°C, 60%,

moderate breeze}

Page 11: An Introduction to Conditional Random Field

11

Why Conditional Random Fields –4 Discriminative models

Directly model the posterior p(y|x) Aim at modeling the “discrimination” between

different outputs Ex: logistic regression (maximum entropy) and

CRF

Page 12: An Introduction to Conditional Random Field

12

Why Conditional Random Fields –5 Advantages of discriminative models

Training process aim at finding optimal coefficients for features no matter the features are correlated

Not sensitive to unbalanced training data Especially for the classification problem, we don’t

have to care about p(x)

Page 13: An Introduction to Conditional Random Field

Why Conditional Random Fields –6 Logistic regression (maximum entropy) Suppose we have a bin of candies, each with

an associated label (A,B,C, or D) Each candy has multiple colors in its wrapper Each candy is assigned a label randomly based on

some distribution over wrapper colors

13

Observation: the color of the wrapper

Label: 4 kinds of favors

A: chocolate

B: strawberry

C: lemon

D: milk

Page 14: An Introduction to Conditional Random Field

14

Why Conditional Random Field –7 For any candy with a red label pulled from the

bin: P(A|red)+P(B|red)+P(C|red)+P(D|red) = 1 Infinite number of distributions exist that fit this

constraint The distribution that fits with the idea of maximum

entropy is: (the most uniform) P(A|red)=0.25 P(B|red)=0.25 P(C|red)=0.25 P(D|red)=0.25

Page 15: An Introduction to Conditional Random Field

15

Why Conditional Random Field –8 Now suppose we add some evidence to our

model We note that 80% of all candies with red labels are

either labeled A or B P(A|red) + P(B|red) = 0.8

The updated model that reflects this would be: P(A|red) = 0.4 P(B|red) = 0.4 P(C|red) = 0.1 P(D|red) = 0.1

As we make more observations and find more constraints, the model gets more complex

Page 16: An Introduction to Conditional Random Field

16

Why Conditional Random Field –9 Given a collection of facts, choose a model

which is consistent with all the facts, but otherwise as uniform as possible

,),(

);|(

),(

wxZewxyp

jjj yxFw

ion termnormalizat a is ),(

),(

y

yxFwj

jj

ewxZ

By learning

Defined feature functions evidence

x1 x2 xd

yB)f(A, nodes B andA between nodesfactor

Factor Graph:

Page 17: An Introduction to Conditional Random Field

17

Outline Problem description Why conditional random fields(CRF) Introduction to CRF

CRF model Inference of CRF Learning of CRF

Applications References

Page 18: An Introduction to Conditional Random Field

18

Linear-Chain CRF –1 If we extend the logistic regression to a

sequence problem( ):

,),(

);|(

),(

wxZewxyp

jjj yxFw

ion termnormalizat a is ),(

),(

y

yxFwj

jj

ewxZ

y

x1 x2 xd

yt-1

x1 x2 xd

ytx1 x2 xd

yt+

1

Entire xsentence thealong sum a ,),,(),( where 1

tttjj xyyfyxF

Page 19: An Introduction to Conditional Random Field

19

Linear-Chain CRF –2

y1 y2 y3

x1 x2 x3

y1 y2 y3

x

Page 20: An Introduction to Conditional Random Field

20

General CRF

Divide Graph G into many templates ψA. The parameters inside each template are tied K(A) is the number of feature functions for the template

)()|(

)(

1

),(

xZ

exyp G

xyf

A

AK

kaaakak

Page 21: An Introduction to Conditional Random Field

21

Inference of CRF Problem description:

Given the observations({xi}) and the probability model(parameters such as ωi mentioned above), we target to find the best state sequence

For general graphs, the problem of exact inference in CRFs is intractable Chain or tree like CRF can yield exact inference Approximation solutions

Page 22: An Introduction to Conditional Random Field

22

Inference of Linear-Chain CRF –1 The inference of linear-chain CRF is very

similar to that of HMM Example: POS(part of speech) tagging

the identification of words as nouns, verbs,adjectives, adverbs, etc.

Students

need another break

noun verb article noun

Page 23: An Introduction to Conditional Random Field

23

Inference of Linear-Chain CRF –2 We firstly illustrate the inference of HMM

students/V

students/N

students/P

students/ART

need/V

need/N

need/P

need/ART

o/s

another/V

another/N

another/P

another/ART

break/V

break/N

break/P

break/ART

7.6x10-6

0.00725

0

0

0.00031

1.3x10-5

0.0002

0

0

1.2x10-7

0

7.2x10-5

2.6x10-9

4.3x10-6

0

0

Page 24: An Introduction to Conditional Random Field

24

Inference of Linear-Chain CRF –3 Then back to CRF

iiiiy

i jiiijy

j iiiijy

jjjy

yxF

y

y

yyg

xyyf

xyyf

yxFxZ

e

xypy

jjj

),(maxarg

),,(maxarg

),,(maxarg

),(maxarg)(

maxarg

);|(maxarg

1

1

1

),(

*

Page 25: An Introduction to Conditional Random Field

Inference of Linear-Chain CRF –4

25

gi can be represented as a mxm matrix where m is the cardinality of the set of the tags

j

iijjiii xyyfyyg ),,(),( 11

V

ART

N

N V ART

yi-1

yi

V

ART

N

V

ART

N

Page 26: An Introduction to Conditional Random Field

26

Inference of Linear-Chain CRF –5 The inference of linear-chain CRF is similar to

that of HMM, which uses Viterbi algorithm. v: range over the tags U(k,v) to be the score of the best sequence of tags

from 1 to k, where tag k is required to be v

)],(),1([max

)],(),([max),(

11

1

1

11},...,{

1

11

vygykU

vygyygvkU

kkky

k

k

ikiiiyy

k

k

Page 27: An Introduction to Conditional Random Field

27

Learning of CRF Problem description

Given training pairs({xi,yi}), we wish to estimate the parameters of the model ({ωi})

Method For chain or tree structured CRFs, they can be

trained by maximum likelihood we will focus on the learning of linear chain CRF

General CRFs are intractable hence approximation solutions are necessary

Page 28: An Introduction to Conditional Random Field

28

Learning of Linear-chain CRF –1 Conditional maximum likelihood (CML)

x: observations; y: labels

Apply CML to the learning of CRF It can be shown that the conditional log-likelihood

of the linear-chain CRF is a convex function we can apply gradient ascent to the CML problem

);|(max)|;(max xypxyL

);|(max xyp

);|(log xyp

0),(log),();|(log

xZyxFxyp

jj

j

Page 29: An Introduction to Conditional Random Field

29

Learning of Linear-chain CRF –2

For the entire training set T)]',([),(

);|'()',(),(

0),(log),();|(log

);'|(~'

'

yxFEyxF

xypyxFyxF

xZyxFxyp

jxypyj

yjj

jj

j

Ep[·] denotes expectation with respect to distribution p.

Tx

jxypyTyxj yxFEyxF

,);|(~

,

)],([),(

The expectation of the feature fx with respect to the model distribution

The expectation of the feature fx with respect to the empirical distribution

Page 30: An Introduction to Conditional Random Field

30

Learning of Linear-chain CRF –3 To yield the best model:

The expectation of each feature with respect to the model distribution is equal to the expected value under the empirical distribution of the training data

The same as the “maximum entropy model”Logistic

regression(maximum entropy)

Extend to sequence

Linear-Chain CRF

Page 31: An Introduction to Conditional Random Field

31

Learning of Linear-chain CRF –4 Apply stochastic gradient ascent

Change the parameter values one example at a time

Stochastic: because the derivative based on a randomly chosen single example is a random approximation to the true derivative based on all training data

rate learning a is

)]),([),((: );|(~

yxFEyxF jxypyjjj

Page 32: An Introduction to Conditional Random Field

32

Outline Problem description Why conditional random fields(CRF) Introduction to CRF

CRF model Inference of CRF Learning of CRF

Comparisons Applications References

Page 33: An Introduction to Conditional Random Field

33

Outline Problem description Why conditional random fields(CRF) Introduction to CRF

CRF model Inference of CRF Learning of CRF

Applications References

Page 34: An Introduction to Conditional Random Field

34

Application – Stereo Matching (1) Ref : Learning Conditional Random Fields for

Stereo(CVPR, 2007)

Rectified imL Rectified imRobj

Page 35: An Introduction to Conditional Random Field

35

Application – Stereo Matching (2) Model the stereo matching problem using CRF

p: pixels in the reference images dp: the disparity at pixel p cp: the matching cost at pixel p gpg: the color gradient between neighbor pixels p and q,

(p,q) N

Page 36: An Introduction to Conditional Random Field

36

Application – Image Labeling(1) Ref: Multiscale Conditional Random Fields for

Image Labeling (CVPR, 2004) Image labeling: assign each pixel to one of a

finite set of labels

Page 37: An Introduction to Conditional Random Field

37

Application – Image Labeling(2) Model the image labeling problem using CRF

X: input image L: output label field S: the entire image

s

gRC gLPfLPXLPZ

XLP ),(),(),|(1)|(

Local classifier

applied to the X

Regional feature

extracted from L

Global feature

extracted from L

Page 38: An Introduction to Conditional Random Field

38

Application – Image Labeling(3)

Page 39: An Introduction to Conditional Random Field

39

Application – Gesture Recognition (1) Ref.: S. Wang, A. Quattoni, L. Morency, D.

Demirdjian, and T. Darrell., “Hidden conditional random fields for gesture recognition,” CVPR, 2006.

Page 40: An Introduction to Conditional Random Field

40

Application – Gesture Recognition (2) s = {s1, s2, ..., sm}, each si

∈ S captures certain underlying structure of each class and S is the set of hidden states in the model

Page 41: An Introduction to Conditional Random Field

41

Application – Gesture Recognition (3) The graph E is a chain where each node

corresponds to a hidden state variable at time t

ω that defines the amount of past and future history to be used when predicting the state at time t.

Assume θ= [θe θy θs]a vector that caninclude any feature of the observation sequence for a specific window size ω.

θs[sj ] to refer to theparameters θs that correspond to state sj ∈ S.

θy[y, sj ] stands for parameters that correspond to class yand state sj

θe[y, sj, sk] refers to parameters that correspondto class y and the pair of states sj and sk.

Page 42: An Introduction to Conditional Random Field

42

Application – Gesture Recognition (4) Thirteen users were asked to perform these

six gestures; an average of 90 gestures per class were collected.

Page 43: An Introduction to Conditional Random Field

43

Summary Discriminative model has the advantage of

Less sensitive to the unbalanced training data Deal with correlated features

CRF is one of the discriminative model and meets the maximum entropy model

Page 44: An Introduction to Conditional Random Field

44

Factor Graph

Represent HMM using factor graphs

x1 x2 x3

y

Represent naïve Bayes using factor graphs

y1 y2 y3

x1 x2 x3

f(y1) f(y2|y1)

f(y3|y2)

f(y1|x1)

f(y2|x2)

f(y3|x3)