conditional random fields

32
Conditional Random Fields William W. Cohen CALD

Upload: hedy

Post on 04-Jan-2016

44 views

Category:

Documents


1 download

DESCRIPTION

Conditional Random Fields. William W. Cohen CALD. Announcements. Upcoming assignments: Today: Sha & Pereira, Lafferty et al Mon 2/23: Klein & Manning, Toutanova et al Wed 2/25: no writeup due Mon 3/1: no writeup due Wed 3/3: project proposal due: personnel + 1-2 page - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Conditional Random Fields

Conditional Random Fields

William W. Cohen

CALD

Page 2: Conditional Random Fields

Announcements

• Upcoming assignments:– Today: Sha & Pereira, Lafferty et al– Mon 2/23: Klein & Manning, Toutanova et al– Wed 2/25: no writeup due– Mon 3/1: no writeup due– Wed 3/3: project proposal due: personnel + 1-2

page – Spring break week, no class

Page 3: Conditional Random Fields

Review: motivation for CMM’s

Ideally we would like to use many, arbitrary, overlapping features of words.

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Page 4: Conditional Random Fields

Motivation for CMMs

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

...),|Pr( ,1 ttt sxs

Page 5: Conditional Random Fields

Implications of the model

• Does this do what we want?

• Q: does Y[i-1] depend on X[i+1] ?– “a nodes is conditionally independent of its non-descendents given

its parents”

Page 6: Conditional Random Fields

Label Bias Problem

• P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)

• Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)In the training data, label value 2 is the only label value observed after label value 1Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x

• However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro).

• Per-state normalization does not allow the required expectation

• Consider this MEMM:

Page 7: Conditional Random Fields

Label Bias Problem

• Consider this MEMM, and enough training data to perfectly model it:

Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3= 0.5 * 1 * 1

Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’= 0.5 * 1 *1

Pr(0123|rib)=1

Pr(0453|rob)=1

Page 8: Conditional Random Fields

How important is label bias?

• Could be avoided in this case by changing structure:

• Our models are always wrong – is this “wrongness” a problem?

• See Klein & Manning’s paper for next week….

Page 9: Conditional Random Fields

Another view of label bias [Sha & Pereira]

So what’s the alternative?

Page 10: Conditional Random Fields

Review of maxent

'

)(0

))',(exp(

)),(exp()|Pr(

)),(exp(),Pr(

))(exp()Pr(

y iii

iii

iii

iii

i

xf

yxf

yxfxy

yxfyx

xfx i

Page 11: Conditional Random Fields

Review of maxent/MEMM/CMMs

j j

ijjjii

jjjjnn

iii

y iii

iii

xZ

yyxfxyyxxyy

xZ

yxf

yxf

yxfxy

)(

)),,(exp()|Pr()...|...Pr(

:MEMMfor

)(

)),(exp(

))',(exp(

)),(exp()|Pr(

1

,111

'

Page 12: Conditional Random Fields

Details on CMMs

j j

ijjjii

jjjjnn xZ

yyxfxyyxxyy

)(

)),,(exp()|Pr()...|...Pr(

1

,111

jjjjijjji

jj

ijjjii

jj

ijjjii

j

yyxfyyxFxZ

yyxF

xZ

yyxf

),,(),,( where,)(

)),,(exp(

)(

)),,(exp(

11

1

1

Page 13: Conditional Random Fields

From CMMs to CRFs

jjjjii

jj

iii

jj

ijjjii

j

yyxfyxFxZ

yxF

xZ

yyxf

),,(),( where,)(

)),(exp(

)(

)),,(exp(

1

1

Recall why we’re unhappy: we don’t want local normalization

)(

)),(exp(

xZ

yxFi

ii

New model

Page 14: Conditional Random Fields

What’s the new model look like?

)(

),,(exp(

)(

)),(exp( 1

xZ

yyxf

xZ

yxFi j

jjjii

iii

x1 x2 x3

y1 y2 y3

What’s independent?

Page 15: Conditional Random Fields

What’s the new model look like?

)(

),,(exp(

)(

)),(exp( 1

xZ

yyxf

xZ

yxFi j

jjii

iii

x

y1 y2 y3

What’s independent now??

Page 16: Conditional Random Fields

Hammerley-Clifford

• For positive distributions P(x1,…,xn):– Pr(xi|x1,…,xi-1,xi+1,…,xn) = Pr(xi|Neighbors(xi))

– Pr(A|B,S) = Pr(A|S) where A,B are sets of nodes and S is a set that separates A and B

– P can be written as normalized product of “clique potentials”

C

CxZ

x clique

)(1

)Pr(

So this is very general: any Markov distribution can be written in this form (modulo nits like “positive distribution”)

Page 17: Conditional Random Fields

Definition of CRFs

X is a random variable over data sequences to be labeled

Y is a random variable over corresponding label sequences

Page 18: Conditional Random Fields

Example of CRFs

Page 19: Conditional Random Fields

Graphical comparison among HMMs, MEMMs and CRFs

HMM MEMM CRF

Page 20: Conditional Random Fields

Lafferty et al notation

1 2 1 2( , , , ; , , , ); andn n k k

x is a data sequencey is a label sequence v is a vertex from vertex set V = set of label random variablese is an edge from edge set E over Vfk and gk are given and fixed. gk is a Boolean vertex feature; fk is a

Boolean edge featurek is the number of features

are parameters to be estimated

y|e is the set of components of y defined by edge ey|v is the set of components of y defined by vertex v

If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:

(y | x) exp ( , y | , x) ( , y | , x)

k k e k k ve E,k v V ,k

p f e g v

Page 21: Conditional Random Fields

Conditional Distribution (cont’d)

• CRFs use the observation-dependent normalization Z(x) for the conditional distributions:

Z(x) is a normalization over the data sequence x

(y | x) exp ( , y | , x) ( , y |1

(x), x)

k k e k k ve E,k v V ,k

p f e g vZ

• Learning:– Lafferty et al’s IIS-based method is rather inefficient.

– Gradient-based methods are faster

– Trickiest bit is computing normalization, which is over exponentially many y vectors.

Page 22: Conditional Random Fields

CRF learning – from Sha & Pereira

Page 23: Conditional Random Fields

CRF learning – from Sha & Pereira

Page 24: Conditional Random Fields

CRF learning – from Sha & Pereira

Something like forward-backward

Idea:

• Define matrix of y,y’ “affinities” at stage i

• Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I

• Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1

Page 25: Conditional Random Fields

x

y1 y2 y3

y1 y2 y3

Page 26: Conditional Random Fields

Forward backward ideas

name

nonName

name

nonName

name

nonName

a

b c

d

e

f g

h

......

bhafbgae

hg

fe

dc

ba

Page 27: Conditional Random Fields

CRF learning – from Sha & Pereira

Page 28: Conditional Random Fields

CRF learning – from Sha & Pereira

Page 29: Conditional Random Fields

Sha & Pereira results

CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron

Page 30: Conditional Random Fields

Sha & Pereira results

in minutes, 375k examples

Page 31: Conditional Random Fields

POS tagging Experiments in Lafferty et al

• Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging• Each word in a given input sentence must be labeled with one of 45 syntactic tags• Add a small set of orthographic features: whether a spelling begins with a number

or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies

• oov = out-of-vocabulary (not observed in the training set)

Page 32: Conditional Random Fields

POS tagging vs MXPost