lectures tagging

8/13/2019 Lectures Tagging

1/38

Tagging Problems, and Hidden Markov Models

Michael Collins, Columbia University


2/38

Overview

The Tagging Problem

Generative models, and the noisy-channel model, forsupervised learning

Hidden Markov Model (HMM) taggers Basic definitions Parameter estimation The Viterbi algorithm


3/38


4/38

Named Entity Recognition

INPUT:Profits soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced first quarter results.

OUTPUT:Profits soared at [Company Boeing Co.], easily topping

forecasts on[LocationWall Street], as their CEO [PersonAlan Mulally]

announced first quarter results.


5/38

Named Entity Extraction as TaggingINPUT:Profits soared at Boeing Co., easily topping forecasts on Wall Street,as their CEO Alan Mulally announced first quarter results.

OUTPUT:

Profits/NAsoared/NAat/NABoeing/SCCo./CC,/NAeasily/NA

topping/NAforecasts/NAon/NAWall/SLStreet/CL,/NAas/NA

their/NACEO/NAAlan/SPMulally/CPannounced/NAfirst/NA

quarter/NAresults/NA./NA

NA = No entitySC = Start CompanyCC = Continue CompanySL = Start LocationCL = Continue Location

. . .


6/38


7/38

Two Types of Constraints

Influential/JJmembers/NNSof/INthe/DTHouse/NNPWays/NNPand/CC

Means/NNPCommittee/NNPintroduced/VBDlegislation/NNthat/WDT

would/MDrestrict/VBhow/WRBthe/DTnew/JJsavings-and-loan/NNbailout/NNagency/NNcan/MDraise/VBcapital/NN./.

Local: e.g., canis more likely to be a modal verbMDrather than a nounNN

Contextual: e.g., a noun is much more likely than averb to follow a determiner

Sometimes these preferences are in conflict:

The trash can is in the garage


8/38

Overview

The Tagging Problem




9/38

Supervised Learning Problems

We have training examples x(i)

, y(i)

fori= 1 . . .m. Each x(i)

is an input, each y(i) is a label.

Task is to learn a function f mapping inputs xto labels f(x)


10/38

Supervised Learning Problems

We have training examples x(i)

, y(i)

fori= 1 . . .m. Each x(i)

is an input, each y(i) is a label.

Task is to learn a function f mapping inputs xto labels f(x)

Conditional models: Learn a distribution p(y|x) from training examples For any test input x, define f(x) = arg maxyp(y|x)


11/38

Generative Models

We have training examples x(i), y(i) fori= 1 . . .m. Task isto learn a function f mapping inputs x to labels f(x).


12/38

Generative Models


Generative models:

Learn a distribution p(x, y) from training examples Often we have p(x, y) =p(y)p(x|y)


13/38

Generative Models


Generative models:


Note: we then have

p(y|x) =p(y)p(x|y)

p(x)

where p(x) =

yp(y)p(x|y)


14/38

Decoding with Generative Models We have training examples x(i), y(i) fori= 1 . . .m. Task is

to learn a function f mapping inputs x to labels f(x).


15/38



Generative models:



16/38



Generative models:


Output from the model:

f(x) = arg maxy

p(y|x)

= arg maxy

p(y)p(x|y)

p(x)

= arg maxy

p(y)p(x|y)


17/38

Overview

The Tagging Problem




18/38

Hidden Markov Models We have an input sentence x= x1, x2, . . . , xn

(xi is the ith word in the sentence)

We have a tag sequence y=y1, y2, . . . , yn(yi is the ith tag in the sentence)

Well use an HMM to define

p(x1, x2, . . . , xn, y1, y2, . . . , yn)

for any sentence x1. . . xn and tag sequence y1. . . yn of thesame length.

Then the most likely tag sequence for x is

arg maxy1...yn

p(x1. . . xn, y1, y2, . . . , yn)


19/38

Trigram Hidden Markov Models (Trigram HMMs)

For any sentence x1. . . xn where xi V fori= 1 . . . n, and anytag sequence y1. . . yn+1 where yi S fori= 1 . . . n, andyn+1=STOP, the joint probability of the sentence and tag

sequence is

p(x1. . . xn, y1. . . yn+1) =n+1i=1

q(yi|yi2, yi1)n

i=1

e(xi|yi)

where we have assumed that x0=x1 =*.

Parameters of the model:

q(s|u, v) for any s S {STOP}, u, v S {*}

e(x|s) for any s S, x V


20/38

An Example

If we have n= 3, x1. . . x3 equal to the sentence the dog laughs,and y1. . . y4 equal to the tag sequence D N V STOP, then

p(x1. . . xn, y1. . . yn+1)

= q(D|, ) q(N|, D) q(V|D, N) q(STOP|N, V)

e(the|D) e(dog|N) e(laughs|V)

STOP is a special tag that terminates the sequence

We take y0

=y1

=*, where * is a special padding symbol

Wh h N ?


21/38

Why the Name?

p(x1. . . xn, y1. . . yn) = q(STOP|yn1, yn)n

j=1

q(yj |yj2, yj1)

Markov Chain

n

j=1

e(xj |yj)

xjs are observed

O i


22/38

Overview

The Tagging Problem



S h d E i i


23/38

Smoothed Estimation

q(Vt| DT, JJ) = 1Count(Dt, JJ, Vt)

Count(Dt, JJ)

+2Count(JJ, Vt)

Count(JJ)

+3Count(Vt)

Count()

1+2+3 = 1, and for alli, i0

e(base| Vt) = Count(Vt, base)

Count(Vt)

D li i h L F W d A E l


24/38

Dealing with Low-Frequency Words: An Example

Profits soared at Boeing Co. , easily topping forecasts on WallStreet , as their CEO Alan Mulally announced first quarter results .

D li ith L F W d


25/38

Dealing with Low-Frequency Words

A common method is as follows: Step 1: Split vocabulary into two sets

Frequent words = words occurring5 times in trainingLow frequency words = all other words

Step 2: Map low frequency words into a small, finite set,depending on prefixes, suffixes etc.

D li ith L F W d A E l


26/38

Dealing with Low-Frequency Words: An Example

[Bikel et. al 1999] (named-entity recognition)Word class Example IntuitiontwoDigitNum 90 Two digit year

fourDigitNum 1990 Four digit yearcontainsDigitAndAlpha A8956-67 Product codecontainsDigitAndDash 09-96 DatecontainsDigitAndSlash 11/9/89 DatecontainsDigitAndComma 23,000.00 Monetary amountcontainsDigitAndPeriod 1.00 Monetary amount, percentothernum 456789 Other numberallCaps BBN OrganizationcapPeriod M. Person name initialfirstWord first word of sentence no useful capitalization infoinitCap Sally Capitalized wordlowercase can Uncapitalized wordother , Punctuation marks, all othe

D li ith L F W d A E l


27/38

Dealing with Low-Frequency Words: An ExampleProfits/NAsoared/NAat/NABoeing/SCCo./CC,/NAeasily/NA

topping/NAforecasts/NAon/NAWall/SLStreet/CL,/NAas/NAtheir/NA

CEO/NAAlan/SPMulally/CPannounced/NAfirst/NAquarter/NA

results/NA./NA

firstword/NAsoared/NAat/NA initCap/SCCo./CC,/NAeasily/NA

lowercase/NAforecasts/NAon/NA initCap/SLStreet/CL,/NAas/NA

their/NACEO/NAAlan/SP initCap/CPannounced/NAfirst/NA

quarter/NAresults/NA./NA

NA = No entitySC = Start CompanyCC = Continue CompanySL = Start LocationCL = Continue Location. . .

Overview


28/38

Overview

The Tagging Problem



The Viterbi Algorithm


29/38


Problem: for an input x1. . . xn, find

arg maxy1...yn+1p(x1. . . xn, y1. . . yn+1)

where thearg max is taken over all sequences y1. . . yn+1 suchthat yi S fori= 1 . . . n, and yn+1=STOP.

We assume that pagain takes the form

p(x1. . . xn, y1. . . yn+1) =n+1i=1

q(yi|yi2, yi1)n

i=1

e(xi|yi)

Recall that we have assumed in this definition that y0 =y1=*,

and yn+1=STOP.


30/38



31/38

The Viterbi Algorithm Define nto be the length of the sentence Define Sk fork=1 . . . nto be the set of possible tags at

position k:

S1=S0={}Sk =S fork {1 . . . n}

Define

r(y1, y0, y1, . . . , yk) =k

i=1

q(yi|yi2, yi1)k

i=1

e(xi|yi)

Define a dynamic programming table

(k,u,v) = maximum probability of a tag sequence

ending in tags u, v at position k

that is,

(k,u,v) = maxy1,y0,y1,...,yk:yk1=u,yk=vr(y1, y0, y1. . . yk)

An Example


32/38

An Example

(k,u,v) = maximum probability of a tag sequence

ending in tags u, v at position k

The man saw the dog with the telescope

A Recursive Definition


33/38

A Recursive Definition

Base case:(0, *, *) = 1

Recursive definition:For any k {1 . . . n}, for any u Sk1 and v Sk:

(k,u,v) = maxwSk2 ((k 1,w ,u) q(v|w, u) e(xk|v))

Justification for the Recursive Definition


34/38

Justification for the Recursive DefinitionFor any k {1 . . . n}, for any u Sk1 and v Sk:

(k,u,v) = maxwSk2

((k 1,w ,u) q(v|w, u) e(xk|v))

The man saw the dog with the telescope



35/38


Input: a sentence x1. . . xn, parameters q(s|u, v) and e(x|s).

Initialization: Set (0, *, *) = 1

Definition: S1 =S0 ={}, Sk =S fork {1 . . . n}

Algorithm:

For k= 1 . . . n,

For u Sk1, v Sk,

(k,u,v) = maxwSk2

((k 1,w ,u) q(v|w, u) e(xk|v))

Return maxuSn1,vSn((n,u,v) q(STOP|u, v))

The Viterbi Algorithm with Backpointers


36/38

The Viterbi Algorithm with BackpointersInput: a sentence x1. . . xn, parameters q(s|u, v) and e(x|s).

Initialization: Set (0, *, *) = 1

Definition: S1

=S0

={}, Sk

=S fork {1 . . . n}Algorithm:

For k= 1 . . . n,

For u Sk1, v Sk,

(k,u,v) = maxwSk2

((k 1,w ,u) q(v|w, u) e(xk|v))

bp(k,u,v) = arg maxwSk2

((k 1,w ,u) q(v|w, u) e(xk|v

Set(yn1, yn) = arg max(u,v)((n,u,v) q(STOP|u, v))

For k= (n 2) . . . 1, yk =bp(k+ 2, yk+1, yk+2)

Return the tag sequencey1. . . y

n

The Viterbi Algorithm: Running Time


38/38

Pros and Cons

Hidden markov model taggers are very simple totrain (just need to compile counts from the

training corpus)

Perform relatively well (over 90% performance onnamed entity recognition)

Main difficulty is modeling

e(word| tag)

can be very difficult if words are complex

lectures tagging

Documents