lectures tagging

Upload: sersoage

Post on 03-Jun-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Lectures Tagging

    1/38

    Tagging Problems, and Hidden Markov Models

    Michael Collins, Columbia University

  • 8/13/2019 Lectures Tagging

    2/38

    Overview

    The Tagging Problem

    Generative models, and the noisy-channel model, forsupervised learning

    Hidden Markov Model (HMM) taggers Basic definitions Parameter estimation The Viterbi algorithm

  • 8/13/2019 Lectures Tagging

    3/38

  • 8/13/2019 Lectures Tagging

    4/38

    Named Entity Recognition

    INPUT:Profits soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced first quarter results.

    OUTPUT:Profits soared at [Company Boeing Co.], easily topping

    forecasts on[LocationWall Street], as their CEO [PersonAlan Mulally]

    announced first quarter results.

  • 8/13/2019 Lectures Tagging

    5/38

    Named Entity Extraction as TaggingINPUT:Profits soared at Boeing Co., easily topping forecasts on Wall Street,as their CEO Alan Mulally announced first quarter results.

    OUTPUT:

    Profits/NAsoared/NAat/NABoeing/SCCo./CC,/NAeasily/NA

    topping/NAforecasts/NAon/NAWall/SLStreet/CL,/NAas/NA

    their/NACEO/NAAlan/SPMulally/CPannounced/NAfirst/NA

    quarter/NAresults/NA./NA

    NA = No entitySC = Start CompanyCC = Continue CompanySL = Start LocationCL = Continue Location

    . . .

  • 8/13/2019 Lectures Tagging

    6/38

  • 8/13/2019 Lectures Tagging

    7/38

    Two Types of Constraints

    Influential/JJmembers/NNSof/INthe/DTHouse/NNPWays/NNPand/CC

    Means/NNPCommittee/NNPintroduced/VBDlegislation/NNthat/WDT

    would/MDrestrict/VBhow/WRBthe/DTnew/JJsavings-and-loan/NNbailout/NNagency/NNcan/MDraise/VBcapital/NN./.

    Local: e.g., canis more likely to be a modal verbMDrather than a nounNN

    Contextual: e.g., a noun is much more likely than averb to follow a determiner

    Sometimes these preferences are in conflict:

    The trash can is in the garage

  • 8/13/2019 Lectures Tagging

    8/38

    Overview

    The Tagging Problem

    Generative models, and the noisy-channel model, forsupervised learning

    Hidden Markov Model (HMM) taggers Basic definitions Parameter estimation The Viterbi algorithm

  • 8/13/2019 Lectures Tagging

    9/38

    Supervised Learning Problems

    We have training examples x(i)

    , y(i)

    fori= 1 . . .m. Each x(i)

    is an input, each y(i) is a label.

    Task is to learn a function f mapping inputs xto labels f(x)

  • 8/13/2019 Lectures Tagging

    10/38

    Supervised Learning Problems

    We have training examples x(i)

    , y(i)

    fori= 1 . . .m. Each x(i)

    is an input, each y(i) is a label.

    Task is to learn a function f mapping inputs xto labels f(x)

    Conditional models: Learn a distribution p(y|x) from training examples For any test input x, define f(x) = arg maxyp(y|x)

  • 8/13/2019 Lectures Tagging

    11/38

    Generative Models

    We have training examples x(i), y(i) fori= 1 . . .m. Task isto learn a function f mapping inputs x to labels f(x).

  • 8/13/2019 Lectures Tagging

    12/38

    Generative Models

    We have training examples x(i), y(i) fori= 1 . . .m. Task isto learn a function f mapping inputs x to labels f(x).

    Generative models:

    Learn a distribution p(x, y) from training examples Often we have p(x, y) =p(y)p(x|y)

  • 8/13/2019 Lectures Tagging

    13/38

    Generative Models

    We have training examples x(i), y(i) fori= 1 . . .m. Task isto learn a function f mapping inputs x to labels f(x).

    Generative models:

    Learn a distribution p(x, y) from training examples Often we have p(x, y) =p(y)p(x|y)

    Note: we then have

    p(y|x) =p(y)p(x|y)

    p(x)

    where p(x) =

    yp(y)p(x|y)

  • 8/13/2019 Lectures Tagging

    14/38

    Decoding with Generative Models We have training examples x(i), y(i) fori= 1 . . .m. Task is

    to learn a function f mapping inputs x to labels f(x).

  • 8/13/2019 Lectures Tagging

    15/38

    Decoding with Generative Models We have training examples x(i), y(i) fori= 1 . . .m. Task is

    to learn a function f mapping inputs x to labels f(x).

    Generative models:

    Learn a distribution p(x, y) from training examples Often we have p(x, y) =p(y)p(x|y)

  • 8/13/2019 Lectures Tagging

    16/38

    Decoding with Generative Models We have training examples x(i), y(i) fori= 1 . . .m. Task is

    to learn a function f mapping inputs x to labels f(x).

    Generative models:

    Learn a distribution p(x, y) from training examples Often we have p(x, y) =p(y)p(x|y)

    Output from the model:

    f(x) = arg maxy

    p(y|x)

    = arg maxy

    p(y)p(x|y)

    p(x)

    = arg maxy

    p(y)p(x|y)

  • 8/13/2019 Lectures Tagging

    17/38

    Overview

    The Tagging Problem

    Generative models, and the noisy-channel model, forsupervised learning

    Hidden Markov Model (HMM) taggers Basic definitions Parameter estimation The Viterbi algorithm

  • 8/13/2019 Lectures Tagging

    18/38

    Hidden Markov Models We have an input sentence x= x1, x2, . . . , xn

    (xi is the ith word in the sentence)

    We have a tag sequence y=y1, y2, . . . , yn(yi is the ith tag in the sentence)

    Well use an HMM to define

    p(x1, x2, . . . , xn, y1, y2, . . . , yn)

    for any sentence x1. . . xn and tag sequence y1. . . yn of thesame length.

    Then the most likely tag sequence for x is

    arg maxy1...yn

    p(x1. . . xn, y1, y2, . . . , yn)

  • 8/13/2019 Lectures Tagging

    19/38

    Trigram Hidden Markov Models (Trigram HMMs)

    For any sentence x1. . . xn where xi V fori= 1 . . . n, and anytag sequence y1. . . yn+1 where yi S fori= 1 . . . n, andyn+1=STOP, the joint probability of the sentence and tag

    sequence is

    p(x1. . . xn, y1. . . yn+1) =n+1i=1

    q(yi|yi2, yi1)n

    i=1

    e(xi|yi)

    where we have assumed that x0=x1 =*.

    Parameters of the model:

    q(s|u, v) for any s S {STOP}, u, v S {*}

    e(x|s) for any s S, x V

  • 8/13/2019 Lectures Tagging

    20/38

    An Example

    If we have n= 3, x1. . . x3 equal to the sentence the dog laughs,and y1. . . y4 equal to the tag sequence D N V STOP, then

    p(x1. . . xn, y1. . . yn+1)

    = q(D|, ) q(N|, D) q(V|D, N) q(STOP|N, V)

    e(the|D) e(dog|N) e(laughs|V)

    STOP is a special tag that terminates the sequence

    We take y0

    =y1

    =*, where * is a special padding symbol

    Wh h N ?

  • 8/13/2019 Lectures Tagging

    21/38

    Why the Name?

    p(x1. . . xn, y1. . . yn) = q(STOP|yn1, yn)n

    j=1

    q(yj |yj2, yj1)

    Markov Chain

    n

    j=1

    e(xj |yj)

    xjs are observed

    O i

  • 8/13/2019 Lectures Tagging

    22/38

    Overview

    The Tagging Problem

    Generative models, and the noisy-channel model, forsupervised learning

    Hidden Markov Model (HMM) taggers Basic definitions Parameter estimation The Viterbi algorithm

    S h d E i i

  • 8/13/2019 Lectures Tagging

    23/38

    Smoothed Estimation

    q(Vt| DT, JJ) = 1Count(Dt, JJ, Vt)

    Count(Dt, JJ)

    +2Count(JJ, Vt)

    Count(JJ)

    +3Count(Vt)

    Count()

    1+2+3 = 1, and for alli, i0

    e(base| Vt) = Count(Vt, base)

    Count(Vt)

    D li i h L F W d A E l

  • 8/13/2019 Lectures Tagging

    24/38

    Dealing with Low-Frequency Words: An Example

    Profits soared at Boeing Co. , easily topping forecasts on WallStreet , as their CEO Alan Mulally announced first quarter results .

    D li ith L F W d

  • 8/13/2019 Lectures Tagging

    25/38

    Dealing with Low-Frequency Words

    A common method is as follows: Step 1: Split vocabulary into two sets

    Frequent words = words occurring5 times in trainingLow frequency words = all other words

    Step 2: Map low frequency words into a small, finite set,depending on prefixes, suffixes etc.

    D li ith L F W d A E l

  • 8/13/2019 Lectures Tagging

    26/38

    Dealing with Low-Frequency Words: An Example

    [Bikel et. al 1999] (named-entity recognition)Word class Example IntuitiontwoDigitNum 90 Two digit year

    fourDigitNum 1990 Four digit yearcontainsDigitAndAlpha A8956-67 Product codecontainsDigitAndDash 09-96 DatecontainsDigitAndSlash 11/9/89 DatecontainsDigitAndComma 23,000.00 Monetary amountcontainsDigitAndPeriod 1.00 Monetary amount, percentothernum 456789 Other numberallCaps BBN OrganizationcapPeriod M. Person name initialfirstWord first word of sentence no useful capitalization infoinitCap Sally Capitalized wordlowercase can Uncapitalized wordother , Punctuation marks, all othe

    D li ith L F W d A E l

  • 8/13/2019 Lectures Tagging

    27/38

    Dealing with Low-Frequency Words: An ExampleProfits/NAsoared/NAat/NABoeing/SCCo./CC,/NAeasily/NA

    topping/NAforecasts/NAon/NAWall/SLStreet/CL,/NAas/NAtheir/NA

    CEO/NAAlan/SPMulally/CPannounced/NAfirst/NAquarter/NA

    results/NA./NA

    firstword/NAsoared/NAat/NA initCap/SCCo./CC,/NAeasily/NA

    lowercase/NAforecasts/NAon/NA initCap/SLStreet/CL,/NAas/NA

    their/NACEO/NAAlan/SP initCap/CPannounced/NAfirst/NA

    quarter/NAresults/NA./NA

    NA = No entitySC = Start CompanyCC = Continue CompanySL = Start LocationCL = Continue Location. . .

    Overview

  • 8/13/2019 Lectures Tagging

    28/38

    Overview

    The Tagging Problem

    Generative models, and the noisy-channel model, forsupervised learning

    Hidden Markov Model (HMM) taggers Basic definitions Parameter estimation The Viterbi algorithm

    The Viterbi Algorithm

  • 8/13/2019 Lectures Tagging

    29/38

    The Viterbi Algorithm

    Problem: for an input x1. . . xn, find

    arg maxy1...yn+1p(x1. . . xn, y1. . . yn+1)

    where thearg max is taken over all sequences y1. . . yn+1 suchthat yi S fori= 1 . . . n, and yn+1=STOP.

    We assume that pagain takes the form

    p(x1. . . xn, y1. . . yn+1) =n+1i=1

    q(yi|yi2, yi1)n

    i=1

    e(xi|yi)

    Recall that we have assumed in this definition that y0 =y1=*,

    and yn+1=STOP.

  • 8/13/2019 Lectures Tagging

    30/38

    The Viterbi Algorithm

  • 8/13/2019 Lectures Tagging

    31/38

    The Viterbi Algorithm Define nto be the length of the sentence Define Sk fork=1 . . . nto be the set of possible tags at

    position k:

    S1=S0={}Sk =S fork {1 . . . n}

    Define

    r(y1, y0, y1, . . . , yk) =k

    i=1

    q(yi|yi2, yi1)k

    i=1

    e(xi|yi)

    Define a dynamic programming table

    (k,u,v) = maximum probability of a tag sequence

    ending in tags u, v at position k

    that is,

    (k,u,v) = maxy1,y0,y1,...,yk:yk1=u,yk=vr(y1, y0, y1. . . yk)

    An Example

  • 8/13/2019 Lectures Tagging

    32/38

    An Example

    (k,u,v) = maximum probability of a tag sequence

    ending in tags u, v at position k

    The man saw the dog with the telescope

    A Recursive Definition

  • 8/13/2019 Lectures Tagging

    33/38

    A Recursive Definition

    Base case:(0, *, *) = 1

    Recursive definition:For any k {1 . . . n}, for any u Sk1 and v Sk:

    (k,u,v) = maxwSk2 ((k 1,w ,u) q(v|w, u) e(xk|v))

    Justification for the Recursive Definition

  • 8/13/2019 Lectures Tagging

    34/38

    Justification for the Recursive DefinitionFor any k {1 . . . n}, for any u Sk1 and v Sk:

    (k,u,v) = maxwSk2

    ((k 1,w ,u) q(v|w, u) e(xk|v))

    The man saw the dog with the telescope

    The Viterbi Algorithm

  • 8/13/2019 Lectures Tagging

    35/38

    The Viterbi Algorithm

    Input: a sentence x1. . . xn, parameters q(s|u, v) and e(x|s).

    Initialization: Set (0, *, *) = 1

    Definition: S1 =S0 ={}, Sk =S fork {1 . . . n}

    Algorithm:

    For k= 1 . . . n,

    For u Sk1, v Sk,

    (k,u,v) = maxwSk2

    ((k 1,w ,u) q(v|w, u) e(xk|v))

    Return maxuSn1,vSn((n,u,v) q(STOP|u, v))

    The Viterbi Algorithm with Backpointers

  • 8/13/2019 Lectures Tagging

    36/38

    The Viterbi Algorithm with BackpointersInput: a sentence x1. . . xn, parameters q(s|u, v) and e(x|s).

    Initialization: Set (0, *, *) = 1

    Definition: S1

    =S0

    ={}, Sk

    =S fork {1 . . . n}Algorithm:

    For k= 1 . . . n,

    For u Sk1, v Sk,

    (k,u,v) = maxwSk2

    ((k 1,w ,u) q(v|w, u) e(xk|v))

    bp(k,u,v) = arg maxwSk2

    ((k 1,w ,u) q(v|w, u) e(xk|v

    Set(yn1, yn) = arg max(u,v)((n,u,v) q(STOP|u, v))

    For k= (n 2) . . . 1, yk =bp(k+ 2, yk+1, yk+2)

    Return the tag sequencey1. . . y

    n

    The Viterbi Algorithm: Running Time

  • 8/13/2019 Lectures Tagging

    37/38

    The Viterbi Algorithm: Running Time

    O(n|S|3

    ) time to calculate q(s|u, v) e(xk|s) forall k, s, u, v.

    n|S|2 entries in to be filled in.

    O(|S|) time to fill in one entry O(n|S|3) time in total

    Pros and Cons

  • 8/13/2019 Lectures Tagging

    38/38

    Pros and Cons

    Hidden markov model taggers are very simple totrain (just need to compile counts from the

    training corpus)

    Perform relatively well (over 90% performance onnamed entity recognition)

    Main difficulty is modeling

    e(word| tag)

    can be very difficult if words are complex