csa3202 human language technology

CSA3202Human Language

Technology

HMMs for POS Tagging

Acknowledgement

Slides adapted from lecture by Dan Jurafsky

3

Hidden Markov Model Tagging

Using an HMM to do POS tagging is a special case of Bayesian inference Foundational work in computational

linguistics Bledsoe 1959: OCR Mosteller and Wallace 1964: authorship

identification It is also related to the “noisy channel”

model that’s the basis for ASR, OCR and MT

4

Noisy Channel Modelfor Machine Translation

(Brown et. al. 1990)

sourcesentence

target sentence

sentence

5

POS Tagging as Sequence Classification

Probabilistic view: Given an observation sequence O = w1…wn e.g. Secretariat is expected to race

tomorrow What is the best sequence of tags T

that corresponds to this sequence of observations?

Consider all possible sequences of tags Out of this universe of sequences, choose

the tag sequence which maximises P(T|O)

6

Getting to HMMs

i.e. out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest.

Hat ^ means “our estimate of the best one” argmaxx f(x) means “the x such that f(x) is

maximized”

7

Getting to HMMs

But how to make it operational? How to compute this value?

Intuition of Bayesian classification: Use Bayes Rule to transform this

equation into a set of other probabilities that are easier to compute

8

Bayes Rule

9

recall this picture ....

A B

sample space

A B

10

Using Bayes Rule

11

Two Probabilities Involved:

Probability of atag sequence

Probability of a wordsequence given a tag sequence

12

Probability of tag sequence

We are seeking p(ti,...,t1) =P(t1|0) *

p(t2|t1) *

p(t3|t1,t2)*, … , *

p(ti|ti-1,…,t1)

(by chain rule of probabilities where 0 is the the beginning of string)

Next we approximate this product …

13

Probability of tag sequence by N-gram Approximation

By employing the following assumption

p(ti|ti-1,...,t1) ≈ p(ti|ti-1) i.e. we approximate the entire left

context by the previous tag. So the product is simplified toP(t1|0) *

p(t2|t1) *

p(t3|t2)*, … , *

p(ti|ti-1)

15

Likelihood and Prior

16

Two Kinds of Probabilities

1. Tag transition probabilities p(ti|ti-1)

2. Word likelihood probabilities p(wi|ti)

17

Estimating Probabilities

Tag transition probabilities p(ti|ti-1) Determiners likely to precede adjs and

nouns That/DT flight/NN The/DT yellow/JJ hat/NN So we expect P(NN|DT) and P(JJ|DT) to be high

Compute p(ti|ti-1) by counting in a labeled corpus:

18

Example

19

Two Kinds of Probabilities

Word likelihood probabilities p(wi|ti) VBZ (3sg Pres verb) likely to be “is” Compute P(is|VBZ) by counting in a

labeled corpus:

20

Example: The Verb “race”

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR

People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN

How do we pick the right tag?

21

Disambiguating “race”

23

Hidden Markov Models

What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)

We approach the concept of an HMM via the simpler notions Weighted Finite State Automaton Markov Chain

24

WFST Definition

A Weighted Finite-State Automaton (WFST) adds probabilities to the arcs The sum of the probabilities leaving any

arc must sum to one A Markov chain is a special case of a

WFST in which the input sequence uniquely determines which states the automaton will go through

Examples follow

25

Markov Chain for Pronunciation Variants

26

Markov Chain for Weather

27

Probability of a Sequence

Markov chain assigns probabilities to unambiguous sequences

What is the probability of 3 consecutive rainy days?

Sequence is rainy-rainy-rainy I.e., state sequence is 0-3-3-3 P(0,3,3,3) =

0a03a33a33 = 0.2 x (0.6)3 = 0.0432

28

Markov Chain: “First-order observable Markov Model”

A set of states Q = q1, q2…qN; the state at time t is qt

Transition probabilities: a set of probabilities A = a01a02…an1…ann.

Each aij represents the probability of transitioning from state i to state j

The set of these is the transition probability matrix A

Current state only depends on previous state

29

But ....

Markov chains can’t represent inherently ambiguous problems

For that we need a Hidden Markov Model

30

Hidden Markov Model

For Markov chains, the output symbols are the same as the states. See hot weather: we’re in state hot

But in part-of-speech tagging (and other things) The output symbols are words But the hidden states are part-of-speech tags

So we need an extension! A Hidden Markov Model is an extension of a

Markov chain in which the input symbols are not the same as the states.

This means we don’t know which state we are in.

31

States Q = q1, q2…qN; Observations O= o1, o2…oN;

Each observation is a symbol from a vocabulary V = {v1,v2,…vV}

Transition probabilities Transition probability matrix A = {aij}

Observation likelihoods Output probability matrix B={bi(k)}

Special initial probability vector

Hidden Markov Models

32

Underlying Markov Chain is as before, but we need to add ...

33

Observation Likelihoods

34

Observation Likelihoods

Observation likelihoods P(O|S) link actual observations O with underlying states S

Are necessary whenever there is not a 1:1 correspondence between the two.

Examples from many different domains where there is uncertainty of interpretation OCR Speech Recognition Spelling Correction POS

35

Decoding

Recall that we need to compute

We could just enumerate all paths given the input and use the model to assign probabilities to each. Many Paths Only need to store the most probable path

Dynamic programming Viterbi Algorithm

36

Viterbi Algorithm

Inputs HMM Q: States

A: Transition probabilities B: Observation likelihoods

O: Observed Words Output

Most probable state/tag sequence together with its probability

37

POS Transition Probabilities

38

POS Observation Likelihoods

39

Viterbi Summary

Create an array Columns corresponding to inputs Rows corresponding states

Sweep through the array in one pass filling the columns left to right using our transition probs and observations probs

Dynamic programming key is that we need only store the MAX prob path to each cell, (not all paths).

40

Viterbi Example

42

POS-Tagger Evaluation

Different metrics can be used:

Overall error rate with respect to a gold-standard test set.

Error rates on particular tagsc(gold(tok)=t and tagger(tok)=t)/c(gold(tok)=t)

Error rates on particular words Tag confusions...

43

Evaluation

The result is compared with a manually coded “Gold Standard” Typically accuracy reaches 96-97% This may be compared with result for a

baseline tagger (one that uses no context).

Important: 100% is impossible even for human annotators.

Inter annotator agreement problem

44

Error Analysis

Look at a confusion matrix

See what errors are causing problems Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

45

POS Tagging Summary

Parts of speech Tagsets Part of speech tagging Rule-Based HMM Tagging

Markov Chains Hidden Markov Models

Next: Transformation based

csa3202 human language technology

Documents

single tag sequence

right tag

sequence of observations

previous tag

t1 ptiti

highcompute ptiti

best sequence of tags

observation sequence