text models continued hmm and pcfgs. recap so far we have discussed 2 different models for text –...

Text Models Continued

HMM and PCFGs

Recap

• So far we have discussed 2 different models for text– Bag of Words (BOW) where we introduced TF-IDF

• Location of words is not captured

– Colocation (n-grams) • Location of words is captured, relations between words

in a “sliding window” of size n• Problem: small n not realistic enough, for large n no

statistics

Today

• Two models that allow to capture language syntax, with probabilities for words in different part-of-speech

• The first is based on a Markov Chain (Probabilistic Finite State Machine), the second on a Probabilistic Context Free Grammar

Example Application:Part-of-Speech Tagging

• Goal: For every word in a sentence, infer its part-of-speech• Verb, noun, adjective…

• Perhaps we are interested (more) in documents for which W is often the sentence subject?

• Part-of-speech tagging– Useful for ranking– For machine translation– Word-Sense Disambiguation– …

Part-of-Speech Tagging

• Tag this word. This word is a tag.

• He dogs like a flea

• The can is in the fridge

• The sailor dogs me every day

A Learning Problem

• Training set: tagged corpus– Most famous is the Brown Corpus with about 1M

words

– The goal is to learn a model from the training set, and then perform tagging of untagged text

– Performance tested on a test-set

Different Learning Approaches

• Supervised: Training corpus is tagged by humans

• Unsupervised: Training corpus isn’t tagged

• Partly supervised: Training corpus isn’t tagged, but we have a dictionary giving possible tags for each word

Training set and test set

• In a learning problem the common practice is to split the annotated examples to training and test set

• Training set used to “train” the model, e.g. learn parameters

• Test set used to test how well the model is doing

Simple Algorithm• Assign to each word its most popular tag in the training set

• Bag-of-words approach

• Problem: Ignores context

• Dogs, tag will always be tagged as a noun…

• Can will be tagged as a verb

• Still, achieves around 80% correctness for real-life test-sets

• Can we do better?

(HMM) Hidden Markov Model

• Model: sentences are generated by a probabilistic process

• In particular, a Markov Chain whose states correspond to Parts-of-Speech

• Transitions are probabilistic

• In each state a word is outputted– The output word is again chosen probabilistically based on the

state

HMM

• HMM is:– A set of N states– A set of M symbols (words) – A matrix NXN of transition probabilities Ptrans– A vector of size N of initial state probabilities

Pstart– A matrix NXM of emissions probabilities Pout

• “Hidden” because we see only the outputs, not the sequence of states traversed

Example

3 Fundamental Problems

1) Given an observation sequence, find the most likely hidden state sequence This is tagging

2) Compute the probability of a given observationSequence (=sentence)

3) Given a training set find the model that would make the observations most likely

Tagging

• Find the most likely sequence of states that led to an observed output sequence– We know what the output (=sentence) is, want to

track back its generation, i.e. sequence of states taken in the hidden model

– Each state is a tag

• How many possible sequences are there?

Viterbi Algorithm

• Dynamic Programming

• Vt,k is the probability of the most probable state sequence – Generating the first t + 1 observations (X0,..Xt)– And terminating at state k

Viterbi Algorithm

• Dynamic Programming• Vt,k is the probability of the most probable

state sequence: – generating the first t + 1 observations (X0,..Xt)– and terminating at state k

V0,k = Pstart(k)*Pout(k,X0)

Vt,k= Pout(k,Xt)*max{Vt-1k’ *Ptrans(k’,k)}

Finding the path

• Note that we are interested in the most likely path, not only in its probability

• So we need to keep track at each point of the argmax– Combine them to form a sequence

• What about top-k?

Complexity

• O(T*|S|^2)

• Where T is the sequence (=sentence) length, |S| is the number of states (= number of possible tags)

Forward Algorithm

• αt(k) is the probability of seeing the sequence X0…Xt and terminating at state k

Computing the probabilities

α0(k)= Pstart(k)*Pout(k,X0)αt(k)= Pout(k,Xt)*Σk’{αt-1k’ *Ptrans(k’,k)}

P(X0,…Xn)= Σk αn(k)

Learning the HMM probabilities

• Expectation-Maximization Algorithm (Baum-Welsch)

1. Choose initial probabilities2. Compute Eij the expected number of transitions from

i to j while generating the training sequences, for each i,j (see next)

3. Set the probability of transition from i to j to be Eij/ (Σk Eik)4. Similarly for omission probability5. Repeat 2-4 using the new model, until convergence

Forward-backward

• Forward probabilities: αt(k) is the probability of seeing the sequence X0…Xt and terminating at state k• Backward probabilities:

βt(k) is the probability of seeing the sequenceXt+1…Xn given that the Markov process is atstate k at time t.

Computing the probabilitiesForward algorithmα0(k)= Pstart(k)*Pout(k,X0)αt(k)= Pout(k,Xt)*Σk’{αt-1k’ *Ptrans(k’,k)}P(X0,…Xn)= Σk αn(k)

Backward algorithmβt(k) = P(Xt+1…Xn| state at time t is k)βt(k) = Σk’{Ptrans(k,k’)* Pout(k’,Xt+1)* βt+1(k’)}βn+1(k) = 1 for all kP(X0,…Xn)= Σk β 0(k)* Pstart(k)*Pout(k, X0)

Also: P(X0,…Xn)= Σk αt(k)* βt(k)

Estimating Parameters

• The probability Ψij (t) of traversing an edge ij at time t given the observation X:– Let

Ψij (t) = --------------------------

=

Expected number of transitions

• Eij = -----------

And this is the newly used transition probability on the edge from i to j!

Similarly for output probabilities

Accuracy

• Tested experimentally

• Reaches 96-98% for the Brown corpus– Trained on half and tested on the other half

• Compare with the 80% by the trivial algorithm

• The hard cases are few but are very hard..

NLTK

• http://www.nltk.org/

• Natrual Language ToolKit

• Open source python modules for NLP tasks– Including stemming, POS tagging and much more

http://www.nltk.org/

PCFG

• A PCFG is a tuple: – N is a set of non-terminals:– is a set of terminals– N1 is the start symbol– R is a set of rules– P is the set of probabilities on rules

• We assume PCFG is in Chomsky Norm Form• Parsing algorithms:

– Earley (top-down)– CYK (bottom-up)– …

(,,,,) 1 PRNN

}{ iN

}{ kw

PFSA vs. PCFG

• PFSA can be seen as a special case of PCFG– State non-terminal– Output symbol terminal– Arc context-free rule– Path Parse tree (only right-branch binary tree)

S1 S2 S3

a b

S1

a S2

b S3

ε

PFSA and HMM

HMM Finish

Add a “Start” state and a transition from “Start” to any state in HMM.Add a “Finish” state and a transition from any state in HMM to “Finish”.

Start

The connection between two algorithms

• HMM can (almost) be converted to a PFSA.• PFSA is a special case of PCFG.• Inside-outside is an algorithm for PCFG.Inside-outside algorithm will work for HMM.

• Forward-backward is an algorithm for HMM.In fact, Inside-outside algorithm is the same

as forward-backward when the PCFG is a PFSA.

Forward and backward probabilities

()ti()ti

X1Xt Xn…

o1 onXn+1…

Ot-1

()ti

X1

…Xt-1

Xt

…

Xn

Xn+1

O1

Ot-1

On

Ot ()ti

Backward/forward prob vs. Inside/outside prob

X1

(,) lti

Xt=Ni

Ot OnOt-1O1 Ol

(,) lti()ti ()ti

O1

X1

Xt=Ni

Ot OnOt-1

PFSA:PCFG:

OutsideInsideForward

Backward

(,) qpj

wp wmwp-1w1wq

(,) qpj

Wq+1

N1

Nj

Notation

Inside and outside probabilities

Definitions

• Inside probability: total prob of generating words wp…wq from non-terminal Nj.

• Outside probability: total prob of beginning with the start symbol N1 and generating and all the words outside wp…wq

• When p>q,

jpqN

(,,)(,) (1)(1)1 mqjpqpj wNwPqp

(|)(,) jpqpqj NwPqp

0(,)(,) qpqp jj

Calculating inside probability (CYK algorithm)

(,1)(,)()(,),

1

qddpNNNPqp srsr

sr

q

pd

jj

Nj

Nr Ns

wp wd Wd+1 wq

()(,) kj

j wNPkk

Calculating outside probability (case 1)

(,1)()(,)(,), 1

eqNNNPepqp ggjf

gf

m

qefj

Nj Ng

wp wq Wq+1 we

Nf

N1

w1 wm

Calculating outside probability (case 2)

(1,)()(,)(,),

1

1

peNNNPqeqp gjgf

gf

p

efj

Ng Nj

we Wp-1 Wp wq

Nf

N1

w1 wm

Outside probability

(1,)()(,)

(,1)()(,)(,)

,

1

1

, 1

peNNNPqe

eqNNNPepqp

gjgf

gf

p

ef

ggjf

gf

m

qefj

otherwise

jifmj 0

11(,1)

Probability of a sentence

(,1)() 11 mwP m

kanyforwNPkkwP kj

jjm ()(,)() 1

(,)(,)(,) 1 qpqpNwP jjjpqm

Recap so far

• Inside probability: bottom-up

• Outside probability: top-down using the same chart.

• Probability of a sentence can be calculated in many ways.

Expected counts and update formulae

The probability of a binary rule is used

()

(,1)(,)()(,)

(|,)1

1

1m

q

pdsr

srjj

msrjj

pq wP

qddpNNNPqp

wNNNNP

()

(,1)(,)()(,)

(|,)

(|,)

1

1

1 1

1 11

1

m

q

pdsr

srjj

m

p

m

q

m

p

m

qm

srjjpq

msrjj

wP

qddpNNNPqp

wNNNNP

wNNNNP

(1)

The probability of Nj is used

(,)(,)(,) 1 qpqpNwP jjjpqm

()

(,)(,)

()

(,)(|)

11

11

m

jj

m

mjpq

mjpq wP

qpqp

wP

wNPwNP

()

(,)(,)

(|)

(|)

(|)

11 1

1

11 1

1

m

jjm

p

m

q

msrj

r s

mjpq

m

p

m

q

mj

wP

qpqp

wNNNP

wNP

wNP

(2)

m

p

m

pqjj

m

p

m

pq

q

pdsr

srjj

mj

mjsrj

msrj

qpqp

qddpNNNPqp

wNP

wNNNNPwNNNP

1

1

1

1

11

(,)(,)

(,1)(,)()(,)

(2)

(1)

(|)

(|,)(|)

The probability of a unaryrule is used

()

(,)(,)(,)(|,)

1

11

m

m

h

khjj

mjkj

wP

wwhhhhwusedisNwNP

m

p

m

pqjj

m

h

khjj

mj

mjkj

mjkj

qpqp

wwhhhh

wNP

wNwNPwNwNP

1

1

1

11

(,)(,)

(,)(,)(,)

(2)

(3)

(|)

(|,)(,|)

(3)

Multiple training sentences

ii

m

jjm

p

m

qm

j

Wsentenceforjh

wP

qpqpwNP

()

()

(,)(,)(|)

11 11

(,,)

()

(,1)(,)()(,)

(|,)1

11

1 11

srjf

wP

qddpNNNPqp

wNNNNP

i

m

q

pdsr

srjj

m

p

m

pqm

srjj

(1)

(2)

()

(,,)()

jh

srjfNNNP

ii

iisrj

Inner loop of the Inside-outside algorithm

Given an input sequence and1. Calculate inside probability:

• Base case• Recursive case:

2. Calculate outside probability:• Base case:

• Recursive case:

()(,) kj

j wNPkk

(,1)(,)()(,),

1

qddpNNNPqp srsr

sr

q

pd

jj

otherwise

jifmj 0

11(,1)

(1,)()(,)

(,1)()(,)(,)

,

1

1

, 1

peNNNPqe

eqNNNPepqp

gjgf

gf

p

ef

ggjf

gf

m

qefj

Inside-outside algorithm (cont)

()

(,1)(,)()(,)

(|,)

1

1

1 1

1

m

q

pdsr

srjj

m

p

m

q

msrjj

wP

qddpNNNPqp

wNNNNP

()

(,)(,)(,)(|,)

1

11

m

m

h

khjj

mjkj

wP

wwhhhhwusedisNwNP

3. Collect the counts

4. Normalize and update the parameters

km

jkjm

jkj

kj

k

kjkj

r sm

srjjm

srjj

r s

srj

srjsrj

wusedisNwNP

wusedisNwNP

wNCnt

wNCntwNP

wNNNNP

wNNNNP

NNNCnt

NNNCntNNNP

(|,)

(|,)

()

()()

(|,)

(|,)

()

()()

1

1

1

1

Relation to EM

Relation to EM

• PCFG is a PM (Product of Multi-nominal) Model

• Inside-outside algorithm is a special case of the EM algorithm for PM Models.

• X (observed data): each data point is a sentence w1m.

• Y (hidden data): parse tree Tr.• Θ (parameters):

()

()

kj

srj

wNP

NNNP

Relation to EM (cont)

(,|,)

(,|,)

(,,)*(,|)

(,,)*(,|)()

1

11 1

11

msrjj

msrjj

pq

m

p

m

q

srj

Trmm

srj

Y

srj

wNNNNP

wNNNNP

NNNwTrcountwTrP

NNNYXcountXYPNNNcount

(,|,)

(,,)*(,|)

(,,)*(,|)()

11

11

mjkj

m

h

Tr

kjmm

kj

Y

kj

wusedisNwNP

wNTrwcountwTrP

wNYXcountXYPwNcount

Summary

()ti (1) tj

XtXt+1

Ot

N1

Nr Ns

wp wd Wd+1 wq

Nj

(,) qpj

(,) qpj

(|) 1 iXjXPa ttij

(,|) 1 jXiXwOPb ttktijk

() srj NNNP

() kj wNP

Summary (cont)

• Topology is known:– (states, arcs, output symbols) in HMM– (non-terminals, rules, terminals) in PCFG

• Probabilities of arcs/rules are unknown.

• Estimating probs using EM (introducing hidden data Y)

Converting HMM to PCFG

• Given an HMM=(S, Σ, π, A, B), create a PCFG=(S1, Σ1,S0, R, P) as follows:– S1=– Σ1=– S0=Start– R=

– P:

]},1[,|,,{}{ 0 NjiDDStartN iji

},{ EOSBOS

},,{

}{]},1[,|{0

0

EOSNBOSDwD

NDStartNjiNDNi

kij

ijiji

1()

1()

()

()

()

0

EOSNP

BOSDP

bwDP

DNStartP

aNDNP

i

ijkkij

ii

ijjiji

Path Parse tree

X1 X2XT…

o1 o2oT

XT+1

Start

X1D0

BOS X2D12

o1…

XT

XT+1DT,T+1

ot EOS

Outside probability

(,) qpj

(,) Tpj

(,) Tti

q=T

()ti

(j,i),(p,t)

(,)_ qpji

(,)_ ppji

(,)_ ttji

q=p

(1)() tat jiji

(p,t)

Outside prob for Nj Outside prob for Dij

Inside probability

(,) Tpj

(,) Tti

()ti

q=T

(j,i),(p,t)

(,) qpj

(,)_ ppji

(,)_ ttji

tijob

q=p

(p,t)

(,)_ qpji

Inside prob for Nj Inside prob for Dij

()

(,1)(,)()(,)

(|,)1

1

1m

q

pdsr

srjj

msrjj

pq wP

qddpNNNPqp

wNNNNP

()

(,1)(,)()(,)

(|,)1

1

1T

q

pdjr

jrii

Tjrii

tq oP

qddtNNNPqt

oNNNNP

td

DN

Tqijr

()

(,1)(,)()(,)(|,)

11

T

jrjiji

iT

jijiitT oP

TtttNDNPTtoNDNNP

Renaming: (j,i), (s,j),(p,t),(m,T)

()()

()()(|,)

1

11 t

oP

jbaioNDNNP ij

T

tijkijtT

jijiitT

(|,) 1Tjijii

tT oNDNNP Estimating

m

p

m

pqm

srjjpqm

srjj wNNNNPwNNNNP1 1

11 (|,)(|,)

T

t

T

pqT

jriitqT

jrii ONNNNPONNNNP1 1

11 (|,)(|,)

td

DN

Tqijr


T

tij

T

tT

jriitTT

jijii tONNNNPONDNNP11

11 ()(|,)(|,)

(|,) 1Tjijii oNDNNP Estimating

()

(,)(,)(|)

11 11

m

jjm

p

m

qm

j

wP

qpqpwNP

()

(,)(,)(|)

11 11

T

iiT

t

T

qT

i

OP

qtqtONP

td

DN

Tqijr


()()

()()(|)

1111 t

OP

iiONP

T

ti

T

ttT

tT

i

(|) 1Ti ONPEstimating

(|)

(|,)(|)

1

11

mj

mjsrj

msrj

wNP

wNNNNPwNNNP

ijT

ti

T

tij

Ti

Tijri

Tjri a

t

t

ONP

ONNNNPONNNP

1

1

1

11

()

()

(|)

(|,)(|)

Renaming: (j,i), (s,j),(w,o),(m,T)

(|) 1Tjri ONNNP Calculating

Calculating

m

p

m

pqjj

m

hhhjj

mkj

qpqp

wwhhhhwwNP

1

11

(,)(,)

(,)(,)(,)(|)

T

t

T

tqii

T

thtjiji

Tkji

qtqt

wOttttOwDP

1

1__

1_

(,)(,)

(,)(,)(,)(|)

tijoji

jijiji

btt

tattt

tq

(,)

(1)()(,)

_

_

Renaming (j,i_j), (s,j),(p,t),(h,t),(m,T),(w,O), (N,D)

T

tij

ht

T

tij

T

tijojiji

T

thtijojiji

Tkji

t

wOt

btat

wObtatOwDP

t

t

1

1

1

11

_

()

(,)()

(1)()

(,)(1)()(|)

(|) 1_

Tkji OwDP

text models continued hmm and pcfgs. recap so far we have discussed 2 different models for text –...

Documents

test set training set

parameters test set

set of n statesa

training corpus isnt

learning problemtraining

symbols words

textbag of words

training setbag