עיבוד שפות טבעיות - שיעור חמישי pos tagging algorithms

88-680 1

Text Books

Text Books

עיבוד שפות טבעיות - שיעור חמישי

POS Tagging Algorithms

עידו דגןהמחלקה למדעי המחשב

אוניברסיטת בר אילן

88-680 2

Text Books

Text Books

Supervised Learning Scheme

ClassificationModel

“Labeled”Examples

NewExamples Classifications

Training Algorithm

ClassificationAlgorithm

88-680 3

Text Books

Text Books

Transformational Based Learning (TBL) for Tagging

• Introduced by Brill (1995)• Can exploit a wider range of lexical and syntactic

regularities via transformation rules – triggering environment and rewrite rule

• Tagger:– Construct initial tag sequence for input – most frequent tag

for each word– Iteratively refine tag sequence by applying “transformation

rules” in rank order• Learner:

– Construct initial tag sequence for the training corpus– Loop until done:

• Try all possible rules and compare to known tags, apply the best rule r* to the sequence and add it to the rule ranking

88-680 4

Text Books

Text Books

Some examples1. Change NN to VB if previous is TO

– to/TO conflict/NN with VB2. Change VBP to VB if MD in previous three

– might/MD vanish/VBP VB3. Change NN to VB if MD in previous two

– might/MD reply/NN VB4. Change VB to NN if DT in previous two

– the/DT reply/VB NN

88-680 5

Text Books

Text Books

Transformation TemplatesSpecify which transformations are possibleFor example: change tag A to tag B when:

1. The preceding (following) tag is Z2. The tag two before (after) is Z3. One of the two previous (following) tags is Z4. One of the three previous (following) tags is Z5. The preceding tag is Z and the following is W6. The preceding (following) tag is Z and the tag two

before (after) is W

88-680 6

Text Books

Text Books

LexicalizationNew templates to include dependency on

surrounding words (not just tags):Change tag A to tag B when:

1. The preceding (following) word is w2. The word two before (after) is w3. One of the two preceding (following) words is w4. The current word is w5. The current word is w and the preceding (following)

word is v6. The current word is w and the preceding (following)

tag is X (Notice: word-tag combination)7. etc…

88-680 7

Text Books

Text Books

Initializing Unseen Words• How to choose most likely tag for unseen

words?Transformation based approach:

– Start with NP for capitalized words, NN for others

– Learn “morphological” transformations from:Change tag from X to Y if:

1. Deleting prefix (suffix) x results in a known word2. The first (last) characters of the word are x3. Adding x as a prefix (suffix) results in a known word4. Word W ever appears immediately before (after) the

word5. Character Z appears in the word

88-680 8

Text Books

Text Books

UnannotatedInput Text

AnnotatedText

Ground Truth forInput Text

RulesLearning Algorithm

TBL Learning Scheme

Setting InitialState

88-680 9

Text Books

Text Books

Greedy Learning Algorithm• Initial tagging of training corpus – most

frequent tag per word• At each iteration:

– Identify rules that fix errors and compute “error reduction” for each transformation rule:• #errors fixed - #errors introduced

– Find best rule; If error reduction greater than a threshold (to avoid overfitting):• Apply best rule to training corpus• Append best rule to ordered list of transformations

88-680 10

Text Books

Text Books

Stochastic POS Tagging

• POS tagging:For a given sentence W = w1…wn Find the matching POS tags T = t1…tn

• In a statistical framework:T' = arg max P(T|W)

T

88-680 11

Text Books

Text Books

n

iiiii

t

nn

n

iii

nn

n

iii

n

inii

n

inni

nnn

n

nnn

nnt

ttPtwP

ttPttPttPtPtwP

ttPtttPttPtPtwP

tPtwP

tPtwP

tPtwPwP

tPtwP

wtP

n

n

11

123121

1

1..1213121

1

1..1

1..1..1

..1..1..1

..1

..1..1..1

..1..1

)|()|(maxarg

)|()...|()|()()|(maxarg

)|()...|()|()()|(maxarg

)()|(maxarg

)()|(maxarg

)()|(maxarg)(

)()|(maxarg

)|(maxarg

..1

..1 Bayes’ Rule

Words are independent of each other

A word’s identity depends only on its own tag

Markovian assumptions

Denominator doesn’t depend on tags

Chaining rule

Notation: P(t1) = P(t1 | t0)

88-680 12

Text Books

Text Books

The Markovian assumptions

• Limited Horizon– P(Xi+1 = tk |X1,…,Xi) = P(Xi+1 = tk | Xi)

• Time invariant – P(Xi+1 = tk | Xi) = P(Xj+1 = tk | Xj)

88-680 13

Text Books

Text Books

Maximum Likelihood Estimations

• In order to estimate P(wi|ti), P(ti|ti-1)we can use the maximum likelihood estimation– P(wi|ti) = c(wi,ti) / c(ti)

– P(ti|ti-1) = c(ti-1ti) / c(ti-1)• Notice estimation for i=1

88-680 14

Text Books

Text Books

Unknown Words

• Many words will not appear in the training corpus.

• Unknown words are a major problem for taggers (!)

• Solutions – – Incorporate Morphological Analysis Consider words appearing once in

training data as UNKOWNs

88-680 15

Text Books

Text Books

“Add-1/Add-Constant” Smoothing

1usually :events language naturalIn prior. uniform assuming Laplace, : 1

||)()(

:tionredistribu and gdiscountin - Smoothing

s)(sparsenes eventsy probabilit lowmany for 0)( length) corpus (e.g. allfor count total the-

)occurrence word(e.g. event for count the- )(

)()(

XN

xcxp

xpXxN

xxcNxcxp

S

MLE

MLE

88-680 16

Text Books

Text Books

Smoothing for Tagging

• For P(ti|ti-1)

• Optionally – for P(ti|ti-1)

88-680 17

Text Books

Text Books

Viterbi

• Finding the most probable tag sequence can be done with the viterbi algorithm.

• No need to calculate every single possible tag sequence (!)

88-680 18

Text Books

Text Books

Hmms

• Assume a state machine with– Nodes that correspond to tags– A start and end state– Arcs corresponding to transition

probabilities - P(ti|ti-1) – A set of observations likelihoods for each

state - P(wi|ti)

88-680 19

Text Books

Text Books

NN

VBZ

NNS AT

VB

RB

P(like)=0.2P(fly)=0.3

…P(eat)=0.36

0.6

0.4

P(likes)=0.3P(flies)=0.1

…P(eats)=0.5

P(the)=0.4P(a)=0.3P(an)=0.2

…

88-680 20

Text Books

Text Books

HMMs

• An HMM is similar to an Automata augmented with probabilities

• Note that the states in an HMM do not correspond to the input symbols.

• The input symbols don’t uniquely determine the next state.

88-680 21

Text Books

Text Books

HMM definition• HMM=(S,K,A,B)

– Set of states S={s1,…sn}

– Output alphabet K={k1,…kn}

– State transition probabilities A={aij} i,jS– Symbol emission probabilities B=b(i,k) iS,kK– start and end states (Non emitting)

• Alternatively: initial state probabilities

• Note: for a given i- aij=1 & b(i,k)=1

88-680 22

Text Books

Text Books

Why Hidden?• Because we only observe the input -

the underlying states are hidden • Decoding:

The problem of part-of-speech tagging can be viewed as a decoding problem: Given an observation sequence W=w1,…,wn find a state sequence T=t1,…,tn that best explains the observation.

88-680 23

Text Books

Text Books

Homework

עיבוד שפות טבעיות - שיעור חמישי pos tagging algorithms

Documents