עיבוד שפות טבעיות - שיעור חמישי pos tagging algorithms
DESCRIPTION
עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms. עידו דגן המחלקה למדעי המחשב אוניברסיטת בר אילן. Supervised Learning Scheme. “Labeled” Examples. Training Algorithm. Classification Model. New Examples. Classification Algorithm. Classifications. - PowerPoint PPT PresentationTRANSCRIPT
88-680 1
Text Books
Text Books
עיבוד שפות טבעיות - שיעור חמישי
POS Tagging Algorithms
עידו דגןהמחלקה למדעי המחשב
אוניברסיטת בר אילן
88-680 2
Text Books
Text Books
Supervised Learning Scheme
ClassificationModel
“Labeled”Examples
NewExamples Classifications
Training Algorithm
ClassificationAlgorithm
88-680 3
Text Books
Text Books
Transformational Based Learning (TBL) for Tagging
• Introduced by Brill (1995)• Can exploit a wider range of lexical and syntactic
regularities via transformation rules – triggering environment and rewrite rule
• Tagger:– Construct initial tag sequence for input – most frequent tag
for each word– Iteratively refine tag sequence by applying “transformation
rules” in rank order• Learner:
– Construct initial tag sequence for the training corpus– Loop until done:
• Try all possible rules and compare to known tags, apply the best rule r* to the sequence and add it to the rule ranking
88-680 4
Text Books
Text Books
Some examples1. Change NN to VB if previous is TO
– to/TO conflict/NN with VB2. Change VBP to VB if MD in previous three
– might/MD vanish/VBP VB3. Change NN to VB if MD in previous two
– might/MD reply/NN VB4. Change VB to NN if DT in previous two
– the/DT reply/VB NN
88-680 5
Text Books
Text Books
Transformation TemplatesSpecify which transformations are possibleFor example: change tag A to tag B when:
1. The preceding (following) tag is Z2. The tag two before (after) is Z3. One of the two previous (following) tags is Z4. One of the three previous (following) tags is Z5. The preceding tag is Z and the following is W6. The preceding (following) tag is Z and the tag two
before (after) is W
88-680 6
Text Books
Text Books
LexicalizationNew templates to include dependency on
surrounding words (not just tags):Change tag A to tag B when:
1. The preceding (following) word is w2. The word two before (after) is w3. One of the two preceding (following) words is w4. The current word is w5. The current word is w and the preceding (following)
word is v6. The current word is w and the preceding (following)
tag is X (Notice: word-tag combination)7. etc…
88-680 7
Text Books
Text Books
Initializing Unseen Words• How to choose most likely tag for unseen
words?Transformation based approach:
– Start with NP for capitalized words, NN for others
– Learn “morphological” transformations from:Change tag from X to Y if:
1. Deleting prefix (suffix) x results in a known word2. The first (last) characters of the word are x3. Adding x as a prefix (suffix) results in a known word4. Word W ever appears immediately before (after) the
word5. Character Z appears in the word
88-680 8
Text Books
Text Books
UnannotatedInput Text
AnnotatedText
Ground Truth forInput Text
RulesLearning Algorithm
TBL Learning Scheme
Setting InitialState
88-680 9
Text Books
Text Books
Greedy Learning Algorithm• Initial tagging of training corpus – most
frequent tag per word• At each iteration:
– Identify rules that fix errors and compute “error reduction” for each transformation rule:• #errors fixed - #errors introduced
– Find best rule; If error reduction greater than a threshold (to avoid overfitting):• Apply best rule to training corpus• Append best rule to ordered list of transformations
88-680 10
Text Books
Text Books
Stochastic POS Tagging
• POS tagging:For a given sentence W = w1…wn Find the matching POS tags T = t1…tn
• In a statistical framework:T' = arg max P(T|W)
T
88-680 11
Text Books
Text Books
n
iiiii
t
nn
n
iii
nn
n
iii
n
inii
n
inni
nnn
n
nnn
nnt
ttPtwP
ttPttPttPtPtwP
ttPtttPttPtPtwP
tPtwP
tPtwP
tPtwPwP
tPtwP
wtP
n
n
11
123121
1
1..1213121
1
1..1
1..1..1
..1..1..1
..1
..1..1..1
..1..1
)|()|(maxarg
)|()...|()|()()|(maxarg
)|()...|()|()()|(maxarg
)()|(maxarg
)()|(maxarg
)()|(maxarg)(
)()|(maxarg
)|(maxarg
..1
..1 Bayes’ Rule
Words are independent of each other
A word’s identity depends only on its own tag
Markovian assumptions
Denominator doesn’t depend on tags
Chaining rule
Notation: P(t1) = P(t1 | t0)
88-680 12
Text Books
Text Books
The Markovian assumptions
• Limited Horizon– P(Xi+1 = tk |X1,…,Xi) = P(Xi+1 = tk | Xi)
• Time invariant – P(Xi+1 = tk | Xi) = P(Xj+1 = tk | Xj)
88-680 13
Text Books
Text Books
Maximum Likelihood Estimations
• In order to estimate P(wi|ti), P(ti|ti-1)we can use the maximum likelihood estimation– P(wi|ti) = c(wi,ti) / c(ti)
– P(ti|ti-1) = c(ti-1ti) / c(ti-1)• Notice estimation for i=1
88-680 14
Text Books
Text Books
Unknown Words
• Many words will not appear in the training corpus.
• Unknown words are a major problem for taggers (!)
• Solutions – – Incorporate Morphological Analysis Consider words appearing once in
training data as UNKOWNs
88-680 15
Text Books
Text Books
“Add-1/Add-Constant” Smoothing
1usually :events language naturalIn prior. uniform assuming Laplace, : 1
||)()(
:tionredistribu and gdiscountin - Smoothing
s)(sparsenes eventsy probabilit lowmany for 0)( length) corpus (e.g. allfor count total the-
)occurrence word(e.g. event for count the- )(
)()(
XN
xcxp
xpXxN
xxcNxcxp
S
MLE
MLE
88-680 16
Text Books
Text Books
Smoothing for Tagging
• For P(ti|ti-1)
• Optionally – for P(ti|ti-1)
88-680 17
Text Books
Text Books
Viterbi
• Finding the most probable tag sequence can be done with the viterbi algorithm.
• No need to calculate every single possible tag sequence (!)
88-680 18
Text Books
Text Books
Hmms
• Assume a state machine with– Nodes that correspond to tags– A start and end state– Arcs corresponding to transition
probabilities - P(ti|ti-1) – A set of observations likelihoods for each
state - P(wi|ti)
88-680 19
Text Books
Text Books
NN
VBZ
NNS AT
VB
RB
P(like)=0.2P(fly)=0.3
…P(eat)=0.36
0.6
0.4
P(likes)=0.3P(flies)=0.1
…P(eats)=0.5
P(the)=0.4P(a)=0.3P(an)=0.2
…
88-680 20
Text Books
Text Books
HMMs
• An HMM is similar to an Automata augmented with probabilities
• Note that the states in an HMM do not correspond to the input symbols.
• The input symbols don’t uniquely determine the next state.
88-680 21
Text Books
Text Books
HMM definition• HMM=(S,K,A,B)
– Set of states S={s1,…sn}
– Output alphabet K={k1,…kn}
– State transition probabilities A={aij} i,jS– Symbol emission probabilities B=b(i,k) iS,kK– start and end states (Non emitting)
• Alternatively: initial state probabilities
• Note: for a given i- aij=1 & b(i,k)=1
88-680 22
Text Books
Text Books
Why Hidden?• Because we only observe the input -
the underlying states are hidden • Decoding:
The problem of part-of-speech tagging can be viewed as a decoding problem: Given an observation sequence W=w1,…,wn find a state sequence T=t1,…,tn that best explains the observation.
88-680 23
Text Books
Text Books
Homework