directed graphical model memm as sequence classifier niket tandon tutor: martin theobald, date: 09...
TRANSCRIPT
DIRECTED GRAPHICAL MODEL MEMM AS SEQUENCE
CLASSIFIER
NIKET TANDON
Tutor: Martin Theobald , Date: 09 June, 2011
Why talk about them..
HMMs is a sequence classifier Hidden Markov models (HMMs) successfully applied to:
Part-of-speech tagging:
<PRP>He</PRP> <VB>attends</VB> <NNS>seminars</NNS> Named entity recognition:
<ORG>MPI</ORG> director <PRS>Gerhard Weikum</PRS>
2
Agenda
HMM Recap MaxEnt Model HMM + MaxEnt ~= MEMM Training & Decoding Experimental results and discussion
3
Agenda
HMM Recap MaxEnt Model HMM + MaxEnt ~= MEMM Training & Decoding Experimental results and discussion
4
5
Markov chains and HMM
HMM and Markov chains extend FA States qi, Transitions b/w states Aij
Markov assumption:
P(arc leaving a node) sum to oneHMM
Markov Chain
Weighted FA
Finite Automation
6
HMM (One additional layer of uncertainty. States are hidden!)
Think of causal factors in prob.model
States qi
Transitions between states Aij
Observation likelihood or emission probability B = bi(ot) observation ot being generated from state qi
Markov Assumption
Output independence assumption
Task: Given observation #ice creams, predice weather states
HMM characterized by…7
three fundamental problems: Computing likelihood (Given an HMM l = (A,B) and O, find P(O|l) Decoding Given HMM l = (A,B) and O, find best hidden seq Q Learning Given Q and O, learn HMM parameters A, B
Computing P(O|l) observation likelihood8
N hidden states , T observations N x N x .. N = NT sequences. Exponentially large! HHC, HCH, CHH.. Efficient algo is Forward Algorithm (uses a table for intermediate values)
9
Forward Algorithm
Previous forward prob *
P(previous state to current state) *
P(observation ot given current state j)
at(j) is P( tth state is state j after seeing first t observations )
For many HMM application many aij are zero, thus reducing the space.
Finally, summing over the prob of every path
11
Decoding: Viterbi Algorithm
Given HMM l = (A,B) and O, find most probable hidden seq Q
Choose the hidden state sequence with max observation likelihood on running fwd algo. Exponential!
An efficient alternative is: Viterbi
Previous viterbi path prob *
P(previous state to current state) *
P(observation ot given current state j)
Compare fwd algo (sum Vs max) Keep a pointer to best path that
brought us here (backtrace)
14
Learning: Forward Backward Algorithm
Train transition prob A, emmission B Consider Markov chain, then B = 1.0 ( so compute A)
Max likelihood of aij = transitionsij / all transitions from i But, HMM has hidden states! Additional work , computed using EM style Forward Backward Algorithm.
sum over all successive values bt+1 weighted by aij and observation prob
Agenda
HMM Recap MaxEnt Model HMM + MaxEnt ~= MEMM Training & Decoding Experimental results and discussion
15
16
Background - Maximum Entropy models
A second probabilistic machine learning framework
Maximum Entropy as a non-sequential, sequence classifier
Most common MaxEnt sequence classifier is Max Entropy Markov Model MEMM
First, we discuss non-seq MaxEnt
Max Entropy works by extracting set of features (on input), combining linearly, and then sum as exponent
You must know: linear regression,
This value is not bounded. So, need normalize it.
Instead of prob, compute odds (recall that if P(A) = 0.9, P(A’) = 0.1
then, odds of A = 0.9 /0.3 = 3
17
Logistic Regression
Linear model to predict odds of y=true
Now lies between 0 and ∞ , but need lie between –∞ and +∞, so take log. Left func is called logit
Model of regression used to estimate not prob but logit of Prob is called logistic regresssion.
So, if linear func estimates logit: what is P(y=true)?
This function called logistic function, gives gives Logistic regression its name
18
Learning Weights (w)
Unlike Linear Regression that minimizes squared loss on train set, Logistic Regression uses Conditional Max Likelihood estimation.
w that makes P(observed y values in training data) to be highest, given x.
For entire train set
Taking log
An unwieldy expression
Condensed form (why)and substitution
19
How to solve the convex optim. Problem?
Several methods to solve convex optimization problem Later we explain an algorithm: Generalized Iterative Scaling Method called GIS
20
MaxEnt
Until now two classes, when more, Logistic regression is called Multinomial Logistic Regression (or MaxEnt)
In MaxEnt Prob that y=c is:
Normalize to make prob
fi(c,x) means feature i of observation x for class c.
fi are not real valued but binary (more common in text processing)
Learning w is similar to logistic regression, with one change. MaxEnt learns very high weights, so
Let us see a classification problem..
21
Niket/NNP is/BEZ expected/VBN to/TO talk/?? today/
f1 f2 f3 f4 f5 f6
VB(f) 0 1 0 1 1 0
VB(w)
.8 .01 .1
NN(f)
1 0 0 0 0 1
NN(w)
.8 -1.3
VB/NN?
22
More complex features..
Word starting with capital letter (Day) is more likely to be NNP than a common noun (e.g. Independence Day)
But a capitalized word occurring at beginning is not more likely NNP (e.g. Day after day) But, MaxEnt would be by hand as below
Key to successful use of MaxEnt is design of appropriate features and feature combinations
23
Why the name Maximum Entropy?
Suppose we tag a new word Prabhu with a model that makes fewest assumptions, imposing no constraints.
We get an equiprobable distribution
Suppose we had some training data from which we learnt set of possible tags for Prabhu are NN,JJ,NNS,VB
Since of the tags is correct so, P(NN)+ P(JJ)+ P(NNS)+ P(VB) = 1
NN JJ NNS
VB NNP
VBG IN CD
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
NN JJ NNS
VB NNP
VBG
IN CD
¼ ¼ ¼ ¼ 0 0 0 0
24
Maximum Entropy
… of all possible distributions, the equiprobable distribution has the maximum entropy
p* = argmax H(p)
Solution to this is entropy of a MaxEnt model whose weights W maximize the likelihood of the training data!
Agenda
HMM Recap MaxEnt Model HMM + MaxEnt ~= MEMM Training & Decoding Experimental results and discussion
25
HMM MEMM
26
Why HMM is not sufficient
HMM based on probabilities P(tag|tag) and P(word|tag). For tagging unknown words, useful features include capitalization, the presence of
hyphens, word endings HMM is unable to use information from later words to inform its decision early on. MEMM (Maximum Entropy Markov Model) mates Viterbi algorithm with
MaxEnt to overcome this problem!
27
HMM (Generative) Vs MEMM (Discriminative)
HMM: two probabilities for the observation likelihood and prior.
MEMM: single probability, conditioned on the previous state, observation.
28
MEMM
MEMM can condition on many features of input (capitalization, morphology (ending in -s or -ed), as well as earlier words or tags).
HMM can’t as its likelihood based, and needs to compute likelihood of each feature.
HMM
MEMM
Agenda
HMM Recap MaxEnt Model HMM + MaxEnt ~= MEMM Training & Decoding Experimental results and discussion
29
30
Decoding(inference) and Learning in MEMM
Viterbi in HMM
Viterbi in MEMM
Learning in MEMM : train the weights so as maximize the log-likelihood of the training corpus. GIS!
31
Learning: GIS basics
estimate a probability distribution p(a,b) where a defines classes (y1,y2) and b is feature indicator (0,1)
p(y1,0)+p(y2 ,0) = 0.6
Constraint on model’s expectation
Ep f = 0.6 (Expectation of feature)
p(y1,0)+p(y1 ,1)+ p(y2,0)+p(y2 ,1) = 1
Find prob distribution p with maxEnt
0 1
y1 ? ?
y2 ? ?
0 1
y1 .3 .2
y2 .3 .2
32
GIS algorithm sketch
Finds weight parameter of the unique distribution p* belonging to P and Q
Each iteration of GIS requires model expectation and trainset expectation of features.
Train set expectation is a count of features. But, model expectation requires approximation.
Agenda
HMM Recap MaxEnt Model HMM + MaxEnt ~= MEMM Training & Decoding Experimental results and discussion
33
34
Application: segmentation of FAQs
38 files belonging to 7 Usenet multi-part FAQs (set of files) Basic file structure:
header
text in Usenet header format
[preamble or table of content]
series of one of more question/answer pairs
tail
[copyright]
[acknowledgements]
[origin of document]
Formatting regularities: indentation, numbered questions, types of paragraph breaks Consistent formatting within a single FAQ
35
Train data
Lines in each file are hand-labeled into 4 categories: head, questions, answers, tail<head>Archive-name: acorn/faq/part2
<head>Frequency: monthly
<head>
<question> 2.6) What configuration of serial cable should I use
<answer>
<answer> Here follows a diagram of the necessary connections
<answer> programs to work properly. They are as far as I know t
<answer>
<answer> Pins 1, 4, and 8 must be connected together inside
<answer> is to avoid the well known serial port chip bugs. The
Prediction: Given a sequence of lines, a learner must return a sequence of labels.
36
Boolean features of lines
The 24 line-based features used in the experiments are:
begins-with-number contains-question-mark
begins-with-ordinal contains-question-word
begins-with-punctuation ends-with-question-mark
begins-with-question-word first-alpha-is-capitalized
begins-with-subject indented
blank indented-1-to-4
contains-alphanum indented-5-to-10
contains-bracketed-number more-than-one-third-space
contains-http only-punctuation
contains-non-space prev-is-blank
contains-number prev-begins-with-ordinal
contains-pipe shorter-than-30
37
Experiment setup
“Leave one out” testing: For each file in a group (FAQ), train a learner and test it on the remaining files in the group.
Scores are averaged over n(n-1) results.
38
Evaluation metrics
Segment: consecutive lines belonging to the same category Co-occurrence agreement probability (COAP)
Empirical probability that the actual and the predicted segmentation agree on the placement of two lines according to some distance distribution D between lines.
Measures whether segment boundaries are properly aligned by the learner
Segmentation precision (SP):
Segmentation recall (SR):
39
Comparison of learners
ME-Stateless: Maximum entropy classifier document is an unordered set of lines lines are classified in isolation using the binary features, not using label of previous line
TokenHMM: Fully connected HMM with hidden states for each of the four labels no binary features transitions between states only on line boundaries
FeatureHMM: same as TokenHMM lines are converted to sequences of features
MEMM
40
Results
Learner COAP SegPrec SegRecall
ME-Stateless 0.52 0.038 0.362
TokenHMM 0.865 0.276 0.14
FeatureHMM 0.941 0.413 0.529
MEMM 0.965 0.867 0.681