hmm, memm and crf - sharif university of...
TRANSCRIPT
HMM, MEMM and CRF
40-957 Special Topics in Artificial Intelligence:
Probabilistic Graphical Models
Sharif University of Technology
Soleymani
Spring 2014
Sequence labeling
2
Taking collective a set of interrelated instances 𝒙1, … , 𝒙𝑇
and jointly labeling them
We get as input a sequence of observations 𝑿 = 𝒙1:𝑇 and need
to label them with some joint label 𝒀 = 𝑦1:𝑇
Generalization of mixture models for
sequential data
3
[Jordan]
𝑍
𝑋
𝑌1 𝑌2 𝑌𝑇
𝑋1 𝑋2 𝑋𝑇
… 𝑌𝑇−1
𝑋𝑇−1
Y: states (latent variables)
X: observations
HMM examples
4
Part-of-speech-tagging
Speech recognition
𝑁𝑁𝑃 𝑉𝐵𝑍 𝑉𝐵 𝑇𝑜
Students are
𝑉𝐵𝑁
expected to study
HMM: probabilistic model
5
Transitional probabilities: transition probabilities between
states
𝐴𝑖𝑗 ≡ 𝑃(𝑌𝑡 = 𝑗|𝑌𝑡−1 = 𝑖)
Initial state distribution: start probabilities in different
states
𝜋𝑖 ≡ 𝑃(𝑌1 = 𝑖)
Observation model: Emission probabilities associated with
each state
𝑃(𝑋𝑡|𝑌𝑡 , 𝚽)
HMM: probabilistic model
6
Transitional probabilities: transition probabilities between
states
𝑃(𝑌𝑡|𝑌𝑡−1 = 𝑖)~𝑀𝑢𝑙𝑡𝑖(𝐴𝑖1, … , 𝐴𝑖𝑀) ∀𝑖 ∈ 𝑠𝑡𝑎𝑡𝑒𝑠
Initial state distribution: start probabilities in different
states
𝑃(𝑌1)~𝑀𝑢𝑙𝑡𝑖(𝜋1, … , 𝜋𝑀)
Observation model: Emission probabilities associated with
each state
Discrete observations: 𝑃(𝑋𝑡|𝑌𝑡 = 𝑖)~𝑀𝑢𝑙𝑡𝑖(𝐵𝑖,1, … , 𝐵𝑖,𝐾) ∀𝑖 ∈ 𝑠𝑡𝑎𝑡𝑒𝑠
General: 𝑃 𝑋𝑡 𝑌𝑡 = 𝑖 = 𝑓(. |𝜽𝑖)
𝑌: states (latent variables)
𝑋: observations
Inference problems in sequential data
7
Decoding: argmax𝑦1,…,𝑦𝑇
𝑃 𝑦1, … , 𝑦𝑇 𝑥1, … , 𝑥𝑇
Evaluation
Filtering: 𝑃 𝑦𝑡 𝑥1, … , 𝑥𝑡
Smoothing: 𝑡′ < 𝑡, 𝑃 𝑦𝑡′ 𝑥1, … , 𝑥𝑡
Prediction: 𝑡′ > 𝑡, 𝑃 𝑦𝑡′ 𝑥1, … , 𝑥𝑡
Some questions
8
𝑃 𝑦𝑡|𝑥1, … , 𝑥𝑡 =?
Forward algorithm
𝑃 𝑥1, … , 𝑥𝑇 =?
Forward algorithm
𝑃 𝑦𝑡|𝑥1, … , 𝑥𝑇 =?
Forward-Backward algorithm
How do we adjust the HMM parameters:
Complete data: each training data includes a state sequence and
the corresponding observation sequence
Incomplete data: each training data includes only an observation
sequence
Forward algorithm
9
𝛼𝑡 𝑖 = 𝑃 𝑥1, … , 𝑥𝑡 , 𝑌𝑡 = 𝑖
𝛼𝑡 𝑖 = 𝛼𝑡−1 𝑗 𝑃 𝑌𝑡 = 𝑖|𝑌𝑡−1 = 𝑗 𝑃 𝑥𝑡| 𝑌𝑡 = 𝑖
𝑗
Initialization:
𝛼1 𝑖 = 𝑃 𝑥1, 𝑌1 = 𝑖 = 𝑃 𝑥1|𝑌1 = 𝑖 𝑃 𝑌1 = 𝑖
Iterations: 𝑡 = 2 to 𝑇
𝛼𝑡 𝑖 = 𝛼𝑡−1 𝑗 𝑃 𝑌𝑡 = 𝑖|𝑌𝑡−1 = 𝑗 𝑃 𝑥𝑡|𝑌𝑡 = 𝑖𝑗
𝑖, 𝑗 = 1, … , 𝑀
𝑌1 𝑌2 𝑌𝑇
𝑋1 𝑋2 𝑋𝑇
… 𝑌𝑇−1
𝑋𝑇−1
𝛼1(. ) 𝛼2(. ) 𝛼𝑇−1(. ) 𝛼𝑇(. ) 𝛼𝑡 𝑖 = 𝑚𝑡−1→𝑡(𝑖)
Backward algorithm
10
𝛽𝑡 𝑖 = 𝑚𝑡→𝑡−1(𝑖) = 𝑃 𝑥𝑡+1, … , 𝑥𝑇|𝑌𝑡 = 𝑖
𝛽𝑡−1 𝑖 = 𝛽𝑡 𝑗 𝑃 𝑌𝑡 = 𝑗|𝑌𝑡−1 = 𝑖 𝑃 𝑥𝑡| 𝑌𝑡 = 𝑖
𝑗
Initialization:
𝛽𝑇 𝑖 = 1
Iterations: 𝑡 = 𝑇 down to 2
𝛽𝑡−1 𝑖 = 𝛽𝑡 𝑗 𝑃 𝑌𝑡 = 𝑗|𝑌𝑡−1 = 𝑖 𝑃 𝑥𝑡| 𝑌𝑡 = 𝑖𝑗
𝑖, 𝑗 ∈ 𝑠𝑡𝑎𝑡𝑒𝑠
𝑌1 𝑌2 𝑌𝑇
𝑋1 𝑋2 𝑋𝑇
… 𝑌𝑇−1
𝑋𝑇−1
𝛽𝑇 . = 1 𝛽𝑇−1 . 𝛽2 . 𝛽1 .
𝛽𝑡 𝑖 = 𝑚𝑡→𝑡−1(𝑖)
Forward-backward algorithm
11
𝛼𝑡 𝑖 = 𝛼𝑡−1 𝑗 𝑃 𝑌𝑡 = 𝑖|𝑌𝑡−1 = 𝑗 𝑃 𝑥𝑡| 𝑌𝑡 = 𝑖
𝑗
𝛼1 𝑖 = 𝑃 𝑥1, 𝑌1 = 𝑖 = 𝑃 𝑥1|𝑌1 = 𝑖 𝑃 𝑌1 = 𝑖
𝛽𝑡−1 𝑖 = 𝛽𝑡 𝑗 𝑃 𝑌𝑡 = 𝑗|𝑌𝑡−1 = 𝑖 𝑃 𝑥𝑡| 𝑌𝑡 = 𝑖
𝑗
𝛽𝑇 𝑖 = 1
𝑃 𝑥1, 𝑥2, … , 𝑥𝑇 = 𝛼𝑇 𝑖
𝑖
𝛽𝑇 𝑖 = 𝛼𝑇 𝑖
𝑖
𝑃 𝑌𝑡 = 𝑖|𝑥1, 𝑥2, … , 𝑥𝑇 =𝛼𝑡(𝑖)𝛽𝑡(𝑖)
𝛼𝑇 𝑖𝑖
𝛼𝑡(𝑖) ≡ 𝑃 𝑥1, 𝑥2, … , 𝑥𝑡 , 𝑌𝑡 = 𝑖
𝛽𝑡(𝑖) ≡ 𝑃 𝑥𝑡+1, 𝑥𝑡+2, … , 𝑥𝑇|𝑌𝑡 = 𝑖
Forward-backward algorithm
12
This will be used for expectation maximization to train a
HMM
𝑃 𝑌𝑡 = 𝑖 𝑥1, … , 𝑥𝑇 =𝑃 𝑥1,…,𝑥𝑇,𝑌𝑡=𝑖
𝑃 𝑥1,…,𝑥𝑇=
𝛼𝑡(𝑖)𝛽𝑡(𝑖)
𝛼𝑇(𝑗)𝑁𝑗=1
𝑌1 𝑌2 𝑌𝑇
𝑋1 𝑋2 𝑋𝑇
… 𝑌𝑇−1
𝑋𝑇−1
𝛽𝑇 . = 1
𝛼1(. ) 𝛼2(. ) 𝛼𝑇−1(. ) 𝛼𝑇(. )
𝛽𝑇−1 . 𝛽2 . 𝛽1 .
Decoding Problem
13
Choose state sequence to maximize the observations:
argmax𝑦1,…,𝑦𝑡
𝑃 𝑦1, … , 𝑦𝑡 𝑥1, … , 𝑥𝑡
Viterbi algorithm:
Define auxiliary variable 𝛿:
𝛿𝑡 𝑖 = max𝑦1,…,𝑦𝑡−1
𝑃(𝑦1, 𝑦2, … , 𝑦𝑡−1, 𝑌𝑡 = 𝑖, 𝑥1, 𝑥2, … , 𝑥𝑡)
𝛿𝑡(𝑖): probability of the most probable path ending in state 𝑌𝑡 = 𝑖
Recursive relation:
𝛿𝑡 𝑖 = max𝑗
𝛿𝑡−1 𝑗 𝑃(𝑌𝑡 = 𝑖|𝑌𝑡−1 = 𝑗) 𝑃 𝑥𝑡 𝑌𝑡 = 𝑖
𝛿𝑡 𝑖 = max𝑗=1,…,𝑁
𝛿𝑡−1 𝑡 𝑃 𝑌𝑡 = 𝑖 𝑌𝑡−1 = 𝑗 𝑃(𝑥𝑡|𝑌𝑡 = 𝑖)
Decoding Problem: Viterbi algorithm
14
Initialization 𝛿1 𝑖 = 𝑃 𝑥1|𝑌1 = 𝑖 𝑃 𝑌1 = 𝑖
𝜓1 𝑖 = 0
Iterations: 𝑡 = 2, … , 𝑇
𝛿𝑡 𝑖 = max𝑗
𝛿𝑡−1 𝑗 𝑃 𝑌𝑡 = 𝑖 𝑌𝑡−1 = 𝑗 𝑃 𝑥𝑡 𝑌𝑡 = 𝑖
𝜓𝑡 𝑖 = argmax𝑗
𝛿𝑡 𝑗 𝑃 𝑌𝑡 = 𝑖 𝑌𝑡−1 = 𝑗
Final computation:
𝑃∗ = max𝑗=1,…,𝑁
𝛿𝑇 𝑗
𝑦𝑇∗ = argmax
𝑗=1,…,𝑁𝛿𝑇 𝑗
Traceback state sequence: 𝑡 = 𝑇 − 1 down to 1 𝑦𝑡
∗ = 𝜓𝑡+1 𝑦𝑡+1∗
𝑖 = 1, … , 𝑀
𝑖 = 1, … , 𝑀
Max-product algorithm
15
𝑚𝑗𝑖𝑚𝑎𝑥 𝑥𝑖 = max
𝑥𝑗
𝜙 𝑥𝑗 𝜙 𝑥𝑖 , 𝑥𝑗 𝑚𝑘𝑗𝑚𝑎𝑥(𝑥𝑗)
𝑘∈𝒩(𝑗)\𝑖
𝛿𝑡 𝑖 = 𝑚𝑡−1,𝑡𝑚𝑎𝑥 × 𝜙 𝑥𝑖
HMM Learning
16
Supervised learning: When we have a set of data samples,
each of them containing a pair of sequences (one is the
observation sequence and the other is the state sequence)
Unsupervised learning: When we have a set of data
samples, each of them containing a sequence of observations
HMM supervised learning by MLE
17
Initial state probability:
𝜋𝑖 = 𝑃 𝑌1 = 𝑖 , 1 ≤ 𝑖 ≤ 𝑀
State transition probability:
𝐴𝑗𝑖 = 𝑃 𝑌𝑡+1 = 𝑖 𝑌𝑡 = 𝑗 , 1 ≤ 𝑖, 𝑗 ≤ 𝑀
State transition probability:
𝐵𝑖𝑘 = 𝑃 𝑋𝑡 = 𝑘 𝑌𝑡 = 𝑖 , 1 ≤ 𝑘 ≤ 𝐾
𝑌1 𝑌2 𝑌𝑇 𝑌𝑇−1 …
𝝅 𝑨
𝑋2 𝑋1 𝑋𝑇−1 𝑋𝑇
𝑩
Discrete
observations
HMM: supervised parameter learning by MLE
18
𝑃 𝒟|𝜽 = 𝑃 𝑦1(𝑛)
𝝅 𝑃(𝑦𝑡(𝑛)
|𝑦𝑡−1(𝑛)
, 𝑨)𝑇
𝑡=2 𝑃(𝒙𝑡
(𝑛)|𝑦𝑡
(𝑛), 𝑩)
𝑇
𝑡=1
𝑁
𝑛=1
𝐴 𝑖𝑗 = 𝐼 𝑦𝑡−1
(𝑛)= 𝑖, 𝑦𝑡
(𝑛)= 𝑗𝑇
𝑡=2𝑁𝑛=1
𝐼 𝑦𝑡−1(𝑛)
= 𝑖𝑇𝑡=2
𝑁𝑛=1
𝜋 𝑖 = 𝐼 𝑦1
(𝑛)= 𝑖𝑁
𝑛=1
𝑁
𝐵 𝑖𝑘 = 𝐼 𝑦𝑡
(𝑛)= 𝑖, 𝑥𝑡
(𝑛)= 𝑘𝑇
𝑡=1𝑁𝑛=1
𝐼 𝑦𝑡(𝑛)
= 𝑖𝑇𝑡=1
𝑁𝑛=1
Discrete
observations
Learning
19
Problem: how to construct an HHM given only observations?
Find 𝜽 = (𝑨, 𝑩, 𝝅), maximizing 𝑃(𝒙1, … , 𝒙𝑇|𝜽)
Incomplete data
EM algorithm
HMM learning by EM (Baum-Welch)
20
𝜽𝑜𝑙𝑑 = 𝝅𝑜𝑙𝑑 , 𝑨𝑜𝑙𝑑 , 𝚽𝑜𝑙𝑑
E-Step:
𝛾𝑛,𝑡𝑖 = 𝑃 𝑌𝑡
(𝑛)= 𝑘 𝑿(𝑛); 𝜽𝑜𝑙𝑑
𝜉𝑛,𝑡𝑖,𝑗
= 𝑃 𝑌𝑡−1(𝑛)
= 𝑖, 𝑌𝑡(𝑛)
= 𝑗 𝒙(𝑛); 𝜽𝑜𝑙𝑑
M-Step:
𝜋𝑖𝑛𝑒𝑤 =
𝛾𝑛,1𝑖𝑁
𝑛=1
𝑁
𝐴𝑖,𝑗𝑛𝑒𝑤 =
𝜉𝑛,𝑡𝑖,𝑗𝑇
𝑡=2𝑁𝑛=1
𝛾𝑛,𝑡𝑖𝑇−1
𝑡=1𝑁𝑛=1
𝐵𝑖,𝑘𝑛𝑒𝑤 =
𝛾𝑛,𝑡𝑖 𝐼 𝑋𝑡
(𝑛)=𝑖, 𝑇
𝑡=1𝑁𝑛=1
𝐼 𝑋𝑡(𝑛)
=𝑖𝑇𝑡=1
𝑁𝑛=1
𝑖, 𝑗 = 1, … , 𝑀
𝑖, 𝑗 = 1, … , 𝑀
Baum-Welch algorithm (Baum, 1972)
Discrete observations
HMM learning by EM (Baum-Welch)
21
𝜽𝑜𝑙𝑑 = 𝝅𝑜𝑙𝑑 , 𝑨𝑜𝑙𝑑 , 𝚽𝑜𝑙𝑑
E-Step:
𝛾𝑛,𝑡𝑖 = 𝑃 𝑌𝑡
(𝑛)= 𝑘 𝑿(𝑛); 𝜽𝑜𝑙𝑑
𝜉𝑛,𝑡𝑖,𝑗
= 𝑃 𝑌𝑡−1(𝑛)
= 𝑖, 𝑌𝑡(𝑛)
= 𝑗 𝒙(𝑛); 𝜽𝑜𝑙𝑑
M-Step:
𝜋𝑖𝑛𝑒𝑤 =
𝛾𝑛,1𝑖𝑁
𝑛=1
𝑁
𝐴𝑖,𝑗𝑛𝑒𝑤 =
𝜉𝑛,𝑡𝑖,𝑗𝑇
𝑡=2𝑁𝑛=1
𝛾𝑛,𝑡𝑖𝑇−1
𝑡=1𝑁𝑛=1
𝐵𝑖,𝑘𝑛𝑒𝑤 =
𝛾𝑛,𝑡𝑖 𝐼 𝑋𝑡
(𝑛)=𝑖, 𝑇
𝑡=1𝑁𝑛=1
𝐼 𝑋𝑡(𝑛)
=𝑖𝑇𝑡=1
𝑁𝑛=1
𝑖, 𝑗 = 1, … , 𝑀
𝑖, 𝑗 = 1, … , 𝑀
𝝁𝑖𝑛𝑒𝑤 =
𝛾𝑛,𝑡𝑖 𝒙𝑡
(𝑛)𝑇𝑡=1
𝑁𝑛=1
𝛾𝑛,𝑡𝑖𝑇
𝑡=1𝑁𝑛=1
𝜮𝑖𝑛𝑒𝑤 =
𝛾𝑛,𝑡𝑖 𝒙𝑡
(𝑛)− 𝝁𝑖
𝑛𝑒𝑤 𝒙𝑡(𝑛)
− 𝝁𝑖𝑛𝑒𝑤
𝑇𝑇𝑡=1
𝑁𝑛=1
𝛾𝑛,𝑡𝑖𝑇
𝑡=1𝑁𝑛=1
Assumption: Gaussian emission
probabilities
Baum-Welch algorithm (Baum, 1972)
Forward-backward algorithm for E-step
22
𝑃 𝑦𝑡−1, 𝑦𝑡 𝒙1, … , 𝒙𝑇
=𝑃 𝒙1, … , 𝒙𝑇 𝑦𝑡−1, 𝑦𝑡)𝑃 𝑦𝑡−1, 𝑦𝑡
𝑃(𝒙1, … , 𝒙𝑇)
=𝑃 𝒙1, … , 𝒙𝑡−1 𝑦𝑡−1)𝑃 𝒙𝑡 𝑦𝑡)𝑃 𝒙𝑡+1, … , 𝒙𝑇 𝑦𝑡)𝑃 𝑦𝑡 𝑦𝑡−1)𝑃 𝑦𝑡−1
𝑃(𝒙1, … , 𝒙𝑇)
=𝛼𝑡−1(𝑦𝑡−1)𝑃 𝒙𝑡 𝑦𝑡)𝑃 𝑦𝑡 𝑦𝑡−1)𝛽𝑡 𝑦𝑡
𝛼𝑇(𝑖)𝑀𝑖=1
HMM shortcomings
23
In modeling the joint distribution 𝑃(𝒀, 𝑿), HMM ignores many
dependencies between observations 𝑋1, … , 𝑋𝑇 (similar to
most generative models that need to simplify the structure)
In the sequence labeling task, we need to classify an
observation sequence using the conditional probability 𝑃(𝒀|𝑿)
However, HMM learns a joint distribution 𝑃(𝒀, 𝑿) while uses only
𝑃(𝒀|𝑿) for labeling
𝑌1 𝑌2 𝑌𝑇
𝑋1 𝑋2 𝑋𝑇
… 𝑌𝑇−1
𝑋𝑇−1
Maximum Entropy Markov Model (MEMM)
24
𝑃 𝒚1:𝑇 𝒙1:𝑇 = 𝑃 𝑦1 𝒙1:𝑇 𝑃 𝑦𝑡 𝑦𝑡−1, 𝒙1:𝑇
𝑇
𝑡=2
𝑃 𝑦𝑡 𝑦𝑡−1, 𝒙1:𝑇 =exp 𝒘𝑇𝒇 𝑦𝑡 , 𝑦𝑡−1, 𝒙1:𝑇
exp 𝒘𝑇𝒇 𝑦𝑡, 𝑦𝑡−1, 𝒙1:𝑇𝑦𝑡
Discriminative model
Only models 𝑃(𝒀|𝑿) and completely ignores modeling 𝑃(𝑿) Maximizes the conditional likelihood 𝑃(𝒟𝒀|𝒟𝑿, 𝜽)
𝑌1 𝑌2 𝑌𝑇
𝑋1 𝑋2 𝑋𝑇
… 𝑌𝑇−1
𝑋𝑇−1
𝑌1 𝑌2 𝑌𝑇
𝑿1:𝑇
… 𝑌𝑇−1
𝑍 𝑦𝑡−1, 𝒙1:𝑇 , 𝒘 = exp 𝒘𝑇𝒇 𝑦𝑡, 𝑦𝑡−1, 𝒙1:𝑇
𝑦𝑡
Feature function
25
Feature function 𝒇 𝑦𝑡, 𝑦𝑡−1, 𝒙1:𝑇 can take account of
relations between both data and label space
However, they are often indicator functions showing absence
or presence of a feature
𝑤𝑖 captures how closely 𝑓 𝑦𝑡, 𝑦𝑡−1, 𝒙1:𝑇 is related with
the label
MEMM disadvantages
26
The later observation in the sequence has absolutely no effect
on the posterior probability of the current state
model does not allow for any smoothing.
The model is incapable of going back and changing its prediction
about the earlier observations.
The label bias problem
there are cases that a given observation is not useful in
predicting the next state of the model.
if a state has a unique out-going transition, the given observation is
useless
Label bias problem in MEMM
27
Label bias problem: states with fewer arcs are preferred
Preference of states with lower entropy of transitions over others in decoding
MEMMs should probably be avoided in cases where many transitions are close to deterministic
Extreme case: When there is only one outgoing arc, it does not matter what the observation is
The source of this problem: Probabilities of outgoing arcs normalized separately for each state
sum of transition probability for any state has to sum to 1
Solution: Do not normalize probabilities locally
From MEMM to CRF
28
From local probabilities to local potentials
𝑃 𝒀 𝑿 =1
𝑍(𝑿) 𝜙 𝑦𝑡−1, 𝑦𝑡 , 𝑿
𝑇
𝑡=1
=1
𝑍(𝑿, 𝒘) exp 𝒘𝑇𝑓 𝑦𝑡, 𝑦𝑡−1, 𝒙
𝑇
𝑡=1
CRF is a discriminative model (like MEMM) can dependence between each state and the entire observation sequence
uses global normalizer 𝑍(𝑿, 𝒘) that overcomes the label bias problem of MEMM
MEMM use an exponential model for each state while CRF have a single
exponential model for the joint probability of the entire label sequence
𝑌1 𝑌2 𝑌𝑇
𝑿1:𝑇
… 𝑌𝑇−1
𝑿 ≡ 𝒙1:𝑇
𝒀 ≡ 𝑦1:𝑇
CRF: conditional distribution
29
𝑃 𝒚 𝒙 =1
𝑍(𝒙, 𝝀)exp 𝝀𝑇𝒇 𝑦𝑖 , 𝑦𝑖−1, 𝒙
𝑇
𝑖=1
=1
𝑍(𝒙, 𝝀)exp 𝜆𝑘𝑓𝑘 𝑦𝑖 , 𝑦𝑖−1, 𝒙
𝑘
𝑇
𝑖=1
𝑍 𝒙, 𝝀 = exp 𝝀𝑇𝒇 𝑦𝑖 , 𝑦𝑖−1, 𝒙
𝑇
𝑖=1𝒚
CRF: MAP inference
30
Given CRF parameters 𝝀, find the 𝒚∗ that maximizes 𝑃 𝒚 𝒙 :
𝒚∗ = argmax𝒚
exp 𝝀𝑇𝒇 𝑦𝑖 , 𝑦𝑖−1, 𝒙
𝑇
𝑖=1
𝑍(𝒙) is not a function of 𝒚 and so has been ignored
Max-product algorithm can be used for this MAP inference
problem
Same as Viterbi decoding used in HMMs
CRF: inference
31
Exact inference for 1-D chain CRFs
𝜙 𝑦𝑖 , 𝑦𝑖−1 = exp 𝝀𝑇𝒇 𝑦𝑖 , 𝑦𝑖−1, 𝒙
𝑌1 𝑌2 𝑌𝑇
𝑿1:𝑇
… 𝑌𝑇−1
𝑌1, 𝑌2
𝑌2, 𝑌3
… 𝑌𝑇−1, 𝑌𝑇 𝑌2 𝑌𝑇−1
CRF: learning
32
𝝀∗ = argmax𝝀
𝑃 𝒚(𝑛) 𝒙(𝑛), 𝝀𝑁
𝑛=1
𝑃 𝒚(𝑛) 𝒙(𝑛), 𝝀𝑁
𝑛=1=
1
𝑍(𝒙(𝑛), 𝝀)exp 𝝀𝑇𝒇 𝑦𝑖
(𝑛), 𝑦𝑖−1
(𝑛), 𝒙(𝑛)
𝑇
𝑖=1
𝑁
𝑛=1
= exp 𝝀𝑇𝒇 𝑦𝑖(𝑛)
, 𝑦𝑖−1(𝑛)
, 𝒙(𝑛) − ln 𝑍(𝒙(𝑛), 𝝀)
𝑇
𝑖=1
𝑁
𝑛=1
𝐿 𝝀 = ln 𝑃 𝒚(𝑛) 𝒙(𝑛), 𝝀𝑁
𝑛=1
𝛻𝝀𝐿 𝝀 = 𝒇 𝑦𝑖(𝑛)
, 𝑦𝑖−1(𝑛)
, 𝒙(𝑛) − 𝛻𝝀 ln 𝑍(𝒙(𝑛), 𝝀)
𝑇
𝑖=1
𝑁
𝑛=1
𝛻𝝀 ln 𝑍(𝒙(𝑛), 𝝀) = 𝑃(𝒚|𝒙 𝑛 , 𝝀) 𝒇 𝑦𝑖 , 𝑦𝑖−1, 𝒙 𝑛
𝑇
𝑖=1𝒚
Maximum Conditional Likelihood
𝜽 = argmax𝜽
𝑃(𝒟𝑌|𝒟𝑋, 𝜽)
Gradient of the log-partition function in an exponential family is
the expectation of the sufficient statistics.
CRF: learning
33
𝛻𝝀 ln 𝑍(𝒙(𝑛), 𝝀) = 𝑃(𝒚|𝒙 𝑛 , 𝝀) 𝒇 𝑦𝑖 , 𝑦𝑖−1, 𝒙 𝑛
𝑇
𝑖=1𝒚
= 𝑃(𝒚|𝒙 𝑛 , 𝝀)𝒇 𝑦𝑖 , 𝑦𝑖−1, 𝒙 𝑛
𝒚
𝑇
𝑖=1
= 𝑃(𝑦𝑖 , 𝑦𝑖−1|𝒙 𝑛 , 𝝀)𝒇 𝑦𝑖 , 𝑦𝑖−1, 𝒙
𝑦𝑖,𝑦𝑖−1
𝑇
𝑖=1
How do we find the above expectations?
𝑃(𝑦𝑖 , 𝑦𝑖−1|𝒙 𝑛 , 𝝀) must be computed for all 𝑖 = 2, … , 𝑇
CRF: learning
Inference to find 𝑃(𝑦𝑖 , 𝑦𝑖−1|𝒙 𝑛 , 𝝀)
34
Junction tree algorithm
Initialization of clique potentials:
𝜓 𝑦𝑖 , 𝑦𝑖−1 = exp 𝝀𝑇𝒇 𝑦𝑖 , 𝑦𝑖−1, 𝒙 𝑛
𝜙 𝑦𝑖 = 1
After calibration:𝜓∗ 𝑦𝑖 , 𝑦𝑖−1
𝑃 𝑦𝑖 , 𝑦𝑖−1|𝒙(𝑛), 𝝀 =𝜓∗ 𝑦𝑖 , 𝑦𝑖−1
𝜓∗ 𝑦𝑖 , 𝑦𝑖−1𝑦𝑖,𝑦𝑖−1
𝑌1, 𝑌2
𝑌2, 𝑌3
… 𝑌𝑇−1, 𝑌𝑇 𝑌2 𝑌𝑇−1
CRF learning: gradient descent
35
𝝀𝑡 = 𝝀𝑡 + 𝜂𝛻𝝀𝐿 𝝀𝑡
In each gradient step, for each data sample, we use inference
to find 𝑃(𝑦𝑖 , 𝑦𝑖−1|𝒙 𝑛 , 𝝀) required for computing feature expectation:
𝛻𝝀 ln 𝑍(𝒙(𝑛), 𝝀) = 𝐸𝑃(𝒚|𝒙 𝑛 ,𝝀𝑡) 𝒇 𝑦𝑖 , 𝑦𝑖−1, 𝒙 𝑛
= 𝑃(𝑦𝑖 , 𝑦𝑖−1|𝒙 𝑛 , 𝝀)𝒇 𝑦𝑖 , 𝑦𝑖−1, 𝒙 𝑛
𝑦𝑖,𝑦𝑖−1
𝛻𝝀𝐿 𝝀 = 𝒇 𝑦𝑖(𝑛)
, 𝑦𝑖−1(𝑛)
, 𝒙(𝑛) − 𝛻𝝀 ln 𝑍(𝒙(𝑛), 𝝀)
𝑇
𝑖=1
𝑁
𝑛=1
Summary
36
Discriminative vs. generative
In cases where we have many correlated features, discriminative models (MEMM and CRF) are often better
avoid the challenge of explicitly modeling the distribution over features.
but, if only limited training data are available, the stronger bias of the generative model may dominate and these models may be preferred.
Learning
HMMs and MEMMs are much more easily learned.
CRF requires an iterative gradient-based approach, which is considerably more expensive
inference must be run separately for every training sequence
MEMM vs. CRF (Label bias problem of MEMM)
In many cases, CRFs are likely to be a safer choice (particularly in cases where many transitions are close to deterministic), but the computational cost may be prohibitive for large data sets.