CS344: Introduction to Artificial Intelligence
(associated lab: CS386)Pushpak Bhattacharyya
CSE Dept., IIT Bombay
Lecture–22, 25: EM, Baum Welch28th March and 1st April, 2014
(Lecture 21 was by Girish on Markov Logic Network )
Expectation Maximization
One of the key ideas of Statistical AI, ML, NLP, CV
Iterative procedure Find Parameters Find hidden variables Maiximize data likelihood
The coin tossing problem
Case of 1 coin: Suppose there are N tosses of a coin. NH = The number of Heads What is the probability of a head i.e. PH = ?
Observed variable
#Observation = N
N
xPTherefore
otherwiseheadaproducestossthewhenxwhere
xxxx
N
ii
H
i
N
1
321
,
,0,1
,, :X
Each observation is a Bernoulli’s Trial where
is the probability of success i.e., getting a head
is the probability of failure i.e., getting a tail
HP
HP1
Likelihood of X
• Likelihood of X, i.e., probability of Observation Sequence X is:
Each trial is identical and independent. Maximum Likelihood of data, requires
us to make and thus, get
the expression for PH
ii x-1H
N
1i
xHH )P -(1P )PL(X,
0HdP
dL
Mathematical Convenience
Take log of the likelihood.
Differentiating w.r.t. PH
To get the expression for , make
N
iHiHi PxPxXLL
1
)1log()1(log);(
H
iN
i H
i
Px
Px
dHdLL
1
11
HP 0HdP
dLL
Equating to 0, expression for PH
H
N
ii
H
N
ii
H P
x
PNx
P
111 1
1
N
xP
N
ii
H
1
Maximum Entropy
Suppose we do not know how to get the MLE, or the likelihood expression is impossible to get, then we use: Maximum Entropy. Example: In problems like co-reference
resolution.
Entropy= To be elaborated later.
)1log()1(log HHHH PPPP
Case for Expectation Maximization
Instead of one coin we toss two coins.Parameters <P, P1, P2>
P = Probability of choosing first coin P1 = Probability of choosing head from first
coin P2 = Probability of choosing head from second
coin
We do not know which coin the observation came from
NxxxxX ,.....,,: 321
EM continued..
Z1, Z2, Z3,…, ZN is the hidden sequence running alongside X1, X2, X3,…, XN
Where, Zi =1, if the ith observation came from coin 1, =0, otherwise
21
321
,,,....,,,
),,Pr();Pr(
PPPzzzzZ
ZXX
N
Z
NN zxzxzxzxY ,......,,,: 332211
Cntd.
We want to work with
Invoke convexity/concavity and expectation of Zi and work with log(Pr(Y;θ))
N
i
zxxzxx iiiiii PPPPPP
ZXPY
1
1122
111 ))1.().1((*))1.(.(
);,();Pr(
));,(log();( Z
ZXPXLL
N
iiii PxPxPzEXLL
1
11 ))1log()1(log)(log([);(
))]1log()1(log)1))(log((1( 22 PxPxPziE ii
Log Likelihood of the Data
IMPORTANT POINTS TO NOTE
Log moves inside the product term. Σ disappears giving rise to E(Zi) in place
of Zi
Differentiate wrt p, p1, p2, equate to 0 and get the results
P, P1, P2
)()(
1
11
i
Ni
iiNi
zExzE
p
)()(
1
12
i
Ni
iiNi
zENxzEM
p M= observed no. of heads
NzE
p iNi )(1
Application of EM: HMM Training
Baum Welch or Forward Backward Algorithm
Key Intuition
Given: Training sequenceInitialization: Probability valuesCompute: Pr (state seq | training seq)
get expected count of transitioncompute rule probabilities
Approach: Initialize the probabilities and recompute them… EM like approach
a
b
a
b
a
b
a
b
q r
Baum-Welch algorithm: counts
String = abb aaa bbb aaa
Sequence of states with respect to input symbols
a, b
a,b
q ra,b
rqrqqqrqrqqrq aaabbbaaabba o/p seq
State seq
a,b
Calculating probabilities from tableTable of counts
T=#statesA=#alphabet symbols
Now if we have a non-deterministic transitions then multiple state seq possible for the given o/p seq (ref. to previous slide’s feature). Our aim is to find expected count through this.
8/3)( bqP b
Src Dest O/P Count
q r a 5
q q b 3
r q a 3
r q b 2
8/5)( rqP a
T
l
A
m
li
jiji
swsc
swscswsPm
kk
1 1)(
)()(
Interplay Between Two Equations
T
l
A
m
lWmi
jWijWi
ssc
sscssPk
k
0 0
)(
)()(
1,0
),,()|()(
,01,0,01,0n
k
k
snn
jWinn
jWi
wSssnWSPssC
wk
No. of times the transitions sisj occurs in the string
Illustration
a:0.67
b:1.0
b:0.17
a:0.16
q r
a:0.04
b:1.0
b:0.48
a:0.48
q r
Actual (Desired) HMM
Initial guess
One run of Baum-Welch algorithm: string ababb
P(path)
q r q r q q 0.00077 0.00154 0.00154 0 0.00077
q r q q q q 0.00442 0.00442 0.00442 0.00442
0.00884
q q q r q q 0.00442 0.00442 0.00442 0.00442
0.00884
q q q q q q 0.02548 0.0 0.000 0.05096
0.07644
Rounded Total 0.035 0.01 0.01 0.06 0.095
New Probabilities (P) 0.06=(0.01/(0.01+0.06+
0.095)
1.0 0.36 0.581
qbq qaq raq qbr a ba ab bb bba
* ε is considered as starting and ending symbol of the input sequence string.
State sequences
Through multiple iterations the probability values will converge.
Computational part (1/2)
ntn
jtkt
it
n
nt snn
jtkt
it
n
snn
jWinn
n
snn
jWinn
jWi
WsSwWsSPWP
WSsSwWsSPWP
WSssnWSPWP
WSssnWSPssC
n
n
k
n
kk
,0,01
,0
,0,01,01
,0
,01,0,01,0,0
,01,0,01,0
)],,,([)(
1
)],,,,([)(
1
)],,(),([)(
1
)],,()|([)(
1,0
1,0
1,0
w0 w1 w2 wk wn-1 wn
S0 S1 S1 … Si Sj … Sn-1 Sn Sn+1
Computational part (2/2)
),1()(),1(
),1()|,(),1(
),1()|,(),1(
)|(),|,(),(
),,,,(
),,,(
0
0
1
0
1
1
0,11,011,0
0,111,0
0,01
jtBswsPitF
jtBsSwWsSPitF
jtBsSwWsSPitF
sSWPsSWwWsSPsSWP
WwWsSsSWP
WwWsSsSP
n
t
ji
n
t
itkt
jt
n
t
itkt
jt
jt
n
tnt
ittkt
jt
itt
n
tntkt
jt
itt
n
tnkt
jt
it
k
w0 w1 w2 wk wn-1 wn
S0 S1 S1 … Si Sj … Sn-1 Sn Sn+1
Discussions1. Symmetry breaking:
Example: Symmetry breaking leads to no change in initial values
2 Struck in Local maxima3. Label bias problem
Probabilities have to sum to 1.Values can rise at the cost of fall of values for others.
s
ss
b:1.0
b:0.5
a:0.5
a:1.0
s
ss
a:0.5
b:0.5
a:0.25
a:0.5b:0.5
a:0.25
b:0.25
b:0.5
Desired Initialized
Another application of EM
WSD
Mitesh Khapra, Salil Joshi and PushpakBhattacharyya, It takes two to Tango: A Bilingual Unsupervised Approach for estimating Sense Distributions using Expectation Maximization, 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, November 2011.
Definition: WSD
Given a context: Get “meaning” of
a set of words (targeted wsd) or all words (all words wsd)
The “Meaning” is usually given by the id of senses in a sense repository usually the wordnet
Example: “operation” (from Princeton Wordnet) Operation, surgery, surgical operation, surgical procedure, surgical
process -- (a medical procedure involving an incision with instruments; performed to repair damage or arrest disease in a living body; "they will schedule the operation as soon as an operating room is available"; "he died while undergoing surgery") TOPIC->(noun) surgery#1
Operation, military operation -- (activity by a military or naval force (as a maneuver or campaign); "it was a joint operation of the navy and air force") TOPIC->(noun) military#1, armed forces#1, armed services#1, military machine#1, war machine#1
mathematical process, mathematical operation, operation --((mathematics) calculation by mathematical methods; "the problems at the end of the chapter demonstrated the mathematical processes involved in the derivation"; "they were learning the basic operations of arithmetic") TOPIC->(noun) mathematics#1, math#1, maths#1
Hindi Wordnet
Dravidian Language Wordnet
North East Language Wordnet
Marathi Wordnet
Sanskrit Wordnet
EnglishWordnet
Bengali Wordnet
Punjabi Wordnet
KonkaniWordnet
UrduWordnet
WSD for ALL Indian languages: Critical resource: INDOWORDNET
Gujarati Wordnet
Oriya Wordnet
Kashmiri Wordnet
Synset Based Multilingual Dictionary
Expansion approach for creating wordnets [Mohanty et. al., 2008]
Instead of creating from scratch link to the synsets of existing wordnet
Relations get borrowed from existing wordnet
S1
S3 S4
S6
S5
S7
S2
S1
S3 S4
S6
S5
S7
S2
S1
S3 S4
S6
S5
S7
S2 A sample entry from the MultiDict
Hindi Marathi
Hypothesis
Sense distributions across languages is invariant!! Proportion of times a sense appears in a
language is uniform across languages!
E.g., proportion of times the sense of “sun” appears in any language through “sun” and its synonyms remains the same!
ESTIMATING SENSE DISTRIBUTIONS
If sense tagged Marathi corpus were available, we could have estimated
But such a corpus is not available
EM for estimating sense distributions
‘
Problem: ‘galaa’ itself is ambiguous Its raw count cannot be used as it
is
Solution: Its count should be weighted by
Word correspondencesSense inEnglish
Smar
(Marathisensenumber)
wordsmar
(partial list)Shin=π(Smar)(projectedHindi sensenumber)
wordsmar
(partial listof words inprojectedHindisense)
Neck 1 maan, greeva 1 gardan, galaa
Respect 2 maan,satkaar,sanmaan
3 izzat, aadar
Voice 3 awaaz, swar 2 galaa
EM for estimating sense distributions
‘
M-Step
E-Step
)().#|()().#|()().#|()().#|()().#|()().#|(
)|(
1111
11
1
swarswarSPawaajawaajSPgreevagreevaSPmaanmaanSPgreevagreevaSPmaanmaanSP
galaSP
marmarmarmar
marmar
hin
)().#|()().#|()().#|()().#|()().#|()().#|(
)|(
2211
11
1
izzatizzatSPaadaraadarSPgalagalaSPgardangardanSPgalagalaSPgardangardanSP
maanSP
hinhinhinhin
hinhin
mar
General Algo
stepExxSP
vvSPuSP L
jLSxS
LiL
SvLi
LjL
Lj
LiL
)2()()).#|((
)().#|)(()|(
1
21
21
1
21
21
)(
)(
)(
)3()()).#|((
)().#|)(()|(
1
2
2
2
12
22
2
11
22
)(
)(
LiL
Lk
LmL
SyS
LkL
SvLk
SSwhere
stepMyySP
vvSPvSP
LmL
Lm
LiL
Results Algorithm
MarathiP % R % F %
IWSD (training onself corpora; noparameterprojection) 81.29 80.42 80.85
IWSD (training onHindi and projectingparameters forMarathi) 73.45 70.33 71.86
EM (no sensecorpora in eitherHindi or Marathi) 68.57 67.93 68.25
Wordnet Baseline 58.07 58.07 58.07
Results & Discussions
Performance of projection using manual cross linkages is within 7% of Self-Training
Performance of projection using probabilistic cross linkages is within 10-12% of Self-Training – remarkable since no additional cost incurred in target language
Both MCL and PCL give 10-14% improvement over Wordnet First Sense Baseline
Not prudent to stick to knowledge based and unsupervised approaches –they come nowhere close to MCL or PCL
Manual Cross LinkagesProbabilistic Cross LinkagesSkyline - self training data is available
Wordnet first sense baseline
S-O-T-A Knowledge Based ApproachS-O-T-A Unsupervised Approach
Our values
Delving deeper into EM
Some Useful mathematical concepts
Convex/ concave functions Jensen’s inequality Kullback–Leibler distance/divergence
)( 1xf
)( 2xf
)()1()( 21 xfxf
))1(( 21 xxf
21 )1( xxz
Criteria for convexity
A function f(x) is said to be convex inthe interval [a,b] iff
)()1()())1(( 2121 xfxfxxf
],[,
21
21
baxxxx
Jensen’s inequality
For any convex function f(x)
n
iii
n
iii xfxf
11)()(
Where 11
n
ii and 10, ii
Proof of Jensen´s inequality
Method:- By induction on N Base case:-
ally truef(x),trivif(x)λλ
λf(x)x)f(λN
i
11 where.
1
Another base case
N = 2
convex is f(x) since )()1()(1 since ))1((
)(
2111
212111
2211
xfxfxxf
xxf
Hypothesis
n
iii
n
iii xfxf
kN
11
)()( i.e
for trueSuppose
Induction Step
1
)()(
given
)()(
thatShow
1321
11
1
1
1
1
kk
k
iii
k
iii
k
iii
k
iii
xfxf
xfxf
Proof
)1( where )()()1(
convexityBy )())1(
()1(
))1(
)1((
)(
111
11
111 1
1
111 1
1
11332211
k
iikk
k
iiik
kk
k
i k
iik
kk
k
i k
iik
kk
xfxf
xfxf
xxf
xxxxf
Continued...
Examine each µi
)1()1(
)1(
)1()1()1()1(
1
1
1
321
11
3
1
2
1
1
3211
k
k
k
k
k
k
kkk
k
k
ii
Continued...
Therefore,
proved is inequality Jensen´s Thus
)()(
stepinduction at theFinally
)()(
)()()1(
)()()1(
1
1
1
1
111
111
1
111
1
i
k
ii
k
iii
kki
k
ii
kki
k
iik
kk
k
iiik
xfxf
xfxf
xfxf
xfxf
KL -divergence
We will do the discrete form of probability distribution.
Given two probability distribution P,Q on the random variable
X : x1,x2,x3...xN
P:p1=p(x1 ), p2=p(x2), ... pn=p(xn) Q:q1=q(x1 ), q2=q(x2), ... qn=q(xn)
KLD definition
Q)(EP)(E DKL(P,Q)
DD
q,p qpp D KL(P,Q)
pp
iii
iN
ii
loglog
as written also0 and cassymmetri is
11log1
Proof: KLD>=0
)x(pxp
],[x pqp
qpp
qpp KL(P,Q)
i
N
iii
N
ii
i
iN
ii
i
iN
ii
i
iN
ii
loglog So
0in convex islog
loglog
-:Proof
0log
11
11
1
Proof cntd.
Apply Jensen’s inequality
10log
loglog
loglog So
11
11
11
N
ii
i
iN
ii
i
iN
ii
N
ii
i
iN
ii
i
iN
ii
q qpp
qppq
)pq(p
pqp
Convexity of –log x
1 1)1(
1)1(
1)1(
)1(
log)1(log))1(log(..
)log)(1()log())1(log(
2
1
1
2
2
1
1
112
2
1
12121
2121
2121
1
1
1
xxy
yy
xx
xx
xx
xx
xxxx
xxxxei
xxxx
Interesting problem
Try to prove:-
21 2121
21
2211 ww ww xxww
xwxw
2nd definition of convexity
Theorem:
.convex is log So.in convex is then ,0
and in abledifferenti twiceis )( If
x-[a,b]f(x)[a,b] x (x)f
[a,b]xf''
Lemma 1
],[ s t,and s t, ),()(then ],[in 0)( If
''
''
batssftfbaxf
a s z t b
Mean Value Theorem
npm(p) m)f(nf(m)f(n)xf
(z,a)s (s) a)f(zf(a)f(z)
'
'
where)(function any For
Alternative form of z
Add –λz to both sides
21 1 λ)x(λxz
)xλ(zz)λ)(x(λ)x(z)λ(xλ)z(
12
21
111
Alternative form of convexity
Add –λf(z) to both sides
)λ)f(x()λf(x)λ)x(f(λ( 2121 11
)λ)f(x(f(z)))λ(f(xλ)f(z)()λ)f(x(f(z)))λ(f(xλ)f(z)(
λf(z))λ)f(x()λf(xλf(z)f(z)
21
21
21
1111
1
Proof: second derivative >=0 implies convexity (1/2)We have that,
(2) ][z]-)[x-(1(1) )]()([)]()()[1(
)()1()()(
)1(
12
12
21
21
xzxfzfzfxf
xfxfzf
xxz
Second derivative >=0 implies convexity (2/2)
(2) Is equivalent to
For some s and t , where
Now since f’’(x) >=0
)(')(' sftf
Combining this with (1), the result is proved
))(()).(()1( 12 xzsfxtf
21 xtzsx
Why all this In EM, we maximize the expectation of
log likelihood of the data Log is a concave function We have to take iterative steps to get
to the maximum There are two unknown values: Z
(unobserved data) and θ (parameters) From θ, get new value of Z (E-step) From Z, get new value of θ (M-step)
How to change θ How to choose the next θ? Take
Where,X: observed dataZ: unobserved dataΘ: parameterLL(X,Z:θn): log likelihood of complete
data with parameter value at θn
This is in lieu of, for example, gradient ascent
θnΘ
At every step LL(.) willIncrease, ultimatelyreaching local/globalmaximum
)):,():,((maxarg nZXLLZXLL
Why expectation of log likelihood? (1/2) P(X:θ) may not be a convenient mathematical
expression Deal with P(X,Z:θ), marginalized over Z Log(ΣZP(X,Z:θ)) is mathematically processed with
multiplying by P(Z|X: θn) which for each Z is between 0 and 1 and sums to 1
Then Jensen inequality will give
));|(
);,(log();|(
at y probabilit theis );|( where ),;|(by devide andmultiply
));|(
);,();|(log());,(log(
nzn
nn
n
z n
n
z
XZPZXPXZP
XZPXZP
XZPZXPXZPZXP
Why expectation of log likelihood? (2/2)
Z. w.r.t.data complete of liklihood log ofn expectatio theis (.) where))),;,((log(
));,(log();|(
));();((maxarg So,
));,();,(log();|(
1);|( since
));().;|(
);,(log();|(
));(log());|(
);,();|(log(
));(log());,(log();();(
zz
Zn
n
nZn
Zn
nnZn
nZ n
n
nZ
n
EZXPE
ZXPXZP
XLLXLLZXPZXPXZP
XZPXPXZP
ZXPXZP
XPXZP
ZXPXZP
XPZXPXLLXLL
Why expectation of Z?
If the log likelihood is a linear function of Z, then the expectation can be carried inside of the log likelihood and E(Z) is computed
The above is true when the hidden variables form a mixture of distributions (e..g, in tosses of two coins), and
Each distribution is an exponential distribution like multinomial/normal/poisson