a plsa-based language model for conversational telephone speech david mrva and philip c.woodland
DESCRIPTION
A PLSA-based Language Model for Conversational Telephone Speech David Mrva and Philip C.Woodland. 2004/12/08 邱炫盛. Outline. Language Model PLSA Model Experimental Results Conclusion. Language Model. The task of a language model is to calculate probability n-gram model - PowerPoint PPT PresentationTRANSCRIPT
A PLSA-based Language Model for Conversational Telephone Speech
David Mrva and Philip C.Woodland
2004/12/08 邱炫盛
Outline
• Language Model• PLSA Model• Experimental Results• Conclusion
Language Model
• The task of a language model is to calculate probability
• n-gram model – Range of dependencies is limited to n-words – Information is ignored
)( ii hwP
),...,()( 11 iniiii wwwPhwP
Language Model (cont.)
• Topic-based language model– Latent Semantic Analysis– Topic-based language model– PLSA-based language model
PLSA Model• PLSA is general machine learning technique for
modeling the co-occurrences of events.
• Co-occurrence of words and documents
• Hidden variable = aspect
• PLSA in this paper is a mixture of unigram distribution.
PLSA Model (cont.)
P(d)d w
P(w|d)
P(d)td w
P(t|d)
P(w|t)
Graphical Model Representation
PLSA Model (cont.)
P(wj|z1)
P(wj|z2)
P(wj|zk)
∑
P(z1|di)
P(z2|di)
P(zk|di)w1 w2 w3…….wj
di
PLSA Model (cont.)
N
i
M
j
K
kkjikij
N
i
M
j
dwnij
K
kkik
kikiii
zwpdzpdwn
dwpL
zwpdzp
zwpdzpzwpdzpzwpdzpdwp
ij
1 1 1
1 1
),(
1
2211
)|()|(log),(
))|((log(log
)|()|(
)|()|(...)|()|()|()|()|(
M: number of words in vocabulary
N: number of documents in training collection
K: number of aspects or topics
PLSA Model (cont.)
iiii
K
kijkijz
K
kikjijk
K
kijijk
,d|wzijzikj,d|wzij
ijkikj
ijk
ikj
ij
K
kkjik
dddd
d,|wzp,d|wpd|,zwp,d|wzp
d|wp,d|wzp
d,|wpd|,zwpEd|wpE
,d|wzp|d,zwp,d|wzp|d,zwp
dwpzwpdzp
k
ijkkijk
~H~
~logˆ~logˆ
~logˆ
~log~log~log
Step-E
logloglog
|log||log
,,
11
1
1
:
PLSA Model (cont.)
0~|ˆ
~1ˆ
1log ~
logˆ
logˆ~logˆ
H~H
0H~
H~
0H~
H~
0log~
log log~
log
1
1
1
11
,,
,,,,
,,,,
K
kijkijk
K
k ijk
ijkijk
K
k ijk
ijkijk
K
kijkijk
K
k
ijkijk
iiii
iiiiiiii
iiiiiiii
ijijijij
d,|wzp,dwzp
,d|wzpd,|wzp
,d|wzp
xx,d|wzpd,|wzp
,d|wzp
,d|wzp,d|wzpd,|wzp,d|wzp
dddd
dddddddd
dddddddd
|dwpd|wp|dwpd|wp
K
k
ikkj
ikkj
ij
ikjijk
K
k
ikkjijk
K
kikjijk
iiij
dzp|zwp
dzp|zwp|dwp|d,zwp
,dwzp
dzp|zwp,d|wzp
d|,zwp,d|wzp
ddd|wp
1
1
1
,
|ˆˆ
|ˆˆˆ
ˆ|ˆ
|logˆmax
~logˆmax
~ maximum
~log maximum
conditional independent
PLSA Model (cont.)
k
ik
j
kj
i kk j
i kk j
Tikj
M
j
K
kikijkijdzP
wkjk
N
i
M
jkjijkij|zwP
d ziki
z wkjk
N
i
M
j
K
k
ikkjijkij
d Tiki
z wkjk
C
dzpdzp,d|wzpdwn
zwp|zwp,d|wzpdwn
dzpzwp
dzp|zwp,dwzpdwn
dzpzwpLE
|1|logˆ,
|1logˆ,
|1|1
|log|ˆ,
|1|1
Step-M
1 1|
1 1
1 1 1
:
PLSA Model (cont.)
PLSA Model (cont.)
i
M
jijkij
M
jij
M
jijij
K
k
M
jijkij
M
jijkij
ik
M
j
N
i
ijkij
N
i
ijkij
kj
dn
,d|wzpdwn
dwn
,d|wzpdwn
,d|wzpdwn
,d|wzpdwndzP
,d|wzpdwn
,d|wzpdwn|zwP
k
1
1
1
1 1
1
1 1
1
,
,
,
,
,|
,
,
...difference take
PLSA Model (cont.)
)|(1)|()|(
)|()|(1
1)|(
),(
)|(),()()|(
1
1 1
1
,
,1
ikK
q iqqi
ikkiik
dw
dw kkk
hzpii
hzpzwphzpzwp
ihzp
dwn
dzpdwnzphzp
Use PLSA in language model:P(zk|di) are used as mixture weights when calculating the word probability.The history hi is used instead of di to re-estimate these weight on the test set.
PLSA Model (cont.)
K
kikkiii
ikK
qics
iqqi
icsikki
ik
i
hzpzwp)|hp(w
hzpbibi
hzpzwphzpzwp
bihzp
h
1
1
1)(
1
)(1
)|()|(
ndistrbutio cprior topi theof weight the:bth word-i theof score confidence the:cs(i)
)|(1))|()|((
))|()|((1)|(
document. theof topicabout the model the toavailablen informatioenough not and,history in the errorsn recognitio of Because
PLSA Model (cont.)Account for the whole document history of word irrespective of the do
cument length.Have no means for representing the word order because of mixture o
f unigram distribution.
Combine n-gram with PLSA:
When PLSA used in decoding, Viterbi-based decoder is not suitable.Two-pass decoder:• First pass:
– n-gram, output a confidence score• Second pass:
– PLSA, rescoring the lattices
)()|()|()|(
iunigram
iiPLSAiigramnii wP
hwPhwPhwP
PLSA Model (cont.)• During the re-scoring, the PLSA history comprises of all segments in
a document but the current segment.
• PLSA history is fixed for all words in a given segment.
• Refer to “history “ as “context” (ctx). It contains both past and future words.
Experimental ResultsTwo Test Sets• NIST’s Hub5 speech-to-text evaluation 2002(eval02)
– Switchboard I and II– 62k words,19k form Switchboard I
• NIST’s Rich Transcription Spring 2003 CTS speech-to-text evalation(eval03)– Switchboard II phase 5 and Fisher– 74k words, 36k from Fisher
Experimental Results (cont.)
Experimental Results (cont.)• The reduction is greater if PLSA’s training text relates to
the test set.
• PP of (ref.ctx,10) <PP of (rec.ctx,10)
• b=10 is the best value
• Use of confidence score makes the PLSA model less sensitive to b
Experimental Results (cont.)
Experimental Results (cont.)• baseline: n-gram trained on 20M words of Fisher
transcripts. Increased to 500 classes• PLSA: 750 aspects,100 EM iterations• Separate into eval03dev,eval03tst
– Interpolation weight of the word and class-based n-gram were set to minimize perplexity.
– A slight improvement when side-based documents were used.
Experimental Results (cont.)• b=100 is best value
– PLSA model needs much more data to estimate the topic of Fisher than SwbI
• Having a long context is very important.
Experimental Results (cont.)
Conclusion
• PLSA with the suggested modifications in a language model reduces perplexity.
• Future work:– Re-score lattices to calculate WERs– Combine semantics-oriented model with synta
x-based language model