the use of virtual hypothesis copies in decoding of large-vocabulary continuous speech
DESCRIPTION
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech. Frank Seide IEEE Transactions on Speech and Audio Processing 2005. Present by shih-hung 2005/09/29. Outline. Introduction Review of (M+1)-gram Viterbi Decoding with reentrant tree - PowerPoint PPT PresentationTRANSCRIPT
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech
Frank Seide
IEEE Transactions on Speech and Audio Processing 2005
Present by shih-hung 2005/09/29
Speech Lab NTNU 2005 2
Outline
• Introduction
• Review of (M+1)-gram Viterbi Decoding with reentrant tree
• Virtual Hypothesis Copies on word level
• Virtual Hypothesis Copies on sub-word level
• Virtual Hypothesis Copies for Long-Range acoustic Lookahead (optional)
• Experimental Results
• Conclusion
Speech Lab NTNU 2005 3
Introduction
Speech Lab NTNU 2005 4
Introduction
• For decoding of LVCSR, the most widely used algorithm is a time-synchronous Viterbi decoder that uses a tree-organized pronunciation lexicon with word-condition tree copies.
• The search space is organized as a reentrant network which is a composition of the state-level network (lexical tree) and the linguistic (M+1)-gram network.– i.e. a distinct instance (“copy”) of each HMM state in the lexical tree is
needed for every linguistic state (M-word history).
• Practically, this copying is done on demand in conjunction with beam pruning.
Speech Lab NTNU 2005 5
Introduction
Speech Lab NTNU 2005 6
Introduction
Speech Lab NTNU 2005 7
Introduction
• One observes that hypotheses for the same word generated from different tree copies are often identical.– i.e. there is redundant computation
• Can we exploit this redundancy and modify the algorithm such that word hypotheses are shared across multiple linguistic state?
frank
funny
seide
Speech Lab NTNU 2005 8
Introduction
• A successful approach to this is the two-pass algorithm by Ney and Aubert. It first generates a word-lattice using the “word-pair approximation”, and searches the best path through this lattice using the full range language model.– computation is reduced by sharing word hypotheses between two-
word histories that end with the same word.
• An alternative approach is start-time conditioned search, which uses non-reentrant tree copies conditioned on the start time of the tree. Here, word hypotheses are shared across all possible linguistic states during word-level recombination.
?
Speech Lab NTNU 2005 9
Introduction
Speech Lab NTNU 2005 10
Introduction
Speech Lab NTNU 2005 11
Introduction
• In this paper, we propose a single-pass reentrant-network (M+1)-gram decoder that uses three novel approaches aiming at eliminating copies of the search-space that are redundant.
• 1.State copies are conditioned on the phonetic history rather than the linguistic history.– Phone-history approximation (PHA) analog to the word-pair
approximation (WPA).
• 2.Path hypotheses at word boundaries are saved at every frame in a data structure similar to a word lattice. To apply the (M+1)- gram at a word end , the needed linguistic path-hypothesis copies are recovered on the fly, similarly to lattice rescoring. We call the recovered copies virtual hypothesis copies (VHC).
Speech Lab NTNU 2005 12
Introduction
• 3.For further reduction of redundancy, also multiple instances of
the same context-dependent phone occurring in the same phonetic history are dynamically replaced by a single instance.
Incomplete path hypotheses at phoneme boundaries are temporarily saved as well in the lattice-like structure. To apply the tree lexicon, CD-phone instances associated with tree nodes are recovered on the fly (phone-level VHC).
Speech Lab NTNU 2005 13
Review of (M+1)-gram Viterbi decoding with a reentrant tree
),( stQMW
MW
),( stBMW := time of the latest transition into the tree root on the
best path up to time t that ends in state s of the lexical tree for the history (“back-pointer”)
:= probability of the best path up to time t that ends in state s of the lexical tree for history
MW
);( tWH M :=probability that the acoustic observation vectors o(1)…o(t) are generated by a word/state sequence that ends with the M words at time t.MW
Speech Lab NTNU 2005 14
Review of (M+1)-gram Viterbi decoding with a reentrant tree
• The dynamic-programming equations for the word-history conditioned (M+1)-gram search are as follow:
Within-word recombination (s>0)
)),(,1(),(
)},1()|(max{))((),(
max sttBstB
tQsPtobstQ
MMM
MM
WWW
WsW
))(( likelihoodemission and)|P(sy probabilitsition with transtates treedenote and s where
s tob
Speech Lab NTNU 2005 15
Review of (M+1)-gram Viterbi decoding with a reentrant tree
1)0,1(
)1;()0,1(
)};),,'((ˆ{max));,((
),()|();,(ˆ
1'
1
ttB
tWHtQ
twhwHtwhH
StQWwPtwWH
M
M
M
W
MW
MVw
M
wWMM
' by word replace doldest wor with thehistory thedenotes ),'( for word treelexical theof state terminala denotes
y probabilit model language gram-1)(M theis )|(
1 wWhwwS
WwP
MM
w
M
Word-boundary equation:
Speech Lab NTNU 2005 16
Virtual hypothesis copies on word level
A. How it works
B. Word hypothesis
C. Word-Boundary assumption and Phonetic-History approximation
D. Virtual hypothesis copies: redundancy of
E. Choosing
F. Collapsed hypothesis copies
G. Word-boundary equations
H. Collapsed (M+1)-gram search : Summary
I. Beam pruning
J. Language model lookahead
MWQ
mW~
Speech Lab NTNU 2005 17
How it works
• The optimal start time of a word depends on its history . The same word in different histories may have different optimal start times - this is the reason for copying.
),( weW StBM MW
However, we observed that start times are often identical, in particular if their histories are acoustically similar.
For two linguistic histories and we obtain the same optimal start time .
MW MW '
st
sweWweW tStBStBMM
),(),( '
then we have computed too much.
Speech Lab NTNU 2005 18
How it works
• It would only have been necessary to perform the state-level Viterbi recursion for one of the two histories. This is because:
)0,(
),(
)0,(
),(
'
'
sW
weW
sW
weW
tQ
StQ
tQ
StQ
M
M
M
M
)0,(
)0,(),(),(
),( from recovered)(or computed becan ),( s,other wordin
''
'
sW
sWweWweW
weWweW
tQ
tQStQStQ
StQStQ
M
M
MM
MM
Speech Lab NTNU 2005 19
How it works
• We are now ready to introduce our method of virtual hypothesis copying (word-level). The method consist of– 1.predicting the sets of histories for which the optimal start times are
going to be identical - this information is needed already when a path enters a new word;
– 2.performing state-level Viterbi processing only for one copy per set.
– 3.for all other copies, recovering their accumulated path probabilities. Thus, on state-level, all but one copy per set are neither stored nor computed - we call them “virtual”.
Speech Lab NTNU 2005 20
How it works
• The art is to reliably predict these sets of histories that will lead to identical optimal start times. An exact prediction is impossible.
• We propose a heuristic, the phone-history approximation (PHA).
• The PHA assumes that a word’s optimal boundary depends only on the last N phones of the history.
))(( histories of classes entireon dconditione )),(),,((
copies" collapsed" with themreplacingby )),(),,((
copies statedependent -history theofpart eliminate wise-step willwe
)()( MWcWc
WW
WcstBstQ
stBstQ
MM
MM
Speech Lab NTNU 2005 21
How it works
Regular bigram search Virtual hypothesis copies
Speech Lab NTNU 2005 22
Word hypotheses
).()...1( vectorsacoustic he produces word
y thatprobabilit:),,( likelihoodemission - wordwith the
"hypothesis-word" a called is )),,(,,,( quadruple The
es
es
eses
totow
ttwh
ttwhttw
),( with )0,(
),(),,(
as derived becan )),,(,( hypotheses
wordofset a, history on dconditionecopy every treeFor
weWssW
weWesW
esWes
M
StBttQ
StQttwh
ttwh,tw,t
W
M
M
M
M
M
p(O|w)
Speech Lab NTNU 2005 23
Word-Boundary assumption and Phonetic-History approximation
)(~
),(),(
:),(symbol by the denoted be shallboundary dcommon wor This
. timeand every wordfor
)(~
),(),(
true,always isequation below assume We
~)(
)(
~
MMwW
def
wWc
wWc
MMwWwW
WcWStBStB
StB
tw
WcWStBStB
MM
M
MM
Speech Lab NTNU 2005 24
Word-Boundary assumption and Phonetic-History approximation
• Intuitively, the optimal word boundaries should not depend on the linguistic state, but rather on the phonetic context at the boundary.
• And words ending similarly should lead to the same boundary.
• Thus, we propose a phonetically motivated history-class definition, the phone-history approximation (PHA):– A word’s optimal start time depends on the word and its N-phone history.
structure. syllable assuch sconstraint cphonotaction depending e.g. variable,be chosen to be alsomay
} of phones last two the{)( example,for
N
WWc MM
Speech Lab NTNU 2005 25
Virtual hypothesis copies: redundancy of MWQ
)(~
)),(;
~(
)),(;(),(
)),(;()0),,((
),(
)),(;()),,(,(
)),(;()0),,((
),(
)0),,(()0),,((
),(),(
M~
)()(~
~
M
wWM
wWMwW
wWcMwWcW
wW
wWMwW
wWMwWW
wW
wWWwWW
wWwW
WcWStBWH
StBWHStQ
StBWHStBQ
StQ
StBWHtStBwh
StBWHStBQ
StQ
StBQStBQ
StQStQ
M
M
M
M
MM
M
MM
M
MM
M
MM
MM
M
M
Speech Lab NTNU 2005 26
Virtual hypothesis copies: redundancy ofMWQ
opiespothesis cvirtual hy
StQ
Wc
WcWH
StQStQ
wW
M
MM
wWwW
M
MM
themcall want to westored,nor
computeddirectly not need ),( recovered theSince
class. a tobelonging states linguistic across hypotheses wordgenerated sharing and )( classhistory per copy tree
oneonly keepingby spacesearch reducing way togives This
).(~
for )( and
),(other any from recovered becan ),(Every ~
Speech Lab NTNU 2005 27
Choosing mW~
)},({max),(
)},({maxarg~
')('
'
')('
wWWcW
wW
wWWcW
M
StQStQ
StQW
MMM
M
MMM
Speech Lab NTNU 2005 28
Collapsed hypothesis copies
• The most probable hypothesis is only know when the end of he word is reached - too late to reduce computation.
)},({max),(
),( copiesdependent over maximum
wise-state theas ),(copy treecollapsed"" thedefine We
).,( potential all
compute tohavingwithout recursion gprogrammin-dynamic a
by determined becan ),( and ~
that found weHowever,
')('
)(
)(
'
~
stQstQ
stQW
stQ
StQ
StQW
MMM
M
M
M
M
M
WWcW
Wc
WM
Wc
wW
wWM
Speech Lab NTNU 2005 29
Collapsed hypothesis copies
)},1()|({max))((
)}},1({max)|({max))((
)}},1()|({max))(({max
)},({max),(
recursion. gprogrammin-dynamic level-state theinsertingby equation above rewrite We
)(
')('
')('
')('
)(
tQsPtob
tQsPtob
tQsPtob
stQstQ
M
MMM
MMM
MMM
M
Wcs
WWcW
s
WsWcW
WWcW
Wc
Speech Lab NTNU 2005 30
Word-boundary equations
)),(;()0),,((
),()|();,(ˆ
)0),,(())},(;'({max
)),(;~
(
)),(;~
(
)),(;(),()|(
),()|();,(ˆ
)}1;'({max
)}0,1({max)0,1(
)()()(
)(
)()()()('
)(
)(
)()(
)('
')('
)(
wWcMwWcWc
wWcMM
wWcWcwWcWcW
wWcM
wWcM
wWcMwWcM
wWMM
MWcW
WWcW
Wc
StBWHStBQ
StQWwPtwWH
StBQStBWH
StBWHwith
StBWH
StBWHStQWwP
StQWwPtwWH
tWH
tQtQ
M
MM
M
MMMMM
M
M
M
M
M
MM
MMM
M
Speech Lab NTNU 2005 31
Collapsed (M+1)-gram search : Summary
.1)0,1(
)}1;'({max)0,1(
};)),,'((ˆ{max));,((
)),(;()0),,((
),()|();,(ˆ
:
)),(,1(),(
)},1()|({max))((),(:
)(
)(')(
1'
1
)()()(
)(
max)()()(
)()(
ttB
tWHtQ
twWwHtwWH
StBWHStBQ
StQWwPtwWH
equationsboundaryword
sttBstB
tQsPtobstQionrecombinatwordwithin
M
MMM
M
MM
M
MMM
MM
Wc
MWcW
Wc
MVw
M
wWcMwWcWc
wWcMM
WcWcWc
WcsWc
Speech Lab NTNU 2005 32
Language model lookahead
• M-gram lookahead aims at using language knowledge as early as possible in the lexical tree by pushing partial M-gram scores toward the tree root.
Speech Lab NTNU 2005 33
Virtual hypothesis copies on the sub-word level
• In the word-level method, the state-level search can be interpreted as a “word-lattice generator” with (M+1)-gram “lattice rescoring” applied on the fly; and search-space reduction was achieved by sharing tree copies amongst multiple histories.
• We now want to apply the same idea to the subword level: the state-level search now becomes sort of a “subword generator,” subword hypotheses are incrementally matched against the lexical tree (frame-synchronously) and (M+1)-gram lattice rescoring applied as before.
Speech Lab NTNU 2005 34
Virtual hypothesis copies on the sub-word level
Speech Lab NTNU 2005 35
Virtual hypothesis copies on the sub-word level
Speech Lab NTNU 2005 36
Virtual hypothesis copies on the sub-word level
Speech Lab NTNU 2005 37
Experimental setup
• Philips LVCSR is based on continuous-mixture HMM.
• MFCC feature.
• Unigram lookahead.
• Corpora for Mandarin:– MAT-2000, PCD, National Hi-Tech Project 863
• Corpora for English:– Trained on WSJ0+1
– Test on 1994 ARPA NAB
Speech Lab NTNU 2005 38
Experimental result
Speech Lab NTNU 2005 39
Experimental result
Speech Lab NTNU 2005 40
Experimental result
Speech Lab NTNU 2005 41
Experimental result
Speech Lab NTNU 2005 42
Experimental result
Speech Lab NTNU 2005 43
Experimental result
Speech Lab NTNU 2005 44
Experimental result
Speech Lab NTNU 2005 45
Experimental result
Speech Lab NTNU 2005 46
Conclusion
• We have present a novel time synchronous LVCSR Viterbi decoder for Mandarin based on the novel concept of virtual hypothesis copies (VHC).
• At no loss of accuracy, a reduction of active states of 60-80% has been achieved for Chinese, and of 40-50% for American English.