1 international computer science institute data sampling for acoustic model training Özgür Çetin...
TRANSCRIPT
1
International Computer Science Institute
Data Sampling for Acoustic Model Training
Özgür ÇetinInternational Computer Science Institute
Andreas StolckeSRI / International Computer Science Institute
Barbara PeskinInternational Computer Science Institute
2
Overview
• Introduction
• Sampling Criteria
• Experiments
• Summary
3
Data Sampling Select a subset of data for acoustic model training A variety of scenarios where sampling can be useful:
– May reduce transcription costs if data are untranscribed, e.g. Broadcast News
– May filter out bad data w/ transcription/alignment errors – May reduce training/decoding costs for target performance – Could train multiple systems on different subsets of data,
e.g. for cross-system adaptation– May improve accuracy in cross-domain tasks, e.g CTS
acoustic models for meetings recognition
4
Data Sampling (contd.)
Key assumptions– Maximum likelihood training– Transcribed data– Utterance-by-utterance data selection
Investigate the utility of various sampling criteria for CTS acoustic models (trained on Fisher) at different amounts of training data
Comparison metric: word error rate (WER) Ultimate goals are tasks w/ unsupervised learning
and discriminative training, where data quality is arguably much more important
5
Experimental Paradigm
Train: Data sampled from male Fisher data (778 hrs – whatever available in Spring ‘04)
Test: 2004 NIST development set BBN + LDC segmentations Decision-tree tied triphones – an automatic
mechanism to control model complexity SRI Decipher recognition system– Not the standard system; runs fast and involves
only one acoustic model
6
Experimental Paradigm (contd.)
Training– Viterbi-style maximum likelihood training– Cross-word models; 128 mixtures per tied state
Decoding– Phone-loop MLLR– Decoding and lattice generation– Lattice rescoring w/ a 4-gram LM– Expansion of lattices w/ a 3-gram LM – N-best decoding from expanded lattices– N-best rescoring w/ a 4-gram LM + duration models– Confusion network decoding of final hypothesis
7
Sampling Criteria
Random sampling Likelihood-derived criteria Accuracy-based criteria Context coverage
8
Random Sampling
Select an arbitrary subset of available data Very simple; doesn’t introduce any
systematic variations Ideal for experimentation w/ small amounts
of training data Data statistics– Average utterance length: 3.77 secs– Average silence% per utterance: 20%
9
Results: Random Sampling
WER for random, hierarchical subsets of training data
Based on a single random sample Incremental gains under our ML training paradigm
0
5
10
15
20
25
30
16 32 64 128 256
amount of training data (hours)
WER
10
Likelihood-based Criteria Select utterances according to utterance-level
acoustic likelihood score:score = utterance likelihood / number # of
frames Pros
– Very simple; readily computed– Utterances w/ low and high scores tend to indicate
transcription errors/long utterances, and long silences Cons
– Likelihood has no direct relevance to accuracy– May need additional normalization to deal w/ silence
Can argue for selecting utterances w/ low, high, and average likelihood scores
11
Normalized Likelihood (Speech + Non-Speech)
Per-frame utterance likelihoods on male Fisher data Unimodal distribution, simplifying selection regimes
Select utterances w/ low, high, and average likelihoods
Utterance length (seconds)
0
1
2
3
4
5
16 32 64 128 256
Amount of selected data (hours)
High
Average
Low
Score PDF
High-likelihood utterances tend to have a lot of silence Use likelihood only from speech frames
12
Normalized Likelihood (Speech)
Use likelihood only from speech frames More concentrated, shifted towards lower
likelihoods
0
1
2
3
4
5
19.9 25.1 45.9 79.6 90.4 167 319 335
Amount of selected data (hours)
Utterance length (seconds)
HighAverageLow
Score PDF
13
Results: Likelihood-based Sampling
WER
20
25
30
35
40
45
16 32 64 128 256
amount of training data (hours )
high
average
low
random
WER
2022242628303234
amount of training data (hours)
w/ speech + non-speech w/ speech only
• Selecting utterances w/ average likelihood scores performs the best
• No benefit over random sampling if likelihoods from non-speech frames contribute
• 0.5 % absolute improvement over random sampling for 256 hours of data, if non-speech frames are excluded
14
Accuracy-based Criteria
Select utterances based on their recognition difficulty
Word and phone error rates, or lattice entropy Pros
– Directed towards the final objective (WER)– Straightforward to calculate w/ additional cost
Cons– Accuracy seems to be highly concentrated (across utt.’s)
Focus on average phone accuracy per utterance
15
Phone Accuracy
Average phone accuracy per utterance, after a monotonic transformation (to spread the distribution)
f(x) = log(1-x)
Utterance length (seconds)
0
1
2
3
4
5
16 32 64 128 256Amount of selected data (hours)
High
Average
Low
Score PDF
16
Results: Accuracy-based Sampling
WER
2022242628303234
16 32 64 128 256
Amount of training data (hours)
high
average
low
random
• For small amounts of training data (< 128 hours), utterances w/ low phone recognition accuracy perform better
• At larger amounts of data, training on more difficult utterances seems to be more advantageous; (promising to perform better than random sampling)
17
Triphone Coverage
Under a generative modeling paradigm (e.g. HMMs) and ML estimation, might argue that it is sufficient to accurately estimate a distribution when enough prototypes are found
Frequent triphones will be selected anyway, so tailor the sampling towards utterances w/ infrequent counts
Greedy utterance selection to maximize entropy of the triphone count distribution
Utterance length (seconds)
0
1
2
3
4
16 32 64 128
Amount of training data (hours)
High
Random
Low
18
Results: Triphone Coverage-based Sampling
WER
20
22
24
26
28
16 32 64 128
Amount of selected data (hours)
coverage
random
• For small amounts of selecting utterances to maximize triphone coverage performs similar to the likelihood-based sampling criteria
• No advantage (even some degradation) as compared to random sampling
• Maybe need to have many examples from frequent triphones to get a better coverage of non-contextual variations, e.g. speakers
19
Summary
Compared a variety of acoustic data selection criteria for labeled data and ML training (random sampling, and those based on likelihood, accuracy, and triphone coverage)
Found out that likelihood-based selection after removing silences perform the best and slightly improve over random sampling (0.5% abs.)
No significant performance improvement Caveat:
– Our accuracy-based and triphone coverage-based selection criteria are rather simplistic
20
Future Work
Tasks where the data quality is more important– Untranscribed data– Discriminative training
More sophisticated accuracy and context-coverage criteria, e.g. lattice entropy/confidence
Data selection for cross-domain tasks, e.g. CTS data for Meetings recognition
Speaker-level data selection– Could be useful for cross-adaptation methods