1 international computer science institute data sampling for acoustic model training Özgür Çetin...

1

International Computer Science Institute

Data Sampling for Acoustic Model Training

Özgür ÇetinInternational Computer Science Institute

Andreas StolckeSRI / International Computer Science Institute

Barbara PeskinInternational Computer Science Institute

2

Overview

• Introduction

• Sampling Criteria

• Experiments

• Summary

3

Data Sampling Select a subset of data for acoustic model training A variety of scenarios where sampling can be useful:

– May reduce transcription costs if data are untranscribed, e.g. Broadcast News

– May filter out bad data w/ transcription/alignment errors – May reduce training/decoding costs for target performance – Could train multiple systems on different subsets of data,

e.g. for cross-system adaptation– May improve accuracy in cross-domain tasks, e.g CTS

acoustic models for meetings recognition

4

Data Sampling (contd.)

Key assumptions– Maximum likelihood training– Transcribed data– Utterance-by-utterance data selection

Investigate the utility of various sampling criteria for CTS acoustic models (trained on Fisher) at different amounts of training data

Comparison metric: word error rate (WER) Ultimate goals are tasks w/ unsupervised learning

and discriminative training, where data quality is arguably much more important

5

Experimental Paradigm

Train: Data sampled from male Fisher data (778 hrs – whatever available in Spring ‘04)

Test: 2004 NIST development set BBN + LDC segmentations Decision-tree tied triphones – an automatic

mechanism to control model complexity SRI Decipher recognition system– Not the standard system; runs fast and involves

only one acoustic model

6

Experimental Paradigm (contd.)

Training– Viterbi-style maximum likelihood training– Cross-word models; 128 mixtures per tied state

Decoding– Phone-loop MLLR– Decoding and lattice generation– Lattice rescoring w/ a 4-gram LM– Expansion of lattices w/ a 3-gram LM – N-best decoding from expanded lattices– N-best rescoring w/ a 4-gram LM + duration models– Confusion network decoding of final hypothesis

7

Sampling Criteria

Random sampling Likelihood-derived criteria Accuracy-based criteria Context coverage

8

Random Sampling

Select an arbitrary subset of available data Very simple; doesn’t introduce any

systematic variations Ideal for experimentation w/ small amounts

of training data Data statistics– Average utterance length: 3.77 secs– Average silence% per utterance: 20%

9

Results: Random Sampling

WER for random, hierarchical subsets of training data

Based on a single random sample Incremental gains under our ML training paradigm

0

5

10

15

20

25

30

16 32 64 128 256

amount of training data (hours)

WER

10

Likelihood-based Criteria Select utterances according to utterance-level

acoustic likelihood score:score = utterance likelihood / number # of

frames Pros

– Very simple; readily computed– Utterances w/ low and high scores tend to indicate

transcription errors/long utterances, and long silences Cons

– Likelihood has no direct relevance to accuracy– May need additional normalization to deal w/ silence

Can argue for selecting utterances w/ low, high, and average likelihood scores

11

Normalized Likelihood (Speech + Non-Speech)

Per-frame utterance likelihoods on male Fisher data Unimodal distribution, simplifying selection regimes

Select utterances w/ low, high, and average likelihoods

Utterance length (seconds)

0

1

2

3

4

5

16 32 64 128 256

Amount of selected data (hours)

High

Average

Low

Score PDF

High-likelihood utterances tend to have a lot of silence Use likelihood only from speech frames

12

Normalized Likelihood (Speech)

Use likelihood only from speech frames More concentrated, shifted towards lower

likelihoods

0

1

2

3

4

5

19.9 25.1 45.9 79.6 90.4 167 319 335



HighAverageLow

Score PDF

13

Results: Likelihood-based Sampling

WER

20

25

30

35

40

45

16 32 64 128 256

amount of training data (hours )

high

average

low

random

WER

2022242628303234

amount of training data (hours)

w/ speech + non-speech w/ speech only

• Selecting utterances w/ average likelihood scores performs the best

• No benefit over random sampling if likelihoods from non-speech frames contribute

• 0.5 % absolute improvement over random sampling for 256 hours of data, if non-speech frames are excluded

14

Accuracy-based Criteria

Select utterances based on their recognition difficulty

Word and phone error rates, or lattice entropy Pros

– Directed towards the final objective (WER)– Straightforward to calculate w/ additional cost

Cons– Accuracy seems to be highly concentrated (across utt.’s)

Focus on average phone accuracy per utterance

15

Phone Accuracy

Average phone accuracy per utterance, after a monotonic transformation (to spread the distribution)

f(x) = log(1-x)


0

1

2

3

4

5

16 32 64 128 256Amount of selected data (hours)

High

Average

Low

Score PDF

16

Results: Accuracy-based Sampling

WER

2022242628303234

16 32 64 128 256

Amount of training data (hours)

high

average

low

random

• For small amounts of training data (< 128 hours), utterances w/ low phone recognition accuracy perform better

• At larger amounts of data, training on more difficult utterances seems to be more advantageous; (promising to perform better than random sampling)

17

Triphone Coverage

Under a generative modeling paradigm (e.g. HMMs) and ML estimation, might argue that it is sufficient to accurately estimate a distribution when enough prototypes are found

Frequent triphones will be selected anyway, so tailor the sampling towards utterances w/ infrequent counts

Greedy utterance selection to maximize entropy of the triphone count distribution


0

1

2

3

4

16 32 64 128

Amount of training data (hours)

High

Random

Low

18

Results: Triphone Coverage-based Sampling

WER

20

22

24

26

28

16 32 64 128


coverage

random

• For small amounts of selecting utterances to maximize triphone coverage performs similar to the likelihood-based sampling criteria

• No advantage (even some degradation) as compared to random sampling

• Maybe need to have many examples from frequent triphones to get a better coverage of non-contextual variations, e.g. speakers

19

Summary

Compared a variety of acoustic data selection criteria for labeled data and ML training (random sampling, and those based on likelihood, accuracy, and triphone coverage)

Found out that likelihood-based selection after removing silences perform the best and slightly improve over random sampling (0.5% abs.)

No significant performance improvement Caveat:

– Our accuracy-based and triphone coverage-based selection criteria are rather simplistic

20

Future Work

Tasks where the data quality is more important– Untranscribed data– Discriminative training

More sophisticated accuracy and context-coverage criteria, e.g. lattice entropy/confidence

Data selection for cross-domain tasks, e.g. CTS data for Meetings recognition

Speaker-level data selection– Could be useful for cross-adaptation methods

1 international computer science institute data sampling for acoustic model training Özgür Çetin...

Documents