1 international computer science institute data sampling for acoustic model training Özgür Çetin...

20
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke SRI / International Computer Science Institute Barbara Peskin International Computer Science Institute

Upload: alaina-holt

Post on 27-Dec-2015

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

1

International Computer Science Institute

Data Sampling for Acoustic Model Training

Özgür ÇetinInternational Computer Science Institute

Andreas StolckeSRI / International Computer Science Institute

Barbara PeskinInternational Computer Science Institute

Page 2: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

2

Overview

• Introduction

• Sampling Criteria

• Experiments

• Summary

Page 3: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

3

Data Sampling Select a subset of data for acoustic model training A variety of scenarios where sampling can be useful:

– May reduce transcription costs if data are untranscribed, e.g. Broadcast News

– May filter out bad data w/ transcription/alignment errors – May reduce training/decoding costs for target performance – Could train multiple systems on different subsets of data,

e.g. for cross-system adaptation– May improve accuracy in cross-domain tasks, e.g CTS

acoustic models for meetings recognition

Page 4: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

4

Data Sampling (contd.)

Key assumptions– Maximum likelihood training– Transcribed data– Utterance-by-utterance data selection

Investigate the utility of various sampling criteria for CTS acoustic models (trained on Fisher) at different amounts of training data

Comparison metric: word error rate (WER) Ultimate goals are tasks w/ unsupervised learning

and discriminative training, where data quality is arguably much more important

Page 5: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

5

Experimental Paradigm

Train: Data sampled from male Fisher data (778 hrs – whatever available in Spring ‘04)

Test: 2004 NIST development set BBN + LDC segmentations Decision-tree tied triphones – an automatic

mechanism to control model complexity SRI Decipher recognition system– Not the standard system; runs fast and involves

only one acoustic model

Page 6: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

6

Experimental Paradigm (contd.)

Training– Viterbi-style maximum likelihood training– Cross-word models; 128 mixtures per tied state

Decoding– Phone-loop MLLR– Decoding and lattice generation– Lattice rescoring w/ a 4-gram LM– Expansion of lattices w/ a 3-gram LM – N-best decoding from expanded lattices– N-best rescoring w/ a 4-gram LM + duration models– Confusion network decoding of final hypothesis

Page 7: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

7

Sampling Criteria

Random sampling Likelihood-derived criteria Accuracy-based criteria Context coverage

Page 8: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

8

Random Sampling

Select an arbitrary subset of available data Very simple; doesn’t introduce any

systematic variations Ideal for experimentation w/ small amounts

of training data Data statistics– Average utterance length: 3.77 secs– Average silence% per utterance: 20%

Page 9: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

9

Results: Random Sampling

WER for random, hierarchical subsets of training data

Based on a single random sample Incremental gains under our ML training paradigm

0

5

10

15

20

25

30

16 32 64 128 256

amount of training data (hours)

WER

Page 10: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

10

Likelihood-based Criteria Select utterances according to utterance-level

acoustic likelihood score:score = utterance likelihood / number # of

frames Pros

– Very simple; readily computed– Utterances w/ low and high scores tend to indicate

transcription errors/long utterances, and long silences Cons

– Likelihood has no direct relevance to accuracy– May need additional normalization to deal w/ silence

Can argue for selecting utterances w/ low, high, and average likelihood scores

Page 11: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

11

Normalized Likelihood (Speech + Non-Speech)

Per-frame utterance likelihoods on male Fisher data Unimodal distribution, simplifying selection regimes

Select utterances w/ low, high, and average likelihoods

Utterance length (seconds)

0

1

2

3

4

5

16 32 64 128 256

Amount of selected data (hours)

High

Average

Low

Score PDF

High-likelihood utterances tend to have a lot of silence Use likelihood only from speech frames

Page 12: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

12

Normalized Likelihood (Speech)

Use likelihood only from speech frames More concentrated, shifted towards lower

likelihoods

0

1

2

3

4

5

19.9 25.1 45.9 79.6 90.4 167 319 335

Amount of selected data (hours)

Utterance length (seconds)

HighAverageLow

Score PDF

Page 13: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

13

Results: Likelihood-based Sampling

WER

20

25

30

35

40

45

16 32 64 128 256

amount of training data (hours )

high

average

low

random

WER

2022242628303234

amount of training data (hours)

w/ speech + non-speech w/ speech only

• Selecting utterances w/ average likelihood scores performs the best

• No benefit over random sampling if likelihoods from non-speech frames contribute

• 0.5 % absolute improvement over random sampling for 256 hours of data, if non-speech frames are excluded

Page 14: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

14

Accuracy-based Criteria

Select utterances based on their recognition difficulty

Word and phone error rates, or lattice entropy Pros

– Directed towards the final objective (WER)– Straightforward to calculate w/ additional cost

Cons– Accuracy seems to be highly concentrated (across utt.’s)

Focus on average phone accuracy per utterance

Page 15: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

15

Phone Accuracy

Average phone accuracy per utterance, after a monotonic transformation (to spread the distribution)

f(x) = log(1-x)

Utterance length (seconds)

0

1

2

3

4

5

16 32 64 128 256Amount of selected data (hours)

High

Average

Low

Score PDF

Page 16: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

16

Results: Accuracy-based Sampling

WER

2022242628303234

16 32 64 128 256

Amount of training data (hours)

high

average

low

random

• For small amounts of training data (< 128 hours), utterances w/ low phone recognition accuracy perform better

• At larger amounts of data, training on more difficult utterances seems to be more advantageous; (promising to perform better than random sampling)

Page 17: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

17

Triphone Coverage

Under a generative modeling paradigm (e.g. HMMs) and ML estimation, might argue that it is sufficient to accurately estimate a distribution when enough prototypes are found

Frequent triphones will be selected anyway, so tailor the sampling towards utterances w/ infrequent counts

Greedy utterance selection to maximize entropy of the triphone count distribution

Utterance length (seconds)

0

1

2

3

4

16 32 64 128

Amount of training data (hours)

High

Random

Low

Page 18: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

18

Results: Triphone Coverage-based Sampling

WER

20

22

24

26

28

16 32 64 128

Amount of selected data (hours)

coverage

random

• For small amounts of selecting utterances to maximize triphone coverage performs similar to the likelihood-based sampling criteria

• No advantage (even some degradation) as compared to random sampling

• Maybe need to have many examples from frequent triphones to get a better coverage of non-contextual variations, e.g. speakers

Page 19: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

19

Summary

Compared a variety of acoustic data selection criteria for labeled data and ML training (random sampling, and those based on likelihood, accuracy, and triphone coverage)

Found out that likelihood-based selection after removing silences perform the best and slightly improve over random sampling (0.5% abs.)

No significant performance improvement Caveat:

– Our accuracy-based and triphone coverage-based selection criteria are rather simplistic

Page 20: 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke

20

Future Work

Tasks where the data quality is more important– Untranscribed data– Discriminative training

More sophisticated accuracy and context-coverage criteria, e.g. lattice entropy/confidence

Data selection for cross-domain tasks, e.g. CTS data for Meetings recognition

Speaker-level data selection– Could be useful for cross-adaptation methods