acoustic adaptation and accent identification in the icsi mr and fae corpora

Javier Macías-GuarasaInternational Computer Science Institute

Berkeley, CA - USA

Acoustic Adaptation and Accent Identification in the ICSI MR and FAE Corpora

2

Overview

• Introduction• Acoustic adaptation

– MR SI task– MR SD task

• Accent identification– MR SI task– FAE task

• Conclusions• Future work

3

Introduction (I)

• Work on improving WER for non-native speakers in the ICSI MR corpus

• General details on the Meeting Recording corpus:– Number of speakers: 61– Speech segmented: 85:08:21– Number of accents: 15– ‘Workable’ accents:

• American 53:12:35 15m+8f• German 11:37:01 10m+2f• Spanish 04:38:24 4m+1f• British 01:03:45 2m+0f just for reference

4

Introduction (II)

• Initial idea:– Pronunciation modeling for non native

speakers

• Acoustic adaptation techniques to be tested first:– SRI Decipher system capabilities:

• MAP/MLLR/PhoneLoop

– Analyze different strategies

• Speaker dependent and independent tasks

5

Introduction (III)

• Accent identification:– Needed to effectively use accent-dependent

models in a real-world system– Emphasis in ‘practical’ approaches using,

again, SRI Decipher capabilities

• MR task is a difficult acoustic environment:– Low number of speakers/speech material– Certain speakers dominance (more details?)

• FAE task also approached

8

Introduction (VI)Baseline WERs

• Using SRI 2003 system, WER:

40.3%34.1%

52.3%

104.2%

95.6%

41.4%

33.0%

51.6%

88.2%

65.0%

0%

20%

40%

60%

80%

100%

120%

All American German Spanish British

New SI partition

New SD partition

11

Acoustic adaptation (I)

• Initial studies with old partitioning shows that global task adaptation through MAP is the best approach:– Accent-dependent MAP adaptation also promising

• Initial attempt to do full retraining using 16KHz speech (also 8KHz speech as reference):– Very bad results (more details?)

• Worse than baseline!!

– Too few speakers in the training set given the task partition (speaker independent)

12

Acoustic Adaptation (II) Previous work

• Interest in language learning tools (CALL)

• Standard acoustic adaptation techniques– MAP/MLLR using L1 or L2 speech data– Model interpolation– Clustering– Sufficient for high proficiency speakers

• Pronunciation modeling:– Little (if any) success reported

14

Acoustic Adaptation (IV) Objectives

• Strategies for SI task, ¿combined improvement?:– Task MAP adaptation (TaskMAP)– Accent dependent MAP (AccMAP)– TaskMAP followed by AccMAP/MLLR

• Strategies for SD task:– Task MAP adaptation (TaskMAP) (includes

speaker adaptation)– Per speaker MAP adaptation (SpkMAP)

15

Acoustic adaptation (V)

• Strategies for Acoustic adaptation:– Adaptation weights tuned per accent (heldout)

– Final phoneloop stage

MAP(task adaptation)Full DB MAP/MLLR

SWBmodels

MAP/MLLR

MAP/MLLR

Global MAPmodels

.

.

.

Am DB

Ge DB

Sp DB

Am MAPmodels

Ge MAPmodels

Sp MAPmodels

OR?

TaskMAP

AccMAPTask+AccMAP

17

Acoustic Adaptation (VII) MR Speaker Independent Task

• SI adaptation summary, WER:

34.1%

52.3%

104.2%

95.6%

30.4%

42.3%

93.2%87.9%

0%

20%

40%

60%

80%

100%

120%

American German Spanish British

Baseline SI

TaskMAP-optimal

AccMAP-optimal

TaskMAP + AccMAP-optimal

+ phoneloop pass2

18

Acoustic Adaptation (VIII) MR Speaker Independent Task

• SRI 5xRT system:– Using new dictionary and interpolated LMs– Using best map adapted models for mel

features– Still some bugs in the process (more details?)

American German Spanish BritishBest single system 30.4% 42.3% 93.2% 87.9%

SRI 5xRT system adapted 33.6% 44.9% 86.6% 79.7%SRI 5xRT system STD 31.0% 44.1% 93.4% 78.7%

err (%rel) [ 10.5% ] [ 6.1% ] [ -7.1% ] [ -9.3% ]

20

Acoustic Adaptation (X) MR Speaker Dependent Task

• SD adaptation summary, WER:

33.0%

51.6%

88.2%

65.0%

29.5%

37.1%

54.3%

60.4%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

American German Spanish British

Baseline SDTaskMAP-optimalSpkMAP-optimal+ phoneloop pass2

21

Accent Identification (I)• Background:

– Techniques similar to Language Identification– GMM based:

• Broad collection of features• GMM tokenizers

– Broad phonetic classes + HMMs– LM/AM score comparison– Based on phonotactic characteristics:

• PPRLM, PRLM

– More complex than LID– Hard to compare rates: No previous work in MR/FAE

23

Accent Identification (II) Objectives

• Strategy: Use SRI Decipher characteristics– Practical approach: Reasonable run times– GMM classification module (for gender detection)

• Evaluate standard features and normalizations

– Hypothesis driven, phone recognition:• CD/CI models • Recognition using flat Phone LM or flat LM• View as a text classification problem

– Phone LM driven:• PRLM/PRLM

– Combination using NNs

24

Accent Identification (III) MR data: MM classification approach

• GMM results for MR corpus:– Unbalanced data tested over and under

sampling– Use different features & normalization:

• No significant differences (except when using voicing features):

– lack of data– ~Uniform channels

23855 5814 2335 288American German Spanish British ID rate Naive rate err (%rel)

fc downsampling 256 96.1% 65.9% 0.0% 0.0% 82.9% 73.9% -34.5%fc 2048 96.1% 71.6% 0.0% 0.0% 83.9% 73.9% -38.2%

25

Accent Identification (IV) MR data: GMM classification approach

• GMM results for MR corpus :– As a function of utterance length task

AM-GE-SP-BR

60%

65%

70%

75%

80%

85%

90%

95%

100%

1 10 100> Utterance length (in seconds)

AI

rate

Chance

256

26

Accent Identification (V) MR data: Hypothesis driven approach

• Text classification view using MR data:– Input from phone recognition:

• From free phone recognition (CD/CI models, full/flat PLM)

– Rainbow: CMU tool for text classification• Naive bayes classification technique• N-grams (1..6)• No further restrictions (feature selection, stop list,

etc.)

27

Accent Identification (VI) MR data: Hypothesis driven approach

• Text classification view using MR data:– Best results using CI models + flat PLM

(bigrams & trigrams)– Chunk based classification rates (simulation):

Chunks ID rate Naive rate err (%rel)

Full task 84.6% 65.8% -55.0%american nonnat 83.8% 68.0% -49.3%american german 95.9% 81.8% -77.4%american spanish 92.8% 90.2% -26.6%american british - 98.1%am ger bri spa 89.3% 75.0% -57.2%

am ger spa 90.8% 76.1% -61.3%

28

Accent Identification (VII) MR data: Hypothesis driven approach

• Text classification view using MR data:– Utterance based classification rates

(simulation):

– Need longer sequences!!

Utterances ID rate Chunk rate Naive rate err (%rel)

Full task 64.88% 84.6% 66.8% 5.7%american nonnat 65.21% 83.8% 69.1% 12.6%american german 80.16% 95.9% 92.1% 152.4%american spanish 91.79% 92.8% 92.1% 4.5%american british 80.42% - 97.0% 544.1%am ger bri spa 73.19% 89.3% 74.8% 6.3%

am ger spa 74.94% 90.8% 76.6% 7.0%

29

Accent Identification (VIII) MR data: Hypothesis driven approach

• Text classification view using MR data:– Real partition classification rates:

– Worse-than-chance rates if utterance based (pending to do length-dependent AI task)

Chunks RealPartition ID rate ID simul Naive rate err (%rel)

Full taskamerican nonnat 75.46% 83.8% 71.2% -14.9%american german 75.16% 95.9% 73.9% -5.0%american spanish 95.24% 92.8% 92.1% -40.0%american british 91.45% - 92.2% 10.2%am ger bri spa 67.68% 89.3% 64.0% -10.2%

am ger spa 73.01% 90.8% 69.3% -12.0%

30

Accent Identification (IX) Phone LM approach

• PRLM: Phone recognition & LM

PhonerecognizerSpeech

AMLM scoring

LM scoring

PLM accent 1

PLM accent N

Scorecomparison

Scorecomparison

Scorecomparison

Decision...

31

Accent Identification (X) Phone LM approach

• PRLM: Phone recognition & LM– Tested different AMs for phonetic string

generation:• Std forced• Std SWB• MAP adapted per accent• Best is Std SWB

– Tested 1-6gram: • Best is trigram

– But very poor results

32

Accent Identification (XI) MR data: Phone LM approach

• PRLM: Phone recognition & LM:– As a function of utterance length, task

AM-GE-SP-BR: Very bad results

40%

50%

60%

70%

80%

90%

100%

110%

0 10 20 30 40 50 60 70 80 90 100> Utterance length (in seconds)

AI

rate

Chance

StdAM-trigram

33

Accent Identification (XII) Phone LM approach

• PPRLM: Parallel Phone recognition & LM

Phonerecognizer Models Z

Speech

LM scoringAccent z

Avg accent a

Decision

.

.

.

Phonerecognizer Models A

.

.

.

LM scoringAccent a

LM scoringAccent z

.

.

.

LM scoringAccent a

Avg accent aAvg accent a

Scorecomparison

ScorecomparisonAvg accent a

Avg accent aAvg accent z

.

.

.

Scorecomparison

34

Accent Identification (XIII) FAE database

• Experiments with the FAE database:– 4500 speakers: More acoustic context– 20 seconds per speaker– Proficiency is labeled

• Strategy:– Apply standard techniques – Possibly:

• Use FAE-generated models in MR data

35

Accent Identification (XIV) FAE database: GMM classification

• GMM:– Gender independent classification (16-2048)– FAE results in GE-SP task:

– Norm better than CMN. CMN better than plain features– Pending to test GD models

GMM size fc fasf fc fasf fc fasf fasf+ffvf Naive rate128 54.4% 59.2% 59.2% 63.2% 53.6% 64.8% 61.6% 51.6%256 59.2% 59.2% 59.2% 65.6% 58.4% 72.0% 63.2% 51.6%512 51.2% 60.0% 60.8% 60.0% 60.0% 68.0% 65.6% 51.6%1024 52.8% 64.8% 54.4% 63.2% 56.0% 64.4% 61.6% 51.6%

NormNo norm CMN

36

Accent Identification (XV) FAE database: GMM classification

• GMM:– Combining FAE models with MR data:

• Using frame_cepstrum + CMN (GMM 256)

– Combination is possible, but more experiments are needed!!

GS-SP task German Spanish ID rate Naive rate err (%rel)

MR models 81.1% 40.0% 72.3% 66.0% -18.6%FAE models (cmn) 100.0% 21.9% 73.4% 66.0% -21.8%

37

Accent Identification (XVI) FAE database: hypothesis driven

• Text classification view:– FAE results:

– Better than chance but, still, far from useful– Pending to test FAE models in MR data

ID rate Naive rate err (%rel)

GE-SP 58.9% 51.6% -15.0%FR-GE-IT-SP 36.2% 28.9% -10.3%ALL accents 13.2% 9.4% -4.3%13 accents 17.7% 12.2% -6.2%

38

Accent Identification (XVII) FAE database: Phone LM approach

• PRLM/PPRLM:– Pending

• GMM better than text based classification. GE-SP task, for example:– GMM: 72.0% – Text-based: 58.9%

• Results as a function of speech length to be evaluated

39

Conclusions

• Acoustic adaptation is important to face non-native accents:– MAP adaptation provided best results:

• Task adaptation+accent adaptation

– Work on tuning adaptation weights for SD & SI task (magnitude differences)

– Low proficiency speakers need additional improvements

• Non native speech recognition may not be solvable!

40

Conclusions

• Accent identification:– Proved to be more difficult than LID– Different techniques applied:

• GMM techniques and text classification techniques showed promising results

• Standard PRLM strategy didn’t work as expected (score normalization needed?)

– PPRLM to be tested– Integration to be tested

41

Future work

• Finish current experimentation:– Accent identification:

• Test features and normalizations in GMM and phone LM based

• Test acoustic scores ratios• Test LM scores• Test NN based combination

– NonNat speech characterization:• Errors phone/word• Model ‘usage’ distributions

42

Future work

• Pronunciation modeling:– Evaluation of pronunciation variants found in the

SRI SWB dictionary for NonNat speech– Rule based:

• Rules in German (from Silke Goronzy’s work)• Rules in Spanish• ‘Speaking mode’ probability estimation (accent + …)

• Use of new databases (FAE, TED, Fisher)

43

Future work

• A note on work on pronunciation modeling in the MR task:– The MR corpus is not suitable for data-driven

pronunciation modeling:• High error rates for non native speakers & limited number of

them• Rule based methods are to be tested first

– Initial work on evaluating current pronunciation alternatives is needed

– I got relevant rules for initial testing in German and Spanish

44

Thank you!!

• To ICSI and the ICSI Speech Group, with special emphasis to:– Morgan– Andreas– Qifeng, Barry, Adam, Yang, Yan, Dave, Jeremy,

…– Sven & all international visitors– The FrameNet people (Miriam, Michael & Co.)– Staff, specially Lila, María Eugenia and Diane

46

MR Partitioning• Speaker independent (SI subtask)

Male Female Train Test

American 36:07:02 14:58:45 33:12:06 17:53:41

51:05:47 15 8 9m + 5f 6m + 3f

German 11:06:43 0:06:56 7:12:09 4:01:30

11:13:39 10 2 6m + 1f 4m + 1f

Spanish 3:05:47 1:12:39 2:46:57 1:31:29

4:18:27 4 1 2m + 1f 2m + 0f

British 1:03:45 0:00:00 0:54:53 0:08:51

1:03:45 2 0 1m + 0f 1m + 0f

47

Full retraining

• Initial attempt to do full retraining using 16KHz speech:– With old partitioning

– Too few speakers in the training set given the task partition (speaker independent)

All American NonNatSWB models 44.1% 34.5% 82.6%Retrain16K+SWBwordnets 53.0% 46.2% 80.1%Retrain16K+SWBwordnets+newHLDA 51.7% 44.6% 79.8%Retrain8K+SWBwordnets 46.8% 40.5% 72.1%Retrain8K+SWBwordnets+newHLDA

48

Speaker dominance

• Few speakers concentrate most speech material: spkID #length

me013 13:53fe008 6:32me011 5:37mn015 4:55me018 4:19mn007 4:16fe016 3:55me010 3:47mn017 3:01

total 50:15

50

Acoustic adaptation (IV)

• SI TaskMAP adaptation, WER:

– Optimal map weight ~proportional to size of accented speech subset

– Bigger improvements in non native accents– Bigger improvements for bigger data size

All NonNat American German Spanish British

Baseline SI 40.3% 64.5% 34.1% 52.3% 104.2% 95.6%

TaskMAP-optimal 37.8% 57.9% 32.5% 46.4% 95.2% 88.5% err (%rel) [ -6.2% ] [ -10.2% ] [ -4.7% ] [ -11.3% ] [ -8.6% ] [ -7.4% ]

51


• SI AccMAP adaptation, WER:

– Similar trends than TaskMAP, but no further improvements, except german benefits from task data!

American German Spanish BritishBaseline SI 34.1% 52.3% 104.2% 95.6%

TaskMAP-optimal 32.5% 46.4% 95.2% 88.5% err (%rel) [ -4.7% ] [ -11.3% ] [ -8.6% ] [ -7.4% ]

AccMAP-optimal 32.5% 46.0% 96.8% 91.9% err (%rel) [ -4.9% ] [ -13.6% ] [ -7.8% ] [ -4.2% ]

Optimal Weight 5 40 40 50

52


• SI TaskMAP+AccMAP adaptation, WER:

– Small improvements over TaskMAP– Also tested MLLR instead

(taskMAP+AccMLLR), but no improvements

American German Spanish BritishBaseline SI 34.1% 52.3% 104.2% 95.6%

TaskMAP-optimal 32.5% 46.4% 95.2% 88.5% err (%rel) [ -4.7% ] [ -11.3% ] [ -8.6% ] [ -7.4% ]

+ AccMAP-optimal 32.5% 45.8% 95.0% 88.6% err (%rel) [ 0.0% ] [ -1.3% ] [ -0.2% ] [ 0.1% ]

Optimal Weight 5 20 30 40

53

Gender ID issues

• Gender identification:– Per chunk gender ID:

– Per utterance gender ID:

#chunks Male Female ID rateTrue male 350 100.0% 0.0%

True female 97 8.3% 91.8%

#utterances Male Female ID rateTrue male 68304 100.0% 0.0%

True female 19563 2.9% 97.1%

98.2%

99.4%

#utterances Male Female ID rateTrue male 68304 88.5% 11.5%

True female 19563 22.6% 77.4%86.0%

54

Acoustic AdaptationSRI 5xRT System results

British 002-mel 002-mel-expanded 005-plp 005-plp-rescored roverSTD 91.5 89.3 78.2 80.6 78.7Adapted 87.9 84.3 78.8 79.7 79.7BestSimple 87.9------------------------------------------------------------------------------Spanish 002-mel 002-mel-expanded 005-plp 005-plp-rescored roverSTD 97.5 94.2 95.9 95.1 93.4Adapted 88.0 85.1 88.4 88.2 86.6BestSimple 93.2------------------------------------------------------------------------------German 002-mel 002-mel-expanded 005-plp 005-plp-rescored roverSTD 50.7 47.6 44.7 44.5 44.1Adapted 45.8 45.8 44.4 44.5 44.9BestSimple 42.3------------------------------------------------------------------------------American 002-mel 002-mel-expanded 005-plp 005-plp-rescored roverSTD 36.3 33.2 31.2 30.8 31.0Adapted 33.6 34.3 33.3 33.1 33.6BestSimple 30.4

acoustic adaptation and accent identification in the icsi mr and fae corpora

Documents

global task adaptation

adaptation strategies

capabilitiesmr task

speaker independent

different speakers

nonnative speakers

speakers spanish

accentdependent models