discriminave phonotaccs for dialect recognion using

Discrimina)vePhonotac)csforDialectRecogni)onUsingContext‐Dependent

PhoneClassifiers

FadiBiadsy*,HagenSoltau+,LidiaMangu+,JiriNavra6l+,JuliaHirschberg*

*ColumbiaUniversity,NY,USA+IBMT.J.WatsonResearchCenter,NY,USA

July1st,2010

1

DialectRecogni)on

 Similartolanguagerecogni)on,butusedialects/accentsofthesamelanguage

 Dialectsmaydifferinanydimensionofthelinguis)cspectrum Differencesarelikelytobemoresubtleacrossdialectsthanthoseacrosslanguages

 Thus,morechallengingproblemthanlanguagerecogni)on

2

Mo)va)on:WhyStudyDialectRecogni)on?

  Discoverdifferencesbetweendialects

  ToimproveAutoma)cSpeechRecogni)on(ASR)  Modeladapta)on:Pronuncia)on,Acous)c,Morphological,Languagemodels

  Toinferspeaker’sregionaloriginfor  Forensicspeakerprofiling

  Speechtospeechtransla)on

  Annota)onsforBroadcastNewsMonitoring

  Spokendialoguesystems–adaptTTSsystems

  Charisma)cspeechiden)fica)on

3

Mul)plecuesthatmaydis)nguishdialects:

  Phone)ccues:  Differencesinphonemicinventory

  Phonemicdifferences

  Allophonicdifferences(context‐dependentphones)

  Phonotac)cs

4

(Al‐Tamimi&Ferragne,2005)

Example:/r/ApproximantinAmericanEnglish[ɹ]–modifiesprecedingvowelsTrilledinSco8shEnglishin[Consonant]–/r/–[Vowel] andinsomeothercontexts

MSA: /s/ /a/ /t/ /u/ /q/ /A/ /b/ /i/ /l/ /u/ /h/ /u/

Egy: /H/ /a/ /t/ /?/ /a/ /b/ /l/ /u/

Lev: /r/ /a/ /H/ /t/ /g/ /A/ /b/ /l/ /u/

DifferencesinMorphology

Differencesinphone)cinventoryandvowelusage

“Shewillmeethim”

Outline

 DialectsandCorpora

 CD‐PhoneRecognizer

 Baselines

 TwoIdeas:  GMM‐UBMwithfMLLR

 Discrimina)vePhonotac)cs

 Results

 ConclusionsandFutureWork

5

CaseStudy:ArabicDialects

(by Arab Atlas)

Corpora

  For testing:   (25% female – mobile, 25% female – landline, 25% male – mobile, 25 % male – landline)

  Egyptian: Training: CallHomeEgyp)an,Tes)ng:CallFriendEgyp)an

Dialect #Speakers Test20%–30s*testcuts

Corpus

Gulf 976 801 (AppenPtyLtd,2006a)

Iraqi 478 477 (AppenPtyLtd,2006b)

Levan)ne 985 818 (AppenPtyLtd,2007)

Dialect #TrainingSpeakers #120speakers30s*cuts

Corpora

Egyp)an 280 1912 (CanavanandZipperlen,1996)(Canavanetal.,1997)

7 *Exactly 30s

Outline

 Mo)va)on

 Corpora


 Baselines



 Results


8

Context‐Dependent(CD)PhoneRecognizer

 HMM‐triphone‐basedphonerecognizerusingIBM’sAjlasystem  Trainedon50hoursofGALEbroadcastnewsandconversa)ons

 230CD‐acousHcmodelsand20,000Gaussians

 Front‐End:  13DPLPfeaturesperframe

 Eachframeissplicedtogetherwithfourprecedingandfour

succeedingframesfollowedbyLDA40D

  CMVN

 SpeakerAdapta)on:  fMLLRfollowedbyMLLR

 UnigramphonelanguagemodeltrainedonMSA9

Outline

 Mo)va)on

 Corpora


 Baselines



 Results


10

Baselines

 StandardPRLM:atrigramphonotac)cmodelperdialect

 StandardGMM‐UBM: Front‐End:Sameasthefrontendofthephonerecognizer

 2048Gaussians–MLtrainedonequalnumberofframesfromeachdialect

 DialectModelsareMAPadaptedwith5itera)ons‐‐similarsejngsofthebaselinein(Torres‐Carrasquilloetal.,2008)

11

Results(DETcurvesofPRLMandGMM‐UBM)–30sCuts

Approach EER(%)

PRLM 17.7

GMM‐UBM 15.3*

12 *ComparabletoGMM‐UBMof(Torres‐Carrasquilloetal.,2008)on3dialects

Outline

 Mo)va)on

 Corpora


 Baselines



 Results


13

OurGMM‐UBMImprovedwithfMLLR

  Mo)va)on:Featurenormaliza)on(CMVNandVTLN)improveGMM‐UBMforlanguageanddialectrecogni)on  (e.g.,WongandSridharan,2002;Torres‐Carrasquilloetal.,2008)

  Ourapproach:FeaturespaceMaximumLikelihoodLinearRegression(fMLLR)adapta)on

  UseaCD‐phonerecognizertoobtainCD‐phonesequence:transformthefeatures“towards”thecorrespondingacous)cmodelGMMs(amatrixforeachspeaker)

  SameasGMM‐UBMapproach,butusetransformedacous)cvectorsinstead

14

fMLLR

[Vowel]‐/r/‐[Consonant]

Results–GMM‐UBM‐fMLLR–30sCuts

Approach EER(%)

PRLM 17.7

GMM‐UBM 15.3

GMM‐UBM‐fMLLR 11.0%

15

Outline

 Mo)va)on

 Corpora


 Baselines



 Results


16

Discrimina)vePhonotac)cs

  Hypothesis:Dialectsdifferintheirallophones(context‐dependentphones)andtheirphonotac)cs

  Idea:Discriminatedialectsfirstatthelevelofcontext‐dependent(CD)phonesandthenphonotac)cs

I.  Obtain CD-phones II.  Extract acoustic features for each CD-phone III.  Discriminate CD-phones across dialects IV.  Augment the CD-phone sequences and extract phonotactic features V.  Train a discriminative classifier to distinguish dialects

17

/r/ isApproximantinAmericanEnglish[ɹ]andtrilledinScojshin[Consonant] – /r/ – [Vowel]

...

[Back vowel]-r-[Central Vowel]

[Plosive]-A-[Voiced Consonant]

[Central Vowel]-b-[High Vowel]

...

...

Run our CD-phone recognizer

ObtainingCD‐Phones

CD-phone sequence

* not just /r/ /A/ /b/

Do the above for all training data of all dialects

18

CD‐PhoneUniversalBackgroundAcous)cModel

e.g., [Back vowel]-r-[Central Vowel]

EachCDphonetypehasanacous)cmodel:

19

ObtainingCD‐Phones+FrameAlignment

20

Acoustic frames: Front-End

Acoustic frames for second state

CD-Acoustic Models:

CD-Phones: (e.g.) [vowel]-b-[glide] [front-vowel]-r-[sonorant]

CD-Phone Recognizer

[Back Vowel]-r-[Central Vowel]

MAP MAP

MAP

MAPadapttheCD‐phoneacous)cmodelGMMstothecorrespondingframes(r=0.1)

MAPAdapta)onofeachCD‐PhoneInstance

21

MAPadapttheCD‐phoneacous)cmodelGMMstothecorrespondingframes*

OneSuperVectorforeachCDphoneinstance:

StackalltheGaussianmeansandphoneduraHonV k =[µ1, µ2, …., µN, duration]

i.e., summarize the acoustic-phonetic features of each CD-phone in one vector


MAPAdapta)onofeachCD‐PhoneInstance

22 *Similarto(Campbelletal.,2006)butatthelevelofCD‐phone

Super vectors of CD-phone instances of all training speakers in dialect 1

Super vectors of CD phone instances of all training speakers in dialect 2


dialect1

dialect2

SVMClassifierforeachCD‐PhoneTypeforeachPairofDialects

23

Discrimina)vePhonotac)cs–CD‐PhoneClassifica)on

24

MAP Adapted Acoustic Models:

MAP Adapt GMMs

Super Vectors



CD-Acoustic Models:


CD-Phone Recognizer

Super Vectors: Super Vector 1 Super Vector N

Dialects: (e.g.) SVM Classifiers Egy Egy

CD‐PhoneClassifierResults

  Splitthetrainingdataintotwohalves

  Train227(oneforeachCD‐phonetype)binaryclassifiersforeachpairofdialectson1sthalfandteston2nd

25 *performedsignificantlybeqerthanchance(50%)

Extrac)onofLinguis)cKnowledge

  Usetheresultsoftheseclassifierstoshowwhichphonesinwhatcontextsdis)nguishdialectsthemost(chanceis50%)

26

Levan)ne/IraqiDialects

LabelingPhoneSequenceswithDialectHypotheses

...

[Back vowel]-r-[Central Vowel]

[Plosive]-A-[Voiced Consonant]

[Central Vowel]-b-[High Vowel]

...

...

Run corresponding SVM classifier to get the dialect of each CD phone

27

...

[Back vowel]-r-[Central Vowel] Egyptian

[Plosive]-A-[Voiced Consonant] Egyptian

[Central Vowel]-b-[High Vowel] Levantine

...

...

CD‐phonerecognizer

TextualFeatureExtrac)onforDiscrimina)vePhonotac)cs

  Extractthefollowingtextualfeaturesfromeachpairofdialects

  Normalizevectorbyitsnorm

  Trainalogis)cregressionwithL2regularizer

28

Experiments–TrainingTwoModels

  Splittrainingdataintotwohalves

  TrainSVMCD‐phoneclassifiersusingthefirsthalf

  RuntheseSVMclassifierstoannotatetheCDphonesofthe2ndhalf

  Trainthelogis)cclassifierontheannotatedsequences

29

Discrimina)vePhonotac)cs–DialectRecogni)on

30

MAP Adapted Acoustic Models:

MAP Adapt GMMs

Super Vectors



CD-Acoustic Models:


CD-Phone Recognizer

Logistic classifier

Egyptian

Super Vectors: Super Vector 1 Super Vector N

Dialects: (e.g.) SVM Classifiers Egy [vowel]-b-[glide] [front-vowel]-r-[sonorant]

Egy

Baselines

 StandardPRLM:atrigramphonotac)cmodelperdialect

 StandardGMM‐UBM: Front‐End:

 13DPLPfeaturesfrom9framesfollowedbyLDA40D

 CMVN

 2048Gaussians–MLtrainedonequalnumberofframesfromeachdialect

 DialectModelsareMAPadaptedwith5itera)ons(similartoTorres‐Carrasquilloetal.,2008)

31

Results–Discrimina)vePhonotac)cs

32

Approach EER(%)

PRLM 17.7

GMM‐UBM 15.3

GMM‐UBM‐fMLLR 11.0%

Disc.PhonotacHcs 6.0%

ResultsperDialect

33

Dialect GMMfMLLR

Disc.Pho.

Egyp)an 4.4% 1.3%

Iraqi 11.1% 6.6%

Levan)ne 12.8% 6.9%

Gulf 15.6% 7.8%

Conclusions

  fMLLRtotransformtheacous)cfeaturessignificantlyimproveresultsforGMM‐UBMapproach  Wes)llneedtodomoreanalyses

  Theproposedmethodhelpsinunderstandingthelinguis)cdifferencesbetweendialects

  Discrimina)vephonotac)csoutperformsGMM‐UBM‐fMLLRin5%absoluteEER.

34

FutureWork

 NewSVMKerneltocomputethesimilarityofallphonesuper‐vectorsacrosstwouqerancesonlyoneSVMclassifierforeachpairofdialects(IS2010;submiqed)

 Testthisapproachonshorteruqerances(3sand10s)

 Trythisapproachondialects/accentsofotherlanguages:

 Englishaccents(AmericanEnglishandIndianEnglish) AmericanEnglishDialects

 ApplyVTLN

 Tes)ngwithNAP(needtomodifytoaccommodateforshortcontextSupervectors)

35

ThankYou!

  Acknowledgments:  JasonPelecanosforusefuldiscussions

36

CaseStudy:ArabicDialects–OurData

  IraqiArabic:Baghdadi,Northern,andSouthern

  GulfArabic:Omani,UAE,andSaudiArabic

  Levan)neArabic:Jordanian,Lebanese,Pales)nian,andSyrianArabic

  Egyp)anArabic:primarilyCaireneArabic

37

discriminave phonotaccs for dialect recognion using

Documents