discriminave phonotaccs for dialect recognion using
TRANSCRIPT
Discrimina)vePhonotac)csforDialectRecogni)onUsingContext‐Dependent
PhoneClassifiers
FadiBiadsy*,HagenSoltau+,LidiaMangu+,JiriNavra6l+,JuliaHirschberg*
*ColumbiaUniversity,NY,USA+IBMT.J.WatsonResearchCenter,NY,USA
July1st,2010
1
DialectRecogni)on
Similartolanguagerecogni)on,butusedialects/accentsofthesamelanguage
Dialectsmaydifferinanydimensionofthelinguis)cspectrum Differencesarelikelytobemoresubtleacrossdialectsthanthoseacrosslanguages
Thus,morechallengingproblemthanlanguagerecogni)on
2
Mo)va)on:WhyStudyDialectRecogni)on?
Discoverdifferencesbetweendialects
ToimproveAutoma)cSpeechRecogni)on(ASR) Modeladapta)on:Pronuncia)on,Acous)c,Morphological,Languagemodels
Toinferspeaker’sregionaloriginfor Forensicspeakerprofiling
Speechtospeechtransla)on
Annota)onsforBroadcastNewsMonitoring
Spokendialoguesystems–adaptTTSsystems
Charisma)cspeechiden)fica)on
3
Mul)plecuesthatmaydis)nguishdialects:
Phone)ccues: Differencesinphonemicinventory
Phonemicdifferences
Allophonicdifferences(context‐dependentphones)
Phonotac)cs
4
(Al‐Tamimi&Ferragne,2005)
Example:/r/ApproximantinAmericanEnglish[ɹ]–modifiesprecedingvowelsTrilledinSco8shEnglishin[Consonant]–/r/–[Vowel] andinsomeothercontexts
MSA: /s/ /a/ /t/ /u/ /q/ /A/ /b/ /i/ /l/ /u/ /h/ /u/
Egy: /H/ /a/ /t/ /?/ /a/ /b/ /l/ /u/
Lev: /r/ /a/ /H/ /t/ /g/ /A/ /b/ /l/ /u/
DifferencesinMorphology
Differencesinphone)cinventoryandvowelusage
“Shewillmeethim”
Outline
DialectsandCorpora
CD‐PhoneRecognizer
Baselines
TwoIdeas: GMM‐UBMwithfMLLR
Discrimina)vePhonotac)cs
Results
ConclusionsandFutureWork
5
CaseStudy:ArabicDialects
(by Arab Atlas)
Corpora
For testing: (25% female – mobile, 25% female – landline, 25% male – mobile, 25 % male – landline)
Egyptian: Training: CallHomeEgyp)an,Tes)ng:CallFriendEgyp)an
Dialect #Speakers Test20%–30s*testcuts
Corpus
Gulf 976 801 (AppenPtyLtd,2006a)
Iraqi 478 477 (AppenPtyLtd,2006b)
Levan)ne 985 818 (AppenPtyLtd,2007)
Dialect #TrainingSpeakers #120speakers30s*cuts
Corpora
Egyp)an 280 1912 (CanavanandZipperlen,1996)(Canavanetal.,1997)
7 *Exactly 30s
Outline
Mo)va)on
Corpora
CD‐PhoneRecognizer
Baselines
TwoIdeas: GMM‐UBMwithfMLLR
Discrimina)vePhonotac)cs
Results
ConclusionsandFutureWork
8
Context‐Dependent(CD)PhoneRecognizer
HMM‐triphone‐basedphonerecognizerusingIBM’sAjlasystem Trainedon50hoursofGALEbroadcastnewsandconversa)ons
230CD‐acousHcmodelsand20,000Gaussians
Front‐End: 13DPLPfeaturesperframe
Eachframeissplicedtogetherwithfourprecedingandfour
succeedingframesfollowedbyLDA40D
CMVN
SpeakerAdapta)on: fMLLRfollowedbyMLLR
UnigramphonelanguagemodeltrainedonMSA9
Outline
Mo)va)on
Corpora
CD‐PhoneRecognizer
Baselines
TwoIdeas: GMM‐UBMwithfMLLR
Discrimina)vePhonotac)cs
Results
ConclusionsandFutureWork
10
Baselines
StandardPRLM:atrigramphonotac)cmodelperdialect
StandardGMM‐UBM: Front‐End:Sameasthefrontendofthephonerecognizer
2048Gaussians–MLtrainedonequalnumberofframesfromeachdialect
DialectModelsareMAPadaptedwith5itera)ons‐‐similarsejngsofthebaselinein(Torres‐Carrasquilloetal.,2008)
11
Results(DETcurvesofPRLMandGMM‐UBM)–30sCuts
Approach EER(%)
PRLM 17.7
GMM‐UBM 15.3*
12 *ComparabletoGMM‐UBMof(Torres‐Carrasquilloetal.,2008)on3dialects
Outline
Mo)va)on
Corpora
CD‐PhoneRecognizer
Baselines
TwoIdeas: GMM‐UBMwithfMLLR
Discrimina)vePhonotac)cs
Results
ConclusionsandFutureWork
13
OurGMM‐UBMImprovedwithfMLLR
Mo)va)on:Featurenormaliza)on(CMVNandVTLN)improveGMM‐UBMforlanguageanddialectrecogni)on (e.g.,WongandSridharan,2002;Torres‐Carrasquilloetal.,2008)
Ourapproach:FeaturespaceMaximumLikelihoodLinearRegression(fMLLR)adapta)on
UseaCD‐phonerecognizertoobtainCD‐phonesequence:transformthefeatures“towards”thecorrespondingacous)cmodelGMMs(amatrixforeachspeaker)
SameasGMM‐UBMapproach,butusetransformedacous)cvectorsinstead
14
fMLLR
[Vowel]‐/r/‐[Consonant]
Results–GMM‐UBM‐fMLLR–30sCuts
Approach EER(%)
PRLM 17.7
GMM‐UBM 15.3
GMM‐UBM‐fMLLR 11.0%
15
Outline
Mo)va)on
Corpora
CD‐PhoneRecognizer
Baselines
TwoIdeas: GMM‐UBMwithfMLLR
Discrimina)vePhonotac)cs
Results
ConclusionsandFutureWork
16
Discrimina)vePhonotac)cs
Hypothesis:Dialectsdifferintheirallophones(context‐dependentphones)andtheirphonotac)cs
Idea:Discriminatedialectsfirstatthelevelofcontext‐dependent(CD)phonesandthenphonotac)cs
I. Obtain CD-phones II. Extract acoustic features for each CD-phone III. Discriminate CD-phones across dialects IV. Augment the CD-phone sequences and extract phonotactic features V. Train a discriminative classifier to distinguish dialects
17
/r/ isApproximantinAmericanEnglish[ɹ]andtrilledinScojshin[Consonant] – /r/ – [Vowel]
...
[Back vowel]-r-[Central Vowel]
[Plosive]-A-[Voiced Consonant]
[Central Vowel]-b-[High Vowel]
...
...
Run our CD-phone recognizer
ObtainingCD‐Phones
CD-phone sequence
* not just /r/ /A/ /b/
Do the above for all training data of all dialects
18
CD‐PhoneUniversalBackgroundAcous)cModel
e.g., [Back vowel]-r-[Central Vowel]
EachCDphonetypehasanacous)cmodel:
19
ObtainingCD‐Phones+FrameAlignment
20
Acoustic frames: Front-End
Acoustic frames for second state
CD-Acoustic Models:
CD-Phones: (e.g.) [vowel]-b-[glide] [front-vowel]-r-[sonorant]
CD-Phone Recognizer
[Back Vowel]-r-[Central Vowel]
MAP MAP
MAP
MAPadapttheCD‐phoneacous)cmodelGMMstothecorrespondingframes(r=0.1)
MAPAdapta)onofeachCD‐PhoneInstance
21
MAPadapttheCD‐phoneacous)cmodelGMMstothecorrespondingframes*
OneSuperVectorforeachCDphoneinstance:
StackalltheGaussianmeansandphoneduraHonV k =[µ1, µ2, …., µN, duration]
i.e., summarize the acoustic-phonetic features of each CD-phone in one vector
[Back Vowel]-r-[Central Vowel]
MAPAdapta)onofeachCD‐PhoneInstance
22 *Similarto(Campbelletal.,2006)butatthelevelofCD‐phone
Super vectors of CD-phone instances of all training speakers in dialect 1
Super vectors of CD phone instances of all training speakers in dialect 2
[Back Vowel]-r-[Central Vowel]
dialect1
dialect2
SVMClassifierforeachCD‐PhoneTypeforeachPairofDialects
23
Discrimina)vePhonotac)cs–CD‐PhoneClassifica)on
24
MAP Adapted Acoustic Models:
MAP Adapt GMMs
Super Vectors
Acoustic frames: Front-End
Acoustic frames for second state
CD-Acoustic Models:
CD-Phones: (e.g.) [vowel]-b-[glide] [front-vowel]-r-[sonorant]
CD-Phone Recognizer
Super Vectors: Super Vector 1 Super Vector N
Dialects: (e.g.) SVM Classifiers Egy Egy
CD‐PhoneClassifierResults
Splitthetrainingdataintotwohalves
Train227(oneforeachCD‐phonetype)binaryclassifiersforeachpairofdialectson1sthalfandteston2nd
25 *performedsignificantlybeqerthanchance(50%)
Extrac)onofLinguis)cKnowledge
Usetheresultsoftheseclassifierstoshowwhichphonesinwhatcontextsdis)nguishdialectsthemost(chanceis50%)
26
Levan)ne/IraqiDialects
LabelingPhoneSequenceswithDialectHypotheses
...
[Back vowel]-r-[Central Vowel]
[Plosive]-A-[Voiced Consonant]
[Central Vowel]-b-[High Vowel]
...
...
Run corresponding SVM classifier to get the dialect of each CD phone
27
...
[Back vowel]-r-[Central Vowel] Egyptian
[Plosive]-A-[Voiced Consonant] Egyptian
[Central Vowel]-b-[High Vowel] Levantine
...
...
CD‐phonerecognizer
TextualFeatureExtrac)onforDiscrimina)vePhonotac)cs
Extractthefollowingtextualfeaturesfromeachpairofdialects
Normalizevectorbyitsnorm
Trainalogis)cregressionwithL2regularizer
28
Experiments–TrainingTwoModels
Splittrainingdataintotwohalves
TrainSVMCD‐phoneclassifiersusingthefirsthalf
RuntheseSVMclassifierstoannotatetheCDphonesofthe2ndhalf
Trainthelogis)cclassifierontheannotatedsequences
29
Discrimina)vePhonotac)cs–DialectRecogni)on
30
MAP Adapted Acoustic Models:
MAP Adapt GMMs
Super Vectors
Acoustic frames: Front-End
Acoustic frames for second state
CD-Acoustic Models:
CD-Phones: (e.g.) [vowel]-b-[glide] [front-vowel]-r-[sonorant]
CD-Phone Recognizer
Logistic classifier
Egyptian
Super Vectors: Super Vector 1 Super Vector N
Dialects: (e.g.) SVM Classifiers Egy [vowel]-b-[glide] [front-vowel]-r-[sonorant]
Egy
Baselines
StandardPRLM:atrigramphonotac)cmodelperdialect
StandardGMM‐UBM: Front‐End:
13DPLPfeaturesfrom9framesfollowedbyLDA40D
CMVN
2048Gaussians–MLtrainedonequalnumberofframesfromeachdialect
DialectModelsareMAPadaptedwith5itera)ons(similartoTorres‐Carrasquilloetal.,2008)
31
Results–Discrimina)vePhonotac)cs
32
Approach EER(%)
PRLM 17.7
GMM‐UBM 15.3
GMM‐UBM‐fMLLR 11.0%
Disc.PhonotacHcs 6.0%
ResultsperDialect
33
Dialect GMMfMLLR
Disc.Pho.
Egyp)an 4.4% 1.3%
Iraqi 11.1% 6.6%
Levan)ne 12.8% 6.9%
Gulf 15.6% 7.8%
Conclusions
fMLLRtotransformtheacous)cfeaturessignificantlyimproveresultsforGMM‐UBMapproach Wes)llneedtodomoreanalyses
Theproposedmethodhelpsinunderstandingthelinguis)cdifferencesbetweendialects
Discrimina)vephonotac)csoutperformsGMM‐UBM‐fMLLRin5%absoluteEER.
34
FutureWork
NewSVMKerneltocomputethesimilarityofallphonesuper‐vectorsacrosstwouqerancesonlyoneSVMclassifierforeachpairofdialects(IS2010;submiqed)
Testthisapproachonshorteruqerances(3sand10s)
Trythisapproachondialects/accentsofotherlanguages:
Englishaccents(AmericanEnglishandIndianEnglish) AmericanEnglishDialects
ApplyVTLN
Tes)ngwithNAP(needtomodifytoaccommodateforshortcontextSupervectors)
35
ThankYou!
Acknowledgments: JasonPelecanosforusefuldiscussions
36
CaseStudy:ArabicDialects–OurData
IraqiArabic:Baghdadi,Northern,andSouthern
GulfArabic:Omani,UAE,andSaudiArabic
Levan)neArabic:Jordanian,Lebanese,Pales)nian,andSyrianArabic
Egyp)anArabic:primarilyCaireneArabic
37