ada 310308

8/12/2019 Ada 310308

1/146

EfficientAlgorithmsforSpeechRecognitionMosurK.Ravishankar

May5,996CMU-CS-96-143

P BBK G 8O TIO WTATEMENTAppiovsaeauliceieo i|

SchoolofComputerScienceComputerScienceDivisionCarnegieMellonUniversityPittsburgh,PA5213

Submittedinpartialfulfillmentofth erequirementsforth edegreeofDoctorofPhilosophy.ThesisCommittee:

RobertoBisiani ,co-chairUniversityofMilan)RajReddy,co-chair

1 9 9 6 0 7 0 80 7 6 AlexanderRudnickyRichardStern

WayneWard

99 6MosurK.RavishankarThisresearchw assupportedbyth eDepartmentof th eNavy,NavalResearchLaboratoryunderGrantNo.00014-93-1-2005.T heviewsan dconclusionscontainedinthisdocumentarethoseofth euthorndhouldnotenterpretedsrepresentingheofficialpolicies,i therxpressedrimplied,of th eU.S.government.

P^5TK3'QMJ53rn?SfEC^B'I

8/12/2019 Ada 310308

2/146

Keywords:peechrecognition,earchalgorithms,realtimerecognition,lexicaltreesearch,latticesearch,fastmatchalgorithms,memorysizereduction.

8/12/2019 Ada 310308

3/146

AbstractAdvancesnpeechechnologyndomputingpoweravecreatedurgefinterestinthepracticalapplicationofspeechrecognition.However,themostaccurate

speechrecognitionsystemsintheresearchworldarestillfartooslowandexpensivetobeusedinpractical,largevocabularycontinuousspeechapplications.Theirmaingoalhasbeenecognitionaccuracy,withemphasisonacousticandanguagemodelling.ButpracticalspeechecognitionalsoequireshecomputationobecarriedoutnrealtimewithinthelimitedresourcesCPUpowerandmemorysizeofcommonlyavailablecomputers.herehasbeenelativelyittleworknhisirectionwhilepreservingtheaccuracyofresearchsystems.

Inthisthesis,wefocusonefficientandaccuratespeechrecognition.tseasyoimproverecognitionspeedandeducememoryrequirementsbyradingawayaccu-racy,forexamplebygreaterpruning,andusingsimpleracousticandlanguagemodels.Itsmuchharderomprovebothheecognitionspeedandeducemainmemorysizewhilepreservingtheaccuracy.

ThishesispresentsseveralechniquesormprovingtheoverallperformanceoftheCMUSphinx-IIsystem.Sphinx-IIemployssemi-continuoushiddenMarkovmod-elsoracousticsndrigramanguagemodels,andsoneofhepremieresearchsystemsofitskind.hetechniquesinthisthesisarevalidatedonseveralwidelyusedbenchmarktestsetsusingtwovocabularysizesofabout20Kand58Kwords.

Themaincontributionsofthisthesisarean8-foldspeedupand4~foldmemorysizereductionoverthebaselineSphinx-IIsystem.Theimprovementinspeedsobtainedfromthefollowingtechniques:exicaltreesearch,phoneticfastmatchheuristic,andglobalbestpathsearchofthewordlattice.hegaininspeedfromthetreesearchisaboutafactorof5.hephoneticfastmatchheuristicspeedsuphetreesearchbyanotheractorof2byfindinghemostikelycandidatephonesactiveatanyime.Thoughhetreesearchncurssomelossofaccuracy,italsoproducescompactwordlatticeswithowerroratewhichcanbeescoredforaccuracy.uchaescoringscombinedwithhebestpathalgorithmoindagloballyoptimumpathhroughawordattice.hisecoversheoriginalaccuracyofthebaselinesystem.heotalrecognitiontimeisabout3timesrealtimeforthe20Ktaskona175MHzDECAlphaworkstation.

ThememoryequirementsofSphinx-IIareminimizedbyeducingheizesoftheacousticandanguagemodels.heanguagemodelsmaintainedondiskandbigramsandtrigramsarereadinondemand.xplicitsoftwarecachingmechanismseffectivelyovercomethediskaccesslatencies.heacousticmodelsizeiseducedbysimplytruncatingprecisionofprobabilityvaluesto8bits.everalotherengineeringsolutions,notexploredinthisthesis,canbeappliedtoreducememoryrequirementsfurther.hememorysizeforthe20Kaskisreducedtoabout30-40MB.

8/12/2019 Ada 310308

4/1461 1

8/12/2019 Ada 310308

5/146

AcknowledgementsIcannotoverstatethedebt ow etoRobertoBisianiandRajReddy.heyhavenotonlyhelpedmeandgivenmeeveryopportunitytoextendmyprofessionalcareer,

butalsohelpedmethroughpersonaldifficultiesaswell.tisquiteremarkablethathavelandednotonebuttwoadvisorsthatcombineintegritytowardsresearchwithahumantouchthatranscendstheproverbialhard-headednessofscience.Onecannothopeorbettermentorshanhem.lexRudnicky,RichStern,andWayneWard,allhaveaclarityofthinkingandself-expressionthatsimplyamazesmewithoutend.Theyhavegivenmethemostinsightfuladvice,comments,andquestionsthat couldhaveaskedfor.hankyou,all.

TheCMUspeechgrouphasbeenapleasuretoworkwith.irstofall, wouldlikeohankomeormerandcurrentmembers,Mei-YuhHwang,ilAlleva,inChase,EricThayer,SunilIssar,BobWeide,andRoniRosenfeld.heyhavehelpedmethroughtheearlystagesofmyinductionintothegroup,andlatergiveninvaluablesupportnmywork.'mfortunateohavenheritedheworkofMei-YuhandFil.LinChasehasbeenagreatfriendandsoundingboardforideasthroughtheseyears.Erichasbeenallofthatandagreatofficemate.havelearntalotfromdiscussionswithPaulPlaceway.Therestofthespeechgroupandtherobustganghasmadeitamostivelyenvironmenttoworkin.hopethechargecontinuesthroughSphinx-IIIandbeyond.

IhavespentagoodractionofmylifeintheCMU-CScommunitysofar.thasbeen,andstillis,thegreatestintellectualenvironment.Thespiritofcooperation,andinformalityofinteractionsassimplyunique. wouldliketoacknowledgethesupportofeveryone haveevercometoknowhere,oomanytoname,romtheWarpandNectardaysuntilnow.headministrativefolkshavealwayssucceededinbluntingtheedgeoffadifficultday.ouneverknowwhatnicknameCatherineCopetaswillchristenyouwithnext.AndSharonBurkshasalwaysputupwithallmyantics.

ItgoeswithoutsayingthatIow eeverythingtomyparents.havehadtremendoussupportrommybrothers,andsomeveryspecialunclesandaunts.nparticular,mustmentionthefunI'vehadwithmybrotherKuts.wouldalsoliketoacknowledgeK.Gopinath'shelpduringmystaynBangalore.inally,BB ,whohasufferedthroughmytantrumsonbaddays,keptmeintouchwiththerestoftheworld,hasamostcreativeoutlookonhecommonplace,candrivemenutssomedays,butwhenallissaidanddone,samostrelaxedandcomfortablepersonohavearound.Lastbutnoteast,IwouldliketothankAndreasNowatzyk,MonicaLam,DuaneNorthcuttandRayClark.thasbeenmygoodfortunetowitnessandparticipateinsomeofAndreas'screativework.histhesiso w e salottohisunendingsupportandencouragement.

1 1 1

8/12/2019 Ada 310308

6/146

IV

8/12/2019 Ada 310308

7/146

ContentsAbstract

Acknowledgements ii1ntroduction1.1heModellingProblem

1.2heSearchProblem1.3hesisContributions

1.3.1mprovingSpeed1.3.2educingMemorySize

1 .4ummaryandDissertationOutline2ackground 1

2.1cousticModelling 12.1.1honesandTriphones 12.1.2MMmodellingofPhonesandTriphones 2

2.2anguageModelling 32.3earchAlgorithms 5

2.3.1 ViterbiBeamSearch 52.4elatedWork 7

2.4.1reeStructuredLexicons 72.4.2emorySizeandSpeedImprovementsinWhisper 92.4.3earchPruningUsingPosteriorPhoneProbabilities0

8/12/2019 Ada 310308

8/146

2.4.4 LowerComplexityViterbiAlgorithm 02.5 Summary 1

3heSphinx-IIBaselineSystem 23.1nowledgeSources 43.1.1cousticModel 4

3.1.2ronunciationLexicon 63.2orwardBeamSearch 6

3.2.1latLexicalStructure 63.2.2ncorporatingheLanguageModel 73.2.3ross-WordTriphoneModeling 83.2.4heForwardSearch 13.3ackwardandA *Search 63.3.1ackwardViterbiSearch 73.3.2*Search 7

3.4aselineSphinx-IISystemPerformance 83.4.1xperimentationMethodology 93.4.2ecognitionAccuracy 13.4.3earchSpeed 23.4.4emoryUsage 5

3.5aselineSystemSummary 84earchSpeedOptimization 9

4 .1otivation 94.2exicalTreeSearch 1

4.2.1exicalTreeConstruction 44.2.2ncorporatingLanguageModelProbabilities64.2.3utlineofTreeSearchAlgorithm 14.2.4erformanceofLexicalTreeSearch 24.2.5exicalTreeSearchSummary 7

VI

8/12/2019 Ada 310308

9/146

4.3lobalBestPathSearch 84.3.1estPathSearchAlgorithm 84.3.2erformance 34.3.3estPathSearchSummary 44.4escoringTree-SearchWordLattice 64.4.1otivation 64.4.2erformance 64.4.3ummary 8

4.5honeticFastMatch 84.5.1otivation 84.5.2etailsofPhoneticFastMatch 04.5.3erformanceofFastMatchUsingAllSenones44.5.4erformanceofFastMatchUsingCISenones 74.5.5honeticFastMatchSummary 8

4.6xploitingConcurrency 94.6.1ultipleLevelsofConcurrency 04.6.2arallelizationSummary 3

4.7ummaryofSearchSpeedOptimization 35emorySizeReduction 75.1enoneMixtureWeightsCompression 75.2isk-BasedLanguageModels 85.3ummaryofExperimentsonMemorySize 0 0

6mallVocabularySystems 016. 1eneralIssues 016.2erformanceonATIS 0 2

6.2.1aselineSystemPerformance 0 26.2.2erformanceofLexicalTreeBasedSystem0 3

6.3mallVocabularySystemsSummary 06vn

8/12/2019 Ada 310308

10/146

8/12/2019 Ada 310308

11/146

ListofFigures2. 1iterbiSearchasDynamicProgramming 53. 1phinx-IISignalProcessingFrontEnd 43.2phinx-IIHMMTopology:-StateBakisModel53.3ross-wordTriphoneModellingatWordEndsinSphinx-II93.4ordnitialTriphoneHMMModellinginSphinx-II13.5neFrameofForwardViterbiBeamSearchintheBaselineSystem. 33.6ordTransitionsnSphinx-IIBaselineSystem53.7utlineofA *AlgorithminBaselineSystem 83.8anguageModelStructureinBaselineSphinx-IISystem64 .1asephoneLexicalTreeExample 24.2riphoneLexicalTreeExample 54.3ross-WordTransitionsWithFlatandTreeLexicons74.4uxiliaryFlatLexicalStructureforBigramTransitions84.5athScoreAdjustmentFactor/forWordWjUpontsExit94.6neFrameofForwardViterbiBeamSearchinTreeSearchAlgorithm.34.7ordLatticeforUtterance:TakeFidelity'scasea sanexample. ...94.8ordLatticeExampleRepresentedasaDAG 04.9ordLatticeDAGExampleUsingaTrigramGrammar14.10uboptimalUsageofTrigramsinSphinx-IIViterbiSearch34.11asePhonesPredictedbyTopScoringSenonesinEachFrame;SpeechFragmentorPhraseHISREND,PronouncedDH-IX-S-R-EH-N-DD 1

IX

8/12/2019 Ada 310308

12/146

4.12ositionofCorrectPhoneinRankingCreatedbyPhoneticFastMatch.24.13ookaheadWindowforSmoothingtheActivePhoneList34.14honeticFastMatchPerformanceUsingAllSenones20KTask). 54.15WordErrorRateusRecognitionSpeedofVariousSystems44.16ConfigurationofaPracticalSpeechRecognitionSystem5

8/12/2019 Ada 310308

13/146

ListofTables3.1o.fWordsandSentencesinEachTestSet 03.2ercentageWordErrorRateofBaselineSphinx-IISystem13.3verallExecutionTimesofBaselineSphinx-IISystem(xRealTime). 33.4aselineSphinx-IISystemForwardViterbiSearchExecutionTimes(xRealTime) 33.5MMsEvaluatedPerFrameinBaselineSphinx-IISystem43.6V-gramTransitionsPerFrameinBaselineSphinx-IISystem54 .1o.fNodesatEachLevelinTreeandFlatLexicons54.2xecutionTimesforLexicalTreeViterbiSearch44.3reakdownofTreeViterbiSearchExecutionTimesxRealTime). 54.4o.fHMMsEvaluatedPerFrameinLexicalTreeSearch54.5o.fLanguageModelOperations/FrameinLexicalTreeSearch. 54.6ordErrorRatesforLexicalTreeViterbiSearch64.7ordErrorRatesfromGlobalBestPathSearchofWordLatticePro-ducedbyLexicalTreeSearch 44.8xecutionTimesforGlobalBestPathDAGSearchxRealTime). 44.9ordErrorRatesFromLexicalTree+Rescoring+BestPathSearch. 74.10xecutionTimesWithRescoringPass 74.11astMatchUsingAllSenones;LookaheadWindow=320KTask). 64.12astMatchUsingAllSenones;LookaheadWindow=358KTask). 74.13astMatchUsingCISenones;LookaheadWindow=386. 1 BaselineSystemPerformanceonATIS 03

XI

8/12/2019 Ada 310308

14/146

6.2atioofNumberofRootHMMsinLexicalTreeandWordsinLexicon(approximate) 036.3xecutionTimesonATIS 046.4reakdownofTreeSearchExecutionTimesonATISWithoutPho-neticFastMatch) 0 46.5ecognitionAccuracyonATIS 0 5A.l TheSphinx-IIPhoneSet 15

Xll

8/12/2019 Ada 310308

15/146

ChapterIntroductionRecentdvancesnpeechechnologyandomputingpowerhavecreatedasurgeofinterestnhepracticalapplicationofspeechecognition.peechisheprimarymodeofcommunicationamonghumans.Ourabilitytocommunicatewithmachinesandcomputers,throughkeyboards,miceandotherdevices,isanorderofmagnitudeslowerndmorecumbersome.nrderomakehisommunicationmoreuser-friendly,speechinputsanessentialcomponent.

Therearebroadlyhreeclassesofspeechecognitionapplications,asdescribedin53].nisolatedwordrecognitionsystemseachwordisspokenwithpausesbeforeandafterit,sothatend-pointingtechniquescanbeusedtoidentifywordboundariesreliably.econd,highlyconstrainedcommand-and-controlapplicationsusesmallv o-cabularies,limitedtospecificphrases,butuseconnectedwordorcontinuousspeech.Finally,largevocabularycontinuousspeechsystemshavevocabulariesofseveraltensofthousandsofwords,andsentencescanbearbitrarilylong,spokeninanaturalfash-ion.heastshemostuser-friendlybutalsohemostchallengingtomplement.However,themostaccuratespeechrecognitionsystemsintheresearchworldarestillfartooslowandexpensivetobeusedinpractical,largevocabularycontinuousspeechapplicationsonawidescale.

Speechresearchhasbeenconcentratedheavilyonacousticandlanguagemodellingissues.incethelate1980s ,thecomplexityoftasksundertakenbyspeechresearchershasgrownromhe1000-wordResourceManagement(RM)ask51 ]oessentiallyunlimitedvocabularyasksuchasranscriptionofadionewsroadcastn99 5[48].Whilethewordecognitionaccuracyhasemainedimpressive,consideringtheincreaseintaskcomplexity,theresourcerequirementshavegrownaswell.heRMtaskanaboutnorderofmagnitudeslowerhanealimeonprocessorsofhatday.heunlimitedvocabularytasksrunaboutwoordersofmagnitudeslowerthanrealtimeonmodernworkstationswhosepowerhasgrownbyanorderofmagnitudeagain,nthemeantime.

Thetaskoflargevocabularycontinuousspeechrecognitionisinherentlyhardfor

8/12/2019 Ada 310308

16/146

CHAPTER1. INTRODUCTION

thefollowingreasons.irst,wordboundariesarenotknownnadvance.nemustbeconstantlypreparedtoencountersuchaboundaryateverytimeinstant.Wecandrawaroughanalogytoreadingaparagraphoftextwithoutanypunctuationmarksorspacesbetweenwords:

myspiritwillsleepinpeaceorifthinksitwillsurelythinkthusfarewellhesprangfromthecabinwindowashesaidthisupontheiceraftwhichlayclosetothevesselhewassoonborneawaybythewavesandlostindarknessanddistance...

Furthermore,manyncorrectwordhypotheseswillbeproducedromncorrecteg-mentationofspeech.ophisticatedanguageodelshatrovidewordcontextorsemanticinformationareneededtodisambiguatebetweentheavailablehypotheses.Thesecondproblemshato-articulatoryeffectsareverystrongnnaturalor

conversationalspeech,ohathesoundproducedatnenstantsnfluencedbytheprecedingandollowingones.istinguishingbetweenheseequiresheuseofdetailedacousticmodelsthatakesuchcontextualconditionsntoaccount.hein-creasingsophisticationoflanguagemodelsandacousticmodels,aswellasthegrowthinthecomplexityoftasks,hasfarexceededthecomputationalandmemorycapacitiesofcommonlyavailableworkstations.

Efficientspeechecognitionforpracticalapplicationsalsoequireshathepro-cessingbearriedoutnealimewithinheimitedesourcesCPUpowerandmemorysizeofcommonlyavailablecomputers.herecertainlyarevariousuchcommercialanddemonstrationsystemsinexistence,buttheirperformancehasneverbeenformallyevaluatedwithrespectotheresearchsystemsorwithespectooneanother,nhewayhatheaccuracyofresearchsystemshasbeen.hishesissprimarilyconcernedwiththeseissuesinimprovingthecomputationalandmemoryefficiencyofcurrentspeechrecognitiontechnologywithoutcompromisingtheachieve-mentsinrecognitionaccuracy.

Thehreeaspectsofperformance,ecognitionspeed,memoryesourceequire-ments,andrecognitionaccuracy,areinmutualconflict.tisrelativelyeasytoimproverecognitionspeedandeducememoryrequirementswhiletradingawaysomeaccu-racy,forexamplebypruningthesearchspacemoredrastically,andbyusingsimpleracousticandanguagemodels.lternatively,onecaneducememoryrequirementsthroughefficientencodingschemesattheexpenseofcomputationtimeneededtode-codesuchrepresentations,andiceversa.utitismuchhardertoimproveboththerecognitionspeedandreducemainmemoryrequirementswhilepreservingorimprov-ingecognitionaccuracy.nhishesis,w edemonstratealgorithmicandheuristictechniquestotackletheproblem.

ThisworkaseenarriedoutnheontextfheCMUphinx-IIpeechrecognitionsystemasabaseline.herearewomainschoolsofspeechecognitiontechnologytoday,basedonstatisticalhiddenMarkovmodelling(HMM),andneural

8/12/2019 Ada 310308

17/146

1 .1 . THEMODELLINGPROBLEM

netechnology,espectively.phinx-IIusesHMM-basedstatisticalmodellingech-niquesandisoneofthepremierrecognizersofitskind.Usingseveralcommonlyusedbenchmarkestsetsandwodifferentvocabularyizesofabout0,000and58,000words,wedemonstratethattherecognitionaccuracyofthebaselineSphinx-IIsystemcanbeattainedwhileitsexecutiontimeisreducedbyaboutanorderofmagnitudeandmemoryrequirementsreducedbyafactorofabout .

1.1 TheModellingProblemAsheomplexityofasksackledbypeechesearchhasrown,oashatfthemodellingtechniques.nsystemsthatusestatisticalmodellingtechniques,suchasheSphinxsystem,hisranslatesntoseveralensohundredsofmegabytesofmemoryneededtostoreinformationregardingstatisticaldistributionsunderlyingthemodels.

AcousticModelsOneofthekeyssuesnacousticmodellinghasbeenhechoiceofagoodunitofspeech32 ,27].nsmallvocabularysystemsofafewensofwords,tspossibleobuildseparatemodelsforentirewords,buthisapproachquicklybecomesinfeasibleashevocabularyizegrows.oronehing,tshardoobtainsufficientrainingdatatobuildallindividualwordmodels.tsnecessarytorepresentwordsntermsofsub-wordunits,andrainacousticmodelsorheatter,nsuchawayhathepronunciationofnewwordscanbedefinedintermsofthealreadytrainedsub-wordunits.

Thephonemeorphone)hasbeenhemostcommonlyacceptedsub-wordunit.Thereareapproximately50phonesinspokenEnglishlanguage;wordsaredefinedassequencesofsuchphones1seeAppendixAfortheSphinx-IIphonesetandexamples).Eachphoneis,nturn,modelledbyanHMMdescribedingreaterdetailinSection2.1.2).

A smentionedearlier,aturalontinuouspeechhastrongo-articulatoryf-fects.nformally,aphonemodelshepositionofvariousarticulatorsnhemouthandnasalpassagesuchasheongueandheips)nhemakingofaparticularsound.incehesearticulatorshaveomovesmoothlybetweendifferentsoundsnproducingspeech,eachphoneisinfluencedbytheneighbouringones,especiallydur-ingthetransitionfromonephoneothenext.hissnotamajorconcerninsmallvocabularysystemsnwhichwordsarenoteasilyconfusable,butbecomesanssueashevocabularysizeandthedegreeofconfusabilityincrease.

xSomesystemsdefinewor dpronunciationssnetworksofphonesnsteadofsimpleineare-quences36].

8/12/2019 Ada 310308

18/146

CHAPTER. INTRODUCTION

Mostystemsemployriphonesasoneormofcontext-dependentHMMmodels[4 ,3]odealwithhisproblem.riphonesarebasicallyphonesobservednhecontextofgivenprecedingandsucceedingphones.here areapproximately50phonesinpokennglishanguage.hus,hereaneotalfabout03riphones,althoughonlyaractionofhemareactuallyobservednheanguage.imitingthevocabularycanfurtherreducethisnumber.orexample,inSphinx-II,a20,000wordvocabularyhasabout75,000distincttriphones,eachofwhichismodelledbya5-stateHMM,foratotalofabout375,000states.incethereisn'tsufficienttrainingdatatobuildmodelsforeachstate,heyareclusteredintoequivalenceclassescalledsenones27].

Theintroductionofcontext-dependentacousticmodels,evenafterclusteringintoequivalenceclasses,createsanexplosionnhememoryrequirementstostoresuchmodels.orexample,hephinx-IIsystemwith0,000enonesoccupiesensfmegabytesofmemory.

LanguageModelsLargevocabularycontinuousspeechrecognitionrequirestheuseofalanguagemodelorgrammartoselectthemostlikelywordsequencefromtherelativelylargenumberofalternativewordhypothesesproducedduringhesearchprocess.smentionedearlier,heabsenceofexplicitwordboundarymarkersncontinuousspeechcausesseveraladditionalwordhypothesesobeproduced,nadditionohentendedorcorrectones.orexample,thephraseIt's nicedaycanbeequallywellrecognizedasItsuncedA.orItsoniceday.Theyareallacousticallyindistinguishable,butthewordboundarieshavebeendrawnatadifferentsetoflocationsineachcase.Clearly,manymorealternativescanbeproducedwithvaryingdegreesoflikelihood,giventheinputspeech.heanguagemodelsnecessaryopickhemostikelysequenceofwordsfromtheavailablealternatives.

Simpleasks,nwhichnesnlyequiredoecognizeonstrainedetfphrases,anuseule-basedegularorcontext-freegrammarswhichcanbeepre-sentedcompactly.However,thatsmpossiblewithlargevocabularytasks.nstead,bigramandrigramgrammars,consistingofwordpairsandripleswithgivenprob-abilitiesofoccurrence,aremostcommonlyused.necanalsobuildsuchanguagemodelsbasedonwordclasses,suchascitynames,monthsoftheyear,etc.However,creatingsuchgrammarsistediousastheyrequireafairamountofhandcompilationoftheclasses.rdinarywordn-gramanguagemodels,onheotherhand,canbecreatedalmostentirelyautomaticallyfromacorpusoftrainingtext.

Clearly,itisnfeasibletocreateacompletesetofwordbigramsforevenmediumvocabularytasks.hus,hesetofbigramandtrigramprobabilitiesactuallypresentinagivengrammarisusuallyasmallsubsetofthepossiblenumber.Eventhen,theyusuallynumberinthemillionsforlargevocabularytasks.hememoryrequirements

8/12/2019 Ada 310308

19/146

1.2 . THESEARCHPROBLEM

forsuchlanguagemodelsrangefromseveraltenstohundredsofmegabytes.

1.2 TheSearchProblemTherearetwocomponentstothecomputationalcostofspeechrecognition:cousticprobabilitycomputation,andsearch.nthecaseofHMM-basedsystems,theformerrefersoheomputationfherobabilityofagivenHMMtateemittingheobservedspeechatagivenime.heatterefersohesearchorhebestwordsequencegiventhecompletespeechnput.hesearchcostsargelyunaffectedbythecomplexityoftheacousticmodels.tismuchmoreheavilyinfluencedbythesizeofthetask.Asweshallseelater,thesearchcostssignificantformediumandlargevocabularyrecognition;tisthemainfocusofthisthesis.

Speechecognitionsearchingorhemostikelysequenceofwordsivenheinputpeechgivesiseoanexponentialsearchspacefallpossiblesequencesofwordsareconsidered.heproblemhasgenerallybeenackledintwoways:iterbidecoding62 ,2]usingbeamsearch37],ortackdecoding9,0]whichsavariantoftheA*algorithm42].omehybridversionshatcombineViterbidecodingwiththeA *algorithmalsoexist21].ViterbiDecodingViterbidecodingsadynamicprogrammingalgorithmthatsearcheshestatespaceforhemostikelytateequencehatccountsorhenputpeech.hetatespacesconstructedbycreatingwordHMMmodelsromtsconstituentphoneortriphoneHMMmodels,andallwordHMMmodelsaresearchednparallel.incethestatespaceshugeorevenmediumvocabularyapplications,hebeamsearchheuristicisusuallyappliedoimitthesearchbypruningoutheessikelystates.Thecombinationisoftensimplyreferredtoasiterbibeamsearch.Viterbidecodingisaime-synchronoussearchhatprocesseshenputpeechonerameataime,updatingallhestatesorhatramebeforemovingonohenextrame.ostsystemsemployaframeinputateof10 0rames/sec.ViterbidecodingisdescribedingreaterdetailinSection2.3.1.StackDecodingStackdecodingmaintainsastackofpartialhypotheses2sortedindescendingorderofposteriorlikelihood.Ateachstepitpopsthebestoneoffthestack.fitisacompletehypothesistsoutput.therwisethealgorithmexpandstbyoneword,ryingall

2 Apartialhypothesisaccountsforaninitialportionoftheinputspeech.completehypothesis,orsimplyhypothesis,accountsfortheentireinputspeech.

8/12/2019 Ada 310308

20/146

CHAPTER1 . INTRODUCTION

possiblewordextensions,evaluatesheesultingpartial)ypotheseswithespecttoheinputspeechande-insertsheminthesortedstack.nynumberofiV-besthypotheses59 ]canbegeneratednthismanner.Toavoidanexponentialgrowthnthesetofpossiblewordsequencesnmediumandargevocabularysystems,partialhypothesesareexpandedonlybyalimited setofcandidatewordsateachstep.hesecandidatesareidentifiedbyafastmatchstep[6 ,7 ,8,20].inceourexperimentshavebeenmostlyconfinedoViterbidecoding,wedonotexplorestackdecodingnanygreaterdetail.TreeStructuredLexiconsEvenwithhebeamsearchheuristic,straightforwardViterbidecodingsexpensive.ThenetworkofstatestobesearchedisformedbyalinearsequenceofHMMmodelsforeachwordnhevocabulary.henumberofmodelsactivelysearchednhisorganizationisstillonetotwoordersofmagnitudebeyondthecapabilitiesofmodernworkstations.

Lexicalreescaneusedoeduceheizeofhesearchpace.incemanywordssharecommonpronunciationprefixes,heycanalsosharemodelsandavoidduplication.Treeswereinitiallyusedinfastmatchalgorithmsforproducingcandidatewordlistsforfurthersearch.ecently,theyhavebeenintroducedinthemainsearchcomponentofseveralsystems44 ,39,43,].hemainproblemfacedbythemisinusingaanguagemodel.ormally,ransitionsbetweenwordsareaccompaniedbyapriorlanguagemodelprobability.utwithrees,hedestinationnodesofsuchtransitionsarenotndividualwordsbutentiregroupsofthem,elatedphoneticallybutquiteunrelatedgrammatically.Anefficientsolutiontothisproblemisoneoftheimportantcontributionsofthisthesis.

MultipassSearchTechniquesViterbisearchalgorithmsusuallyalsocreateawordatticenadditionohebestrecognitionhypothesis.helatticeincludesseveralalternativewordsthatwererecog-nizedatanygiventimeduringthesearch.talsotypicallycontainsotherinformationsuchasheimeegmentationsforhesewords,ndheirposterioracousticcores(i.e.,heprobabilityofobservingawordgiventhatimesegmentofinputspeech).Theatticeerrorratemeasuresthenumberofcorrectwordsmissingfromthelatticearoundtheexpectedtime.tistypicallymuchlowerthantheworderrorrate3ofthesinglebesthypothesesproducedforeachsentence.

Wordatticescanbekeptverycompact,withowatticerrorrate,ftheyareproducedusingsufficientlydetailedacousticmodelsasopposedoprimitivemodels3Worderroratesaremeasuredycountinghenumberofwordsubstitutions,deletions,andinsertionsinthehypothesis,comparedtothecorrecteferenceentence.

8/12/2019 Ada 310308

21/146

1.3. THESISCONTRIBUTIONS

asin,forexample,fastmatchalgorithms).nourwork,alOseclongsentencetypicallyproducesawordlatticecontainingabout0 00wordinstances.

Givenuchompactatticeswithowrrorates,neanearchhemusingsophisticatedmodelsandsearchalgorithmsveryefficientlyandobtainresultswithalowerworderrorrate,asdescribedin38 ,65,1].Mostsystemsusesuchmultipasstechniques.

However,herehasbeenelativelylittleworkeportednactuallycreatingsuchlatticesefficiently.Thisisimportantforthepracticalapplicability ofsuchtechniques.Latticescanbecreatedwithlowcomputationaloverheadifweusesimplemodels,buttheirsizemustbelargetoguaranteeasufficientlylowlatticeerrorrate.Ontheotherhand,compact,low-errorlatticescanbecreatedusingmoresophisticatedmodels,attheexpenseofmorecomputationtime.heefficientcreationofcompact,ow-errorlatticesforefficientpostprocessingsanotherbyproductofthiswork.1.3 ThesisContributionsThisthesisexploreswaysofimprovingtheperformanceofspeechrecognitionsystemsalongheimensionsofecognitionpeedndfficiencyofmemoryusage,hilepreservingheecognitionaccuracyofresearchsystems.smentionedearlier,hisisamuchharderproblemhanfweareallowedoradeecognitionaccuracyorimprovementinspeedandmemoryusage.

Inorderomakemeaningfulcomparisons,hebaselineperformanceofanestab-lishedresearch systemisfirstmeasured.WeusetheCMUSphinx-IIsystemasthebaselinesystemsincethasbeenextensivelyusednheyearlyARPAevaluations.Ithasknownrecognitionaccuracyonvarioustestsets,andwithsimilaritiestomanyotherresearchsystems.heparametersmeasuredinclude,inadditiontorecognitionaccuracy,heCPUusageofvariousstepsduringexecution,frequencycountsofthemostime-consumingoperations,andmemoryusage.lltestsarecarriedoutusingtwovocabularyizesofabout0,00020IC)nd58,00058K)words,espectively.ThetestsentencesaretakenfromtheARPAevaluationsn99 3and99445,46].

Theesultsromhisnalysishowhathesearchcomponentseveralensoftimesslowerhanealimeonheeportedasks.Theacousticoutputproba-bilitycomputationselativelysmallersinceheseestshavebeenconductedusingsemi-continuousacousticmodels28 ,27].)urthermore,hesearchimetselfcanbefurtherdecomposedntotwomaincomponents:heevaluationofHMMmodels,andcarryingoutcross-wordransitionsatwordboundaries.heformerissimplyameasureoftheaskcomplexity.heattersasignificantproblemsinceherearecross-wordransitionsoeverywordnhevocabulary,andanguagemodelproba-bilitiesmustbecomputedforeveryoneofthem.

8/12/2019 Ada 310308

22/146

CHAPTER1. INTRODUCTION

1.3.1 ImprovingSpeedTheworkpresentedinthishesisshowshatanewadaptationoflexicaltreesearchcanbeusedtoreduceboththenumber ofHMMsevaluatedandthecostofcross-wordtransitions.nthismethod,languagemodelprobabilities forawordarecomputed notwhenenteringthatwordbutupontsexit,fitsoneoftheecognizedcandidates.Thenumberofsuchcandidatesatagiveninstantsonaverageaboutwoordersofmagnitudesmallerthanthevocabularysize.urthermore,theproportionappearstodecreasewithincreasingvocabularysize.

Usingthismethod,theexecutiontimeforrecognitionisdecreasedbyafactorofabout4.8orbothhe20Kand58Kwordasks.fweexcludetheacousticoutputprobabilitycomputation,thespeedupofthesearchcomponentaloneisabout6.3forthe20Kwordtaskandover7forthe58Ktask.talsodemonstratesthathelexicaltreesearchefficientlyproducescompactwordatticeswithowerrorateshatcanagainbeefficientlysearchedusingmorecomplexmodelsandsearchalgorithms.Eventhoughthereisarelativelossofaccuracyofabout20 %usingthismethod,weshowthattcanberecoveredefficientlybypostprocessingthewordlatticeproducedbytheexicaltreesearch.heosssattributedosuboptimalwordsegmentationsproducedbythetreesearch.However,anewshortest-pathgraphsearchformulationforsearchingthewordatticecaneducetheossnaccuracytounder0%elativetothebaselinesystemwithanegligibleincreaseincomputation.

Ifthelatticeisfirstescoredtoobtainbetterwordsegmentations,allthelossnaccuracyisrecovered.Therescoringstepaddslessthan20 %executiontimeoverhead,givinganeffectiveoverallspeedupofabout overthebaselinesystem.

Wehaveappliedanewphoneticfastmatchstepoheexicalreesearchhatperformsannitialpruningofthecontextindependentphonesobesearched.histechniquereducesheoverallexecutiontimebyabout40-45%,withalesshan2% relativelossinaccuracy.Thisbringstheoverallspeedofthesystemtoabout timesthatofthebaselinesystem,withalmostnolossofaccuracy.Thestructureofheinaldecodersapipelineofseveralstageswhichcanbeoperatednanoverlappedashion.arallelismamongstages,especiallyheexicaltreesearchandrescoringpasses,spossibleforadditionalimprovementinspeed.

1.3.2 ReducingMemorySizeThewomaincandidatesormemoryusagenhebaselineSphinx-IIsystem,andmostofthecommonresearchsystems,aretheacousticandlanguagemodels.

Thekeyobservationforreducingthesizeofthelanguagemodelsisthatindecod-inganygivenutterance,onlyasmallportionofitsactuallyused.Hence,wecan

8/12/2019 Ada 310308

23/146

1.4. SUMMARYANDDISSERTATIONOUTLINE

considermaintainingthelanguagemodelentirelyondisk,andretrievingonlythenec-essaryinformationondemand.Cachingschemescanovercomethelargedisk-accesslatencies.nemightexpecthevirtualmemorysystemsoperformhisunctionautomatically.However,theydon'tappeartobeefficientatmanagingthelanguagemodelworkingsetincehegranularityofaccessoheelateddatastructuressmuchsmallerthanapagesize.

Wehaveimplementedsimplecachingrulesandeplacementpoliciesforbigramsandrigrams,whichhowhathememoryesidentportionfargebigramandtrigramlanguagemodelscanbereducedsignificantly.nourbenchmarks,thenumberofbigramsinmemoryisreducedtoabout5-25%ofthetotal,andthatoftrigramstoabout2-5%ofthetotal.heimpactofdiskaccessesonelapsedtimeperformanceisminimal,showinghathecachingpoliciesareeffective.Webelievethaturtherreductionsinsizecanbeeasilyobtainedbyvariouscompressiontechniques,suchasareductionintheprecisionofrepresentation.

Theizeoftheacousticmodelssriviallyeducedbyaactorof4 ,implybyreducingtheprecisionoftheirrepresentation from32bitsto8bits,withnodifferenceinaccuracy.hishas,nfact,beendoneinmanyothersystemsasn25].henewobservationshatnadditionomemorysizereduction,hesmallerprecisionalsoallowsustospeedupthecomputationofacousticoutputprobabilities ofsenoneseveryframe.heomputationnvolveshesummationofprobabilitiesinog-domain,whichiscumbersome.The8-bitepresentationofsuchoperandsallowsustoachievethiswithasimpletablelookupoperation,mprovingthespeedofthisstepbyaboutafactorof2.

1.4 SummaryandDissertationOutlineInsummary,hishesispresentsnumberofechniquesormprovinghespeedofthebaselineSphinx-IIsystembyaboutanorderofmagnitude,andeducingtsmemoryrequirementsbyafactorof4 ,withoutsignificantossofaccuracy.ndoingso,tdemonstratesseveralfacts:

tispossibletobuildefficientspeechrecognitionsystemscomparable toresearchsystemsinaccuracy.tsossibleoeparateoncernsfsearchomplexityromhatfmod-ellingcomplexity.Byusingsemi-continuousacousticmodelsandefficientsearchstrategiestoproducecompactwordlatticeswithlowerrorrates,andrestrictingthemoredetailedmodelstosearchsuchlattices,theoverallperformanceofthesystemisoptimized. Itisnecessaryandpossibletomakedecisionsforpruninglargeportionsofthesearchspaceawaywithlo wcostandhighreliability.Thebeamsearchheuristic

8/12/2019 Ada 310308

24/146

10 HAPTER1. INTRODUCTION

isawellknownexampleofthisprinciple.Thephoneticfastmatchmethodandthereductioninprecisionofprobabilityvaluesalsofallunderthiscategory.Theorganizationfhishesisssollows.hapterontainsackground

materialandbriefdescriptionsofrelatedworkdonenhisarea.inceecognitionspeedandmemoryefficiencyhasnotbeenanexplicitconsiderationnheesearchcommunityoar,nhewayhatecognitionaccuracyhasbeen,hereselativelittlematerialinthisregard.

Chapter ismainlyconcernedwithestablishingbaselineperformancefiguresfortheSphinx-IIesearchsystem.tncludesacomprehensivedescriptionofthebase-linesystem,specificationsofthebenchmarktestsandexperimentalconditionsusedthroughouthishesis,anddetailedperformancefigures,ncludingaccuracy,speedandmemoryrequirements.

Chapter soneofthemainchapternthishesishatdescribesallofthenewtechniquestospeeduprecognitionandtheirresultsonthebenchmark tests.Boththebaselineandtheimprovedsystemusethesamesetofacousticandlanguagemodels.

TechniquesformemorysizereductionandcorrespondingresultsarepresentedinChapter .tshouldbenotedhatmostexperimentseportednhishesiswereconductedwiththeseoptimizationsinplace.Thoughhisthesisisprimarilyconcernedwithlargevocabularyrecognition,tisinterestingtoconsiderheapplicabilityoftheechniquesdevelopedhereosmallervocabularysituations.hapter addressesheconcernselatingosmallandex-tremelysmallvocabularyasks.hessuesofefficiencyarequitedifferentnheir

case,ndheproblemsrelsoifferent.heperformanceofbothhebaselineSphinx-IIsystemandtheproposedexperimentalsystemareevaluatedandcomparedonheATISAirlineTravelnformationService)ask,whichhasvocabularyofabout3,000words.

Finally,Chapter concludeswithasummaryoftheresults,contributionsofthisthesisandsomethoughtsonfuturedirectionsforsearchalgorithms.

8/12/2019 Ada 310308

25/146

Chapter2BackgroundThishaptercontainsbriefeviewofhenecessarybackgroundmaterialoun-derstandthecommonlyusedmodellingandsearchtechniquesinspeechrecognition.Sections2. 1nd2.2coverbasiceaturesofstatisticalacousticandanguagemod-elling,espectively.ViterbidecodingusingbeamsearchsdescribednSection2.3,whilerelatedresearchonefficientsearchtechniquesiscoveredinSection2.4.

2.1 AcousticModelling2.1.1 PhonesandTriphonesTheobjectiveofspeechrecognitionisthetranscriptionofspeechintotext,i.e.,wordstrings.oaccomplishhis,nemightwishoreateordmodelsfromrainingdata.owever,nthecaseoflargevocabularyspeechecognition,herearesimplytoomanywordsoberainednthisway.tsnecessaryoobtainseveralsamplesofeverywordfromseveraldifferentspeakers,norderocreatereasonablespeaker-independentmodelsoreachword.urthermore,heprocessmustbeepeatedoreachnewwordthatisaddedtothevocabulary.

Theproblemissolvedbycreatingacousticmodelsforub-wordunits.llwordsarecomposedofbasicallyasmallsetofsoundsorsub-wordunits,suchassyllablesorphonemes,whichcanbemodelledandsharedacrossdifferentwords.

Phoneticmodelsarehemostfrequentlyusedsub-wordmodels.hereareonlyabout0phonesnspokenEnglishseeAppendixAorhesetofphonesusednSphinx-II).Newwordscansimplybeaddedtothevocabularybydefiningtheirpro-nunciationintermsofsuchphones.

Theproductionofsoundcorrespondingtoaphoneisinfluencedbyneighbouringphones.orexample,theAEphoneinthewordman soundsdifferentfromthatn

11

8/12/2019 Ada 310308

26/146

12 HAPTER2. BACKGROUND

lack ;heformersmorenasal.BM4]proposedheuseofriphoneorcontext-dependentphonemodelstodealwithsuchvariations.With50phones,herecanbeupo03riphones,butonlyaractionofthemareactuallyobservednpractice.Virtuallyallspeechrecognitionsystemsnowusesuchcontextdependentmodels.

2.1.2 HMMmodellingofPhonesandTriphonesMostystemsusehiddenMarkovmodelsHMMs)oepresenthebasicunitsfspeech.heusageandrainingofHMMshasbeencoveredwidelyintheliterature.InitiallydescribedbyBaumin11],itwasfirstusedinspeechrecognitionsystemsbyCMU[10]andIBM[29].TheuseofHMMsinspeechhasbeendescribed,forexample,byRabiner[52].Currently,almostallsystemsuseHMMsformodellingtriphonesandcontext-independentphonesalsoreferredtoasmonophonesorasephones).heseincludeBBN[41],CMU[35,27] ,theCambridgeHTKsystem[65],IBM[5],andLIMSI[18],amongothers.WewillgiveabriefdescriptionofHMMsasusedinspeech.

Firstofall,hesampledspeechnputsusuallypreprocessed,hroughvarioussignal-processingsteps,ntoacepstrumorothereaturetreamhatontainsonefeaturevectoreveryframe.ramesareypicallyspacedat0msecintervals.omesystemsproducemultiple, parallelfeaturestreams.Forexample,Sphinxhas4featurestreamscepstra,Acepstra,AAcepstra,andpowerrepresentingthespeechsignal(seeSection3.1.1).

AnHMMisasetofstatesconnectedbytransitions(seeFigure3.2foranexample).Transitionsmodelheemissionofonerameofspeech.achHMMransitionhasanassociatedoutputprobabilityfunctionthatdefinestheprobabilityofemittingtheinputeatureobservedinanygivenframewhiletakinghatransition.npractice,mostsystemsassociatetheoutputprobabilityfunctionwiththesourceordestinationstateofthetransition,ratherthanhetransitionitself.Henceforth,w eshallassumethattheoutputprobability isassociatedwiththesourcestate.The outputprobabilityforstate atimetsusuallydenotedbybi(t).Actually,{snotafunctionoft,butatherafunctionoftheinputspeech,whichisafunctionof/.However,weshalloftenusethenotationb{(t)withthismplicitunderstanding.)

EachHMMransitionromnytateostatejlsoasstaticransitionprobability,usuallydenotedbya^-,whichisindependentofthespeechinput.

Thus,achHMMtateoccupiesrepresentssmallsubspaceofheoverallfeaturespace.heshapeofthissubspacessufficientlycomplexthattcannotbeaccuratelycharacterizedbyasimplemathematicaldistribution.ormathematicaltractability,themostcommongeneralapproachhasbeenomodelthestateoutputprobabilitybyamixtureGaussiancodebook.ForanyHMMstatesandfeaturestream/,hei-th.componentofsuchacodebookisanormaldistributionwithmeanvectorfisj,iandcovariancematrixUsjj. Inorderoimplifyhecomputationandalso

8/12/2019 Ada 310308

27/146

2.2. LANGUAGEMODELLING 3

becausethereisofteninsufficientdatatoestimatealltheparametersofthecovariancematrix,mostsystemsassumeindependenceofdimensionsandthereforethecovariancematrixbecomesdiagonal.hus,w ecansimplyusestandarddeviationvectors rs,/,;insteadofC /s,/,,-.inally,eachuchmixturecomponentlsohasscalarixturecoefficientormixtureweightwsji. Withthat,heprobabilityofobservingagivenspeechinputxinHMMstatesisgivenby:

Mx)=II(E ,/,^(x/,fiaJ, ut raJti))2 . 1 )/

wherethespeechinputxistheparallelsetoffeaturevectors,andX/its/-thfeaturecomponent;irangesoverthenumberofGaussiandensitiesinthemixtureand/overthenumberoffeatures.heexpressionf(.)shevalueofthechosencomponentGaussiandensityfunctionatX/.

InthegeneralcaseoffullycontinuousHMMs,eachHMMstatesntheacousticmodelhastsownseparateweightedmixtureGaussiancodebook.owever,hisscomputationallyexpensive,andmanyschemesareusedoeducehiscost.talsoresultsntoomanyfreeparameters.MostsystemsgroupHMMstatesntoclustersthatsharethesamesetofmodelparameters.Thesharingcanbeofdifferentdegrees.Insemi-continuoussystems,allstatesshareasinglemixtureGaussiancodebook,butthemixturecoefficientsaredistinctorndividualstates.nphinx-II,tatesaregroupedintoclusterscalledsenones[27],withasinglecodebookperfeaturestream)sharedamongallsenones,butdistinctmixtureweightsforeach.Thus,Sphinx-IIusessemi-continuousmodellingwithstateclustering.

EvensimplerdiscreteHMMmodelscanbederivedbyeplacinghemeanandvarianceectorsepresentingGaussianensitieswithingleentroid.nveryframe,hesingleclosestcentroidtoheinputeaturevectoriscomputedusingtheEuclideandistancemeasure),andndividualstatesweighthecodewordochosen.Discretemodelsareypicallyonlyusednmakingapproximatesearchessuchasnfastmatchalgorithms.Forsimplicityofmodelling,HMMscanhaveNULLransitionshatdonotcon-

sumeanytimeandhencedonotmodeltheemissionofspeech.WordHMMscanbebuiltbysimplystringingtogetherphoneticHMMmodelsusingNULLtransitionsasappropriate.

2.2 LanguageModellingA smentionedinChapter ,alanguagemodelLM)sequiredinargevocabularyspeechrecognitionfordisambiguatingbetweenthelargesetofalternative,confusablewordshatmightbehypothesizedduringthesearch.

8/12/2019 Ada 310308

28/146


TheLMdefineshe prioriprobabilityofasequenceofwords.heLMproba-bilityofasentencei.e.,asequenceofwordsw\, W2,...wn)sgivenby:P(w1)P(w2\wi)P(w3\w1,W2)P(w4\wi,W2,W3)---P(wn\w1, W n_i)

n=Y[P(wi\wi,...iWi-i).i=l

InanexpressionsuchasP{wi\w\,... Wi-i),wi,...u>j_iisthewordhistoryorsimplyhistoryforW{.npractice,necannotbtaineliableprobabilityestimatesgivenarbitrarilylonghistoriessincethatwouldrequireenormousamountsoftrainingdata.Instead,oneusuallyapproximatestheminthefollowingways:

Contextreegrammarsorregulargrammars.uchLMsareusedtodefinetheformofwellstructuredsentencesorphrases.eviationsromheprescribedstructurearenotermitted.uchormalgrammarsareneverusednargevocabularysystemssincetheyaretoorestrictive.

Wordunigram,igram,rigram,grammars.hesearedefinedespectivelyasfollowshigher-ordern-gramscanbedefinedsimilarly):

P(w)probabilityofwordwP(wj\wi)probabilityofWjgivenaonewordhistoryWiP(wk\wi,Wj) = probabilityofWkgivenatwowordhistoryWi,WjAbigramgrammarneednotcontainprobabilitiesorallpossiblewordpairs.Infact,thatwouldbeprohibitiveforallbutthesmallestvocabularies.nstead,ittypicallylistsonlythemostfrequentlyoccurringbigrams,andusesabackoffmechanismtofallbackonunigramprobabilitywhenthedesiredbigramisnotfound.notherwords,ifP(wj\wi)ssoughtandisnotfound,onefallsbackonP WJ).ButabackoffweightisappliedtoaccountforthefactthatWjsknowntobenotoneofthebigramsuccessorsofto ;30].therhigher-orderbackoffn-gramgrammarscanbedefinedsimilarly.

lassn-gramgrammars.hesearesimilartowordn-gramgrammars,exceptthatheokensareentirewordclasses,uchasdigit,number,month,propername,etc.hecreationanduseofclassgrammarssrickysincewordscanbelongtomultiple classes.hereisalsoafairamountofhandcraftinginvolved.

ongistancegrammars.nliken-gramMs,hesearecapableofelatingwordseparatedyomedistancei.e.,withomenterveningwords).orexample,herigger-pairmechanismdiscussedn57 ]softhisvariety.ongdistancegrammarsareprimarilyusedoescoren-besthypothesisistsrompreviousdecodings.

8/12/2019 Ada 310308

29/146

2.3. SEARCHALGORITHMS 15

o taxes i FinalstateV r 1M //T

Startstate

*-TimeFigure2.1:ViterbiSearchasDynamicProgramming

Oftheabove,wordbigramandrigramgrammarsarehemostcommonlyusedsinceheyareeasyorainromargevolumesofdata,equiringminimalmanualintervention.heyhavealsoprovidedhighdegreesofecognitionaccuracy.heSphinx-IIsystemuseswordtrigramLMs.

2.3 SearchAlgorithmsThewomainormsofdecodingmostcommonlyusedodayareViterbidecodingusingthebeamsearchheuristic,andstackdecoding.incetheworkreportedinthisthesisisbasedontheformer,webrieflyreviewitsbasicprincipleshere.2.3.1 ViterbiBeamSearchViterbisearch62]sessentiallyadynamicprogrammingalgorithm,onsistingftraversinganetworkofHMMstatesndmaintaininghebestpossiblepathscoreateachstateneachrame.tsaime-synchronoussearchalgorithmnhattprocessesallstatescompletelyatimetbeforemovingontotimet+1.

ThebstractlgorithmcanenderstoodwithheelpofFigure.1.nedimensionrepresentsthestatesinthenetwork,andtheotheristhetimeaxis.here istypicallyonestartstateandoneormorefinalstatesinthenetwork.Thearrowsdepictpossiblestateransitionshroughouthenetwork.nparticular,NULLransitionsgoverticallysincetheydonotconsumeanyinput,andnon-NULLtransitionsalwaysgooneimesteporward.achointnhis-Dpaceepresentshebestathprobabilityorhecorrespondingtatethatime.hats,ivenaime ndstates,hevalueat(t,s)representsheprobabilitycorrespondingothebeststatesequenceleadingfromtheinitialstateatime0ostatesatimet.

Theime-synchronousnatureofheViterbisearchmplieshathe2-Dpaceisraversedromeftoight,startingatime .Thesearchsnitializedatime

8/12/2019 Ada 310308

30/146


t withhepathprobabilityathestarttateseto,ndatallotherstatesto .neachrame,hecomputationconsistsofevaluatingallransitionsbetweenthepreviousframeandthecurrentframe,andthenevaluatingallNULLransitionswithinhecurrentrame.ornon-NULLransitions,healgorithmssummarizedbythefollowingexpression:

Pj(t)=m&x(Pi(t 1) ciij bi(t)),iesetofpredecessorstatesofj (2.2)where,Pj(t)shepathprobabilityofstatejatimet,,ijshestaticprobabilityassociatedwithheransitionromstate oj,andbi(t)sheoutputprobabilityassociatedwithstateiwhileconsumingtheinputspeechat seeSection2.1.2andequation.1).tstraightforwardoxtendhisormulationoncludeNULLtransitionshatdonotconsumeanyinput.

Thus,everystatehasasinglebestpredecessorateachtimeinstant.Withsomesimplebookkeepingomaintainthisnformation,onecaneasilydeterminethebeststatesequenceorheentiresearchbystartingtheinaltatetheendandfollowingthebestpredecessorateachstepallthewaybacktothestartstate.uchanexampleisshownbytheboldarrowsnFigure2.1.

ThecomplexityofViterbidecodingsN2Tassumingeachstatecanransitiontoeverystateateachtimestep),whereNisthetotalnumberofstatesandTisthetotalduration.TheapplicationofViterbidecodingtocontinuousspeechrecognitionsstraight-

forward.WordHMMsarebuiltbystringingogetherphoneticHMMmodelsusingNULLransitionsbetweentheinalstateofoneandhestartstateofthenext.naddition,NULLtransitionsareaddedfromthefinalstateofeachwordtotheinitialstateofallwordsnhevocabulary,husmodellingcontinuousspeech.anguagemodelbigram)probabilitiesareassociatedwitheveryoneofthesecross-wordtran-sitions.NotethatasystemwithavocabularyofVwordshasV2possiblecross-wordtransitions.llwordHMMsaresearchedinparallelaccordingtoequation2.2.

SinceevenasmalltomediumvocabularysystemconsistsofhundredsorthousandsofHMMstates,hestate-timematrixofFigure2.1uicklybecomesooargeandcostlytocomputeinitsentirety.Tokeepthecomputationwithinmanageablelimits,onlythemostlikelystatesareevaluatedineachframe,accordingtothebeamsearchheuristic[37].Attheendoftimet,thestatewiththehighestpathprobabilitypmax(t)isound.fanyotherstate hasPi(t)

8/12/2019 Ada 310308

31/146

2.4. RELATEDWORK 7

2.4 RelatedWorkSomeofthestandardtechniquesinreducingthecomputationalloadofViterbisearchforlargevocabularycontinuousspeechrecognitionhavebeenthefollowing:

Narrowinghebeamwidthforgreaterpruning.owever,hissusuallyasso-ciatedwithanncreasenerroratebecauseofanncreasenhenumberofsearcherrors:hecorrectwordsometimesgetprunedfromhesearchpathnthebargain.

educingthecomplexityofacousticandlanguagemodels.Thisapproachworkstoomeextent,speciallyftsollowedbymoreetailedearchnaterpasses.heresaradeoffhere,betweenhecomputationaloadoftheirstpassandsubsequentones.heuseofdetailedmodelsinthefirstpassproducescompactwordlatticeswithlowerrorratethatcanbepostprocessedefficiently,buthefirstpassitself iscomputationallyexpensive.tscostcanbereducedifsimplermodelsareemployed,atthecostofanincreaseinlatticesizeneededtoguaranteelowlatticeerrorrates.

Bothheaboveechniquesnvolvesomeradeoffbetweenecognitionaccuracyandspeed.

2.4.1 TreeStructuredLexiconsOrganizingtheHMMstobesearchedasaphonetictreeinsteadoftheflatstructureofindependentlinearHMMsequencesforeachwordisprobablythemostoftencitedimprovementnsearchechniquesnusecurrently.hisstructureseferredoastree-structuredexiconorexicalr e e .fhepronunciationsfwoormorewordscontainthesameninitialphonemes,theyshareasinglesequenceofnHMMmodelsrepresentinghatnitialportionoftheirpronunciation.Inpractice,mostsystemsuseriphonesnsteadofjustbasephones,ow eshouldeallyconsiderriphonepro-nunciationsequences.uthebasicargumentshesame.)inceheword-initialmodelsinanon-treestructuredViterbisearcharetypicallythemajorityofthetotalnumberofactivemodels,thereductionincomputationissignificant.

Theproblem withalexicaltreeoccursatwordboundarytransitionswherebigramlanguagemodelprobabilitiesareusuallycomputedandapplied.ntheflatnon-tree)Viterbialgorithmthereisatransitionfromeachwordendingstate(withinthebeam)tohebeginningofeverywordnhevocabulary.hus,heresan-inatheinitialstateofeveryword,withdifferentbigramprobabilitiesattachedtoeverysuchtransition.heViterbialgorithmchoosesthebestncomingtransitionineachcase.

However,withalexicaltreestructure,severalwordsmaysharethesamerootnodeofthetree.herecanbeaconflictbetweenthebestncomingcross-wordransition

8/12/2019 Ada 310308

32/146


fordifferentwordshatsharehesameootnode. Thisproblemhasbeenusuallysolvedbymakingcopiesofthelexicaltreetoresolvesuchconflicts.

ApproximateBigramTreesSRI39]andCRIM43 ]augmentheirexicaltreestructurewithafiatcopyofthelexiconthatsactivatedforbigramtransitions.llbigramtransitionsentertheflatlexiconcopy,whilehebackedoffunigramransitionsenterheootsoftheexicaltree.RInotesthatrelyingonjustunigramsmorethandoublestheworderrorrate.Theyshowhatusingthisscheme,herecognitionspeedsmprovedbyafactorof2- 3orapproximatelyhesameaccuracy.ogainurthermprovementsnspeed,theyreducethesizeofthebigramsectionbypruningthebigramlanguagemodelinvariousways,whichaddssignificantlytotheerrorrate.However,itshouldbenotedthattheexperimentalsetupisbasedonusingdiscreteHMMacousticmodels,withbaselinesystemworderrorate21.5%),whichssignificantlyworsehanheirbestesearchsystem10.3%)usingbigrams,andalsoworsethanmostotherresearchsystemstobeginwith.

AsweshallseeinChapter3,bigramtransitionsconstituteasignificantportionofcrosswordtransitions,whichinturnareadominantpartofthesearchcost.ence,theuseofaflatexicalstructureforbigramtransitionsmustcontinuetoncurhiscost.

ReplicatedBigramTreesNeyandothers40 ,]havesuggestedcreatingcopiesoftheexicalreeohandlebigramtransitions.heleafnodesathefirstevelunigram)exicaltreehavesec-ondarybigram)reeshangingoffthemforbigramtransitions.hetotalsizeofthesecondaryreesdependsonhenumberofbigramspresentnhegrammar.ec-ondarytreesthatrepresentthebigramfollowersofthemostcommonfunctionwords,suchasA,THE,N ,O F,etc.reusuallylarge.

Thisschemecreatesadditionalcopiesofwordshatdidnotexistntheoriginalflatstructure.orexample,ntheconventionalflatexiconornheauxiliaryflatlexiconopyf39]),heresnlynenstanceofeachword.owever,nhisproposedschemethesamewordcanappearnmultiplesecondaryrees.inceheshortunctionwordsareecognizedoftenthoughspuriously),heirbigramcopiesarerequentlyactive.heyarealsoamongheargerones,snotedabove.tsunclearhowmuchoverheadthisaddsothesystem.

8/12/2019 Ada 310308

33/146

2.4. RELATEDWORK 9

DynamicNetworkDecodingCambridgeUniversity44 ]esignedone-passecoderhatsesheexicalreestructure,withcopiesorcross-wordransitions,butnstantiatesnewcopiesatev-eryransition,asnecessary.asically,heraditionale-entrantexicalstructuresreplacedwithanon-re-entrantstructure.opreventanexplosionnmemoryspacerequirements,heyeclaimHMMnodesasoonasheybecomenactivebyallingoutsidethepruningbeamwidth.urthermore,theendpointsofmultipleinstancesofthesamewordcanbemergedundertheproperconditions,allowingjustoneinstanceofthelexicaltreetobepropagatedfromthemergedwordends,insteadofseparatelyandmultiplyfromeach.hissystemattainedthehighestrecognitionaccuracyintheNov993evaluations.

Theyreporttheperformanceunderstandardconditionsstandard199320KWallStreetJournaldevelopmentestetecodedusinghecorrespondingtandardbi-gram/trigramlanguagemodelusingwidebeamwidthsasntheactualevaluations.ThenumberofactiveHMMmodelsperrameinhisschemeisactuallyhigherthanthenumberinthebaselineSphinx-IIsystemundersimilartestconditions(exceptthatSphinx-IIusesadifferentlexiconandacousticmodels).hereareotherfactorsatwork,buthedynamicinstantiationoflexicaltreescertainlyplaysapartnthisincrease.Theoverheadfordynamically constructingtheHMMnetworkisreportedtobelessthan20 %ofthetotalcomputationalload.Thisisactually fairlyhighsincethetimetodecodeasentenceonanHP7 35platformisreportedtobeabout5minutesonaverage.

2.4.2 MemorySizeandSpeedImprovementsnWhisperTheCMUSphinx-IIsystemhasbeenmprovedinmanywaysbyMicrosoftnpro-ducingtheWhispersystem26].heyreporthatmemorysizehasbeenreducedbyafactorof20andspeedimprovedbyafactorof5,comparedtoSphinx-IIunderthesameaccuracyconstraints.

Oneoftheschemesormemoryreductionsheuseofacontextreegrammar(CFG)nplaceofbigramorrigramgrammars.FGsrehighlyompact,anbesearchedefficiently,andcanbeelativelyeasilycreatedorsmallasksuchscommandandcontrolapplicationsnvolvingaewhundredwords.owever,argevocabularyapplicationscannotbesorigidlyconstrained.

Theyalsoobtainanmprovementofabout5%nhememorysizeofacousticmodelsbyusingrunlengthencodingforsenoneweightingcoefficients(Section2.1.2).

TheyhavealsomprovedthespeedperformanceofWhisperthroughaRichGetRicher (RGR)heuristic fordecidingwhichphonesshouldbeevaluatedindetail,usingtriphonestates,ndwhichshouldallbackonontextndependentphonestates.

8/12/2019 Ada 310308

34/146


RGRworksasfollows:etPp(t)bethebestpathprobabilityofanystatebelongingtobasephonepatimet,pmax(t)hebestpathprobabilityoverallstatesat,andbp(t+1)heoutputprobabilityofthecontext-independentmodelforpatimet+1.Then,hecontext-dependentstatesforphonepareevaluatedatramet+1ff :

a-P p{t)+b p(t+l)>Pmax(t)-Kwhere,aandKareempiricallydeterminedconstants.Otherwise,context-independentoutputrobabilitiesreusedorhosestates.Allprobabilitiesareomputednlog-space.enceheadditionoperationseallyepresentmultiplicationsnnormalprobabilityspace.)

Usinghisheuristic,heyreportan0 %eductionnhenumberofcontextde-pendentstatesforwhichoutputprobabilitiesarecomputed,withnolossofaccuracy.IftheparametersaandKaretightenedtoreducethenumberofcontext-dependentstatesevaluatedby95%,hereisa5%elativelossofaccuracy.Thebaselinetestconditionshavenotbespecifiedfortheseexperiments.)

2.4.3 SearchPruningUsingPosteriorPhoneProbabilitiesIn[56],RenalsandHochbergdescribeamethodofdeactivatingcertainphonesduringsearchoachievehigherecognitionspeed.hemethodsncorporatedntoaastmatchpasshatproduceswordsandposteriorprobabilitiesorheirN O W A Ystackdecoder.heastmatchstepusesHMMbasephonemodels,hestatesofwhicharemodelledbyneuralnetworkshatdirectoryestimatephoneposteriorprobabil-itiesnsteadoftheusualikelihoods;.e.,heyestimateP{phone\data),nsteadofP(data\phone).Usingheposteriorphoneprobabilityinformation,onecandentifythelesslikelyactivephonesatanygiventimeandprunethesearchaccordingly.

Thisisapotentiallypowerfuland easypruningtechniquewhentheposteriorphoneprobabilitiesareavailable.tackdecoderscanparticularlygainifthefastmatchstepcanemadeoimitheumberofcandidatewordsmittedwhileextendingpartialhypothesis.nheirN O W A Yimplementation,aspeedupofaboutanorderofmagnitudeisobservedona20Kvocabularytask(fromabout5 0 xrealtimetoabout15 xealime)onanHP735workstation.heydonoteportheeductionnhenumberofactiveHMMsasaresultofthispruning.2.4.4 LowerComplexityViterbiAlgorithmAnewapproachtotheViterbialgorithm,specificallyapplicabletospeechrecognition,isdescribedbyPateln49].tsaimedateducinghecostoftheargenumberofcross-wordtransitionsandhasanexpectedcomplexityofNy/NT,insteadofN2T(Section2.3.1).Thealgorithmdependsonorderingheexitpathprobabilitiesand

8/12/2019 Ada 310308

35/146

2.5. SUMMARY 1

transitionbigramprobabilities,andindingahresholduchhatmostransitionscanbeeliminatedfromconsideration.

Theauthorsndicatethathealgorithmoffersbetterperformanceifeverywordhasbigramtransitionsoheentirevocabulary.owever,hissnothecasewithlargevocabularysystems.Nevertheless,itsworthexploringhisechniquefurtherforitspracticalapplicability.2.5 SummaryInthischapterwehavecoveredthebasicmodellingprinciplesandsearchtechniquescommonlyusedinspeechrecognitiontoday.Wehavealsobrieflyreviewedanumberofsystemsandtechniquesusedtoimprovetheirspeedand memoryrequirements.Oneofthemainthemesrunningthroughthisworkisthatvirtuallynoneofthepracticalimplementationshavebeenformallyevaluatedwithespectoheesearchsystemsonwellestablishedtestsetsunderwidelyusedtestconditions,orwithrespecttooneanother.

Intherestofthisthesis,weevaluatethebaselineSphinx-IIsystemundernormalevaluationconditionsandusetheresultsforcomparisonwithourotherexperiments.

8/12/2019 Ada 310308

36/146

Chapter3TheSphinx-IIBaselineSystemAsmentionednhepreviouschapters,hereselativelyittlepublishedworkontheperformanceofspeechecognitionsystems,measuredalonghedimensionsofrecognitionaccuracy,speedandesourceutilization.hepurposeofthischapteristoestablishacomprehensiveaccountoftheperformanceofabaselinesystemhathasbeenconsideredapremierepresentativeofitskind,withwhichwecanmakemeaningfulcomparisonsoftheresearchreportedinthishesis.orthispurpose,wehavechosenheSphinx-IIspeechecognitionsystem1tCarnegieMellonhathasbeenusedextensivelyinspeechresearchandheyearlyARPAevaluations.Variousaspectsofthisbaselinesystemanditsprecursorshavebeenreportedintheliterature,notablyn32 ,33,35,28,,].Mostoftheseconcentrateonhemodellingaspectsofthesystemacoustic,grammaticalorexicalandtheireffectonecognitionac-curacy.nhischapterweocusonobtainingacomprehensivesetofperformancecharacteristicsforthissystem.

ThebaselineSphinx-IIecognitionsystemusesemi-continuousoried-mixturehiddenMarkovmodels(HMMs)fortheacousticmodels52,27,2]andwordbigramortrigrambackofflanguagemodelsseeSections2. 1and2.2).tisa3-passdecoderstructuredasfollows:

1.imesynchronousViterbibeamsearch[52,62,37 ]intheforwarddirection.tisacompletesearchofthefullvocabulary,usingsemi-continuousacousticmodels,abigramortrigramlanguagemodel,andcross-wordtriphonemodellingduringthesearch.Theresultofthissearchisasinglerecognitionhypothesis,aswellasawordlatticethatcontainsallthewordsthatwererecognizedduringthesearch.Thelatticeincludeswordsegmentationandscoresinformation.Oneofthekeyfeaturesofthislatticeishatoreachwordoccurrence,severalsuccessiveendtimesareidentifiedalongwithheirscores,whereasveryoftenonlyhesinglemostlikelybegintimeisidentified.Scoresforalternativebegintimesareusually

1TheSphinx-IIdecoderreportednthissectionisknowninternallyasFBS6.

22

8/12/2019 Ada 310308

37/146

2 3

notavailable.2.imesynchronousViterbibeamsearchinthebackwarddirection.hissearch

isestrictedohewordsdentifiedinheforwardpassandsveryfast.ikethefirstpass,tproducesawordatticewithwordsegmentationsandscores.However,thistimeseveralalternativebegintimesareidentifiedwhiletypicallyonlyoneendimesavailable.naddition,heViterbisearchalsoproducesthebestpathscorefromanypointintheutterancetotheendoftheutterance,whichisusedinthethirdpass.

3.nA *orstacksearchusingthewordsegmentationsandscoresproducedbytheforwardandbackwardViterbipassesabove.tproducesanN-bestist59]ofalternativehypothesesastsoutput,asdescribedbrieflyinSection .2 .hereisnoacousticescoringnhispass.owever,anyarbitraryanguagemodelcanbeappliedncreatingheN-bestist.nhishesis,wewillestrictourdiscussiontowordtrigramlanguagemodels.

TheeasonorheexistenceofhebackwardandA *asses,venhoughhefirstpassproducesausablerecognitionresult,isthefollowing.OnelimitationoftheforwardViterbisearchnheirstpassshattshardoemployanythingmoresophisticatedthanasimplebigramorsimilargrammar.Althoughatrigramgrammarisusednheforwardpass,tsnotacompletetrigramsearchseeSection3.2.2).Stackdecoding,avariantoftheA*searchalgorithm242],smoreappropriateorusewithsuchgrammarswhichleadogreaterrecognitionaccuracy.hisalgorithmmaintainsastackofseveralpossiblepartialdecodings(i.e,wordsequencehypotheses)whichareexpandednabest-firstmanner9,2,0].inceeachpartialhypothesisisalinearwordsequence,anyarbitraryanguagemodelcanbeappliedtot.tackdecodinglsollowshedecoderooutputeveralmostikelyN-bestypothesesratherthanjustthesinglebestone.hesemultiple hypothesescanbepostprocessedwithevenmoredetailedmodels.heneedorhebackwardpassnhebaselinesystemhasbeenmentionedabove.

Inhischapterweeviewhedetailsofhebaselinesystemneededorunder-standingtheperformancecharacteristics.nordertokeepthisdiscussionfairlyself-contained,wefirsteviewthevariousknowledgesourcemodelsinSection3.1.omeofthebackgroundmaterialinSections2.1,2.2,and2.3salsoelevant.hissol-lowedbyadiscussionoftheforwardpassViterbibeamsearchinSection3.2,andthebackwardandA *earchesnSection3.3.heperformanceofthissystemonev-eralwidelyusedtestsetsfromtheARPAevaluationssdescribedinSection3.4.tincludesrecognitionaccuracy,variousstatisticsrelatedtosearchspeed,andmemoryusage.WefinallyconcludewithsomefinalremarksinSection3.5.

2 W ewilloftenus ethetermsstackdecodingandA searchinterchangeably.

8/12/2019 Ada 310308

38/146

2 4 CHAPTER3. THESPHINX-IIBASELINESYSTEM

i16KHz,16-bitlinearsamplesPre-emphasisFilter

H(z)=l-0.97z-1

25.6msecHamming Window10msecintervals^HlOms|-^

1^ -25.6ms-^-|12melfreq.coeff.+power coeff.

,100cepstralframes/secsentence-based

poweran dcepstralnormalizationpower-=max(power)oversentencecepstrum -=mean(cepstrum)oversentence

cepstrum-vector Acepstrum AAcepstrum power,Apower ,AAp o w e r^4featurestreamsat10 0 frames/sec.

Figure3.1:phinx-IISignalProcessingFrontEnd.3.1 KnowledgeSourcesThissectionbrieflydescribesthevariousknowledgesourcesormodelsandthespeechsignalrocessingront-endusednphinx-II.ndditionohecousticmodelsandpronunciationlexicondescribedbelow,Sphinx-IIuseswordbigramandtrigramgrammars.hesehavebeendiscussedinSection2.2.3.1.1 AcousticModelSignalProcessingAdetaileddescriptionofhesignalprocessingrontendnSphinx-IIscontainedinSection4.2.1ignalProcessingof27].heblockdiagramnFigure3. 1epictstheoverallprocessing.riefly,thestreamof16-bitsamplesofspeechdata,sampledat6KHz,sconvertednto12-elementmelscalefrequencycepstrumvectorsandapowercoefficientineach10msecframe.Werepresenthecepstrumvectoratimetbyx(t)individualelementsaredenotedbyXk(t),1

8/12/2019 Ada 310308

39/146

3.1 . KNOWLEDGESOURCES 5

FinalState(Non-emitt ing)

Figure3.2:phinx-IIHMMTopology:-StateBakisModel.issimplyx0(t).hiscepstrumvectorandpowerstreamsareirstnormalized,andfourfeaturevectorsarederivedineachframebycomputingthefirstandsecondorderdifferencesintime:

x(i)normalizedcepstrumvectorAx(t) = x(t+2)-x(*-2),A,x(*)=x(*+4 )-x(*-4 )AAx(f) = Ax(t+1)-Ax(*-1)xo(*) = x0(t),

Ax0(t)=x0(t+2 -x(t-2 ,AAx0{t)=Ax0(t+1)-Ax0(t-1)

wherethecommasdenoteconcatenation.hus,ineveryframeweobtainfourfeaturevectorsof2,24,2,and elements,espectively.hese,ultimately,arehenputtothespeechrecognitionsystem.PhoneticHMMModelsAcousticmodellinginSphinx-IIisbasedonhiddenMarkovmodels(HMMs)forbase-phonesandtriphones.AllHMMsinSphinx-IIhavethesame5-stateBakistopologyshownnheFigure3.2.ThebackgroundonHMMshasbeencoveredbrieflynSection2.1.2.)

AsmentionedinSection2.1.2,Sphinx-IIusessemi-continuousacousticmodellingwith56omponentensitiesneacheaturecodebook.tatesreclusteredntosenones27],whereeachsenonehastsownsetof25 6mixturecoefficientsweightingthecodebookforeachfeaturestream.

Inorderourthereducehecomputationalcost,onlyheopewcomponentdensitiesfromeachfeaturecodebooktypically4arefullyevaluatedineachframeincomputingtheoutputprobabilityofastateorsenone(equation2.1).Therationalebehindhisapproximationshatheemainingcomponentsmatchheinputverypoorlyanywayandcanbeignoredaltogether.heapproximationprimarilyreducesthecostofapplyingthemixtureweightsincomputingsenoneoutputprobabilitiesineachframe.Foreachsenoneandfeatureonly4mixingweightshavetobeappliedtothe4bestcomponents,insteadofall256.

8/12/2019 Ada 310308

40/146

26 HAPTER3. THESPHINX-IIBASELINESYSTEM

3.1.2 PronunciationLexiconTheexiconnphinx-IIdefinesheinearsequenceofphonemesepresentinghepronunciationoreachwordnhevocabulary.hereareabout0phonemeshatmakeuptheEnglishlanguage.hephonesetusedinSphinx-IIisgiveninAppendixA.hefollowingisasmallexampleofthelexiconfordigits:

O H O WZERO ZIHROWZERO(2) ZIY ROWO N E W AH NT W O T UWTHREE TH RIYFOUR FAORFIVE FAY VS I X SIH KSSEVEN SEH VAX NEIGHT EY TDNINE N AY N

Therecanbemultiplepronunciationsforaword,asshownforthewordZEROabove.Eachalternativepronunciationisassumedtohavethesame priori languagemodelprobability.

3.2 ForwardBeamSearchAsmentionedearlier,hebaselinephinx-IIecognitionystemonsistsfhreepasses,ofwhichthefirstsatime-synchronousViterbibeamsearchnheforwarddirection.nhissectionwedescribehestructureofthisorwardpass.eshallfirstexaminethedatastructuresinvolvedinthesearchalgorithm,beforemovingontothedynamicsofthealgorithm.

3.2.1 FlatLexicalStructureTheexicondefinesheinearsequenceofcontext-independentorbasephoneshatmakeuphepronunciationofeachwordnhevocabulary.incephinx-IIusestriphoneacousticmodels[34],thesebasephonesequencesareconvertedintotriphonesequencesbysimplytakingeachbasephoneogetherwithitseftandightcontextbasephones.Notethathephoneticleftcontextathebeginningofawordshelastbasephonefromthepreviousword.imilarly,thephoneticrightcontextatheendofthewordsheirstbasephoneofthenextword. Sincehedecoderdoes

8/12/2019 Ada 310308

41/146

3.2. FORWARDBEAMSEARCH 7

notknowtheseneighbouringwords priori,itmusttryallpossiblecasesandfinallychoosehebest.hissdiscussedindetailbelow.)Giventhesequenceoftriphonesforaword,onecanconstructanequivalentword-HMMbysimplyconcatenatingtheHMMsfortheindividualtriphones,i.e.,byaddingaNULLransitionfromthefinalstateofoneHMMtotheinitialstateofthenext.heinitialstateof firstHMM,andthefinalstateofthelastHMMnthissequencebecometheinitialandfinalstates,respectively,ofthecompleteword-HMM.Finally,inordertomodelcontinuousspeech(i.e.,transitionfromonewordintothenext),additionalNULLtransitionsarecreatedfromthefinalstateofeverywordotheinitialstateofallwordsnthevocabulary.Thus,withaVwordvocabulary,hereareV2possiblecross-wordransitions.

SincetheresultisastructureconsistingofseparatelinearsequenceofHMMsforeachword,wecallthisaflatlexicalstructure.

3.2.2 IncorporatingtheLanguageModelWhilehecross-wordNULLransitionsdonotconsumeanyspeechnput,eachofthemdoeshaveaanguagemodelprobabilityassociatedwitht.oraransitionfromsomewordwioanywordW j,hisprobabilityssimplyP(wj\wi)fabigramlanguagemodelsused.bigramanguagemodelitsnneatlywithheMarkovassumptionthatgivenanycurrentstatesattimettheprobabilityoftransitionsoutofsdoesnotdependonhowonearrivedat.hus,helanguagemodelprobabilityP(wj\wi)canbeassociatedwiththetransitionfromthefinalstateoft o , -totheinitialstateofWjandthereafterweneednotcareabouthowwearrivedatW j.

TheaboveargumentdoesnotholdorarigramorsomeotherongerdistancegrammarsincethelanguagemodelprobabilityoftransitionoWjdependsnotonlyontheimmediatepredecessorbutalsosomeearlierones.fatrigramlanguagemodelisused,helexicalstructurehasobemodifiedsuchthatoreachwordwhereareseveralparallelnstancesofitswordHMM,oneforeachpossiblepredecessorword.Althoughthecopiesmayscoreidenticallyacoustically,theinclusionoflanguagemodelscoreswouldmaketheirtotalpathprobabilitiesdistinct.ngeneral,withnon-bigramgrammars,weneedaseparatewordHMMmodelforeachgrammarstateratherthanjustoneperwordinthevocabulary.

Clearly,eplicatingthewordHMMmodelsforncorporatingatrigramgrammarorsomeothernon-bigramgrammarinthesearchalgorithmismuchcostliercompu-tationally.owever,moresophisticatedgrammarsoffergreaterrecognitionaccuracyandpossiblyevenareductionnhesearchspace.herefore,nSphinx-II,rigramgrammarsareusedinanapproximate mannerwiththefollowingcompromise.When-everthereisatransitionfromwordWioW j,wecanfindthebestpredecessorofW { athatpoint,ayo- ,asdeterminedbyheViterbisearch.ehenassociatehetrigramprobabilityP(wj\w'i,Wi)withheransitionromWioW j.ote,however,thatunlikewithbigramgrammars,trigramprobabilitiesappliedtocross-wordtran-

8/12/2019 Ada 310308

42/146


sitionsinthisapproximatefashionhavetobedetermineddynamically,dependingonthebestpredecessorforeachtransitionathetimeinquestion.Usingatrigramgrammarinanapproximatemannerasdescribedabovehashe

followingadvantages:tvoidnyeplicationofheexicalword-HMMtructuresndssociated

increaseincomputationalload.ntermsofaccuracy,itismuchbetterthanusingabigrammodelandisclosetothatofacompletetrigramsearch.Weinferthisfromthefactthattheaccuracy

oftheresultsfromthefinalA *pass,whichusesthetrigramgrammarcorrectly,andalsohashebenefitofadditionalwordsegmentationsochooserom,srelativelyonlyabout%betterseeSection3.4.2).trigramgrammarappliedinthisapproximatemannerisempiricallyobserved

tosearchfewerword-HMMscomparedtoabigramgrammar,thusleadingtoaslightimprovementintherecognitionspeed.hereductioninsearchisaresultofsharperpruningofferedbythetrigramgrammar.

3.2.3 Cross-WordTriphoneModelingItsadvantageousousecross-wordriphonemodelsasopposedognoringcross-wordphoneticcontexts)orcontinuouspeechecognitionwherewordboundariesareunclearobeginwithandhereareverystrongco-articulationeffects.singcross-wordtriphonemodelswenotonlyobtainbetteraccuracy,butalsogreatercom-putationalefficiency,athecostofanncreasenheotalizeofacousticmodels.Thesharpermodelsprovidedbytriphones,comparedtodiphonesandmonophones,leadsogreaterpruningefficiencyandeductionncomputation.owever,us-ingcross-wordriphonemodelsnheViterbisearchalgorithmsnotwithouttscomplications.

RightContextThephoneticightcontextorheastriphonepositionnawordsheirstbasephoneofthenextword.ntime-synchronousViterbisearch,thereisnowaytoknowthenextwordnadvance.nanycase,whateverdecodingalgorithmisused,herecanbeseveralpotentialuccessorwordsoanygivenwordW {tnygivenime.Therefore,helastriphonepositionforeachwordhasobemodelledbyaparallelsetoftriphonemodels,oneforeachpossiblephoneticrightcontext.notherwords,iftherearekbasephonespi,P2, ,P kinthesystem,wehavekparalleltriphoneHMMmodelsiPl,/iP2,... hPkrepresentingthefinaltriphonepositionforWi.cross-wordtransitionfromu > ;oanotherwordWj whosefirstbasephoneispisepresentedby

8/12/2019 Ada 310308

43/146


H M M netw ork fo rwordw .iParallelse tof H M M sinlast phonepositionfo rdifferentphoneticright contexts

/PJ3Rightcontextbasephone

S.Cross-wordNULL transitionW o r d.,firstbasephone=p

Figure3.3:Cross-wordTriphoneModellingatWordEndsnSphinx-II.aNULLarcfromhpohenitialstateofW j.igure3.3llustrateshisconceptofrightcontextfanoutatheendofeachwordW {nSphinx-II.

Thissolution,atfirstglance,appearstoforcealargeincreaseinthetotalnumberoftriphoneHMMshatmaybesearched.nheplaceoftheingleastpositiontriphoneforeachword,wenowhaveoneriphonemodelforeachpossiblephoneticrightcontext,whichistypicallyaround0nnumber.npractice,wealmostneverencounterthisapparentexplosionincomputationalload,forthefollowingreasons:

hedynamicnumberofrightmostriphonesactuallyevaluatednpracticesmuchsmallerthanthestaticnumberbecausethebeamsearchheuristicprunesmostofthewordsawaybythetimetheirlastphonehasbeenreached.hisisbyfarthelargestsourceofefficiency,evenwiththerightcontextfanout.

hesetofphoneticightcontextsactuallymodelledcanbeestrictedtojustthosefoundnhenputvocabulary;.e.,ohesetoffirstbasephonesofallthewordsnthevocabulary.Moreover,phinx-IIusestatelusteringntoenones,hereeveraltatessharethesameoutputdistributionmodelledbyasenone.Therefore,theparallelsetofmodelsatheendofanygivenwordarenotllunique.yemovingduplicates,hefanoutcanbefurthereduced.nSphinx-II,hesetwofactorstogetherreducetherightcontextfanoutbyabout0 %onaverage.

heincreaseisnumberofrightmosttriphonesspartlyoffsetbythereductionincomputationaffordedbythesharpertriphonemodels.

8/12/2019 Ada 310308

44/146


LeftContext

Thehoneticeftontextorheirsthoneositionnwordsheastasephonefromthepreviousword.Duringdecoding,thereisnouniquesuchpredecessorword.nanygivenframe,heremayberansitionsoawordWjfromanumberofcandidatesw^,io;2,...TheViterbialgorithmchooseshebestpossibletransitionintoW j.etussaythewinningpredecessorisW { k.hus,helastbasephoneofW { kbecomeshephoneticeftcontextorW j.owever,hissnramet.nhenextframe,theremaybeanentirelydifferentwinnerthatresultsinadifferentleftcontextbasephone.inceheealbestpredecessorsnotetermineduntilheendoftheViterbidecoding,allsuchpossiblepathshavetobepursuedinparallel.

Aswithrightcontextcross-wordtriphonemodelling,thisproblemalsocansolvedbyusingaparallelsetoftriphonemodelsforthefirstphonepositionofeachwordaseparatetriphoneforeachpossiblephoneticleftcontext.owever,unliketheword-endingphonepositionwhichisheavilyprunedbythebeamsearchheuristic,theword-initialpositionisextensivelysearched.Mostoftheword-initialtriphonemodelsarealiveeveryframe.nfact,asweshallseelaterinSection3.4,heyaccountormorethan60 %ofalltriphonemodelsevaluatedinthecaseoflarge-vocabularyrecognition.Aleftcontextfanoutofevenasmallfactorof2or wouldsubstantiallyslowdownthesystem.

ThesolutionusedntheSphinx-IIbaselinesystemisocollapsetheleftcontextfanoutntoaingle5-stateHMMwithynamicriphonemappingasollows.sdescribedabove,atanygivenframeheremaybeseveralpossibleransitionsromwordstOj,, ,...ntoW j.ccordingoheViterbialgorithm,heransitionwiththebestncomingscorewins.ethewinningpredecessorbeWik.henheinitialstateofWjlsoynamicallyinheritsheastbasephoneofW { kstseftcontext.WhentheoutputprobabilityoftheinitialstateofWjhastobeevaluatedinthenextframe,itsparenttriphoneidentityisfirstdetermineddynamicallyfromtheinheritedleftcontextbasephone.urthermore,thisdynamicallydeterminedtriphoneidentityisalsopropagatedbyheViterbialgorithm,ashepathprobabilityspropagatedfromstatetostate.hisensuresthatanycompletepaththroughtheinitialtriphonepositionofWjsscoredconsistentlyusingasingletriphoneHMMmodel.

Figure3.4llustrateshisprocesswithanexample,goinghroughasequenceof4rames.tcontainsasnapshotofaword-initialHMMmodelatheendofeachframe.rcsnboldndicatehewinningransitionsoeachstateoftheHMMnthisexample.MMstatesareannotatedwithheleftcontextbasephonenheriteddynamicallythroughtime.A swecanseeintheexample,differentstatescanhavedif-ferentphoneticleftcontextsassociatedwiththem,butasingleViterbipaththroughtheHMMsevaluatedwiththesamecontext.hiscanbeverifiedbybacktrackingfromthefinalstatebackwardintime.

8/12/2019 Ada 310308

45/146


Initial(leftmost)H M M modelfo rawordIncomingleftcontextphone=p(Frompreviousword)Incomingleftcontextphone=p .

Incomingleftcontextphone=p .

Time=

Time=2

Time=3

Time=4

Figure3.4:WordnitialTriphoneHMMModellinginSphinx-II.SinglePhoneWordsInhecaseofsingle-phonewords,othheireftandighthoneticcontextsrederiveddynamicallyfromneighbouringwords.hus,heyhavetobehandledbyacombinationoftheabovetechniques.WithreferencetoFigures3.3and3.4,separatecopiesofthesinglephonehavetobecreated foreachrightphoneticcontext,andeachcopyismodelledusingthedynamictriphonemappingtechniqueforhandlingitsleftphoneticcontext.

3.2.4 TheForwardSearchThedecodingalgorithmis,nprinciple,straightforward.heproblemisofindhemostrobablesequenceofwordshatccountsorhebservedpeech.hisstackledasfollows.

TheabstractViterbidecodingalgorithmandhebeamsearchheuristic,andts

8/12/2019 Ada 310308

46/146


applicationospeechdecodinghavebeenexplainednSection2.3.1.nSphinx-II,therearetwodistinguishedwords,and depicting thebeginningandendingsilenceinanyutterance.heinputspeechisexpectedtobeginatheinitialstateofandendinthefinalstateof

WecannowdescribedtheforwardViterbibeam searchimplementation inSphinx-II tsxplainedwithhehelpfragmentsfpseudo-code.tsecessaryounderstandtheforwardpassatthislevelinordertofollowthesubsequentdiscussiononperformanceanalysisandthebreakdownofcomputationamongdifferentmodules.

SearchOutlineBeforewegointothedetailsofthesearchalgorithm,weintroducesometerminology.AstatejfanMMmodelmnhelatexicalsearchspacehasheollowingattributes:

pathscoreatimet,P {t),hatndicatestheprobabilitycorrespondingothebeststatesequenceeadingromhenitialstateofatime ohisstateatimet,whileconsumingtheinputspeechuntilt.historyinformationatimet,-H] (i),hatallowsusoracebackhebestprecedingwordhistoryleadingtothisstateat.Asweshallseelater,hisis

apointertothewordlatticeentrycontainingthebestpredecessorword.) Thesenoneoutputprobability,b t),forthisstateattimet(seeSection2.1.2).

Ifmbelongsohefirstpositionnaword,hesenonedentityforstate; sdetermineddynamicallyfromtheinherited phoneticleftcontext(Section3.2.3).Atthebeginningofthedecodingofanutterance,thesearchprocessisinitializedbysettingthepathprobabilityofthestartstateofthedistinguishedwordo.

Allotherstatesareinitializedwithapathscoreof0.Also,anactiveHMMlistthatidentifieshesetofactiveHMMsnhecurrentramesnitializedwithhisirstHMMfor Fromthenon,theprocessingofeachframe ofspeech,giventheinputfeaturevectorforthatframe,isoutlinedbythepseudo-codenFigure3.5.

WeconsidersomeofthefunctionsdefinedinFigure3.5inalittlemoredetailbe-low.Certainaspects,suchaspruningoutHMMsthatfallbelowthebeamthreshold,havebeenomittedforthesakeofsimplicity.VQ:Qstandsorectorquantization.nhisunction,heGaussiandensitiesthatmakeupeacheaturecodebookareevaluatedathenputeaturevectors.notherwords,wecomputetheMahalanobisdistanceoftheinputfeaturevectorfromthemeanofeachGaussiandensityfunction.(ThiscorrespondsoevaluatingMn

8/12/2019 Ada 310308

47/146


forward_frame(inputfeaturevectorfo rcurrentframe)C

VQ(inputfeature); /*Findtop 4densitiesclosesttoinputfeature*/senone_evaluate ) ; /*FindsenoneoutputprobabilitiesusingVQresults*/hmm_evaluate ) ; /*Within-HMMan dcross-HMM transitions*/word_transition ) ; /*Cross-wordtransitions*//*HMMpruning usinga beamomittedforsimplicity*/updateactive HMMlistfo rnextframe;

hmm_evaluate )

/*Within-HMM transitions*/fo r(eachactiveHMMh)

for(eachstatesin h)updatepath probabilityofsusingsenoneoutputprobabilities;

/*Within-word cross-HMMtransitionsan dword-exits*/for(eachactiveHMMhwith finalstatescorewithin beam){

if(hisafinalHMMforaword w){createwordlatticeentryfo r w ; /*wordexit*/

}else{leth'=nextHMMinwordafter h ;NULL transition(final-state(h)->initial-state(h'));/*Remember rightcontextfanoutifh'isfinalHMMin word*/

}

word_transition )C

let{w} =setofwordsenteredintowordlatticeinthisframe;fo r(eachwordw'invocabulary)

Findthebesttransition({w}->w'),includingLMprobability;}

Figure 3 . 5 :OneFrameofForward ViterbiBeam Search i n the Baseline System.

8/12/2019 Ada 310308

48/146

8/12/2019 Ada 310308

49/146

8/12/2019 Ada 310308

50/146


theutteranceandbacktrackingtothebeginning,byfollowingthehistorypointersinthewordlattice.

3.3 BackwardandA*SearchAsmentionedearlier,heA*orstacksearchscapableofexactlyusingmoreo-phisticatedlanguagemodelsthanbigramgrammars,thusofferinghigherrecognitionaccuracy.tmaintainsasortedstackofpartialhypotheseswhichareexpandednabest-firstmanner,onewordengthataime.herearewomainssueswithhisalgorithm:

opreventanexponentialexplosionnhesearchspace,hestackdecodingalgorithmmustexpandeachpartialypothesisnlybyimitedsetfhemostlikelycandidatewordshatmayfollowthatpartialhypothesis.heA *algorithmisnottimesynchronous.pecifically,eachpartialhypotheses

inthesortedstackcanaccountforadifferentinitialsegmentoftheinputspeech.Thismakesithardtocomparethepathprobabilitiesoftheentriesinthestack.Ithasbeenshownin[42]thatthesecondissuecanbesolvedbyattachingaheuris-

ticscorewitheverypartialhypothesisHthataccountsfortheremainingportionofthespeechnotncludednH.yfillingout verypartialhypothesisoheullutterancelengthinthisway,theentriesinthestackcanbecomparedtooneanother,andexpandedinabest-firstmanner.songasheheuristicscoreattachedtoanypartialhypothesisHsanupperboundonhescoreofthebestpossiblecompleterecognitionachievablefromH,theA *algorithmisguaranteedtoproducethecorrectresults.

ThebackwardpassintheSphinx-IIbaselinesystem providesanapproximationtotheheuristic scoreneededbytheA*algorithm.Sinceitisatime-synchronousViterbisearch,unnhebackwarddirectionfromtheendoftheutterance,hepathscoreatanystatecorrespondstothebeststatesequencebetweenitandtheutteranceend.Hencetervesashedesiredupperbound.tsanapproximationsincehepathscoreusesbigramprobabilitiesandnotheexactgrammarthatheA *searchuses.

Thebackwardpassalsoproducesawordattice,similaroheforwardViterbisearch.TheA*searchisconstrainedtosearchonlythewordsinthetwolattices,andisrelativelyfast.

Thewordlatticeproducedbythebackwardpasshasanotherdesirableproperty.Wenotedatthebeginningofthischapterthatforeachwordoccurrenceintheforwardpasswordattice,severalsuccessiveendtimesaredentifiedalongwiththeirscores,whereasveryoftenonlythesinglemostlikelybegintimeisidentified.Thebackwardpasswordatticeproduceshecomplementaryesult: severalbeginningimesare

8/12/2019 Ada 310308

51/146

3.3. BACKWARDANDA*SEARCH 7

identifiedforagivenwordoccurrence,whileusuallyonlyhesinglemostikelyendtimeisavailable.hetwoatticescanbecombinedtoobtainacousticprobabilitiesforawiderrangeofwordbeginningandendingtimes,whichimprovestherecognitionaccuracy.

Inheollowingsubsections,webrieflydescribehebackwardViterbipassandtheA*algorithmusedintheSphinx-IIbaselinesystem.

3.3.1 BackwardViterbiSearchThebackwardViterbisearchisessentiallyidentical totheforwardsearch,exceptthatitiscompletelyreversedintime.Themaindifferencesarelistedbelow:

heinputspeechisprocessedinreverse.tsconstrainedtosearchonlythewordsnthewordlatticefromtheforward

pass.pecifically,atanytimet,cross-wordransitionsarerestrictedtowordsthatexitedat intheforwardpass,asdeterminedbythelatter'swordlattice.llHMMransitions,aswellascross-HMMandcross-wordNULLransitions

arereversedwithrespecttotheforwardpass.rosswordriphonemodellingisperformedusinge/t-contextanoutanddy-namictriphonemappingforightcontexts.nlythebigramprobabilitiesareused.herefore,theViterbipathscorefromanypointintheutteranceuptotheendisonlyanapproximationtotheupperboundsdesiredbytheA *search.TheresultofthebackwardViterbisearchisalsoawordlatticelikethatfromtheforwardpass.tsootedathatendsnthefinalframeoftheutterance,and

growingbackwardintime.Thebackwardpassidentifiesseveralbeginningtimesforaword,buttypically onlyoneendingtime.Acousticscoresforeachwordsegmentationareavailableinthebackwardpasswordlattice.

3.3.2 A*SearchTheA *searchalgorithmisdescribedin[42].tworksbymaintaininganorderedstackorlistofpartialhypotheses,sortedindescendingorderoflikelihood.Hypothesesarewordsequencesandmaybeofdifferentengths,accountingordifferentengthsofinputspeed.igure3.7outlinesthebasicstackdecodingalgorithmforfindingJV-besthypotheses.

8/12/2019 Ada 310308

52/146

8/12/2019 Ada 310308

53/146

3.4. BASELINESPHINX-IISYSTEMPERFORMANCE9datasetshavebeenextensivelyusedbyseveralsitesinthepastewyears,ncludingthespeechgroupatCarnegieMellonUniversity.utheprincipalgoaloftheseex-perimentshasbeenmprovingtheecognitionaccuracy.heworkeportednhisthesissocussedonobtainingotherperformancemeasuresorhesamedatasets,namelyexecutiontimeandmemoryrequirements.Wefirstdescribetheexperimen-tationmethodologyinhefollowingsection,ollowedbyothersectionscontainingadetailedperformanceanalysis.

3.4.1 ExperimentationMethodologyParametersMeasuredandMeasurementTechniquesTheperformanceanalysisnhissectionprovidesadetailedookatllaspectsofcomputationalefficiency,ncludingabreakdownbyhevariousalgorithmicstepsneachcase.wodifferentvocabularysizesapproximately20,000and58,000words,referredtoasthe20Kand58Ktasks,respectivelyareconsideredforallexperiments.Themajorparametersmeasuredincludethefollowing:

ecognitionaccuracyfromthefirstViterbipassresultandthefinalA*esult.ThisiscoveredindetailinSection3.4.2.verallexecutionimeandtsreakdownmonghemajoromputationalsteps.Wealsoprovidefrequencycountsofthemostcommonoperationshataccountormostoftheexecutionime.ection3.4.3dealswithhesemea-

surements.Timingmeasurementsareperformedoverentiretestsets,averagedtoperframevalues,andpresentedinmultiplesofr ea ltime.orexample,anycomputationthatakes23msectoexecuteperframe,onaverage,issaidtorunin2.3imesrealtime,sinceaframeis10mseclong.hismakesitconvenienttoestimatetheexecutioncostandusabilityofindividualtechniques.requencycountsarealsonormalizedtoperframevalues.

hebreakdownofmemoryusageamongvariousdatastructures.ThisiscoveredinSection3.4.4.

Clearly,theexecutiontimesreportedherearemachine-dependent.Evenwithasin-glearchitecture,differencesnimplementationssuchascachesize,memoryandbusspeedselativetoCPUspeed,etc.anaffecthespeedperformance.urthermore,forshortevents,heactofmeasuringhemtselfwouldperturbheesults.tsimportantokeephesecaveatsnmindnnterpretingtheimingresults.avingsaidhat,wenotehatallexperimentswerecarriedoutononeparticularmodelofDigitalEquipmentCorporation'sAlphaworkstations.heAlphaarchitecture61]includesaspecialRPCCnstructionhatallowsanapplicationoimeveryshort

8/12/2019 Ada 310308

54/146

40 CHAPTER3. THESPHINX-IIBASELINESYSTEM

eventsofasittleasewhundredmachinecycleswithnegligibleoverhead.lltimingmeasurementsarenormalizedtoanAlphaprocessorrunningat175MHz.ItshouldalsobeemphasizedthatthemaincomputationalloopsntheSphinx-II

systemhavebeentunedcarefullyforoptimumspeedperformance.Themeasurementsreportedinthisworkhavebeenlimitedalmostexclusivelytosuchloops.

TestSetsandExperimentalConditionsTheestsetsusednheexperimentshavebeenakenromhevariousdatasetsinvolvednhe993nd994ARPAhubevaluations.llheestetsconsistofcleanspeechecordedusinghighqualitymicrophones.pecifically,heyconsistofthefollowing:

ev93\he993developmentsetcommonlyreferredtoassi.dt.20).ev94:he994developmentset(hl.dt.94).vaW4:he994evaluationset(hl.et.94).Thetestsetsareevaluatedindividuallyonthe20Kandthe58Ktasks.hisisim-portanttodemonstratethevariationinperformance,especiallyrecognitionaccuracy,withdifferenttestsetsandvocabularysizes.Theindividualperformanceresultsallow

anopportunityforcomparisonswithexperimentsperformedelsewherethatmightberestrictedtojustsomeofthetestsets.able3.1summarizesthenumberofsentencesandwordsineachtestset.Dev93 Dev94 Eval94 Total

SentencesWords 5038227 3107387 3168186 112923800

Table3.1:No.fWordsandSentencesinEachTestSetTheknowledgebasesusedineachexperimentarethefollowing:othhe20Kandhe58Ktasksusesemi-continuousacousticmodelsofthekinddiscussedinSection3.1.1.hereare10,000senonesoriedstatesnthissystem.hepronunciationlexiconsinthe20Ktasksareidentical tothoseusedbyCMU

inheactualevaluations.heexiconforhe58ktasksderivedpartlyfromthe20ktaskandpartlyfromtheOOK-worddictionaryexportedbyCMU.

8/12/2019 Ada 310308

55/146

3.4. BASELINESPHINX-IISYSTEMPERFORMANCE 41

heDev93languagemodelorhe20Ktaskshestandardoneusedbyallsitesn993.tconsistsofabout3.5Mbigramsand3.2Mrigrams.he20KgrammarforDev94andEvaW4estsetssalsohestandardoneusedbyallsites,anditconsistsofabout5.0Mbigramsand6.7Mrigrams.hegrammarforhe8Ktasksderivedromheapproximately230Mwordsoflanguagemodeltrainingdatathatbecameavailableduringthe994ARPAevaluations,anditconsistsof6.1Mbigramsand18.OMtrigrams.Thesamegrammarisusedwithalltestsets.

ThefollowingsectionscontainthedetailedperformancemeasurementsconductedonthebaselineSphinx-IIrecognitionsystem.

3.4.2 RecognitionAccuracyRecognitionresultsfromthefirstpassViterbibeamsearch)aswellashefinalA*passrepresentedorbothhe20Kand8Ktask.able3.2istsheworderrorratesoneachofthetestsets,individuallyandoverall56.rrorsincludesubstitutions,insertionsanddeletions.

Dev93 Dev94 Eval94 Mean20K(Vit.) 17.6 15.8 15.9 16.420K(A*) 16.5 15.2 15.3 15.758K(Vit.) 15.1 14.3 14.5 14.658K(A*) 13.8 13.8 13.8 13.8

Table3.2:ercentageWordErrorRateofBaselineSphinx-IISystem.Itisclearthathelargestsinglefactorthatdeterminestheworderrorrateisthetestsetitself.nfact,iftheinputspeechwerebrokendownbyindividualspeakers,a

muchgreatervariationwouldbeobserved45,46].artofthismightbeattributabletodifferentout-of-vocabularyOOV)atesorhesetsofsentencesutteredbyn-dividualspeakers.owever,adetailedexaminationofaspeaker-by-speakerOOVrateanderrorratedoesnotshowanystrongcorrelationbetweenthetwo.hemainconclusionshatworderroratecomparisonsbetweendifferentystemsmustberestrictedtothesametestsets.

5Theaccuracyesultsreportedntheactualevaluationsaresomewhatbetterhanthoseshownhere.hemaineasonshatheacousticmodelsusednheevaluationsaremorecomplex,consistingofseparatecodebooksorndividualphoneclasses.eusedainglecodebooknourexperimentsinstead,incethegoalofourstudyisthecostof thesearchalgorithm,whichisaboutthesameinbothcases.6Notehatnllsuchables,heoverallmeaniscomputedoverlldifferentetsputogether.H en ce ,tisnotnecessarilyjustthemeanofthemeansfortheindividualtestets.

8/12/2019 Ada 310308

56/146


3.4.3 SearchSpeedInhisectionwepresentsummaryofheomputationaloadmposedbyheSphinx-IIbaselinesearcharchitecture.herearehreemainpassesnhesystem:forwardViterbibeamsearch,ackwardViterbisearch,nd*earch.heirstpresentsthegreatestloadofall,andhencewealsostudythebreakdownofthatoadamongitsmaincomponents:Gaussiandensitycomputation,senonescorecomputa-tion,HMMevaluation,andcross-wordtransitions.hesearethefourmainfunctionsintheforwardpasshatwereintroducedinSection3.2.4.Althoughwepresentper-formancestatisticsforallcomponents,thefollowingfunctionsintheforwardViterbisearchwillbethemainfocusofourdiscussion:

MMevaluation.WepresentstatisticsonbothexecutiontimesaswellasthenumberofHMMsevaluatedperframe.rosswordtransitions.Again,wefocusonexecutiontimesandthenumberofcross-wordtransitionscarriedoutperframe.Theexecutionimeforeachstepspresentednermsofmultiplesofrealimetakentoprocessthatstep.Asmentionedearlier,themachineplatformforallexperi-mentsistheDECAlphaworkstation.AlltimingmeasurementsarecarriedoutusingtheRPCCinstruction,ohathemeasurementoverheadsminimized.tshould

againbeemphasizedthatexecutiontimesareheavilyinfluencedbytheoverallpro-cessor,bus,ndmemoryarchitecture.orhiseason,allexperimentsarecarriedoutonasinglemachinemodel.TheperformancefigurespresentedinthissectionarenormalizedtoanAlphaprocessorrunningat175MHz.OverallExecutionTimes

Table3.3summarizestheexecutiontimes forboththe20Kand58Ktasks.Aswecansee,theforwardViterbisearchaccountsforwellover9

ada 310308

Documents