ada 310308

Upload: sameeran-amar-nath

Post on 03-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Ada 310308

    1/146

    EfficientAlgorithmsforSpeechRecognitionMosurK.Ravishankar

    May5,996CMU-CS-96-143

    P BBK G 8O TIO WTATEMENTAppiovsaeauliceieo i|

    SchoolofComputerScienceComputerScienceDivisionCarnegieMellonUniversityPittsburgh,PA5213

    Submittedinpartialfulfillmentofth erequirementsforth edegreeofDoctorofPhilosophy.ThesisCommittee:

    RobertoBisiani ,co-chairUniversityofMilan)RajReddy,co-chair

    1 9 9 6 0 7 0 80 7 6 AlexanderRudnickyRichardStern

    WayneWard

    99 6MosurK.RavishankarThisresearchw assupportedbyth eDepartmentof th eNavy,NavalResearchLaboratoryunderGrantNo.00014-93-1-2005.T heviewsan dconclusionscontainedinthisdocumentarethoseofth euthorndhouldnotenterpretedsrepresentingheofficialpolicies,i therxpressedrimplied,of th eU.S.government.

    P^5TK3'QMJ53rn?SfEC^B'I

  • 8/12/2019 Ada 310308

    2/146

    Keywords:peechrecognition,earchalgorithms,realtimerecognition,lexicaltreesearch,latticesearch,fastmatchalgorithms,memorysizereduction.

  • 8/12/2019 Ada 310308

    3/146

    AbstractAdvancesnpeechechnologyndomputingpoweravecreatedurgefinterestinthepracticalapplicationofspeechrecognition.However,themostaccurate

    speechrecognitionsystemsintheresearchworldarestillfartooslowandexpensivetobeusedinpractical,largevocabularycontinuousspeechapplications.Theirmaingoalhasbeenecognitionaccuracy,withemphasisonacousticandanguagemodelling.ButpracticalspeechecognitionalsoequireshecomputationobecarriedoutnrealtimewithinthelimitedresourcesCPUpowerandmemorysizeofcommonlyavailablecomputers.herehasbeenelativelyittleworknhisirectionwhilepreservingtheaccuracyofresearchsystems.

    Inthisthesis,wefocusonefficientandaccuratespeechrecognition.tseasyoimproverecognitionspeedandeducememoryrequirementsbyradingawayaccu-racy,forexamplebygreaterpruning,andusingsimpleracousticandlanguagemodels.Itsmuchharderomprovebothheecognitionspeedandeducemainmemorysizewhilepreservingtheaccuracy.

    ThishesispresentsseveralechniquesormprovingtheoverallperformanceoftheCMUSphinx-IIsystem.Sphinx-IIemployssemi-continuoushiddenMarkovmod-elsoracousticsndrigramanguagemodels,andsoneofhepremieresearchsystemsofitskind.hetechniquesinthisthesisarevalidatedonseveralwidelyusedbenchmarktestsetsusingtwovocabularysizesofabout20Kand58Kwords.

    Themaincontributionsofthisthesisarean8-foldspeedupand4~foldmemorysizereductionoverthebaselineSphinx-IIsystem.Theimprovementinspeedsobtainedfromthefollowingtechniques:exicaltreesearch,phoneticfastmatchheuristic,andglobalbestpathsearchofthewordlattice.hegaininspeedfromthetreesearchisaboutafactorof5.hephoneticfastmatchheuristicspeedsuphetreesearchbyanotheractorof2byfindinghemostikelycandidatephonesactiveatanyime.Thoughhetreesearchncurssomelossofaccuracy,italsoproducescompactwordlatticeswithowerroratewhichcanbeescoredforaccuracy.uchaescoringscombinedwithhebestpathalgorithmoindagloballyoptimumpathhroughawordattice.hisecoversheoriginalaccuracyofthebaselinesystem.heotalrecognitiontimeisabout3timesrealtimeforthe20Ktaskona175MHzDECAlphaworkstation.

    ThememoryequirementsofSphinx-IIareminimizedbyeducingheizesoftheacousticandanguagemodels.heanguagemodelsmaintainedondiskandbigramsandtrigramsarereadinondemand.xplicitsoftwarecachingmechanismseffectivelyovercomethediskaccesslatencies.heacousticmodelsizeiseducedbysimplytruncatingprecisionofprobabilityvaluesto8bits.everalotherengineeringsolutions,notexploredinthisthesis,canbeappliedtoreducememoryrequirementsfurther.hememorysizeforthe20Kaskisreducedtoabout30-40MB.

  • 8/12/2019 Ada 310308

    4/1461 1

  • 8/12/2019 Ada 310308

    5/146

    AcknowledgementsIcannotoverstatethedebt ow etoRobertoBisianiandRajReddy.heyhavenotonlyhelpedmeandgivenmeeveryopportunitytoextendmyprofessionalcareer,

    butalsohelpedmethroughpersonaldifficultiesaswell.tisquiteremarkablethathavelandednotonebuttwoadvisorsthatcombineintegritytowardsresearchwithahumantouchthatranscendstheproverbialhard-headednessofscience.Onecannothopeorbettermentorshanhem.lexRudnicky,RichStern,andWayneWard,allhaveaclarityofthinkingandself-expressionthatsimplyamazesmewithoutend.Theyhavegivenmethemostinsightfuladvice,comments,andquestionsthat couldhaveaskedfor.hankyou,all.

    TheCMUspeechgrouphasbeenapleasuretoworkwith.irstofall, wouldlikeohankomeormerandcurrentmembers,Mei-YuhHwang,ilAlleva,inChase,EricThayer,SunilIssar,BobWeide,andRoniRosenfeld.heyhavehelpedmethroughtheearlystagesofmyinductionintothegroup,andlatergiveninvaluablesupportnmywork.'mfortunateohavenheritedheworkofMei-YuhandFil.LinChasehasbeenagreatfriendandsoundingboardforideasthroughtheseyears.Erichasbeenallofthatandagreatofficemate.havelearntalotfromdiscussionswithPaulPlaceway.Therestofthespeechgroupandtherobustganghasmadeitamostivelyenvironmenttoworkin.hopethechargecontinuesthroughSphinx-IIIandbeyond.

    IhavespentagoodractionofmylifeintheCMU-CScommunitysofar.thasbeen,andstillis,thegreatestintellectualenvironment.Thespiritofcooperation,andinformalityofinteractionsassimplyunique. wouldliketoacknowledgethesupportofeveryone haveevercometoknowhere,oomanytoname,romtheWarpandNectardaysuntilnow.headministrativefolkshavealwayssucceededinbluntingtheedgeoffadifficultday.ouneverknowwhatnicknameCatherineCopetaswillchristenyouwithnext.AndSharonBurkshasalwaysputupwithallmyantics.

    ItgoeswithoutsayingthatIow eeverythingtomyparents.havehadtremendoussupportrommybrothers,andsomeveryspecialunclesandaunts.nparticular,mustmentionthefunI'vehadwithmybrotherKuts.wouldalsoliketoacknowledgeK.Gopinath'shelpduringmystaynBangalore.inally,BB ,whohasufferedthroughmytantrumsonbaddays,keptmeintouchwiththerestoftheworld,hasamostcreativeoutlookonhecommonplace,candrivemenutssomedays,butwhenallissaidanddone,samostrelaxedandcomfortablepersonohavearound.Lastbutnoteast,IwouldliketothankAndreasNowatzyk,MonicaLam,DuaneNorthcuttandRayClark.thasbeenmygoodfortunetowitnessandparticipateinsomeofAndreas'screativework.histhesiso w e salottohisunendingsupportandencouragement.

    1 1 1

  • 8/12/2019 Ada 310308

    6/146

    IV

  • 8/12/2019 Ada 310308

    7/146

    ContentsAbstract

    Acknowledgements ii1ntroduction1.1heModellingProblem

    1.2heSearchProblem1.3hesisContributions

    1.3.1mprovingSpeed1.3.2educingMemorySize

    1 .4ummaryandDissertationOutline2ackground 1

    2.1cousticModelling 12.1.1honesandTriphones 12.1.2MMmodellingofPhonesandTriphones 2

    2.2anguageModelling 32.3earchAlgorithms 5

    2.3.1 ViterbiBeamSearch 52.4elatedWork 7

    2.4.1reeStructuredLexicons 72.4.2emorySizeandSpeedImprovementsinWhisper 92.4.3earchPruningUsingPosteriorPhoneProbabilities0

  • 8/12/2019 Ada 310308

    8/146

    2.4.4 LowerComplexityViterbiAlgorithm 02.5 Summary 1

    3heSphinx-IIBaselineSystem 23.1nowledgeSources 43.1.1cousticModel 4

    3.1.2ronunciationLexicon 63.2orwardBeamSearch 6

    3.2.1latLexicalStructure 63.2.2ncorporatingheLanguageModel 73.2.3ross-WordTriphoneModeling 83.2.4heForwardSearch 13.3ackwardandA *Search 63.3.1ackwardViterbiSearch 73.3.2*Search 7

    3.4aselineSphinx-IISystemPerformance 83.4.1xperimentationMethodology 93.4.2ecognitionAccuracy 13.4.3earchSpeed 23.4.4emoryUsage 5

    3.5aselineSystemSummary 84earchSpeedOptimization 9

    4 .1otivation 94.2exicalTreeSearch 1

    4.2.1exicalTreeConstruction 44.2.2ncorporatingLanguageModelProbabilities64.2.3utlineofTreeSearchAlgorithm 14.2.4erformanceofLexicalTreeSearch 24.2.5exicalTreeSearchSummary 7

    VI

  • 8/12/2019 Ada 310308

    9/146

    4.3lobalBestPathSearch 84.3.1estPathSearchAlgorithm 84.3.2erformance 34.3.3estPathSearchSummary 44.4escoringTree-SearchWordLattice 64.4.1otivation 64.4.2erformance 64.4.3ummary 8

    4.5honeticFastMatch 84.5.1otivation 84.5.2etailsofPhoneticFastMatch 04.5.3erformanceofFastMatchUsingAllSenones44.5.4erformanceofFastMatchUsingCISenones 74.5.5honeticFastMatchSummary 8

    4.6xploitingConcurrency 94.6.1ultipleLevelsofConcurrency 04.6.2arallelizationSummary 3

    4.7ummaryofSearchSpeedOptimization 35emorySizeReduction 75.1enoneMixtureWeightsCompression 75.2isk-BasedLanguageModels 85.3ummaryofExperimentsonMemorySize 0 0

    6mallVocabularySystems 016. 1eneralIssues 016.2erformanceonATIS 0 2

    6.2.1aselineSystemPerformance 0 26.2.2erformanceofLexicalTreeBasedSystem0 3

    6.3mallVocabularySystemsSummary 06vn

  • 8/12/2019 Ada 310308

    10/146

  • 8/12/2019 Ada 310308

    11/146

    ListofFigures2. 1iterbiSearchasDynamicProgramming 53. 1phinx-IISignalProcessingFrontEnd 43.2phinx-IIHMMTopology:-StateBakisModel53.3ross-wordTriphoneModellingatWordEndsinSphinx-II93.4ordnitialTriphoneHMMModellinginSphinx-II13.5neFrameofForwardViterbiBeamSearchintheBaselineSystem. 33.6ordTransitionsnSphinx-IIBaselineSystem53.7utlineofA *AlgorithminBaselineSystem 83.8anguageModelStructureinBaselineSphinx-IISystem64 .1asephoneLexicalTreeExample 24.2riphoneLexicalTreeExample 54.3ross-WordTransitionsWithFlatandTreeLexicons74.4uxiliaryFlatLexicalStructureforBigramTransitions84.5athScoreAdjustmentFactor/forWordWjUpontsExit94.6neFrameofForwardViterbiBeamSearchinTreeSearchAlgorithm.34.7ordLatticeforUtterance:TakeFidelity'scasea sanexample. ...94.8ordLatticeExampleRepresentedasaDAG 04.9ordLatticeDAGExampleUsingaTrigramGrammar14.10uboptimalUsageofTrigramsinSphinx-IIViterbiSearch34.11asePhonesPredictedbyTopScoringSenonesinEachFrame;SpeechFragmentorPhraseHISREND,PronouncedDH-IX-S-R-EH-N-DD 1

    IX

  • 8/12/2019 Ada 310308

    12/146

    4.12ositionofCorrectPhoneinRankingCreatedbyPhoneticFastMatch.24.13ookaheadWindowforSmoothingtheActivePhoneList34.14honeticFastMatchPerformanceUsingAllSenones20KTask). 54.15WordErrorRateusRecognitionSpeedofVariousSystems44.16ConfigurationofaPracticalSpeechRecognitionSystem5

  • 8/12/2019 Ada 310308

    13/146

    ListofTables3.1o.fWordsandSentencesinEachTestSet 03.2ercentageWordErrorRateofBaselineSphinx-IISystem13.3verallExecutionTimesofBaselineSphinx-IISystem(xRealTime). 33.4aselineSphinx-IISystemForwardViterbiSearchExecutionTimes(xRealTime) 33.5MMsEvaluatedPerFrameinBaselineSphinx-IISystem43.6V-gramTransitionsPerFrameinBaselineSphinx-IISystem54 .1o.fNodesatEachLevelinTreeandFlatLexicons54.2xecutionTimesforLexicalTreeViterbiSearch44.3reakdownofTreeViterbiSearchExecutionTimesxRealTime). 54.4o.fHMMsEvaluatedPerFrameinLexicalTreeSearch54.5o.fLanguageModelOperations/FrameinLexicalTreeSearch. 54.6ordErrorRatesforLexicalTreeViterbiSearch64.7ordErrorRatesfromGlobalBestPathSearchofWordLatticePro-ducedbyLexicalTreeSearch 44.8xecutionTimesforGlobalBestPathDAGSearchxRealTime). 44.9ordErrorRatesFromLexicalTree+Rescoring+BestPathSearch. 74.10xecutionTimesWithRescoringPass 74.11astMatchUsingAllSenones;LookaheadWindow=320KTask). 64.12astMatchUsingAllSenones;LookaheadWindow=358KTask). 74.13astMatchUsingCISenones;LookaheadWindow=386. 1 BaselineSystemPerformanceonATIS 03

    XI

  • 8/12/2019 Ada 310308

    14/146

    6.2atioofNumberofRootHMMsinLexicalTreeandWordsinLexicon(approximate) 036.3xecutionTimesonATIS 046.4reakdownofTreeSearchExecutionTimesonATISWithoutPho-neticFastMatch) 0 46.5ecognitionAccuracyonATIS 0 5A.l TheSphinx-IIPhoneSet 15

    Xll

  • 8/12/2019 Ada 310308

    15/146

    ChapterIntroductionRecentdvancesnpeechechnologyandomputingpowerhavecreatedasurgeofinterestnhepracticalapplicationofspeechecognition.peechisheprimarymodeofcommunicationamonghumans.Ourabilitytocommunicatewithmachinesandcomputers,throughkeyboards,miceandotherdevices,isanorderofmagnitudeslowerndmorecumbersome.nrderomakehisommunicationmoreuser-friendly,speechinputsanessentialcomponent.

    Therearebroadlyhreeclassesofspeechecognitionapplications,asdescribedin53].nisolatedwordrecognitionsystemseachwordisspokenwithpausesbeforeandafterit,sothatend-pointingtechniquescanbeusedtoidentifywordboundariesreliably.econd,highlyconstrainedcommand-and-controlapplicationsusesmallv o-cabularies,limitedtospecificphrases,butuseconnectedwordorcontinuousspeech.Finally,largevocabularycontinuousspeechsystemshavevocabulariesofseveraltensofthousandsofwords,andsentencescanbearbitrarilylong,spokeninanaturalfash-ion.heastshemostuser-friendlybutalsohemostchallengingtomplement.However,themostaccuratespeechrecognitionsystemsintheresearchworldarestillfartooslowandexpensivetobeusedinpractical,largevocabularycontinuousspeechapplicationsonawidescale.

    Speechresearchhasbeenconcentratedheavilyonacousticandlanguagemodellingissues.incethelate1980s ,thecomplexityoftasksundertakenbyspeechresearchershasgrownromhe1000-wordResourceManagement(RM)ask51 ]oessentiallyunlimitedvocabularyasksuchasranscriptionofadionewsroadcastn99 5[48].Whilethewordecognitionaccuracyhasemainedimpressive,consideringtheincreaseintaskcomplexity,theresourcerequirementshavegrownaswell.heRMtaskanaboutnorderofmagnitudeslowerhanealimeonprocessorsofhatday.heunlimitedvocabularytasksrunaboutwoordersofmagnitudeslowerthanrealtimeonmodernworkstationswhosepowerhasgrownbyanorderofmagnitudeagain,nthemeantime.

    Thetaskoflargevocabularycontinuousspeechrecognitionisinherentlyhardfor

  • 8/12/2019 Ada 310308

    16/146

    CHAPTER1. INTRODUCTION

    thefollowingreasons.irst,wordboundariesarenotknownnadvance.nemustbeconstantlypreparedtoencountersuchaboundaryateverytimeinstant.Wecandrawaroughanalogytoreadingaparagraphoftextwithoutanypunctuationmarksorspacesbetweenwords:

    myspiritwillsleepinpeaceorifthinksitwillsurelythinkthusfarewellhesprangfromthecabinwindowashesaidthisupontheiceraftwhichlayclosetothevesselhewassoonborneawaybythewavesandlostindarknessanddistance...

    Furthermore,manyncorrectwordhypotheseswillbeproducedromncorrecteg-mentationofspeech.ophisticatedanguageodelshatrovidewordcontextorsemanticinformationareneededtodisambiguatebetweentheavailablehypotheses.Thesecondproblemshato-articulatoryeffectsareverystrongnnaturalor

    conversationalspeech,ohathesoundproducedatnenstantsnfluencedbytheprecedingandollowingones.istinguishingbetweenheseequiresheuseofdetailedacousticmodelsthatakesuchcontextualconditionsntoaccount.hein-creasingsophisticationoflanguagemodelsandacousticmodels,aswellasthegrowthinthecomplexityoftasks,hasfarexceededthecomputationalandmemorycapacitiesofcommonlyavailableworkstations.

    Efficientspeechecognitionforpracticalapplicationsalsoequireshathepro-cessingbearriedoutnealimewithinheimitedesourcesCPUpowerandmemorysizeofcommonlyavailablecomputers.herecertainlyarevariousuchcommercialanddemonstrationsystemsinexistence,buttheirperformancehasneverbeenformallyevaluatedwithrespectotheresearchsystemsorwithespectooneanother,nhewayhatheaccuracyofresearchsystemshasbeen.hishesissprimarilyconcernedwiththeseissuesinimprovingthecomputationalandmemoryefficiencyofcurrentspeechrecognitiontechnologywithoutcompromisingtheachieve-mentsinrecognitionaccuracy.

    Thehreeaspectsofperformance,ecognitionspeed,memoryesourceequire-ments,andrecognitionaccuracy,areinmutualconflict.tisrelativelyeasytoimproverecognitionspeedandeducememoryrequirementswhiletradingawaysomeaccu-racy,forexamplebypruningthesearchspacemoredrastically,andbyusingsimpleracousticandanguagemodels.lternatively,onecaneducememoryrequirementsthroughefficientencodingschemesattheexpenseofcomputationtimeneededtode-codesuchrepresentations,andiceversa.utitismuchhardertoimproveboththerecognitionspeedandreducemainmemoryrequirementswhilepreservingorimprov-ingecognitionaccuracy.nhishesis,w edemonstratealgorithmicandheuristictechniquestotackletheproblem.

    ThisworkaseenarriedoutnheontextfheCMUphinx-IIpeechrecognitionsystemasabaseline.herearewomainschoolsofspeechecognitiontechnologytoday,basedonstatisticalhiddenMarkovmodelling(HMM),andneural

  • 8/12/2019 Ada 310308

    17/146

    1 .1 . THEMODELLINGPROBLEM

    netechnology,espectively.phinx-IIusesHMM-basedstatisticalmodellingech-niquesandisoneofthepremierrecognizersofitskind.Usingseveralcommonlyusedbenchmarkestsetsandwodifferentvocabularyizesofabout0,000and58,000words,wedemonstratethattherecognitionaccuracyofthebaselineSphinx-IIsystemcanbeattainedwhileitsexecutiontimeisreducedbyaboutanorderofmagnitudeandmemoryrequirementsreducedbyafactorofabout .

    1.1 TheModellingProblemAsheomplexityofasksackledbypeechesearchhasrown,oashatfthemodellingtechniques.nsystemsthatusestatisticalmodellingtechniques,suchasheSphinxsystem,hisranslatesntoseveralensohundredsofmegabytesofmemoryneededtostoreinformationregardingstatisticaldistributionsunderlyingthemodels.

    AcousticModelsOneofthekeyssuesnacousticmodellinghasbeenhechoiceofagoodunitofspeech32 ,27].nsmallvocabularysystemsofafewensofwords,tspossibleobuildseparatemodelsforentirewords,buthisapproachquicklybecomesinfeasibleashevocabularyizegrows.oronehing,tshardoobtainsufficientrainingdatatobuildallindividualwordmodels.tsnecessarytorepresentwordsntermsofsub-wordunits,andrainacousticmodelsorheatter,nsuchawayhathepronunciationofnewwordscanbedefinedintermsofthealreadytrainedsub-wordunits.

    Thephonemeorphone)hasbeenhemostcommonlyacceptedsub-wordunit.Thereareapproximately50phonesinspokenEnglishlanguage;wordsaredefinedassequencesofsuchphones1seeAppendixAfortheSphinx-IIphonesetandexamples).Eachphoneis,nturn,modelledbyanHMMdescribedingreaterdetailinSection2.1.2).

    A smentionedearlier,aturalontinuouspeechhastrongo-articulatoryf-fects.nformally,aphonemodelshepositionofvariousarticulatorsnhemouthandnasalpassagesuchasheongueandheips)nhemakingofaparticularsound.incehesearticulatorshaveomovesmoothlybetweendifferentsoundsnproducingspeech,eachphoneisinfluencedbytheneighbouringones,especiallydur-ingthetransitionfromonephoneothenext.hissnotamajorconcerninsmallvocabularysystemsnwhichwordsarenoteasilyconfusable,butbecomesanssueashevocabularysizeandthedegreeofconfusabilityincrease.

    xSomesystemsdefinewor dpronunciationssnetworksofphonesnsteadofsimpleineare-quences36].

  • 8/12/2019 Ada 310308

    18/146

    CHAPTER. INTRODUCTION

    Mostystemsemployriphonesasoneormofcontext-dependentHMMmodels[4 ,3]odealwithhisproblem.riphonesarebasicallyphonesobservednhecontextofgivenprecedingandsucceedingphones.here areapproximately50phonesinpokennglishanguage.hus,hereaneotalfabout03riphones,althoughonlyaractionofhemareactuallyobservednheanguage.imitingthevocabularycanfurtherreducethisnumber.orexample,inSphinx-II,a20,000wordvocabularyhasabout75,000distincttriphones,eachofwhichismodelledbya5-stateHMM,foratotalofabout375,000states.incethereisn'tsufficienttrainingdatatobuildmodelsforeachstate,heyareclusteredintoequivalenceclassescalledsenones27].

    Theintroductionofcontext-dependentacousticmodels,evenafterclusteringintoequivalenceclasses,createsanexplosionnhememoryrequirementstostoresuchmodels.orexample,hephinx-IIsystemwith0,000enonesoccupiesensfmegabytesofmemory.

    LanguageModelsLargevocabularycontinuousspeechrecognitionrequirestheuseofalanguagemodelorgrammartoselectthemostlikelywordsequencefromtherelativelylargenumberofalternativewordhypothesesproducedduringhesearchprocess.smentionedearlier,heabsenceofexplicitwordboundarymarkersncontinuousspeechcausesseveraladditionalwordhypothesesobeproduced,nadditionohentendedorcorrectones.orexample,thephraseIt's nicedaycanbeequallywellrecognizedasItsuncedA.orItsoniceday.Theyareallacousticallyindistinguishable,butthewordboundarieshavebeendrawnatadifferentsetoflocationsineachcase.Clearly,manymorealternativescanbeproducedwithvaryingdegreesoflikelihood,giventheinputspeech.heanguagemodelsnecessaryopickhemostikelysequenceofwordsfromtheavailablealternatives.

    Simpleasks,nwhichnesnlyequiredoecognizeonstrainedetfphrases,anuseule-basedegularorcontext-freegrammarswhichcanbeepre-sentedcompactly.However,thatsmpossiblewithlargevocabularytasks.nstead,bigramandrigramgrammars,consistingofwordpairsandripleswithgivenprob-abilitiesofoccurrence,aremostcommonlyused.necanalsobuildsuchanguagemodelsbasedonwordclasses,suchascitynames,monthsoftheyear,etc.However,creatingsuchgrammarsistediousastheyrequireafairamountofhandcompilationoftheclasses.rdinarywordn-gramanguagemodels,onheotherhand,canbecreatedalmostentirelyautomaticallyfromacorpusoftrainingtext.

    Clearly,itisnfeasibletocreateacompletesetofwordbigramsforevenmediumvocabularytasks.hus,hesetofbigramandtrigramprobabilitiesactuallypresentinagivengrammarisusuallyasmallsubsetofthepossiblenumber.Eventhen,theyusuallynumberinthemillionsforlargevocabularytasks.hememoryrequirements

  • 8/12/2019 Ada 310308

    19/146

    1.2 . THESEARCHPROBLEM

    forsuchlanguagemodelsrangefromseveraltenstohundredsofmegabytes.

    1.2 TheSearchProblemTherearetwocomponentstothecomputationalcostofspeechrecognition:cousticprobabilitycomputation,andsearch.nthecaseofHMM-basedsystems,theformerrefersoheomputationfherobabilityofagivenHMMtateemittingheobservedspeechatagivenime.heatterefersohesearchorhebestwordsequencegiventhecompletespeechnput.hesearchcostsargelyunaffectedbythecomplexityoftheacousticmodels.tismuchmoreheavilyinfluencedbythesizeofthetask.Asweshallseelater,thesearchcostssignificantformediumandlargevocabularyrecognition;tisthemainfocusofthisthesis.

    Speechecognitionsearchingorhemostikelysequenceofwordsivenheinputpeechgivesiseoanexponentialsearchspacefallpossiblesequencesofwordsareconsidered.heproblemhasgenerallybeenackledintwoways:iterbidecoding62 ,2]usingbeamsearch37],ortackdecoding9,0]whichsavariantoftheA*algorithm42].omehybridversionshatcombineViterbidecodingwiththeA *algorithmalsoexist21].ViterbiDecodingViterbidecodingsadynamicprogrammingalgorithmthatsearcheshestatespaceforhemostikelytateequencehatccountsorhenputpeech.hetatespacesconstructedbycreatingwordHMMmodelsromtsconstituentphoneortriphoneHMMmodels,andallwordHMMmodelsaresearchednparallel.incethestatespaceshugeorevenmediumvocabularyapplications,hebeamsearchheuristicisusuallyappliedoimitthesearchbypruningoutheessikelystates.Thecombinationisoftensimplyreferredtoasiterbibeamsearch.Viterbidecodingisaime-synchronoussearchhatprocesseshenputpeechonerameataime,updatingallhestatesorhatramebeforemovingonohenextrame.ostsystemsemployaframeinputateof10 0rames/sec.ViterbidecodingisdescribedingreaterdetailinSection2.3.1.StackDecodingStackdecodingmaintainsastackofpartialhypotheses2sortedindescendingorderofposteriorlikelihood.Ateachstepitpopsthebestoneoffthestack.fitisacompletehypothesistsoutput.therwisethealgorithmexpandstbyoneword,ryingall

    2 Apartialhypothesisaccountsforaninitialportionoftheinputspeech.completehypothesis,orsimplyhypothesis,accountsfortheentireinputspeech.

  • 8/12/2019 Ada 310308

    20/146

    CHAPTER1 . INTRODUCTION

    possiblewordextensions,evaluatesheesultingpartial)ypotheseswithespecttoheinputspeechande-insertsheminthesortedstack.nynumberofiV-besthypotheses59 ]canbegeneratednthismanner.Toavoidanexponentialgrowthnthesetofpossiblewordsequencesnmediumandargevocabularysystems,partialhypothesesareexpandedonlybyalimited setofcandidatewordsateachstep.hesecandidatesareidentifiedbyafastmatchstep[6 ,7 ,8,20].inceourexperimentshavebeenmostlyconfinedoViterbidecoding,wedonotexplorestackdecodingnanygreaterdetail.TreeStructuredLexiconsEvenwithhebeamsearchheuristic,straightforwardViterbidecodingsexpensive.ThenetworkofstatestobesearchedisformedbyalinearsequenceofHMMmodelsforeachwordnhevocabulary.henumberofmodelsactivelysearchednhisorganizationisstillonetotwoordersofmagnitudebeyondthecapabilitiesofmodernworkstations.

    Lexicalreescaneusedoeduceheizeofhesearchpace.incemanywordssharecommonpronunciationprefixes,heycanalsosharemodelsandavoidduplication.Treeswereinitiallyusedinfastmatchalgorithmsforproducingcandidatewordlistsforfurthersearch.ecently,theyhavebeenintroducedinthemainsearchcomponentofseveralsystems44 ,39,43,].hemainproblemfacedbythemisinusingaanguagemodel.ormally,ransitionsbetweenwordsareaccompaniedbyapriorlanguagemodelprobability.utwithrees,hedestinationnodesofsuchtransitionsarenotndividualwordsbutentiregroupsofthem,elatedphoneticallybutquiteunrelatedgrammatically.Anefficientsolutiontothisproblemisoneoftheimportantcontributionsofthisthesis.

    MultipassSearchTechniquesViterbisearchalgorithmsusuallyalsocreateawordatticenadditionohebestrecognitionhypothesis.helatticeincludesseveralalternativewordsthatwererecog-nizedatanygiventimeduringthesearch.talsotypicallycontainsotherinformationsuchasheimeegmentationsforhesewords,ndheirposterioracousticcores(i.e.,heprobabilityofobservingawordgiventhatimesegmentofinputspeech).Theatticeerrorratemeasuresthenumberofcorrectwordsmissingfromthelatticearoundtheexpectedtime.tistypicallymuchlowerthantheworderrorrate3ofthesinglebesthypothesesproducedforeachsentence.

    Wordatticescanbekeptverycompact,withowatticerrorrate,ftheyareproducedusingsufficientlydetailedacousticmodelsasopposedoprimitivemodels3Worderroratesaremeasuredycountinghenumberofwordsubstitutions,deletions,andinsertionsinthehypothesis,comparedtothecorrecteferenceentence.

  • 8/12/2019 Ada 310308

    21/146

    1.3. THESISCONTRIBUTIONS

    asin,forexample,fastmatchalgorithms).nourwork,alOseclongsentencetypicallyproducesawordlatticecontainingabout0 00wordinstances.

    Givenuchompactatticeswithowrrorates,neanearchhemusingsophisticatedmodelsandsearchalgorithmsveryefficientlyandobtainresultswithalowerworderrorrate,asdescribedin38 ,65,1].Mostsystemsusesuchmultipasstechniques.

    However,herehasbeenelativelylittleworkeportednactuallycreatingsuchlatticesefficiently.Thisisimportantforthepracticalapplicability ofsuchtechniques.Latticescanbecreatedwithlowcomputationaloverheadifweusesimplemodels,buttheirsizemustbelargetoguaranteeasufficientlylowlatticeerrorrate.Ontheotherhand,compact,low-errorlatticescanbecreatedusingmoresophisticatedmodels,attheexpenseofmorecomputationtime.heefficientcreationofcompact,ow-errorlatticesforefficientpostprocessingsanotherbyproductofthiswork.1.3 ThesisContributionsThisthesisexploreswaysofimprovingtheperformanceofspeechrecognitionsystemsalongheimensionsofecognitionpeedndfficiencyofmemoryusage,hilepreservingheecognitionaccuracyofresearchsystems.smentionedearlier,hisisamuchharderproblemhanfweareallowedoradeecognitionaccuracyorimprovementinspeedandmemoryusage.

    Inorderomakemeaningfulcomparisons,hebaselineperformanceofanestab-lishedresearch systemisfirstmeasured.WeusetheCMUSphinx-IIsystemasthebaselinesystemsincethasbeenextensivelyusednheyearlyARPAevaluations.Ithasknownrecognitionaccuracyonvarioustestsets,andwithsimilaritiestomanyotherresearchsystems.heparametersmeasuredinclude,inadditiontorecognitionaccuracy,heCPUusageofvariousstepsduringexecution,frequencycountsofthemostime-consumingoperations,andmemoryusage.lltestsarecarriedoutusingtwovocabularyizesofabout0,00020IC)nd58,00058K)words,espectively.ThetestsentencesaretakenfromtheARPAevaluationsn99 3and99445,46].

    Theesultsromhisnalysishowhathesearchcomponentseveralensoftimesslowerhanealimeonheeportedasks.Theacousticoutputproba-bilitycomputationselativelysmallersinceheseestshavebeenconductedusingsemi-continuousacousticmodels28 ,27].)urthermore,hesearchimetselfcanbefurtherdecomposedntotwomaincomponents:heevaluationofHMMmodels,andcarryingoutcross-wordransitionsatwordboundaries.heformerissimplyameasureoftheaskcomplexity.heattersasignificantproblemsinceherearecross-wordransitionsoeverywordnhevocabulary,andanguagemodelproba-bilitiesmustbecomputedforeveryoneofthem.

  • 8/12/2019 Ada 310308

    22/146

    CHAPTER1. INTRODUCTION

    1.3.1 ImprovingSpeedTheworkpresentedinthishesisshowshatanewadaptationoflexicaltreesearchcanbeusedtoreduceboththenumber ofHMMsevaluatedandthecostofcross-wordtransitions.nthismethod,languagemodelprobabilities forawordarecomputed notwhenenteringthatwordbutupontsexit,fitsoneoftheecognizedcandidates.Thenumberofsuchcandidatesatagiveninstantsonaverageaboutwoordersofmagnitudesmallerthanthevocabularysize.urthermore,theproportionappearstodecreasewithincreasingvocabularysize.

    Usingthismethod,theexecutiontimeforrecognitionisdecreasedbyafactorofabout4.8orbothhe20Kand58Kwordasks.fweexcludetheacousticoutputprobabilitycomputation,thespeedupofthesearchcomponentaloneisabout6.3forthe20Kwordtaskandover7forthe58Ktask.talsodemonstratesthathelexicaltreesearchefficientlyproducescompactwordatticeswithowerrorateshatcanagainbeefficientlysearchedusingmorecomplexmodelsandsearchalgorithms.Eventhoughthereisarelativelossofaccuracyofabout20 %usingthismethod,weshowthattcanberecoveredefficientlybypostprocessingthewordlatticeproducedbytheexicaltreesearch.heosssattributedosuboptimalwordsegmentationsproducedbythetreesearch.However,anewshortest-pathgraphsearchformulationforsearchingthewordatticecaneducetheossnaccuracytounder0%elativetothebaselinesystemwithanegligibleincreaseincomputation.

    Ifthelatticeisfirstescoredtoobtainbetterwordsegmentations,allthelossnaccuracyisrecovered.Therescoringstepaddslessthan20 %executiontimeoverhead,givinganeffectiveoverallspeedupofabout overthebaselinesystem.

    Wehaveappliedanewphoneticfastmatchstepoheexicalreesearchhatperformsannitialpruningofthecontextindependentphonesobesearched.histechniquereducesheoverallexecutiontimebyabout40-45%,withalesshan2% relativelossinaccuracy.Thisbringstheoverallspeedofthesystemtoabout timesthatofthebaselinesystem,withalmostnolossofaccuracy.Thestructureofheinaldecodersapipelineofseveralstageswhichcanbeoperatednanoverlappedashion.arallelismamongstages,especiallyheexicaltreesearchandrescoringpasses,spossibleforadditionalimprovementinspeed.

    1.3.2 ReducingMemorySizeThewomaincandidatesormemoryusagenhebaselineSphinx-IIsystem,andmostofthecommonresearchsystems,aretheacousticandlanguagemodels.

    Thekeyobservationforreducingthesizeofthelanguagemodelsisthatindecod-inganygivenutterance,onlyasmallportionofitsactuallyused.Hence,wecan

  • 8/12/2019 Ada 310308

    23/146

    1.4. SUMMARYANDDISSERTATIONOUTLINE

    considermaintainingthelanguagemodelentirelyondisk,andretrievingonlythenec-essaryinformationondemand.Cachingschemescanovercomethelargedisk-accesslatencies.nemightexpecthevirtualmemorysystemsoperformhisunctionautomatically.However,theydon'tappeartobeefficientatmanagingthelanguagemodelworkingsetincehegranularityofaccessoheelateddatastructuressmuchsmallerthanapagesize.

    Wehaveimplementedsimplecachingrulesandeplacementpoliciesforbigramsandrigrams,whichhowhathememoryesidentportionfargebigramandtrigramlanguagemodelscanbereducedsignificantly.nourbenchmarks,thenumberofbigramsinmemoryisreducedtoabout5-25%ofthetotal,andthatoftrigramstoabout2-5%ofthetotal.heimpactofdiskaccessesonelapsedtimeperformanceisminimal,showinghathecachingpoliciesareeffective.Webelievethaturtherreductionsinsizecanbeeasilyobtainedbyvariouscompressiontechniques,suchasareductionintheprecisionofrepresentation.

    Theizeoftheacousticmodelssriviallyeducedbyaactorof4 ,implybyreducingtheprecisionoftheirrepresentation from32bitsto8bits,withnodifferenceinaccuracy.hishas,nfact,beendoneinmanyothersystemsasn25].henewobservationshatnadditionomemorysizereduction,hesmallerprecisionalsoallowsustospeedupthecomputationofacousticoutputprobabilities ofsenoneseveryframe.heomputationnvolveshesummationofprobabilitiesinog-domain,whichiscumbersome.The8-bitepresentationofsuchoperandsallowsustoachievethiswithasimpletablelookupoperation,mprovingthespeedofthisstepbyaboutafactorof2.

    1.4 SummaryandDissertationOutlineInsummary,hishesispresentsnumberofechniquesormprovinghespeedofthebaselineSphinx-IIsystembyaboutanorderofmagnitude,andeducingtsmemoryrequirementsbyafactorof4 ,withoutsignificantossofaccuracy.ndoingso,tdemonstratesseveralfacts:

    tispossibletobuildefficientspeechrecognitionsystemscomparable toresearchsystemsinaccuracy.tsossibleoeparateoncernsfsearchomplexityromhatfmod-ellingcomplexity.Byusingsemi-continuousacousticmodelsandefficientsearchstrategiestoproducecompactwordlatticeswithlowerrorrates,andrestrictingthemoredetailedmodelstosearchsuchlattices,theoverallperformanceofthesystemisoptimized. Itisnecessaryandpossibletomakedecisionsforpruninglargeportionsofthesearchspaceawaywithlo wcostandhighreliability.Thebeamsearchheuristic

  • 8/12/2019 Ada 310308

    24/146

    10 HAPTER1. INTRODUCTION

    isawellknownexampleofthisprinciple.Thephoneticfastmatchmethodandthereductioninprecisionofprobabilityvaluesalsofallunderthiscategory.Theorganizationfhishesisssollows.hapterontainsackground

    materialandbriefdescriptionsofrelatedworkdonenhisarea.inceecognitionspeedandmemoryefficiencyhasnotbeenanexplicitconsiderationnheesearchcommunityoar,nhewayhatecognitionaccuracyhasbeen,hereselativelittlematerialinthisregard.

    Chapter ismainlyconcernedwithestablishingbaselineperformancefiguresfortheSphinx-IIesearchsystem.tncludesacomprehensivedescriptionofthebase-linesystem,specificationsofthebenchmarktestsandexperimentalconditionsusedthroughouthishesis,anddetailedperformancefigures,ncludingaccuracy,speedandmemoryrequirements.

    Chapter soneofthemainchapternthishesishatdescribesallofthenewtechniquestospeeduprecognitionandtheirresultsonthebenchmark tests.Boththebaselineandtheimprovedsystemusethesamesetofacousticandlanguagemodels.

    TechniquesformemorysizereductionandcorrespondingresultsarepresentedinChapter .tshouldbenotedhatmostexperimentseportednhishesiswereconductedwiththeseoptimizationsinplace.Thoughhisthesisisprimarilyconcernedwithlargevocabularyrecognition,tisinterestingtoconsiderheapplicabilityoftheechniquesdevelopedhereosmallervocabularysituations.hapter addressesheconcernselatingosmallandex-tremelysmallvocabularyasks.hessuesofefficiencyarequitedifferentnheir

    case,ndheproblemsrelsoifferent.heperformanceofbothhebaselineSphinx-IIsystemandtheproposedexperimentalsystemareevaluatedandcomparedonheATISAirlineTravelnformationService)ask,whichhasvocabularyofabout3,000words.

    Finally,Chapter concludeswithasummaryoftheresults,contributionsofthisthesisandsomethoughtsonfuturedirectionsforsearchalgorithms.

  • 8/12/2019 Ada 310308

    25/146

    Chapter2BackgroundThishaptercontainsbriefeviewofhenecessarybackgroundmaterialoun-derstandthecommonlyusedmodellingandsearchtechniquesinspeechrecognition.Sections2. 1nd2.2coverbasiceaturesofstatisticalacousticandanguagemod-elling,espectively.ViterbidecodingusingbeamsearchsdescribednSection2.3,whilerelatedresearchonefficientsearchtechniquesiscoveredinSection2.4.

    2.1 AcousticModelling2.1.1 PhonesandTriphonesTheobjectiveofspeechrecognitionisthetranscriptionofspeechintotext,i.e.,wordstrings.oaccomplishhis,nemightwishoreateordmodelsfromrainingdata.owever,nthecaseoflargevocabularyspeechecognition,herearesimplytoomanywordsoberainednthisway.tsnecessaryoobtainseveralsamplesofeverywordfromseveraldifferentspeakers,norderocreatereasonablespeaker-independentmodelsoreachword.urthermore,heprocessmustbeepeatedoreachnewwordthatisaddedtothevocabulary.

    Theproblemissolvedbycreatingacousticmodelsforub-wordunits.llwordsarecomposedofbasicallyasmallsetofsoundsorsub-wordunits,suchassyllablesorphonemes,whichcanbemodelledandsharedacrossdifferentwords.

    Phoneticmodelsarehemostfrequentlyusedsub-wordmodels.hereareonlyabout0phonesnspokenEnglishseeAppendixAorhesetofphonesusednSphinx-II).Newwordscansimplybeaddedtothevocabularybydefiningtheirpro-nunciationintermsofsuchphones.

    Theproductionofsoundcorrespondingtoaphoneisinfluencedbyneighbouringphones.orexample,theAEphoneinthewordman soundsdifferentfromthatn

    11

  • 8/12/2019 Ada 310308

    26/146

    12 HAPTER2. BACKGROUND

    lack ;heformersmorenasal.BM4]proposedheuseofriphoneorcontext-dependentphonemodelstodealwithsuchvariations.With50phones,herecanbeupo03riphones,butonlyaractionofthemareactuallyobservednpractice.Virtuallyallspeechrecognitionsystemsnowusesuchcontextdependentmodels.

    2.1.2 HMMmodellingofPhonesandTriphonesMostystemsusehiddenMarkovmodelsHMMs)oepresenthebasicunitsfspeech.heusageandrainingofHMMshasbeencoveredwidelyintheliterature.InitiallydescribedbyBaumin11],itwasfirstusedinspeechrecognitionsystemsbyCMU[10]andIBM[29].TheuseofHMMsinspeechhasbeendescribed,forexample,byRabiner[52].Currently,almostallsystemsuseHMMsformodellingtriphonesandcontext-independentphonesalsoreferredtoasmonophonesorasephones).heseincludeBBN[41],CMU[35,27] ,theCambridgeHTKsystem[65],IBM[5],andLIMSI[18],amongothers.WewillgiveabriefdescriptionofHMMsasusedinspeech.

    Firstofall,hesampledspeechnputsusuallypreprocessed,hroughvarioussignal-processingsteps,ntoacepstrumorothereaturetreamhatontainsonefeaturevectoreveryframe.ramesareypicallyspacedat0msecintervals.omesystemsproducemultiple, parallelfeaturestreams.Forexample,Sphinxhas4featurestreamscepstra,Acepstra,AAcepstra,andpowerrepresentingthespeechsignal(seeSection3.1.1).

    AnHMMisasetofstatesconnectedbytransitions(seeFigure3.2foranexample).Transitionsmodelheemissionofonerameofspeech.achHMMransitionhasanassociatedoutputprobabilityfunctionthatdefinestheprobabilityofemittingtheinputeatureobservedinanygivenframewhiletakinghatransition.npractice,mostsystemsassociatetheoutputprobabilityfunctionwiththesourceordestinationstateofthetransition,ratherthanhetransitionitself.Henceforth,w eshallassumethattheoutputprobability isassociatedwiththesourcestate.The outputprobabilityforstate atimetsusuallydenotedbybi(t).Actually,{snotafunctionoft,butatherafunctionoftheinputspeech,whichisafunctionof/.However,weshalloftenusethenotationb{(t)withthismplicitunderstanding.)

    EachHMMransitionromnytateostatejlsoasstaticransitionprobability,usuallydenotedbya^-,whichisindependentofthespeechinput.

    Thus,achHMMtateoccupiesrepresentssmallsubspaceofheoverallfeaturespace.heshapeofthissubspacessufficientlycomplexthattcannotbeaccuratelycharacterizedbyasimplemathematicaldistribution.ormathematicaltractability,themostcommongeneralapproachhasbeenomodelthestateoutputprobabilitybyamixtureGaussiancodebook.ForanyHMMstatesandfeaturestream/,hei-th.componentofsuchacodebookisanormaldistributionwithmeanvectorfisj,iandcovariancematrixUsjj. Inorderoimplifyhecomputationandalso

  • 8/12/2019 Ada 310308

    27/146

    2.2. LANGUAGEMODELLING 3

    becausethereisofteninsufficientdatatoestimatealltheparametersofthecovariancematrix,mostsystemsassumeindependenceofdimensionsandthereforethecovariancematrixbecomesdiagonal.hus,w ecansimplyusestandarddeviationvectors rs,/,;insteadofC /s,/,,-.inally,eachuchmixturecomponentlsohasscalarixturecoefficientormixtureweightwsji. Withthat,heprobabilityofobservingagivenspeechinputxinHMMstatesisgivenby:

    Mx)=II(E ,/,^(x/,fiaJ, ut raJti))2 . 1 )/

    wherethespeechinputxistheparallelsetoffeaturevectors,andX/its/-thfeaturecomponent;irangesoverthenumberofGaussiandensitiesinthemixtureand/overthenumberoffeatures.heexpressionf(.)shevalueofthechosencomponentGaussiandensityfunctionatX/.

    InthegeneralcaseoffullycontinuousHMMs,eachHMMstatesntheacousticmodelhastsownseparateweightedmixtureGaussiancodebook.owever,hisscomputationallyexpensive,andmanyschemesareusedoeducehiscost.talsoresultsntoomanyfreeparameters.MostsystemsgroupHMMstatesntoclustersthatsharethesamesetofmodelparameters.Thesharingcanbeofdifferentdegrees.Insemi-continuoussystems,allstatesshareasinglemixtureGaussiancodebook,butthemixturecoefficientsaredistinctorndividualstates.nphinx-II,tatesaregroupedintoclusterscalledsenones[27],withasinglecodebookperfeaturestream)sharedamongallsenones,butdistinctmixtureweightsforeach.Thus,Sphinx-IIusessemi-continuousmodellingwithstateclustering.

    EvensimplerdiscreteHMMmodelscanbederivedbyeplacinghemeanandvarianceectorsepresentingGaussianensitieswithingleentroid.nveryframe,hesingleclosestcentroidtoheinputeaturevectoriscomputedusingtheEuclideandistancemeasure),andndividualstatesweighthecodewordochosen.Discretemodelsareypicallyonlyusednmakingapproximatesearchessuchasnfastmatchalgorithms.Forsimplicityofmodelling,HMMscanhaveNULLransitionshatdonotcon-

    sumeanytimeandhencedonotmodeltheemissionofspeech.WordHMMscanbebuiltbysimplystringingtogetherphoneticHMMmodelsusingNULLtransitionsasappropriate.

    2.2 LanguageModellingA smentionedinChapter ,alanguagemodelLM)sequiredinargevocabularyspeechrecognitionfordisambiguatingbetweenthelargesetofalternative,confusablewordshatmightbehypothesizedduringthesearch.

  • 8/12/2019 Ada 310308

    28/146

    14 HAPTER2. BACKGROUND

    TheLMdefineshe prioriprobabilityofasequenceofwords.heLMproba-bilityofasentencei.e.,asequenceofwordsw\, W2,...wn)sgivenby:P(w1)P(w2\wi)P(w3\w1,W2)P(w4\wi,W2,W3)---P(wn\w1, W n_i)

    n=Y[P(wi\wi,...iWi-i).i=l

    InanexpressionsuchasP{wi\w\,... Wi-i),wi,...u>j_iisthewordhistoryorsimplyhistoryforW{.npractice,necannotbtaineliableprobabilityestimatesgivenarbitrarilylonghistoriessincethatwouldrequireenormousamountsoftrainingdata.Instead,oneusuallyapproximatestheminthefollowingways:

    Contextreegrammarsorregulargrammars.uchLMsareusedtodefinetheformofwellstructuredsentencesorphrases.eviationsromheprescribedstructurearenotermitted.uchormalgrammarsareneverusednargevocabularysystemssincetheyaretoorestrictive.

    Wordunigram,igram,rigram,grammars.hesearedefinedespectivelyasfollowshigher-ordern-gramscanbedefinedsimilarly):

    P(w)probabilityofwordwP(wj\wi)probabilityofWjgivenaonewordhistoryWiP(wk\wi,Wj) = probabilityofWkgivenatwowordhistoryWi,WjAbigramgrammarneednotcontainprobabilitiesorallpossiblewordpairs.Infact,thatwouldbeprohibitiveforallbutthesmallestvocabularies.nstead,ittypicallylistsonlythemostfrequentlyoccurringbigrams,andusesabackoffmechanismtofallbackonunigramprobabilitywhenthedesiredbigramisnotfound.notherwords,ifP(wj\wi)ssoughtandisnotfound,onefallsbackonP WJ).ButabackoffweightisappliedtoaccountforthefactthatWjsknowntobenotoneofthebigramsuccessorsofto ;30].therhigher-orderbackoffn-gramgrammarscanbedefinedsimilarly.

    lassn-gramgrammars.hesearesimilartowordn-gramgrammars,exceptthatheokensareentirewordclasses,uchasdigit,number,month,propername,etc.hecreationanduseofclassgrammarssrickysincewordscanbelongtomultiple classes.hereisalsoafairamountofhandcraftinginvolved.

    ongistancegrammars.nliken-gramMs,hesearecapableofelatingwordseparatedyomedistancei.e.,withomenterveningwords).orexample,herigger-pairmechanismdiscussedn57 ]softhisvariety.ongdistancegrammarsareprimarilyusedoescoren-besthypothesisistsrompreviousdecodings.

  • 8/12/2019 Ada 310308

    29/146

    2.3. SEARCHALGORITHMS 15

    o taxes i FinalstateV r 1M //T

    Startstate

    *-TimeFigure2.1:ViterbiSearchasDynamicProgramming

    Oftheabove,wordbigramandrigramgrammarsarehemostcommonlyusedsinceheyareeasyorainromargevolumesofdata,equiringminimalmanualintervention.heyhavealsoprovidedhighdegreesofecognitionaccuracy.heSphinx-IIsystemuseswordtrigramLMs.

    2.3 SearchAlgorithmsThewomainormsofdecodingmostcommonlyusedodayareViterbidecodingusingthebeamsearchheuristic,andstackdecoding.incetheworkreportedinthisthesisisbasedontheformer,webrieflyreviewitsbasicprincipleshere.2.3.1 ViterbiBeamSearchViterbisearch62]sessentiallyadynamicprogrammingalgorithm,onsistingftraversinganetworkofHMMstatesndmaintaininghebestpossiblepathscoreateachstateneachrame.tsaime-synchronoussearchalgorithmnhattprocessesallstatescompletelyatimetbeforemovingontotimet+1.

    ThebstractlgorithmcanenderstoodwithheelpofFigure.1.nedimensionrepresentsthestatesinthenetwork,andtheotheristhetimeaxis.here istypicallyonestartstateandoneormorefinalstatesinthenetwork.Thearrowsdepictpossiblestateransitionshroughouthenetwork.nparticular,NULLransitionsgoverticallysincetheydonotconsumeanyinput,andnon-NULLtransitionsalwaysgooneimesteporward.achointnhis-Dpaceepresentshebestathprobabilityorhecorrespondingtatethatime.hats,ivenaime ndstates,hevalueat(t,s)representsheprobabilitycorrespondingothebeststatesequenceleadingfromtheinitialstateatime0ostatesatimet.

    Theime-synchronousnatureofheViterbisearchmplieshathe2-Dpaceisraversedromeftoight,startingatime .Thesearchsnitializedatime

  • 8/12/2019 Ada 310308

    30/146

    16 HAPTER2. BACKGROUND

    t withhepathprobabilityathestarttateseto,ndatallotherstatesto .neachrame,hecomputationconsistsofevaluatingallransitionsbetweenthepreviousframeandthecurrentframe,andthenevaluatingallNULLransitionswithinhecurrentrame.ornon-NULLransitions,healgorithmssummarizedbythefollowingexpression:

    Pj(t)=m&x(Pi(t 1) ciij bi(t)),iesetofpredecessorstatesofj (2.2)where,Pj(t)shepathprobabilityofstatejatimet,,ijshestaticprobabilityassociatedwithheransitionromstate oj,andbi(t)sheoutputprobabilityassociatedwithstateiwhileconsumingtheinputspeechat seeSection2.1.2andequation.1).tstraightforwardoxtendhisormulationoncludeNULLtransitionshatdonotconsumeanyinput.

    Thus,everystatehasasinglebestpredecessorateachtimeinstant.Withsomesimplebookkeepingomaintainthisnformation,onecaneasilydeterminethebeststatesequenceorheentiresearchbystartingtheinaltatetheendandfollowingthebestpredecessorateachstepallthewaybacktothestartstate.uchanexampleisshownbytheboldarrowsnFigure2.1.

    ThecomplexityofViterbidecodingsN2Tassumingeachstatecanransitiontoeverystateateachtimestep),whereNisthetotalnumberofstatesandTisthetotalduration.TheapplicationofViterbidecodingtocontinuousspeechrecognitionsstraight-

    forward.WordHMMsarebuiltbystringingogetherphoneticHMMmodelsusingNULLransitionsbetweentheinalstateofoneandhestartstateofthenext.naddition,NULLtransitionsareaddedfromthefinalstateofeachwordtotheinitialstateofallwordsnhevocabulary,husmodellingcontinuousspeech.anguagemodelbigram)probabilitiesareassociatedwitheveryoneofthesecross-wordtran-sitions.NotethatasystemwithavocabularyofVwordshasV2possiblecross-wordtransitions.llwordHMMsaresearchedinparallelaccordingtoequation2.2.

    SinceevenasmalltomediumvocabularysystemconsistsofhundredsorthousandsofHMMstates,hestate-timematrixofFigure2.1uicklybecomesooargeandcostlytocomputeinitsentirety.Tokeepthecomputationwithinmanageablelimits,onlythemostlikelystatesareevaluatedineachframe,accordingtothebeamsearchheuristic[37].Attheendoftimet,thestatewiththehighestpathprobabilitypmax(t)isound.fanyotherstate hasPi(t)

  • 8/12/2019 Ada 310308

    31/146

    2.4. RELATEDWORK 7

    2.4 RelatedWorkSomeofthestandardtechniquesinreducingthecomputationalloadofViterbisearchforlargevocabularycontinuousspeechrecognitionhavebeenthefollowing:

    Narrowinghebeamwidthforgreaterpruning.owever,hissusuallyasso-ciatedwithanncreasenerroratebecauseofanncreasenhenumberofsearcherrors:hecorrectwordsometimesgetprunedfromhesearchpathnthebargain.

    educingthecomplexityofacousticandlanguagemodels.Thisapproachworkstoomeextent,speciallyftsollowedbymoreetailedearchnaterpasses.heresaradeoffhere,betweenhecomputationaloadoftheirstpassandsubsequentones.heuseofdetailedmodelsinthefirstpassproducescompactwordlatticeswithlowerrorratethatcanbepostprocessedefficiently,buthefirstpassitself iscomputationallyexpensive.tscostcanbereducedifsimplermodelsareemployed,atthecostofanincreaseinlatticesizeneededtoguaranteelowlatticeerrorrates.

    Bothheaboveechniquesnvolvesomeradeoffbetweenecognitionaccuracyandspeed.

    2.4.1 TreeStructuredLexiconsOrganizingtheHMMstobesearchedasaphonetictreeinsteadoftheflatstructureofindependentlinearHMMsequencesforeachwordisprobablythemostoftencitedimprovementnsearchechniquesnusecurrently.hisstructureseferredoastree-structuredexiconorexicalr e e .fhepronunciationsfwoormorewordscontainthesameninitialphonemes,theyshareasinglesequenceofnHMMmodelsrepresentinghatnitialportionoftheirpronunciation.Inpractice,mostsystemsuseriphonesnsteadofjustbasephones,ow eshouldeallyconsiderriphonepro-nunciationsequences.uthebasicargumentshesame.)inceheword-initialmodelsinanon-treestructuredViterbisearcharetypicallythemajorityofthetotalnumberofactivemodels,thereductionincomputationissignificant.

    Theproblem withalexicaltreeoccursatwordboundarytransitionswherebigramlanguagemodelprobabilitiesareusuallycomputedandapplied.ntheflatnon-tree)Viterbialgorithmthereisatransitionfromeachwordendingstate(withinthebeam)tohebeginningofeverywordnhevocabulary.hus,heresan-inatheinitialstateofeveryword,withdifferentbigramprobabilitiesattachedtoeverysuchtransition.heViterbialgorithmchoosesthebestncomingtransitionineachcase.

    However,withalexicaltreestructure,severalwordsmaysharethesamerootnodeofthetree.herecanbeaconflictbetweenthebestncomingcross-wordransition

  • 8/12/2019 Ada 310308

    32/146

    18 HAPTER2. BACKGROUND

    fordifferentwordshatsharehesameootnode. Thisproblemhasbeenusuallysolvedbymakingcopiesofthelexicaltreetoresolvesuchconflicts.

    ApproximateBigramTreesSRI39]andCRIM43 ]augmentheirexicaltreestructurewithafiatcopyofthelexiconthatsactivatedforbigramtransitions.llbigramtransitionsentertheflatlexiconcopy,whilehebackedoffunigramransitionsenterheootsoftheexicaltree.RInotesthatrelyingonjustunigramsmorethandoublestheworderrorrate.Theyshowhatusingthisscheme,herecognitionspeedsmprovedbyafactorof2- 3orapproximatelyhesameaccuracy.ogainurthermprovementsnspeed,theyreducethesizeofthebigramsectionbypruningthebigramlanguagemodelinvariousways,whichaddssignificantlytotheerrorrate.However,itshouldbenotedthattheexperimentalsetupisbasedonusingdiscreteHMMacousticmodels,withbaselinesystemworderrorate21.5%),whichssignificantlyworsehanheirbestesearchsystem10.3%)usingbigrams,andalsoworsethanmostotherresearchsystemstobeginwith.

    AsweshallseeinChapter3,bigramtransitionsconstituteasignificantportionofcrosswordtransitions,whichinturnareadominantpartofthesearchcost.ence,theuseofaflatexicalstructureforbigramtransitionsmustcontinuetoncurhiscost.

    ReplicatedBigramTreesNeyandothers40 ,]havesuggestedcreatingcopiesoftheexicalreeohandlebigramtransitions.heleafnodesathefirstevelunigram)exicaltreehavesec-ondarybigram)reeshangingoffthemforbigramtransitions.hetotalsizeofthesecondaryreesdependsonhenumberofbigramspresentnhegrammar.ec-ondarytreesthatrepresentthebigramfollowersofthemostcommonfunctionwords,suchasA,THE,N ,O F,etc.reusuallylarge.

    Thisschemecreatesadditionalcopiesofwordshatdidnotexistntheoriginalflatstructure.orexample,ntheconventionalflatexiconornheauxiliaryflatlexiconopyf39]),heresnlynenstanceofeachword.owever,nhisproposedschemethesamewordcanappearnmultiplesecondaryrees.inceheshortunctionwordsareecognizedoftenthoughspuriously),heirbigramcopiesarerequentlyactive.heyarealsoamongheargerones,snotedabove.tsunclearhowmuchoverheadthisaddsothesystem.

  • 8/12/2019 Ada 310308

    33/146

    2.4. RELATEDWORK 9

    DynamicNetworkDecodingCambridgeUniversity44 ]esignedone-passecoderhatsesheexicalreestructure,withcopiesorcross-wordransitions,butnstantiatesnewcopiesatev-eryransition,asnecessary.asically,heraditionale-entrantexicalstructuresreplacedwithanon-re-entrantstructure.opreventanexplosionnmemoryspacerequirements,heyeclaimHMMnodesasoonasheybecomenactivebyallingoutsidethepruningbeamwidth.urthermore,theendpointsofmultipleinstancesofthesamewordcanbemergedundertheproperconditions,allowingjustoneinstanceofthelexicaltreetobepropagatedfromthemergedwordends,insteadofseparatelyandmultiplyfromeach.hissystemattainedthehighestrecognitionaccuracyintheNov993evaluations.

    Theyreporttheperformanceunderstandardconditionsstandard199320KWallStreetJournaldevelopmentestetecodedusinghecorrespondingtandardbi-gram/trigramlanguagemodelusingwidebeamwidthsasntheactualevaluations.ThenumberofactiveHMMmodelsperrameinhisschemeisactuallyhigherthanthenumberinthebaselineSphinx-IIsystemundersimilartestconditions(exceptthatSphinx-IIusesadifferentlexiconandacousticmodels).hereareotherfactorsatwork,buthedynamicinstantiationoflexicaltreescertainlyplaysapartnthisincrease.Theoverheadfordynamically constructingtheHMMnetworkisreportedtobelessthan20 %ofthetotalcomputationalload.Thisisactually fairlyhighsincethetimetodecodeasentenceonanHP7 35platformisreportedtobeabout5minutesonaverage.

    2.4.2 MemorySizeandSpeedImprovementsnWhisperTheCMUSphinx-IIsystemhasbeenmprovedinmanywaysbyMicrosoftnpro-ducingtheWhispersystem26].heyreporthatmemorysizehasbeenreducedbyafactorof20andspeedimprovedbyafactorof5,comparedtoSphinx-IIunderthesameaccuracyconstraints.

    Oneoftheschemesormemoryreductionsheuseofacontextreegrammar(CFG)nplaceofbigramorrigramgrammars.FGsrehighlyompact,anbesearchedefficiently,andcanbeelativelyeasilycreatedorsmallasksuchscommandandcontrolapplicationsnvolvingaewhundredwords.owever,argevocabularyapplicationscannotbesorigidlyconstrained.

    Theyalsoobtainanmprovementofabout5%nhememorysizeofacousticmodelsbyusingrunlengthencodingforsenoneweightingcoefficients(Section2.1.2).

    TheyhavealsomprovedthespeedperformanceofWhisperthroughaRichGetRicher (RGR)heuristic fordecidingwhichphonesshouldbeevaluatedindetail,usingtriphonestates,ndwhichshouldallbackonontextndependentphonestates.

  • 8/12/2019 Ada 310308

    34/146

    20 HAPTER2. BACKGROUND

    RGRworksasfollows:etPp(t)bethebestpathprobabilityofanystatebelongingtobasephonepatimet,pmax(t)hebestpathprobabilityoverallstatesat,andbp(t+1)heoutputprobabilityofthecontext-independentmodelforpatimet+1.Then,hecontext-dependentstatesforphonepareevaluatedatramet+1ff :

    a-P p{t)+b p(t+l)>Pmax(t)-Kwhere,aandKareempiricallydeterminedconstants.Otherwise,context-independentoutputrobabilitiesreusedorhosestates.Allprobabilitiesareomputednlog-space.enceheadditionoperationseallyepresentmultiplicationsnnormalprobabilityspace.)

    Usinghisheuristic,heyreportan0 %eductionnhenumberofcontextde-pendentstatesforwhichoutputprobabilitiesarecomputed,withnolossofaccuracy.IftheparametersaandKaretightenedtoreducethenumberofcontext-dependentstatesevaluatedby95%,hereisa5%elativelossofaccuracy.Thebaselinetestconditionshavenotbespecifiedfortheseexperiments.)

    2.4.3 SearchPruningUsingPosteriorPhoneProbabilitiesIn[56],RenalsandHochbergdescribeamethodofdeactivatingcertainphonesduringsearchoachievehigherecognitionspeed.hemethodsncorporatedntoaastmatchpasshatproduceswordsandposteriorprobabilitiesorheirN O W A Ystackdecoder.heastmatchstepusesHMMbasephonemodels,hestatesofwhicharemodelledbyneuralnetworkshatdirectoryestimatephoneposteriorprobabil-itiesnsteadoftheusualikelihoods;.e.,heyestimateP{phone\data),nsteadofP(data\phone).Usingheposteriorphoneprobabilityinformation,onecandentifythelesslikelyactivephonesatanygiventimeandprunethesearchaccordingly.

    Thisisapotentiallypowerfuland easypruningtechniquewhentheposteriorphoneprobabilitiesareavailable.tackdecoderscanparticularlygainifthefastmatchstepcanemadeoimitheumberofcandidatewordsmittedwhileextendingpartialhypothesis.nheirN O W A Yimplementation,aspeedupofaboutanorderofmagnitudeisobservedona20Kvocabularytask(fromabout5 0 xrealtimetoabout15 xealime)onanHP735workstation.heydonoteportheeductionnhenumberofactiveHMMsasaresultofthispruning.2.4.4 LowerComplexityViterbiAlgorithmAnewapproachtotheViterbialgorithm,specificallyapplicabletospeechrecognition,isdescribedbyPateln49].tsaimedateducinghecostoftheargenumberofcross-wordtransitionsandhasanexpectedcomplexityofNy/NT,insteadofN2T(Section2.3.1).Thealgorithmdependsonorderingheexitpathprobabilitiesand

  • 8/12/2019 Ada 310308

    35/146

    2.5. SUMMARY 1

    transitionbigramprobabilities,andindingahresholduchhatmostransitionscanbeeliminatedfromconsideration.

    Theauthorsndicatethathealgorithmoffersbetterperformanceifeverywordhasbigramtransitionsoheentirevocabulary.owever,hissnothecasewithlargevocabularysystems.Nevertheless,itsworthexploringhisechniquefurtherforitspracticalapplicability.2.5 SummaryInthischapterwehavecoveredthebasicmodellingprinciplesandsearchtechniquescommonlyusedinspeechrecognitiontoday.Wehavealsobrieflyreviewedanumberofsystemsandtechniquesusedtoimprovetheirspeedand memoryrequirements.Oneofthemainthemesrunningthroughthisworkisthatvirtuallynoneofthepracticalimplementationshavebeenformallyevaluatedwithespectoheesearchsystemsonwellestablishedtestsetsunderwidelyusedtestconditions,orwithrespecttooneanother.

    Intherestofthisthesis,weevaluatethebaselineSphinx-IIsystemundernormalevaluationconditionsandusetheresultsforcomparisonwithourotherexperiments.

  • 8/12/2019 Ada 310308

    36/146

    Chapter3TheSphinx-IIBaselineSystemAsmentionednhepreviouschapters,hereselativelyittlepublishedworkontheperformanceofspeechecognitionsystems,measuredalonghedimensionsofrecognitionaccuracy,speedandesourceutilization.hepurposeofthischapteristoestablishacomprehensiveaccountoftheperformanceofabaselinesystemhathasbeenconsideredapremierepresentativeofitskind,withwhichwecanmakemeaningfulcomparisonsoftheresearchreportedinthishesis.orthispurpose,wehavechosenheSphinx-IIspeechecognitionsystem1tCarnegieMellonhathasbeenusedextensivelyinspeechresearchandheyearlyARPAevaluations.Variousaspectsofthisbaselinesystemanditsprecursorshavebeenreportedintheliterature,notablyn32 ,33,35,28,,].Mostoftheseconcentrateonhemodellingaspectsofthesystemacoustic,grammaticalorexicalandtheireffectonecognitionac-curacy.nhischapterweocusonobtainingacomprehensivesetofperformancecharacteristicsforthissystem.

    ThebaselineSphinx-IIecognitionsystemusesemi-continuousoried-mixturehiddenMarkovmodels(HMMs)fortheacousticmodels52,27,2]andwordbigramortrigrambackofflanguagemodelsseeSections2. 1and2.2).tisa3-passdecoderstructuredasfollows:

    1.imesynchronousViterbibeamsearch[52,62,37 ]intheforwarddirection.tisacompletesearchofthefullvocabulary,usingsemi-continuousacousticmodels,abigramortrigramlanguagemodel,andcross-wordtriphonemodellingduringthesearch.Theresultofthissearchisasinglerecognitionhypothesis,aswellasawordlatticethatcontainsallthewordsthatwererecognizedduringthesearch.Thelatticeincludeswordsegmentationandscoresinformation.Oneofthekeyfeaturesofthislatticeishatoreachwordoccurrence,severalsuccessiveendtimesareidentifiedalongwithheirscores,whereasveryoftenonlyhesinglemostlikelybegintimeisidentified.Scoresforalternativebegintimesareusually

    1TheSphinx-IIdecoderreportednthissectionisknowninternallyasFBS6.

    22

  • 8/12/2019 Ada 310308

    37/146

    2 3

    notavailable.2.imesynchronousViterbibeamsearchinthebackwarddirection.hissearch

    isestrictedohewordsdentifiedinheforwardpassandsveryfast.ikethefirstpass,tproducesawordatticewithwordsegmentationsandscores.However,thistimeseveralalternativebegintimesareidentifiedwhiletypicallyonlyoneendimesavailable.naddition,heViterbisearchalsoproducesthebestpathscorefromanypointintheutterancetotheendoftheutterance,whichisusedinthethirdpass.

    3.nA *orstacksearchusingthewordsegmentationsandscoresproducedbytheforwardandbackwardViterbipassesabove.tproducesanN-bestist59]ofalternativehypothesesastsoutput,asdescribedbrieflyinSection .2 .hereisnoacousticescoringnhispass.owever,anyarbitraryanguagemodelcanbeappliedncreatingheN-bestist.nhishesis,wewillestrictourdiscussiontowordtrigramlanguagemodels.

    TheeasonorheexistenceofhebackwardandA *asses,venhoughhefirstpassproducesausablerecognitionresult,isthefollowing.OnelimitationoftheforwardViterbisearchnheirstpassshattshardoemployanythingmoresophisticatedthanasimplebigramorsimilargrammar.Althoughatrigramgrammarisusednheforwardpass,tsnotacompletetrigramsearchseeSection3.2.2).Stackdecoding,avariantoftheA*searchalgorithm242],smoreappropriateorusewithsuchgrammarswhichleadogreaterrecognitionaccuracy.hisalgorithmmaintainsastackofseveralpossiblepartialdecodings(i.e,wordsequencehypotheses)whichareexpandednabest-firstmanner9,2,0].inceeachpartialhypothesisisalinearwordsequence,anyarbitraryanguagemodelcanbeappliedtot.tackdecodinglsollowshedecoderooutputeveralmostikelyN-bestypothesesratherthanjustthesinglebestone.hesemultiple hypothesescanbepostprocessedwithevenmoredetailedmodels.heneedorhebackwardpassnhebaselinesystemhasbeenmentionedabove.

    Inhischapterweeviewhedetailsofhebaselinesystemneededorunder-standingtheperformancecharacteristics.nordertokeepthisdiscussionfairlyself-contained,wefirsteviewthevariousknowledgesourcemodelsinSection3.1.omeofthebackgroundmaterialinSections2.1,2.2,and2.3salsoelevant.hissol-lowedbyadiscussionoftheforwardpassViterbibeamsearchinSection3.2,andthebackwardandA *earchesnSection3.3.heperformanceofthissystemonev-eralwidelyusedtestsetsfromtheARPAevaluationssdescribedinSection3.4.tincludesrecognitionaccuracy,variousstatisticsrelatedtosearchspeed,andmemoryusage.WefinallyconcludewithsomefinalremarksinSection3.5.

    2 W ewilloftenus ethetermsstackdecodingandA searchinterchangeably.

  • 8/12/2019 Ada 310308

    38/146

    2 4 CHAPTER3. THESPHINX-IIBASELINESYSTEM

    i16KHz,16-bitlinearsamplesPre-emphasisFilter

    H(z)=l-0.97z-1

    25.6msecHamming Window10msecintervals^HlOms|-^

    1^ -25.6ms-^-|12melfreq.coeff.+power coeff.

    ,100cepstralframes/secsentence-based

    poweran dcepstralnormalizationpower-=max(power)oversentencecepstrum -=mean(cepstrum)oversentence

    cepstrum-vector Acepstrum AAcepstrum power,Apower ,AAp o w e r^4featurestreamsat10 0 frames/sec.

    Figure3.1:phinx-IISignalProcessingFrontEnd.3.1 KnowledgeSourcesThissectionbrieflydescribesthevariousknowledgesourcesormodelsandthespeechsignalrocessingront-endusednphinx-II.ndditionohecousticmodelsandpronunciationlexicondescribedbelow,Sphinx-IIuseswordbigramandtrigramgrammars.hesehavebeendiscussedinSection2.2.3.1.1 AcousticModelSignalProcessingAdetaileddescriptionofhesignalprocessingrontendnSphinx-IIscontainedinSection4.2.1ignalProcessingof27].heblockdiagramnFigure3. 1epictstheoverallprocessing.riefly,thestreamof16-bitsamplesofspeechdata,sampledat6KHz,sconvertednto12-elementmelscalefrequencycepstrumvectorsandapowercoefficientineach10msecframe.Werepresenthecepstrumvectoratimetbyx(t)individualelementsaredenotedbyXk(t),1

  • 8/12/2019 Ada 310308

    39/146

    3.1 . KNOWLEDGESOURCES 5

    FinalState(Non-emitt ing)

    Figure3.2:phinx-IIHMMTopology:-StateBakisModel.issimplyx0(t).hiscepstrumvectorandpowerstreamsareirstnormalized,andfourfeaturevectorsarederivedineachframebycomputingthefirstandsecondorderdifferencesintime:

    x(i)normalizedcepstrumvectorAx(t) = x(t+2)-x(*-2),A,x(*)=x(*+4 )-x(*-4 )AAx(f) = Ax(t+1)-Ax(*-1)xo(*) = x0(t),

    Ax0(t)=x0(t+2 -x(t-2 ,AAx0{t)=Ax0(t+1)-Ax0(t-1)

    wherethecommasdenoteconcatenation.hus,ineveryframeweobtainfourfeaturevectorsof2,24,2,and elements,espectively.hese,ultimately,arehenputtothespeechrecognitionsystem.PhoneticHMMModelsAcousticmodellinginSphinx-IIisbasedonhiddenMarkovmodels(HMMs)forbase-phonesandtriphones.AllHMMsinSphinx-IIhavethesame5-stateBakistopologyshownnheFigure3.2.ThebackgroundonHMMshasbeencoveredbrieflynSection2.1.2.)

    AsmentionedinSection2.1.2,Sphinx-IIusessemi-continuousacousticmodellingwith56omponentensitiesneacheaturecodebook.tatesreclusteredntosenones27],whereeachsenonehastsownsetof25 6mixturecoefficientsweightingthecodebookforeachfeaturestream.

    Inorderourthereducehecomputationalcost,onlyheopewcomponentdensitiesfromeachfeaturecodebooktypically4arefullyevaluatedineachframeincomputingtheoutputprobabilityofastateorsenone(equation2.1).Therationalebehindhisapproximationshatheemainingcomponentsmatchheinputverypoorlyanywayandcanbeignoredaltogether.heapproximationprimarilyreducesthecostofapplyingthemixtureweightsincomputingsenoneoutputprobabilitiesineachframe.Foreachsenoneandfeatureonly4mixingweightshavetobeappliedtothe4bestcomponents,insteadofall256.

  • 8/12/2019 Ada 310308

    40/146

    26 HAPTER3. THESPHINX-IIBASELINESYSTEM

    3.1.2 PronunciationLexiconTheexiconnphinx-IIdefinesheinearsequenceofphonemesepresentinghepronunciationoreachwordnhevocabulary.hereareabout0phonemeshatmakeuptheEnglishlanguage.hephonesetusedinSphinx-IIisgiveninAppendixA.hefollowingisasmallexampleofthelexiconfordigits:

    O H O WZERO ZIHROWZERO(2) ZIY ROWO N E W AH NT W O T UWTHREE TH RIYFOUR FAORFIVE FAY VS I X SIH KSSEVEN SEH VAX NEIGHT EY TDNINE N AY N

    Therecanbemultiplepronunciationsforaword,asshownforthewordZEROabove.Eachalternativepronunciationisassumedtohavethesame priori languagemodelprobability.

    3.2 ForwardBeamSearchAsmentionedearlier,hebaselinephinx-IIecognitionystemonsistsfhreepasses,ofwhichthefirstsatime-synchronousViterbibeamsearchnheforwarddirection.nhissectionwedescribehestructureofthisorwardpass.eshallfirstexaminethedatastructuresinvolvedinthesearchalgorithm,beforemovingontothedynamicsofthealgorithm.

    3.2.1 FlatLexicalStructureTheexicondefinesheinearsequenceofcontext-independentorbasephoneshatmakeuphepronunciationofeachwordnhevocabulary.incephinx-IIusestriphoneacousticmodels[34],thesebasephonesequencesareconvertedintotriphonesequencesbysimplytakingeachbasephoneogetherwithitseftandightcontextbasephones.Notethathephoneticleftcontextathebeginningofawordshelastbasephonefromthepreviousword.imilarly,thephoneticrightcontextatheendofthewordsheirstbasephoneofthenextword. Sincehedecoderdoes

  • 8/12/2019 Ada 310308

    41/146

    3.2. FORWARDBEAMSEARCH 7

    notknowtheseneighbouringwords priori,itmusttryallpossiblecasesandfinallychoosehebest.hissdiscussedindetailbelow.)Giventhesequenceoftriphonesforaword,onecanconstructanequivalentword-HMMbysimplyconcatenatingtheHMMsfortheindividualtriphones,i.e.,byaddingaNULLransitionfromthefinalstateofoneHMMtotheinitialstateofthenext.heinitialstateof firstHMM,andthefinalstateofthelastHMMnthissequencebecometheinitialandfinalstates,respectively,ofthecompleteword-HMM.Finally,inordertomodelcontinuousspeech(i.e.,transitionfromonewordintothenext),additionalNULLtransitionsarecreatedfromthefinalstateofeverywordotheinitialstateofallwordsnthevocabulary.Thus,withaVwordvocabulary,hereareV2possiblecross-wordransitions.

    SincetheresultisastructureconsistingofseparatelinearsequenceofHMMsforeachword,wecallthisaflatlexicalstructure.

    3.2.2 IncorporatingtheLanguageModelWhilehecross-wordNULLransitionsdonotconsumeanyspeechnput,eachofthemdoeshaveaanguagemodelprobabilityassociatedwitht.oraransitionfromsomewordwioanywordW j,hisprobabilityssimplyP(wj\wi)fabigramlanguagemodelsused.bigramanguagemodelitsnneatlywithheMarkovassumptionthatgivenanycurrentstatesattimettheprobabilityoftransitionsoutofsdoesnotdependonhowonearrivedat.hus,helanguagemodelprobabilityP(wj\wi)canbeassociatedwiththetransitionfromthefinalstateoft o , -totheinitialstateofWjandthereafterweneednotcareabouthowwearrivedatW j.

    TheaboveargumentdoesnotholdorarigramorsomeotherongerdistancegrammarsincethelanguagemodelprobabilityoftransitionoWjdependsnotonlyontheimmediatepredecessorbutalsosomeearlierones.fatrigramlanguagemodelisused,helexicalstructurehasobemodifiedsuchthatoreachwordwhereareseveralparallelnstancesofitswordHMM,oneforeachpossiblepredecessorword.Althoughthecopiesmayscoreidenticallyacoustically,theinclusionoflanguagemodelscoreswouldmaketheirtotalpathprobabilitiesdistinct.ngeneral,withnon-bigramgrammars,weneedaseparatewordHMMmodelforeachgrammarstateratherthanjustoneperwordinthevocabulary.

    Clearly,eplicatingthewordHMMmodelsforncorporatingatrigramgrammarorsomeothernon-bigramgrammarinthesearchalgorithmismuchcostliercompu-tationally.owever,moresophisticatedgrammarsoffergreaterrecognitionaccuracyandpossiblyevenareductionnhesearchspace.herefore,nSphinx-II,rigramgrammarsareusedinanapproximate mannerwiththefollowingcompromise.When-everthereisatransitionfromwordWioW j,wecanfindthebestpredecessorofW { athatpoint,ayo- ,asdeterminedbyheViterbisearch.ehenassociatehetrigramprobabilityP(wj\w'i,Wi)withheransitionromWioW j.ote,however,thatunlikewithbigramgrammars,trigramprobabilitiesappliedtocross-wordtran-

  • 8/12/2019 Ada 310308

    42/146

    28 HAPTER3. THESPHINX-IIBASELINESYSTEM

    sitionsinthisapproximatefashionhavetobedetermineddynamically,dependingonthebestpredecessorforeachtransitionathetimeinquestion.Usingatrigramgrammarinanapproximatemannerasdescribedabovehashe

    followingadvantages:tvoidnyeplicationofheexicalword-HMMtructuresndssociated

    increaseincomputationalload.ntermsofaccuracy,itismuchbetterthanusingabigrammodelandisclosetothatofacompletetrigramsearch.Weinferthisfromthefactthattheaccuracy

    oftheresultsfromthefinalA *pass,whichusesthetrigramgrammarcorrectly,andalsohashebenefitofadditionalwordsegmentationsochooserom,srelativelyonlyabout%betterseeSection3.4.2).trigramgrammarappliedinthisapproximatemannerisempiricallyobserved

    tosearchfewerword-HMMscomparedtoabigramgrammar,thusleadingtoaslightimprovementintherecognitionspeed.hereductioninsearchisaresultofsharperpruningofferedbythetrigramgrammar.

    3.2.3 Cross-WordTriphoneModelingItsadvantageousousecross-wordriphonemodelsasopposedognoringcross-wordphoneticcontexts)orcontinuouspeechecognitionwherewordboundariesareunclearobeginwithandhereareverystrongco-articulationeffects.singcross-wordtriphonemodelswenotonlyobtainbetteraccuracy,butalsogreatercom-putationalefficiency,athecostofanncreasenheotalizeofacousticmodels.Thesharpermodelsprovidedbytriphones,comparedtodiphonesandmonophones,leadsogreaterpruningefficiencyandeductionncomputation.owever,us-ingcross-wordriphonemodelsnheViterbisearchalgorithmsnotwithouttscomplications.

    RightContextThephoneticightcontextorheastriphonepositionnawordsheirstbasephoneofthenextword.ntime-synchronousViterbisearch,thereisnowaytoknowthenextwordnadvance.nanycase,whateverdecodingalgorithmisused,herecanbeseveralpotentialuccessorwordsoanygivenwordW {tnygivenime.Therefore,helastriphonepositionforeachwordhasobemodelledbyaparallelsetoftriphonemodels,oneforeachpossiblephoneticrightcontext.notherwords,iftherearekbasephonespi,P2, ,P kinthesystem,wehavekparalleltriphoneHMMmodelsiPl,/iP2,... hPkrepresentingthefinaltriphonepositionforWi.cross-wordtransitionfromu > ;oanotherwordWj whosefirstbasephoneispisepresentedby

  • 8/12/2019 Ada 310308

    43/146

    3.2. FORWARDBEAMSEARCH 29

    H M M netw ork fo rwordw .iParallelse tof H M M sinlast phonepositionfo rdifferentphoneticright contexts

    /PJ3Rightcontextbasephone

    S.Cross-wordNULL transitionW o r d.,firstbasephone=p

    Figure3.3:Cross-wordTriphoneModellingatWordEndsnSphinx-II.aNULLarcfromhpohenitialstateofW j.igure3.3llustrateshisconceptofrightcontextfanoutatheendofeachwordW {nSphinx-II.

    Thissolution,atfirstglance,appearstoforcealargeincreaseinthetotalnumberoftriphoneHMMshatmaybesearched.nheplaceoftheingleastpositiontriphoneforeachword,wenowhaveoneriphonemodelforeachpossiblephoneticrightcontext,whichistypicallyaround0nnumber.npractice,wealmostneverencounterthisapparentexplosionincomputationalload,forthefollowingreasons:

    hedynamicnumberofrightmostriphonesactuallyevaluatednpracticesmuchsmallerthanthestaticnumberbecausethebeamsearchheuristicprunesmostofthewordsawaybythetimetheirlastphonehasbeenreached.hisisbyfarthelargestsourceofefficiency,evenwiththerightcontextfanout.

    hesetofphoneticightcontextsactuallymodelledcanbeestrictedtojustthosefoundnhenputvocabulary;.e.,ohesetoffirstbasephonesofallthewordsnthevocabulary.Moreover,phinx-IIusestatelusteringntoenones,hereeveraltatessharethesameoutputdistributionmodelledbyasenone.Therefore,theparallelsetofmodelsatheendofanygivenwordarenotllunique.yemovingduplicates,hefanoutcanbefurthereduced.nSphinx-II,hesetwofactorstogetherreducetherightcontextfanoutbyabout0 %onaverage.

    heincreaseisnumberofrightmosttriphonesspartlyoffsetbythereductionincomputationaffordedbythesharpertriphonemodels.

  • 8/12/2019 Ada 310308

    44/146

    30 HAPTER3. THESPHINX-IIBASELINESYSTEM

    LeftContext

    Thehoneticeftontextorheirsthoneositionnwordsheastasephonefromthepreviousword.Duringdecoding,thereisnouniquesuchpredecessorword.nanygivenframe,heremayberansitionsoawordWjfromanumberofcandidatesw^,io;2,...TheViterbialgorithmchooseshebestpossibletransitionintoW j.etussaythewinningpredecessorisW { k.hus,helastbasephoneofW { kbecomeshephoneticeftcontextorW j.owever,hissnramet.nhenextframe,theremaybeanentirelydifferentwinnerthatresultsinadifferentleftcontextbasephone.inceheealbestpredecessorsnotetermineduntilheendoftheViterbidecoding,allsuchpossiblepathshavetobepursuedinparallel.

    Aswithrightcontextcross-wordtriphonemodelling,thisproblemalsocansolvedbyusingaparallelsetoftriphonemodelsforthefirstphonepositionofeachwordaseparatetriphoneforeachpossiblephoneticleftcontext.owever,unliketheword-endingphonepositionwhichisheavilyprunedbythebeamsearchheuristic,theword-initialpositionisextensivelysearched.Mostoftheword-initialtriphonemodelsarealiveeveryframe.nfact,asweshallseelaterinSection3.4,heyaccountormorethan60 %ofalltriphonemodelsevaluatedinthecaseoflarge-vocabularyrecognition.Aleftcontextfanoutofevenasmallfactorof2or wouldsubstantiallyslowdownthesystem.

    ThesolutionusedntheSphinx-IIbaselinesystemisocollapsetheleftcontextfanoutntoaingle5-stateHMMwithynamicriphonemappingasollows.sdescribedabove,atanygivenframeheremaybeseveralpossibleransitionsromwordstOj,, ,...ntoW j.ccordingoheViterbialgorithm,heransitionwiththebestncomingscorewins.ethewinningpredecessorbeWik.henheinitialstateofWjlsoynamicallyinheritsheastbasephoneofW { kstseftcontext.WhentheoutputprobabilityoftheinitialstateofWjhastobeevaluatedinthenextframe,itsparenttriphoneidentityisfirstdetermineddynamicallyfromtheinheritedleftcontextbasephone.urthermore,thisdynamicallydeterminedtriphoneidentityisalsopropagatedbyheViterbialgorithm,ashepathprobabilityspropagatedfromstatetostate.hisensuresthatanycompletepaththroughtheinitialtriphonepositionofWjsscoredconsistentlyusingasingletriphoneHMMmodel.

    Figure3.4llustrateshisprocesswithanexample,goinghroughasequenceof4rames.tcontainsasnapshotofaword-initialHMMmodelatheendofeachframe.rcsnboldndicatehewinningransitionsoeachstateoftheHMMnthisexample.MMstatesareannotatedwithheleftcontextbasephonenheriteddynamicallythroughtime.A swecanseeintheexample,differentstatescanhavedif-ferentphoneticleftcontextsassociatedwiththem,butasingleViterbipaththroughtheHMMsevaluatedwiththesamecontext.hiscanbeverifiedbybacktrackingfromthefinalstatebackwardintime.

  • 8/12/2019 Ada 310308

    45/146

    3.2. FORWARDBEAMSEARCH 31

    Initial(leftmost)H M M modelfo rawordIncomingleftcontextphone=p(Frompreviousword)Incomingleftcontextphone=p .

    Incomingleftcontextphone=p .

    Time=

    Time=2

    Time=3

    Time=4

    Figure3.4:WordnitialTriphoneHMMModellinginSphinx-II.SinglePhoneWordsInhecaseofsingle-phonewords,othheireftandighthoneticcontextsrederiveddynamicallyfromneighbouringwords.hus,heyhavetobehandledbyacombinationoftheabovetechniques.WithreferencetoFigures3.3and3.4,separatecopiesofthesinglephonehavetobecreated foreachrightphoneticcontext,andeachcopyismodelledusingthedynamictriphonemappingtechniqueforhandlingitsleftphoneticcontext.

    3.2.4 TheForwardSearchThedecodingalgorithmis,nprinciple,straightforward.heproblemisofindhemostrobablesequenceofwordshatccountsorhebservedpeech.hisstackledasfollows.

    TheabstractViterbidecodingalgorithmandhebeamsearchheuristic,andts

  • 8/12/2019 Ada 310308

    46/146

    32 HAPTER3. THESPHINX-IIBASELINESYSTEM

    applicationospeechdecodinghavebeenexplainednSection2.3.1.nSphinx-II,therearetwodistinguishedwords,and depicting thebeginningandendingsilenceinanyutterance.heinputspeechisexpectedtobeginatheinitialstateofandendinthefinalstateof

    WecannowdescribedtheforwardViterbibeam searchimplementation inSphinx-II tsxplainedwithhehelpfragmentsfpseudo-code.tsecessaryounderstandtheforwardpassatthislevelinordertofollowthesubsequentdiscussiononperformanceanalysisandthebreakdownofcomputationamongdifferentmodules.

    SearchOutlineBeforewegointothedetailsofthesearchalgorithm,weintroducesometerminology.AstatejfanMMmodelmnhelatexicalsearchspacehasheollowingattributes:

    pathscoreatimet,P {t),hatndicatestheprobabilitycorrespondingothebeststatesequenceeadingromhenitialstateofatime ohisstateatimet,whileconsumingtheinputspeechuntilt.historyinformationatimet,-H] (i),hatallowsusoracebackhebestprecedingwordhistoryleadingtothisstateat.Asweshallseelater,hisis

    apointertothewordlatticeentrycontainingthebestpredecessorword.) Thesenoneoutputprobability,b t),forthisstateattimet(seeSection2.1.2).

    Ifmbelongsohefirstpositionnaword,hesenonedentityforstate; sdetermineddynamicallyfromtheinherited phoneticleftcontext(Section3.2.3).Atthebeginningofthedecodingofanutterance,thesearchprocessisinitializedbysettingthepathprobabilityofthestartstateofthedistinguishedwordo.

    Allotherstatesareinitializedwithapathscoreof0.Also,anactiveHMMlistthatidentifieshesetofactiveHMMsnhecurrentramesnitializedwithhisirstHMMfor Fromthenon,theprocessingofeachframe ofspeech,giventheinputfeaturevectorforthatframe,isoutlinedbythepseudo-codenFigure3.5.

    WeconsidersomeofthefunctionsdefinedinFigure3.5inalittlemoredetailbe-low.Certainaspects,suchaspruningoutHMMsthatfallbelowthebeamthreshold,havebeenomittedforthesakeofsimplicity.VQ:Qstandsorectorquantization.nhisunction,heGaussiandensitiesthatmakeupeacheaturecodebookareevaluatedathenputeaturevectors.notherwords,wecomputetheMahalanobisdistanceoftheinputfeaturevectorfromthemeanofeachGaussiandensityfunction.(ThiscorrespondsoevaluatingMn

  • 8/12/2019 Ada 310308

    47/146

    3.2. FORWARDBEAMSEARCH 3

    forward_frame(inputfeaturevectorfo rcurrentframe)C

    VQ(inputfeature); /*Findtop 4densitiesclosesttoinputfeature*/senone_evaluate ) ; /*FindsenoneoutputprobabilitiesusingVQresults*/hmm_evaluate ) ; /*Within-HMMan dcross-HMM transitions*/word_transition ) ; /*Cross-wordtransitions*//*HMMpruning usinga beamomittedforsimplicity*/updateactive HMMlistfo rnextframe;

    hmm_evaluate )

    /*Within-HMM transitions*/fo r(eachactiveHMMh)

    for(eachstatesin h)updatepath probabilityofsusingsenoneoutputprobabilities;

    /*Within-word cross-HMMtransitionsan dword-exits*/for(eachactiveHMMhwith finalstatescorewithin beam){

    if(hisafinalHMMforaword w){createwordlatticeentryfo r w ; /*wordexit*/

    }else{leth'=nextHMMinwordafter h ;NULL transition(final-state(h)->initial-state(h'));/*Remember rightcontextfanoutifh'isfinalHMMin word*/

    }

    word_transition )C

    let{w} =setofwordsenteredintowordlatticeinthisframe;fo r(eachwordw'invocabulary)

    Findthebesttransition({w}->w'),includingLMprobability;}

    Figure 3 . 5 :OneFrameofForward ViterbiBeam Search i n the Baseline System.

  • 8/12/2019 Ada 310308

    48/146

  • 8/12/2019 Ada 310308

    49/146

  • 8/12/2019 Ada 310308

    50/146

    36 HAPTER3. THESPHINX-IIBASELINESYSTEM

    theutteranceandbacktrackingtothebeginning,byfollowingthehistorypointersinthewordlattice.

    3.3 BackwardandA*SearchAsmentionedearlier,heA*orstacksearchscapableofexactlyusingmoreo-phisticatedlanguagemodelsthanbigramgrammars,thusofferinghigherrecognitionaccuracy.tmaintainsasortedstackofpartialhypotheseswhichareexpandednabest-firstmanner,onewordengthataime.herearewomainssueswithhisalgorithm:

    opreventanexponentialexplosionnhesearchspace,hestackdecodingalgorithmmustexpandeachpartialypothesisnlybyimitedsetfhemostlikelycandidatewordshatmayfollowthatpartialhypothesis.heA *algorithmisnottimesynchronous.pecifically,eachpartialhypotheses

    inthesortedstackcanaccountforadifferentinitialsegmentoftheinputspeech.Thismakesithardtocomparethepathprobabilitiesoftheentriesinthestack.Ithasbeenshownin[42]thatthesecondissuecanbesolvedbyattachingaheuris-

    ticscorewitheverypartialhypothesisHthataccountsfortheremainingportionofthespeechnotncludednH.yfillingout verypartialhypothesisoheullutterancelengthinthisway,theentriesinthestackcanbecomparedtooneanother,andexpandedinabest-firstmanner.songasheheuristicscoreattachedtoanypartialhypothesisHsanupperboundonhescoreofthebestpossiblecompleterecognitionachievablefromH,theA *algorithmisguaranteedtoproducethecorrectresults.

    ThebackwardpassintheSphinx-IIbaselinesystem providesanapproximationtotheheuristic scoreneededbytheA*algorithm.Sinceitisatime-synchronousViterbisearch,unnhebackwarddirectionfromtheendoftheutterance,hepathscoreatanystatecorrespondstothebeststatesequencebetweenitandtheutteranceend.Hencetervesashedesiredupperbound.tsanapproximationsincehepathscoreusesbigramprobabilitiesandnotheexactgrammarthatheA *searchuses.

    Thebackwardpassalsoproducesawordattice,similaroheforwardViterbisearch.TheA*searchisconstrainedtosearchonlythewordsinthetwolattices,andisrelativelyfast.

    Thewordlatticeproducedbythebackwardpasshasanotherdesirableproperty.Wenotedatthebeginningofthischapterthatforeachwordoccurrenceintheforwardpasswordattice,severalsuccessiveendtimesaredentifiedalongwiththeirscores,whereasveryoftenonlythesinglemostlikelybegintimeisidentified.Thebackwardpasswordatticeproduceshecomplementaryesult: severalbeginningimesare

  • 8/12/2019 Ada 310308

    51/146

    3.3. BACKWARDANDA*SEARCH 7

    identifiedforagivenwordoccurrence,whileusuallyonlyhesinglemostikelyendtimeisavailable.hetwoatticescanbecombinedtoobtainacousticprobabilitiesforawiderrangeofwordbeginningandendingtimes,whichimprovestherecognitionaccuracy.

    Inheollowingsubsections,webrieflydescribehebackwardViterbipassandtheA*algorithmusedintheSphinx-IIbaselinesystem.

    3.3.1 BackwardViterbiSearchThebackwardViterbisearchisessentiallyidentical totheforwardsearch,exceptthatitiscompletelyreversedintime.Themaindifferencesarelistedbelow:

    heinputspeechisprocessedinreverse.tsconstrainedtosearchonlythewordsnthewordlatticefromtheforward

    pass.pecifically,atanytimet,cross-wordransitionsarerestrictedtowordsthatexitedat intheforwardpass,asdeterminedbythelatter'swordlattice.llHMMransitions,aswellascross-HMMandcross-wordNULLransitions

    arereversedwithrespecttotheforwardpass.rosswordriphonemodellingisperformedusinge/t-contextanoutanddy-namictriphonemappingforightcontexts.nlythebigramprobabilitiesareused.herefore,theViterbipathscorefromanypointintheutteranceuptotheendisonlyanapproximationtotheupperboundsdesiredbytheA *search.TheresultofthebackwardViterbisearchisalsoawordlatticelikethatfromtheforwardpass.tsootedathatendsnthefinalframeoftheutterance,and

    growingbackwardintime.Thebackwardpassidentifiesseveralbeginningtimesforaword,buttypically onlyoneendingtime.Acousticscoresforeachwordsegmentationareavailableinthebackwardpasswordlattice.

    3.3.2 A*SearchTheA *searchalgorithmisdescribedin[42].tworksbymaintaininganorderedstackorlistofpartialhypotheses,sortedindescendingorderoflikelihood.Hypothesesarewordsequencesandmaybeofdifferentengths,accountingordifferentengthsofinputspeed.igure3.7outlinesthebasicstackdecodingalgorithmforfindingJV-besthypotheses.

  • 8/12/2019 Ada 310308

    52/146

  • 8/12/2019 Ada 310308

    53/146

    3.4. BASELINESPHINX-IISYSTEMPERFORMANCE9datasetshavebeenextensivelyusedbyseveralsitesinthepastewyears,ncludingthespeechgroupatCarnegieMellonUniversity.utheprincipalgoaloftheseex-perimentshasbeenmprovingtheecognitionaccuracy.heworkeportednhisthesissocussedonobtainingotherperformancemeasuresorhesamedatasets,namelyexecutiontimeandmemoryrequirements.Wefirstdescribetheexperimen-tationmethodologyinhefollowingsection,ollowedbyothersectionscontainingadetailedperformanceanalysis.

    3.4.1 ExperimentationMethodologyParametersMeasuredandMeasurementTechniquesTheperformanceanalysisnhissectionprovidesadetailedookatllaspectsofcomputationalefficiency,ncludingabreakdownbyhevariousalgorithmicstepsneachcase.wodifferentvocabularysizesapproximately20,000and58,000words,referredtoasthe20Kand58Ktasks,respectivelyareconsideredforallexperiments.Themajorparametersmeasuredincludethefollowing:

    ecognitionaccuracyfromthefirstViterbipassresultandthefinalA*esult.ThisiscoveredindetailinSection3.4.2.verallexecutionimeandtsreakdownmonghemajoromputationalsteps.Wealsoprovidefrequencycountsofthemostcommonoperationshataccountormostoftheexecutionime.ection3.4.3dealswithhesemea-

    surements.Timingmeasurementsareperformedoverentiretestsets,averagedtoperframevalues,andpresentedinmultiplesofr ea ltime.orexample,anycomputationthatakes23msectoexecuteperframe,onaverage,issaidtorunin2.3imesrealtime,sinceaframeis10mseclong.hismakesitconvenienttoestimatetheexecutioncostandusabilityofindividualtechniques.requencycountsarealsonormalizedtoperframevalues.

    hebreakdownofmemoryusageamongvariousdatastructures.ThisiscoveredinSection3.4.4.

    Clearly,theexecutiontimesreportedherearemachine-dependent.Evenwithasin-glearchitecture,differencesnimplementationssuchascachesize,memoryandbusspeedselativetoCPUspeed,etc.anaffecthespeedperformance.urthermore,forshortevents,heactofmeasuringhemtselfwouldperturbheesults.tsimportantokeephesecaveatsnmindnnterpretingtheimingresults.avingsaidhat,wenotehatallexperimentswerecarriedoutononeparticularmodelofDigitalEquipmentCorporation'sAlphaworkstations.heAlphaarchitecture61]includesaspecialRPCCnstructionhatallowsanapplicationoimeveryshort

  • 8/12/2019 Ada 310308

    54/146

    40 CHAPTER3. THESPHINX-IIBASELINESYSTEM

    eventsofasittleasewhundredmachinecycleswithnegligibleoverhead.lltimingmeasurementsarenormalizedtoanAlphaprocessorrunningat175MHz.ItshouldalsobeemphasizedthatthemaincomputationalloopsntheSphinx-II

    systemhavebeentunedcarefullyforoptimumspeedperformance.Themeasurementsreportedinthisworkhavebeenlimitedalmostexclusivelytosuchloops.

    TestSetsandExperimentalConditionsTheestsetsusednheexperimentshavebeenakenromhevariousdatasetsinvolvednhe993nd994ARPAhubevaluations.llheestetsconsistofcleanspeechecordedusinghighqualitymicrophones.pecifically,heyconsistofthefollowing:

    ev93\he993developmentsetcommonlyreferredtoassi.dt.20).ev94:he994developmentset(hl.dt.94).vaW4:he994evaluationset(hl.et.94).Thetestsetsareevaluatedindividuallyonthe20Kandthe58Ktasks.hisisim-portanttodemonstratethevariationinperformance,especiallyrecognitionaccuracy,withdifferenttestsetsandvocabularysizes.Theindividualperformanceresultsallow

    anopportunityforcomparisonswithexperimentsperformedelsewherethatmightberestrictedtojustsomeofthetestsets.able3.1summarizesthenumberofsentencesandwordsineachtestset.Dev93 Dev94 Eval94 Total

    SentencesWords 5038227 3107387 3168186 112923800

    Table3.1:No.fWordsandSentencesinEachTestSetTheknowledgebasesusedineachexperimentarethefollowing:othhe20Kandhe58Ktasksusesemi-continuousacousticmodelsofthekinddiscussedinSection3.1.1.hereare10,000senonesoriedstatesnthissystem.hepronunciationlexiconsinthe20Ktasksareidentical tothoseusedbyCMU

    inheactualevaluations.heexiconforhe58ktasksderivedpartlyfromthe20ktaskandpartlyfromtheOOK-worddictionaryexportedbyCMU.

  • 8/12/2019 Ada 310308

    55/146

    3.4. BASELINESPHINX-IISYSTEMPERFORMANCE 41

    heDev93languagemodelorhe20Ktaskshestandardoneusedbyallsitesn993.tconsistsofabout3.5Mbigramsand3.2Mrigrams.he20KgrammarforDev94andEvaW4estsetssalsohestandardoneusedbyallsites,anditconsistsofabout5.0Mbigramsand6.7Mrigrams.hegrammarforhe8Ktasksderivedromheapproximately230Mwordsoflanguagemodeltrainingdatathatbecameavailableduringthe994ARPAevaluations,anditconsistsof6.1Mbigramsand18.OMtrigrams.Thesamegrammarisusedwithalltestsets.

    ThefollowingsectionscontainthedetailedperformancemeasurementsconductedonthebaselineSphinx-IIrecognitionsystem.

    3.4.2 RecognitionAccuracyRecognitionresultsfromthefirstpassViterbibeamsearch)aswellashefinalA*passrepresentedorbothhe20Kand8Ktask.able3.2istsheworderrorratesoneachofthetestsets,individuallyandoverall56.rrorsincludesubstitutions,insertionsanddeletions.

    Dev93 Dev94 Eval94 Mean20K(Vit.) 17.6 15.8 15.9 16.420K(A*) 16.5 15.2 15.3 15.758K(Vit.) 15.1 14.3 14.5 14.658K(A*) 13.8 13.8 13.8 13.8

    Table3.2:ercentageWordErrorRateofBaselineSphinx-IISystem.Itisclearthathelargestsinglefactorthatdeterminestheworderrorrateisthetestsetitself.nfact,iftheinputspeechwerebrokendownbyindividualspeakers,a

    muchgreatervariationwouldbeobserved45,46].artofthismightbeattributabletodifferentout-of-vocabularyOOV)atesorhesetsofsentencesutteredbyn-dividualspeakers.owever,adetailedexaminationofaspeaker-by-speakerOOVrateanderrorratedoesnotshowanystrongcorrelationbetweenthetwo.hemainconclusionshatworderroratecomparisonsbetweendifferentystemsmustberestrictedtothesametestsets.

    5Theaccuracyesultsreportedntheactualevaluationsaresomewhatbetterhanthoseshownhere.hemaineasonshatheacousticmodelsusednheevaluationsaremorecomplex,consistingofseparatecodebooksorndividualphoneclasses.eusedainglecodebooknourexperimentsinstead,incethegoalofourstudyisthecostof thesearchalgorithm,whichisaboutthesameinbothcases.6Notehatnllsuchables,heoverallmeaniscomputedoverlldifferentetsputogether.H en ce ,tisnotnecessarilyjustthemeanofthemeansfortheindividualtestets.

  • 8/12/2019 Ada 310308

    56/146

    42 HAPTER3. THESPHINX-IIBASELINESYSTEM

    3.4.3 SearchSpeedInhisectionwepresentsummaryofheomputationaloadmposedbyheSphinx-IIbaselinesearcharchitecture.herearehreemainpassesnhesystem:forwardViterbibeamsearch,ackwardViterbisearch,nd*earch.heirstpresentsthegreatestloadofall,andhencewealsostudythebreakdownofthatoadamongitsmaincomponents:Gaussiandensitycomputation,senonescorecomputa-tion,HMMevaluation,andcross-wordtransitions.hesearethefourmainfunctionsintheforwardpasshatwereintroducedinSection3.2.4.Althoughwepresentper-formancestatisticsforallcomponents,thefollowingfunctionsintheforwardViterbisearchwillbethemainfocusofourdiscussion:

    MMevaluation.WepresentstatisticsonbothexecutiontimesaswellasthenumberofHMMsevaluatedperframe.rosswordtransitions.Again,wefocusonexecutiontimesandthenumberofcross-wordtransitionscarriedoutperframe.Theexecutionimeforeachstepspresentednermsofmultiplesofrealimetakentoprocessthatstep.Asmentionedearlier,themachineplatformforallexperi-mentsistheDECAlphaworkstation.AlltimingmeasurementsarecarriedoutusingtheRPCCinstruction,ohathemeasurementoverheadsminimized.tshould

    againbeemphasizedthatexecutiontimesareheavilyinfluencedbytheoverallpro-cessor,bus,ndmemoryarchitecture.orhiseason,allexperimentsarecarriedoutonasinglemachinemodel.TheperformancefigurespresentedinthissectionarenormalizedtoanAlphaprocessorrunningat175MHz.OverallExecutionTimes

    Table3.3summarizestheexecutiontimes forboththe20Kand58Ktasks.Aswecansee,theforwardViterbisearchaccountsforwellover9