speech recognition template matching - speech at cmu · speech recognition by templates a little...
TRANSCRIPT
![Page 1: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/1.jpg)
Speech Processing 15-492/18-492
Speech RecognitionTemplate matching
![Page 2: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/2.jpg)
Speech Recognition by Templates
��A little history …A little history …
��Matching TemplatesMatching Templates
��DTW (Dynamic Time Warping)DTW (Dynamic Time Warping)
��Beyond template matchingBeyond template matching
![Page 3: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/3.jpg)
Radio Rex (1922)
• Toys always lead technology …• Call “Rex” and he comes out of his kennel
• (Crystalradio.com and Rhys Jones)
![Page 4: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/4.jpg)
Toy ASR“Tricks”
��Radio RexRadio Rex�� Recognizes vowel formants in “EH”Recognizes vowel formants in “EH”
��Voice activated toy trainVoice activated toy train�� Multilingual stop/go Multilingual stop/go hashire/tomatehashire/tomate
��Toys “pets” don’t need perfect ASRToys “pets” don’t need perfect ASR
![Page 5: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/5.jpg)
Template Matching
��Record templates from userRecord templates from user�� Store in libraryStore in library
��Record ASR exampleRecord ASR example�� Compare against each library templateCompare against each library template
��Select closest exampleSelect closest example
��For example …For example …�� On a voice dialing systemOn a voice dialing system
![Page 6: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/6.jpg)
Voice Dialing System
• Library– Mom
– Dad
– Bob
– Mario’s Pizza
– Let’s Go Bus Information System
![Page 7: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/7.jpg)
Matching in Time Domain
��DurationDuration�� Will discriminate some examplesWill discriminate some examples
�� But Mom, Bob and Dad will be confusedBut Mom, Bob and Dad will be confused
��What about spectral propertiesWhat about spectral properties
![Page 8: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/8.jpg)
Matching in Frequency Domain
Mom
Bob
![Page 9: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/9.jpg)
Different deliveries
��We change durationsWe change durations�� Two utterances are never the sameTwo utterances are never the same
��When it fails we change our deliveryWhen it fails we change our delivery�� Become more Become more articulararticular
�� “clearer”“clearer”
![Page 10: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/10.jpg)
Dynamic Time Warping
Template
Sample Speech
![Page 11: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/11.jpg)
DTW algorithm
�� For each square For each square
�� Dist(template[i],sample[jDist(template[i],sample[j]) +]) +
smallest_ofsmallest_of (Dist(template[i(Dist(template[i--1],sample[j])1],sample[j])
Dist(template[i],sample[jDist(template[i],sample[j--1])1])
Dist(template[iDist(template[i--1],sample[j1],sample[j--1])1])
Remember which choice your took (count path)Remember which choice your took (count path)
Template
Sample
j-1 j
i
i-1
![Page 12: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/12.jpg)
Multiple Templates
��Compare against eachCompare against each
��Find closestFind closest
��Need to normalize scoresNeed to normalize scores�� (divide by length of matches)(divide by length of matches)
![Page 13: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/13.jpg)
Matching Templates
Sample
Template Library
Word0Word1Word2…
For Word in TemplatesScore = dtw(Template[Word], Sample);if (Score < BestScore)
BestWord = Word;DoAction(Action[BestWord])
![Page 14: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/14.jpg)
DTW issues
��What happens with noWhat happens with no--matchesmatches�� Need to deal with none of the aboveNeed to deal with none of the above
��What happens with more templatesWhat happens with more templates�� Harder to choose betweenHarder to choose between
�� Once variance greater than differencesOnce variance greater than differences
��Choose templates that are very differentChoose templates that are very different
![Page 15: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/15.jpg)
DTW/Template Applications
��Voice dialerVoice dialer
��Simple command and controlSimple command and control
��Speaker IDSpeaker ID
![Page 16: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/16.jpg)
Speaker ID
Sample
Template Library
Speaker0Speaker1Speaker2…
For Speaker in TemplatesScore = dtw(Template[Speaker], Sample);if (Score < BestScore)
BestSpeaker = Speaker;
![Page 17: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/17.jpg)
DTW
�� AdvantagesAdvantages�� Works well for small number of templates (<20)Works well for small number of templates (<20)
�� Language independentLanguage independent
�� Speaker specificSpeaker specific
�� Easy to train (end user controls it)Easy to train (end user controls it)
�� DisadvantagesDisadvantages�� Limited number of templatesLimited number of templates
�� Speaker specificSpeaker specific
�� Need actual training examplesNeed actual training examples
![Page 18: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/18.jpg)
More reliable matching
• Distance metric– Euclidean
• But some distances are bigger than others– Silence is pretty similar
– Fricatives are quite larger• A longer fricative might give large score
• A longer vowel might give smaller score
![Page 19: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/19.jpg)
More reliable matching
• Having multiple template examples– Individual matches or
– Average them together
• DTW align all of the examples
• Collect statistics as a Gaussian– Mean and standard deviation for each coeff
![Page 20: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/20.jpg)
More reliable distances
• Instead of Euclidean distance– Doesn’t care about the standard deviation
• Use Mahalanobis distance– Care about means and standard deviation
![Page 21: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/21.jpg)
Extending Template matching
��String word templates togetherString word templates together�� Need to find word segmentationNeed to find word segmentation
��But there are many words …But there are many words …
Word0Word1Word2…
![Page 22: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/22.jpg)
Extending template model
��String phoneme templates togetherString phoneme templates together�� A template model for each phonemeA template model for each phoneme
k ae t
SamplePhone0Phone1Phone2…
Phoneme Templates
![Page 23: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/23.jpg)
Summary
��Speech Recognition by TemplatesSpeech Recognition by Templates�� Good for simple small vocabulary tasksGood for simple small vocabulary tasks
��Dynamic Time Warping (DTW)Dynamic Time Warping (DTW)�� Can match different durational examplesCan match different durational examples
��Averaging over multiple modelsAveraging over multiple models
��Distance metricsDistance metrics�� Euclidean Euclidean vsvs MahalanobisMahalanobis
![Page 24: Speech Recognition Template matching - Speech at CMU · Speech Recognition by Templates A little history … Matching Templates DTW (Dynamic Time Warping) Beyond template matching](https://reader030.vdocuments.net/reader030/viewer/2022013113/5b4371087f8b9a38048b58c6/html5/thumbnails/24.jpg)