microsoft research academic services gtm draft€¦ · microsoft cortana 2014 amazon echo 2014...
TRANSCRIPT
![Page 1: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/1.jpg)
![Page 2: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/2.jpg)
Yanmin QianShanghai Jiao Tong University
New Challenges andRecent Progresses inSpeech Recognition
![Page 3: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/3.jpg)
Statistical Speech Recognition
Speech Waveforms
Front End Processing
Acoustic Model
Recognition (Inference)
Language ModelLexicon
RecognizedHypothesis
![Page 4: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/4.jpg)
Speech Recognition in Products
Microsoft
Cortana
2014
Amazon
Echo
2014
Apple
Siri
2011
Samsung
S Voice
2014
Home
2016
Microsoft
Invoke
2017
Apple
HomePod
2017
Now
2012
![Page 5: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/5.jpg)
Progress in Speech Recognition
14.5
12.2 11.810.4
8
5.9 5.5
2011 IBM
GMM-HMM
2012 IBM
DNN-HMM
2013 IBM
CNN-HMM
2014 IBM
Joint CNN/DNN
2015 IBM
Joint CNN/DNN
+RNN+NNLM
2016 MSR
ResNet+LACE
+BLSTM+RNNLM
2017 IBM
ResNet+WaveNet
+LSTMLM
SWB WER(%)
Microsoft speech & dialogue group
![Page 6: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/6.jpg)
Good Enough to Deploy ASR Everywhere?
![Page 7: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/7.jpg)
Still Challenging on Many Aspects
Noise Robust
Multi Genre
Low Resource
Multi/Mix Lingual
Low Computation
Rich Transcription
![Page 8: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/8.jpg)
Challenge 1: Noise RobustLarge degradation exists in noisy scenarios
• Additive noise, reverberation, channel distortion…
![Page 9: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/9.jpg)
Mismatch and Adaption
System is fragile in reality due to themismatch in training and test
• Background noise
• Channel
• Speaker
• Accent
Adaptation is one effective method toreduce the mismatch
![Page 10: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/10.jpg)
Cluster Adaptive Training (T. Tan, et al. T-ASLP2016)
DNN: full layer matrix as bases CNN: feature maps or filters as bases
![Page 11: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/11.jpg)
Environment-Aware Training (Y. Qian, et al. T-ASLP2016)
DNNs are used to do all factorrepresentations• Speaker, phone, environment
• With the specific target and criterion
Factor integration + Cross connection• Information exchange mechanism
• All modules can benefit from each other
Traditional factors are not ruled out
![Page 12: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/12.jpg)
Auto-noise-reduction Acoustic Model(Y. Qian, et al. T-ASLP2016)
Very deep CNNs• Local correlation
• Translational invariance
Very deep CNN achieves promising resultsin noisy scenarios
Much better than RNN in noisy conditions
Can reduce the noise embedding and de-noise gradually across the stackedconvolutional layers
Appropriate pooling, padding and inputchannel usages are important in speech
![Page 13: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/13.jpg)
More Advanced with All Techs (T. Tian, et al. SpeechCom2017)
System can be further significantly boosted with all these technologies
VDCNN + Factor
-aware Training
ResNet + Factor-aware Training +
Cluster Adaptive Training
VDCNN + RNN
Joint Decoding
![Page 14: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/14.jpg)
New Milestone on Aurora4 Task (T. Tian, et al.
SpeechCom2017)13.4
12.4
10 10.39.7
8.7
7.1
5.7
2012 CUED… 2013 MSR… 2014 IBM… 2015 USTC… 2015 SJTU… 2016 EU… 2016 SJTU… 2017 SJTU…
Aurora4 WER(%)
![Page 15: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/15.jpg)
Challenge 2: Multi GenreMulti-Genre:Comedy, Drama, Children, Advice, News…
• Youtube, BBC, etc
Very high WER on transcripts, 30.0%~40.0%, and no accurate time
![Page 16: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/16.jpg)
Diverse audio• Multi Genres
• Different length
• Diverse conditions
Alignment:Lightly Supervised(P. Lanchantin, et al. InterSpeech2016)
Lightly supervised alignment• Lightly supervised decoding
• Split point detection
• Segments merging
• Non-speech filter
• Data selection by confidence
Diverse transcription• Words existing were not spoken
• Words missing were spoken
• High WER 30.0%~40.0%
• No accurate time
![Page 17: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/17.jpg)
Demo: Alignment
![Page 18: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/18.jpg)
Transcription: Acoustic Model + Adaptation(P.C. Woodland, et al. ASRU2016)
Based on parameterised p-sigmoid / p-reluactivation
Scales slope of activation functions
Adaptation at utterance level and layer-by-layer
DNN Hybrid system with MPE training• More advanced using stacked hybrid system
DNN Tandem system with MPE training• More advanced using adaptation
System combination using joint decoding• More advanced using structured log-linear model
![Page 19: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/19.jpg)
Transcription: Language Model + Adaptation(P.C. Woodland, et al. ASRU2016)
Efficient RNNLM training• Non-class based, full vocab output
• Training with bunch mode
Efficient RNNLM lattice rescoring• Better than N-best list rescoring
• Better using CN decoding
Topic adaptation via Latent Dirichletallocation
• Fed-into both the input & output layers
• Used in RNNLM training & after 1st-passdecoding
RNNLM RNNLM + Adaptation
![Page 20: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/20.jpg)
Demo: Transcription
![Page 21: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/21.jpg)
Challenge 3: Cocktail Party Problem
“One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of
others. This is such a common experience that we may take it for granted; we may call it ‘the cocktail
party problem’…” (Cherry’ 57)
![Page 22: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/22.jpg)
Cocktail Party Problem
This is a key and very difficult problem for speech processing in reality• Require most research work
Human’s performance is superior to machine in this scenario• “For ‘cocktail party’-like situations…when all voices are equally loud, speech remains intelligible for
normal-hearing listeners even when there are as many as six interfering talkers” (Bronkhorst & Plomp’92)
Multi-talker• Speech separation: Separate and trace the streams of the mixed speech
• Speech recognition: Recognize the streams of the mixed speech
• Speaker identification: Identify the speakers of the mixed speech
![Page 23: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/23.jpg)
Multi-Talker Speech Separation (D. Yu, et al. ICASSP2017)
Tradition: CASA, NMF, factorial GMM-HMM,
Microphone-array…
Deep learning: convert the problem from an
unsupervised learning problem to a
supervised one
Simple supervised training does not work
due to label permutation problem
Only work well in the seen speakers or
specific interferences
Label Permutation/Ambiguity Problem Speaker 1 -> output 1? / -> output 2?
![Page 24: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/24.jpg)
Permutation Invariant Training (D. Yu, et al. ICASSP2017)
Automatically determines the best label
assignment based on the current model
Can be used to separate multiple speech
streams
Only affects the training, no extra
processing during separation
Can be easily extended to 3-speakers
![Page 25: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/25.jpg)
Multi-Talker Speech Recognition (D. Yu, et al.
InterSpeech2017)
Tradition• factorial GMM-HMM (IBM2006)
Outperform human, however easy isolated word & only seen speakers
• Deep modelsexplicit speech separation + normal speech recognition
PIT is extended to ASR• Auto-determines the best label assign.
• Do the recognition without separation
• Separation/tracing/recognition in one shot
• Can recognize multiple speech streams
![Page 26: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/26.jpg)
Multi-Talker Speech Recognition (D. Yu, et al.
InterSpeech2017)
![Page 27: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/27.jpg)
Multi-Talker Speaker Identification (X. Zhao, et al. T-
ASLP2015)
Tradition• Speech Separation + Speaker Identification
• GMM-based Approach
Deep Learning based Approach• Input can be raw wave or cepstral features
• Easy to extend for multi-speaker (>2)
condition
• Soft aligned frames to speakers
![Page 28: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/28.jpg)
Multi-Talker Speaker Identification
2 speaker
3 speaker
4 speaker
• 音频样例
ChorusSSC
SSC (Speech Separation Contest)• English short commands (1 second)
• 34 speakers in total
Chorus• Chinese song chorus
• 10 kids (8-12 years old)
Performance• For Chorus, the accuracy for 3 speakers out of 4 is 98.0 % .
Corpus 2 speakers 3 speakers 4 speakers
SSC 100% 97.8% 80.0%
Chorus 97.0% 85.5% 66.2%
![Page 29: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/29.jpg)
Microsoft Cognitive Toolkit (CNTK)
From 2014
ComputationalNetwork Toolkit
To now
Cognitive Toolkit
![Page 30: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/30.jpg)
References
• [1] Y. Qian, M. Bi, T. Tan and K. Yu. Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, no. 12, 2263-2276, 2016.
• [2] Y. Qian, T. Tan and D. Yu. Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, 2231-2240, 2016.
• [3] T. Tan, Y. Qian and K. Yu. Cluster Adaptive Training for Deep Neural Network Based Acoustic Model. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, 459-468, 2016.
• [4] T. Tan, Y. Qian, H. Hu, Y. Zhou and K. Yu. Adaptive very deep convolutional residual network for noise robust speechrecognition. Speech Communication, under review, 2017.
• [5] P. Lanchantin, M.J.F. Gales, P. Karanasou, X. Liu, Y. Qian, L. Wang, P.C. Woodland and C. Zhang. Selection of Multi-genre Broadcast Data for the Training of Automatic Speech Recognition System. Interspeech, 2016.
• [6] P.C. Woodland, X. Liu, Y. Qian, C. Zhang, M.J.F. Gales, P. Karanasou, P. Lanchantin and L. Wang. CambridgeUniversity Transcription Systems for the Multi-Genre Broadcast Challenge. ASRU, 2015.
• [7] D. Yu, M. Kolbak, Z. Hua and J. Jensen. Permutation Invariant Training of Deep Models for Speaker-independentMulti-talker Speech Recognition. ICASSP, 2017.
• [8] D. Yu, X. Chang and Y. Qian. Recognizing Multi-talker Speech with Permutation Invariant Training. Interspeech,2017.
• [9] X. Zhao, Y. Wang and D. Wang. Cochannel Speaker Identification in Anechoic and Reverberant Conditions.IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, no. 11, 1727-1736, 2017.
![Page 31: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/31.jpg)
Thanks for my colleague and students in SJTU !
Thanks for the collaboration with Microsoft !
Thanks for the CNTK team !
![Page 32: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft](https://reader036.vdocuments.net/reader036/viewer/2022081517/5fc0256517e4e042f30a66b3/html5/thumbnails/32.jpg)