listen, attend and spellklivescu/mlslp2016/chan_mlslp2016.pdflisten, attend and spell - a neural...
TRANSCRIPT
![Page 1: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/1.jpg)
Listen, Attend and SpellA Neural Network for
Large Vocabulary Conversational Speech Recognition
William Chan, Navdeep Jaitly, Quoc Le, Oriol [email protected]
{ndjaitly,qvl,vinyals}@google.com
*work done at Google Brain.
September 13, 2016
![Page 2: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/2.jpg)
Outline
1. Introduction and Motivation2. Model: Listen, Attend and Spell3. Experiments and Results4. Conclusion
Carnegie Mellon University 2
![Page 3: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/3.jpg)
Introduction and Motivation
![Page 4: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/4.jpg)
Automatic Speech Recognition
InputI Acoustic signal
OutputI Word transcription
Carnegie Mellon University 4
![Page 5: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/5.jpg)
State-of-the-Art ASR is Complicated
I Signal ProcessingI Pronunciation DictionaryI GMM-HMMI Context-Dependent PhonemesI DNN Acoustic ModelI Sequence TrainingI Language Model
I Many proxy problems, (mostly) independently optimizedI Disconnect between proxy problems (i.e., frame accuracy) and
ASR performanceI Sequence Training solves some of the problems
Carnegie Mellon University 5
![Page 6: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/6.jpg)
HMM Assumptions
I Conditional independence between frames/symbolsI MarkovianI Phonemes
I We make untrue assumptions to simply our problemI Almost everything fallback to the HMM (and phonemes)
Carnegie Mellon University 6
![Page 7: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/7.jpg)
Goal: Model Characters directly from Acoustics
InputI Acoustic signal (e.g., filterbank spectra)
OutputI English characters
I Don’t make assumptions about the our distributions
Carnegie Mellon University 7
![Page 8: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/8.jpg)
End-to-End Model
I Signal ProcessingI Listen, Attend and Spell (LAS)I Language Model?
I One model optimized end-to-endI learn pronunciation, acoustic, dicationary all in one end-to-end
model
Carnegie Mellon University 8
![Page 9: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/9.jpg)
Model: Listen, Attend and Spell
![Page 10: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/10.jpg)
Sequence-to-Sequence (and Attention)
Machine Translation:I Sutskever et al., “Sequence to Sequence Learning with Neural
Networks,” in NIPS 2014.I Cho et al., “Learning Phrase Representations using RNN
Encoder-Decoder for Statistical Machine Translation,” inEMNLP 2014.
I Bahdanau et al., “Neural Machine Translation by JointlyLearning to Align and Translate,” in ICLR 2015.
TIMIT:I Chorowski et al., “Attention-Based Models for Speech
Recognition,” in NIPS 2015.
Carnegie Mellon University 10
![Page 11: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/11.jpg)
Listen, Attend and Spell
Let x be our acoustic features, and let y be the sequence we aretrying to model (i.e., character sequence):
h = Listen(x) (1)P (yi|x, y<i) = AttendAndSpell(y<i,h) (2)
Carnegie Mellon University 11
![Page 12: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/12.jpg)
Implicit Language Model
I HMM/CTC have conditional independence assumptionI seq2seq models have a conditional dependence on the
previously emitted symbols:
P (yi|x, y<i) (3)
Carnegie Mellon University 12
![Page 13: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/13.jpg)
Listen, Attend and Spell
I Listen(x) can be a RNN (i.e., LSTM).I Transform our input features x into some higher level feature h
I AttendAndSpell is an attention-based RNN decoder.
Carnegie Mellon University 13
![Page 14: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/14.jpg)
Listen, Attend and Spell
AttendAndSpell is an attention-based RNN decoder:
ci = AttentionContext(si,h) (4)si = RNN(si−1, yi−1, ci−1) (5)
P (yi|x, y<i) = CharacterDistribution(si, ci) (6)
The AttentionContext creates an alignment and context for eachtimestep:
ei,u = 〈φ(si), ψ(hu)〉 (7)
αi,u = exp(ei,u)∑u′ exp(ei,u′) (8)
ci =∑u
αi,uhu (9)
Carnegie Mellon University 14
![Page 15: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/15.jpg)
Listen, Attend and Spell
Carnegie Mellon University 15
![Page 16: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/16.jpg)
Listen, Attend and Spell
I Attention mechanism creates a short circuit between eachdecoder’s output and the acoustic
I More efficient information/gradient flow!I Creates an explicit alignment between each character and the
acoustic featuresI CTC’s alignment is latent
Carnegie Mellon University 16
![Page 17: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/17.jpg)
![Page 18: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/18.jpg)
Listen, Attend and Spell
I Model works but...I Takes “forever” to train, after > 1 month model still not
converged : (I WERs in the >20s (CLDNN-HMM is 8ish)I Attention mechanism must focus on a long range of frames
Carnegie Mellon University 18
![Page 19: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/19.jpg)
Pyramid
Carnegie Mellon University 19
![Page 20: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/20.jpg)
Pyramid
I Build higher level features with each layerI Reduce number of timesteps for attention to attend toI Computational efficiencyI 8 filterbank frames → 1 pyramid frame feature
hji = pBLSTM(hji−1,[hj−1
2i , hj−12i+1
]) (10)
Carnegie Mellon University 20
![Page 21: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/21.jpg)
Pyramid
1. Sequence-to-Sequence2. Attention3. Pyramid
I 16 to 20-ish WERs (w/o LM)I Takes around 2-3 wks to train, overfitting a HUGE problemI Mismatch between train and inference conditions
Carnegie Mellon University 21
![Page 22: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/22.jpg)
Sampling Trick
Machine Translation, Image Captioning and TIMIT:I Bengio et al., “Scheduled Sampling for Sequence Prediction
with Recurrent Neural Networks,” in NIPS 2015.
Carnegie Mellon University 22
![Page 23: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/23.jpg)
Sampling Trick
I Training is conditioned on ground truthI We don’t have access to ground truth during inference!
maxθ
∑i
logP (yi|x, y∗>i; θ) (11)
Carnegie Mellon University 23
![Page 24: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/24.jpg)
Sampling Trick
I Sample from our modelI Condition on sample for next step prediction
yi ∼ CharacterDistribution(si, ci) (12)
maxθ
∑i
logP (yi|x, y>i; θ) (13)
Carnegie Mellon University 24
![Page 25: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/25.jpg)
Listen, Attend and Spell
P (y|x) =∏i
P (yi|x, y<i) (14)
h = Listen(x) (15)P (y|x) = AttendAndSpell(h,y) (16)
Carnegie Mellon University 25
![Page 26: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/26.jpg)
Language Model Rescoring
I Leverage on vast quantities of text!I Normalize our LAS model by number of characters in
utterance – LAS has bias for short utterances.
s(y,x) = logP (y|x)|y|c
+ λ logPLM(y) (17)
Carnegie Mellon University 26
![Page 27: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/27.jpg)
Experiments and Results
![Page 28: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/28.jpg)
Dataset
I Google voice searchI 2000 hrs, 3M training utterancesI 16 hrs, 22K test utterancesI Mixed Room Simulator, artificially increase acoustic data by
x20 (i.e., YouTube and environmental noise)I Clean and noisy test set
Carnegie Mellon University 28
![Page 29: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/29.jpg)
Training
I Stochastic Gradient DescentI DistBelief 32 replicas, minibatch size of 32I 2-3 weeks of training time
Carnegie Mellon University 29
![Page 30: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/30.jpg)
Results
Model Clean WER Noisy WERCLDNN-HMM (Tara et al., 2015) 8.0 8.9LAS 16.2 19.0LAS + LM 12.6 14.7LAS + Sampling 14.1 16.5LAS + Sampling + LM 10.3 12.0
Carnegie Mellon University 30
![Page 31: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/31.jpg)
Decoding
I We didn’t decode with a dictionary!I LAS implicitly learnt the dictionary during trainingI Rare spelling mistakes!
I We didn’t decode with a LM! (only rescored)I n-best list decoding where n = 32
I CLDNN-HMM is convolutional and unidirectional, LAS is notconvolutional and bidirectional
Carnegie Mellon University 31
![Page 32: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/32.jpg)
Decoding
12 4 8 16 32N
0
5
10
15
20W
ER
WER
WER LM
WER Oracle
N-best List Decoding: N vs. WER
Carnegie Mellon University 32
![Page 33: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/33.jpg)
Decoding
I 16% WER without any searching (and LM) - just take thegreedy path!
I LAS does “reasonably” well even with n = 4 in n-best listdecoding
I Not much to gain after n > 16I LM rescoring recovers less than 1/2 of the oracle – need to
improve LM?
Carnegie Mellon University 33
![Page 34: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/34.jpg)
Results: Triple A
N Text logP WERTruth call aaa roadside assistance - -1 call aaa roadside assistance -0.57 0.02 call triple a roadside assistance -1.54 50.03 call trip way roadside assistance -3.50 50.04 call xxx roadside assistance -4.44 25.0
Carnegie Mellon University 34
![Page 35: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/35.jpg)
![Page 36: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/36.jpg)
Conclusion
![Page 37: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/37.jpg)
Conclusion
I Listen, Attend and Spell (LAS)I End-to-end speech recognition modelI No conditional independence, Markovian assumptions, or proxy
problemsI Sequence-to-Sequence + Attention + PyramidI One model: integrate all traditional components of an ASR
system into one model (acoustic, pronunciation, language,etc...)
I Competitive to state-of-the-art CLDNN-HMM systemI 10.3 vs. 8.0 WER
I Time to throw away HMM and phonemes!I Independently proposed by Bahandau et al., 2016 on WSJ
(next next talk, checkout their paper too!)
Carnegie Mellon University 37
![Page 38: Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural Network for Large Vocabulary Conversational Speech Recognition Author William Chan,](https://reader030.vdocuments.net/reader030/viewer/2022040823/5e6d98573b94c46bda628a49/html5/thumbnails/38.jpg)
Acknowledgements
Nothing is possible without help from many friends...I Google Speech team: Tara Sainath and Babak DamavandiI Google Brain team: Andrew Dai, Ashish Agarwal, Samy
Bengio, Eugene Brevdo, Greg Corrado, Andrew Dai, JeffDean, Rajat Monga, Christopher Olah, Mike Schuster, NoamShazeer, Ilya Sutskever, Vincent Vanhoucke and more!
Carnegie Mellon University 38