achieving human parity in conversational speech recognition · pdf filehuman parity in...
TRANSCRIPT
![Page 1: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/1.jpg)
Achieving Human Parity in Conversational Speech
Recognition
Speech & Dialog Research Group
AI & Research
1
![Page 2: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/2.jpg)
Acknowledgements
2
Joint work withgreat collaborators!
![Page 3: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/3.jpg)
Human Parity in Conversational Speech Recognition• What is Human Parity?
• Humans make mistakes, too. Can ASR make fewer?
• Conversational Speech Recognition• Humans talking in unplanned way
• Focus on each other, not on a computer
• The result of thirty years of progress• DARPA / US Government programs
• Conversational Speech Recognition is the latest in a series of increasingly difficult tasks.
3
![Page 4: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/4.jpg)
Significance: History
RM
ATIS
WSJ
SWB
CH
For many years, DARPA drove the field 4
![Page 5: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/5.jpg)
Significance: Community
Building on accumulated knowledge of many institutions! 5
![Page 6: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/6.jpg)
Significance: Technical
• The right tool for the right job
• CNNs, LSTMs!
• Building on lots of past innovations:• HMM modeling
• Distributed Representations [Hinton ‘84]
• Early CNNs, RNNs, TDNNs [Lang & Hinton ‘88, Waibel et al. ‘89, Robinson ’91, Pineda ‘87]
• Hybrid training [Renals et al. ‘91, Bourlard & Morgan ‘94]
• Discriminative modeling
• Speaker adaptation
• System combination
RNNs / CNNs
GMMs
6
![Page 7: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/7.jpg)
Outline
• Acoustic Modeling
• Language Modeling
• Decoding, Rescoring & System Combination
• Measuring Human Performance
• Results
• Counterpoint – Letter based CTC
• Conclusions
7
![Page 8: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/8.jpg)
Outline
• Acoustic Modeling• Model Structures
• Language Modeling
• Decoding, Rescoring & System Combination
• Measuring Human Performance
• Results
• Counterpoint – Letter based CTC
• Conclusions
8
![Page 9: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/9.jpg)
Acoustic Modeling: Hybrid HMM/DNN
[Yu et al., 2010; Dahl et al., 2011]
Record performance in 2011 [Seide et al.]
Hybrid HMM/NN approach standardBut DNN model now obsolete (!)• Poor spatial/temporal invariance
9
1st pass decoding
![Page 10: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/10.jpg)
Acoustic Modeling: VGG CNN
[Simonyan & Zisserman, 2014; Frossard 2016, Saon et al., 2016, Krizhevsky et al., 2012]
Adapted from image processingRobust to temporal and frequency shifts
10
![Page 11: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/11.jpg)
Acoustic Modeling: ResNet
[He et al., 2015]
Add a non-linear offset to linear transformation of featuresSimilar to fMPE in Povey et al., 2005See also Ghahremani & Droppo, 2016Our best single model after rescoring
11
1st pass decoding
![Page 12: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/12.jpg)
Acoustic Modeling: LACE CNN
[Yu et al., 2016]
Combines batch normalization, Resnetjumps, and attention masks in CNNTied for 2nd best single model after rescoring
12
1st pass decoding
![Page 13: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/13.jpg)
CNN Comparison
13
Very deepMany parametersSmall convolution patternsProcessing ~ ½ second per window
![Page 14: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/14.jpg)
Acoustic Modeling: Bidirectional LSTMs
Stable form of recurrent neural netRobust to temporal shiftsTied for 2nd best single model
[Hochreiter & Schmidhuber, 1997, Graves & Schmidhuber, 2005; Sak et al., 2014]
[Graves & Jaitly ‘14]14
![Page 15: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/15.jpg)
Outline
• Acoustic Modeling• Training Techniques
• Language Modeling
• Decoding, Rescoring & System Combination
• Measuring Human Performance
• Results
• Counterpoint – Letter based CTC
• Conclusions
15
![Page 16: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/16.jpg)
I-vector Adaptation
𝑌(𝑡)
H1
H2
H3
H4
LABELS
Ƹ𝑠(𝑡)
[Dehak et al. 2011; Saon et al., 2013]
5-10% relative improvement for Switchboard
16
I-vectors provide a fixed-length representation of a speaker’s voice characteristics.
![Page 17: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/17.jpg)
Spatial Regularization
- =
2-D Unrolling Smoothed 2D Hi-Freq
Regularize with L2 norm of Hi-frequency residual
5-10% relative improvement for BLSTM
[Droppo et al. in progress]
17
![Page 18: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/18.jpg)
Lattice Free MMI
• Simple brute force MMI
• Avoids need to generate lattices
• Alignments always current
Data,
'
Data,
Data,
);'|()'(
);|(logmaxarg
);(
);|(logmaxarg
);()(
);,(logmaxarg
aw
w
aw
aw
waPwP
waP
aP
waP
aPwP
awP
Traditionally approximated by word sequences in lattice (DAG)
Instead LFMMI uses all possible word sequences in cyclic FSA
[Chen et al., 2006, McDermott et al., 2914, Povey et al., 2016]
18
![Page 19: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/19.jpg)
Denominator GPU computation
• Represent FSA of all possible state sequences as a sparse transition matrix A
• Implement exact alpha beta computations
• Execute in straight “for” loops on GPU with cusparseDcsrmv and cublasDdgmm
• Beautifully simple
11
1
tt
T
t
ttt
o
o
A
A
19
![Page 20: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/20.jpg)
LFMMI Improvements
• Denominator LM graph has 52k states and 215k transitions• GPU-side alpha-beta computation is 0.18xRT exclusive of NN evaluation
20
8-14% relative improvement on SWB
![Page 21: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/21.jpg)
Cognitive Toolkit (CNTK) Training
• Flexible
• Multi-GPU
• Multi-Server
• 1-bit SGD
• All AM training
• Best LM training
21
![Page 22: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/22.jpg)
Outline
• Acoustic Modeling
• Language Modeling
• Decoding, Rescoring & System Combination
• Measuring Human Performance
• Results
• Counterpoint – Letter based CTC
• Conclusions
22
![Page 23: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/23.jpg)
Language Models
• 1st Pass n-gram:• SRI-LM, 30k vocab, 16M n-grams
• Rescoring n-gram:• SRI-LM, 145M n-grams
• RNN LM• CUED Toolkit, two 1000 unit layers• Relu activations, NCE training
• LSTM LM• Cognitive Toolkit (CNTK), three 1000 unit layers• Letter trigram input, no NCE
23
![Page 24: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/24.jpg)
LM Training Trick: Self-stabilization
• Learn an overall scaling function for each layer
xWy
Wxy
)(
:becomes
[Ghahremani & Droppo, 2016]
Applied to the LSTM networks, between layers.
24
![Page 25: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/25.jpg)
Language Model Perplexities
Perplexities on the 1997 eval set
LSTM beats RNN
Letter trigram input slightlybetter than word input
25
Note both forward and backward running models
![Page 26: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/26.jpg)
Outline
• Acoustic Modeling
• Language Modeling
• Decoding, Rescoring & System Combination
• Measuring Human Performance
• Results
• Counterpoint – Letter based CTC
• Conclusions
26
![Page 27: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/27.jpg)
Overall Process
Lattice Generation:ResNet
Rescoring:RNNs, LSTMs,
Pron Probs
Lattice Generation:LACE
Rescoring:RNNs, LSTMs,
Pron Probs
Lattice Generation:BLSTM
Rescoring:RNNs, LSTMs,
Pron Probs
Confusion Network Combination &
Select Best
…
27
N-gram Rescoring500-best Generation
N-gram Rescoring500-best Generation
N-gram Rescoring500-best Generation
![Page 28: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/28.jpg)
Greedy System Combination
• Make confusion network from best single system
• Repeat:• Compute error rate on development data for each possible system addition
• Add the system
28
![Page 29: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/29.jpg)
Rescoring Performance
ResNet CNN Acoustic Model (no combination)
One LSTM ~ 0.5% better than one RNN.
Multiple LSTMs provide further 0.3%
29
![Page 30: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/30.jpg)
Outline
• Acoustic Modeling
• Language Modeling
• Decoding, Rescoring & System Combination
• Measuring Human Performance
• Results
• Counterpoint – Letter based CTC
• Conclusions
30
![Page 31: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/31.jpg)
A First Try
• The 4% rumor
[Lippman, 1997]
31
![Page 32: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/32.jpg)
Another Attempt
[Glenn et al., 2010]
Significant variability.
Note the bulk of the training data was “quick transcribed.”
32
![Page 33: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/33.jpg)
Getting a Positive ID on Actual Test Data
• Skype Translator has a weekly transcription contract• Quality control, training, etc.
• Transcription followed by a second checking pass
• One week, we added eval 2000 to the pile…
33
![Page 34: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/34.jpg)
The Results
• Switchoard: 5.9% error rate
• CallHome: 11.3% error rate
• SWB in the 4.1% - 9.6% range expected
• CH is difficult for both people and machines• High ASR error not just because
of mismatched conditions
34
![Page 35: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/35.jpg)
Outline
• Acoustic Modeling
• Language Modeling
• Decoding, Rescoring & System Combination
• Measuring Human Performance
• Results
• Counterpoint – Letter based CTC
• Conclusions
35
![Page 36: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/36.jpg)
The Bottom Line & Comparisons
Single CNN doesremarkably well
Parity with professional transcribers
Best previous number
36
![Page 37: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/37.jpg)
Runtimes
AM Training: Forward, Backward + Update computationsAM eval: Forward probability computation onlyDecoding: Mixed GPU/CPU, complete decoding time with open beams
Titan X GPU & Intel Xeon E5-2620 v3 @2.4GhZ, 12 coresAll times are xRT (fraction of real-time required) on Titan X GPU
GPU 10 to 100x faster than CPU
(Multiples of real-time, smaller is better)
37
![Page 38: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/38.jpg)
Error AnalysisSubstitutions (~21k words in each test set)
“ums” and “uh-hums” most frequent mistakes – but most errors are in the long tail
38
![Page 39: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/39.jpg)
Error Analysis
Deletions Insertions
Both people and machines insert “I” and “and” a lot.39
![Page 40: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/40.jpg)
Outline
• Acoustic Modeling
• Language Modeling
• Decoding, Rescoring & System Combination
• Measuring Human Performance
• Results
• Counterpoint – Letter based CTC
• Conclusions
40
![Page 41: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/41.jpg)
Two Ways to move up field
The Dive The Pass
41
![Page 42: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/42.jpg)
The CTC Alternative
• Wouldn’t it be nice if we could just look at the frame-level labels,
de-dup,
and read-off the transcription?
• For example, with a character model,
S – U U – P P P E E - - R G G - - O O – D D
Super Good
• CTC [Graves et al. 2006] can train a model so you can do this!42
![Page 43: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/43.jpg)
Objective Function and Gradient:
t
q
t
qt
q
t
t
tq
pa
tqt
pqP
:softmax beforeGradient
))(( symbolfor at timeoutput net neural :P
)|( :function Obj.
t
(t))q(
alignments
))((
alignments
Make the neural-net outputs look like the transcript-constrained posteriors
43
![Page 44: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/44.jpg)
Defining the Symbols
• Characters:• Generalize to new words
• No problem with infrequent words
• Couple of issues:• Double-letters (e.g. “hello”) don’t work with de-duping
• Where to insert spaces to form words (e.g. “darkroom” vs “dark room”)
• Solution: • introduce double-letter units (ll, oo, etc.)
• Introduce word-initial letters (capital letters)Explicit space characteraligns acoustics to nothing.
44
![Page 45: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/45.jpg)
CUDNN RNNImplementation
• Process full minibatch per CUDNN call
• 32 utterances per minibatch
• computation on CPU• 8-way parallelization / OMP
• Best Configuration:• 9 Relu-RNN layers• Bidirectional• 1024 wide• 0.0058 xRT (!)
45
![Page 46: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/46.jpg)
Results: 2000 Hour Training
New record for CTC on Switchboard
46
Best previous result – ensemble from Hannun et al. 2014
Conclusion: Much simpler systems can produce good performance[Zweig et al., 2016]
![Page 47: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/47.jpg)
Outline
• Acoustic Modeling
• Language Modeling
• Decoding, Rescoring & System Combination
• Measuring Human Performance
• Results
• Counterpoint – Letter based CTC
• Conclusions
47
![Page 48: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/48.jpg)
Summary: Human Parity after Twenty Years
5.9% Human performance:Microsoft,October, 2016
48
![Page 49: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/49.jpg)
Concluding Remarks
• Parity – Are we done?• Cocktail party problem
• Farfield
• Robustness
Cocktail Party Problem
49
![Page 50: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/50.jpg)
Concluding Remarks
• Parity – Are we done?• Cocktail party problem
• Farfield
• Robustness
• What is interesting?• New network structures
• Process Simplification e.g. CTC
50
![Page 51: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/51.jpg)
Concluding Remarks
• Parity – Are we done?• Cocktail party problem
• Farfield
• Robustness
• What is interesting?• New network structures
• Process Simplification e.g. CTC
• Are we stuck?• CNNs! LSTMs! Attention & More.
• The future is bright
51
![Page 52: Achieving Human Parity in Conversational Speech Recognition · PDF fileHuman Parity in Conversational Speech Recognition •What is Human Parity? •Humans make mistakes, too. Can](https://reader033.vdocuments.net/reader033/viewer/2022051523/5a78789f7f8b9a8c428bf78f/html5/thumbnails/52.jpg)
Thank You!
52