deep speech 2: end-to-end speech ... - jesseengel.github.io · for deployment, we develop a...
TRANSCRIPT
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Silicon Valley AI Lab (SVAIL)*
We demonstrate a generic speech engine that handles a broad range of scenarios without needing to resort to domain-speci�c optimizations.
We perform a focused search through model architectures �nding deep recurrent nets with multiple layers of 2D convolution and layer-to-layer batch normalization to perform best.
In many cases, our system is competitive with the transcription performance of human workers when benchmarked on standard datasets.
This approach also adapts easily to multiple languages, which we illustrate by applying essentially the same system to both Mandarin and English speech.
To train on 12,000 hours of speech, we perform many systems optimizations. Placing all computation on distributed GPUs, we can perform synchronous SGD at 3 TFLOP/s per GPU for up to a 128 GPUs (weak scaling).
For deployment, we develop a batching scheduler to improve computational e�ciency while minimizing latency. We also create specialized matrix multiplication kernels to perform well at small batch sizes.
Combined with forward only variants of our research models, we achieve a low-latency production system with little loss in recognition accuracy.
Abstract
*Presenters: Jesse Engel, Shubho Sengupta
SVAIL
Architecture Search / Data Collection
SortaGrad
Batch Normalization
Data Scaling
Deep Recurrence
Bigrams
Frequency Convolutions
Model Size
Single Model Approaching Human Performance on Many Test Sets
Architectural Improvements Generalize to Mandarin ASR
More layers of 2D convolution generalize better in noisy environments, while more layers of 1D convolution lead to worse generalization.
Temporal Stride: 2Frequency Stride: 2/layer7-Recurrent Layers (38M Params)
Batch normalization (between layers, not be-tween timesteps) improves both training and generalization in deep recurrent networks.
More layers of recurrence helps for both Gated Recurrent Units (GRUs) and Simple Recurrent Units. Increase at 7xGRU is likely due to not enough parameters (layers too thin).
1x1D Convolution Layer(38M Params)
WER decreases following an approximate power law with amount of training audio (~40% relative / decade). There is a near constant relative o�set between noisy and clean generalization (~60%).
2x2D ConvolutionTemporal Stride: 2Frequency Stride: 2/layer7-Recurrent Layers (70M Params)
Millions of Parameters
GRU:1x2D ConvolutionTemporal Stride: 3Frequency Stride: 23-Recurrent LayersBigrams
Simple Recurrent:3x2D ConvolutionTemporal Stride: 3Frequency Stride: 2/layer7-Recurrent LayersBigrams
Connectionist Temporal Classi�cation (CTC)
Spectrogram
Fully Connected
1x1D Convolution7-Recurrent Layers (38M Params)
3x2D Convolution (Bigrams / BN)Temporal Stride: 3Frequency Stride: 2/layer7-Recurrent Layers (100M Params)
9.52
Baseline Batch Norm
Not SortedSorted
11.9610.83
9.78
22
20
18
16
14
12
10
81 2 3
Noisy Dev
Clean Dev
# Convolution Layers
WER
1D
2D
1 3 5# Recurrent Layers
7
15
13
11
10
9
8
7
WER
14
12
Clean Dev
Simple Recurrent
GRU
(Recurrent/Total) No BN
1/53/55/77/9
2400188015101280
10.559.558.598.76
14.4010.56
9.789.52
BN
Layers
Width
Train Dev
No BN BN
11.998.297.617.68
13.5511.6110.7710.83
WER
Hours of Audio
100
10
5
50
20
100 1000 10000
Clean Dev
Noisy Dev
10
5
20
15
25W
ER
20 40 60 80 100
Clean Dev
Noisy Dev
GRU
SimpleRecurrent
20
30
40
50
CTC
Cos
t
0.5 1 1.5 2 2.5 3
Iterations (105)
1-Recurrent Layer
7-Recurrent Layers
BN
BN
No BN
No BN
Diverse DatasetsDataset Speech Type Hours
WSJSwitchboardFisherLibriSpeechBaidu
readconversationalconversationalreadread
80300
2000960
50003600Baidu mixed
11940Total
Dataset Type Hours
Unaccented MandarinAccented MandarinAssorted Mandarin
spontaneousspontaneousmixed
500030001400
9400Total
Test set DS1 Human
WSJ eval’92WSJ eval’93LibriSpeech test-cleanLibriSpeech test-other
5.038.085.83
12.69
Read Speech
DS2
3.604.985.33
13.25
4.946.947.89
21.74
WER
Stride
No LM
LM
2 3 4 5 6789
10
12
15
20 Unigrams
Bigrams
Test set DS1 Human
VoxForge American-CanadianVoxForge CommonwealthVoxForge EuropeanVoxForge Indian
4.858.15
12.7622.15
Accented Speech
DS2
7.5513.5617.5522.44
15.0128.4631.2045.35
Test set DS1 Human
CHiME eval cleanCHiME eval realCHiME eval sim
3.4611.8431.33
Noisy Speech
DS2
3.3421.7945.05
6.3067.9480.27
Language Architecture Dev No LM
EnglishEnglishMandarin
5-layers, 1 recurrent9-layers, 7 recurrent
27.7914.93
9.807.55Mandarin
Dev LM
5-layers, 1 recurrent9-layers, 7 recurrent
14.399.527.135.81
Architecture Dev
5-layers, 1 recurrent5-layers, 3 recurrent
7.136.496.225.81
Test
5-layers, 3 recurrent + BN9-layers, 7 recurrent + BN + 2D conv
15.4111.85
9.397.93
3x2D ConvolutionTemporal Stride: 4Frequency Stride: 2/layer70M Params
Inspired by Mandarin, we can maintain performance while reducing the number of timesteps (thus dramatically reducing computation) by creating a larger alphabet (bigrams) with fewer char-acters / second. This decreases the entropy density of the language, allowing for more temporal compression. (ex. [the_cats_sat] -> [th, e, _, ca, ts, _, sa, t])
4.1
Unigram MandarinChars / SecBits / Char
2.49.6
14.1
Bits / Sec
Bigram
23.358.1
3.212.640.7
“A Sort of Stochastic Gradient Descent”Curriculum learning with utterance length as a proxy for di�culty. Sorting the �rst epoch by length improves convergence stability and �nal values.
Mix of conversational, read, and spontaneous. Average utterance length ~6 seconds.
WER
WER
WER WER
WER
CERWER English / CER MandarinMix of conversational, read, and spontaneous. Average utterance length ~3 seconds.
Hannun et al., Deep speech: Scaling up end-to-end speech
recognition, 2014.http://arxiv.org/abs/1412.5567
Amazon mechanical turk transcription. Each utterance was transcribed by 2 turk-ers, averaging 30 seconds of work time per an utterance.
DS 1
1x1D Convolution Temporal Stride: 2Unigrams / No BN1-Recurrent Layer (38M Params)
DS 2
This WorkHuman
Systems Optimizations / Deployment
I
I I
IA
T
CI
I
I I I
G
G
G
G
G
G
GG
G
1 2 3 4 T-3 T-2 T-1 T
G
G G
GA
T
CG
G
G G G
I
I
I
I
I
I
II
I
α
β
Optimized compute dense node with 53 TFLOP/s peak compute and large PCI root complex for e�cient communication
An optimized CTC implmentation on the GPU is signi�cantly faster than the corresponding CPU implementation. This also eliminates all roundtrip copies beween CPU and GPU. All times are in seconds for training one epoch of a 5 layer, 3 bi-directional layer network.
We built a system to train deep recurrent neural networks that can linearly scale from 1 to 128 GPUs, while sustaining 3 TFLOP/s computational throughput per GPU throughout an entire training run (weak scaling). We did this by carefully de-signing our node, building an e�cient com-munication layer and writing a very fast implementation of CTC cost function so that our entire network runs on the GPU.
Node Architecture
Scaling Neural Network Training
Computing CTC on GPU
Fast MPI AllReduceWe use synchronous SGD for deterministic results from run to run. The weights are synchronized through MPI Allreduce. We wrote our own Allreduce implementation that is 20x faster than Open-MPI’s Allreduce up to 16 GPUs. It can scale up to 128 GPUs and maintain its supe-rior performance
ht
Row convolution layer
Recurrent layer
Row Convolutions
A special convolution layer that captures a forward context of �xed number of timesteps into the future. This allows us to train and deploy a model with only forward recurrent layers. Models with only forward recurrent layers satisfy the latency constraints for de-ployment since we do not need to wait for the end of utterance.
As we see on the right, the production system that only has forward recurrent layers is very close in performance to the research system with bi-directional layers, on two di�erent test sets.
We wrote a speical ma-trix-matrix multiply kernel that is e�cient for half-preci-sion and small batch sizes. The deployment system uses small batches to mini-mize latency and uses half-precision to minimize memory footprint. Our kernel is 2x faster than the publicly available kernels from Nervana systems
Deployment Optimized Matrix-Matrix Multiply
Batch Dispatch10 streams20 streams30 streams
We built a batching schedul-er that assembles streams of data from user requests into batches before doing for-ward propagation. Large batch size results in more computational e�ciency but also higher latency. As expected, batching works best when the server is heavily loaded: as load increases, the distribution shifts to favor processing
requests in larger batches. Even for a light load of only 10 concurrent user requests, our system performs more than half the work in batches with at least 2 samples.
G P U G P U G P U G P U G P U G P U G P U G P U
PLXPLXPLX
C P U C P U
PLX
Language CPU CTC Time
EnglishMandarin
5888.121688.01
203.56135.05
GPU CTC Time Speedup
28.912.5
System Clean
ResearchProduction
6.056.10
9.7510.00
Noisy
1 2 4 8 16 32 64 128# GPUs
Narrow deep network
Wide shallow network
4 8 16 32 64 128# GPUs
OpenMPI Allreduce
1 2 3 4 5 6Batch size
7 8 9 10
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Tera
FLO
P/s
Baidu HGEMM
Nervana HGEMM
1 2 3 4 5 6Batch size
7 8 9 10
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Prob
abili
ty
0.45
29
210
211
212
213
214
215
216
Tim
e (s
)
210
211
212
213
214
215
216
217218
Tim
e (s
)
CER
Internal Mandarin voice search test set.