speech recognition with deep recurrent neural … · speech recognition with deep recurrent neural...

SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS

Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton

Department of Computer Science, University of Toronto

ABSTRACT

Recurrent neural networks (RNNs) are a powerful model forsequential data. End-to-end training methods such as Connec-tionist Temporal Classification make it possible to train RNNsfor sequence labelling problems where the input-output align-ment is unknown. The combination of these methods withthe Long Short-term Memory RNN architecture has provedparticularly fruitful, delivering state-of-the-art results in cur-sive handwriting recognition. However RNN performance inspeech recognition has so far been disappointing, with betterresults returned by deep feedforward networks. This paper in-vestigates deep recurrent neural networks, which combine themultiple levels of representation that have proved so effectivein deep networks with the flexible use of long range contextthat empowers RNNs. When trained end-to-end with suit-able regularisation, we find that deep Long Short-term Mem-ory RNNs achieve a test set error of 17.7% on the TIMITphoneme recognition benchmark, which to our knowledge isthe best recorded score.

Index Terms— recurrent neural networks, deep neuralnetworks, speech recognition

1. INTRODUCTION

Neural networks have a long history in speech recognition,usually in combination with hidden Markov models [1, 2].They have gained attention in recent years with the dramaticimprovements in acoustic modelling yielded by deep feed-forward networks [3, 4]. Given that speech is an inherentlydynamic process, it seems natural to consider recurrent neu-ral networks (RNNs) as an alternative model. HMM-RNNsystems [5] have also seen a recent revival [6, 7], but do notcurrently perform as well as deep networks.

Instead of combining RNNs with HMMs, it is possibleto train RNNs ‘end-to-end’ for speech recognition [8, 9, 10].This approach exploits the larger state-space and richer dy-namics of RNNs compared to HMMs, and avoids the prob-lem of using potentially incorrect alignments as training tar-gets. The combination of Long Short-term Memory [11], anRNN architecture with an improved memory, with end-to-endtraining has proved especially effective for cursive handwrit-ing recognition [12, 13]. However it has so far made littleimpact on speech recognition.

RNNs are inherently deep in time, since their hidden stateis a function of all previous hidden states. The question thatinspired this paper was whether RNNs could also benefit fromdepth in space; that is from stacking multiple recurrent hid-den layers on top of each other, just as feedforward layers arestacked in conventional deep networks. To answer this ques-tion we introduce deep Long Short-term Memory RNNs andassess their potential for speech recognition. We also presentan enhancement to a recently introduced end-to-end learningmethod that jointly trains two separate RNNs as acoustic andlinguistic models [10]. Sections 2 and 3 describe the networkarchitectures and training methods, Section 4 provides exper-imental results and concluding remarks are given in Section 5.

2. RECURRENT NEURAL NETWORKS

Given an input sequence x = (x1, . . . , xT ), a standard recur-rent neural network (RNN) computes the hidden vector se-quence h = (h1, . . . , hT ) and output vector sequence y =(y1, . . . , yT ) by iterating the following equations from t = 1to T :

ht = H (Wxhxt +Whhht−1 + bh) (1)yt =Whyht + by (2)

where the W terms denote weight matrices (e.g. Wxh is theinput-hidden weight matrix), the b terms denote bias vectors(e.g. bh is hidden bias vector) andH is the hidden layer func-tion.H is usually an elementwise application of a sigmoid

function. However we have found that the Long Short-TermMemory (LSTM) architecture [11], which uses purpose-builtmemory cells to store information, is better at finding and ex-ploiting long range context. Fig. 1 illustrates a single LSTMmemory cell. For the version of LSTM used in this paper [14]H is implemented by the following composite function:

it = σ (Wxixt +Whiht−1 +Wcict−1 + bi) (3)ft = σ (Wxfxt +Whfht−1 +Wcfct−1 + bf ) (4)ct = ftct−1 + it tanh (Wxcxt +Whcht−1 + bc) (5)ot = σ (Wxoxt +Whoht−1 +Wcoct + bo) (6)ht = ot tanh(ct) (7)

where σ is the logistic sigmoid function, and i, f , o and care respectively the input gate, forget gate, output gate and

6645978-1-4799-0356-6/13/$31.00 ©2013 IEEE ICASSP 2013

Fig. 1. Long Short-term Memory Cell

Fig. 2. Bidirectional RNN

cell activation vectors, all of which are the same size as thehidden vector h. The weight matrices from the cell to gatevectors (e.g. Wsi) are diagonal, so element m in each gatevector only receives input from element m of the cell vector.

One shortcoming of conventional RNNs is that they areonly able to make use of previous context. In speech recog-nition, where whole utterances are transcribed at once, thereis no reason not to exploit future context as well. Bidirec-tional RNNs (BRNNs) [15] do this by processing the data inboth directions with two separate hidden layers, which arethen fed forwards to the same output layer. As illustrated inFig. 2, a BRNN computes the forward hidden sequence

−→h ,

the backward hidden sequence←−h and the output sequence y

by iterating the backward layer from t = T to 1, the forwardlayer from t = 1 to T and then updating the output layer:

−→h t = H

(W

x−→hxt +W−→

h−→h

−→h t−1 + b−→

h

)(8)

←−h t = H

(W

x←−hxt +W←−

h←−h

←−h t+1 + b←−

h

)(9)

yt =W−→h y

−→h t +W←−

h y

←−h t + by (10)

Combing BRNNs with LSTM gives bidirectional LSTM [16],which can access long-range context in both input directions.

A crucial element of the recent success of hybrid HMM-neural network systems is the use of deep architectures, whichare able to build up progressively higher level representationsof acoustic data. Deep RNNs can be created by stacking mul-tiple RNN hidden layers on top of each other, with the out-put sequence of one layer forming the input sequence for thenext. Assuming the same hidden layer function is used forall N layers in the stack, the hidden vector sequences hn areiteratively computed from n = 1 to N and t = 1 to T :

hnt = H(Whn−1hnhn−1t +Whnhnhnt−1 + bnh

)(11)

where we define h0 = x. The network outputs yt are

yt =WhNyhNt + by (12)

Deep bidirectional RNNs can be implemented by replacingeach hidden sequence hn with the forward and backward se-quences

−→h n and

←−h n, and ensuring that every hidden layer

receives input from both the forward and backward layers atthe level below. If LSTM is used for the hidden layers we getdeep bidirectional LSTM, the main architecture used in thispaper. As far as we are aware this is the first time deep LSTMhas been applied to speech recognition, and we find that ityields a dramatic improvement over single-layer LSTM.

3. NETWORK TRAINING

We focus on end-to-end training, where RNNs learn to mapdirectly from acoustic to phonetic sequences. One advantageof this approach is that it removes the need for a predefined(and error-prone) alignment to create the training targets. Thefirst step is to to use the network outputs to parameterise adifferentiable distribution Pr(y|x) over all possible phoneticoutput sequences y given an acoustic input sequence x. Thelog-probability log Pr(z|x) of the target output sequence zcan then be differentiated with respect to the network weightsusing backpropagation through time [17], and the whole sys-tem can be optimised with gradient descent. We now describetwo ways to define the output distribution and hence train thenetwork. We refer throughout to the length of x as T , thelength of z as U , and the number of possible phonemes as K.

3.1. Connectionist Temporal Classification

The first method, known as Connectionist Temporal Classi-fication (CTC) [8, 9], uses a softmax layer to define a sepa-rate output distribution Pr(k|t) at every step t along the in-put sequence. This distribution covers the K phonemes plusan extra blank symbol ∅ which represents a non-output (thesoftmax layer is therefore size K + 1). Intuitively the net-work decides whether to emit any label, or no label, at everytimestep. Taken together these decisions define a distribu-tion over alignments between the input and target sequences.CTC then uses a forward-backward algorithm to sum over all

6646

possible alignments and determine the normalised probabilityPr(z|x) of the target sequence given the input sequence [8].Similar procedures have been used elsewhere in speech andhandwriting recognition to integrate out over possible seg-mentations [18, 19]; however CTC differs in that it ignoressegmentation altogether and sums over single-timestep labeldecisions instead.

RNNs trained with CTC are generally bidirectional, to en-sure that every Pr(k|t) depends on the entire input sequence,and not just the inputs up to t. In this work we focus on deepbidirectional networks, with Pr(k|t) defined as follows:

yt =W−→h Ny

−→h N

t +W←−h Ny

←−h N

t + by (13)

Pr(k|t) = exp(yt[k])∑Kk′=1 exp(yt[k

′]), (14)

where yt[k] is the kth element of the length K + 1 unnor-malised output vector yt, andN is the number of bidirectionallevels.

3.2. RNN Transducer

CTC defines a distribution over phoneme sequences that de-pends only on the acoustic input sequence x. It is thereforean acoustic-only model. A recent augmentation, known as anRNN transducer [10] combines a CTC-like network with aseparate RNN that predicts each phoneme given the previousones, thereby yielding a jointly trained acoustic and languagemodel. Joint LM-acoustic training has proved beneficial inthe past for speech recognition [20, 21].

Whereas CTC determines an output distribution at everyinput timestep, an RNN transducer determines a separate dis-tribution Pr(k|t, u) for every combination of input timestep tand output timestep u. As with CTC, each distribution cov-ers the K phonemes plus ∅. Intuitively the network ‘de-cides’ what to output depending both on where it is in theinput sequence and the outputs it has already emitted. For alength U target sequence z, the complete set of TU decisionsjointly determines a distribution over all possible alignmentsbetween x and z, which can then be integrated out with aforward-backward algorithm to determine log Pr(z|x) [10].

In the original formulation Pr(k|t, u) was defined by tak-ing an ‘acoustic’ distribution Pr(k|t) from the CTC network,a ‘linguistic’ distribution Pr(k|u) from the prediction net-work, then multiplying the two together and renormalising.An improvement introduced in this paper is to instead feedthe hidden activations of both networks into a separate feed-forward output network, whose outputs are then normalisedwith a softmax function to yield Pr(k|t, u). This allows aricher set of possibilities for combining linguistic and acous-tic information, and appears to lead to better generalisation.In particular we have found that the number of deletion errorsencountered during decoding is reduced.

Denote by−→h N and

←−h N the uppermost forward and

backward hidden sequences of the CTC network, and by pthe hidden sequence of the prediction network. At each t, uthe output network is implemented by feeding

−→h N and

←−h N

to a linear layer to generate the vector lt, then feeding lt andpu to a tanh hidden layer to yield ht,u, and finally feedinght,u to a size K + 1 softmax layer to determine Pr(k|t, u):

lt =W−→h N l

−→h N

t +W←−h N l

←−h N

t + bl (15)

ht,u = tanh (Wlhlt,u +Wpbpu + bh) (16)yt,u =Whyht,u + by (17)

Pr(k|t, u) = exp(yt,u[k])∑Kk′=1 exp(yt,u[k

′]), (18)

where yt,u[k] is the kth element of the length K + 1 unnor-malised output vector. For simplicity we constrained all non-output layers to be the same size (|

−→h n

t | = |←−h n

t | = |pu| =|lt| = |ht,u|); however they could be varied independently.

RNN transducers can be trained from random initialweights. However they appear to work better when initialisedwith the weights of a pretrained CTC network and a pre-trained next-step prediction network (so that only the outputnetwork starts from random weights). The output layers (andall associated weights) used by the networks during pretrain-ing are removed during retraining. In this work we pretrainthe prediction network on the phonetic transcriptions of theaudio training data; however for large-scale applications itwould make more sense to pretrain on a separate text corpus.

3.3. Decoding

RNN transducers can be decoded with beam search [10] toyield an n-best list of candidate transcriptions. In the pastCTC networks have been decoded using either a form of best-first decoding known as prefix search, or by simply taking themost active output at every timestep [8]. In this work howeverwe exploit the same beam search as the transducer, with themodification that the output label probabilities Pr(k|t, u) donot depend on the previous outputs (so Pr(k|t, u) = Pr(k|t)).We find beam search both faster and more effective than pre-fix search for CTC. Note the n-best list from the transducerwas originally sorted by the length normalised log-probabiltylog Pr(y)/|y|; in the current work we dispense with the nor-malisation (which only helps when there are many more dele-tions than insertions) and sort by Pr(y).

3.4. Regularisation

Regularisation is vital for good performance with RNNs, astheir flexibility makes them prone to overfitting. Two regu-larisers were used in this paper: early stopping and weightnoise (the addition of Gaussian noise to the network weightsduring training [22]). Weight noise was added once per train-ing sequence, rather than at every timestep. Weight noise

6647

tends to ‘simplify’ neural networks, in the sense of reducingthe amount of information required to transmit the parame-ters [23, 24], which improves generalisation.

4. EXPERIMENTS

Phoneme recognition experiments were performed on theTIMIT corpus [25]. The standard 462 speaker set with allSA records removed was used for training, and a separatedevelopment set of 50 speakers was used for early stop-ping. Results are reported for the 24-speaker core test set.The audio data was encoded using a Fourier-transform-basedfilter-bank with 40 coefficients (plus energy) distributed ona mel-scale, together with their first and second temporalderivatives. Each input vector was therefore size 123. Thedata were normalised so that every element of the input vec-tors had zero mean and unit variance over the training set. All61 phoneme labels were used during training and decoding(so K = 61), then mapped to 39 classes for scoring [26].Note that all experiments were run only once, so the vari-ance due to random weight initialisation and weight noise isunknown.

As shown in Table 1, nine RNNs were evaluated, vary-ing along three main dimensions: the training method used(CTC, Transducer or pretrained Transducer), the number ofhidden levels (1–5), and the number of LSTM cells in eachhidden layer. Bidirectional LSTM was used for all networksexcept CTC-3l-500h-tanh, which had tanh units instead ofLSTM cells, and CTC-3l-421h-uni where the LSTM layerswere unidirectional. All networks were trained using stochas-tic gradient descent, with learning rate 10−4, momentum 0.9and random initial weights drawn uniformly from [−0.1, 0.1].All networks except CTC-3l-500h-tanh and PreTrans-3l-250hwere first trained with no noise and then, starting from thepoint of highest log-probability on the development set, re-trained with Gaussian weight noise (σ = 0.075) until thepoint of lowest phoneme error rate on the development set.PreTrans-3l-250h was initialised with the weights of CTC-3l-250h, along with the weights of a phoneme prediction net-work (which also had a hidden layer of 250 LSTM cells), bothof which were trained without noise, retrained with noise, andstopped at the point of highest log-probability. PreTrans-3l-250h was trained from this point with noise added. CTC-3l-500h-tanh was entirely trained without weight noise becauseit failed to learn with noise added. Beam search decoding wasused for all networks, with a beam width of 100.

The advantage of deep networks is immediately obvious,with the error rate for CTC dropping from 23.9% to 18.4%as the number of hidden levels increases from one to five.The four networks CTC-3l-500h-tanh, CTC-1l-622h, CTC-3l-421h-uni and CTC-3l-250h all had approximately the samenumber of weights, but give radically different results. Thethree main conclusions we can draw from this are (a) LSTMworks much better than tanh for this task, (b) bidirectional

Table 1. TIMIT Phoneme Recognition Results. ‘Epochs’ isthe number of passes through the training set before conver-gence. ‘PER’ is the phoneme error rate on the core test set.

NETWORK WEIGHTS EPOCHS PERCTC-3L-500H-TANH 3.7M 107 37.6%CTC-1L-250H 0.8M 82 23.9%CTC-1L-622H 3.8M 87 23.0%CTC-2L-250H 2.3M 55 21.0%CTC-3L-421H-UNI 3.8M 115 19.6%CTC-3L-250H 3.8M 124 18.6%CTC-5L-250H 6.8M 150 18.4%TRANS-3L-250H 4.3M 112 18.3%PRETRANS-3L-250H 4.3M 144 17.7%

Fig. 3. Input Sensitivity of a deep CTC RNN. The heatmap(top) shows the derivatives of the ‘ah’ and ‘p’ outputs printedin red with respect to the filterbank inputs (bottom). TheTIMIT ground truth segmentation is shown below. Note thatthe sensitivity extends to surrounding segments; this may bebecause CTC (which lacks an explicit language model) at-tempts to learn linguistic dependencies from the acoustic data.

LSTM has a slight advantage over unidirectional LSTMand(c) depth is more important than layer size (which supportsprevious findings for deep networks [3]). Although the advan-tage of the transducer is slight when the weights are randomlyinitialised, it becomes more substantial when pretraining isused.

5. CONCLUSIONS AND FUTURE WORK

We have shown that the combination of deep, bidirectionalLong Short-term Memory RNNs with end-to-end training andweight noise gives state-of-the-art results in phoneme recog-nition on the TIMIT database. An obvious next step is to ex-tend the system to large vocabulary speech recognition. An-other interesting direction would be to combine frequency-domain convolutional neural networks [27] with deep LSTM.

6648

6. REFERENCES

[1] H.A. Bourlard and N. Morgan, Connnectionist SpeechRecognition: A Hybrid Approach, Kluwer AcademicPublishers, 1994.

[2] Qifeng Zhu, Barry Chen, Nelson Morgan, and AndreasStolcke, “Tandem connectionist feature extraction forconversational speech recognition,” in InternationalConference on Machine Learning for Multimodal Inter-action, Berlin, Heidelberg, 2005, MLMI’04, pp. 223–231, Springer-Verlag.

[3] A. Mohamed, G.E. Dahl, and G. Hinton, “Acousticmodeling using deep belief networks,” Audio, Speech,and Language Processing, IEEE Transactions on, vol.20, no. 1, pp. 14 –22, jan. 2012.

[4] G. Hinton, Li Deng, Dong Yu, G.E. Dahl, A. Mohamed,N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N.Sainath, and B. Kingsbury, “Deep neural networks foracoustic modeling in speech recognition,” Signal Pro-cessing Magazine, IEEE, vol. 29, no. 6, pp. 82 –97, nov.2012.

[5] A. J. Robinson, “An Application of Recurrent Nets toPhone Probability Estimation,” IEEE Transactions onNeural Networks, vol. 5, no. 2, pp. 298–305, 1994.

[6] Oriol Vinyals, Suman Ravuri, and Daniel Povey, “Re-visiting Recurrent Neural Networks for Robust ASR,”in ICASSP, 2012.

[7] A. Maas, Q. Le, T. O’Neil, O. Vinyals, P. Nguyen, andA. Ng, “Recurrent neural networks for noise reductionin robust asr,” in INTERSPEECH, 2012.

[8] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber,“Connectionist Temporal Classification: Labelling Un-segmented Sequence Data with Recurrent Neural Net-works,” in ICML, Pittsburgh, USA, 2006.

[9] A. Graves, Supervised sequence labelling with recurrentneural networks, vol. 385, Springer, 2012.

[10] A. Graves, “Sequence transduction with recurrent neu-ral networks,” in ICML Representation Learning Work-sop, 2012.

[11] S. Hochreiter and J. Schmidhuber, “Long Short-TermMemory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[12] A. Graves, S. Fernandez, M. Liwicki, H. Bunke, andJ. Schmidhuber, “Unconstrained Online HandwritingRecognition with Recurrent Neural Networks,” in NIPS.2008.

[13] Alex Graves and Juergen Schmidhuber, “Offline Hand-writing Recognition with Multidimensional RecurrentNeural Networks,” in NIPS. 2009.

[14] F. Gers, N. Schraudolph, and J. Schmidhuber, “LearningPrecise Timing with LSTM Recurrent Networks,” Jour-nal of Machine Learning Research, vol. 3, pp. 115–143,2002.

[15] M. Schuster and K. K. Paliwal, “Bidirectional Recur-rent Neural Networks,” IEEE Transactions on SignalProcessing, vol. 45, pp. 2673–2681, 1997.

[16] A. Graves and J. Schmidhuber, “Framewise PhonemeClassification with Bidirectional LSTM and Other Neu-ral Network Architectures,” Neural Networks, vol. 18,no. 5-6, pp. 602–610, June/July 2005.

[17] David E. Rumelhart, Geoffrey E. Hinton, andRonald J. Williams, Learning representations by back-propagating errors, pp. 696–699, MIT Press, 1988.

[18] Georey Zweig and Patrick Nguyen, “SCARF: A seg-mental CRF speech recognition system,” Tech. Rep.,Microsoft Research, 2009.

[19] Andrew W. Senior and Anthony J. Robinson, “Forward-backward retraining of recurrent neural networks,” inNIPS, 1995, pp. 743–749.

[20] Abdel rahman Mohamed, Dong Yu, and Li Deng, “In-vestigation of full-sequence training of deep belief net-works for speech recognition,” in in Interspeech, 2010.

[21] M. Lehr and I. Shafran, “Discriminatively estimatedjoint acoustic, duration, and language model for speechrecognition,” in ICASSP, 2010, pp. 5542 –5545.

[22] Kam-Chuen Jim, C.L. Giles, and B.G. Horne, “An anal-ysis of noise in recurrent neural networks: convergenceand generalization,” Neural Networks, IEEE Transac-tions on, vol. 7, no. 6, pp. 1424 –1438, nov 1996.

[23] Geoffrey E. Hinton and Drew van Camp, “Keeping theneural networks simple by minimizing the descriptionlength of the weights,” in COLT, 1993, pp. 5–13.

[24] Alex Graves, “Practical variational inference for neuralnetworks,” in NIPS, pp. 2348–2356. 2011.

[25] DARPA-ISTO, The DARPA TIMIT Acoustic-PhoneticContinuous Speech Corpus (TIMIT), speech disc cd1-1.1 edition, 1990.

[26] Kai fu Lee and Hsiao wuen Hon, “Speaker-independentphone recognition using hidden markov models,” IEEETransactions on Acoustics, Speech, and Signal Process-ing, 1989.

[27] O. Abdel-Hamid, A. Mohamed, Hui Jiang, and G. Penn,“Applying convolutional neural networks concepts tohybrid nn-hmm model for speech recognition,” inICASSP, march 2012, pp. 4277 –4280.

6649

speech recognition with deep recurrent neural … · speech recognition with deep recurrent neural...

Documents