deep neural networks for acoustic modelling · 1, 2012. • alex graves, abdel-rahman mohamed, and...

Deep Neural Networks for Acoustic Modelling

Bajibabu Bollepalli

Hieu Nguyen

Rakshith Shetty

Pieter Smit (Mentor)

Introduction

• Automatic speech recognition

Feature Extraction

Acoustic Modelling

Decoder

Language Modelling

Speech signalRecognized text

Introduction

• Acoustic modelling using deep neural networks

Feature Extraction

Acoustic Modelling

Decoder

Language Modelling

Speech signalRecognized text

Background

• HMM-GMMs have prevailed in ASR for last four decades• Difficult for any new methods to outperform them for acoustic modelling

• Can GMMs capture all information in acoustic features?• No. Inefficient in modelling the data that lie on or near nonlinear manifold in

the data space

• Need for better models• Artificial neural networks (ANNs) are known to capture the nonlinearities in

the data

• Natural to think of ANNs as an alternative to GMMs

Background

• ANNs are not new for speech recognition• Two decades back, researchers employed the ANNs for ASR

• Unable to outperform the GMMs

• Hardware and learning algorithms were restricted the capacity of ANNs

• Advancements in hardware as well in machine learning algorithms allows us to train large multilayer (deep) ANNs called Deep Neural Networks (DNNs)• DNNs outperform the GMMs (finally ;) )

Deep Neural Networks (DNNs)

• Feed-forward ANNs with more than one hidden layers

Our task

• Frame based phoneme recognition using simple DNNs

• Experiments with various input features

• Compare the results with GMMs

• Try complex DNNs (if time permits)• Deep belief networks (DBN)

• Recurrent neural networks (RNNs)

Database

• Training data : 151 Finnish speech sentences (~ 15 mins)

• Development data 135 sentences (~ 11 mins)

• Evaluation data 100 sentences (~ 8 mins)

Simple DNN

• Similar to multi-layer perceptron (MLP)

• Hidden Layers: [300, 300]

• Activations: Sigmoid

• Optimization: Stochastic Gradient Descent (SGD)

• Error criteria: Categorical crossentropy

• Software tool: Keras

• Input: MFCC features with 39 dimension

• Output: 24 Finnish phonemes

• Normalization: Mean-variance

Performance of simple DNN (MLP)

Input feature Frame-wise accuracy (%)

Single frame [t] 63.81

Three frames [t-1, t, t+1] 67.59

Five frames [t-2, t-1, t, t+1, t+2] 67.22

Deep Belief Network (DBN)

• This neural network is similar to MLP but the weights are pre-trained using multiple Restricted Boltzmann Machines (RBM) instead of only random initialization.

• After the model is pre-trained, the weight are fine-tuned again. The process is similar to model training of only MLP.

• Pre-training step is unsupervised (without using the true target label of data point), we try to regenerate input x from the hidden representation induced by input x. The knowledge learned is encoded by the values of the weights.

• Fine-tuning is supervised training step, where we try to maximize the prediction accuracy of the data points with true label.

13

Restricted Boltzmann Machine (RBM)

• This is type of generative neural network.

• The idea is to generate an ’energy surface’ or ’heat map’ in form of probability density.

• Energy:

• Probability density:

• Optimize:

Use Gibbs sampling for <.>_model

14

DBN-pretraining

• Stack of RBMs:

• Two consecutive layers are trained using a RBM with the lower one is hidden layer and the upper one is visible layer.

• The process is done bottom-up

• Iterate multiple for multiple epochs

15

Setups

• Using Theano-based tutorial code from deeplearning.net

• Hidden layers uses activation function sigmoid function.

• Prediction layer (top layer) is a softmax layer.

• Loss function is categorical cross entropy.

• Output is either predicted label (one of 24 phonemes) or probabilities of 24 phonemes (predicted label is argmax of probabilities)

• Each input is MFCC in context of 3 (triphone)

16

Experiments

• Pre-training is tricky, after some rough estimates, pre-training learning rate 1e-5 is chosen

• Train ‘with’ and ‘without’ pre-training to compare

• The number of hidden layers varies from 1 to 3

• The size of each hidden layer varies from 100 to 600 (some results with size 500 and 600 were not trained)

• Experimenting with some 3-hidden-layer ‘hourglass’ model, the results don’t show real improvement.

17

DBN Results

• The best model is non-pre-trained 500_500 network. Accuracy on validation set is 66.82%

• The table show predicting accuracy on trained models on validation set.

18

Model size Pre-trained Iterations

Non-pre-

trained Iterations

100 60.188 48934 60.344 39830

200 61.235 44382 62.792 48934

300 61.387 39830 62.721 39830

400 61.284 42106 63.561 37554

100_100 61.641 48934 62.638 44382

200_200 63.106 47796 64.266 39830

300_300 63.808 46658 64.716 37554

400_400 63.741 51210 64.634 33002

500_500 66.820 33002

600_600 65.327 30726

100_100_100 62.237 55762 62.926 46658

200_200_200 63.589 53486 64.19 40968

300_300_300 63.572 44382 63.73 33002

400_400_400 63.106 44382 64.941 35278

Recurrent Networks

Recurrent Neural Networks• Output of a recurrent n/w at time t depends on inputs at

time t as well as state of the n/w at time t-1.

• Thus are ideal to model sequences, as time dependencies can be learnt in the recurrent weights

• In case of phoneme classification it is now easy to include arbitrary amount of context i.e previous frames within a window.

• Infinitely deep in a sense

Our Model• We use a fixed context size with frames upto t-context fed

into the RNN.

• Then, the hidden state of RNN at time t is used to predict the class of the frame at time t.

Learning in recurrent nets• We can compute the error at time t (cross entropy error)

and backpropagate the gradients through time, similar to backpropagation in MLP.

• Problem is these gradients can die out or blow up if sequence is very long

• One solution for explosive gradients is to truncate the depth in time till which you propagate

• Other solution is to use more complex recurrent units like LSTMs

LSTM Cell• Consists of a memory unit

and 3 gates

• Each gate is affected by current input and previous output state of the cell.

• The 3 cells control data flow to the memory, retention of memory & activation of output from the cell.

Learning Details and Regularization• We use RMSprop learning algorithm – a form of gradient

descent where learning rate is automatically scaled by rmsvalue of most recent gradients

• Regularize using dropout : For each training sample some units are randomly switched off. This forces each unit to learn something useful and not co-depend too much

• Dropout only in the embedding and output layer, bad idea to do it recurrent connections.

Results with RNNs - Accuracies• Context 10, 200 units, Dropout 0.3

• LSTM , Context 10, Dropout 0.3

• LSTM , 200 Units, Dropout 0.3

• LSTM , Context 10, 200 Units

Size of n/w 50 100 200

Accuracy on Eval 67.79 68.11 67.76

Context Window 5 10 20


Dropout Prob 0.0 0.3 0.5 0.7

Accuracy on Eval 66.47 67.76 68.21 68.19

Type of Unit simple lstm

Accuracy on Eval 66.43 67.76

Summary Results: All ModelsContext Window MLP DBN RNN


Source code is available on GitHub :

https://github.com/rakshithShetty/dnn-speech

27

References

• George E. Dahl Abdel-rahman Mohamed and Geoffrey Hinton. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, Volume 20 Issue 1, 2012.

• Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv, abs/1303.5778, 2013.

• Some figures are taken from prof. Juha Karhunen’s slides of the course Machine Learning and Neural Networks.

• Implementation DBN code are take and modified from tutorial on deeplearning.net

28

Questions ?

29

deep neural networks for acoustic modelling · 1, 2012. • alex graves, abdel-rahman mohamed, and...

Documents