deep neural networks for acoustic modelling · 1, 2012. • alex graves, abdel-rahman mohamed, and...

29
Deep Neural Networks for Acoustic Modelling Bajibabu Bollepalli Hieu Nguyen Rakshith Shetty Pieter Smit (Mentor)

Upload: others

Post on 18-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Deep Neural Networks for Acoustic Modelling

Bajibabu Bollepalli

Hieu Nguyen

Rakshith Shetty

Pieter Smit (Mentor)

Page 2: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Introduction

• Automatic speech recognition

Feature Extraction

Acoustic Modelling

Decoder

Language Modelling

Speech signalRecognized text

Page 3: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Introduction

• Acoustic modelling using deep neural networks

Feature Extraction

Acoustic Modelling

Decoder

Language Modelling

Speech signalRecognized text

Page 4: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Background

• HMM-GMMs have prevailed in ASR for last four decades• Difficult for any new methods to outperform them for acoustic modelling

• Can GMMs capture all information in acoustic features?• No. Inefficient in modelling the data that lie on or near nonlinear manifold in

the data space

• Need for better models• Artificial neural networks (ANNs) are known to capture the nonlinearities in

the data

• Natural to think of ANNs as an alternative to GMMs

Page 5: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Background

• ANNs are not new for speech recognition• Two decades back, researchers employed the ANNs for ASR

• Unable to outperform the GMMs

• Hardware and learning algorithms were restricted the capacity of ANNs

• Advancements in hardware as well in machine learning algorithms allows us to train large multilayer (deep) ANNs called Deep Neural Networks (DNNs)• DNNs outperform the GMMs (finally ;) )

Page 6: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Deep Neural Networks (DNNs)

• Feed-forward ANNs with more than one hidden layers

Page 7: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Our task

• Frame based phoneme recognition using simple DNNs

• Experiments with various input features

• Compare the results with GMMs

• Try complex DNNs (if time permits)• Deep belief networks (DBN)

• Recurrent neural networks (RNNs)

Page 8: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Our task

• Frame based phoneme recognition using simple DNNs

• Experiments with various input features

• Compare the results with GMMs

• Try complex DNNs (if time permits)• Deep belief networks (DBN)

• Recurrent neural networks (RNNs)

Page 9: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Database

• Training data : 151 Finnish speech sentences (~ 15 mins)

• Development data 135 sentences (~ 11 mins)

• Evaluation data 100 sentences (~ 8 mins)

Page 10: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Simple DNN

• Similar to multi-layer perceptron (MLP)

• Hidden Layers: [300, 300]

• Activations: Sigmoid

• Optimization: Stochastic Gradient Descent (SGD)

• Error criteria: Categorical crossentropy

• Software tool: Keras

• Input: MFCC features with 39 dimension

• Output: 24 Finnish phonemes

• Normalization: Mean-variance

Page 11: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Performance of simple DNN (MLP)

Input feature Frame-wise accuracy (%)

Single frame [t] 63.81

Three frames [t-1, t, t+1] 67.59

Five frames [t-2, t-1, t, t+1, t+2] 67.22

Page 12: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

DBN

Page 13: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Deep Belief Network (DBN)

• This neural network is similar to MLP but the weights are pre-trained using multiple Restricted Boltzmann Machines (RBM) instead of only random initialization.

• After the model is pre-trained, the weight are fine-tuned again. The process is similar to model training of only MLP.

• Pre-training step is unsupervised (without using the true target label of data point), we try to regenerate input x from the hidden representation induced by input x. The knowledge learned is encoded by the values of the weights.

• Fine-tuning is supervised training step, where we try to maximize the prediction accuracy of the data points with true label.

13

Page 14: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Restricted Boltzmann Machine (RBM)

• This is type of generative neural network.

• The idea is to generate an ’energy surface’ or ’heat map’ in form of probability density.

• Energy:

• Probability density:

• Optimize:

Use Gibbs sampling for <.>_model

14

Page 15: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

DBN-pretraining

• Stack of RBMs:

• Two consecutive layers are trained using a RBM with the lower one is hidden layer and the upper one is visible layer.

• The process is done bottom-up

• Iterate multiple for multiple epochs

15

Page 16: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Setups

• Using Theano-based tutorial code from deeplearning.net

• Hidden layers uses activation function sigmoid function.

• Prediction layer (top layer) is a softmax layer.

• Loss function is categorical cross entropy.

• Output is either predicted label (one of 24 phonemes) or probabilities of 24 phonemes (predicted label is argmax of probabilities)

• Each input is MFCC in context of 3 (triphone)

16

Page 17: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Experiments

• Pre-training is tricky, after some rough estimates, pre-training learning rate 1e-5 is chosen

• Train ‘with’ and ‘without’ pre-training to compare

• The number of hidden layers varies from 1 to 3

• The size of each hidden layer varies from 100 to 600 (some results with size 500 and 600 were not trained)

• Experimenting with some 3-hidden-layer ‘hourglass’ model, the results don’t show real improvement.

17

Page 18: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

DBN Results

• The best model is non-pre-trained 500_500 network. Accuracy on validation set is 66.82%

• The table show predicting accuracy on trained models on validation set.

18

Model size Pre-trained Iterations

Non-pre-

trained Iterations

100 60.188 48934 60.344 39830

200 61.235 44382 62.792 48934

300 61.387 39830 62.721 39830

400 61.284 42106 63.561 37554

100_100 61.641 48934 62.638 44382

200_200 63.106 47796 64.266 39830

300_300 63.808 46658 64.716 37554

400_400 63.741 51210 64.634 33002

500_500 66.820 33002

600_600 65.327 30726

100_100_100 62.237 55762 62.926 46658

200_200_200 63.589 53486 64.19 40968

300_300_300 63.572 44382 63.73 33002

400_400_400 63.106 44382 64.941 35278

Page 19: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Recurrent Networks

Page 20: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Recurrent Neural Networks• Output of a recurrent n/w at time t depends on inputs at

time t as well as state of the n/w at time t-1.

• Thus are ideal to model sequences, as time dependencies can be learnt in the recurrent weights

• In case of phoneme classification it is now easy to include arbitrary amount of context i.e previous frames within a window.

• Infinitely deep in a sense

Page 21: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Our Model• We use a fixed context size with frames upto t-context fed

into the RNN.

• Then, the hidden state of RNN at time t is used to predict the class of the frame at time t.

Page 22: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Learning in recurrent nets• We can compute the error at time t (cross entropy error)

and backpropagate the gradients through time, similar to backpropagation in MLP.

• Problem is these gradients can die out or blow up if sequence is very long

• One solution for explosive gradients is to truncate the depth in time till which you propagate

• Other solution is to use more complex recurrent units like LSTMs

Page 23: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

LSTM Cell• Consists of a memory unit

and 3 gates

• Each gate is affected by current input and previous output state of the cell.

• The 3 cells control data flow to the memory, retention of memory & activation of output from the cell.

Page 24: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Learning Details and Regularization• We use RMSprop learning algorithm – a form of gradient

descent where learning rate is automatically scaled by rmsvalue of most recent gradients

• Regularize using dropout : For each training sample some units are randomly switched off. This forces each unit to learn something useful and not co-depend too much

• Dropout only in the embedding and output layer, bad idea to do it recurrent connections.

Page 25: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Results with RNNs - Accuracies• Context 10, 200 units, Dropout 0.3

• LSTM , Context 10, Dropout 0.3

• LSTM , 200 Units, Dropout 0.3

• LSTM , Context 10, 200 Units

Size of n/w 50 100 200

Accuracy on Eval 67.79 68.11 67.76

Context Window 5 10 20

Accuracy on Eval 68.11 67.76 68.76

Dropout Prob 0.0 0.3 0.5 0.7

Accuracy on Eval 66.47 67.76 68.21 68.19

Type of Unit simple lstm

Accuracy on Eval 66.43 67.76

Page 26: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Summary Results: All ModelsContext Window MLP DBN RNN

Accuracy on Eval 67.59 66.82 68.76

Page 27: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Source code is available on GitHub :

https://github.com/rakshithShetty/dnn-speech

27

Page 28: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

References

• George E. Dahl Abdel-rahman Mohamed and Geoffrey Hinton. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, Volume 20 Issue 1, 2012.

• Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv, abs/1303.5778, 2013.

• Some figures are taken from prof. Juha Karhunen’s slides of the course Machine Learning and Neural Networks.

• Implementation DBN code are take and modified from tutorial on deeplearning.net

28

Page 29: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,

Questions ?

29