automatic speech recognition system using deep learning

39
Automatic Speech Recognition System using Deep Learning Ankan Dutta 14MCEI03 Guided By Dr. Sharada Valiveti Institute of Technology Nirma University May 16, 2016 Ankan Dutta (Institute of TechnologyNirma University) Audio Visual Speech Recognition System using Deep Learning May 16, 2016 1 / 39

Upload: ankan-dutta

Post on 09-Jan-2017

149 views

Category:

Engineering


5 download

TRANSCRIPT

Page 1: Automatic speech recognition system using deep learning

Automatic Speech Recognition System using DeepLearning

Ankan Dutta14MCEI03

Guided ByDr. Sharada Valiveti

Institute of TechnologyNirma University

May 16, 2016

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 1/39May 16, 2016 1 / 39

Page 2: Automatic speech recognition system using deep learning

Introduction

Definition

Development of Automatic Speech Recognition System using DeepLearning Techniques

Scope

Automatic Speech Recognition allows computers to interpret humanspeechLower barrier for computer interactionsSpeech recognition allows converting speech to text

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 2/39May 16, 2016 2 / 39

Page 3: Automatic speech recognition system using deep learning

Introduction

Objectives

The audio signals should be converted in the form ofMFCCs(Mel-Frequency Ceptral Coefficients)[1]Implementing Convolutional Neural Network[2, 3] for audio featureextractionThen using Gaussian Mixture Model - Hidden Markov Model[4]forrecognition of audio signal.

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 3/39May 16, 2016 3 / 39

Page 4: Automatic speech recognition system using deep learning

1st Review

Applications of Speech Recognition System

Dimensions of Speech Recognition System.

Generic Block Diagram of the system fig.1.

Performance Evaluation of the system.

Conventional Speech Recognition Systems.

Required Machine Learning Techniques.

MFCCs fig. 2Hidden Markov Models fig.3Gaussian Mixture Models fig. 4Deep De-noising Auto-Encoders fig. 5Convolutional Neural Networks fig. 6

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 4/39May 16, 2016 4 / 39

Page 5: Automatic speech recognition system using deep learning

1st Review

Literature Survey on various proposed

Audio Features Extraction Mechanisms.Visual Features Extraction Mechanisms.Integration of Audio and Visual Systems.Architectures for the System.

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 5/39May 16, 2016 5 / 39

Page 6: Automatic speech recognition system using deep learning

Mechanisms for Audio Features Extraction [5, 6, 7]

Approaches:

Using DNN-HMM in recognition phase.

Using DNN in feature-extraction phase and GMM-HMM inrecognition phase.

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 6/39May 16, 2016 6 / 39

Page 7: Automatic speech recognition system using deep learning

Generic Block Diagram

Figure: System architecture

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 7/39May 16, 2016 7 / 39

Page 8: Automatic speech recognition system using deep learning

MFCC[1]

Figure: MFCC

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 8/39May 16, 2016 8 / 39

Page 9: Automatic speech recognition system using deep learning

Hidden Markov Model

Figure: Hidden Markov Model

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 9/39May 16, 2016 9 / 39

Page 10: Automatic speech recognition system using deep learning

Gaussian Mixture Model

Figure: Gaussian Mixture Model

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 10/39May 16, 2016 10 / 39

Page 11: Automatic speech recognition system using deep learning

Deep De-noising Auto-Encoder

Figure: Deep De-noising Auto-Encoder

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 11/39May 16, 2016 11 / 39

Page 12: Automatic speech recognition system using deep learning

Convolutional Neural Network

Figure: Convolutional Neural Network

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 12/39May 16, 2016 12 / 39

Page 13: Automatic speech recognition system using deep learning

2nd Review

Audio Visual Speech Recognition.

Architecture of the Model which we will use in our implementationfig. 7

Required Tools and Datasets.

A Basic implementation Using KALDI.

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 13/39May 16, 2016 13 / 39

Page 14: Automatic speech recognition system using deep learning

Architecture of the Model

Figure: System architecture

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 14/39May 16, 2016 14 / 39

Page 15: Automatic speech recognition system using deep learning

Requirements

For Automatic Speech Recognition,

Nvidia CUDA 7.5 (System should have a Nvidia GPU)[8]We are using KALDI Speech Recognition Toolkit For ourimplementation.

For Automatic Speaker Recognition,

Python

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 15/39May 16, 2016 15 / 39

Page 16: Automatic speech recognition system using deep learning

KALDI’s Dependencies [9]

For Using Kaldi following libraries and tools have to beinstalled:

OpenFst : Most of the compilation is done with it, and it is veryheavily used.

IRSTLM : It is a language modeling Toolkit.

sph2pipe : It is for converting .sph files to .wav files, it is requiredfor using LDC datasets.

sclite : It is not that important but still may arise as one of thedependencies, so it is better to install it

ATLAS : Its a linear algebra library. It will only work if your CPUthrottling is not enabled.

CLAPACK : This also a linear algebra library. If one doesn’t haveATLAS ,CLAPACK can be used as an alternative.

SRILM : SRILM is a toolkit for building and applying statisticallanguage models (LMs), primarily for use in speech recognition,statistical tagging and segmentation, and machine translation.

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 16/39May 16, 2016 16 / 39

Page 17: Automatic speech recognition system using deep learning

Dataset

We have made our own dataset according to our requirement of theproject. Vocabulary size of our dataset is very small i.e numbers from 0to 9. This same dataset is used in both of our implementation,automatic speech recognition, and speaker identification. Sentencesthat contain only digits are perfect in this case.

10 different speakers (ASR systems must be trained and tested ondifferent speakers, the more speakers you have the better),

each speaker says 10 sentences,

100 sentences/utterances (in 100 *.wav files placed in

10 folders related to particular speakers - 10 *.wav files in eachfolder),

300 words (digits from zero to nine),

each sentence/utterance consist of 3 words.

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 17/39May 16, 2016 17 / 39

Page 18: Automatic speech recognition system using deep learning

Implementation

Automatic Speech Recognition

Automatic Speaker Recognition

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 18/39May 16, 2016 18 / 39

Page 19: Automatic speech recognition system using deep learning

Procedure for making a Automatic Speech RecognitionSystem Using KALDI

Data Preparation

Audio DataAcoustic ModellingLanguage Modelling

Project Finalization

Tools AttachmentScoring ScriptConfiguration Files

Running Scripts Creation

Getting Results

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 19/39May 16, 2016 19 / 39

Page 20: Automatic speech recognition system using deep learning

Data Preparation

Audio Data : We are using our own Digits dataset for theimplementation of our ASR system.

TASK : Go to kaldi-trunk/egs/digits directory and create’digitsaudio’ folder. In kalditrunk/egs/digits/digitsaudio create twofolders: ’train’ and ’test’.

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 20/39May 16, 2016 20 / 39

Page 21: Automatic speech recognition system using deep learning

Data Preparation

Acoustic Modelling

TASK: In kaldi-trunk/egs/digits directory, create a folder ’data’.Then create ’test’ and ’train’ subfolders inside.

spk2gender : This file informs about speakers gender.PATTERN: < speakerID >< gender >

wav.scp : This file connects every utterance (sentence said by oneperson during particular recording session) with an audio filerelated to this utterance.

PATTERN: < uterranceID >< full − path− to− audio− file >

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 21/39May 16, 2016 21 / 39

Page 22: Automatic speech recognition system using deep learning

Data Preparation

Acoustic Modelling

text : This file contains every utterance matched with its texttranscription.PATTERN: < uterranceID >< text− transcription >

utt2spk : This file tells the ASR system which utterance belongs toparticular speaker.PATTERN: < uterranceID >< speakerID >

corpus.txt: In kaldi-trunk/egs/digits/data/local create a filecorpus.txt which should contain every single utterance transcriptionthat can occur in our ASR system (in our case it will be 100 linesfrom 100 audio files).PATTERN: < text− transcription >

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 22/39May 16, 2016 22 / 39

Page 23: Automatic speech recognition system using deep learning

Data Preparation

Language Modelling

TASK: In kaldi-trunk/egs/digits/data/local directory, create afolder ’dict’. Then create ’test’ and ’train’ subfolders inside.

lexicon.txt : This file contains every word from our dictionary withits ’phone transcriptions’ (taken from /egs/voxforge).PATTERN: < word >< phone1 >< phone2 > ...

nonsilence-phones.txt : This file lists nonsilence phones that arepresent in our project.PATTERN:< phone >

silence-phones.txt : This file lists silence phones.PATTERN:< phone >

optional-silence.txt : This file lists optional silence phones.PATTERN: < phone >

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 23/39May 16, 2016 23 / 39

Page 24: Automatic speech recognition system using deep learning

Project Finalization

Tools Attachment

From kaldi-trunk/egs/wsj/s5 copy two folders (with the wholecontent) - ’utils’ and ’steps’ - and put them in ourkaldi-trunk/egs/digits directory.

Scoring Script

This script will help you to get decoding results.

TASK: From kaldi-trunk/egs/voxforge/local copy the script score.shinto exactly same location in our project(kaldi-trunk/egs/digits/local).

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 24/39May 16, 2016 24 / 39

Page 25: Automatic speech recognition system using deep learning

Project Finalization

Configuration Files

TASK:In kaldi-trunk/egs/digits create a folder ’conf’. Insidekaldi-trunk/egs/digits/conf create two files (for some configurationmodifications in decoding and mfcc feature extraction processes -taken from /egs/voxforge)

decode.config

mfcc.conf

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 25/39May 16, 2016 25 / 39

Page 26: Automatic speech recognition system using deep learning

Running Script Creation

Our last job is to prepare running scripts to create ASR system ofour choice.

path.sh

cmd.sh

run.sh

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 26/39May 16, 2016 26 / 39

Page 27: Automatic speech recognition system using deep learning

Results Automatic Speech Recognition System

When we execute the run.sh file the training is done and results logsare generated for the decoding process are found in ’log’ folder. Fig. 8,9, 10 shows the training process and figure 11 shows the results of theprediction on the test data and accuracy.

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 27/39May 16, 2016 27 / 39

Page 28: Automatic speech recognition system using deep learning

Results Automatic Speech Recognition System(Training)

Figure: Training Screenshot 1Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 28/39May 16, 2016 28 / 39

Page 29: Automatic speech recognition system using deep learning

Results Automatic Speech Recognition System(Training)

Figure: Training Screenshot 2Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 29/39May 16, 2016 29 / 39

Page 30: Automatic speech recognition system using deep learning

Results Automatic Speech Recognition System(Training)

Figure: Training Screenshot 3Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 30/39May 16, 2016 30 / 39

Page 31: Automatic speech recognition system using deep learning

Results Automatic Speech Recognition System(Prediction)

Figure: Prediction of our ASR System

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 31/39May 16, 2016 31 / 39

Page 32: Automatic speech recognition system using deep learning

Procedure for making a Automatic Speaker RecognitionSystem Using K-NN in Python

We have used our own dataset in our K-NN implementation.

First we have extracted the mfcc features using matlab code.

Then we have to arrange this mfcc feature in .csv form.

Now this .csv feature file is given as input to our K-NNimplementation in python.

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 32/39May 16, 2016 32 / 39

Page 33: Automatic speech recognition system using deep learning

Result of the Automatic Speaker Recognition System

When we execute the program the dataset is split into two partstraining and testing. After the training is completed and speakerprediction is done on the testing data. Then the accuracy is measuredon the prediction. Figure 13 shows the result of the accuracy is 80%when we took nine-speaker and all of the mfcc features generated forall the ten samples. Figure 12 shows the mfcc features.

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 33/39May 16, 2016 33 / 39

Page 34: Automatic speech recognition system using deep learning

Result of the Automatic Speaker Recognition System

Figure: MFCC features generated from Digits dataset

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 34/39May 16, 2016 34 / 39

Page 35: Automatic speech recognition system using deep learning

Result of the Automatic Speaker Recognition System

Figure: Prediction and Accuracy of Speaker Identification on Digits dataset

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 35/39May 16, 2016 35 / 39

Page 36: Automatic speech recognition system using deep learning

Conclusion

For Automatic Speech Recognition,

For implementing our own ASR system we used KALDI speechrecognition framework.We trained our system on our own dataset digits.Digits is our own dataset containing 10 speakers each speaking 10sentences, every sentence contains 3 words. The vocabulary of ourdataset is from 0 to 9.As a result, we have achieved an accuracy rate of 72% for ourSpeech Recognition System.Our ASR system is the text-dependent system as it has a limitedvocabulary of 0 to 9.Higher recognition rate gain can be achieved with a larger dataset.

For Automatic Speaker Recognition,

We have also implemented Speaker identification using K-NNclassification in Python using the same dataset.After the training of our K-NN classifier, we achieved an accuracyof 80% for our Speaker Identification system.

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 36/39May 16, 2016 36 / 39

Page 37: Automatic speech recognition system using deep learning

Bibliography I

J. Luettin, N. Thacker, S. W. Beet, et al., “Visual speechrecognition using active shape models and hidden markov models,”in Acoustics, Speech, and Signal Processing, 1996. ICASSP-96.Conference Proceedings., 1996 IEEE International Conference on,vol. 2, pp. 817–820, IEEE, 1996.

O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, andD. Yu, “Convolutional neural networks for speech recognition,”Audio, Speech, and Language Processing, IEEE/ACM Transactionson, vol. 22, no. 10, pp. 1533–1545, 2014.

Y. Lan, B.-J. Theobald, R. Harvey, E.-J. Ong, and R. Bowden,“Improving visual features for lip-reading.,” in AVSP, pp. 7–3,2010.

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 37/39May 16, 2016 37 / 39

Page 38: Automatic speech recognition system using deep learning

Bibliography II

S. Renals, N. Morgan, H. Bourlard, M. Cohen, and H. Franco,“Connectionist probability estimators in hmm speech recognition,”Speech and Audio Processing, IEEE Transactions on, vol. 2, no. 1,pp. 161–174, 1994.

P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol,“Extracting and composing robust features with denoisingautoencoders,” in Proceedings of the 25th international conferenceon Machine learning, pp. 1096–1103, ACM, 2008.

P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A.Manzagol, “Stacked denoising autoencoders: Learning usefulrepresentations in a deep network with a local denoising criterion,”The Journal of Machine Learning Research, vol. 11, pp. 3371–3408,2010.

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 38/39May 16, 2016 38 / 39

Page 39: Automatic speech recognition system using deep learning

Bibliography III

K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata,“Audio-visual speech recognition using deep learning,” AppliedIntelligence, vol. 42, no. 4, pp. 722–737, 2015.

ETSI/SAGE, “Specification of the 3GPP Confidentiality andIntegrity Algorithms 128-EEA3 & 128-EIA3. Document 1:128-EEA3 and 128-EIA3 Specification.”

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlıcek, Y. Qian, P. Schwarz, et al.,“The kaldi speech recognition toolkit,” 2011.

Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep Learning 39/39May 16, 2016 39 / 39