deep learning for speech recognition - vikrant singh tomar
TRANSCRIPT
Deep Learning for Speech Recognition
Vikrant Tomar
Founder, Fluent.ai
We are hiring!
Outline- Introduction
- General overview of speech recognition framework
- Conventional GMM-HMM based systems
- Deep neural networks in speech
- ConvNets
- RNNs/LSTMs and End-to-end learning
- New interesting stuff
2
Intro 1: What is speech recognition?
- Dream: A machine should be able to develop a functional equivalent of the
speaker’s intended message as effortlessly as humans can
- In other words: The goal is to find the most likely sequence of symbols such as
words or sub-word speech units from a stream of acoustic data.
3
Intro 2: How is deep learning for speech different from deep learning for images?
- Speech is a temporal signal, there is information in the sequence
- One dimensional signal with multitudes of information:
- Speaker
- Accent and language
- Age and health
- Environment
- Issues:
- Noise and background conditions
- Accents
- Recording devices
4
Overview: Statistical Framework for speech recognition- Formally, an ASR system maps the sequence of observation vectors, X, to the
optimum sequence of words, Ŵ :
-
5
Overview 2: System Architecture
6
System Architecture : Feature extraction & spectrogram
7
GMM-HMM based systems
8
Deep neural networks in speech- Few different approaches
- Tandem
- Hybrid
- End-to-end
- Old but new
9
Tandem DNN: DNN -- GMM -- HMM
10
Hybrid DNN - HMM
11
- Good source:
Hinton et. al, Deep neural networks
for acoustic modelling in speech, 2012.
Hybrid CNN - HMM
12
- Good source: A-Hamid et. al, Covolutional neural networks for speech recognition,
2014
Hybrid CNN - HMM -- Partial weight sharing
13
Some benchmarks
14
RNNs and End to end models- RNN:
- Good because sequential models
- However, cannot capture long-term dependencies
- Vanishing gradients
- Solutions: LSTMs and GRUs
- End to end models have overall simplified arch.
- CTC : Connectionist temporal classification
A. Graves et. al., “Towards End-to-End Speech
Recognition with Recurrent Neural Networks, 2014
15
New interesting stuff- Baidu Deep Speech: Use bi-directional RNNs to directly map to characters
- IBM 2015/2016 and Microsoft 2016: Deep CNN with 3 x 3 kernels similar to VGG
net etc.
- CLDNN : Conv + LSTMs + Fully Connected
Baidu Lab: Deep Speech 2014 and Deep Speech 2, 2015
Sainath et. al, CONVOLUTIONAL, LONG SHORT-TERM MEMORY, FULLY CONNECTED DEEP
NEURAL NETWORKS, 2015
Xiong et. al, THE MICROSOFT 2016 CONVERSATIONAL SPEECH RECOGNITION SYSTEM, 2016
Saon et. al, The IBM 2015/16 English Conversational Telephone Speech Recognition System, 2015/16
16
Conclusion and resources- Lots of exciting stuff, most concepts are similar to other deep learning
communities
- Good starting point: http://www.recognize-speech.com
- You can use any toolbox you like to start:
- Tensorflow, Torch, Theano etc.
- Kaldi, Currennt
- Older stuff: CMU-Sphinx, RWTH-ASR, HTK
- Free(-ish) datasets: http://www.openslr.org/resources.php
- Contact: [email protected] (Hiring Scientists)
17