deep learning for speech recognition - vikrant singh tomar

Deep Learning for Speech Recognition

Vikrant Tomar

Founder, Fluent.ai

[email protected]

We are hiring!

mailto:[email protected]


Outline- Introduction

- General overview of speech recognition framework

- Conventional GMM-HMM based systems

- Deep neural networks in speech

- ConvNets

- RNNs/LSTMs and End-to-end learning

- New interesting stuff

2

Intro 1: What is speech recognition?

- Dream: A machine should be able to develop a functional equivalent of the

speaker’s intended message as effortlessly as humans can

- In other words: The goal is to find the most likely sequence of symbols such as

words or sub-word speech units from a stream of acoustic data.

3

Intro 2: How is deep learning for speech different from deep learning for images?

- Speech is a temporal signal, there is information in the sequence

- One dimensional signal with multitudes of information:

- Speaker

- Accent and language

- Age and health

- Environment

- Issues:

- Noise and background conditions

- Accents

- Recording devices

4

Overview: Statistical Framework for speech recognition- Formally, an ASR system maps the sequence of observation vectors, X, to the

optimum sequence of words, Ŵ :

-

5

Overview 2: System Architecture

6

System Architecture : Feature extraction & spectrogram

7

GMM-HMM based systems

8

Deep neural networks in speech- Few different approaches

- Tandem

- Hybrid

- End-to-end

- Old but new

9

Tandem DNN: DNN -- GMM -- HMM

10

Hybrid DNN - HMM

11

- Good source:

Hinton et. al, Deep neural networks

for acoustic modelling in speech, 2012.

Hybrid CNN - HMM

12

- Good source: A-Hamid et. al, Covolutional neural networks for speech recognition,

2014

Hybrid CNN - HMM -- Partial weight sharing

13

Some benchmarks

14

RNNs and End to end models- RNN:

- Good because sequential models

- However, cannot capture long-term dependencies

- Vanishing gradients

- Solutions: LSTMs and GRUs

- End to end models have overall simplified arch.

- CTC : Connectionist temporal classification

A. Graves et. al., “Towards End-to-End Speech

Recognition with Recurrent Neural Networks, 2014

15

New interesting stuff- Baidu Deep Speech: Use bi-directional RNNs to directly map to characters

- IBM 2015/2016 and Microsoft 2016: Deep CNN with 3 x 3 kernels similar to VGG

net etc.

- CLDNN : Conv + LSTMs + Fully Connected

Baidu Lab: Deep Speech 2014 and Deep Speech 2, 2015

Sainath et. al, CONVOLUTIONAL, LONG SHORT-TERM MEMORY, FULLY CONNECTED DEEP

NEURAL NETWORKS, 2015

Xiong et. al, THE MICROSOFT 2016 CONVERSATIONAL SPEECH RECOGNITION SYSTEM, 2016

Saon et. al, The IBM 2015/16 English Conversational Telephone Speech Recognition System, 2015/16

16

Conclusion and resources- Lots of exciting stuff, most concepts are similar to other deep learning

communities

- Good starting point: http://www.recognize-speech.com

- You can use any toolbox you like to start:

- Tensorflow, Torch, Theano etc.

- Kaldi, Currennt

- Older stuff: CMU-Sphinx, RWTH-ASR, HTK

- Free(-ish) datasets: http://www.openslr.org/resources.php

- Contact: [email protected] (Hiring Scientists)

17

http://www.recognize-speech.com

http://www.openslr.org/resources.php


deep learning for speech recognition - vikrant singh tomar

Technology