speech recognition - sns courseware

1of 68

Speech Recognition

CS8691-AI-UNIT-V-Speech Recognition

2of 68

Outline

• Conventional Speech Recognition• How to use Deep Learning in acoustic modeling?• Why Deep Learning?• Speaker Adaptation• Multi-task Deep Learning• New acoustic features• Convolutional Neural Network (CNN)• Applications in Acoustic Signal Processing


3of 68

Conventional Speech Recognition


4of 68

Machine Learning helps

天氣真好Speech

RecognitionX

This is a structured learning problem.

Inference:


5of 68

X:

Input Representation

• Audio is represented by a vector sequence

39 dim MFCC

……

……x1 x2 x3

X:


6of 68

Input Representation - Splice

• To consider some temporal information

MFCC

Splice

…… ……

…… ……


7of 68

Phoneme

• Phoneme: basic unit

Each word corresponds to a sequence of phonemes.

hh w aa t d uw y uw th ih ng k

what do you thinkDifferent words can correspond to the same phonemes

Lexicon

Lexicon


8of 68

State

• Each phoneme correspond to a sequence of states

hh w aa t d uw y uw th ih ng k

what do you think

t-d+uw1 t-d+uw2 t-d+uw3

…… t-d+uwd-uw+y uw-y+uwy-uw+th ……

d-uw+y1 d-uw+y2 d-uw+y3

Phone:

Tri-phone:

State:CS8691-AI-UNIT-V-Speech Recognition

9of 68

State

• Each state has a stationary distribution for acoustic features

t-d+uw1

d-uw+y3

P(x|”t-d+uw1”)

P(x|”d-uw+y3”)

Gaussian Mixture Model (GMM)


10of 68

State

• Each state has a stationary distribution for acoustic features

P(x|”t-d+uw1”)P(x|”d-uw+y3”)

Tied-state

Same Address

pointer pointer


11of 68

Acoustic Model

X: ……x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

P(X|W) = P(X|S)

……

transition

emission

what do you think?W:

a b c d e ……S:


12of 68

Acoustic Model

X: ……x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

P(X|W) = P(X|S)what do you think?W:

a b c d e ……S:Actually, we don’t know the alignment.

(Viterbi algorithm)CS8691-AI-UNIT-V-Speech Recognition

13of 68

How to use Deep Learning?


14of 68

People imagine ……

This can not be true!

…… ……

Acoustic features

DNNcan only take fixed length vectors as input.

DNN

“Hello”


15of 68

What DNN can do is ……

…… ……

xi

Size of output layer= No. of states

P(a|xi)

DNN

DNN input: One acoustic feature

DNN output: Probability of each state

P(b|xi) P(c|xi) ……

……


16of 68

Low rank approximation

W

Input layer

Output layerM

N

W: M X N

M can be large if the outputs are the states of tri-phone.

N is the size of the last hidden layerM is the size of output layer

Number of states


17of 68

Low rank approximation

W UV

W linear

≈M M

N NK

K

U

V

Less parameters

M

N

M

N

K

K < M,N

Output layer Output layer


18of 68

How we use deep learning

• There are three ways to use DNN for acoustic modeling– Way 1. Tandem

– Way 2. DNN-HMM hybrid

– Way 3. End-to-end

Efforts for exploiting deep learning


19of 68


Way 1: Tandem


20of 68

Way 1: Tandem system

…… ……

xi

Size of output layer= No. of states DNN

……

……

Last hidden layer or bottleneck layer are also possible.

Input of your original speech recognition system

P(a|xi) P(b|xi) P(c|xi)


21of 68


Way 2: DNN-HMM hybrid


22of 68

Way 2: DNN-HMM Hybrid

From DNN

DNN

……

Count from training dataCS8691-AI-UNIT-V-Speech Recognition

23of 68


Count from training data

From original HMM

This assembled vehicle works …….

DNN

DNN

……


24of 68


• Sequential Training

increase

decrease


25of 68


Way 3: End-to-end


26of 68

Way 3: End-to-end - Character

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng ”Deep Speech: Scaling up end-to-end speech recognition”, arXiv:1412.5567v2, 2014.

Output: characters (and space)

Input: acoustic features (spectrograms)

No phoneme and lexicon

(No OOV problem)

+ null (~)


27of 68

Way 3: End-to-end - Character

Graves, Alex, and NavdeepJaitly. "Towards end-to-end speech recognition with recurrent neural networks."Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014.

HIS FRIEND’S

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~ ~ ~ ~ ~


28of 68

Way 3: End-to-end –Word?

Ref: Bengio, Samy, and Georg Heigold. "Word embeddingsfor speech recognition.“, Interspeech. 2014.

… …

DNN

Input layer

Output layer

~50k words in the lexicon Size = 200 times of

acoustic features

padding with zero!

Use other systems to get the word boundaries

apple bog cat ……


29of 68

Why Deep Learning?


30of 68

Deeper is better?

• Word error rate (WER)

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

1 hidden layermultiple layers

Deeper is Better


31of 68

Deeper is better?

• Word error rate (WER)

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

1 hidden layermultiple layers

For a fixed number of parameters, a deep model is clearly better than the shallow one.

Deeper is Better


32of 68

What does DNN do?

• Speaker normalization is automatically done in DNN

Input Acoustic Feature (MFCC)

A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012.

1-st Hidden Layer


33of 68

What does DNN do?

• Speaker normalization is automatically done in DNN

Input Acoustic Feature (MFCC) 8-th Hidden Layer

A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012.


34of 68

What does DNN do?

• In ordinary acoustic models, all the states are modeled independently– Not effective way to model human voice

high

low

front backThe sound of vowel is only controlled by a few factors.

http://www.ipachart.com/


35of 68

high

low

front back

/a/

/i/

/e/ /o/

/u/

Output of hidden layer reduce to two dimensions

The lower layers detect the manner of articulation

All the states share the results from the same set of detectors.

Use parameters effectively

What does DNN do?Vu, Ngoc Thang, JochenWeiner, and TanjaSchultz. "Investigating the Learning Effect of Multilingual Bottle-Neck Features for ASR." Interspeech. 2014.


36of 68

Speaker Adaptation


37of 68

Speaker Adaptation

• Speaker adaptation: use different models to recognition the speech of different speakers– Collect the audio data of each speaker

• A DNN model for each speaker– Challenge: limited data for training

• Not enough data for directly training a DNN model

• Not enough data for just fine-tune a speaker independent DNN model


38of 68

Categories of Methods

• Re-train the whole DNN with some constraints

Conservative training

• Only train the parameter of one layer

Transformation methods

• Do not really change the DNN parameters

Speaker-aware Training

Need less training dataCS8691-AI-UNIT-V-Speech Recognition

39of 68

Conservative Training

Audio data of Many speakers

A little data from target speaker

Input layer

Output layer

initialization

Input layer

Output layeroutput close

parameter close


40of 68


W

Input layer

Output layer

layer i

layer i+1

Add an extra layer

W

Input layer

Output layer

layer i

layer i+1

extra layerWa

Fix all the other parameters

A little data from target speaker


41of 68


• Add the extra layer between the input and first layer

• With splicing

layer 1

extra layerWa

……

layer 1

extra layers

Wa WaWa

……

Larger Wa Smaller WaMore data less dataCS8691-AI-UNIT-V-Speech Recognition

42of 68


• SVD bottleneck adaptation

linear

U

V

M

N

K

Output layer

WaK is usually small


43of 68


Lots of mismatched data

Text transcription is not needed for extracting the vectors.

Data of Speaker 1

Data of Speaker 2

Data of Speaker 3

Fixed length low dimension vectors

Can also be noise-aware, devise aware training, ……

Speaker information


44of 68


Speaker 1

Speaker 2

Training data:

Testing data:

Acoustic features are appended with speaker information features

Acousticfeature

Output layer

SpeakerInformation

All the speaker use the same DNN modelDifferent speaker augmented by different features

train

test


45of 68

Multi-task Learning


46of 68

Multitask Learning

• The multi-layer structure makes DNN suitable for multitask learning

Input feature

Task A Task B

Task A Task B

Input feature for task A

Input feature for task BCS8691-AI-UNIT-V-Speech Recognition

47of 68

Multitask Learning -Multilingual

acoustic features

states of French

states of German

Human languages share some common characteristics.

states of Spanish

states of Italian

states of Mandarin


48of 68

Multitask Learning -Multilingual

Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers."Acoustics, Speech and Signal Processing (ICASSP), 2013

25

30

35

40

45

50

1 10 100 1000

Char

acte

r Erro

r Rat

e

Hours of training data for Mandarin

Mandarin only

With European Language


49of 68

Multitask Learning - Different units

acoustic features

A= state

A B

B = phoneme

A= state B = gender

A= state B = grapheme(character)


50of 68

Deep Learning for Acoustic Modeling

New acoustic features


51of 68

MFCC

DFT

DCT log

MFCC

………

Input of DNN

Waveform

spectrogram

filter bank


52of 68

Filter-bank Output

DFT

……

Input of DNN

Kind of standard now

…

log

Waveform

spectrogram

filter bank


53of 68

Spectrogram

DFT

Input of DNN

5% relative improvement over fitlerbankoutput

Ref: ainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B., “Learning filter banks within a deep neural network framework,” In Automatic Speech Recognition and Understanding (ASRU), 2013

Waveform

spectrogram

common today


54of 68

Spectrogram


55of 68

Waveform?

Input of DNN

People tried, but not better than spectrogram yet

Ref: Tüske, Z., Golik, P., Schlüter, R., & Ney, H., “Acoustic modeling with deep neural networks using raw time signal for LVCSR,” In INTERPSEECH 2014

Still need to take Signal & Systems ……

Waveform

If success, no Signal & Systems


56of 68

Waveform?


57of 68

Convolutional Neural Network (CNN)


58of 68

CNN

• Speech can be treated as images

Spectrogram

Time

Freq

uenc

y


59of 68

CNN

Probabilities of states

CNN

Image

CNN

Replace DNN by CNN


60of 68

CNN

a2

a1 b1

b2

Max Max

Max

Max pooling

Maxout


61of 68

CNNTóth, László. "Convolutional Deep MaxoutNetworks for Phone Recognition”, Interspeech, 2014.


62of 68

Applications in Acoustic Signal Processing


63of 68

DNN for Speech Enhancement

Demo for speech enhancement: http://home.ustc.edu.cn/~xuyong62/demo/SE_DNN.html

Noisy Speech

Clean Speech

for mobile commination or speech recognition

DNN


64of 68

DNN for Voice Conversion

Demo for Voice Conversion: http://research.microsoft.com/en-us/projects/vcnn/default.aspx

Male

Female

DNN


65of 68

Concluding Remarks


66of 68

Concluding Remarks

• Conventional Speech Recognition

• How to use Deep Learning in acoustic modeling?

• Why Deep Learning?

• Speaker Adaptation

• Multi-task Deep Learning

• New acoustic features

• Convolutional Neural Network (CNN)

• Applications in Acoustic Signal ProcessingCS8691-AI-UNIT-V-Speech Recognition

67of 68

Thank you for your attention!


68of 68

More Researches related to Speech

你好Speech

Recognition

lecture recordings

Summary

Hi

HelloComputer Assisted Language Learning

I would like to leave Taipei on November 2nd

DialogueInformation Extraction

Speech SummarizationSpoken Content Retrieval

Find the lectures related to “deep learning”

core techniques


speech recognition - sns courseware

Documents