speech recognition - sns courseware

68
1 of 68 Speech Recognition CS8691-AI-UNIT-V-Speech Recognition

Upload: others

Post on 23-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speech Recognition - SNS Courseware

1of 68

Speech Recognition

CS8691-AI-UNIT-V-Speech Recognition

Page 2: Speech Recognition - SNS Courseware

2of 68

Outline

• Conventional Speech Recognition• How to use Deep Learning in acoustic modeling?• Why Deep Learning?• Speaker Adaptation• Multi-task Deep Learning• New acoustic features• Convolutional Neural Network (CNN)• Applications in Acoustic Signal Processing

CS8691-AI-UNIT-V-Speech Recognition

Page 3: Speech Recognition - SNS Courseware

3of 68

Conventional Speech Recognition

CS8691-AI-UNIT-V-Speech Recognition

Page 4: Speech Recognition - SNS Courseware

4of 68

Machine Learning helps

天氣真好Speech

RecognitionX

This is a structured learning problem.

Inference:

CS8691-AI-UNIT-V-Speech Recognition

Page 5: Speech Recognition - SNS Courseware

5of 68

X:

Input Representation

• Audio is represented by a vector sequence

39 dim MFCC

……

……x1 x2 x3

X:

CS8691-AI-UNIT-V-Speech Recognition

Page 6: Speech Recognition - SNS Courseware

6of 68

Input Representation - Splice

• To consider some temporal information

MFCC

Splice

…… ……

…… ……

CS8691-AI-UNIT-V-Speech Recognition

Page 7: Speech Recognition - SNS Courseware

7of 68

Phoneme

• Phoneme: basic unit

Each word corresponds to a sequence of phonemes.

hh w aa t d uw y uw th ih ng k

what do you thinkDifferent words can correspond to the same phonemes

Lexicon

Lexicon

CS8691-AI-UNIT-V-Speech Recognition

Page 8: Speech Recognition - SNS Courseware

8of 68

State

• Each phoneme correspond to a sequence of states

hh w aa t d uw y uw th ih ng k

what do you think

t-d+uw1 t-d+uw2 t-d+uw3

…… t-d+uwd-uw+y uw-y+uwy-uw+th ……

d-uw+y1 d-uw+y2 d-uw+y3

Phone:

Tri-phone:

State:CS8691-AI-UNIT-V-Speech Recognition

Page 9: Speech Recognition - SNS Courseware

9of 68

State

• Each state has a stationary distribution for acoustic features

t-d+uw1

d-uw+y3

P(x|”t-d+uw1”)

P(x|”d-uw+y3”)

Gaussian Mixture Model (GMM)

CS8691-AI-UNIT-V-Speech Recognition

Page 10: Speech Recognition - SNS Courseware

10of 68

State

• Each state has a stationary distribution for acoustic features

P(x|”t-d+uw1”)P(x|”d-uw+y3”)

Tied-state

Same Address

pointer pointer

CS8691-AI-UNIT-V-Speech Recognition

Page 11: Speech Recognition - SNS Courseware

11of 68

Acoustic Model

X: ……x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

P(X|W) = P(X|S)

……

transition

emission

what do you think?W:

a b c d e ……S:

CS8691-AI-UNIT-V-Speech Recognition

Page 12: Speech Recognition - SNS Courseware

12of 68

Acoustic Model

X: ……x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

P(X|W) = P(X|S)what do you think?W:

a b c d e ……S:Actually, we don’t know the alignment.

(Viterbi algorithm)CS8691-AI-UNIT-V-Speech Recognition

Page 13: Speech Recognition - SNS Courseware

13of 68

How to use Deep Learning?

CS8691-AI-UNIT-V-Speech Recognition

Page 14: Speech Recognition - SNS Courseware

14of 68

People imagine ……

This can not be true!

…… ……

Acoustic features

DNNcan only take fixed length vectors as input.

DNN

“Hello”

CS8691-AI-UNIT-V-Speech Recognition

Page 15: Speech Recognition - SNS Courseware

15of 68

What DNN can do is ……

…… ……

xi

Size of output layer= No. of states

P(a|xi)

DNN

DNN input: One acoustic feature

DNN output: Probability of each state

P(b|xi) P(c|xi) ……

……

CS8691-AI-UNIT-V-Speech Recognition

Page 16: Speech Recognition - SNS Courseware

16of 68

Low rank approximation

W

Input layer

Output layerM

N

W: M X N

M can be large if the outputs are the states of tri-phone.

N is the size of the last hidden layerM is the size of output layer

Number of states

CS8691-AI-UNIT-V-Speech Recognition

Page 17: Speech Recognition - SNS Courseware

17of 68

Low rank approximation

W UV

W linear

≈M M

N NK

K

U

V

Less parameters

M

N

M

N

K

K < M,N

Output layer Output layer

CS8691-AI-UNIT-V-Speech Recognition

Page 18: Speech Recognition - SNS Courseware

18of 68

How we use deep learning

• There are three ways to use DNN for acoustic modeling– Way 1. Tandem

– Way 2. DNN-HMM hybrid

– Way 3. End-to-end

Efforts for exploiting deep learning

CS8691-AI-UNIT-V-Speech Recognition

Page 19: Speech Recognition - SNS Courseware

19of 68

How to use Deep Learning?

Way 1: Tandem

CS8691-AI-UNIT-V-Speech Recognition

Page 20: Speech Recognition - SNS Courseware

20of 68

Way 1: Tandem system

…… ……

xi

Size of output layer= No. of states DNN

……

……

Last hidden layer or bottleneck layer are also possible.

Input of your original speech recognition system

P(a|xi) P(b|xi) P(c|xi)

CS8691-AI-UNIT-V-Speech Recognition

Page 21: Speech Recognition - SNS Courseware

21of 68

How to use Deep Learning?

Way 2: DNN-HMM hybrid

CS8691-AI-UNIT-V-Speech Recognition

Page 22: Speech Recognition - SNS Courseware

22of 68

Way 2: DNN-HMM Hybrid

From DNN

DNN

……

Count from training dataCS8691-AI-UNIT-V-Speech Recognition

Page 23: Speech Recognition - SNS Courseware

23of 68

Way 2: DNN-HMM Hybrid

Count from training data

From original HMM

This assembled vehicle works …….

DNN

DNN

……

CS8691-AI-UNIT-V-Speech Recognition

Page 24: Speech Recognition - SNS Courseware

24of 68

Way 2: DNN-HMM Hybrid

• Sequential Training

increase

decrease

CS8691-AI-UNIT-V-Speech Recognition

Page 25: Speech Recognition - SNS Courseware

25of 68

How to use Deep Learning?

Way 3: End-to-end

CS8691-AI-UNIT-V-Speech Recognition

Page 26: Speech Recognition - SNS Courseware

26of 68

Way 3: End-to-end - Character

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng ”Deep Speech: Scaling up end-to-end speech recognition”, arXiv:1412.5567v2, 2014.

Output: characters (and space)

Input: acoustic features (spectrograms)

No phoneme and lexicon

(No OOV problem)

+ null (~)

CS8691-AI-UNIT-V-Speech Recognition

Page 27: Speech Recognition - SNS Courseware

27of 68

Way 3: End-to-end - Character

Graves, Alex, and NavdeepJaitly. "Towards end-to-end speech recognition with recurrent neural networks."Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014.

HIS FRIEND’S

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~ ~ ~ ~ ~

CS8691-AI-UNIT-V-Speech Recognition

Page 28: Speech Recognition - SNS Courseware

28of 68

Way 3: End-to-end –Word?

Ref: Bengio, Samy, and Georg Heigold. "Word embeddingsfor speech recognition.“, Interspeech. 2014.

… …

DNN

Input layer

Output layer

~50k words in the lexicon Size = 200 times of

acoustic features

padding with zero!

Use other systems to get the word boundaries

apple bog cat ……

CS8691-AI-UNIT-V-Speech Recognition

Page 29: Speech Recognition - SNS Courseware

29of 68

Why Deep Learning?

CS8691-AI-UNIT-V-Speech Recognition

Page 30: Speech Recognition - SNS Courseware

30of 68

Deeper is better?

• Word error rate (WER)

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

1 hidden layermultiple layers

Deeper is Better

CS8691-AI-UNIT-V-Speech Recognition

Page 31: Speech Recognition - SNS Courseware

31of 68

Deeper is better?

• Word error rate (WER)

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

1 hidden layermultiple layers

For a fixed number of parameters, a deep model is clearly better than the shallow one.

Deeper is Better

CS8691-AI-UNIT-V-Speech Recognition

Page 32: Speech Recognition - SNS Courseware

32of 68

What does DNN do?

• Speaker normalization is automatically done in DNN

Input Acoustic Feature (MFCC)

A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012.

1-st Hidden Layer

CS8691-AI-UNIT-V-Speech Recognition

Page 33: Speech Recognition - SNS Courseware

33of 68

What does DNN do?

• Speaker normalization is automatically done in DNN

Input Acoustic Feature (MFCC) 8-th Hidden Layer

A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012.

CS8691-AI-UNIT-V-Speech Recognition

Page 34: Speech Recognition - SNS Courseware

34of 68

What does DNN do?

• In ordinary acoustic models, all the states are modeled independently– Not effective way to model human voice

high

low

front backThe sound of vowel is only controlled by a few factors.

http://www.ipachart.com/

CS8691-AI-UNIT-V-Speech Recognition

Page 35: Speech Recognition - SNS Courseware

35of 68

high

low

front back

/a/

/i/

/e/ /o/

/u/

Output of hidden layer reduce to two dimensions

The lower layers detect the manner of articulation

All the states share the results from the same set of detectors.

Use parameters effectively

What does DNN do?Vu, Ngoc Thang, JochenWeiner, and TanjaSchultz. "Investigating the Learning Effect of Multilingual Bottle-Neck Features for ASR." Interspeech. 2014.

CS8691-AI-UNIT-V-Speech Recognition

Page 36: Speech Recognition - SNS Courseware

36of 68

Speaker Adaptation

CS8691-AI-UNIT-V-Speech Recognition

Page 37: Speech Recognition - SNS Courseware

37of 68

Speaker Adaptation

• Speaker adaptation: use different models to recognition the speech of different speakers– Collect the audio data of each speaker

• A DNN model for each speaker– Challenge: limited data for training

• Not enough data for directly training a DNN model

• Not enough data for just fine-tune a speaker independent DNN model

CS8691-AI-UNIT-V-Speech Recognition

Page 38: Speech Recognition - SNS Courseware

38of 68

Categories of Methods

• Re-train the whole DNN with some constraints

Conservative training

• Only train the parameter of one layer

Transformation methods

• Do not really change the DNN parameters

Speaker-aware Training

Need less training dataCS8691-AI-UNIT-V-Speech Recognition

Page 39: Speech Recognition - SNS Courseware

39of 68

Conservative Training

Audio data of Many speakers

A little data from target speaker

Input layer

Output layer

initialization

Input layer

Output layeroutput close

parameter close

CS8691-AI-UNIT-V-Speech Recognition

Page 40: Speech Recognition - SNS Courseware

40of 68

Transformation methods

W

Input layer

Output layer

layer i

layer i+1

Add an extra layer

W

Input layer

Output layer

layer i

layer i+1

extra layerWa

Fix all the other parameters

A little data from target speaker

CS8691-AI-UNIT-V-Speech Recognition

Page 41: Speech Recognition - SNS Courseware

41of 68

Transformation methods

• Add the extra layer between the input and first layer

• With splicing

layer 1

extra layerWa

……

layer 1

extra layers

Wa WaWa

……

Larger Wa Smaller WaMore data less dataCS8691-AI-UNIT-V-Speech Recognition

Page 42: Speech Recognition - SNS Courseware

42of 68

Transformation methods

• SVD bottleneck adaptation

linear

U

V

M

N

K

Output layer

WaK is usually small

CS8691-AI-UNIT-V-Speech Recognition

Page 43: Speech Recognition - SNS Courseware

43of 68

Speaker-aware Training

Lots of mismatched data

Text transcription is not needed for extracting the vectors.

Data of Speaker 1

Data of Speaker 2

Data of Speaker 3

Fixed length low dimension vectors

Can also be noise-aware, devise aware training, ……

Speaker information

CS8691-AI-UNIT-V-Speech Recognition

Page 44: Speech Recognition - SNS Courseware

44of 68

Speaker-aware Training

Speaker 1

Speaker 2

Training data:

Testing data:

Acoustic features are appended with speaker information features

Acousticfeature

Output layer

SpeakerInformation

All the speaker use the same DNN modelDifferent speaker augmented by different features

train

test

CS8691-AI-UNIT-V-Speech Recognition

Page 45: Speech Recognition - SNS Courseware

45of 68

Multi-task Learning

CS8691-AI-UNIT-V-Speech Recognition

Page 46: Speech Recognition - SNS Courseware

46of 68

Multitask Learning

• The multi-layer structure makes DNN suitable for multitask learning

Input feature

Task A Task B

Task A Task B

Input feature for task A

Input feature for task BCS8691-AI-UNIT-V-Speech Recognition

Page 47: Speech Recognition - SNS Courseware

47of 68

Multitask Learning -Multilingual

acoustic features

states of French

states of German

Human languages share some common characteristics.

states of Spanish

states of Italian

states of Mandarin

CS8691-AI-UNIT-V-Speech Recognition

Page 48: Speech Recognition - SNS Courseware

48of 68

Multitask Learning -Multilingual

Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers."Acoustics, Speech and Signal Processing (ICASSP), 2013

25

30

35

40

45

50

1 10 100 1000

Char

acte

r Erro

r Rat

e

Hours of training data for Mandarin

Mandarin only

With European Language

CS8691-AI-UNIT-V-Speech Recognition

Page 49: Speech Recognition - SNS Courseware

49of 68

Multitask Learning - Different units

acoustic features

A= state

A B

B = phoneme

A= state B = gender

A= state B = grapheme(character)

CS8691-AI-UNIT-V-Speech Recognition

Page 50: Speech Recognition - SNS Courseware

50of 68

Deep Learning for Acoustic Modeling

New acoustic features

CS8691-AI-UNIT-V-Speech Recognition

Page 51: Speech Recognition - SNS Courseware

51of 68

MFCC

DFT

DCT log

MFCC

………

Input of DNN

Waveform

spectrogram

filter bank

CS8691-AI-UNIT-V-Speech Recognition

Page 52: Speech Recognition - SNS Courseware

52of 68

Filter-bank Output

DFT

……

Input of DNN

Kind of standard now

log

Waveform

spectrogram

filter bank

CS8691-AI-UNIT-V-Speech Recognition

Page 53: Speech Recognition - SNS Courseware

53of 68

Spectrogram

DFT

Input of DNN

5% relative improvement over fitlerbankoutput

Ref: ainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B., “Learning filter banks within a deep neural network framework,” In Automatic Speech Recognition and Understanding (ASRU), 2013

Waveform

spectrogram

common today

CS8691-AI-UNIT-V-Speech Recognition

Page 54: Speech Recognition - SNS Courseware

54of 68

Spectrogram

CS8691-AI-UNIT-V-Speech Recognition

Page 55: Speech Recognition - SNS Courseware

55of 68

Waveform?

Input of DNN

People tried, but not better than spectrogram yet

Ref: Tüske, Z., Golik, P., Schlüter, R., & Ney, H., “Acoustic modeling with deep neural networks using raw time signal for LVCSR,” In INTERPSEECH 2014

Still need to take Signal & Systems ……

Waveform

If success, no Signal & Systems

CS8691-AI-UNIT-V-Speech Recognition

Page 56: Speech Recognition - SNS Courseware

56of 68

Waveform?

CS8691-AI-UNIT-V-Speech Recognition

Page 57: Speech Recognition - SNS Courseware

57of 68

Convolutional Neural Network (CNN)

CS8691-AI-UNIT-V-Speech Recognition

Page 58: Speech Recognition - SNS Courseware

58of 68

CNN

• Speech can be treated as images

Spectrogram

Time

Freq

uenc

y

CS8691-AI-UNIT-V-Speech Recognition

Page 59: Speech Recognition - SNS Courseware

59of 68

CNN

Probabilities of states

CNN

Image

CNN

Replace DNN by CNN

CS8691-AI-UNIT-V-Speech Recognition

Page 60: Speech Recognition - SNS Courseware

60of 68

CNN

a2

a1 b1

b2

Max Max

Max

Max pooling

Maxout

CS8691-AI-UNIT-V-Speech Recognition

Page 61: Speech Recognition - SNS Courseware

61of 68

CNNTóth, László. "Convolutional Deep MaxoutNetworks for Phone Recognition”, Interspeech, 2014.

CS8691-AI-UNIT-V-Speech Recognition

Page 62: Speech Recognition - SNS Courseware

62of 68

Applications in Acoustic Signal Processing

CS8691-AI-UNIT-V-Speech Recognition

Page 63: Speech Recognition - SNS Courseware

63of 68

DNN for Speech Enhancement

Demo for speech enhancement: http://home.ustc.edu.cn/~xuyong62/demo/SE_DNN.html

Noisy Speech

Clean Speech

for mobile commination or speech recognition

DNN

CS8691-AI-UNIT-V-Speech Recognition

Page 64: Speech Recognition - SNS Courseware

64of 68

DNN for Voice Conversion

Demo for Voice Conversion: http://research.microsoft.com/en-us/projects/vcnn/default.aspx

Male

Female

DNN

CS8691-AI-UNIT-V-Speech Recognition

Page 65: Speech Recognition - SNS Courseware

65of 68

Concluding Remarks

CS8691-AI-UNIT-V-Speech Recognition

Page 66: Speech Recognition - SNS Courseware

66of 68

Concluding Remarks

• Conventional Speech Recognition

• How to use Deep Learning in acoustic modeling?

• Why Deep Learning?

• Speaker Adaptation

• Multi-task Deep Learning

• New acoustic features

• Convolutional Neural Network (CNN)

• Applications in Acoustic Signal ProcessingCS8691-AI-UNIT-V-Speech Recognition

Page 67: Speech Recognition - SNS Courseware

67of 68

Thank you for your attention!

CS8691-AI-UNIT-V-Speech Recognition

Page 68: Speech Recognition - SNS Courseware

68of 68

More Researches related to Speech

你好Speech

Recognition

lecture recordings

Summary

Hi

HelloComputer Assisted Language Learning

I would like to leave Taipei on November 2nd

DialogueInformation Extraction

Speech SummarizationSpoken Content Retrieval

Find the lectures related to “deep learning”

core techniques

CS8691-AI-UNIT-V-Speech Recognition