speech recognition - sns courseware
TRANSCRIPT
1of 68
Speech Recognition
CS8691-AI-UNIT-V-Speech Recognition
2of 68
Outline
• Conventional Speech Recognition• How to use Deep Learning in acoustic modeling?• Why Deep Learning?• Speaker Adaptation• Multi-task Deep Learning• New acoustic features• Convolutional Neural Network (CNN)• Applications in Acoustic Signal Processing
CS8691-AI-UNIT-V-Speech Recognition
3of 68
Conventional Speech Recognition
CS8691-AI-UNIT-V-Speech Recognition
4of 68
Machine Learning helps
天氣真好Speech
RecognitionX
This is a structured learning problem.
Inference:
CS8691-AI-UNIT-V-Speech Recognition
5of 68
X:
Input Representation
• Audio is represented by a vector sequence
39 dim MFCC
……
……x1 x2 x3
X:
CS8691-AI-UNIT-V-Speech Recognition
6of 68
Input Representation - Splice
• To consider some temporal information
MFCC
Splice
…… ……
…… ……
CS8691-AI-UNIT-V-Speech Recognition
7of 68
Phoneme
• Phoneme: basic unit
Each word corresponds to a sequence of phonemes.
hh w aa t d uw y uw th ih ng k
what do you thinkDifferent words can correspond to the same phonemes
Lexicon
Lexicon
CS8691-AI-UNIT-V-Speech Recognition
8of 68
State
• Each phoneme correspond to a sequence of states
hh w aa t d uw y uw th ih ng k
what do you think
t-d+uw1 t-d+uw2 t-d+uw3
…… t-d+uwd-uw+y uw-y+uwy-uw+th ……
d-uw+y1 d-uw+y2 d-uw+y3
Phone:
Tri-phone:
State:CS8691-AI-UNIT-V-Speech Recognition
9of 68
State
• Each state has a stationary distribution for acoustic features
t-d+uw1
d-uw+y3
P(x|”t-d+uw1”)
P(x|”d-uw+y3”)
Gaussian Mixture Model (GMM)
CS8691-AI-UNIT-V-Speech Recognition
10of 68
State
• Each state has a stationary distribution for acoustic features
P(x|”t-d+uw1”)P(x|”d-uw+y3”)
Tied-state
Same Address
pointer pointer
CS8691-AI-UNIT-V-Speech Recognition
11of 68
Acoustic Model
X: ……x1 x2 x3 x4 x5
s1 s2 s3 s4 s5
P(X|W) = P(X|S)
……
transition
emission
what do you think?W:
a b c d e ……S:
CS8691-AI-UNIT-V-Speech Recognition
12of 68
Acoustic Model
X: ……x1 x2 x3 x4 x5
s1 s2 s3 s4 s5
P(X|W) = P(X|S)what do you think?W:
a b c d e ……S:Actually, we don’t know the alignment.
(Viterbi algorithm)CS8691-AI-UNIT-V-Speech Recognition
13of 68
How to use Deep Learning?
CS8691-AI-UNIT-V-Speech Recognition
14of 68
People imagine ……
This can not be true!
…… ……
Acoustic features
DNNcan only take fixed length vectors as input.
DNN
“Hello”
CS8691-AI-UNIT-V-Speech Recognition
15of 68
What DNN can do is ……
…… ……
xi
Size of output layer= No. of states
P(a|xi)
DNN
DNN input: One acoustic feature
DNN output: Probability of each state
P(b|xi) P(c|xi) ……
……
CS8691-AI-UNIT-V-Speech Recognition
16of 68
Low rank approximation
W
Input layer
Output layerM
N
W: M X N
M can be large if the outputs are the states of tri-phone.
N is the size of the last hidden layerM is the size of output layer
Number of states
CS8691-AI-UNIT-V-Speech Recognition
17of 68
Low rank approximation
W UV
W linear
≈M M
N NK
K
U
V
Less parameters
M
N
M
N
K
K < M,N
Output layer Output layer
CS8691-AI-UNIT-V-Speech Recognition
18of 68
How we use deep learning
• There are three ways to use DNN for acoustic modeling– Way 1. Tandem
– Way 2. DNN-HMM hybrid
– Way 3. End-to-end
Efforts for exploiting deep learning
CS8691-AI-UNIT-V-Speech Recognition
19of 68
How to use Deep Learning?
Way 1: Tandem
CS8691-AI-UNIT-V-Speech Recognition
20of 68
Way 1: Tandem system
…… ……
xi
Size of output layer= No. of states DNN
……
……
Last hidden layer or bottleneck layer are also possible.
Input of your original speech recognition system
P(a|xi) P(b|xi) P(c|xi)
CS8691-AI-UNIT-V-Speech Recognition
21of 68
How to use Deep Learning?
Way 2: DNN-HMM hybrid
CS8691-AI-UNIT-V-Speech Recognition
22of 68
Way 2: DNN-HMM Hybrid
From DNN
DNN
……
Count from training dataCS8691-AI-UNIT-V-Speech Recognition
23of 68
Way 2: DNN-HMM Hybrid
Count from training data
From original HMM
This assembled vehicle works …….
DNN
DNN
……
CS8691-AI-UNIT-V-Speech Recognition
24of 68
Way 2: DNN-HMM Hybrid
• Sequential Training
increase
decrease
CS8691-AI-UNIT-V-Speech Recognition
25of 68
How to use Deep Learning?
Way 3: End-to-end
CS8691-AI-UNIT-V-Speech Recognition
26of 68
Way 3: End-to-end - Character
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng ”Deep Speech: Scaling up end-to-end speech recognition”, arXiv:1412.5567v2, 2014.
Output: characters (and space)
Input: acoustic features (spectrograms)
No phoneme and lexicon
(No OOV problem)
+ null (~)
CS8691-AI-UNIT-V-Speech Recognition
27of 68
Way 3: End-to-end - Character
Graves, Alex, and NavdeepJaitly. "Towards end-to-end speech recognition with recurrent neural networks."Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014.
HIS FRIEND’S
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~ ~ ~ ~ ~
CS8691-AI-UNIT-V-Speech Recognition
28of 68
Way 3: End-to-end –Word?
Ref: Bengio, Samy, and Georg Heigold. "Word embeddingsfor speech recognition.“, Interspeech. 2014.
… …
DNN
Input layer
Output layer
~50k words in the lexicon Size = 200 times of
acoustic features
padding with zero!
Use other systems to get the word boundaries
apple bog cat ……
CS8691-AI-UNIT-V-Speech Recognition
29of 68
Why Deep Learning?
CS8691-AI-UNIT-V-Speech Recognition
30of 68
Deeper is better?
• Word error rate (WER)
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
1 hidden layermultiple layers
Deeper is Better
CS8691-AI-UNIT-V-Speech Recognition
31of 68
Deeper is better?
• Word error rate (WER)
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
1 hidden layermultiple layers
For a fixed number of parameters, a deep model is clearly better than the shallow one.
Deeper is Better
CS8691-AI-UNIT-V-Speech Recognition
32of 68
What does DNN do?
• Speaker normalization is automatically done in DNN
Input Acoustic Feature (MFCC)
A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012.
1-st Hidden Layer
CS8691-AI-UNIT-V-Speech Recognition
33of 68
What does DNN do?
• Speaker normalization is automatically done in DNN
Input Acoustic Feature (MFCC) 8-th Hidden Layer
A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012.
CS8691-AI-UNIT-V-Speech Recognition
34of 68
What does DNN do?
• In ordinary acoustic models, all the states are modeled independently– Not effective way to model human voice
high
low
front backThe sound of vowel is only controlled by a few factors.
http://www.ipachart.com/
CS8691-AI-UNIT-V-Speech Recognition
35of 68
high
low
front back
/a/
/i/
/e/ /o/
/u/
Output of hidden layer reduce to two dimensions
The lower layers detect the manner of articulation
All the states share the results from the same set of detectors.
Use parameters effectively
What does DNN do?Vu, Ngoc Thang, JochenWeiner, and TanjaSchultz. "Investigating the Learning Effect of Multilingual Bottle-Neck Features for ASR." Interspeech. 2014.
CS8691-AI-UNIT-V-Speech Recognition
36of 68
Speaker Adaptation
CS8691-AI-UNIT-V-Speech Recognition
37of 68
Speaker Adaptation
• Speaker adaptation: use different models to recognition the speech of different speakers– Collect the audio data of each speaker
• A DNN model for each speaker– Challenge: limited data for training
• Not enough data for directly training a DNN model
• Not enough data for just fine-tune a speaker independent DNN model
CS8691-AI-UNIT-V-Speech Recognition
38of 68
Categories of Methods
• Re-train the whole DNN with some constraints
Conservative training
• Only train the parameter of one layer
Transformation methods
• Do not really change the DNN parameters
Speaker-aware Training
Need less training dataCS8691-AI-UNIT-V-Speech Recognition
39of 68
Conservative Training
Audio data of Many speakers
A little data from target speaker
Input layer
Output layer
initialization
Input layer
Output layeroutput close
parameter close
CS8691-AI-UNIT-V-Speech Recognition
40of 68
Transformation methods
W
Input layer
Output layer
layer i
layer i+1
Add an extra layer
W
Input layer
Output layer
layer i
layer i+1
extra layerWa
Fix all the other parameters
A little data from target speaker
CS8691-AI-UNIT-V-Speech Recognition
41of 68
Transformation methods
• Add the extra layer between the input and first layer
• With splicing
layer 1
extra layerWa
……
layer 1
extra layers
Wa WaWa
……
Larger Wa Smaller WaMore data less dataCS8691-AI-UNIT-V-Speech Recognition
42of 68
Transformation methods
• SVD bottleneck adaptation
linear
U
V
M
N
K
Output layer
WaK is usually small
CS8691-AI-UNIT-V-Speech Recognition
43of 68
Speaker-aware Training
Lots of mismatched data
Text transcription is not needed for extracting the vectors.
Data of Speaker 1
Data of Speaker 2
Data of Speaker 3
Fixed length low dimension vectors
Can also be noise-aware, devise aware training, ……
Speaker information
CS8691-AI-UNIT-V-Speech Recognition
44of 68
Speaker-aware Training
Speaker 1
Speaker 2
Training data:
Testing data:
Acoustic features are appended with speaker information features
Acousticfeature
Output layer
SpeakerInformation
All the speaker use the same DNN modelDifferent speaker augmented by different features
train
test
CS8691-AI-UNIT-V-Speech Recognition
45of 68
Multi-task Learning
CS8691-AI-UNIT-V-Speech Recognition
46of 68
Multitask Learning
• The multi-layer structure makes DNN suitable for multitask learning
Input feature
Task A Task B
Task A Task B
Input feature for task A
Input feature for task BCS8691-AI-UNIT-V-Speech Recognition
47of 68
Multitask Learning -Multilingual
acoustic features
states of French
states of German
Human languages share some common characteristics.
states of Spanish
states of Italian
states of Mandarin
CS8691-AI-UNIT-V-Speech Recognition
48of 68
Multitask Learning -Multilingual
Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers."Acoustics, Speech and Signal Processing (ICASSP), 2013
25
30
35
40
45
50
1 10 100 1000
Char
acte
r Erro
r Rat
e
Hours of training data for Mandarin
Mandarin only
With European Language
CS8691-AI-UNIT-V-Speech Recognition
49of 68
Multitask Learning - Different units
acoustic features
A= state
A B
B = phoneme
A= state B = gender
A= state B = grapheme(character)
CS8691-AI-UNIT-V-Speech Recognition
50of 68
Deep Learning for Acoustic Modeling
New acoustic features
CS8691-AI-UNIT-V-Speech Recognition
51of 68
MFCC
DFT
DCT log
MFCC
………
Input of DNN
Waveform
spectrogram
filter bank
CS8691-AI-UNIT-V-Speech Recognition
52of 68
Filter-bank Output
DFT
……
Input of DNN
Kind of standard now
…
log
Waveform
spectrogram
filter bank
CS8691-AI-UNIT-V-Speech Recognition
53of 68
Spectrogram
DFT
Input of DNN
5% relative improvement over fitlerbankoutput
Ref: ainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B., “Learning filter banks within a deep neural network framework,” In Automatic Speech Recognition and Understanding (ASRU), 2013
Waveform
spectrogram
common today
CS8691-AI-UNIT-V-Speech Recognition
54of 68
Spectrogram
CS8691-AI-UNIT-V-Speech Recognition
55of 68
Waveform?
Input of DNN
People tried, but not better than spectrogram yet
Ref: Tüske, Z., Golik, P., Schlüter, R., & Ney, H., “Acoustic modeling with deep neural networks using raw time signal for LVCSR,” In INTERPSEECH 2014
Still need to take Signal & Systems ……
Waveform
If success, no Signal & Systems
CS8691-AI-UNIT-V-Speech Recognition
56of 68
Waveform?
CS8691-AI-UNIT-V-Speech Recognition
57of 68
Convolutional Neural Network (CNN)
CS8691-AI-UNIT-V-Speech Recognition
58of 68
CNN
• Speech can be treated as images
Spectrogram
Time
Freq
uenc
y
CS8691-AI-UNIT-V-Speech Recognition
59of 68
CNN
Probabilities of states
CNN
Image
CNN
Replace DNN by CNN
CS8691-AI-UNIT-V-Speech Recognition
60of 68
CNN
a2
a1 b1
b2
Max Max
Max
Max pooling
Maxout
CS8691-AI-UNIT-V-Speech Recognition
61of 68
CNNTóth, László. "Convolutional Deep MaxoutNetworks for Phone Recognition”, Interspeech, 2014.
CS8691-AI-UNIT-V-Speech Recognition
62of 68
Applications in Acoustic Signal Processing
CS8691-AI-UNIT-V-Speech Recognition
63of 68
DNN for Speech Enhancement
Demo for speech enhancement: http://home.ustc.edu.cn/~xuyong62/demo/SE_DNN.html
Noisy Speech
Clean Speech
for mobile commination or speech recognition
DNN
CS8691-AI-UNIT-V-Speech Recognition
64of 68
DNN for Voice Conversion
Demo for Voice Conversion: http://research.microsoft.com/en-us/projects/vcnn/default.aspx
Male
Female
DNN
CS8691-AI-UNIT-V-Speech Recognition
65of 68
Concluding Remarks
CS8691-AI-UNIT-V-Speech Recognition
66of 68
Concluding Remarks
• Conventional Speech Recognition
• How to use Deep Learning in acoustic modeling?
• Why Deep Learning?
• Speaker Adaptation
• Multi-task Deep Learning
• New acoustic features
• Convolutional Neural Network (CNN)
• Applications in Acoustic Signal ProcessingCS8691-AI-UNIT-V-Speech Recognition
67of 68
Thank you for your attention!
CS8691-AI-UNIT-V-Speech Recognition
68of 68
More Researches related to Speech
你好Speech
Recognition
lecture recordings
Summary
Hi
HelloComputer Assisted Language Learning
I would like to leave Taipei on November 2nd
DialogueInformation Extraction
Speech SummarizationSpoken Content Retrieval
Find the lectures related to “deep learning”
core techniques
CS8691-AI-UNIT-V-Speech Recognition