![Page 1: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/1.jpg)
An Analysis-by-Synthesis Approach to Vocal
Tract Modeling for Robust Speech Recognition
Ziad Al Bawab([email protected])
Electrical and Computer Engineering
Carnegie Mellon University
Work in collaboration with:Bhiksha Raj
Lorenzo Turicchia (MIT)and Richard M. Stern
IBM Research
October 9, 2009
![Page 2: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/2.jpg)
Talk Outline
I. Introduction
II. Deriving vocal tract shapes from EMA data using a physical model
III. Analysis-by-synthesis framework
IV. Dynamic articulatory model
V. Conclusion
2
![Page 3: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/3.jpg)
Conventional Generative Model
0.05 0.1 0.15 0.2 0.25 0.3 0.35
-5000
0
5000
S-P-IY-CH
Am
plitu
de
Time
Freq
uenc
y
0 0.05 0.1 0.15 0.2 0.25 0.3 0.350
2000
4000
6000
8000
SPEECH: /S/-/P/-/IY/-/CH/
/S/ /P/ /IY/ /CH/
S1 S2 Sn
F1
F2
…
F13
F1
F2
…
F13
Acoustic FeaturesF1
F2
…
F13
Maximum
Likelihood
3
Wikipedia
![Page 4: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/4.jpg)
The Ultimate Generative Model
0.05 0.1 0.15 0.2 0.25 0.3 0.35
-5000
0
5000
S-P-IY-CH
Am
plitu
de
Time
Freq
uenc
y
0 0.05 0.1 0.15 0.2 0.25 0.3 0.350
2000
4000
6000
8000
SPEECH: /S/-/P/-/IY/-/CH/
/S/ /P/ /IY/ /CH/
S11
F1
F2
…
F13
F1
F2
…
F13
Acoustic FeaturesF1
F2
…
F13
S21
S12
S22
S1n
S2n
S13
S23
S14
S24
Lips
Separation
Tongue Tip
Articulatory Targets
Articulatory modeling
Physical model of sound generation
4
Speech is actually
generated by the
vocal tract!
Physical
Generative Model
![Page 5: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/5.jpg)
The Missing Science
• Need a framework that can explicitly model the articulatory space
(configurations and dynamics) that can help alleviate problems like
coarticulation, articulatory target undershoot, asynchrony of
articulators, and pronunciation variations
• Current approaches in articulatory modeling (Livescu, Deng, Erler,
and more) attempt to learn and apply constraints based on inferences
from surface level acoustic observations or from linguistic sources
• Need to learn from real articulatory data
• Need a mapping from articulatory space to the acoustic domain based
on the physical generative process that is more natural (i.e. accurate)
and can generalize better than learning the mapping statistically (i.e.
from parallel articulatory and acoustic data)
5
![Page 6: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/6.jpg)
MOCHA Database
MOCHA Apparatus Raw Articulatory Measurements
6
![Page 7: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/7.jpg)
-2 -1 0 1 2 3 4 5 6 7-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
x cm
y c
m
UL
LL
UI
LI
TT
TB TDVL
MOCHA EMA Data
7
![Page 8: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/8.jpg)
Maeda Parameters
P7
P1
Maeda’s Model
7 Maeda Parameters
Area Length
A1 L1
… …
A36 L36
Area Functions
(Acoustic Tubes)
Upper Palate
8
Lips
Glottis
![Page 9: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/9.jpg)
Articulatory Speech Synthesis
Area Length
A1 L1
… …
A36 L36
Area Functions
(Acoustic Tubes)
Area to
Transfer Function
of Each Section
VT Transfer
Function
Sondhi and Schroeter Model
9
![Page 10: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/10.jpg)
Deriving Realistic Vocal Tract Shapes from ElectroMagnetic
Articulograph Data via Geometric Adaptation and Profile Fitting
• Problem Overview:
– Speech synthesis solely from EMA data using:
• Knowledge of the geometry of the vocal tract
• Knowledge of the physics of the speech generation process
• Approach Followed:
– Compute realistic vocal tract shapes from EMA data
1. Adapting Maeda’s geometric vocal tract model to EMA data
2. Search for best fit of the tongue and lips profile contours to EMA data
– Synthesize speech from vocal tract shapes
3. Articulatory synthesis using the Sondhi and Schroeter model
10
![Page 11: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/11.jpg)
1. Vocal Tract Adaptation
Parameters
11
-6 -4 -2 0 2 4 6-8
-6
-4
-2
0
2
4
6
x cm
y c
m
Lips
Origin
Upper Wall
Inner Wall
d
Upper Incisor
Tongue
12
10
21
29
Larynx Edges
• Origin
• Upper Wall Shift
• θ
• d
• Lips Separation
![Page 12: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/12.jpg)
Adaptation Result [1]
12
-2 0 2 4 6 8 10-10
-8
-6
-4
-2
0
2
4
TTTB TD
VL
LL
UL
LI
UI
x cm
y c
m
TTTB TD
VL
LL
UL
LI
UI
TTTB TD
VL
LL
UL
LI
UI
TTTB TD
VL
LL
UL
LI
UI
Maeda Upper Wall
Inner Wall
1
2
152629
Larynx
+ d Estimated EMA Upper Wall
[1] Z. Al Bawab, L. Turicchia, R. M. Stern, and B. Raj, “Deriving Vocal Tract Shapes From ElectroMagnetic
Articulograph Data Via Geometric Adaptation and Matching,” in Interspeech, Brighton, UK, September
2009.
![Page 13: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/13.jpg)
2. Search Results
13
EMA points in purple for phone ‘II’ as in “Seesaw = /S-II-S-OO/”
EMA points in purple for phone ‘@@’ as in “Working = /W-@@-
K-I-NG/”
![Page 14: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/14.jpg)
3. Synthesis Results
14
Acoustic tubes model for phone ‘II’ as in “Seesaw = /S-II-S-OO/”
Acoustic tubes model for phone ‘@@’ as in “Working = /W-@@-
K-I-NG/”
![Page 15: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/15.jpg)
Creating a Realistic Codebook and
Adapted Articulatory Transfer Functions
Codeword: p1 p2 p3 p4 p5 p6 p7 VA
15
Velum Area
![Page 16: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/16.jpg)
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
@@@
A
AA
AIAU
B
CH
D
DH
EE@
EI
F
G
H
I
I@
II
JHK
L
M
N
NG
O
OI
OO
OU
P
R
S
SH
T
TH
UU@
UH
UU
V
W
Y
Z
ZH
x
yProjecting the 44 Phones Codewords’ Means
using Multi-Dimensional Scaling (MDS)
16
![Page 17: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/17.jpg)
Articulatory
Configurations
P7
P1
P7
P1
Mel-Cepstral
Distortion
Speech
Deriving Analysis-by-Synthesis Features[2]
dN
d1
Distortion
Feature Vector
Energy, Pitch
Synthesis
Synthesis
MFCC
MFCC
MFCCArticulatory
Space
Compare signals generated
from a codebook of valid
vocal tract configurations
to the incoming signal to
produce a “distortion” feature
vector
codeword 1
codeword N
17[2] Z. Al Bawab, B, Raj, and R. M. Stern, “Analysis-by-synthesis features for speech recognition,” IEEE
International Conference on Acoustics, Speech, and Signal Processing, April 2008, Las Vegas, Nevada.
![Page 18: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/18.jpg)
Mixture Probability Density Function
• For a given frame, the output probability of each
state in the HMM is a mixture density over a set
of M codewords:
18
Weight of each
codeword
Likelihood of input given
the codeword and state
![Page 19: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/19.jpg)
HMM Framework
19
![Page 20: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/20.jpg)
Priors From EMA
cd1 cd2 cd1 cd3 cd2
TT
TB
TD
EMA measurements
Time
20
![Page 21: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/21.jpg)
Update Equations
• For each phone, we estimate
and for each state as:
21
2
exp)|( ujjd
jju cdxP
![Page 22: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/22.jpg)
Weights for Phone ‘OU’ Projected on Codewords-MDS Space
22
-6 -4 -2 0 2 4-4
-3
-2
-1
0
1
2
3
4 Priors from EMA
x
y
-6 -4 -2 0 2 4-4
-3
-2
-1
0
1
2
3
4 Weights Flat Init
xy
-6 -4 -2 0 2 4-4
-3
-2
-1
0
1
2
3
4 Weights Init from EMA
x
y
-6 -4 -2 0 2 4-4
-3
-2
-1
0
1
2
3
4 Weights Init from EMA + Adaptation
x
y
![Page 23: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/23.jpg)
Experimental Setup
• Segmented phone recognition on the MOCHA Database (9 speakers, 460 TIMIT British English utterances per speaker, 44 phones)
• Articulatory codebook composed of 1024 different Maeda configurations derived from MOCHA EMA data
• LDA dimensionality reduction of the distortion vector to 20 features per frame, phones being the classes of transformation
23
![Page 24: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/24.jpg)
24
Experimental Setup Cont’d
• Distortion measure used is the Mel-Cepstral distortion:
• Classify each phone c according to:
12
1
2
synthincomingsynthincoming ))()((210ln
10),(
k
kCkCCCMCD
)1()|()|()(maxargˆ cDFPcMFCCPcPc c
![Page 25: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/25.jpg)
Summary of Phone Error Rates Results [3]
25
Features (dimension) Topology
Obser Prob / Init
fsew0
14,352
msak0
14,302
Both
28,654
Improvement
MFCC + CMN (13) 3S-128M-HMMGaussian/VQ
61.6% 55.9% 58.8%
Dist Feat (1024)
(Prob. Combination α = 0.2)
3S-1024M-HMMExponential/FlatSparsity = 21% 57.6% 53.7% 55.7% 5.3%
Dist Feat(1024)
(Prob. Combination α = 0.2)
3S-1024M-HMMExponential/EMA
Sparsity = 51%58.3% 53.9% 56.1% 4.6%
Adapted Dist Feat (1024)
(Prob. Combination α = 0.25)
3S-1024M-HMMExponential/EMA
Sparsity = 51%58.4% 53.1% 55.7% 5.3%
Dist Feat + LDA + CMN (20)
(Prob. Combination α = 0.6)
3S-128M-HMMGaussian/VQSparsity = 0%
54.9% 49.8% 52.4% 10.9%[3] Z. Al Bawab, B, Raj, and R. M. Stern, “A Hybrid Physical and Statistical Dynamic Articulatory
Framework Incorporating Analysis-by-Synthesis for Improved Phone Classification,” Submitted to
ICASSP 2010, Dallas, Texas.
![Page 26: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/26.jpg)
Summary of our Contribution
26
Conventional HMM Production Based HMM
States Abstract, no physical
meaning
Real articulatory
configurations
Output Observation
Probability
Gaussian probability
using acoustic features
Exponential probability
based on the analysis-by-
synthesis distortion
features
Adaptation VTLN, MLLR, MAP Vocal tract geometric
model adaptation
Transition Probability Based on acoustic
observation
Can be leaned from
articulatory dynamics
![Page 27: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/27.jpg)
27
Conclusion
• A model that mimics the actual physics of the vocal tract results in better classification performance
• Developed a hybrid physical and statistical dynamic articulatory framework that incorporates analysis-by-synthesis for improved phone classification
• Recent databases open new horizons to better understand the articulatory phenomena
• Current advancements in computations and machine learning algorithms facilitate the integration of physical models in large scale systems
![Page 28: An Analysis-by-Synthesis Approach to Vocal Tract Modeling](https://reader031.vdocuments.net/reader031/viewer/2022012005/61d95f2ef8d9ab6ff53ed8e2/html5/thumbnails/28.jpg)
• THANK YOU
28