a. katsamanis, e. bresch, v. ramanarayanan, s. narayanan€¦ · draw every outer line first then...

1
“Draw every outer line first then fill in the interior” validating rt-MRI based articulatory representations via articulatory recognition A. Katsamanis, E. Bresch, V. Ramanarayanan, S. Narayanan Signal Analysis and Interpretation Laboratory, Electrical Engineering, University of Southern California, Los Angeles, CA 90089 http://sail.usc.edu references Research supported by NIH Grant R01 DC007124-01 “This was easy for us” [1] E. Bresch and S. Narayanan, “Region segmentation in the frequency domain applied to upper airway real-time magnetic resonance images,” IEEE Trans. Medical Imaging, vol. 28, no. 3, pp.323–338, Mar 2009. [2] S. Narayanan, E. Bresch, P. Ghosh, L. Goldstein, A. Katsamanis, Y. Kim, A. Lammert, M. Proctor, V. Ramanarayanan, and Y. Zhou, “A multimodal real-time MRI artculatory corpus for speech research,” in Proc. of Interspeech 2011. [3] T. Cootes, C. Taylor, D. Cooper, and J. Graham, “Active shape models – their training and application,” Computer Vision and Image Understanding, vol. 61, pp. 38– 59, 1995. [4] P. Besl and N. McKay, “A method for registration of 3-d shapes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, pp. 239–256, 1992. statistical deformable model using principal component analysis vocal tract outlines /ae/ /g/ s 1 s 3 s 2 3-state left-right phoneme hidden Markov models articulatory recognition results time state “This was easy for us” articulatory representation extraction linear discriminant analysis feature extraction automatic vocal tract segmentation [1] contour resampling head motion compensation [4] fixed hard palate and pharyngeal wall in a nutshell our goal is to devise an articulatory representation that would allow us to computationally analyze real-time speech production data we investigate deformable shape model and linear discriminant analysis based representations we validate alternative representations by HMM-based articulatory recognition we want to model the (co-)variation of the points on the vocal tract contour each vocal tract corresponds to a point in an M-dimensional space and all such points form a cloud in the “Allowable Shape Domain” [3] it is assumed that this cloud is approximately ellipsoidal and its center and major axes are estimated using principal component analysis the eigenvectors of the covariance matrix corresponding to the largest eigenvalues describe the longest axis of the ellipsoid 26 components describe more than 95% of the variance our goal is to obtain a vocal tract representation that would preserve as much of the discriminative information among classes of shapes (one class per phoneme) as possible assuming unimodal Gaussian distributions of the points for each class the optimal projection matrix can be found via linear discriminant analysis, by maximizing the ratio of between-class to within-class scatter 40 LDA / 25 PCA based features results using image intensity based discrete cosine transform features are also shown (100 features) Int. DCT Sh. PCA Sh. LDA 0 10 20 30 40 50 60 % (Phoneme recognition accuracy) training the models were initialized using the timed phonetic transcriptions of the training sequences 4-component Gaussian mixture models as observation probability distributions each pair of phonemes whose articulation presumably only varies in terms of voicing are considered to be in the same class, e.g., /z/ and /s/ or /t/ and /d/ training was performed using the Hidden Markov Model toolkit (HTK) finally, the articulatory models are allowed to be asynchronously aligned with the corresponding acoustics 20 40 60 80 100 120 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 frame number LDA parameters 20 40 60 80 100 120 ï2 ï1 0 1 2 3 frame number PCA parameters 1st parameter (PCA) 2nd parameter (PCA)

Upload: others

Post on 13-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A. Katsamanis, E. Bresch, V. Ramanarayanan, S. Narayanan€¦ · Draw every outer line first then fill in the interior” validating rt-MRI based articulatory representations via

“Draw every outer line first then fill in the interior”

validating rt-MRI based articulatory representations via articulatory recognition

A. Katsamanis, E. Bresch, V. Ramanarayanan, S. NarayananSignal Analysis and Interpretation Laboratory, Electrical Engineering, University of Southern California, Los Angeles, CA 90089

http://sail.usc.edu

references

Research supported by NIH Grant R01 DC007124-01

“This was easy for us”

[1] E. Bresch and S. Narayanan, “Region segmentation in the frequency domain applied to upper airway real-time magnetic resonance images,” IEEE Trans. Medical Imaging, vol. 28, no. 3, pp.323–338, Mar 2009.[2] S. Narayanan, E. Bresch, P. Ghosh, L. Goldstein, A. Katsamanis, Y. Kim, A. Lammert, M. Proctor, V. Ramanarayanan, and Y. Zhou, “A multimodal real-time MRI artculatory corpus for speech research,” in Proc. of Interspeech 2011.[3] T. Cootes, C. Taylor, D. Cooper, and J. Graham, “Active shape models – their training and application,” Computer Vision and Image Understanding, vol. 61, pp. 38–59, 1995.[4] P. Besl and N. McKay, “A method for registration of 3-d shapes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, pp. 239–256, 1992.

statistical deformable model using principal component analysis

vocal tract outlines

/ae/

/g/

s1 s3s2

3-state left-right phoneme hidden Markov models

articulatory recognition results

time

state“This was easy for us”

articulatory representation

extraction

linear discriminant analysis

feature extraction

automatic vocal tract

segmentation [1]

contourresampling

head motion compensation [4]

fixed hard palate and pharyngeal

wall

in a nutshell our goal is to devise an articulatory representation that would allow us to computationally analyze real-time speech production data we investigate deformable shape model and linear discriminant analysis based representations we validate alternative representations by HMM-based articulatory recognition

we want to model the (co-)variation of the points on the vocal tract contour each vocal tract corresponds to a point in an M-dimensional space and all such points form a cloud in the “Allowable Shape Domain” [3] it is assumed that this cloud is approximately ellipsoidal and its center and major axes are estimated using principal component analysis the eigenvectors of the covariance matrix corresponding to the largest eigenvalues describe the longest axis of the ellipsoid 26 components describe more than 95% of the variance

our goal is to obtain a vocal tract representation that would preserve as much of the discriminative information among classes of shapes (one class per phoneme) as possible

assuming unimodal Gaussian distributions of the points for each class the optimal projection matrix can be found via linear discriminant analysis, by maximizing the ratio of between-class to within-class scatter

40 LDA / 25 PCA based features results using image intensity based discrete cosine transform features are also shown (100 features)

Int. DCT Sh. PCA Sh. LDA0

10

20

30

40

50

60

% (P

hone

me

reco

gniti

on a

ccur

acy)

training the models were initialized using the timed phonetic transcriptions of the training sequences 4-component Gaussian mixture models as observation probability distributions each pair of phonemes whose articulation presumably only varies in terms of voicing are considered to be in the same class, e.g., /z/ and /s/ or /t/ and /d/ training was performed using the Hidden Markov Model toolkit (HTK) finally, the articulatory models are allowed to be asynchronously aligned with the corresponding acoustics

20 40 60 80 100 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

frame number

LDA parameters

20 40 60 80 100 120

2

1

0

1

2

3

frame number

PCA parameters

1st parameter (PCA)2nd parameter (PCA)