rapid speaker normalization and adaptation with ...spapl/paper/wang_dissertation.pdf · gratitude...

University of California

Los Angeles

Rapid Speaker Normalization and Adaptation

with Applications to Automatic Evaluation

of Children’s Language Learning Skills

A dissertation submitted in partial satisfaction

of the requirements for the degree

Doctor of Philosophy in Electrical Engineering

by

Shizhen Wang

2010

c© Copyright by

Shizhen Wang

2010

To my family

iii

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . 1

1.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 6

1.1.3 Acoustic Modeling . . . . . . . . . . . . . . . . . . . . . . 7

1.1.4 Robustness Issues . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Speaker Normalization and Speaker Adaptation . . . . . . . . . . 10

1.2.1 Linear Frequency Warping . . . . . . . . . . . . . . . . . . 13

1.2.2 Maximum Likelihood Linear Regression . . . . . . . . . . . 14

1.3 Children’s Speech Recognition . . . . . . . . . . . . . . . . . . . . 15

1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . 17

2 Regression-Tree based Spectral Peak Alignment . . . . . . . . . 18

2.1 Frequency Warping as Linear Transformation . . . . . . . . . . . 18

2.1.1 Frequency Warping for MFCC . . . . . . . . . . . . . . . . 18

2.1.2 Approximated Linearization of Frequency Warping . . . . 20

2.1.3 Definition of the Frequency Warping Matrix . . . . . . . . 22

2.1.4 Linear Frequency Warping Functions . . . . . . . . . . . . 23

2.2 Alignment of Spectral Peaks . . . . . . . . . . . . . . . . . . . . . 24

2.2.1 Choice of the Reference Speaker . . . . . . . . . . . . . . . 24

2.2.2 Levels of Mismatch in Formant Structure . . . . . . . . . . 25

iv

2.3 Speaker Adaptation Using Spectral Peak Alignment . . . . . . . . 28

2.4 Regression-tree based Speaker Adaptation . . . . . . . . . . . . . 30

2.4.1 Global vs. Regression-tree based Peak Alignment . . . . . 30

2.4.2 Phoneme based Regression Tree . . . . . . . . . . . . . . . 31

2.4.3 Gaussian Mixture based Regression Tree . . . . . . . . . . 31

2.5 Integration of Peak Alignment with MLLR . . . . . . . . . . . . 32

2.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 34

2.6.2 Comparison of Global and Regression-tree based PAA ver-

sus MLLR and VTLN . . . . . . . . . . . . . . . . . . . . 36

2.6.3 Discussion on Comparison of RM1 and TIDIGITS . . . . . 40

2.6.4 Performance of the Linearization Appproximation . . . . . 41

2.6.5 Comparison of PAA, PSAT and MLLR-SAT . . . . . . . . 42

2.6.6 Comparison of PSAT and MLLR-SAT . . . . . . . . . . . 44

2.6.7 Comparison of Supervised and Unsupervised Adaptation . 45

2.6.8 Significance Analysis . . . . . . . . . . . . . . . . . . . . . 48

2.7 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . 50

3 Speaker Normalization based on Subglottal Resonances . . . . 52

3.1 Subglottal Acoustic System and Its Coupling to Vocal Tract . . . 52

3.1.1 Subglottal Acoustic System . . . . . . . . . . . . . . . . . 52

3.1.2 Coupling between Subglottal and Supraglottal Systems . . 53

3.1.3 Effects of Coupling to Subglottal System . . . . . . . . . . 54

v

3.1.4 Subglottal Resonances and Phonological Distinctive Features 55

3.2 Estimating the Second Subglottal Resonance . . . . . . . . . . . . 57

3.2.1 Estimation based on Frequency Discontinuity . . . . . . . 57

3.2.2 Estimation based on Joint Frequency and Energy Measure-

ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.3 Calibration of the Sg2 Estimation Algorithm . . . . . . . . . . . . 62

3.4 Variability of Subglottal Resonance Sg2 . . . . . . . . . . . . . . . 67

3.4.1 The Bilingual Database . . . . . . . . . . . . . . . . . . . 67

3.4.2 Cross-content and Cross-language Variability . . . . . . . . 69

3.4.3 Implications of Sg2 Invariability . . . . . . . . . . . . . . . 70

3.5 Experiments with Linear Frequency Warping . . . . . . . . . . . . 71

3.5.1 Comparison of VTLN and Sg2 Frequency Warping . . . . 72

3.5.2 Effectiveness of Sg2 Normalization . . . . . . . . . . . . . 72

3.5.3 Comparison of Vowel Content Dependency . . . . . . . . . 75

3.5.4 Performance on RM1 Database . . . . . . . . . . . . . . . 76

3.5.5 Cross-language Speaker Normalization . . . . . . . . . . . 78

3.6 Nonlinear Frequency Warping . . . . . . . . . . . . . . . . . . . . 80

3.6.1 Mel-shift based Frequency Warping . . . . . . . . . . . . . 80

3.6.2 Bark-shift based Frequency Warping . . . . . . . . . . . . 81

3.7 Experiments with Nonlinear Frequency Warping . . . . . . . . . . 83

3.7.1 Sg2 based Nonlinear Frequency Warping . . . . . . . . . . 83

3.7.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 84

3.7.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . 86

vi

3.8 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . 88

4 Automatic Evaluation of Children’s

Language Learning Skills . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.1 Technology based Assessment of Language and Literacy . . . . . . 90

4.2 Blending Tasks and Database Collections . . . . . . . . . . . . . . 91

4.2.1 Blending Tasks for Phonemic Awareness . . . . . . . . . . 91

4.2.2 Database Collections . . . . . . . . . . . . . . . . . . . . . 92

4.3 Human Evaluations and Discussions . . . . . . . . . . . . . . . . . 94

4.3.1 Web-based Teacher’s Assessment . . . . . . . . . . . . . . 94

4.3.2 Inter-correlation of the Assessment . . . . . . . . . . . . . 94

4.3.3 Discussions on the Blending Target Words . . . . . . . . . 95

4.4 Automatic Evaluation System . . . . . . . . . . . . . . . . . . . . 96

4.4.1 Overall System Flowchart . . . . . . . . . . . . . . . . . . 96

4.4.2 Disfluency Detection . . . . . . . . . . . . . . . . . . . . . 97

4.4.3 Accent Detection . . . . . . . . . . . . . . . . . . . . . . . 99

4.4.4 Pronunciation Dictionary . . . . . . . . . . . . . . . . . . . 103

4.4.5 Accuracy and Smoothness Measurements . . . . . . . . . . 104

4.4.6 Overall quality measurement . . . . . . . . . . . . . . . . . 105

4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 106


5 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . 108


vii

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

viii

List of Figures

1.1 Diagram of MFCC features extraction. . . . . . . . . . . . . . . . 8

1.2 Spectrograms of clean speech utterance from a male speaker (top)

saying two digits “eight two” and the same utterance corrupted

with additive white noise at 5 dB (bottom). . . . . . . . . . . . . 10

1.3 Spectrograms of an utterance from an adult speaker (top) say-

ing one digit “zero”, and the same sentence from a child speaker

(bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1 Diagram of MFCC features extraction without (Xc) and with (Y c)

frequency warping. . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Mel-frequency filter banks and the approximation made in the lin-

earization of frequency warping in [30]: each triangular filter is

represented only with its central peak value (the circle point) . . . 22

2.3 Illustration of three levels of formant estimations (global, phoneme,

and state). Boundaries are obtained through force alignment:

dashed lines mark the boundaries of phonemes and dotted lines

mark the boundaries for states . . . . . . . . . . . . . . . . . . . 26

2.4 F3 warping factors for /IY/, /AA/, /UW/, and the global average

for 10 test speakers (6 male and 4 female adults) from RM1 . . . 27

2.5 F3 warping factors for /IY/, /AH/, /UW/, and the global average

for 10 test speakers (5 boys and 5 girls) from TIDIGITS . . . . . 28

ix

2.6 An example of regression tree using combined phonetic knowl-

edge and data-driven techniques for the phoneme-based approach.

Phonemes are firstly categorized based on phonetic knowledge, and

then further clustered according to their estimated F3 values. . . . 33

2.7 The speaker adaptation algorithm using regression-tree based spec-

tral peak alignment for both supervised and unsupervised adapta-

tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.8 Performance of VTLN, MLLR, G-PA, GPAA, PPAA and MPAA

using RM1 for supervised adaptation. . . . . . . . . . . . . . . . . 38

2.9 Performance of VTLN, MLLR, G-PA, GPAA, PPAA and MPAA

using TIDIGITS for supervised adaptation. . . . . . . . . . . . . . 39

2.10 Performance of MPAA, MLLR, PSAT and MLLR-SAT using RM1

for supervised adaptation. . . . . . . . . . . . . . . . . . . . . . . 43

2.11 Performance of MPAA, MLLR, PSAT and MLLR-SAT using TIDIG-

ITS for supervised adaptation. . . . . . . . . . . . . . . . . . . . . 44

3.1 Schematic model of vocal tract with acoustic coupling to the tra-

chea through the glottis (a) and the equivalent circuit model (b).

Adapted from [62]. . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.2 Spectrogram for the word boy from an eight-year-old girl. The

second subglottal resonance Sg2 for this speaker is at 1920 Hz. . 55

x

3.3 Illustration of the F2 discontinuity caused by Sg2. The bold solid

line corresponds to the most prominent spectral peak (F2), which

has a jump in frequency and a decrease in amplitude when F2 is

crossing the subglottal resonance Sg2. The dotted line represents

the Sg2 pole, which varies somewhat in frequency and amplitude

when F2 is nearby. The horizontal thin solid line represents the

Sg2 zero, which is roughly constant. Adapted from [62]. . . . . . 56

3.4 Illustration of the relative positions of vowel formants F1 (·), F2

(+) and F3 (x) and the subglottal resonances (Sg1, Sg2 and Sg3)

for an adult male speaker. For the vowels /i, I, E, æ/ F2 > Sg2, and

they are therefore [-back]. For the vowels /a, 2, O, U, u/ F2 < Sg2,

and they are therefore [+back]. Adapted from [67]. . . . . . . . . 58

3.5 An example of the detection algorithm. . . . . . . . . . . . . . . . 59

3.6 Example of the joint estimation method where F2 discontinuity

and E2 attenuation correspond to the same location (frame 38).

Eq. (3.3) is used to estimate Sg2. . . . . . . . . . . . . . . . . . . 62

3.7 Example when there is a discrepancy between locations of F2 dis-

continuity (not detectable) and E2 attenuation (at frame 51). The

average F2 value within the dotted box is then used as the Sg2

estimate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.8 Comparison of Sg2 estimates for the two speakers in Table 3.2, top

panel for speaker 1 and bottom panel for speaker 2. . . . . . . . . 67

3.9 Average within-speaker Sg2 standard deviations and the COVs

against contents and repetitions. . . . . . . . . . . . . . . . . . . . 70

3.10 Cross-language within-speaker COV of Sg2 for 10 boys and 10 girls. 71

xi

3.11 Vowel formants F1 (·), F2 (+) and F3 (x) before and after VTLN

(in circles) and Sg2-based (in squares) warping for a nine-year-old

girl’s vowels. The lines ‘Sg1’, ‘Sg2’ and ‘Sg3’ are the reference

subglottal resonances from the same speaker as in Fig. 3.4. . . . . 73

3.12 Vowel formants F1 (·), F2 (+) and F3 (x) from the reference

speaker (Fig. 3.4) versus those from the test speaker (Fig. 3.11)

before and after warping (VTLN in circles, Sg2 in squares). The

dotted line is y = x which means perfect match between reference

and test speakers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.13 Speaker normalization performance on TIDIGITS with various

amount of adaptation data. . . . . . . . . . . . . . . . . . . . . . 75

3.14 Performance comparison of VTLN, F3 and Sg2D2 using one adap-

tation digit with various vowel content. . . . . . . . . . . . . . . 77

3.15 Piecewise bark shift warping function, where α > 0 shifts the Bark

scale upward, α < 0 shifts downward, and α = 0 means no warping. 83

4.1 Flowchart of the automatic evaluation system for the blending tasks. 97

4.2 An example of the disfluency detection network for a syllable

blending task word ‘peptic’, where START and END are the net-

work enter and exit points, respectively. . . . . . . . . . . . . . . . 99

xii

List of Tables

2.1 Word recognition accuracy using RM1 for supervised adaptation. 47

2.2 Word recognition accuracy using RM1 for unsupervised adaptation. 47

2.3 Word recognition accuracy using TIDIGITS for supervised adap-

tation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.4 Word recognition accuracy using TIDIGITS for unsupervised adap-

tation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.5 Significant analysis of performance improvements of MPAA over

MLLR using RM1 for supervised adaptation . . . . . . . . . . . . 49

2.6 Significant analysis of performance improvements of MPAA over

MLLR using TIDIGITS for supervised adaptation . . . . . . . . . 49

3.1 Comparison of Sg2 estimates for two algorithms over various vowel

contents, where Sg2M is the manual measurement from speech

spectrum, and Sg2Acc is the ‘ground truth’ measurement from the

accelerometer signal. For each algorithm the average Sg2 estimates

(Hz) are shown (with standard deviations in parentheses). The two

speakers with a * are those used for calibration. . . . . . . . . . . 65

xiii

3.2 Detailed comparison of Sg2 estimates for the two algorithms on

two speakers. For vowels above the double line, there are no dis-

continuities in the F2 trajectory, and Sg2D1 uses the mean F2 as

Sg2 while Sg2D2 uses Eq. (3.2) ( ˜Sg2) to make an estimate; for

vowels below the double line, the F2 discontinuity is detectable,

and Sg2D1 uses Eq. (3.1) while Sg2D2 uses Eq. (3.3). The row

‘Avg.(std)’ shows the mean (and standard deviation) for each al-

gorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.3 Performance comparison (word recognition accuracy) on RM1 with

one adaptation utterance. . . . . . . . . . . . . . . . . . . . . . . 78

3.4 Performance comparison (word recognition accuracy) of VTLN

and Sg2 normalization using English (four words) and Spanish

(five words) adaptation data. The acoustic models were trained

and tested using English data. . . . . . . . . . . . . . . . . . . . . 80

3.5 WER on TIDIGITS using MFCC features with varying normal-

ization data from 1 to 15 digits. . . . . . . . . . . . . . . . . . . . 87

3.6 WER on TIDIGITS using PLPCC features with varying normal-

ization data from 1 to 15 digits. . . . . . . . . . . . . . . . . . . . 87

3.7 WER on TBall children’s data using MFCC and PLPCC features

with 3 normalization words. . . . . . . . . . . . . . . . . . . . . . 88

4.1 An example of the TBall blending tasks: audio prompts are pre-

sented and a child is asked to orally blend them into a whole word.

A one-second silence (SIL) is used within the prompts to separate

each sound. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

xiv

4.2 An example of the TBall segmentation tasks: audio prompts are

presented and a child is asked to orally segment them into parts. . 92

4.3 Target words for the blending tasks. . . . . . . . . . . . . . . . . . 93

4.4 Speaker distribution by native language and gender. . . . . . . . . 93

4.5 Average inter-evaluator correlation on pronunciation accuracy, smooth-

ness and overall evaluations on three blending tasks. . . . . . . . . 95

4.6 Pronunciation variants analysis for consonants and vowels on a

Spanish-accent English database, with the percentage of occur-

rence in the analysis database shown in the parentheses. Entries

with a tailing asterisk are those change patterns not predicted from

theories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.7 Average correlation between ASR and teacher evaluations on pro-

nunciation accuracy, smoothness and overall qualities for three

blending tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

xv

Acknowledgments

I am deeply grateful to my advisor, Dr. Abeer Alwan for her intellectual guidance,

gracious support and encouragement throughout my study at UCLA. My sincere

gratitude also goes to Professors Richard D. Wesel, Paulo Tabuada, and Mark H.

Hansen for being on my doctoral committee and for their interests in my research.

I would also extend my gratitude to Dr. Xiaodong Cui, and Dr. Steven M.

Lulich for their insightful suggestions and valuable comments. I have greatly

benefited from collaborations with Dr. Cui and Dr. Lulich, and long discussions

with each of them directly lead to some ideas presented in this dissertation.

I am very thankful to my family - father, mother, my wife and daughter. They

have always been there by my side and encouraged me to do my best. Without

their love, support and encouragement, this dissertation work would not have

been possible.

Thanks go to my SPAPL labmates, Dr. Markus Iseli, Dr. Sankaran Pancha-

pagesan, Jonas, Yen, Wei, Gang, Harish, and many others for their important

helps and for their friendship. I would also thank all my friends for being there

for me over the years.

This work was supported in part by the NSF Grant No. 0326214, and parts

of the dissertation have been published in papers listed in publications.

xvi

Vita

1998–2002 B.S. (Electrical Engineering), Shandong University, China

2002–2005 M.S. (Electrical Engineering), Tsinghua University, China

2005–2010 PhD (Electrical Engineering), UCLA, Los Angeles, California.

Publications

S. Wang, A. Alwan and S. M. Lulich, “Automatic detection of the second sub-

glottal resonance and its application to speaker normalization,” J. Acoust. Soc.

Am., 126(6): 3268-3277, 2009.

S. Wang, P. Price, Y.-H. Lee and A. Alwan, “Measuring children’s phonemic

awareness through blending tasks,” in Proc. of SLaTE workshop 2009.

S. Wang, Y.-H. Lee and A. Alwan, “Bark-shift based nonlinear speaker normal-

ization using the second subglottal resonance,” in Proc. of Interspeech 2009, pp.

1619-1622.

S. Wang, A. Alwan and S. M. Lulich, “A reliable technique for detecting the

second subglottal resonance and its use in cross-language speaker adaptation,”

in Proc. of Interspeech 2008, pp. 1717-1720.

xvii

S. Wang, A. Alwan and S. M. Lulich, “Speaker normalization based on subglottal

resonances,” In Proc. of ICASSP 2008, pp. 4277-4280, 2008.

S. Wang, X. Cui and A. Alwan, “Speaker Adaptation with Limited Data using

Regression-Tree based Spectral Peak Alignment,” IEEE Trans. on Audio, Speech

and Language Processing, Vol. 15, pp. 2454-2464, 2007.

S. Wang, P. Price, M. Heritage and A. Alwan, “Automatic Evaluation of Chil-

dren’s Performance on an English Syllable Blending Task,” in Proc. of SLaTE

workshop 2007.

S. Wang, X. Cui and A. Alwan, “Rapid Speaker Adaptation using Regression-

Tree based Spectral Peak Alignment,” in Proc. of ICSLP, pp. 1479-1482, 2006.

xviii

Abstract of the Dissertation

Rapid Speaker Normalization and Adaptation

with Applications to Automatic Evaluation

of Children’s Language Learning Skills

by

Shizhen Wang

Doctor of Philosophy in Electrical Engineering

University of California, Los Angeles, 2010

Professor Abeer Alwan, Chair

This dissertation investigates speaker variation issues in automatic speech recog-

nition (ASR), with a focus on rapid speaker normalization and adaptation meth-

ods using limited enrollment data from the speaker. Investigations are carried

out in the direction of reducing spectral variations through frequency warping.

Two methods are developed, one based on the supraglottal (vocal tract) res-

onances (formants), and the other on resonances from subglottal airways. The

first method attempts to reshape (warp) the spectrum by aligning corresponding

formant peaks. Since there are various levels of variations in formant struc-

tures, regression-tree based phoneme- and state-level spectral peak alignment is

studied for rapid speaker adaptation using linearization of the vocal tract length

normalization (VTLN) technique. This method is investigated in a maximum

likelihood linear regression (MLLR)-like framework, taking advantage of both

the efficiency of frequency warping (VTLN) and the reliability of statistical es-

timations (MLLR). Two different regression classes are investigated: one based

xix

on phonetic classes (using combined knowledge and data-driven techniques) and

the other based on Gaussian mixture classes.

The second approach utilizes subglottal resonances, which has been shown to

affect spectral properties of speech sounds. A reliable algorithm is developed to

automatically estimate the second subglottal resonance (Sg2) from speech sig-

nals. The algorithm is calibrated on children’s speech data with simultaneous

accelerometer recordings from which Sg2 frequencies can be directly measured.

A cross-language study with bilingual Spanish-English children is performed to

investigate whether Sg2 frequencies are independent of speech content and lan-

guage. The study verifies that Sg2 is approximately constant for a given speaker

and thus can be a good candidate for limited data speaker normalization and

cross-language adaptation. A speaker normalization method is then presented

using Sg2.

As an application, ASR techniques are applied to automatically evaluate

children’s phonemic awareness through three blending tasks (phoneme blending,

onset-rhyme blending and syllable blending). The system incorporates speaker

normalization, disfluency detection and Spanish accent detection, together with

speech recognition to assess the overall quality of children’s speech productions.

xx

CHAPTER 1

Introduction

1.1 Automatic Speech Recognition

Automatic Speech Recognition (ASR) aims to decode an acoustic signal X =

X1X2 · · ·XT into a word sequence W = w1w2 · · ·wm, with the goal of making

W close to the original word sequence W. The most common criterion is to

maximize the posterior probability P (W|X) for the given observation X, that is:

W = arg maxw

P (W|X) (1.1)

Using Bayes’ theorem, the above equation can be written as:

W = arg maxw

P (W|X) = arg maxw

P (W)P (X|W)

P (X)= arg max

w

P (W)P (X|W)

(1.2)

where P (X) is omitted because its value is fixed for all the word sequences.

P (W) is the language model probability of the sequence W; while P (X|W) is

the acoustic model probability of generating the observation sequence X given

the acoustic model W.

1.1.1 Hidden Markov Models

The most widely used ASR acoustic models are hidden Markov models (HMM).

HMM [1] is a very powerful statistical method of characterizing the observed

1

data samples, with the assumption that the data samples can be well character-

ized as a parametric first-order Markov random process. An HMM is basically

a Markov chain, except that the output observation is probabilistically, instead

of deterministically, generated according to an output probability function asso-

ciated with each state, and thus there is no one-to-one correspondence between

the observation sequence and the state sequence.

Formally speaking, an HMM (λ) is defined by:

λ = (A,B, π) (1.3)

where:

• A = {aij} is a transition probability matrix, where aij is the probability of

transiting from state i at time t − 1 to state j at time t, i.e.,

aij = P (st = j|st−1 = i) (1.4)

• B = {bj(Xt)} is an output probability matrix, where bj(Xt) is the proba-

bility of emitting observation Xt given the state is j, i.e.,

bj(Xt) = P (Xt|st = j) (1.5)

The probability function bj(Xt) can be either a discrete probability mass

function (PMF) or a continuous probability density function (PDF).

• π = {πj} is an initial (at time t = 0) state distribution where

πj = P (s0 = j) (1.6)

There are three fundamental problems related to HMM [2]:

• Probability evaluation problem: Given a model λ and an observation se-

quence X, what is the probability P (X|λ)?

2

• State sequence decoding problem: Given a model λ and an observation

sequence X, what is the most likely state sequence S that generates the

observations?

• Parameter estimation problem: Given a model λ and a set of observa-

tions, how can one modify the parameters to maximize the joint probability∏X

P (X|λ)?

The first problem, the probability evaluation problem, can be efficiently solved

using the forward algorithm [3], which recursively calculates the forward proba-

bility defined as:

αt(i) = P (X t1, st = i|λ) (1.7)

which is the probability of being in state i at time t and generating the partial

observation X t1 = X1X2 · · ·Xt. αt(i) can be calculated recursively as:

αt(j) =∑

i

αt−1(i)aijbj(Xt) (1.8)

with initialization

α1(i) = πibi(X1) (1.9)

and termination

P (X|λ) =∑

i

αT (i) (1.10)

The second problem, the state sequence decoding problem, can be solved with

the Viterbi algorithm [4]. The Viterbi algorithm can be viewed as a modified for-

ward algorithm, where instead of summing up probabilities from different paths,

only the path with the highest probability is selected. Define φt(i) as the prob-

ability of the most likely state sequence at time t which generates the partial

observation X t1 and in state i, that is:

φt(i) = maxSt−1

1

P (X t1, S

t−11 , st = i|λ) (1.11)

3

which can be recursively calculated as:

φt(j) = maxi

[φt−1(i)aij

]bj(Xt) (1.12)

with initialization

φ1(i) = πibi(X1) (1.13)

and termination

V = maxi

φT (i) (1.14)

where V is the likelihood score of the best state sequence which can then be

obtained through backtracking.

The third problem, the parameter estimation problem, can be solved using the

Baum-Welch algorithm [3], also known as forward-backward algorithm. Similarly

to the forward probability, the backward probability is defined as:

βt(i) = P (XTt+1|st = i, λ) (1.15)

That is, βt(i) is the probability of generating the partial observation from time

t+1 to the end (XTt+1), given that the HMM is in state i at time t. The recursion

for βt(i) is:

βt(i) =∑

j

βt+1(j)aijbj(Xt+1) (1.16)

with initialization

βT (i) = 1 (1.17)

The HMM parameters can be iteratively refined by maximizing the likelihood

P (X|λ). According to the expectation maximization (EM) algorithm [5], the

maximization process is equivalent to maximizing the following auxiliary function:

Q(λ, λ) = ES|X,λ log P (S,X|λ) (1.18)

=∑S

P (S|X, λ) log P (S,X|λ) (1.19)

4

The maximization of Q(λ, λ) can be done by setting its derivative over λ to zero,

which has close-form solutions for both discrete output probability functions,

and continuous output probability functions where each PDF is represented as

Gaussian mixtures. For example, in the Gaussian mixture cases, we have:

bj(Xt) =

M∑k

wjkN (Xt, μjk,Σjk) (1.20)

where N (Xt, μjk,Σjk) denotes a single Gaussian density function with mean μjk

and covariance matrix Σjk for state j, and M is the number of Gaussian mixtures.

The weights wjk satisfy∑

k wjk = 1. The parameter re-estimation formulae are

as follows:

πi =

∑j γ1(i, j)∑

i

∑j γ1(i, j)

(1.21)

aij =

∑t γt(i, j)∑

t

∑j γt(i, j)

(1.22)

μjk =

∑t ζt(j, k)Xt∑

t ζt(j, k)(1.23)

Σjk =

∑t ζt(j, k)(Xt − μjk)(Xt − μjk)

t∑t ζt(j, k)

(1.24)

wjk =

∑t ζt(j, k)∑

t

∑k ζt(j, k)

(1.25)

where γt(i, j) is the probability of taking the transition from state i to state j at

time t, given the observation X:

γt(i, j) = P (st−1 = i, st = j|X, λ) (1.26)

=P (st−1 = i, st = j,X|λ)

P (X|λ)(1.27)

=αt−1(i)aijbj(Xt)βt(j)∑

i αT (i)(1.28)

5

and ζt(j, k) is the probability of being in state j and Gaussian mixture k at time

t, given the observation X:

ζt(j, k) = P (st = j, ξt = k|X, λ) (1.29)

=∑

i

αt−1(i)aijwjkbjk(Xt)βt(j) (1.30)

1.1.2 Feature Extraction

According to the linear production theory, a speech signal can be viewed as a

source signal going through a linear filter. The source represents the air flow at

the vocal cords, which is a periodic signal for voiced sounds (with the inverse of

the period known as fundamental frequency F0), and aperiodic noise for unvoiced

sounds. The filter represents the resonances (poles, also known as formants)

and anti-resonances (zeros) of the vocal tract. The characteristics of the filter

provides more discriminative information about phoneme sounds, and thus are

heavily relied on for sound classification either by human or by machine.

Feature extraction decomposes source and filter functions, and parameterizes

raw speech signals into sequence of feature vectors. Commonly-used features

include linear predictive cepstral coefficients (LPCC) [6] , perceptual linear pre-

diction (PLP) [7] and Mel-frequency cepstral coefficients (MFCC) [8]. MFCCs

are most widely used because of robust performance across various conditions.

Some common pre-processing operations are typically applied before perform-

ing feature extractions. First, pre-emphasis is applied to boost high frequencies

through the first-order difference equation

s′n = sn − k sn−1 (1.31)

where sn is the speech samples and k is the pre-emphasis coefficient in the range

0 ≤ k < 1. Since speech signals are non-stationary, the signals are then segmented

6

into short frames (about 20ms long) and processed frame by frame. To attenuate

discontinuities at the window boundaries, a Hamming window is usually applied

to taper the samples in each window.

MFCC is defined as the real cepstrum of a windowed signal derived from

its Fourier transform through a Mel-frequency filter bank, which approximates

the frequency resolution of the auditory system. Fig. 1.1 shows a diagram for

computing MFCC features. Discrete Fourier transform (DFT) is performed to

each frame of windowed signal to transform the time domain signal into fre-

quency domain. A triangular Mel-frequence filter bank is then applied to the

DFT magnitude, followed by logarithmic operations to compress the dynamic

range. Discrete cosine transform (DCT) is then performed to decorrelate the

log output. First- and second-order derivatives are usually appended to the raw

MFCC features to account for spectral dynamics.

1.1.3 Acoustic Modeling

Acoustic modeling is critical to ASR’s performance and is arguably the central

part of an ASR system. According to Eq. (1.2), the challenge of acoustic mod-

eling is to build accurate P (X|W), which should take into account all possible

variabilities such as speaker variations, pronunciation variations, environmental

variations, etc.

After converting raw speech signals into feature vectors, a training process is

performed to learn the acoustic characteristics of the training data, resulting in

a set of acoustic models. The most successful models are HMM, as described in

Section 1.1.1. Depending on the amount of training data, an ASR system can

use discrete, continuous, or semi-continuous HMMs. When the training data is

sufficient, a continuous model offers the best recognition accuracy, while discrete

7

Pre-emphasis

Hamming windowing

DFT magnitude

Mel-frequency filter bank

Logarithm

DCT

Speech signal

MFCC feature

Figure 1.1: Diagram of MFCC features extraction.

models are more effective for small amounts of data.

The model unit can be either a whole-word model or a sub-word model, e.g.,

phonetic model, dependent on the size of the recognition task. For small vocab-

ulary and isolated word tasks, whole-word models provide better performance;

while sub-word unit model is more flexible and robust for large vocabulary con-

tinuous tasks.

Depending on the source of the training data, an ASR system can be ei-

ther speaker-dependent (SD) or speaker-independent (SI) [9]. A SD system can

achieve high recognition accuracy, but requires a large amount of training data

from the target speaker, and therefore is difficult to generalize to new speakers.

8

On the other hand, a SI system is more flexible and easier to adapt to a new

speaker, though the performance is usually not as good as a SD system trained

for a specific speaker.

This dissertation focuses on speaker independent continuous speech recogni-

tion using continuous HMMs.

1.1.4 Robustness Issues

High accuracy and robustness is the ultimate goal of ASR systems. Due to great

variabilities in speech signals, today’s state-of-the-art ASR systems are still far

from matching human’s performance. It is still a challenge to build an ASR

system that can accurately recognize anyone’s speech, on any topic, and in any

speaking environment. Variations in environmental conditions, especially noise

and microphone variations, and speaker variabilities such as gender, age, speaking

rate, and accent etc., can greatly degrade ASR’s performance. For example, an

ASR system trained on clean speech degrade significantly on real-world noisy

speech; and an ASR system trained on adult males’ speech performs about 10%

relatively worse on females’ speech; and the performance on children’s speech is

even worse.

The performance degradation is caused by mismatch between training and

test data due to environmental and/or speaker variations. Such variations can

dramatically change the characteristics of speech signals, as illustrated in Figs.

1.2 and 1.3, which show the spectrogram of speech under clean versus noisy

conditions, and speech from an adult male versus a child, respectively. These

mismatches cause ASR’s performance to fluctuate in various noise conditions, or

from speaker to speaker. This dissertation addresses speaker variation issues.

9

time (s)

freq

uenc

y (k

Hz)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

1

2

3

4

time (s)

freq

uenc

y (k

Hz)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

1

2

3

4

Figure 1.2: Spectrograms of clean speech utterance from a male speaker (top)

saying two digits “eight two” and the same utterance corrupted with additive

white noise at 5 dB (bottom).

1.2 Speaker Normalization and Speaker Adaptation

Inter-speaker acoustic variations are mostly caused by differences in the vocal

tract and vocal fold apparati. Typically, adult females have shorter vocal tract

lengths (VTL) and smaller vocal cords than adult males, while children have

shorter VTLs and smaller vocal cords than adults [10]. This implies, according

to the linear speech production theory [11], that children have higher formant

and fundamental (F0) frequencies than adults, and female adults have higher

formants and F0 than male adult speakers. Consequently, the performance of

speech recognition systems may be significantly different from speaker to speaker.

To maintain robust recognition accuracy, speaker adaptation and speaker nor-

10

time (s)

freq

uenc

y (k

Hz)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

1

2

3

4

time (s)

freq

uenc

y (k

Hz)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

1

2

3

4

Figure 1.3: Spectrograms of an utterance from an adult speaker (top) saying one

digit “zero”, and the same sentence from a child speaker (bottom).

malization techniques are usually applied to reduce spectral mismatch between

training and testing utterances [12–20]. Speaker adaptation attempts to compen-

sate for spectral mismatch in the back-end acoustic model domain by statistically

tuning the acoustic models to a specific speaker [12–15]. Speaker normalization,

or vocal tract length normalization (VTLN), on the other hand, aims at reducing

the effects of vocal tract variability in the front-end feature domain via linear,

piece-wise linear or bilinear frequency warping [16,17]. Other frequency warping

functions have also been studied [18–20]. A class of transforms, known as all-

pass transforms (APTs), was proposed to perform VTLN in [19] and studied in

detail in [20] for two classes of conformal maps, namely rational all-pass trans-

forms (RAPTs) and sine-log all-pass transforms (SLAPTs). It was demonstrated

that using multiple-parameter warping functions is more effective than single-

11

parameter ones [20]. In speaker adaptation, parameters are speaker-specific trans-

formation matrices and biases estimated using the maximum likelihood (ML) or

maximum a posterior (MAP) criterion [14, 21]. In VTLN, the parameters to be

estimated are the frequency warping factors. Hence, to make reliable statisti-

cal estimation of adaptation parameters, speaker adaptation methods generally

require more adaptation data than VTLN.

Considerable research efforts have been devoted to the relationship between

frequency warping in the feature domain and the corresponding transformations

in the model domain [22–30]. For computational efficiency, several studies have

proposed the possibility of directly performing VTLN in the back-end model do-

main. In [22], vocal tract length normalization was implemented in an MLLR

framework. Claes et al. in [23] proposed a linear approximation of VTLN for

reasonably small warping factors using Taylor expansion. In [24] and [25], Mc-

Donough et al. derived the linearity of VTLN in cepstral space for two all-pass

transforms (rational all-pass transforms and sine-log all-pass transforms) and con-

ducted in [26] a detailed performance comparison with MLLR on a large vocab-

ulary database. In [27], Pitz and Ney showed that, in the continuous frequency

space, VTLN is equivalent to a linear transformation in the cepstral domain

for MFCCs with Mel-frequency warping (instead of Mel-frequency filter banks).

Umesh et al. in [28] showed that this VTLN linearization also holds in the dis-

crete frequency space under the assumption of strictly limited quefrency range in

the cepstral domain. Cui and Alwan in [29] and [30] discussed in detail the lin-

earization of frequency warping for several different feature extraction schemes.

Under certain approximations, they showed that frequency warping of MFCC

features with Mel-frequency filter banks equals a linear transformation in the

model domain.

12

1.2.1 Linear Frequency Warping

VTLN is one of the most popular methods for reducing the effects of speaker-

dependent vocal tract variability through a speaker-specific frequency warping

function (linear, bilinear, or piece-wise linear) [16, 17, 20, 25, 27, 28]. Warping

factors are typically estimated based on the maximum likelihood (ML) criterion

over the adaptation data through an exhaustive grid search or warping-factor

specific models [16,17]. Linear frequency warping can be implemented directly in

the power spectrum domain or in the cepstral domain through the linearization

of VTLN [25, 27, 28]. Along with the linearization of VTLN, the warping factor

can be estimated using the Expectation Maximization (EM) algorithm with an

auxiliary function [25]. Other frequency warping functions have also been studied

in [20].

Another way to reduce spectral variability is to explicitly align spectral for-

mant positions or formant-like spectral peaks, especially the third formant (F3),

and to define the warping factors as formant frequency ratios [18, 23, 30–34]. In

formant-based frequency warping methods, formant positions of different speakers

are transformed into a normalized frequency space. The authors in [18] proposed

a nonlinear warping function based on a parameter estimated using F3 frequency,

while [32] extended this formant-based algorithm and compared the performance

with ML-based methods. In [31], the performance of frequency warping using the

first three formant frequencies was explored. A linear approximation of VTLN

was proposed in [23] for reasonably small warping factors estimated based on

average F3 values. Cui and Alwan proposed a novel spectral formant-like peak

alignment method, with a focus on F3, to reduce spectral mismatch between

adults and children’s speech [30, 33]. Based on the idea of frequency transfor-

mation for digital filters, the authors in [34] treated formant structures as filters

13

and developed a bilinear transform with parameters estimated using average F3

frequency and bandwidth values.

However, due to coarticulation, clarity, speed and other factors, formant fre-

quencies vary considerably within an utterance, and thus make the performance

of formant normalization content dependent.

1.2.2 Maximum Likelihood Linear Regression

Maximum likelihood linear regression (MLLR) [14, 15] estimates a set of linear

transformations for the mean and variance parameters of a Gaussian mixture to

reduce the mismatch between the initial model set and the adaptation data. The

linear transformation of the mean is defined as

μ = Wξ (1.32)

where W is the transformation matrix, and ξ is the extended mean vector,

ξ = [w μ1 μ2 · · · μn]T (1.33)

where w = 1 represents a bias offset. W can be decomposed into

W = [b A] (1.34)

and hence

μ = A μ + b (1.35)

where A is the transformation matrix and b is a bias vector.

The variance adaptation is given in the form:

Σ = BT H B (1.36)

where H is the transformation matrix to be estimated, and B is the inverse of

the Cholesky factor of Σ−1

Σ−1 = CCT (1.37)

14

and

B = C−1 (1.38)

The transformation matrices can be obtained by solving an auxiliary function

using the Expectation Maximization (EM) technique,

QN (λ, λ) =∑j,k

∑t

ζt(j, k) logN (o(t); μjk; Σjk) (1.39)

where ζt(j, k) is defined in Eq. (1.30), and N (o(t); μjk; Σjk) is the kth Gaussian

mixture of state j.

1.3 Children’s Speech Recognition

Children’s speech analysis and recognition have drawn increasing attention for

educational purposes, and more efforts have been devoted to ASR’s applications

using children’s speech [35–40]. Speech technology has been applied in auto-

mated language and literacy tutoring to measure children’s language learning

skills, and to assess their reading and listening comprehension. Children’s speech

recognition, however, still remains challenging. Due to developmental changes in

vocal tract and vocal ford apparati, children’s speech demonstrates high acoustic

variabilities, which makes children’s ASR more challenging compared to adults’

ASR [41, 42]. The performance of an ASR system trained using adult speech

degrades drastically when employed to recognize children’s speech. Furthermore,

recognition performance for children is usually lower than that achieved for adults

even when using a recognition system trained on children’s speech [43].

Disfluency is also an important issue in children’s speech recognition. As

part of the learning process, disfluencies such as repetitions, false starts, and

self-repairs, etc. often occur in young children’s speech. It was found that mis-

15

pronunciations and partial words repetition account for more than 30% of word

errors made by the speech recognizer [44]. Many approaches have been proposed

to detect disfluency in spontaneous speech. For example, a decision model was

applied in [45] using prosodic features, while [46] studied the combination of

multiple knowledge sources including acoustic-prosodic features, language mod-

els and rule-based knowledge. A disfluency-specialized grammar structure was

applied in [47] to detect disfluent reading miscues; while [48] proposed an efficient

hybrid word/subword unit recognition system which was shown to work well on

children’s speech.

In addition, accents present another challenge for ASR. If a child speaks more

than one language, his/her speech can present various level of pronunciation vari-

ations, e.g., different pronunciations of a phoneme, phoneme insertion/deletion

or substitution. Such accented speech can also degrade ASR’s performance. Ac-

cent detection/classification provides a way to improve performance on accented

speakers, since, with knowledge of a speaker’s accent, specific modeling strate-

gies can be applied to better target his/her individual acoustic characteristics.

Studies on accent detection employ either feature-based or model-based meth-

ods, or the combination of them. In [49], GMM with MFCC features was used to

classify four Chinese accents (dialects) of Mandarin speech. Decision trees was

built in [50] to detect accent levels of Japanese-accented English using prosodic

features like duration and pitch (F0). The authors in [51] proposed to use parallel

phoneme recognizers followed by phoneme language models (phoneme transition

probabilities). It has been shown that foreign accents can be successfully de-

tected using GMM classifiers, neural networks or phone recognizers with acoustic

and/or prosodic features.

16

1.4 Organization of the Dissertation

The dissertation is organized as follows. Chapter 2 presents a rapid speaker

adaptation method using regression-tree based spectral peak alignment. Chap-

ter 3 analyzes the variabilities of subglottal resonances, and proposes an efficient

speaker normalization and cross-language adaptation algorithm based on the sec-

ond subglottal resonance. Chapter 4 applies ASR on children’s speech to evaluate

their langauge learning skills, and addresses pronunciation, accent and disfluency

issues. Chapter 5 summarizes the dissertation and discusses future work.

17

CHAPTER 2

Regression-Tree based Spectral Peak Alignment

Spectral mismatch between training and testing utterances can cause significant

performance degradation in ASR systems. One way to reduce spectral mismatch

is to reshape the spectrum by aligning corresponding formant peaks. In this

chapter, regression-tree based phoneme- and state-level spectral peak alignment is

proposed for rapid speaker adaptation using linearization of the VTLN technique

in a MLLR-like framework, taking advantage of both the efficiency of frequency

warping (VTLN) and the reliability of statistical estimations (MLLR).

2.1 Frequency Warping as Linear Transformation

Frequency warping is usually applied in the linear spectral domain, which is very

time-consuming. To make it more efficient, this section derives the approximated

linear transformation in the cepstral domain, for frequency warping applied to

MFCC features.

2.1.1 Frequency Warping for MFCC

Most state-of-the-art ASR systems utilize MFCC features. Fig. 2.1 shows the

extraction of MFCC features with and without frequency warping. To derive the

relationship between warped and unwarped MFCC features, matrix operations

18

Mel-scaledfilter bank

FrequencyWarping

log DCT

Mel-scaledfilter bank

log DCT

PowerSpectrum

lS cX

cY

Figure 2.1: Diagram of MFCC features extraction without (Xc) and with (Y c)

frequency warping.

will be used to represent the feature extraction. Let Sl, an l × 1 column vector,

denote the magnitude spectrum of length l (usually l = 256 for 8k Hz speech

signals, and l = 512 for 16k Hz signals), which is calculated from the Fourier

transform of a frame of input speech signals; FB, an n × l matrix, be a Mel-

frequency filter bank, where n is the number of filter bank channels (usually

26), and each row the FB matrix represent one Mel-frequency filter; log(·) be an

element-wise logarithm operation; C, an m×l matrix, be the DCT matrix, where

m is the number of cepstral coefficients (the number of static MFCC features,

usually 13); W, an l× l matrix, denote the discretized frequency warping matrix;

Xc, an m × 1 column vector, denote the cepstral coefficients (MFCC features)

of the original unwarped speech signal, and Yc be the cepstral coefficients after

frequency warping. We have:

Xc = C · log(FB · Sl

)(2.1)

Yc = C · log(FB · W · Sl

)(2.2)

To relate the warped cepstrum Yc with the unwarped one Xc, we need to

explicitly derive Sl in terms of Xc, i.e., to recover the magnitude spectrum from

the cepstrum. Unfortunately, exact recovery of Sl from Xc is not possible, be-

19

cause the Mel-frequency filter bank FB, which is usually a wide matrix with size

n � l, is not invertible. Therefore, strictly speaking, there is no simple linear

relationship between Yc and Xc due to the non-invertibility imposed by the Mel-

frequency filter banks, however, some reasonable approximations exist [23,28,29].

These approximations produce acceptable ASR performances with less compu-

tational cost than performing the warping directly in the spectral space as Eq.

(2.2).

2.1.2 Approximated Linearization of Frequency Warping

The approximation proposed in [30] is applied to derive the linearization of fre-

quency warping for MFCC features. This approach is based on the concept of

an index mapping matrix. An index mapping matrix M has only one non-zero

element (with value equal to 1) in each row. A multiplication of an index map-

ping matrix M with any matrix X results in a row-permuted (index re-mapped)

version of the original matrix X, and thus comes the name “index mapping”.

Obviously, the multiplication of two index mapping matrixes is still an index

mapping matrix. Another property of index mapping matrixes, which will be

used later in the derivation of the linearization of frequency warping, is that the

multiplication operation of an index mapping matrix is interchangeable with any

element-wise operations, e.g.,

log(M · X) = M · log(X) (2.3)

As will be shown in Section 2.1.3, the warping matrix for a monotonic frequency

warping function is an index mapping matrix.

The approximation in [30] is to simplify the Mel-frequency filter bank by using

only the central peak value to represent each triangular Mel-frequency filter, as

shown in Fig. 2.2. That is, to retain only the nonzero central value for each row

20

in the Mel-frequency filter-bank matrix, and to set all other entries of the row to

zero. Under such an approximation, the original Mel-frequency filter bank matrix

FB becomes an index mapping matrix referred to as FB. To recover the linear

spectrum Sl from the Mel spectrum, interpolations are applied to build an index

mapping matrix F∗B, where unseen samples are generated by repeating neighbor

samples, such that F∗B· FB = I. Thus, from Eq. (2.1), we have

Sl ≈ F∗B· exp

(C−1 · Xc

)(2.4)

where exp(·) is element-wise exponential operation, and C−1 is the inverse DCT

matrix. Together with Eq. (2.2), we have:

Yc = C · log(FB · W · Sl

)(2.5)

≈ C · log(FB · W · F∗

B· exp

(C−1 · Xc

))(2.6)

= C · FB · W · F∗B· log

(exp

(C−1 · Xc

))from Eq. (2.3) (2.7)

= C · FB · W · F∗B· C−1 · Xc (2.8)

Therefore, under such approximations, frequency warping for MFCC features

can be implemented as a linear transformation in the cepstral domain, i.e.,

Yc ≈ A · Xc (2.9)

where

A = C · FB ·W · F∗B· C−1 (2.10)

A more detailed derivation can be found in [30].

In ASR systems, both static and dynamic features are used. From Eq. (2.9),

it is straightforward to show that dynamic features also hold this linearity, i.e.,

ΔYc ≈ A · ΔXc (2.11)

Δ2Yc ≈ A · Δ2Xc (2.12)

where Δ and Δ2 represent the first and second order derivatives, respectively.

21

0 500 1000 1500 2000 2500 3000 3500 40000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

frequency (Hz)

Figure 2.2: Mel-frequency filter banks and the approximation made in the lin-

earization of frequency warping in [30]: each triangular filter is represented only

with its central peak value (the circle point)

2.1.3 Definition of the Frequency Warping Matrix

For a given frequency warping function gα(f), the discretized frequency warping

matrix W in Eq. (2.10) is defined as

wij =

⎧⎪⎨⎪⎩

1, if i=round(gα(j))

0, otherwise

(2.13)

where i and j are the frequency sample indices. In real applications, the warping

function gα(f) is a monotonic function, and thus the warping matrix W is an

index mapping matrix.

Note that for mathematical tractability, simple warping functions such as

linear, piece-wise linear, bilinear or quadratic functions are generally applied to

22

perform frequency scaling in speaker normalization. In the derivation of the

transformation matrix A (Eq. (2.10)), however, no assumptions are made on the

warping function, i.e., gα(·) can be any reasonable monotonic mapping function.

2.1.4 Linear Frequency Warping Functions

A popular and effective warping function used in VTLN is a simple linear warping

function as:

gα(f) = α · f (2.14)

where α is the warping factor estimated from enrollment data.

With a linear warping function as in Eq. (2.14), the frequency range of the

resulting spectrum differs from the original one. To preserve the entire band-

width after warping, piece-wise linear or even nonlinear warping functions can be

applied, such that the boundary frequencies are always mapped into themselves.

An example of a piece-wise linear warping function is

gα(f) =

⎧⎪⎨⎪⎩

α · f, 0 � f � fu

fmax−αfu

fmax−fu(f − fu) + αfu, fu < f � fmax

(2.15)

where fmax is the maximum frequency in the spectrum, and fu is an empirically

chosen upper ‘cutoff’ frequency where the warping function deviates from α.

Preliminary experimental results didn’t show significant improvement using piece-

wise linear over linear functions. So only the linear warping function is used in

this work.

23

2.2 Alignment of Spectral Peaks

The warping factor α is usually estimated using maximum likelihood criterion.

Another approach is to explicitly align spectral formant positions or formant-like

spectral peaks, and to define the warping factor as formant frequency ratios. It

was shown in [29] and [30] that aligning only the third formant (F3) offers best

ASR performance. In other words, gα(f) is a linear warping function to align

formant-like peaks in the spectrum space:

gα(f) = α · f (2.16)

where

α =F3, new speaker

F3, standard speaker(2.17)

2.2.1 Choice of the Reference Speaker

The reference standard speaker, chosen to represent the acoustic characteristics

of the entire training set, is one of the training speakers who yields the highest

likelihood in the training stage. Since formant frequencies are gradually changing

from frame to frame, the median values of F3 over all voiced segments are used

for each speaker in Eq. (2.17). Since F3 has been shown to highly correlate to

speaker’s vocal tract length [11], this F3 peak alignment is related to vocal tract

length normalization.

Another choice for the reference standard speaker is to choose the speaker

who has a neutral warping factor (α closest to 1). That is, to define the reference

standard speaker as the one with F3 closest to the mean F3 value over the training

set, or the one with the median F3 value. Experiments with such a choice for

the standard speaker resulted in slightly worse performance, partially due to the

fact that by using the speaker with the highest training likelihood, we explicitly

24

transform the acoustic parameters of each speaker toward a higher likelihood

space.

2.2.2 Levels of Mismatch in Formant Structure

As mentioned before, spectral mismatch is a major reason for performance degra-

dation. There are various levels of mismatch in the formant structures, e.g.

global average, phoneme level and state level. Fig. 2.3 illustrates formant esti-

mation at these three levels. Global average formants are estimated using all the

voiced segments (including vowels and voiced consonants) from the speech data;

phoneme-level formants are estimated using speech segments for each phoneme;

and state-level formants are estimated using segments in that state.

To illustrate different mismatch levels, we calculated the global average and

phoneme-level F3 warping factors for each test speaker from the RM1 and TIDIG-

ITS database (see Section 2.6.1 for detailed experimental settings). F3 values

were estimated using 10 adaptation utterances (digits) for each speaker. For the

phoneme-level F3 warping factors, we compared the three vowels in the classic

vowel triangle: front vowel /IY/, mid vowel /AA/ (/AH/ for TIDIGITS, since

there was no /AA/ in the data) and back vowel /UW/. The reference standard

speaker for RM1 was a male adult with an average F3 of 2524Hz and the F3 values

for /IY/, /AA/ and /UW/ were 2951Hz, 2354Hz and 2143Hz, respectively. For

TIDIGITS, the reference speaker was a male with an average F3 of 2537Hz and

phoneme-level F3 of 2968Hz, 2457Hz and 2268Hz for /IY/, /AH/ and /UW/,

respectively. The global average F3 warping factor was calculated according to

Eq. (2.17), and the phoneme-level F3 warping factor was defined in a similar

way:

α =F3, /ph/ from new speaker

F3, /ph/ from standard speaker

(2.18)

25

where F3, /ph/ is the phoneme-level F3 value for phoneme /ph/.

.........

.........sentence

frames

phonemes

HMMstates

Gaussianmixtures

global

phonemelevel

state level

/IY/ /AA/.........

Figure 2.3: Illustration of three levels of formant estimations (global, phoneme,

and state). Boundaries are obtained through force alignment: dashed lines mark

the boundaries of phonemes and dotted lines mark the boundaries for states

Figures 2.4 and 2.5 show the global average and phoneme-level F3 warping

factors. From these figures, it can be seen that the adult-to-adult warping factors

(as in RM1, Fig. 2.4) are in the range of [0.96, 1.08], while the child-to-adult

warping factors (as in TIDIGITS, Fig. 2.5) are in the range of [1.12, 1.26].

This is consistent with the fact that children’s formant frequencies are higher

than adults’ [11]. Compared to Fig. 2.4, the warping factors in Fig. 2.5 show

more dramatic changes from speaker to speaker. This agrees with the observation

in [41] that children’s speech demonstrate larger inter- and intra- speaker spectral

variations than adults’ speech.

More importantly, these figures illustrate that phoneme-level warping factors

may be very different from the global average and different phonemes may have

different warping factors. For example, warping factors for /UW/ are around

26

1 2 3 4 5 6 7 8 9 100.96

1

1.04

1.08

Testing speaker id number

F3 w

arpi

ng fa

ctor

Front vowel /IY/Mid vowel /AA/Back vowel /UW/Global average

Figure 2.4: F3 warping factors for /IY/, /AA/, /UW/, and the global average

for 10 test speakers (6 male and 4 female adults) from RM1

1.0 for the adult-to-adult case and around 1.15 for the child-to-adult case; while

the warping factors for /IY/ have a larger dynamic range. Thus, if phoneme-

level or even lower state-level (instead of global average) warping factors are

used to reduce spectral mismatch, we can expect better performance. This is

the motivation for our proposed regression-tree based phoneme- and state-level

spectral peak alignment methods in Section 2.41.

1The amount of available adaptation data is another issue and is addressed in Sections 2.4and 2.5

27

1 2 3 4 5 6 7 8 9 101.12

1.16

1.2

1.24

1.28

Testing speaker id number

F3 w

arpi

ng fa

ctor

Front vowel /IY/Mid vowel /AH/Back vowel /UW/Global average

Figure 2.5: F3 warping factors for /IY/, /AH/, /UW/, and the global average

for 10 test speakers (5 boys and 5 girls) from TIDIGITS

2.3 Speaker Adaptation Using Spectral Peak Alignment

The linearity in Eqs. (2.9), (2.11) and (2.12) bridges the gap between the front-

end feature domain and the back-end model domain techniques and thus provides

an efficient way of frequency warping. It can be used to perform rapid speaker

adaptations on HMM Gaussian mixtures with mean μ and diagonal covariance

Σ in an MLLR-like manner [14]:

μ = Aμ + b (2.19)

Σ = BHBT (2.20)

where μ and Σ are the transformed mean vector and covariance matrix, B is the

inverse of the Cholesky factor of the original covariance Σ−1. The bias vector

28

b in Eq. (2.19) and the covariance transformation matrix H in Eq. (2.20) are

statistically estimated from the adaptation data under the maximum likelihood

criterion,

b =

{∑j,k

T∑t=1

ζt(j, k)Σ−1jk

}−1{∑j,k

T∑t=1

ζt(j, k)Σ−1jk (o(t) − Aμjk)

}(2.21)

H =

∑j,k

{(B−1

jk )T

[T∑

t=1

ζt(j, k)(o(t) − μjk)(o(t) − μjk)T

](B−1

jk )

}

∑j,k

T∑t=1

ζt(j, k)

(2.22)

where T is the number of frames of the adaptation data, and j and k are the

indices of state and mixture sets, respectively. ζt(j, k) is the posterior probability

of being at state j mixture k at time t given the observation o(t). By setting the

off-diagonal terms of H to zero, the adapted covariance Σ is also diagonal.

Unlike the statistical estimation in MLLR, the transformation matrix A here

is generated deterministically based on Eq. (2.10), which depends only on the

warping factors; while in MLLR a full or block-diagonal A needs to be statistically

estimated. This would result in many more parameters than the deterministically

generated A in Eq. (2.10). Though more parameters are powerful to capture

slight differences among speakers, they may also lead to unreliable estimations

(and thus unsatisfactory performance) with limited adaptation data. The A

matrix generated using Eq. (2.10), however, can be more reliable than in MLLR

when the amount of adaptation data is small; while the statistically estimated

bias b and covariance transformation matrix H can benefit from increasing the

amount of adaptation data. Hence, this peak alignment adaptation method can

perform well for varying amounts of adaptation data.

29

2.4 Regression-tree based Speaker Adaptation

2.4.1 Global vs. Regression-tree based Peak Alignment

Several different approaches can be applied to perform formant-like peak align-

ment adaptation. In [30], speaker adaptation was employed as a global peak

alignment, i.e. to estimate the average F3 over all the adaptation data and gen-

erate the transformation matrix A according to Eq. (2.10) with the same warping

factor (Eq. (2.17)) for all model units. When performing adaptation, all means

of the HMM parameters share the same transformation A.

On the one side, since there is only one parameter (α) to be estimated, this

global method has the potential of good performance for limited adaptation data.

On the other side, however, a simple global warping factor can’t take advantage

of increasing adaptation data, for which fine and detailed modeling abilities are

required.

Another argument is that, as shown in Section 2.2.2, there are various lev-

els of mismatch in formant structures. Using only one global average warping

factor may not reduce the spectral mismatch uniformly for all phonemes. Since

different phonemes may have different warping factors, we can use phoneme- or

state-level warping factors to perform adaptation. This is the basic idea for the

regression-tree based spectral peak alignment adaptation, i.e. to align similar

(close in acoustic space) components in a similar way. This extension from global

to regression-tree based peak alignment is similar to the expansion of MLLR

from a global transform to many transforms especially when the adaptation data

increase.

Tow methods are investigated to define regression classes: phoneme-based

(using phoneme-level formants) and Gaussian mixture-based (using state-level

30

formants). The following sections will discuss these two methods in details.

2.4.2 Phoneme based Regression Tree

In this phoneme-based method, regression classes (units) are classified based on

phonetic knowledge and/or data-driven methods. For example, according to pho-

netic knowledge, phonemes can first be categorized into vowels and consonants,

and then consonants can be further classified as voiced or unvoiced; vowels can

further be clustered according to their phoneme-level F3 values using data-driven

methods. All model parameters for phoneme units with similar acoustic char-

acteristics (phoneme-level formants) are placed together in the same regression

class. Preliminary experiments showed that phonetic knowledge offers better per-

formance when adaptation data are limited to less than 5 utterances, while the

data-driven approach is superior when more data are available. Therefore, we

chose to combine the two techniques.

Figure 2.6 shows an example of a regression tree based on tied phonetic knowl-

edge and data-driven methods with eight base classes (terminal nodes) denoted

as {2, 3, 4, 6, 7, 8, 9, 10}. Each phoneme belongs to one specific base class. During

adaptation, the number of base classes is dynamically created depending on the

amount of adaptation data. Since unvoiced consonants have no clear formant

structure in their spectra, the transformation matrix A for unvoiced consonants

is determined by the average F3 over all voiced consonants in the adaptation

data.

2.4.3 Gaussian Mixture based Regression Tree

Since formant frequencies are gradually changing from frame to frame, it may

be helpful to use further lower level formants in adaptations. In HMM models,

31

each phoneme unit has several states, and states are represented with Gaussian

mixtures. Hence, we can consider state-level formants, and define the regression

tree based on Gaussian mixtures of the states. In this method, Gaussian mixture

components (means and covariances) are clustered based on a measure of simi-

larity. In each class, the state-level F3 is estimated and averaged, and spectral

peaks are then aligned with the same warping factor. Similar to global average

and phoneme-level F3 warping factors (Eq. (2.17) and (2.18)), state-level warping

factor is defined as

α =F3, state m of /ph/ from new speaker

F3, state m of /ph/ from standard speaker

(2.23)

where F3, state m of /ph/ is the state-level F3 value of state m in phoneme /ph/.

For both phoneme-based and Gaussian mixture-based methods, regression

trees are constructed based on the speaker independent training data and is

independent of new speakers. The tree is constructed with a centroid splitting

algorithm using a Euclidean distance measure. Each terminal node (base class) of

the tree specifies a particular component groupings: phonemes for the phoneme-

based regression tree and states for the Gaussian mixture-based regression tree.

The following sections will evaluate and compare the performance of these differ-

ent approaches of peak alignment adaptation (PAA).

2.5 Integration of Peak Alignment with MLLR

As will be shown in the next section, when adaptation data are limited, both ap-

proaches of PAA, namely phoneme-class and Gaussian mixture-class based, work

well. With few parameters to estimate, PAA can handle one of the limitations

of MLLR: unreliable parameter estimation for limited data. The performance of

PAA, however, tends to saturate when more adaptation data become available,

32

P h o n e m e s

V o w e l s

M o n o t h o n g s D i p h t h o n g s

C o n s o n a n t s

V o i c e d U n v o i c e d

21 6543

87 1 0 9

P h o n e t i c K n o w l e d g e

D a t a - D r i v e n

Figure 2.6: An example of regression tree using combined phonetic knowledge and

data-driven techniques for the phoneme-based approach. Phonemes are firstly

categorized based on phonetic knowledge, and then further clustered according

to their estimated F3 values.

which is most obvious for global PAA. To some extent, this problem can be al-

leviated by increasing the number of regression classes. Since MLLR is able to

offer better performance when more data are available, we attempt to integrate

peak alignment with MLLR, i.e. to perform peak alignment first, followed by

standard MLLR.

Given the peak alignment matrix A and the additive bias vector b, the Gaus-

sian mixture components of speaker specific models are re-estimated using the

EM algorithm [5]. The auxiliary function is defined as

QN (λ, λ) =∑j,k

T∑t=1

ζt(j, k) logN (o(t);Aμjk + b; Σjk) (2.24)

where N (o(t);Aμjk + b; Σjk) is the kth Gaussian mixture of state j. The max-

33

imum likelihood estimation of μjk and Σjk can be derived from

∂QN (λ, λ)

∂μjk

= 0 (2.25)

∂QN (λ, λ)

∂Σjk

= 0 (2.26)

respectively, which give

μjk =

{ T∑t=1

ζt(j, k)ATΣ−1jk A

}−1{ T∑t=1

ζt(j, k)ATΣ−1jk (o(t) − b)

}(2.27)

Σjk =

T∑t=1

ζt(j, k)(o(t) − μjk)(o(t) − μjk)T

T∑t=1

ζt(j, k)

(2.28)

where

μjk = Aμjk + b (2.29)

μjk represents the adapted speaker-specific Gaussian means.

This now can be viewed as a special case of standard speaker adaptive training

(SAT) with only one speaker-dependent model [14, 52]. However, unlike the

statistical estimation of the transforms as in SAT, which requires more adaptation

data, the transformation matrix A is generated deterministically. Therefore,

peak alignment has the potential for better performance than SAT with limited

adaptation data. The integration with MLLR, denoted as PSAT in the following

experiments, can be applied to global or regression-tree based peak alignment.

2.6 Experimental Results

2.6.1 Experimental Setup

Two different recognition tasks were carried out to evaluate the performance of

the proposed algorithm. One was a medium vocabulary recognition task using the

34

DARPA Resource Management RM1 continuous speech database [53], and an-

other was a connected digits recognition task using the TIDIGITS database [54].

For the two databases, speech signals were firstly downsampled to 8kHz, and

then segmented into 25ms frames, with a 10ms shift. Each frame was parame-

terized with a 39-dimensional feature vector consisting of 12 static MFCCs plus

log energy, and their first-order and second-order derivatives.

For the RM1 database, triphone acoustic models were trained on the speaker

independent (SI) portion of the database (72 speakers, 40 utterances from each

speaker). Each triphone model had 3 states with 6 Gaussian mixtures per state.

This set of SI models produced a baseline performance of 89.2% word recognition

accuracy on the test set (10 speakers, 300 utterances from each speaker). Since

the focus here is on rapid adaptation, for each speaker the adaptation data were

limited to no more than 30 utterances for RM1 (or 35 digits for TIDIGTS),

which corresponds to less than 2 minutes for RM1 (or 30 seconds for TIDIGITS).

Adaptation data consisted of 1, 4, 7, 10, 15, 20, 25 or 30 utterances for each

speaker, and they were randomly chosen from the speaker dependent portion of

the database.

For the TIDIGITS task, acoustic models were trained on 55 adult male speak-

ers and then tested on 10 children (5 boys and 5 girls) with 77 utterances consist-

ing of 1, 2, 3, 4, 5, or 7 digits for each speaker. Acoustic HMMs were monophone-

based with 4 states for vowels and 2 states for consonants, and 6 Gaussian mix-

tures per state. The baseline word recognition accuracy was 38.9%. For each

child, the adaptation data, which consisted of 1, 5, 10, 15, 20, 25, 30 or 35 digits,

were randomly chosen from the test set and not used in the test.

In all adaptation experiments, a forward-backward alignment of the adapta-

tion data was first implemented to assign each frame to a regression class (global

35

adaptation can be considered as a special case of regression classes, with only one

class.) For each class, formant-like peaks were then estimated. Depending on

the amount of the adaptation data, different numbers of regression classes were

experimentally tested, and the best performances were selected for comparison.

Fig. 2.7 described the steps for both supervised (steps 2-4) and unsupervised

(steps 1-4) peak alignment adaptation.

Gaussian mixture models were used to estimate formant-like peaks [55]. In

the 4k Hz frequency range, adult speakers were observed to typically have four

formants, while children had only three. Therefore, in the peak alignment proce-

dure, four Gaussian mixtures were used for adults and three for children.

For comparison, speaker-specific VTLN was implemented based on a grid

search over [0.8, 1.2] with a stepsize of 0.02. The scaling factor producing max-

imal average likelihood was used to warp the frequency axis [16]. Since VTLN

is usually applied through warping the power spectrum, the Jacobian determi-

nant is difficult to compute due to non-invertible Mel filter-bank operations. The

Jacobian compensation is approximated by using the determinant of the trans-

formation matrix A (| detA|).

2.6.2 Comparison of Global and Regression-tree based PAA versus

MLLR and VTLN

Experiments were first conducted to compare the performance of global (GPAA),

phoneme-class (PPAA) and Gaussian mixture-class (MPAA) based PAA with

different numbers of adaptation utterances (or digits). In all experiments except

otherwise specified, bias and diagonal covariance adaptation were performed for

PAA. The block-diagonal MLLR adaptation with the optimal number of trans-

forms was also performed for comparison. Figs. 2.8 and 2.9 illustrate the per-

36

For unsupervised adaptation, perform step 1-4; for supervised adaptation, per-

form step 2-4

1. For unsupervised adaptation only: generating transcriptions

• Locate voiced segments using cepstral peak analysis

• Estimate formant-like peaks in the spectrum

• Calculate scaling factor α (Eq. (2.17))

• Generate transformation matrix A (Eq. (2.10))

• Perform spectral peak alignment for each Gaussian mixture mean vec-

tor (without adaptation of bias and covariance) (Eq. (2.30))

• Generate recognition hypotheses (with the partially adapted means)

as transcriptions of the adaptation data

2. Dynamically determine the number of regression classes N based on the

amount of adaptation data and cluster model parameters into N classes

C1, C2, ..., CN

3. Align with transcriptions to assign speech frames to regression classes

4. For each regression class Ci, i ∈ {1, 2, ..., N}

• Estimate formant-like peaks in the spectrum

• Calculate scaling factor αi (Eq. (2.18) or (2.23))

• Generate transformation matrix Ai (Eq. (2.10))

• Estimate biases vector bi and covariance transformation matrix Hi

(Eq. (2.21), (2.22))

• Adapt mean and covariance (Eq. (2.19), (2.20))

Figure 2.7: The speaker adaptation algorithm using regression-tree based spectral

peak alignment for both supervised and unsupervised adaptations37

formance of GPAA, PPAA, MPAA, VTLN and MLLR. Not shown in the figures

are recognition accuracies of MLLR with one adaptation utterance using RM1

(88.2%), and with one and five adaptation digits using TIDIGITS (40.5% and

57.0%, respectively.)

1 4 7 10 15 20 25 3090

92

94

96

Number of adaptation utterances

Wor

d re

cogn

ition

acc

urac

y (%

)

MLLRVTLNGPAAPPAAMPAAG−PA

Figure 2.8: Performance of VTLN, MLLR, G-PA, GPAA, PPAA and MPAA

using RM1 for supervised adaptation.

Fig. 2.8 shows that all three PAA methods can greatly improve the perfor-

mance over the baseline (with no adaptation) in all cases; VTLN and GPAA

provide the best performance with only one adaptation utterance, while PAA

methods outperform VTLN in all other cases. MLLR, however, may produce

worse performance than the baseline when only a small amount of adaptation

data is available. For example, with one adaptation utterance, MLLR produces

recognition accuracy of 88.2%, about one percent lower than the baseline. Com-

pared to MLLR, PAA performs significantly better for limited adaptation data,

38

1 5 10 15 20 25 30 3587

89

91

93

95

97

Number of adaptation digits

Wor

d re

cogn

ition

acc

urac

y (%

)

MLLRVTLNGPAAPPAAMPAAG−PA

Figure 2.9: Performance of VTLN, MLLR, G-PA, GPAA, PPAA and MPAA

using TIDIGITS for supervised adaptation.

with on average about 13.0% reduction of word error rate (WER) over MLLR

for one and four adaptation utterances. With increasing adaptation data, MLLR

offers better results than GPAA when the adaptation data are more than 15

utterances, while MPAA can outperform MLLR for 1-25 adaptation utterances.

Since the covariance adaptation of PAA and MLLR is the same, the main dif-

ferences between speakers seem to be characterized by the means of Gaussian

components.

Among the three PAA methods, MPAA performs the best, and significant im-

provements can be achieved by using regression-tree based PAA over global PAA.

On average, more than 11% WER reduction is obtained with MPAA over GPAA.

The advantage of MPAA over GPAA becomes greater with increasing adaptation

data. For the two regression-tree based PAAs, MPAA performs slightly better

39

than PPAA in all cases. This is because these three PAA methods work at dif-

ferent levels to reduce spectral mismatch: MPAA at the state level, PPAA at

the phoneme level and GPAA at the global level. As discussed in Section 2.2.2

and 2.4, lower-level (phoneme- or state-level) alignment is expected to be more

powerful than the global average to capture subtle differences between phonemes

or even states, provided that the parameters are reliably estimated. Compared

to global average formants used in GPAA, phoneme-level (PPAA) and state-

level (MPAA) formants need to be estimated with more parameters and thus

require more adaptation data. This explains the performance curves of GPAA,

PPAA and MPAA in Fig. 2.8. Experimental results for TIDIGITS (Fig. 2.9)

demonstrate similar trends to Fig. 2.8. This similarity shows that performance

improvements achieved by PAA are consistent across different tasks.

2.6.3 Discussion on Comparison of RM1 and TIDIGITS

Comparing Figs. 2.8 and 2.9, it can be noticed that improvements for TIDIGITS

are more significant than that for RM1 database: with only one adaptation digit

(or utterance), more than 80.0% WER reduction over the baseline was obtained

for TIDIGITS, while for RM1 the WER reduction over the baseline was about

10.5% .

The more significant improvements with the TIDIGITS database can be ex-

plained as follows. The basic idea for PAA is to reduce spectral mismatch by

aligning formant-like peaks using estimated F3 values. The performance im-

provement will be more obvious if the F3 difference between the new speaker and

the standard speaker is significant, which is the case for TIDIGITS: for adult

males the typical F3 is about 2500Hz, and for children it is 3100 Hz. On the

other hand, if the F3 of the new speaker is very close to that of the standard

40

speaker as with the RM1 database which has only adult speakers, the effect of

peak alignment will be less pronounced. An extreme case is when the new speaker

has exactly the same global average F3 value as the standard speaker. In this

case, the global average warping factor α will be 1 (Eq. (2.17)), and the warping

matrix W will be an identity matrix (Eq. (2.13)), which will result in an identity

transformation matrix A (Eq. (2.10)) for global peak alignment (GPAA). 2 Thus,

theoretically, in this case global peak alignment will have little effect on reducing

spectral mismatch, resulting in marginal, if any, performance improvement. This

is also supported by experimental results with the RM1 database using global

peak alignment with only A: the speaker with α closest to 1 shows only 1.5%

average improvement, while the speaker with the largest α achieves over 10%

improvement.

Regression tree based peak alignment may still perform well even in the case

where global peak alignment fails to provide satisfactory improvement, since re-

gression tree based peak alignment utilizes phoneme or state level formant infor-

mation (instead of global average as in global peak alignment), and all phoneme-

or state-level formant values from two different speakers may not be identical.

This is another advantage of regression tree based peak alignment over global

peak alignment.

2.6.4 Performance of the Linearization Appproximation

Since PAA is based on an approximate linearization of VTLN, it is also of interest

to study how good this approximation is. We compared the performance of

VTLN and GPAA using only A (without bias and covariance adaptation), 3

2Strictly speaking, A will not be identity due to the approximation made in the linearizationof VTLN. However, A will be very close to identity with diagonal entries very close to 1 andoff-diagonal entries close to 0.

3This configuration of GPAA can be viewed as a direct linear approximation of VTLN.

41

denoted as G-PA in Figs. 2.8 and 2.9. G-PA performs a little worse than VTLN;

the differences, however, are small. This means that the linearization is a good

approximation to VTLN, and the adaptation of bias and covariance contributes

to the better performance of GPAA.

The peak alignment technique was also compared in [56] with VTLN based

on parameters estimated directly using maximum likelihood criterion, i.e., α was

statistically estimated under ML criterion instead of being defined as the formant

frequency ratios (Eq. (2.17), (2.18) or (2.23)) or being determined using a grid

search. Experimental results showed that GPAA achieves similar performance

to the ML-based VTLN. Peak alignment is, however, more efficient from the

computational point of view. In addition, MPAA outperforms ML-based VTLN

when the adaptation data are more than 5 utterances.

2.6.5 Comparison of PAA, PSAT and MLLR-SAT

In this section, PAA is compared with PSAT which combines peak alignment

followed by MLLR. Gaussian mixture-class based peak alignment (MPAA) is

considered as the reference which performs the best among the three PAA meth-

ods. PSAT is applied in two ways: based on GPAA (PSAT-GPAA) and on MPAA

(PSAT-MPAA).

The performance of MLLR, MPAA and PSAT are shown in Figs. 2.10 and

2.11. Compared to MPAA, PSAT (both PSAT-GPAA and PSAT-MPAA) shows

better performance with improvements, on average, of about 6% with RM1 and

20% with TIDIGITS; compared to GPAA (Figs. 2.8 and 2.9), the improvements

are even more significant (16% with RM1 and 24% with TIDIGITS). Improve-

ment trends are consistent in all cases especially with more adaptation data. As

to the two PSAT methods, PSAT-GPAA is a little better with a small amount of

42

adaptation data, while PSAT-MPAA outperforms PSAT-GPAA when the adap-

tation data are more than 10 utterances.

1 4 7 10 15 20 25 3090

92

94

96

98


Wor

d re

cogn

ition

acc

urac

y (%

)

MLLRMPAAPSAT−GPAAPSAT−MPAAMLLR−SAT

Figure 2.10: Performance of MPAA, MLLR, PSAT and MLLR-SAT using RM1

for supervised adaptation.

Compared to MLLR, the performance of PSAT-MPAA is superior in all ex-

periments with on average 14% improvement for RM1 and 23% for TIDIGITS,

though the difference becomes small as adaptation data increase. Significance

analysis shows that for the p-level less than 0.05, the improvement of PSAT-

MPAA over MLLR is statistically significant. This indicates that PSAT can take

advantage of PAA for reliable parameter estimations with limited adaptation

data, and of MLLR for statistical parameter estimations with sufficient adapta-

tion data. Another advantage of PSAT is that it can still perform well even when

there is no difference in global average F3 values between the new speaker and

the standard speaker, in which case PSAT-GPAA becomes equivalent to MLLR.4

4PSAT can be considered as the combination of PAA and MLLR. As discussed in Section

43

1 5 10 15 20 25 30 3588

90

92

94

96

98


Wor

d re

cogn

ition

acc

urac

y (%

)

MLLRMPAAPSAT−GPAAPSAT−MPAAMLLR−SAT

Figure 2.11: Performance of MPAA, MLLR, PSAT and MLLR-SAT using TIDIG-

ITS for supervised adaptation.

2.6.6 Comparison of PSAT and MLLR-SAT

Since PSAT can be viewed as a special case of MLLR-SAT, which is an alter-

native implementation of SAT through constrained MLLR transformations [14],

it is interesting to compare their performance. The experiments follow the steps

described in [14] and use block diagonal transforms in MLLR-SAT.

The performance of MLLR-SAT is shown in Figs. 2.10 and 2.11. MLLR-SAT

provides better performance than MLLR, decreasing WER by about 10% on av-

erage. However, it performs similarly to MLLR, i.e. they both require a certain

amount of adaptation data (more than 20 utterances) for robust and satisfac-

tory performance. In contrast, PSAT is more robust for limited data, especially

2.6.2, in this case GPAA has little effect on reducing spectral mismatch, and MLLR is creditedwith the performance improvements.

44

PSAT-GPAA, which achieves more than 17% WER reduction over MLLR-SAT

for one adaptation utterance. With the increase of adaptation data, PSAT-MPAA

performs better than PSAT-GPAA and provides comparable performance with

MLLR-SAT. From the computational point of view, PSAT is more efficient than

MLLR-SAT, with only several warping factors instead of a full or block diago-

nal matrix A to be estimated. So PSAT is more suitable for rapid adaptation

where the available enrollment data for a new speaker is limited to only several

utterances.

Another rapid adaptation method is Maximum A Posterior Linear Regression

(MAPLR) [57–59]. MAPLR incorporates prior knowledge into the linear regres-

sion adaptation of means and covariances by using MAP criterion. The hyper-

parameters (parameters needed to describe the prior distribution) are estimated

based on an empirical Bayes (EB) approach [57] and/or the structural informa-

tion of the models [58]. Provided that appropriate priors are chosen, MAPLR

may significantly outperform MLLR. The performance of MAPLR, however, is

highly dependent on the choice of prior distributions [59]. Like MAPLR, prior

knowledge can also be integrated into PAA through the MAP estimation of the

bias b and the covariance transforms H. This work, however, focuses on PAA

in the MLLR framework and leaves the exploration of the PAA in the MAPLR

framework for future work.

2.6.7 Comparison of Supervised and Unsupervised Adaptation

The previous adaptation experiments are implemented in a supervised way where

the true transcription is known. Unsupervised adaptation can be performed by

first generating the transcription through an initial recognition pass. Before this

initial recognition, global peak alignment (without adaptation of bias and covari-

45

ance) is conducted to reduce spectral mismatch. According to Eqs. (2.10), (2.13),

(2.14) and (2.17), the generation of matrix A is only dependent on the warping

factor α which can be estimated from voiced segments and thus requires no tran-

scription knowledge. For each test speaker, formant-like peaks are estimated

from the voiced segments of the adaptation utterance; voicing is detected using

the cepstral analysis technique [60]. Spectral peaks are then aligned with the

average F3, i.e. Gaussian mixture means are adapted according to the following

equation:

μ = Aμ (2.30)

The performance of supervised and unsupervised adaptation is shown in Ta-

bles 2.1, 2.2, 2.3 and 2.4. It should be noted that the performance listed here

for supervised and unsupervised adaptation was based on different numbers of

regression classes: in all cases, the number of classes for unsupervised adapta-

tion was smaller than that of the corresponding supervised case. For example,

for the RM1 database, when the adaptation data consist of 20 utterances, 105

Gaussian mixture classes were found to give the best results for unsupervised

adaptation, while 150 classes were optimal for supervised adaptation. The num-

ber of regression-tree base classes used in MPAA and PSAT for each testing case

is given in the tables in the row labeled “# of classes”. The optimal number of

base classes for MLLR can be different.

From these tables, compared to supervised adaptation, unsupervised peak

alignment adaptation performs slightly worse in all experimental cases, but the

difference is not large: 0.5% and 0.8% absolute WER increase for PSAT-MPAA

using RM1 with 10 and 30 adaptation utterances, respectively; 0.2% and 0.7%

absolute WER increase for PSAT-MPAA using TIDIGITS with 10 and 35 adap-

tation digits. There are two possible reasons for this small difference. One is that

46

Table 2.1: Word recognition accuracy using RM1 for supervised adaptation.


1 4 7 10 15 20 25 30

MLLR 88.2 90.8 92.0 93.0 94.0 94.8 95.3 95.9

GPAA 90.8 91.5 92.4 93.3 93.7 94.0 94.3 94.4

MPAA 90.3 91.9 92.8 94.0 94.7 95.1 95.4 95.6

PSAT-MPAA 90.5 92.0 92.9 94.2 95.0 95.6 95.9 96.4

# of classes 10 40 50 75 100 150 175 225

Table 2.2: Word recognition accuracy using RM1 for unsupervised adaptation.


1 4 7 10 15 20 25 30

MLLR 86.5 89.3 90.5 91.5 92.3 93.3 94.4 94.6

GPAA 90.7 91.3 92.2 93.1 93.6 93.9 94.0 94.2

MPAA 88.7 90.2 91.3 93.4 94.0 94.2 94.6 94.9

PSAT-MPAA 89.0 90.7 91.8 93.7 94.6 94.9 95.2 95.6

# of classes 5 20 35 60 80 105 135 195

after the global peak alignment, the partially adapted models produce a high

recognition accuracy and thus an acceptable labeling of the adaptation data.

The other is that with a smaller number of classes, it is more likely for unsuper-

vised adaptation to reduce the effect of misclassified frames (due to the initial

recognition errors) and thus to generate robust estimation for the adaptation pa-

rameters. This explains why the unsupervised GPAA performs almost the same

as the supervised case, especially for the highly mismatched TIDIGITS database

with the differences being less than 0.2% in all cases. Compared to GPAA, unsu-

pervised PSAT-MPAA achieves on average 6.8% and 12.7% WER reduction for

RM1 and TIDIGITS, respectively.

47

Table 2.3: Word recognition accuracy using TIDIGITS for supervised adaptation.


1 5 10 15 20 25 30 35

MLLR 40.5 57.0 88.9 92.8 94.6 95.7 96.6 96.9

GPAA 87.9 93.3 93.5 93.9 94.2 94.4 94.2 94.4

MPAA 88.5 93.9 94.1 94.7 94.9 95.8 95.9 96.3

PSAT-MPAA 88.5 94.0 94.3 94.7 95.0 96.0 96.8 97.4

# of classes 5 25 30 40 55 80 100 125

Table 2.4: Word recognition accuracy using TIDIGITS for unsupervised adapta-

tion.


1 5 10 15 20 25 30 35

MLLR 38.9 55.3 88.2 92.3 94.5 95.1 95.9 96.1

GPAA 87.7 93.2 93.4 93.8 94.1 94.3 94.2 94.4

MPAA 86.4 92.3 94.0 94.1 94.5 95.3 95.1 95.2

PSAT-MPAA 86.4 92.3 94.1 94.2 94.7 95.6 96.2 96.7

# of classes 3 20 25 25 30 50 75 95

2.6.8 Significance Analysis

We use the matched-pair test proposed in [61] to analyze whether the performance

differences between MLLR and regression-tree based peak alignment (MPAA) are

statistically significant for both the supervised and unsupervised adaptations.

Tables 2.5 and 2.6 show the significance levels (p-value) of MPAA compared to

MLLR for supervised speaker adaptation with various amounts of adaptation

data.

These tables show that, for a given significance level β = 0.05, the average

48

Table 2.5: Significant analysis of performance improvements of MPAA over

MLLR using RM1 for supervised adaptation

# of utterances 1 4 7 10 15 20 25 30

p-value 0.007 0.009 0.013 0.018 0.026 0.031 0.039 0.043

Table 2.6: Significant analysis of performance improvements of MPAA over

MLLR using TIDIGITS for supervised adaptation

# of digits 1 5 10 15 20 25 30 35

p-value 0.001 0.003 0.008 0.012 0.027 0.043 0.025 0.034

performance differences between MPAA and MLLR are statistically significant

using both RM1 and TIDIGITS for supervised adaptation. Examining the sig-

nificance levels for different amounts of adaptation data, we can see that the

performance improvements of MPAA over MLLR are more significant for limited

adaptation data (less than 20 utterances). This is due to the deterministically

generated transforms A in MPAA versus the unreliable statistically estimated

A in MLLR because of not enough adaptation data. Similar conclusions also

hold for the unsupervised adaptations using both the RM1 database and the

TIDIGITS database.

Analysis on PSAT-MPAA and MPAA doesn’t show significant differences be-

tween these two algorithms. The performance improvements of PSAT-MPAA

over MLLR, however, are statistically significant in all the testing cases, at sig-

nificance levels less than 0.05.

49

2.7 Summary and Conclusion

Various levels of spectral mismatch in formant structures cause ASR systems to

perform unsatisfactorily. Regression tree based spectral peak alignment is pro-

posed as a rapid speaker adaptation to reduce phoneme- and state-level spectral

mismatch. This method is investigated in an MLLR-like framework based on

the linearization of VTLN. In the proposed approach, the transformation matrix

for Gaussian mixture means is generated deterministically by aligning phoneme-

and state-level formant-like peaks in the spectrum; adaptation of the bias and

covariance is estimated using the EM algorithm. This method can be viewed as a

combination of VTLN and MLLR. On the one hand, like VTLN, the transforma-

tion matrix for means has fewer parameters than MLLR to be estimated, which is

advantageous for limited adaptation data. On the other hand, like MLLR, biases

and covariances are adapted using the EM algorithm. Statistical estimation has

an advantage when large amounts of adaptation data are available. Hence, the

proposed approach has the potential of good performance for both limited and

large amounts of adaptation data.

The performance of this peak alignment approach is evaluated on both medium

vocabulary (the RM1 database) and connected digits recognition (the TIDIGITS

database) tasks. In both tasks, experimental results show that through peak

alignment adaptation significant performance improvements can be achieved even

for very limited adaptation data, with state-level peak alignment (MPAA) per-

forming the best. When sufficient adaptation data are available, peak alignment

adaptation offers results similar to or slightly worse than MLLR. The PSAT

method which integrates peak alignment with MLLR, however, shows better

performance than MLLR and comparable performance with MLLR-SAT in all

cases. Another merit of this regression-tree based spectral peak alignment is that

50

when implementing adaptation in an unsupervised way, only a slight performance

degradation is observed compared to supervised adaptation.

51

CHAPTER 3

Speaker Normalization based on Subglottal

Resonances

Speaker normalization typically focuses on variabilities of the supra-glottal (vocal

tract) resonances, which constitute a major cause of spectral mismatch. Recent

studies show that the subglottal airways also affect spectral properties of speech

sounds. This chapter presents a speaker normalization method based on estimat-

ing the second subglottal resonance. Since the subglottal airways do not change

for a specific speaker, the subglottal resonances are independent of the sound type

(i.e., vowel, consonant, etc.) and the speaking language, and remain constant for

a given speaker. This context-free property makes the proposed method suitable

for limited data speaker normalization and cross-language adaptation.

3.1 Subglottal Acoustic System and Its Coupling to Vocal

Tract

3.1.1 Subglottal Acoustic System

The subglottal acoustic system refers to the acoustic system below the glottis,

which consists of the trachea, bronchi and lungs. Similar to the vocal tract, the

acoustic input impedance of the subglottal system is characterized by a series

of poles (or resonances) and zeros. Unlike the supraglottal system, however,

52

the configuration of the subglottal system is essentially fixed and thus the poles

and zeros are expected to remain constant for a given speaker. Like formant

frequencies, subglottal resonances are generally higher for female speakers than

for male speakers, and there are substantial individual differences from speaker to

speaker. It has been shown that the lowest three subglottal resonances, namely

Sg1, Sg2, and Sg3, respectively, are around 600, 1450 and 2200 Hz for adult

males, and 700, 1600, and 2350 Hz for adult females [62].

3.1.2 Coupling between Subglottal and Supraglottal Systems

When the glottis is open, the subglottal system is coupled to the vocal tract and

can influence the speech sound output. Fig. 3.1 shows a schematic model of vocal

tract coupling to the trachea through the glottis and its equivalent circuit model,

where Zl is the impedance of the subglottal system, Zg is the glottal impedance,

Zv is the impedance looking into the vocal tract from the glottis, Ug is the volume

velocity through the glottis, and Uv is the airflow into the vocal tract. Coupling

between the subglottal and supraglottal airways is thought to occur primarily

when the glottis is open, such as during a voiceless consonant or the open phase

of glottal vibration in a voiced sound, although [63] and [64] suggest that coupling

may also occur when the vocal folds are closed, either by means of a posterior

glottal opening or the vocal fold tissue itself.

During coupling, each subglottal resonance contributes a pole-zero pair to the

speech spectrum, in addition to the vocal tract pole-zero pairs. The frequency of

the zero is the same as that of the subglottal resonance, while the pole is shifted

upward in frequency away from the resonance and depends somewhat on the

vocal tract configuration. This is because the zero is a function only of the part

of the entire system behind the source (that is, the subglottal airways), while the

53

(a)

Zg

(b)

UgUgZl

Zv

Uv

trachea

glottis vocal tract

bronchi UvUg

Figure 3.1: Schematic model of vocal tract with acoustic coupling to the trachea

through the glottis (a) and the equivalent circuit model (b). Adapted from [62].

pole is a function of the entire system, including the subglottal and supraglottal

airways [63, 65].

3.1.3 Effects of Coupling to Subglottal System

The pole-zero pair introduced in the speech spectrum around Sg2 generally falls

within the range of 1300 to 1500 Hz for adult males, and between 1400 and 1700

Hz for adult females [62]. It is somewhat higher in frequency for children [66].

When F2 crosses the Sg2 pole-zero pair, F2 can jump in frequency or diminish

in amplitude, or both, resulting in a discontinuity in the F2 trajectory [65]. This

is illustrated in Fig. 3.2 for an eight-year-old girl speaking the word boy, and

it is schematically represented in Fig. 3.3. In both figures F2 rises from a low

frequency to a high frequency, crossing the Sg2 pole-zero pair along the way. The

54

F2 discontinuity in Fig. 3.2 is marked by a diminished amplitude in the vicinity

of the zero. The Sg2 pole has a very low amplitude except during the time when

F2 is nearby. In Fig. 3.2 the diffuse energy between F2 and the zero at 250 ms

is likely due to the Sg2 pole, its amplitude decreasing as F2 continues to rise.

Time (ms)

Freq

uenc

y (H

z) Sg2 pole

Sg2 zeroF2 discontinuity

0 50 100 150 200 250 3000

1000

2000

3000

4000

5000

Figure 3.2: Spectrogram for the word boy from an eight-year-old girl. The second

subglottal resonance Sg2 for this speaker is at 1920 Hz.

3.1.4 Subglottal Resonances and Phonological Distinctive Features

Recent studies [67–69] have shown that the acoustic contrasts for some phonologi-

cal distinctive features are dependent on the subglottal resonances. As illustrated

in Fig. 3.4, for example, the vowel feature [back]1 is dependent on the frequency

1The place of articulation feature [+/-back] specifies the tongue positions during speechproduction: [+back] segments are produced with the tongue dorsum bunched and retractedslightly to the back of the mouth, while [-back] segments are bunched and extended slightlyforward.

55

100 150 200 250 3001650

1700

1750

1800

1850

1900

1950

2000

2050

2100

2150

2200

Time (ms)

Fre

quen

cy (

Hz)

Sg2 poleSg2 pole

F2

F2

Sg2 zero

F2 discontinuity

Figure 3.3: Illustration of the F2 discontinuity caused by Sg2. The bold solid

line corresponds to the most prominent spectral peak (F2), which has a jump

in frequency and a decrease in amplitude when F2 is crossing the subglottal

resonance Sg2. The dotted line represents the Sg2 pole, which varies somewhat

in frequency and amplitude when F2 is nearby. The horizontal thin solid line

represents the Sg2 zero, which is roughly constant. Adapted from [62].

of Sg2, such that a vowel with F2 > Sg2 is [-back] and a vowel with F2 < Sg2

is [+back]. The ability of Sg2 to underlie the distinctive feature [back] is likely

derived from the fact that Sg2 is roughly constant for a given speaker. Subglot-

tal resonances could potentially be affected by lung volume, larynx height, and

glottal configuration. Lung volume has been shown not to significantly affect

the subglottal resonances in one study [70], and reported accelerometer measure-

ments of subglottal resonances across utterances (in which phonetic content was

varied and voice quality was uncontrolled - both of which may affect laryngeal

56

height and glottal configuration) have had standard deviations on the order of 30

Hz or less [65]. Thus, although the influence of lung volume, larynx height, and

glottal configuration on subglottal resonances invites further research, the avail-

able evidence appears to indicate that subglottal resonances are roughly constant

under normal speaking conditions.

For this reason, Sg2 might be useful in speaker normalization, since it is

context independent but speaker dependent. Sg1 and Sg3 have also been claimed

to play a role in distinguishing different classes of speech sounds, but Sg2 has been

more thoroughly studied. In this paper, therefore, we focus on Sg2 estimation

and its application to speaker normalization.

3.2 Estimating the Second Subglottal Resonance

3.2.1 Estimation based on Frequency Discontinuity

3.2.1.1 A simple detection algorithm Sg2D1

As noted above, when F2 crosses Sg2, there is a discontinuity in the F2 trajectory.

An automatic Sg2 detector (Sg2D1) is developed based on the frequency discon-

tinuity. The Snack sound toolkit [71] is used to generate the F2 trajectory. The

tracking parameters are specifically tuned to provide reliable F2 tracking results

on children’s speech. Manual verification and/or correction are applied through

visually checking the tracking contours against the spectrogram. (Note that this

method is limited by the accuracy of the formant tracker, which is known to

encounter difficulties in high-pitched speech such as that produced by children.)

The F2 discontinuity is detected based on the smoothed first order difference of

the F2 trajectory, as shown in Fig. 3.5. If the F2 values on the high and low

frequency side of the discontinuity are F2high and F2low, respectively, then the

57

0

500

1000

1500

2000

2500

3000

Fre

quen

cy (

Hz)

Sg1

Sg2

Sg3

[−back] [+back]

i I E æ a 2 O U u

Figure 3.4: Illustration of the relative positions of vowel formants F1 (·), F2 (+)

and F3 (x) and the subglottal resonances (Sg1, Sg2 and Sg3) for an adult male

speaker. For the vowels /i, I, E, æ/ F2 > Sg2, and they are therefore [-back]. For

the vowels /a, 2, O, U, u/ F2 < Sg2, and they are therefore [+back]. Adapted

from [67].

algorithm estimates Sg2 as:

ˆSg2 =F2high + F2low

2(3.1)

If no such discontinuity is detected, the algorithm uses the mean F2 over the

utterance. In many such cases, such as during a monophthong, F2 is consistently

above or below Sg2, and the mean F2 value is either too high or too low. Thus,

the estimated Sg2 values are dependent on the speech sound analyzed.

Furthermore, discontinuities in F2 may arise from other factors beside the

subglottal resonances, including pole-zero pairs from the interdental spaces [72].

58

5 10 15 20 25 30 35 40 45 501800

1900

2000

2100

2200

F2low

F2high

Detected Sg2

Manually measured Sg2

Frame by frame F2 track

F2

(Hz)

5 10 15 20 25 30 35 40 45 500

50

100

150

200

Smoothed first order difference of the F2 track

threshold

Figure 3.5: An example of the detection algorithm.

These discontinuities occur a few hundred Hz higher than Sg2 discontinuities,

but are sometimes more prominent than Sg2 discontinuities and can therefore be

mistakenly detected as Sg2.

3.2.1.2 An improved detection algorithm Sg2D2

To address both issues, an improved Sg2 estimation algorithm (Sg2D2) is then

developed. There are two main differences between Sg2D2 and Sg2D1, namely:

a. Sg2D2 applies an empirical formula to guide the discontinuity search, which

serves as a starting point for the search, and also as a back-off point in cases

where no discontinuities are detected.

b. Sg2 uses a statistical method to estimate Sg2 from F2high and F2low, instead

of simply using the average as in Sg2D1.

59

The Sg2D2 algorithm works as follows. It first detects F3 and obtains an

estimate of Sg2 using a formula derived in [63]:

˜Sg2 = 0.636 × F3 − 103 (3.2)

Note that the derivation of this formula is based on a linear regression on chil-

dren’s speech data which have available simultaneous accelerometer recordings,

and its extension to adult speech may still need further refinements.

The algorithm then searches for a discontinuity within ±100 Hz of this esti-

mate using the original algorithm. The range ±100Hz is chosen based on calcu-

lated standard deviations of Sg2 on the calibration data. If no discontinuity in

this range is found, ˜Sg2 in Eq. (3.2) is used. If a discontinuity is found, Sg2 is

estimated using the following equation:

ˆSg2 = β × F2high + (1 − β) × F2low (3.3)

where β is a weight in the range (0, 1) that controls the closeness of the detected

Sg2 value to F2high. The optimal value of β is calibrated over the data described

below using the minimum mean square error criterion:

β = arg min E{( ˆSg2 − Sg2)2} (3.4)

and is found to be 0.65 in our experiments.

3.2.2 Estimation based on Joint Frequency and Energy Measurement

The estimation method in Section 3.2.1 is based solely on F2 frequency dis-

continuities. Though simple and efficient, this method may produce unreliable

estimates in cases where F2 discontinuities are not detectable. Speech analysis

studies have shown that discontinuities and attenuations of formant prominence

60

typically occur near resonances of the subglottal system [65]. Take Sg2 for exam-

ple, which has been more thoroughly studied than other subglottal resonances.

When F2 approaches Sg2, an attenuation of 5-12dB in F2 energy prominence

(E2) is always observed, while an F2 frequency discontinuity in the range of

50-300Hz often occurs.

Since E2 attenuation always occurs when F2 crosses Sg2, a joint F2 and E2

measurement (Sg2DJ) is developed to improve the reliability of Sg2 estimation.

The detection algorithm works as follows:

1. Track F2 and E2 frame by frame using LPC analysis and dynamic pro-

gramming. The F2 tracking algorithm is similar to that used in Snack [71],

with parameters specifically tuned to provide reliable F2 tracking results on

children’s speech. Manual verification and/or correction is applied through

visually checking the tracking contours against spectrogram.

2. Search within ±100 Hz around ˜Sg2 (Eq. (3.2)) for F2 discontinuities (F2d)

and E2 attenuation (E2a).

3. Check if F2d and E2a correspond to the same location. Apply decision

rules for Sg2 estimation.

The decision rules are biased toward E2 attenuations, since E2 attenuations are

more correlated with Sg2. If the time information of F2 discontinuity matches

that of E2 attenuation, as shown in Fig. 3.6, Eq. (3.3) is used for Sg2 estimation.

Otherwise, if F2 discontinuities are not detectable or F2 discontinuities and E2

attenuations disagree, as shown in Fig. 3.7, the estimation will only rely on E2

attenuation, and uses the average F2 value around E2a as Sg2. If in some extreme

cases E2 attenuation is not detectable, which rarely occurs in our experiments,

then Eq. (3.2) would be used for Sg2 estimation. In other words, in cases where

61

F2 discontinuities are detectable and agree with energy attenuations, Sg2DJ gives

exactly the same estimates as Sg2D2 does; while in other cases, Sg2DJ and Sg2D2

may provide different estimates.

26 28 30 32 34 36 38 40 421000

1500

2000

2500


F2

(Hz)

26 28 30 32 34 36 38 40 42−15

−10

−5

0

5

Frame by frame E2 track

E2

(dB

)

F2 discontinuity

E2 attenuation

Figure 3.6: Example of the joint estimation method where F2 discontinuity and

E2 attenuation correspond to the same location (frame 38). Eq. (3.3) is used to

estimate Sg2.

3.3 Calibration of the Sg2 Estimation Algorithm

To verify and calibrate our Sg2 estimation algorithm, acoustic data were collected

from six female children aged 2 to 17 years old (speakers G1-G6 in [63]). The

children were native speakers of American English and all of them except the

youngest were recorded repeating the phrase ‘hVd, say hVd again’ three times for

each of the vowels [i], [I], [E], [æ], [a], [2], [o], [U], and [u]. The subjects also recited

62

40 45 50 55 602100

2200

2300

2400


F2

(Hz)

40 45 50 55 60−5

0

5

10

Frame by frame E2 track

E2

(dB

)

E2 attenuation

Figure 3.7: Example when there is a discrepancy between locations of F2 discon-

tinuity (not detectable) and E2 attenuation (at frame 51). The average F2 value

within the dotted box is then used as the Sg2 estimate.

the alphabet, counted to 10, and recited a few short sentences. The recording

list was presented in random order and verbally prompted by the experimenter.

The youngest speaker (G1) was recorded counting to 10, reciting the alphabet,

and answering questions of the sort ‘What is this?’, in which the experimenter

pointed to his hand or head, for instance. All utterances were recorded in a sound-

isolated chamber using a SHURE BG4.1 uni-directional condenser microphone,

and an accelerometer. Both the speech and accelerometer signals were digitized

at 16kHz. Microphone signals of each speaker were used to measure average F3

and the discontinuity in the F2 track. An independent direct measure of the

average Sg2 for each speaker was obtained from an accelerometer signal. The

accelerometer was attached to the skin of the neck below the larynx so that the

63

measured vibration of the neck skin is directly related to the acoustic pressure

variations in the air column at the top of the trachea [65,70]. The accelerometer

signal can therefore act as a stand-in for the subglottal input impedance, in which

the subglottal resonances appear as formants in the accelerometer spectrum.

The detection algorithms Sg2D1 and Sg2D2 were calibrated (to estimate dis-

continuity thresholds for both Sg2D1 and Sg2D2, and β for Sg2D2) on data from

two of the recorded children and tested on the remaining four. The values mea-

sured from the accelerometer data were used as the ‘ground truth’ Sg2 frequencies

(henceforth denoted by ‘Sg2Acc’). The average Sg2 estimates (with standard de-

viations) over various vowel contents are shown in Table 3.1. Compared to Sg2D1,

the algorithm Sg2D2 estimates Sg2 much better with less variance across vowels.

The observed standard deviation values of Sg2D2 are similar to those from man-

ually measured Sg2’s (Sg2M2) in this study and those found in other studies [66].

The performance of these two algorithms was investigated in more detail for

each vowel for two speakers and the results are shown in Table 3.2 and Fig.

3.8. As stated earlier, if no discontinuity in the F2 track is detected (as for the

vowels above the double line, Table 3.2), Sg2D1 uses the mean F2 as Sg2 and

thus is highly dependent on vowel content. Sg2D2, on the other hand, uses a

formula to estimate Sg2 from F3 which is less content-dependent than F2. In

such cases, it can be seen that the formula in Sg2D2 gives much closer estimates

to the ground truth, especially for mid and back vowels. For the case when there

is a discontinuity in the F2 trajectory (as for the diphthongs below the double

line), both algorithms work well when the F2 discontinuity is from Sg2, as for

speaker 1. In this case, Sg2D1 gave an estimate within about 70Hz of the true

2The manual Sg2’s were estimated through visually examining the speech spectrogram, andthen applying Eq. (3.2) or Eq. (3.3) depending on the existence of F2 discontinuities.

64

Table 3.1: Comparison of Sg2 estimates for two algorithms over various vowel

contents, where Sg2M is the manual measurement from speech spectrum, and

Sg2Acc is the ‘ground truth’ measurement from the accelerometer signal. For each

algorithm the average Sg2 estimates (Hz) are shown (with standard deviations

in parentheses). The two speakers with a * are those used for calibration.

Speaker Sg2D1 Sg2D2 Sg2M Sg2Acc

1 2135 (531) 2194 (95) 2193 (97) 2176

2 2115 (334) 1766 (137) 1719 (112) 1646

3 2586 (467) 2718 (143) 2634 (135) 2679

4 2098 (358) 1823 (151) 1781 (129) 1614

5* 2065 (267) 2021 (79) 2013 (76) 1970

6* 1612 (251) 1689 (72) 1681 (65) 1648

Sg2 value, while the Sg2D2 estimate was within less than 10Hz. For speaker 2,

where the most prominent F2 discontinuity was probably from the interdental

space, Sg2D1 gave an estimate hundreds of Hz above the Sg2 value, while Sg2D2

roughly located the correct Sg2 value using Eq. (3.2). Thus, Sg2D2 is less

prone to mistakenly detecting discontinuities not caused by Sg2. In addition to

diphthongs, discontinuities in F2 should also be detectable in certain consonant-

vowel transitions [63]. Since Sg2D2 performs consistently better than Sg2D1,

we’ll focus only on Sg2D2 in the following experiments. As shown in Tables 3.1

and 3.2, and Fig. 3.8, the proposed detector produces Sg2 estimates close to

the ground truth. Also, as will be shown in the experimental section (Section

3.5), the estimated Sg2 helps to improve ASR’s performance on children’s speech,

which is of primary interest to us.

The algorithm Sg2DJ was also evaluated and compared to the F2 discontinuity-

based detection algorithm Sg2D2. Improved accuracy was achieved in cases where

65

Table 3.2: Detailed comparison of Sg2 estimates for the two algorithms on two

speakers. For vowels above the double line, there are no discontinuities in the

F2 trajectory, and Sg2D1 uses the mean F2 as Sg2 while Sg2D2 uses Eq. (3.2)

( ˜Sg2) to make an estimate; for vowels below the double line, the F2 discontinuity

is detectable, and Sg2D1 uses Eq. (3.1) while Sg2D2 uses Eq. (3.3). The row

‘Avg.(std)’ shows the mean (and standard deviation) for each algorithm.

Vowel

Speaker 1 (age 6) Speaker 2 (age 13)

Sg2Acc: 2176Hz Sg2Acc: 1646Hz

Sg2D1 Sg2D2 Sg2D1 Sg2D2

[i] 2987 2312 2563 1971

[I] 2515 2306 2439 1909

[e] 2894 2115 2629 1998

[E] 2799 2291 2378 1867

[æ] 2382 2289 2350 1863

[a] 1599 2020 1796 1700

[2] 1687 2243 1948 1704

[o] 1512 2185 1497 1613

[U] 1578 2228 1964 1717

[u] 1739 2071 1825 1631

[au] 1841 2114 1974 1617

[aI] 2103 2170 2072 1709

[OI] 2115 2183 2063 1659

Avg.(std) 2135 (531) 2194 (95) 2115 (334) 1766 (137)

F2 discontinuities and E2 attenuations disagree. These F2 discontinuities may

be caused by factors other than subglottal resonances, e.g., probably from inter

dental space.

Since for most of the speakers there are no significant performance differ-

66

1500

2000

2500

3000

Est

imat

ed S

g2 fr

eque

ncy

(Hz)

1000

1500

2000

2500

3000

Est

imat

ed S

g2 fr

eque

ncy

(Hz)

Sg2D1 Sg2D2 Ground truth Sg2D1 Avg. Sg2D2 Avg.

i

i

I

I

E

E

æ

æ

a

a

2

2

o

o

U

U

u

u

au

au

e

e

aI

aI

OI

OI

Figure 3.8: Comparison of Sg2 estimates for the two speakers in Table 3.2, top

panel for speaker 1 and bottom panel for speaker 2.

ences between the two algorithms Sg2DJ and Sg2D2, and Sg2D2 is more efficient

than Sg2DJ, in the following experiments, Sg2D2 is used for estimation unless

otherwise stated.

3.4 Variability of Subglottal Resonance Sg2

3.4.1 The Bilingual Database

The acoustic characteristics of children’s speech have been shown to be highly

different from those of adult speech, in term of pitch and formant frequencies,

67

segmental durations, and temporal and spectral intra- and inter-speaker variabil-

ities [41, 42]. Studies of subglottal resonances [65, 67–69, 73, 74], however, have

mainly focused on adult speech in English with little effort devoted to children’s

speech or to other languages [66,75,76]. This section analyzes children’s speech in

English and Spanish, investigating the variabilities of Sg2 under different contents

and across languages.

To examine the cross-language variability of Sg2 frequencies, we recorded a

database (ChildSE) of 20 bilingual Spanish-English children (10 boys and 10

girls) in the 1st or 2nd grade (around 6 and 7 years old, respectively) from

a bilingual elementary school in Los Angeles. The recorded speech consisted of

words containing front, mid, back, and diphthong vowel. There were four English

words: beat, bet, boot, and bite, and five Spanish words (with English meanings

in parentheses): calle (street), casa (house), quitar (to take out), taquito (taco)

and cuchillo (knife). All the words were familiar to the children.

Prior to the recording, children were instructed to practice as many times as

they wanted. Both text and audio samples for each target word were available

for prompt, and children decided what prompt they needed during recording

and what language they wanted to record first. There were three repetitions

for each word, and children spoke all the words in one language in a row with

3 seconds pause between words, and then repeated them. After they finished

the recordings in one language, there was about a one-minute pause before they

began the recordings in the other language. Recordings were made with 16 kHz

sampling rate and 16-bit resolution.

Like the English word bite [baIt] with a diphthong [aI], the Spanish words

calle [kaje] and cuchillo [kutSijo] had obvious F2 discontinuities. We used these

words with diphthongs to estimate Sg2 frequencies. Therefore, for each speaker,

68

there were three English tokens and six Spanish tokens for the Sg2 estimation.

3.4.2 Cross-content and Cross-language Variability

The within-speaker standard deviations were calculated on the Sg2 values esti-

mated from the six Spanish tokens for each speaker. The within-speaker coeffi-

cients of variation (COV) was also calculated, which can be viewed as a measure

of dispersion of a probability distribution. The COV was computed as the ratio of

the standard deviation to the mean Sg2 value for each speaker. As shown in Fig.

3.9, the within-speaker Sg2 standard deviations are around 20Hz and the COV is

less than 0.01. No significant difference in the COVs is observed between genders.

A similar trend is observed for the within-speaker Sg2 standard deviations calcu-

lated from the English tokens. Compared to the COV of formant frequencies [41],

which are usually around 0.10, the COV of Sg2 is about one order of magnitude

smaller. Therefore the within-speaker Sg2 variability is negligible since they are

sufficiently small compared to formant variabilities. This means that for a given

speaker, Sg2 is relatively constant relative to content and repetition.

Since Sg2 frequency for a given speaker does not depend on context, it makes

sense to calculate the Sg2 COV for each speaker over the three English tokens

and six Spanish tokens and view that as the Sg2 cross-language variability, which

is plotted in Fig. 3.10. The cross-language Sg2 COVs are less than 0.01, and

there is no significant difference between genders. The cross-language COVs are

similar to the within-speaker COVs, indicating that the cross-language effects are

not significant for Sg2 frequencies and the Sg2 frequency for a given speaker is

independent of language.

69

1 2 3 4 5 6 7 8 9 1010

15

20

25

Sg2

var

iatio

ns (

Hz)

1 2 3 4 5 6 7 8 9 10.007

.008

.009

.010

.011

Speaker No.

CO

V o

f Sg2

Male

Female

Figure 3.9: Average within-speaker Sg2 standard deviations and the COVs

against contents and repetitions.

3.4.3 Implications of Sg2 Invariability

Because of its invariability across speech content and language, Sg2 is judged to

be applicable to speaker normalization. Since Sg2 is content-independent, it is

hypothesized that the performance of speaker normalization using Sg2 should be

robust and independent of the amount of adaptation data available. This would

make the Sg2 normalization method greatly suitable for limited data adaptation,

which is often the case in ASR applications.

On the other hand, the language-independent property of Sg2 makes cross-

language adaptation possible based on Sg2 normalization. Theoretically, with

Sg2 normalization acoustic models trained in one language could be adapted

with data in any other language, which may be useful in ASR applications for

70

1 2 3 4 5 6 7 8 9 10.008

.009

.010

Speaker No.

CO

V o

f Sg2

MaleFemale

Figure 3.10: Cross-language within-speaker COV of Sg2 for 10 boys and 10 girls.

second-language learning.

3.5 Experiments with Linear Frequency Warping

Similar to formant normalization, the warping ratio for Sg2 normalization is

defined as:

α = Sg2r/Sg2t (3.5)

where Sg2r is the reference Sg2 and Sg2t is the Sg2 of the test speaker. The

reference Sg2 is defined as the mean value of all the training speakers’ Sg2’s. The

Sg2 values are detected using the Sg2D2 algorithm. In this section, we evaluate

the content dependency of Sg2 normalization and also its use for cross-language

normalization. The simple linear frequency warping is applied in this section,

and nonlinear frequency warping will be addressed in the next section.

71

3.5.1 Comparison of VTLN and Sg2 Frequency Warping

Fig. 3.11 shows F1, F2 and F3 values from a nine-year-old girl before and after

warping using VTLN [16] and the Sg2 ratio. The line ‘Sg2’ is the reference second

subglottal resonance for an adult male speaker (as in Fig. 3.4). Compared to

Fig. 3.4, unwarped data demonstrate an obviously different pattern as to the

relative positions of the formants with respect to the reference Sg2. For instance,

the back vowels [U] and [u] have higher F2 values than the reference Sg2, while

in Fig. 3.4 F2’s of all the back vowels lie below the Sg2 line. It is necessary

to apply frequency warping to achieve the reference formant position pattern.

Both VTLN (in circles) and Sg2 (in squares) warp the formants close to the

reference pattern, although Sg2 warping yields a formant pattern more similar to

the reference speaker’s.

To examine the effects of warping in more detail, Fig. 3.12 plots the reference

F1, F2 and F3 values versus the normalized values. It can be seen that Sg2

warping aligns the test speaker’s formants more closely to the reference speaker’s

formants (Fig. 3.4), as indicated by the proximity of the data points to the

diagonal line (with slope 1). In ASR such warping results in greatly reduced

spectral mismatch between test and reference speakers, and thus can lead to

better ASR performance.

3.5.2 Effectiveness of Sg2 Normalization

Since VTLN has been shown to provide significant performance improvement

on children’s speech recognition, the subglottal normalization method is first

evaluated on a connected digits recognition task of children’s speech using the

TIDIGITS database. Speech signals were segmented into 25ms frames, with a

10ms shift. Each frame was parameterized by a 39-dimensional feature vector

72

0

500

1000

1500

2000

2500

3000

3500

4000

Fre

quen

cy (

Hz)

Sg1

Sg2

Sg3

i I E æ a 2 O U u

Figure 3.11: Vowel formants F1 (·), F2 (+) and F3 (x) before and after VTLN

(in circles) and Sg2-based (in squares) warping for a nine-year-old girl’s vowels.

The lines ‘Sg1’, ‘Sg2’ and ‘Sg3’ are the reference subglottal resonances from the

same speaker as in Fig. 3.4.

consisting of 12 static MFCCs plus log energy, and their first- and second-order

derivatives. Acoustic HMMs were monophone-based with 3 states and 6 Gaussian

mixtures in each state. VTLN was implemented based on a grid search over

[0.7, 1.2] with a stepsize of 0.01. The scaling factor producing maximal average

likelihood was used to warp the frequency axis [16].

In this setup, acoustic models were trained on 55 adult male speakers and

tested on 50 children. The baseline word accuracy is 55.76%. For each child, the

adaptation data, which consisted of 1, 4, 7, 10, 13 or 16 digits, were randomly

chosen from the test subset to estimate the Sg2 and VTLN warping factors. For

comparison, the performance of manually measured Sg2 is also tested, which in

73

200 400 600 800200

300

400

500

600

700

800

900

1000

1100

F1 of the reference speaker (adult male)

F1

of a

test

spe

aker

(9

year

old

girl

)

F1

500 1000 1500 2000 2500500

1000

1500

2000

2500

3000


F2

of a

test

spe

aker

(9

year

old

girl

)

F2

2000 2500 30002000

2200

2400

2600

2800

3000

3200

3400

3600


F3

of a

test

spe

aker

(9

year

old

girl

)

F3

Figure 3.12: Vowel formants F1 (·), F2 (+) and F3 (x) from the reference speaker

(Fig. 3.4) versus those from the test speaker (Fig. 3.11) before and after warping

(VTLN in circles, Sg2 in squares). The dotted line is y = x which means perfect

match between reference and test speakers.

some sense can be viewed as the upper bound of this Sg2 normalization method.

For each speaker, the manual Sg2 was measured from only diphthong words

containing obvious F2 discontinuities in the spectrum, and, independent of adap-

tation data, the same Sg2 value was applied for normalization. Fig. 3.13 shows

the recognition accuracy for VTLN, F3 and Sg2 warping with various amounts

of adaptation data, where Sg2M represents results using the manually measured

subglottal resonance.

When the amount of adaptation data is small, Sg2 normalization offers better

performance than VTLN. For instance, with only one digit for normalization, Sg2

normalization outperforms VTLN by more than 2%. VTLN outperforms Sg2D2

when more data is available, while the Sg2M provides slightly better performance

to VTLN even with 16 adaptation digits. The improvements of Sg2 normalization

74

1 4 7 10 13 1690

90.5

91

91.5

92

92.5

93

93.5

94

94.5

95


Wor

d re

cogn

ition

acc

urac

y

VTLNF3Sg2D2Sg2M

Figure 3.13: Speaker normalization performance on TIDIGITS with various

amount of adaptation data.

over VTLN for up to 10 adaptation digits are statistically significant for p < 0.05.

Although automatic detection of Sg2 was fairly accurate, it was not exact and

there is thus a gap between the performance of the automatic detection method

and that of Sg2M. With more accurate Sg2 detection algorithms, we may expect

closer performance to that of the manual Sg2.

3.5.3 Comparison of Vowel Content Dependency

As discussed in 3.3, Sg2 is not always detectable from acoustic signals, but that

Sg2 detectability in adaptation data is important to the normalization process.

To investigate the content dependency of the detection algorithm Sg2D2, its nor-

malization performance is evaluated on TIDIGITS database with one adaptation

digit. For each child, the adaptation data were limited to only one digit but with

75

varying vowels from front vowel (e.g., [I] in six), central vowel (e.g., [2] in one),

back vowel (e.g., [u] in two) to diphthong (e.g., [aI] in five). The adaptation digits

were chosen such that F2 discontinuities, if any, come only from vowel contents

without any possible interferences from consonant-vowel transitions [63].

The performance comparison for VTLN, F3 and Sg2 normalization is shown

in Fig. 3.14. It can be seen that the choice of adaptation data can potentially

have an effect on the normalization performance for all three methods. Among

them, VTLN is least affected by the choice of adaptation data (the performance

standard deviation is 0.55), while F3 normalization is highly data dependent. The

performance of Sg2 normalization is less content sensitive compared to F3 nor-

malization, but more content dependent than VTLN. We expect that the content

dependency of Sg2 normalization will decrease with improved Sg2 detection algo-

rithms. In spite of its greater content dependency, on average Sg2 normalization

provides better performance than VTLN.

3.5.4 Performance on RM1 Database

Since the TIDIGITS setup is a highly mismatched case, the experiments demon-

strate the effectiveness of subglottal resonance-based speaker normalization. To

further verify the effectiveness of this method, the performance is also tested on

a medium vocabulary recognition task using the DARPA Resource Management

RM1 continuous speech database. As a next step, the method is evaluated on

the RM1 database for both a medium-mismatched case and a matched case. Tri-

phone acoustic models were applied with 3 states and 4 Gaussian mixtures per

state using the same features as in the TIDIGITS experiments. For the mis-

matched case, HMM models were trained on 49 adult male speakers from the

speaker independent (SI) portion of the database, and tested on 23 adult female

76

front vowel (e.g. six) central vowel (e.g. one) back vowel (e.g. two) diphthong (e.g. five)90

90.5

91

91.5

92

92.5

93

93.5

94

94.5

Wor

d re

cogn

ition

acc

urac

y

VTLNF3Sg2D2

Figure 3.14: Performance comparison of VTLN, F3 and Sg2D2 using one adap-

tation digit with various vowel content.

speakers in the SI portion. The baseline word recognition accuracy was 59.10%.

For the regular test on RM1, the HMM models were trained on the SI training

portion of the database with 72 adult speakers, and tested on the SI testing set.

The baseline performance was 92.47% word recognition accuracy. In both cases,

the same utterance was used to estimate the Sg2 and VTLN warping factor for

all speakers. Table 3.3 shows the results.

For the mismatched case, Sg2 normalization provides better performance than

VTLN with about 1.5% absolute improvement. This improvement is statistically

significant for p < 0.01. For the matched case, Sg2 normalization provides com-

parable performance to that of VTLN. From the computational point of view,

Sg2 normalization is more efficient than VTLN, since VTLN relies on an ex-

haustive grid search over the warping factors to maximize the likelihood of the

77

Table 3.3: Performance comparison (word recognition accuracy) on RM1 with

one adaptation utterance.

Accuracy mismatched matched

Baseline 59.10 92.47

F3 79.01 92.58

VTLN 86.65 93.91

Sg2 88.37 94.05

adaptation data, while for Sg2 normalization the main computational cost comes

from formant tracking which can be done efficiently.

3.5.5 Cross-language Speaker Normalization

The language-independent property of Sg2 makes cross-language adaptation pos-

sible based on Sg2 normalization. In this experiment, training and test data

were in English, while the adaptation data were in either English or Spanish.

The warping factors were estimated from the adaptation data using Sg2D2 and

applied to the test data to warp the spectrum. English adaptation data were

collected for comparison.

The performance was evaluated on the Technology Based Assessment of Lan-

guage and Literacy (TBall) project database [77], and the English high frequency

words for 1st and 2nd grade students were used in the test. Monophone acoustic

models were trained on speech data from native English speakers. The test data

were from the same 20 speakers as in the ChildSE. The ChildSE utterances (only

one repetition) were used as adaptation data, and for each speaker there were

four English words and five Spanish words for adaptation.

The typical text-dependent VTLN method using HMM recognizers for warp-

78

ing factor searching is not quite suitable in this scenario, because decoding Span-

ish speech with English phoneme models could itself introduce a systematic error

due to different phonetic characteristics between these two languages. Instead,

for a fair and reasonable comparison, text-independent VTLN is applied, which

uses Gaussian mixture models (GMM) for warping factor searching. A GMM

with 512 mixtures was trained on English training set, and then applied to cal-

culate the likelihood for each warping factor in the range [0.8,1.2] with a step

size of 0.01. The warping factor with the highest likelihood was chosen as the

VTLN warping factor. Compared to the text-dependent VTLN used in [78], this

text-independent method provides similar performance with English adaptation

data, but much better for Spanish adaptation data. The subglottal resonance was

estimated using Sg2D2 for each word, and the average was used as the speaker’s

Sg2 frequency. The Sg2 warping factor was calculated using Eq. (3.5).

The normalization performance is shown in Table 3.4 for VTLN and Sg2 us-

ing English and Spanish adaptation data. When adaptation data are in English,

which is the same language as for the acoustic models, Sg2 normalization and

VTLN give comparably good results. For Spanish adaptation data, however,

the performance of VTLN degrades, while the performance of Sg2 normalization

remains similar as for English adaptation data. Sg2 normalization, therefore,

produces more robust results than VTLN when performing cross-language adap-

tation. The performance difference between using Sg2D2 and using VTLN is

statistically significant with Spanish adaptation data for p < 0.01.

79

Table 3.4: Performance comparison (word recognition accuracy) of VTLN and

Sg2 normalization using English (four words) and Spanish (five words) adaptation

data. The acoustic models were trained and tested using English data.

MethodLanguage of adaptation data

English Spanish

VTLN 86.61 82.35

Sg2 86.59 85.97

3.6 Nonlinear Frequency Warping

3.6.1 Mel-shift based Frequency Warping

Given a warping function W (f), the spectrum S(f) is transformed into

S ′(f) = S(W (f)) (3.6)

where f is the frequency scale in Hz. For computational efficiency, W (f) usually

involves only one parameter, the warping factor α. A simple yet effective warping

function is a linear scaling function:

W (f) = Wα(f) = α · f (3.7)

In conventional VTLN, the optimal warping factor is usually estimated using a

grid search to maximize the likelihood of warped observations given an acoustic

model λ:

α = arg maxα∈G

R∑r=1

log p(Or(Wα(f))|λ, sr) (3.8)

where sr is the transcription of the rth speech file Or, and G is the search grid.

Though widely used, the linear scaling model in Eq. (3.7) is known to be

a crude approximation of the way vocal tract variations affect spectrum. The

warping factor between speakers is also observed to be frequency dependent [79].

80

Motivated by speech analysis, [79,80] proposed a shift-based nonlinear frequency

warping, i.e., to shift upward or downward the Mel scale, which results in nonlin-

ear warping in Hz. As opposed to a linear Wα(f), the warping function is defined

as:

Wα(z) = z + α (3.9)

where z is in Mel scale3:

z = Mel(f) = 1127 log(1 +f

700) (3.10)

The Mel-shift function corresponds to a non-linear relationship in Hz:

f ′ = eα

1127 · f + 700(eα

1127 − 1) (3.11)

Similar to the linear warping method, the optimal warping factor α for shift-based

methods can be estimated using the ML criterion.

3.6.2 Bark-shift based Frequency Warping

In this section, a Bark-scale shift based warping function is investigated as defined

in Eq. (3.9), but where z is now in Bark scale:

z = Bark(f) = 6 log(f

600+

√(

f

600)2 + 1) (3.12)

Inserting Eq. (3.12) into Eq. (3.9), the frequency (Hz) domain relationship

corresponding to a Bark shift can be derived:

6 log(f ′

600+

√(

f ′

600)2 + 1) = 6 log(

f

600+

√(

f

600)2 + 1) + α (3.13)

f ′ = 300eα6 [

f

600+

√(

f

600)2 + 1] −

300e−α6

f

600+

√( f

600)2 + 1

(3.14)

3In [79], the coefficient 1127 is changed to 1. Throughout this paper, the standard Mel scalein Eq. (3.10) is used.

81

In general the relationship in Eq. (3.14) is nonlinear and complicated. How-

ever, we can approximate Eq. (3.13) as:

⎧⎪⎨⎪⎩

f ′ = eα6 · f, for f � 600 Hz

f ′ = eα6 · f + 600(e

α6 − 1), for f � 600 Hz

(3.15)

For high frequency f � 600 Hz, the Bark shift corresponds to a linear scaling in

Hz as Eq. (3.7); while for low frequency f � 600 Hz, the Bark shift results in

an affine relationship in Hz as the Mel shift (Eq. (3.11)). In general, the Bark

shift warping function stretches or compresses lower frequencies more than higher

frequencies.

To preserve the frequency bandwidth after warping, a piecewise nonlinear

warping function, shown in Fig. 3.15, is applied such that the lower boundary

frequency fmin (or zmin) and the upper boundary frequency fmax (or zmax) are

always mapped to themselves, i.e.,

Wα(z) =

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

zl+α−zmin

zl−zmin· (z − zmin) + zmin, if z ≤ zl

z + α, if zl < z < zu

zmax−zu−αzmax−zu

· (z − zu) + zu + α, if z ≥ zu

(3.16)

This Bark-shift based piecewise nonlinear warping function differs from pre-

vious Bark-scale based approaches [81–83] in two aspects. First, the previous

methods apply modifications to the Hz-Bark conversion formula directly, which

make it difficult to implement in an uniform filter bank analysis framework. In

contrast, the proposed method can be easily implemented by modifying filter

bank analysis for computational efficiency. Second and the most important, the

piecewise function in Eq. (3.16) compensates for bandwidth mismatch, while

the warping functions in [81–83] change frequency bandwidth, which result in

82

original scale (bark)

war

ped

scal

e (b

ark)

α > 0

α < 0

α = 0

zmin

zmax

zmax

zl

zu

Figure 3.15: Piecewise bark shift warping function, where α > 0 shifts the Bark

scale upward, α < 0 shifts downward, and α = 0 means no warping.

information loss at the boundaries. Preserving bandwidth is more important for

nonlinear frequency warping that for linear frequency warping, because one unit

shift in Mel- or Bark-scale could correspond to hundreds of Hz deviation in linear

scale due to the Mel- and Bark-nonlinearity.

3.7 Experiments with Nonlinear Frequency Warping

3.7.1 Sg2 based Nonlinear Frequency Warping

The automatically estimated Sg2 has been applied to linear frequency warping

and shown to be promising. Here, that work is extended to nonlinear speaker

normalization. Given the Sg2 value for a test speaker, Sg2tst, and a reference Sg2

value Sg2ref , which is the average Sg2 value over training speakers, the warping

83

factor α is calculated as:⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

α =Sg2ref

Sg2tst, for linear scaling

α = Mel(Sg2ref) − Mel(Sg2tst), for Mel shift

α = Bark(Sg2ref) − Bark(Sg2tst), for Bark shift

(3.17)

The ML-based speaker normalization method (Eq. (3.8)) involves an exhaus-

tive grid search to find an optimal warping factor in the ML sense, which is time

consuming and requires a certain amount of data to be effective. In contrast,

the main computational cost for Sg2-based normalization methods comes from

F2 and E2 tracking based on LPC analysis, which can be done efficiently. Since

Sg2 has been shown to be content independent and remains constant for a given

speaker [65,78], Sg2 estimation doesn’t require large amounts of data, and theo-

retically a few words, or even one word if carefully chosen4, would be sufficient.

Therefore, compared to ML-based normalization methods, Sg2-based normaliza-

tion methods are computationally more efficient and require less data, which is

desirable for rapid speaker normalization with limited enrollment data.

3.7.2 Experimental Setup

For computational efficiency, all normalization methods are implemented by mod-

ifying the Mel or Bark filter bank analysis, instead of warping the power spectrum.

MFCC features are used for Mel shift, and PLPCC features are used for Bark

shift. PLP features can also be computed from Mel filter bank front end. Prelimi-

nary experiments showed that for the baseline system, Mel-PLP performs slightly

better than Bark scale PLP and standard MFCC. However, the improvement is

not significant, and since feature comparison is not the primary interest of this

4For most reliable estimation, the Sg2 detector requires F2 transition crossing Sg2, e.g., asin a diphthong /ai/.

84

work, standard MFCC and Bark-scale PLP were used in all experiments. Here

the focus is on the comparison of linear vs. nonlinear warping functions, and ML

vs. Sg2 based normalization. For fair comparisons, all experiments (both linear

and nonlinear) use piecewise warping functions with the same cut-off frequencies.

It is also important to use a consistent framework when conducting the com-

parison of ML-based linear vs. nonlinear normalization, i.e., the search grids

should be equivalent. This means the grid size should be the same and within an

appropriate range to ensure that the linear and nonlinear warped spectra cover

roughly the same frequency range. For the linear warping function, a grid of 21

searching points is used with a step size of 0.01. According to Eq. (3.15) and Eq.

(3.11), a step size of 0.01 in linear scaling roughly corresponds to a shift of 0.07

(bark) in Bark scale, or a shift of 10 (Mel) in Mel scale.

The performance of different normalization methods is evaluated on children’s

ASR, where speaker normalization has been shown to provide significant perfor-

mance improvement. Two databases are used: one is the TIDIGITS database on

connected digits, and the other is the TBall database on high frequency words

(HFW) and basic phonic skills test (BPST) words [77]. For the two databases,

speech signals were segmented into 25ms frames, with a 10ms shift. Each frame

was parameterized by a 39-dimensional feature vector consisting of 12 static

MFCC (or PLPCC) plus log energy, and their first- and second-order deriva-

tives. Cepstral mean substraction (CMS) is applied in all cases. Throughout this

section word error rate (WER) is used for performance evaluation.

Monophone-based acoustic models were used with 3 states and 6 Gaussian

mixtures in each state. In the TIDIGITS experiments, acoustic models were

trained on 55 adult male speakers and tested on 50 children. The baseline WER is

37.63% using MFCC features and 37.47% using PLPCC features. For each child,

85

the normalization data, which consisted of 1, 4, 7, 10 or 15 digits, were randomly

chosen from the test subset to estimate Sg2 and the ML-based warping factors.

The ML search grid is [0.8, 1.0] for linear scaling, [-1.4, 0.0] for Bark shifting, and

[-200, 0.0] for Mel shifting. In the TBall database, 55 HFW words and 55 BPST

words were collected from 189 children in grades 1 or 2. Around two-thirds of

the data (120) were used for training, and the remaining third for testing. The

baseline WER is 7.75% using MFCC features and 8.35% using PLPCC features.

Three randomly chosen words (including at least one diphthong word) were used

for normalization. The ML search grid is [0.9, 1.1] for linear scaling, [-0.7, 0.7]

for Bark shifting, and [-100, 100] for Mel shifting. For comparison, the Bark

offset method in [82] was also evaluated using PLPCC features. All experiments

were performed in an unsupervised way, and the recognition output from the

baseline models (without normalization) was used as transcription during ML

grid searching.

3.7.3 Experimental Results

Tables 3.5 and 3.6 show results on TIDIGITS with various amounts of normal-

ization data for MFCC and PLPCC features, respectively. LS-ML means linear

scaling with ML-based warping factor estimation; LS-Sg2 means linear scaling

with Sg2-based warping factor estimation; MS represents Mel-shift based non-

linear warping; BS is Bark-shift based nonlinear warping; BO-ML is the method

in [82] using ML grid search.

For ML-based warping methods, comparing LS vs. MS for MFCC (rows 1 and

2 in Table 3.5) and LS vs. BS for PLPCC features (rows 1 and 2 in Table 3.6), it

can be seen that nonlinear frequency warping provides better performance than

linear warping in all conditions, which is in agreement with literature. Due to

86

the bandwidth compensation, the proposed piecewise Bark shift method (BS-ML)

outperforms BO-ML except for the case of one normalization digit.

Compared to ML-based methods, Sg2 normalization performs significantly

better for up to seven normalization digits with all three warping functions (LS,

MS, and BS). With more data, ML-based methods tend to produce close or

superior performance, though for the case of Bark shift (BS-ML vs. BS-Sg2,

rows 3 and 5 in Table 3.6), Sg2 outperforms ML in all testing conditions for up

to 15 digits. Similar performance trends are observed on TBall data in Table 3.7.

Table 3.5: WER on TIDIGITS using MFCC features with varying normalization

data from 1 to 15 digits.

Warping 1 4 7 10 15

LS-ML 7.48 6.34 5.42 4.99 4.91

MS-ML 6.33 5.47 4.48 4.11 4.08

LS-Sg2 6.11 5.57 5.05 5.07 5.03

MS-Sg2 5.29 4.81 4.05 4.13 3.99

Table 3.6: WER on TIDIGITS using PLPCC features with varying normalization

data from 1 to 15 digits.

Warping 1 4 7 10 15

LS-ML 7.62 6.90 5.78 5.64 5.25

BS-ML 6.21 5.63 4.56 4.30 4.13

BO-ML 6.00 5.94 5.33 4.96 4.65

LS-Sg2 6.15 5.71 5.51 5.47 5.39

BS-Sg2 5.17 4.76 4.09 4.11 4.05

87

Table 3.7: WER on TBall children’s data using MFCC and PLPCC features with

3 normalization words.

MFCC PLPCC

Warping WER Warping WER

LS-ML 6.86 LS-ML 6.99

MS-ML 5.91 BS-ML 5.82

- - BO-ML 6.08

LS-Sg2 6.10 LS-Sg2 6.33

MS-Sg2 4.89 BS-Sg2 4.71

3.8 Summary and Discussion

This chapter presents a reliable algorithm for estimating the second subglottal

resonance (Sg2) from acoustic signals. The algorithm provides Sg2 estimates

which are very close to actual Sg2 values as determined from direct measurements

using accelerometer data. With the proposed algorithm, Sg2 standard deviation

across content and language was investigated with children’s data for English

and Spanish. Analysis shows that for a given speaker the second subglottal

resonance does not appear to vary with speech sounds, repetitions, and even

across languages. Based on such observations, a speaker normalization method

is proposed using the second subglottal resonance. This normalization method

defines the warping factor as the ratio of the reference subglottal resonance over

that of the test speaker.

A variety of evaluations show that the second subglottal resonance normaliza-

tion performs better than or comparable to VTLN, especially for limited adapta-

tion data. An obvious advantage of this method is that the subglottal resonances

remain roughly constant for a specific speaker. This method is potentially inde-

88

pendent of the amount of available adaptation data, which makes it suitable for

limited data adaptation.

Cross-language experimental results show that Sg2 normalization is more ro-

bust across languages than VTLN, and no significant performance variations are

observed for Sg2 when the adaptation data are changed from English to Span-

ish. The fact that Sg2 is independent of language should make it possible to

adapt acoustic models with available data from any language. The method is

also computationally more efficient than VTLN.

The Sg2 variations found in this work are similar to what has been reported

elsewhere. However, given the small number of subglottal resonance studies, more

data may need to be collected and analyzed in order to refine the characterization

of subglottal resonance variability. Future work is required to further improve

the accuracy of the Sg2 detector, evaluate the effectiveness of this method on a

large vocabulary database, and test the performance in noisy conditions.

89

CHAPTER 4

Automatic Evaluation of Children’s

Language Learning Skills

Increasing attention has been devoted to applying automatic speech recognition

techniques to children’s speech for educational purposes. Many automatic assess-

ment, tutoring, and computer aided language learning (CALL) systems have been

developed. This chapter describes an automated evaluation system developed in

response to the growing need for reliable and objective reading assessments in

schools. The system applies disfluency detection and Spanish accent detection

together with speech recognition to evaluate children’s langauge learning skills.

4.1 Technology based Assessment of Language and Liter-

acy

The technology-based assessment of language and literacy (TBall) project [39]

was designed to automatically evaluate English language learning and literacy

skills of predominantly Mexican-American children in grades K-2 (ages 5-7 years).

The goal is to use classroom-based assessments to inform reading instruction, en-

abling teachers to gather data about a large number of discrete skills including

phonological awareness, alphabet knowledge, word decoding, fluency, vocabulary,

and comprehension skills. A system designed to robustly meet these broad de-

mands must make use of multiple information sources when eliciting responses

90

from the children, automatically processing these responses, and reporting as-

sessment scores to the teachers.

The TBall system provides teachers of grades K-2 with a tool that allows

them to efficiently gather data about their students’ language skills from reliable

classroom-based assessments in order to plan individualized instruction tailored

for each child’s needs. The developed system consists of three main parts:

1. A multimedia student interface to present stimuli in audio, text, and graph-

ics, and to collect data over various sources and modes.

2. An assessment module using ASR to analyze and score the students’ re-

sponses in a reliable, fair and efficient manner.

3. A teacher interface to monitor students’ progress, and to create query based

database for students, groups and classes.

4.2 Blending Tasks and Database Collections

4.2.1 Blending Tasks for Phonemic Awareness

A critical component of the TBall project is assessment of phonemic awareness

because of its key role in reading and writing, especially for the targeted age

group. Phonemic awareness is related to developing reading and writing skills,

and is important for children to master to become proficient readers. It can be

assessed through oral segmenting and blending tasks at various linguistic lev-

els. Examples of blending and segmentation tasks are shown in Tables 4.1 and

4.2, respectively. Here the primarily focus is on phoneme blending, onset-rhyme

blending and syllable blending. The blending tasks assess both pronunciation ac-

curacy and smoothness of the target words. In the tasks, audio prompts present

91

phonemes, onset-rhymes, or syllables separately, and a child is asked to orally

blend them into a whole word. A child is said to be proficient in the tasks

provided:

• The child reproduces all the sounds of the original prompts (phonemes,

onset-rhymes, or syllables) in the final word.

• The child can smoothly blend the prompts together to make one word.

Table 4.1: An example of the TBall blending tasks: audio prompts are presented

and a child is asked to orally blend them into a whole word. A one-second silence

(SIL) is used within the prompts to separate each sound.

Blending task Audio Prompt Target

Phonemes /hh/ SIL /ae/ SIL /ch/ hatch

Onset-rhyme /r/ SIL /ae m p/ (r+amp) ramp

Syllables /p eh p/ SIL /t I k/ (pep+tic) peptic

Table 4.2: An example of the TBall segmentation tasks: audio prompts are

presented and a child is asked to orally segment them into parts.

Segmentation task Audio Prompt Target

Phonemes chime ch + i + me

Onset-rhyme shake sh + ake

Syllables station sta + tion

4.2.2 Database Collections

The speech corpus was collected in five Kindergarten classrooms in Los Angeles.

The schools were carefully chosen to provide balanced data from children whose

92

native language was either English or Mexican Spanish. Each blending task has

eight words, most of which are unfamiliar words to young children. Table 4.3 lists

the target words for each blending task. By choosing such unfamiliar words, the

intention is to reduce the likelihood that a child could guess the target answer

without focusing on blending the components.

Before the actual recording started, children first practiced on examples to

become familiar with the task, and they decided when they were ready to start

the recordings. During data collection, a timer with expiration time of three

seconds was used as the maximum pause between the prompt and the answer. If

a child didn’t respond within 3s after the prompt, the prompt for the next word

would be presented. A total of 193 children were recorded, and Table 4.4 shows

the distribution of children by native language and gender. The database was

roughly gender-balanced and also language-balanced.

Table 4.3: Target words for the blending tasks.

Blending task Target words

Phonemes pick, fan, ship, cash, lack, fad, shin, hatch

Onset-rhyme pot, mat, gum, shine, ramp, nit, chad, shape

Syllables bamboo, napkin, nova, peptic, stable, table, wafer, window

Table 4.4: Speaker distribution by native language and gender.

Native language English Spanish Unknown

Boy 38 43 11

Girl 41 47 13

Total 79 90 24

93

4.3 Human Evaluations and Discussions

4.3.1 Web-based Teacher’s Assessment

In our previous work [40], it was found that evaluations based on several words

from a speaker are more reliable than those based on a single word, since the more

speech from a child the rater hears, the more familiar the rater becomes with the

system of contrasts used by the child. For example, hearing a child say wick

for rick may indicate an articulation issue and not a phonemic awareness issue.

Therefore, in the web-based teacher’s assessment, audio samples were grouped by

speaker to allow teachers to apply speaker-specific information (dialect or accent,

speaking-rate, etc.) for judgment adaptation. Such speaker-specific information,

however, may lead to biased evaluations since dialect or accent, if any, is highly

subjective and thus people may perceive it differently.

Teachers assessed both pronunciation accuracy and smoothness by responding

to the following questions:

• Are the sounds correctly pronounced? (accuracy)

• Are the sounds smoothly blended? (smoothness)

• Is the final word acceptable? (overall)

For each question, two choices were presented to classify the quality: acceptable

or unacceptable. Teachers also provided comments for their decisions.

4.3.2 Inter-correlation of the Assessment

Assessments from nine teachers were collected to calculate the inter-correlation

between evaluators. As shown in Table 4.5, teacher evaluations are reasonably

94

consistent for the three tasks. The inter-correlations in evaluating the overall

quality are similar for all the tasks: about 85%. The inter-correlations on ac-

curacy evaluations are significantly higher than those on smoothness. This is

because, compared to pronunciation accuracy, smoothness evaluation is more

subjective especially toward short utterances. However, smoothness may be more

important than accuracy in the blending task because that is the goal of a blend-

ing assessment. In any case, it is an orthogonal judgment because words can be

smooth and accurate, not smooth and accurate, smooth and inaccurate or not

smooth and inaccurate. Of the three tasks, phoneme blending is the most dif-

ficult for children and draws much disagreement among teachers; while syllable

blending is relatively easy.

Table 4.5: Average inter-evaluator correlation on pronunciation accuracy,

smoothness and overall evaluations on three blending tasks.

Blending task Accuracy Smoothness Overall

Phonemes 87.6 80.8 83.3

Onset-rhyme 91.3 82.4 84.1

Syllables 97.5 85.3 86.7

4.3.3 Discussions on the Blending Target Words

From teachers’ comments, it is found that children’s background knowledge of

the task words greatly affects their performance. For unfamiliar target words, it

usually takes longer for a child to give the answer. For example, many children

are unfamiliar with peptic and with the unusual occurrence of /p/ and /t/

sounds together. In this case, there will typically be long pauses between the

end of a prompt and a child’s answer, and also between the two syllables to be

blended.

95

Another issue is for confusable target words: children tend to pronounce them

incorrectly but blend them smoothly, and thus show “strong blending” skills. For

the word stable many children pronounced it as staple because the two words

are very confusable especially when spoken in isolation without any context.

The confusion is particularly strong for Hispanic children learning English, since

Spanish /p/ can be acoustically similar to English /b/.

There are also some “language-driven” errors. That is, substitution or dele-

tion/insertion errors can occur when the syllables to be blended do not exist in

the child’s native language. For example, children from Spanish linguistic back-

grounds tended to pronounce the word stable as estable or estaple because no

words begin with the sound sp in Spanish and they always have a vowel preceding

the consonant cluster, such as the Spanish words Espana or esperanza.

To be consistent with the goals of this syllable blending task, the final decision

is based on both the pronunciation correctness and the blending smoothness, i.e.,

a word can be acceptable only when the pronunciation accuracy and the blending

smoothness are both acceptable.

4.4 Automatic Evaluation System

4.4.1 Overall System Flowchart

The automatic evaluation system to measure children’s performance on the blend-

ing tasks consists of four core components: disfluency detection, accent detection,

accuracy assessment and smoothness assessment. The system flowchart is shown

in Fig. 4.1, with detailed descriptions in subsequent sections. Disfluency detec-

tion uses the partial-word recognizer to filter out disfluent phenomena such as

false-starts, sounding out, self-repair and repetitions, and to localize the target

96

answer. Accent detection is then applied to the target word to detect possible

non-native English pronunciations. The accent information is then used to up-

date the pronunciation dictionaries and duration ratio models. Normalized log

likelihood and duration ratio scores are used to measure accuracy and smooth-

ness, respectively. These two scores are combined together to get the final result.

Since the tasks are designed to evaluate a child’s language learning skills

based on responses to audio prompts, prior information of the expected answer is

available for use in ASR. Hence, the automatic system can work in a supervised

mode and exploit knowledge-based information derived by linguistic experts for

better and more reliable performance.

Figure 4.1: Flowchart of the automatic evaluation system for the blending tasks.

4.4.2 Disfluency Detection

Generally speaking, disfluencies include everything spoken by the child that dis-

rupts the natural flow of the target word pronunciation. Typical disfluencies

found in our data are: fillers such as uhhh or ummm, partial- and/or full-word

repetitions where syllables or phonemes within a word or a whole word are re-

peated, self-corrections, long pauses within a word, elongations where syllables

or phonemes (usually the first one) are lengthened. These last two disfluencies

97

(pauses and elongations) are related to the smoothness measure, and will be

addressed using duration ratio models.

Disfluency detection as the first stage of the system is used mainly to filter out

fillers and repetitions, and to get the approximate beginning and ending times

for the target answer. If the target word is repeated several times, only the

last one is used for further evaluations in order to be consistent with teachers’

decision-making protocols, where only the last answer is accounted for.

A partial-word recognizer (PWR) [48] is used to detect disfluency with sub-

word units derived from the dictionary based on the task; sub-word units are

phonemes, onsets or rhymes, or syllables depending on the blending task. An

example of the detection network is shown in Fig. 4.2 for a syllable blending

word peptic. A background/garbage model is used to consume background

noises, fillers and out-of-vocabulary words. Long pauses are allowed between

sub-word units. The PWR can be bypassed to whole word recognition (WWR)

for disfluency-free speech. The WWR is a regular phoneme-based recognizer ex-

cept that it allows repetitions. WWR can also be bypassed for the case where

the child does not make an attempt to say the target whole word.

For computational efficiency, only one canonical dictionary pronunciation is

used to generate sub-word units, and no accented alternatives are taken into

account at this state. This is reasonable because here the disfluency detector is

mainly used to localize the target answer of interest (not score it). Evaluated

on a subset of the blending tasks data, the disfluency detector is able to filter

out around 85% of the disfluent miscues. The subsequent process detects accent

and uses that information to choose the pronunciation dictionary and duration

models.

98

Figure 4.2: An example of the disfluency detection network for a syllable blending

task word ‘peptic’, where START and END are the network enter and exit points,

respectively.

4.4.3 Accent Detection

The TBall data used here were collected from children with multi-lingual back-

grounds, and thus contain foreign accented (mainly Spanish accented) English.

An example of pronunciation variation for Spanish-accented English is the re-

placement of /dh/ (there) with /d/ (dare), since /dh/ does not exist in Spanish.

The pronunciation patterns of Spanish-accented English can be predicted from

theories of second language learning. The phonological rules are divided into three

categories: consonant changes, vowel changes, and syllable structure changes

(insertions or deletions of consonants or vowels).

Possible consonant changes include final obstruent devoicing, interdental frica-

tive change, palatalization, retroflexing, alveolar approximate change, dentaliza-

tion and labialization, etc. Particularly, the following list (similar to that in [84]

and the online resource the speech accent archive [85]) enumerates pronunciation

patterns that are within the phonemic coverage of our analysis database.

• /v/ (vat) → /f/ (fat), because /v/ doesn’t exist in Spanish, and /f/ is

acoustically similar to /v/.

• /z/ (zoo) → /s/ (sat), because /z/ doesn’t exist in Spanish, and /s/ is

acoustically similar to /z/.

99

• /dh/ (that) → /d/ (debt), because /dh/ doesn’t exist in Spanish, and /dh/

is acoustically close to the Spanish /d/.

• /th/ (thing) → /t/ (ten), because /th/ doesn’t exist in Spanish, and /th/

is acoustically close to the Spanish /t/.

• /r/ (rent) → trill, pronounce /r/ with a rolling sound.

• Unaspirated /p/ (pet), /t/ (ten), /k/ (kit), pronounce without the aspira-

tion.

Possible vowel changes include vowel shortening, vowel raising, and vowel

lowering. Potential pronunciation patterns are summarized as follows:

• /ih/ (bit) → /iy/ (beat), because /ih/ doesn’t exist in Spanish, and /iy/ is

acoustically close to /ih/.

• /ae/ (bat) → /eh/ (bet), because /ae/ doesn’t exist in Spanish.

• /uh/ (book) → /uw/ (boot), because /uh/ doesn’t exist in Spanish.

• /ah/ (but) → /eh/ (bet), because /ah/ doesn’t exist in Spanish.

Possible syllable structure changes include vowel insertion, consonant deletion

(/r/ deletion), and consonant insertion. Two such patterns are observed in our

analysis database: /er/ (bird) → /aa r/ (are), and /ao r/ (four) → /aa/ (Bob).

Since the focus in this work is on Spanish-accented English from young chil-

dren of the age 5-7 years, while the language learning theories are mainly based

on adult subjects, the pronunciation patterns for children may be different from

what the theories predicted. Therefore, a pronunciation variation study was

carried out on the TBall database [84], with the main results summarized here

100

Table 4.6: Pronunciation variants analysis for consonants and vowels on a Span-

ish-accent English database, with the percentage of occurrence in the analysis

database shown in the parentheses. Entries with a tailing asterisk are those

change patterns not predicted from theories.

Consonant changes Vowel changes

/z/ → /s/ (73.6) /ih/ → /iy/ (33.4)

/th/ → /d/ (34.6)* /uh/ → /uw/ (32.8)

/dh/ → /d/ (29.7) /ae/ → /eh/ (11.7)

/d/ → /t/ (22.4)* /ah/ → /eh/ (10.1)

/ch/ → /sh/ (22.2)*

/v/ → /f/ (21.6)

/ng/ → /n/ (17.0)*

in Table 4.6. These analysis results confirm most of the predicted hypotheses,

but also show some new observations. Listed here are only those patterns with

occurrence probability larger than 10% in the database. Some predicted pronun-

ciation variation hypotheses are not observed in the analysis database, including

consonant changes trill /r/ and unaspirated /p/, /t/, /k/. The consonant change

(/th/ → /t/) does occur in the database, but with a rather low probability (9.2%);

while another variant of /th/ (/th/ → /d/) occurs at a much higher probability

(34.6%). The hypothesized structure changes (phoneme insertion, deletion) hap-

pen at a significantly low probability (<0.1%), which may not be well generalized.

See [84] for a complete list of patterns and more detailed explanations.

Based on the analysis, an algorithm is developed to automatically detect

Spanish accent. Given the pronunciation variation patterns, a simple but effec-

tive measure for accent detection is the occurrence ratio of such patterns in an

101

utterance, defined as:

Rph1|ph2 =C(ph2 → ph1)

C(ph2)(4.1)

where Rph1|ph2 is the occurrence ratio of pronunciation change pattern from one

phoneme (ph2) to another phoneme (ph1), which is denoted by {ph2 → ph1};

C(ph2) is the occurrence count (OC) of ph2, and C(ph2 → ph1) is the OC

of pattern {ph2 → ph1}. Since the system is running in a supervised mode

with available transcriptions, the OCs can be easily calculated through forced

alignment using a canonical pronunciation dictionary first and then an accented

pronunciation dictionary. The two alignment outputs are analyzed and compared

to calculate the OC of each pattern.

The average value of all occurring pattern ratios is a measure of the overall

accent of a speaker, i.e.,

R =1

M

∑{ph2→ph1}∈P

Rph1|ph2 (4.2)

where P represents all valid pronunciation change patterns, and M is the to-

tal number of patterns occurring in the utterance. To make reliable estimates,

patterns with OCs of C(ph2) below a threshold of 3, are not included in the

calculation.

The speaker level accent measure in Eq. (4.2) treats all pronunciation change

patterns equally. A statistical analysis of the data, however, shows patterns do

not occur with the same probability, and some patterns occur much more fre-

quently than others, e.g., the occurrence of pattern {/z/→/s/} has a probability

of 73.6%, while the pattern{/v/→/f/} occurs only 21.5% of time. The occurrence

probabilities can be viewed as the correlation between each pattern and the over-

all accent. The higher the probability, the more related the pattern is to accent.

102

To take this into account, Eq. (4.2) is changed into the following equation:

R =∑

{ph2→ph1}∈P

p(ph2 → ph1) · Rph1|ph2 (4.3)

where p(ph2 → ph1) is the probability of pattern {ph2 → ph1}, which is normal-

ized to make the summation of all pattern probabilities equal to 1.

The accent score from Eq. (4.3) is used to classify a speaker’s accent level.

The higher the score, the more accented the utterance is. Since our database

does not have accent level information, a binary detection is performed to decide

if a speaker is Spanish-accented or not. Given a threshold Ta (0.6 in the following

experiments), if the score R is greater than Ta, then the speaker has Spanish

accent. The accent detector achieved 83% correctness on an evaluation dataset

which was labeled for accent.

4.4.4 Pronunciation Dictionary

The dictionary used in accuracy assessment needs to consider possible pronuncia-

tion variations. Besides the canonical pronunciation for each word, the dictionary

also contains entries for non-canonical but correct (and common in kids) pronun-

ciations from different dialects common in the Los Angeles area. For example,

many speakers do not distinguish cot and caught, pronouncing both as /k aa t/.

Therefore, /k aa t/ and /k ao t/ are both considered as correct pronunciations.

The dictionary also includes iy/ih alternations since Spanish learners of English

often do not separate them well. Hispanic letter to sound (LTS) rules are not

applied in the dictionary, since LTS rules are for reading evaluations while in our

task the prompts are auditory. Although it is possible that these rules may have

some effect (since they hear speech of adults who are literate and influenced by

Hispanic LTS rules when speaking English), such instances appeared to be rare

103

relative to the increase in size of the dictionary that would be needed to cover

them comprehensively.

The pronunciations in the dictionary have tags for these various pronunci-

ations (dialected pronunciation, canonical pronunciation, phonological develop-

ment issue, etc.) In this way, “dialect” or “idiolect” can be attributed in a simple

way: the likelihood for each pronunciation is calculated and the pronunciation

with the highest likelihood, if non-canonical, is declared as the “idiolect” for the

speaker for that word. A pattern of many words through the dialected path

would confirm a speaker as having dialected speech. A constraint for detecting

dialect is that the speaker must produce a consistent dialect, that is, the dialect,

if detectable, must be the same in most of the task words. In this way, we can

model the dialect as a system of distinctions, which is linguistically much more

appropriate.

4.4.5 Accuracy and Smoothness Measurements

Normalized HMM log likelihoods through forced alignments are calculated to

evaluate the pronunciation qualities. Accent information from the accent de-

tection component is used to choose appropriate entries from the pronunciation

dictionary. Local normalization is applied to compensate for utterance length

(time duration):

Sl =1

N

N∑i=1

si

di

(4.4)

where si is the log likelihood of the ith segment (phoneme, onset, rhyme, syllable,

or the pause between), di is the corresponding time duration in frames, and the

summation is over all N segments. The pronunciation is acceptable if the log

likelihood score Sl > Tl, where the threshold Tl can be speaker-independent

empirical values or speaker-specific values to take into consideration individual

104

speaker’s acoustic characteristics.

Segment durations are used to measure the blending smoothness. The dura-

tions are obtained from forced alignments with the most likely pronunciations.

To compensate for the effects of rate of speech, the durations are normalized as:

di = di/

N∑j=1

dj (4.5)

Gaussian mixture models (GMM) are used to approximate the distribution of

syllable duration ratios for each task word. Two GMM models are constructed

from the training data, one for native English and the other for Spanish-accented

English. Information from the accent detection component is used to select the

appropriate model. The log likelihood of given duration ratios against the GMM

is used as smoothness scores Sd:

Sd =∑

i

log∑m

N (di; μim, σim) (4.6)

where N (·; μim, σim) is the mth Gaussian mixture with mean μim and variance

σim for the ith segment. If Sd is greater than the smoothness threshold Td, the

blending smoothness is acceptable.

4.4.6 Overall quality measurement

The overall quality is unacceptable if either pronunciation or smoothness is un-

acceptable. If the pronunciation and smoothness are both acceptable, the overall

quality is evaluated based on the weighted summation of pronunciation scores

and smoothness scores:

S = w · Sl + (1 − w) · Sd (4.7)

A threshold T is used to decide the acceptability of the overall quality. Similar

to pronunciation evaluation, T can be speaker-independent or speaker-specific.

105

4.5 Experimental Results

To test system performance, evaluations from teachers were used as references.

Acoustic monophone models were trained on the TBall database (excluding the

blending tasks) with approximately seven hours annotated recordings from both

native and nonnative speaker. For each blending task, performance was tested

on 1350 utterances. Speaker independent decision thresholds were used in all

experiments. Table 4.7 shows the correlation between automatic and average

teacher evaluations for the three blending tasks.

For pronunciation quality evaluation, normalized likelihoods correlate well

with teacher assessments. For the smoothness measurement, duration ratio scores

achieved comparable performance to the average inter-correlation between teach-

ers. The overall evaluation using a weighted summation of pronunciation and

smoothness scores obtained an average correlation around 88% over the three

tasks, slightly better than the average inter-teacher correlation. The weight of

the optimal performance is 0.35, which means that smoothness is more important

than pronunciation in the blending task. Note that on the syllable blending task,

overall performance is improved from 87.5% (in [40]) to 91.8% due to disfluency

and accent detection.

Table 4.7: Average correlation between ASR and teacher evaluations on pronun-

ciation accuracy, smoothness and overall qualities for three blending tasks.

Blending task Accuracy Smoothness Overall

Phonemes 90.5 79.8 85.6

Onset-rhythm 93.2 83.1 87.9

Syllables 95.4 90.7 91.8

106


An automatic evaluation system is developed to assess children’s performance on

three blending tasks. The system applies disfluency detection and accent detec-

tion for pre-processing and uses a pronunciation dictionary for forced alignment to

generate sound segmentations and produce HMM likelihood scores. The weighted

summation of normalized likelihoods and duration scores is used to evaluate the

overall quality of children’s responses. Speaker specific accent information is used

to update the dictionary and duration ratio models. Compared to teachers’ as-

sessments, the system achieves a correlation better than the average inter-teacher

correlation.

107

CHAPTER 5

Summary and Future Work


This dissertation investigates rapid speaker normalization and adaptation meth-

ods to reduce speaker variations in automatic speech recognition (ASR) systems.

Two methods are developed based on the supraglottal (vocal tract) resonances

(formants), and the subglottal acoustic system resonances, respectively. As an

application, an automatic system is developed using ASR to evaluate children’s

language learning skills.

Chapter 2 presents a rapid speaker adaptation method using regression-tree

based spectral peak alignment. Based on the analysis of various levels of spectral

mismatch in formant structures, this method is proposed to reduce phoneme-

and state-level spectral mismatch. With the linearization of frequency warping

in the cepstral domain, the method is investigated in a maximum likelihood lin-

ear regression (MLLR)-like framework, where the transformation matrix is gen-

erated deterministically by aligning phoneme- or state-level formant-like peaks

in the spectrum, while bias and covariance are statistically estimated using the

Expectation Maximization (EM) algorithm. This method can be viewed as a

combination of vocal tract length normalization (VTLN) and MLLR, taking ad-

vantage of both the efficiency of VTLN and the reliability of MLLR, with the

potential of good performance for both limited and large amounts of adaptation

108

data. Experimental results show that this method can achieve significant per-

formance improvements over MLLR, especially for cases where only a very small

amount of data is available.

Chapter 3 develops a speaker normalization method using the subglottal res-

onances. Based on the coupling effects of subglottal acoustic system to the vocal

tract cavity, a reliable algorithm is proposed to automatically estimate the second

subglottal resonance (Sg2) from speech signals using joint frequency discontinu-

ity and energy attenuation measurements. The algorithm provides Sg2 estimates

which are very close to the ground truth as determined from direct measurements

using accelerometer data. With the proposed algorithm, analysis is conducted on

speech data from Spanish-English bilingual children speakers to investigate the

content and language variability of the subglottal resonances. It is shown that

for a given speaker the second subglottal resonance does not appear to vary with

speech sounds, repetitions, and even across languages. Based on such observa-

tions, a speaker normalization method is proposed using the second subglottal

resonance. This normalization method defines the warping factor as the ratio

of the reference subglottal resonance over that of the test speaker. A variety

of evaluations show that the second subglottal resonance normalization method

performs better than or comparable to VTLN, especially for limited adaptation

data and cross langauge adaptation. An obvious advantage of this method is

that the subglottal resonances remain roughly constant for a specific speaker and

thus this method is potentially independent of the amount of available adaptation

data and language.

Chapter 4 introduces ASR techniques to evaluate children’s langauge learning

skills through blending and segmentation tasks. The challenges stem from the

children’s young age, their multilingual background and frequent occurrence of

109

disfluency. Accordingly, the system applies speaker normalization, disfluency de-

tection and accent detection for pre-processing to localize possible valid responses.

The target response is then passed to an ASR system, which uses a pronuncia-

tion dictionary for forced alignment to generate sound segment durations and to

produce HMM likelihood scores. Normalization likelihood scores and duration

scores are used to measure pronunciation accuracy and smoothness, respectively.

These two scores are then combined to assess the overall quality. Speaker-specific

accent information is used to update the dictionary and duration ratio models.

Compared to teachers’ assessments, the system achieves a correlation better than

the average inter-teacher correlation.

5.2 Future Work

Rapid speaker normalization and adaptation is an important issue for real-world

ASR systems to provide robust and satisfactory performance over a large distri-

bution of speakers. Because of the problem of data sparsity, knowledge based

methods usually tend to outperform purely data-driven statistical methods, as

shown in this dissertation for both spectral peak alignment and subglottal reso-

nance normalization. Such prior information can greatly reduce the requirement

of and the dependency on the amount of available data.

For future work, it is worth exploring the incorporation of acoustic and percep-

tual knowledge into the currently data-driven ASR systems. The prior knowledge

can guide statistical ASR system for smart and efficient training and decoding.

Besides information about formant structures and subglottal resonances as uti-

lized in this dissertation, acoustic phonetic knowledge such as classic phonetic

features of place, manner and voicing, and distinctive features may also be help-

ful for acoustic modeling to improve the model’s discriminative ability. Since

110

these features are developed to specify a phoneme and to describe speech sounds

in a particular langauge or dialect, incorporating them into ASR systems seems

promising as reported in [86].

The spectral peak alignment method is investigated in an MLLR-framework.

It is also possible to use Maximum A Posterior Linear Regression (MAPLR)

[57–59] to incorporate prior knowledge based on an empirical Bayes (EB) ap-

proach [57] and/or the structural information of the models [58]. Provided that

appropriate priors are chosen, MAPLR may significantly outperform MLLR.

The Sg2 variations are studied on a small data set in this dissertation. More

data may need to be collected and analyzed in order to refine the characterization

of subglottal resonance variability. Future work is also required to further improve

the accuracy of the Sg2 detector, evaluate the method on a large vocabulary

database and in noisy conditions.

For the automatic evaluation system, future work will aim to improve perfor-

mance using additional features and speaker-specific modeling.

111

References

[1] L.E. Baum and J.A. Eagon, “An inequality with applications to statisticalestimation for probabilistic functions of Markov processes and to a model forecology,” Bulletin of American Mathematical Society, vol. 73, pp. 360-363,1967.

[2] L. Rabiner and B.-H. Juang, Fundamentals of speech recognition, PrenticeHall, 1993.

[3] L. Baum, “An ineuqality and associated maximization technique in statisti-cal estimation for probabilistic functions of Markov processes,” Inequalities,vol. 3, pp. 1-8, 1972.

[4] A.J. Viterbi, “Error bounds for convolutional codes and an asymptoticallyoptimal decoding algorithm,” IEEE Trans. INformat. Theory, vol. IT-13,pp. 260-269, 1967.

[5] A.P. Dempster, N.M. Laird and D.B. Rubin, “Maximum likelihood fromincomplete data via the EM algorithm,” J. Roy. Stat. Soc., vol. 39(1), pp.1-38, 1977.

[6] J. Markel and A. Gary, Linear Prediction of Speech, Springer-Verlag, 1976.

[7] H. Hermansky, “Perceptual Linear Prediction (PLP) Analysis of Speech,”Journal of Acoustical Society of America, vol. 87(4), pp. 1738-1752, 1990.

[8] S. Davis and P. Mermelstein, “Comparison of parameteric representationsfor monosyllabic word recognition in continuously spoken sentences,” IEEE

Trans. on Acoustics, Speech, Signal Proc., vol. 28(4), pp. 357-366, 1980.

[9] X. Huang and K.F. Lee, “On speaker-independent, speaker-dependent, andspeaker-adaptive speech recognition,” IEEE Trans. Speech and Audio Pro-

cessing, vol. 1(2), pp. 150-157, 1993.

[10] H. Wakita, “Normalization of vowels by vocal tract length and its applicationto vowel identification,” IEEE Trans. Acoust., Speech, Signal Processing, vol.25, pp. 183-192, 1977.

[11] G. Fant, Acoustic Theory of Speech Production. The Hague, The Nether-lands: Mouton, 1960.

[12] V. Digalakis, D. Rtischev and L.G. Neumeyer, “Speaker adaptation usingconstrained estimation of Gaussian mixtures,” IEEE Trans. Speech Audio

Processing, vol. 3(5), pp. 357-366, 1995.

112

[13] V. Digalakis, S. Berkowitz, E. Bocchieri, C. Boulis and W. Byrnc, “Rapidspeech recognizer adaptation to new speakers,” in Proc. ICASSP, pp. 765-768, 1999.

[14] M.J.F. Gales, “Maximum likelihood linear transformations for HMM-basedspeech recognition,” Computer Speech and Language, vol. 12(2), pp. 75-98,1998.

[15] C.J. Leggetter and P.C. Woodland, “Maximum likelihood linear regressionfor speaker adaptation of continuous density hidden Markov models,” Com-

puter Speech and Language, vol. 9, pp. 171-185, 1995.

[16] L. Lee and R. Rose, “A frequency warping approach to speaker normaliza-tion,” IEEE Trans. Speech Audio Processing, vol. 6(1), pp. 49-60, 1998.

[17] S. Wegmann, D. McAllaster, J. Orloff and B. Peskin, “Speaker normalisationon conversational telephone speech,” in Proc. ICASSP, vol. I, pp. 339-341,1996.

[18] E. Eide and H. Gish, “A parametric approach to vocal tract length normal-ization,” in Proc. ICASSP, pp. 346-349, 1996.

[19] J. McDonough, W. Byrne and X. Luo, “Speaker normalization with all-passtransforms,” in Proc. ICSLP, vol. VI, pp. 2307-2310, 1998.

[20] J. McDonough, “Speaker compensation with all-pass transforms,” Ph.D. dis-sertation, Johns Hopkins University, 2000.

[21] J.L. Gauvain and C.H. Lee, “Maximum a posteriori estimation for multivari-ate Gaussian mixture observations of Markov chains,” IEEE Trans. Speech

Audio Processing, vol. 2(2), pp. 291-298, 1994.

[22] G. Ding, Y. Zhu, C. Li and B. Xu, “Implementing vocal tract length nor-malization in the MLLR framework,” in Proc. ICSLP, pp. 1389-1392, 2002.

[23] T. Claes, I. Dologlou, L. Bosch and D.V. Compernolle, “A novel featuretransformation for vocal tract length normalization in automatic speechrecognition,” IEEE Trans. Speech Audio Processsing, vol. 11(6), pp. 549-557, 1998.

[24] J. McDonough and W. Byrne, “Speaker adaptation with all-pass trans-forms,” in Proc. ICASSP, pp. 757-760, 1999.

[25] J. McDonough, T. Shaaf and A. Waibel, “Speaker adaptation with all-passtransforms,” Speech Commnication, vol. 42, pp. 75-91, 2004.

113

[26] J. McDonough and A. Waibel, “Performance comparisons of all-pass trans-form adaptation with maximum likelihood linear regression,” in Proc.

ICASSP, pp. I313-I316, 2004.

[27] M. Pitz and H. Ney, “Vocal tract normalization as linear transformation ofMFCC,” in Proc. Eur. Conf. Speech Communication and Technology, pp.1445-1448, 2003.

[28] S. Umesh, A. Zolnay and H. Ney, “Implementing frequency-warping andVTLN through linear transformation of conventional MFCC,” in Proc.

Interspeech-2005, pp. 269-272, 2005.

[29] X. Cui and A. Alwan, “MLLR-like speaker adaptation based on linearizationof VTLN with MFCC features,” in Proc. Interspeech-2005, pp. 273-276, 2005.

[30] X. Cui and A. Alwan, “Adaptation of children’s speech with limited databased on formant-like peak alignment,” Computer Speech and Language, vol.20(4), pp. 400-419, 2006.

[31] Gouvea, E. and Stern, R. (1997). “Speaker normalization through formant-based warping of the frequency scale,” in Proc. Eurospeech, pp. 1139-1142.

[32] Zhan, P. and Westphal, M. (1997). “Speaker normalization based on fre-quency warping,” in Proc. ICASSP, pp. 1039-1041.

[33] Wang, S., Cui, X. and Alwan, A. (2007). “Speaker Adaptation with LimitedData using Regression-Tree based Spectral Peak Alignment”, IEEE TASLP,Vol. 15, pp. 2454-2464.

[34] Wang, X., Wang, B. and Qi, D. (2004). “A bilinear transform approach forvocal tract length normalization,” in Proc. ICARCV, pp. 547-551.

[35] V. Zue, S. Seneff, J. Polifroni, H. Meng and J. Glass, “Multilingual human-computer interactions: from information acess to language learning,” in IC-

SLP’96.

[36] J. Wilpon and C. Jacobsen, “A study of speech recognition for children andelderly,” in Proc. ICASSP, vol. I, pp. 349-352, 1996.

[37] J. Mostow, G. Aist, P. Burkhead, A.Corbett, A. Cuneo, S. Eitelman, C.Huang, B. Junker, M.B. Sklar and B. Tobin, “Evaluation of an automatedreading tutor that listens: comparison to human tutoring and classroominstruction,” Journal of Educational Computing Research, vol. 29(1), pp.61-117, 2003.

114

[38] A. Hagen, B. Pellom and R. Cole, “Children’s speech recognition with ap-plications to interactive books and tutors,” in Proc. ASRU 2003.

[39] A. Alwan, et al., “A System for Technology Based Assessment of Languageand Literacy in Young Children: the Role of Multiple Information Source,”in Proc. IEEE MMSP, Greece, October 2007.

[40] S. Wang, et al., “Automatic evaluation of children’s performance on an En-glish syllable blending task,” SLaTE Workshop 2007.

[41] S. Lee, A. Potamianos and S. Narayanan, “Acoustic of children’s speech:developmental changes of temporal and spectral parameters, ” J. Acoust.

Soc. Am., vol. 105(3), pp. 1455-1468, 1999.

[42] J.E. Huber, E.T. Stathopoulos, G.M. Curione, T.A. Ash, and K. Johnson,“Formants of children women and men: The effect of vocal intensity varia-tion,” J. Acoust. Soc. Am., 106(3): 1532-1542, 1999.

[43] D. Elenius and M. Blomberg, “Comparing speech recognition for adults andchildren,” in Proc. FONETIK 2004.

[44] K. Lee, A. Hagen, N. Romanyshyn, S. Martin and B. Pellom, “Analysisand detection of reading miscues for interactive literacy tutors,” in Proc.

COLING, 2004.

[45] E. Shriberg, R. Bates and A. Stolcke, “A Prosody-Only Decision-Tree Modelfor Disfluency Detection,” in Proc. Eurospeech, pp. 2383-2386, 1997.

[46] Y. Liu, E. Shriberg and A. Stolcke, “Automatic Disfluency Identificationin Conversational Speech Using Multiple Knowledge Sources,” in Proc. Eu-

rospeech, pp. 957-960, 2003.

[47] M. Black, et al., “Automatic detection and classification of disfluent readingmiscues in young children’s speech for the purpose of assessment,” in Proc.

Interspeech, 2007.

[48] A. Hagen, B. Pellom and R. Cole, “Highly accurate children’s speech recog-nition for interactive reading tutors using subword units,” Speech Commu-

nication, vol. 49, pp. 861-873, 2007.

[49] T. Chen, C. Huang, E. Chang and J. Wang, “Automatic accent identificationusing gaussian mixture models,” in IEEE Workshop on ASRU, 2001.

[50] C. Teixeira, H. Franco, E. Shriberg, E. Precoda and K. Sonmez, “Evaluationof speaker’s degree of nativeness using text-independent prosodic features,”in Proc. of Multilingual Speech and Language Processing, 2001.

115

[51] T. Schultz, Q. Jin, K. Laskowski, A. Tribble and A. Waibel, “Speaker, accentand language identification using multilingual phone strings,” in HLT-2002.

[52] T. Anastasakos, J. McDonough, R. Schwartz and J. Makhoul, “A compactmodel for speaker-adaptive training,” in Proc. ICSLP, pp. 1137-1140, 1996.

[53] P. Price, W.M. Fisher, J. Bernstein and D.S. Pallett, “The DARPA 1000-word resource management database for continuous speech recognition,” inProc. ICASSP, vol. 1, pp. 651-654, 1998.

[54] R. Leonard, “A database for speaker-independent digit recognition,” in Proc.

ICASSP, vol. 9, pp. 328-331, 1984.

[55] P. Zolfaghari and T. Robinson, “Formant analysis using mixtures of Gaus-sians,” in Proc. ICSLP, pp. 1229-1232, 1996.

[56] S. Panchapagesan, “Frequency warping by linear transformation of standardMFCC,” in Proc. Interspeech, pp. 397-400, 2006.

[57] C. Chesta, O. Siohan and C. Lee, “Maximum a posterior linear regression forhidden Markov model adaptation”, in Proc. EuroSpeech, pp. 211-214, 1999.

[58] W. Chou and X. He, “Maximum a posterior linear regression (MAPLR)variance adaptation for continuous density HMMs”, in Proc. EuroSpeech,pp. 1513-1516, 2003.

[59] W. Chou, “Maximum a posterior linear regression with elliptically symmetricmatrix variate priors”, in Proc. EuroSpeech, pp. 1-4, 1999.

[60] L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Pren-tice Hall, 1978.

[61] L. Gillick and S. Cox, “Some statistical issues in the comparison of speechrecognition algorithm,” in Proc. ICASSP, pp. 532-535, 1989.

[62] K.N. Stevens, Acoustic Phonetics, MIT Press, Cambridge, MA, 1998.

[63] S.M. Lulich, “Subglottal resonances and distinctive features,” J. Phon.,doi:10.1016/j.wocn.2008.10.006, 2008.

[64] S.M. Lulich, M. Zanartu, D.D. Mehta, R.E. Hillman, “Source-Filter Inter-action in the Opposite Direction: Subglottal Coupling and the Influence ofVocal Fold Mechanics on Vowel Spectra during the Closed Phase,” J. Acoust.

Soc. Am., vol. 125, pp. 2638, 2009.

116

[65] X. Chi and M. Sonderegger, “Subglottal coupling and its influence on vowelformants,” J. Acoust. Soc. Am., vol. 122(3), pp. 1735-1745, 2007.

[66] Y. Jung, S.M. Lulich, K.N. Stevens, “Development of subglottal quantaleffects in young children,” J. Acoust. Soc. Am., vol. 124, pp. 2519, 2008.

[67] S.M. Lulich, “The role of lower airway resonances in defining vowel featurecontrasts,” PhD Dissertation, MIT, Cambrige, MA, 2006.

[68] S.M. Lulich, A. Bachrach and N. Malyska, “A role for the second subglottalresonance in lexical access,” J. Acoust. Soc. Am., col. 122(4), pp. 2320-2327,2007.

[69] M. Sonderegger, “Sublottal coupling and vowel space: An investigation inquantal theory,” Undergraduate thesis, MIT, Cambrige, MA, 2004.

[70] H.A. Cheyne, “Estimating glottal voicing source characteristics by measur-ing and modeling the acceleration of the skin on the neck,” Ph.D. Disserta-tion, MIT, Cambrige, MA, 2001.

[71] The Snack Sound Toolkit, Royal Inst. Technol., Oct. 2005 [Online]. Avail-able: http://www.speech.kth.se/snack/ (date last viewd 8/10/08).

[72] K. Honda, S. Takano and H. Takemoto, “Effects of side cavities andtongue stabilization: Possible extensions of quantal theory,” J. Phon.,doi:10.1016/j.wocn.2008.11.002, 2008.

[73] H. Hanson and K.N. Stevens, “Subglottal resonances in female speakers andtheir effect on vowel spectra,” in Proc. XIIIth International Congress of

Phonectic Sciences, Stockholm, vol. 3, pp. 182-185, 1995.

[74] X. Chi and M. Sonderegger, “Subglottal coupling and vowel space,” J.

Acoust. Soc. Am., vol. 115(5), pp. 2540-2540, 2004.

[75] Y. Jung, “Acoustic articulatory evidence for quantal vowel categories acrosslanguages,” Poster presented at the Harvard-MIT HST Forum, 2008.

[76] A. Madsack, S.M. Lulich, W. Wokurek and G. Dogil, “Subglottal resonancesand vowel formant variability: a case study of high German monophthongsand Swabian diphthongs,” Lab. Phon. 11, 2008.

[77] A. Kazemzadeh, H. You, M. Iseli, B. Jones, X. Cui, M. Heritage, P. Price,E. Anderson, S. Narayanan and A. Alwan, “TBall Data Collection: TheMaking of a Young Children’s Speech Corpus”, in Proc. Eurospeech, pp.1581-1584, 2005.

117

[78] S. Wang, S.M. Lulich, and A. Alwan, “A reliable technique for detecting thesecond subglottal resonance and its use in cross-language speaker adapta-tion”, in Proc. Interspeech, pp. 1717-1720, 2008.

[79] S. Umesh, L. Cohen and D. Nelson, “Frequency warping and the Mel scale,”IEEE SPL, 9(3):104C107, 2002.

[80] R. Sinha and S. Umesh, “A shift-based approach to speaker normalization us-ing non-linear frequency-scaling model,” Speech Communication, 50 (2008):191-202, 2008.

[81] Y. Ono, H. Wakita and Y. Zhao, “Speaker normalization using constrainedspectra shifts in auditory filter domain,” in Proc. EUROSPEECH, pp.21C23, 1993.

[82] D. Burnett and M. Fanty, “Rapid unsupervised adaptation to children’sspeech on a connected-digit task,” in ICASSP, pp. 1145-1148, 1996.

[83] P. Zhan and A. Waibel, “Vocal tract length normalization for large vocabu-lary continuous speech recognition,” Tech. Rep. CMU-CS-97-148, CarnegieMellon University, May, 1997.

[84] H. You, et al., “Pronunciation Variation of Spanish-accented English Spokenby Young Children,” in Proc. Eurospeech, pp. 273-276, 2005.

[85] The Speech Accent Archive [Online]. Available: http://accent.gmu.edu/(date last viewed 02/10/10).

[86] M. Hasegawa-Johnson, J. Baker, S. Borys, K. Chen, E. Coogan, S. Green-berg, A. Juneja, K. Kirchho, K. Livescu, S. Mohan, J. Muller, K. Sonmez,and T. Wang, “Landmark-based speech recognition: report of the 2004 JohnsHopkins Summer Workshop,” in Proc. of ICASSP, pp. 214-216, 2005.

118

rapid speaker normalization and adaptation with ...spapl/paper/wang_dissertation.pdf · gratitude...

Documents