experiment with adaptation and vocal tract length … · tract length normalization at automatic...

Experiment with Adaptation and Vocal Tract Length Normalization at Automatic Speech Recognition of Children’s Speech

S A R A Ö H G R E N

Master of Science Thesis Stockholm, Sweden 2007

Experiment with Adaptation and Vocal Tract Length Normalization at Automatic Speech Recognition of Children’s Speech

S A R A Ö H G R E N

Master’s Thesis in Speech Communication (30 ECTS credits) School of Electrical Engineering Royal Institute of Technology year 2007 Supervisors at CSC were Mats Blomberg and Daniel Elenius Examiner was Rolf Carlson TRITA-CSC-E 2007:127 ISRN-KTH/CSC/E--07/127--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.csc.kth.se

Abstract

When a system for automatic speech recognition developed for adult’s speech handleschildren’s speech, the recognition performance deteriorates. Children have brighter voicesand might not yet have learned the proper pronunciations. Children also speak morespontaneous.

Children’s brighter voices are to a large extent a result of their shorter vocal tracts,i.e. from the larynx to the lips. To compensate for this, a method called vocal tract lengthnormalization (VTLN) is helpful. The basic idea is to squeeze in or pull out the frequencyaxis so that the power concentrations shift to more or less the same positions as for adult’sspeech. Another method to improve the recognition performance of children’s speech isto adapt the recognizer to children’s speech. The adaptation method used in this masterthesis is called Maximum Likelihood Linear Regression (MLLR).

In this thesis these two methods were combined to see if the recognition performancewould improve compared to using just one of them. There was an improvement, at leastwithin an interval around the warp factors that generally work with children’s speech.The other part of this master thesis was to examine if the recognition performance ofVTLN would improve if the warp factor for the frequency axis of the children’s speech wasestimated based only on the speech part of the utterances instead of the entire utterance.The word error rate decreased by approximately 4-6 percentage points absolute.

Experiment med adaption och talrorslangdsnormalisering vidautomatisk igenkanning av barntal

Sammanfattning

System for automatisk taligenkanning ar normalt tranade enbart pa vuxental och hardarfor betydligt samre igenkanningsresultat for barntal. Nagra orsaker ar tex att grundtonoch formantfrekvenser ligger hogre for barn an for vuxna pa grund av mindre fysiskaproportioner. Barn pratar ocksa mer spontant och de har kanske inte lart sig ratt uttalan.

Att trana systemet pa barntal ar ofta inte ett alternativ pa grund av att tillrackligtstora barntalinspelningar saknas. Istallet anvands olika metoder som kompenserar for vis-sa olikheter mellan vuxen- och barntal. For att kompensera for barnens kortare talror,dvs fran struphuvudet till lapparna, kan man anvanda en metod som kallas talrorslangds-normalisering (VTLN). Grundiden ar att trycka ihop eller dra ut frekvensaxeln pa barn-tal med en anpassad skalfaktor sa att energikoncentrationerna flyttas till ungefar sammapositioner som hos de tranade vuxenmodellerna. En annan metod ar att adaptera tal-igenkannarens akustiska referensmodeller till barntal. Adaptionsmetoden som anvants idetta examensarbete heter Maximum Likelihood Linear Regression (MLLR).

Uppgiften var att kombinera dessa tva metoder och se om igenkanningsresultatet kundeforbattras ytterligare jamfort med var och en for sig. Detta antagande kunde verifieras inomett intervall kring de skalfaktorer som tidigare visat sig lampliga for barntal generellt.Dessutom undersoktes om igenkanningsresultaten vid VTLN kunde forbattras om manestimerade skalfaktorn for frekvensaxeln for barntalet endast utifran taldelen av yttrandetoch inte utifran hela yttrandet. Ocksa nu blev det en forbattring, andelen ordfel minskademed ca 4-6 procentenheter i absoluta tal.

Contents

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.4 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.5 Outline of the report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 3

2.1 Automatic Speech Recognition (ASR) . . . . . . . . . . . . . . . . . . . . . 3

2.2 Speech production model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 The Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.2 The Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.5 Hidden Markov Models (HMM) . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.6 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.7 MLLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.8 Vocal Tract Length Normalization . . . . . . . . . . . . . . . . . . . . . . . 9

2.9 ASR with children’s speech . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Frequency scaling 11

3.1 HTK’s standard frequency warp function . . . . . . . . . . . . . . . . . . . 11

3.2 Linear frequency warp function . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Databases 13

4.1 The SpeeCon database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 The PF-Star database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Recognizer 17

6 Experiments 19

6.1 Baseline Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.2 MLLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.3 VTLN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6.4 VTLN + MLLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6.5 Estimation of warp factor based only on the speech part . . . . . . . . . . . 20

6.5.1 Direct method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.5.2 Subtraction method . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7 Results 27

7.1 VTLN and MLLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7.1.1 VTLN and MLLR, group 1632 . . . . . . . . . . . . . . . . . . . . . 32

7.1.2 VTLN and MLLR, group 1616 . . . . . . . . . . . . . . . . . . . . . 32

7.1.3 VTLN and MLLR, group 816 . . . . . . . . . . . . . . . . . . . . . . 32

7.2 Estimation of warp factor based on the speech part . . . . . . . . . . . . . . 33

7.2.1 Estimation of warp factor based on the speech part, group 1632 . . . 33

7.2.2 Estimation of warp factor based on the speech part, group 816 . . . 367.3 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8 Discussion 398.1 General results from the VTLN and MLLR experiments . . . . . . . . . . . 39

8.1.1 Comparing the results between the different bandwidths . . . . . . . 398.2 Estimation of warp factor based on the speech part . . . . . . . . . . . . . . 41

9 Conclusion 439.1 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Acknowledgments

I would like to thank my supervisors Mats Blomberg and Daniel Elenius and my examinerRolf Carlson for help and support during this work. I would also like to thank my opponentAsa Wallers for critics on my report and seminar.

List of Abbreviations

ASR Automatic Speech RecognitionHMM Hidden Markov modelHTK HMM ToolKitGMM Gaussian Mixture ModelVTLN Vocal Tract Length NormalizationFFT Fast Fourier TransformWER Word Error RateMLLR Maximum Likelihood Linear RegressionMFCC Mel Frequency Cepstral Coefficients

1 Introduction

1.1 Background

This master thesis contains experiments on Automatic Speech Recognition (ASR) for chil-dren’s speech. ASR is a technology that transforms human speech to a symbolic represen-tation. The transformation is done by a recognizer and the future goal is that a recognizerwould be able to handle spontaneous speech from any speaker in any environment.

ASR performs well with adults’ speech, but when children’s speech is recognized, theperformance decreases considerably. Children differ in both physical conditions and pro-nunciation. Children have shorter vocal tract and shorter vocal folds. This makes theirfundamental and formant frequencies higher. Children also loose their milk teeth whenthey are about 6 years old and their pronunciation deteriorates for some time. Most of thedifferences can be compensated for, but the fact that children talk more spontaneously,have smaller vocabulary and talk less grammatically than adults (Blomberg M., EleniusD., 2003) is not theoretically impossible to compensate for, but it is much harder andmore complicated and is not discussed in this thesis.

The purpose of this project was to see if there could be an improvement if two methodswere combined, both examined in two separate Master theses earlier, compared to usingjust one of them. Both methods separately improve ASR of children’s speech. One Masterthesis concerned adaptation methods (Mahl, 2003) and examined how children’s speechrecognition results improved by using different adaptation techniques. Adaptation meansthat the existing models of adult’s speech in the recognizer are slightly modified to betterresemble children’s speech. The adaptation method used in her thesis was MaximumLikelihood Linear Regression (MLLR). The other thesis concerned Vocal Tract LengthNormalization (VTLN) for child speech (Schiold, 2004) and examined how VTLN mayimprove ASR of children’s speech. Automatic speech recognition, adaptation and vocaltract length normalization will be explained later in the theory section (Section 2).

The second part of this master thesis was about finding a way to possibly improve theperformance of VTLN. This will be explained more in detail in Section 1.2.

1.2 Problem formulation

Both VTLN and MLLR separately increase the performance of ASR of children’s speech(Schiold, 2004), (Mahl, 2003), but still the recognition performance is not as high as foradult’s speech. Maybe a combination of the two methods could improve the performancefurther.

The second part of this thesis studies ways to possibly improve the performance ofVTLN. VTLN introduces spectral adjustment to the speech, but unfortunately also to thenon-speech segments, the background noise. This property reduces the power of VTLNsince in general, non speech segments should not be exposed to VTLN.

1.3 Method

To examine the effects of a combination of the two methods, MLLR and VTLN, for improv-ing the ASR on children’s speech (Sec 1.2), vocal tract length normalization (VTLN) wasused on the children’s speech, i.e. the frequency axes of the spectrum was compressed tomake it resemble better with the adult’s speech. Then the speech recognizer was adapted

1

to the compressed children’s speech. And finally the new adapted recognizer was evaluatedwith the compressed children’s speech, to see how the performance had developed. Theadaptation was carried out with Maximum Likelihood Linear Regression (MLLR).

To minimize the effect of the differences in the background noise, the non-speechsegments, when using VTLN, it is possible to estimate the frequency scaling factor (Sec 2.8)for the utterance of a child only upon the speech part of the utterance and hence ignoringthe non-speech part. After estimation of the warp factor, the entire utterance, includingthe non-speech parts is warped and at last the warped utterance is recognized (Sec 2.1)for evaluation. Two methods for estimating the warp factor only upon the speech part ofthe utterance were evaluated.

1.4 Tools

The recognizer used in the experiments in this thesis was based on RefRec (Giampero,2002) and an adaptation of it to word models, Lena Mahls Digit-recognizer (Ohlin, 2004).The script language Perl is used to call functions from HTK, the Hidden Markov modelTool Kit (Young, Evermann, Kershaw, Moore, Odell, Ollason, Povey, Valtchev, Woodland,2002). HTK was first developed at Cambridge University, but is now owned by Microsoftand consists of free C source code. It contains tools for building, training, adapting andevaluating Hidden Markov Models, HMMs, used for automatic speech recognition, ASR.

1.5 Outline of the report

The rest of this report is outlined as follows. Chapter 2 contains an overview of the theoryof automatic speech recognition (ASR); signal processing, feature extraction and HMM. Italso contains a section about adaptation and one about MLLR. This chapter also includesthe speech production model and VTLN is explained. Chapter 3 describes the two warpfunctions used in this thesis. Chapter 4 accounts for the databases used. The recognizerused for the experiments are described in Chapter 5. Chapter 6 gives a presentation of theexperiments conducted. In Chapter 7 the results can be found and the two last chaptersinclude a discussion of the results and conclusions drawn by them.

2

2 Theory

2.1 Automatic Speech Recognition (ASR)

Automatic Speech Recognition (ASR) is a technology that transforms human speech to asymbolic representation.

The first research on ASR was conducted in the late 1940s and the development in thefield has closely followed the increasing capacity and speed in computer processing. Theresearch has come a long way and today voice control is found in for instance telephoneoperated booking systems, aids for disabled and dictation systems (Blomberg M., EleniusK., 2003).

However there are still restraints that prevent ASR to function perfectly in all applica-tion areas. Factors limiting recognition performance are for example; large vocabularies,continuous spontaneous speech instead of isolated words and the fact that recognizersmust be able to interpret speech from multiple users. Speakers differs in age, gender, ge-ographical and social dialects. Also the environment and the transmission channels vary(Holmes, Holmes, 2001). Automatic Speech Recognition will be described more in detailin the following sections. ASR can be divided into three main parts; signal processing,feature extraction and classification.

2.2 Speech production model

The speech production process can be seen as composed by a source, the lungs forcing airthrough the glottis and a filter, the vocal tract (Sundberg, 1986).

Figure 1: The speech production organs (picture from (Sixtus, Molau, Kanthak, Schluter,Ney, 2000).)

3

2.2.1 The Source

The vocal folds, or the vocal cords, are two folds of tissue stretched across the opening inthe larynx (Fig 1). The distance between the folds can varied by muscular control. Whenbrought closely together but slightly apart, the slit-like opening between them is calledglottis.

Three different types of sound sources can be produced by the air stream from the lungs.One of them is the voiced sound that produces the vowels and some of the consonants suchas [m, n, l, w ]. The voiced sound is produced when the vocal cords are brought tightlytogether, but still not closed, and an airflow from the lungs is forcing them to vibrate, i.e.alternating between the open and the closed state. The variation of airflow is illustratedin (Fig 2).

Figure 2: Glottal cycle: Air flow through glottis during voiced sounds (picture from(Holmes, Holmes, 2001)).

The inverse of the time for one glottal cycle is corresponding to the fundamentalfrequency, which is determined by the mass and tension of the cords, but also affectedby the air pressure in the lungs (Holmes, Holmes, 2001). The fundamental frequency is80-160 Hz for men and 150-400 Hz for women and children.

The second type of sound source is the air turbulence that is caused when air from thelungs is forced through a constriction in the vocal tract and the cords are open. Then anunvoiced fricative is produced and its identity is defined by the position of the constriction.

The third and last type of sound source is the release phase of stop consonants. Duringan initial closed vocal tract interval, a pressure is built up, which is later released at thesame time as the consonant is spoken. Examples are [p, k, t]. These consonants are calledplosives.

The segmentation of sound sources into three different types can be referred to (Holmes,Holmes, 2001).

2.2.2 The Filter

Some of the acoustic differences between speech sounds stem from the vocal cords andhow we use them. The other characteristics stem from the vocal tract where the glottalpulses are shaped into speech. The vocal tract includes everything from the vocal cords

4

up to the lips. The main parts are the pharynx, the nasal cavity and the different parts ofthe mouth. When we change the shape of the vocal tract different sounds are obtained,since different shapes of the vocal tract results in different resonant frequencies. In voicedspeech, such as vowels and voiced consonants, the resonant frequencies are called formants(Fig 3).

Figure 3: Typical response of a 10-section acoustic tube, such as is illustrated in Fig 4.The peaks correspond to resonant frequencies, formants. (Picture from (Holmes, Holmes,2001.)

To calculate the resonant frequencies we need a simplified model of the vocal tract.One simple model is the lossless tube model. To achieve it we straighten the curved vocaltract and make a cylinder of it. Since one cylinder is a too rough model for the purpose,several cylinders with different radii are concatenated (Fig 4). If we assume that there areno losses from friction between the air and the tube and that the tube doesn’t vibrate,then this is the lossless tube model or the lossless tube concatenation (Huang, Acero,Hon, 2001). According to (Holmes, Holmes, 2001), the formants (resonant frequencies)are equally spaced at c/4L, 3c/4L, 5c/4L, etc. where c is the speed of sound and L is thetube length (simulating the vocal tract length). This explains why children’s formants areof higher frequency. A further complexity is that the length proportions of the differentparts of the vocal tract vary with the age of the child. Usually, the throat grows morethan the mouth (Schiold, 2004), which causes a formant frequency shift to be dependenton its index and the phonetic identity. To fully compensate for this effect involves a verycomplex acoustic transformation and only linear scaling has been modeled in practicalspeech recognition systems. C.f. Sec. 2.8.

2.3 Signal Processing

First the sound is captured via a microphone. This is where the signal processing starts.Since all processing is done in the digital domain, the signal is sampled. The sample ratemust exceed twice the highest frequency component, otherwise distortion, the alias effectoccurs, i.e. the signal appears to be of lower frequency (Oppenheim, Willsky, Nawab,1997). Human speech is wide-band, but enough information is captured even if the signalis low-pass filtered at a cut off frequency of about 4 kHz, as in telephony. A bandwidth of4 kHz would thus correspond to a sampling frequency of at least 8 kHz. The most common

5

Figure 4: Graph of cross-section of a 10-section acoustic tube modeling a typical vowel.The mouth termination is shown coupling into an infinite tube of cross-section 40 cm2.(Picture from (Holmes, Holmes, 2001).)

sampling frequencies used, stretches from 8 kHz (as in telephony) to 32 kHz (BlombergM., Elenius K., 2003).

The amplitude of the sampled signal is digitally represented with a number of bits persample. The more bits per sample, the better the resolution, and the better the resolution,the smaller the distortion. Most common are 16 bits and above (Holmes, Holmes, 2001).

2.4 Feature Extraction

After sampling, features of the signal needs to be derived. These are normally extractedfrom a spectral representation of the signal.

The transformation from the time-domain to the frequency-domain is done with aFourier transform, FFT. The problem is that Fourier transforms are for time-invariantsignals (Oppenheim, Willsky, Nawab, 1997) and speech is time-variant, but this can besolved if the signal is cut into small time-slots, frames, since speech is time-invariant overshort periods of time, approximately 10-20 ms.

To avoid discontinuities at the edges of the frames that could lead to high-frequencycomponents in the spectrum, each frame is multiplied with a tapered window, typically aHanning or a Hamming window (Webster, 1978) (Huang, et al., 2001) before the Fouriertransform is applied.

The frames should be short enough for the desired time resolution and long enoughfor the desired frequency resolution. They should also be long enough to be insensitiveto the position relative to the glottal cycle (Fig 2), i.e. there must be at least one fullglottal cycle in each frame. These requirements lead to a common compromise of 20-25ms long frames every 10:th ms. This gives a 50-60 % overlap and a frame rate of 100frames/second.

The output of the Fourier analysis will usually be at a finer frequency resolution thanis needed, especially at high frequencies. That is why the Fourier magnitudes are summedup into a number of channels, whose bandwidth are smaller at lower frequencies and largerat higher frequencies. That is how the human ear percepts the sound. This scale is called amel scale (fig 5). Typically about 20 channels are used for speech with a 4 kHz bandwidth

6

and when larger bandwidth is used, some extra channels are added (Holmes, Holmes,2001).

Figure 5: Triangular filters for transforming the output of a Fourier transform onto a melscale in both bandwidth and spacing (picture from (Holmes, Holmes, 2001)).

MFCC is cepstral analysis performed on a melscale transformed spectrum. Cepstralanalysis is an inverse Fourier transform performed on a logarithmic spectrum of the FourierTransformed speech signal just described. The advantages of MFCC are for instance thatthe coarse structure of the spectrum is described by small number of parameters and thatthe statistical dependence between the coefficients is reduced. The number of coefficientsis normally 12 and it is also common to use their first and second order time derivate aswell as the energy and its first and second order time derivate. All together that makes39 coefficients, features, for each frame.

2.5 Hidden Markov Models (HMM)

After the speech signal is transformed into a sequence of feature vectors follows a matchingprocedure to make a decision on what was spoken. There are different matching methods,but one method is based on Hidden Markov modeling and this method is used in thisthesis project. The basic theory of Hidden Markov modeling will be described below.

The advantages of HMM are several. First it has a simple mathematical structure,second the structure is rich enough to describe also very complicated pattern sequencesand third the models can be automatically adapted to conform with given training data.Hidden Markov models are used to characterize signals in widely different areas: Bachmusic, chaotic sea-water waves, bird song-style, speech, handwriting and gestures (Leijon,2003).

In speech recognition, a hidden Markov model normally represents a phoneme or awhole word. The model consists of a finite number of states, where each state representsa part of the phoneme or word. At every frame the model is able to change state, and willdo so randomly in a way determined by a set of transition probabilities associated withthe current state. The probabilities of all transitions from a state at any frame must sumto 1, including the probability of remaining in the current state. There is an initial- and afinal state as well, they are just for practical purpose and don’t have any output function.See Fig 6.

7

Figure 6: A simple three state HMM (picture from (Ohlin.2004)).

For a first order Markov model the transition probabilities only depend on its presentstate and not on its state during previous frames. This leads to an exponential distributionof the time the model stays in each state. Markov models of higher orders also take theprevious state history into account.

When the state is active, it emits a sequence of feature vectors, one feature vector foreach time frame, i.e. one feature vector for each transition. These feature vectors areof the same form as those that can be observed when a spoken word is recognized. Butit is not possible to know exactly which states have been passed from the observations,since each state, apart from transition probabilities, also is associated with a probabilitydensity function (p.d.f.) for the feature vectors. The p.d.f. can be used to calculate theprobability that a feature vector was emitted by one certain state. An observed featurevector is therefore output of the probability density function of the associated state and thestates themselves are hidden from the observer. For this reason this type of model is calleda Hidden Markov model (HMM). The p.d.f.s are statistical models of the acoustics of thespeech units. P.d.f.s can be either continuous or discrete; in this report they are continuous.Continuous p.d.f.s need more speech material for training the models than discrete p.d.f.s,but if the speech material is extensive enough, the result is better (Blomberg M., EleniusK., 2003). The continuous p.d.f. is normally a mixture of Gaussian distributions, eachwith individual means, variances and mixture weights (GMM)(Fig 7).

Figure 7: Gaussian distributions and their weights, describing the statistics of a state(picture after (Blomberg M., Elenius K., 2003)).

The parameters of Hidden Markov models are estimated in a training session. Theparameters are the transition probabilities and the probability density function parame-ters; their individual means, variances and their mixture weights. These are optimized inaccordance with the training data by the Baum-Welch algorithm (Kleijn, 2000). For the

8

training, a sufficiently large speech corpus is needed. The size needed depends on the task,the larger the dictionary or the speaker variability, the more training material is needed.

Recognition is done by first finding the sequence of states that most likely have gen-erated the sequence of acoustic feature vectors. This is done by the Viterbi algorithm(Kleijn, 2000). From the state sequence we can simply determine the corresponding modelsequence. If word models are used, this represents the recognized word sequence. In thisthesis, the only words the recognizer have to cope with are the digits 0-9. Each digit isdescribed by an HMM. Also the silence is described by an HMM.

2.6 Adaptation

If there are differences in the attributes of the training and the test data, the recognitionperformance will be considerable reduced. The mismatch can be due to differences betweendifferent environments, different transmission channels, different speakers and even thesame speaker on different occasions. Given some speech data that are representativefor the new conditions, it is not necessary to retrain the models, which would be timeconsuming. One can instead adapt the models to this new speech material.

There are several adaptation techniques, but the purpose of them all is to adapt theparameters of the models with a small amount of speech data that are representative forthe new conditions, to better correspond to the new speech data (Holmes, Holmes, 2001).If the transcription of the adaptation data is ’known’, it is called supervised adaptation.This is possible for example when a user of an ASR system confirms the recognition results.Otherwise, when the text is not known, unsupervised adaptation is used. Unsupervisedadaptation is more difficult because recognition and adaption must proceed simultaneously.

2.7 MLLR

One adaptation technique is called Maximum Likelihood Linear Regression (MLLR) andis the one used in this thesis. Since a large amount of adaptation data is required to adapteach HMM uniquely, MLLR clusters acoustically similar models into groups, regressionclasses. Each group of HMM:s are treated similarly. The less training material available,the fewer the number of groups and the coarser the clustering. As a result of the clustering,not only those models that are represented in the adaptation data can be adapted, butall models belonging to the same cluster, presupposed one of the models from that clusteris represented in the adaptation data. This is why MLLR is a good adaptation techniqueeven for a small amount of adaptation data (Mahl, 2003).

The operation performed in MLLR is a linear transformation of the model parameters.The transformation function is estimated in a way to maximize the likelihood of theadaptation data. For example, given a model mean vector µ , a new mean µ can bederived as follows:

µ = Aµ + b (1)

where A is a transformation matrix and b is a bias vector. Both are estimated fromthe adaptation data.

2.8 Vocal Tract Length Normalization

Vocal tract length normalization (VTLN) (Holmes, Holmes, 2001) is a method to reducethe effects of length differences between different speaker’s vocal tracts. If identical vow-

9

els produced by two vocal tracts of different length are compared, one will see that theformants of the shorter vocal tract are shifted to higher frequencies (Fig. 3) comparedto the formants of the longer vocal tract. This can be compensated for by stretchingor compressing the frequency axis of the spectrum prior to recognition. If VTLN is ap-plied, the stretching/compressing is applied during the feature extraction process, afterthe windowing and the Fourier transform.

2.9 ASR with children’s speech

Current ASR techniques function sufficiently well with adult’s speech to allow practicalapplications, for example in booking systems. But when children’s speech is recognized inthese applications, the performance decrease considerably. Why is that so?

Children differ in both physical conditions and pronunciation. The differences in phys-ical conditions stem from the fact that they have shorter vocal tract and shorter vocalfolds. This makes their fundamental and formant frequencies higher. For 5-7 year oldchildren the first three formants are shifted upwards about 65% compared to adults (Li,Russel, 2002).

When children loose their milk teeth, which happens when they are about 6 years old,their pronunciation deteriorates for some time till they get new permanent teeth. Smallchildren may also not have learned the proper pronunciation yet.

Some of these differences can be compensated for, but still the performance is reduced.This is because children have smaller vocabulary and talk more spontaneously and lessgrammatically than adults (Blomberg M., Elenius D., 2003).

Also it is found that the recognition performance of children’s speech vary from verypoor to almost as good as for adults (Li, Russel, 2002). This make one believe that thedifferences cannot be compensated for by just adapting the models to children’s speechor scaling the children’s speech (VTLN) to make it resemble better to adult’s speech.Performance difference will still remain.

10

3 Frequency scaling

In the VTLN algorithm, the frequency axis is stretched or compressed, this is also calledscaling or warping. In this thesis, two different warp functions were used. They aredescribed in the following section.

3.1 HTK’s standard frequency warp function

In the software package HTK (Hidden Markov Toolkit, see Sec 5) a simple piecewiselinear frequency warping is supported (Young, et al., 2002). The warp factor is calledα. Here values of α < 1.0 means a compression of the frequency axis. As the warpingmight lead to the upper filters containing little or no energy, the simple linear warpingfunction is modified at the upper and lower boundaries. The result is that the lowerboundary frequency of the analysis and the upper boundary frequency are always mappedto themselves. The regions where the warp function deviates from the linear warping withfactor α are controlled with the two configuration variables fL and fU . The resultingpiecewise linear warp function is showed in Figure 8.

Figure 8: HTK’s frequency warp function (picture from (Young, et al., 2002)).

This warp function can make the children’s speech spectrum resemble better to theadult’s spectrum.

3.2 Linear frequency warp function

The second frequency warp function used in this thesis is linear throughout the spectrum,which reduces the effects of the end point areas of the standard HTK warping function.The function is controlled by a single variable, the frequency scaling factor α, (Fig 9).During recognition, all frequencies above the analysis bandwidth is cut off. This is becausethe recognizer only wants data below the Nyquist frequency of the training data. Twodifferent databases were used, one for the children’s speech and one for the adult’s speech(c.f. Sec. 4). The two databases were recorded with different bandwidths, the children’sspeech were recorded with the double bandwidth in comparison. So if children’s speechis recognized with a model trained with adult’s speech, the upper half of the bandwidth

11

will be available for linear compression into the bandwidth of the adult models withthis warping technique, high frequency components which are lost without warping andwith piece-wise linear warping is shifted into the adult bandwidth and can be used forrecognition.

Figure 9: The linear frequency warp function (picture from [2]).

12

4 Databases

Two databases were used; the SpeeCon database and the Pf-Star database. All adultspeech data is extracted from the SpeeCon database and all the children speech datafrom the Pf-Star database. The two databases are similar when comparing their recordingsettings, so when the combined VTLN and MLLR experiments were performed, the resultswere comparable. For the second part of this thesis project, it was however suspected thata small difference between the two databases might have negative impact on the recognitionperformance. It had earlier been observed that the non-speech segments differed spectrally(Schiold, 2004).

4.1 The SpeeCon database

The SpeeCon database was recorded within the SpeeCon project (Speech-Driven Interfacesfor Consumer Devices) (Iskra, Großkopf, Marasek, van der Heuvel, Diehl, Kissling, 2002).The SpeeCon database was created for the development of voice-driven interfaces forconsumer applications. The database includes 20 languages with 600 speakers of eachlanguage. The speech was recorded in different environments where voice-driven interfacescould be expected to be used. Both spontaneous speech, read speech, digits and isolatedwords were recorded. In this thesis the Swedish digit strings recorded in office environmentwere used. The office environment has little noise and when noise is present it is mostlystationary. This environment is also very similar to the environment in which the childrenspeech was recorded. The SpeeCon database consists of 50% men and 50% women, with adeviation of maximum 5%, but since the speech data for this thesis was choosen randomly,this gender distribution cannot be guaranteed.

The speech was recorded in four channels through four microphones simultaneously,but the only recording used here is the one from the headset microphone, a SennheiserME 104. The input speech signal was high pass filtered at 80 Hz and sampled at 16 kHz.It was then quantized using 16 bit linear coding (Iskra, et al., 2002).

The selection from the SpeeCon database is partitioned into different sets, one fortraining, one for development and one for test, see Table 1. The training set is used fortraining the HMM:s, the test set is used for evaluation and the development set is usedfor adjusting the recognizer.

Set N:o speakers N:o utterances

Training (A60) 60 300Test (A) 60 300Development (A) 81 405

Table 1: SpeeCon partitioning.

There are more speakers available in the SpeeCon database than in the Pf-Star database,but only 60 speakers, the same as in Pf-Star, were used in each group. In order to have afair comparison between the adult and the child models, the possibility of increasing theadult training size was not utilized.

13

4.2 The PF-Star database

A Swedish children speech database was recorded at KTH for the PF-Star project (BlombergM., Elenius D., 2003). It was designed to be a generic corpus, i.e. it wasn’t created for anyspecific purpose. One of the requirements on a Swedish generic corpus is that it shouldcover all the Swedish phonemes in a large enough number so that speaker independentphoneme models can be trained from it. Similar recordings were also made in the othercountries that participated in the PF-Star project, all in their native language, but forthis thesis, only the Swedish digit strings were used. Another important feature for thiscorpus is that it was designed to be compatible with the SpeeCon database. To obtainthis, the children’s database were as much as possible recorded with the same equipmentand with a similar setup.

The corpus was recorded with two channels simultaneously, one at a close distance donewith the same headset microphone (Sennheiser ME 104) as used in the SpeeCon databaseand one at the table, an omnidirectional microphone, also the same as the adult speechdatabase. The recording done with the headset microphone is used here. Instead of highpass filtering the input signal as in SpeeCon a higher A/D resolution were used (24 bitsinstead of 16 bits). This allows the low frequency noise to be up to 48 dB stronger than thespeech signal and still remaining a theoretical 16 bits resolution of the signal. So instead ofusing a hardware filter, the low frequency part of the signal could just be discarded duringanalysis with the same result. The sound level varied between the children. This led todifferent amplifier settings which made a calibration procedure necessary. The calibrationfor the table microphone was performed at the beginning of every session of 5-10 childrenand the headset microphone was calibrated for each speaker.

The recordings were held in day-care and after-school centers and the children weresitting one at a time in a quiet room. The children repeated utterances that an adultrecorder leader read. The utterances were a mix of answers to questions, texts fromchildren’s literature, girls’ and boys’ names and sequences of digits. The recorded childrenwere 4 to 8 years old and are divided in partitions according to Table 2.

Set N:o speakers N:o utterances

Training 8 year 12 120Training 7 year 12 120Training 6 year 12 120Training 5 year 12 120Training 4 year 12 120

Training total 60 600

Test 8 year 12 120Test 7 year 12 120Test 6 year 12 120Test 5 year 12 120Test 4 year 12 120

Test total 60 600

Development 4-8 years 48 480

Table 2: PF-Star partitioning.

14

There are approximately an equal number of children speakers of each gender but theywere divided so that the training sets are 50% of each gender and the test sets have 60%girls. That will probably not make any difference since gender never has shown to makea difference when it comes to children.

15

5 Recognizer

This thesis is based on a recognizer developed by Lena Mahl for her Master’s thesis (Mahl,2003). Her work is in turn based on the Refrec (Giampero, 2002) recognizer. The rec-ognizer was developed in the HTK (Hidden Markov Model Toolkit) environment (Young,et al., 2002). HTK is a portable toolkit primarily used for speech recognition research,but also for other related research topics. It includes tools for training, testing and resultanalysis for HMM:s. These tools are distributed under a free license and they are built inopen source C code, which means that it is possible to change and recompile the tools afteryour own needs. The HTK toolkit was originally developed at Cambridge University, butis now owned by Microsoft. The recognizer was built in the script language Perl by callingdifferent modules from HTK. All the digital signal processing is done by the modules inHTK and the Perl-scripts tie the modules together.

The recognizer is a digit recognizer with one HMM per digit and one HMM for thesilence model, sil. The number of states per digit was set to twice the number of phonemesin each word to account for differences between the beginning and the ending of thephonemes, see Table 3

HMM N:o phonemes N:o states

Ett 2 4Tva 3 6Tre 3 6Fyra 4 8Fem 3 6Sex 4 8Sju 2 4Atta 3 6Nio 3 6Noll 3 6Sil 3

Table 3: Number of states in the word HMM’s.

HTK needs a grammar which defines in what sequence the words in the dictionarycan be spoken. For this recognizer the grammar allows an unspecified number of digits insuccession with optional silence between the digits. In the grammar there is also a silenceexpected in the beginning and the end of each utterance.

One essential parameter for the recognizer is the word insertion penalty. It balancesthe insertions and deletions during the recognition. It was set to -94.0. This value waschosen because it optimized the recognition based on the adults development speech set.The recognizer was optimized in accordance to the adult’s speech, so that the experimentswith children speech were implemented on a recognizer developed strictly for adult speech.In this way, the results of the experiments would only be due to the effect of the differentmethods and not on the fact that the recognizer had been adjusted to better suit childrenspeech.

The experiments were all performed with 16 Gaussian mixtures (Sec 2.5) for each state.This was the number of mixtures Mahl used in her work (Mahl, 2003). In this thesis it

17

was important to use the same number in all experiments, but the particular number itselfwas less important. Then the relative changes of performance could be observed.

The recognizer was configured according to the tutorial configuration in The htk book

(Young, et al., 2002), except for some additional configuration parameters necessary tocontrol the VTLN and parameters to control the filter-bank boundaries. C.f. AppendixA.

18

6 Experiments

Since the task was to measure whether there was an enhancement when VTLN and MLLRwas combined, the two methods were first separately evaluated to have something to com-pare to and then afterward the combination of them were evaluated. Also, the children’sspeech with no method at all applied to it were evaluated, this is the baseline experiment.Besides, methods for possibly improving VTLN were evaluated, these experiments arecompletely separated from the first part. It had earlier been noticed that the non-speechparts of the two databases differed. And it was also believed that this difference mightreduce the recognition performance. VTLN can be divided into two parts; one wherethe warp factor for the utterance of the child is estimated and one where the utteranceis warped with the recently estimated warp factor. One way to reduce the effect of thedifferences of the non-speech parts, is to estimate the warp factor for the utterance basedonly on the speech part of the utterances instead of the entire utterance. This idea wastested in one experiment. The recognizer was designed as a digit recognition system forall experiments.

6.1 Baseline Experiments

The recognizer was trained with the adult speech from the Speecon database with abandwidth of 8 kHz and a sampling frequency of 16 kHz. Recognition was performed withthe test part of the children speech in the PF-Star database, first with twice the bandwidthand sampling frequency of the adult speech, then with the same bandwidth and samplingfrequency as in the adult speech. The analysis bandwidth was 8 kHz. After that therecognizer was retrained with the same adult speech but now with a bandwidth of 4 kHzand a sampling frequency of 8 kHz. Recognition was now performed with the test partof the children speech in the PF-Star database with twice the bandwidth and samplingfrequency of the adult speech. The analysis bandwidth was 4 kHz. The children’s speechwas a mixture of the ages from 4 to 8 years, c.f. Sec. 4.2. The results from this evaluationis a base for comparison with the other results.

6.2 MLLR

The recognizer was trained with the adult speech from the SpeeCon database. Thenthe recognizer was adapted with an adaptation set consisting of children’s speech fromthe PF-Star database. The adaptation was performed with MLLR. Then the adaptedrecognizer was evaluated with children’s speech. This experiment was performed withthe same three combinations of bandwidth and sampling frequency as in the baselineexperiments. The number of nodes, groups, in the cluster was empirically chosen tomaximize the recognition performance of the development set of the SpeeCon database.Five nodes were used when the recognizer was trained with the adult speech from theSpeecon database with a bandwidth of 8 kHz and a sampling frequency of 16 kHz andthe recognition was performed with the test part of the children speech in the PF-Stardatabase, first with twice the bandwidth and sampling frequency of the adult speech, thenwith the same bandwidth and sampling frequency as in the adult speech. Seven nodeswere used when the recognizer was trained with the same adult speech but now with abandwidth of 4 kHz and a sampling frequency of 8 kHz and the recognition was performed

19

with the test part of the children speech in the PF-Star database with twice the bandwidthand sampling frequency of the adult speech.

6.3 VTLN

The recognizer was trained with the adult’s speech from the SpeeCon database. Then thefrequency axes in the spectrum of the children’s speech from the test set was warped, scaledwith a warp factor, α, i.e. it was compressed or stretched. All the children speech waswarped with the same α. The warping is performed during the feature extraction process.Then the recognizer was evaluated with this warped children’s speech. This was repeatedfor a wide range of warp factors; from 0.5 to 1.22 with a step of 0.02. 0.5 is the lowestpossible α in HTK and 1.22 is commonly used as an upper limit when performing VTLN(Blomberg M., Elenius D., 2003), (Lutz Welling, Herman Ney and Stephan Kanthak,2002) and (Lutz Welling, Stephan Kanthak and Herman Ney, 1999). This experimentwas performed with the same three combinations of bandwidths and sampling frequenciesas in the baseline experiments. The warping was performed with both HTK’s standardfrequency warp function and the linear frequency warp function.

6.4 VTLN + MLLR

The recognizer was trained with the adult’s speech from the SpeeCon database. Thenthe frequency axes in the spectrum of the children’s speech in the adaptation set waswarped with α, all children speech was warped with the same warp factor. After thatthe recognizer was adapted to the scaled children speech and at last the recognizer wasevaluated with scaled children’s speech from the test set, the same scale factor was used inboth adaptation and evaluation. This was repeated for several warp factors; from 0.5 to1.22 with a step of 0.2. This experiment was performed with the same three combinationsof bandwidths and sampling frequencies as in the baseline experiments. The warping wasperformed with both HTK’s standard frequency warp function and the linear frequencywarp function.

6.5 Estimation of warp factor based only on the speech part

It had been noticed that the frequency spectra of the non-speech parts of the two databases;PF-Star and SpeeCon, differed (Schiold, 2004). This might influence the recognitionperformance in a negative way, because it sometimes leads to incorrectly estimated warpfactors. A reasonable warp factor is usually about 0.8 for children’s speech and whenthe estimated warp factors diverge a lot from that, they might not be appropriate. Forexample, a warp factor of 0.5 is difficult to motivate, because 5-7 year old children havetheir first three formants shifted upwards about 65% compared to adults (ref. 2.9). Toavoid that incorrect warp factors are estimated, the estimation of warp factor for theparticular utterance could be based only on the speech part of the utterance and not on thenon-speech part as well. Two methods for solving this were evaluated. The two methodsare structurally similar, but since they are not identical, both were evaluated so that theirresults could be compared. In both methods, one utterance at a time was handled and aseparate warp factor was chosen for each utterance. Both the recognition performance andthe distribution of warp factors for the children’s utterances were evaluated. Also ordinary

20

VTLN with estimation of the warp factors based on the entire utterance, including thenon-speech parts, were evaluated as an initial position to start from.

All utterances consist of three parts; silence, speech and silence again. One way toavoid the influence of the spectral differences of the non-speech parts, is to remove thepart in the beginning and in the end of the utterance where silence is and then to choosethe warp factor that makes the speech part resemble the adult’s speech the most. Sincethe utterances no longer consist of these non-speech parts, a change in the grammar needsto be done. The silence HMM is removed from the grammar and left are the digit HMM:s.An unspecified number of digits is expected, but no silences in the beginning or the in theend. After the estimation, the entire utterance, including the non-speech parts, is warpedwith the estimated warp factor and the utterance is recognized. During the recognition,the grammar is again including the silence HMM.

First the speech segment boundaries, the points of time of the beginning and the endof the speech-part, were determined. This was done with a recognition of the utterance,without any frequency warping. No scaling is the best way, since we won’t know whatthe non-speech spectra will be like in future applications. And also the speech segmentboundaries are different depending on what warp factor the utterance have been warpedwith. If for example the utterances are warped with 0.8 when trying to find the boundariesand it happens to be an adult speaking, the boundaries will be misplaced since the correctwarp factor for an adult mostly is 1.0, i.e. no scaling. To assume that it is an adultspeaking, may be reasonable, since it is a recognizer developed for adults’ speech theexperiments are performed on.

When the boundaries for the speech segment of the utterance are determined, thesecond step is to estimate the warp factor based on the speech segment. Here is wherethe two methods differ and both will be explained in the following subsections. The maindifference between the methods is that the non-speech intervals are either removed fromthe signal (Direct method) or compensated for in the score-domain (Subtraction method).

6.5.1 Direct method

There are tools in HTK for estimating the warp factor. One tool is called HCopy andconverts the utterance to a sequence of feature vectors. It is possible to tell HCopy whatpoint of time of the utterance to start converting from and at what point of time to stop.HCopy can also warp the utterance with a specified warp factor. This was done for awide range of warp factors, from 0.5 to 1.22. For each warping, the speech part of theutterance is scored. Since the speech part of the utterance consists of one or more digits,for each digit, the HMM that is allotted the highest score is chosen, c.f. Section 2.5. Thescores of the chosen HMM’s for the digits are summed up to a total score for the wholeutterance. These total scores are put in a vector, each element corresponds to a separatewarp factor. The warp factor that generated the highest total score for the speech part ofthe utterance is chosen.

After the estimation, the whole utterance was warped with the estimated warp fac-tor. And at last the recognition performance was evaluated to see if there had been anyimprovement compared to ordinary VTLN, VTLN with estimation of warp factor basedon the entire utterance. Also the distribution of warp factors for the children’s utteranceswere evaluated.

21

6.5.2 Subtraction method

This time the complete utterance, including the non-speech parts, was converted to featurevectors with HCopy and scored. This could be compared to the direct method, were onlythe speech part were converted to feature vectors. This was done for a wide range ofwarp factors, from 0.5 to 1.22. To receive the score for the speech part, the scores for thenon-speech parts in the beginning and in the end were subtracted from the total score ofthe utterance. The scores for the speech parts for the different warp factors cannot becompared directly. First a compensation must be done.

The length of the speech segments are different depending on what warp factors theutterances have been warped with and the length of the speech segment influences thescore. Therefore the scores for the speech segments warped with different warp factorsare only comparable if the lengths of the segments are equal. The lengths the speechsegments had when the utterances were warped with 1.0 are chosen as standard. When theutterances are scaled with other warp factors, the speech segments needs to be shortenedor lengthened to fit to the standard lengths. But since it isn’t possible to shorten orlengthen the speech segments, the scores that belongs to these segments are increased ordecreased.

Imagine three digits were spoken and two or more digits have been recognized, seeFig 10. Here for example, the score for the first recognized digit must be increased, tocompensate for the different start times of the speech segments and the score for thelast recognized digit must be decreased to compensate for the different end times. Toincrease/decrease the score a compensation factor is multiplied to the score that needs tobe modified. The new total score will be:

Total = score(digitnr1) ∗ comp1 + score(digitnr2) + ... + score(lastdigit) ∗ comp2 (2)

where,

comp1 =t2 − tstart

t2 − t1,

comp2 =tend − ti−1

ti − ti−1

(3)

In the special case where only one digit have been recognized, see Fig 11, the totalscore will be:

Total = score(digit) ∗ comp (4)

where,

comp =tend − tstart

t2 − t1(5)

22

Figure 10: An illustrative example where two or more digits have been recognized. Seetext for details.

23

Figure 11: An illustrative example whrere only one digit have been recognized.See textfor details.

24

The utterances were warped with different warp factors and different scores for thespeech segments were achieved. After compensation for different lengths, the scores werecompared and the warp factors that produced the highest scores for the respective speechsegments were chosen.

After the estimation, the whole utterances are warped with the estimated warp fac-tors. And at last the recognition performances were evaluated to see if there have beenany improvement compared to VTLN with estimation of warp factor based on the en-tire utterance. Also the distribution of warp factors for the children’s utterances wereevaluated.

25

7 Results

To present the results from the experiments, an evaluation value called word accuracy isused. Word accuracy is a representative figure of recognizer performance (Young, et al.,2002). Word accuracy is defined as:

Accuracy =N − D − S − I

N∗ 100 (6)

N = total number of utterancesD = number of deletionsS = number of substitutionsI = number of insertions

To present how the number of errors decrease or increase when comparing methods,a figure called relative error reduction, might be useful. It is defined as the difference oferrors divided by the total number of errors from one of the methods.

The results from the experiments will be shown in groups depending on what speechthe recognizer have been trained with and with what speech the recognizer is evaluatedwith. This gives three groups;

1. The recognizer was trained with adults’ speech with 16 kHz sampling frequency andwas evaluated with children’s speech with 32 kHz sampling frequency. The analysisbandwidth was 8 kHz. This group is called 1632.



7.1 VTLN and MLLR

The results can be viewed in Fig 12, Fig 13 and Fig 14. The graphs show the results fromthe baseline, the MLLR, the VTLN and the combined VTLN and MLLR experiments. Theuppermost graph shows the VTLN and the combined VTLN and MLLR experiments whenperformed with HTK’s standard frequency warp function and the lower graph shows thecorresponding experiments performed with the linear frequency warp function. In group1616 only HTK’s standard frequency warp function was used. The linear frequency warpfunction couldn’t be used, since it compresses the frequency axes of the spectrum linearlyand therefore the children’s speech need to be of larger bandwidth. Otherwise it wouldtry to compress a nonexistent bandwidth into the range of the recognizer.

The overall results from experiments measured in accuracy and the corresponding warpfactors are shown in Table 4, 5 and 6. The values correspond to the maximum accuracyin Figure 12-14.

27

Experiment (Max) Acc[%] W.f.

Baseline 75.0 -MLLR 86.0 -VTLN(htk) 81.3 0.88VTLN(linear) 82.1 0.86VTLN+MLLR(htk) 87.5 0.80VTLN+MLLR(linear) 88.9 0.82

Table 4: Results from the VTLN and MLLR experiments, group 1632.


Baseline 78.4 -MLLR 86.6 -VTLN(htk) 82.9 0.86VTLN+MLLR(htk) 88.2 0.84



Baseline 62.4 -MLLR 81.2 -VTLN(htk) 78.5 0.82VTLN(linear) 79.3 0.84VTLN+MLLR(htk) 85.2 0.76VTLN+MLLR(linear) 87.2 0.76


28

0.5 0.6 0.7 0.8 0.9 1 1.1 1.240

50

60

70

80

90

100

warp factor

accu

racy

[%]

Adult: fs=16kHz, Child: fs=32 kHz, htk/3.2

BaselineVTLNMLLRVTLN + MLLR

0.5 0.6 0.7 0.8 0.9 1 1.1 1.240

50

60

70

80

90

100

warp factor

accu

racy

[%]

Adult: fs=16kHz, Child: fs=32 kHz, linear warp fkn

Figure 12: Results from VTLN and MLLR, group 1632.

29

0.5 0.6 0.7 0.8 0.9 1 1.1 1.240

50

60

70

80

90

100

warp factor

accu

racy

[%]




30

0.5 0.6 0.7 0.8 0.9 1 1.1 1.240

50

60

70

80

90

100

warp factor

accu

racy

[%]



0.5 0.6 0.7 0.8 0.9 1 1.1 1.240

50

60

70

80

90

100

warp factor

accu

racy

[%]

Adult: fs=8kHz, Child: fs=16 kHz, linear warp function


31

7.1.1 VTLN and MLLR, group 1632

The results can be viewed in Fig 12. The uppermost graph shows the VTLN and thecombined VTLN and MLLR experiments when performed with HTK’s standard frequencywarp function and the lower graph shows the corresponding experiments performed withthe linear frequency warp function.

In this group, the region where the combination of VTLN and MLLR proved betterthan MLLR alone, is covering the warp factors between 0.72 to 1.00 with HTK’s standardfrequency warp function and 0.68 to 0.98 with the linear frequency warp function. Thesetwo intervals are fairly similar.

The difference between the maximum of the combination of VTLN and MLLR andMLLR alone was bigger when performed with the linear frequency warp function than withHTK’s standard frequency warp function. The difference was 2.9 percentage points for thelinear frequency warp function and 1.5 percentage points for HTK’s standard frequencywarp function, i.e. the difference with the linear frequency warp function was 93.3 %higher than the corresponding difference with HTK’s standard frequency warp function.

The difference between the maximum of VTLN and the baseline experiment was 7.1percentage points for the linear frequency warp function and 6.3 percentage points forHTK’s standard frequency warp function, i.e. the difference with the linear frequencywarp function was 12.7 % higher than the corresponding difference with HTK’s standardfrequency warp function. The difference between the two warp functions is smaller whencomparing the maximum of VTLN and the baseline experiment, than when comparingthe maximum of the combination of VTLN and MLLR and MLLR alone.


The results can be viewed in Fig 13. The VTLN and the combined VTLN and MLLRexperiments was performed with HTK’s standard frequency warp function.

In this group, the region where the combination of VTLN and MLLR proved betterthan MLLR alone, is covering the warp factors between 0.72 to 1.00 with HTK’s standardfrequency warp function.

The difference between the maximum of the combination of VTLN and MLLR andjust MLLR was 1.6 percentage points for HTK’s standard frequency warp function.

The difference between the maximum of VTLN and the baseline experiment was 4.5percentage points for HTK’s standard frequency warp function.


The results can be viewed in Fig 14. The uppermost graph shows the VTLN and thecombined VTLN and MLLR experiments when performed with HTK’s standard frequencywarp function and the lower graph shows the corresponding experiments performed withthe linear frequency warp function.

In this group, the region where the combination of VTLN and MLLR proved betterthan MLLR alone, is covering the warp factors between 0.62 to 1.00 with HTK’s standardfrequency warp function and 0.56 to 1.02 with the linear frequency warp function. Thelater interval is slightly larger.

The difference between the maximum of the combination of VTLN and MLLR andMLLR alone was bigger when performed with the linear frequency warp function than with

32

HTK’s standard frequency warp function. The difference was 6.0 percentage points for thelinear frequency warp function and 4.0 percentage points for HTK’s standard frequencywarp function, i.e. the difference with the linear frequency warp function was 50 % higherthan the corresponding difference with HTK’s standard frequency warp function.

The difference between the maximum of VTLN and the baseline experiment was 16.9percentage points for the linear frequency warp function and 16.1 percentage points forHTK’s standard frequency warp function, i.e. the difference with the linear frequencywarp function was 5.0 % higher than the corresponding difference with HTK’s standardfrequency warp function. The difference between the two warp functions is smaller whencomparing the maximum of VTLN and the baseline experiment, than when comparingthe maximum of the combination of VTLN and MLLR and MLLR alone.

7.2 Estimation of warp factor based on the speech part

The experiments were conducted with the linear frequency warp function, since this wasexpected to give larger improvements in the recognition performance and in the distribu-tion of warp factors for the children’s utterances.

The results can be viewed in Fig 15 and Fig 16. The uppermost graph shows theresult for ordinary VTLN, where the estimation of warp factor was based on the entireutterance. The graphs in the middle show the result for the direct method, where theestimation of warp factor was based only on the speech part of the utterance. And thegraphs at the bottom show the result for the subtraction method, where also the estimationof warp factor was based only on the speech part of the utterance. The overall resultsfrom experiments; measured in accuracy and the partition of the utterances allotted awarp factor less than 0.6, is shown in Table 7 and Table 8.

Experiment Acc [%] Utt. with α < 0.6 [%]

VTLN 73.9 13.8Direct method 80.0 4.2Subtraction method 79.7 3.6

Table 7: Results from the estimation based on the speech part, group 1632.

Experiment Acc [%] Utt. with α < 0.6 [%]

VTLN 73.5 11.1Direct method 78.2 3.1Subtraction method 77.5 3.6

Table 8: Results from the estimation based on the speech part, group 816.

7.2.1 Estimation of warp factor based on the speech part, group 1632

Here the number of utterances that had been alloted 0.5 as warp factor reduced withboth methods, c.f. Table 9. With the direct method the number of utterances with thiswarp factor reduced with 71%, compared to 82% with the subtraction method, but both

33

0.5 0.6 0.7 0.8 0.9 1 1.1 1.20

20

40

60

80

100

Warp factor

Num

ber o

f utte

ranc

esOrdinary VTLN, group 1632

0.5 0.6 0.7 0.8 0.9 1 1.1 1.20

20

40

60

80

100

Warp factor

Num

ber o

f utte

ranc

es

Direct method, group 1632

0.5 0.6 0.7 0.8 0.9 1 1.1 1.20

20

40

60

80

100

Warp factor

Num

ber o

f utte

ranc

es

Subtraction method, group 1632

Figure 15: Distribution of warp factors for the children’s utterances, group 1632.

34

0.5 0.6 0.7 0.8 0.9 1 1.1 1.20

20

40

60

80

100

Warp factor

Num

ber o

f utte

ranc

es

Ordinary VTLN, group 816

0.5 0.6 0.7 0.8 0.9 1 1.1 1.20

20

40

60

80

100

Warp factor

Num

ber o

f utte

ranc

es

Direct method, group 816

0.5 0.6 0.7 0.8 0.9 1 1.1 1.20

20

40

60

80

100

Warp factor

Num

ber o

f utte

ranc

es

Subtraction method, group 816

Figure 16: Distribution of warp factors for the children’s utterances, group 816.

35

methods allotted fairly similar number of utterances in the interval of 0.5-0.6, here thereduction was 70% and 74% respectively.

With both methods, the number of utterances allotted warp factors in the “good”range of 0.7-0.88, increased slightly. The number of utterances in this region increasedwith 4.5% with the direct method and with 2.4% with the subtraction method.

The number of utterances allotted warp factors between 0.9-0.98 increased with bothmethods. With the direct method the increment was 43.3% and with the subtractionmethod it was 53.7%.

Intuitively, children speech shouldn’t be allotted warp factors greater than one, sincethat would mean that they would have longer vocal tracts than the adults, but sometimestheir utterances are allotted these warp factors. With both methods, the number of utter-ances allotted these warp factors increased, with the direct method and the subtractionmethod the increase was 11 and 16 utterances respectively.

The overall results from experiments measured in accuracy c.f. Table 7 showed that themethods didn’t differ much, 80.00% compared to 79.70% for the direct and subtractionmethods respectively. But both methods improved the recognition performance, sinceordinary VTLN had an accuracy of 73.88%.

Warp factor VTLN[#] Direct method[#] Subtraction method[#]

0.50 49 14 9

0.50-0.60 76 23 20

0.70-0.88 380 397 389

0.90-0.98 67 96 103

1.00-1.22 3 14 19

Table 9: Number of utterances that was allotted each warp factor, group 1632.

7.2.2 Estimation of warp factor based on the speech part, group 816

Here the number of utterances that had been alloted 0.5 as warp factor reduced with bothmethods, c.f. Table 10. With the direct method the number of utterances with this warpfactor reduced with 78%, compared to 83% with the subtraction method. In the interval of0.5-0.6, the reduction was 72% for the direct method and 67% for the subtraction method.

With both methods, the number of utterances allotted warp factors in the “good”range of 0.7-0.88, increased. With the direct method 7.2% more utterances were placed inthis region and with the subtraction method there were 6.3% more.

The number of utterances allotted warp factors between 0.9-0.98 increased with bothmethods. With the direct method the increment was 193.8% and with the subtractionmethod it was 156.3%.

The number of utterances allotted warp factors greater than one increased with thesubtraction method, to 4 from 1. The corresponding number was unchanged with thedirect method.

The overall results from experiments measured in accuracy c.f. Table 8 showed thatthe methods didn’t differ much, 78.24% compared to 77.52% for the direct and subtractionmethods respectively. Both methods improved the recognition performance, since ordinaryVTLN had an accuracy of 73.52%.

36

Warp factor VTLN[#] Direct method[#] Subtraction method[#]

0.50 46 10 8

0.50-0.60 61 17 20

0.70-0.88 429 460 456

0.90-0.98 16 47 41

1.00-1.22 1 1 4

Table 10: Number of utterances that was allotted each warp factor, group 816.

7.3 Significance

Confidence intervals and confidence levels are often used to measure how valid, statisticallysignificant, the results are. In this work, we have used a simple rule of thumb for coarseestimation of the confidence interval. The “rule of 30” (Doddington, Schalk, 1981) is oftenused for this purpose in experiments on speech recognition. The rule assumes binomialdistribution of independent errors and states the following. If an error of E percent ismeasured, then the true error is within 30 percent of this value with a probability of 90percent if at least 30 errors have been made. More errors give tighter confidence intervalor higher confidence level. This implies that the better the performance, the more datamust be collected to prove it.

In this thesis, the requirement of 30 errors are always fulfilled when measuring the ac-curacy. The number of errors are actually more than that, about 180 or more. This mightindicate that a tighter confidence interval or higher confidence level could be used, butsince it might be that the spoken words are not completely statistically independent, thisapproximation is used. For example, since each child speaks several sequences of words,these word are from the same child and might therefore have similar acoustic propertiesand be dependent of each other. The confidence intervals are shown in Table 11, 12, 13, 14and 15. The confidence intervals should not be used for comparison between the methods.There may be statistically significant differences which don’t show up in the often verybroad intervals. There are better methods for this, but they have not been studied in thiswork.

Experiment (Max) Acc[%] Conf. int.[%]

Baseline 75.0 67.5 - 82.5MLLR 86.0 81.8 - 90.2VTLN(htk) 81.3 75.7 - 86.9VTLN(linear) 82.1 76.7 - 87.4VTLN+MLLR(htk) 87.5 83.8 - 91.3VTLN+MLLR(linear) 88.9 85.5 - 92.2

Table 11: Confidence intervals of the VTLN and MLLR experiments with htk’s standardwarp function and the linear, group 1632.

37


Baseline 78.4 72.0 - 84.9MLLR 86.6 82.6 - 90.6VTLN(htk) 82.9 77.8 - 88.0VTLN+MLLR(htk) 88.2 84.7 - 91.8



Baseline 62.4 51.2 - 73.7MLLR 81.2 75.5 - 86.8VTLN(htk) 78.5 72.0 - 84.9VTLN(linear) 79.3 73.1 - 85.5VTLN+MLLR(htk) 85.2 80.7 - 89.6VTLN+MLLR(linear) 87.2 83.3 - 91.0


Experiment Acc [%] Conf. int.[%]

VTLN 73.9 66.0 - 81.7Direct method 80.0 74.0 - 86.0Subtraction method 79.7 73.6 - 85.8

Table 14: Confidence intervals of the estimation based on the speech part, group 1632.

Experiment Acc [%] Conf. int.[%]

VTLN 73.5 65.6 - 81.5Direct method 78.2 71.7 - 84.8Subtraction method 77.5 70.8 - 84.3

Table 15: Confidence intervals of the estimation based on the speech part, group 816.

38

8 Discussion

8.1 General results from the VTLN and MLLR experiments

Several common results were found in the three different groups and the results refer toFig 12, 13 and 14. In all speech combinations MLLR improved the recognition performancemore than VTLN, even when compared to the best possible warp factor. Both MLLR andVTLN separately improved the accuracy though, VTLN improved the recognition resultsfor the range of warp factors from 0.7 or 0.8 to 1.0 approximately.

Was there any improvement when using the combination of VTLN and MLLR com-pared to using just one of them? The combination of VTLN and MLLR was always betterthan VTLN alone for all warp factors. The combination was also better than MLLR alone,at least for a range of warp factor about 0.8, the size of this range differed between thedifferent groups, which will be explained more in detail later.

One could also see that the difference between the maximum of the combination ofVTLN and MLLR and MLLR alone was bigger when performed with the linear frequencywarp function than with HTK’s standard frequency warp function. See Fig 12, 14. Thismight be because of the fact that the linear frequency warp function warps new informationfrom the children’s speech into the range of the recognizer, i.e. it adds energy of thechildren’s spectrum above the bandwidth of the range of the recognizer and compressesit into it. A corresponding conclusion could be drawn when comparing the maximum ofVTLN and the baseline experiment, but here the difference was smaller.

Another similarity between the different groups was that the accuracy for both VTLNand the combination of VTLN and MLLR was higher at the lower warp factors whenperformed with the linear frequency warp function compared to HTK’s standard frequencywarp function. The reason for this is that HTK’s standard frequency warp function verymuch penalizes the lowest warp factors, since it has a natural built-in protection againstthe lowest warp factors.

For all combinations of speech and for both warp functions, the maximum of accuracyof the combination of VTLN and MLLR was shifted to lower warp factors compared tothe maximum of VTLN.

Within all groups and with both warp functions, the slope of the VTLN curve is steeperat the low warp factors than at the high warp factors. This could indicate that it causesmore harm to choose a too low warp factor for children’s speech than a too high. The sameeffect could be seen in all graphs for the combination of VTLN and MLLR. The conclusionof this is though more diffuse, it might indicate that it is worse to adapt the recognizerto warped children’s speech and to scale the children’s speech that are to be recognizedwith the same too low warp factor compared to a too high warp factor. This effect is alsobecause VTLN is one of the combined methods and thus the effect from VTLN is alsoreflected here.

8.1.1 Comparing the results between the different bandwidths

The results from group 816 deviated in several ways from those of the other groups. Inthis group the recognizer was trained with adults’ speech with 4 kHz bandwidth and 8kHz sampling frequency and was evaluated with children’s speech with 8 kHz bandwidthand 16 kHz sampling frequency. The combination of bandwidths in this group is of specialinterest, since it is possibly of commercial interest. A bandwidth of 4 kHz is common

39

in today’s telephony. This narrow bandwidth covers most of the information in adult’sspeech, but some of the information in children’s speech will be lost, since some of thehigher formants of children’s speech will be discarded.

In group 816 the interval where the combination of VTLN and MLLR was better thanMLLR alone is larger than the intervals of the two other groups.

For group 816, the increase in accuracy when combining VTLN and MLLR comparedto MLLR alone was larger than for the other groups, for both warp functions. In thisgroup the accuracy increased with 4.0 percentage points when using HTK’s standardfrequency warp function and with 6.0 percentage points when using the linear frequencywarp function. This should be compared to group 1632 where the accuracy increasedwith 1.5 percentage points when using HTK’s standard frequency warp function and 2.9percentage points when using the linear frequency warp function and to group 1616 wherethe accuracy increased with 1.6 percentage points when using HTK’s standard frequencywarp function. C.f. Table 16.

Group HTK’s standard freq. w. fcn. The linear freq. w. fcn.

1632 1.5 2.91616 1.6 -816 4.0 6.0

Table 16: Accuracy increase in percentage points, when comparing the combination ofVTLN and MLLR and MLLR alone.

Since the accuracy increased more for group 816 when comparing the combinationof VTLN and MLLR and MLLR alone, than for the other groups, it might be of moreinterest to use the combination with this lower bandwidth of speech. This is also verifiedwhen comparing the widths of the intervals where combination of the two methods provedbetter than VTLN alone, since the interval was larger for this group.

Also when comparing VTLN and the baseline experiment, the increase in accuracywas larger for group 816 than for the other groups, for both warp functions. In thisgroup the accuracy increased with 16.1 percentage points when using HTK’s standardfrequency warp function and with 16.9 percentage points when using the linear frequencywarp function. This should be compared to group 1632 where the accuracy increased with6.3 percentage points when using HTK’s standard frequency warp function and with 7.1percentage points when using the linear frequency warp function and to group 1616 wherethe accuracy increased with 4.5 percentage points when using HTK’s standard frequencywarp function. C.f. Table 17.

Group HTK’s standard freq. w. fcn. The linear freq. w. fcn.

1632 6.3 7.11616 4.5 -816 16.1 16.9

Table 17: Accuracy increase in percentage points, when comparing VTLN and the baselineexperiment.

Another distinctive feature for group 816 was that the relative error reduction (RER)

40

between the two warp functions was larger than for group 1632. When comparing thecombination of VTLN and MLLR and MLLR alone for group 816, the accuracy increasedwith 4.0 percentage points when using HTK’s standard frequency warp function and with6.0 percentage points when using the linear frequency warp function, this gives a relativeerror reduction of 13.4%. The corresponding figure for group 1632 is 11.2%. This mightconfirm the assumption that there are more information to gain when warping informa-tion of the children’s speech from the bandwidth of 4-8 kHz into the bandwidth of therecognizer, than it is to warp information from the bandwidth of 8-16 kHz of the children’sspeech into the bandwidth of the recognizer. Within 4-8 kHz there are lower formantsthen in 8-16 kHz and these lower formants are more important than the higher ones forthe recognition.

On the other hand, when comparing VTLN and the baseline experiment, the relativeerror reduction was smaller for group 816 than for group 1632. When comparing VTLNand the baseline experiment for group 816, the accuracy increased with 16.1 percentagepoints when using HTK’s standard frequency warp function and with 16.9 percentagepoints when using the linear frequency warp function, this gives a relative error reductionof 3.7%. The corresponding figure for group 1632 is 4.3%.

8.2 Estimation of warp factor based on the speech part

Several common results could be found in both groups and the results refer to Fig 15and 16. In both groups and with both methods, the number of utterances that had beenalloted 0.5 as warp factor reduced sharply. So did the number of utterances that hadbeen alloted 0.5-0.6 as warp factor. This was a welcome result, c.f. Sec. 6.5. Also, in bothgroups and with both methods, the number of utterances that had been alloted 0.7-0.88 aswarp factor increased. This range of warp factors have previously shown to be appropriatefor children’s speech in general, c.f. Sec. 6.5.

Another common result was that the number of utterances that had been alloted 0.90-0.98 as warp factor increased for both methods and both bandwidth groups. If the increaseonly consisted of 8 year old children’s utterances it could have been explained with thefact that they have the longest vocal tracts, but unfortunately this increase was seen inall ages. The reason for this is not clear and requires further studies.

41

9 Conclusion

It was shown in this thesis that when combining the two methods VTLN and MLLRthe recognition performance improved comparing to using one of them separately. Thecombination is always better than VTLN alone. When comparing to MLLR, this is onlytrue for an interval of warp factors. Depending on which bandwidth used, the intervalsdiffer, but in general the interval of 0.7-1.0 covers the region where the combination winsover MLLR.

When comparing the combination of VTLN and MLLR to MLLR alone, the improve-ment was greater when the experiment was performed on a recognizer trained with adults’speech with 4 kHz bandwidth and was evaluated with children’s speech with 8 kHz band-width than if the recognizer was trained with the double bandwidth and also evaluatedwith the double bandwidth. This might indicate more interest to use the combination ofthe two methods with this lower bandwidth of speech.

Also it was shown that VTLN could be improved if the warp factors for the children’sutterances were estimated only upon the speech part of the utterances and not on the entireutterances. The recognition accuracy increased and the number of utterances allotted warpfactors in the interval of 0.5-0.6 decreased.

9.1 Outlook

A further experiment that would be of interest is to examine why there still were utterancesallotted warp factors such as 0.5-0.6 and those greater than 1 when estimating warp factoronly upon the speech part of the utterances (c.f. Sec 6.5). Is this caused by non-speechintervals between the digits, by incorrect utterance boundary detection, or is there anyother explanation?

43

References

[1] Lena Mahl. Speech recognition and adaptation experiments on children’s speech. KTH,Stockholm, Sweden 2003

[2] Anders Schiold. Normalization of vocal tract length for automatic recognition of chil-

dren’s speech. KTH, Stockholm, Sweden 2004

[3] Mats Blomberg and Daniel Elenius. Collection and recognition of children’s speech

in the pf-star project. Technical report, Department of Speech, Music and Hearing,KTH, http://www.ling.umu.se/fonetik2003/ , 2003

[4] Giampero Salvi. RefRec: a reference recognizer in the cost249 project, kth-tmh- KTH-TMH Guide, February 2002.

[5] Steve Young, Gunnar Evermann, Dan Kershaw, Gareth Moore, Julian Odell, DaveOllason, Dan Povey, Valtcho Valtchev and Phil Woodland. The htk book. MicrosoftCorporation, December 2002.

[6] Mats Blomberg, Kjell Elenius. Automatisk igenkanning av tal. Institutionen for tal,musik och horsel, KTH, 2003 .

[7] John Holmes and Wendy Holmes. Speech Synthesis and Recognition. Taylor & Francis,2nd edition, 2001.

[8] Alan Oppenheim, Alan Willsky and Hamid Nawab. Signals & Systems. Prentice-Hall,Upper Saddle River, New Jersey, second edition, 1997. ISBN 0-13-651175-9.

[9] R. Webster. A generalized hamming window. Acoustics, Speech and Signal Processing,26(2):176-177, April 1978.

[10] Xuedong Huang, Alex Acero and Hsiao-Wuen Hon. Spoken Language Processing.

Prentice-Hall, Upper Saddle River, New Jersey, 2001. ISBN 0-13-022616-5.

[11] Arne Leijon. Introduction to Pattern Recognition. KTH-TMH, 2003.

[12] W.B, Kleijn. Hidden Markow Models for Speech Recognition. Department of Speech,Music and Hearing, Royal Institute of Technology, Stockholm, 2000.

[13] Johan Sundberg. Rostlara, Fakta om rosten i tal och sang. Proprius forlag, Stockholm,1986. ISBN 91-7118-558-5

[14] Q. Li and M. Russel. An analysis of the causes of increased rates in children’s speech

recognition. Proceeding of ICSLP 2002, Denver, pages 2337-2340, 2002.

[15] Dorota Iskra, Beate Großkopf, Krzysztof Marasek, Henk vander Heuvel, Frank Diehl and Andreas Kissling. Speecon - speech

databases for cunsumer devices: Database specification and validation.

http://www.speecon.com/public docs/LREC2002 specification and validation.pdf,2002

45

[16] Lutz Welling, Herman Ney and Stephan Kanthak. Speaker adaptive modeling by vocal

tract length normalization. IEEE Transaction on speech and audio processing, volume10, september 2002.

[17] Lutz Welling, Stephan Kanthak and Herman Ney. Improved methods for vocal tract

normalization. IEEE, pages 761-764, 1999. RWTH Aachen - University of Technology.

[18] Achim Sixtus, Sirko Molau, Stephan Kanthak, Ralf Schluter and Herman Ney. Recent

improvements of the twth large vocabulary speech recognition system on spontaneous

speech. ICASSP, pages 1671-1674, 2000. RWTH Aachen University of Technology.

[19] David Ohlin. Formant extraction for data-driven formant synthesis. Master’s thesis,TMH, KTH, Stockholm, 2004..

[20] Linda Oppelstrup. Speech Recognition used for Scoring of Children’s Pronunciation

of a Foreign Language. Master’s thesis, TMH, KTH, Stockholm, 2004.

[21] George R. Doddington and Thomas B. Schalk. Speech recognition: turning theory to

practice. Texas Instruments Inc., IEEE, 1981.

46

Appendix A

Parameter settings for the recognizer

TARGETKIND = MFCC0

TARGETRATE = 100000.0SAV ECOMPRESSED = FSAV EWITHCRC = FWINDOWSIZE = 250000.0USEHAMMING = TPREEMCOEF = 0.0CEPLIFTER = 0NUMCEPS = 12ENORMALISE = FLOFREQ = 0HIFREQ = 4000/8000

# input file formatSOURCEKIND = WAVEFORMSOURCEFORMAT = WAV

#Additional specNUMCHANS = 26WARPLCUTOFF = 300/300WARPUCUTOFF = 2750/5500WARPFREQ = 1.22

47

TRITA-CSC-E 2007:127 ISRN-KTH/CSC/E--07/127--SE

ISSN-1653-5715

www.kth.se

experiment with adaptation and vocal tract length … · tract length normalization at automatic...

Documents