[ieee 2013 national conference on communications (ncc) - new delhi, india (2013.2.15-2013.2.17)]...

Subspace Modeling Technique Using Monophonesfor Speech Recognition

Bhargav Srinivas Ch, Neethu Mariam Joy, Raghavendra R. Bilgi and S. Umesh

Department of Electrical EngineeringIndian Institute of Technology Madras, Chennai 600036, India

Email: {ee11s003, ee11d009, ee10s009, umeshs}@ee. iitm. ac. in

Abstract—In this paper we propose an adaptive trainingmethod for parameter estimation of acoustic models in the speechrecognition system. Our technique is inspired from the ClusterAdaptive Training (CAT) method which is used for rapid speakeradaptation. Instead of adapting the model to a speaker as in CAT,we adapt the parameters of the context dependent triphone states(tied states) from context independent states (monophones). Thisis achieved by finding a global mapping of parameters of thetied state from the parametric subspace of monophone models.This technique is similar to Subspace Gaussian Mixture Model(SGMM), but differs in the initialization of parameters and inthe update of weights of Gaussian mixture components. We showthat, the proposed method can match the performance of theconventional HMM system for large amount of training dataand outperforms it when the number of training examples areless.

Index Terms—Speech recognition, adaptive training, subspacemodeling.

I. INTRODUCTION

Gaussian Mixture Model (GMM) characterizing the dis-tribution of each Continuous Density-Hidden Markov Model(CD-HMM) state is the standard acoustic model in speechrecognition. The requirement of sufficiently large amount oftraining data for the robust estimation of these GMM param-eters (i.e., context dependent phone models) is an importantissue for building these models. Ongoing research in the fieldof acoustic modeling is gaining momentum in the presentscenario where robust models need to be built on less amountof training data. Efficient and robust estimation of acousticmodel parameters utilizing limited availability of training datais steering the research focus to techniques where the numberof parameters to be estimated are few.

Subspace Gaussian Mixture Models (SGMM) [6] andCanonical State Models (CSM) [2] are two acoustic modelingtechniques which exploits the relationship between the contextdependent phone models (or triphones). Both these methodshave similarities to the Cluster Adaptive Training (CAT) [3], aspeaker adaptation technique. While SGMM exploits the cor-relations among GMM parameters of tied-states, CSM strivesto transform a canonical model to a context-dependent state.Both these methods achieve considerable parameter reductionand hence can be used in cases where the amount of trainingdata available is limited.

Following similar lines of arguments as the afore mentionedtechniques, we propose a new acoustic modeling techniquecalled Subspace Model which takes a more intuitive approachto the estimation of triphone model parameters. Here, weassume that triphone models are formed by the linear combi-nation of monophone models. It is imperative that we attributethe inspiration of our newly proposed technique to CAT.

In CAT, the means of the new speaker HMM model isformed by the linear interpolation of several cluster modelmeans. A cluster is formed by grouping together a set ofspeakers and during CAT training phase a HMM model is builtfor each of these clusters. Similarly in the proposed SubspaceModel, the means of the triphone model are formed by thelinear combination of monophone model means.

In Subspace Model, monophone models serve as basisfunctions and these are linearly combined to give any triphonemodel parameters. Recently Eigentriphones [4] was proposedwhich tries to develop an eigenbasis for triphones and modelseach triphone as a point in the space of the eigenbasis.However, in Subspace Model, each triphone is identified asa point in the space of monophones. The way the basis isdeveloped in Eigentriphones is very different from that theway that is used in Subspace Model.

On an implementation level, our technique is more similarto SGMM where the means and mixture weights of the tiedstates vary in the full GMM parameter space and the mappingfor a particular tied state is done via a state projection vector.SGMM also constrains the covariances of all the HMM statesto be the same as that of the canonical model. In our proposedmethod, the covariances of the monophone models are tiedtogether, thereby constraining the covariances of the tied statemodel to be same as that of the monophone models.

The rest of the paper is organized as follows: sectionII gives a brief introduction to Subspace Model using ananalogy with CAT, section III goes into the details of theSubspace Model, section IV describes the overall trainingscheme, section V gives the experimental details and resultsfollowed by a discussion of the results in section VI andfinally the conclusion and possible future research directionsin section VII.978-1-4673-5952-8/13/$31.00 c© 2013 IEEE

II. SUBSPACE MODEL & CAT

A. Cluster Adaptive Training (CAT)

CAT is a rapid adaptation technique which compactlyrepresents a speaker with lesser number of parameters. Itgains an upper hand over Speaker Independent CD-HMMmodel and gives comparable performance to that of a SpeakerDependent model, but without the added hassle of requiringmore amount of training data for a particular speaker. CAT isa generalization of Speaker Adaptive Training (SAT) [1].

In CAT, all the training speakers are initially divided into“P ” groups or clusters and an HMM model known as clustermodel is built for each cluster using the data of the speakersbelonging to that particular cluster. The parameters of eachspeaker model is obtained by the linear interpolation of theparameters of all the P clusters. So the cluster to which thespeaker is more close to gives more weight. This ensures amode of soft clustering than the hard clustering advocated bySpeaker Clustering [5]. The mean of jthstate, mth Gaussianfor speaker “s” in CAT model is as follows:

µsjm =

[µ1jmµ

2jm · · ·µP

jm

] [λ1sλ

2s · · ·λPs

]T= Mjmλ

s (1)

where Mjm is the matrix of means from P clusters and λs

is referred to as the speaker weight vector.During training phase, the cluster model parameters and

cluster weight vectors for the training speakers are iterativelyestimated so that clusters become compact. During the testphase, adapted model for the test speaker is obtained by esti-mating speaker weight vector λs. As the number of parametersto be estimated is less (only P parameters), CAT supportsmodel adaptation when the amount of adaptation data availableis less (i.e., rapid adaptation).

The driving force behind Subspace Model is the need tobuild a robust acoustic model with limited amount of trainingdata. The proposed technique accomplishes this by sharinga large number of parameters among the tied states, therebyreducing the total number of parameters to be estimated forthe model.

B. Introduction to Subspace Model

Subspace Model technique is closely related to CAT asthe former adopts the parameter estimation framework of thelatter. However, CAT is a model adaptation technique andSubspace Model operates on the acoustic modeling level. InSubspace Model, the parameters of each tied state are obtainedby linear interpolation of parameters of the monophone mod-els. Mean µjmof the jthstate, mth Gaussian is as follows:

µjm = Mmvj (2)

where, Mm is the matrix formed by stacking the mthmeanof all the monophone models and vj is the tied state weightvector for the jth tied state as shown in Eq. 3.

Mm =[µ1m µ2

m · · · µKm

](3)

vj =[v1j v

2j · · · vKj

]T(4)

and µkm is mthmean of kthmonophone and K is the total

number of monophones.Hence to estimate the means of the jth triphone model,

only vj is to be estimated. The covariances are tied acrossall the triphone models. The weights of each triphone modelare estimated using standard Expectation Maximization (EM)algorithm as in the case of CD-HMM parameter estimation.

C. Analogy between Subspace Model & CAT

In this section, we try to draw an analogy between theproposed Subspace Model and CAT. As already mentionedin the previous section, CAT is a speaker adaptation tech-nique and Subspace Model is an acoustic modeling technique.Consequently, the term “cluster” in CAT refers to a HMMmodel for a group or cluster of speakers and that in Subspacemodel is thought of as monophone models. Mjm is formedfrom the cluster means and Mm from monophone GMMmeans in CAT and Subspace Model respectively. In CAT, eachspeaker is associated with weight vector λs which weightsthe cluster means. Similarly, in Subspace Model the tied statespecific weight vector vj , which weights monophone means,identifies each of the triphone models. CAT tries to locateevery speaker in the subspace of clusters (speaker subspace)and Subspace Model identifies every triphone model in themonophone subspace.

D. Subspace Model vs SGMM

The proposed Subspace Model and SGMM are both acous-tic modeling techniques. Though Subspace Model in a theo-retical sense is closer to CAT, implementation wise it loansa few concepts from SGMM. Both methods are parameterreduction modeling techniques, but brought forth from twodifferent streams of thought. For conventional CD-HMMmodel, phonetic decision based tree tying and subsequenttied-state modeling are employed to model the correlationsamong the various context-dependent phonetic states. SGMMgoes one step further and tries to model the correlationsamong the parameters of GMMs of various tied-states. Ourproposed method on the other hand, tries to give a plausibleexplanation to the formulation of triphones by addressing themas nothing but linear combination of monophones. So, it isevident that while SGMM tries to address the problem ofparameter reduction from ground level of similarity amongGMMs in a tied-state, Subspace Model tries to explain as towhat might bring about these similarities. In this section wetry to compare the differences between both these techniques.

1. Initialization of Mm: In SGMM, Mm is initialized toIdentity matrix, whereas in our model Mm is initialized as amatrix of monophone means stacked together.

2. Intuitiveness: The proposed method brings in a plausiblereason as to why the means of a triphone model (µjm) canbe thought of as the linear combination of monophone means.But, no such intuitive explanation can be given for SGMMmodel .

.

.

GMM 2

GMM−TS.

GMM 1

GMM K

∑

vKj

v2j

v1j

Figure 1. Subspace Model

3. Component Priors: The component priors wjm are up-dated in CD-HMM fashion in the proposed method. ButSGMM uses an exponential transform on vj to compute thecomponent priors. This enables the SGMM model to consumeless number of parameters than the proposed technique.

III. DETAILS OF THE SUBSPACE MODEL

The Subspace model is shown diagrammatically in Fig. 1,where GMM for a tied state is obtained by linear combinationof GMMs of all the monophones (GMM 1, GMM 2,..., GMMK). GMM-TS refers to GMM of a Tied State (TS)

The Subspace Model for a tied state is defined as follows:

P (ot | λ) =

R∑

m=1

wjmN(ot;µjm,Σjm) (5)

where R is the total number of mixtures in jth tied state.• Mean µjm of a tied state is given by Eq. 2.• Covariances Σjm are globally shared across the tied states

and these are made equal to covariances of monophones.We drop index j (tied state index) from Σjm (hereaftercovariance will be referred as Σm) as the covariances areshared among all the tied states.

• Component priors of Gaussians wjm.The means, covariances and component priors mentionedabove constitute the parameters of the Subspace Model. Here,each tied state is associated with a vector vj (dimension = no.of monophones ≈ 40) which linearly combines the means ofthe monophones to give GMM means of that particular tiedstate. Note that, vjs are tied state specific vectors and Mmsare globally shared parameters.

In Subspace Model, each tied state is characterized by tiedstate weight vector vj (which is typically 40 dimensional). Themeans of the tied state are obtained by the linear combinationof the monophone means by vj . The covariances are sharedamong all the tied states. Consequently, the number of param-eters to be estimated is drastically reduced and this allows usto have a complex model having 100-150 mixtures per tiedstate. It must be noted that even if the number of mixtures perstate increases, the dimension of vj remains fixed.

In order to give the reader a feel for the number ofparameters involved, we compare the parameter count of CD-

Table IPARAMETER COUNT FOR CD-HMM & SUBSPACE MODEL

Parameters CD-HMM6 mix & diagonal

Σjm

Subspace Model128 mix & diagonal Σm

State-specific µjm: 6×39Σjm: 6×39wjm: 6

vj : 42wjm: 128

Global - Mm:128×39×42Σm: 128×39

Total 14,22,000 7,08,672

HMM and Subspace Model for Aurora-4 task having 3000tied states in Table I.

From Table I, it can be inferred that the Subspace Modelhas only half the number of parameters of a CD-HMM model.Hence Subspace Model is suitable for smaller training datasets for robust model estimation.

A. Why we call this as a Subspace Model

The Gaussian means of all the tied states µjm lie in thespace spanned columns of Mm. So, the super vector obtainedby stacking all the means of jth tied state can be written asfollows:[µTj1 · · · µT

jR

]T39R×1

=[MT

1 · · · MTR

]T39R×K

vj = Mvj

(6)where R is the total number of mixtures (100− 130) in a tiedstate of subspace model and K (= no. of monophones ≈ 40)is dimension of vj .

Here, 39R dimensional super vector of any tied state lies inthe column space of M (K dimensional). We are identifyingall the 39R dimensional tied state super vectors in K (<<39R)dimensional subspace.

IV. TRAINING OF SUBSPACE MODEL

This section gives details of the model updates and overalltraining scheme of Subspace Model.

A. Expectation Maximization (EM) Algorithm

The Maximum Likelihood training of the Subspace Modelis same as any other adaptive training techniques like SAT,CAT etc. The auxiliary function which is to be maximizedcan be written as follows:

Q =∑

j,m,t

γjm(t)(ot −Mmvj)T Σ−1

m (ot −Mmvj) (7)

where γjm(t) is the posterior probability of jth state, mth

Gaussian component for observation ot.The optimization of parameters takes place through standard

Expectation Maximization approach. The state specific andglobally shared parameters are optimized separately. i.e.,

1) Update the weight vector vj of all the tied states giventhe current estimates of monophone GMM parametersMm, Σm

2) Update all the monophone GMM parameters Mm,Σm

given the current estimates of the tied state weightvectors vj

These two steps together constitute a single iteration of EMalgorithm. They are repeated till convergence of the parametersis achieved. In this framework, we are adaptively training themonophone GMMs so that we can transform (linear combi-nation) these reference monophone GMMs to any triphoneGMM. Update equations for the parameters described in steps1 and 2 are discussed in the next section.

B. Model Updates

The update equations for model parameters are obtained viathe differentiation of auxiliary function in Eq. 7. The modelupdates are as follows:

vj =

[∑

m,t

γjm(t)MTmΣ−1

m Mm

]−1 [∑

m,t

γjm(t)MTmΣ−1

m ot

]

(8)

Mm =

∑

j,t

γjm(t)otvTj

∑

j,t

γjm(t)vjvTj

−1

(9)

Σm =

∑j,t

γjm(t)(ot −Mmvj)(ot −Mmvj)T

∑j,t

γjm(t)(10)

C. Overall Training Procedure

A. Initialization1) Build a standard CD-HMM baseline model (6 mix).

Both the initial alignment information to bootstrap theSubspace Model and tied state information are obtainedfrom this model.

2) Build monophone GMMs of required size (96 or 128mix) from the train data. Form Mm by stacking togetherthe mth mean of all the monophones as shown in Eq.3.

3) Copy the monophone GMMs built in step 2 to all the tiedstates (for eg. model of /aa/ is copied to all tied stateshaving /aa/ as middle context) in CD-HMM model. Noweach tied state contains GMM of the required size. Thisserves as an initialization for our Subspace Model.

B. Training the Subspace Model consists of two phases.• Phase1: Update the model parameters (shown in Eqs.

8, 9, 10) using alignment γjm(t) from the baseline CD-HMM model. Alignment from the baseline model isneeded as it is difficult to train the Subspace Model fromscratch.

• Phase2: Update the model parameters using self align-ment i.e., alignment from Subspace Model.

Both these phases have to undergo atleast 7-8 iterations of EMas mentioned in section IV-A.

Table IIRECOGNITION ACCURACIES (%) OF SUBSPACE AND CD-HMM MODELS

TRAINED USING 4000 UTTERANCES

LM factor4000 utterances 18 15 14 13 Max

Subspace 96 mix 79.75 81.73 84.3 84.4 84.4Model 128 mix 82.2 84.6 85.07 85.07 85.07

CD-HMM 6 mix 81.8 83.2 82.5 82.57 83.2

Table IIIRECOGNITION ACCURACIES (%) OF SUBSPACE AND CD-HMM MODELS

TRAINED USING 7138 UTTERANCES

LM factor7138 utterances 18 14 13 11 Max

Subspace 128 mix 82.37 85.4 85.7 85.8 85.8Model 160 mix 83.75 86.44 87.15 87.13 87.15

CD-HMM 6 mix 87.65 88.9 88.72 87.9 88.9

V. EXPERIMENTS & RESULTS

A. Experimental Setup

The performance of the Subspace Model has been testedon Aurora-4 database which contains 7138 continuous utter-ances for training and 330 utterances for testing. Also, theperformance of Subspace Model has been tested on smallertraining data set obtained by taking 4000 utterances fromAurora-4. 13 dimensional Mel Frequency Cepstral Coefficients(MFCC) are used as features for parameterizing the speechdata. 13 delta and 13 acceleration coefficients are augmentedto form a 39 dimensional MFCC feature vector and Cepstralmean subtraction (CMS) is performed on these feature vectors.HMM Toolkit (HTK-3.4)[7] is used for building and testingthe acoustic models.

Cross word triphone HMM is built with 3 states and 6mixtures per tied state. Three state Silence model with 6Gaussian mixtures per state and one state short pause modeltied to middle state of silence are used. This is considered asbaseline CD-HMM model. Keeping the number of tied statesconstant, subspace model with 96 mixtures, 128 mixtures pertied state are created taking 4000 utterances as training setand 128 mix, 160 mixture are created for full training set(7138 utterances). Diagonal covariance is considered for all themixtures in all the experiments. The language model used is astandard bigram model used for the Wall Street Journal (WSJ)task. Language Model (LM) factor which gives weightages tothe language and acoustic models plays a big role in evaluatingthese models. Lower the LM factor, more the weightage givento the acoustic model.

B. Results

Tables II and III show the recognition accuracies for thesubspace and CD-HMM models trained using a reduced dataset and a full data set. Both CD-HMM and subspace modelsare tested using several Language Model factors (LM).

As shown in Table II, for reduced dataset of 4000 utterances,subspace model gives an absolute improvement of 1.2% for 96mixtures and 1.87% for 128 mixtures over baseline CD-HMMmodel. For complete dataset, we observe from Table III that

the CD-HMM model performs better than the subspace modelby 1.75% for 160 mixtures and 3.1% for 128 mixtures.

VI. DISCUSSION

We can see from Tables II, III that the performance ofsubspace model heavily depends on LM factor. As we builda good acoustic model containing 100-150 mixtures per tiedstate, more weightage should be given to the acoustic modelthan language model. Hence, smaller LM factor should beused for subspace model than the baseline model.

We can observe from Tables II, III that for the proposedtechnique, recognition accuracy improves with increasing thenumber of mixtures. For reduced data set, Subspace Modelperforms better than the baseline model as it contains veryless number of parameters than CD-HMM model.

On the contrary, the subspace model performs inferior whencompared to the baseline model for larger training data sets, asthe subspace model is constrained by the number of parametersused to model the training data. As the availability of trainingdata increases, the performance of subspace model degradeswhen compared to the performance of CD-HMM model. Also,in subspace model the number of mixtures cannot be increasedbeyond certain limit, as the data required to build initialmonophone GMMs becomes a constraint. We cannot use fullcovariance for the tied states, as computing likelihoods w.r.t.such a complex model while testing is time consuming.

VII. CONCLUSIONS & FUTURE WORK

We have presented a technique wherein a complex acousticmodel is built with lesser number of parameters. The reductionin the number of parameters is achieved using a transformationthat linearly combines parameters of monophones to give theparameters of the tied states. Furthermore, we have shown thatthis technique matches the performance of baseline model withvery less number of parameters.

In future, we would like to investigate more on tied stateweight vectors and its influence on the performance of thesubspace model. Also, we would like to test the performanceof subspace modeling technique on low resource languages.

VIII. ACKNOWLEDGEMENTS

This work was supported in part under the project “Speechbased Access for Agricultural Commodity Prices in six IndianLanguages” funded by Department of Information Technology(DIT), Govt. of India.

REFERENCES

[1] T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul. A compactmodel for speaker-adaptive training. In Spoken Language, 1996. ICSLP96. Proceedings., Fourth International Conference on, volume 2, pages1137 –1140 vol.2, oct 1996.

[2] Mark J. F. Gales and Kai Yu. Canonical state models for automatic speechrecognition. In INTERSPEECH’10, pages 58–61, 2010.

[3] M.J.F. Gales. Cluster adaptive training of hidden markov models. Speechand Audio Processing, IEEE Transactions, 8(4):417–428, jul 2000.

[4] T. Ko and B. Mak. Eigentriphones: A basis for context-dependent acousticmodeling. In Acoustics, Speech and Signal Processing (ICASSP), 2011IEEE International Conference on, pages 4892 –4895, may 2011.

[5] M. Padmanabhan, L. R. Bahl, D. Nahamoo, and M. A. Picheny. Speakerclustering and transformation for speaker adaptation in large-vocabularyspeech recognition systems. In IEEE Transactions. Speech and SignalProcessing, pages 701–704, 1995.

[6] Daniel Povey, Lukáš Burget, Mohit Agarwal, Pinar Akyazi, Feng Kai,Arnab Ghoshal, Ondrej Glembek, Nagendra Goel, Martin Karafiát, AriyaRastrow, Richard C. Rose, Petr Schwarz, and Samuel Thomas. The sub-space gaussian mixture model-a structured model for speech recognition.Comput. Speech Lang., 25(2):404–439, April 2011.

[7] Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Ker-shaw, Xunying A. Liu, Gareth Moore, Julian Odell, Dave Ollason, DanPovey, Valtcho Valtchev, and Phil Woodland. The HTK Book (for HTKVersion 3.4). Cambridge University Engineering Department, 2006.

[ieee 2013 national conference on communications (ncc) - new delhi, india (2013.2.15-2013.2.17)]...

Documents