speech recognition document

4
COMBINA TION OF ACOUSTIC MODELS IN CONTINUOUS SPEECH RECOGNITI ON HYBRID SYSTEMS  Hugo Meinedo Jo ˜ ao P. Neto INESC - IST Rua Alv es Red ol, 9 1000-029 Lisboa - Por tug al [email protected], [email protected] http://neural.inesc.pt/NN/RFC ABSTRACT The combination of multiple sources of information has been an attracti ve approach in differe nt areas. That is the case of speech recognition area where several combination methods have been presented. Our hybrid MLP/HMM sys- tems use acoustic models based on different set of features and different MLP classier structures. In this work we de- veloped a method combining phoneme probabilities gen- erated by the different acous tic models trained on disti nct feature extraction proces ses. T wo diffe rent algori thms were implemented for combining the acoustic models probabil- ities . The rst covers the combinati on in the probabilit y domain and the seco nd one in the log-pro babili ty domain. We made combinations of two and three alternative base- line syst ems where was poss ible to obtain relati ve improv e- ments on word error rate larger than 20% for a large vocab- ulary speak er indepe ndent continuous speech recogn ition task. 1. INT RODUCTION For some time now researchers have been working in the combination of different classiers following the ”divide and conquer” thesis as a way to improv e overall class ier performance. In fact, the combinati on of multip le sources of information is an attractive approach to speech recogni- tion where frequently there is a high number of classes and the input is affected by noise or other sources of variabil- ity resulting in hard to classify data. In speech recognition, there are at least three potentia l lev els where combinatio n may be intr oduced : i) we can combine th e probabi lity e s- timates of several acoustic models before they are fed into the de coder [1, 2] , ii) it is poss ibl e to combi ne se ver al language models estimated on different corpora by inter- polating their probabilities e stimates and iii) it is als o pos- sible to run several independent recognisers in parallel and combine their output word hypotheses based on voting [3]. In our work we have been developing hybrid systems that combin e the tempo ral model ling capabiliti es of hidden Markov models (HMMs) with the pattern classication ca- pabilities of multilayer perceptrons (MLPs). In this hybrid HMM/MLP system [4], a Markov process is used to model the basic tempor al nature of the speech signa l. The MLP is used as the acoustic model within the HMM framework. The MLP estimates context-independent posterior phone prob abi lit ies to be use d in the Mark ov proce ss. We de- veloped hybrid MLP/HMM context independent systems for continuous speech recognition using a different set of databases and applied to distinct tasks [5, 6]. We noted that some of our baseline recognition systems that use acoustic models with different properties gave em- phasis to distinct portions of the speech signal and caused different responses to difcult situations. Through the ob- servation of the output produced by these recognition sys- tems for a common task we concluded that they tended to mak e errors with a low deg ree of correl ati on. This evi- dence suggested that an appropriate method for combining the baseline recognisers might produce a nal system more accurate than either of the constituents alone. In this work we developed a combination method of dif- ferent context independent acoustic models through an av- erage in the probability and in the log-probability domain of the phonet ic probabili ties estimated at the output of the clas sier s. W e combined simultaneous ly two and three base line recognition systems and compar ed them based on the total numbe r of par ame ter s and word error rate (% WER). On section 2 we present the combination algorithms used for the expe riment s. The baseline spee ch recognis ers are des cri bed on secti on 3. On sect ion 4 the evalu ati on of the combined recognition systems is summarised. Finally, some conclu sions and fut ure work are pre sented in sections 5 and 6. 2. COMBINA TION OF ACOUS TIC MODELS In this work we de ve lope d a me thod that combines phoneme prob abi lit ies genera ted by dif fer ent aco ust ic

Upload: linktoanu

Post on 02-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speech Recognition Document

8/10/2019 Speech Recognition Document

http://slidepdf.com/reader/full/speech-recognition-document 1/4

COMBINATION OF ACOUSTIC MODELS IN CONTINUOUS SPEECH

RECOGNITION HYBRID SYSTEMS

Hugo Meinedo Jo ˜ ao P. NetoINESC - IST

Rua Alves Redol, 9 1000-029 Lisboa - [email protected], [email protected]

http://neural.inesc.pt/NN/RFC

ABSTRACT

The combination of multiple sources of information hasbeen an attractive approach in different areas. That is the

case of speech recognition area where several combinationmethods have been presented. Our hybrid MLP/HMM sys-tems use acoustic models based on different set of featuresand different MLP classier structures. In this work we de-veloped a method combining phoneme probabilities gen-erated by the different acoustic models trained on distinctfeature extractionprocesses. Two different algorithms wereimplemented for combining the acoustic models probabil-ities. The rst covers the combination in the probabilitydomain and the second one in the log-probability domain.We made combinations of two and three alternative base-line systems where was possible to obtain relative improve-ments on word error rate larger than 20% for a large vocab-ulary speaker independent continuous speech recognitiontask.

1. INTRODUCTION

For some time now researchers have been working in thecombination of different classiers following the ”divideand conquer” thesis as a way to improve overall classierperformance. In fact, the combination of multiple sourcesof information is an attractive approach to speech recogni-tion where frequently there is a high number of classes andthe input is affected by noise or other sources of variabil-ity resulting in hard to classify data. In speech recognition,there are at least three potential levels where combinationmay be introduced: i) we can combine the probability es-timates of several acoustic models before they are fed intothe decoder [1, 2], ii) it is possible to combine severallanguage models estimated on different corpora by inter-polating their probabilities estimates and iii) it is also pos-sible to run several independent recognisers in parallel andcombine their output word hypotheses based on voting [3].

In our work we have been developing hybrid systems thatcombine the temporal modelling capabilities of hiddenMarkov models (HMMs) with the pattern classication ca-

pabilities of multilayer perceptrons (MLPs). In this hybridHMM/MLP system [4], a Markov process is used to modelthe basic temporal nature of the speech signal. The MLPis used as the acoustic model within the HMM framework.

The MLP estimates context-independent posterior phoneprobabilities to be used in the Markov process. We de-veloped hybrid MLP/HMM context independent systemsfor continuous speech recognition using a different set of databases and applied to distinct tasks [5, 6].

We noted that some of our baseline recognition systemsthat use acoustic models with different properties gave em-phasis to distinct portions of the speech signal and causeddifferent responses to difcult situations. Through the ob-servation of the output produced by these recognition sys-tems for a common task we concluded that they tended tomake errors with a low degree of correlation. This evi-dence suggested that an appropriate method for combiningthe baseline recognisers might produce a nal system moreaccurate than either of the constituents alone.

In this work we developed a combination method of dif-ferent context independent acoustic models through an av-erage in the probability and in the log-probability domainof the phonetic probabilities estimated at the output of theclassiers. We combined simultaneously two and threebaseline recognition systems and compared them basedon the total number of parameters and word error rate(% WER).

On section 2 we present the combination algorithms usedfor the experiments. The baseline speech recognisers aredescribed on section 3. On section 4 the evaluation of the combined recognition systems is summarised. Finally,some conclusions and future work are presented in sections5 and 6.

2. COMBINATION OF ACOUSTICMODELS

In this work we developed a method that combinesphoneme probabilities generated by different acoustic

Page 2: Speech Recognition Document

8/10/2019 Speech Recognition Document

http://slidepdf.com/reader/full/speech-recognition-document 2/4

Page 3: Speech Recognition Document

8/10/2019 Speech Recognition Document

http://slidepdf.com/reader/full/speech-recognition-document 3/4

Page 4: Speech Recognition Document

8/10/2019 Speech Recognition Document

http://slidepdf.com/reader/full/speech-recognition-document 4/4

were used. By combining three acoustic models we wereable to reduce the WER by 22% relative to the best con-stituent. In the second case we added the ModSpec acous-tic model to one of the combinations of two systems thathad obtained the best results. This case represents a 16%

relative reduction when compared with the best individualsystem. Finally, in an attempt to obtain the lowest possi-ble word error rate, we used the PLP 1000 instead of thePLP+Sil because it has better performance due to a MLPwith more parameters. We see only a small reduction whencomparing with the second case. Compared with the bestconstituent, the PLP 1000, we got a 15% relative improve-ment.

We are currently developing recognition systems for moredifcult tasks. One such case is broadcast news transcrip-tion. Some preliminary experiments were performed usingthese baseline recognition systems whose acoustic mod-els were only trained with clean read speech. Obviouslythis is a very demanding task furthermore complicated byfrequent OOV words even for a 27k vocabulary. In theseexperiments we have observed that in situations involvingpivot speech the relative improvement obtained in the worderror rate by the combined recognition system is slightlyabove 20%. For situations involving spontaneous speechthe improvement approaches 30% in respect to our bestbaseline system.

5. CONCLUSIONS

In this paper we presented the development of a new largevocabulary continuous speech recognition system for thePortuguese language based on the combination of differ-ent acoustic models. These models were merged throughtwo different algorithms combining the phonemes a poste-riori probabilities, estimated by the several acoustic mod-els MLP’s, in the probability and in the log-probability do-main.

Starting with absolute word error rates around 14% fromthe isolated baseline systems we were able to achievevalues around 10% which represents a relative improve-ment larger than 20%. Furthermore the combined systems

proved to be even more robust when used in adverse situa-tions that signicantly differ from the training ones.

6. FUTURE WORK

We will continue to investigate and develop new ways of improving our recognition systems. One technique thatwill deserve our attention will be the combination at wordlevel using voting algorithms. Other area of interest will bethe development of acoustic models more robust to noiseand spontaneous speech in order to cope with more difcultsituations. The combined systems described in this paperwere obtained by using acoustic models trained only with

clean read speech to use in dictation tasks, and by that con-sequence are not the most suited for the type of situationsimposed by broadcast news transcription tasks.

7. ACKNOWLEDGEMENTS

This work was partially funded by the PRAXIS XXIproject 1654. An acknowledgement is also given to theP UBLICO newspaper for making its texts available for thiswork.

8. REFERENCES

1. Hochberg, M., Cook, G., Renals, S., and Robinson, T.,Connectionists model combination for large vocabularyspeech recognition , in Proceedings IEEE Workshop onNeural Networks for Signal Processing, 1994.

2. Cook, G. and Robinson, A., Boosting the performanceof connectionist large vocabulary speech recognition , inProceedings ICSLP 96, Philadelphia, USA, 1996.

3. Fiscus, J., A post-processing system to yield reduced word error rates: Recogniser output voting error re-duction (rover) , in Proceedings IEEE Workshop on Au-tomatic Speech Recognition and understanding (ASRU97), Santa Barbara, USA, 1997.

4. Bourlard, H. and Morgan, N., Connectionist Speech Recognition - A Hybrid Approach , Kluwer AcademicPublishers, Massachusetts, EUA, 1994.

5. Neto, J., Martins, C. and Almeida, L., The Development of a Speaker Independent Continuous Speech recog-niser for Portuguese , in Proceedings EUROSPEECH97, Rhodes, Greece, 1997.

6. Neto, J., Martins, C. and Almeida, L., A Large Vocabu-lary Continuous Speech Recognition Hybrid System for the Portuguese Language , in Proceedings ICSLP 98,Sydney, Australia, 1998.

7. Neto, J., Martins, C. Meinedo, H. and Almeida, L., The Design of a Large Vocabulary Speech Corpus for Por-tuguese , in Proceedings EUROSPEECH 97, Rhodes,

Greece, 1997.8. Kingsbury, B. E., Morgan, N., and Greenberg, S., Ro-

bust speech recognition using the modulation spectro-gram , Speech Comunication, 25:117–132, 1998.

9. Meinedo, H. and Neto, J., The use of syllable segmen-tation information in continuous speech recognition hy-brid systems applied to the Portuguese language , inPro-ceedings ICSLP 2000, Beijing, China, 2000.