speech recognition document
TRANSCRIPT
8/10/2019 Speech Recognition Document
http://slidepdf.com/reader/full/speech-recognition-document 1/4
COMBINATION OF ACOUSTIC MODELS IN CONTINUOUS SPEECH
RECOGNITION HYBRID SYSTEMS
Hugo Meinedo Jo ˜ ao P. NetoINESC - IST
Rua Alves Redol, 9 1000-029 Lisboa - [email protected], [email protected]
http://neural.inesc.pt/NN/RFC
ABSTRACT
The combination of multiple sources of information hasbeen an attractive approach in different areas. That is the
case of speech recognition area where several combinationmethods have been presented. Our hybrid MLP/HMM sys-tems use acoustic models based on different set of featuresand different MLP classier structures. In this work we de-veloped a method combining phoneme probabilities gen-erated by the different acoustic models trained on distinctfeature extractionprocesses. Two different algorithms wereimplemented for combining the acoustic models probabil-ities. The rst covers the combination in the probabilitydomain and the second one in the log-probability domain.We made combinations of two and three alternative base-line systems where was possible to obtain relative improve-ments on word error rate larger than 20% for a large vocab-ulary speaker independent continuous speech recognitiontask.
1. INTRODUCTION
For some time now researchers have been working in thecombination of different classiers following the ”divideand conquer” thesis as a way to improve overall classierperformance. In fact, the combination of multiple sourcesof information is an attractive approach to speech recogni-tion where frequently there is a high number of classes andthe input is affected by noise or other sources of variabil-ity resulting in hard to classify data. In speech recognition,there are at least three potential levels where combinationmay be introduced: i) we can combine the probability es-timates of several acoustic models before they are fed intothe decoder [1, 2], ii) it is possible to combine severallanguage models estimated on different corpora by inter-polating their probabilities estimates and iii) it is also pos-sible to run several independent recognisers in parallel andcombine their output word hypotheses based on voting [3].
In our work we have been developing hybrid systems thatcombine the temporal modelling capabilities of hiddenMarkov models (HMMs) with the pattern classication ca-
pabilities of multilayer perceptrons (MLPs). In this hybridHMM/MLP system [4], a Markov process is used to modelthe basic temporal nature of the speech signal. The MLPis used as the acoustic model within the HMM framework.
The MLP estimates context-independent posterior phoneprobabilities to be used in the Markov process. We de-veloped hybrid MLP/HMM context independent systemsfor continuous speech recognition using a different set of databases and applied to distinct tasks [5, 6].
We noted that some of our baseline recognition systemsthat use acoustic models with different properties gave em-phasis to distinct portions of the speech signal and causeddifferent responses to difcult situations. Through the ob-servation of the output produced by these recognition sys-tems for a common task we concluded that they tended tomake errors with a low degree of correlation. This evi-dence suggested that an appropriate method for combiningthe baseline recognisers might produce a nal system moreaccurate than either of the constituents alone.
In this work we developed a combination method of dif-ferent context independent acoustic models through an av-erage in the probability and in the log-probability domainof the phonetic probabilities estimated at the output of theclassiers. We combined simultaneously two and threebaseline recognition systems and compared them basedon the total number of parameters and word error rate(% WER).
On section 2 we present the combination algorithms usedfor the experiments. The baseline speech recognisers aredescribed on section 3. On section 4 the evaluation of the combined recognition systems is summarised. Finally,some conclusions and future work are presented in sections5 and 6.
2. COMBINATION OF ACOUSTICMODELS
In this work we developed a method that combinesphoneme probabilities generated by different acoustic
8/10/2019 Speech Recognition Document
http://slidepdf.com/reader/full/speech-recognition-document 2/4
8/10/2019 Speech Recognition Document
http://slidepdf.com/reader/full/speech-recognition-document 3/4
8/10/2019 Speech Recognition Document
http://slidepdf.com/reader/full/speech-recognition-document 4/4
were used. By combining three acoustic models we wereable to reduce the WER by 22% relative to the best con-stituent. In the second case we added the ModSpec acous-tic model to one of the combinations of two systems thathad obtained the best results. This case represents a 16%
relative reduction when compared with the best individualsystem. Finally, in an attempt to obtain the lowest possi-ble word error rate, we used the PLP 1000 instead of thePLP+Sil because it has better performance due to a MLPwith more parameters. We see only a small reduction whencomparing with the second case. Compared with the bestconstituent, the PLP 1000, we got a 15% relative improve-ment.
We are currently developing recognition systems for moredifcult tasks. One such case is broadcast news transcrip-tion. Some preliminary experiments were performed usingthese baseline recognition systems whose acoustic mod-els were only trained with clean read speech. Obviouslythis is a very demanding task furthermore complicated byfrequent OOV words even for a 27k vocabulary. In theseexperiments we have observed that in situations involvingpivot speech the relative improvement obtained in the worderror rate by the combined recognition system is slightlyabove 20%. For situations involving spontaneous speechthe improvement approaches 30% in respect to our bestbaseline system.
5. CONCLUSIONS
In this paper we presented the development of a new largevocabulary continuous speech recognition system for thePortuguese language based on the combination of differ-ent acoustic models. These models were merged throughtwo different algorithms combining the phonemes a poste-riori probabilities, estimated by the several acoustic mod-els MLP’s, in the probability and in the log-probability do-main.
Starting with absolute word error rates around 14% fromthe isolated baseline systems we were able to achievevalues around 10% which represents a relative improve-ment larger than 20%. Furthermore the combined systems
proved to be even more robust when used in adverse situa-tions that signicantly differ from the training ones.
6. FUTURE WORK
We will continue to investigate and develop new ways of improving our recognition systems. One technique thatwill deserve our attention will be the combination at wordlevel using voting algorithms. Other area of interest will bethe development of acoustic models more robust to noiseand spontaneous speech in order to cope with more difcultsituations. The combined systems described in this paperwere obtained by using acoustic models trained only with
clean read speech to use in dictation tasks, and by that con-sequence are not the most suited for the type of situationsimposed by broadcast news transcription tasks.
7. ACKNOWLEDGEMENTS
This work was partially funded by the PRAXIS XXIproject 1654. An acknowledgement is also given to theP UBLICO newspaper for making its texts available for thiswork.
8. REFERENCES
1. Hochberg, M., Cook, G., Renals, S., and Robinson, T.,Connectionists model combination for large vocabularyspeech recognition , in Proceedings IEEE Workshop onNeural Networks for Signal Processing, 1994.
2. Cook, G. and Robinson, A., Boosting the performanceof connectionist large vocabulary speech recognition , inProceedings ICSLP 96, Philadelphia, USA, 1996.
3. Fiscus, J., A post-processing system to yield reduced word error rates: Recogniser output voting error re-duction (rover) , in Proceedings IEEE Workshop on Au-tomatic Speech Recognition and understanding (ASRU97), Santa Barbara, USA, 1997.
4. Bourlard, H. and Morgan, N., Connectionist Speech Recognition - A Hybrid Approach , Kluwer AcademicPublishers, Massachusetts, EUA, 1994.
5. Neto, J., Martins, C. and Almeida, L., The Development of a Speaker Independent Continuous Speech recog-niser for Portuguese , in Proceedings EUROSPEECH97, Rhodes, Greece, 1997.
6. Neto, J., Martins, C. and Almeida, L., A Large Vocabu-lary Continuous Speech Recognition Hybrid System for the Portuguese Language , in Proceedings ICSLP 98,Sydney, Australia, 1998.
7. Neto, J., Martins, C. Meinedo, H. and Almeida, L., The Design of a Large Vocabulary Speech Corpus for Por-tuguese , in Proceedings EUROSPEECH 97, Rhodes,
Greece, 1997.8. Kingsbury, B. E., Morgan, N., and Greenberg, S., Ro-
bust speech recognition using the modulation spectro-gram , Speech Comunication, 25:117–132, 1998.
9. Meinedo, H. and Neto, J., The use of syllable segmen-tation information in continuous speech recognition hy-brid systems applied to the Portuguese language , inPro-ceedings ICSLP 2000, Beijing, China, 2000.