neannmathailiteraturesurvey.pdf

7/27/2019 NeannMathaiLiteratureSurvey.pdf

1/6

1

A Literature Survey of Speech Recognition and Hidden Markov Models

Neann MathaiUniversity of Cape Town

Computer Science Department

[email protected]

AbstractThis article highlights Hidden Markov Models (HMM)

and their use in Automatic Speech Recognition (ASR)

Systems. Hidden Markov Models is the most widely used

speech recognition technique. Speech recognition is a field

that involves a lot of different disciplines. As more and more

research is being done in this field, the application

possibilities of speech recognition increases, this includes

military uses as well as health care usage.

This paper presents a high level view description of the

basic speech recognition categories. It then goes on to

discuss Hidden Markov Models and its use in ASR. Finallythe paper looks briefly at some ways in which Speech

Recognition Technology may be applied.

Keywords--- Automatic Speech Recognition (ASR)

Hidden Markov Models

1. Introduction

Automatic speech recognition (ASR) has beenresearched for more than four decades [5]. In spite of theextensive amount of time spent in this research area theultimate goal of having speech recognition done by machineis still far from reality [1, 5]. One of the reasons that thisresearch area has proven to be so difficult is the fact that isinterdisciplinary in nature, that is, the following disciplineshave to be applied [5]:

Signal Processing Physics: acoustics Pattern Recognition Communication and information theory Linguists Physiology Computer Science PsychologyReal world processes generate observable signals. In thecase of speech these signals are continuous, and non-

stationary (they vary with time) [6,7].

When a person speaks, he uses his articulatoryapparatus that consists of the lips, jaw, tongue, and velum, ascan be seen in figure 1 [3]. The articulatory apparatus is usedto modulate the air pressure and flow which produces asequence of sounds [3].

Figure 1: The Articulatory ApparatusIn order to perform digital processing on an analogue

speech signal, discrete time sampling and quantization of thewave form must be done [3]. Normally sampling is done at arate of 8-20 kilohertz (kHz). The amplitude of each wavesample is then represented by one of 216 values which is a16-bit quantization of the unique time signal[3].

This paper will focus on Hidden Markov Models for thefollowing reasons:

The HMM has a mathematical structure that can bestudied and analyzed.

A HMM is relatively easy to train from a given setof training data (this is also known as an observationsequence) [3].

HMM for speech signals are relatively easy toimplement [3].

Due to the above three mentioned points HMM havebecome the norm in speech recognition research.

2. ASR Approaches

2.1. Overview

There are three approaches to ASR that are as follows[5]:

Acoustic-phonetic approach Statistical pattern recognition Artificial intelligence approachThe basic acoustic-phonetic approach is that the

machine decodes the speech signal in a systematic manner bymapping the observed acoustic features of the signal to theirknown acoustic features and phonetic symbols. Thisapproach is based on the theory of acoustic phonetics whichsuggests that in spoken language there are finite and


2/6

2

distinctive phonetic units. These phonetic units are eachcharacterized by a set of properties that can be seen in thespeech signal.

Pattern recognition techniques, such as the HiddenMarkov Model technique, are the most popular of the speechrecognition techniques. [1,12]. The reasons for thispopularity of pattern recognition are as follows [5]:

Simplicity of use Robustness and invariance to different speech

vocabularies, users, feature sets, pattern comparisonalgorithms and decision rules.

Proven high performanceThe artificial intelligence approach is a combination of

the acoustic phonetic approach and the pattern recognitionapproach. This approach automates the recognition procedureaccording to the way in which people applies theirintelligence in analysis, and then makes a decision on thedifferent measured feature.

Artificial Neural Networks (NNs) may also be used as ameans of speech recognition. This may be thought of as aseparate approach, or it may fit into any of the above threeapproaches. NNs is a model that attempts to mimic thehuman brain [4]. NNs very versatile and has solved problemsin various fields such as computer vision, process control,and medical diagnostics. Unfortunately, NNs has a longtraining time. A single NN consists of many neurons. Theseneurons are also called perceptrons or nodes. Each neuronhas a set of inputs; it then performs a computation and finallyproduces a single output. The neurons are connected to eachother by the means of weighted paths.

The rest of this literature survey will discuss HiddenMarkov Models and their use in speech recognition.

2.2. Hidden Markov Models

2.2.1 What is a Hidden Markov Model and how is it

Structured?

Hidden Markov Models are used due to the difference inspeech sounds. Some sounds are sustained, such as a vowel,while other sounds are ephemeral, such as consonants. HMMare therefore ideal for this situation as they are probabilisticand can thus precisely represent the stationary and transient

properties of the sound signal.

The definition of a HMM is [7]: A HMM is a stochasticprocess that cannot be observed itself (it is hidden), but theprocess is monitored by another set of processes that producea sequence of observed symbols.

Generally, a HMM is made up of states, transitions(from one state to the next), and observations. It is a form ofa finite state machine.

Once a model has been formulated three problems haveto be solved [1,3,5,6,7,16]:

1. Evaluation:There is a model that can be used.There is a testing observation sequence Ot.Find P(Ot|) this is the probability that Ot wasproduced by this particular model.Solving this problem allows one to choose amongcompeting models.The forward-backward algorithm is used to solvethis [7].

2. Decoding:There is a model that can be used.There is a testing observation sequence Ot.Find the most likely state path given and Ot.

This is denoted as Q = q1 . qt.This problem is solved using an optimality criterionThe Veterbi algorithm is used to solve this [7].

3. Training:There is a model that can be used.There is a testing observation sequence Ot.How can the model parameters be adjusted so asto maximize P(Ot|) ?The Baum-Welch algorithm is used to solve this [7].

An illustration of the general HMM that has beenspoken of can be seen in figure 2 [7]. In this type of a model,

any state can be revisited and the revisits do not need to takeplace at specific time intervals.

Figure 2: A General Hidden Markov Model

Alternatively a non-ergodic model is shown in figure 3[7]. This model is known as a left-to-right or Bakis [6] modelwhich impose a temporal order to the HMM as you can onlymove from left to right.

Figure 3: A non-ergodic HMM


3/6

3

2.2.2 HMMs as applied to Single Word Recognition

An example of how isolated word recognition can beperformed is as follows [7]:

A vocabulary of V words needs to be recognized. Each word and a training set of L tokens Using the L tokens for one of each of the V words,

estimate the optimum parameters for each word andbuild a HMM for each word.

For a single test word there is now an observationsequence O. Now the probability PV =P(O|

V) forall of the models is calculated.

The word whose model probability is the highest ie:v* = argmax[PV]

1


4/6

4

2.2.2 Examples of Speech Recognition Systems that use

HMM

The SPHINX-II Speech Recognition SystemSPHINX was the first accurate large vocabulary,continuous, speaker-independent speech recognitionsystem [15].

For feature extraction SPHINX-II uses fourcodebooks [15]. Each codebook contains 256 entriesand the four code books are as follows [15]:

1. 12 LPC cepstrum coefficients2. 12 40-msec and 12 80-msec differenced LPC

cepstrum coefficients.3. 12 second-order differenced cepstrum4. Power, 40-msec differenced power, and second

order differenced power.The SPHINX-II system also normalizes the data

from different speakers to a golden speaker cluster [15].The golden speaker cluster is the cluster that has themaximum number of speakers. This was done using

neural networks [15]. Two golden clusters were created,one male and one female cluster. Other smaller clusterswere mapped to the golden speaker clusters using thecreated codeword-dependent neural network.

SHPINX-II does a lot of parameter sharing whenbuilding models and uses semi-continuous hiddenMarkov models, and senones. SCHMMs incorporatequantization accuracy into the HMM. They estimate thediscrete output probabilities by considering differentcodeword candidates in the vector quantization. The useof SCHMM also requires less training data whencompared to discrete HMM and thus for a given data setmore models can be produced. A senone is a state in aphonetic HMM. Senones are constructed by clusteringthe state-dependent output distributions from thedifferent phonetic models. Senones allow for betterparameter sharing and improved pronunciationoptimization. Figure 5 shows a summary of the SHPIN-II system.

Figure 5: A summary of the SPHINX-II System

The HTK Tied-State Continuous Speech RecognizerLike the SPHINXX-II system, HTK has a parametergeneralization mechanism [14]. This allows for abalance between model complexity and parameterestimation accuracy to be obtained. Tied state refers to

the fact that state distributions for corresponding statesin allophone of the same phone are bunched together.This ensures that there is enough training data across alldistributions. This system was trained in a number ofstages, and optimized at each stage. Again, like theSPHIN-II system, HTK also used gender modeling andcreated different HMM for male and female voices.

3. Applications for Speech Recognition

There are numerous application for ASR, and a fewmilitary and health applications will be discussed herebriefly.

1. Military Uses [13]Command and Control on the Move (C2OTM) is anAmerican army project that aims to keep command andcontrol entities mobile along with mobile troops in awar zone. Figure 6 [13] shows some of these mobileforce elements that require C2OTM.

Figure 6: C2TOM force elements and examplecommunication of human- machine communication byvoice

One example of how speech recognition will be used inthis project is: the foot soldiers voice translation of

what is being observed can be used to assess thebattlefield situation information, and aid in weaponssystem selection. Another instance of voice recognitionin this application is: in field repair and maintenancecan be aided by a voice access to information and ahelmet mounted display to show the information.

The American Navy Personnel Research andDevelopment Center has proposed creating a combatteam tactical training application. This is illustrated in


5/6

5

figure 7 [13]. The aim of this project is to havepersonnel respond to ongoing combat simulations usingvoice, typing, trackballs, and other modes so as tocommunicate with both machine and with each other.

Figure 7: Combat team tactical training system conceptand applications of speech based technology

An air force application being investigated by theUnited Kingdoms Defense Research Agency is anapplication that will recognize pilots voices and allowthem to enter reconnaissance reports. A simplerapplications that are being researched in terms of the airforce is to allow for voice control of radio frequencies,displays and gauges in order to increase missioneffectiveness and safety of the pilots.

Figure 8 [13], shows a matrix the classes of differentvoice applications with the interest of various military

and government end users.

Figure 8: Classes of different voice applications

2. Health Care UsesThere are many, many more applications of speechrecognition in society outside of the military. One suchapplication pertaining to the health care industry is theuse of speech recognition in automatic medical

transcription [8,11]. This avenue is being seriouslyconsidered as it may prove to be more cost effectivewith a projected savings of $230 a week in 1998.

Conclusions

This paper has looked at speech recognition and how thiscan be solved using Hidden Markov Models. HMM arewidely used for speech recognition as they are relatively easyto use and they are mathematically backed and as a resultthey can be analyzed.

The overall speech recognition process involvesextracting features from the signal, performing a vectorquantization on these features and then using the HMMs torecognize a particular utterance, based on probabilities.

The paper finally touched on a few applications areasthat are open to speech recognition as systems keep gettingbetter.

References

[1] Abdulla Waleed, and Kasabov Nikola. The Concepts ofHidden Markov Models in Speech Recognition. TechnicalReport, Information Science Department University of Otago,1999.

[2] Bouchaffra D, and Tan L J. Structural Hidden MarkovModels Using a Relation of Equialence: Application toAutomotive Designs. In Data Mining and Knowledge

Discovery 12 79- 96. 2006.

[3] Juang B H, and Rabiner L R. Hidden Markov Models forspeech Recognition. In Techometrics Volume 33 No. 3 251-272. August 1991.

[4] Klevans Richard L. and Rodman Robert D.. VoiceRecognition. Artech House Publishers, NorwoodMassachusetts, 1997.

[5] Rabiner Lawrence and Juang Biin-Hwang, Fundamentals ofSpeech Recognition. Prentice Hall, New Jersey, 1993.

[6] Rabiner Lawrence R. A Tutorial on Hidden Markov Modelsand Selected Application in Speech Recognition. InProceedings of the IEEEVolume 7 No 2 257285. February1987.

[7] Rabiner L R, Junang B H. An Introduction to Hidden

Markov Models. In IEEE ASSP Magazine 416. January1986.

[8] Ragupathi W. Health Care Information Systems. InCommunications of the ACM Volume 40 Number 8 81 -82.

(August 1997)

[9] Rao Y N: Speech Recognition Using Hidden Markov Models A Project Description


6/6

6

http://www.cnel.ufl.edu/~yadu/report1.html - accessed on the29th April at 15:44

[10] Richardson M, Bilmes J , and Diorio C. Hidden-ArticulatorMarkov Models for Speech Recognition. In SpeechCommunication Volume 41 No. 3 511 - 532. 2003.

[11] Rosenthal D. et al. Computers in Radiology. In AmericanJournal of Roentgenology 170 (1) 23 -25. (January 1998)

[12] Weber K, Ikbal A, Bengio S, and Bourlard H. Robust speechrecognition and feature extraction using HMM2. In ComputerSpeech and Lanuage Volume 17 195- 211. 17 February 2003.

[13] Wienstien C.J. Military and government applications ofhuman-machine communication by voice. In Proceedings ofthe Natl. Acad. Sci. USA. Volume 92 10011 10016. October1995.

[14] Woodland P.C., and Young S.J. The HTK Tied-StateContinuous Speech Recognizer. In Proceeding of

Eurospeach. 1993.

[15] Xeuding H et al. The SPHINX-II Speech RecognitionSystem: An Overview. In Computer, Speech and LanguageVolume 7 137 - 148. 1992.

[16] A Tutorial on Hidden Markov Models

http://www.cs.brown.edu/research/ai/dynamics/tutorial/Documents/HiddenMarkovModels.html -accessed on the 29thApril 2009 at 15:23