network training for continuous speech recognition author: issac john alphonso inst. for signal and...

30
Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University Contact Information: Box 0452 Mississippi State University Mississippi State, Mississippi 39762 Tel: 662-325-8335 Fax: 662-325-2298 URL: isip.msstate.edu/publications/books/msstate_theses/2003/netwo rk_training/ Email: [email protected]

Upload: angel-quinn

Post on 05-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

Network Training for Continuous Speech Recognition

• Author:Issac John AlphonsoInst. for Signal and Info. ProcessingDept. Electrical and Computer Eng.Mississippi State University

• Contact Information:Box 0452Mississippi State UniversityMississippi State, Mississippi 39762Tel: 662-325-8335Fax: 662-325-2298

• URL: isip.msstate.edu/publications/books/msstate_theses/2003/network_training/

Email: [email protected]

Page 2: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

INTRODUCTIONABSTRACT

A traditional trainer uses an expectation maximization (EM) based supervised training framework to estimate the parameters of a speech recognition system. EM-based parameter estimation for speech recognition is performed using several complicated stages of iterative re-estimation. These stages are prone to human error. This thesis describes a new network training paradigm that reduces the complexity of the training process, while retaining the robustness of the EM-based supervised training framework. The network trainer can achieve comparable recognition performance to a traditional trainer while alleviating the need for complicated systems and training recipes for speech recognition systems.

Page 3: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

INTRODUCTIONORGANIZATION

• Motivation: Why do we need a new training paradigm?

• Theoretical: Review the EM-based supervised training framework.

• Network Training: The differences between the network training and traditional training.

• Experiments: Verification of the approach using industry standard databases (e.g., TIDigits, Alphadigits and Resource Management).

Motivation

NetworkTraining

TheoreticalBackground

Experiments

Conclusion & Future Work

Page 4: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

INTRODUCTIONMOTIVATION

• A traditional trainer uses an EM-based framework to estimate the parameters of a speech recognition system.

• EM-based parameter estimation is performed in several complicated stages which are prone to human error.

• A network trainer reduces the complexity of the training process by employing a soft decision criterion.

• A network trainer achieves comparable performance and retains the robustness of the EM-based framework.

Page 5: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

THEORETICAL BACKGROUNDCOMMUNICATION THEORETIC APPROACH

MessageSource

LinguisticChannel

ArticulatoryChannel

AcousticChannel

Observable: Message Words Sounds Features

Maximum likelihood formulation for speech recognition:• P(W|A) = P(A|W) P(W) / P(A)

Objective: minimize the word error rate

Approach: maximize P(W|A) during training

Components:• P(A|W) : acoustic model (HMM’s/GMM’s)

• P(W) : language model (statistical, FSN’s, etc.)

Page 6: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

THEORETICAL BACKGROUNDMAXIMUM LIKELIHOOD

• The approach treats the parameters of the model as fixed quantities whose values need to be estimated.

• The model parameters are estimated by maximizing the log likelihood of observing the training data.

• The estimation of the parameters is computationally tractable due to the availability of efficient algorithms.

T

tioPOP

1

)]|(log[)]|(log[

Page 7: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

THEORETICAL BACKGROUNDEXPECTATION MAXIMIZATION

• A general framework that can be used to determine the maximum likelihood estimates of the model parameters.

• The algorithm iteratively estimates the likelihood of the model by maximizing Baum’s auxiliary function.

• The expectation maximization algorithm is guaranteed to converge to the maximum likelihood estimate.

q

qOqOPQ )|,log()|,(),(

Page 8: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

THEORETICAL BACKGROUNDHIDDEN MARKOV MODELS

• A random process that consists of a set of states and their corresponding transition probabilities:

• The priori probabilities:

• The state transition probabilities:

• The state emission probabilities:

),0( jstatetPj

),|,1( istatettimejstatettimePaij

),|()( jstatettimeOPOb tj

Page 9: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

NETWORK TRAINERTRAINING RECIPE

• The flat start stage segments the acoustic signal and seed the speech and non-speech models.

• The context-independent stage inserts and optional silence model between words.

• The state-tying stage clusters the model parameters via linguistic rules to compensate for sparse training data.

• The context-dependent stage is similar to the context-independent stage (words are modeled using context).

Flat StartFlat Start CI TrainingCI Training State TyingState Tying CD TrainingCD Training

Context-Independent Context-Dependent

Page 10: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

NETWORK TRAINERTRANSCRIPTIONS

silsil hhhh vv

Traditional Trainer:

aeae silsil

SILENCESILENCE HAVEHAVE SILENCESILENCE

Network Trainer:

• The network trainer uses word level transcriptions which does not impose restrictions on the word pronunciation.

• The traditional trainer uses phone level transcriptions which uses the canonical pronunciation of the word.

• Using orthographic transcriptions removes the need for directly dealing with phonetic contexts during training.

Page 11: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

NETWORK TRAINERSILENCE MODELS

Multi-Path: Single-Path:

• The multi-path silence model is used between words.

• The single-path silence model is used at utterance ends.

Page 12: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

NETWORK TRAINERDURATION MODELING

• The network trainer uses a silence word which precludes the need for inserting it into the phonetic pronunciation.

• The traditional trainer deals with silence between words by explicitly specifying it in the phonetic pronunciation.

Network Trainer: Traditional Trainer:

Page 13: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

NETWORK TRAINERPRONUNCIATION MODELING

• A pronunciation network precludes the need to use a single canonical pronunciation for each word.

• The pronunciation network has the added advantage of being able to generalize to unseen pronunciations.

Network Trainer: Traditional Trainer:

Page 14: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

NETWORK TRAINEROPTIONAL SILENCE MODELING

• The network trainer uses a fixed silence at utterance bounds and an optional silence between words.

• We use a fixed silence at utterance bounds to avoid an underestimated silence model.

Page 15: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

NETWORK TRAINERSILENCE DURATION MODELING

• Network training uses a single-path silence at utterance bounds and a multi-path silence between words.

• We use a single-path silence at utterance bounds to avoid uncertainty in modeling silence.

Page 16: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

EXPERIMENTSSPEECH DATABASES

0%

10%

30%

40%

20%

Word Error Rate

Level Of Difficulty

Digits

ContinuousDigits

Command and Control

Letters and Numbers

BroadcastNews

Read Speech

ConversationalSpeech

Page 17: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

EXPERIMENTSTIDIGITS DATABASE

• Collected by Texas Instruments in 1983 to establish a common baseline for connected digit recognition tasks.

• Includes digits from ‘zero’ through ‘nine’ and ‘oh’ (an alternative pronunciation for ‘zero’).

• The corpora consists of 326 speakers (111, men, 114 women and 101 children).

Page 18: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

EXPERIMENTSTIDIGITS: WER COMPARISON

Stage WER Insertion Rate

Deletion Rate

Substitution Rate

Traditional Trainer

7.7% 0.1% 2.5% 5.0%

Network Trainer

7.6% 0.1% 2.4% 5.0%

• The network trainer achieves comparable performance to the traditional trainer.

• The network trainer converges in word error rate to the traditional trainer.

Page 19: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

EXPERIMENTSTIDIGITS: LIKELIHOOD COMPARISON

Iterations

Ave

rag

e L

og

Lik

elih

oo

d

_ _ _ _ Network Trainer______ Traditional Trainer

Page 20: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

• Collected by the Oregon Graduate Institute (OGI) using the CLSU T1 data collection system.

• Includes letters (‘a’ through ‘z’) and numbers (‘zero’ through ‘nine’ and ‘oh’).

• The database consists of 2,983 speakers (1,419 men, 1,533 women and 30 children).

EXPERIMENTSALPHADIGITS (AD) DATABASE

Page 21: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

EXPERIMENTSAD: WER COMPARISON

• The network trainer achieves comparable performance to the traditional trainer.

• The network trainer converges in word error rate to the traditional trainer.

Stage WER Insertion Rate

Deletion Rate

Substitution Rate

Traditional Trainer

38.0% 0.8% 3.0% 34.2%

Network Trainer

35.3% 0.8% 2.2% 34.2%

Page 22: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

EXPERIMENTSAD: LIKELIHOOD COMPARISON

Ave

rag

e L

og

Lik

elih

oo

d

Iterations

_ _ _ _ Network Trainer______ Traditional Trainer

Page 23: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

• Was collected by the Defense Advanced Research Project Agency (DARPA).

• Includes a collection of spoken sentences pertaining to a naval RM task.

• The database consists of 80 speakers, each reading two ‘dialect’ sentences plus 40 sentences from the RM text corpus.

EXPERIMENTSRESOURCE MANAGEMENT (RM) DATABASE

Page 24: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

EXPERIMENTSRM: WER COMPARISON

• The network trainer achieves comparable performance to the traditional trainer.

• It is important to note that the 1.8% degradation in performance is not significant (MAPSSWE test).

Stage WER Insertion Rate

Deletion Rate

Substitution Rate

Traditional Trainer

25.7% 1.9% 6.7% 17.1%

Network Trainer

27.5% 2.6% 7.1% 17.9%

Page 25: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

EXPERIMENTSRM: LIKELIHOOD COMPARISON

Ave

rag

e L

og

Lik

elih

oo

d

Iterations

_ _ _ _ Network Trainer______ Traditional Trainer

Page 26: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

• Explored the effectiveness of a novel training recipe in the reestimation process of for speech processing.

• Analyzed performance on three databases.

• For TIDigits, at 7.6% WER, the performance of the network trainer was better by about 0.1%.

• For OGI Alphadigits, at 35.3% WER, the performance of the network trainer was better by about 2.7%.

• For Resource Management, at 27.5% WER, the performance degraded by about 1.8% (not significant).

CONCLUSIONSSUMMARY

Page 27: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

• The results presented use single-mixture context-dependent models for training and recognition.

• A efficient tree-based decoder is currently under development and context-dependent results are planned.

• The databases presented all use single pronunciations for each word in the lexicon.

• The ability to run large databases like Switchboard, which has multiple pronunciations, requires a tree-based decoder.

CONCLUSIONSFUTURE WORK

Page 28: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

PROGRAM OF STUDY

Course No. Title Semester

CS 8990 Probabilistic Expert Systems Spring 2000

ST 8253 Linear Regression Fall 2000

ECE 8990 Pattern Recognition Spring 2001

ECE 8990 Information Theory Spring 2001

CS 8990 Reinforcement Learning Fall 2001

CS 8663 Neural Computing Fall 2001

ECE 8990 Random Signals and Systems Fall 2001

ECE 8990 Fundamentals of Speech Recognition Spring 2002

ECE 8000 Research/Thesis

APPENDIXPROGRAM OF STUDY

Page 29: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

• I would like to thank Dr. Joe Picone for his mentoring and guidance through out my graduate program.

• I would also like to thank Jon Hamaker for his valuable suggestions throughout my thesis.

• Finally, I would like to thank my co-workers at the Institute for Signal and Information Processing (ISIP) for all their help.

APPENDIXACKNOWLEDGEMENTS

Page 30: Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng

• S. Pinker, The Language Instinct, Harper Collins, New York City, New York, USA, 1994.

• L. Rabiner, B. Juang, Fundamentals of Speech Recognition, Prentice Hall, Upper Saddle River, New Jersey, USA, 1993.

• R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification, John Wiley & Sons, New York City, New York, USA, 2001.

• X. Huang, A. Acero, H. Hon, Spoken Language Processing, Prentice Hall, Upper Saddle River, New Jersey, USA, 2001.

APPENDIXREFERENCES