a sparse modeling approach to speech recognition using kernel machines jon hamaker...

A Sparse Modeling A Sparse Modeling Approach to Speech Approach to Speech Recognition Using Recognition Using Kernel MachinesKernel Machines

Jon Hamaker Jon Hamaker [email protected]@isip.msstate.edu

Institute for Signal and Information Institute for Signal and Information ProcessingProcessing

Mississippi State UniversityMississippi State University

AbstractAbstractStatistical techniques based on Hidden Markov models (HMMs) with Statistical techniques based on Hidden Markov models (HMMs) with Gaussian emission densities have dominated the signal processing and Gaussian emission densities have dominated the signal processing and pattern recognition literature for the past 20 years. However, HMMs pattern recognition literature for the past 20 years. However, HMMs suffer from an inability to learn discriminative information and are suffer from an inability to learn discriminative information and are prone to over-fitting and over-parameterization. Recent work in prone to over-fitting and over-parameterization. Recent work in machine learning has focused on models, such as the support vector machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. SVMs have parameterization as part of the overall optimization process. SVMs have been shown to provide significant improvements in performance on been shown to provide significant improvements in performance on small pattern recognition tasks compared to a number of conventional small pattern recognition tasks compared to a number of conventional approaches. SVMs, however, require ad hoc (and unreliable) methods to approaches. SVMs, however, require ad hoc (and unreliable) methods to couple it to probabilistic learning machines. Probabilistic Bayesian couple it to probabilistic learning machines. Probabilistic Bayesian learning machines, such as the relevance vector machine (RVM), are learning machines, such as the relevance vector machine (RVM), are fairly new approaches that attempt to overcome the deficiencies of fairly new approaches that attempt to overcome the deficiencies of SVMs by explicitly accounting for sparsity and statistics in their SVMs by explicitly accounting for sparsity and statistics in their formulation.formulation.

In this presentation, we describe both of these modeling approaches in In this presentation, we describe both of these modeling approaches in brief. We then describe our work to integrate these as acoustic models brief. We then describe our work to integrate these as acoustic models in large vocabulary speech recognition systems. Particular attention is in large vocabulary speech recognition systems. Particular attention is given to algorithms for training these learning machines on large given to algorithms for training these learning machines on large corpora. In each case, we find that both SVM and RVM-based systems corpora. In each case, we find that both SVM and RVM-based systems perform better than Gaussian mixture-based HMMs in open-loop perform better than Gaussian mixture-based HMMs in open-loop recognition. We further show that the RVM-based solution performs on recognition. We further show that the RVM-based solution performs on par with the SVM system using an order of magnitude fewer par with the SVM system using an order of magnitude fewer parameters. We conclude with a discussion of the remaining hurdles for parameters. We conclude with a discussion of the remaining hurdles for providing this technology in a form amenable to current state-of-the-art providing this technology in a form amenable to current state-of-the-art recognizers. recognizers.

BioBioJon Hamaker is a Ph.D. candidate in the Department Jon Hamaker is a Ph.D. candidate in the Department of Electrical and Computer Engineering at of Electrical and Computer Engineering at Mississippi State University under the supervision of Mississippi State University under the supervision of Dr. Joe Picone. He has been a senior member of the Dr. Joe Picone. He has been a senior member of the Institute for Signal and Information Processing Institute for Signal and Information Processing (ISIP) at MSU since 1996. Mr. Hamaker's research (ISIP) at MSU since 1996. Mr. Hamaker's research work has revolved around automatic structural work has revolved around automatic structural analysis and optimization methods for acoustic analysis and optimization methods for acoustic modeling in speech recognition systems. His most modeling in speech recognition systems. His most recent work has been in the application of kernel recent work has been in the application of kernel machines as replacements for the underlying machines as replacements for the underlying Gaussian distribution in hidden Markov acoustic Gaussian distribution in hidden Markov acoustic models. His dissertation work compares the popular models. His dissertation work compares the popular support vector machine with the relatively new support vector machine with the relatively new relevance vector machine in the context of a speech relevance vector machine in the context of a speech recognition system. Mr. Hamaker has co-authored 4 recognition system. Mr. Hamaker has co-authored 4 journal papers (2 under review), 22 conference journal papers (2 under review), 22 conference papers, and 3 invited presentations during his papers, and 3 invited presentations during his graduate studies at MS State (graduate studies at MS State (http://www.isip.msstate.edu/publicationshttp://www.isip.msstate.edu/publications). He also spent ). He also spent two summers as an intern at Microsoft in the two summers as an intern at Microsoft in the recognition engine group.recognition engine group.

OutlineOutline

The acoustic modeling problem for speechThe acoustic modeling problem for speech Current state-of-the-artCurrent state-of-the-art Discriminative approachesDiscriminative approaches Structural optimization and Occam’s Structural optimization and Occam’s

RazorRazor Support vector classifiersSupport vector classifiers Relevance vector classifiersRelevance vector classifiers Coupling vector machines to ASR systemsCoupling vector machines to ASR systems Scaling relevance vector methods to “real” Scaling relevance vector methods to “real”

problemsproblems Extensions of this workExtensions of this work

ASR ProblemASR Problem Front-end maintains Front-end maintains

information important information important for modeling in a for modeling in a reduced parameter setreduced parameter set

Language model Language model typically predicts a small typically predicts a small set of next words based set of next words based on knowledge of a finite on knowledge of a finite number of previous number of previous words (N-grams)words (N-grams)

Search engine uses Search engine uses knowledge sources and knowledge sources and models to chooses models to chooses amongst competing amongst competing hypotheseshypotheses

Input Speech

Statistical AcousticModels p(A/W)

LanguageModel p(W)

AcousticFront-End

Recognized Utterance

Search

Focus ofFocus ofWorkWork

Acoustic ConfusabilityAcoustic ConfusabilityRequires reasoning under uncertainty!Requires reasoning under uncertainty!

• Regions of overlap represent classification error

• Reduce overlap by introducing acoustic and linguistic context

Comparison of “aa” in “lOck” and “iy” in “bEAt” for SWB

Probabilistic Probabilistic FormulationFormulation

To deal with the uncertainty, we typically To deal with the uncertainty, we typically formulate speech as a probabilistic problem:formulate speech as a probabilistic problem:

Objective: Minimize the word error rate by Objective: Minimize the word error rate by maximizing P(W|A)maximizing P(W|A)

Approach: Maximize P(A|W) during trainingApproach: Maximize P(A|W) during training Components:Components:

P(A|W): Acoustic ModelP(A|W): Acoustic Model P(W): Language ModelP(W): Language Model P(A): Acoustic probability (ignored during P(A): Acoustic probability (ignored during

maximization)maximization)

)(

)()|()|(

AP

WPWAPAWP

Acoustic Modeling - Acoustic Modeling - HMMsHMMs

HMMs model temporal HMMs model temporal variation in the variation in the transition probabilities transition probabilities of the state machineof the state machine

GMM emission GMM emission densities are used to densities are used to account for variations account for variations in speaker, accent, and in speaker, accent, and pronunciationpronunciation

Sharing model Sharing model parameters is a parameters is a common strategy to common strategy to reduce complexityreduce complexity

s0 s1 s2 s3 s4

THREE TWO FIVE EIGHT

Maximum Likelihood Maximum Likelihood TrainingTraining

Data-driven modeling supervised only from a word-level transcription

Approach: maximum likelihood estimation The EM algorithm is used to improve our

estimates:

Guaranteed convergence to local maximum No guard against overfitting!

Computationally efficient training algorithms (Forward-Backward) have been crucial

Decision trees are used to optimize sharing parameters, minimize system complexity and integrate additional linguistic knowledge

),(),ˆ( if )|Data()ˆ|Data( QQPP

Drawbacks of Current Drawbacks of Current ApproachApproach

ML Convergence does not translate to optimal classification

Error from incorrect modeling assumptions

Finding the optimal decision boundary requires only one parameter!

Drawbacks of Current Drawbacks of Current ApproachApproach

Data not separable by a hyperplane – nonlinear classifier is needed

Gaussian MLE models tend toward the center of mass – overtraining leads to poor generalization

Acoustic ModelingAcoustic Modeling Acoustic Models Must:Acoustic Models Must:

Model the temporal progression of the speechModel the temporal progression of the speech Model the characteristics of the sub-word Model the characteristics of the sub-word

unitsunits We would also like our models to:We would also like our models to:

Optimally trade-off discrimination and Optimally trade-off discrimination and representationrepresentation

Incorporate Bayesian statistics (priors)Incorporate Bayesian statistics (priors) Make efficient use of parameters (sparsity)Make efficient use of parameters (sparsity) Produce confidence measures of their Produce confidence measures of their

predictions for higher-level decision processespredictions for higher-level decision processes

Paradigm Shift - Discriminative Paradigm Shift - Discriminative ModelingModeling

Discriminative Training (Maximum Discriminative Training (Maximum Mutual Information Estimation)Mutual Information Estimation) Essential Idea: Maximize Essential Idea: Maximize

Maximize numerator (ML term), Maximize numerator (ML term), minimize denominator minimize denominator (discriminative term)(discriminative term)

Discriminative Modeling (e.g. ANN Discriminative Modeling (e.g. ANN Hybrids – Bourlard and Morgan)Hybrids – Bourlard and Morgan)

)|(

)|(

out

in

WAP

WAP

Research FocusResearch Focus Our Research: replace the Gaussian Our Research: replace the Gaussian

likelihood computation with a likelihood computation with a machine that incorporates notions ofmachine that incorporates notions of DiscriminationDiscrimination Bayesian statistics (prior Bayesian statistics (prior

information)information) ConfidenceConfidence SparsitySparsity

All while maintaining computational All while maintaining computational efficiencyefficiency

ANN HybridsANN Hybrids

Shortcomings:Shortcomings: Prone to overfitting: require cross-validation to Prone to overfitting: require cross-validation to

determine when to stop training. determine when to stop training. Need methods Need methods to automatically penalize overfittingto automatically penalize overfitting

No substantial recognition improvements over No substantial recognition improvements over HMM/GMMHMM/GMM

Architecture:Architecture: ANN provides flexible, ANN provides flexible,

discriminative classifiers discriminative classifiers for emission probabilities for emission probabilities that avoid HMM that avoid HMM independence independence assumptions (can use assumptions (can use wider acoustic context)wider acoustic context)

Trained using Viterbi Trained using Viterbi iterative training (hard iterative training (hard decision rule) or can be decision rule) or can be trained to learn Baum-trained to learn Baum-Welch targets (soft Welch targets (soft decision rule)decision rule)

Input Feature Vector

………………..

…..

...

P(c1|o) … P(cn|o)

Structural OptimizationStructural Optimization

Structural optimization often guided by an Structural optimization often guided by an Occam’s Razor approach Occam’s Razor approach

Trading goodness of fit and model complexityTrading goodness of fit and model complexity Examples: MDL, BIC, AIC, Examples: MDL, BIC, AIC, Structural Risk Structural Risk

Minimization, Automatic Relevance Minimization, Automatic Relevance DeterminationDetermination

Model Complexity

Error

Training SetError

Open-LoopError

Optimum

Structural Risk Structural Risk MinimizationMinimization

The VC dimension is a The VC dimension is a measure of the complexity measure of the complexity of the learning machineof the learning machine

Higher VC dimension gives Higher VC dimension gives a looser bound on the a looser bound on the actual risk – thus actual risk – thus penalizing a more complex penalizing a more complex model (Vapnik)model (Vapnik)

Expected Risk: Expected Risk:

Not possible to estimate Not possible to estimate P(x,y)P(x,y)

Empirical Risk:Empirical Risk:

Related by the VC dimension, Related by the VC dimension, h:h:

Approach: choose the Approach: choose the machine that gives the least machine that gives the least upper bound on the actual upper bound on the actual riskrisk

),(),(2

1)( yxdPxfyR

l

iiiemp xfy

lR

1

|),(|2

1

)()()( hfRR emp

VC confidence

empirical risk

bound on the expected risk

VC dimension

Expected risk

optimum

Support Vector MachinesSupport Vector Machines

Hyperplanes C0-C2 achieve zero empirical risk. C0 generalizes optimally

The data points that define the boundary are called support vectors

Optimization: Optimization: Separable DataSeparable Data

Hyperplane: Constraints:

Quadratic optimization of a Lagrange functional minimizes risk criterion (maximizes margin). Only a small portion become support vectors

Final classifier: SVs

iii bxxyxf )()(

bwx

01)( bwxy ii

origin

class 1

class 2

w

H1

H2

C1

CO C2

optimalclassifier

SVMs as Nonlinear SVMs as Nonlinear ClassifiersClassifiers

Data for practical applications typically not separable using a hyperplane in the original input feature space

Transform data to higher dimension where hyperplane classifier is sufficient to model decision surface

Kernels used for this transformation

Final classifier:

Nn :

)()(),( jiji xxxxK

SVs

iii bxxKyxf ),()(

SVMs for Non-Separable SVMs for Non-Separable DataData

No hyperplane could achieve zero No hyperplane could achieve zero empirical risk (in any dimension space!)empirical risk (in any dimension space!)

Recall the SRM Principle: trade-off Recall the SRM Principle: trade-off empirical risk and model complexityempirical risk and model complexity

Relax our optimization constraint to allow Relax our optimization constraint to allow for errors on the training set: for errors on the training set:

A new parameter, A new parameter, CC, must be estimated to , must be estimated to optimally control the trade-off between optimally control the trade-off between training set errors and model complexitytraining set errors and model complexity

iii bwxy 1)(

SVM DrawbacksSVM Drawbacks Uses a binary (yes/no) decision ruleUses a binary (yes/no) decision rule

Generates a distance from the hyperplane, but Generates a distance from the hyperplane, but this distance is often not a good measure of this distance is often not a good measure of our “confidence” in the classificationour “confidence” in the classification

Can produce a “probability” as a function of Can produce a “probability” as a function of the distance (e.g. using sigmoid fits), but they the distance (e.g. using sigmoid fits), but they are inadequateare inadequate

Number of support vectors grows linearly Number of support vectors grows linearly with the size of the data setwith the size of the data set

Requires the estimation of trade-off Requires the estimation of trade-off parameter, parameter, CC, via held-out sets, via held-out sets

Evidence MaximizationEvidence Maximization Build a fully specified probabilistic model – Build a fully specified probabilistic model –

incorporate prior information/beliefs as incorporate prior information/beliefs as well as a notion of confidence in well as a notion of confidence in predictionspredictions

MacKay posed a special form for MacKay posed a special form for regularization in neural networks – sparsityregularization in neural networks – sparsity

Evidence maximization: evaluate candidate Evidence maximization: evaluate candidate models based on their “evidence”, P(D|Hmodels based on their “evidence”, P(D|H ii))

Structural optimization by maximizing the Structural optimization by maximizing the evidence across all candidate models!evidence across all candidate models!

Steeped in Gaussian approximationsSteeped in Gaussian approximations

Evidence FrameworkEvidence Framework

Penalty that measures how well our posterior Penalty that measures how well our posterior model fits our prior assumptions: model fits our prior assumptions:

We can use set the prior in favor of We can use set the prior in favor of sparse, smooth models!sparse, smooth models!

Evidence Evidence approximation:approximation:

Likelihood of Likelihood of data given best data given best fit parameter set: fit parameter set:

wHwPHwDPHDP )|ˆ(),ˆ|()|(

),ˆ|( HwDP

wHwP )|ˆ(

w

w

w

P(w|D,Hi)

P(w|Hi)

A kernel-based learning machineA kernel-based learning machine

Incorporates an Incorporates an automatic relevance automatic relevance determination (ARD)determination (ARD) prior over each prior over each weight (MacKay)weight (MacKay)

A flat (non-informative) prior over A flat (non-informative) prior over completes the Bayesian specificationcompletes the Bayesian specification

Relevance Vector MachinesRelevance Vector Machines

)1

),0(|()|(0

N

i iiiwNwP

N

iii xxKwwwxy

10 ),();(

);(1

1);|1( wxyi

iewxtP

Relevance Vector Relevance Vector MachinesMachines

The goal in training becomes finding:The goal in training becomes finding:

Estimation of the “sparsity” parameters Estimation of the “sparsity” parameters is inherent in the optimization – no need is inherent in the optimization – no need for a held-out set!for a held-out set!

A closed-form solution to this A closed-form solution to this maximization problem is not available. maximization problem is not available. Rather, we iteratively reestimate Rather, we iteratively reestimate

)|(

)|,(),,|(),(

),|,(,maxargˆ,ˆ

Xtp

XwpXwtpwp

whereXtwpw

w

ˆ andw

Laplace’s MethodLaplace’s Method Fix Fix and estimate and estimate ww (e.g. gradient descent) (e.g. gradient descent)

Use the Hessian to approximate the covariance of a Use the Hessian to approximate the covariance of a Gaussian posterior of the weights centered atGaussian posterior of the weights centered at

With and as the mean and covariance, With and as the mean and covariance, respectively, of the Gaussian approximation, we find respectively, of the Gaussian approximation, we find by finding by finding

Method is O(NMethod is O(N22) in memory and O(N) in memory and O(N33) in time) in time

w

w

)|()|(maxargˆ wpwtpw

w

1)|()|( wpwtpww

iiiii

ii where

w 1

ˆˆ

2

RVMs Compared to SVMsRVMs Compared to SVMsRVMRVM Data: Class labels (0,1)Data: Class labels (0,1) Goal: Learn posterior, Goal: Learn posterior,

P(t=1|x)P(t=1|x)

Structural Structural Optimization: Optimization: Hyperprior Hyperprior distribution distribution encourages sparsityencourages sparsity

Training: iterative – Training: iterative – O(NO(N33))

SVMSVM Data: Class labels (-Data: Class labels (-

1,1)1,1) Goal: Find optimal Goal: Find optimal

decision surface decision surface under constraintsunder constraints

Structural Structural Optimization: Trade-Optimization: Trade-off parameter that off parameter that must be estimatedmust be estimated

Training: Quadratic – Training: Quadratic – O(NO(N22))

iii bwxy 1)(

Simple ExampleSimple Example

ML ComparisonML Comparison

SVM ComparisonSVM Comparison

SVM With Sigmoid SVM With Sigmoid Posterior ComparisonPosterior Comparison

RVM ComparisonRVM Comparison

Experimental Experimental ProgressionProgression

Proof of concept on speech Proof of concept on speech classification dataclassification data

Coupling classifiers to ASR systemCoupling classifiers to ASR system Reduced-set tests on Alphadigits taskReduced-set tests on Alphadigits task Algorithms for scaling up RVM Algorithms for scaling up RVM

classifiersclassifiers Further tests on Alphadigits task (still Further tests on Alphadigits task (still

not the full training set though!)not the full training set though!) New work aiming at larger data sets New work aiming at larger data sets

and HMM decouplingand HMM decoupling

Vowel ClassificationVowel Classification Deterding Vowel Data: 11 vowels spoken in “h*d” Deterding Vowel Data: 11 vowels spoken in “h*d”

context; 10 log area parameters; 528 train, 462 SI testcontext; 10 log area parameters; 528 train, 462 SI test

ApproachApproach % % ErrorError

# # ParametersParameters

SVM: Polynomial SVM: Polynomial KernelsKernels

49%49%

K-Nearest NeighborK-Nearest Neighbor 44%44%

Gaussian Node Gaussian Node NetworkNetwork

44%44%

SVM: RBF KernelsSVM: RBF Kernels 35%35% 83 SVs83 SVs

Separable Mixture Separable Mixture ModelsModels

30%30%

RVM: RBF KernelsRVM: RBF Kernels 30%30% 13 RVs13 RVs

Coupling to ASRCoupling to ASR Data size:Data size:

30 million frames of data 30 million frames of data in training setin training set

Solution: Segmental phone Solution: Segmental phone modelsmodels

Source for Segmental Data:Source for Segmental Data: Solution: Use HMM system Solution: Use HMM system

in bootstrap procedurein bootstrap procedure Could also build a Could also build a

segment-based decodersegment-based decoder Probabilistic decoder Probabilistic decoder

coupling:coupling: SVMs: Sigmoid-fit SVMs: Sigmoid-fit

posteriorposterior RVMs: naturally RVMs: naturally

probabilisticprobabilistic

hh aw aa r y uw

region 10.3*k frames



mean region 1 mean region 2 mean region 3

k frames

Coupling to ASR SystemCoupling to ASR System

SEGMENTALCONVERTER

SEGMENTALCONVERTER

HMMRECOGNITION

HMMRECOGNITION

HYBRIDDECODER

HYBRIDDECODER

Features (Mel-Cepstra)

SegmentInformation

N-bestList

SegmentalFeatures

Hypothesis

Alphadigit RecognitionAlphadigit Recognition OGI Alphadigits: continuous, telephone OGI Alphadigits: continuous, telephone

bandwidth letters and numbers (“A19B4E”)bandwidth letters and numbers (“A19B4E”) Reduced training set size for RVM Reduced training set size for RVM

comparison: 2000 training segments per comparison: 2000 training segments per phone modelphone model Could not, at this point, run larger sets Could not, at this point, run larger sets

efficientlyefficiently 3329 utterances using 10-best lists 3329 utterances using 10-best lists

generated by the HMM decodergenerated by the HMM decoder SVM and RVM system architecture are SVM and RVM system architecture are

nearly identical: RBF kernels with gamma nearly identical: RBF kernels with gamma = 0.5= 0.5 SVM requires the sigmoid posterior estimate to SVM requires the sigmoid posterior estimate to

produce likelihoods – sigmoid parameters produce likelihoods – sigmoid parameters estimated from large held-out setestimated from large held-out set

SVM Alphadigit SVM Alphadigit RecognitionRecognition

TranscriptTranscriptionion

SegmentaSegmentationtion

SVMSVM HMMHMM

N-bestN-best HypothesiHypothesiss

11.0%11.0% 11.9%11.9%

N-N-best+Refbest+Ref

ReferenceReference 3.3%3.3% 6.3%6.3% HMM system is cross-word state-tied HMM system is cross-word state-tied

triphones with 16 mixtures of Gaussian triphones with 16 mixtures of Gaussian modelsmodels

SVM system has monophone models SVM system has monophone models with segmental featureswith segmental features

System combination experiment yields System combination experiment yields another 1% reduction in erroranother 1% reduction in error

SVM/RVM Alphadigit SVM/RVM Alphadigit ComparisonComparison

RVMs yield a large reduction in the RVMs yield a large reduction in the parameter count while attaining parameter count while attaining superior performancesuperior performance

Computational costs mainly in Computational costs mainly in training for RVMs but is still training for RVMs but is still prohibitive for larger setsprohibitive for larger sets

ApproaApproachch

ErroErrorr

RatRatee

Avg. # Avg. # ParametParamet

ersers

TrainiTraining ng

TimeTime

TestinTesting Timeg Time

SVMSVM 16.416.4%%

257257 0.5 0.5 hourshours

30 30 minsmins

RVMRVM 16.216.2%%

1212 30 30 daysdays

1 min1 min

Scaling UpScaling Up Central to RVM training is the inversion Central to RVM training is the inversion

of an MxM Hessian matrix: an O(Nof an MxM Hessian matrix: an O(N33) ) operation initiallyoperation initially

Solutions: Solutions: Constructive ApproachConstructive Approach: Start with an : Start with an

empty model and iteratively add candidate empty model and iteratively add candidate parameters. M is typically much smaller parameters. M is typically much smaller than Nthan N

Divide and Conquer ApproachDivide and Conquer Approach: Divide : Divide complete problem into set of sub-problems. complete problem into set of sub-problems. Iteratively refine the candidate parameter Iteratively refine the candidate parameter set according to sub-problem solution. M is set according to sub-problem solution. M is user-defineduser-defined

Constructive ApproachConstructive Approach Tipping and Faul (MSR-Cambridge) Tipping and Faul (MSR-Cambridge) Define Define

has a unique solution with respect has a unique solution with respect

to to

The results give a set of rules for The results give a set of rules for

adding vectors to the model, removing adding vectors to the model, removing

vectors from the model or updating vectors from the model or updating

parameters in the modelparameters in the model

)()()( ii lLL

)(L i

Constructive Approach Constructive Approach AlgorithmAlgorithm

Prune all parameters;Prune all parameters;While not convergedWhile not converged

For each parameter:For each parameter:If parameter is If parameter is

pruned:pruned: checkAddRulecheckAddRule

Else: Else: checkPruneRulecheckPruneRule

checkUpdateRulecheckUpdateRule

EndEndUpdate ModelUpdate Model

EndEnd

Begin with all weights Begin with all weights

set to zero and iteratively set to zero and iteratively

construct an optimal construct an optimal

model without evaluating model without evaluating

the full NxN inversethe full NxN inverse Formed for RVM Formed for RVM

regression – can have regression – can have

oscillatory behavior for oscillatory behavior for

classificationclassification Rule Rule subroutines require subroutines require

the full design matrix the full design matrix

(NxN) storage (NxN) storage

requirementrequirement

Iterative Reduction Iterative Reduction AlgorithmAlgorithm

O(MO(M33) in run-time and O(MxN) in memory. ) in run-time and O(MxN) in memory. M is a user-defined parameterM is a user-defined parameter

Assumes that if P(wAssumes that if P(wkk=0=0|w|wI,JI,J,D,D) is 1 then ) is 1 then P(wP(wkk=0|=0|w,Dw,D) is also 1! Optimality?) is also 1! Optimality?

Candidate

Pool

Candidate

Pool

TR

AIN

TR

AIN

Iteration I

TR

AIN

TR

AIN

Iteration I+1

RVs

RVs

Subset 0

Subset J

Alphadigit RecognitionAlphadigit Recognition

Data increased to 10000 training vectorsData increased to 10000 training vectors Reduction method has been trained up to Reduction method has been trained up to

100k vectors (on toy task). Not possible 100k vectors (on toy task). Not possible for Constructive methodfor Constructive method

ApproachApproach ErroErrorr

RateRate

Avg. # Avg. # ParametParamet

ersers

TrainiTraining ng

TimeTime

TestinTesting Timeg Time

SVMSVM 15.515.5%%

994994 3 3 hourshours

1.5 1.5 hourshours

RVMRVMConstrucConstruc

tivetive

14.814.8%%

7272 5 days5 days 5 mins5 mins

RVMRVMReductioReductio

nn

14.814.8%%

7474 6 days6 days 5 mins5 mins

SummarySummary First to apply kernel machines as First to apply kernel machines as

acoustic modelsacoustic models Comparison of two machines that Comparison of two machines that

apply structural optimization to apply structural optimization to learning: SVM and RVMlearning: SVM and RVM

Performance exceeds that of HMM Performance exceeds that of HMM but with quite a bit of HMM but with quite a bit of HMM interactioninteraction

Algorithms for increased data sizes Algorithms for increased data sizes are keyare key

Decoupling the HMMDecoupling the HMM Still want to use segmental data (data Still want to use segmental data (data

size)size) Want the kernel machine acoustic model Want the kernel machine acoustic model

to determine an optimal segmentation to determine an optimal segmentation thoughthough

Need a new decoderNeed a new decoder Hypothesize each phone for each possible Hypothesize each phone for each possible

segmentsegment Pruning is a huge issuePruning is a huge issue Stack decoder is beneficialStack decoder is beneficial

Status: In developmentStatus: In development

Improved Iterative Improved Iterative AlgorithmAlgorithm

Same principle of operationSame principle of operation One pass over the data – much faster!One pass over the data – much faster! Status: Equivalent performance on all Status: Equivalent performance on all

benchmarks – running on Alphadigits nowbenchmarks – running on Alphadigits now

Candidate

Pool

Candidate

Pool

TR

AIN

TR

AIN

Subset 0

Subset 1

RVs

RVs

Active Learning for Active Learning for RVMsRVMs

Idea: Given the current model, iteratively Idea: Given the current model, iteratively chooses a subset of points from the full chooses a subset of points from the full training set that will improve the system training set that will improve the system performanceperformance

Problem #1: “Performance” typically is Problem #1: “Performance” typically is defined as classifier error rate (e.g. boosting). defined as classifier error rate (e.g. boosting). What about the posterior estimate accuracy?What about the posterior estimate accuracy?

Problem #2: For kernel machines, an added Problem #2: For kernel machines, an added training point can:training point can: Assist in bettering the model performanceAssist in bettering the model performance Become part of the model itself! How do we Become part of the model itself! How do we

determine which points should be added?determine which points should be added? Look to work in Gaussian Processes Look to work in Gaussian Processes

(Lawrence, Seeger, Herbrich, 2003)(Lawrence, Seeger, Herbrich, 2003)

ExtensionsExtensions Not ready for prime time as an Not ready for prime time as an

acoustic modelacoustic model How else might we use the same How else might we use the same

techniques for speech?techniques for speech? Online Speech/Noise Classification?Online Speech/Noise Classification?

Requires adaptation methodsRequires adaptation methods Application of automatic relevance Application of automatic relevance

determination to model selection for determination to model selection for HMMs?HMMs?

AcknowledgmentsAcknowledgments Collaborators: Aravind Collaborators: Aravind

Ganapathiraju and Joe Picone at Ganapathiraju and Joe Picone at Mississippi StateMississippi State

Consultants: Michael Tipping (MSR-Consultants: Michael Tipping (MSR-Cambridge) and Thorsten Joachims Cambridge) and Thorsten Joachims (now at Cornell)(now at Cornell)