mark hasegawa-johnson jhasegaw@uiuc university of illinois at urbana-champaign, usa
DESCRIPTION
Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson [email protected] University of Illinois at Urbana-Champaign, USA. Lecture 9. Learning in Bayesian Networks. - PowerPoint PPT PresentationTRANSCRIPT
Landmark-Based Speech Recognition:
Spectrogram Reading,Support Vector Machines,
Dynamic Bayesian Networks,and Phonology
Mark [email protected]
University of Illinois at Urbana-Champaign, USA
Lecture 9. Learning in Bayesian Networks
• Learning via Global Optimization of a Criterion• Maximum-likelihood learning
– The Expectation Maximization algorithm– Solution for discrete variables using Lagrangian
multipliers– General solution for continuous variables– Example: Gaussian PDF– Example: Mixture Gaussian– Example: Bourlard-Morgan NN-DBN Hybrid– Example: BDFK NN-DBN Hybrid
• Discriminative learning criteria– Maximum Mutual Information– Minimum Classification Error
What is Learning?
Imagine that you are a student who needs to learn how to propagate belief in a junction tree.
• Level 1 Learning (Rule-Based): I tell you the algorithm. You memorize it.
• Level 2 Learning (Category Formation): You observe examples (FHMM). You memorize them. From the examples, you build a cognitive model of each of the steps (moralization, triangulation, cliques, sum-product).
• Level 3 Learning (Performance): You try a few problems. When you fail, you optimize your understanding of all components of the cognitive model in order to minimize the probability of future failures.
What is Machine Learning? • Level 1 Learning (Rule-Based): Programmer tells
the computer how to behave. This is not usually called “machine learning.”
• Level 2 Learning (Category Formation): The program is given a numerical model of each category (e.g., a PDF, or a geometric model). Parameters of the numerical model are adjusted in order to represent the category.
• Level 3 Learning (Performance): All parameters in a complex system are simultaneously adjusted in order to optimize a global performance metric.
Learning Criteria
Optimization Methods
Maximum Likelihood Learning in a Dynamic Bayesian Network
• Given: a particular model structure• Given: a set of training examples for that model,
(bm,om), 1≤m≤M
• Estimate all model parameters (p(b|a), p(c|a),…) in order to maximize mlog p(bm,om|)
• Recognition is Nested within Training: at each step of the training algorithm, we need to compute p(bm,om,am,…,qm) for every training token, using sum-product algorithm.
a
b c
e fd
n o
q
o
b
Baum’s Theorem(Baum and Eagon, Bull. Am. Math. Soc., 1967)
Expectation Maximization (EM)
EM for a Discrete-Variable Bayesian Network
a
b c
e fd
n o
q
o
b
EM for a Discrete-Variable Bayesian Network
a
b c
e fd
n o
q
o
b
Solution: Lagrangian Method
The EM Algorithm for a Large Training Corpus
EM for Continuous Observations(Liporace, IEEE Trans. Inf. Th., 1982)
Solution: Lagrangian Method
Example: Gaussian(Liporace, IEEE Trans. Inf. Th., 1982)
Example: Mixture Gaussian(Juang, Levinson, and Sondhi, IEEE Trans. Inf. Th., 1986)
Example: Bourlard-Morgan Hybrid (Morgan and Bourlard, IEEE Sign. Proc. Magazine 1995)
Pseudo-Priors and Training Priors
Training the Hybrid Model Using the EM Algorithm
The Solution: Q Back-Propagation
Merging the EM and Gradient Ascent Loops
Example: BDFK Hybrid (Bengio, De Mori, Flammia, and Kompe, Spe. Comm. 1992)
The Q Function for a BDFK Hybrid
The EM Algorithm for a BDFK Hybrid
Discriminative Learning Criteria
Maximum Mutual Information
Maximum Mutual Information
Maximum Mutual Information
Maximum Mutual Information
An EM-Like Algorithm for MMI
An EM-Like Algorithm for MMI
MMI for Databases with Different Kinds of Transcription
• If every word’s start and end times are labeled, then WT is the true word label, and W* is the label of the false word (or words!) with maximum modeled probability.
• If the start and times of individual word strings are not known, then WT is the true word sequence. W* may be computed as the best path (or paths) through a word lattice or N-best list. (Schlüter, Macherey, Müller, and Ney, Spe. Comm. 2001)
Minimum Classification Error(McDermott and Katagiri, Comput. Speech Lang. 1994)
• Define empirical risk as “the number of word tokens for which the wrong HMM has higher log-likelihood than the right HMM”
• This risk definition has two nonlinearities:– Zero-one loss function, u(x). Replace with a
differentiable loss function, (x).– Max. Replace with a “softmax” function,
log(exp(a)+exp(b)+exp(c)).
• Differentiate the result; train all HMM parameters using error backpropagation.
Summary• What is Machine Learning?
– choose an optimality criterion, – find an algorithm that will adjust model
parameters to optimize the criterion
• Maximum Likelihood– Baum’s theorem: argmax E[log(p)] = argmax[p]– Apply directly to discrete, Gaussian, MG– Nest within EBP for BM and BDFK hybrids
• Discriminative Criteria– Maximum Mutual Information (MMI)– Minimum Classification Error (MCE)