1 regularized adaptation: theory, algorithms and applications xiao li electrical engineering...
Post on 21-Dec-2015
215 views
TRANSCRIPT
1
Regularized Adaptation: Theory, Algorithms and Applications
Xiao Li
Electrical Engineering Department
University of Washington
2
Goal
When a target distribution pad(x,y) is close to but different from the training distribution ptr(x,y)
A classifier optimized for ptr(x,y) needs to be adapted for pad(x,y), using a small amount of labeled adaptation data
3
Roadmap
Introduction Theoretical results
A Bayesian fidelity prior for adaptation Generalization error bounds
Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification
Application to the Vocal Joystick Conclusions and future work
4
Inductive Learning Given
a set of m samples (xi, yi) ~p(x, y) a decision function space F: X {±1}
Goal learn a decision function that minimizes the
expected error
In practice minimize the empirical error
while applying a regularization strategy to achieve good generalization performance
f F
( , ) ( , ) ( , )( ) E [I( ( ) )] p x y x y p x yR f f x y
m
empi=1
1( ) I( ( ) ) i iR f f x y
m
5
Why Is Regularization Helpful?
Learning theory says
Vapnik’s VC bound (a frequentist view): Φ VC dimension of F Occam’s Razor bound (a Bayesian view): Φ π( f ) for
countable function space
“Accuracy-regularization” We want to minimize the empirical error as well as the capacity Support vector machines (a frequentist view) Bayesian model selection (a Bayesian view)
empPr , ( ) ( ) ( , , , ) 1f F R f R f F f m
6
Adaptive Learning (Adaptation)
Two related yet different distributions Training Target (test-time)
Given An unadapted model Adaptation data (labeled)
Goal Learn an adapted model that is as close as possible to our
desired model
Notes Adaptation data Dm may be very limited
Original data for training f tr is not preserved for adaptation
( , )adp x y( , )trp x y
( , )arg min ( )
tr
tr
p x yf F
f R f
1{( , ) ( , )} ad mm i i iD x y p x y
( , )arg min ( )
ad
ad
p x yf F
f R f
7
Scenarios
Customization Speech recognition: speaker adaptation Handwriting recognition: writer adaptation Language processing: domain adaptation
Evolutionary environment Spam filter
8
Practical Work on Adaptation
Gaussian mixture models (GMMs) MAP (Gauvain 94); MLLR (Leggetter 95)
Support vector machines (SVMs) Boosting-like approach (Matic 93) Weighted combination of old support vectors and adaptation
data (Wu 04)
Multi-layer perceptrons (MLPs) Shared “internal representation” in transfer learning (Baxter 95,
Caruana 97, Stadermann 05) Linear input network (Neto 95)
Conditional maximum entropy models Gaussian prior (Chelba 04)
9
This Work Seeks Answers to …
A unified and principled approach to adaptation Applicable to a variety of classifiers Amenable to variations in the amount of adaptation data
Quantitative relationships between The generalization error bound (or sample complexity bound)
and The divergence between training and target distributions
10
Roadmap
Introduction Theoretical results
A Bayesian fidelity prior for adaptation Generalization error bounds
Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification
Application to the Vocal Joystick Conclusions and future work
11
Bayesian Fidelity Prior
Adaptation objective
Remp( f ) – empirical error on the adaptation data Pfid( f ) – Bayesian “fidelity prior”
Fidelity prior
How likely a classifier is given a training distribution
rather than a training set – key difference from Baxter 97 Reflects the “fidelity” to the unadapted model Related to the KL-divergence
emp fidmin ( ) ln ( )
f F
R f p f
fid ( , )ln ( ) E [ln ( | , )] trp x y
p f p f x y
12
Generative Models
Generative models F – a space of generative models p( x, y | f )
Classification
Posterior
Assume f tr and f ad are the true models generating the training and target distributions respectively, i.e.
Note that this assumption is justified when the function space contains the true models and if we use the log likelihood loss
( , | ) ( )( | , )
( , | ) ( )
p x y f fp f x y
p x y f f
arg max[log ( , | )]y
p x y f
( , | ) ( , )
( , | ) ( , )
tr tr
ad ad
p x y f p x y
p x y f p x y
standard prior, chosen before training
13
Fidelity Prior for Generative Models
Key result
where β > 0
Implication Fidelity prior at the desired model
We are more likely to learn our desired model using the fidelity prior than using the standard prior
fidln ( ) ( || ) ln ( ) ad tr ad adp f D p p f
fid( || ) ( ) ( ) tr ad ad adD p p p f f
fidln ( ) ( ( , | ) || ( , | )) ln ( ) trp f D p x y f p x y f f
14
Instantiations
To compute the fidelity prior Assuming a “uniform” standard prior, this prior is determined by
the KL-divergence We can use an upper bound on the KL-divergence (hence a
lower bound on the prior)
Gaussian models The fidelity prior is a normal-Wishart distribution
Mixture models An upper bound on the KL-divergence (using log sum inequality)
Hidden Markov models An upper bound on the KL-divergence (Silva 06)
15
Discriminative Models
A unified view of SVMs, MLPs, CRFs and etc. Affine classifiers in a transformed space f = ( w, b ) Classification
Conditional distribution (for binary case)
sgn( ( ) )Tw x b
Φ(x) Loss function
SVMs Determined by the kernel Hinge loss
MLPs Hidden neurons Log loss
CRFs Any feature function Log loss
( ( ) )
1( | , )
1Ty w x b
p y x fe
16
Discriminative Models (cont.)
Conditional models F – a space of conditional models p( y | x, f )
Classification
Posterior
Assume f tr and f ad are the true models describing the training and target conditional distributions respectively, i.e.
arg max[log ( | , )]y
p y x f
( | , ) ( )( | , )
( | , ) ( )
p y x f fp f x y
p y x f f
( | , ) ( | )
( | , ) ( | )
tr tr
ad ad
p y x f p y x
p y x f p y x
17
Fidelity Prior for Conditional Models
Again a divergence
where β > 0 What if we do not know ptr(x, y) We seek an upper bound on the KL-divergence and hence a
lower bound on the prior
Key result
where
fidln ( ) ( ( | , ) || ( | , )) ln ( ) trp f D p y x f p y x f f
E[|| ||]R x
( ( | , ) || ( | , )) || || | |tr tr trD p y x f p y x f R w w b b
18
Roadmap
Introduction Theoretical results
A Bayesian fidelity prior for adaptation Generalization error bounds
Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification
Learning and adaptation in the Vocal Joystick Conclusions and future work
19
Occam’s Razor Bound for Adaptation
(McAllester 98) For a countable function space, for any prior π( f )
Apply for adaptation …
emp
ln ( ) lnPr , ( ) ( ) 1
2
ff F R f R f
m
( ( , | ) || ( , | )) ln ( ) ln
2
trD p x y f p x y f f
m
20
sample size m
Bound using standard prior
Bounds using fidelity prior
( ( , | ) || ( , | )) ln ( ) ln( , , )
2
trD p x y f p x y f ff m
m
( , , ) f m Increasing KL
21
PAC-Bayesian Bounds for Adaptation
(McAllester 98) For both countable and uncountable function spaces, for any prior p( f ) and posterior q( f )
Choice of prior p( f ) and posterior q( f ) Use pfid( f ) or its related forms as prior
Choose q( f ) to have the same parametric form
Examples Gaussian models Linear classifier
f q(f) f q(f) emp
( ( ) || ( )) ln ln 2Pr , E [ ( )] E [ ( )] 1
2 1
D q f p f mf R f R f
m
1( ( ) || ( )) ( ') ( ') / 2 tr T trD q f p f2( ( ) || ( )) || ' || / 2 trD q f p f w w
22
Roadmap
Introduction Theoretical results
A Bayesian fidelity prior for adaptation Generalization error bounds
Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification
Application the Vocal Joystick Conclusions and future work
23
Algorithms Derived from the Fidelity Prior
Generative Models
Relate to MAP adaptation
Conditional Models
In particular for log linear models
We focus on SVMs and MLPs
empmin ( ) ( ( , | ) || ( , | ))
tr
f FR f D p x y f p x y f
empmin ( ) ( ( | , ) || ( | , ))
tr
f FR f D p y x f p y x f
2 2emp 1 2min ( ) || || | |
tr tr
f FR f w w b b
24
Regularized SVM Adaptation
Optimization objective
Globally optimal solution
Regularized – fixing old support vectors and their coefficients Extended regularized – update coefficients of old support
vectors as well, with two separate constraints in the dual space
2
1
1min || ||
2
s.t. 1 ( ( ) )
0
mtr
ii
Ti i i
i
w wm
y w x b
| |
1 1
( ) ( , ) ( , )
SV m
tr tr tri i i i i i
i i
f x y k x x y k x x
25
Algorithms in Comparison Unadapted Retrained
Use adaptation data only Boosted (Matic 93)
Select adaptation data misclassified by the unadapted model Combine with old support vectors
Bootstrapped (proposed in thesis) Train a seed classifier using adaptation data only Select old support vectors correctly classified by the seed classifier Combine with adaptation data
Regularized and Extended regularized
26
Regularized MLP Adaptation
Optimization objective for a multi-class, two-layer MLP
Wh2o and Wi2h – the hidden-to-output and input-to-hidden layer weight matrix respectively
||W|| = tr(WTW) Remp( f ) – cross-entropy, corresponding to logistic loss
Locally optimal solution found using back-propagation
2 2
2 22 2 2 2
,min || || || || ( )
2 2h o i h
tr trh o h o i h i h emp
W WW W W W R f
27
Algorithms in Comparison Unadapted Retrained
Start from randomly initialized weights and train with weight decay Linear input network (Neto 95)
Add a linear transformation in the input space Retrained speaker-independent (Neto 95)
Start from unadapted; train both layers Retrained last layer (Baxter 95, Caruana 97, Stadermann 05)
Start from unadapted; only train the last layer Retrained first layer (proposed in thesis)
Start from unadapted; only train the first layer Regularized
Many algorithms above can be considered as special cases of regularized
28
Roadmap
Introduction Theoretical results
A Bayesian fidelity prior for adaptation Generalization error bounds
Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification
Application to the Vocal Joystick Conclusions and future work
29
Experimental Paradigm
Procedure Train an unadapted model on training set Adapt (with supervision) and evaluate via n-fold CV on test set Select regularization coefficients on the dev set
Corpora VJ vowel dataset (Kilanski 06) NORB image dataset (LeCun 04)
30
VJ Vowel Dataset Task
8 vowel classes Frame-level classification error rate Speaker adaptation
Data allocation Training set – 21 speakers, 420K samples
For SVM, we randomly selected 80K samples for training Test set – 10 speakers, 200 samples Dev set – 4 speakers, 80 samples
Features 182 dimensions – 7 frames of MFCC+delta features
31
SVM Adaptation RBF kernel (std=10) optimized for training and fixed for adaptation Mean and std. dev of frame-level error rates over 10 speakers
# adapt. samples per speaker
1K 2K 3K
Unadapted 38.21 ± 4.68 38.21 ± 4.68 38.21 ± 4.68
Retrained 24.70 ± 2.47 18.94 ± 1.52 14.00 ± 1.40
Boosted 29.66 ± 4.60 26.54 ± 2.49 28.85 ± 2.03
Bootstrapped 26.16 ± 5.07 19.24 ± 1.32 14.41 ± 1.26
Regularized 23.28 ± 4.21 19.01 ± 1.30 15.00 ± 1.41
Ext. regularized 28.55 ± 4.99 25.38 ± 2.49 20.36 ± 2.08
red are the best and those not significantly different from the best at p<0.001 level
32
MLP Adaptation (I) 50 hidden nodes Mean and std. dev over 10 speakers
# adapt. samples per speaker
1K 2K 3K
Unadapted 32.03 ± 3.76 32.03 ± 3.76 32.03 ± 3.76
Retrained (reg.) 14.21 ± 2.50 10.06 ± 3.15 9.09 ± 3.92
Linear input 13.52 ± 2.22 11.81 ± 2.33 11.96 ± 1.77
Retrained SI 12.15 ± 2.70 9.64 ± 2.74 7.88 ± 2.49
Retrained last 15.45 ± 2.75 13.32 ± 2.46 11.40 ± 2.37
Retrained first 11.56 ± 2.09 9.12 ± 3.08 7.35 ± 2.26
Regularized 11.56 ± 2.09 8.16 ± 2.60 7.30 ± 2.40
33
05
1015
2025
3035
4045
50
1 2 3 4 5 6 7 8
Num. vowel classes used for adaptation
Err
or
rate
on
all
vow
el c
lass
es
Linear input net
Retrained SI
Retrained last
Retrained first
Regularized
MLP Adaptation (II) Varying number of vowel classes available in adaptation data
34
NORB Image Dataset
Task 5 object classes Classify each image Lighting condition adaptation
Data allocation Training set – 2700 samples Test set – 2700 samples
Features 32x32 raw images
35
SVM Adaptation
# adapt. samples per lighting condition
90 180
Unadapted 12.5 ± 2.7 12.5 ± 2.7
Retrained 30.1 ± 3.5 18.9 ± 2.5
Boosted 11.5 ± 2.0 10.7 ± 1.5
Bootstrapped 21.9 ± 4.8 14.9 ± 2.5
Regularized 14.8 ± 1.9 13.4 ± 1.9
Ext. regularized 11.0 ± 1.7 10.4 ± 1.9
RBF kernel (std=500) optimized for training and fixed for adaptation Mean and std. dev of classification error rates over 6 lighting conditions
red are the best and those not significantly different from the best at p<0.001 level
36
MLP Adaptation
# adapt. samples per lighting condition
90 180
Unadapted 21.4 ± 6.6 21.4 ± 6.6
Retrained (reg) 41.9 ± 15.8 32.5 ± 12.8
Retrained SI 19.6 ± 4.3 18.6 ± 4.3
Regularized 19.2 ± 3.4 18.0 ± 3.7
30 hidden nodes Mean and std. dev over 6 lighting conditions
37
Roadmap
Introduction Theoretical results
A Bayesian fidelity prior for adaptation Generalization error bounds
Regularized adaptation algorithms SVM and MLP adaptation Experiments on vowel and object classification
Application to the Vocal Joystick Conclusions and future work
38
Why the Vocal Joystick
Hands-free computer interfaces Eye-gaze tracking – high cost Head-mouse or chin-joystick – low bandwidth Speech interfaces – inability for continuous control
The Vocal Joystick (Bilmes 05) Can control both continuous and discrete tasks. Utilizes intensity, vowel quality, pitch and discrete sound identity The VJ-mouse control scheme
Joint work with Graduate students: J. Malkin, S. Harada and K. KilanskiFaculty members: J. Bilmes, R. Wright, K. Kirchhoff, J. Landay,
P. Dowden and H. Chizeck
39
40
The Pattern Recognition Module
Pitch Tracking
Vowel Classification
Discrete Sound Recognition
Pitch
Vowel posteriors
Discretesound ID
Adaptive Filter
Discrete Sound Detection
Graphical Model
Two-Layer MLP
PhonemeHMMs
Regularized MLP Adaptation
Regularized GMM Adaptation
Vowel Detection
IntensityZero-crossAutocovarainceMFCCs
41
Vowel Classifier Adaptation
Classifier Two-layer MLP with 4 or 8 outputs Trained on the VJ-vowel dataset
Adaptation Motivation: overlap of different vowel classes articulated from
different speakers Regularized adaptation for MLPs Improvements over unadapted classifier (with 2 secs per vowel):
4-class 7.6%0.2%; 8-class: 32.0%8.2%
42
Discrete Sound Recognizer Adaptation
Recognizer Phoneme HMMs Trained on TIMIT
Rejection “In-vocabulary”: unvoiced consonants Garbage utterances: vowels, breathing, extraneous speech Heuristics for rejection: duration, zero-crossing, pitch, posteriors
Adaptation Motivation: consonant articulation mismatch between TIMIT data
and the VJ application Regularized adaptation for GMMs Obtain better rejection thresholds
43
Conclusions and Future Work
Key contributions A Bayesian fidelity prior and generalization error bounds
Regularized adaptation algorithms derived from the fidelity prior
Application to the Vocal Joystick
Future work Theoretical work: adaptation error bounds for classifiers in an
uncountable function space
Algorithmic work: kernel adaptation
The Vocal Joystick: discrete sound data collection and training a better baseline
44
Thank You