an overview on semi-supervised learning methods matthias seeger mpi for biological cybernetics...
TRANSCRIPT
An Overview onSemi-Supervised Learning
Methods
Matthias SeegerMPI for Biological Cybernetics
Tuebingen, Germany
Overview
The SSL Problem Paradigms for SSL. Examples The Importance of
Input-dependent Regularization
Note: Citations omitted here (given inmy literature review)
Semi-Supervised Learning
SSL is Supervised Learning...
Goal: Estimate P(y|x) from Labeled DataDl={ (xi,yi) }
But: Additional Source tells about P(x)(e.g., Unlabeled Data Du={xj})
x y The Interesting Case:
Obvious Baseline Methods
Do not use info about P(x) Supervised Learning
Fit a Mixture Model
using Unsupervised Learning, then“label up” components using {yi}
The Goal of SSL is To Do Better
Not: Uniformly and always(No Free Lunch; and yes (of course): Unlabeled data can hurt)
But (as always): If our modelling and algorithmic efforts reflecttrue problem characteristics
The Generative Paradigm
Model Class Distributions and
Implies model for P(y|x)
and for P(x)
x y
The Joint Likelihood
Natural Criterion in this context:
Maximize using EM (idea as old as EM)
Early and recent theoretical work onasymptotic variance
Advantage: Easy to implement forstandard mixture model setups
Drawbacks of Generative SSL Choice of source weighting crucial
Cross-Validation fails for small n Homotopy Continuation (Corduneanu
etal.) Just like in Supervised Learning:
Model for P(y|x) specified indirectly Fitting not primarily concerned with P(y|x).Also: Have to represent P(x) generally wellNot just aspects which help with P(y|x).
The Diagnostic Paradigm Model P(y|x,) and P(x|)
directly But: Since , are
independent a priori, does not depend on , given data Knowledge of does not influence P(y|x) prediction in a probabilistic setup!
x y
What To Do About It Non-probabilistic diagnostic
techniques Replace expected loss
byTong, Koller; Chapelle etal. Very limited effect if n small
Some old work (eg., Anderson) Drop the prior independence of ,
Input-dependent Regularization
Input-Dependent Regularization
Conditional priors P(|)make P(y|x) estimationdependent on P(x),
Now, unlabeled data can really help...
And can hurt for the same reason!
x y
The Cluster Assumption (CA)
Empirical Observation: Clustering of data {xj} w.r.t. “sensible” distance / features often fairly compatible with class regions
Weaker: Class regions do not tend to cut high-volume regions of P(x)
Why? Ask Philosophers! My guess:Selection bias for features/distance
No Matter Why:
Many SSL Methods implement theCA and work fine in practice
Examples For IDR Using CA Label Propagation, Gaussian Random
Fields: Regularization depends on graph structure which is built from all {xj} More smoothness in regions of high connectivity / affinity flows
Cluster kernels for SVM (Chapelle etal.) Information Regularization
(Corduneanu, Jaakkola)
More Examples for IDR
Some methods do IDR, but implement the CA only in special cases:
Fisher Kernels (Jaakkola etal.)Kernel from Fisher features Automatic feature induction from P(x) model
Co-Training (Blum, Mitchell)Consistency across diff. views (features)
Is SSL Always Generative?
Wait: We have to model P(x) somehow.Is this not always generative then? ... No!
Generative: Model P(x|y) fairly directly, P(y|x) model and effect of P(x) are implicit
Diagnostic IDR: Direct model for P(y|x), more flexibility Influence of P(x) knowledge on P(y|x)
prediction directly controlled, eg. through CA Model for P(x) can be much less elaborate
Conclusions
Given taxonomy for probabilistic approaches to SSL
Illustrated paradigms by examples from literature
Tried to clarify some points which have led to confusions in the past