an overview on semi-supervised learning methods matthias seeger mpi for biological cybernetics...

An Overview onSemi-Supervised Learning

Methods

Matthias SeegerMPI for Biological Cybernetics

Tuebingen, Germany

Overview

The SSL Problem Paradigms for SSL. Examples The Importance of

Input-dependent Regularization

Note: Citations omitted here (given inmy literature review)

Semi-Supervised Learning

SSL is Supervised Learning...

Goal: Estimate P(y|x) from Labeled DataDl={ (xi,yi) }

But: Additional Source tells about P(x)(e.g., Unlabeled Data Du={xj})

x y The Interesting Case:

Obvious Baseline Methods

Do not use info about P(x) Supervised Learning

Fit a Mixture Model

using Unsupervised Learning, then“label up” components using {yi}

The Goal of SSL is To Do Better

Not: Uniformly and always(No Free Lunch; and yes (of course): Unlabeled data can hurt)

But (as always): If our modelling and algorithmic efforts reflecttrue problem characteristics

The Generative Paradigm

Model Class Distributions and

Implies model for P(y|x)

and for P(x)

x y

The Joint Likelihood

Natural Criterion in this context:

Maximize using EM (idea as old as EM)

Early and recent theoretical work onasymptotic variance

Advantage: Easy to implement forstandard mixture model setups

Drawbacks of Generative SSL Choice of source weighting crucial

Cross-Validation fails for small n Homotopy Continuation (Corduneanu

etal.) Just like in Supervised Learning:

Model for P(y|x) specified indirectly Fitting not primarily concerned with P(y|x).Also: Have to represent P(x) generally wellNot just aspects which help with P(y|x).

The Diagnostic Paradigm Model P(y|x,) and P(x|)

directly But: Since , are

independent a priori, does not depend on , given data Knowledge of does not influence P(y|x) prediction in a probabilistic setup!

x y

What To Do About It Non-probabilistic diagnostic

techniques Replace expected loss

byTong, Koller; Chapelle etal. Very limited effect if n small

Some old work (eg., Anderson) Drop the prior independence of ,

Input-dependent Regularization

Input-Dependent Regularization

Conditional priors P(|)make P(y|x) estimationdependent on P(x),

Now, unlabeled data can really help...

And can hurt for the same reason!

x y

The Cluster Assumption (CA)

Empirical Observation: Clustering of data {xj} w.r.t. “sensible” distance / features often fairly compatible with class regions

Weaker: Class regions do not tend to cut high-volume regions of P(x)

Why? Ask Philosophers! My guess:Selection bias for features/distance

No Matter Why:

Many SSL Methods implement theCA and work fine in practice

Examples For IDR Using CA Label Propagation, Gaussian Random

Fields: Regularization depends on graph structure which is built from all {xj} More smoothness in regions of high connectivity / affinity flows

Cluster kernels for SVM (Chapelle etal.) Information Regularization

(Corduneanu, Jaakkola)

More Examples for IDR

Some methods do IDR, but implement the CA only in special cases:

Fisher Kernels (Jaakkola etal.)Kernel from Fisher features Automatic feature induction from P(x) model

Co-Training (Blum, Mitchell)Consistency across diff. views (features)

Is SSL Always Generative?

Wait: We have to model P(x) somehow.Is this not always generative then? ... No!

Generative: Model P(x|y) fairly directly, P(y|x) model and effect of P(x) are implicit

Diagnostic IDR: Direct model for P(y|x), more flexibility Influence of P(x) knowledge on P(y|x)

prediction directly controlled, eg. through CA Model for P(x) can be much less elaborate

Conclusions

Given taxonomy for probabilistic approaches to SSL

Illustrated paradigms by examples from literature

Tried to clarify some points which have led to confusions in the past

an overview on semi-supervised learning methods matthias seeger mpi for biological cybernetics...

Documents