an overview on semi-supervised learning methods matthias seeger mpi for biological cybernetics...

15
An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

Upload: madison-simpson

Post on 13-Dec-2015

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

An Overview onSemi-Supervised Learning

Methods

Matthias SeegerMPI for Biological Cybernetics

Tuebingen, Germany

Page 2: An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

Overview

The SSL Problem Paradigms for SSL. Examples The Importance of

Input-dependent Regularization

Note: Citations omitted here (given inmy literature review)

Page 3: An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

Semi-Supervised Learning

SSL is Supervised Learning...

Goal: Estimate P(y|x) from Labeled DataDl={ (xi,yi) }

But: Additional Source tells about P(x)(e.g., Unlabeled Data Du={xj})

x y The Interesting Case:

Page 4: An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

Obvious Baseline Methods

Do not use info about P(x) Supervised Learning

Fit a Mixture Model

using Unsupervised Learning, then“label up” components using {yi}

The Goal of SSL is To Do Better

Not: Uniformly and always(No Free Lunch; and yes (of course): Unlabeled data can hurt)

But (as always): If our modelling and algorithmic efforts reflecttrue problem characteristics

Page 5: An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

The Generative Paradigm

Model Class Distributions and

Implies model for P(y|x)

and for P(x)

x y

Page 6: An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

The Joint Likelihood

Natural Criterion in this context:

Maximize using EM (idea as old as EM)

Early and recent theoretical work onasymptotic variance

Advantage: Easy to implement forstandard mixture model setups

Page 7: An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

Drawbacks of Generative SSL Choice of source weighting crucial

Cross-Validation fails for small n Homotopy Continuation (Corduneanu

etal.) Just like in Supervised Learning:

Model for P(y|x) specified indirectly Fitting not primarily concerned with P(y|x).Also: Have to represent P(x) generally wellNot just aspects which help with P(y|x).

Page 8: An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

The Diagnostic Paradigm Model P(y|x,) and P(x|)

directly But: Since , are

independent a priori, does not depend on , given data Knowledge of does not influence P(y|x) prediction in a probabilistic setup!

x y

Page 9: An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

What To Do About It Non-probabilistic diagnostic

techniques Replace expected loss

byTong, Koller; Chapelle etal. Very limited effect if n small

Some old work (eg., Anderson) Drop the prior independence of ,

Input-dependent Regularization

Page 10: An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

Input-Dependent Regularization

Conditional priors P(|)make P(y|x) estimationdependent on P(x),

Now, unlabeled data can really help...

And can hurt for the same reason!

x y

Page 11: An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

The Cluster Assumption (CA)

Empirical Observation: Clustering of data {xj} w.r.t. “sensible” distance / features often fairly compatible with class regions

Weaker: Class regions do not tend to cut high-volume regions of P(x)

Why? Ask Philosophers! My guess:Selection bias for features/distance

No Matter Why:

Many SSL Methods implement theCA and work fine in practice

Page 12: An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

Examples For IDR Using CA Label Propagation, Gaussian Random

Fields: Regularization depends on graph structure which is built from all {xj} More smoothness in regions of high connectivity / affinity flows

Cluster kernels for SVM (Chapelle etal.) Information Regularization

(Corduneanu, Jaakkola)

Page 13: An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

More Examples for IDR

Some methods do IDR, but implement the CA only in special cases:

Fisher Kernels (Jaakkola etal.)Kernel from Fisher features Automatic feature induction from P(x) model

Co-Training (Blum, Mitchell)Consistency across diff. views (features)

Page 14: An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

Is SSL Always Generative?

Wait: We have to model P(x) somehow.Is this not always generative then? ... No!

Generative: Model P(x|y) fairly directly, P(y|x) model and effect of P(x) are implicit

Diagnostic IDR: Direct model for P(y|x), more flexibility Influence of P(x) knowledge on P(y|x)

prediction directly controlled, eg. through CA Model for P(x) can be much less elaborate

Page 15: An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany

Conclusions

Given taxonomy for probabilistic approaches to SSL

Illustrated paradigms by examples from literature

Tried to clarify some points which have led to confusions in the past