semi-supervised learning in gigantic image collections rob fergus (nyu) yair weiss (hebrew u.)...

Semi-Supervised Learning in Gigantic Image Collections

Rob Fergus (NYU)Yair Weiss (Hebrew

U.)Antonio Torralba

(MIT)

What does the world look like?

High level image statisticsObject Recognition for large-scale search

Gigantic Image Collections

Spectrum of Label InformationHuman annotations Noisy

labelsUnlabele

d

Semi-Supervised Learning using Graph Laplacian

V = data pointsE = n x n affinity matrix W

G = (V;E )

Wi j = exp(¡ kxi ¡ xj k=2²2) Di i =P

j Wi j

L = D ¡ 1=2LD ¡ 1=2 = I ¡ D ¡ 1=2WD ¡ 1=2Graph Laplacian:

[Zhu03,Zhou04]

Wi j = exp(¡ kxi ¡ xj k=2²2)

SSL using Graph Laplacian

J (f ) = f T Lf +P l

i=1 ¸(f (i) ¡ yi )2

f T Lf + (f ¡ y)T ¤(f ¡ y)¤i i = ¸

¤i i = 0

If labeled:

If unlabeled:

• Want to find label function f that minimizes:

• y = labels, λ = weights• Rewrite as:

• Straightforward solution

Smoothness Agreement with labels

• Smooth vectors will be linear combinations of eigenvectors U with small eigenvalues:

Eigenvectors of Laplacian

f = U®U = [Á1; : : : ;Ák]

[Belkin & Niyogi 06, Schoelkopf & Smola 02, Zhu et al 03, 08]

Rewrite System

• Let • U = smallest k eigenvectors of L, α =

coeffs.

• Optimal is now solution to k x k system:

J (®) = ®T § ®+ (U®¡ y)T ¤(U®¡ y)

®

(§ + UT ¤U)®= UT ¤y

f = U®

Computational Bottleneck

• Consider a dataset of 80 million images

• Inverting L– Inverting 80 million x 80 million matrix

• Finding eigenvectors of L– Diagonalizing 80 million x 80 million

matrix

Large Scale SSL - Related work• Nystrom method: pick small set of landmark points

– Compute exact solution on these– Interpolate solution to rest

• Others iteratively use classifiers to label data– E.g. Boosting-based method of Loeff et al. ICML’08

[see Zhu ‘08 survey]

Data Landmarks

Our Approach

Overview of Our Approach

Data LandmarksDensity

Reduce n

Limit as n ∞

Nystrom

Ours

Consider Limit as n ∞

• Consider x to be drawn from 2D distribution p(x)

• Let Lp(F) be a smoothness operator on p(x), for a function F(x):

• Analyze eigenfunctions of Lp(F)

Lp(F ) = 1=2RR

(F (x1) ¡ F (x2))2W(x1;x2)p(x1)p(x2)dx1dx2

W(x1;x2) = exp(¡ kx1 ¡ x2k=2²2)where2

Eigenvectors & Eigenfunctions

• Claim:

If p is separable, then:

Eigenfunctions of marginals are also eigenfunctions of the joint density, with same eigenvalue

p(x1,x2)

p(x1)

p(x2)

Key Assumption: Separability of Input data

[Nadler et al. 06,Weiss et al. 08]

Numerical Approximations to Eigenfunctions in 1D

• 300k points drawn from distribution p(x)

• Consider p(x1)

p(x) Data

p(x1)

Histogram h(x1)

• Solve for values g of eigenfunction at set of discrete locations (histogram bin centers)– and associated eigenvalues– B x B system (# histogram bins = 50)

• P is diag(h(x1))

•

Numerical Approximations to Eigenfunctions in 1D

P ( ~D ¡ ~W)P g= ¾P D̂g~D =

Pj

~W

~W D̂ = diag(P

j P ~W)

¾

Affinity between discrete locations

1D Approximate Eigenfunctions

• Solve

1st Eigenfunction of h(x1)

2nd Eigenfunction of h(x1)

3rd Eigenfunction of h(x1)

P ( ~D ¡ ~W)P g= ¾P D̂g

Separability over Dimension

• Build histogram over dimension 2: h(x2)

• Now solve for eigenfunctions of h(x2)

1st Eigenfunction of h(x2)

2nd Eigenfunction of h(x2)

3rd Eigenfunction of h(x2)

Data

From Eigenfunctions to Approximate Eigenvectors

• Take each data point• Do 1-D interpolation in each eigenfunction

k dimensional vector (for k eigenfunctions)

• Very fast operation (has to be done nk times)

Histogram bin1 50

Eig

enfu

nct

ion v

alu

e

Preprocessing

• Need to make data separable• Rotate using PCA

Not separable Separable

Rotate

Overall Algorithm1. Rotate data to maximize separability (currently use

PCA)

2. For each dimension:– Construct 1D histogram– Solve numerically for eigenfunctions/values

3. Order eigenfunctions from all dimensions by increasing eigenvalue & take first k

4. Interpolate data into k eigenfunctions– Yields approximate eigenvectors of Normalized Laplacian

5. Solve k x k least squares system to give label function

Experimentson Toy Data

Comparison of Approaches

Data Exact Eigenvector Eigenfunction

ExactEigenvectors

0.0531 −− 0.0535

Exact -- ApproximateEigenvalues Approximate

Eigenvectors

0.1920 −− 0.1928

0.2049 −− 0.2068

0.2480 −− 0.5512

0.3580 −− 0.7979

Data

Nystrom Comparison

• Too few landmark points results in highly unstable eigenvectors

Nystrom Comparison

• Eigenfunctions fail when data has significant dependencies between dimensions

Experimentson Real Data

Experiments• Images from 126 classes downloaded

from Internet search engines, total 63,000 images Dump truck Emu

• Labels (correct/incorrect) provided by Geoff Hinton, Alex Krizhevsky, Vinod Nair (U. Toronto and CIFAR)

Input Image Representation

• Pixels not a convenient representation• Use Gist descriptor (Oliva & Torralba, 2001)• PCA down to 64 dimensions• L2 distance btw. Gist vectors rough

substitute for human perceptual distance

Are Dimensions Independent?Joint histogram for pairs of dimensions from raw 384-dimensional Gist

PCA

Joint histogram for pairs of dimensions after PCA

MI is mutual information score. 0 = Independent

Real 1-D Eigenfunctions of PCA’d Gist descriptors

64

56

48

40

32

24

16

8

1Eigenfunction 1

Eigenfunction 256

Inp

ut D

imensio

n

Eig

enfu

nct

ion v

alu

e Color = Input dimension

xmin xmax

Histogram bin1 50

Protocol• Task is to re-rank images of each class

• Measure precision @ 15% recall

• Vary # of labeled examples

−Inf 0 1 2 3 4 5 6 70.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Log2 number of +ve training examples/class

Mean

pre

cisi

on

at

15

% r

eca

ll a

vera

ged

over

16

cla

sses

Least−squares

SVM

Chance

−Inf 0 1 2 3 4 5 6 70.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7


Mean

pre

cisi

on

at

15

% r

eca

ll a

vera

ged

over

16

cla

sses

Nystrom

Least−squares

SVM

Chance

−Inf 0 1 2 3 4 5 6 70.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7


Mean

pre

cisi

on

at

15

% r

eca

ll a

vera

ged

over

16

cla

sses

Eigenfunction

Nystrom

Least−squares

SVM

Chance

−Inf 0 1 2 3 4 5 6 70.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7


Mean

pre

cisi

on

at

15

% r

eca

ll a

vera

ged

over

16

cla

sses

Eigenfunction

Nystrom

Least−squares

Eigenvector

SVM

NN

Chance

80 Million Images

Running on 80 million images

• PCA to 32 dims, k=48 eigenfunctions

• Precompute approximate eigenvectors (~20Gb)

• For each class, labels propagating through 80 million images

Summary

• Semi-supervised scheme that can scale to really large problems

• Rather than sub-sampling the data, we take the limit of infinite unlabeled data

• Assumes input data distribution is separable

• Can propagate labels in graph with 80 million nodes in fractions of second

Future Work

• Can potentially use 2D or 3D histograms instead of 1D– Requires more data

• Consider diagonal eigenfunctions

• Sharing of labels between classes


PCA

Joint histogram for pairs of dimensions after PCA



ICA

Joint histogram for pairs of dimensions after ICA


Overview of Our Approach

• Existing large-scale SSL methods try to reduce # points

• We consider what happens as n ∞

• Eigenvectors Eigenfunctions

• Assume input distribution is separable

• Make crude numerical approx. to Eigenfunctions

• Interpolate data in these approximate eigenfunctions to give approx. eigenvalues

Eigenfunctions

• Eigenfunction are limit of Eigenvectors as n ∞

• Analytical forms of eigenfunctions exist only in a few cases: Uniform, Gaussian

• Instead, we calculate numerical approximation to eigenfunctions


[Coifman et al. 05, Nadler et al. 06, Belikin & Niyogi 07]

1n2 f T Lf ! Lp(F )

Complexity Comparison

Nystrom

Select m landmark points

Get smallest k eigenvectors of m x m system

Interpolate n points into k eigenvectors

Solve k x k linear system

Eigenfunction

Rotate n points

Form d 1-D histograms

Solve d linear systems, each b x b

k 1-D interpolations of n points

Solve k x k linear system

Key: n = # data points (big, >106) l = # labeled points (small, <100) m = # landmark points d = # input dims (~100) k = # eigenvectors (~100) b = # histogram bins (~50)

Polynomial in # landmarks Linear in # data points

• Can’t build accurate high dimensional histograms– Need too many points

• Currently just use 1-D histograms– 2 or 3D ones possible with enough data

• This assumes distribution is separable– Assume p(x) = p(x1) p(x2) … p(xd)

• For separable distributions, eigenfunctions are also separable

Key Assumption: Separability of Input data


Varying # Training Examples

−Inf 0 1 2 3 4 5 6 70.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7


Mean

pre

cisi

on

at

15

% r

eca

ll a

vera

ged

over

16

cla

sses

Eigenfunction

Nystrom

Least−squares

Eigenvector

SVM

NN

Chance

semi-supervised learning in gigantic image collections rob fergus (nyu) yair weiss (hebrew u.)...

Documents

eigenfunctions of marginals

values g of eigenfunction

data pointse

data separablerotate

kinterpolate data

data pointdo

n nystrom

eigenvaluesb x b system