rajat raina honglak lee, roger grosse alexis battle, chaitanya ekanadham, helen kwong, benjamin...

Rajat Raina

Honglak Lee, Roger GrosseAlexis Battle, Chaitanya Ekanadham, Helen Kwong,

Benjamin Packer,Narut Sereewattanawoot

Andrew Y. Ng

Stanford University

Self-taught LearningTransfer Learning from Unlabeled Data

The “one learning algorithm” hypothesis

There is some evidence that the human brain uses essentially the same algorithm to understand many different input modalities.– Example: Ferret experiments, in which the “input”

for vision was plugged into auditory part of brain, and the auditory cortex learns to “see.” [Roe et al., 1992]

Self-taught Learning

(Roe et al., 1992. Hawkins & Blakeslee, 2004)

There is some evidence that the human brain uses essentially the same algorithm to understand many different input modalities.– Example: Ferret experiments, in which the “input”

for vision was plugged into auditory part of brain, and the auditory cortex learns to “see.” [Roe et al., 1992]

If we could find this one learning algorithm,we would be done. (Finally!)


(Roe et al., 1992. Hawkins & Blakeslee, 2004)

The “one learning algorithm” hypothesis

This talk

If the brain really is one learning algorithm, it would suffice to just:Find a learning algorithm for a single layer,

and,Show that it can build a small number of

layers.We evaluate our algorithms:

Against biology. On applications.

Finding a deep learning algorithm


e.g., Sparse RBMs for V2: Poster yesterday (Lee et al.)

Supervised learning

Cars Motorcycles

Train Test


Supervised learning algorithms may not work well with limited labeled data.

Learning in humansYour brain has 1014 synapses (connections).You will live for 109 seconds.If each synapse requires 1 bit to

parameterize, you need to “learn” 1014 bits in 109 seconds.

Or, 105 bits per second.

Human learning is largely unsupervised, and uses readily available unlabeled data.


(Geoffrey Hinton, personal communication)

Supervised learning

Cars Motorcycles

Train Test


“Brain-like” Learning

Cars Motorcycles

Train Test

Unlabeled images(randomly downloaded from the Internet)


“Brain-like” Learning

Unlabeled English charactersLabeled Digits


Labeled Webpages Unlabeled newspaper articles

Labeled Russian Speech Unlabeled English speech

+ ?

+ ?

+ ?

“Self-taught Learning”

Unlabeled English charactersLabeled Digits


Labeled Webpages Unlabeled newspaper articles

Labeled Russian Speech Unlabeled English speech

+ ?

+ ?

+ ?

Recent history of machine learning• 20 years ago: Supervised learning

• 10 years ago: Semi-supervised learning.

• 10 years ago: Transfer learning.

• Next: Self-taught learning?

Cars Motorcycles

Bus Cars MotorcyclesTractor Aircraft Helicopter

Natural scenesCar Motorcycle

Cars Motorcycles



Labeled examples:

Unlabeled examples:

The unlabeled and labeled data:• Need not share labels y.• Need not share a generative distribution.

Advantage: Such unlabeled data is often easy to obtain.

mi

iil yx 1

)()( )},{( },,1{, )()( TyRx inil

ki

iux 1

)( }{ mkRx niu ,)(

Overview: Represent each labeled or unlabeled input as a sparse linear combination of “basis vectors” .

A self-taught learning algorithm

= 0.8 * + 0.3 * + 0.5 *

x = 0.8 * b87 + 0.3 * b376

+ 0.5 *

b411


j

jjbax RaRb jn

j ,

sjjb 1}{

x

Key steps:1. Learn good bases using unlabeled data .2. Use these learnt bases to construct “higher-level” features for the

labeled data.3. Apply a standard supervised learning algorithm on these features.

A self-taught learning algorithm

= 0.8 * + 0.3 * + 0.5 *


)(iuxjb

j

jjbax

x = 0.8 * b87 + 0.3 * b376

+ 0.5 *

b411

Given only unlabeled data , we find good bases b using sparse coding:

Learning the bases: Sparse coding


i

i

i jj

ij

iuab abax 1

)(22

)()(, ||||||||min

)(iux

2|||| jb

Reconstruction error Sparsity penalty

[Details: An extra normalization constraint on is required.]

(Efficient algorithms: Lee et al., NIPS 2006)

Example basesNatural images. Learnt bases: “Edges”


Handwritten characters. Learnt bases: “Strokes”

Constructing featuresUsing the learnt bases b, compute features for the

examples xl from the classification task by solving:

Finally, learn a classifer using a standard supervised learning algorithm (e.g., SVM) over these features.

= 0.8 * + 0.3 * + 0.5 *


122 ||||||||minarg of Features abaxx

jjjlal

xl = 0.8 * b87 + 0.3 * b376

+ 0.5 *

b411

Reconstruction error Sparsity penalty

Image classification


Large image(Platypus from

Caltech101 dataset)

Feature visualization



Platypus image(Caltech101 dataset)

Feature visualization



Baseline 16%

PCA 37%

Sparse coding 47%

Other reported results:Fei-Fei et al, 2004: 16%Berg et al., 2005: 17%Holub et al., 2005: 40%Serre et al., 2005: 35%Berg et al, 2005: 48%

Zhang et al., 2006: 59%Lazebnik et al., 2006: 56%

(15 labeled images per class)

36.0% error reduction

Raw 54.8%

PCA 54.8%

Sparse coding 58.5%

Character recognition


Digits Handwritten English English font

Handwritten English classification(20 labeled images per handwritten character)

Bases learnt on digits

English font classification(20 labeled images per font character)

Bases learnt on handwritten English

Raw 17.9%

PCA 14.5%

Sparse coding 16.6%

Sparse coding + Raw 20.2%

8.2% error reduction 2.8% error reduction

Text classification


Raw words 62.8%

PCA 63.3%

Sparse coding 64.3%

Reuters newswire Webpages UseNet articles

Webpage classification(2 labeled documents per class)

Bases learnt on Reuters newswire

Raw words 61.3%

PCA 60.7%

Sparse coding 63.8%

UseNet classification(2 labeled documents per class)

Bases learnt on Reuters newswire

4.0% error reduction 6.5% error reduction

Shift-invariant sparse coding


Sparse features Basis functions

Reconstruction

(Algorithms: Grosse et al., UAI 2007)

Audio classification


Spectrogram 38.5%

MFCCs 43.8%

Sparse coding 48.7%


(Details: Grosse et al., UAI 2007)

Speaker identification(5 labels, TIMIT corpus, 1 sentence per speaker.)

Bases learnt on different dialects

Spectrogram 48.4%

MFCCs 54.0%

Music-specific model 49.3%

Sparse coding 56.6%

Musical genre classification(5 labels, 18 seconds per genre.)

Bases learnt on different genres, songs


Sparse deep belief networks


(Details: Lee et al., NIPS 2007. Poster yesterday.)

. . .

. . .

h: Hidden layer

v: Visible layer

W, b, c: Parameters

New

Sparse RBM

Sparse deep belief networks


1-layer sparse DBN 44.5%

2-layer sparse DBN 46.6%


(Details: Lee et al., NIPS 2007. Poster yesterday.)

Image classification(Caltech101 dataset)

SummarySelf-taught learning: Unlabeled data does not

share the labels of the classification task.

Use unlabeled data to discover features.Use sparse coding to construct an easy-to-

classify, “higher-level” representation.


Cars Motorcycles

= 0.8 * + 0.3 * + 0.5 *

Unlabeled images

THE END

Related Work


• Weston et al, ICML 2006• Make stronger assumptions on the unlabeled data.

• Ando & Zhang, JMLR 2005• For natural language tasks and character

recognition, use heuristics to construct a transfer learning task using unlabeled data.

rajat raina honglak lee, roger grosse alexis battle, chaitanya ekanadham, helen kwong, benjamin...

Documents

human learning

benjamin packer

semisupervised learning

rajat rainahonglak lee

selftaught learningroe

roger grossealexis battle

human brain

humansyour brain