rajat raina honglak lee, roger grosse alexis battle, chaitanya ekanadham, helen kwong, benjamin...

31
Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University Self-taught Learning Transfer Learning from Unlabeled Data

Upload: jameson-parrent

Post on 14-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Rajat Raina

Honglak Lee, Roger GrosseAlexis Battle, Chaitanya Ekanadham, Helen Kwong,

Benjamin Packer,Narut Sereewattanawoot

Andrew Y. Ng

Stanford University

Self-taught LearningTransfer Learning from Unlabeled Data

Page 2: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

The “one learning algorithm” hypothesis

There is some evidence that the human brain uses essentially the same algorithm to understand many different input modalities.– Example: Ferret experiments, in which the “input”

for vision was plugged into auditory part of brain, and the auditory cortex learns to “see.” [Roe et al., 1992]

Self-taught Learning

(Roe et al., 1992. Hawkins & Blakeslee, 2004)

Page 3: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

There is some evidence that the human brain uses essentially the same algorithm to understand many different input modalities.– Example: Ferret experiments, in which the “input”

for vision was plugged into auditory part of brain, and the auditory cortex learns to “see.” [Roe et al., 1992]

If we could find this one learning algorithm,we would be done. (Finally!)

Self-taught Learning

(Roe et al., 1992. Hawkins & Blakeslee, 2004)

The “one learning algorithm” hypothesis

Page 4: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

This talk

If the brain really is one learning algorithm, it would suffice to just:Find a learning algorithm for a single layer,

and,Show that it can build a small number of

layers.We evaluate our algorithms:

Against biology. On applications.

Finding a deep learning algorithm

Self-taught Learning

e.g., Sparse RBMs for V2: Poster yesterday (Lee et al.)

Page 5: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Supervised learning

Cars Motorcycles

Train Test

Self-taught Learning

Supervised learning algorithms may not work well with limited labeled data.

Page 6: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Learning in humansYour brain has 1014 synapses (connections).You will live for 109 seconds.If each synapse requires 1 bit to

parameterize, you need to “learn” 1014 bits in 109 seconds.

Or, 105 bits per second.

Human learning is largely unsupervised, and uses readily available unlabeled data.

Self-taught Learning

(Geoffrey Hinton, personal communication)

Page 7: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Supervised learning

Cars Motorcycles

Train Test

Self-taught Learning

Page 8: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

“Brain-like” Learning

Cars Motorcycles

Train Test

Unlabeled images(randomly downloaded from the Internet)

Self-taught Learning

Page 9: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

“Brain-like” Learning

Unlabeled English charactersLabeled Digits

Self-taught Learning

Labeled Webpages Unlabeled newspaper articles

Labeled Russian Speech Unlabeled English speech

+ ?

+ ?

+ ?

Page 10: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

“Self-taught Learning”

Unlabeled English charactersLabeled Digits

Self-taught Learning

Labeled Webpages Unlabeled newspaper articles

Labeled Russian Speech Unlabeled English speech

+ ?

+ ?

+ ?

Page 11: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Recent history of machine learning• 20 years ago: Supervised learning

• 10 years ago: Semi-supervised learning.

• 10 years ago: Transfer learning.

• Next: Self-taught learning?

Cars Motorcycles

Bus Cars MotorcyclesTractor Aircraft Helicopter

Natural scenesCar Motorcycle

Cars Motorcycles

Page 12: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Self-taught Learning

Self-taught Learning

Labeled examples:

Unlabeled examples:

The unlabeled and labeled data:• Need not share labels y.• Need not share a generative distribution.

Advantage: Such unlabeled data is often easy to obtain.

mi

iil yx 1

)()( )},{( },,1{, )()( TyRx inil

ki

iux 1

)( }{ mkRx niu ,)(

Page 13: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Overview: Represent each labeled or unlabeled input as a sparse linear combination of “basis vectors” .

A self-taught learning algorithm

= 0.8 * + 0.3 * + 0.5 *

x = 0.8 * b87 + 0.3 * b376

+ 0.5 *

b411

Self-taught Learning

j

jjbax RaRb jn

j ,

sjjb 1}{

x

Page 14: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Key steps:1. Learn good bases using unlabeled data .2. Use these learnt bases to construct “higher-level” features for the

labeled data.3. Apply a standard supervised learning algorithm on these features.

A self-taught learning algorithm

= 0.8 * + 0.3 * + 0.5 *

Self-taught Learning

)(iuxjb

j

jjbax

x = 0.8 * b87 + 0.3 * b376

+ 0.5 *

b411

Page 15: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Given only unlabeled data , we find good bases b using sparse coding:

Learning the bases: Sparse coding

Self-taught Learning

i

i

i jj

ij

iuab abax 1

)(22

)()(, ||||||||min

)(iux

2|||| jb

Reconstruction error Sparsity penalty

[Details: An extra normalization constraint on is required.]

(Efficient algorithms: Lee et al., NIPS 2006)

Page 16: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Example basesNatural images. Learnt bases: “Edges”

Self-taught Learning

Handwritten characters. Learnt bases: “Strokes”

Page 17: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Constructing featuresUsing the learnt bases b, compute features for the

examples xl from the classification task by solving:

Finally, learn a classifer using a standard supervised learning algorithm (e.g., SVM) over these features.

= 0.8 * + 0.3 * + 0.5 *

Self-taught Learning

122 ||||||||minarg of Features abaxx

jjjlal

xl = 0.8 * b87 + 0.3 * b376

+ 0.5 *

b411

Reconstruction error Sparsity penalty

Page 18: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Image classification

Self-taught Learning

Large image(Platypus from

Caltech101 dataset)

Feature visualization

Page 19: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Image classification

Self-taught Learning

Platypus image(Caltech101 dataset)

Feature visualization

Page 20: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Image classification

Self-taught Learning

Platypus image(Caltech101 dataset)

Feature visualization

Page 21: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Image classification

Self-taught Learning

Platypus image(Caltech101 dataset)

Feature visualization

Page 22: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Image classification

Self-taught Learning

Baseline 16%

PCA 37%

Sparse coding 47%

Other reported results:Fei-Fei et al, 2004: 16%Berg et al., 2005: 17%Holub et al., 2005: 40%Serre et al., 2005: 35%Berg et al, 2005: 48%

Zhang et al., 2006: 59%Lazebnik et al., 2006: 56%

(15 labeled images per class)

36.0% error reduction

Page 23: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Raw 54.8%

PCA 54.8%

Sparse coding 58.5%

Character recognition

Self-taught Learning

Digits Handwritten English English font

Handwritten English classification(20 labeled images per handwritten character)

Bases learnt on digits

English font classification(20 labeled images per font character)

Bases learnt on handwritten English

Raw 17.9%

PCA 14.5%

Sparse coding 16.6%

Sparse coding + Raw 20.2%

8.2% error reduction 2.8% error reduction

Page 24: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Text classification

Self-taught Learning

Raw words 62.8%

PCA 63.3%

Sparse coding 64.3%

Reuters newswire Webpages UseNet articles

Webpage classification(2 labeled documents per class)

Bases learnt on Reuters newswire

Raw words 61.3%

PCA 60.7%

Sparse coding 63.8%

UseNet classification(2 labeled documents per class)

Bases learnt on Reuters newswire

4.0% error reduction 6.5% error reduction

Page 25: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Shift-invariant sparse coding

Self-taught Learning

Sparse features Basis functions

Reconstruction

(Algorithms: Grosse et al., UAI 2007)

Page 26: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Audio classification

Self-taught Learning

Spectrogram 38.5%

MFCCs 43.8%

Sparse coding 48.7%

8.7% error reduction

(Details: Grosse et al., UAI 2007)

Speaker identification(5 labels, TIMIT corpus, 1 sentence per speaker.)

Bases learnt on different dialects

Spectrogram 48.4%

MFCCs 54.0%

Music-specific model 49.3%

Sparse coding 56.6%

Musical genre classification(5 labels, 18 seconds per genre.)

Bases learnt on different genres, songs

5.7% error reduction

Page 27: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Sparse deep belief networks

Self-taught Learning

(Details: Lee et al., NIPS 2007. Poster yesterday.)

. . .

. . .

h: Hidden layer

v: Visible layer

W, b, c: Parameters

New

Sparse RBM

Page 28: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Sparse deep belief networks

Self-taught Learning

1-layer sparse DBN 44.5%

2-layer sparse DBN 46.6%

3.2% error reduction

(Details: Lee et al., NIPS 2007. Poster yesterday.)

Image classification(Caltech101 dataset)

Page 29: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

SummarySelf-taught learning: Unlabeled data does not

share the labels of the classification task.

Use unlabeled data to discover features.Use sparse coding to construct an easy-to-

classify, “higher-level” representation.

Self-taught Learning

Cars Motorcycles

= 0.8 * + 0.3 * + 0.5 *

Unlabeled images

Page 30: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

THE END

Page 31: Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

Related Work

Self-taught Learning

• Weston et al, ICML 2006• Make stronger assumptions on the unlabeled data.

• Ando & Zhang, JMLR 2005• For natural language tasks and character

recognition, use heuristics to construct a transfer learning task using unlabeled data.