rajat raina honglak lee, roger grosse alexis battle, chaitanya ekanadham, helen kwong, benjamin...
TRANSCRIPT
Rajat Raina
Honglak Lee, Roger GrosseAlexis Battle, Chaitanya Ekanadham, Helen Kwong,
Benjamin Packer,Narut Sereewattanawoot
Andrew Y. Ng
Stanford University
Self-taught LearningTransfer Learning from Unlabeled Data
The “one learning algorithm” hypothesis
There is some evidence that the human brain uses essentially the same algorithm to understand many different input modalities.– Example: Ferret experiments, in which the “input”
for vision was plugged into auditory part of brain, and the auditory cortex learns to “see.” [Roe et al., 1992]
Self-taught Learning
(Roe et al., 1992. Hawkins & Blakeslee, 2004)
There is some evidence that the human brain uses essentially the same algorithm to understand many different input modalities.– Example: Ferret experiments, in which the “input”
for vision was plugged into auditory part of brain, and the auditory cortex learns to “see.” [Roe et al., 1992]
If we could find this one learning algorithm,we would be done. (Finally!)
Self-taught Learning
(Roe et al., 1992. Hawkins & Blakeslee, 2004)
The “one learning algorithm” hypothesis
This talk
If the brain really is one learning algorithm, it would suffice to just:Find a learning algorithm for a single layer,
and,Show that it can build a small number of
layers.We evaluate our algorithms:
Against biology. On applications.
Finding a deep learning algorithm
Self-taught Learning
e.g., Sparse RBMs for V2: Poster yesterday (Lee et al.)
Supervised learning
Cars Motorcycles
Train Test
Self-taught Learning
Supervised learning algorithms may not work well with limited labeled data.
Learning in humansYour brain has 1014 synapses (connections).You will live for 109 seconds.If each synapse requires 1 bit to
parameterize, you need to “learn” 1014 bits in 109 seconds.
Or, 105 bits per second.
Human learning is largely unsupervised, and uses readily available unlabeled data.
Self-taught Learning
(Geoffrey Hinton, personal communication)
Supervised learning
Cars Motorcycles
Train Test
Self-taught Learning
“Brain-like” Learning
Cars Motorcycles
Train Test
Unlabeled images(randomly downloaded from the Internet)
Self-taught Learning
“Brain-like” Learning
Unlabeled English charactersLabeled Digits
Self-taught Learning
Labeled Webpages Unlabeled newspaper articles
Labeled Russian Speech Unlabeled English speech
+ ?
+ ?
+ ?
“Self-taught Learning”
Unlabeled English charactersLabeled Digits
Self-taught Learning
Labeled Webpages Unlabeled newspaper articles
Labeled Russian Speech Unlabeled English speech
+ ?
+ ?
+ ?
Recent history of machine learning• 20 years ago: Supervised learning
• 10 years ago: Semi-supervised learning.
• 10 years ago: Transfer learning.
• Next: Self-taught learning?
Cars Motorcycles
Bus Cars MotorcyclesTractor Aircraft Helicopter
Natural scenesCar Motorcycle
Cars Motorcycles
Self-taught Learning
Self-taught Learning
Labeled examples:
Unlabeled examples:
The unlabeled and labeled data:• Need not share labels y.• Need not share a generative distribution.
Advantage: Such unlabeled data is often easy to obtain.
mi
iil yx 1
)()( )},{( },,1{, )()( TyRx inil
ki
iux 1
)( }{ mkRx niu ,)(
Overview: Represent each labeled or unlabeled input as a sparse linear combination of “basis vectors” .
A self-taught learning algorithm
= 0.8 * + 0.3 * + 0.5 *
x = 0.8 * b87 + 0.3 * b376
+ 0.5 *
b411
Self-taught Learning
j
jjbax RaRb jn
j ,
sjjb 1}{
x
Key steps:1. Learn good bases using unlabeled data .2. Use these learnt bases to construct “higher-level” features for the
labeled data.3. Apply a standard supervised learning algorithm on these features.
A self-taught learning algorithm
= 0.8 * + 0.3 * + 0.5 *
Self-taught Learning
)(iuxjb
j
jjbax
x = 0.8 * b87 + 0.3 * b376
+ 0.5 *
b411
Given only unlabeled data , we find good bases b using sparse coding:
Learning the bases: Sparse coding
Self-taught Learning
i
i
i jj
ij
iuab abax 1
)(22
)()(, ||||||||min
)(iux
2|||| jb
Reconstruction error Sparsity penalty
[Details: An extra normalization constraint on is required.]
(Efficient algorithms: Lee et al., NIPS 2006)
Example basesNatural images. Learnt bases: “Edges”
Self-taught Learning
Handwritten characters. Learnt bases: “Strokes”
Constructing featuresUsing the learnt bases b, compute features for the
examples xl from the classification task by solving:
Finally, learn a classifer using a standard supervised learning algorithm (e.g., SVM) over these features.
= 0.8 * + 0.3 * + 0.5 *
Self-taught Learning
122 ||||||||minarg of Features abaxx
jjjlal
xl = 0.8 * b87 + 0.3 * b376
+ 0.5 *
b411
Reconstruction error Sparsity penalty
Image classification
Self-taught Learning
Large image(Platypus from
Caltech101 dataset)
Feature visualization
Image classification
Self-taught Learning
Platypus image(Caltech101 dataset)
Feature visualization
Image classification
Self-taught Learning
Platypus image(Caltech101 dataset)
Feature visualization
Image classification
Self-taught Learning
Platypus image(Caltech101 dataset)
Feature visualization
Image classification
Self-taught Learning
Baseline 16%
PCA 37%
Sparse coding 47%
Other reported results:Fei-Fei et al, 2004: 16%Berg et al., 2005: 17%Holub et al., 2005: 40%Serre et al., 2005: 35%Berg et al, 2005: 48%
Zhang et al., 2006: 59%Lazebnik et al., 2006: 56%
(15 labeled images per class)
36.0% error reduction
Raw 54.8%
PCA 54.8%
Sparse coding 58.5%
Character recognition
Self-taught Learning
Digits Handwritten English English font
Handwritten English classification(20 labeled images per handwritten character)
Bases learnt on digits
English font classification(20 labeled images per font character)
Bases learnt on handwritten English
Raw 17.9%
PCA 14.5%
Sparse coding 16.6%
Sparse coding + Raw 20.2%
8.2% error reduction 2.8% error reduction
Text classification
Self-taught Learning
Raw words 62.8%
PCA 63.3%
Sparse coding 64.3%
Reuters newswire Webpages UseNet articles
Webpage classification(2 labeled documents per class)
Bases learnt on Reuters newswire
Raw words 61.3%
PCA 60.7%
Sparse coding 63.8%
UseNet classification(2 labeled documents per class)
Bases learnt on Reuters newswire
4.0% error reduction 6.5% error reduction
Shift-invariant sparse coding
Self-taught Learning
Sparse features Basis functions
Reconstruction
(Algorithms: Grosse et al., UAI 2007)
Audio classification
Self-taught Learning
Spectrogram 38.5%
MFCCs 43.8%
Sparse coding 48.7%
8.7% error reduction
(Details: Grosse et al., UAI 2007)
Speaker identification(5 labels, TIMIT corpus, 1 sentence per speaker.)
Bases learnt on different dialects
Spectrogram 48.4%
MFCCs 54.0%
Music-specific model 49.3%
Sparse coding 56.6%
Musical genre classification(5 labels, 18 seconds per genre.)
Bases learnt on different genres, songs
5.7% error reduction
Sparse deep belief networks
Self-taught Learning
(Details: Lee et al., NIPS 2007. Poster yesterday.)
. . .
. . .
h: Hidden layer
v: Visible layer
W, b, c: Parameters
New
Sparse RBM
Sparse deep belief networks
Self-taught Learning
1-layer sparse DBN 44.5%
2-layer sparse DBN 46.6%
3.2% error reduction
(Details: Lee et al., NIPS 2007. Poster yesterday.)
Image classification(Caltech101 dataset)
SummarySelf-taught learning: Unlabeled data does not
share the labels of the classification task.
Use unlabeled data to discover features.Use sparse coding to construct an easy-to-
classify, “higher-level” representation.
Self-taught Learning
Cars Motorcycles
= 0.8 * + 0.3 * + 0.5 *
Unlabeled images
THE END
Related Work
Self-taught Learning
• Weston et al, ICML 2006• Make stronger assumptions on the unlabeled data.
• Ando & Zhang, JMLR 2005• For natural language tasks and character
recognition, use heuristics to construct a transfer learning task using unlabeled data.