![Page 1: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/1.jpg)
Tutorial on Semi-Supervised Learning
Xiaojin Zhu
Department of Computer SciencesUniversity of Wisconsin, Madison, USA
Theory and Practice of Computational LearningChicago, 2009
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 1 / 99
![Page 2: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/2.jpg)
New book
Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi-SupervisedLearning. Morgan & Claypool, 2009.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 2 / 99
![Page 3: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/3.jpg)
Outline
1 Part IWhat is SSL?Mixture ModelsCo-training and Multiview AlgorithmsManifold Regularization and Graph-Based AlgorithmsS3VMs and Entropy Regularization
2 Part IITheory of SSLOnline SSLMultimanifold SSLHuman SSL
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 3 / 99
![Page 4: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/4.jpg)
Part I
Outline
1 Part IWhat is SSL?Mixture ModelsCo-training and Multiview AlgorithmsManifold Regularization and Graph-Based AlgorithmsS3VMs and Entropy Regularization
2 Part IITheory of SSLOnline SSLMultimanifold SSLHuman SSL
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 4 / 99
![Page 5: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/5.jpg)
Part I What is SSL?
Outline
1 Part IWhat is SSL?Mixture ModelsCo-training and Multiview AlgorithmsManifold Regularization and Graph-Based AlgorithmsS3VMs and Entropy Regularization
2 Part IITheory of SSLOnline SSLMultimanifold SSLHuman SSL
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 5 / 99
![Page 6: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/6.jpg)
Part I What is SSL?
What is Semi-Supervised Learning?
Learning from both labeled and unlabeled data. Examples:
Semi-supervised classification: training data l labeled instances(xi, yi)l
i=1 and u unlabeled instances xjl+uj=l+1, often u l.
Goal: better classifier f than from labeled data alone.
Constrained clustering: unlabeled instances xinj=1, and “supervised
information”, e.g., must-links, cannot-links. Goal: better clusteringthan from unlabeled data alone.
We will mainly discuss semi-supervised classification.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 6 / 99
![Page 7: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/7.jpg)
Part I What is SSL?
What is Semi-Supervised Learning?
Learning from both labeled and unlabeled data. Examples:
Semi-supervised classification: training data l labeled instances(xi, yi)l
i=1 and u unlabeled instances xjl+uj=l+1, often u l.
Goal: better classifier f than from labeled data alone.
Constrained clustering: unlabeled instances xinj=1, and “supervised
information”, e.g., must-links, cannot-links. Goal: better clusteringthan from unlabeled data alone.
We will mainly discuss semi-supervised classification.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 6 / 99
![Page 8: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/8.jpg)
Part I What is SSL?
What is Semi-Supervised Learning?
Learning from both labeled and unlabeled data. Examples:
Semi-supervised classification: training data l labeled instances(xi, yi)l
i=1 and u unlabeled instances xjl+uj=l+1, often u l.
Goal: better classifier f than from labeled data alone.
Constrained clustering: unlabeled instances xinj=1, and “supervised
information”, e.g., must-links, cannot-links. Goal: better clusteringthan from unlabeled data alone.
We will mainly discuss semi-supervised classification.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 6 / 99
![Page 9: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/9.jpg)
Part I What is SSL?
Motivations
Machine learning
Promise: better performance for free...
labeled data can be hard to getI labels may require human expertsI labels may require special devices
unlabeled data is often cheap in large quantity
Cognitive science
Computational model of how humans learn from labeled and unlabeleddata.
concept learning in children: x=animal, y=concept (e.g., dog)
Daddy points to a brown animal and says “dog!”
Children also observe animals by themselves
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 7 / 99
![Page 10: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/10.jpg)
Part I What is SSL?
Motivations
Machine learning
Promise: better performance for free...
labeled data can be hard to getI labels may require human expertsI labels may require special devices
unlabeled data is often cheap in large quantity
Cognitive science
Computational model of how humans learn from labeled and unlabeleddata.
concept learning in children: x=animal, y=concept (e.g., dog)
Daddy points to a brown animal and says “dog!”
Children also observe animals by themselves
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 7 / 99
![Page 11: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/11.jpg)
Part I What is SSL?
Example of hard-to-get labels
Task: speech analysis
Switchboard dataset
telephone conversation transcription
400 hours annotation time for each hour of speech
film ⇒ f ih n uh gl n mbe all ⇒ bcl b iy iy tr ao tr ao l dl
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 8 / 99
![Page 12: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/12.jpg)
Part I What is SSL?
Another example of hard-to-get labels
Task: natural language parsing
Penn Chinese Treebank
2 years for 4000 sentences
“The National Track and Field Championship has finished.”
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 9 / 99
![Page 13: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/13.jpg)
Part I What is SSL?
Notations
instance x, label y
learner f : X 7→ Ylabeled data (Xl, Yl) = (x1:l, y1:l)unlabeled data Xu = xl+1:l+u, available during training. Usuallyl u. Let n = l + u
test data (xn+1..., yn+1...), not available during training
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 10 / 99
![Page 14: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/14.jpg)
Part I What is SSL?
Semi-supervised vs. transductive learning
Inductive semi-supervised learning: Given (xi, yi)li=1, xjl+u
j=l+1,learn f : X 7→ Y so that f is expected to be a good predictor onfuture data, beyond xjl+u
j=l+1.
Transductive learning: Given (xi, yi)li=1, xjl+u
j=l+1, learn
f : X l+u 7→ Y l+u so that f is expected to be a good predictor on theunlabeled data xjl+u
j=l+1. Note f is defined only on the giventraining sample, and is not required to make predictions outside them.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 11 / 99
![Page 15: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/15.jpg)
Part I What is SSL?
Semi-supervised vs. transductive learning
Inductive semi-supervised learning: Given (xi, yi)li=1, xjl+u
j=l+1,learn f : X 7→ Y so that f is expected to be a good predictor onfuture data, beyond xjl+u
j=l+1.
Transductive learning: Given (xi, yi)li=1, xjl+u
j=l+1, learn
f : X l+u 7→ Y l+u so that f is expected to be a good predictor on theunlabeled data xjl+u
j=l+1. Note f is defined only on the giventraining sample, and is not required to make predictions outside them.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 11 / 99
![Page 16: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/16.jpg)
Part I What is SSL?
How can unlabeled data ever help?
−1.5 −1 −0.5 0 0.5 1 1.5 2x
Supervised decision boundary Semi−supervised decision boundary
Positive labeled dataNegative labeled dataUnlabeled data
assuming each class is a coherent group (e.g. Gaussian)
with and without unlabeled data: decision boundary shift
This is only one of many ways to use unlabeled data.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 12 / 99
![Page 17: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/17.jpg)
Part I What is SSL?
How can unlabeled data ever help?
−1.5 −1 −0.5 0 0.5 1 1.5 2x
Supervised decision boundary Semi−supervised decision boundary
Positive labeled dataNegative labeled dataUnlabeled data
assuming each class is a coherent group (e.g. Gaussian)
with and without unlabeled data: decision boundary shift
This is only one of many ways to use unlabeled data.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 12 / 99
![Page 18: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/18.jpg)
Part I What is SSL?
Self-training algorithm
Our first SSL algorithm:
Input: labeled data (xi, yi)li=1, unlabeled data xjl+u
j=l+1.
1. Initially, let L = (xi, yi)li=1 and U = xjl+u
j=l+1.
2. Repeat:3. Train f from L using supervised learning.4. Apply f to the unlabeled instances in U .5. Remove a subset S from U ; add (x, f(x))|x ∈ S to L.
Self-training is a wrapper method
the choice of learner for f in step 3 is left completely open
good for many real world tasks like natural language processing
but mistake by f can reinforce itself
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 13 / 99
![Page 19: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/19.jpg)
Part I What is SSL?
Self-training algorithm
Our first SSL algorithm:
Input: labeled data (xi, yi)li=1, unlabeled data xjl+u
j=l+1.
1. Initially, let L = (xi, yi)li=1 and U = xjl+u
j=l+1.
2. Repeat:3. Train f from L using supervised learning.4. Apply f to the unlabeled instances in U .5. Remove a subset S from U ; add (x, f(x))|x ∈ S to L.
Self-training is a wrapper method
the choice of learner for f in step 3 is left completely open
good for many real world tasks like natural language processing
but mistake by f can reinforce itself
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 13 / 99
![Page 20: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/20.jpg)
Part I What is SSL?
Self-training algorithm
Our first SSL algorithm:
Input: labeled data (xi, yi)li=1, unlabeled data xjl+u
j=l+1.
1. Initially, let L = (xi, yi)li=1 and U = xjl+u
j=l+1.
2. Repeat:3. Train f from L using supervised learning.4. Apply f to the unlabeled instances in U .5. Remove a subset S from U ; add (x, f(x))|x ∈ S to L.
Self-training is a wrapper method
the choice of learner for f in step 3 is left completely open
good for many real world tasks like natural language processing
but mistake by f can reinforce itself
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 13 / 99
![Page 21: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/21.jpg)
Part I What is SSL?
Self-training example: Propagating 1-Nearest-Neighbor
An instance of self-training.
Input: labeled data (xi, yi)li=1, unlabeled data xjl+u
j=l+1,
distance function d().1. Initially, let L = (xi, yi)l
i=1 and U = xjl+uj=l+1.
2. Repeat until U is empty:3. Select x = argminx∈U minx′∈L d(x,x′).4. Set f(x) to the label of x’s nearest instance in L.
Break ties randomly.5. Remove x from U ; add (x, f(x)) to L.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99
![Page 22: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/22.jpg)
Part I What is SSL?
Propagating 1-Nearest-Neighbor: now it works
80 90 100 110
40
45
50
55
60
65
70
weight (lbs.)
heig
ht (
in.)
80 90 100 110
40
45
50
55
60
65
70
weight (lbs.)
heig
ht (
in.)
(a) Iteration 1 (b) Iteration 25
80 90 100 110
40
45
50
55
60
65
70
weight (lbs.)
heig
ht (
in.)
80 90 100 110
40
45
50
55
60
65
70
weight (lbs.)
heig
ht (
in.)
(c) Iteration 74 (d) Final labeling of all instancesXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 15 / 99
![Page 23: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/23.jpg)
Part I What is SSL?
Propagating 1-Nearest-Neighbor: now it doesn’tBut with a single outlier...
80 90 100 110
40
45
50
55
60
65
70
weight (lbs.)
heig
ht (
in.)
outlier
80 90 100 110
40
45
50
55
60
65
70
weight (lbs.)
heig
ht (
in.)
(a) (b)
80 90 100 110
40
45
50
55
60
65
70
weight (lbs.)
heig
ht (
in.)
80 90 100 110
40
45
50
55
60
65
70
weight (lbs.)
heig
ht (
in.)
(c) (d)Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 16 / 99
![Page 24: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/24.jpg)
Part I Mixture Models
Outline
1 Part IWhat is SSL?Mixture ModelsCo-training and Multiview AlgorithmsManifold Regularization and Graph-Based AlgorithmsS3VMs and Entropy Regularization
2 Part IITheory of SSLOnline SSLMultimanifold SSLHuman SSL
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 17 / 99
![Page 25: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/25.jpg)
Part I Mixture Models
A simple example of generative models
Labeled data (Xl, Yl):
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
Assuming each class has a Gaussian distribution, what is the decisionboundary?
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 18 / 99
![Page 26: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/26.jpg)
Part I Mixture Models
A simple example of generative models
Model parameters: θ = w1, w2, µ1, µ2,Σ1,Σ2The GMM:
p(x, y|θ) = p(y|θ)p(x|y, θ)= wyN (x;µy,Σy)
Classification: p(y|x, θ) = p(x,y|θ)Py′ p(x,y′|θ)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 19 / 99
![Page 27: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/27.jpg)
Part I Mixture Models
A simple example of generative modelsThe most likely model, and its decision boundary:
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 20 / 99
![Page 28: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/28.jpg)
Part I Mixture Models
A simple example of generative models
Adding unlabeled data:
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 21 / 99
![Page 29: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/29.jpg)
Part I Mixture Models
A simple example of generative models
With unlabeled data, the most likely model and its decision boundary:
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 22 / 99
![Page 30: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/30.jpg)
Part I Mixture Models
A simple example of generative models
They are different because they maximize different quantities.
p(Xl, Yl|θ) p(Xl, Yl, Xu|θ)
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 23 / 99
![Page 31: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/31.jpg)
Part I Mixture Models
Generative model for semi-supervised learning
Assumption
knowledge of the model form p(X, Y |θ).
joint and marginal likelihood
p(Xl, Yl, Xu|θ) =∑Yu
p(Xl, Yl, Xu, Yu|θ)
find the maximum likelihood estimate (MLE) of θ, the maximum aposteriori (MAP) estimate, or be Bayesiancommon mixture models used in semi-supervised learning:
I Mixture of Gaussian distributions (GMM) – image classificationI Mixture of multinomial distributions (Naıve Bayes) – text
categorizationI Hidden Markov Models (HMM) – speech recognition
Learning via the Expectation-Maximization (EM) algorithm(Baum-Welch)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 24 / 99
![Page 32: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/32.jpg)
Part I Mixture Models
Generative model for semi-supervised learning
Assumption
knowledge of the model form p(X, Y |θ).
joint and marginal likelihood
p(Xl, Yl, Xu|θ) =∑Yu
p(Xl, Yl, Xu, Yu|θ)
find the maximum likelihood estimate (MLE) of θ, the maximum aposteriori (MAP) estimate, or be Bayesian
common mixture models used in semi-supervised learning:I Mixture of Gaussian distributions (GMM) – image classificationI Mixture of multinomial distributions (Naıve Bayes) – text
categorizationI Hidden Markov Models (HMM) – speech recognition
Learning via the Expectation-Maximization (EM) algorithm(Baum-Welch)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 24 / 99
![Page 33: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/33.jpg)
Part I Mixture Models
Generative model for semi-supervised learning
Assumption
knowledge of the model form p(X, Y |θ).
joint and marginal likelihood
p(Xl, Yl, Xu|θ) =∑Yu
p(Xl, Yl, Xu, Yu|θ)
find the maximum likelihood estimate (MLE) of θ, the maximum aposteriori (MAP) estimate, or be Bayesiancommon mixture models used in semi-supervised learning:
I Mixture of Gaussian distributions (GMM) – image classificationI Mixture of multinomial distributions (Naıve Bayes) – text
categorizationI Hidden Markov Models (HMM) – speech recognition
Learning via the Expectation-Maximization (EM) algorithm(Baum-Welch)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 24 / 99
![Page 34: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/34.jpg)
Part I Mixture Models
Case study: GMM
Binary classification with GMM using MLE.
with only labeled dataI log p(Xl, Yl|θ) =
∑li=1 log p(yi|θ)p(xi|yi, θ)
I MLE for θ trivial (sample mean and covariance)
with both labeled and unlabeled datalog p(Xl, Yl, Xu|θ) =
∑li=1 log p(yi|θ)p(xi|yi, θ)
+∑l+u
i=l+1 log(∑2
y=1 p(y|θ)p(xi|y, θ))
I MLE harder (hidden variables): EM
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 25 / 99
![Page 35: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/35.jpg)
Part I Mixture Models
Case study: GMM
Binary classification with GMM using MLE.
with only labeled dataI log p(Xl, Yl|θ) =
∑li=1 log p(yi|θ)p(xi|yi, θ)
I MLE for θ trivial (sample mean and covariance)
with both labeled and unlabeled datalog p(Xl, Yl, Xu|θ) =
∑li=1 log p(yi|θ)p(xi|yi, θ)
+∑l+u
i=l+1 log(∑2
y=1 p(y|θ)p(xi|y, θ))
I MLE harder (hidden variables): EM
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 25 / 99
![Page 36: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/36.jpg)
Part I Mixture Models
The EM algorithm for GMM
1 Start from MLE θ = w, µ,Σ1:2 on (Xl, Yl),I wc=proportion of class cI µc=sample mean of class cI Σc=sample cov of class c
repeat:
2 The E-step: compute the expected label p(y|x, θ) = p(x,y|θ)Py′ p(x,y′|θ) for
all x ∈ Xu
I label p(y = 1|x, θ)-fraction of x with class 1I label p(y = 2|x, θ)-fraction of x with class 2
3 The M-step: update MLE θ with (now labeled) Xu
Can be viewed as a special form of self-training.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 26 / 99
![Page 37: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/37.jpg)
Part I Mixture Models
The EM algorithm for GMM
1 Start from MLE θ = w, µ,Σ1:2 on (Xl, Yl),I wc=proportion of class cI µc=sample mean of class cI Σc=sample cov of class c
repeat:
2 The E-step: compute the expected label p(y|x, θ) = p(x,y|θ)Py′ p(x,y′|θ) for
all x ∈ Xu
I label p(y = 1|x, θ)-fraction of x with class 1I label p(y = 2|x, θ)-fraction of x with class 2
3 The M-step: update MLE θ with (now labeled) Xu
Can be viewed as a special form of self-training.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 26 / 99
![Page 38: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/38.jpg)
Part I Mixture Models
The EM algorithm for GMM
1 Start from MLE θ = w, µ,Σ1:2 on (Xl, Yl),I wc=proportion of class cI µc=sample mean of class cI Σc=sample cov of class c
repeat:
2 The E-step: compute the expected label p(y|x, θ) = p(x,y|θ)Py′ p(x,y′|θ) for
all x ∈ Xu
I label p(y = 1|x, θ)-fraction of x with class 1I label p(y = 2|x, θ)-fraction of x with class 2
3 The M-step: update MLE θ with (now labeled) Xu
Can be viewed as a special form of self-training.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 26 / 99
![Page 39: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/39.jpg)
Part I Mixture Models
The EM algorithm for GMM
1 Start from MLE θ = w, µ,Σ1:2 on (Xl, Yl),I wc=proportion of class cI µc=sample mean of class cI Σc=sample cov of class c
repeat:
2 The E-step: compute the expected label p(y|x, θ) = p(x,y|θ)Py′ p(x,y′|θ) for
all x ∈ Xu
I label p(y = 1|x, θ)-fraction of x with class 1I label p(y = 2|x, θ)-fraction of x with class 2
3 The M-step: update MLE θ with (now labeled) Xu
Can be viewed as a special form of self-training.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 26 / 99
![Page 40: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/40.jpg)
Part I Mixture Models
The assumption of mixture modelsAssumption: the data actually comes from the mixture model, wherethe number of components, prior p(y), and conditional p(x|y) are allcorrect.
When the assumption is wrong:
−5 0 5−6
−4
−2
0
2
4
6
x1
x 2
y = 1
y = −1
For example, classifying text by topic vs. by genre.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 27 / 99
![Page 41: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/41.jpg)
Part I Mixture Models
The assumption of mixture modelsAssumption: the data actually comes from the mixture model, wherethe number of components, prior p(y), and conditional p(x|y) are allcorrect.
When the assumption is wrong:
−5 0 5−6
−4
−2
0
2
4
6
x1
x 2y = 1
y = −1
For example, classifying text by topic vs. by genre.Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 27 / 99
![Page 42: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/42.jpg)
Part I Mixture Models
The assumption of mixture models
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6wrong model, higher log likelihood (−847.9309)
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6correct model, lower log likelihood (−921.143)
Heuristics to lessen the danger
Carefully construct the generative model, e.g., multiple Gaussiandistributions per classDown-weight the unlabeled data (λ < 1)
log p(Xl, Yl, Xu|θ) =∑l
i=1 log p(yi|θ)p(xi|yi, θ)
+ λ∑l+u
i=l+1 log(∑2
y=1 p(y|θ)p(xi|y, θ))
Other
dangers: identifiability, EM local optima
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 28 / 99
![Page 43: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/43.jpg)
Part I Mixture Models
The assumption of mixture models
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6wrong model, higher log likelihood (−847.9309)
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6correct model, lower log likelihood (−921.143)
Heuristics to lessen the danger
Carefully construct the generative model, e.g., multiple Gaussiandistributions per class
Down-weight the unlabeled data (λ < 1)
log p(Xl, Yl, Xu|θ) =∑l
i=1 log p(yi|θ)p(xi|yi, θ)
+ λ∑l+u
i=l+1 log(∑2
y=1 p(y|θ)p(xi|y, θ))
Other
dangers: identifiability, EM local optima
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 28 / 99
![Page 44: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/44.jpg)
Part I Mixture Models
The assumption of mixture models
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6wrong model, higher log likelihood (−847.9309)
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6correct model, lower log likelihood (−921.143)
Heuristics to lessen the danger
Carefully construct the generative model, e.g., multiple Gaussiandistributions per classDown-weight the unlabeled data (λ < 1)
log p(Xl, Yl, Xu|θ) =∑l
i=1 log p(yi|θ)p(xi|yi, θ)
+ λ∑l+u
i=l+1 log(∑2
y=1 p(y|θ)p(xi|y, θ))
Other
dangers: identifiability, EM local optima
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 28 / 99
![Page 45: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/45.jpg)
Part I Mixture Models
The assumption of mixture models
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6wrong model, higher log likelihood (−847.9309)
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6correct model, lower log likelihood (−921.143)
Heuristics to lessen the danger
Carefully construct the generative model, e.g., multiple Gaussiandistributions per classDown-weight the unlabeled data (λ < 1)
log p(Xl, Yl, Xu|θ) =∑l
i=1 log p(yi|θ)p(xi|yi, θ)
+ λ∑l+u
i=l+1 log(∑2
y=1 p(y|θ)p(xi|y, θ))
Other
dangers: identifiability, EM local optimaXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 28 / 99
![Page 46: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/46.jpg)
Part I Mixture Models
Related: cluster-and-label
Input: (x1, y1), . . . , (xl, yl), xl+1, . . . ,xl+u,a clustering algorithm A, a supervised learning algorithm L
1. Cluster x1, . . . ,xl+u using A.2. For each cluster, let S be the labeled instances in it:3. Learn a supervised predictor from S: fS = L(S).4. Apply fS to all unlabeled instances in this cluster.Output: labels on unlabeled data yl+1, . . . , yl+u.
But again: SSL sensitive to assumptions—in this case, that the clusterscoincide with decision boundaries. If this assumption is incorrect, theresults can be poor.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 29 / 99
![Page 47: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/47.jpg)
Part I Mixture Models
Related: cluster-and-label
Input: (x1, y1), . . . , (xl, yl), xl+1, . . . ,xl+u,a clustering algorithm A, a supervised learning algorithm L
1. Cluster x1, . . . ,xl+u using A.
2. For each cluster, let S be the labeled instances in it:3. Learn a supervised predictor from S: fS = L(S).4. Apply fS to all unlabeled instances in this cluster.Output: labels on unlabeled data yl+1, . . . , yl+u.
But again: SSL sensitive to assumptions—in this case, that the clusterscoincide with decision boundaries. If this assumption is incorrect, theresults can be poor.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 29 / 99
![Page 48: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/48.jpg)
Part I Mixture Models
Related: cluster-and-label
Input: (x1, y1), . . . , (xl, yl), xl+1, . . . ,xl+u,a clustering algorithm A, a supervised learning algorithm L
1. Cluster x1, . . . ,xl+u using A.2. For each cluster, let S be the labeled instances in it:
3. Learn a supervised predictor from S: fS = L(S).4. Apply fS to all unlabeled instances in this cluster.Output: labels on unlabeled data yl+1, . . . , yl+u.
But again: SSL sensitive to assumptions—in this case, that the clusterscoincide with decision boundaries. If this assumption is incorrect, theresults can be poor.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 29 / 99
![Page 49: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/49.jpg)
Part I Mixture Models
Related: cluster-and-label
Input: (x1, y1), . . . , (xl, yl), xl+1, . . . ,xl+u,a clustering algorithm A, a supervised learning algorithm L
1. Cluster x1, . . . ,xl+u using A.2. For each cluster, let S be the labeled instances in it:3. Learn a supervised predictor from S: fS = L(S).
4. Apply fS to all unlabeled instances in this cluster.Output: labels on unlabeled data yl+1, . . . , yl+u.
But again: SSL sensitive to assumptions—in this case, that the clusterscoincide with decision boundaries. If this assumption is incorrect, theresults can be poor.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 29 / 99
![Page 50: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/50.jpg)
Part I Mixture Models
Related: cluster-and-label
Input: (x1, y1), . . . , (xl, yl), xl+1, . . . ,xl+u,a clustering algorithm A, a supervised learning algorithm L
1. Cluster x1, . . . ,xl+u using A.2. For each cluster, let S be the labeled instances in it:3. Learn a supervised predictor from S: fS = L(S).4. Apply fS to all unlabeled instances in this cluster.Output: labels on unlabeled data yl+1, . . . , yl+u.
But again: SSL sensitive to assumptions—in this case, that the clusterscoincide with decision boundaries. If this assumption is incorrect, theresults can be poor.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 29 / 99
![Page 51: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/51.jpg)
Part I Mixture Models
Related: cluster-and-label
Input: (x1, y1), . . . , (xl, yl), xl+1, . . . ,xl+u,a clustering algorithm A, a supervised learning algorithm L
1. Cluster x1, . . . ,xl+u using A.2. For each cluster, let S be the labeled instances in it:3. Learn a supervised predictor from S: fS = L(S).4. Apply fS to all unlabeled instances in this cluster.Output: labels on unlabeled data yl+1, . . . , yl+u.
But again: SSL sensitive to assumptions—in this case, that the clusterscoincide with decision boundaries. If this assumption is incorrect, theresults can be poor.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 29 / 99
![Page 52: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/52.jpg)
Part I Mixture Models
Cluster-and-label: now it works, now it doesn’tExample: A=Hierarchical Clustering, L=majority vote.
single linkage
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Partially labeled data
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Single linkage clustering
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Predicted labeling
complete linkage
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Partially labeled data
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Complete linkage clustering
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Predicted labeling
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 30 / 99
![Page 53: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/53.jpg)
Part I Mixture Models
Cluster-and-label: now it works, now it doesn’tExample: A=Hierarchical Clustering, L=majority vote.
single linkage
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Partially labeled data
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Single linkage clustering
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Predicted labeling
complete linkage
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Partially labeled data
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Complete linkage clustering
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Predicted labeling
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 30 / 99
![Page 54: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/54.jpg)
Part I Mixture Models
Cluster-and-label: now it works, now it doesn’tExample: A=Hierarchical Clustering, L=majority vote.
single linkage
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Partially labeled data
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Single linkage clustering
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Predicted labeling
complete linkage
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Partially labeled data
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Complete linkage clustering
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Predicted labeling
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 30 / 99
![Page 55: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/55.jpg)
Part I Mixture Models
Cluster-and-label: now it works, now it doesn’tExample: A=Hierarchical Clustering, L=majority vote.
single linkage
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Partially labeled data
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Single linkage clustering
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Predicted labeling
complete linkage
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Partially labeled data
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Complete linkage clustering
80 90 100 110
40
50
60
70
weight (lbs.)
heig
ht (
in.)
Predicted labeling
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 30 / 99
![Page 56: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/56.jpg)
Part I Co-training and Multiview Algorithms
Outline
1 Part IWhat is SSL?Mixture ModelsCo-training and Multiview AlgorithmsManifold Regularization and Graph-Based AlgorithmsS3VMs and Entropy Regularization
2 Part IITheory of SSLOnline SSLMultimanifold SSLHuman SSL
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 31 / 99
![Page 57: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/57.jpg)
Part I Co-training and Multiview Algorithms
Two Views of an Instance
Example: named entity classification Person (Mr. Washington) orLocation (Washington State)
instance 1: . . . headquartered in (Washington State) . . .
instance 2: . . . (Mr. Washington), the vice president of . . .
a named entity has two views (subset of features) x = [x(1),x(2)]the words of the entity is x(1)
the context is x(2)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 32 / 99
![Page 58: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/58.jpg)
Part I Co-training and Multiview Algorithms
Two Views of an Instance
Example: named entity classification Person (Mr. Washington) orLocation (Washington State)
instance 1: . . . headquartered in (Washington State) . . .
instance 2: . . . (Mr. Washington), the vice president of . . .
a named entity has two views (subset of features) x = [x(1),x(2)]the words of the entity is x(1)
the context is x(2)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 32 / 99
![Page 59: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/59.jpg)
Part I Co-training and Multiview Algorithms
Quiz
instance 1: . . . headquartered in (Washington State)L . . .
instance 2: . . . (Mr. Washington)P , the vice president of . . .
test: . . . (Robert Jordan), a partner at . . .
test: . . . flew to (China) . . .
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 33 / 99
![Page 60: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/60.jpg)
Part I Co-training and Multiview Algorithms
Quiz
With more unlabeled datainstance 1: . . . headquartered in (Washington State)L . . .
instance 2: . . . (Mr. Washington)P , the vice president of . . .
instance 3: . . . headquartered in (Kazakhstan) . . .
instance 4: . . . flew to (Kazakhstan) . . .instance 5: . . . (Mr. Smith), a partner at Steptoe & Johnson . . .
test: . . . (Robert Jordan), a partner at . . .
test: . . . flew to (China) . . .
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 34 / 99
![Page 61: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/61.jpg)
Part I Co-training and Multiview Algorithms
Co-training algorithm
Input: labeled data (xi, yi)li=1, unlabeled data xjl+u
j=l+1
each instance has two views xi = [x(1)i ,x(2)
i ],and a learning speed k.
1. let L1 = L2 = (x1, y1), . . . , (xl, yl).2. Repeat until unlabeled data is used up:
3. Train view-1 f (1) from L1, view-2 f (2) from L2.
4. Classify unlabeled data with f (1) and f (2) separately.
5. Add f (1)’s top k most-confident predictions (x, f (1)(x)) to L2.
Add f (2)’s top k most-confident predictions (x, f (2)(x)) to L1.Remove these from the unlabeled data.
Like self-training, but with two classifiers teaching each other.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 35 / 99
![Page 62: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/62.jpg)
Part I Co-training and Multiview Algorithms
Co-training algorithm
Input: labeled data (xi, yi)li=1, unlabeled data xjl+u
j=l+1
each instance has two views xi = [x(1)i ,x(2)
i ],and a learning speed k.
1. let L1 = L2 = (x1, y1), . . . , (xl, yl).2. Repeat until unlabeled data is used up:
3. Train view-1 f (1) from L1, view-2 f (2) from L2.
4. Classify unlabeled data with f (1) and f (2) separately.
5. Add f (1)’s top k most-confident predictions (x, f (1)(x)) to L2.
Add f (2)’s top k most-confident predictions (x, f (2)(x)) to L1.Remove these from the unlabeled data.
Like self-training, but with two classifiers teaching each other.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 35 / 99
![Page 63: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/63.jpg)
Part I Co-training and Multiview Algorithms
Co-training algorithm
Input: labeled data (xi, yi)li=1, unlabeled data xjl+u
j=l+1
each instance has two views xi = [x(1)i ,x(2)
i ],and a learning speed k.
1. let L1 = L2 = (x1, y1), . . . , (xl, yl).2. Repeat until unlabeled data is used up:
3. Train view-1 f (1) from L1, view-2 f (2) from L2.
4. Classify unlabeled data with f (1) and f (2) separately.
5. Add f (1)’s top k most-confident predictions (x, f (1)(x)) to L2.
Add f (2)’s top k most-confident predictions (x, f (2)(x)) to L1.Remove these from the unlabeled data.
Like self-training, but with two classifiers teaching each other.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 35 / 99
![Page 64: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/64.jpg)
Part I Co-training and Multiview Algorithms
Co-training algorithm
Input: labeled data (xi, yi)li=1, unlabeled data xjl+u
j=l+1
each instance has two views xi = [x(1)i ,x(2)
i ],and a learning speed k.
1. let L1 = L2 = (x1, y1), . . . , (xl, yl).2. Repeat until unlabeled data is used up:
3. Train view-1 f (1) from L1, view-2 f (2) from L2.
4. Classify unlabeled data with f (1) and f (2) separately.
5. Add f (1)’s top k most-confident predictions (x, f (1)(x)) to L2.
Add f (2)’s top k most-confident predictions (x, f (2)(x)) to L1.Remove these from the unlabeled data.
Like self-training, but with two classifiers teaching each other.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 35 / 99
![Page 65: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/65.jpg)
Part I Co-training and Multiview Algorithms
Co-training algorithm
Input: labeled data (xi, yi)li=1, unlabeled data xjl+u
j=l+1
each instance has two views xi = [x(1)i ,x(2)
i ],and a learning speed k.
1. let L1 = L2 = (x1, y1), . . . , (xl, yl).2. Repeat until unlabeled data is used up:
3. Train view-1 f (1) from L1, view-2 f (2) from L2.
4. Classify unlabeled data with f (1) and f (2) separately.
5. Add f (1)’s top k most-confident predictions (x, f (1)(x)) to L2.
Add f (2)’s top k most-confident predictions (x, f (2)(x)) to L1.Remove these from the unlabeled data.
Like self-training, but with two classifiers teaching each other.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 35 / 99
![Page 66: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/66.jpg)
Part I Co-training and Multiview Algorithms
Co-training assumptions
Assumptions
feature split x = [x(1);x(2)] exists
x(1) or x(2) alone is sufficient to train a good classifier
x(1) and x(2) are conditionally independent given the class
X1 view X2 view
++
++
++
+
++
+
−
− −−
−
−−
−+
−++
++
++
+++
+++
+
+
++
− −
− −
−
−
−−
−
−−
−
+
+
+
++
+
+
+
++
+
−
−−
−−
−
−−
−
+++
+
+
+
+ +
+
+
+
+
+
++
+
−
−
−
−−
−−
−
−
−
−
−
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 36 / 99
![Page 67: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/67.jpg)
Part I Co-training and Multiview Algorithms
Co-training assumptions
Assumptions
feature split x = [x(1);x(2)] exists
x(1) or x(2) alone is sufficient to train a good classifier
x(1) and x(2) are conditionally independent given the class
X1 view X2 view
++
++
++
+
++
+
−
− −−
−
−−
−+
−++
++
++
+++
+++
+
+
++
− −
− −
−
−
−−
−
−−
−
+
+
+
++
+
+
+
++
+
−
−−
−−
−
−−
−
+++
+
+
+
+ +
+
+
+
+
+
++
+
−
−
−
−−
−−
−
−
−
−
−
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 36 / 99
![Page 68: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/68.jpg)
Part I Co-training and Multiview Algorithms
Co-training assumptions
Assumptions
feature split x = [x(1);x(2)] exists
x(1) or x(2) alone is sufficient to train a good classifier
x(1) and x(2) are conditionally independent given the class
X1 view X2 view
++
++
++
+
++
+
−
− −−
−
−−
−+
−++
++
++
+++
+++
+
+
++
− −
− −
−
−
−−
−
−−
−
+
+
+
++
+
+
+
++
+
−
−−
−−
−
−−
−
+++
+
+
+
+ +
+
+
+
+
+
++
+
−
−
−
−−
−−
−
−
−
−
−
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 36 / 99
![Page 69: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/69.jpg)
Part I Co-training and Multiview Algorithms
Multiview learning
Extends co-training.
Loss Function: c(x, y, f(x)) ∈ [0,∞). For example,I squared loss c(x, y, f(x)) = (y − f(x))2I 0/1 loss c(x, y, f(x)) = 1 if y 6= f(x), and 0 otherwise.
Empirical risk: R(f) = 1l
∑li=1 c(xi, yi, f(xi))
Regularizer: Ω(f), e.g., ‖f‖2
Regularized Risk Minimization f∗ = argminf∈F R(f) + λΩ(f)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 37 / 99
![Page 70: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/70.jpg)
Part I Co-training and Multiview Algorithms
Multiview learning
Extends co-training.
Loss Function: c(x, y, f(x)) ∈ [0,∞). For example,I squared loss c(x, y, f(x)) = (y − f(x))2I 0/1 loss c(x, y, f(x)) = 1 if y 6= f(x), and 0 otherwise.
Empirical risk: R(f) = 1l
∑li=1 c(xi, yi, f(xi))
Regularizer: Ω(f), e.g., ‖f‖2
Regularized Risk Minimization f∗ = argminf∈F R(f) + λΩ(f)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 37 / 99
![Page 71: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/71.jpg)
Part I Co-training and Multiview Algorithms
Multiview learning
Extends co-training.
Loss Function: c(x, y, f(x)) ∈ [0,∞). For example,I squared loss c(x, y, f(x)) = (y − f(x))2I 0/1 loss c(x, y, f(x)) = 1 if y 6= f(x), and 0 otherwise.
Empirical risk: R(f) = 1l
∑li=1 c(xi, yi, f(xi))
Regularizer: Ω(f), e.g., ‖f‖2
Regularized Risk Minimization f∗ = argminf∈F R(f) + λΩ(f)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 37 / 99
![Page 72: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/72.jpg)
Part I Co-training and Multiview Algorithms
Multiview learning
Extends co-training.
Loss Function: c(x, y, f(x)) ∈ [0,∞). For example,I squared loss c(x, y, f(x)) = (y − f(x))2I 0/1 loss c(x, y, f(x)) = 1 if y 6= f(x), and 0 otherwise.
Empirical risk: R(f) = 1l
∑li=1 c(xi, yi, f(xi))
Regularizer: Ω(f), e.g., ‖f‖2
Regularized Risk Minimization f∗ = argminf∈F R(f) + λΩ(f)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 37 / 99
![Page 73: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/73.jpg)
Part I Co-training and Multiview Algorithms
Multiview learning
A special regularizer Ω(f) defined on unlabeled data, to encourageagreement among multiple learners:
argminf1,...,fk
k∑v=1
(l∑
i=1
c(xi, yi, fv(xi)) + λ1ΩSL(fv)
)
+λ2
k∑u,v=1
l+u∑i=l+1
c(xi, fu(xi), fv(xi))
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 38 / 99
![Page 74: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/74.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Outline
1 Part IWhat is SSL?Mixture ModelsCo-training and Multiview AlgorithmsManifold Regularization and Graph-Based AlgorithmsS3VMs and Entropy Regularization
2 Part IITheory of SSLOnline SSLMultimanifold SSLHuman SSL
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 39 / 99
![Page 75: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/75.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Example: text classification
Classify astronomy vs. travel articles
Similarity measured by content word overlap
d1 d3 d4 d2asteroid • •
bright • •comet •
yearzodiac
.
.
.airport
bikecamp •
yellowstone • •zion •
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 40 / 99
![Page 76: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/76.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
When labeled data alone fails
No overlapping words!
d1 d3 d4 d2asteroid •
bright •comet
yearzodiac •
.
.
.airport •
bike •camp
yellowstone •zion •
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 41 / 99
![Page 77: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/77.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Unlabeled data as stepping stones
Labels “propagate” via similar unlabeled articles.
d1 d5 d6 d7 d3 d4 d8 d9 d2asteroid •
bright • •comet • •
year • •zodiac • •
.
.
.airport •
bike • •camp • •
yellowstone • •zion •
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 42 / 99
![Page 78: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/78.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Another example
Handwritten digits recognition with pixel-wise Euclidean distance
not similar ‘indirectly’ similarwith stepping stones
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 43 / 99
![Page 79: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/79.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Graph-based semi-supervised learning
Nodes: Xl ∪Xu
Edges: similarity weights computed from features, e.g.,
I k-nearest-neighbor graph, unweighted (0, 1 weights)I fully connected graph, weight decays with distance
w = exp(−‖xi − xj‖2/σ2
)I ε-radius graph
Assumption Instances connected by heavy edge tend to have thesame label.
x2
x3
x1
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 44 / 99
![Page 80: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/80.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Graph-based semi-supervised learning
Nodes: Xl ∪Xu
Edges: similarity weights computed from features, e.g.,I k-nearest-neighbor graph, unweighted (0, 1 weights)
I fully connected graph, weight decays with distancew = exp
(−‖xi − xj‖2/σ2
)I ε-radius graph
Assumption Instances connected by heavy edge tend to have thesame label.
x2
x3
x1
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 44 / 99
![Page 81: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/81.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Graph-based semi-supervised learning
Nodes: Xl ∪Xu
Edges: similarity weights computed from features, e.g.,I k-nearest-neighbor graph, unweighted (0, 1 weights)I fully connected graph, weight decays with distance
w = exp(−‖xi − xj‖2/σ2
)
I ε-radius graph
Assumption Instances connected by heavy edge tend to have thesame label.
x2
x3
x1
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 44 / 99
![Page 82: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/82.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Graph-based semi-supervised learning
Nodes: Xl ∪Xu
Edges: similarity weights computed from features, e.g.,I k-nearest-neighbor graph, unweighted (0, 1 weights)I fully connected graph, weight decays with distance
w = exp(−‖xi − xj‖2/σ2
)I ε-radius graph
Assumption Instances connected by heavy edge tend to have thesame label.
x2
x3
x1
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 44 / 99
![Page 83: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/83.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Graph-based semi-supervised learning
Nodes: Xl ∪Xu
Edges: similarity weights computed from features, e.g.,I k-nearest-neighbor graph, unweighted (0, 1 weights)I fully connected graph, weight decays with distance
w = exp(−‖xi − xj‖2/σ2
)I ε-radius graph
Assumption Instances connected by heavy edge tend to have thesame label.
x2
x3
x1
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 44 / 99
![Page 84: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/84.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
The mincut algorithm
Fix Yl, find Yu ∈ 0, 1n−l to minimize∑
ij wij |yi − yj |.
Equivalently, solves the optimization problem
minY ∈0,1n
∞l∑
i=1
(yi − Yli)2 +
∑ij
wij(yi − yj)2
Combinatorial problem, but has polynomial time solution.
Mincut computes the modes of a discrete Markov random field, butthere might be multiple modes
+ −
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 45 / 99
![Page 85: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/85.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
The mincut algorithm
Fix Yl, find Yu ∈ 0, 1n−l to minimize∑
ij wij |yi − yj |.Equivalently, solves the optimization problem
minY ∈0,1n
∞l∑
i=1
(yi − Yli)2 +
∑ij
wij(yi − yj)2
Combinatorial problem, but has polynomial time solution.
Mincut computes the modes of a discrete Markov random field, butthere might be multiple modes
+ −
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 45 / 99
![Page 86: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/86.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
The mincut algorithm
Fix Yl, find Yu ∈ 0, 1n−l to minimize∑
ij wij |yi − yj |.Equivalently, solves the optimization problem
minY ∈0,1n
∞l∑
i=1
(yi − Yli)2 +
∑ij
wij(yi − yj)2
Combinatorial problem, but has polynomial time solution.
Mincut computes the modes of a discrete Markov random field, butthere might be multiple modes
+ −
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 45 / 99
![Page 87: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/87.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
The mincut algorithm
Fix Yl, find Yu ∈ 0, 1n−l to minimize∑
ij wij |yi − yj |.Equivalently, solves the optimization problem
minY ∈0,1n
∞l∑
i=1
(yi − Yli)2 +
∑ij
wij(yi − yj)2
Combinatorial problem, but has polynomial time solution.
Mincut computes the modes of a discrete Markov random field, butthere might be multiple modes
+ −Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 45 / 99
![Page 88: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/88.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
The harmonic function
Relaxing discrete labels to continuous values in R, the harmonic function fsatisfies
f(xi) = yi for i = 1 . . . l
f minimizes the energy∑i∼j
wij(f(xi)− f(xj))2
the mean of a Gaussian random field
average of neighbors f(xi) =P
j∼i wijf(xj)Pj∼i wij
,∀xi ∈ Xu
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 46 / 99
![Page 89: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/89.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
The harmonic function
Relaxing discrete labels to continuous values in R, the harmonic function fsatisfies
f(xi) = yi for i = 1 . . . l
f minimizes the energy∑i∼j
wij(f(xi)− f(xj))2
the mean of a Gaussian random field
average of neighbors f(xi) =P
j∼i wijf(xj)Pj∼i wij
,∀xi ∈ Xu
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 46 / 99
![Page 90: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/90.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
The harmonic function
Relaxing discrete labels to continuous values in R, the harmonic function fsatisfies
f(xi) = yi for i = 1 . . . l
f minimizes the energy∑i∼j
wij(f(xi)− f(xj))2
the mean of a Gaussian random field
average of neighbors f(xi) =P
j∼i wijf(xj)Pj∼i wij
,∀xi ∈ Xu
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 46 / 99
![Page 91: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/91.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
The harmonic function
Relaxing discrete labels to continuous values in R, the harmonic function fsatisfies
f(xi) = yi for i = 1 . . . l
f minimizes the energy∑i∼j
wij(f(xi)− f(xj))2
the mean of a Gaussian random field
average of neighbors f(xi) =P
j∼i wijf(xj)Pj∼i wij
,∀xi ∈ Xu
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 46 / 99
![Page 92: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/92.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
An electric network interpretation
Edges are resistors with conductance wij
1 volt battery connects to labeled points y = 0, 1The voltage at the nodes is the harmonic function f
Implied similarity: similar voltage if many paths exist
+1 volt
wijR =ij
1
1
0
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 47 / 99
![Page 93: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/93.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
A random walk interpretation
Randomly walk from node i to j with probabilitywijPk wik
Stop if we hit a labeled node
The harmonic function f = Pr(hit label 1|start from i)
1
0
i
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 48 / 99
![Page 94: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/94.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
An algorithm to compute harmonic function
One iterative way to compute the harmonic function:
1 Initially, set f(xi) = yi for i = 1 . . . l, and f(xj) arbitrarily (e.g., 0)for xj ∈ Xu.
2 Repeat until convergence: Set f(xi) =P
j∼i wijf(xj)Pj∼i wij
,∀xi ∈ Xu, i.e.,
the average of neighbors. Note f(Xl) is fixed.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 49 / 99
![Page 95: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/95.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
An algorithm to compute harmonic function
One iterative way to compute the harmonic function:
1 Initially, set f(xi) = yi for i = 1 . . . l, and f(xj) arbitrarily (e.g., 0)for xj ∈ Xu.
2 Repeat until convergence: Set f(xi) =P
j∼i wijf(xj)Pj∼i wij
,∀xi ∈ Xu, i.e.,
the average of neighbors. Note f(Xl) is fixed.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 49 / 99
![Page 96: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/96.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
The graph Laplacian
We can also compute f in closed form using the graph Laplacian.
n× n weight matrix W on Xl ∪Xu
I symmetric, non-negative
Diagonal degree matrix D: Dii =∑n
j=1 Wij
Graph Laplacian matrix ∆
∆ = D −W
The energy can be rewritten as∑i∼j
wij(f(xi)− f(xj))2 = f>∆f
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 50 / 99
![Page 97: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/97.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Harmonic solution with Laplacian
The harmonic solution minimizes energy subject to the given labels
minf
∞l∑
i=1
(f(xi)− yi)2 + f>∆f
Partition the Laplacian matrix ∆ =[
∆ll ∆lu
∆ul ∆uu
]Harmonic solution
fu = −∆uu−1∆ulYl
The normalized Laplacian L = D−1/2∆D−1/2 = I −D−1/2WD−1/2, or∆p,Lp are often used too (p > 0).
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 51 / 99
![Page 98: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/98.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Harmonic solution with Laplacian
The harmonic solution minimizes energy subject to the given labels
minf
∞l∑
i=1
(f(xi)− yi)2 + f>∆f
Partition the Laplacian matrix ∆ =[
∆ll ∆lu
∆ul ∆uu
]
Harmonic solutionfu = −∆uu
−1∆ulYl
The normalized Laplacian L = D−1/2∆D−1/2 = I −D−1/2WD−1/2, or∆p,Lp are often used too (p > 0).
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 51 / 99
![Page 99: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/99.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Harmonic solution with Laplacian
The harmonic solution minimizes energy subject to the given labels
minf
∞l∑
i=1
(f(xi)− yi)2 + f>∆f
Partition the Laplacian matrix ∆ =[
∆ll ∆lu
∆ul ∆uu
]Harmonic solution
fu = −∆uu−1∆ulYl
The normalized Laplacian L = D−1/2∆D−1/2 = I −D−1/2WD−1/2, or∆p,Lp are often used too (p > 0).
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 51 / 99
![Page 100: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/100.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Harmonic solution with Laplacian
The harmonic solution minimizes energy subject to the given labels
minf
∞l∑
i=1
(f(xi)− yi)2 + f>∆f
Partition the Laplacian matrix ∆ =[
∆ll ∆lu
∆ul ∆uu
]Harmonic solution
fu = −∆uu−1∆ulYl
The normalized Laplacian L = D−1/2∆D−1/2 = I −D−1/2WD−1/2, or∆p,Lp are often used too (p > 0).
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 51 / 99
![Page 101: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/101.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Local and Global consistency
Allow f(Xl) to be different from Yl, but penalize it
minf
l∑i=1
(f(xi)− yi)2 + λf>∆f
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 52 / 99
![Page 102: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/102.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Manifold regularization
The graph-based algorithms so far are transductive. Manifoldregularization is inductive.
defines function in a RKHS: f(x) = h(x) + b, h(x) ∈ HK
views the graph as a random sample of an underlying manifold
regularizer prefers low energy f>1:n∆f1:n
minf
l∑i=1
(1− yif(xi))+ + λ1‖h‖2HK+ λ2f
>1:n∆f1:n
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 53 / 99
![Page 103: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/103.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Graph spectrum and SSL
Assumption: labels are “smooth” on the graph, characterized by the graphspectrum (eigen-values/vectors (λi, φi)l+u
i=1 of the Laplacian L):
L =∑l+u
i=1 λiφiφi>
a graph has k connected components if and only if λ1 = . . . = λk = 0.
the corresponding eigenvectors are constant on individual connectedcomponents, and zero elsewhere.
any f on the graph can be represented as f =∑l+u
i=1 aiφi
graph regularizer f>Lf =∑l+u
i=1 a2i λi
smooth function f uses smooth basis (those with small λi)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 54 / 99
![Page 104: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/104.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Graph spectrum and SSL
Assumption: labels are “smooth” on the graph, characterized by the graphspectrum (eigen-values/vectors (λi, φi)l+u
i=1 of the Laplacian L):
L =∑l+u
i=1 λiφiφi>
a graph has k connected components if and only if λ1 = . . . = λk = 0.
the corresponding eigenvectors are constant on individual connectedcomponents, and zero elsewhere.
any f on the graph can be represented as f =∑l+u
i=1 aiφi
graph regularizer f>Lf =∑l+u
i=1 a2i λi
smooth function f uses smooth basis (those with small λi)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 54 / 99
![Page 105: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/105.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Graph spectrum and SSL
Assumption: labels are “smooth” on the graph, characterized by the graphspectrum (eigen-values/vectors (λi, φi)l+u
i=1 of the Laplacian L):
L =∑l+u
i=1 λiφiφi>
a graph has k connected components if and only if λ1 = . . . = λk = 0.
the corresponding eigenvectors are constant on individual connectedcomponents, and zero elsewhere.
any f on the graph can be represented as f =∑l+u
i=1 aiφi
graph regularizer f>Lf =∑l+u
i=1 a2i λi
smooth function f uses smooth basis (those with small λi)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 54 / 99
![Page 106: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/106.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Graph spectrum and SSL
Assumption: labels are “smooth” on the graph, characterized by the graphspectrum (eigen-values/vectors (λi, φi)l+u
i=1 of the Laplacian L):
L =∑l+u
i=1 λiφiφi>
a graph has k connected components if and only if λ1 = . . . = λk = 0.
the corresponding eigenvectors are constant on individual connectedcomponents, and zero elsewhere.
any f on the graph can be represented as f =∑l+u
i=1 aiφi
graph regularizer f>Lf =∑l+u
i=1 a2i λi
smooth function f uses smooth basis (those with small λi)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 54 / 99
![Page 107: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/107.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Graph spectrum and SSL
Assumption: labels are “smooth” on the graph, characterized by the graphspectrum (eigen-values/vectors (λi, φi)l+u
i=1 of the Laplacian L):
L =∑l+u
i=1 λiφiφi>
a graph has k connected components if and only if λ1 = . . . = λk = 0.
the corresponding eigenvectors are constant on individual connectedcomponents, and zero elsewhere.
any f on the graph can be represented as f =∑l+u
i=1 aiφi
graph regularizer f>Lf =∑l+u
i=1 a2i λi
smooth function f uses smooth basis (those with small λi)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 54 / 99
![Page 108: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/108.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Graph spectrum and SSL
Assumption: labels are “smooth” on the graph, characterized by the graphspectrum (eigen-values/vectors (λi, φi)l+u
i=1 of the Laplacian L):
L =∑l+u
i=1 λiφiφi>
a graph has k connected components if and only if λ1 = . . . = λk = 0.
the corresponding eigenvectors are constant on individual connectedcomponents, and zero elsewhere.
any f on the graph can be represented as f =∑l+u
i=1 aiφi
graph regularizer f>Lf =∑l+u
i=1 a2i λi
smooth function f uses smooth basis (those with small λi)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 54 / 99
![Page 109: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/109.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
Example graph spectrum
The graph
Eigenvalues and eigenvectors of the graph Laplacian
λ1=0.00 λ
2=0.00 λ
3=0.04 λ
4=0.17 λ
5=0.38
λ6=0.38 λ
7=0.66 λ
8=1.00 λ
9=1.38 λ
10=1.38
λ11
=1.79 λ12
=2.21 λ13
=2.62 λ14
=2.62 λ15
=3.00
λ16
=3.34 λ17
=3.62 λ18
=3.62 λ19
=3.83 λ20
=3.96
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 55 / 99
![Page 110: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/110.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
When the graph assumption is wrong
“colliding two moons”
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 56 / 99
![Page 111: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/111.jpg)
Part I Manifold Regularization and Graph-Based Algorithms
When the graph assumption is wrong
“colliding two moons”
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 56 / 99
![Page 112: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/112.jpg)
Part I S3VMs and Entropy Regularization
Outline
1 Part IWhat is SSL?Mixture ModelsCo-training and Multiview AlgorithmsManifold Regularization and Graph-Based AlgorithmsS3VMs and Entropy Regularization
2 Part IITheory of SSLOnline SSLMultimanifold SSLHuman SSL
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 57 / 99
![Page 113: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/113.jpg)
Part I S3VMs and Entropy Regularization
Semi-supervised Support Vector Machines
SVMs
−
++
−+
−
Semi-supervised SVMs (S3VMs) = Transductive SVMs (TSVMs)
−
++
−+
−
Assumption: Unlabeled data from different classes are separated with largemargin.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 58 / 99
![Page 114: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/114.jpg)
Part I S3VMs and Entropy Regularization
Semi-supervised Support Vector Machines
SVMs
−
++
−+
−
Semi-supervised SVMs (S3VMs) = Transductive SVMs (TSVMs)
−
++
−+
−
Assumption: Unlabeled data from different classes are separated with largemargin.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 58 / 99
![Page 115: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/115.jpg)
Part I S3VMs and Entropy Regularization
Standard soft margin SVMs
Try to keep labeled points outside the margin, while maximizing themargin:
minh,b,ξ
l∑i=1
ξi + λ‖h‖2HK
subject to yi(h(xi) + b) ≥ 1− ξi ,∀i = 1 . . . l
ξi ≥ 0
Equivalent to
minf
l∑i=1
(1− yif(xi))+ + λ‖h‖2HK
yif(xi) known as the margin, (1− yif(xi))+ the hinge loss
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 59 / 99
![Page 116: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/116.jpg)
Part I S3VMs and Entropy Regularization
Standard soft margin SVMs
Try to keep labeled points outside the margin, while maximizing themargin:
minh,b,ξ
l∑i=1
ξi + λ‖h‖2HK
subject to yi(h(xi) + b) ≥ 1− ξi ,∀i = 1 . . . l
ξi ≥ 0
Equivalent to
minf
l∑i=1
(1− yif(xi))+ + λ‖h‖2HK
yif(xi) known as the margin, (1− yif(xi))+ the hinge loss
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 59 / 99
![Page 117: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/117.jpg)
Part I S3VMs and Entropy Regularization
The S3VM objective function
To incorporate unlabeled points,
assign putative labels sign(f(x)) to x ∈ Xu
the hinge loss on unlabeled points becomes
(1− sign(f(x))f(xi))+ = (1− |f(xi)|)+
S3VM objective:
minf
l∑i=1
(1− yif(xi))+ + λ1‖h‖2HK+ λ2
n∑i=l+1
(1− |f(xi)|)+
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 60 / 99
![Page 118: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/118.jpg)
Part I S3VMs and Entropy Regularization
The S3VM objective function
To incorporate unlabeled points,
assign putative labels sign(f(x)) to x ∈ Xu
the hinge loss on unlabeled points becomes
(1− sign(f(x))f(xi))+ = (1− |f(xi)|)+
S3VM objective:
minf
l∑i=1
(1− yif(xi))+ + λ1‖h‖2HK+ λ2
n∑i=l+1
(1− |f(xi)|)+
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 60 / 99
![Page 119: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/119.jpg)
Part I S3VMs and Entropy Regularization
The S3VM objective function
To incorporate unlabeled points,
assign putative labels sign(f(x)) to x ∈ Xu
the hinge loss on unlabeled points becomes
(1− sign(f(x))f(xi))+ = (1− |f(xi)|)+
S3VM objective:
minf
l∑i=1
(1− yif(xi))+ + λ1‖h‖2HK+ λ2
n∑i=l+1
(1− |f(xi)|)+
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 60 / 99
![Page 120: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/120.jpg)
Part I S3VMs and Entropy Regularization
The hat loss on unlabeled data
hinge loss (1− yif(xi))+ hat loss (1− |f(xi)|)+
−5 −4 −3 −2 −1 0 1 2 3 4 50
1
2
3
4
5
−5 −4 −3 −2 −1 0 1 2 3 4 50
1
2
3
4
5
yif(xi) f(xi)
Prefers f(x) ≥ 1 or f(x) ≤ −1, i.e., unlabeled instance away from decisionboundary f(x) = 0.
The class balancing constraint
often unbalanced – most points classified into one class.
Heuristic class balance: 1n−l
∑ni=l+1 yi = 1
l
∑li=1 yi.
Relaxed: 1n−l
∑ni=l+1 f(xi) = 1
l
∑li=1 yi.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 61 / 99
![Page 121: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/121.jpg)
Part I S3VMs and Entropy Regularization
The hat loss on unlabeled data
hinge loss (1− yif(xi))+ hat loss (1− |f(xi)|)+
−5 −4 −3 −2 −1 0 1 2 3 4 50
1
2
3
4
5
−5 −4 −3 −2 −1 0 1 2 3 4 50
1
2
3
4
5
yif(xi) f(xi)
Prefers f(x) ≥ 1 or f(x) ≤ −1, i.e., unlabeled instance away from decisionboundary f(x) = 0.The class balancing constraint
often unbalanced – most points classified into one class.
Heuristic class balance: 1n−l
∑ni=l+1 yi = 1
l
∑li=1 yi.
Relaxed: 1n−l
∑ni=l+1 f(xi) = 1
l
∑li=1 yi.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 61 / 99
![Page 122: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/122.jpg)
Part I S3VMs and Entropy Regularization
The hat loss on unlabeled data
hinge loss (1− yif(xi))+ hat loss (1− |f(xi)|)+
−5 −4 −3 −2 −1 0 1 2 3 4 50
1
2
3
4
5
−5 −4 −3 −2 −1 0 1 2 3 4 50
1
2
3
4
5
yif(xi) f(xi)
Prefers f(x) ≥ 1 or f(x) ≤ −1, i.e., unlabeled instance away from decisionboundary f(x) = 0.The class balancing constraint
often unbalanced – most points classified into one class.
Heuristic class balance: 1n−l
∑ni=l+1 yi = 1
l
∑li=1 yi.
Relaxed: 1n−l
∑ni=l+1 f(xi) = 1
l
∑li=1 yi.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 61 / 99
![Page 123: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/123.jpg)
Part I S3VMs and Entropy Regularization
The hat loss on unlabeled data
hinge loss (1− yif(xi))+ hat loss (1− |f(xi)|)+
−5 −4 −3 −2 −1 0 1 2 3 4 50
1
2
3
4
5
−5 −4 −3 −2 −1 0 1 2 3 4 50
1
2
3
4
5
yif(xi) f(xi)
Prefers f(x) ≥ 1 or f(x) ≤ −1, i.e., unlabeled instance away from decisionboundary f(x) = 0.The class balancing constraint
often unbalanced – most points classified into one class.
Heuristic class balance: 1n−l
∑ni=l+1 yi = 1
l
∑li=1 yi.
Relaxed: 1n−l
∑ni=l+1 f(xi) = 1
l
∑li=1 yi.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 61 / 99
![Page 124: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/124.jpg)
Part I S3VMs and Entropy Regularization
The S3VM algorithm
minf
∑li=1(1− yif(xi))+ + λ1‖h‖2HK
+ λ2∑n
i=l+1(1− |f(xi)|)+
s.t. 1n−l
∑ni=l+1 f(xi) = 1
l
∑li=1 yi
Computational difficulty
SVM objective is convex
Semi-supervised SVM objective is non-convex
Optimization approaches: SVMlight, ∇S3VM, continuation S3VM,deterministic annealing, CCCP, Branch and Bound, SDP convexrelaxation, etc.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 62 / 99
![Page 125: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/125.jpg)
Part I S3VMs and Entropy Regularization
The S3VM algorithm
minf
∑li=1(1− yif(xi))+ + λ1‖h‖2HK
+ λ2∑n
i=l+1(1− |f(xi)|)+
s.t. 1n−l
∑ni=l+1 f(xi) = 1
l
∑li=1 yi
Computational difficulty
SVM objective is convex
Semi-supervised SVM objective is non-convex
Optimization approaches: SVMlight, ∇S3VM, continuation S3VM,deterministic annealing, CCCP, Branch and Bound, SDP convexrelaxation, etc.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 62 / 99
![Page 126: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/126.jpg)
Part I S3VMs and Entropy Regularization
The S3VM algorithm
minf
∑li=1(1− yif(xi))+ + λ1‖h‖2HK
+ λ2∑n
i=l+1(1− |f(xi)|)+
s.t. 1n−l
∑ni=l+1 f(xi) = 1
l
∑li=1 yi
Computational difficulty
SVM objective is convex
Semi-supervised SVM objective is non-convex
Optimization approaches: SVMlight, ∇S3VM, continuation S3VM,deterministic annealing, CCCP, Branch and Bound, SDP convexrelaxation, etc.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 62 / 99
![Page 127: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/127.jpg)
Part I S3VMs and Entropy Regularization
Logistic regression
The probabilistic counter part of SVMs.
p(y|x) = 1/ (1 + exp(−yf(x))) where f(x) = w>x + b
(conditional) log likelihood∑l
i=1 log p(yi|xi,w, b)prior w ∼ N (0, I/(2λ))MAP training maxw,b
∑li=1 log (1/ (1 + exp(−yif(xi))))− λ‖w‖2
logistic loss c(x, y, f(x)) = log (1 + exp(−yf(x)))Logistic regression does not use unlabeled data.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 63 / 99
![Page 128: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/128.jpg)
Part I S3VMs and Entropy Regularization
Logistic regression
The probabilistic counter part of SVMs.
p(y|x) = 1/ (1 + exp(−yf(x))) where f(x) = w>x + b
(conditional) log likelihood∑l
i=1 log p(yi|xi,w, b)
prior w ∼ N (0, I/(2λ))MAP training maxw,b
∑li=1 log (1/ (1 + exp(−yif(xi))))− λ‖w‖2
logistic loss c(x, y, f(x)) = log (1 + exp(−yf(x)))Logistic regression does not use unlabeled data.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 63 / 99
![Page 129: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/129.jpg)
Part I S3VMs and Entropy Regularization
Logistic regression
The probabilistic counter part of SVMs.
p(y|x) = 1/ (1 + exp(−yf(x))) where f(x) = w>x + b
(conditional) log likelihood∑l
i=1 log p(yi|xi,w, b)prior w ∼ N (0, I/(2λ))
MAP training maxw,b∑l
i=1 log (1/ (1 + exp(−yif(xi))))− λ‖w‖2
logistic loss c(x, y, f(x)) = log (1 + exp(−yf(x)))Logistic regression does not use unlabeled data.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 63 / 99
![Page 130: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/130.jpg)
Part I S3VMs and Entropy Regularization
Logistic regression
The probabilistic counter part of SVMs.
p(y|x) = 1/ (1 + exp(−yf(x))) where f(x) = w>x + b
(conditional) log likelihood∑l
i=1 log p(yi|xi,w, b)prior w ∼ N (0, I/(2λ))MAP training maxw,b
∑li=1 log (1/ (1 + exp(−yif(xi))))− λ‖w‖2
logistic loss c(x, y, f(x)) = log (1 + exp(−yf(x)))Logistic regression does not use unlabeled data.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 63 / 99
![Page 131: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/131.jpg)
Part I S3VMs and Entropy Regularization
Logistic regression
The probabilistic counter part of SVMs.
p(y|x) = 1/ (1 + exp(−yf(x))) where f(x) = w>x + b
(conditional) log likelihood∑l
i=1 log p(yi|xi,w, b)prior w ∼ N (0, I/(2λ))MAP training maxw,b
∑li=1 log (1/ (1 + exp(−yif(xi))))− λ‖w‖2
logistic loss c(x, y, f(x)) = log (1 + exp(−yf(x)))
Logistic regression does not use unlabeled data.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 63 / 99
![Page 132: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/132.jpg)
Part I S3VMs and Entropy Regularization
Logistic regression
The probabilistic counter part of SVMs.
p(y|x) = 1/ (1 + exp(−yf(x))) where f(x) = w>x + b
(conditional) log likelihood∑l
i=1 log p(yi|xi,w, b)prior w ∼ N (0, I/(2λ))MAP training maxw,b
∑li=1 log (1/ (1 + exp(−yif(xi))))− λ‖w‖2
logistic loss c(x, y, f(x)) = log (1 + exp(−yf(x)))Logistic regression does not use unlabeled data.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 63 / 99
![Page 133: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/133.jpg)
Part I S3VMs and Entropy Regularization
Entropy regularization
Assumption: if the two classes are well-separated, then p(y|x) on anyunlabeled instance should be close to 0 or 1.
Entropy H(p) = −p log p− (1− p) log(1− p) should be small
entropy regularizer Ω(f) =∑l+u
j=l+1 H(p(y = 1|xj ,w, b))semi-supervised logistic regression
minw,b
l∑i=1
log (1 + exp(−yif(xi))) + λ1‖w‖2
+λ2
l+u∑j=l+1
H(1/ (1 + exp(−f(xj))))
The probabilistic counter part of S3VMs.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 64 / 99
![Page 134: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/134.jpg)
Part I S3VMs and Entropy Regularization
Entropy regularization
Assumption: if the two classes are well-separated, then p(y|x) on anyunlabeled instance should be close to 0 or 1.
Entropy H(p) = −p log p− (1− p) log(1− p) should be small
entropy regularizer Ω(f) =∑l+u
j=l+1 H(p(y = 1|xj ,w, b))semi-supervised logistic regression
minw,b
l∑i=1
log (1 + exp(−yif(xi))) + λ1‖w‖2
+λ2
l+u∑j=l+1
H(1/ (1 + exp(−f(xj))))
The probabilistic counter part of S3VMs.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 64 / 99
![Page 135: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/135.jpg)
Part I S3VMs and Entropy Regularization
Entropy regularization
Assumption: if the two classes are well-separated, then p(y|x) on anyunlabeled instance should be close to 0 or 1.
Entropy H(p) = −p log p− (1− p) log(1− p) should be small
entropy regularizer Ω(f) =∑l+u
j=l+1 H(p(y = 1|xj ,w, b))
semi-supervised logistic regression
minw,b
l∑i=1
log (1 + exp(−yif(xi))) + λ1‖w‖2
+λ2
l+u∑j=l+1
H(1/ (1 + exp(−f(xj))))
The probabilistic counter part of S3VMs.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 64 / 99
![Page 136: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/136.jpg)
Part I S3VMs and Entropy Regularization
Entropy regularization
Assumption: if the two classes are well-separated, then p(y|x) on anyunlabeled instance should be close to 0 or 1.
Entropy H(p) = −p log p− (1− p) log(1− p) should be small
entropy regularizer Ω(f) =∑l+u
j=l+1 H(p(y = 1|xj ,w, b))semi-supervised logistic regression
minw,b
l∑i=1
log (1 + exp(−yif(xi))) + λ1‖w‖2
+λ2
l+u∑j=l+1
H(1/ (1 + exp(−f(xj))))
The probabilistic counter part of S3VMs.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 64 / 99
![Page 137: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/137.jpg)
Part I S3VMs and Entropy Regularization
Entropy regularization
−5 −4 −3 −2 −1 0 1 2 3 4 50
1
2
3
4
5
−5 −4 −3 −2 −1 0 1 2 3 4 50
1
2
3
4
5
yf(x) f(x)(a) the logistic loss (b) the entropy regularizer
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 65 / 99
![Page 138: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/138.jpg)
Part I S3VMs and Entropy Regularization
When the large margin assumption is wrong
−0.4 −0.2 0 0.2 0.4 0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
x1
x 2
−0.4 −0.2 0 0.2 0.4 0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
x1
2
Unlabeled dataPositive dataNegative dataTrue boundarySVM boundaryS3VM boundary
S3VM in local minimum S3VM in wrong gap
SVM error: 0.26± 0.13S3VM error: 0.34± 0.19
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 66 / 99
![Page 139: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/139.jpg)
Part II
Outline
1 Part IWhat is SSL?Mixture ModelsCo-training and Multiview AlgorithmsManifold Regularization and Graph-Based AlgorithmsS3VMs and Entropy Regularization
2 Part IITheory of SSLOnline SSLMultimanifold SSLHuman SSL
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 67 / 99
![Page 140: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/140.jpg)
Part II Theory of SSL
Outline
1 Part IWhat is SSL?Mixture ModelsCo-training and Multiview AlgorithmsManifold Regularization and Graph-Based AlgorithmsS3VMs and Entropy Regularization
2 Part IITheory of SSLOnline SSLMultimanifold SSLHuman SSL
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 68 / 99
![Page 141: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/141.jpg)
Part II Theory of SSL
SSL does not always help
Negative distributionPositive distributionUnlabeled instanceNegative instancePositive instance
OptimalSupervisedGenerative modelS3VMGraph−based
Training set 1
Training set 2
Training set 3
Training set 4
Training set 5
Truedistribution
Wrong SSL assumption can make SSL worse than SL!
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 69 / 99
![Page 142: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/142.jpg)
Part II Theory of SSL
A computational theory for SSL
(Theoretic guarantee of Balcan & Blum)Recall in supervised learning
labeled data D = (xi, yi)li=1
i.i.d.∼ P (x, y), where P unknown
function family Fassume zero training sample error e(f) = 1
l
∑li=1(f(xi) 6= yi)
can we say anything about its true errore(fD) = E(x,y)∼P [fD(x) 6= y]?it turns out we can bound e(fD) without the knowledge of P .
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 70 / 99
![Page 143: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/143.jpg)
Part II Theory of SSL
A computational theory for SSL
(Theoretic guarantee of Balcan & Blum)Recall in supervised learning
labeled data D = (xi, yi)li=1
i.i.d.∼ P (x, y), where P unknown
function family F
assume zero training sample error e(f) = 1l
∑li=1(f(xi) 6= yi)
can we say anything about its true errore(fD) = E(x,y)∼P [fD(x) 6= y]?it turns out we can bound e(fD) without the knowledge of P .
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 70 / 99
![Page 144: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/144.jpg)
Part II Theory of SSL
A computational theory for SSL
(Theoretic guarantee of Balcan & Blum)Recall in supervised learning
labeled data D = (xi, yi)li=1
i.i.d.∼ P (x, y), where P unknown
function family Fassume zero training sample error e(f) = 1
l
∑li=1(f(xi) 6= yi)
can we say anything about its true errore(fD) = E(x,y)∼P [fD(x) 6= y]?it turns out we can bound e(fD) without the knowledge of P .
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 70 / 99
![Page 145: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/145.jpg)
Part II Theory of SSL
A computational theory for SSL
(Theoretic guarantee of Balcan & Blum)Recall in supervised learning
labeled data D = (xi, yi)li=1
i.i.d.∼ P (x, y), where P unknown
function family Fassume zero training sample error e(f) = 1
l
∑li=1(f(xi) 6= yi)
can we say anything about its true errore(fD) = E(x,y)∼P [fD(x) 6= y]?
it turns out we can bound e(fD) without the knowledge of P .
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 70 / 99
![Page 146: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/146.jpg)
Part II Theory of SSL
A computational theory for SSL
(Theoretic guarantee of Balcan & Blum)Recall in supervised learning
labeled data D = (xi, yi)li=1
i.i.d.∼ P (x, y), where P unknown
function family Fassume zero training sample error e(f) = 1
l
∑li=1(f(xi) 6= yi)
can we say anything about its true errore(fD) = E(x,y)∼P [fD(x) 6= y]?it turns out we can bound e(fD) without the knowledge of P .
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 70 / 99
![Page 147: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/147.jpg)
Part II Theory of SSL
PAC bound for SL
training error minimizer fD is a random variable (of D)
e(fD) > ε is a random Boolean event
the probability of this event is PrD∼P (e(fD) > ε). Goal: showthat this probability is small
PrD∼P (e(fD) > ε) ≤ PrD∼P
(∪f∈F :e(f)=0e(f) > ε
)= PrD∼P
(∪f∈Fe(f) = 0, e(f) > ε
)= PrD∼P
(∪f∈F :e(f)>εe(f) = 0
)≤
∑f∈F :e(f)>ε
PrD∼P (e(f) = 0)
last step is union bound Pr(A ∪B) ≤ Pr(A) + Pr(B)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 71 / 99
![Page 148: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/148.jpg)
Part II Theory of SSL
PAC bound for SL
training error minimizer fD is a random variable (of D)
e(fD) > ε is a random Boolean event
the probability of this event is PrD∼P (e(fD) > ε). Goal: showthat this probability is small
PrD∼P (e(fD) > ε) ≤ PrD∼P
(∪f∈F :e(f)=0e(f) > ε
)= PrD∼P
(∪f∈Fe(f) = 0, e(f) > ε
)= PrD∼P
(∪f∈F :e(f)>εe(f) = 0
)≤
∑f∈F :e(f)>ε
PrD∼P (e(f) = 0)
last step is union bound Pr(A ∪B) ≤ Pr(A) + Pr(B)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 71 / 99
![Page 149: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/149.jpg)
Part II Theory of SSL
PAC bound for SL
training error minimizer fD is a random variable (of D)
e(fD) > ε is a random Boolean event
the probability of this event is PrD∼P (e(fD) > ε). Goal: showthat this probability is small
PrD∼P (e(fD) > ε) ≤ PrD∼P
(∪f∈F :e(f)=0e(f) > ε
)= PrD∼P
(∪f∈Fe(f) = 0, e(f) > ε
)= PrD∼P
(∪f∈F :e(f)>εe(f) = 0
)≤
∑f∈F :e(f)>ε
PrD∼P (e(f) = 0)
last step is union bound Pr(A ∪B) ≤ Pr(A) + Pr(B)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 71 / 99
![Page 150: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/150.jpg)
Part II Theory of SSL
PAC bound for SL
training error minimizer fD is a random variable (of D)
e(fD) > ε is a random Boolean event
the probability of this event is PrD∼P (e(fD) > ε). Goal: showthat this probability is small
PrD∼P (e(fD) > ε) ≤ PrD∼P
(∪f∈F :e(f)=0e(f) > ε
)
= PrD∼P
(∪f∈Fe(f) = 0, e(f) > ε
)= PrD∼P
(∪f∈F :e(f)>εe(f) = 0
)≤
∑f∈F :e(f)>ε
PrD∼P (e(f) = 0)
last step is union bound Pr(A ∪B) ≤ Pr(A) + Pr(B)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 71 / 99
![Page 151: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/151.jpg)
Part II Theory of SSL
PAC bound for SL
training error minimizer fD is a random variable (of D)
e(fD) > ε is a random Boolean event
the probability of this event is PrD∼P (e(fD) > ε). Goal: showthat this probability is small
PrD∼P (e(fD) > ε) ≤ PrD∼P
(∪f∈F :e(f)=0e(f) > ε
)= PrD∼P
(∪f∈Fe(f) = 0, e(f) > ε
)= PrD∼P
(∪f∈F :e(f)>εe(f) = 0
)
≤∑
f∈F :e(f)>ε
PrD∼P (e(f) = 0)
last step is union bound Pr(A ∪B) ≤ Pr(A) + Pr(B)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 71 / 99
![Page 152: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/152.jpg)
Part II Theory of SSL
PAC bound for SL
training error minimizer fD is a random variable (of D)
e(fD) > ε is a random Boolean event
the probability of this event is PrD∼P (e(fD) > ε). Goal: showthat this probability is small
PrD∼P (e(fD) > ε) ≤ PrD∼P
(∪f∈F :e(f)=0e(f) > ε
)= PrD∼P
(∪f∈Fe(f) = 0, e(f) > ε
)= PrD∼P
(∪f∈F :e(f)>εe(f) = 0
)≤
∑f∈F :e(f)>ε
PrD∼P (e(f) = 0)
last step is union bound Pr(A ∪B) ≤ Pr(A) + Pr(B)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 71 / 99
![Page 153: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/153.jpg)
Part II Theory of SSL
PAC bound for SL
A biased coin with P (heads) = ε producing l tails∑f∈F :e(f)>ε
PrD∼P (e(f) = 0) ≤∑
f∈F :e(f)>ε
(1− ε)l
if F is finite,∑
f∈F :e(f)>ε(1− ε)l ≤ |F|(1− ε)l
by 1− x ≤ e−x, |F|(1− ε)l ≤ |F|e−εl
putting things together, PrD∼P (e(fD) ≤ ε) ≥ 1− |F|e−εl
Probably (i.e., on at least 1− |F|e−εl fraction of random draws of thetraining sample), the function fD, picked because e(fD) = 0, isapproximately correct (i.e., has true error e(fD) ≤ ε).
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 72 / 99
![Page 154: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/154.jpg)
Part II Theory of SSL
PAC bound for SL
A biased coin with P (heads) = ε producing l tails∑f∈F :e(f)>ε
PrD∼P (e(f) = 0) ≤∑
f∈F :e(f)>ε
(1− ε)l
if F is finite,∑
f∈F :e(f)>ε(1− ε)l ≤ |F|(1− ε)l
by 1− x ≤ e−x, |F|(1− ε)l ≤ |F|e−εl
putting things together, PrD∼P (e(fD) ≤ ε) ≥ 1− |F|e−εl
Probably (i.e., on at least 1− |F|e−εl fraction of random draws of thetraining sample), the function fD, picked because e(fD) = 0, isapproximately correct (i.e., has true error e(fD) ≤ ε).
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 72 / 99
![Page 155: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/155.jpg)
Part II Theory of SSL
PAC bound for SL
A biased coin with P (heads) = ε producing l tails∑f∈F :e(f)>ε
PrD∼P (e(f) = 0) ≤∑
f∈F :e(f)>ε
(1− ε)l
if F is finite,∑
f∈F :e(f)>ε(1− ε)l ≤ |F|(1− ε)l
by 1− x ≤ e−x, |F|(1− ε)l ≤ |F|e−εl
putting things together, PrD∼P (e(fD) ≤ ε) ≥ 1− |F|e−εl
Probably (i.e., on at least 1− |F|e−εl fraction of random draws of thetraining sample), the function fD, picked because e(fD) = 0, isapproximately correct (i.e., has true error e(fD) ≤ ε).
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 72 / 99
![Page 156: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/156.jpg)
Part II Theory of SSL
PAC bound for SL
A biased coin with P (heads) = ε producing l tails∑f∈F :e(f)>ε
PrD∼P (e(f) = 0) ≤∑
f∈F :e(f)>ε
(1− ε)l
if F is finite,∑
f∈F :e(f)>ε(1− ε)l ≤ |F|(1− ε)l
by 1− x ≤ e−x, |F|(1− ε)l ≤ |F|e−εl
putting things together, PrD∼P (e(fD) ≤ ε) ≥ 1− |F|e−εl
Probably (i.e., on at least 1− |F|e−εl fraction of random draws of thetraining sample), the function fD, picked because e(fD) = 0, isapproximately correct (i.e., has true error e(fD) ≤ ε).
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 72 / 99
![Page 157: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/157.jpg)
Part II Theory of SSL
PAC bound for SL
A biased coin with P (heads) = ε producing l tails∑f∈F :e(f)>ε
PrD∼P (e(f) = 0) ≤∑
f∈F :e(f)>ε
(1− ε)l
if F is finite,∑
f∈F :e(f)>ε(1− ε)l ≤ |F|(1− ε)l
by 1− x ≤ e−x, |F|(1− ε)l ≤ |F|e−εl
putting things together, PrD∼P (e(fD) ≤ ε) ≥ 1− |F|e−εl
Probably (i.e., on at least 1− |F|e−εl fraction of random draws of thetraining sample), the function fD, picked because e(fD) = 0, isapproximately correct (i.e., has true error e(fD) ≤ ε).
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 72 / 99
![Page 158: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/158.jpg)
Part II Theory of SSL
Simple sample complexity for SL
Theorem Assume F is finite. Given any ε > 0, δ > 0, if we see l traininginstances where
l =1ε
(log |F|+ log
1δ
)then with probability at least 1− δ, all f ∈ F with zero training errore(f) = 0 have e(f) ≤ ε.
ε controls the error of the learned function
δ controls the confidence of the bound
proof: setting δ = |F|e−εl
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 73 / 99
![Page 159: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/159.jpg)
Part II Theory of SSL
A Finite, Doubly Realizable PAC bound for SSL
Plan: make |F| smaller
incompatibility Ξ(f,x) : F × X 7→ [0, 1] between a function f and anunlabeled instance x
example: S3VM wants |f(x)| ≥ γ. Define
ΞS3VM(f,x) =
1, if |f(x)| < γ0, otherwise.
true unlabeled data error eU (f) = Ex∼PX[Ξ(f,x)]
sample unlabeled data error eU (f) = 1u
∑l+ui=l+1 Ξ(f,xi)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 74 / 99
![Page 160: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/160.jpg)
Part II Theory of SSL
A Finite, Doubly Realizable PAC bound for SSL
Plan: make |F| smaller
incompatibility Ξ(f,x) : F × X 7→ [0, 1] between a function f and anunlabeled instance x
example: S3VM wants |f(x)| ≥ γ. Define
ΞS3VM(f,x) =
1, if |f(x)| < γ0, otherwise.
true unlabeled data error eU (f) = Ex∼PX[Ξ(f,x)]
sample unlabeled data error eU (f) = 1u
∑l+ui=l+1 Ξ(f,xi)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 74 / 99
![Page 161: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/161.jpg)
Part II Theory of SSL
A Finite, Doubly Realizable PAC bound for SSL
Plan: make |F| smaller
incompatibility Ξ(f,x) : F × X 7→ [0, 1] between a function f and anunlabeled instance x
example: S3VM wants |f(x)| ≥ γ. Define
ΞS3VM(f,x) =
1, if |f(x)| < γ0, otherwise.
true unlabeled data error eU (f) = Ex∼PX[Ξ(f,x)]
sample unlabeled data error eU (f) = 1u
∑l+ui=l+1 Ξ(f,xi)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 74 / 99
![Page 162: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/162.jpg)
Part II Theory of SSL
A Finite, Doubly Realizable PAC bound for SSL
Plan: make |F| smaller
incompatibility Ξ(f,x) : F × X 7→ [0, 1] between a function f and anunlabeled instance x
example: S3VM wants |f(x)| ≥ γ. Define
ΞS3VM(f,x) =
1, if |f(x)| < γ0, otherwise.
true unlabeled data error eU (f) = Ex∼PX[Ξ(f,x)]
sample unlabeled data error eU (f) = 1u
∑l+ui=l+1 Ξ(f,xi)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 74 / 99
![Page 163: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/163.jpg)
Part II Theory of SSL
A Finite, Doubly Realizable PAC bound for SSL
Plan: make |F| smaller
incompatibility Ξ(f,x) : F × X 7→ [0, 1] between a function f and anunlabeled instance x
example: S3VM wants |f(x)| ≥ γ. Define
ΞS3VM(f,x) =
1, if |f(x)| < γ0, otherwise.
true unlabeled data error eU (f) = Ex∼PX[Ξ(f,x)]
sample unlabeled data error eU (f) = 1u
∑l+ui=l+1 Ξ(f,xi)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 74 / 99
![Page 164: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/164.jpg)
Part II Theory of SSL
A Finite, Doubly Realizable PAC bound for SSL
by a similar argument, after u = 1ε
(log |F|+ log 2
δ
)unlabeled data,
with probability at least 1− δ/2, all f ∈ F with eU (f) = 0 haveeU (f) ≤ ε.
i.e., if eU (f) = 0, then f ∈ F(ε) ≡ f ∈ F : eU (f) ≤ εapply the SL PAC bound on the (much smaller) F(ε)
Theorem (finite, doubly realizable) Assume F is finite. Given anyε > 0, δ > 0, if we see l labeled and u unlabeled training instances where
l =1ε
(log |F(ε)|+ log
2δ
)and u =
1ε
(log |F|+ log
2δ
),
then with probability at least 1− δ, all f ∈ F with e(f) = 0 andeU (f) = 0 have e(f) ≤ ε.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 75 / 99
![Page 165: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/165.jpg)
Part II Theory of SSL
A Finite, Doubly Realizable PAC bound for SSL
by a similar argument, after u = 1ε
(log |F|+ log 2
δ
)unlabeled data,
with probability at least 1− δ/2, all f ∈ F with eU (f) = 0 haveeU (f) ≤ ε.
i.e., if eU (f) = 0, then f ∈ F(ε) ≡ f ∈ F : eU (f) ≤ ε
apply the SL PAC bound on the (much smaller) F(ε)
Theorem (finite, doubly realizable) Assume F is finite. Given anyε > 0, δ > 0, if we see l labeled and u unlabeled training instances where
l =1ε
(log |F(ε)|+ log
2δ
)and u =
1ε
(log |F|+ log
2δ
),
then with probability at least 1− δ, all f ∈ F with e(f) = 0 andeU (f) = 0 have e(f) ≤ ε.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 75 / 99
![Page 166: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/166.jpg)
Part II Theory of SSL
A Finite, Doubly Realizable PAC bound for SSL
by a similar argument, after u = 1ε
(log |F|+ log 2
δ
)unlabeled data,
with probability at least 1− δ/2, all f ∈ F with eU (f) = 0 haveeU (f) ≤ ε.
i.e., if eU (f) = 0, then f ∈ F(ε) ≡ f ∈ F : eU (f) ≤ εapply the SL PAC bound on the (much smaller) F(ε)
Theorem (finite, doubly realizable) Assume F is finite. Given anyε > 0, δ > 0, if we see l labeled and u unlabeled training instances where
l =1ε
(log |F(ε)|+ log
2δ
)and u =
1ε
(log |F|+ log
2δ
),
then with probability at least 1− δ, all f ∈ F with e(f) = 0 andeU (f) = 0 have e(f) ≤ ε.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 75 / 99
![Page 167: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/167.jpg)
Part II Theory of SSL
A Finite, Doubly Realizable PAC bound for SSL
by a similar argument, after u = 1ε
(log |F|+ log 2
δ
)unlabeled data,
with probability at least 1− δ/2, all f ∈ F with eU (f) = 0 haveeU (f) ≤ ε.
i.e., if eU (f) = 0, then f ∈ F(ε) ≡ f ∈ F : eU (f) ≤ εapply the SL PAC bound on the (much smaller) F(ε)
Theorem (finite, doubly realizable) Assume F is finite. Given anyε > 0, δ > 0, if we see l labeled and u unlabeled training instances where
l =1ε
(log |F(ε)|+ log
2δ
)and u =
1ε
(log |F|+ log
2δ
),
then with probability at least 1− δ, all f ∈ F with e(f) = 0 andeU (f) = 0 have e(f) ≤ ε.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 75 / 99
![Page 168: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/168.jpg)
Part II Theory of SSL
Discussions on the PAC bound for SSL
Good news: can require less labeled data than SL
This particular theorem requires finite F , and doubly realizable f withe(f) = 0 and eU (f) = 0More general theorems in (Balcan & Blum 2008):
I infinite F is OK: extensions of the VC-dimensionI agnostic, does not require either realizability: both e(f) and eU (f)
may be non-zero and unknownI also tighter ε-cover based bounds
Most SSL algorithms (e.g. S3VMs) empirically minimizee(f) + eU (f): not necessarily justified in theory
Incompatibility functions arbitrary. Serves as regularization. There aregood and bad incompatibility functions. Example: “inverse S3VM”prefers to cut through dense unlabeled data
Ξinv(f,x) = 1− ΞS3VM(f,x)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 76 / 99
![Page 169: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/169.jpg)
Part II Theory of SSL
Discussions on the PAC bound for SSL
Good news: can require less labeled data than SL
This particular theorem requires finite F , and doubly realizable f withe(f) = 0 and eU (f) = 0
More general theorems in (Balcan & Blum 2008):I infinite F is OK: extensions of the VC-dimensionI agnostic, does not require either realizability: both e(f) and eU (f)
may be non-zero and unknownI also tighter ε-cover based bounds
Most SSL algorithms (e.g. S3VMs) empirically minimizee(f) + eU (f): not necessarily justified in theory
Incompatibility functions arbitrary. Serves as regularization. There aregood and bad incompatibility functions. Example: “inverse S3VM”prefers to cut through dense unlabeled data
Ξinv(f,x) = 1− ΞS3VM(f,x)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 76 / 99
![Page 170: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/170.jpg)
Part II Theory of SSL
Discussions on the PAC bound for SSL
Good news: can require less labeled data than SL
This particular theorem requires finite F , and doubly realizable f withe(f) = 0 and eU (f) = 0More general theorems in (Balcan & Blum 2008):
I infinite F is OK: extensions of the VC-dimensionI agnostic, does not require either realizability: both e(f) and eU (f)
may be non-zero and unknownI also tighter ε-cover based bounds
Most SSL algorithms (e.g. S3VMs) empirically minimizee(f) + eU (f): not necessarily justified in theory
Incompatibility functions arbitrary. Serves as regularization. There aregood and bad incompatibility functions. Example: “inverse S3VM”prefers to cut through dense unlabeled data
Ξinv(f,x) = 1− ΞS3VM(f,x)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 76 / 99
![Page 171: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/171.jpg)
Part II Theory of SSL
Discussions on the PAC bound for SSL
Good news: can require less labeled data than SL
This particular theorem requires finite F , and doubly realizable f withe(f) = 0 and eU (f) = 0More general theorems in (Balcan & Blum 2008):
I infinite F is OK: extensions of the VC-dimensionI agnostic, does not require either realizability: both e(f) and eU (f)
may be non-zero and unknownI also tighter ε-cover based bounds
Most SSL algorithms (e.g. S3VMs) empirically minimizee(f) + eU (f): not necessarily justified in theory
Incompatibility functions arbitrary. Serves as regularization. There aregood and bad incompatibility functions. Example: “inverse S3VM”prefers to cut through dense unlabeled data
Ξinv(f,x) = 1− ΞS3VM(f,x)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 76 / 99
![Page 172: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/172.jpg)
Part II Theory of SSL
Discussions on the PAC bound for SSL
Good news: can require less labeled data than SL
This particular theorem requires finite F , and doubly realizable f withe(f) = 0 and eU (f) = 0More general theorems in (Balcan & Blum 2008):
I infinite F is OK: extensions of the VC-dimensionI agnostic, does not require either realizability: both e(f) and eU (f)
may be non-zero and unknownI also tighter ε-cover based bounds
Most SSL algorithms (e.g. S3VMs) empirically minimizee(f) + eU (f): not necessarily justified in theory
Incompatibility functions arbitrary. Serves as regularization. There aregood and bad incompatibility functions. Example: “inverse S3VM”prefers to cut through dense unlabeled data
Ξinv(f,x) = 1− ΞS3VM(f,x)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 76 / 99
![Page 173: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/173.jpg)
Part II Online SSL
Outline
1 Part IWhat is SSL?Mixture ModelsCo-training and Multiview AlgorithmsManifold Regularization and Graph-Based AlgorithmsS3VMs and Entropy Regularization
2 Part IITheory of SSLOnline SSLMultimanifold SSLHuman SSL
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 77 / 99
![Page 174: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/174.jpg)
Part II Online SSL
Life-long learning
x1 x2 . . . x1000 . . . x1000000 . . .
. . . . . . . . .y1 = 0 - - y1000 = 1 . . . y1000000 = 0 . . .
n →∞ examples arrive sequentially, cannot store them all
most examples unlabeled
no iid assumption, p(x, y) can change over time
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 78 / 99
![Page 175: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/175.jpg)
Part II Online SSL
This is how children learn, too
x1 x2 . . . x1000 . . . x1000000 . . .
. . . . . . . . .y1 = 0 - - y1000 = 1 . . . y1000000 = 0 . . .
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 79 / 99
![Page 176: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/176.jpg)
Part II Online SSL
New paradigm: online semi-supervised learning
1 At time t, adversary picks xt ∈ X , yt ∈ Y not necessarily iid, shows xt
2 Learner has classifier ft : X 7→ R, predicts ft(xt)3 With small probability, adversary reveals yt; otherwise it abstains
(unlabeled)
4 Learner updates to ft+1 based on xt and yt (if given). Repeat.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 80 / 99
![Page 177: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/177.jpg)
Part II Online SSL
New paradigm: online semi-supervised learning
1 At time t, adversary picks xt ∈ X , yt ∈ Y not necessarily iid, shows xt
2 Learner has classifier ft : X 7→ R, predicts ft(xt)
3 With small probability, adversary reveals yt; otherwise it abstains(unlabeled)
4 Learner updates to ft+1 based on xt and yt (if given). Repeat.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 80 / 99
![Page 178: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/178.jpg)
Part II Online SSL
New paradigm: online semi-supervised learning
1 At time t, adversary picks xt ∈ X , yt ∈ Y not necessarily iid, shows xt
2 Learner has classifier ft : X 7→ R, predicts ft(xt)3 With small probability, adversary reveals yt; otherwise it abstains
(unlabeled)
4 Learner updates to ft+1 based on xt and yt (if given). Repeat.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 80 / 99
![Page 179: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/179.jpg)
Part II Online SSL
New paradigm: online semi-supervised learning
1 At time t, adversary picks xt ∈ X , yt ∈ Y not necessarily iid, shows xt
2 Learner has classifier ft : X 7→ R, predicts ft(xt)3 With small probability, adversary reveals yt; otherwise it abstains
(unlabeled)
4 Learner updates to ft+1 based on xt and yt (if given). Repeat.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 80 / 99
![Page 180: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/180.jpg)
Part II Online SSL
Online manifold regularization
Recall (batch) manifold regularization risk:
J(f) =1l
T∑t=1
δ(yt)c(f(xt), yt) +λ1
2‖f‖2K
+λ2
2T
T∑s,t=1
(f(xs)− f(xt))2wst
c(f(x), y) convex loss function, e.g., the hinge loss.
Instantaneous risk:
Jt(f) =T
lδ(yt)c(f(xt), yt) +
λ1
2‖f‖2K + λ2
t∑i=1
(f(xi)− f(xt))2wit
(involves graph edges between xt and all previous examples)
batch risk = average instantaneous risks J(f) = 1T
∑Tt=1 Jt(f)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 81 / 99
![Page 181: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/181.jpg)
Part II Online SSL
Online manifold regularization
Recall (batch) manifold regularization risk:
J(f) =1l
T∑t=1
δ(yt)c(f(xt), yt) +λ1
2‖f‖2K
+λ2
2T
T∑s,t=1
(f(xs)− f(xt))2wst
c(f(x), y) convex loss function, e.g., the hinge loss.
Instantaneous risk:
Jt(f) =T
lδ(yt)c(f(xt), yt) +
λ1
2‖f‖2K + λ2
t∑i=1
(f(xi)− f(xt))2wit
(involves graph edges between xt and all previous examples)
batch risk = average instantaneous risks J(f) = 1T
∑Tt=1 Jt(f)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 81 / 99
![Page 182: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/182.jpg)
Part II Online SSL
Online manifold regularization
Recall (batch) manifold regularization risk:
J(f) =1l
T∑t=1
δ(yt)c(f(xt), yt) +λ1
2‖f‖2K
+λ2
2T
T∑s,t=1
(f(xs)− f(xt))2wst
c(f(x), y) convex loss function, e.g., the hinge loss.
Instantaneous risk:
Jt(f) =T
lδ(yt)c(f(xt), yt) +
λ1
2‖f‖2K + λ2
t∑i=1
(f(xi)− f(xt))2wit
(involves graph edges between xt and all previous examples)
batch risk = average instantaneous risks J(f) = 1T
∑Tt=1 Jt(f)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 81 / 99
![Page 183: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/183.jpg)
Part II Online SSL
Online convex programming
Instead of minimizing convex J(f), reduce convex Jt(f) at each step
t: ft+1 = ft − ηt∂Jt(f)
∂f
∣∣∣ft
Step size ηt decays, e.g., ηt = 1/√
t
Accuracy can be arbitrarily bad if adversary flips target often. If so,no batch learner in hindsight can do well either
regret ≡ 1T
T∑t=1
Jt(ft)− J(f∗)
no-regret guarantee against adversary [Zinkevich ICML03]:lim supT→∞
1T
∑Tt=1 Jt(ft)− J(f∗) ≤ 0.
If no adversary (iid), the average classifier f = 1/T∑T
t=1 ft is good:J(f) → J(f∗).
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 82 / 99
![Page 184: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/184.jpg)
Part II Online SSL
Online convex programming
Instead of minimizing convex J(f), reduce convex Jt(f) at each step
t: ft+1 = ft − ηt∂Jt(f)
∂f
∣∣∣ft
Step size ηt decays, e.g., ηt = 1/√
t
Accuracy can be arbitrarily bad if adversary flips target often. If so,no batch learner in hindsight can do well either
regret ≡ 1T
T∑t=1
Jt(ft)− J(f∗)
no-regret guarantee against adversary [Zinkevich ICML03]:lim supT→∞
1T
∑Tt=1 Jt(ft)− J(f∗) ≤ 0.
If no adversary (iid), the average classifier f = 1/T∑T
t=1 ft is good:J(f) → J(f∗).
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 82 / 99
![Page 185: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/185.jpg)
Part II Online SSL
Online convex programming
Instead of minimizing convex J(f), reduce convex Jt(f) at each step
t: ft+1 = ft − ηt∂Jt(f)
∂f
∣∣∣ft
Step size ηt decays, e.g., ηt = 1/√
t
Accuracy can be arbitrarily bad if adversary flips target often. If so,no batch learner in hindsight can do well either
regret ≡ 1T
T∑t=1
Jt(ft)− J(f∗)
no-regret guarantee against adversary [Zinkevich ICML03]:lim supT→∞
1T
∑Tt=1 Jt(ft)− J(f∗) ≤ 0.
If no adversary (iid), the average classifier f = 1/T∑T
t=1 ft is good:J(f) → J(f∗).
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 82 / 99
![Page 186: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/186.jpg)
Part II Online SSL
Online convex programming
Instead of minimizing convex J(f), reduce convex Jt(f) at each step
t: ft+1 = ft − ηt∂Jt(f)
∂f
∣∣∣ft
Step size ηt decays, e.g., ηt = 1/√
t
Accuracy can be arbitrarily bad if adversary flips target often. If so,no batch learner in hindsight can do well either
regret ≡ 1T
T∑t=1
Jt(ft)− J(f∗)
no-regret guarantee against adversary [Zinkevich ICML03]:lim supT→∞
1T
∑Tt=1 Jt(ft)− J(f∗) ≤ 0.
If no adversary (iid), the average classifier f = 1/T∑T
t=1 ft is good:J(f) → J(f∗).
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 82 / 99
![Page 187: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/187.jpg)
Part II Online SSL
Online convex programming
Instead of minimizing convex J(f), reduce convex Jt(f) at each step
t: ft+1 = ft − ηt∂Jt(f)
∂f
∣∣∣ft
Step size ηt decays, e.g., ηt = 1/√
t
Accuracy can be arbitrarily bad if adversary flips target often. If so,no batch learner in hindsight can do well either
regret ≡ 1T
T∑t=1
Jt(ft)− J(f∗)
no-regret guarantee against adversary [Zinkevich ICML03]:lim supT→∞
1T
∑Tt=1 Jt(ft)− J(f∗) ≤ 0.
If no adversary (iid), the average classifier f = 1/T∑T
t=1 ft is good:J(f) → J(f∗).
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 82 / 99
![Page 188: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/188.jpg)
Part II Online SSL
Sparse approximation by buffering
The algorithm is impractical as T →∞:
space O(T ): stores all previous examples
time O(T 2): each new instance connects to all previous ones
Keep a size τ buffer
approximate representers: ft =∑t−1
i=t−τ α(t)i K(xi, ·)
approximate instantaneous risk
Jt(f) =T
lδ(yt)c(f(xt), yt) +
λ1
2‖f‖2K
+λ2t
τ
t∑i=t−τ
(f(xi)− f(xt))2wit
dynamic graph on instances in the buffer
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 83 / 99
![Page 189: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/189.jpg)
Part II Online SSL
Sparse approximation by buffering
The algorithm is impractical as T →∞:
space O(T ): stores all previous examples
time O(T 2): each new instance connects to all previous ones
Keep a size τ buffer
approximate representers: ft =∑t−1
i=t−τ α(t)i K(xi, ·)
approximate instantaneous risk
Jt(f) =T
lδ(yt)c(f(xt), yt) +
λ1
2‖f‖2K
+λ2t
τ
t∑i=t−τ
(f(xi)− f(xt))2wit
dynamic graph on instances in the buffer
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 83 / 99
![Page 190: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/190.jpg)
Part II Online SSL
Sparse approximation by buffering
The algorithm is impractical as T →∞:
space O(T ): stores all previous examples
time O(T 2): each new instance connects to all previous ones
Keep a size τ buffer
approximate representers: ft =∑t−1
i=t−τ α(t)i K(xi, ·)
approximate instantaneous risk
Jt(f) =T
lδ(yt)c(f(xt), yt) +
λ1
2‖f‖2K
+λ2t
τ
t∑i=t−τ
(f(xi)− f(xt))2wit
dynamic graph on instances in the buffer
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 83 / 99
![Page 191: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/191.jpg)
Part II Online SSL
Sparse approximation by buffering
The algorithm is impractical as T →∞:
space O(T ): stores all previous examples
time O(T 2): each new instance connects to all previous ones
Keep a size τ buffer
approximate representers: ft =∑t−1
i=t−τ α(t)i K(xi, ·)
approximate instantaneous risk
Jt(f) =T
lδ(yt)c(f(xt), yt) +
λ1
2‖f‖2K
+λ2t
τ
t∑i=t−τ
(f(xi)− f(xt))2wit
dynamic graph on instances in the buffer
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 83 / 99
![Page 192: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/192.jpg)
Part II Multimanifold SSL
Outline
1 Part IWhat is SSL?Mixture ModelsCo-training and Multiview AlgorithmsManifold Regularization and Graph-Based AlgorithmsS3VMs and Entropy Regularization
2 Part IITheory of SSLOnline SSLMultimanifold SSLHuman SSL
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 84 / 99
![Page 193: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/193.jpg)
Part II Multimanifold SSL
Multiple, intersecting manifolds
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 85 / 99
![Page 194: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/194.jpg)
Part II Multimanifold SSL
Building Blocks: Local Covariance MatrixFor a sparse subset of points x, the local covariance matrix of theneighbors
Σx =1
m− 1
∑j
(xj − µx)(xj − µx)>
captures local geometry.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 86 / 99
![Page 195: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/195.jpg)
Part II Multimanifold SSL
A Distance on Covariance Matrices
Hellinger distance
H2(p, q) =12
∫ (√p(x)−
√q(x)
)2dx
H(p, q) symmetric, in [0, 1]Let p = N(0,Σ1), q = N(0,Σ2). We define
H(Σ1,Σ2) ≡ H(p, q) =
√√√√1− 2d2|Σ1|
14 |Σ2|
14
|Σ1 + Σ2|12
(computed in common subspace)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 87 / 99
![Page 196: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/196.jpg)
Part II Multimanifold SSL
A Distance on Covariance Matrices
Hellinger distance
H2(p, q) =12
∫ (√p(x)−
√q(x)
)2dx
H(p, q) symmetric, in [0, 1]
Let p = N(0,Σ1), q = N(0,Σ2). We define
H(Σ1,Σ2) ≡ H(p, q) =
√√√√1− 2d2|Σ1|
14 |Σ2|
14
|Σ1 + Σ2|12
(computed in common subspace)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 87 / 99
![Page 197: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/197.jpg)
Part II Multimanifold SSL
A Distance on Covariance Matrices
Hellinger distance
H2(p, q) =12
∫ (√p(x)−
√q(x)
)2dx
H(p, q) symmetric, in [0, 1]Let p = N(0,Σ1), q = N(0,Σ2). We define
H(Σ1,Σ2) ≡ H(p, q) =
√√√√1− 2d2|Σ1|
14 |Σ2|
14
|Σ1 + Σ2|12
(computed in common subspace)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 87 / 99
![Page 198: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/198.jpg)
Part II Multimanifold SSL
Hellinger Distance
Comment H(Σ1,Σ2)
similar 0.02
density 0.28
dimension 1
orientation∗ 1
* smoothed version: Σ + εI
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 88 / 99
![Page 199: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/199.jpg)
Part II Multimanifold SSL
Hellinger Distance
Comment H(Σ1,Σ2)
similar 0.02
density 0.28
dimension 1
orientation∗ 1
* smoothed version: Σ + εI
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 88 / 99
![Page 200: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/200.jpg)
Part II Multimanifold SSL
Hellinger Distance
Comment H(Σ1,Σ2)
similar 0.02
density 0.28
dimension 1
orientation∗ 1
* smoothed version: Σ + εI
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 88 / 99
![Page 201: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/201.jpg)
Part II Multimanifold SSL
Hellinger Distance
Comment H(Σ1,Σ2)
similar 0.02
density 0.28
dimension 1
orientation∗ 1
* smoothed version: Σ + εI
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 88 / 99
![Page 202: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/202.jpg)
Part II Multimanifold SSL
A Sparse GraphKNN graph use Mahalanobis distance to trace the manifoldd2(x, y) = (x− y)>Σ−1
x (x− y)
Gaussian edge weight on edges wij = e−H2(Σxi ,Σxj )
2σ2
Combines locality and shape. Red=large w, yellow=small w
Manifold Regularization on the graph
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 89 / 99
![Page 203: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/203.jpg)
Part II Multimanifold SSL
A Sparse GraphKNN graph use Mahalanobis distance to trace the manifoldd2(x, y) = (x− y)>Σ−1
x (x− y)
Gaussian edge weight on edges wij = e−H2(Σxi ,Σxj )
2σ2
Combines locality and shape. Red=large w, yellow=small w
Manifold Regularization on the graph
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 89 / 99
![Page 204: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/204.jpg)
Part II Multimanifold SSL
A Sparse GraphKNN graph use Mahalanobis distance to trace the manifoldd2(x, y) = (x− y)>Σ−1
x (x− y)
Gaussian edge weight on edges wij = e−H2(Σxi ,Σxj )
2σ2
Combines locality and shape. Red=large w, yellow=small w
Manifold Regularization on the graphXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 89 / 99
![Page 205: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/205.jpg)
Part II Human SSL
Outline
1 Part IWhat is SSL?Mixture ModelsCo-training and Multiview AlgorithmsManifold Regularization and Graph-Based AlgorithmsS3VMs and Entropy Regularization
2 Part IITheory of SSLOnline SSLMultimanifold SSLHuman SSL
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 90 / 99
![Page 206: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/206.jpg)
Part II Human SSL
Do we learn from both labeled and unlabeled data?
Learning exists long before machine learning. Do humans performsemi-supervised learning?
We discuss two human experiments:1 One-class classification [Zaki & Nosofsky 2007]2 Binary classification [Zhu et al. 2007]
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 91 / 99
![Page 207: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/207.jpg)
Part II Human SSL
Do we learn from both labeled and unlabeled data?
Learning exists long before machine learning. Do humans performsemi-supervised learning?
We discuss two human experiments:1 One-class classification [Zaki & Nosofsky 2007]2 Binary classification [Zhu et al. 2007]
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 91 / 99
![Page 208: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/208.jpg)
Part II Human SSL
Zaki & Nosofsky 2007: self training?
participants shown training sample (xi, yi = 1)li=1, all from one
class.
shown u unlabeled instances xil+ui=l+1, decide if yi = 1
density level-set problem: learn X1 = x ∈ X | p(x|y = 1) ≥ ε,classify y = 1 if x ∈ X1
if X1 is fixed after training, then test data won’t affect classification.
Zaki & Nosofsky showed this is not true.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 92 / 99
![Page 209: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/209.jpg)
Part II Human SSL
Zaki & Nosofsky 2007: self training?
participants shown training sample (xi, yi = 1)li=1, all from one
class.
shown u unlabeled instances xil+ui=l+1, decide if yi = 1
density level-set problem: learn X1 = x ∈ X | p(x|y = 1) ≥ ε,classify y = 1 if x ∈ X1
if X1 is fixed after training, then test data won’t affect classification.
Zaki & Nosofsky showed this is not true.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 92 / 99
![Page 210: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/210.jpg)
Part II Human SSL
Zaki & Nosofsky 2007: self training?
participants shown training sample (xi, yi = 1)li=1, all from one
class.
shown u unlabeled instances xil+ui=l+1, decide if yi = 1
density level-set problem: learn X1 = x ∈ X | p(x|y = 1) ≥ ε,classify y = 1 if x ∈ X1
if X1 is fixed after training, then test data won’t affect classification.
Zaki & Nosofsky showed this is not true.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 92 / 99
![Page 211: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/211.jpg)
Part II Human SSL
Zaki & Nosofsky 2007: self training?
participants shown training sample (xi, yi = 1)li=1, all from one
class.
shown u unlabeled instances xil+ui=l+1, decide if yi = 1
density level-set problem: learn X1 = x ∈ X | p(x|y = 1) ≥ ε,classify y = 1 if x ∈ X1
if X1 is fixed after training, then test data won’t affect classification.
Zaki & Nosofsky showed this is not true.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 92 / 99
![Page 212: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/212.jpg)
Part II Human SSL
Zaki & Nosofsky 2007: self training?
participants shown training sample (xi, yi = 1)li=1, all from one
class.
shown u unlabeled instances xil+ui=l+1, decide if yi = 1
density level-set problem: learn X1 = x ∈ X | p(x|y = 1) ≥ ε,classify y = 1 if x ∈ X1
if X1 is fixed after training, then test data won’t affect classification.
Zaki & Nosofsky showed this is not true.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 92 / 99
![Page 213: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/213.jpg)
Part II Human SSL
The Zaki & Nosofsky 2007 experiment
40
µ
(a) a stimulus (b) training distribution
20
µ
20
40
4
19
µ
40
1
µ
20
4
new
2
(c) condition 1 test distribution (d) condition 2 test distributionp(y = 1|µ) > p(y = 1|low) p(y = 1|µnew) > p(y = 1|lownew)
> p(y = 1|high) p(y = 1|random) > p(y = 1|µ) ≈ p(y = 1|low)≈ p(y = 1|high) p(y = 1|random)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 93 / 99
![Page 214: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/214.jpg)
Part II Human SSL
The Zaki & Nosofsky 2007 experiment
40
µ
(a) a stimulus (b) training distribution
20
µ
20
40
4
19
µ
40
1
µ
20
4
new
2
(c) condition 1 test distribution (d) condition 2 test distribution
p(y = 1|µ) > p(y = 1|low) p(y = 1|µnew) > p(y = 1|lownew)> p(y = 1|high) p(y = 1|random) > p(y = 1|µ) ≈ p(y = 1|low)
≈ p(y = 1|high) p(y = 1|random)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 93 / 99
![Page 215: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/215.jpg)
Part II Human SSL
The Zaki & Nosofsky 2007 experiment
40
µ
(a) a stimulus (b) training distribution
20
µ
20
40
4
19
µ
40
1
µ
20
4
new
2
(c) condition 1 test distribution (d) condition 2 test distributionp(y = 1|µ) > p(y = 1|low) p(y = 1|µnew) > p(y = 1|lownew)
> p(y = 1|high) p(y = 1|random) > p(y = 1|µ) ≈ p(y = 1|low)≈ p(y = 1|high) p(y = 1|random)
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 93 / 99
![Page 216: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/216.jpg)
Part II Human SSL
Zhu et al. 2007: mixture model?
−3 −2 −1 0 1 2 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
left−shifted
Gaussian mixture
range examples
test examples
x
blocks1 20 labeled points at x = −1, 1
2 test 1: 21 test examples in grid[−1, 1]
3 690 examples ∼ bimodaldistribution, plus 63 rangeexamples in [−2.5, 2.5]
4 test 2: same as test 1
12 participants left-offset, 10 right-offset. Record their decisions andresponse times.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 94 / 99
![Page 217: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/217.jpg)
Part II Human SSL
Zhu et al. 2007: mixture model?
−3 −2 −1 0 1 2 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
left−shifted
Gaussian mixture
range examples
test examples
x
blocks1 20 labeled points at x = −1, 12 test 1: 21 test examples in grid
[−1, 1]
3 690 examples ∼ bimodaldistribution, plus 63 rangeexamples in [−2.5, 2.5]
4 test 2: same as test 1
12 participants left-offset, 10 right-offset. Record their decisions andresponse times.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 94 / 99
![Page 218: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/218.jpg)
Part II Human SSL
Zhu et al. 2007: mixture model?
−3 −2 −1 0 1 2 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
left−shifted
Gaussian mixture
range examples
test examples
x
blocks1 20 labeled points at x = −1, 12 test 1: 21 test examples in grid
[−1, 1]3 690 examples ∼ bimodal
distribution, plus 63 rangeexamples in [−2.5, 2.5]
4 test 2: same as test 1
12 participants left-offset, 10 right-offset. Record their decisions andresponse times.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 94 / 99
![Page 219: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/219.jpg)
Part II Human SSL
Zhu et al. 2007: mixture model?
−3 −2 −1 0 1 2 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
left−shifted
Gaussian mixture
range examples
test examples
x
blocks1 20 labeled points at x = −1, 12 test 1: 21 test examples in grid
[−1, 1]3 690 examples ∼ bimodal
distribution, plus 63 rangeexamples in [−2.5, 2.5]
4 test 2: same as test 1
12 participants left-offset, 10 right-offset. Record their decisions andresponse times.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 94 / 99
![Page 220: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/220.jpg)
Part II Human SSL
Zhu et al. 2007: mixture model?
−3 −2 −1 0 1 2 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
left−shifted
Gaussian mixture
range examples
test examples
x
blocks1 20 labeled points at x = −1, 12 test 1: 21 test examples in grid
[−1, 1]3 690 examples ∼ bimodal
distribution, plus 63 rangeexamples in [−2.5, 2.5]
4 test 2: same as test 1
12 participants left-offset, 10 right-offset. Record their decisions andresponse times.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 94 / 99
![Page 221: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/221.jpg)
Part II Human SSL
Zhu et al. 2007: mixture model?
−3 −2 −1 0 1 2 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
left−shifted
Gaussian mixture
range examples
test examples
x
blocks1 20 labeled points at x = −1, 12 test 1: 21 test examples in grid
[−1, 1]3 690 examples ∼ bimodal
distribution, plus 63 rangeexamples in [−2.5, 2.5]
4 test 2: same as test 1
12 participants left-offset, 10 right-offset. Record their decisions andresponse times.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 94 / 99
![Page 222: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/222.jpg)
Part II Human SSL
Visual stimuli
Stimuli parametrized by a continuous scalar x. Some examples:
−2.5 −2 −1.5 −1
−0.5 0 0.5 1
1.5 2 2.5
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 95 / 99
![Page 223: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/223.jpg)
Part II Human SSL
Observation 1: unlabeled data affects decision boundary
−1 −0.5 0 0.5 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
perc
ent c
lass
2 r
espo
nses
test−1, alltest−2, L−subjectstest−2, R−subjects
average decision boundary
after seeing labeled data: x = 0.11
after seeing labeled and unlabeled data: L-subjects x = −0.10,R-subjects x = 0.48
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 96 / 99
![Page 224: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/224.jpg)
Part II Human SSL
Observation 1: unlabeled data affects decision boundary
−1 −0.5 0 0.5 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
perc
ent c
lass
2 r
espo
nses
test−1, alltest−2, L−subjectstest−2, R−subjects
average decision boundary
after seeing labeled data: x = 0.11after seeing labeled and unlabeled data: L-subjects x = −0.10,R-subjects x = 0.48
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 96 / 99
![Page 225: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/225.jpg)
Part II Human SSL
Observation 2: unlabeled data affects reaction time
−1 −0.5 0 0.5 1400
450
500
550
600
650
700
750
800
850
900
x
reac
tion
time
(ms)
test−1, alltest−2, L−subjectstest−2, R−subjects
longer reaction time → harder example → closer to decision boundary.Reaction times too suggest decision boundary shift.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 97 / 99
![Page 226: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/226.jpg)
Part II Human SSL
Model fitting
We can fit human behavior with a GMM.
−1 −0.5 0 0.5 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
p(y=
2|x)
test−1test−2, L−datatest−2, R−data
−1 −0.5 0 0.5 1400
450
500
550
600
650
700
750
800
850
900
x
fitte
d re
actio
n tim
e (m
s)
test−1test−2, L−datatest−2, R−data
boundary shift reaction time t = aH(x) + b
Humans and machines both perform semi-supervised learning.
Understanding natural learning may lead to new machine learningalgorithms.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 98 / 99
![Page 227: Tutorial on Semi-Supervised Learningpages.cs.wisc.edu/~jerryzhu/pub/sslchicago09.pdfXiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 14 / 99](https://reader031.vdocuments.net/reader031/viewer/2022022518/5b0b3a4e7f8b9a61448c513e/html5/thumbnails/227.jpg)
Part II Human SSL
References
See the references in
Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi-SupervisedLearning. Morgan & Claypool, 2009.
Xiaojin Zhu (Univ. Wisconsin, Madison) Tutorial on Semi-Supervised Learning Chicago 2009 99 / 99