semi-supervised learningzhuxj/tmp/book.pdf1.1.2 semi-supervised learning semi-supervised learning...

Semi-Supervised Learning

Adaptive Computation and Machine Learning

Thomas Dietterich, Editor

Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, As-

sociate Editors

Bioinformatics: The Machine Learning Approach, Pierre Baldi and Søren Brunak

Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto

Graphical Models for Machine Learning and Digital Communication, Brendan J.

Frey

Learning in Graphical Models, Michael I. Jordan

Causation, Prediction, and Search, second edition, Peter Spirtes, Clark Glymour,

and Richard Scheines

Principles of Data Mining, David Hand, Heikki Mannila, and Padhraic Smyth

Bioinformatics: The Machine Learning Approach, second edition, Pierre Baldi and

Søren Brunak

Learning Kernel Classifiers: Theory and Algorithms, Ralf Herbrich

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and

Beyond, Bernhard Scholkopf and Alexander J. Smola

Semi-Supervised Learning

Olivier Chapelle

Bernhard Scholkopf

Alexander Zien

The MIT Press

Cambridge, Massachusetts

London, England

c©2005 Massachusetts Institute of Technology

All rights reserved. No part of this book may be reproduced in any form by any electronic

or mechanical means (including photocopying, recording, or information storage and retrieval)

without permission in writing from the publisher.

Typeset by the authors using LATEX2ε

Library of Congress Control No. xxx

Printed and bound in the United States of America

Library of Congress Cataloging-in-Publication Data

Contents

1 A Tutorial Introduction 1

1.1 Supervised, Unsupervised, and Semi-Supervised Learning . . . . . . 1

1.2 When Can Semi-Supervised Learning Work? . . . . . . . . . . . . . 3

1.3 Classes of Algorithms and Organization of This Book . . . . . . . . 6

I Generative Models 11

2 Probabilistic Graphical Models for Semi-Supervised Learning 13

2.1 The Semi-Supervised Learning Problem . . . . . . . . . . . . . . . . 13

2.2 The Sampling Paradigm (Generative Models) . . . . . . . . . . . . . 13

2.3 The Diagnostic Paradigm. Input-Dependent Regularization . . . . . 13

2.4 Examples: Generative Models . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Examples: Input-Dependent Regularization . . . . . . . . . . . . . . 13

2.6 A General Scheme for Input-Dependent Regularization. Relations to

Co-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Probabilistic Semi-Supervised Clustering with Constraints 15

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.5 Constraint acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Semi-Supervised Text Classification Using EM 17

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 The Basic Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 Modeling Sub-topic Structure . . . . . . . . . . . . . . . . . . . . . . 18

4.4 Modeling Super-topic Structure . . . . . . . . . . . . . . . . . . . . . 18

4.5 Finding More Probable Models . . . . . . . . . . . . . . . . . . . . . 18

4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Risks of Semi-Supervised Learning:

vi Contents

How Unlabeled Data Can Degrade Performance of Generative

Classifiers 19

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.2 Do unlabeled data improve or degrade classification performance? . 20

5.3 The Boy-Girl classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.4 Insights from asymptotic analysis . . . . . . . . . . . . . . . . . . . . 20

5.5 The value of labeled and unlabeled data . . . . . . . . . . . . . . . . 20

5.6 Understanding the risks of semi-supervised learning . . . . . . . . . . 20

5.7 Model search (and some philosophy...) in semi-supervised learning of

generative classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

II Low Density Separation 21

6 Transductive Support Vector Machines 23

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.2 TSVM Optimization Problems . . . . . . . . . . . . . . . . . . . . . 24

6.3 Why use margin on the test set? . . . . . . . . . . . . . . . . . . . . 24

6.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.6 Problems with the TSVM objective . . . . . . . . . . . . . . . . . . . 24

6.7 Connection to related approaches . . . . . . . . . . . . . . . . . . . . 24

7 Semi-Supervised Learning using Semi-Definite Programming 25

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7.2 Relaxing SVM-transduction . . . . . . . . . . . . . . . . . . . . . . . 26

7.3 An efficient approximation . . . . . . . . . . . . . . . . . . . . . . . . 26

7.4 General semi-supervised learning settings . . . . . . . . . . . . . . . 26

7.5 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

8 Gaussian Processes and the Null-Category Noise Model 27

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

8.2 The Noise Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

8.3 The Process Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

9 Entropy-Based Methods 29

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

9.2 Derivation of the Criterion . . . . . . . . . . . . . . . . . . . . . . . . 30

9.3 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 30

9.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

9.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

9.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Contents vii

III Graph-Based Methods 31

10 Label Propagation and Quadratic Criterion 33

10.1 Label propagation on a similarity graph . . . . . . . . . . . . . . . . 34

10.2 Quadratic cost criterion . . . . . . . . . . . . . . . . . . . . . . . . . 34

10.3 From transduction to induction . . . . . . . . . . . . . . . . . . . . . 34

10.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

11 The Manifold Basis of Semi-supervised Learning 35

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

11.2 Theoretical framework: RKHS/Manifolds, manifold representer the-

orems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

11.3 Algorithmic framework: Laplacian SVM/ Laplacian RLS . . . . . . . 35

11.4 Data-dependent kernels and linear methods . . . . . . . . . . . . . . 35

11.5 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

12 Regularization on Directed Graphs 37

12.1 Preliminaries on Directed Graphs and Markov Chains . . . . . . . . 37

12.2 Regularization Framework A: hubs and authorities . . . . . . . . . . 37

12.3 Regularization Framework B: stationary distribution . . . . . . . . . 37

12.4 Regularization Framework C: water-pipe model . . . . . . . . . . . . 37

12.5 Discrete Differential Geometry on Directed graphs . . . . . . . . . . 37

12.6 Experiments on Web Classification and Protein Prediction . . . . . . 37

13 Boosting on Manifolds 39

13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

13.2 Marginal AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

13.3 RegBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

13.4 The graph Laplacian regularizer . . . . . . . . . . . . . . . . . . . . . 40

13.5 Other approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

IV Change of Representation 41

14 Graph Kernels by Spectral Transforms 43

14.1 Graph Laplacian and Spectral Clustering . . . . . . . . . . . . . . . 43

14.2 Kernels by Spectral Transforms . . . . . . . . . . . . . . . . . . . . . 44

14.3 Kernel Alignment using QCQP . . . . . . . . . . . . . . . . . . . . . 44

14.4 Semi-Supervised Kernels with Order Constraints . . . . . . . . . . . 44

14.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

15 Spectral Methods for Dimensionality Reduction 45

15.1 introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

15.2 algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

15.3 discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

viii Contents

15.4 appendix (if necessary) . . . . . . . . . . . . . . . . . . . . . . . . . . 46

16 Modifying Distances 47

16.1 A simple geometric intuition . . . . . . . . . . . . . . . . . . . . . . . 48

16.2 Theoretical basis for (re)defining distances . . . . . . . . . . . . . . . 48

16.3 Defining density-sensitive distances . . . . . . . . . . . . . . . . . . . 48

16.4 Using modified distances for semi-supervised learning . . . . . . . . . 48

17 Data-Dependent Regularization 49

V Semi-Supervised Learning in Practice 51

18 Large-Scale Algorithms 53

18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

18.2 Cost approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

18.3 Subset selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

18.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

19 Semi-Supervised Protein Classification using Cluster Kernels 55

19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

19.2 Representations and kernels for protein Sequences . . . . . . . . . . 55

19.3 Cluster Kernels for protein Sequences . . . . . . . . . . . . . . . . . 55

19.4 The neighborhood mismatch kernel . . . . . . . . . . . . . . . . . . . 56

19.5 The bagged mismatch kernel . . . . . . . . . . . . . . . . . . . . . . 56

19.6 The profile kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

19.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

19.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

20 Prediction of Protein Function from Networks 57

20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

20.2 Graph-based Semi-Supervised Learning . . . . . . . . . . . . . . . . 58

20.3 Combining Multiple Graphs . . . . . . . . . . . . . . . . . . . . . . . 58

20.4 Experiments on Function Prediction of Proteins . . . . . . . . . . . . 58

20.5 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . 58

VI Perspectives 59

21 An Augmented PAC Model for Semi-Supervised Learning 61

21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

21.2 A Formal Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

21.3 Sample Complexity Results . . . . . . . . . . . . . . . . . . . . . . . 67

21.4 Algorithmic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

21.5 Related Models and Discussion . . . . . . . . . . . . . . . . . . . . . 80

Contents ix

22 Metric Based Approaches for Semi-Supervised Regression 85

22.1 Metrics for Supervised Learning . . . . . . . . . . . . . . . . . . . . . 85

22.2 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

22.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

22.4 Connections to Active Learning . . . . . . . . . . . . . . . . . . . . . 85

22.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

References 86

Index 90

Notation 92

Notation and Symbols 93

1 A Tutorial Introduction

1.1 Supervised, Unsupervised, and Semi-Supervised Learning

In order to understand the nature of semi-supervised learning, it will be useful to

first take a look at supervised and unsupervised learning.

1.1.1 Supervised and Unsupervised Learning

Traditionally, there have been two fundamentally different types of tasks in machine

learning.

The first one is unsupervised learning. Let X = (x1, . . . , xn) be a set of n examplesunsupervised

learning (or points), where xi ∈ X for all i ∈ [n] := 1, . . . , n. It is often convenient to define

the (n × d)-matrix X = (x>i )>i∈[n] that contains the data points as its rows. The

goal of unsupervised learning is to find interesting structure in the data X. It

has been argued that the problem of unsupervised learning is fundamentally that

of estimating a density which is likely to have generated X. However, there are

also weaker forms of unsupervised learning, such as quantile estimation, clustering,

outlier detection and dimensionality reduction.

The second task is supervised learning. The goal is to learn a mapping fromsupervised

learning x to y, given a training set made of pairs (xi, yi). Here, the yi ∈ Y are called

the labels or targets of the examples xi. If the labels are real (regression) or integer

(classification) numbers, y = (yi)>i∈[n] denotes the column vector of labels. The task

is well defined, since a mapping can be evaluated through its predictive performance

on test examples. When Y = R or Y = Rd, the task is called regression. Most of this

book will focus on classification, i.e. when y takes values in a finite set. There are two

families of algorithms for supervised learning: generative algorithms try to modelgenerative

methods the class conditional density p(x|y) by some unsupervised learning procedure.1 A

1. For simplicity, we are assuming that all distributions have densities, and thus we restrictourselves to dealing with densities.


predictive density can then be inferred by applying Bayes’ rule:

p(y|x) =p(x|y)p(y)∫

Yp(x|y)p(y)dy

. (1.1)

In fact, p(x|y)p(y) = p(x, y) is the joint density of the data, from which pairs

(xi, yi) could be generated. Discriminative algorithms do not try to estimate howdiscriminative

methods the xi have been generated, but instead concentrate on estimating p(y|x). Some

discriminative methods even only try to model whether p(y|x) is greater than or

less than 0.5; an example of this is the Support Vector Machine. It has been argued

that discriminative models are more directly aligned with the goal of supervised

learning and therefore tend to be more efficient in practice. The two frameworks

are discussed in more detail in Sections 2.2 and 2.3.

1.1.2 Semi-Supervised Learning

Semi-supervised learning (SSL) is half way between supervised and unsupervised

learning. In addition to unlabeled data, the algorithm is provided with some super-

vision information – but not necessarily for all examples. Often, this information

will be the targets associated with some of the examples. In this case, the data setstandard setting

of SSL X = (xi)i∈[n] can be divided into two parts: the points Xl := (x1, . . . , xl), for which

labels Yl := (y1, . . . , yl) are provided, and the points Xu := (xl+1, . . . , xl+u), the

labels of which are not known. This is the “standard” semi-supervised learning as

investigated in this book; most chapters will refer to this setting.

Other forms of partial supervision are possible. For example, there may be

constraints such as “these points have (or do not have) the same target” (cf. Abu-

Mostafa [1995]). This more general setting is considered in Chapter 3. The differentSSL with

constraints setting corresponds to a different view on semi-supervised learning: In Chapter

3, SSL is seen as unsupervised learning guided by constraints. In contrast, most

other approaches see SSL as supervised learning with additional information on the

distribution of the examples x. The latter interpretation seems to be more in line

with most applications, where the goal is the same as in supervised learning: to

predict a target value for a given xi. However, this view does not readily apply if

number and nature of the classes are not known in advance but have to be infered

from the data. In constrast, SSL with constraints may still remain applicable in

such situations.

A problem related to SSL has been introduced by Vapnik already several decades

ago: the so-called transductive learning. In this setting, one is given a (labeled)transductive

learning training set and an (unlabeled) test set. The idea of transduction is to perform

predictions only for the test points. This is in contrast to inductive learning, whereinductive learningthe goal is to output a prediction function which is defined on the entire space X.

Many methods described in this book will be transductive; in particular, this is

rather natural for inference based on graph representations of the data. This issue

will be addressed again below.

1.2 When Can Semi-Supervised Learning Work? 3

1.2 When Can Semi-Supervised Learning Work?

A natural question arises: is semi-supervised learning meaningful? More precisely:

in comparison with a supervised algorithm that uses only labeled data, can one

hope to have a more accurate prediction by taking into account the unlabeled

points? As you may have guessed from the size of the book in your hands, in

principle the answer is “yes.” However, there is an important prerequisite: that the

distribution of examples, which the unlabeled data will help elucidate, be relevant

for the classification problem.

In a more mathematical formulation, one could say that the knowledge on p(x)

that one gains through the unlabeled data has to carry information that is useful

in the inference of p(y|x). If this is not the case, semi-supervised learning will not

yield an improvement over supervised learning. It might even happen that using

the unlabeled data degrades the prediction accuracy by misguiding the inference;

this effect is investigated in detail in Chapter 5.

One should thus not be too surprised that for semi-supervised learning to work,

certain assumptions will have to hold. In this context, note that plain supervised

learning also has to rely on assumptions: One of the most popular such assumptionsSmoothness

Assumption can be formulated as follows.2

Smoothness Assumption of Supervised Learning. If two points x1, x2 are

close, then so should be the corresponding outputs y1, y2.

Clearly, without such assumptions, it would never be possible to generalize from a

finite training set to a set of possibly infinitely many unseen test cases.

1.2.1 The Semi-Supervised Smoothness Assumption

We now propose a generalization of the Smoothness Assumption that is useful

for semi-supervised learning; we thus call it the “Semi-Supervised Smoothness

Assumption.” While in the supervised case according to our prior beliefs the output

varies smoothly with the distance, we now also take into account the density of the

inputs. The assumption is that the label function is smoother in high density regionsSemi-Supervised

Smoothness

Assumption

than in low density regions:

Semi-Supervised Smoothness Assumption. If two points x1, x2 in a high-

density region are close, then so should be the corresponding outputs y1, y2.

Note that by transitivity, this assumption implies that if two points are linked by

a path of high density (e.g., if they belong to the same cluster), then their outputs

are likely to be close. If, on the other hand, they are separated by a low density

2. Strictly speaking, this assumption only refers to continuity rather than smoothness;however, the term smoothness is commonly used, possibly because in regression estimationpractice, y is often modelled as a smooth function of x.


region, then their outputs need not be close.

Note that the Semi-Supervised Smoothness Assumption applies to both regres-

sion and classification. In the next section, we will show that in the case of clas-

sification, it reduces to assumptions commonly used in SSL. At present, it is less

how useful the assumption is regression problems. As an alternative, Chapter 22

explores how semi-supervised regression can take advantage of assumptions about

the output space, which is richer in regression than it is in classification.

1.2.2 The Cluster Assumption

Suppose we knew that the points of each class tended to form a cluster. Then theCluster

Assumption unlabeled data could aid in finding the boundary of each cluster more accurately:

one could run a clustering algorithm and use the labeled points to assign a class to

each cluster. That is in fact one of the earliest forms of semi-supervised learning (see

Chapter 2). The underlying, now classical, assumption may be stated as follows:

Cluster Assumption. If points are in the same cluster, they are likely to be of

the same class.

This assumption may be considered reasonable on the basis of the sheer existence

of classes: if there is a continuum of objects, it is unlikely that they were ever

distinguished into different classes.

Note that the Cluster Assumption does not imply that each class forms a single,

compact cluster: it only means that, usually, we do not observe objects of two

distinct classes in the same cluster.

The Cluster Assumption can easily be seen to a special case of the above proposed

Semi-Supervised Smoothness Assumption, considering that clusters are frequently

defined as being sets of points that can be connected by short curves which traverse

only high densites regions.

The cluster assumption can be formulated in an equivalent way:Low Density

SeparationLow Density Separation. The decision boundary should lie in a low density

region.

The equivalence is easy to see: A decision boundary in a high density region would

cut a cluster into two different classes; many objects of different classes in the same

cluster would require the decision boundary to cut the cluster, i.e. to go through a

high density region.

Although the two formulations are conceptually equivalent, they can inspire

different algorithms, as we will argue in Section 1.3. The Low Density version

also gives additional intuition why the assumption is sensible in many real world

problems. Consider digit recogntion for instance and suppose that one wants to

learn how to distinguish a hand-written digit “0” against digit “1.” The decision

boundary will be between a 0 and a 1, most likely a digit looking like a very

elongated zero. But the probability that someone wrote this “weird” digit is very

small.

1.2 When Can Semi-Supervised Learning Work? 5

1.2.3 The Manifold Assumption

A different but related assumption that forms the basis of several semi-supervised

learning methods is the manifold assumption:Manifold

AssumptionManifold Assumption. The (high-dimensional) data lie (roughly) on a low-

dimensional manifold.

How can this be useful? A well-known problem of many statistical methods

and learning algorithms is the so-called “Curse of Dimensionality.” It is relatedCurse of

Dimensionality to the fact that volume grows exponentially with the number of dimensions, and an

exponentially growing number of examples is required for statistical tasks such as

the reliable estimation of densities. This is a problem that directly affects generative

approaches that are based on density estimates in input space. A related problem

of high dimensions, which may be more severe for discriminative methods, is that

pairwise distances tend to become more similar, and thus less expressive.

If the data happen to lie on a low-dimensional manifold, however, then the

learning algorithm can essentially operate in a space of corresponding dimension,

thus avoiding the curse of dimensionality.

As above, one can argue that algorithms working with manifolds may be seen

as approximately implementing the semi-supervised smoothness assumption: such

algorithms use the metric of the manifold for computing geodesic distances. If we

view the manifold as an approximation of the high-density regions, then it becomes

clear that in this case, the semi-supervised smoothness assumption reduces to the

standard smoothness assumption of supervised learning, applied on the manifold.

Note that if the manifold is embedded into the high-dimensional input space in a

curved fashion (i.e., it is not just a subspace), geodesic distances differ from those in

the input space. By ensuring more accurate density estimates and more appropriate

distances, the manifold assumption may be useful for classification as well as for

regression.

1.2.4 Transduction

As mentioned before, some algorithms naturally operate in a transductive setting.

According to the philosophy put forward by Vapnik, high-dimensional estimation

problems should attempt to follow the following principle:

Vapnik’s Principle. When trying to solve some problem, one should not solve a

more difficult problem as an intermediate step.

Consider as an example supervised learning, where predictions of labels y corre-

sponding to some objects x are asked for. Generative models estimate the density

of x as intermediate step, while discriminative methods directly estimate the labels.

In a similar way, if label predictions are only required for a given test set,

transduction can be argued to be more direct than the corresponding induction:

while an inductive method infers a function f : X → Y on the entire space X, and


afterwards returns the evaluations f(xi) at the test points, transduction consists

of directly estimating the finite set of test labels, i.e. a function f : Xu → Y only

defined on the test set. Note that transduction is different from SSL as we defined

it above.

Now suppose we are given a transductive algorithm which produces a solution

superior to an inductive algorithm trained on the same labeled data (but discarding

the unlabeled data). Then the performance difference might be due to one of the

following two points (or a combination thereof):

1. transduction follows Vapnik’s principle more closely than induction does, or

2. the transductive algorithm takes advantage of the unlabeled data in a way similar

to semi-supervised learning algorithms.

There is ample evidence for improvements being due to the second of these points.

We are presently not aware of empirical results that selectively support the first

point.

In fact, one could argue that prediction on an arbitrary point x is no less difficult

than prediction on all points. Any algorithm performing transduction can also be

seen as one performing induction: instead of returning the label function, it directly

implements it. The function can be evaluated at any point by adding it to the

original test set and calling the transduction algorithm.

1.3 Classes of Algorithms and Organization of This Book

Although many methods were not explicitly derived from one of the above assump-

tions, most algorithms can be seen to correspond to or implement one or more

assumptions. We try to organize the semi-supervised learning methods presented

in this book into four classes that roughly correspond to the underlying assump-

tion. Although the classification is not always unique, we hope that this organiza-

tion makes the book and its contents more accessible to the reader, by providing a

guiding scheme.

For the same reason, this book is organized in “parts.” There is one part for each

class of SSL algorithms and an extra part focussing on generative approaches. Two

further parts are devoted to applications and perspectives of SSL. In the following

we briefly introduce the ideas of each book part.

1.3.1 Generative Models

Part I presents history and state of the art of SSL with generative models. Chapter

2 starts with a thorough review of the field.

Inference using a generative model involves the estimation of the conditional

density p(x|y). In this setting, any additional information on p(x) is useful. As

a simple example, assume that p(x|y) is Gaussian. Then one can use the EM

1.3 Classes of Algorithms and Organization of This Book 7

algorithm to find the parameters of the Gaussian corresponding to each class.

The only difference to the standard EM algorithm as used for clustering is that

the “hidden variable” associated with any labeled example is actually not hidden,

but it is known and equals its class label. It implements the Cluster Assumption

(cf. Section 2.2), since a given cluster belongs to only one class.

This small example already highlights different interpretations of semi-supervised

learning with a generative model:

It can be seen as classification with additional information on the marginal

density.

It can be seen as clustering with additional information. In the standard setting,

this information would be the labels of a subset of points, but it could also come

in the more general form of constraints (see Chapter 3).

A strength of the generative approach is that knowledge on the structure of the

problem or the data can naturally be incorporated by modeling it. In Chapter 4, this

is demonstrated for the application of the EM algorithm to text data. It is observed

that, in some circumstances, unlabeled data can decrease prediction accuracy. This

effect is investigated in depth in Chapter 5.

In statistical learning, before performing inference, one chooses a class of func-

tions, or a prior over functions. One has to choose it according to what is known

in advance about the problem. In the semi-supervised learning context, if one has

some ideas about what the structure of the data tells about the target function, the

choice of this prior can be made more precise after seeing the unlabeled data: onedata-dependent

priors could typically put a higher prior probability on functions that satisfy the cluster

assumption. In Chapter 8, an example is given of such a data-dependent prior.

1.3.2 Low Density Separation

Part II of this book aims at describing algorithms which try to directly implement

the Low Density Separation Assumption by pushing the decision boundary away

from the unlabeled points.

The most common approach to achieve this goal is to use a maximum margin

algorithm such as Support Vector Machines. The method of maximizing the margin

also for unlabeled points is called Transductive SVM (TSVM). However, theTransductive

SVM (TSVM) corresponding problem is nonconvex and thus difficult to optimize.

One optimization algorithm for the TSVM is presented in Chapter 6. Starting

from the SVM solution as trained on the labeled data only, the unlabeled points are

labeled by SVM predictions, and the SVM is retrained on all points. This is iterated

while the weight of the unlabeled points is slowly increased. Another possibility is

the SDP relaxation suggested in Chapter 7.

Two alternatives to the TSVM are presented that are formulated in a probabilistic

and in an information theoretic framework, respectively. In Chapter 8, binary

Gaussian Process classification is augmented by the introduction of a null-class

that occupies the space between the two regular classes. As an advantage to the


TSVM, this allows for probabilistic outputs.

This advantage is shared by the entropy minimization presented in Chapter 9.

It encourages the class conditional probabilities P (y|x) to be close to either 1 or 0

at labeled and unlabeled points. As a consequence of the smoothness assumption,

the probability will tend to be close to 0 or 1 throughout any high density region,

while class boundaries correspond to intermediate probabilities.

1.3.3 Graph-Based Methods

During the last couple of years, the most active area of research in semi-supervised

learning has been in graph based methods, which are the topic of Part III of this

book. The common denominator of these methods is that the data are represented

by the nodes of a graph, the edges of which are weighted by the pairwise distances

of the incident nodes (and a missing edge corresponds to infinite distance). If the

distance of two points is computed by minimizing the aggregate path distance over

all paths connecting the two points, this can be seen as an approximation of the

geodesic distance of the two points with respect to the manifold of data points.

Thus, graph methods can be argued to build on the manifold assumption.

Most graph methods refer to the graph by utilizing the graph Laplacian. Let

G = (V,E) be a graph with real edge weights given by w : E → R. Here, the weight

w(e) of an edge e indicates the similarity of the incident nodes (and a missing edge

corresponds to zero similarity). Now the weighted adjacency matrix (or weightweight matrix

matrix, for short) W of the graph G = (V,E) is defined by

Wij :=

w(e) if e = (i, j) ∈ E

0 if e = (i, j) ∈ E(1.2)

The diagonal matrix D defined by Dii :=∑

j Wij is called the degree matrix of

G. Now there are different ways of defining the graph Laplacian, the two mostgraph Laplacian

prominent of which are the normalized graph Laplacian, L, and the un-normalized

graph Laplacian, L:

L := D−1/2WD−1/2

L := D − W(1.3)

Many graph methods that penalize non-smoothness along the edges of a weighted

graph can in retrospect be seen as different instances of a rather general family

of algorithms, as is outlined in Chapter 10. An alternative approach is persued in

Chapter 13, where the graph Laplacian is used to regularize boosting.

Usually the prediction consists of labels for the unlabeled nodes. For this reason,

this kind of algorithm is intrinsically transductive, i.e., it returns only the value of

the decision function on the unlabeled points and not the decision function itself.

However, there has been recent work in order to extend graph based methods to

produce inductive solutions, as discussed in Chapter 11.

Often the graph G is costructed by computing similarities of objects in some

1.3 Classes of Algorithms and Organization of This Book 9

other representation, e.g. using a kernel function on Euclidean data points. But

sometimes the original data already have the form of a graph. Examples include

the linkage pattern of web pages and the interactions of proteins (see Chapter 20).

In such cases, the directionality of the edges may be important. In Chapter 12,

principled approaches of dealing with directed graphs are developed.

1.3.4 Change of Representation

The topic of Part IV are algorithms that are not intrinsically semi-supervised, but

instead perform a two-step learning:

1. Perform an unsupervised step on all data, labeled and unlabeled, but ignoring

the available labels. This can, for instance, be a change of representation, or the

construction of a new metric or a new kernel.

2. Ignore the unlabeled data and perform plain supervised learning using the new

distance, representation or kernel.

This can be seen as direct implementation of the Semi-Supervised Smoothness

Assumption, since the representation is changed in such a way that small distances

in high density regions are conserved.

Note that the graph-based methods (Part III) are closely related to the ones

presented in this book part: the very construction of the graph from the data

can be seen as an unsuervised change of representation. Consequently, the first

chapter of this part discusses spectral transforms of such graphs in order to build

kernels. Spectral methods can also be used for non-linear dimensionality reduction,

as extended in Chapter 15. Furthermore, in Chapter 16, metrics derived from graphs

are investigated, for exampled derived from shortest paths.

1.3.5 Self-Learning

For completeness, we mention the idea of self-learning, which is also known as self-

training, self-labeling, or decision directed learning. This is a wrapper-algorithm

that repeatedly uses a supervised learning method. It starts by training on the

labeled data only. In each step a part of the unlabeled points are labeled according

to the current decision function; then the supervised method is retrained using its

own predictions as additional labeled points. This idea appears in the literature

already for some time (e.g., Scudder [1965], Fralick [1967], Agrawala [1970]).

An unsatisfactory aspect of self-learning is that the resulting effect depends

on the employed supervised method. If self-learning is used with empirical risk

minimization and 1-0-loss, the unlabeled data will have no effect on the solution

at all. If instead a margin maximizing method is used, as a result the decision

boundary is pushed away from the unlabeled points. In other cases it seems to be

unclear what the self-learning is really doing, and which assumption it corresponds

to.

Correspondingly, we limit the coverage of self-learning in this book to the


Transductive SVM (cf. Chapter 6), which is probably its most successful instance.

1.3.6 Semi-Supervised Learning in Practice

Semi-supervised learning will be most useful whenever there are far more unlabeled

data than labeled. This can be the case is obtaining data points is cheap, but

obtaining the labels cost a lot of time, effort or money. This is the case in many

application areas of machine learning, for example:

In speech recognition, it cost almost nothing to record huge amounts of speech,

but labeling it requires some human to listen to it and type a transcript.

Billions of web pages are directly available for automated processing, but to

reliably classify them, humans had to read them.

Protein sequences are nowadays determined at industrial speed (by genome se-

quencing, computational gene finding and automatic translation), but to determine

3D structure or functions of a single protein requires up to years of scientific work.

Web page classification has already been introduced in Chapter 4 in the context of

generative models.

Since unlabeled data carry less information than labeled data, they are required

in large amounts in order to significantly increase the prediction accuracy. This

implies the need for fast and efficient SSL algorithms. Chapters 18 and 19. present

two approaches to being able to deal with huge numbers of points. In Chapter 18

methods are developed for speeding up the label propagation methods introduced in

Chapter 10. In Chapter 19 cluster kernels are shown to be an efficient SSL method.

Chapter 19 also presents the first of two approaches to an important bioinformat-

ics applications of semi-supervised learning: the classification of protein sequences.

While here the predictions are based on the protein sequences themselves, Chapter

20 moves on to a somewhat more complex setting: The information is here assumed

to be present in the form of graphs that characterize the interactions of proteins.

Several such graphs exist and have to be combined in a appropriate way.

1.3.7 Perspectives

The last part of the book, Part VI, shows two of the most interesting directions of

ongoing research in SSL.

So far this book mostly presented algorithms for semi-supervised learning. While

the assumptions discussed above supply some intuition on when and why SSL works,

and Chapter 5 investigates when and why it can fail, it would clearly be more

satisfactory to have a thorough theoretical understanding of SSL in total. Chapter

21 offers a PAC-style framework that yields error bounds for SSL problems.

Furthermore, until now this book has mostly resticted itself to classification.

Chapter 22 introduces another assumption for SSL that is particularly well suited

for regression, and derives algorithms from it.

I Generative Models

12 Generative Models

2 Probabilistic Graphical Models for Semi-

Supervised Learning

Matthias Seeger

In this Section we propose a simple taxonomy of probabilistic graphical models for

the SSL problem. We give some broad classes of algorithms for each of the families

and point to specific realizations in the literature. Finally, we shed more detailed

light on the family of methods using input-dependent regularization (or conditional

prior distributions) and show parallels to the co-training paradigm.

2.1 The Semi-Supervised Learning Problem

2.2 The Sampling Paradigm (Generative Models)

2.3 The Diagnostic Paradigm. Input-Dependent Regularization

2.4 Examples: Generative Models

2.5 Examples: Input-Dependent Regularization

2.6 A General Scheme for Input-Dependent Regularization. Relations to Co-

Training

3 Probabilistic Semi-Supervised Clustering

with Constraints

Sugato Basu

Raymond Mooney

Mikhail Bilenko

Arindam Banerjee

In certain clustering problems it is often natural to get limited supervision in

the form of pairwise constraints, i.e., pairs of instances labeled as belonging to

same or different clusters. In recent years some algorithms have been proposed

for enhancing clustering quality by using supervision in the form of constraints.

These algorithms typically utilize the pairwise constraints to either modify the

clustering objective function or to learn the clustering distortion measure. We

propose a probabilistic model for semi-supervised clustering based on Hidden Markov

Random Fields (HMRFs) that integrates these two approaches and provides a

principled framework for incorporating constraint-based supervision into prototype-

based clustering. The model allows the use of a broad range of clustering distortion

measures, including Bregman divergences (e.g., squares Euclidean distance and I-

divergence) and directional distance measures (e.g., cosine distance), making it

applicable to several domains. We present an algorithm that performs partitional

semi-supervised clustering of data by minimizing an objective function derived

from the posterior energy of the HMRF model. We also present a method for

actively selecting informative pairwise constraints in a query-driven framework, to

get improved clustering performance with relatively small amounts of supervision.

3.1 Introduction

Description of the problem of semi-supervised clustering with constraints, overview

of different domains where constraints are natural form of supervision

16 Probabilistic Semi-Supervised Clustering with Constraints

3.2 Related work

Brief overview of existing work in constraint-based and metric-based semi-

supervised clustering

3.3 Model

Motivation of HMRF model, connection between MAP estimation on HMRF and

constrained semi-supervised clustering objective function

3.4 Algorithm

Details of initialization step, E-step and M-step, proof of convergence for different

distortion measures, integration of constraint-based and metric-based approaches

3.5 Constraint acquisition

Active learning algorithm for acquiring constraints, analysis of 2-phase algorithm

3.6 Experiments

Demonstration of effectiveness of semi-supervised clustering algorithm and active

selection of constraints for different domains

4 Semi-Supervised Text Classification Using

EM

Kamal Nigam

Andrew McCallum

Tom Mitchell

One key difficulty with text classification learning algorithms is that they require

many hand-labeled examples to learn accurately. This chapter demonstrates that

supervised learning algorithms that use a small number of labeled examples and

many inexpensive unlabeled examples can create high-accuracy text classifiers. By

assuming that documents are created by a parametric generative model, Expectation-

Maximization (EM) finds local maximum a posteriori models and classifiers from

all the data—labeled and unlabeled. These generative models do not capture all the

intricacies of text. However in some domains this technique substantially improves

classification accuracy, especially when labeled data are sparse.

Two problems arise from this basic approach. First, unlabeled data can hurt

performance in domains where the generative modeling assumptions are too strongly

violated. In this case the assumptions can be made more representative in two ways:

by modeling sub-topic class structure, and by modeling super-topic hierarchical class

relationships. By doing so, model probability and classification accuracy come into

correspondence, allowing unlabeled data to improve classification performance. The

second problem is that even with a representative model, the improvements given

by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here,

limited labeled data provide EM initializations that lead to low-probability models.

Performance can be significantly improved by using alternatives to EM that avoid

low-probability local maxima.

4.1 Introduction

18 Semi-Supervised Text Classification Using EM

4.2 The Basic Setting

4.2.1 A Generative Model for Text

4.2.2 Naive Bayes Text Classification

4.2.3 Learning a Naive Bayes Model with Labeled and Unlabeled Data

4.2.4 Experimental Results

4.2.5 Discussion

4.3 Modeling Sub-topic Structure

4.3.1 EM with Multiple Mixtures Per Class


4.3.3 Discussion

4.4 Modeling Super-topic Structure

4.4.1 Estimating Parameters of a Hierarchical Model


4.4.3 Discussion

4.5 Finding More Probable Models

4.5.1 Many Random Starting Points

4.5.2 Determininstic Annealing


4.5.4 Discussion

4.6 Related Work

4.7 Conclusions

5 Risks of Semi-Supervised Learning:

How Unlabeled Data Can Degrade Performance

of Generative Classifiers

Fabio Cozman

Ira Cohen

Generative classifiers, even supervised ones, are often led astray by bad modeling as-

sumptions. But in semi-supervised learning we face a very embarrassing situation:

it is often the case that, in the presence of bad modeling assumptions, we would

be better off just discarding the unlabeled data and employing a supervised method,

rather than using both the labeled and unlabeled data. Typically we do not expect

to be better off by discarding data, how can this possibility arise in semi-supervised

learning? In this chapter we discuss theoretical and empirical results in the liter-

ature concerning the (relative) value of labeled and unlabeled data; we explain the

performance degradation phenomenon and show several examples of it. We then

discuss the effect of model search on generative classifiers based on semi-supervised

learning, and summarize related attempts in the literature to overcome performance

degradation.

5.1 Introduction

20 Risks of Semi-Supervised Learning: How Unlabeled Data Can Degrade Performance of Generative Classifiers

5.2 Do unlabeled data improve or degrade classification performance?

5.3 The Boy-Girl classifier

5.4 Insights from asymptotic analysis

5.5 The value of labeled and unlabeled data

5.5.1 When the model is correct

5.5.2 When the model is incorrect

5.6 Understanding the risks of semi-supervised learning

5.6.1 A Gaussian example

5.6.2 Building examples

5.6.3 Finite sample effects

5.7 Model search (and some philosophy...) in semi-supervised learning of

generative classifiers

5.7.1 Searching for models: is it necessary?

5.7.2 Searching for models: is it possible?

5.7.3 Using unlabeled data as an auxiliary device

II Low Density Separation

22 Low Density Separation

6 Transductive Support Vector Machines

Thorsten Joachims

In the transductive setting proposed by V. Vapnik, instead of learning a general

rule, the goal is to make predictions only at a fixed number of test points. This

allows the learning algorithm to exploit the location of the test points, making it

a particular type of semi-supervised learning problem. Transductive Support Vector

Machines (TSVMs) implement the idea of transductive learning by including test

points in the computation of the margin. This chapter will provide some examples

for why the margin on the test examples can provide useful prior information

for learning, in particular for the problem of text classification. The resulting

optimization problems, however, are difficult to solve. The chapter reviews exact

and approximate optimization methods and discusses their properties. Finally, the

chapter discusses connections to other related semi-supervised learning approaches

like co-training and methods based on graph cuts, which can be seen as solving a

simplified variant of the TSVM optimization problem.

6.1 Introduction

24 Transductive Support Vector Machines

6.2 TSVM Optimization Problems

6.3 Why use margin on the test set?

6.4 Algorithms

6.4.1 Exact

6.4.2 Approximate

6.5 Experiments

6.5.1 text classification

6.5.2 Use of TSVMs (Literature overview)

6.6 Problems with the TSVM objective

6.7 Connection to related approaches

6.7.1 co-training

6.7.2 graph cuts

6.7.3 other related work

7 Semi-Supervised Learning using Semi-Definite

Programming

Tilj De Bie

Nello Cristianini

We discuss the problem of support vector machine transduction, which is a combina-

torial problem with exponential computational complexity in the number of unlabeled

samples. Different approaches to such combinatorial problems exist, among which

are exhaustive search strategies (only feasible for very small sample sizes) and lo-

cal search heuristics starting from a suitably chosen start value (scalable to large

problem sizes, but sensitive to local minima).

In this chapter, we discuss an alternative approach, based on a convex relaxation

of the optimization problem associated to support vector machine transduction. The

result is a semi-definite programming problem which can be solved in polynomial

time, the result of which is an approximation of the optimal labeling as well as

a bound on the true optimum of the original transduction objective function. To

further decrease the computational complexity, we introduce an approximation that

allows to solve transduction problems of up to 1000 unlabeled samples.

Lastly, we extend the formulation to more general settings of semi-supervised

learning, where equivalence and inequivalence constraints are given on labels of some

of the samples.

7.1 Introduction

26 Semi-Supervised Learning using Semi-Definite Programming

7.2 Relaxing SVM-transduction

7.2.1 Relaxation to an SDP-problem

7.2.2 Some simplifications

7.2.3 Recovering the label vector

7.3 An efficient approximation

7.3.1 The subspace trick

7.3.2 Finding a subspace close to the label vector

7.4 General semi-supervised learning settings

7.4.1 Equivalence and inequivalence label constraints

7.4.2 The subspace trick

7.5 Empirical results

7.6 Conclusions

8 Gaussian Processes and the Null-Category

Noise Model

Neil Lawrence

Michael Jordan

With Gaussian process classifiers (GPC) we aim to predict the posterior probability

of the class labels given an input position, p (yn|xn). In general we find that this

posterior distribution is unaffected by unlabelled data points during learning. Support

vector machines are strongly related to GPCs, but one notable difference is that the

decision boundary in a SVM can be influenced by unlabelled data. The source of

this discrepancy is the SVM’s margin: a characteristic which is not shared with the

GPC. The presence of the margin allows the support vector machine to seek low data

density regions for the decision boundary, effectively allowing it to incorporate the

cluster hypothesis (refer to Thorsten’s chapter here). In this chapter we present the

null category noise model. A probabilistic equivalent of the margin. By combining

this noise model with a GPC we are able to incorporate the cluster hypothesis

without explicity modelling of the input data density distributions or a special choice

of kernel.

8.1 Introduction

28 Gaussian Processes and the Null-Category Noise Model

8.2 The Noise Model

8.2.1 Ordinal Regression

8.2.2 The Null Category Noise Model

8.2.3 Prediction vs Transduction

8.3 The Process Model

8.3.1 Gaussian Processes

8.3.2 Assumed Density Filtering

8.4 Results

8.5 Discussion

9 Entropy-Based Methods

Yves Grandvalet

Yoshua Bengio

In this chapter, we consider the problem of learning a decision rule from labeled

and unlabeled data. In this framework, one motivates the use of discriminative

models, and an entropy regularization is devised to benefit from unlabeled data in the

supervised learning setting. The learning criterion, which is based on the penalized

likelihood principle, derives from clearly stated assumptions and can be applied

to any smoothly parametrized model of posterior probabilities. The regularization

scheme favors low density separation, without any modeling of the density of input

features. Other approaches to the semi-supervised problem are shown to be particular

or limiting cases of entropy regularization.

9.1 Introduction

30 Entropy-Based Methods

9.2 Derivation of the Criterion

9.2.1 Likelihood

9.2.2 When Are Unlabeled Examples Informative?

9.2.3 Entropy Regularization

9.3 Optimization Algorithms

9.3.1 Annealed EM and IRLS

9.3.2 Conjugate gradient

9.4 Related Work

9.4.1 Minimum Entropy in Pattern Recognition

9.4.2 Self-Training

9.4.3 Information Regularization

9.4.4 Transductive Support Vector Machines

9.5 Experiments

9.5.1 Artificial Data

9.5.2 Real Data

9.6 Discussion

III Graph-Based Methods

32 Graph-Based Methods

10 Label Propagation and Quadratic Criterion

Yoshua Bengio

Olivier Delalleau

Nicolas Le Roux

(edited by Misha Belkin, Dengyong Zhou, Xiaojin Zhu)

Various graph-based algorithms for semi-supervised learning have been proposed in

the recent literature. They rely on the idea of building a graph whose nodes are

data points (labeled and unlabeled) and edges represent similarities between points.

Known labels are used to propagate information through the graph in order to label

all nodes. In this chapter, we show how these different algorithms can be cast into

a common framework where one minimizes a quadratic cost criterion whose closed-

form solution is found by solving a linear system of size n. The cost criterion

naturally leads to an extension of such algorithms to the inductive setting, where one

obtains test samples one at a time: the derived induction formula can be evaluated

in O(n) time, which is much more efficient than solving again the linear system

(which in general costs O(kn2) time for a sparse graph where each data point has

k neighbors).

10.1 Label propagation on a similarity graph

34 Label Propagation and Quadratic Criterion

10.2 Quadratic cost criterion

10.2.1 Regularization on graphs

10.2.2 Link with label propagation

10.2.3 Optimization framework

10.3 From transduction to induction

10.4 Discussion

11 The Manifold Basis of Semi-supervised

Learning

Misha Belkin

Vikas Sindhwani

11.1 Introduction

11.1.1 Manifold methods and manifolds in real-world data

11.1.2 Regularizing on maniolds

11.1.3 From manifolds to the ambient space

11.2 Theoretical framework: RKHS/Manifolds, manifold representer theorems

11.3 Algorithmic framework: Laplacian SVM/ Laplacian RLS

11.4 Data-dependent kernels and linear methods

11.5 Future directions

12 Regularization on Directed Graphs

Dengyong Zhou

Jiayuan Huang

Bernhard Schoelkopf

Thomas Hofmann

Given a directed graph in which some of the nodes are labeled, we investigate the

question of how to exploit the link structure of the graph to infer the labels of the

remaining unlabeled nodes. To that extent we propose a family of regularization

frameworks for functions defined over nodes of a directed graph that forces the

classification function to change slowly on densely linked subgraphs, and compares

them with each other on real-world Web classification problems.

12.1 Preliminaries on Directed Graphs and Markov Chains

12.2 Regularization Framework A: hubs and authorities

12.3 Regularization Framework B: stationary distribution

12.4 Regularization Framework C: water-pipe model

12.5 Discrete Differential Geometry on Directed graphs

12.6 Experiments on Web Classification and Protein Prediction

13 Boosting on Manifolds

Balazs Kegl

Ligen Wang

In this chapter we combine two powerful ideas, boosting and manifold learning. The

algorithm is based on RegBoost, an extension of AdaBoost that allows us to incorpo-

rate a data-dependent complexity regularizer into the base learning step of AdaBoost.

If the regularization term does not depend on the labels, then the algorithm can be

used to semi-supervised learning in a natural manner: in each iteration, the error

term is calculated using the labeled points, and the regularization term is calculated

using all the data. The convergence of the algorithm can be controlled by a simple

regularization parameter. The convergence theorem also suggests various interpre-

tations that indicate why the algorithm would indeed achieve regularization. The

particular regularizer that we use is the graph Laplacian, which captures the natural

cluster or manifold structure of the data. For large data sets, computing the Lapla-

cian can be prohibitive, so, for simple base classifier sets such as decision trees or

decision stumps, we propose a simple regularizer that approximates the Laplacian

in linear time.

13.1 Introduction

40 Boosting on Manifolds

13.2 Marginal AdaBoost

13.2.1 Convergence

13.3 RegBoost

13.3.1 The goal of the base learner

13.3.2 Theoretical justifications

13.4 The graph Laplacian regularizer

13.4.1 An approximation in linear time

13.4.2 Experiments

13.5 Other approaches

IV Change of Representation

42 Change of Representation

14 Graph Kernels by Spectral Transforms

Xiaojin Zhu

Jaz Kandola

Many graph based semi-supervised learning methods can be viewed as regularizers,

i.e. kernels, over the graph. Such regularization penalizes unsmooth functions on

the graph. The central quantity in regularization is the spectral decomposition of the

graph Laplacian, a matrix derived from edge weights. The eigenvectors with small

eigenvalues are smooth and ideally represents large cluster structures within the

data. The ones with large eigenvalues are rugged and considered noise.

Functions are compared against these eigenvectors to determine their smoothness.

One can alter the eigenvalues to emphasize different eigenvectors. Such change, or

spectral transform, produces different regularizers. Many familiar kernels can be

produced by spectral transform, for example the diffusion kernels.

The question naturally arises, can we find a good spectral transform automati-

cally? The answer is yes. We learn a nonparametric spectral transform based on

convex optimization, to maximize the kernel alignment to the labeled data. We in-

corporate order constraints to encode our preference on smooth eigenvectors. This

results in flexible kernels and avoids the need to choose among different parametric

spectral transforms. Our approach relies on a quadratically constrained quadratic

program (QCQP), and is computationally feasible for large datasets.

14.1 Graph Laplacian and Spectral Clustering

In this section I plan to describe the graph Laplacian (normalized and unnor-

malized), with a little bit of spectral graph theory (Chung). I will emphasize the

importance of the smallest eigenvectors, linking it to spectral clustering.

14.2 Kernels by Spectral Transforms

44 Graph Kernels by Spectral Transforms

I will describe regularization on the graph, and how spectral transforms produce

different kernels. (I suspect this and the previous sections are also discussed

elsewhere in the book). In the end the question is how to automatically find a

good transform. This section will draw materials from Chapelle et al. ”Cluster

kernels”, and Smola and Kondor ”Kernels and Regularization on Graphs”.

14.3 Kernel Alignment using QCQP

I introduce kernel alignment as a convex measure for spectral transform. The

problem is then turned into a QCQP. This and next sections draw from Zhu et

al. ”Nonparametric Transforms”.

14.4 Semi-Supervised Kernels with Order Constraints

Finally I introduce order constraints on the eigenvalues. The solution is a nonpara-

metric spectral transform, which produces a semi-supervised kernel.

14.5 Experimental Results

15 Spectral Methods for Dimensionality Reduction

Lawrence Saul

Kilian Weinberger

Fei Sha

Jihun Ham

Daniel Lee

How can we detect low dimensional structure in high dimensional data? If the data

is mainly confined to a low dimensional subspace, then simple linear methods can

be used to discover the subspace and estimate its dimensionality. More generally,

though, if the data lies on (or near) a low dimensional manifold, then its structure

may be highly nonlinear, and linear methods are bound to fail.

The last few years have witnessed several advances in the problem of nonlin-

ear dimensionality reduction. Given high dimensional data sampled from a low di-

mensional manifold, researchers now have several frameworks for estimating the

data’s intrinsic dimensionality and computing a faithful low dimensional embed-

ding. Surprisingly, the main computations for ”manifold learning” are based on

highly tractable optimizations, such as nearest-neighbor searches, least squares fits,

semidefinite programming, and eigenvalue problems. This chapter will survey the

numerous algorithms in machine learning that can be viewed as spectral methods

for linear and nonlinear dimensionality reduction.

15.1 introduction

46 Spectral Methods for Dimensionality Reduction

15.2 algorithms

15.2.1 PCA

15.2.2 multidimensional scaling

15.2.3 isomap

15.2.4 maximum variance unfolding

15.2.5 locally linear embedding (LLE)

15.2.6 laplacian eigenmaps

15.2.7 hessian LLE

15.2.8 local tangent space alignment

15.2.9 conformal reweighting

15.2.10 kernel PCA

15.3 discussion

15.3.1 comparison of algorithms

15.3.2 open questions

15.4 appendix (if necessary)

15.4.1 rayleigh-ritz variational method for eigenvalues

15.4.2 semidefinite programming

16 Modifying Distances

Pascal Vincent

Yoshua Bengio

Matthias Hein

Alexander Zien

In this chapter, we examine how we can implement the following intuitive idea:

modify the geometry of input space, more specifically the way we compute pairwise

distances, to induce a “natural closeness” for points likely to be in the same cluster

(according to the cluster assumption, i.e. if they are linked by a region of high

density) and a “natural separatedness” for points of different clusters (if they can

only be connected by traversing regions of low density). The idea is thus to “warp

space” in such a way to “shrink” distances in high density regions and to “blow

up” (comparatively speaking) distances in low density regions. After motivating

and illustrating this geometric intuition in more details, we briefly introduce the

theoretical bases that will allow us to properly devise these new distances: we

formally introduce metric spaces, shortest path metrics and Riemannian metrics.

We then show several approaches that have been proposed in the literature, to

define local metrics that depend on the data’s estimated density. We also detail

the necessary approximation techniques used to make these distance computations

feasible in practice. Finally, we show how these techniques can be applied in the

context of semi-supervised learning, by taking into account unlabeled points to obtain

the modified distances, and from those distances either define density-sensitive

kernels to be used in kernel methods, or perform a preprocessing step that will map

the initial data to a new space in which clusters are easier to separate.

16.1 A simple geometric intuition

48 Modifying Distances

16.2 Theoretical basis for (re)defining distances

16.2.1 Metric space

16.2.2 Path metrics

16.2.3 Riemannian metrics

16.3 Defining density-sensitive distances

16.3.1 General Riemannian metrics based on an actual density estimate

16.3.2 Approximations used in shortest path computations

16.3.3 Other simple data-dependent path metrics proposed in the

litterature

16.3.4 Illustration of the resulting geometric transformations

16.4 Using modified distances for semi-supervised learning

16.4.1 Forming kernels from the modified distances

16.4.2 Preprocessing through Multidimensional Scaling to map the

data to a new space

16.4.3 Experiments showing improved classification accuracy

17 Data-Dependent Regularization

Tommi Jaakkola

Adrian Corduneanu

V Semi-Supervised Learning in Practice

52 Semi-Supervised Learning in Practice

18 Large-Scale Algorithms

Olivier Delalleau

Yoshua Bengio

Nicolas Le Roux

In chapter 10, it is shown how a number of graph-based semi-supervised learning

algorithms can be seen as the minimization of a specific cost function, leading to

a linear system with n equations and unknowns. Solving such a linear system will

in general require on the order of O(kn2) time and O(kn) memory (for a sparse

graph where each data point has k neighbors), which can be prohibitive on large

datasets. We present in this chapter a subset selection algorithm that can be used

to reduce the original system to one of size m n. The idea is to solve for the

labels of a subset S ⊂ X of only m points, while still retaining information from

the rest of the data by approximating their label with a linear combination of the

labels in S (using the induction formula presented in chapter 10). This can give

an improvement of up to O(n2/m2) in time and O(n/m) in memory usage, thus

allowing one to take advantage of significantly bigger unlabeled datasets than with

the original algorithms.

18.1 Introduction

54 Large-Scale Algorithms

18.2 Cost approximation

18.2.1 Reducing the cost using a subset

18.2.2 Resolution

18.2.3 Computational issues

18.3 Subset selection

18.3.1 Random selection

18.3.2 Smart data sampling

18.4 Discussion

19 Semi-Supervised Protein Classification using

Cluster Kernels

Jason Weston

Protein Classification is a classic case of the semi-supervised learning problem.

Biologists have built huge databases of proteins (amino acid sequences) of hundreds

of thousands of proteins, but the 3D structure and structural class (label) of only

a few thousand of these proteins is actually known. Kernel methods have proven

effective at the problem of protein classification (classifying proteins into known

families and superfamilies) by defining similiarity metrics between sequences.

In this chapter, we describe simple and scalable cluster kernel techniques for

incorporating unlabeled data into the representation of protein sequences. We show

that these methods greatly improve the classification performance of standard string

kernels and outperform naive approaches for using unlabeled data, such as adding

close homologs of the positive examples to the training data. These methods also

achieve equal or superior performance to previously presented cluster kernel methods

while achieving far greater computational efficiency.

19.1 Introduction

19.2 Representations and kernels for protein Sequences

19.3 Cluster Kernels for protein Sequences

(review section, this should mostly refer to elsewhere in the book I guess)

19.4 The neighborhood mismatch kernel

56 Semi-Supervised Protein Classification using Cluster Kernels

19.5 The bagged mismatch kernel

19.6 The profile kernel

19.7 Experiments

19.7.1 Transductive Setting

19.7.2 Semi-supervised setting

19.8 Discussion

20 Prediction of Protein Function from Networks

Hyunjung Shin

Koji Tsuda

Bernhard Scholkopf

In some application areas, it is natural to represent domain knowledge using graphs.

Frequently there exist multiple graphs for the same set of nodes, representing infor-

mation from different sources, and no single graph is sufficient to predict class labels

of unlabelled nodes reliably. One way to enhance reliability is to integrate multiple

graphs, since individual graphs are partly independent and partly complementary

to each other for prediction. In this chapter, we propose a novel algorithm to as-

sign weights to multiple graphs within graph-based semi-supervised learning. Both

predicting class labels and searching for combining weights are formulated into one

convex optimization problem. The proposed method is applied to functional class

prediction of yeast proteins. When compared with individual graphs, the combined

graph with optimized weights performs significantly better than any single graph.

When compared with the state-of-the-art kernel method, a semi-definite program-

ming based support vector machines method (SDP/SVM), it shows comparable ac-

curacy in a remarkably short time. Compared with fixed-weight-integration (without

optimization for weight), it selects important graphs over redundant ones, which

implies the desirable property of integration with selectivity.

20.1 Introduction

58 Prediction of Protein Function from Networks

20.2 Graph-based Semi-Supervised Learning

20.3 Combining Multiple Graphs

20.3.1 Regularized Version

20.3.2 Optimization

20.4 Experiments on Function Prediction of Proteins

20.4.1 Comparison with Individual Graphs

20.4.2 Comparison with SDP/SVM

20.4.3 Comparison with Fixed Weight Integration

20.5 Conclusion and Outlook

VI Perspectives

60 Perspectives

21 An Augmented PAC Model for Semi-

Supervised Learning

Maria-Florina Balcan

Avrim Blum

The standard PAC-learning model has proven to be a useful theoretical framework

for thinking about the problem of supervised learning. However, it does not tend

to capture many of the assumptions underlying semi-supervised learning methods.

In this chapter we describe an augmented version of the PAC model designed with

semi-supervised learning in mind, that can be used to help think about the problem of

learning from labeled and unlabeled data and many of the different approaches taken.

The model provides a unified framework for analyzing when and why unlabeled data

can help, in which one can discuss both sample-complexity and algorithmic issues.

Our model can be viewed as an extension of the standard PAC model, where in

addition to a concept class C, one also proposes a compatibility function: a type of

compatibility that one believes the target concept should have with the underlying

distribution of data. For example, it could be that one believes the target should

cut through a low-density region of space, or that it should be self-consistent in

some way as in co-training. This belief is then explicitly represented in the model.A PAC Model for

Semi-Supervised

Learning

Unlabeled data is then potentially helpful in this setting because it allows one to

estimate compatibility over the space of hypotheses, and to reduce the size of the

search space from the whole set of hypotheses C down to those that, according to

one’s assumptions, are a-priori reasonable with respect to the distribution.

After proposing the model, we then analyze sample-complexity issues in this

setting: that is, how much of each type of data one should expect to need in order

to learn well, and what are the basic quantities that these numbers depend on. We

provide examples of sample-complexity bounds both for uniform convergence and

ε-cover based algorithms, as well as several algorithmic results.

62 An Augmented PAC Model for Semi-Supervised Learning

21.1 Introduction

As we have already seen in the previous chapters, there has been growing interest in

practice in using unlabeled data together with labeled data in machine learning, and

a number of different approaches have been developed. However, the assumptions

these methods are based on are often quite distinct and not captured by standard

theoretical models.

One difficulty from a theoretical point of view is that standard discriminative

learning models do not really capture how and why unlabeled data can be of

help. In particular, in the PAC model there is purposefully a complete disconnect

between the data distribution D and the target function f being learned [Valiant,

1984, Blumer et al., 1989, Kearns and Vazirani, 1994]. The only prior belief is

that f belongs to some class C: even if D is known fully, any function f ∈ C

is still possible. For instance, it is perfectly natural (and common) to talk about

the problem of learning a concept class such as DNF formulas [Linial et al., 1989,

Verbeurgt, 1990] or an intersection of halfspaces [Baum, 1990, Blum and Kannan,

1997, Vempala, 1997, Klivans et al., 2002] over the uniform distribution; but clearly

in this case unlabeled data is useless — you can just generate it yourself. For learning

over an unknown distribution (the standard PAC setting), unlabeled data can help

somewhat, by allowing one to use distribution-specific sample-complexity bounds,

but this does not seem to fully capture the power of unlabeled data in practice.

In generative-model settings, one can easily talk theoretically about the use

of unlabeled data, e.g., [Castelli and Cover, 1995, 1996]. However, these results

typically make strong assumptions that essentially imply that there is only one

natural distinction to be made for a given (unlabeled) data distribution. For

instance, a typical generative-model setting would be that we assume positive

examples are generated by one Gaussian, and negative examples are generated by

another Gaussian. In this case, given enough unlabeled data, we could recover the

Gaussians and would need labeled data only to tell us which Gaussian is the positive

one and which is the negative one.1 This is too strong an assumption for most real-

world settings. Instead, we would like our model to allow for a distribution over

data (e.g., documents we want to classify) where there are a number of plausible

distinctions we might want to make.2 In addition, we would like a general framework

that can be used to model many different uses of unlabeled data.

We present here a PAC-style framework that bridges between these positions

and we believe can be used to help think about many of the ways unlabeled data

1. Castelli and Cover [1995, 1996] do not assume Gaussians in particular, but they doassume the distributions are distinguishable, which from our perspective has the sameissue.2. In fact, there has been recent work in the generative model setting on the practical sidethat goes in this direction (see [Nigam et al., 2000, Nigam, 2001]). We discuss connectionsto generative models further in Section 21.5.2.

21.1 Introduction 63

is typically used, including approaches discussed in other chapters. This framework

extends the PAC model in a way that allows one to express not only the form

of target function one is considering, but also relationships that one hopes the

target function and underlying distribution will possess. We then analyze sample-

complexity issues in this setting: that is, how much of each type of data one should

expect to need in order to learn well, and also give examples of algorithmic results

in this model.

Specifically, the idea of the proposed model is to augment the PAC notion of a

concept class, which is a set of functions (like linear separators or decision trees),

with a notion of compatibility between a function and the data distribution that we

hope the target function will satisfy. Then, rather than talking of “learning a conceptMain Idea

class C,” we will talk of “learning a concept class C under compatibility notion χ.”

For example, suppose we believe there should exist a good linear separator, and that

furthermore, if the data happens to cluster, then this separator probably does not

slice through the middle of any such clusters. Then we would want a compatibility

notion that penalizes functions that do, in fact, slice through clusters. In this

framework, the extent to which unlabeled data helps depends on two quantities:

first, the extent to which the true target function satisfies the given assumption,

and second, the extent to which the distribution allows this assumption to rule

out alternative hypotheses. For instance, if the data does not cluster at all, then

all functions equally satisfy this compatibility notion and the assumption ends up

not helping. From a Bayesian perspective, one can think of this as a PAC model

for a setting in which one’s prior is not just over functions, but also over how the

function and underlying distribution relate to each other.

To make our model formal, we will need to ensure that the degree of compatibility

be something that can be estimated from a finite sample. To do this, we will require

that the compatibility notion χ actually be a function from C × X to [0, 1], where

the compatibility of a function f with the data distribution D is Ex∼D[χ(f, x)].

The degree of incompatibility is then something we can think of as a kind of

“unlabeled error rate” that measures how a-priori unreasonable we believe some

proposed hypothesis to be. For instance, in the example above of a “margin-style”

compatibility, we could define χ(f, x) to be an increasing function of the distance of

x to the separator f . In this case, the unlabeled error rate, 1−χ(f,D), is a measure

of the probability mass close to the proposed separator. In co-training, where each

example x has two “views” (x = 〈x1, x2〉), the underlying belief is that the true

target c∗ can be decomposed into functions 〈c∗1, c∗2〉 over each view such that for

most examples, c∗1(x1) = c∗2(x2). In this case, we can define χ(〈f1, f2〉, 〈x1, x2〉) = 1

if f1(x1) = f2(x2), and 0 if f1(x1) 6= f2(x2). Then the compatibility of a hypothesis

〈f1, f2〉 with an underlying distribution D is Pr〈x1,x2〉∼D[f1(x1) = f2(x2)].

This setup allows us to analyze the ability of a finite unlabeled sample to reduce

our dependence on labeled examples, as a function of the compatibility of the target

function (i.e., how correct we were in our assumption) and various measures of the

“helpfulness” of the distribution. In particular, in our model, we find that unlabeled

data can help in several distinct ways.


If the target function is highly compatible with D, then if we have enough

unlabeled data to estimate compatibility over all f ∈ C, we can in principle reduce

the size of the search space from C down to just those f ∈ C whose estimated

compatibility is high. For instance, if D is “helpful”, then the set of such functions

will be much smaller than the entire set C.

By providing an estimate of D, unlabeled data can allow us to use a more refinedWays in which

unlabeled data

can help

distribution-specific notion of “hypothesis space size” such as Annealed VC-entropy

[Devroye et al., 1996], Rademacher complexities [Koltchinskii, 2001, Bartlett and

Mendelson, 2002, Boucheron et al., 2004] or the size of the smallest ε-cover [Benedek

and Itai, 1991], rather than VC-dimension [Blumer et al., 1989, Kearns and Vazirani,

1994]. In fact, for natural cases (such as those above) we find that the sense in

which unlabeled data reduces the “size” of the search space is best described in

these distribution-specific measures.

Finally, if the distribution is especially nice, we may find that not only does the

set of “compatible” f ∈ C have a small ε-cover, but also the elements of the cover

are far apart. In that case, if we assume the target function is fully compatible, we

may be able to learn from even fewer labeled examples than the 1/ε needed just to

verify a good hypothesis! (Though here D is effectively committing to the target

as in generative models.)

Our framework also allows us to address the issue of how much unlabeled data

we should expect to need. Roughly, the “VCdim/ε2” form of standard PAC sample

complexity bounds now becomes a bound on the number of unlabeled examples we

need. However, technically, the set whose VC-dimension we now care about is not

C but rather a set defined by both C and χ: that is, the overall complexity depends

both on the complexity of C and the complexity of the notion of compatibility (see

Section 21.3.1.2). One consequence of our model is that if the target function and

data distribution are both well behaved with respect to the compatibility notion,

then the sample-size bounds we get for labeled data can substantially beat what

one could hope to achieve through pure labeled-data bounds, and we illustrate this

with a number of examples through the chapter.

21.2 A Formal Framework

In this section we formally introduce what we mean by a notion of compatibility,

and illustrate it through a number of examples including margins and co-training.

We assume that examples (both labeled and unlabeled) come according to a

fixed unknown distribution D over an instance space X, and they are labeled

by some unknown target function c∗. As in the standard PAC model, a concept

class or hypothesis space is a set of functions over the instance space X, and we

will often make the assumption (the “realizable case”) that the target function

belongs to a given class C. For a given hypothesis f , the (true) error rate of f is

defined as err(f) = errD(f) = Prx∼D[f(x) 6= c∗(x)]. For any two hypotheses

21.2 A Formal Framework 65

f1, f2 ∈ C, the distance with respect to D between f1 and f2 is defined as

d(f1, f2) = dD(f1, f2) = Prx∼D[f1(x) 6= f2(x)]. We will use err(f) to denote

the empirical error rate of f on a given labeled sample and d(f1, f2) to denote the

empirical distance between f1 and f2 on a given unlabeled sample.

We define a notion of compatibility to be a mapping from a hypothesis f and a

distribution D to [0, 1] indicating how “compatible” f is with D. In order for this to

be estimable from a finite sample, we require that compatibility be an expectation

over individual examples. (Though one could imagine more general notions with

this property as well.) Specifically, we define:Legal notion of

compatibilityDefinition 21.1 A legal notion of compatibility is a function χ : C × X → [0, 1]

where we (overloading notation) define χ(f,D) = Ex∼D[χ(f, x)]. Given a sample

S, we define χ(f, S) to be the empirical average over the sample.

Remark 21.2 One could also allow compatibility functions over k-tuples of exam-

ples, in which case our (unlabeled) sample-complexity bounds would simply increase

by a factor of k. For settings in which D is actually known in advance (e.g., trans-

ductive learning, see Section 21.5.1) we can drop this requirement entirely and allow

any notion of compatibility χ(f,D) to be legal.

Definition 21.3 Given compatibility notion χ, the incompatibility of f with D is

1 − χ(f,D). We will also call this its unlabeled error rate, errunl(f), when χ and

D are clear from context. For a given sample S, we use errunl(f) to denote the

empirical average over S.

Finally, we need a notation for the set of functions whose incompatibility is at

most some given value τ .

Definition 21.4 Given threshold τ , we define CD,χ(τ) = f ∈ C : errunl(f) ≤ τ.

So, e.g., CD,χ(1) = C. Similarly, for a sample S, we define CS,χ(τ) = f ∈ C :

errunl(f) ≤ τ

We now illustrate our compatibility notion through a number of examples:

Example 1. Suppose examples are points in Rd and C is the class of linear

separators. A natural belief in this setting is that data should be “well-separated”:Margins

not only should the target function separate the positive and negative examples,

but it should do so by some reasonable margin γ. This is the assumption used by

Transductive SVM (see [Joachims, 1999] and also Chapter 6 in this book). In this

case, if we are given γ up front, we could define χ(f, x) = 1 if x is farther than

distance γ from the hyperplane defined by f , and χ(f, x) = 0 otherwise. So, the

incompatibility of f with D is probability mass within distance γ of f ·x = 0. Or we

could define χ(f, x) to be a smooth function of the distance of x to the separator,

if we do not want to commit to a specific γ in advance. (In contrast, defining

compatibility of a hypothesis based on the largest γ such that D has probability

mass exactly zero within distance γ of the separator would not fit our model: it


cannot be written as an expectation over individual examples and indeed would

not be a good definition since one cannot distinguish “zero” from “exponentially

close to zero” from a small sample of unlabeled data).

Example 2. In co-training [Blum and Mitchell, 1998], we assume examples come

as pairs 〈x1, x2〉, and our goal is to learn a pair of functions 〈f1, f2〉. For instance,

if our goal is to classify web pages, x1 might represent the words on the page itselfCo-training

and x2 the words attached to links pointing to this page from other pages. The

hope that underlies co-training is that the two parts of the example are consistent,

which then allows the co-training algorithm to bootstrap from unlabeled data. For

example, iterative co-training uses a small amount of labeled data to get some initial

information (e.g., if a link with the words “my advisor” points to a page then that

page is probably a faculty member’s home page) and then when it finds an unlabeled

example where one half is confident (e.g., the link says “my advisor”), it uses that to

label the example for training its hypothesis over the other half. This approach and

several variants have been used for a variety of learning problems, including named

entity classification [Collins and Singer, 1999], text classification [Nigam and Ghani,

2000, Ghani, 2001], natural language processing [Pierce and Cardie, 2001], large

scale document classification [Park and Zhang, 2003], and visual detectors [Levin

et al., 2003].3 As mentioned in Section 21.1, the assumptions underlying co-training

fit naturally into our framework. In particular, we can define the incompatibility of

some hypothesis 〈f1, f2〉 with distribution D as Pr〈x1,x2〉∼D[f1(x1) 6= f2(x2)].

Example 3. In transductive graph-based methods, we are given a set of unlabeled

examples connected in a graph g, where the interpretation of an edge is that we

believe the two endpoints of the edge should have the same label. Given a few

labeled vertices, various graph-based methods then attempt to use them to infer

labels for the remaining points. If we are willing to view D as a distribution overGraph-based

methods edges (a uniform distribution if g is unweighted), then as in co-training we can

define the incompatibility of some hypothesis f as the probability mass of edges

that are cut by f , which then motivates various cut-based algorithms. For instance,

if we require f to be boolean, then the mincut method of Blum and Chawla [2001]

finds the most-compatible hypothesis consistent with the labeled data; if we allow f

to be fractional and define 1−χ(f, 〈x1, x2〉) = (f(x1)−f(x2))2, then the algorithm

of Zhu et al. [2003a] finds the most-compatible consistent hypothesis. If we do not

wish to view D as a distribution over edges, we could have D be a distribution over

vertices and broaden Definition 21.1 to allow for χ to be a function over pairs of

examples. In fact, as mentioned in Remark 21.2, since we have perfect knowledge

of D in this setting we can allow any compatibility function χ(f,D) to be legal. We

discuss more connections with graph-based methods in Section 21.5.1.

Example 4. As a special case of co-training, suppose examples are pairs of points

in Rd, C is the class of linear separators, and we believe the two points in each

3. For more discussion regarding co-training see also Chapter 2 in this book.

21.3 Sample Complexity Results 67

pair should both be on the same side of the target function. (So, this is a version

of co-training where we require f1 = f2.) The motivation is that we want to use

pairwise information as in Example 3, but we also want to use the features of

each data point. For instance, in the word-sense disambiguation problem studied

by Yarowsky [1995], the goal is to determine which of several dictionary definitions

is intended for some target word in a piece of text (e.g., is “plant” being used to

indicate a tree or a factory?). The local context around each word can be viewedLinear separator

graph cuts as placing it into Rd, but the edges correspond to a completely different type of

information: the belief that if a word appears twice in the same document, it is

probably being used in the same sense both times. In this setting, we could use the

same compatibility function as in Example 3, but rather than having the concept

class C be all possible functions, we reduce C to just linear separators.

Example 5. In a related setting to co-training, considered by Leskes [2005], ex-

amples are single points in X but we have a pair of hypothesis spaces 〈C1,C2〉 (orAgreement

more generally k-tuple 〈C1, . . . ,Ck〉), and the goal is to find a pair of hypotheses

〈f1, f2〉 ∈ C1 × C2 with low error over labeled data and that agree over the distri-

bution. For instance, if data is sufficiently “well-separated”, one might expect there

to exist both a good linear separator and a good decision tree, and one would like

to use this assumption to reduce the need for labeled data. In this case one could

define compatibility of 〈f1, f2〉 with D as Prx∼D[f1(x) = f2(x)], or the similar

notion given in [Leskes, 2005].

21.3 Sample Complexity Results

We now present several sample-complexity bounds that fall out of this framework,

showing how unlabeled data, together with a suitable compatibility notion, can

reduce the need for labeled examples.

The basic structure of all of these results is as follows. First, given enough unla-

beled data (where “enough” will be a function of some measure of the complexity

of C and possibly of χ as well), we can uniformly estimate the true compatibilities

of all functions in C by their empirical compatibilities over the sample. Then, by

using this quantity to give a preference ordering over the functions in C, we can re-

duce “C” down to “the set of functions in C whose compatibility is not much larger

than the true target function” in bounds for the number of labeled examples needed

for learning. The specific bounds differ in terms of the exact complexity measures

used (and a few other issues such as stratification and realizability) and we provide

examples illustrating when certain complexity measures can be significantly more

powerful than others. In particular, ε-cover bounds (Section 21.3.2) can provide

especially good bounds for co-training and graph-based settings.


21.3.1 Uniform Convergence Bounds

We begin with uniform convergence bounds (later in Section 21.3.2 we give tighter

ε-cover bounds that apply to algorithms of a particular form). For clarity, we begin

with the case of finite hypothesis spaces where we measure the “size” of a set

of functions by just the number of functions in the set. We then discuss several

issues that arise when considering infinite hypothesis spaces, such as what is an

appropriate measure for the “size” of the set of compatible functions, and the

need to account for the complexity of the compatibility notion itself. Note that in

the standard PAC model, one typically talks of either the realizable case, where we

assume that c∗ ∈ C, or the agnostic case where we do not (see [Kearns and Vazirani,

1994]). In our setting, we have the additional issue of unlabeled error rate, and can

either make an a-priori assumption that the target function’s unlabeled error is low,

or else aim for a more “Occam-style” bound in which we have a stream of labeled

examples and halt once they are sufficient to justify the hypothesis produced.

21.3.1.1 Finite hypothesis spaces

We first give a bound for the “doubly realizable” case.

Theorem 21.5 If we see mu unlabeled examples and ml labeled examples, where

mu ≥1

ε

[ln |C| + ln

2

δ

]and ml ≥

1

ε

[ln |CD,χ(ε)| + ln

2

δ

],

then with probability at least 1 − δ, all f ∈ C with err(f) = 0 and errunl(f) = 0

have err(f) ≤ ε.

Proof The probability that a given hypothesis f with errunl(f) > ε has

errunl(f) = 0 is at most (1 − ε)mu < δ/(2|C|) for the given value of mu. There-

fore, by the union bound, the number of unlabeled examples is sufficient to ensure

that with probability 1 − δ/2, only hypotheses in CD,χ(ε) have errunl(f) = 0. The

number of labeled examples then similarly ensures that with probability 1 − δ/2,

none of those whose true error is at least ε have an empirical error of 0, yielding

the theorem.

So, if the target function indeed is perfectly correct and compatible, then Theo-

rem 21.5 gives sufficient conditions on the number of examples needed to ensure

that an algorithm that optimizes both quantities over the observed data will, in

fact, achieve a PAC guarantee. To emphasize this, we will say that an algorithm

efficiently PACunl-learns the pair (C, χ) if it is able to achieve a PAC guarantee

using time and sample sizes polynomial in the bounds of Theorem 21.5.

We can think of Theorem 21.5 as bounding the number of labeled examples we

need as a function of the “helpfulness” of the distribution D with respect to our

notion of compatibility. That is, in our context, a helpful distribution is one inInterpretation

which CD,χ(ε) is small, and so we do not need much labeled data to identify a good


function among them. We can get a similar bound in the situation when the target

function is not fully compatible:

Theorem 21.6 Given t ∈ [0, 1], if we see mu unlabeled examples and ml labeled

examples, where

mu ≥2

ε2

[ln |C| + ln

4

δ

]and ml ≥

1

ε

[ln |CD,χ(t + 2ε)| + ln

2

δ

],

then with probability at least 1− δ, all f ∈ C with err(f) = 0 and errunl(f) ≤ t + ε

have err(f) ≤ ε, and furthermore all f ∈ C with errunl(f) ≤ t have errunl(f) ≤

t + ε.

In particular, this implies that if errunl(c∗) ≤ t and err(c∗) = 0 then with high

probability the f ∈ C that optimizes err(f) and errunl(f) has err(f) ≤ ε.

Proof Same as Theorem 21.5 except apply Hoeffding bounds (see Devroye et al.

[1996]) to the unlabeled error rates.

Finally, we give a simple Occam/luckiness type of bound for this setting. Given

a sample S, let us define descS(f) = ln |CS,χ(errunl(f))|. That is, descS(f) is

the description length of f (in “nats”) if we sort hypotheses by their empiri-

cal compatibility and output the index of f in this ordering. Similarly, define ε-

descD(f) = ln |CD,χ(errunl(f) + ε)|. This is an upper-bound on the description

length of f if we sort hypotheses by an ε-approximation to the their true compati-

bility. Then we can get a bound as follows:

Theorem 21.7 For any set S of unlabeled data, given ml labeled examples, with

probability at least 1−δ, all f ∈ C satisfying err(f) = 0 and descS(f) ≤ εml−ln(1/δ)

have err(f) ≤ ε. Furthermore, if |S| ≥ 2ε2 [ln |C|+ln 2

δ ], then with probability at least

1 − δ, all f ∈ C satisfy descS(f) ≤ ε-descD(f).

The point of this theorem is that an algorithm can use observable quantities to

determine if it can be confident. Furthermore, if we have enough unlabeled data,Interpretation

the observable quantities will be no worse than if we were learning a slightly less

compatible function using an infinite-size unlabeled sample.

Note that if we begin with a non-distribution-dependent ordering of hypotheses,

inducing some description length desc(f), and our compatibility assumptions turn

out to be wrong, then it could well be that descD(c∗) > desc(c∗). In this case our

use of unlabeled data would end up hurting rather than helping.

21.3.1.2 Infinite hypothesis spaces

To reduce notation, we will assume in the rest of this chapter that χ(f, x) ∈ 0, 1

so that χ(f,D) = Prx∼D[χ(f, x) = 1]. However, all our sample complexity results

can be easily extended to the general case.


For infinite hypothesis spaces, the first issue that arises is that in order to achieve

uniform convergence of unlabeled error rates, the set whose complexity we care

about is not C but rather χ(C) = χf : f ∈ C where we define χf (x) = χ(f, x). For

instance, suppose examples are just points on the line, and C = fa(x) : fa(x) = 1

iff x ≤ a. In this case, VCdim(C) = 1. However, we could imagine a compatibility

function such that χ(fa, x) depends on some complicated relationship between the

real numbers a and x. In this case, VCdim(χ(C)) is much larger, and indeed we

would need many more unlabeled examples to estimate compatibility over all of C.

A second issue is that we need an appropriate measure for the “size” of the set

of surviving functions. VC-dimension tends not to be a good choice: for instance,

if we consider the case of Example 1 (margins), then even if data is concentrated

in two well-separated “blobs”, the set of compatible separators still has as large a

VC-dimension as the entire class even though they are all very similar with respect

to D. Instead, it is better to consider distribution dependent complexity measures

such as annealed VC-entropy and Rademacher averages. We give in the following

several examples of bounds of these type. For this we introduce some notation.

Specifically, for any C, we denote by C[m,D] the expected number of splits of m

points (drawn i.i.d.) from D with concepts in C. Also, for a given (fixed) S ⊆ X,

we will denote by S the uniform distribution over S, and by C[m,S] the expected

number of splits of m points (drawn i.i.d.) from S with concepts in C. Also for

a given sample S of n points from D, let’s denote by Rn(C) [S] the conditional

Rademacher average of C given by S. Then we can get bounds as follows:

Theorem 21.8 An unlabeled sample of size

mu = O

(V Cdim (χ(C))

ε2log

1

ε+

1

ε2log

2

δ

)

and a labeled sample of size

ml >2

ε

[log(2s) + log

2

δ

], where s = CD,χ(t + 2ε)[2ml,D]

is sufficient so that with probability 1−δ, all f ∈ C with err(f) = 0 and errunl(f) ≤

t + ε have err(f) ≤ ε, and furthermore all f ∈ C have |errunl(f) − errunl(f)| ≤ ε.

This is the analog of Theorem 21.6 for the infinite case. In particular, this implies

that if err(c∗) = 0 and errunl(c∗) ≤ t, then with high probability the f ∈ C thatInterpretation

optimizes err(f) and errunl(f) has err(f) ≤ ε.

Proof Sketch: By standard VC-bounds [Devroye et al., 1996, Vapnik, 1998], the

number of unlabeled examples is sufficient to ensure that with probability 1−δ/2 we

can estimate, within ε, Prx∈D[χf (x) = 1] for all χf ∈ χ(C). Since χf (x) = χ(f, x),

this implies we have can estimate, within ε, the unlabeled error rate errunl(f) for

all f ∈ C, and so the set of hypotheses with errunl(f) ≤ t + ε is contained in

CD,χ(t + 2ε).

The bound on the number of labeled examples follows from [Devroye et al., 1996]

(where it is shown that the expected number of partitions can be used instead of


the maximum in the standard VC proof). This bound ensures that with probability

1− δ/2, none of the functions in CD,χ(t + 2ε) whose true (labeled) error is at least

ε have an empirical (labeled) error of 0.

We can also give a bound where we specify the number of labeled examples as a

function of the unlabeled sample; this is useful because we can imagine our learning

algorithm performing some calculations over the unlabeled data and then deciding

how many labeled examples to purchase.

Theorem 21.9 Given t ≥ 0, an unlabeled sample S of size

O

(max[V Cdim(C), V Cdim(χ(C))]

ε2log

1

ε+

1

ε2log

2

δ

)

is sufficient so that if we label ml examples drawn uniformly at random from S,

where

ml >4

ε

[log(2s) + log

2

δ

]and s = CS,χ(t + ε)

[2ml, S

]

then with probability ≥ 1− δ, all f ∈ C with err(f) = 0 and errunl(f) ≤ t + ε have

err(f) ≤ ε. Furthermore all f ∈ C have |errunl(f) − errunl(f)| ≤ ε.

Proof Standard VC-bounds (in the same form as for Theorem 21.8) imply that

the number of labeled examples ml is sufficient to guarantee the conclusion of the

theorem with “err(f)” replaced by “errS(f)” (the error with respect to S) and “ε”

replaced with “ε/2”. The number of unlabeled examples is enough to ensure that,

with probability ≥ 1−δ/2, for all f ∈ C, |err(f)−errS(f)| ≤ ε/2. Combining these

two statements yields the theorem.

So, if err(c∗) = 0 and errunl(c∗) ≤ t, then with high probability the f ∈ C that

optimizes err(f) and errunl(f) has err(f) ≤ ε. If we assume errunl(c∗) = 0 then

we can use CS,χ(0) instead of CS,χ(t + ε).

Notice that for the case of Example 1, in the worst case (over distributions D)

this will essentially recover the standard margin sample-complexity bounds. InInterpretation

particular, CS,χ(0) contains only those separators that split S with margin ≥ γ,

and therefore, s is no greater than the maximum number of ways of splitting 2ml

points with margin γ. However, if the distribution is nice, then the bounds can be

much better because there may be many fewer ways of splitting S with margin γ.

For instance, in the case of two well-separated “blobs” discussed above, if S is large

enough, we would have just s = 4.

We can finally give a stratified version of Theorem 21.9 as follows:

Theorem 21.10 An unlabeled sample S of size

O


ε2log

1

ε+

1

ε2log

2

δ

)

is sufficient so that with probability ≥ 1 − δ we have that simultaneously for every

k ≥ 0 the following is true: if we label mk examples drawn uniformly at random


from S, where

mk >4

ε

[log(2s) + log

2(k + 1)(k + 2)

δ

]and s = CS,χ((k + 1)ε)

[2mk, S

]

then all f ∈ C with err(f) = 0 and errunl(f) ≤ (k + 1)ε have err(f) ≤ ε.

This theorem is an analog of Theorem 21.7 and it essentially justifies a stratifica-

tion based on the estimated unlabeled error rates. We can also imagine having data

dependent bounds for both labeled and unlabeled data, and also doing a double

stratification, with respect to both labeled and unlabeled error rates. In particular,

we can derive a bound as follows:


O


ε2log

1

ε+

1

ε2log

2

δ

)


i ≥ 0, k ≥ 0 the following is true: if we label mk examples drawn uniformly at

random from S, where

mk,i >8

ε2

[log(2s) + log

4(k + 1)(k + 2)(i + 1)(i + 2)

δ

]and s = CS,χ((k+1)ε)

[2mk, S

]

then all f ∈ C with err(f) ≤ (i + 1)ε and errunl(f) ≤ (k + 1)ε have err(f) ≤ i · ε.

We can similarly derive a bound using Rademacher averages as follows4.


O


ε2log

1

ε+

1

ε2log

2

δ

)


i ≥ 0, k ≥ 0 the following is true: if the number of labeled examples mk,i satisfies

mk,i >16

ε2

[2Rmk

(CS,χ((k + 1)ε)[Sk]) + 2 log8(k + 1)(k + 2)(i + 1)(i + 2)

δ

]

then all f ∈ C with err(f) ≤ (i + 1)ε and errunl(f) ≤ (k + 1)ε have err(f) ≤ i · ε.

21.3.2 ε-Cover-based Bounds

The bounds in the previous section are for uniform convergence: they provide

guarantees for any algorithm that optimizes well on the observed data. In this

section, we consider stronger bounds based on ε-covers that can be obtained for

4. For different versions of these bounds using recent stronger bounds([Boucheron et al.,2000], [Boucheron et al., 2004]) see [Balcan and Blum, 2005].


algorithms that behave in a specific way: they first use the unlabeled examples to

choose a “representative” set of compatible hypotheses, and then use the labeled

sample to choose among these. Bounds based on ε-covers exist in the classical PAC

setting, but in our framework these bounds and algorithms of this type are especially

natural and convenient.

Recall that a set Cε ⊆ 2X is an ε-cover for C with respect to D if for every f ∈ C

there is a f ′ ∈ Cε which is ε-close to f . That is, Prx∼D(f(x) 6= f ′(x)) ≤ ε.

To illustrate how this can produce stronger bounds, consider the setting of

Example 3 (graph-based algorithms) where the graph g consists of two cliques of

n/2 vertices, connected together by o(n2) edges (in particular, the number of edges

connecting the cliques is small compared to εn2). Suppose the target function labels

one of the cliques as positive and one as negative, and we define compatibility of

a hypothesis to be the fraction of edges cut by it (so the target function indeed

has unlabeled error rate less than ε). Now, given any set SL of ml εn labeledExamples where

ε-cover bounds

beat uniform

convergence

bounds

examples, there is always a highly-compatible hypothesis consistent with SL that

just separates the positive points in SL from the entire rest of the graph: the number

of edges cut will be at most nml εn2. However, such a hypothesis clearly has

high true error since it is so unbalanced. So, we do not have uniform convergence.

On the other hand, the set of functions of unlabeled error rate less than ε/4 has

a small ε-cover: in particular, any partition of g that cuts less than εn2/4 edges

must be ε-close to (a) the all-positive function, (b) the all-negative function, (c) the

target function c∗, or (d) the complement of the target function 1− c∗. So, ε-cover

bounds require much fewer labeled examples.5

For another case where ε-cover bounds can beat uniform-convergence bounds,

imagine examples are pairs of points in 0, 1d, C is the class of linear separators,

and compatibility is determined by whether both points are on the same side

of the separator (i.e., the case of Example 4). Now suppose for simplicity that

the target function just splits the hypercube on the first coordinate, and the

distribution is uniform over pairs having the same first coordinate (so the target is

fully compatible). It is not hard to show that given polynomially many unlabeled

examples SU and 14 log d labeled examples SL, with high probability there will exist

high-error functions consistent with SL and compatible with SU .6 So, we do not

5. Effectively, ε-cover bounds allow one to rule out a hypothesis that, say, just separatesthe positive points in SL from the rest of the graph by noting that this hypothesis is veryclose (with respect to D) to the all-negative hypothesis, and that hypothesis has a highlabeled-error rate.6. Proof: Let V be the set of all variables that (a) appear in every positive example of SL

and (b) appear in no negative example of SL. Over the draw of SL, each variable has a(1/2)2|SL| = 1/

√

d chance of belonging to V , so with high probability V has size at least1

2

√

d. Now, consider the hypothesis corresponding to the conjunction of all variables in V .This correctly classifies the examples in SL, and w.h.p. it classifies every other examplein SU negative because each example in SU has only a 1/2|V | chance of satisfying everyvariable in V , and the size of SU is much less than 2|V |. So, this means it is compatiblewith SU and consistent with SL, even though its true error is high.


yet have uniform convergence. In contrast, the cover-size of the set of functions

compatible with SU is constant, so ε-cover based bounds allow learning from just

a constant number of labeled examples.

In particular, we can give an ε-cover based bound as follows.

Theorem 21.13 If t is an upper bound for errunl(c∗) and p is the size of a

minimum ε − cover for CD,χ(t + 4ε), then using mu unlabeled examples and ml

labeled examples for

mu = O

(V Cdim (χ(C))

ε2log

1

ε+

1

ε2log

2

δ

)and ml = O

(1

εln

p

δ

),

we can with probability 1 − δ identify a hypothesis which is 10ε close to c∗.

Proof Sketch: First, given the unlabeled sample SU , define Hε ⊆ C as follows: for

every labeling of SU that is consistent with some f in C, choose a hypothesis in

C for which errunl(f) is smallest among all the hypotheses corresponding to that

labeling. Next, we obtain Cε by eliminating from Hε those hypotheses f with the

property that errunl(f) > t + 3ε. We then apply a greedy procedure on Cε, and we

obtain Gε = g1, · · · , gs, as follows:

Initialize H1ε = Cε and i = 1.

1. Let gi = argminf∈Hi

ε

errunl(f).

2. Using unlabeled data, determine Hi+1ε by crossing out from Hi

ε those hypotheses

f with the property that d(gi, f) < 3ε.

3. If Hi+1ε = ∅ then set s = i and stop; else, increase i by 1 and goto 1.

Our bound on mu is sufficient to ensure that, with probability ≥ 1 − δ/2, Hε is

an ε-cover of C, which implies that, with probability ≥ 1− δ/2, Cε is an ε-cover for

CD,χ(t). It is then possible to show Gε is, with probability ≥ 1− δ/2, a 5ε-cover for

CD,χ(t) of size at most p. The idea here is that by greedily creating a 3ε-cover of

Cε with respect to distribution SU , we are creating a 4ε-cover of Cε with respect to

D, which is a 5ε-cover of CD,χ(t) with respect to D. Furthermore, we are doing this

using no more functions than would a greedy 2ε-cover procedure for CD,χ(t + 4ε)

with respect to D, which is no more than the optimal ε-cover of CD,χ(t + 4ε).

Now to learn c∗ we use labeled data and we do empirical risk minimization on

Gε. By standard bounds (see for instance [Benedek and Itai, 1991]), the number of

labeled examples is enough to ensure that with probability ≥ 1− δ/2 the empirical

optimum hypothesis in Gε has true error at most 10ε. This implies that overall,

with probability ≥ 1 − δ, we find a hypothesis of error at most 10ε.

As an interesting case where unlabeled data helps substantially, consider a

co-training setting where the target c∗ is fully compatible and D satisfies the

conditional independence given the label property. As shown by Blum and Mitchell

[1998], one can boost any weak hypothesis from unlabeled data in this setting


(assuming one has enough labeled data to produce a weak hypothesis). 7. We can

actually show that given enough unlabeled data, in fact we can learn from just

a single labeled example. Specifically, it is possible to show that for any concept

classes C1 and C2, we have:

Theorem 21.14 Assume that err(c∗) = errunl(c∗) = 0 and D satisfies indepen-

dence given the label. Then using mu unlabeled examples and ml labeled examples

we can find a hypothesis that with probability 1 − δ has error at most ε, provided

that

mu = O

(1

ε·

[(V Cdim(C1) + V Cdim(C2)) · ln

(1

ε

)+ ln

(1

δ

)])

and

ml = O

(log( 1

ε )

(1

δ

))

ε-Cover Bounds

for Co-trainingProof Sketch: For convenience we will show a bound with 6ε instead of ε, 3δ

instead of δ, and we will assume for simplicity the setting of Example 3, where

c∗ = c∗1 = c∗2 and also that D1 = D2 = D (the general case is handled similarly,

but just requires more notation). We first characterize the hypotheses with true

unlabeled error rate at most ε. Recall that χ(f,D) = Pr〈x1,x2〉∼D[f(x1) = f(x2)],

and for concreteness assume f predicts using x1 if f(x1) 6= f(x2). Consider f ∈ C

with errunl(f) ≤ ε and let’s define p− = Prx∈D [c∗(x) = 0], p+ = Prx∈D [c∗(x) = 1]

and for i, j ∈ 0, 1 define pij = Prx∈D [f(x) = i, c∗(x) = j]. We clearly have

err (f) = p10 + p01. From errunl(f) = Pr(x1,x2)∼D [f (x1) 6= f (x2)] ≤ ε, using

the independence given the label of D, we get 2p10p00

p−

+ 2p01p11

p+≤ ε. This implies

that the almost compatible hypothesis f must be of one the following four types:

1. f is “close to c∗” or more exactly err(f) ≤ 2ε.

2. f is “close to the opposite of c∗” or more exactly err(f) ≥ 1 − 2ε.

3. f “predicts almost always negative” or more exactly p10 + p11 ≤ 3ε.

4. f “predicts almost always positive” or more exactly p01 + p00 ≤ 3ε.

Now, consider f1 to be the constant positive function, f0 to be the constant negative

function. The unlabeled sample SU is sufficient to ensure that probability ≥ 1 − δ,

every hypothesis with zero estimated unlabeled error has true unlabeled error at

most ε. Therefore, by our previous analysis, there are only four kinds of hypotheses

consistent with unlabeled data: those close to c∗, those close to its complement

c∗, those close to f0, and those close to f1. Furthermore, c∗, c∗, f0, and f1 are

compatible with the unlabeled data.

We now check if there exists a hypothesis g ∈ C with errunl(g) = 0 such that

df1,g ≥ 4ε and df0,g ≥ 4ε. If such a hypothesis g exists, then we know that one of

7. Related sample complexity results are given in [Dasgupta et al., 2001]


g, g, where g is the opposite of g, is 2ε-close to c∗. If not, we must have p+ ≤ 6ε

or p− ≤ 6ε, in which case we know that one of f0, f1 is 6ε-close to c∗. So, we

have a set of two functions, opposite to each other, one of which is at least 6ε-close

to c∗. We now use labeled data to pick one of these to output, using Lemma 21.15

below.

Lemma 21.15 Consider ε < 18 . Let Cε =

f, f

be a subset of C containing

two opposite hypotheses with the property that one of them is ε-close to c∗. Then,

ml > 6 log( 1ε )

(1δ

)labeled examples are sufficient so that with probability ≥ 1 − δ,

the concept in Cε that is ε-close to c∗ in fact has lower empirical error.

Proof Easy calculation: if ml > 6 log 1ε

(1δ

), then

bml2

c∑k=0

(ml

k

)ε(ml−k) (1 − ε)

k ≤ δ.

In particular, by reducing ε to poly(δ), we can reduce the number of labeled

examples needed ml to 1. In fact, this result can be extended to the case considered

in [Balcan et al., 2004] that D+ and D− merely satisfy constant expansion.

This example illustrates that if data is especially well behaved with respect to

the compatibility notion, then our bounds on labeled data can be extremely good.

More interesting, as shown in section 21.4.2, for the case of linear separators and

independence given the label, we can give efficient algorithms, achieving the bounds

in Theorem 21.14 in terms of labeled examples by a polynomial time algorithm.

21.4 Algorithmic Results

In this section we give several examples of efficient algorithms in our model.

21.4.1 A simple case

We give here a simple example to illustrate the bounds in Section 21.3.1.1, and for

which we can give a polynomial-time algorithm that takes advantage of them. Let

the instance space X = 0, 1d, and for x ∈ X, let vars(x) be the set of variables

set to 1 by x. Let C be the class of monotone disjunctions (e.g., x1 ∨ x3 ∨ x6),

and for f ∈ C, let vars(f) be the set of variables disjoined by f . Now, suppose we

say an example x is compatible with function f if either vars(x) ⊆ vars(f) or else

vars(x) ∩ vars(f) = φ. This is a very strong notion of “margin”: it says, in essence,

that every variable is either a positive indicator or a negative indicator, and no

example should contain both positive and negative indicators.

Given this setup, we can give a simple PACunl-learning algorithm for this pair

(C, χ). We begin by using our unlabeled data to construct a graph on d vertices (one

per variable), putting an edge between two vertices i and j if there is any example

x in our unlabeled sample with i, j ∈ vars(x). We now use our labeled data to

label the components. If the target function is fully compatible, then no component

will get multiple labels (if some component does get multiple labels, we halt with

21.4 Algorithmic Results 77

failure). Finally, we produce the hypothesis f such that vars(f) is the union of the

positively-labeled components. This is fully compatible with the unlabeled data and

has zero error on the labeled data, so by Theorem 21.5, if the sizes of the data sets

are as given in the bounds, with high probability the hypothesis produced will have

error ≤ ε.

Notice that if we want to view the algorithm as “purchasing” labeled data, then

we can simply examine the graph, count the number of connected components k,

and then request 1ε [k ln 2 + ln 2

δ ] labeled examples. (Here, 2k = |CS,χ(0)|.) By the

proof of 21.5, with high probability 2k ≤ |CD,χ(ε)|, so we are purchasing no more

than the number of labeled examples in the theorem statement.

Also, it is interesting to see the difference between a “helpful” and “non-helpful”

distribution for this problem. An especially non-helpful distribution would be the

uniform distribution over all examples x with |vars(x)| = 1, in which there are d

components. In this case, unlabeled data does not help at all, and one still needs

Ω(d) labeled examples (or, even Ω(d/ε) if the distribution is a non-uniform as in

VC-dimension lower bounds [Ehrenfeucht et al., 1989]). On the other hand, a helpful

distribution is one such that with high probability the number of components is

small, such as the case of features appearing independently given the label.

21.4.2 Co-training with linear separators

We now consider the case of co-training where the hypothesis class is the class of

linear separators. For simplicity we focus first on the case of Example 4: the target

function is a linear separator in Rd and each example is a pair of points, both of

which are assumed to be on the same side of the separator (i.e., an example is a

line-segment that does not cross the target hyperplane). We then show how our

results can be extended to the more general setting.

As in the previous example, a natural approach is to try to solve the “consistency”

problem: given a set of labeled and unlabeled data, our goal is to find a separator

that is consistent with the labeled examples and compatible with the unlabeled ones

(i.e., it gets the labeled data correct and doesn’t cut too many edges). Unfortunately,

this consistency problem is NP-hard: given a graph g embedded in Rd with two

distinguished points s and t, it is NP-hard to find the linear separator that cuts

the minimum number of edges, even if the minimum is 0 [Flaxman, 2003]. For

this reason, we will make an additional assumption, that the two points in an

example are each drawn independently given the label. That is, there is a single

distribution D over Rd, and with some probability p+, two points are drawn

i.i.d. from D+ (D restricted to the positive side of the target function) and with

probability 1 − p+, the two are drawn i.i.d from D− (D restricted to the negative

side of the target function). Note that our sample complexity results in sectionNeed to assume

independence for

our algorithmic

results

21.3.2 extend to weaker assumptions such as distributional expansion introduced

by Balcan et al. [2004], but we need true independence for our algorithmic results.

Blum and Mitchell [1998] have also given positive algorithmic results for co-training

when (a) the two halves of an example are drawn independently given the label


(which we are assuming now), (b) the underlying function is learnable via Statistical

Query algorithms8 (which is true for linear separators by Blum et al. [1998]), and

(c) we have enough labeled data to produce a weakly-useful hypothesis on one of

the halves to begin with. A weakly-useful predictor is defined to be a predictor f

such that:

Pr[f(x) = 1|c∗(x) = 1] > Pr[f(x) = 1|c∗(x) = 0] + ε.

It is equivalent to the usual notion of a “weak hypothesis”(see [Kearns and Vazirani,

1994]) when the target function is balanced, but requires the hypothesis give more

information when the target function is unbalanced — see [Blum and Mitchell,

1998].

We show here how we can run the algorithm in [Blum and Mitchell, 1998] with

only a single labeled example, thus obtaining an efficient algorithm in our model.

It is worth noticing that in the process, we also simplify the results of Blum et al.

[1998] somewhat.

Theorem 21.16 There is a polynomial-time algorithm (in d and b, where b is the

number of bits per example) to learn a linear separator under the above assumptions,

from a polynomial number of unlabeled examples and a single labeled example.

Proof Sketch: Assume for convenience that the target separator passes through the

origin, and let us denote the separator by c∗ · x = 0. We will also assume for

convenience that PrD(c∗(x) = 1) ∈ [ε/2, 1− ε/2]; that is, the target function is not

overwhelmingly positive or overwhelmingly negative (if it is, this is actually an easy

case, but it makes the arguments more complicated). Define the margin of some

point x as the distance of x/|x| to the separating plane, or equivalently, the cosine

of the angle between c∗ and x.

We begin by drawing a large unlabeled sample S =〈xi

1, xi2〉

; denote by Sj

the setxi

j

, for j = 1, 2. (We describe our algorithm as working with the fixed

unlabeled sample S, since we just need to apply standard VC-dimension arguments

to get the desired result.) The first step is to perform a transformation T on S1

to ensure that some reasonable (1/poly) fraction of T (S1) has margin at least

1/poly, which we can do via the Outlier Removal Lemma of Blum et al. [1998]

and Dunagan and Vempala [2001].9 The Outlier Removal Lemma states that one

can algorithmically remove an ε′ fraction of S1 and ensure that for the remainder,

for any vector w, maxx∈S1(w · x)2 ≤ poly(n, b, 1/ε′)Ex∈S1

[(w · x)2], where b is the

number of bits needed to describe the input points. We reduce the dimensionality

(if necessary) to get rid of any of the vectors for which the above quantity is zero.

We then determine a linear transformation (as described in Blum et al. [1998]) so

8. For a detailed description of the Statistical Query model see [Kearns, 1998] and [Kearnsand Vazirani, 1994].9. If the reader is willing to allow running time polynomial in the margin of the data set,then this part of the argument is not needed.

21.4 Algorithmic Results 79

that in that in the transformed space for all unit-length w, Ex∈T (S1)[(w · x)2] = 1).

Since the maximum is bounded, this guarantees that at least a 1/poly fraction of

the points in T (S1) have at least a 1/poly margin with respect to the separating

hyperplane.

To avoid cumbersome notation in the rest of the discussion, we drop our use

of “T” and simply use S and c∗ to denote the points and separator in the

transformed space. (If the distribution originally had a reasonable probability mass

at a reasonable margin from c∗, then T could be the identity anyway.)

The second step is we argue that a random halfspace has at least a 1/poly chance

of being a weak predictor on S1. ([Blum et al., 1998] uses the Perceptron algorithm

to get weak learning; here, we need something simpler since we do not yet have any

labeled data.) Specifically, consider a point x such that the angle between x and c∗

is π/2− γ, and imagine that we draw f at random subject to f · c∗ ≥ 0 (half of the

f ’s will have this property). Then,

Prf (f(x) 6= c∗(x)|f · c∗ ≥ 0) = (π/2 − γ)/π = 1/2 − γ/π.

Since at least a 1/poly fraction of the points in S1 have at least a 1/poly margin

this implies that:

Prf,x[f(x) = 1|c∗(x) = 1] > Prf,x[f(x) = 1|c∗(x) = 0] + 1/poly.

This means that a 1/poly probability mass of functions f must in fact be weakly-

useful predictors.

The final step of the algorithm is as follows. Using the above observation, we pick

a random f , and plug it into the bootstrapping theorem of [Blum and Mitchell,

1998] (which, given unlabeled pairs 〈xi1, x

i2〉 ∈ S, will use f(xi

1) as a noisy label of

xi2, feeding the result into an SQ algorithm), repeating this process poly(n) times.

With high probability, our random f was a weakly-useful predictor on at least one

of these steps, and we end up with a low-error hypothesis. For the rest of the runs

of the algorithm, we have no guarantees. We now observe the following. First of

all, any function f with small err(f) must have small errunl(f). Secondly, because

of the assumption of independence given the label, as shown in theorem 21.14, the

only functions with low unlabeled error rate are functions close to c∗, close to ¬c∗,

close to the “all positive” function, or close to the “all negative” function.

So, if we simply examine all the hypotheses produced by this procedure, and pick

some h with a low unlabeled error rate that is at least ε/2-far from the “all-positive”

or “all-negative” functions, then either f or ¬f is close to c∗. We can now just draw

a single labeled example to determine which case is which.

We can easily extend our algorithm to the standard co-training (where c∗1 can

be different from c∗2) as follows: we repeat the procedure in a symmetric way, and

then, in order to find a good pair of functions, just try all combinations of pairs

of functions to find one of small unlabeled error rate, not close to “all positive”,

or “all negative” functions; finally use a constant number of labeled examples to

produce a low error hypothesis (and here we use only one part of the example and


only one of the functions in the pair).

21.5 Related Models and Discussion

21.5.1 A Transductive Analog of our Model

We can also talk about a transductive analog of our (inductive) model, that

incorporates many of the existing transductive methods for learning with labeled

and unlabeled data. In a transductive setting one assumes that the unlabeled sample

S is given, a random small subset is labeled, and the goal is to predict well on

the rest of S. In order to make use of unlabeled examples, we will again express

the relationship we hope the target function has with the distribution through a

compatibility notion χ. However, since in this case the compatibility between a

given hypothesis and D is completely determined by S (which is known), we will

not need to require that compatibility be an expectation over unlabeled examples.

Given this setup, from the sample complexity point of view we only care about how

much labeled data we need, and algorithmically we need to find a highly compatible

hypothesis with low error on the labeled data.

Rather than presenting general theorems, we instead focus on the modeling

aspect and give here several examples in the context of graph-based semi-supervised

algorithms for binary classification. In these methods one usually assumes that there

is weighted graph g defined over S, which is given a-priori and encodes the prior

knowledge. In the following we denote by W the weighted adjacency matrix of g

and by CS the set of all binary functions over S.

Minimum cut Suppose for f ∈ CS we define the incompatibility of f to be the

weight of the cut in g determined by f . This is the implicit notion of compatibilityMinimum cut

considered in [Blum and Chawla, 2001], and algorithmically the goal is to find the

most compatible hypothesis that gets the labeled data correct, which can be solved

efficiently using network flow. From a sample-complexity point of view, the number

of labeled examples we need is proportional to the VC-dimension of the class of

hypotheses that are at least as compatible as the target function, which is known

to be O(k/λ) (see [Kleinberg, 2000], [Kleinberg et al., 2004]), where k is the number

of edges cut by c∗ and λ is the size of the global minimum cut in the graph. Also note

that the Randomized Mincut algorithm (considered by Blum et al. [2004]), which

is an extension of the basic mincut approach is actually motivated by a PAC-Bayes

sample complexity analysis of the problem.

Normalized Cut Consider the normalized cut setting of Joachims [2003] and for

f ∈ CS define size(f) to be the weight of the cut in g determined by f , and

let fneg and fpos be the number of points in S on which h predicts negative and

positive, respectively. For f ∈ CS , define the incompatibility of f to be size(f)fneg·fpos

.

Note that this is the implicit compatibility function used in Joachims [2003], and

again, algorithmically the goal would be to find a highly compatible hypothesis

21.5 Related Models and Discussion 81

that gets the labeled data correct. Unfortunately, the corresponding optimizationNormalized

Graph Cuts with

Constraints

problem is in this case is NP-hard. Still, several “approximate” solutions have been

considered, leading to different semi-supervised learning algorithms; for instance

Joachims considers in [Joachims, 2003] a spectral relaxation that leads to the “SGT

algorithm”; another relaxation based on Semi-Definite programming is considered

by Bie and Cristianini [2004]10.

Harmonic Function We can also model the algorithms introduced in [Zhu et al.,

2003a], [Zhu et al., 2003b] as follows. If we consider f to be a probabilistic

prediction function defined over S, then the incompatibility of f is given by∑i,j

wi,j (f(i) − f(j))2

= fT Lf , where L is the un-normalized Laplacian of g.Gaussian

Random Field

and Harmonic

Function

Similarly we can model the algorithm introduced by Zhou et al. [2004] by noticing

that the incompatibility of f is given by fT Lf where L is the normalized Laplacian

of g. More generally, all the Graph Kernel methods can be viewed in our framework

if we consider that the incompatibility of f is given by ||f ||K = fT Kf where K is

a kernel derived from the graph (see for instance [Zhu et al., 2003c]).

21.5.2 Connections to Generative Models

It is also interesting to consider how generative models fit into our model. As

mentioned in Section 21.1, a typical assumption in a generative setting is that D is

a mixture with the probability density function p(x|θ) = p0 · p0(x|θ0)+ p1 · p1(x|θ1)

(see for instance [Ratsaby and Venkatesh, 1995], [Castelli and Cover, 1995, 1996]).

That means that the labeled examples are generated according to the following

mechanism: a label y ∈ 0, 1 is drawn according to the distribution of classes

p0, p1 and then a corresponding random feature vector is drawn according to

the class-conditional density py. The assumption typically used is that the mixture

is identifiable. Identifiability ensures that the Bayes optimal decision border x :

p0 · p0(x|θ0) = p1 · p1(x|θ1) can be deduced if p(x|θ) is known, and therefore one

can construct an estimate of the Bayes border by using p(x|θ) instead of p(x|θ).

Essentially once the decision border is estimated, a small labeled sample suffices to

learn (with high confidence and small error) the appropriate class labels associated

with the two disjoint regions generated by the estimate of the Bayes decision border.How the

generative models

fit into our model

To see how we can incorporate this setting in our model, consider for illustration the

setting in Ratsaby and Venkatesh [1995]; there they assume that p0 = p1, and that

the class conditional densities are d-dimensional Gaussians with unit covariance and

unknown mean vectors θi ∈ Rd. The algorithm used is the following: the unknown

parameter vector θ = (θ0, θ1) is estimated from unlabeled data using a maximum

likelihood estimate; this determines a hypothesis which is a linear separator that

passes through the point (θ0 + θ1)/2 and is orthogonal to the vector θ1 − θ0; finally

each of the two decision regions separated by the hyperplane is labeled according

10. For a more detailed discussion on this see also Chapter 7 in this book.


to the majority of the labeled examples in the region. Given this setting, a natural

notion of compatibility we can consider is the expected log-likelihood function

(where the expectation is taken with respect to the unknown distribution specified

by θ). Specifically, we can identify a legal hypothesis fθ with the set of parameters

θ = (θ0, θ1) that determine it, and then we can define χ(fθ,D) = Ex∈D[log(p(x|θ))].

Ratsaby and Venkatesh [1995] show that if the unlabeled sample is large enough,

then all hypotheses specified by parameters θ which are close enough to θ, will have

the property that their empirical compatibilities will be close enough to their true

compatibilities. This then implies (together with other observations about Gaussian

mixture) that the maximum likelihood estimate will be close enough to θ, up to

permutations. (This actually motivates χ as a good compatibility function in our

model.)

More generally, if we deal with other parametric family (but we are in the

same setting), we can use the same compatibility notion; however, we will need

to impose certain constraints on the distributions allowed in order to ensure that

the compatibility is actually well defined (the expected log-likelihood is bounded).

As mentioned in section 21.1 this kind of generative setting is really at the extreme

of our model. The assumption that the distribution that generates the data is really

a mixture implies that if we knew the distribution, then there are only two possible

concepts left (and this makes the unlabeled data extremely useful).

21.5.3 Connections to the Luckiness Framework

It is worth noticing that there is a strong connection between our approach and

the luckiness framework (see [Shawe-Taylor et al., 1998], [Mendelson and Philips,

2003]). In both cases, the idea is to define an ordering of hypotheses that dependsRelationship to

the luckiness

framework

on the data, in the hope that we will be “lucky” and find that not too many other

functions are as compatible as the target. There are two main differences, however.

The first is that the luckiness framework (being designed for supervised learning

only) uses labeled data both for estimating compatibility and for learning: this is a

more difficult task, and as a result our bounds on labeled data can be significantly

better. For instance, in Example 4 described in Section 21.2 for any non-degenerate

distribution, a dataset of d/2 pairs can with probability 1 be completely shattered by

fully-compatible hypotheses, so the luckiness framework does not help. In contrast,

with a larger (unlabeled) sample, one can potentially reduce the space of compatible

functions quite significantly, and learn from o(d) or even O(1) labeled examples

depending on the distribution – see Section 21.3.2 and 21.4. Secondly, the luckiness

framework talks about compatibility between a hypothesis and a sample, whereas

we define compatibility with respect to a distribution. This allows us to talk about

the amount of unlabeled data needed to estimate true compatibility. There are also

a number of differences at the technical level of the definitions.

21.5 Related Models and Discussion 83

21.5.4 Conclusions

There has been a widespread disagreement and no clear understanding about

when semi-supervised learning helps and by how much. In this chapter, we have

provided a PAC-style model that incorporates both labeled and unlabeled data,

that captures many of the ways unlabeled data is typically used, and provides a

unified framework for thinking about when and why unlabeled data can help. The

high level main implication of our analysis is that unlabeled data is useful if (a)

we have a good notion of compatibility so that the target function indeed has a

low unlabeled error rate, (b) the distribution D is helpful in the sense that not

too many other hypotheses also have a low unlabeled error rate, and (c) we have

enough unlabeled data to estimate unlabeled error rates well. One consequence of our

model is that if the target function and data distribution are both well behaved with

respect to the compatibility notion, then the sample-size bounds we get for labeled

data can substantially beat what one could hope to achieve through pure labeled-

data bounds, and we have illustrated this with a number of examples through the

chapter.

22 Metric Based Approaches for Semi-Supervised

Regression

Dale Schuurmans

Finnegan Southey

Semi-supervised learning methods typically require an explicit relationship to be

asserted between the labelled and unlabelled data, as illustrated, for example, by

the neighbourhoods used in graph-based methods. We present semi-supervised model

selection and regularization methods that require instead only that the labelled and

unlabelled data are drawn from the same distribution. From this assumption a metric

can be constructed over hypotheses based solely on their predictions for unlabelled

data. This metric can then be used to detect untrustworthy training error estimates,

leading to model selection strategies that select the richest hypothesis class while

providing theoretical guarantees against overfitting. The general approach is then

adapted to regularization for supervised regression. This regularization adapts not

only to the hypothesis class but also to the specific data provided, allowing for better

performance than regularizers that account only for class complexity.

22.1 Metrics for Supervised Learning

22.2 Model Selection

22.3 Regularization

22.4 Connections to Active Learning

22.5 Classification

References

Y. S. Abu-Mostafa. Machines that learn from hints. Scientific American, 272(4):64–69, 1995.

A. K. Agrawala. Learning with a probabilistic teacher. IEEE Transactions on Information Theory,16:373–379, 1970.

M.-F. Balcan and A. Blum. An augmented PAC model for semi-supervised learning. Manuscript,2005.

M.-F. Balcan, A. Blum, and K. Yang. Co-training and expansion: Towards bridging theory andpractice. In NIPS, 2004.

P. Bartlett and S. Mendelson. Rademacher and gaussian complexities risk bounds and structuralresults. Journal of Machine Learning Research, pages 463–482, 2002.

E. B. Baum. Polynomial time algorithms for learning neural nets. In Proceedings of the thirdannual workshop on Computational learning theory, pages 258 – 272, 1990.

G.M. Benedek and A. Itai. Learnability with respect to a fixed distribution. Theoretical ComputerScience, 86:377–389, 1991.

T. De Bie and N. Cristianini. Convex transduction with the normalized cut. Manuscript, 2004.

A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. InProceedings of the Eighteenth International Conference on Machine Learning (ICML 2001),pages 19–26, 2001.

A. Blum and R. Kannan. Learning an intersection of k halfspaces over a uniform distribution.Journal of Computer and Systems Sciences, 54(2):371–380, 1997.

A. Blum and T. M. Mitchell. Combining labeled and unlabeled data with co-training. InProceedings of the Eleventh Annual Conference on Computational Learning Theory, pages92–100, 1998.

A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning noisylinear threshold functions. Algorithmica, 22:35–52, 1998.

A. Blum, J. Lafferty, R. Reddy, and M. R. Rwebangira. Semi-supervised learning using randomizedmincuts. In ICML ’04, 2004.

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the VapnikChervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989.

S. Boucheron, G. Lugosi, and P. Massart. A sharp concentration inequality with applications.Random Structures and Algorithms, 16:277–292, 2000.

S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of recent advances.Manuscript, 2004.

V. Castelli and T.M. Cover. On the exponential value of labeled samples. Pattern RecognitionLetters, 16:105–111, 1995.

V. Castelli and T.M. Cover. The relative value of labeled and unlabeled samples in patternrecognition with an unknown mixing parameter. IEEE Transactions on Information Theory,42(6):2102–2117, 1996.

M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedingsof the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing andVery Large Corpora, pages 189–196, 1999.

S. Dasgupta, M. L. Littman, and D. McAllester. Pac generalization bounds for co-training. InT. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural InformationProcessing Systems 14, Cambridge, MA, 2001. MIT Press.

L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-

88 REFERENCES

Verlag, 1996.

J. Dunagan and S. Vempala. Optimal outlier removal in high-dimensional spaces. In Proceedingsof the 33rd ACM Symposium on Theory of Computing, 2001.

A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the numberof examples needed for learning. Inf. and Comput, 82:246–261, 1989.

A. Flaxman. Personal communication, 2003.

S. C. Fralick. Learning to recognize patterns wothout a teacher. IEEE Transactions onInformation Theory, 13:57–64, 1967.

R. Ghani. Combining labeled and unlabeled data for text classification with a large number ofcategories. In Proceedings of the IEEE International Conference on Data Mining, 2001.

T. Joachims. Transductive learning via spectral graph partitioning. In Proceedings of theInternational Conference on Machine Learning (ICML 2003), 2003.

T. Joachims. Transductive inference for text classification using support vector machines. InProceedings of the Sixteenth International Conference on Machine Learning (ICML 1999),pages 200–209, Bled, Slovenia, 1999. Morgan Kaufmann.

M. Kearns. Efficient noise-tolerant learning from statistical queries. In Journal of the ACM(JACM), pages 983 – 1006, 1998.

M. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. MIT Press,1994.

J. Kleinberg. Detecting a network failure. In Proceedings of the 41st IEEE Symposium onFoundations of Computer Science, pages 231–239, 2000.

J. Kleinberg, M. Sandler, and A. Slivkins. Network failure detection and graph connectivity. InProceedings of the 41st IEEE Symposium on Foundations of Computer Science, pages 231–239,2004.

A. R. Klivans, R. O’Donnell, and R. Servedio. Learning intersections and thresholds of halfspaces.In Proceedings of the 43rd Symposium on Foundations of Computer Science, pages 177–186,2002.

V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Trans. Inform.Theory, pages 1902–1914, 2001.

B. Leskes. The value of agreement, a new boosting algorithm. In COLT, pages 51 – 56, 2005.

A. Levin, P. Viola, and Y. Freund. Unsupervised improvement of visual detectors using co-training. In Proceedings of the Nineth IEEE International Conference on Computer Vision(ICCV 2003), pages 626–633, Nice, France, 2003. IEEE.

N. Linial, Y. Mansour, and N. Nisan. Constant depth circuits, fourier transform, and learnability.In Proceedings of the Thirtieth Annual Symposium on Foundations of Computer Science, pages574–579, Research Triangle Park, North Carolina, October 1989.

S. Mendelson and P. Philips. Random subclass bounds. In Proceedings of the 16th AnnualConference on Computational Learning Theory (COLT), 2003.

K. Nigam. Using unlabeled data to improve text classification. Technical Report DoctoralDisertation, CMU-CS-01-126, Carnegie Mellon University, 2001.

K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In Proc. ACMCIKM Int. Conf. on Information and Knowledge Management, pages 86–93, 2000.

K. Nigam, A. McCallum, S. Thrun, and T. M. Mitchell. Text classification from labeled andunlabeled documents using EM. Machine Learning, 39(2/3):103–134, 2000.

S. Park and B. Zhang. Large scale unstructured document classification using unlabeled data andsyntactic information. In PAKDD 2003, LNCS vol. 2637, pages 88–99. Springer, 2003.

D. Pierce and C. Cardie. Limitations of Co-Training for natural language learning from largedatasets. In Proc. Conference on Empirical Methods in NLP, pages 1–9, 2001.

J. Ratsaby and S. Venkatesh. Learning from a mixture of labeled and unlabeled examples withparametric side information. In Proceedings of the Eighth Annual Conference on ComputationalLearning Theory, pages 412–417, 1995.

H. J. Scudder. Probability of error of some adaptive pattern-recognition machines. IEEETransactions on Information Theory, 11:363–371, 1965.

J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimizationover data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926–1940,

REFERENCES 89

1998.

L.G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134–1142, 1984.

V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons Inc., 1998.

S. Vempala. A random sampling based algorithm for learning the intersection of half-spaces. InProceedings of the 38th Symposium on Foundations of Computer Science, pages 508–513, 1997.

K. A. Verbeurgt. Learning dnf under the uniform distribution in quasi-polynomial time. In COLT,pages 314–326, 1990.

D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Meetingof the Association for Computational Linguistics, pages 189–196, 1995.

D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schlkopf. Learning with local and globalconsistency. In NIPS, 2004.

X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields andharmonic functions. In The Twentieth International Conference on Machine Learning, pages912–912, Washington, DC, USA, 2003a. AAAI Press.

X. Zhu, Z. Ghahramani, and J. Lafferty. Combinig active learning and semi-supervised learningusing gaussian fields and harmonic functions. In ICML-2003 Workshop on The Continuumfrom Labeled to Unlabeled Data in Machine Learning, pages 912–912, Washington, DC, USA,2003b.

X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning: From gaussian fields to gaussianprocesses. Technical report, Carnegie Mellon University, 2003c.

Index

characteristic function, 93

Dirac δ, 93

notation table, 91

symbol list, 91

Notation and Symbols

Sets of Numbers

N the set of natural numbers, N = 1, 2, . . .

R the set of reals

[n] compact notation for 1, . . . , n

x ∈ [a, b] interval a ≤ x ≤ b

x ∈ (a, b] interval a < x ≤ b

x ∈ (a, b) interval a < x < b

|C| cardinality of a set C (for finite sets, the number of elements)

Data

X the input domain

d (used if X is a vector space) dimension of X

M number of classes (for classification)

l, u number of labeled, unlabeled training examples

n total number of examples, n = l + u.

i, j indices, often running over [l] or [n]

xi input patterns xi ∈ X

yi classes yi ∈ [M ] (for regression: target values yi ∈ R)

X a sample of input patterns, X = (x1, . . . , xn)

Y a sample of output targets, Y = (y1, . . . , yn)

Xl labeled part of X, Xl = (x1, . . . , xl)

Yl labeled part of Y , Yl = (y1, . . . , yl)

Xu unlabeled part of X, Xu = (xl+1, . . . , xl+u)

Yu unlabeled part of Y , Yu = (yl+1, . . . , yl+u)

Kernels

H feature space induced by a kernel

Φ feature map, Φ : X → H

k (positive definite) kernel

K kernel matrix or Gram matrix, Kij = k(xi, xj)

Vectors, Matrices and Norms

1 vector with all entries equal to one

I identity matrix

A> transposed matrix (or vector)

A−1 inverse matrix (in some cases, pseudo-inverse)

tr (A) trace of a matrix

det (A) determinant of a matrix

〈x,x′〉 dot product between x and x′

‖·‖ 2-norm, ‖x‖ :=√〈x,x〉

‖·‖p p-norm , ‖x‖p :=(∑N

i=1 |xi|p)1/p

, N ∈ N ∪ ∞

‖·‖∞ ∞-norm , ‖x‖∞ := supNi=1 |xi|, N ∈ N ∪ ∞

Functions

ln logarithm to base e

log2 logarithm to base 2

f a function, often from X or [n] to R, RM or [M ]

F a family of functions

Lp(X) function spaces, 1 ≤ p ≤ ∞

Probability

P· probability of a logical formula

P(C) probability of a set (event) C

p(x) density evaluated at x ∈ X

E [·] expectation of a random variable

Var [·] variance of a random variable

N(µ, σ2) normal distribution with mean µ and variance σ2

Graphs

g graph g = (V,E) with nodes V and edges E

G set of graphs

W weighted adjacency matrix of a graph (Wij 6= 0 ⇔ (i, j) ∈ E)

D (diagonal) degree matrix of a graph, Dii =∑

j Wij

L normalized graph Laplacian, L = D−1/2WD−1/2

L un-normalized graph Laplacian, L = D − W

SVM-related

ρf (x, y) margin of function f on the example (x, y), i.e., y · f(x)

ρf margin of f on the training set, i.e., minmi=1 ρf (xi, yi)

h VC dimension

C regularization parameter in front of the empirical risk term

λ regularization parameter in front of the regularizer

w weight vector

b constant offset (or threshold)

αi Lagrange multiplier or expansion coefficient

βi Lagrange multiplier

α,β vectors of Lagrange multipliers

ξi slack variables

ξ vector of all slack variables

Q Hessian of a quadratic program

Miscellaneous

IA characteristic (or indicator) function on a set A

i.e., IA(x) = 1 if x ∈ A and 0 otherwise

δij Kronecker δ (δij = 1 if i = j, 0 otherwise)

δx Dirac δ, satisfying∫

δx(y)f(y)dy = f(x)

O(g(n)) a function f(n) is said to be O(g(n)) if there exist constants C > 0

and n0 ∈ N such that |f(n)| ≤ Cg(n) for all n ≥ n0

o(g(n)) a function f(n) is said to be o(g(n)) if there exist constants c > 0

and n0 ∈ N such that |f(n)| ≥ cg(n) for all n ≥ n0

rhs/lhs shorthand for “right/left hand side”

the end of a proof

semi-supervised learningzhuxj/tmp/book.pdf1.1.2 semi-supervised learning semi-supervised learning...

Documents