mathematics of machine learningrds37/teaching/machine_learning/... · 2020-03-09 · mathematics of...

Mathematics of Machine LearningRajen D. Shah [email protected]

1 Introduction

Consider a pair of random elements (X, Y ) ∈ X × Y with joint distribution P0, where Xis to be thought of as an input or vector of predictors, and Y as an output or response.For instance X may represent a collection of disease risk factors (e.g. BMI, age, geneticindicators etc.) for a subject randomly selected from a population and Y may representtheir disease status; or X could represent the number or bedrooms and other facilities ina randomly selected house, and Y could be its price. In the former case we may takeY = {−1, 1}, and this setting, known as the classification setting, will be of primaryinterest to us in this course. The latter case where Y ∈ R is an instance of a regressionsetting. We will take X = Rp unless otherwise specified.

It is of interest to predict the random Y from X; we may attempt to do this via a(measurable) function h : X → Y , known as a hypothesis. To measure the quality of sucha prediction we will introduce a loss function

` : Y × Y → R.

In the classification setting we typically take ` to be the misclassification error

`(h(x), y) =

{0 if h(x) = y,

1 otherwise.

In this context h is also referred to as a classifier. In regression settings the squared error`(h(x), y) = (h(x)− y)2 is common. We will aim to pick a hypothesis h such that the risk

R(h) :=

∫(x,y)∈X×Y

`(h(x), y)dP0(x, y)

is small. For a deterministic h, R(h) = E`(h(X), Y ). In what follows we will take ` and Rto be the misclassification loss and risk respectively, unless otherwise stated.

A classifier h0 that minimises the misclassification risk is known as a Bayes classifier,and its risk is called the Bayes risk. Define the regression function η by

η(x) := P(Y = 1|X = x).

Proposition 1. A Bayes classifier h0 is given by1

h0(x) =

{1 if η(x) > 1/2

−1 otherwise.1When η(x) = 1/2, we can equally well take h0 = ±1 and achieve the same misclassification error.

1

In most settings of interest, the joint distribution P0 of (X, Y ), which determines the op-timal h, will be unknown. Instead we will suppose we have i.i.d. copies (X1, Y1), . . . , (Xn, Yn)of the pair (X, Y ), known as training data. Our task is to use this data to construct a clas-sifier h such that R(h) is small. Important point: R(h) is a random variable dependingon the random training data:

R(h) = E(`(h(X), Y )|X1, Y1, . . . , Xn, Yn).

A statistical approach to classification may attempt to model P0 up to some unknownparameters, estimate these parameters, and thereby obtain an estimate of the regressionfunction (or the conditional expectation in the case of least squares—see below). We willtake a different approach and assume that we are given a class H of hypotheses from whichto pick our h. Possible choices of H include for instance

• H = {h : h(x) = sgn(µ+ xTβ) where µ ∈ R, β ∈ Rp};

• H ={h : h(x) = sgn

(µ+

∑dj=1 ϕj(x)βj

)where µ ∈ R, β ∈ Rd

}for a given dictio-

nary of functions ϕ1, . . . , ϕd : X → R.

Technical note: In this course we will take sgn(0) = −1. (It does not matter muchwhether we take sgn(0) = ±1, but we need to specify a choice in order that the h definedabove are classifiers.)

1.1 Brief review of conditional expectation

For many of the mathematical arguments in this course we will need to manipulate condi-tional expectations.

Recall that if Z ∈ R and W ∈ Rd are random elements with joint probability densityfunction (pdf) fZ,W then the conditional pdf fZ|W of Z given W satisfies

fZ|W (z|w) =

{fZ,W (z, w)/fW (w) if fW (w) 6= 0

0 otherwise,

where fW is the marginal pdf of W . When one or more of Z and W are discrete wetypically work with probability mass functions.

Suppose E|Z| < ∞. Then the conditional expectation function E(Z|W = w) is givenby

g(w) := E(Z|W = w) =

∫zfZ|W (z|w)dz. (1.1)

We write E(Z|W ) for the random variable g(W ) (note this is a function of W , not Z).This is not a fully general definition of conditional expectation (for that see the Stochas-

tic Financial Models course) and we will not use it. We will however make frequent use ofthe following properties of conditional expectation.

2

(i) Role of independence: If Z and W are independent, then E(Z|W ) = EZ. (Recall:Z and W being independent means P(Z ∈ A,W ∈ B) = P(Z ∈ A)P(W ∈ B) for allmeasurable A ⊆ R, B ⊆ Rd)

(ii) Tower property: Let f : Rd → Rm be a (measurable) function. Then

E{E(Z|W )|f(W )} = E{Z|f(W )}.

In particular, E{E(Z|W )|W1, . . . ,Wm} = E(Z|W1, . . . ,Wm) for m ≤ d. Taking f ≡c ∈ R and using (i) gives us that E{E(Z|W )} = E(Z) (as f(W ) is a constant it isindependent of any random variable).

(iii) Taking out what is known: If EZ2 <∞ and f : Rd → R is such that E[{f(W )}2] <∞ then E{f(W )Z|W} = f(W )E(Z|W ).

Probabilistic results can be ‘applied conditionally’, for example:

Conditional Jensen. Recall that f : R→ R is a convex function if

tf(x) + (1− t)f(y) ≥ f(tx+ (1− t)y

)for all x, y ∈ R and t ∈ (0, 1).

The conditional version of Jensen’s inequality states that if f : R → R is convex andrandom variable Z has E|f(Z)| <∞, then

E(f(Z)|W

)≥ f

(E(Z|W )

).

1.2 Bayes risk

Proof of Proposition 1. We have

R(h) =1

4E{(Y − h(X))2}

=1

4E{(Y − E(Y |X) + E(Y |X)− h(X))2}

=1

4E{(Y − E(Y |X))2}+

1

4E{(E(Y |X)− h(X))2}+

1

2E{(Y − E(Y |X))(E(Y |X)− h(X))}.

But

E{(Y − E(Y |X))(E(Y |X)− h(X))}=EE{(Y − E(Y |X))(E(Y |X)− h(X))|X} (tower property)

=E[(E(Y |X)− h(X))E{(Y − E(Y |X))|X}

](taking out what is known)

= 0.

Thus minimising R(h) is equivalent to minimising E{(E(Y |X) − h(X))2}. We thereforeget h0(X) = sgn(E(Y |X)).

The proof also shows that the risk under least squares loss is minimised by takingh(x) = E(Y |X = x) (provided EY 2 <∞).

3

1.3 Empirical risk minimisation

Empirical risk minimisation replaces the expectation over the unknown P0 in the definitionof the risk with the empirical distribution, and seeks to minimise the resulting objectiveover h ∈ H:

R(h) :=1

n

n∑i=1

`(h(Xi), Yi), h ∈ arg minh∈H

R(h).

R(h) is the empirical risk or training error of h.

Example. Consider the regression setting with Y = R, squared error loss and H = {x 7→µ + xTβ for µ ∈ R, β ∈ Rp}. Then empirical risk minimisation is equivalent to ordinaryleast squares, i.e. we have

h(x) = µ+ βTx where (µ, β) ∈ arg min(µ,β)∈R×Rp

1

n

n∑i=1

(Yi − µ−XTi β)2.

A good choice for the class H will result in a low generalisation error R(h). This is ameasure of how well we can expect the empirical risk minimiser (ERM) h to predict a newdata point (Xnew, Ynew) ∼ P0 given only knowledge of Xnew. Define h∗ ∈ arg min

h∈HR(h)2

and consider the decomposition

R(h)−R(h0) = R(h)−R(h∗)︸︷︷︸stochastic error

+R(h∗)−R(h0)︸︷︷︸approximation error

.

Clearly a richer class H will decrease the approximation error. However, it will tend toincrease the stochastic error as empirical risk minimisation will fit to the realised Y1, . . . , Yntoo closely and result in poor generalisation. There is thus a tradeoff between the stochasticerror due to the complexity of the class H, and its approximation error.

We will primarily study the stochastic term or excess risk 3, and aim to provide boundson this in terms of the complexity of H. Recall that whilst for a fixed h ∈ H, R(h) isdeterministic, R(h) is a random variable. The bounds we obtain will be of the form “withprobability at least 1− δ,

R(h)−R(h∗) ≤ ε.”

2If there is no h∗ that achieves the associated infimum, we can consider an approximate minimiser withR(h∗) < infh∈HR(h) + ε for arbitrary ε > 0 and all our analysis will carry through. Similar reasoning is

applicable to h.3Sometimes “excess risk” is used for R(h)−R(h0). However since we are considering H to be fixed in

advance for much of the course, we will use excess risk to refer to the risk relative to that of h∗.

4

2 Statistical learning theory

Consider the following decomposition of the excess risk:

R(h)−R(h∗) = R(h)− R(h)︸︷︷︸concentration

+ R(h)− R(h∗)︸︷︷︸≤0

+ R(h∗)−R(h∗)︸︷︷︸concentration

≤ R(h)− R(h) + R(h∗)−R(h∗).

Note that R(h∗) is an average of n i.i.d. random variables, each with expectation R(h∗).To bound R(h∗) − R(h∗) we will consider the general problem of how random variablesconcentrate around their expectation, a problem which is the topic of an important areaof probability theory concerning concentration inequalities. The term R(h)− R(h) is morecomplicated as R(h) is not a sum of i.i.d. random variables, but we will see how extensionsof techniques for the simpler case may be used to tackle this.

2.1 Sub-Gaussianity and Hoeffding’s inequality

We begin our discussion of concentration inequalities with the simplest tail bound, Markov’sinequality. Let W be a non-negative random variable. Taking expectations of both sidesof t1{W≥t} ≤ W for t > 0, we obtain after dividing through by t

P(W ≥ t) ≤ E(W )

t.

This immediately implies that given a strictly increasing function ϕ : R→ [0,∞) and anyrandom variable W ,

P(W ≥ t) = P(ϕ(W ) ≥ ϕ(t)

)≤ E(ϕ(W ))

ϕ(t).

Applying this with ϕ(t) = eαt (α > 0) yields the so-called Chernoff bound :

P(W ≥ t) ≤ infα>0

e−αtEeαW .

Example. Consider the case when W ∼ N(0, σ2). Recall that

EeαW = eα2σ2/2. (2.1)

Thus for t ≥ 0,

P(W ≥ t) ≤ infα>0

eα2σ2/2−αt = e−t

2/(2σ2). (2.2)

Note that to arrive at this bound, all we required was (an upper bound on) the momentgenerating function (mgf) of W (2.1). This motivates the following definition.

5

Definition 1. We say a random variable W is sub-Gaussian with parameter σ > 0 if

Eeα(W−EW ) ≤ eα2σ2/2 for all α ∈ R.

From (2.2) we immediately have the following result.

Proposition 2. If W is sub-Gaussian with parameter σ > 0, then

P(W − EW ≥ t) ≤ e−t2/(2σ2) for all t ≥ 0.

Note that if W is sub-Gaussian with parameter σ > 0, then

• it is also sub-Gaussian with parameter σ′ for any σ′ ≥ σ;

• −W is also sub-Gaussian with parameter σ > 0. This means we have from (2.2) that

P(|W − EW | ≥ t) ≤ P(W − EW ≥ t) + P(−(W − EW ) ≥ t) ≤ 2e−t2/(2σ2).

Proposition 3. Suppose W1, . . . ,Wn are independent and each Wi is sub-Gaussian withparameter σi and has mean 0. Then for γ ∈ Rn, γTW is sub-Gaussian with parameter(∑

i γ2i σ

2i

)1/2.

Proof.

E exp(α

n∑i=1

γiWi

)=

n∏i=1

E exp(αγiWi)

≤n∏i=1

exp(α2γ2i σ2i /2)

= exp(α2

n∑i=1

γ2i σ2i /2).

Combining with Proposition 2 we obtain

P

(1

n

n∑i=1

(Wi − EWi) ≥ t

)≤ exp

(− n2t2

2∑n

i=1 σ2i

)for t ≥ 0. (2.3)

As well as implying concentration around the mean, the bound on the mgf satisfied bysub-Gaussian random variables also offers a bound on the expected maximum of d sub-Gaussians. We do not need the following result at this stage, but will make use of itlater.

Proposition 4. Suppose W1, . . . ,Wd are all mean-zero and sub-Gaussian with parameterσ > 0 (but are not necessarily independent). Then

EmaxjWj ≤ σ

√2 log(d).

6

Proof. Let α > 0. By convexity of x 7→ exp(αx) and Jensen’s inequality we have

exp(αEmaxjWj) ≤ E exp(αmax

jWj) = Emax

jexp(αWj).

Now

E maxj=1,...,d

exp(αWj) ≤d∑j=1

E exp(αWj) ≤ deα2σ2/2.

Thus

EmaxjWj ≤

log(d)

α+ασ2

2.

Optimising over α > 0 yields the result.

Gaussian random variables are sub-Gaussian, but the sub-Gaussian class is muchbroader than this.

Example. A Rademacher random variable ε takes values {−1, 1} with equal probability.It is sub-Gaussian with parameter σ = 1:

Eeαε =1

2(e−α + eα) =

1

2

( ∞∑k=0

(−α)k

k!+∞∑k=0

αk

k!

)=∞∑k=0

α2k

(2k)!

≤∞∑k=0

α2k

2kk!= eα

2/2 (using (2k)! ≥ 2kk!). (2.4)

Recall that we are interested in the concentration properties of 1{h(Xi)6=Yi}−P(h(Xi) 6=Yi), which in particular is bounded.

Lemma 5 (Hoeffding’s lemma). If W is mean-zero and takes values in [a, b], then W issub-Gaussian with parameter σ = (b− a)/2.

Proof. We will prove a weaker result here with σ = b−a; see the Example sheet for a proofwith σ = (b− a)/2. Let W ′ be an independent copy of W . We have

EeαW = Eeα(W−EW ′)

= EeE{α(W−W ′)|W} using E(W ′) = E(W ′|W ) and E(W |W ) = W

≤ Eeα(W−W ′) (Jensen conditional on W and tower prop.).

Now W −W ′ d= −(W −W ′)d= ε(W −W ′) where ε ∼ Rademacher with ε independent of

(W,W ′). (Here “d=” means “equal in distribution”.) Thus

EeαW ≤ Eeαε(W−W ′) = E{E(eαε(W−W′)|W,W ′)}.

7

We now apply our previous result (2.4) conditionally on (W −W ′) to obtain

EeαW ≤ Eeα2(W−W ′)2/2 ≤ Eeα2(b−a)2/2

as |W −W ′| ≤ b− a.

The introduction of an independent copy W ′ and a Rademacher random variable hereis an example of a symmetrisation argument ; we will make use of this technique again laterin the course. As an application of the result above, suppose W1, . . . ,Wn are independent,mean-zero and ai ≤ Wi ≤ bi almost surely for all i. Then

P

(1

n

n∑i=1

Wi ≥ t

)≤ exp

(− 2n2t2∑n

i=1(bi − ai)2

)for t ≥ 0, (2.5)

which is known as Hoeffding’s inequality.We are now in a position to bound R(h)−R(h∗) when H is finite.

2.2 Finite hypothesis classes

Theorem 6. Suppose H is finite and ` takes values in [0,M ]. Then with probability atleast 1− δ, the ERM h satisfies

R(h)−R(h∗) ≤M

√2(log |H|+ log(1/δ))

n.

The assumption on ` includes as a special case misclassification loss. However the extragenerality will prove helpful later in the course.

Proof. Recall that

R(h)−R(h∗) = R(h)− R(h) + R(h)− R(h∗)︸︷︷︸≤0

+R(h∗)−R(h∗).

Now for each h, R(h) − R(h) is an average of mean-zero i.i.d. quantities of the form`(h(Xi), Yi)− µ with µ = E`(h(Xi), Yi) which take values in [−µ,M − µ]. For t > 0,

P(R(h)−R(h∗) > t) = P(R(h)−R(h∗) > t, h 6= h∗)

≤ P(R(h)− R(h) > t/2, h 6= h∗) + P(R(h∗)−R(h∗) > t/2)

We can immediately apply Hoeffding’s inequality to the second term to obtain

P(R(h∗)−R(h∗) ≥ t/2) ≤ exp(−nt2/(2M2)).

However the complicated dependence among the summands in R(h) prevents this line ofattack for bounding the first term. To tackle this issue, we note that when h 6= h∗,

R(h)− R(h) ≤ maxh∈H−

R(h)− R(h),

8

where H− := H \ {h∗}. We then have using a union bound,

P(maxh∈H−

R(h)− R(h) ≥ t/2) = P(∪h∈H−R(h)− R(h) ≥ t/2)

≤∑h∈H−

P(R(h)− R(h) ≥ t/2)

≤ |H−| exp(−nt2/(2M2)).

ThusP(R(h)−R(h∗) > t) ≤ |H| exp(−nt2/(2M2)).

Writing δ := |H| exp(−nt2/(2M2)) and then expressing t in terms of δ gives the result.

Example. Consider a simple classification setting with Xi ∈ [0, 1)2. Let us divide [0, 1)2

intom2 disjoint squares R1, . . . , Rm2 ⊂ [0, 1)2 of the form [r/m, (r+1)/m)×[s/m, (s+1)/m)for r, s = 0, . . . ,m− 1. Let

Yj = sgn

( ∑i:Xi∈Rj

Yi

)and define

hhist(x) =m2∑j=1

Yj1Rj(x).

Then hhist is equivalent to the ERM over hypothesis class H consisting of the 2m2

clas-sifiers each corresponding to a way of assigning labels in {−1, 1} to each of the regionsR1, . . . , Rm2 . The result above tells us that the generalisation error (with misclassificationloss) of hhist is at most

R(hhist)−R(h∗) ≤ m

√2(log 2 + log(1/δ)/m2)

n≤ m

√2(log 2 + log(1/δ))

n.

[In fact it can be shown that the approximation error R(h∗) − R(h0) → 0 if m → ∞ forany given P0. Combining with the above, we then see that choosing e.g. m = n1/3 we canapproach the Bayes risk for n sufficiently large.]

Whilst a union bound and Hoeffding’s inequality sufficed to give us a bound in the casewhere H is finite, to handle the more common setting where H is infinite, we will needmore sophisticated techniques. Our approach will be to view the key quantity

G(X1, Y1, . . . , Xn, Yn) := suph∈H{R(h)− R(h)}

as a function G of the i.i.d. random elements (X1, Y1), . . . , (Xn, Yn). We currently only haveat our disposal concentration inequalities where g takes the form of an average; however Gwill in general clearly be much more complex. Intuitively though, the key property of theempirical average that results in concentration is that the individual contributions of each

9

of the random elements is not too large. Can we show that our G would, despite having anintractable form, nevertheless share this property in common with the empirical average?

Given data (x1, y1), . . . (xn, yn) and ε > 0, let h ∈ H be such that

G(x1, y1, . . . , xn, yn) < R(h)− R(h) + ε.

Now consider perturbing (wlog) the first pair of arguments of G. We have

G(x1, y1, . . . , xn, yn)−G(x′1, y′1, x2, y2, . . . , xn, yn)

<R(h)− 1

n

n∑i=1

`(yi, h(xi))− suph∈H

(R(h)− 1

n`(y′1, h(x′1))−

1

n

n∑i=2

`(yi, h(xi))

)+ ε

≤ 1

n{`(y′1, h(x′1))− `(y1, h(x1))}+ ε.

As ε was arbitrary, if ` takes values in [0,M ] we have

G(x1, y1, . . . , xn, yn)−G(x′1, y′1, x2, y2, . . . , xn, yn) ≤M/n.

We thus seek a concentration inequality for multivariate functions where arbitrary pertur-bations of a single argument change the output by a bounded amount.

2.3 Bounded differences inequality

The result we are going to aim for is the so-called Bounded differences inequality. Letus adopt the notation that for a sequence of vectors as, as+1, as+2, . . . (where the startingindex s can be e.g. 0 or 1), aj:k for j ≤ k is the subsequence aj, . . . , ak.

Theorem 7 (Bounded differences inequality). Let f : Rd·n → R satisfy a bounded differ-ences property such that

f(w1, . . . , wi−1, wi, wi+1, . . . , wn)− f(w1, . . . , wi−1, w′i, wi+1, . . . , wn) ≤ Li,

for all w1, . . . , wn, w′i ∈ Rd, and all i = 1, . . . , n. Suppose random vectors W1, . . . ,Wn

taking values in Rd are independent. Then for t ≥ 0,

P(f(W1:n)− Ef(W1:n) ≥ t

)≤ exp

(− 2t2∑n

i=1 L2i

)Note that taking d = 1, f(W1:n) =

∑i{Wi−E(Wi)}/n and assuming the Wi take values

in [ai, bi], we recover Hoeffding’s inequality.To motivate the proof, consider the sequence of random variables given by Z0 =

Ef(W1:n), Zn = f(W1:n) and

Zi = E(f(W1:n)|W1:i) for i = 1, . . . , n− 1. (2.6)

10

Note that in the special case where f(W1:n) =∑

iWi and EWi = 0, we have Zk − Z0 =∑ki=1Wi. Our approach centres on the telescoping decomposition

f(W1:n)− Ef(W1:n) = Zn − Z0 =n∑i=1

(Zi − Zi−1)︸︷︷︸Di

;

the differences Di play an analogous role to the individual independent random variablesin the case of bounding sums. These considerations help to motivate the definition ofa martingale4, which encapsulates the key features of the Zi above that will allow ourarguments to go through.

Definition 2. A sequence of random variables Z0, Z1, . . . , Zn is a martingale sequence withrespect to another sequence of random vectors W0,W1, . . . ,Wn ∈ Rd if

(i) E|Zi| <∞ for i = 0, . . . , n,

(ii) Zi is a function of W0:i for i = 0, . . . , n,

(iii) E(Zi|W0:(i−1)) = Zi−1 i = 1, . . . , n.

We call Di := Zi − Zi−1 a martingale difference sequence with respect to W0,W1, . . . ,Wn.We also extend the definitions to infinite sequences.

Example. Consider a random walk with Z0 = 0, Zi = Zi−1± 1 each with probability 1/2.Then (Zi)i≥1 is just a sequence of partial sums of i.i.d. Rademacher random variables andis a martingale sequence with respect to (Zi−1)i≥1. Now suppose Zi = Zi−1 if Zi = z forsome z ∈ Z but with the dynamics governed by the random walk otherwise. We still have

E(Zi|Z0:(i−1)) = z1{Zi−1=z} + Zi−11{Zi−1 6=z} = Zi−1.

Note that in this case the corresponding martingale difference sequence is not i.i.d. andexhibits some dependence.

Example. The sequence Z0:n defined earlier via (2.6) is an example of a Doob martingale.Formally it is a martingale sequence with respect to W0:n where we may set W0 to anarbitrary constant. That (ii) holds is clear. (i) certainly holds when f is bounded, butholds more generally provided E|f(W1:n)| <∞:

E|Zi| = E|E{f(W1:n)|W1:i}| ≤ E|f(W1:n)| <∞

using Jensen’s inequality conditional onW1:i applied to the convex function |·|, and then thetower property. That (iii) holds follows from the tower property of conditional expectation.

We are now in a position to prove a generalisation of Proposition 3 applicable to averagesof martingale differences.

4For a more general and formal definition of a martingale, see the Stochastic Financial Models course.

11

Lemma 8. Let D1, . . . , Dn be a martingale difference sequence with respect to W0, . . . ,Wn ∈Rd such that

E(eαDi |W0:(i−1)) ≤ eα2σ2

i /2 almost surely.

Let γ ∈ Rn and write D = (D1, . . . , Dn)T . Then γTD is sub-Gaussian with parameter

(∑

i γ2i σ

2i )

1/2.

Proof. We have

E exp(α

n∑i=1

γiDi

)= E exp

(α

n−1∑i=1

γiDi

)E(eαγnDn|W0:(n−1))

≤ eα2γ2nσ

2n/2E exp

(αn−1∑i=1

γiDi

)≤ exp

(α2

2

n∑i=1

γ2i σ2i

)(by induction).

The Azuma-Hoeffding inequality applies the above result to the case of bounded randomvariables.

Theorem 9 (Azuma–Hoeffding). Let D1, . . . , Dn be a martingale difference sequence withrespect to W0, . . . ,Wn ∈ Rd. Suppose the following holds for each i = 1, . . . , n: there existrandom variables Ai and Bi that are functions of W0:(i−1) such that Ai ≤ Di ≤ Bi almostsurely and Bi − Ai ≤ Li almost surely for a constant Li. Then for t ≥ 0,

P

(n∑i=1

Di ≥ t

)≤ exp

(− 2t2∑n

i=1 L2i

). (2.7)

Proof. Conditional on W0:(i−1), Ai and Bi are constant. Thus we may apply Hoeffding’sLemma (Lemma 5) conditionally on W0:(i−1) to obtain

E(eαDi |W0:(i−1)) ≤ eα2(Li/2)

2/2 almost surely.

The martingale difference sequence thus satisfies the hypotheses of Lemma 8. The sum∑iDi is sub-Gaussian with parameter σ = (

∑i L

2i )

1/2/2. The result then follows from the

sub-Gaussian tail bound (Proposition 2).

We are finally ready to prove the Bounded differences inequality.

Proof of Theorem 7. Let D1, . . . , Dn be the Martingale difference sequence associated withthe Doob martingale, so D1 = E(f(W1:n)|W1)− Ef(W1:n) and for i = 2, . . . , n,

Di = E(f(W1:n)|W1:i)− E(f(W1:n)|W1:i−1).

12

Recall that f(W1:n) − Ef(W1:n) =∑n

i=1Di. Using the Azuma–Hoeffding inequality, itsuffices to prove that Ai ≤ Di ≤ Bi almost surely where Ai and Bi are functions of W1:(i−1)satisfying Bi − Ai ≤ Li for all i, which we now do.

Let us define for each i = 1, . . . , n, functions

Fi : Rd·i → R( w1︸︷︷︸∈Rd

, . . . , wi) 7→ E(f(W1:n)|W1 = w1, . . . ,Wi = wi).

Then define the random variables

A1 := infw1∈Rd

F1(w1)− Ef(W1:n), Ai := infwi∈Rd

Fi(W1:(i−1), wi)− E(f(W1:n)|W1:(i−1)) (i ≥ 2)

B1 := supw1∈Rd

F1(w1)− Ef(W1:n), Bi := supwi∈Rd

Fi(W1:(i−1), wi)− E(f(W1:n)|W1:(i−1)) (i ≥ 2).

Note that setting W0 ≡ 0, Ai and Bi are functions of W0:(i−1). Furthermore for i = 2, . . . , n

Di − Ai = Fi(W1:i)− infwi∈Rd

Fi(W1:(i−1), wi) ≥ 0

Di −Bi = Fi(W1:i)− supwi∈Rd

Fi(W1:(i−1), wi) ≤ 0,

so Ai ≤ Di ≤ Bi. Also

Bi − Ai = supwi∈Rd

Fi(W1:(i−1), wi)− infwi∈Rd

Fi(W1:(i−1), wi)

= supwi,w′i∈Rd

{Fi(W1:(i−1), wi)− Fi(W1:(i−1), w′i)}

= supwi,w′i∈Rd

{E(f(W1:(i−1), wi,W(i+1):n)|W1:(i−1),Wi = wi)

− E(f(W1:(i−1), w′i,W(i+1):n)|W1:(i−1),Wi = w′i)}.

Now as the W1:n are independent, the distribution of W(i+1):n conditional on W1:(i−1) andWi = wi is equal to its unconditional distribution, and the same holds when conditioningon Wi = wi′ . Thus

Bi − Ai = supwi,w′i∈Rd

[E{f(W1:(i−1), wi,W(i+1):n)− f(W1:(i−1), w′i,W(i+1):n)︸︷︷︸

≤Li

|W1:(i−1)}]

≤ Li.

Analogous properties hold when i = 1.We have have verified all the conditions of the Azuma–Hoeffding inequality which may

now be applied to give the result.

13

2.4 Rademacher complexity

Recall our setup: H is a (now possibly infinite) hypothesis class, ` takes values in [0,M ]are we are aiming to bound the right-hand side of

R(h)−R(h∗) ≤ G+ R(h∗)−R(h∗).

where G := suph∈H{R(h) − R(h)}. The Bounded differences inequality provides a meansto bound G− EG, but in order to make use of this, we must find a way of bounding EG.Let us write Zi = (Xi, Yi) for i = 1, . . . , n and

F := {(x, y) 7→ −`(h(x), y) : h ∈ H}. (2.8)

Then we have

G = supf∈F

1

n

n∑i=1

{f(Zi)− Ef(Zi)}.

We will prove the following result which applies for a general function class F (not neces-sarily coming from (2.8)).

Theorem 10. Let F be a class of real-valued functions f : Z → R and let Z1, . . . , Zn bei.i.d. random elements taking values in Z. Then

E

(supf∈F

1

n

n∑i=1

{f(Zi)− Ef(Zi)}

)≤ 2Rn(F)

where Rn(F) is the Rademacher complexity of F defined by

Rn(F) := E

(supf∈F

1

n

n∑i=1

εif(Zi)

).

Here ε1, . . . , εn are i.i.d. Rademacher random variables independent of Z1:n.

Some intuition: Consider a classification problem with inputs Z1, . . . , Zn and completelyrandom labels ε1, . . . , εn. The Rademacher complexity then captures how closely alignedthe ‘predictions’ f(Zi) are to the random labels.

Before we prove Theorem 10, let us reflect on what it might achieve. Considering ourmain problem of bounding EG, a key challenge is that it depends strongly and in a com-plicated way on the unknown P0. To understand the potential advantages of Rademachercomplexity, it is helpful to introduce the following.

Definition 3. Let F be a class of real-valued functions f : Z → R and let z1, . . . , zn ∈ Z.Writing

F(z1:n) = {(f(z1), . . . , f(zn)) : f ∈ F}

14

define the empirical Rademacher complexity

R(F(z1:n)) := E

(supf∈F

1

n

n∑i=1

εif(zi)

), (2.9)

where ε1, . . . , εn are i.i.d. Rademacher random variables. Given i.i.d. random elementsZ1, . . . , Zn taking values in Z, we sometimes view the empirical Rademacher complexityas a random variable:

R(F(Z1:n)) := E

(supf∈F

1

n

n∑i=1

εif(Zi)∣∣∣Z1:n

).

Note that R(F(z1:n)) is well-defined in that the right-hand side of (2.9) only dependson F(z1:n), the behaviours of the functions in F on the fixed set of points z1:n.

Key point: R(F(z1:n)) does not depend on P0. It is conceivable that we could up-per bound R(F(z1:n)) uniformly in z1:n ∈ Zn. We then immediately get a bound onRn(F) = E{R(F(Z1:n))} that is independent of P0.

We now turn to the proof of the result, which uses a symmetrisation technique.

Proof of Theorem 10. Let us introduce an independent copy (Z ′1, . . . , Z′n) of (Z1, . . . , Zn).

We have

supf∈F

1

n

n∑i=1

{f(Zi)− Ef(Zi)} = supf∈F

1

n

n∑i=1

E{f(Zi)− f(Z ′i)|Z1:n} (independence of Z1:n and Z ′1:n)

≤ E

(supf∈F

1

n

n∑i=1

{f(Zi)− f(Z ′i)}∣∣∣∣Z1:n

).

Note we have used the fact that for any collection of random variables Vt, supt′ EVt′ ≤E supt Vt; this may easily be verified by removing the supremum over t′ and noting thatthe resulting inequality must hold for all t′. Now let ε1, . . . , εn be i.i.d. Rademacher randomvariables, independent of Z1:n and Z ′1:n. Then

supf∈F

1

n

n∑i=1

{f(Zi)− f(Z ′i)}d= sup

f∈F

1

n

n∑i=1

εi{f(Zi)− f(Z ′i)}

≤ supf∈F

1

n

n∑i=1

εif(Zi) + supg∈F

1

n

n∑i=1

{−εig(Zi)}.

Noting that ε1:nd= −ε1:n, we have

E

(supf∈F

1

n

n∑i=1

{f(Zi)− f(Z ′i)}

)≤ E

(supf∈F

2

n

n∑i=1

εif(Zi)

)= 2Rn(F).

15

Theorem 11 (Generalisation bound based on Rademacher complexity). Let F := {(x, y) 7→`(h(x), y) : h ∈ H} and suppose ` takes values in [0,M ]. With probability at least 1− δ,

R(h)−R(h∗) ≤ 2Rn(F) +M

√2 log(2/δ)

n.

Proof. Let G := suph∈HR(h)− R(h) and recall that

R(h)−R(h∗) ≤ G+ R(h∗)−R(h∗) = (G− EG) + EG+ R(h∗)−R(h∗).

Further recall that viewing G as a function of Z1, . . . , Zn where Zi = (Xi, Yi), it satisfiesa bounded differences property with constants Li = M/n. Thus the Bounded differencesinequality gives us that

P(G− EG ≥ t/2) ≤ exp(−t2n/(2M2)).

Hoeffding’s inequality (or Bounded differences with the average function) also gives P(R(h∗)−R(h∗) ≥ t/2) ≤ exp(−t2n/(2M2)). Also, noting thatRn(F) = Rn(−F), from Theorem 10,EG ≤ 2Rn(F). Thus taking t = M

√2 log(2/δ)/n gives the result.

2.5 VC dimension

All we need to do in order to bound the generalisation error is to obtain bounds on theRademacher complexity. There are various ways of tackling this problem in general. Here,we will explore an approach suited to the classification setting with misclassification lossand F := {(x, y) 7→ `(h(x), y) : h ∈ H}. Our bounds will be in terms of the numberof behaviours of the function class on n points |F(z1:n)|. Observe first that |F(z1:n)| =|H(x1:n)| where zi = (xi, yi).

Lemma 12. We have R(F(z1:n)) ≤√

2 log(|F(z1:n)|)/n =√

2 log(|H(x1:n)|)/n.

Proof. Let d = |F(z1:n)| and let F ′ := {f1, . . . , fd} be such that F(z1:n) = F ′(z1:n) (so eachfj has a unique behaviour on z1:n). For j = 1, . . . , d, let

Wj =1

n

n∑i=1

εifj(zi),

where ε1:n are i.i.d. Rademacher random variables. Then R(F(z1:n)) = EmaxjWj. ByLemma 5 and Proposition 3, each Wj is sub-Gaussian with parameter 1/

√n. Thus we may

apply Proposition 4 on the expected maximum of sub-Gaussian random variables to givethe result.

As each h(xi) ∈ {−1, 1}, we always have |H(x1:n)| ≤ 2n. Considering the result above,an interesting case then is when |H(x1:n)| is growing slower than exponentially in n, e.g.growing polynomially in n.

16

Definition 4. Let F be a class of functions f : X → {a, b} with a 6= b (e.g. {a, b} ={−1, 1}) with |F| ≥ 2.

• We say F shatters x1:n ∈ X n if |F(x1:n)| = 2n.

• Define also s(F , n) := maxx1:n∈Xn |F(x1:n)|; this is known as the shattering coefficient.

• The VC dimension VC(F) is the largest integer n such that some x1:n is shatteredby F , or ∞ if no such n exists. Equivalently, VC(F) = sup{n ∈ N : s(F , n) = 2n}.

Example. Let X = R and consider F = {fa,b : fa,b(x) = 1[a,b)(x) : a, b,∈ R}. Con-sider n distinct points x1, . . . , xn. These divide up the real line into n + 1 intervals(−∞, x1], (x1, x2], . . . , (xn−1, xn], (xn,∞). Now if a and a′ are in the same interval, andb and b′ are in the same interval, then (fa,b(xi))

ni=1 = (fa′,b′(xi))

ni=1. Thus every possible

behaviour (fa,b(xi))ni=1 can be obtained by picking one of the n + 1 intervals for each of a

and b, sos(F , n) ≤ (n+ 1)2.

Now consider VC(F). Any x1:2 can be shattered, but with three points x1 < x2 < x3, wecan never have f(x1) = f(x3) = 1 but f(x2) = 0. Thus VC(F) = 2.

It is a bit tedious to determine the shattering coefficient individually for each F andsee whether it grows polynomially; we would like a more streamlined approach. Observethat in the previous example, we have s(F , n) ≤ (n + 1)VC(F). The usefulness of the VCdimension, named after its inventors Vladmir Vapnik and Alexey Chervonenkis, is due tothe remarkable fact that this is true more generally. The result below is known as theSauer–Shelah lemma.

Lemma 13 (Sauer–Shelah). Let F be a class with finite VC dimension d. Then

s(F , n) ≤d∑i=0

(n

i

)≤ (n+ 1)d.

What is striking about this result is that whilst we know from the definition that for alln > d, s(F , n) < 2n, it is not immediately obvious that we cannot have s(F , n) = 2n − 1,or s(F , n) = 1.8n for n > d. The result shows that beyond d the growth of s(F , n) isradically different in that it is polynomial. The important consequence of this is that fromLemma 12 we have

Rn(F) ≤√

2VC(F) log(n+ 1)

n.

*Proof of Lemma 13*. This proof is non-examinable. We will prove the following strongerresult. Fix x1:n ∈ X n and let xQ for any non-empty Q ⊆ {1, . . . , n} be (xi1 , . . . , xi|Q|) ifQ = {i1, . . . , i|Q|}. Then we claim that there are at least |F(x1:n)| − 1 non-empty setsQ ⊆ {1, . . . , n} such that F shatters xQ.

17

That this implies the statement of the lemma may be seen from the following reasoning.Take x1:n to be such that |F(x1:n)| = s(F , n) and (for a contradiction) such that

|F(x1:n)| >d∑i=0

(n

i

)=

d∑i=1

(n

i

)+ 1.

Then as the right-hand side is one more than the number of non-empty subsets Q ⊆{1, . . . , n} of size at most d, we must have that some xQ with Q > d is shattered by F ,but this contradicts VC(F) = d.

It remains to prove the claim, which we do by induction on |F(x1:n)|. Wlog assumethe functions in F map to {−1, 1}. The claim when |F(x1:n)| = 1 is clearly true (thestatement is vacuous in this case). Now take k ≥ 1 and suppose the result is true forall n ∈ N and x1:n ∈ X n and F with |F(x1:n)| ≤ k. We will show the result holds atk + 1. Take any n ∈ N, x1:n ∈ X n and F with |F(x1:n)| = k + 1. Let xj be such thatF+ := {f ∈ F : f(xj) = 1} and F− := {f ∈ F : f(xj) = −1} are both non-empty. Then|F+(x1:n)| and |F−(x1:n)| sum to |F(x1:n)| = k + 1 and they both are non-zero.

Let X− and X+ be the sets of subvectors xQ that are shattered by F− and F+ respec-tively. By the induction hypothesis, |X−|+ |X+| ≥ k− 1. Clearly if xQ ∈ X− ∪X+, xQ canbe shattered by F . Now none of the subvectors in X− ∪ X+ can have xj as a componentas then the subvector could not be shattered (each subfamily of hypotheses has all f(xj)taking the same value). But then when xQ ∈ X− ∩ X+, it must be the case that both xQand xQ∪{j} (distinct) can be shattered by F . Also xj itself is shattered. Thus we see thatthe number of sets shattered by F is at least

1 + |X− ∪ X+|+ |X− ∩ X+| = 1 + |X−|+ |X+| ≥ 1 + (k − 1) = k,

thereby completing the induction step.

An important class of hypotheses H is based on functions that form a vector space.Let F be a vector space of functions f : X → R, e.g. consider X = Rp and

F = {x 7→ xTβ : β ∈ Rp}.

From F form a class of hypotheses

H = {h : h(x) = sgn(f(x)) where f ∈ F}. (2.10)

The following Proposition bounds the VC dimension of H.

Proposition 14. Consider hypothesis class H given by (2.10) where F is a vector spaceof functions. Then

VC(H) ≤ dim(F).

Proof. Let d = dim(F) + 1 and take x1:d ∈ X d. We need to show that x1:d cannot beshattered by H. Consider the linear map L : F → Rd given by

L(f) = (f(x1), . . . , f(xd)) ∈ Rd.

18

The rank of L is at most dim(F) = d−1 < d. Therefore, there must exist non-zero γ ∈ Rd

orthogonal to everything in the image L(F) i.e.∑i:γi>0

γif(xi) +∑i:γi≤0

γif(xi) = 0 for all f ∈ F , (2.11)

where wlog at least one component of γ is strictly positive. Let I+ = {i : γi > 0} andI− = {i : γi ≤ 0}. Then it is not possible to have

h(xi) = 1⇒ f(xi) > 0 for all i ∈ I+,h(xi) = −1⇒ f(xi) ≤ 0 for all i ∈ I−,

(recall we are taking sgn(0) := −1) as otherwise the LHS of (2.11) would be strictlypositive. Thus x1:d cannot be shattered so VC(H) ≤ d− 1 as required.

3 Computation for empirical risk minimisation

The results of the previous section have given us a good understanding of the theoreticalproperties of the ERM h corresponding to a given hypothesis class. We have not yetdiscussed whether h can be computed in practice, and how to do so; these questions arethe topic of this chapter.

For a general hypothesis class H, computation of the ERM h can be arbitrarily hard.Things simplify greatly if computing h may be equivalently phrased in terms of minimisinga convex function over a convex set.

3.1 Basic properties of convex functions

Recall that a set C ⊆ Rd is convex if

x, y ∈ C ⇒ (1− t)x+ ty ∈ C for all t ∈ (0, 1).

In the following, let C ⊆ Rd be a convex set. A function f : C → R is convex if

f((1− t)x+ ty

)≤ (1− t)f(x) + tf(y) for all x, y ∈ C and t ∈ (0, 1).

Then −f is a concave function. It is strictly convex if the inequality is strict for all x, y ∈ C,x 6= y and t ∈ (0, 1).

Convex functions exhibit a “local to global phenomenon”: for example local minimaare necessarily global minima. Indeed, if x ∈ C is a local minimum, so for all y ∈ C,f((1− t)x+ ty) ≥ f(x) for all t sufficiently small, then by convexity

f(x) ≤ f((1− t)x+ ty) ≤ (1− t)f(x) + tf(y),

so f(x) ≤ f(y) for all y ∈ C. On the other hand, non-convex functions can have manylocal minima whose objective values are far from the global minimum, which can makethem very hard to optimise.

We collect together several useful properties of convex functions in the following propo-sition.

19

Proposition 15. In the following, let C ⊆ Rd be a convex set and let f : C → R be aconvex function.

(i) Let g : C → R be a convex function. Then if a, b ≥ 0, af + bg is a convex function.

(ii) Let A ∈ Rd×m and b ∈ Rd and take C = Rd. Then g : Rm → R given by g(x) =f(Ax− b) is a convex function.

(iii) Suppose fα : C → R is convex for all α ∈ I where I is some index set, and defineg(x) := supα∈I fα(x). Then

(a) D := {x ∈ C : g(x) <∞} is convex and

(b) function g restricted to D is convex.

(iv) If f is differentiable at x ∈ int(C) then f(y) ≥ f(x) +∇f(x)T (y − x) for all y ∈ C.

(v) If f : Rd → R is twice continuously differentiable then

(a) f is convex iff. its Hessian matrix H(x) at x is positive semi-definite for all x,

(b) f is strictly convex if H(x) is positive definite for all x.

3.2 Convex surrogates

In the classification setting, one problem with using misclassification loss is that the ERMoptimisation can be intractable for many hypothesis classes. For example, taking H basedon half-spaces, the ERM problem minimises over β ∈ Rp the following objective:

n∑i=1

1{sgn(XTi β)6=Yi} ≈

n∑i=1

1(−∞,0](YiXTi β)

(ignoring when XTi β = 0). The RHS is not convex and in fact not continuous due to the

indicator function. If 1(−∞,0] above were somehow replaced with a convex function, weknow from Proposition 15 (ii) that the resulting objective would be a convex function of β.The minimising β may still be able to deliver classification performance via x 7→ sgn(xT β)that is comparable to that of the ERM provided the convex function is a sufficiently goodapproximation to an indicator function.

These considerations motivate the following changes to the classification frameworkthat we have been studying thus far.

• Rather than performing ERM over a set of classifiers, let us consider a family H offunctions h : X → R. Each h ∈ H determines a classifier via x 7→ sgn(h(x)).

• We will consider loss functions ` : R× R→ [0,∞) of the form

`(h(x), y) = φ(yh(x))

20

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

u

φ(u

)

0−1 losshinge losslogistic lossexponential loss

and φ : R → [0,∞) is convex. We will refer to the corresponding risk as the φ-riskand denote it by Rφ. Note formally we will be taking Y = R (even though the data(Yi)

ni=1 are in {−1, 1}).

Common choices of φ include the following:

• Hinge loss: φ(u) = max(1− u, 0).

• Exponential loss: φ(u) = e−u.

• Logistic loss: φ(u) = log2(1 + e−u) = log(1 + e−u)/ log(2).

For the strategy of using a surrogate loss to be useful, ERM with the surrogate loss shouldhopefully mimic using misclassification loss. For example, we would ideally like the hφ,0that minimises Rφ (assuming it exists) to be such that x 7→ sgn(hφ,0(x)) is (equivalent to)the Bayes classifier x 7→ sgn(η(x)−1/2). To understand when this is the case, we introducethe following definitions.

The conditional φ-risk of h is

E(φ(Y h(X))|X = x) = η(x)φ(h(x)) + (1− η(x))φ(−h(x)),

where recall η(x) = P(Y = 1|X = x). It will be helpful to consider this in terms of a genericconditional probability η ∈ [0, 1] and generic value α ∈ R of h(x). We thus introduce

Cη(α) := ηφ(α) + (1− η)φ(−α)

The following definition encapsulates our idea of sgn ◦ hφ,0 minimising the Bayes misclas-sification risk, but also allows for the possibility that infhRφ(h) is not attained.

Definition 5. We say φ is classification calibrated if for any η ∈ [0, 1] with η 6= 1/2,

infα∈R

Cη(α) < infα:α(2η−1)≤0

Cη(α).

21

In words, the equation above says that the infimal generic conditional φ-risk is strictlyless than the infimum where α (playing the role of h(x)) is forced to disagree in sign withthe Bayes classifier. The following result tells us when the favourable case of classificationcalibration occurs for convex φ.

Theorem 16. Let φ be convex. Then φ is classification calibrated if it is differentiable at0 and φ′(0) < 0.

Proof. Note that Cη is convex and differentiable at 0 with

C ′η(0) = (2η − 1)φ′(0).

Suppose η > 1/2, so C ′η(0) < 0. Then from Proposition 15 (iv),

Cη(α) ≥ Cη(0) + C ′η(0)α ≥ Cη(0)

for α ≤ 0. Also as

0 > C ′η(0) = limα↓0

Cη(α)− Cη(0)

α,

for some α∗ > 0 we have Cη(α∗) < Cη(0). Similarly when η < 1/2, there exists some α∗ < 0

with Cη(α∗) < Cη(0). Thus in both cases infα∈RCη(α) ≤ Cη(α

∗) < infα:α(2η−1)≤0Cη(α).

We thus see that the popular choices of φ above are all classification calibrated.

3.3 Rademacher complexity revisited

One remaining issue is whether we can obtain guarantees on when the generalisation er-ror measured in terms of φ-risk is small. Theorem 11 gives us a bound in terms of theRademacher complexity of

F = {(x, y) 7→ φ(yh(x)) : h ∈ H}.

Our bounds for Rn(F) involving shattering coefficients and VC dimension relied heavilyon the use of misclassification loss. We will need a different approach. One useful stepwould be to relate Rn(F) to Rn(H) which is potentially simpler to handle. The followingresult, which is sometimes known as the contraction lemma, helps in this regard.

Lemma 17 (Contraction lemma). Let r = supx∈X ,h∈H |h(x)|. Suppose there exists L ≥ 0with |φ(u)− φ(u′)| ≤ L|u− u′| for all u, u′ ∈ [−r, r], so φ is Lipschitz with constant L on[−r, r]. Then Rn(F) ≤ LRn(H).

Proof. Let (x1, y1), . . . , (xn, yn) ∈ X × {−1, 1} and let ε1, . . . , εn be a sequence of i.i.d.Rademacher random variables. Then writing zi = (xi, yi), we have

R(F(z1:n)) = E

(suph∈H

1

n

n∑i=1

εiφ(yih(xi))

).

22

Let us consider z1:n as fixed and for any i write ε−i for the sequence ε1:n with εi removed.We claim that for any (suitable) function A : H× {−1, 1}n−1,

E suph∈H

(1

nεiφ(yih(xi)) + A(h, ε−i)

)≤ E sup

h∈H

(L

nεih(xi) + A(h, ε−i)

). (3.1)

Applying this with i = 1 and

A(h, ε−1) =1

n

n∑i=2

εiφ(yih(xi)),

we get

E suph∈H

(1

nε1φ(y1h(x1)) +

1

n

n∑i=2

εiφ(yih(xi))

)≤ E sup

h∈H

(L

nε1h(x1) +

1

n

n∑i=2

εiφ(yih(xi))

).

(3.2)

Next applying (3.1) with i = 2 and

A(h, ε−2) =1

n

n∑i=3

εiφ(yih(xi)) +L

nε1h(x1),

we get that the RHS of (3.2) is at most

E suph∈H

(L

n

2∑i=1

εih(xi) +1

n

n∑i=3

εiφ(yih(xi))

).

Continuing this argument yields the result. It remains to prove the claim, which we donow. We have

E suph∈H

(1


∣∣∣ε−i)=

1

2n

[suph∈H{φ(yih(xi)) + nA(h, ε−i)}+ sup

h∈H{−φ(yih(xi)) + nA(h, ε−i)}

]=

1

2n

[suph,g∈H

{φ(yih(xi))− φ(yig(xi))︸︷︷︸≤L|h(xi)−g(xi)|

+nA(h, ε−i) + nA(g, ε−i)}].

But by symmetry,

suph,g∈H

{L|h(xi)− g(xi)|+ nA(h, ε−i) + nA(g, ε−i)}

= suph,g∈H

[L{h(xi)− g(xi)}+ nA(h, ε−i) + nA(g, ε−i)]

= suph∈H{Lh(xi) + nA(h, ε−i)}+ sup

h∈H{−Lh(xi) + nA(h, ε−i)}.

23

Hence

E suph∈H

(1


∣∣∣ε−i) ≤ E suph∈H

(L

nεih(xi) + A(h, ε−i)

∣∣∣ε−i)Taking expectations proves the claim.

Corollary 18. Consider the setup of Lemma 17 and suppose r is finite. Suppose φ isnon-increasing and let M = φ(−r). Then with probability at least 1− δ,

Rφ(h)−Rφ(h∗) ≤ 2LRn(H) +M

√2 log(2/δ)

n

In order for the result above to be applicable when φ is e.g. one of the convex surrogatesdiscussed earlier, we need H to be such that r is finite so M is finite. This will not holdfor our example where X = Rp of

H = {x 7→ xTβ : β ∈ Rp}.

However, if we constrain the norm of the β and X is a bounded subset of Rp we can achievethis.

3.4 `2-constraint

Suppose X = {x ∈ Rp : ‖x‖2 ≤ C} and consider

H = {x 7→ xTβ : β ∈ Rp and ‖β‖2 ≤ λ} (3.3)

for λ > 0. Then we have that for any x1:n ∈ X n,

R(H(x1:n)) =1

nE

(sup

β:‖β‖2≤λ

n∑i=1

εixTi β

)

≤ λ

nE∥∥∥ n∑i=1

εixi

∥∥∥2

(Cauchy–Schwarz)

≤ λ

n

(E∥∥∥ n∑i=1

εixi

∥∥∥22

)1/2

where the last inequality follows due to concavity of√· and Jensen’s inequality. Now for

i 6= j, E(εixTi xjεj) = 0, so

E∥∥∥ n∑i=1

εixi

∥∥∥22

=n∑i=1

‖xi‖22 ≤ nC2.

ThusR(H(x1:n)) ≤ λC/

√n

Furthermore

supx∈X ,h∈H

|h(x)| = supx:‖x‖2≤C,β:‖β‖2≤λ

xTβ = λC.

24

Example. Take φ to be the hinge loss and H given by (3.3). Then from Corollary 18,with probability at least 1− δ,

Rφ(h)−Rφ(h∗) ≤ 2λC√n

+ (λC + 1)

√2 log(2/δ)

n.

3.5 `1-constraint

The `1-norm of a vector u is ‖u‖1 :=∑

i |ui| and the `∞-norm is ‖u‖∞ := maxi |ui|.Suppose now that X = {x ∈ Rp : ‖x‖∞ ≤ C} and consider

H = {x 7→ xTβ : β ∈ Rp and ‖β‖1 ≤ λ}.

To compute the Rademacher complexity of H, it is helpful to introduce the followingdefinitions.

Definition 6. .

• For a set S ⊆ Rd, the convex hull convS is the intersection of all convex sets con-taining S.

• A point v ∈ Rd is a convex combination of v1, . . . , vm ∈ Rd if

v = α1v1 + · · ·+ αmvm

where α1, . . . , αm ≥ 0 and∑m

j=1 αj = 1.

As we will see later, for any set A ∈ Rn,

R(A) = E

(supa∈A

1

n

n∑i=1

aiεi

)= R(convA),

and H(x1:n) = convA for a set A ⊆ Rn with |A| = 2p. To show this we need a few resultson convex hulls.

Lemma 19. For S ⊆ Rd, v ∈ convS if and only if v is a convex combination of some setof points in S.

Proof. Let D be the set of all convex combinations of sets of points from S. We want toshow

D ⊇ convS (3.4)

D ⊆ convS. (3.5)

Showing (3.4) is a task on the example sheet. We show (3.5). Clearly a convex combinationof any v1, v2 ∈ S is in convS. Suppose then that for m ≥ 2, any convex combination of mpoints from S is in convS. Take v1, . . . , vm+1 ∈ S and α1, . . . , αm+1 ≥ 0 with

∑m+1j=1 αj = 1.

25

Consider v =∑m+1

j=1 vjαj. If αm+1 = 1, v = vm+1 ∈ S ⊆ convS. Otherwise, writingt =

∑mj=1 αj, we have t > 0 and αm+1 = 1− t so

v = t( α1

tv1 + · · ·+ αm

tvm︸︷︷︸

∈ convS by theinduction hypothesis

)+ (1− t)vm+1 ∈ convS.

Lemma 20. Let S ∈ Rd. For any linear map L : Rd → Rn, convL(S) = L(convS).

Proof. u ∈ convL(S) iff. there exist v1, . . . , vm ∈ S and α1, . . . , αm ≥ 0 such that∑m

j=1 αj =1 and

u =∑j

αjL(vj).

But the RHS is L(∑

j αjvj

)∈ L(convS) and u ∈ L(convS) iff. u takes this form.

Lemma 21. For any A ⊆ Rn, R(A) = R(convA).

Proof. See example sheet.

To use this, observe that if β has ‖β‖1 = λ, then writing

β = λ

p∑j=1

|βj|λ

sgn(βj)ej,

we see that β ∈ convS where S = ∪pj=1{λej,−λej} and ej is the jth standard basis vector.Next if ‖β‖1 ≤ λ, then

β =λ+ ‖β‖1

2λ

λ

‖β‖1β︸︷︷︸

∈convS

+λ− ‖β‖1

2λ

(−λ)

‖β‖1β︸︷︷︸

∈convS

∈ convS

as convS is convex. Then given x1, . . . , xn, let L : Rp → Rn be the linear map given by

L(β) = (xT1 β, . . . , xTnβ)T .

Then H(x1:n) = L(convS) = convL(S) from Lemma 20. Thus from Lemma 21 we have

R(H(x1:n)) = R(L(S))

=λ

nE(

maxj=1,...,p

∣∣∣ n∑i=1

εixij

∣∣∣)where ε1, . . . , εn are i.i.d. Rademacher random variables. Now each ±

∑i εixij is sub-

Gaussian with parameter( n∑i=1

x2ij

)1/2

≤ C√n (Proposition 3).

26

Thus from Proposition 4 we have

R(H(x1:n)) ≤ λ

n× C√n×

√2 log(|S|) =

λC√n

√2 log(2p).

Also

supx∈X ,h∈H

|h(x)| = supx:‖x‖∞≤C,β:‖β‖1≤λ

xTβ = λC.

Example. Take φ to be the hinge loss and H as above. Suppose X = [−1, 1]p. Then fromCorollary 18, with probability at least 1− δ,

Rφ(h)−Rφ(h∗) ≤ 2λ

√2 log(2p)

n+ (λ+ 1)

√2 log(2/δ)

n.

In contrast, withH given by the `2-constraint (3.3) we would have a bound of order λ√p/n.

Some notable differences are as follows.

• The dimension p contributes a factor of order√

log(p) is the `1 constraint case versus√p is the `2 constraint case.

• Suppose that the components of Xi have independent U [−1, 1] distributions. Thenfor any fixed β, E(XT

i β)2 = ‖β‖22/3. In order for φ(YiXTi β) = 0 we need |XT

i β| = 1,which suggests we need our constraint sets to contain some β for which ‖β‖22 ≥ 3say. But note that

β =

(1√p, . . . ,

1√p

)Thas ‖β‖2 = 1 but ‖β‖1 =

√p. Thus if every component of Xi is equally important

for predicting Yi an `1 constraint will typically not perform well. On the other hand,

β =( 1√

s, . . . ,

1√s︸︷︷︸

s of these

, 0, . . . , 0)T

has ‖β‖2 = 1 and ‖β‖1 =√s. If only the first s predictors are important and s is

small, the `1 approach can perform well.

3.6 Projections on to convex sets

Empirical risk minimisation (with a convex surrogate) over the `2 and `1 constraint classesdiscussed above involves minimising a convex function subject to the minimiser being in aconvex set. In order to perform this optimisation it will be helpful to project points on toconvex constraint sets.

27

Proposition 22. Let C ⊆ Rd be a closed convex set. Then for each x ∈ Rd, the minimiserof ‖x− z‖2 over z ∈ C exists and is unique. Moreover writing

πC(x) = argminz∈C‖x− z‖2,

we have that for all x ∈ Rd,

(x− πC(x))T (z − πC(x)) ≤ 0 for all z ∈ C, (3.6)

‖πC(x)− πC(z)‖2 ≤ ‖x− z‖2 for all z ∈ Rd. (3.7)

Proof. Existence: Let µ = infz∈C ‖x− z‖2. Write B = {w : ‖w − x‖2 ≤ µ+ 1}. Then

infz∈C‖x− z‖2 = inf

z∈C∩B‖x− z‖2,

and the RHS is an infimum of a continuous function on a closed and bounded set, so theinfimum is achieved at π = πC(x), say.Uniqueness: For each fixed x, z 7→ ‖x−z‖22 is a strictly convex function, so any minimiserover the convex set C must be unique (see example sheet).(3.6): We have (1− t)π + tz ∈ C for all t ∈ [0, 1], so

‖x− π‖22 ≤ ‖x− π + t(π − z)‖22= ‖x− π‖22 − 2t(x− π)T (z − π) + t2‖π − z‖22,

whence

(x− π)T (z − π) ≤ t

2‖π − z‖22 for all t ∈ (0, 1].

Letting t→ 0 shows (3.6).(3.7): From (3.6) we have

(x− πC(x))T (πC(z)− πC(x)) ≤ 0

(z − πC(z))T (πC(x)− πC(z)) ≤ 0.

Adding these we have

‖πC(x)− πC(z)‖22 + (πC(x)− πC(z))T (z − x) ≤ 0.

Also

0 ≤ 1

2‖πC(x)− πC(z) + (z − x)‖22

=1

2‖πC(x)− πC(z)‖22 + (πC(x)− πC(z))T (z − x) +

1

2‖z − x‖22.

Combining with the last display we get

1

2‖πC(x)− πC(z)‖22 ≤

1

2‖z − x‖22.

28

Definition 7. We call πC(x) above the projection of x on C.

Proposition 23. Let C ⊆ Rd be a closed convex set and let x /∈ C. Then there existsv ∈ Rd and ε > 0 such that for all z ∈ C,

vT z ≤ vTx− ε.

Proof. By Proposition 22

(x− πC(x))T (z − πC(x)) ≤ 0 for all z ∈ C.

Define v = x− πC(x). Note that ε := ‖v‖22 > 0 as x /∈ C. Then

vT z ≤ vTπC(x) = vT (−v) + vTx = vTx− ε for all z ∈ C.

3.7 Subgradients

For a convex function f : Rd → R differentiable at x ∈ Rd, we have that

f(z) ≥ f(x) +∇f(x)T (z − x) for all z ∈ Rd,

so in particular there is a hyperplane passing through (x, f(x)) that lies below the function.This also holds true more generally at points where f may not be differentiable with ∇f(x)above replaced by a subgradient.

Definition 8. A vector g ∈ Rd is a subgradient of a convex function f : Rd → R at x if

f(z) ≥ f(x) + gT (z − x) for all z ∈ Rd.

The set of subgradients of f at x is called the subdifferential of f at x and denoted ∂f(x).

Proposition 24. If f : Rd → R is convex, ∂f(x) is non-empty for all x ∈ Rd.

*Proof*. The set C = {(z, y) ∈ Rd × R : y ≥ f(z)} (known as the epigraph of f) isclosed and convex. Take a sequence w1, w2, . . . ∈ Rd+1 such that wk /∈ C for each k andwk → (x, f(x)) as k →∞. Then from Proposition 23 we know that for each k, there existsvk ∈ Rd+1 where vTk w < vTk wk for all w ∈ C. Wlog we can take ‖vk‖2 = 1 for each k, sothe sequence vk lies in a closed bounded set. By the Bolzano–Weierstrass theorem, thereexists a convergent subsequence vkj → v = (−g, α) as j →∞. Then in particular

−gT z + αy ≤ −gTx+ αf(x) for all (z, y) ∈ C.

Clearly this is only possible if α < 0, so dividing by −α and setting g = g/α and y = f(z)we obtain

f(z) + gT z ≥ f(x) + gTx for all z.

To compute subgradients, the following facts will be helpful.

29

Proposition 25. Let f : Rd → R be convex, and suppose f is differentiable at x. Then∂f(x) = {∇f(x)}.

Proof. Suppose g ∈ Rd is a subgradient of f at x. Then, for any z ∈ Rd, we have

∇f(x)T z = limt↓0

f(x+ tz)− f(x)

t≥ gT z.

In particular, taking z = g−∇f(x), we have ‖∇f(x)− g‖22 ≤ 0, so we must have ∇f(x) =g.

Proposition 26 (Subgradient calculus). Let f, f1, f2 : Rd → R be convex. Then

(i) ∂(αf)(x) = {αg : g ∈ ∂f(x)} for α > 0,

(ii) ∂(f1 + f2)(x) = {g1 + g2 : g1 ∈ ∂f1(x), g2 ∈ ∂f2(x)}.

Also if h : Rm → R is given by h(x) = f(Ax+ b) where A ∈ Rd×m and b ∈ Rd, then

(iii) ∂h(x) = AT∂f(Ax+ b).

Example. Consider

f(β) =1

n

n∑i=1

max(1− yixTi β, 0).

Let φ(u) = max(1− u, 0). Then

∂φ(u) =

{0} if u > 1,

[−1, 0] if u = 1,

{−1} if u < 1.

By Proposition 26 (iii) writing hi(β) = max(1− yixTi β, 0), we have ∂hi(β) = {−yixit : t ∈[0, 1]} when yix

Ti β = 1. From Proposition 26 (i) and (ii), we see that ∂f(β) consists of

sums of the form − 1n

∑ni=1 yixiti where each ti may be 0, 1 or anything in [0, 1] depending

on the value of yixTi β.

3.8 Gradient descent

Suppose we wish to minimise a function f that is differentiable at a point β with gradientg = ∇f(β). A first-order Taylor expansion gives f(z) ≈ f(β) + gT (z − β), so for smallη > 0,

minδ:‖δ‖2=1

f(β + ηδ) ≈ f(β) + η minδ:‖δ‖2=1

gT δ.

Thus to minimise the linear approximation of f at β, one should move in the direction ofthe negative gradient.

The procedure of (projected) gradient descent for minimising f over a closed convex setC uses this intuition to produce a sequence of iterates β1, β2, . . . aiming have f(βs) closeto a minimum f(β) for large s.

30

Algorithm 1 Gradient descent

Input: β1 ∈ C; number of iterations k ∈ N; sequence of positive step sizes (ηs)k−1s=1

for s = 1 to k − 1 doCompute gs ∈ ∂f(βs)zs+1 = βs − ηsgsβs+1 = πC(zs+1)

end forreturn β = 1

k

∑ks=1 βs

Theorem 27. Suppose β is a minimiser of convex function f : Rp → R over a closedconvex set C ⊆ Rp. Suppose supβ∈C ‖β‖2 ≤ R <∞ and supβ∈C supg∈∂f(β) ‖g‖2 ≤ L <∞.

Then if ηs ≡ η = 2R/(L√k), the output β of the gradient descent algorithm above satisfies

f(β)− f(β) ≤ 2LR√k.

Proof. We have

f(βs)− f(β) ≤ gTs (βs − β) (definition of subgradient)

= −1

η(zs+1 − βs)T (βs − β)

=1

2η{‖βs − zs+1‖22 + ‖βs − β‖22 − ‖zs+1 − β‖22}. (3.8)

From Proposition 22, ‖πC(z)− πC(x)‖2 ≤ ‖z − x‖2, so in particular

‖zs+1 − β‖22 ≥ ‖βs+1 − β‖22.

Using this and (3.8),

f(βs)− f(β) ≤ 1

2η{η2‖gs‖22 + ‖βs − β‖22 − ‖βs+1 − β‖22}. (3.9)

Now ‖gs‖2 ≤ L. Also β1 ∈ C so ‖β1 − β‖22 ≤ R2. Thus summing we get

1

k

k∑s=1

f(βs)− f(β) ≤ ηL2

2+

1

2ηk

(‖β1 − β‖22 − ‖βk+1 − β‖22

)≤ ηL2

2+

2R2

ηk.

Taking the minimising η = 2R/(L√k) and using Jensen’s inequality to give f(β) ≤

1k

∑ks=1 f(βs), we get the result.

31

Example. Consider ERM with hinge loss, X = {x ∈ Rp : ‖x‖2 ≤ C} and the `2-constrained hypothesis class H = {x 7→ xTβ : ‖β‖2 ≤ λ}. Then a subgradient of theobjective function f at β takes the form

g = − 1

n

n∑i=1

yixiti where ti ∈ [0, 1].

Thus ‖g‖2 ≤ C by the triangle inequality. From Theorem 27 we see that the output ofgradient descent with step size η = 2λ/(C

√k) satisfies f(β)− f(β) ≤ 2Cλ/

√k.

3.9 Stochastic gradient descent

One issue with gradient descent is that the gradients themselves may be computationallyexpensive to compute: in the case of ERM the gradient is a sum of n terms correspondingto each data point, and so computing the gradient typically involves a sweep over the entiredataset at each iteration.

Stochastic gradient descent can circumvent this issue in the case of minimising convexfunctions of the form f(β) = Ef(β;U), where

• f : Rp × U → R is such that β 7→ f(β;u) is convex for all u ∈ U ,

• U is a random variable taking values in U .

This encompasses empirical risk minimisation. Indeed let U be uniformly distributed on{1, . . . , n}. Then the ERM objective function with H = {hβ : β ∈ C} may be written as

1

n

n∑i=1

`(hβ(xi), yi) = E`(hβ(xU), yU) = Ef(β;U).

Note we are thinking of the data (x1, y1), . . . , (xn, yn) as fixed; only U is random.

Algorithm 2 Stochastic gradient descent

Input: β1 ∈ C; number of iterations k ∈ N; sequence of positive step sizes (ηs)k−1s=1 , i.i.d.

copies U1, . . . , Uk−1 of Ufor s = 1 to k − 1 do

Compute gs ∈ ∂f(βs;Us) (to be interpreted as gs ∈ h(βs) where h(β) = f(β;Us))zs+1 = βs − ηsgsβs+1 = πC(zs+1)

end forreturn β = 1

k

∑ks=1 βs

The key point to note is that computing gs involves just a single data point (xUs , yUs).

32

Theorem 28. Suppose β is a minimiser of f as above over a closed convex set C ⊆ Rp.

Suppose supβ∈C ‖β‖2 ≤ R < ∞ and supβ∈C E(

supg∈∂f(β;U) ‖g‖22)≤ L2 < ∞. Then if

ηs ≡ η = 2R/(L√k), the output β of the gradient descent algorithm above satisfies

Ef(β)− f(β) ≤ 2LR√k.

Proof. Let gs = E(gs|βs). Then gs ∈ ∂f(βs). Indeed we have f(β;Us) ≥ f(βs;Us)+ gTs (β−βs) for all β. Note Us is independent of βs so taking expectations conditional on βs showsgs ∈ ∂f(βs). Then arguing as in the proof of Theorem 27,

f(βs)− f(β) ≤ gTs (βs − β)

= E(gs(βs − β)|βs)

= −1

ηE{(zs+1 − βs)T (βs − β)|βs}

=1

2ηE{‖βs − zs+1‖22 + ‖βs − β‖22 − ‖zs+1 − β‖22|βs}

≤ 1

2ηE{η2‖gs‖22 + ‖βs − β‖22 − ‖βs+1 − β‖22|βs} (Prop. 22).

Taking expectations and summing we get

E

(1

k

k∑s=1

f(βs)

)− f(β) ≤ ηL2

2+

2R2

ηk.

Taking η = 2R/(L√k) and using Jensen’s inequality we get the result.

4 Popular machine learning methods

In the course so far, we have developed a coherent framework giving statistical and compu-tational guarantees for a variety of procedures. However many popular machine learningmethods do not fall precisely within this framework. In this last part of the course, we willbriefly describe a selection of such methods in routine use today. We begin by discussingan important technique for selecting tuning parameters for machine learning methods (e.g.the λ in the cases of `1 and `2-constrained hypotheses), or more generally selecting a goodclassifier or regression method from among a number of competing methods.

4.1 Cross-validation

Let H1, . . . , Hm be a collection of machine learning methods: each Hj takes as its argumenti.i.d. training data (Xi, Yi)

ni=1 =: D and outputs a hypothesis, so Hj

D : X → R. Given aloss function `, we may ideally want to pick a j such that

E{`(HjD(X), Y )|D} (4.1)

33

is minimised. Here (X, Y ) ∈ X × Y is independent of D and has the same distributionas (X1, Y1). This j is such that conditional on the original training data, it minimises theexpected loss on a new observation drawn from the same distribution as the training data.

A less ambitious goal is to find a j to minimise

E[E{`(HjD(X), Y )|D}] (4.2)

where compared with (4.1), we have taken a further expectation over the training data D.We still have no way of computing (4.2) directly, but we can attempt to estimate it.

The idea of v-fold cross-validation is to split the data into v groups or folds of roughlyequal size. Let D−k be all the data except that in the kth fold, and let Ak ⊂ {1, . . . , n}be the observation indices corresponding to the kth fold. For each j we apply Hj to dataD−k to obtain hypothesis Hj

−k := HjD−k

. We choose the value of j that minimises

CV(j) :=1

n

v∑k=1

∑i∈Ak

`(Hj−k(Xi), Yi). (4.3)

Writing j for the minimiser, we may take final selected hypothesis to be H jD.

Note that for each i ∈ Ak,

E`(Hj−k(Xi), Yi) = E[E{`(Hj

−k(Xi), Yi)|D−k}]. (4.4)

This is precisely the expected loss in (4.2) but with training data D replaced with a trainingdata set of smaller size. If all the folds have the same size, then CV(j) is an average ofn identically distributed quantities, each with expected value as in (4.4). However, thequantities being averaged are not independent as they share the same data.

Thus cross-validation gives a biased estimate of the expected prediction error. Theamount of the bias depends on the size of the folds, the case when the v = n giving theleast bias—this is known as leave-one-out cross-validation. The quality of the estimate,though, may be worse as the quantities being averaged in (4.3) will be highly positivelycorrelated. Typical choices of v are 5 or 10.

4.1.1 *Stacking*

Cross-validation aims to allow us to choose the single best machine learning method; wecould instead aim to find the best weighted combination of methods. To do this, we canattempt to minimise

1

n

v∑k=1

∑i∈Ak

`

(m∑j=1

wjHj−k(Xi), Yi

)over w in the convex set

{u ∈ Rm : uj ≥ 0 for all j}.Additional `1 or `2 constraints may be added to the set. This sort of idea is known asstacking and it can often outperform cross-validation.

34

4.2 Adaboost

Empirical risk minimisation is a technique for finding a single good hypothesis from agiven class H. In an analogy with stacking, we could alternatively attempt to find agood weighted combination of hypotheses. Specifically, given an initial set H of classifiersh : X → {−1, 1}, consider the class

HM =

{M∑m=1

βmhm : βm ∈ R, hm ∈ H for m = 1, . . . ,M

}.

This is clearly a richer class than H itself and turns out to be useful way of creating a morecomplex hypothesis class from a simpler one, with the tuning parameter M controlling thecomplexity. Performing ERM over HM , however, can be computationally challenging. TheAdaboost algorithm can be motivated as a greedy empirical risk minimisation procedurewith exponential loss. As we shall see, one attractive feature of the algorithm is that itonly relies on being able to perform ERM over the simpler class H given different weightedversions of the data.

Adaboost first sets f0 to be the function x 7→ 0 and then performs the following form = 1, . . . ,M :

(βm, hm) = arg minβ∈R,h∈H

1

n

n∑i=1

exp[−Yi{fm−1(Xi) + βh(Xi)}]

fm = fm−1 + βmhm.

The final classification is performed according to sgn◦fM . Let us examine the minimisationabove in more detail. Set w

(m)i = n−1 exp(−Yifm−1(Xi)). Then

1

n

n∑i=1

exp[−Yi{fm−1(Xi) + βh(Xi)}] = eβn∑i=1

w(m)i 1{h(Xi)6=Yi} + e−β

n∑i=1

w(m)i 1{h(Xi)=Yi}

= (eβ − e−β)n∑i=1

w(m)i 1{h(Xi)6=Yi} + e−β

n∑i=1

w(m)i .

Provided no h ∈ H perfectly classifies the data so

errm(h) :=

∑ni=1w

(m)i 1{h(Xi)6=Yi}∑ni=1w

(m)i

> 0 for all h ∈ H,

we have that

hm = arg minh∈H

errm(h) and βm =1

2log

(1− errm(hm)

errm(hm)

).

If M is large, the weighted empirical risk minimisation step to produce the hm must beperformed many times. In order for this approach to be practical, we need H to be suchthat this optimisations can be done very fast. More generally, the hm need not be formedthrough ERM but may be the output of some machine learning method.

35

Example. Let X = Rp and consider the class of decision stumps

H = {ha,j,1(x) = sgn(xj − a), ha,j,2(x) = sgn(a− xj) : a ∈ R, j = 1, . . . , p}.

To perform weighted ERM with weights w1, . . . , wn > 0 (we have dropped the superscriptm), for each j = 1, . . . , p, first sort {Xij}ni=1 so X(1)j < · · · < X(n)j (we assume these aredistinct for simplicity). Fixing j, wlog we may assume X(i)j = Xij = xi. Now observe that(dropping the subscript m),

err(hxk+1,j,1)− err(hxk,j,1) = Yk+1wk+1

/∑l

wl.

Thus picking the optimal ha,j,1 (for fixed j) amounts to picking the minimum across asequence of cumulative sums, and similarly for ha,j,2. This needs to be performed for eachj = 1, . . . , p. Assuming the sorting is performed as part of pre-processing, the weightedempirical risk minimisation has O(np) computational complexity.

4.3 Gradient boosting

Consider the following thought experiment. Let us imagine applying gradient descentdirectly to minimise R(h) = E`(h(X), Y ). This would involve the following steps.

1. Start with an initial guess f0 : X → R.

2. For m = 1, . . . ,M , iteratively compute

gm(x) =∂E(`(θ, Y )|X = x)

∂θ

∣∣∣∣fm−1(x)

= E(∂`(θ, Y )

∂θ

∣∣∣fm−1(x)

∣∣∣∣X = x

)assuming sufficient regularity conditions.

3. Update fm = fm−1 − ηgm, where η > 0 is a small step length.

If we want to create a version of the ‘algorithm’ above that works with finite data(X1, Y1), . . . , (Xn, Yn), we need to find a way of approximating the conditional expectationfunction

x 7→ E(∂`(θ, Y )

∂θ

∣∣∣fm−1(x)

∣∣∣∣X = x

).

Recall from Proposition 1 and following discussion on page 3, that this minimises

E(∂`(θ, Y )

∂θ

∣∣∣fm−1(X)

− h(X)

)2

(4.5)

36

among all (measurable) functions h : X → R under suitable conditions. This observationmotivates the following algorithm known as gradient boosting, where we try to minimisean empirical version of (4.5) using regression, thereby approximating the conditional ex-pectation. This regression is performed using some base regression method H that takesas its argument some training data D and outputs a hypothesis HD : X → R. In whatfollows, the loss ` may correspond to a convex surrogate or least squares loss for example.

Algorithm 3 Gradient boosting

Input: Data X1:n, Y1:n; η > 0; base regression method H; stopping iteration MCompute µ = arg min

µ∈R

1n

∑ni=1 `(µ, Yi) and set f0(x) = µ

for m = 1 to M doCompute Wi = ∂

∂θ`(θ, Yi)|θ=fm−1(Xi)

Apply H to data X1:n,W1:n to give gm = H(X1:n,W1:n) : X → RUpdate fm = fm−1 − ηgm

end forreturn fM (or sgn ◦ fM in the classification setting)

4.4 Decision trees

Gradient boosting requires a fast base regression procedure. In the setting where X =Rp, methods for fitting decision trees are the most popular choice. Decision trees are ageneralisation of decision stumps and take the form

T (x) =J∑j=1

γj1{x∈Rj}

where Rj are rectangular regions that form a partition of Rp and the γj are coefficients inR.

The regions and coefficients are typically computed from data (Xi, Yi)ni=1 using the

following recursive binary partitioning algorithm.

1. Input maximum number of regions J . Initialise R = {Rp}.

2. For each region R ∈ R such that I := {i : Xi ∈ R} has |I| > 1, find the predictor jRand split point sR to minimise over j ∈ {1, . . . , p} and s ∈ Sj

minc1∈R

∑i∈I:Xij≤s

(Yi − c1)2 + minc2∈R

∑i∈I:Xij>s

(Yi − c2)2 −∑i∈I

(Yi −

1

|I|∑l∈I

Yl

)2

, (4.6)

where Sj is the set of mid-points between adjacent {Xij}i∈I .

37

3. Let R be the region yielding the lowest value of (4.6) and define

R1 = {x ∈ R : xjR≤ sR}, R2 = R \ R1.

Refine the partition via R ← (R \ {R}) ∪ {R1, R2}.

4. Repeat steps 2 and 3 until |R| = J .

5. Writing R = {R1, . . . , RJ}, let Ij = {i : Xi ∈ Rj} and

γj =1

|Ij|

∑i∈Ij

Yi.

Output T : Rp → R such that T (x) =∑J

j=1 γj1{x∈Rj}.

4.5 Random forests

Whilst decision trees as above are a useful machine learning method in their own right,they are most useful for prediction when used in conjunction with gradient boosting orwithin the Random forest procedure which we now describe.

Consider the regression setting where Yi ∈ R and we are using squared error loss. LetTD be a decision tree trained on data D := (Xi, Yi)

ni=1. Also let T = ETD and let (X, Y )

be independent of D with (X, Y )d= (X1, Y1). Then we have the following decomposition

of the expected risk of TD:

ER(TD) = E[{Y − T (X)︸︷︷︸=E(TD(X)|X)

+T (X)− TD(X)}2]

= R(T ) + EVar(TD(X)|X) + 2E[E{(Y − T (X))(T (X)− TD(X))|X}]︸︷︷︸=0

.

If the number of regions J used by TD is large, some of these regions will contain onlysmall numbers of observations in them so the corresponding coefficients γj will by highly

variable and consequently EVar(TD(X)|X) will tend to be large. On the other hand, R(T )may be fairly low as a large J would allow T to approximate x 7→ E(Y |X = x) well.

Random forest effectively attempts to ‘estimate’ T . If we had multiple independentdatasets D1, . . . , DB, we could form an unbiased estimate via

∑Bb=1 TDb

. Random forestsamples the data D with replacement to form new datasets D∗1, . . . , D

∗B and performs the

following.

1. For each b = 1, . . . , B, fit a decision tree T(b)D∗b

but when searching for the best predictor

to split on, randomly sample (without replacement) mtry of the p predictors andchoose the best split from among these variables.

2. Output frf = 1B

∑Bb=1 T

(b)D∗b

.

38

The reason for sampling predictors is to try to make the T (b) := T(b)D∗b

more indepen-

dent. To see why this would be useful, suppose for b1 6= b2 and some x ∈ Rp thatCorr(T (b1)(x), T (b2)(x)) = ρ ≥ 0. Then

Var(frf(x)) =1

BVar(T (1)(x)) +

ρB(B − 1)

B2Var(T (1)(x))

=1− ρB

Var(T (1)(x)) + ρVar(T (1)(x)).

Whilst the first term can be made small for large B, the second term does not dependon B, so we would like ρ to be small. The extra randomisation in the form of samplingpredictors can help to achieve this.

4.6 Feedforward neural networks

In recent years, (artificial) neural networks have been shown to be very successful for avariety of learning tasks. The class of feedforward neural networks are based around aparticular class of hypotheses h : X = Rp → R with general form

h(x) = A(d) ◦ g ◦ A(d−1) ◦ g ◦ · · · ◦ g ◦ A(2) ◦ g ◦ A(1)(x)

where

• d is known as the depth of the network;

• A(k)(v) = β(k)v + µ(k) where v ∈ Rmk , β(k) ∈ Rmk+1×mk , µ(k) ∈ Rmk+1 with m1 = pand md+1 = 1;

• g : Rm → Rm applies (for any given m) a so-called activation function ψ : R → Relementwise i.e. for v = (v1, . . . , vm)T , g(v) = (ψ(v1), . . . , ψ(vm))T . The activationfunction is nonlinear and typical choices include

(i) u 7→ max(u, 0) (known as a rectified linear unit (ReLU))

(ii) u 7→ 1/(1 + e−u) (sigmoid).

This cascade of alternating linear and nonlinear compositions can be visualised in the formof a graph. Here we have set h(0) := x and for k = 1, . . . , d − 1, x(k) = A(k)(h(k−1)),h(k) = g(x(k)). The intermediate outputs h(1), . . . , h(d−1) are known as hidden layers andx(d) = A(d)(h(d−1)) = h(x) is sometimes known as the output layer.

The parameters (β(k), µ(k))dk=1 are typically fitted to data (x1, y1), . . . , (xn, yn) ∈ Rp ×{−1, 1} with empirical risk minimisation using a surrogate loss φ. Despite the resulting op-timisation being highly nonconvex, stochastic gradient descent has been shown empiricallyto be extremely effective in selecting good parameters. A key factor in their success hasbeen the fact that the gradients involved can be computed quickly due to the compositionalnature of the hypotheses using the chain rule.

39

Suppose φ and ψ are differentiable. At an observation (x, y) = (xUs , yUs) we first com-pute all the intermediate quantities h(l) and x(l) given the current values of the parameters.Let z = φ(yh(x)) = φ(yx(d)). We then compute, in order

∂z

x(d)= yφ′(yx(d))

∂z

∂β(d)1k

=∂z

x(d)h(d−1)k ,

∂z

∂µ(d)=

∂z

x(d)(4.7)

∂z

∂h(d−1)j

=∂z

∂x(d)β(d)1j

∂z

∂x(d−1)j

=∂z

∂h(d−1)j

ψ′(x(d−1)j )

∂z

∂β(d−1)jk

=∂z

∂x(d−1)j

h(d−2)k ,

∂z

∂µ(d−1)j

=∂z

∂x(d−1)j

(4.8)

∂z

∂h(d−2)j

=∑k

∂z

∂x(d−1)k

β(d−1)kj ,

....

This process is known as back propogation. Note only (4.7) and (4.8) out of the equationspresented above are directly used in the SGD update step; the remaining equations simplyfacilitate computation of the gradient with respect to the (β(k), µ(k))dk=1.

40

mathematics of machine learningrds37/teaching/machine_learning/... · 2020-03-09 · mathematics of...

Documents