statistical machine learning: introductionsejdinov/talks/pdf/2015-06-22_novisad.pdf · supervised...

61
Statistical Machine Learning: Introduction Dino Sejdinovic Department of Statistics University of Oxford 22-24 June 2015, Novi Sad slides available at: http://www.stats.ox.ac.uk/~sejdinov/talks.html

Upload: others

Post on 10-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Statistical Machine Learning: Introduction

Dino SejdinovicDepartment of Statistics

University of Oxford

22-24 June 2015, Novi Sadslides available at:

http://www.stats.ox.ac.uk/~sejdinov/talks.html

Page 2: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Introduction Introduction

What is Machine Learning?

Arthur Samuel, 1959Field of study that gives computers the ability to learn without being explicitlyprogrammed.

Tom Mitchell, 1997Any computer program that improves its performance at some task throughexperience.

Kevin Murphy, 2012

To develop methods that can automatically detect patterns in data, andthen to use the uncovered patterns to predict future data or other outcomesof interest.

Page 3: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Introduction Introduction

What is Machine Learning?

Arthur Samuel, 1959Field of study that gives computers the ability to learn without being explicitlyprogrammed.

Tom Mitchell, 1997Any computer program that improves its performance at some task throughexperience.

Kevin Murphy, 2012

To develop methods that can automatically detect patterns in data, andthen to use the uncovered patterns to predict future data or other outcomesof interest.

Page 4: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Introduction Introduction

What is Machine Learning?

Arthur Samuel, 1959Field of study that gives computers the ability to learn without being explicitlyprogrammed.

Tom Mitchell, 1997Any computer program that improves its performance at some task throughexperience.

Kevin Murphy, 2012

To develop methods that can automatically detect patterns in data, andthen to use the uncovered patterns to predict future data or other outcomesof interest.

Page 5: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Introduction Introduction

What is Machine Learning?

data

InformationStructurePredictionDecisionsActions

http://gureckislab.org Larry Page about DeepMind’s ML systems that can learn to play video games like humans

Page 6: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Introduction Introduction

What is Machine Learning?

Machine Learning

statistics

computerscience

cognitivescience

psychology

mathematics

engineeringoperationsresearch

physics

biologygenetics

businessfinance

Page 7: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Introduction Introduction

What is Data Science?

’Data Scientists’ Meld Statistics and Software WSJ article

Page 8: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Introduction Introduction

Information Revolution

Traditional Statistical Inference ProblemsWell formulated question that we would like to answer.Expensive data gathering and/or expensive computation.Create specially designed experiments to collect high quality data.

Information RevolutionImprovements in data processing and data storage.Powerful, cheap, easy data capturing.Lots of (low quality) data with potentially valuable information inside.

Page 9: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Introduction Introduction

Statistics and Machine Learning in the age of Big Data

ML becoming a thorough blending of computer science, engineering andstatisticsunified framework of data, inferences, procedures, algorithms

statistics taking computation seriouslycomputing taking statistical risk seriously

scale and granularity of datapersonalization, societal and business impactmultidisciplinarity - and you are the interdisciplinary glueit’s just getting started

Michael Jordan: On the Computational and Statistical Interface and "Big Data"

Page 10: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Introduction Applications of Machine Learning

Applications of Machine Learning

spam filteringrecommendation

systemsfraud detection

self-driving carsimage recognition

stock market analysis

ImageNet: Krizhevsky et al, 2012; Machine Learning is Eating the World: Forbes article

Page 11: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Introduction Types of Machine Learning

Types of Machine Learning

Supervised learning

Data contains “labels”: every example is an input-output pairclassification, regressionGoal: prediction on new examples

Unsupervised learning

Extract key features of the “unlabelled” dataclustering, signal separation, density estimationGoal: representation, hypothesis generation, visualization

Page 12: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Introduction Types of Machine Learning

Types of Machine Learning

Semi-supervised Learning

A database of examples, only a small subset of which are labelled.

Multi-task Learning

A database of examples, each of which has multiple labels corresponding todifferent prediction tasks.

Reinforcement Learning

An agent acting in an environment, given rewards for performing appropriateactions, learns to maximize their reward.

Page 13: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Introduction Types of Machine Learning

Literature

Page 14: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Supervised Learning

Supervised Learning

We observe the input-output pairs {(xi, yi)}ni=1, xi ∈ X , yi ∈ Y

Types of supervised learning:Classification: discrete responses, e.g. Y = {+1,−1} or {1, . . . ,K}.Regression: a numerical value is observed and Y = R.

The goal is to accurately predict the response Y on new observations of X,i.e., to learn a function f : Rp → Y, such that f (X) will be close to the trueresponse Y.

Page 15: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Decision Theory

Loss function

Suppose we made a prediction Y = f (X) ∈ Y based on observation of X.How good is the prediction? We can use a loss function L : Y ×Y 7→ R+

to formalize the quality of the prediction.Typical loss functions:

Misclassification loss (or 0-1 loss) for classification

L(Y, f (X)) =

{0 f (X) = Y1 f (X) 6= Y

.

Squared loss for regression

L(Y, f (X)) = (f (X)− Y)2 .

Many other choices are possible, e.g., weighted misclassification loss.In classification, the vector of estimated probabilities (p(k))k∈Y can bereturned, and in this case log-likelihood loss (or log loss)L(Y, p) = − log p(Y) is often used.

Page 16: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Decision Theory

Risk

Paired observations {(xi, yi)}ni=1 viewed as i.i.d. realizations of a random

variable (X,Y) on X × Y with joint distribution PXY

RiskFor a given loss function L, the risk R of a learned function f is given by theexpected loss

R(f ) = EPXY [L(Y, f (X))] ,

where the expectation is with respect to the true (unknown) joint distribution of(X,Y).

The risk is unknown, but we can compute the empirical risk:

Rn(f ) =1n

n∑i=1

L(yi, f (xi)).

Page 17: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Decision Theory

The Bayes Classifier

What is the optimal classifier if the joint distribution (X,Y) were known?The density g of X can be written as a mixture of K components(corresponding to each of the classes):

g(x) =

K∑k=1

πkgk(x),

where, for k = 1, . . . ,K,P(Y = k) = πk are the class probabilities,gk(x) is the conditional density of X, given Y = k.

The Bayes classifier fBayes : x 7→ {1, . . . ,K} is the one with minimum risk:

R(f ) =E [L(Y, f (X))] = EX[EY|X[L(Y, f (X))|X]

]=

∫XE [L(Y, f (X))|X = x] g(x)dx

The minimum risk attained by the Bayes classifier is called Bayes risk.Minimizing E[L(Y, f (X))|X = x] separately for each x suffices.

Page 18: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Decision Theory

The Bayes Classifier

Consider the 0-1 loss.The risk simplifies to:

E[L(Y, f (X))

∣∣X = x]

=

K∑k=1

L(k, f (x))P(Y = k|X = x)

=1− P(Y = f (x)|X = x)

The risk is minimized by choosing the class with the greatest posteriorprobability:

fBayes(x) = arg maxk=1,...,K

P(Y = k|X = x)

= arg maxk=1,...,K

πkgk(x)∑Kj=1 πjgj(x)

= arg maxk=1,...,K

πkgk(x).

The functions x 7→ πkgk(x) are called discriminant functions. Thediscriminant function with maximum value determines the predicted classof x.

Page 19: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Decision Theory

The Bayes Classifier: ExampleA simple two Gaussians example: Suppose X ∼ N (µY , 1), where µ1 = −1 andµ2 = 1 and assume equal priors π1 = π2 = 1/2.

g1(x) =1√2π

exp(− (x + 1)2

2

)and g2(x) =

1√2π

exp(− (x− 1)2

2

).

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

x

cond

ition

al d

ensi

ties

−3 −2 −1 0 1 2 3

0.05

0.10

0.15

0.20

0.25

x

mar

gina

l den

sity

Optimal classification is fBayes(x) = arg maxk=1,...,K

πkgk(x) =

{1 if x < 0,2 if x ≥ 0.

Page 20: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Decision Theory

The Bayes Classifier: Example

How do you classify a new observation x if now the standard deviation is still 1for class 1 but 1/3 for class 2?

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

cond

ition

al d

ensi

ties

−3 −2 −1 0 1 2 3

1e−

321e

−25

1e−

181e

−11

1e−

04

x

cond

ition

al d

ensi

ties

Looking at density in a log-scale, optimal classification is to select class 2 ifand only if x ∈ [0.34, 2.16].

Page 21: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Decision Theory

Plug-in Classification

The Bayes Classifier chooses the class with the greatest posteriorprobability

fBayes(x) = arg maxk=1,...,K

πkgk(x).

We know neither the conditional densities gk nor the class probabilities πk!The plug-in classifier chooses the class

f (x) = arg maxk=1,...,K

πkgk(x),

where we plugged inestimates πk of πk and k = 1, . . . ,K andestimates gk(x) of conditional densities,

Page 22: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Linear Discriminant Analysis

Linear Discriminant Analysis

LDA is the simplest example of plug-in classification.Assume multivariate normal conditional density gk(x) for each class k:

X|Y = k ∼N (µk,Σ),

gk(x) =(2π)−p/2|Σ|−1/2 exp(−1

2(x− µk)

>Σ−1(x− µk)

),

each class can have a different mean µk,all classes share the same covariance Σ.

For an observation x, the k-th log-discriminant function is

logπkgk(x) = const + logπk −12

(x− µk)>Σ−1(x− µk)

= const + ak + b>k x

linear log-discriminant functions↔ linear decision boundaries.The quantity (x− µk)

>Σ−1(x− µk) is the squared Mahalanobis distancebetween x and µk.If Σ = Ip and πk = 1

K , LDA simply chooses the class k with the nearest (inthe Euclidean sense) class mean.

Page 23: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Linear Discriminant Analysis

Iris Dataset

●●● ●●

●●

●●

●●

●●

● ●●

●●

●● ●●

●●●●

● ●

●●

●●●●

● ●

● ●

●●●

●●

●●

● ●

●●●

●●

1 2 3 4 5 6 7

0.5

1.0

1.5

2.0

2.5

Petal.Length

Pet

al.W

idth

Page 24: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Statistical Learning Theory

Generative vs Discriminative Learning

Generative learning: find parameters which explain all the dataavailable.

θ = argmaxθ

n∑i=1

log P(xi, yi|θ)

Examples: LDA, naïve Bayes.Makes use of all the data available.Flexible framework, can incorporate other tasks, incomplete data.Stronger modelling assumptions.

Discriminative learning: find parameters that aid in prediction.

θ = argminθ

1n

n∑i=1

L(yi, fθ(xi)) or θ = argmaxθ

n∑i=1

log P(yi|xi, θ)

Examples: logistic regression, support vector machines.Typically performs better on a given task.Weaker modelling assumptions.Can overfit more easily.

Page 25: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Statistical Learning Theory

Generative Learning

We work with a joint distribution PX,Y(x, y) over data vectors and labels.A learning algorithm: construct f : X → Y which predicts the label of X.Given a loss function L, the risk R of f (X) is

R(f ) = EX,Y [L(Y, f (X))]

For 0/1 loss in classification, Bayes classifier

fBayes(x) = argmaxk=1,...,K

P(Y = k|x) = argmaxk=1,...,K

PX,Y(x, k)

has the minimum risk (Bayes risk), but is unknown since PX,Y is unknown.Assume a parameteric model for the joint: PX,Y(x, y) = PX,Y(x, y|θ)Fit θ = argmaxθ

∑ni=1 log P(xi, yi|θ) and plug in back to Bayes classifier:

f (x) = argmaxk=1,...,K

PX,Y(x, k|θ).

Page 26: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Statistical Learning Theory

Hypothesis space and Empirical Risk Minimization

Find best function in H minimizing the risk:

f? = argminf∈H

EX,Y [L(Y, f (X))]

Empirical Risk Minimization (ERM): minimize the empirical risk instead,since we typically do not know PX,Y .

f = argminf∈H

1n

n∑i=1

L(yi, f (xi))

Hypothesis space H is the space of functions f under consideration.How complex should we allow functions f to be? If hypothesis space H is“too large”, ERM will overfit. Function

f (x) =

{yi if x = xi,

0 otherwise

will have zero empirical risk, but is useless for generalization, since it hassimply “memorized” the dataset.

Page 27: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Statistical Learning Theory

Training and Test Performance

Training error is the empirical risk

1n

n∑i=1

L(yi, f (xi))

For 0-1 loss in classification, this is the misclassification error on thetraining data {xi, yi}n

i=1, which were used in learning f .Test error is the empirical risk on new, previously unseen observations{xi, yi}m

i=1

1m

m∑i=1

L(yi, f (xi))

which were NOT used in learning f .Test error is a much better gauge of how well learned functiongeneralizes to new data.The test error is in general larger than the training error.

Page 28: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Statistical Learning Theory

Hypothesis space for two-class LDA

Assume we have two classes {+1,−1}.Recall that the discriminant functions in LDA are linear. Assuming thatdata vectors in class k is modelled as N (µk,Σ), choosing class +1 over−1 involves:

a+1 + b>+1x > a−1 + b>−1x ⇔ a? + b>? x > 0,

where a? = a+1 − a−1, b? = b+1 − b−1.Thus, hypothesis space of two-class LDA consists of functionsf (x) = sign(a + b>x).We obtain coefficients a and b, and thus the function f through fitting theparameters of the generative model.Discriminative learning: restrict H to a class of functionsf (x) = sign(a + b>x) and select a and b which minimize empirical risk.

Page 29: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Statistical Learning Theory

Space of linear decision functions

Hypothesis space H = {f : f (x) = sign(a + b>x), a ∈ R, b ∈ Rp}Find a, b that minimize the empirical risk under 0-1 loss:

argmina,b

1n

n∑i=1

L(yi, fa,b(xi))

= argmina,b

1n

n∑i=1

{0 if yi = sign(a + b>xi)

1 otherwise

= argmina,b

12n

n∑i=1

[1− sign(yi(a + b>xi))

].

Combinatorial problem - not typically possible to solve...Maybe easier with a different loss function? (Logistic regression)

Page 30: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Logistic Regression

Linearity of log-posterior odds

Another way to express linear decision boundary of LDA:

logp(Y = +1|X = x)

p(Y = −1|X = x)= a + b>x.

Solve explicitly for conditional class probabilities:

p(Y = +1|X = x) =1

1 + exp(−(a + b>x))=: s(a + b>x)

p(Y = −1|X = x) =1

1 + exp(+(a + b>x))= s(−a− b>x)

where s(·) is the logistic function

−8 −6 −4 −2 0 2 4 6 80

0.5

1

Page 31: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Logistic Regression

Logistic Regression

Consider maximizing the conditional log likelihood:

`(a, b) =

n∑i=1

log p(Y = yi|X = xi) =

n∑i=1

− log(1 + exp(−yi(a + b>xi)))

Equivalent to minimizing the empirical risk associated with the log loss:

Remplog =

1n

n∑i=1

log(1 + exp(−yi(a + b>xi))) =1n

n∑i=1

− log(s(yi(a + b>xi))︸ ︷︷ ︸pa,b(yi|xi)

)

−4 −3 −2 −1 0 1 2 3 40

1

2

3

4

Log Loss (log(1+exp(−y(a+b’x))))

0−1 Loss (1/2+sign(−y(a+b’x))/2)

Page 32: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Logistic Regression

Logistic Regression

Not possible to find optimal a, b analytically.For simplicitiy, absorb a as an entry in b byappending ’1’ into x vector.Objective function:

Remplog =

1n

n∑i=1

− log s(yix>i b)

Logistic Function

s(−z) = 1− s(z)

∇zs(z) = s(z)s(−z)

∇z log s(z) = s(−z)

∇2z log s(z) = −s(z)s(−z)

Differentiate wrt b:

∇bRemplog =

1n

n∑i=1

−s(−yix>i b)yixi

∇2bRemp

log =1n

n∑i=1

s(yix>i b)s(−yix>i b)xix>i

Page 33: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Logistic Regression

Logistic Regression

Second derivative is positive-definite: objective function is convex andthere is a single unique global minimum.Many different algorithms can find optimal b, e.g.:

Gradient descent:

bnew = b + ε1n

n∑i=1

s(−yix>i b)yixi

Stochastic gradient descent:

bnew = b + εt1|I(t)|

∑i∈I(t)

s(−yix>i b)yixi

where I(t) is a subset of the data at iteration t, and εt → 0 slowly(∑

t εt =∞,∑

t ε2t <∞).

Newton-Raphson:bnew = b− (∇2

bRemplog )−1∇bRemp

log

This is also called iterative reweighted least squares.Conjugate gradient, LBFGS and other methods from numerical analysis.

Page 34: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Logistic Regression

Logistic Regression vs. LDA

Both have linear decision boundaries and model log-posterior odds as

logp(Y = +1|X = x)

p(Y = −1|X = x)= a + b>x

LDA models the marginal density of x as a Gaussian mixture with sharedcovariance

g(x) = π−1N (x;µ−1,Σ) + π+1N (x;µ+1,Σ)

and fits the parameters θ = (µ−1, µ+1, π−1, π+1,Σ) by maximizing jointlikelihood

∑ni=1 p(xi, yi|θ). a and b are then determined from θ.

Logistic regression leaves the marginal density g(x) as an arbitrarydensity function, and fits the parameters a,b by maximizing theconditional likelihood

∑ni=1 p(yi|xi; a, b).

Page 35: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Logistic Regression

Logistic Regression

Properties of logistic regression:Makes less modelling assumptions than generative classifiers.A simple example of a generalised linear model (GLM). Much statisticaltheory:

assessment of fit via deviance and plots,interpretation of entries of b as odds-ratios,fitting categorical data (sometimes called multinomial logistic regression),well founded approaches to removing insignificant features (drop-indeviance test, Wald test).

Page 36: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Model Complexity and Generalization

Model Complexity and Generalization

Page 37: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Model Complexity and Generalization

Generalization

Generalization ability: what is the out-of-sample error of learner f ?training error 6= testing error.We learn f by minimizing some variant of empirical risk Remp(f )- what canwe say about the true risk R(f )?Two important factors determining generalization ability:

Model complexityTraining data size

Page 38: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Model Complexity and Generalization

Learning Curves

Model complexity/flexibility

Pred

ictio

n er

ror

Underfit

Overfit

Just right

Training error

Testerror

Fixed dataset size, varying model complexity.

Page 39: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Model Complexity and Generalization

Learning Curves

training error

testing error

training dataset size

overfit

pre

dic

tio

ner

ror

training error

testing error

training dataset size

overfit

pre

dic

tio

ner

ror

Fixed model complexity, varying dataset size.Two models: one simple, one complex. Which is which?

Page 40: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Model Complexity and Generalization

Bias-Variance Tradeoff

Where does the prediction error come from?Example: Squared loss in regression: X = Rp, Y = R,

L(Y, f (X)) = (Y − f (X))2

Optimal f is the conditional mean:

f∗(x) = E [Y|X = x]

Follows from R(f ) = EXE[

(Y − f (X))2∣∣∣X] and

E[

(Y − f (X))2∣∣∣X = x

]= E

[Y2∣∣X = x

]− 2f (x)E [Y|X = x] + f (x)

2

= Var [Y|X = x] + (E [Y|X = x]− f (x))2.

Page 41: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Model Complexity and Generalization

Bias-Variance Tradeoff

Optimal risk is the intrinsic conditional variability of Y (noise):

R(f∗) = EX [Var [Y|X]]

Excess risk:R(f )− R(f∗) = EX

[(f (X)− f∗(X))

2]

How does the excess risk behave on average?Consider a randomly selected dataset D = {(Xi,Yi)}n

i=1 and f (D) trainedon D using a model (hypothesis class) H.

ED[R(f (D))− R(f∗)

]= EDEX

[(f (D)(X)− f∗(X)

)2]

= EXED[(

f (D)(X)− f∗(X))2].

Page 42: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Model Complexity and Generalization

Bias-Variance Tradeoff

Denote f (x) = EDf (D)(x) (average decision function over all possibledatasets)

ED[(

f (D)(X)− f∗(X))2]

= ED[(

f (D)(X)− f (X))2]

︸ ︷︷ ︸VarX(H,n)

+(f (X)− f∗(X)

)2︸ ︷︷ ︸Bias2

X(H,n)

Now,EDR(f (D)) = R(f∗) + EXVarX(H, n) + EXBias2

X(H, n)

Where does the prediction error come from?Noise: Intrinsic difficulty of regression problem.Bias: How far away is the best learner in the model (average learner overall possible datasets) from the optimal one?Variance: How variable is our learning method if given different datasets?

Page 43: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Model Complexity and Generalization

Learning Curves

training error

testing error

training dataset size

overfit

pre

dic

tio

ner

ror

bias

variance

training error

testing error

training dataset sizep

red

icti

on

erro

r

bias

variance

overfit

Page 44: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Model Complexity and Generalization

Learning Curve

Model complexity/flexibility

Pred

ictio

n er

ror

Underfit:high bias

low varianceOverfit:low bias

high variance

Just right

Training error

Testerror

Page 45: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Artificial Neural Networks

Page 46: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Biological inspiration

Basic computational elements:neurons.Receives signals from otherneurons via dendrites.Sends processed signals viaaxons.Axon-dendrite interactions atsynapses.1010 − 1011 neurons.1014 − 1015 synapses.

Page 47: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Single Neuron Classifier

x(1)

x(2)

x(3)

x(p)

b

b

b

w1

w2

w3

wp

s(.)

P(Y = 1|X = x)

1

b

w⊤x+ b

activation w>x + b (linear in inputs x)activation/transfer function s gives the output/activity (potentiallynonlinear in x)common nonlinear activation function s(a) = 1

1+e−a : logistic regressionlearn w and b via gradient descent

Page 48: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Single Neuron Classifier

xi1

xi2

Page 49: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Overfitting

iterations 30,80

Figures from D. MacKay, Information Theory, Inference and Learning Algorithms

prevent overfitting by:early stopping: just halt the gradient descentregularization: L2-regularization called weight decay in neural networksliterature.

Page 50: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Overfitting

iterations 500,3000

Figures from D. MacKay, Information Theory, Inference and Learning Algorithms

prevent overfitting by:early stopping: just halt the gradient descentregularization: L2-regularization called weight decay in neural networksliterature.

Page 51: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Overfitting

iterations 10000,40000

Figures from D. MacKay, Information Theory, Inference and Learning Algorithms

prevent overfitting by:early stopping: just halt the gradient descentregularization: L2-regularization called weight decay in neural networksliterature.

Page 52: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Multilayer Networks

Data vectors xi ∈ Rp, binary labels yi ∈ {0, 1}.inputs xi1, . . . , xip

output yi = P(Y = 1|X = xi)

hidden unit activities hi1, . . . , him

Compute hidden unit activities:

hil = s

bhl +

p∑j=1

whjlxij

Compute output probability:

yi = s

(bo +

m∑l=1

wokhil

)xi1 xi2 xi3 xi4

hi1 hi2 hi3 hi4 hi5 hi6 hi7

yi

Page 53: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Multilayer Networks

xi1

xi2

Page 54: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Training a Neural Network

Objective function: L2-regularized log-loss

J = −n∑

i=1

yi log yi + (1− yi) log(1− yi) +λ

2

∑jl

(whjl)

2 +∑

l

(wol )2

where

yi = s

(bo +

m∑l=1

wol hil

)hil = s

bhl +

p∑j=1

whjlxij

Optimize parameters θ =

{bh,wh, bo,wo

}, where bh ∈ Rm, wh ∈ Rp×m,

bo ∈ R, wo ∈ Rm with gradient descent.

∂J∂wo

l= λwo

l +

n∑i=1

∂J∂yi

∂yi

∂wol

= λwol +

n∑i=1

(yi − yi)hil,

∂J∂wh

jl= λwh

jl +n∑

i=1

∂J∂yi

∂yi

∂hil

∂hil

∂whjl

= λwhjl +

n∑i=1

(yi − yi)wol hil(1− hil)xij.

L2-regularization often called weight decay.Multiple hidden layers: Backpropagation algorithm

Page 55: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Multiple hidden layers

yi = hL+1

i

b b b b b bhLi1 hL

im

b b b b b bh1i1 h1

im

b b b

xi1 = h0i1

xip = h0ip

h`+1i = s

(W`+1h`i

)W`+1 =

(w`jk)

jk: weight matrix at

the (`+ 1)-th layer, weight w`jk onthe edge between h`−1

ik and h`ijs: entrywise (logistic) transferfunction

yi = s(WL+1s

(WL (· · · s (W1xi

))))

Page 56: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Backpropagation

yi = hL+1

i

b b bhℓ+1

i1 hℓ+1

im

hℓij

hℓ−1

ik

wℓjk

b b b

b b b

b b b

J = −n∑

i=1

yi log hL+1i +(1−yi) log(1−hL+1

i )

Gradients wrt h`ij computed byrecursive applications of chainrule, and propagated through thenetwork backwards.

∂J∂hL+1

i

= − yi

hL+1i

+1− yi

1− hL+1i

∂J∂h`ij

=

m∑r=1

∂J∂h`+1

ir

∂h`+1ir

∂h`ij

∂J∂w`jk

=

n∑i=1

∂J∂h`ij

∂h`ij∂w`jk

Page 57: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Neural Networks

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

Solution (global minimum)Local minimum 1Local minimum 2Local minimum 3

Global solution and local minima

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

Neural network fit with a weight decay of 0.01

R package implementing neural networks with a single hidden layer: nnet.

Page 58: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Neural Networks – Discussion

Nonlinear hidden units introduce modelling flexibility.In contrast to user-introduced nonlinearities, features are global, and canbe learned to maximize predictive performance.Neural networks with a single hidden layer and sufficiently many hiddenunits can model arbitrarily complex functions.Optimization problem is not convex, and objective function can havemany local optima, plateaus and ridges.On large scale problems, often use stochastic gradient descent, alongwith a whole host of techniques for optimization, regularization, andinitialization.Recent developments, especially by Geoffrey Hinton, Yann LeCun,Yoshua Bengio, Andrew Ng and others. See alsohttp://deeplearning.net/.

Page 59: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Dropout Training of Neural Networks

Neural network with single layer of hiddenunits:

Hidden unit activations:

hik = s

bhk +

p∑j=1

Whjkxij

Output probability:

yi = s

(bo +

m∑k=1

Wok hik

)

Large, overfitted networks often haveco-adapted hidden units.What each hidden unit learns may in factbe useless, e.g. predicting the negation ofpredictions from other units.Can prevent co-adaptation by randomlydropping out units from network.

xi1 xi2 xi3 xi4

hi1 hi2 hi3 hi4 hi5 hi6 hi7

yi

Hinton et al (2012).

Page 60: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Dropout Training of Neural Networks

Model as an ensemble of networks:

p(yi = 1|xi, θ) =∑

b⊂{1,...,m}

q|b|(1− q)m−|b|p(yi = 1|xi, θ, drop out units b)

xi1 xi2 xi3 xi4

hi1 hi2 hi3 hi4

yi

xi1 xi2 xi3 xi4

hi1 hi3 hi5 hi7

yi

xi1 xi2 xi3 xi4

hi1 hi4 hi6

yi

xi1 xi2 xi3 xi4

hi3 hi4 hi5

yi

Weight-sharing among all networks: each network uses a subset of theparameters of the full network (corresponding to the retained units).Training by stochastic gradient descent: at each iteration a network issampled from ensemble, and its subset of parameters are updated.Biological inspiration: 1014 weights to be fitted in a lifetime of 109 seconds

Poisson spikes as a regularization mechanism which preventsco-adaptation: Geoff Hinton on Brains, Sex and Machine Learning

Page 61: Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Supervised Learning Artificial Neural Networks

Dropout Training of Neural NetworksClassification of phonemes in speech.

Figure from Hinton et al.