introduction to machine learning - brown...

16
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 15: Online Learning: Stochastic Gradient Descent Perceptron Algorithm Kernel Methods Many figures courtesy Kevin Murphy’s textbook, Machine Learning: A Probabilistic Perspective

Upload: others

Post on 06-Aug-2020

35 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Introduction to Machine Learning

Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth

Lecture 15: Online Learning: Stochastic Gradient Descent

Perceptron Algorithm Kernel Methods

Many figures courtesy Kevin Murphy’s textbook, Machine Learning: A Probabilistic Perspective

Page 2: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Batch versus Online Learning •! Many learning algorithms we’ve seen can be phrased as

batch minimization of the following objective:

•! This produces effective prediction algorithms, but can require significant computation and storage for training

(one of N data points) (for ML, similar for MAP)

•! We can do online learning from streaming data via stochastic gradient descent:

•! SGD takes a small step based on single observations “sampled” from the data’s empirical distribution:

p(z) =1

N

N∑

i=1

δzi(z)

Page 3: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Stochastic Gradient Descent •! How can we produce a single parameter estimate?

•! How should we set the step size ?

•! How does this work in practice?

p(z) =1

N

N∑

i=1

δzi(z)

Polyak-Ruppert averaging

ηkRobbins-Monro

conditions

Conventional batch step size rules fail (in theory & practice)

•! Excellent for big datasets, but tuning parameters tricky •! Refinement: Take batches of data for each step: 1 < B � N

Page 4: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Least Mean Squares (LMS)

0 5 10 15 20 25 303

4

5

6

7

8

9

10

RSS vs iterationblack line = LMS trajectory towards LS soln (red cross)

w0−1 0 1 2 3

−1

−0.5

0

0.5

1

1.5

2

2.5

3

Stochastic gradient descent applied to linear regression model:

f(θ, yi, xi) =1

2(yi − θTφ(xi))

2

yk = θTk φ(xk)θk+1 = θk + ηk(yk − yk)φ(xk)

Page 5: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

SGD for Logistic Regression

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

σ(z) = sigm(z) =1

1 + e−z

p(yi | xi, θ) = Ber(yi | µi)

µi = σ(θTφ(xi))

f(θ) = −N∑

i=1

[yi logµi + (1− yi) log(1− µi)]

•! Batch gradient function:

∇f(θ) =

N∑

i=1

(µi − yi)φ(xi)

•! Stochastic gradient descent:

θk+1 = θk + ηk(yk − µk)φ(xk)

µk = σ(θTk φ(xk)) 0 < µk < 1

Page 6: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Perceptron MARK 1 Computer

Frank Rosenblatt, late 1950s Decision Rule: yi = I(θTφ(xi) > 0)

Learning Rule: If yk = yk, θk+1 = θkIf yk �= yk, θk+1 = θk + ykφ(xk)

yk = 2yk − 1 ∈ {+1,−1}

Page 7: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Perceptron Algorithm Convergence

!1 !0.5 0 0.5 1!1

!0.5

0

0.5

1

!1 !0.5 0 0.5 1!1

!0.5

0

0.5

1

!1 !0.5 0 0.5 1!1

!0.5

0

0.5

1

!1 !0.5 0 0.5 1!1

!0.5

0

0.5

1

C. Bishop, Pattern Recognition & Machine Learning

1

2

Page 8: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Perceptron Algorithm Properties

•! Guaranteed to converge if data linearly separable (in feature space; reduces angle to true separators)

•! Easy to construct kernel representation of algorithm

Strengths

•! May be slow to converge (worst-case performance poor) •! If data not linearly separable, will never converge •! Solution depends on order data visited;

no notion of a best separating hyperplane •! Non-probabilistic: No measure of confidence in decisions,

difficult to generalize to other problems

Weaknesses

Page 9: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Covariance Matrices •! Eigenvalues and eigenvectors:

•! For a symmetric matrix:

•! For a positive semidefinite matrix:

•! For a positive definite matrix:

Σui = λiui, i = 1, . . . , d

λi ≥ 0

λi ∈ R uTi ui = 1 uT

i uj = 0

Σ = UΛUT =d∑

i=1

λiuiuTi

Σ−1 = UΛ−1UT =d∑

i=1

1

λiuiu

Ti

λi > 0

yi = uTi (x− µ)

Page 10: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Mercer Kernel Functions X arbitrary input space (vectors, functions, strings, graphs, !)

•! A kernel function maps pairs of inputs to real numbers:

k : X × X → R k(xi, xj) = k(xj , xi)

Intuition: Larger values indicate inputs are “more similar”

•! A kernel function is positive semidefinite if and only if for any , and any , the Gram matrix is positive semidefinite:

n ≥ 1 x = {x1, x2, . . . , xn}

K ∈ Rn×n Kij = k(xi, xj)

•! Mercer’s Theorem: Assuming certain technical conditions, every positive definite kernel function can be represented as

k(xi, xj) =d∑

�=1

φ�(xi)φ�(xj)for some feature mapping (but may need ) d → ∞

φ

Page 11: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Exponential Kernels X real vectors of some fixed dimension

k(xi, xj) = exp

{−(|xi − xj |

σ

)γ}

We can construct a covariance matrix by evaluating kernel at any set of inputs, and then sample from the zero-mean Gaussian distribution with that covariance. This is a Gaussian process.

0 < γ ≤ 2

Page 12: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Polynomial Kernels X real vectors of some fixed dimension

•! The polynomial kernel has an explicit feature mapping, but the number of features grows exponentially with both the input dimension and the polynomial degree

•! The squared exponential kernel requires an infinite number of features (roughly, radial basis functions at all possible locations in the input space)

Page 13: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

String Kernels X strings of characters from some finite alphabet, of size A

•! Feature vector: Count of number of times that every substring, of every possible length, occurs within string

•! Using suffix trees, the kernel can be evaluated in time linear in the length of the input strings

Amino Acids

x

x′

D = A+A2 +A3 +A4 + · · ·

Page 14: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Kernels and Features •! What features lead to valid, positive semidefinite kernels?

•! When is a hypothesized kernel function positive semidefinite?

•! How can I build new kernel functions?

k(xi, xj) = φ(xi)Tφ(xj) φ : X → Rdis valid for any

Can be tricky to verify whether underlying feature mapping exists.

k(xi, xj) = ck(xi, xj), c > 0

k(xi, xj) = f(xi)k(xi, xj)f(xj) for any f(x)

k(xi, xj) = k1(xi, xj) + k2(xi, xj)

k(xi, xj) = k1(xi, xj)k2(xi, xj)

k(xi, xj) = exp(k(xi, xj))

Page 15: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Kernelizing Learning Algorithms •! Start with any learning algorithm based on features •! Manipulate steps in algorithm so that it depends not directly on

features, but only their inner products: •! Write code that only uses calls to kernel function •! Basic identity: Squared distance between feature vectors

φ(x)(Don’t worry that computing features might be expensive or impossible.)

k(xi, xj) = φ(xi)Tφ(xj)

•! Feature-based nearest neighbor classification •! Feature-based clustering algorithms (later) •! Feature-based nearest centroid classification:

||φ(xi)− φ(xj)||22 = k(xi, xi) + k(xj , xj)− 2k(xi, xj)

ytest = argminc

||φ(xtest)− µc||2

µc =1

Nc

i|yi=c

φ(xi)mean of the Nc training examples of class c

Page 16: Introduction to Machine Learning - Brown Universitycs.brown.edu/courses/cs195-5/spring2012/lectures/... · 22.03.2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Kernelized Perceptron Algorithm Decision Rule:

Learning Rule: If yk = yk, θk+1 = θkIf yk �= yk, θk+1 = θk + ykφ(xk)

yk = 2yk − 1 ∈ {+1,−1}Problem: May be intractable to compute/store φ(xk), θk

ytest = I(θTφ(xtest) > 0)

Decision Rule:

Learning Rule:

ytest = I

(N∑

i=1

sik(xtest, xi) > 0

)

If yk = yk, sk,k+1 = sk,kIf yk �= yk, sk,k+1 = sk,k + yk

Representation: D feature weights

Initialize with . By induction, for all k θ0 = 0

θk =

N∑

i=1

sikφ(xi) for some integers sik

Representation: N training example weights