introduction to machine learning - brown...

Introduction to Machine Learning

Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth

Lecture 15: Online Learning: Stochastic Gradient Descent

Perceptron Algorithm Kernel Methods

Many figures courtesy Kevin Murphy’s textbook, Machine Learning: A Probabilistic Perspective

Batch versus Online Learning •! Many learning algorithms we’ve seen can be phrased as

batch minimization of the following objective:

•! This produces effective prediction algorithms, but can require significant computation and storage for training

(one of N data points) (for ML, similar for MAP)

•! We can do online learning from streaming data via stochastic gradient descent:

•! SGD takes a small step based on single observations “sampled” from the data’s empirical distribution:

p(z) =1

N

N∑

i=1

δzi(z)

Stochastic Gradient Descent •! How can we produce a single parameter estimate?

•! How should we set the step size ?

•! How does this work in practice?

p(z) =1

N

N∑

i=1

δzi(z)

Polyak-Ruppert averaging

ηkRobbins-Monro

conditions

Conventional batch step size rules fail (in theory & practice)

•! Excellent for big datasets, but tuning parameters tricky •! Refinement: Take batches of data for each step: 1 < B � N

Least Mean Squares (LMS)

0 5 10 15 20 25 303

4

5

6

7

8

9

10

RSS vs iterationblack line = LMS trajectory towards LS soln (red cross)

w0−1 0 1 2 3

−1

−0.5

0

0.5

1

1.5

2

2.5

3

Stochastic gradient descent applied to linear regression model:

f(θ, yi, xi) =1

2(yi − θTφ(xi))

2

yk = θTk φ(xk)θk+1 = θk + ηk(yk − yk)φ(xk)

SGD for Logistic Regression

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

σ(z) = sigm(z) =1

1 + e−z

p(yi | xi, θ) = Ber(yi | µi)

µi = σ(θTφ(xi))

f(θ) = −N∑

i=1

[yi logµi + (1− yi) log(1− µi)]

•! Batch gradient function:

∇f(θ) =

N∑

i=1

(µi − yi)φ(xi)

•! Stochastic gradient descent:

θk+1 = θk + ηk(yk − µk)φ(xk)

µk = σ(θTk φ(xk)) 0 < µk < 1

Perceptron MARK 1 Computer

Frank Rosenblatt, late 1950s Decision Rule: yi = I(θTφ(xi) > 0)

Learning Rule: If yk = yk, θk+1 = θkIf yk �= yk, θk+1 = θk + ykφ(xk)

yk = 2yk − 1 ∈ {+1,−1}

Perceptron Algorithm Convergence

!1 !0.5 0 0.5 1!1

!0.5

0

0.5

1

!1 !0.5 0 0.5 1!1

!0.5

0

0.5

1

!1 !0.5 0 0.5 1!1

!0.5

0

0.5

1

!1 !0.5 0 0.5 1!1

!0.5

0

0.5

1

C. Bishop, Pattern Recognition & Machine Learning

1

2

Perceptron Algorithm Properties

•! Guaranteed to converge if data linearly separable (in feature space; reduces angle to true separators)

•! Easy to construct kernel representation of algorithm

Strengths

•! May be slow to converge (worst-case performance poor) •! If data not linearly separable, will never converge •! Solution depends on order data visited;

no notion of a best separating hyperplane •! Non-probabilistic: No measure of confidence in decisions,

difficult to generalize to other problems

Weaknesses

Covariance Matrices •! Eigenvalues and eigenvectors:

•! For a symmetric matrix:

•! For a positive semidefinite matrix:

•! For a positive definite matrix:

Σui = λiui, i = 1, . . . , d

λi ≥ 0

λi ∈ R uTi ui = 1 uT

i uj = 0

Σ = UΛUT =d∑

i=1

λiuiuTi

Σ−1 = UΛ−1UT =d∑

i=1

1

λiuiu

Ti

λi > 0

yi = uTi (x− µ)

Mercer Kernel Functions X arbitrary input space (vectors, functions, strings, graphs, !)

•! A kernel function maps pairs of inputs to real numbers:

k : X × X → R k(xi, xj) = k(xj , xi)

Intuition: Larger values indicate inputs are “more similar”

•! A kernel function is positive semidefinite if and only if for any , and any , the Gram matrix is positive semidefinite:

n ≥ 1 x = {x1, x2, . . . , xn}

K ∈ Rn×n Kij = k(xi, xj)

•! Mercer’s Theorem: Assuming certain technical conditions, every positive definite kernel function can be represented as

k(xi, xj) =d∑

�=1

φ�(xi)φ�(xj)for some feature mapping (but may need ) d → ∞

φ

Exponential Kernels X real vectors of some fixed dimension

k(xi, xj) = exp

{−(|xi − xj |

σ

)γ}

We can construct a covariance matrix by evaluating kernel at any set of inputs, and then sample from the zero-mean Gaussian distribution with that covariance. This is a Gaussian process.

0 < γ ≤ 2

Polynomial Kernels X real vectors of some fixed dimension

•! The polynomial kernel has an explicit feature mapping, but the number of features grows exponentially with both the input dimension and the polynomial degree

•! The squared exponential kernel requires an infinite number of features (roughly, radial basis functions at all possible locations in the input space)

String Kernels X strings of characters from some finite alphabet, of size A

•! Feature vector: Count of number of times that every substring, of every possible length, occurs within string

•! Using suffix trees, the kernel can be evaluated in time linear in the length of the input strings

Amino Acids

x

x′

D = A+A2 +A3 +A4 + · · ·

Kernels and Features •! What features lead to valid, positive semidefinite kernels?

•! When is a hypothesized kernel function positive semidefinite?

•! How can I build new kernel functions?

k(xi, xj) = φ(xi)Tφ(xj) φ : X → Rdis valid for any

Can be tricky to verify whether underlying feature mapping exists.

k(xi, xj) = ck(xi, xj), c > 0

k(xi, xj) = f(xi)k(xi, xj)f(xj) for any f(x)

k(xi, xj) = k1(xi, xj) + k2(xi, xj)

k(xi, xj) = k1(xi, xj)k2(xi, xj)

k(xi, xj) = exp(k(xi, xj))

Kernelizing Learning Algorithms •! Start with any learning algorithm based on features •! Manipulate steps in algorithm so that it depends not directly on

features, but only their inner products: •! Write code that only uses calls to kernel function •! Basic identity: Squared distance between feature vectors

φ(x)(Don’t worry that computing features might be expensive or impossible.)

k(xi, xj) = φ(xi)Tφ(xj)

•! Feature-based nearest neighbor classification •! Feature-based clustering algorithms (later) •! Feature-based nearest centroid classification:

||φ(xi)− φ(xj)||22 = k(xi, xi) + k(xj , xj)− 2k(xi, xj)

ytest = argminc

||φ(xtest)− µc||2

µc =1

Nc

∑

i|yi=c

φ(xi)mean of the Nc training examples of class c

Kernelized Perceptron Algorithm Decision Rule:

Learning Rule: If yk = yk, θk+1 = θkIf yk �= yk, θk+1 = θk + ykφ(xk)

yk = 2yk − 1 ∈ {+1,−1}Problem: May be intractable to compute/store φ(xk), θk

ytest = I(θTφ(xtest) > 0)

Decision Rule:

Learning Rule:

ytest = I

(N∑

i=1

sik(xtest, xi) > 0

)

If yk = yk, sk,k+1 = sk,kIf yk �= yk, sk,k+1 = sk,k + yk

Representation: D feature weights

Initialize with . By induction, for all k θ0 = 0

θk =

N∑

i=1

sikφ(xi) for some integers sik

Representation: N training example weights

introduction to machine learning - brown...

Documents