introduction to machine learning - brown...
TRANSCRIPT
Introduction to Machine Learning
Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth
Lecture 15: Online Learning: Stochastic Gradient Descent
Perceptron Algorithm Kernel Methods
Many figures courtesy Kevin Murphy’s textbook, Machine Learning: A Probabilistic Perspective
Batch versus Online Learning •! Many learning algorithms we’ve seen can be phrased as
batch minimization of the following objective:
•! This produces effective prediction algorithms, but can require significant computation and storage for training
(one of N data points) (for ML, similar for MAP)
•! We can do online learning from streaming data via stochastic gradient descent:
•! SGD takes a small step based on single observations “sampled” from the data’s empirical distribution:
p(z) =1
N
N∑
i=1
δzi(z)
Stochastic Gradient Descent •! How can we produce a single parameter estimate?
•! How should we set the step size ?
•! How does this work in practice?
p(z) =1
N
N∑
i=1
δzi(z)
Polyak-Ruppert averaging
ηkRobbins-Monro
conditions
Conventional batch step size rules fail (in theory & practice)
•! Excellent for big datasets, but tuning parameters tricky •! Refinement: Take batches of data for each step: 1 < B � N
Least Mean Squares (LMS)
0 5 10 15 20 25 303
4
5
6
7
8
9
10
RSS vs iterationblack line = LMS trajectory towards LS soln (red cross)
w0−1 0 1 2 3
−1
−0.5
0
0.5
1
1.5
2
2.5
3
Stochastic gradient descent applied to linear regression model:
f(θ, yi, xi) =1
2(yi − θTφ(xi))
2
yk = θTk φ(xk)θk+1 = θk + ηk(yk − yk)φ(xk)
SGD for Logistic Regression
−10 −5 0 5 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
σ(z) = sigm(z) =1
1 + e−z
p(yi | xi, θ) = Ber(yi | µi)
µi = σ(θTφ(xi))
f(θ) = −N∑
i=1
[yi logµi + (1− yi) log(1− µi)]
•! Batch gradient function:
∇f(θ) =
N∑
i=1
(µi − yi)φ(xi)
•! Stochastic gradient descent:
θk+1 = θk + ηk(yk − µk)φ(xk)
µk = σ(θTk φ(xk)) 0 < µk < 1
Perceptron MARK 1 Computer
Frank Rosenblatt, late 1950s Decision Rule: yi = I(θTφ(xi) > 0)
Learning Rule: If yk = yk, θk+1 = θkIf yk �= yk, θk+1 = θk + ykφ(xk)
yk = 2yk − 1 ∈ {+1,−1}
Perceptron Algorithm Convergence
!1 !0.5 0 0.5 1!1
!0.5
0
0.5
1
!1 !0.5 0 0.5 1!1
!0.5
0
0.5
1
!1 !0.5 0 0.5 1!1
!0.5
0
0.5
1
!1 !0.5 0 0.5 1!1
!0.5
0
0.5
1
C. Bishop, Pattern Recognition & Machine Learning
1
2
Perceptron Algorithm Properties
•! Guaranteed to converge if data linearly separable (in feature space; reduces angle to true separators)
•! Easy to construct kernel representation of algorithm
Strengths
•! May be slow to converge (worst-case performance poor) •! If data not linearly separable, will never converge •! Solution depends on order data visited;
no notion of a best separating hyperplane •! Non-probabilistic: No measure of confidence in decisions,
difficult to generalize to other problems
Weaknesses
Covariance Matrices •! Eigenvalues and eigenvectors:
•! For a symmetric matrix:
•! For a positive semidefinite matrix:
•! For a positive definite matrix:
Σui = λiui, i = 1, . . . , d
λi ≥ 0
λi ∈ R uTi ui = 1 uT
i uj = 0
Σ = UΛUT =d∑
i=1
λiuiuTi
Σ−1 = UΛ−1UT =d∑
i=1
1
λiuiu
Ti
λi > 0
yi = uTi (x− µ)
Mercer Kernel Functions X arbitrary input space (vectors, functions, strings, graphs, !)
•! A kernel function maps pairs of inputs to real numbers:
k : X × X → R k(xi, xj) = k(xj , xi)
Intuition: Larger values indicate inputs are “more similar”
•! A kernel function is positive semidefinite if and only if for any , and any , the Gram matrix is positive semidefinite:
n ≥ 1 x = {x1, x2, . . . , xn}
K ∈ Rn×n Kij = k(xi, xj)
•! Mercer’s Theorem: Assuming certain technical conditions, every positive definite kernel function can be represented as
k(xi, xj) =d∑
�=1
φ�(xi)φ�(xj)for some feature mapping (but may need ) d → ∞
φ
Exponential Kernels X real vectors of some fixed dimension
k(xi, xj) = exp
{−(|xi − xj |
σ
)γ}
We can construct a covariance matrix by evaluating kernel at any set of inputs, and then sample from the zero-mean Gaussian distribution with that covariance. This is a Gaussian process.
0 < γ ≤ 2
Polynomial Kernels X real vectors of some fixed dimension
•! The polynomial kernel has an explicit feature mapping, but the number of features grows exponentially with both the input dimension and the polynomial degree
•! The squared exponential kernel requires an infinite number of features (roughly, radial basis functions at all possible locations in the input space)
String Kernels X strings of characters from some finite alphabet, of size A
•! Feature vector: Count of number of times that every substring, of every possible length, occurs within string
•! Using suffix trees, the kernel can be evaluated in time linear in the length of the input strings
Amino Acids
x
x′
D = A+A2 +A3 +A4 + · · ·
Kernels and Features •! What features lead to valid, positive semidefinite kernels?
•! When is a hypothesized kernel function positive semidefinite?
•! How can I build new kernel functions?
k(xi, xj) = φ(xi)Tφ(xj) φ : X → Rdis valid for any
Can be tricky to verify whether underlying feature mapping exists.
k(xi, xj) = ck(xi, xj), c > 0
k(xi, xj) = f(xi)k(xi, xj)f(xj) for any f(x)
k(xi, xj) = k1(xi, xj) + k2(xi, xj)
k(xi, xj) = k1(xi, xj)k2(xi, xj)
k(xi, xj) = exp(k(xi, xj))
Kernelizing Learning Algorithms •! Start with any learning algorithm based on features •! Manipulate steps in algorithm so that it depends not directly on
features, but only their inner products: •! Write code that only uses calls to kernel function •! Basic identity: Squared distance between feature vectors
φ(x)(Don’t worry that computing features might be expensive or impossible.)
k(xi, xj) = φ(xi)Tφ(xj)
•! Feature-based nearest neighbor classification •! Feature-based clustering algorithms (later) •! Feature-based nearest centroid classification:
||φ(xi)− φ(xj)||22 = k(xi, xi) + k(xj , xj)− 2k(xi, xj)
ytest = argminc
||φ(xtest)− µc||2
µc =1
Nc
∑
i|yi=c
φ(xi)mean of the Nc training examples of class c
Kernelized Perceptron Algorithm Decision Rule:
Learning Rule: If yk = yk, θk+1 = θkIf yk �= yk, θk+1 = θk + ykφ(xk)
yk = 2yk − 1 ∈ {+1,−1}Problem: May be intractable to compute/store φ(xk), θk
ytest = I(θTφ(xtest) > 0)
Decision Rule:
Learning Rule:
ytest = I
(N∑
i=1
sik(xtest, xi) > 0
)
If yk = yk, sk,k+1 = sk,kIf yk �= yk, sk,k+1 = sk,k + yk
Representation: D feature weights
Initialize with . By induction, for all k θ0 = 0
θk =
N∑
i=1
sikφ(xi) for some integers sik
Representation: N training example weights