lecture 03: multi-layer perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · linear predictor...
TRANSCRIPT
CS489/698: Intro to MLLecture 03: Multi-layer Perceptron
9/19/17 Yao-Liang Yu1
Outl ine
• Failure of Perceptron
• Neural Network
• Backpropagation
• Universal Approximator
9/19/17 Yao-Liang Yu2
Outl ine
• Failure of Perceptron
• Neural Network
• Backpropagation
• Universal Approximator
9/19/17 Yao-Liang Yu3
The XOR problem
9/19/17 Yao-Liang Yu4
• X = { (0,0), (0,1), (1,0), (1,1) }, y = {-1, 1, 1, -1}• f*(x) = sign[ fxor - ½ ], fxor(x) = (x1 & -x2) | (-x1 & x2)
No separat ing hyperplane
• x1 = (0,0), y1 = -1 b < 0• x2 = (0,1), y2 = 1 w2 + b > 0• x3 = (1,0), y3 = 1 w1 + b > 0• x4 = (1,1), y4 = -1 w1 + w2 + b < 0
• Contradiction!• [At least one of the blue or green inequalities has to be strict]
9/19/17 Yao-Liang Yu5
(w1 + w2 + b) + b > 0
Ex. what happens if we run perceptron/winnow on this example?
Fixing the problem
• Our model (hyperplanes) underfits the data (xor func)
• Fix representation, richer model
• Fix model, richer representation
• NN: still use hyperplane, but learn representation simultaneously
9/19/17 Yao-Liang Yu6
Outl ine
• Failure of Perceptron
• Neural Network
• Backpropagation
• Universal Approximator
9/19/17 Yao-Liang Yu7
Two-layer perceptron
9/19/17 Yao-Liang Yu8
x1
x2
z1
z2
ŷ
h1
h2
u11
u22
u12
u21
1
c1
c2
w1
w2
1
b
inputlayer
hiddenlayer
Nonlinear activation function
outputlayer
weights weights
learned rep.
1st linear layer
nonlinear transform
2nd linear layer
makes all the difference!
Does i t work?
• x1 = (0,0), y1 = -1 z1 = (0,-1), h1 = (0,0) ŷ1 = -1 ✔• x2 = (0,1), y2 = 1 z2 = (1,0), h2 = (1,0) ŷ2 = 1 ✔• x3 = (1,0), y3 = 1 z3 = (1,0), h3 = (1,0) ŷ3 = 1 ✔• x4 = (1,1), y4 = -1 z4 = (2,1), h4 = (2,1) ŷ4 = -1 ✔
9/19/17 Yao-Liang Yu9
Rectified Linear Unit (ReLU)
Mult i- layer perceptron
9/19/17 Yao-Liang Yu10
x1
x2
z1
z2
h1
h2
11
.
.
.
xd
zk
.
.
.
hk
.
.
.
.
.
.
ŷm
.
.
.
ŷ2
ŷ1
inputlayer:
Rd
outputlayer:
Rm
weights: fully connectedweights: fully connected
hiddenlayer Rk
Mult i- layer perceptron (stacked)
9/19/17 Yao-Liang Yu11
x1x2
z1z2
h1h2
1 1
.
.
.xd
zk
.
.
. hk
.
.
.
.
.
. ŷm
.
.
.
ŷ2
ŷ1
x1x2
z1z2
h1h2
1 1xd
zk
.
.
. hk
.
.
.
.
.
. ŷm
.
.
.
ŷ2
ŷ1
x1x2
z1z2
h1h2
1 1
.
.
.xd
zk
.
.
. hk
.
.
.
.
.
. ŷm
.
.
.
ŷ2
ŷ1…
1 2 p
depth
width
feed-forward
learned hierarchical nonlinear feature representation linearpredictor
Activat ion funct ion
• Sigmoid
• Tanh
• Rectified Linear
9/19/17 Yao-Liang Yu12
smooth
nonsmooth
saturate
Underf i t t ing vs. Overf i t t ing
• Linear predictor (perceptron / winnow / linear regression) underfits
• NNs learn hierarchical nonlinear feature jointly with linear predictor• may overfit• tons of heuristics (some later)
• Size of NN: # of weights/connections• ~ 1 billion; human brain 106 billions (estimated)
9/19/17 Yao-Liang Yu13
Weights training
9/19/17 Yao-Liang Yu14
x1x2
z1z2
h1h2
1 1
.
.
.xd
zk
.
.
. hk
.
.
.
.
.
. ŷm
.
.
.
ŷ2
ŷ1
x1x2
z1z2
h1h2
1 1xd
zk
.
.
. hk
.
.
.
.
.
. ŷm
.
.
.
ŷ2
ŷ1
x1x2
z1z2
h1h2
1 1
.
.
.xd
zk
.
.
. hk
.
.
.
.
.
. ŷm
.
.
.
ŷ2
ŷ1…x in
Rdŷ inRm
• Need a loss to measure diff. between pred ŷ and truth y• E.g., (ŷ-y)2; more later
• Need a training set {(x1,y1), …, (xn,yn)} to train weights Θ
ŷ = q(x; Θ)
Gradient Descent
9/19/17 Yao-Liang Yu15
Step size (learning rate)• const., if L is smooth• diminishing, otherwise
(Generalized) gradientO(n) !
Stochastic Gradient Descent (SGD)
• diminishing step size, e.g., 1/sqrt{t} or 1/t• averaging, momentum, variance-reduction, etc.• sample w/o replacement; cycle; permute in each pass
9/19/17 Yao-Liang Yu16
average over n samplesa random sample suffices
A l i t t le history on optimizat ion• Gradient descent mentioned first in (Cauchy, 1847)
9/19/17 Yao-Liang Yu17
• First rigorous convergence proof (Curry, 1944)
• SGD proposed and analyzed (Robbins & Monro, 1951)
Herbert Robbins ( 1 9 1 5 – 2 0 0 1 )
9/19/17 Yao-Liang Yu18
Outl ine
• Failure of Perceptron
• Neural Network
• Backpropagation
• Universal Approximator
9/19/17 Yao-Liang Yu19
Backpropogation
• A fancy name for the chain rule for derivatives:f(x) = g[ h(x) ] f’(x) = g’[ h(x) ] * h’(x)
• Efficiently computes the derivative in NN
• Two passes; complexity = O(size(NN))• forward pass: compute function value sequentially• backward pass: compute derivative sequentially
9/19/17 Yao-Liang Yu20
Algori thm
9/19/17 Yao-Liang Yu21
f: activation function
J = L + λΩ: training obj.Forward pass
Backwardpass
History (pe r fec t f o r a cou rse p ro jec t ! )
9/19/17 Yao-Liang Yu22
Outl ine
• Failure of Perceptron
• Neural Network
• Backpropagation
• Universal Approximator
9/19/17 Yao-Liang Yu23
Rationals are dense in R
• Any real number can be approximated by some rational number arbitrarily well
• Or in fancy mathematical language
9/19/17 Yao-Liang Yu24
domain of interest subset in pocket metric for approx.
Kolmogorov-Arnold Theorem
• Binary addition is the “only” multivariate function!• Solves Hilbert’s 13th problem!• can be constructed from g, which is unknown…
9/19/17 Yao-Liang Yu25
Theorem (Kolmogorov, Arnold, Lorentz, Sprecher, …). Any continuous function g: [0,1]d R can be written (exactly!) as:
Andrey Kolmogorov ( 1 9 0 3 - 1 9 8 7 )
9/19/17 Yao-Liang Yu26
Foundations of the theory of probability
"Every mathematician believes that he is ahead of the others. The reason none state this belief in public is because they are intelligent people."
Universal Approximator
• conditions are necessary in some sense• includes (almost) all activation functions in practice
9/19/17 Yao-Liang Yu27
Theorem (Cybenko, Hornik et al., Leshno et al., …). Any continuous function g: [0,1]d R can be uniformlyapproximated in arbitrary precision by a two-layer NN with an activation function f that 1. is locally bounded2. has “negligible’’ closure of discontinuous points3. is not a polynomial
Caveat and Remedy
• NNs were praised for being “universal”• but shall see later that many kernels are universal as well• desirable but perhaps not THE explanation
• May need exponentially many hidden units…
• Increase depth may reduce network size, exponentially!• can be a course project, ask for references
9/19/17 Yao-Liang Yu28
Questions?
9/19/17 Yao-Liang Yu29