lecture 03: multi-layer perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · linear predictor...

29
CS489/698: Intro to ML Lecture 03: Multi-layer Perceptron 9/19/17 Yao-Liang Yu 1

Upload: trancong

Post on 24-Jul-2019

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

CS489/698: Intro to MLLecture 03: Multi-layer Perceptron

9/19/17 Yao-Liang Yu1

Page 2: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Outl ine

• Failure of Perceptron

• Neural Network

• Backpropagation

• Universal Approximator

9/19/17 Yao-Liang Yu2

Page 3: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Outl ine

• Failure of Perceptron

• Neural Network

• Backpropagation

• Universal Approximator

9/19/17 Yao-Liang Yu3

Page 4: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

The XOR problem

9/19/17 Yao-Liang Yu4

• X = { (0,0), (0,1), (1,0), (1,1) }, y = {-1, 1, 1, -1}• f*(x) = sign[ fxor - ½ ], fxor(x) = (x1 & -x2) | (-x1 & x2)

Page 5: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

No separat ing hyperplane

• x1 = (0,0), y1 = -1 b < 0• x2 = (0,1), y2 = 1 w2 + b > 0• x3 = (1,0), y3 = 1 w1 + b > 0• x4 = (1,1), y4 = -1 w1 + w2 + b < 0

• Contradiction!• [At least one of the blue or green inequalities has to be strict]

9/19/17 Yao-Liang Yu5

(w1 + w2 + b) + b > 0

Ex. what happens if we run perceptron/winnow on this example?

Page 6: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Fixing the problem

• Our model (hyperplanes) underfits the data (xor func)

• Fix representation, richer model

• Fix model, richer representation

• NN: still use hyperplane, but learn representation simultaneously

9/19/17 Yao-Liang Yu6

Page 7: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Outl ine

• Failure of Perceptron

• Neural Network

• Backpropagation

• Universal Approximator

9/19/17 Yao-Liang Yu7

Page 8: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Two-layer perceptron

9/19/17 Yao-Liang Yu8

x1

x2

z1

z2

ŷ

h1

h2

u11

u22

u12

u21

1

c1

c2

w1

w2

1

b

inputlayer

hiddenlayer

Nonlinear activation function

outputlayer

weights weights

learned rep.

1st linear layer

nonlinear transform

2nd linear layer

makes all the difference!

Page 9: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Does i t work?

• x1 = (0,0), y1 = -1 z1 = (0,-1), h1 = (0,0) ŷ1 = -1 ✔• x2 = (0,1), y2 = 1 z2 = (1,0), h2 = (1,0) ŷ2 = 1 ✔• x3 = (1,0), y3 = 1 z3 = (1,0), h3 = (1,0) ŷ3 = 1 ✔• x4 = (1,1), y4 = -1 z4 = (2,1), h4 = (2,1) ŷ4 = -1 ✔

9/19/17 Yao-Liang Yu9

Rectified Linear Unit (ReLU)

Page 10: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Mult i- layer perceptron

9/19/17 Yao-Liang Yu10

x1

x2

z1

z2

h1

h2

11

.

.

.

xd

zk

.

.

.

hk

.

.

.

.

.

.

ŷm

.

.

.

ŷ2

ŷ1

inputlayer:

Rd

outputlayer:

Rm

weights: fully connectedweights: fully connected

hiddenlayer Rk

Page 11: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Mult i- layer perceptron (stacked)

9/19/17 Yao-Liang Yu11

x1x2

z1z2

h1h2

1 1

.

.

.xd

zk

.

.

. hk

.

.

.

.

.

. ŷm

.

.

.

ŷ2

ŷ1

x1x2

z1z2

h1h2

1 1xd

zk

.

.

. hk

.

.

.

.

.

. ŷm

.

.

.

ŷ2

ŷ1

x1x2

z1z2

h1h2

1 1

.

.

.xd

zk

.

.

. hk

.

.

.

.

.

. ŷm

.

.

.

ŷ2

ŷ1…

1 2 p

depth

width

feed-forward

learned hierarchical nonlinear feature representation linearpredictor

Page 12: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Activat ion funct ion

• Sigmoid

• Tanh

• Rectified Linear

9/19/17 Yao-Liang Yu12

smooth

nonsmooth

saturate

Page 13: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Underf i t t ing vs. Overf i t t ing

• Linear predictor (perceptron / winnow / linear regression) underfits

• NNs learn hierarchical nonlinear feature jointly with linear predictor• may overfit• tons of heuristics (some later)

• Size of NN: # of weights/connections• ~ 1 billion; human brain 106 billions (estimated)

9/19/17 Yao-Liang Yu13

Page 14: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Weights training

9/19/17 Yao-Liang Yu14

x1x2

z1z2

h1h2

1 1

.

.

.xd

zk

.

.

. hk

.

.

.

.

.

. ŷm

.

.

.

ŷ2

ŷ1

x1x2

z1z2

h1h2

1 1xd

zk

.

.

. hk

.

.

.

.

.

. ŷm

.

.

.

ŷ2

ŷ1

x1x2

z1z2

h1h2

1 1

.

.

.xd

zk

.

.

. hk

.

.

.

.

.

. ŷm

.

.

.

ŷ2

ŷ1…x in

Rdŷ inRm

• Need a loss to measure diff. between pred ŷ and truth y• E.g., (ŷ-y)2; more later

• Need a training set {(x1,y1), …, (xn,yn)} to train weights Θ

ŷ = q(x; Θ)

Page 15: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Gradient Descent

9/19/17 Yao-Liang Yu15

Step size (learning rate)• const., if L is smooth• diminishing, otherwise

(Generalized) gradientO(n) !

Page 16: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Stochastic Gradient Descent (SGD)

• diminishing step size, e.g., 1/sqrt{t} or 1/t• averaging, momentum, variance-reduction, etc.• sample w/o replacement; cycle; permute in each pass

9/19/17 Yao-Liang Yu16

average over n samplesa random sample suffices

Page 17: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

A l i t t le history on optimizat ion• Gradient descent mentioned first in (Cauchy, 1847)

9/19/17 Yao-Liang Yu17

• First rigorous convergence proof (Curry, 1944)

• SGD proposed and analyzed (Robbins & Monro, 1951)

Page 18: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Herbert Robbins ( 1 9 1 5 – 2 0 0 1 )

9/19/17 Yao-Liang Yu18

Page 19: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Outl ine

• Failure of Perceptron

• Neural Network

• Backpropagation

• Universal Approximator

9/19/17 Yao-Liang Yu19

Page 20: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Backpropogation

• A fancy name for the chain rule for derivatives:f(x) = g[ h(x) ] f’(x) = g’[ h(x) ] * h’(x)

• Efficiently computes the derivative in NN

• Two passes; complexity = O(size(NN))• forward pass: compute function value sequentially• backward pass: compute derivative sequentially

9/19/17 Yao-Liang Yu20

Page 21: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Algori thm

9/19/17 Yao-Liang Yu21

f: activation function

J = L + λΩ: training obj.Forward pass

Backwardpass

Page 22: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

History (pe r fec t f o r a cou rse p ro jec t ! )

9/19/17 Yao-Liang Yu22

Page 23: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Outl ine

• Failure of Perceptron

• Neural Network

• Backpropagation

• Universal Approximator

9/19/17 Yao-Liang Yu23

Page 24: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Rationals are dense in R

• Any real number can be approximated by some rational number arbitrarily well

• Or in fancy mathematical language

9/19/17 Yao-Liang Yu24

domain of interest subset in pocket metric for approx.

Page 25: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Kolmogorov-Arnold Theorem

• Binary addition is the “only” multivariate function!• Solves Hilbert’s 13th problem!• can be constructed from g, which is unknown…

9/19/17 Yao-Liang Yu25

Theorem (Kolmogorov, Arnold, Lorentz, Sprecher, …). Any continuous function g: [0,1]d R can be written (exactly!) as:

Page 26: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Andrey Kolmogorov ( 1 9 0 3 - 1 9 8 7 )

9/19/17 Yao-Liang Yu26

Foundations of the theory of probability

"Every mathematician believes that he is ahead of the others. The reason none state this belief in public is because they are intelligent people."

Page 27: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Universal Approximator

• conditions are necessary in some sense• includes (almost) all activation functions in practice

9/19/17 Yao-Liang Yu27

Theorem (Cybenko, Hornik et al., Leshno et al., …). Any continuous function g: [0,1]d R can be uniformlyapproximated in arbitrary precision by a two-layer NN with an activation function f that 1. is locally bounded2. has “negligible’’ closure of discontinuous points3. is not a polynomial

Page 28: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Caveat and Remedy

• NNs were praised for being “universal”• but shall see later that many kernels are universal as well• desirable but perhaps not THE explanation

• May need exponentially many hidden units…

• Increase depth may reduce network size, exponentially!• can be a course project, ask for references

9/19/17 Yao-Liang Yu28

Page 29: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear

Questions?

9/19/17 Yao-Liang Yu29