cs 5751 machine learning chapter 4 artificial neural networks1 artificial neural networks threshold...

37
CS 5751 Machine Learning Chapter 4 Artificial Neu ral Networks 1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks • Backpropagation Hidden layer representations Example: Face recognition Advanced topics

Upload: myron-underwood

Post on 22-Dec-2015

232 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 1

Artificial Neural Networks • Threshold units

• Gradient descent

• Multilayer networks

• Backpropagation

• Hidden layer representations

• Example: Face recognition

• Advanced topics

Page 2: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 2

Connectionist ModelsConsider humans• Neuron switching time ~.001 second• Number of neurons ~1010

• Connections per neuron ~104-5

• Scene recognition time ~.1 second• 100 inference step does not seem like enough

must use lots of parallel computation!Properties of artificial neural nets (ANNs):• Many neuron-like threshold switching units• Many weighted interconnections among units• Highly parallel, distributed process• Emphasis on tuning weights automatically

Page 3: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 3

When to Consider Neural Networks• Input is high-dimensional discrete or real-valued (e.g., raw

sensor input)

• Output is discrete or real valued

• Output is a vector of values

• Possibly noisy data

• Form of target function is unknown

• Human readability of result is unimportant

Examples:• Speech phoneme recognition [Waibel]

• Image classification [Kanade, Baluja, Rowley]

• Financial prediction

Page 4: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 4

ALVINN drives 70 mph on highways

4 HiddenUnits

30x32 SensorInput Retina

SharpLeft

SharpRight

StraightAhead

Page 5: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 5

Perceptron

n

W2

0

1

X 1

X 2

X n

X 0 =1

i

n

ii xw

0

otherwise 1-

if 1n

1iii xw

otherwise 1-

0 if 1

:notationctor simpler ve use will weSometimes

otherwise 1-

0... if 1),...,( 110

1

xw ) xo(

xwxwwxxo nn

n

Page 6: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 6

Decision Surface of Perceptron

X 2

X1

X 2

X1

Represents some useful functions

• What weights represent g(x1,x2) = AND(x1,x2)?

But some functions not representable

• e.g., not linearly separable

• therefore, we will want networks of these ...

Page 7: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 7

Perceptron Training Rule•

smallly sufficient is and

separablelinearly is data trainingIf

converge it will proveCan

rate learning called .1) (e.g.,constant small is

output perceptron is

lue target vais )(

)(

where

o

xct

xotw

www

ii

iii

Page 8: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 8

Gradient Descent

examples trainingofset theis Where

)(

error squared theminimize that s'learn :Idea

where, simpleconsider ,understand To

221

110

D

otwE

w

xw...xwwo

tlinear uni

Dddd

i

nn

Page 9: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 9

Gradient Descent

Page 10: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 10

Gradient Descent

ii

i

n

w

Ew

wEw

w

E

w

E

w

EwE

i.e.,

][ :rule Training

,...,,][ Gradient 10

Page 11: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 11

Gradient Descent

))((

)()(

)()(22

1

)(2

1

)(2

1

,

2

2

ddidd

i

ddd i

dd

ddd i

dd

ddd i

ddd

ii

xotw

E

xwtw

ot

otw

ot

otw

otww

E

Page 12: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 12

Gradient Descent

iii

iii

i

i

i

www

xotww

w

ox

examplestraining,tx

w

w

). (e.g., .rning rateis the leavalue. et output s the tes and t iinput valu

ctor of is the vexwhere ,,txe form pair of thples is a ining exam Each tra

examplestraining

do t w,unit weighlinear each For -

)(

do ,t unit weighlinear each For *

output compute and instance Input the

do ,_in each For -

zero. toeach Initialize -

do met, iscondition on terminati the Until

valuerandom small some toeach Initialize

05 arg

) ,_(DESCENTGRADIENT

Page 13: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 13

SummaryPerceptron training rule guaranteed to succeed if• Training examples are linearly separable• Sufficiently small learning rate

Linear unit training rule uses gradient descent• Guaranteed to converge to hypothesis with

minimum squared error• Given sufficiently small learning rate • Even when training data contains noise• Even when training data not separable by H

Page 14: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 14

Incremental (Stochastic) Gradient DescentBatch mode Gradient Descent:Do until satisfied:

][ 2.

gradient theCompute .1

wEww

]w[E

D

D

Incremental mode Gradient Descent:Do until satisfied: - For each training example d in D

][ 2.

gradient theCompute .1

wEww

]w[E

d

d

221 )(][ d

DddD otwE

221 )(][ ddd otwE

Incremental Gradient Descent can approximate Batch GradientDescent arbitrarily closely if made small enough

Page 15: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 15

Multilayer Networks of Sigmoid Units

d 1

d 0 d 2

d 3

d 1

d 0

d 2

d 3

h 1h 2

x 1 x 2

h 1 h 2

o

Page 16: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 16

Multilayer Decision Space

Page 17: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 17

Sigmoid Unit

W n

W2

W0W

1

X 1

X 2

X n

X 0 =1

i

n

ii xwnet

0

neteneto

1

1)(σ

ationBackpropag networksMultilayer

xxdx

xde

x

-x

units sigmoid of

unit sigmoid One

train torulesdescent gradient derivecan We

))( 1)(( )(

:property Nice

1

1

function sigmoid theis )(σ

Page 18: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 18

The Sigmoid Function

00.10.20.30.40.50.60.70.80.9

1

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6

net input

ou

tpu

t

-xex

1

1)(σ

Sort of a rounded step functionUnlike step function, can take derivative (makes learning possible)

Page 19: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 19

Error Gradient for a Sigmoid Unit

i

d

d

d

ddd

d i

ddd

dd

di

dd

dd

di

dDd

dii

w

net

net

oot

w

oot

otw

ot

otw

otww

E

)(-

)(

)()(22

1

)(2

1

)(2

1

2

2

didddDd

di

dii

d

i

d

ddd

d

d

d

xoootw

E

xw

xw

w

net

oonet

net

net

o

,

,

)1()(

:So

)(

)1()(

:know But we

Page 20: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 20

Backpropagation Algorithm

jijji

jijiji

ji

koutputsk

khhhk

kkkkk

xw

www

w

woo

h

otoo

k

,,

,,,

,

,

ere wh

ight network weeach Update4.

)1(

unit hidden each For 3.

))(1(

unit output each For 2.

outputs thecompute and example trainingInput the 1.

do example, ingeach trainFor

do satisfied, Untilnumbers. random small to weightsall Initialize

Page 21: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 21

More on Backpropagation• Gradient descent over entire network weight vector

• Easily generalized to arbitrary directed graphs

• Will find a local, not necessarily global error minimum

– In practice, often works well (can run multiple times)

• Often include weight momentum

• Minimizes error over training examples

• Will it generalize well to subsequent examples?

• Training can take thousands of iterations -- slow!

– Using network after training is fast

)1( )( ,,, nwxnw jijijji

Page 22: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 22

Learning Hidden Layer Representations

00000001 00000001

00000010 00000010

00000100 00000100

00001000 00001000

00010000 00010000

00100000 00100000

01000000 01000000

10000000 10000000

Output Input

Inputs

Outputs

Page 23: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 23

Learning Hidden Layer Representations

Inputs

Outputs

00000001 .01 .94 .60 00000001

00000010 .98 .01 .80 00000010

00000100 .99 .99 .22 00000100

00001000 .02 .05 .03 00001000

00010000 .71 .97 .99 00010000

00100000 .27 .97 .01 00100000

01000000 .88 .11 .01 01000000

10000000 .08 .04 .89 10000000

Output Input

Page 24: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 24

Output Unit Error during Training

Sum of squared errors for each output unit

00.10.20.30.4

0.50.60.70.80.9

0 500 1000 1500 2000 2500

Page 25: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 25

Hidden Unit Encoding

Hidden unit encoding for one input

0.10.20.30.40.5

0.60.70.80.9

1

0 500 1000 1500 2000 2500

Page 26: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 26

Input to Hidden Weights

Weights from inputs to one hidden unit

-5-4-3-2-1

01234

0 500 1000 1500 2000 2500

Page 27: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 27

Convergence of BackpropagationGradient descent to some local minimum• Perhaps not global minimum

• Momentum can cause quicker convergence

• Stochastic gradient descent also results in faster convergence

• Can train multiple networks and get different results (using different initial weights)

Nature of convergence• Initialize weights near zero

• Therefore, initial networks near-linear

• Increasingly non-linear functions as training progresses

Page 28: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 28

Expressive Capabilities of ANNsBoolean functions:• Every Boolean function can be represented by network

with a single hidden layer

• But that might require an exponential (in the number of inputs) hidden units

Continuous functions:• Every bounded continuous function can be approximated

with arbitrarily small error by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989]

• Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]

Page 29: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 29

Overfitting in ANNs

Error versus weight updates (example 1)

0.0020.0030.0040.0050.0060.0070.0080.009

0.01

0 5000 10000 15000 20000

Number of weight updates

Err

or

Training set

Validation set

Page 30: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 30

Overfitting in ANNs

Error versus weight updates (Example 2)

00.010.020.030.040.050.060.070.08

0 1000 2000 3000 4000 5000 6000

Number of weight updates

Err

or

Training set

Validation set

Page 31: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 31

Neural Nets for Face Recognition

30x32inputs

left strt rgt up

Typical Input Images

90% accurate learninghead pose, and recognizing1-of-20 faces

Page 32: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 32

Learned Network Weights

Typical Input Images

30x32inputs

left strt rgt upLearned Weights

Page 33: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 33

Alternative Error Functions

nrecognitio phonemein e.g.,

:weights together Tie

)()(

: valuesas wellas slopeson target Train

)()(

: weightslarge Penalize

22

21

ji,

2221

Dd outputsk djkd

dj

kdkdkd

jikdDd outputsk

kd

x

o

x

totwE

wotwE

Page 34: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 34

Recurrent Networks

FeedforwardNetwork

y(t+1)

x(t)

RecurrentNetwork

y(t+1)

x(t)

y(t+1)

x(t)

y(t)

x(t-1)

y(t-1)

x(t-2)

Recurrent Networkunfolded in time

Page 35: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 35

Neural Network Summary• physiologically (neurons) inspired model• powerful (accurate), slow, opaque (hard to

understand resulting model)• bias: preferential

– based on gradient descent

– finds local minimum

– effect by initial conditions, parameters

• neural units– linear

– linear threshold

– sigmoid

Page 36: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 36

Neural Network Summary (cont)• gradient descent

– convergence

• linear units– limitation: hyperplane decision surface

– learning rule

• multilayer network– advantage: can have non-linear decision surface

– backpropagation to learn• backprop learning rule

• learning issues– units used

Page 37: CS 5751 Machine Learning Chapter 4 Artificial Neural Networks1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 37

Neural Network Summary (cont)• learning issues (cont)

– batch versus incremental (stochastic)

– parameters• initial weights

• learning rate

• momentum

– cost (error) function• sum of squared errors

• can include penalty terms

• recurrent networks– simple

– backpropagation through time