cs 5751 machine learning chapter 4 artificial neural networks1 artificial neural networks threshold...

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 1

Artificial Neural Networks • Threshold units

• Gradient descent

• Multilayer networks

• Backpropagation

• Hidden layer representations

• Example: Face recognition

• Advanced topics



Connectionist ModelsConsider humans• Neuron switching time ~.001 second• Number of neurons ~1010

• Connections per neuron ~104-5

• Scene recognition time ~.1 second• 100 inference step does not seem like enough

must use lots of parallel computation!Properties of artificial neural nets (ANNs):• Many neuron-like threshold switching units• Many weighted interconnections among units• Highly parallel, distributed process• Emphasis on tuning weights automatically



When to Consider Neural Networks• Input is high-dimensional discrete or real-valued (e.g., raw

sensor input)

• Output is discrete or real valued

• Output is a vector of values

• Possibly noisy data

• Form of target function is unknown

• Human readability of result is unimportant

Examples:• Speech phoneme recognition [Waibel]

• Image classification [Kanade, Baluja, Rowley]

• Financial prediction



ALVINN drives 70 mph on highways

4 HiddenUnits

30x32 SensorInput Retina

SharpLeft

SharpRight

StraightAhead



Perceptron

n

W2

0

1

X 1

X 2

X n

X 0 =1

i

n

ii xw

0

otherwise 1-

if 1n

1iii xw

otherwise 1-

0 if 1

:notationctor simpler ve use will weSometimes

otherwise 1-

0... if 1),...,( 110

1

xw ) xo(

xwxwwxxo nn

n



Decision Surface of Perceptron

X 2

X1

X 2

X1

Represents some useful functions

• What weights represent g(x1,x2) = AND(x1,x2)?

But some functions not representable

• e.g., not linearly separable

• therefore, we will want networks of these ...



Perceptron Training Rule•

smallly sufficient is and

separablelinearly is data trainingIf

converge it will proveCan

rate learning called .1) (e.g.,constant small is

output perceptron is

lue target vais )(

)(

where

o

xct

xotw

www

ii

iii



Gradient Descent

examples trainingofset theis Where

)(

error squared theminimize that s'learn :Idea

where, simpleconsider ,understand To

221

110

D

otwE

w

xw...xwwo

tlinear uni

Dddd

i

nn



Gradient Descent



Gradient Descent

ii

i

n

w

Ew

wEw

w

E

w

E

w

EwE

i.e.,

][ :rule Training

,...,,][ Gradient 10



Gradient Descent

))((

)()(

)()(22

1

)(2

1

)(2

1

,

2

2

ddidd

i

ddd i

dd

ddd i

dd

ddd i

ddd

ii

xotw

E

xwtw

ot

otw

ot

otw

otww

E



Gradient Descent

iii

iii

i

i

i

www

xotww

w

ox

examplestraining,tx

w

w

). (e.g., .rning rateis the leavalue. et output s the tes and t iinput valu

ctor of is the vexwhere ,,txe form pair of thples is a ining exam Each tra

examplestraining

do t w,unit weighlinear each For -

)(

do ,t unit weighlinear each For *

output compute and instance Input the

do ,_in each For -

zero. toeach Initialize -

do met, iscondition on terminati the Until

valuerandom small some toeach Initialize

05 arg

) ,_(DESCENTGRADIENT



SummaryPerceptron training rule guaranteed to succeed if• Training examples are linearly separable• Sufficiently small learning rate

Linear unit training rule uses gradient descent• Guaranteed to converge to hypothesis with

minimum squared error• Given sufficiently small learning rate • Even when training data contains noise• Even when training data not separable by H



Incremental (Stochastic) Gradient DescentBatch mode Gradient Descent:Do until satisfied:

][ 2.

gradient theCompute .1

wEww

]w[E

D

D

Incremental mode Gradient Descent:Do until satisfied: - For each training example d in D

][ 2.

gradient theCompute .1

wEww

]w[E

d

d

221 )(][ d

DddD otwE

221 )(][ ddd otwE

Incremental Gradient Descent can approximate Batch GradientDescent arbitrarily closely if made small enough



Multilayer Networks of Sigmoid Units

d 1

d 0 d 2

d 3

d 1

d 0

d 2

d 3

h 1h 2

x 1 x 2

h 1 h 2

o



Multilayer Decision Space



Sigmoid Unit

W n

W2

W0W

1

X 1

X 2

X n

X 0 =1

i

n

ii xwnet

0

neteneto

1

1)(σ

ationBackpropag networksMultilayer

xxdx

xde

x

-x

units sigmoid of

unit sigmoid One

train torulesdescent gradient derivecan We

))( 1)(( )(

:property Nice

1

1

function sigmoid theis )(σ



The Sigmoid Function

00.10.20.30.40.50.60.70.80.9

1

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6

net input

ou

tpu

t

-xex

1

1)(σ

Sort of a rounded step functionUnlike step function, can take derivative (makes learning possible)



Error Gradient for a Sigmoid Unit

i

d

d

d

ddd

d i

ddd

dd

di

dd

dd

di

dDd

dii

w

net

net

oot

w

oot

otw

ot

otw

otww

E

)(-

)(

)()(22

1

)(2

1

)(2

1

2

2

didddDd

di

dii

d

i

d

ddd

d

d

d

xoootw

E

xw

xw

w

net

oonet

net

net

o

,

,

)1()(

:So

)(

)1()(

:know But we



Backpropagation Algorithm

jijji

jijiji

ji

koutputsk

khhhk

kkkkk

xw

www

w

woo

h

otoo

k

,,

,,,

,

,

ere wh

ight network weeach Update4.

)1(

unit hidden each For 3.

))(1(

unit output each For 2.

outputs thecompute and example trainingInput the 1.

do example, ingeach trainFor

do satisfied, Untilnumbers. random small to weightsall Initialize



More on Backpropagation• Gradient descent over entire network weight vector

• Easily generalized to arbitrary directed graphs

• Will find a local, not necessarily global error minimum

– In practice, often works well (can run multiple times)

• Often include weight momentum

• Minimizes error over training examples

• Will it generalize well to subsequent examples?

• Training can take thousands of iterations -- slow!

– Using network after training is fast

)1( )( ,,, nwxnw jijijji



Learning Hidden Layer Representations

00000001 00000001

00000010 00000010

00000100 00000100

00001000 00001000

00010000 00010000

00100000 00100000

01000000 01000000

10000000 10000000

Output Input

Inputs

Outputs



Learning Hidden Layer Representations

Inputs

Outputs

00000001 .01 .94 .60 00000001

00000010 .98 .01 .80 00000010

00000100 .99 .99 .22 00000100

00001000 .02 .05 .03 00001000

00010000 .71 .97 .99 00010000

00100000 .27 .97 .01 00100000

01000000 .88 .11 .01 01000000

10000000 .08 .04 .89 10000000

Output Input



Output Unit Error during Training

Sum of squared errors for each output unit

00.10.20.30.4

0.50.60.70.80.9

0 500 1000 1500 2000 2500



Hidden Unit Encoding

Hidden unit encoding for one input

0.10.20.30.40.5

0.60.70.80.9

1

0 500 1000 1500 2000 2500



Input to Hidden Weights

Weights from inputs to one hidden unit

-5-4-3-2-1

01234

0 500 1000 1500 2000 2500



Convergence of BackpropagationGradient descent to some local minimum• Perhaps not global minimum

• Momentum can cause quicker convergence

• Stochastic gradient descent also results in faster convergence

• Can train multiple networks and get different results (using different initial weights)

Nature of convergence• Initialize weights near zero

• Therefore, initial networks near-linear

• Increasingly non-linear functions as training progresses



Expressive Capabilities of ANNsBoolean functions:• Every Boolean function can be represented by network

with a single hidden layer

• But that might require an exponential (in the number of inputs) hidden units

Continuous functions:• Every bounded continuous function can be approximated

with arbitrarily small error by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989]

• Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]



Overfitting in ANNs

Error versus weight updates (example 1)

0.0020.0030.0040.0050.0060.0070.0080.009

0.01

0 5000 10000 15000 20000

Number of weight updates

Err

or

Training set

Validation set



Overfitting in ANNs

Error versus weight updates (Example 2)

00.010.020.030.040.050.060.070.08

0 1000 2000 3000 4000 5000 6000

Number of weight updates

Err

or

Training set

Validation set



Neural Nets for Face Recognition

30x32inputs

left strt rgt up

Typical Input Images

90% accurate learninghead pose, and recognizing1-of-20 faces



Learned Network Weights

Typical Input Images

30x32inputs

left strt rgt upLearned Weights



Alternative Error Functions

nrecognitio phonemein e.g.,

:weights together Tie

)()(

: valuesas wellas slopeson target Train

)()(

: weightslarge Penalize

22

21

ji,

2221

Dd outputsk djkd

dj

kdkdkd

jikdDd outputsk

kd

x

o

x

totwE

wotwE



Recurrent Networks

FeedforwardNetwork

y(t+1)

x(t)

RecurrentNetwork

y(t+1)

x(t)

y(t+1)

x(t)

y(t)

x(t-1)

y(t-1)

x(t-2)

Recurrent Networkunfolded in time



Neural Network Summary• physiologically (neurons) inspired model• powerful (accurate), slow, opaque (hard to

understand resulting model)• bias: preferential

– based on gradient descent

– finds local minimum

– effect by initial conditions, parameters

• neural units– linear

– linear threshold

– sigmoid



Neural Network Summary (cont)• gradient descent

– convergence

• linear units– limitation: hyperplane decision surface

– learning rule

• multilayer network– advantage: can have non-linear decision surface

– backpropagation to learn• backprop learning rule

• learning issues– units used



Neural Network Summary (cont)• learning issues (cont)

– batch versus incremental (stochastic)

– parameters• initial weights

• learning rate

• momentum

– cost (error) function• sum of squared errors

• can include penalty terms

• recurrent networks– simple

– backpropagation through time

cs 5751 machine learning chapter 4 artificial neural networks1 artificial neural networks threshold...

Documents

machine learning chapter

artificial neural networks3

batch gradient descent

neural networks input

highways slide

h slide

d incremental gradient

training data