machine learning 12. multilayer perceptrons. neural networks lecture notes for e alpaydın 2004...

MACHİNE LEARNİNG12. Multilayer Perceptrons

Neural Networks

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

2

Networks of processing units (neurons) with connections (synapses) between them

Large number of neurons: 1010

Large connectitivity: 105

Parallel processing Distributed computation/memory Robust to noise, failures

Perceptron


3

Learn weight for given task

Td

Td

Td

jjj

x,...,x,

w,...,w,w

wxwy

1

10

01

1

x

w

xw

Perceptron

Weights define a hyperplane (line in 2D) Split space into 2 parts (above/below

hyperplane) Can be used to split data into 2 classes (linear

discriminant) Threshold function

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

4

K Outputs classification


5

choose if max

Ti i

i i kk

y

C y y

w x

Online Learning

Batch training All data given at once

Online training Sample by Sample Update parameters on each new instance Save memory Data changing in time

Baesed on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

6

Online Learning: Regression Error

Update

Gradually decreasing learning factor


7

Online Learning : Classification Input: (y,r) where y is a data instance and

r=1 if from class C1

Option I: Using threshold function

Difficulties to treat analytically (gradient?)Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press

(V1.1)

8

Online Learning: Classification Replace threshold with

sigmoid Sigmoid output=>probability

of a class True(training) probability is

either 0 or 1 (r) Error measure: Cross-Entropy


9

Online Learning: Classification Reduce cross-entropy error

Gradient descent rule

Update=LearningFactor*(DesiredOutput-ActualOutput)*Input


10

Example: Learning Boolean Function Two Boolean inputs (0 or 1) Single Boolean output (0 or 1) Can be seen as classification problem Not all Boolean function separable (can

be learned)


11

Learning Boolean AND


12

XOR

No w0, w1, w2 satisfy:

0

0

0

0

021

01

02

0

www

ww

ww

w

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)13

(Minsky and Papert, 1969)

Multilayer Perceptron (MLP)

Can learn only linear discriminant or linear regression

Introduce intermediate/hidden layers

Non linear “activation” function in hidden layers


14

Multilayer Perceptrons


15

d

j hjhj

Thh

H

hihih

Tii

wxw

z

vzvy

1 0

10

exp1

1

sigmoid xw

zv

Multilayer Perceptron

Hidden layer Non-linear transformation from d-

dimensional to H-dimensional

Second output layer implement linear combination of non-linear basis functions

Can add more hidden layers


16


17

x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)

Universal Approximator

MLP with one hidden layer and arbitrary number of hidden perceptrons can learn any nonlinear function of input

Baesed on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

18

Backpropagation: Intro


19

hj

h

h

i

ihj

d

j hjhj

Thh

H

hihih

Tii

wz

zy

yE

wE

wxw

z

vzvy

exp1

1

sigmoid

1 0

10

xw

zv

Backpropagation: Regression Model

Error

Update second layer


20

xwThhz sigmoid

H

h

thh

t vzvy1

0

221

| t

tt yr,E XvW

th

t

tth zyrv

Backpropoagation: Regression Use chain rule to obtain first layer


21

Backpropogation: Regression Update first layer using old values(v) of

the second layer

Update second layer using new values of


22

th

t

tth zyrv

1t t t t thj h h h j

t

w r y v z z x

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)24

whx+w0

zh

vhzh

One sigmoid output yt for P(C1|xt) and P(C2|xt) ≡ 1-yt

Two-Class Discrimination


25

t

jth

thh

t

tthj

th

t

tth

t

tttt

H

h

thh

t

xzzvyrw

zyrv

yryr,E

vzvy

1

1 log 1 log |

sigmoid1

0

XvW

Multiple Hidden Layers


26

2

1

1022

21

0212122

11

01111

1sigmoidsigmoid

1sigmoidsigmoid

H

lll

T

H

hlhlh

Tll

d

jhjhj

Thh

vzvy

H,...,l,wzwz

H,...,h,wxwz

zv

zw

xw

MLP with one hidden layer is a universal approximator (Hornik et al., 1989), but using multiple layers may lead to simpler networks

Gradient Descent

Simple Local (change in weight requires only

input and output of single Perceptron) Can be used with online training( data no

storage) Can be implemented in hardware May converge slowly Frequently used techniques to improve

convergenceBased on E Alpaydın 2004 Introduction to Machine

Learning © The MIT Press (V1.1)

27

Momentum At each parameter update, successive

Δw values may be so different that large oscillations may occur and slow convergence

Running average by taking previous value

Based on E Alpaydın 2004 Introduction to Machine


28

1

ti

i

tti w

wE

w

Adaptive Learning Rate


29

Keep learning rate increasing when error decrease Going to right direction

Decrease learning rate when error increase Missed the minimum

otherwise

if

b

EEa tt

Overfitting/Overtraining

Overfitting: Too many parameters/hidden nodes Fits to the noise present at training sample Poor generalization error Use cross-validation

Overtraining Error function have several minima Most important weights become non-zero first As training continues, more weights move

away from zero, adding more parameters Stop earlier using cross-validation

30

Overfitting31

Number of weights: H (d+1)+(H+1)K

Structuring the network

In some applications input has a “local structure” Nearby pixels correlated Speech samples close in time

Hidden units are not connected to all input Define a windows over input Number of parameters is reduced Connect nodes only to a subset of nodes in other

layers Hierarchical cone: features get more complex


33

Structured MLP34

Hints

Have some prior expert knowledge Hint: properties of the target function

known independent of training examples For instance: Invariance to rotation

35

Hints36

Create Virtual examples Generated multiple rotated copies of the

example and feed them to MLP Preprocessing State

Scale and align all inputs Incorporate into structure

Local structure

Tuning Network Size

Networks with unnecessary large number of parameters generalize poorly

Too many possible configuration (nodes/connections) to try for cross-validation

Use structural adaptation Destructive approach: gradually remove

unnecessary nodes and connections Constructive approach: gradually add new

nodes/connections Based on E Alpaydın 2004 Introduction to Machine


37

Destructive method: Weight decay Punish for large connection weight Connections with weight close to zero

“disappears”


38

ii

ii

i

wE'E

wwE

w

2

2

machine learning 12. multilayer perceptrons. neural networks lecture notes for e alpaydın 2004...

Documents