supervised learning neural networks multilayer perceptron...

Supervised learning neural networks

• Multilayer perceptron

• Adaptive-Network-based Fuzzy Inference System (ANFIS)

First part based on slides by Walter Kosters

Neural networksMassively connected computational units inspired by

the working of the human brainProvide a mathematical model for biological neural

networks (brains)Characteristics:

learning from examplesadaptive and fault tolerantrobust for fulfilling complex tasks

Network variations Learning methods

(supervised, unsupervised)Architectures (feedforward, recurrent)Output types (binary, continuous)Node types (uniform, hybrid)Implementations (software, hardware)Connection weights

(adjustable, hard-wired)Inspirations

(biological, psychological)

Artificial NeuronInput:

Network with one unit: perceptron

(Neuron/Unit)

Activation Functions

Stepstept(x)=1 if x >= tstept(x)=0 otherwise

Signsign(x)=1 if x>=0sign(x)=-1 if x<0

Activation Functions

Sigmoid (Logistic)sigmoid(x)=1/(1+e-bx)

Hyperbolic tangent

htan(x)=

Fuzzy membership functions!

tanh(x

2)=

1−e− x

1+e− x

Perceptron: Encoding Simple Boolean FunctionsStep functionInput in [0,1]Output in [0,1]

Perceptron: Linear Separablestep function:

1 if 0 otherwise

Perceptron: LearningUpdate rule:

whereError = target – predicted α = learning rate

predicted target Error

Feed-forward network

x1

x2

4

5

36

7 9

8

Input layer Layer 1 Layer 2 Output layer

x8

x9

fixed nodes

adaptive nodes

If one layer in network: perceptron (no hidden units)Otherwise: multi-layer

Backpropagation

BackpropagationUpdate rule output layer:

with

Update rule hidden layer: with

α learning rateak activation input nodeini weighted input ith output nodeinj weighted input jth hidden node

x1

x2

4

5

36

7 9

8 x8

x9

ε1

ε2

4

5

36

7 9

8 ε8

ε9

w83

w97

w52 w75

w31

Backpropagation DerivationGradient descentError is for all output nodes

Backpropagation Derivation

x1

x2

x3

w1

w2

θ = -w0x3

x4

x5

x1

x2

+1

+1+1

+1+1

-1 0.5

1.5

0.5

• One hidden layer

• Hidden neurons: tanh or logistic activation

• Output neurons: linear activation

Online vs Offline learning online: update after each example p

offline: update after seeing all examples

Speeding up backpropagation

Momentum term

(Output layer)

(General case)

Regularized initialization

Learning rate rescalinglearning rate of layers decreases towards the output layer

Approximation powerGeneral function approximators

Feed-forward neural network with one hidden layer and sigmoidal activation functions can approximate any continuous function arbitrarily well (Cybenko)

ANFISTakagi-Sugeno fuzzy system mapped onto a

neural network structureDifferent representations are possible, but one

with 5 layers is the most commonNetwork nodes in different layers have different

structuresA1

A2 B2

B1

X Y

X Yx y

w1

w2

f1 = p1x + q1y + r1

f2 = p2x + q2y + r2 2211

21

2211

fwfw

ww

fwfwf

+=++=

(product)

(normalization)

ANFIS layersLayer 1: every node is adaptive with function

Layer 2: every node is fixed and computes a T-norm of the inputs

Layer 3: every node is fixed with function

Layer 4: every node is adaptive with function

Layer 5: single node sums up inputs

Hybrid learning for ANFISConsider an ANFIS with two inputs x & yLet there be 2 nodes in layers 2-3-4Denote the consequent parameters by p, q, rANFIS output is now given by

Partition S as follows:S1: premise parameters (nonlinear)S2: consequent parameters (linear)

222222111111

22221111

21

)()()()()()(

)()(21

2

21

1

rwqywpxwrwqywpxw

ryqxpwryqxpw

fffww

www

w

+++++=+++++=

+= ++

Hybrid learning for ANFISHow to learn the parameters of a linear model

such that is minimal?

(where and represent input and output for training examples)

least squares linear regression→

Linear RegressionError function:

Compute global minimum by means of derivative:

Stone-Weierstrass theoremLet D be a compact space of N dimensions and let F be a set

of continuous real-valued functions on D satisfying the following

Identity function: f(x)=1 is in FSeparability: for any two points x1≠x2 in D, there is an f in F

s.t. f(x1) ≠ f(x2)Algebraic closure: if f and g are two functions in F, then fg

and af+bg are also in F for real a and b F is dense in the set of continuous functions C(D) on D:

(∀ε>0 )(∀ g ∈C ( D ))(∃ f ∈F ) (∀ x∈D :∣g ( x )− f ( x )∣<ε )

Universal approximation in ANFISAccording to Stone-Weierstrass theorem, an ANFIS can approximate

any continuous nonlinear function arbitrarily well

Restriction to rules with a constant in the consequent and Gaussian membership functions

Identity: f(x)=1 is in FObtained by having a constant consequent

Separability: for any two points x1≠x2 in D, there is f in F s.t. f(x1) ≠ f(x2)Obtained by selecting different parameters in the network

Algebraic closureConsider two systems with two rules

Additive

Multiplicative

21

2211

21

2211ˆˆ

ˆˆˆˆˆand

wwfwfw

wwfwfw

zz ++

++ ==

22122111

2222121221211111

21

2211

21

2211

ˆˆˆˆ)ˆ(ˆ)ˆ(ˆ)ˆ(ˆ)ˆ(ˆ

ˆˆ

ˆˆˆˆˆ

wwwwwwwwfbafwwfbafwwfbafwwfbafww

wwfwfw

wwfwfw bazbaz

++++++++++

++

++

=

+=+

( )22122111

2222121221211111

21

2211

21

2211

ˆˆˆˆ

ˆˆˆˆˆˆˆˆ

ˆˆ

ˆˆˆˆˆ

wwwwwwwwffwwffwwffwwffww

wwfwfw

wwfwfwzz

++++++

++

++

=

=Use Gaussian

membership functions!

Model building guidelinesSelect number of fuzzy sets per variable by

empirically by examining dataclustering techniques regression trees (CART)

Initially, distribute bell-shaped membership functions evenly

Using an adaptive step size can speed up training

What to do?Number of inputsWhat inputsNumber of outputsWhat outputsNumber of rulesType of consequent

functions

Normalize inputsCollect dataDefine objective

functionDetermine initial rulesInitialize networkDefine stop conditionsTRAIN

Two-input sinc functionInput range: [-10,10]x[-10,10]

121 data points

ANFIS: 16 rules, 4 membership functions per variable, 72 parameters (48 consequence, 24 premise)

MLP: 18 neurons in hidden layer, 73 parameters, quick propagation

ANFIS takes longer for same number of epochs

xyyx

yxz)sin()sin(

),sinc( ==

Average of 10 runs

-100

10

-100

10

0

0.5

1

X

Training data

Y -100

10

-100

10

0

0.5

1

X

ANFIS Output

Y

0 50 1000

0.05

0.1

0.15

0.2

epoch number

root m

ean s

quare

d e

rror error curve

0 50 1000.1

0.15

0.2

0.25

0.3

0.35

epoch number

ste

p s

ize

step size curvelearning rate

-10 -5 0 5 10

0

0.2

0.4

0.6

0.8

1

Deg

ree

o f m

embe

r shi

p

Initial MFs on X

-10 -5 0 5 10

0

0.2

0.4

0.6

0.8

1

Deg

ree

o f m

embe

r shi

p

Initial MFs on Y

-10 -5 0 5 10

0

0.2

0.4

0.6

0.8

1

input1

Deg

ree

o f m

embe

r shi

p

Final MFs on X

-10 -5 0 5 10

0

0.2

0.4

0.6

0.8

1

input2

Deg

ree

o f m

embe

r shi

p

Final MFs on Y

Modeling dynamic systemsf(.) has the following form

where

ANFIS parameters updated at each step

Learning rate: 0.1; forgetting factor: 0.99

ANFIS can adapt even after the input changes

f ( u )=0 .6sin (πu )+0 .3sin (3πu )+0 .1sin (5πu )

)250/2sin()( kku π=

5 m

embe

rsh i

p fu

n cti

ons

-1 -0.5 0 0.5 10

0.2

0.4

0.6

0.8

1

Initial MFs

-1 -0.5 0 0.5 10

0.2

0.4

0.6

0.8

1

Final MFs

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1f(u) and ANFIS Outputs

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

Each Rule's Outputs

4 m

embe

rsh i

p fu

n cti

ons

-1 -0.5 0 0.5 10

0.2

0.4

0.6

0.8

1

Initial MFs

-1 -0.5 0 0.5 10

0.2

0.4

0.6

0.8

1

Final MFs

-1 -0.5 0 0.5 1-1

-0.5

0

0.5


-1 -0.5 0 0.5 1

-1

0

1

Each Rule's Outputs

3 m

embe

rsh i

p fu

n cti

ons

-1 -0.5 0 0.5 10

0.2

0.4

0.6

0.8

1

Initial MFs

-1 -0.5 0 0.5 10

0.2

0.4

0.6

0.8

1

Final MFs

-1 -0.5 0 0.5 1-1

-0.5

0

0.5


-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

Each Rule's Outputs

Three-input nonlinear functionSystem mapping is given by

Two membership functions per variable, 8 rulesInput ranges: [1,6]x[1,6]x[1,6]215 training data, 125 validation data

211

31

+++=

zyxo

0 50 1000

0.2

0.4

0.6

0.8

1

epoch number

root

mea

n sq

uare

d er

ror error curves

training errorchecking error

0 50 1000

0.05

0.1

0.15

0.2

epoch number

step

siz

estep size curve

1 2 3 4 5 6

0

0.2

0.4

0.6

0.8

1

input1

Deg

ree

o f m

embe

r shi

p

Initial MFs on X, Y and Z

1 2 3 4 5 6

0

0.2

0.4

0.6

0.8

1

input1

Deg

ree

o f m

embe

r shi

p

Initial MFs on Y

1 2 3 4 5 6

0

0.2

0.4

0.6

0.8

1

input2

Deg

ree

o f m

embe

r shi

p

Final MFs on X

1 2 3 4 5 6

0

0.2

0.4

0.6

0.8

1

input3

Deg

ree

o f m

embe

r shi

p

Final MFs on Y

Predicting chaotic time seriesConsider a chaotic time series generated by

Task: predict the output of system at some future instance t+P by using past output

500 training data, 500 validation dataANFIS input: [x(t-18),x(t-12),x(t-6),x(t)]ANFIS output: x(t+6)Two MFs per variable, 16 rules104 parameters (24 premise, 80 consequent)Data generated from t=118 to t=1117

d x ( t )/ dt=0 . 2x ( t−τ )

1+ x10( t−τ )−0 .1x ( t )

0.6 0.8 1 1.20

0.2

0.4

0.6

0.8

1

Final MFs on Input 1, x(t - 18)

0.6 0.8 1 1.20

0.2

0.4

0.6

0.8

1


0.6 0.8 1 1.20

0.2

0.4

0.6

0.8

1


0.6 0.8 1 1.20

0.2

0.4

0.6

0.8

1

Final MFs on Input 4, x(t)

0 5 10

2

2.2

2.4

2.6

2.8

x 10-3 Error Curves

Training ErrorChecking Error

0 5 100.1

0.105

0.11

0.115Step Sizes

200 400 600 800 1000

0.6

0.8

1

1.2

Desired and ANFIS Outputs

200 400 600 800 1000-10

-5

0

5

x 10-3 Prediction Errors

Auto-Regression (AR) Model

Linear prediction:

Order = number of terms

Order selection

Select optimal order of AR model in order to prevent overfitting

Select the order that minimizes the error on a test set

Root mean square error / standard deviation target

ANFIS extensionsDifferent types of membership functions in layer 1Parameterized t-norms in layer 2Interpretability

constrained gradient descent optimizationbounds on fuzziness

parameterize to reflect constraintsStructure identification

∑+==

PN

iii wwEE

1)ln(' β

supervised learning neural networks multilayer perceptron...

Documents