supervised learning neural networks multilayer perceptron...
Post on 14-Sep-2020
23 Views
Preview:
TRANSCRIPT
Supervised learning neural networks
• Multilayer perceptron
• Adaptive-Network-based Fuzzy Inference System (ANFIS)
First part based on slides by Walter Kosters
Neural networksMassively connected computational units inspired by
the working of the human brainProvide a mathematical model for biological neural
networks (brains)Characteristics:
learning from examplesadaptive and fault tolerantrobust for fulfilling complex tasks
Network variations Learning methods
(supervised, unsupervised)Architectures (feedforward, recurrent)Output types (binary, continuous)Node types (uniform, hybrid)Implementations (software, hardware)Connection weights
(adjustable, hard-wired)Inspirations
(biological, psychological)
Artificial NeuronInput:
Network with one unit: perceptron
(Neuron/Unit)
Activation Functions
Stepstept(x)=1 if x >= tstept(x)=0 otherwise
Signsign(x)=1 if x>=0sign(x)=-1 if x<0
Activation Functions
Sigmoid (Logistic)sigmoid(x)=1/(1+e-bx)
Hyperbolic tangent
htan(x)=
Fuzzy membership functions!
tanh(x
2)=
1−e− x
1+e− x
Perceptron: Encoding Simple Boolean FunctionsStep functionInput in [0,1]Output in [0,1]
Perceptron: Linear Separablestep function:
1 if 0 otherwise
Perceptron: LearningUpdate rule:
whereError = target – predicted α = learning rate
predicted target Error
Feed-forward network
x1
x2
4
5
36
7 9
8
Input layer Layer 1 Layer 2 Output layer
x8
x9
fixed nodes
adaptive nodes
If one layer in network: perceptron (no hidden units)Otherwise: multi-layer
Backpropagation
BackpropagationUpdate rule output layer:
with
Update rule hidden layer: with
α learning rateak activation input nodeini weighted input ith output nodeinj weighted input jth hidden node
x1
x2
4
5
36
7 9
8 x8
x9
ε1
ε2
4
5
36
7 9
8 ε8
ε9
w83
w97
w52 w75
w31
Backpropagation DerivationGradient descentError is for all output nodes
Backpropagation Derivation
x1
x2
x3
w1
w2
θ = -w0x3
x4
x5
x1
x2
+1
+1+1
+1+1
-1 0.5
1.5
0.5
• One hidden layer
• Hidden neurons: tanh or logistic activation
• Output neurons: linear activation
Online vs Offline learning online: update after each example p
offline: update after seeing all examples
Speeding up backpropagation
Momentum term
(Output layer)
(General case)
Regularized initialization
Learning rate rescalinglearning rate of layers decreases towards the output layer
Approximation powerGeneral function approximators
Feed-forward neural network with one hidden layer and sigmoidal activation functions can approximate any continuous function arbitrarily well (Cybenko)
ANFISTakagi-Sugeno fuzzy system mapped onto a
neural network structureDifferent representations are possible, but one
with 5 layers is the most commonNetwork nodes in different layers have different
structuresA1
A2 B2
B1
X Y
X Yx y
w1
w2
f1 = p1x + q1y + r1
f2 = p2x + q2y + r2 2211
21
2211
fwfw
ww
fwfwf
+=++=
(product)
(normalization)
ANFIS layersLayer 1: every node is adaptive with function
Layer 2: every node is fixed and computes a T-norm of the inputs
Layer 3: every node is fixed with function
Layer 4: every node is adaptive with function
Layer 5: single node sums up inputs
Hybrid learning for ANFISConsider an ANFIS with two inputs x & yLet there be 2 nodes in layers 2-3-4Denote the consequent parameters by p, q, rANFIS output is now given by
Partition S as follows:S1: premise parameters (nonlinear)S2: consequent parameters (linear)
222222111111
22221111
21
)()()()()()(
)()(21
2
21
1
rwqywpxwrwqywpxw
ryqxpwryqxpw
fffww
www
w
+++++=+++++=
+= ++
Hybrid learning for ANFISHow to learn the parameters of a linear model
such that is minimal?
(where and represent input and output for training examples)
least squares linear regression→
Linear RegressionError function:
Compute global minimum by means of derivative:
Stone-Weierstrass theoremLet D be a compact space of N dimensions and let F be a set
of continuous real-valued functions on D satisfying the following
Identity function: f(x)=1 is in FSeparability: for any two points x1≠x2 in D, there is an f in F
s.t. f(x1) ≠ f(x2)Algebraic closure: if f and g are two functions in F, then fg
and af+bg are also in F for real a and b F is dense in the set of continuous functions C(D) on D:
(∀ε>0 )(∀ g ∈C ( D ))(∃ f ∈F ) (∀ x∈D :∣g ( x )− f ( x )∣<ε )
Universal approximation in ANFISAccording to Stone-Weierstrass theorem, an ANFIS can approximate
any continuous nonlinear function arbitrarily well
Restriction to rules with a constant in the consequent and Gaussian membership functions
Identity: f(x)=1 is in FObtained by having a constant consequent
Separability: for any two points x1≠x2 in D, there is f in F s.t. f(x1) ≠ f(x2)Obtained by selecting different parameters in the network
Algebraic closureConsider two systems with two rules
Additive
Multiplicative
21
2211
21
2211ˆˆ
ˆˆˆˆˆand
wwfwfw
wwfwfw
zz ++
++ ==
22122111
2222121221211111
21
2211
21
2211
ˆˆˆˆ)ˆ(ˆ)ˆ(ˆ)ˆ(ˆ)ˆ(ˆ
ˆˆ
ˆˆˆˆˆ
wwwwwwwwfbafwwfbafwwfbafwwfbafww
wwfwfw
wwfwfw bazbaz
++++++++++
++
++
=
+=+
( )22122111
2222121221211111
21
2211
21
2211
ˆˆˆˆ
ˆˆˆˆˆˆˆˆ
ˆˆ
ˆˆˆˆˆ
wwwwwwwwffwwffwwffwwffww
wwfwfw
wwfwfwzz
++++++
++
++
=
=Use Gaussian
membership functions!
Model building guidelinesSelect number of fuzzy sets per variable by
empirically by examining dataclustering techniques regression trees (CART)
Initially, distribute bell-shaped membership functions evenly
Using an adaptive step size can speed up training
What to do?Number of inputsWhat inputsNumber of outputsWhat outputsNumber of rulesType of consequent
functions
Normalize inputsCollect dataDefine objective
functionDetermine initial rulesInitialize networkDefine stop conditionsTRAIN
Two-input sinc functionInput range: [-10,10]x[-10,10]
121 data points
ANFIS: 16 rules, 4 membership functions per variable, 72 parameters (48 consequence, 24 premise)
MLP: 18 neurons in hidden layer, 73 parameters, quick propagation
ANFIS takes longer for same number of epochs
xyyx
yxz)sin()sin(
),sinc( ==
Average of 10 runs
-100
10
-100
10
0
0.5
1
X
Training data
Y -100
10
-100
10
0
0.5
1
X
ANFIS Output
Y
0 50 1000
0.05
0.1
0.15
0.2
epoch number
root m
ean s
quare
d e
rror error curve
0 50 1000.1
0.15
0.2
0.25
0.3
0.35
epoch number
ste
p s
ize
step size curvelearning rate
-10 -5 0 5 10
0
0.2
0.4
0.6
0.8
1
Deg
ree
o f m
embe
r shi
p
Initial MFs on X
-10 -5 0 5 10
0
0.2
0.4
0.6
0.8
1
Deg
ree
o f m
embe
r shi
p
Initial MFs on Y
-10 -5 0 5 10
0
0.2
0.4
0.6
0.8
1
input1
Deg
ree
o f m
embe
r shi
p
Final MFs on X
-10 -5 0 5 10
0
0.2
0.4
0.6
0.8
1
input2
Deg
ree
o f m
embe
r shi
p
Final MFs on Y
Modeling dynamic systemsf(.) has the following form
where
ANFIS parameters updated at each step
Learning rate: 0.1; forgetting factor: 0.99
ANFIS can adapt even after the input changes
f ( u )=0 .6sin (πu )+0 .3sin (3πu )+0 .1sin (5πu )
)250/2sin()( kku π=
5 m
embe
rsh i
p fu
n cti
ons
-1 -0.5 0 0.5 10
0.2
0.4
0.6
0.8
1
Initial MFs
-1 -0.5 0 0.5 10
0.2
0.4
0.6
0.8
1
Final MFs
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1f(u) and ANFIS Outputs
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
Each Rule's Outputs
4 m
embe
rsh i
p fu
n cti
ons
-1 -0.5 0 0.5 10
0.2
0.4
0.6
0.8
1
Initial MFs
-1 -0.5 0 0.5 10
0.2
0.4
0.6
0.8
1
Final MFs
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1f(u) and ANFIS Outputs
-1 -0.5 0 0.5 1
-1
0
1
Each Rule's Outputs
3 m
embe
rsh i
p fu
n cti
ons
-1 -0.5 0 0.5 10
0.2
0.4
0.6
0.8
1
Initial MFs
-1 -0.5 0 0.5 10
0.2
0.4
0.6
0.8
1
Final MFs
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1f(u) and ANFIS Outputs
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
Each Rule's Outputs
Three-input nonlinear functionSystem mapping is given by
Two membership functions per variable, 8 rulesInput ranges: [1,6]x[1,6]x[1,6]215 training data, 125 validation data
211
31
+++=
zyxo
0 50 1000
0.2
0.4
0.6
0.8
1
epoch number
root
mea
n sq
uare
d er
ror error curves
training errorchecking error
0 50 1000
0.05
0.1
0.15
0.2
epoch number
step
siz
estep size curve
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
input1
Deg
ree
o f m
embe
r shi
p
Initial MFs on X, Y and Z
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
input1
Deg
ree
o f m
embe
r shi
p
Initial MFs on Y
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
input2
Deg
ree
o f m
embe
r shi
p
Final MFs on X
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
input3
Deg
ree
o f m
embe
r shi
p
Final MFs on Y
Predicting chaotic time seriesConsider a chaotic time series generated by
Task: predict the output of system at some future instance t+P by using past output
500 training data, 500 validation dataANFIS input: [x(t-18),x(t-12),x(t-6),x(t)]ANFIS output: x(t+6)Two MFs per variable, 16 rules104 parameters (24 premise, 80 consequent)Data generated from t=118 to t=1117
d x ( t )/ dt=0 . 2x ( t−τ )
1+ x10( t−τ )−0 .1x ( t )
0.6 0.8 1 1.20
0.2
0.4
0.6
0.8
1
Final MFs on Input 1, x(t - 18)
0.6 0.8 1 1.20
0.2
0.4
0.6
0.8
1
Final MFs on Input 2, x(t - 12)
0.6 0.8 1 1.20
0.2
0.4
0.6
0.8
1
Final MFs on Input 3, x(t - 6)
0.6 0.8 1 1.20
0.2
0.4
0.6
0.8
1
Final MFs on Input 4, x(t)
0 5 10
2
2.2
2.4
2.6
2.8
x 10-3 Error Curves
Training ErrorChecking Error
0 5 100.1
0.105
0.11
0.115Step Sizes
200 400 600 800 1000
0.6
0.8
1
1.2
Desired and ANFIS Outputs
200 400 600 800 1000-10
-5
0
5
x 10-3 Prediction Errors
Auto-Regression (AR) Model
Linear prediction:
Order = number of terms
Order selection
Select optimal order of AR model in order to prevent overfitting
Select the order that minimizes the error on a test set
Root mean square error / standard deviation target
ANFIS extensionsDifferent types of membership functions in layer 1Parameterized t-norms in layer 2Interpretability
constrained gradient descent optimizationbounds on fuzziness
parameterize to reflect constraintsStructure identification
∑+==
PN
iii wwEE
1)ln(' β
top related