center for big data analytics and discovery informatics ... · center for big data analytics and...
TRANSCRIPT
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Approximating Real-Valued Functions from DataNeural Networks
Vasant Honavar Artificial Intelligence Research Laboratory
Informatics Graduate ProgramComputer Science and Engineering Graduate Program
Bioinformatics and Genomics Graduate ProgramNeuroscience Graduate Program
Center for Big Data Analytics and Discovery InformaticsHuck Institutes of the Life Sciences
Institute for CyberscienceClinical and Translational Sciences Institute
Northeast Big Data HubPennsylvania State University
[email protected]://faculty.ist.psu.edu/vhonavar
http://ailab.ist.psu.edu
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Neural Networks • Learning to approximate real-valued functions• Bayesian recipe for learning real-valued functions• Review: Learning linear functions using gradient descent
in weight space• Universal function approximation theorem• Learning nonlinear functions using gradient descent in
weight space• Practical considerations and examples
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Review: Bayesian Recipe for Learning
• P (h) = prior probability of hypothesis h• P (D) = prior probability of training data D • P (h | D) = probability of h given D • P (D | h) = probability of D given h
)()()|()|(
DPhPhDPDhP =
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Bayesian recipe for learning
)hypothesis likelihood (Maximum )|(maxarg),()( , If
)()|(maxarg)(
)()|(maxarg
)hypothesis posteriori a (Maximum )|(maxarg
hDPhhPhPHhh
hPhDPDP
hPhDP
DhPh
HhML
jiji
Hh
Hh
HhMAP
Î
Î
Î
Î
=
=Î"
=
=
=
Choose the most likely hypothesis given the data
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Learning a Real Valued Function
• Consider a real-valued target function f • Training examples áxi, diñ , where di is noisy
training value di = f(xi) + ei• ei is random variable (noise) drawn
independently for each xi according to Gaussian distribution with zero mean
• Þ di has mean f(xi) and same variance
å=
Î-=
m
iiiHhML xhdh
1
2))((minarg
Then the maximum likelihood hypothesis hML is one that minimizes the sum of squared error:
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Learning a Real Valued Function
€
hML = argmaxh∈H
P(h |D)
= argmaxh∈H
P(D | h)
= argmaxh∈H
P(di ,Xi | h)i=1
m∏
= argmaxh∈H
P(di | h,Xi )i=1
m∏ P(Xi )
= argmaxh∈H
12πσ2i=1
m∏ e
−12di−h(xi )
σ( )2
P(Xii=1
m∏ )
Maximize natural log of this instead...
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Learning a Real Valued Function
å
å
å
å
=Î
=Î
=Î
=Î
-=
--=
÷øö
çèæ -
-=
÷øö
çèæ -
-=
m
iiiHh
m
iiiHh
m
i
ii
Hh
iim
iHhML
xhd
xhd
xhd
xhdh
1
2
1
2
1
2
21
2
21
12
))((minarg
))((maxarg
)(maxarg
)(21lnmaxarg
s
sps
Maximum Likelihood hypothesis is one that minimizes the mean squared error!
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Approximating a linear function using a linear neuron
X0 =1
Input
Synaptic weights
n Outputy
x1
x2
xn
w2
wn
w1
! !å
W0
i
n
ii xwy å
=
=0
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Learning Task
€
W = W0.........Wn[ ]T is the weight vector
Xp = X0p ....Xnp[ ]T is the pth training sample
y p = Wii∑ Xip =W • Xp is the output of the neuron for input Xp
Xp = f Xp( ) is the desired output for input Xp
ep = dp − y p( ) is the error of the neuron on input Xp
S = Xp,dp( ){ } is`a (multi) set of training examples
ES W( ) = ES W0,W1,..........Wn( ) =12
ep2
p∑ is the estimated
error of W on training set S
Goal : Find W* = argminW
ES W( )
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Learning linear functions
1w
0w
ES
The error is a quadratic function of the weights in the case of a linear neuron
tW
*W
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Learning linear functions
€
∂E∂wi
=12
∂∂wi
ep2
p∑$ % &
' ( )
=12
∂∂wi
ep2( )
p∑*
+ ,
-
. /
=12
2ep( )∂ep∂wip
∑*
+ ,
-
. / = ep
∂ep∂y p
*
+ ,
-
. / /
p∑
∂y p∂wi
*
+ ,
-
. / = ep −1( )
p∑
∂∂wi
w j x jpj=0
n∑
*
+ ,
-
. /
*
+ ,
-
. / /
= − dp − y p( )p∑
∂∂wi
wi xip + w j x jpj≠i∑
*
+ ,
-
. /
*
+ ,
-
. / /
= − dp − y p( )p∑
∂∂wi
wi xip( ) +∂∂wi
w j x jpj≠i∑
*
+ ,
-
. /
*
+ ,
-
. / /
= − dp − y p( )p∑ xip
€
wi ← wi − η∂E∂wi
( ) ipp
ppii xydηww å -+¬
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
( ) ipp
ppii xydηww å -+¬ Batch Update
( ) ipppii xydηww -+¬
Per sample Update
Least Mean Square Error (LMSE) Learning Rule
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
( ) ( )
( )
( )t==t
t-
=
¶
¶a-=
<a<-Da+¶
¶-=D
D+=+
åii
ii
wwi
tt
itwwi
i
iii
wE
η
twwE
ηtw
twtwtw
0
1
1
10 where)()(
)(
The momentum update allows effective learning rate to increase when feasible and decrease when necessary. Converges for 0£a<1
Momentum update
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Learning approximations of nonlinear functions from data – the generalized delta rule
• Motivations• Universal function approximation theorem (UFAT)• Derivation of the generalized delta rule• Back-propagation algorithm• Practical considerations• Applications
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
• Psychology – Empirical inadequacy of behaviorist theories of learning – simple reward-punishment based learning models are incapable of learning functions (e.g., exclusive OR) which are readily learned by animals (e.g., monkeys)
• Artificial Intelligence – the need for learning highly nonlinearfunctions where the form of the nonlinear relationship is unknown a-priori
• Statistics – Limitations of linear regression in fitting data when the relationship is highly nonlinear and the form of the relationship is unknown
• Control – Need for nonlinear control methods
Motivations
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Universal function approximation theorem (UFAT) (Cybenko, 1989)
• Let j : Âà be a non-constant, bounded (hence non-linear),
monotone, continuous function. Let IN be the N-dimensional
unit hypercube in ÂN. • Let C(IN ) = {f: IN à Â} be the set of all continuous functions
with domain IN and range Â. Then for any function f Î C(IN ) and any e > 0, $ an integer L and a sets of real values q, aj, qj, wji(1£j£L; 1£i£N ) such that
F(x1, x2...xN ) = α jj=1
L
∑ φ wjixi −θ ji=1
N
∑⎛
⎝⎜
⎞
⎠⎟−θ
is a uniform approximation of f – that is,
),...(),...( ,),...( e<-Î" NNNN xxfxxFIxx 111
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Universal function approximation theorem (UFAT)
• Unlike Kolmogorov�s theorem, UFAT requires only one kind of nonlinearity to approximate any arbitrary nonlinear function to any desired accuracy
• The sigmoid function satisfies the UFAT requirements
q-÷÷ø
öççè
æq-ja= åå
==
N
ijiji
L
jjn xwxxxF
1121 )...,(
011
>+
=j-
ae
z az ;)(
Similar universal approximation properties can be guaranteed for other functions – e.g., radial basis functions
10 =j=j¥+®¥-®
)(lim ;)(lim
zzzz
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
• UFAT guarantees the existence of arbitrarily accurate approximations of continuous functions defined over bounded subsets of ÂN
• UFAT tells us the representational power a certain class of multi-layer networks relative to the set of continuous functions defined on bounded subsets of ÂN
• UFAT is not constructive – it does not tell us how to choose the parameters to construct a desired function
• To learn an unknown function from data, we need an algorithm to search the hypothesis space of multilayer networks
• Generalized delta rule allows nonlinear function to be learnedfrom the training data
Universal function approximation theorem
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Feed-forward neural networks • A feed-forward n-layer network consists of n layers of nodes• 1 layer of Input nodes• n-2 layers of Hidden nodes• 1 layer of Output nodes• interconnected by modifiable weights from input nodes to
the hidden nodes and the hidden nodes to the output nodes• More general topologies (e.g., with connections that skip
layers, e.g., direct connections between input and output nodes) are possible
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
A three layer network that approximates the exclusive or function
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Three-layer feed-forward neural network
• A single bias unit is connected to each unit other than the input units
• Net input
• where the subscript i indexes units in the input layer, j in the hidden; wji denotes the input-to-hidden layer weights at the hidden unit j.
• The output of a hidden unit is a nonlinear function of its net input. That is, yj = f(nj) e.g.,
å å= =
•º=+=d
i
d
ijjiijjiij wxwwxn
1 00 ,. XW
jnje
y-+
=11
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Three-layer feed-forward neural network
• Each output unit similarly computes its net activation based on the hidden unit signals as:
• where the subscript k indexes units in the ouput layer andnH denotes the number of hidden units
• The output can be a linear or nonlinear function of the net input e.g.,
å å= =
•==+=H Hn
j
n
jkkjjkkjjk wywwyn
1 00 ,YW
€
zk = nk
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Computing nonlinear functions
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Realizing non linearly separable class boundaries
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
• Given a training set determine
• Network structure – number of hidden nodes or more generally,
network topology
• Start small and grow the network
• Start with a sufficiently large network and prune away the
unnecessary connections
• For a given structure, determine the parameters (weights) that
minimize the error on the training samples (e.g., the mean
squared error)
• For now, we focus on the latter
Learning nonlinear real-valued functions
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
• Challenge – we know the desired outputs for nodes in the output layer, but not the hidden layer
• Need to solve the credit assignment problem – dividing the credit or blame for the performance of the output nodes among hidden nodes
• Generalized delta rule offers an elegant solution to the credit assignment problem in feed-forward neural networks in which each neuron computes a differentiable function of its inputs
• Solution can be generalized to other kinds of networks, including networks with cycles
Generalized delta rule – error back-propagation
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
• Forward operation (computing output for a given input based on
the current weights)
• Learning – modification of the network parameters (weights) to
minimize an appropriate error measure
• Because each neuron computes a differentiable function of its
inputs
– If error is a differentiable function of the network
outputs, the error is a differentiable function of the
weights in the network – so we can perform gradient
descent!
Feed-forward networks
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
A fully connected 3-layer network
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Generalized delta rule• Let tkp be the k-th target (or desired) output for input pattern Xp
and zkp be the output produced by k-th output node and let W represent all the weights in the network
• Training error:
• The weights are initialized with pseudo-random values and are changed in a direction that will reduce the error:
( )WW ååå =-== p
p
M
kkpkp
pS EztE
1
2
21 )()(
ji
Sji w
Eηw¶¶
-=Dkj
Skj w
Eηw¶¶
-=D
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Generalized delta rule
h>0 is a suitable the learning rate WßW+ DWHidden–to-output weights
jpkj
kp ywn
=¶
¶
kj
kp
kp
p
kj
p
wn
nE
wE
¶
¶
¶
¶=
¶
¶.
))((. 1kpkpkp
kp
kp
p
kp
p ztnz
zE
nE
--=¶
¶
¶
¶=
¶
¶
jpkpkjjpkpkpkjkj
pkjkj ywyztw
wE
ηww d+=-+=¶
¶-¬ )(
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Generalized delta rule
Weights from input to hidden units
( ) ( )( )
( )( ) ( )( )
( ) ( ) ( )
ipjp
ipjpjpkj
M
kkp
ipjpjpkj
M
kkpkp
ipjpjpkjlp
M
llp
kp
M
k
x
xyyw
xyywzt
xyywztz
jp
d
d
d
-=
÷÷ø
öççè
æ--=
---=
-úúû
ù
êêë
é-
¶¶
=
å
å
åå
=
=
==
!!!!! "!!!!! #$
1)(
1)(
1)()(21
1
1
2
11
ji
jp
jp
jp
jp
kp
kp
pM
kji
kpM
k kp
p
ji
p
wn
ny
yz
zE
wz
zE
wE
¶
¶
¶
¶
¶
¶
¶
¶=
¶
¶
¶
¶=
¶
¶åå==
..11
ipjpjiji xηww d+¬
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Start with small random initial weightsUntil desired stopping criterion is satisfied do
Select a training sample from SCompute the outputs of all nodes based on current weights and the input sampleCompute the weight updates for output nodesCompute the weight updates for hidden nodesUpdate the weights
Back propagation algorithm
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
kpk
p zF maxarg)( =X
Classify a pattern by assigning it to the class that corresponds to the index of the output node with the largest output for the pattern
Network outputs are real valued.
How can we use the networks for classification?
Using neural networks for classification
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Training multi-layer networks – Some Useful Tricks
• Initializing weights to small random values that place the neurons in the linear portion of their operating range for most of the patterns in the training set improves speed of convergence e.g.,
å=
±=,...,Ni
||xNji iw
1
121
å= ÷
÷
ø
ö
çç
è
æ
÷÷÷÷
ø
ö
çççç
è
æ
±=å,...,Ni xw
Nkjiji
w1
121
j
For input to hidden layer weights with the sign of the weight chosen at random
For hidden to output layer weights with the sign of the weight chosen at random
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Some Useful Tricks
• Use of momentum term allows the effective learning rate for each weight to adapt as needed and helps speed up convergence – in a network with 2 layers of weights,
( ) ( )
( )
( ) ( )
( )
0.9 to . 0.6, to .of values typical with1,0 where
)()(
)(
)()(
)(
8050
1
1
1
1
=a=h<ha<
ïïïï
þ
ïïïï
ý
ü
-Da+¶¶
-=D
D+=+
-Da+¶¶
-=D
D+=+
=
=
twwE
ηtw
twtwtw
twwEηtw
twtwtw
kj
twwji
Skj
kjkjkj
ji
twwji
Sji
jijiji
kjkj
jiji
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Some Useful Tricks
• Use sigmoid function which satisfies j(–z)=–j(z) helps speed up convergence
( )
( ) 11 range in thelinear is and
132b ,716.1
11
0
<<-
»¶¶
Þ==
÷÷ø
öççè
æ+-
=
=
-
-
zzz
a
eeaz
z
bz
bz
j
j
j
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Some Useful Tricks
• Randomize the order of presentation of training examples
from one pass to the next helps avoid local minima
• Introduce small amounts of noise in the weight updates (or
into examples) during training helps improve generalization –
minimizes over fitting, makes the learned approximation more
robust to noise, and helps avoid local minima
• If using the suggested sigmoid nodes in the output layer, set
target output for output nodes to be 1 for target class and -1
for all others
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
• Regularization helps avoid over fitting and improves generalization
( ) ( ) ( ) ( ) 10 ; £l£l-+l= WWW CER 1
( )
and kjkj
jiji
kjkj
jiji
wwCw
wC
wwC
-=¶¶
--=¶¶
-
÷÷ø
öççè
æ+= åå 22
21W
Start with l close to 1 and gradually lower it during training. When l <1, it tends to drive weights toward zero setting up a tension between error reduction and complexity minimization
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Some Useful Tricks
Input and output encodings• Do not eliminate natural proximity in the input or output space
• Do not normalize input patterns to be of unit length if the length is likely to be relevant for distinguishing between classes
• Do not introduce unwarranted proximity as an artifact• Do not use log2 M outputs to encode M classes, use M
outputs instead to avoid spurious proximity in the output space
• Use error correcting codes when feasible
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Some Useful Tricks
Examples of a good code
• Binary thermometer codes for encoding real values
• Suppose we can use 10 bits to represent a value between -
1.0 and +1.0
• We can quantize the interval [-1, 1] into 10 equal parts
• 0.38 in thermometer code is 1111000000
• 0.60 in thermometer code is 1111110000
• Note values that are close along the real number line have
thermometer codes that are close in Hamming distance
Example of a bad code
• Ordinary binary representations of integers
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Some Useful Tricks
• Normalizing inputs – know when and when not to normalize• Scale each component of the input separately to lie between -
1 and 1 with mean of 0 and standard deviation of 1
( )i
iipip
i
P
qiqi
P
qiqi
xx
xP
xP
s
µ-¬
µ-=s
=µ
å
å
=
=
2
1
22
1
1
1
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Some Useful Tricks
Initializing weights (revisited)
Nw
N
Nw-
NwN-w
ww
ji11
11and 1between fall to want thisWe
and between ddistribute isneuron hidden a input to edStandardiz
and between ddistributeuniformly are weightsSuppose
<<-Þ
÷ø
öçè
æ =Þ+
+
+-
Hkj
H nw
n11
<<-
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Some Useful Tricks
• Normalizing inputs – know when and when not to normalize• Normalizing each input pattern so that it is of unit length is
commonplace, but often inappropriate
p
pp X
XX ¬ w1
w21
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Some Useful Tricks
• Use of problem specific information (if known) speeds up convergence and improves generalization
• In networks designed for translation-invariant visual image classification, building in translation invariance as a constraint on the weights helps
• If we know the function to be approximated is smooth, we can build that in as part of the criterion to be minimized –minimize in addition to the error, the gradient of the error with respect to the inputs
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Some Useful Tricks
• Manufacture training data – training networks with translated and rotated patterns if translation and rotation invariant recognition is desired
• Incorporate hints during training
• Hints are used as additional outputs during training to help shape the hidden layer representation
Hint nodes (e.g., vowels versus consonants in training a phoneme recognizer)
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Some Useful Tricks
• Reducing the effective number of free parameters (degrees of freedom) helps improve generalization
• Regularization
• Preprocess the data to reduce the dimensionality of the input –
• Train a neural network with output same as input, but with fewer hidden neurons than the number of inputs
• Use the hidden layer outputs as inputs to a second network to do function approximation
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Some Useful Tricks
• Choice of appropriate error function is critical – do not blindly minimize sum squared error – there are many cases where other criteria are appropriate
• Example
÷÷ø
öççè
æ=åå
= = kp
kpP
p
M
kkpS z
ttE ln)(
1 1W
is appropriate for minimizing the distance between the target probability distribution over the M output variables and the probability distribution represented by the network
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Some Useful Tricks
• Interpreting the outputs as class conditional probabilities
• Use exponential output nodes
÷÷÷÷
ø
ö
çççç
è
æ
=
÷÷÷÷
ø
ö
çççç
è
æ
=
=
å
å
å
=
=
=
M
lkp
n
kp
M
llp
kpkp
n
jjpkjkp
n
ez
n
nz
ywn
kp
H
1
1
0
output lexponentia
output linear
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
( ) ( ) ( )( ) ( )
( )
( ) ( )( )
( )( ) ( )( )22
2
1
1
0;1;
;
input for output th ;
outputth target if 0)( if 1)(
|
||
åå
å
å
ÏÎ
=
=
-+-=
-=
=
ïþ
ïýü
Ï==
Î==
=
kpkp
pkpk
P
pkppkS
ppk
kpkppk
kpkppk
ll
M
l
kkk
gg
tgE
kg
ktttt
PP
PPP
ww
w
w
ww
www
XXWXWX
WXW
XWXXXXX
X
XX
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Bayes classification and Neural Networks
( ) ( )( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( )
( ) ( )( ) ( ) ( ) ( )!!!!! "!!!!! #$
W
XXXXXXXWX
XXXXWXXXWX
XXWXXXWXW
oft independen
2
2
22
||
||)(|;
,,;2;
|;)(|1;)(||
1lim
dPPPdPPg
dPdPgdPg
dPgPdPgPES
kikkk
kkkk
kikkikkkSS
òòòòò
òò
¹
¹¹¥®
+-=
+-=
+-=
www
ww
wwww
( ) ( )XWX |; kk Pg w»Because generalized delta rule minimizes this quantity with respect to W, we have
Assuming that the network is expressive enough to represent ( )X|kP w
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Radial-Basis Function Networks
• A function is approximated as a linear combination of radial basis functions (RBF). RBFs capture local behaviors of functions.
• RBFs represent local receptive fields
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Radial Basis Function Networks
• Hidden layer applies a non-linear transformation from the input space to the hidden space.
• Output layer applies a linear transformation from the hidden space to the output space.
x2
xm
x1
y
wH
w11j
Hj
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Example of a radial basis function
• Hidden units: use a radial basis function
x2
x1
xN
φs( || X-W||2)
W is called centers is called spreadcenter and spread are parameters
sj
φs( || X- W||2)the output depends on the distance of the input x from the center t
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Radial basis function
• A hidden neuron is more sensitive to data points near its center. This sensitivity may be tuned by adjusting the spread s.
• Larger spread Þ less sensitivity
• Neurons in the visual cortex have locally tuned frequency responses.
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Gaussian Radial Basis Function φ
center
φ :
s is a measure of how spread the curve is:
Large s Small s
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Types of φ
• Multiquadrics
• Inverse multiquadrics
• Gaussian functions:
21)()( 22 crr +=j
÷÷ø
öççè
æ-= 2
2
2exp)(
sj rr
21)(
1)( 22 crr
+=j
|||| WX -=>rc 0
0>c
0>s
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
Implementing Exclusive OR using an RBF network
• Input space:
• Output space:
• Construct an RBF pattern classifier such that:(0,0) and (1,1) are mapped to 0, class C1(1,0) and (0,1) are mapped to 1, class C2
(1,1)(0,1)
(0,0)(1,0)
x1
x2
y
10
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
j
Sσj σE
ησj ¶¶
-=D
j
Sjj
Eη
a¶¶
-=aD
Depending on the specific function can be computed using the chain rule of calculus
ji
Sjiji wE
ηw¶¶
-=D
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
( )[ ][ ]TjNjj
TNppp
ppp
L
jjpjp
jp
ww
xx
ytE
zy
ez j
jp
......
.........
1
1
2
0
2
21
2
2
=
=
-=
a=
=
å=
s-
--
W
X
WX
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
RBF Learning Algorithm ( )
( )
( ) ( )
( ) ( )jiipj
jpjppjijiji
jiipj
jpjpp
ji
jp
jp
p
p
p
ji
p
jpppjjj
jpppjj
pjj
wxz
ytww
wxz
yt
wz
zy
yE
wE
zyt
zytE
α
-÷÷ø
öççè
æ-+=
-÷÷ø
öççè
æ--=
¶
¶
¶
¶
¶
¶=
¶
¶
-+¬
-=¶
¶-=D
2
2
sah
sa
haa
ha
h
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
RBF Learning Algorithm
( ) ( ) ( )
( ) ( ) ( )÷÷ø
öççè
æ÷÷ø
öççè
æ
sa-h-s¬s
÷÷ø
öççè
æ÷÷ø
öççè
æ
s-a--=
s¶
¶
¶
¶
¶
¶=
s¶
¶
jpj
jpjppjjj
jpj
jpjpp
j
jp
jp
p
p
p
j
p
zzyt
zzyt
zzy
yEE
ln
ln
2
2
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
RBF Learning Algorithm (continued)
( ) ( )
( )
( )
( ) XXXX
XXXX
XX
TT
T
TC
TTTC
T
AdAd
AAdd
AAdd
CCVV
CVCVCVCVV
VVV
=
=
=
==
==
=
matrix) symmetric a isA (when 2
matrixidentity if
norm) (weighted
(norm)
22
2
2
Some useful facts
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
( ) ( )
( )[ ][ ]TjNjj
TNppp
ppp
L
jjpjp
jp
ww
xx
ytE
zy
ez jpjT
jp
......
.........
1
1
2
0
21
21
=
=
-=
a=
=
å=
-S--
W
X
WXWX
principlesfirst from equations update weight theDerive :Exercise
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
RBF Learning Algorithm( )
( )
( ) ( )
( ) ( )
( ) ( ) ( )÷÷ø
öççè
æ÷÷ø
öççè
æ
s-a--=
s¶
¶
¶
¶
¶
¶=
s¶
¶
-÷÷ø
öççè
æ
sa-h+=
-÷÷ø
öççè
æ
sa--=
¶
¶
¶
¶
¶
¶=
¶
¶
-h+a¬a
-h=a¶
¶h-=D
jpj
jpjpp
j
jp
jp
p
p
p
j
p
jiipj
jpjppjijiji
jiipj
jpjpp
ji
jp
jp
p
p
p
ji
p
jpppjjj
jpppjj
pjj
zzyt
zzy
yEE
wxz
ytww
wxz
yt
wz
zy
yE
wE
zyt
zytE
α
ln2
2
2
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
RBF Learning Algorithm (continued)
÷÷÷
ø
ö
ççç
è
æ
÷÷÷
ø
ö
ççç
è
æ----
=pjjC
Tpjejpz
XWXW 1
More general form of radial basis function
is the inverse of an N�N covariance matrix
Note that the covariance matrix is symmetric
1-jC
Exercise: derive a learning rule for an RBF network with such neurons in the hidden layer and linear neurons in the output layer
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar
RBF Learning Algorithm (continued)
( ) ( )
( )
( )
( ) XXXX
XXXX
XX
TT
T
TC
TTTC
T
AdAd
AAdd
AAdd
CCVV
CVCVCVCVV
VVV
=
=
=
==
==
=
matrix)symmetric a is A (when
matrixidentity if
norm) (weighted
(norm)
2
22
2
2
Some useful facts
Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory
Fall 2018 Vasant G Honavar