center for big data analytics and discovery informatics ... · center for big data analytics and...

67
Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant G Honavar Approximating Real-Valued Functions from Data Neural Networks Vasant Honavar Artificial Intelligence Research Laboratory Informatics Graduate Program Computer Science and Engineering Graduate Program Bioinformatics and Genomics Graduate Program Neuroscience Graduate Program Center for Big Data Analytics and Discovery Informatics Huck Institutes of the Life Sciences Institute for Cyberscience Clinical and Translational Sciences Institute Northeast Big Data Hub Pennsylvania State University [email protected] http://faculty.ist.psu.edu/vhonavar http://ailab.ist.psu.edu

Upload: others

Post on 14-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Approximating Real-Valued Functions from DataNeural Networks

Vasant Honavar Artificial Intelligence Research Laboratory

Informatics Graduate ProgramComputer Science and Engineering Graduate Program

Bioinformatics and Genomics Graduate ProgramNeuroscience Graduate Program

Center for Big Data Analytics and Discovery InformaticsHuck Institutes of the Life Sciences

Institute for CyberscienceClinical and Translational Sciences Institute

Northeast Big Data HubPennsylvania State University

[email protected]://faculty.ist.psu.edu/vhonavar

http://ailab.ist.psu.edu

Page 2: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Neural Networks • Learning to approximate real-valued functions• Bayesian recipe for learning real-valued functions• Review: Learning linear functions using gradient descent

in weight space• Universal function approximation theorem• Learning nonlinear functions using gradient descent in

weight space• Practical considerations and examples

Page 3: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Review: Bayesian Recipe for Learning

• P (h) = prior probability of hypothesis h• P (D) = prior probability of training data D • P (h | D) = probability of h given D • P (D | h) = probability of D given h

)()()|()|(

DPhPhDPDhP =

Page 4: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Bayesian recipe for learning

)hypothesis likelihood (Maximum )|(maxarg),()( , If

)()|(maxarg)(

)()|(maxarg

)hypothesis posteriori a (Maximum )|(maxarg

hDPhhPhPHhh

hPhDPDP

hPhDP

DhPh

HhML

jiji

Hh

Hh

HhMAP

Î

Î

Î

Î

=

=Î"

=

=

=

Choose the most likely hypothesis given the data

Page 5: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Learning a Real Valued Function

• Consider a real-valued target function f • Training examples áxi, diñ , where di is noisy

training value di = f(xi) + ei• ei is random variable (noise) drawn

independently for each xi according to Gaussian distribution with zero mean

• Þ di has mean f(xi) and same variance

å=

Î-=

m

iiiHhML xhdh

1

2))((minarg

Then the maximum likelihood hypothesis hML is one that minimizes the sum of squared error:

Page 6: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Learning a Real Valued Function

hML = argmaxh∈H

P(h |D)

= argmaxh∈H

P(D | h)

= argmaxh∈H

P(di ,Xi | h)i=1

m∏

= argmaxh∈H

P(di | h,Xi )i=1

m∏ P(Xi )

= argmaxh∈H

12πσ2i=1

m∏ e

−12di−h(xi )

σ( )2

P(Xii=1

m∏ )

Maximize natural log of this instead...

Page 7: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Learning a Real Valued Function

å

å

å

å

-=

--=

÷øö

çèæ -

-=

÷øö

çèæ -

-=

m

iiiHh

m

iiiHh

m

i

ii

Hh

iim

iHhML

xhd

xhd

xhd

xhdh

1

2

1

2

1

2

21

2

21

12

))((minarg

))((maxarg

)(maxarg

)(21lnmaxarg

s

sps

Maximum Likelihood hypothesis is one that minimizes the mean squared error!

Page 8: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Approximating a linear function using a linear neuron

X0 =1

Input

Synaptic weights

n Outputy

x1

x2

xn

w2

wn

w1

! !å

W0

i

n

ii xwy å

=

=0

Page 9: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Learning Task

W = W0.........Wn[ ]T is the weight vector

Xp = X0p ....Xnp[ ]T is the pth training sample

y p = Wii∑ Xip =W • Xp is the output of the neuron for input Xp

Xp = f Xp( ) is the desired output for input Xp

ep = dp − y p( ) is the error of the neuron on input Xp

S = Xp,dp( ){ } is`a (multi) set of training examples

ES W( ) = ES W0,W1,..........Wn( ) =12

ep2

p∑ is the estimated

error of W on training set S

Goal : Find W* = argminW

ES W( )

Page 10: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Learning linear functions

1w

0w

ES

The error is a quadratic function of the weights in the case of a linear neuron

tW

*W

Page 11: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Learning linear functions

∂E∂wi

=12

∂∂wi

ep2

p∑$ % &

' ( )

=12

∂∂wi

ep2( )

p∑*

+ ,

-

. /

=12

2ep( )∂ep∂wip

∑*

+ ,

-

. / = ep

∂ep∂y p

*

+ ,

-

. / /

p∑

∂y p∂wi

*

+ ,

-

. / = ep −1( )

p∑

∂∂wi

w j x jpj=0

n∑

*

+ ,

-

. /

*

+ ,

-

. / /

= − dp − y p( )p∑

∂∂wi

wi xip + w j x jpj≠i∑

*

+ ,

-

. /

*

+ ,

-

. / /

= − dp − y p( )p∑

∂∂wi

wi xip( ) +∂∂wi

w j x jpj≠i∑

*

+ ,

-

. /

*

+ ,

-

. / /

= − dp − y p( )p∑ xip

wi ← wi − η∂E∂wi

( ) ipp

ppii xydηww å -+¬

Page 12: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

( ) ipp

ppii xydηww å -+¬ Batch Update

( ) ipppii xydηww -+¬

Per sample Update

Least Mean Square Error (LMSE) Learning Rule

Page 13: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

( ) ( )

( )

( )t==t

t-

=

¶a-=

<a<-Da+¶

¶-=D

D+=+

åii

ii

wwi

tt

itwwi

i

iii

wE

η

twwE

ηtw

twtwtw

0

1

1

10 where)()(

)(

The momentum update allows effective learning rate to increase when feasible and decrease when necessary. Converges for 0£a<1

Momentum update

Page 14: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Learning approximations of nonlinear functions from data – the generalized delta rule

• Motivations• Universal function approximation theorem (UFAT)• Derivation of the generalized delta rule• Back-propagation algorithm• Practical considerations• Applications

Page 15: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

• Psychology – Empirical inadequacy of behaviorist theories of learning – simple reward-punishment based learning models are incapable of learning functions (e.g., exclusive OR) which are readily learned by animals (e.g., monkeys)

• Artificial Intelligence – the need for learning highly nonlinearfunctions where the form of the nonlinear relationship is unknown a-priori

• Statistics – Limitations of linear regression in fitting data when the relationship is highly nonlinear and the form of the relationship is unknown

• Control – Need for nonlinear control methods

Motivations

Page 16: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Universal function approximation theorem (UFAT) (Cybenko, 1989)

• Let j : Âà be a non-constant, bounded (hence non-linear),

monotone, continuous function. Let IN be the N-dimensional

unit hypercube in ÂN. • Let C(IN ) = {f: IN à Â} be the set of all continuous functions

with domain IN and range Â. Then for any function f Î C(IN ) and any e > 0, $ an integer L and a sets of real values q, aj, qj, wji(1£j£L; 1£i£N ) such that

F(x1, x2...xN ) = α jj=1

L

∑ φ wjixi −θ ji=1

N

∑⎛

⎝⎜

⎠⎟−θ

is a uniform approximation of f – that is,

),...(),...( ,),...( e<-Î" NNNN xxfxxFIxx 111

Page 17: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Universal function approximation theorem (UFAT)

• Unlike Kolmogorov�s theorem, UFAT requires only one kind of nonlinearity to approximate any arbitrary nonlinear function to any desired accuracy

• The sigmoid function satisfies the UFAT requirements

q-÷÷ø

öççè

æq-ja= åå

==

N

ijiji

L

jjn xwxxxF

1121 )...,(

011

>+

=j-

ae

z az ;)(

Similar universal approximation properties can be guaranteed for other functions – e.g., radial basis functions

10 =j=j¥+®¥-®

)(lim ;)(lim

zzzz

Page 18: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

• UFAT guarantees the existence of arbitrarily accurate approximations of continuous functions defined over bounded subsets of ÂN

• UFAT tells us the representational power a certain class of multi-layer networks relative to the set of continuous functions defined on bounded subsets of ÂN

• UFAT is not constructive – it does not tell us how to choose the parameters to construct a desired function

• To learn an unknown function from data, we need an algorithm to search the hypothesis space of multilayer networks

• Generalized delta rule allows nonlinear function to be learnedfrom the training data

Universal function approximation theorem

Page 19: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Feed-forward neural networks • A feed-forward n-layer network consists of n layers of nodes• 1 layer of Input nodes• n-2 layers of Hidden nodes• 1 layer of Output nodes• interconnected by modifiable weights from input nodes to

the hidden nodes and the hidden nodes to the output nodes• More general topologies (e.g., with connections that skip

layers, e.g., direct connections between input and output nodes) are possible

Page 20: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

A three layer network that approximates the exclusive or function

Page 21: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Three-layer feed-forward neural network

• A single bias unit is connected to each unit other than the input units

• Net input

• where the subscript i indexes units in the input layer, j in the hidden; wji denotes the input-to-hidden layer weights at the hidden unit j.

• The output of a hidden unit is a nonlinear function of its net input. That is, yj = f(nj) e.g.,

å å= =

•º=+=d

i

d

ijjiijjiij wxwwxn

1 00 ,. XW

jnje

y-+

=11

Page 22: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Three-layer feed-forward neural network

• Each output unit similarly computes its net activation based on the hidden unit signals as:

• where the subscript k indexes units in the ouput layer andnH denotes the number of hidden units

• The output can be a linear or nonlinear function of the net input e.g.,

å å= =

•==+=H Hn

j

n

jkkjjkkjjk wywwyn

1 00 ,YW

zk = nk

Page 23: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Computing nonlinear functions

Page 24: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Realizing non linearly separable class boundaries

Page 25: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

• Given a training set determine

• Network structure – number of hidden nodes or more generally,

network topology

• Start small and grow the network

• Start with a sufficiently large network and prune away the

unnecessary connections

• For a given structure, determine the parameters (weights) that

minimize the error on the training samples (e.g., the mean

squared error)

• For now, we focus on the latter

Learning nonlinear real-valued functions

Page 26: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

• Challenge – we know the desired outputs for nodes in the output layer, but not the hidden layer

• Need to solve the credit assignment problem – dividing the credit or blame for the performance of the output nodes among hidden nodes

• Generalized delta rule offers an elegant solution to the credit assignment problem in feed-forward neural networks in which each neuron computes a differentiable function of its inputs

• Solution can be generalized to other kinds of networks, including networks with cycles

Generalized delta rule – error back-propagation

Page 27: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

• Forward operation (computing output for a given input based on

the current weights)

• Learning – modification of the network parameters (weights) to

minimize an appropriate error measure

• Because each neuron computes a differentiable function of its

inputs

– If error is a differentiable function of the network

outputs, the error is a differentiable function of the

weights in the network – so we can perform gradient

descent!

Feed-forward networks

Page 28: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

A fully connected 3-layer network

Page 29: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Generalized delta rule• Let tkp be the k-th target (or desired) output for input pattern Xp

and zkp be the output produced by k-th output node and let W represent all the weights in the network

• Training error:

• The weights are initialized with pseudo-random values and are changed in a direction that will reduce the error:

( )WW ååå =-== p

p

M

kkpkp

pS EztE

1

2

21 )()(

ji

Sji w

Eηw¶¶

-=Dkj

Skj w

Eηw¶¶

-=D

Page 30: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Generalized delta rule

h>0 is a suitable the learning rate WßW+ DWHidden–to-output weights

jpkj

kp ywn

kj

kp

kp

p

kj

p

wn

nE

wE

¶=

¶.

))((. 1kpkpkp

kp

kp

p

kp

p ztnz

zE

nE

--=¶

¶=

jpkpkjjpkpkpkjkj

pkjkj ywyztw

wE

ηww d+=-+=¶

¶-¬ )(

Page 31: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Generalized delta rule

Weights from input to hidden units

( ) ( )( )

( )( ) ( )( )

( ) ( ) ( )

ipjp

ipjpjpkj

M

kkp

ipjpjpkj

M

kkpkp

ipjpjpkjlp

M

llp

kp

M

k

x

xyyw

xyywzt

xyywztz

jp

d

d

d

-=

÷÷ø

öççè

æ--=

---=

-úúû

ù

êêë

é-

¶¶

=

å

å

åå

=

=

==

!!!!! "!!!!! #$

1)(

1)(

1)()(21

1

1

2

11

ji

jp

jp

jp

jp

kp

kp

pM

kji

kpM

k kp

p

ji

p

wn

ny

yz

zE

wz

zE

wE

¶=

¶=

¶åå==

..11

ipjpjiji xηww d+¬

Page 32: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Start with small random initial weightsUntil desired stopping criterion is satisfied do

Select a training sample from SCompute the outputs of all nodes based on current weights and the input sampleCompute the weight updates for output nodesCompute the weight updates for hidden nodesUpdate the weights

Back propagation algorithm

Page 33: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

kpk

p zF maxarg)( =X

Classify a pattern by assigning it to the class that corresponds to the index of the output node with the largest output for the pattern

Network outputs are real valued.

How can we use the networks for classification?

Using neural networks for classification

Page 34: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Training multi-layer networks – Some Useful Tricks

• Initializing weights to small random values that place the neurons in the linear portion of their operating range for most of the patterns in the training set improves speed of convergence e.g.,

å=

±=,...,Ni

||xNji iw

1

121

å= ÷

÷

ø

ö

çç

è

æ

÷÷÷÷

ø

ö

çççç

è

æ

±=å,...,Ni xw

Nkjiji

w1

121

j

For input to hidden layer weights with the sign of the weight chosen at random

For hidden to output layer weights with the sign of the weight chosen at random

Page 35: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Some Useful Tricks

• Use of momentum term allows the effective learning rate for each weight to adapt as needed and helps speed up convergence – in a network with 2 layers of weights,

( ) ( )

( )

( ) ( )

( )

0.9 to . 0.6, to .of values typical with1,0 where

)()(

)(

)()(

)(

8050

1

1

1

1

=a=h<ha<

ïïïï

þ

ïïïï

ý

ü

-Da+¶¶

-=D

D+=+

-Da+¶¶

-=D

D+=+

=

=

twwE

ηtw

twtwtw

twwEηtw

twtwtw

kj

twwji

Skj

kjkjkj

ji

twwji

Sji

jijiji

kjkj

jiji

Page 36: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Some Useful Tricks

• Use sigmoid function which satisfies j(–z)=–j(z) helps speed up convergence

( )

( ) 11 range in thelinear is and

132b ,716.1

11

0

<<-

»¶¶

Þ==

÷÷ø

öççè

æ+-

=

=

-

-

zzz

a

eeaz

z

bz

bz

j

j

j

Page 37: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Some Useful Tricks

• Randomize the order of presentation of training examples

from one pass to the next helps avoid local minima

• Introduce small amounts of noise in the weight updates (or

into examples) during training helps improve generalization –

minimizes over fitting, makes the learned approximation more

robust to noise, and helps avoid local minima

• If using the suggested sigmoid nodes in the output layer, set

target output for output nodes to be 1 for target class and -1

for all others

Page 38: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

• Regularization helps avoid over fitting and improves generalization

( ) ( ) ( ) ( ) 10 ; £l£l-+l= WWW CER 1

( )

and kjkj

jiji

kjkj

jiji

wwCw

wC

wwC

-=¶¶

--=¶¶

-

÷÷ø

öççè

æ+= åå 22

21W

Start with l close to 1 and gradually lower it during training. When l <1, it tends to drive weights toward zero setting up a tension between error reduction and complexity minimization

Page 39: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Some Useful Tricks

Input and output encodings• Do not eliminate natural proximity in the input or output space

• Do not normalize input patterns to be of unit length if the length is likely to be relevant for distinguishing between classes

• Do not introduce unwarranted proximity as an artifact• Do not use log2 M outputs to encode M classes, use M

outputs instead to avoid spurious proximity in the output space

• Use error correcting codes when feasible

Page 40: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Some Useful Tricks

Examples of a good code

• Binary thermometer codes for encoding real values

• Suppose we can use 10 bits to represent a value between -

1.0 and +1.0

• We can quantize the interval [-1, 1] into 10 equal parts

• 0.38 in thermometer code is 1111000000

• 0.60 in thermometer code is 1111110000

• Note values that are close along the real number line have

thermometer codes that are close in Hamming distance

Example of a bad code

• Ordinary binary representations of integers

Page 41: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Some Useful Tricks

• Normalizing inputs – know when and when not to normalize• Scale each component of the input separately to lie between -

1 and 1 with mean of 0 and standard deviation of 1

( )i

iipip

i

P

qiqi

P

qiqi

xx

xP

xP

s

µ-¬

µ-=s

å

å

=

=

2

1

22

1

1

1

Page 42: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Some Useful Tricks

Initializing weights (revisited)

Nw

N

Nw-

NwN-w

ww

ji11

11and 1between fall to want thisWe

and between ddistribute isneuron hidden a input to edStandardiz

and between ddistributeuniformly are weightsSuppose

<<-Þ

÷ø

öçè

æ =Þ+

+

+-

Hkj

H nw

n11

<<-

Page 43: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Some Useful Tricks

• Normalizing inputs – know when and when not to normalize• Normalizing each input pattern so that it is of unit length is

commonplace, but often inappropriate

p

pp X

XX ¬ w1

w21

Page 44: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Some Useful Tricks

• Use of problem specific information (if known) speeds up convergence and improves generalization

• In networks designed for translation-invariant visual image classification, building in translation invariance as a constraint on the weights helps

• If we know the function to be approximated is smooth, we can build that in as part of the criterion to be minimized –minimize in addition to the error, the gradient of the error with respect to the inputs

Page 45: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Some Useful Tricks

• Manufacture training data – training networks with translated and rotated patterns if translation and rotation invariant recognition is desired

• Incorporate hints during training

• Hints are used as additional outputs during training to help shape the hidden layer representation

Hint nodes (e.g., vowels versus consonants in training a phoneme recognizer)

Page 46: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Some Useful Tricks

• Reducing the effective number of free parameters (degrees of freedom) helps improve generalization

• Regularization

• Preprocess the data to reduce the dimensionality of the input –

• Train a neural network with output same as input, but with fewer hidden neurons than the number of inputs

• Use the hidden layer outputs as inputs to a second network to do function approximation

Page 47: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Some Useful Tricks

• Choice of appropriate error function is critical – do not blindly minimize sum squared error – there are many cases where other criteria are appropriate

• Example

÷÷ø

öççè

æ=åå

= = kp

kpP

p

M

kkpS z

ttE ln)(

1 1W

is appropriate for minimizing the distance between the target probability distribution over the M output variables and the probability distribution represented by the network

Page 48: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Some Useful Tricks

• Interpreting the outputs as class conditional probabilities

• Use exponential output nodes

÷÷÷÷

ø

ö

çççç

è

æ

=

÷÷÷÷

ø

ö

çççç

è

æ

=

=

å

å

å

=

=

=

M

lkp

n

kp

M

llp

kpkp

n

jjpkjkp

n

ez

n

nz

ywn

kp

H

1

1

0

output lexponentia

output linear

Page 49: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

( ) ( ) ( )( ) ( )

( )

( ) ( )( )

( )( ) ( )( )22

2

1

1

0;1;

;

input for output th ;

outputth target if 0)( if 1)(

|

||

åå

å

å

ÏÎ

=

=

-+-=

-=

=

ïþ

ïýü

Ï==

Î==

=

kpkp

pkpk

P

pkppkS

ppk

kpkppk

kpkppk

ll

M

l

kkk

gg

tgE

kg

ktttt

PP

PPP

ww

w

w

ww

www

XXWXWX

WXW

XWXXXXX

X

XX

Page 50: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Bayes classification and Neural Networks

( ) ( )( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( )

( ) ( )( ) ( ) ( ) ( )!!!!! "!!!!! #$

W

XXXXXXXWX

XXXXWXXXWX

XXWXXXWXW

oft independen

2

2

22

||

||)(|;

,,;2;

|;)(|1;)(||

1lim

dPPPdPPg

dPdPgdPg

dPgPdPgPES

kikkk

kkkk

kikkikkkSS

òòòòò

òò

¹

¹¹¥®

+-=

+-=

+-=

www

ww

wwww

( ) ( )XWX |; kk Pg w»Because generalized delta rule minimizes this quantity with respect to W, we have

Assuming that the network is expressive enough to represent ( )X|kP w

Page 51: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Radial-Basis Function Networks

• A function is approximated as a linear combination of radial basis functions (RBF). RBFs capture local behaviors of functions.

• RBFs represent local receptive fields

Page 52: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Radial Basis Function Networks

• Hidden layer applies a non-linear transformation from the input space to the hidden space.

• Output layer applies a linear transformation from the hidden space to the output space.

x2

xm

x1

y

wH

w11j

Hj

Page 53: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Example of a radial basis function

• Hidden units: use a radial basis function

x2

x1

xN

φs( || X-W||2)

W is called centers is called spreadcenter and spread are parameters

sj

φs( || X- W||2)the output depends on the distance of the input x from the center t

Page 54: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Radial basis function

• A hidden neuron is more sensitive to data points near its center. This sensitivity may be tuned by adjusting the spread s.

• Larger spread Þ less sensitivity

• Neurons in the visual cortex have locally tuned frequency responses.

Page 55: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Gaussian Radial Basis Function φ

center

φ :

s is a measure of how spread the curve is:

Large s Small s

Page 56: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Types of φ

• Multiquadrics

• Inverse multiquadrics

• Gaussian functions:

21)()( 22 crr +=j

÷÷ø

öççè

æ-= 2

2

2exp)(

sj rr

21)(

1)( 22 crr

+=j

|||| WX -=>rc 0

0>c

0>s

Page 57: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

Implementing Exclusive OR using an RBF network

• Input space:

• Output space:

• Construct an RBF pattern classifier such that:(0,0) and (1,1) are mapped to 0, class C1(1,0) and (0,1) are mapped to 1, class C2

(1,1)(0,1)

(0,0)(1,0)

x1

x2

y

10

Page 58: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

j

Sσj σE

ησj ¶¶

-=D

j

Sjj

a¶¶

-=aD

Depending on the specific function can be computed using the chain rule of calculus

ji

Sjiji wE

ηw¶¶

-=D

Page 59: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

( )[ ][ ]TjNjj

TNppp

ppp

L

jjpjp

jp

ww

xx

ytE

zy

ez j

jp

......

.........

1

1

2

0

2

21

2

2

=

=

-=

a=

=

å=

s-

--

W

X

WX

Page 60: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

RBF Learning Algorithm ( )

( )

( ) ( )

( ) ( )jiipj

jpjppjijiji

jiipj

jpjpp

ji

jp

jp

p

p

p

ji

p

jpppjjj

jpppjj

pjj

wxz

ytww

wxz

yt

wz

zy

yE

wE

zyt

zytE

α

-÷÷ø

öççè

æ-+=

-÷÷ø

öççè

æ--=

¶=

-+¬

-=¶

¶-=D

2

2

sah

sa

haa

ha

h

Page 61: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

RBF Learning Algorithm

( ) ( ) ( )

( ) ( ) ( )÷÷ø

öççè

æ÷÷ø

öççè

æ

sa-h-s¬s

÷÷ø

öççè

æ÷÷ø

öççè

æ

s-a--=

¶=

jpj

jpjppjjj

jpj

jpjpp

j

jp

jp

p

p

p

j

p

zzyt

zzyt

zzy

yEE

ln

ln

2

2

Page 62: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

RBF Learning Algorithm (continued)

( ) ( )

( )

( )

( ) XXXX

XXXX

XX

TT

T

TC

TTTC

T

AdAd

AAdd

AAdd

CCVV

CVCVCVCVV

VVV

=

=

=

==

==

=

matrix) symmetric a isA (when 2

matrixidentity if

norm) (weighted

(norm)

22

2

2

Some useful facts

Page 63: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

( ) ( )

( )[ ][ ]TjNjj

TNppp

ppp

L

jjpjp

jp

ww

xx

ytE

zy

ez jpjT

jp

......

.........

1

1

2

0

21

21

=

=

-=

a=

=

å=

-S--

W

X

WXWX

principlesfirst from equations update weight theDerive :Exercise

Page 64: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

RBF Learning Algorithm( )

( )

( ) ( )

( ) ( )

( ) ( ) ( )÷÷ø

öççè

æ÷÷ø

öççè

æ

s-a--=

¶=

-÷÷ø

öççè

æ

sa-h+=

-÷÷ø

öççè

æ

sa--=

¶=

-h+a¬a

-h=a¶

¶h-=D

jpj

jpjpp

j

jp

jp

p

p

p

j

p

jiipj

jpjppjijiji

jiipj

jpjpp

ji

jp

jp

p

p

p

ji

p

jpppjjj

jpppjj

pjj

zzyt

zzy

yEE

wxz

ytww

wxz

yt

wz

zy

yE

wE

zyt

zytE

α

ln2

2

2

Page 65: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

RBF Learning Algorithm (continued)

÷÷÷

ø

ö

ççç

è

æ

÷÷÷

ø

ö

ççç

è

æ----

=pjjC

Tpjejpz

XWXW 1

More general form of radial basis function

is the inverse of an N�N covariance matrix

Note that the covariance matrix is symmetric

1-jC

Exercise: derive a learning rule for an RBF network with such neurons in the hidden layer and linear neurons in the output layer

Page 66: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar

RBF Learning Algorithm (continued)

( ) ( )

( )

( )

( ) XXXX

XXXX

XX

TT

T

TC

TTTC

T

AdAd

AAdd

AAdd

CCVV

CVCVCVCVV

VVV

=

=

=

==

==

=

matrix)symmetric a is A (when

matrixidentity if

norm) (weighted

(norm)

2

22

2

2

Some useful facts

Page 67: Center for Big Data Analytics and Discovery Informatics ... · Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory Fall 2018 Vasant

Center for Big Data Analytics and Discovery InformaticsArtificial Intelligence Research Laboratory

Fall 2018 Vasant G Honavar