section5 rbf
TRANSCRIPT
Section 5: Radial Basis Function (RBF) Networks
Course: Introduction to Neural NetworksInstructor: Jeen-Shing Wang
Department of Electrical EngineeringNational Cheng Kung University
Fall, 2005
2 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Outline
Origin: Cover’s theorem Interpolation problem Regularization theory Generalized RBFN
Universal approximation Comparison with MLP RBFN = Kernel regression
Learning Centers, width, and weights
Simulations
3 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Origin: Cover’s Theorem
A complex pattern-classification problem cast in a high dimensional space nonlinearly is more likely to be linearly separable than in a low dimensional space (Cover, 1965).
Cover, T. M., 1965. “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition.” IEEE transactions on Electronic Computers, EC-14, 326-334
4 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Cover’s Theorem
Covers’ theorem on separability of patterns (1965)
x1, x2, …, xN, xi p, are assigned to two classes C1
and C2
-separability:
, 0
0 s.t.
2T
1T
C
C
xxw
xxww
where (x) = [1(x), 2(x), …,M(x)]T.
5 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Cover’s Theorem (cont’d)
Two basic ingredients of Covers’ theorem: Nonlinear functions (x) Dimensions of hidden space (M) > Dimensions of
input space (P) → probability of separability closer to 1
x xxx
x
x
xx
(a) (b) (c)
Linear Separable Spherically Separable Quadrically Separable
6 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Interpolation Problem
Given points (xi, di), xip, di, 1 ≤ i ≤ N: Find F(xi) = di, 1 ≤ i ≤ N Radial basis function (RBF) technique
(Powell, 1988):
are arbitrary nonlinear functions Number of functions is the same as number of data points Centers are fixed at known points xi.
ixx
N
iiiF
1
)( xxwx
7 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Interpolation Problem (cont’d)
Matrix form: Vital question: Is non-singular?
N
iiiF
1
)( xxwx
ijji xx
ii dF )(x
NNNNNN
N
N
d
d
d
w
w
w
2
1
2
1
21
22221
11211
where
dφwdφw 1
8 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Michelli’s Theorem
If points xi are distinct, is non-singular (regardless of the dimension of the input space)
Valid for a large class of RBF functions:
12 2 2
2
2
1. Inverse multiquadrics
1 for some 0 and 0
2. Gaussian functions
exp for 0 and 02
r c rr c
rr r
9 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Learning: Ill and Well Posed Problems
Given a set of data points, learning is viewed as a hypersurface reconstruction or approximation problem---inverse problem
Well posed problem A mapping from input to output exists for all input values The mapping is unique The mapping function is continuous
Ill posed problem Noise or imprecision data adds uncertainty to reconstruct the
mapping uniquely Not enough training data to reconstruct the mapping uniquely.
Degraded generalization performance Regularization is needed
10 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Regularization Theory
The basic idea of regularization is to stabilize the solution by means of some auxiliary functions that embeds prior information, e.g., smoothness constraints on the input-output mapping (i.e., solution to the approximation problem), And thereby make an ill-posed problem into a well-posed one. (Poggio and Girosi, 1990).
11 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Solution to the Regularization Problem
Minimize the cost function E(F) w.r.t F
,)(2
1)(
2
1
)()()(
1
2
N
iii
cs
FCFd
FEFEFE
x
Standard error term Regularizing term
where is the regularization parameter
12 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Solution to the Regularization Problem
Poggio & Griosi (1990): If C(F) is a (problem-dependent) linear differential
operator, the solution to
2
1
1
1
1 1( ) ( ) ( ),
2 2
is of the following form:
( ) ( , ),
where ( ) is a Green's function,
( ) , where ( , ).
N
i ii
N
i ii
ki k i
E F d F C F
F w G
G
G G
x
x x x
w G I d x x
13 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Interpolation vs Regularization
Interpolation
Exact interpolator
Possible RBF
Regularization
Exact interpolator Equal to the “interpolation”
solution if = 0 Example of Green’s function
dIGw
xxx
1
1
)(
),,()(
N
iiiGwF
dφw
xxx
1
1
)(
N
iiiwF
2
2
2exp,
i
i
xxxx
2
2
2exp,
i
iGxx
xx
14 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Generalized RBF Network (GRBFN)
As many radial basis functions as training patterns
Computationally intensive Ill-conditioned matrix Regularization is not easy (C(F) is problem-dependent)
Possible solution → Generalized RBFN approach
2
2
1
2exp),(
:Typically
),,()(
i
ii
K
iii
NK
wF
cxcx
xxx
Adjustable parameters
15 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
D-Dimensional Gaussian Distribution
d-Dimensional Gaussian Distribution
d
2
1
dx
x
x
X2
1
2
22
21
000
0
00
00
d
dxxx ,, 21 ( ind. each other)
)}()(2
1exp{
||)2(
1)( 1
2/12/
XXXp T
d
}2
)(
2
)(
2
)(exp{
)2(
1)(
2
2
22
222
21
211
212/
d
dd
dd
xxxXp
(General form)
}2
||||exp{
)2(
1)(
2
2
2/
XXp
dd)( 21 d
16 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
D-Dimensional Gaussian Distribution
2-Dimensional Gaussian Distribution
2
1
x
xX
0
0
10
01
10 20 30 40 50 60
10
20
30
40
50
60
x1
x2
17 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
2
2( , ) exp2
ii
i
x cx c
1x 2x 3x 4x
1 2 3j J
1O KO
Radial Basis Function Networks
1
( ) ( , )K
i ii
F w
x x x
18 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
RBFN: Universal Approximation
Park & Sandberg (1991): For any continuous input-output mapping function f(x)
The theorem is stronger (radial symmetry is not needed) K is not specified Provides a theoretical basis for practical RBFN!
),1,0(
))(),(( s.t. )(1
p
xFxfLwF p
K
iii
xxx
19 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Kernel Regression
Consider the nonlinear regression model :
Recall:
From probability theory, By using (2) in (1),
),...,2,1()( Nify iii x
)1()|(]|[)( dyyyfyEf Y xxx
, ( , )( | ) (2)
( )Y
Y
f yf y
f X
X
xx
x
)3()(
),()(
,
x
xx
X
X
f
dyyyff
Y
20 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Kernel Regression
We do not know the , which can be estimated by Parzen-Rosenblatt density estimator :
),(, yf Y xX
0
01
1ˆ ( ) ( ) for (4)N
mim
i
f KNh h
x
x xx x
0
0, 11
1ˆ ( , ) ( ) ( ) for ,
where : positive number, and
( ) : symmetric about the origin, ( ) 1)m
Nmi i
Y mi
y yf y K K y
Nh h h
h
K K d
X
R
x xx x
x x x
21 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Kernel Regression
Integration and by using , and using the symmetric property of K, we get :
hyyz i /)( ),(ˆ
, yfy Y xX
)5()(1
),(ˆ1
, 0
N
i
iimY hKy
Nhdyyfy
xxxX
22 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Kernel Regression
By using (4) and (5) as estimates of part of (3) :
)6()(
)()(ˆ
1
1
N
i
i
N
i
ii
hK
hKy
fxx
xx
xF(x)
23 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Nadaraya-Watson Regression Estimator
By define the normalized weighting function :
We can rewrite (6) as :
F(x) : a weighted average of the y-observables
,
1
,1
( )( ) ( 1,2,... )
( )
with ( ) 1 for all x
i
N i Ni
i
N
N ii
KhW i N
Kh
W
x x
xx x
x
N
iiiN yWF
1, )()( xx
24 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Normalized RBF Network
Assume the spherical symmetry of K(x), then :
Normalized radial basis function is defined:
( ) ( ) for alliiK K ih h
x xx x
1
1
( )( , ) ( 1,2,... )
( )
( , ) 1 for all
i
N i Ni
i
N
N ii
Kh i N
Kh
with
x x
x xx x
x x x
25 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Normalized RBF Network
let for all i, we may rewrite (6) as :
may be interpreted as the probability of an event x conditional on xi
ii wy
N
iiNiwF
1
),()( xxx
),( iN xx
26 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Multivariate Gaussian Distribution
If we take the kernel function as the multivariate Gaussian Distribution :
Then we can write :
)2
exp()2(
1)(
2
2/0
xx mK
),...,2,1()2
exp()2(
1)(
2
2
2/2 0Ni
hK i
mi
xxxx
27 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Multivariate Gaussian Distribution
And the NRBF is:
The centers of the RBF coincide with the data points
)7(
)2
exp(
)2
exp()(
12
21
2
2
N
i
i
N
i
iiw
F
xx
xx
x
Niix 1}{
28 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
RBFN vs MLP
RBFN Single hidden layer
Nonlinear hidden layer and linear output layer
Argument of hidden units: Euclidean norm
Universal approximation property
Local approximators
MPL Single or multiple hidden
layers Nonlinear hidden layer and
linear or nonlinear output layer
Argument of hidden units: scalar product
Universal approximation property
Global approximators
29 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Learning Strategies
Parameters to be determined: wi, ci, and i
Traditional learning strategy: splitted computation Centers, ci
Widths, i
Weights, wi
2
2
1 2exp),( ),()(
i
ii
K
iiiwF
cxcxxxx
30 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Computation of Centers
Vector quantization: Centers ci must have the density properties of training patterns xi Random selection from the training set Competitive learning Frequency-sensitive learning Kohonen learning
This phase only uses the input (xi) information, not the output (di)
31 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
K-Means Clustering
k(x) = index of best-matching (winning) neuron:
M = number of clusters
where tk(n) = location of the kth center.
arg min ( ) , 1,2, , kk
n k M x t
( ) + ( ) ( ) , for ( )1
( ), otherwisek k
kk
n n n k k xn
n
t x tt
t
32 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Computation of Widths
Universal approximation property: valid with identical widths
In practice (limited training patterns), variables widths i
One approach: Use local clusters Select i according to the standard deviation of
clusters
33 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Red-dotted line: Estimated distribution Blue-solid line: Actual distribution
Computation of Widths (cont’d.)
34 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Computation of Weights (SVD)
Problem becomes linear! Solution of least square criterion
In practice, use SVD
2
2
1 2exp),( ),()(
i
ii
K
iiiwF
cxcxxxx
Keep Constant
dφφφdφw
x
TT
N
iii FdFE
1
1
2
)(
tolead )(2
1)(
35 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Computation of Weights (Gradient Descent)
Linear weights (output layer)
Positions of centers (hidden layer)
Widths of centers (hidden layer)
N
jijj
i
nnenw
n
1
||))((||)()(
)(cxφ
Minw
nEnwnw
iii ,...,2,1 ,
)(
)()()1( 1
N
jijiijji
i
nnnenwn
nE
1
1' )]([||))((||)()(2)(
)(cxcxφ
c
Min
nEnn
iii ,...,2,1 ,
)(
)()()1( 2
c
cc
N
jjiijji
i
nnnenwn
nE
1
'1
)(||))((||)()()(
)(Qcxφ
Tijijji nnn )]()][([)( cxcxQ
)(
)()()1(
1311
n
nEnn
iii
36 Computational Intelligence & Learning Systems Lab
EE Department, NCKU
Summary
Learning is finding surface in multidimensional space best fit to training data
Approximate function with linear combination of Radial basis functions
F(x) = wiG(||x-xi||) i = 1, 2, … , N G(||x-xi||) is called a Green function
It can be a uniform or Gaussian function.
When N= number of sample, we call it regularization network
When N < number of sample, it is a radial basis function network