[ieee 2012 international joint conference on neural networks (ijcnn 2012 - brisbane) - brisbane,...

Extending the functional training approach for B-Splines

Cristiano L. Cabrita

Higher Institute for Engineering University of Algarve

Faro, Portugal [email protected]

António E. Ruano, Pedro M. Ferreira

CSI, IDMEC, IST and University of Algarve

Portugal [email protected], [email protected]

László T. Kóczy Faculty of Engineering Sciences,

Széchenyi István University, Gyõr, Hungary

[email protected]

Abstract - When used for function approximation purposes, neural networks belong to a class of models whose parameters can be separated into linear and nonlinear, according to their influence in the model output. This concept of parameter separability can also be applied when the training problem is formulated as the minimization of the integral of the (functional) squared error, over the input domain. Using this approach, the computation of the gradient involves terms that are dependent only on the model and the input domain, and terms which are the projection of the target function on the basis functions and on their derivatives with respect to the nonlinear parameters, over the input domain. This paper extends the application of this formulation to B-splines, describing how the Levenberg-Marquardt method can be applied using this methodology. Simulation examples show that the use of the functional approach obtains important savings in computational complexity and a better approximation over the whole input domain.

Keywords- Neural networks training, parameter separability, functional training, Levenberg-Marquardt algorithm

I. INTRODUCTION When used for function approximation purposes, neural

networks (and neuro-fuzzy systems) belong to a class of models whose parameters can be separated into linear and nonlinear, according to their influence in the model output. This concept, for general nonlinear least square problems, was introduced in [1], and was proposed to the neural network community in [2]. Other authors have, more or less explicitly exploited this concept (see, for instance, [3]). A good review on the applications of the concept of separability of parameters in general nonlinear least-squares problems can be found in [4].

The ultimate goal of modeling is to obtain a “good” approximation of the function behind the data, in the whole input domain. To accomplish this, the authors have proposed in [5] to formulate the learning problem as the minimization of the integral of the (functional) squared error, over the input domain, instead of the usual sum of the square of the errors using the data. The concept of parameter separability can be also applied to this functional approach, resulting in two types of terms: i) terms that are dependent only on the model and the input domain, and independent of the target function, and ii) other terms which are dependent on the model, the input domain and the target function. The first class of terms can be previously analytically computed, according to the model type

and topology. The second class of terms, if the target function was known and an analytical solution could be found, could also be obtained and a complete analytical solution would be available for training. As in practice what is available is the data, the training data can be used to numerically approximate this second class of terms.

This approach was applied for Radial Basis function networks in [6] and first presented for B-spline networks in [7]. This paper extends this last work. A brief review of the proposed methodology is presented in Section II. Section III explains how the Levenberg-Marquardt can be applied with this methodology. Didactical examples are shown in Section IV, emphasizing the savings obtained in computational complexity and the better results achieved, for the whole input domain, compared with the standard (discrete) methodology. Section V concludes the paper and points out directions for future work.

II. FUNCTIONAL METHODOLOGY All models addressed in this paper are assumed to have

their parameters separable into linear (denoted by u) and nonlinear (denoted by v), according to the dependence on the model output on the parameters:

( ) ( ), , ,=y X v u Γ X v u (1)

In (1), y is the model’s output, a vector with m elements, and X is the input matrix, with dimensions m.n (n is the number of inputs). It is assumed that the model has un linear parameters and vn nonlinear parameters. Γ is the matrix of the basis functions (of dimensions m . un ). Eq. (1) assumes that the data is, as usually, discrete. The training criterion normally employed is:

( )( ) ( )2 2

2 2, , , ,

, , ,2 2d

−Ω = =

t Γ X v u e X v u tX v u t , (2)

where t is the target vector, e is the error vector and 2

is the Euclidean norm.

Assuming that the function generating the data is known, and denoting it by ( )t x , (1) and (2) can be recast as:

This work was supported by FCT, through IDMEC, under LAETA, and partially supported by the project PTDC/ENR/73345/2006 and PROTEC SFRH/BD/49438/2009. The third author thanks the European commission for the funding through grant PERG-GA-2008-239451. U.S. Government work not protected by U.S. copyright

WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia IJCNN

( ) ( )Ty , , ,=v u φ v ux x (3)

( )

( ) ( )( )

( ) ( )

min

min min

2

min

2 2

,, , , ,

2

, , ,

2 2

MAX

MAX MAX

T

f MAX

T

x

t dt

t dx e t d

−

Ω = =

−=

∫

∫ ∫

φ v uu v

u φ v u

x

x

x x

x

x x x

x x

x x

(4)

In (3) x is now a multi-dimensional real variable, 1 ,, , ,i nx x x=x … … , and in (4),

1

min 1min min min

1 1( ,.) ( , , , , ,.) , ,k nMAX MAX MAX MAX

k n

x x x

k n k nx x x

f d f x x x dx dx dx=∫ ∫ ∫ ∫x

x

x x … … … … (5)

y , t and e are real functions. φ is a vector of basis functions, and not a matrix as in the discrete case. In order to identify the two cases, the subscript d will denote the discrete case and f the functional case, whenever needed.

A. Exploiting the Parameters Separability As the models parameters are linearly separable, the

optimal value of the linear parameters with respect to (wrt) the nonlinear ones can be determined. In the discrete case,

( ) ( ) 1ˆ , , T T

d

− += =u X v t Γ Γ Γ t Γ t , (6)

where the symbol ’+’ denotes a pseudo-inverse. In the continuous case (please see [5] for details), we have:

( )min min

1

minˆ , , ,MAX MAX

Tf MAXt d td

−⎡ ⎤

= ⎢ ⎥⎢ ⎥⎣ ⎦∫ ∫u v φφ φ

x x

x x

x x x x (7)

Incorporating (6) and (7) in (2) and (4), respectively, new criteria can be defined. In the discrete case,

( )22

2 2ˆ

, ,2 2

dd

+−−Ψ = =

t ΓΓ tt Γux v t , (8)

And in the functional case,

( )

( ) ( ) ( ) ( ) ( )

min min min

min min min

21

min

2

, , ,2

ˆ ˆ ˆ2 , , ,

MAX MAX MAX

MAX MAX MAX

T T

f MAX

T T T

t d td d

t d t d d

−⎛ ⎞⎡ ⎤⎜ ⎟− ⎢ ⎥⎜ ⎟⎢ ⎥⎜ ⎟⎣ ⎦⎝ ⎠Ψ = =

= − +

∫ ∫ ∫

∫ ∫ ∫

φ φφ φ

v

φ v u u φ v φ v u

x x x

x x x

x x x

x x x

x x x

x x x

x x x x x x x x

(9)

B. Training Algorithms The most common derivative-based training algorithms for

neural networks and neuro-fuzzy systems are the Back-Propagation (BP) [8] and the Levenberg-Marquardt (LM) [9, 10] algorithms.

The BP update, for the functional formulation, is given by:

[ ] [ ] [ ]1 fk k kη+ = −v v g , (10)

while the LM update is obtained as the solution of:

[ ] [ ] [ ] [ ] [ ]max

min

x

x

xTf f fk k d k k kλ

⎛ ⎞+ = −⎜ ⎟⎜ ⎟

⎝ ⎠∫ j j I s g , (11)

In (10) and (11), fg is the gradient of the functional criterion (9):

ff T

∂Ψ=

∂g

v, (12)

and fj is the Jacobian:

ˆT

ff T

⎡ ⎤∂ ⎣ ⎦=∂

φ uj

v (13)

In [5], it is proved that the gradient (12) can be obtained as:

min min min

min

min min min min

1

1 1

2

MAX MAX MAX

MAX

MAX MAX MAX MAX

TT

f T

T

T T TT

t d d td

dtd d d td

−

− −

⎡ ⎤∂= − +⎢ ⎥∂ ⎢ ⎥⎣ ⎦⎧ ⎫

∂⎪ ⎪⎡ ⎤ ⎡ ⎤⎪ ⎪+ ⎢ ⎥ ⎢ ⎥⎨ ⎬∂⎢ ⎥ ⎢ ⎥⎪ ⎪⎣ ⎦ ⎣ ⎦⎪ ⎪⎩ ⎭

∫ ∫ ∫

∫∫ ∫ ∫ ∫

φg φφ φv

φφφ φφ φφ φ

v

x x x

x x x

x

x x x xx

x x x x

x x x

x

x x x x

(14)

And that the term [ ] [ ]max

min

x

x

xTf fk k d∫ j j in (11), using Golub

and Pereyra’s formulation [1], can be obtained as:

min min

min

ˆ ˆˆ ˆ

ˆˆ ˆ ˆ

ˆ ˆ ˆˆ

MAX MAX

MAX

T Tf fT T T

f f f fT T T T

TfT T T

f f fT T T T

T Tf f fT T T

f T T T T

d d

d

⎡ ⎤∂ ∂⎡ ⎤∂ ∂= + +⎢ ⎥ ⎢ ⎥∂ ∂ ∂ ∂⎢ ⎥ ⎣ ⎦⎣ ⎦∂⎛ ⎞∂ ∂ ∂+ +⎜ ⎟∂ ∂ ∂ ∂⎜ ⎟= ⎜ ⎟∂ ∂ ∂⎡ ⎤∂⎜ ⎟+⎢ ⎥⎜ ⎟∂ ∂ ∂ ∂⎣ ⎦⎝ ⎠

∫ ∫

∫

u uφ φj j u φ u φv v v v

uφ φ φu u u φv v v v

u u uφu φ φφv v v v

x x

x x

x

x

x x

x

(15)

The first term in (15) can be given as:

min min min min min

1 1x x x x x

x x x x x

x x x x xMAX MAX MAX MAX MAXT

T T TT Ttd d d d td

− −⎧ ⎫⎡ ⎤ ⎡ ⎤∂ ∂⎪ ⎪⎢ ⎥ ⎢ ⎥⎨ ⎬∂ ∂⎢ ⎥ ⎢ ⎥⎪ ⎪⎣ ⎦ ⎣ ⎦⎩ ⎭

∫ ∫ ∫ ∫ ∫φ φφ φφ φφ φv v

(16)

The3rd term in (15) is just the transpose of (16). The 2nd

term, min

ˆˆ

MAXfT T

f T T d∂∂

∂ ∂∫uφu φ

v v

x

x

x , is given by:

min min min

min min

min

min

min min

1 1

1

.

.

MAX MAX MAX

MAX MAX

MAX

MAX

MAX MAX

T T TT

T

T

TT

T T

d d d

td tdd

d

td d

− −

−

⎧ ⎫⎛ ⎞ ⎛ ⎞∂⎪ ⎪⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎪ ⎪∂⎝ ⎠ ⎝ ⎠⎪ ⎪⎪ ⎪− ⎨ ⎬⎛ ⎞⎜ ⎟∂⎪ ⎪⎜ ⎟ ⎛ ⎞⎪ ⎪⎝ ⎠ ⎜ ⎟⎪ ⎪⎜ ⎟∂⎪ ⎪⎝ ⎠⎩ ⎭

⎛⎜+⎝

∫ ∫ ∫

∫ ∫∫

∫

∫ ∫

φφφ φ φφv

φ φφφ

φφv

φ φφ

x x x

x x xx x

x

x xx

x

x

x x

x x

x x x

x xx

x

x xmin min min

1 1MAX MAX MAX

T TT Td d td

− −⎧ ⎫⎞ ⎛ ⎞∂ ∂⎪ ⎪⎟ ⎜ ⎟⎨ ⎬⎜ ⎟ ⎜ ⎟∂ ∂⎪ ⎪⎠ ⎝ ⎠⎩ ⎭∫ ∫ ∫

φ φφ φφv v

x x x

x x x

x x x

(17)

And the 4th term is:

min min min min

min

min min min

1

1 1

ˆ ˆMAX MAX MAX MAX

MAX

MAX MAX MAX

T Tf fT T

T T T T

T

T T TT

d td d td

d

td d d

−

− −

⎛ ⎞∂ ∂ ∂ ∂⎜ ⎟= −⎜ ⎟∂ ∂ ∂ ∂⎝ ⎠

⎧ ⎫⎛ ⎞⎪ ⎪⎜ ⎟∂⎜ ⎟⎛ ⎞ ⎛ ⎞⎪ ⎪ ∂⎪ ⎪⎝ ⎠⎜ ⎟ ⎜ ⎟− ⎨ ⎬⎜ ⎟ ⎜ ⎟∂ ∂⎪ ⎪⎝ ⎠ ⎝ ⎠⎪ ⎪⎪ ⎪⎩ ⎭

∫ ∫ ∫ ∫

∫∫ ∫ ∫

u u φ φφφ φφv v v v

φφφφ φφ φφ

v

x x x x

x x x x

x

x x xx

x x x

x x x x

x

x x xmin

min

min min min min

min

min

1 1

1

MAX

MAX

MAX MAX MAX MAX

MAX

MAX

T

T

T

T T TT T

T

T

T

td

d

td d d td

d

td

− −

−

−

⎡ ⎤⎧ ⎫⎛ ⎞⎢ ⎥⎪ ⎪⎜ ⎟∂⎜ ⎟⎛ ⎞ ⎛ ⎞⎢ ⎥⎪ ⎪ ∂⎪ ⎪⎝ ⎠⎜ ⎟ ⎜ ⎟− +⎢ ⎥⎨ ⎬⎜ ⎟ ⎜ ⎟∂ ∂⎢ ⎥⎪ ⎪⎝ ⎠ ⎝ ⎠⎢ ⎥⎪ ⎪

⎪ ⎪⎢ ⎥⎩ ⎭⎣ ⎦

∂⎛ ⎞⎜ ⎟⎜ ⎟⎝ ⎠+

∫

∫∫ ∫ ∫ ∫

∫∫

v

φφφφ φφ φφ

v v

φφ

φφ

φ

x

x

x

x x x xx

x x x x

x

xx

x

x

x

x x x x

x

x

min

min

min

min

min

1

1

MAX

MAX

MAX

MAX

MAX

TT

T

TT

d

d

td

d

d

−

−

⎧ ⎫⎛ ⎞⎪ ⎪⎜ ⎟⎜ ⎟ ⎛ ⎞⎪ ⎪⎝ ⎠ ⎜ ⎟ ×⎪ ⎪⎜ ⎟∂⎪ ⎪⎝ ⎠⎨ ⎬

⎛ ⎞⎪ ⎪⎜ ⎟∂⎪ ⎪⎜ ⎟ ⎛ ⎞⎪ ⎪⎝ ⎠ ⎜ ⎟⎪ ⎪⎜ ⎟∂ ⎝ ⎠⎩ ⎭

∫∫

∫∫

∫

φφv

φ

φφ

φφv

x

xx

xx

xx

xx

x

x

x

x

x

x

(18)

If eqs (14), (16), (17) and (18) are examined, it can be concluded that they are composed of two classes of terms:

1) Terms that are independent of the function to approximate, obtainable analytically, off-line, for the model at hand, and evaluated on-line, during training:

min

1MAX

T d−

⎡ ⎤⎢ ⎥⎢ ⎥⎣ ⎦∫ φφ

x

x

x (19)

min

min min

1 1

MAX

MAX MAX

T

T TT

dd d

− −∂⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥

∂⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦

∫∫ ∫

φφφφ φφ

v

x

x xx

x x

x

x x (20)

min min min

1 1x x x

x x x

x x xMAX MAX MAXT

T TT Td d d

− −⎡ ⎤ ⎡ ⎤∂ ∂⎢ ⎥ ⎢ ⎥

∂ ∂⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦∫ ∫ ∫

φ φφφ φφv v

(21)

min min min

min

min

1 1

1

.

.

MAX MAX MAX

MAX

MAX

T T TT

T

TT

d d d

d

d

− −

−

⎡ ⎤ ⎡ ⎤∂⎢ ⎥ ⎢ ⎥

∂⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦⎛ ⎞

∂ ⎜ ⎟⎜ ⎟ ⎡ ⎤⎝ ⎠ ⎢ ⎥∂ ⎢ ⎥⎣ ⎦

∫ ∫ ∫

∫∫

φφφ φ φφv

φφ

φφv

x x x

x x x

x

xx

x

x x x

x

x

(22)

min min min

1 1MAX MAX MAX

T T TTd d d

− −⎡ ⎤ ⎡ ⎤∂⎢ ⎥ ⎢ ⎥

∂⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦∫ ∫ ∫

φφφ φ φφv

x x x

x x x

x x x (23)

min min

min min min

1 1 1

MAX MAX

MAX MAX MAX

T T

T T TT T

d d

d d d− − −

⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟∂ ∂⎜ ⎟ ⎜ ⎟⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎝ ⎠ ⎝ ⎠⎢ ⎥ ⎢ ⎥ ⎢ ⎥

∂ ∂⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦

∫ ∫∫ ∫ ∫

φφ φφ

φφ φφ φφv v

x x

x x xx x

x x x

x x

x x x

(24)

2) And terms that involve the function to approximate:

min

MAX

td∫ φx

x

x (25)

min

MAX T

Tt d∂∂∫φv

x

x

x (26)

In a practical application the underlying function is not known (otherwise it should be used). The integrals (25) and (26) can be numerically approximated using the training data and a suitable numerical integration technique.

Besides being an elegant solution – the updates are computed with a set of terms that only depend on the model and the input domain, and another set of terms which are the projection of the function to approximate on the basis functions and on their partial derivatives, over the domain – the use of this approach reduces also the computational complexity and obtains a better approximation of the whole input domain, as it will be demonstrated in Section IV.

III. B-SPLINES An extended description of the B-spline architecture can be

found, for instance, in [11]. B-splines are functionally equivalent to Mamdani fuzzy models, under certain assumptions. By extending this equivalence, all derivative-based training algorithms (and also the functional approach proposed here) are applicable to Sugeno and Takagi-Sugeno fuzzy models. For more details, please see [12].

B-Spline neural networks belong to the class of networks denoted as lattice-based associative memories networks. In order to define a lattice in the input space, vectors of knots must be defined, one for each input axis.

The interior knots, for the ith input, are , , 1, ,i j ij rλ = . They are arranged in such a way that:

min max,1 ,2 , ii i i i r ix xλ λ λ< ≤ ≤ ≤ < (27)

In the context of the previous discussion, the interior knots are the nonlinear parameters, v . At each extreme of each axis, a set of ki exterior knots must be given which satisfy:

( )min max

,0 , 1 ,, 1 ,i i ii i i i i r i r ki k x xλ λ λ λ+ +− − ≤ ≤ = = ≤ ≤ (28)

These exterior knots are required to generate the basis functions that are close to the boundaries. These knots are usually coincident with the extreme of the input axes, or are equidistant. The network input space is

min max min max1 1, ,n nx x x x⎡ ⎤ ⎡ ⎤× ×⎣ ⎦ ⎣ ⎦ , and so the exterior knots are

only used for defining these basis functions at the extreme of the lattice.

The jth interval of the ith input is denoted as Ii,j and is defined as:

, 1 ,

,

, 1 ,

1, ,

1

i j i j i

i j

i j i j i

for j rI

if j r

λ λ

λ λ−

−

⎧⎡ ⎡ =⎪⎣ ⎣= ⎨⎡ ⎤ = +⎪ ⎣ ⎦⎩

(29)

This way, within the range of the ith input, there are ri+1

intervals, which means that there are ( )1

1n

ii

p r=

′ = +∏ cells in an

n-dimensional lattice.

The output of the hidden layer is determined by a set of p basis functions defined on the n-dimensional lattice. The shape, size and distribution of the basis functions are characteristics of the particular AMN employed, and the support of each basis function is bounded.

In B-Splines neural networks, the order of the spline implicitly sets the size of the basis functions support and its shape. The univariate B-Spline basis function of order k has a support, which is k intervals wide. Hence, each input is assigned to k basis functions.

The jth univariate basis function of order k is denoted as ( )j

kN x , and it is defined by the following relationships:

11 1

1 1

1

( ) ( ) ( )

1( )

0

j k jj j jk k k

j j k j j k

jj

x xN x N x N x

if x IN x

otherwise

λ λλ λ λ λ

− −− −

− − − +

⎛ ⎞ ⎛ ⎞− −= +⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟− −⎝ ⎠ ⎝ ⎠

∈⎧= ⎨⎩

(30)

Multivariate basis functions are formed by taking the tensor product of the univariate basis functions. Therefore, each multivariable basis function is formed from the product of n univariate basis functions, one from each input, and every possible combination of univariate basis function is taken:

( ) ( ),1

i

nj j

k i ii

N N=

= ∏k x x (31)

The number of basis functions of order ki defined on an axis with ri interior knots is ri+ki. Therefore, the total number of basis functions for a multivariate B-Spline is

( )1

n

i ii

p r k=

= +∏ . (32)

The output of an AMN is a linear combination of the outputs of the basis functions:

1

p

i ii

y=

=∑φ u , (33)

where ( ), 1, ,ii N x i p= =kφ … .

A. Computation of Terms In a previous work [7], it was shown how to apply the

functional BP algorithm for B-splines. The computation of terms (19), (20), and the derivatives of the basis functions, needed to compute (26), are therein detailed and will not be shown here. The present paper deals with the application of the LM method and, consequently, the derivation of eqs. (21)-(24) will be detailed below, for constant (k=1) and triangular (k=2) splines. Although their derivation for quadratic and cubic

splines is straightforward, the expressions will not be shown, due to lack of space.

1) Computation of (21) Knowing how to compute (19), the availability of (21)

requires the computation of min

x

x

xMAX T

T T d∂ ∂∂ ∂∫φ φv v

. Denote

min

MAXT d= ∫Φ φφ

x

x

x and min

( ) ( ) ( )

x

iMAX

i

xi i i T

idx= ∫Φ φ φ , i.e., the integral

for the ith dimension, computed with the univariate basis

functions for this dimension. Denote min

x

x

xMAX T

T T d∂ ∂′ =∂ ∂∫φ φΦv v

.

a) Univariate case Constant splines (k=1)

′ =Φ 0 (34)

Triangular splines (k=2)

Denote min

MAXx Tl m

ilmji jx

dxϕ ϕλ λ

∂ ∂′⎡ ⎤ =⎣ ⎦ ∂ ∂∫Φ .

( ) { } { } { }{ } { }

( ) { } { } { }{ } { }{ }

( ) { } { }{ } { }

( ) { } { }{ } { }

( ) { }

1

1

1

1

1

1 , 1 , 2, , 13

1 , 1 1 , 3, , 13

1 , 2 , 3, , 13

1 , , 1, , 13

1 , 1 16

ilmj jmli imlj jlmi

i i

i i

i i

i i

i i

i j l i m l m r

i j l i m l m r

i j l m i m r

i j l m i m r

m l i m j i

λ λ

λ λ

λ λ

λ λ

λ λ

−

+

+

−

+

′ ′ ′ ′⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤= = = =⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦

= ∧ = ∧ = + ∈ +−

= ∧ = + ∧ = + ∈ +−

− = ∧ = = + ∈ +−

− = ∧ = = ∈ +−

= = + ∧ = − ∧ =−

Φ Φ Φ Φ

…

…

…

…

{ }{ } { }

( ) { } { }{ }{ } { }

( ) { } { }{ } { }

( )( ) { } { }{ } { }

1

1

1 1

1 1

1 , 2, ,

1 , 1 , 2, , 16

1 , 2 1 , 3, , 16

, 1 , 2, , 130,

i i

i i

i i

i i i i

m r

m l i m j m m r

m l i m j m m r

m l i j m m r

otherwise

λ λ

λ λλ λ

λ λ λ λ

−

+

− +

− +

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨ + ∈⎪⎪⎪

− = ∧ = − ∧ = ∈ +⎪ −⎪⎪

− = ∧ = − ∧ = − ∈ +⎪−⎪

⎪ − +⎪ = ∧ = = − ∈ +⎪ − −⎪⎪⎩

…

…

…

…

(35)

b) Multivariate case Assume that { }( )

,1 ,, ,i

ii i rλ λ=v … , are the interior knots for the

ith dimension, and thatmin

( ) ( )( )

( ) ( )'iMAX

i

x l l Tl

ll T l Tx

dx∂ ∂=∂ ∂∫φ φΦv v

. If ⊗ denotes

the tensor product operation, then

[ ]min

x(1) ( 1) ( ) ( 1) ( )

( ) ( )x

x 'MAX T

l l l nl T l T d − +∂ ∂ = ⊗ ⊗ ⊗ ⊗ ⊗

∂ ∂∫φ φ Φ Φ Φ Φ Φ

v v (36)

2) Computation of (22) - (24)

Knowing how to compute (19) and min

MAXT

T

d∂

∂

∫ φφ

v

x

x

x

(please

see [7]), in order to compute eqs (22) - (24), we need to know

how to obtain [ ]min

MAXT

ldl

∂ ′′=∂∫φφ Φ

x

x

x .

a) Univariate case Constant splines (k=1)

[ ] 0′′ =Φ (37)

Triangular splines (k=2)

1 , 1 1 26

1 , 1 261 , 1 13

1 , 1 23

0,

lij

i j l i j l

j i l i j l

j i l

i j l

otherwise

⎧− = + = + ∨ = = +⎪⎪⎪ = + = + ∨ = =⎪⎪′′⎡ ⎤ = ⎨⎣ ⎦ = + = +⎪⎪⎪− = + = +⎪⎪⎩

Φ (38)

b) Multivariate case As every knot is associated with only one dimension, we

have:

[ ]min

( )(1) ( 1) ( 1) ( )( )

MAXkT k k n

k d − +∂ ′′= ⊗ ⊗ ⊗ ⊗ ⊗∂∫φ φ Φ Φ Φ Φ Φ

v

x

x

x (39)

IV. EXAMPLES We start by studying the computational complexity of the

functional version, with a univariate example. We compare the time spent (in seconds) with the standard approach (discrete version) and with the functional version, according to different sizes of input data sets. These results have been obtained with a Pentium Intel Core [email protected], running Mathematica version 7.0.0.

Subsequently, results comparing the performance between the discrete and functional versions using two bivariate examples are shown.

A. Experimental setup

In order to differentiate the analytical version from the case where (25) and (26) are approximated using sampled data, we shall denote the error criterion using the true function as

( )aΨ v , and the error criterion using a numerical integration

method as ( )fΨ v . In all simulations, the learning rate ( )η is set to 1, when BP is used. Training stops when the maximum number of 200 iterations is reached, or if the following set of termination criteria is satisfied:

[ ] [ ] [ ][ ] [ ] [ ]( )[ ] [ ]( )

2 2

32

1

1 1

1

k k k

k k k

k k

β

τ

τ

Ψ − − Ψ <

− − < +

≤ + Ψ

v v v

g

, (40)

Where [ ] [ ]( )1k kβ τ= + Ψ and τ is a measure of the desired number of correct digits in the objective function. This is a criterion typically used in unconstrained optimization [13]. In all simulations, 810τ −= was used.

For all problems, the input data was obtained using a Gaussian quadrature method. In order to produce illustrative plots, only examples using two nonlinear parameters will be used.

B. Univariate Problem

For this example, triangular splines (order 2) will be considered. The nonlinear parameters are [ ]1 2,T λ λ=v . There are 4 linear weights, associated with the 4 basis functions. The function to be approximated will be ( ) 2t x x= , i.e, we shall be approximating a 3rd order spline, with no internal knots, by a 2nd order B-spline, with 2 interior knots. The domain considered will be [ ]1,1x ∈ − . This was the example used in [7], where a comparison of performance between the BP functional and discrete versions can be inspected.

The aim here is to compare the time, per iteration, achieved with the different versions of BP and LM, when the training size varies. Table 1 shows the time spent by the analytical version (i.e., when the real function was used), the functional version and the discrete version.

TABLE 1. TIME PER ITERATION SPENT OPTIMIZING 2 PARAMETERS, USING DIFERENT TRAINING DATA SET SIZES

Algorithm Number of samples

8 100 500 1000

BPana 0,016 0,016 0,016 0,016

BPf 0,006 0,027 0,090 0,188

BPd 0,002 0,015 0,157 0,780

LMana 0,020 0,020 0,020 0,020

LMf 0,008 0,027 0,116 0,208

LMd 0,001 0,024 0,174 0,808

As it can be seen, the analytical approaches do not depend

on the training data size. The LM algorithm takes a bit more of time than BP. If we compare the results for the functional and discrete versions, the latter is faster for small sample sizes, but is much slower for larger number of samples. The time taken by the functional version increases nearly linearly with the sample size. The same is not verified with the discrete version.

Table 2 illustrates equivalent results, but this time with a model with 4 interior knots. Similar conclusions can be drawn analyzing this table. If the two tables are compared, the functional version time increases, on average, less than the discrete version, with an increasing number of parameters.

TABLE 2. TIME PER ITERATION SPENT OPTIMIZING 4 PARAMETERS, USING DIFERENT TRAINING DATA SET SIZES

Algorithm Number of samples

8 100 500 1000

BPana 0,024 0,024 0,024 0,024

BPf 0,010 0,028 0,184 0,322

BPd 0,004 0,034 0,290 1,506

LMana 0,027 0,027 0,027 0,027

LMf 0,019 0,052 0,197 0,393

LMd 0,006 0,036 0,306 1,754

C. Bivariate Problems In the first example triangular splines (order 2), with 1

interior knots will be considered for both dimensions. The nonlinear parameters are 1,1 2,1,T λ λ⎡ ⎤= ⎣ ⎦v . There are 9 linear weights, associated with the tensor product between 3 basis functions from each dimension. For this example, the function to be approximated will be ( ) ( ) 2 2

1 1 2 1 2x ,t t x x x x= = , which means that we shall be approximating a bivariate spline, 3rd order for each dimension, with no internal knots, by a bivariate spline, 2nd order with 1 interior knot for each dimension. The domain considered will be { } [ ]1 2, 1,1x x ∈ − .

The input data were generated from Gaussian quadrature forming a bivariate grid with 8 samples in each dimension:

1,2

0.96029,0.796666,0.525532,0.183435,0.183435, 0.525532, 0.796666, 0.96029

⎧ ⎫= ⎨ ⎬− − − −⎩ ⎭

x (41)

The target function to be approximated is depicted in the 3D plot of Fig 1. The analytical, functional and discrete performance surfaces are shown in Figures 2 to 4, which also illustrate the evolution of 4 different trainings, starting from 4 different initial points.

Figure 1. Input-Output 3D plot for t1(x)

The analytical optimum is located at [ ]1,1 2,1, 0,0λ λ⎡ ⎤ =⎣ ⎦ with a ˆ 0,00438aΨ = . As it can be seen in Figure 2, the analytical

performance surface has just one minimum, and the 4 different trainings converge to this minimum.

Figure 2. Evolution of 4 different trainings with LM (analytical version) for

t1(x) Table 3 shows some results related with the 4 analytical

trainings. N denotes the final iteration, i.e., where (40) is satisfied. The next line shows the final value of the interior knots, ˆ av . The third line, indicated by ˆ

aΨ , shows the analytical criterion (9), at the optimum. It has a bold typeface as it is the criterion that is minimized. The fourth line shows the value of the functional criterion (i.e. eq. (9), but where the integrals involving the target function are approximated using the training data), evaluated for ˆ av . The last line shows the discrete criterion (8), again evaluated for ˆ av .

TABLE 3: TRAININGS WITH LM (ANALYTICAL)

v[1] [-0.6 0.6] [-0.4 -0.4] [0.4 -0.4] [-0.4 0.4]

N 26 25 25 25ˆ av [-1e-4 1e-4] [-1e-4, -1e-4] [1e-4, -1e-4] [-1e-4 1e-4]

âΨ 4.38e-3 4.38e-3 4.38e-3 4.38e-3

âf v

Ψ 3.12e-3 3.12e-3 3.12e-3 3.12e-3

âd v

Ψ 1.48e-1 1.48e-1 1.48e-1 1.48e-1

The functional performance is shown in Figure 3.

Figure 3. Evolution of 4 different trainings with LM (functional version)

for t1(x)

Notice that even with this simple problem the performance surface exhibits many local optima.

TABLE 4: TRAININGS WITH LM (FUNCTIONAL)

v[1] [-0.6 0.6] [-0.4 -0.4] [0.4 -0.4] [-0.4 0.4]

N 8 26 26 26ˆ fv [-0.587 0.587] [-1e-4, -1e-4] [1e-4, -1e-4] [-1e-4 1e-4]

ˆ faΨ v 2.51e-2 4.38e-3 4.38e-3 4.38e-3

ˆfΨ 2.47e-2 4.61e-3 4.61e-3 4.61e-3

ˆ fd vΨ 3.35 1.49e-1 1.49e-1 1.49e-1

As it can be seen, in the functional version, 3 out of the 4 cases attain the global optimum. The same is not achieved by the discrete version, as it can be seen in Figure 4, and in Table 5.

Figure 4. Evolution of 4 different trainings with LM (discrete version) for

t1(x)

TABLE 5: TRAININGS WITH LM (DISCRETE)

v[1] [-0.6 0.6] [-0.4 -0.4] [0.4 -0.4] [-0.4 0.4]

N 7 6 6 6ˆ dv [-0.236 0.236] [-0.236 -0.236] [0.236 -0.236] [-0.236 0.236]

ˆ daΨv

9.61e-3 9.61e-3 9.61e-3 9.61e-3

ˆ dfΨ

v 1.054e-2 1.054e-2 1.054e-2 1.054e-2

ˆdΨ 2.35e-1 2.35e-1 2.35e-1 2.35e-1

The discrete trainings converge to different local minima.

Visually comparing the functional and discrete performance surfaces, it is apparent that the former is smoother than the latter.

As a final example, we maintained the previous model type but the function to be approximated is now:

( ) ( ) ( )( ) ( ) ( )( )1 2 1 22 2 2 20.7 0.7 0.7 0.710 5

2 1 2,x x x x

t x x e e− −− + + − + + + +

= + (42)

We shall be approximating a RBF with 2 neurons by a bivariate spline. The domain considered will be { } [ ]1 2, 1,1x x ∈ − . To obtain the training data, a Gaussian quadrature method was once again used, with 8 input samples (41) in each dimension.

Figure 5. Input-Output 3D plot for t2(x)

The analytical performance function is shown in Figure 6, together with 4 different training evolutions. The global optimum is located at [ ]1,1 2,1, 0.079,0.079λ λ⎡ ⎤ =⎣ ⎦ with a value of ˆ 0,0303aΨ = .

Figure 6. Evolution of 4 different trainings with LM (analytical version) for

t2(x)

TABLE 6: TRAININGS WITH LM (ANALYTICAL)

v[1] [-0.6 0.6] [-0.4 -0.4] [0.4 -0.4] [-0.4 0.4]

N 13 11 10 10 ˆ av [0.079 0.079] [0.079 0.079] [0.079 0.079] [0.079 0.079] ˆ

aΨ 3.03e-2 3.03e-2 3.03e-2 3.03e-2

ˆaf v

Ψ 2.98e-2 2.98e-2 2.98e-2 2.98e-2

ˆad vΨ 9.80e-1 9.80e-1 9.80e-1 9.80e-1

The functional performance surface is shown below. The

functional version fails to converge to the global optimum in two tries. This can be explained by the shape of the graph in this region, which resembles to a saddle, leading the estimated parameters into saddle points, i.e.

[ ] [ ] [ ]1,1 2,1 1 2, 0.184,0.065 , 0.065, 0.184λ λ λ λ⎡ ⎤ = − ∨ = −⎣ ⎦ . Note that the starting point located furthest from the global optimum, though

requiring more iterations, converges smoothly, being able to skip the saddle points in its way towards the global optimum.

Figure 7. Evolution of 4 different trainings with LM (functional version)

for t2(x)

TABLE 7: TRAININGS WITH LM (FUNCTIONAL)

v[1] [-0.6 0.6] [-0.4 -0.4] [0.4 -0.4] [-0.4 0.4]

N 21 10 11 11 ˆ fv [0.069 0.069] [0.069 0.069] [0.069 -0.184] [-0.184 0.069]

ˆ faΨv

3.03e-2 3.03e-2 3.35e-2 3.35e-2

ˆfΨ 2.95e-2 2.95e-2 3.36e-3 3.36e-3

ˆ fdΨv

9.84e-1 9.84e-1 1.15 1.15

The next figure shows the performance of the discrete version.

Figure 8. Evolution of 4 different trainings with LM (discrete version) for

t2(x)

TABLE8: TRAININGS WITH LM (DISCRETE)

v[1] [-0.6 0.6] [-0.4 -0.4] [0.4 -0.4] [-0.4 0.4]

N 200 200 200 200ˆ dv [-0.803 0.183] [0.183 0.183] [0.175 0.183] [0.183 0.175]

ˆ daΨv

4.81e-2 3.96e-2 3.96e-2 3.9 e-2

ˆ dfΨ

v 4.67e-2 3.83e-2 3.83e-2 3.83e-2

ˆdΨ 7.78e-1 6.90e-1 6.90e-1 6.90e-1

The discrete version fails to converge to the minima within the maximum number of iterations. The final points, for all cases, are far from the global analytical optimum.

V. CONCLUSIONS In this paper it was shown how to apply the functional LM

algorithm to B-splines. It was also demonstrated that the use of functional approach achieves important savings in computational time, if the training data is large, and, in most of the cases, obtains a better approximation over the whole input domain than the standard, discrete training algorithms. This is due to the fact that the functional performance surface is closer to the analytical performance surface than the discrete one.

Future work will apply this methodology to neuro-fuzzy systems, profiting from the functional equivalence between the 2 models types. Integration techniques, other than the Gaussian Quadrature will also be investigated.

REFERENCES [1] G. H. Golub and V. Pereyra, "Differentiation of pseudo-inverses and

nonlinear least-squares problems whose varables separate," Siam Journal on Numerical Analysis, vol. 10, pp. 413-432, 1973.

[2] A. E. B. Ruano, D. I. Jones, and P. J. Fleming, "A New Formulation of the Learning Problem for a Neural Network Controller," in 30th IEEE Conference on Decision and Control, Brighton, UK, 1991, pp. 865-866.

[3] J. Sjoberg and M. Viberg, "Separable non-linear least-squares minimization - Possible improvements for neural net fitting," Neural Networks for Signal Processing Vii, pp. 345-354, 1997.

[4] G. Golub and V. Pereyra, "Separable nonlinear least squares: the variable projection method and its applications," Inverse Problems, vol. 19, pp. R1-R26, 2003.

[5] A. E. Ruano, C. L. Cabrita, and P. M. Ferreira, "Towards a More Analytical Training of Neural Networks and Neuro-Fuzzy Systems," in IEEE 7th International Symposium on Intelligent Signal Processing (WISP 2011), Floriana, Malta, 2011.

[6] C. L. Cabrita, A. E. Ruano, and P. M. Ferreira, "Exploiting the functional training approach in Radial Basis Function Networks," in IEEE 7th International Symposium on Intelligent Signal Processing (WISP 2011), Floriana, Malta, 2011.

[7] A. E. Ruano, C. L. Cabrita, P. M. Ferreira and L. T. Kóczy, "Exploiting the functional training approach in B-Splines," in 1st IFAC Conference on Embedded Systems, Computational Intelligence and Telematics in Control, Wurzburg, Germany, 2012. in press

[8] P. Werbos, "Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences," PhD, Appl. Math., Harvard University, 1974.

[9] K. Levenberg, "A method for the solution of certain problems in least squares," Quarterly Applied Mathematics, vol. 2, pp. 164–168, 1944.

[10] D. Marquardt, "An algorithm for least-squares estimation of nonlinear parameters," SIAM Journal of Applied Mathematics, vol. 11, pp. 431–441, 1963.

[11] M. Brown and C. Harris, Neurofuzzy Adaptive Modelling and Control: Prentice-Hall, 1994.

[12] A. E. Ruano, C. Cabrita, J. V. Oliveira, and L. T. Koczy, "Supervised training algorithms for B-Spline neural networks and neuro-fuzzy systems," International Journal of Systems Science, vol. 33, pp. 689-711, Jun 2002.

[13] P. E. Gill, W. Murray, and M. H. Wright, Practical Optimization: Academic Press, Inc., 1981.

[ieee 2012 international joint conference on neural networks (ijcnn 2012 - brisbane) - brisbane,...

Documents