neural networks for solving systems of linear equations

Artificial Neural Networks (Spring 2007)

Neural Networks for Solving Systems of Linear Equations

Seyed Jalal KazemitabarReza Sadraei

Instructor: Dr. Saeed BagheriArtificial Neural Networks Course (Spring 2007)

Reza SadraeiJalal Kazemitabar Artificial Neural Networks (Spring 2007)

Outline

Historical IntroductionProblem FormulationStandard Least Squares SolutionGeneral ANN SolutionMinimax SolutionLeast Absolute Value SolutionConclusion


History

70’s:Kohonen solved optimization problems using Neural Networks.

80’s:Hopfield used Lyapunov function (Energy function) for proving the convergence of iterative methods in optimization problems.

Differential Eq. Neural Networksmapping


History

Many problems in science and engineering involve solving a large system of linear equations:

Machine LearningPhysicsImage ProcessingStatistics,…

In many applications an on-line solution of a set of linear equations is desired.


History

40’s:Kaczmarz introduced a method to solve linear equations

50’s – 80’s:Different methods based on Kaczmarz’s has been proposed in different fields.Conjugate Gradient method.

No good method for on-line solution of large systems.


1990:Andrzej Cichocki:

a Mathematician who received his PhD in Electrical Engineering Proposed a Neural Network for solving systems of linear equations in real time


Outline



Problem Formulation

Linear Parameter Estimation model :

: Linear Equation

: Model matrix: Unknown vector of the system parameters to be estimated

: Vector of observations: Unknown measurement errors: Vector of true values (usually unknown)

nmij R]a[A ×∈=

truebrbAx =+=

mRb∈mRr∈m

true Rb ∈

nTn21 R]x,...,x,x[x ∈=

⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢

⎣

⎡

=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

+

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

m

2

1

true

true

true

m

2

1

m

2

1

n

2

1

mn2m1m

n22221

n11211

b

bb

r

rr

b

bb

x

xx

aaa

aaaaaa

MMMM

L

MOM

L


Types of Equations

A set of linear equations is said to be overdetermined if m > n.

Usually inconsistent due to noise and errors.e.g. Linear parameter estimation problems arising in signal processing, biology, medicine and automatic control.

A set of linear equations is said to be underdetermined if m < n (due to the lack of information).

Inverse and extrapolation problems.Involves much less problems than overdetermined case

nmij R]a[A ×∈=truebrbAx =+=


Mathematical Solutions

Why not use ?It is not applicable since m≠n most of the time which results in irreversibility of A.

What if we use least square error method?

Inversing is considered to be time consuming for large A in real-time systems.

bA x -1=

;bA)AA(x,bAAxA

,0)bAx(A'y),bAx()bAx(y

T1T

TT

T

T

−=

=

=−=

−−=

AAT


Outline



Least Squares Error Function

Find the vector that minimizes the least squares function

Where

represents the residual components of the residual vector

nRx ∈*

∑=

−=−=n

1jijijiii bxabxA)x(r

bAxxrxrxrxr Tm −== )](),...,(),([)( 21

∑=

=−−=m

1i

2i

T )x(r21)bAx()bAx(

21)x(E


Gradient Descent ApproachBasic idea: compute a trajectory starting at the initial point

that has the solution x* as a limit point ( for )

General gradient approach for minimization of a function:

is chosen in a way that ensures the stability of the differential equations and an appropriate convergence speed

∞→t)t(x

)0(x

)x(EdtdX

∇−= μ

μ

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

∂∂

∂∂

∂∂

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−=

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

n

2

1

mn2m1m

n22221

n11211

n

2

1

xE

xE

xE

dtdx

dtdx

dtdx

M

L

MOM

L

Mμμμ

μμμμμμ


Solving LE Using Least Squares Criterion

Gradient of the energy function:

So

Scalar representation:

)bAx(AxE

xE

xEE T

T

n21

−=⎥⎦

⎤⎢⎣

⎡∂∂

∂∂

∂∂

=∇ L

n,...,2,1j,x)0(x

bxaadtdx

)0(jj

n

1p

n

1kikik

m

1iipjp

j

==

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛−−= ∑ ∑∑

= ==

μ

)bAx(AdtdX T −−= μ


∑ ∑∑= ==

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛−−=

n

1p

n

1kikik

m

1iipjp

j bxaadtdx

μ


ANN With Identity Activation Function


Outline



General ANN Solution

The key step in designing an algorithm for neural networks:

Construct an appropriate computational energy function (Lyapunov function)

Lowest energy state will correspond to the desired solution x*

Using derivation, the energy function minimization problem is transformed into a set of ordinary differential equations

)x(E



In general, the optimization problem can be formulated as:

Find the vector that minimizes the energy function

is called weighting function.Weighting function derivation is called activation function

nRx ∈*

))x(r()bxA()x(Em

1ii

m

1iii ∑∑

==

=−= σσ

))x(r( iσ

ii

iii r

Er

)r()r(g∂∂

=∂

∂=

σ



Gradient descent approach:

The minimization of the energy function leads to the set of differential equation

)x(EdtdX

∇−= μ

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

∂∂

∂∂

∂∂

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−=

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

n

2

1

mn2m1m

n22221

n11211

n

2

1

xE

xE

xE

dtdx

dtdx

dtdx

M

L

MOM

L

Mμμμ

μμμμμμ

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛−×−=

⎟⎟⎠

⎞⎜⎜⎝

⎛

∂∂

∂∂

−=∂∂

−=

∑ ∑∑

∑∑∑

= ==

===

m

1i

n

1kikikiip

n

1pjp

j

m

1i ip

in

1pjp

p

n

1pjp

j

bxagadt

dx

rE

xr

xE

dtdx

μ

μμ


General ANN Architecture⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛−×−= ∑ ∑∑

= ==

m

1i

n

1kikikiip

n

1pjp

j bxagadt

dxμ

Remember that this is the activation function

g1

g2

gm


Drawbacks of Least Square Error Criterion

Why not always use least square energy function?

Not so good in case of existence of large outliers.Only optimal for Gaussian distribution of error.

The proper choice of the criterion depends onSpecific applications.Distribution of the errors in the measurement vector b

Gaussian dist*. Least squares criterionUniform dist. Chebyshev norm criterion

*However the assumption that the set of measurements or observations has a Gaussian error distribution is frequently unrealistic due to different sources of errors such as instrument errors, modeling errors, sampling errors, and human errors.


Huber’s Function:

Weighting Function Activation Function

Special Energy Functions

⎪⎪⎭

⎪⎪⎬

⎫

⎪⎪⎩

⎪⎪⎨

⎧

>−

≤=

βββ

βρ

e:2

e

e:2e

)e( 2

2

H



Talvar’s Function:

This Function has direct implementationWeighting Function Activation Function

⎪⎪⎭

⎪⎪⎬

⎫

⎪⎪⎩

⎪⎪⎨

⎧

>

≤=

ββ

βρ

e:2

e:2

e

)e( 2

2

T



Logistic Function:

Iterative Reweigheted method uses this activation function.


⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

ββρ eCoshln)e( 2

L



Lp-normed function:

Activation Function

∑=

=m

1i

pip r

p1)x(E


Lp-Norm Energy Functions

A well-known criterion is energy functionNormL1 −


∑=

=m

1ii1 )x(r)x(E



Another well-known criterion is (chebyshev) criterion which can be formulated as the minimax problem:

This criterion is optimal for uniform distribution of error.

NormL −∞

{ })x(rmaxmin imi1Rx n ≤≤∈


Outline



Minimax (L∞-Norm) Criterion

For the case p=∞ of the Lp-Norm problem the activation function g[ri(x)] can not be explicitly mathematically expressed by

Error function can be define as

resulting in following activation function:

mi1i })x(rmax{)x(E

≤≤∞ =

1)( −pi xr

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧ =

= ≤≤

otherwise0})x(r{max)x(rif)]x(r[sign

)]x(r[gkmk1ii

i

Jalal KazemitabarReza Sadraei Artificial Neural Networks (Spring 2007)

Minimax (L∞-Norm) Criterion

Although straightforward, some problems arise in practical implementations of the system of differential equations:

Exact realization of the signum functions is rather difficult (electrically).E∞ has a derivative discontinuity at x if for some i ≠ k

*This is often responsible for various anomalous results (e.g. hysteresis phenomena)

)()()( xExrxr ki ∞==


Transforming the problem to an equivalent one

Rather than directly implementing the proposed system, we transform the minimax problem

into an equivalent one:

Minimize subject to the constraints

Thus the problem can be viewed as finding the smallest non-negative value of

where x* is a vector of the optimal values of the parameters

{ })(maxmin1

xr imiRx n ≤≤∈

ε

ε≤)(xri 0≥ε

0)( ** ≥= ∞ xEε


New Energy Function

Applying the standard quadratic function we can consider the cost function as:

where are coefficients and

{ }∑=

−− −+++=m

iii xrxrxE

1

22 ))](([))](([2

),( εεκυεε

0,0 >> κν

},0min{][ yy =−


New Energy Function

Applying now the gradient strategy we obtain the associated system of differential equations

⎟⎠

⎞⎜⎝

⎛−ε++ε+

κν

μ−=ε ∑

=

]S))x(r(S))x(r[(dtd

2ii

m

1i1ii0

{ }∑=

−−+−=m

iiiiiijj

j SxrSxradtdx

121 ]))(())([( εεμ ),...,2,1( nj =

⎭⎬⎫

⎩⎨⎧ ≥+

=otherwise;1

0)x(r;0S i

1i

ε

⎭⎬⎫

⎩⎨⎧ ≥−

=otherwise;1

0)x(r;0S i

2i

ε


Simplifying architecture

It is interesting to note that the system of differential equations can be simplified by:

This nonlinear function represent a typical dead zone function.

⎪⎭

⎪⎬

⎫

⎪⎩

⎪⎨

⎧

>+−≤≤−

−<+=

εεεε

εεεϕ

ii

i

ii

ii

rifrrif

rifrxr

__0_

)),((


Simplifying architectureIt is easy to check:

Thus the system of differential equations can be simplified to the form:

)),(())(())(( 21 εϕεε xrSxrSxr iiiiii −=−++

)),(())(())(( 21 εϕεε xrSxrSxr iiiiii =−−+

)0(

10 )0(,)),(( εεεϕκυμε

=⎟⎠

⎞⎜⎝

⎛−−= ∑

=

m

iii xr

dtd

,)),x(r(adt

dx m

1iiiijj

j ∑=

εϕμ−= )n,...,2,1j(x)0(x )0(jj ==


∑=

−=m

iiiijj

j xradt

dx

1)),(( εϕμ


⎟⎠

⎞⎜⎝

⎛−−= ∑

=

m

1iii0 )),x(r(

dtd εϕ

κυμε


Outline



Least Absolute Values ( L1-Norm) Energy Function

Find the design vector that minimizes the error function

where

Why should one choose this function knowing that it has differentiation problems?

∑=

=m

ii xrxE

11 )()(

∑=

−=n

jijiji bxaxr

1)(


Important L1-Norm Properties1. Least absolute value problems are equivalent to linear

programming problems and vice versa.

2. Although the energy function E1(x) is not differentiable, the terms can be approximated very closely by smoothly differentiable functions

3. For a full rank* matrix A, there always exists a minimum L1-Norm solution which passes through at least n of the m data points. L2-Norm does not in general interpolate any of the points.

These properties are not shared by L2-Norm.

* Matrix A is said to be of full rank if all its rows or columns are linearly independent.

)(xri


Important L1-Norm Properties

Theorem: There is a minimizer of the energy function for which the residuals forat least n values of i, say i1, i2, …, in, where n denotes the rank of the matrix A.

We can say that L1-Norm solution is the median solution while the L2-Norm solution is the mean solution.

n* Rx ∈∑=

=m

1ii1 )x(r)x(E 0)x(r *

i =


Least Absolute Error Implementation

The algorithm is as follows:1. First phase:

Solving the problem using ordinary least-square technique and computing all m residualsSelecting from them the n residuals which are smallest in absolute value

2. Second phase:Discarding the rest of equations, n equations related to selected residuals are solved by minimizing the residuals to zero

ANN implementation is done in three layers using inhibition control circuit.


Phase #1

ANN Architecture for Solving L1-Norm Estimation Problem

Phase #2


Example

Consider matrix A and observation b as below. Find the solution to Ax=b using the least absolute error energy function.

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

=

16 4 19 3 14 2 11 1 10 0 1

A

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

=

101-1 2 1

b, 0bAx, =−


In the first phase all the switches ( S1-S5 ) were closed and the network was able to find the following standard least-squares solution:

In this case it is impossible to select two largest, in absolutevalue, residuals because Phase one was rerun while switch S4 was opened and the network found then

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−=

5.15.36.0

x *I

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

−

−

=

6.04.1

6.06.04.0

)x(r *I

6.0rrr 532 ===

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−=

3409.16404.29182.0

x II*

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

−−

−

=

0273.02273.3

016362182.00818.0

)x(r II*


Cichocki’s Circuit Simulation Results

In the second phase ( and third run of the network ) the inhibitive control network has opened the switch S2. So in the third run only switches S1,S3,S5 were closed, and the network found the equilibrium point:

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−=

375.1750.21

x*

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

−=

0125.2

0375.00

)x(r *,


Cichocki’s Circuit Simulation Results

Residuals for n=3 of the m=5 equations converges to zero in 50 nano-seconds.


Using MATLAB, we observed that zeroing r1,r3 and r5 results in the minimum value of ∑

=

=m

1ii1 )x(r)x(E


Outline



Conclusion

Great need for real-time solution of linear equations.

Cichocki’s proposal ANN is different from classical ANNs.

Consider a proper energy function, reducing which results in the optimal solution to Ax=b.

‘Proper function’ may have different meaning in different applications.

Standard least square error function gives the optimal answer for Gaussian distribution of error.


Conclusion (Cont.)Least square function doesn’t have a good behavior when having large outliers in observations.

Various energy functions have been proposed to solve the outlierproblem (e.g. logistic function).

Minimax results in the optimal answer for the uniform distribution of error. It also has some implementation and mathematical problemsthat results in an indirect approach to solving the problem.

Least absolute error function has some properties that makes it distinguishable from other error functions.

neural networks for solving systems of linear equations

Documents