r.p. lippman - semantic scholar · b. nour-omid, a. raefsky, and g. l yzenga, solving nite element...
TRANSCRIPT
[12] R.P. Lippman, An introduction to the computing with neural nets, IEEE ASSP Magazine, pp. 4{
22, 1987.
[13] J. Mandel, Balancing domain decomposition, Comm. Appl. Numer. Methods, 9 (1993), pp. 233{
241.
[14] J. Mandel, Adaptive iterative solvers in �nite elements, in Solving Large-Scale Problems in
Mechanics, M. Papadrakakis, ed., John Wiley, 1993, pp. 65{88.
[15] B. Nour-Omid, A. Raefsky, and G. Lyzenga, Solving �nite element equations on concurrent
computers, in Parallel Computations and Their Impact on Mechanics, A.K. Noor, ed., 1986,
ASME Press, New York, pp. 209{227.
[16] C. Peterson, and J.R. Anderson, Neural networks and NP-complete optimization problems;
A performance study on the graph bisection problem, Complex Systems, 2 (1988), pp. 59{89.
[17] A. Pothen, H.D. Simon, and K.P. Liu, Partitioning sparse matrices with eigenvectors of
graphs, SIAM J. Matrix Anal. Appl., 11 (1990), pp. 430{452.
[18] H.D. Simon, Partitioning of unstructured problems for parallel processing, Comput. Systems En-
grg., 2 (1991), pp. 135{148.
[19] D.E. Rumelhart, and D. Zipser, Feature discovery by competitive learning, Cognitive Science,
9 (1985), pp. 75{112.
[20] J. Xu, Iterative methods by space decomposition and subspace correction, SIAM Rev., 34 (1992),
pp. 581{613.
[21] C. Vaughn, Structural analysis on massively parallel computers, Proceedings of the Conference
on Parallel Methods on Large Scale Structural Analysis and Physics Applications, Pergamon
Press, 1991.
[22] R.D. Williams, Performance of dynamic load balancing algorithms for unstructured mesh cal-
culations, Tech. Report C3P913, California Institute of Technology, Pasadena, June 1990.
22
7. Conclusions. Two novel techniques are proposed in this paper for generating
substructures for domain decomposition using neural network paradigms. This study
demonstrates the usefulness of neural networks for approximately solving large NP-hard
optimization problems, with reasonable speed on serial machines. Both techniques yield
signi�cantly better outputs compared to a popular greedy algorithm, when only element
to element connectivity in the FE mesh is used. Sparse implementations are developed
for fast serial execution and reduction in memory requirements. Both the neural net-
based techniques are competitive with benchmark algorithms. Future research will
concentrate on incorporating the aspect ratio requirement explicitly into the optimiza-
tion process; use of faster ANN algorithms, e.g., VFSR [11] and Mean �eld [16] type,
that attempt to avoid local minima; automatic selection of the network parameters;
and generation of submeshes respecting the physics of the problem. Generation of 3-D
hexahedral �nite element meshes, given 3-D node points, is a similar problem and will
be studied following a similar philosophy.
Acknowledgments. Test meshes have been provided by Charbel Farhat, Dept. of
Aerospace Engr., University of Colorado at Boulder. The performance comparison
numbers are obtained using TOP/DOMDEC software developed by Charbel Farhat
and his coworkers. Horst Simon of NASA Ames Research Center provided the RSB
code. We also would like to thank Ellen Applebaum for her suggestions regarding
competitive learning.
REFERENCES
[1] S.V.B. Aiyer, M. Niranjan, and F. Fallside, A theoretical investigation into the perfor-
mance of the Hop�eld model, IEEE Trans. Neural Networks, 1 (1990), pp. 204{215.
[2] S.T. Barnard, H.D. Simon, A fast multilevel implementation of recursive spectral bisection
for partitioning unstructured problems, Report RNR-92-033, NASA Ames Research Center,
April 1993; Concurrency: Practice and Experience, 1994, to appear.
[3] C. Farhat, H. Simon, and S. Lanteri, TOP/DOMDEC, A software tool for mesh partitioning
and parallel processing, in press.
[4] C. Farhat, J. Mandel, and F.X. Roux, Optimal convergence properties of the FETI domain
decomposition method, Comput. Meth. Appl. Mech. Engrg., 115 (1994), pp. 367{388.
[5] C. Farhat, and M. Lesoine, Automatic partitioning of unstructured meshes for the parallel
solution of problems in computational mechanics, Intl. J. Numer. Meth. Engr., 36 (1993),
pp. 745{764.
[6] M. Fiedler, A property of eigenvectors of nonnegative symmetric matrices and its application
to graph theory, Czechoslovak Math. J., 25 (1975), pp. 619{633.
[7] S. Ghosal, J. Mandel, and R. Tezaur, Automatic substructuring for domain decomposition
using neural networks, Proceedings of IEEE Conference on Neural Networks, Orlando, June
1994, to appear.
[8] S. Grossberg, Competitive learning: From interactive activation to adaptive resonance, Cogni-
tive Science, 11 (1987), pp. 23{63.
[9] J. Hertz, A. Krogh, and R.G. Palmer, Introduction To The Theory Of Neural Computation,
Addison-Wesley, Redwood City, CA, 1991.
[10] J.J. Hopfield, D.W. Tank, \Neural" computation of decisions in optimization problems, Bio-
logical Cybernetics, 52 (1985), pp. 141{152.
[11] A. L. Ingber, Simulated annealing: Theory vs. practice, J. Math. Computer Model., 1993.
21
(c)
Fig. 6. Mesh partitioning by neural networks. (a) 2-D FE mesh, ENGINE with 22,936 elements
and 12,233 nodes. (b) Partitions generated by competitive learning network. (c) Partitions generated
by Hop�eld network.
N
s
Competitive Learning Hop�eld Net
N
I
Load Time N
I
Load Time
16420 0.93 5.09 454 0.99 4.24
32716 0.97 6.13 787 0.99 5.55
641197 0.97 7.36 1277 0.93 10.15
1281928 0.93 10.15 2051 0.92 15.62
Table 5
Mesh partitioning performance of neural networks for the unstructured 2-D mesh, ENGINE with
22,936 elements and 12,233 nodes.
N
s
RSB Simulated Annealing
N
I
Load Time N
I
Load Time
16403 1.0 17.14 425 0.97 3.71
32689 1.0 20.11 725 0.99 4.42
641204 0.99 27.78 1161 0.85 11.52
1282070 0.99 31.94 1829 0.82 14.69
Table 6
Mesh partitioning performance of recursive spectral bisection and simulated annealing-based tech-
niques for the unstructured 2-D mesh, ENGINE with 12,233 elements and 22,936 nodes.
20
(a)
(b)
19
(c)
Fig. 5. Mesh partitioning by neural networks. (a) 3-D FE mesh, MACHINE with 3648 elements
and 5336 nodes. (b) Partitions generated by competitive learning network. (c) Partitions generated by
Hop�eld network.
N
s
Competitive Learning Hop�eld Net
N
I
Load Time N
I
Load Time
8722 0.95 1.67 783 0.90 1.30
161107 0.91 1.78 1177 0.84 1.83
321573 0.91 2.08 1683 0.82 2.59
642023 0.84 2.58 2170 0.79 3.47
Table 3
Mesh partitioning performance of neural networks for the unstructured 3-D mesh, MACHINE with
3648 elements and 5336 nodes.
N
s
RSB Simulated Annealing
N
I
Load Time N
I
Load Time
81234 1.0 4.01 803 0.97 2.77
161720 1.0 5.59 1131 0.91 4.06
322144 1.0 7.21 1694 0.96 7.88
642608 1.0 8.37 2071 0.82 6.89
Table 4
Mesh partitioning performance of recursive spectral bisection and simulated annealing-based tech-
niques for the unstructured 3-D mesh, MACHINE with 3648 elements and 5336 nodes.
18
(a)
(b)
17
(c)
Fig. 4. Mesh partitioning by neural networks. (a) 3-D FE mesh, BLADE with 944 elements and
1820 nodes. (b) Partitions generated by competitive learning network. (c) Partitions generated by
Hop�eld network.
N
s
Competitive Learning Hop�eld Net
N
I
Load Time N
I
Load Time
8360 0.89 0.35 378 0.86 0.67
16530 0.88 0.36 572 0.79 1.04
32689 0.82 0.47 784 0.84 1.36
64879 0.74 0.56 893 0.82 1.61
Table 1
Mesh partitioning performance of neural networks for the unstructured 3-D mesh, BLADE with 944
elements and 1820 nodes.
N
s
RSB Simulated Annealing
N
I
Load Time N
I
Load Time
8417 1.0 1.00 354 0.82 0.72
16586 1.0 1.22 512 0.86 1.24
32814 1.0 1.62 656 0.89 1.27
641057 1.0 1.75 886 0.79 5.42
Table 2
Mesh partitioning performance of recursive spectral bisection and simulated annealing-based tech-
niques for the unstructured 3-D mesh, BLADE with 944 elements and 1820 nodes.
16
(a)
(b)
15
0 5 10 15 20 25 30
5
10
(a)
0 5 10 15 20 25 30
5
10
(b)
0 5 10 15 20 25 30
5
10
(c)
Fig. 3. Substructuring by neural networks. (a) Output of Farhat's greedy algorithm. (b) After
optimization by competitive learning network. (c) After optimization by Hop�eld network.
14
sparse data structures. It also ensures that the initial state of the network is away from
the basins of attraction, corresponding to the corners of the hypercube.
Reasonable mesh partitions can be obtained in 5{15 epoches. The learning rate
parameter in the competitive learning network and constants g
0
, A, B, and in the
Hop�eld network are chosen empirically to generate acceptable mesh partitions. ! =
0:0001, g
0
= 0:15, A = 0:5, B = 0:001, and = 0:3 generate reasonable partitions for
a wide variety of FE meshes. Automatic selection of parameters is an open research
issue.
Fig. 3(a) shows a 2-D FE mesh with 260 elements, decomposed into 3 subdomains
by the greedy algorithm. The ratio of the number of interface nodes (i.e., nodes on
the interface between more than one subdomain) and total number of nodes in the
mesh is 0:1346. Also, note that the decomposition su�ers from \unfavorable" aspect
ratios for all the subdomains. Aspect ratio consideration is important for convergence
properties of the domain decomposition computations [4]. Decompositions obtained
using competitive learning and Hop�eld network-based techniques are shown in Fig. 3(b)
and Fig. 3(c), respectively. The interface ratio is 0:07407 in both cases, which is 45:0
lower compared to that obtained by the greedy algorithm. Also, aspect ratios of the
substructures generated by both ANN paradigms are satisfactory.
Fig. 4(a) shows an unstructured 3-D mesh, BLADE with 944 elements and 1820
nodes. Submeshes obtained using the proposed neural net-based techniques are de-
picted in Fig. 4(a) and Fig. 4(b) for N
d
= 4. Corresponding partitioning results are
reported in Tables 1 and 2 for the proposed technique, RSB method and simulated
annealing-based partitioning technique. RSB and simulated annealing-based methods
are run in TOP/DOMDEC environment. N
I
denotes the total number of interface
nodes for a given number of subdomains, and Load represents the ratio of the average
size of submeshes and size of the largest submesh, created by the respective partitioning
algorithm. Time denotes the sum of execution times of the greedy algorithm and the
optimizing algorithm for the neural network- and simulated annealing-based methods.
Execution time of the greedy algorithm is almost negligible for most problems. Fig. 5
describes the partitionings generated by the proposed techniques for the 3-D mesh,
MACHINE with 3648 elements, and 5336 nodes for N
d
= 8. Comparative performances
are presented in Tables 3 and 4. Partitioning of a large 2-D mesh, ENGINE with 22,936
triangular elements and 12,233 nodes is described in Fig. 6 for N
d
= 8. Tables 5 and 6
summarize the relative performance of the proposed techniques with benchmark algo-
rithms for this mesh. Both the proposed techniques outperform the recursive spectral
bisection method for 3-D meshes. Results are comparable for 2-D meshes. The pro-
posed techniques are also competitive with simulated annealing. The main essence of
ANN algorithms is that they are simple in nature and exhibit �ne-grain parallelism.
Thus, it is expected that the neural net-based algorithms with carefully chosen values
of parameters would perform excellent, specially in a parallel environment. Results
reported here on a wide variety of large FE meshes clearly establish the e�ectiveness of
neural network paradigms for FE mesh partitioning even on serial computers.
13
where,
u
Xi
(t) = �2Aa
X
(t)�Bb
i
(t) c
Xi
(t) (2A� s( ))
Xi
(t) A BN
and
u
Xi
(t)� u
Xi
(t� 1) =
�2A
j2fS
X
(
Xj
(t)�
Xj
(t� 1)) �B
X
(
Xi
(t)�
Xi
(t� 1))
2 (X)
(
i
(t)�
i
(t� 1)) (2A� s( ))(
Xi
(t)�
Xi
(t� 1)) (15)
Since the initial state of the network corresponds to a reasonable decomposition, only a
small fraction of elements changes their memberships in each iteration as the network
evolves in time. Consequently, for a given element ,
Xi
(t)�
Xi
(t� 1) remains zero
for almost all 's, and thus does not enter into the computational process. Therefore, for
any element the inputs-outputs u
Xi
-
Xi
need to be updated for only a small number
of 's (compared to j
X
j for regular updates). Also, it is found for manymeshes that only
the elements near the subdomain interfaces change their memberships as the interface
length is minimized, and sub-optimal partitions can be obtained by updating the states
of those elements which are near the subdomain boundaries at each iteration. Thus,
with sparse data structures and the proposed state updating rule, the Hop�eld network
can be implemented with complexity (N
I
) per time step on a serial machine, where
N
I
is the number of interface nodes present in the initial decomposition, generated by
the greedy algorithm.
. ume ic l esults. Numerous experiments have been conducted with model
problems as well as real-world 2-D and 3-D FE meshes to demonstrate the e�ectiveness
of neural network principles for automatic substructuring for domain decomposition.
Some of these results are reported in this section. Both neural net paradigms are
implemented in C and incorporated into the TOP/DOMDEC software [3] and executed
on an SGI Crimson workstation.
A two-step approach is adopted in this study to generate submeshes from FE
meshes. In the �rst step, an initial decomposition is obtained using Farhat's greedy
algorithm, that satis�es the criterion of load balance. Competitive learning and Hop-
�eld networks are employed in the second step to minimize the interface length between
ad acent subdomains, keeping the size of each subdomain nearly equal. The outputs of
the greedy algorithm are used to initialize the synaptic strengths, and inputs-outputs
of competitive learning, and Hop�eld networks, respectively. Let
j
�denote an
element allocated to the subdomain by the greedy algorithm. Then
ij
�in the com-
petitive learning network is set to c
i
, such that
i2S
� c
i
=1, and
ij
= 0; = . The
outputs of the Hop�eld net are initialized to
Xi
= 0:5 � 0:01 a (), if the element
is allocated to the -th subdomain by the greedy algorithm, and
Xj
= 0 = .
a () is a random number uniformly distributed between 0 and 1. Such an initializa-
tion cuts down memory requirements for the Hop�eld network and helps in maintaining
12
(N N
d
) per time step. For partitioning of large FE meshes, N may be in millions and
N
d
can vary asN
. Thus, the complexity of the Hop�eld network on serial machines
is very high for real-time mesh partitioning.
The properties of the FE meshes are utilized to develop sparse implementation of
the Hop�eld network, that in addition to reducing memory requirements, e�ectively
lowers the computational expenses. The states of the network are initialized according
to the decomposition generated by the greedy algorithm. Since the greedy algorithm
generates almost acceptable partitioning for most problems, an element initially be-
longing to a subdomain stays in that subdomain during the entire time evolution of the
network or may change its membership only to the neighboring subdomains present in
the initial decomposition. That is, if any element belongs to subdomain in the
initial decomposition, then either remains in subdomain during the evolution of
the network, else moves to a subdomain
0
such that
0
where is the set of
subdomains, neighboring to subdomain . This is an acceptable proposition since the
number of interface nodes can be minimized by local ad ustment of the membership of
a given element. Upper limit of j j is typically 8 for 2-D square FEs and 16 for 3-D
cubic FEs. Thus the state updating rule given in (11) can be written as,
u
Xi
(t 1) = u
Xi
(t)� 2Aa
X
�Bb
i
c
Xi
(2A� s( ))
Xi
A BN
X
and
a
X
=
j2fS
X
Xj
1 N
b
i
=
e
X
Xi X
c
Xi
=
2 (X)
i X
.
Now, a
X
can be computed in
e
X
j
X
j = (N ) oating-point operations for all
elements, and so be b
i
for all subdomains. Given any element , c
Xi
can be evaluated
in s( )j ( )j operations for all . Thus, a
X
; b
i
, and c
Xi
's can be essentially computed
with number of oating-point operations proportional to N per iteration. The constant
of proportionality evidently depends on the number of subdomains j
X
j neighboring the
subdomain in which the element initially resides. Constraining the movement of an
element to only neighboring subdomains in fact helps in the optimization process for
some problems by cutting down the search space, and facilitates sparse implementation
of the Hop�eld network for large-scale problems.
A new state updating scheme is also developed utilizing the present as well as the
past states, to further reduce the computational complexity. The original state update
rule in the Hop�eld network can be rewritten as,
u
Xi
(t 1) = u
Xi
(t) u
Xi
(t)
X
= u
Xi
(t) ( u
Xi
(t)� u
Xi
(t� 1)) u
Xi
(t� 1) (14)
11
a
~
, where ~ is the winning output node. Thus, the complexity of step 5 is reduced to
(N ). In order to prevent over ow of a
j
's, an occasional full renormalization may be
necessary, which hardly happens in the partitioning process of most meshes.
Further reduction of complexity is inspired by the behavior of the algorithm. The
initialization of weights is based on a decomposition generated by Farhat's greedy al-
gorithm. Initially, there is only one nonzero weight
ij
for each . In the process of
learning, one can expect that only a small number of
ij
will become nonzero when we
use low values of the learning rate. Computational results only con�rm this proposition.
In fact, if an element resides well inside of a subdomain, then it will quite probably stay
in the same subdomain along with its neighbors after the learning is over. Thus, we
assume that only a small number of weights
ij
will become nonzero for each . This
number is likely to grow with the number of subdomains for a particular mesh. How-
ever, it proves to be a growth much slower than a linear one. This motivates a sparse
representation of the weight matrix. From the implementation point of view, a linked
list for each turns out to be appropriate.
Storing only nonzeros in not only reduces memory requirements, but also reduces
the number of necessary oating-point operations. Each of the steps 2{5 then takes only
(N ) oating-point operations. Computational results indicate that there is a slow
growth with number of subdomains, which is caused by reasons mentioned above and
necessity to initialize the output vector = ( ; ; :::;
t
) of length N
d
to zero in each
epoch for N times. Overall, the complexity of the algorithm is approximately (N ).
. . o eld etwo k. Contrary to competitive learning, the operation of Hop-
�eld network is guided by the input dynamics of the neurons. The connection matrix
is built according to the cost function associated with the partitioning problem and
kept �xed. The inputs to neurons are updated in every iteration based on present states
(outputs
Xi
) of neurons. Let us analyze the complexity of input updates per iteration
according to (11). Without the loss of generality, (11) can be rewritten as,
u
Xi
(t 1) = u
Xi
(t)� 2Aa
X
�Bb
i
c
Xi
(2A� s( ))
Xi
A BN
where,
a
X
=
j
Xj
1 N
b
i
=
e
X
Xi
1 N
d
c
Xi
=
2 (X)
i
1 N ; 1 N
d
:
Given u
Xi
(t) for all and , output states
Xi
can be computed with (N N
d
) com-
plexity per iteration. Calculation of a
X
requires N
d
number of oating-point operations
for any . b
i
can be computed in (N ) operations for any ; and c
Xi
in s( ) operations
for any and . Thus, the network can be implemented serially with a complexity of
10
The constants A;B; in uence the convergence and stability of the Hop�eld net in
addition to controlling penalties associated with di�erent optimization criteria. Since
is symmetric (in order that the Lyapunov function is valid), it can be completely
characterized by its eigenvalues
i
; :::; =
i
(A;B; ) and corresponding orthog-
onal eigenvectors e ; :::; e . A;B; should be chosen such that the network is able to
minimize by moving in a way to reduce all components
i
's of (in the direction
of e
i
) to zero if
i
0; and to increase all
j
's if
j
0 [1].
. st se m lement tion o eu l etwo ks. The main motivation
behind neural net-based computation is that they are simple and exhibit massive par-
allelism suitable for large-scale scienti�c problems. Neural computations can however
be expensive for execution on serial machines. Experiments reported in this paper are
all performed on a single processor machine. In this section, implementation issues are
discussed for fast serial execution of neural networks, taking into account the special
nature of the mesh partitioning problem.
. . Com etiti e e ning. This type of neural net algorithm involves updating
connection weights from
i
input nodes to output nodes at each iteration (or
epoch) for every input pattern. Before explaining the proposed implementation of the
competitive learning algorithm, let us �rst calculate the computational complexity of
the algorithm as described in Section 3. Initialization of the weight matrix can be
done in (N N
d
) operations. This step is performed only once and presents a constant
computational overhead. Steps 2{5 are performed N times in each epoch and involve
presentation of patterns of size N . Therefore, step 2 requires (N N
d
) oating-point
operations. Finding the winner in step 3 takes (N N
d
) operations per epoch. Finally,
steps 4 and 5 (winner's weight update and renormalization) can be accomplished in
(N ) operations per epoch. Thus, the entire algorithm has complexity of (N N
d
)
which is unreasonably high for partitioning of large FE meshes.
The computational complexity of the entire algorithm can be reduced by utilizing
the sparse nature of the data as well as the behavior of the algorithm. In FE meshes,
each element is ad acent to only a small number of other elements. So the input patterns
contain only a few nonzeros. This number is typically a constant depending on the
type of elements in the mesh. Making use of this fact, we can cut down the number
of operations needed for steps 2, 3, 4 and 5 to (N N
d
), (N N
d
), (N ) and (N ),
respectively. The complexity is now concentrated in the renormalization step 5. The
complexity of the renormalization step can be reduced by representing
ij
as
ij
= a
j ij
.
Steps 4 and 5 can now be expressed as
i~
=
i~
!
i
a
~
; = 1; : : : ;
i
s:t:
i
= 0; (12)
a
~
= a
~
(1 !
i
) = 1; : : : ;
i
: (13)
This increases the number of operations in steps 2 and 4 by a small extent, but it does
not change the order of complexity, and step 5 is reduced to normalization of one number
9
3. Compute the outputs
i
= g(u
i
) :
4. End or repeat from 2.
Let us now develop a cost function for the substructuring problem. Let 1
N denote an element, and 1 N
d
represent the subdomain. Then, if the element
belongs to the subdomain ,
j
= 1 and
i
= 0 = . The part of the cost
function arising from this constraint can be written as
s
= A
Xi
Xi
� 1 A
X;i
Xi
(
Xi
� 1) (9)
The �rst term ensures that sum of outputs of N
d
neurons associated with an element
is equal to 1. The second term ensures that
Xi
= 0 or 1. Thus, the sum of these two
terms ensures the validity of the solution, i.e., only one out of N
d
neurons associated
with each element is 1, and the rest are zeros. The same penalty A can be associated
with both the terms, since they are equally important for obtaining a valid solution. A
substructuring is desired such that all subdomains are approximately of size N = N N
d
and the interface size is minimized. So we need to minimize,
=
s
B
2
iX
Xi
�N
2
X;i
2 (X)
(
Xi
�
i
) (10)
where ( ) denotes the set of elements connected to the element . Note that the
second and third terms are con icting. The third term is minimized if all the neighboring
elements belong to the same subdomain, i.e., if there is only one subdomain and the
number of interface nodes is minimal. However, this results in high penalty associated
with the second term. From Equations 6, 7, and 10,
u
Xi
(t)
t
= �
u
Xi
(t)
�2A
j i
Xj
�B
X
Xi
(
2 (X)
i
�s( )
Xi
) A BN (11)
where s( ) = j ( )j is the number of elements connected to the element . If is
chosen to be 1 without the loss of generality, the input updates can be written as,
(t 1) = (t)
where
Xi; j
= �2A
X
(1 �
ij
)�B
ij(X);
ij
� s( )
X ij
;
= A BN
and
= �
1
2
�
8
iV
1
nninTT
3n2n1nTTT
ni32IIIII
noden-th
Fig. 2. Hop�eld network.
to constrain the output states between 0 and 1. Here,
g( ) =
>
>
>
>
0 �
0:5 g
0
j j
1
(5)
The output of the -th neuron is fed to the input of the -th neuron by a weight
ij
. In
addition, each neuron has an o�set bias of
i
fed to its input. The state of the -th
neuron u
i
is updated by a function of the total input to the neuron. The dynamical
behavior of the network is governed by the di�erential equation
u
i
t
= �
u
i
j
ij j i
;
j
= g(u
j
): (6)
For symmetric = (
ij
), as the system of neurons evolves in time and converges, its
stable state corresponds to a minimum of a Lyapunov function de�ned by
= �
1
2
i;j
ij i j
�
i
i i
1
i
i
0
g ( ) : (7)
If g
0
is large enough such that the nonlinearity g(:) asymptotically approaches a step
function, the third term can be neglected. The state space over which the system
operates is the interior of the N -dimensional hypercube de�ned by
i
[0; 1]. However,
for su ciently high value of g
0
the minima only occur at vertices of the hypercube [10].
The main computational steps can be summarized as follows.
1. Initialize the input-output u
i
-
i
of the network, and specify the connection
matrix for the given problem.
2. Update the inputs according to
u
i
(t 1) = u
i
(t)
j
ij j
(t)
i
: (8)
7
For the automatic subdomain generation problem, the input nodes in the compet-
itive learning network correspond to the elements in the domain, and output units to
the subdomains. After presenting a su cient number of input patterns, the element
is assigned to the subdomain ~, if
i~ ij
= ~. Input patterns are created
to represent the element to element ad acency in the mesh. Possible choices of input
patterns are
1.
i
=
i
= 1, if elements and are ad acent;
i
= 0 = ; .
Thus, the training set consists of patterns with all possible element to element ad acen-
cies.
2.
i
=
i
= :::: =
i
= 1, i� elements ; ::::; are ad acent to
the element . The training set size is equal to the number of elements in the mesh.
3. This is essentially same as multipole patterns, but the
input component magnitudes are proportional to the number of nodes between ad acent
elements.
Since in �nite element meshes, each element is connected to only a few neighboring el-
ements, multipole and weighted multipole patterns are quite sparse, and most suitable
for the automatic substructuring problem. It can be shown for the dipole and mul-
tipole types of input patterns that in some special cases (uniform element to element
ad acency), the competitive learning mechanism solves the graph partitioning problem
corresponding to the element to element ad acency so that the number of interconnec-
tions between partitions is minimal [19]. Good suboptimal solutions can be achieved
for most practical problems.
. u st uctu ing u d tic timi tion. In this section, substructur-
ing of FE meshes is formulated as an optimization problem and mapped onto a Hop�eld
type [10] neural network. The ob ective function is the same as the Lyapunov func-
tion of the network, with the synaptic interconnection weights between the neurons
representing the ob ective function associated with the substructuring problem. The
Lyapunov function represents the collective behavior of the network when the network
is at its stable state the energy function is at its local minimum. The basic idea of this
type of neural network is to make a cooperative (global) decision based on the simulta-
neous input of a whole community of neurons in which each neuron receives information
from and gives information to every other neuron (sparse implementation developed in
the next section does not always take into account the entire set of neurons, to reduce
the complexity for serial execution). This information is used by each neuron to force
the network to converge to a stable state in order to make a decision.
The Hop�eld net is constructed by connecting a large number of simple processing
units to each other, as shown in Fig. 2.
i
denotes the output state of the -th neuron,
i.e.,
i
= g(u
i
), where g(:) is the activation function of the neurons, and monotonic in
nature. u
i
is the input to the -th neuron. g(:) is a nonlinear or piecewise linear function
6
steps can be described as follows
1. Initialize weights so that they are normalized
i
i
ij
= 1 = 1; : : : ; (1)
2. Select randomly one of input vectors and compute the output vector =
( ; :::;
t
)
j
=
i
i
ij i
= 1; : : : ; (2)
3. Find the winner unit
~= arg max
j ; ;
t
j
:
4. Ad ust winner's weights
i~
=
i~
!
i
; = 1; : : : ;
i
(3)
where the parameter ! is called a learning rate. Moderate values of learning
rate ensure the stability of the network. Thus, for each input pattern, the
winner ~ is found and the weights
i~
's are updated to make the weight vector
w
~
closer to the present input pattern.
5. Renormalize winner's weights so that the condition from (1) holds, i.e.,
i~
=
i~
(1 !
i
) = 1; : : : ;
i
: (4)
Incremental weight updating rule in Step 4 makes the winner ~ more likely to
win on the speci�c input pattern in the future. Also, it makes the weights grow
without bound and consequently one unit comes to dominate the competition
for all inputs. The renormalization step is necessary to ensure fair competition
among the output nodes.
6. End or repeat from 2.
Training stops as the network reaches a statistically stable state which corresponds
to the condition that average change in connection weight
ij
is equal to zero. At
equilibrium, an output node responds most strongly to patterns that overlap other
patterns to which the node responds, and responds most weakly to patterns that are far
from patterns to which it responds. If the patterns are highly structured, classi�cations
are highly stable in the sense that the same units will always respond to the same
patterns. Grossberg in fact proved that if the patterns are su ciently sparse, and/or
when there are su ciently many output nodes, competitive learning network converges
to a so-called perfectly stable state [8].
5
ijw
(Subdomains)Output layer
(Elements)Input layer
outnj1
ooo
inni21
pppp
Fig. 1. Competitive learning network.
medium-grain parallel algorithm.
All the above-mentioned mesh partitioning algorithms are sequential or medium-
grain parallel in nature, which may not be e cient for real large-scale problems (10 �10
elements). In this paper, we propose two �ne-grain algorithms based on neural networks
for the mesh partitioning problem. Initial partitioning is obtained using Farhat's greedy
algorithm. Competitive learning and Hop�eld net-based algorithms are applied next to
generate optimized mesh partitions. Sophisticated implementations are developed such
that inherently parallel NN-based techniques can be run on serial machines with linear
complexity. Sparse implementation issues are also addressed for reducing complexity
as well as memory requirements.
. u st uctu ing Com etiti e e ning. In this section, we propose a
simple competitive learning-based technique for 3-D unstructured FE mesh partitioning.
Competitive learning is an unsupervised algorithm, based on nonassociative statistical
learning principle and well suited for regularity detection [19]. It consists of a set of
input nodes connected to a set output nodes via ad ustable weights, as shown in Fig. 1.
An output node responds most strongly to a particular input pattern if the weights
to that output node (also called \winner") resemble the pattern vector most closely.
The inner product of the normalized weight vector and the given pattern vector is a
standard choice for the \closeness" metric. Learning involves moving the weights to
the winner output node towards input pattern components, and eventually the network
discovers the clusters of overlapping input vectors [9, p.220]. For the mesh partitioning
problem, the decomposition of the mesh into subdomains is stored in the connection
weight matrix of the network. Let
i
denote the number of input units and the
number of output units.
ij
is the weight from the input node and output node .
Vector = ( ; :::;
i
) denotes one of the input patterns. Then the computational
4
. es titioning. Substructuring a �nite element mesh is in essence a graph
partitioning problem, where every element in the mesh can be viewed as a vertex in the
graph. The desired number of subdomains is the number of partitions in the graph. The
graph partitioning problem can be formally represented as follows. Given an undirected
graph ( ; ) with a set of vertices = ; ::;
e
and set of edges , a partitioning
is desired such that =
i
, where = :: ;
i j
= ; j
i
j j
j
j ; ;
and the total cut-size
i;j
j
i;j
j, where
i;j
= (a; b) j a
i
; b
j
is minimized.
N is the number of vertices in the graph, and N
d
the desired number of partitions.
Several neural net algorithms have been proposed for solving the graph partitioning
problem [9, 16, 19]. For our applications, such as analyzing an aircraft fuselage, the
number of elements are in thousands, even in millions, and the number of subdomains
may vary approximately as square root of the number of elements. Moreover, parti-
tioning is ust a preprocessing stage, and the amount of time spent in this stage must
be of the order of the actual domain decomposition computations. So it is essential to
develop an algorithm that is fast and has approximately linear complexity w.r.t. the
problem size, i.e., the number of elements in the FE mesh.
With the availability of parallel machines, a large number of di�erent algorithms for
mesh partitioning has been investigated. In general, these algorithms can be broadly
categorized into two groups (1) engineering-based and (2) graph-theoretic. Farhat
proposed [5] a simple greedy algorithm, which can generate reasonably balanced parti-
tioning of FE meshes with great speed. In this technique, given a mesh and the desired
number of subdomains, a seed element is chosen in the unassigned part of the mesh,
and the subdomain is grown by adding ad acent elements until the number of elements
belonging to the subdomain is approximately equal to N = N N
d
. This process is
repeated for all subdomains. The algorithm is inherently sequential in nature but very
fast. Simulated annealing has been used by Williams [22] and Nour-Omid [15].
FE meshes for structural analysis have been partitioned using the \peeling" algorithm,
and bisection method based on the centroid of a structure and its principal directions
[18]. Explicit geometric information is needed in these techniques. Vaughn applied
the ernigham-Lin algorithm to mesh partitioning [21]. This algorithm begins with an
initial partition of the graph into two subsets, which di�er in their sizes by at most one.
At each iteration, the algorithm chooses two candidate subsets of equal size to swap
between the initial two subsets, thereby reducing the cut-size between the two. The
algorithm terminates when it is no longer possible to decrease the cut-size by swap-
ping subsets. Though L algorithm generates nearly optimal solutions, it su�ers from
the lack of speed (the complexity is about (N g(N )) per iteration). Pothen
proposed an elegant method, called recursive spectral bisection (RSB) [17], based on
Fiedler's eigenvalue properties of graphs [6]. The RSB method requires computation
of the second smallest non-trivial eigenvector of the Laplacian matrix associated with
the graph of the problem. The Laplacian is a sparse, symmetric, positive semide�nite
matrix of the same order as the problem. RSB method is in a certain sense an optimal
graph partitioning algorithm [2]. However, in spite of strong theoretical foundations,
it is not the best possible algorithm in some practical situations. It is at its best a
3
cial neural network-based techniques for automatic substructure generation for iterative
solvers.
Large systems of equations resulting from FE discretizations are increasingly being
solved on highly parallel machines and clusters of high-end workstations. Solution
of such large system of equations on parallel machines calls for innovative iterative
solution techniques such as domain decomposition and multigrid methods [20]. Domain
decomposition is a class of powerful iterative methods for solving a system of linear
algebraic equations arising from a �nite element discretization of a linear, elliptic, self-
ad oint boundary value problem, and often achieves fast convergence for large problems.
It is based on combination of local computations on the desired discretization, and
smaller non-local problems that exchange information between distant parts of the
system. Consider a system of linear algebraic equations A = resulting from FE
discretization on a domain . Let the domain be split into non-overlapping subdomains
, , ...., , each of which is the union of some �nite elements. The main idea
is to reduce the problem by eliminating the DOF associated with the interiors of the
subdomains. The reduced problem is then solved by preconditioned con ugate gradients
[13].
Execution of FE problems on distributed memory machines requires their data
structures be partitioned and distributed across processors. Even on virtual shared
memory machines, mesh partitioning is desired for the sake of data locality and en-
hanced performance [5]. If the mesh elements (and corresponding nodes) are allocated
to di�erent processors, interprocessor communication takes place (along the edge) when
computation involves nodal variables, allocated to neighboring processors. Thus, inter-
processor communication can be reduced by minimizing the interface length in the par-
titioned mesh. For explicit solution methods, e.g., time marching algorithm or global
iterative methods, computations on nodes residing on a single processor are followed
by non-local computation related to interfaces. Thus load balance of the number of
variables per processor is at a premium for processor synchronization [18]. However, for
domain decomposition techniques, most of the computation involves the variables on
interface nodes. Therefore, for domain decomposition-based iterative methods, a sub-
structuring is desired to minimize the number of interface nodes between subdomains
(often determined by the number of available processors) of approximately same size.
Such an optimization criterion also generates smooth interfaces, and improves the aspect
ratios of the resulting subdomains. Aspect ratios close to one and smooth interfaces in
turn enhance the numerical performance of the subsequent domain decomposition-based
iterative solvers [4].
Section 2 reviews some important partitioning techniques. Competitive learning
and Hop�eld net-based mesh partitioning techniques are proposed in Sections 3 and 4,
respectively. Implementation issues are discussed in Section 5 for fast serial execution of
neural net-based partitioning algorithms. Experimental results are presented in Section
6. Finally, we summarize the present investigation and mention some future research
issues in Section 7.
2
A A A C
A A
S ATA OSAL, JAN MAN EL AN RA EK TE A R
The solution of large nite element problems on distributed memory (even shared
virtual memory) computers calls for e cient partitioning of large and unstructured 3- meshes into
submeshes such that computations can be distributed across processors. It is desired that resulting
subdomains (submeshes) are approximately of same size, and the total number of interface nodes
between ad acent subdomains is minimal. Two ne-grain scalable parallel algorithms are proposed
employing neural network paradigms that can e ciently perform mesh partitioning for subse uent do-
main decomposition computations. New implementations are developed such that both the techni ues
have almost linear comple ity with respect to the problem size for serial execution. These substruc-
turing techni ues compare favorably to the well-known recursive spectral bisection (RSB) method and
simulated annealing-based partitioning algorithm.
mesh partitioning, domain decomposition, neural networks, competi-
tive learning, Hop�eld network, parallel processing, scalable algorithms
. nt oduction. Arti�cial neural networks (ANN) are parallel systems with a
large number of interconnected processing elements or \neurons", where desired in-
put/output mappings are obtained by adapting (training) the interconnections (also
called connection weights or synaptic strengths) according to a suitable learning al-
gorithm [12]. With the advent of massively parallel computer architectures, neural
network-guided solutions have become a reality in applications such as forecasting,
pattern recognition, diagnostics and automatic control. Although ANN can serve as
models for further understanding of brain functions, as a computational tool they can
provide fast solutions to large scienti�c problems because of their parallel nature and
learning capabilities. The focus of this article is on one such application of neural
network paradigms in high-performance computing, namely unstructured 3-D �nite el-
ement (FE) mesh partitioning for domain decomposition.
FE analysis has evolved as a powerful tool in diverse scienti�c disciplines, e.g.,
computational uid dynamics and computational structural analysis because of their
simplicity and general nature. However, almost all commercial FE solvers are of di-
rect nature. Their solution time increases proportional to the square of the number
of degrees of freedom (NDOF) or worse, and storage requirement approximately pro-
portional to (NDOF) for large problems [14]. \Smart" iterative solvers, suitable for
implementation on parallel computers, on the other hand require less storage. Their so-
lution time degrades almost only linearly with the problem size, but their performance
is much more problem dependent. This work deals with the application of fast arti�-
This paper is partially based on the presentation at the IEEE Conference on Neural Networks [7].
Center for ComputationalMathematics, niversity of Colorado at enver, enver, CO 80217. This
research was supported by the National Science Foundation grants ASC-9121431 and ASC-9217394.
1