r.p. lippman - semantic scholar · b. nour-omid, a. raefsky, and g. l yzenga, solving nite element...

22

Upload: others

Post on 10-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

[12] R.P. Lippman, An introduction to the computing with neural nets, IEEE ASSP Magazine, pp. 4{

22, 1987.

[13] J. Mandel, Balancing domain decomposition, Comm. Appl. Numer. Methods, 9 (1993), pp. 233{

241.

[14] J. Mandel, Adaptive iterative solvers in �nite elements, in Solving Large-Scale Problems in

Mechanics, M. Papadrakakis, ed., John Wiley, 1993, pp. 65{88.

[15] B. Nour-Omid, A. Raefsky, and G. Lyzenga, Solving �nite element equations on concurrent

computers, in Parallel Computations and Their Impact on Mechanics, A.K. Noor, ed., 1986,

ASME Press, New York, pp. 209{227.

[16] C. Peterson, and J.R. Anderson, Neural networks and NP-complete optimization problems;

A performance study on the graph bisection problem, Complex Systems, 2 (1988), pp. 59{89.

[17] A. Pothen, H.D. Simon, and K.P. Liu, Partitioning sparse matrices with eigenvectors of

graphs, SIAM J. Matrix Anal. Appl., 11 (1990), pp. 430{452.

[18] H.D. Simon, Partitioning of unstructured problems for parallel processing, Comput. Systems En-

grg., 2 (1991), pp. 135{148.

[19] D.E. Rumelhart, and D. Zipser, Feature discovery by competitive learning, Cognitive Science,

9 (1985), pp. 75{112.

[20] J. Xu, Iterative methods by space decomposition and subspace correction, SIAM Rev., 34 (1992),

pp. 581{613.

[21] C. Vaughn, Structural analysis on massively parallel computers, Proceedings of the Conference

on Parallel Methods on Large Scale Structural Analysis and Physics Applications, Pergamon

Press, 1991.

[22] R.D. Williams, Performance of dynamic load balancing algorithms for unstructured mesh cal-

culations, Tech. Report C3P913, California Institute of Technology, Pasadena, June 1990.

22

Page 2: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

7. Conclusions. Two novel techniques are proposed in this paper for generating

substructures for domain decomposition using neural network paradigms. This study

demonstrates the usefulness of neural networks for approximately solving large NP-hard

optimization problems, with reasonable speed on serial machines. Both techniques yield

signi�cantly better outputs compared to a popular greedy algorithm, when only element

to element connectivity in the FE mesh is used. Sparse implementations are developed

for fast serial execution and reduction in memory requirements. Both the neural net-

based techniques are competitive with benchmark algorithms. Future research will

concentrate on incorporating the aspect ratio requirement explicitly into the optimiza-

tion process; use of faster ANN algorithms, e.g., VFSR [11] and Mean �eld [16] type,

that attempt to avoid local minima; automatic selection of the network parameters;

and generation of submeshes respecting the physics of the problem. Generation of 3-D

hexahedral �nite element meshes, given 3-D node points, is a similar problem and will

be studied following a similar philosophy.

Acknowledgments. Test meshes have been provided by Charbel Farhat, Dept. of

Aerospace Engr., University of Colorado at Boulder. The performance comparison

numbers are obtained using TOP/DOMDEC software developed by Charbel Farhat

and his coworkers. Horst Simon of NASA Ames Research Center provided the RSB

code. We also would like to thank Ellen Applebaum for her suggestions regarding

competitive learning.

REFERENCES

[1] S.V.B. Aiyer, M. Niranjan, and F. Fallside, A theoretical investigation into the perfor-

mance of the Hop�eld model, IEEE Trans. Neural Networks, 1 (1990), pp. 204{215.

[2] S.T. Barnard, H.D. Simon, A fast multilevel implementation of recursive spectral bisection

for partitioning unstructured problems, Report RNR-92-033, NASA Ames Research Center,

April 1993; Concurrency: Practice and Experience, 1994, to appear.

[3] C. Farhat, H. Simon, and S. Lanteri, TOP/DOMDEC, A software tool for mesh partitioning

and parallel processing, in press.

[4] C. Farhat, J. Mandel, and F.X. Roux, Optimal convergence properties of the FETI domain

decomposition method, Comput. Meth. Appl. Mech. Engrg., 115 (1994), pp. 367{388.

[5] C. Farhat, and M. Lesoine, Automatic partitioning of unstructured meshes for the parallel

solution of problems in computational mechanics, Intl. J. Numer. Meth. Engr., 36 (1993),

pp. 745{764.

[6] M. Fiedler, A property of eigenvectors of nonnegative symmetric matrices and its application

to graph theory, Czechoslovak Math. J., 25 (1975), pp. 619{633.

[7] S. Ghosal, J. Mandel, and R. Tezaur, Automatic substructuring for domain decomposition

using neural networks, Proceedings of IEEE Conference on Neural Networks, Orlando, June

1994, to appear.

[8] S. Grossberg, Competitive learning: From interactive activation to adaptive resonance, Cogni-

tive Science, 11 (1987), pp. 23{63.

[9] J. Hertz, A. Krogh, and R.G. Palmer, Introduction To The Theory Of Neural Computation,

Addison-Wesley, Redwood City, CA, 1991.

[10] J.J. Hopfield, D.W. Tank, \Neural" computation of decisions in optimization problems, Bio-

logical Cybernetics, 52 (1985), pp. 141{152.

[11] A. L. Ingber, Simulated annealing: Theory vs. practice, J. Math. Computer Model., 1993.

21

Page 3: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

(c)

Fig. 6. Mesh partitioning by neural networks. (a) 2-D FE mesh, ENGINE with 22,936 elements

and 12,233 nodes. (b) Partitions generated by competitive learning network. (c) Partitions generated

by Hop�eld network.

N

s

Competitive Learning Hop�eld Net

N

I

Load Time N

I

Load Time

16420 0.93 5.09 454 0.99 4.24

32716 0.97 6.13 787 0.99 5.55

641197 0.97 7.36 1277 0.93 10.15

1281928 0.93 10.15 2051 0.92 15.62

Table 5

Mesh partitioning performance of neural networks for the unstructured 2-D mesh, ENGINE with

22,936 elements and 12,233 nodes.

N

s

RSB Simulated Annealing

N

I

Load Time N

I

Load Time

16403 1.0 17.14 425 0.97 3.71

32689 1.0 20.11 725 0.99 4.42

641204 0.99 27.78 1161 0.85 11.52

1282070 0.99 31.94 1829 0.82 14.69

Table 6

Mesh partitioning performance of recursive spectral bisection and simulated annealing-based tech-

niques for the unstructured 2-D mesh, ENGINE with 12,233 elements and 22,936 nodes.

20

Page 4: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

(a)

(b)

19

Page 5: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

(c)

Fig. 5. Mesh partitioning by neural networks. (a) 3-D FE mesh, MACHINE with 3648 elements

and 5336 nodes. (b) Partitions generated by competitive learning network. (c) Partitions generated by

Hop�eld network.

N

s

Competitive Learning Hop�eld Net

N

I

Load Time N

I

Load Time

8722 0.95 1.67 783 0.90 1.30

161107 0.91 1.78 1177 0.84 1.83

321573 0.91 2.08 1683 0.82 2.59

642023 0.84 2.58 2170 0.79 3.47

Table 3

Mesh partitioning performance of neural networks for the unstructured 3-D mesh, MACHINE with

3648 elements and 5336 nodes.

N

s

RSB Simulated Annealing

N

I

Load Time N

I

Load Time

81234 1.0 4.01 803 0.97 2.77

161720 1.0 5.59 1131 0.91 4.06

322144 1.0 7.21 1694 0.96 7.88

642608 1.0 8.37 2071 0.82 6.89

Table 4

Mesh partitioning performance of recursive spectral bisection and simulated annealing-based tech-

niques for the unstructured 3-D mesh, MACHINE with 3648 elements and 5336 nodes.

18

Page 6: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

(a)

(b)

17

Page 7: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

(c)

Fig. 4. Mesh partitioning by neural networks. (a) 3-D FE mesh, BLADE with 944 elements and

1820 nodes. (b) Partitions generated by competitive learning network. (c) Partitions generated by

Hop�eld network.

N

s

Competitive Learning Hop�eld Net

N

I

Load Time N

I

Load Time

8360 0.89 0.35 378 0.86 0.67

16530 0.88 0.36 572 0.79 1.04

32689 0.82 0.47 784 0.84 1.36

64879 0.74 0.56 893 0.82 1.61

Table 1

Mesh partitioning performance of neural networks for the unstructured 3-D mesh, BLADE with 944

elements and 1820 nodes.

N

s

RSB Simulated Annealing

N

I

Load Time N

I

Load Time

8417 1.0 1.00 354 0.82 0.72

16586 1.0 1.22 512 0.86 1.24

32814 1.0 1.62 656 0.89 1.27

641057 1.0 1.75 886 0.79 5.42

Table 2

Mesh partitioning performance of recursive spectral bisection and simulated annealing-based tech-

niques for the unstructured 3-D mesh, BLADE with 944 elements and 1820 nodes.

16

Page 8: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

(a)

(b)

15

Page 9: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

0 5 10 15 20 25 30

5

10

(a)

0 5 10 15 20 25 30

5

10

(b)

0 5 10 15 20 25 30

5

10

(c)

Fig. 3. Substructuring by neural networks. (a) Output of Farhat's greedy algorithm. (b) After

optimization by competitive learning network. (c) After optimization by Hop�eld network.

14

Page 10: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

sparse data structures. It also ensures that the initial state of the network is away from

the basins of attraction, corresponding to the corners of the hypercube.

Reasonable mesh partitions can be obtained in 5{15 epoches. The learning rate

parameter in the competitive learning network and constants g

0

, A, B, and in the

Hop�eld network are chosen empirically to generate acceptable mesh partitions. ! =

0:0001, g

0

= 0:15, A = 0:5, B = 0:001, and = 0:3 generate reasonable partitions for

a wide variety of FE meshes. Automatic selection of parameters is an open research

issue.

Fig. 3(a) shows a 2-D FE mesh with 260 elements, decomposed into 3 subdomains

by the greedy algorithm. The ratio of the number of interface nodes (i.e., nodes on

the interface between more than one subdomain) and total number of nodes in the

mesh is 0:1346. Also, note that the decomposition su�ers from \unfavorable" aspect

ratios for all the subdomains. Aspect ratio consideration is important for convergence

properties of the domain decomposition computations [4]. Decompositions obtained

using competitive learning and Hop�eld network-based techniques are shown in Fig. 3(b)

and Fig. 3(c), respectively. The interface ratio is 0:07407 in both cases, which is 45:0

lower compared to that obtained by the greedy algorithm. Also, aspect ratios of the

substructures generated by both ANN paradigms are satisfactory.

Fig. 4(a) shows an unstructured 3-D mesh, BLADE with 944 elements and 1820

nodes. Submeshes obtained using the proposed neural net-based techniques are de-

picted in Fig. 4(a) and Fig. 4(b) for N

d

= 4. Corresponding partitioning results are

reported in Tables 1 and 2 for the proposed technique, RSB method and simulated

annealing-based partitioning technique. RSB and simulated annealing-based methods

are run in TOP/DOMDEC environment. N

I

denotes the total number of interface

nodes for a given number of subdomains, and Load represents the ratio of the average

size of submeshes and size of the largest submesh, created by the respective partitioning

algorithm. Time denotes the sum of execution times of the greedy algorithm and the

optimizing algorithm for the neural network- and simulated annealing-based methods.

Execution time of the greedy algorithm is almost negligible for most problems. Fig. 5

describes the partitionings generated by the proposed techniques for the 3-D mesh,

MACHINE with 3648 elements, and 5336 nodes for N

d

= 8. Comparative performances

are presented in Tables 3 and 4. Partitioning of a large 2-D mesh, ENGINE with 22,936

triangular elements and 12,233 nodes is described in Fig. 6 for N

d

= 8. Tables 5 and 6

summarize the relative performance of the proposed techniques with benchmark algo-

rithms for this mesh. Both the proposed techniques outperform the recursive spectral

bisection method for 3-D meshes. Results are comparable for 2-D meshes. The pro-

posed techniques are also competitive with simulated annealing. The main essence of

ANN algorithms is that they are simple in nature and exhibit �ne-grain parallelism.

Thus, it is expected that the neural net-based algorithms with carefully chosen values

of parameters would perform excellent, specially in a parallel environment. Results

reported here on a wide variety of large FE meshes clearly establish the e�ectiveness of

neural network paradigms for FE mesh partitioning even on serial computers.

13

Page 11: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

where,

u

Xi

(t) = �2Aa

X

(t)�Bb

i

(t) c

Xi

(t) (2A� s( ))

Xi

(t) A BN

and

u

Xi

(t)� u

Xi

(t� 1) =

�2A

j2fS

X

(

Xj

(t)�

Xj

(t� 1)) �B

X

(

Xi

(t)�

Xi

(t� 1))

2 (X)

(

i

(t)�

i

(t� 1)) (2A� s( ))(

Xi

(t)�

Xi

(t� 1)) (15)

Since the initial state of the network corresponds to a reasonable decomposition, only a

small fraction of elements changes their memberships in each iteration as the network

evolves in time. Consequently, for a given element ,

Xi

(t)�

Xi

(t� 1) remains zero

for almost all 's, and thus does not enter into the computational process. Therefore, for

any element the inputs-outputs u

Xi

-

Xi

need to be updated for only a small number

of 's (compared to j

X

j for regular updates). Also, it is found for manymeshes that only

the elements near the subdomain interfaces change their memberships as the interface

length is minimized, and sub-optimal partitions can be obtained by updating the states

of those elements which are near the subdomain boundaries at each iteration. Thus,

with sparse data structures and the proposed state updating rule, the Hop�eld network

can be implemented with complexity (N

I

) per time step on a serial machine, where

N

I

is the number of interface nodes present in the initial decomposition, generated by

the greedy algorithm.

. ume ic l esults. Numerous experiments have been conducted with model

problems as well as real-world 2-D and 3-D FE meshes to demonstrate the e�ectiveness

of neural network principles for automatic substructuring for domain decomposition.

Some of these results are reported in this section. Both neural net paradigms are

implemented in C and incorporated into the TOP/DOMDEC software [3] and executed

on an SGI Crimson workstation.

A two-step approach is adopted in this study to generate submeshes from FE

meshes. In the �rst step, an initial decomposition is obtained using Farhat's greedy

algorithm, that satis�es the criterion of load balance. Competitive learning and Hop-

�eld networks are employed in the second step to minimize the interface length between

ad acent subdomains, keeping the size of each subdomain nearly equal. The outputs of

the greedy algorithm are used to initialize the synaptic strengths, and inputs-outputs

of competitive learning, and Hop�eld networks, respectively. Let

j

�denote an

element allocated to the subdomain by the greedy algorithm. Then

ij

�in the com-

petitive learning network is set to c

i

, such that

i2S

� c

i

=1, and

ij

= 0; = . The

outputs of the Hop�eld net are initialized to

Xi

= 0:5 � 0:01 a (), if the element

is allocated to the -th subdomain by the greedy algorithm, and

Xj

= 0 = .

a () is a random number uniformly distributed between 0 and 1. Such an initializa-

tion cuts down memory requirements for the Hop�eld network and helps in maintaining

12

Page 12: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

(N N

d

) per time step. For partitioning of large FE meshes, N may be in millions and

N

d

can vary asN

. Thus, the complexity of the Hop�eld network on serial machines

is very high for real-time mesh partitioning.

The properties of the FE meshes are utilized to develop sparse implementation of

the Hop�eld network, that in addition to reducing memory requirements, e�ectively

lowers the computational expenses. The states of the network are initialized according

to the decomposition generated by the greedy algorithm. Since the greedy algorithm

generates almost acceptable partitioning for most problems, an element initially be-

longing to a subdomain stays in that subdomain during the entire time evolution of the

network or may change its membership only to the neighboring subdomains present in

the initial decomposition. That is, if any element belongs to subdomain in the

initial decomposition, then either remains in subdomain during the evolution of

the network, else moves to a subdomain

0

such that

0

where is the set of

subdomains, neighboring to subdomain . This is an acceptable proposition since the

number of interface nodes can be minimized by local ad ustment of the membership of

a given element. Upper limit of j j is typically 8 for 2-D square FEs and 16 for 3-D

cubic FEs. Thus the state updating rule given in (11) can be written as,

u

Xi

(t 1) = u

Xi

(t)� 2Aa

X

�Bb

i

c

Xi

(2A� s( ))

Xi

A BN

X

and

a

X

=

j2fS

X

Xj

1 N

b

i

=

e

X

Xi X

c

Xi

=

2 (X)

i X

.

Now, a

X

can be computed in

e

X

j

X

j = (N ) oating-point operations for all

elements, and so be b

i

for all subdomains. Given any element , c

Xi

can be evaluated

in s( )j ( )j operations for all . Thus, a

X

; b

i

, and c

Xi

's can be essentially computed

with number of oating-point operations proportional to N per iteration. The constant

of proportionality evidently depends on the number of subdomains j

X

j neighboring the

subdomain in which the element initially resides. Constraining the movement of an

element to only neighboring subdomains in fact helps in the optimization process for

some problems by cutting down the search space, and facilitates sparse implementation

of the Hop�eld network for large-scale problems.

A new state updating scheme is also developed utilizing the present as well as the

past states, to further reduce the computational complexity. The original state update

rule in the Hop�eld network can be rewritten as,

u

Xi

(t 1) = u

Xi

(t) u

Xi

(t)

X

= u

Xi

(t) ( u

Xi

(t)� u

Xi

(t� 1)) u

Xi

(t� 1) (14)

11

Page 13: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

a

~

, where ~ is the winning output node. Thus, the complexity of step 5 is reduced to

(N ). In order to prevent over ow of a

j

's, an occasional full renormalization may be

necessary, which hardly happens in the partitioning process of most meshes.

Further reduction of complexity is inspired by the behavior of the algorithm. The

initialization of weights is based on a decomposition generated by Farhat's greedy al-

gorithm. Initially, there is only one nonzero weight

ij

for each . In the process of

learning, one can expect that only a small number of

ij

will become nonzero when we

use low values of the learning rate. Computational results only con�rm this proposition.

In fact, if an element resides well inside of a subdomain, then it will quite probably stay

in the same subdomain along with its neighbors after the learning is over. Thus, we

assume that only a small number of weights

ij

will become nonzero for each . This

number is likely to grow with the number of subdomains for a particular mesh. How-

ever, it proves to be a growth much slower than a linear one. This motivates a sparse

representation of the weight matrix. From the implementation point of view, a linked

list for each turns out to be appropriate.

Storing only nonzeros in not only reduces memory requirements, but also reduces

the number of necessary oating-point operations. Each of the steps 2{5 then takes only

(N ) oating-point operations. Computational results indicate that there is a slow

growth with number of subdomains, which is caused by reasons mentioned above and

necessity to initialize the output vector = ( ; ; :::;

t

) of length N

d

to zero in each

epoch for N times. Overall, the complexity of the algorithm is approximately (N ).

. . o eld etwo k. Contrary to competitive learning, the operation of Hop-

�eld network is guided by the input dynamics of the neurons. The connection matrix

is built according to the cost function associated with the partitioning problem and

kept �xed. The inputs to neurons are updated in every iteration based on present states

(outputs

Xi

) of neurons. Let us analyze the complexity of input updates per iteration

according to (11). Without the loss of generality, (11) can be rewritten as,

u

Xi

(t 1) = u

Xi

(t)� 2Aa

X

�Bb

i

c

Xi

(2A� s( ))

Xi

A BN

where,

a

X

=

j

Xj

1 N

b

i

=

e

X

Xi

1 N

d

c

Xi

=

2 (X)

i

1 N ; 1 N

d

:

Given u

Xi

(t) for all and , output states

Xi

can be computed with (N N

d

) com-

plexity per iteration. Calculation of a

X

requires N

d

number of oating-point operations

for any . b

i

can be computed in (N ) operations for any ; and c

Xi

in s( ) operations

for any and . Thus, the network can be implemented serially with a complexity of

10

Page 14: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

The constants A;B; in uence the convergence and stability of the Hop�eld net in

addition to controlling penalties associated with di�erent optimization criteria. Since

is symmetric (in order that the Lyapunov function is valid), it can be completely

characterized by its eigenvalues

i

; :::; =

i

(A;B; ) and corresponding orthog-

onal eigenvectors e ; :::; e . A;B; should be chosen such that the network is able to

minimize by moving in a way to reduce all components

i

's of (in the direction

of e

i

) to zero if

i

0; and to increase all

j

's if

j

0 [1].

. st se m lement tion o eu l etwo ks. The main motivation

behind neural net-based computation is that they are simple and exhibit massive par-

allelism suitable for large-scale scienti�c problems. Neural computations can however

be expensive for execution on serial machines. Experiments reported in this paper are

all performed on a single processor machine. In this section, implementation issues are

discussed for fast serial execution of neural networks, taking into account the special

nature of the mesh partitioning problem.

. . Com etiti e e ning. This type of neural net algorithm involves updating

connection weights from

i

input nodes to output nodes at each iteration (or

epoch) for every input pattern. Before explaining the proposed implementation of the

competitive learning algorithm, let us �rst calculate the computational complexity of

the algorithm as described in Section 3. Initialization of the weight matrix can be

done in (N N

d

) operations. This step is performed only once and presents a constant

computational overhead. Steps 2{5 are performed N times in each epoch and involve

presentation of patterns of size N . Therefore, step 2 requires (N N

d

) oating-point

operations. Finding the winner in step 3 takes (N N

d

) operations per epoch. Finally,

steps 4 and 5 (winner's weight update and renormalization) can be accomplished in

(N ) operations per epoch. Thus, the entire algorithm has complexity of (N N

d

)

which is unreasonably high for partitioning of large FE meshes.

The computational complexity of the entire algorithm can be reduced by utilizing

the sparse nature of the data as well as the behavior of the algorithm. In FE meshes,

each element is ad acent to only a small number of other elements. So the input patterns

contain only a few nonzeros. This number is typically a constant depending on the

type of elements in the mesh. Making use of this fact, we can cut down the number

of operations needed for steps 2, 3, 4 and 5 to (N N

d

), (N N

d

), (N ) and (N ),

respectively. The complexity is now concentrated in the renormalization step 5. The

complexity of the renormalization step can be reduced by representing

ij

as

ij

= a

j ij

.

Steps 4 and 5 can now be expressed as

i~

=

i~

!

i

a

~

; = 1; : : : ;

i

s:t:

i

= 0; (12)

a

~

= a

~

(1 !

i

) = 1; : : : ;

i

: (13)

This increases the number of operations in steps 2 and 4 by a small extent, but it does

not change the order of complexity, and step 5 is reduced to normalization of one number

9

Page 15: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

3. Compute the outputs

i

= g(u

i

) :

4. End or repeat from 2.

Let us now develop a cost function for the substructuring problem. Let 1

N denote an element, and 1 N

d

represent the subdomain. Then, if the element

belongs to the subdomain ,

j

= 1 and

i

= 0 = . The part of the cost

function arising from this constraint can be written as

s

= A

Xi

Xi

� 1 A

X;i

Xi

(

Xi

� 1) (9)

The �rst term ensures that sum of outputs of N

d

neurons associated with an element

is equal to 1. The second term ensures that

Xi

= 0 or 1. Thus, the sum of these two

terms ensures the validity of the solution, i.e., only one out of N

d

neurons associated

with each element is 1, and the rest are zeros. The same penalty A can be associated

with both the terms, since they are equally important for obtaining a valid solution. A

substructuring is desired such that all subdomains are approximately of size N = N N

d

and the interface size is minimized. So we need to minimize,

=

s

B

2

iX

Xi

�N

2

X;i

2 (X)

(

Xi

i

) (10)

where ( ) denotes the set of elements connected to the element . Note that the

second and third terms are con icting. The third term is minimized if all the neighboring

elements belong to the same subdomain, i.e., if there is only one subdomain and the

number of interface nodes is minimal. However, this results in high penalty associated

with the second term. From Equations 6, 7, and 10,

u

Xi

(t)

t

= �

u

Xi

(t)

�2A

j i

Xj

�B

X

Xi

(

2 (X)

i

�s( )

Xi

) A BN (11)

where s( ) = j ( )j is the number of elements connected to the element . If is

chosen to be 1 without the loss of generality, the input updates can be written as,

(t 1) = (t)

where

Xi; j

= �2A

X

(1 �

ij

)�B

ij(X);

ij

� s( )

X ij

;

= A BN

and

= �

1

2

8

Page 16: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

iV

1

nninTT

3n2n1nTTT

ni32IIIII

noden-th

Fig. 2. Hop�eld network.

to constrain the output states between 0 and 1. Here,

g( ) =

>

>

>

>

0 �

0:5 g

0

j j

1

(5)

The output of the -th neuron is fed to the input of the -th neuron by a weight

ij

. In

addition, each neuron has an o�set bias of

i

fed to its input. The state of the -th

neuron u

i

is updated by a function of the total input to the neuron. The dynamical

behavior of the network is governed by the di�erential equation

u

i

t

= �

u

i

j

ij j i

;

j

= g(u

j

): (6)

For symmetric = (

ij

), as the system of neurons evolves in time and converges, its

stable state corresponds to a minimum of a Lyapunov function de�ned by

= �

1

2

i;j

ij i j

i

i i

1

i

i

0

g ( ) : (7)

If g

0

is large enough such that the nonlinearity g(:) asymptotically approaches a step

function, the third term can be neglected. The state space over which the system

operates is the interior of the N -dimensional hypercube de�ned by

i

[0; 1]. However,

for su ciently high value of g

0

the minima only occur at vertices of the hypercube [10].

The main computational steps can be summarized as follows.

1. Initialize the input-output u

i

-

i

of the network, and specify the connection

matrix for the given problem.

2. Update the inputs according to

u

i

(t 1) = u

i

(t)

j

ij j

(t)

i

: (8)

7

Page 17: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

For the automatic subdomain generation problem, the input nodes in the compet-

itive learning network correspond to the elements in the domain, and output units to

the subdomains. After presenting a su cient number of input patterns, the element

is assigned to the subdomain ~, if

i~ ij

= ~. Input patterns are created

to represent the element to element ad acency in the mesh. Possible choices of input

patterns are

1.

i

=

i

= 1, if elements and are ad acent;

i

= 0 = ; .

Thus, the training set consists of patterns with all possible element to element ad acen-

cies.

2.

i

=

i

= :::: =

i

= 1, i� elements ; ::::; are ad acent to

the element . The training set size is equal to the number of elements in the mesh.

3. This is essentially same as multipole patterns, but the

input component magnitudes are proportional to the number of nodes between ad acent

elements.

Since in �nite element meshes, each element is connected to only a few neighboring el-

ements, multipole and weighted multipole patterns are quite sparse, and most suitable

for the automatic substructuring problem. It can be shown for the dipole and mul-

tipole types of input patterns that in some special cases (uniform element to element

ad acency), the competitive learning mechanism solves the graph partitioning problem

corresponding to the element to element ad acency so that the number of interconnec-

tions between partitions is minimal [19]. Good suboptimal solutions can be achieved

for most practical problems.

. u st uctu ing u d tic timi tion. In this section, substructur-

ing of FE meshes is formulated as an optimization problem and mapped onto a Hop�eld

type [10] neural network. The ob ective function is the same as the Lyapunov func-

tion of the network, with the synaptic interconnection weights between the neurons

representing the ob ective function associated with the substructuring problem. The

Lyapunov function represents the collective behavior of the network when the network

is at its stable state the energy function is at its local minimum. The basic idea of this

type of neural network is to make a cooperative (global) decision based on the simulta-

neous input of a whole community of neurons in which each neuron receives information

from and gives information to every other neuron (sparse implementation developed in

the next section does not always take into account the entire set of neurons, to reduce

the complexity for serial execution). This information is used by each neuron to force

the network to converge to a stable state in order to make a decision.

The Hop�eld net is constructed by connecting a large number of simple processing

units to each other, as shown in Fig. 2.

i

denotes the output state of the -th neuron,

i.e.,

i

= g(u

i

), where g(:) is the activation function of the neurons, and monotonic in

nature. u

i

is the input to the -th neuron. g(:) is a nonlinear or piecewise linear function

6

Page 18: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

steps can be described as follows

1. Initialize weights so that they are normalized

i

i

ij

= 1 = 1; : : : ; (1)

2. Select randomly one of input vectors and compute the output vector =

( ; :::;

t

)

j

=

i

i

ij i

= 1; : : : ; (2)

3. Find the winner unit

~= arg max

j ; ;

t

j

:

4. Ad ust winner's weights

i~

=

i~

!

i

; = 1; : : : ;

i

(3)

where the parameter ! is called a learning rate. Moderate values of learning

rate ensure the stability of the network. Thus, for each input pattern, the

winner ~ is found and the weights

i~

's are updated to make the weight vector

w

~

closer to the present input pattern.

5. Renormalize winner's weights so that the condition from (1) holds, i.e.,

i~

=

i~

(1 !

i

) = 1; : : : ;

i

: (4)

Incremental weight updating rule in Step 4 makes the winner ~ more likely to

win on the speci�c input pattern in the future. Also, it makes the weights grow

without bound and consequently one unit comes to dominate the competition

for all inputs. The renormalization step is necessary to ensure fair competition

among the output nodes.

6. End or repeat from 2.

Training stops as the network reaches a statistically stable state which corresponds

to the condition that average change in connection weight

ij

is equal to zero. At

equilibrium, an output node responds most strongly to patterns that overlap other

patterns to which the node responds, and responds most weakly to patterns that are far

from patterns to which it responds. If the patterns are highly structured, classi�cations

are highly stable in the sense that the same units will always respond to the same

patterns. Grossberg in fact proved that if the patterns are su ciently sparse, and/or

when there are su ciently many output nodes, competitive learning network converges

to a so-called perfectly stable state [8].

5

Page 19: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

ijw

(Subdomains)Output layer

(Elements)Input layer

outnj1

ooo

inni21

pppp

Fig. 1. Competitive learning network.

medium-grain parallel algorithm.

All the above-mentioned mesh partitioning algorithms are sequential or medium-

grain parallel in nature, which may not be e cient for real large-scale problems (10 �10

elements). In this paper, we propose two �ne-grain algorithms based on neural networks

for the mesh partitioning problem. Initial partitioning is obtained using Farhat's greedy

algorithm. Competitive learning and Hop�eld net-based algorithms are applied next to

generate optimized mesh partitions. Sophisticated implementations are developed such

that inherently parallel NN-based techniques can be run on serial machines with linear

complexity. Sparse implementation issues are also addressed for reducing complexity

as well as memory requirements.

. u st uctu ing Com etiti e e ning. In this section, we propose a

simple competitive learning-based technique for 3-D unstructured FE mesh partitioning.

Competitive learning is an unsupervised algorithm, based on nonassociative statistical

learning principle and well suited for regularity detection [19]. It consists of a set of

input nodes connected to a set output nodes via ad ustable weights, as shown in Fig. 1.

An output node responds most strongly to a particular input pattern if the weights

to that output node (also called \winner") resemble the pattern vector most closely.

The inner product of the normalized weight vector and the given pattern vector is a

standard choice for the \closeness" metric. Learning involves moving the weights to

the winner output node towards input pattern components, and eventually the network

discovers the clusters of overlapping input vectors [9, p.220]. For the mesh partitioning

problem, the decomposition of the mesh into subdomains is stored in the connection

weight matrix of the network. Let

i

denote the number of input units and the

number of output units.

ij

is the weight from the input node and output node .

Vector = ( ; :::;

i

) denotes one of the input patterns. Then the computational

4

Page 20: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

. es titioning. Substructuring a �nite element mesh is in essence a graph

partitioning problem, where every element in the mesh can be viewed as a vertex in the

graph. The desired number of subdomains is the number of partitions in the graph. The

graph partitioning problem can be formally represented as follows. Given an undirected

graph ( ; ) with a set of vertices = ; ::;

e

and set of edges , a partitioning

is desired such that =

i

, where = :: ;

i j

= ; j

i

j j

j

j ; ;

and the total cut-size

i;j

j

i;j

j, where

i;j

= (a; b) j a

i

; b

j

is minimized.

N is the number of vertices in the graph, and N

d

the desired number of partitions.

Several neural net algorithms have been proposed for solving the graph partitioning

problem [9, 16, 19]. For our applications, such as analyzing an aircraft fuselage, the

number of elements are in thousands, even in millions, and the number of subdomains

may vary approximately as square root of the number of elements. Moreover, parti-

tioning is ust a preprocessing stage, and the amount of time spent in this stage must

be of the order of the actual domain decomposition computations. So it is essential to

develop an algorithm that is fast and has approximately linear complexity w.r.t. the

problem size, i.e., the number of elements in the FE mesh.

With the availability of parallel machines, a large number of di�erent algorithms for

mesh partitioning has been investigated. In general, these algorithms can be broadly

categorized into two groups (1) engineering-based and (2) graph-theoretic. Farhat

proposed [5] a simple greedy algorithm, which can generate reasonably balanced parti-

tioning of FE meshes with great speed. In this technique, given a mesh and the desired

number of subdomains, a seed element is chosen in the unassigned part of the mesh,

and the subdomain is grown by adding ad acent elements until the number of elements

belonging to the subdomain is approximately equal to N = N N

d

. This process is

repeated for all subdomains. The algorithm is inherently sequential in nature but very

fast. Simulated annealing has been used by Williams [22] and Nour-Omid [15].

FE meshes for structural analysis have been partitioned using the \peeling" algorithm,

and bisection method based on the centroid of a structure and its principal directions

[18]. Explicit geometric information is needed in these techniques. Vaughn applied

the ernigham-Lin algorithm to mesh partitioning [21]. This algorithm begins with an

initial partition of the graph into two subsets, which di�er in their sizes by at most one.

At each iteration, the algorithm chooses two candidate subsets of equal size to swap

between the initial two subsets, thereby reducing the cut-size between the two. The

algorithm terminates when it is no longer possible to decrease the cut-size by swap-

ping subsets. Though L algorithm generates nearly optimal solutions, it su�ers from

the lack of speed (the complexity is about (N g(N )) per iteration). Pothen

proposed an elegant method, called recursive spectral bisection (RSB) [17], based on

Fiedler's eigenvalue properties of graphs [6]. The RSB method requires computation

of the second smallest non-trivial eigenvector of the Laplacian matrix associated with

the graph of the problem. The Laplacian is a sparse, symmetric, positive semide�nite

matrix of the same order as the problem. RSB method is in a certain sense an optimal

graph partitioning algorithm [2]. However, in spite of strong theoretical foundations,

it is not the best possible algorithm in some practical situations. It is at its best a

3

Page 21: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

cial neural network-based techniques for automatic substructure generation for iterative

solvers.

Large systems of equations resulting from FE discretizations are increasingly being

solved on highly parallel machines and clusters of high-end workstations. Solution

of such large system of equations on parallel machines calls for innovative iterative

solution techniques such as domain decomposition and multigrid methods [20]. Domain

decomposition is a class of powerful iterative methods for solving a system of linear

algebraic equations arising from a �nite element discretization of a linear, elliptic, self-

ad oint boundary value problem, and often achieves fast convergence for large problems.

It is based on combination of local computations on the desired discretization, and

smaller non-local problems that exchange information between distant parts of the

system. Consider a system of linear algebraic equations A = resulting from FE

discretization on a domain . Let the domain be split into non-overlapping subdomains

, , ...., , each of which is the union of some �nite elements. The main idea

is to reduce the problem by eliminating the DOF associated with the interiors of the

subdomains. The reduced problem is then solved by preconditioned con ugate gradients

[13].

Execution of FE problems on distributed memory machines requires their data

structures be partitioned and distributed across processors. Even on virtual shared

memory machines, mesh partitioning is desired for the sake of data locality and en-

hanced performance [5]. If the mesh elements (and corresponding nodes) are allocated

to di�erent processors, interprocessor communication takes place (along the edge) when

computation involves nodal variables, allocated to neighboring processors. Thus, inter-

processor communication can be reduced by minimizing the interface length in the par-

titioned mesh. For explicit solution methods, e.g., time marching algorithm or global

iterative methods, computations on nodes residing on a single processor are followed

by non-local computation related to interfaces. Thus load balance of the number of

variables per processor is at a premium for processor synchronization [18]. However, for

domain decomposition techniques, most of the computation involves the variables on

interface nodes. Therefore, for domain decomposition-based iterative methods, a sub-

structuring is desired to minimize the number of interface nodes between subdomains

(often determined by the number of available processors) of approximately same size.

Such an optimization criterion also generates smooth interfaces, and improves the aspect

ratios of the resulting subdomains. Aspect ratios close to one and smooth interfaces in

turn enhance the numerical performance of the subsequent domain decomposition-based

iterative solvers [4].

Section 2 reviews some important partitioning techniques. Competitive learning

and Hop�eld net-based mesh partitioning techniques are proposed in Sections 3 and 4,

respectively. Implementation issues are discussed in Section 5 for fast serial execution of

neural net-based partitioning algorithms. Experimental results are presented in Section

6. Finally, we summarize the present investigation and mention some future research

issues in Section 7.

2

Page 22: R.P. Lippman - Semantic Scholar · B. Nour-Omid, A. Raefsky, and G. L yzenga, Solving nite element e quations on c oncurr ent c omputers, in Parallel Computatio ns and Their Impact

A A A C

A A

S ATA OSAL, JAN MAN EL AN RA EK TE A R

The solution of large nite element problems on distributed memory (even shared

virtual memory) computers calls for e cient partitioning of large and unstructured 3- meshes into

submeshes such that computations can be distributed across processors. It is desired that resulting

subdomains (submeshes) are approximately of same size, and the total number of interface nodes

between ad acent subdomains is minimal. Two ne-grain scalable parallel algorithms are proposed

employing neural network paradigms that can e ciently perform mesh partitioning for subse uent do-

main decomposition computations. New implementations are developed such that both the techni ues

have almost linear comple ity with respect to the problem size for serial execution. These substruc-

turing techni ues compare favorably to the well-known recursive spectral bisection (RSB) method and

simulated annealing-based partitioning algorithm.

mesh partitioning, domain decomposition, neural networks, competi-

tive learning, Hop�eld network, parallel processing, scalable algorithms

. nt oduction. Arti�cial neural networks (ANN) are parallel systems with a

large number of interconnected processing elements or \neurons", where desired in-

put/output mappings are obtained by adapting (training) the interconnections (also

called connection weights or synaptic strengths) according to a suitable learning al-

gorithm [12]. With the advent of massively parallel computer architectures, neural

network-guided solutions have become a reality in applications such as forecasting,

pattern recognition, diagnostics and automatic control. Although ANN can serve as

models for further understanding of brain functions, as a computational tool they can

provide fast solutions to large scienti�c problems because of their parallel nature and

learning capabilities. The focus of this article is on one such application of neural

network paradigms in high-performance computing, namely unstructured 3-D �nite el-

ement (FE) mesh partitioning for domain decomposition.

FE analysis has evolved as a powerful tool in diverse scienti�c disciplines, e.g.,

computational uid dynamics and computational structural analysis because of their

simplicity and general nature. However, almost all commercial FE solvers are of di-

rect nature. Their solution time increases proportional to the square of the number

of degrees of freedom (NDOF) or worse, and storage requirement approximately pro-

portional to (NDOF) for large problems [14]. \Smart" iterative solvers, suitable for

implementation on parallel computers, on the other hand require less storage. Their so-

lution time degrades almost only linearly with the problem size, but their performance

is much more problem dependent. This work deals with the application of fast arti�-

This paper is partially based on the presentation at the IEEE Conference on Neural Networks [7].

Center for ComputationalMathematics, niversity of Colorado at enver, enver, CO 80217. This

research was supported by the National Science Foundation grants ASC-9121431 and ASC-9217394.

1