java performance in fem
TRANSCRIPT
-
7/28/2019 Java Performance in FEM
1/6
JAVA PERFORMANCE IN FINITE ELEMENT COMPUTATIONS
G.P. NIKISHKOV
University of Aizu
Aizu-Wakamatsu, Fukushima 965-8580, Japan
email: [email protected]
ABSTRACT 1
The performance of the developed Java finite element code
is compared to that of the C finite element code on the so-
lution of three-dimensional elasticity problems using Intel
Pentium 4 computer. Untuned Java code is approximately
two times slower then analogous C code. It is shown that
code tuning with the use of blocking technique can provide
Java/C performance ratio 90% for the LDU solution of fi-
nite element equations. Java performance for PCG iterative
solution algorithm tuned by inner loop unrolling is 75% of
the C code. We recommend using Java Virtual MachineJVM 1.2 since in many cases it is considerably faster in
finite element computations than JVMs 1.3 and 1.4.
KEY WORDS
Finite Element Methods, Java-based Simulation, Perfor-
mance, Tuning
1 Introduction
Finite element codes were traditionally developed in For-
tran [1] and recently in Fortran 90 [2]. During last decade
FEM developers started using C++ language in order tohandle complexity in finite element software [3-5]. Using
the object-oriented approach with data hiding, encapsula-
tion and inheritance, allows creating reliable and extensible
finite element codes.
Java language [6] developed by Sun Microsystems
possesses features, which makes it attractive for using in
computational modelling. Java is a simple language with
rich collection of libraries implementing various APIs (Ap-
plication Programming Interfaces). With Java it is easy to
create Graphical User Interfaces and to communicate with
other computers over a network. Java has built-in garbage
collector preventing memory leaks. Another advantage of
Java is its portability. Java Virtual Machines (JVM) [7] aredeveloped for all major computer systems. JVM is em-
bedded in most popular Web browsers. Java applets can
be downloaded through the net and executed within Web
browser. While object-oriented programming can be done
with C++ language, other useful features such as actual
portability and garbage collection are unique characteris-
tics of Java language.
1Applied Simulation and Modeling, Procs of the 12th IASTED Int.
Conf., Sept. 3-5, 2003, Marbella, Spain., ACTA Press, Anaheim, 2003,
pp. 130-135.
Despite its attractive features, Java is not widely used
in engineering computations. Java byte code translation
into native instructions leads to a slower operation of Java
code. However, Just-In-Time compiler (JIT) can signifi-
cantly speed up the execution of Java applications and ap-
plets. The JIT, which is an integral part of the JVM takes
the bytecodes and compile them into native code before
execution. Since Java is a dynamic language, the JIT com-
piles methods on a method-by-method basis just before
they are called. If the same method is called many times
or if the method contains loop with many repetitions theeffect of re-execution of the native code can make the per-
formance of Java code acceptable.
Java performance in numerical computing was con-
sidered in several publications [8-10]. It was shown that
high-performance numerical codes could be developed in
Java with suitable code development techniques. While pa-
pers [8-10] deal with general issues of numerical comput-
ing, this paper addresses Java performance and tuning in
finite element computations. We present our experience in
designing the efficient finite element code in Java. The per-
formance of the developed Java finite element code is com-
pared to that of the analogous C code on finite element so-
lutions of three-dimensional elasticity problems using Intel
computer. For running Java code we employed Sun JVMs
1.2, 1.3 and 1.4. It is shown that with proper coding and
JVM selection the Java finite element code can be almost
as fast as the C code.
2 Java Finite Element Code
Object-oriented approach is used widely in order to create
reusable, extensible, and reliable components, which can
be used in later research and practical applications. How-
ever, full object-oriented programming approach might not
be always ideal for computationally intensive sections of
codes. Object creation and destruction in Java are expen-
sive operations. The use of large amount of small objects
can lead to considerable time and space overhead. As ex-
periments show, a possible way to increase computing per-
formance is reducing expenses for object creation in the
code by using primitive types in place of objects.
For a variable of a primitive type the JVM allocates
the variable directly on the stack (local variable) or within
the memory used for the object (member variable). For
such variables there is no object creation overhead, and no
-
7/28/2019 Java Performance in FEM
2/6
-
7/28/2019 Java Performance in FEM
3/6
than 3. Material parameters and are Lame elastic con-
stants. In our computer code integration of the stiffness ma-
trix [k] for the 20-node element is performed using special14-point integration rule. Since the element stiffness ma-
trix possesses symmetry property, only symmetrical part of
the matrix and diagonal coefficients are computed and then
used for assembly of the global stiffness matrix.
Assembly of the global stiffness matrix is performed
with the use of element connectivity information. Assem-
bly algorithm depends on the storage format for the finite
element equation system.
3.2 LDU Solution of Equation System
Symmetric part of the global stiffness matrix of the order n
is stored in a profile form by columns. Each column of
the matrix starts from the first top nonzero element and
ends at the diagonal element. The matrix is represented
by two arrays: one-dimensional double array a, containing
matrix elements and a pointer array pcol. Assuming that
array indices begin from one, the ith element ofpcol con-tains the index in the array a of the first element of the ith
column minus one. The length of the ith column is given
by pcol[i+1]-pcol[i]. The length of the array a is
equal to pcol[n+1]. The location (row number) of the
first nonzero element in the ith column of the matrix [A] isgiven by the function FN(i):
FN(i)=i-(pcol[i+1]-pcol[i])+1.
The following correspondence relation can be easily ob-
tained for a transition from two-index matrix notation to
one-dimensional array notation:
a[i,j] a[i+pcol[j+1]-j].
Solution of a symmetric equation system consists of
[U]T[D][U] decomposition of the system matrix followedby forward reduction and backsubstitution for the right-
hand side. The [U]T[D][U] decomposition takes majorityof the computing time. The right-looking algorithm of the
decomposition can be presented as the following pseudo-
code:
do j=2,n Cdivt(j) =Cdivt(j) do i=FN(j),j-1
do i=j,n t[i] = a[i,j]/a[i,i]
Cmod(j,i) end do
end do
end do Cmod(j,i) =
do j=2,n do k=max(FN(j),FN(i)),j-1
Cdiv(j) a[j,i] -= t[k]*a[k,i]
end do end do
Cdiv(j) =
do i=FN(j),j-1
a[i,j] /= a[i,i]
end do
Do loop, which takes most time of LDU decomposition is
contained in the procedure Cmod(j,i). One column of
the matrix is used to modify another column inside inner
do loop. Two operands should be loaded from memory in
order to perform one Floating-point Multiply-Add (FMA)
operation. Data loads can be economized by tuning with
the use of blocking technique. After unrolling two outer
loops, the tuned version of the LDU decomposition is as
follows:
do j=1,n,d Bdivt(k,d) =
Bdivt(j,d) do j=k,k+d-1
do i=j+d,n,d do i=FN(k),j-1
BBmod(j,i,d) t[i,j] = a[i,j]/a[i,i]
end do end do
end do do i=j,k+d-1
do j=2,n do l=max(FN(j),FN(i)),j-1
Cdiv(j) a[j,i] -= t[l,j]*a[l,i]
end do end do
end do
end do
BBmod(j,i,d=2) =
do k=max(FN(j),FN(i)),j-1
a[j,i] -= t[k,j]*a[k,i]
a[j+1,i] -= t[k,j+1]*a[k,i]
a[j,i+1] -= t[k,j]*a[k,i+1]
a[j+1,i+1] -= t[k,j+1]*a[k,i+1]
end do
if j>=FN(j) then
a[j+1,i] -= t[j,j+1]*a[j,i]
a[j+1,i+1] -= t[j,j+1]*a[j,i+1]
end if
Method BBmod(j,i,d) performs modification of a col-
umn block, which starts from column i by a column block,
which starts from column j and contains d columns. The
pseudo-code above is given for the block size d = 2 for
brevity. In three-dimensional problems, which are solvedhere, the block size d = 3 is used. It is assumed thatcolumns in the block start at the same row of the matrix
a. This is fulfilled automatically if the column block con-
tains columns, which are related to one node of the finite
element model.
3.3 PCG Solution of Equation System
Preconditioned conjugate gradient (PCG) method is an it-
erative procedure, which does not alter the equation matrix.
Because of this, only nonzero coefficients of the finite ele-
ment global stiffness matrix can be stored. Sparse structureof the matrix should be taken into account in matrix-vector
multiplications.
We use sparse row format for the equation matrix.
In this format all information about matrix is contained in
three arrays:
a - array of doubles containing non-zero elements
of the matrix, row by row;
col - array of column indices for non-zero ele-
ments of the array a;
-
7/28/2019 Java Performance in FEM
4/6
prow - pointer array of indices of starting elements
of matrix rows in the array a, again assuming that in-
dices start from one.
Preconditioning techniques are not the subject of this work.
Simple diagonal preconditioning is used in our PCG solu-
tion procedure of finite element equations. The most time
consuming operation in the PCG solution procedure is the
sparse matrix-vector product inside iteration loop. Matrix-vector multiplication for matrix [A] in sparse-row format isperformed as follows:
do j=1,n
y[j] = 0
do i=prow[j],prow[j+1]-1
y[j] = y[j] + a[i]*x[col[i]]
end do
end do
Experience with tuning C codes shows that little can be
done to speed up sparse matrix-vector product. To our sur-
prise the following simple inner loop unrolling may im-
prove Java code performance:
do j=1,n
y[j] = 0
do i=prow[j],prow[j+1]-1,3
y[j] = y[j]+a[i]*x[col[i]]
+a[i+1]*x[col[i+1]]+a[i+2]*x[col[i+2]]
end do
end do
Experiments with unrolling the outer loop lead to slower
calculations. The speed up of the sparse matrix-vector
product after inner loop unrolling and lack of it after outer
loop unrolling can be explained by the internal compilation
features of the Java compilers.
4 Experimental Results
We compared our C and Java implementations of the finite
element method on the series of three-dimensional elastic-
ity problems. The test problem is simple tension of an elas-
tic cube. Three-dimensional meshes ofE E E brick-
type 20-node elements are used for C-Java benchmarking.
The value ofE varies from 4 to 14 thus providing meshes
from 64 elements (1275 degrees of freedom) to 2744 ele-
ments (38475 degrees of freedom). The mesh with E = 8is shown in Fig. 2.
Desktop computer with Intel Pentium 4 2.80GHz pro-cessor (533 MHz frontside bus and 512 KB L2 cache) was
used for running the C and the Java finite element codes.
The C code was compiled using Microsoft Visual C++ 6.0
with maximum speed optimization. The Java code was
compiled using javac compiler developed by Sun Mi-
crosystems with optimization option -O and run using Java
virtual machine (JVM). Three JVMs were used:
JVM 1.2.2-015 with Symantec Just-In-Time compiler;
Java HotSpot Client VM 1.3.1 07-b02;
Figure 2. Finite element mesh of8 8 8 brick-type 20-node elements.
0 10 20 30 40
0.25
0.50
0.75
1.00
1.25
1.50
JVM1.2
JVM1.3
JVM1.4
3
Assemblyofprofilesystem,Pentium42.8GHz
tC/t
Java
NumberofDOF,10
Figure 3. Ratio of the C code time to the Java code time for
assembly of the global stiffness matrix in the profile format.
Java HotSpot Client VM 1.4.1 02-b06.
Results for assembly of the global stiffness matrix in the
profile format and for the LDU solution of the equation
system are presented in Figures 3-4. Since it is difficult
to determine megaflops rate for the assembly phase we
present C/Java performance comparison as ratios of com-
puting time used by the C code to computing time used
by the Java code. Assembly of the stiffness matrix in the
profile format is faster with JVM 1.2 than with C code. Per-
formance of JVMs 1.3 and 1.4 is around 75% of the C code
performance. Fig. 4 shows megaflops rates for the LDU
solution of the equation system stored in the profile format.
Untuned version of the Java code produces approximately
same speed of calculation for all JVMs. Java performance
of the untuned code is roughly 40% of C performance. Tun-
ing of C and Java codes changes the performance ratios
-
7/28/2019 Java Performance in FEM
5/6
0 10 20 30 40
200
400
600
800
JVM 1.2
JVM 1.3
JVM 1.4
MSC
3
UntunedLDUsolution, Pentium42.8GHz
MFlops
NumberofDOF,10
(a) (b)
0 10 20 30 400
200
400
600
800
1000
1200
JVM1.2
JVM1.3
JVM1.4
MSC
3
TunedLDUsolution,Pentium42.8GHz
MFlops
NumberofDOF,10
Figure 4. Java and C Megaflops rates for the LDU solution before tuning (a) and after tuning (b).
0 10 20 30 40
0.25
0.50
0.75
1.00
1.25
1.50
JVM1.2
JVM1.3
JVM1.4
3
Assemblyofsparserowsystem
Pentium42.8GHz
tC/t
Java
NumberofDOF,10
Figure 5. Ratio of the C code time to the Java code time for
assembly of the stiffness matrix in the sparse row format.
dramatically (Fig. 4,b). JVM 1.2 shows computing rates,
which are around 90% of the C code rates. JVMs 1.3 and
1.4 produces lower speed for the tuned LDU code. Signif-
icant performance drops are observed for the tuned LDU
code when using JVM 1.3. Such phenomena can be ex-plained by data block conflicts in cash memory for certain
profiles of the equation system.
Fig. 5 presents comparison of C and Java speeds for
the assembly of the global stiffness matrix in the sparse row
format. JVM 1.2 produces best speed. The speed of Java
code run with JVM 1.2 is higher than the C code speed.
Lower speeds are shown by JVMs 1.3 and 1.4 (60% of the
C speed).
Megaflops rates for the PCG solution of equation sys-
tem are depicted in Fig. 6. For the untuned PCG solution,
Java is about two times slower then C. Tuning does not af-
fect the speed of the C code. However, simple code tuning
with unrolling only inner loop of the sparse matrix-vector
product improves Java performance considerably making
the Java speed equal to 75% of the C speed.
There is a recommendation [9] to use JVM 1.4 and
to run it with the -server option in order to increase
speed of the Java codes. Our attempts to do so showed that
the finite element computations are 20% slower with the
-server option in comparison to the default -client
option.
The data presented in Figs 3-6 shows performance re-
sults for the three types of computations:
1) Calculation of element stiffness matrices and as-
sembly of the global stiffness matrix: mostly compu-
tations with scalar variables;
2) LDU solution of the equation system: mostly triple
loop for multiply-add operations for columns with a
consecutive access to operands;
3) PCG solution of the equation system: mostly
double loop for multiply-add operations with a non-
consecutive access to operands.
The experimental results show that the performance of Java
is on par with C for computations involving mostly scalar
variables. For multiply-add operations with the consecu-
tive access to array elements inside the triple loop the Java
performance can be 90% of the C performance after tun-
ing. For multiply-add operations with the non-consecutive
access to array elements inside double loops, the Java per-
formance is 75% of the C performance. It should be noted
that this conclusion is true if the proper choice of the Java
machine is done (JVM 1.2). While it is reasonable to use
the latest Java SDK (Software Development Kit) for most
purposes, we can recommend also to install Java Runtime
-
7/28/2019 Java Performance in FEM
6/6
0 10 20 30 40
100
200
300
400
500
600
JVM1.2
JVM1.3
JVM1.4
MSC
3
UntunedPCGsolution,Pentium42.8GHz
MFlops
NumberofDOF,10
0 10 20 30 40
100
200
300
400
500
600
JVM1.2
JVM1.3
JVM1.4
MSC
3
TunedPCGsolution,Pentium42.8GHz
MFlops
NumberofDOF,10
(b)(a)
Figure 6. Java and C Megaflops rates for the PCG solution before tuning (a) and after tuning (b).
Environment JRE 1.2 and to employ it for performing large
finite element analyses.
5 Conclusion
We have designed the object-oriented version of the three-
dimensional finite element code for elasticity problems and
implemented it in Java programming language. Special at-
tention has been devoted to the efficient implementation of
computationally intensive sections of the code.
The performance of the Java code has been compared
to the performance of the analogous C code on the solutionof three-dimensional elasticity problems using a computer
with Intel Pentium 4 processor. Java Virtual Machines 1.2,
1.3 and 1.4 were used for running Java code.
The experimental results show that the performance
of the Java finite element code is roughly equal to the per-
formance of the C code for calculation of element stiff-
ness matrices and assembly of the global equation system
when using JVM 1.2. JVMs 1.3 and 1.4 provide lower
performance. Untuned Java code demonstrates relatively
low performance for the LDU solution of the equation sys-
tem in the profile format. However, tuning with blocking
technique affects speed of the Java code more than speed
of the C code. Performance of the tuned Java code run-
ning on JVM 1.2 is about 90% of the C code performance.
The PCG iterative solution of the equation system is 30%
slower using the Java tuned code in comparison to the C
tuned code.
It is possible to conclude that the Java language is
quite suitable for development of finite element software.
With the use of proper coding the performance of the Java
code is comparable to the performance of the correspond-
ing tuned C code. It is recommended using JVM 1.2 for
large finite element analyses.
References
[1] K.-J. Bathe, Finite Element Procedures (Englewood
Cliffs: Prentice- Hall, 1996).
[2] I.M. Smith and D.V. Griffiths, Programming the Fi-
nite Element Method(Chichester: Wiley, 1998).
[3] R.I. Mackie, Using objects to handle complexity in
finite element software, Engineering with Computers,
13, 1997, 99-111.
[4] R.I. Mackie, Object-Oriented Methods and Finite El-
ement Analysis (Stirling: Saxe-Coburg, 2001).
[5] Y. Dubois-Pelerin and P. Pegon, Object-oriented pro-
gramming in nonlinear finite element analysis, Com-
puters and Structures, 67, 1998, 225-241.
[6] J. Gosling, B. Joy and G. Steele, The Java Language
Specification (Reading, MA: Addison-Wesley, 1996).
[7] T. Lindholm and F. Yellin, The Java Virtual Machine
Specification (Reading, MA: Addison-Wesley, 1996).
[8] R.F. Boisvert, J. Moreira, M. Philippsen and R. Pozo,
Java and numerical computing, Computing in Science
and Engineering, March/April, 2001, 18-24.
[9] D. Kruger, Performance tuning in Java, Java Devel-
opers Journal, August, 2002, 44-52.
[10] J.E. Moreira, S.P. Midkiff, M. Gupta, P.V. Artigas,
M. Snir and R.D. Lawrence, Java programming for
high-performance numerical computing, IBM Sys-
tems Journal, 39, 2000, 21-56.