java performance in fem

Upload: ko-than-soe

Post on 03-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Java Performance in FEM

    1/6

    JAVA PERFORMANCE IN FINITE ELEMENT COMPUTATIONS

    G.P. NIKISHKOV

    University of Aizu

    Aizu-Wakamatsu, Fukushima 965-8580, Japan

    email: [email protected]

    ABSTRACT 1

    The performance of the developed Java finite element code

    is compared to that of the C finite element code on the so-

    lution of three-dimensional elasticity problems using Intel

    Pentium 4 computer. Untuned Java code is approximately

    two times slower then analogous C code. It is shown that

    code tuning with the use of blocking technique can provide

    Java/C performance ratio 90% for the LDU solution of fi-

    nite element equations. Java performance for PCG iterative

    solution algorithm tuned by inner loop unrolling is 75% of

    the C code. We recommend using Java Virtual MachineJVM 1.2 since in many cases it is considerably faster in

    finite element computations than JVMs 1.3 and 1.4.

    KEY WORDS

    Finite Element Methods, Java-based Simulation, Perfor-

    mance, Tuning

    1 Introduction

    Finite element codes were traditionally developed in For-

    tran [1] and recently in Fortran 90 [2]. During last decade

    FEM developers started using C++ language in order tohandle complexity in finite element software [3-5]. Using

    the object-oriented approach with data hiding, encapsula-

    tion and inheritance, allows creating reliable and extensible

    finite element codes.

    Java language [6] developed by Sun Microsystems

    possesses features, which makes it attractive for using in

    computational modelling. Java is a simple language with

    rich collection of libraries implementing various APIs (Ap-

    plication Programming Interfaces). With Java it is easy to

    create Graphical User Interfaces and to communicate with

    other computers over a network. Java has built-in garbage

    collector preventing memory leaks. Another advantage of

    Java is its portability. Java Virtual Machines (JVM) [7] aredeveloped for all major computer systems. JVM is em-

    bedded in most popular Web browsers. Java applets can

    be downloaded through the net and executed within Web

    browser. While object-oriented programming can be done

    with C++ language, other useful features such as actual

    portability and garbage collection are unique characteris-

    tics of Java language.

    1Applied Simulation and Modeling, Procs of the 12th IASTED Int.

    Conf., Sept. 3-5, 2003, Marbella, Spain., ACTA Press, Anaheim, 2003,

    pp. 130-135.

    Despite its attractive features, Java is not widely used

    in engineering computations. Java byte code translation

    into native instructions leads to a slower operation of Java

    code. However, Just-In-Time compiler (JIT) can signifi-

    cantly speed up the execution of Java applications and ap-

    plets. The JIT, which is an integral part of the JVM takes

    the bytecodes and compile them into native code before

    execution. Since Java is a dynamic language, the JIT com-

    piles methods on a method-by-method basis just before

    they are called. If the same method is called many times

    or if the method contains loop with many repetitions theeffect of re-execution of the native code can make the per-

    formance of Java code acceptable.

    Java performance in numerical computing was con-

    sidered in several publications [8-10]. It was shown that

    high-performance numerical codes could be developed in

    Java with suitable code development techniques. While pa-

    pers [8-10] deal with general issues of numerical comput-

    ing, this paper addresses Java performance and tuning in

    finite element computations. We present our experience in

    designing the efficient finite element code in Java. The per-

    formance of the developed Java finite element code is com-

    pared to that of the analogous C code on finite element so-

    lutions of three-dimensional elasticity problems using Intel

    computer. For running Java code we employed Sun JVMs

    1.2, 1.3 and 1.4. It is shown that with proper coding and

    JVM selection the Java finite element code can be almost

    as fast as the C code.

    2 Java Finite Element Code

    Object-oriented approach is used widely in order to create

    reusable, extensible, and reliable components, which can

    be used in later research and practical applications. How-

    ever, full object-oriented programming approach might not

    be always ideal for computationally intensive sections of

    codes. Object creation and destruction in Java are expen-

    sive operations. The use of large amount of small objects

    can lead to considerable time and space overhead. As ex-

    periments show, a possible way to increase computing per-

    formance is reducing expenses for object creation in the

    code by using primitive types in place of objects.

    For a variable of a primitive type the JVM allocates

    the variable directly on the stack (local variable) or within

    the memory used for the object (member variable). For

    such variables there is no object creation overhead, and no

  • 7/28/2019 Java Performance in FEM

    2/6

  • 7/28/2019 Java Performance in FEM

    3/6

    than 3. Material parameters and are Lame elastic con-

    stants. In our computer code integration of the stiffness ma-

    trix [k] for the 20-node element is performed using special14-point integration rule. Since the element stiffness ma-

    trix possesses symmetry property, only symmetrical part of

    the matrix and diagonal coefficients are computed and then

    used for assembly of the global stiffness matrix.

    Assembly of the global stiffness matrix is performed

    with the use of element connectivity information. Assem-

    bly algorithm depends on the storage format for the finite

    element equation system.

    3.2 LDU Solution of Equation System

    Symmetric part of the global stiffness matrix of the order n

    is stored in a profile form by columns. Each column of

    the matrix starts from the first top nonzero element and

    ends at the diagonal element. The matrix is represented

    by two arrays: one-dimensional double array a, containing

    matrix elements and a pointer array pcol. Assuming that

    array indices begin from one, the ith element ofpcol con-tains the index in the array a of the first element of the ith

    column minus one. The length of the ith column is given

    by pcol[i+1]-pcol[i]. The length of the array a is

    equal to pcol[n+1]. The location (row number) of the

    first nonzero element in the ith column of the matrix [A] isgiven by the function FN(i):

    FN(i)=i-(pcol[i+1]-pcol[i])+1.

    The following correspondence relation can be easily ob-

    tained for a transition from two-index matrix notation to

    one-dimensional array notation:

    a[i,j] a[i+pcol[j+1]-j].

    Solution of a symmetric equation system consists of

    [U]T[D][U] decomposition of the system matrix followedby forward reduction and backsubstitution for the right-

    hand side. The [U]T[D][U] decomposition takes majorityof the computing time. The right-looking algorithm of the

    decomposition can be presented as the following pseudo-

    code:

    do j=2,n Cdivt(j) =Cdivt(j) do i=FN(j),j-1

    do i=j,n t[i] = a[i,j]/a[i,i]

    Cmod(j,i) end do

    end do

    end do Cmod(j,i) =

    do j=2,n do k=max(FN(j),FN(i)),j-1

    Cdiv(j) a[j,i] -= t[k]*a[k,i]

    end do end do

    Cdiv(j) =

    do i=FN(j),j-1

    a[i,j] /= a[i,i]

    end do

    Do loop, which takes most time of LDU decomposition is

    contained in the procedure Cmod(j,i). One column of

    the matrix is used to modify another column inside inner

    do loop. Two operands should be loaded from memory in

    order to perform one Floating-point Multiply-Add (FMA)

    operation. Data loads can be economized by tuning with

    the use of blocking technique. After unrolling two outer

    loops, the tuned version of the LDU decomposition is as

    follows:

    do j=1,n,d Bdivt(k,d) =

    Bdivt(j,d) do j=k,k+d-1

    do i=j+d,n,d do i=FN(k),j-1

    BBmod(j,i,d) t[i,j] = a[i,j]/a[i,i]

    end do end do

    end do do i=j,k+d-1

    do j=2,n do l=max(FN(j),FN(i)),j-1

    Cdiv(j) a[j,i] -= t[l,j]*a[l,i]

    end do end do

    end do

    end do

    BBmod(j,i,d=2) =

    do k=max(FN(j),FN(i)),j-1

    a[j,i] -= t[k,j]*a[k,i]

    a[j+1,i] -= t[k,j+1]*a[k,i]

    a[j,i+1] -= t[k,j]*a[k,i+1]

    a[j+1,i+1] -= t[k,j+1]*a[k,i+1]

    end do

    if j>=FN(j) then

    a[j+1,i] -= t[j,j+1]*a[j,i]

    a[j+1,i+1] -= t[j,j+1]*a[j,i+1]

    end if

    Method BBmod(j,i,d) performs modification of a col-

    umn block, which starts from column i by a column block,

    which starts from column j and contains d columns. The

    pseudo-code above is given for the block size d = 2 for

    brevity. In three-dimensional problems, which are solvedhere, the block size d = 3 is used. It is assumed thatcolumns in the block start at the same row of the matrix

    a. This is fulfilled automatically if the column block con-

    tains columns, which are related to one node of the finite

    element model.

    3.3 PCG Solution of Equation System

    Preconditioned conjugate gradient (PCG) method is an it-

    erative procedure, which does not alter the equation matrix.

    Because of this, only nonzero coefficients of the finite ele-

    ment global stiffness matrix can be stored. Sparse structureof the matrix should be taken into account in matrix-vector

    multiplications.

    We use sparse row format for the equation matrix.

    In this format all information about matrix is contained in

    three arrays:

    a - array of doubles containing non-zero elements

    of the matrix, row by row;

    col - array of column indices for non-zero ele-

    ments of the array a;

  • 7/28/2019 Java Performance in FEM

    4/6

    prow - pointer array of indices of starting elements

    of matrix rows in the array a, again assuming that in-

    dices start from one.

    Preconditioning techniques are not the subject of this work.

    Simple diagonal preconditioning is used in our PCG solu-

    tion procedure of finite element equations. The most time

    consuming operation in the PCG solution procedure is the

    sparse matrix-vector product inside iteration loop. Matrix-vector multiplication for matrix [A] in sparse-row format isperformed as follows:

    do j=1,n

    y[j] = 0

    do i=prow[j],prow[j+1]-1

    y[j] = y[j] + a[i]*x[col[i]]

    end do

    end do

    Experience with tuning C codes shows that little can be

    done to speed up sparse matrix-vector product. To our sur-

    prise the following simple inner loop unrolling may im-

    prove Java code performance:

    do j=1,n

    y[j] = 0

    do i=prow[j],prow[j+1]-1,3

    y[j] = y[j]+a[i]*x[col[i]]

    +a[i+1]*x[col[i+1]]+a[i+2]*x[col[i+2]]

    end do

    end do

    Experiments with unrolling the outer loop lead to slower

    calculations. The speed up of the sparse matrix-vector

    product after inner loop unrolling and lack of it after outer

    loop unrolling can be explained by the internal compilation

    features of the Java compilers.

    4 Experimental Results

    We compared our C and Java implementations of the finite

    element method on the series of three-dimensional elastic-

    ity problems. The test problem is simple tension of an elas-

    tic cube. Three-dimensional meshes ofE E E brick-

    type 20-node elements are used for C-Java benchmarking.

    The value ofE varies from 4 to 14 thus providing meshes

    from 64 elements (1275 degrees of freedom) to 2744 ele-

    ments (38475 degrees of freedom). The mesh with E = 8is shown in Fig. 2.

    Desktop computer with Intel Pentium 4 2.80GHz pro-cessor (533 MHz frontside bus and 512 KB L2 cache) was

    used for running the C and the Java finite element codes.

    The C code was compiled using Microsoft Visual C++ 6.0

    with maximum speed optimization. The Java code was

    compiled using javac compiler developed by Sun Mi-

    crosystems with optimization option -O and run using Java

    virtual machine (JVM). Three JVMs were used:

    JVM 1.2.2-015 with Symantec Just-In-Time compiler;

    Java HotSpot Client VM 1.3.1 07-b02;

    Figure 2. Finite element mesh of8 8 8 brick-type 20-node elements.

    0 10 20 30 40

    0.25

    0.50

    0.75

    1.00

    1.25

    1.50

    JVM1.2

    JVM1.3

    JVM1.4

    3

    Assemblyofprofilesystem,Pentium42.8GHz

    tC/t

    Java

    NumberofDOF,10

    Figure 3. Ratio of the C code time to the Java code time for

    assembly of the global stiffness matrix in the profile format.

    Java HotSpot Client VM 1.4.1 02-b06.

    Results for assembly of the global stiffness matrix in the

    profile format and for the LDU solution of the equation

    system are presented in Figures 3-4. Since it is difficult

    to determine megaflops rate for the assembly phase we

    present C/Java performance comparison as ratios of com-

    puting time used by the C code to computing time used

    by the Java code. Assembly of the stiffness matrix in the

    profile format is faster with JVM 1.2 than with C code. Per-

    formance of JVMs 1.3 and 1.4 is around 75% of the C code

    performance. Fig. 4 shows megaflops rates for the LDU

    solution of the equation system stored in the profile format.

    Untuned version of the Java code produces approximately

    same speed of calculation for all JVMs. Java performance

    of the untuned code is roughly 40% of C performance. Tun-

    ing of C and Java codes changes the performance ratios

  • 7/28/2019 Java Performance in FEM

    5/6

    0 10 20 30 40

    200

    400

    600

    800

    JVM 1.2

    JVM 1.3

    JVM 1.4

    MSC

    3

    UntunedLDUsolution, Pentium42.8GHz

    MFlops

    NumberofDOF,10

    (a) (b)

    0 10 20 30 400

    200

    400

    600

    800

    1000

    1200

    JVM1.2

    JVM1.3

    JVM1.4

    MSC

    3

    TunedLDUsolution,Pentium42.8GHz

    MFlops

    NumberofDOF,10

    Figure 4. Java and C Megaflops rates for the LDU solution before tuning (a) and after tuning (b).

    0 10 20 30 40

    0.25

    0.50

    0.75

    1.00

    1.25

    1.50

    JVM1.2

    JVM1.3

    JVM1.4

    3

    Assemblyofsparserowsystem

    Pentium42.8GHz

    tC/t

    Java

    NumberofDOF,10

    Figure 5. Ratio of the C code time to the Java code time for

    assembly of the stiffness matrix in the sparse row format.

    dramatically (Fig. 4,b). JVM 1.2 shows computing rates,

    which are around 90% of the C code rates. JVMs 1.3 and

    1.4 produces lower speed for the tuned LDU code. Signif-

    icant performance drops are observed for the tuned LDU

    code when using JVM 1.3. Such phenomena can be ex-plained by data block conflicts in cash memory for certain

    profiles of the equation system.

    Fig. 5 presents comparison of C and Java speeds for

    the assembly of the global stiffness matrix in the sparse row

    format. JVM 1.2 produces best speed. The speed of Java

    code run with JVM 1.2 is higher than the C code speed.

    Lower speeds are shown by JVMs 1.3 and 1.4 (60% of the

    C speed).

    Megaflops rates for the PCG solution of equation sys-

    tem are depicted in Fig. 6. For the untuned PCG solution,

    Java is about two times slower then C. Tuning does not af-

    fect the speed of the C code. However, simple code tuning

    with unrolling only inner loop of the sparse matrix-vector

    product improves Java performance considerably making

    the Java speed equal to 75% of the C speed.

    There is a recommendation [9] to use JVM 1.4 and

    to run it with the -server option in order to increase

    speed of the Java codes. Our attempts to do so showed that

    the finite element computations are 20% slower with the

    -server option in comparison to the default -client

    option.

    The data presented in Figs 3-6 shows performance re-

    sults for the three types of computations:

    1) Calculation of element stiffness matrices and as-

    sembly of the global stiffness matrix: mostly compu-

    tations with scalar variables;

    2) LDU solution of the equation system: mostly triple

    loop for multiply-add operations for columns with a

    consecutive access to operands;

    3) PCG solution of the equation system: mostly

    double loop for multiply-add operations with a non-

    consecutive access to operands.

    The experimental results show that the performance of Java

    is on par with C for computations involving mostly scalar

    variables. For multiply-add operations with the consecu-

    tive access to array elements inside the triple loop the Java

    performance can be 90% of the C performance after tun-

    ing. For multiply-add operations with the non-consecutive

    access to array elements inside double loops, the Java per-

    formance is 75% of the C performance. It should be noted

    that this conclusion is true if the proper choice of the Java

    machine is done (JVM 1.2). While it is reasonable to use

    the latest Java SDK (Software Development Kit) for most

    purposes, we can recommend also to install Java Runtime

  • 7/28/2019 Java Performance in FEM

    6/6

    0 10 20 30 40

    100

    200

    300

    400

    500

    600

    JVM1.2

    JVM1.3

    JVM1.4

    MSC

    3

    UntunedPCGsolution,Pentium42.8GHz

    MFlops

    NumberofDOF,10

    0 10 20 30 40

    100

    200

    300

    400

    500

    600

    JVM1.2

    JVM1.3

    JVM1.4

    MSC

    3

    TunedPCGsolution,Pentium42.8GHz

    MFlops

    NumberofDOF,10

    (b)(a)

    Figure 6. Java and C Megaflops rates for the PCG solution before tuning (a) and after tuning (b).

    Environment JRE 1.2 and to employ it for performing large

    finite element analyses.

    5 Conclusion

    We have designed the object-oriented version of the three-

    dimensional finite element code for elasticity problems and

    implemented it in Java programming language. Special at-

    tention has been devoted to the efficient implementation of

    computationally intensive sections of the code.

    The performance of the Java code has been compared

    to the performance of the analogous C code on the solutionof three-dimensional elasticity problems using a computer

    with Intel Pentium 4 processor. Java Virtual Machines 1.2,

    1.3 and 1.4 were used for running Java code.

    The experimental results show that the performance

    of the Java finite element code is roughly equal to the per-

    formance of the C code for calculation of element stiff-

    ness matrices and assembly of the global equation system

    when using JVM 1.2. JVMs 1.3 and 1.4 provide lower

    performance. Untuned Java code demonstrates relatively

    low performance for the LDU solution of the equation sys-

    tem in the profile format. However, tuning with blocking

    technique affects speed of the Java code more than speed

    of the C code. Performance of the tuned Java code run-

    ning on JVM 1.2 is about 90% of the C code performance.

    The PCG iterative solution of the equation system is 30%

    slower using the Java tuned code in comparison to the C

    tuned code.

    It is possible to conclude that the Java language is

    quite suitable for development of finite element software.

    With the use of proper coding the performance of the Java

    code is comparable to the performance of the correspond-

    ing tuned C code. It is recommended using JVM 1.2 for

    large finite element analyses.

    References

    [1] K.-J. Bathe, Finite Element Procedures (Englewood

    Cliffs: Prentice- Hall, 1996).

    [2] I.M. Smith and D.V. Griffiths, Programming the Fi-

    nite Element Method(Chichester: Wiley, 1998).

    [3] R.I. Mackie, Using objects to handle complexity in

    finite element software, Engineering with Computers,

    13, 1997, 99-111.

    [4] R.I. Mackie, Object-Oriented Methods and Finite El-

    ement Analysis (Stirling: Saxe-Coburg, 2001).

    [5] Y. Dubois-Pelerin and P. Pegon, Object-oriented pro-

    gramming in nonlinear finite element analysis, Com-

    puters and Structures, 67, 1998, 225-241.

    [6] J. Gosling, B. Joy and G. Steele, The Java Language

    Specification (Reading, MA: Addison-Wesley, 1996).

    [7] T. Lindholm and F. Yellin, The Java Virtual Machine

    Specification (Reading, MA: Addison-Wesley, 1996).

    [8] R.F. Boisvert, J. Moreira, M. Philippsen and R. Pozo,

    Java and numerical computing, Computing in Science

    and Engineering, March/April, 2001, 18-24.

    [9] D. Kruger, Performance tuning in Java, Java Devel-

    opers Journal, August, 2002, 44-52.

    [10] J.E. Moreira, S.P. Midkiff, M. Gupta, P.V. Artigas,

    M. Snir and R.D. Lawrence, Java programming for

    high-performance numerical computing, IBM Sys-

    tems Journal, 39, 2000, 21-56.