parallel computing explained porting issues

Slides Prepared from the CI-Tutor Courses at NCSA

http://ci-tutor.ncsa.uiuc.edu/By

S. Masoud SadjadiSchool of Computing and Information

SciencesFlorida International University

March 2009

Parallel Computing Explained

Porting Issues

Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues

3.1 Recompile3.2 Word Length3.3 Compiler Options for Debugging3.4 Standards Violations3.5 IEEE Arithmetic Differences3.6 Math Library Differences3.7 Compute Order Related Differences3.8 Optimization Level Too High3.9 Diagnostic Listings3.10 Further Information

Porting Issues In order to run a computer program that presently runs

on a workstation, a mainframe, a vector computer, or another parallel computer, on a new parallel computer you must first "port" the code.

After porting the code, it is important to have some benchmark results you can use for comparison. To do this, run the original program on a well-defined

dataset, and save the results from the old or “baseline” computer.

Then run the ported code on the new computer and compare the results.

If the results are different, don't automatically assume that the new results are wrong – they may actually be better. There are several reasons why this might be true, including: Precision Differences - the new results may actually be more

accurate than the baseline results. Code Flaws - porting your code to a new computer may have

uncovered a hidden flaw in the code that was already there. Detection methods for finding code flaws, solutions, and

workarounds are provided in this lecture.

RecompileSome codes just need to be recompiled to get accurate

results. The compilers available on the NCSA computer platforms

are shown in the following table:

Language

SGI Origin2000 IA-32 Linux IA-64 Linux

MIPSproPortland Group

Intel GNUPortland Group

Intel GNU

Fortran 77

f77 ifort g77 pgf77 ifort g77

Fortran 90

f90 ifort pgf90 ifort

Fortran 90

f95 ifort ifort

High Performance Fortran

pghpf pghpf

C cc icc gcc pgcc icc gcc

C++ CC icpc g++ pgCC icpc g++

Word LengthCode flaws can occur when you are porting your code

to a different word length computer. For C, the size of an integer variable differs

depending on the machine and how the variable is generated. On the IA32 and IA64 Linux clusters, the size of an integer variable is 4 and 8 bytes, respectively. On the SGI Origin2000, the corresponding value is 4 bytes if the code is compiled with the –n32 flag, and 8 bytes if compiled without any flags or explicitly with the –64 flag.

For Fortran, the SGI MIPSpro and Intel compilers contain the following flags to set default variable size.-in where n is a number: set the default INTEGER to

INTEGER*n. The value of n can be 4 or 8 on SGI, and 2, 4, or 8 on the Linux clusters.

-rn where n is a number: set the default REAL to REAL*n. The value of n can be 4 or 8 on SGI, and 4, 8, or 16 on the Linux clusters.

Compiler Options for DebuggingOn the SGI Origin2000, the MIPSpro

compilers include debugging options via the –DEBUG:group. The syntax is as follows:-DEBUG:option1[=value1]:option2[=value2]...

Two examples are:Array-bound checking: check for subscripts

out of range at runtime.-DEBUG:subscript_check=ON

Force all un-initialized stack, automatic and dynamically allocated variables to be initialized. -DEBUG:trap_uninitialized=ON

Compiler Options for DebuggingOn the IA32 Linux cluster, the Fortran

compiler is equipped with the following –C flags for runtime diagnostics:-CA: pointers and allocatable references -CB: array and subscript bounds -CS: consistent shape of intrinsic

procedure -CU: use of uninitialized variables -CV: correspondence between dummy

and actual arguments

Standards ViolationsCode flaws can occur when the program has

non-ANSI standard Fortran coding. ANSI standard Fortran is a set of rules for compiler

writers that specify, for example, the value of the do loop index upon exit from the do loop.

Standards Violations DetectionTo detect standards violations on the SGI

Origin2000 computer use the -ansi flag. This option generates a listing of warning

messages for the use of non-ANSI standard coding.

On the Linux clusters, the -ansi[-] flag enables/disables assumption of ANSI conformance.

IEEE Arithmetic DifferencesCode flaws occur when the baseline computer

conforms to the IEEE arithmetic standard and the new computer does not. The IEEE Arithmetic Standard is a set of rules governing

arithmetic roundoff and overflow behavior. For example, it prohibits the compiler writer from

replacing x/y with x *recip (y) since the two results may differ slightly for some operands. You can make your program strictly conform to the IEEE standard.

To make your program conform to the IEEE Arithmetic Standards on the SGI Origin2000 computer use:f90 -OPT:IEEEarithmetic=n ... prog.f where n is 1, 2, or 3.

This option specifies the level of conformance to the IEEE standard where 1 is the most stringent and 3 is the most liberal.

On the Linux clusters, the Intel compilers can achieve conformance to IEEE standard at a stringent level with the –mp flag, or a slightly relaxed level with the –mp1 flag.

Math Library DifferencesMost high-performance parallel computers are

equipped with vendor-supplied math libraries.On the SGI Origin2000 platform, there are SGI/Cray

Scientific Library (SCSL) and Complib.sgimath. SCSL contains Level 1, 2, and 3 Basic Linear Algebra

Subprograms (BLAS), LAPACK and Fast Fourier Transform (FFT) routines.

SCSL can be linked with –lscs for the serial version, or –mp –lscs_mp for the parallel version.

The complib library can be linked with –lcomplib.sgimath for the serial version, or –mp –lcomplib.sgimath_mp for the parallel version.

The Intel Math Kernel Library (MKL) contains the complete set of functions from BLAS, the extended BLAS (sparse), the complete set of LAPACK routines, and Fast Fourier Transform (FFT) routines.

Math Library DifferencesOn the IA32 Linux cluster, the libraries to link to

are: For BLAS: -L/usr/local/intel/mkl/lib/32 -lmkl -lguide –lpthread

For LAPACK: -L/usr/local/intel/mkl/lib/32 –lmkl_lapack -lmkl -lguide –lpthread

When calling MKL routines from C/C++ programs, you also need to link with –lF90.

On the IA64 Linux cluster, the corresponding libraries are:For BLAS: -L/usr/local/intel/mkl/lib/64 –lmkl_itp –lpthread

For LAPACK: -L/usr/local/intel/mkl/lib/64 –lmkl_lapack –lmkl_itp –lpthread

When calling MKL routines from C/C++ programs, you also need to link with -lPEPCF90 –lCEPCF90 –lF90 -lintrins

Compute Order Related DifferencesCode flaws can occur because of the non-deterministic

computation of data elements on a parallel computer. The compute order in which the threads will run cannot be guaranteed. For example, in a data parallel program, the 50th index of a

do loop may be computed before the 10th index of the loop. Furthermore, the threads may run in one order on the first run, and in another order on the next run of the program.

Note: : If your algorithm depends on data being compared in a specific order, your code is inappropriate for a parallel computer.

Use the following method to detect compute order related differences:If your loop looks like DO I = 1, N change it to DO I = N, 1, -1 The results should not change if the iterations

are independent

Optimization Level Too HighCode flaws can occur when the optimization level has

been set too high thus trading speed for accuracy. The compiler reorders and optimizes your code based on

assumptions it makes about your program. This can sometimes cause answers to change at higher optimization level.

Setting the Optimization LevelBoth SGI Origin2000 computer and IBM Linux clusters

provide Level 0 (no optimization) to Level 3 (most aggressive) optimization, using the –O{0,1,2, or 3} flag. One should bear in mind that Level 3 optimization may carry out loop transformations that affect the correctness of calculations. Checking correctness and precision of calculation is highly recommended when –O3 is used.

For example on the Origin 2000 f90 -O0 … prog.f turns off all optimizations.

Optimization Level Too HighIsolating Optimization Level Problems

You can sometimes isolate optimization level problems using the method of binary chop.To do this, divide your program prog.f into halves. Name

them prog1.f and prog2.f.Compile the first half with -O0 and the second half with -O3 f90 -c -O0 prog1.f f90 -c -O3 prog2.f f90 prog1.o prog2.o a.out > results

If the results are correct, the optimization problem lies in prog1.f

Next divide prog1.f into halves. Name them prog1a.f and prog1b.f

Compile prog1a.f with -O0 and prog1b.f with -O3f90 -c -O0 prog1a.f f90 -c -O3 prog1b.f f90 prog1a.o prog1b.o prog2.o a.out > results

Continue in this manner until you have isolated the section of code that is producing incorrect results.

Diagnostic ListingsThe SGI Origin 2000 compiler will

generate all kinds of diagnostic warnings and messages, but not always by default. Some useful listing options are: f90 -listing ...

f90 -fullwarn ... f90 -showdefaults ... f90 -version ... f90 -help ...

Further InformationSGI

man f77/f90/cc man debug_group man math man complib.sgimath MIPSpro 64-Bit Porting and Transition Guide Online Manuals

Linux clusters pagesifort/icc/icpc –help (IA32, IA64, Intel64) Intel Fortran Compiler for Linux Intel C/C++ Compiler for Linux

Agenda1 Parallel Computing Overview2 How to Parallelize a Code3 Porting Issues4 Scalar Tuning

4.1 Aggressive Compiler Options4.2 Compiler Optimizations4.3 Vendor Tuned Code4.4 Further Information

Scalar TuningIf you are not satisfied with the performance of

your program on the new computer, you can tune the scalar code to decrease its runtime.

This chapter describes many of these techniques:The use of the most aggressive compiler options The improvement of loop unrolling The use of subroutine inlining The use of vendor supplied tuned code

The detection of cache problems, and their solution are presented in the Cache Tuning chapter.

Aggressive Compiler OptionsFor the SGI Origin2000 Linux clusters

the main optimization switch is-On where n ranges from 0 to 3. -O0 turns off all optimizations. -O1 and -O2 do beneficial optimizations

that will not effect the accuracy of results.

-O3 specifies the most aggressive optimizations. It takes the most compile time, may produce changes in accuracy, and turns on software pipelining.

Aggressive Compiler OptionsIt should be noted that –O3 might carry out

loop transformations that produce incorrect results in some codes. It is recommended that one compare the answer

obtained from Level 3 optimization with one obtained from a lower-level optimization.

On the SGI Origin2000 and the Linux clusters, –O3 can be used together with –OPT:IEEE_arithmetic=n (n=1,2, or 3) and –mp (or –mp1), respectively, to enforce operation conformance to IEEE standard at different levels.

On the SGI Origin2000, the option -Ofast = ip27

is also available. This option specifies the most aggressive optimizations that are specifically tuned for the Origin2000 computer.

parallel computing explained porting issues

Documents