code tuning and optimization kadin tseng boston university scientific computing and visualization

CODE TUNING AND OPTIMIZATION

Kadin Tseng

Boston University

Scientific Computing and Visualization

Outline• Introduction• Timing• Example Code• Profiling• Cache• Tuning• Parallel Performance

Code Tuning and Optimization 2

Introduction• Timing

• Where is most time being used?

• Tuning• How to speed it up• Often as much art as science

• Parallel Performance• How to assess how well parallelization is working


Timing


Timing• When tuning/parallelizing a code, need to assess

effectiveness of your efforts• Can time whole code and/or specific sections• Some types of timers

• unix time command• function/subroutine calls• profiler


CPU Time or Wall-Clock Time?• CPU time

• How much time the CPU is actually crunching away• User CPU time

• Time spent executing your source code

• System CPU time• Time spent in system calls such as i/o

• Wall-clock time• What you would measure with a stopwatch


CPU Time or Wall-Clock Time? (cont’d)• Both are useful• For serial runs without interaction from keyboard, CPU

and wall-clock times are usually close• If you prompt for keyboard input, wall-clock time will accumulate if

you get a cup of coffee, but CPU time will not


CPU Time or Wall-Clock Time? (3)• Parallel runs

• Want wall-clock time, since CPU time will be about the same or even increase as number of procs. is increased

• Wall-clock time may not be accurate if sharing processors• Wall-clock timings should always be performed in batch mode


Unix Time Command• easiest way to time code• simply type time before your run command• output differs between c-type shells (cshell, tcshell) and

Bourne-type shells (bsh, bash, ksh)


Unix Time Command (cont’d)

katana:~ % time mycode1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w

user CPU time (s)

system CPU time (s)

wall-clock time (s)

(u+s)/wc

avg. shared + unsharedtext space

input + output operations

page faults + no. timesproc. was swapped


Unix Time Command (3)• Bourne shell results

$ time mycodereal 0m1.62suser 0m1.57ssys 0m0.03s

wall-clock time

user CPU time

system CPU time


Example Code


Example Code• Simulation of response of eye to stimuli (CNS Dept.)• Based on Grossberg & Todorovic paper

• Contains 6 levels of response• Our code only contains levels 1 through 5• Level 6 takes a long time to compute, and would skew our timings!


Example Code (cont’d)• All calculations done on a square array• Array size and other constants are defined in gt.h (C) or in

the “mods” module at the top of the code (Fortran)


Level 1 Equations Computational domain is a square Defines square array I over domain (initial condition)

bright

dark


Level 2 Equations

]})()[(exp{ 222 jqipCC pqij

]})()[(exp{ 222 jqipEEpqij

qppqpqijpqij

qppqpqijpqij

ij IECA

IDEBC

x

,

,

)(

)(

)0,max( ijij xx

Ipq=initial condition


Level 3 Equations

]})()[(exp{ 222 jqipGpqij

]})()[(exp{ 222)(kk

kpqij njqmipH

)(

,

kpqij

qppqijk FXy

)()( kpqijpqij

kpqij HGF

K

kmk

2sin

K

knk

2cos

)0,max( ijkijk yy


Level 4 Equations

)]2/([ Kkijijkijk YYz

)0,max( LzZ ijkijk


Level 5 Equation

k

ijkij ZZ


Exercise 1• Copy files from /scratch disc

Katana% cp /scratch/kadin/tuning/* .• Choose C (gt.c and gt.h) or Fortran (gt.f90)• Compile with no optimization:

pgcc –O0 –o gt gt.cc

pgf90 –O0 –o gt gt.f90

• Submit rungt script to batch queue

katana% qsub -b y rungt

capital ohsmall oh

zero


Exercise 1 (cont’d)• Check status

qstat –u username

• After run has completed a file will appear named rungt.o??????, where ?????? represents the process number

• File contains result of time command• Write down wall-clock time

• Re-compile using –O3• Re-run and check time


Function/Subroutine Calls• often need to time part of code• timers can be inserted in source code• language-dependent


cpu_time• intrinsic subroutine in Fortran• returns user CPU time (in seconds)

• no system time is included

real :: t1, t2call cpu_time(t1) ! Start timer... perform computation here ... call cpu_time(t2) ! Stop timerprint*, 'CPU time = ', t2-t1, ' sec.'


system_clock• intrinsic subroutine in Fortran• good for measuring wall-clock time


system_clock (cont’d)• t1 and t2 are tic counts• count_rate is optional argument containing tics/sec.

integer :: t1, t2, count_rate call system_clock(t1, count_rate) ! Start clock ... perform computation here ... call system_clock(t2) ! Stop clock print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’


times• can be called from C to obtain CPU time

#include <sys/times.h>#include <unistd.h>void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); // start clock tic1 = timedat.tms_utime; … perform computation here … times(&timedat); // stop clock tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); }

• can also get system time with tms_stime


gettimeofday• can be called from C to obtain wall-clock time

#include <sys/time.h> void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); // start clock t1 = t.tv_sec + 1.0e-6*t.tv_usec; … perform computation here … gettimeofday(&t, NULL); // stop clock t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); }


MPI_Wtime• convenient wall-clock timer for MPI codes


MPI_Wtime (cont’d)• Fortran

• C

double precision t1, t2t1 = mpi_wtime() ! Start clock ... perform computation here ...t2 = mpi_wtime() ! Stop clockprint*,'wall-clock time = ', t2-t1

double t1, t2;t1 = MPI_Wtime(); // start clock... perform computation here …t2 = MPI_Wtime(); // stop clockprintf(“wall-clock time = %5.3f\n”,t2-t1);


omp_get_time• convenient wall-clock timer for OpenMP codes• resolution available by calling omp_get_wtick()


omp_get_wtime (cont’d)• Fortran

• C

double precision t1, t2, omp_get_wtimet1 = omp_get_wtime() ! Start clock... perform computation here ...t2 = omp_get_wtime() ! Stop clockprint*,'wall-clock time = ', t2-t1

double t1, t2;t1 = omp_get_wtime(); // start clock... perform computation here ...t2 = omp_get_wtime(); // stop clockprintf(“wall-clock time = %5.3f\n”,t2-t1);


Timer Summary

CPU Wall

Fortran cpu_time system_clock

C times gettimeofday

MPI MPI_Wtime

OpenMP omp_get_time


Exercise 2• Put wall-clock timer around each “level” in the example

code• Print time for each level• Compile and run


PROFILING


Profilers• profile tells you how much time is spent in each routine

• gives a level of granularity not available with previous timers• e.g., function may be called from many places

• various profilers available, e.g.• gprof (GNU) -- function level profiling• pgprof (Portland Group) -- function and line level profiling


gprof• compile with -pg• when you run executable, file gmon.out will be created• gprof executable > myprof

• this processes gmon.out into myprof

• for multiple processes (MPI), copy or link gmon.out.n to gmon.out, then run gprof


gprof (cont’d)

ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds

% cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00 .conduct [5] 7.6 122.34 33.17 323 102.69 102.69 .getxyz [8] 7.5 154.77 32.43 .__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17 .btri [10] 7.2 217.33 31.17 .kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00 .rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51 .getq [24]


gprof (3)

ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds

called/total parents index %time self descendents called+self name index called/total children

0.00 340.50 1/1 .__start [2][1] 78.3 0.00 340.50 1 .main [1] 2.12 319.50 10/10 .contrl [3] 0.04 7.30 10/10 .force [34] 0.00 5.27 1/1 .initia [40] 0.56 3.43 1/1 .plot3da [49] 0.00 1.27 1/1 .data [73]


pgprof• compile with Portland Group compiler

• pgf90 (pgf95, etc.)• pgcc• –Mprof=func

• similar to –pg• run code

• pgprof –exe executable• pops up window with flat profile


pgprof (cont’d)


pgprof (3)• To save profile data to a file:

• re-run pgprof using –text flag• at command prompt type p > filename

• filename is the name you want to give the profile file• type quit to get out of profiler

• Close pgprof as soon as you’re through• Leaving window open ties up a license (only a few available)


Line-Level Profiling• Times individual lines• For pgprof, compile with the flag

–Mprof=line

• Optimizer will re-order lines• profiler will lump lines in some loops or other constructs• may want to compile without optimization, may not

• In flat profile, double-click on function to get line-level data


Line-Level Profiling (cont’d)


CACHE


Cache• Cache is a small chunk of fast memory between the main

memory and the registers

secondary cache

registers

primary cache

main memory


Cache (cont’d)• If variables are used repeatedly, code will run faster since

cache memory is much faster than main memory• Variables are moved from main memory to cache in lines

• L1 cache line sizes on our machines• Opteron (katana cluster) 64 bytes• Xeon (katana cluster) 64 bytes• Power4 (p-series) 128 bytes• PPC440 (Blue Gene) 32 bytes• Pentium III (linux cluster) 32 bytes


Cache (3)• Why not just make the main memory out of the same stuff

as cache?• Expensive• Runs hot• This was actually done in Cray computers

• Liquid cooling system


Cache (4)• Cache hit

• Required variable is in cache

• Cache miss• Required variable not in cache• If cache is full, something else must be thrown out (sent back to

main memory) to make room• Want to minimize number of cache misses


Cache (5)

…

x[0]x[1]

x[2]x[3]x[4]x[5]

x[6]x[7]

x[8]x[9]

Main memory

“mini” cacheholds 2 lines, 4 words each

for(i=0; i<10; i++) x[i] = i;

ab…


Cache (6)

…

x[0]x[1]

x[2]x[3]x[4]x[5]

x[6]x[7]

x[8]x[9]

• will ignore i for simplicity• need x[0], not in cache cache miss• load line from memory into cache• next 3 loop indices result in cache hits

for(i=0; i<10; i++) x[i] = i;

ab…

x[0]x[1]

x[2]x[3]


Cache (7)

…

x[0]x[1]

x[2]x[3]x[4]x[5]

x[6]x[7]

x[8]x[9]

• need x[4], not in cache cache miss• load line from memory into cache• next 3 loop indices result in cache hits

for(i=0; i<10; i++) x[i] = i;

ab…

x[0]x[1]

x[2]x[3]

x[4]

x[5]x[6]x[7]


Cache (8)

…

x[0]x[1]

x[2]x[3]x[4]x[5]

x[6]x[7]

x[8]x[9]

• need x[8], not in cache cache miss• load line from memory into cache• no room in cache!• replace old line

for(i=0; i<10; i++) x[i] = i;

ab…

x[4]

x[5]x[6]x[7]

x[8]x[9]

ab


Cache (9)• Contiguous access is important• In C, multidimensional array is stored in memory as

a[0][0]

a[0][1]

a[0][2]

…


Cache (10)• In Fortran and Matlab, multidimensional array is stored

the opposite way:

a(1,1)

a(2,1)

a(3,1)

…


Cache (11)• Rule: Always order your loops appropriately

• will usually be taken care of by optimizer• suggestion: don’t rely on optimizer

for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; }}

do j = 1, n do i = 1, n a(i,j) = 1.0 enddoenddo

C Fortran


TUNING TIPS


Tuning Tips

• Some of these tips will be taken care of by compiler optimization• It’s best to do them yourself, since compilers vary

• Two important rules• minimize number of operations• access cache contiguously


Tuning Tips (cont’d)• Access arrays in contiguous order

• For multi-dimensional arrays, rightmost index varies fastest for C and C++, leftmost for Fortran and Matlab

Bad Goodfor(i=0; i<N; i++){ for(j=0; j<N; j++{ a[i][j] = 1.0; }}

for(j=0; j<N; j++){ for(i=0; i<N; i++{ a[i][j] = 1.0; }}


Tuning Tips (3)• Eliminate redundant operations in loops

Bad:

Good:

for(i=0; i<N; i++){ x = 10;

}

…

x = 10;for(i=0; i<N; i++){ }

…


Tuning Tips (4)• Eliminate or minimize if statements within loops

Bad:

if may inhibit pipelining

Good:

for(i=0; i<N; i++){

if(i = = 0)

perform i=0 calculations

else

perform i>0 calculations

}


perform i=0 calculations

for(i=1; i<N; i++){

perform i>0 calculations

}

Tuning Tips (5)• Divides are expensive

• Intel x86 clock cycles per operation• add 3-6• multiply 4-8• divide 32-45

• Bad:

• Good:

for(i=0; i<N; i++) {

x[i] = y[i]/scalarval; }

qs = 1.0/scalarval;

for(i=0; i<N; i++) {

x[i] = y[i]*qs; }


Tuning Tips (6)• There is overhead associated with a function call

Bad:

Good:

for(i=0; i<N; i++)

myfunc(i);

myfunc ( );

void myfunc( ){

for(int i=0; i<N; i++){

do stuff

}

}


Tuning Tips (7)• Minimize calls to math functions

Bad:

Good:

for(i=0; i<N; i++)

z[i] = log(x[i]) * log(y[i]);

for(i=0; i<N; i++){

z[i] = log(x[i] + y[i]);


Tuning Tips (8)• recasting may be costlier than you think

Bad:

Good:

sum = 0.0;

for(i=0; i<N; i++)

sum += (float) i

isum = 0;

for(i=0; i<N; i++)

isum += i;

sum = (float) isum


Exercise 3 (not in class)• The example code provided is written in a clear, readable style,

that also happens to violate lots of the tuning tips that we have just reviewed.

• Examine the line-level profile. What lines are using the most time? Is there anything we might be able to do to make it run faster?• We will discuss options as a group• come up with a strategy• modify code• re-compile and run• compare timings

• Re-examine line level profile, come up with another strategy, repeat procedure, etc.


Speedup Ratio and Parallel Efficiency• S is ratio of T1 over TN , elapsed times of 1 and N workers.

• f is fraction of T1 due to code sections not parallelizable.

• Amdahl’s Law above states that a code with its parallelizable component comprising 90% of total computation time can at best achieve a 10X speedup with lots of workers. A code that is 50% parallelizable speeds up two-fold with lots of workers.

• The parallel efficiency is E = S / N Program that scales linearly (S = N) has parallel efficiency 1. A task-parallel program is usually more efficient than a data- parallel program. Parallel codes can sometimes achieve super-linear behavior due to efficient cache usage per worker.


NasfT

Nf

f

TTT

SN

)(

1

11

11

Example of Speedup Ratio & Parallel Efficiency


code tuning and optimization kadin tseng boston university scientific computing and visualization

Documents

user cpu time time

wallclock time code

check time code tuning

system time

example code code tuning

code fortran code tuning

long time

equations code tuning