tm cache optimizations & the loop nest optimizer

TM

Cache Optimizations &Cache Optimizations &the Loop Nest Optimizerthe Loop Nest Optimizer

TM

Improvement OpportunitiesImprovement Opportunities Program runs slow because not all resources are used:• processor:

– not using opportunities to go superscalar (ILP)– scheduling of instructions is not optimal (too many wait states)

• memory access:– not all data in cache line is used (spatial locality)– data in the cache in not reused (temporal locality)

Performance analysis is used to diagnose the problem.

Compiler will attempt to optimize the program for the given Architecture: • data structure can inhibit compiler optimizations• algorithm presentations can inhibit compiler optimizations

Often it is necessary to rewrite critical part of code (loops) in the program so that compiler can do better performance optimization. Understand compiler optimizations techniques

TM

Compiler Optimization TechniquesCompiler Optimization Techniques The following optimizations are built into the compiler:

• general– procedure inlining– data and array padding

• loop based:– Loop interchange– outer and inner loop unrolling– cache blocking– loop fusion (merge) and fission (split)

• Code generation:– software pipelining– instruction reordering

Algorithm presentation in the program such that compiler can apply the optimization techniques - - leads to optimal program performance on the machine.

Loop nests,implies usage ofmulti-dimensional arraysenabled at -O3 or withLNO:opt=[1|0]

TM

Scalar Architecture:Scalar Architecture: Cache System Cache System• The hierarchy of memory devices:

• The goal of Memory Hierarchy:– access speed ~ fastest memory– effective capacity ~ size of largest memory

-> Programs should follow the principle of locality:(Use items in the cache)

– Spatial locality of reference (use all words in cache line)– Temporal locality of reference (use same cache line)

Spee

d of

Acc

ess

1/cl

ock

64reg

32KB(L1) 8MB

(L2)~1 - 100s GB

Cache subsystem memory

Device Capacity (size)

1

0.1

0.01~4000 cy

~100 cy

~10 cy~2-3cy

disk

TMScalar Architecture:Scalar Architecture: Cache OrganizationCache Organization

The goal of scalar optimization:– Spatial locality of reference (use all words in cache line)– Temporal locality of reference (use same cache line)

Words in Memory

Example Cache L2 on O2K(e.g. 8 MB

or 2097152 words)

cache lines in memory (32 words)

Load instruction (ld)

for 1 word

cache linetransfer

‡ Cache hit will load word from cache‡ Cache miss will load cache line from memory

TM

Problems of Scalar OptimizationProblems of Scalar Optimization

– each C(I,j) value is accumulated in the register for A(I,k)*B(k,j)– B is traversed in sequence of cache lines (spatial locality)– A is accessing only 1 word from each cache line (no locality)– for A and B no reuse of cache lines (if n is large)

This is a problem only if A,B,C do not fit into the cache

DO i=1,n DO j=1,n DO k=1,n C(i,j)=C(i,j) + A(i,k)*B(k,j)ENDDO ENDDO ENDDO

i

j k j

i k

= X

cache lines

TM

Loop Nest OptimizerLoop Nest Optimizer LNO performs loop restructuring to optimize data access:• loop interchange • loop unrolling• loop blocking for cache• loop fusion• loop fission• pre-fetching

LNO is controlled with compiler options and/or compiler directives or pragmas; same options for both• LNO is the default at -O3, but can be turned on/off individually by -LNO:opt=[1|0]

• directives/pragma syntax:– Fortran: C*$* keyword [=value(s)]– C/C++ : #pragma keyword [=value(s)]–

• directives/pragmas can be disabled with the compiler switch -LNO:ignore_pragmas

TM

Array IndexingArray Indexing There are several ways to index arrays:

• The addressing scheme will have impact on the performance

• Arrays should be accessed in most natural direct way for compiler to apply loop optimization techniques

Direct Addressing ++DO j=1,M DO i=1,n … A(i,j) ….ENDDO ENDDO

Explicit Addressing +DO j=1,M DO i=1,N … A(i+(j-1)*N) …ENDDO ENDDO

Indirect Addressing --DO j=1,M DO i=1,N … A(index(i,j)) …ENDDO ENDDO

Loop carried Addressing -DO j=1,M DO i=1,N k = k + 1 … A(k) …ENDDO ENDDO

TM

Data Storage in MemoryData Storage in Memory Data storage order is language dependent:• Fortran stores multi-dimensional arrays “column-wise”

• C stores multi-dimensional arrays “row-wise”

• Accessing array elements in storage order greatly improves performance:

for arrays that do not fit in the cache(s)

A(I,J)

JI

i i i ij j+1 j+2

In memory

j j j ji i+1 i+2

In memory

a[i][j]

ji

right most index changes fastest...

left most index changes fastest...

TM

Loop Interchange: FORTRANLoop Interchange: FORTRAN

• The distribution of data in memory is not changed. Only the access pattern is changed

• Compiler can do this optimization automatically -LNO:interchange=[on|off] (default on)

Original loop:

c*$* no interchangeDO I=1,N DO J=1,M C(I,J)=A(I,J)+B(I,J)ENDDO ENDDO

Interchanged loops:

c*$* interchange(J,I)DO J=1,M DO I=1,N C(I,J)=A(I,J)+B(I,J)ENDDO ENDDO

A(I,J)B(I,J)C(I,J)

JI

M

NAccess orderStorage order

A(I,J)B(I,J)C(I,J)

JI

M

N

TM

Index ReversalIndex Reversal

• Index reversal on B: i.e. B(I,J) replaced by B(J,I) must be done everywhere in the program

• This has to be done manually, there is no compiler optimization that does index reversal.

Original loop:

DO I=1,N DO J=1,M C(I,J)=A(I,J)+B(J,I)ENDDO ENDDO

The access is poor for A and C,while it is optimal for B

Interchanged loops + Index reversal:

DO J=1,M DO I=1,N C(I,J)=A(I,J)+B(I,J)ENDDO ENDDO

interchange will be good for A and C,it will be bad for B

TMThe Significance of The Significance of Loop InterchangeLoop Interchange

Run time in seconds obtained on an Origin 3000:

loop order R12K@400MHz (8 MB cache) i,j,k 535.0 j,i,k 32.0 k,j,i 11.0

DO I=1,700 DO J=1,700 DO K=1,700 A(I,J,K)=A(I,J,K)+B(I,J,K)*C(I,J,K)ENDDO ENDDO ENDDO

TM

Loop Interchange in CLoop Interchange in C In C, the situation is exactly the opposite to Fortran:

• The performance benefits in C are the same as in Fortran

• In most practical situations, loop interchange (supported by the compiler) is much easier to achieve than index reversal.

Original loop:#pragma no interchangefor(j=0; j<m; j++) for(i=0; i<n; i++) c[i][j]=a[i][j]+b[j][i];

Interchanged loop:#pragma interchange(i,j) for(i=0; i<n; i++)for(j=0; j<m; j++) c[i][j]=a[i][j]+b[j][i];

Index Reversal loop:

for(j=0; j<m; j++) for(i=0; i<n; i++) c[j][i]=a[j][i]+b[j][i];

Addressing ofc[i][j] anda[i][j] are poor

Addressing ofb[j][i] isoptimal

TM

Array Placement EffectsArray Placement Effects “Poor” data placement in memory can lead to the effect of cache thrashing.

There are 2 techniques built into the compiler to avoid the cache thrashing:

• array padding

• leading dimension extension

NOTE: leading dimension of arrays should be an odd number, if the multi-dimensional array has small extensions (e.g. a(64,64,64,..)) several leading dimensions should be odd numbers.

TM

Direct-Mapped Caches: ThrashingDirect-Mapped Caches: Thrashing

A(1) A(2) A(3) A(4)

A(5) A(6) A(7) A(8)

A(8185) A(8186) A(8187) A(8188)

A(8189) A(8190) A(8191) A(8192)

Direct mapped cache (32 KB)Cache line: 4 words

A(1)A(2)

A(8191)A(8192)B(1)

B(8191)B(8192)

32 KB

(Virtual)memory

Location in the cache:(memory-address) mod (cache-size)in this case loc(A(1)) mod 32KB = loc(B(1)) mod 32KB[because B(1) = A(1) + 8192; 8192*4B mod 32KB = 0]

1

2

2047

2048

Registersin the CPU

COMMON //A(8192), B(8192)DO I=1,N PROD = PROD + A(I)*B(I)ENDDO

Thrashing: every memory reference results in a cache miss

TM

Set-Associative CachesSet-Associative Caches

A(1) A(2) A(3) A(4)A(5) A(6) A(7) A(8)

A(4089) A(4090) A(4091) A(4092)A(4093) A(4094) A(4095) A(4096)

2 way set associative cache (32 KB)Cache line: 4 words

A(1)A(2)

A(8191)A(8192)B(1)

B(8191)B(8192)

32 KB

(Virtual)memory

Location in the cache:(memory-address) mod (cache-size)in this case loc(A(1)) mod 16KB = loc(B(1)) mod 16KBBUT A DIFFERENT SET!

1

2

10231024

Registersin the CPU

COMMON //A(8192), B(8192)DO I=1,N PROD = PROD + A(I)*B(I)ENDDO

No Thrashing: conflicting cache lines are stored into a different set

B(1) B(2) B(3) B(4)B(5) B(6) B(7) B(8)

B(4089) B(4090) B(4091) B(4092)B(4093) B(4094) B(4095) B(4096)

Set select (1bit)(LRU)

TM

Array Padding: ExampleArray Padding: Example

COMMON // A(1024,1024),B(1024,1024),C(1024,1024)

DO J=1,1024 DO I=1,1024 A(I,J) = A(I,J)+B(I,J)*C(I,J)ENDDO ENDDO

Addr[C(1,1)] = Addr[B(1,1)] + 1024*1024*4position in the cache: C(1,1) = B(1,1) since (1024*1024*4) mod 32KB = 0

Assume 32 KB cache

COMMON // A(1024,1024),pad1(129)B(1024,1024),pad2(129)C(1024,1024)

DO J=1,1024 DO I=1,1024 A(I,J) = A(I,J)+B(I,J)*C(I,J)ENDDO ENDDO

Addr[C(1,1)] = Addr[B(1,1)] + 1024*1024*4+129*4position in the cache: C(1,1) = B(129,1) mod 32KB

•Padding will cause cache lines to be placed in different cache locations

•Compiler will try to do padding automatically

TM

Maxwell Code ExampleMaxwell Code Example

Compiling with: -mips4 -O3 -LNO:opt=0 -OPT:reorg_common=off (to show the effect of compiler not performing the necessary optimizations)

gives performance on this code of 4.6 Mflop/s

REAL EX(NX,NY,NZ),EY(NX,NY,NZ),EZ(NX,NY,NZ) !Electric fieldREAL HX(NX,NY,NZ),HY(NX,NY,NZ),HZ(NX,NY,NZ) !Magnetic field…DO K=2,NZ-1 DO J=2,NY-1 DO I=2,NX-1 HX(I,J,K)=HX(I,J,K)-(EZ(I,J,K)-EZ(I,J-1,K))*CHDY +(EY(I,J,K)-EY(I,J,K-1))*CHDZ HY(I,J,K)=HY(I,J,K)-(EX(I,J,K)-EX(I,J,K-1))*CHDZ +(EZ(I,J,K)-EZ(I-1,J,K))*CHDX HZ(I,J,K)=HZ(I,J,K)-(EY(I,J,K)-EY(I-1,J,K))*CHDX +(EX(I,J,K)-EX(I,J-1,K))*CHDYENDDO ENDDO ENDDO

here NX=NY=NZ = 32, 64, 128, 256 (i.e. with real*4 elements: 0.8MB, 6.3MB, 50MB, 403MB)

Reusing load from previous iteration (I-1) gives in total:13 memory operations (6H+7E) -> minimum 13 cycles/iteration18 floating point operations in this code18/(13*2)=69% peak, i.e. 800Mflop/s on the R10000@400MHz processor

TM

Maxwell Example - continuedMaxwell Example - continued Problem:• array dimensions are small even numbers, power of 2 and map

to the same location in both 1st level and the 2nd level caches

• for the Maxwell example the print shows with NX=NY=NZ=64:

• Compiler is able to pad the arrays automatically. Compiling with the default optimizations: -mips4 -O3 gives for the performance 162 Mflop/s

In general:primary cache 32 KB = 2(way-set-ass) * 4(size-real) * 4096secondary cache 8 MB = 2(way-set-ass) * 4(size-real) * 1048576

C print position of arrays in memory with the code:Integer*8 aEXaEX = %LOC(EX(1,1,1))print *,’Addr EX=‘,mod(aEX,4096), mod(aEX,1048576),’words’

Addr EX= 3720 470664Addr EY= 3720 470664…….. etc.Addr HZ= 3720 470664

All arrays map to the same locationsin both caches

TM

Dangers of Array PaddingDangers of Array Padding• Compiler will automatically pad local data

• -O3 optimization will automatically pad common blocks

• Padding of common blocks is safe as long as the Fortran standard is not violated:

• Fix violation or do not to use this optimization either by compiling with lower optimization or using explicit compiler flag:

• -OPT:reorg_common=off

SUBROUTINE SUBCOMMON // A(512,512), B(512,512)

DO I=1, 2*512*512 A(I) = 0.0END

TM

Variable Length Arrays (VLA)Variable Length Arrays (VLA) SGI compiler supports Variable Length Arrays in C and Fortran• It is standard in F90 and an SGI extension in F77:

• In C it is an SGI extension:

• VLAs are very handy as scratch arrays, since they are created each time execution enters the subroutine and they are destroyed at exit

• Unlike the static arrays, VLAs allow for proper aliasing and alignment considerations by the compiler

SUBROUTINE NAME1(N,M)DIMENTION R(N,M)……… etc. …END

These arrays are created on the stack,as opposed to a location in a static area

void name1(int m, int n){ double r[m][n][n+m];…… etc. …..}

TM

Loop UnrollingLoop Unrolling Loop unrolling: perform multiple loop iterations at the same time

Advantages of loop unrolling:• more opportunities for super-scalar code• more data re-use & pseudo-prefetch• exploit presence of cache lines• reduction in loop overhead (minor)

NOTE: Inner loops should “never” be unrolled by hand:• compiler will typically unroll the inner loop the necessary amount for SWP

DO I=1,N,1…(I)…ENDDO

DO I=1,N,UNROLL…(I)……(I+1)……(I+2)……(I+UNROLL-1)…ENDDO

DO I=N-mod(N,unroll)+1,N…(I)…ENDDO

& cleanup

C*$* unroll(p)

P = 0 default unrollingp = 1 no unrollingp = UNROLL - that factor

TM

Prefetch Data from MemoryPrefetch Data from Memory Reordering instructions in unrolled loop leads to effective (pseudo-) prefetch of the data

• no instruction overhead; compiler does this optimization automatically. Explicit (manual) prefetch for memory:• prefetch to 1st level cache should be done in form of pseudo-prefetch• compiler will insert prefetch to 2nd level cache automatically (LNO)• manual prefetch to 2nd level cache can be done with compiler directive:

• same in C with the corresponding #pragma directive

for(i=0; i<n; i+=4){ t = b[i+3]; a += b[i+0]; a += b[i+1]; a += b[i+2]; a += t;}

for(i=0; i<n; i+=4){ a += b[i+0]; a += b[i+1]; a += b[i+2]; a += b[i+3];}

C*$* prefetch_ref=a(1)c*$* prefetch_ref=a(1+16)

do I=1,nc*$* prefetch_ref=a(I+32),stride=16,kind=rd

sum = sum + a(I) enddo

TM

Outer Loop UnrollingOuter Loop Unrolling

• the unroll factor should match the cache line size• mostly 1st level cache optimization• if the data fits into the 2nd level cache, this is good optimization to use

DO I=1,N DO J=1,N A(I)=A(I)+B(I,J)*C(J)ENDDO ENDDO

DO I=1,N,4 ! Unrolling by 4 DO J=1,N A(I+0)=A(I+0)+B(I+0,J)*C(J) A(I+1)=A(I+1)+B(I+1,J)*C(J) A(I+2)=A(I+2)+B(I+2,J)*C(J) A(I+3)=A(I+3)+B(I+3,J)*C(J)ENDDO ENDDO

Problem:A(I) is constant for the inner loop JC(J) is traversed each I iterationB(I,J) is traversed poorly

Unrolling the outer loopwill load the complete cacheline of B in to the registers-> data re-use

one 1st level cache line

-LNO:outer_unroll=n

TM

Blocking for Cache (tiling)Blocking for Cache (tiling) Blocking for cache:• An optimization that applies to data sets that do not fit into the

(2nd level) data cache• A way to increase spatial locality of reference

(i.e. exploit full cache lines)

• A way to increase temporal locality of reference (i.e. to improve data re-use)

• It is beneficial mostly with multi-dimensional arrays

DO I=1,N…. (I) ….ENDDO

DO i1=1,N,nb DO I=i1,min(i1+nb-1,N)…. (I) ….ENDDO ENDDO

The inner loop is traversed only in therange of nb at a time

-LNO:blocking=[on|off] (default on)-LNO:blocking_size=n1,n2 (for L1 and L2)

By default L1=32KB and L2=1MBuse -LNO:cs2=8M to specify the 8MB L2 cache

TM

Blocking: ExampleBlocking: Example The following loop nest:

• z[j] is reused for each i iteration

• For large n the array z will not be reused from the cache

Blocking the loops for cache:

• nb elements of z array will be brought in to the cache and reused nb times before moving on to the next tile

for(i=0; i<n; i++) for(j=0; j<m; j++) x[i][j] = y[i] + z[j]

x[i][j] is traversed in ordery[I] is loop invariantz[j] is traversed sequentiallychanging loop order is not beneficialin this case

For(it=0; it<n; it += nb) for(jt=0; jt<m; jt += nb for(i=it; i<min(jt+nb,n); i++) for(j=jt; j<min(jt+nb,m); j++) x[i][j] = y[i] + z[j]

TM

Loop Fusion Loop Fusion Loop fusion (merging two or more loops together):

• fusing loops that refer to the same data enhances temporal locality• larger loop body allow more effective scalar optimizations

Example:

• loop peeling can be used to break data dependencies when fusing loops• sometimes temporary arrays can be replaced by scalars (this

optimization has to be done manually)• Compiler will attempt fuse loops if they are adjacent, i.e. no code between the

loops to be fused

Original loops:for(i=0; i<n; i++) a[i] = b[i] + 1for(i=0; i<n; i++) c[i] = a[i]/2for(i=0; i<n; i++) d[i] = 1/c[i+1]

Fused loops:for(i=0; i<n; i++){ a[i] = b[i] + 1 c[i] = a[i]/2}for(i=0; i<n; i++) d[i] = 1/c[i+1]

Fusing more loops with loop peeling:a[0] = b[0] + 1c[0] = a[0]/2for(i=1; i<n; i++){ a[i] = b[i] + 1 c[i] = a[i]/2 d[I-1] = 1/c[i]}d[n] = 1/c[n+1]-LNO:fusion=[0,1,2] (default 1)

TM

Loop Fusion in Array AssignmentsLoop Fusion in Array Assignments Loop Fusion is instrumental in generating good F90 code

compiler can optimize the loop sequence by fusion• for that, all assignments (loops) should be adjacent• preserving data dependencies, this can fused:

• for this optimization to work automatically, no code should be placed between the array assignments, such that the assignments are adjacent

F90 code sequence:A(I:N) = B(I:N)+1

C(I:N) = A(1:N)/2

D(1:N) = 1/C(2:N+1)

Allocate T(1:N)DO I=1,N T(I)=B(I)+1ENDDODO I=1,N A(I) = T(I)ENDDODO I=1,N T(I)= A(I)/2ENDDODO I=1,N C(I) = T(I)ENDDODO I=1,N T(I)=1/C(I+1)ENDDODO I=1,N D(I) = T(I)ENDDO

Compiler will typically

generate the following

instruction sequence

Fused loops:DO I=1,N A(I) = B(I)+1 C(I) = A(I)/2ENDDODO I=1,ND(I) = 1/C(I+1)ENDDO

Further peeling to break data dependencieswill merge the two remaining loops

TM

Loop FissionLoop Fission Loop Fission (splitting) or loop distribution:• improve memory locality by splitting out loops that refer to different

independent arrays

for(i=1; i<n; i++){ a[i] = a[i] + b[i-1]; b[i] = c[i-1]*x + y; c[i] = 1/b[i]; d[i] = sqrt(c[i]);}

for(i=0; i<n-1; i++){ b[i+1] = c[i]*x + y; c[i+1] = 1/b[i+1];}for(i=0; i<n-1; i++) a[i+1] = a[i+1] + b[i];for(i=0; i<n-1; i++) d[i+1] = sqrt(c[i+1]);i=n+1

-LNO:fission=[0,1,2] (default 1)0 no fission1 normal fission3 fission tried before fussion

attempts to distribute inner loops

TM

LNO: Gather-ScatterLNO: Gather-Scatter Special form of loop fission:• If the loop to be optimized contains conditional execution, it is often

faster to evaluate all the conditions first.

• The computationally intensive loop runs only over the indices for which the condition was true and can be better optimized (SWP)

• LNO will not evaluate the nested IF conditions, unless -LNO:gather_scatter=2 is used

Subroutine fred(a,b,c,n)real*8 a(n), b(n), c(n)do I=1,n if(c(I) .gt. 0) then a(I) = c(I)/b(I) c(I) = c(I)*b(I) b(I) = 2*b(I) endifenddoend

do I=1,n deref_gs(inc_0+1) = I if(c(I) .gt. 0) then inc_0 = inc_0 + 1 endifenddodo ind_0=0,inc_0-1 I=deref_gs(ind_0+1) a(I) = c(I)/b(I) c(I) = c(I)*b(I) b(I) = 2*b(I)enddoend

Conditional executionremoved

TM

LNO: Vector IntrinsicsLNO: Vector Intrinsics Most intrinsics have their “vector” equivalents. The compiler will automatically substitute vector intrinsics where legal, when the functions are invoked in a loop:

Vector intrinsics are faster if N>10 for most intrinsics• Vector intrinsics have different precision rules (1 or 2 ulp less)

• illegal arguments cannot be trapped with the vector intrinsics

• -LNO:vintr=off to disable the generation of the vector intrinsics

SUBROUTINE VFRED(A,N)REAL*8 A(N)DO I=1,N A(I) = A(I) + COS(A(I))ENDDOEND

CALL VCOS$(A(1),DEREF_SE1_F8(1), %VAL(N-1),%VAL(1), %VAL(1))DO I=1,N A(I) = A(I) + DEREF_SE1_F8(I)ENDDO

TM

Vector Intrinsics: PerformanceVector Intrinsics: Performance

TM

Data Dependence in LoopsData Dependence in Loops In loops, each statement can be executed many times. • loop carried data dependence

– dependence between statements in different iterations• loop independent data dependence

– dependence between statements in the same iteration• lexically forward dependence:

– source precedes the target lexically• lexically backward dependence:

– opposite from above• right-hand side of an assignment precede the left-hand side

example:

unroll to analyze:

loop carried, lexically forward dependence

(1) for( i=2; i<9; i++){(2) x[i] = y[i] + z[i];(3) a[i] = x[i-1] + 1;(4) }

S2

S3

(1)

TM

Specifying the Dependency RulesSpecifying the Dependency Rules In the following example:

if K>N no dependency; if K<N there is a dependency. The value of K is unknown to the compiler , thus the

compiler will assume dependencies. The ivdep directive can be used to

communicate to the compiler the data dependency rules.

IVDEP = Ignore Vector DEPendency

SUBROUTINE DAXPYI(N,X,K,A)INTEGER N,KREAL*8 X(N),ADO I=1,N X(K+I) = X(K+I) + A*X(I)ENDDOEND

Compiler schedules:K<N (dependence) 14% peakK>N (no dependence) 33% peak

SUBROUTINE DAXPYI(N,X,K,A)INTEGER N,KREAL*8 X(N),Acdir$ ivdepDO I=1,N X(K+I) = X(K+I) + A*X(I)ENDDOEND

TM

The IVDEP DirectiveThe IVDEP Directive With indexed addressing IVDEP is the only way to specify no data dependencies to the compiler:

• here ivdep means that the integer values stored in indx array are all different, I.e. indx is a permutation array

• assuming no data dependencies will produce faster processor code, because compiler has less constraints on ordering the load-store instructions

The IVDEP directive to the compiler is not part of the language and its interpretation is not standardized.

void update(int n, float *a, float *b, int *indx, float s)

{ int i;#pragma ivdep for(i=0; i<n; i++) a[indx[i]] += s*b[i];}

TM

Three Types of IVDEP DirectiveThree Types of IVDEP Directive The IVDEP directive to the compiler is not part of any language and its interpretation is not standardized.

• Default interpretation:– A and B and C are independent, that breaks both, lexically

forward (i+k) and backward (i-k) dependencies. • index(1,i) != index(1,j) • index(2,i) != index(2,j) • But for some I: index(1,*) == index(2,*)

• The default interpretation can be changed with the -OPT: compiler option. Possible other interpretations:– break only lexically backward dependencies (Cray IVDEP), I.e.

assume only index(*,i)!=index(*,i-k) (cray_ivdep=on)– there are no dependencies what so ever (Liberal IVDEP, enable with -OPT:liberal_ivdep=on)

CDIR$ IVDEPDO I=1,N

A(INDEX(1,I)) = B(I)A(INDEX(2,I)) = C(I)

ENDDO

SGI default behaviour:A and B and C are independent,i.e. index(*,i) != index(*,j)

TM

The Argument Alias ProblemThe Argument Alias Problem

• In Fortran, it is a mistake to invoke copy with overlapping arguments. The compiler will perform optimizations assuming A and B are not aliases over the computational range.

• In C, argument aliases are allowed. Therefore optimizations (SWP) changing the original order of loads and stores are not possible. There are several ways to remove this restriction:– the ivdep pragma– the compiler optimization flag: -OPT:alias=memory-access-model– the restrict keyword

SUBROUTINE COPY(A,B,N)REAL*8 A(N),B(N)DO I=1,N B(I) = A(I)ENDDOEND

void copy(double *a, double *b, int n){ int i; for(i=1; i<n; i++) b[i] = a[i];}

In Fortran, compiler assumes A and B do not overlap

In C, compiler assumes pointers a and b can point to the same address

TM

Aliases: the Optimizer OptionsAliases: the Optimizer Options These options work over all of the compilation unit. -OPT:alias=[any,typed,unnamed,restrict,disjoint]• any is the default. Any pair of memory references may be aliased.

From the other memory access models, the most important are:• restrict

– assume that any pair of memory references that are named differently do not point to the same regions in memory

• disjoint– assume same restrictions as “restrict”, in addition any pointer de-

referencing will point to an overlapping region in memory

float *p, *q*p does not alias with *q, q, p or any global variable

float *p, *q*p does not alias with *q, q, p or any global variable*p does not alias with **q, **p, ***q, etc.

TM

The The restrictrestrict Keyword Keyword The Numerical C Extensions Group X3J11.1 proposed (1993) a restrict keyword as the way to specify pointer access models. The restrict semantics:• assume de-referencing the qualified pointer is the only way the program can

access the memory pointed to by that pointer

• loads and stores through such a pointer do not alias with any other load and stores, except these with the same pointer

• in this example, it is sufficient to indicate restrict b, since it is necessary to qualify only the pointers being stored through

• to enable the restrict keyword it is necessary to use the compiler flag (7.2 and 7.3 compilers): -LANG:restrict

void copy(double * restrict a, double * restrict b, int n)

{ int i; for(i=1; i<n; i++) b[i] = a[i];}

TM

Alias in Storage AllocationAlias in Storage Allocation Program data can be stored in memory in 2 ways:• Storage in global area

– memory pages are allocated statically, i.e. all data is put at a fixed (virtual) address at load time

– loading such data takes often 2 instructions, since the load immediate instruction in MIPS is limited by 64 KB offset: ldadr

R1,addr #load base pointer ldw R2,R1+offset #load base+offset– COMMON block data, global data, SAVE data, malloc, mmap– compilation with -static: all variables are allocated in global area

• Storage on the stack– memory pages are allocated dynamically during program exec– each subroutine gets new stack area for local data– loading data from the stack requires single instruction ldw

R2,TOS+offset #load TopOfStack+offset– local (automatic) variables, temporary storage, alloca data

• Routines called from a parallel region :– Allocate private stack area– Variables allocated on private stack are private. – Variables in global area are shared (aliases).

TM

Procedure InliningProcedure Inlining Inlining: replace a function call by that function source code

Advantages:• increase opportunities for processor optimizations• more opportunities for Loop Nest optimizations

Candidates for inlining are modules that:• “small” i.e. not much source code• are called very often (typically in a loop)• do not take much time per call

Inhibition to inlining:• mismatched in the subroutine arguments (type or shape)• no inlining across languages (e.g. Fortran calls C subroutine)• no static (SAVE) local variables• not varargs routines, no recursive routines• no functions with alternate entry points• no nested subroutines (like in F90)

DO I=1,N call DO_WORK(A(I),C(I))ENDDO

Subroutine DO_WORK(X,Y) Y=1+X*(1+x*0.5)END

-INLINE:list=[on|off] (default off)-INLINE:must=sub1:never=sub2-IPA:inline=[on|off] (default on)

TM

Software Pipelining (SWP)Software Pipelining (SWP) The software pipelining is the way to mix iterations in a loop such that all processor execution slots are filled:• SWP is performed by the Code Generator (CG), that also unrolls

inner loop to achieve the best SWP schedule (-O3 opt level). This can be computationally intensive.

• Vector loops well-suited for SWP; short loops may run slower with SWP

Inhibitors to SWP:• loops with subroutine (or intrinsic) calls cannot be SWP-ed

• loops with complicated conditionals or branching

• loops that are too long cannot be software pipelined because compiler runs out of available registers (loop fission)

• data dependence between iterations are harder to SWP

TM

Summary Summary • Scalar optimization:

– improving ILP by code transformation and grouping independent instructions

– improving memory access by restructuring loop nests to take better advantage of memory hierarchy

• compilers are good at instruction level optimizations and loop transformations. It depends on the language, however:– F77 is the easiest for compiler to work with– C is more difficult– F90/C++ are most complex for compiler optimizations

• the user is responsible to present the code in a way that allows for compiler optimizations:– don’t violate the language standard– write clean and clear code– consider the data structures for (false) sharing and alignment – consider the data structures for data dependencies– most natural presentation of algorithms using multi-dimensional arrays

TM

Case Study:Case Study:Vector UpdateVector Update

Scalar Optimization Techniques

TM

Vector Update CodeVector Update Code

ll=0do jj=1,nj do ii=1,ni ll=ll+1 res=0 do n=1,nib na=ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub nb=n+(jj-1)*nib ndb1=nmb1/2 naa1=nma1+na nbb1=ndb1+nb res=res+p(naa1)*dp(nbb1) end do nde1=nme1/2 lle1=nde1+ll

dp(lle1)=dp(lle1)+res end doend do

Thist is the net result ofall the computations

Profiling tells us that we spend mosttime in this part

L1 Cache L2 Cache TLB Execution (sec) (sec) (sec) (sec) 50 37 215 286

TM

Vector Update: Stride AnalysisVector Update: Stride Analysis

• for the inner loop, the stride on array P is controlled by naa1:

• the loop index in n, therefore the stride is nra • stride on array DP is controlled by nbb1:• therefore the stride is 1• loop exchange consideration: (note: nra, nib

~5000)• thus ii should be the inner loop

do jj=1,nj do ii=1,ni

…. do n=1,nib na=ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub nb=n+(jj-1)*nib ndb1=nmb1/2 naa1=nma1+na nbb1=ndb1+nb res=res+P(naa1)*DP(nbb1) end do….

naa1 = nma1+ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub

nbb1 = nbd1+n+(jj-1)*nib

Inner loop over n ii jjstride on P: nra 1 0stride on DP: 1 0 nib

TM

Vector Update: Loop InterchangeVector Update: Loop Interchange To interchange the loops they have to be properly nested• substitution expressions and eliminating temporary variables

• now the loops can be interchanged

do jj=1,nj do ii=1,ni res=0 do n=1,nib ndb1=nmb1/2 naa1=nma1+ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub nbb1=ndb1+n+(jj-1)*nib res=res+p(naa1)*dp(nbb1) end do nde1=nme1/2 lle1=nde1+ii+(jj-1)*ni dp(lle1)=dp(lle1)+res end doend do

res can be eliminated by placingin inner loop

Eliminated LL

Substituted NB

Substituted NA

do jj=1,nj do ii=1,ni do n=1,nib ndb1=nmb1/2 naa1=nma1+ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub nbb1=ndb1+n+(jj-1)*nib nde1=nme1/2 lle1=nde1+ii+(jj-1)*ni dp(lle1)=dp(lle1)+p(naa1)*dp(nbb1) end do end doend do

TM

Vector Update: DAXPY FormVector Update: DAXPY Form

simplifying indexing….

this is a DAXPY operation

ndb1=nmb1/2nde1=nme1/2do jj=1,nj do n=1,nib do ii=1,ni naa1=nma1+ii+(n-1)*nra+ (i-1)*nru+(l-1)*nra*nrub nbb1=ndb1+n+(jj-1)*nib lle1=nde1+ii+(jj-1)*ni dp(lle1)=dp(lle1)+p(naa1)*dp(nbb1) end do end doend do

ndb1=nmb1/2nde1=nme1/2do jj=1,nj do n=1,nib naa1=nma1+(n-1)*nra+ (i-1)*nru+(l-1)*nra*nrub dp_temp=dp(ndb1+n+(jj-1)*nib) lle1=nde1+(jj-1)*ni do ii=1,ni dp(lle1+ii)=dp(lle1+ii)+ p(naa1+ii)*dp_temp end do end doend do

ndb1=nmb1/2nde1=nme1/2id1 =nma1+(i-1)*nru+(l-1)*nra*nrubdo jj=1,nj id2 = ndb1+(jj-1)*nib lle1= nde1+(jj-1)*ni id3 = id1 do n=1,nib dp_temp=dp(id2+n) do ii=1,ni dp(lle1+ii)=dp(lle1+ii)+p(id3+ii)*dp_temp end do id3 = id3 + nra end doend do

TM

Vector Update: 2D FormVector Update: 2D Form With DAXPY operation in the inner loop, we should consider further optimization with outer loop unrolling and blocking.• hand tuning was necessary

• compiler would not implement loop interchange because in the original code the loops are not properly nested

• With the DAXPY formulation, we can consider 2-dimensional implementation of that code:

real*8 dp(ni,nj), p(ni,nib)

ndb1=nmb1/2nde1=nme1/2id1 =nma1+(i-1)*nru+(l-1)*nra*nrub

do jj=nde1,nj do n=ndb1,nib dp_temp=dp(n,jj-nde1) do ii=1,ni dp(ii,jj)=dp(ii,jj)+p(ii,jj)*dp_temp end do end doend do

TM

Vector Update: Compiler OptVector Update: Compiler Opt Compilation the new 2D version with -O3:• compiler can perform automatically the necessary loop transforms

DO wd_jj0 = jj, MIN((tile2jj + 125), nj), 1 mi8 = dp2(n, wd_jj0) mi9 = dp2(n + 1, wd_jj0) mi10 = dp2(n + 3, wd_jj0) mi11 = dp2(n + 2, wd_jj0) DO ii0 = tile1ii, MIN((tile1ii + 543), ni), 1 dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n) * mi8)) dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n + 1) * mi9)) dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n + 2) * mi11)) dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n + 3) * mi10)) END DOEND DOEND DO DO wd_n = n, nib, 1 DO jj0 = tile2jj, MIN((nj + -1), (tile2jj + 124)), 2 mi12 = dp2(wd_n, jj0) mi13 = dp2(wd_n, jj0 + 1) DO ii1 = tile1ii, MIN((tile1ii + 543), ni), 1 dp1(ii1, jj0) = (dp1(ii1, jj0) +(p(ii1, wd_n) * mi12)) dp1(ii1, jj0 + 1) = (dp1(ii1, jj0 + 1) +(p(ii1, wd_n) * mi13)) END DO END DO DO wd_jj = jj0, MIN((tile2jj + 125), nj), 1 mi14 = dp2(wd_n, wd_jj) DO ii2 = tile1ii, MIN((tile1ii + 543), ni), 1 dp1(ii2, wd_jj) = (dp1(ii2, wd_jj) +(p(ii2, wd_n) * mi14)) END DO END DO

DO tile2jj = 1, nj, 126 DO tile1ii = 1, ni, 544 DO n = 1, (nib + -3), 4 DO jj = tile2jj, MIN((nj + -1), (tile2jj + 124)), 2 mi0 = dp2(n, jj) mi1 = dp2(n + 3, jj + 1) mi2 = dp2(n + 2, jj + 1) mi3 = dp2(n + 1, jj + 1) mi4 = dp2(n, jj + 1) mi5 = dp2(n + 1, jj) mi6 = dp2(n + 2, jj) mi7 = dp2(n + 3, jj) DO ii = tile1ii, MIN((tile1ii + 543), ni), 1 dp1(ii, jj) = (dp1(ii, jj) +(p(ii, n) * mi0)) dp1(ii, jj) = (dp1(ii, jj) +(p(ii, n + 1) * mi5)) dp1(ii, jj) = (dp1(ii, jj) +(p(ii, n + 2) * mi6)) dp1(ii, jj) = (dp1(ii, jj) +(p(ii, n + 3) * mi7)) dp1(ii, jj + 1) = (dp1(ii, jj + 1) +(p(ii, n) * mi4)) dp1(ii, jj + 1) = (dp1(ii, jj + 1) +(p(ii, n + 1) * mi3)) dp1(ii, jj + 1) = (dp1(ii, jj + 1) +(p(ii, n + 2) * mi2)) dp1(ii, jj + 1) = (dp1(ii, jj + 1) +(p(ii, n + 3) * mi1)) END DO END DO

END DO END DOEND DO

TM

Vector Update SummaryVector Update Summary

ORIGINAL CODE

TM

Vector Update SummaryVector Update Summary

tm cache optimizations & the loop nest optimizer

Documents