修改程序代码以 利用编译器实现优化
Post on 06-Jan-2016
34 Views
Preview:
DESCRIPTION
TRANSCRIPT
*All other brands and *All other brands and names are the propernames are the property of their respective ty of their respective ownersowners
Intel Confidential Intel Confidential IA64_Tools_Overview2. IA64_Tools_Overview2.pptppt
11
修改程序代码以修改程序代码以
利用编译器实现优化利用编译器实现优化
www.intel.com/software/products
Responsible Pointer Responsible Pointer UsageUsage
Compiler alias analysis limits optimizationsCompiler alias analysis limits optimizations Developer knows App – tell compiler!Developer knows App – tell compiler! Avoid pointing to same memory address with Avoid pointing to same memory address with
2 different pointers2 different pointers Use array notation when possibleUse array notation when possible Avoid pointer arithmetic if possibleAvoid pointer arithmetic if possible
Data IssuesData IssuesData IssuesData Issues
Pointer Pointer DisambiguationDisambiguation
-Oa file.c (Windows)-Oa file.c (Windows) -fno-alias file.c (Linux)-fno-alias file.c (Linux) All pointers in file.c are assumed not to aliasAll pointers in file.c are assumed not to alias
-Ow file.c (Windows)-Ow file.c (Windows) Not (yet) on LinuxNot (yet) on Linux Assume no aliasing within functions (ie, pointer arguments Assume no aliasing within functions (ie, pointer arguments
are unique)are unique)
-Qrestrict file.c (Windows)-Qrestrict file.c (Windows) -restrict (Linux)-restrict (Linux) Restrict Qualifier: Enables pointer disambiguationRestrict Qualifier: Enables pointer disambiguation
-Za file.c (Windows)-Za file.c (Windows) -ansi (Linux)-ansi (Linux) Enforce strict ANSI compilance (requires that pointers to Enforce strict ANSI compilance (requires that pointers to
different data types are not aliased)different data types are not aliased)
Data IssuesData IssuesData IssuesData Issues
High Level Optimizations High Level Optimizations Available at O3Available at O3
Prefetch Prefetch Loop interchangeLoop interchange UnrollingUnrolling Cache blockingCache blocking Unroll-and-jam Unroll-and-jam Scalar replacementScalar replacement Redundant zero-trip Redundant zero-trip
eliminationelimination Data dependence Data dependence
analysisanalysis
Reuse analysisReuse analysis Loop recoveryLoop recovery Canonical expressionsCanonical expressions Loop fusionLoop fusion Loop distributionLoop distribution Loop reversalLoop reversal Loop skewingLoop skewing Loop peelingLoop peeling Scalar expansionScalar expansion Register blockingRegister blocking
HLOHLOHLOHLO
55
Data PrefetchingData Prefetching
for i = 1, M for j = 1, N A[j, i] = B[0, j] + B[0, j+1] end_forend_for
for i = 1, M for j = 1, N A[j, i] = B[0, j] + B[0, j+1] if (mod(j,8) == 0) lfetch.nta(A[j+d, i]) if (i == 1) lfetch.nt1(B[0, j+d]) end_forend_for
• Adding prefetching instructions using selective prefetching. • Works for array , pointers , C structure , C/C++ parameters
• Goal: to issue one prefetch instruction per cache line• Itanium cache lines are L1 : 32B, L2 : 64B, L3 : 64B• Itanium 2 cache lines are L1 : 64B, L2 : 128B, L3 : 128B
-O3 does this for you“Let the Compiler do the work!”
HLOHLOHLOHLO
Loop InterchangeLoop Interchange
Note: c[i][j] term is constant in inner loopNote: c[i][j] term is constant in inner loop Interchange to allow unit stride memory Interchange to allow unit stride memory
accessaccess
DemoHLOHLOHLOHLO
for(i=0;i<NUM;i++) { for(j=0;j<NUM;j++) { for(k=0;k<NUM;k++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } } }
Consecutive memory
index
Fast Inner loop index
Lab : Matrix with Loop Interchange, -O2Lab : Matrix with Loop Interchange, -O2
Unit Stride memory accessUnit Stride memory accessC/C++ Example – Fortran oppositeC/C++ Example – Fortran opposite
bN-10bN-10 bN-1jbN-1j bN-1N-1bN-1N-1
b10b10 b11b11 b12b12 b13b13 b1jb1j b1N-1b1N-1
b00b00 b01b01 b02b02 b03b03 b0jb0j b0N-1b0N-1b
Non-unit strided data accessNon-unit strided data access
aN-10aN-10 aN-1N-1aN-1N-1
ai0ai0 ai1ai1 ai2ai2 ai3ai3 aiN-1aiN-1
a10a10 a11a11 a12a12 a13a13 a1N-1a1N-1
a00a00 a01a01 a02a02 a03a03 a0N-1a0N-1a
k
j
k
i
incrementing K gets non consecutive memory elements
Unit strided data accessUnit strided data accessincrementing K gets consecutive memory elements
HLOHLOHLOHLO
Loop Loop after after interchangeinterchange
Note: a[i][k] term is constant in inner loopNote: a[i][k] term is constant in inner loop Two loads, one Store, one FMA: F/M = .33, Unit Two loads, one Store, one FMA: F/M = .33, Unit
stridestride
for(i=0;i<NUM;i++) { for(k=0;k<NUM;k++) { for(j=0;j<NUM;j++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } } }
HLOHLOHLOHLODemo
Lab : Matrix with Loop Interchange, -O3Lab : Matrix with Loop Interchange, -O3
Unit Stride memory access Unit Stride memory access (C/C++)(C/C++)
All Unit strided data accessAll Unit strided data access
aN-10aN-10 aN-1N-1aN-1N-1
ai0ai0 ai1ai1 ai2ai2 ai3ai3 aiN-1aiN-1
a10a10 a11a11 a12a12 a13a13 a1N-1a1N-1
a00a00 a01a01 a02a02 a03a03 a0N-1a0N-1
k
ak
i
bN-10bN-10 bN-1N-1bN-1N-1
bk0bk0 bk1bk1 bk2bk2 bk3bk3 bkN-1bkN-1
b10b10 b11b11 b12b12 b13b13 b1N-1b1N-1
b00b00 b01b01 b02b02 b03b03 b0N-1b0N-1
j
j
b
k
Fastest incremented indexConsecutive memory access
Next fastest loop indexConsecutive memory index
HLOHLOHLOHLO
Loop UnrollingLoop Unrolling
N=1025M=5DO I=1,N DO J=1,M A(J,I) = B(J,I) + C(J,I) * D ENDDOENDDO
II = IMOD (N,4)DO I = 1, II DO J=1,M A(J,I) = B(J,I) + C(J,I) * D ENDDOENDDO
DO I = II,N,4 DO J=1,M A(J,I) = B(J,I) + C(J,I) * D A(J,I+1) = B(J,I+1) + C(J,I+1) * D A(J,I+2) = B(J,I+2) + C(J,I+2) * D A(J,I+3) = B(J,I+3) + C(J,I+3) * D ENDDOENDDO
Unroll Outer loop by 4
Preconditioning loop
• Unroll largest loops
• If loop size known can eliminate preconditioning loop by choosing number of times to unroll
HLOHLOHLOHLODemo
Lab : Matrix with Loop Unrolling by 2Lab : Matrix with Loop Unrolling by 2
Loop Unrolling - Loop Unrolling - CandidatesCandidates
If trip count is low and known at compile time it If trip count is low and known at compile time it may make sense to may make sense to FullyFully unroll unroll
Poor Candidates: (similar issues for SWP or Poor Candidates: (similar issues for SWP or vectorizer)vectorizer) Low trip count loops – for (j=0; j < N; j++) : N=4 at Low trip count loops – for (j=0; j < N; j++) : N=4 at
runtimeruntime Fat loops – loop body already has lots of Fat loops – loop body already has lots of
computation taking placecomputation taking place Loops containing procedure callsLoops containing procedure calls Loops with branchesLoops with branches
HLOHLOHLOHLO
Loop Unrolling - Loop Unrolling - BenefitsBenefits
BenefitsBenefits perform more computations per loop iterationperform more computations per loop iteration Reduces the effect of loop overheadReduces the effect of loop overhead Can increase Floating point to memory Can increase Floating point to memory
access ratio (F/M)access ratio (F/M)
CostsCosts Register pressureRegister pressure Code bloatCode bloat
HLOHLOHLOHLO
All loops unrolled by 4 results in (per iteration)All loops unrolled by 4 results in (per iteration)
32 Loads, 16 stores, 64 FMA: F/M = 1.3332 Loads, 16 stores, 64 FMA: F/M = 1.33
Loop Unrolling Loop Unrolling - Example- Example for(i=0;i<NUM;i=i+2) { for(k=0;k<NUM;k=k+2){ for(j=0;j<NUM;j++){
c[i][j]= c[i][j]+ a[i][k]* b[k][j]; c[i+1][j]= c[i+1][j]+ a[i+1][k]* b[k][j]; c[i][j]= c[i][j]+ a[i][k+1]* b[k+1][j]; c[i+1][j]= c[i+1][j]+ a[i+1][k+1]* b[k+1][j]; } } }
Loop invariant
HLOHLOHLOHLOLab
Demo Lab : Matrix with Loop Unrolling by 4Demo Lab : Matrix with Loop Unrolling by 4
1414
Cache BlockingCache Blocking
for i = 1, 1000 for j = 1, 1000 for k = 1, 1000 A[i, j, k] = A[i, j, k] + B[i, k, j] end_for end_forend_for
for v = 1, 1000, 20 for u = 1, 1000, 20 for k = v, v+19 for j = u, u+19 for i = 1, 1000 A[i, j, k] = A[i, j, k] + B[i, k, j] end_for end_for end_for end_forend_for
•When all arrays in loop do not fit in cache•Effective for huge out-of-core memory applications•Effective for large out-of-cache applications•Work on “neighborhoods” of data and keep these neighborhoods in cache•Helps reduce TLB & Cache misses
HLOHLOHLOHLO
top related