修改程序代码以利用编译器实现优化

*All other brands and *All other brands and names are the propernames are the property of their respective ty of their respective ownersowners

Intel Confidential Intel Confidential IA64_Tools_Overview2. IA64_Tools_Overview2.pptppt

修改程序代码以修改程序代码以

利用编译器实现优化利用编译器实现优化

www.intel.com/software/products

Responsible Pointer Responsible Pointer UsageUsage

Compiler alias analysis limits optimizationsCompiler alias analysis limits optimizations Developer knows App – tell compiler!Developer knows App – tell compiler! Avoid pointing to same memory address with Avoid pointing to same memory address with

2 different pointers2 different pointers Use array notation when possibleUse array notation when possible Avoid pointer arithmetic if possibleAvoid pointer arithmetic if possible

Data IssuesData IssuesData IssuesData Issues

Pointer Pointer DisambiguationDisambiguation

-Oa file.c (Windows)-Oa file.c (Windows) -fno-alias file.c (Linux)-fno-alias file.c (Linux) All pointers in file.c are assumed not to aliasAll pointers in file.c are assumed not to alias

-Ow file.c (Windows)-Ow file.c (Windows) Not (yet) on LinuxNot (yet) on Linux Assume no aliasing within functions (ie, pointer arguments Assume no aliasing within functions (ie, pointer arguments

are unique)are unique)

-Qrestrict file.c (Windows)-Qrestrict file.c (Windows) -restrict (Linux)-restrict (Linux) Restrict Qualifier: Enables pointer disambiguationRestrict Qualifier: Enables pointer disambiguation

-Za file.c (Windows)-Za file.c (Windows) -ansi (Linux)-ansi (Linux) Enforce strict ANSI compilance (requires that pointers to Enforce strict ANSI compilance (requires that pointers to

different data types are not aliased)different data types are not aliased)

Data IssuesData IssuesData IssuesData Issues

High Level Optimizations High Level Optimizations Available at O3Available at O3

Prefetch Prefetch Loop interchangeLoop interchange UnrollingUnrolling Cache blockingCache blocking Unroll-and-jam Unroll-and-jam Scalar replacementScalar replacement Redundant zero-trip Redundant zero-trip

eliminationelimination Data dependence Data dependence

analysisanalysis

Reuse analysisReuse analysis Loop recoveryLoop recovery Canonical expressionsCanonical expressions Loop fusionLoop fusion Loop distributionLoop distribution Loop reversalLoop reversal Loop skewingLoop skewing Loop peelingLoop peeling Scalar expansionScalar expansion Register blockingRegister blocking

HLOHLOHLOHLO

Data PrefetchingData Prefetching

for i = 1, M for j = 1, N A[j, i] = B[0, j] + B[0, j+1] end_forend_for

for i = 1, M for j = 1, N A[j, i] = B[0, j] + B[0, j+1] if (mod(j,8) == 0) lfetch.nta(A[j+d, i]) if (i == 1) lfetch.nt1(B[0, j+d]) end_forend_for

• Adding prefetching instructions using selective prefetching. • Works for array , pointers , C structure , C/C++ parameters

• Goal: to issue one prefetch instruction per cache line• Itanium cache lines are L1 : 32B, L2 : 64B, L3 : 64B• Itanium 2 cache lines are L1 : 64B, L2 : 128B, L3 : 128B

-O3 does this for you“Let the Compiler do the work!”

HLOHLOHLOHLO

Loop InterchangeLoop Interchange

Note: c[i][j] term is constant in inner loopNote: c[i][j] term is constant in inner loop Interchange to allow unit stride memory Interchange to allow unit stride memory

accessaccess

DemoHLOHLOHLOHLO

for(i=0;i<NUM;i++) { for(j=0;j<NUM;j++) { for(k=0;k<NUM;k++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } } }

Consecutive memory

Fast Inner loop index

Lab : Matrix with Loop Interchange, -O2Lab : Matrix with Loop Interchange, -O2

Unit Stride memory accessUnit Stride memory accessC/C++ Example – Fortran oppositeC/C++ Example – Fortran opposite

bN-10bN-10 bN-1jbN-1j bN-1N-1bN-1N-1

b10b10 b11b11 b12b12 b13b13 b1jb1j b1N-1b1N-1

b00b00 b01b01 b02b02 b03b03 b0jb0j b0N-1b0N-1b

Non-unit strided data accessNon-unit strided data access

aN-10aN-10 aN-1N-1aN-1N-1

ai0ai0 ai1ai1 ai2ai2 ai3ai3 aiN-1aiN-1

a10a10 a11a11 a12a12 a13a13 a1N-1a1N-1

a00a00 a01a01 a02a02 a03a03 a0N-1a0N-1a

incrementing K gets non consecutive memory elements

Unit strided data accessUnit strided data accessincrementing K gets consecutive memory elements

HLOHLOHLOHLO

Loop Loop after after interchangeinterchange

Note: a[i][k] term is constant in inner loopNote: a[i][k] term is constant in inner loop Two loads, one Store, one FMA: F/M = .33, Unit Two loads, one Store, one FMA: F/M = .33, Unit

stridestride

for(i=0;i<NUM;i++) { for(k=0;k<NUM;k++) { for(j=0;j<NUM;j++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } } }

HLOHLOHLOHLODemo

Lab : Matrix with Loop Interchange, -O3Lab : Matrix with Loop Interchange, -O3

Unit Stride memory access Unit Stride memory access (C/C++)(C/C++)

All Unit strided data accessAll Unit strided data access

aN-10aN-10 aN-1N-1aN-1N-1

ai0ai0 ai1ai1 ai2ai2 ai3ai3 aiN-1aiN-1

a10a10 a11a11 a12a12 a13a13 a1N-1a1N-1

a00a00 a01a01 a02a02 a03a03 a0N-1a0N-1

bN-10bN-10 bN-1N-1bN-1N-1

bk0bk0 bk1bk1 bk2bk2 bk3bk3 bkN-1bkN-1

b10b10 b11b11 b12b12 b13b13 b1N-1b1N-1

b00b00 b01b01 b02b02 b03b03 b0N-1b0N-1

Fastest incremented indexConsecutive memory access

Next fastest loop indexConsecutive memory index

HLOHLOHLOHLO

Loop UnrollingLoop Unrolling

N=1025M=5DO I=1,N DO J=1,M A(J,I) = B(J,I) + C(J,I) * D ENDDOENDDO

II = IMOD (N,4)DO I = 1, II DO J=1,M A(J,I) = B(J,I) + C(J,I) * D ENDDOENDDO

DO I = II,N,4 DO J=1,M A(J,I) = B(J,I) + C(J,I) * D A(J,I+1) = B(J,I+1) + C(J,I+1) * D A(J,I+2) = B(J,I+2) + C(J,I+2) * D A(J,I+3) = B(J,I+3) + C(J,I+3) * D ENDDOENDDO

Unroll Outer loop by 4

Preconditioning loop

• Unroll largest loops

• If loop size known can eliminate preconditioning loop by choosing number of times to unroll

HLOHLOHLOHLODemo

Lab : Matrix with Loop Unrolling by 2Lab : Matrix with Loop Unrolling by 2

Loop Unrolling - Loop Unrolling - CandidatesCandidates

If trip count is low and known at compile time it If trip count is low and known at compile time it may make sense to may make sense to FullyFully unroll unroll

Poor Candidates: (similar issues for SWP or Poor Candidates: (similar issues for SWP or vectorizer)vectorizer) Low trip count loops – for (j=0; j < N; j++) : N=4 at Low trip count loops – for (j=0; j < N; j++) : N=4 at

runtimeruntime Fat loops – loop body already has lots of Fat loops – loop body already has lots of

computation taking placecomputation taking place Loops containing procedure callsLoops containing procedure calls Loops with branchesLoops with branches

HLOHLOHLOHLO

Loop Unrolling - Loop Unrolling - BenefitsBenefits

BenefitsBenefits perform more computations per loop iterationperform more computations per loop iteration Reduces the effect of loop overheadReduces the effect of loop overhead Can increase Floating point to memory Can increase Floating point to memory

access ratio (F/M)access ratio (F/M)

CostsCosts Register pressureRegister pressure Code bloatCode bloat

HLOHLOHLOHLO

All loops unrolled by 4 results in (per iteration)All loops unrolled by 4 results in (per iteration)

32 Loads, 16 stores, 64 FMA: F/M = 1.3332 Loads, 16 stores, 64 FMA: F/M = 1.33

Loop Unrolling Loop Unrolling - Example- Example for(i=0;i<NUM;i=i+2) { for(k=0;k<NUM;k=k+2){ for(j=0;j<NUM;j++){

c[i][j]= c[i][j]+ a[i][k]* b[k][j]; c[i+1][j]= c[i+1][j]+ a[i+1][k]* b[k][j]; c[i][j]= c[i][j]+ a[i][k+1]* b[k+1][j]; c[i+1][j]= c[i+1][j]+ a[i+1][k+1]* b[k+1][j]; } } }

Loop invariant

HLOHLOHLOHLOLab

Demo Lab : Matrix with Loop Unrolling by 4Demo Lab : Matrix with Loop Unrolling by 4

Cache BlockingCache Blocking

for i = 1, 1000 for j = 1, 1000 for k = 1, 1000 A[i, j, k] = A[i, j, k] + B[i, k, j] end_for end_forend_for

for v = 1, 1000, 20 for u = 1, 1000, 20 for k = v, v+19 for j = u, u+19 for i = 1, 1000 A[i, j, k] = A[i, j, k] + B[i, k, j] end_for end_for end_for end_forend_for

•When all arrays in loop do not fit in cache•Effective for huge out-of-core memory applications•Effective for large out-of-cache applications•Work on “neighborhoods” of data and keep these neighborhoods in cache•Helps reduce TLB & Cache misses

HLOHLOHLOHLO

修改程序代码以利用编译器实现优化

Documents

java性能优化指南， - pic.huodongjia.com ·...

第九章...

软件调优基础陈健2003/3. 为什么需要调优？...

第三节译码器

面向代码的软件能耗优化研究进展 -...

第二章 i/o 端口地址译码技术

代码优化有效使用内存

软件调优基础 2004 年 2 月 23 日....

基于 fft 的频域 rs 译码

浙江大学cad&cg国家重点实验室 ·...

viterbi 译码

自表达代码 -...

第 11 章代码优化

第三章一位及多位计数 / 锁存 / 译码 /...

c++编写 grasshopper 插件 -...

第 11 章代码优化

第 3 节遗传密码子的破译

一、网上登陆优课系统...

《口译》课程教学大纲jwc.gdufe.edu.cn/_upload/article/files/39/f1/94a... ·...

嵌入式代码自动生成 - mathworks...3 目录...

修改程序代码以 利用编译器实现优化

修改程序代码以利用编译器实现优化