multi-core programming
DESCRIPTION
Multi-core Programming. Tools. Topics. General Ideas Compiler Switches Dual Core Vectorization. General Ideas - Optimization. Exploiting Architectural Power requires Sophisticated Compilers Optimal use of Registers & functional units Dual-Core/Multi-processor SSE instructions - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/1.jpg)
Multi-core Programming
Tools
![Page 2: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/2.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
2
Topics
• General Ideas• Compiler Switches• Dual Core• Vectorization
![Page 3: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/3.jpg)
3
General Ideas - Optimization
• Exploiting Architectural Power requires Sophisticated Compilers
• Optimal use of– Registers & functional units– Dual-Core/Multi-processor– SSE instructions– Cache architecture
![Page 4: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/4.jpg)
4
General Ideas – CompatibilityCompatibility is a key concern for software development. Always read
manuals to ensure:• Tool compatibility with hardware revisions.• Tool compatibility with software revisions.• Tool compatibility with IDE.• Tool compatibility with native (development) and target (deployment)
operating system.
![Page 5: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/5.jpg)
5
General Ideas – Intel C++ Compatibility with Microsoft
• Source & binary compatible with VC2003 with /Qvc71,
• Source & binary compatible with w/ VC 2005 under /Qvc8.
• Microsoft* & Intel OpenMP binaries are not compatible.
• Use the one compiler for all modules compiled with OpenMP
• For more information, refer to the User’s Guide
![Page 6: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/6.jpg)
6
General Ideas – Tools
Never ignore or take tools for granted.
A key part of system development is initial specification and qualification of tools required to get the job done.
The wrong tool alone can destroy project success chances.
![Page 7: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/7.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
7
General Ideas - Use Intel Compiler in Microsoft IDE C++
![Page 8: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/8.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
8
Topics
• General Ideas• Compiler Switches• Dual Core• Vectorization
![Page 9: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/9.jpg)
9Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Compiler Switches - General Optimizations
Windows* Linux* Mac*
/Od -O0 -O0 Disables optimizations
/Zi -g -g Creates symbols
/O1 -O1 -O1 Optimize for Binary Size: Server Code
/O2 -O2 -O2 Optimizes for speed (default)
/O3 -O3 -O3 Optimize for Data Cache:Loopy Floating Point Code
![Page 10: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/10.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
10
Compiler Switches - Multi-pass Optimization Interprocedural Optimizations (IPO)
ip: Enables interproceduraloptimizations for single file compilation
ipo: Enables interproceduraloptimizations across filesCan inline functions in separate files
Enhances optimization when used in combination with other compiler features
Windows* Linux* Mac*/Qip -ip -ip/Qipo -ipo -ipo
![Page 11: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/11.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
11
Compiler Switches - Multi-pass Optimization - IPOUsage: Two-Step Process
LinkingWindows* icl /Qipo main.o func1.o func2.oLinux* icc -ipo main.o func1.o func2.oMac* icc -ipo main.o func1.o func2.o
Pass 1
Pass 2
virtual .o
executable
CompilingWindows* icl -c /Qipo main.c func1.c func2.cLinux* icc -c -ipo main.c func1.c func2.cMac* icc -c -ipo main.c func1.c func2.c
![Page 12: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/12.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
12
Compiler Switches - Profile Guided Optimizations (PGO)
• Use execution-time feedback to guide many other compiler optimizations
• Helps I-cache, paging, branch-prediction• Enabled optimizations:– Basic block ordering– Better register allocation– Better decision of functions to inline– Function ordering– Switch-statement optimization– Better vectorization decisions
![Page 13: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/13.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
13
Instrumented Compilation(Mac*/Linux*) icc -prof_gen[x] prog.c(Windows*) icl -Qprof_gen[x] prog.c
Instrumented ExecutionRun program on a typical dataset
Feedback Compilation(Mac/Linux) icc -prof_use prog.c(Windows) icl -Qprof_use prog.c
DYN file containingdynamic info: .dyn
Instrumented executable
Merged DYNsummary file: .dpiDelete old dyn files if you do not want the info included
Step 1
Step 2
Step 3
Compiler Switches - Multi-pass OptimizationPGO: Three-Step Process
![Page 14: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/14.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
14
Topics
• General Ideas• Compiler Switches• Dual Core
• Auto Parallelization• OpenMP• Threading Diagnostics
• Vectorization
![Page 15: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/15.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
15
Auto-parallelization
• Auto-parallelization: Automatic threading of loops without having to manually insert OpenMP* directives.
– Compiler can identify “easy” candidates for parallelization, but large applications are difficult to analyze.
Windows* Linux* Mac*/Qparallel -parallel -parallel/Qpar_report[n] -par_report[n] -par_report[n]
![Page 16: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/16.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
16
OpenMP* Threading Technology • Pragma based approach to parallelism• Usage:
– OpenMP switches: -openmp : /Qopenmp– OpenMP reports: -openmp-report : /Qopenmp-report
#pragma omp parallel for for (i=0;i<MAX;i++) A[i]= c*A[i] + B[i];
![Page 17: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/17.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
17
OpenMP: Workqueueing Extension Example
• Intel Compiler’s Workqueuing extension– Create Queue of tasks…Works on…• Recursive functions• Linked lists, etc.
#pragma intel omp parallel taskq shared(p){ while (p != NULL) {#pragma intel omp task captureprivate(p)
do_work1(p); p = p->next; }}
![Page 18: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/18.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
18
Parallel Diagnostics• Source Instrumentation for Intel Thread Checker
• Allows thread checker to diagnose threading correctness bugs• To use tcheck/Qtcheck you must have Intel Thread Checker installed
Windows* Linux* Mac*
/Qtcheck -tcheck No support
![Page 19: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/19.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
19
Topics
• General Ideas• Compiler Switches• Dual Core• Vectorization
• SSE & Vectorization• Vectorization Reports• Explanations of a few specific vectorization inhibitors
![Page 20: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/20.jpg)
20
SIMD – SSE, SSE2, SSE3 Support
16x bytes
8x words
4x dwords
2x qwords
1x dqword
4x floats
2x doubles
MMX*
SSE
SSE2SSE3
* MMX actually used the x87 Floating Point Registers - SSE, SSE2, and SSE3 use the new SSE registers
![Page 21: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/21.jpg)
21
SIMD FP using AOS format*
Thread Synchronization
Video encoding
Complex arithmetic
FP to integer conversions
HADDPD, HSUBPDHADDPS, HSUBPS
MONITOR, MWAIT
LDDQU
ADDSUBPD, ADDSUBPS,MOVDDUP, MOVSHDUP,
MOVSLDUP
FISTTP
* Also benefits Complex and Vectorization
SSE3 Instructions
![Page 22: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/22.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
22
Using SSE3 - Your Task: Convert This…
128-bit Registers
A[0]
B[0]
C[0]
+ + + +A[1]
B[1]
C[1]
not used not used not used
not used not used not used
not used not used not used
for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];
![Page 23: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/23.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
23
… Into This …
128-bit Registers
A[3] A[2]
B[3] B[2]
C[3] C[2]
+ +A[1] A[0]
B[1] B[0]
C[1] C[0]
+ +
for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];
![Page 24: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/24.jpg)
24Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Compiler Based VectorizationProcessor Specific
Description Use Windows* Linux* Mac*
Generate instructions and optimize for Intel® Pentium® 4 compatible processors including MMX, SSE and SSE2.
W /QxW -xW Does not apply
Generate instructions and optimize for Intel® processors with SSE3 capability including Core Duo. These processors support SSE3 as well as MMX,SSE and SSE2.
P /QxP/QaxP
-xP,-axP
Vector-ization occurs by default
![Page 25: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/25.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
25
Compiler Based Vectorization Automatic Processor Dispatch – ax[?]
• Single executable – Optimized for Intel® Core Duo processors and
generic code that runs on all IA32 processors.• For each target processor it uses:– Processor-specific instructions– Vectorization
• Low overhead – Some increase in code size
![Page 26: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/26.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
26
Why Loops Don’t Vectorize• Independence
– Loop Iterations generally must be independent
• Some relevant qualifiers:
– Some dependent loops can be vectorized.– Most function calls cannot be vectorized. – Some conditional branches prevent vectorization. – Loops must be countable.– Outer loop of nest cannot be vectorized.– Mixed data types cannot be vectorized.
![Page 27: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/27.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
27
Why Didn’t My Loop Vectorize? Windows* Linux* Macintosh*
-Qvec_reportn -vec_reportn -vec_reportn
Set diagnostic level dumped to stdout
– n=0: No diagnostic information– n=1: (Default) Loops successfully vectorized – n=2: Loops not vectorized – and the reason why not– n=3: Adds dependency Information– n=4: Reports only non-vectorized loops– n=5: Reports only non-vectorized loops and adds dependency info
![Page 28: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/28.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
28
Why Loops Don’t Vectorize
– “Existence of vector dependence”– “Nonunit stride used”– “Mixed Data Types”– “Unsupported Loop Structure”– “Contains unvectorizable statement at line XX”– There are more reasons loops don’t vectorize but
we will disucss the reasons above
![Page 29: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/29.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
29
“Existence of Vector Dependency”
• Usually, indicates a real dependency between iterations of the loop, as shown here:
for (i = 0; i < 100; i++) x[i] = A * x[i + 1];
![Page 30: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/30.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
30
Defining Loop Independence
Iteration Y of a loop is independent of when (or whether) iteration X occurs.
int a[MAX], b[MAX];
for (j=0;j<MAX;j++) {
a[j] = b[j];
}
![Page 31: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/31.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
31
“Nonunit stride used”
for (I=0;I<=MAX;I++) for (J=0;J<=MAX;J++) {
c[I][J]+=1; // Unit Stridec[J][I]+=1; // Non-UnitA[J*J]+=1; // Non-unitA[B[J]]+=1; // Non-Unitif (A[MAX-J])=1 last1=J;}// Non-Unit
End Result: Loading Vector may take more cycles than executing operation sequentially.
Mem
ory
![Page 32: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/32.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
32
“Mixed Data Types”An example:
int howmany_close(double *x, double *y) { int withinborder=0; double dist; for(int i=0;i<MAX;i++) { dist=sqrtf(x[i]*x[i] + y[i]*y[i]); if (dist<5) withinborder++; }}
Mixed data types are possible – but complicate thingsi.e.: 2 doubles vs 4 ints per SIMD register
Some operations with specific data types won’t work
![Page 33: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/33.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
33
“Unsupported Loop Structure”
Example:struct _xx {
int data;int bound; } ;
doit1(int *a, struct _xx *x) { for (int i=0; i<x->bound; i++) a[i] = 0;
An unsupported loop structure means the loop is not countable, or the compiler for whatever reason can’t construct a run-time expression for the trip count.
![Page 34: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/34.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
34
“Contains unvectorizable statement”
for (i=1;i<nx;i++) {B[i] = func(A[i]); }
128-bit Registers
A[3] A[2]
B[3] B[2]
func funcA[1] A[0]
B[1] B[0]
func func
![Page 35: Multi-core Programming](https://reader036.vdocuments.net/reader036/viewer/2022062301/5681603d550346895dcf6075/html5/thumbnails/35.jpg)
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
35
Reference
• White papers and technical notes– www.intel.com/ids– www.intel.com/software/products
• Product support resources– www.intel.com/software/products/support