finding the limits of hardware optimization through

Post on 27-May-2022

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Derek Kern, Roqyah Alalqam,

Ahmed Mehzer, Mohammed Mohammed

Finding the Limits of Hardware Optimization

through Software De-optimization

Presented By:

� Flashback

� Project Structure

� Judging de-optimizations

� What does a de-op look like?

� General Areas of Focus

� Instruction Fetching and Decoding

� Instruction Scheduling

� Instruction Type Usage (e.g. Integer vs. FP)

� Branch Prediction

� Idiosyncrasies

� Our Methods

� Measuring clock cycles

� Eliminating noise

� Something about the de-ops that didn’t work

� Lots and lots of de-ops

During the research project

� We studied de-optimizations

� We studied the Opteron

For the implementation project

� We have chosen de-optimizations to implement

� We have chosen algorithms that may best reflect our de-

optimizations

� We have implemented the de-optimizations

� …And, we’re here to report the results

Judging de-optimizations (de-ops)

� Whether the de-op affects scheduling, caching, branching, etc, its

impact will be felt in the clocks needed to execute an algorithm.

� So, our metric of choice will be CPU clock cycles

What does a de-op look like?

� A de-op is a change to an optimal implementation of an algorithm that

increases the clock cycles needed to execute the algorithm and that

demonstrates some interesting fact about the CPU in question

� The CPUs

� AMD Opteron (Hydra)

� Intel Nehalem (Derek’s Laptop)

� Our primary focus was the Opteron

� The de-optimizations were designed to affect the Opteron

� We also tested them on the Intel in order to give you an idea of how universal a de-optimization is

� When we know why something does or doesn’t affect the Intel, we will try to let you know

� The code

� Most of the de-optimizations are written in C (GCC)

� Some of them have a wrapper that is written in C, while the code being

de-optimized is written in NASM (assembly)

� E.g.

� Mod_ten_counter

� Factorial_over_array

� Typically, if a de-op is written in NASM, then the C wrapper does all of

the grunt work prior to calling the de-optimized NASM module

� Problem: How do we measure clock cycles?

� An obvious answer

� CodeAnalyst

� Actually, we were getting strange results from CodeAnalyst

� …And, it is hard to separate important code sections from unimportant

code sections

� …And, it is cumbersome to work with

� A better answer

� Embed code that measures clock cycles for important sections

� Ok….but how?

#if defined(__i386__)static __inline__ unsigned long long rdtsc(void){

unsigned long long int x;__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));return x;

}

#elif defined(__x86_64__)static __inline__ unsigned long long rdtsc(void){

unsigned hi, lo;__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );

}

#endif

Answer: Read the CPU Timestamp Counter

� CPU Timestamp Counter

� In all x86 CPUs since the Pentium

� Counts the number of clock cycles since the last reset

� It’s a little tricky in multi-core environments

� Care must be taken to control the cores that do the relevant

processing

� CPU Timestamp Counter

Windows:

Linux (Hydra):

start /realtime /affinity 4 /b <exe name> <arguments>

bpsh 11 taskset 0x000000008 <exe name> <arguments>

Runs the executable on core 3 (of 1 – 4)

Runs the executable on node 11, CPU 3 (of 0 – 11)

So, by restricting our runs to specific CPUs, we can rely on the CPU

timestamp values

� CPU Timestamp Counter

� Wrapping code so that clock cycles can be counted

//// Send the array off to be counted by the assembly code//unsigned long long start = rdtsc();

#ifdef _WIN64mod_ten_counter( counts, mod_ten_array, size_of_array );

#elif _WIN32mod_ten_counter( counts, mod_ten_array, size_of_array );

#elif __linux___mod_ten_counter( counts, mod_ten_array, size_of_array );

#endif

printf( "Cycles=%d\n", ( rdtsc() - start ) );

The important section is wrapped and the number of clock cycles will

be the difference between the start and the finish

� Eliminating noisy results

� Even with our precautions, there can be some noise in the clock cycles

� So, we need lots of iterations that we can use to generate a good

average

� But, this can be very, very time consuming

� How, oh how?

Answer: The Version Tester

� Eliminating noisy results – The Version Tester

� Used to iteratively test executables

� Expects each executable to return the number of cycles that need to

be counted

� Remember this?

//// Send the array off to be counted by the assembly code//unsigned long long start = rdtsc();#ifdef _WIN64

mod_ten_counter( counts, mod_ten_array, size_of_array );#elif _WIN32

mod_ten_counter( counts, mod_ten_array, size_of_array );#elif __linux__

_mod_ten_counter( counts, mod_ten_array, size_of_array );#endifprintf( "Cycles=%d\n", ( rdtsc() - start ) );

� Eliminating noisy results – The Version Tester

� Runs executables for specified number of iterations and then averages

the number of cycles

> bpsh 10 taskset 0x000000004 version_tester mtc.hydra-core3.config Running Optimized for 1000 for 200 iterations Done running Optimized for 1000 with an average of 19058 cycles

Running De-optimized #1 for 1000 for 200 iterations Done running De-optimized #1 for 1000 with an average of 21039 cycles

Running Optimized for 10000 for 200 iterations Done running Optimized for 10000 with an average of 187296 cycles

Running De-optimized #1 for 10000 for 200 iterations Done running De-optimized #1 for 10000 with an average of 206060 cycles

Example run on Hydra:

Runs version_tester.exe on CPU 2 and mod_ten_counter.exe on CPU 3

� Eliminating noisy results – The Version Tester

� Running

version_tester <tester_configuration>

Command Format

Configuration File (for Hydra)

ITERATIONS=200

__EXECUTABLES__Optimized for 1000=taskset 0x000000008 ./mod_ten_counter_op 1000De-optimized #1 for 1000=taskset 0x000000008 ./mod_ten_counter_deop 1000Optimized for 10000=taskset 0x000000008 ./mod_ten_counter_op 10000De-optimized #1 for 10000=taskset 0x000000008 ./mod_ten_counter_deop 10000Optimized for 100000=taskset 0x000000008 ./mod_ten_counter_op 100000De-optimized #1 for 100000=taskset 0x000000008 ./mod_ten_counter_deop 100000Optimized for 1000000=taskset 0x000000008 ./mod_ten_counter_op 1000000De-optimized #1 for 1000000=taskset 0x000000008 ./mod_ten_counter_deop 1000000

� Eliminating noisy results – The Version Tester

� Running

Configuration File (for Windows):

ITERATIONS=200

__EXECUTABLES__Optimized for 10=.\mod_ten_counter\mod_ten_counter_op 10De-optimized #1 for 10=.\mod_ten_counter\mod_ten_counter_deop 10Optimized for 100=.\mod_ten_counter\mod_ten_counter_op 100De-optimized #1 for 100=.\mod_ten_counter\mod_ten_counter_deop 100Optimized for 1000=.\mod_ten_counter\mod_ten_counter_op 1000De-optimized #1 for 1000=.\mod_ten_counter\mod_ten_counter_deop 1000Optimized for 10000=.\mod_ten_counter\mod_ten_counter_op 10000De-optimized #1 for 10000=.\mod_ten_counter\mod_ten_counter_deop 10000Optimized for 100000=.\mod_ten_counter\mod_ten_counter_op 100000De-optimized #1 for 100000=.\mod_ten_counter\mod_ten_counter_deop 100000Optimized for 1000000=.\mod_ten_counter\mod_ten_counter_op 1000000De-optimized #1 for 1000000=.\mod_ten_counter\mod_ten_counter_deop 1000000Optimized for 10000000=.\mod_ten_counter\mod_ten_counter_op 10000000De-optimized #1 for 10000000=.\mod_ten_counter\mod_ten_counter_deop 10000000

� Eliminating noisy results – The Version Tester

� Therefore, using the Version Tester, we can iterate hundreds or

thousands of times in order to obtain a solid average number cycles

� So, we believe our results fairly represent the CPUs in

question

� You are going to see the various de-optimizations that we

implemented and the corresponding results

� These de-optimizations were tested using the Version Tester

and were executed while restricting the execution to a single

core (CPU)

� …something about the de-optimizations that were less than

successful

� Branch Patterns

� Remember: We wanted to challenge the CPU with branching patterns

that could force misses

� This turned out to be very difficult to do

� Random data caused a significant slowdown. But random data will

break any branch prediction mechanism

� The branch prediction mechanism on the Opteron is very very good

� Unpredictable Instructions - Recursion

� Remember: Writing recursive functions that call other functions near

their return

� This was supposed to overload the return address buffer and cause

mispredictions

� It turned out to be very difficult to implement

� We never really showed any performance degradation

� So, don’t worry about this one

So, without further adieu...

De-Optimization Results

Area: Instruction Scheduling

Dependency Chain

� Description

� As we have seen in this class data dependency would have an impact on

the ILP.

� Dynamic scheduling as we saw can eliminate the WAW & WAR

dependency

� However, to a point the Dynamic scheduling could be overwhelmed

which could affect the performance as we will see next

� The Opteron

� Opetron ,like all the other architectures, would be highly affected by the

data hazard

� The reason of this de-optimization is to show the impact of the data

chain dependency on the performance

� dependency_chain.exe

� We implemented two versions of a program called ‘dependency_chain’

� The program takes an array size as argument

� It then generates an array of the specified size in which each element is

populated with integers x where 0 <= x <= 20

� The array’s element are being summed and the output would be the

number of cycles that been taken by the program

� dependency_chain.exe

� In the optimized version adds the elements of the array by striding through the array in four element chunks and adding elements to four different temporary variables

� Then the four temporary variables are added

� The advantage is allowing four large dependency chain instead of one massive one

� However for the de-optimized version, each of the element of the array are sums into one variable

� This create a massive dependency chain which will quickly exhausts the scheduling resources of the dynamic scheduler

� Dependency_chain.exe

for ( i = 0; i < size_of_array; i+=4 ) {sum1 += test_array[i];sum2 += test_array[i + 1];sum3 += test_array[i + 2];sum4 += test_array[i + 3];

}sum = sum1 + sum2 + sum3 + sum4;

for ( i = 0; i < size_of_array; i++ ) {

sum += test_array[i];}

Optimized

De-Optimized

Source

0

20000000

40000000

60000000

80000000

100000000

120000000

140000000

160000000

180000000

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

dependency_ chain

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

� dependency_chain.exe

� chart below shows that not breaking up a dependency chain can be extraordinarily

costly. On the Opteron, it caused ~150% for all array sizes.

� The scheduling resources of the Opteron become overwhelmed essentially causing the

program to run sequentially, i.e. with no ILP

� Nehalem was impacted by this de-optimization too. Given the lesser impact, one can only

imagine that it has more scheduling resources

Difference between Optimized and De-Optimized Versions

Array Size AMD Opteron:

Difference

Slowdown(%) Intel Nehalem:

Difference

Slowdown(%)

10 80 7.07 21 2.58

100 1510 0.88 2436 32.15

1000 10429 11.61 -990 -1.24

10000 115975 12.76 36158 5.33

100000 1139748 12.59 238852 3.54

1000000 11529624 12.66 -3505949 -5.42

10000000 175191774 18.90 3081557 0.51

* In Clock Cycles

Array Size AMD Opteron:

Difference

Slowdown(%) Intel Nehalem:

Difference

Slowdown(%)

1043 19.03 3 2.5

100841 120.31 256 57.66

10008857 160.02 2793 79.03

1000087503 160.64 30073 86.46

100000877172 155.3 272096 78.83

10000008633066 142.06 2937193 88.53

1000000090226731 132.98 25436239 71.12

� Lessons

� The code for the de-optimization is so natural that it is a little scary. It is

elegant and parsimonious

� However, this elegance and parsimony may come at a very high cost

� If you don’t get the performance that you expect from a program, then

it is definitely worth looking for these types of dependency chains

� Break these chains up to give dynamic schedulers more scheduling

options

High Instructions Latency

De-Optimization Results

Area: Instruction Fetching and Decoding

� Description

� CPUs often have instructions that can perform almost the same operation

� Yet, in spite of their seeming similarity, they have very different latencies. By choosing the high-latency version when the low latency version would suffice, code can be de-optimized

� The Opetron� The LOOP instruction on the Opteron has a latency of 8 cycles, while

a test (like DEC) and jump (like JNZ) has a latency of less than 4 cycles

� Therefore, substituting LOOP instructions for DEC/JNZ combinations will be a de-optimization

� fib.exe

� We implemented a program called ‘fib’

� It takes an array size as an argument

� Fibonacci number is calculated for each element in the array

� fib.exe

� Fibonacci number is calculated in assembly code

� In the optimized version used dec & jnz instructions which take up to 4

cycles

� In the de-optimized version used loop instruction which takes 8 cycles

� fib.exe

calculate:mov edx, eaxadd ebx, edxmov eax, ebxmov dword [edi], ebxadd edi, 4dec ecxjnz calculate

calculate:mov edx, eaxadd ebx, edxmov eax, ebxmov dword [edi], ebxadd edi, 4 loop calculate

Optimized De-Optimized

Source

� fib.exe

08048481 <calculate>:8048481: 89 c2 mov %eax,%edx8048483: 01 d3 add %edx,%ebx8048485: 89 d8 mov %ebx,%eax8048487: 89 1f mov %ebx,(%edi)8048489: 81 c7 04 00 00 00 add $0x4,%edi804848f: 49 dec %ecx8048490: 75 ef jne 8048481 <calculate>

08048481 <calculate>:8048481: 89 c2 mov %eax,%edx8048483: 01 d3 add %edx,%ebx8048485: 89 d8 mov %ebx,%eax8048487: 89 1f mov %ebx,(%edi)8048489: 81 c7 04 00 00 00 add $0x4,%edi804848f: e2 f0 loop 8048481 <calculate>

Optimized

De-Optimized

Compiled

0

10000000

20000000

30000000

40000000

50000000

60000000

70000000

80000000

90000000

100000000

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

De-Optimization (fib)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

� fib.exe

� In the chart below we can see that optimized version significantly

outperforms the de-optimized version. The results on Nehalem are

more impressive

Difference between Optimized and De-Optimized Versions

Array Size AMD Opteron:

Difference

Slowdown(%) Intel Nehalem:

Difference

Slowdown(%)

10 80 7.07 21 2.58

100 1510 0.88 2436 32.15

1000 10429 11.61 -990 -1.24

10000 115975 12.76 36158 5.33

100000 1139748 12.59 238852 3.54

1000000 11529624 12.66 -3505949 -5.42

10000000 175191774 18.90 3081557 0.51

* In Clock Cycles

Array Size AMD Opteron:

Difference

Slowdown(%) Intel Nehalem:

Difference

Slowdown(%)

10 17 9.04 9 8.57

100 155 29.4 244 81.87

1000 1386 34.74 2239 104.52

10000 14588 22.03 19519 62.16

100000 123271 17.91 256839 87.42

1000000 1206678 16.94 2430301 81.51

10000000 11716747 14.07 19896396 65.26

� Lessons

� As we have seen different instructions can affect your program when

you don’t choose them carefully.

� It’s important to know which instruction takes more cycles to avoid

using them as possible.

Costly Instructions

De-Optimization Results

Area: Instruction Type Usage

� Description

� Some instructions can do the same job but with more cost in term of

number of cycles

� The Opetron

� Integer division for Opetron costs 22-47 cycles for signed, and 17-41

unsigned

� While it takes only 3-8 cycles for both signed and unsigned

multiplication

� mult_vs_div_deop_1.exe & mult_vs_div_op.exe

� We implemented two programs , optimized and de-optimized versions

� They take an array size as an argument, the array initialized randomly with powers of 2 (less than or equal to 2^12)

� The de-optimized version divides each element by 2.0. The optimized version multiplies each element by 0.5.

� The versions are functionality equivalent

� mult_vs_div_deop_1.exe & mult_vs_div_op.exe

for ( i = 0; i < size_of_array; i++ ){

test_array[i] = test_array[i] / 2.0;}

for ( i = 0; i < size_of_array; i++ ){

test_array[i] = test_array[i] * 0.5;}

De-optimized

Optimized

0

100000000

200000000

300000000

400000000

500000000

600000000

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

De-Optimization (Costly_Instruction)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

� mult_vs_div_deop_1.exe & mult_vs_div_op.exe

� By looking to the chart below you can see this de-optimization has a huge

impact on the Opetron of average 23% . It still has an affect on the

Nehalem even it is not as big as the Opetron

Difference between Optimized and De-Optimized Versions

* In Clock Cycles

Array Size AMD Opteron:

Difference

Slowdown(%) Intel Nehalem:

Difference

Slowdown(%)

1099 18.1 -42 -14.34

100949 24.73 57 2.04

10009754 26.49 1630 5.96

1000094658 25.67 16364 5.07

100000938712 25.27 117817 4.54

10000009619257 23.94 675940 2.64

1000000096477334 23.98 8585454 3.41

� Lessons

� Small changes in your code could have a real impact on the

performance

� It so important to know the difference between instruction in term of

cost

� Seek discount instruction when it is possible

De-Optimization Results

Area: Instruction Type Usage

Costly Instructions

� Description

� Some instructions can do the same job but with more cost in term of

number of cycles

Example:

float f1, f2

if (f1<f2)

This is a common usage for programmer which could be considered a de-

optimization technique

� The Opteron

� Branches based on floating-point comparisons are often slow

� Compare_two_floats.exe

� We implemented a program called ‘Compare_two_floats’

� It takes a number of iteration as an argument

� Comparisons between 2 floating numbers will be done in this program.

� Compare_two_floats_deop.exe & Compare_two_floats_op.exe

� In the de-optimized version we compare two floats by using the old

common way as we will see in the next slide

� However for the optimized version, we used change the float to integer

and take it as a condition instead

� The condition was specified on purpose to be not taken all the time

� Compare_two_floats_deop.exe & Compare_two_floats_op.exe

Optimized De-Optimized

for (i = 0; i <numberof_iteration ; i++)

{

if (f1<=f2)

{

Count_numbers(i);

count++;

}

else

count++;

}

for (j = 0; j <numberof_iteration ; j++)

{

if (FLOAT2INTCAST(t)<=0)

{

Count_numbers(i);

count++;

}

else

count++;

}

� Compare_two_floats.exe

0

20000000

40000000

60000000

80000000

100000000

120000000

140000000

160000000

180000000

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

De-Optimization (Compate_Two_floats)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

� Compare_two_floats.exe

� The chart below shows a small impact on the Opetron, however the

results were surprising for the Nehalem even it was designated basically

for the Opetron.

Difference between Optimized and De-Optimized Versions

Array Size AMD Opteron:

Difference

Slowdown(%) Intel Nehalem:

Difference

Slowdown(%)

10 -10991 -97.19 43 26.70

100 -10960 -86.18 471 49.68

1000 -9892 -37.97 5285 59.90

10000 -982 -0.60 52211 58.82

100000 88962 5.86 413932 41.79

1000000 987559 6.56 4298502 44.46

10000000 9949917 6.62 41906333 46.14

* In Clock Cycles

� Lessons

� Usually floats comparison are more expensive compare to integer in

term of number of cycles

� Even though the Opetron passed this test, that does not mean your

computer will do the same!!

� The float comparisons still have a big impact on the Nehalem

� Again, great care must be taken when the program deals with more

floats comparison.

De-Optimization Results

Area: Instruction Scheduling

Loop Re-rolling

� Description

� Loops not only affect branch prediction. They can also affect dynamic

scheduling. How ?

� Having two instructions 1 and 2 be within loops A and B, respectively.

1 and 2 could be part of a unified loop. If they were, then they could

be scheduled together. Yet, they are separate and cannot be

� The Opteron

� Given that the Opteron is 3-way scalar, this de-optimization could

significantly reduce IPC

� This would be two consecutive loops each containing one or more

instructions such that the loops could be combined

� Loop_re_rolling_deop.exe & loop_re_rolling_op.exe

� We implemented two programs : optimized and de-optimized versions

� They take an array size as an argument, and initialize it randomly

� Cubic and quadratic are calculated for each element in the array

� In the de-optimized version the cubic and quadratic calculation were in two consecutive loops. They are combined with the same loop for the optimized version.

� Bothe versions are functionality equivalent

� Loop_re_rolling_deop.exe & loop_re_rolling_op.exe

� We want to show whether the removing some of the flexibility available to the dynamic scheduler would affect the number of cycles or not.

� It is not expected for the de-optimization instructions to be scheduled at the same time. The de-optimization should prevent this

� Loop_re_rolling_deop.exe & loop_re_rolling_op.exeOptimized

De-optimized

for (i=0;i<size_of_array;i++){

quadratic_array[i]=load_store_array[i]*load_store_array[i];}

for (i=0;i<size_of_array;i++){

cubic_array[i]=load_store_array[i]*load_store_array[i]*load_store_array[i];}

for (i=0;i<size_of_array;i++){

quadratic_array[i]=load_store_array[i]*load_store_array[i];cubic_array[i]=load_store_array[i]*load_store_array[i]*load_store_array[i];

}

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

10 100 1000 10000 100000

Clo

ck C

ycl

es

Array Size

De-Optimization (Loop_re_Rolling)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

� Loop_re_rolling_deop.exe & loop_re_rolling_op.exe

� The slow percentage is defiantly large for Opetron. It is almost 50% in average.

� It is large for the Nehalem as well.

� These results shows the difference between using dynamic scheduling or not

Difference between Optimized and De-Optimized Versions

* In Clock Cycles

Array Size AMD Opteron:

Difference

Slowdown(%) Intel Nehalem:

Difference

Slowdown(%)

10 189 43.95 83 28.32

100 1653 53.68 357 15.35

1000 21311 54.96 4696 21.28

10000 234336 52.42 39311 17.10

100000 2352328 50.93 134786 6.05

� Lessons

� Dynamic scheduling is absolutely important

� Loops should be used carefully, and enroll them when it is possible

� Instructions that do not depend on each other (No true dependency),

would guarantee a better dynamic scheduling that would enhance the

performance especially when they are repeated frequently

Store-to-load dependency

De-Optimization Results

Area: Instruction Type Usage

� Description

� Store-to-load dependency takes place when stored data needs to be used shortly

� This type of dependency increases the pressure on the load and store unit and might cause the CPU to stall especially when this type of dependency occurs frequently

� In many instructions , when we load the data which is stored shortly

� The Opteron

� dependecy_deop.exe & dependency_op.exe

� We implemented two versions of dependency program, one for

optimization and the other for de-optimization

� They take an array size as an argument, the array initialized randomly

� Both versions perform a prefix sum over the array. Thus, the final

array element will contain a sum with it and all previous elements

within the array

� dependecy_deop.exe & dependency_op.exe

� In the de-optimization we are storing and loading shortly the same

elements of the array

� However, we used a temporary variables to avoid this type of

dependency in the optimized version

� The optimization code has more instructions, and it is obvious that

adding more instructions would have impact on the number of cycles

compare to the version that has fewer instructions

� dependecy_deop.exe & dependency_op.exe

Optimized

for ( i = 1; i < size_of_array; i++ ) {

test_array[i] = test_array[i] + test_array[i - 1];}

for ( i = 3; i < size_of_array; i += 3 )

{

temp2 = test_array[i - 2] + temp_prev;

temp1 = test_array[i - 1] + temp2;

test_array[i - 2] = temp2;

test_array[i - 1] = temp1;

test_array[i] = temp_prev = test_array[i] + temp1;

}

De-optimized

0

50000000

100000000

150000000

200000000

250000000

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

De-Optimization (Costly_Instruction)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

� dependecy_deop.exe & dependency_op.exe

� The chart below we can see it this de-optimization technique has an

average 60% of slowing down for the Opetron, which is a huge

difference. The Nehalem as well has been affected by this code

Difference between Optimized and De-Optimized Versions

* In Clock Cycles

Array Size AMD Opteron:

Difference

Slowdown(%) Intel Nehalem:

Difference

Slowdown(%)

1087 37.5 12 8.16

1002392 77.98 230 21.9

10009858 78.41 1610 15.38

1000079180 63.89 17517 17.84

100000786708 63.1 176893 18.7

10000008181981 64.12 1205000 12.12

1000000081498679 58.83 17276216 18.06

� Lessons

� Load-store dependency is something you should be aware of

� Writing more instructions does not always mean your program will run

slower

� Avoid this common usage of load store dependency to have a good

impact on your program

De-Optimization Results

Area: Instruction type usage

Costly Behavior

� Description

� Conditional statement is an active player in almost everyone’s code.

Would you believe if someone tells you a minor change could have a

real effect on your program?

� The sequence of the statements that are needed to be checked is SO

IMPORTANT.

� Most of the architectures use the same sequence to check these

condition regardless the programming language you use & the platform.

� IF_Condition.exe

� We implemented two versions of a program called ‘IF_Condition’

� The takes a number of iterations as an argument and initialize an array

randomly with floats between 0.5 and 11.0

� For each element in the array, we add one to a dummy variable if its

index is equal to 0 (mod 2) and its value is greater than 1.5.

� IF_deop.exe & IF_op.exe

� The if statement will hold true if both conditions are true

� In the de-optimized version we put the condition that has more chance

to be false as second condition ,while we put it a first one in the

optimized version

� IF_deop.exe & IF_op.exe

Optimized

De-Optimized

for ( i = 0; i < size_of_array; i++ ){

mod = ( i % 2 );if ( test_array[i] > 1.5 && mod == 0 )

dummy++;else

dummy--;}

for ( i = 0; i < size_of_array; i++ ){

mod = ( i % 2 );if ( mod == 0 && test_array[i] > 1.5 )

dummy++;else

dummy--; }

0

50000000

100000000

150000000

200000000

250000000

300000000

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

De-Optimization (If_Condition)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

� IF_Condition.exe

� The chart below shows optimized version outperforms the de-optimized

version on both the Opetron and Nehalem.

Difference between Optimized and De-Optimized Versions

Array Size AMD Opteron:

Difference

Slowdown(%) Intel Nehalem:

Difference

Slowdown(%)

10 -207 -26.90 57 23.85

100338 15.48 990 56.99

10006438 38.27 4886 33.86

1000063975 38.09 47374 37.87

100000651438 38.94 520198 50.64

10000006385986 37.37 4577710 41.86

1000000066023847 35.89 49997484 47.52

* In Clock Cycles

� Lessons

� Conditional statements can have a negative impacts if they are ignored

� One case was implemented (&&), and other cases would be equivalent

in term of increasing the number of cycles

� If it is possible to specify which conditions will be more true or false

then putting that condition in the right position will save some cycles

De-Optimization Results

Area: Branch Prediction

Branch Density

� Description

� This de-optimization attempts to overwhelm the CPUs ability to

predict a branch code by packing branches as tightly as possible

� Whether or not a bubble is created is dependent upon the

hardware

� However, at some point, the hardware can only predict so much

and pre-load so much code

� The Opteron

� The Opteron's BTB (Branch Target Buffer) can only maintain 3

(used) branch entries per (aligned) 16 bytes of code [AMD05]

� Thus, the Opteron cannot successfully maintain predictions for all

of the branches within following sequence of instructions

401399: 8b 44 24 10 mov 0x10(%esp),%eax40139d: 48 dec %eax40139e: 74 7a je 40141a <_mod_ten_counter+0x8a>4013a0: 8b 0f mov (%edi),%ecx4013a2: 74 1b je 4013bf <_mod_ten_counter+0x2f>4013a4: 49 dec %ecx4013a5: 74 1f je 4013c6 <_mod_ten_counter+0x36>4013a7: 49 dec %ecx4013a8: 74 25 je 4013cf <_mod_ten_counter+0x3f>4013aa: 49 dec %ecx4013ab: 74 2b je 4013d8 <_mod_ten_counter+0x48>4013ad: 49 dec %ecx4013ae: 74 31 je 4013e1 <_mod_ten_counter+0x51>4013b0: 49 dec %ecx4013b1: 74 37 je 4013ea <_mod_ten_counter+0x5a>4013b3: 49 dec %ecx4013b4: 74 3d je 4013f3 <_mod_ten_counter+0x63>4013b6: 49 dec %ecx4013b7: 74 43 je 4013fc <_mod_ten_counter+0x6c>4013b9: 49 dec %ecx4013ba: 74 49 je 401405 <_mod_ten_counter+0x75>4013bc: 49 dec %ecx4013bd: 74 4f je 40140e <_mod_ten_counter+0x7e>

� mod_ten_counter.exe

� We implemented a program called ‘mod_ten_counter’

� It takes an array size as an argument

� The array is populated with a repeating pattern of consecutive integers

from zero to nine

� Like: 012345678901234567890123456789…

� In other words, the contents are not random

� Very simply, it counts the number of times that each integer (0 – 9)

appears within the array

� mod_ten_counter.exe

� The optimized version maintained proper spacing between branch

instructions

� The de-optimized version (seen on the previous slide) has densely

packed branches

� Notes:

� The spacing for the optimized version is achieved with NOP

instructions

� It has one extra NOP per branch so it has roughly 5 more instructions

per iteration than the de-optimized version

� Thus, if the optimized version outperforms the de-optimized version,

then the difference will be even more impressive

� mod_ten_counter.exe

cmp ecx, 0je mark_0 ; We have a 0nopdec ecxje mark_1 ; We have a 1nopdec ecxje mark_2 ; We have a 2nopdec ecxje mark_3...

cmp ecx, 0je mark_0 ; We have a 0dec ecxje mark_1 ; We have a 1dec ecxje mark_2 ; We have a 2dec ecxje mark_3 ...

Optimized De-Optimized

Source

� mod_ten_counter.exe

0

50000000

100000000

150000000

200000000

250000000

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

De-optimization: Branch Density (mod_ten_counter)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

� mod_ten_counter.exe

� As you can see from the chart below, in spite of its handicap, the

optimized version significantly outperforms the de-optimized version

� Interestingly, this de-optimization is more impressive on the Intel, even

though it was designed with the Opteron in mind

Difference between Optimized and De-Optimized Versions

* In Clock Cycles

Array Size AMD Opteron:

Difference

Slowdown(%) Intel Nehalem:

Difference

Slowdown(%)

10 -47 -7.31 39 9.92

100 185 7.80 331 27.36

1000 2234 11.86 6355 89.42

10000 23230 12.69 67653 110.38

100000 203760 10.98 537362 84.27

1000000 1620652 8.40 5306766 89.63

10000000 17263048 8.34 52082971 78.85

� So, what’s up with the Nehalem?

� The Nehalem performs well generally, but is very susceptible to this de-

optimization. Why?

� There isn’t great information on this facet of the Nehalem

� But…

� The Nehalem can handle 4 active branches per 16 bytes

� The misprediction penalty is ~17 cycles so the Nehalem has a long

pipeline

� Therefore, missing the BTB is probably very costly as well

� Lessons

� Branch density can adversely affect performance and make otherwise

predictable branches unpredictable

� Great care must be taken when designing branches, if-then-else

structures and case-switch structures

De-Optimization Results

Area: Branch Prediction

Unpredictable Instructions

� Description

� Some CPUs restricts only one branch instruction be within a certain number bytes

� If this exceeded or if branch instructions are not aligned properly, then

branches cannot be predicted

� The Opteron

� The return (RET) instruction may only take up one byte

� If a branch instruction immediately precedes a one byte RET instruction, then RET cannot be predicted

� One byte RET instruction can cause a miss prediction even if we have one branch instruction per 16 bytes

� Alignment: 9 branch indicators associated with byte addresses of 0,1,3,5,7,9,11, 13, & 15 within each 16 byte segment

� factorial_over_array.exe

� We implemented a program called ‘factorial_over_array’

� It takes an array size as an argument

� The array is populated with random integers between 1 and 12

e.g. { 3, 7, 4, 10, 9, 1, 5, 2, 12 }

� Factorial is calculated for each element in the array

� factorial_over_array.exe

� Factorial is calculated in assembly code

� In the optimized version, the RET instruction is aligned using a NOP so

that it is not immediately next another branch and so that it falls on an

odd number within the 16 byte segment

� In the de-optimized version, the RET instruction is aligned immediately

next to a branch instruction and so that it falls on an even number

within the 16 byte segment

� factorial_over_array.exe

global _factorial section .text

_factorial:nopmov eax, [esp+4] cmp eax, 1 jne calculatenopret

calculate:dec eaxpush eaxcall _factorialadd esp, 4imul eax, [esp+4]ret

global _factorial section .text

_factorial:nopmov eax, [esp+4]cmp eax, 1jne calculateret

calculate:dec eaxpush eaxcall _factorialadd esp, 4imul eax, [esp+4]ret

Optimized De-Optimized

Source

� factorial_over_array.exe0: 90 nop1: 8b 44 24 04 mov 0x4(%esp),%eax5: 83 f8 01 cmp $0x1,%eax8: 75 02 jne c <_factorial+0xc>a: 90 nopb: c3 retc: 48 dec %eaxd: 50 push %eaxe: e8 ed ff ff ff call 0 <_factorial>

13: 83 c4 04 add $0x4,%esp16: 0f af 44 24 04 imul 0x4(%esp),%eax1b: c3 ret

0: 90 nop1: 8b 44 24 04 mov 0x4(%esp),%eax5: 83 f8 01 cmp $0x1,%eax8: 75 01 jne b <_factorial+0xb>a: c3 retb: 48 dec %eaxc: 50 push %eaxd: e8 ee ff ff ff call 0 <_factorial>

12: 83 c4 04 add $0x4,%esp15: 0f af 44 24 04 imul 0x4(%esp),%eax1a: c3 ret

Optimized

De-Optimized

Compiled

� factorial_over_array.exe

0

200000000

400000000

600000000

800000000

1E+09

1.2E+09

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

De-optimization: Unpredictable instructions (factorial_over_array)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

� factorial_over_array.exe

� As you can see from the chart below, the optimized version significantly

outperforms the de-optimized version

� Interestingly, this de-optimization has an inconclusive effect on the

Nehalem

Difference between Optimized and De-Optimized Versions

Array Size AMD Opteron:

Difference

Slowdown(%) Intel Nehalem:

Difference

Slowdown(%)

10 80 7.07 21 2.58

100 1510 0.88 2436 32.15

1000 10429 11.61 -990 -1.24

10000 115975 12.76 36158 5.33

100000 1139748 12.59 238852 3.54

1000000 11529624 12.66 -3505949 -5.42

10000000 175191774 18.90 3081557 0.51

* In Clock Cycles

Array Size AMD Opteron:

Difference

Slowdown(%) Intel Nehalem:

Difference

Slowdown(%)

10 80 7.07 21 2.58

100 1510 0.88 2436 32.15

1000 10429 11.61 -990 -1.24

10000 115975 12.76 36158 5.33

100000 1139748 12.59 238852 3.54

1000000 11529624 12.66 -3505949 -5.42

10000000 175191774 18.90 3081557 0.51

� Lessons

� Alignment is one of many ways that instructions can become

unpredictable

� These constant misses can be very costly

� Again, great care must be taken. Brevity, at times, can create

inefficiencies

� We’ve shown you lots of de-optimizations

� Most of them were successful

� So, now, you know some of the costs associated with ignoring CPU architecture when writing code

� If you are like us, then you must be reconsidering how you write software

� As you’ve seen, some of the simple habits that you’ve accumulated may be causing your code to run more slowly than it would have otherwise

[AMD05] AMD64 Technology. Software Optimization Guide for AMD64 Processors, 2005

[AMD11] AMD64 Technology. AMD64 Architecture Programmers Manual, Volume 1: Application

Programming. 2011

[AMD11] AMD64 Technology. AMD64 Architecture Programmers Manual, Volume 2: System

Programming. 2011

[AMD11] AMD64 Technology. AMD64 Architecture Programmers Manual, Volume 3: General

Purpose and System Instructions. 2011

top related