finding the limits of hardware optimization through

Derek Kern, Roqyah Alalqam,

Ahmed Mehzer, Mohammed Mohammed

Finding the Limits of Hardware Optimization

through Software De-optimization

Presented By:

� Flashback

� Project Structure

� Judging de-optimizations

� What does a de-op look like?

� General Areas of Focus

� Instruction Fetching and Decoding

� Instruction Scheduling

� Instruction Type Usage (e.g. Integer vs. FP)

� Branch Prediction

� Idiosyncrasies

� Our Methods

� Measuring clock cycles

� Eliminating noise

� Something about the de-ops that didn’t work

� Lots and lots of de-ops

During the research project

� We studied de-optimizations

� We studied the Opteron

For the implementation project

� We have chosen de-optimizations to implement

� We have chosen algorithms that may best reflect our de-

optimizations

� We have implemented the de-optimizations

� …And, we’re here to report the results

Judging de-optimizations (de-ops)

� Whether the de-op affects scheduling, caching, branching, etc, its

impact will be felt in the clocks needed to execute an algorithm.

� So, our metric of choice will be CPU clock cycles

What does a de-op look like?

� A de-op is a change to an optimal implementation of an algorithm that

increases the clock cycles needed to execute the algorithm and that

demonstrates some interesting fact about the CPU in question

� The CPUs

� AMD Opteron (Hydra)

� Intel Nehalem (Derek’s Laptop)

� Our primary focus was the Opteron

� The de-optimizations were designed to affect the Opteron

� We also tested them on the Intel in order to give you an idea of how universal a de-optimization is

� When we know why something does or doesn’t affect the Intel, we will try to let you know

� The code

� Most of the de-optimizations are written in C (GCC)

� Some of them have a wrapper that is written in C, while the code being

de-optimized is written in NASM (assembly)

� E.g.

� Mod_ten_counter

� Factorial_over_array

� Typically, if a de-op is written in NASM, then the C wrapper does all of

the grunt work prior to calling the de-optimized NASM module

� Problem: How do we measure clock cycles?

� An obvious answer

� CodeAnalyst

� Actually, we were getting strange results from CodeAnalyst

� …And, it is hard to separate important code sections from unimportant

code sections

� …And, it is cumbersome to work with

� A better answer

� Embed code that measures clock cycles for important sections

� Ok….but how?

#if defined(__i386__)static __inline__ unsigned long long rdtsc(void){

unsigned long long int x;__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));return x;

}

#elif defined(__x86_64__)static __inline__ unsigned long long rdtsc(void){

unsigned hi, lo;__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );

}

#endif

Answer: Read the CPU Timestamp Counter

� CPU Timestamp Counter

� In all x86 CPUs since the Pentium

� Counts the number of clock cycles since the last reset

� It’s a little tricky in multi-core environments

� Care must be taken to control the cores that do the relevant

processing


Windows:

Linux (Hydra):

start /realtime /affinity 4 /b <exe name> <arguments>

bpsh 11 taskset 0x000000008 <exe name> <arguments>

Runs the executable on core 3 (of 1 – 4)

Runs the executable on node 11, CPU 3 (of 0 – 11)

So, by restricting our runs to specific CPUs, we can rely on the CPU

timestamp values


� Wrapping code so that clock cycles can be counted

//// Send the array off to be counted by the assembly code//unsigned long long start = rdtsc();

#ifdef _WIN64mod_ten_counter( counts, mod_ten_array, size_of_array );

#elif _WIN32mod_ten_counter( counts, mod_ten_array, size_of_array );

#elif __linux___mod_ten_counter( counts, mod_ten_array, size_of_array );

#endif

printf( "Cycles=%d\n", ( rdtsc() - start ) );

The important section is wrapped and the number of clock cycles will

be the difference between the start and the finish

� Eliminating noisy results

� Even with our precautions, there can be some noise in the clock cycles

� So, we need lots of iterations that we can use to generate a good

average

� But, this can be very, very time consuming

� How, oh how?

Answer: The Version Tester

� Eliminating noisy results – The Version Tester

� Used to iteratively test executables

� Expects each executable to return the number of cycles that need to

be counted

� Remember this?

//// Send the array off to be counted by the assembly code//unsigned long long start = rdtsc();#ifdef _WIN64

mod_ten_counter( counts, mod_ten_array, size_of_array );#elif _WIN32

mod_ten_counter( counts, mod_ten_array, size_of_array );#elif __linux__

_mod_ten_counter( counts, mod_ten_array, size_of_array );#endifprintf( "Cycles=%d\n", ( rdtsc() - start ) );


� Runs executables for specified number of iterations and then averages

the number of cycles

> bpsh 10 taskset 0x000000004 version_tester mtc.hydra-core3.config Running Optimized for 1000 for 200 iterations Done running Optimized for 1000 with an average of 19058 cycles

Running De-optimized #1 for 1000 for 200 iterations Done running De-optimized #1 for 1000 with an average of 21039 cycles

Running Optimized for 10000 for 200 iterations Done running Optimized for 10000 with an average of 187296 cycles

Running De-optimized #1 for 10000 for 200 iterations Done running De-optimized #1 for 10000 with an average of 206060 cycles

Example run on Hydra:

Runs version_tester.exe on CPU 2 and mod_ten_counter.exe on CPU 3


� Running

version_tester <tester_configuration>

Command Format

Configuration File (for Hydra)

ITERATIONS=200

__EXECUTABLES__Optimized for 1000=taskset 0x000000008 ./mod_ten_counter_op 1000De-optimized #1 for 1000=taskset 0x000000008 ./mod_ten_counter_deop 1000Optimized for 10000=taskset 0x000000008 ./mod_ten_counter_op 10000De-optimized #1 for 10000=taskset 0x000000008 ./mod_ten_counter_deop 10000Optimized for 100000=taskset 0x000000008 ./mod_ten_counter_op 100000De-optimized #1 for 100000=taskset 0x000000008 ./mod_ten_counter_deop 100000Optimized for 1000000=taskset 0x000000008 ./mod_ten_counter_op 1000000De-optimized #1 for 1000000=taskset 0x000000008 ./mod_ten_counter_deop 1000000


� Running

Configuration File (for Windows):

ITERATIONS=200

__EXECUTABLES__Optimized for 10=.\mod_ten_counter\mod_ten_counter_op 10De-optimized #1 for 10=.\mod_ten_counter\mod_ten_counter_deop 10Optimized for 100=.\mod_ten_counter\mod_ten_counter_op 100De-optimized #1 for 100=.\mod_ten_counter\mod_ten_counter_deop 100Optimized for 1000=.\mod_ten_counter\mod_ten_counter_op 1000De-optimized #1 for 1000=.\mod_ten_counter\mod_ten_counter_deop 1000Optimized for 10000=.\mod_ten_counter\mod_ten_counter_op 10000De-optimized #1 for 10000=.\mod_ten_counter\mod_ten_counter_deop 10000Optimized for 100000=.\mod_ten_counter\mod_ten_counter_op 100000De-optimized #1 for 100000=.\mod_ten_counter\mod_ten_counter_deop 100000Optimized for 1000000=.\mod_ten_counter\mod_ten_counter_op 1000000De-optimized #1 for 1000000=.\mod_ten_counter\mod_ten_counter_deop 1000000Optimized for 10000000=.\mod_ten_counter\mod_ten_counter_op 10000000De-optimized #1 for 10000000=.\mod_ten_counter\mod_ten_counter_deop 10000000


� Therefore, using the Version Tester, we can iterate hundreds or

thousands of times in order to obtain a solid average number cycles

� So, we believe our results fairly represent the CPUs in

question

� You are going to see the various de-optimizations that we

implemented and the corresponding results

� These de-optimizations were tested using the Version Tester

and were executed while restricting the execution to a single

core (CPU)

� …something about the de-optimizations that were less than

successful

� Branch Patterns

� Remember: We wanted to challenge the CPU with branching patterns

that could force misses

� This turned out to be very difficult to do

� Random data caused a significant slowdown. But random data will

break any branch prediction mechanism

� The branch prediction mechanism on the Opteron is very very good

� Unpredictable Instructions - Recursion

� Remember: Writing recursive functions that call other functions near

their return

� This was supposed to overload the return address buffer and cause

mispredictions

� It turned out to be very difficult to implement

� We never really showed any performance degradation

� So, don’t worry about this one

So, without further adieu...

De-Optimization Results

Area: Instruction Scheduling

Dependency Chain

� Description

� As we have seen in this class data dependency would have an impact on

the ILP.

� Dynamic scheduling as we saw can eliminate the WAW & WAR

dependency

� However, to a point the Dynamic scheduling could be overwhelmed

which could affect the performance as we will see next

� The Opteron

� Opetron ,like all the other architectures, would be highly affected by the

data hazard

� The reason of this de-optimization is to show the impact of the data

chain dependency on the performance

� dependency_chain.exe

� We implemented two versions of a program called ‘dependency_chain’

� The program takes an array size as argument

� It then generates an array of the specified size in which each element is

populated with integers x where 0 <= x <= 20

� The array’s element are being summed and the output would be the

number of cycles that been taken by the program


� In the optimized version adds the elements of the array by striding through the array in four element chunks and adding elements to four different temporary variables

� Then the four temporary variables are added

� The advantage is allowing four large dependency chain instead of one massive one

� However for the de-optimized version, each of the element of the array are sums into one variable

� This create a massive dependency chain which will quickly exhausts the scheduling resources of the dynamic scheduler

� Dependency_chain.exe

for ( i = 0; i < size_of_array; i+=4 ) {sum1 += test_array[i];sum2 += test_array[i + 1];sum3 += test_array[i + 2];sum4 += test_array[i + 3];

}sum = sum1 + sum2 + sum3 + sum4;

for ( i = 0; i < size_of_array; i++ ) {

sum += test_array[i];}

Optimized

De-Optimized

Source

0

20000000

40000000

60000000

80000000

100000000

120000000

140000000

160000000

180000000

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

dependency_ chain

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem


� chart below shows that not breaking up a dependency chain can be extraordinarily

costly. On the Opteron, it caused ~150% for all array sizes.

� The scheduling resources of the Opteron become overwhelmed essentially causing the

program to run sequentially, i.e. with no ILP

� Nehalem was impacted by this de-optimization too. Given the lesser impact, one can only

imagine that it has more scheduling resources

Difference between Optimized and De-Optimized Versions

Array Size AMD Opteron:

Difference

Slowdown(%) Intel Nehalem:

Difference

Slowdown(%)

10 80 7.07 21 2.58

100 1510 0.88 2436 32.15

1000 10429 11.61 -990 -1.24

10000 115975 12.76 36158 5.33

100000 1139748 12.59 238852 3.54

1000000 11529624 12.66 -3505949 -5.42

10000000 175191774 18.90 3081557 0.51

* In Clock Cycles


Difference


Difference

Slowdown(%)

1043 19.03 3 2.5

100841 120.31 256 57.66

10008857 160.02 2793 79.03

1000087503 160.64 30073 86.46

100000877172 155.3 272096 78.83

10000008633066 142.06 2937193 88.53

1000000090226731 132.98 25436239 71.12

� Lessons

� The code for the de-optimization is so natural that it is a little scary. It is

elegant and parsimonious

� However, this elegance and parsimony may come at a very high cost

� If you don’t get the performance that you expect from a program, then

it is definitely worth looking for these types of dependency chains

� Break these chains up to give dynamic schedulers more scheduling

options

High Instructions Latency


Area: Instruction Fetching and Decoding

� Description

� CPUs often have instructions that can perform almost the same operation

� Yet, in spite of their seeming similarity, they have very different latencies. By choosing the high-latency version when the low latency version would suffice, code can be de-optimized

� The Opetron� The LOOP instruction on the Opteron has a latency of 8 cycles, while

a test (like DEC) and jump (like JNZ) has a latency of less than 4 cycles

� Therefore, substituting LOOP instructions for DEC/JNZ combinations will be a de-optimization

� fib.exe

� We implemented a program called ‘fib’

� It takes an array size as an argument

� Fibonacci number is calculated for each element in the array

� fib.exe

� Fibonacci number is calculated in assembly code

� In the optimized version used dec & jnz instructions which take up to 4

cycles

� In the de-optimized version used loop instruction which takes 8 cycles

� fib.exe

calculate:mov edx, eaxadd ebx, edxmov eax, ebxmov dword [edi], ebxadd edi, 4dec ecxjnz calculate

calculate:mov edx, eaxadd ebx, edxmov eax, ebxmov dword [edi], ebxadd edi, 4 loop calculate

Optimized De-Optimized

Source

� fib.exe

08048481 <calculate>:8048481: 89 c2 mov %eax,%edx8048483: 01 d3 add %edx,%ebx8048485: 89 d8 mov %ebx,%eax8048487: 89 1f mov %ebx,(%edi)8048489: 81 c7 04 00 00 00 add $0x4,%edi804848f: 49 dec %ecx8048490: 75 ef jne 8048481 <calculate>

08048481 <calculate>:8048481: 89 c2 mov %eax,%edx8048483: 01 d3 add %edx,%ebx8048485: 89 d8 mov %ebx,%eax8048487: 89 1f mov %ebx,(%edi)8048489: 81 c7 04 00 00 00 add $0x4,%edi804848f: e2 f0 loop 8048481 <calculate>

Optimized

De-Optimized

Compiled

0

10000000

20000000

30000000

40000000

50000000

60000000

70000000

80000000

90000000

100000000

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

De-Optimization (fib)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

� fib.exe

� In the chart below we can see that optimized version significantly

outperforms the de-optimized version. The results on Nehalem are

more impressive



Difference


Difference

Slowdown(%)

10 80 7.07 21 2.58

100 1510 0.88 2436 32.15

1000 10429 11.61 -990 -1.24

10000 115975 12.76 36158 5.33

100000 1139748 12.59 238852 3.54

1000000 11529624 12.66 -3505949 -5.42

10000000 175191774 18.90 3081557 0.51

* In Clock Cycles


Difference


Difference

Slowdown(%)

10 17 9.04 9 8.57

100 155 29.4 244 81.87

1000 1386 34.74 2239 104.52

10000 14588 22.03 19519 62.16

100000 123271 17.91 256839 87.42

1000000 1206678 16.94 2430301 81.51

10000000 11716747 14.07 19896396 65.26

� Lessons

� As we have seen different instructions can affect your program when

you don’t choose them carefully.

� It’s important to know which instruction takes more cycles to avoid

using them as possible.

Costly Instructions


Area: Instruction Type Usage

� Description

� Some instructions can do the same job but with more cost in term of

number of cycles

� The Opetron

� Integer division for Opetron costs 22-47 cycles for signed, and 17-41

unsigned

� While it takes only 3-8 cycles for both signed and unsigned

multiplication

� mult_vs_div_deop_1.exe & mult_vs_div_op.exe

� We implemented two programs , optimized and de-optimized versions

� They take an array size as an argument, the array initialized randomly with powers of 2 (less than or equal to 2^12)

� The de-optimized version divides each element by 2.0. The optimized version multiplies each element by 0.5.

� The versions are functionality equivalent


for ( i = 0; i < size_of_array; i++ ){

test_array[i] = test_array[i] / 2.0;}


test_array[i] = test_array[i] * 0.5;}

De-optimized

Optimized

0

100000000

200000000

300000000

400000000

500000000

600000000

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

De-Optimization (Costly_Instruction)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem


� By looking to the chart below you can see this de-optimization has a huge

impact on the Opetron of average 23% . It still has an affect on the

Nehalem even it is not as big as the Opetron


* In Clock Cycles


Difference


Difference

Slowdown(%)

1099 18.1 -42 -14.34

100949 24.73 57 2.04

10009754 26.49 1630 5.96

1000094658 25.67 16364 5.07

100000938712 25.27 117817 4.54

10000009619257 23.94 675940 2.64

1000000096477334 23.98 8585454 3.41

� Lessons

� Small changes in your code could have a real impact on the

performance

� It so important to know the difference between instruction in term of

cost

� Seek discount instruction when it is possible



Costly Instructions

� Description

� Some instructions can do the same job but with more cost in term of

number of cycles

Example:

float f1, f2

if (f1<f2)

This is a common usage for programmer which could be considered a de-

optimization technique

� The Opteron

� Branches based on floating-point comparisons are often slow

� Compare_two_floats.exe

� We implemented a program called ‘Compare_two_floats’

� It takes a number of iteration as an argument

� Comparisons between 2 floating numbers will be done in this program.

� Compare_two_floats_deop.exe & Compare_two_floats_op.exe

� In the de-optimized version we compare two floats by using the old

common way as we will see in the next slide

� However for the optimized version, we used change the float to integer

and take it as a condition instead

� The condition was specified on purpose to be not taken all the time

� Compare_two_floats_deop.exe & Compare_two_floats_op.exe


for (i = 0; i <numberof_iteration ; i++)

{

if (f1<=f2)

{

Count_numbers(i);

count++;

}

else

count++;

}

for (j = 0; j <numberof_iteration ; j++)

{

if (FLOAT2INTCAST(t)<=0)

{

Count_numbers(i);

count++;

}

else

count++;

}


0

20000000

40000000

60000000

80000000

100000000

120000000

140000000

160000000

180000000

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

De-Optimization (Compate_Two_floats)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem


� The chart below shows a small impact on the Opetron, however the

results were surprising for the Nehalem even it was designated basically

for the Opetron.



Difference


Difference

Slowdown(%)

10 -10991 -97.19 43 26.70

100 -10960 -86.18 471 49.68

1000 -9892 -37.97 5285 59.90

10000 -982 -0.60 52211 58.82

100000 88962 5.86 413932 41.79

1000000 987559 6.56 4298502 44.46

10000000 9949917 6.62 41906333 46.14

* In Clock Cycles

� Lessons

� Usually floats comparison are more expensive compare to integer in

term of number of cycles

� Even though the Opetron passed this test, that does not mean your

computer will do the same!!

� The float comparisons still have a big impact on the Nehalem

� Again, great care must be taken when the program deals with more

floats comparison.


Area: Instruction Scheduling

Loop Re-rolling

� Description

� Loops not only affect branch prediction. They can also affect dynamic

scheduling. How ?

� Having two instructions 1 and 2 be within loops A and B, respectively.

1 and 2 could be part of a unified loop. If they were, then they could

be scheduled together. Yet, they are separate and cannot be

� The Opteron

� Given that the Opteron is 3-way scalar, this de-optimization could

significantly reduce IPC

� This would be two consecutive loops each containing one or more

instructions such that the loops could be combined

� Loop_re_rolling_deop.exe & loop_re_rolling_op.exe

� We implemented two programs : optimized and de-optimized versions

� They take an array size as an argument, and initialize it randomly

� Cubic and quadratic are calculated for each element in the array

� In the de-optimized version the cubic and quadratic calculation were in two consecutive loops. They are combined with the same loop for the optimized version.

� Bothe versions are functionality equivalent


� We want to show whether the removing some of the flexibility available to the dynamic scheduler would affect the number of cycles or not.

� It is not expected for the de-optimization instructions to be scheduled at the same time. The de-optimization should prevent this

� Loop_re_rolling_deop.exe & loop_re_rolling_op.exeOptimized

De-optimized

for (i=0;i<size_of_array;i++){

quadratic_array[i]=load_store_array[i]*load_store_array[i];}


cubic_array[i]=load_store_array[i]*load_store_array[i]*load_store_array[i];}


quadratic_array[i]=load_store_array[i]*load_store_array[i];cubic_array[i]=load_store_array[i]*load_store_array[i]*load_store_array[i];

}

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

10 100 1000 10000 100000

Clo

ck C

ycl

es

Array Size

De-Optimization (Loop_re_Rolling)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem


� The slow percentage is defiantly large for Opetron. It is almost 50% in average.

� It is large for the Nehalem as well.

� These results shows the difference between using dynamic scheduling or not


* In Clock Cycles


Difference


Difference

Slowdown(%)

10 189 43.95 83 28.32

100 1653 53.68 357 15.35

1000 21311 54.96 4696 21.28

10000 234336 52.42 39311 17.10

100000 2352328 50.93 134786 6.05

� Lessons

� Dynamic scheduling is absolutely important

� Loops should be used carefully, and enroll them when it is possible

� Instructions that do not depend on each other (No true dependency),

would guarantee a better dynamic scheduling that would enhance the

performance especially when they are repeated frequently

Store-to-load dependency



� Description

� Store-to-load dependency takes place when stored data needs to be used shortly

� This type of dependency increases the pressure on the load and store unit and might cause the CPU to stall especially when this type of dependency occurs frequently

� In many instructions , when we load the data which is stored shortly

� The Opteron

� dependecy_deop.exe & dependency_op.exe

� We implemented two versions of dependency program, one for

optimization and the other for de-optimization

� They take an array size as an argument, the array initialized randomly

� Both versions perform a prefix sum over the array. Thus, the final

array element will contain a sum with it and all previous elements

within the array


� In the de-optimization we are storing and loading shortly the same

elements of the array

� However, we used a temporary variables to avoid this type of

dependency in the optimized version

� The optimization code has more instructions, and it is obvious that

adding more instructions would have impact on the number of cycles

compare to the version that has fewer instructions


Optimized

for ( i = 1; i < size_of_array; i++ ) {

test_array[i] = test_array[i] + test_array[i - 1];}

for ( i = 3; i < size_of_array; i += 3 )

{

temp2 = test_array[i - 2] + temp_prev;

temp1 = test_array[i - 1] + temp2;

test_array[i - 2] = temp2;

test_array[i - 1] = temp1;

test_array[i] = temp_prev = test_array[i] + temp1;

}

De-optimized

0

50000000

100000000

150000000

200000000

250000000

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

De-Optimization (Costly_Instruction)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem


� The chart below we can see it this de-optimization technique has an

average 60% of slowing down for the Opetron, which is a huge

difference. The Nehalem as well has been affected by this code


* In Clock Cycles


Difference


Difference

Slowdown(%)

1087 37.5 12 8.16

1002392 77.98 230 21.9

10009858 78.41 1610 15.38

1000079180 63.89 17517 17.84

100000786708 63.1 176893 18.7

10000008181981 64.12 1205000 12.12

1000000081498679 58.83 17276216 18.06

� Lessons

� Load-store dependency is something you should be aware of

� Writing more instructions does not always mean your program will run

slower

� Avoid this common usage of load store dependency to have a good

impact on your program


Area: Instruction type usage

Costly Behavior

� Description

� Conditional statement is an active player in almost everyone’s code.

Would you believe if someone tells you a minor change could have a

real effect on your program?

� The sequence of the statements that are needed to be checked is SO

IMPORTANT.

� Most of the architectures use the same sequence to check these

condition regardless the programming language you use & the platform.

� IF_Condition.exe

� We implemented two versions of a program called ‘IF_Condition’

� The takes a number of iterations as an argument and initialize an array

randomly with floats between 0.5 and 11.0

� For each element in the array, we add one to a dummy variable if its

index is equal to 0 (mod 2) and its value is greater than 1.5.

� IF_deop.exe & IF_op.exe

� The if statement will hold true if both conditions are true

� In the de-optimized version we put the condition that has more chance

to be false as second condition ,while we put it a first one in the

optimized version

� IF_deop.exe & IF_op.exe

Optimized

De-Optimized


mod = ( i % 2 );if ( test_array[i] > 1.5 && mod == 0 )

dummy++;else

dummy--;}


mod = ( i % 2 );if ( mod == 0 && test_array[i] > 1.5 )

dummy++;else

dummy--; }

0

50000000

100000000

150000000

200000000

250000000

300000000

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

De-Optimization (If_Condition)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

� IF_Condition.exe

� The chart below shows optimized version outperforms the de-optimized

version on both the Opetron and Nehalem.



Difference


Difference

Slowdown(%)

10 -207 -26.90 57 23.85

100338 15.48 990 56.99

10006438 38.27 4886 33.86

1000063975 38.09 47374 37.87

100000651438 38.94 520198 50.64

10000006385986 37.37 4577710 41.86

1000000066023847 35.89 49997484 47.52

* In Clock Cycles

� Lessons

� Conditional statements can have a negative impacts if they are ignored

� One case was implemented (&&), and other cases would be equivalent

in term of increasing the number of cycles

� If it is possible to specify which conditions will be more true or false

then putting that condition in the right position will save some cycles


Area: Branch Prediction

Branch Density

� Description

� This de-optimization attempts to overwhelm the CPUs ability to

predict a branch code by packing branches as tightly as possible

� Whether or not a bubble is created is dependent upon the

hardware

� However, at some point, the hardware can only predict so much

and pre-load so much code

� The Opteron

� The Opteron's BTB (Branch Target Buffer) can only maintain 3

(used) branch entries per (aligned) 16 bytes of code [AMD05]

� Thus, the Opteron cannot successfully maintain predictions for all

of the branches within following sequence of instructions

401399: 8b 44 24 10 mov 0x10(%esp),%eax40139d: 48 dec %eax40139e: 74 7a je 40141a <_mod_ten_counter+0x8a>4013a0: 8b 0f mov (%edi),%ecx4013a2: 74 1b je 4013bf <_mod_ten_counter+0x2f>4013a4: 49 dec %ecx4013a5: 74 1f je 4013c6 <_mod_ten_counter+0x36>4013a7: 49 dec %ecx4013a8: 74 25 je 4013cf <_mod_ten_counter+0x3f>4013aa: 49 dec %ecx4013ab: 74 2b je 4013d8 <_mod_ten_counter+0x48>4013ad: 49 dec %ecx4013ae: 74 31 je 4013e1 <_mod_ten_counter+0x51>4013b0: 49 dec %ecx4013b1: 74 37 je 4013ea <_mod_ten_counter+0x5a>4013b3: 49 dec %ecx4013b4: 74 3d je 4013f3 <_mod_ten_counter+0x63>4013b6: 49 dec %ecx4013b7: 74 43 je 4013fc <_mod_ten_counter+0x6c>4013b9: 49 dec %ecx4013ba: 74 49 je 401405 <_mod_ten_counter+0x75>4013bc: 49 dec %ecx4013bd: 74 4f je 40140e <_mod_ten_counter+0x7e>

� mod_ten_counter.exe

� We implemented a program called ‘mod_ten_counter’


� The array is populated with a repeating pattern of consecutive integers

from zero to nine

� Like: 012345678901234567890123456789…

� In other words, the contents are not random

� Very simply, it counts the number of times that each integer (0 – 9)

appears within the array


� The optimized version maintained proper spacing between branch

instructions

� The de-optimized version (seen on the previous slide) has densely

packed branches

� Notes:

� The spacing for the optimized version is achieved with NOP

instructions

� It has one extra NOP per branch so it has roughly 5 more instructions

per iteration than the de-optimized version

� Thus, if the optimized version outperforms the de-optimized version,

then the difference will be even more impressive


cmp ecx, 0je mark_0 ; We have a 0nopdec ecxje mark_1 ; We have a 1nopdec ecxje mark_2 ; We have a 2nopdec ecxje mark_3...

cmp ecx, 0je mark_0 ; We have a 0dec ecxje mark_1 ; We have a 1dec ecxje mark_2 ; We have a 2dec ecxje mark_3 ...


Source


0

50000000

100000000

150000000

200000000

250000000

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

De-optimization: Branch Density (mod_ten_counter)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem


� As you can see from the chart below, in spite of its handicap, the

optimized version significantly outperforms the de-optimized version

� Interestingly, this de-optimization is more impressive on the Intel, even

though it was designed with the Opteron in mind


* In Clock Cycles


Difference


Difference

Slowdown(%)

10 -47 -7.31 39 9.92

100 185 7.80 331 27.36

1000 2234 11.86 6355 89.42

10000 23230 12.69 67653 110.38

100000 203760 10.98 537362 84.27

1000000 1620652 8.40 5306766 89.63

10000000 17263048 8.34 52082971 78.85

� So, what’s up with the Nehalem?

� The Nehalem performs well generally, but is very susceptible to this de-

optimization. Why?

� There isn’t great information on this facet of the Nehalem

� But…

� The Nehalem can handle 4 active branches per 16 bytes

� The misprediction penalty is ~17 cycles so the Nehalem has a long

pipeline

� Therefore, missing the BTB is probably very costly as well

� Lessons

� Branch density can adversely affect performance and make otherwise

predictable branches unpredictable

� Great care must be taken when designing branches, if-then-else

structures and case-switch structures


Area: Branch Prediction

Unpredictable Instructions

� Description

� Some CPUs restricts only one branch instruction be within a certain number bytes

� If this exceeded or if branch instructions are not aligned properly, then

branches cannot be predicted

� The Opteron

� The return (RET) instruction may only take up one byte

� If a branch instruction immediately precedes a one byte RET instruction, then RET cannot be predicted

� One byte RET instruction can cause a miss prediction even if we have one branch instruction per 16 bytes

� Alignment: 9 branch indicators associated with byte addresses of 0,1,3,5,7,9,11, 13, & 15 within each 16 byte segment

� factorial_over_array.exe

� We implemented a program called ‘factorial_over_array’


� The array is populated with random integers between 1 and 12

e.g. { 3, 7, 4, 10, 9, 1, 5, 2, 12 }

� Factorial is calculated for each element in the array


� Factorial is calculated in assembly code

� In the optimized version, the RET instruction is aligned using a NOP so

that it is not immediately next another branch and so that it falls on an

odd number within the 16 byte segment

� In the de-optimized version, the RET instruction is aligned immediately

next to a branch instruction and so that it falls on an even number

within the 16 byte segment


global _factorial section .text

_factorial:nopmov eax, [esp+4] cmp eax, 1 jne calculatenopret

calculate:dec eaxpush eaxcall _factorialadd esp, 4imul eax, [esp+4]ret

global _factorial section .text

_factorial:nopmov eax, [esp+4]cmp eax, 1jne calculateret

calculate:dec eaxpush eaxcall _factorialadd esp, 4imul eax, [esp+4]ret


Source

� factorial_over_array.exe0: 90 nop1: 8b 44 24 04 mov 0x4(%esp),%eax5: 83 f8 01 cmp $0x1,%eax8: 75 02 jne c <_factorial+0xc>a: 90 nopb: c3 retc: 48 dec %eaxd: 50 push %eaxe: e8 ed ff ff ff call 0 <_factorial>

13: 83 c4 04 add $0x4,%esp16: 0f af 44 24 04 imul 0x4(%esp),%eax1b: c3 ret

0: 90 nop1: 8b 44 24 04 mov 0x4(%esp),%eax5: 83 f8 01 cmp $0x1,%eax8: 75 01 jne b <_factorial+0xb>a: c3 retb: 48 dec %eaxc: 50 push %eaxd: e8 ee ff ff ff call 0 <_factorial>

12: 83 c4 04 add $0x4,%esp15: 0f af 44 24 04 imul 0x4(%esp),%eax1a: c3 ret

Optimized

De-Optimized

Compiled


0

200000000

400000000

600000000

800000000

1E+09

1.2E+09

10 100 1000 10000 100000 1000000 10000000

Clo

ck C

ycl

es

Array Size

De-optimization: Unpredictable instructions (factorial_over_array)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem


� As you can see from the chart below, the optimized version significantly

outperforms the de-optimized version

� Interestingly, this de-optimization has an inconclusive effect on the

Nehalem



Difference


Difference

Slowdown(%)

10 80 7.07 21 2.58

100 1510 0.88 2436 32.15

1000 10429 11.61 -990 -1.24

10000 115975 12.76 36158 5.33

100000 1139748 12.59 238852 3.54

1000000 11529624 12.66 -3505949 -5.42

10000000 175191774 18.90 3081557 0.51

* In Clock Cycles


Difference


Difference

Slowdown(%)

10 80 7.07 21 2.58

100 1510 0.88 2436 32.15

1000 10429 11.61 -990 -1.24

10000 115975 12.76 36158 5.33

100000 1139748 12.59 238852 3.54

1000000 11529624 12.66 -3505949 -5.42

10000000 175191774 18.90 3081557 0.51

� Lessons

� Alignment is one of many ways that instructions can become

unpredictable

� These constant misses can be very costly

� Again, great care must be taken. Brevity, at times, can create

inefficiencies

� We’ve shown you lots of de-optimizations

� Most of them were successful

� So, now, you know some of the costs associated with ignoring CPU architecture when writing code

� If you are like us, then you must be reconsidering how you write software

� As you’ve seen, some of the simple habits that you’ve accumulated may be causing your code to run more slowly than it would have otherwise

[AMD05] AMD64 Technology. Software Optimization Guide for AMD64 Processors, 2005

[AMD11] AMD64 Technology. AMD64 Architecture Programmers Manual, Volume 1: Application

Programming. 2011

[AMD11] AMD64 Technology. AMD64 Architecture Programmers Manual, Volume 2: System

Programming. 2011

[AMD11] AMD64 Technology. AMD64 Architecture Programmers Manual, Volume 3: General

Purpose and System Instructions. 2011

finding the limits of hardware optimization through

Documents