assembly code optimization techniques for the amd64 athlon and opteron architectures david phillips...

Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures

David PhillipsRobert Duckles

Cse 520 Spring 2007Term Project Presentation

Background•In order to generate efficient code, modern compilers must consider the architecture for which they generate code for.

•Therefore, the engineer who implements the compiler must be very familiar with the architecture.

•There is rarely “one way” to write code for one solution. However, some implementations may be able to take better advantage of the architecture’s features than others and achieve higher performance.

•Naturally, we always want the most “efficient” (fastest) solution.

Methodology•For our project, we performed research on the architecture of the AMD64 Athlon and Opteron processors.

•While gathering information on the AMD64 architecture, we selected a subset of relevant optimization techniques that should “in theory” yield better performance than other similar approaches.

•Using the Microsoft Macro Assembler (MASM), we implemented a series of 15 small sample programs where each program isolates a

single optimization technique (or lack thereof).

Methodology II•After assembling all of the test programs, all were instrumented and profiled on a machine with a single core AMD64 Athlon processor.

•We used AMD’s downloadable “Code Analyst” suite to profile the program’s behavior and collect results such as clock events, cache misses, dispatch stalls, and time to completion.

•The goal was to determine which optimization techniques yielded the best performance gain and validate our assumptions pertaining to the system architecture.

Optimization Classes Examined•Decode

•Memory Access

•Arithmetic

•Integer Scheduling

•Loop Instruction Overhead

•Loop Unrolling

•Function Inline

Decode Optimization

- Decoding IA32 instructions is complicated. Using single complex instructions rather than multiple simpler instructions reduces the number of instructions that must be decoded into micro-operations.

Example:add edx, DWORD PTR [eax]

instead ofmov ebx, QWORD PTR [eax]add edx, ebx

This optimization reduces:- decoder usage.- register-pressure (allocated registers in the register file).- data dependent instructions in the pipeline (stalls).

Memory Access Optimization- The L1 cache is implemented as 8 separate banks. Each line in

the bank is 8 bytes wide. If two consecutive load instructions try to read from a different line in the same bank then the second instruction has to wait for the previous instruction to finish. If no bank conflict occurs, then both loads can complete during the same cycle.

Example:mov ebx, DWORD ptr [edi] mov edx, DWORD ptr [edi + TYPE intarray] imul ebx, ebximul edx, edx

instead ofmov ebx, DWORD ptr [edi] imul ebx, ebxmov edx, DWORD ptr [edi + TYPE intarray] imul edx, edx

Assuming that each array value is 4 bytes in size, a bank conflict can not occur since both loads read from the same cache line.

Arithmetic Optimization

- Dividing by 16 bit integers is faster than dividing by 32 bit integers.

Example:mov dx, 0mov ax, 65535mov bx, 5idiv bx

instead ofmov edx, 0mov eax, 65535mov ebx, 5idiv ebx

This optimization reduces:- Time to perform division

Integer Scheduling Optimization

- Pushing data onto the stack directly from memory is faster than loading the value into a register first and then pushing the register onto the stack.

Example:push [edi]

instead oflea ebx, [edi]push ebx

This optimization reduces:- register-pressure (allocated registers in the register file).- data dependent instructions in the pipeline (stalls).

Integer Scheduling Optimization

- Two writes to different portions of the same register is slower than writing to different registers.

Example:mov bl, 0012h ; Load the constant into lower word of EBX

mov ah, 0000h ; Load the constant into upper word of EAX; No false dependency on the completion

; of previous instruction since BL and AH are different ;registers.

instead ofmov bl, 0012h ; Load the constant into lower word of EBX

mov bh, 0000h ; Load the constant into upper word of EBX; Instruction has a false dependency on the completion

; of previous instruction since BL and BH share EBX.

This optimization reduces:- Dependent instructions in the pipeline.

Loop Instruction Overhead Optimization

- The “LOOP” instruction has an 8 cycle latency. Faster to use other instructions like decrement and jump.

Example:mov ecx, LENGTHOF intarray dec ecxjnz L1

instead ofmov ecx, LENGTHOF intarrayloop L1

This optimization reduces:- Loop overhead latency.

Loop Unrolling Optimization

- Unrolling the body of a loop reduces the total number of iterations that need to be performed which eliminates a great deal of loop overhead and overall faster execution.

Example:lea ebx, [edi] ; Get the next element from memory into registerpush ebx ; Push the next element onto the stackpop ebx ; Pop the element from the stack

lea ebx, 2[edi] ; Get the next element from memory into registerpush ebx ; Push the next element onto the stackpop ebx ; Pop the element from the stack



lea ebx, 8[edi] ; Get the next element from memory into registerpush ebx ; Push the next element onto the stack

pop ebx ; Pop the element from the stack

instead oflea ebx, [edi] ; Get the next element from memory into registerpush ebx ; Push the next element onto the stackpop ebx

Function Inline Optimization

- The body of small functions can replace the function call in order to reduce function call overhead.

Example:mov edx, 0 ; Sign extend dividend.mov eax, 65535 ; Load the dividend mov ebx, 5 ; Load the divisoridiv ebx ; Perform the division

instead ofcall DoDivision ; Perform the division

Program CPU Clock Events Retired Instruction Events DCM Events Dispatch Stall Events Instructions Per Cycle Stalls Per Instruction DCM Per Instruction Total Runtime (ms)DIVIDE16 28232 50203 31 9238 0.177823038 0.036802582 3.08746E-05 13979DIVIDE32 44186 50223 43 9838 0.113662699 0.039177269 4.28091E-05 21898DIVIDE32FuncCall 45925 69730 41 15286 0.151834513 0.043843396 2.93991E-05 22858LEA16 4114 40153 9 5427 0.976008751 0.027031604 1.12071E-05 1774LEA32 4124 40151 11 5446 0.973593598 0.027127593 1.36983E-05 1861IssueNoOp 256887 2504269 941 195018 0.974852367 0.015574844 1.87879E-05 124485IssueWithOp 271579 2007048 315596 790831 0.739029159 0.07880539 0.007862194 131240IssueWithNoLOOP 134073 3005102 1150 617630 2.241392376 0.041105427 1.91341E-05 62244IssueWithLoopUnrolled 137075 1707411 696 1556030 1.245603502 0.182267773 2.03817E-05 64344LoadExecuteNoOp 273144 2508666 632342 808336 0.918440822 0.064443493 0.012603152 130687LoadExecuteWithOp 225221 2008493 631482 961432 0.891787622 0.095736654 0.015720294 107365MemAccessNoOp 124539 1202488 253502 99917 0.965551353 0.016618378 0.010540729 59872MemAccessWithOp 124605 1202935 253494 99018 0.96539866 0.016462735 0.01053648 59778PartialRWNoOp 16297 119865 27 8379 0.735503467 0.013980728 1.12627E-05 7818PartialRWWithOp 16343 120192 35 8357 0.735434131 0.013906084 1.456E-05 7969

Retired Instructions Per Event 100000Data Cache Misses Per Event 5000CPU Clocks Per Event 1000000Dispatch Stalls Per Event 20000Collection Time (seconds) 10

Code Analyst Runtime Performance Counter Results

Decode Results Discussion

LoadExecuteNoOp vs LoadExecuteWithOp

Confirmed expectations.

Data cache misses were about the same for both programs.

The non-optimized program required 273144 / 225221 = 1.21x more cycles than the optimized program.

The non-optimized program required 2508666 / 2008493 = 1.25x more instructions than the optimized program.

The optimized program introduced .095736654 / .064443493 = 1.49x more stalls than the non-optimized program.

The non-optimized program took 130687 / 107365 = 1.22x longer to finish than the optimized program.

Even though the optimized program had a higher stall rate, the overall reduction in instructions and cyclescreated an overall net-performance gain.

Memory Access Results Discussion

MemAccessNoOp vs MemAccessWithOpDid not confirm expectations.

There was no real observable difference between either of these programs.

Both executed roughly the same number of cycles in the same period of time with the similar stall and cache miss occurrences.

We are guessing that the same micro-operations get generated even for the optimized program.

LEA16 vs LEA32Did not confirm expectations.

There was no real observable difference between either of these programs.

Both executed roughly the same number of cycles in the same period of time with the similar stall and cache miss occurrences.

Arithmetic Results Discussion

DIVIDE32 vs DIVIDE16



Both programs executed roughly the same number of instructions.

However, the DIVIDE32 took 44186 / 28232 = 1.57x as many cycles as DIVIDE16 to finish.

DIVIDE16 finished 21898 / 13979 = 1.57x faster than DIVIDE32.

The instructions per cycle decreased by 0.177823038 / 0.113662699 = 1.56x for the DIVIDE32.

The stalls per instruction ratio increased by slightly in the DIVIDE32 program .039177269 / .036802582 = 1.0036x.

As expected, the 32bit division appeared to run significantly slower than 16 bit division.

Integer Scheduling Results Discussion

IssueNoOp vs IssueWithOpDid not confirm expectations.

The optimized program required 271579 / 256887 = 1.06x more cycles than the non-optimized program.

The non-optimized program required 2504269 / 2007048 = 1.25x more instructions.

The optimized program had 315596 / 941 = 3468.1x more cache misses than the non optimized program.

The optimized program achieved 0.974852367 / 0.739029159 = 1.32x fewer instructions per cycle than the non-optimized program.

It took the optimized program 131240 / 124485 = 1.05x longer to finish than the non-optimized program.

While the optimization did reduce the code density, the cache miss rate increased greatly which diminishedany performance returns to the point that the performance actually became worse.

Integer Scheduling Results Discussion

PartialRWNoOp vs PartialRWWithOp

Did not confirm expectations.

Both of these programs had very close to the same performance profile.

We expected to see the number of stalls reduce in the optimized program since the false dependencies were eliminated.

However both had about the same number of measured stalls and both finished execution in about the same amount of time.

Loop Instruction Overhead Results Discussion

IssueNoOp vs IssueWithNoLOOPConfirmed expectations.


Replacing the LOOP instruction with dec/jump had the following effects:Number of cycles reduced by 256887 / 134073 = 1.91xInstruction count increased by 3005102 / 2504269 = 1.20xStalls increased by .041105427 / .015574844 = 2.64xTotal runtime decreased by 124485 / 62244 = 2.0x

While the overall instruction count and stall count increased, the total number of cycles neededwas reduced by almost half which allowed a great performance gain.

Loop Unrolling Results Discussion

IssueNoOp vs IssueWithLoopUnrolledConfirmed expectations.

Unrolling the loop body 5 times had the following effects:

Number of cycles reduced by 256887 / 137075 = 1.87x

Number of instructions reduced by 2504269 / 1707411 = 1.47x

Cache miss rate was slightly less 941 / 696 = 1.35x

Instructions per cycle increased 1.245603502 / .974852367 = 1.28x

Stalls increased by .182267773 / .015574844 = 11x

Total runtime decreased by 124485 / 64344 = 1.93x

Even though more stalls were introduced by the optimization, the total number of required cyclesdecreased significantly. Much of the loop overhead was removed which is what allowed an overallnet performance increase.

Function Inline Results Discussion

DIVIDE32 vs DIVIDE32FuncCall



The DIVIDE32FuncCall required 69730 / 50223 = 1.39x more instructions than the in-lined DIVIDE32.

The instructions per cycle increased in the DIVIDE32FuncCall, but this is most likely a false positive as thefunction call overhead introduced more instructions into the pipeline.

The inline DIVIDE32 program finished 22858 / 21898 = 1.043x faster than the function call implementation.

The number of stalls increased by .043843396 / .039177269 = 1.12x in the function call implementation.

The function call implementation required 45925 / 44186 = 1.04x as many clock cycles as the inline version.

As expected, the added overhead of the function call made a noticeable impact on performance. Also note that we did not pass any parameters to the function. Had parameters been passed, it could be expected that the overhead would increase.

Most Significant Performance Gains

1.) Loop Instruction Overhead (2x speedup)

2.) Loop Unrolling (1.93x speedup)

3.) Arithmetic (1.57x speedup)

4.) Decode (1.22x speedup)

5.) Function Inline (1.043x speedup)

6.) Memory Access (0x speedup)

7.) Integer Scheduling (-1.05x speedup)

Thank you.

Questions?

assembly code optimization techniques for the amd64 athlon and opteron architectures david phillips...

Documents

dword ptr edi mov edx

arithmetic optimization

idiv ebxthis optimization

single optimization

dword ptr edi imul ebx

memory access optimization

ebxmov edx

ebximul edx