chapter 21 ia-64 architecture (think intel itanium) also known as (epic – extremely parallel...

Chapter 21

IA-64 Architecture

(Think Intel Itanium)

also known as

(EPIC – Extremely Parallel Instruction Computing)

Superpipelined & Superscaler Machines

Superpipelined machine:• Superpiplined machines overlap pipe stages

— Relies on stages being able to begin operations before the last is complete.

Superscaler Machine:

A Superscalar machine employs multiple independent pipelines to executes multiple independent instructions in parallel.— Particularly common instructions (arithmetic, load/store,

conditional branch) can be executed independently.

Why A New Architecture Direction?

Processor designers obvious choices for use of increasing number of transistors on chip and extra speed:

• Bigger Caches diminishing returns

• Increase degree of Superscaling by adding more execution units complexity wall: more logic, need improved branch prediction, more renaming registers, more complicated dependencies.

• Multiple Processors challenge to use them effectively in general computing

• Longer pipelines greater penalty for misprediction

IA-64 : Background

• Explicitly Parallel Instruction Computing (EPIC) - Jointly developed by Intel & Hewlett-Packard (HP)

• New 64 bit architecture—Not extension of x86 series—Not adaptation of HP 64bit RISC architecture

• To exploit increasing chip transistors and increasing speeds

• Utilizes systematic parallelism

• Departure from superscalar trend

Note: Became the architecture of the Intel Itanium

Basic Concepts for IA-64

• Instruction level parallelism — EXPLICIT in machine instruction, rather than determined at run time by

processor

• Long or very long instruction words (LIW/VLIW)— Fetch bigger chunks already “preprocessed”

• Predicated Execution — Marking groups of instructions for a late decision on “execution”.

• Control Speculation— Go ahead and fetch & decode instructions, but keep track of them so

the decision to “issue” them, or not, can be practically made later

• Data Speculation (or Speculative Loading)— Go ahead and load data early so it is ready when needed, and have a

practical way to recover if speculation proved wrong

• Software Pipelining— Multiple iterations of a loop can be executed in parallel

• “Revolvable” Register Stack— Stack Frames are programmable and used to reduce unnecessary

movement of data on procedure calls

Predication

Speculative Loading

General Organization

IA-64 Key Hardware Features

• Large number of registers—IA-64 instruction format assumes 256 Registers

– 128 * 64 bit integer, logical & general purpose– 128 * 82 bit floating point and graphic

—64 (1 bit) predicated execution registers (To support high degree of parallelism)

• Multiple execution units—Probably 8 or more pipelined

IA-64 Register Set

Predicate Registers

• Used as a flag for instructions that may or may not be executed.

• A set of instructions is assigned a predicate register when it

is uncertain whether the instruction sequence will actually be executed (think branch).

• Only instructions with a predicate value of true are executed.

• When it is known that the instruction is going to be executed, its predicate is set. All instructions with that predicate true can now be completed.

• Those instructions with predicate false are now candidates for cleanup.

Instruction Format

128 bit bundles• Can fetch one or more bundles at a time

• Bundle holds three instructions plus template

• Instructions are usually 41 bit long— Have associated predicated execution registers

• Template contains info on which instructions can be executed in parallel— Not confined to single bundle— e.g. a stream of 8 instructions may be executed in parallel— Compiler will have re-ordered instructions to form contiguous

bundles— Can mix dependent and independent instructions in same

bundle

Instruction Format Diagram

IA-64 Execution Units

• I-Unit—Integer arithmetic—Shift and add—Logical—Compare—Integer multimedia ops

• M-Unit— Load and store

– Between register and memory— Some integer ALU operations

• B-Unit— Branch instructions

• F-Unit— Floating point instructions

Relationship between Instruction Type & Execution Unit

Field Encoding & Instr Set Mapping

Note: BAR indicates stops: Possible dependencies with Instructions after the stop

Predication (review)

Speculative Loading (review)

Assembly Language Format

[qp] mnemonic [.comp] dest = srcs ;; //

• qp - predicate register– 1 at execution time execute and commit result to hardware– 0 at execution time result is discarded

• mnemonic - name of instruction

• comp – one or more instruction completers used to qualify mnemonic

• dest – one or more destination operands

• srcs – one or more source operands

• ;; - instruction groups stops– Sequence without hazards - read after write, write after write, . .

• // - comment follows

Assembly Example

ld8 r1 = [r5] ;; //first group

add r3 = r1, r4 //second group

• Second instruction depends on value in r1—Changed by first instruction—Can not be in same group for parallel execution

• Note ;; ends the group of instructions that can be executed in parallel

Register Dependency:

Assembly Example

ld8 r1 = [r5] //first group

sub r6 = r8, r9 ;; //first group

add r3 = r1, r4 //second group

st8 [r6] = r12 //second group

• Last instruction stores in the memory location whose address is in r6, which is established in the second instruction

Multiple Register Dependencies:

Assembly Example – Predicated Code

if (a && b)

j = j + 1;

else

if(c)

k = k + 1;

else

k = k – 1;

i = i + 1;

Consider the Following program with branches:


Source CodeSource Codeif (a && b)

j = j + 1;

else if(c)

k = k + 1;

else k = k – 1;i = i + 1;

Pentium Assembly CodePentium Assembly Code cmp a, 0 ; compare with 0

je L1 ; branch to L1 if a = 0

cmp b, 0

je L1

add j, 1 ; j = j + 1

jmp L3

L1: cmp c, 0

je L2

add k, 1 ; k = k + 1

jmp L3

L2: sub k, 1 ; k = k – 1

L3: add i, 1 ; i = i + 1


Source CodeSource Codeif (a && b)

j = j + 1;

else

if(c)

k = k + 1;

else

k = k – 1;

i = i + 1;

Pentium CodePentium Code

cmp a, 0

je L1

cmp b, 0

je L1

add j, 1

jmp L3

L1: cmp c, 0

je L2

add k, 1

jmp L3

L2: sub k, 1

L3: add i, 1

IA-64 CodeIA-64 Code

cmp.eq p1, p2 = 0, a ;;

(p2) cmp.eq p1, p3 = 0, b

(p3) add j = 1, j

(p1) cmp.ne p4, p5 = 0, c

(p4) add k = 1, k

(p5) add k = -1, k

add i = 1, i

Example of Prediction

IA-64 Code:IA-64 Code:

cmp.eq p1, p2 = 0, a ;;

(p2) cmp.eq p1, p3 = 0, b

(p3) add j = 1, j

(p1) cmp.ne p4, p5 = 0, c

(p4) add k = 1, k

(p5) add k = -1, k add i = 1, i

Data Speculation

• Load data from memory before needed

• What might go wrong?

—Load could be completed before another required read or could later be shown to be incorrect

—Need subsequent check in value ?

Assembly Example – Data Speculation

(p1) br some_label // cycle 0

ld8 r1 = [r5] ;; // cycle 0 (indirect memory op – 2 cycles)

add r1 = r1, r3 // cycle 2

Consider the following code:


(p1) br some_label //cycle 0

ld8 r1 = [r5] ;; //cycle 0

add r1 = r1, r3 //cycle 2


Original code Original code Speculated CodeSpeculated Code

ld8.s r1 = [r5] ;; //cycle -2

// other instructions

(p1) br some_label //cycle 0

chk.s r1, recovery //cycle 0

add r2 = r1, r3 //cycle 0


st8 [r4] = r12 //cycle 0ld8 r6 = [r8] ;; //cycle 0 (indirect memory op – 2 cycles)add r5 = r6, r7 ;; //cycle 2 st8 [r18] = r5 //cycle 3


What if r4 and r8 point to the same address?


st8 [r4] = r12 //cycle 0

ld8 r6 = [r8] ;; //cycle 0

add r5 = r6, r7 ;; //cycle 2

st8 [r18] = r5 //cycle 3


Without Data Speculation Without Data Speculation With Data SpeculationWith Data Speculation

ld8.a r6 = [r8] ;; //cycle -2, adv

// other instructions

st8 [r4] = r12 //cycle 0

ld8.c r6 = [r8] //cycle 0, check

add r5 = r6, r7 ;; //cycle 0

st8 [r18] = r5 //cycle 1

Note: The Advanced load Address Table is checked for an entry. It should be there. If another access has been made to that target, it would have been removed.


ld8.a r6 = [r8];; //cycle -3,adv ld // other instructions add r5 = r6, r7 //cycle -1,uses r6 // other instructions st8 [r4] = r12 //cycle 0 chk.a r6, recover //cycle 0, checkback: //return pt st8 [r18] = r5 //cycle 0

recover: ld8 r6 = [r8] ;; //get r6 from [r8] add r5 = r6, r7;; //re-execute br back //jump back

ld8.a r6 = [r8] ;; //cycle-2// other instructions

st8 [r4] = r12 //cycle 0ld8.c r6 = [r8] //cycle 0add r5 = r6, r7 ;; //cycle 0st8 [r18] = r5 //cycle 1

Consider the following code with an additional data dependency:

Speculation Speculation with data dependencySpeculation Speculation with data dependency

Software Pipelining

Consider loop in which: y[i] = x[i] + c

L1: ld4 r4=[r5],4 ;;//cycle 0 load postinc 4

add r7=r4,r9 ;;//cycle 2 r9 holds c

st4 [r6]=r7,4 //cycle 3 store postinc 4

br.cloop L1 ;;//cycle 3

• Adds constant to one vector and stores result in another

• No opportunity for instruction level parallelism in one iteration

• Instruction in iteration x all executed before iteration x+1 begins

IA-64 Register Set (recall)

Pipeline - Unrolled Loop, Pipeline Display

Unrolled loopUnrolled loop

ld4 r32=[r5],4;; //cycle 0

ld4 r33=[r5],4;; //cycle 1

ld4 r34=[r5],4 //cycle 2

add r36=r32,r9;; //cycle 2

ld4 r35=[r5],4 //cycle 3

add r37=r33,r9 //cycle 3

st4 [r6]=r36,4;; //cycle 3

ld4 r36=[r5],4 //cycle 3


st4 [r6]=r37,4;; //cycle 4


st4 [r6]=r38,4;; //cycle 5


st4 [r6]=r39,4;; //cycle 6

st4 [r6]=r40,4;; //cycle 7

Original LoopOriginal LoopL1: ld4 r4=[r5],4 ;;//cycle 0 load postinc 4 add r7=r4,r9 ;;//cycle 2 st4 [r6]=r7, 4 //cycle 3 store postinc 4 br.cloop L1 ;;//cycle 3

Pipeline DisplayPipeline Display

Mechanism for “Unrolling” Loops

• Automatic Register Naming— r 32-r127, fr 32-fr127, and pr 16-pr63 are capable of

rotation for automatic renaming of registers

• Predication of Loops— each instruction in a given loop is predicated.— on the prolog, each cycle an additional instruction

predicate is true —on the kernel, n instruction’s predicates are true— on the Epilog, each cycle an additional predicate is

made false

• Spl Loop Termination Instructions — the loop count and epilog count is used to determine

when the loop is complete and the process stops

Unrolled Loop Example Observations

• Completes 5 iterations in 7 cycles—Compared with 20 cycles in original code

• Assumes two memory ports—Load and store can be done in parallel

IA-64 Register Stack

• The Register Stack mechanism avoids unecessary movement of register data during procedure call and return (r32-r127 are used in a rotation)

— the number of local, & pass/return are specifiable

— the “register renaming” allows locals to become hidden and pass/return to become local on a call, and changed back on a return

— IF the stacking mechanism runs out of registers, the last used are moved to memory

Basic Concepts for IA-64

• Instruction level parallelism — EXPLICIT in machine instruction, rather than determined at run time by

processor

• Long or very long instruction words (LIW/VLIW)— Fetch bigger chunks already “preprocessed”

• Predicated Execution — Marking groups of instructions for a late decision on “execution”.

• Control Speculation— Go ahead and fetch & decode instructions, but keep track of them so

the decision to “issue” them, or not, can be practically made later

• Data Speculation (or Speculative Loading)— Go ahead and load data early so it is ready when needed, and have a

practical way to recover if speculation proved wrong

• Software Pipelining— Multiple iterations of a loop can be executed in parallel

• “Revolvable” Register Stack— Stack Frames are programmable and used to reduce unnecessary

movement of data on procedure calls

chapter 21 ia-64 architecture (think intel itanium) also known as (epic – extremely parallel...

Documents