instruction level parallelism (ilp) limitations

57
Jose P. Pinilla EECE528: Parallel and Reconfigurable Computing Instruction-Level Parallelism Limitations

Upload: jose-pinilla

Post on 01-Jul-2015

484 views

Category:

Engineering


2 download

DESCRIPTION

A presentation about the ILP, its limitations and applications in today's architectures.

TRANSCRIPT

Page 1: Instruction Level Parallelism (ILP) Limitations

Jose P. Pinilla

EECE528: Parallel and Reconfigurable Computing

Instruction-Level Parallelism Limitations

Page 2: Instruction Level Parallelism (ILP) Limitations

CONTENT

I. ILP Background

II. Hardware Model

III. Study of Limitations

IV. Simultaneous Multithreading

V. ILP today

Page 3: Instruction Level Parallelism (ILP) Limitations

CONTENT

I. ILP Background

II. Hardware Model

III. Study of Limitations

IV. Simultaneous Multithreading

V. ILP today

Page 4: Instruction Level Parallelism (ILP) Limitations

I. ILP

• MIPS Example

– Hazards

• Structural

• Data

• Control

• Power5

• ILP Optimizations

– Register Renaming

– Branch/Jump Prediction

– Alias Analysis

Page 5: Instruction Level Parallelism (ILP) Limitations

  I$

Load

Instr 1

Instr 2

Instr 3

Instr 4

  I$ Reg   D$ Reg

AL

U  I$ Reg   D$ Reg

  I$ Reg   D$ Reg

AL

UReg   D$ Reg

  I$ Reg   D$ Reg

Time (clock cycles)

AL

U

AL

U

AL

U

I. ILP: MIPS

Page 6: Instruction Level Parallelism (ILP) Limitations

I. ILP: Hazards

1. Structural

2. Data

3. Control

Page 7: Instruction Level Parallelism (ILP) Limitations

I. ILP: Structural HazardsConflict over the use of resources

  I$

Load

Instr 1

Instr 2

Instr 3

Instr 4

  I$ Reg   D$ Reg

AL

U  I$ Reg   D$ Reg

  I$ Reg   D$ RegA

LU

Reg   D$ Reg

  I$ Reg   D$ Reg

Time (clock cycles)

AL

U

AL

U

AL

U

Page 8: Instruction Level Parallelism (ILP) Limitations

I. ILP: Structural HazardsConflict over the use of resources

  I$

Load

Instr 1

Instr 2

Instr 3

Instr 4

  I$ Reg   D$ Reg

AL

U  I$ Reg   D$ Reg

  I$ Reg   D$ Reg

AL

UReg   D$ Reg

  I$ Reg   D$ Reg

Time (clock cycles)

AL

U

AL

U

AL

U

Page 9: Instruction Level Parallelism (ILP) Limitations

I. ILP: Structural Hazards

  I$

Load

Instr 1

Instr 2

Instr 3

Instr 4

  I$ Reg   D$ Reg

AL

U  I$ Reg   D$ Reg

  I$ Reg   D$ Reg

AL

U

Reg   D$ Reg

  I$ Reg   D$ Reg

Time (clock cycles)

AL

U

AL

U

AL

U

Solutions R/W:

*On same clock cycle

On different R/W ports

Page 10: Instruction Level Parallelism (ILP) Limitations

I. ILP: Data Hazards

  I$

Instr 1

Instr 2

Instr 3

Instr 4

Instr 5

  I$ Reg   D$ Reg

AL

U  I$ Reg   D$ Reg

  I$ Reg   D$ RegA

LU

Reg   D$ Reg

  I$ Reg   D$ Reg

Time (clock cycles)

AL

U

AL

U

AL

U

Page 11: Instruction Level Parallelism (ILP) Limitations

I. ILP: Data Hazards

add $t0, $t1, $t2

sub $t4, $t0 ,$t3

and $t5, $t0 ,$t6

or $t7, $t0 ,$t8

xor $t9, $t0 ,$t10

Page 12: Instruction Level Parallelism (ILP) Limitations

I. ILP: Data Hazards

Page 13: Instruction Level Parallelism (ILP) Limitations

I. ILP: Data HazardsForwarding

Page 14: Instruction Level Parallelism (ILP) Limitations

I. ILP: Data HazardsForwarding

Page 15: Instruction Level Parallelism (ILP) Limitations

I. ILP: Data HazardsHardware Interlock

Allows Forwarding

Page 16: Instruction Level Parallelism (ILP) Limitations

I. ILP: Data HazardsStalling by compiler

Page 17: Instruction Level Parallelism (ILP) Limitations

I. ILP: Data HazardsStalling by compiler

The compiler could schedule a better use of that cycle. Hardware can also do it.

Page 18: Instruction Level Parallelism (ILP) Limitations

I. ILP: Data HazardsAvoid Stalling

Page 19: Instruction Level Parallelism (ILP) Limitations

I. ILP: Control Hazards

Page 20: Instruction Level Parallelism (ILP) Limitations

I. ILP: Control Hazards

Solutions:

Add HW to be able to compute branch on stage 2 (DECODE)

Predict Branch: To simplify hardware, predict branch as NOT TAKEN most of the times. End

of the loop will always be wrong, but then is just once

Insert instruction after branch, always gets executed. Compiler. MIPS

Page 21: Instruction Level Parallelism (ILP) Limitations

I. ILP: Power5 Architecture

16 different stages

Page 22: Instruction Level Parallelism (ILP) Limitations

I. ILP: OptimizationsInstruction window: Trace of incoming instructions to analyze for execution.

Register Renaming: On false data dependences, hardware can rename the register.

Compilers should optimize this false dependences: R2R memory model.

Branch Prediction:

Static: Always not taken, always taken. Forward/Backward taken. Branch delay slot.

Dynamic: One-level (1bit, 2bit...), Two-level and Multiple Component

Jump Prediction: Static profiling. Dynamic: Last taken, 2bit tables, return stack.

Alias Analysis: Indirect memory references. Instruction Inspection.

Page 23: Instruction Level Parallelism (ILP) Limitations

I. ILP: Branch Prediction

Saturated counter: Increment on branch

taken, decrement on not taken. No

Over or Under flow.

Branch correlation: Inter/Intra

Two-level: Remembers the history of the

last n occurrences of the branch and

uses one saturating counter for each of

the possible 2n history patterns.

Many more...

Page 24: Instruction Level Parallelism (ILP) Limitations

CONTENT

I. ILP Background

II. Hardware Model

III. Study of Limitations

IV. Simultaneous Multithreading

V. ILP today

Page 25: Instruction Level Parallelism (ILP) Limitations

II. HARDWARE MODEL

• Profiling Framework

– Assumptions

– Window Size

Page 26: Instruction Level Parallelism (ILP) Limitations

II. HW MODEL: Profiling Framework

A set of assumptions and a methodology to, experimentally, extract a parallelism

profile out of a set of benchmarks.

Program is executed completely, resulting in a trace of instructions.

Trace includes data addresses referenced, and the results of branches and jumps.

(D. Wall's 1993 study) Divides the trace in cycles of 64 instructions in flight.

The only limits on ILP in such a processor are those imposed by the actual data

flows through either registers or memory.

Page 27: Instruction Level Parallelism (ILP) Limitations

II. HW MODEL: Assumptions

• No limits on replicated functional units or ports to registers or memory.

• Register Renaming: Perfect, Infinite, Finite, None

• Branch Prediction: Perfect, Infinite, Finite, None

• Jump Prediction: Perfect, Infinite, Finite, None

• Memory Address Alias Analysis:

• Perfect Caches

• Unit cycle

• 2k Window size

Page 28: Instruction Level Parallelism (ILP) Limitations

II. HW MODEL: Register Renaming

• Perfect: Infinite number of registers to avoid false register dependencies.

• Finite: Normally 256 integer registers and 256 floating point registers used in LRU (Least Recently Used) fashion.

• No renaming: Number of registers used in the code.

Page 29: Instruction Level Parallelism (ILP) Limitations

II. HW MODEL: Branch Prediction

• Perfect: All branches are correctly predicted.

• 2bit predictor with infinite tables: Dynamic. A 2bit counter per branch option (2). Indexed by low-order bits of branch's address. Incremented on branch taken. Does not overflow. Branch is taken if table entry is 2 or 3. Up to 512 2bit entries.

• 2bit predictor with infinite tables: Infinite number of counters.

• Tournament-based branch predictor: 2 2bit counters competing. A 2bit selector that is decremented/incremented according to the correct prediction of the table entries.

• Profile based: Static predictions.

• No prediction: Every branch is predicted wrong.

Not in order of performance

Page 30: Instruction Level Parallelism (ILP) Limitations

II. HW MODEL: Jump Prediction

• Direct Jumps are known.

• Indirect jumps

– Perfect: Always performed correctly.

– Finite prediction: A table with destination addresses. The address of a jump provides the index of the table. Whenever a jump is executed, we put its address in the table. Next jump should be to address in the table.

– Infinite prediction: Infinite table entries.

• No prediction: Every jump is predicted wrong.

Page 31: Instruction Level Parallelism (ILP) Limitations

II. HW MODEL: Alias Analysis

• If two memory references do not refer to the same address, then they may be safely interchanged.

• Indirect memory references are previous to the instruction execution.

• No need to predict the actual values, only whether those values conflict.

• Perfect: All global and stack reference predictions are perfect, heap

• Inspection: Examine base and offset

• None: All indirect memory references conflict.

Page 32: Instruction Level Parallelism (ILP) Limitations

II. HW MODEL: Window Size

• The set of instructions which is examined for simultaneous execution.

• The cycle width limits the number of instructions which can be scheduled.

• A window size of 2k will look at 2048 instructions.

• Cycle width: Assume we have found 111 instructions which can be parallelized. A cycle width of 64 would limit actual parallelism to 64 in flight instructions.

Page 33: Instruction Level Parallelism (ILP) Limitations

II. HW MODEL

ctr: counter

gsh: gshared (global history)

Page 34: Instruction Level Parallelism (ILP) Limitations

CONTENT

I. ILP Background

II. Hardware Model

III. Study of Limitations

IV. Simultaneous Multithreading

V. ILP today

Page 35: Instruction Level Parallelism (ILP) Limitations

III. STUDY OF LIMITATIONS

• Effects of...

– Register Renaming

– Branch/Jump Prediction

– Alias Analysis

– Realizable processor

• Window Size (Discrete/Continuous)

• Results

Page 36: Instruction Level Parallelism (ILP) Limitations

III. LIMITATIONS: Benchmarks

Page 37: Instruction Level Parallelism (ILP) Limitations

III. LIMITATIONS: Register Renaming

Page 38: Instruction Level Parallelism (ILP) Limitations

III. LIMITATIONS: Branch Prediction

Page 39: Instruction Level Parallelism (ILP) Limitations

III. LIMITATIONS: Branch Prediction

Page 40: Instruction Level Parallelism (ILP) Limitations

III. LIMITATIONS: Alias Analysis

Page 41: Instruction Level Parallelism (ILP) Limitations

III. LIMITATIONS: Results

Page 42: Instruction Level Parallelism (ILP) Limitations

III. LIMITATIONS: Realizable Processor

• Up to 64 instruction issues per clock with no issue restrictions, or roughly 10 times the total issue width of the widest processor in 2011

• A tournament predictor with 1K entries and a 16-entry return predictor. This predictor is comparable to the best predictors in 2011; the predictor is not a primary bottleneck

• Perfect disambiguation of memory references done dynamically—this is ambitious but perhaps attainable for small window sizes (and hence small issue rates and load-store buffers) or through address aliasing prediction

• Register renaming with 64 additional integer and 64 additional FP registers, which is slightly less than the most aggressive processor in 2011

• No issue restrictions, no cache misses, unit latencies

• Variable Window Size (Power5 200, Intel Core i7 ~128)

Page 43: Instruction Level Parallelism (ILP) Limitations

III. LIMITATIONS: Realizable Processor

Page 44: Instruction Level Parallelism (ILP) Limitations

III. LIMITATIONS: Conclusions

• Plateau behavior

• Window size effect on integer programs (3 top) is not as severe. Due to loop-level parallelism.

• Designers are faced with the challenge:

– Simpler processors with larger caches and higher clock rates

Vs

– ILP with slower clock and smaller caches

• Persistent limitations:

– WAW and WAR hazards through memory

– Unnecessary dependences

– Data flow limit

Page 45: Instruction Level Parallelism (ILP) Limitations

CONTENT

I. ILP Background

II. Hardware Model

III. Study of Limitations

IV. Simultaneous Multithreading

V. ILP today

Page 46: Instruction Level Parallelism (ILP) Limitations

IV. SIMULTANEOUS MULTITHREADING

• TLP Background

– TLP approaches

– Design Challenges

• Limits of Multiple-Issue Processors

– Power

– Complexity

Page 47: Instruction Level Parallelism (ILP) Limitations

IV. SMT: TLP Background

• Largely independent

– Separate copy of regFile, PC and page table

• Thread could represent

– A process that is part of a parallel program consisting of multiple processes

– An independent program on its own

• Thread level parallelism occurs naturally

• It can be used to employ the functional units idle when ILP is insufficient

Page 48: Instruction Level Parallelism (ILP) Limitations

IV. SMT: TLP Approaches

Page 49: Instruction Level Parallelism (ILP) Limitations

IV. SMT: Changes

• Increasing the associativity of the L1 instruction cache and the instruction address translation buffers

• Adding per-thread load and store queues

• Increasing the size of the L2 and L3 caches

• Adding separate instruction prefetch and buffering

• Increasing the number of virtual registers from 152 to 240

• Increasing the size of several issue queues

Page 50: Instruction Level Parallelism (ILP) Limitations

IV. SMT: Results

Page 51: Instruction Level Parallelism (ILP) Limitations

IV. SMT: Results

• SMT reduces energy by 7%

• “Because of the costs and diminishing returns in performance, however, rather than implement wider superscalars and more aggressive versions of SMT, many designers are opting to implement multiple CPU cores on a single die with slightly less aggressive support for multiple issue and multithreading; we return to this topic in the next chapter.” - Hennessy et al.

Page 52: Instruction Level Parallelism (ILP) Limitations

CONTENT

I. ILP Background

II. Hardware Model

III. Study of Limitations

IV. Simultaneous Multithreading

V. ILP today

Page 53: Instruction Level Parallelism (ILP) Limitations

V. ILP TODAY: x86

• Instruction fetch—The processor uses a multilevel branch target buffer to achieve a balance between speed and prediction accuracy. There is also a return address stack to speed up function return. Mispredictions cause a penalty of about 17 cycles. Using the predicted address, the instruction fetch unit fetches 16 bytes from the instruction cache.

• Micro-code and Macro-code

• Total pipeline depth is 14 stages

• 128 reorder (renaming) buffer size

Page 54: Instruction Level Parallelism (ILP) Limitations

V. ILP TODAY: x86

• Hyper-Threading:

– SMT

– The processor may stall due to a cache miss, branch misprediction, or data dependency.

– Branch misprediction costs 17 cycles

Page 55: Instruction Level Parallelism (ILP) Limitations

V. ILP TODAY: x86

Page 56: Instruction Level Parallelism (ILP) Limitations

V. ILP TODAY: ARM

- The average CPI for the ARM7 family is about 1.9 cycles per instruction.

- The average CPI for the ARM9 family is about 1.5 cycles per instruction.

- The average CPI for the ARM11 family is about 1.39 cycles per instruction.

Page 57: Instruction Level Parallelism (ILP) Limitations

SOURCESComputer Architecture: A Quantitative Approach. Hennessy, J.L., Patterson, D.A., Asanović, K..

5th Ed. 2012. Morgan Kaufmann/Elsevier.

Limits of instruction-level parallelism. D. W. Wall. IV international conference on Architectural

Support for Programming Languages and Operating Systems (ASPLOS), pages 176–188, 1991.

Computer Science 61C - Lecture 31: Instruction Level Parallelism. Mike Franklin, Dan Garcia.

UC Berkeley. Fall. 2011

ILP and TLP in Shared Memory Applications: A Limit Study. E. Fatehi, P. V. Gratz, Proceedings of

the 23rd international conference on Parallel architectures and compilation, pages 113-126, 2014.

MIPS Multicycle Model: Pipelining. Michael Langer. Introduction to Computer Systems. McGill

University. 2012.

IBM Power5 Chip: A Dual-Core Multithreaded Processor. R. Kalla, B. Sinharoy, J. M. Tendler. IBM.

IEEE CS. 2004.