1 review of chapters 3 & 4 copyright © 2012, elsevier inc. all rights reserved

22
1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.

Upload: caden-clayburn

Post on 29-Mar-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

1Copyright © 2012, Elsevier Inc. All rights reserved.

Review of Chapters 3 & 4

Page 2: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

2Copyright © 2012, Elsevier Inc. All rights reserved.

Chapter 3 Review

Baseline: simple MIPS 5-stage pipeline IF, ID, EX, MEM, WB

How to exploit Instruction-Level Parallelism (ILP) to improve the performance?

Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls

Page 3: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

3Copyright © 2012, Elsevier Inc. All rights reserved.

Hazards & Stalls Structural Hazards

Cause: resource contention Solution: add more resources & better scheduling

Control Hazards Cause: branch instructions, change of program flow Solution: loop unrolling, branch prediction, hardware

speculation Data Hazards

Cause: DependencesTrue data dependence: property of program: RAW

Name dependence: reuse of registers, WAR & WAW Solution: loop unrolling, dynamic scheduling, register

renaming, hardware speculation

Page 4: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

4Copyright © 2012, Elsevier Inc. All rights reserved.

Ideal CPI Multiple Issue

Page 5: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

5Copyright © 2012, Elsevier Inc. All rights reserved.

Loop Unrolling (pp.161) Finds that the loop iterations were independent Uses different registers to avoid unnecessary

constraints (name dependence) Eliminate extra test and branch instructions

(control dependence) Interchanges the load and store instructions if

possible (utilize time of stalls) Schedule the code: avoid/mitigate stalls while

maintaining true data dependence

Page 6: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

6Copyright © 2012, Elsevier Inc. All rights reserved.

Branch Predication 1-bit or 2-bit predictor: local predicator

Uses the past results of the branch itself as an indicator

Correlated predictor: global predicator Uses the pass results of correlated branches as an

indicator (m,n) predictor: Two-level predicator

The number of bits in an (m, n) predictor is 2m * n * number of prediction entries

Tournament predictor: an adaptive one Combining local & global predictor together Select the right predictor for a particular branch

Page 7: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

7Copyright © 2012, Elsevier Inc. All rights reserved.

Dynamic Scheduling Hardware rearranges the instruction execution to

reduce the stalls while maintaining data flow and exception behaviors.

Simple pipeline: In-order issue, in-order execution, in-order completion

Dynamic scheduling: In-order issue, out-of-order execution, out-of-order

execution Out-of-order execution results in WAR & WAW

hazards Out-of-order completion results in unexpected

exception behaviors

Page 8: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

8Copyright © 2012, Elsevier Inc. All rights reserved.

Dynamic Scheduling Addressing WAW & WAR hazards by out-of-

order execution Tomasulo’s Approach: Register Renaming

(Reservation station, Common data bus) Issue, Execute, Write Results Basic structure of Tomasulo’s algorithm: PP 173

Page 9: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

9Copyright © 2012, Elsevier Inc. All rights reserved.

Dynamic Scheduling Addressing unexpected exception behaviors by

out-of-order completion Hardware speculation: Reorder buffer(pass the

results, guaranteeing in-order completion) Issue, Execute, Write Results Commit Basic structure of hardware speculation: PP 185

Now: pipeline with dynamic scheduling In-order issue, out-of-order execution, in-order

completion

Page 10: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

10Copyright © 2012, Elsevier Inc. All rights reserved.

Decreasing the CPI Multiple Issue

Statically scheduled superscalar processors VLIW(very long instruction word) processors Dynamically scheduled superscalar processors See a summary table on pp. 194

Page 11: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

11Copyright © 2012, Elsevier Inc. All rights reserved.

Chapter 4 Review

SISD (single instruction, single data) architecture Examples in Chapter 3

SIMD (single instruction, multiple data) architecture: exploiting data-level parallelism Vector architecture Multimedia SIMD instruction set extensions Graphics processing units (GPUs)

Data Independence

Page 12: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

12Copyright © 2012, Elsevier Inc. All rights reserved.

Vector Architecture

Primary components: VMIPS Vector registers Vector functional units Vector load/store unit A set of scalar registers Basic structure of a vector architecture (pp. 265)

Page 13: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

13Copyright © 2012, Elsevier Inc. All rights reserved.

Vector Architecture Execution time

Length of the operand vectors; Structural hazards among the operations; Data dependencies.

Convoy The set of vector instructions that could potentially

execute together NO structural hazards Chaining: address data dependency in a convoy

Chime The unit of time taken to execute one convoy

Page 14: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

14Copyright © 2012, Elsevier Inc. All rights reserved.

Vector Architecture Execution time

a vector sequence: m convoys, a vector length of n m * n clock cycles

Page 15: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

15Copyright © 2012, Elsevier Inc. All rights reserved.

Vector Architecture Executes a single vector faster than one element

per clock cycle Multiple Lanes

Handles programs where the vector lengths are not the same as the length of the vector register Vector-Length Register: MTC1 VLR, R1 Strip mining: if the vector length is longer than MVL

Handles IF statements in vector loops Vector Mask Registers: CVM, POP

Supplying bandwidth for vector load/store units Memory Banks: allow multiple independent data

accesses

Page 16: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

16Copyright © 2012, Elsevier Inc. All rights reserved.

Vector Architecture Handles multidimensional arrays

Stride LVWS V1, (R1, R2) SVWS (R1, R2),V1

Handles Sparse Matrices Gather-Scatter

LVI V1, (R1, V2) SVI (R1,V2), V1

Programming Vector Architectures: Program structures affecting performance

most of them are spent on improving memory accesses, and most of them are modifications to the vector instruction set

Page 17: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

17Copyright © 2012, Elsevier Inc. All rights reserved.

SIMD Instruction Set Execution Observation: many media applications operate

on narrower data types than the 32-bit processors were optimize for 8 bits represent each of three primary colors 8 bits for transparency

Limitation Fix the number of data operands in the opcode Does not offer the more sophisticated addressing

modes of vector architectures: stride & gather-scatter Does not offer the mask registers

Roofline Visual Performance Model

Page 18: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

18Copyright © 2012, Elsevier Inc. All rights reserved.

SIMD Implementations Implementations:

Intel MMX (1996) Eight 8-bit integer ops or four 16-bit integer ops

Streaming SIMD Extensions (SSE) (1999) Eight 16-bit integer ops Four 32-bit integer/fp ops or two 64-bit integer/fp ops

Advanced Vector Extensions (2010) Four 64-bit integer/fp ops

Operands must be consecutive and aligned memory locations Generally designed to accelerate carefully written libraries rather than for

compilers

Advantages over vector architecture: Cost little to add to the standard ALU and easy to implement Require little extra state easy for context-switch Require little extra memory bandwidth No virtual memory problem of cross-page access and page-fault

SIM

D Instruction S

et Extensions for M

ultimedia

Page 19: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

19Copyright © 2012, Elsevier Inc. All rights reserved.

Graphics Processing Units Challenges:

Not simply getting good performance on the GPU Coordinating the scheduling of computation on the

system processor and the GPU, and the transfer of data between system memory and GPU memory

Heterogeneous architecture & computing CPU + GPU Individual memories for CPU & GPU Like a distributed system on a node CUDA or OpenCL languages Programming model is “Single Instruction Multiple

Thread” (SIMT)

Page 20: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

20Copyright © 2012, Elsevier Inc. All rights reserved.

Threads, Blocks, Grids A thread is associated with each data element Threads are organized into blocks Blocks are organized into a grid GPU hardware handles thread management, not

applications or OS

Graphical P

rocessing Units

Page 21: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

21Copyright © 2012, Elsevier Inc. All rights reserved.

NVIDIA GPU Architecture

Similarities to vector machines: Works well with data-level parallel problems Scatter-gather transfers Mask registers Large register files

Differences: No scalar processor Uses multithreading to hide memory latency Has many functional units, as opposed to a few

deeply pipelined units like a vector processor

Graphical P

rocessing Units

Page 22: 1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved

22Copyright © 2012, Elsevier Inc. All rights reserved.

Terminology

Threads of SIMD instructions Each has its own PC Thread scheduler uses scoreboard to dispatch No data dependencies between threads! Keeps track of up to 48 threads of SIMD instructions

Hides memory latency

Thread block scheduler schedules blocks to SIMD processors

Within each SIMD processor: 32 SIMD lanes Wide and shallow compared to vector processors

Graphical P

rocessing Units