computation gap instruction level parallelismcomputation gap xaccess gap ¾in early computers such...

129
Computation Gap Instruction Level Parallelism A.R. Hurson Computer Science Department Missouri Science & Technology [email protected]

Upload: others

Post on 13-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation GapInstruction Level Parallelism

A.R. HursonComputer Science Department

Missouri Science & [email protected]

Page 2: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Computation gap is defined as the differebetween computational power demandedthe application environmentscomputational capability of the exiscomputers.

Today, one can find many applications whrequire orders of magnitude mcomputations than the capability of thecalled super-computers and super-systems.

Page 3: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Some ApplicationsIt is estimated that the so called ProbSolving and Inference Systems requireenvironment with the computational powethe order of 100 MLIPS to 1 GLIPS (1 LIP100-1000 instructions).

Page 4: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Some ApplicationsExperiences in Fluid Dynamics have shothat the conventional super-computerscalculate steady 2-dimensional flow in minu

However, conventional super-compurequire up to 20 hours to handle time depend2-dimensional flow or steady 3-dimensioflows on simple objects.

Page 5: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Some ApplicationsNumerical Aerodynamics Simulator requireenvironment with a sustained speed of 1 bilFLOPS.

Strategic Defense Initiative requiresdistributed, fault tolerant compuenvironment with a processing rate ofMOPS.

Page 6: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Some ApplicationsU.S. Patent Office and Trademark has a databassize 25 terabytes subject to search and update.

An angiogram department of a mid-size hosgenerates more than 64 * 1011 bits of data a year.

NASA's Earth Observing System will generate mthan 11,000 terabytes of data during the 15-year tperiod of the project.

Page 7: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

PerformanceCDC STAR-100 25-100 MFLOPDAP 100 MFLOPILLIAC IV 160 MFLOPHEP 160 MFLOPCRAY-1 25-80 MFLOPCRAY X-MP(1) 210 MFLOPCRAY X-MP(4) 840 MFLOPCRAY-2 250 MFLOPCDC CYBER

200400 MFLOP

Hitachi S-810(10) 315 MFLOPHitachi S-810(20) 630 MFLOPFujitsu FACOM

VP-50140 MFLOP

Fujitsu FACOMVP-100

285 MFLOP

Fujistu FACOMVP-200

570 MFLOP

Fujistu FACOMVP-400

1,140 MFLOP

Page 8: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Performance

NEC SX-1 570 MFLOPSNEC SX-2 1,300 MFLOPSIBM RP3 1,000 MFLOPSMPP 8-bit integer 1,545-6,553 MIPSMPP 12-bit integer 795-4,428 MIPSMPP 16-bit integer 484-3,343 MIPSMPP 32-bit FL 165-470 MIPSMPP 40-bit FL 126-383 MIPS

Page 9: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

PerformanceNEC (Earth System) 35 tera FLOPSIBM Blue Gene 70 tera FLOPS

Page 10: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Some New ApplicationsA recent estimate puts the amount of new informgenerated in 2002 to be 5 exabytes (1 exabyte= 1018

which is approximately equal to all words spoken by hubeings) and 92% of this information is in hard disk. Whgood fraction of this information is of transient intuseful information of archival value will continuaccumulate.The TREC database holds around 800 million static phaving 6 trillion bytes of plain text equal to the sizemillion books.The Google system routinely accumulates millions of pof new text information every week.

Page 11: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

1970 1980 1990 2000

Text

Image

Multimedia

Sensors

Binary

Page 12: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation GapProblem

Suppose a machine capable of handlingcharacters per second is in hand. How ldoes it take to search 25 terabytes of data?

25 * 10 12

10 6 = 25 * 10 6 sec. 4 * 10 5 min. 7 * 10 3 Hours 290 da

Even if the performance is improved by a factor of 1000, it takes about 8 hours to exhaustively search this database!

Page 13: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

ProblemNOT PRACTICAL!WHAT ARE THE SOLUTIONS?

Page 14: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Reduce the amount of needed computati(advances in software technologyalgorithms).Improve the speed of the computers:

Physical Speed (Advances in hardwtechnology).Logical Speed (Advances in comparchitecture/organization).

Page 15: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Architectural Advances of the Uprocessor Organization

Organization of the conventional uni-procesystems can be modified in order to removeexisting bottlenecks. For example, Accessis one of the problems in the von Neumorganization.

Page 16: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Access GapAccess gap is defined as the time differebetween the CPU cycle time and the mmemory cycle time.

Access gap problem was created byadvances in technology.

Page 17: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Access GapIn early computers such IBM 704, CPU and mmemory cycle time were identical — i.e., 12 sec.

IBM 360/195 had the logic delay of 5 sec per sthe CPU cycle time of 54 sec and the main memcycle time of .756 sec and CDC 7600 had theand main memory cycle time of 27.5 sec and

sec, respectively.

Page 18: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

System Architecture/OrganizationTo overcome the technological limitaticomputer designers have long been attractetechniques that are classified under the term"Concurrency".

Page 19: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

ConcurrencyConcurrency is a generic term which defthe ability of the computer hardwaresimultaneously execute many actions atinstant.Within this general term are several wrecognized techniques such as ParallelPipelining, and Multiprocessing.

Page 20: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

ConcurrencyAlthough these techniques have the same orand are often hard to distinguish, in practhey are different in their general approach.

Page 21: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

ConcurrencyParallelism achieves concurrencyreplicating/duplicating the hardware strucmany times,

Pipelining takes the approach of splittingfunction to be performed into smaller pieallocating separate hardware to each piece,overlapping operations of each piece.

Page 22: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsClassification

• Feng's Classification• Flynn's Classification• Handler's Classification

Page 23: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsFeng's Classification

• In this classification, the concurrent spacidentified as a two dimensional space based onbit and word multiplicities.

Page 24: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation GapConcurrent Systems

Feng's ClassificationMPP

(1,16384)

Staran(1,256)

C(1,N) D(M,N)

Cmmp(16,16)

(1,1)AIBM360

(32,1)( 1,1)

1 16 32 Bit Multiplicity

Wor

d M

ultip

licity

16

B

Page 25: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsFeng's Classification

• Point A represents a pure sequential machine - iuni-processor with serial ALU.

• Point B represents a uni-processor with parALU.

• Point C represents a parallel bit slice organizatio• Point D represents a parallel word organ

organization.

Page 26: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsFlynn's Classification

• Flynn has classified the concurrent space accorto the multiplicity of instruction and data stream

I= { Single Instruction Stream (SI), Multiple Instruction Stream (M

Single Data Stream (SD), Multiple Data Stream (MD) }D={

Page 27: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsFlynn's Classification

• The Cartesian product of these two sets will defour different classes:− SISD− SIMD− MISD− MIMD

Page 28: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsFlynn's Classification — Revisited

• The MIMD class can be further divided based on− Memory structure — global or distributed− Communication/synchronism mechanism — s

variable or message passing.

Page 29: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsFlynn's Classification — Revisited

• As a result we have four additional classecomputers:− GMSV — Shared memory multiprocessors− GMMP — ?− DMSV — Distributed shared memory− DMMP — Distributed memory (multi-computers)

Page 30: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsHandler’s Classification

• Handler has extended Feng's concurrent spacethird dimension, namely, the number of counits.

• Handler's space is defined as T=(k,d,w):− k number of control units,− d number of ALUs controlled by a control unit,− w number of bits handled by an ALU.

Page 31: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsHandler’s Classification

Number of CUs

Cmmp(16,1,16)

16

4

1

116

64IBM 360/91

(1,3,64)

TI ASC(1,4,64)

ILLIAC IV*(4,64,64)

3

4

64

dNumber of

ALUs

k

Word Length

Page 32: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsHandler’s Classification

• Point (1,1,1) represents von Neumann machineserial ALU.

• Point (1,1,M) represents von Neumann macwith parallel ALU.

Page 33: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsHandler’s Classification

• To represent pipelining at different levels -macro pipeline, instruction pipeline and arithmpipeline - diversity, sequentiality,flexibility/adaptability, the original Handscheme has been extended by three varia(k΄,d΄,w΄) and three operators (+, *, v).

Page 34: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsHandler’s Classification

• k΄ represents macro pipeline• d΄ represents instruction pipeline• w΄ represents arithmetic pipeline• + represents diversity (parallelism)• * represents sequentiality (pipelining)• v represents flexibility/adaptability

Page 35: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsHandler’s Classification

• According to the extended Handler's scheme:

ILLIAC IV: (1*1, 1*1, 48*1) * (1*1, 64*1, 64*1)

Front-end (B 6700) Array Processor

CDC7600: (1*1, 1*9, 60*1)CDCStar: (1*1, 2*1, 64*4)

Page 36: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsHandler’s Classification

• According to the extended Handler's scheme:DAP: (1*1,1*1, 32*1) * [ (1*1,128*1, 32*1) v (1*1, 4096*1, 1*1)]

front-end Array Processor

Page 37: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

QuestionsWhat are the motivations behindclassification of the computer systems?What are the shortcomings ofaforementioned classification schemes?Can you propose a new classification schem

Page 38: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Why classify computer architecture?Generalize and identify the characteristicsdifferent systems.Group machines with common architectfeatures:

• To study systems easier.• To transfer solutions easier.

Page 39: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Why classify computer architecture?Better estimate the weak and strong pointssystem:

• To utilize a system more effectively.Anticipate the future trends and thedevelopments:

• Research directions.

Page 40: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Goals of a classification schemeCategorize all existing and foreseeable desigDifferentiate different designs.Assign an architecture to a unique class.

Page 41: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

SummaryComputation GapHow to reduce Computation Gap:• Advances in Software and Algorithms• Advances in Technology• Advances in Computer Organization/Architecture

ConcurrencyClassification• Feng• Flynn/Extended MIMD• Handler

Page 42: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsScalar

Sequential Lookahead

I/E Overlap FunctionalParallelism

MultipleFunc. Units

Pipeline

Implicit vector Explicit vector

Memory-to-Memory

Register-to-Register

SIMD MIMD

AssociativeProcessor

ProcessorArray Multi-computer Multiprocessor

VLIWSuperScalar

Su

Page 43: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsWe group concurrent systems into two group

• Control Flow• Data Flow

Page 44: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation GapConcurrent Systems

In the control flow model of computatexecution of an instruction activatesexecution of the next instruction.In the data flow model of computatavailability of the data activates the execuof the next instruction(s).

Page 45: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent Systems

ConcurrentSystems

ControlFlow

DataFlow

Parallel Systems

MultiprocessorSystems

Pipelined Systems

Data DrivenSystems

Demand DrivenSystems

Ensemble PSIMD ArrAssociative

Loosely CoTightly Co

Linear/FeeUnifunctioStatic/Dyn

°°°

Static

Dynamic

Page 46: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsWithin the scope of the control flow systwe distinguish three classes — namely:

• Parallel Systems• Pipeline Systems• Multiprocessors

Page 47: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Concurrent SystemsThis distinction is due to the exploitationconcurrency and the interrelationships amthe control unit, processing elementsmemory modules in each group.

Page 48: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Multiprocessor SystemsMultiprocessor systems can be groupedtwo classes:

• Tightly Coupled (Central Memory — not scalab• Loosely Coupled (Distributed Memory — scalab

Page 49: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Multiprocessor SystemsTightly Coupled (Central Memory — not Scalashared memory modules are separated from proceby an interconnection network or a multiport interfaAll processors have equal access time to all memwords. Therefore, the memory access time (assumno conflict) is independent of the module baccessed (C.mmp, HEP, Encore's Multimax, CedarNYU Ultracomputer).

Page 50: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Multiprocessor SystemsLoosely Coupled (Distributed Memory — Scalaeach processor has a local-public memory.Each processor can directly access its memory mobut all other accesses to non-local memory modmust be made through an interconnection networkaccess time varies with the location of the memmodule (Cm*, BBN Butterfly, and Dash).

Page 51: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Multiprocessor SystemsBesides the higher throughput, multiprocesystems offer more reliability since failure inone of the redundant components can be tolerthrough system reconfiguration.Multiprocessor organization is a logical extenof the parallel system — i.e., array of proceorganization. However, the degree of freeassociated with the processors are much hithan it is in an array of processor.

Page 52: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Multiprocessor SystemsThe independence of the processors andsharing of resources among the processorsboth desirable features — are achieved atexpense of an increase in complexity at bothhardware and software levels.

Page 53: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Multi-computer SystemsA multi-computer system is a collectionprocessors, interconnected by a messapassing network.Each processor is an autonomous comphaving its own local memorycommunicating with each other throughmessage passing network (iPSC, nCUBE,Mosaic).

Page 54: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsThe term pipelining refers to a design techniqueintroduces concurrency by taking a basic function tinvolved repeatedly in a process and partitioning itseveral sub-functions with the following properties:

Page 55: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline Systems• Evaluation of the basic function is equivalen

some sequential evaluation of the sub-functions.• Other than the exchange of inputs and outputs, t

is no interrelationships between sub-functions.• Hardware may be developed to execute each

function.• The execution time of these hardware units

usually approximately equal.

Page 56: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsUnder the aforementioned conditions, the spup from pipelining equals the number of pstages.However, stages are rarely balancedfurthermore, pipelining does involve sooverhead.

Page 57: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsThe concept of pipelining can be implemenat different levels. With regard to this isone can address:

• Arithmetic Pipelining• Instruction Pipelining• Processor Pipelining

Page 58: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsNon-pipelined instruction cycle:

Inst. i IF ID EX MEM WB

Inst. i+1 IF ID EX MEM WB

Inst. i+2 IF ID EX

Page 59: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsPipelined instruction cycle:

Inst. i IF ID EX MEM WB

Inst. i+1 IF ID EX MEM WB

Inst. i+2 IF ID EX MEM WB

Inst. i+3 IF ID EX MEM WB

A pipelined instruction cycle gives a peak performaof one instruction every step.

Page 60: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline Systems — ExampleAssume an unpipeline machine has a 10 ns ccycles. It requires four clock cycles for the ALUbranch operations and five clock cycles for the memreference operations. Calculate the average instrucexecution time, if the relative frequencies of toperations are 40%, 20%, and 40%, respectively.

Ave. instr. exec. time = 10 * [ (40%+20%) * 4 + 40%= 44 ns

Page 61: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline Systems — ExampleNow assume we have a pipeline version ofmachine. Furthermore, due to the clock skew andup, pipelining adds 1 ns overhead to the clock tIgnoring the latency, now calculate the aveinstruction execution time.

Ave. instr. exec. time = 10 + 1 ns, andSpeed up = 44/11 = 4

Page 62: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline Systems — ExampleAssume that the time required for the five units iinstruction cycle are, 10 ns, 8 ns, 10 ns, 10 ns, andFurther, assume that pipelining adds 1 ns overhFind the speed up factor:

Ave. instr. exec. timeunpipeline = 10 + 8 + 10 + 10 + 7= 45 ns

Ave. instr. exec. timepipeline = 11 nsSpeed up = 45/11 = 4.1

Page 63: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsIssues of concern:

• Overlapping operations should not over comresources — Every pipe stage is active on eclock cycle.

• All operations in a pipe stage should compleone clock and any combination of operations shbe able to occur at once.

Page 64: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsPipeline systems can be further classified as

• Linear Pipe / Feedback Pipe• Scalar Pipe / Vector Pipe• Uni-function Pipe / Multifunction Pipe• Statically/Dynamically Con-figured Pipe

Page 65: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsUni-function Pipeline: Pipeline is dedicatedspecific function — CRAY-1 has 12 dedicpipelines.Multifunction Pipeline: Pipeline systemperform different functions either at different tior at the same time — TI-ACS hasmultifunction pipelines reconfigurable for a varof arithmetic operations.

Page 66: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsStatic Pipeline: Pipeline system assumesconfiguration at a time.Dynamic Pipeline: Pipeline system allseveral functional configurations to esimultaneously.

Page 67: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline Systems — ExampleIn a multifunction pipe of 5 stages calculatespeed-up factor for

Y = A(i) * B(i)i=1

n

Page 68: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline Systems — ExampleProduct terms will be generated in (n-1steps.Additions will be performed in:5 + ( n/2 -1) + 5 + ( n/4 -1) + ... + 5 + (1-(4log2n + n) steps.Speed-up ratio

S = 5(2n-1)2n+4 log 2n+4

5 for large n

Page 69: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsA concept known as hazard is a major concin a pipeline organization.A hazard prevents the pipeline from accepdata at the maximum rate that the staging clmight support.

Page 70: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsA hazard can be of three types:

• Structural Hazard: Arises from resource conflicts wthe hardware cannot support all possible combinatioinstructions in simultaneous overlapped execution —different pieces of data attempt to use the same stathe same time.

• Data-Dependent Hazard: Arises when an instrudepends on the result of a previous instruction —pass through a stage is a function of the data value.

Page 71: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline Systems• Control Hazard: Arises from the pipelinin

instructions that affect PC — Branch.

Page 72: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsStructural Hazard (Assume a Single mempipeline system)

Inst. i IF ID EX MEM WB

Inst. i+1 IF ID EX MEM WB

Inst. i+2 IF ID EX MEM WB

Inst. i+3 IF ID EX MEM WB

Inst. i+4 IF ID EX MEM WBStall

Page 73: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsData Hazard

• A data hazard is created whenever theredependence between instructions, and they are cenough that the overlap caused by pipelining wchange the order of access to an operand.

ADD R1, R2, R3SUB R4, R1, R5

Page 74: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsData Hazard

• Data Hazard can be resolved with a simforwarding technique — If the forwarding harddetects that the previous ALU operation has wrto a source register of the current ALU operacontrol logic selects the forwarded result as the Ainput rather the value read from the register file.

Page 75: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsData Hazard — Classification

• Assume i and j are two instructions and j issuccessor of i, then one could expect three typedata hazard:− Read after write (RAW)− Write after write (WAW)− Write after read (WAR)

Page 76: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsData Hazard — Classification

• Read after write (RAW) — j reads a source beforwrites it (flow dependence).

• Write after write (WAW) — j writes into the samdestination as i does (output dependence).

LW R1, 0(R2) IF ID EX MEM1 WBMEM2

Add R1, R2, R3 IF ID EX WB

Page 77: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsData Hazard — Classification

• Write after read (WAR) — j writes into a source(anti dependence).

SW 0(R1), R2 IF ID EX MEM1 WBMEM2

Add R2, R4, R3 IF ID EX WB

Page 78: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation GapPipeline Systems

Data Hazard — Forwarding• One can use the concept of data forwardin

overcome stall (s) due to data hazard.

Add R1, R2, R3 IF ID EX MEM WB

Sub R4, R1, R5 IF ID EX MEM WB

IF ID EX MEM WBAnd R6, R1, R7

IF ID EX MEM WBOR R8, R1, R9

IF ID EX MEMXOR R10, R1, R11

Page 79: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsData Hazard — Forwarding

• In some cases data forwarding does not work:

LW R1, 0(R2) IF ID EX MEM WB

Sub R4, R1, R5 IF ID EX MEM WB

IF ID EX MEM WBAdd R6, R1, R7

IF ID EX MEM WBOR R8, R1, R9

Forwarding works

Forwardindoes not w

Page 80: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsData Hazard — Stalling

• In cases where data forwarding does not workpipe has to be stalled:LW R1, A IF ID EX MEM WB

Add R4, R1, R7 IF ID EX MEM WB

IF ID EX MEM WBSub R5, R1, R8

IF ID EX MEM WAnd R6, R1, R7

Page 81: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsData Hazard — Stalling

LW R1, A IF ID EX MEM WB

Add R4, R1, R7 IF ID EX MEM W

Sub R5, R1, R8 IF ID EX MEM

IF ID EXAnd R6, R1, R7

Stalls

Page 82: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsData Hazard — Stalling

• The pipeline interlock detects a hazard and stallpipeline until the hazard is cleared .

• This delay cycle — bubble or pipeline stall, althe load data to be generated at the time it is neby the instruction.

Page 83: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsData Hazard — Stalling

• Let us look at A B + C

LW R1, B IF ID EX MEM WB

IF ID EX MEMAdd R3, R1, R2

IF ID EX MST A, R3

LW R2, C IF ID EX MEM WB

Stall needed to allowload of C to complete

Fo

Page 84: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsData Hazard — Example

• Assume 30% of the instructions are load andthe time the instruction following a load instrucdepends on the result of the load. If the hacreates a single-cycle delay, how much faster iideal pipelined machine?

CPIideal = 1CPInew = (.7 * 1 + .3 * 1.5) = 1.15

Page 85: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsData Hazard — Pipeline Scheduling or InstrucScheduling

• Compiler attempts to schedule the pipeline to athe stalls by rearranging the code sequenceliminate the hazard — Software support to adata hazard.

• Sometimes if compiler can not scheduleinterlocks, a no-op instruction may be inserted.

Page 86: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsData Hazard — Pipeline Scheduling or InstrucScheduling

• Let us look at the following sequenceinstructions:

a = b + cd = e - f

Page 87: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsData Hazard — Pipeline Scheduling or InstrucScheduling

IFIF

IFIF

Load Rb, b

Load Rc, c

ADD Ra, Rb, Rc

Store a, Ra

IDID

IDID EXEX

EXEX

MEM

MEM

MEM

MEM

WBWB

WBWB

IF ID EX MEM WBIF ID EX MEM WB

IF ID EX MEM WIF ID EX M

Load Re, e

Load Rf, f

SUB Rd, Re, RfStore d, Rd

Page 88: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsControl Hazard

• If instruction i is a successful branch, then the Pchanged at the end of MEM phase. This mstalling the next instructions for three clock cycl

Page 89: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsControl Hazard

Page 90: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsControl Hazard — Observations

• Three clock cycles are wasted for every branch.• However, the above sequence is not even poss

since we do not know the nature of the instrucuntil after the instruction i + 1 is fetched.

Page 91: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsControl Hazard — Solution

Page 92: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsControl Hazard — Solution

• Still the performance penalty is severe.• What are the solution(s) to speed up the pipeline

Page 93: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsControl Hazard — Reducing pipeline brpenalties

• Detect, earlier in the pipeline, whether or notbranch is successful,

• For a successful branch, calculate the value oPC earlier,

• It should be noted that, these solutions come aexpense of extra hardware,

Page 94: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsControl Hazard — Reducing pipeline brpenalties

• Freeze the pipeline — Holding any instructionthe branch until the branch destination is knowEasy to enforce,

• Assume unsuccessful branch — Continue toinstructions as if the branch were a noinstruction. If a branch is taken, then stoppipeline and restart the fetch,

Page 95: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsControl Hazard — Reducing pipeline brpenalties

• Assume the branch is successful — as soon atarget address is calculated, fetch and exeinstructions at the target,

• Delayed Branch — Software attempts to makesuccessor instruction valid and useful.

Page 96: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsControl Hazard — Reducing pipeline brpenalties

• Assume branch is not successful:IF

IFIF

IF

IF

Inst. i + 1

Inst. i + 2

Inst. i + 3

Inst. i + 4

IDID

ID

ID

ID EX

EX

EXEX

EX

ME

MEM

MEM

MEM

MEM

W

WBWB

WBUntaken branch Inst.

Page 97: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsControl Hazard — Reducing pipeline brpenalties

Page 98: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsStructural Hazard

• For Statically Configured pipelines, one cpredict precisely when a structural hazard moccur and hence it is possible to schedulepipeline so that the collisions do not occur.

Page 99: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsStructural Hazard

• Let Si (1 i n) denote a stage of a pipelineperforms a well defined subtask with a delay tim

i.• Define latency as the minimum time ela

between the initiation of two processes. Therefor a linear pipeline the latency is Max( i) 1 i

Page 100: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsStructural Hazard

• Reservation Table represents the flow ofthrough the pipeline for one complete evaluatioa given function.

• It is a table which shows the activation ofpipeline stages at each moment of time.

Page 101: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsStructural Hazard

• Assume the following pipeline:

inS0 S1 S2 S3

2 nd pass

3 rd pass S4

S5

i= t 0 i 6 and i 33=2 t

Page 102: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation GapPipeline Systems

Structural Hazard• The following is the reservation table of

aforementioned pipeline organization:Time

Stage t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12

S 0

S 1

S 2

S 3

S 4

S 5

S 6

xx

xx x

xx

x x

xx

x

xx

Page 103: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsStructural Hazard

• For a given pipeline organization, one can alwderive its unique reservation table. Howdifferent pipeline organizations might have the sreservation table.

Page 104: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsStructural Hazard

• A pipeline is statically configured if it assumesame reservation table for each activation.

• A pipeline is multiply configured if the reservatable is one from a set of reservation tables.

• A pipeline is dynamically configured ifactivation does not have a predetermined reservatable.

Page 105: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsStructural Hazard

• A collision occurs if two or more activations atteto use the same pipeline segment simultaneously

• A collision will occur if reservation tables are ofby l time units and activation of the same pipsegment overlaps.

• l is called a forbidden latency — two activashould not be initiated l time units apart.

Page 106: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsStructural Hazard

• Given a pipeline system, one can defineforbidden list L as the set of forbidden latencies

L = (l1, l2, ..., ln)

C = (Cn, Cn-1, ..., C1)n = Max(lj) lj L andCi = 1 if i LCi = 0 otherwise

• For a given L we can define the collision vectorC as:

Page 107: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsStructural Hazard

• The collision vector can be interpreted thainitiation is allowed at every time unit such that0. This allows us to build a finite state diagrapossible initiations.

• Initial state is the collision vector, and each starepresented by a combination of collision vewhich have led to such a state.

Page 108: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsStructural Hazard

• In case of our example we have:

L = (4,8,1,3,5)C = 10011101

• Initial state 10011101 indicates that we can haveinitiation at time 2, 6, or 7. If we have a new initat time 2 then the finite state machine would be istate (00100111) (10011101) = 10111111.Following such a procedure then we have:

Page 109: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsStructural Hazard

Page 110: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsStructural Hazard

• From State Diagram then one can design a controllregulate the initiation of the activations.

• In a state diagram:− The simple cycle is a cycle in which each state ap

only once.− The average latency of a cycle is the sum of its late

divided by the number of states in the cycle.− The greedy cycle is a cycle that always minimize

latency between the current initiation and the veryinitiation.

Page 111: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline Systems — ExampleFor a 5-stage pipeline characterized by the followreservation table:

TimeStage

xx x

xx x

x x

x

0

1

5

2

4

3

x

1 2 3 4 5 6 7 8

a) Determine forbidden list and collision vector.b) Draw the state diagram and determine the minimal

average latency and the maximum throughput.

Page 112: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsMultifunction Pipeline

• The scheduling method for static uni-funcpipelines can be generalized for multifuncpipelines.

• A pipeline that can perform P distinct functionsbe classified by P overlaid reservation tables.

Page 113: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsMultifunction Pipeline

• Each task to be initiated can be associated wfunction tag to identify the reservation table tused.

• In this case collision may occur between twmore tasks with the same function tag ordistinct function tags.

Page 114: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsMultifunction Pipeline

• For example, the following reservationcharacterizes a 3-stage 2-function pipeline

1

2

3

0 41 32

A

B

B

A

AB

A

A

B

B

TimeStage

Page 115: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsMultifunction Pipeline

• A forbidden set of latencies for a multifunction pipis the collection of collision-causing latencies.

• A cross-collision vector marks the forbidden latebetween the functions - i.e., vAB represents the forbilatencies between A and B. Therefore, for a P funpipeline one can define P2 cross-collision vectors.

• P2 cross-collision vectors can be represented bcollision matrices.

Page 116: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsMultifunction Pipeline

• For our example we have:

Cross Collision VectorsvAA=(0110) vAB=(1011)vBA=(1010) vBB=(0110)Collision Matrices

Page 117: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsMultifunction Pipeline

• Similar to the uni-function pipeline, one can uscollision matrices in order to construct adiagram.

Page 118: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsMultifunction Pipeline

01101010

01111111

10110111

01111010

11110111

B1

B4

B5 +B4

B4

A3

A3

A4

A4

A1

A4

A5 +

10110110

B1, B3

B1, B3

Page 119: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsVector Processors

• A vector processor is equipped with multiple vepipelines that can be concurrently used uhardware or firmware control.

• There are two groups of pipeline processors:− Memory-to-Memory Architecture− Register-to-Register Architecture

Page 120: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation GapPipeline Systems

Vector Processors• Memory-to-Memory architecture that support

pipeline flow of vector operands directly frommemory to pipelines and then back to the mem(Cyber 205).

• Register-to-Register architecture that uses veregisters as operands for the functional pipe(Cray series that use size registers and Fujitsu2000 series that use reconfigurable vector registe

Page 121: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsVector Processors

• Vector machines allow efficient use of pipeliwhile reducing memory latency and pipscheduling penalties.

• Computations on vector elements are maindependent from each other — lack ofhazards.

Page 122: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation GapPipeline Systems

Vector Processors• A vector instruction is equivalent to a loop,

implies:− Smaller program size, hence reducing the instru

bandwidth requirement.− Fewer number of control hazards.

• Vector instructions initiate regular operand fpattern — allowing efficient use of meminterleaving and efficient address calculations.

Page 123: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsVector Processors — Vector Stride

• What if adjacent vector elements of a vector operare not positioned in sequence in the memory.

• The distance separating elements that ought to be meinto a single vector is called the stride.

• Almost all vector machines allow access to vectorsany constant stride. Some constant strides may cmemory-bank conflict.

Page 124: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsVector Processors — Chaining

• Chaining allows a vector operation to start asas the individual elements of its vector sooperand become available.

• Result of the first functional unit (pipeline) inchain are forwarded to the second functional uni

Page 125: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation GapPipeline Systems

Efficient Use of Vector Processors — Memto-Memory Organization

• Increase the vector size — if possible:− Change the nesting order of the loop,− Convert multidimensional arrays into one-dimen

arrays,− Rearrange data into unconventional forms so that sm

vectors may be combined into a single large vector.• Perform as many operations on an input vecto

possible before storing the result vector back inmain memory.

Page 126: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsEfficient Use of Vector Processors — Memto-Memory Organization

• Changing the nesting order of the loop:

Do I = 1, 100A(I, 1:60) = 0

End

Do J = 1, 60A(1:100,J) =

End

Page 127: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsEfficient Use of Vector Processors — Memto-Memory Organization

• Convert multidimensional arrays intodimensional arrays:

Do I = 1, 100A(I, 1:60) = 0

EndA(1:6000) = 0

Page 128: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

Pipeline SystemsEfficient Use of Vector Processors — Regito-Register Organization

• Values often used in a program should be kept in intregisters.

• Perform as many operations on an input vectopossible before storing the result vector back in thememory.

• Organize vectors into sections of size equal to the leof the vector registers — Strip mining.

• Convert multidimensional arrays into one-dimensarrays.

Page 129: Computation Gap Instruction Level ParallelismComputation Gap XAccess Gap ¾In early computers such IBM 704, CPU and m memory cycle time were identical — i.e., 12 Psec. ¾IBM 360/195

Computation Gap

QuestionCompare and contrast Memory-to-MemoryRegister-to-Register pipeline systems agaeach other.