cs252 spring 2017 graduate computer architecture lecture...

WU UCB CS252 SP17

CS252 Spring 2017Graduate Computer Architecture

Lecture 4:Pipelining

Lisa Wu, Krste Asanovichttp://inst.eecs.berkeley.edu/~cs252/sp17

©KrsteAsanovic,2015CS252,Fall2015,Lecture4

LastTimeinLecture3

§ Microcoding,aneffectivetechniquetomanagecontrolunitcomplexity,inventedinerawhenlogic(tubes),mainmemory(magneticcore),andROM(diodes)useddifferenttechnologies

§ DifferencebetweenROMandRAMspeedmotivatedadditionalcomplexinstructions

§ TechnologyadvancesleadingtofastSRAMmadetechnologyassumptionsinvalid

§ Complexinstructionssetsimpedeparallelandpipelinedimplementations

§ Load/store,register-richISAs(pioneeredbyCray,popularizedbyRISC)performbetterinnewVLSItechnology

2


§ Instructionsperprogramdependsonsourcecode,compilertechnology,andISA

§ Cyclesperinstructions(CPI)dependsonISAandµarchitecture

§ Timepercycledependsupontheµarchitectureandbasetechnology

3

Time =Instructions Cycles TimeProgramProgram*Instruction*Cycle

“IronLaw”ofProcessorPerformance

Microarchitecture CPI cycletimeMicrocoded >1 shortSingle-cycleunpipelined 1 longPipelined 1 short


MemoryEXecuteDecodeFetch

Classic5-StageRISCPipeline

4

Registers

ALU

BA

DataCache

PC

InstructionCache

Store

Imm

Inst.Register

Writeback

Thisversiondesignedforregfiles/memorieswithsynchronousreadsandwrites.


CPIExamples

5

Time

Inst3

7cycles

Inst1 Inst2

5cycles 10cyclesMicrocoded machine

3instructions,22cycles,CPI=7.33Unpipelined machine

3instructions,3cycles,CPI=1

Inst1 Inst2 Inst3

Pipelinedmachine

3instructions,3cycles,CPI=1Inst1

Inst2Inst3 5-stagepipelineCPI≠5!!!


Instructionsinteractwitheachotherinpipeline§ Aninstructioninthepipelinemayneedaresourcebeingusedbyanotherinstructioninthepipelineà structuralhazard

§ Aninstructionmaydependonsomethingproducedbyanearlierinstruction- Dependencemaybeforadatavalue

à datahazard- Dependencemaybeforthenextinstruction’saddress

à controlhazard(branches,exceptions)

§ HandlinghazardsgenerallyintroducesbubblesintopipelineandreducesidealCPI>1

6


PipelineCPIExamples

7

Time

3instructionsfinishin3cyclesCPI=3/3=1

Inst1Inst2

Inst3

3instructionsfinishin4cyclesCPI=4/3=1.33

Inst1Inst2

Inst3Bubble

Inst1

Inst2Inst3

Bubble1

Bubble2

Measurefromwhenfirstinstructionfinishestowhenlastinstructioninsequencefinishes.

3instructionsfinishin5cyclesCPI=5/3=1.67

Inst3


ResolvingStructuralHazards

§ Structuralhazardoccurswhentwoinstructionsneedsamehardwareresourceatsametime- Canresolveinhardwarebystallingnewerinstructiontillolderinstructionfinishedwithresource

§ Astructuralhazardcanalwaysbeavoidedbyaddingmorehardwaretodesign- E.g.,iftwoinstructionsbothneedaporttomemoryatsametime,couldavoidhazardbyaddingsecondporttomemory

§ ClassicRISC5-stageintegerpipelinehasnostructuralhazardsbydesign-ManyRISCimplementationshavestructuralhazardsonmulti-cycleunitssuchasmultipliers,dividers,floating-pointunits,etc.,andcanhaveonregisterwriteback ports

8


TypesofDataHazards

9

Considerexecutingasequenceofregister-registerinstructionsoftype:

rk ß ri oprjData-dependence

r3 ß r1 opr2 Read-after-Writer5 ß r3 opr4 (RAW)hazard

Anti-dependencer3 ß r1 opr2 Write-after-Readr1 ß r4 opr5 (WAR)hazard

Output-dependencer3 ß r1 opr2 Write-after-Writer3 ß r6 opr7 (WAW)hazard


ThreeStrategiesforDataHazards

§ Interlock-Waitforhazardtoclearbyholdingdependentinstructioninissuestage

§ Bypass- Resolvehazardearlierbybypassingvalueassoonasavailable

§ Speculate-Guessonvalue,correctifwrong

10


InterlockingVersusBypassing

11

add x1, x3, x5sub x2, x1, x4

F add x1, x3, x5D

F

X

D

F

sub x2, x1, x4

W

M

X bubble

F

D

W

X M W

M W

W

M

D

X bubble

M

X bubble

D

F

Instructioninterlockedindecodestage

F D X M W add x1, x3, x5

F D X M W sub x2, x1, x4

BypassaroundALUwithnobubbles



ExampleBypassPath

12

Registers

ALU

BA

DataCache

PC

InstructionCache

Store

Imm

Inst.Register

Writeback



FullyBypassedDataPath

13

Registers

ALU

BA

DataCache

PC

InstructionCache

Store

Imm

Inst.Register

Writeback

F D X M W

F D X M W

F D X M W

F D X M W[AssumesdatawrittentoregistersinaWcycleisreadableinparallelDcycle(dottedline).Extrawritedataregisterandbypasspathsrequiredifthisisnotpossible.]


ValueSpeculationforRAWDataHazards

§ Ratherthanwaitforvalue,canguessvalue!

§ Sofar,onlyeffectiveincertainlimitedcases:- Branchprediction- Stackpointerupdates-Memoryaddressdisambiguation

14


ControlHazards

WhatdoweneedtocalculatenextPC?

§ ForUnconditionalJumps- Opcode,PC,andoffset

§ ForJumpRegister- Opcode,Registervalue,andoffset

§ ForConditionalBranches- Opcode,Register(forcondition),PCandoffset

§ Forallotherinstructions- Opcode andPC(andhavetoknowit’snotoneofabove)

15



Controlflowinformationinpipeline

16

Registers B

A

DataCache

PC

InstructionCache

Store

Imm

Inst.Register

Writeback

PCknown Opcode,offsetknown

Branchcondition,Jumpregistervalueknown

ALU


EXecuteDecodeFetch

RISC-VUnconditionalPC-RelativeJumps

17

Registers B

A

InstructionCache

Imm

Inst.Register

ALU

PC_d

ecod

e

Add

Jump?PCJumpSelPC

_fetch

Kill

FKill

+4

[Killbitturnsinstructionintoabubble]


PipeliningforUnconditionalPC-RelativeJumps

18

M W

X M W

D X M W

j targetF D

F

target: add x1, x2, x3

X

D

F

bubble


BranchDelaySlots§ EarlyRISCsadoptedideafrompipelinedmicrocodeengines,andchangedISAsemanticssoinstructionafter branch/jumpisalwaysexecutedbeforecontrolflowchangeoccurs:0x100 j target0x104 add x1, x2, x3 // Executed before target…0x205 target: xori x1, x1, 7

§ Softwarehastofilldelayslotwithusefulwork,orfillwithexplicitNOPinstruction

19

M W

X M W

D X M W

j targetF D

F

target: xori x1, x1, 7

X

D

F

add x1, x2, x3


Post-1990RISCISAsdon’thavedelayslots

§ Encodesmicroarchitectural detailintoISA- c.f.IBM650drumlayout

§ Performanceissues- IncreasedI-cachemissesfromNOPsinunuseddelayslots- I-cachemissondelayslotcausesmachinetowait,evenifdelayslotisaNOP

§ Complicatesmoreadvancedmicroarchitectures- Consider30-stagepipelinewithfour-instruction-per-cycleissue

§ Betterbranchpredictionreducedneed- Branchpredictioninlaterlecture

20


EXecuteDecodeFetch

RISC-VConditionalBranches

21

Registers B

A

InstructionCache

Inst.

Inst.Register

ALU

PC_d

ecod

e

Add

Branch?PCSelPC

_fetch

Kill

FKill

+4

Cond?

PC_execute

Add

Kill

DKill


PipeliningforConditionalBranches

22

M W

X M W

D X M W

beq x1, x2, targetF D

F


X

D

F

bubble

bubble

F D X M W


PipeliningforJumpRegister

§ Registervalueobtainedinexecutestage

23

M W

X M W

D X M W

jr x1F D

F


X

D

F

bubble

bubble

F D X M W


Whyinstructionmaynotbedispatchedeverycycleinclassic5-stagepipeline(CPI>1)

§ Fullbypassingmaybetooexpensivetoimplement- typicallyallfrequentlyusedpathsareprovided- someinfrequentlyusedbypasspathsmayincreasecycletimeand

counteractthebenefitofreducingCPI§ Loadshavetwo-cyclelatency

- Instructionafterloadcannotuseloadresult- MIPS-IISAdefinedloaddelayslots,asoftware-visiblepipelinehazard

(compilerschedulesindependentinstructionorinserts NOPtoavoidhazard).RemovedinMIPS-II(pipelineinterlocksaddedinhardware)- MIPS:“Microprocessor withoutInterlockedPipelineStages”

§ Jumps/Conditionalbranchesmaycausebubbles- killfollowinginstruction(s)ifnodelayslots

24

Machineswithsoftware-visibledelayslotsmayexecutesignificantnumberofNOPinstructionsinsertedbythecompiler.NOPsreduceCPI,butincreaseinstructions/program!


TrapsandInterrupts

Inclass,we’llusefollowingterminology§ Exception:Anunusualinternaleventcausedbyprogramduringexecution- E.g.,pagefault,arithmeticunderflow

§ Trap:Forcedtransferofcontroltosupervisorcausedbyexception-Notallexceptionscausetraps(c.f.IEEE754floating-pointstandard)

§ Interrupt:Anexternaleventoutsideofrunningprogram,whichcausestransferofcontroltosupervisor

§ Trapsandinterruptsusuallyhandledbysamepipelinemechanism

25


HistoryofExceptionHandling

§ (AnalyticalEnginehadoverflowexceptions)§ FirstsystemwithtrapswasUnivac-I,1951

- Arithmeticoverflowwouldeither- 1.triggertheexecutionatwo-instructionfix-uproutineataddress0,or

- 2.attheprogrammer'soption,causethecomputertostop- LaterUnivac1103,1955,modifiedtoaddexternalinterrupts- Usedtogatherreal-timewindtunneldata

§ FirstsystemwithI/OinterruptswasDYSEAC,1954- Hadtwoprogramcounters,andI/OsignalcausedswitchbetweentwoPCs

- Also,firstsystemwithDMA(directmemoryaccessbyI/Odevice)- And,firstmobilecomputer(twotractortrailers,12tons+8tons)

26

WU UCB CS252 SP1727

DYSEAC:

front – control console and magnetic-wire input-output equipment

middle – the computer itself

middle rear – 512-word mercury delay-line memory

rear – air-conditioning


AsynchronousInterrupts

§ AnI/Odevicerequestsattentionbyassertingoneoftheprioritizedinterruptrequestlines

§ Whentheprocessordecidestoprocesstheinterrupt- ItstopsthecurrentprogramatinstructionIi,completingalltheinstructionsuptoIi-1 (preciseinterrupt)

- ItsavesthePCofinstructionIi inaspecialregister(EPC)- Itdisablesinterruptsandtransferscontroltoadesignatedinterrupthandlerrunninginthekernelmode

28


InterruptHandler

§ SavesEPCbeforeenablinginterruptstoallownestedinterruptsÞ- needaninstructiontomoveEPCintoGPRs- needawaytomaskfurtherinterruptsatleastuntilEPCcanbesaved

§ Needstoreada statusregister thatindicatesthecauseoftheinterrupt

§ Usesaspecial indirectjumpinstructionERET(return-from-environment)which- enablesinterrupts- restorestheprocessortotheusermode- restoreshardwarestatusandcontrolstate

29


SynchronousTrap

§ Asynchronoustrapiscausedbyanexceptiononaparticularinstruction

§ Ingeneral,theinstructioncannotbecompletedandneedstoberestarted aftertheexceptionhasbeenhandled- requiresundoingtheeffectofoneormorepartiallyexecutedinstructions

§ Inthecaseofasystemcalltrap,theinstructionisconsideredtohavebeencompleted- aspecialjumpinstructioninvolvingachangetoaprivilegedmode

30


ExceptionHandling5-StagePipeline

§ Howtohandlemultiplesimultaneousexceptionsindifferentpipelinestages?

§ Howandwheretohandleexternalasynchronousinterrupts?

31

PCInst.Mem D Decode E M

DataMem W+

IllegalOpcode Overflow Dataaddress

ExceptionsPCaddressException




32

PCInst.Mem D Decode E M

DataMem W+

IllegalOpcode

Overflow DataaddressExceptions

PCaddressException


ExcD

PCD

ExcE

PCE

ExcM

PCM

Cause

EPC

KillDStage

KillFStage

KillEStage

SelectHandlerPC

KillWriteback

CommitPoint



§ Holdexceptionflagsinpipelineuntilcommitpoint(Mstage)

§ Exceptionsinearlierpipestagesoverridelaterexceptionsforagiveninstruction

§ Injectexternalinterruptsatcommitpoint(overrideothers)

§ Ifexceptionatcommit:updateCauseandEPCregisters,killallstages,injecthandlerPCintofetchstage

33


SpeculatingonExceptions

§ Predictionmechanism- Exceptionsarerare,sosimplypredictingnoexceptionsisveryaccurate!

§ Checkpredictionmechanism- Exceptionsdetectedatendofinstructionexecutionpipeline,specialhardwareforvariousexceptiontypes

§ Recoverymechanism- Onlywritearchitecturalstateatcommitpoint,socanthrowawaypartiallyexecutedinstructionsafterexception

- Launchexceptionhandlerafterflushingpipeline

§ Bypassingallowsuseofuncommittedinstructionresultsbyfollowinginstructions

34


IssuesinComplexPipelineControl

35

IF ID WB

ALU Mem

Fadd

Fmul

Fdiv

Issue

GPRsFPRs

• Structuralconflictsattheexecutionstageifsome FPUormemoryunitisnotpipelinedandtakesmorethanonecycle• Structuralconflictsatthewrite-backstagedueto variablelatenciesofdifferentfunctionalunits• Out-of-orderwritehazardsduetovariablelatenciesofdifferentfunctionalunits• Howtohandleexceptions?


ComplexIn-OrderPipeline

§ Delaywriteback soalloperationshavesamelatencytoWstage- Writeportsneveroversubscribed

(oneinst.in&oneinst.outeverycycle)

- Stallpipelineonlonglatencyoperations,e.g.,divides,cachemisses

- Handleexceptionsin-orderatcommitpoint

36

CommitPoint

PCInst.Mem D Decode X1 X2

DataMem W+GPRs

X2 WFAdd X3

X3

FPRs X1

X2 FMul X3

X2FDiv X3

Unpipelineddivider

Howtopreventincreasedwriteback latencyfromslowingdownsinglecycleintegeroperations? Bypassing


In-OrderSuperscalarPipeline

§ Fetchtwoinstructionspercycle;issuebothsimultaneouslyifoneisinteger/memoryandotherisfloatingpoint

§ Inexpensivewayofincreasingthroughput,examplesincludeAlpha21064(1992)&MIPSR5000series(1996)

§ Sameideacanbeextendedtowiderissuebyduplicatingfunctionalunits(e.g.4-issueUltraSPARC &Alpha21164)butregfile portsandbypassingcostsgrowquickly

37

CommitPoint

2PC

Inst.Mem D

DualDecode X1 X2

DataMem W+GPRs

X2 WFAdd X3

X3

FPRs X1

X2 FMul X3

X2FDiv X3Unpipelineddivider


Acknowledgements

§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:- Arvind (MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)

38

cs252 spring 2017 graduate computer architecture lecture...

Documents