lecture 01: introduc/on - github pageslecture 01: introduc/on cse 564 computer architecture summer...

57
Lecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan 1

Upload: others

Post on 10-Mar-2020

29 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Lecture01:Introduc/on

CSE564ComputerArchitectureSummer2017

DepartmentofComputerScienceandEngineeringYonghongYan

[email protected]/~yan

1

Page 2: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

CopyrightandAcknowledgement•  Mostslideswereadaptedfromlecturesnotesofthetwotextbookswithcopyrightofpublisherortheoriginal

authorsincludingElsevierInc,MorganKaufmann,DavidA.PaIersonandJohnL.Hennessy.•  Someslideswereadaptedfromthefollowingcourses:

–  UCBerkeleycourse“ComputerScience252:GraduateComputerArchitecture”ofDavidE.CullerCopyright2005UCB•  hIp://people.eecs.berkeley.edu/~culler/courses/cs252-s05/

–  GreatIdeasinComputerArchitecture(MachineStructures)byRandyKatzandBernhardBoser•  hIp://inst.eecs.berkeley.edu/~cs61c/fa16/

•  Ialsorefertothefollowingcoursesandlecturenoteswhenpreparingmaterialsforthiscourse–  ComputerScience152:ComputerArchitectureandEngineering,Spring2016byDr.GeorgeMichelogiannakisfrom

UCBerkeley•  hIp://www-inst.eecs.berkeley.edu/~cs152/sp16/

–  ComputerScience252:GraduateComputerArchitecture,Fall2015byProf.KrsteAsanovićfromUCBerkeley•  hIp://www-inst.eecs.berkeley.edu/~cs252/fa15/

–  ComputerScienceS250:VLSISystemsDesign,Spring2016byProf.JohnWawrzynekfromUCBerkeley•  hIp://www-inst.eecs.berkeley.edu/~cs250/sp16/

–  ComputerSystemArchitecture,Fall2005byDr.JoelEmerandProf.ArvindfromMIT•  hIp://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-823-computer-system-architecture-

fall-2005/–  SynthesisLecturesonComputerArchitecture

•  hIp://www.morganclaypool.com/toc/cac/1/1

•  Theusesoftheslidesofthiscourseareforeduca/onalpurposesonlyandshouldbeusedonlyinconjunc/onwiththetextbook.Deriva/vesoftheslidesmustacknowledgethecopyrightno/cesofthisandtheoriginals.Permissionforcommercialpurposesshouldbeobtainedfromtheoriginalcopyrightholderandthesuccessivecopyrightholdersincludingmyself. 2

Page 3: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Contents

•  Computersandcomputercomponents•  Computerarchitecturesandgreatideasinhistoryandnow•  Performance

3

Page 4: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

TheComputerRevolu/on

•  Progressincomputertechnology–  UnderpinnedbyMoore’sLaw

•  Makesnovelapplicaconsfeasible–  Computersinautomobiles–  Cellphones–  Humangenomeproject–  WorldWideWeb–  SearchEngines

•  Computersarepervasive

4

Page 5: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

ClassesofComputers

•  PersonalMobileDevice(PMD)–  e.g.smartphones,tabletcomputers–  Emphasisonenergyefficiencyandreal-cme

•  DesktopCompucng–  Emphasisonprice-performance

•  Servers–  Emphasisonavailability,scalability,throughput

•  Clusters/WarehouseScaleComputers–  Usedfor“SogwareasaService(SaaS)”–  Emphasisonavailabilityandprice-performance–  Sub-class:Supercomputers,emphasis:floacng-pointperformanceand

fastinternalnetworks

•  EmbeddedComputers–  Emphasis:price

5

Page 6: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

ThePostPCEra

6

Page 7: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

ThePostPCEra

•  PersonalMobileDevice(PMD)–  BaIeryoperated–  ConnectstotheInternet–  Hundredsofdollars–  Smartphones,tablets,electronicglasses

•  Cloudcompucng–  WarehouseScaleComputers(WSC)–  SogwareasaService(SaaS)–  PorconofsogwarerunonaPMDandaporconruninthe

Cloud–  AmazonandGoogle

7

Page 8: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

OldSchoolComputer

8

Page 9: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

NewSchoolComputer(#1)

PersonalMobileDevices

99

Page 10: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

NewSchool“Computer”(#2)

1010

Page 11: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

ComponentsofaComputer

•  Samecomponentsforallkindsofcomputer–  Desktop,server,

embedded

•  Input/outputincludes–  User-interfacedevices

•  Display,keyboard,mouse–  Storagedevices

•  Harddisk,CD/DVD,flash–  Networkadapters

•  Forcommunicacngwithothercomputers

TheBIGPicture

Page 12: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

InsidetheProcessor(CPU)

•  Funcconalunits:performscomputacons•  Datapath:performsoperaconsondata•  Control:sequencesdatapath,memory,...•  Cachememory

–  SmallfastSRAMmemoryforimmediateaccesstodata

Apple A5

12

Page 13: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

ASafePlaceforData

•  Volaclemainmemory–  Losesinstrucconsanddatawhenpoweroff

•  Non-volaclesecondarymemory–  Magneccdisk–  Flashmemory–  Opccaldisk(CDROM,DVD)

Page 14: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Contents

•  Computersandcomputercomponents•  Computerarchitecturesandgreatideasinhistoryandnow

•  Performance

14

Page 15: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Whatis“ComputerArchitecture”?

15

Applica'ons

InstrucconSetArchitecture

Compiler

OperacngSystem

Firmware

I/OsystemInstr.SetProc.

DigitalDesignCircuitDesign

Datapath&Control

Layout&fabSemiconductorMaterials

Page 16: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

16

TheInstruc/onSet:aCri/calInterface

instrucconset

sogware

hardware

•  Propercesofagoodabstraccon–  Laststhroughmanygeneracons(portability)–  Usedinmanydifferentways(generality)–  Providesconvenientfuncconalitytohigherlevels–  Permitsanefficientimplementaconatlowerlevels

Page 17: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

17

ElementsofanISA

•  Setofmachine-recognizeddatatypes–  bytes,words,integers,floacngpoint,strings,...

•  Operaconsperformedonthosedatatypes–  Add,sub,mul,div,xor,move,….

•  Programmablestorage–  regs,PC,memory

•  Methodsofidencfyingandobtainingdatareferencedbyinstruccons(addressingmodes)–  Literal,reg.,absolute,relacve,reg+offset,…

•  Format(encoding)oftheinstruccons–  Opcode,operandfields,…

Page 18: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

ComputerArchitectureHowthingsareputtogetherindesignandimplementa/on

• Capabilices&PerformanceCharacterisccsofPrincipalFuncconalUnits

– (e.g.,Registers,ALU,Shigers,LogicUnits,...)• Waysinwhichthesecomponentsareinterconnected• Informaconflowsbetweencomponents• Logicandmeansbywhichsuchinformaconflowiscontrolled.

• ChoreographyofFUstorealizetheISA

18

Page 19: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Great Ideas in Computer Architectures

1.  Design for Moore’s Law

2.  Use abstraction to simplify design

3.  Make the common case fast

4.  Performance via parallelism

5.  Performance via pipelining

6.  Performance via prediction

7.  Hierarchy of memories

8.  Dependability via redundancy

19

Page 20: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

GreatIdea:“Moore’sLaw”

GordonMoore,FounderofIntel•  1965:sincetheintegratedcircuitwasinvented,thenumberof

transistors/inch2inthesecircuitsroughlydoubledeveryyear;thistrendwouldconcnuefortheforeseeablefuture

•  1975:revised-circuitcomplexitydoubleseverytwoyears

20Imagecredit:Intel

Page 21: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

MicroprocessorTransistorCounts1971-2011&Moore'sLaw

21

hcps://en.wikipedia.org/wiki/Transistor_count

Page 22: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Moore’sLawtrends•  Moretransistors=↑opportunicesforexploicngparallelisminthe

instrucconlevel(ILP)–  Pipeline,superscalar,VLIW(VeryLongInstrucconWord),SIMD(Single

InstrucconMulcpleData)orvector,speculacon,branchprediccon•  Generalpathofscaling

–  Widerinstrucconissue,longerpiepline–  Morespeculacon–  Moreandlargerregistersandcache

•  Increasingcircuitdensity~=increasingfrequency~=increasingperformance

•  Transparenttousers–  AneasyjobofgevngbeIerperformance:buyingfasterprocessors(higher

frequency)

•  Wehaveenjoyedthisfreelunchforseveraldecades,however(TBD)…

22

Page 23: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

GreatIdea:PipelineFundamentalExecu/onCycle

23

Instruc'onFetch

Instruc'onDecode

OperandFetch

Execute

ResultStore

NextInstruc'on

Obtaininstrucconfromprogramstorage

Determinerequiredacconsandinstrucconsize

Locateandobtainoperanddata

Computeresultvalueorstatus

Depositresultsinstorageforlateruse

Determinesuccessorinstruccon

Processor

regs

F.U.s

Memory

program

Data

vonNeumanboIleneck

Page 24: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

PipelinedInstruc/onExecu/on

24

I n s t r. O r d e r

Time (clock cycles)

Reg ALU

DMem Ifetch Reg

Reg ALU

DMem Ifetch Reg

Reg ALU

DMem Ifetch Reg

Reg ALU

DMem Ifetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5

Page 25: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

GreatIdea:Abstrac/on(LevelsofRepresenta/on/Interpreta/on)

lw $t0,0($2)lw $t1,4($2)sw $t1,0($2)sw $t0,4($2)

HighLevelLanguageProgram(e.g.,C)

AssemblyLanguageProgram(e.g.,MIPS)

MachineLanguageProgram(MIPS)

HardwareArchitectureDescrip/on(e.g.,blockdiagrams)

Compiler

Assembler

MachineInterpreta4on

temp=v[k];v[k]=v[k+1];v[k+1]=temp;

0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 !

LogicCircuitDescrip/on(CircuitSchema/cDiagrams)

ArchitectureImplementa4on

Anythingcanberepresentedasanumber,

i.e.,dataorinstruccons

25

Page 26: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

TheMemoryAbstrac/on

•  Associaconof<name,value>pairs–  typicallynamedasbyteaddresses–  ogenvaluesalignedonmulcplesofsize

•  SequenceofReadsandWrites•  Writebindsavaluetoanaddress•  ReadofaddrreturnsmostrecentlywriIenvalueboundtothataddress

26

address (name) command (R/W)

data (W)

data (R)

done

Page 27: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Processor-DRAMMemoryGap(latency)

27

µProc 60%/yr. (2X/1.5yr)

DRAM 9%/yr. (2X/10 yrs)

1!

10!

100!

1000!19

80!

1981

!

1983

!19

84!

1985

!19

86!

1987

!19

88!

1989

!19

90!

1991

!19

92!

1993

!19

94!

1995

!19

96!

1997

!19

98!

1999

!20

00!

DRAM

CPU!19

82!

Processor-Memory Performance Gap: (grows 50% / year)

Perf

orm

ance

Time

“Joy’s Law”

Page 28: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

ThePrincipleofLocality

•  ThePrincipleofLocality:–  Programaccessarelacvelysmallporconoftheaddressspace

atanyinstantofcme.•  TwoDifferentTypesofLocality:

–  TemporalLocality(LocalityinTime):Ifanitemisreferenced,itwilltendtobereferencedagainsoon(e.g.,loops,reuse)

–  SpacalLocality(LocalityinSpace):Ifanitemisreferenced,itemswhoseaddressesareclosebytendtobereferencedsoon(e.g.,straightlinecode,arrayaccess)

•  Last30years,HWreliedonlocalityforspeed

28

P MEM$

Page 29: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Greatidea:MemoryHierarchyLevelsoftheMemoryHierarchy

29

CPU Registers 100s Bytes << 1s ns

Cache 10s-100s K Bytes ~1 ns $1s/ MByte

Main Memory M Bytes 100ns- 300ns $< 1/ MByte

Disk 10s G Bytes, 10 ms (10,000,000 ns) $0.001/ MByte

Capacity Access Time Cost

Tape infinite sec-min $0.0014/ MByte

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

Staging Xfer Unit

prog./compiler 1-8 bytes

cache cntl 8-128 bytes

OS 512-4K bytes

user/operator Mbytes

Upper Level

Lower Level

faster

Larger

Page 30: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

JimGray’sStorageLatencyAnalogy:HowFarAwayistheData?

30Registers

On Chip Cache On Board Cache

Main Memory

Disk

1

2 10

100

Tape /Optical Robot

10 9

10 6

Lansing

This Campus

This Room My Head

10 min

1.5 hr

2 Years

1 min

Pluto

2,000 Years

Andromeda

(ns)

JimGrayTuringAwardB.S.Cal1966Ph.D.Cal1969!

Page 31: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

TheCacheDesignSpace

•  Severalinteraccngdimensions–  cachesize–  blocksize–  associacvity–  replacementpolicy–  write-throughvswrite-back

•  Theopcmalchoiceisacompromise–  dependsonaccesscharacterisccs

•  workload•  use(I-cache,D-cache,TLB)

–  dependsontechnology/cost•  Simplicityogenwins

31

Associativity

Cache Size

Block Size

Bad

Good Less More

Factor A Factor B

Page 32: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

GreatIdea:Parallelism

32

Page 33: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

DefiningComputerArchitecture

•  “Old”viewofcomputerarchitecture:–  InstrucconSetArchitecture(ISA)design–  i.e.decisionsregarding:

•  registers,memoryaddressing,addressingmodes,instrucconoperands,availableoperacons,controlflowinstruccons,instrucconencoding

•  “Real”computerarchitecture:–  Specificrequirementsofthetargetmachine–  Designtomaximizeperformancewithinconstraints:cost,

power,andavailability–  IncludesISA,microarchitecture,hardware

33

Page 34: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

ComputerArchitectureTopics

34

Instruction Set Architecture

Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, Dynamic Compilation

Addressing, Protection, Exception Handling

L1 Cache

L2/L3 Cache

DRAM

Disks, WORM, Tape

Coherence, Bandwidth, Latency

Emerging Technologies Interleaving Bus protocols

RAID

VLSI

Input/Output and Storage

Memory Hierarchy

Pipelining and Instruction Level Parallelism

Network Communication

Oth

er P

roce

ssor

s

Page 35: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

WhyisArchitectureExci/ngToday?

35

CPUClockS

peed+15%

/year

CPUSpeedFlat

Page 36: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Problemsoftradi/onalILPscaling

•  Fundamentalcircuitlimitacons1–  delays⇑asissuequeues⇑andmulc-portregisterfiles⇑–  increasingdelayslimitperformancereturnsfromwiderissue

•  Limitedamountofinstruccon-levelparallelism1

–  inefficientforcodeswithdifficult-to-predictbranches

•  Powerandheatstallclockfrequencies

36

[1]Thecaseforasingle-chipmulcprocessor,K.Olukotun,B.Nayfeh,L.Hammond,K.Wilson,andK.Chang,ASPLOS-VII,1996.

Page 37: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

ILPimpacts

37

Page 38: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Simula/onsof8-issueSuperscalar

38

Page 39: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Power/heatdensitylimitsfrequency

39

•  Somefundamentalphysicallimitsarebeingreached

Page 40: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Wewillhavethis…

40

Page 41: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

41

Revolu/onishappeningnow•  Chipdensityis

concnuingincrease~2xevery2years–  Clockspeedisnot–  Numberofprocessor

coresmaydoubleinstead

•  ThereisliIleornohiddenparallelism(ILP)tobefound

•  Parallelismmustbeexposedtoandmanagedbysogware–  Nofreelunch

Source:Intel,Microsog(SuIer)andStanford(Olukotun,Hammond)

Page 42: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

SingleProcessorPerformance

RISC

Move to multi-processor

42

Page 43: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

IBMBG/L

ASCIWhite Pacific

EDSAC1 UNIVAC1

IBM7090 CDC6600

IBM360/195 CDC7600

Cray1

CrayX-MP Cray2

TMCCM-2

TMCCM-5 CrayT3D

ASCIRed

1950 1960 1970 1980 1990 2000 2010

1KFlop/s

1MFlop/s

1GFlop/s

1TFlop/s

1PFlop/s

Scalar

Super Scalar

Parallel

Vector

1941 1 (Floating Point operations / second, Flop/s) 1945 100 1949 1,000 (1 KiloFlop/s, KFlop/s) 1951 10,000 1961 100,000 1964 1,000,000 (1 MegaFlop/s, MFlop/s) 1968 10,000,000 1975 100,000,000 1987 1,000,000,000 (1 GigaFlop/s, GFlop/s) 1992 10,000,000,000 1993 100,000,000,000 1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) 2000 10,000,000,000,000 2005 131,000,000,000,000 (131 Tflop/s)

Super Scalar/Vector/Parallel

(103)

(106)

(109)

(1012)

(1015)

2XTransistors/ChipEvery1.5Years

Thetrends

43

Page 44: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Recentmul/coreprocessors

44

Page 45: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

RecentmanycoreGPUprocessors

45

��

An�Overview�of�the�GK110�Kepler�Architecture�Kepler�GK110�was�built�first�and�foremost�for�Tesla,�and�its�goal�was�to�be�the�highest�performing�parallel�computing�microprocessor�in�the�world.�GK110�not�only�greatly�exceeds�the�raw�compute�horsepower�delivered�by�Fermi,�but�it�does�so�efficiently,�consuming�significantly�less�power�and�generating�much�less�heat�output.��

A�full�Kepler�GK110�implementation�includes�15�SMX�units�and�six�64�bit�memory�controllers.��Different�products�will�use�different�configurations�of�GK110.��For�example,�some�products�may�deploy�13�or�14�SMXs.��

Key�features�of�the�architecture�that�will�be�discussed�below�in�more�depth�include:�

� The�new�SMX�processor�architecture�� An�enhanced�memory�subsystem,�offering�additional�caching�capabilities,�more�bandwidth�at�

each�level�of�the�hierarchy,�and�a�fully�redesigned�and�substantially�faster�DRAM�I/O�implementation.�

� Hardware�support�throughout�the�design�to�enable�new�programming�model�capabilities�

Kepler�GK110�Full�chip�block�diagram�

��

Streaming�Multiprocessor�(SMX)�Architecture�

Kepler�GK110)s�new�SMX�introduces�several�architectural�innovations�that�make�it�not�only�the�most�powerful�multiprocessor�we)ve�built,�but�also�the�most�programmable�and�power�efficient.��

SMX:�192�single�precision�CUDA�cores,�64�double�precision�units,�32�special�function�units�(SFU),�and�32�load/store�units�(LD/ST).�

��

Kepler�Memory�Subsystem�/�L1,�L2,�ECC�

Kepler&s�memory�hierarchy�is�organized�similarly�to�Fermi.�The�Kepler�architecture�supports�a�unified�memory�request�path�for�loads�and�stores,�with�an�L1�cache�per�SMX�multiprocessor.�Kepler�GK110�also�enables�compiler�directed�use�of�an�additional�new�cache�for�read�only�data,�as�described�below.�

64�KB�Configurable�Shared�Memory�and�L1�Cache�

In�the�Kepler�GK110�architecture,�as�in�the�previous�generation�Fermi�architecture,�each�SMX�has�64�KB�of�on�chip�memory�that�can�be�configured�as�48�KB�of�Shared�memory�with�16�KB�of�L1�cache,�or�as�16�KB�of�shared�memory�with�48�KB�of�L1�cache.�Kepler�now�allows�for�additional�flexibility�in�configuring�the�allocation�of�shared�memory�and�L1�cache�by�permitting�a�32KB�/�32KB�split�between�shared�memory�and�L1�cache.�To�support�the�increased�throughput�of�each�SMX�unit,�the�shared�memory�bandwidth�for�64b�and�larger�load�operations�is�also�doubled�compared�to�the�Fermi�SM,�to�256B�per�core�clock.�

48KB�Read�Only�Data�Cache�

In�addition�to�the�L1�cache,�Kepler�introduces�a�48KB�cache�for�data�that�is�known�to�be�read�only�for�the�duration�of�the�function.�In�the�Fermi�generation,�this�cache�was�accessible�only�by�the�Texture�unit.�Expert�programmers�often�found�it�advantageous�to�load�data�through�this�path�explicitly�by�mapping�their�data�as�textures,�but�this�approach�had�many�limitations.��

•  ~3kcores

Page 46: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

CurrentTrendsinArchitecture

•  CannotconcnuetoleverageInstruccon-Levelparallelism(ILP)–  Singleprocessorperformanceimprovementendedin2003

•  Newmodelsforperformance:–  Data-levelparallelism(DLP)–  Thread-levelparallelism(TLP)–  Heterogeneity

•  Theserequireexplicitrestructuringoftheapplicacon

46

Page 47: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Parallelism

•  Classesofparallelisminapplicacons:–  Data-LevelParallelism(DLP)–  Task-LevelParallelism(TLP)

•  Classesofarchitecturalparallelism:–  Instruccon-LevelParallelism(ILP)–  Vectorarchitectures/GraphicProcessorUnits(GPUs)–  Thread-LevelParallelism–  Heterogeneity

47

Page 48: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

ArchitecturalChallenges

•  Massive(ca.4X)increaseinconcurrency–  Mulccore(4-<100)àManycores(100s–1ks)

•  Heterogeneity–  System-level(accelerators)vschiplevel(embedded)

•  Computepowerandmemoryspeedchallenges(twowalls)–  500xcomputepowerand30xmemoryof2PFHW–  Memoryaccess'melagsfurtherbehind

48

• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2

Course Motivation: Research Perspective

���

�����

�����

� ���� ������ ����

��� ��� ��� ��� ���� ���� �� � �� � ����

��� ��� ��� ������� �� ��

�������

���������

����������

� �����

������

!"�#���

"��$������������

���%�����

&�'(�)

(�*�!+(

�,����$����

'�����'�+-��

�����'.�� �

�/�0��1'���

&��2$���-�������

'.�� �

(�$3

�������

�4��5��2�6�&0�.&���7��8�9��1:*�4��9 ��'.�� ���������$�� �4�6����� ����;����� ��2��<�� �4��9 ��'"<�����5$�� <��5$�� �")

��������

ECE 5950 Course Overview 18 / 35Data$Processing$in$Exascale1class$Computing$Systems$$|$$April$27,$2011$$|$$CRM$4"

Three"Eras"of"Processor"Performance"

Single4Core""Era"

Single1thread$$Performance$

?$

Time$

we#are#here#

o"

Enabled$by:$� ���������$� Voltage$Scaling$� MicroArchitecture$

$

Constrained$by:$Power$Complexity$

Multi4Core""Era"

Throughput$$Performance$

Time$(##of#Processors)#

we#are#here#

o"

Enabled$by:$� ���������$� Desire$for$Throughput$� 20$years$of$SMP$arch$

$

Constrained$by:$Power$Parallel$SW$availability$Scalability$

Heterogeneous"Systems"Era"

Targeted$Application$$

Performance$

Time$(Data1parallel#exploitation)#

we#are#here#

o"

Enabled$by:$� ���������$� Abundant$data$parallelism$� Power$efficient$GPUs$

$

Currently)constrained$by:$Programming$models$Communication$overheads$

Source:ChuckMoore,DataProcessinginExaScale-ClassComputerSystems,Salishan,April2011

Page 49: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Exercise:InspectISAforsum

•  cp~yan/sum.c~(copysum.cfilefrommyhomefoldertoyourhomefolder)

•  gcc-save-tempssum.c–osum•  ./sum102400

•  visum.c•  visum.s•  Orcheckfrom:

–  hIps://passlab.github.io/CSE564/exercises/sum/

•  ViewthemfromHdrive•  Othersystemcommands:

–  cat/proc/cpuinfotoshowtheCPUand#cores–  topcommandtoshowsystemusageandmemory

49

Page 50: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

Backup

50

Page 51: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

New-SchoolMachineStructures

•  ParallelRequestsAssignedtocomputere.g.,Search“cats”

•  ParallelThreadsAssignedtocoree.g.,Lookup,Ads

•  ParallelInstruccons>[email protected].,5pipelinedinstruccons

•  ParallelData>[email protected].,Addof4pairsofwords

•  HardwaredescripconsAllgatesfuncconinginparallel

atsamecme

51

SmartPhone

Warehouse-Scale

Computer

SoLwareHardware

HarnessParallelism&AchieveHighPerformance

LogicGates

Core Core…

Memory(Cache)

Input/Output

Computer

MainMemory

Core

InstrucconUnit(s)

FuncconalUnit(s)

A3+B3A2+B2A1+B1A0+B0

51

Page 52: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

CopingwithFailures

•  4disks/server,50,000servers•  Failurerateofdisks:2%to10%/year

–  Assume4%annualfailurerate•  Onaverage,howogendoesadiskfail?

a)  1/monthb)  1/weekc)  1/dayd)  1/hour

52

Page 53: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

CopingwithFailures

•  4disks/server,50,000servers•  Failurerateofdisks:2%to10%/year

–  Assume4%annualfailurerate•  Onaverage,howogendoesadiskfail?

a)  1/monthb)  1/weekc)  1/dayd)  1/hour 50,000x4=200,000disks

200,000x4%=8000disksfail365daysx24hours=8760hours

53

Page 54: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

GreatIdea:DependabilityviaRedundancy

•  Redundancysothatafailingpiecedoesn’tmakethewholesystemfail

1+1=2 1+1=2 1+1=1

1+1=22of3agree

FAIL!

Increasingtransistordensityreducesthecostofredundancy54

Page 55: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

GreatIdea:DependabilityviaRedundancy

•  Appliestoeverythingfromdatacenterstostoragetomemorytoinstructors–  Redundantdatacenterssothatcanlose1datacenterbut

Internetservicestaysonline–  Redundantdiskssothatcanlose1diskbutnotlosedata

(RedundantArraysofIndependentDisks/RAID)–  Redundantmemorybitsofsothatcanlose1bitbutnodata

(ErrorCorreccngCode/ECCMemory)

55

Page 56: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

UnderstandingComputerArchitecture

56de.pinterest.com

Page 57: Lecture 01: Introduc/on - GitHub PagesLecture 01: Introduc/on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

EndofMoore’sLaw?

57

Costpertransistorisrisingastransistorsizecon/nuesto

shrink