CSE 502 Graduate Computer Architecture Lec 1-2 - Introduction

Download CSE 502 Graduate Computer Architecture  Lec 1-2 - Introduction

Post on 08-Jan-2016

28 views

Category:

Documents

1 download

Embed Size (px)

DESCRIPTION

CSE 502 Graduate Computer Architecture Lec 1-2 - Introduction. Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson, UC-Berkeley cs252-s06. Outline. Computer Science at a Crossroads - PowerPoint PPT Presentation

TRANSCRIPT

<ul><li><p>CSE 502 Graduate Computer Architecture</p><p> Lec 1-2 - Introduction Larry WittieComputer Science, StonyBrook University</p><p>http://www.cs.sunysb.edu/~cse502 and ~lw</p><p>Slides adapted from David Patterson, UC-Berkeley cs252-s06</p><p>2/1,3/2011*CSE502-S10, Lec 01-2 - intro</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*OutlineComputer Science at a CrossroadsComputer Architecture v. Instruction Set Arch.How would you like your CSE502?What Computer Architecture brings to tableQuantitative Principles of DesignTechnology Performance Trends Careful, Quantitative Comparisons</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*Old Conventional Wisdom: Power is free, Transistors expensiveNew Conventional Wisdom: Power wall Power expensive, Xtors free (Can put more on chip than can afford to turn on)Old CW: Can increase Instruction Level Parallelism more via compilers, innovation (Out-of-order, speculation, VLIW, )New CW: ILP wall law of diminishing returns on more HW for ILP Old CW: Multiplies are slow, Memory access is fastNew CW: Memory wall Memory slow, multiplies fast (200 clock cycles to DRAM memory, 4 clocks for multiply)Old CW: Uniprocessor performance 2X / 1.5 yrsNew CW: Power Wall + ILP Wall + Memory Wall = Brick WallUniprocessor performance now 2X / 5(?) yrs Sea change in chip design: multiple cores (2X processors per chip / ~ 2 years)Increase on-chip number of simple processors that are power efficientSimple processor cores use less power per useful calculation doneCrossroads: Conventional Wisdom in Comp. Arch</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*Crossroads: Uniprocessor Performance VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to 2006From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, October, 2006</p><p>CSE502-S10, Lec 01-2 - intro</p><p>Chart3</p><p>1</p><p>1.5</p><p>5</p><p>9</p><p>13</p><p>18</p><p>24</p><p>51</p><p>80</p><p>117</p><p>183</p><p>280</p><p>481</p><p>648.9682539683</p><p>992.5396825397</p><p>1267.3968253968</p><p>1778.9365079365</p><p>2584.4206349206</p><p>4195.3888888889</p><p>5363.5317460318</p><p>5764.3650793651</p><p>6504.9523809524</p><p>25%/year</p><p>52%/year</p><p>??%/year</p><p>Performance (vs. VAX-11/780)</p><p>Sheet1</p><p>Data points (not linked)yearPerf TargetCAGR</p><p>YearRel. Perf.Computer1978125%</p><p>19781VAX-11/780198656.052%VAX-11/780</p><p>19841.5VAX-11/785200242004,059.520%VAX-11/785</p><p>19865VAX 8700200565007,257.6VAX 8700</p><p>19879Sun-4/260Sun-4/260</p><p>198813MIPS M/120MIPS M/120</p><p>198918MIPS M2000MIPS M2000</p><p>199024IBM RS6000/540IBM RS6000/540</p><p>199151HP 9000/750HP PA-RISC, 0.05 GHz</p><p>199280Digital 3000 AXP/500Alpha 21064, 0.2 GHz</p><p>1993117IBM POWERstation 100PowerPC 604, 0.1 GHz</p><p>1994183Digital Alphastation 4/266Alpha 21064A, 0.3 GHz</p><p>1995280Digital Alphastation 5/300Alpha 21164, 0.3 GHz</p><p>1996481Digital Alphastation 5/500Alpha 21164, 0.5 GHz</p><p>1997649AlphaServer 4000 5/600, 600 MHz 21164(was 1140 for 600 MHz 21264)Alpha 21164, 0.6 GHz</p><p>1998993Digital AlphaServer 8400 6/575, 575 MHz 21264Alpha 21264, 0.6 GHz</p><p>19991,267Professional Workstation XP1000, 667 MHz 21264AAlpha 21264A, 0.7 GHz</p><p>20001,779Intel VC820 motherboard, 1.0 GHz Pentium III processorIntel Pentium III, 1.0 GHz</p><p>20012,584AMD Epox 8KHA+ Motherboard, AMD Athlon (TM) XP 1900+, 1.6 GHzAMD Athlon, 1.6 GHz</p><p>20024,195Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology)Intel Pentium 4, 3.0 GHz</p><p>20035,364AMD ASUS SK8N Motherboard, AMD Opteron (TM) 148, 2.2 GHzAMD Opteron, 2.2 GHz</p><p>20045,764HP ProLiant BL20p G3 (3.6GHz, Intel Xeon)Intel Xeon, 3.6 GHz</p><p>20056,505Fujitsu Siemens Computers, PRIMERGY BX620 S2, 64-bit Intel Xeon 3.60 GHz64-bit Intel Xeon, 3.6 GHz</p><p>Sheet1</p><p>1</p><p>1.5</p><p>5</p><p>9</p><p>13</p><p>18</p><p>24</p><p>51</p><p>80</p><p>117</p><p>183</p><p>280</p><p>481</p><p>648.9682539683</p><p>992.5396825397</p><p>1267.3968253968</p><p>1778.9365079365</p><p>2584.4206349206</p><p>4195.3888888889</p><p>5363.5317460318</p><p>5764.3650793651</p><p>6504.9523809524</p><p>25%/year</p><p>52%/year</p><p>20%/year</p><p>Performance (vs. VAX-11/780)</p><p>Sheet2</p><p>Sheet3</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*Sea Change in Chip DesignIntel 4004 (1971): 4-bit processor, 2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm2 chip Processor is the new transistor? RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip (6 x 10 mm)Today (2006) 125 mm2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache (~108 Xistors)RISC II shrinks to ~ 0.02 mm2 at 65 nmCaches via DRAM or 1 transistor SRAM (www.t-ram.com) ?Proximity Communication via capacitive coupling at &gt; 1 TB/s ? (Ivan Sutherland @ Sun / Berkeley)</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*Dj vu all over again?Multiprocessors imminent in 1970s, 80s, 90s, todays processors are nearing an impasse as technologies approach the speed of light..David Mitchell, The Transputer: The Time Is Now (1989)Transputer was premature Custom multiprocessors strove to lead uniprocessors Procrastination rewarded: 2X seq. perf. / 1.5 years</p><p> We are dedicating all of our future product development to multicore designs. This is a sea change in computingPaul Otellini, President, Intel (2004) Difference is all microprocessor companies switch to multiprocessors (AMD, Intel, IBM, Sun; all new Apples 2 CPUs) Procrastination penalized: 2X sequential perf. / 5 yrs Biggest programming challenge: going from 1 to 2 CPUs</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*Problems with Sea Change Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, not ready to supply Thread Level Parallelism or Data Level Parallelism for 1000 CPUs / chip, Architectures not ready for 1000 CPUs / chipUnlike Instruction Level Parallelism, cannot be solved by just by computer architects and compiler writers alone, but also cannot be solved without participation of computer architectsThis edition of CSE 502 (and 4th Edition of textbook Computer Architecture: A Quantitative Approach) explores shift from Instruction Level Parallelism to Thread Level Parallelism / Data Level Parallelism</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*OutlineComputer Science at a CrossroadsComputer Architecture v. Instruction Set Arch.How would you like your CSE502?What Computer Architecture brings to tableQuantitative Principles of DesignTechnology Performance TrendsCareful, Quantitative Comparisons</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*Instruction Set Architecture: Critical InterfaceThe computing system as seen by programmersinstruction setsoftwarehardwareProperties of a good abstractionLasts through many generations (portability)Used in many different ways (generality)Provides convenient functionality to higher levelsPermits an efficient implementation at lower levelsWhat matters today is performance of complete computer systems </p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*Example: MIPS0r0r1r31PClohiProgrammable storage2^32 x bytes31 x 32-bit GPRs (R0=0)32 x 32-bit FP regs (paired DP)HI, LO, PCData types ?Format ?Addressing Modes?Arithmetic logical Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU, AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUISLL, SRL, SRA, SLLV, SRLV, SRAVMemory AccessLB, LBU, LH, LHU, LW, LWL,LWRSB, SH, SW, SWL, SWRControlJ, JAL, JR, JALRBEQ, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL32-bit instructions on word boundary</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*OutlineComputer Science at a CrossroadsComputer Architecture v. Instruction Set Arch.How would you like your CSE502?What Computer Architecture brings to tableQuantitative Principles of DesignTechnology Performance TrendsCareful, Quantitative Comparisons</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*CSE502: AdministriviaInstructor: Prof Larry WittieOffice/Lab: 1308 CompSci, lw AT icDOTsunysbDOTeduOffice Hrs: TuTh, 3:45 - 5:15 pm, if door open, or appt.T. A.: To Be Determined (TBD)Class:TuTh, 2:20 - 3:40 pm 2120 Comp SciText:Computer Architecture: A Quantitative Approach, 4th Ed. (Oct, 2006), ISBN 0123704901 or 978-0123704900, $65/($50?) Amazon; $77used SBU, S11 Web page: http://www.cs.sunysb.edu/~cse502/First reading assignment: Chapter 1 for today, TuesdayAppendix A (at back of CAQA4 text) for next week(New edition this fall 2011 so little 4ed trade-back value)</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*Coping with CSE 502Undergrads from SBU have taken CSE320Grad Students with too varied background?You will have a difficult time if you have not had an undergrad course using a Hennessy &amp; Patterson text.Grads without CSE320 equivalent may have to work hard; Review: CSE502 text Appendix A, B, C; the CSE320 home page; and maybe CSE320 text Computer Organization and Design (COD) 4/e Read chapters 1 to 8 of COD if you never took the prerequisiteIf took a class, be sure COD Chapters 2, 6, 7 are very familiarWe will spend 2 week-long lectures on review of Pipelining (App. A) and Memory Hierarchy (App. C), before an in-class quiz to check if everyone is OK.</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*Grading18% Homeworks (practice for the exams)Not doing HW yourself will doom you on exams (&amp; penalties)74% Exams: {4% Quiz, 20% Midterm, 50% Final Exam}8% (Optional) Research Project (work in pairs)you need to show initiative (you propose project to me)Pick a topic (more on this later)give oral presentation or poster sessionwritten report like a conference paper5 weeks work full-time for 2 peopleopportunity to do research in the small to help make transition from good undergrad student to research colleague</p><p>I may add up to 3% to a students final score, usually given only to people showing marked improvement during the course.</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*OutlineComputer Science at a CrossroadsComputer Architecture v. Instruction Set Arch.How would you like your CSE502?What Computer Architecture brings to tableQuantitative Principles of DesignTechnology Performance TrendsCareful, Quantitative Comparisons</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*What Computer Architecture Brings to TableQuantitative Principles of DesignTake Advantage of ParallelismPrinciple of LocalityFocus on the Common CaseAmdahls LawThe Processor Performance EquationCulture of anticipating and exploiting advances in technology- technology performance trendsCareful, quantitative comparisonsDefine, quantify, and summarize relative performanceDefine and quantify relative costDefine and quantify dependabilityDefine and quantify powerCulture of crafting well-defined interfaces that are carefully implemented and thoroughly checked</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*1) Taking Advantage of ParallelismIncreasing throughput of server computer via multiple processors or multiple disksDetailed HW designCarry lookahead adders uses parallelism to speed up computing sums from linear to logarithmic in number of bits per operandMultiple memory banks searched in parallel in set-associative cachesPipelining: overlap instruction execution to reduce the total time to complete an instruction sequence.Not every instruction depends on immediate predecessor executing instructions completely/partially in parallel possibleClassic 5-stage pipeline: 1) Instruction Fetch (Ifetch), 2) Register Read (Reg), 3) Execute (ALU), 4) Data Memory Access (Dmem), 5) Register Write (Reg)</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*Pipelined Instruction Execution Is Faster</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*Limits to Pipelining Hazards prevent next instruction from executing during its designated clock cycleStructural hazards: attempt to use the same hardware to do two different things at onceData hazards: Instruction depends on result of prior instruction still in the pipelineControl hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).Instr.</p><p>OrderTime (clock cycles)</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*2) The Principle of Locality =&gt; Caches ($)The Principle of Locality:Programs access a relatively small portion of the address space at any instant of time.Two Different Types of Locality:Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access)For 30 years, HW has relied on locality for memory perf.PMEM$</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*Levels of the Memory HierarchyCPU Registers100s Bytes300 500 ps (0.3-0.5 ns)L1 and L2 Cache10s-100s K Bytes~1 ns - ~10 ns$1000s/ GByteMain MemoryG Bytes80ns- 200ns~ $100/ GByteDisk10s T Bytes, 10 ms (10,000,000 ns)~ $0.25 / GByteCapacityAccess TimeCostTape VaultSemi-infinitesec-min~$1 / GByte</p><p>RegistersL1 CacheMemoryDiskTapeInstr. OperandsBlocksPagesFilesStagingXfer Unitprog./compiler1-8 bytescache cntlr32-64 bytesOS4K-8K bytesuser/operatorMbytesUpper LevelLower LevelFasterLargerL2 Cachecache cntlr64-128 bytesBlocks</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*3) Focus on the Common CaseMake Frequent Case Fast and Rest RightCommon sense guides computer designSince its engineering, common sense is valuableIn making a design trade-off, favor the frequent case over the infrequent caseE.g., Instruction fetch and decode unit used more frequently than multiplier, so optimize it firstE.g., If database server has 50 disks / processor, storage dependability dominates system dependability, so optimize it 1stFrequent case is often simpler and can be done faster than the infrequent caseE.g., overflow is rare when adding 2 numbers, so improve performance by optimizing more common case of no overflow May slow down overflow, but overall performance improved by optimizing for the normal caseWhat is frequent case? How much is performance improved by making case faster? =&gt; Amdahls Law </p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*4) Amdahls Law - Partial Enhancement LimitsBest to ever achieve:Example: An I/O bound server gets a new CPU that is 10X faster, but 60% of server time is spent waiting for I/O.A 10X faster CPU allures, but the server is only 1.6X faster.</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*5) Processor performance equationCPU time = Inst Countx CPI xClock Cycle</p><p>Program X</p><p>Compiler X (X)</p><p>Inst. Set. X X</p><p>Organization X X</p><p>Technology XInst countCPICycle time</p><p>CSE502-S10, Lec 01-2 - intro</p></li><li><p>2/1,3/2011CSE502-S10, Lec 01-2 - intro*What Determines a Clock Cycle?At transition edge(s) of each clock pulse, state devices sample and save their present input signalsPast: 1 cycle = time for signals to pass 10 levels of gatesToday: determined by numerous time-of-flight issues + gate delaysclock...</p></li></ul>