transforming computer system design

18
Transforming Computer System Design Derek Chiou University of Texas at Austin Electrical and Computer Engineering Supported in part by DOE, NSF, SRC, Bluespec, Intel, Xilinx, IBM, and Freescale

Upload: others

Post on 03-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Transforming Computer System Design

Transforming Computer System

Design

Derek ChiouUniversity of Texas at Austin

Electrical and Computer Engineering

Supported in part by DOE, NSF, SRC,Bluespec, Intel, Xilinx, IBM, and Freescale

Page 2: Transforming Computer System Design

3/12/2009 SOS 13 Workshop, Hilton Head, SC

First, Some TerminologyHost

The system on which a simulator runsE.g.,

Dell 390 with a single 1.8GHz Core 2 Duo, 4GB of RAM, 10K RPM Seagate HDA Xilinx XUP board

TargetThe system being modeledEg.,

Alpha 21264 processorDell 390 with a single 1.8GHz Core 2 Duo, 4GB of RAM, 10K RPM Seagate HD

2

Page 3: Transforming Computer System Design

3/12/2009 SOS 13 Workshop, Hilton Head, SC

Simulation

Many large computers used to simulate physical worldSimulators (generally) get faster as computers get faster

Real world does not increase in complexity

Such simulators are fantastically capable, and getting better, quickly

3

Page 4: Transforming Computer System Design

3/12/2009 SOS 13 Workshop, Hilton Head, SC

Except For Simulating Computers

Accurate simulators are slowx86 1KIPS-10KIPS (Intel/AMD)

3GIPS processor for 1 sec take 83 hours at 10KIPSSimulating 2 minutes takes over 1 year

Getting slowerUnlike physical world, computers grow in complexity faster than they get faster

More complex cores, more cores, more features, etc.Slowing down by a factor of 2, relative to target, per year (Murkherjee, Intel)

4

Page 5: Transforming Computer System Design

Why Faster Simulators?Why not run many short simulations in parallel?

Cannot run full, unmodified, unbenchmarked, software

Better architect computer systems before they are builtSpeed enables longer , heavier evaluation of software on future hardware

what architectural mechanisms improve database/game speeds?Facilitate during co-design of hardware and software

E.g., Intel and Microsoft making a system that works well togetherNeed to be able to execute sufficient cycles to run Microsoft software

Reduce/eliminate work by transforming simulator to implementationTune software for correctness, performance, and power after real system exists

Provide full visibility at useable speedsDifficult/impossible to achieve on a real system

Requires Unified Simulator for Architects, Designers, Software, Algorithms

3/12/2009 SOS 13 Workshop, Hilton Head, SC 5

Page 6: Transforming Computer System Design

3/12/2009 SOS 13 Workshop, Hilton Head, SC

Benefits of a Unified Simulator

A common sandbox for Arch/RTL/SoftwarePromotes sharing, information transferred immediatelyWrite/verify once, eliminating ambiguity, wasted work, effort to keep simulators consistentWork proceeds in parallel

Architecture/software/algorithms co-developedSimulator runs software at interactive speeds while predicting performance, power, temperature, etc.Generate implementation from simulator???

6

Page 7: Transforming Computer System Design

But, a Unified Simulator has Unified Requirements

ArchitectureTimely: enough time to make decisionsAccurate: compare architectural mechanismsFlexible: quick changesTransparent: full visibility with little/no performance impactPower: compare architectural mechanisms

SoftwareFast: (~10MIPS+/host in accurate mode, 20-100MIPS+ /host nearly accurate)Full-system: run unmodified operating systems, applications,…Bottleneck Detection: automatically find problems

ImplementationAccurate: produce cycle-accurate numbersSynthesizable: convert to implementationReducible: convert back to a simulator

3/12/2009 SOS 13 Workshop, Hilton Head, SC 7

Page 8: Transforming Computer System Design

3/12/2009 SOS 13 Workshop, Hilton Head, SC

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Cycle-Accurate Simulators

Rename

RS

Br ALU

Decode

Fetch

dTLB

L2

L1

iTLB L1

ROB

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

If simulator models every transition correctly, it models performance correctly

Many lightweight operations that occur in parallel

R2 = MEM[R1]R3 = R2 + R2R4 = R4 + R4

Time

8

Page 9: Transforming Computer System Design

Separate Entity(processor,

FPGA resources)

3/12/2009 SOS 13 Workshop, Hilton Head, SC

FPGA

FPGA-Accelerated Simulation Technologies (FAST) Simulator Architecture

Rename

RS

Br ALU

Decode

Fetch

dTLB

L2

L1

iTLB L1

ROB

Functional Model

Timing Model

MICRO 2007

Parallelize Timing ModelParallelize FM-TM

Functional model speculates at simulator levelRoll back required if functional path != timing path

Branch misprediction likely to require 2 rollbacksFunctional model runs ahead, improves parallelism

Parallel slowdown due to communication?FM runs ahead, speculatively, round-trip communication infrequentRound-trip communication only when functional path!=timing path

The better the target micro-architecture, the faster the simulator

Instructions executed, sent to TM

9

Page 10: Transforming Computer System Design

3/12/2009 SOS 13 Workshop, Hilton Head, SC

Timing Models are Simple Compared to Implementation

TraceTrace

0x2

addrinst

InstructionMemory

Add

rd1

GPR File

rr1rr2

wrwd rd2

we

Immed.Extend

M0

2

raddrwaddr

wdata

rdata

re

Data Memory

ALUalgn

1

3

wePC A

B

MD1

Y

MD2

IR

IR IR IR

R

Bypass/interlock I1

I2

Small, relatively simpleEventually as simple as software timing model?

Easy to modularize, compose from libraries

makes hardware easier

10

Page 11: Transforming Computer System Design

Why are FAST Timing Models Simpler than Implementation?

Basically no functionalityMultiple host cycles per target cycle

Can write a loop for a CAMNot trying to meet timing

Multiple host cycles helps with thatMany modules are just delays

Fixed cycle ALUsCan approximate when desired

Cache studies (have all addresses) without detailed coreOften derived from previous micro-architecture

3/12/2009 SOS 13 Workshop, Hilton Head, SC 11

Page 12: Transforming Computer System Design

3/12/2009 SOS 13 Workshop, Hilton Head, SC

FAST Prototype Overview

Software (QEMU modified with trace/rollback) functional modelEventually hardware functional model, but software sim exists

FPGA-based timing model written in BluespecComplex OoO micro-architecture fits in a single FPGA

trace

FunctionalModel

Software

TimingModel

Bluespec HDL

Processor FPGADRCComputer

HT Xilinx FPGAPowerPC 405 Xilinx/IntelACP

FSB

12

Page 13: Transforming Computer System Design

3/12/2009 SOS 13 Workshop, Hilton Head, SC

Old FAST on DRC Platform(~1.2MIPS, 1000x Intel/AMD)

13

Page 14: Transforming Computer System Design

3/12/2009 SOS 13 Workshop, Hilton Head, SC

Current Simulator Performance

0

2

4

6

8

10

12

14

16

gshare

97% BP

100% BP

DRC 100%

ACP 100% BP

Includes Operating System CodeIncludes Operating System Code

14

Page 15: Transforming Computer System Design

Are Simulation/Implementation Two Forms of Same Thing?

Both can Run entire software stackAccurately predict performance

Analogous to frequency vs time domain in DSP?Easier to filter in freq domain than in time domainTransform & operate can make things easier

Requires “transforms” to convert back and forthHigh speed simulation in simulator domainVerify in implementation domain

3/12/2009 SOS 13 Workshop, Hilton Head, SC 15

Page 16: Transforming Computer System Design

FAST Capability StatusArchitect

Timely: enough time to make decisionsAccurate: compare architectural mechanismsFlexible: quick changesTransparent: full visibility with little/no performance impactPerformance: predictionPower: prediction

SoftwareFast (~10MIPS/host, 20-50-100MIPS+ /host nearly accurate)Full-system: run unmodified operating systems, applications,…Bottleneck Detection: automatically find performance problems

ImplementationAccurate: produce cycle-accurate numbersSynthesizable: convert to implementationReducible: convert back to a simulator

3/12/2009 SOS 13 Workshop, Hilton Head, SC 16

Page 17: Transforming Computer System Design

3/12/2009 SOS 13 Workshop, Hilton Head, SC

AcknowledgementsStudents

Dam Sunwoo (FM)Nikhil Patil (TM, tools, FAST2Imp)Bill Reinhart (TM connector)Eric Johnson (TM, machine learning + arch)Joonsoo Kim (TM, Implementation2FAST)Gene Wu (FM)

Funding, EquipmentDOE, NSF, SRC, Sandia Graduate FellowshipIntel, IBM, Xilinx, Freescale

Software, toolsBluespec, Xilinx

Open-source full system simulatorsQEMU, Bochs

17

Page 18: Transforming Computer System Design

ConclusionsAccurately simulates next generation on current generation at roughly current generation speeds

Within a factor of 100 or so todayCould be within a factor of 2 to 10 in future

Timing model speed is current limitLower accuracy faster

Supports integrated software and hardware explorationEasy to do “what-if” experiments

Map target software functions to arbitrary timingArbitrary sized matrix multiply in 1 target cycle

Software performance/power tuning toolSimulator scaling can be purchased

More host systems communicating over standard host interconnect

3/12/2009 SOS 13 Workshop, Hilton Head, SC 18