counterscounterscounterscounters counterscounters counters ... t standard tool flows (edk, ise) t...

1
The Open-Source ProtoFlex Simulator Eric S. Chung, Michael K. Papamichael*, James C. Hoe, Ken Mai, Babak Falsafi {echung, jhoe, kenmai, babak}@ece.cmu.edu, *[email protected] http://www.ece.cmu.edu/~protoflex Computer Architecture Lab at ProtoFlex: FPGA-Accelerated Full-System Multiprocessor Simulation Acknowledgements We would like to thank our colleagues in the RAMP and TRUSS projects. Our work in this area has been supported in part by NSF, IBM, Intel, Sun, Xilinx, and Microsoft. History Project started (circa 2007) to build scalable, full-system multiprocessor simulators using FPGAs Key Features Functional simulator for N-way UltraSPARC III server (~50-90 MIPS) Using hybrid simulation, runs real server apps + Solaris OS Employs multithreading to virtualize # CPUs per FPGA core 1 Hybrid Simulation Virtualization Why open source? Demonstration of FPGAs as viable architecture research vehicle Facilitate adoption of hybrid simulation & host multithreading Encourage building on top of our work What are we releasing? Bluespec source HDL, Verilog and pre-generated netlists for SPARCV9 CPU model + interfaces XUPV5 Reference Design for EDK 10.1 Virtutech Simics plug-ins for hybrid simulation Top-level SW controller, user command-line interface Documentation through online wiki 1 ProtoFlex (GNU GPL Release) 1 FPGA Linux PC PFMON PFMON SIMICS (I/O) SIMICS (I/O) FPGA Core FPGA Core Main Memory Main Memory PowerPC (or uBlaze) PowerPC (or uBlaze) Ethernet User Interfac e User Interfac e FPGA Linux PC PFMON PFMON SIMICS (I/O) SIMICS (I/O) FPGA Core FPGA Core Main Memory Main Memory PowerPC (or uBlaze) PowerPC (or uBlaze) Ethernet User Interfac e User Interfac e User perceives familiar SW-like UltraSPARC III simulator Software user interface similar to Simics Applications load directly from Simics checkpoints Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring ISA Specifications 64-bit SPARCV9 ISA + US III extensions 8 register windows, 4 global register files 512-entry D-TLB, 128 I-TLB Implementation 14-stage, multi-threaded pipeline, switch context on each cycle On Virtex-5, XST~148MHz, Placed & routed @ 100MHz Parameterized non-blocking caches FP + rare MMU instructions are SW-emulated by nearby uBlaze 100% mirrors Virtutech Simics model 1 I-TLB Stage 1 I-TLB Stage 2 64-bit ALU Stage 1 Context Scheduler I-Fetch Address Generate I-Fetch Tag Check US III Decoder 64-bit ALU Stage 2 D-TLB Stage 1 D-TLB Stage 2 D-TLB Stage 3 D-Cache Address Generate D-Cache Tag Check Multi-Cycle Instruction Unit Nonblocking I-cache (BRAM) Writeback Arbiter to DDR Memory Integer RF (BRAM) Nonblocking D- cache (BRAM) I-TLB Stage 1 I-TLB Stage 2 64-bit ALU Stage 1 Context Scheduler I-Fetch Address Generate I-Fetch Tag Check US III Decoder 64-bit ALU Stage 2 D-TLB Stage 1 D-TLB Stage 2 D-TLB Stage 3 D-Cache Address Generate D-Cache Tag Check Multi-Cycle Instruction Unit Nonblocking I-cache (BRAM) Writeback Arbiter to DDR Memory Integer RF (BRAM) Nonblocking D- cache (BRAM) Runs 100MHz on V5 Synthesizes up to 148MHz using standard tools (ISE XST) Logic usage 23.5 KLUTs (11.3% LX330T) BRAM usage 120 BRAMs for 16-context configuration (37% LX330T) Future optimizations Paging structures to SRAM or DRAM can reduce BRAM by significant amount Will release in future updates 1 Examples Functional-first CMP cache coherency model for first-order timing models and functional warming [TRETS09] Real-time stack trace profiling CMP interconnect model (in progress) Realistic CPU traffic generators (in progress) running real 16-CPU server workloads Oracle TPC-C, IBM DB/2 TPC-C, TPC-H, SPEC2K 1 Piranha CMP Cache (First-Order Timing Model) Piranha CMP Cache (First-Order Timing Model) Statistics + Warmed Coherency & Tag States Statistics + Warmed Coherency & Tag States Piranha CMP Cache (First-Order Timing Model) Piranha CMP Cache (First-Order Timing Model) Statistics + Warmed Coherency & Tag States Statistics + Warmed Coherency & Tag States We use Bluespec System Verilog (high-level, synthesizable HDL) 4-8 weeks learning curve for normal HDL users Once learned, easier to read/modify than conventional RTL Requires BSV compiler (free for academics) Paper in MEMOCODE09 describes BSV coding/validation of core Sample code: 1 rule split_ALU_pipeline (True); p1 = piperegs[DECODE]; piperegs[ALU1] <= doALUStage1(p1, alu_ifc); p2 = piperegs[ALU1]; piperegs[ALU2] <= doALUStage2(p2, alu_ifc); endrule rule merged_ALU_pipeline (True); p1 = piperegs[DECODE]; p_tmp1 = doALUStage1(p1, alu_ifc); p_tmp2 = doALUStage2(p_tmp1, alu_ifc); piperegs[ALU] <= p_tmp2; endrule 2-stage ALU 1-stage ALU Why XUPv5? Inexpensive (~$750), easily accessible Standard tool flows (EDK, ISE) Reference design portable to other platforms just drop in our pcoresSupporting other platforms Future ports to BEE3 & Xilinx Accelerated Computing Platform (ACP) Plan to release with future updates 1 BlueSPARC Core and Statistics ProtoFlex Instrumentation and Applications 0 10 20 30 40 50 60 70 oracle bzip 2 crafty gcc gzip parser vortex average MIPS BlueSPARC (90mhz) Simics-fast (2.0GHz C2Duo) Simics-trace 1.18 Performance on BEE2 Platform Changing Core Parameters Number of CPU contexts Cache sizes Merge/split pipeline stages Enable/disable modules for profiling & debugging Clock frequency (tested @ 10 MHz 100 MHz) Set optimal LUTRAM size (16 = V2P, 64 = V5) Choose LUTRAMs or BRAMs for any CPU state System Parameters UDP or TCP/IP (for PFMON-to-FPGA communication) XUPv5, BEE2 1 Hardware Requirements Bluespec Language & Flow Other simulator features Thank you! Add passive monitors Counters, histograms Roll your own 1 I-TLB Stage 1 I-TLB Stage 2 64-bit ALU Stage 1 Context Scheduler I-Fetch Address Generate I-Fetch Tag Check US III Decoder 64-bit ALU Stage 2 D-TLB Stage 1 D-TLB Stage 2 D-TLB Stage 3 D-Cache Address Generate D-Cache Tag Check Multi-Cycle Instruction Unit Nonblocking I-cache (BRAM) Writeback Arbiter to DDR Memory Integer RF (BRAM) Nonblocking D-cache (BRAM) I-TLB Stage 1 I-TLB Stage 2 64-bit ALU Stage 1 Context Scheduler I-Fetch Address Generate I-Fetch Tag Check US III Decoder 64-bit ALU Stage 2 D-TLB Stage 1 D-TLB Stage 2 D-TLB Stage 3 D-Cache Address Generate D-Cache Tag Check Multi-Cycle Instruction Unit Nonblocking I-cache (BRAM) Writeback Arbiter to DDR Memory Integer RF (BRAM) Nonblocking D-cache (BRAM) Trace-based simulation Collect dynamic traces Feed traces to functional-first timing model Sampled Program Monitoring Use micro-blaze (or PPC) to monitor core/memory state Unintrusive profiling w/o changes to target SW Counters Counters Counters Counters Counters Counters Histogram Tracker Histogram Tracker Histogram Tracker Histogram Tracker Histogram Trackers Histogram Trackers Counters Counters Counters Counters Counters Counters Counters Counters Counters Counters Counters Counters Histogram Tracker Histogram Tracker Histogram Tracker Histogram Tracker Histogram Trackers Histogram Trackers Histogram Tracker Histogram Tracker Histogram Tracker Histogram Tracker Histogram Trackers Histogram Trackers Timing Model Timing Model Timing Model Timing Model FPGA Hard/Soft Core (PowerPC or MicroBlaze) FPGA Hard/Soft Core (PowerPC or MicroBlaze) FPGA Hard/Soft Core (PowerPC or MicroBlaze) FPGA Hard/Soft Core (PowerPC or MicroBlaze) Topics & References Hybrid Simulation and FPGA Core Multithreading A Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs [FPGA08] Functional-First, First-Order Timing Model for a CMP cache ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs [TRETS09] FPGA Core Design & Validation Using Flight Data Recorder Implementing a High-performance Multithreaded Microprocessor: A Case Study in High- level Design and Validation [MEMOCODE09] 1 FPGA Board Linux PC Ethernet BlueSPARC PFMON + Simics 1 Bluespec Verilog Bluespec compiler ~30 minutes Verilog Bitstream Xilinx EDK ~ 3 hours Bitstream Working System Stream mem. image over ethernet ~ 5 minutes (for 512MB image) Bluespec code Verilog code Bitstream 1 2 3 1 2 Working System 3

Upload: others

Post on 10-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CountersCountersCountersCounters CountersCounters Counters ... t Standard tool flows (EDK, ISE) t Reference design portable to other platforms t just drop in our Zpcores[{ Supporting

The Open-Source ProtoFlex SimulatorEric S. Chung, Michael K. Papamichael*, James C. Hoe, Ken Mai, Babak Falsafi

{echung, jhoe, kenmai, babak}@ece.cmu.edu, *[email protected]

http://www.ece.cmu.edu/~protoflexComputer Architecture Lab at

ProtoFlex: FPGA-Accelerated Full-System Multiprocessor Simulation

Acknowledgements We would like to thank our colleagues in the RAMP and TRUSS projects. Our work in this area has been supported in part by NSF, IBM, Intel, Sun, Xilinx, and Microsoft.

• History

– Project started (circa 2007) to build scalable, full-system multiprocessor simulators using FPGAs

• Key Features

– Functional simulator for N-way UltraSPARC III server (~50-90 MIPS)

– Using hybrid simulation, runs real server apps + Solaris OS

– Employs multithreading to virtualize # CPUs per FPGA core

1

Hybrid Simulation Virtualization

• Why open source?

– Demonstration of FPGAs as viable architecture research vehicle

– Facilitate adoption of hybrid simulation & host multithreading

– Encourage building on top of our work

• What are we releasing?

– Bluespec source HDL, Verilog and pre-generated netlistsfor SPARCV9 CPU model + interfaces

– XUPV5 Reference Design for EDK 10.1

– Virtutech Simics plug-ins for hybrid simulation

– Top-level SW controller, user command-line interface

– Documentation through online wiki 1

ProtoFlex (GNU GPL Release)

1

FPGA Linux PC

PFMONPFMON

SIMICS

(I/O)

SIMICS

(I/O)

FPGA

Core

FPGA

Core

Main MemoryMain Memory

PowerPC

(or uBlaze)

PowerPC

(or uBlaze)

Ethernet User

Interfac

e

User

Interfac

e

FPGA Linux PC

PFMONPFMON

SIMICS

(I/O)

SIMICS

(I/O)

FPGA

Core

FPGA

Core

Main MemoryMain Memory

PowerPC

(or uBlaze)

PowerPC

(or uBlaze)

Ethernet User

Interfac

e

User

Interfac

e

• User perceives familiar SW-like UltraSPARC III simulator

– Software user interface similar to Simics

– Applications load directly from Simics checkpoints

– Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring

• ISA Specifications

– 64-bit SPARCV9 ISA + US III extensions

– 8 register windows, 4 global register files

– 512-entry D-TLB, 128 I-TLB

• Implementation

– 14-stage, multi-threaded pipeline, switch context on each cycle

– On Virtex-5, XST~148MHz,Placed & routed @ 100MHz

– Parameterized non-blocking caches

– FP + rare MMU instructions areSW-emulated by nearby uBlaze

– 100% mirrors Virtutech Simics model 1

I-TLB Stage 1

I-TLB Stage 2

64-bit ALU Stage 1

Context Scheduler

I-Fetch Address Generate

I-Fetch Tag Check

US III Decoder

64-bit ALU Stage 2

D-TLB Stage 1

D-TLB Stage 2

D-TLB Stage 3

D-Cache Address Generate

D-Cache Tag Check

Multi-Cycle Instruction Unit

NonblockingI-cache(BRAM)

Writeback

Arbiter to DDR Memory

Integer RF(BRAM)

Nonblocking D-cache

(BRAM)

I-TLB Stage 1

I-TLB Stage 2

64-bit ALU Stage 1

Context Scheduler

I-Fetch Address Generate

I-Fetch Tag Check

US III Decoder

64-bit ALU Stage 2

D-TLB Stage 1

D-TLB Stage 2

D-TLB Stage 3

D-Cache Address Generate

D-Cache Tag Check

Multi-Cycle Instruction Unit

NonblockingI-cache(BRAM)

Writeback

Arbiter to DDR Memory

Integer RF(BRAM)

Nonblocking D-cache

(BRAM)

• Runs 100MHz on V5

– Synthesizes up to 148MHz using standard tools (ISE XST)

• Logic usage

– 23.5 KLUTs (11.3% LX330T)

• BRAM usage

– 120 BRAMs for 16-context configuration (37% LX330T)

• Future optimizations

– Paging structures to SRAM or DRAM can reduce BRAM by significant amount

– Will release in future updates 1

• Examples

– Functional-first CMP cache coherency model for first-order timing models and functional warming [TRETS’09]

– Real-time stack trace profiling

– CMP interconnect model (in progress)

– Realistic CPU traffic generators (in progress)

• … running real 16-CPU server workloads

– Oracle TPC-C, IBM DB/2 TPC-C, TPC-H, SPEC2K

1

Piranha CMP Cache

(First-Order Timing

Model)

Piranha CMP Cache

(First-Order Timing

Model)

Statistics + Warmed

Coherency &

Tag States

Statistics + Warmed

Coherency &

Tag States

Piranha CMP Cache

(First-Order Timing

Model)

Piranha CMP Cache

(First-Order Timing

Model)

Statistics + Warmed

Coherency &

Tag States

Statistics + Warmed

Coherency &

Tag States

• We use Bluespec System Verilog (high-level, synthesizable HDL)

– 4-8 weeks learning curve for normal HDL users

– Once learned, easier to read/modify than conventional RTL

– Requires BSV compiler (free for academics)

– Paper in MEMOCODE’09 describes BSV coding/validation of core

• Sample code:

1

rule split_ALU_pipeline (True);…p1 = piperegs[DECODE];piperegs[ALU1] <= doALUStage1(p1, alu_ifc);p2 = piperegs[ALU1];piperegs[ALU2] <= doALUStage2(p2, alu_ifc);…

endrule

rule merged_ALU_pipeline (True);…p1 = piperegs[DECODE];p_tmp1 = doALUStage1(p1, alu_ifc);p_tmp2 = doALUStage2(p_tmp1, alu_ifc);piperegs[ALU] <= p_tmp2;…

endrule

2-stage ALU 1-stage ALU• Why XUPv5?

– Inexpensive (~$750), easily accessible

– Standard tool flows (EDK, ISE)

– Reference design portable to other platforms – just drop in our ‘pcores’

• Supporting other platforms

– Future ports to BEE3 & Xilinx Accelerated Computing Platform (ACP)

– Plan to release with future updates

1

BlueSPARC Core and Statistics

ProtoFlex Instrumentation and Applications

010203040506070

orac

le

bzip

2

craf

ty gcc

gzip

pars

ervo

rtex

aver

age

MIP

S

BlueSPARC (90mhz)Simics-fast (2.0GHz C2Duo)Simics-trace

1.18

Performance on BEE2 Platform

• Changing Core Parameters

– Number of CPU contexts

– Cache sizes

– Merge/split pipeline stages

– Enable/disable modules for profiling & debugging

– Clock frequency (tested @ 10 MHz – 100 MHz)

– Set optimal LUTRAM size (16 = V2P, 64 = V5)

– Choose LUTRAMs or BRAMs for any CPU state

• System Parameters

– UDP or TCP/IP (for PFMON-to-FPGA communication)

– XUPv5, BEE2

1

Hardware Requirements Bluespec Language & Flow Other simulator features

Thank you!

• Add passive monitors

– Counters, histograms

– Roll your own

1

I-TLB Stage 1

I-TLB Stage 2

64-bit ALU Stage 1

Context Scheduler

I-Fetch Address Generate

I-Fetch Tag Check

US III Decoder

64-bit ALU Stage 2

D-TLB Stage 1

D-TLB Stage 2

D-TLB Stage 3

D-Cache Address Generate

D-Cache Tag Check

Multi-Cycle Instruction Unit

NonblockingI-cache(BRAM)

Writeback

Arbiter to DDR Memory

Integer RF(BRAM)

NonblockingD-cache(BRAM)

I-TLB Stage 1

I-TLB Stage 2

64-bit ALU Stage 1

Context Scheduler

I-Fetch Address Generate

I-Fetch Tag Check

US III Decoder

64-bit ALU Stage 2

D-TLB Stage 1

D-TLB Stage 2

D-TLB Stage 3

D-Cache Address Generate

D-Cache Tag Check

Multi-Cycle Instruction Unit

NonblockingI-cache(BRAM)

Writeback

Arbiter to DDR Memory

Integer RF(BRAM)

NonblockingD-cache(BRAM)

• Trace-based simulation

– Collect dynamic traces

– Feed traces to functional-first timing model

• Sampled Program Monitoring

– Use micro-blaze (or PPC) to monitor core/memory state

– Unintrusive profiling w/o changes to target SW

CountersCountersCountersCounters

CountersCounters

Histogram

Tracker

Histogram

TrackerHistogram

Tracker

Histogram

TrackerHistogram

Trackers

Histogram

Trackers

CountersCountersCountersCounters

CountersCountersCountersCounters

CountersCountersCountersCounters

Histogram

Tracker

Histogram

TrackerHistogram

Tracker

Histogram

TrackerHistogram

Trackers

Histogram

Trackers

Histogram

Tracker

Histogram

TrackerHistogram

Tracker

Histogram

TrackerHistogram

Trackers

Histogram

Trackers

Timing

Model

Timing

ModelTiming

Model

Timing

Model

FPGA

Hard/Soft Core

(PowerPC or

MicroBlaze)

FPGA

Hard/Soft Core

(PowerPC or

MicroBlaze)

FPGA

Hard/Soft Core

(PowerPC or

MicroBlaze)

FPGA

Hard/Soft Core

(PowerPC or

MicroBlaze)

• Topics & References

– Hybrid Simulation and FPGA Core MultithreadingA Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs [FPGA’08]

– Functional-First, First-Order Timing Model for a CMP cacheProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs [TRETS’09]

– FPGA Core Design & Validation Using Flight Data RecorderImplementing a High-performance Multithreaded Microprocessor: A Case Study in High-level Design and Validation [MEMOCODE’09]

1

FPGA BoardLinux PC

Ethernet

BlueSPARCPFMON+ Simics

1

• Bluespec Verilog

– Bluespec compiler

– ~30 minutes

• Verilog Bitstream

– Xilinx EDK

– ~ 3 hours

• BitstreamWorking System

– Stream mem. image over ethernet

– ~ 5 minutes (for 512MB image)

Bluespec code

Verilog code

Bitstream

1

2

3

1

2

WorkingSystem

3