hasim fpga-based processor models: basic models
DESCRIPTION
HAsim FPGA-Based Processor Models: Basic Models. Michael Adler Elliott Fleming Michael Pellauer Joel Emer. Managing Complexity. How do we reduce the work of writing software models? Re-use components Split functional & timing models - PowerPoint PPT PresentationTRANSCRIPT
HAsim FPGA-Based Processor Models: Basic Models
Michael AdlerElliott FlemingMichael PellauerJoel Emer
2
Managing Complexity
How do we reduce the work of writing software models?– Re-use components– Split functional & timing models– Model only what is necessary to compute target performance– Limit data manipulation in timing models, e.g.:
• Manage cache tags but not data• Let functional model handle DEC and EXE details
Same on FPGA, plus:– Use software for operations that don’t affect model speed
3
Functional Model
• ISA semantics only – no timing• Similarities to software models:
– ISA-agnostic code implements core functions:• Register file• Virtual to physical address translation• Store buffer
– ISA-specific code:• Decode• Execute
HAsim currently implements Alpha and SMIPS
4
Functional Model is Hybrid FPGA / Software
• Difficult but infrequent tasks do not need to be accelerated• HAsim uses gem5 for:
– Loading programs– Target machine address space management– Emulation of rare & complex instructions
5
Functional Model is Latency Insensitive (LI)
• LI design enables hybrid FPGA / software implementation– Functional memory is cached on FPGA, homed on host– Emulate system calls (gem5 user-mode) in software
• Area-efficient FPGA implementation– Store buffer uses a hash table instead of a CAM– Choose number of register-file read ports to handle the common case– Serialized rewind
6
Functional MemoryFP
GASo
ftwar
e
Mem Read Interface
Central Cache
Hit
Miss
Private Mem Cache
Mem Read Interface
Private Mem Cache
Central Cache
Hit
Mem Interface
Miss
Mem Interface
Memory (m5)
Time
7
Reducing Model Complexity:Shared Functional Model
ITranslate
Fetch
DTranslate
Memory
Local
Commit
Global
Commit
Decode
Execute
Functional Pipeline
Functional
State
• Similar philosophy to software models:– Write ISA functional model once– Functional machine state is completely
managed– Timing models can be ISA-independent
• Each functional pipeline stage behaves like a request/response FIFO
ISPASS 2008 Paper:Quick Performance Models Quickly: Timing-Directed Simulation on FPGAs
8
FPGA Functional Detail: Token Indexed Scoreboard
• Functional model maintains a “token” for each active instruction• Like a pointer, but better on hardware than new, delete and global
storage• Token indexed state in functional scoreboard:
– Decode information (isLoad, isStore, …)– Architectural to physical register mapping– Memory information– Model verification logic (e.g. prove that an instruction properly flows
through all functional stages.)
10
Functional Model API (Named Connections)
Connection_Client#(FUNCP_REQ_DO_ITRANSLATE, FUNCP_RSP_DO_ITRANSLATE) linkToITR <- mkConnection_Client("funcp_doITranslate");
Connection_Client#(FUNCP_REQ_GET_INSTRUCTION, FUNCP_RSP_GET_INSTRUCTION) linkToFET <- mkConnection_Client("funcp_getInstruction");
Connection_Client#(FUNCP_REQ_GET_DEPENDENCIES, FUNCP_RSP_GET_DEPENDENCIES) linkToDEC <- mkConnection_Client("funcp_getDependencies");
Connection_Client#(FUNCP_REQ_GET_RESULTS, FUNCP_RSP_GET_RESULTS) linkToEXE <- mkConnection_Client("funcp_getResults");
Connection_Client#(FUNCP_REQ_DO_DTRANSLATE, FUNCP_RSP_DO_DTRANSLATE) linkToDTR <- mkConnection_Client("funcp_doDTranslate");
Connection_Client#(FUNCP_REQ_DO_LOADS, FUNCP_RSP_DO_LOADS) linkToLOA <- mkConnection_Client("funcp_doLoads");
Connection_Client#(FUNCP_REQ_DO_STORES, FUNCP_RSP_DO_STORES) linkToSTO <- mkConnection_Client("funcp_doSpeculativeStores");
Connection_Client#(FUNCP_REQ_COMMIT_RESULTS, FUNCP_RSP_COMMIT_RESULTS) linkToLCO <- mkConnection_Client("funcp_commitResults");
Connection_Client#(FUNCP_REQ_COMMIT_STORES, FUNCP_RSP_COMMIT_STORES) linkToGCO <- mkConnection_Client("funcp_commitStores");
Connection_Send#(CONTROL_MODEL_CYCLE_MSG) linkModelCycle <- mkConnection_Send("model_cycle"); Connection_Send#(CONTROL_MODEL_COMMIT_MSG) linkModelCommit <- mkConnection_Send("model_commits");
11
Functional: Sample Data Structures
// FUNCP_REQ_DO_ITRANSLATE
typedef struct{ CONTEXT_ID contextId; ISA_ADDRESS virtualAddress; // Virtual address to translate} FUNCP_REQ_DO_ITRANSLATE deriving (Eq, Bits);
// FUNCP_RSP_DO_ITRANSLATE
typedef struct{ CONTEXT_ID contextId; MEM_ADDRESS physicalAddress; // Result of translation. MEM_OFFSET offset; // Offset of the instruction. Bool fault; // Translation failure: fault will be raised on // attempts to commit this token. physicalAddress // is on the guard page, so it can still be used // in order to simplify timing model logic. Bool hasMore; // More translations coming? (The fetch spans two addresses.)} FUNCP_RSP_DO_ITRANSLATE deriving (Eq, Bits);
12
Step 1: Basic Unpipelined Target
13
Key Components
• How do I start running?• Signal completion?• Execute instructions?
14
The “Starter”
• Global controller is in software– Orchestrates program loading– Tells local controllers when to begin
• Local controllers (FPGA)– Associated with target machine pipeline stage models– LI (like most of the API)
LOCAL_CONTROLLER#(MAX_NUM_CPUS) localCtrl <- mkLocalController(inports, outports);
let cpu_iid <- localCtrl.startModelCycle();
localCtrl.endModelCycle(cpu_iid, 1);
15
Timing: Starting the Pipeline
rule stage1_itrReq (True);
// Begin a new model cycle. let cpu_iid <- localCtrl.startModelCycle(); linkModelCycle.send(cpu_iid); debugLog.nextModelCycle(cpu_iid);
…
16
Must Drive the Functional Pipeline
ITranslate
Fetch
DTranslate
Memory
Local
Commit
Global
Commit
Decode
Execute
Functional Pipeline
Functional
State
17
Timing Model
ITranslate
Fetch
DTranslate
Memory
Local
Commit
Global
Commit
Decode
Execute
Functional Pipeline
Functional
State
IP
Next IP
• Timing & functional models communicate state using tokens
• Minimal timing model:– Only state is IP – Drives a single token at a
time
Timing Pipeline
18
Timing: Stage 1 (ITranslate)
rule stage1_itrReq (True);
// Begin a new model cycle. let cpu_iid <- localCtrl.startModelCycle(); linkModelCycle.send(cpu_iid); debugLog.nextModelCycle(cpu_iid);
// Translate next pc. Reg#(ISA_ADDRESS) pc = pcPool[cpu_iid]; let ctx_id = getContextId(cpu_iid); linkToITR.makeReq(initFuncpReqDoITranslate(ctx_id, pc)); debugLog.record_next_cycle(cpu_iid, $format("Translating virtual address: 0x%h", pc));
endrulepcPool is the only state in the timing model!
19
Timing: Fetch
rule stage2_itrRsp_fetReq (True);
// Get the ITrans response started by stage1_itrReq let rsp = linkToITR.getResp(); linkToITR.deq();
let cpu_iid = getCpuInstanceId(rsp.contextId);
debugLog.record(cpu_iid, $format("ITR Responded, hasMore: %0d", rsp.hasMore));
// Fetch the next instruction linkToFET.makeReq(initFuncpReqGetInstruction(rsp.contextId, rsp.physicalAddress, rsp.offset)); debugLog.record(cpu_iid, $format("Fetching physical address: 0x%h, offset: 0x%h", rsp.physicalAddress, rsp.offset));
endrule
21
Timing: Execute Response rule stage5_exeRsp (True); // Get the execution result let exe_resp = linkToEXE.getResp(); linkToEXE.deq();
let tok = exe_resp.token; let res = exe_resp.result;
let cpu_iid = tokCpuInstanceId(tok); Reg#(ISA_ADDRESS) pc = pcPool[cpu_iid];
// If it was a branch we must update the PC. case (res) matches tagged RBranchTaken .addr: begin pc <= addr; end
tagged RBranchNotTaken .addr: begin pc <= pc + exe_resp.instructionSize; end
default: begin pc <= pc + exe_resp.instructionSize; end endcase . . .
23
Timing: Execute (Minimal Requirement)
• Send the PC to the functional model• Receive the updated PC from the functional model
24
Functional: Execute
• Read input registers required from token scoreboard• Wait for input registers to have valid values
– Functional model supports OOO models but enforces execution in a valid order
– Timing model must execute instructions in a valid order or simulation will deadlock
• Forward values to an ISA-specific data path• Write result to physical register(s)
– Functional model may return without finishing result register write (e.g. floating point emulation)
– State returned to the timing model (e.g. branch) resolution must be computed before return
25
Hybrid Instruction Emulation (Infrequent/Complicated Instructions)
Write Line
Writ
e Ba
ck o
rIn
valid
ate
FPGA
Softw
are
Time
Execute
EmulationServer
Functional Instruction Simulator
MemoryServer
FunctionalCache
Execute
EmulationServer
Sync Registers
Done
Sync
Reg
ister
s
RRRLayer
Emulate Instruction Em
ulati
on D
one
……
Ack
26
Unpipelined Model Source
Single source file: unpipelined-pipeline.bsv
27
Step 2: Simple Pipelined Target
28
Model Performance Goal: Pipeline Parallelism
ITranslate
Fetch
DTranslate
Memory
Local
Commit
Global
Commit
Decode
Execute
Functional Pipeline
Functional
State
IPs
Next IPs
• Unpipelined target necessarily serialized functional model calls
• Pipelined target could run each stage in parallel
• Must solve one problem: how to manage time
29
Managing Time:A-Ports and Soft Connections
FPGA cycles != simulated cycles
– HAsim computes target time algorithmically– We are building a timing model, NOT a prototype
– 1:n cycle mapping would force us to slow the timing clock to the longest operation, even if it is infrequent
– 1:n would force us either to fit an entire design on the FPGA or synchronize clock domains
30
Option #1: Global Controller [rejected]
Central controller advances cycle when all modules are ready•Slowest possible cycle no longer dictates throughput•However:
– Place & route becomes difficult– Long signal to global controller is on the critical path
FET DEC EXE MEM WB
Controller curCC
31
Option #2: A-Ports
•Extension of Asim named channels•FIFO with user-specified latency and capacity•Pass exactly one message per cycle per port
– Beginning of model cycle: read all input ports– End of model cycle: write all output ports
ISFPGA 2008 Paper: A-Ports: An Efficient Abstraction for Cycle-Accurate Performance Models on FPGAs
FET DEC EXE MEM WB11
1 10
2
32
Now the Timing Model is LI!
• Each target machine pipeline stage may take multiple FPGA cycles
• This reduces– Algorithmic complexity– FPGA timing pressure
• Less work per cycle• May use FPGA area-efficient code (e.g. linear search replacing a CAM)
33
Target Machine Model Pipeline Stage Conventions
• Separate source module for each target pipeline stage• Private local controller for each module
– Local controllers automatically assemble themselves onto a ring– All are managed by the global controller
• All external communication is via A-Ports– Functional model API is only exception (we have debated switching to A-
Ports)
34
Example: In-order Decode FPGA Pipeline Stages
These FPGA stages represent one target machine cycle:
1. Receive register scoreboard writebacks from EXE and MEM stages
2. Consume faults from EXE and COMMIT stages (trigger rewind)
3. Consume dependence info (scoreboard state and writebacks)
4. Attempt issue (if data ready and EXE slot available)
5. Update local state (scoreboard)
inorder-decode-stage.bsv
35
Key Observation: Parallel Target Machine Yields Parallel FPGA Model
• Each pipeline stage is a separate module• Only dependence between modules is through A-Ports• A-Ports are parallel• Parallelism in the model is proportional to the parallelism of the
target
36
Pipeline Parallelism
ITranslate
Fetch
DTranslate
Memory
Local
Commit
Global
Commit
Decode
Execute
Functional Pipeline
Functional
State
IPs
Next IPs
• Model of a pipelined design naturally runs pipelined on an FPGA
• Detailed model of a pipelined design runs faster than a trivial, unpipelined model!
37
Aggregate Simulator Throughput (Parsec Black-Scholes)
39
Modeling Caches
40
What Makes Modeling Caches Hard on FPGAs?
• Storage• Not much else – cache management is just a set of pipeline stages
41
Modeled Caches Are Not as Big as You Might Think
• Only model tags – no data in the timing model• Cache model is LI
– Connected only by A-Ports– FPGA-latency of the cache tag storage is irrelevant!– Tag storage on the FPGA may be hierarchical– Build FPGA cache to model a target-machine cache, but they are
unrelated (LEAP provides automatically generated caches)
Terminology gets messy, but the implementation does not. LEAP scratchpad storage has the same interface as an array.
42
Multiple Cores and Shared Cache
• Later, we will consider multiplexed timing pipelines• On-chip network connecting a shared cache becomes interesting
(interleaving varies with OCN topology)