compiler scheduling for a wide-issue multithreaded fpga-based compute engine
DESCRIPTION
Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine. Ilian Tili Kalin Ovtcharov , J. Gregory Steffan (University of Toronto). What is an FPGA?. FPGA = Field Programmable Gate Array Eg ., a large Altera Stratix IV: 40nm, 2.5B transistors - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/1.jpg)
University of Toronto 1
Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine
Ilian TiliKalin Ovtcharov, J. Gregory Steffan
(University of Toronto)
![Page 2: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/2.jpg)
University of Toronto 2
What is an FPGA?
• FPGA = Field Programmable Gate Array• Eg., a large Altera Stratix IV: 40nm, 2.5B transistors
– 820K logic elements (LEs), 3.1Mb block-RAMs, 1.2K multipliers– High-speed I/Os
• Can be programmed to implement any circuit
![Page 3: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/3.jpg)
University of Toronto 3
IBM and FPGAs• DataPower
– FPGA-accelerated XML processing• Netezza
– Data warehouse appliance; FPGAs accelerate DBMS• Algorithmics
– Acceleration of financial algorithms• Lime (Liquid Metal)
– Java synthesized to heterogeneous (CPUs, FPGAs)• HAL (Hardware Acceleration Lab)
– IBM Toronto; FPGA-based acceleration• New: IBM Canada Research & Development Centre
– One (of 5) thrust on “agile computing”• SURGE IN FPGA-BASED COMPUTING!
![Page 4: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/4.jpg)
University of Toronto 4
FPGA Programming
• Requires expert hardware designer• Long compile times – up to a day for a large design
-> Options for programming with high-level languages?
![Page 5: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/5.jpg)
University of Toronto 5
Option 1: Behavioural Synthesis
HardwareOpenCL
• Mapping high-level languages to hardware– Eg., liquid metal, ImpulseC, LegUp– OpenCL: increasingly popular acceleration language
![Page 6: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/6.jpg)
University of Toronto 6
Option 2: Overlay Processing Engines
OpenCL
• Quickly reprogrammed (vs regenerating hardware)• Versatile (multiple software functions per area)• Ideally high throughput-per-area (area efficient)
ENGINE
![Page 7: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/7.jpg)
University of Toronto 7
Option 2: Overlay Processing Engines
OpenCL
• Quickly reprogrammed (vs regenerating hardware)• Versatile (multiple software functions per area)• Ideally high throughput-per-area (area efficient)
ENGINE ENGINE
ENGINE ENGINE
ENGINE
ENGINE
-> Opportunity to architect novel processor designs
![Page 8: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/8.jpg)
University of Toronto 8
Option 3: Option 1 + Option 2
OpenCL
• Engines and custom circuit can be used in concert
ENGINE
ENGINE HARDWARE
Synthesis
![Page 9: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/9.jpg)
University of Toronto 9
This talk: wide-issue multithreaded overlay engines
Pipeline
Functional Units
![Page 10: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/10.jpg)
University of Toronto 10
This talk: wide-issue multithreaded overlay engines
• Variable latency FUs• add/subtract, multiply,
divide, exponent (7,5,6,17 cycles)
• Deeply-pipelined• Multiple threads
Pipeline
Functional Units
![Page 11: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/11.jpg)
University of Toronto 11
This talk: wide-issue multithreaded overlay engines
• Variable latency FUs• add/subtract, multiply,
divide, exponent (7,5,6,17 cycles)
• Deeply-pipelined• Multiple threads
?
Pipeline
Functional Units
Storage & Crossbar
![Page 12: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/12.jpg)
University of Toronto 12
This talk: wide-issue multithreaded overlay engines
• Variable latency FUs• add/subtract, multiply,
divide, exponent (7,5,6,17 cycles)
• Deeply-pipelined• Multiple threads
?
Pipeline
Functional Units
Storage & Crossbar
-> Architecture and control of storage+interconnect to allow full utilization
![Page 13: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/13.jpg)
University of Toronto 13
Our Approach• Avoid hardware complexity– Compiler controlled/scheduled
• Explore large, real design space– We measure 490 designs
• Future features:– Coherence protocol– Access to external memory (DRAM)
?
![Page 14: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/14.jpg)
University of Toronto 14
Our Objective
Find Best Design1. Fully utilizes datapath – Multiple ALUs of significant and varying pipeline depth.
2. Reduces FPGA area usage– Thread data storage– Connections between components• Exploring a very large design space
![Page 15: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/15.jpg)
University of Toronto 15
Hardware Architecture Possibilities
![Page 16: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/16.jpg)
University of Toronto 16
Single-Threaded Single-Issue
T0T0XXXXXT0
Multiported Banked Memory
Pipeline
T0
Stalls
-> Simple system but utilization is low
![Page 17: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/17.jpg)
University of Toronto 17
Single-Threaded Multiple-Issue
T0XXT0XXXT0
Multiported Banked Memory
Pipeline
T0
T0XXX
T0T0
X
T0XX
T0
T0XX
-> ILP within a thread improves utilization but stalls remain
![Page 18: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/18.jpg)
University of Toronto 18
Multi-Threaded Single-Issue
T0T1T2T3T4T0T1T2
Multiported Banked Memory
Pipeline
T0 T1 T2 T3 T4
-> Multi threading easily improves utilization
![Page 19: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/19.jpg)
University of Toronto 19
Our Base Hardware ArchitectureMultiported Banked Memory
Pipeline
T0 T1 T2 T3 T4
-> Supports ILP and TLP
![Page 20: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/20.jpg)
University of Toronto 20
TLP IncreaseMemory
T0 T1 T2 T3 T4 T5
Adding TLP
-> Utilization is improved but more storage banks required
![Page 21: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/21.jpg)
University of Toronto 21
ILP IncreaseMemory
T0 T1 T2 T3 T4 T5
Adding ILP
-> Increased storage multiporting required
T5
![Page 22: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/22.jpg)
University of Toronto 22
Design space exploration
• Vary parameters– ILP– TLP– Functional Unit Instances
• Measure/Calculate– Throughput – Utilization– FPGA Area Usage– Compute Density
![Page 23: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/23.jpg)
University of Toronto 23
Compiler Scheduling
(Implemented in LLVM)
![Page 24: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/24.jpg)
University of Toronto 24
Compiler FlowC code
![Page 25: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/25.jpg)
University of Toronto 25
Compiler FlowC code
IR code1
LLVM
![Page 26: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/26.jpg)
University of Toronto 26
Compiler FlowC code
IR codeData Flow Graph 1
2
LLVM
LLVM Pass
![Page 27: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/27.jpg)
University of Toronto 27
Data Flow Graph
• Each node represents an arithmetic operation (+,-, * , /)
• Edges represent dependencies• Weights on edges – delay between operations
7
7
5 5
6
6
![Page 28: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/28.jpg)
University of Toronto 28
Initial Algorithm: List Scheduling
• Find nodes in DFG that have no predecessors or whose predecessors are already scheduled.
• Schedule them in the earliest possible slot.
Cycle + , - * /
1
2
3
4
[M. Lam, ACM SIGPLAN, 1988]
![Page 29: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/29.jpg)
University of Toronto 29
Initial Algorithm: List Scheduling
• Find nodes in DFG that have no predecessors or whose predecessors are already scheduled.
• Schedule them in the earliest possible slot.
Cycle + , - * /
1 A B G
2 F C
3
4
[M. Lam, ACM SIGPLAN, 1988]
![Page 30: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/30.jpg)
University of Toronto 30
Initial Algorithm: List Scheduling
• Find nodes in DFG that have no predecessors or whose predecessors are already scheduled.
• Schedule them in the earliest possible slot.
Cycle + , - * /
1 A B G
2 F C
3
4
[M. Lam, ACM SIGPLAN, 1988]
![Page 31: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/31.jpg)
University of Toronto 31
Initial Algorithm: List Scheduling
• Find nodes in DFG that have no predecessors or whose predecessors are already scheduled.
• Schedule them in the earliest possible slot.
Cycle + , - * /
1 A B G
2 D F C
3 H
4
[M. Lam, ACM SIGPLAN, 1988]
![Page 32: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/32.jpg)
University of Toronto 32
Operation PrioritiesAdd Sub
1 Op1 Op323 Op245 Op467 Op5
ASAP
![Page 33: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/33.jpg)
University of Toronto 33
Operation PrioritiesAdd Sub
1 Op123 Op245 Op4 Op367 Op5
ALAP
Add Sub1 Op1 Op3
2
3 Op2
4
5 Op4
6
7 Op5
ASAP
![Page 34: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/34.jpg)
University of Toronto 34
Operation Priorities
• Mobility = ALAP(op) – ASAP(op)• Lower mobility indicates higher priority
Add Sub1 Op1 Op323 Op245 Op467 Op5
Add Sub1 Op1 Op323 Op245 Op4 Op367 Op5
Mobility
ASAP ALAP
[C.-T. Hwang, et al, IEEE Transactions, 1991]
![Page 35: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/35.jpg)
University of Toronto 35
Scheduling Variations
1. Greedy2. Greedy Mix3. Greedy with Variable Groups4. Longest Path
![Page 36: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/36.jpg)
University of Toronto 36
Greedy
• Schedule each thread fully• Schedule next thread in remaining spots
![Page 37: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/37.jpg)
University of Toronto 37
Greedy
• Schedule each thread fully• Schedule next thread in remaining spots
![Page 38: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/38.jpg)
University of Toronto 38
Greedy
• Schedule each thread fully• Schedule next thread in remaining spots
![Page 39: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/39.jpg)
University of Toronto 39
Greedy
• Schedule each thread fully• Schedule next thread in remaining spots
![Page 40: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/40.jpg)
University of Toronto 40
Greedy Mix
• Round-robin scheduling across threads
![Page 41: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/41.jpg)
University of Toronto 41
Greedy Mix
• Round-robin scheduling across threads
![Page 42: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/42.jpg)
University of Toronto 42
Greedy Mix
• Round-robin scheduling across threads
![Page 43: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/43.jpg)
University of Toronto 43
Greedy Mix
• Round-robin scheduling across threads
![Page 44: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/44.jpg)
University of Toronto 44
Greedy with Variable Groups
• Group = number of threads that are fully scheduled before scheduling the next group
![Page 45: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/45.jpg)
University of Toronto 45
Longest Path
• First schedule the nodes in the longest path• Use Prioritized Greedy Mix or Variable Groups
Longest Path Nodes Rest of Nodes
[Xu et al, IEEE Conf. on CSAE, 2011]
![Page 46: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/46.jpg)
University of Toronto 46
All Scheduling Algorithms
Longest path scheduling can produce a shorter schedule than other methods
Greedy Greedy Mix Variable Groups Longest Path
![Page 47: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/47.jpg)
University of Toronto 47
Compilation Results
![Page 48: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/48.jpg)
University of Toronto 48
• Hodgkin-Huxley • Differential equations• Computationally intensive• Floating point operations:– Add, Subtract, Divide,
Multiply, Exponent
Sample App: Neuron Simulation
![Page 49: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/49.jpg)
University of Toronto 49
• High level overview of data flow
Hodgkin-Huxley
![Page 50: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/50.jpg)
University of Toronto 50
Schedule Utilization
-> No significant benefit going beyond 16 threads-> Best algorithm varies by case
![Page 51: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/51.jpg)
University of Toronto 51
Design Space Considered
Add/Sub Mult Div Exp
T0
• Varying number of threads• Varying FU instance counts• Using Longest Path Groups Algorithm
![Page 52: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/52.jpg)
University of Toronto 52
• Varying number of threads• Varying FU instance counts• Using Longest Path Groups Algorithm
Design Space Considered
Add/Sub Mult Div Exp
Add/Sub
T0 T1 T2 T3
![Page 53: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/53.jpg)
University of Toronto 53
• Varying number of threads• Varying FU instance counts• Using Longest Path Groups Algorithm
Design Space Considered
Add/Sub Mult Div Exp
Add/Sub Mult
T0 T1 T2 T3 T4
![Page 54: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/54.jpg)
University of Toronto 54
Design Space Considered
• Varying number of threads• Varying FU instance counts• Using Longest Path Groups Algorithm
Add/Sub Mult Div Exp
Add/Sub Mult
Add/Sub
Div
Maximum 8 FUs in total
T0 T1 T2 T3 T4 T5 T6
-> 490 designs considered
![Page 55: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/55.jpg)
University of Toronto 55
Throughput vs num threads
• Throughput depends on configuration of FU mix and number of threads
IPC
![Page 56: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/56.jpg)
University of Toronto 56
Throughput vs num threads
• Throughput depends on configuration of FU mix and number of threads
IPC
3-add/2-mul/2-div/1-exp
![Page 57: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/57.jpg)
University of Toronto 57
Real Hardware Results
![Page 58: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/58.jpg)
University of Toronto 58
Methodology
• Design built on FPGA• Altera Stratix IV (EP4SGX530)• Quartus 12.0• Area = equivalent ALMs– Takes into account BRAM (memory) requirement
• IEEE-754 compliant floating point units– Clock Frequency at least 200MHz
![Page 59: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/59.jpg)
University of Toronto 59
Area vs threads
• Area depends on instances of FU and num threads
(eALM)
eALM
![Page 60: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/60.jpg)
University of Toronto 60
Compute Density
Compute Density = (instr/cycle/area)
=
![Page 61: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/61.jpg)
University of Toronto 61
Compute Density
• Balance of throughput and area consumption
![Page 62: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/62.jpg)
University of Toronto 62
Compute Density
• Balance of throughput and area consumption
2-add/1-mul/1-div/1-exp3-add/2-mul/2-div/1-exp
![Page 63: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/63.jpg)
University of Toronto 63
Compute Density
• Best configuration at 8 or 16 threads.
2-add/1-mul/1-div/1-exp3-add/2-mul/2-div/1-exp
![Page 64: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/64.jpg)
University of Toronto 64
Compute Density
• Less than 8 – not enough parallelism
2-add/1-mul/1-div/1-exp3-add/2-mul/2-div/1-exp
![Page 65: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/65.jpg)
University of Toronto 65
Compute Density
• More than 16 – too expensive
2-add/1-mul/1-div/1-exp3-add/2-mul/2-div/1-exp
![Page 66: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/66.jpg)
University of Toronto 66
Compute Density
• FU mix is crucial to getting the best density
2-add/1-mul/1-div/1-exp3-add/2-mul/2-div/1-exp
![Page 67: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/67.jpg)
University of Toronto 67
Compute Density
• Normalized FU Usage in DFG = [3.2,1.6,1.87,1]
2-add/1-mul/1-div/1-exp3-add/2-mul/2-div/1-exp
(3,2,2,1)
![Page 68: Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine](https://reader035.vdocuments.net/reader035/viewer/2022062411/568166cb550346895ddad753/html5/thumbnails/68.jpg)
University of Toronto 68
Conclusions
• Longest Path Scheduling seems best– Highest utilization on average
• Best compute density found through simulation– 8 and 16 threads give best compute densities– Best FU mix proportional to FU usage in DFG
• Compiler finds best hardware configuration