executive summary
DESCRIPTION
Executive Summary. Overall Architecture of ARC. Architecture of ARC Multiple cores and accelerators Global Accelerator Manager (GAM) Shared L2 cache banks and NoC routers between multiple accelerators. GAM. Accelerator + DMA+SPM. Shared Router. Shared L2 $. Core. Memory controller. - PowerPoint PPT PresentationTRANSCRIPT
1
Executive Summary
2
Overall Architecture of ARC♦ Architecture of ARC
Multiple cores and accelerators Global Accelerator Manager (GAM) Shared L2 cache banks and NoC
routers between multiple accelerators
GAMAccelerator +
DMA+SPMShared RouterCore
Shared
L2 $Memory
controller
3
What are the Problems with ARC? ♦ Dedicated accelerators are inflexible
An LCA may be useless for new algorithms or new domains Often under-utilized LCAs contain many replicated structures • Things like fp-ALUs, DMA engines, SPM• Unused when the accelerator is unused
♦ We want flexibility and better resource utilization Solution: CHARM
♦ Private SPM is wasteful Solution: BiN
4
A Composable Heterogeneous Accelerator-Rich Microprocessor (CHARM) [ISLPED’12]
♦ Motivation Great deal of data parallelism
• Tasks performed by accelerators tend to have a great deal of data parallelism Variety of LCAs with possible overlap
• Utilization of any particular LCA being somewhat sporadic It is expensive to have both:
• Sufficient diversity of LCAs to handle the various applications • Sufficient quantity of a particular LCA to handle the parallelism
Overlap in functionality• LCAs can be built using a limited number of smaller, more general LCAs: Accelerator building blocks (ABBs)
♦ Idea Flexible accelerator building blocks (ABB) that can be composed into accelerators
♦ Leverage economy of scale
5
Micro Architecture of CHARM♦ ABB
Accelerator building blocks (ABB) Primitive components that can be
composed into accelerators♦ ABB islands
Multiple ABBs Shared DMA controller, SPM and
NoC interface
♦ ABC Accelerator Block Composer
(ABC)• To orchestrate the data flow between
ABBs to create a virtual accelerator• Arbitrate requests from cores
♦ Other components Cores L2 Banks Memory controllers
6
CAMEL♦ What are the problems with
CHARM? What if new algorithm introduce
new ABBs? What if we want to use this
architecture on multiple domains?
♦ Adding programmable fabric to increase Longevity Domain span
7
CAMEL♦ What are the problems with
CHARM? What if new algorithm introduce
new ABBs? What if we want to use this
architecture on multiple domains?
♦ Adding programmable fabric to increase Longevity Domain span
♦ ABC is now responsible to allocate programmable fabric
8
9
Extensive Use of Accelerators♦ Accelerators provide high power-efficiency over general-purpose processors
IBM wire-speed processor Intel Larrabee
♦ ITRS 2007 System drivers prediction: Accelerator number close to 1500 by 2022
♦ Two kinds of accelerators Tightly coupled – part of datapath Loosely coupled – shared via NoC
♦ Challenges Accelerator extraction and synthesis Efficient accelerator management
• Scheduling• Sharing• Virtualization …
Friendly programming models
Acc
eler
ator
2A
ccel
erat
or1
Core3
Core1
Core4
Core2
L1 L1
L1 L1
L2 L2
L2 L2
NoC
10
Architecture Support for Accelerator-Rich CMPs (ARC) [DAC’2012]
♦ Managing accelerators through the OS is expensive♦ In an accelerator rich CMP, management should be cheaper both in
terms of time and energy Invoke “Open”s the driver and returns the handler to driver. Called once. RD/WR is called multiple times.
Operation Latency (# Cycles)
1 core 2 cores 4 cores 8 cores 16 cores
Invoke 214413 256401 266133 308434 316161
RD/WR 703 725 781 837 885
CPU
App
OS
Accelerator Manager AcceleratorAccelerator
Motivation
11
Overall Architecture of ARC♦ Architecture of ARC
Multiple cores and accelerators Global Accelerator Manager (GAM) Shared L2 cache banks and NoC
routers between multiple accelerators
GAMAccelerator +
DMA+SPMShared RouterCore
Shared
L2 $Memory
controller
12
Overall Communication Scheme in ARC
1. The core requests for a given type of accelerator (lcacc-req).
New ISA
lcacc-req t
lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free id
CPUCPU
Memory Memory LCALCA
GAMGAM1
13
Overall Communication Scheme in ARC
2. The GAM responds with a “list + waiting time” or NACK
New ISA
lcacc-req t
lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free id
CPUCPU
Memory Memory LCALCA
GAMGAM2
14
Overall Communication Scheme in ARC
3. The core reserves (lcacc-rsv) and waits.
New ISA
lcacc-req t
lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free id
CPUCPU
Memory Memory LCALCA
GAMGAM3
15
Overall Communication Scheme in ARC
4. The GAM ACK the reservation and send the core ID to accelerator
New ISA
lcacc-req t
lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free id
CPUCPU
Memory Memory LCALCA
GAMGAM4
4
16
Overall Communication Scheme in ARC
5. The core shares a task description with the accelerator through memory and starts it (lcacc-cmd).• Task description consists of:
o Function ID and input parameterso Input/output addresses and strides
New ISA
lcacc-req t
lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free id
CPUCPU
MemoryMemory Task description AcceleratorAccelerator
GAMGAM
55
17
Overall Communication Scheme in ARC
6. The accelerator reads the task description, and begins working• Overlapped Read/Write from/to Memory and Compute• Interrupting core when TLB miss
New ISA
lcacc-req t
lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free id
CPUCPU
MemoryMemory Task description LCALCA
GAMGAM
6
6
18
Overall Communication Scheme in ARC
7. When the accelerator finishes its current task it notifies the core.
New ISA
lcacc-req t
lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free id
CPUCPU
MemoryMemory Task description LCALCA
GAMGAM
7
19
Overall Communication Scheme in ARC
8. The core then sends a message to the GAM freeing the accelerator (lcacc-free).
New ISA
lcacc-req t
lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free id
CPUCPU
MemoryMemory Task description LCALCA
GAMGAM8
20
Accelerator Chaining and Composition
♦ Chaining Efficient accelerator to
accelerator communication
♦ Composition Constructing virtual
accelerators
Accelerator1 Accelerator1
Scratchpad Scratchpad
DMA controller DMA controller
Accelerator2 Accelerator2
Scratchpad Scratchpad
DMA controller DMA controller
M-point1D FFTM-point1D FFT
M-point1D FFTM-point1D FFT
3D FFT3D FFT
N-point2D FFTN-point2D FFT
virtualization
M-point1D FFTM-point1D FFT
M-point1D FFTM-point1D FFT
21
Accelerator Virtualization♦ Application programmer or compilation framework selects high-
level functionality♦ Implementation via
Monolithic accelerator Distributed accelerators composed to a virtual accelerator Software decomposition libraries
♦ Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT
22
Accelerator Virtualization♦ Application programmer or compilation framework selects high-
level functionality♦ Implementation via
Monolithic accelerator Distributed accelerators composed to a virtual accelerator Software decomposition libraries
♦ Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT
Step 1: 1D FFT on Row 1 and Row 2
23
Accelerator Virtualization♦ Application programmer or compilation framework selects high-
level functionality♦ Implementation via
Monolithic accelerator Distributed accelerators composed to a virtual accelerator Software decomposition libraries
♦ Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT
Step 2: 1D FFT on Row 3 and Row 4
24
Accelerator Virtualization♦ Application programmer or compilation framework selects high-
level functionality♦ Implementation via
Monolithic accelerator Distributed accelerators composed to a virtual accelerator Software decomposition libraries
♦ Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT
Step 3: 1D FFT on Col 1 and Col 2
25
Accelerator Virtualization♦ Application programmer or compilation framework selects high-
level functionality♦ Implementation via
Monolithic accelerator Distributed accelerators composed to a virtual accelerator Software decomposition libraries
♦ Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT
Step 4: 1D FFT on Col 3 and Col 4
26
Light-Weight Interrupt Support
CPUCPU
LCALCA
GAMGAM
27
Light-Weight Interrupt Support
CPUCPU
LCALCA
GAMGAM
Request/Reserve Confirmation and NACKSent by GAM
28
Light-Weight Interrupt Support
CPUCPU
LCALCA
GAMGAM
TLB MissTask Done
29
Light-Weight Interrupt Support
CPUCPU
LCALCA
GAMGAM
TLB MissTask Done
Core Sends Logical Addresses to LCALCA keeps a small TLB for the addresses that it is working on
30
Light-Weight Interrupt Support
CPUCPU
LCALCA
GAMGAM
TLB MissTask Done
Core Sends Logical Addresses to LCALCA keeps a small TLB for the addresses that it is working on
Why Logical Address?1- Accelerators can work on irregular addresses (e.g. indirect addressing)2- Using large page size can be a solution but will effect other applications
31
Light-Weight Interrupt Support
CPUCPU
LCALCA
GAMGAM
OperationLatency to switch to ISR and back (# Cycles)
1 core 2 cores 4 cores 8 cores 16 cores
Interrupt 16 K 20 K 24 K 27 K 29 K
It’s expensive to handle the interrupts via OS
32
Light-Weight Interrupt Support
CPUCPU
LCALCA
GAMGAM
Extending the core with a light-weight interrupt support
LWI
LWI
33
Light-Weight Interrupt Support
CPUCPU
LCALCA
GAMGAM
Extending the core with a light-weight interrupt support
LWI
LWI
Two main components added: A table to store ISR info An interrupt controller to queue and prioritize incoming interrupt packets
Each thread registers: Address of the ISR and its arguments and lw-int source
Limitations: Only can be used when running the same thread which LW interrupt belongs to OS-handled interrupt otherwise
34
Programming interface to ARCPlatform creation
Application Mapping & Development
35
Evaluation methodology♦ Benchmarks
Medical imaging Vision & Navigation
36
compressive sensing
level set methods
fluid registration
total variational algorithm
Application Domain: Medical Image Processing
deno
isin
g
regi
stra
tion
se
gmen
tatio
n
anal
ysis
h
zyS
i,jvolumevoxel
ji
S
kkk
eiZ
wjfwi
1
21
2
j
2, )(
1 ,2)()(u :voxel
)()()()( uxTxRuxTvv
uvt
uv
0t)(x, : xvoxels)(surface
div),(F
t
datat
3
12
23
1
),(
),()(
ji
j
ij
j ij
ij
i txfx
vv
x
p
x
vv
t
v
txfvpvvt
v
re
cons
truc
tion
voxels
2
points sampled
)(-ARmin
:theoryNyquist -Shannon classical rate aat sampled be can and sparsity,exhibit images Medical
ugradSuu
Navier-Stokes
equations
37
Area Overhead
♦ AutoESL (from Xilinx) for C to RTL synthesis♦ Synopsys for ASIC synthesis
32 nm Synopsys Educational library♦ CACTI for L2♦ Orion for NoC♦ One UltraSparc IIIi core (area scaled to 32 nm)
178.5 mm^2 in 0.13 um (http://en.wikipedia.org/wiki/UltraSPARC_III)
Core NoC L2 Deblur Denoise Segmentation Registration SPM Banks Number of instance/Size 1 1 8MB 1 1 1 1 39 x 2KBArea(mm^2) 10.8 0.3 39.8 0.03 0.01 0.06 0.01 0.02Percentage (%) 18.0 0.5 66.2 3.4 0.8 6.5 1.2 2.4
Total ARC: 14.3 %
38
Experimental Results – Performance(N cores, N threads, N accelerators)
Performance improvement over OS based approaches:
on average 51x, up to 292x
Performance improvement over OS based approaches:
on average 51x, up to 292x
Performance improvement over SW only approaches:
on average 168x, up to 380x
Performance improvement over SW only approaches:
on average 168x, up to 380x1 2 4 8 16
050
100150200250300350400
Speedup over SW-Only
RegistrationDeblurDenoiseSegmentation
Configuration (N cores, N threads, N accelerators)
Gai
n (X
)
1 2 4 8 160
50100150200250300350
Speedup over OS-based
RegistrationDeblurDenoiseSegmentation
Configuration (N cores, N threads, N accelerators)
Gai
n (X
)
39
Experimental Results – Energy (N cores, N threads, N accelerators)
Energy improvement over OS-based
approaches:on average 17x, up to 63x
Energy improvement over OS-based
approaches:on average 17x, up to 63x
Energy improvement over SW-only approaches:
on average 241x, up to 641x
Energy improvement over SW-only approaches:
on average 241x, up to 641x1 2 4 8 16
0100200300400500600700
Energy gain over SW-only version
RegistrationDeblurDenoiseSegmentation
Configuration (N cores, N threads, N accelerators)
Gai
n (X
)
1 2 4 8 160
10203040506070
Energy gain over OS-based version
RegistrationDeblurDenoiseSegmentation
Configuration (N cores, N threads, N accelerators)
Gai
n (X
)
40
What are the Problems with ARC? ♦ Dedicated accelerators are inflexible
An LCA may be useless for new algorithms or new domains Often under-utilized LCAs contain many replicated structures • Things like fp-ALUs, DMA engines, SPM• Unused when the accelerator is unused
♦ We want flexibility and better resource utilization Solution: CHARM
♦ Private SPM is wasteful Solution: BiN
41
A Composable Heterogeneous Accelerator-Rich Microprocessor (CHARM) [ISLPED’12]
♦ Motivation Great deal of data parallelism
• Tasks performed by accelerators tend to have a great deal of data parallelism Variety of LCAs with possible overlap
• Utilization of any particular LCA being somewhat sporadic It is expensive to have both:
• Sufficient diversity of LCAs to handle the various applications • Sufficient quantity of a particular LCA to handle the parallelism
Overlap in functionality• LCAs can be built using a limited number of smaller, more general LCAs: Accelerator building blocks (ABBs)
♦ Idea Flexible accelerator building blocks (ABB) that can be composed into accelerators
♦ Leverage economy of scale
42
Micro Architecture of CHARM♦ ABB
Accelerator building blocks (ABB) Primitive components that can be
composed into accelerators♦ ABB islands
Multiple ABBs Shared DMA controller, SPM and
NoC interface
♦ ABC Accelerator Block Composer
(ABC)• To orchestrate the data flow between
ABBs to create a virtual accelerator• Arbitrate requests from cores
♦ Other components Cores L2 Banks Memory controllers
43
An Example of ABB Library (for Medical Imaging)
Internal
of Poly
44
Example of ABB Flow-Graph (Denoise)
2
45
Example of ABB Flow-Graph (Denoise)
--
**
--
**
--
**
--
**
--
**
--
**++ ++ ++
++
++
sqrtsqrt
1/x1/x
2
46
Example of ABB Flow-Graph (Denoise)
--
**
--
**
--
**
--
**
--
**
--
**++ ++ ++
++
++
sqrtsqrt
1/x1/x
2
ABB1: Poly
ABB2: Poly
ABB3: Sqrt
ABB4: Inv
47
Example of ABB Flow-Graph (Denoise)
--
**
--
**
--
**
--
**
--
**
--
**++ ++ ++
++
++
sqrtsqrt
1/x1/x
2
ABB1:Poly
ABB2: Poly
ABB3: Sqrt
ABB4: Inv
48
LCA Composition Process
ABB
ISLAND1
ABB
ISLAND1
ABB
ISLAND2
ABB
ISLAND2
ABB
ISLAND3
ABB
ISLAND3
ABB
ISLAND4
ABB
ISLAND4
ABC
x
y
x
w
z
w
y
z
Core
49
ABB
ISLAND1
ABB
ISLAND1
ABB
ISLAND2
ABB
ISLAND2
ABB
ISLAND3
ABB
ISLAND3
ABB
ISLAND4
ABB
ISLAND4
Core
LCA Composition Process
1. Core initiation Core sends the task description: task flow-
graph of the desired LCA to ABC together with polyhedral space for input and output
ABC
x
y
x
w
z
w
y
z
xx
yy zz
10x10 input and output
Task description
50
ABB
ISLAND1
ABB
ISLAND1
ABB
ISLAND2
ABB
ISLAND2
ABB
ISLAND3
ABB
ISLAND3
ABB
ISLAND4
ABB
ISLAND4
LCA Composition Process
2. Task-flow parsing and task-list creation ABC parses the task-flow graph and breaks the request
into a set of tasks with smaller data size and fills the task list
ABC
x
y
x
w
z
w
y
z Needed ABBs: “x”, “y”, “z”
With task size of 5x5 block,
ABC generates 4 tasks
Core
ABC generates internally
51
ABB
ISLAND1
ABB
ISLAND1
ABB
ISLAND2
ABB
ISLAND2
ABB
ISLAND3
ABB
ISLAND3
ABB
ISLAND4
ABB
ISLAND4
Core
LCA Composition Process3. Dynamic ABB mapping
ABC uses a pattern matching algorithm to assign ABBs to islands
Fills the composed LCA table and resource allocation table
ABC
x
y
x
w
z
w
y
z
Island ID
ABB Type
ABB ID Status
1 x 1 Free
1 y 1 Free
2 x 1 Free
2 w 1 Free
3 z 1 Free
3 w 1 Free
4 y 1 Free
4 z 1 Free
52
ABB
ISLAND1
ABB
ISLAND1
ABB
ISLAND2
ABB
ISLAND2
ABB
ISLAND3
ABB
ISLAND3
ABB
ISLAND4
ABB
ISLAND4
Core
LCA Composition Process3. Dynamic ABB mapping
ABC uses a pattern matching algorithm to assign ABBs to islands
Fills the composed LCA table and resource allocation table
ABC
x
y
x
w
z
w
y
z
Island ID
ABB Type
ABB ID Status
1 x 1 Busy
1 y 1 Busy
2 x 1 Free
2 w 1 Free
3 z 1 Busy
3 w 1 Free
4 y 1 Free
4 z 1 Free
53
ABB
ISLAND1
ABB
ISLAND1
ABB
ISLAND2
ABB
ISLAND2
ABB
ISLAND3
ABB
ISLAND3
ABB
ISLAND4
ABB
ISLAND4
Core
LCA Composition Process4. LCA cloning
Repeat to generate more LCAs if ABBs are available
ABC
x
y
x
w
z
w
y
z
Core ID
ABB Type
ABB ID Status
1 x 1 Busy
1 y 1 Busy
2 x 1 Busy
2 w 1 Free
3 z 1 Busy
3 w 1 Free
4 y 1 Busy
4 z 1 Busy
54
ABB
ISLAND1
ABB
ISLAND1
ABB
ISLAND2
ABB
ISLAND2
ABB
ISLAND3
ABB
ISLAND3
ABB
ISLAND4
ABB
ISLAND4
Core
LCA Composition Process
5. ABBs finishing task When ABBs finish, they signal the ABC. If
ABC has another task it sends otherwise it frees the ABBs
ABC
x
y
x
w
z
w
y
z
Island ID
ABB Type
ABB ID Status
1 x 1 Busy
1 y 1 Busy
2 x 1 Busy
2 w 1 Free
3 z 1 Busy
3 w 1 Free
4 y 1 Busy
4 z 1 Busy
DONE
55
ABB
ISLAND1
ABB
ISLAND1
ABB
ISLAND2
ABB
ISLAND2
ABB
ISLAND3
ABB
ISLAND3
ABB
ISLAND4
ABB
ISLAND4
Core
LCA Composition Process
5. ABBs being freed When an ABB finishes, it signals the ABC. If
ABC has another task it sends otherwise it frees the ABBs
ABC
x
y
x
w
z
w
y
z
Island ID
ABB Type
ABB ID Status
1 x 1 Busy
1 y 1 Busy
2 x 1 Free
2 w 1 Free
3 z 1 Busy
3 w 1 Free
4 y 1 Free
4 z 1 Free
56
ABB
ISLAND1
ABB
ISLAND1
ABB
ISLAND2
ABB
ISLAND2
ABB
ISLAND3
ABB
ISLAND3
ABB
ISLAND4
ABB
ISLAND4
Core
LCA Composition Process
6. Core notified of end of task When the LCA finishes ABC signals the core
ABC
x
y
x
w
z
w
y
z
Island ID
ABB Type
ABB ID Status
1 x 1 Free
1 y 1 Free
2 x 1 Free
2 w 1 Free
3 z 1 Free
3 w 1 Free
4 y 1 Free
4 z 1 Free
DONE
57
ABC Internal Design♦ ABC sub-components
Resource Table(RT): To keep track of available/used ABBs
Composed LCA Table (CLT): Eliminates the need to re-compose LCAs
Task List (TL): To queue the broken LCA requests (to smaller data size)
TLB: To service and share the translation requests by ABBs
Task Flow-Graph Interpreter (TFGI): Breaks the LCA DFG into ABBs
LCA Composer (LC): Compose the LCA using available ABBs
♦ Implementation RT, CLT, TL and TLB are implemented using
RAM TFGI has a table to keep ABB types and an
FSM to read task-flow-graph and compares LC has an FSM to go over CLT and RT and
check mark the available ABBs
Resource Table
Composed LCA Table
TLB
Task List
DFG Interpreter
LCA Composer
From ABBs(Done signal)
Cores
Accelerator Block Composer
To ABBs
(allocate)
ABBs(TLB service)
58
CHARM Software Infrastructure♦ ABB type extraction
Input: compute-intensive kernels from different application
Output: ABB Super-patterns Currently semi-automatic
♦ ABB template mapping Input: Kernels + ABB types Output: Covered kernels as an
ABB flow-graph
♦ CHARM uProgram generation Input: ABB flow-graph Output:
59
Evaluation Methodology
♦ Simics+GEMS based simulation♦ AutoPilot/Xilinx+ Synopsys for
ABB/ABC/DMA-C synthesis♦ Cacti for memory synthesis (SPM)♦ Automatic flow to generate the CHARM
software and simulation modules♦ Case studies
Physical LCA sharing with Global Accelerator Manager (LCA+GAM)
Physical LCA sharing with ABC (LCA+ABC)
ABB composition and sharing with ABC (ABB+ABC)
♦ Medical imaging benchmarks Denoise, Deblur, Segmentation and
Registration
60
Area Overhead Analysis♦ Area-equivalent
The total area consumed by the ABBs equals the total area of all LCAs required to run a single instance of each benchmark
♦ Total CHARM area is 14% of the 1cmx1cm chip A bit less than LCA-based
design
61
Results: Improvement Over LCA-based Design
♦ N’x’ has N times area -equivalent accelerators
♦ Performance 2.5X vs. LCA+GAM (max 5X) 1.4X vs. LCA+ABC (max 2.6X)
♦ Energy 1.9X vs. LCA+GAM (max 3.4X) 1.3X vs. LCA+ABC (max 2.2X)
♦ ABB+ABC better energy and performance ABC starts composing ABBs to
create new LCAs Creates more parallelism
1x 2x 4x 8x 1x 2x 4x 8x 1x 2x 4x 8x 1x 2x 4x 8xDeb Den Reg Seg
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Normalized Performance
LCA+GAMLCA+ABCABB+ABC
1x 2x 4x 8x 1x 2x 4x 8x 1x 2x 4x 8x 1x 2x 4x 8xDeb Den Reg Seg
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Normalized Energy
LCA+GAMLCA+ABCABB+ABC
62
Results: Platform Flexibility♦ Two applications from two
unrelated domains to MI Computer vision • Log-Polar Coordinate Image
Patches (LPCIP) Navigation• Extended Kalman Filter-based
Simultaneous Localization and Mapping (EKF-SLAM)
♦ Only one ABB is added Indexed Vector Load
MAX Benefit over LCA+GAM 3.64X
AVG Benefit over LCA+GAM 2.46X
MAX Benefit over LCA+ABC 3.04X
AVG Benefit over LCA+ABC 2.05X
63
CAMEL♦ What are the problems with
CHARM? What if new algorithm introduce
new ABBs? What if we want to use this
architecture on multiple domains?
♦ Adding programmable fabric to increase Longevity Domain span
64
CAMEL♦ What are the problems with
CHARM? What if new algorithm introduce
new ABBs? What if we want to use this
architecture on multiple domains?
♦ Adding programmable fabric to increase Longevity Domain span
♦ ABC is now responsible to allocate programmable fabric
65
CAMEL Microarchitecture
66
CAMEL Programmable Fabric Block
67
Programmable Fabric Allocation by ABC♦ ABB replication in order to rate-match♦ ABB assignment on those paths that have larger slack in order to
decrease negative effect♦ Interval-based allocation to increase the performance between multiple
requests♦ When on Programmable Fabric?
ABB does not exist in system ABB is being used by other accelerators
● Beneficial if can be rate-matched● Potential slow-down
68
FPGA-allocation by ABC – Block Diagram
♦ Greedy approach Single application service
♦ Interval-based approach Wait for some fixed interval Use a greedy-approach to find the
best to satisfy multiple request
LCA Task Flow Graph
Find Needed ABBs on PF
Available ABBs
Min Area Feasible?
Return FALSE
Fit PF?
Return True
Set Maximum Rate
Yes
No
Program Fabric
DecreaseRate
No
Lowest rate?
Return FALSE
69
CAMEL Results – All Domains
70
CAMEL Results - Speedup
71
CAMEL Results – Energy improvement