hrl: efficient and flexible reconfigurable logic for near...
TRANSCRIPT
HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing
Mingyu Gao and Christos Kozyrakis
Stanford University
http://mast.stanford.edu
HPCA – March 14, 2016
PIM is Coming Back …
2
Near-Data Processing
(NDP)
End of Dennard scaling
In-memory analytics
3D stacking
MapReduce, graph processing, deep neural networks, …
HMC, HBM
Figs: www.extremetech.comwww.cisl.columbia.edu/grads/tuku/research/www.oceanaute.blogspot.com/2015/06/how-to-shuffle-sort-mapreduce.html
Energy-bound systems
NDP Logic Requirements
Area-efficient
o High processing throughput to match the high memory bandwidth
o 128 GBps per 50 mm2 stack > 32 Gflops > 0.6 Gflops/mm2
Power-efficient
o Thermal constraints limit clock frequency
o 5 W per stack 100 mW/mm2
Flexible
o Must amortize manufacturing cost through reuse across apps
3
NDP Logic Options
Programmable cores[IRAM, FlexRAM, NDC, TOP-PIM]
FPGA (fine-grained)[Active Pages]
CGRA (coarse-grained)[NDA]
ASIC[MSA, LiM]
4
AreaEfficiency
PowerEfficiency
Flexibility
Reconfigurable Logic Challenges
5
FPGA
o Area overhead due to support for bit-level configuration
CGRA
o Traditional GGRAs
• Limited flexibility in interconnects, only for regular computation patterns
o DySER [HPCA’11] and NDA [HPCA’15]
• High power due to circuit-switched routing
• Inefficient for branches and irregular data layouts
Heterogeneity: achieve the best of FPGA and CGRA
Outline
Motivation
NDP System Design
Heterogeneous Reconfigurable Logic (HRL)
Evaluation
Conclusions
6
Overall System Architecture
Multiple stacks linked to host processor through serial links7
Host Processor
Memory Stack
High-Speed Serial Link
Multi-core chip with cache hierarchy:Runs code with high temporal locality
Memory stack with NDP capability:runs memory-intensive code
NDP Stack
Vault Logic
NoC
Logic Die
DRAM Die
...
Bank
Channel
Vault
8
Vault: • Vertical channel• Dedicated memory controller• 8 – 16 vaults per stack
vs. DDR3 channel
• 10x bandwidth (160 GBps)
• 3-5x power improvement Vault logic:• Multiple PEs + control logic• NoC to interconnect vaults
Iterative Execution Flow
Processing phase: PEs run tasks independently and in parallel
Communication phase: data exchange and sync b/w PEs [PACT’15]
o Communication within and across stacks
9
Task Task Task Task
PEs PEs PEs PEs Process
LocalBuffer
Task Task Task Task Pull
Output Queues
Mem CtrlTo remotememory
Generic Load/Store UnitTask
Queue
op,a
dd
r,sz
op,a
ddr,
sz
op,a
ddr,
sz
RouterTo localmemory
Scratchpad Buffer
Global Controller
(Host Processor)
DMA
...
PE
PE
PEUser-defined Logic
Fixed Logic
Output write queues
DMA Ctrl
Vault Logic
Handles task control and data communication
Allows the use of reconfigurable or custom PEs
10
Input & output data info from host or other PEs
Cached data(e.g., graph vertices)
Streaming data(e.g., graph edge list)
Separate queue for each consumer• Coalesce writebacks• In-place combine ops
Outline
Motivation
NDP System Design
Heterogeneous Reconfigurable Logic (HRL)
Evaluation
Conclusions
11
HRL Features
Fine-grained + coarse-grained reconfigurable blocks
o LUTs for flexible control
o ALUs for efficient arithmetic
Static interconnects
o Wide network for data
o Separate and narrow network for control
Special blocks for branches & irregular data layout
Compute throughput per Watt:
2.2x over FPGA, 1.7x over CGRA12
Area-efficiency and flexibility
Power-efficiency
Flexibility
HRL Array: Logic Blocks
13
Control Input Data Input
FU FU FU FU FU
FU FU FU FU FU
FU FU FU FUFU
OMB OMB OMB OMB OMB
CLB
CLB
CLB
CLB
16-bit data
routing tracks
1-bit control routing tracks
Control Output Data Output
CGRA-style functional unit• Efficient 48-bit arithmetic/logic ops• Registers for pipelining and retiming
FPGA-style configurable block• LUTs for embedded control logic• Special functions: sigmoid, tanh, etc.
Output MUX Block• Configurable MUXes (tree, cascading, parallel)• Put close to output, low cost and flexible
Flexible IO Interface• Simple but flexible alignment
HRL Array: Routing
14
Control Input Data Input
FU FU FU FU FU
FU FU FU FU FU
FU FU FU FUFU
OMB OMB OMB OMB OMB
CLB
CLB
CLB
CLB
16-bit data
routing tracks
1-bit control routing tracks
Control Output Data Output
HRL Array: Routing
15
FU
MXBCLB
out
in
INA INBsel
OUT
A
B
Y out
16-bit data net
tracks
1-bit control net
tracks
Block output to routing track connection
Routing switch box
Block input MUX connection
Fully static for low power
Separate data and control to reduce area
16-bit data, bus-based1-bit control, few tracks
No switches b/w two networks
Connection box and switch
HRL Array: IO
16
Control Input Data Input
FU FU FU FU FU
FU FU FU FU FU
FU FU FU FUFU
OMB OMB OMB OMB OMB
CLB
CLB
CLB
CLB
16-bit data
routing tracks
1-bit control routing tracks
Control Output Data Output
HRL Array: IO
17
48-bit fixed-point 16-bit short 32-bit int
48-bit fixed-point
16-bit short
32-bit int
Ctrl IO
1-bit control tracks
16-bit data tracks
Data IO
Control IO• Connects to control tracks• Same as FPGA
Data IO• Connect to data net• Simple 16-bit chunk alignment
Low cost and sufficiently flexible even for irregular data
Outline
Motivation
NDP System Design
Heterogeneous Reconfigurable Logic (HRL)
Evaluation
Conclusions
18
Methodology
Workloads
o 3 analytics frameworks: MapReduce, graph, DNN
o 9 representative applications, 11 kernel circuits (KCs)
Technology
o 45 nm area and power model
o CGRA: DySER as in NDA [HPCA’11, HPCA’15]
o FPGA: Xilinx Virtex-6
Tools
o Synthesize, place & route by Yosys + VTR
o System simulation by zsim 19
Array Area
Same logic capacity for each type array
20
0
0.2
0.4
0.6
0.8
1
1.2
1.4
FPGA CGRA HRL
Sin
gle
Arr
ay A
rea
(mm
2)
Ctrl Routing
Data Routing
Routing
Logic
Coarse-grained FUs improve area efficiency
Array Power
21
0
5
10
15
20
25
30
35
40
KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average
Pow
er (
mW
)
FPGA CGRA HRL
HRL: static but flexible routing
CGRA: high power on circuit-switched network
FPGA: half the frequency
Vault Power Efficiency
22
0
0.5
1
1.5
2
2.5
3
3.5
KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average
No
rmal
ized
Per
f/W
att
FPGA CGRA HRL
HRL: 2.2x FPGA, 1.7x CGRA on perf/Watt
Overall Performance
ASIC represents the upper bound of efficiency
Cores, FPGA, CGRA only match 30% to 80% of ASIC
HRL has 92% of ASIC performance on average
23
0
0.2
0.4
0.6
0.8
1
GroupBy Hist LinReg PageRank SSSP CC ConvNet MLP dA
No
rmal
ized
Perf
orm
ance
Cores FPGA CGRA HRL ASIC
Memory bandwidth saturatedMemory bandwidth not saturated
Conclusions
NDP logic requirements: area + power efficiency, flexibility
Heterogeneous reconfigurable logic (HRL)
o Fine-grained + coarse-grained logic blocks
o Static and separate data and control networks
o Special blocks for branching and layout management
o Vault logic handles communication and control
HRL for in-memory analytics
o 2.2x performance/Watt over FPGA and 1.7x over CGRA
o Within 92% of ASIC performance24
Thanks!Questions?