hrl: efficient and flexible reconfigurable logic for near...

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

Mingyu Gao and Christos Kozyrakis

Stanford University

http://mast.stanford.edu

HPCA – March 14, 2016

PIM is Coming Back …

2

Near-Data Processing

(NDP)

End of Dennard scaling

In-memory analytics

3D stacking

MapReduce, graph processing, deep neural networks, …

HMC, HBM

Figs: www.extremetech.comwww.cisl.columbia.edu/grads/tuku/research/www.oceanaute.blogspot.com/2015/06/how-to-shuffle-sort-mapreduce.html

Energy-bound systems

NDP Logic Requirements

Area-efficient

o High processing throughput to match the high memory bandwidth

o 128 GBps per 50 mm2 stack > 32 Gflops > 0.6 Gflops/mm2

Power-efficient

o Thermal constraints limit clock frequency

o 5 W per stack 100 mW/mm2

Flexible

o Must amortize manufacturing cost through reuse across apps

3

NDP Logic Options

Programmable cores[IRAM, FlexRAM, NDC, TOP-PIM]

FPGA (fine-grained)[Active Pages]

CGRA (coarse-grained)[NDA]

ASIC[MSA, LiM]

4

AreaEfficiency

PowerEfficiency

Flexibility

Reconfigurable Logic Challenges

5

FPGA

o Area overhead due to support for bit-level configuration

CGRA

o Traditional GGRAs

• Limited flexibility in interconnects, only for regular computation patterns

o DySER [HPCA’11] and NDA [HPCA’15]

• High power due to circuit-switched routing

• Inefficient for branches and irregular data layouts

Heterogeneity: achieve the best of FPGA and CGRA

Outline

Motivation

NDP System Design

Heterogeneous Reconfigurable Logic (HRL)

Evaluation

Conclusions

6

Overall System Architecture

Multiple stacks linked to host processor through serial links7

Host Processor

Memory Stack

High-Speed Serial Link

Multi-core chip with cache hierarchy:Runs code with high temporal locality

Memory stack with NDP capability:runs memory-intensive code

NDP Stack

Vault Logic

NoC

Logic Die

DRAM Die

...

Bank

Channel

Vault

8

Vault: • Vertical channel• Dedicated memory controller• 8 – 16 vaults per stack

vs. DDR3 channel

• 10x bandwidth (160 GBps)

• 3-5x power improvement Vault logic:• Multiple PEs + control logic• NoC to interconnect vaults

Iterative Execution Flow

Processing phase: PEs run tasks independently and in parallel

Communication phase: data exchange and sync b/w PEs [PACT’15]

o Communication within and across stacks

9

Task Task Task Task

PEs PEs PEs PEs Process

LocalBuffer

Task Task Task Task Pull

Output Queues

Mem CtrlTo remotememory

Generic Load/Store UnitTask

Queue

op,a

dd

r,sz

op,a

ddr,

sz

op,a

ddr,

sz

RouterTo localmemory

Scratchpad Buffer

Global Controller

(Host Processor)

DMA

...

PE

PE

PEUser-defined Logic

Fixed Logic

Output write queues

DMA Ctrl

Vault Logic

Handles task control and data communication

Allows the use of reconfigurable or custom PEs

10

Input & output data info from host or other PEs

Cached data(e.g., graph vertices)

Streaming data(e.g., graph edge list)

Separate queue for each consumer• Coalesce writebacks• In-place combine ops

Outline

Motivation

NDP System Design


Evaluation

Conclusions

11

HRL Features

Fine-grained + coarse-grained reconfigurable blocks

o LUTs for flexible control

o ALUs for efficient arithmetic

Static interconnects

o Wide network for data

o Separate and narrow network for control

Special blocks for branches & irregular data layout

Compute throughput per Watt:

2.2x over FPGA, 1.7x over CGRA12

Area-efficiency and flexibility

Power-efficiency

Flexibility

HRL Array: Logic Blocks

13

Control Input Data Input

FU FU FU FU FU

FU FU FU FU FU

FU FU FU FUFU

OMB OMB OMB OMB OMB

CLB

CLB

CLB

CLB

16-bit data

routing tracks

1-bit control routing tracks

Control Output Data Output

CGRA-style functional unit• Efficient 48-bit arithmetic/logic ops• Registers for pipelining and retiming

FPGA-style configurable block• LUTs for embedded control logic• Special functions: sigmoid, tanh, etc.

Output MUX Block• Configurable MUXes (tree, cascading, parallel)• Put close to output, low cost and flexible

Flexible IO Interface• Simple but flexible alignment

HRL Array: Routing

14


FU FU FU FU FU

FU FU FU FU FU

FU FU FU FUFU

OMB OMB OMB OMB OMB

CLB

CLB

CLB

CLB

16-bit data

routing tracks



HRL Array: Routing

15

FU

MXBCLB

out

in

INA INBsel

OUT

A

B

Y out

16-bit data net

tracks

1-bit control net

tracks

Block output to routing track connection

Routing switch box

Block input MUX connection

Fully static for low power

Separate data and control to reduce area

16-bit data, bus-based1-bit control, few tracks

No switches b/w two networks

Connection box and switch

HRL Array: IO

16


FU FU FU FU FU

FU FU FU FU FU

FU FU FU FUFU

OMB OMB OMB OMB OMB

CLB

CLB

CLB

CLB

16-bit data

routing tracks



HRL Array: IO

17

48-bit fixed-point 16-bit short 32-bit int

48-bit fixed-point

16-bit short

32-bit int

Ctrl IO

1-bit control tracks

16-bit data tracks

Data IO

Control IO• Connects to control tracks• Same as FPGA

Data IO• Connect to data net• Simple 16-bit chunk alignment

Low cost and sufficiently flexible even for irregular data

Outline

Motivation

NDP System Design


Evaluation

Conclusions

18

Methodology

Workloads

o 3 analytics frameworks: MapReduce, graph, DNN

o 9 representative applications, 11 kernel circuits (KCs)

Technology

o 45 nm area and power model

o CGRA: DySER as in NDA [HPCA’11, HPCA’15]

o FPGA: Xilinx Virtex-6

Tools

o Synthesize, place & route by Yosys + VTR

o System simulation by zsim 19

Array Area

Same logic capacity for each type array

20

0

0.2

0.4

0.6

0.8

1

1.2

1.4

FPGA CGRA HRL

Sin

gle

Arr

ay A

rea

(mm

2)

Ctrl Routing

Data Routing

Routing

Logic

Coarse-grained FUs improve area efficiency

Array Power

21

0

5

10

15

20

25

30

35

40

KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average

Pow

er (

mW

)

FPGA CGRA HRL

HRL: static but flexible routing

CGRA: high power on circuit-switched network

FPGA: half the frequency

Vault Power Efficiency

22

0

0.5

1

1.5

2

2.5

3

3.5

KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average

No

rmal

ized

Per

f/W

att

FPGA CGRA HRL

HRL: 2.2x FPGA, 1.7x CGRA on perf/Watt

Overall Performance

ASIC represents the upper bound of efficiency

Cores, FPGA, CGRA only match 30% to 80% of ASIC

HRL has 92% of ASIC performance on average

23

0

0.2

0.4

0.6

0.8

1

GroupBy Hist LinReg PageRank SSSP CC ConvNet MLP dA

No

rmal

ized

Perf

orm

ance

Cores FPGA CGRA HRL ASIC

Memory bandwidth saturatedMemory bandwidth not saturated

Conclusions

NDP logic requirements: area + power efficiency, flexibility

Heterogeneous reconfigurable logic (HRL)

o Fine-grained + coarse-grained logic blocks

o Static and separate data and control networks

o Special blocks for branching and layout management

o Vault logic handles communication and control

HRL for in-memory analytics

o 2.2x performance/Watt over FPGA and 1.7x over CGRA

o Within 92% of ASIC performance24

Thanks!Questions?

hrl: efficient and flexible reconfigurable logic for near...

Documents