cma-cube: a scalable reconfigurable accelerator with · pdf filecma-cube: a scalable...

1
CMA-CUBE: A SCALABLE RECONFIGURABLE ACCELERATOR WITH 3-D WIRELESS INDUCTIVE COUPLING INTERCONNECT Yusuke Koizumi , Eiichi Sasaki , Hideharu Amano , Hiroki Matsutani , Yasuhiro Take , Tadahiro Kuroda , Ryuichi Sakamoto †† , Mitaro Namiki †† , Kimiyoshi Usami ††† , Masaaki Kondo †††† , Hiroshi Nakamura ††††† † Keio Univ. †† Tokyo Univ. of Agri. And Tech. ††† Shibaura Institute of Technology †††† University of Electro-Cmmunications ††††† University of Tokyo Introduction CMA-Cube Inductive Coupling -Wireless Interconnection Technology- Network on Chip CMA Core -Low Power Reconfigurable Accelerator- Evaluation Recent battery driven IT devices require versatile functions and high performance with low energy consumption The target applications require wide range of performance It is difficult to develop an SoC (System-on-a Chip) for each product because of the grown initial cost of LSI A scalable reconfigurable accelerator CMA-Cube has been proposed An inductor is implemented as a square coil with metal in common CMOS layout Chips are stacked with a certain length of gap so that the TX channel is located exactly on the same place of the RX channel of the next chip RX TX TX RX RX TX TX RX RX TX TX RX RX TX TX RX Router Interface scheme of CMA-Cube Layout of CMA-Cube CMA-Cube consists of the following components Reconfigurable accelerator core called CMA (Cool Mega Array) Inductive Coupling interconnect 2 Routers with bubble flow controll Interface between a core and routers PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE μ‐Controller 8 x 8 PE(Processing Element) Array 8 x 8 PEs only with combinational circuit Eliminating the energy consumption for storing the intermediate data in registers and their clock distribution Supply voltage ranging from 0.5V to 1.2V Bank0 Bank1 Data Memory Router Standard 3-stage pipeline structure with 3-input/output Bubble flow control for dead lock free The following simple rule is applied (1) A packet on a ring can move to the next hop along the ring when the input buffer of the next hop has an empty space of at least one packet. (2) The network interface can inject a packet to a ring when the vertical input buffer of the ingress router has an empty space of at least two packets. (3) A packet can exit from a ring only when the horizontal output buffer of the egress router has an empty space of at least one packet. Interface Decode packets from a router Create packets to a router A packet consists of flits Target System 3 CMA-Cubes (Accelerators) Geyser-Cube (Host CPU) Benchmark Application Result Header Analysis Decode Geyser-Cube Huffman decode CMA-Cube0 Inverse Quantization CMA-Cube1 Inverse DCT CMA-Cube2 YUV to RGB Convert JPEG image RGB image MCU Image data 0 100000 200000 300000 400000 500000 600000 700000 800000 execution cycles Target System other convert yuv to rgb store intermediate inverse dct inverse quantization huffman decode collect result dma trans processing image data trans Cube-1 Quad-Core No Acceleration (Geyser-Cube + 3 CMA-Cubes) (Geyser-Cube) Cube-1 Quad-Core Geyser-Cube MIPS R3000 compatible CPU Fine-Grained Run-Time Power Gating 4KByte instruction cache 4KByte data cache 16 entry TLB μ-Controller Flexible data management between PE Array and Data Memory Data Memory 25bits x 1024 entry 2 memory bank Data translation from external and operation on PE Array can be overlapped An inductor to receive data An inductor to transfer data Bonding wires for VDD, GND and CLK Data flow control CMA-Cube control Calculation with low energy consumption JPEG decoder Cube-1 Quad-Core vs No Acceleration Simulation tool Cadence NC-verilog 8.1 Frequency Geyser-Cube:200MHz CMA-Cube:200MHz Router:200MHz Data bus:100MHz Cube-1 Quad-Core vs No Acceleration Specifications of Inductilve Coupling Clock Freq. 4GHz for both edges Bandwidth 36bits/5ns Latency 2clocks/chip 0 35 32 31 32 bit Data Flit Type Types of packet single flit packet Command 2-flit packet (header flit, data flit) Single data transfer 5-flit packet (header flit, data flit x 4) Block data transfer Flit format

Upload: dinhlien

Post on 15-Mar-2018

224 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: CMA-CUBE: A SCALABLE RECONFIGURABLE ACCELERATOR WITH · PDF fileCMA-CUBE: A SCALABLE RECONFIGURABLE ACCELERATOR ... the vertical input buffer of the ingress router has an ... A packet

CMA-CUBE: A SCALABLE RECONFIGURABLE ACCELERATOR

WITH 3-D WIRELESS INDUCTIVE COUPLING INTERCONNECT Yusuke Koizumi

†, Eiichi Sasaki

†, Hideharu Amano

†, Hiroki Matsutani

†, Yasuhiro Take

†, Tadahiro Kuroda

†,

Ryuichi Sakamoto††

, Mitaro Namiki††

, Kimiyoshi Usami†††

, Masaaki Kondo††††

, Hiroshi Nakamura†††††

† Keio Univ. †† Tokyo Univ. of Agri. And Tech. ††† Shibaura Institute of Technology

†††† University of Electro-Cmmunications ††††† University of Tokyo

Introduction

CMA-Cube

Inductive Coupling -Wireless Interconnection Technology-

Network on Chip CMA Core -Low Power Reconfigurable Accelerator-

Evaluation

• Recent battery driven IT devices require versatile functions and high performance with low energy consumption • The target applications require wide range of performance • It is difficult to develop an SoC (System-on-a Chip) for each product because of the grown initial cost of LSI

A scalable reconfigurable accelerator CMA-Cube has been proposed

• An inductor is implemented as a square coil with metal in common CMOS layout

• Chips are stacked with a certain length of gap so that the TX channel is located exactly on the same place of the RX channel of the next chip

RX TX TX RX

RX TX TX RX

RX TX TX RX

RX TX TX RX

Router

Interface

scheme of CMA-Cube

Layout of CMA-Cube

• CMA-Cube consists of the following components Reconfigurable accelerator core called CMA (Cool Mega Array) Inductive Coupling interconnect 2 Routers with bubble flow controll Interface between a core and routers

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

μ‐Controlle

r

8 x 8

PE(Processing Element) Array

• 8 x 8 PEs only with combinational circuit • Eliminating the energy consumption for

storing the intermediate data in registers and their clock distribution

• Supply voltage ranging from 0.5V to 1.2V

Ban

k0

Ban

k1

Dat

a M

emo

ry

Router

• Standard 3-stage pipeline structure with 3-input/output

• Bubble flow control for dead lock free • The following simple rule is applied

(1) A packet on a ring can move to the next hop along the ring when the input buffer of the next hop has an empty space of at least one packet.

(2) The network interface can inject a packet to a ring when the vertical input buffer of the ingress router has an empty space of at least two packets.

(3) A packet can exit from a ring only when the horizontal output buffer of the egress router has an empty space of at least one packet.

Interface

• Decode packets from a router • Create packets to a router • A packet consists of flits

Target System

3 CMA-Cubes (Accelerators)

Geyser-Cube (Host CPU)

Benchmark Application Result

Header Analysis

Decode

Geyser-Cube Huffman decode

CMA-Cube0 Inverse Quantization

CMA-Cube1 Inverse DCT

CMA-Cube2 YUV to RGB Convert

JPEG image

RGB image

:MCU

Image data

0

100000

200000

300000

400000

500000

600000

700000

800000

exe

cuti

on

cyc

les

Target System

other

convert yuv to rgb

store intermediate

inverse dct

inverse quantization

huffman decode

collect result

dma trans

processing

image data trans Cube-1 Quad-Core No Acceleration (Geyser-Cube + 3 CMA-Cubes) (Geyser-Cube)

• Cube-1 Quad-Core

Geyser-Cube ・MIPS R3000 compatible CPU ・Fine-Grained Run-Time Power Gating ・4KByte instruction cache ・4KByte data cache ・16 entry TLB

μ-Controller

• Flexible data management between PE Array and Data Memory

Data Memory

• 25bits x 1024 entry • 2 memory bank • Data translation from external and operation

on PE Array can be overlapped

An inductor to receive data

An inductor to transfer data

Bonding wires for VDD, GND and CLK

Data flow control CMA-Cube control

Calculation with low energy consumption

• JPEG decoder

• Cube-1 Quad-Core vs No Acceleration

Simulation tool Cadence NC-verilog 8.1

Frequency

Geyser-Cube:200MHz

CMA-Cube:200MHz

Router:200MHz

Data bus:100MHz

• Cube-1 Quad-Core vs No Acceleration

Specifications of Inductilve Coupling

Clock Freq. 4GHz for both edges

Bandwidth 36bits/5ns

Latency 2clocks/chip

0 35 32 31

32 bit Data Flit Type

Types of packet

single flit packet Command

2-flit packet (header flit, data flit)

Single data transfer

5-flit packet (header flit, data flit x 4)

Block data transfer

Flit format