neural cache: bit-serial in-cache acceleration of deep ...2 transforming caches into massively...

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan

Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das

M-Bits Research Group

CPU GPU

Can we transform CPU into a neural accelerator?

CPU Neural Cache

++ Parallelism

-- Data Movement

Transforming caches into massively parallel vector ALUs

18-core Xeon processor45 MB LLC

18 LLC slices

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

18 LLC slices 360 ways

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

Row decoder

255BL/BLB

8kB SRAM array

18 LLC slices 360 ways 5760 arrays

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

8kB SRAM array

WLBit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

Row decoders

= A + B

BL/BLB

18 LLC slices 360 ways 5760 arrays

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

8kB SRAM array

~A & ~B

BL BLB

S = A^B^C

Bitline ALU

18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs

Row decoders

= A + B

BL/BLB

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

8kB SRAM array

Row decoders

= A + B

BL/BLB

~A & ~B

BL BLB

S = A^B^C

Bitline ALU

Array AArray B

Passive Last Level Cache transformed into ∼ 1 million bit-serial active ALUs

✓ Multiply ✓ Divide ✓ Add

Bit-serial operation @2.5 GHz

Configurable Precision

Row decoders

255BL/BLB

Bit-parallel arithmetic

Why bit-serial?

Row decoders

255BL/BLB

Array A

Array B

Word 3Word 2Word 1Word 0

Why bit-serial?

Row decoders

BL/BLB

Array A

Array B

Why bit-serial?

Row decoders

255BL/BLB

Array A

Array B

Carry propagation across bitlines

Why bit-serial?

Row decoders

255BL/BLB

Array A

Array B

Carry propagation across bitlines

Why bit-serial?

Row decoders

255BL/BLB

Array A

Array B

S S S S

CCCarry propagation across bitlines

High complexity

Loss of throughput and efficiency

!Bit-parallel arithmetic

Why bit-serial?

Row decoders

255BL/BLB

Bit-serial arithmetic

Why bit-serial?

Row decoders

255BL/BLB

S S S S

Transposed data

0 0 0 0

Why bit-serial?

Row decoders

255BL/BLB

S S S S

Transposed data

0 0 0 0

Cycle 1

Why bit-serial?

Row decoders

255BL/BLB

S S S S

Transposed data

C C C C

Cycle 2

Why bit-serial?

Row decoders

255BL/BLB

S S S S

Transposed data

C C C C

Cycle 3

Why bit-serial?

Row decoders

255BL/BLB

S S S S

Transposed data

C C C C

Cycle 4

Low area complexity

High throughput

Configurable & High precision

Why bit-serial?

Outline

• Motivation

• Bit-Serial Arithmetic

• Transpose

• Mapping of Convolution to Array

• Methodology

• Results

2.5MB LLC slice

CBOXTMU

32kB data bank

8kB array

8kB SRAM array

~A & ~B

BL BLB

S = A^B^C

Bitline ALU

Row decoders

= A + B

BL/BLB

In-SRAM Arithmetic

Logical Operations In-SRAM

BLB0 BL0 BLBn BLn

SADifferential

Sense Amplifiers

Bitlines

Wordlines

Changes

Vref Single-endedSense Amplifiers

Additional row decoder

Reconfigurablesense amplifiers

BLB0 BL0 BLBn BLn

A AND B

0 11 0

A AND B

BLB0 BL0 BLBn BLn

0 11 0

A AND B

A NOR B

Addition In-SRAM

BLB0 BL0 BLBn BLn

0 0Carry

0 0Sum

A P256 Bitlines

~A & ~B

BL BLB

S = A^B^C

Addition [Cycle 1]

BLB0 BL0 BLBn BLn

0 0Carry

BLB0 BL0 BLBn BLn

0 1Carry

Addition [Cycle 2]

BLB0 BL0 BLBn BLn

0 1Carry

Addition [Cycle 3]

BLB0 BL0 BLBn BLn

0 0Carry

0 000P3

0 0Tag

Multiplication In-SRAM

BLB0 BL0 BLBn BLn

0 0Carry

0 0Tag 1

Multiplication [Cycle 1]

A1B0 A0B0

A1B1 A0B1

XB1 B0

P0P1P2

BLB0 BL0 BLBn BLn

0 0Carry

0 1Tag

P0 <- A0B0P0

A1B0 A0B0

A1B1 A0B1

XB1 B0

P0P1P2

BLB0 BL0 BLBn BLn

0 0Carry

0 1Tag

P0 <- A0B0

P1 <- A1B0

A1B0 A0B0

A1B1 A0B1

XB1 B0

P0P1P2

BLB0 BL0 BLBn BLn

0 0Carry

0 0Tag1 1

P0 <- A0B0

P1 <- A1B0

A1B0 A0B0

A1B1 A0B1

XB1 B0

P0P1P2

BLB0 BL0 BLBn BLn

0 0Carry

1 1Tag

P0 <- A0B0

P1 <- A1B0 + A0B1

A1B0 A0B0

A1B1 A0B1

XB1 B0

P0P1P2

P1 <- P1 + A0B1

If(B1), P1 <- P1 + A0

Else, P1 <- P1

BLB0 BL0 BLBn BLn

0 1Carry

0 0Sum

1 1Tag

P0 <- A0B0

P1 <- A1B0 + A0B1

P2 <- A1B1

A1B0 A0B0

A1B1 A0B1

XB1 B0

P0P1P2

BLB0 BL0 BLBn BLn

0 1Carry

0 1Tag

P0 <- A0B0

P1 <- A1B0 + A0B1

P2 <- A1B1

P3 <- Cin

A1B0 A0B0

A1B1 A0B1

XB1 B0

P0P1P2

Operation Cycles

ADD N+1

SUB 2N+1

MUL N2 + 5N -2

DIV 1.5N2 + 5.5N

Comparison 2N+1

Supported Arithmetic

Synthesized array—7.5% area overhead Processor Chip— 2% area overhead

Outline

• Motivation

• Transpose

• Methodology

• Results

CBOXTMU

Transpose

A0[MSB]

A1[MSB]

A2[MSB]

A0[LSB]

A1[LSB]

A2[LSB]

... ...

Col Decoder

SA SA SA

... ...

Control

B0[MSB]

B1[MSB]

B2[MSB]

B0[LSB]

B1[LSB]

B2[LSB]

Regular read/write

Transp

8-T transpose bit-cell

A2 A1 A0

B2 B1 B0

C2 C1 C0

TMU A0

Transpose

Outline

• Motivation

• Transpose

• Methodology

• Results

3D Filters (M)

each filter: C channels

each channel: RxS weights

Input Activations

(C channels)

Output Activations

(M channels)

A Convolutional Layer

Partial Sum

C 1 Output

Activation

∑Reduction

Filter

Weights

Activations

Output

Activations

Unroll Unroll

Mapping CNN to Neural Cache

256 Bitlines

8 kB SRAM Array

Way 20

2.5 MB LLC Slice

. . .Way 1 Way 2 Way 3

M = 32Output Position 1 Output Position 2 . . .

Mapping CNN to Neural Cache

tivati

Filter 1 (C = 256)

256 Bitlines

8 kB SRAM Array

Slice 1 Slice 14

Mapping of Convolution to Array

LLCSlice 1

LLCSlice 14

RingInterconnect

Core 14

FilterWeights

Input Activations

Output Activations

Way 19(Reserved)

2.5 MB LLC Slice

. . .Way 1 Way 2 Way 3

Put it together

Core 1

Filter Loading1 Input Loading2 Output Transfer4 MAC + Reduction3

Outline

• Motivation

• Transpose

• Methodology

• Results

Evaluation Methodology

CPU (2 sockets) GPU (1 card) Neural Cache

Processor

Intel Xeon E5-2597 v3, 2.6GHz,

28 cores, 56 threads

Nvidia Titan Xp, 1.6GHz, 3840 cuda

2.5GHz ComputeSRAM,

1032192Bit-serial ALUs

On-chip memory 78.96 MB 9.14 MB70 MB (Dual

Socket)

Off-chip memory 64 GB DRAM 12 GB DRAM 64 GB DRAM

Profiler / Simulator

(Performance)

TensorFlowtfprof

Cycle accurate simulator +

C Microbench

Profiler / Simulator(Energy)

Intel RAPL InterfaceNVIDIA System Management

Interface

SPICEsimulation

+Intel RAPLInterface

DNN Models

- Inception V3

- 8-bit weights and inputs

Outline

• Motivation

• Transpose

• Methodology

• Results

1 4 16 64 256

Batch Size

CPU - Xeon E5 GPU - Titan Xp Neural Cache

Throughput

2.2x Improved throughput over GPU

CPU GPU Neural Cache

Latency

7.7x Latency improvement over GPU

Power/Energy Comparison

CPU GPU Neural Cache

Total Energy Avg Power

12x 20x

.. over server class CPU at 2% area overhead

Neural Cache Summary

Repurpose Cache to Data Parallel DNN Accelerator

.. over server class GPU

2x 16x

Massively Parallel Bit-Serial In-SRAM Arithmetic Data Layout for CNNs

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan

Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das

M-Bits Research Group

neural cache: bit-serial in-cache acceleration of deep ...2 transforming caches into massively...

Documents

yuejian xie, gabriel h. loh. core0 il1 dl1 core1 il1 dl1...

cache-cache -...

cache-aware and cache-oblivious algorithms

azure redis cache - cache on steroids!

cache memory - ufrgsflavio/ensino/cmp502/memoria.pdf ·...

miss elena instrucciones para crear una cuenta cbox crear...

tutorial cbox

breizhcamp 2014 : une partie de cache-cache

j:cisemagazines@ box magazines modified 2015type cbox c...

mag-lite - cbox · title: mag-lite created date: 8/30/2019...

wpos-cbox - wiseasy€¦ · wpos-cbox to ensure cash...

inf3380: parallel programming for natural sciences · cache...

your global container supplier - cbox containers · pdf...

proxysql llc - dataops.barcelona · proxysql llc we provide...

logicore ip system cache v1the cache memory provides the...

haiwell 海为 hmi/cbox/ipc...

weibo / cache-cache

cache-oblivious ray reordering -...

les blasons d’autonomie cache-cache - cycle...

dell poweredge m640 for vrtx technical guide...• 1.0 mb...