bit-tactical: exploiting ineffectual computations in ... · abstract backend: precision...

1
Abstract Backend: precision adaptability Results Motivation Over 90% of Activation bits are ZERO Throughput: Energy efficiency: Comparison with other accelerators: Baseline Straight-forward implementation Synchronized parallel units that do not skip any work: Frontend: Skipping 0-Weights Sparsified networks present ineffectual computation in the form of zero-valued weights. A data-parallel accelerator will not take advantage of this, taking 4 cycles in this example: Bit-Tactical borrows effectual weights temporally (from the future) and spatially (from different coordinates within the same filter). It only needs 2 cycles to process this example: Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How Alberto Delmas, Patrick Judd, Dylan Malone Stuart, Zissis Poulos, Mostafa Mahmoud, Sayeh Sharify, Milos Nikolic, Andreas Moshovos Previous accelerators exploit data sparsity in one dimension (zero- values, precision variability, zero-bits; on either Weights or Activations). Combining zero-value Weight sparsity along with zero- bit Activation sparsity exploitation provides the best performance potential for sparse networks, while still benefiting dense ones. Most computation is ineffectual: over 31x potential Bit-Tactical can work with multiple backends to take advantage of this: TCLp uses a bit-serial synchronized backend which skips zeros at both ends of the value; TCLe uses a term-serial backend that only processes effectual terms. Bit-Tactical: Deep neural network inference accelerator Targets primarily CNNs, can do any layer type Exploits sparsity and effectual bit content Over 10x faster than data-parallel accelerators 2x more energy efficient 32% area cost Distribution of activation values using a 16-bit fixed point representation. × W 4 W 3 W 2 W 1 W 0 × 1 4 0 4 1 4 0 4 0 4 × 0 4 0 4 0 4 0 4 0 4 0 4 0 4 0 4 0 4 0 4 0 4 W 4 W 3 W 2 W 1 W 0 0 4 0 4 0 4 0 4 0 4 0 4 0 4 0 4 0 4 0 4 W 4 W 3 W 2 W 1 W 0 0 4 0 4 0 4 0 4 Tile: general view WSU (Weight Skipping Unit Slice) ASU (Activation Selection Unit) TCLe/TCLp tile enabling serial activation processing Organization Off-Chip bandwidth characterization No-Crossbar design Instead of using crossbars to shuffle data around (be it the outputs – left; or one of the inputs – center), Bit-Tactical (right) utilizes small, local muxes to select effectual term pairs. 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.3 3.1 1.8 1.6 1.0 4.2 2.6 2.2 3.6 4.2 3.1 3.4 4.4 2.0 5.5 3.6 6.1 7.3 5.5 6.2 7.9 5.8 7.1 6.5 7.0 9.1 5.1 6.2 5.6 4.6 11.3 6.7 11.2 15.5 8.6 10.5 9.6 10.9 14.1 11.3 0 2 4 6 8 10 12 14 16 DaDianNao++ SCNN Dstripes Pragmatic TCLp<2,5> TCLe<2,5> 2.3 3.0 1.7 2.1 1.8 1.5 3.7 2.2 2.0 2.8 1.6 1.9 1.8 2.0 2.6 2.1 0 1 2 3 4 TCLp<2,5> TCLe<2,5> 5.8 7.4 4.5 5.5 5.3 3.3 9.2 5.6 7.0 9.1 5.1 6.2 5.6 4.6 11.3 6.7 7.1 9.5 4.9 6.2 5.4 3.1 9.3 6.1 9.6 12.9 7.8 9.4 9.2 8.4 11.7 9.7 11.2 15.5 8.6 10.5 9.6 10.9 14.1 11.3 11.0 16.2 7.9 10.3 9.1 7.1 11.7 10.1 0 2 4 6 8 10 12 14 16 18 TCLp<1,6> TCLp<2,5> TCLp<4,3> TCLe<1,6> TCLe<2,5> TCLe<4,3> 1.6x 1.5x 1.8x 1.9x 2.5x 1.8x 1.6x 1.8x 6.7x 4.3x 4.4x 2.5x 1.7x 2.2x 3.7x 3.3x 11.6x 5.1x 7.8x 4.6x 3.9x 3.3x 5.6x 5.5x 3.6x 3.2x 3.9x 3.4x 9.0x 2.7x 2.5x 3.7x 8.0x 7.1x 8.8x 8.2x 17.9x 6.5x 6.1x 8.4x 26.2x 10.6x 16.6x 8.5x 14.0x 5.0x 9.2x 11.4x 58.4x 23.7x 37.2x 20.7x 28.0x 12.0x 22.5x 26.0x 0x 10x 20x 30x 40x 50x 60x AlexNet-ES AlexNet-SS GoogleNet-ES GoogleNet-SS ResNet50-SS Mobilenet Bi-LSTM Geomean Potential Speedup A W W+A Ap Ae W+Ap W+Ae

Upload: others

Post on 13-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bit-Tactical: Exploiting Ineffectual Computations in ... · Abstract Backend: precision adaptability Results Motivation future) Over 90% of Activation bits are ZERO Throughput: 11.2

Abstract

Backend: precision adaptability

Results

Motivation

Over 90% of Activation bits are ZERO

Throughput: Energy efficiency:

Comparison with other accelerators:

Baseline

Straight-forward implementation Synchronized parallel units that

do not skip any work:

Frontend: Skipping 0-Weights

Sparsified networks present ineffectual computation in the form of zero-valued weights. A data-parallel accelerator will not take advantage of this, taking 4 cycles in this example: Bit-Tactical borrows effectual weights temporally (from the future) and spatially (from different coordinates within the same filter). It only needs 2 cycles to process this example:

Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How

Alberto Delmas, Patrick Judd, Dylan Malone Stuart, Zissis Poulos,

Mostafa Mahmoud, Sayeh Sharify, Milos Nikolic, Andreas Moshovos

Previous accelerators exploit data sparsity in one dimension (zero-values, precision variability, zero-bits; on either Weights or Activations). Combining zero-value Weight sparsity along with zero-bit Activation sparsity exploitation provides the best performance potential for sparse networks, while still benefiting dense ones.

Most computation is ineffectual:

over 31x potential

Bit-Tactical can work with multiple backends to take advantage of this: TCLp uses a bit-serial synchronized backend which skips zeros at both ends of the value; TCLe uses a term-serial backend that only processes effectual terms.

Bit-Tactical: Deep neural network inference accelerator

• Targets primarily CNNs, can do any layer type • Exploits sparsity and effectual bit content

Over 10x faster than data-parallel accelerators

2x more energy efficient

32% area cost

5.8

7.4

4.5

5.5

5.3

5.6

6.5

8.4

4.7

6.0

5.3

6.1

7.0

9.1

5.1

6.2

5.6

6.5

7.1

9.5

4.9

6.2

5.4

6.4

9.6

12

.9

7.8

9.4

9.2

9.7

10

.5

14

.5

7.8

10

.1

9.2

10

.2

11.2

15

.5

8.6

10

.5

9.6

10

.8

11.0

16

.2

7.9

10

.3

9.1

10

.6

0

2

4

6

8

10

12

14

16

18

AlexNet-ss AlexNet-es GoogLeNet-ss GoogLeNet-es ResNet50-ss geomean

TCLp<1,6> TCLp<2,5> TCLp<2,5>+S TCLp<4,3>

TCLe<1,6> TCLe<2,5> TCLe<2,5>+S TCLe<4,3>

1.0

1.0

1.0

1.0

1.0

1.0

0.7

0.7

0.8

0.6

0.6

0.7

2.3

3.1

1.8

1.6

1.0

1.8

3.6

4.2

3.1

3.4

4.4

3.7

7.0

9.1

5.1

6.2

5.6

6.5

6.1

7.3

5.5

6.2

7.9

6.6

11.2

15

.5

8.6

10

.5

9.6

10

.8

0

2

4

6

8

10

12

14

16

AlexNet-ss AlexNet-es GoogLeNet-ss GoogLeNet-es ResNet50-ss geomean

base dense-SCNN SCNN DynSTR TCLp<2,5>+S PRA TCLe<2,5>+S

2.9

3.7

2.3

2.7

2.7

2.8

3.2

4.1

2.3

2.9

2.6

3.0

3.4

4.4

2.5

3.1

2.7

3.2

3.3

4.5

2.3

2.9

2.6

3.0

2.7

3.6

2.2

2.6

2.6

2.7

2.7

3.7

2.0

2.6

2.3

2.6

2.7

3.7

2.0

2.5

2.3

2.6

2.4

3.6

1.8

2.3

2.0

2.4

0

1

2

3

4

5

AlexNet-ss AlexNet-es GoogLeNet-ss GoogLeNet-es ResNet50-ss geomean

TCLp<1,6> TCLp<2,5> TCLp<2,5>+S TCLp<4,3>

TCLe<1,6> TCLe<2,5> TCLe<2,5>+S TCLe<4,3>

Distribution of activation values using a 16-bit fixed point representation.

× W4W3W2W1W0 × 1404140404 × 0404040404

040404040404 W4W3W2W1W00404

0404040404040404 W4W3W2W1W004040404

Tile: general view

WSU

(W

eig

ht

Skip

pin

g U

nit

Slic

e)

A

SU (

Act

ivat

ion

Se

lect

ion

Un

it)

TCLe/TCLp tile enabling serial activation processing

Organization

Off-Chip bandwidth characterization

No-Crossbar design

Instead of using crossbars to shuffle data around (be it the outputs – left; or one of the inputs – center), Bit-Tactical (right) utilizes small, local muxes to select effectual term pairs.

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

2.3

3.1

1.8

1.6

1.0

4.2

2.6

2.2

3.6

4.2

3.1

3.4

4.4

2.0

5.5

3.6

6.1

7.3

5.5

6.2

7.9

5.8

7.1

6.5

7.0

9.1

5.1

6.2

5.6

4.6

11.3

6.7

11.2

15.5

8.6

10.5

9.6

10.9

14.1

11.3

0

2

4

6

8

10

12

14

16

DaDianNao++ SCNN Dstripes Pragmatic TCLp<2,5> TCLe<2,5>

2.3

3.0

1.7

2.1

1.8

1.5

3.7

2.2

2.0

2.8

1.6

1.9

1.8

2.0

2.6

2.1

0

1

2

3

4

TCLp<2,5> TCLe<2,5>

5.8

7.4

4.5

5.5

5.3

3.3

9.2

5.6

7.0

9.1

5.1

6.2

5.6

4.6

11

.3

6.7

7.1

9.5

4.9

6.2

5.4

3.1

9.3

6.1

9.6

12

.9

7.8

9.4

9.2

8.4

11

.7

9.7

11

.2

15

.5

8.6

10

.5

9.6

10

.9

14

.1

11

.3

11

.0

16

.2

7.9

10

.3

9.1

7.1

11

.7

10

.1

0

2

4

6

8

10

12

14

16

18

TCLp<1,6> TCLp<2,5> TCLp<4,3> TCLe<1,6> TCLe<2,5> TCLe<4,3>

1.6x

1.5x

1.8x

1.9x

2.5x

1.8x

1.6x

1.8x

6.7x

4.3x

4.4x

2.5x

1.7x

2.2x

3.7x

3.3x

11.6

x

5.1x

7.8x

4.6x

3.9x

3.3x

5.6x

5.5x

3.6x

3.2x

3.9x

3.4x

9.0x

2.7x

2.5x

3.7x

8.0x

7.1x

8.8x

8.2x

17.9

x

6.5x

6.1x

8.4x

26.2

x

10.6

x 16.6

x

8.5x

14.0

x

5.0x

9.2x

11.4

x

58.4

x

23.7

x

37.2

x

20.7

x

28.0

x

12.0

x

22.5

x

26.0

x

0x

10x

20x

30x

40x

50x

60x

AlexNet-ES

AlexNet-SS

GoogleNet-ES

GoogleNet-SS

ResNet50-SS

Mobilenet Bi-LSTM

Geomean

Pot

entia

l Spe

edup

A W W+A Ap Ae W+Ap W+Ae