neural cache: bit-serial in-cache acceleration of deep ...2 transforming caches into massively...
Post on 09-Oct-2020
11 Views
Preview:
TRANSCRIPT
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan
Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das
M-Bits Research Group
1
2
CPU GPU
Can we transform CPU into a neural accelerator?
$
3
GPU
Can we transform CPU into a neural accelerator?
CPU Neural Cache
++ Parallelism
-- Data Movement
Transforming caches into massively parallel vector ALUs
4
18-core Xeon processor45 MB LLC
18 LLC slices
Transforming caches into massively parallel vector ALUs
5
18-core Xeon processor45 MB LLC
Way
1
Way
20
Way
2
Way
19
2.5MB LLC slice
CBOXTMU
32kB data bank
8kB array
18 LLC slices 360 ways
Transforming caches into massively parallel vector ALUs
6
18-core Xeon processor45 MB LLC
Way
1
Way
20
Way
19
2.5MB LLC slice
CBOXTMU
32kB data bank
8kB array
WL
Row decoder
0
255
255BL/BLB
8kB SRAM array
18 LLC slices 360 ways 5760 arrays
Way
2
Transforming caches into massively parallel vector ALUs
7
18-core Xeon processor45 MB LLC
Way
1
Way
20
Way
19
2.5MB LLC slice
CBOXTMU
32kB data bank
8kB array
8kB SRAM array
WLBit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
Row decoders
0
255
255
= A + B
BL/BLB
Logic
Arr
ay A
Arr
ay B
0
1
1
0
0
0
1
1
1
0
0
1A +
B
18 LLC slices 360 ways 5760 arrays
Way
2
Transforming caches into massively parallel vector ALUs
8
18-core Xeon processor45 MB LLC
Way
1
Way
20
Way
19
2.5MB LLC slice
CBOXTMU
32kB data bank
8kB array
8kB SRAM array
D EN
QC
A&B
A^B
SCout
Cin
Vref
C_EN
~A & ~B
SA SA
BL BLB
DR
S = A^B^C
Bitline ALU
18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs
WLBit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
Row decoders
0
255
255
= A + B
BL/BLB
Logic
Arr
ay A
Arr
ay B
0
1
1
0
0
0
1
1
1
0
0
1A +
B
Way
2
Way
2
Transforming caches into massively parallel vector ALUs
9
18-core Xeon processor45 MB LLC
Way
1
Way
20
Way
19
2.5MB LLC slice
CBOXTMU
32kB data bank
8kB array
8kB SRAM array
WL
Row decoders
0
255
255
= A + B
BL/BLB
Logic
D EN
QC
A&B
A^B
SCout
Cin
Vref
C_EN
~A & ~B
SA SA
BL BLB
DR
S = A^B^C
Bitline ALU
Array AArray B
A + B
Passive Last Level Cache transformed into ∼ 1 million bit-serial active ALUs
✓ Multiply ✓ Divide ✓ Add
Bit-serial operation @2.5 GHz
18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs
Configurable Precision
10
A + B
Row decoders
0
255
255BL/BLB
Logic
Bit-parallel arithmetic
Why bit-serial?
11
A + B
Row decoders
0
255
255BL/BLB
Logic
Array A
Array B
A + B
Word 3Word 2Word 1Word 0
Word 3Word 2Word 1Word 0
}
}
}
Bit-parallel arithmetic
Why bit-serial?
12
A + B
WL1
Row decoders
0
255
255
S
BL/BLB
Logic
WL2
Array A
Array B
A + B
Word 3Word 2Word 1Word 0
Word 3Word 2Word 1Word 0
}
}
}
Bit-parallel arithmetic
Why bit-serial?
13
A + B
WL1
Row decoders
0
255
255BL/BLB
Logic
WL2
Array A
Array B
A + B
Word 3Word 2Word 1Word 0
Word 3Word 2Word 1Word 0
}
}
}
C
S S
Bit-parallel arithmetic
Carry propagation across bitlines
Why bit-serial?
14
A + B
WL1
Row decoders
0
255
255BL/BLB
Logic
WL2
Array A
Array B
A + B
Word 3Word 2Word 1Word 0
Word 3Word 2Word 1Word 0
}
}
}
C
S S S
C
Bit-parallel arithmetic
Carry propagation across bitlines
Why bit-serial?
15
A + B
WL1
Row decoders
0
255
255BL/BLB
Logic
WL2
Array A
Array B
A + B
Word 3Word 2Word 1Word 0
Word 3Word 2Word 1Word 0
}
}
}
C
S S S S
CCCarry propagation across bitlines
High complexity
Loss of throughput and efficiency
!
!Bit-parallel arithmetic
Why bit-serial?
16
A + B
Row decoders
0
255
255BL/BLB
Logic
Bit-serial arithmetic
Why bit-serial?
17
A + B
Row decoders
0
255
255BL/BLB
Sum
Carry
Arr
ay A
Arr
ay B
A +
B
Wo
rd 3
Wo
rd 2
Wo
rd 1
Wo
rd 0
}
}
}
S S S S
Bit-serial arithmetic
Transposed data
0 0 0 0
Why bit-serial?
18
A + B
WL1
Row decoders
0
255
255BL/BLB
Sum
WL2
Carry
Arr
ay A
Arr
ay B
A +
B
Wo
rd 3
Wo
rd 2
Wo
rd 1
Wo
rd 0
}
}
}
S S S S
Bit-serial arithmetic
Transposed data
Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
0 0 0 0
Cycle 1
Why bit-serial?
19
A + B
WL1
Row decoders
0
255
255BL/BLB
Sum
WL2
Carry
Arr
ay A
Arr
ay B
A +
B
Wo
rd 3
Wo
rd 2
Wo
rd 1
Wo
rd 0
}
}
}
S S S S
Bit-serial arithmetic
Transposed data
Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
C C C C
Cycle 2
Why bit-serial?
20
A + B
WL1
Row decoders
0
255
255BL/BLB
Sum
WL2
Carry
Arr
ay A
Arr
ay B
A +
B
Wo
rd 3
Wo
rd 2
Wo
rd 1
Wo
rd 0
}
}
}
S S S S
Bit-serial arithmetic
Transposed data
Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
C C C C
Cycle 3
Why bit-serial?
21
A + B
WL1
Row decoders
0
255
255BL/BLB
Sum
WL2
Carry
Arr
ay A
Arr
ay B
A +
B
Wo
rd 3
Wo
rd 2
Wo
rd 1
Wo
rd 0
}
}
}
S S S S
Bit-serial arithmetic
Transposed data
Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
C C C C
Cycle 4
Low area complexity
High throughput
Configurable & High precision
✓
✓
✓
Why bit-serial?
Outline
• Motivation
• Bit-Serial Arithmetic
• Transpose
• Mapping of Convolution to Array
• Methodology
• Results
22
23
18-core Xeon processor45 MB LLC
Way
1
Way
20
Way
19
2.5MB LLC slice
CBOXTMU
32kB data bank
8kB array
8kB SRAM array
D EN
QC
A&B
A^B
SCout
Cin
Vref
C_EN
~A & ~B
SA SA
BL BLB
DR
S = A^B^C
Bitline ALU
WLBit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
Row decoders
0
255
255
= A + B
BL/BLB
Logic
Arr
ay A
Arr
ay B
0
1
1
0
0
0
1
1
1
0
0
1A +
B
Way
2
In-SRAM Arithmetic
18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs
Logical Operations In-SRAM
BLB0 BL0 BLBn BLn
SA
Ro
w D
eco
der
SADifferential
Sense Amplifiers
Bitlines
Wordlines
Ro
w D
eco
der
-O
Changes
SA SA
Vref
SA SA
Vref Single-endedSense Amplifiers
Additional row decoder
Reconfigurablesense amplifiers
24
SA SA
Vref
Logical Operations In-SRAM
BLB0 BL0 BLBn BLn
Ro
w D
eco
der
Ro
w D
eco
der
SA SA
Vref Single-endedSense Amplifiers
A AND B
A
B
BA
0 1
0 11 0
0 1
A AND B
10
25
SA SA
Vref
BLB0 BL0 BLBn BLn
Ro
w D
eco
der
Ro
w D
eco
der
SA SA
Vref Single-endedSense Amplifiers
A
B
BA
0 1
0 11 0
0 1
A AND B
10
A NOR B
1 0
26
Logical Operations In-SRAM
SA SA
Vref
Addition In-SRAM
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
0 0Sum
P00 0
0 0
0 0
1
10
1A1
B0
B1
P1
P2
Ro
w D
eco
der
B
Ro
w D
eco
der
A P256 Bitlines
D EN
QC
A&B
A^B
SCout
Cin
Vref
C_EN
~A & ~B
SA
SA
BL BLB
DR
S = A^B^C
27
1
SA SA
Vref
Addition [Cycle 1]
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
0Sum
1
P00 0
0 0
0 0
1
1
10
1A1
B0
B1
P1
P2
Ro
w D
eco
der
B
Ro
w D
eco
der
A P
28
11
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
1 1
0 1
0 1
1 1
1 0
0 0
0 0
0 1Carry
Sum
1 1
A0
P0
A1
B0
B1
P1
P2
Ro
w D
eco
der
B
Ro
w D
eco
der
A P
Addition [Cycle 2]
29
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
1 1
0 1
0 1
1 1
1 0
1 1
0 0
0 1Carry
Sum
1
A0
P0
A1
B0
B1
P1
P2
Ro
w D
eco
der
P
Ro
w D
eco
der
Addition [Cycle 3]
30
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
Sum
P00 0
0 0
1
10
1A1
B0
B1
P1
Ro
w D
eco
der
Ro
w D
eco
der
0 0P2
0 000P3
0 0Tag
Multiplication In-SRAM
31
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
Sum
0 0
0 0
1
10
1A1
B0
B1
Ro
w D
eco
der
Ro
w D
eco
der
0 0
0 000
P0
P1
P2
P3
0 0Tag 1
Multiplication [Cycle 1]
A1B0 A0B0
A1B1 A0B1
A1 A0
XB1 B0
P0P1P2
32
1
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
0 Sum
0 0
0 0
1
10
1A1
B0
B1
Ro
w D
eco
der
Ro
w D
eco
der
0 0
0 000
0 1Tag
1
Multiplication [Cycle 2]
P0 <- A0B0P0
P1
P2
P3
A1B0 A0B0
A1B1 A0B1
A1 A0
XB1 B0
P0P1P2
33
1
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
0 Sum
0 1
0 0
1
10
1A1
B0
B1
Ro
w D
eco
der
Ro
w D
eco
der
0 0
0 000
0 1Tag
1
Multiplication [Cycle 3]
P0 <- A0B0
P1 <- A1B0
P0
P1
P2
P3
A1B0 A0B0
A1B1 A0B1
A1 A0
XB1 B0
P0P1P2
34
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
Sum
0 1
0 1
1
10
1A1
B0
B1
Ro
w D
eco
der
Ro
w D
eco
der
0 0
0 000
0 0Tag1 1
Multiplication [Cycle 4]
P0 <- A0B0
P1 <- A1B0
P0
P1
P2
P3
A1B0 A0B0
A1B1 A0B1
A1 A0
XB1 B0
P0P1P2
35
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
0Sum
0 1
0 1
1
10
1A1
B0
B1
Ro
w D
eco
der
Ro
w D
eco
der
0 1
0 000
1 1Tag
1
01
1
Multiplication [Cycle 5]
P0 <- A0B0
P1 <- A1B0 + A0B1
P0
P1
P2
P3
A1B0 A0B0
A1B1 A0B1
A1 A0
XB1 B0
P0P1P2
36
P1 <- P1 + A0B1
If(B1), P1 <- P1 + A0
Else, P1 <- P1
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 1Carry
0 0Sum
0 1
1 0
1
10
1A1
B0
B1
Ro
w D
eco
der
Ro
w D
eco
der
0 1
0 000
1 1Tag
0
Multiplication [Cycle 6]
P0 <- A0B0
P1 <- A1B0 + A0B1
P2 <- A1B1
P0
P1
P2
P3
A1B0 A0B0
A1B1 A0B1
A1 A0
XB1 B0
P0P1P2
37
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 1Carry
Sum
0 1
1 0
1
10
1A1
B0
B1
Ro
w D
eco
der
Ro
w D
eco
der
0 0
0 000
0 1Tag
1
Multiplication [Cycle 7]
P0 <- A0B0
P1 <- A1B0 + A0B1
P2 <- A1B1
P3 <- Cin
P0
P1
P2
P3
A1B0 A0B0
A1B1 A0B1
A1 A0
XB1 B0
P0P1P2
38
Operation Cycles
ADD N+1
SUB 2N+1
MUL N2 + 5N -2
DIV 1.5N2 + 5.5N
Comparison 2N+1
Supported Arithmetic
39
Synthesized array—7.5% area overhead Processor Chip— 2% area overhead
Outline
• Motivation
• Bit-Serial Arithmetic
• Transpose
• Mapping of Convolution to Array
• Methodology
• Results
40
41
Way
1
Way
20
Way
2
Way
19
CBOXTMU
Transpose
Ro
w D
eco
der
A0[MSB]
A1[MSB]
A2[MSB]
A0[LSB]
A1[LSB]
A2[LSB]
... ...
... ...
...
...
...
...
...
...
Col Decoder
SA SA SA
DR
DR
DR
SADR
SA SA SA
DR
DR
DR
...
...
...
...
...
...
... ...
... ...
...
...
...
...
...
...
...
...
...
...
...
...
Control
SA
SA
SA
SA
SA
SA
SA
SA
SA
SA
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
B0[MSB]
B1[MSB]
B2[MSB]
B0[LSB]
B1[LSB]
B2[LSB]
Regular read/write
Transp
ose
re
ad/w
rite
8-T transpose bit-cell
A2 A1 A0
B2 B1 B0
C2 C1 C0
TMU A0
A1
A2
C0
C1
C2
B0
B1
B2
Transpose
42
Outline
• Motivation
• Transpose
• Bit-Serial Arithmetic
• Mapping of Convolution to Array
• Methodology
• Results
43
C
W
H
M
E
F
S
3D Filters (M)
each filter: C channels
each channel: RxS weights
1
C
R
S
M
C
R
Input Activations
(C channels)
Output Activations
(M channels)
44
A Convolutional Layer
RxS
C
. . .
RxS
C
. . .
Partial Sum
C 1 Output
Activation
MAC
∑Reduction
Filter
Weights
1
C
M
C
R
S
R
S
Input
Activations
CW
H
Output
Activations
M
E
F
. . .
Unroll Unroll
Mapping CNN to Neural Cache
45
256 W
ord
lines
Inp
ut
Acti
vati
on
Rx
Sx
8
256 Bitlines
8 kB SRAM Array
Weig
hts
Rx
Sx
8
Part
ial
Su
m
4x
8
. . .
C
Ou
tpu
t
4x
8
. . .
. . .
Way 20
2.5 MB LLC Slice
. . .
. . .
. . .
. . .
. . .Way 1 Way 2 Way 3
Qu
ad 1
Qu
ad 2
Qu
ad 3
Qu
ad 4
M = 32Output Position 1 Output Position 2 . . .
Mapping CNN to Neural Cache
256 W
ord
lines
Inp
ut
Ac
tivati
on
Rx
Sx
8
ch
an
ne
l 1
Filter 1 (C = 256)
256 Bitlines
8 kB SRAM Array
Weig
hts
Rx
Sx
8
Part
ial
Su
m
4x
8
ch
an
ne
l 2
ch
an
ne
l 3
ch
an
ne
l 2
56
ch
an
ne
l 4
. . .
. . .
. . .
. . .
M
E
F
46
. . .
. . .
Way
1 -
18
Way
19
-20
Way
1 -
18
Way
19
-20
Slice 1 Slice 14
Mapping of Convolution to Array
M
E
F
47
LLCSlice 1
LLCSlice 14
RingInterconnect
Core 14
DRAM
. . .
. . .
FilterWeights
Input Activations
Output Activations
Way 19(Reserved)
2.5 MB LLC Slice
. . .Way 1 Way 2 Way 3
Qu
ad 1
Qu
ad 2
Qu
ad 3
Qu
ad 4
. . .
. . .
. . .
. . .
Put it together
Core 1
Filter Loading1 Input Loading2 Output Transfer4 MAC + Reduction3
48
Outline
• Motivation
• Transpose
• Bit-Serial Arithmetic
• Mapping of Convolution to Array
• Methodology
• Results
49
50
Evaluation Methodology
CPU (2 sockets) GPU (1 card) Neural Cache
Processor
Intel Xeon E5-2597 v3, 2.6GHz,
28 cores, 56 threads
Nvidia Titan Xp, 1.6GHz, 3840 cuda
cores
2.5GHz ComputeSRAM,
1032192Bit-serial ALUs
On-chip memory 78.96 MB 9.14 MB70 MB (Dual
Socket)
Off-chip memory 64 GB DRAM 12 GB DRAM 64 GB DRAM
Profiler / Simulator
(Performance)
TensorFlowtfprof
TensorFlowtfprof
Cycle accurate simulator +
C Microbench
Profiler / Simulator(Energy)
Intel RAPL InterfaceNVIDIA System Management
Interface
SPICEsimulation
+Intel RAPLInterface
DNN Models
- Inception V3
- 8-bit weights and inputs
Outline
• Motivation
• Transpose
• Bit-Serial Arithmetic
• Mapping of Convolution to Array
• Methodology
• Results
51
0
100
200
300
400
500
600
700
1 4 16 64 256
Thro
ugh
pu
t (I
nfe
ren
ces
/ se
c)
Batch Size
CPU - Xeon E5 GPU - Titan Xp Neural Cache
Throughput
2.2x Improved throughput over GPU
0
20
40
60
80
100
CPU GPU Neural Cache
Late
ncy
(m
s)
Latency
7.7x Latency improvement over GPU
52
Power/Energy Comparison
53
0
20
40
60
80
100
120
0
1
2
3
4
5
6
7
8
9
10
CPU GPU Neural Cache
Pow
er (
Wat
ts)
Ener
gy (
Jou
les)
Total Energy Avg Power
12x 20x
.. over server class CPU at 2% area overhead
Neural Cache Summary
Repurpose Cache to Data Parallel DNN Accelerator
.. over server class GPU
2x 16x
54
Massively Parallel Bit-Serial In-SRAM Arithmetic Data Layout for CNNs
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan
Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das
M-Bits Research Group
55
top related