neural cache: bit-serial in-cache acceleration of deep ...2 transforming caches into massively...
TRANSCRIPT
![Page 1: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/1.jpg)
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan
Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das
M-Bits Research Group
1
![Page 2: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/2.jpg)
2
CPU GPU
Can we transform CPU into a neural accelerator?
$
![Page 3: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/3.jpg)
3
GPU
Can we transform CPU into a neural accelerator?
CPU Neural Cache
++ Parallelism
-- Data Movement
![Page 4: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/4.jpg)
Transforming caches into massively parallel vector ALUs
4
18-core Xeon processor45 MB LLC
18 LLC slices
![Page 5: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/5.jpg)
Transforming caches into massively parallel vector ALUs
5
18-core Xeon processor45 MB LLC
Way
1
Way
20
Way
2
Way
19
2.5MB LLC slice
CBOXTMU
32kB data bank
8kB array
18 LLC slices 360 ways
![Page 6: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/6.jpg)
Transforming caches into massively parallel vector ALUs
6
18-core Xeon processor45 MB LLC
Way
1
Way
20
Way
19
2.5MB LLC slice
CBOXTMU
32kB data bank
8kB array
WL
Row decoder
0
255
255BL/BLB
8kB SRAM array
18 LLC slices 360 ways 5760 arrays
Way
2
![Page 7: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/7.jpg)
Transforming caches into massively parallel vector ALUs
7
18-core Xeon processor45 MB LLC
Way
1
Way
20
Way
19
2.5MB LLC slice
CBOXTMU
32kB data bank
8kB array
8kB SRAM array
WLBit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
Row decoders
0
255
255
= A + B
BL/BLB
Logic
Arr
ay A
Arr
ay B
0
1
1
0
0
0
1
1
1
0
0
1A +
B
18 LLC slices 360 ways 5760 arrays
Way
2
![Page 8: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/8.jpg)
Transforming caches into massively parallel vector ALUs
8
18-core Xeon processor45 MB LLC
Way
1
Way
20
Way
19
2.5MB LLC slice
CBOXTMU
32kB data bank
8kB array
8kB SRAM array
D EN
QC
A&B
A^B
SCout
Cin
Vref
C_EN
~A & ~B
SA SA
BL BLB
DR
S = A^B^C
Bitline ALU
18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs
WLBit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
Row decoders
0
255
255
= A + B
BL/BLB
Logic
Arr
ay A
Arr
ay B
0
1
1
0
0
0
1
1
1
0
0
1A +
B
Way
2
![Page 9: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/9.jpg)
Way
2
Transforming caches into massively parallel vector ALUs
9
18-core Xeon processor45 MB LLC
Way
1
Way
20
Way
19
2.5MB LLC slice
CBOXTMU
32kB data bank
8kB array
8kB SRAM array
WL
Row decoders
0
255
255
= A + B
BL/BLB
Logic
D EN
QC
A&B
A^B
SCout
Cin
Vref
C_EN
~A & ~B
SA SA
BL BLB
DR
S = A^B^C
Bitline ALU
Array AArray B
A + B
Passive Last Level Cache transformed into ∼ 1 million bit-serial active ALUs
✓ Multiply ✓ Divide ✓ Add
Bit-serial operation @2.5 GHz
18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs
Configurable Precision
![Page 10: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/10.jpg)
10
A + B
Row decoders
0
255
255BL/BLB
Logic
Bit-parallel arithmetic
Why bit-serial?
![Page 11: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/11.jpg)
11
A + B
Row decoders
0
255
255BL/BLB
Logic
Array A
Array B
A + B
Word 3Word 2Word 1Word 0
Word 3Word 2Word 1Word 0
}
}
}
Bit-parallel arithmetic
Why bit-serial?
![Page 12: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/12.jpg)
12
A + B
WL1
Row decoders
0
255
255
S
BL/BLB
Logic
WL2
Array A
Array B
A + B
Word 3Word 2Word 1Word 0
Word 3Word 2Word 1Word 0
}
}
}
Bit-parallel arithmetic
Why bit-serial?
![Page 13: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/13.jpg)
13
A + B
WL1
Row decoders
0
255
255BL/BLB
Logic
WL2
Array A
Array B
A + B
Word 3Word 2Word 1Word 0
Word 3Word 2Word 1Word 0
}
}
}
C
S S
Bit-parallel arithmetic
Carry propagation across bitlines
Why bit-serial?
![Page 14: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/14.jpg)
14
A + B
WL1
Row decoders
0
255
255BL/BLB
Logic
WL2
Array A
Array B
A + B
Word 3Word 2Word 1Word 0
Word 3Word 2Word 1Word 0
}
}
}
C
S S S
C
Bit-parallel arithmetic
Carry propagation across bitlines
Why bit-serial?
![Page 15: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/15.jpg)
15
A + B
WL1
Row decoders
0
255
255BL/BLB
Logic
WL2
Array A
Array B
A + B
Word 3Word 2Word 1Word 0
Word 3Word 2Word 1Word 0
}
}
}
C
S S S S
CCCarry propagation across bitlines
High complexity
Loss of throughput and efficiency
!
!Bit-parallel arithmetic
Why bit-serial?
![Page 16: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/16.jpg)
16
A + B
Row decoders
0
255
255BL/BLB
Logic
Bit-serial arithmetic
Why bit-serial?
![Page 17: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/17.jpg)
17
A + B
Row decoders
0
255
255BL/BLB
Sum
Carry
Arr
ay A
Arr
ay B
A +
B
Wo
rd 3
Wo
rd 2
Wo
rd 1
Wo
rd 0
}
}
}
S S S S
Bit-serial arithmetic
Transposed data
0 0 0 0
Why bit-serial?
![Page 18: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/18.jpg)
18
A + B
WL1
Row decoders
0
255
255BL/BLB
Sum
WL2
Carry
Arr
ay A
Arr
ay B
A +
B
Wo
rd 3
Wo
rd 2
Wo
rd 1
Wo
rd 0
}
}
}
S S S S
Bit-serial arithmetic
Transposed data
Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
0 0 0 0
Cycle 1
Why bit-serial?
![Page 19: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/19.jpg)
19
A + B
WL1
Row decoders
0
255
255BL/BLB
Sum
WL2
Carry
Arr
ay A
Arr
ay B
A +
B
Wo
rd 3
Wo
rd 2
Wo
rd 1
Wo
rd 0
}
}
}
S S S S
Bit-serial arithmetic
Transposed data
Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
C C C C
Cycle 2
Why bit-serial?
![Page 20: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/20.jpg)
20
A + B
WL1
Row decoders
0
255
255BL/BLB
Sum
WL2
Carry
Arr
ay A
Arr
ay B
A +
B
Wo
rd 3
Wo
rd 2
Wo
rd 1
Wo
rd 0
}
}
}
S S S S
Bit-serial arithmetic
Transposed data
Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
C C C C
Cycle 3
Why bit-serial?
![Page 21: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/21.jpg)
21
A + B
WL1
Row decoders
0
255
255BL/BLB
Sum
WL2
Carry
Arr
ay A
Arr
ay B
A +
B
Wo
rd 3
Wo
rd 2
Wo
rd 1
Wo
rd 0
}
}
}
S S S S
Bit-serial arithmetic
Transposed data
Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
C C C C
Cycle 4
Low area complexity
High throughput
Configurable & High precision
✓
✓
✓
Why bit-serial?
![Page 22: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/22.jpg)
Outline
• Motivation
• Bit-Serial Arithmetic
• Transpose
• Mapping of Convolution to Array
• Methodology
• Results
22
![Page 23: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/23.jpg)
23
18-core Xeon processor45 MB LLC
Way
1
Way
20
Way
19
2.5MB LLC slice
CBOXTMU
32kB data bank
8kB array
8kB SRAM array
D EN
QC
A&B
A^B
SCout
Cin
Vref
C_EN
~A & ~B
SA SA
BL BLB
DR
S = A^B^C
Bitline ALU
WLBit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0
Row decoders
0
255
255
= A + B
BL/BLB
Logic
Arr
ay A
Arr
ay B
0
1
1
0
0
0
1
1
1
0
0
1A +
B
Way
2
In-SRAM Arithmetic
18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs
![Page 24: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/24.jpg)
Logical Operations In-SRAM
BLB0 BL0 BLBn BLn
SA
Ro
w D
eco
der
SADifferential
Sense Amplifiers
Bitlines
Wordlines
Ro
w D
eco
der
-O
Changes
SA SA
Vref
SA SA
Vref Single-endedSense Amplifiers
Additional row decoder
Reconfigurablesense amplifiers
24
![Page 25: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/25.jpg)
SA SA
Vref
Logical Operations In-SRAM
BLB0 BL0 BLBn BLn
Ro
w D
eco
der
Ro
w D
eco
der
SA SA
Vref Single-endedSense Amplifiers
A AND B
A
B
BA
0 1
0 11 0
0 1
A AND B
10
25
![Page 26: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/26.jpg)
SA SA
Vref
BLB0 BL0 BLBn BLn
Ro
w D
eco
der
Ro
w D
eco
der
SA SA
Vref Single-endedSense Amplifiers
A
B
BA
0 1
0 11 0
0 1
A AND B
10
A NOR B
1 0
26
Logical Operations In-SRAM
![Page 27: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/27.jpg)
SA SA
Vref
Addition In-SRAM
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
0 0Sum
P00 0
0 0
0 0
1
10
1A1
B0
B1
P1
P2
Ro
w D
eco
der
B
Ro
w D
eco
der
A P256 Bitlines
D EN
QC
A&B
A^B
SCout
Cin
Vref
C_EN
~A & ~B
SA
SA
BL BLB
DR
S = A^B^C
27
![Page 28: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/28.jpg)
1
SA SA
Vref
Addition [Cycle 1]
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
0Sum
1
P00 0
0 0
0 0
1
1
10
1A1
B0
B1
P1
P2
Ro
w D
eco
der
B
Ro
w D
eco
der
A P
28
![Page 29: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/29.jpg)
11
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
1 1
0 1
0 1
1 1
1 0
0 0
0 0
0 1Carry
Sum
1 1
A0
P0
A1
B0
B1
P1
P2
Ro
w D
eco
der
B
Ro
w D
eco
der
A P
Addition [Cycle 2]
29
![Page 30: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/30.jpg)
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
1 1
0 1
0 1
1 1
1 0
1 1
0 0
0 1Carry
Sum
1
A0
P0
A1
B0
B1
P1
P2
Ro
w D
eco
der
P
Ro
w D
eco
der
Addition [Cycle 3]
30
![Page 31: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/31.jpg)
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
Sum
P00 0
0 0
1
10
1A1
B0
B1
P1
Ro
w D
eco
der
Ro
w D
eco
der
0 0P2
0 000P3
0 0Tag
Multiplication In-SRAM
31
![Page 32: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/32.jpg)
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
Sum
0 0
0 0
1
10
1A1
B0
B1
Ro
w D
eco
der
Ro
w D
eco
der
0 0
0 000
P0
P1
P2
P3
0 0Tag 1
Multiplication [Cycle 1]
A1B0 A0B0
A1B1 A0B1
A1 A0
XB1 B0
P0P1P2
32
![Page 33: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/33.jpg)
1
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
0 Sum
0 0
0 0
1
10
1A1
B0
B1
Ro
w D
eco
der
Ro
w D
eco
der
0 0
0 000
0 1Tag
1
Multiplication [Cycle 2]
P0 <- A0B0P0
P1
P2
P3
A1B0 A0B0
A1B1 A0B1
A1 A0
XB1 B0
P0P1P2
33
![Page 34: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/34.jpg)
1
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
0 Sum
0 1
0 0
1
10
1A1
B0
B1
Ro
w D
eco
der
Ro
w D
eco
der
0 0
0 000
0 1Tag
1
Multiplication [Cycle 3]
P0 <- A0B0
P1 <- A1B0
P0
P1
P2
P3
A1B0 A0B0
A1B1 A0B1
A1 A0
XB1 B0
P0P1P2
34
![Page 35: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/35.jpg)
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
Sum
0 1
0 1
1
10
1A1
B0
B1
Ro
w D
eco
der
Ro
w D
eco
der
0 0
0 000
0 0Tag1 1
Multiplication [Cycle 4]
P0 <- A0B0
P1 <- A1B0
P0
P1
P2
P3
A1B0 A0B0
A1B1 A0B1
A1 A0
XB1 B0
P0P1P2
35
![Page 36: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/36.jpg)
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 0Carry
0Sum
0 1
0 1
1
10
1A1
B0
B1
Ro
w D
eco
der
Ro
w D
eco
der
0 1
0 000
1 1Tag
1
01
1
Multiplication [Cycle 5]
P0 <- A0B0
P1 <- A1B0 + A0B1
P0
P1
P2
P3
A1B0 A0B0
A1B1 A0B1
A1 A0
XB1 B0
P0P1P2
36
P1 <- P1 + A0B1
If(B1), P1 <- P1 + A0
Else, P1 <- P1
![Page 37: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/37.jpg)
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 1Carry
0 0Sum
0 1
1 0
1
10
1A1
B0
B1
Ro
w D
eco
der
Ro
w D
eco
der
0 1
0 000
1 1Tag
0
Multiplication [Cycle 6]
P0 <- A0B0
P1 <- A1B0 + A0B1
P2 <- A1B1
P0
P1
P2
P3
A1B0 A0B0
A1B1 A0B1
A1 A0
XB1 B0
P0P1P2
37
![Page 38: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/38.jpg)
SA SA
Vref
BLB0 BL0 BLBn BLn
SA SA
Vref
A0
0 1
1 1
0 1Carry
Sum
0 1
1 0
1
10
1A1
B0
B1
Ro
w D
eco
der
Ro
w D
eco
der
0 0
0 000
0 1Tag
1
Multiplication [Cycle 7]
P0 <- A0B0
P1 <- A1B0 + A0B1
P2 <- A1B1
P3 <- Cin
P0
P1
P2
P3
A1B0 A0B0
A1B1 A0B1
A1 A0
XB1 B0
P0P1P2
38
![Page 39: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/39.jpg)
Operation Cycles
ADD N+1
SUB 2N+1
MUL N2 + 5N -2
DIV 1.5N2 + 5.5N
Comparison 2N+1
Supported Arithmetic
39
Synthesized array—7.5% area overhead Processor Chip— 2% area overhead
![Page 40: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/40.jpg)
Outline
• Motivation
• Bit-Serial Arithmetic
• Transpose
• Mapping of Convolution to Array
• Methodology
• Results
40
![Page 41: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/41.jpg)
41
Way
1
Way
20
Way
2
Way
19
CBOXTMU
Transpose
Ro
w D
eco
der
A0[MSB]
A1[MSB]
A2[MSB]
A0[LSB]
A1[LSB]
A2[LSB]
... ...
... ...
...
...
...
...
...
...
Col Decoder
SA SA SA
DR
DR
DR
SADR
SA SA SA
DR
DR
DR
...
...
...
...
...
...
... ...
... ...
...
...
...
...
...
...
...
...
...
...
...
...
Control
SA
SA
SA
SA
SA
SA
SA
SA
SA
SA
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
B0[MSB]
B1[MSB]
B2[MSB]
B0[LSB]
B1[LSB]
B2[LSB]
Regular read/write
Transp
ose
re
ad/w
rite
8-T transpose bit-cell
![Page 42: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/42.jpg)
A2 A1 A0
B2 B1 B0
C2 C1 C0
TMU A0
A1
A2
C0
C1
C2
B0
B1
B2
Transpose
42
![Page 43: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/43.jpg)
Outline
• Motivation
• Transpose
• Bit-Serial Arithmetic
• Mapping of Convolution to Array
• Methodology
• Results
43
![Page 44: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/44.jpg)
C
W
H
M
E
F
S
3D Filters (M)
each filter: C channels
each channel: RxS weights
1
C
R
S
M
C
R
Input Activations
(C channels)
Output Activations
(M channels)
44
A Convolutional Layer
![Page 45: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/45.jpg)
RxS
C
. . .
RxS
C
. . .
Partial Sum
C 1 Output
Activation
MAC
∑Reduction
Filter
Weights
1
C
M
C
R
S
R
S
Input
Activations
CW
H
Output
Activations
M
E
F
. . .
Unroll Unroll
Mapping CNN to Neural Cache
45
256 W
ord
lines
Inp
ut
Acti
vati
on
Rx
Sx
8
256 Bitlines
8 kB SRAM Array
Weig
hts
Rx
Sx
8
Part
ial
Su
m
4x
8
. . .
C
Ou
tpu
t
4x
8
. . .
. . .
![Page 46: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/46.jpg)
Way 20
2.5 MB LLC Slice
. . .
. . .
. . .
. . .
. . .Way 1 Way 2 Way 3
Qu
ad 1
Qu
ad 2
Qu
ad 3
Qu
ad 4
M = 32Output Position 1 Output Position 2 . . .
Mapping CNN to Neural Cache
256 W
ord
lines
Inp
ut
Ac
tivati
on
Rx
Sx
8
ch
an
ne
l 1
Filter 1 (C = 256)
256 Bitlines
8 kB SRAM Array
Weig
hts
Rx
Sx
8
Part
ial
Su
m
4x
8
ch
an
ne
l 2
ch
an
ne
l 3
ch
an
ne
l 2
56
ch
an
ne
l 4
. . .
. . .
. . .
. . .
M
E
F
46
. . .
. . .
![Page 47: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/47.jpg)
Way
1 -
18
Way
19
-20
Way
1 -
18
Way
19
-20
Slice 1 Slice 14
Mapping of Convolution to Array
M
E
F
47
![Page 48: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/48.jpg)
LLCSlice 1
LLCSlice 14
RingInterconnect
Core 14
DRAM
. . .
. . .
FilterWeights
Input Activations
Output Activations
Way 19(Reserved)
2.5 MB LLC Slice
. . .Way 1 Way 2 Way 3
Qu
ad 1
Qu
ad 2
Qu
ad 3
Qu
ad 4
. . .
. . .
. . .
. . .
Put it together
Core 1
Filter Loading1 Input Loading2 Output Transfer4 MAC + Reduction3
48
![Page 49: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/49.jpg)
Outline
• Motivation
• Transpose
• Bit-Serial Arithmetic
• Mapping of Convolution to Array
• Methodology
• Results
49
![Page 50: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/50.jpg)
50
Evaluation Methodology
CPU (2 sockets) GPU (1 card) Neural Cache
Processor
Intel Xeon E5-2597 v3, 2.6GHz,
28 cores, 56 threads
Nvidia Titan Xp, 1.6GHz, 3840 cuda
cores
2.5GHz ComputeSRAM,
1032192Bit-serial ALUs
On-chip memory 78.96 MB 9.14 MB70 MB (Dual
Socket)
Off-chip memory 64 GB DRAM 12 GB DRAM 64 GB DRAM
Profiler / Simulator
(Performance)
TensorFlowtfprof
TensorFlowtfprof
Cycle accurate simulator +
C Microbench
Profiler / Simulator(Energy)
Intel RAPL InterfaceNVIDIA System Management
Interface
SPICEsimulation
+Intel RAPLInterface
DNN Models
- Inception V3
- 8-bit weights and inputs
![Page 51: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/51.jpg)
Outline
• Motivation
• Transpose
• Bit-Serial Arithmetic
• Mapping of Convolution to Array
• Methodology
• Results
51
![Page 52: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/52.jpg)
0
100
200
300
400
500
600
700
1 4 16 64 256
Thro
ugh
pu
t (I
nfe
ren
ces
/ se
c)
Batch Size
CPU - Xeon E5 GPU - Titan Xp Neural Cache
Throughput
2.2x Improved throughput over GPU
0
20
40
60
80
100
CPU GPU Neural Cache
Late
ncy
(m
s)
Latency
7.7x Latency improvement over GPU
52
![Page 53: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/53.jpg)
Power/Energy Comparison
53
0
20
40
60
80
100
120
0
1
2
3
4
5
6
7
8
9
10
CPU GPU Neural Cache
Pow
er (
Wat
ts)
Ener
gy (
Jou
les)
Total Energy Avg Power
![Page 54: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/54.jpg)
12x 20x
.. over server class CPU at 2% area overhead
Neural Cache Summary
Repurpose Cache to Data Parallel DNN Accelerator
.. over server class GPU
2x 16x
54
Massively Parallel Bit-Serial In-SRAM Arithmetic Data Layout for CNNs
![Page 55: Neural Cache: Bit-Serial In-Cache Acceleration of Deep ...2 Transforming caches into massively parallel vector ALUs 9 18-core Xeon processor 45 MB LLC 1 19 20 2.5MB LLC slice CBOX](https://reader034.vdocuments.net/reader034/viewer/2022051815/603bfa6b97167e1aed36ffd1/html5/thumbnails/55.jpg)
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan
Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das
M-Bits Research Group
55