18-447 computer architecture lecture 30: in-memory processingece447/s15/lib/exe/... · 18-447...
TRANSCRIPT
![Page 1: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/1.jpg)
18-447 Computer Architecture
Lecture 30: In-memory Processing
Vivek Seshadri Carnegie Mellon University Spring 2015, 4/13/2015
![Page 2: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/2.jpg)
Goals for This Lecture
• Understand DRAM technology – How it is built? – How it operates? – What are the trade-‐offs?
• Can we use DRAM for more than just storage? – In-‐DRAM copying – In-‐DRAM bitwise operaCons
2
![Page 3: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/3.jpg)
DRAM Module and Chip
3
![Page 4: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/4.jpg)
Goals of DRAM Design
• Cost • Latency • Bandwidth • Parallelism • Power • Energy • Reliability
4
![Page 5: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/5.jpg)
DRAM Chip
5
Bank
![Page 6: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/6.jpg)
DRAM Cell – Capacitor
6
Empty State Fully Charged State
Logical “0” Logical “1”
1
2
Small – Cannot drive circuits
Reading destroys the state
![Page 7: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/7.jpg)
Sense Amplifier
7
enable
top
bo*om
Inverter
![Page 8: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/8.jpg)
Sense Amplifier – Two Stable States
8
en en
0
0 VDD
VDD
Logical “1” Logical “0”
![Page 9: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/9.jpg)
Sense Amplifier OperaMon
9
dis
VT
VB
VT > VB en
0
VDD
![Page 10: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/10.jpg)
Capacitor to Sense Amplifier
10
en
0
VDD
en
VDD
0 ?
![Page 11: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/11.jpg)
DRAM Cell OperaMon
11
½VDD
½VDD
dis en
0
VDD ½VDD+δ Cell loses charge
Cell regains charge
![Page 12: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/12.jpg)
AmorMzing Cost – DRAM Tile
12
Row Driv
er
![Page 13: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/13.jpg)
DRAM Subarray
13
Row Driv
er
Tile Tile Tile
Row Decod
er
![Page 14: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/14.jpg)
DRAM Subarray
14
Row Driv
er
Tile Tile Tile
Row Decod
er
Tile Tile Tile Tile
![Page 15: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/15.jpg)
DRAM Bank
15
Row Decod
er
Array of Sense Amplifiers (8Kb)
Cell Array
Cell Array
Row Decod
er
Array of Sense Amplifiers
Cell Array
Cell Array
Bank I/O (64b)
Address
Address Data
![Page 16: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/16.jpg)
DRAM Chip
16
Shared internal bus
Memory channel -‐ 8bits
Row Decoder
Array of Sense Amplifiers (8Kb)
Cell Array
Cell Array
Row Decoder
Array of Sense Amplifiers
Cell Array
Cell Array
Bank I/O (64b)
Row Decoder
Array of Sense Amplifiers (8Kb)
Cell Array
Cell Array
Row Decoder
Array of Sense Amplifiers
Cell Array
Cell Array
Bank I/O (64b)
Row Decoder
Array of Sense Amplifiers (8Kb)
Cell Array
Cell Array
Row Decoder
Array of Sense Amplifiers
Cell Array
Cell Array
Bank I/O (64b)
Row Decoder
Array of Sense Amplifiers (8Kb)
Cell Array
Cell Array
Row Decoder
Array of Sense Amplifiers
Cell Array
Cell Array
Bank I/O (64b)
Row Decoder
Array of Sen
se Amplifiers (8K
b)
Cell Array
Cell Array
Row Decoder
Array of Sen
se Amplifiers
Cell Array
Cell Array
Bank
I/O (6
4b)
Row Decoder
Array of Sen
se Amplifiers (8K
b)
Cell Array
Cell Array
Row Decoder
Array of Sen
se Amplifiers
Cell Array
Cell Array
Bank
I/O (6
4b)
Row Decoder
Array of Sen
se Amplifiers (8K
b)
Cell Array
Cell Array
Row Decoder
Array of Sen
se Amplifiers
Cell Array
Cell Array
Bank
I/O (6
4b)
Row Decoder
Array of Sen
se Amplifiers (8K
b)
Cell Array
Cell Array
Row Decoder
Array of Sen
se Amplifiers
Cell Array
Cell Array
Bank
I/O (6
4b)
![Page 17: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/17.jpg)
DRAM OperaMon
17
Row Decod
er
Row Decod
er
Array of Sense Amplifiers
Cell Array
Cell Array
Bank I/O Data
1
2
ACTIVATE Row
READ/WRITE Column
3 PRECHARGE
Row Add
ress
Column Address
![Page 18: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/18.jpg)
Goals for This Lecture
• Understand DRAM technology – How it is built? – How it operates? – What are the trade-‐offs?
• Can we use DRAM for more than just storage? – In-‐DRAM copying – In-‐DRAM bitwise operaCons
18
![Page 19: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/19.jpg)
Trade-‐offs in DRAM Design
• Cost • Latency • Bandwidth • Parallelism • Power • Energy • Reliability
19
— Rows/Subarray — Data width, Chips/DIMM — Banks/Chip
![Page 20: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/20.jpg)
Goals for This Lecture
ü Understand DRAM technology – How it is built? – How it operates? – What are the trade-‐offs?
• Can we use DRAM for more than just storage? – In-‐DRAM copying – In-‐DRAM bitwise operaCons
20
![Page 21: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/21.jpg)
RowClone Fast and Energy-‐Efficient In-‐DRAM Bulk Data Copy and IniMalizaMon
Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun,
G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry
Vivek Seshadri
![Page 22: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/22.jpg)
Memory Channel – Bo\leneck
Core
Core
Cache
MC
Mem
ory
Channel
Limited Bandwidth
High Energy
![Page 23: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/23.jpg)
Goal: Reduce Memory Bandwidth Demand
Core
Core
Cache
MC
Mem
ory
Channel
Reduce unnecessary data movement
![Page 24: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/24.jpg)
Bulk Data Copy and IniMalizaMon
Bulk Data Copy
Bulk Data IniMalizaMon
src dst
dst val
![Page 25: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/25.jpg)
Bulk Data Copy and IniMalizaMon
Bulk Data Copy
Bulk Data IniMalizaMon
src dst
dst val
![Page 26: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/26.jpg)
Bulk Copy and IniMalizaMon – ApplicaMons
Forking
000000000000000
Zero iniMalizaMon (e.g., security)
VM Cloning DeduplicaMon
CheckpoinMng
Page MigraMon
Many more
![Page 27: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/27.jpg)
Shortcomings of ExisMng Approach
Core
Core
Cache
MC Channel src
dst
High latency (1046ns to copy 4KB)
Interference
High Energy (3600nJ to copy 4KB)
![Page 28: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/28.jpg)
Our Approach: In-‐DRAM Copy with Low Cost
Core
Core
Cache
MC Channel dst
High latency
Interference
High Energy
src
X
X
X
?
![Page 29: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/29.jpg)
RowClone: In-‐DRAM Copy
29
![Page 30: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/30.jpg)
Bulk Copy in DRAM – RowClone
30
½VDD
½VDD
0 1
0
VDD ½VDD +δ
Data gets copied
![Page 31: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/31.jpg)
Fast Parallel Mode – Benefits
31
Latency Energy
Bulk Data Copy (4KB across a module)
1046ns to 90ns 3600nJ to 40nJ
No bandwidth consumpMon
Very li\le changes to the DRAM chip
11X 74X
![Page 32: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/32.jpg)
Fast Parallel Mode – Constraints
• LocaCon constraint – Source and desCnaCon in same subarray
• Size constraint – EnCre row gets copied (no parCal copy)
32
1
2
Can sCll accelerate many exisCng primiCves (copy-‐on-‐write, bulk zeroing)
Alternate mechanism to copy data across banks (pipelined serial mode – lower benefits than Fast Parallel)
![Page 33: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/33.jpg)
End-‐to-‐end System Design
• Soeware interface – memcpy and meminit instrucCons
• Managing cache coherence – Use exisCng DMA support!
• Maximizing use of Fast Parallel Mode – Smart OS page allocaCon
33
![Page 34: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/34.jpg)
ApplicaMons Summary
34
0
0.2
0.4
0.6
0.8
1
bootup compile forkbench mcached mysql shell
Frac
tion
of M
emor
y Tr
affic
Zero Copy Write Read
![Page 35: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/35.jpg)
Results Summary
35
0%
10%
20%
30%
40%
50%
60%
70%
bootup compile forkbench mcached mysql shell
Com
pare
d to
Bas
elin
e
IPC Improvement Memory Energy Reduction
![Page 36: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/36.jpg)
Goals for This Lecture
ü Understand DRAM technology – How it is built? – How it operates? – What are the trade-‐offs?
• Can we use DRAM for more than just storage? – In-‐DRAM copying – In-‐DRAM bitwise operaCons
36
![Page 37: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/37.jpg)
Triple Row AcMvaMon
37
½VDD
½VDD
dis
A
B
C
Final State AB + BC + AC
½VDD+δ
C(A + B) + ~C(AB) en
0
VDD
![Page 38: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/38.jpg)
In-‐DRAM Bitwise AND/OR
Required OperaCon: Perform a bitwise AND of two rows A and B and store the result in C • R0 – reserved zero row, R1 – reserved one row • D1, D2, D3 – Designated rows for triple acCvaCon
1. RowClone A into D1 2. RowClone B into D2 3. RowClone R0 into D3 4. ACTIVATE D1,D2,D3 5. RowClone Result into C
38
![Page 39: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/39.jpg)
Throughput Results
39
0 10 20 30 40 50 60 70 80 8K
B
16KB
32KB
64KB
128K
B
256K
B
512K
B
1MB
2MB
4MB
8MB
16MB
32MB AN
D/OR Th
roughp
ut (G
B/s)
Size of vectors involved in AND/OR
Intel-‐AVX (one core) Our Proposal (Aggressive) (one bank)
L1
L2 L3 Memory
![Page 40: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/40.jpg)
Bitmap Index • AlternaCve to B-‐tree and its variants • Efficient for performing range queries and joins
40
Bitm
ap 1
Bitm
ap 2
Bitm
ap 4
Bitm
ap 3
age < 18 18 < age < 25 25 < age < 60 age > 60
![Page 41: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/41.jpg)
Performance EvaluaMon
41
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
3 9 20 45 98 118 128
Performan
ce RelaM
ve to
Ba
selin
e
Number of OR bins
ConservaMve (1 Bank) Aggressive (1 Bank)
ConservaMve (4 Banks) Aggressive (4 Banks)
![Page 42: 18-447 Computer Architecture Lecture 30: In-memory Processingece447/s15/lib/exe/... · 18-447 Computer Architecture Lecture 30: In-memory Processing Vivek Seshadri Carnegie Mellon](https://reader033.vdocuments.net/reader033/viewer/2022052722/5f0cb2f07e708231d436b222/html5/thumbnails/42.jpg)
Goals for This Lecture
ü Understand DRAM technology – How it is built? – How it operates? – What are the trade-‐offs?
ü Can we use DRAM for more than just storage? – In-‐DRAM copying – In-‐DRAM bitwise operaCons
42