snatch: opportunistically reassigning power allocation...
TRANSCRIPT
Snatch: Opportunistically Reassigning Power Allocation between Processor
and Memory in 3D Stacks
Dimitrios Skarlatos, Renji Thomas, Aditya Agrawal, Shibin Qin, Robert Pilawa, Ulya Karpuzcu, Radu Teodorescu, Nam Sung Kim, and Josep Torrellas
UIUC, OSU, UMN, NVIDIA
1
Motivation: Cost of Power/Ground Pins in 3D stacks
Memory VR
PCB substrate
Processor die
C4 bumps
DRAM0 die
DRAM1 die
microbumps
Processor VR
BGA pins
TSVs
2
Motivation: Cost of Power/Ground Pins in 3D stacks
• Size & cost of packages is proportional to # of pins
Memory VR
PCB substrate
Processor die
C4 bumps
DRAM0 die
DRAM1 die
microbumps
Processor VR
BGA pins
TSVs
3
Motivation: Cost of Power/Ground Pins in 3D stacks
• Size & cost of packages is proportional to # of pins
• 3D Stacks: Disjoint Power/Ground pins for Processor and Memory
Memory VR
PCB substrate
Processor die
C4 bumps
DRAM0 die
DRAM1 die
microbumps
Processor VR
BGA pins
TSVs
4
Motivation: Cost of Power/Ground Pins in 3D stacks
• Size & cost of packages is proportional to # of pins
• 3D Stacks: Disjoint Power/Ground pins for Processor and Memory
• Each dimensioned for the worst case
Memory VR
PCB substrate
Processor die
C4 bumps
DRAM0 die
DRAM1 die
microbumps
Processor VR
BGA pins
TSVs
5
Motivation: Underutilization of Power Budget
• High Processor or Memory Power phases
6
Contribution: Snatch
• Dynamically and opportunistically divert power between processor and memory
Conventional
Processor die
Mem0 die
Mem1 die
Proc PDN
TSVs
Mem PDN
Processor VR
Memory VR
7
Contribution: Snatch
• Dynamically and opportunistically divert power between processor and memory
• On-chip voltage regulator connects the two Power Delivery Networks
• Processor or Memory can consume more power for the same # of pins
Snatch
Processor die
Mem0 die
Mem1 die
Proc PDN
TSVs
Mem PDN
Processor
VR
Memory
VR
On-chip VR
8
Impact Compared to Conventional 3D Stacks
• For same # of power/ground pins:
• Application can consume more power
• Up to 23% application speedup
• For the same maximum power in Processor and Memory
• Fewer pins, about 30% package cost reduction
9
Snatch Outline
• Implementation
• Operation
• Case 1:
• Same Max Power in Processor and Memory, reduced # of pins
• Case 2:
• Same # of pins, improved performance
• Evaluation
10
Snatch Outline
• Implementation
• Operation
• Case 1:
• Same Max Power in Processor and Memory, reduced # of pins
• Case 2:
• Same # of pins, improved performance
• Evaluation
11
Conventional Implementation
0.8-0.95V
5.5W 4.5W
1.1V
12V 12V
Memory VR
PCB substrate
Processor die
C4 bumps
DRAM0 die
DRAM1 die
microbumps
Processor VR
BGA pins
TSVs
Cross-section 12
Snatch implementation
12V 12V
Cross-section
5.5W 4.5W
0.8-0.95V
1.1V
Memory VR
PCB substrate
Processor die
C4 bumps
DRAM0 die
DRAM1 die
microbumps
Processor VR
BGA pins
TSVs
Single on-chip VRSingle On-Chip VR
13
Snatch implementation
• Small 2W on-chip bidirectional VR on Proc die
• Bulk of work from off-chip VRs
12V 12V
Cross-section
5.5W 4.5W
0.8-0.95V
1.1V
Memory VR
PCB substrate
Processor die
C4 bumps
DRAM0 die
DRAM1 die
microbumps
Processor VR
BGA pins
TSVs
Single on-chip VRSingle On-Chip VR
14
Memory VR
PCB substrate
Processor die
C4 bumps
DRAM0 die
DRAM1 die
microbumps
Processor VR
BGA pins
TSVs
Single on-chip VR
Snatch: Dynamic power reassignment
• Up/Down convert power Snatched
Single On-Chip VR
15
Memory VR
PCB substrate
Processor die
C4 bumps
DRAM0 die
DRAM1 die
microbumps
Processor VR
BGA pins
TSVs
Single on-chip VR
Snatch: Dynamic power reassignment
• Up/Down convert power Snatched
0.8-0.95V
1.1V
Single On-Chip VR
2W
16
Snatch: Cross-Section
• Small 2W on-chip bidirectional VR on Proc die
Cross-section
Memory VR
PCB substrate
Processor die
C4 bumps
DRAM0 die
DRAM1 die
microbumps
Processor VR
BGA pins
TSVs
Single on-chip VRSingle On-Chip VR
17
Snatch: Top Down
• Small 2W on-chip bidirectional VR on Proc die
0.8-0.95V
5.5W
4.5W 1.1V
Top Down
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
Memory VR
PCB substrate
Processor die
C4 bumps
DRAM0 die
DRAM1 die
microbumps
Processor VR
BGA pins
TSVs
Single on-chip VRSingle On-Chip VR
Cross-section18
Snatch Outline
• Implementation
• Operation
• Case 1:
• Same Max Power in Processor and Memory, reduced # of pins
• Case 2:
• Same # of pins, improved performance
• Evaluation
19
Snatching Memory Power
• On processor intensive phase
5.5W
4.5W
5.5W
4.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
20
Snatching Memory Power
• On processor intensive phase
• Snatch Memory Power TurboBoost Processor
7.5W
2.5W
5.5W
4.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
2W
21
Snatching Processor Power
• On memory intensive phase
• Snatch Processor Power TurboBoost Memory
3.5W
6.5W
5.5W
4.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
2W
22
Snatching Decisions
• Processor or Memory Intensive Phase?
5.5W
4.5W
5.5W
4.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
?W
23
Snatching Decisions
• Processor or Memory Intensive Phase?
• How much Power is available?
5.5W
4.5W
5.5W
4.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
?W
24
Snatching Decisions
• Processor or Memory Intensive Phase?
• How much Power is available?
• How much Power can we Snatch?5.5W
4.5W
5.5W
4.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
?W
25
Conservative Snatching Algorithm
• Keep track of past power values of 10µs epochs
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
?W
5.5W
4.5W
26
Conservative Snatching Algorithm
• Keep track of past power values of 10µs epochs
• Average for activity detection
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
?W
5.5W
4.5W
27
Conservative Snatching Algorithm
• Keep track of past power values of 10µs epochs
• Average for activity detection
• MAX for power availabilityProcessor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
?W
5.5W
4.5W
28
Conservative Snatching Algorithm
• Keep track of past power values of 10µs epochs
• Average for activity detection
• MAX for power availability
• Avoid hysteresis
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
?W
5.5W
4.5W
29
Snatch Outline
• Implementation
• Operation
• Case 1:
• Same Max Power in Processor and Memory, reduced # of pins
• Case 2:
• Same # of pins, improved performance
• Evaluation
30
Conventional Power Provisioning
• Processor provisioned for 7.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
31
Conventional Power Provisioning
• Processor provisioned for 7.5W
7.5W7.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
32
Conventional Power Provisioning
• Processor provisioned for 7.5W
• Memory provisioned for 6.5W
6.5W
6.5W
7.5W7.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
33
Conventional Power Provisioning
• Processor provisioned for 7.5W
• Memory provisioned for 6.5W
• Total = Processor + Memory = 14W
6.5W
6.5W
7.5W7.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
34
Snatch: Provisioning 3D Stacks Just Right
• Processor provisioned for 7.5W
• Memory provisioned for 6.5W
• Total = Processor + Memory = 14W
6.5W
6.5W
7.5W7.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
35
• Processor provisioned for 7.5W
• Memory provisioned for 6.5W
• Total = Processor + Memory = 14W
Snatch: Provisioning 3D Stacks Just Right
Snatch 2W
6.5W
6.5W
7.5W7.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
36
7.5W5.5W
Snatch: Provisioning 3D Stacks Just Right
5.5W +-2W
Snatch 2W
6.5W
6.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
• Processor provisioned for 7.5W - 2W = 5.5W
• Memory provisioned for 6.5W
• Total = Processor + Memory = 14W
37
6.5W
Snatch: Provisioning 3D Stacks Just Right
4.5W +-2W
7.5W5.5W 5.5W +-2W
Snatch 2W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
4.5W
• Processor provisioned for 7.5W - 2W = 5.5W
• Memory provisioned for 6.5W - 2W = 4.5W
• Total = Processor + Memory = 14W
38
Snatch: Provisioning 3D Stacks Just Right
Reduce Total Provisioning from 14W to 10W, approx same
performance
6.5W
4.5W +-2W
7.5W5.5W 5.5W +-2W
Snatch 2W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
4.5W
• Processor provisioned for 7.5W - 2W = 5.5W
• Memory provisioned for 6.5W - 2W = 4.5W
• Total = Processor + Memory = 14W - 4W = 10W
39
Snatch: Provisioning 3D Stacks Just Right
Reduce Total Provisioning from 14W to 10W, approx same
performance
30% Reduction in Package Power/Ground Pins
6.5W
4.5W +-2W
7.5W5.5W 5.5W +-2W
Snatch 2W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
4.5W
• Processor provisioned for 7.5W - 2W = 5.5W
• Memory provisioned for 6.5W - 2W = 4.5W
• Total = Processor + Memory = 14W - 4W = 10W
40
Snatch Outline
• Implementation
• Operation
• Case 1:
• Same Max Power in Processor and Memory, reduced # of pins
• Case 2:
• Same # of pins, improved performance
• Evaluation
41
Conventional Power Provisioning
• Processor & Memory provisioned for 5.5W & 4.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
42
Conventional Power Provisioning
• Processor & Memory provisioned for 5.5W & 4.5W
5.5W
4.5W
5.5W
4.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN
43
Snatch: Provide Additional Power
• Processor & Memory provisioned for 5.5W & 4.5W
• Snatch power
Snatch 2W
4.5W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN5.5W
4.5W
5.5W
44
Snatch: Opportunistically boost performance
• Processor & Memory provisioned for 5.5W & 4.5W
• Snatch power and boost performance
5.5W +-2W
4.5W +-2W
Snatch 2W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN5.5W
4.5W
45
Snatch: Boost Performance with Same # of Pins
• Processor & Memory provisioned for 5.5W & 4.5W
• Snatch power and boost performance
• Same # of pins as conventional5.5W +-2W
4.5W +-2W
Snatch 2W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN5.5W
4.5W
46
Snatch: Boost Performance with Same # of Pins
• Processor & Memory provisioned for 5.5W & 4.5W
• Snatch power and boost performance
• Same # of pins as conventional
Higher performance for the same package cost
5.5W +-2W
4.5W +-2W
Snatch 2W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN5.5W
4.5W
47
Snatch: Boost Performance with Same # of Pins
• Processor & Memory provisioned for 5.5W & 4.5W
• Snatch power and boost performance
• Same # of pins as conventional
Higher performance for the same package cost
IR-drop and EM characteristics remain the same
5.5W +-2W
4.5W +-2W
Snatch 2W
Processor
On-chip
VR
Memory
Processor
VR
Memory
VR
PDN
PDN5.5W
4.5W
48
Snatch Outline
• Implementation
• Operation
• Case 1:
• Same Max Power in Processor and Memory, reduced # of pins
• Case 2:
• Same # of pins, improved performance
• Evaluation
49
Evaluation Methodology
• Case 2: Same # of pins, improved performance
• Processor: 22nm LP 8-core w/ SESC + McPAT
• Memory: 4GB 2-layer WideIO2 w/ DRAMSim2
• Benchmarks: SPLASH-2, NAS, and SPEC
50
Performance
Sp
eed
up
0.4
0.6
0.7
0.9
1.0
1.2
1.3
Average(Splash + NAS) Average(SPEC)
Baseline Turbo Boost Snatch
51
Performance
Sp
eed
up
0.4
0.6
0.7
0.9
1.0
1.2
1.3
Average(Splash + NAS) Average(SPEC)
Baseline Turbo Boost SnatchP = {5.5W, 1.2GHz} M= {4.5W, 400MHz}
Baseline
52
Performance
Sp
eed
up
0.4
0.6
0.7
0.9
1.0
1.2
1.3
Average(Splash + NAS) Average(SPEC)
Baseline Turbo Boost Snatch
P = {5.5W, 1.2-1.5GHz} M= {4.5W, 400-900MHz} DVFS Within Power Budget
P = {5.5W, 1.2GHz} M= {4.5W, 400MHz}
Baseline
Turbo Boost
53
Performance
Sp
eed
up
0.4
0.6
0.7
0.9
1.0
1.2
1.3
Average(Splash + NAS) Average(SPEC)
Baseline Turbo Boost Snatch
P = {5.5W, 1.2-1.5GHz} M= {4.5W, 400-900MHz}DVFS and Snatch up to 2W
P = {5.5W, 1.2-1.5GHz} M= {4.5W, 400-900MHz} DVFS Within Power Budget
P = {5.5W, 1.2GHz} M= {4.5W, 400MHz}
Baseline
Turbo Boost
Snatch
54
P = {5.5W, 1.2-1.5GHz} M= {4.5W, 400-900MHz}DVFS and Snatch up to 2W
P = {5.5W, 1.2-1.5GHz} M= {4.5W, 400-900MHz} DVFS Within Power Budget
P = {5.5W, 1.2GHz} M= {4.5W, 400MHz}
Baseline
Turbo Boost
Snatch
Performance
Sp
eed
up
0.4
0.6
0.7
0.9
1.0
1.2
1.3
Average(Splash + NAS) Average(SPEC)
Baseline Turbo Boost Snatch
25%8%
Snatch boosts performance on average, by 25% against Baseline and 8% against Turbo Boost
for Splash and NAS benchmarks
10%
55
P = {5.5W, 1.2-1.5GHz} M= {4.5W, 400-900MHz}DVFS and Snatch up to 2W
P = {5.5W, 1.2-1.5GHz} M= {4.5W, 400-900MHz} DVFS Within Power Budget
P = {5.5W, 1.2GHz} M= {4.5W, 400MHz}
Baseline
Turbo Boost
Snatch
Performance
Sp
eed
up
0.4
0.6
0.7
0.9
1.0
1.2
1.3
Average(Splash + NAS) Average(SPEC)
Baseline Turbo Boost Snatch
25%8%
Snatch boosts performance on average, by 10% against Baseline for SPEC benchmarks
Negligible gains against Turbo Boost
10%
56
Snatching Activity Overview%
of
To
tal
Tim
e S
na
tch
ing
0
22.5
45
67.5
90
Bar
nes BT
CG
Ch
oles
kyF
FT
FM
M FT IS LU
LU
(NA
S)M
GR
adio
sity
Rad
ixR
aytr
ace
SPW
-Nsq
uar
edW
-Sp
atia
lA
vgM
->P
mcf
mil
clb
mb
zip
2A
vgP
->M
M->P P->M
57
Snatching Activity Overview%
of
To
tal
Tim
e S
na
tch
ing
0
22.5
45
67.5
90
Bar
nes BT
CG
Ch
oles
kyF
FT
FM
M FT IS LU
LU
(NA
S)M
GR
adio
sity
Rad
ixR
aytr
ace
SPW
-Nsq
uar
edW
-Sp
atia
lA
vgM
->P
mcf
mil
clb
mb
zip
2A
vgP
->M
M P P M
58
Snatching Activity Overview%
of
To
tal
Tim
e S
na
tch
ing
0
22.5
45
67.5
90
Bar
nes BT
CG
Ch
oles
kyF
FT
FM
M FT IS LU
LU
(NA
S)M
GR
adio
sity
Rad
ixR
aytr
ace
SPW
-Nsq
uar
edW
-Sp
atia
lA
vgM
->P
mcf
mil
clb
mb
zip
2A
vgP
->M
M P P M
59
Snatching Activity Overview%
of
To
tal
Tim
e S
na
tch
ing
0
22.5
45
67.5
90
Bar
nes BT
CG
Ch
oles
kyF
FT
FM
M FT IS LU
LU
(NA
S)M
GR
adio
sity
Rad
ixR
aytr
ace
SPW
-Nsq
uar
edW
-Sp
atia
lA
vgM
->P
mcf
mil
clb
mb
zip
2A
vgP
->M
M P P M
Application Snatch on average, 30% for Splash+NAS
9.4% for SPEC
60
More On the Paper
• Design and Implementation:
• On-chip Voltage Regulator
• Snatch Algorithm
• Additional Evaluation:
• Snatch Algorithm
• Power Delivery Network
• Pin Reliability
• 3D Stack Thermals
61
Summary
• Snatch: An opportunistic power reassignment design for 3D Stacked architectures
• Small on-chip bidirectional VR
• Processor - Memory phase detection and power availability estimation
• Up to 23% application speedup
• Alternatively, about 30% package cost reduction
62
63Image Source : http://www.hutui6.com/data/out/178/67609941-snatch-wallpapers.jpg