department of electrical and computer engineering university of wisconsin - madison optimizing total...
DESCRIPTION
Parallel Processing Improved throughput of computing systems w/ more cores Throughput is limited by power+thermal constraints w/ all cores running Challenges: How do we Determine # of cores for best performance-power efficiency? Exploit process variations for multicore processors? Multicore processors [1] Serial processing Parallel processing [1] Source: [2] Source: NVIDIAhttp://www.interactivesupercomputing.com/starpexpress/042007/3_Task_Parallel.html GPU which has many cores [2]TRANSCRIPT
Department of Electrical and Computer EngineeringUniversity of Wisconsin - Madison
Optimizing Total Power of Many-core Processors
Considering Voltage Scaling Limit and Process Variations
Jungseob Lee and Nam Sung KimOctober 9, 2009
Outline
Introduction
Supply Voltage and Power Scaling Supply Voltage Scaling of Many-Core Processors Power Scaling of Many-Core Processors
Impacts of Within-Die(WID) Spatial Process Variations Global Clocking Frequency−Island Clocking
Conclusions
Parallel Processing Improved throughput of computing systems w/ more cores Throughput is limited by power+thermal constraints w/ all cores running
Challenges: How do we Determine # of cores for best performance-power efficiency? Exploit process variations for multicore processors?
Multicore processors
[1]
Serial processing
Parallel processing
[1] Source: http://www.interactivesupercomputing.com/starpexpress/042007/3_Task_Parallel.html [2] Source: NVIDIA
GPU which has many cores [2]
Types of Process variations
Process variations
Within-Die (WID) VariationsDie-to-Die (D2D) Variations
Wafer Scale
Courtesy: K. Bowman from Intel
A Systematic Vth variation map for a 16-core
processor
The corresponding Norm Fmax and Pleak map
C2C frequency and leakage power variations due to spatial correlated
WID variations become considerable.
Supply Voltage Scaling1
Supply voltage scaling of many-core processors
Throughput w/ certain # of cores at max VDD (thus Fmax)
= Throughput w/ more cores at lower VDD (thus Fmax)
Potential throughput increase by many cores and lower VDD can
reduce power.
# of cores
4 Operating freq
VDD
# of cores 8
Operating freq Lower V than VDD
Supply Voltage Scaling2
Supply voltage scaling of many-core processors M∙Tcycle(VDD) = M∙((1−F) + F/N)∙Tcycle(V)
M Number of operations
Tcycle Cycle time of a processor at supply voltage
VDD Nominal supply voltage of base core processor
F Fraction of operations parallelizable w/o overhead
N Relative number of cores
V Scaled supply voltage of N x more cores
PTM 32nm HP
PTM 32nm LP Require
higher VDD due to high Vth
> 40 % ↓
Dynamic Power Analysis1
Dynamic power scaling Dynamic power of a base many-core processor
Pdyn,base = Ceff ∙V2DD ∙Fmax(VDD)
Dynamic power of N x more cores than the base processor Pdyn,N = ((1−F) ∙(1+(N−1) ∙K) + F ∙N) ∙Ceff ∙V2 ∙Fmax(V)
= k(F, K, N) ∙f(V) ∙(V/VDD)2 ∙Pdyn,base
Pdyn,base Dynamic power of a base core
Ceff Effetive total switching capacitance
VDD Nominal voltage of the base core
Fmax Maximum operating frequency of the base core
Pdyn,N Dynamic power of N x more cores
K Fraction of dynamic power of idle cores
k(F,K,N) ((1−F) ∙(1+(N−1) ∙K) + F ∙N)
f(V) Frequency scaling factor at V; Fmax(V)/Fmax(VDD)
Pdyn,base Dynamic power of a base processor
Ceff Effetive total switching capacitance
VDD Nominal voltage of the base core
Fmax Maximum operating frequency of the base proc
Dynamic Power Analysis2
Dynamic power scaling
PTM 32nm HP
PTM 32nm LP
VDD,min F=0.6 F=0.7 F=0.8 F=0.9 F=1.0
PTMHP
0.7 0.75/2 0.66/3 0.60/2 0.52/2 0.45/2
0.6 0.75/2 0.66/3 0.54/3 0.41/4 0.34/3
No limit 0.75/2 0.66/3 0.54/3 0.41/5 0.20/8
PTMLP
0.7 0.75/2 0.70/2 0.65/2 0.56/3 0.46/3
0.6 0.75/2 0.70/2 0.65/2 0.55/4 0.35/8
No limit 0.75/2 0.70/2 0.65/2 0.55/4 0.35/8
Optimal Normalized Pdyn /
Relative # of cores
Dotted lines show projected power
consumption when no supply
limit.
VDD,min = 0.7VLess VDD scaling
Less Pdyn reduction
HP: 25~55%LP: 25~54%
Leakage Power Analysis1
Leakage power scaling In nanoscale technology, leakage power is significant fraction of
total power consumption.
Leakage power of a base many-core processor
Pleak,base = Ileak(VDD) ∙VDD
Leakage power of N x more cores than the base processor
Pleak,N = N ∙Ileak(V) ∙V = N ∙l(V) ∙(V/VDD) ∙Pleak,base
Pleak,base Dynamic power of a base core
Ileak Total Leakage current of the base processor
VDD Nominal voltage of the base core
Pleak,N Dynamic power of N x more cores
l(V) Leakage scaling factor at V
Pleak,base Leakage power of a base core
Ileak Total Leakage current of the base processor
VDD Nominal voltage of the base core
Leakage power scaling
VDD,min F=0.6 F=0.7 F=0.8 F=0.9 F=1.0
PTMHP
0.7 0.46/3 0.35/3 0.31/2 0.25/2 0.20/2
0.6 0.46/3 0.35/3 0.27/3 0.21/4 0.16/3
No limit 0.46/3 0.35/3 0.27/3 0.21/4 0.15/5
PTMLP
0.7 0.67/2 0.62/2 0.58/2 0.54/2 0.50/2
0.6 0.67/2 0.62/2 0.58/2 0.54/2 0.50/2
No limit 0.67/2 0.62/2 0.58/2 0.54/2 0.50/2
Leakage Power Analysis2
PTM 32nm HP
PTM 32nm LP
Optimal Normalized Pleak /
Relative # of cores
But Absolute Pleak is much less
than HP
HP: 54~80%LP: 33~50%
Total Power Analysis1
Total power scaling The total power of a base many-core processor is the sum of
dynamic and leakage power.
Ptot,base = Pdyn,base + Pleak,base = Pdyn,base ∙ (1 + LF)
The total power of N x more cores than the base processor is the sum of dynamic and leakage power.
Ptot,N = Pdyn,N + Pleak,N
= Pdyn,base ∙ { k(F,K,N) ∙ f(V) ∙ (V/VDD)2 + N ∙ l(V) ∙ (V/VDD) ∙ LF }
Ptot,base Total power of a base core
LF Ratio between Pleak and Pdyn ; (Pleak/Pdyn)
Ptot,N Total power of N x more cores
Total power scaling
Total Power Analysis2
LF VDD,min F=0.6 F=0.7 F=0.8 F=0.9 F=1.0
PTMHP
0.4/0.6
0.7 0.64/2 0.53/3 0.48/2 0.41/2 0.35/2
0.6 0.64/2 0.53/3 0.43/3 0.33/4 0.27/3
No limit 0.64/2 0.53/3 0.43/3 0.33/5 0.18/8
PTMLP
0.2/0.8
0.7 0.74/2 0.69/2 0.63/2 0.57/3 0.48/3
0.6 0.74/2 0.69/2 0.63/2 0.57/3 0.46/5
No limit 0.74/2 0.69/2 0.63/2 0.57/3 0.46/5
PTM 32nm HP LF 0.4/0.6
PTM 32nm LPLF 0.2/0.8
Optimal Normalized Ptot /
Relative # of cores
More VDD scaling only 17% more Ptot reduction, but require more
on-die memory area
HP: 36~65%LP: 26~52%
Impacts of WID Variations − GC Global Clocking
Limits Fmax of a many-core processor to that of slowest core.
Previous Pdyn,N equation still can be used to estimate Pdyn,N
Estimation of Pleak,N have to account for each core’s leakage variations as follows.
Pleak,N = li(V) ∙(V/VDD) ∙Pleak,base
N
1i
li(V) Leakage scaling factor of i-th core; Normalized to I leak(VDD)
A Systematic Vth variation map for a 16-core processor
The corresponding Fmax and Pleak map
Core ID
Normalized Fmax, Pleak
Impacts of WID Variations − GC Global Clocking
Base VDD,min F=0.6 F=0.7 F=0.8 F=0.9 F=1.0
Slow0.7 0.77/2 0.67/2 0.59/2 0.52/2 0.46/2
0.6 0.77/2 0.67/2 0.57/3 0.46/3 0.37/2
No limit 0.77/2 0.67/2 0.57/3 0.46/4 0.29/8
Fast0.7 0.23/3 0.18/3 0.14/4 0.12/2 0.10/2
0.6 0.23/3 0.18/3 0.14/4 0.10/4 0.07/3
No limit 0.23/3 0.18/3 0.14/4 0.10/4 0.06/8
HP Slowest base core
HP Fastest base core
Much more relative total
power reduction because the
fastest base core is not power
efficient
Average total power of 100 die samples /
Relative # of cores(N)
Slow: 23~54%Fast: 77~90%
Impact of WID Variations − FI Frequency−Island Clocking
FI clocking is more performance and power efficient than GC because each core can run at its own fastest frequency.
Previous GC Pleak,N equation can be used to estimate Pleak,N.
The equation for supply voltage scaling have to be modified as follows.
M ∙Tcycle,base(VDD) = M ∙((1−F) / fj + F/ fi ) ∙Tcycle(V)
Estimation of Pdyn,N also have to account for an independent clock frequency per core.
Pdyn,N = ((1−F)∙(fj + fi ∙K) + F ∙ fi ) ∙ (V/VDD)2 ∙ Pdyn,base
The fastest one among the chosen active cores always offers the optimal total power for processing the totally sequential portion of workload.
N
1i
N1,j-
1j 1,i
N
1i
Impacts of WID Variations − FI Frequency−Island Clocking
Base VDD,min F=0.6 F=0.7 F=0.8 F=0.9 F=1.0
Slow0.7 0.70/2 0.63/2 0.56/2 0.50/2 0.42/2
0.6 0.70/2 0.62/3 0.53/3 0.44/3 0.36/2
No limit 0.70/2 0.62/3 0.52/3 0.43/4 0.27/8
Fast0.7 0.19/3 0.15/4 0.12/4 0.10/3 0.10/2
0.6 0.19/3 0.15/4 0.12/4 0.09/5 0.07/3
No limit 0.19/3 0.15/4 0.12/4 0.09/5 0.06/8
HP Slowest base core
HP Fastest base core
Average total power of 100 die samples /
Relative # of cores(N)
FI clocking is more power-efficient than the global clocking
(GC) that often wastes Fmax of faster cores.
On average, FI clocking offers 7% lower total power
consumption than GC.
Slow: 30~58%Fast: 81~90%
Experimental Methodology HSPICE simulations
32nm PTM HP and LP model
Frequency / Leakage scaling factor A range of VDD : 0.55 ~ 1.05(V)
Vth and Leff WID spatial and D2D variation map
Complex gates for measuring l(VDD)24 FO4 inv chain for measuring f(VDD)
WID variationCorrelation distance
coefficient (Φ) 0.5
6.4%
D2D variation 5.0%
sysVthσD2DVthσ 1 grid point
[3] Smruti R. Sarangi et al., “VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects”, IEEE Transactions on Semiconductor Manufacturing (IEEE TSM), February 2008.
[3]
Conclusions Optimal number of active cores to minimize total power
consumption of many-core processors. 2x more active cores at lower voltage offer more than 50% of
total power reduction at the same throughput with a base core.
Extended power analysis considering WID C2C frequency and leakage variations 2x more active cores at lower voltage is the optimal choice.
FI clocking provides lower power consumption than GC since it can exploit C2C variations. Also the fastest one in active cores for sequential portion of application led to the lowest power consumption.
Backup
Process variations Manufactured dies exhibit a large spread of transistor delay and
leakage power across die and within each die.
Die-to-die(D2D) variations affect all transistors on a die equally. Within-die(WID) variations induce different characteristics across each die.
As individual core size becomes smaller, core-to-core(C2C) frequency and leakage power variations due to spatial correlated WID variations will become considerable.
Introduction
Source: Synopsys
Die-to-die variations Spatial Within-die variations
Supply Voltage and Power Scaling2
Supply voltage scaling of many-core processors Throughput w/ a certain # of cores at max VDD (thus Fmax)
= Throughput w/ more cores at lower VDD (thus Fmax)
Potential throughput increase by many cores and lower VDD can reduce power.
# of active cores 1
Operating freqVDD
# of active cores 8
Operating freq Lower V than VDD
Idle Core
Many−Core Processor
[1]
xx xxx
xx
x xx x xx
x
x x
x
x
x
xx x
Active Corex