department of electrical and computer engineering university of wisconsin - madison optimizing total...

Department of Electrical and Computer EngineeringUniversity of Wisconsin - Madison

Optimizing Total Power of Many-core Processors

Considering Voltage Scaling Limit and Process Variations

Jungseob Lee and Nam Sung KimOctober 9, 2009

Outline

Introduction

Supply Voltage and Power Scaling Supply Voltage Scaling of Many-Core Processors Power Scaling of Many-Core Processors

Impacts of Within-Die(WID) Spatial Process Variations Global Clocking Frequency−Island Clocking

Conclusions

Parallel Processing Improved throughput of computing systems w/ more cores Throughput is limited by power+thermal constraints w/ all cores running

Challenges: How do we Determine # of cores for best performance-power efficiency? Exploit process variations for multicore processors?

Multicore processors

[1]

Serial processing

Parallel processing

[1] Source: http://www.interactivesupercomputing.com/starpexpress/042007/3_Task_Parallel.html [2] Source: NVIDIA

GPU which has many cores [2]

http://www.interactivesupercomputing.com/starpexpress/042007/3_Task_Parallel.html

Types of Process variations

Process variations

Within-Die (WID) VariationsDie-to-Die (D2D) Variations

Wafer Scale

Courtesy: K. Bowman from Intel

A Systematic Vth variation map for a 16-core

processor

The corresponding Norm Fmax and Pleak map

C2C frequency and leakage power variations due to spatial correlated

WID variations become considerable.

Supply Voltage Scaling1

Supply voltage scaling of many-core processors

Throughput w/ certain # of cores at max VDD (thus Fmax)

= Throughput w/ more cores at lower VDD (thus Fmax)

Potential throughput increase by many cores and lower VDD can

reduce power.

# of cores

4 Operating freq

VDD

# of cores 8

Operating freq Lower V than VDD

Supply Voltage Scaling2

Supply voltage scaling of many-core processors M∙Tcycle(VDD) = M∙((1−F) + F/N)∙Tcycle(V)

M Number of operations

Tcycle Cycle time of a processor at supply voltage

VDD Nominal supply voltage of base core processor

F Fraction of operations parallelizable w/o overhead

N Relative number of cores

V Scaled supply voltage of N x more cores

PTM 32nm HP

PTM 32nm LP Require

higher VDD due to high Vth

> 40 % ↓

Dynamic Power Analysis1

Dynamic power scaling Dynamic power of a base many-core processor

Pdyn,base = Ceff ∙V2DD ∙Fmax(VDD)

Dynamic power of N x more cores than the base processor Pdyn,N = ((1−F) ∙(1+(N−1) ∙K) + F ∙N) ∙Ceff ∙V2 ∙Fmax(V)

= k(F, K, N) ∙f(V) ∙(V/VDD)2 ∙Pdyn,base

Pdyn,base Dynamic power of a base core

Ceff Effetive total switching capacitance

VDD Nominal voltage of the base core

Fmax Maximum operating frequency of the base core

Pdyn,N Dynamic power of N x more cores

K Fraction of dynamic power of idle cores

k(F,K,N) ((1−F) ∙(1+(N−1) ∙K) + F ∙N)

f(V) Frequency scaling factor at V; Fmax(V)/Fmax(VDD)

Pdyn,base Dynamic power of a base processor

Ceff Effetive total switching capacitance


Fmax Maximum operating frequency of the base proc

Dynamic Power Analysis2

Dynamic power scaling

PTM 32nm HP

PTM 32nm LP

VDD,min F=0.6 F=0.7 F=0.8 F=0.9 F=1.0

PTMHP

0.7 0.75/2 0.66/3 0.60/2 0.52/2 0.45/2

0.6 0.75/2 0.66/3 0.54/3 0.41/4 0.34/3

No limit 0.75/2 0.66/3 0.54/3 0.41/5 0.20/8

PTMLP

0.7 0.75/2 0.70/2 0.65/2 0.56/3 0.46/3

0.6 0.75/2 0.70/2 0.65/2 0.55/4 0.35/8

No limit 0.75/2 0.70/2 0.65/2 0.55/4 0.35/8

Optimal Normalized Pdyn /

Relative # of cores

Dotted lines show projected power

consumption when no supply

limit.

VDD,min = 0.7VLess VDD scaling

Less Pdyn reduction

HP: 25~55%LP: 25~54%

Leakage Power Analysis1

Leakage power scaling In nanoscale technology, leakage power is significant fraction of

total power consumption.

Leakage power of a base many-core processor

Pleak,base = Ileak(VDD) ∙VDD

Leakage power of N x more cores than the base processor

Pleak,N = N ∙Ileak(V) ∙V = N ∙l(V) ∙(V/VDD) ∙Pleak,base

Pleak,base Dynamic power of a base core

Ileak Total Leakage current of the base processor


Pleak,N Dynamic power of N x more cores

l(V) Leakage scaling factor at V

Pleak,base Leakage power of a base core

Ileak Total Leakage current of the base processor


Leakage power scaling

VDD,min F=0.6 F=0.7 F=0.8 F=0.9 F=1.0

PTMHP

0.7 0.46/3 0.35/3 0.31/2 0.25/2 0.20/2

0.6 0.46/3 0.35/3 0.27/3 0.21/4 0.16/3

No limit 0.46/3 0.35/3 0.27/3 0.21/4 0.15/5

PTMLP

0.7 0.67/2 0.62/2 0.58/2 0.54/2 0.50/2

0.6 0.67/2 0.62/2 0.58/2 0.54/2 0.50/2

No limit 0.67/2 0.62/2 0.58/2 0.54/2 0.50/2

Leakage Power Analysis2

PTM 32nm HP

PTM 32nm LP

Optimal Normalized Pleak /

Relative # of cores

But Absolute Pleak is much less

than HP

HP: 54~80%LP: 33~50%

Total Power Analysis1

Total power scaling The total power of a base many-core processor is the sum of

dynamic and leakage power.

Ptot,base = Pdyn,base + Pleak,base = Pdyn,base ∙ (1 + LF)

The total power of N x more cores than the base processor is the sum of dynamic and leakage power.

Ptot,N = Pdyn,N + Pleak,N

= Pdyn,base ∙ { k(F,K,N) ∙ f(V) ∙ (V/VDD)2 + N ∙ l(V) ∙ (V/VDD) ∙ LF }

Ptot,base Total power of a base core

LF Ratio between Pleak and Pdyn ; (Pleak/Pdyn)

Ptot,N Total power of N x more cores

Total power scaling

Total Power Analysis2

LF VDD,min F=0.6 F=0.7 F=0.8 F=0.9 F=1.0

PTMHP

0.4/0.6

0.7 0.64/2 0.53/3 0.48/2 0.41/2 0.35/2

0.6 0.64/2 0.53/3 0.43/3 0.33/4 0.27/3

No limit 0.64/2 0.53/3 0.43/3 0.33/5 0.18/8

PTMLP

0.2/0.8

0.7 0.74/2 0.69/2 0.63/2 0.57/3 0.48/3

0.6 0.74/2 0.69/2 0.63/2 0.57/3 0.46/5

No limit 0.74/2 0.69/2 0.63/2 0.57/3 0.46/5

PTM 32nm HP LF 0.4/0.6

PTM 32nm LPLF 0.2/0.8

Optimal Normalized Ptot /

Relative # of cores

More VDD scaling only 17% more Ptot reduction, but require more

on-die memory area

HP: 36~65%LP: 26~52%

Impacts of WID Variations − GC Global Clocking

Limits Fmax of a many-core processor to that of slowest core.

Previous Pdyn,N equation still can be used to estimate Pdyn,N

Estimation of Pleak,N have to account for each core’s leakage variations as follows.

Pleak,N = li(V) ∙(V/VDD) ∙Pleak,base

N

1i

li(V) Leakage scaling factor of i-th core; Normalized to I leak(VDD)

A Systematic Vth variation map for a 16-core processor

The corresponding Fmax and Pleak map

Core ID

Normalized Fmax, Pleak

Impacts of WID Variations − GC Global Clocking

Base VDD,min F=0.6 F=0.7 F=0.8 F=0.9 F=1.0

Slow0.7 0.77/2 0.67/2 0.59/2 0.52/2 0.46/2

0.6 0.77/2 0.67/2 0.57/3 0.46/3 0.37/2

No limit 0.77/2 0.67/2 0.57/3 0.46/4 0.29/8

Fast0.7 0.23/3 0.18/3 0.14/4 0.12/2 0.10/2

0.6 0.23/3 0.18/3 0.14/4 0.10/4 0.07/3

No limit 0.23/3 0.18/3 0.14/4 0.10/4 0.06/8

HP Slowest base core

HP Fastest base core

Much more relative total

power reduction because the

fastest base core is not power

efficient

Average total power of 100 die samples /

Relative # of cores(N)

Slow: 23~54%Fast: 77~90%

Impact of WID Variations − FI Frequency−Island Clocking

FI clocking is more performance and power efficient than GC because each core can run at its own fastest frequency.

Previous GC Pleak,N equation can be used to estimate Pleak,N.

The equation for supply voltage scaling have to be modified as follows.

M ∙Tcycle,base(VDD) = M ∙((1−F) / fj + F/ fi ) ∙Tcycle(V)

Estimation of Pdyn,N also have to account for an independent clock frequency per core.

Pdyn,N = ((1−F)∙(fj + fi ∙K) + F ∙ fi ) ∙ (V/VDD)2 ∙ Pdyn,base

The fastest one among the chosen active cores always offers the optimal total power for processing the totally sequential portion of workload.

N

1i

N1,j-

1j 1,i

N

1i

Impacts of WID Variations − FI Frequency−Island Clocking

Base VDD,min F=0.6 F=0.7 F=0.8 F=0.9 F=1.0

Slow0.7 0.70/2 0.63/2 0.56/2 0.50/2 0.42/2

0.6 0.70/2 0.62/3 0.53/3 0.44/3 0.36/2

No limit 0.70/2 0.62/3 0.52/3 0.43/4 0.27/8

Fast0.7 0.19/3 0.15/4 0.12/4 0.10/3 0.10/2

0.6 0.19/3 0.15/4 0.12/4 0.09/5 0.07/3

No limit 0.19/3 0.15/4 0.12/4 0.09/5 0.06/8

HP Slowest base core

HP Fastest base core

Average total power of 100 die samples /

Relative # of cores(N)

FI clocking is more power-efficient than the global clocking

(GC) that often wastes Fmax of faster cores.

On average, FI clocking offers 7% lower total power

consumption than GC.

Slow: 30~58%Fast: 81~90%

Experimental Methodology HSPICE simulations

32nm PTM HP and LP model

Frequency / Leakage scaling factor A range of VDD : 0.55 ~ 1.05(V)

Vth and Leff WID spatial and D2D variation map

Complex gates for measuring l(VDD)24 FO4 inv chain for measuring f(VDD)

WID variationCorrelation distance

coefficient (Φ) 0.5

6.4%

D2D variation 5.0%

sysVthσD2DVthσ 1 grid point

[3] Smruti R. Sarangi et al., “VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects”, IEEE Transactions on Semiconductor Manufacturing (IEEE TSM), February 2008.

[3]

Conclusions Optimal number of active cores to minimize total power

consumption of many-core processors. 2x more active cores at lower voltage offer more than 50% of

total power reduction at the same throughput with a base core.

Extended power analysis considering WID C2C frequency and leakage variations 2x more active cores at lower voltage is the optimal choice.

FI clocking provides lower power consumption than GC since it can exploit C2C variations. Also the fastest one in active cores for sequential portion of application led to the lowest power consumption.

Backup

Process variations Manufactured dies exhibit a large spread of transistor delay and

leakage power across die and within each die.

Die-to-die(D2D) variations affect all transistors on a die equally. Within-die(WID) variations induce different characteristics across each die.

As individual core size becomes smaller, core-to-core(C2C) frequency and leakage power variations due to spatial correlated WID variations will become considerable.

Introduction

Source: Synopsys

Die-to-die variations Spatial Within-die variations

Supply Voltage and Power Scaling2

Supply voltage scaling of many-core processors Throughput w/ a certain # of cores at max VDD (thus Fmax)

= Throughput w/ more cores at lower VDD (thus Fmax)

Potential throughput increase by many cores and lower VDD can reduce power.

# of active cores 1

Operating freqVDD

# of active cores 8

Operating freq Lower V than VDD

Idle Core

Many−Core Processor

[1]

xx xxx

xx

x xx x xx

x

x x

x

x

x

xx x

Active Corex

department of electrical and computer engineering university of wisconsin - madison optimizing total...

Documents