software optimization for performance, energy, and thermal distribution: initial case studies
DESCRIPTION
Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies. Md. Ashfaquzzaman Khan, Can Hankendi, Ayse Kivilcim Coskun , Martin C. Herbordt { azkhan | hankendi | acoskun | herbordt }@ bu.edu Electrical and Computer Engineering BOSTON UNIVERSITY - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/1.jpg)
Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies
Md. Ashfaquzzaman Khan, Can Hankendi, Ayse Kivilcim Coskun, Martin C. Herbordt
{azkhan| hankendi | acoskun | herbordt}@bu.edu
Electrical and Computer Engineering
BOSTON UNIVERSITY
HPEC’11
09/21/2011 Funded by Dean’s Catalyst Award, BU.
![Page 2: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/2.jpg)
Energy consumption by data centers increases by 15% per year [Koomey 08].
Motivation
(BAU: Business as Usual)
Source: Pike ResearchMGHPCC
Holyoke Dam
![Page 3: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/3.jpg)
Energy consumption by data centers increases by 15% per year [Koomey 08].
Motivation
(BAU: Business as Usual)
Source: Pike ResearchMGHPCC
Holyoke Dam
Goal: Reduce energy consumptionthrough temperature-aware softwareoptimization
![Page 4: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/4.jpg)
Software to Reduce Power (basic) If cores are idle shut them off (Dynamic Power Management) If you have slack in the schedule, slow down (Dynamic
Voltage/Frequency Scaling)
![Page 5: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/5.jpg)
Temperature is also a concern
Prohibitive cooling costs Performance problems
Increased circuit delay Harder performance
prediction at design Increased leakage power
Reaches 35-40%(e.g., 45nm process)
Reliability degradation Higher permanent fault rate
Hot spots Thermal cycles Spatial gradients
Figure: Intel
Figure: Santarini, EDN’05
![Page 6: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/6.jpg)
Software to Reduce Temperature (basic)
If the chip is too hot, back off (Dynamic Thermal Management)
![Page 7: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/7.jpg)
A More Sophisticated Strategy* Physical Goals
Minimize total energy Minimize max energy for each core Per core -- reduce time spent above threshold temperature Minimize spatial thermal gradients Minimize temporal thermal gradients
Software Goals include Avoid hot spots by penalizing clustered jobs
*Coskun et al. TVLSI 2008
![Page 8: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/8.jpg)
Coskun, BU
Is Energy Management Sufficient?
Energy or performance-aware methods are not always effective for managing temperature. We need: Dynamic techniques specifically addressing temperature-induced
problems Efficient framework for evaluating dynamic techniques
% T
ime
Spen
t at
Va
riou
s Te
mpe
ratu
re
Rang
es
![Page 9: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/9.jpg)
Opt for P ≠ Opt for E ≠ Opt for T P vs. E:
For E, less likely to work hard to gain marginal improvement in performance
E vs. T: For T, less likely to concentrate work temporally or spatially
![Page 10: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/10.jpg)
Problems with these approaches Assume slack Assume knowledge of tasks Assume (known) deadlines rather than “fastest possible” Assume that critical temps are a problem Do not take advantage of intra-task optimization
Plan: design software to be thermally optimized
![Page 11: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/11.jpg)
Energy consumption by data centers increases by 15% per year [Koomey 08].
Motivation High temperature:
Higher cooling cost, degraded reliability
Software optimization for improving Performance, Energy, and Temperature: Potential for significantly
better P, E, T profiles than HW-only optimization
Jointly optimizing P,E and T is necessary for high energy-efficiency while maintaining high performance and reliability.
(BAU: Business as Usual)
Source: Pike Research
![Page 12: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/12.jpg)
Contributions Demonstrating the need for optimizing PET instead of
optimizing PE or PT only. Developing guidelines to design PET-aware
software.
Providing application-specific analysis to design metrics and tools to evaluate P, E and T.
Two case studies for SW-based optimization:Software restructuring and
tuning• 36% reduction in system
energy• 30% reduction in CPU energy• 56% reduction in temporal
thermal variations.
Investigating the effect of software on cooling energy3°C increase in peak temperature translates into 12.7W increase in system power.
![Page 13: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/13.jpg)
Outline
Motivation/Goals Methodology Case studies
Software tuning to improve P,E and T Effect of temperature optimization on system-energy
Conclusion Questions
✔
![Page 14: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/14.jpg)
System-under-test: 12-core AMD Magny Cours processor, U1 server
Measurement Setup
10mV/A conversion
40 ms data sampling period
Temperature sensor & performance counter readings
System power
Data logger
![Page 15: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/15.jpg)
Power and Temperature Estimation (ideal)
Power Measurement
s
HotSpot-5.0
Thermal Simulator[Skadron ISCA’03]
Layout for Magny Cours
Per-core power traces
Temperature
trace for each core
Northbridge Bus
CPU0
CPU3
CPU4
CPU5
CPU2
CPU1
L20
L21
L22
L23
L24
L25
L2 caches
![Page 16: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/16.jpg)
Power and Temperature Estimation (practical)
• Sparse power measurements• Performance monitors
Our Power Estimation
Model
HotSpot-5.0
Thermal Simulator[Skadron ISCA’03]
Layout for Magny Cours Per-core power traces
Temperature
trace for each core
Northbridge Bus
CPU0
CPU3
CPU4
CPU5
CPU2
CPU1
L20
L21
L22
L23
L24
L25
L2 caches
Per-core power and temperature measurements are often unavailable.
![Page 17: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/17.jpg)
Motivation: Per-core power and temperature measurements are often not available.
We custom-designed six microbenchmarks to build the power estimation model.
Power Estimation Methodology
LinearRegressi
on
Performance counter data
+ power
measurements
Coefficients
• CPU cycles• Retired micro-ops• Retired MMX and FP
instructions• Retired SSE operations
• L2 cache misses• Dispatch stalls• Dispatched FPU instructions• Retired branch instructions
Hardware events collected through the counters:
![Page 18: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/18.jpg)
L2-Cache
FPUUnit
ALU
BRL1-Cache
L2-Cache
FPUUnit
ALU
BRL1-Cache
In-cache matrix multiplication (double)
Microbenchmarking
L3-Cache
In-cache matrix multiplication (short)
Intensive memory access w/o sharing
Intensive memory access w/ sharing
Intensive memory access w/ frequent synchronization
In-cache matrix multiplication (short-simple)
Main Memory
![Page 19: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/19.jpg)
Power Model Validation
Power estimation for microbenchmarks
* Average error for PARSEC benchmarks is less than 5 %.
ProjectedMeasured
Error % for PARSEC benchmarks [Bienia
PACT’08]ROI Error % (Avg.P)All Error % (Avg.P)
![Page 20: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/20.jpg)
Outline
Motivation/Goals Methodology Case studies
Software tuning to improve P,E and T Effect of temperature optimization on system-energy
Conclusion Questions
✔✔
![Page 21: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/21.jpg)
A kernel in PARSEC benchmark suite
Implements a data compression method called “deduplication”
Combines local and global compression
“deduplication” is an emerging method for compressing: storage footprints communication data
Parallelization of dedup
Fine-grained segmentation
Break up of data into coarse-grained chunks
Hash-computation
Compression (Ziv-Lempel)
Hashes and compressed data output
ParallelStages
![Page 22: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/22.jpg)
Default dedup (Pipelined)
Default version: Pipelined model
S1
Stage2
Stage2
Stage2
Stage3
Stage3
Stage3
Stage4
Stage4
Stage4
S5
Parallel threads
Data Queue OS schedules the
parallel threads as data become available
Heavy data dependency among threads
Increased need for synchronization
Increased data movement (less reuse) inside processing cores
Uneven computational load leads to uneven power consumption
![Page 23: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/23.jpg)
Task-decomposed dedup
Proposed version: Task-decomposed
S1
Stage1+Stage2+Stage3
Stage1+Stage2+Stage3
Stage1+Stage2+Stage3
S5
Parallel threads
Data Queue
More data reuse
Less synchronization
Balanced computation load among cores
Improved performance, energy and temperature behavior
Parameter optimized for target architecture
![Page 24: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/24.jpg)
Dedup threads takes specific number of tasks from the queue (default=20)
Number of tasks between two synchronization points is critical for the application performance
Tuning the number of tasks balances the workload across threads Tuned value=10
Parameter Tuning
Ideal
Performance scaling of default and proposed version of dedup
![Page 25: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/25.jpg)
Power & Temperature ResultsDEFAULT TASK-DECOMPOSED TASK-DECOMPOSED &
PARAMETER TUNED
Per-core temperature (°C )
Chip power (W)8070605040302010 0 50 100 150 200 250 300 350
400 a Sampling Interval (x40ms)
50 100 150 200 250 300 350 400 a Sampling Interval (x40ms)
50 100 150 200 250 300 350 400 Sampling Interval (x40ms)
605958575655 50 100 150 200 250 300 350
400 a Sampling Interval (x40ms)
50 100 150 200 250 300 350 400 a Sampling Interval (x40ms)
50 100 150 200 250 300 350 400 Sampling Interval (x40ms)
![Page 26: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/26.jpg)
Parameter-tuned task-based model improvements with respect to default parallelization model:
30% reduction in CPU energy
35% reduction in system energy
56% reduction in per-core maximum temporal thermal variation
41% reduction in spatial thermal variation
Energy & Temperature Results
![Page 27: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/27.jpg)
Outline
Motivation/Goals Methodology Case studies
Software tuning to improve P,E and T Effect of temperature optimization on system-energy
Conclusion Questions
✔✔
✔
![Page 28: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/28.jpg)
Optimizing temperature at μs granularity has substantial benefits.
Effect of SW Optimization on Temperature
Quantifying effect of temperature optimization on system power mPrime stress test Microbenchmarks
![Page 29: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/29.jpg)
mPrime (Prime95 on Windows) is a commonly used stress benchmark Δ25°C +17W system power, +10W chip power
mPrime Stress Test
System powerNB temperatureChip power
120
140
160
180
200
220
Power (W)
100 200 300400 500 600 700 800Time(s)
10
20
30
40
50
60
70Temperature (°C) Δ17W
Δ25°C
Δ10W
![Page 30: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/30.jpg)
Two benchmarks with different P and T profiles: In-cache matrix multiplication (double) -- MM
High power due to stress on FPU units Intensive memory access w/ frequent synchronization --
Spinlock Low power due to memory and synchronization operations
Effect of Temperature on System Power
ΔPeak Temp. = 3°C
ΔChip Power = 10.8W
ΔSystem Power = 23.5W
![Page 31: Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies](https://reader035.vdocuments.net/reader035/viewer/2022070501/5681692e550346895de073c1/html5/thumbnails/31.jpg)
We presented our initial results in application-level SW optimization for performance, energy and thermal distribution.
Our evaluation infrastructure includes: direct measurements and power/temperature modeling.
We presented 2 case studies: Effect of code restructuring on P, E, and T.
Software optimization reduces system energy and maximum thermal variance by %35 and 56%.
Potential energy savings from temperature optimization: 3°C reduction in peak temperature causes 12.7W
system power savings.
Future work: Expanding the SW tuning strategies for parallel workloads, explicitly focusing on temperature.
Conclusions