Adaptive Power Profiling for Many-Core HPC ArchitecturesJ A I M I E K E L L E Y, C H R I S TO P H E R S T E WA R TT H E O H I O S TAT E U N I V E R S I T Y
D E V E S H T I WA R I , S A U R A B H G U P TAO A K R I D G E N AT I O N A L L A B O R AT O RY
S U M M A R I Z E D B Y:D A R W I N M A C H
F O R :C S 7 8 8 – A U TO N O M I C C O M P U T I N GFA L L 20 17
G E O R G E M A S O N U N I V E R S I T Y
Overview
• Background & Problem Statement
• Experimental Design
• Observations
• Observations on Power Consumption
• Predicting Peak Power Using Reference Workloads
• Analyzing Power Consumption Profile of Scientific Applications
• Adaptive Power Profiling
2
Background
• Amount of cores available for HPC (high performance computing) continues to increase
• Aggregation of computing power, “supercomputers” such as Jaguar, Titan, etc
• Many workloads don’t use every core effectively
• PARSEC benchmark gets 90% of its speed from just 35 out of 442 cores
• NAS workload on Intel Phi gets 85% of its speed from 32 of the 61 cores
• Core scaling: restrict workload to subset of cores
• Side effect: increases peak power use for workload cores
• Increasing demand to set power caps
3
Problem
• Current HPC schedulers use static workload profiles to allocate resources & adjust provisioning while workload runs
• But resource contention is dynamic & lots of things can affect it
• Actual power usage is complex and varies
• Static models only determine what’s possible, not what actually happens
• How can we accurately predict peak power dynamically with as little time as possible?
• Peak power because that’s going to determine minimum power cap
4
Experimental Design
• Power Measurement
• Architectures
• Platforms
• Workloads
5
Experimental Design: Power Measurement
• Use Intel Running Average Power Limit (RAPL)
• Stores measurements per CPU socket in Machine State Registers (MSRs)
• Measure energy and convert to power by associating with timestamp
• Power = Work / time
• Measure every 100 ms
• For Xeon Phi, use micsmc
• Because it’s a coprocessor on a PCIe card (at least the one they used is)
6
Experimental Design• Architecture
• I7-2600K (I7)• On-Demand CPU Governor
• Xeon E5-2670 (Sandy Bridge, SB)• P-states (per core)
• Xeon Phi 5110P (Phi)• Many Integrated Core (MIC)• P-states and C3 (whole CPU)
• All 3 CPUs are P-state and C3 capable. Not sure why they didn’t keep this constant. Or did they?
• Also, the I7 and Xeon E5 are both based on SB architecture. Better aliases would have been I7 and E5
7
Intel Xeon Phi
Source: https://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html
8
Intel Xeon Phi (Internal)
Source: https://software.intel.com/en-us/blogs/2014/01/31/intel-xeon-phi-coprocessor-power-management-configuration-using-the-micsmc-command
9
Observations
• 3 Sections:
• Observations on Power Consumption
• Predicting Peak Power Using Reference Workloads
• Analyzing Power Consumption Profile of Scientific Applications
• An attempt to characterize the peak power of workloads and corresponding HPC components (L1, L2, L3 cache, memory, etc)
10
Observation 1• Different architectures have different
increases in power
• Floor
• Benchmarks that target only the CPU registers (no caches or memory)
• Ceiling
• Benchmarks that target everything
• I7 and SB have more dramatic increases because other resources (cache, interconnects, etc) scale up with increasing cores
• Phi has a ring bus that is fully powered if even 1 core is in use
11
Observation 2
• Figure 2A
• Relative peak power increases are different between architectures
• They are also differentdepending on the chosen workload
• More cores, more variation
12
Observation 3
• Using 1 reference workload to predict peak power of all others isn’t accurate at all
13
Observation 3 (continued)
• (A) Pairs of workloads that are similar on one architecture can be different on another
• (B) Different parallelizationplatforms (MPI vs OMP, same workload) can be similar on one architecture and different on another
14
Observation 4
• Different workloads reach their peak power usage at different times
• That same workload may reachpeak power at a different time on a different architecture
• Same workloads have different power profiles from architecture to architecture (see next slide)
15
Observation 4 (continued)
16
Observation 5
• Power profiles are similar with different number of active cores on the same architecture
• Regardless of parallelization platform (MPI vs OMP)
17
Observation 5 (continued)
• Seeing peak power spikes early in execution
• After 40% execution ofworkload, predicted peak power error below 5%
18
Adaptive Power Profiling (APP)
• k% Sampling
• Authors’ Approach
• Evaluation
• Corner Cases
19
APP: k% sampling• Widely used approach (k% sampling)
1. Choose % of workload to run (k%)
2. Run the workload for k% time
3. Collect power usage
• For multiple cores
• (# cores) * k%
• 5 cores 5k% of workload needs to be run (e.g. 1, 2, 3, 4, 5)
• Rationale: consistent with observation #5 (figure 9)
• After a certain % of workload is run, error is minimum
• More cores need more of the workload to run to be accurate
20
APP: Authors’ Approach
1. Profile k% using maximum core count
2. Construct estimation error curve
3. Find the normalized run time where error is below user specified maximum error (k%)
4. Profile remaining core scalingsettings using the new k%
21
APP: Author’s Approach (continued)
• Run a k% for max cores and collect power trace
• PP(i) = Power at time i
• Peak power so far = PPmax(i)
• Calculate expected error curve, PPEC(i)
• For user specified accuracy (a%), find i where PPEC(i) < a
22
APP: Author’s Approach (continued)
23
Evaluation: Time Used vs Requested
• Shows a reduction actual time used vs k% method, which simply uses requested time• Not sure of the criteria used to determine workload percentiles
24
Evaluation: APP vs k%
Shows a difference in prediction from the authors’ method as vs k% method used as a baseline
25
Evaluation: Relaxing Accuracy
• Relaxing accuracy requirement (a%) doesn’t necessarily mean it will be inaccurate by that same amount
• Not sure of the criteria used to determine workload percentiles
26
Evaluation: Accuracy vs Profiling Time
• Small changes to relax accuracy greatly reduces time to profile (up to 5%)• Characterizes accuracy vs profiling time tradeoff• Not sure of the criteria used to determine workload percentiles
27
Corner Cases• Comparison of finding k% for APP
with min, median, and max core counts
• Starting with max core count to find k% (like previously described) is optimal
• Would be helpful to state how many cores for each architecture was used
• Likely Min = 2 for all of them
• Median for SB = 4• What about I7? Phi?• Doesn’t fit 2n constraint for benchmarks
they imposed in experimental design
28