Download - Adaptive Power Profiling for Many-Core HPC …menasce/cs788/slides/cs788_Adaptive...Adaptive Power Profiling for Many-Core HPC Architectures JAIMIE KELLEY, CHRISTOPHER STEWART THE

Adaptive Power Profiling for Many-Core HPC ArchitecturesJ A I M I E K E L L E Y, C H R I S TO P H E R S T E WA R TT H E O H I O S TAT E U N I V E R S I T Y

D E V E S H T I WA R I , S A U R A B H G U P TAO A K R I D G E N AT I O N A L L A B O R AT O RY

S U M M A R I Z E D B Y:D A R W I N M A C H

F O R :C S 7 8 8 – A U TO N O M I C C O M P U T I N GFA L L 20 17

G E O R G E M A S O N U N I V E R S I T Y

Overview

• Background & Problem Statement

• Experimental Design

• Observations

• Observations on Power Consumption

• Predicting Peak Power Using Reference Workloads

• Analyzing Power Consumption Profile of Scientific Applications

• Adaptive Power Profiling

2

Background

• Amount of cores available for HPC (high performance computing) continues to increase

• Aggregation of computing power, “supercomputers” such as Jaguar, Titan, etc

• Many workloads don’t use every core effectively

• PARSEC benchmark gets 90% of its speed from just 35 out of 442 cores

• NAS workload on Intel Phi gets 85% of its speed from 32 of the 61 cores

• Core scaling: restrict workload to subset of cores

• Side effect: increases peak power use for workload cores

• Increasing demand to set power caps

3

Problem

• Current HPC schedulers use static workload profiles to allocate resources & adjust provisioning while workload runs

• But resource contention is dynamic & lots of things can affect it

• Actual power usage is complex and varies

• Static models only determine what’s possible, not what actually happens

• How can we accurately predict peak power dynamically with as little time as possible?

• Peak power because that’s going to determine minimum power cap

4

Experimental Design

• Power Measurement

• Architectures

• Platforms

• Workloads

5

Experimental Design: Power Measurement

• Use Intel Running Average Power Limit (RAPL)

• Stores measurements per CPU socket in Machine State Registers (MSRs)

• Measure energy and convert to power by associating with timestamp

• Power = Work / time

• Measure every 100 ms

• For Xeon Phi, use micsmc

• Because it’s a coprocessor on a PCIe card (at least the one they used is)

6

Experimental Design• Architecture

• I7-2600K (I7)• On-Demand CPU Governor

• Xeon E5-2670 (Sandy Bridge, SB)• P-states (per core)

• Xeon Phi 5110P (Phi)• Many Integrated Core (MIC)• P-states and C3 (whole CPU)

• All 3 CPUs are P-state and C3 capable. Not sure why they didn’t keep this constant. Or did they?

• Also, the I7 and Xeon E5 are both based on SB architecture. Better aliases would have been I7 and E5

7

Intel Xeon Phi

Source: https://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html

8

https://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html

Intel Xeon Phi (Internal)

Source: https://software.intel.com/en-us/blogs/2014/01/31/intel-xeon-phi-coprocessor-power-management-configuration-using-the-micsmc-command

9

https://software.intel.com/en-us/blogs/2014/01/31/intel-xeon-phi-coprocessor-power-management-configuration-using-the-micsmc-command

Observations

• 3 Sections:

• Observations on Power Consumption

• Predicting Peak Power Using Reference Workloads

• Analyzing Power Consumption Profile of Scientific Applications

• An attempt to characterize the peak power of workloads and corresponding HPC components (L1, L2, L3 cache, memory, etc)

10

Observation 1• Different architectures have different

increases in power

• Floor

• Benchmarks that target only the CPU registers (no caches or memory)

• Ceiling

• Benchmarks that target everything

• I7 and SB have more dramatic increases because other resources (cache, interconnects, etc) scale up with increasing cores

• Phi has a ring bus that is fully powered if even 1 core is in use

11

Observation 2

• Figure 2A

• Relative peak power increases are different between architectures

• They are also differentdepending on the chosen workload

• More cores, more variation

12

Observation 3

• Using 1 reference workload to predict peak power of all others isn’t accurate at all

13

Observation 3 (continued)

• (A) Pairs of workloads that are similar on one architecture can be different on another

• (B) Different parallelizationplatforms (MPI vs OMP, same workload) can be similar on one architecture and different on another

14

Observation 4

• Different workloads reach their peak power usage at different times

• That same workload may reachpeak power at a different time on a different architecture

• Same workloads have different power profiles from architecture to architecture (see next slide)

15


16

Observation 5

• Power profiles are similar with different number of active cores on the same architecture

• Regardless of parallelization platform (MPI vs OMP)

17


• Seeing peak power spikes early in execution

• After 40% execution ofworkload, predicted peak power error below 5%

18

Adaptive Power Profiling (APP)

• k% Sampling

• Authors’ Approach

• Evaluation

• Corner Cases

19

APP: k% sampling• Widely used approach (k% sampling)

1. Choose % of workload to run (k%)

2. Run the workload for k% time

3. Collect power usage

• For multiple cores

• (# cores) * k%

• 5 cores 5k% of workload needs to be run (e.g. 1, 2, 3, 4, 5)

• Rationale: consistent with observation #5 (figure 9)

• After a certain % of workload is run, error is minimum

• More cores need more of the workload to run to be accurate

20

APP: Authors’ Approach

1. Profile k% using maximum core count

2. Construct estimation error curve

3. Find the normalized run time where error is below user specified maximum error (k%)

4. Profile remaining core scalingsettings using the new k%

21

APP: Author’s Approach (continued)

• Run a k% for max cores and collect power trace

• PP(i) = Power at time i

• Peak power so far = PPmax(i)

• Calculate expected error curve, PPEC(i)

• For user specified accuracy (a%), find i where PPEC(i) < a

22

APP: Author’s Approach (continued)

23

Evaluation: Time Used vs Requested

• Shows a reduction actual time used vs k% method, which simply uses requested time• Not sure of the criteria used to determine workload percentiles

24

Evaluation: APP vs k%

Shows a difference in prediction from the authors’ method as vs k% method used as a baseline

25

Evaluation: Relaxing Accuracy

• Relaxing accuracy requirement (a%) doesn’t necessarily mean it will be inaccurate by that same amount

• Not sure of the criteria used to determine workload percentiles

26

Evaluation: Accuracy vs Profiling Time

• Small changes to relax accuracy greatly reduces time to profile (up to 5%)• Characterizes accuracy vs profiling time tradeoff• Not sure of the criteria used to determine workload percentiles

27

Corner Cases• Comparison of finding k% for APP

with min, median, and max core counts

• Starting with max core count to find k% (like previously described) is optimal

• Would be helpful to state how many cores for each architecture was used

• Likely Min = 2 for all of them

• Median for SB = 4• What about I7? Phi?• Doesn’t fit 2n constraint for benchmarks

they imposed in experimental design

28

Download - Adaptive Power Profiling for Many-Core HPC …menasce/cs788/slides/cs788_Adaptive...Adaptive Power Profiling for Many-Core HPC Architectures JAIMIE KELLEY, CHRISTOPHER STEWART THE

Top Related