z z z o o v p sd r Ç z } p v ] Ç r u Ì } v...
TRANSCRIPT
IC2E 2017 – Wes J. Lloyd 4/6/2017
Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 1
Wes Lloyd, Shrideep Pallickara, Olaf David, Mazdak Arabi, Ken Rojas
April 6, 2017
Institute of Technology, University of Washington, Tacoma, Washington USA
IC2E 2017: IEEE International Conference on Cloud Engineering April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 2
Outline
Background Research QuestionsExperimental WorkloadsExperiments/EvaluationConclusions
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 3
Outline
BackgroundResearch QuestionsExperimental WorkloadsExperiments/EvaluationConclusions
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 4
Rosetta Protein Folding
Computational methods for accurate design of new hyperstable constrained peptides
In 53 hours, using 5,904 EC2 compute cores: Generated 5.2 million peptide structures $3,400 spot instances Upfront cost of physical cluster to
achieve same result in ~53 hours:$857,752
Cloud enables adhoc large-scale experimentation
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 5
Research ChallengesHow can we improve performance and costs
for hosting scientific application workloads on the cloud? Resource heterogeneity Resource contention
Relative to: HPC Compute clusters
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 6
VM-type heterogeneity- Amazon EC2From: Is The Same Instance Type Created Equal
2013 IEEE Transactions on Cloud Computing
IC2E 2017 – Wes J. Lloyd 4/6/2017
Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 2
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 7
Trial-and-better Resource Provisioning Z. Ou et al., 2013 IEEE Trans. on Cloud Computing Using Amazon EC21. Provision instances2. Perform trial(s) - - VM testing3. Keep desired instances4. Replace undesirable instances
Test: Underlying CPU TypeApril 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 8
VM-Scaler
future
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 9
VM-Scaler
future
• Web services application• Rest-based/JSON• Harnesses EC2 API• Manages virtual cloud infrastructure• Supports scientific modeling-as-a-service• Supports Amazon, Eucalyptus clouds
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 10
Trial and Better – VM-Scaler Harness this approach for VM-Pools Ensure every VM has same backing CPU Provide more consistent test results
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 11
Resource Utilization Data Collection
Profile resource utilization forscientific workloads runningacross many VMs
Sensor on every VM Transmits data to VM-Scaler
CPU- CPU time- cpu usr: CPU time in user mode- cpu krn:CPU time in kernel mode - cpu_idle: CPU idle time - contextsw: # of context switches - cpu_io_wait: CPU time waiting for I/O- cpu_sint_time: CPU time serving soft interrupts- loadavg: (# proc / 60 secs)- cpuSteal: VM CPU ready, physical CPU unavailable
Disk- dsr: disk sector reads - dsreads: disk sector reads completed - drm: merged adjacent disk reads - readtime: time spent reading from disk - dsw: disk sector writes - dswrites: disk sector writes completed- dwm: merged adjacent disk writes - writetime: time spent writing to disk
Network- nbr: network bytes sent - nbs: network bytes received
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 12
CpuSteal
CpuSteal: VM’s CPU core is ready to execute but the physical CPU core is busy
Symptom of over provisioning physical servers
Factors which cause CpuSteal:1. Processors shared by too many busy VMs2. Hypervisor kernel (Xen dom0) is occupying the CPU3. VM’s CPU time share <100% for 1 or more cores,
and 100% is needed for a CPU intensive workload.
IC2E 2017 – Wes J. Lloyd 4/6/2017
Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 3
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 13
Outline
BackgroundResearch QuestionsExperimental WorkloadsExperiments/EvaluationConclusions
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 14
Research Questions
How common is public cloud VM-typeimplementation heterogeneity?
What performance implications result from VM-type heterogeneity for hosting scientific application workloads?
RQ1:
RQ2:
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 15
Research Questions - 2How effective is cpuSteal at identifying VMs with high resource contention due to multi-tenancy (e.g. noisy neighbor VMs) in a public cloud?
What are the performance implications of hostingscientific modeling workloads on worker VMs withconsistently high cpuSteal measurements in a public cloud?Is there a pattern to cpuSteal behavior across worker VMs over time?
RQ3:
RQ4:
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 16
Outline
BackgroundResearch QuestionsExperimental WorkloadsExperiments/EvaluationConclusions
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 17
CSIP Model Services
Cloud Services Innovation Platform Java-based framework to support development
of scientific model services (modeling-as-a-service)
Increase availability and throughput of models Harness scalable cloud infrastructure Cloud virtualization supports variety of legacy
software required for scientific applications (e.g. FORTRAN, Visual C++ 6.0, etc.)
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 18
Scientific Application Workloads
Rusle2 Soil erosion from water Median runtime ~1.89s
WEPS Soil erosion from wind Median runtime ~55s Years weather data * Years of crop rotation
IC2E 2017 – Wes J. Lloyd 4/6/2017
Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 4
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 19
Scientific Modeling Workloads - 2
WEPS / RUSLE CPU utilization:
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 20
Outline
BackgroundResearch QuestionsExperimental WorkloadsExperiments/EvaluationConclusions
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 21
Testing for VM Type Heterogeneity Identified CPU by checking /proc/cpuinfo
Launched 50 VMs of a given type If there was heterogeneity, launched 50 more
Tested 12 VM types, across 3 generations 1st: m1.medium, m1.large, m1.xlarge, c1.medium, c1.xlarge 2nd: m2.xlarge, m2.2xlarge, and m2.4xlarge 3rd: c3.large, c3.xlarge c3.2xlarge, m3.large
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 22
Amazon EC2VM Type Heterogeneity
VM type Region Backing CPU Backing CPU
m1.medium us-east-1c Intel E5-2650 v0
8c,95w,96% Intel Xeon E5645
6c,80w,4%
m2.xlarge us-east-1c Intel Xeon X5550
4c, 95w, 48% Intel Xeon E5-2665 v0
8c, 115w, 42%
m1.large us-east-1d Intel Xeon E5-2650 v0
8c,95w,74% Intel Xeon E5-2651 v2
12c,105w,19%
m1.large us-east-1d Intel Xeon E5645
6c,80w,7% --
m2.xlarge us-east-1d Intel Xeon E5-2665 v0
8c, 115w,78% Intel Xeon X5550
4c, 95w, 22%
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 23
VM Type Heterogeneity Performance Implications Tested small 5 VM pools Compared the two most abundant hardware
implementations m1.large - Intel Xeon
E5-2650 v0, 8cores, 95 w vs. E5-2651 v2, 12 cores, 105 w m2.xlarge - Intel Xeon
E5-2665 v0, 8 cores, 115 w vs. X5550, 4 cores, 95 w
Workloads WEPS: 10 x 100 runs RUSLE2: 10 x 660 runs
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 24
VM Type Heterogeneity Performance Variation
IC2E 2017 – Wes J. Lloyd 4/6/2017
Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 5
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 25
VM Type Heterogeneity Performance Variation
Performance Variance - m1.large
RUSLE2 8%, WEPS 9%
Performance Variance – m2.xlarge
RUSLE2 14%, WEPS 4%
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 26
Noisy Neighbor (NN-Detect)Detection MethodologyNoisy neighbors cause resource contention and
degrade performance of worker VMs Identify noisy neighbors by analyzing cpuSteal
Detection method:Step 1: Execute processor intensive workload across
pool of VMs. Step 2: Capture total cpuSteal for each VM for the
workload.Step 3: Calculate average cpuSteal for the workload
(cpuStealavg).
Identify NNs using statistical outliers, and assigning application specific thresholds through observation…
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 27
VM Type Host CPU Intel Xeon
Average R2 linear reg.
Average cpuSteal per core
% with Noisy
Neighbors us-east-1c
c3.large-2c E5-2680v2/10c .1753 2.35 0% m3.large-2c E5-2670v2/10c - 1.58 0% m1.large-2c E5-2650v0/8c .5568 7.62 12% m2.xlarge-2c X5550/4c .4490 310.25 18% m1.xlarge-4c E5-2651v2/12c .9431 7.25 4%
m3.medium-1c E5-2670v2/10c .0646 17683.21 n/a c1.xlarge-8c E5-2651v2/12c .3658 1.86 0%
us-east-1d m1.medium-1c E5-2650v0/8c .4545 6.2 10%
m2.xlarge-2c E5-2665v0/8c .0911 3.14 0%
Amazon EC2 CpuSteal Analysis
Test Configuration:• Completed 4 x 1000 WEPS runs over ~5 hours• ~50 VM pools (c1.xlarge 25, m3/m1.medium 60)• Round robin load balancing of runs across pools
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 28
VM Type Host CPU Intel Xeon
Average R2 linear reg.
Average cpuSteal per core
% with Noisy
Neighbors us-east-1c
c3.large-2c E5-2680v2/10c .1753 2.35 0% m3.large-2c E5-2670v2/10c - 1.58 0% m1.large-2c E5-2650v0/8c .5568 7.62 12% m2.xlarge-2c X5550/4c .4490 310.25 18% m1.xlarge-4c E5-2651v2/12c .9431 7.25 4%
m3.medium-1c E5-2670v2/10c .0646 17683.21 n/a c1.xlarge-8c E5-2651v2/12c .3658 1.86 0%
us-east-1d m1.medium-1c E5-2650v0/8c .4545 6.2 10%
m2.xlarge-2c E5-2665v0/8c .0911 3.14 0%
Amazon EC2 CpuSteal Analysis
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 29
VM Type Host CPU Intel Xeon
Average R2 linear reg.
Average cpuSteal per core
% with Noisy
Neighbors us-east-1c
c3.large-2c E5-2680v2/10c .1753 2.35 0% m3.large-2c E5-2670v2/10c - 1.58 0% m1.large-2c E5-2650v0/8c .5568 7.62 12% m2.xlarge-2c X5550/4c .4490 310.25 18% m1.xlarge-4c E5-2651v2/12c .9431 7.25 4%
m3.medium-1c E5-2670v2/10c .0646 17683.21 n/a c1.xlarge-8c E5-2651v2/12c .3658 1.86 0%
us-east-1d m1.medium-1c E5-2650v0/8c .4545 6.2 10%
m2.xlarge-2c E5-2665v0/8c .0911 3.14 0%
Amazon EC2 CpuSteal Analysis
Key Result #14 of 9 VM types had R2 > 0.44
m1.large, m2.xlarge, m1.xlarge, m1.medium
Key Result #2Where cpuSteal could not be predicted
it did not exist. This hardware tended to be CPU core dense. (e.g. 8, 10, or 12)
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 30
Compared performance of small 5 VM pools 5 Noisy-Neighbor VMs 5 regular VMs
WEPS: 10 x 100 runsRUSLE2: 10 x 660 runs
Normalized results to VM pools w/o NN’s
Noisy NeighborPerformance Degradation
IC2E 2017 – Wes J. Lloyd 4/6/2017
Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 6
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 31
VM type Region WEPS RUSLE2
m1.large E5-2650v0/8c
us-east-1c
117.68% df=9.866
p=6.847·10-8
125.42% df=9.003 p=.016
m2.xlarge X5550/4c us-east-1c
107.3% df=19.159 p=.05232
102.76% df=25.34
p=1.73·10-11
c1.xlarge E5-2651v2/12c us-east-1c
100.73% df=9.54 p=.1456
102.91% n.s.
m1.medium E5-2650v0/8c us-east-1d
111.6% df=13.459
p=6.25·10-8
104.32% df=9.196
p=1.173·10-5
EC2 Noisy Neighbor Performance Degradation
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 32
VM type Region WEPS RUSLE2
m1.large E5-2650v0/8c
us-east-1c
117.68% df=9.866
p=6.847·10-8
125.42% df=9.003 p=.016
m2.xlarge X5550/4c us-east-1c
107.3% df=19.159 p=.05232
102.76% df=25.34
p=1.73·10-11
c1.xlarge E5-2651v2/12c us-east-1c
100.73% df=9.54 p=.1456
102.91% n.s.
m1.medium E5-2650v0/8c us-east-1d
111.6% df=13.459
p=6.25·10-8
104.32% df=9.196
p=1.173·10-5
EC2 Noisy Neighbor Performance Degradation
Key Result #1Maximum performance loss:
WEPS 18%, RUSLE2 25%
Key Result #23 VM types with significant performance loss (p <.05)
Average performance loss: WEPS/RUSLE2 ~ 9%
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 33
Outline
BackgroundResearch QuestionsExperimental WorkloadsExperiments/EvaluationConclusions
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 34
Conclusions
Hardware (CPU-type) heterogeneity is more prolific for legacy instance implementations
Performance variance up to ~14%
For large VM pools, 4 of 9 instance types showed had noisy neighbors which produced statistically significant performance variance
Performance variance up to ~25%
April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 35
Future Work
VM-Scaler: Trail-and-better VM pool creation Provide cpuSteal Noisy Neighbor detection Consider are we detecting ourselves? other users?
Extend noisy neighbor detection techniques Memory, disk, network resource contention
Evaluate new instance types, public clouds, and science applications