z z z o o v p sd r Ç z } p v ] Ç r u Ì } v...

6
IC2E 2017 – Wes J. Lloyd 4/6/2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 1 Wes Lloyd, Shrideep Pallickara, Olaf David, Mazdak Arabi, Ken Rojas April 6, 2017 Institute of Technology, University of Washington, Tacoma, Washington USA IC2E 2017: IEEE International Conference on Cloud Engineering April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 2 Outline Background Research Questions Experimental Workloads Experiments/Evaluation Conclusions April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 3 Outline Background Research Questions Experimental Workloads Experiments/Evaluation Conclusions April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 4 Rosetta Protein Folding Computational methods for accurate design of new hyperstable constrained peptides In 53 hours, using 5,904 EC2 compute cores: Generated 5.2 million peptide structures $3,400 spot instances Upfront cost of physical cluster to achieve same result in ~53 hours: $857,752 Cloud enables adhoc large-scale experimentation April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 5 Research Challenges How can we improve performance and costs for hosting scientific application workloads on the cloud? Resource heterogeneity Resource contention Relative to: HPC Compute clusters April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 6 VM-type heterogeneity- Amazon EC2 From: Is The Same Instance Type Created Equal 2013 IEEE Transactions on Cloud Computing

Upload: others

Post on 16-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Z Z Z o o v P sD r Ç Z } P v ] Ç r u Ì } v îfaculty.washington.edu/wlloyd/slides/ic2e2017_wlloyd.pdf · ,& ( ±:hv - /or\g 0lwljdwlqj 5hvrxufh &rqwhqwlrq dqg +hwhurjhqhlw\ lq

IC2E 2017 – Wes J. Lloyd 4/6/2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 1

Wes Lloyd, Shrideep Pallickara, Olaf David, Mazdak Arabi, Ken Rojas

April 6, 2017

Institute of Technology, University of Washington, Tacoma, Washington USA

IC2E 2017: IEEE International Conference on Cloud Engineering April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 2

Outline

Background Research QuestionsExperimental WorkloadsExperiments/EvaluationConclusions

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 3

Outline

BackgroundResearch QuestionsExperimental WorkloadsExperiments/EvaluationConclusions

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 4

Rosetta Protein Folding

Computational methods for accurate design of new hyperstable constrained peptides

In 53 hours, using 5,904 EC2 compute cores: Generated 5.2 million peptide structures $3,400 spot instances Upfront cost of physical cluster to

achieve same result in ~53 hours:$857,752

Cloud enables adhoc large-scale experimentation

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 5

Research ChallengesHow can we improve performance and costs

for hosting scientific application workloads on the cloud? Resource heterogeneity Resource contention

Relative to: HPC Compute clusters

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 6

VM-type heterogeneity- Amazon EC2From: Is The Same Instance Type Created Equal

2013 IEEE Transactions on Cloud Computing

Page 2: Z Z Z o o v P sD r Ç Z } P v ] Ç r u Ì } v îfaculty.washington.edu/wlloyd/slides/ic2e2017_wlloyd.pdf · ,& ( ±:hv - /or\g 0lwljdwlqj 5hvrxufh &rqwhqwlrq dqg +hwhurjhqhlw\ lq

IC2E 2017 – Wes J. Lloyd 4/6/2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 2

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 7

Trial-and-better Resource Provisioning Z. Ou et al., 2013 IEEE Trans. on Cloud Computing Using Amazon EC21. Provision instances2. Perform trial(s) - - VM testing3. Keep desired instances4. Replace undesirable instances

Test: Underlying CPU TypeApril 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 8

VM-Scaler

future

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 9

VM-Scaler

future

• Web services application• Rest-based/JSON• Harnesses EC2 API• Manages virtual cloud infrastructure• Supports scientific modeling-as-a-service• Supports Amazon, Eucalyptus clouds

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 10

Trial and Better – VM-Scaler Harness this approach for VM-Pools Ensure every VM has same backing CPU Provide more consistent test results

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 11

Resource Utilization Data Collection

Profile resource utilization forscientific workloads runningacross many VMs

Sensor on every VM Transmits data to VM-Scaler

CPU- CPU time- cpu usr: CPU time in user mode- cpu krn:CPU time in kernel mode - cpu_idle: CPU idle time - contextsw: # of context switches - cpu_io_wait: CPU time waiting for I/O- cpu_sint_time: CPU time serving soft interrupts- loadavg: (# proc / 60 secs)- cpuSteal: VM CPU ready, physical CPU unavailable

Disk- dsr: disk sector reads - dsreads: disk sector reads completed - drm: merged adjacent disk reads - readtime: time spent reading from disk - dsw: disk sector writes - dswrites: disk sector writes completed- dwm: merged adjacent disk writes - writetime: time spent writing to disk

Network- nbr: network bytes sent - nbs: network bytes received

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 12

CpuSteal

CpuSteal: VM’s CPU core is ready to execute but the physical CPU core is busy

Symptom of over provisioning physical servers

Factors which cause CpuSteal:1. Processors shared by too many busy VMs2. Hypervisor kernel (Xen dom0) is occupying the CPU3. VM’s CPU time share <100% for 1 or more cores,

and 100% is needed for a CPU intensive workload.

Page 3: Z Z Z o o v P sD r Ç Z } P v ] Ç r u Ì } v îfaculty.washington.edu/wlloyd/slides/ic2e2017_wlloyd.pdf · ,& ( ±:hv - /or\g 0lwljdwlqj 5hvrxufh &rqwhqwlrq dqg +hwhurjhqhlw\ lq

IC2E 2017 – Wes J. Lloyd 4/6/2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 3

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 13

Outline

BackgroundResearch QuestionsExperimental WorkloadsExperiments/EvaluationConclusions

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 14

Research Questions

How common is public cloud VM-typeimplementation heterogeneity?

What performance implications result from VM-type heterogeneity for hosting scientific application workloads?

RQ1:

RQ2:

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 15

Research Questions - 2How effective is cpuSteal at identifying VMs with high resource contention due to multi-tenancy (e.g. noisy neighbor VMs) in a public cloud?

What are the performance implications of hostingscientific modeling workloads on worker VMs withconsistently high cpuSteal measurements in a public cloud?Is there a pattern to cpuSteal behavior across worker VMs over time?

RQ3:

RQ4:

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 16

Outline

BackgroundResearch QuestionsExperimental WorkloadsExperiments/EvaluationConclusions

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 17

CSIP Model Services

Cloud Services Innovation Platform Java-based framework to support development

of scientific model services (modeling-as-a-service)

Increase availability and throughput of models Harness scalable cloud infrastructure Cloud virtualization supports variety of legacy

software required for scientific applications (e.g. FORTRAN, Visual C++ 6.0, etc.)

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 18

Scientific Application Workloads

Rusle2 Soil erosion from water Median runtime ~1.89s

WEPS Soil erosion from wind Median runtime ~55s Years weather data * Years of crop rotation

Page 4: Z Z Z o o v P sD r Ç Z } P v ] Ç r u Ì } v îfaculty.washington.edu/wlloyd/slides/ic2e2017_wlloyd.pdf · ,& ( ±:hv - /or\g 0lwljdwlqj 5hvrxufh &rqwhqwlrq dqg +hwhurjhqhlw\ lq

IC2E 2017 – Wes J. Lloyd 4/6/2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 4

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 19

Scientific Modeling Workloads - 2

WEPS / RUSLE CPU utilization:

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 20

Outline

BackgroundResearch QuestionsExperimental WorkloadsExperiments/EvaluationConclusions

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 21

Testing for VM Type Heterogeneity Identified CPU by checking /proc/cpuinfo

Launched 50 VMs of a given type If there was heterogeneity, launched 50 more

Tested 12 VM types, across 3 generations 1st: m1.medium, m1.large, m1.xlarge, c1.medium, c1.xlarge 2nd: m2.xlarge, m2.2xlarge, and m2.4xlarge 3rd: c3.large, c3.xlarge c3.2xlarge, m3.large

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 22

Amazon EC2VM Type Heterogeneity

VM type Region Backing CPU Backing CPU

m1.medium us-east-1c Intel E5-2650 v0

8c,95w,96% Intel Xeon E5645

6c,80w,4%

m2.xlarge us-east-1c Intel Xeon X5550

4c, 95w, 48% Intel Xeon E5-2665 v0

8c, 115w, 42%

m1.large us-east-1d Intel Xeon E5-2650 v0

8c,95w,74% Intel Xeon E5-2651 v2

12c,105w,19%

m1.large us-east-1d Intel Xeon E5645

6c,80w,7% --

m2.xlarge us-east-1d Intel Xeon E5-2665 v0

8c, 115w,78% Intel Xeon X5550

4c, 95w, 22%

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 23

VM Type Heterogeneity Performance Implications Tested small 5 VM pools Compared the two most abundant hardware

implementations m1.large - Intel Xeon

E5-2650 v0, 8cores, 95 w vs. E5-2651 v2, 12 cores, 105 w m2.xlarge - Intel Xeon

E5-2665 v0, 8 cores, 115 w vs. X5550, 4 cores, 95 w

Workloads WEPS: 10 x 100 runs RUSLE2: 10 x 660 runs

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 24

VM Type Heterogeneity Performance Variation

Page 5: Z Z Z o o v P sD r Ç Z } P v ] Ç r u Ì } v îfaculty.washington.edu/wlloyd/slides/ic2e2017_wlloyd.pdf · ,& ( ±:hv - /or\g 0lwljdwlqj 5hvrxufh &rqwhqwlrq dqg +hwhurjhqhlw\ lq

IC2E 2017 – Wes J. Lloyd 4/6/2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 5

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 25

VM Type Heterogeneity Performance Variation

Performance Variance - m1.large

RUSLE2 8%, WEPS 9%

Performance Variance – m2.xlarge

RUSLE2 14%, WEPS 4%

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 26

Noisy Neighbor (NN-Detect)Detection MethodologyNoisy neighbors cause resource contention and

degrade performance of worker VMs Identify noisy neighbors by analyzing cpuSteal

Detection method:Step 1: Execute processor intensive workload across

pool of VMs. Step 2: Capture total cpuSteal for each VM for the

workload.Step 3: Calculate average cpuSteal for the workload

(cpuStealavg).

Identify NNs using statistical outliers, and assigning application specific thresholds through observation…

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 27

VM Type Host CPU Intel Xeon

Average R2 linear reg.

Average cpuSteal per core

% with Noisy

Neighbors us-east-1c

c3.large-2c E5-2680v2/10c .1753 2.35 0% m3.large-2c E5-2670v2/10c - 1.58 0% m1.large-2c E5-2650v0/8c .5568 7.62 12% m2.xlarge-2c X5550/4c .4490 310.25 18% m1.xlarge-4c E5-2651v2/12c .9431 7.25 4%

m3.medium-1c E5-2670v2/10c .0646 17683.21 n/a c1.xlarge-8c E5-2651v2/12c .3658 1.86 0%

us-east-1d m1.medium-1c E5-2650v0/8c .4545 6.2 10%

m2.xlarge-2c E5-2665v0/8c .0911 3.14 0%

Amazon EC2 CpuSteal Analysis

Test Configuration:• Completed 4 x 1000 WEPS runs over ~5 hours• ~50 VM pools (c1.xlarge 25, m3/m1.medium 60)• Round robin load balancing of runs across pools

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 28

VM Type Host CPU Intel Xeon

Average R2 linear reg.

Average cpuSteal per core

% with Noisy

Neighbors us-east-1c

c3.large-2c E5-2680v2/10c .1753 2.35 0% m3.large-2c E5-2670v2/10c - 1.58 0% m1.large-2c E5-2650v0/8c .5568 7.62 12% m2.xlarge-2c X5550/4c .4490 310.25 18% m1.xlarge-4c E5-2651v2/12c .9431 7.25 4%

m3.medium-1c E5-2670v2/10c .0646 17683.21 n/a c1.xlarge-8c E5-2651v2/12c .3658 1.86 0%

us-east-1d m1.medium-1c E5-2650v0/8c .4545 6.2 10%

m2.xlarge-2c E5-2665v0/8c .0911 3.14 0%

Amazon EC2 CpuSteal Analysis

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 29

VM Type Host CPU Intel Xeon

Average R2 linear reg.

Average cpuSteal per core

% with Noisy

Neighbors us-east-1c

c3.large-2c E5-2680v2/10c .1753 2.35 0% m3.large-2c E5-2670v2/10c - 1.58 0% m1.large-2c E5-2650v0/8c .5568 7.62 12% m2.xlarge-2c X5550/4c .4490 310.25 18% m1.xlarge-4c E5-2651v2/12c .9431 7.25 4%

m3.medium-1c E5-2670v2/10c .0646 17683.21 n/a c1.xlarge-8c E5-2651v2/12c .3658 1.86 0%

us-east-1d m1.medium-1c E5-2650v0/8c .4545 6.2 10%

m2.xlarge-2c E5-2665v0/8c .0911 3.14 0%

Amazon EC2 CpuSteal Analysis

Key Result #14 of 9 VM types had R2 > 0.44

m1.large, m2.xlarge, m1.xlarge, m1.medium

Key Result #2Where cpuSteal could not be predicted

it did not exist. This hardware tended to be CPU core dense. (e.g. 8, 10, or 12)

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 30

Compared performance of small 5 VM pools 5 Noisy-Neighbor VMs 5 regular VMs

WEPS: 10 x 100 runsRUSLE2: 10 x 660 runs

Normalized results to VM pools w/o NN’s

Noisy NeighborPerformance Degradation

Page 6: Z Z Z o o v P sD r Ç Z } P v ] Ç r u Ì } v îfaculty.washington.edu/wlloyd/slides/ic2e2017_wlloyd.pdf · ,& ( ±:hv - /or\g 0lwljdwlqj 5hvrxufh &rqwhqwlrq dqg +hwhurjhqhlw\ lq

IC2E 2017 – Wes J. Lloyd 4/6/2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 6

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 31

VM type Region WEPS RUSLE2

m1.large E5-2650v0/8c

us-east-1c

117.68% df=9.866

p=6.847·10-8

125.42% df=9.003 p=.016

m2.xlarge X5550/4c us-east-1c

107.3% df=19.159 p=.05232

102.76% df=25.34

p=1.73·10-11

c1.xlarge E5-2651v2/12c us-east-1c

100.73% df=9.54 p=.1456

102.91% n.s.

m1.medium E5-2650v0/8c us-east-1d

111.6% df=13.459

p=6.25·10-8

104.32% df=9.196

p=1.173·10-5

EC2 Noisy Neighbor Performance Degradation

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 32

VM type Region WEPS RUSLE2

m1.large E5-2650v0/8c

us-east-1c

117.68% df=9.866

p=6.847·10-8

125.42% df=9.003 p=.016

m2.xlarge X5550/4c us-east-1c

107.3% df=19.159 p=.05232

102.76% df=25.34

p=1.73·10-11

c1.xlarge E5-2651v2/12c us-east-1c

100.73% df=9.54 p=.1456

102.91% n.s.

m1.medium E5-2650v0/8c us-east-1d

111.6% df=13.459

p=6.25·10-8

104.32% df=9.196

p=1.173·10-5

EC2 Noisy Neighbor Performance Degradation

Key Result #1Maximum performance loss:

WEPS 18%, RUSLE2 25%

Key Result #23 VM types with significant performance loss (p <.05)

Average performance loss: WEPS/RUSLE2 ~ 9%

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 33

Outline

BackgroundResearch QuestionsExperimental WorkloadsExperiments/EvaluationConclusions

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 34

Conclusions

Hardware (CPU-type) heterogeneity is more prolific for legacy instance implementations

Performance variance up to ~14%

For large VM pools, 4 of 9 instance types showed had noisy neighbors which produced statistically significant performance variance

Performance variance up to ~25%

April 6, 2017 Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services 35

Future Work

VM-Scaler: Trail-and-better VM pool creation Provide cpuSteal Noisy Neighbor detection Consider are we detecting ourselves? other users?

Extend noisy neighbor detection techniques Memory, disk, network resource contention

Evaluate new instance types, public clouds, and science applications