spoton: a batch computing service for the spot …spoton: a batch computing service for the spot...

49
SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy University of Massachusetts Amherst

Upload: others

Post on 10-Jul-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

SpotOn: A Batch Computing Service

for the Spot Market

Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

University of Massachusetts Amherst

Page 2: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

infrastructure cloud

On-demand

Page 3: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

infrastructure cloud

On-demand

Cost vs. Availability Tradeoff in IaaS CloudsC

heap

Expe

nsiv

e

Guaranteed, Non-revocable

Not guaranteed, Non-revocable

Not guaranteed, Revocable

Cos

t (pe

r hou

r)

Availability

Reserved

Spot

Page 4: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

infrastructure cloud

Spot Instances

Preemptible VM

On-demand

Cost vs. Availability Tradeoff in IaaS CloudsC

heap

Expe

nsiv

e

Guaranteed, Non-revocable

Not guaranteed, Non-revocable

Not guaranteed, Revocable

Cos

t (pe

r hou

r)

Availability

Reserved

Spot

Page 5: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

spot markets

Amazon EC2

Bid in a 2nd price auction Acquire when bid > spot price Terminate when spot price > bid

Time of the day

Pric

e

Spot User bid

Page 6: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

spot markets

Amazon EC2

Bid in a 2nd price auction Acquire when bid > spot price Terminate when spot price > bid

How do we mitigate the impact of revocation?

1. Raise the bid 2. Employ fault-tolerance mechanisms

Time of the day

Pric

e

Spot User bid

Page 7: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

spot markets

Amazon EC2

Bid in a 2nd price auction Acquire when bid > spot price Terminate when spot price > bid

How do we mitigate the impact of revocation?

1. Raise the bid 2. Employ fault-tolerance mechanisms

Time of the day

Pric

e

Spot User bid

2. Employ fault-tolerance mechanisms

Page 8: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

challenges — spot market complexity

Amazon EC2 operates ~4000 spot markets

Page 9: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

challenges — spot market complexity

Scatterplot of ranks for EC2 spot markets

Amazon EC2 operates ~4000 spot markets

Page 10: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

challenges — spot market complexity

Scatterplot of ranks for EC2 spot markets

Amazon EC2 operates ~4000 spot markets

Selecting an instance that yields

lowest cost per unit of computation

while also considering the

probability of revocation is complex

Page 11: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

challenges — application complexity

CPU : IO 1:1

Working Set 8GB

Running Time 1 hour

Resource Vector

Disk Type Remote

Page 12: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

challenges — application complexity

Cost 20% On-demand

Revocation Rate 2.4 per day

Spot VM

Fault-tolerance Mechanism Checkpoint (every 900s)

CPU : IO 1:1

Working Set 8GB

Running Time 1 hour

Resource Vector

Disk Type Remote

Page 13: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Com

plet

ion

time

(s)

0

2000

4000

6000

8000

Cost

(¢)

0

25

50

75

100

On-demand Checkpoint

challenges — application complexity

Cost 20% On-demand

Revocation Rate 2.4 per day

Spot VM

Fault-tolerance Mechanism Checkpoint (every 900s)

CPU : IO 1:1

Working Set 8GB

Running Time 1 hour

Resource Vector

Disk Type Remote

Page 14: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

CPU : IO 1:1

Working Set 8GB

Running Time 1 hour

Resource Vector

Cost 20% On-demand

Revocation Rate 2.4 per day

Spot VM

Fault-tolerance Mechanism Checkpoint (every 900s)

Disk Type Remote

Replicate (deg=2)

challenges — application complexity

Local

Page 15: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

CPU : IO 1:1

Working Set 8GB

Running Time 1 hour

Resource Vector

Cost 20% On-demand

Revocation Rate 2.4 per day

Spot VM

Fault-tolerance Mechanism Checkpoint (every 900s)

Disk Type Remote

Com

plet

ion

time

(s)

0

2000

4000

6000

8000

Cost

(¢)

0

25

50

75

100

On-demand Checkpoint Replicate

Replicate (deg=2)

challenges — application complexity

Local

Page 16: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

CPU : IO 1:1

Working Set 8GB

Running Time 1 hour

Resource Vector

Cost 20% On-demand

Revocation Rate 2.4 per day

Spot VM

Fault-tolerance Mechanism Checkpoint (every 900s)

Disk Type Local

challenges — application complexity

>24 per day

Page 17: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

CPU : IO 1:1

Working Set 8GB

Running Time 1 hour

Resource Vector

Cost 20% On-demand

Revocation Rate 2.4 per day

Spot VM

Fault-tolerance Mechanism Checkpoint (every 900s)

Disk Type Local

Com

plet

ion

time

(s)

0

2000

4000

6000

8000

Cost

(¢)

0

25

50

75

100

On-demand

Checkpoint

Replicate

Replicate (Revoked)

challenges — application complexity

>24 per day

Page 18: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Spoton: A batch computing service

SpotOn

Page 19: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Spoton: A batch computing service

SpotOn

Service that accepts batch jobs from users and runs them on spot instances

Manages application and spot market complexity transparently

Page 20: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Spoton: A batch computing service

SpotOn

Service that accepts batch jobs from users and runs them on spot instances

Manages application and spot market complexity transparently

Run batch jobs at on-demand performance but paying spot market price

Page 21: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

greedy selection algorithm

Selecting the best spot market and fault-tolerance mechanism

Jb

Batch Job

Page 22: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

greedy selection algorithm

Selecting the best spot market and fault-tolerance mechanism

Jb

Spot Markets

∀Si

Batch Job

Page 23: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

greedy selection algorithm

Selecting the best spot market and fault-tolerance mechanism

Jb

Spot Markets

∀Si

Batch Job

∀Ft

Fault Tolerance

Migrate Dup Chkp

Page 24: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

greedy selection algorithm

Selecting the best spot market and fault-tolerance mechanism

Jb $

Minimum cost of running Jb using Ft on Si

Spot Markets

∀Si

Batch Job

∀Ft

Fault Tolerance

Migrate Dup Chkp

Page 25: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

greedy selection algorithm

Selecting the best spot market and fault-tolerance mechanism

Jb $

Minimum cost of running Jb using Ft on Si

Acquire Si

Spot Markets

∀Si

Batch Job

∀Ft

Fault Tolerance

Migrate Dup Chkp

Page 26: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

greedy selection algorithm

Selecting the best spot market and fault-tolerance mechanism

Jb $

Minimum cost of running Jb using Ft on Si

Acquire Si

Repeat on spot revocation

(until Jb finishes)

Spot Markets

∀Si

Batch Job

∀Ft

Fault Tolerance

Migrate Dup Chkp

Page 27: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Reactive Migration

fault tolerance (1/3)

Migrated

Page 28: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Reactive Migration

fault tolerance (1/3)

Migrated

size of memory + local disk remote disk bandwidth

TM ∝

Page 29: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Reactive Migration

fault tolerance (1/3)

Migrated

Zk → Random variable measuring time to revocation

Pk → Probability job gets revoked before completion

size of memory + local disk remote disk bandwidth

TM ∝

Page 30: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Reactive Migration

fault tolerance (1/3)

Migrated

Zk → Random variable measuring time to revocation

Pk → Probability job gets revoked before completion

E[Pricek] [(1 — Pk) * T + Pk * (E(Zk) + TM)] * spot-price

E[Timek] (1 — Pk) * T + Pk * E(Zk)

=Cost =

size of memory + local disk remote disk bandwidth

TM ∝

Page 31: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Proactive Checkpoint

fault tolerance (2/3)

Restored

Page 32: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Proactive Checkpoint

fault tolerance (2/3)

Restored

𝜏: checkpoint frequency

size of memory + local disk remote disk bandwidth

TL ∝ (𝜏/2)

TC ∝

Page 33: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Proactive Checkpoint

fault tolerance (2/3)

Total Overhead = (T /𝜏) * Tc + TL

Restored

𝜏: checkpoint frequency

size of memory + local disk remote disk bandwidth

TL ∝ (𝜏/2)

TC ∝

Page 34: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Proactive Checkpoint

fault tolerance (2/3)

Total Overhead = (T /𝜏) * Tc + TL

Restored

𝜏: checkpoint frequency

size of memory + local disk remote disk bandwidth

TL ∝ (𝜏/2)

TC ∝

Cost overhead is primarily a function of job’s resource usage

Page 35: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Spot Replication

fault tolerance (3/3)

TL ∝ market volatility

Page 36: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Spot Replication

fault tolerance (3/3)

TL ∝ market volatility

Total Overhead ∝ (Replication factor, TL)

Cost overhead is primarily a function of market characteristics

Page 37: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

evaluation (1/3)

SpotOn Prototype

App Emulator to create synthetic jobs with varying resource usage

Built on Linux Containers for efficient checkpointing / migration

App emulator

SpotOn job manager

daemondaemon

Page 38: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Effect of Application and Spot Market Characteristics

evaluation (2/3)

Page 39: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Effect of Application and Spot Market Characteristics

evaluation (2/3)

Page 40: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Effect of Application and Spot Market Characteristics

evaluation (2/3)

Best choice of fault-tolerance mechanism is a function of the spot market and job characteristics

Page 41: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Effect of Cost-aware Selection

evaluation (3/3)

Page 42: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Effect of Cost-aware Selection

evaluation (3/3)

Page 43: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Effect of Cost-aware Selection

evaluation (3/3)

2x better than just checkpointing

Page 44: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

On Google cluster trace, Cost-aware selection achieved 91.9% savings with little impact on performance

Effect of Cost-aware Selection

evaluation (3/3)

2x better than just checkpointing

Page 45: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Spot markets offer arbitrage opportunities

SpotOn manages application and market complexities

We model fault tolerance and propose a selection algorithm

Prototype on Amazon EC2

Achieves ~90% cost savings

conclusion

Page 46: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

Spot markets offer arbitrage opportunities

SpotOn manages application and market complexities

We model fault tolerance and propose a selection algorithm

Prototype on Amazon EC2

Achieves ~90% cost savings

conclusion

Thank you!

Page 47: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

backup slides

Page 48: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

bidding

Spot price distribution has long tail Spot price changes are peaky

Page 49: SpotOn: A Batch Computing Service for the Spot …SpotOn: A Batch Computing Service for the Spot Market Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, Prashant Shenoy

bidding

Bidding does affect volatility but not drastically

SpotOn always bids at the on-demand price If spot price goes above on-demand, SpotOn would choose on-demand

Spot price distribution has long tail Spot price changes are peaky