aws re:invent 2014 talk: scheduling using apache mesos in the cloud

Post on 07-Jul-2015

2.137 Views

Category:

Software

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

How can you reliably schedule tasks in an unreliable, autoscaling Cloud environment? In this presentation, we'll talk about the design of our scheduler built on top of Apache Mesos that serves as the core of our stream-processing platform, Mantis, designed for real-time insights. We'll focus on the following aspects of the scheduler: - Coarse-grained vs. fine-grained resource scheduling - Fault tolerance via a combination of task reconciliation and life cycle event processing - Scheduling optimizations for bin packing, for stream locality to reduce network bandwidth usage, for task placement to achieve auto scaling of the cluster size, etc. This talk will also include detailed information about approaches to scheduling in a distributed, auto-scaling, environment.

TRANSCRIPT

Analytics

System health

Customer experienceInsights

Anomaly detection

User

Job 1

User

Job 2

User

Job 3

Dis

covery

MantisMantis

Apache MesosApache Mesos

Mantis

Apache Mesos

ASGASG

ASG

FenzoMesos

Framework

JobJob

Job

Mesos slave

FrmWrk2 executor

TaskTask

Mesos slave

FrmWrk2 executor

FrmWrk1 executor

TaskTask

Mesos master Standby master Standby master

Mesos slave

FrmWrk1 executor

TaskTask

FrmWrk1 FrmWrk2

Instance 1Instance 1

Task A

Instance 2

Task B

Instance 1

Task A

Instance 1

Task A

Task B

Task C Task D

Data

stream

Host A

Task1

Host B

Task2

Host C

Task3

Data

stream

Data

stream

Host X

Task1

Task2

Task3

Host A

Task1

Host B

Task2

Host C

Task3

MantisMantis

Mantis

FenzoMesos

Framework

Mesos slave

FrmWrk1 executorFrmWrk1 executor

Mesos slave

Framework executor

TaskTaskTask

Framework executor

Task

Apache MesosApache Mesos

Mesos Master

Apache Mesos

Framework

Persistence

Apache MesosApache Mesos

Mesos Master

Apache Mesos

Framework

Persistence

.setName(name)

.setFailoverTimeout(to)

.setId(id)

.setCheckpoint(true)

.build();

Heterogeneous

Autoscale

Visibility

Plugins for

Constraints, Fitness

High speed

Mesos master

Mesos framework

Tasks

requests

Available

resource

offers

Fenzo task

scheduler

Persistence

Fitness

Urg

ency

Speed Accuracy

First fit assignment Optimal assignment

Real world trade-offs~ O (1) ~ O (N * M)1

1 Assuming tasks are not reassigned

750

950

1150

1350

1550

1750

1950

2150

#H

osts

No bin packing used

#Full

#Partial

#Empty

750

950

1150

1350

1550

1750

1950

2150

#H

osts

With bin packing

#Full

#Partial

#Empty

1450

1550

1650

1750

1850

1950

2050

#H

osts

No task runtime-based packer

DifferentruntimesSameruntimesUnused

1450

1550

1650

1750

1850

1950

2050

#H

osts

Using task runtime-based packer

Differentruntimes

Sameruntimes

Unused

ASG/Cluster:

mantisagent

MinIdle: 8

MaxIdle: 20

CooldownSecs:

360

ASG/Cluster:

mantisagent

MinIdle: 8

MaxIdle: 20

CooldownSecs:

360

ASG/cluster:

mantisagent

MinIdle: 8

MaxIdle: 20

CooldownSecs: 360

Fenzo

ScaleUp

action:

Cluster, N

ScaleDown

action:

Cluster,

HostList

.withLeaseOfferExpirySecs(60)

.withLeaseRejectAction( (lease) -> {

mesosDriver.declineOffer(lease.getOffer().getId());

})

.withLeaseOfferExpirySecs(60)

.withLeaseRejectAction( (lease) -> {

mesosDriver.declineOffer(lease.getOffer().getId());

} )

.withFitnessCalculator(

BinPackingFitnessCalculators.cpuBinPacker)

Heterogeneous

Autoscale

Visibility

Plugins for

Constraints, Fitness

High speed

MantisMantis

Apache MesosApache Mesos

Mantis

Apache Mesos

ASGASG

ASG

FenzoMesos

Framework

Talk Time Title

PFC-305 Wednesday, 1:15pm Embracing Failure: Fault Injection and Service Reliability

BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix

PFC-306 Wednesday, 3:30pm Performance Tuning EC2

DEV-309 Wednesday, 3:30pm From Asgard to Zuul, How Netflix’s proven Open Source

Tools can accelerate and scale your services

ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale

PFC-304 Wednesday, 4:30pm Effective Inter-process Communications in the Cloud: The

Pros and Cons of Micro Services Architectures

ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems

APP-310 Friday, 9:00am Scheduling using Apache Mesos in the Cloud

Please give us your feedback on this

presentation

top related