batch processing with amazon ec2 container service

42
Batch Processing with Amazon EC2 Container Service Brennan Saeta, Frank Chen Infrastructure Engineering @ Coursera

Upload: amazon-web-services

Post on 11-Aug-2015

1.468 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Batch Processing with Amazon EC2 Container Service

Batch Processing with Amazon EC2 Container Service

Brennan Saeta, Frank ChenInfrastructure Engineering @ Coursera

Page 2: Batch Processing with Amazon EC2 Container Service
Page 3: Batch Processing with Amazon EC2 Container Service

What We Do

Massive Open Online Courses

14 millionlearners

120partners

1000courses

2.2 millioncourse completions

Page 4: Batch Processing with Amazon EC2 Container Service

Coursera Technical Architecture

Page 5: Batch Processing with Amazon EC2 Container Service

Batch Processing Enables...

Reporting

Instructor Reports● Grade exports● Learner demographics● Course progress statistics

Internal Reports● Business metrics● Payments

reconciliation

Page 6: Batch Processing with Amazon EC2 Container Service

Batch Processing Enables...

Marketing

● Recommendation emails

● Batch marketing / reactivation emails

Page 7: Batch Processing with Amazon EC2 Container Service

Batch Processing Enables...

Pedagogical Innovation

● Auto-graded programming assignments

● Peer-review matching & analysis

Page 8: Batch Processing with Amazon EC2 Container Service

Bad Old Days of Batch Processing

Cascade● PHP-based job runner● Run in screen sessions● Polled for new jobs● Fragile and unreliable● Forced restarts on regular basis

Page 9: Batch Processing with Amazon EC2 Container Service

Bad Old Days of Batch Processing II

Saturn● Scala-based scheduled batch process runner

o Powered by Quartz Scheduler library● Cron-only jobs

o Cannot run jobs on demando Some jobs have to wake up frequently to poll

● All jobs ran on same instance (and JVM), causing interference

Page 10: Batch Processing with Amazon EC2 Container Service

What Do We Want?

Reliability● Saturn / Cascade were flaky● Developers became frustrated

with jobs not running properly

flickr.com/rclarkeimages, cc by-nc 2.0

Page 11: Batch Processing with Amazon EC2 Container Service

What Do We Want?

Easy

Development● Developing & testing

locally was difficult

● Little or no boilerplateshould be required flickr.com/riebart, cc by 2.0

Page 12: Batch Processing with Amazon EC2 Container Service

What Do We Want?

Easy

Deployment● Deployment was difficult

and non-deterministic● “Other services have one-

click tools, why can’t your service have that too?” flickr.com/derelllicht, cc by-nd 2.0

Page 13: Batch Processing with Amazon EC2 Container Service

What Do We Want?

High Efficiency● Cost-conscious

● Most jobs complete < 20 minuteso EC2 rounds costs up to full hour

● Startup time of individual instancestoo long for batch processing

flickr.com/koertmichiels, cc by-nc-nd 2.0

Page 14: Batch Processing with Amazon EC2 Container Service

What Do We Want?

Low Ops Load● Only one dev-ops engineer --

can’t manage everything● Developers own their services● Developers shouldn’t have to

actively monitor services

flickr.com/reynermedia, cc by 2.0

Page 15: Batch Processing with Amazon EC2 Container Service

Alternative Technologies

Home-grown Tech

● Tried, but proved to be unreliable

● Difficult to handle coordination and synchronization

● Very powerful, but hard to productionize

● Needs actual DevOps team

● GCE first-class, everything else second-class

Page 16: Batch Processing with Amazon EC2 Container Service

Amazon ECS

● Low-to-no maintenance solution● Integrated with AWS infrastructure● Easy to understand and program for

Page 17: Batch Processing with Amazon EC2 Container Service

However…

● No scheduled tasks

● No fine-grained monitoring of tasks

● No retries / delays when cluster out of resources

● Does not integrate well with ourexisting Scala APIs and tooling

Page 18: Batch Processing with Amazon EC2 Container Service

Iguazú -- Batch Management for ECS

● Named for Iguazú Fallso World’s largest waterfall

● Batch Task Schedulero Immediatelyo Deferred (run once at X time)o Scheduled recurring (cron-like)

● Programmatically accessible via standard APIs / clients

flickr.com/mrpunto, cc by-nc-nd 2.0

Page 19: Batch Processing with Amazon EC2 Container Service

Iguazú Semantic Guarantees

● At most once execution for all jobs● Jobs will be provided with at least the

CPU and RAM they requested● Scheduler may elect to skip execution

of some scheduled jobs under adverse conditions

Page 20: Batch Processing with Amazon EC2 Container Service

Iguazú Architecture

Services

Services

IguazúAdmin

Iguazúfrontend

Iguazú backend

Amazon SQS

Cassandra Worker Machines

Worker Worker

schedulerdevelopers

learners

ECS APIs

Page 21: Batch Processing with Amazon EC2 Container Service

Iguazú Design

● Frontend + Schedulero Generates requests (either via API calls or internally from the

scheduler)o Puts new requests in SQS queueso Handles requests for status from other services

● Backendo Attempts to run tasks via ECS API

Failure (e.g. lack of resources) means task goes back into queue to try again later

o Keeps track of task status and updates Cassandra

Page 22: Batch Processing with Amazon EC2 Container Service

Iguazú Admin User Interface

Page 23: Batch Processing with Amazon EC2 Container Service

Iguazú Admin User Interface

Page 24: Batch Processing with Amazon EC2 Container Service

Developing Iguazú Tasks

class Job extends AbstractJob with StrictLogging { override val reservedCpu = 1024 override val reservedMemory = 1024

def run(parameters: JsValue) = { logger.info(“I am running my Job!”) expensiveComputationHere() }}

Page 25: Batch Processing with Amazon EC2 Container Service

Running Tasks Locally$ sbt> project iguazuJobs[info] Set current project to iguazu-jobs > run sample sample.json[info] Running org.coursera.iguazu.internal.IguazuJobsMain sample sample.json[info] 2015-07-08 13:31:52,368 INFO [o.c.i.j.s.Job] >>> I am running my job![success] Total time: 5 s, completed Jul 8, 2015 1:31:58 PM>

Page 26: Batch Processing with Amazon EC2 Container Service

Running Tasks from Other Services

// invoking a job with one command// from another service via Naptime REST framework

val invocationId = IguazuJobInvocationClient .create(IguazuJobInvocationRequest( family = "mailer", jobName = "recommendationsEmail", parameters = emailParams))

Page 27: Batch Processing with Amazon EC2 Container Service

Deploying Tasks

Easy Deployment1. Merge into master. Done!

Jenkins Build Steps:2. Builds zip package from master3. Prepares Docker image4. Pushes docker image into docker registry5. Registers updated tasks with ECS APIs

Page 28: Batch Processing with Amazon EC2 Container Service

Logs

● Logs are in /var/lib/docker/containers/*

● Upload into log analysis service (Sumologic for us)

● Wrapper prints out task name and task IDat the start for easy searching

● Good enough for now

Page 29: Batch Processing with Amazon EC2 Container Service

Metrics

● Using third-party metrics collector (Datadog)

● Metrics for both tasks and container instances

● So long as can talk to Internet,things will work out pretty well

Page 30: Batch Processing with Amazon EC2 Container Service

Programming Assignments

Page 31: Batch Processing with Amazon EC2 Container Service

Programming Assignments

Page 32: Batch Processing with Amazon EC2 Container Service
Page 33: Batch Processing with Amazon EC2 Container Service
Page 34: Batch Processing with Amazon EC2 Container Service

Grading Prog. Assignments

● Special case of batch processing● Near real-time feedback

o <30 seconds for fast graders

● Compiling and running untrusted code

● Infrastructure security huge concerno Minimize exfiltration of info (e.g. test cases)o Avoid turning into bitcoin miners or DDoS

attack

Page 35: Batch Processing with Amazon EC2 Container Service

GRading Inside Docker: Architecture

Alternate (GRID) AWS AccountProduction AWS Account

Amazon S3

Amazon ECS

Strict network firewall

Worker WorkerCtrlCluster

CtrlCluster

Page 36: Batch Processing with Amazon EC2 Container Service

GRID: Defense in depth

Network

Completely separate AWS account

Network ACLs & Routing Tables

HostDocker /

Other

+ additional mitigation techniques and defenses

Custom cleaning of container images

Page 37: Batch Processing with Amazon EC2 Container Service

Modifying the ECS Agent

● Coursera has 2 simple forks of ECS agento Allow privileged docker-in-docker access for

the “cleaning” agento Disable networking and disk writes for

untrusted code in the “grading” agent

● Check it out at: github.com/coursera/amazon-ecs-agent

Page 38: Batch Processing with Amazon EC2 Container Service
Page 39: Batch Processing with Amazon EC2 Container Service

Usage

● Iguazú o 38 tasks written since launch in April o 24 scheduled taskso >1000 invocations per day

● Grido Pre-production at this time (launching in

weeks)o Dozens of graders already written by multiple

instructional teams!

Page 40: Batch Processing with Amazon EC2 Container Service

Future Improvements

● True Autoscalingo Scaling up is easy, scaling down not so much

● Task prioritization (multiple queues)

● Simulate memory and CPU limits in dev modes

Page 41: Batch Processing with Amazon EC2 Container Service

Lessons Learned / Docker war stories

● Docker instability funo Container format changing between 1.0 and

1.5.

● btrfs wedging on older kernelso Default 14.04 kernel not new enough!

● Disk usageo Docker-in-docker can’t clean up after itself

(yet).

Page 42: Batch Processing with Amazon EC2 Container Service

Thank You! Questions?

We are hiring! www.coursera.org/jobs