condor in dynamic environments

29
Ian Alderman Ian Alderman [email protected] [email protected]

Upload: reidar

Post on 07-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Condor in Dynamic Environments. Ian Alderman [email protected]. A Little History…. Condor – A Hunter of Idle Workstations. 1988 paper introducing Condor. Described system using cycle-scavenging: batch jobs run on desktop machines when their daytime users are idle. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Condor in   Dynamic Environments

Ian AldermanIan Alderman

[email protected]@cyclecomputing.com

Page 2: Condor in   Dynamic Environments

A Little History…A Little History…

Page 3: Condor in   Dynamic Environments

Condor – A Hunter of Idle Condor – A Hunter of Idle WorkstationsWorkstations1988 paper introducing Condor.1988 paper introducing Condor.Described system using cycle-scavenging: batch jobs Described system using cycle-scavenging: batch jobs

run on desktop machines when their daytime users run on desktop machines when their daytime users are idle.are idle.

Before: 23 VAXstation II workstations utilized at Before: 23 VAXstation II workstations utilized at ~30%. ~30%.

After: 200 CPU-days used over 5 months by 1000 jobs.After: 200 CPU-days used over 5 months by 1000 jobs.Key features:Key features:

Transparent to interactive users, batch users.Transparent to interactive users, batch users.Fair sharing among batch users.Fair sharing among batch users.Checkpointing and migration for batch jobs.Checkpointing and migration for batch jobs.

Page 4: Condor in   Dynamic Environments

A busy week…A busy week…

Page 5: Condor in   Dynamic Environments

Fast forward to todayFast forward to todayCondor widely used, very successful.Condor widely used, very successful.

Look around you!Look around you!Key reason: assumption of Key reason: assumption of dynamic dynamic

environmentenvironment::Design goals from 1988 still valid today.Design goals from 1988 still valid today.

““Workstations are autonomous computing resources and Workstations are autonomous computing resources and should be managed by their own users.”should be managed by their own users.”

High Throughput via automated policy High Throughput via automated policy management.management.

There are differences between today’s dynamic There are differences between today’s dynamic environments and idle workstations.environments and idle workstations.

Page 6: Condor in   Dynamic Environments

Today’s Dynamic Today’s Dynamic Environments?Environments?Virtualized Environments!Virtualized Environments!Slots/machines come and go on demand.Slots/machines come and go on demand.Essentially the same problem as autonomous Essentially the same problem as autonomous

user-managed workstations.user-managed workstations.Similar but different policy considerations…Similar but different policy considerations…

e.g. Workstations: Round robin to distribute load.e.g. Workstations: Round robin to distribute load. VMs: Fill entire host to minimize instances.VMs: Fill entire host to minimize instances.

Thesis: Even with these differences, the same Thesis: Even with these differences, the same features make Condor work well in both use features make Condor work well in both use cases…cases…

Page 7: Condor in   Dynamic Environments

Today: Notes from the fieldToday: Notes from the field

My work on CycleCloud – Managed Condor My work on CycleCloud – Managed Condor poolspools on Amazon EC2. on Amazon EC2.Instance Types/ProfilesInstance Types/ProfilesAuto-start/stopAuto-start/stopSpot instancesSpot instances

VMWare Environments – tiered use of VMWare Environments – tiered use of available resources.available resources.Still have peak vs. median usage problem that Still have peak vs. median usage problem that leads leads

to waste.to waste.

Page 8: Condor in   Dynamic Environments

Wasted during Spring Wasted during Spring Break…Break…

Page 9: Condor in   Dynamic Environments

EC2 Background:EC2 Background:Amazon Elastic Compute CloudAmazon Elastic Compute Cloud

Pay for use utility computing.Pay for use utility computing.BYOVM: you have root, they have kernel.BYOVM: you have root, they have kernel.Request instance; charged for each hour or Request instance; charged for each hour or

portion.portion.Range of machine configurations.Range of machine configurations.Appearance of unlimited capacity.Appearance of unlimited capacity.Proprietary highly-available storage Proprietary highly-available storage

backend: S3.backend: S3.

Page 10: Condor in   Dynamic Environments

Back to Spring Break…Back to Spring Break…

Page 11: Condor in   Dynamic Environments

Condor pools on demandCondor pools on demandCycleCloud Service: at-cost or supported pools.CycleCloud Service: at-cost or supported pools.Relies on dynamic features of Condor: instances Relies on dynamic features of Condor: instances

stop and start as needed.stop and start as needed.

Page 12: Condor in   Dynamic Environments

EC2 Instances MenuEC2 Instances Menu

AWS Name Arch

Cores

Memory (GB)

Price (VA AZ)

Price/Core

Memory (GB)/Core

m1.small 32 1 1.7 $0.085 $0.085 1.7

m1.large 64 2 7.5 $0.340 $0.170 3.75

m1.xlarge 64 4 15 $0.680 $0.170 3.75

c1.medium 32 2 1.7 $0.170 $0.085 0.85

c1.xlarge 64 8 7 $0.680 $0.085 0.875

m2.xlarge 64 2 17.1 $0.500 $0.250 8.55

m2.2xlarge 64 4 34.2 $1.200 $0.300 8.55

m2.4xlarge 64 8 68.4 $2.400 $0.300 8.55

Page 13: Condor in   Dynamic Environments
Page 14: Condor in   Dynamic Environments
Page 15: Condor in   Dynamic Environments

High-CPU Amazon EC2 nodesHigh-CPU Amazon EC2 nodeshave best price/performancehave best price/performance

Page 16: Condor in   Dynamic Environments

Making Condor work in EC2Making Condor work in EC2Naïve: Boot instances in EC2, add to pool.Naïve: Boot instances in EC2, add to pool.

Better: Create a pool in EC2.Better: Create a pool in EC2.No Condor configuration changes No Condor configuration changes requiredrequired, but…, but…

Disable preemption.Disable preemption. Configure authentication, integrity, encryption.Configure authentication, integrity, encryption. Optimize for security, performance, scalability.Optimize for security, performance, scalability.

Additional configuration:Additional configuration: Patches, information flow, installed applications, shared Patches, information flow, installed applications, shared

files, encrypt data at rest and in transit, etc…files, encrypt data at rest and in transit, etc…

Page 17: Condor in   Dynamic Environments

Short

Long

Wide

Lots of intermediate data!

Preprocessing

Algorithm A Algo B Algo C

Final Consensus

Page 18: Condor in   Dynamic Environments

Auto-start, auto-stop Auto-start, auto-stop mechanismsmechanismsAuto-startAuto-start: Jobs present in queue : Jobs present in queue machines start. machines start.

Jobs run!Jobs run!User sets type of machine, limit # of instances to start.User sets type of machine, limit # of instances to start.

Auto-stopAuto-stop: Before beginning of next hour:: Before beginning of next hour:Check to see if jobs still running – if not, shut down.Check to see if jobs still running – if not, shut down.

Users manually start machines that will auto-stop.Users manually start machines that will auto-stop.Mechanisms for auto-starting different machine Mechanisms for auto-starting different machine

types based on user requirements.types based on user requirements.Users can supply Users can supply hintshints about job run time for about job run time for

auto-start.auto-start.

Page 19: Condor in   Dynamic Environments

Spot InstancesSpot InstancesSame set of instance options.Same set of instance options.Lower cost, weaker service characteristics:Lower cost, weaker service characteristics:

Could go away at any time.Could go away at any time.If it goes away, you don’t pay.If it goes away, you don’t pay.

Bid for maximum you’re willing to pay, get Bid for maximum you’re willing to pay, get machines at that price if available (i.e. machines at that price if available (i.e. going rate is <=).going rate is <=).

If going rate goes above your maximum, If going rate goes above your maximum, your instances are terminated.your instances are terminated.

Page 20: Condor in   Dynamic Environments

Spring Break volatility…Spring Break volatility…

Page 21: Condor in   Dynamic Environments

Spot Instance Policy Spot Instance Policy BrainstormBrainstorm Jobs expected to run for more than one hour need dedicated Jobs expected to run for more than one hour need dedicated

resources to avoid waste (getting billed but not finishing) resources to avoid waste (getting billed but not finishing) Job:Job:REQUIREMENTS = isSpotInstance =?= FALSEREQUIREMENTS = isSpotInstance =?= FALSE

Machine:Machine:START = Target.EstimatedRuntime =?= UNDEFINED ||START = Target.EstimatedRuntime =?= UNDEFINED ||

Target.EstimatedRuntime >=3600Target.EstimatedRuntime >=3600isOwner = FalseisOwner = False

Jobs run on the cheapest nodes possibleJobs run on the cheapest nodes possible Jobs prefer to run on machines up for lower fractions of an Jobs prefer to run on machines up for lower fractions of an

hour (to allow auto-stop to work)hour (to allow auto-stop to work)RANK = 100 * SlotHourCost + FractionOfHourUPRANK = 100 * SlotHourCost + FractionOfHourUP

Page 22: Condor in   Dynamic Environments
Page 23: Condor in   Dynamic Environments
Page 24: Condor in   Dynamic Environments

Some folks don’t want EC2Some folks don’t want EC2What about internal VM environments?What about internal VM environments?

Most places have them, and they’re over Most places have them, and they’re over provisionedprovisioned

VMWare environments: VMWare environments: Help with server consolidation (Run 40 VMs on Help with server consolidation (Run 40 VMs on

beefy servers rather than 40 servers).beefy servers rather than 40 servers).Still have peak vs. median usage problem.Still have peak vs. median usage problem.For example, 500 Core VMWare environment For example, 500 Core VMWare environment

running SAP that is 25-40% utilized on average, running SAP that is 25-40% utilized on average, but still needs all that hardware for peak.but still needs all that hardware for peak.

Page 25: Condor in   Dynamic Environments

VMWare tiered applicationsVMWare tiered applicationsThankfully VMWare has tiers:Thankfully VMWare has tiers:

Production (PRD) usurps User Acceptance Testing (UAT) Production (PRD) usurps User Acceptance Testing (UAT) environment, which usurps Dev(DEV) environment.environment, which usurps Dev(DEV) environment.

Perfect for harvesting (just like Spot Instances).Perfect for harvesting (just like Spot Instances).

Create a GRID tier that PRD/UAT/DEV usurp for Create a GRID tier that PRD/UAT/DEV usurp for resources and have the cores join locally. resources and have the cores join locally.

Add VMs to pool when there are jobs in queue, and Add VMs to pool when there are jobs in queue, and remove GRID when PROD/UAT need them back.remove GRID when PROD/UAT need them back.

Page 26: Condor in   Dynamic Environments

VMWare Policy Use CasesVMWare Policy Use CasesSame high level policies work:Same high level policies work:

Jobs prefer dedicated nodes.Jobs prefer dedicated nodes.Still cycle harvesting, just from a multi-core VM Still cycle harvesting, just from a multi-core VM

environment rather than just workstations. environment rather than just workstations. Just like auto-start, we’ll add VM requests to Just like auto-start, we’ll add VM requests to

VMWare when there are jobs, remove them VMWare when there are jobs, remove them when they’re idle.when they’re idle.

Goal: Turn 40% Utilization to 90-95% utilization Goal: Turn 40% Utilization to 90-95% utilization using technical computing workloads.using technical computing workloads.

Page 27: Condor in   Dynamic Environments

Back to a busy week…Back to a busy week…

Page 28: Condor in   Dynamic Environments

A lot has changed…A lot has changed…

Thanks to Moore’s LawThanks to Moore’s Law

1988 2010

The original 200 CPU days

~11 CPU minutes

~3 CPU years ~I core hour

Page 29: Condor in   Dynamic Environments

But some things have stayed the But some things have stayed the same:same:

1988 2010

“We need more compute power”

“We need more compute power”

Dynamic Environments are plentiful

Dynamic Environments are plentiful