production condor in the cloud

Ian D. AldermanIan D. AldermanCycleComputingCycleComputinghttp://twitter.com/cyclecomputing

http://twitter.com/cyclecomputing

Case StudiesCase StudiesThis is a presentation of some work we’ve This is a presentation of some work we’ve

recently completed with customers.recently completed with customers.Focus on two particular customer Focus on two particular customer

workflows: workflows: Part 1: 10,000 core cluster “Tanuki.”Part 1: 10,000 core cluster “Tanuki.”Part 2: Easily running the same workflow Part 2: Easily running the same workflow

“locally” and in the cloud including data “locally” and in the cloud including data synchronization.synchronization.

We will discuss the technical challenges. We will discuss the technical challenges.

Case 1: 10,000 Cores Case 1: 10,000 Cores “Tanuki”“Tanuki”Run time = 8 hoursRun time = 8 hours1.14 compute-years of computing executed every 1.14 compute-years of computing executed every

hourhourCluster Time = 80,000 hours = 9.1 compute years.Cluster Time = 80,000 hours = 9.1 compute years.Total run time cost = ~$8,500Total run time cost = ~$8,500

1250 c1.xlarge ec2 instances ( 8 cores / 7-GB RAM )1250 c1.xlarge ec2 instances ( 8 cores / 7-GB RAM )10,000 cores, 8.75 TB RAM, 2 PB of disk space10,000 cores, 8.75 TB RAM, 2 PB of disk spaceWeighs in at number 75 of Top 500 SuperComputing Weighs in at number 75 of Top 500 SuperComputing

listlistCost to run = ~ $1,060 / hour Cost to run = ~ $1,060 / hour

Customer GoalsCustomer GoalsGenentech: “Examine how proteins bind to Genentech: “Examine how proteins bind to

each other in research that may lead to each other in research that may lead to medical treatments.”medical treatments.”- www.networkworld.com - www.networkworld.com

Customer wants to test the scalability of Customer wants to test the scalability of CycleCloud: “Can we run 10,000 jobs at CycleCloud: “Can we run 10,000 jobs at once?”once?”

Same workflow would take weeks or Same workflow would take weeks or months on existing internal infrastructure.months on existing internal infrastructure.

Customer Goals (cont)Customer Goals (cont)They can’t get answers to their scientific They can’t get answers to their scientific

questions quickly enough on their internal questions quickly enough on their internal cluster.cluster.Need to know “What do we do next?”Need to know “What do we do next?”

Genentech wanted to find out what Cycle Genentech wanted to find out what Cycle can do:can do:How much compute time can we get How much compute time can we get

simultaneously?simultaneously?How much will it cost?How much will it cost?

Run TimelineRun Timeline12:35 – 10,000 Jobs submitted and 12:35 – 10,000 Jobs submitted and

requests for batches cores are initiatedrequests for batches cores are initiated12:45 – 2,000 cores acquired12:45 – 2,000 cores acquired1:18 – 10,000 cores acquired1:18 – 10,000 cores acquired9:15 – Cluster shut down9:15 – Cluster shut down

System ComponentsSystem Components

Condor (& Workflow)Condor (& Workflow)ChefChefCycleCloud custom CentOS AMIsCycleCloud custom CentOS AMIsCycleCloud.comCycleCloud.comAWSAWS

Technical Challenges: AWSTechnical Challenges: AWSRare problems:Rare problems:

If a particular problem occurs .4% of the time, if If a particular problem occurs .4% of the time, if you run 1254 instances it will happen 5 times on you run 1254 instances it will happen 5 times on average.average.

Rare problem examples:Rare problem examples: DOA instances (unreachable).DOA instances (unreachable). Disk problems: can’t mount “ephemeral” disk.Disk problems: can’t mount “ephemeral” disk.

Need chunking: “Thundering Herd” both for Need chunking: “Thundering Herd” both for us and AWS.us and AWS.

Almost certainly will fill up availability Almost certainly will fill up availability zones. zones.

Technical Challenges: ChefTechnical Challenges: ChefChef works by periodic client pulls.Chef works by periodic client pulls.

If all the pulls occur at the same time, run out of If all the pulls occur at the same time, run out of resources.resources.

That’s exactly what happens in our infrastructure That’s exactly what happens in our infrastructure when you start 1250 machines at once.when you start 1250 machines at once.

Machines do ‘fast converge’ initially.Machines do ‘fast converge’ initially.So we need to use beefy servers, configure So we need to use beefy servers, configure

appropriately, and then stagger launches at a rate appropriately, and then stagger launches at a rate the server can handle.the server can handle.

This was the bottleneck in the systemThis was the bottleneck in the systemFuture work: pre-stage/cache chef results locally to Future work: pre-stage/cache chef results locally to

reduce initial impact on the server. reduce initial impact on the server.

Technical Challenges: Technical Challenges: CycleCloudCycleCloudImplemented monitoring and detection of classes of rare Implemented monitoring and detection of classes of rare

problems.problems.Batching of requests with delays between successive Batching of requests with delays between successive

requests.requests.Testing: better to request 256 every 5 minutes or 128 Testing: better to request 256 every 5 minutes or 128

every 2.5 minutes? What’s the best rate to make every 2.5 minutes? What’s the best rate to make requests?requests?

Set chunk size to 256 ec2 instances at a timeSet chunk size to 256 ec2 instances at a timeDid not overwhelm AWS/CycleCloud/Chef infrastructureDid not overwhelm AWS/CycleCloud/Chef infrastructure2048 cores got job work stream running immediately2048 cores got job work stream running immediately1250 ec2 requests launched in 25 minutes ( 12:35am – 1250 ec2 requests launched in 25 minutes ( 12:35am –

12:56 am Eastern Time ) 12:56 am Eastern Time )

Technical Challenges: Images + Technical Challenges: Images + FSFSImages not affected by scale of this run.Images not affected by scale of this run.Filesystem: large NFS server.Filesystem: large NFS server.We didn’t have problems with this run (we We didn’t have problems with this run (we

were worried!)were worried!)Future work: parallel filesystem in the Future work: parallel filesystem in the

cloud.cloud.

Technical Challenges: Technical Challenges: CondorCondor……

Technical Challenges: Technical Challenges: CondorCondorBasic configuration changes. Basic configuration changes.

See condor-wiki.cs.wisc.eduSee condor-wiki.cs.wisc.eduMake sure user job log isn’t on NFSMake sure user job log isn’t on NFS

Really a workflow problem…Really a workflow problem…Used 3 scheduler instances (could have used one)Used 3 scheduler instances (could have used one)

1 main scheduler1 main scheduler 2 auxilary schedulers2 auxilary schedulers Worked very well and handled the queue of 10,000 jobs Worked very well and handled the queue of 10,000 jobs

just finejust fine Scheduler instance 1: 3333 jobsScheduler instance 1: 3333 jobs Scheduler instance 2: 3333 jobsScheduler instance 2: 3333 jobs Scheduler instance 3: 3334 jobsScheduler instance 3: 3334 jobs

Very nice, steady and even stream of condor job Very nice, steady and even stream of condor job distribution ( see graph )distribution ( see graph )

So how did it go?So how did it go?Set up cluster using CycleCloud.com web Set up cluster using CycleCloud.com web

interface.interface.All the customer had to do was submit All the customer had to do was submit

jobs, go to bed, and check results in the jobs, go to bed, and check results in the morning.morning.

Cores used in Condor Cores used in Condor

Cores used Cluster Life-timeCores used Cluster Life-time

Condor QueueCondor Queue

Case 2: Synchronized Cloud Case 2: Synchronized Cloud OverflowOverflowInsurance companies often have periodic Insurance companies often have periodic

(quarterly) compute demand spikes. Want (quarterly) compute demand spikes. Want results ASAP but also don’t want to pay for results ASAP but also don’t want to pay for hardware to sit idle the other 11+ weeks of hardware to sit idle the other 11+ weeks of the quarter.the quarter.

Replicate internal filesystem to cloud.Replicate internal filesystem to cloud.Run jobs (on Windows).Run jobs (on Windows).Replicate results. Replicate results.

/Users/ianalderman/Old_Desktop/oldDesktop/top/Condor Week/Condor Week 2010-2.ppt

Customer GoalsCustomer GoalsStep outside the fixed cluster size/speed Step outside the fixed cluster size/speed

trade off.trade off.Replicate internal Condor/Windows pipeline Replicate internal Condor/Windows pipeline

using shared filesystem.using shared filesystem.Make it look the same to the user – minor UI Make it look the same to the user – minor UI

change to run jobs in Cloud vs. local.change to run jobs in Cloud vs. local.Security policy constraints – only outgoing Security policy constraints – only outgoing

connections to cloud.connections to cloud.Continue to monitor job progress using Continue to monitor job progress using

CycleServer.CycleServer.

System componentsSystem components

Condor (& Workflow)Condor (& Workflow)CycleServerCycleServerChefChefCycleCloud custom Windows AMIs / CycleCloud custom Windows AMIs /

FilesystemFilesystem

Technical Challenges: Images Technical Challenges: Images and File Systemand File SystemCustomized Windows images (Condor).Customized Windows images (Condor).Init scripts a particular problem.Init scripts a particular problem.Machines reboot to get their hostname.Machines reboot to get their hostname.Configure SAMBA on Linux Condor Configure SAMBA on Linux Condor

scheduler. scheduler.

Technical Challenges: ChefTechnical Challenges: ChefImplemented support for a number of Implemented support for a number of

features for Chef on Windows that were features for Chef on Windows that were missing.missing.

Lots of special case recipes because of the Lots of special case recipes because of the differences.differences.

First restart caused problems. First restart caused problems.

Technical Challenges: Technical Challenges: CondorCondorDAGMan submission has three stages.DAGMan submission has three stages.

Ensure input gets transferred.Ensure input gets transferred.Submit actual work (1000-5000 compute-hours)Submit actual work (1000-5000 compute-hours)Transfer resultsTransfer results

Cross platform submission. Cross platform submission.

Technical Challenges: Technical Challenges: CycleServerCycleServerCycleServer is used:CycleServer is used:

As a job submission point for cloud jobs.As a job submission point for cloud jobs.To monitor the progress of jobs.To monitor the progress of jobs.To monitor other components To monitor other components

(Collect-L/Ganglia).(Collect-L/Ganglia).To coordinate file synchronization on both ends.To coordinate file synchronization on both ends.

Custom plugins.Custom plugins. DAG jobs initiating file synchronization – wait until it DAG jobs initiating file synchronization – wait until it

completes.completes. Plugins do push/pull from local to remote due to Plugins do push/pull from local to remote due to

firewall. firewall.

Cloud vs. On-demand ClustersCloud vs. On-demand ClustersCloud Cluster Physical Cluster

Actions taken to Actions taken to provision: Button provision: Button pushed on websitepushed on website

Duration to start: 45 Duration to start: 45 minutesminutes

Copying software – Copying software – minutesminutes

Copying Data – minutes Copying Data – minutes to hoursto hours

Ready for JobsReady for Jobs

Actions Taken to provisionActions Taken to provision BudgetingBudgeting Eval VendorsEval Vendors Picking hardware optionsPicking hardware options Procurement process (4-5 mo)Procurement process (4-5 mo) Waiting for Servers to shipWaiting for Servers to ship Getting Datacenter spaceGetting Datacenter space Upgrading Power/Cooling Upgrading Power/Cooling Unpacking, Stacking, Racking Unpacking, Stacking, Racking

the serversthe servers Plugging networking equipment Plugging networking equipment

and storageand storage Installing images/softwareInstalling images/software Testing Burn-in and networking Testing Burn-in and networking

addrsaddrs Ready for jobsReady for jobs

production condor in the cloud

Documents