consolidation of cloud computing in atlas

In 2016, the virtual machine (VM)definition for the Vacuum platform2

was rewritten from scratch to use theindustrystandard cloudinit formatfor contextualization. This allowedcommon elements of thecontextualization recipe used acrosscloud platforms in ATLAS to beconsolidated.

Moreover, VM instances now run asHTCondor workers rather than directlyretrieving payloads from PanDA. This aligns the workflow with that of othercloud and batch systems used in ATLAS, while retaining the feedbackbasedphilosophy of the vacuum model. Future work is planned to continueconverging the various ATLAS virtual machine definitions.

Consolidation of CloudComputing in ATLAS

Ryan P Taylor,1 Cristovao Jose Domingues Cordeiro,2 Domenico Giordano,2Alessandro Di Girolamo,2 John Hover,3 Tomas Kouba,4 Peter Love,5

Andrew McNab,6 Jaroslava Schovancova2 and Randall Sobie,1on behalf of the ATLAS Collaboration

1 University of Victoria2 CERN3 Brookhaven National Laboratory

4 Czech Academy of Sciences5 Lancaster University6 University of Manchester

Amazon Scale Testing

Vac Integration

Commercial Cloud Procurement

Cloud Benchmarking SuiteCloud Monitoring

Right: 40,000cores used byjobs on EC2East.

Below: Dataworkflow onAWS.

Building on successful scalingtests in 2014 of Amazon EC2 atthe level of 20,000 cores,5 a40,000 core run was conducted in2015. Highbandwidth directnetwork peering and S3integration have been establishedto enable running at this scale.

To efficiently exploit the spotmarket, an interruptible workloadis needed. Event Service jobs meetthis need, and natively supportdata stageout to S3. The S3storage is integrated with the gridusing FTS and a SRM andGridFTP server on S3FS.

Above: Number of ATLAS jobs running on six Vacsites in the UK over the last year, peaking around 700.

CERN has conducted several exercises in commercial cloud resourceprocurement, resulting in valuable experience in integrating these resources intothe ATLAS distributed computing framework, which will lead to increasedadoption in the future.6

In the context of commercial providers, accounting is of particular importancefor understanding performance, usage and efficiency. Data from the existingmonitoring infrastructure was leveraged to provide accounting information, andperformance monitoring is based on the benchmarking suite.

Left: Performancemonitoring ofcommercial clouds.Contractual review orcompensatory measuresmay be called for ifperformance falls belowthe agreed level into thecompensation region.

The Sim@P1 project enables theATLAS HighLevel Trigger farm to beused for offline production activities.Using Openstack, up to 70,000 coresin 2,300 compute nodes are exploitedfor running primarily CPUintensivejobs, such as event generation andMonte Carlo production, when notneeded for TDAQ purposes.

Recent developments have been madeon a toolkit for fast switching betweenthe TDAQ and Sim@P1 modes of operation, as well as improved resourcepartitioning and monitoring. Enhancements in resource provisioning now allowthe entire farm to be ramped up for running jobs in as little as two hours. Testingof Event Service pilots is also underway.1

The Cloud Benchmarking Suite4 is a configurable framework for running aselection of benchmarks, and transporting, storing and analysing the measuredresults. It enables users to profile the performance of computing resources andquickly identify possible issues, in a unified way.

The benchmarking suite is being used by ATLAS to quantify the performance ofevery VM on several clouds, in HS06equivalent units. A fast benchmark runs atbootup, and the result is joined with the delivered walltime and CPU time foreach VM, to generate HS06normalized accounting data.

Left: architecture of the cloudbenchmark system. Results aretransported using ActiveMQ,stored in ElasticSearch, andvisualized with Kibana.

When operating severalHTCondor/Cloud Schedulerinstances, each with severalclouds, job types, and VM types, it is essential to have a unified monitoringinterface to present a coherent view of all VMs and jobs.3

The Cloud Monitor stack accomplishes this using standard tools such asGanglia, ELK, Graphite, Grafana, Sensu, Plotly.js and Flask. All monitoredmetrics can be visualized in timeseries graphs in a flexible manner. One canselect any VM to see the jobs which ran on it, and examine the job logs as well.

Left: running jobsof different typesover time.

Above: the current quantity of VMs and jobs in different states on one cloud.

Sim@P1 Improvements

1. T. Kouba, "Evolution and Experience With the ATLAS Simulation at Point1 Project", presented at CHEP 2016 poster session

2. A. McNab, "The Vacuum Platform", presented at CHEP 2016 poster session

3. R. Sobie, "ContextAware Cloud Computing for HEP", proceedings of ISGC 2016

4. D. Giordano et al, "Benchmarking Cloud Resources", presented at CHEP 2016 poster session

5. R. Taylor et al, "The Evolution of Cloud Computing in ATLAS", 2015 J. Phys.: Conf. Ser. 664 022038

6. D. Giordano et al, "CERN Computing in Commercial Clouds", presented at CHEP 2016 lecture session

Above: cores used by Sim@P1 over the last year.About 5,000 cores are generally available for use, withhigher opportunistic usage during technical stops, etc.

consolidation of cloud computing in atlas

Documents

the atlas computing model

atlas distributed computing: experience and...

atlas computing

atlas & atlas reactivity engine hands -on session mobile...

data center consolidation disaster recovery cloud computing

atlas distributed computing in lhc run2

atlas computing

atlas computing tdr

atlas software & computing status and plans

atlas computing technical design report

atlas: an infrastructure for global computing

doe/u.s. atlas computing sept. 9, 1999 u.s. atlas computing ...

second atlas-south caucasus software / computing workshop &...

computing and networking infrastructure consolidation

bnl atlas meeting july 1999 u.s. atlas computing goals for...

roger jones: the atlas computing model ankara, turkey - 2...

evolving atlas computing model and requirements

lhc computing hep 101 lecture #8 ayana arce. outline major...

server consolidation using cisco unified computing system...

2012 - report from atlas atlas distributed computing has...