* authors contributed equally - biorxiv · 7/19/2016  · 120 the climb core infrastructure is a...

10
CLIMB (the Cloud Infrastructure for Microbial Bioinformatics): an online resource for the 1 medical microbiology community 2 3 Thomas R. Connor* +1 , Nicholas J. Loman* +2 , Simon Thompson* 3 , Andy Smith 2 , Joel 4 Southgate 1 , Radoslaw Poplawski 2,3 , Matthew J. Bull 1 , Emily Richardson 2 , Matthew Ismail 4 , 5 Simon Elwood-Thompson 5 , Christine Kitchen 6 , Martyn Guest 6 , Marius Bakke 7 , Sam K. 6 Sheppard* 8 , Mark J. Pallen* 7 7 * Authors contributed equally 8 + For correspondence; [email protected]; [email protected] 9 10 Affiliations 11 12 1. Cardiff University School of Biosciences, The Sir Martin Evans Building, Cardiff University, 13 Cardiff, CF10 2AX, UK 14 2. Institute of Microbiology and Infection, University of Birmingham, Birmingham, B15 2TT, 15 UK 16 3. IT Services (Research Computing), University of Birmingham, Birmingham, B15 2TT, UK 17 4. Centre for Scientific Computing, University of Warwick, Coventry, CV4 7AL, UK. 18 5. College of Medicine, Swansea University, Swansea, UK. 19 6. Advanced Research Computing@Cardiff (ARCCA), Cardiff University, UK 20 7. Microbiology and Infection Unit, Warwick Medical School, University of Warwick, 21 Coventry, United Kingdom 22 8. The Milner Centre for Evolution, Department of Biology and Biochemistry, University of 23 Bath, Bath, BA2 7AY, UK 24 25 ABSTRACT 26 27 The increasing availability and decreasing cost of high-throughput sequencing has 28 transformed academic medical microbiology, delivering an explosion in available genomes 29 while also driving advances in bioinformatics. However, many microbiologists are unable to 30 exploit the resulting large genomics datasets because they do not have access to relevant 31 computational resources and to an appropriate bioinformatics infrastructure. Here, we 32 present the Cloud Infrastructure for Microbial Bioinformatics (CLIMB) facility, a shared 33 computing infrastructure that has been designed from the ground up to provide an 34 environment where microbiologists can share and reuse methods and data. 35 36 DATA SUMMARY 37 38 The paper describes a new, freely available public resource and therefore no data has been 39 generated. The resource can be accessed at http://www.climb.ac.uk. Source code for 40 software developed for the project can be found at http://github.com/MRC-CLIMB/ 41 42 I/We confirm all supporting data, code and protocols have been provided within the 43 article or through supplementary data files. 44 45 IMPACT STATEMENT 46 47 . CC-BY 4.0 International license not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was this version posted July 19, 2016. . https://doi.org/10.1101/064451 doi: bioRxiv preprint

Upload: others

Post on 28-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

CLIMB(theCloudInfrastructureforMicrobialBioinformatics):anonlineresourceforthe1medicalmicrobiologycommunity2

3ThomasR.Connor*+1,NicholasJ.Loman*+2,SimonThompson*3,AndySmith2,Joel4Southgate1,RadoslawPoplawski2,3,MatthewJ.Bull1,EmilyRichardson2,MatthewIsmail4,5SimonElwood-Thompson5,ChristineKitchen6,MartynGuest6,MariusBakke7,SamK.6Sheppard*8,MarkJ.Pallen*77*Authorscontributedequally8+Forcorrespondence;[email protected];n.j.loman@bham.ac.uk910Affiliations11121.CardiffUniversitySchoolofBiosciences,TheSirMartinEvansBuilding,CardiffUniversity,13Cardiff,CF102AX,UK142.InstituteofMicrobiologyandInfection,UniversityofBirmingham,Birmingham,B152TT,15UK163.ITServices(ResearchComputing),UniversityofBirmingham,Birmingham,B152TT,UK174.CentreforScientificComputing,UniversityofWarwick,Coventry,CV47AL,UK.185.CollegeofMedicine,SwanseaUniversity,Swansea,UK.196.AdvancedResearchComputing@Cardiff(ARCCA),CardiffUniversity,UK207.MicrobiologyandInfectionUnit,WarwickMedicalSchool,UniversityofWarwick,21Coventry,UnitedKingdom228.TheMilnerCentreforEvolution,DepartmentofBiologyandBiochemistry,Universityof23Bath,Bath,BA27AY,UK24

25ABSTRACT2627Theincreasingavailabilityanddecreasingcostofhigh-throughputsequencinghas28transformedacademicmedicalmicrobiology,deliveringanexplosioninavailablegenomes29whilealsodrivingadvancesinbioinformatics.However,manymicrobiologistsareunableto30exploittheresultinglargegenomicsdatasetsbecausetheydonothaveaccesstorelevant31computationalresourcesandtoanappropriatebioinformaticsinfrastructure.Here,we32presenttheCloudInfrastructureforMicrobialBioinformatics(CLIMB)facility,ashared33computinginfrastructurethathasbeendesignedfromthegrounduptoprovidean34environmentwheremicrobiologistscanshareandreusemethodsanddata.35

36DATASUMMARY3738Thepaperdescribesanew,freelyavailablepublicresourceandthereforenodatahasbeen39generated.Theresourcecanbeaccessedathttp://www.climb.ac.uk.Sourcecodefor40softwaredevelopedfortheprojectcanbefoundathttp://github.com/MRC-CLIMB/4142I/Weconfirmallsupportingdata,codeandprotocolshavebeenprovidedwithinthe43articleorthroughsupplementarydatafiles.☒ 44

45IMPACTSTATEMENT4647

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 2: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

Technologicaladvancesmeanthatgenomesequencingisnowrelativelysimple,quick,and48affordable.However,handlinglargegenomedatasetsremainsasignificantchallengefor49manymicrobiologists,withsubstantialrequirementsforcomputationalresourcesand50expertiseindatastorageandanalysis.Thishasledtofragmentaryapproachestosoftware51developmentanddatasharingthatreducethereproducibilityofresearchandlimits52opportunitiesforbioinformaticstraining.Here,wedescribeanationwideelectronic53infrastructurethathasbeendesignedtosupporttheUKmicrobiologycommunity,providing54simplemechanismsforaccessinglarge,shared,computationalresourcesdesignedtomeet55thebioinformaticneedsofmicrobiologists.56

57INTRODUCTION58Genomesequencinghastransformedthescaleofquestionsthatcanbeaddressedby59biologicalresearchers.Sincethepublicationofthefirstbacterialgenomesequenceover60twentyyearsago(1),therehasbeenanexplosionintheproductionofmicrobialgenome61sequencedata,fuelledmostrecentlybyhigh-throughputsequencing(2).Thishasplaced62microbiologyattheforefrontofdata-drivenscience(3).Asaconsequence,thereisnow63hugedemandforphysicalandcomputationalinfrastructurestoproduce,analyseandshare64microbiologicalsoftwareanddatasetsandarequirementfortrainedbioinformaticiansthat65canusegenomedatatoaddressimportantquestionsinmicrobiology(4).Itisworth66stressingthatmicrobialgenomics,withitsfocusontheriotousvariationseeninmicrobial67genomes,bringschallengesaltogetherdifferentfromtheanalysisofthelargerbutless68variablegenomesofhumans,animalsorplants.6970Onesolutiontothedata-delugechallengeisforeverymicrobiologyresearchgroupto71establishtheirowndedicatedbioinformaticshardwareandsoftware.However,thisentails72considerableupfrontinfrastructurecostsandgreatinefficienciesofeffort,whilealso73encouragingaworking-in-silosmentality,whichmakesitdifficulttosharedataandpipelines74andthushardtoreplicateresearch.Cloudcomputingprovidesanalternativeapproachthat75facilitatestheuseoflargegenomedatasetsinbiologicalresearch(5).7677Thecloud-computingapproachincorporatesasharedonlinecomputationalinfrastructure,78whichsparestheenduserfromworryingabouttechnicalissuessuchastheinstallation79maintenanceand,even,thelocationofphysicalcomputingresources,togetherwithother80potentiallytroublingissuessuchassystemsadministration,datasharing,scalability,security81andbackup.Attheheartofcloudcomputingliesvirtualization,anapproachinwhicha82physicalcomputingset-upisre-purposedintoascalablesystemofmultipleindependent83virtualmachines,eachofwhichcanbepre-loadedwithsoftware,customisedbyendusers84andsavedassnapshotsforre-usebyothers.Ideally,suchaninfrastructurealsoprovides85large-scaledatastorageandcomputecapacityondemand,reducingcoststothepublic86pursebyoptimisingutilizationofhardwareandavoidingresourcessittingidlewhilestill87capitalisingontheeconomiesofscale.8889Thepotentialforcloudcomputinginbiologicalresearchhasbeenrecognizedbyfunding90organisationsandhasseenthedevelopmentofnationwideresourcessuchasiPlant(6)91(nowCyVerse),NECTAR(7)andXSEDE(8)thatprovideresearcherswithaccesstolarge92cloudinfrastructures.Here,wedescribeanewfacility,designedspecificallyfor93microbiologists,toprovideacomputationalandbioinformaticsinfrastructurefortheUK’s94

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 3: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

academicmedicalmicrobiologycommunity,facilitatingtraining,accesstohardwareand95sharingofdataandsoftware.9697

98Resourceoverview99TheCloudInfrastructureforMicrobialBioinformatics(CLIMB)facilityisintendedasa100generalsolutiontopressingissuesinbig-datamicrobiology.Theresourcecomprisesacore101physicalinfrastructure(Figure1),combinedwiththreekeyfeaturesmakingthecloud102suitableformicrobiologists.103104First,CLIMBprovidesasingleenvironment,completewithpipelinesanddatasetsthatare105co-locatedwithcomputingresource.Thismakestheprocessofaccessingpublished106packagesandsequencedatasimplerandfaster,improvingreuseofsoftwareanddata.107108Second,CLIMBhasbeendesignedwithtraininginmind.Ratherthanhavingtrainees109configurepersonallaptopsorfacechallengesingainingaccesstosharedhighperformance110computingresources,weprovidetrainingimagesonvirtualmachinesthathaveallthe111necessarysoftwareinstalledandweprovideeachtraineewithherownpersonalserverto112continuelearningaftertheworkshopconcludes.113114Third,bybringingtogetherexpertbioinformaticians,educatorsandbiologistsinaunified115system,CLIMBprovidesanenvironmentwhereresearchersacrossinstitutionscanshare116dataandcode,permittingcomplexprojectsiterativelytoberemixed,reproduced,updated117andimproved.118119TheCLIMBcoreinfrastructureisacloudsystemrunningthefreeopen-sourcecloud120operatingsystemOpenStack(9).Thissystemallowsustorunover1000virtualmachinesat121anyonetime,eachpreconfiguredwithastandarduserconfiguration.Acrossthecloud,we122haveaccesstoalmost43terabytesofRAM.Specialistuserscanrequestaccesstooneofour123twelvehigh-memoryvirtualmachineseachwith3terabytesofRAMforespeciallylarge,124complexanalyses(Figure1).Thesystemisspreadoverfoursitestoenhanceitsresilience125andissupportedbylocalscratchstorageof500TBpersiteemployingIBM’sSpectrumScale126storage(formerlyGPFS).Thesystemisunderpinnedbyalargesharedobjectstoragesystem127thatprovidesapproximately2.5petabytesofdatastorage,whichmaybereplicatedbetween128sites.Thisstoragesystem,runningthefreeopen-sourceCephsystem(10),providesaplace129tostoreandshareverylargemicrobialdatasets—forcomparison,thebacterialcomponent130oftheEuropeanNucleotideArchiveiscurrentlyaround400terabytesinsize.TheCLIMB131systemcanbecoupledtosequencingservices;forexample,sequencedatageneratedbythe132MicrobesNGservice(11)hasbeenmadeavailablewithintheCLIMBsystem.133134Resourceperformance135ToassesstheperformanceoftheCLIMBsystemincomparisontotraditionalHigh136PerformanceComputing(HPC)systemsandsimilarcloudsystems,weundertookasmall-137scalebenchmarkingexercise(Figure2).ComparedtotheRavenHPCresourceatCardiff138(runningIntelprocessorsagenerationbehindthoseinCLIMB),performanceonCLIMBwas139generallygood,offeringarelativeincreaseinperformanceofupto38%ontaskscommonly140undertakenbymicrobialbioinformaticians.TheCLIMBsystemalsocompareswelltocloud141

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 4: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

serversfrommajorproviders,offeringbetteraggregateperformancethanMicrosoftAzure142A8andGoogleN1S2virtualmachines.Theresultsalsorevealanumberoffeaturesthatmay143berelevanttowhereauserchoosestoruntheiranalysis.CLIMBperformsworsethanRaven144whenrunningBEAST,andprovidesalimitedincreaseinperformanceforthepackage145nhmmer,suggestingthatwhileitispossibletoruntheseanalysesonCLIMB,otherresources146–suchaslocalHPCfacilitiesmightbemoreappropriate.Conversely,thelargestperformance147increasesareobservedforProkka,SnippyandPhyML,whichencompasssomeofthemost148commonlyusedanalysesundertakeninmicrobialgenomics.Itisalsointerestingtonotethat149bothcommercialcloudsofferexcellentperformancerelativetoRavenfortwoworkloads;150muscleandPhyML.Thesourceofthisperformanceisdifficulttopredict,butitispossible151thattheseworkloadsmaybemoresimilartothesortofworkloadsthatthesecloudservices152havebeendesignedtohandle.Onthebasisoftheperformanceresultsmoregenerally,153however,CLIMBislikelytoofferanumberofperformancebenefitsoverlocalresourcesfor154manymicrobialbioinformaticsworkloads.155156Providingasingleenvironmentfortraining,dataandsoftwaresharing157TheCLIMBsystemisaccessedthroughtheInternet,viaasimplesetofwebinterfaces158enablingthesharingofsoftwareonvirtualmachines(Figure3).Usersrequestavirtual159machineviaawebform.Eachvirtualmachinemakesavailablethemicrobialversionofthe160GenomicsVirtualLaboratory(Figure3)(7).Thisincludesasetofwebtools(Galaxy,Jupyter161NotebookandRStudio,withanoptionalPacBioSMRTportal),aswellasasetofpipelines162andtoolsthatcanbeaccessedviathecommandline.Thisstandardisedenvironment163providesacommonplatformforteaching,whilethebaseimageprovidesaversatile164platformthatcanbecustomizedtomeettheneedsofindividualresearchers.165166Systemaccess167Usersregisteratourwebsite,usingtheirUKacademiccredentials(http://bryn.climb.ac.uk/).168Uponregistration,usershaveoneoftwomodesofaccess:thefirstistolaunchaninstance169runningapreconfiguredvirtualmachine,withasetofpredefinedpipelinesortools,which170includestheGenomicsVirtualLaboratory.Thesecondoptionisaimedatexpert171bioinformaticiansanddeveloperswhomaywanttobeabletodeveloptheirownvirtual172machinesfromabaseimage—toenablethiswealsoallowuserstoaccessthesystemviaa173dashboard,similartothatprovidedbyAmazonWebServices,whereuserscanspecifythe174sizeandtypeofvirtualmachinethattheywouldlike,withthesystemthenprovisioningthis175upondemand.Tosharetheresourcefairly,userswillhaveindividualquotasthatcanbe176increasedonrequest.Irrespectiveofquotasize,accesstothesystemisfreeofchargetoUK177academicusers.178179

180CONCLUSION181182CLIMBisprobablythelargestcomputersystemdedicatedtomicrobiologyintheworld.The183systemhasalreadybeenusedtoaddressmicrobiologicalquestionsfeaturingbacteria(12)184andviruses(13).CLIMBhasbeendesignedfromthegrounduptomeettheneedsof185microbiologists,providingacoreinfrastructurethatispresentedinasimple,intuitiveway.186Individualelementsofthesystem—suchasthelargesharedstorageandextremelylarge187memorysystems—providecapabilitiesthatareusuallynotavailablelocallyto188

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 5: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

microbiologistswithinmostUKinstitutions,whilethesharednatureofthesystemprovides189newopportunitiesfordataandsoftwaresharingthathavethepotentialenhanceresearch190reproducibilityindataintensivebiology.Cloudcomputingclearlyhasthepotentialto191revolutionizehowbiologicaldataaresharedandanalysed.Wehopethatthemicrobiology192researchcommunitywillcapitaliseonthesenewopportunitiesbyexploitingtheCLIMB193facility.194195

196ACKNOWLEDGEMENTS197WewouldliketoespeciallyacknowledgetheassistanceofSimonGladman,AndrewLonie,198TorstenSeemanandNuwanGoonasekerafortheirextensiveassistanceingettingtheGVL199runningonCLIMB.WethankIsabelDodd(Warwick)andBenPascoe(Bath)forassistance200managingtheproject,andwouldalsoliketothankthelocalUniversitynetworkandITstaff201(particularlyDrIanMerrickandKevinMunnatCardiffandChrisJonesatSwansea)whohave202helpedtogetthesystemupandrunning,andUniversityprocurementstaff(especially203AnthonyHaleatCardiff)whoworkedtogetthesystempurchasedinchallengingtimescales.204Inaddition,wewouldalsoliketothankearlyaccessCLIMBusersfortestingandreporting205issueswithourservice,withparticularthankstoPhilAshton(OxfordUniversityClinical206ResearchUnitVietnam),EdFeil,HarryThorpe,SionBayliss(UniversityofBath).Wethank207EmilyRichardson(UniversityofBirmingham)fordevelopingtutorialmaterialsforCLIMBand208testing.WethankallourprojectpartnersandsuppliersatOCF,IBM,RedHat/Ceph,Dell,209MellanoxandBrocadeforsupportoftheprojectwithparticularthankstoArifAliand210GeorginaEllis(OCF),DaveCoughlinandHenryBennett(Dell),BenHarrison(RedHat),211StephanHohn(RedHat/InkTank),JimRoche(Lenovo).212213

214REFERENCES2152162171. FleischmannRD,AdamsMD,WhiteO,ClaytonRA,KirknessEF,KerlavageAR,Bult218

CJ,TombJF,DoughertyBA,MerrickJM,etal.1995.Whole-genomerandom219sequencingandassemblyofHaemophilusinfluenzaeRd.Science269:496-512.220

2. LomanNJ,PallenMJ.2015.Twentyyearsofbacterialgenomesequencing.NatRev221Microbiol13:787-794.222

3. MarxV.2013.Biology:Thebigchallengesofbigdata.Nature498:255-260.2234. ChangJ.2015.Coreservices:Rewardbioinformaticians.Nature520:151-152.2245. SteinLD,KnoppersBM,CampbellP,GetzG,KorbelJO.2015.Dataanalysis:Createa225

cloudcommons.Nature523:149-151.2266. MerchantN,LyonsE,GoffS,VaughnM,WareD,MicklosD,AntinP.2016.The227

iPlantCollaborative:CyberinfrastructureforEnablingDatatoDiscoveryfortheLife228Sciences.PLoSBiol14:e1002342.229

7. AfganE,SloggettC,GoonasekeraN,MakuninI,BensonD,CroweM,GladmanS,230KowsarY,PheasantM,HorstR,LonieA.2015.GenomicsVirtualLaboratory:A231PracticalBioinformaticsWorkbenchfortheCloud.PLoSOne10:e0140829.232

8. TownsJ,CockerillT,DahanM,FosterI,GaitherK,GrimshawA,HazlewoodV,233LathropS,LifkaD,PetersonGD,RoskiesR,ScottJR,Wilkens-DiehrN.2014.XSEDE:234Acceleratingscientificdiscovery.ComputinginScienceandEngineering16:62-74.235

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 6: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

9. Openstack.http://www.openstack.org.23610. RedHatCephStorage.http://www.redhat.com/en/technologies/storage/ceph.23711. MicrobesNG:https://microbesng.uk.23812. ConnorTR,BarkerCR,BakerKS,WeillFX,TalukderKA,SmithAM,BakerS,Gouali239

M,PhamThanhD,JahanAzmiI,DiasdaSilveiraW,SemmlerT,WielerLH,Jenkins240C,CraviotoA,FaruqueSM,ParkhillJ,WookKimD,KeddyKH,ThomsonNR.2015.241Species-widewholegenomesequencingrevealshistoricalglobalspreadandrecent242localpersistenceinShigellaflexneri.Elife4.243

13. QuickJ,LomanNJ,DuraffourS,SimpsonJT,SeveriE,CowleyL,BoreJA,244KoundounoR,DudasG,MikhailA,OuedraogoN,AfroughB,BahA,BaumJH,245Becker-ZiajaB,BoettcherJP,Cabeza-CabrerizoM,Camino-SanchezA,CarterLL,246DoerrbeckerJ,EnkirchT,Garcia-DorivalI,HetzeltN,HinzmannJ,HolmT,247KafetzopoulouLE,KoropoguiM,KosgeyA,KuismaE,LogueCH,MazzarelliA,248MeiselS,MertensM,MichelJ,NgaboD,NitzscheK,PallaschE,PatronoLV,249PortmannJ,RepitsJG,RickettNY,SachseA,SingethanK,VitorianoI,250YemanaberhanRL,ZekengEG,RacineT,BelloA,SallAA,FayeO,etal.2016.Real-251time,portablegenomesequencingforEbolasurveillance.Nature530:228-232.252

253254

DATABIBLIOGRAPHY255256Notapplicable257

258 259

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 7: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

FIGURESANDTABLES260261

262Figure1.Overviewofthesystem.A.Thesiteswherethecomputationalhardwareisbased.263B.Highleveloverviewofthesystemandhowthedifferentsoftwarecomponentsconnectto264

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 8: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

oneanother.C.Computehardwarepresentateachofthefoursites.D.Hardware265comprisingtheCephstoragesystemateachsite.E.Typeandroleofnetworkhardwareused266ateachsite.267268Figure2(appendedatendofdocument).RelativeperformanceofVMsrunningoncloud269services,comparedtotheCardiffUniversityHPCsystem,Raven.A.Valuesforeachpackage270arethemeanofthewalltimetakenfor10runsperformedonRaven,dividedbythemean271walltimeof40runsundertakenontheVMonthenamedservice.Valuesgreaterthan1are272fasterthanRaven,valueslessthan1areslower.B.Showingtherawwalltimevaluesforthe273namedsoftwareoneachofthesystems.274275276Figure3.(appendedatendofdocument)CLIMBvirtualmachinelaunchworkflow.Auser,277onloggingintotheBrynlauncherinterfaceispresentedwithalistofthevirtualmachines278theyarerunningandareabletostop,rebootorterminatethem(panelA).Userslauncha279newGVLserverwithaminimalinterface,specifyinganame,theserver‘flavour’(useror280group)andanaccesspassword(panelB).Onbooting,theuseraccessesawebserver281runningontheGVLinstance,whichgivesaccesstovariousservicesthatarestarted282automatically(panelC).TheGVLprovidesaccesstoaCloudman,aGalaxyserver,an283administrationinterface,JupyternotebookandRStudio(panelD,toptobottom).284285SupplementaryTable1.Tablecontainstherawwalltimerecordedfor40independentruns286oneachoftheindicatedcomparisonsystems.Forfigure2theserawdatawerecompared287againstthemeanwalltimerecordedfrom10runsontheCardiffUniversityRavensystem.288Therawandcalculatedvaluesareintheattachedspreadsheet,indifferentworkbooks.289290

291

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 9: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

velvth

velvetg

snippy

prokka

phyml

nhmmer

muscle

gunzip

blastn

beast

Walltime as proportion of Raven walltime (larger is better) Raw walltime (s)

CLIMB

Azure

A8

Raven

iPlant

4vCPU 8GB

Google

N1S2

Azure

G1

A B.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under a

The copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 10: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint