CLIMB(theCloudInfrastructureforMicrobialBioinformatics):anonlineresourceforthe1medicalmicrobiologycommunity2
3ThomasR.Connor*+1,NicholasJ.Loman*+2,SimonThompson*3,AndySmith2,Joel4Southgate1,RadoslawPoplawski2,3,MatthewJ.Bull1,EmilyRichardson2,MatthewIsmail4,5SimonElwood-Thompson5,ChristineKitchen6,MartynGuest6,MariusBakke7,SamK.6Sheppard*8,MarkJ.Pallen*77*Authorscontributedequally8+Forcorrespondence;[email protected];n.j.loman@bham.ac.uk910Affiliations11121.CardiffUniversitySchoolofBiosciences,TheSirMartinEvansBuilding,CardiffUniversity,13Cardiff,CF102AX,UK142.InstituteofMicrobiologyandInfection,UniversityofBirmingham,Birmingham,B152TT,15UK163.ITServices(ResearchComputing),UniversityofBirmingham,Birmingham,B152TT,UK174.CentreforScientificComputing,UniversityofWarwick,Coventry,CV47AL,UK.185.CollegeofMedicine,SwanseaUniversity,Swansea,UK.196.AdvancedResearchComputing@Cardiff(ARCCA),CardiffUniversity,UK207.MicrobiologyandInfectionUnit,WarwickMedicalSchool,UniversityofWarwick,21Coventry,UnitedKingdom228.TheMilnerCentreforEvolution,DepartmentofBiologyandBiochemistry,Universityof23Bath,Bath,BA27AY,UK24
25ABSTRACT2627Theincreasingavailabilityanddecreasingcostofhigh-throughputsequencinghas28transformedacademicmedicalmicrobiology,deliveringanexplosioninavailablegenomes29whilealsodrivingadvancesinbioinformatics.However,manymicrobiologistsareunableto30exploittheresultinglargegenomicsdatasetsbecausetheydonothaveaccesstorelevant31computationalresourcesandtoanappropriatebioinformaticsinfrastructure.Here,we32presenttheCloudInfrastructureforMicrobialBioinformatics(CLIMB)facility,ashared33computinginfrastructurethathasbeendesignedfromthegrounduptoprovidean34environmentwheremicrobiologistscanshareandreusemethodsanddata.35
36DATASUMMARY3738Thepaperdescribesanew,freelyavailablepublicresourceandthereforenodatahasbeen39generated.Theresourcecanbeaccessedathttp://www.climb.ac.uk.Sourcecodefor40softwaredevelopedfortheprojectcanbefoundathttp://github.com/MRC-CLIMB/4142I/Weconfirmallsupportingdata,codeandprotocolshavebeenprovidedwithinthe43articleorthroughsupplementarydatafiles.☒ 44
45IMPACTSTATEMENT4647
.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint
Technologicaladvancesmeanthatgenomesequencingisnowrelativelysimple,quick,and48affordable.However,handlinglargegenomedatasetsremainsasignificantchallengefor49manymicrobiologists,withsubstantialrequirementsforcomputationalresourcesand50expertiseindatastorageandanalysis.Thishasledtofragmentaryapproachestosoftware51developmentanddatasharingthatreducethereproducibilityofresearchandlimits52opportunitiesforbioinformaticstraining.Here,wedescribeanationwideelectronic53infrastructurethathasbeendesignedtosupporttheUKmicrobiologycommunity,providing54simplemechanismsforaccessinglarge,shared,computationalresourcesdesignedtomeet55thebioinformaticneedsofmicrobiologists.56
57INTRODUCTION58Genomesequencinghastransformedthescaleofquestionsthatcanbeaddressedby59biologicalresearchers.Sincethepublicationofthefirstbacterialgenomesequenceover60twentyyearsago(1),therehasbeenanexplosionintheproductionofmicrobialgenome61sequencedata,fuelledmostrecentlybyhigh-throughputsequencing(2).Thishasplaced62microbiologyattheforefrontofdata-drivenscience(3).Asaconsequence,thereisnow63hugedemandforphysicalandcomputationalinfrastructurestoproduce,analyseandshare64microbiologicalsoftwareanddatasetsandarequirementfortrainedbioinformaticiansthat65canusegenomedatatoaddressimportantquestionsinmicrobiology(4).Itisworth66stressingthatmicrobialgenomics,withitsfocusontheriotousvariationseeninmicrobial67genomes,bringschallengesaltogetherdifferentfromtheanalysisofthelargerbutless68variablegenomesofhumans,animalsorplants.6970Onesolutiontothedata-delugechallengeisforeverymicrobiologyresearchgroupto71establishtheirowndedicatedbioinformaticshardwareandsoftware.However,thisentails72considerableupfrontinfrastructurecostsandgreatinefficienciesofeffort,whilealso73encouragingaworking-in-silosmentality,whichmakesitdifficulttosharedataandpipelines74andthushardtoreplicateresearch.Cloudcomputingprovidesanalternativeapproachthat75facilitatestheuseoflargegenomedatasetsinbiologicalresearch(5).7677Thecloud-computingapproachincorporatesasharedonlinecomputationalinfrastructure,78whichsparestheenduserfromworryingabouttechnicalissuessuchastheinstallation79maintenanceand,even,thelocationofphysicalcomputingresources,togetherwithother80potentiallytroublingissuessuchassystemsadministration,datasharing,scalability,security81andbackup.Attheheartofcloudcomputingliesvirtualization,anapproachinwhicha82physicalcomputingset-upisre-purposedintoascalablesystemofmultipleindependent83virtualmachines,eachofwhichcanbepre-loadedwithsoftware,customisedbyendusers84andsavedassnapshotsforre-usebyothers.Ideally,suchaninfrastructurealsoprovides85large-scaledatastorageandcomputecapacityondemand,reducingcoststothepublic86pursebyoptimisingutilizationofhardwareandavoidingresourcessittingidlewhilestill87capitalisingontheeconomiesofscale.8889Thepotentialforcloudcomputinginbiologicalresearchhasbeenrecognizedbyfunding90organisationsandhasseenthedevelopmentofnationwideresourcessuchasiPlant(6)91(nowCyVerse),NECTAR(7)andXSEDE(8)thatprovideresearcherswithaccesstolarge92cloudinfrastructures.Here,wedescribeanewfacility,designedspecificallyfor93microbiologists,toprovideacomputationalandbioinformaticsinfrastructurefortheUK’s94
.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint
academicmedicalmicrobiologycommunity,facilitatingtraining,accesstohardwareand95sharingofdataandsoftware.9697
98Resourceoverview99TheCloudInfrastructureforMicrobialBioinformatics(CLIMB)facilityisintendedasa100generalsolutiontopressingissuesinbig-datamicrobiology.Theresourcecomprisesacore101physicalinfrastructure(Figure1),combinedwiththreekeyfeaturesmakingthecloud102suitableformicrobiologists.103104First,CLIMBprovidesasingleenvironment,completewithpipelinesanddatasetsthatare105co-locatedwithcomputingresource.Thismakestheprocessofaccessingpublished106packagesandsequencedatasimplerandfaster,improvingreuseofsoftwareanddata.107108Second,CLIMBhasbeendesignedwithtraininginmind.Ratherthanhavingtrainees109configurepersonallaptopsorfacechallengesingainingaccesstosharedhighperformance110computingresources,weprovidetrainingimagesonvirtualmachinesthathaveallthe111necessarysoftwareinstalledandweprovideeachtraineewithherownpersonalserverto112continuelearningaftertheworkshopconcludes.113114Third,bybringingtogetherexpertbioinformaticians,educatorsandbiologistsinaunified115system,CLIMBprovidesanenvironmentwhereresearchersacrossinstitutionscanshare116dataandcode,permittingcomplexprojectsiterativelytoberemixed,reproduced,updated117andimproved.118119TheCLIMBcoreinfrastructureisacloudsystemrunningthefreeopen-sourcecloud120operatingsystemOpenStack(9).Thissystemallowsustorunover1000virtualmachinesat121anyonetime,eachpreconfiguredwithastandarduserconfiguration.Acrossthecloud,we122haveaccesstoalmost43terabytesofRAM.Specialistuserscanrequestaccesstooneofour123twelvehigh-memoryvirtualmachineseachwith3terabytesofRAMforespeciallylarge,124complexanalyses(Figure1).Thesystemisspreadoverfoursitestoenhanceitsresilience125andissupportedbylocalscratchstorageof500TBpersiteemployingIBM’sSpectrumScale126storage(formerlyGPFS).Thesystemisunderpinnedbyalargesharedobjectstoragesystem127thatprovidesapproximately2.5petabytesofdatastorage,whichmaybereplicatedbetween128sites.Thisstoragesystem,runningthefreeopen-sourceCephsystem(10),providesaplace129tostoreandshareverylargemicrobialdatasets—forcomparison,thebacterialcomponent130oftheEuropeanNucleotideArchiveiscurrentlyaround400terabytesinsize.TheCLIMB131systemcanbecoupledtosequencingservices;forexample,sequencedatageneratedbythe132MicrobesNGservice(11)hasbeenmadeavailablewithintheCLIMBsystem.133134Resourceperformance135ToassesstheperformanceoftheCLIMBsystemincomparisontotraditionalHigh136PerformanceComputing(HPC)systemsandsimilarcloudsystems,weundertookasmall-137scalebenchmarkingexercise(Figure2).ComparedtotheRavenHPCresourceatCardiff138(runningIntelprocessorsagenerationbehindthoseinCLIMB),performanceonCLIMBwas139generallygood,offeringarelativeincreaseinperformanceofupto38%ontaskscommonly140undertakenbymicrobialbioinformaticians.TheCLIMBsystemalsocompareswelltocloud141
.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint
serversfrommajorproviders,offeringbetteraggregateperformancethanMicrosoftAzure142A8andGoogleN1S2virtualmachines.Theresultsalsorevealanumberoffeaturesthatmay143berelevanttowhereauserchoosestoruntheiranalysis.CLIMBperformsworsethanRaven144whenrunningBEAST,andprovidesalimitedincreaseinperformanceforthepackage145nhmmer,suggestingthatwhileitispossibletoruntheseanalysesonCLIMB,otherresources146–suchaslocalHPCfacilitiesmightbemoreappropriate.Conversely,thelargestperformance147increasesareobservedforProkka,SnippyandPhyML,whichencompasssomeofthemost148commonlyusedanalysesundertakeninmicrobialgenomics.Itisalsointerestingtonotethat149bothcommercialcloudsofferexcellentperformancerelativetoRavenfortwoworkloads;150muscleandPhyML.Thesourceofthisperformanceisdifficulttopredict,butitispossible151thattheseworkloadsmaybemoresimilartothesortofworkloadsthatthesecloudservices152havebeendesignedtohandle.Onthebasisoftheperformanceresultsmoregenerally,153however,CLIMBislikelytoofferanumberofperformancebenefitsoverlocalresourcesfor154manymicrobialbioinformaticsworkloads.155156Providingasingleenvironmentfortraining,dataandsoftwaresharing157TheCLIMBsystemisaccessedthroughtheInternet,viaasimplesetofwebinterfaces158enablingthesharingofsoftwareonvirtualmachines(Figure3).Usersrequestavirtual159machineviaawebform.Eachvirtualmachinemakesavailablethemicrobialversionofthe160GenomicsVirtualLaboratory(Figure3)(7).Thisincludesasetofwebtools(Galaxy,Jupyter161NotebookandRStudio,withanoptionalPacBioSMRTportal),aswellasasetofpipelines162andtoolsthatcanbeaccessedviathecommandline.Thisstandardisedenvironment163providesacommonplatformforteaching,whilethebaseimageprovidesaversatile164platformthatcanbecustomizedtomeettheneedsofindividualresearchers.165166Systemaccess167Usersregisteratourwebsite,usingtheirUKacademiccredentials(http://bryn.climb.ac.uk/).168Uponregistration,usershaveoneoftwomodesofaccess:thefirstistolaunchaninstance169runningapreconfiguredvirtualmachine,withasetofpredefinedpipelinesortools,which170includestheGenomicsVirtualLaboratory.Thesecondoptionisaimedatexpert171bioinformaticiansanddeveloperswhomaywanttobeabletodeveloptheirownvirtual172machinesfromabaseimage—toenablethiswealsoallowuserstoaccessthesystemviaa173dashboard,similartothatprovidedbyAmazonWebServices,whereuserscanspecifythe174sizeandtypeofvirtualmachinethattheywouldlike,withthesystemthenprovisioningthis175upondemand.Tosharetheresourcefairly,userswillhaveindividualquotasthatcanbe176increasedonrequest.Irrespectiveofquotasize,accesstothesystemisfreeofchargetoUK177academicusers.178179
180CONCLUSION181182CLIMBisprobablythelargestcomputersystemdedicatedtomicrobiologyintheworld.The183systemhasalreadybeenusedtoaddressmicrobiologicalquestionsfeaturingbacteria(12)184andviruses(13).CLIMBhasbeendesignedfromthegrounduptomeettheneedsof185microbiologists,providingacoreinfrastructurethatispresentedinasimple,intuitiveway.186Individualelementsofthesystem—suchasthelargesharedstorageandextremelylarge187memorysystems—providecapabilitiesthatareusuallynotavailablelocallyto188
.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint
microbiologistswithinmostUKinstitutions,whilethesharednatureofthesystemprovides189newopportunitiesfordataandsoftwaresharingthathavethepotentialenhanceresearch190reproducibilityindataintensivebiology.Cloudcomputingclearlyhasthepotentialto191revolutionizehowbiologicaldataaresharedandanalysed.Wehopethatthemicrobiology192researchcommunitywillcapitaliseonthesenewopportunitiesbyexploitingtheCLIMB193facility.194195
196ACKNOWLEDGEMENTS197WewouldliketoespeciallyacknowledgetheassistanceofSimonGladman,AndrewLonie,198TorstenSeemanandNuwanGoonasekerafortheirextensiveassistanceingettingtheGVL199runningonCLIMB.WethankIsabelDodd(Warwick)andBenPascoe(Bath)forassistance200managingtheproject,andwouldalsoliketothankthelocalUniversitynetworkandITstaff201(particularlyDrIanMerrickandKevinMunnatCardiffandChrisJonesatSwansea)whohave202helpedtogetthesystemupandrunning,andUniversityprocurementstaff(especially203AnthonyHaleatCardiff)whoworkedtogetthesystempurchasedinchallengingtimescales.204Inaddition,wewouldalsoliketothankearlyaccessCLIMBusersfortestingandreporting205issueswithourservice,withparticularthankstoPhilAshton(OxfordUniversityClinical206ResearchUnitVietnam),EdFeil,HarryThorpe,SionBayliss(UniversityofBath).Wethank207EmilyRichardson(UniversityofBirmingham)fordevelopingtutorialmaterialsforCLIMBand208testing.WethankallourprojectpartnersandsuppliersatOCF,IBM,RedHat/Ceph,Dell,209MellanoxandBrocadeforsupportoftheprojectwithparticularthankstoArifAliand210GeorginaEllis(OCF),DaveCoughlinandHenryBennett(Dell),BenHarrison(RedHat),211StephanHohn(RedHat/InkTank),JimRoche(Lenovo).212213
214REFERENCES2152162171. FleischmannRD,AdamsMD,WhiteO,ClaytonRA,KirknessEF,KerlavageAR,Bult218
CJ,TombJF,DoughertyBA,MerrickJM,etal.1995.Whole-genomerandom219sequencingandassemblyofHaemophilusinfluenzaeRd.Science269:496-512.220
2. LomanNJ,PallenMJ.2015.Twentyyearsofbacterialgenomesequencing.NatRev221Microbiol13:787-794.222
3. MarxV.2013.Biology:Thebigchallengesofbigdata.Nature498:255-260.2234. ChangJ.2015.Coreservices:Rewardbioinformaticians.Nature520:151-152.2245. SteinLD,KnoppersBM,CampbellP,GetzG,KorbelJO.2015.Dataanalysis:Createa225
cloudcommons.Nature523:149-151.2266. MerchantN,LyonsE,GoffS,VaughnM,WareD,MicklosD,AntinP.2016.The227
iPlantCollaborative:CyberinfrastructureforEnablingDatatoDiscoveryfortheLife228Sciences.PLoSBiol14:e1002342.229
7. AfganE,SloggettC,GoonasekeraN,MakuninI,BensonD,CroweM,GladmanS,230KowsarY,PheasantM,HorstR,LonieA.2015.GenomicsVirtualLaboratory:A231PracticalBioinformaticsWorkbenchfortheCloud.PLoSOne10:e0140829.232
8. TownsJ,CockerillT,DahanM,FosterI,GaitherK,GrimshawA,HazlewoodV,233LathropS,LifkaD,PetersonGD,RoskiesR,ScottJR,Wilkens-DiehrN.2014.XSEDE:234Acceleratingscientificdiscovery.ComputinginScienceandEngineering16:62-74.235
.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint
9. Openstack.http://www.openstack.org.23610. RedHatCephStorage.http://www.redhat.com/en/technologies/storage/ceph.23711. MicrobesNG:https://microbesng.uk.23812. ConnorTR,BarkerCR,BakerKS,WeillFX,TalukderKA,SmithAM,BakerS,Gouali239
M,PhamThanhD,JahanAzmiI,DiasdaSilveiraW,SemmlerT,WielerLH,Jenkins240C,CraviotoA,FaruqueSM,ParkhillJ,WookKimD,KeddyKH,ThomsonNR.2015.241Species-widewholegenomesequencingrevealshistoricalglobalspreadandrecent242localpersistenceinShigellaflexneri.Elife4.243
13. QuickJ,LomanNJ,DuraffourS,SimpsonJT,SeveriE,CowleyL,BoreJA,244KoundounoR,DudasG,MikhailA,OuedraogoN,AfroughB,BahA,BaumJH,245Becker-ZiajaB,BoettcherJP,Cabeza-CabrerizoM,Camino-SanchezA,CarterLL,246DoerrbeckerJ,EnkirchT,Garcia-DorivalI,HetzeltN,HinzmannJ,HolmT,247KafetzopoulouLE,KoropoguiM,KosgeyA,KuismaE,LogueCH,MazzarelliA,248MeiselS,MertensM,MichelJ,NgaboD,NitzscheK,PallaschE,PatronoLV,249PortmannJ,RepitsJG,RickettNY,SachseA,SingethanK,VitorianoI,250YemanaberhanRL,ZekengEG,RacineT,BelloA,SallAA,FayeO,etal.2016.Real-251time,portablegenomesequencingforEbolasurveillance.Nature530:228-232.252
253254
DATABIBLIOGRAPHY255256Notapplicable257
258 259
.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint
FIGURESANDTABLES260261
262Figure1.Overviewofthesystem.A.Thesiteswherethecomputationalhardwareisbased.263B.Highleveloverviewofthesystemandhowthedifferentsoftwarecomponentsconnectto264
.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint
oneanother.C.Computehardwarepresentateachofthefoursites.D.Hardware265comprisingtheCephstoragesystemateachsite.E.Typeandroleofnetworkhardwareused266ateachsite.267268Figure2(appendedatendofdocument).RelativeperformanceofVMsrunningoncloud269services,comparedtotheCardiffUniversityHPCsystem,Raven.A.Valuesforeachpackage270arethemeanofthewalltimetakenfor10runsperformedonRaven,dividedbythemean271walltimeof40runsundertakenontheVMonthenamedservice.Valuesgreaterthan1are272fasterthanRaven,valueslessthan1areslower.B.Showingtherawwalltimevaluesforthe273namedsoftwareoneachofthesystems.274275276Figure3.(appendedatendofdocument)CLIMBvirtualmachinelaunchworkflow.Auser,277onloggingintotheBrynlauncherinterfaceispresentedwithalistofthevirtualmachines278theyarerunningandareabletostop,rebootorterminatethem(panelA).Userslauncha279newGVLserverwithaminimalinterface,specifyinganame,theserver‘flavour’(useror280group)andanaccesspassword(panelB).Onbooting,theuseraccessesawebserver281runningontheGVLinstance,whichgivesaccesstovariousservicesthatarestarted282automatically(panelC).TheGVLprovidesaccesstoaCloudman,aGalaxyserver,an283administrationinterface,JupyternotebookandRStudio(panelD,toptobottom).284285SupplementaryTable1.Tablecontainstherawwalltimerecordedfor40independentruns286oneachoftheindicatedcomparisonsystems.Forfigure2theserawdatawerecompared287againstthemeanwalltimerecordedfrom10runsontheCardiffUniversityRavensystem.288Therawandcalculatedvaluesareintheattachedspreadsheet,indifferentworkbooks.289290
291
.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint
velvth
velvetg
snippy
prokka
phyml
nhmmer
muscle
gunzip
blastn
beast
Walltime as proportion of Raven walltime (larger is better) Raw walltime (s)
CLIMB
Azure
A8
Raven
iPlant
4vCPU 8GB
N1S2
Azure
G1
A B.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under a
The copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint
.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint