Download - * Authors contributed equally - bioRxiv · 7/19/2016 · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

CLIMB(theCloudInfrastructureforMicrobialBioinformatics):anonlineresourceforthe1medicalmicrobiologycommunity2

3ThomasR.Connor*+1,NicholasJ.Loman*+2,SimonThompson*3,AndySmith2,Joel4Southgate1,RadoslawPoplawski2,3,MatthewJ.Bull1,EmilyRichardson2,MatthewIsmail4,5SimonElwood-Thompson5,ChristineKitchen6,MartynGuest6,MariusBakke7,SamK.6Sheppard*8,MarkJ.Pallen*77*Authorscontributedequally8+Forcorrespondence;[email protected];n.j.loman@bham.ac.uk910Affiliations11121.CardiffUniversitySchoolofBiosciences,TheSirMartinEvansBuilding,CardiffUniversity,13Cardiff,CF102AX,UK142.InstituteofMicrobiologyandInfection,UniversityofBirmingham,Birmingham,B152TT,15UK163.ITServices(ResearchComputing),UniversityofBirmingham,Birmingham,B152TT,UK174.CentreforScientificComputing,UniversityofWarwick,Coventry,CV47AL,UK.185.CollegeofMedicine,SwanseaUniversity,Swansea,UK.196.AdvancedResearchComputing@Cardiff(ARCCA),CardiffUniversity,UK207.MicrobiologyandInfectionUnit,WarwickMedicalSchool,UniversityofWarwick,21Coventry,UnitedKingdom228.TheMilnerCentreforEvolution,DepartmentofBiologyandBiochemistry,Universityof23Bath,Bath,BA27AY,UK24

25ABSTRACT2627Theincreasingavailabilityanddecreasingcostofhigh-throughputsequencinghas28transformedacademicmedicalmicrobiology,deliveringanexplosioninavailablegenomes29whilealsodrivingadvancesinbioinformatics.However,manymicrobiologistsareunableto30exploittheresultinglargegenomicsdatasetsbecausetheydonothaveaccesstorelevant31computationalresourcesandtoanappropriatebioinformaticsinfrastructure.Here,we32presenttheCloudInfrastructureforMicrobialBioinformatics(CLIMB)facility,ashared33computinginfrastructurethathasbeendesignedfromthegrounduptoprovidean34environmentwheremicrobiologistscanshareandreusemethodsanddata.35

36DATASUMMARY3738Thepaperdescribesanew,freelyavailablepublicresourceandthereforenodatahasbeen39generated.Theresourcecanbeaccessedathttp://www.climb.ac.uk.Sourcecodefor40softwaredevelopedfortheprojectcanbefoundathttp://github.com/MRC-CLIMB/4142I/Weconfirmallsupportingdata,codeandprotocolshavebeenprovidedwithinthe43articleorthroughsupplementarydatafiles.☒ 44

45IMPACTSTATEMENT4647

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

https://doi.org/10.1101/064451

http://creativecommons.org/licenses/by/4.0/

Technologicaladvancesmeanthatgenomesequencingisnowrelativelysimple,quick,and48affordable.However,handlinglargegenomedatasetsremainsasignificantchallengefor49manymicrobiologists,withsubstantialrequirementsforcomputationalresourcesand50expertiseindatastorageandanalysis.Thishasledtofragmentaryapproachestosoftware51developmentanddatasharingthatreducethereproducibilityofresearchandlimits52opportunitiesforbioinformaticstraining.Here,wedescribeanationwideelectronic53infrastructurethathasbeendesignedtosupporttheUKmicrobiologycommunity,providing54simplemechanismsforaccessinglarge,shared,computationalresourcesdesignedtomeet55thebioinformaticneedsofmicrobiologists.56

57INTRODUCTION58Genomesequencinghastransformedthescaleofquestionsthatcanbeaddressedby59biologicalresearchers.Sincethepublicationofthefirstbacterialgenomesequenceover60twentyyearsago(1),therehasbeenanexplosionintheproductionofmicrobialgenome61sequencedata,fuelledmostrecentlybyhigh-throughputsequencing(2).Thishasplaced62microbiologyattheforefrontofdata-drivenscience(3).Asaconsequence,thereisnow63hugedemandforphysicalandcomputationalinfrastructurestoproduce,analyseandshare64microbiologicalsoftwareanddatasetsandarequirementfortrainedbioinformaticiansthat65canusegenomedatatoaddressimportantquestionsinmicrobiology(4).Itisworth66stressingthatmicrobialgenomics,withitsfocusontheriotousvariationseeninmicrobial67genomes,bringschallengesaltogetherdifferentfromtheanalysisofthelargerbutless68variablegenomesofhumans,animalsorplants.6970Onesolutiontothedata-delugechallengeisforeverymicrobiologyresearchgroupto71establishtheirowndedicatedbioinformaticshardwareandsoftware.However,thisentails72considerableupfrontinfrastructurecostsandgreatinefficienciesofeffort,whilealso73encouragingaworking-in-silosmentality,whichmakesitdifficulttosharedataandpipelines74andthushardtoreplicateresearch.Cloudcomputingprovidesanalternativeapproachthat75facilitatestheuseoflargegenomedatasetsinbiologicalresearch(5).7677Thecloud-computingapproachincorporatesasharedonlinecomputationalinfrastructure,78whichsparestheenduserfromworryingabouttechnicalissuessuchastheinstallation79maintenanceand,even,thelocationofphysicalcomputingresources,togetherwithother80potentiallytroublingissuessuchassystemsadministration,datasharing,scalability,security81andbackup.Attheheartofcloudcomputingliesvirtualization,anapproachinwhicha82physicalcomputingset-upisre-purposedintoascalablesystemofmultipleindependent83virtualmachines,eachofwhichcanbepre-loadedwithsoftware,customisedbyendusers84andsavedassnapshotsforre-usebyothers.Ideally,suchaninfrastructurealsoprovides85large-scaledatastorageandcomputecapacityondemand,reducingcoststothepublic86pursebyoptimisingutilizationofhardwareandavoidingresourcessittingidlewhilestill87capitalisingontheeconomiesofscale.8889Thepotentialforcloudcomputinginbiologicalresearchhasbeenrecognizedbyfunding90organisationsandhasseenthedevelopmentofnationwideresourcessuchasiPlant(6)91(nowCyVerse),NECTAR(7)andXSEDE(8)thatprovideresearcherswithaccesstolarge92cloudinfrastructures.Here,wedescribeanewfacility,designedspecificallyfor93microbiologists,toprovideacomputationalandbioinformaticsinfrastructurefortheUK’s94


https://doi.org/10.1101/064451


academicmedicalmicrobiologycommunity,facilitatingtraining,accesstohardwareand95sharingofdataandsoftware.9697

98Resourceoverview99TheCloudInfrastructureforMicrobialBioinformatics(CLIMB)facilityisintendedasa100generalsolutiontopressingissuesinbig-datamicrobiology.Theresourcecomprisesacore101physicalinfrastructure(Figure1),combinedwiththreekeyfeaturesmakingthecloud102suitableformicrobiologists.103104First,CLIMBprovidesasingleenvironment,completewithpipelinesanddatasetsthatare105co-locatedwithcomputingresource.Thismakestheprocessofaccessingpublished106packagesandsequencedatasimplerandfaster,improvingreuseofsoftwareanddata.107108Second,CLIMBhasbeendesignedwithtraininginmind.Ratherthanhavingtrainees109configurepersonallaptopsorfacechallengesingainingaccesstosharedhighperformance110computingresources,weprovidetrainingimagesonvirtualmachinesthathaveallthe111necessarysoftwareinstalledandweprovideeachtraineewithherownpersonalserverto112continuelearningaftertheworkshopconcludes.113114Third,bybringingtogetherexpertbioinformaticians,educatorsandbiologistsinaunified115system,CLIMBprovidesanenvironmentwhereresearchersacrossinstitutionscanshare116dataandcode,permittingcomplexprojectsiterativelytoberemixed,reproduced,updated117andimproved.118119TheCLIMBcoreinfrastructureisacloudsystemrunningthefreeopen-sourcecloud120operatingsystemOpenStack(9).Thissystemallowsustorunover1000virtualmachinesat121anyonetime,eachpreconfiguredwithastandarduserconfiguration.Acrossthecloud,we122haveaccesstoalmost43terabytesofRAM.Specialistuserscanrequestaccesstooneofour123twelvehigh-memoryvirtualmachineseachwith3terabytesofRAMforespeciallylarge,124complexanalyses(Figure1).Thesystemisspreadoverfoursitestoenhanceitsresilience125andissupportedbylocalscratchstorageof500TBpersiteemployingIBM’sSpectrumScale126storage(formerlyGPFS).Thesystemisunderpinnedbyalargesharedobjectstoragesystem127thatprovidesapproximately2.5petabytesofdatastorage,whichmaybereplicatedbetween128sites.Thisstoragesystem,runningthefreeopen-sourceCephsystem(10),providesaplace129tostoreandshareverylargemicrobialdatasets—forcomparison,thebacterialcomponent130oftheEuropeanNucleotideArchiveiscurrentlyaround400terabytesinsize.TheCLIMB131systemcanbecoupledtosequencingservices;forexample,sequencedatageneratedbythe132MicrobesNGservice(11)hasbeenmadeavailablewithintheCLIMBsystem.133134Resourceperformance135ToassesstheperformanceoftheCLIMBsystemincomparisontotraditionalHigh136PerformanceComputing(HPC)systemsandsimilarcloudsystems,weundertookasmall-137scalebenchmarkingexercise(Figure2).ComparedtotheRavenHPCresourceatCardiff138(runningIntelprocessorsagenerationbehindthoseinCLIMB),performanceonCLIMBwas139generallygood,offeringarelativeincreaseinperformanceofupto38%ontaskscommonly140undertakenbymicrobialbioinformaticians.TheCLIMBsystemalsocompareswelltocloud141


https://doi.org/10.1101/064451


serversfrommajorproviders,offeringbetteraggregateperformancethanMicrosoftAzure142A8andGoogleN1S2virtualmachines.Theresultsalsorevealanumberoffeaturesthatmay143berelevanttowhereauserchoosestoruntheiranalysis.CLIMBperformsworsethanRaven144whenrunningBEAST,andprovidesalimitedincreaseinperformanceforthepackage145nhmmer,suggestingthatwhileitispossibletoruntheseanalysesonCLIMB,otherresources146–suchaslocalHPCfacilitiesmightbemoreappropriate.Conversely,thelargestperformance147increasesareobservedforProkka,SnippyandPhyML,whichencompasssomeofthemost148commonlyusedanalysesundertakeninmicrobialgenomics.Itisalsointerestingtonotethat149bothcommercialcloudsofferexcellentperformancerelativetoRavenfortwoworkloads;150muscleandPhyML.Thesourceofthisperformanceisdifficulttopredict,butitispossible151thattheseworkloadsmaybemoresimilartothesortofworkloadsthatthesecloudservices152havebeendesignedtohandle.Onthebasisoftheperformanceresultsmoregenerally,153however,CLIMBislikelytoofferanumberofperformancebenefitsoverlocalresourcesfor154manymicrobialbioinformaticsworkloads.155156Providingasingleenvironmentfortraining,dataandsoftwaresharing157TheCLIMBsystemisaccessedthroughtheInternet,viaasimplesetofwebinterfaces158enablingthesharingofsoftwareonvirtualmachines(Figure3).Usersrequestavirtual159machineviaawebform.Eachvirtualmachinemakesavailablethemicrobialversionofthe160GenomicsVirtualLaboratory(Figure3)(7).Thisincludesasetofwebtools(Galaxy,Jupyter161NotebookandRStudio,withanoptionalPacBioSMRTportal),aswellasasetofpipelines162andtoolsthatcanbeaccessedviathecommandline.Thisstandardisedenvironment163providesacommonplatformforteaching,whilethebaseimageprovidesaversatile164platformthatcanbecustomizedtomeettheneedsofindividualresearchers.165166Systemaccess167Usersregisteratourwebsite,usingtheirUKacademiccredentials(http://bryn.climb.ac.uk/).168Uponregistration,usershaveoneoftwomodesofaccess:thefirstistolaunchaninstance169runningapreconfiguredvirtualmachine,withasetofpredefinedpipelinesortools,which170includestheGenomicsVirtualLaboratory.Thesecondoptionisaimedatexpert171bioinformaticiansanddeveloperswhomaywanttobeabletodeveloptheirownvirtual172machinesfromabaseimage—toenablethiswealsoallowuserstoaccessthesystemviaa173dashboard,similartothatprovidedbyAmazonWebServices,whereuserscanspecifythe174sizeandtypeofvirtualmachinethattheywouldlike,withthesystemthenprovisioningthis175upondemand.Tosharetheresourcefairly,userswillhaveindividualquotasthatcanbe176increasedonrequest.Irrespectiveofquotasize,accesstothesystemisfreeofchargetoUK177academicusers.178179

180CONCLUSION181182CLIMBisprobablythelargestcomputersystemdedicatedtomicrobiologyintheworld.The183systemhasalreadybeenusedtoaddressmicrobiologicalquestionsfeaturingbacteria(12)184andviruses(13).CLIMBhasbeendesignedfromthegrounduptomeettheneedsof185microbiologists,providingacoreinfrastructurethatispresentedinasimple,intuitiveway.186Individualelementsofthesystem—suchasthelargesharedstorageandextremelylarge187memorysystems—providecapabilitiesthatareusuallynotavailablelocallyto188


https://doi.org/10.1101/064451


microbiologistswithinmostUKinstitutions,whilethesharednatureofthesystemprovides189newopportunitiesfordataandsoftwaresharingthathavethepotentialenhanceresearch190reproducibilityindataintensivebiology.Cloudcomputingclearlyhasthepotentialto191revolutionizehowbiologicaldataaresharedandanalysed.Wehopethatthemicrobiology192researchcommunitywillcapitaliseonthesenewopportunitiesbyexploitingtheCLIMB193facility.194195

196ACKNOWLEDGEMENTS197WewouldliketoespeciallyacknowledgetheassistanceofSimonGladman,AndrewLonie,198TorstenSeemanandNuwanGoonasekerafortheirextensiveassistanceingettingtheGVL199runningonCLIMB.WethankIsabelDodd(Warwick)andBenPascoe(Bath)forassistance200managingtheproject,andwouldalsoliketothankthelocalUniversitynetworkandITstaff201(particularlyDrIanMerrickandKevinMunnatCardiffandChrisJonesatSwansea)whohave202helpedtogetthesystemupandrunning,andUniversityprocurementstaff(especially203AnthonyHaleatCardiff)whoworkedtogetthesystempurchasedinchallengingtimescales.204Inaddition,wewouldalsoliketothankearlyaccessCLIMBusersfortestingandreporting205issueswithourservice,withparticularthankstoPhilAshton(OxfordUniversityClinical206ResearchUnitVietnam),EdFeil,HarryThorpe,SionBayliss(UniversityofBath).Wethank207EmilyRichardson(UniversityofBirmingham)fordevelopingtutorialmaterialsforCLIMBand208testing.WethankallourprojectpartnersandsuppliersatOCF,IBM,RedHat/Ceph,Dell,209MellanoxandBrocadeforsupportoftheprojectwithparticularthankstoArifAliand210GeorginaEllis(OCF),DaveCoughlinandHenryBennett(Dell),BenHarrison(RedHat),211StephanHohn(RedHat/InkTank),JimRoche(Lenovo).212213

214REFERENCES2152162171. FleischmannRD,AdamsMD,WhiteO,ClaytonRA,KirknessEF,KerlavageAR,Bult218

CJ,TombJF,DoughertyBA,MerrickJM,etal.1995.Whole-genomerandom219sequencingandassemblyofHaemophilusinfluenzaeRd.Science269:496-512.220

2. LomanNJ,PallenMJ.2015.Twentyyearsofbacterialgenomesequencing.NatRev221Microbiol13:787-794.222

3. MarxV.2013.Biology:Thebigchallengesofbigdata.Nature498:255-260.2234. ChangJ.2015.Coreservices:Rewardbioinformaticians.Nature520:151-152.2245. SteinLD,KnoppersBM,CampbellP,GetzG,KorbelJO.2015.Dataanalysis:Createa225

cloudcommons.Nature523:149-151.2266. MerchantN,LyonsE,GoffS,VaughnM,WareD,MicklosD,AntinP.2016.The227

iPlantCollaborative:CyberinfrastructureforEnablingDatatoDiscoveryfortheLife228Sciences.PLoSBiol14:e1002342.229

7. AfganE,SloggettC,GoonasekeraN,MakuninI,BensonD,CroweM,GladmanS,230KowsarY,PheasantM,HorstR,LonieA.2015.GenomicsVirtualLaboratory:A231PracticalBioinformaticsWorkbenchfortheCloud.PLoSOne10:e0140829.232

8. TownsJ,CockerillT,DahanM,FosterI,GaitherK,GrimshawA,HazlewoodV,233LathropS,LifkaD,PetersonGD,RoskiesR,ScottJR,Wilkens-DiehrN.2014.XSEDE:234Acceleratingscientificdiscovery.ComputinginScienceandEngineering16:62-74.235


https://doi.org/10.1101/064451


9. Openstack.http://www.openstack.org.23610. RedHatCephStorage.http://www.redhat.com/en/technologies/storage/ceph.23711. MicrobesNG:https://microbesng.uk.23812. ConnorTR,BarkerCR,BakerKS,WeillFX,TalukderKA,SmithAM,BakerS,Gouali239

M,PhamThanhD,JahanAzmiI,DiasdaSilveiraW,SemmlerT,WielerLH,Jenkins240C,CraviotoA,FaruqueSM,ParkhillJ,WookKimD,KeddyKH,ThomsonNR.2015.241Species-widewholegenomesequencingrevealshistoricalglobalspreadandrecent242localpersistenceinShigellaflexneri.Elife4.243

13. QuickJ,LomanNJ,DuraffourS,SimpsonJT,SeveriE,CowleyL,BoreJA,244KoundounoR,DudasG,MikhailA,OuedraogoN,AfroughB,BahA,BaumJH,245Becker-ZiajaB,BoettcherJP,Cabeza-CabrerizoM,Camino-SanchezA,CarterLL,246DoerrbeckerJ,EnkirchT,Garcia-DorivalI,HetzeltN,HinzmannJ,HolmT,247KafetzopoulouLE,KoropoguiM,KosgeyA,KuismaE,LogueCH,MazzarelliA,248MeiselS,MertensM,MichelJ,NgaboD,NitzscheK,PallaschE,PatronoLV,249PortmannJ,RepitsJG,RickettNY,SachseA,SingethanK,VitorianoI,250YemanaberhanRL,ZekengEG,RacineT,BelloA,SallAA,FayeO,etal.2016.Real-251time,portablegenomesequencingforEbolasurveillance.Nature530:228-232.252

253254

DATABIBLIOGRAPHY255256Notapplicable257

258 259


https://doi.org/10.1101/064451


FIGURESANDTABLES260261

262Figure1.Overviewofthesystem.A.Thesiteswherethecomputationalhardwareisbased.263B.Highleveloverviewofthesystemandhowthedifferentsoftwarecomponentsconnectto264


https://doi.org/10.1101/064451


oneanother.C.Computehardwarepresentateachofthefoursites.D.Hardware265comprisingtheCephstoragesystemateachsite.E.Typeandroleofnetworkhardwareused266ateachsite.267268Figure2(appendedatendofdocument).RelativeperformanceofVMsrunningoncloud269services,comparedtotheCardiffUniversityHPCsystem,Raven.A.Valuesforeachpackage270arethemeanofthewalltimetakenfor10runsperformedonRaven,dividedbythemean271walltimeof40runsundertakenontheVMonthenamedservice.Valuesgreaterthan1are272fasterthanRaven,valueslessthan1areslower.B.Showingtherawwalltimevaluesforthe273namedsoftwareoneachofthesystems.274275276Figure3.(appendedatendofdocument)CLIMBvirtualmachinelaunchworkflow.Auser,277onloggingintotheBrynlauncherinterfaceispresentedwithalistofthevirtualmachines278theyarerunningandareabletostop,rebootorterminatethem(panelA).Userslauncha279newGVLserverwithaminimalinterface,specifyinganame,theserver‘flavour’(useror280group)andanaccesspassword(panelB).Onbooting,theuseraccessesawebserver281runningontheGVLinstance,whichgivesaccesstovariousservicesthatarestarted282automatically(panelC).TheGVLprovidesaccesstoaCloudman,aGalaxyserver,an283administrationinterface,JupyternotebookandRStudio(panelD,toptobottom).284285SupplementaryTable1.Tablecontainstherawwalltimerecordedfor40independentruns286oneachoftheindicatedcomparisonsystems.Forfigure2theserawdatawerecompared287againstthemeanwalltimerecordedfrom10runsontheCardiffUniversityRavensystem.288Therawandcalculatedvaluesareintheattachedspreadsheet,indifferentworkbooks.289290

291


https://doi.org/10.1101/064451


velvth

velvetg

snippy

prokka

phyml

nhmmer

muscle

gunzip

blastn

beast

Walltime as proportion of Raven walltime (larger is better) Raw walltime (s)

CLIMB

Azure

A8

Raven

iPlant

4vCPU 8GB

Google

N1S2

Azure

G1

A B.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under a

The copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

https://doi.org/10.1101/064451



https://doi.org/10.1101/064451


Download - * Authors contributed equally - bioRxiv · 7/19/2016 · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

Top Related