Transcript
Page 1: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

CLIMB(theCloudInfrastructureforMicrobialBioinformatics):anonlineresourceforthe1medicalmicrobiologycommunity2

3ThomasR.Connor*+1,NicholasJ.Loman*+2,SimonThompson*3,AndySmith2,Joel4Southgate1,RadoslawPoplawski2,3,MatthewJ.Bull1,EmilyRichardson2,MatthewIsmail4,5SimonElwood-Thompson5,ChristineKitchen6,MartynGuest6,MariusBakke7,SamK.6Sheppard*8,MarkJ.Pallen*77*Authorscontributedequally8+Forcorrespondence;[email protected];n.j.loman@bham.ac.uk910Affiliations11121.CardiffUniversitySchoolofBiosciences,TheSirMartinEvansBuilding,CardiffUniversity,13Cardiff,CF102AX,UK142.InstituteofMicrobiologyandInfection,UniversityofBirmingham,Birmingham,B152TT,15UK163.ITServices(ResearchComputing),UniversityofBirmingham,Birmingham,B152TT,UK174.CentreforScientificComputing,UniversityofWarwick,Coventry,CV47AL,UK.185.CollegeofMedicine,SwanseaUniversity,Swansea,UK.196.AdvancedResearchComputing@Cardiff(ARCCA),CardiffUniversity,UK207.MicrobiologyandInfectionUnit,WarwickMedicalSchool,UniversityofWarwick,21Coventry,UnitedKingdom228.TheMilnerCentreforEvolution,DepartmentofBiologyandBiochemistry,Universityof23Bath,Bath,BA27AY,UK24

25ABSTRACT2627Theincreasingavailabilityanddecreasingcostofhigh-throughputsequencinghas28transformedacademicmedicalmicrobiology,deliveringanexplosioninavailablegenomes29whilealsodrivingadvancesinbioinformatics.However,manymicrobiologistsareunableto30exploittheresultinglargegenomicsdatasetsbecausetheydonothaveaccesstorelevant31computationalresourcesandtoanappropriatebioinformaticsinfrastructure.Here,we32presenttheCloudInfrastructureforMicrobialBioinformatics(CLIMB)facility,ashared33computinginfrastructurethathasbeendesignedfromthegrounduptoprovidean34environmentwheremicrobiologistscanshareandreusemethodsanddata.35

36DATASUMMARY3738Thepaperdescribesanew,freelyavailablepublicresourceandthereforenodatahasbeen39generated.Theresourcecanbeaccessedathttp://www.climb.ac.uk.Sourcecodefor40softwaredevelopedfortheprojectcanbefoundathttp://github.com/MRC-CLIMB/4142I/Weconfirmallsupportingdata,codeandprotocolshavebeenprovidedwithinthe43articleorthroughsupplementarydatafiles.☒ 44

45IMPACTSTATEMENT4647

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 2: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

Technologicaladvancesmeanthatgenomesequencingisnowrelativelysimple,quick,and48affordable.However,handlinglargegenomedatasetsremainsasignificantchallengefor49manymicrobiologists,withsubstantialrequirementsforcomputationalresourcesand50expertiseindatastorageandanalysis.Thishasledtofragmentaryapproachestosoftware51developmentanddatasharingthatreducethereproducibilityofresearchandlimits52opportunitiesforbioinformaticstraining.Here,wedescribeanationwideelectronic53infrastructurethathasbeendesignedtosupporttheUKmicrobiologycommunity,providing54simplemechanismsforaccessinglarge,shared,computationalresourcesdesignedtomeet55thebioinformaticneedsofmicrobiologists.56

57INTRODUCTION58Genomesequencinghastransformedthescaleofquestionsthatcanbeaddressedby59biologicalresearchers.Sincethepublicationofthefirstbacterialgenomesequenceover60twentyyearsago(1),therehasbeenanexplosionintheproductionofmicrobialgenome61sequencedata,fuelledmostrecentlybyhigh-throughputsequencing(2).Thishasplaced62microbiologyattheforefrontofdata-drivenscience(3).Asaconsequence,thereisnow63hugedemandforphysicalandcomputationalinfrastructurestoproduce,analyseandshare64microbiologicalsoftwareanddatasetsandarequirementfortrainedbioinformaticiansthat65canusegenomedatatoaddressimportantquestionsinmicrobiology(4).Itisworth66stressingthatmicrobialgenomics,withitsfocusontheriotousvariationseeninmicrobial67genomes,bringschallengesaltogetherdifferentfromtheanalysisofthelargerbutless68variablegenomesofhumans,animalsorplants.6970Onesolutiontothedata-delugechallengeisforeverymicrobiologyresearchgroupto71establishtheirowndedicatedbioinformaticshardwareandsoftware.However,thisentails72considerableupfrontinfrastructurecostsandgreatinefficienciesofeffort,whilealso73encouragingaworking-in-silosmentality,whichmakesitdifficulttosharedataandpipelines74andthushardtoreplicateresearch.Cloudcomputingprovidesanalternativeapproachthat75facilitatestheuseoflargegenomedatasetsinbiologicalresearch(5).7677Thecloud-computingapproachincorporatesasharedonlinecomputationalinfrastructure,78whichsparestheenduserfromworryingabouttechnicalissuessuchastheinstallation79maintenanceand,even,thelocationofphysicalcomputingresources,togetherwithother80potentiallytroublingissuessuchassystemsadministration,datasharing,scalability,security81andbackup.Attheheartofcloudcomputingliesvirtualization,anapproachinwhicha82physicalcomputingset-upisre-purposedintoascalablesystemofmultipleindependent83virtualmachines,eachofwhichcanbepre-loadedwithsoftware,customisedbyendusers84andsavedassnapshotsforre-usebyothers.Ideally,suchaninfrastructurealsoprovides85large-scaledatastorageandcomputecapacityondemand,reducingcoststothepublic86pursebyoptimisingutilizationofhardwareandavoidingresourcessittingidlewhilestill87capitalisingontheeconomiesofscale.8889Thepotentialforcloudcomputinginbiologicalresearchhasbeenrecognizedbyfunding90organisationsandhasseenthedevelopmentofnationwideresourcessuchasiPlant(6)91(nowCyVerse),NECTAR(7)andXSEDE(8)thatprovideresearcherswithaccesstolarge92cloudinfrastructures.Here,wedescribeanewfacility,designedspecificallyfor93microbiologists,toprovideacomputationalandbioinformaticsinfrastructurefortheUK’s94

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 3: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

academicmedicalmicrobiologycommunity,facilitatingtraining,accesstohardwareand95sharingofdataandsoftware.9697

98Resourceoverview99TheCloudInfrastructureforMicrobialBioinformatics(CLIMB)facilityisintendedasa100generalsolutiontopressingissuesinbig-datamicrobiology.Theresourcecomprisesacore101physicalinfrastructure(Figure1),combinedwiththreekeyfeaturesmakingthecloud102suitableformicrobiologists.103104First,CLIMBprovidesasingleenvironment,completewithpipelinesanddatasetsthatare105co-locatedwithcomputingresource.Thismakestheprocessofaccessingpublished106packagesandsequencedatasimplerandfaster,improvingreuseofsoftwareanddata.107108Second,CLIMBhasbeendesignedwithtraininginmind.Ratherthanhavingtrainees109configurepersonallaptopsorfacechallengesingainingaccesstosharedhighperformance110computingresources,weprovidetrainingimagesonvirtualmachinesthathaveallthe111necessarysoftwareinstalledandweprovideeachtraineewithherownpersonalserverto112continuelearningaftertheworkshopconcludes.113114Third,bybringingtogetherexpertbioinformaticians,educatorsandbiologistsinaunified115system,CLIMBprovidesanenvironmentwhereresearchersacrossinstitutionscanshare116dataandcode,permittingcomplexprojectsiterativelytoberemixed,reproduced,updated117andimproved.118119TheCLIMBcoreinfrastructureisacloudsystemrunningthefreeopen-sourcecloud120operatingsystemOpenStack(9).Thissystemallowsustorunover1000virtualmachinesat121anyonetime,eachpreconfiguredwithastandarduserconfiguration.Acrossthecloud,we122haveaccesstoalmost43terabytesofRAM.Specialistuserscanrequestaccesstooneofour123twelvehigh-memoryvirtualmachineseachwith3terabytesofRAMforespeciallylarge,124complexanalyses(Figure1).Thesystemisspreadoverfoursitestoenhanceitsresilience125andissupportedbylocalscratchstorageof500TBpersiteemployingIBM’sSpectrumScale126storage(formerlyGPFS).Thesystemisunderpinnedbyalargesharedobjectstoragesystem127thatprovidesapproximately2.5petabytesofdatastorage,whichmaybereplicatedbetween128sites.Thisstoragesystem,runningthefreeopen-sourceCephsystem(10),providesaplace129tostoreandshareverylargemicrobialdatasets—forcomparison,thebacterialcomponent130oftheEuropeanNucleotideArchiveiscurrentlyaround400terabytesinsize.TheCLIMB131systemcanbecoupledtosequencingservices;forexample,sequencedatageneratedbythe132MicrobesNGservice(11)hasbeenmadeavailablewithintheCLIMBsystem.133134Resourceperformance135ToassesstheperformanceoftheCLIMBsystemincomparisontotraditionalHigh136PerformanceComputing(HPC)systemsandsimilarcloudsystems,weundertookasmall-137scalebenchmarkingexercise(Figure2).ComparedtotheRavenHPCresourceatCardiff138(runningIntelprocessorsagenerationbehindthoseinCLIMB),performanceonCLIMBwas139generallygood,offeringarelativeincreaseinperformanceofupto38%ontaskscommonly140undertakenbymicrobialbioinformaticians.TheCLIMBsystemalsocompareswelltocloud141

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 4: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

serversfrommajorproviders,offeringbetteraggregateperformancethanMicrosoftAzure142A8andGoogleN1S2virtualmachines.Theresultsalsorevealanumberoffeaturesthatmay143berelevanttowhereauserchoosestoruntheiranalysis.CLIMBperformsworsethanRaven144whenrunningBEAST,andprovidesalimitedincreaseinperformanceforthepackage145nhmmer,suggestingthatwhileitispossibletoruntheseanalysesonCLIMB,otherresources146–suchaslocalHPCfacilitiesmightbemoreappropriate.Conversely,thelargestperformance147increasesareobservedforProkka,SnippyandPhyML,whichencompasssomeofthemost148commonlyusedanalysesundertakeninmicrobialgenomics.Itisalsointerestingtonotethat149bothcommercialcloudsofferexcellentperformancerelativetoRavenfortwoworkloads;150muscleandPhyML.Thesourceofthisperformanceisdifficulttopredict,butitispossible151thattheseworkloadsmaybemoresimilartothesortofworkloadsthatthesecloudservices152havebeendesignedtohandle.Onthebasisoftheperformanceresultsmoregenerally,153however,CLIMBislikelytoofferanumberofperformancebenefitsoverlocalresourcesfor154manymicrobialbioinformaticsworkloads.155156Providingasingleenvironmentfortraining,dataandsoftwaresharing157TheCLIMBsystemisaccessedthroughtheInternet,viaasimplesetofwebinterfaces158enablingthesharingofsoftwareonvirtualmachines(Figure3).Usersrequestavirtual159machineviaawebform.Eachvirtualmachinemakesavailablethemicrobialversionofthe160GenomicsVirtualLaboratory(Figure3)(7).Thisincludesasetofwebtools(Galaxy,Jupyter161NotebookandRStudio,withanoptionalPacBioSMRTportal),aswellasasetofpipelines162andtoolsthatcanbeaccessedviathecommandline.Thisstandardisedenvironment163providesacommonplatformforteaching,whilethebaseimageprovidesaversatile164platformthatcanbecustomizedtomeettheneedsofindividualresearchers.165166Systemaccess167Usersregisteratourwebsite,usingtheirUKacademiccredentials(http://bryn.climb.ac.uk/).168Uponregistration,usershaveoneoftwomodesofaccess:thefirstistolaunchaninstance169runningapreconfiguredvirtualmachine,withasetofpredefinedpipelinesortools,which170includestheGenomicsVirtualLaboratory.Thesecondoptionisaimedatexpert171bioinformaticiansanddeveloperswhomaywanttobeabletodeveloptheirownvirtual172machinesfromabaseimage—toenablethiswealsoallowuserstoaccessthesystemviaa173dashboard,similartothatprovidedbyAmazonWebServices,whereuserscanspecifythe174sizeandtypeofvirtualmachinethattheywouldlike,withthesystemthenprovisioningthis175upondemand.Tosharetheresourcefairly,userswillhaveindividualquotasthatcanbe176increasedonrequest.Irrespectiveofquotasize,accesstothesystemisfreeofchargetoUK177academicusers.178179

180CONCLUSION181182CLIMBisprobablythelargestcomputersystemdedicatedtomicrobiologyintheworld.The183systemhasalreadybeenusedtoaddressmicrobiologicalquestionsfeaturingbacteria(12)184andviruses(13).CLIMBhasbeendesignedfromthegrounduptomeettheneedsof185microbiologists,providingacoreinfrastructurethatispresentedinasimple,intuitiveway.186Individualelementsofthesystem—suchasthelargesharedstorageandextremelylarge187memorysystems—providecapabilitiesthatareusuallynotavailablelocallyto188

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 5: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

microbiologistswithinmostUKinstitutions,whilethesharednatureofthesystemprovides189newopportunitiesfordataandsoftwaresharingthathavethepotentialenhanceresearch190reproducibilityindataintensivebiology.Cloudcomputingclearlyhasthepotentialto191revolutionizehowbiologicaldataaresharedandanalysed.Wehopethatthemicrobiology192researchcommunitywillcapitaliseonthesenewopportunitiesbyexploitingtheCLIMB193facility.194195

196ACKNOWLEDGEMENTS197WewouldliketoespeciallyacknowledgetheassistanceofSimonGladman,AndrewLonie,198TorstenSeemanandNuwanGoonasekerafortheirextensiveassistanceingettingtheGVL199runningonCLIMB.WethankIsabelDodd(Warwick)andBenPascoe(Bath)forassistance200managingtheproject,andwouldalsoliketothankthelocalUniversitynetworkandITstaff201(particularlyDrIanMerrickandKevinMunnatCardiffandChrisJonesatSwansea)whohave202helpedtogetthesystemupandrunning,andUniversityprocurementstaff(especially203AnthonyHaleatCardiff)whoworkedtogetthesystempurchasedinchallengingtimescales.204Inaddition,wewouldalsoliketothankearlyaccessCLIMBusersfortestingandreporting205issueswithourservice,withparticularthankstoPhilAshton(OxfordUniversityClinical206ResearchUnitVietnam),EdFeil,HarryThorpe,SionBayliss(UniversityofBath).Wethank207EmilyRichardson(UniversityofBirmingham)fordevelopingtutorialmaterialsforCLIMBand208testing.WethankallourprojectpartnersandsuppliersatOCF,IBM,RedHat/Ceph,Dell,209MellanoxandBrocadeforsupportoftheprojectwithparticularthankstoArifAliand210GeorginaEllis(OCF),DaveCoughlinandHenryBennett(Dell),BenHarrison(RedHat),211StephanHohn(RedHat/InkTank),JimRoche(Lenovo).212213

214REFERENCES2152162171. FleischmannRD,AdamsMD,WhiteO,ClaytonRA,KirknessEF,KerlavageAR,Bult218

CJ,TombJF,DoughertyBA,MerrickJM,etal.1995.Whole-genomerandom219sequencingandassemblyofHaemophilusinfluenzaeRd.Science269:496-512.220

2. LomanNJ,PallenMJ.2015.Twentyyearsofbacterialgenomesequencing.NatRev221Microbiol13:787-794.222

3. MarxV.2013.Biology:Thebigchallengesofbigdata.Nature498:255-260.2234. ChangJ.2015.Coreservices:Rewardbioinformaticians.Nature520:151-152.2245. SteinLD,KnoppersBM,CampbellP,GetzG,KorbelJO.2015.Dataanalysis:Createa225

cloudcommons.Nature523:149-151.2266. MerchantN,LyonsE,GoffS,VaughnM,WareD,MicklosD,AntinP.2016.The227

iPlantCollaborative:CyberinfrastructureforEnablingDatatoDiscoveryfortheLife228Sciences.PLoSBiol14:e1002342.229

7. AfganE,SloggettC,GoonasekeraN,MakuninI,BensonD,CroweM,GladmanS,230KowsarY,PheasantM,HorstR,LonieA.2015.GenomicsVirtualLaboratory:A231PracticalBioinformaticsWorkbenchfortheCloud.PLoSOne10:e0140829.232

8. TownsJ,CockerillT,DahanM,FosterI,GaitherK,GrimshawA,HazlewoodV,233LathropS,LifkaD,PetersonGD,RoskiesR,ScottJR,Wilkens-DiehrN.2014.XSEDE:234Acceleratingscientificdiscovery.ComputinginScienceandEngineering16:62-74.235

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 6: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

9. Openstack.http://www.openstack.org.23610. RedHatCephStorage.http://www.redhat.com/en/technologies/storage/ceph.23711. MicrobesNG:https://microbesng.uk.23812. ConnorTR,BarkerCR,BakerKS,WeillFX,TalukderKA,SmithAM,BakerS,Gouali239

M,PhamThanhD,JahanAzmiI,DiasdaSilveiraW,SemmlerT,WielerLH,Jenkins240C,CraviotoA,FaruqueSM,ParkhillJ,WookKimD,KeddyKH,ThomsonNR.2015.241Species-widewholegenomesequencingrevealshistoricalglobalspreadandrecent242localpersistenceinShigellaflexneri.Elife4.243

13. QuickJ,LomanNJ,DuraffourS,SimpsonJT,SeveriE,CowleyL,BoreJA,244KoundounoR,DudasG,MikhailA,OuedraogoN,AfroughB,BahA,BaumJH,245Becker-ZiajaB,BoettcherJP,Cabeza-CabrerizoM,Camino-SanchezA,CarterLL,246DoerrbeckerJ,EnkirchT,Garcia-DorivalI,HetzeltN,HinzmannJ,HolmT,247KafetzopoulouLE,KoropoguiM,KosgeyA,KuismaE,LogueCH,MazzarelliA,248MeiselS,MertensM,MichelJ,NgaboD,NitzscheK,PallaschE,PatronoLV,249PortmannJ,RepitsJG,RickettNY,SachseA,SingethanK,VitorianoI,250YemanaberhanRL,ZekengEG,RacineT,BelloA,SallAA,FayeO,etal.2016.Real-251time,portablegenomesequencingforEbolasurveillance.Nature530:228-232.252

253254

DATABIBLIOGRAPHY255256Notapplicable257

258 259

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 7: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

FIGURESANDTABLES260261

262Figure1.Overviewofthesystem.A.Thesiteswherethecomputationalhardwareisbased.263B.Highleveloverviewofthesystemandhowthedifferentsoftwarecomponentsconnectto264

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 8: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

oneanother.C.Computehardwarepresentateachofthefoursites.D.Hardware265comprisingtheCephstoragesystemateachsite.E.Typeandroleofnetworkhardwareused266ateachsite.267268Figure2(appendedatendofdocument).RelativeperformanceofVMsrunningoncloud269services,comparedtotheCardiffUniversityHPCsystem,Raven.A.Valuesforeachpackage270arethemeanofthewalltimetakenfor10runsperformedonRaven,dividedbythemean271walltimeof40runsundertakenontheVMonthenamedservice.Valuesgreaterthan1are272fasterthanRaven,valueslessthan1areslower.B.Showingtherawwalltimevaluesforthe273namedsoftwareoneachofthesystems.274275276Figure3.(appendedatendofdocument)CLIMBvirtualmachinelaunchworkflow.Auser,277onloggingintotheBrynlauncherinterfaceispresentedwithalistofthevirtualmachines278theyarerunningandareabletostop,rebootorterminatethem(panelA).Userslauncha279newGVLserverwithaminimalinterface,specifyinganame,theserver‘flavour’(useror280group)andanaccesspassword(panelB).Onbooting,theuseraccessesawebserver281runningontheGVLinstance,whichgivesaccesstovariousservicesthatarestarted282automatically(panelC).TheGVLprovidesaccesstoaCloudman,aGalaxyserver,an283administrationinterface,JupyternotebookandRStudio(panelD,toptobottom).284285SupplementaryTable1.Tablecontainstherawwalltimerecordedfor40independentruns286oneachoftheindicatedcomparisonsystems.Forfigure2theserawdatawerecompared287againstthemeanwalltimerecordedfrom10runsontheCardiffUniversityRavensystem.288Therawandcalculatedvaluesareintheattachedspreadsheet,indifferentworkbooks.289290

291

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 9: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

velvth

velvetg

snippy

prokka

phyml

nhmmer

muscle

gunzip

blastn

beast

Walltime as proportion of Raven walltime (larger is better) Raw walltime (s)

CLIMB

Azure

A8

Raven

iPlant

4vCPU 8GB

Google

N1S2

Azure

G1

A B.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under a

The copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint

Page 10: * Authors contributed equally - bioRxiv · 7/19/2016  · 120 The CLIMB core infrastructure is a cloud system running the free open-source cloud 121 operating system OpenStack (9)

.CC-BY 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted July 19, 2016. . https://doi.org/10.1101/064451doi: bioRxiv preprint


Top Related