climb stateoftheartintro

Microbial Bioinformatics in the cloud

Introducing CLIMBDr Tom Connor

Cardiff UniversityBiological Cloud Computing Workshop

www.climb.ac.uk@tomrconnor ; @mrcclimb

Overview

• Background• View from a newly

minted academic Bioinformatician at a “regular” University– Bioinformatics needs

and challenges• Introducing CLIMB

Big DataWave 1 Wave 2 Wave 3

2005-09

1989-97

2003-07

1992-2002

1993-98

1975-86

1937-611966-71

1967-89

1969-73

1969-81

1981-85

19741986-87

1969-73

Adapted from Mutreja, Kim, Thomson, Connor et al, Nature, 2011

Population genomics; using genomics to reconstruct the global spread of pathogens

Bioinformatics; developing new approaches to analyse massive datasets

From Marttinen, Hanage, Croucher, Connor, Harris, Bentley and Corander, Nucleic Acids Res. 2011

From Cheng, Connor, Sirén, Aanensen, Corander, MBE, 2013

Grand challenges; fighting antimicrobial resistance

From Fookes et al, PLoS Pathogens 2011

From Reuter, Connor et al, PNAS 2014

From Okoro, Kingsley, Connor et al.Nature Genetics 2012

From He et al, Nature Genetics,2013

From Dziva, Hauser, Connor et al, I&I, 2013

Pathogen genomics: understanding how pathogens evolve

About Me

• 2006-2010 - PhD at Imperial; Population Genetics / Molecular Epidemiology of bacterial pathogens

• 2010-2012 - Post-Doc, Wellcome Trust Sanger Institute; Pathogen Genomics

• 2012 – present Lecturer then Senior Lecturer, Cardiff University

Hanage, Fraser, Tang, Connor, Corander, Science, 2009

Building bioinformatics capacity

• When I arrived at Cardiff, I had the joy of working out how I was going to do my research in a new place

• How to get scripts/software installed

• Where to install scripts/software

• And I had to do this mostly on my own

• Not an unusual story

Key Challenges

• Infrastructure– Storage– HTC capacity

• Portability of software• Portability of datasets• But, we do have the

University Supercomputer….

Advanced Research Computing @ Cardiff (ARCCA)

• 2048 core HPC cluster• Second hand ~868 Westmere

core “HTC” partition• 8 ‘large memory’ 128GB

nodes• Lustre file system (scratch,

nominally unlimited, but 50TB total space initially)

• NFS /home mount (50GB maximum quota)

• Freely available to University Researchers

…. So the first thing I did was to buy myself some servers

• Large HPC clusters often don’t meet our needs

• Bioinformaticians aren’t the ideal HPC users– Disruptive software needs– Disruptive usage patterns– Disruptive storage needs

• Setting up our “own” system seems the most intuitive way to ensure that you have something that works

Biologists often end up working in silos• As a discipline we have probably been

taught to think in terms of ‘labs’, ‘groups’ and ‘experiments’ being wet work

• We build capacity and teams locally, and those are the resources that we use every day

• For bioinformaticians this means we are likely to develop our solutions locally first, building a local group and local capacity

• Our software, data set storage, LIMS etc are usually bespoke

• Because our software/data is locally stored/setup – it is often less portable than wet lab methods / approaches

• Bioinformaticians should be working differently

Key Challenges Overall• We need systems that allow us to rapidly and easily share complete

systems, from the perspective of both novice users and experienced developers

• We need to develop systems to share complete datasets, rather than forcing users to install loads of bits of software to reconstitute the development environment we used, or forcing us to become proper software developers

• We need to lower the barrier to access for for research scientists with a limited understanding of UNIX/Computer Science

• We need a system that allows us to train users on systems that they will then be able to use when they go home

• We need to understand the needs of individual fields• We need to integrate activities across these needs, to avoid

reinventing the wheel

The cloud• All of the infrastructure issues

are much easier when tackled at scale

• This concept led to Amazon, followed by others, offering Cloud services

• A cloud infrastructure provides a mechanism to share systems/software and data, at scale

• Let someone else do the admin etc, and all you have to worry about is running the software

Why not use a commercial cloud?• We often want lots of RAM – Amazon max

flavour size is ~250GB• Prices are high ($1200/month for ~250GB

RAM flavour)• Storage costs also high – 1TB on Amazon

S3 costs $30/month (our current costs are £7/month)

• Additionally Amazon isn’t designed to facilitate sharing of data etc between different people who have VMs

• There are possible issues around T&C’s, governance etc

• Even if we overcome these, often these services are too hard for novice users to make use of

Needs• Core infrastructure now to

take advantage of new technologies

• Systems to easily share data• Repositories that can make

tools/methods/data available rapidly and easily

• Better use of the existing RCUK/University server estate

• To change the view of Biologists about working in silos

Introducing the CLoud Infrastructure for Microbial Bioinformatics (CLIMB)

• MRC funded project to develop Cloud Infrastructure for microbial bioinformatics

• ~£4M of hardware, capable of supporting >1000 individual virtual servers

• Providing a core, national cyberinfrastructure for Microbial Bioinformatics

The CLIMB Consortium Are

• Professor Mark Pallen (Warwick) and Dr Sam Sheppard (Swansea) – Joint PIs

• Professor Mark Achtman (Warwick), ProfessorSteveBusby FRS (Birmingham), Dr Tom

Connor (Cardiff)*, Professor Tim Walsh (Cardiff), Dr Robin Howe (Public Health Wales) – Co-Is

• Dr Nick Loman (Birmingham)* and Dr Chris Quince (Warwick) ; MRC Research Fellows

* Principal bioinformaticians architecting and designing the system

The CLoud Infrastructure for Microbial Bioinformatics (climb.ac.uk)

• We are creating A one stop shop for Microbial Bioinformatics– Public/private cloud for use by UK

academics– Standardised cloud images that

implement key pipelines– Storage repository for data/images

that are made available online and within our system, anywhere (‘eduroam for microbial genomics’)

• We will provide access to other databases from within the system

• As well as providing a place to support orphan databases and tools

System Outline• 4 sites• Connected over Janet• Different sizes of VM available; personal, standard, large memory, huge memory• Able to support >1,000 VMs simultaneously (1:1 vCPUs/vRAM : CPUs/RAM)• 7-8PB of object storage across 4 sites (~2-3PB usable with erasure coding)• 4-500TB of local high performance storage per site• A single system, with common log in, and between site data replication• System has been designed to enable the addition of extra nodes / Universities

CLIMB Overview

• 4 sites, running OpenStack• Hardware procured in a two

stage process• IBM/OCF provided compute,

Dell/redhat provided storage• Networks provided by

Brocade• Are defining a reference

architecture to enable other sites to trivially be added

Hardware (per site)• 2 router/firewalls (capable of

routing >80Gb each• 3 Controllers• 21x 64 vCore, 512GB RAM nodes• 3x 192 vCore, 3TB RAM nodes• ~500TB GPFS (local)

– 4 controllers– Infiniband, with 10Gb failover

• ~2TB Ceph (shared)– 27x 64TB nodes/site– Cross site replication– 10Gb Backbone

Overview – 4 sites, (virtually) identical hardware

Each site is connected to theothers over VPN tunnels.Sites can be easily added.System can use free routersoftware and commodityhardware, pay for-softwareor dedicated router/firewalls

Our intention is for the systemto be presented to users as a single system, with single login,via Shibboleth.We are currently working on thatbit

A single system makes it easy(er)to share methods and data!

External cloudsExternal databases

External cloudsExternal databases

Flavours• User configurable, with standard

flavours• Regular; up to 8 vCPUs, 64GB RAM• xlarge; up to 16 vCPUs, 256GB RAM• Huge; up to 192 vCPUs, 3TB RAM• System also supports a scalable

virtual cluster (large embarrassingly parallel projects)– 2+ nodes with 2+ vCPUs, 2-4GB

RAM/vCPU• Also provides for Long Term

Hosting (for orphan datasets/tools)

Access• Microbial researchers will be

able to access the system through one of two ways– Externally, via federated access

system, login via .ac.uk user login in first instance, later (hopefully) open to anyone who uses shibboleth

– Internally, via user accounts setup by consortium for collaborators

• Researchers will be able to provision up a set number of VMs

Where are we now?• Computational hardware was procured by March 2015 (~6 month process)• Ahead of schedule - system is now online and in use for research• Adopting two models for access

– Access for registered users to core OpenStack system online now– “version 1.0” system providing universal access to predefined images starting with

the GVL – Autumn 2015

VMs are already up

Users are already using CLIMB to do research

Challenges

• Future Planning (CLIMB will run for 5 years, then what?)

• Cross-Cloud Integration• Not reinventing the wheel• Standardising software stacks for .ac.uk clouds• Being able to embrace new technologies• Meeting cloud development needs

The Sequencing Iceberg

All of the sequencing platforms availablenow make producing large genomics datasets relatively cheap and easy

However, the major costs and difficulties do not lie with the generation of data, they lie with the pre-requisites for storing and analysing that data

Informatics expertise

Storage availability

Appropriate HTC capacity

These areinterlinked,and expensive

Iceberg breaking with the Cloud?• It is a mechanism for sharing servers

– Clouds remove the need for hardware maintenance and support

– Storage, compute, networking are most expensive when bought one by one; building a large system represents better value for money

• Sharing servers means you can have standardised systems, simplifying the process of installing and maintaining software– It provides a mechanism for software/data reuse

as well as sharing– Also makes training easier; you can use the

system you trained on, once you get back home• Sharing servers also makes training easier; you

can use the system you trained on, once you get back home

CLIMB Next Steps – and future needs• New images/analytics tools (GVL!)• Integration of datasets• Expanding our userbase• Collaboration with other cloud services• Integrating with databases• Integrating with other clouds• Developing new sites• Developing the system to meet the needs of

our users• Developing policy• Defining and developing security policy• Developing/setting up federated access• Possibly looking at capacity to burst out or

accept bursts from other resources• Developing our training programme and

outreach

Cloud Infrastructure for Microbial Bioinformatics

• A multi site system to provide a one-stop-bioinformatics-shop, designed specifically to support Microbial researchers

• For both Bioinformaticians and wet lab scientists• Combines hardware with training• Free, simple interface, easy to use• Common login• Easy data and method sharing• Already have multiple users from across UK academia and healthcare

The CLIMB Consortium Are

• Professor Mark Pallen (Warwick) and Dr Sam Sheppard (Swansea) – Joint PIs

• Professor Mark Achtman (Warwick), ProfessorSteveBusby FRS (Birmingham), Dr Tom

Connor (Cardiff)*, Professor Tim Walsh (Cardiff), Dr Robin Howe (Public Health Wales) – Co-Is

• Dr Nick Loman (Birmingham)* and Dr Chris Quince (Warwick) ; MRC Research Fellows

* Principal bioinformaticians architecting and designing the system

CLoud Infrastructure for Microbial Bioinformatics (CLIMB)

• MRC funded project to develop Cloud Infrastructure for microbial bioinformatics

• ~£4M of hardware, capable of supporting >1000 individual virtual servers

• Amazon/Google cloud for Academics

climb stateoftheartintro

Science

connor et

bioinformatics capacity

cardiff university hanage

university researchers

local capacity

university supercomputer

population genomics

fookes et