delivering bioinformatics training using cloud computing infrastructure - nathan watson-haigh

24
Delivering Bioinformatics Training Using Cloud Computing Infrastructure Nathan S. Watson-Haigh

Upload: australian-bioinformatics-network

Post on 10-May-2015

411 views

Category:

Science


5 download

TRANSCRIPT

Page 1: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Delivering Bioinformatics Training Using Cloud Computing Infrastructure

Nathan S. Watson-Haigh

Page 2: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Take-Home Message

• Cloud computing infrastructure– Solves some issues in delivering hands-on

bioinformatics training– Has its own unique set of issues

• Code and materials (CC and Open Access)– NGS Workshop– Try rolling your own!

github.com/BPA-CSIRO-WorkshopsWatson-Haigh, N.S., et al. (2013). Next-generation sequencing: a challenge to meet the increasing demand for training workshops in Australia. Brief Bioinform 14, 563–574. http://bib.oxfordjournals.org/content/14/5/563

Page 3: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

ACKNOWLEDGEMENTS

Catherine Shang (Bioplatforms Australia)

Nathan Watson-Haigh (ACPFG)Nandan Deshpande (Systems Biology Initiative, UNSW)Paula Moolhuijzen (CCG, Murdoch University)Sonika Tyagi (Australian Genome Research Facility)Matthew Field (ANU)

Annette McGrath (CSIRO Bioinformatics Core , Digital Productivity and Services Flagship)

Konsta Duesing (Food & Nutrition Flagship, CSIRO)Xi (Sean) Li (CSIRO Bioinformatics Core, Digital Productivity and Services Flagship)Sean McWilliam (CSIRO Agricultural Productivity Flagship)Paul Greenfield (CSIRO Digital Productivity and Services Flagship)

Cath Brooksbank (EBI)Vicky Schneider (TGAC)

Matthias Haimel (University of Cambridge)Myrto Kostadima (University of Cambridge)Remco Loos (EBI)Alex Mitchell (EBI)Hubert Denise (EBI)

Jerico Revote (Monash e-Research Centre)Simon Michnowicz (Monash e-Research Centre)Steve Quenette (Monash University)

Mark Crowe (QFAB)Peter Sterk (Oxford e-Research Centre)

Page 4: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Office

Workshop VM

THE SETUP

Host

Page 5: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

SETTING UP

Office

Admin VM

Host

Cloud API Tools

Page 6: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Office Host

Sysadmin Computer

DEALING WITH HICCUPS

Portable Apps

Admin VM

parallel-ssh -scp -slurp

Page 7: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Drivers

• A need for bioinformatics training– Good bioinformaticians work at, and understand,

command line tools

• Take the workshops to the trainees– Maximise participation

Page 8: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Goals

• Minimise maintenance of the training environment– No monolithic installs

• Minimise cognitive burden on trainees– The training environment should go unseen

• Make everything publically accessible and as reusable as possible

Page 9: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

NGS Workshop: Key Elements

• Knowledgeable, friendly trainers– Obviously

• Content– Tools, data, handout

• Mode of delivery– Dedicated training suite, BYO laptop, roadshow

• Training environment– Tailored to mode of delivery

Page 10: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

THE REUSABLE RESOURCES: VM SETUP

Office

Vanilla Ubuntu VM

NGS Workshop VM

• Gnome• FreeNX• Generic Tools

• NGS tools• NGS data• NGS handout

Page 11: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

THE REUSABLE RESOURCES: HANDOUT

Office

TraineeHandout

TrainerHandout

Page 12: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Rolling Your Own Handout

• Style file provided– Makes it easy(er) to write/edit LaTeX

• Trainee Handout • Trainer Handout

https://github.com/BPA-CSIRO-Workshops/handout-template

Page 13: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Simplified Styling\begin{information}Information to be provided to the trainee.

\end{information}

Page 14: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Simplified Styling\begin{questions}First question.

\begin{answer}Answer to first question.

\end{answer}

Second question.

\begin{answer}Answer to second question.

\end{answer}\end{questions}

Page 15: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Simplified Styling\begin{lstlisting}# several lines of codecd ~/ls -l# a long command that line wraps automaticallytophat --solexa-quals -g 2 --library-type fr-unstranded -j annotation/Danio_rerio.Zv9.66.spliceSites -o tophat/ZV9_2cells genome/ZV9 data/2cells_1.fastq data/2cells_2.fastq

\end{lstlisting}

Page 16: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Resources Refresher

• Plain text files (Bash, LaTeX) for– Generic tools install– Workshop-specific tool install– Workshop-specific data download/configuration– Handout document

• Why plain text?– Version control– Collaboration– Reuse

Page 17: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Cloud Pros and ConsPros• Consistent training environment

• No “alien” OS on host network• Minimal host network

configuration and traffic– Firewall (port 22)

• Minimal local computer specification and configuration– NX Client plus session files

• Scalable resources• Encourages reproducible work

Cons• Remote vs local confusion

– Hide this using NX

• How to analyse own data?

• Requires a computer suite• Sysadmin skills required

Page 18: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Workshops Using This SystemFe

b 20

12

Jul 2

012

MEL

(22)

SYD

(22)

Nov

201

2BN

E (3

3)AD

L (2

9)

Feb

2013

CAN

(35)

Jun

2013

PER

(38)

Jul 2

013

MEL

(60)

Jul 2

013

MEL

(38)

Nov

201

3SY

D (3

8)BN

E (3

0)

Feb

2014

SYD

(38)

MEL

(34)

Jul 2

014

SYD

(37)

Jul 2

014

CAN

(60)

Jul 2

014

CAN

(15)

Sep

2013

Feb

2012

Dec

201

2Bi

oInf

oSum

mer

(100

)

Dec

201

3Bi

oInf

oSum

mer

(100

)

Nov

201

3AC

AD (3

0)

Nov

201

2AC

AD (3

0)

Jul 2

014

R &

rQTL

(40)

Apr 2

013

Linu

x &

RN

A-Se

q (3

0)

Nov

201

4AC

AD (3

0)

BPA/CSIRO Competitive Courses:~650 applicants for ~400 places

EMBL Australia PhD Program:120 places

Other workshops:360 places

Total: ~900 places in 2.5 yrs

Page 19: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Future Directions

• Better “glue” to enable easier reuse on– Local VM’s– NeCTAR Research Cloud– Amazon Web Services

• Better documentation - ugh!– Easier for others to contribute and roll their own– Tagging workshop versions

Page 20: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Take-Home Message

• Cloud computing infrastructure– Solves some issues in delivering hands-on

bioinformatics training– Has its own unique set of issues

• Code and materials (CC and Open Access)– NGS Workshop– Try rolling your own!

github.com/BPA-CSIRO-WorkshopsWatson-Haigh, N.S., et al. (2013). Next-generation sequencing: a challenge to meet the increasing demand for training workshops in Australia. Brief Bioinform 14, 563–574. http://bib.oxfordjournals.org/content/14/5/563

Page 21: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

ACKNOWLEDGEMENTS

Catherine Shang (Bioplatforms Australia)

Nathan Watson-Haigh (ACPFG)Nandan Deshpande (Systems Biology Initiative, UNSW)Paula Moolhuijzen (CCG, Murdoch University)Sonika Tyagi (Australian Genome Research Facility)Matthew Field (ANU)

Annette McGrath (CSIRO Bioinformatics Core , Digital Productivity and Services Flagship)

Konsta Duesing (Food & Nutrition Flagship, CSIRO)Xi (Sean) Li (CSIRO Bioinformatics Core, Digital Productivity and Services Flagship)Sean McWilliam (CSIRO Agricultural Productivity Flagship)Paul Greenfield (CSIRO Digital Productivity and Services Flagship)

Cath Brooksbank (EBI)Vicky Schneider-Gricar (TGAC)

Matthias Haimel (University of Cambridge)Myrto Kostadima (University of Cambridge)Remco Loos (EBI)Alex Mitchell (EBI)Hubert Denise (EBI)

Jerico Revote (Monash e-Research Centre)Simon Michnowicz (Monash e-Research Centre)Steve Quenette (Monash University)

Mark Crowe (QFAB)Peter Sterk (Oxford e-Research Centre)

Page 22: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Good, Bad and Ugly: TraineeGood Bad Ugly• Familiar

environment• Accessible

afterwards

• Permissions• Poor hardware

specification

• First hour (or 3) wasted by sorting out “issues”

• A dedicated facility

• Everything should just work

• Costs of residential courses

• Access to more powerful hardware

• Usually CLI• Remote vs local

confusion• Limited access

• I want a GUI• Users competing

over compute resources

• Accessible afterwards

• What’s a cloud!? • Can I use this for my own data?

BYO laptop

Dedicated training room

Remote server

Cloud virtualisation

Page 23: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Good Bad Ugly

Good, Bad and Ugly: Trainer• Just need a room • Network access

• Multiple OSes• We look like idiots• We wasted so

much time

• We know what works

• Local IT support

• Maintaining up-to-date hardware

• No access afterwards

• Control over the OS

• A single OS to maintain

• Managing users• Resource

management

• Enables roadshows

• 1 VM per trainee

• New skills required

• Post-workshop access

BYO laptop

Dedicated training room

Remote server

Cloud virtualisation

Page 24: Delivering Bioinformatics Training Using Cloud Computing Infrastructure - Nathan Watson-Haigh

Puppet

• Helps sysadmins automate many repetitive tasks

• Puppet config files – Plain text (version control)– Defines the required state “B” - Puppet figures

out how to get from “A” to “B”• Workshops defined in terms of tools and data

needed using plain text– Collaborate and share on workshops