delivering bioinformatics training using cloud computing infrastructure - nathan watson-haigh
TRANSCRIPT
Delivering Bioinformatics Training Using Cloud Computing Infrastructure
Nathan S. Watson-Haigh
Take-Home Message
• Cloud computing infrastructure– Solves some issues in delivering hands-on
bioinformatics training– Has its own unique set of issues
• Code and materials (CC and Open Access)– NGS Workshop– Try rolling your own!
github.com/BPA-CSIRO-WorkshopsWatson-Haigh, N.S., et al. (2013). Next-generation sequencing: a challenge to meet the increasing demand for training workshops in Australia. Brief Bioinform 14, 563–574. http://bib.oxfordjournals.org/content/14/5/563
ACKNOWLEDGEMENTS
Catherine Shang (Bioplatforms Australia)
Nathan Watson-Haigh (ACPFG)Nandan Deshpande (Systems Biology Initiative, UNSW)Paula Moolhuijzen (CCG, Murdoch University)Sonika Tyagi (Australian Genome Research Facility)Matthew Field (ANU)
Annette McGrath (CSIRO Bioinformatics Core , Digital Productivity and Services Flagship)
Konsta Duesing (Food & Nutrition Flagship, CSIRO)Xi (Sean) Li (CSIRO Bioinformatics Core, Digital Productivity and Services Flagship)Sean McWilliam (CSIRO Agricultural Productivity Flagship)Paul Greenfield (CSIRO Digital Productivity and Services Flagship)
Cath Brooksbank (EBI)Vicky Schneider (TGAC)
Matthias Haimel (University of Cambridge)Myrto Kostadima (University of Cambridge)Remco Loos (EBI)Alex Mitchell (EBI)Hubert Denise (EBI)
Jerico Revote (Monash e-Research Centre)Simon Michnowicz (Monash e-Research Centre)Steve Quenette (Monash University)
Mark Crowe (QFAB)Peter Sterk (Oxford e-Research Centre)
Office
Workshop VM
THE SETUP
Host
SETTING UP
Office
Admin VM
Host
Cloud API Tools
Office Host
Sysadmin Computer
DEALING WITH HICCUPS
Portable Apps
Admin VM
parallel-ssh -scp -slurp
Drivers
• A need for bioinformatics training– Good bioinformaticians work at, and understand,
command line tools
• Take the workshops to the trainees– Maximise participation
Goals
• Minimise maintenance of the training environment– No monolithic installs
• Minimise cognitive burden on trainees– The training environment should go unseen
• Make everything publically accessible and as reusable as possible
NGS Workshop: Key Elements
• Knowledgeable, friendly trainers– Obviously
• Content– Tools, data, handout
• Mode of delivery– Dedicated training suite, BYO laptop, roadshow
• Training environment– Tailored to mode of delivery
THE REUSABLE RESOURCES: VM SETUP
Office
Vanilla Ubuntu VM
NGS Workshop VM
• Gnome• FreeNX• Generic Tools
• NGS tools• NGS data• NGS handout
THE REUSABLE RESOURCES: HANDOUT
Office
TraineeHandout
TrainerHandout
Rolling Your Own Handout
• Style file provided– Makes it easy(er) to write/edit LaTeX
• Trainee Handout • Trainer Handout
https://github.com/BPA-CSIRO-Workshops/handout-template
Simplified Styling\begin{information}Information to be provided to the trainee.
\end{information}
Simplified Styling\begin{questions}First question.
\begin{answer}Answer to first question.
\end{answer}
Second question.
\begin{answer}Answer to second question.
\end{answer}\end{questions}
Simplified Styling\begin{lstlisting}# several lines of codecd ~/ls -l# a long command that line wraps automaticallytophat --solexa-quals -g 2 --library-type fr-unstranded -j annotation/Danio_rerio.Zv9.66.spliceSites -o tophat/ZV9_2cells genome/ZV9 data/2cells_1.fastq data/2cells_2.fastq
\end{lstlisting}
Resources Refresher
• Plain text files (Bash, LaTeX) for– Generic tools install– Workshop-specific tool install– Workshop-specific data download/configuration– Handout document
• Why plain text?– Version control– Collaboration– Reuse
Cloud Pros and ConsPros• Consistent training environment
• No “alien” OS on host network• Minimal host network
configuration and traffic– Firewall (port 22)
• Minimal local computer specification and configuration– NX Client plus session files
• Scalable resources• Encourages reproducible work
Cons• Remote vs local confusion
– Hide this using NX
• How to analyse own data?
• Requires a computer suite• Sysadmin skills required
Workshops Using This SystemFe
b 20
12
Jul 2
012
MEL
(22)
SYD
(22)
Nov
201
2BN
E (3
3)AD
L (2
9)
Feb
2013
CAN
(35)
Jun
2013
PER
(38)
Jul 2
013
MEL
(60)
Jul 2
013
MEL
(38)
Nov
201
3SY
D (3
8)BN
E (3
0)
Feb
2014
SYD
(38)
MEL
(34)
Jul 2
014
SYD
(37)
Jul 2
014
CAN
(60)
Jul 2
014
CAN
(15)
Sep
2013
Feb
2012
Dec
201
2Bi
oInf
oSum
mer
(100
)
Dec
201
3Bi
oInf
oSum
mer
(100
)
Nov
201
3AC
AD (3
0)
Nov
201
2AC
AD (3
0)
Jul 2
014
R &
rQTL
(40)
Apr 2
013
Linu
x &
RN
A-Se
q (3
0)
Nov
201
4AC
AD (3
0)
BPA/CSIRO Competitive Courses:~650 applicants for ~400 places
EMBL Australia PhD Program:120 places
Other workshops:360 places
Total: ~900 places in 2.5 yrs
Future Directions
• Better “glue” to enable easier reuse on– Local VM’s– NeCTAR Research Cloud– Amazon Web Services
• Better documentation - ugh!– Easier for others to contribute and roll their own– Tagging workshop versions
Take-Home Message
• Cloud computing infrastructure– Solves some issues in delivering hands-on
bioinformatics training– Has its own unique set of issues
• Code and materials (CC and Open Access)– NGS Workshop– Try rolling your own!
github.com/BPA-CSIRO-WorkshopsWatson-Haigh, N.S., et al. (2013). Next-generation sequencing: a challenge to meet the increasing demand for training workshops in Australia. Brief Bioinform 14, 563–574. http://bib.oxfordjournals.org/content/14/5/563
ACKNOWLEDGEMENTS
Catherine Shang (Bioplatforms Australia)
Nathan Watson-Haigh (ACPFG)Nandan Deshpande (Systems Biology Initiative, UNSW)Paula Moolhuijzen (CCG, Murdoch University)Sonika Tyagi (Australian Genome Research Facility)Matthew Field (ANU)
Annette McGrath (CSIRO Bioinformatics Core , Digital Productivity and Services Flagship)
Konsta Duesing (Food & Nutrition Flagship, CSIRO)Xi (Sean) Li (CSIRO Bioinformatics Core, Digital Productivity and Services Flagship)Sean McWilliam (CSIRO Agricultural Productivity Flagship)Paul Greenfield (CSIRO Digital Productivity and Services Flagship)
Cath Brooksbank (EBI)Vicky Schneider-Gricar (TGAC)
Matthias Haimel (University of Cambridge)Myrto Kostadima (University of Cambridge)Remco Loos (EBI)Alex Mitchell (EBI)Hubert Denise (EBI)
Jerico Revote (Monash e-Research Centre)Simon Michnowicz (Monash e-Research Centre)Steve Quenette (Monash University)
Mark Crowe (QFAB)Peter Sterk (Oxford e-Research Centre)
Good, Bad and Ugly: TraineeGood Bad Ugly• Familiar
environment• Accessible
afterwards
• Permissions• Poor hardware
specification
• First hour (or 3) wasted by sorting out “issues”
• A dedicated facility
• Everything should just work
• Costs of residential courses
• Access to more powerful hardware
• Usually CLI• Remote vs local
confusion• Limited access
• I want a GUI• Users competing
over compute resources
• Accessible afterwards
• What’s a cloud!? • Can I use this for my own data?
BYO laptop
Dedicated training room
Remote server
Cloud virtualisation
Good Bad Ugly
Good, Bad and Ugly: Trainer• Just need a room • Network access
• Multiple OSes• We look like idiots• We wasted so
much time
• We know what works
• Local IT support
• Maintaining up-to-date hardware
• No access afterwards
• Control over the OS
• A single OS to maintain
• Managing users• Resource
management
• Enables roadshows
• 1 VM per trainee
• New skills required
• Post-workshop access
BYO laptop
Dedicated training room
Remote server
Cloud virtualisation
Puppet
• Helps sysadmins automate many repetitive tasks
• Puppet config files – Plain text (version control)– Defines the required state “B” - Puppet figures
out how to get from “A” to “B”• Workshops defined in terms of tools and data
needed using plain text– Collaborate and share on workshops