building a bioinformatics pipeline - arc centre of...
TRANSCRIPT
Building a bioinformatics
pipelineKate Wathen-Dunn
Background
Sugarcane yellow canopy syndrome (YCS) symptoms include
• specific pattern of leaf yellowing
• abnormal amount of sugar and starch in the leaf
Currently, the cause of YCS is unknown.
2
Background
Key Questions
Using RNAseq and differential expression
• Which molecular mechanisms are involved?
• Any enrichment for biological agents?
• What can the changes in gene expression tell us about YCS?
3
Research Problem
• There was no reference genome or transcriptome for sugarcane to map our reads to.
• The RNA was fragmented before sequencing
• We had to start by assembling the transcriptome, de novo, from our RNAseq reads.
DNA
RNA
PROTEIN
What is a Bioinformatics Pipeline?
4
A series of steps you take to analyse your data, in the order you want, all linked together.
Start with raw data
Step 1: quality
trimming
Step 2: normalization
Step 3: assembly
Link the steps together using pipes or workflows
A pipeline is only the process, not the end goal .
Where do I start?
What steps or processes do you always do with new data?
5
Every bioinformatician will need to build a pipeline at some point
Raw Sequencing
filesFastQC Trimmomatic FastQC
Benefits of building your pipelineWhy bother with all that effort?
Reproducibility
• You’ve already done the thinking and research
• You know what the best tools are
• All the software is already installed and working
6
• Re-use your pipeline when you get new data
• Results from any new data will be produced in exactly the same way
• Share your pipeline with others (eg: Workflows)
Time-saver for later
What steps do I need in my bioinformatics pipeline?
It depends on your dataset and your question7
It depends…. ??
Building a pipeline is just like this !
You have a goal to achieve, and a set of tools available to you.
8
Building your Bioinformatics Pipeline
How have other people done this?
• Literature review
• Talk to others who have done this kind of work before
• Will those approaches work for me?
What resources do I have or need?
• Computing
• Skills and Knowledge
• Time
9
Save yourself some time and pain by learning from other people’s mistakes!
The first step in any new adventure is gathering information.
Ten Rules for Building a Bioinformatics pipeline
1. Begin at the end
2. Don’t reinvent the wheel
3. Keep it simple
4. Be organised
5. Be prepared to fail (a lot!) before you succeed
6. Step back and check
7. Persevere
8. Be flexible
9. Think logically
10. Ask for help
10
Ten Rules for Building a Bioinformatics pipeline
1. Begin at the end
2. Don’t reinvent the wheel
3. Keep it simple
4. Be organised
5. Be prepared to fail (a lot!) before you succeed
6. Step back and check
7. Persevere
8. Be flexible
9. Think logically
10. Ask for help
11
Ten Rules for Building a Bioinformatics pipeline
1. Begin at the end
2. Don’t reinvent the wheel
3. Keep it simple
4. Be organised
5. Be prepared to fail (a lot!) before you succeed
6. Step back and check
7. Persevere
8. Be flexible
9. Think logically
10. Ask for help
12
Preventing problems
The first 4 rules are to prevent problems from occurring while building your bioinformatics pipeline
13
1) Begin at the endThis helps you to know what you’re working towards, and gives you a clear goal to achieve.
What will your finished product look like?
• How many transcripts are expected?
• What sizes should the transcripts be?
• Should the transcriptome be annotated?
• Who will use it? How will you know if it is any good?
• What metrics should I use to assess my transcriptome?
• What transcripts should I expect to have definitely been assembled? (eg: housekeeping genes)
• Do my reads map to it?
• Is there a software tool I can use to assess quality?
14
2) Don’t reinvent the wheel
15
No need for yet-another-method.
Who else has already done this?
Can I use their methods directly?
Do I need to adapt it to suit my data?
3) Keep it simpleUse only what you need to, no more.
What tools are essential?
What steps can I scrap?
16
4) Be organisedHave some structure to how you keep your files organised and named.
Final.fasta
Keep your original files unaltered
• Use a system to name and store your files
• Keep good notes and records
• Back up your files regularly
17
Final-final.fasta
This-time-I-mean-it-final.fasta
Solving problems
The next 6 rules are to help you solve the problems you’ll face when building your bioinformatics pipeline
18
5) Be prepared to fail (a lot!) before you succeedNothing personal, just saying.
Why didn’t that work??!
What can I change to fix it?
• Check for simple errors first, like the wrong path to a file or a spelling mistake
19
• Don’t waste your energy getting angry at your computer, it won’t care
6) Step back and check
Try a different perspective
Is this the best way to do this?
20
Is there a better way?
With the end in mind, consider if you’re still going in the right direction.
7) PersevereIf this is the best way, then keep going.
Take a break if you need to, sleep on it or tackle another task for a while.
But keep coming back to the problem.
21
There is a solution, you just have to find it.
8) Be flexible
22
If there’s a better way, take it.
Keep the destination in mind, but be flexible in how you get there.
Use your judgement
Don’t be afraid to change direction if you need to
9) Think logicallySoftware and computing problems are logical, so think logically to solve them.
Try a systematic approach
• Check the software
• Check your script
• Check your files
• Check the permissions
23
10) Ask for helpSomeone, somewhere, has likely already faced this problem and solved it.
Ask them how they did it.
What help is available?
• Try pasting your error message directly into your favourite search engine.
• Software Google groups and forums, Biostars, Stack Overflow
• People you know, in your lab or field
• Collaborate with someone
• READMEs, user manuals and vignettes
Pay it forward
• If you find a solution, upload it and help someone else
24
Raw Reads
Trimmed Readsusing
Trimmomatic
Normalised ReadsUsing Trinity
InSilico normalisation
Concatenated Left and Right Reads
Sugarcane RNA sequences from
leaf
Total samples : 70
Velvet assembly Trinity assemblySoapTrans denovo
assembly
11
Cluster and merge 12 assemblies with
different kmersusing Oases
1 assembly made with
single kmer
Cluster and merge8 assemblies with
different kmersusing CD-HIT-EST
Cluster and merge all contigs at 100% similarity using CD-HIT-EST
Clean up the assemblies output from each tool through the EvidentialGene pipeline
Normalised reads into four assembly software tools
Soap2 denovoassembly
Cluster and merge8 assemblies with
different kmersusing CD-HIT-EST
Single Velvetassembly
Single Trinityassembly
Single Soap2assembly
Single SoapTransassembly
Reassemble the contigs using Trinity, then annotate the transcripts using Blast2GO
Suga
rcan
e Le
af T
ran
scri
pto
me
Clean up final assembly using EvidentialGene.Check quality with BUSCO, TransRate and
python scripts.
My Bioinformatics Pipeline 25
Purpose of building the pipelineSugarcane yellow canopy syndrome (YCS) symptoms include
• leaf yellowing
• abnormal amount of sugar and starch in the leaf
Currently, the cause of YCS is unknown.
RNAseq and Differential Expression
• Which molecular mechanisms are involved?
• Any enrichment for biological agents?
• What can the changes in gene expression tell us about YCS?
26
Research Problem
• There was no reference genome or transcriptome for sugarcane to map our reads to.
• We had to start by assembling the transcriptome, de novo, from our RNAseq reads.
Collaborators• Prof Frikkie Botha
• Annelie Marquardt Sugar Research Australia
• Gerard Scalia
• A/Prof Mikael Boden University of Queensland
• A/Prof Kate Hertweck University of Texas at Tyler
Acknowledgement and Thanks Research funding from Sugar Research Australia and the Queensland government.
Thanks to UQ RCC’s Tinaroo and FlashLite HPC servers, and to TACC’s Stampede HPC cluster for compute resources.
Grateful thanks to Mikael Boden’s lab group for including me, sharing their expertise, and helping me develop my bioinformatics skills. 27