building a bioinformatics pipeline - arc centre of...

Building a bioinformatics

pipelineKate Wathen-Dunn

Background

Sugarcane yellow canopy syndrome (YCS) symptoms include

• specific pattern of leaf yellowing

• abnormal amount of sugar and starch in the leaf

Currently, the cause of YCS is unknown.

2

Background

Key Questions

Using RNAseq and differential expression

• Which molecular mechanisms are involved?

• Any enrichment for biological agents?

• What can the changes in gene expression tell us about YCS?

3

Research Problem

• There was no reference genome or transcriptome for sugarcane to map our reads to.

• The RNA was fragmented before sequencing

• We had to start by assembling the transcriptome, de novo, from our RNAseq reads.

DNA

RNA

PROTEIN

What is a Bioinformatics Pipeline?

4

A series of steps you take to analyse your data, in the order you want, all linked together.

Start with raw data

Step 1: quality

trimming

Step 2: normalization

Step 3: assembly

Link the steps together using pipes or workflows

A pipeline is only the process, not the end goal .

Where do I start?

What steps or processes do you always do with new data?

5

Every bioinformatician will need to build a pipeline at some point

Raw Sequencing

filesFastQC Trimmomatic FastQC

Benefits of building your pipelineWhy bother with all that effort?

Reproducibility

• You’ve already done the thinking and research

• You know what the best tools are

• All the software is already installed and working

6

• Re-use your pipeline when you get new data

• Results from any new data will be produced in exactly the same way

• Share your pipeline with others (eg: Workflows)

Time-saver for later

What steps do I need in my bioinformatics pipeline?

It depends on your dataset and your question7

It depends…. ??

Building a pipeline is just like this !

You have a goal to achieve, and a set of tools available to you.

8

Building your Bioinformatics Pipeline

How have other people done this?

• Literature review

• Talk to others who have done this kind of work before

• Will those approaches work for me?

What resources do I have or need?

• Computing

• Skills and Knowledge

• Time

9

Save yourself some time and pain by learning from other people’s mistakes!

The first step in any new adventure is gathering information.

Ten Rules for Building a Bioinformatics pipeline

1. Begin at the end

2. Don’t reinvent the wheel

3. Keep it simple

4. Be organised

5. Be prepared to fail (a lot!) before you succeed

6. Step back and check

7. Persevere

8. Be flexible

9. Think logically

10. Ask for help

10


1. Begin at the end


3. Keep it simple

4. Be organised



7. Persevere

8. Be flexible

9. Think logically

10. Ask for help

11


1. Begin at the end


3. Keep it simple

4. Be organised



7. Persevere

8. Be flexible

9. Think logically

10. Ask for help

12

Preventing problems

The first 4 rules are to prevent problems from occurring while building your bioinformatics pipeline

13

1) Begin at the endThis helps you to know what you’re working towards, and gives you a clear goal to achieve.

What will your finished product look like?

• How many transcripts are expected?

• What sizes should the transcripts be?

• Should the transcriptome be annotated?

• Who will use it? How will you know if it is any good?

• What metrics should I use to assess my transcriptome?

• What transcripts should I expect to have definitely been assembled? (eg: housekeeping genes)

• Do my reads map to it?

• Is there a software tool I can use to assess quality?

14

2) Don’t reinvent the wheel

15

No need for yet-another-method.

Who else has already done this?

Can I use their methods directly?

Do I need to adapt it to suit my data?

3) Keep it simpleUse only what you need to, no more.

What tools are essential?

What steps can I scrap?

16

4) Be organisedHave some structure to how you keep your files organised and named.

Final.fasta

Keep your original files unaltered

• Use a system to name and store your files

• Keep good notes and records

• Back up your files regularly

17

Final-final.fasta

This-time-I-mean-it-final.fasta

Solving problems

The next 6 rules are to help you solve the problems you’ll face when building your bioinformatics pipeline

18

5) Be prepared to fail (a lot!) before you succeedNothing personal, just saying.

Why didn’t that work??!

What can I change to fix it?

• Check for simple errors first, like the wrong path to a file or a spelling mistake

19

• Don’t waste your energy getting angry at your computer, it won’t care

6) Step back and check

Try a different perspective

Is this the best way to do this?

20

Is there a better way?

With the end in mind, consider if you’re still going in the right direction.

7) PersevereIf this is the best way, then keep going.

Take a break if you need to, sleep on it or tackle another task for a while.

But keep coming back to the problem.

21

There is a solution, you just have to find it.

8) Be flexible

22

If there’s a better way, take it.

Keep the destination in mind, but be flexible in how you get there.

Use your judgement

Don’t be afraid to change direction if you need to

9) Think logicallySoftware and computing problems are logical, so think logically to solve them.

Try a systematic approach

• Check the software

• Check your script

• Check your files

• Check the permissions

23

10) Ask for helpSomeone, somewhere, has likely already faced this problem and solved it.

Ask them how they did it.

What help is available?

• Try pasting your error message directly into your favourite search engine.

• Software Google groups and forums, Biostars, Stack Overflow

• People you know, in your lab or field

• Collaborate with someone

• READMEs, user manuals and vignettes

Pay it forward

• If you find a solution, upload it and help someone else

24

Raw Reads

Trimmed Readsusing

Trimmomatic

Normalised ReadsUsing Trinity

InSilico normalisation

Concatenated Left and Right Reads

Sugarcane RNA sequences from

leaf

Total samples : 70

Velvet assembly Trinity assemblySoapTrans denovo

assembly

11

Cluster and merge 12 assemblies with

different kmersusing Oases

1 assembly made with

single kmer

Cluster and merge8 assemblies with

different kmersusing CD-HIT-EST

Cluster and merge all contigs at 100% similarity using CD-HIT-EST

Clean up the assemblies output from each tool through the EvidentialGene pipeline

Normalised reads into four assembly software tools

Soap2 denovoassembly

Cluster and merge8 assemblies with

different kmersusing CD-HIT-EST

Single Velvetassembly

Single Trinityassembly

Single Soap2assembly

Single SoapTransassembly

Reassemble the contigs using Trinity, then annotate the transcripts using Blast2GO

Suga

rcan

e Le

af T

ran

scri

pto

me

Clean up final assembly using EvidentialGene.Check quality with BUSCO, TransRate and

python scripts.

My Bioinformatics Pipeline 25

Purpose of building the pipelineSugarcane yellow canopy syndrome (YCS) symptoms include

• leaf yellowing

• abnormal amount of sugar and starch in the leaf

Currently, the cause of YCS is unknown.

RNAseq and Differential Expression

• Which molecular mechanisms are involved?

• Any enrichment for biological agents?

• What can the changes in gene expression tell us about YCS?

26

Research Problem

• There was no reference genome or transcriptome for sugarcane to map our reads to.

• We had to start by assembling the transcriptome, de novo, from our RNAseq reads.

Collaborators• Prof Frikkie Botha

• Annelie Marquardt Sugar Research Australia

• Gerard Scalia

• A/Prof Mikael Boden University of Queensland

• A/Prof Kate Hertweck University of Texas at Tyler

Acknowledgement and Thanks Research funding from Sugar Research Australia and the Queensland government.

Thanks to UQ RCC’s Tinaroo and FlashLite HPC servers, and to TACC’s Stampede HPC cluster for compute resources.

Grateful thanks to Mikael Boden’s lab group for including me, sharing their expertise, and helping me develop my bioinformatics skills. 27

Thank YouKate Wathen-Dunn

(07) 3331 3333

[email protected]

www.sugarresearch.com.au

building a bioinformatics pipeline - arc centre of...

Documents