1 of 34 ensembl use of rnaseq steve searle. 2 of 34 ways we use rnaseq data in ensembl: build...

34
1 of 34 Ensembl use of RNASeq Steve Searle

Upload: daisy-newman

Post on 16-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

1 of 34

Ensembl use of RNASeq

Steve Searle

Page 2: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

2 of 34

Ways we use RNASeq data in Ensembl:

• Build complete gene set from scratch for individual or pooled RNASeq data sets

• Incorporate into a new Ensembl gene set

• Add novel models into a gene set

• UTR

• Filtering Models

• Improve old gene sets

Introduction

Page 3: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

3 of 34

RNASeq pipelineBuilding genes from RNASeq

Page 4: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

4 of 34

• Reads are aligned to the genome with a quick un-gapped alignment using BWA

• Transcriptome reads split over introns - we need to allow for this:

• Align with up to 50% miss-matches to get intron spanning reads to align• The alignments are then processed to collapse overlapping reads into

blocks representing exons• Read pairing is used (if available) to group the exon blocks into

approximate transcript structures

RNASeq PipelineAlignment and Initial Processing

Page 5: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

5 of 17

Page 6: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

6 of 34

RNASeq Pipeline Intron Alignment

We align split reads using Exonerate – has a good splice model but is not a short read aligner

Intron alignment is made faster in 2 ways: • Don’t realign all the reads:

• Introns are resolved by realigning partially aligned reads.• Use Exonerate word length to define which reads to realign

• Align to a single transcript:• Reads are realigned either to the rough transcript sequence or

to the genomic span of the rough transcript.

• Limiting the search space allows us to do a more sophisticated Exonerate alignment with a splice model and a shorter word length.

• Aligning to the genomic span of the transcript can identify small exons that were missed by the BWA alignment that can be incorporated into the final model.

Page 7: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

Exonerate spliced alignment Partially aligned reads

Split reads

CollapsedIntron Features

Final Models

Page 8: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

BLASTP

Coverage

(PE12)

Page 9: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

9 of 34

Website Display of RNASeq pipeline results

Data visible in Ensembl

Transcript models

Intron features

BAM files of BWA alignments

Page 10: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

10 of 34

Human gene ZMPSTE24

RNASeq introns by tissue

RNASeq models by tissue & merged

CCDS

GENCODE transcript

Page 12: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

12 of 34

Nile tilapia: BAM files

Page 13: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

13 of 34

Nile tilapia: BAM files

Page 14: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

14 of 34

RNASeq Volume

We are collecting more and more RNASeq

We now have sizeable RNASeq sets for 12 species +

Pipeline is now being used in production

Further automation has allowed us to speed up model building:

• Process spreadsheet data to automate the pipeline setup and configuration

• Parse meta data out of spreadsheets into the final BAM files

Page 15: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

15 of 34

Using RNASeq in the Ensembl genebuild pipeline

Page 16: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

16 of 34

Using RNASeq in the Ensembl genebuild pipeline

Some species have little specific dataEg. Nile tilapia

131 proteins in Uniprot

35 cDNAs, 119531 ESTs

Rely on data from related species

RNASeq supplements the above dataSpecies-specific

Fills gaps, alternate splice sites, faster genebuild

Page 17: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

17 of 34

Raw Computes

Targeted stage Similarity stage

cDNAs/ESTs

UTR addition

Final gene set

Filtering

Genebuild process

Filtering

TranscriptConsensus

LayerAnnotation

Annotation Projection(primates)

Page 18: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

18 of 34

Raw Computes

Targeted stage Similarity stage

cDNAs/ESTs

UTR addition

Final gene set

Filtering

Genebuild process

FilteringMerged

RNA-Seq models

Annotation Projection(primates)

Page 19: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

19 of 34

RNASeq helps with:1. Choice of splice site

RNASeq

Similarity models

Ensembl model

Page 20: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

20 of 34

RNASeq helps with:2. UTR addition

RNASeq model

Similarity model

Ensembl model

Page 21: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

21 of 34

RNASeq helps with:3. New models

RNASeq intronsRNASeq modelSimilarity modelEnsembl model

Page 22: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

22 of 34

Species with RNASeq used in generating Ensembl gene set

Released:•Zebrafish•Tasmanian Devil•Coelacanth•Tilapia

In progress:

Dog, Turtle, Rat, Cat, Chicken, Platyfish

So RNASeq is becoming a central part of the genebuild process with many species having components of RNASeq going forward

Page 23: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

23 of 34

Gene set update pipeline using RNASeq

Page 24: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

24 of 34

Gene set Update Pipeline using RNASeq

1. RNA-Seq• RNA-Seq is pipeline is highly automated, many

species take around a week to process

2. Split core gene set into single transcript genes

3. Transcript scoring / filtering• UTR addition done at the same time

4. Layering• avoiding pseudogenes• gap filling with fragments

5. Rebuild core set

6. Transfer pseudogenes + ncRNAs

Gene set update pipeline is fast and is using existing code in a novel way with very few alterations

Page 25: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

RNASeq model

Ensembl models

RNASeq Introns

Filter and add UTRs

Page 26: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

Add ‘UTR’

Extend CDS

RNASeq models

Ensembl models

RNASeq Introns

Page 31: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

31 of 34

ResultsMonodelphisPlatypus

Genes Transcripts

19,466 32,541

21,324 22,307

132

Genes Transcripts

17,951 26,836

21,695 23,581

204

before merge

after merge

joined genes

Page 32: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

32 of 34

Gene set update pipeline -Summary

Quick, straightforward method of tidying up gene sets

Add species specific models into gene-sets that were previously mostly based on proteins from other species

Much more efficient than a new genebuild

Future work:

Lots of other species we could apply this to

See what effect it has on primates / projection builds - in progress

Page 33: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

33 of 34

Ensembl Use of NHPRT dataPrimates in Ensembl currently: Chimp, Gorilla, Rhesus macaque, Marmoset, Mouse lemur*, Squirrel monkey+, Baboon+, Orangutan, Gibbon, Tarsier* (+ = Pre!, *=2x)

Run RNASeq pipeline on NHPRT primates in Ensembl to generate:–Transcript models–Introns–BAM files of alignments

(would like individual tissue RNASeq data for this)

Use NHPRT RNASeq in Ensembl gene builds on new species eg. Baboon

Use NHPRT RNASeq to improve existing Ensembl gene sets eg. Rhesus macaque

Consider other uses - –targeted improvement of models for ‘important’ genes (disease related)–Long non coding genes–Alignment to human

Page 34: 1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled

34 of 34

Steve Searle

Bronwen Aken

Daniel Barrell

Susan Fairley

Carlos Garcia Giron

Thibaut Hourlier

Andreas Kahari

Rishi Nag

Magali Ruffier

Amy Tang

Jan-Hinnerk Vogel

Amonida Zadissa

Acknowledgements

John E Collins

Stephen Keenan

Henrik Kaessman

Jessica Alfoldi

Illumina (Human Body Map data)