assembly, annotation, quantification assembly.pdf · annotation gtf file transcript identification...

Post on 20-Aug-2020

16 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Assembly,annotation,quantification

Fromreadstotranscripts

denovoassembly

Genomebased Genomeguideddenovo

Reads

Putativetranscripts

Mappingagainstgenome

Readclusters

Transcriptreconstruction

•  Nowweknowwhichexonsareexpressed,but:– Agenecanhavemorethanonetranscript(isoforms)

– Withshortreadsequencing,noreadspansthecompletetranscript

– Whichisoformswerepresentinthesample?

StringTie

Pertea,etal.(2015)."StringTieenablesimprovedreconstructionofatranscriptomefromRNA-seqreads."NatBiotech33(3):290-295.

Asthereadsgetlonger…

MappingHisat2

Transcriptannotation

GTFfile

Transcriptidentification

StringTie

Rawreads

FASTQfile

Genomesequence

FASTAfile

Genomeannotation

GFF/GTFfile

Alignedreads

BAMfile

Question

Stringtiecanuseanannotatedgenome,orpredicttranscriptsdenovousingjustthegenomesequence.Whenwouldthelastoptionbeuseful?

Denovoassembly

WithoutareferencegenomeTools:Trinity,Velvet/Oasis,Trans-ABySS

Differentfromdenovogenomeassembly…SplicevariantsReadcoveragedependsonexpression

k-mer

Read: GTAGAGCTGT 5-mers: GTAGA TAGAG AGAGC GAGCT AGCTG GCTGT

k-mer

Read: GTAGAGCTGT 5-mers: GTAGA TAGAG AGAGC GAGCT AGCTG GCTGT

Trinity

EffectofK-merssizeTworeadscouldbemergedwhentheyhaveaminimumoverlapofkLargerk-mersworkbetterforhighlyexpressedgenesandhashigheraccuracyShorterk-mersworkbetterforlowlyexpressedgenesbutcanleadtoloweraccuracy

Low High

KmersizeworksKmersizewillnotwork

Annotation

Asthereadsgetlonger…

MappingHisat2

transcriptome

FASTAfile

Transcriptannotation

GTFfile

Transcriptidentification

StringTie

Rawreads

FASTQfile

AssemblyTrinity

Genomesequence

FASTAfile

Genomeannotation

GFF/GTFfile

Alignedreads

BAMfile

Questions

Howwouldadenovoassemblybeaffectedbybadqualityreads?Howcouldyou'repair'transcriptsthatwereassembledinfragments?Howwouldyouidentifyaread'smRNAifyouonlyhavethetranscriptome?

Quantification

MapreadstothegenomeExon/intronstructureAlternativesplicing:whatwerethetranscripts?

Mapreadstothetranscriptome

Differentisoforms

Normalization

NumberofmappedreadspermRNAdependson:TheabundanceofthemRNAThetotalnumberofreadsThelengthofthemRNATechnical“noise”

Normalization

NumberofmappedreadspermRNAdependson:TheabundanceofthemRNAThetotalnumberofreadsThelengthofthemRNATechnical“noise”

RPKM,FPKM,TPM,CPM

Countthereadsmappingtoatranscript/geneComparethesamegenebetweensamples:normalizeforthetotalnumberofmappedreads-CPM(countspermillion)

Comparedifferentgenes/isoforms:normalizeforthelengthofthetranscript-RPKM,FPKM,TPM(transcriptspermillion)

Correctingfortranscriptlengthandtotalnumberofreads

RPKM =109 x Number of reads mapped to a regionTotal reads x region length

10,000,000300nt

RPKM≈1.7

Allmappedreads

Readsmappedtoregion

Featurelength

RPKM:ReadsPerKilobaseoftranscriptperMillionreads

Correctforoutliers

Sequencecoveragebias

LiorPachterGenomeInformaticskeynoteCSHL

Readsmappingtoisoforms

Pseudoalignment

"ithasbeenshownthataccuratequantificationdoesnotrequireinformationonwhereinsidetranscriptsthereadsmayhaveoriginatedfrom,butratherwhichtranscriptscouldhavegeneratedthem"

Bray,N.L.,H.Pimentel,P.MelstedandL.Pachter(2016)."Near-optimalprobabilisticRNA-seqquantification."NatBiotech34(5):525-527.

Pseudoalignment

Mappingreadstoatranscriptomebasedonsharedk-mers,onlyusinginformativek-mers.+Veryfast-Doesnotdetectnewtranscripts-DoesnotfindSNPsTools:kallisto,salmon

Bray,N.L.,H.Pimentel,P.MelstedandL.Pachter(2016)."Near-optimalprobabilisticRNA-seqquantification."NatBiotech34(5):525-527.Patro,R.,G.Duggal,M.I.Love,R.A.Irizarry,C.Kingsford(2017)."Salmonprovidesfastandbias-awarequantificationoftranscriptexpression."NatMeth14(4):417-419.

Pseudoalignment:Kallisto

Bray,N.L.,H.Pimentel,P.MelstedandL.Pachter(2016)."Near-optimalprobabilisticRNA-seqquantification."NatBiotech34(5):525-527.

Pseudoalignment

Bray,N.L.,H.Pimentel,P.MelstedandL.Pachter(2016)."Near-optimalprobabilisticRNA-seqquantification."NatBiotech34(5):525-527.

transcriptome

FASTAfile

Transcriptannotation

GTFfile

counts

csvfile

Rawreads

FASTQfile

Genomesequence

FASTAfile

Genomeannotation

GFF/GTFfile

Alignedreads

BAMfile

PseudoalignmentKallisto

MappingHisat2

transcriptome

FASTAfile

Transcriptannotation

GTFfile

counts

csvfile

Transcriptidentification

&quantificationStringTie

Rawreads

FASTQfile

AssemblyTrinity

Genomesequence

FASTAfile

Genomeannotation

GFF/GTFfile

Alignedreads

BAMfile

Handson

top related