assembly, annotation, quantification assembly.pdf · annotation gtf file transcript identification...
TRANSCRIPT
Assembly,annotation,quantification
Fromreadstotranscripts
denovoassembly
Genomebased Genomeguideddenovo
Reads
Putativetranscripts
Mappingagainstgenome
Readclusters
Transcriptreconstruction
• Nowweknowwhichexonsareexpressed,but:– Agenecanhavemorethanonetranscript(isoforms)
– Withshortreadsequencing,noreadspansthecompletetranscript
– Whichisoformswerepresentinthesample?
StringTie
Pertea,etal.(2015)."StringTieenablesimprovedreconstructionofatranscriptomefromRNA-seqreads."NatBiotech33(3):290-295.
Asthereadsgetlonger…
MappingHisat2
Transcriptannotation
GTFfile
Transcriptidentification
StringTie
Rawreads
FASTQfile
Genomesequence
FASTAfile
Genomeannotation
GFF/GTFfile
Alignedreads
BAMfile
Question
Stringtiecanuseanannotatedgenome,orpredicttranscriptsdenovousingjustthegenomesequence.Whenwouldthelastoptionbeuseful?
Denovoassembly
WithoutareferencegenomeTools:Trinity,Velvet/Oasis,Trans-ABySS
Differentfromdenovogenomeassembly…SplicevariantsReadcoveragedependsonexpression
k-mer
Read: GTAGAGCTGT 5-mers: GTAGA TAGAG AGAGC GAGCT AGCTG GCTGT
k-mer
Read: GTAGAGCTGT 5-mers: GTAGA TAGAG AGAGC GAGCT AGCTG GCTGT
Trinity
EffectofK-merssizeTworeadscouldbemergedwhentheyhaveaminimumoverlapofkLargerk-mersworkbetterforhighlyexpressedgenesandhashigheraccuracyShorterk-mersworkbetterforlowlyexpressedgenesbutcanleadtoloweraccuracy
Low High
KmersizeworksKmersizewillnotwork
Annotation
Asthereadsgetlonger…
MappingHisat2
transcriptome
FASTAfile
Transcriptannotation
GTFfile
Transcriptidentification
StringTie
Rawreads
FASTQfile
AssemblyTrinity
Genomesequence
FASTAfile
Genomeannotation
GFF/GTFfile
Alignedreads
BAMfile
Questions
Howwouldadenovoassemblybeaffectedbybadqualityreads?Howcouldyou'repair'transcriptsthatwereassembledinfragments?Howwouldyouidentifyaread'smRNAifyouonlyhavethetranscriptome?
Quantification
MapreadstothegenomeExon/intronstructureAlternativesplicing:whatwerethetranscripts?
Mapreadstothetranscriptome
Differentisoforms
Normalization
NumberofmappedreadspermRNAdependson:TheabundanceofthemRNAThetotalnumberofreadsThelengthofthemRNATechnical“noise”
Normalization
NumberofmappedreadspermRNAdependson:TheabundanceofthemRNAThetotalnumberofreadsThelengthofthemRNATechnical“noise”
RPKM,FPKM,TPM,CPM
Countthereadsmappingtoatranscript/geneComparethesamegenebetweensamples:normalizeforthetotalnumberofmappedreads-CPM(countspermillion)
Comparedifferentgenes/isoforms:normalizeforthelengthofthetranscript-RPKM,FPKM,TPM(transcriptspermillion)
Correctingfortranscriptlengthandtotalnumberofreads
RPKM =109 x Number of reads mapped to a regionTotal reads x region length
10,000,000300nt
RPKM≈1.7
Allmappedreads
Readsmappedtoregion
Featurelength
RPKM:ReadsPerKilobaseoftranscriptperMillionreads
Correctforoutliers
Sequencecoveragebias
LiorPachterGenomeInformaticskeynoteCSHL
Readsmappingtoisoforms
Pseudoalignment
"ithasbeenshownthataccuratequantificationdoesnotrequireinformationonwhereinsidetranscriptsthereadsmayhaveoriginatedfrom,butratherwhichtranscriptscouldhavegeneratedthem"
Bray,N.L.,H.Pimentel,P.MelstedandL.Pachter(2016)."Near-optimalprobabilisticRNA-seqquantification."NatBiotech34(5):525-527.
Pseudoalignment
Mappingreadstoatranscriptomebasedonsharedk-mers,onlyusinginformativek-mers.+Veryfast-Doesnotdetectnewtranscripts-DoesnotfindSNPsTools:kallisto,salmon
Bray,N.L.,H.Pimentel,P.MelstedandL.Pachter(2016)."Near-optimalprobabilisticRNA-seqquantification."NatBiotech34(5):525-527.Patro,R.,G.Duggal,M.I.Love,R.A.Irizarry,C.Kingsford(2017)."Salmonprovidesfastandbias-awarequantificationoftranscriptexpression."NatMeth14(4):417-419.
Pseudoalignment:Kallisto
Bray,N.L.,H.Pimentel,P.MelstedandL.Pachter(2016)."Near-optimalprobabilisticRNA-seqquantification."NatBiotech34(5):525-527.
Pseudoalignment
Bray,N.L.,H.Pimentel,P.MelstedandL.Pachter(2016)."Near-optimalprobabilisticRNA-seqquantification."NatBiotech34(5):525-527.
transcriptome
FASTAfile
Transcriptannotation
GTFfile
counts
csvfile
Rawreads
FASTQfile
Genomesequence
FASTAfile
Genomeannotation
GFF/GTFfile
Alignedreads
BAMfile
PseudoalignmentKallisto
MappingHisat2
transcriptome
FASTAfile
Transcriptannotation
GTFfile
counts
csvfile
Transcriptidentification
&quantificationStringTie
Rawreads
FASTQfile
AssemblyTrinity
Genomesequence
FASTAfile
Genomeannotation
GFF/GTFfile
Alignedreads
BAMfile
Handson