assembly, annotation, quantification assembly.pdf · annotation gtf file transcript identification...

36
Assembly, annotation, quantification

Upload: others

Post on 20-Aug-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Assembly,annotation,quantification

Page 2: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Fromreadstotranscripts

denovoassembly

Genomebased Genomeguideddenovo

Reads

Putativetranscripts

Mappingagainstgenome

Readclusters

Page 3: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Transcriptreconstruction

•  Nowweknowwhichexonsareexpressed,but:– Agenecanhavemorethanonetranscript(isoforms)

– Withshortreadsequencing,noreadspansthecompletetranscript

– Whichisoformswerepresentinthesample?

Page 4: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

StringTie

Pertea,etal.(2015)."StringTieenablesimprovedreconstructionofatranscriptomefromRNA-seqreads."NatBiotech33(3):290-295.

Page 5: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Asthereadsgetlonger…

Page 6: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

MappingHisat2

Transcriptannotation

GTFfile

Transcriptidentification

StringTie

Rawreads

FASTQfile

Genomesequence

FASTAfile

Genomeannotation

GFF/GTFfile

Alignedreads

BAMfile

Page 7: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Question

Stringtiecanuseanannotatedgenome,orpredicttranscriptsdenovousingjustthegenomesequence.Whenwouldthelastoptionbeuseful?

Page 8: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Denovoassembly

WithoutareferencegenomeTools:Trinity,Velvet/Oasis,Trans-ABySS

Differentfromdenovogenomeassembly…SplicevariantsReadcoveragedependsonexpression

Page 9: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF
Page 10: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

k-mer

Read: GTAGAGCTGT 5-mers: GTAGA TAGAG AGAGC GAGCT AGCTG GCTGT

Page 11: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

k-mer

Read: GTAGAGCTGT 5-mers: GTAGA TAGAG AGAGC GAGCT AGCTG GCTGT

Page 12: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF
Page 13: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF
Page 14: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF
Page 15: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF
Page 16: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Trinity

Page 17: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

EffectofK-merssizeTworeadscouldbemergedwhentheyhaveaminimumoverlapofkLargerk-mersworkbetterforhighlyexpressedgenesandhashigheraccuracyShorterk-mersworkbetterforlowlyexpressedgenesbutcanleadtoloweraccuracy

Low High

KmersizeworksKmersizewillnotwork

Page 18: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Annotation

Page 19: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Asthereadsgetlonger…

Page 20: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

MappingHisat2

transcriptome

FASTAfile

Transcriptannotation

GTFfile

Transcriptidentification

StringTie

Rawreads

FASTQfile

AssemblyTrinity

Genomesequence

FASTAfile

Genomeannotation

GFF/GTFfile

Alignedreads

BAMfile

Page 21: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Questions

Howwouldadenovoassemblybeaffectedbybadqualityreads?Howcouldyou'repair'transcriptsthatwereassembledinfragments?Howwouldyouidentifyaread'smRNAifyouonlyhavethetranscriptome?

Page 22: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Quantification

MapreadstothegenomeExon/intronstructureAlternativesplicing:whatwerethetranscripts?

Mapreadstothetranscriptome

Differentisoforms

Page 23: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Normalization

NumberofmappedreadspermRNAdependson:TheabundanceofthemRNAThetotalnumberofreadsThelengthofthemRNATechnical“noise”

Page 24: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Normalization

NumberofmappedreadspermRNAdependson:TheabundanceofthemRNAThetotalnumberofreadsThelengthofthemRNATechnical“noise”

Page 25: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

RPKM,FPKM,TPM,CPM

Countthereadsmappingtoatranscript/geneComparethesamegenebetweensamples:normalizeforthetotalnumberofmappedreads-CPM(countspermillion)

Comparedifferentgenes/isoforms:normalizeforthelengthofthetranscript-RPKM,FPKM,TPM(transcriptspermillion)

Page 26: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Correctingfortranscriptlengthandtotalnumberofreads

RPKM =109 x Number of reads mapped to a regionTotal reads x region length

10,000,000300nt

RPKM≈1.7

Allmappedreads

Readsmappedtoregion

Featurelength

RPKM:ReadsPerKilobaseoftranscriptperMillionreads

Page 27: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Correctforoutliers

Page 28: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Sequencecoveragebias

Page 29: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

LiorPachterGenomeInformaticskeynoteCSHL

Readsmappingtoisoforms

Page 30: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Pseudoalignment

"ithasbeenshownthataccuratequantificationdoesnotrequireinformationonwhereinsidetranscriptsthereadsmayhaveoriginatedfrom,butratherwhichtranscriptscouldhavegeneratedthem"

Bray,N.L.,H.Pimentel,P.MelstedandL.Pachter(2016)."Near-optimalprobabilisticRNA-seqquantification."NatBiotech34(5):525-527.

Page 31: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Pseudoalignment

Mappingreadstoatranscriptomebasedonsharedk-mers,onlyusinginformativek-mers.+Veryfast-Doesnotdetectnewtranscripts-DoesnotfindSNPsTools:kallisto,salmon

Bray,N.L.,H.Pimentel,P.MelstedandL.Pachter(2016)."Near-optimalprobabilisticRNA-seqquantification."NatBiotech34(5):525-527.Patro,R.,G.Duggal,M.I.Love,R.A.Irizarry,C.Kingsford(2017)."Salmonprovidesfastandbias-awarequantificationoftranscriptexpression."NatMeth14(4):417-419.

Page 32: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Pseudoalignment:Kallisto

Bray,N.L.,H.Pimentel,P.MelstedandL.Pachter(2016)."Near-optimalprobabilisticRNA-seqquantification."NatBiotech34(5):525-527.

Page 33: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Pseudoalignment

Bray,N.L.,H.Pimentel,P.MelstedandL.Pachter(2016)."Near-optimalprobabilisticRNA-seqquantification."NatBiotech34(5):525-527.

Page 34: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

transcriptome

FASTAfile

Transcriptannotation

GTFfile

counts

csvfile

Rawreads

FASTQfile

Genomesequence

FASTAfile

Genomeannotation

GFF/GTFfile

Alignedreads

BAMfile

Page 35: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

PseudoalignmentKallisto

MappingHisat2

transcriptome

FASTAfile

Transcriptannotation

GTFfile

counts

csvfile

Transcriptidentification

&quantificationStringTie

Rawreads

FASTQfile

AssemblyTrinity

Genomesequence

FASTAfile

Genomeannotation

GFF/GTFfile

Alignedreads

BAMfile

Page 36: Assembly, annotation, quantification Assembly.pdf · annotation GTF file Transcript identification StringTie Raw reads FASTQ file Genome sequence FASTA file Genome annotation GFF/GTF

Handson