![Page 1: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/1.jpg)
Capturing the chicken transcriptome with PacBio long read RNA-seq data
OR
Chicken in awesome sauce: a recipe for new transcript identification
Gladstone InstitutesPacific Biosciences
Sean Thomas and Alisha Holloway
![Page 2: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/2.jpg)
• Overarching goal: understand gene regulation during heart development and why children are born with congenital heart defects
• Accelerate discovery to clinical practice by fostering collaborations of basic, translational and clinical researchers
• www.benchtobassinet.org
Motivation
![Page 3: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/3.jpg)
Motivation
Chicken hearts are being used as models of cardiac development
chicken human
![Page 4: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/4.jpg)
Motivation
Functional genomics studies of the molecular mechanisms behind cardiac development require solid genome and transcript annotations.
![Page 5: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/5.jpg)
Motivation
Poor annotations are common for many model organisms that could be useful for understanding heart development and evolution
![Page 6: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/6.jpg)
Motivation
Koshiba-Takeuchi et al. 2009, Nature
Turtle
Chicken
Tbx5expression
![Page 7: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/7.jpg)
Motivation
Current best chicken annotations, as of 2012*: Ensembl and refSeq
refSeq annotation contained only 6,459 transcripts, but were well-polished
Ensembl annotation contained ~20k transcripts but with many errors
mouse and chicken have similar genome sizes and numbers of genes, but Ensembl annotation for mouse has ~95k transcripts
*galGal3 assembly
![Page 8: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/8.jpg)
Motivation available annotations were unreliable
![Page 9: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/9.jpg)
Motivation available annotations were unreliable
conservationEnsembl annotation
non-chicken refSeq
RNA-seq data
Znf503, a likely regulator of heart development
![Page 10: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/10.jpg)
Acquired deep short read data from many tissue typesIllumina data – 150 million uniquely mapping fragmentsTissues –brain, cerebellum, heart, kidney, liver, testicle
Acquired EST data from existing databasesEmployed de novo transcriptome assembly tools to generate annotation
TrinityMAKER
Solution? De novo transcriptome assembly of short read data and EST data
![Page 11: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/11.jpg)
Blue boxes are exonsBlack lines show exons joined by:
1. exon spanning reads2. paired-end reads
Solution?
Assembly of exons is possible with short reads but assembling isoforms is trickier…
![Page 12: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/12.jpg)
Solution?
![Page 13: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/13.jpg)
?
Solution?
Three exons can be joined by:1. one end of a pair mapping to exon 12. other end spanning exons 2 & 3
300 bp
Can’t join 2 - 3 - 4 because exon 3 longer than insert size
![Page 14: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/14.jpg)
Assembly of exons is possible with short reads but assembling isoforms is trickier…
Assembly of Illumina reads yielded 120k distinct contigs with average length of ~600bp, well below median transcript length
Solution?
Median Transcript LengthChicken – 1.2kbMouse – 1.5kb
![Page 15: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/15.jpg)
![Page 16: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/16.jpg)
isolate mRNA
generate cDNA
size-select 0-1kb1-2 kb2-3 kb3+ kb
create libraries
sequence
post-processingID full length reads
trim primers & polyA tails
sequencing
embryonic chicken hearts
error correction alignment gene modeling
long error-prone read
short high-fidelity reads
x x x x xlong corrected read
galGal4 genome
methodology
![Page 17: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/17.jpg)
Terminology
1,508,184 subreads mapped uniquely (~72%) to galGal4 assembly
PacBio SMRTbell
transcription
cDNA insert
full length subread
pacBio read
![Page 18: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/18.jpg)
All Subreads – includes incompletely sequenced transcript
HQRegion Subread – completely sequenced >= 1x
HQRegion Full-Pass Subreads – completely sequenced >= 2x
Most reads cover full length of transcript
![Page 19: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/19.jpg)
Most reads begin at the 5’ end of transcripts and end at the 3’ end
![Page 20: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/20.jpg)
transcript length
Coverage of refSeq transcripts
![Page 21: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/21.jpg)
Example of data…good existing annotation
![Page 22: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/22.jpg)
Illumina
Ensembl
RefSeq
PacBio data
Example of current coverage…
![Page 23: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/23.jpg)
New ensembl annotation based on galGal4 fixed many of the issues that motivated our efforts
New genome assembly, new annotation
![Page 24: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/24.jpg)
Remember this gene? available annotations were unreliable
conservationensembl annotation
non-chicken refSeq
RNA-seq data
Znf503, a likely regulator of heart development
![Page 25: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/25.jpg)
galGal3
galGal4ensembl
ensembl
non-chickenrefSeq
Annotation improvement
non-chickenrefSeq
Znf503, a likely regulator of heart development
![Page 26: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/26.jpg)
New ensembl annotation based on galGal4 fixed many of the issues that motivated our efforts
However, the PacBio data contains ~2,000 transcripts that represent improvements to even this newest annotation
New genome assembly, new annotation
![Page 27: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/27.jpg)
Categories of annotation improvements…
B2B PacBio isoforms
Ensembl 2013 annotation
RefSeq annotation
Illumina tag density
corrected genes missing from Ensembl
![Page 28: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/28.jpg)
Categories of annotation improvements…
B2B PacBio isoforms
Ensembl 2013 annotation
RefSeq annotation
Illumina tag density
corrected exons missing from Ensembl
![Page 29: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/29.jpg)
Categories of annotation improvements…
B2B PacBio isoforms
Ensembl 2013 annotation
RefSeq annotation
Illumina tag density
identify completely new isoforms
![Page 30: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/30.jpg)
Categories of annotation improvements…
B2B PacBio isoforms
Ensembl 2013 annotation
RefSeq annotation
Illumina tag density
corrected false exons
![Page 31: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/31.jpg)
Categories of annotation improvements…
B2B PacBio isoforms
Ensembl 2013 annotation
RefSeq annotation
Illumina tag density
identify new transcription start sites
![Page 32: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/32.jpg)
Categories of annotation improvements…
B2B PacBio isoforms
Illumina tag density
identify new low-abundance genes/exons
![Page 33: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/33.jpg)
Mapped ends of PacBio reads (GMAP) exhibit systematic splice donor site errors
![Page 34: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/34.jpg)
conservationIllumina tag density
Ensembl
PacBio rawGMAP alignments
(1,2)
(3)
Peculiar buildup at 3’ end of reads…
![Page 35: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/35.jpg)
1. short reads aligned to long read
long error-prone readx
short high-fidelity reads
long corrected read
2. consensus of aligned reads corrects error
Error correction wasn’t really useful in this case (good underlying genome build)
![Page 36: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/36.jpg)
1. New Ensembl annotation fixed many problematic transcripts2. PacBio data added another 2,000 transcripts to the set expressed in
embryonic chicken hearts
Recommendations for others with similar projects
3. Select mRNAs with mature 5’cap and poly-A tail to ensure full length transcript4. Perform normalization using double stranded nuclease to get greater coverage5. Don’t worry about error correction if you’ve got a good reference genome6. Be aware of some of the systematic errors associated with mapping results
Summary and recommendations
![Page 37: Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c745503460f949267b3/html5/thumbnails/37.jpg)
Acknowledgements
Jason UnderwoodElizabeth TsengLuke Hickey
Alisha HollowaySean Thomas