post-assembly data analysis - cornell universitycbsuss05.tc.cornell.edu/doc/trinity_lecture2.pdf ·...
TRANSCRIPT
![Page 1: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/1.jpg)
Post-assembly Data Analysis
• Quantification: the expression level of each gene in each sample
• DE genes: genes differentially expressed between samples
• Clustering/network analysis
• Identifying over-represented functional categories in DE genes
• Evaluation of the quality of the assembly
Assembled transcriptome
![Page 2: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/2.jpg)
Different summarization strategies will result in the inclusion or exclusion of different sets of reads in the table of counts.
Part 1. Abundance estimation using RSEM
![Page 3: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/3.jpg)
Map reads to genome (TOPHAT)
Map reads to Transcriptome(BOWTIE)
vs
What is easier?No issues with alignment across splicing junctions.
What is more difficult?The same reads could be repeatedly aligned to different splicing isoforms, and paralogous genes.
![Page 4: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/4.jpg)
Hass et al. Nature Protocols 8:1494
RSEM assign ambiguous reads based on unique reads mapped to the same transcript.
Red & yellow: unique regionsBlue: regions shared between two transcripts
Transcript 1
Transcript 2
![Page 5: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/5.jpg)
Trinity provides a script for calling BOWTIE and RSEMalign_and_estimate_abundance.pl
align_and_estimate_abundance.pl \--transcripts Trinity.fasta \--seqType fq \--left sequence_1.fastq.gz \--right sequence_2.fastq.gz \--SS_lib_type RF \--aln_method bowtie \--est_method RSEM \--thread_count 4 \--trinity_mode \--output_prefix tis1rep1 \
![Page 6: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/6.jpg)
Parameters for align_and_estimate_abundance.pl
--aln_method: alignment methodDefault: “bowtie” .Alignment file from other aligner might not be supported.
--est_method: abundance estimation methodDefault : RSEM, slightly more accurate.Optional: eXpress, faster and less RAM required.
--thread_count: number of threads
--trinity_mode: the input reference is from Trinity. Non-trinity reference requires a gene-isoform mapping file (--gene_trans_map).
![Page 7: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/7.jpg)
transcript_id gene_id length effective_length
expected_count
TPM FPKM IsoPct
gene1_isoform1 gene1 2169 2004.97 22.1 3.63 3.93 92.08
gene1_isoform2 gene1 2170 2005.97 1.9 0.31 0.34 7.92
…
gene_id transcript_id(s) length effective_length
expected_count
TPM FPKM
gene1 gene1_isoform1,gene1_isoform2
2169.1 2005.04 24 3.94 4.27
…
Output files from RSEM (two files per sample)
*.genes.results table
*.isoforms.results table
![Page 8: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/8.jpg)
transcript_id gene_id length effective_length
expected_count
TPM FPKM IsoPct
gene1_isoform1 gene1 2169 2004.97 22.1 3.63 3.93 92.08
gene1_isoform2 gene1 2170 2005.97 1.9 0.31 0.34 7.92
…
gene_id transcript_id(s) length effective_length
expected_count
TPM FPKM
gene1 gene1_isoform1,gene1_isoform2
2169.1 2005.04 24 3.94 4.27
…
Output files from RSEM (two files per sample)
*.genes.results table
*.isoforms.results table
sum
Percentage of an isoform in a gene
![Page 9: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/9.jpg)
filter_fasta_by_rsem_values.pl \
--rsem_output=s1.isoforms.results,s2.isoforms.results \
--fasta=Trinity.fasta \
--output=Trinity.filtered.fasta \
--isopct_cutoff=5 \
--fpkm_cutoff=10 \
--tpm_cutoff=10 \
Filtering transcriptome reference based on RSEMfilter_fasta_by_rsem_values.pl
• Can be filtered by multiple RSEM files simultaneously. (criteria met in any one filter will be filtered )
![Page 10: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/10.jpg)
Expression level
% o
f re
plica
tes
0 5 10 15 20
0.0
00
.05
0.1
00
.15
0.2
00
.25
Condition 1
Condition 2
Distribution of Expression Level of A Gene
… with 3 biological replicates in each species
Part 2. Differentially Expressed Genes
![Page 11: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/11.jpg)
Reported for each gene:
• Normalized read count (Log2 count-per-million);
• Fold change between biological conditions (Log2 fold)
• Q(FDR).
Use EdgeR or DESeq to identify DE genes
Using FDR, fold change and read count values to filter:
E.g. Log2(fold) >2 or <-2FDR < 0.01 Log2(counts) > 2
![Page 12: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/12.jpg)
abundance_estimates_to_matrix.pl \
--est_method RSEM \
s1.genes.results \
s2.genes.results \
s3.genes.results \
s4.genes.results \
--out_prefix mystudy
Combine multiple-sample RSEM results into a matrix
s1 s2 s3 s4
TR10295|c0_g1 0 0 0 0
TR17714|c6_g5 51 70 122 169
TR23703|c1_g1 0 0 0 0
TR16168|c0_g2 1 1 1 0
TR16372|c0_g1 10 4 65 91
TR12445|c0_g3 0 0 0 0
Output file: mystudy.counts.matrix
……
![Page 13: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/13.jpg)
Use run_DE_analysis.pl script to call edgeR or DESeq
contition1 s1
condition1 s2
condition2 s3
condition2 s4
Make a condition-sample mapping file
run_DE_analysis.pl \
--matrix mystudy.counts.matrix \
--samples_file mysamples \
--method edgeR \
--min_rowSum_counts 10 \
--output edgeR_results
Run script run_DE_analysis.pl
Skip genes with summed read counts less than 10
![Page 14: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/14.jpg)
logFC logCPM PValue FDR
TR8841|c0_g2 13.28489 7.805577 3.64E-20 1.05E-15
TR14584|c0_g2 -12.9853 7.507117 2.82E-19 4.04E-15
TR16945|c0_g1 -13.3028 7.823384 4.21E-18 3.11E-14
TR15899|c3_g2 9.485185 7.708629 5.35E-18 3.11E-14
TR8034|c0_g1 -10.1468 7.836908 5.42E-18 3.11E-14
TR3434|c0_g1 -12.909 7.431009 1.16E-17 5.55E-14
Output files from run_DE_analysis.pl
Filtering (done in R or Excel):
FDR: False-discovery-rate of DE detection (0.05 or below)
logFC: fold change log2(fc)
logCPM: average expression level (count-per-million)
![Page 15: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/15.jpg)
Heat map and Hierarchical clustering K-means clustering
Part 3. Clustering/Network Analysis
![Page 16: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/16.jpg)
# Heap map
analyze_diff_expr.pl \
--matrix ../mystudy.TMM.fpkm.matrix \
--samples ../mysamples \
-P 1e-3 -C 2 \
# K means clustering
define_clusters_by_cutting_tree.pl \
-K 6 -R cluster_results.matrix.RData
Clustering tools in Trinity package
Output:Images in PDF format;Gene list for each group from K-means analysis;
![Page 17: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/17.jpg)
Part 4. Functional Category Enrichment Analysis
(from DE or clustering)TR14584|c0_g2TR16945|c0_g1TR8034|c0_g1TR3434|c0_g1TR13135|c1_g2TR15586|c2_g4TR14584|c0_g1TR20065|c2_g3TR1780|c0_g1TR25746|c0_g2
GO ID over represented pvalue
FDR GO term
GO:0044456 3.40E-10 3.78E-06synapse part
GO:0050804 1.49E-09 8.29E-06regulation of synaptic transmission
GO:0007268 5.64E-09 2.09E-05synaptic transmission
GO:0004889 1.00E-07 0.00012394acetylcholine-activated cation-selective channel activity
GO:0005215 1.23E-070.00013647
1transporter activity
DE Gene list(From DEseq or EdgeR)
Enriched GO categoriesFrom GO enrichment analysis of DE genes
Results from RNA-seq data analysis
![Page 18: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/18.jpg)
Steps in RNA-seq analysis:
1. Identifying open reading frames and translate into
protein sequences (optional, not needed if using
BLAST2GO et al);
2. Annotate the transcripts with GO
• BLAST2GO
• TRINOTATE
• INTERPROSCAN
3. Enrichment analysis
![Page 19: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/19.jpg)
Transdecoder – Identify ORF from Transcript
![Page 20: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/20.jpg)
Transdecoder – Identify ORF from Transcript
• Long ORF
• Alignment to known protein sequences or motif
• Markov model of coding sequences (model trained on genes from the same genome)
![Page 21: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/21.jpg)
Transdecoder – Identify ORF from Transcript
#identify longest ORF
TransDecoder.LongOrfs -t myassembly.fa
#BLAST ORF against known protein datbase
blastp -query myassembly.fa.transdecoder_dir/longest_orfs.pep \
-db swissprot
-max_target_seqs 1 \
-outfmt 6 -evalue 1e-5 \
-num_threads 8 > blastp.outfmt6
#Predict ORF based on a Markov model of coding sequence (trained from long ORF)
TransDecoder.Predict --cpu 8 \
--retain_blastp_hits blastp.outfmt6 \
-t myassembly.fa
![Page 22: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/22.jpg)
Function annotation
Predict gene function from sequences
Trinotate package• BLAST Uniprot : homologs in known proteins• PFAM: protein domain)• SignalP: signal peptide)• TMHMM: trans-membrane domain• RNAMMER: rRNA
Output: go_annotations.txt (Gene Ontology annotation file)Step-by-step guidance: https://cbsu.tc.cornell.edu/lab/userguide.aspx?a=software&i=143#c
BLAST2GOhttps://cbsu.tc.cornell.edu/lab/userguide.aspx?a=software&i=73#c
or Recommended
![Page 23: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/23.jpg)
run_GOseq.pl \
--genes_single_factor DE.filtered.txt \
--GO_assignments go_annotations.txt \
--lengths genes.lengths.txt
Enrichment analysisUse the geneIDs in
the first column
GO annotation file from Trinotate
3rd column from RSEM *.genes.results
file
GO ID over represented pvalue
FDR GO term
GO:0044456 3.40E-10 3.78E-06synapse part
GO:0050804 1.49E-09 8.29E-06regulation of synaptic transmission
GO:0007268 5.64E-09 2.09E-05synaptic transmission
acetylcholine-activated cation-
Results:
![Page 24: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/24.jpg)
Part 5. Evaluation of Assembly Quality
https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome-Assembly-Quality-Assessment
A. Representation of known proteins
• Compare to SWISSPROT (% of full length)
• Compare to genes in a closely related species
(% of full length; completeness)
• Compare to BUSCO core proteins (BUSCO)
![Page 25: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/25.jpg)
analyze_blastPlus_topHit_coverage.pl \blastx.outfmt6 \Trinity.fasta \melanogaster.pep.all.fa \
Blastx \-query Trinity.fasta \-db melanogaster.pep.all.fa \-out blastx.outfmt6 -evalue 1e-20 -num_threads 6 \
BLAST against D. melanogaster protein database
Analysis the blast results and generate histogram
![Page 26: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/26.jpg)
Part 5. Evaluation of Assembly Quality
#hit_pct_cov_bin count_in_bin >bin_below
100 5165 5165
90 1293 6458
80 1324 7782
70 1434 9216
60 1634 10850
50 1385 12235
40 1260 13495
30 1091 14586
20 992 15578
10 507 16085
Compare the Drosophila yakuba assembly used in the exercise to Drosophila melanogaster proteins from flybase
![Page 27: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct](https://reader034.vdocuments.net/reader034/viewer/2022050401/5f7ebaf7394858652d27fdd6/html5/thumbnails/27.jpg)
Part 5. Evaluation of Assembly Quality
https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome-Assembly-Quality-Assessment
A. Statistics scores
• Map RNA-Seq reads to the assembly (>80% of
reads can be aligned)
• ExN50 transcript contig length
N50: shortest sequence length at 50% of the genome
E90N50: N50 based on transcripts representing 90% of
expression data
• DETONATE score
With or without reference sequences