gene prediction group preliminary resultscompgenomics2017.biology.gatech.edu/images/d/d0/... ·...
TRANSCRIPT
Gene Prediction Group Preliminary Results
Faction 1
Gene Prediction Group
03/01/2017
Final Proposed Pipeline
Validation
GeneValidator
BLAST
Validation
Merge
Filtering
Pseudogenes and
correcting for start codon
Homology Based
BLAST
Assembled Genome
Protein - coding Gene Prediction RNA Gene PredictionAb Initio
• Prodigal
• Glimmer
• GeneMarkS
• FGeneS
• AMIGene
ncRNA Specific
• tRNAscanSE (tRNA)
• RNAmmer (rRNA)
• Infernal
• ARAGON (tmRNA)
Final Output
Ab- initio tools
Ab Initio Tool Choices
We have run six tools so far:
- FGenesB
- GLIMMER
- GeneMark-S
- Prodigal
- AMIGenes
- TICO (gene prediction improvement)
Validating Results with Annotated Genomes
Salmonella enterica subsp. enterica serovar Heidelberg str. CFSAN002069 (Genome 1):
Annotated Using RAST Pipeline + ab initio tools. RAST pipeline uses homology search using figFam
database
Salmonella enterica subsp. enterica serovar Heidelberg str. B182 (Genome 2):
NCBI’s annotation pipeline: Homology, ProSplign, GeneMarkS+
Salmonella enterica subsp. enterica serovar Heidelberg str. CFSAN002069 (Genome 3):
Glimmer3 for initial prediction HMM and BER (Homology tools for confirming annotation)
Metrics for Comparing Results
● TP = True Positive (result matches NCBI annotation)
● FP = False Positive (results not found in NCBI annotation)
● FN = False Negative (found in NCBI annotation but not results)
● Sensitivity = TP / (TP + FN)
○ Higher specificity means less chance of missing a True Positive
● PPV = TP / (TP + FP)
○ Describes probability that a positive result is actually positive
● Unable to calculate True Negative
○ Specificity = TN / (TN + FP)
Types of Misannotation
Tool Exact
Match
Stop Overlap Extra
(FP)
Missed
(FN)
Sensitivity PPV
GeneMark- S 3820 352 355 153 363 92.6% 96.7%
Prodigal 4047 225 242 93 379 92.9% 98.0%
GLIMMER 1828 428 71 2605 2563 47.6% 47.2%
FGenesB 3694 319 364 106 515 89.5% 97.6%
Homology 1500 443 22 2630 2922 40.2% 42.8%
Blind Union of
Prodigal and
GeneMarkS
4148 233 221 428 319 93.5% 91.5%
Overlap Union
of Prodigal and
GeneMarkS
4073 222 202 284 424 91.4% 94.1%
Reference
Genome 2
Merging CDS Predictions
Ab initio Homology
GeneMarkS ProdigalBlast/ ORF finder
Overlapping
ResultsBlast Merge Results
With the current progress in genome sequencing projects more than 20 billion nucleotides of DNA
sequences are made available on online repositories and there is a steady progress in the accuracy of
genome annotation.
Principle: Coding regions of related species are highly conserved
We rely on significant matches of query sequence with sequence of known genes from databases.
TOOLS USED:Basic Local
Alignment Search
Tool
Open Reading Frame finder
Homology based Gene Prediction Pipeline
Assembled Genome
ORFfinder
Nucleotide ORF’s Translated ORF’s
BLAST against
Refseq database
(Salmonella genus)
BLAST against the
Protein database
Stand-alone version of Blast and ORFfinder were used
Prediction Accuracy of Homology Tool
#Genes Annotated 4,848 4,555 4,784
#genes predicted
using BLAST
4,630 4,688 4,597
#True Positives 1,976 1,965 1,965
#False Positives 2,720 2,630 2,652
% predicted by
BLAST
43 42 43
Reference 1
(NC_011083.1)
Reference 2
(NC_017623.1)
Reference 3
(NC_021812.2)
#ORFs detected 11394 11098 11070
OB0022 OB0023 OB0024
#ORFs detected 10971 11363 11425
#genes predicted
using BLAST
4601 4623 4693
Pseudogene Prediction
Reference 1
(NC_011083.
1)
Reference 2
(NC_017623.
1)
Reference 3
(NC_021812.
2)
#Pseudogenes
annotated
144 218 129
#Pseudogenes
predicted
282 290 137
• Homologous sequences arising from active genes that have lost their function
• Gene-like, might have promoter, CpG islands
• Might also have stop codons, repetitive elements, have frameshifts and/or lack of transcription
Validation?
Future Work
Blast against protein database
Validation for Pseudogene prediction
Expand to 24 samples
ncRNA Prediction Tools
Non-coding RNA Gene Prediction Process
tRNAscan-SE RNAmmerARAGORNInfernal with
RFAM
tRNA
prediction
rRNA
prediction
tmRNA
prediction
Other
ncRNA
prediction
GTF
format
GTF
format
GTF
format
GTF
format
ncRNA
GTF
Merge &
Resolve
Redundancies
tRNAscan-SE v 1.3.1Predicted all tRNA genes plus 1 pseudo tRNA gene in 4 reference genomes
tRNAscan-SE -P -o <output> <sequence>
GTF format
awk ‘BEGIN{k=0;j="+";strt=0;stp=0;}
{k += 1;if ($3 > $4) (j="-") && (strt=$4) && (stp=$3); else(j="+") && (strt=$3) && (stp=$4);if (k > 3) print $1 "\ttRNAscan-
SE\ttRNA\t" strt "\t" stp "\t" $9 "\t" j "\t.\ttype-" $5 " anticodon-" $6;}’ <tRNAscan-output>
ARAGORN evaluation
Positives
Fast, >2000x as fast as Infernal
Negatives
Low sensitivity
High false positive rate
Contradictory results with lower threshold
Awkward output format
Lack of scores
infernal 1.1.2 vs ARAGORN
Test Data: 390 tmRNA sequences from tmRDB
Tool Total Hits Test Seq
Detected
% Test Seq
Detected
ARAGORN 310 306 78.5%
infernal 1.1.2 373 352 90.3%
ARAGORN
RNAmmer● Rnammer [-options] <fasta sequence>
● ./rnammer -S bac -multi -gff -f -h < fasta sequence
● - S kingdom bac arch euk
● -multi runs all molecules and both strands in parallel
● -gff -f -h -xml formats of output
● -m tsu(5sRNA), lsu(23sRNA), ssu(16sRNA)
Web Interface
RNAmmer result
5s rRNA 23s rRNA 16s rRNA
16 3 3
Infernal 1.1.2 and Rfam 12.2
Reference Sequence:
Commands:
cmscan [-options] <cmdb> <seqfile>
cmscan --tblout ~/tbResult Rfam.cm ~/genome.fa
CMSCAN Result
Merging rRNA, tRNA, and CDS Predictions
CDSOverlap
rRNA?
>30bp
Overlap
tRNA?
Pseudo or
atypical tRNA
Not hypothetical
CDS?
Hypothetical
CDS
tRNA neither
pseudo or
atypical?
Keep Both
Keep tRNA
onlyKeep CDS only
No
Keep CDS
Discard CDS
YesYes
YesYe
s
No
No
No
Tool Exact
Match
Stop Overlap Extra Missed
GeneMark- S 3796 315 317 158 380
Prodigal 3950 225 199 117 403
GLIMMER 1801 378 20 2552 2574
FGenesB 3691 319 344 468 767
Homology 1451 489 25 2720 2811
Blind Union 4053 236 177 440 338
Overlap
Union
3977 219 160 303 447
Appendix: Genome 1 Comparison Table
Appendix: Genome 3 Comparison Table
Tool Exact
Match
Stop Overlap Extra Missed
GeneMark- S 4048 267 298 144 395
Prodigal 4234 184 182 82 408
GLIMMER 2003 457 26 2501 2555
FGenesB 3886 281 325 108 518
Homology 1491 467 18 2652 3031
Blind Union 4345 195 141 431 342
Overlap
Union
4258 178 126 290 460
GeneValidator (GV)
Evaluate quality of gene predictions based on comparisons with similar
known proteins from public and private databases.
What does it do …BLAST: gene
prediction sequences to
relevant database
Comparison with HSPs based on 7
parameters
How gene prediction deviate from those of hit
sequences?
•Combine scores of individual test into overall quality scores (0-100)
Final Proposed Pipeline
Validation
GeneValidator
BLAST
Validation
Merge
Filtering
Pseudogenes and
correcting for start codon
Homology Based
BLAST
Assembled Genome
Protein - coding Gene Prediction RNA Gene PredictionAb Initio
• Prodigal
• Glimmer
• GeneMarkS
• FGeneS
• AMIGene
ncRNA Specific
• tRNAscanSE (tRNA)
• RNAmmer (rRNA)
• Infernal
• ARAGON (tmRNA)
Final Output
Homework is on the Wiki
Under “Exercises”
You have one week to complete it
GO!
References
Lukashin A, Borodovsky M: GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 1998, 26(4):1107–1115.
10.1093/nar/26.4.1107
Delcher A, Bratke K, Powers E, Salzberg S: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 2007,
23(6):673–679. 10.1093/bioinformatics/btm009
Lomsadze, A., Tang, S., Gemayel, K., & Borodovsky, M. GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition.
Besemer, J., Lomsadze, A., & Borodovsky, M. (2001). GeneMarkS: a self-training method for prediction of gene starts in microbial genomes.
Implications for finding sequence motifs in regulatory regions. Nucleic acids research, 29(12), 2607-2618.
Bocs S, Cruveiller S, Vallenet D, Nuel G, Médigue C. AMIGene: Annotation of MIcrobial Genes. Nucleic Acids Research.
2003;31(13):3723-3726.
Oliynyk et al., (2007) Complete genome sequence of the erythromycin-producing bacterium Saccharopolyspora erythraea NRRL23338
Nature Biotechnology 25, 447 - 453
John E. Karro, Yangpan Yan, Deyou Zheng, Zhaolei Zhang, Nicholas Carriero, Philip Cayting, Paul Harrrison, Mark Gerstein;
Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation. Nucleic Acids Res 2007; 35
(suppl_1): D55-D60. doi: 10.1093/nar/gkl851
Hori H. Methylated nucleosides in tRNA and tRNA methyltransferases. Frontiers in Genetics. 2014;5:144.
doi:10.3389/fgene.2014.00144.
Gong H, Vu G-P, Bai Y, et al. A Salmonella Small Non-Coding RNA Facilitates Bacterial Invasion and Intracellular Replication by
Modulating the Expression of Virulence Factors. Monack DM, ed. PLoS Pathogens. 2011;7(9):e1002120.
doi:10.1371/journal.ppat.1002120.
Sweeney B.A., Roy P., Leontis N.B. (2015) An introduction to recurrent nucleotide interactions in RNA. Wiley Interdisciplinary Reviews:
RNA, 6, 17–45
Harris KA, Lünse CE, Li S, Brewer KI, Breaker RR. Biochemical analysis of pistol self-cleaving ribozymes. RNA. 2015;21(11):1852-
1858. doi:10.1261/rna.052514.115.
Tjaden B, Goodwin SS, Opdyke JA, et al. Target prediction for small, noncoding RNAs in bacteria. Nucleic Acids Research.
2006;34(9):2791-2802. doi:10.1093/nar/gkl356.
Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids
Research. 1997;25(5):955-964.
Lasleett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA gnes in nucleotide sequences. Nucleic Acids
Research. 2004;32(1):11-16. doi:10.1093/nar/gkh152.