gene prediction group preliminary resultscompgenomics2017.biology.gatech.edu/images/d/d0/... ·...

34
Gene Prediction Group Preliminary Results Faction 1 Gene Prediction Group 03/01/2017

Upload: vuonganh

Post on 18-Jul-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Gene Prediction Group Preliminary Results

Faction 1

Gene Prediction Group

03/01/2017

Page 2: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Final Proposed Pipeline

Validation

GeneValidator

BLAST

Validation

Merge

Filtering

Pseudogenes and

correcting for start codon

Homology Based

BLAST

Assembled Genome

Protein - coding Gene Prediction RNA Gene PredictionAb Initio

• Prodigal

• Glimmer

• GeneMarkS

• FGeneS

• AMIGene

ncRNA Specific

• tRNAscanSE (tRNA)

• RNAmmer (rRNA)

• Infernal

• ARAGON (tmRNA)

Final Output

Page 3: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Ab- initio tools

Page 4: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Ab Initio Tool Choices

We have run six tools so far:

- FGenesB

- GLIMMER

- GeneMark-S

- Prodigal

- AMIGenes

- TICO (gene prediction improvement)

Page 5: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Validating Results with Annotated Genomes

Salmonella enterica subsp. enterica serovar Heidelberg str. CFSAN002069 (Genome 1):

Annotated Using RAST Pipeline + ab initio tools. RAST pipeline uses homology search using figFam

database

Salmonella enterica subsp. enterica serovar Heidelberg str. B182 (Genome 2):

NCBI’s annotation pipeline: Homology, ProSplign, GeneMarkS+

Salmonella enterica subsp. enterica serovar Heidelberg str. CFSAN002069 (Genome 3):

Glimmer3 for initial prediction HMM and BER (Homology tools for confirming annotation)

Page 6: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Metrics for Comparing Results

● TP = True Positive (result matches NCBI annotation)

● FP = False Positive (results not found in NCBI annotation)

● FN = False Negative (found in NCBI annotation but not results)

● Sensitivity = TP / (TP + FN)

○ Higher specificity means less chance of missing a True Positive

● PPV = TP / (TP + FP)

○ Describes probability that a positive result is actually positive

● Unable to calculate True Negative

○ Specificity = TN / (TN + FP)

Page 7: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Types of Misannotation

Page 8: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline
Page 9: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline
Page 10: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline
Page 11: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Tool Exact

Match

Stop Overlap Extra

(FP)

Missed

(FN)

Sensitivity PPV

GeneMark- S 3820 352 355 153 363 92.6% 96.7%

Prodigal 4047 225 242 93 379 92.9% 98.0%

GLIMMER 1828 428 71 2605 2563 47.6% 47.2%

FGenesB 3694 319 364 106 515 89.5% 97.6%

Homology 1500 443 22 2630 2922 40.2% 42.8%

Blind Union of

Prodigal and

GeneMarkS

4148 233 221 428 319 93.5% 91.5%

Overlap Union

of Prodigal and

GeneMarkS

4073 222 202 284 424 91.4% 94.1%

Reference

Genome 2

Page 12: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Merging CDS Predictions

Ab initio Homology

GeneMarkS ProdigalBlast/ ORF finder

Overlapping

ResultsBlast Merge Results

Page 13: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

With the current progress in genome sequencing projects more than 20 billion nucleotides of DNA

sequences are made available on online repositories and there is a steady progress in the accuracy of

genome annotation.

Principle: Coding regions of related species are highly conserved

We rely on significant matches of query sequence with sequence of known genes from databases.

TOOLS USED:Basic Local

Alignment Search

Tool

Open Reading Frame finder

Page 14: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Homology based Gene Prediction Pipeline

Assembled Genome

ORFfinder

Nucleotide ORF’s Translated ORF’s

BLAST against

Refseq database

(Salmonella genus)

BLAST against the

Protein database

Stand-alone version of Blast and ORFfinder were used

Page 15: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Prediction Accuracy of Homology Tool

#Genes Annotated 4,848 4,555 4,784

#genes predicted

using BLAST

4,630 4,688 4,597

#True Positives 1,976 1,965 1,965

#False Positives 2,720 2,630 2,652

% predicted by

BLAST

43 42 43

Reference 1

(NC_011083.1)

Reference 2

(NC_017623.1)

Reference 3

(NC_021812.2)

#ORFs detected 11394 11098 11070

OB0022 OB0023 OB0024

#ORFs detected 10971 11363 11425

#genes predicted

using BLAST

4601 4623 4693

Page 16: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Pseudogene Prediction

Reference 1

(NC_011083.

1)

Reference 2

(NC_017623.

1)

Reference 3

(NC_021812.

2)

#Pseudogenes

annotated

144 218 129

#Pseudogenes

predicted

282 290 137

• Homologous sequences arising from active genes that have lost their function

• Gene-like, might have promoter, CpG islands

• Might also have stop codons, repetitive elements, have frameshifts and/or lack of transcription

Validation?

Page 17: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Future Work

Blast against protein database

Validation for Pseudogene prediction

Expand to 24 samples

Page 18: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

ncRNA Prediction Tools

Page 19: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Non-coding RNA Gene Prediction Process

tRNAscan-SE RNAmmerARAGORNInfernal with

RFAM

tRNA

prediction

rRNA

prediction

tmRNA

prediction

Other

ncRNA

prediction

GTF

format

GTF

format

GTF

format

GTF

format

ncRNA

GTF

Merge &

Resolve

Redundancies

Page 20: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

tRNAscan-SE v 1.3.1Predicted all tRNA genes plus 1 pseudo tRNA gene in 4 reference genomes

tRNAscan-SE -P -o <output> <sequence>

GTF format

awk ‘BEGIN{k=0;j="+";strt=0;stp=0;}

{k += 1;if ($3 > $4) (j="-") && (strt=$4) && (stp=$3); else(j="+") && (strt=$3) && (stp=$4);if (k > 3) print $1 "\ttRNAscan-

SE\ttRNA\t" strt "\t" stp "\t" $9 "\t" j "\t.\ttype-" $5 " anticodon-" $6;}’ <tRNAscan-output>

Page 21: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

ARAGORN evaluation

Positives

Fast, >2000x as fast as Infernal

Negatives

Low sensitivity

High false positive rate

Contradictory results with lower threshold

Awkward output format

Lack of scores

infernal 1.1.2 vs ARAGORN

Test Data: 390 tmRNA sequences from tmRDB

Tool Total Hits Test Seq

Detected

% Test Seq

Detected

ARAGORN 310 306 78.5%

infernal 1.1.2 373 352 90.3%

ARAGORN

Page 22: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

RNAmmer● Rnammer [-options] <fasta sequence>

● ./rnammer -S bac -multi -gff -f -h < fasta sequence

● - S kingdom bac arch euk

● -multi runs all molecules and both strands in parallel

● -gff -f -h -xml formats of output

● -m tsu(5sRNA), lsu(23sRNA), ssu(16sRNA)

Web Interface

Page 23: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

RNAmmer result

5s rRNA 23s rRNA 16s rRNA

16 3 3

Page 24: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Infernal 1.1.2 and Rfam 12.2

Reference Sequence:

Commands:

cmscan [-options] <cmdb> <seqfile>

cmscan --tblout ~/tbResult Rfam.cm ~/genome.fa

Page 25: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

CMSCAN Result

Page 26: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Merging rRNA, tRNA, and CDS Predictions

CDSOverlap

rRNA?

>30bp

Overlap

tRNA?

Pseudo or

atypical tRNA

Not hypothetical

CDS?

Hypothetical

CDS

tRNA neither

pseudo or

atypical?

Keep Both

Keep tRNA

onlyKeep CDS only

No

Keep CDS

Discard CDS

YesYes

YesYe

s

No

No

No

Page 27: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline
Page 28: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Tool Exact

Match

Stop Overlap Extra Missed

GeneMark- S 3796 315 317 158 380

Prodigal 3950 225 199 117 403

GLIMMER 1801 378 20 2552 2574

FGenesB 3691 319 344 468 767

Homology 1451 489 25 2720 2811

Blind Union 4053 236 177 440 338

Overlap

Union

3977 219 160 303 447

Appendix: Genome 1 Comparison Table

Page 29: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Appendix: Genome 3 Comparison Table

Tool Exact

Match

Stop Overlap Extra Missed

GeneMark- S 4048 267 298 144 395

Prodigal 4234 184 182 82 408

GLIMMER 2003 457 26 2501 2555

FGenesB 3886 281 325 108 518

Homology 1491 467 18 2652 3031

Blind Union 4345 195 141 431 342

Overlap

Union

4258 178 126 290 460

Page 30: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

GeneValidator (GV)

Evaluate quality of gene predictions based on comparisons with similar

known proteins from public and private databases.

What does it do …BLAST: gene

prediction sequences to

relevant database

Comparison with HSPs based on 7

parameters

How gene prediction deviate from those of hit

sequences?

•Combine scores of individual test into overall quality scores (0-100)

Page 31: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Final Proposed Pipeline

Validation

GeneValidator

BLAST

Validation

Merge

Filtering

Pseudogenes and

correcting for start codon

Homology Based

BLAST

Assembled Genome

Protein - coding Gene Prediction RNA Gene PredictionAb Initio

• Prodigal

• Glimmer

• GeneMarkS

• FGeneS

• AMIGene

ncRNA Specific

• tRNAscanSE (tRNA)

• RNAmmer (rRNA)

• Infernal

• ARAGON (tmRNA)

Final Output

Page 32: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Homework is on the Wiki

Under “Exercises”

You have one week to complete it

GO!

Page 33: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

References

Lukashin A, Borodovsky M: GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 1998, 26(4):1107–1115.

10.1093/nar/26.4.1107

Delcher A, Bratke K, Powers E, Salzberg S: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 2007,

23(6):673–679. 10.1093/bioinformatics/btm009

Lomsadze, A., Tang, S., Gemayel, K., & Borodovsky, M. GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition.

Besemer, J., Lomsadze, A., & Borodovsky, M. (2001). GeneMarkS: a self-training method for prediction of gene starts in microbial genomes.

Implications for finding sequence motifs in regulatory regions. Nucleic acids research, 29(12), 2607-2618.

Bocs S, Cruveiller S, Vallenet D, Nuel G, Médigue C. AMIGene: Annotation of MIcrobial Genes. Nucleic Acids Research.

2003;31(13):3723-3726.

Oliynyk et al., (2007) Complete genome sequence of the erythromycin-producing bacterium Saccharopolyspora erythraea NRRL23338

Nature Biotechnology 25, 447 - 453

John E. Karro, Yangpan Yan, Deyou Zheng, Zhaolei Zhang, Nicholas Carriero, Philip Cayting, Paul Harrrison, Mark Gerstein;

Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation. Nucleic Acids Res 2007; 35

(suppl_1): D55-D60. doi: 10.1093/nar/gkl851

Page 34: Gene Prediction Group Preliminary Resultscompgenomics2017.biology.gatech.edu/images/d/d0/... · Gene Prediction Group Preliminary Results Faction 1 ... Annotated Using RAST Pipeline

Hori H. Methylated nucleosides in tRNA and tRNA methyltransferases. Frontiers in Genetics. 2014;5:144.

doi:10.3389/fgene.2014.00144.

Gong H, Vu G-P, Bai Y, et al. A Salmonella Small Non-Coding RNA Facilitates Bacterial Invasion and Intracellular Replication by

Modulating the Expression of Virulence Factors. Monack DM, ed. PLoS Pathogens. 2011;7(9):e1002120.

doi:10.1371/journal.ppat.1002120.

Sweeney B.A., Roy P., Leontis N.B. (2015) An introduction to recurrent nucleotide interactions in RNA. Wiley Interdisciplinary Reviews:

RNA, 6, 17–45

Harris KA, Lünse CE, Li S, Brewer KI, Breaker RR. Biochemical analysis of pistol self-cleaving ribozymes. RNA. 2015;21(11):1852-

1858. doi:10.1261/rna.052514.115.

Tjaden B, Goodwin SS, Opdyke JA, et al. Target prediction for small, noncoding RNAs in bacteria. Nucleic Acids Research.

2006;34(9):2791-2802. doi:10.1093/nar/gkl356.

Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids

Research. 1997;25(5):955-964.

Lasleett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA gnes in nucleotide sequences. Nucleic Acids

Research. 2004;32(1):11-16. doi:10.1093/nar/gkh152.