improving pan-genome annotation using whole genome multiple alignment
DESCRIPTION
Improving pan-genome annotation using whole genome multiple alignmentTRANSCRIPT
Raunak Shrestha
27th October 2011
Source: Angiuoli SV, Hotopp JC, Salzberg SL, Tettelin H. Improving pan-genome annotation using whole genome multiple alignment. BMC Bioinformatics. 2011 Jun 30;12:272.
Background
• Describing genetic diversity of some organism is difficult on the basis of a single reference genome
• Pan-genomes • greater intra-specific
genetic variation even in closely related strains
• To aid gene-prediction & annotation genome sequence of the some closely related strains are required
2http://en.wikipedia.org/wiki/File:Pan-genome-graphics.png
Background
3
Schnoes et. al., 2009
The change in misannotation over time in the NR database for the 37 families investigated.
Mugsy-Annotator (http://mugsy.sf.net)• Steps:
1. Aligning multiple whole genomes, 2. mapping orthologs among the genomes, 3. identifying annotation anomalies
4
• Objectives :1) identifying orthologs and 2) Evaluating the quality of
annotated gene structures in prokaryotic genomes.
Determining Orthologs
• Identifies orthologs on the basis of Whole Genome Alignment (WGA), sequence position and length of sequence.
• expects one segment per organism in the whole genome alignment.
• For segmental duplications: • It will report separate ortholog groups for each copy only if whole
genome alignment identifies orthologous copies in other genomes
• If not, it will not recognize the duplication and group under a single ortholog
5
Identification of annotation inconsistencies• Evaluate Start codon, Stop codon and Translation Initiation Sites
(TIS),
6
Data set• Neisseria meningitidis (Nmen) dataset of 20 genomes• Nmen verA contained 13 genomes • Nmen verB contained 7 genomes• Annotation pipeline differs between Nmen verA and Nmen verB
• A genome dataset of other 9 bacterial species from Refseq database.
7
Comparison of the groups oforthologs for 20 Nmen genomes
• Within the genes reported exclusively by any one method• intra-genome BLASTP matches predicts most of the genes to be
paralogs (40 % for Mugsy-Annotator & 60% for OrthoMCL)• Some have functional names that indicate transposases• Some are hypothetical proteins
• Paper claims that OrthoMCL clusters paralogs and orthologs in a single group
8
Run Time Performance
• Nmen dataset of 20 genomes
• single CPU in ~4 h • ~2 h for WGA with Mugsy and • ~2 h for comparing annotations with Mugsy-Annotator
• OrthoMCL consumed ~32 CPU hours
• WGA method is computationally efficient and has a significant runtime performance advantage over BLAST based OrthoMCL
9
10
Consistency of annotated gene structures in several species pan-genomes as reported by Mugsy-Annotator
11
improve annotation consistency
• In case of inconsistency in TIS, Mugsy-Annotator suggests alternative gene structures that improve annotation consistency
• Strategy -> to look for the conserved TIS in the close proximity to the previously annotated TIS
12
Conclusion• aids in identifying and comparing gene content across a pan-
genome
• Aids annotation and re-annotation of genes within a pan-genome rather than in a single genome
• Study demonstrates significant variation in annotation primarily due to different bioinformatics approaches available rather than the true biological variation
• Mugsy-Annotator : efficient, accurate method for finding orthologs within a pan-genome
• Mugsy (WGA approach) is computationally efficient compared to BLAST-based approaches for finding orthologs
13
Critique• Musgy-Annotator requires pre-predicted annotation
information and is therefore not an independent annotation tool
• Musgy-Annotator still finds difficult to determine the segmental duplications and paralogs
• It would have been even better, if the author had measured the performance of Musgy-Annotator for pan-genomes dataset with larger evolutionary distance.
14
15
QUESTIONS?