improving pan-genome annotation using whole genome multiple alignment

Raunak Shrestha

27th October 2011

Source: Angiuoli SV, Hotopp JC, Salzberg SL, Tettelin H. Improving pan-genome annotation using whole genome multiple alignment. BMC Bioinformatics. 2011 Jun 30;12:272.

Background

• Describing genetic diversity of some organism is difficult on the basis of a single reference genome

• Pan-genomes • greater intra-specific

genetic variation even in closely related strains

• To aid gene-prediction & annotation genome sequence of the some closely related strains are required

2http://en.wikipedia.org/wiki/File:Pan-genome-graphics.png

Background

3

Schnoes et. al., 2009

The change in misannotation over time in the NR database for the 37 families investigated.

Mugsy-Annotator (http://mugsy.sf.net)• Steps:

1. Aligning multiple whole genomes, 2. mapping orthologs among the genomes, 3. identifying annotation anomalies

4

• Objectives :1) identifying orthologs and 2) Evaluating the quality of

annotated gene structures in prokaryotic genomes.

Determining Orthologs

• Identifies orthologs on the basis of Whole Genome Alignment (WGA), sequence position and length of sequence.

• expects one segment per organism in the whole genome alignment.

• For segmental duplications: • It will report separate ortholog groups for each copy only if whole

genome alignment identifies orthologous copies in other genomes

• If not, it will not recognize the duplication and group under a single ortholog

5

Identification of annotation inconsistencies• Evaluate Start codon, Stop codon and Translation Initiation Sites

(TIS),

6

Data set• Neisseria meningitidis (Nmen) dataset of 20 genomes• Nmen verA contained 13 genomes • Nmen verB contained 7 genomes• Annotation pipeline differs between Nmen verA and Nmen verB

• A genome dataset of other 9 bacterial species from Refseq database.

7

Comparison of the groups oforthologs for 20 Nmen genomes

• Within the genes reported exclusively by any one method• intra-genome BLASTP matches predicts most of the genes to be

paralogs (40 % for Mugsy-Annotator & 60% for OrthoMCL)• Some have functional names that indicate transposases• Some are hypothetical proteins

• Paper claims that OrthoMCL clusters paralogs and orthologs in a single group

8

Run Time Performance

• Nmen dataset of 20 genomes

• single CPU in ~4 h • ~2 h for WGA with Mugsy and • ~2 h for comparing annotations with Mugsy-Annotator

• OrthoMCL consumed ~32 CPU hours

• WGA method is computationally efficient and has a significant runtime performance advantage over BLAST based OrthoMCL

9

Consistency of annotated gene structures in several species pan-genomes as reported by Mugsy-Annotator

11

improve annotation consistency

• In case of inconsistency in TIS, Mugsy-Annotator suggests alternative gene structures that improve annotation consistency

• Strategy -> to look for the conserved TIS in the close proximity to the previously annotated TIS

12

Conclusion• aids in identifying and comparing gene content across a pan-

genome

• Aids annotation and re-annotation of genes within a pan-genome rather than in a single genome

• Study demonstrates significant variation in annotation primarily due to different bioinformatics approaches available rather than the true biological variation

• Mugsy-Annotator : efficient, accurate method for finding orthologs within a pan-genome

• Mugsy (WGA approach) is computationally efficient compared to BLAST-based approaches for finding orthologs

13

Critique• Musgy-Annotator requires pre-predicted annotation

information and is therefore not an independent annotation tool

• Musgy-Annotator still finds difficult to determine the segmental duplications and paralogs

• It would have been even better, if the author had measured the performance of Musgy-Annotator for pan-genomes dataset with larger evolutionary distance.

14

15

QUESTIONS?

improving pan-genome annotation using whole genome multiple alignment

Health & Medicine

pangenome annotation

nmen genomes

genome dataset

genomes annotation pipeline

pan genome aids annotation

pangenomes dataset

genomes nmen vera

genomes nmen verb