genome annotation using maker-p at iplant collaboration with mark yandell lab (university of utah) ...
TRANSCRIPT
Genome Annotation using MAKER-P at iPlant
Collaboration with Mark Yandell Lab (University of Utah)www.yandell-lab.org
iPlant:Josh Stein (CSHL)Matt Vaughn (TACC)Dian Jiao (TACC)Zhenyuan Lu (CSHL)Nirav Merchant (U. Arizona)
Carson Holt (Ontario Institute Cancer Research)
Cantarel et al. 2008. Genome Research 18:188Holt & Yandell. 2011. BMC Bioinformatics 12:491
What Are Annotations?
• Annotations are descriptions of features of the genome• Structural: exons, introns, UTRs, splice forms etc.
• Coding & non-coding genes
• Functional: enzymatic activity, expression
• Annotations should include evidence trail• Assists in quality control of genome annotations
• Examples of evidence supporting a structural annotation:• Ab initio gene predictions
• ESTs
• Protein homology
Secondary Annotation
• Protein Domains• InterPro Scan: combines many HMM databases
• GO and other ontologies• Pathway mapping
• E.g. BioCyc Pathway tools
Challenges in Plant Genome Annotation
• Genomes are BIG • Highly repetitive• Many pseudogenes
Yet it is important to get it right!
Contamination Issue
Annotation ErrorExample: split gene models
Typical Annotation Pipeline
• Contamination screening• Repeat/TE masking• Ab initio prediction• Evidence alignment (cDNA, EST, RNA-seq,
protein)• Evidence-based prediction• Combiner• Evaluation/filtering• Manual curation
Options for Protein-coding Gene Annotation
• MAKER is an easy-to-use annotation pipeline designed to help smaller research groups convert the mountain of genomic data provided by next generation sequencing technologies into a usable resource.
MAKER identifies repeats, aligns ESTs and proteins to a genome, producesab-initio gene predictions, automatically synthesizes these data into geneannotations, and produces evidence-based quality values for downstream annotation management
Quality Control evaluation of the MAKER-P and TAIR10 datasets using Annotation Edit Distance (AED).
Better Quality Worse
MAKER-P MPI Support
Message Passing Interface (MPI) is a communication protocol for computer clusters which essentially allows multiple computers to act like a single powerful machine.
Current evidence
Current Assembly
Annotating the Genome – Apollo View
Current evidence
Current Assembly
Identify and Mask Repetitive Elements
Current evidence
Current Assembly
Identify and Mask Repetitive Elements
• RepeatMasker– RepBase– Species specific library
• RepeatRunner– MAKER internal protein library
Current evidence
Current Assembly
Identify and Mask Repetitive Elements
Current evidence
Current Assembly
Ab initio Predictions
Generate Ab Initio Gene Predictions
Current evidence
Current Assembly
Ab initio Predictions
Generate Ab Initio Gene Predictions
• MAKER currently supports:– SNAP– Augustus– GeneMark– FGENESH
• Can be run internally or externally
Current evidence
Current Assembly
Ab initio Predictions
Generate Ab Initio Gene Predictions
Current evidence
Current Assembly
Ab initio Predictions
Align EST and Protein Evidence
EST TBLASTX
EST BLASTNProtein BLASTX
Current evidence
Current Assembly
Ab initio Predictions
Align EST and Protein Evidence
EST TBLASTX
EST BLASTNProtein BLASTX
• Identify regions being actively transcribed (i.e. EST data)• Identify region with homology to a known protein
Current evidence
Current Assembly
Ab initio Predictions
Align EST and Protein Evidence
EST TBLASTX
EST BLASTNProtein BLASTX
Polish BLAST Alignments with Exonerate
Current evidence
Current Assembly
Ab initio Predictions
Polished protein
Polished EST
Polish BLAST Alignments with Exonerate
Current evidence
Current Assembly
Ab initio Predictions
Polished protein
Polished EST
• All base pairs must aligns in order.
• No HSP overlap is permitted
• Aligns HSPs correctly with respect to splice sites.
Polish BLAST Alignments with Exonerate
Current evidence
Current Assembly
Ab initio Predictions
Polished protein
Polished EST
Current evidence
Current Assembly
Ab initio Predictions
Hint-based SNAP Hint-based FgenesH
Pass Gene Finders Evidence-based ‘hints’
Current evidence
Current Assembly
Ab initio Predictions
Hint-based SNAP Hint-based FgenesH
*
*Quantitative Measures for the Management and Comparison of Annotated Genomes Karen Eilbeck , Barry Moore , Carson Holt and Mark Yandell BMC Bioinformatics 2009
10:67doi:10.1186/1471-2105-10-67
Identify Gene Model Most Consistent with Evidence*
Current evidence
Current Assembly
Ab initio Predictions*
Revise it further if necessary; Create New Annotation
Compute Support for Each Portion of Gene Model
MAKER-P v2.28 at iPlant
• TACC Lonestar• Supercomputer with 22,656 CPU
• MPI enabled for parallel computation
• Can complete entire rice genome in ~2 hrs (1,152 cores)96 CPU per chromosome
• Can complete Aegilops tauschii ALLPATHS-LG assembly in ~8 hrs (1,152 cores)
• Currently being integrated into the iPlant Discovery Environment
• Atmosphere • MPI enabled for parallel computation
• Maximum instance size 16 CPU
Assembly & Annotation at iPlant