genome annotation using maker-p at iplant collaboration with mark yandell lab (university of utah) ...

31
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) www.yandell-lab.org iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Carson Holt (Ontario Institute Cancer Research) Cantarel et al. 2008. Genome Research 18:188 Holt & Yandell. 2011. BMC Bioinformatics 12:491

Upload: leonard-cain

Post on 28-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Genome Annotation using MAKER-P at iPlant

Collaboration with Mark Yandell Lab (University of Utah)www.yandell-lab.org

iPlant:Josh Stein (CSHL)Matt Vaughn (TACC)Dian Jiao (TACC)Zhenyuan Lu (CSHL)Nirav Merchant (U. Arizona)

Carson Holt (Ontario Institute Cancer Research)

Cantarel et al. 2008. Genome Research 18:188Holt & Yandell. 2011. BMC Bioinformatics 12:491

Page 2: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

What Are Annotations?

• Annotations are descriptions of features of the genome• Structural: exons, introns, UTRs, splice forms etc.

• Coding & non-coding genes

• Functional: enzymatic activity, expression

• Annotations should include evidence trail• Assists in quality control of genome annotations

• Examples of evidence supporting a structural annotation:• Ab initio gene predictions

• ESTs

• Protein homology

Page 3: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Secondary Annotation

• Protein Domains• InterPro Scan: combines many HMM databases

• GO and other ontologies• Pathway mapping

• E.g. BioCyc Pathway tools

Page 4: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Challenges in Plant Genome Annotation

• Genomes are BIG • Highly repetitive• Many pseudogenes

Yet it is important to get it right!

Page 5: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Contamination Issue

Page 6: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Annotation ErrorExample: split gene models

Page 7: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Typical Annotation Pipeline

• Contamination screening• Repeat/TE masking• Ab initio prediction• Evidence alignment (cDNA, EST, RNA-seq,

protein)• Evidence-based prediction• Combiner• Evaluation/filtering• Manual curation

Page 8: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Options for Protein-coding Gene Annotation

Page 9: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

• MAKER is an easy-to-use annotation pipeline designed to help smaller research groups convert the mountain of genomic data provided by next generation sequencing technologies into a usable resource.

Page 10: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

MAKER identifies repeats, aligns ESTs and proteins to a genome, producesab-initio gene predictions, automatically synthesizes these data into geneannotations, and produces evidence-based quality values for downstream annotation management

Page 11: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Quality Control evaluation of the MAKER-P and TAIR10 datasets using Annotation Edit Distance (AED).

Better Quality Worse

Page 12: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

MAKER-P MPI Support

Message Passing Interface (MPI) is a communication protocol for computer clusters which essentially allows multiple computers to act like a single powerful machine.

Page 13: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Current evidence

Current Assembly

Annotating the Genome – Apollo View

Page 14: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Current evidence

Current Assembly

Identify and Mask Repetitive Elements

Page 15: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Current evidence

Current Assembly

Identify and Mask Repetitive Elements

• RepeatMasker– RepBase– Species specific library

• RepeatRunner– MAKER internal protein library

Page 16: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Current evidence

Current Assembly

Identify and Mask Repetitive Elements

Page 17: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Current evidence

Current Assembly

Ab initio Predictions

Generate Ab Initio Gene Predictions

Page 18: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Current evidence

Current Assembly

Ab initio Predictions

Generate Ab Initio Gene Predictions

• MAKER currently supports:– SNAP– Augustus– GeneMark– FGENESH

• Can be run internally or externally

Page 19: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Current evidence

Current Assembly

Ab initio Predictions

Generate Ab Initio Gene Predictions

Page 20: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Current evidence

Current Assembly

Ab initio Predictions

Align EST and Protein Evidence

EST TBLASTX

EST BLASTNProtein BLASTX

Page 21: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Current evidence

Current Assembly

Ab initio Predictions

Align EST and Protein Evidence

EST TBLASTX

EST BLASTNProtein BLASTX

• Identify regions being actively transcribed (i.e. EST data)• Identify region with homology to a known protein

Page 22: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Current evidence

Current Assembly

Ab initio Predictions

Align EST and Protein Evidence

EST TBLASTX

EST BLASTNProtein BLASTX

Page 23: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Polish BLAST Alignments with Exonerate

Current evidence

Current Assembly

Ab initio Predictions

Polished protein

Polished EST

Page 24: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Polish BLAST Alignments with Exonerate

Current evidence

Current Assembly

Ab initio Predictions

Polished protein

Polished EST

• All base pairs must aligns in order.

• No HSP overlap is permitted

• Aligns HSPs correctly with respect to splice sites.

Page 25: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Polish BLAST Alignments with Exonerate

Current evidence

Current Assembly

Ab initio Predictions

Polished protein

Polished EST

Page 26: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Current evidence

Current Assembly

Ab initio Predictions

Hint-based SNAP Hint-based FgenesH

Pass Gene Finders Evidence-based ‘hints’

Page 27: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Current evidence

Current Assembly

Ab initio Predictions

Hint-based SNAP Hint-based FgenesH

*

*Quantitative Measures for the Management and Comparison of Annotated Genomes Karen Eilbeck , Barry Moore , Carson Holt and Mark Yandell BMC Bioinformatics 2009

10:67doi:10.1186/1471-2105-10-67

Identify Gene Model Most Consistent with Evidence*

Page 28: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Current evidence

Current Assembly

Ab initio Predictions*

Revise it further if necessary; Create New Annotation

Page 29: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Compute Support for Each Portion of Gene Model

Page 30: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

MAKER-P v2.28 at iPlant

• TACC Lonestar• Supercomputer with 22,656 CPU

• MPI enabled for parallel computation

• Can complete entire rice genome in ~2 hrs (1,152 cores)96 CPU per chromosome

• Can complete Aegilops tauschii ALLPATHS-LG assembly in ~8 hrs (1,152 cores)

• Currently being integrated into the iPlant Discovery Environment

• Atmosphere • MPI enabled for parallel computation

• Maximum instance size 16 CPU

Page 31: Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah)  iPlant: Josh Stein (CSHL) Matt Vaughn

Assembly & Annotation at iPlant