abstract although transposable elements (tes) were discovered over 50 years ago, the robust...

1
Abstract Although transposable elements (TEs) were discovered over 50 years ago, the robust discovery of them in newly sequenced genomes remains a difficult problem. Numerous types with different structural characteristics, sequence degradation, multiple insertions within existing elements, and co-option by the organism’s regulatory system are some of the issues confounding the discovery process. We have developed an automated pipeline employing a homology-based approach, complemented with de novo- and structure-based approaches, to discover and annotate TEs in invertebrate genomes. Once fully automated, our pipeline will be integrated with VectorBase, an NIAID Bioinformatics Resource Center for invertebrate vectors of human pathogens, to produce a first-pass discovery and annotation of TEs for newly sequenced genomes. Currently hosting five organisms with more on the way, VectorBase provides the Ensembl genome browser, computational tools, and other data specific to the study of invertebrate vectors. The annotation component of our pipeline includes enhancements to the Ensembl genome browser, elevating the importance of TEs by displaying genomic location, structural details, alignments with consensus TEs, and homology with other organisms. VectorBase has developed a community annotation system whereby the research community can upload annotation corrections to genes for curation and broad dissemination; we plan to extend this to TEs. We hope this will provide an invaluable resource for researchers studying the biology of TEs and their genomic impact. Ryan C. Kennedy 1,2 , Maria F. Unger 1,3 , Scott Christley 4 , Jenica L. Abrudan 1,3 , Neil F. Lobo 1,3 , Greg Madey 1,2 , Frank H. Collins 1,2,3 1 Eck Institute for Global Health, University of Notre Dame 2 Department of Computer Science and Engineering, University of Notre Dame 3 Department of Biological Sciences, University of Notre Dame 4 Department of Mathematics & Department of Computer Science, University of California, Irvine The VectorBase project is funded by the US National Institute of Allergy and Infectious Diseases (NIAID), contract HHSN266200400039C. TE Discovery TEs are difficult to thoroughly characterize because of their complex and varying structure (or lack thereof). Most current TE discovery techniques fall into the following categories: homology-based, structure-based, and de novo. Popular tools exist within each of these categories, yet most are not automated or easily accessible for all researchers. We have developed a semi-automated discovery pipeline that utilizes a homology-based approach and is complemented with de novo and structure-based components. Our pipeline is reliant on several well-known technologies, including BLAST, Perl (and BioPerl), and DNASTAR SeqMan II. We also require a library of representative TEs, which we obtain from Repbase, TEfam, and the literature. VectorBase VectorBase is an NIAID bioinformatics resource center that serves as a web-based facilitator to information and tools pertaining to invertebrate vectors of human pathogens. VectorBase currently houses genome information for the mosquito species Aedes aegypti, Anopheles gambiae, and Culex quinquefasciatus, as well as the body louse, Pediculus humanus, and the tick, Ixodes scapularis. Current features and capabilities include: Integrated use of the Ensembl genome browser Integrated tools, including BLAST, ClustalW, and HMMER Community annotation pipeline for genes Microarray and gene expression repository Controlled vocabularies TE Discovery Pipeline Our homology-based TE discovery pipeline can be broken down into the following steps and is also shown graphically in Figure 2: 1. BLAST representative sequences against the genome Sequences are individually blasted against the genome 2. Process results files and extract hits Hits that are within a prespecified threshold are combined and represented as a single hit Hits that do not meet a minimum length threshold are ignored Corresponding sequence from the genome is extracted, including flanking sequence 3. Assemble sequences Sequences are assembled into contigs Biologists familiar with TEs manually determine TE boundaries and consensus sequences are generated 4. BLAST results from step 3 against the genome and characterize results Generated consensus sequences are blasted against the genome to determine coverage Hits are then analyzed by scripts and TEs are annotated Community Gene Annotation Annotation is the process by which meaning is given to genomic data. Ensembl’s automatic gene annotation system is one of the better-known gene annotation systems. VectorBase currently hosts a community annotation pipeline, whereby registered users of the site can contribute annotation data for one of the hosted genomes. VectorBase can accept four types of annotation information: gene models, publications, controlled vocabulary terms, and comments. The following steps are taken by users and curators to submit gene models: 1. Users download, fill out, and upload the gene submission form through VectorBase 2. Users preview and submit the data 3. Curators can then approve the data 4. If approved, genes are integrated into the manual annotation DAS track and displayed in the genome browser, shown in Figure 1 Figure 2. Simplified visual diagram of homology-based discovery pipeline. Community TE Annotation While not yet fully implemented on VectorBase, annotation of TEs on VectorBase will follow the same general steps as genes and TEs will be shown within the genome browser. Current work has led to a means to store consensus TEs in the same Chado database schema as genes and also to provide a structural display of TEs. Current TE online repositories traditionally lack this structural display as well as the user-feedback system that VectorBase employs. Additionally, BLAST will be utilized to provide a mechanism to show coverage of TEs within a genome. Figure 1 graphically shows the information flow for TE community annotation on VectorBase. References D. Lawson, et al., VectorBase: a data resource for invertebrate vector genomics. Nucleic Acids Research, 37:D58307, 2009. Repbase. http://www.girinst.org/repbase/index.html. TEfam. http://tefam.biochem.vt.edu. Goal We aim to provide an automatic and easy-to-use method, integrated with VectorBase, to identify and annotate TEs in invertebrate genomes. Figure 1. Information flow diagram for TE annotation. Discovery and Annotation of Transposable Elements on VectorBase http://www.vectorbase.org

Upload: johnathan-williamson

Post on 11-Jan-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Abstract Although transposable elements (TEs) were discovered over 50 years ago, the robust discovery of them in newly sequenced genomes remains a difficult

Abstract

Although transposable elements (TEs) were discovered over 50 years ago, the robust discovery of them in newly sequenced genomes remains a difficult problem. Numerous types with different structural characteristics, sequence degradation, multiple insertions within existing elements, and co-option by the organism’s regulatory system are some of the issues confounding the discovery process.

We have developed an automated pipeline employing a homology-based approach, complemented with de novo- and structure-based approaches, to discover and annotate TEs in invertebrate genomes. Once fully automated, our pipeline will be integrated with VectorBase, an NIAID Bioinformatics Resource Center for invertebrate vectors of human pathogens, to produce a first-pass discovery and annotation of TEs for newly sequenced genomes. Currently hosting five organisms with more on the way, VectorBase provides the Ensembl genome browser, computational tools, and other data specific to the study of invertebrate vectors.

The annotation component of our pipeline includes enhancements to the Ensembl genome browser, elevating the importance of TEs by displaying genomic location, structural details, alignments with consensus TEs, and homology with other organisms. VectorBase has developed a community annotation system whereby the research community can upload annotation corrections to genes for curation and broad dissemination; we plan to extend this to TEs. We hope this will provide an invaluable resource for researchers studying the biology of TEs and their genomic impact.

Ryan C. Kennedy1,2, Maria F. Unger1,3, Scott Christley4,Jenica L. Abrudan1,3, Neil F. Lobo1,3, Greg Madey1,2, Frank H. Collins1,2,3

1Eck Institute for Global Health, University of Notre Dame2Department of Computer Science and Engineering, University of Notre Dame

3Department of Biological Sciences, University of Notre Dame4Department of Mathematics & Department of Computer Science, University of California, Irvine

The VectorBase project is funded by the US National Institute of Allergy and Infectious Diseases (NIAID), contract HHSN266200400039C.

TE Discovery

TEs are difficult to thoroughly characterize because of their complex and varying structure (or lack thereof). Most current TE discovery techniques fall into the following categories: homology-based, structure-based, and de novo. Popular tools exist within each of these categories, yet most are not automated or easily accessible for all researchers. We have developed a semi-automated discovery pipeline that utilizes a homology-based approach and is complemented with de novo and structure-based components. Our pipeline is reliant on several well-known technologies, including BLAST, Perl (and BioPerl), and DNASTAR SeqMan II. We also require a library of representative TEs, which we obtain from Repbase, TEfam, and the literature.

VectorBase

VectorBase is an NIAID bioinformatics resource center that serves as a web-based facilitator to information and tools pertaining to invertebrate vectors of human pathogens. VectorBase currently houses genome information for the mosquito species Aedes aegypti, Anopheles gambiae, and Culex quinquefasciatus, as well as the body louse, Pediculus humanus, and the tick, Ixodes scapularis. Current features and capabilities include:

• Integrated use of the Ensembl genome browser

• Integrated tools, including BLAST, ClustalW, and HMMER

• Community annotation pipeline for genes

• Microarray and gene expression repository

• Controlled vocabularies TE Discovery Pipeline

Our homology-based TE discovery pipeline can be broken down into the following steps and is also shown graphically in Figure 2:

1. BLAST representative sequences against the genome

• Sequences are individually blasted against the genome

2. Process results files and extract hits

• Hits that are within a prespecified threshold are combined and represented as a single hit

• Hits that do not meet a minimum length threshold are ignored

• Corresponding sequence from the genome is extracted, including flanking sequence

3. Assemble sequences

• Sequences are assembled into contigs

• Biologists familiar with TEs manually determine TE boundaries and consensus sequences are generated

4. BLAST results from step 3 against the genome and characterize results

• Generated consensus sequences are blasted against the genome to determine coverage

• Hits are then analyzed by scripts and TEs are annotated

Community Gene Annotation

Annotation is the process by which meaning is given to genomic data. Ensembl’s automatic gene annotation system is one of the better-known gene annotation systems. VectorBase currently hosts a community annotation pipeline, whereby registered users of the site can contribute annotation data for one of the hosted genomes. VectorBase can accept four types of annotation information: gene models, publications, controlled vocabulary terms, and comments. The following steps are taken by users and curators to submit gene models:

1. Users download, fill out, and upload the gene submission form through VectorBase

2. Users preview and submit the data

3. Curators can then approve the data

4. If approved, genes are integrated into the manual annotation DAS track and displayed in the genome browser, shown in Figure 1

Figure 2. Simplified visual diagram of homology-based discovery pipeline.

Community TE Annotation

While not yet fully implemented on VectorBase, annotation of TEs on VectorBase will follow the same general steps as genes and TEs will be shown within the genome browser. Current work has led to a means to store consensus TEs in the same Chado database schema as genes and also to provide a structural display of TEs. Current TE online repositories traditionally lack this structural display as well as the user-feedback system that VectorBase employs. Additionally, BLAST will be utilized to provide a mechanism to show coverage of TEs within a genome. Figure 1 graphically shows the information flow for TE community annotation on VectorBase.

References

D. Lawson, et al., VectorBase: a data resource for invertebrate vector genomics. Nucleic Acids Research, 37:D58307, 2009.

Repbase. http://www.girinst.org/repbase/index.html.

TEfam. http://tefam.biochem.vt.edu.

Goal

We aim to provide an automatic and easy-to-use method, integrated with VectorBase, to identify and annotate TEs in invertebrate genomes.

Figure 1. Information flow diagram for TE annotation.

Discovery and Annotation of Transposable Elements on VectorBase

http://www.vectorbase.org