diya: an annotation pipeline for any genomics lab

Do It Yourself Annotator

An annotation pipeline for every genomics lab

Andrew Stewart *, Timothy Read

●Genomics Department, Biological Defense Research

Directorate, Navy Medical Research Center,

Rockville, Maryland, United States

●Distribution and source code are available at

https://sourceforge.net/projects/diyg/

●Contact: [email protected]

https://sourceforge.net/projects/diyg/

mailto:[email protected]

●DIYA is an open source pipeline for the rapid annotation of genomic sequences.

The software is designed to use as input DNA contigs, either in the form of

complete genomes or the result of shotgun sequencing of a genome library, and

produce as output a fully annotated sequence.

●The DIYA pipeline is modular in nature, and easily expandable to include further

forms of feature finding. Each module follows a similar structure, using for input

and output a standard format as a conduit between stages in the pipeline. The

usefulness of BioPerl (http://bioperl.org) as a format conversion utility and parser

is demonstrated in this system. SGE support allows running multiple sequences

in parallel.

http://bioperl.org/

Background

●“A sequencing center in every genomics lab”

●Thus, an annotation pipeline in every genomics lab

●Need for sequence analysis tools with

decentralization of sequencing technology

Background

●Explosion of tools onto the bioinformatics community

●Inconsistent formats, need for ‘pipelining’, bioperl

Background: BDRD

●454 Life Systems FLX sequencers

●Push data off onto servers

oAssembly

oAnnotation

oAnalysis

Outline of the pipeline

●diya-assemble-pseudocontig

●diya-glimmer

●diya-blast

●diya-rfam_scan

●diya-tRNAscan

●Auxiliary scripts

Installation requirements

●Software

oPerl v5+, SGE, MUMer, Glimmer, Blast, tRNAscanSE, Infernal, rfamscan.pl

●Databases

oProtein Clusters, Rfam

●Perl libraries

oBioPerl, Getopt::Long, Data::Dumper, XML::Simple, etc..

Pipeline: diya.pl

●Controller script for the pipeline

●Manages configuration and project data table

generation

●Fires off jobs to SGE

Pipeline: Assembly

●Generate a ‘pseudocontig’

●MUMmer v3.20 (http://mummer.sourceforge.net/)

http://mummer.sourceforge.net/

Pipeline: Glimmer

●Prediction of gene coding regions

●Glimmer v3.02 (http://www.cbcb.umd.edu/software/glimmer/)

og3-iterated.csh - two rounds of iteration

●Uses interpolated Markov models to distinguish

between coding and non-coding regions

http://www.cbcb.umd.edu/software/glimmer/

Pipeline: Blast

●BLAST v2.2.16 (ftp://ftp.ncbi.nih.gov/blast/)

●Two rounds of blast against..

oReference genome

oProtein Clusters database

Pipeline: rfam_scan

●Identification of ncRNA (rRNA, tRNA)

●Infernal v0.81 (http://infernal.janelia.org/)

●Rfam (http://www.sanger.ac.uk/Software/Rfam/)

●rfamscan.pl v0.1 (http://www.sanger.ac.uk/Users/sgj/code/)

http://infernal.janelia.org/

http://www.sanger.ac.uk/Software/Rfam/

http://www.sanger.ac.uk/Users/sgj/code/

Pipeline: tRNAscan-SE

●Identification of tRNA

●tRNAscan-SE v1.23 (http://lowelab.ucsc.edu/tRNAscan-SE/)

http://lowelab.ucsc.edu/tRNAscan-SE/

Pipeline: Auxiliary scripts

●Locus tag reordering (cleanup)

●Protein extraction (ie, PIPA input)

●Pseudocontig disassembly

●Hooks

oLoad databases

oReport genome statistics

oWikiLIMS integration

Modularity

●Adding extra modules is rather simple

●Things to come...

oCRISPR elements

opseudogenes

oprophages

Do It Yourself Genomics

●A project community and collection of bioinformatics

tools and applications for the analysis of genomic

sequence data, with the intent of bringing these tools

into the hands of medium to small scale sequencing

labs.

DIYG on disk

●OS (Linux) distribution with DIYG pre-installed

●Simplifies process of installation, compilation,

‘prerequisite gathering’

●Run analysis directly on sequencer workstation?

●Easy deployment across a high performance

computing cluster

DIYG: Virtual Machine

●Virtualization creates a complete, self-contained

deployment of an operating system

●“Disposable” analysis machine

DIYG: Cloud Computing

●Ideal for labs without direct access to a HPC cluster

●Truly an annotation pipeline in every genomics lab

Deployment at BHSAI

●Make sequence annotation available to wider DOD

community

●Concerns about ‘perl’ nature of DIYA

●Need to determine HPC guidelines

●Possible integration / hook into PIPA

Deployment at BHSAI

●Conventional installation (integration into existing

systems, ala PIPA)

●Sourced from disk image

●Virtualization servers? (if available)

diya: an annotation pipeline for any genomics lab

Data & Analytics

diya pipeline

pipeline diya

annotation pipeline

open source pipeline

rfam http

genomics lab

sequence analysis tools

sequence annotation