user manual revision - sourceforgeanntools.sourceforge.net/pdf/user_manual.pdfwe provide source...

1

User Manual User Manual .............................................................................................................................. 1 Brief technology overview .............................................................................................................1 Availability ..........................................................................................................................................2 Installation ..........................................................................................................................................2 Initial Setup.........................................................................................................................................3 Database structure ...........................................................................................................................3 Putative promoter detection.........................................................................................................6 Submit dataset for annotation......................................................................................................6 Annotation by RSID........................................................................................................................................ 8

Performance testing.........................................................................................................................9 Database update............................................................................................................................. 10 Custom annotation ........................................................................................................................ 10 Extra tools......................................................................................................................................... 14 MendelianConsistency ............................................................................................................................... 15 VCF-‐Info-‐Parser ............................................................................................................................................ 15 Pileup-‐To-‐VCF................................................................................................................................................ 16 SampleExtractor ........................................................................................................................................... 16 Tab-‐To-‐VCF..................................................................................................................................................... 16 Transition-‐to-‐Transversion ratio.......................................................................................................... 17 VCF-‐To-‐BED.................................................................................................................................................... 17 Filter-‐Pass........................................................................................................................................................ 18 VCF-‐2-‐PED....................................................................................................................................................... 18

Brief technology overview

Operating system: UNIX/LINUX /MAC OS X

Programming languages: Python, SQL

Memory requirements: 256Mb

Pre-requisites:

1. Python 2.6 or later (2.7 recommended)

http://www.python.org/download/releases/2.7.2/)

2. MySQL 5.0 or later (5.5 recommended http://dev.mysql.com/downloads/mysql/)

3. MySQLdb (Python-MySQL driver http://sourceforge.net/projects/mysql-python/)

2

If any of the pre-requisites above are missing, please install them before proceeding

further. The materials below assume that all requirements are met and all prerequisites

are already installed. Installation and use examples are given for Unix-like operating

systems.

Availability

AnnTools is freely available for academic and non-commercial use from

http://anntools.sourceforge.net/. We provide source code, data tables, user manual,

installation and update scripts, demos and a useful set of helper tools.

Installation

1. Download the “setup.sh.gz” file from the left navigation tab to any directory on

your computer where you have full privileges and sufficient space (at least

100GB) available. Shell file was zipped as some systems may prevent

downloading executable files.

2. Unzip it, open with any text editor and modify MySQL user name and password

as appropriate. You must have DROP, CREATE, INSERT, SELECT privileges to

MySQL server in order to accomplish data table installation.

3. Run the following command:

~$ ./setup.sh

3

The script will download the software source code and MySQL dump file, install the

software and create and populate the database. After the installation finish (will

take several hours), proceed to “Initial configuration” section.

Initial Setup

Modify the “config.txt” file according your local database setting. In order to run

AnnTools (except Bed-To-Table), account with SELECT only privileges are adequate.

Database structure

After the operation is completed, the whole application is ready to use (Figure S1)

Figure S1. Data flow diagram

4

Table S1 presents a list of the data tables used by AnnTools and the information they

contain. All tables are sorted and indexed to improve the speed of search.

!"#$%&'()*+,-./&0()*$#/&123$)245

6""7%2%*8&7$%#$%+,-.5!""#$$%&

839:0

;61<=*';*"* ->%732"8

047?7%*4

9*@A$#B

1.C9;6A ;D69

A;,

23024%B-$B%7?&142EFB

5

Table S1. Data sources and functionality

6

Putative promoter detection

Putative promoter regions are predicted by AnnTools based on their common location

reaching 500bp upstream from the transcription start and an overlap with the CpG islands

(Antequera and Bird, 1999). An example of a non-coding SNP mapped to the predicted

promoter site is illustrated in Figure S2.

Figure 2. SNP rs2227295, chr1: 19,638,309 located in the non-coding region of the AKR7A2

gene (chr1: 19,630,459-19,638,640, strand-, NM_003689) overlaps with the promoter region of

the PQLC2 gene (chr1: 19,638,740-19,655,793, strand+, NM_001040125, NM_001040126,

NM_017765).

Submit dataset for annotation

We provide 4 small files (about 1000 variants each) for SNP, INDEL and CNV and list

of RSID in the “example” subdirectory. You may use them to practice.

AnnTools can run as a standalone application or in a distributed computing environment.

Possible run modes:

7

1. As a standalone program by running the wrapper shell scripts provided (see

Submit dataset for annotation section). In this mode one genome at a time is

annotated.

2. In a distributed computing environment (client–server architecture) in which the

MySQL database is installed on a dedicated database server optimized for

multiple concurrent connections and queries. Possible use case scenarios include

multiple local users connecting from their desktops or multiple applications

running on multiple nodes such as in a High Performance Computing Cluster

(HPCC). To submit jobs to a HPCC, writing a submission shell will be required

based on your HPCC specifications. Due to the multiplicity of possibilities (SGE,

PBS, etc.), the writing of the submission shell is left to the local developer.

3. As an Application Programming Interface (API) to your own Python application.

Use "annotate.py" (for SNP and indel) or “annotate_cnv.py” python scripts as a

starting point to include AnnTools in your python application. See also

‘driver.py’, “driver_indel.py” and “driver_cnv.py” to see examples of calling

corresponding functions.

We provide 3 wrapper shells to annotate 3 different types of genomic variants [single

nucleotide substitutions (SNP/SNV), short insertions/deletions (INDEL) and structural

variation/copy number variation (SV/CNV)] called ‘run_snp.sh’, run_indel.sh and

‘run_cnv.sh’ respectively. To annotate each of them, open the corresponding shell file in

the text editor and edit parameters as appropriate to indicate the path to the dataset you

wish to annotate. All three wrapper shells are well commented.

8

The SNP/SNV and INDEL annotators accept files in VCF and SAMTools pileup formats.

Tabular format is also accepted, but requires prior conversion to the VCF file with the

TAB-To-VCF tool (see Extra Tools section). The tabular format for SNP and INDEL

requires 4 columns (CHROM, POS, REF, ALT).

The SV/CNV annotator accepts input in VCF and BED tabular formats. The tabular

format for CNV requires three columns (CHROM, START, END). The column name

INFO is a reserved word. AnnTools will append annotation information to the column

called INFO, if present. If the INFO column is absent, AnnTools will create a column

called “INFO” and append it rightmost position in the file.

To run, simply issue the command, as appropriate:

~$ ./run_snp.sh

~$ ./run_indel.sh

~$ ./run_cnv.sh

Annotation by RSID

While we can envision that most of the time our users will be interested in

annotation based on genomic coordinates, we also provide them with the possibility

of annotation RSIDs. Script calls “byRSID.py” accepts one column file containing the

list of RSIDs, performs dbSNP search for genomic coordinates, reference and

alternative alleles (CHROM, POS, REF, ALT), internally converts tabular format to

VCF and generates two annotated VCF files for SNP separately. The RSIDs not found

9

in the current dbSNP release are included in the log file for the record. To run issue

command:

~$ python byRSID.py

on the command line shell and follow instructions.

Performance testing

We benchmarked AnnTools against two popular annotation programs (ANNOVAR and

SNP effect predictor ‘snpEff’, snpeff.sourceforge.net) and confirmed that our method is

sufficiently fast with speed comparable to other methods. AnnTools completed SNP

variant prediction for 32K variants in 300 seconds compared to 120 seconds by the

snpEff method and 730 seconds by ANNOVAR (including pre-computed SIFT scores for

all possible whole-genome non-synonymous mutations). It should be noted that

AnnTools and snpEff utilize standard VCF for input and output, whereas ANNOVAR

requires file conversion to its own format.

To evaluate the impact of the expanding size of the database on the annotation speed, we

tested AnnTools performance under different conditions. Annotation of human

chromosome 1 (3500 line VCF file) using regular dbSNP, refGene-big-table-b37 and

refGene tables (summarily 465M records) took 95 seconds. Annotation of the same file

using specially created, abbreviated tables Chr1_dbSNP, Chr1_refGene-big-table-b37

and Chr1_refGene (summarily 45M records) took 49 seconds. These results indicate that

a 10-fold expansion of the database causes only a 2-fold increase in annotation time due

10

to records being sorted and indexed. Further increase in speed can be achieved by

splitting large tables by chromosome, thus decreasing the number of records in each

table.

Database update

To ensure easy and timely updates, we provide db_update utility for aggregate

database update. It is available for download from the left navigation bar. Unzip it,

change to “db_update” directory and modify MySQL login name and password as

appropriate. No other modification is required. You must have DROP, CREATE,

INSERT, SELECT privileges to MySQL server in order to accomplish update. To run,

issue command:

~$ run_db_update.sh

Only database tables downloaded and installed from Anntools website during

Installation step will be updated. No custom annotation table (see Custom

annotation section) will change.

Custom annotation

For comprehensive annotation of genomic variants, we assign information on the most

essential tracks of common interest among users such as gene name, cytogenetic band,

presence in promoter region, exon, intron, UTR or intergenic region, coding changes,

11

type of mutation for both known and novel SNP, the exon number and the total number

of exons in the gene, conserved transcription factor binding sites, segmental duplications,

artifact prone regions, CNV in the Database of Genomic Variants and disease/trait-

associated CNV regions. Furthermore, users can annotate variants with user-specific

tracks by applying extra tools called BED-To-TABLE (bed2table.py) and CUSTOM-

ANNOTATION (custom_annotation.py and custom_annotation_cnv.py). The python

scripts are located in the root of the AnnTools directory.

The user may also create and download custom tracks from the UCSC Genome Browser

for use with AnnTools as follows:

Go to http://genome.ucsc.edu/cgi-bin/hgTables

Select table as appropriate following the steps below (Figures S3 and S4):

assembly: GRCh37/hg19

group: All Tables

database: hg19

table: <table of your choice>

region: genome (or position if you are interested only in specific region)

output format: selected fields from primary and related tables

output file: <name of desired tabular file>

file type returned: gzip compressed for faster download

Click the 'Get Output' button

On the next page select desired fields (Figure S4)

Click the 'get output' button

12

Figure S3.

13

Figure S4.

Please note that the columns chrom, chromStart and chromEnd (Figure S5) are required

and must be first, second and third columns accordingly. All other columns will be

appended to the “INFO” field in the custom database table to be created. Column names

line must be commented by “#” character.

Figure S5.

14

After the custom track has been created in a BED format, run ‘’ as follows

~$ python bed2table.py

where ‘mytable’ is a name of the table you wish to create for the custom track. This script

will import the content of the BED file to a persistent MySQL database table. Note that

you need to use an account with DROP, CREATE, INSERT, SELECT privileges to run

this script.

After the persistent table has been created, you may apply it to annotate as many datasets

as you wish by running the ‘custom_annotation.py’ script as follows:

~$ python python custom_annotation.py

and follow instructions on the command line.

To custom annotate CNV, there is another similar tool calld ‘custom_annotation_cnv.py’.

To run, issue command:

~$ python python custom_annotation_cnv.py myfile.vcf table format chrind

posstartind posendind extout


Extra tools

We provide a number of extra tools in the “extra” directory. For greater portability, all

tools are designed to be self-contained python programs without dependencies on any

other python programs. Each script is well commented. In addition to the two custom

annotation tools (Bed-To-Table and CustomAnnotator) described in the Custom

15

Annotation section, we expect the number of tools to grow. At the time of writing they

are:

MendelianConsistency

The mendelianConsistency.py tool verifies the Mendelian consistency for the family

trio. It accepts text files in VCF format (first nine columns being: CHROM, POS, ID,

REF, ALT, QUAL, FILTER, INFO, FORMAT) and outputs a report of percentages of

consistent variants. To run issue command:

~$ python mendelianConsistency.py


VCF-‐Info-‐Parser

The parse_info_vcf.py and tools accepts text files in VCF format, parses the INFO field

and outputs a tabular format for easy visualizing by human readers. By default it parses

all the fields from VCF INFO to separate columns, but this option can be changed by

selecting the columns of your interest. That can be achieved by modifying the KEY list in

the header of the python script. The parse_info_table.py is similar to the parse_info.py

but accepts any data in tabular format. To run issue corresponding command:

~$ python parse_info.py

~$ python parse_info_table.py


16

Pileup-‐To-‐VCF

The pileup2vcf.py tool accepts SAMTools variant pileup format and converts it to

VCF format. The output (VCF) format has been verified with VCFTools

(http://vcftools.sourceforge.net/). To run issue command:

~$ python tab2vcf.py


SampleExtractor

The sampleExtractor.py tool accepts text files in VCF format and extracts a

specified sample column from the VCF file. The column index must be 10 or greater,

as the first 9 columns have other information (CHROM, POS, ID, REF, ALT, QUAL,

FILTER, INFO, FORMAT). Program outputs: 9 first columns plus the specified

sample column. To run issue command:

~$ python sampleExtractor.py


Tab-‐To-‐VCF

17

The tab2vcf.py tool accepts tabular file format and converts it to VCF format. The

tabular format for SNPs and INDELs requires 4 columns (CHROM, POS, REF, ALT).

These 4 column names are required for parsing. To run issue command:

~$ python tab2vcf.py


Transition-‐to-‐Transversion ratio

TiTvRatio.py accepts text files in VCF format and calculates the ratio of transitions

to transversions (Ti/Tv ratio). Transitions are interchanges of purines (A-‐G) or

pyrimidines (C-‐T). Transversions are interchanges of purine for pyrimidine bases or

vice versa. Transitions are less likely to result in amino acid substitutions and are,

therefore, more likely to represent "silent” substitutions. To run issue command:

~$ python TiTvRatio.py

on the command line shell and follow instructions. This tool has been tested against

SnpSift TsTv tool (part of the popular snpEff toolset, http://snpeff.sourceforge.net/)

with full concordance.

VCF-‐To-‐BED

18

vcf2bed.py accepts text files in VCF format or other txt TAB delimited format. Other

delimiters (comma, semicolon) can be specified. It generates a Browser Extensible

Data (BED) format, which can be used to submit to UCSC Genome Browser for

viewing as a custom track. To run issue command:

~$ python vcf2bed.py


Filter-‐Pass

filter_pass.py accepts VCF file and filters it according to specified parameters in

order to filter out low quality variants. The output is VCF file with pass only

variants according to specified criteria. To run issue command:

~$ python filter_pass.py


VCF-‐2-‐PED

Vcf2ped.py accepts VCF file and sample information file and generates the PED

(pedigree) file which can be used for Plink, as specified at

(http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml)

19

The tool accepts text file in VCF format and sample information file. Before running

the tool you must prepare the text file which contains six columns in order specified

below (please notice that order is important):

FamilyID (0=unknown)

IndividualID (Must include all samples in the VCF file, order is not important)

PaternalID (0=unknown)

MaternalID (0=unknown)

Sex (1=male; 2=female; 0=unknown)

Phenotype (1=control/unaffected, 2=case/affected, 0=unknown)

To run issue command:

~$ python vcf2ped.py


user manual revision - sourceforgeanntools.sourceforge.net/pdf/user_manual.pdfwe provide source...

Documents