user manual revision - sourceforgeanntools.sourceforge.net/pdf/user_manual.pdfwe provide source...
TRANSCRIPT
1
User Manual User Manual .............................................................................................................................. 1 Brief technology overview .............................................................................................................1 Availability ..........................................................................................................................................2 Installation ..........................................................................................................................................2 Initial Setup.........................................................................................................................................3 Database structure ...........................................................................................................................3 Putative promoter detection.........................................................................................................6 Submit dataset for annotation......................................................................................................6 Annotation by RSID........................................................................................................................................ 8
Performance testing.........................................................................................................................9 Database update............................................................................................................................. 10 Custom annotation ........................................................................................................................ 10 Extra tools......................................................................................................................................... 14 MendelianConsistency ............................................................................................................................... 15 VCF-‐Info-‐Parser ............................................................................................................................................ 15 Pileup-‐To-‐VCF................................................................................................................................................ 16 SampleExtractor ........................................................................................................................................... 16 Tab-‐To-‐VCF..................................................................................................................................................... 16 Transition-‐to-‐Transversion ratio.......................................................................................................... 17 VCF-‐To-‐BED.................................................................................................................................................... 17 Filter-‐Pass........................................................................................................................................................ 18 VCF-‐2-‐PED....................................................................................................................................................... 18
Brief technology overview
Operating system: UNIX/LINUX /MAC OS X
Programming languages: Python, SQL
Memory requirements: 256Mb
Pre-requisites:
1. Python 2.6 or later (2.7 recommended)
http://www.python.org/download/releases/2.7.2/)
2. MySQL 5.0 or later (5.5 recommended http://dev.mysql.com/downloads/mysql/)
3. MySQLdb (Python-MySQL driver http://sourceforge.net/projects/mysql-python/)
2
If any of the pre-requisites above are missing, please install them before proceeding
further. The materials below assume that all requirements are met and all prerequisites
are already installed. Installation and use examples are given for Unix-like operating
systems.
Availability
AnnTools is freely available for academic and non-commercial use from
http://anntools.sourceforge.net/. We provide source code, data tables, user manual,
installation and update scripts, demos and a useful set of helper tools.
Installation
1. Download the “setup.sh.gz” file from the left navigation tab to any directory on
your computer where you have full privileges and sufficient space (at least
100GB) available. Shell file was zipped as some systems may prevent
downloading executable files.
2. Unzip it, open with any text editor and modify MySQL user name and password
as appropriate. You must have DROP, CREATE, INSERT, SELECT privileges to
MySQL server in order to accomplish data table installation.
3. Run the following command:
~$ ./setup.sh
3
The script will download the software source code and MySQL dump file, install the
software and create and populate the database. After the installation finish (will
take several hours), proceed to “Initial configuration” section.
Initial Setup
Modify the “config.txt” file according your local database setting. In order to run
AnnTools (except Bed-To-Table), account with SELECT only privileges are adequate.
Database structure
After the operation is completed, the whole application is ready to use (Figure S1)
Figure S1. Data flow diagram
4
Table S1 presents a list of the data tables used by AnnTools and the information they
contain. All tables are sorted and indexed to improve the speed of search.
!"#$%&'()*+,-./&0()*$#/&123$)245
6""7%2%*8&7$%#$%+,-.5!""#$$%&
839:0
;61<=*';*"* ->%732"8
047?7%*4
9*@A$#B
1.C9;6A ;D69
A;,
23024%B-$B%7?&142EFB
5
Table S1. Data sources and functionality
6
Putative promoter detection
Putative promoter regions are predicted by AnnTools based on their common location
reaching 500bp upstream from the transcription start and an overlap with the CpG islands
(Antequera and Bird, 1999). An example of a non-coding SNP mapped to the predicted
promoter site is illustrated in Figure S2.
Figure 2. SNP rs2227295, chr1: 19,638,309 located in the non-coding region of the AKR7A2
gene (chr1: 19,630,459-19,638,640, strand-, NM_003689) overlaps with the promoter region of
the PQLC2 gene (chr1: 19,638,740-19,655,793, strand+, NM_001040125, NM_001040126,
NM_017765).
Submit dataset for annotation
We provide 4 small files (about 1000 variants each) for SNP, INDEL and CNV and list
of RSID in the “example” subdirectory. You may use them to practice.
AnnTools can run as a standalone application or in a distributed computing environment.
Possible run modes:
7
1. As a standalone program by running the wrapper shell scripts provided (see
Submit dataset for annotation section). In this mode one genome at a time is
annotated.
2. In a distributed computing environment (client–server architecture) in which the
MySQL database is installed on a dedicated database server optimized for
multiple concurrent connections and queries. Possible use case scenarios include
multiple local users connecting from their desktops or multiple applications
running on multiple nodes such as in a High Performance Computing Cluster
(HPCC). To submit jobs to a HPCC, writing a submission shell will be required
based on your HPCC specifications. Due to the multiplicity of possibilities (SGE,
PBS, etc.), the writing of the submission shell is left to the local developer.
3. As an Application Programming Interface (API) to your own Python application.
Use "annotate.py" (for SNP and indel) or “annotate_cnv.py” python scripts as a
starting point to include AnnTools in your python application. See also
‘driver.py’, “driver_indel.py” and “driver_cnv.py” to see examples of calling
corresponding functions.
We provide 3 wrapper shells to annotate 3 different types of genomic variants [single
nucleotide substitutions (SNP/SNV), short insertions/deletions (INDEL) and structural
variation/copy number variation (SV/CNV)] called ‘run_snp.sh’, run_indel.sh and
‘run_cnv.sh’ respectively. To annotate each of them, open the corresponding shell file in
the text editor and edit parameters as appropriate to indicate the path to the dataset you
wish to annotate. All three wrapper shells are well commented.
8
The SNP/SNV and INDEL annotators accept files in VCF and SAMTools pileup formats.
Tabular format is also accepted, but requires prior conversion to the VCF file with the
TAB-To-VCF tool (see Extra Tools section). The tabular format for SNP and INDEL
requires 4 columns (CHROM, POS, REF, ALT).
The SV/CNV annotator accepts input in VCF and BED tabular formats. The tabular
format for CNV requires three columns (CHROM, START, END). The column name
INFO is a reserved word. AnnTools will append annotation information to the column
called INFO, if present. If the INFO column is absent, AnnTools will create a column
called “INFO” and append it rightmost position in the file.
To run, simply issue the command, as appropriate:
~$ ./run_snp.sh
~$ ./run_indel.sh
~$ ./run_cnv.sh
Annotation by RSID
While we can envision that most of the time our users will be interested in
annotation based on genomic coordinates, we also provide them with the possibility
of annotation RSIDs. Script calls “byRSID.py” accepts one column file containing the
list of RSIDs, performs dbSNP search for genomic coordinates, reference and
alternative alleles (CHROM, POS, REF, ALT), internally converts tabular format to
VCF and generates two annotated VCF files for SNP separately. The RSIDs not found
9
in the current dbSNP release are included in the log file for the record. To run issue
command:
~$ python byRSID.py
on the command line shell and follow instructions.
Performance testing
We benchmarked AnnTools against two popular annotation programs (ANNOVAR and
SNP effect predictor ‘snpEff’, snpeff.sourceforge.net) and confirmed that our method is
sufficiently fast with speed comparable to other methods. AnnTools completed SNP
variant prediction for 32K variants in 300 seconds compared to 120 seconds by the
snpEff method and 730 seconds by ANNOVAR (including pre-computed SIFT scores for
all possible whole-genome non-synonymous mutations). It should be noted that
AnnTools and snpEff utilize standard VCF for input and output, whereas ANNOVAR
requires file conversion to its own format.
To evaluate the impact of the expanding size of the database on the annotation speed, we
tested AnnTools performance under different conditions. Annotation of human
chromosome 1 (3500 line VCF file) using regular dbSNP, refGene-big-table-b37 and
refGene tables (summarily 465M records) took 95 seconds. Annotation of the same file
using specially created, abbreviated tables Chr1_dbSNP, Chr1_refGene-big-table-b37
and Chr1_refGene (summarily 45M records) took 49 seconds. These results indicate that
a 10-fold expansion of the database causes only a 2-fold increase in annotation time due
10
to records being sorted and indexed. Further increase in speed can be achieved by
splitting large tables by chromosome, thus decreasing the number of records in each
table.
Database update
To ensure easy and timely updates, we provide db_update utility for aggregate
database update. It is available for download from the left navigation bar. Unzip it,
change to “db_update” directory and modify MySQL login name and password as
appropriate. No other modification is required. You must have DROP, CREATE,
INSERT, SELECT privileges to MySQL server in order to accomplish update. To run,
issue command:
~$ run_db_update.sh
Only database tables downloaded and installed from Anntools website during
Installation step will be updated. No custom annotation table (see Custom
annotation section) will change.
Custom annotation
For comprehensive annotation of genomic variants, we assign information on the most
essential tracks of common interest among users such as gene name, cytogenetic band,
presence in promoter region, exon, intron, UTR or intergenic region, coding changes,
11
type of mutation for both known and novel SNP, the exon number and the total number
of exons in the gene, conserved transcription factor binding sites, segmental duplications,
artifact prone regions, CNV in the Database of Genomic Variants and disease/trait-
associated CNV regions. Furthermore, users can annotate variants with user-specific
tracks by applying extra tools called BED-To-TABLE (bed2table.py) and CUSTOM-
ANNOTATION (custom_annotation.py and custom_annotation_cnv.py). The python
scripts are located in the root of the AnnTools directory.
The user may also create and download custom tracks from the UCSC Genome Browser
for use with AnnTools as follows:
Go to http://genome.ucsc.edu/cgi-bin/hgTables
Select table as appropriate following the steps below (Figures S3 and S4):
assembly: GRCh37/hg19
group: All Tables
database: hg19
table: <table of your choice>
region: genome (or position if you are interested only in specific region)
output format: selected fields from primary and related tables
output file: <name of desired tabular file>
file type returned: gzip compressed for faster download
Click the 'Get Output' button
On the next page select desired fields (Figure S4)
Click the 'get output' button
12
Figure S3.
13
Figure S4.
Please note that the columns chrom, chromStart and chromEnd (Figure S5) are required
and must be first, second and third columns accordingly. All other columns will be
appended to the “INFO” field in the custom database table to be created. Column names
line must be commented by “#” character.
Figure S5.
14
After the custom track has been created in a BED format, run ‘’ as follows
~$ python bed2table.py
where ‘mytable’ is a name of the table you wish to create for the custom track. This script
will import the content of the BED file to a persistent MySQL database table. Note that
you need to use an account with DROP, CREATE, INSERT, SELECT privileges to run
this script.
After the persistent table has been created, you may apply it to annotate as many datasets
as you wish by running the ‘custom_annotation.py’ script as follows:
~$ python python custom_annotation.py
and follow instructions on the command line.
To custom annotate CNV, there is another similar tool calld ‘custom_annotation_cnv.py’.
To run, issue command:
~$ python python custom_annotation_cnv.py myfile.vcf table format chrind
posstartind posendind extout
on the command line shell and follow instructions.
Extra tools
We provide a number of extra tools in the “extra” directory. For greater portability, all
tools are designed to be self-contained python programs without dependencies on any
other python programs. Each script is well commented. In addition to the two custom
annotation tools (Bed-To-Table and CustomAnnotator) described in the Custom
15
Annotation section, we expect the number of tools to grow. At the time of writing they
are:
MendelianConsistency
The mendelianConsistency.py tool verifies the Mendelian consistency for the family
trio. It accepts text files in VCF format (first nine columns being: CHROM, POS, ID,
REF, ALT, QUAL, FILTER, INFO, FORMAT) and outputs a report of percentages of
consistent variants. To run issue command:
~$ python mendelianConsistency.py
on the command line shell and follow instructions.
VCF-‐Info-‐Parser
The parse_info_vcf.py and tools accepts text files in VCF format, parses the INFO field
and outputs a tabular format for easy visualizing by human readers. By default it parses
all the fields from VCF INFO to separate columns, but this option can be changed by
selecting the columns of your interest. That can be achieved by modifying the KEY list in
the header of the python script. The parse_info_table.py is similar to the parse_info.py
but accepts any data in tabular format. To run issue corresponding command:
~$ python parse_info.py
~$ python parse_info_table.py
on the command line shell and follow instructions.
16
Pileup-‐To-‐VCF
The pileup2vcf.py tool accepts SAMTools variant pileup format and converts it to
VCF format. The output (VCF) format has been verified with VCFTools
(http://vcftools.sourceforge.net/). To run issue command:
~$ python tab2vcf.py
on the command line shell and follow instructions.
SampleExtractor
The sampleExtractor.py tool accepts text files in VCF format and extracts a
specified sample column from the VCF file. The column index must be 10 or greater,
as the first 9 columns have other information (CHROM, POS, ID, REF, ALT, QUAL,
FILTER, INFO, FORMAT). Program outputs: 9 first columns plus the specified
sample column. To run issue command:
~$ python sampleExtractor.py
on the command line shell and follow instructions.
Tab-‐To-‐VCF
17
The tab2vcf.py tool accepts tabular file format and converts it to VCF format. The
tabular format for SNPs and INDELs requires 4 columns (CHROM, POS, REF, ALT).
These 4 column names are required for parsing. To run issue command:
~$ python tab2vcf.py
on the command line shell and follow instructions.
Transition-‐to-‐Transversion ratio
TiTvRatio.py accepts text files in VCF format and calculates the ratio of transitions
to transversions (Ti/Tv ratio). Transitions are interchanges of purines (A-‐G) or
pyrimidines (C-‐T). Transversions are interchanges of purine for pyrimidine bases or
vice versa. Transitions are less likely to result in amino acid substitutions and are,
therefore, more likely to represent "silent” substitutions. To run issue command:
~$ python TiTvRatio.py
on the command line shell and follow instructions. This tool has been tested against
SnpSift TsTv tool (part of the popular snpEff toolset, http://snpeff.sourceforge.net/)
with full concordance.
VCF-‐To-‐BED
18
vcf2bed.py accepts text files in VCF format or other txt TAB delimited format. Other
delimiters (comma, semicolon) can be specified. It generates a Browser Extensible
Data (BED) format, which can be used to submit to UCSC Genome Browser for
viewing as a custom track. To run issue command:
~$ python vcf2bed.py
on the command line shell and follow instructions.
Filter-‐Pass
filter_pass.py accepts VCF file and filters it according to specified parameters in
order to filter out low quality variants. The output is VCF file with pass only
variants according to specified criteria. To run issue command:
~$ python filter_pass.py
on the command line shell and follow instructions.
VCF-‐2-‐PED
Vcf2ped.py accepts VCF file and sample information file and generates the PED
(pedigree) file which can be used for Plink, as specified at
(http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml)
19
The tool accepts text file in VCF format and sample information file. Before running
the tool you must prepare the text file which contains six columns in order specified
below (please notice that order is important):
FamilyID (0=unknown)
IndividualID (Must include all samples in the VCF file, order is not important)
PaternalID (0=unknown)
MaternalID (0=unknown)
Sex (1=male; 2=female; 0=unknown)
Phenotype (1=control/unaffected, 2=case/affected, 0=unknown)
To run issue command:
~$ python vcf2ped.py
on the command line shell and follow instructions.