user manual revision - sourceforgeanntools.sourceforge.net/pdf/user_manual.pdfwe provide source...

19
1 User Manual User Manual .............................................................................................................................. 1 Brief technology overview ............................................................................................................. 1 Availability .......................................................................................................................................... 2 Installation .......................................................................................................................................... 2 Initial Setup ......................................................................................................................................... 3 Database structure ........................................................................................................................... 3 Putative promoter detection......................................................................................................... 6 Submit dataset for annotation ...................................................................................................... 6 Annotation by RSID ........................................................................................................................................ 8 Performance testing......................................................................................................................... 9 Database update ............................................................................................................................. 10 Custom annotation ........................................................................................................................ 10 Extra tools......................................................................................................................................... 14 MendelianConsistency ............................................................................................................................... 15 VCFInfoParser ............................................................................................................................................ 15 PileupToVCF................................................................................................................................................ 16 SampleExtractor ........................................................................................................................................... 16 TabToVCF ..................................................................................................................................................... 16 TransitiontoTransversion ratio .......................................................................................................... 17 VCFToBED .................................................................................................................................................... 17 FilterPass........................................................................................................................................................ 18 VCF2PED ....................................................................................................................................................... 18 Brief technology overview Operating system: UNIX/LINUX /MAC OS X Programming languages: Python, SQL Memory requirements: 256Mb Pre-requisites : 1. Python 2.6 or later (2.7 recommended) http://www.python.org/download/releases/2.7.2/ ) 2. MySQL 5.0 or later (5.5 recommended http://dev.mysql.com/downloads/mysql/ ) 3. MySQLdb (Python-MySQL driver http://sourceforge.net/projects/mysql-python/)

Upload: others

Post on 03-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  1  

User  Manual  User  Manual .............................................................................................................................. 1  Brief  technology  overview .............................................................................................................1  Availability ..........................................................................................................................................2  Installation ..........................................................................................................................................2  Initial  Setup.........................................................................................................................................3  Database  structure ...........................................................................................................................3  Putative  promoter  detection.........................................................................................................6  Submit  dataset  for  annotation......................................................................................................6  Annotation  by  RSID........................................................................................................................................ 8  

Performance  testing.........................................................................................................................9  Database  update............................................................................................................................. 10  Custom  annotation ........................................................................................................................ 10  Extra  tools......................................................................................................................................... 14  MendelianConsistency ............................................................................................................................... 15  VCF-­‐Info-­‐Parser ............................................................................................................................................ 15  Pileup-­‐To-­‐VCF................................................................................................................................................ 16  SampleExtractor ........................................................................................................................................... 16  Tab-­‐To-­‐VCF..................................................................................................................................................... 16  Transition-­‐to-­‐Transversion  ratio.......................................................................................................... 17  VCF-­‐To-­‐BED.................................................................................................................................................... 17  Filter-­‐Pass........................................................................................................................................................ 18  VCF-­‐2-­‐PED....................................................................................................................................................... 18  

 

Brief  technology  overview  

Operating system: UNIX/LINUX /MAC OS X

Programming languages: Python, SQL

Memory requirements: 256Mb

Pre-requisites:

1. Python 2.6 or later (2.7 recommended)

http://www.python.org/download/releases/2.7.2/)

2. MySQL 5.0 or later (5.5 recommended http://dev.mysql.com/downloads/mysql/)

3. MySQLdb (Python-MySQL driver http://sourceforge.net/projects/mysql-python/)

Page 2: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  2  

If any of the pre-requisites above are missing, please install them before proceeding

further. The materials below assume that all requirements are met and all prerequisites

are already installed. Installation and use examples are given for Unix-like operating

systems.

Availability  

AnnTools is freely available for academic and non-commercial use from

http://anntools.sourceforge.net/. We provide source code, data tables, user manual,

installation and update scripts, demos and a useful set of helper tools.

Installation  

1. Download the “setup.sh.gz” file from the left navigation tab to any directory on

your computer where you have full privileges and sufficient space (at least

100GB) available. Shell file was zipped as some systems may prevent

downloading executable files.

2. Unzip it, open with any text editor and modify MySQL user name and password

as appropriate. You must have DROP, CREATE, INSERT, SELECT privileges to

MySQL server in order to accomplish data table installation.

3. Run the following command:

~$ ./setup.sh

Page 3: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  3  

The  script  will  download  the  software  source  code  and  MySQL  dump  file,  install  the  

software  and  create  and  populate  the  database.  After  the  installation  finish  (will  

take  several  hours),  proceed  to  “Initial  configuration”  section.    

 

Initial  Setup  

Modify the “config.txt” file according your local database setting. In order to run

AnnTools (except Bed-To-Table), account with SELECT only privileges are adequate.

Database  structure  

After the operation is completed, the whole application is ready to use (Figure S1)

Figure S1. Data flow diagram

Page 4: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  4  

Table S1 presents a list of the data tables used by AnnTools and the information they

contain. All tables are sorted and indexed to improve the speed of search.

!"#$%&'()*+,-./&0()*$#/&123$)245

6""7%2%*8&7$%#$%+,-.5!""#$$%&

839:0

;61<=*';*"* ->%732"8

047?7%*4

9*@A$#B

1.C9;6A ;D69

A;,

23024%B-$B%7?&142EFB

Page 5: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  5  

Table S1. Data sources and functionality

 

Page 6: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  6  

Putative  promoter  detection    

Putative promoter regions are predicted by AnnTools based on their common location

reaching 500bp upstream from the transcription start and an overlap with the CpG islands

(Antequera and Bird, 1999). An example of a non-coding SNP mapped to the predicted

promoter site is illustrated in Figure S2.

 

 

Figure 2. SNP rs2227295, chr1: 19,638,309 located in the non-coding region of the AKR7A2

gene (chr1: 19,630,459-19,638,640, strand-, NM_003689) overlaps with the promoter region of

the PQLC2 gene (chr1: 19,638,740-19,655,793, strand+, NM_001040125, NM_001040126,

NM_017765).

 

Submit  dataset  for  annotation  

We provide 4 small files (about 1000 variants each) for SNP, INDEL and CNV and list

of RSID in the “example” subdirectory. You may use them to practice.

AnnTools can run as a standalone application or in a distributed computing environment.

Possible run modes:

Page 7: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  7  

1. As a standalone program by running the wrapper shell scripts provided (see

Submit dataset for annotation section). In this mode one genome at a time is

annotated.

2. In a distributed computing environment (client–server architecture) in which the

MySQL database is installed on a dedicated database server optimized for

multiple concurrent connections and queries. Possible use case scenarios include

multiple local users connecting from their desktops or multiple applications

running on multiple nodes such as in a High Performance Computing Cluster

(HPCC). To submit jobs to a HPCC, writing a submission shell will be required

based on your HPCC specifications. Due to the multiplicity of possibilities (SGE,

PBS, etc.), the writing of the submission shell is left to the local developer.  

3. As an Application Programming Interface (API) to your own Python application.

Use "annotate.py" (for SNP and indel) or “annotate_cnv.py” python scripts as a

starting point to include AnnTools in your python application. See also

‘driver.py’, “driver_indel.py” and “driver_cnv.py” to see examples of calling

corresponding functions.

We provide 3 wrapper shells to annotate 3 different types of genomic variants [single

nucleotide substitutions (SNP/SNV), short insertions/deletions (INDEL) and structural

variation/copy number variation (SV/CNV)] called ‘run_snp.sh’, run_indel.sh and

‘run_cnv.sh’ respectively. To annotate each of them, open the corresponding shell file in

the text editor and edit parameters as appropriate to indicate the path to the dataset you

wish to annotate. All three wrapper shells are well commented.

Page 8: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  8  

The SNP/SNV and INDEL annotators accept files in VCF and SAMTools pileup formats.

Tabular format is also accepted, but requires prior conversion to the VCF file with the

TAB-To-VCF tool (see Extra Tools section). The tabular format for SNP and INDEL

requires 4 columns (CHROM, POS, REF, ALT).

The SV/CNV annotator accepts input in VCF and BED tabular formats. The tabular

format for CNV requires three columns (CHROM, START, END). The column name

INFO is a reserved word. AnnTools will append annotation information to the column

called INFO, if present. If the INFO column is absent, AnnTools will create a column

called “INFO” and append it rightmost position in the file.

To run, simply issue the command, as appropriate:

~$ ./run_snp.sh

~$ ./run_indel.sh

~$ ./run_cnv.sh

Annotation  by  RSID    

While  we  can  envision  that  most  of  the  time  our  users  will  be  interested  in  

annotation  based  on  genomic  coordinates,  we  also  provide  them  with  the  possibility  

of  annotation  RSIDs.  Script  calls  “byRSID.py”  accepts  one  column  file  containing  the  

list  of  RSIDs,  performs  dbSNP  search  for  genomic  coordinates,  reference  and  

alternative  alleles  (CHROM,  POS,  REF,  ALT),  internally  converts  tabular  format  to  

VCF  and  generates  two  annotated  VCF  files  for  SNP  separately.  The  RSIDs  not  found  

Page 9: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  9  

in  the  current  dbSNP  release  are  included  in  the  log  file  for  the  record.  To  run  issue  

command:  

~$ python  byRSID.py      

on  the  command  line  shell  and  follow  instructions.    

 

Performance  testing    

We benchmarked AnnTools against two popular annotation programs (ANNOVAR and

SNP effect predictor ‘snpEff’, snpeff.sourceforge.net) and confirmed that our method is

sufficiently fast with speed comparable to other methods. AnnTools completed SNP

variant prediction for 32K variants in 300 seconds compared to 120 seconds by the

snpEff method and 730 seconds by ANNOVAR (including pre-computed SIFT scores for

all possible whole-genome non-synonymous mutations). It should be noted that

AnnTools and snpEff utilize standard VCF for input and output, whereas ANNOVAR

requires file conversion to its own format.

To evaluate the impact of the expanding size of the database on the annotation speed, we

tested AnnTools performance under different conditions. Annotation of human

chromosome 1 (3500 line VCF file) using regular dbSNP, refGene-big-table-b37 and

refGene tables (summarily 465M records) took 95 seconds. Annotation of the same file

using specially created, abbreviated tables Chr1_dbSNP, Chr1_refGene-big-table-b37

and Chr1_refGene (summarily 45M records) took 49 seconds. These results indicate that

a 10-fold expansion of the database causes only a 2-fold increase in annotation time due

Page 10: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  10  

to records being sorted and indexed. Further increase in speed can be achieved by

splitting large tables by chromosome, thus decreasing the number of records in each

table.

Database  update    

To ensure easy and timely updates, we provide db_update utility for aggregate

database update. It is available for download from the left navigation bar. Unzip it,

change to “db_update” directory and modify MySQL login name and password as

appropriate. No other modification is required. You must have DROP, CREATE,

INSERT, SELECT privileges to MySQL server in order to accomplish update. To run,

issue command:

~$  run_db_update.sh

Only database tables downloaded and installed from Anntools website during

Installation step will be updated. No custom annotation table (see Custom

annotation section) will change.

Custom  annotation  

For comprehensive annotation of genomic variants, we assign information on the most

essential tracks of common interest among users such as gene name, cytogenetic band,

presence in promoter region, exon, intron, UTR or intergenic region, coding changes,

Page 11: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  11  

type of mutation for both known and novel SNP, the exon number and the total number

of exons in the gene, conserved transcription factor binding sites, segmental duplications,

artifact prone regions, CNV in the Database of Genomic Variants and disease/trait-

associated CNV regions. Furthermore, users can annotate variants with user-specific

tracks by applying extra tools called BED-To-TABLE (bed2table.py) and CUSTOM-

ANNOTATION (custom_annotation.py and custom_annotation_cnv.py). The python

scripts are located in the root of the AnnTools directory.

The user may also create and download custom tracks from the UCSC Genome Browser

for use with AnnTools as follows:

Go to http://genome.ucsc.edu/cgi-bin/hgTables

Select table as appropriate following the steps below (Figures S3 and S4):

assembly: GRCh37/hg19

group: All Tables

database: hg19

table: <table of your choice>

region: genome (or position if you are interested only in specific region)

output format: selected fields from primary and related tables

output file: <name of desired tabular file>

file type returned: gzip compressed for faster download

Click the 'Get Output' button

On the next page select desired fields (Figure S4)

Click the 'get output' button

Page 12: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  12  

Figure S3.

Page 13: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  13  

Figure S4.

Please note that the columns chrom, chromStart and chromEnd (Figure S5) are required

and must be first, second and third columns accordingly. All other columns will be

appended to the “INFO” field in the custom database table to be created. Column names

line must be commented by “#” character.

Figure S5.

Page 14: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  14  

After the custom track has been created in a BED format, run ‘’ as follows

~$ python bed2table.py

where ‘mytable’ is a name of the table you wish to create for the custom track. This script

will import the content of the BED file to a persistent MySQL database table. Note that

you need to use an account with DROP, CREATE, INSERT, SELECT privileges to run

this script.

After the persistent table has been created, you may apply it to annotate as many datasets

as you wish by running the ‘custom_annotation.py’ script as follows:

~$ python python custom_annotation.py

and  follow  instructions  on  the  command  line.

To custom annotate CNV, there is another similar tool calld ‘custom_annotation_cnv.py’.

To run, issue command:

~$ python python custom_annotation_cnv.py myfile.vcf table format chrind

posstartind posendind extout

on  the  command  line  shell  and  follow  instructions.

Extra  tools  

We provide a number of extra tools in the “extra” directory. For greater portability, all

tools are designed to be self-contained python programs without dependencies on any

other python programs. Each script is well commented. In addition to the two custom

annotation tools (Bed-To-Table and CustomAnnotator) described in the Custom

Page 15: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  15  

Annotation section, we expect the number of tools to grow. At the time of writing they

are:

MendelianConsistency    

The  mendelianConsistency.py  tool verifies the Mendelian consistency for the family

trio. It  accepts  text  files  in  VCF format (first nine columns being: CHROM, POS, ID,

REF, ALT, QUAL, FILTER, INFO, FORMAT) and outputs a report of percentages of

consistent variants. To run issue command:  

~$ python mendelianConsistency.py

on  the  command  line  shell  and  follow  instructions.    

 

VCF-­‐Info-­‐Parser    

The  parse_info_vcf.py and tools accepts text files in VCF format, parses the INFO field

and outputs a tabular format for easy visualizing by human readers. By default it parses

all the fields from VCF INFO to separate columns, but this option can be changed by

selecting the columns of your interest. That can be achieved by modifying the KEY list in

the header of the python script. The  parse_info_table.py is similar to the parse_info.py

but accepts any data in tabular format. To run issue corresponding command:

~$ python parse_info.py

~$ python parse_info_table.py

on  the  command  line  shell  and  follow  instructions.    

Page 16: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  16  

Pileup-­‐To-­‐VCF    

The  pileup2vcf.py    tool  accepts  SAMTools  variant  pileup  format  and  converts  it  to  

VCF  format.  The  output  (VCF)  format  has  been  verified  with  VCFTools  

(http://vcftools.sourceforge.net/).  To  run  issue  command:  

~$ python  tab2vcf.py  

on  the  command  line  shell  and  follow  instructions.    

   

SampleExtractor    

The  sampleExtractor.py  tool  accepts  text  files  in  VCF  format  and  extracts  a  

specified  sample  column  from  the  VCF  file.  The  column  index  must  be  10  or  greater,  

as  the  first  9  columns  have  other  information  (CHROM, POS, ID, REF, ALT, QUAL,

FILTER, INFO, FORMAT).  Program  outputs:  9  first  columns  plus  the  specified  

sample  column.  To  run  issue  command:  

~$ python  sampleExtractor.py  

on  the  command  line  shell  and  follow  instructions.    

 

Tab-­‐To-­‐VCF    

Page 17: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  17  

The  tab2vcf.py  tool  accepts  tabular  file  format  and  converts  it  to  VCF  format.  The  

tabular  format  for  SNPs  and  INDELs  requires  4  columns  (CHROM,  POS,  REF,  ALT).  

These  4  column  names  are  required  for  parsing.  To  run  issue  command:  

~$ python  tab2vcf.py    

on  the  command  line  shell  and  follow  instructions.    

 

Transition-­‐to-­‐Transversion  ratio    

TiTvRatio.py  accepts  text  files  in  VCF  format  and  calculates  the  ratio  of  transitions  

to  transversions  (Ti/Tv  ratio).  Transitions  are  interchanges  of  purines  (A-­‐G)  or  

pyrimidines  (C-­‐T).  Transversions  are  interchanges  of  purine  for  pyrimidine  bases  or  

vice  versa.  Transitions  are  less  likely  to  result  in  amino  acid  substitutions  and  are,  

therefore,  more  likely  to  represent  "silent”  substitutions.  To  run  issue  command:  

~$ python  TiTvRatio.py  

on  the  command  line  shell  and  follow  instructions.  This  tool  has  been  tested  against  

SnpSift  TsTv  tool  (part  of  the  popular  snpEff  toolset,  http://snpeff.sourceforge.net/)  

with  full  concordance.    

 

VCF-­‐To-­‐BED    

Page 18: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  18  

vcf2bed.py  accepts  text  files  in  VCF  format  or  other  txt  TAB  delimited  format.  Other  

delimiters  (comma,  semicolon)  can  be  specified.  It  generates  a  Browser  Extensible  

Data  (BED)  format,  which  can  be  used  to  submit  to  UCSC  Genome  Browser  for  

viewing  as  a  custom  track.  To  run  issue  command:  

~$ python  vcf2bed.py      

on  the  command  line  shell  and  follow  instructions.    

 

Filter-­‐Pass    

filter_pass.py  accepts  VCF  file  and  filters  it  according  to  specified  parameters  in  

order  to  filter  out  low  quality  variants.    The  output  is  VCF  file  with  pass  only  

variants  according  to  specified  criteria.  To  run  issue  command:  

~$ python  filter_pass.py      

on  the  command  line  shell  and  follow  instructions.    

 

VCF-­‐2-­‐PED    

Vcf2ped.py  accepts  VCF  file  and  sample  information  file  and  generates  the  PED  

(pedigree)  file  which  can  be  used  for  Plink,  as  specified  at  

(http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml)    

Page 19: user manual revision - SourceForgeanntools.sourceforge.net/pdf/user_manual.pdfWe provide source code, data tables, user manual, installation and update scripts, demos and a useful

  19  

The  tool  accepts  text  file  in  VCF  format  and  sample  information  file.  Before  running  

the  tool  you  must  prepare  the  text  file  which  contains  six  columns  in  order  specified  

below  (please  notice  that  order  is  important):  

FamilyID  (0=unknown)    

IndividualID  (Must  include  all  samples  in  the  VCF  file,  order  is  not  important)    

PaternalID  (0=unknown)  

MaternalID  (0=unknown)    

Sex  (1=male;  2=female;  0=unknown)  

Phenotype  (1=control/unaffected,  2=case/affected,  0=unknown)  

To  run  issue  command:  

~$ python  vcf2ped.py  

on  the  command  line  shell  and  follow  instructions.