estap installation guide - biocomplexity institute of...

24
ESTAP Installation And Operation Guide I. Hardware and Software Requirements 1. Operating Systems The ESTAP pipeline analysis programs have been developed on Linux. The ESTAP Web programs are platform independent. 2. Programming Language The ESTAP pipeline analysis programs (except the InterProScan wrapper programs) are written in C/Pro*C and shell scripts. The InterProScan wrapper programs are written in Java. The ESTAP Web programs are written using Java (Servlets and Applets). 3. Database Management The current ESTAP database system is Oracle 9i. 4. Web Browser The system supports Netscape 4.7 and above and Internet Explorer 5 and above. 5. Software Requirements a. ESTAP database The Oracle software currently needed to support the ESTAP database is: Oracle 9.2.0.3.0 (9i) Enterprise Edition or Standard Edition. The Enterprise Edition includes several features that are not included in the Standard Edition. These features have not been implemented in the ESTAP database, so the Standard Edition meets the requirements for the database. b. The ESTAP pipeline analysis programs require the following software: Oracle 9i client including Pro*C compiler and OCI. blastall and formatdb – from NCBI. To set up standalone BLAST for Linux, please go to ftp://ftp.ncbi.nih.gov/blast/executables/ to download BLAST programs (blast.linux.tar.Z). Phred - reads DNA sequencer trace data, calls bases, assigns quality values to the bases, and writes the base calls and quality values to output files. Please contact [email protected] regarding licensing issues. Cross_match – please contact [email protected] regarding licensing issues. This program is used for vector screen and masking repeats, ribosome, and mitochondria DNAs. d2_cluster and enc_db – These programs are used for EST clustering. They are part of the stackPACK package. Please contact SANBI (http://www.sanbi.ac.za ) for these programs. The repeat.seq included in the stackPACK (under stackpack/supporting/ directory) is also required for masking repeats and ribosomal and mitocondrial sequences.

Upload: hoangtu

Post on 11-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

ESTAP Installation And Operation Guide

I Hardware and Software Requirements

1 Operating Systems The ESTAP pipeline analysis programs have been developed on Linux The ESTAP Web programs are platform independent

2 Programming Language The ESTAP pipeline analysis programs (except the InterProScan wrapper programs) are written in CProC and shell scripts The InterProScan wrapper programs are written in Java The ESTAP Web programs are written using Java (Servlets and Applets)

3 Database Management The current ESTAP database system is Oracle 9i

4 Web Browser The system supports Netscape 47 and above and Internet Explorer 5 and above

5 Software Requirements a ESTAP database

The Oracle software currently needed to support the ESTAP database is Oracle 92030 (9i) Enterprise Edition or Standard Edition The Enterprise Edition includes several features that are not included in the Standard Edition These features have not been implemented in the ESTAP database so the Standard Edition meets the requirements for the database

b The ESTAP pipeline analysis programs require the following software

bull Oracle 9i client including ProC compiler and OCI bull blastall and formatdb ndash from NCBI To set up standalone BLAST for Linux

please go to ftpftpncbinihgovblastexecutables to download BLAST programs (blastlinuxtarZ)

bull Phred - reads DNA sequencer trace data calls bases assigns quality values to the bases and writes the base calls and quality values to output files Please contact swxfruwashingtonedu regarding licensing issues

bull Cross_match ndash please contact phguwashingtonedu regarding licensing issues This program is used for vector screen and masking repeats ribosome and mitochondria DNAs

bull d2_cluster and enc_db ndash These programs are used for EST clustering They are part of the stackPACK package Please contact SANBI (httpwwwsanbiacza) for these programs The repeatseq included in the stackPACK (under stackpacksupporting directory) is also required for masking repeats and ribosomal and mitocondrial sequences

bull CAP3 ndash This program is used for EST assembly Please contact xqhuangcsiastateedu for licensing of the CAP3 program

bull Email server ndash used to automatically send emails to PIs when their data are processed ESTAP uses the elm (electronic mail for unix) mail system for sending email messages when raw sequence data are cleansed Please go to httpwwwinstinctorgelmfilestarballselm256targz to download the elm program if this program is not already installed in your system Place the elm program in usrlocalbin directory

bull InterProScan ndash This is a tool developed at the European Bioinformatics Institute (EBI) that combines different protein signature recognition methods into one resource ESTAP wraps this tool to query contigs and singlets against the InterPro public databases to automatically annotate the protein functions Please go to ftpftpebiacukpubsoftwareunixiprscan to download the iprscan program and InterPro databases (see Section IV3k for installation details)

c The ESTAP web programs requires the following software

bull Servlet and JSP software ndash ESTAP uses Apache Tomcat Please go to httpjakartaapacheorgtomcat to download a current version of Tomcat Currently ESTAP uses Tomcat version 404

bull Apache AXIS for web services ndash Please go to httpwsapacheorgaxis to download current version of Apache AXIS

bull Oracle 9i client including JDBC bull JDK ndash Please go to httpwwwsuncom to download the latest version of

JDK ESTAP uses JDK131 or higher bull Email server ndash used to automatically send dbEST submission files as

attachments to PIs when they use the ESTAP dbEST submission tool ESTAP uses a bash attachment mailer program called ldquoBIABAMrdquo Please go to httppanthermmjdkbiabam to download the BIABAM program In order to send the mail attachment without command line prompt please remove the following lines in the biabam file echo Email body (type CTRL-d on a blank line to finish) cat gtgt $TEMPFILE echo gtgt $TEMPFILE Rename your modified biabam program to ldquomail_attachmentrdquo and place it in usrlocalbin directory

For more information about Phred Phrap and Cross_match please visit httpwwwphrapcom

II How to Download and Uncompress the Software Go to the ESTAP Web site (httpwwwvbivtedu~estap) to download the current version of the ESTAP software The software package is prepared to install on Linux The file is compressed in targz format Uncompress it by first use ldquogunziprdquo and then

ldquotar ndashxvf rdquo commands The uncompressed software has four directories server_side web_side database and java_prog The server_side directory contains directories and files for pipeline analysis programs The web_side contains directories and files for the ESTAP web programs The database directory contains documentation and scripts for creating the ESTAP database and importing data The java_prog directory contains stand-alone java applications used for processing dbEST submission confirmation file that NCBI sends to the user (ReadConfirmationFilejava) and for running InterProScan analysis for singlets and contigs (the remaining java files in this directory)

III How to Install the Oracle Software and Create an ESTAP Database

1 Install Oracle Software ESTAP requires Oracle 9i Oracle provides detailed instructions with their products

2 Create the Database Structure See separate documentation for creation of the ESTAP database The document is named ldquoREADME_dbtxtrdquo and is included in the software package under the database directory 3 Load data See separate documentation for instructions Please refer to the document named ldquoREADME_dbtxtrdquo that is included in the software package under the database directory

IV How to Install and Run Pipeline Analysis Programs

1 Set up a Set up correct directory hierarchy where you will place raw sequence data (seq

qal files) The user may specify a directory for placement of input sequence data eg homeestapraw_seq This directory can be named according to user preference Under this directory the user must set up the following subdirectories input done and problem The input done and problem directories have to be named as they are to be recognized by the ESTAP programs (note the directory names are case-sensitive) Under the input directory create lab directories for each registered lab (XX) The lab directory names are the two-letter lab codes (case-sensitive) of the labs eg JC The raw sequences data (seq and qal files) of a lab should put under the directory for that lab (inputXX) An ESTAP program (estapDriverexe) is scheduled to run periodically to read data files from each lab subdirectory verify and cleanse the sequences and store them into the ESTAP database If a seq and qal pair in the inputXX directory has a valid file name and all the sequences in the seq file and all quality scores in the qal file have valid names and format this seq and qal pair is considered to be valid After the

data processing this seq and qal pair is moved from the inputXX directory to the doneXX directory If the seq and qal pair is invalid this seq and qal pair is then moved from the inputXX directory the problemXX directory The corresponding lab (XX) subdirectories under the done and problem directories are automatically created by the ESTAP program (estapDriverexe)

b Under the server_side directory of the uncompressed ESTAP software there are three sub-directories estap_exe include and src The analysis programs (configuration files executables and scripts) are located in the estap_exe directory The executable programs in this directory are compiled using gcc 2953 and glibc-225 and tested on Linux version 2418 The 3rd party software and files Cross_match d2_cluster CAP3 repeatseq are not included here User should get them from their corresponding resources The executable programs cross_match d2_cluster enc_db and cap3 and the repeatseq file should be placed in the estap_exe directory The configuration file db_passwordtxt is used for database connection The user needs to edit this file to provide a database user name and password for ESTAP analysis programs to connect to the ESTAP database The name of the file must be ldquodb_passwordtxtrdquo The content of the file must follow the following format USER_NAME=user_nameestap_db_name PASSWORD=user_password For example USER_NAME=estap_userestapdb PASSWORD=password The include and src directories contains the source codes for ESTAP analysis programs A makefile is provided under the src directory The user may recompile the programs using ldquomake buildrdquo command under the src directory The newly compiled programs will go to the estap_exe directory

c Under the java_prog directory of the uncompressed ESTAP software there are

java application programs and a configuration file The configuration file javaConfigtxt provides parameters for the java applications such as database connection parameters directory location for dbEST files and numbers of processors used for InterProScan analysis The content of the file must follow the following format

jdbcDriverClassName=oraclejdbcdriverOracleDriver jdbcURL=jdbcoraclethinservervtedu1521estap_db_name dbUserName=db_user_name dbUserPassword=db_use_password ncbiConfirmFilePath=directory for dbEST confirmation file iprscanNumProcessor=number of processors to be used for InterProScan analysis

For example jdbcDriverClassName=oraclejdbcdriverOracleDriver jdbcURL=jdbcoraclethinservervtedu1521estapdb dbUserName=estap_user dbUserPassword=password ncbiConfirmFilePath=homeestapdbEST_Incoming iprscanNumProcessor=2

2 Environment Variables a Environment variables needed for Oracle PATH ORACLE_HOME

LD_LIBRARY_PATH ORA_CLIENT_LIB and ORACLE_SID The following are examples of the variable settings export PATH=$PATHhomeestapblast$ORACLE_HOMEbin export ORACLE_HOME=homeoracleproducts920 export LD_LIBRARY_PATH=$ORACLE_HOMElib export ORA_CLIENT_LIB=shared export ORACLE_SID=estap More information regarding database-related environment variables may be found in the document ldquoREADME_dbtxtrdquo that is located in the database directory

b Environment variable needed for BLAST programs ndash please go to NCBI site to view details about how to set up BLAST programs You need to create a ncbirc file to set Data=homeestapblastdata if your BLAST programs are located at homeestapblast You need to include this path to the PATH variable (see above) Also you need to set the environment variable BLASTDB to homeestapblastdb if your blast database are located at homeestapblastdb eg export BLASTDB=homeestapblastdb

c Environment variable needed for the Phred program

PHRED_PARAMETER_FILE If the Phred program (phred and phredpardat) is in the homeestapphred directory the variable PHRED_PARAMETER_FILE should be set to homeestapphredphredpardat eg export PHRED_PARAMETER_FILE=homeestapphredphredpardat If this variable is not set correctly phred will not run properly

d Environment variable needed for runPhredexe program PHRED_PATH

runPhredexe program wraps the phred program and combines all ab1 files in one directory to produce one seq and qal file in a fasta format If the phred program (phred and phredpardat) is in the homeestapphred directory the variable PHRED_PATH should be set to homeestapphredphred eg export PHRED_PATH=homeestapphredphred If this variable is not set correctly the program will not run properly

e Environment variable needed for d2_cluster program OMP_NUM_THREADS If you use one processor you need to set this variable to 1 eg export OMP_NUM_THREADS=1 If this variable is not set d2_cluster will not run In some cases d2_cluster program may not exit normally due to memory problem We recommend to set the environment variable MALLOC_CHECK_ to 1 eg export MALLOC_CHECK_=1 In addition d2_cluster requires libpgthreadso You may place it at the usrlocallib directory

3 Run the Programs a FTP and format NCBI blast databases (eg nr nt)

Fetched NCBI databases are updated every month or as frequently as you wish Please make sure that ESTAP analysis programs are not running when you update NCBI databases Otherwise ESTAP analysis programs will not run properly Also make sure that database formatting is complete before running ESTAP programs We provide some shell scripts (getAndFormatNcbiDBsh getDBsh uncompressFormatDBsh) and a C program (getAndFormatNcbiDBexe) under the estap_exe directory to automatically update the NCBI databases We use crontab to schedule eg 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 getAndFormatNcbiDBsh calls getAndFormatNcbiDBexe which calls getDBsh and uncompressFormatDBsh getAndFormatNcbiDBexe program makes sure to wait until all ESTAP analysis programs finish before updating NCBI databases getDBsh fetches NCBI databases via wget uncompressFormatDBsh uncompresses the fetched NCBI databases and formats the databases using formatdb Please revise the scripts when the directory setup is different

b Where to put raw data (ab1 files)

The ab1 files can be stored at a user specified directory eg homeestapraw_data When receiving data from customers be sure to create a new directory under this directory using the following naming convention for better management (the user may choose a different naming convention for this) labcode-ddmmmyy eg JC-06may02 Then transfer data to this directory All sequence files must be named following ESTAP naming conventions (see Section VIII for how to prepare raw data and naming convention)

c Run Phred

Go to the directory where runPhredexe is located eg homeestapserver_sideestap_exe run the runPhredexe program which is a program that wraps the phred program to make base calling and combines all ab1 files in one sequence file directory to produce one seq and qal file in a fasta format If multiple sequence file directories exist multiple seq and qal files will be generated Command runPhredexe input_dirctory output_directory

eg runPhredexe homeestapraw_datajc-06may02 homeestapraw_datajc-06may02-phred-out where runPhredexe is the program name homeestapraw_datajc-06may02 is the directory that contains the ab1 files from JC lab and homeestapraw_datajc-06may02-phred-out is the output directory where seq and qal files go Before starting to run the program you need to first create the output_directory homeestapraw_datajc-06may02-phred-out After the program is done check phredlog (the program will automatically log the results) in the homeestapserver_sideestap_exe directory to see any problems If not you may either rename the log file to the some other name or remove the log file In the output_directory remove any files that do not have a seq or qal extension

d Copy seq and qal files generated from the runPhredexe program to the correct

lab subdirectory under the input directory eg if the files belong to the JC lab copy those files to homeestapraw_seqinputJC

e Run cleansing program to verify the integrity of the seq and qal files cleanse the

sequences and store the results to the ESTAP database Command estapDriverexe root_path_name gtcleanlog 2gtamp1 eg estapDriverexe homeestapraw_seq gtcleanlog 2gtamp1 where estapDriverexe is the program name root_path_name is the parent directory path for the input directory (see Section IV1 for the directory hierarchy description) and gtcleanlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cleanlog This cleanlog can be viewed for debugging purposes After the cleansing program is successfully run the cleanlog can be removed to save disk space You may use nohup to run the command which immunizes your program to hang-ups eg nohup estapDriverexe root_path_name gtcleanlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep estapDriverexe If for some reason you need to re-run the cleansing for some sequence files please follow the following steps

(1) Delete the sequence file(s) from the database directly (2) Reload the corresponding seq and qal files to the input[lab_code]

directory (3) Run estapDriverexe as described in e

f Run ESTAP blast program to do blast according to corresponding blast protocols

Command runBlastexe gtblastlog 2gtamp1 where runBlastexe is the program name and gtblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named blastlog This blastlog can be viewed for debugging purposes After the runBlast program is successfully run the blastlog can be removed to save disk space You may use nohup command

eg nohup runBlastexe gtblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep runBlastexe Currently the runBlastexe program is set to blast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to blast you need to run this programs multiple times Please go to the blastlog to check how many sequences must be blasted Note Please give at least 5-10 minutes to start another runBlastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the blastlog file to see if the blast command already appears When starting a second runBlastexe program please use a different log file name instead of blastlog eg blast2log The same precautions apply when you need to run the program three or more times

g Run reblast program to do reblast according to corresponding blast protocols

Command reblastexe gtreblastlog 2gtamp1 where reblastexe is the program name and gtreblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named reblastlog This reblastlog can be viewed for debugging purposes After the reblast program is successfully run the reblastlog can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastexe gtreblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastexe Currently the reblastexe program is set to reblast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to reblast you need to run this programs multiple times Please go to reblastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the reblastlog file to see if the blast command already appears When starting a second reblastexe program please use a different log file name instead of reblastlog eg reblast2log The same precautions apply when you need to run the program three or more times

h Run clusterassembly programs You have two options 1) do clusterassembly for a specific project 2) do clusterassembly for all projects 1) Do clusterassembly for a specific project

Command clusterexe project_id gtcluster_project_idlog 2gtamp1 eg clusterexe 377 gtcluster_377log 2gtamp1 where

clusterexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gtcluster_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cluster_project_idlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterexe project_id gtcluster_project_idlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterexe

2) Do clusterassembly for all projects that are qualified for the process (this is

preferred) Command clusterAllexe gtclusterlog 2gtamp1 where

clusterAllexe is the program name and gtclusterlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named clusterlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterAllexe gtclusterlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterAllexe

i Run contig blast program to do blast for assembled contigs according to

corresponding blast protocols Command blastAllContigsexe gtcontig_blastlog 2gtamp1 where blastAllContigsexe is the program name and gt contig_blastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_blastlog This contig_blastlog can be viewed for debugging purposes After the blastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup blastAllContigsexe gtcontig_blastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep blastAllContigsexe Currently the blastAllContigsexe program is set to blast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to blast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another blastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_blastlog file to see if the blast command already appears When starting a second

blastAllContigsexe program please use a different log file name instead of contig_blastlog eg contig_blast2log The same precautions apply when you need to run the program three or more times

j Run contig reblast program to reblast for assembled contig according to corresponding blast protocols Command reblastAllContigsexe gtcontig_reblastlog 2gtamp1 where reblastAllContigsexe is the program name and gt contig_reblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_reblastlog This log file can be viewed for debugging purposes After the reblastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastAllContigsexe gtcontig_reblastlog 2gtamp1 amp It is better to use nohup and run program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastAllContigsexe Currently the reblastAllContigsexe program is set to reblast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to reblast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_reblastlog file to see if the blast command already appears When starting a second reblastAllContigsexe program please use a different log file name instead of contig_reblastlog eg contig_reblast2log The same precautions apply when you need to run the program three or more times

k Set up and run InterProScan application InterProScan is a tool developed at EBI that combines different protein signature recognition methods into one resource ESTAP wraps this tool to query contigs and singlets against the InterPro public databases to automatically annotate the protein functions The GO term of the matched proteins are linked to the contigs and sinlgets Please follow the instruction below to set up and run InterProScan application 1) Download iprscan

Go to ftpftpebiacukpubsoftwareunixiprscan site to download the following files (1) iprscan_vXXXtargz where XXX is the version number of the iprscan

release eg 32 (2) iprscan_bin_XXXtargz where XXX is your platform eg Linux (3) iprscan_DATA_XXXtargz where XXX is the version number of InterPro

database release eg 61)

2) Uncompress the files in the order specified above into your home directory ($HOME_DIR) The iprscan home directory will be $HOME_DIRiprscan eg homeestapiprscan

3) First time configuration Go to the iprscan home directory and run perl CONFIGpl Then (1) Modify CONFIGpl as described below (save a copy of the original

CONFIGpl file as CONFIG_origpl) (i) At the beginning of CONFIGpl (in main) along with other variable

declarations add the following lines my $IprPWD=iprpwdtxt my $pwd = $ENVPWD unless ($pwd) $pwd = `pwd` chomp $pwd

(ii) Before the line setting up applications $applset = get_user_prompt(Setup applications $first (y|n))

Add these lines open (FF gt$IprPWD) || die Cannot create $IprPWD$ print FF $pwd close FF

(2) Modify InterProScanpl file as described below (save a copy of the

original file as InterProScan_origpl) (i) Replace my $seqfile = $ARGV[0]

with my $seqfile = $ARGV[1] (ii) After the line my $UserId = Manager-gtgetUserId()

add these lines my $OutDir=$ARGV[0] print $OutDir n Do not use $path in this file since it wont be recognized by the iprscan wrapper program $OutDir instead of $path is used as an argument for the iprscan wrapper program to store iprscan result files

(iii) Replace $path with $OutDir in all the lines following the above comment lines

4) Reconfiguration

Go to the iprscan home directory and run perl CONFIGpl You will be asked to choose the member databases you would like to search against If you only use the public data and applications that come with the iprscan distribution you should answer ldquonrdquo for SignalPHMM and TMHMM during configuration since they are not public When the configuration is complete a file called iprpwdtxt will be created under the iprscan home directory The iprpwdtxt file stores the iprscan path information which will be used for the

iprscan wrapper programs You must copy the iprpwdtxt file to the java_prog directory where the iprscan wrapper programs are placed so that the java programs can read the iprscan path information from this file

5) Run iprscan

Go to the java_prog directory where the java application and configuration files (javaConfigtxt and iprpwdtxt) are placed and set the proper parameters in the configuration file as described in Section IV1c You may analyze a specific project or all projects at a time If you wish to run a specific project use command nohup java InterProDriver lab_code project_code gt ipr_project_codelog 2gtamp1 amp eg nohup java InterProDriver JC MCA gt ipr_MCAlog 2gtamp1 where InterProDriver is the program name lab_code is the two-letter lab code of the lab that the project belongs to project_code is the three-letter project code of the project being analyzed and gt ipr_project_codelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named ipr_project_codelog If you wish to run all projects that are qualified for this procedure use command nohup java InterProScanAll gt iprlog 2gtamp1 amp where InterProScanAll is the program name and gt iprlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named iprlog The log files can be viewed for debugging purposes After the program is successfully run the log files can be removed to save disk space

l Run genomic DNA assembly programs

You have two options 1) do DNA assembly for a specific genome project 2) do DNA assembly for all qualified genome projects 1) Do assembly for a specific genome project

Command nohup gAssembleexe project_id gtassemble_project_idlog 2gtamp1 eg nohup gAssembleexe 377 gt assemble_377log 2gtamp1 where

gAssembleexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gt assemble_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemble_project_idlog 2) Do assembly for all genome projects that are qualified for the process (this is

preferred) Command nohup gAssembleAllexe gtassemblelog 2gtamp1 where

gAssembleAllexe is the program name and gt assemblelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemblelog

m Prepare EST database files from user projects

ESTAP Web interface allows users to blast specific sequence(s) against the ESTs from their projects We provide a program called writeAndFormatFastaexe to create and format EST database files from user projects This program can be scheduled to run daily or as often as you wish to get the most recent EST databases for users to blast against To run this program you need to create a directory where the EST database files will be placed eg homeestapestdb This directory path should be used for the ESTDBFilePath parameter setting in the webxml file (see Section V3e) so that the ESTAP local blast Web service can recognize

4 Schedule Programs The above programs can be run manually as described above or scheduled to run automatically using the UNIX cron command Please see the UNIX man page for the crontab command Following is an example of the scheduling 0 91117 ~estap_exeestapDriversh gtgt estaplog 2gtamp1 0 19 20 21 ~estap_exerunBlastsh gtgt blastlog 2gtamp1 30 22 23 24 ~estap_exereblastsh gtgt reblastlog 2gtamp1 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 0 5 ~estap_execlusterAllsh gtgt ~estap_execlusterlog 2gtamp1 030 19 ~estap_exeblastAllContigssh gtgt ~estap_exeblast_contigslog 2gtamp1 030 12 ~estap_exereblastAllContigssh gtgt ~estap_exereblast_contigslog 2gtamp1 030 7 ~estap_exegAssembleAllsh gtgt ~estap_exeassemblelog 2gtamp1 5 2 ~estap_exewriteAndFormatFastash gtgt ~estap_exeestap_formatdblog 2gtamp1 The shell scripts used here call the corresponding programs and make sure the programs are run under the correct directories and using the correct environment variable settings These scripts should be revised when the directory set up and environment variable settings are different

5 Trouble-shooting When a program is stopped abnormally eg power or network failure processes are killed etc some data may be partially analyzed and partial results may be stored in the database You may need to do some clean ups (removing partial results from the database) and rerun the program Please use the following instructions

a Go to the ESTAP database table PROG_RUN_STATUS in ESTAP_SYS_DEF schema This table records the information about which programs are running or completed when they were started and when they were completed If a program is killed in the middle of a run for some reason the end_time will not be recorded and the status will still be running (status = 1) You need to change the status from 1 to 0 to indicate the program is no longer running

b If the program was runBlastexe or reblastexe after you change the status in the

PROG_RUN_STATUS table please go to ANALYSIS_TRACKING table in ESTAP_ANALYSIS schema delete those rows created by the program that have

status = 1 eg DELETE from ANALYSIS_TRACKING where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the runBlastexe program

c If the program was clusterexe after you change the status in the

PROG_RUN_STATUS table please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the row with the project id that you specified when you ran the clusterexe Commit the changes after you are sure that you want to delete them Then rerun the clusterexe program

d If the program was clusterAllexe after you change the status in the

PROG_RUN_STATUS table (look for ldquoclusterexerdquo and ldquoclusterAllexerdquo in the prog_name column) please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the rows created by the program that have status of 1 Commit the changes after you are sure that you want to delete them Then rerun the clusterAllexe program

e If the program was blastAllContigsexe or reblastAllContigsexe after you change

the status in the PROG_RUN_STATUS table please go to the CONTIG_ANALYSIS table in the ESTAP_ANALYSIS schema delete those rows created by the program that have status = 1 eg DELETE from CONTIG_ANALYSIS where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the blastAllContigsexe program

V How to Install and Run Web Programs

1 Install Oracle Client including JDBC Please contact Oracle to obtain licensing and product support ESTAP uses Oracle 9i client

2 Install JDK Servlet and Web Service Software

a To install JDK 13 (Windows Linux Solaris) or higher please go to httpjavasuncomj2se

b To install Tomcat please go to httpjakartaapacheorgtomcat c To install Apache Axis please go to httpwsapacheorgaxis Apache axis beta

3 version is used for the ESTAP local blast Web service

3 Configure Tomcat a Visit httpjakartaapacheorgtomcat for instruction for Tomcat configuration b Set up correct PATH and CLASSPATH environment

c Modify serverxml in the Tomcat conf directory In this file add the following

for estap context

ltContext path=estap docBase=estap showdebuginfo=true debug=DEBUG reloadable=true crossContext=falsegt ltLogger className=orgapachecatalinaloggerFileLogger prefix=localhost_estap_log suffix=txt timestamp=truegt ltContextgt

d Place the estap Web application under the Tomcat webapps directory Copy the estap directory from the web_side directory of the uncompressed software to the webapps directory

e Modify the webxml file in the estapWEB-INF directory The following section

in the webxml is specific for ESTAP Please modify the text in bold for your own parameter settings

ltservletgt ltservlet-namegtProjectListltservlet-namegt ltservlet-classgtProjectListltservlet-classgt ltinit-paramgt ltparam-namegtjdbcDriverClassNameltparam-namegt ltparam-valuegtoraclejdbcdriverOracleDriverltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtjdbcURLltparam-namegt ltparam-valuegtjdbcoraclethinservervtedu1521estapdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserNameltparam-namegt ltparam-valuegtestap_userltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserPasswordltparam-namegt ltparam-valuegtpasswordltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbInitConnectltparam-namegt ltparam-valuegt1ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbMaxConnectltparam-namegt ltparam-valuegt50ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbESTFilePathltparam-namegt ltparam-valuegthomeestapdbEST_Outgoingltparam-valuegt ltinit-paramgt

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 2: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

bull CAP3 ndash This program is used for EST assembly Please contact xqhuangcsiastateedu for licensing of the CAP3 program

bull Email server ndash used to automatically send emails to PIs when their data are processed ESTAP uses the elm (electronic mail for unix) mail system for sending email messages when raw sequence data are cleansed Please go to httpwwwinstinctorgelmfilestarballselm256targz to download the elm program if this program is not already installed in your system Place the elm program in usrlocalbin directory

bull InterProScan ndash This is a tool developed at the European Bioinformatics Institute (EBI) that combines different protein signature recognition methods into one resource ESTAP wraps this tool to query contigs and singlets against the InterPro public databases to automatically annotate the protein functions Please go to ftpftpebiacukpubsoftwareunixiprscan to download the iprscan program and InterPro databases (see Section IV3k for installation details)

c The ESTAP web programs requires the following software

bull Servlet and JSP software ndash ESTAP uses Apache Tomcat Please go to httpjakartaapacheorgtomcat to download a current version of Tomcat Currently ESTAP uses Tomcat version 404

bull Apache AXIS for web services ndash Please go to httpwsapacheorgaxis to download current version of Apache AXIS

bull Oracle 9i client including JDBC bull JDK ndash Please go to httpwwwsuncom to download the latest version of

JDK ESTAP uses JDK131 or higher bull Email server ndash used to automatically send dbEST submission files as

attachments to PIs when they use the ESTAP dbEST submission tool ESTAP uses a bash attachment mailer program called ldquoBIABAMrdquo Please go to httppanthermmjdkbiabam to download the BIABAM program In order to send the mail attachment without command line prompt please remove the following lines in the biabam file echo Email body (type CTRL-d on a blank line to finish) cat gtgt $TEMPFILE echo gtgt $TEMPFILE Rename your modified biabam program to ldquomail_attachmentrdquo and place it in usrlocalbin directory

For more information about Phred Phrap and Cross_match please visit httpwwwphrapcom

II How to Download and Uncompress the Software Go to the ESTAP Web site (httpwwwvbivtedu~estap) to download the current version of the ESTAP software The software package is prepared to install on Linux The file is compressed in targz format Uncompress it by first use ldquogunziprdquo and then

ldquotar ndashxvf rdquo commands The uncompressed software has four directories server_side web_side database and java_prog The server_side directory contains directories and files for pipeline analysis programs The web_side contains directories and files for the ESTAP web programs The database directory contains documentation and scripts for creating the ESTAP database and importing data The java_prog directory contains stand-alone java applications used for processing dbEST submission confirmation file that NCBI sends to the user (ReadConfirmationFilejava) and for running InterProScan analysis for singlets and contigs (the remaining java files in this directory)

III How to Install the Oracle Software and Create an ESTAP Database

1 Install Oracle Software ESTAP requires Oracle 9i Oracle provides detailed instructions with their products

2 Create the Database Structure See separate documentation for creation of the ESTAP database The document is named ldquoREADME_dbtxtrdquo and is included in the software package under the database directory 3 Load data See separate documentation for instructions Please refer to the document named ldquoREADME_dbtxtrdquo that is included in the software package under the database directory

IV How to Install and Run Pipeline Analysis Programs

1 Set up a Set up correct directory hierarchy where you will place raw sequence data (seq

qal files) The user may specify a directory for placement of input sequence data eg homeestapraw_seq This directory can be named according to user preference Under this directory the user must set up the following subdirectories input done and problem The input done and problem directories have to be named as they are to be recognized by the ESTAP programs (note the directory names are case-sensitive) Under the input directory create lab directories for each registered lab (XX) The lab directory names are the two-letter lab codes (case-sensitive) of the labs eg JC The raw sequences data (seq and qal files) of a lab should put under the directory for that lab (inputXX) An ESTAP program (estapDriverexe) is scheduled to run periodically to read data files from each lab subdirectory verify and cleanse the sequences and store them into the ESTAP database If a seq and qal pair in the inputXX directory has a valid file name and all the sequences in the seq file and all quality scores in the qal file have valid names and format this seq and qal pair is considered to be valid After the

data processing this seq and qal pair is moved from the inputXX directory to the doneXX directory If the seq and qal pair is invalid this seq and qal pair is then moved from the inputXX directory the problemXX directory The corresponding lab (XX) subdirectories under the done and problem directories are automatically created by the ESTAP program (estapDriverexe)

b Under the server_side directory of the uncompressed ESTAP software there are three sub-directories estap_exe include and src The analysis programs (configuration files executables and scripts) are located in the estap_exe directory The executable programs in this directory are compiled using gcc 2953 and glibc-225 and tested on Linux version 2418 The 3rd party software and files Cross_match d2_cluster CAP3 repeatseq are not included here User should get them from their corresponding resources The executable programs cross_match d2_cluster enc_db and cap3 and the repeatseq file should be placed in the estap_exe directory The configuration file db_passwordtxt is used for database connection The user needs to edit this file to provide a database user name and password for ESTAP analysis programs to connect to the ESTAP database The name of the file must be ldquodb_passwordtxtrdquo The content of the file must follow the following format USER_NAME=user_nameestap_db_name PASSWORD=user_password For example USER_NAME=estap_userestapdb PASSWORD=password The include and src directories contains the source codes for ESTAP analysis programs A makefile is provided under the src directory The user may recompile the programs using ldquomake buildrdquo command under the src directory The newly compiled programs will go to the estap_exe directory

c Under the java_prog directory of the uncompressed ESTAP software there are

java application programs and a configuration file The configuration file javaConfigtxt provides parameters for the java applications such as database connection parameters directory location for dbEST files and numbers of processors used for InterProScan analysis The content of the file must follow the following format

jdbcDriverClassName=oraclejdbcdriverOracleDriver jdbcURL=jdbcoraclethinservervtedu1521estap_db_name dbUserName=db_user_name dbUserPassword=db_use_password ncbiConfirmFilePath=directory for dbEST confirmation file iprscanNumProcessor=number of processors to be used for InterProScan analysis

For example jdbcDriverClassName=oraclejdbcdriverOracleDriver jdbcURL=jdbcoraclethinservervtedu1521estapdb dbUserName=estap_user dbUserPassword=password ncbiConfirmFilePath=homeestapdbEST_Incoming iprscanNumProcessor=2

2 Environment Variables a Environment variables needed for Oracle PATH ORACLE_HOME

LD_LIBRARY_PATH ORA_CLIENT_LIB and ORACLE_SID The following are examples of the variable settings export PATH=$PATHhomeestapblast$ORACLE_HOMEbin export ORACLE_HOME=homeoracleproducts920 export LD_LIBRARY_PATH=$ORACLE_HOMElib export ORA_CLIENT_LIB=shared export ORACLE_SID=estap More information regarding database-related environment variables may be found in the document ldquoREADME_dbtxtrdquo that is located in the database directory

b Environment variable needed for BLAST programs ndash please go to NCBI site to view details about how to set up BLAST programs You need to create a ncbirc file to set Data=homeestapblastdata if your BLAST programs are located at homeestapblast You need to include this path to the PATH variable (see above) Also you need to set the environment variable BLASTDB to homeestapblastdb if your blast database are located at homeestapblastdb eg export BLASTDB=homeestapblastdb

c Environment variable needed for the Phred program

PHRED_PARAMETER_FILE If the Phred program (phred and phredpardat) is in the homeestapphred directory the variable PHRED_PARAMETER_FILE should be set to homeestapphredphredpardat eg export PHRED_PARAMETER_FILE=homeestapphredphredpardat If this variable is not set correctly phred will not run properly

d Environment variable needed for runPhredexe program PHRED_PATH

runPhredexe program wraps the phred program and combines all ab1 files in one directory to produce one seq and qal file in a fasta format If the phred program (phred and phredpardat) is in the homeestapphred directory the variable PHRED_PATH should be set to homeestapphredphred eg export PHRED_PATH=homeestapphredphred If this variable is not set correctly the program will not run properly

e Environment variable needed for d2_cluster program OMP_NUM_THREADS If you use one processor you need to set this variable to 1 eg export OMP_NUM_THREADS=1 If this variable is not set d2_cluster will not run In some cases d2_cluster program may not exit normally due to memory problem We recommend to set the environment variable MALLOC_CHECK_ to 1 eg export MALLOC_CHECK_=1 In addition d2_cluster requires libpgthreadso You may place it at the usrlocallib directory

3 Run the Programs a FTP and format NCBI blast databases (eg nr nt)

Fetched NCBI databases are updated every month or as frequently as you wish Please make sure that ESTAP analysis programs are not running when you update NCBI databases Otherwise ESTAP analysis programs will not run properly Also make sure that database formatting is complete before running ESTAP programs We provide some shell scripts (getAndFormatNcbiDBsh getDBsh uncompressFormatDBsh) and a C program (getAndFormatNcbiDBexe) under the estap_exe directory to automatically update the NCBI databases We use crontab to schedule eg 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 getAndFormatNcbiDBsh calls getAndFormatNcbiDBexe which calls getDBsh and uncompressFormatDBsh getAndFormatNcbiDBexe program makes sure to wait until all ESTAP analysis programs finish before updating NCBI databases getDBsh fetches NCBI databases via wget uncompressFormatDBsh uncompresses the fetched NCBI databases and formats the databases using formatdb Please revise the scripts when the directory setup is different

b Where to put raw data (ab1 files)

The ab1 files can be stored at a user specified directory eg homeestapraw_data When receiving data from customers be sure to create a new directory under this directory using the following naming convention for better management (the user may choose a different naming convention for this) labcode-ddmmmyy eg JC-06may02 Then transfer data to this directory All sequence files must be named following ESTAP naming conventions (see Section VIII for how to prepare raw data and naming convention)

c Run Phred

Go to the directory where runPhredexe is located eg homeestapserver_sideestap_exe run the runPhredexe program which is a program that wraps the phred program to make base calling and combines all ab1 files in one sequence file directory to produce one seq and qal file in a fasta format If multiple sequence file directories exist multiple seq and qal files will be generated Command runPhredexe input_dirctory output_directory

eg runPhredexe homeestapraw_datajc-06may02 homeestapraw_datajc-06may02-phred-out where runPhredexe is the program name homeestapraw_datajc-06may02 is the directory that contains the ab1 files from JC lab and homeestapraw_datajc-06may02-phred-out is the output directory where seq and qal files go Before starting to run the program you need to first create the output_directory homeestapraw_datajc-06may02-phred-out After the program is done check phredlog (the program will automatically log the results) in the homeestapserver_sideestap_exe directory to see any problems If not you may either rename the log file to the some other name or remove the log file In the output_directory remove any files that do not have a seq or qal extension

d Copy seq and qal files generated from the runPhredexe program to the correct

lab subdirectory under the input directory eg if the files belong to the JC lab copy those files to homeestapraw_seqinputJC

e Run cleansing program to verify the integrity of the seq and qal files cleanse the

sequences and store the results to the ESTAP database Command estapDriverexe root_path_name gtcleanlog 2gtamp1 eg estapDriverexe homeestapraw_seq gtcleanlog 2gtamp1 where estapDriverexe is the program name root_path_name is the parent directory path for the input directory (see Section IV1 for the directory hierarchy description) and gtcleanlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cleanlog This cleanlog can be viewed for debugging purposes After the cleansing program is successfully run the cleanlog can be removed to save disk space You may use nohup to run the command which immunizes your program to hang-ups eg nohup estapDriverexe root_path_name gtcleanlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep estapDriverexe If for some reason you need to re-run the cleansing for some sequence files please follow the following steps

(1) Delete the sequence file(s) from the database directly (2) Reload the corresponding seq and qal files to the input[lab_code]

directory (3) Run estapDriverexe as described in e

f Run ESTAP blast program to do blast according to corresponding blast protocols

Command runBlastexe gtblastlog 2gtamp1 where runBlastexe is the program name and gtblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named blastlog This blastlog can be viewed for debugging purposes After the runBlast program is successfully run the blastlog can be removed to save disk space You may use nohup command

eg nohup runBlastexe gtblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep runBlastexe Currently the runBlastexe program is set to blast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to blast you need to run this programs multiple times Please go to the blastlog to check how many sequences must be blasted Note Please give at least 5-10 minutes to start another runBlastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the blastlog file to see if the blast command already appears When starting a second runBlastexe program please use a different log file name instead of blastlog eg blast2log The same precautions apply when you need to run the program three or more times

g Run reblast program to do reblast according to corresponding blast protocols

Command reblastexe gtreblastlog 2gtamp1 where reblastexe is the program name and gtreblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named reblastlog This reblastlog can be viewed for debugging purposes After the reblast program is successfully run the reblastlog can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastexe gtreblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastexe Currently the reblastexe program is set to reblast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to reblast you need to run this programs multiple times Please go to reblastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the reblastlog file to see if the blast command already appears When starting a second reblastexe program please use a different log file name instead of reblastlog eg reblast2log The same precautions apply when you need to run the program three or more times

h Run clusterassembly programs You have two options 1) do clusterassembly for a specific project 2) do clusterassembly for all projects 1) Do clusterassembly for a specific project

Command clusterexe project_id gtcluster_project_idlog 2gtamp1 eg clusterexe 377 gtcluster_377log 2gtamp1 where

clusterexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gtcluster_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cluster_project_idlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterexe project_id gtcluster_project_idlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterexe

2) Do clusterassembly for all projects that are qualified for the process (this is

preferred) Command clusterAllexe gtclusterlog 2gtamp1 where

clusterAllexe is the program name and gtclusterlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named clusterlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterAllexe gtclusterlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterAllexe

i Run contig blast program to do blast for assembled contigs according to

corresponding blast protocols Command blastAllContigsexe gtcontig_blastlog 2gtamp1 where blastAllContigsexe is the program name and gt contig_blastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_blastlog This contig_blastlog can be viewed for debugging purposes After the blastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup blastAllContigsexe gtcontig_blastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep blastAllContigsexe Currently the blastAllContigsexe program is set to blast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to blast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another blastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_blastlog file to see if the blast command already appears When starting a second

blastAllContigsexe program please use a different log file name instead of contig_blastlog eg contig_blast2log The same precautions apply when you need to run the program three or more times

j Run contig reblast program to reblast for assembled contig according to corresponding blast protocols Command reblastAllContigsexe gtcontig_reblastlog 2gtamp1 where reblastAllContigsexe is the program name and gt contig_reblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_reblastlog This log file can be viewed for debugging purposes After the reblastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastAllContigsexe gtcontig_reblastlog 2gtamp1 amp It is better to use nohup and run program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastAllContigsexe Currently the reblastAllContigsexe program is set to reblast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to reblast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_reblastlog file to see if the blast command already appears When starting a second reblastAllContigsexe program please use a different log file name instead of contig_reblastlog eg contig_reblast2log The same precautions apply when you need to run the program three or more times

k Set up and run InterProScan application InterProScan is a tool developed at EBI that combines different protein signature recognition methods into one resource ESTAP wraps this tool to query contigs and singlets against the InterPro public databases to automatically annotate the protein functions The GO term of the matched proteins are linked to the contigs and sinlgets Please follow the instruction below to set up and run InterProScan application 1) Download iprscan

Go to ftpftpebiacukpubsoftwareunixiprscan site to download the following files (1) iprscan_vXXXtargz where XXX is the version number of the iprscan

release eg 32 (2) iprscan_bin_XXXtargz where XXX is your platform eg Linux (3) iprscan_DATA_XXXtargz where XXX is the version number of InterPro

database release eg 61)

2) Uncompress the files in the order specified above into your home directory ($HOME_DIR) The iprscan home directory will be $HOME_DIRiprscan eg homeestapiprscan

3) First time configuration Go to the iprscan home directory and run perl CONFIGpl Then (1) Modify CONFIGpl as described below (save a copy of the original

CONFIGpl file as CONFIG_origpl) (i) At the beginning of CONFIGpl (in main) along with other variable

declarations add the following lines my $IprPWD=iprpwdtxt my $pwd = $ENVPWD unless ($pwd) $pwd = `pwd` chomp $pwd

(ii) Before the line setting up applications $applset = get_user_prompt(Setup applications $first (y|n))

Add these lines open (FF gt$IprPWD) || die Cannot create $IprPWD$ print FF $pwd close FF

(2) Modify InterProScanpl file as described below (save a copy of the

original file as InterProScan_origpl) (i) Replace my $seqfile = $ARGV[0]

with my $seqfile = $ARGV[1] (ii) After the line my $UserId = Manager-gtgetUserId()

add these lines my $OutDir=$ARGV[0] print $OutDir n Do not use $path in this file since it wont be recognized by the iprscan wrapper program $OutDir instead of $path is used as an argument for the iprscan wrapper program to store iprscan result files

(iii) Replace $path with $OutDir in all the lines following the above comment lines

4) Reconfiguration

Go to the iprscan home directory and run perl CONFIGpl You will be asked to choose the member databases you would like to search against If you only use the public data and applications that come with the iprscan distribution you should answer ldquonrdquo for SignalPHMM and TMHMM during configuration since they are not public When the configuration is complete a file called iprpwdtxt will be created under the iprscan home directory The iprpwdtxt file stores the iprscan path information which will be used for the

iprscan wrapper programs You must copy the iprpwdtxt file to the java_prog directory where the iprscan wrapper programs are placed so that the java programs can read the iprscan path information from this file

5) Run iprscan

Go to the java_prog directory where the java application and configuration files (javaConfigtxt and iprpwdtxt) are placed and set the proper parameters in the configuration file as described in Section IV1c You may analyze a specific project or all projects at a time If you wish to run a specific project use command nohup java InterProDriver lab_code project_code gt ipr_project_codelog 2gtamp1 amp eg nohup java InterProDriver JC MCA gt ipr_MCAlog 2gtamp1 where InterProDriver is the program name lab_code is the two-letter lab code of the lab that the project belongs to project_code is the three-letter project code of the project being analyzed and gt ipr_project_codelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named ipr_project_codelog If you wish to run all projects that are qualified for this procedure use command nohup java InterProScanAll gt iprlog 2gtamp1 amp where InterProScanAll is the program name and gt iprlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named iprlog The log files can be viewed for debugging purposes After the program is successfully run the log files can be removed to save disk space

l Run genomic DNA assembly programs

You have two options 1) do DNA assembly for a specific genome project 2) do DNA assembly for all qualified genome projects 1) Do assembly for a specific genome project

Command nohup gAssembleexe project_id gtassemble_project_idlog 2gtamp1 eg nohup gAssembleexe 377 gt assemble_377log 2gtamp1 where

gAssembleexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gt assemble_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemble_project_idlog 2) Do assembly for all genome projects that are qualified for the process (this is

preferred) Command nohup gAssembleAllexe gtassemblelog 2gtamp1 where

gAssembleAllexe is the program name and gt assemblelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemblelog

m Prepare EST database files from user projects

ESTAP Web interface allows users to blast specific sequence(s) against the ESTs from their projects We provide a program called writeAndFormatFastaexe to create and format EST database files from user projects This program can be scheduled to run daily or as often as you wish to get the most recent EST databases for users to blast against To run this program you need to create a directory where the EST database files will be placed eg homeestapestdb This directory path should be used for the ESTDBFilePath parameter setting in the webxml file (see Section V3e) so that the ESTAP local blast Web service can recognize

4 Schedule Programs The above programs can be run manually as described above or scheduled to run automatically using the UNIX cron command Please see the UNIX man page for the crontab command Following is an example of the scheduling 0 91117 ~estap_exeestapDriversh gtgt estaplog 2gtamp1 0 19 20 21 ~estap_exerunBlastsh gtgt blastlog 2gtamp1 30 22 23 24 ~estap_exereblastsh gtgt reblastlog 2gtamp1 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 0 5 ~estap_execlusterAllsh gtgt ~estap_execlusterlog 2gtamp1 030 19 ~estap_exeblastAllContigssh gtgt ~estap_exeblast_contigslog 2gtamp1 030 12 ~estap_exereblastAllContigssh gtgt ~estap_exereblast_contigslog 2gtamp1 030 7 ~estap_exegAssembleAllsh gtgt ~estap_exeassemblelog 2gtamp1 5 2 ~estap_exewriteAndFormatFastash gtgt ~estap_exeestap_formatdblog 2gtamp1 The shell scripts used here call the corresponding programs and make sure the programs are run under the correct directories and using the correct environment variable settings These scripts should be revised when the directory set up and environment variable settings are different

5 Trouble-shooting When a program is stopped abnormally eg power or network failure processes are killed etc some data may be partially analyzed and partial results may be stored in the database You may need to do some clean ups (removing partial results from the database) and rerun the program Please use the following instructions

a Go to the ESTAP database table PROG_RUN_STATUS in ESTAP_SYS_DEF schema This table records the information about which programs are running or completed when they were started and when they were completed If a program is killed in the middle of a run for some reason the end_time will not be recorded and the status will still be running (status = 1) You need to change the status from 1 to 0 to indicate the program is no longer running

b If the program was runBlastexe or reblastexe after you change the status in the

PROG_RUN_STATUS table please go to ANALYSIS_TRACKING table in ESTAP_ANALYSIS schema delete those rows created by the program that have

status = 1 eg DELETE from ANALYSIS_TRACKING where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the runBlastexe program

c If the program was clusterexe after you change the status in the

PROG_RUN_STATUS table please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the row with the project id that you specified when you ran the clusterexe Commit the changes after you are sure that you want to delete them Then rerun the clusterexe program

d If the program was clusterAllexe after you change the status in the

PROG_RUN_STATUS table (look for ldquoclusterexerdquo and ldquoclusterAllexerdquo in the prog_name column) please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the rows created by the program that have status of 1 Commit the changes after you are sure that you want to delete them Then rerun the clusterAllexe program

e If the program was blastAllContigsexe or reblastAllContigsexe after you change

the status in the PROG_RUN_STATUS table please go to the CONTIG_ANALYSIS table in the ESTAP_ANALYSIS schema delete those rows created by the program that have status = 1 eg DELETE from CONTIG_ANALYSIS where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the blastAllContigsexe program

V How to Install and Run Web Programs

1 Install Oracle Client including JDBC Please contact Oracle to obtain licensing and product support ESTAP uses Oracle 9i client

2 Install JDK Servlet and Web Service Software

a To install JDK 13 (Windows Linux Solaris) or higher please go to httpjavasuncomj2se

b To install Tomcat please go to httpjakartaapacheorgtomcat c To install Apache Axis please go to httpwsapacheorgaxis Apache axis beta

3 version is used for the ESTAP local blast Web service

3 Configure Tomcat a Visit httpjakartaapacheorgtomcat for instruction for Tomcat configuration b Set up correct PATH and CLASSPATH environment

c Modify serverxml in the Tomcat conf directory In this file add the following

for estap context

ltContext path=estap docBase=estap showdebuginfo=true debug=DEBUG reloadable=true crossContext=falsegt ltLogger className=orgapachecatalinaloggerFileLogger prefix=localhost_estap_log suffix=txt timestamp=truegt ltContextgt

d Place the estap Web application under the Tomcat webapps directory Copy the estap directory from the web_side directory of the uncompressed software to the webapps directory

e Modify the webxml file in the estapWEB-INF directory The following section

in the webxml is specific for ESTAP Please modify the text in bold for your own parameter settings

ltservletgt ltservlet-namegtProjectListltservlet-namegt ltservlet-classgtProjectListltservlet-classgt ltinit-paramgt ltparam-namegtjdbcDriverClassNameltparam-namegt ltparam-valuegtoraclejdbcdriverOracleDriverltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtjdbcURLltparam-namegt ltparam-valuegtjdbcoraclethinservervtedu1521estapdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserNameltparam-namegt ltparam-valuegtestap_userltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserPasswordltparam-namegt ltparam-valuegtpasswordltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbInitConnectltparam-namegt ltparam-valuegt1ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbMaxConnectltparam-namegt ltparam-valuegt50ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbESTFilePathltparam-namegt ltparam-valuegthomeestapdbEST_Outgoingltparam-valuegt ltinit-paramgt

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 3: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

ldquotar ndashxvf rdquo commands The uncompressed software has four directories server_side web_side database and java_prog The server_side directory contains directories and files for pipeline analysis programs The web_side contains directories and files for the ESTAP web programs The database directory contains documentation and scripts for creating the ESTAP database and importing data The java_prog directory contains stand-alone java applications used for processing dbEST submission confirmation file that NCBI sends to the user (ReadConfirmationFilejava) and for running InterProScan analysis for singlets and contigs (the remaining java files in this directory)

III How to Install the Oracle Software and Create an ESTAP Database

1 Install Oracle Software ESTAP requires Oracle 9i Oracle provides detailed instructions with their products

2 Create the Database Structure See separate documentation for creation of the ESTAP database The document is named ldquoREADME_dbtxtrdquo and is included in the software package under the database directory 3 Load data See separate documentation for instructions Please refer to the document named ldquoREADME_dbtxtrdquo that is included in the software package under the database directory

IV How to Install and Run Pipeline Analysis Programs

1 Set up a Set up correct directory hierarchy where you will place raw sequence data (seq

qal files) The user may specify a directory for placement of input sequence data eg homeestapraw_seq This directory can be named according to user preference Under this directory the user must set up the following subdirectories input done and problem The input done and problem directories have to be named as they are to be recognized by the ESTAP programs (note the directory names are case-sensitive) Under the input directory create lab directories for each registered lab (XX) The lab directory names are the two-letter lab codes (case-sensitive) of the labs eg JC The raw sequences data (seq and qal files) of a lab should put under the directory for that lab (inputXX) An ESTAP program (estapDriverexe) is scheduled to run periodically to read data files from each lab subdirectory verify and cleanse the sequences and store them into the ESTAP database If a seq and qal pair in the inputXX directory has a valid file name and all the sequences in the seq file and all quality scores in the qal file have valid names and format this seq and qal pair is considered to be valid After the

data processing this seq and qal pair is moved from the inputXX directory to the doneXX directory If the seq and qal pair is invalid this seq and qal pair is then moved from the inputXX directory the problemXX directory The corresponding lab (XX) subdirectories under the done and problem directories are automatically created by the ESTAP program (estapDriverexe)

b Under the server_side directory of the uncompressed ESTAP software there are three sub-directories estap_exe include and src The analysis programs (configuration files executables and scripts) are located in the estap_exe directory The executable programs in this directory are compiled using gcc 2953 and glibc-225 and tested on Linux version 2418 The 3rd party software and files Cross_match d2_cluster CAP3 repeatseq are not included here User should get them from their corresponding resources The executable programs cross_match d2_cluster enc_db and cap3 and the repeatseq file should be placed in the estap_exe directory The configuration file db_passwordtxt is used for database connection The user needs to edit this file to provide a database user name and password for ESTAP analysis programs to connect to the ESTAP database The name of the file must be ldquodb_passwordtxtrdquo The content of the file must follow the following format USER_NAME=user_nameestap_db_name PASSWORD=user_password For example USER_NAME=estap_userestapdb PASSWORD=password The include and src directories contains the source codes for ESTAP analysis programs A makefile is provided under the src directory The user may recompile the programs using ldquomake buildrdquo command under the src directory The newly compiled programs will go to the estap_exe directory

c Under the java_prog directory of the uncompressed ESTAP software there are

java application programs and a configuration file The configuration file javaConfigtxt provides parameters for the java applications such as database connection parameters directory location for dbEST files and numbers of processors used for InterProScan analysis The content of the file must follow the following format

jdbcDriverClassName=oraclejdbcdriverOracleDriver jdbcURL=jdbcoraclethinservervtedu1521estap_db_name dbUserName=db_user_name dbUserPassword=db_use_password ncbiConfirmFilePath=directory for dbEST confirmation file iprscanNumProcessor=number of processors to be used for InterProScan analysis

For example jdbcDriverClassName=oraclejdbcdriverOracleDriver jdbcURL=jdbcoraclethinservervtedu1521estapdb dbUserName=estap_user dbUserPassword=password ncbiConfirmFilePath=homeestapdbEST_Incoming iprscanNumProcessor=2

2 Environment Variables a Environment variables needed for Oracle PATH ORACLE_HOME

LD_LIBRARY_PATH ORA_CLIENT_LIB and ORACLE_SID The following are examples of the variable settings export PATH=$PATHhomeestapblast$ORACLE_HOMEbin export ORACLE_HOME=homeoracleproducts920 export LD_LIBRARY_PATH=$ORACLE_HOMElib export ORA_CLIENT_LIB=shared export ORACLE_SID=estap More information regarding database-related environment variables may be found in the document ldquoREADME_dbtxtrdquo that is located in the database directory

b Environment variable needed for BLAST programs ndash please go to NCBI site to view details about how to set up BLAST programs You need to create a ncbirc file to set Data=homeestapblastdata if your BLAST programs are located at homeestapblast You need to include this path to the PATH variable (see above) Also you need to set the environment variable BLASTDB to homeestapblastdb if your blast database are located at homeestapblastdb eg export BLASTDB=homeestapblastdb

c Environment variable needed for the Phred program

PHRED_PARAMETER_FILE If the Phred program (phred and phredpardat) is in the homeestapphred directory the variable PHRED_PARAMETER_FILE should be set to homeestapphredphredpardat eg export PHRED_PARAMETER_FILE=homeestapphredphredpardat If this variable is not set correctly phred will not run properly

d Environment variable needed for runPhredexe program PHRED_PATH

runPhredexe program wraps the phred program and combines all ab1 files in one directory to produce one seq and qal file in a fasta format If the phred program (phred and phredpardat) is in the homeestapphred directory the variable PHRED_PATH should be set to homeestapphredphred eg export PHRED_PATH=homeestapphredphred If this variable is not set correctly the program will not run properly

e Environment variable needed for d2_cluster program OMP_NUM_THREADS If you use one processor you need to set this variable to 1 eg export OMP_NUM_THREADS=1 If this variable is not set d2_cluster will not run In some cases d2_cluster program may not exit normally due to memory problem We recommend to set the environment variable MALLOC_CHECK_ to 1 eg export MALLOC_CHECK_=1 In addition d2_cluster requires libpgthreadso You may place it at the usrlocallib directory

3 Run the Programs a FTP and format NCBI blast databases (eg nr nt)

Fetched NCBI databases are updated every month or as frequently as you wish Please make sure that ESTAP analysis programs are not running when you update NCBI databases Otherwise ESTAP analysis programs will not run properly Also make sure that database formatting is complete before running ESTAP programs We provide some shell scripts (getAndFormatNcbiDBsh getDBsh uncompressFormatDBsh) and a C program (getAndFormatNcbiDBexe) under the estap_exe directory to automatically update the NCBI databases We use crontab to schedule eg 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 getAndFormatNcbiDBsh calls getAndFormatNcbiDBexe which calls getDBsh and uncompressFormatDBsh getAndFormatNcbiDBexe program makes sure to wait until all ESTAP analysis programs finish before updating NCBI databases getDBsh fetches NCBI databases via wget uncompressFormatDBsh uncompresses the fetched NCBI databases and formats the databases using formatdb Please revise the scripts when the directory setup is different

b Where to put raw data (ab1 files)

The ab1 files can be stored at a user specified directory eg homeestapraw_data When receiving data from customers be sure to create a new directory under this directory using the following naming convention for better management (the user may choose a different naming convention for this) labcode-ddmmmyy eg JC-06may02 Then transfer data to this directory All sequence files must be named following ESTAP naming conventions (see Section VIII for how to prepare raw data and naming convention)

c Run Phred

Go to the directory where runPhredexe is located eg homeestapserver_sideestap_exe run the runPhredexe program which is a program that wraps the phred program to make base calling and combines all ab1 files in one sequence file directory to produce one seq and qal file in a fasta format If multiple sequence file directories exist multiple seq and qal files will be generated Command runPhredexe input_dirctory output_directory

eg runPhredexe homeestapraw_datajc-06may02 homeestapraw_datajc-06may02-phred-out where runPhredexe is the program name homeestapraw_datajc-06may02 is the directory that contains the ab1 files from JC lab and homeestapraw_datajc-06may02-phred-out is the output directory where seq and qal files go Before starting to run the program you need to first create the output_directory homeestapraw_datajc-06may02-phred-out After the program is done check phredlog (the program will automatically log the results) in the homeestapserver_sideestap_exe directory to see any problems If not you may either rename the log file to the some other name or remove the log file In the output_directory remove any files that do not have a seq or qal extension

d Copy seq and qal files generated from the runPhredexe program to the correct

lab subdirectory under the input directory eg if the files belong to the JC lab copy those files to homeestapraw_seqinputJC

e Run cleansing program to verify the integrity of the seq and qal files cleanse the

sequences and store the results to the ESTAP database Command estapDriverexe root_path_name gtcleanlog 2gtamp1 eg estapDriverexe homeestapraw_seq gtcleanlog 2gtamp1 where estapDriverexe is the program name root_path_name is the parent directory path for the input directory (see Section IV1 for the directory hierarchy description) and gtcleanlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cleanlog This cleanlog can be viewed for debugging purposes After the cleansing program is successfully run the cleanlog can be removed to save disk space You may use nohup to run the command which immunizes your program to hang-ups eg nohup estapDriverexe root_path_name gtcleanlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep estapDriverexe If for some reason you need to re-run the cleansing for some sequence files please follow the following steps

(1) Delete the sequence file(s) from the database directly (2) Reload the corresponding seq and qal files to the input[lab_code]

directory (3) Run estapDriverexe as described in e

f Run ESTAP blast program to do blast according to corresponding blast protocols

Command runBlastexe gtblastlog 2gtamp1 where runBlastexe is the program name and gtblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named blastlog This blastlog can be viewed for debugging purposes After the runBlast program is successfully run the blastlog can be removed to save disk space You may use nohup command

eg nohup runBlastexe gtblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep runBlastexe Currently the runBlastexe program is set to blast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to blast you need to run this programs multiple times Please go to the blastlog to check how many sequences must be blasted Note Please give at least 5-10 minutes to start another runBlastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the blastlog file to see if the blast command already appears When starting a second runBlastexe program please use a different log file name instead of blastlog eg blast2log The same precautions apply when you need to run the program three or more times

g Run reblast program to do reblast according to corresponding blast protocols

Command reblastexe gtreblastlog 2gtamp1 where reblastexe is the program name and gtreblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named reblastlog This reblastlog can be viewed for debugging purposes After the reblast program is successfully run the reblastlog can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastexe gtreblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastexe Currently the reblastexe program is set to reblast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to reblast you need to run this programs multiple times Please go to reblastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the reblastlog file to see if the blast command already appears When starting a second reblastexe program please use a different log file name instead of reblastlog eg reblast2log The same precautions apply when you need to run the program three or more times

h Run clusterassembly programs You have two options 1) do clusterassembly for a specific project 2) do clusterassembly for all projects 1) Do clusterassembly for a specific project

Command clusterexe project_id gtcluster_project_idlog 2gtamp1 eg clusterexe 377 gtcluster_377log 2gtamp1 where

clusterexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gtcluster_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cluster_project_idlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterexe project_id gtcluster_project_idlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterexe

2) Do clusterassembly for all projects that are qualified for the process (this is

preferred) Command clusterAllexe gtclusterlog 2gtamp1 where

clusterAllexe is the program name and gtclusterlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named clusterlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterAllexe gtclusterlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterAllexe

i Run contig blast program to do blast for assembled contigs according to

corresponding blast protocols Command blastAllContigsexe gtcontig_blastlog 2gtamp1 where blastAllContigsexe is the program name and gt contig_blastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_blastlog This contig_blastlog can be viewed for debugging purposes After the blastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup blastAllContigsexe gtcontig_blastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep blastAllContigsexe Currently the blastAllContigsexe program is set to blast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to blast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another blastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_blastlog file to see if the blast command already appears When starting a second

blastAllContigsexe program please use a different log file name instead of contig_blastlog eg contig_blast2log The same precautions apply when you need to run the program three or more times

j Run contig reblast program to reblast for assembled contig according to corresponding blast protocols Command reblastAllContigsexe gtcontig_reblastlog 2gtamp1 where reblastAllContigsexe is the program name and gt contig_reblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_reblastlog This log file can be viewed for debugging purposes After the reblastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastAllContigsexe gtcontig_reblastlog 2gtamp1 amp It is better to use nohup and run program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastAllContigsexe Currently the reblastAllContigsexe program is set to reblast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to reblast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_reblastlog file to see if the blast command already appears When starting a second reblastAllContigsexe program please use a different log file name instead of contig_reblastlog eg contig_reblast2log The same precautions apply when you need to run the program three or more times

k Set up and run InterProScan application InterProScan is a tool developed at EBI that combines different protein signature recognition methods into one resource ESTAP wraps this tool to query contigs and singlets against the InterPro public databases to automatically annotate the protein functions The GO term of the matched proteins are linked to the contigs and sinlgets Please follow the instruction below to set up and run InterProScan application 1) Download iprscan

Go to ftpftpebiacukpubsoftwareunixiprscan site to download the following files (1) iprscan_vXXXtargz where XXX is the version number of the iprscan

release eg 32 (2) iprscan_bin_XXXtargz where XXX is your platform eg Linux (3) iprscan_DATA_XXXtargz where XXX is the version number of InterPro

database release eg 61)

2) Uncompress the files in the order specified above into your home directory ($HOME_DIR) The iprscan home directory will be $HOME_DIRiprscan eg homeestapiprscan

3) First time configuration Go to the iprscan home directory and run perl CONFIGpl Then (1) Modify CONFIGpl as described below (save a copy of the original

CONFIGpl file as CONFIG_origpl) (i) At the beginning of CONFIGpl (in main) along with other variable

declarations add the following lines my $IprPWD=iprpwdtxt my $pwd = $ENVPWD unless ($pwd) $pwd = `pwd` chomp $pwd

(ii) Before the line setting up applications $applset = get_user_prompt(Setup applications $first (y|n))

Add these lines open (FF gt$IprPWD) || die Cannot create $IprPWD$ print FF $pwd close FF

(2) Modify InterProScanpl file as described below (save a copy of the

original file as InterProScan_origpl) (i) Replace my $seqfile = $ARGV[0]

with my $seqfile = $ARGV[1] (ii) After the line my $UserId = Manager-gtgetUserId()

add these lines my $OutDir=$ARGV[0] print $OutDir n Do not use $path in this file since it wont be recognized by the iprscan wrapper program $OutDir instead of $path is used as an argument for the iprscan wrapper program to store iprscan result files

(iii) Replace $path with $OutDir in all the lines following the above comment lines

4) Reconfiguration

Go to the iprscan home directory and run perl CONFIGpl You will be asked to choose the member databases you would like to search against If you only use the public data and applications that come with the iprscan distribution you should answer ldquonrdquo for SignalPHMM and TMHMM during configuration since they are not public When the configuration is complete a file called iprpwdtxt will be created under the iprscan home directory The iprpwdtxt file stores the iprscan path information which will be used for the

iprscan wrapper programs You must copy the iprpwdtxt file to the java_prog directory where the iprscan wrapper programs are placed so that the java programs can read the iprscan path information from this file

5) Run iprscan

Go to the java_prog directory where the java application and configuration files (javaConfigtxt and iprpwdtxt) are placed and set the proper parameters in the configuration file as described in Section IV1c You may analyze a specific project or all projects at a time If you wish to run a specific project use command nohup java InterProDriver lab_code project_code gt ipr_project_codelog 2gtamp1 amp eg nohup java InterProDriver JC MCA gt ipr_MCAlog 2gtamp1 where InterProDriver is the program name lab_code is the two-letter lab code of the lab that the project belongs to project_code is the three-letter project code of the project being analyzed and gt ipr_project_codelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named ipr_project_codelog If you wish to run all projects that are qualified for this procedure use command nohup java InterProScanAll gt iprlog 2gtamp1 amp where InterProScanAll is the program name and gt iprlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named iprlog The log files can be viewed for debugging purposes After the program is successfully run the log files can be removed to save disk space

l Run genomic DNA assembly programs

You have two options 1) do DNA assembly for a specific genome project 2) do DNA assembly for all qualified genome projects 1) Do assembly for a specific genome project

Command nohup gAssembleexe project_id gtassemble_project_idlog 2gtamp1 eg nohup gAssembleexe 377 gt assemble_377log 2gtamp1 where

gAssembleexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gt assemble_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemble_project_idlog 2) Do assembly for all genome projects that are qualified for the process (this is

preferred) Command nohup gAssembleAllexe gtassemblelog 2gtamp1 where

gAssembleAllexe is the program name and gt assemblelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemblelog

m Prepare EST database files from user projects

ESTAP Web interface allows users to blast specific sequence(s) against the ESTs from their projects We provide a program called writeAndFormatFastaexe to create and format EST database files from user projects This program can be scheduled to run daily or as often as you wish to get the most recent EST databases for users to blast against To run this program you need to create a directory where the EST database files will be placed eg homeestapestdb This directory path should be used for the ESTDBFilePath parameter setting in the webxml file (see Section V3e) so that the ESTAP local blast Web service can recognize

4 Schedule Programs The above programs can be run manually as described above or scheduled to run automatically using the UNIX cron command Please see the UNIX man page for the crontab command Following is an example of the scheduling 0 91117 ~estap_exeestapDriversh gtgt estaplog 2gtamp1 0 19 20 21 ~estap_exerunBlastsh gtgt blastlog 2gtamp1 30 22 23 24 ~estap_exereblastsh gtgt reblastlog 2gtamp1 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 0 5 ~estap_execlusterAllsh gtgt ~estap_execlusterlog 2gtamp1 030 19 ~estap_exeblastAllContigssh gtgt ~estap_exeblast_contigslog 2gtamp1 030 12 ~estap_exereblastAllContigssh gtgt ~estap_exereblast_contigslog 2gtamp1 030 7 ~estap_exegAssembleAllsh gtgt ~estap_exeassemblelog 2gtamp1 5 2 ~estap_exewriteAndFormatFastash gtgt ~estap_exeestap_formatdblog 2gtamp1 The shell scripts used here call the corresponding programs and make sure the programs are run under the correct directories and using the correct environment variable settings These scripts should be revised when the directory set up and environment variable settings are different

5 Trouble-shooting When a program is stopped abnormally eg power or network failure processes are killed etc some data may be partially analyzed and partial results may be stored in the database You may need to do some clean ups (removing partial results from the database) and rerun the program Please use the following instructions

a Go to the ESTAP database table PROG_RUN_STATUS in ESTAP_SYS_DEF schema This table records the information about which programs are running or completed when they were started and when they were completed If a program is killed in the middle of a run for some reason the end_time will not be recorded and the status will still be running (status = 1) You need to change the status from 1 to 0 to indicate the program is no longer running

b If the program was runBlastexe or reblastexe after you change the status in the

PROG_RUN_STATUS table please go to ANALYSIS_TRACKING table in ESTAP_ANALYSIS schema delete those rows created by the program that have

status = 1 eg DELETE from ANALYSIS_TRACKING where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the runBlastexe program

c If the program was clusterexe after you change the status in the

PROG_RUN_STATUS table please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the row with the project id that you specified when you ran the clusterexe Commit the changes after you are sure that you want to delete them Then rerun the clusterexe program

d If the program was clusterAllexe after you change the status in the

PROG_RUN_STATUS table (look for ldquoclusterexerdquo and ldquoclusterAllexerdquo in the prog_name column) please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the rows created by the program that have status of 1 Commit the changes after you are sure that you want to delete them Then rerun the clusterAllexe program

e If the program was blastAllContigsexe or reblastAllContigsexe after you change

the status in the PROG_RUN_STATUS table please go to the CONTIG_ANALYSIS table in the ESTAP_ANALYSIS schema delete those rows created by the program that have status = 1 eg DELETE from CONTIG_ANALYSIS where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the blastAllContigsexe program

V How to Install and Run Web Programs

1 Install Oracle Client including JDBC Please contact Oracle to obtain licensing and product support ESTAP uses Oracle 9i client

2 Install JDK Servlet and Web Service Software

a To install JDK 13 (Windows Linux Solaris) or higher please go to httpjavasuncomj2se

b To install Tomcat please go to httpjakartaapacheorgtomcat c To install Apache Axis please go to httpwsapacheorgaxis Apache axis beta

3 version is used for the ESTAP local blast Web service

3 Configure Tomcat a Visit httpjakartaapacheorgtomcat for instruction for Tomcat configuration b Set up correct PATH and CLASSPATH environment

c Modify serverxml in the Tomcat conf directory In this file add the following

for estap context

ltContext path=estap docBase=estap showdebuginfo=true debug=DEBUG reloadable=true crossContext=falsegt ltLogger className=orgapachecatalinaloggerFileLogger prefix=localhost_estap_log suffix=txt timestamp=truegt ltContextgt

d Place the estap Web application under the Tomcat webapps directory Copy the estap directory from the web_side directory of the uncompressed software to the webapps directory

e Modify the webxml file in the estapWEB-INF directory The following section

in the webxml is specific for ESTAP Please modify the text in bold for your own parameter settings

ltservletgt ltservlet-namegtProjectListltservlet-namegt ltservlet-classgtProjectListltservlet-classgt ltinit-paramgt ltparam-namegtjdbcDriverClassNameltparam-namegt ltparam-valuegtoraclejdbcdriverOracleDriverltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtjdbcURLltparam-namegt ltparam-valuegtjdbcoraclethinservervtedu1521estapdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserNameltparam-namegt ltparam-valuegtestap_userltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserPasswordltparam-namegt ltparam-valuegtpasswordltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbInitConnectltparam-namegt ltparam-valuegt1ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbMaxConnectltparam-namegt ltparam-valuegt50ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbESTFilePathltparam-namegt ltparam-valuegthomeestapdbEST_Outgoingltparam-valuegt ltinit-paramgt

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 4: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

data processing this seq and qal pair is moved from the inputXX directory to the doneXX directory If the seq and qal pair is invalid this seq and qal pair is then moved from the inputXX directory the problemXX directory The corresponding lab (XX) subdirectories under the done and problem directories are automatically created by the ESTAP program (estapDriverexe)

b Under the server_side directory of the uncompressed ESTAP software there are three sub-directories estap_exe include and src The analysis programs (configuration files executables and scripts) are located in the estap_exe directory The executable programs in this directory are compiled using gcc 2953 and glibc-225 and tested on Linux version 2418 The 3rd party software and files Cross_match d2_cluster CAP3 repeatseq are not included here User should get them from their corresponding resources The executable programs cross_match d2_cluster enc_db and cap3 and the repeatseq file should be placed in the estap_exe directory The configuration file db_passwordtxt is used for database connection The user needs to edit this file to provide a database user name and password for ESTAP analysis programs to connect to the ESTAP database The name of the file must be ldquodb_passwordtxtrdquo The content of the file must follow the following format USER_NAME=user_nameestap_db_name PASSWORD=user_password For example USER_NAME=estap_userestapdb PASSWORD=password The include and src directories contains the source codes for ESTAP analysis programs A makefile is provided under the src directory The user may recompile the programs using ldquomake buildrdquo command under the src directory The newly compiled programs will go to the estap_exe directory

c Under the java_prog directory of the uncompressed ESTAP software there are

java application programs and a configuration file The configuration file javaConfigtxt provides parameters for the java applications such as database connection parameters directory location for dbEST files and numbers of processors used for InterProScan analysis The content of the file must follow the following format

jdbcDriverClassName=oraclejdbcdriverOracleDriver jdbcURL=jdbcoraclethinservervtedu1521estap_db_name dbUserName=db_user_name dbUserPassword=db_use_password ncbiConfirmFilePath=directory for dbEST confirmation file iprscanNumProcessor=number of processors to be used for InterProScan analysis

For example jdbcDriverClassName=oraclejdbcdriverOracleDriver jdbcURL=jdbcoraclethinservervtedu1521estapdb dbUserName=estap_user dbUserPassword=password ncbiConfirmFilePath=homeestapdbEST_Incoming iprscanNumProcessor=2

2 Environment Variables a Environment variables needed for Oracle PATH ORACLE_HOME

LD_LIBRARY_PATH ORA_CLIENT_LIB and ORACLE_SID The following are examples of the variable settings export PATH=$PATHhomeestapblast$ORACLE_HOMEbin export ORACLE_HOME=homeoracleproducts920 export LD_LIBRARY_PATH=$ORACLE_HOMElib export ORA_CLIENT_LIB=shared export ORACLE_SID=estap More information regarding database-related environment variables may be found in the document ldquoREADME_dbtxtrdquo that is located in the database directory

b Environment variable needed for BLAST programs ndash please go to NCBI site to view details about how to set up BLAST programs You need to create a ncbirc file to set Data=homeestapblastdata if your BLAST programs are located at homeestapblast You need to include this path to the PATH variable (see above) Also you need to set the environment variable BLASTDB to homeestapblastdb if your blast database are located at homeestapblastdb eg export BLASTDB=homeestapblastdb

c Environment variable needed for the Phred program

PHRED_PARAMETER_FILE If the Phred program (phred and phredpardat) is in the homeestapphred directory the variable PHRED_PARAMETER_FILE should be set to homeestapphredphredpardat eg export PHRED_PARAMETER_FILE=homeestapphredphredpardat If this variable is not set correctly phred will not run properly

d Environment variable needed for runPhredexe program PHRED_PATH

runPhredexe program wraps the phred program and combines all ab1 files in one directory to produce one seq and qal file in a fasta format If the phred program (phred and phredpardat) is in the homeestapphred directory the variable PHRED_PATH should be set to homeestapphredphred eg export PHRED_PATH=homeestapphredphred If this variable is not set correctly the program will not run properly

e Environment variable needed for d2_cluster program OMP_NUM_THREADS If you use one processor you need to set this variable to 1 eg export OMP_NUM_THREADS=1 If this variable is not set d2_cluster will not run In some cases d2_cluster program may not exit normally due to memory problem We recommend to set the environment variable MALLOC_CHECK_ to 1 eg export MALLOC_CHECK_=1 In addition d2_cluster requires libpgthreadso You may place it at the usrlocallib directory

3 Run the Programs a FTP and format NCBI blast databases (eg nr nt)

Fetched NCBI databases are updated every month or as frequently as you wish Please make sure that ESTAP analysis programs are not running when you update NCBI databases Otherwise ESTAP analysis programs will not run properly Also make sure that database formatting is complete before running ESTAP programs We provide some shell scripts (getAndFormatNcbiDBsh getDBsh uncompressFormatDBsh) and a C program (getAndFormatNcbiDBexe) under the estap_exe directory to automatically update the NCBI databases We use crontab to schedule eg 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 getAndFormatNcbiDBsh calls getAndFormatNcbiDBexe which calls getDBsh and uncompressFormatDBsh getAndFormatNcbiDBexe program makes sure to wait until all ESTAP analysis programs finish before updating NCBI databases getDBsh fetches NCBI databases via wget uncompressFormatDBsh uncompresses the fetched NCBI databases and formats the databases using formatdb Please revise the scripts when the directory setup is different

b Where to put raw data (ab1 files)

The ab1 files can be stored at a user specified directory eg homeestapraw_data When receiving data from customers be sure to create a new directory under this directory using the following naming convention for better management (the user may choose a different naming convention for this) labcode-ddmmmyy eg JC-06may02 Then transfer data to this directory All sequence files must be named following ESTAP naming conventions (see Section VIII for how to prepare raw data and naming convention)

c Run Phred

Go to the directory where runPhredexe is located eg homeestapserver_sideestap_exe run the runPhredexe program which is a program that wraps the phred program to make base calling and combines all ab1 files in one sequence file directory to produce one seq and qal file in a fasta format If multiple sequence file directories exist multiple seq and qal files will be generated Command runPhredexe input_dirctory output_directory

eg runPhredexe homeestapraw_datajc-06may02 homeestapraw_datajc-06may02-phred-out where runPhredexe is the program name homeestapraw_datajc-06may02 is the directory that contains the ab1 files from JC lab and homeestapraw_datajc-06may02-phred-out is the output directory where seq and qal files go Before starting to run the program you need to first create the output_directory homeestapraw_datajc-06may02-phred-out After the program is done check phredlog (the program will automatically log the results) in the homeestapserver_sideestap_exe directory to see any problems If not you may either rename the log file to the some other name or remove the log file In the output_directory remove any files that do not have a seq or qal extension

d Copy seq and qal files generated from the runPhredexe program to the correct

lab subdirectory under the input directory eg if the files belong to the JC lab copy those files to homeestapraw_seqinputJC

e Run cleansing program to verify the integrity of the seq and qal files cleanse the

sequences and store the results to the ESTAP database Command estapDriverexe root_path_name gtcleanlog 2gtamp1 eg estapDriverexe homeestapraw_seq gtcleanlog 2gtamp1 where estapDriverexe is the program name root_path_name is the parent directory path for the input directory (see Section IV1 for the directory hierarchy description) and gtcleanlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cleanlog This cleanlog can be viewed for debugging purposes After the cleansing program is successfully run the cleanlog can be removed to save disk space You may use nohup to run the command which immunizes your program to hang-ups eg nohup estapDriverexe root_path_name gtcleanlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep estapDriverexe If for some reason you need to re-run the cleansing for some sequence files please follow the following steps

(1) Delete the sequence file(s) from the database directly (2) Reload the corresponding seq and qal files to the input[lab_code]

directory (3) Run estapDriverexe as described in e

f Run ESTAP blast program to do blast according to corresponding blast protocols

Command runBlastexe gtblastlog 2gtamp1 where runBlastexe is the program name and gtblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named blastlog This blastlog can be viewed for debugging purposes After the runBlast program is successfully run the blastlog can be removed to save disk space You may use nohup command

eg nohup runBlastexe gtblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep runBlastexe Currently the runBlastexe program is set to blast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to blast you need to run this programs multiple times Please go to the blastlog to check how many sequences must be blasted Note Please give at least 5-10 minutes to start another runBlastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the blastlog file to see if the blast command already appears When starting a second runBlastexe program please use a different log file name instead of blastlog eg blast2log The same precautions apply when you need to run the program three or more times

g Run reblast program to do reblast according to corresponding blast protocols

Command reblastexe gtreblastlog 2gtamp1 where reblastexe is the program name and gtreblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named reblastlog This reblastlog can be viewed for debugging purposes After the reblast program is successfully run the reblastlog can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastexe gtreblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastexe Currently the reblastexe program is set to reblast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to reblast you need to run this programs multiple times Please go to reblastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the reblastlog file to see if the blast command already appears When starting a second reblastexe program please use a different log file name instead of reblastlog eg reblast2log The same precautions apply when you need to run the program three or more times

h Run clusterassembly programs You have two options 1) do clusterassembly for a specific project 2) do clusterassembly for all projects 1) Do clusterassembly for a specific project

Command clusterexe project_id gtcluster_project_idlog 2gtamp1 eg clusterexe 377 gtcluster_377log 2gtamp1 where

clusterexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gtcluster_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cluster_project_idlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterexe project_id gtcluster_project_idlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterexe

2) Do clusterassembly for all projects that are qualified for the process (this is

preferred) Command clusterAllexe gtclusterlog 2gtamp1 where

clusterAllexe is the program name and gtclusterlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named clusterlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterAllexe gtclusterlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterAllexe

i Run contig blast program to do blast for assembled contigs according to

corresponding blast protocols Command blastAllContigsexe gtcontig_blastlog 2gtamp1 where blastAllContigsexe is the program name and gt contig_blastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_blastlog This contig_blastlog can be viewed for debugging purposes After the blastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup blastAllContigsexe gtcontig_blastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep blastAllContigsexe Currently the blastAllContigsexe program is set to blast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to blast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another blastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_blastlog file to see if the blast command already appears When starting a second

blastAllContigsexe program please use a different log file name instead of contig_blastlog eg contig_blast2log The same precautions apply when you need to run the program three or more times

j Run contig reblast program to reblast for assembled contig according to corresponding blast protocols Command reblastAllContigsexe gtcontig_reblastlog 2gtamp1 where reblastAllContigsexe is the program name and gt contig_reblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_reblastlog This log file can be viewed for debugging purposes After the reblastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastAllContigsexe gtcontig_reblastlog 2gtamp1 amp It is better to use nohup and run program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastAllContigsexe Currently the reblastAllContigsexe program is set to reblast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to reblast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_reblastlog file to see if the blast command already appears When starting a second reblastAllContigsexe program please use a different log file name instead of contig_reblastlog eg contig_reblast2log The same precautions apply when you need to run the program three or more times

k Set up and run InterProScan application InterProScan is a tool developed at EBI that combines different protein signature recognition methods into one resource ESTAP wraps this tool to query contigs and singlets against the InterPro public databases to automatically annotate the protein functions The GO term of the matched proteins are linked to the contigs and sinlgets Please follow the instruction below to set up and run InterProScan application 1) Download iprscan

Go to ftpftpebiacukpubsoftwareunixiprscan site to download the following files (1) iprscan_vXXXtargz where XXX is the version number of the iprscan

release eg 32 (2) iprscan_bin_XXXtargz where XXX is your platform eg Linux (3) iprscan_DATA_XXXtargz where XXX is the version number of InterPro

database release eg 61)

2) Uncompress the files in the order specified above into your home directory ($HOME_DIR) The iprscan home directory will be $HOME_DIRiprscan eg homeestapiprscan

3) First time configuration Go to the iprscan home directory and run perl CONFIGpl Then (1) Modify CONFIGpl as described below (save a copy of the original

CONFIGpl file as CONFIG_origpl) (i) At the beginning of CONFIGpl (in main) along with other variable

declarations add the following lines my $IprPWD=iprpwdtxt my $pwd = $ENVPWD unless ($pwd) $pwd = `pwd` chomp $pwd

(ii) Before the line setting up applications $applset = get_user_prompt(Setup applications $first (y|n))

Add these lines open (FF gt$IprPWD) || die Cannot create $IprPWD$ print FF $pwd close FF

(2) Modify InterProScanpl file as described below (save a copy of the

original file as InterProScan_origpl) (i) Replace my $seqfile = $ARGV[0]

with my $seqfile = $ARGV[1] (ii) After the line my $UserId = Manager-gtgetUserId()

add these lines my $OutDir=$ARGV[0] print $OutDir n Do not use $path in this file since it wont be recognized by the iprscan wrapper program $OutDir instead of $path is used as an argument for the iprscan wrapper program to store iprscan result files

(iii) Replace $path with $OutDir in all the lines following the above comment lines

4) Reconfiguration

Go to the iprscan home directory and run perl CONFIGpl You will be asked to choose the member databases you would like to search against If you only use the public data and applications that come with the iprscan distribution you should answer ldquonrdquo for SignalPHMM and TMHMM during configuration since they are not public When the configuration is complete a file called iprpwdtxt will be created under the iprscan home directory The iprpwdtxt file stores the iprscan path information which will be used for the

iprscan wrapper programs You must copy the iprpwdtxt file to the java_prog directory where the iprscan wrapper programs are placed so that the java programs can read the iprscan path information from this file

5) Run iprscan

Go to the java_prog directory where the java application and configuration files (javaConfigtxt and iprpwdtxt) are placed and set the proper parameters in the configuration file as described in Section IV1c You may analyze a specific project or all projects at a time If you wish to run a specific project use command nohup java InterProDriver lab_code project_code gt ipr_project_codelog 2gtamp1 amp eg nohup java InterProDriver JC MCA gt ipr_MCAlog 2gtamp1 where InterProDriver is the program name lab_code is the two-letter lab code of the lab that the project belongs to project_code is the three-letter project code of the project being analyzed and gt ipr_project_codelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named ipr_project_codelog If you wish to run all projects that are qualified for this procedure use command nohup java InterProScanAll gt iprlog 2gtamp1 amp where InterProScanAll is the program name and gt iprlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named iprlog The log files can be viewed for debugging purposes After the program is successfully run the log files can be removed to save disk space

l Run genomic DNA assembly programs

You have two options 1) do DNA assembly for a specific genome project 2) do DNA assembly for all qualified genome projects 1) Do assembly for a specific genome project

Command nohup gAssembleexe project_id gtassemble_project_idlog 2gtamp1 eg nohup gAssembleexe 377 gt assemble_377log 2gtamp1 where

gAssembleexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gt assemble_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemble_project_idlog 2) Do assembly for all genome projects that are qualified for the process (this is

preferred) Command nohup gAssembleAllexe gtassemblelog 2gtamp1 where

gAssembleAllexe is the program name and gt assemblelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemblelog

m Prepare EST database files from user projects

ESTAP Web interface allows users to blast specific sequence(s) against the ESTs from their projects We provide a program called writeAndFormatFastaexe to create and format EST database files from user projects This program can be scheduled to run daily or as often as you wish to get the most recent EST databases for users to blast against To run this program you need to create a directory where the EST database files will be placed eg homeestapestdb This directory path should be used for the ESTDBFilePath parameter setting in the webxml file (see Section V3e) so that the ESTAP local blast Web service can recognize

4 Schedule Programs The above programs can be run manually as described above or scheduled to run automatically using the UNIX cron command Please see the UNIX man page for the crontab command Following is an example of the scheduling 0 91117 ~estap_exeestapDriversh gtgt estaplog 2gtamp1 0 19 20 21 ~estap_exerunBlastsh gtgt blastlog 2gtamp1 30 22 23 24 ~estap_exereblastsh gtgt reblastlog 2gtamp1 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 0 5 ~estap_execlusterAllsh gtgt ~estap_execlusterlog 2gtamp1 030 19 ~estap_exeblastAllContigssh gtgt ~estap_exeblast_contigslog 2gtamp1 030 12 ~estap_exereblastAllContigssh gtgt ~estap_exereblast_contigslog 2gtamp1 030 7 ~estap_exegAssembleAllsh gtgt ~estap_exeassemblelog 2gtamp1 5 2 ~estap_exewriteAndFormatFastash gtgt ~estap_exeestap_formatdblog 2gtamp1 The shell scripts used here call the corresponding programs and make sure the programs are run under the correct directories and using the correct environment variable settings These scripts should be revised when the directory set up and environment variable settings are different

5 Trouble-shooting When a program is stopped abnormally eg power or network failure processes are killed etc some data may be partially analyzed and partial results may be stored in the database You may need to do some clean ups (removing partial results from the database) and rerun the program Please use the following instructions

a Go to the ESTAP database table PROG_RUN_STATUS in ESTAP_SYS_DEF schema This table records the information about which programs are running or completed when they were started and when they were completed If a program is killed in the middle of a run for some reason the end_time will not be recorded and the status will still be running (status = 1) You need to change the status from 1 to 0 to indicate the program is no longer running

b If the program was runBlastexe or reblastexe after you change the status in the

PROG_RUN_STATUS table please go to ANALYSIS_TRACKING table in ESTAP_ANALYSIS schema delete those rows created by the program that have

status = 1 eg DELETE from ANALYSIS_TRACKING where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the runBlastexe program

c If the program was clusterexe after you change the status in the

PROG_RUN_STATUS table please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the row with the project id that you specified when you ran the clusterexe Commit the changes after you are sure that you want to delete them Then rerun the clusterexe program

d If the program was clusterAllexe after you change the status in the

PROG_RUN_STATUS table (look for ldquoclusterexerdquo and ldquoclusterAllexerdquo in the prog_name column) please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the rows created by the program that have status of 1 Commit the changes after you are sure that you want to delete them Then rerun the clusterAllexe program

e If the program was blastAllContigsexe or reblastAllContigsexe after you change

the status in the PROG_RUN_STATUS table please go to the CONTIG_ANALYSIS table in the ESTAP_ANALYSIS schema delete those rows created by the program that have status = 1 eg DELETE from CONTIG_ANALYSIS where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the blastAllContigsexe program

V How to Install and Run Web Programs

1 Install Oracle Client including JDBC Please contact Oracle to obtain licensing and product support ESTAP uses Oracle 9i client

2 Install JDK Servlet and Web Service Software

a To install JDK 13 (Windows Linux Solaris) or higher please go to httpjavasuncomj2se

b To install Tomcat please go to httpjakartaapacheorgtomcat c To install Apache Axis please go to httpwsapacheorgaxis Apache axis beta

3 version is used for the ESTAP local blast Web service

3 Configure Tomcat a Visit httpjakartaapacheorgtomcat for instruction for Tomcat configuration b Set up correct PATH and CLASSPATH environment

c Modify serverxml in the Tomcat conf directory In this file add the following

for estap context

ltContext path=estap docBase=estap showdebuginfo=true debug=DEBUG reloadable=true crossContext=falsegt ltLogger className=orgapachecatalinaloggerFileLogger prefix=localhost_estap_log suffix=txt timestamp=truegt ltContextgt

d Place the estap Web application under the Tomcat webapps directory Copy the estap directory from the web_side directory of the uncompressed software to the webapps directory

e Modify the webxml file in the estapWEB-INF directory The following section

in the webxml is specific for ESTAP Please modify the text in bold for your own parameter settings

ltservletgt ltservlet-namegtProjectListltservlet-namegt ltservlet-classgtProjectListltservlet-classgt ltinit-paramgt ltparam-namegtjdbcDriverClassNameltparam-namegt ltparam-valuegtoraclejdbcdriverOracleDriverltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtjdbcURLltparam-namegt ltparam-valuegtjdbcoraclethinservervtedu1521estapdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserNameltparam-namegt ltparam-valuegtestap_userltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserPasswordltparam-namegt ltparam-valuegtpasswordltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbInitConnectltparam-namegt ltparam-valuegt1ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbMaxConnectltparam-namegt ltparam-valuegt50ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbESTFilePathltparam-namegt ltparam-valuegthomeestapdbEST_Outgoingltparam-valuegt ltinit-paramgt

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 5: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

For example jdbcDriverClassName=oraclejdbcdriverOracleDriver jdbcURL=jdbcoraclethinservervtedu1521estapdb dbUserName=estap_user dbUserPassword=password ncbiConfirmFilePath=homeestapdbEST_Incoming iprscanNumProcessor=2

2 Environment Variables a Environment variables needed for Oracle PATH ORACLE_HOME

LD_LIBRARY_PATH ORA_CLIENT_LIB and ORACLE_SID The following are examples of the variable settings export PATH=$PATHhomeestapblast$ORACLE_HOMEbin export ORACLE_HOME=homeoracleproducts920 export LD_LIBRARY_PATH=$ORACLE_HOMElib export ORA_CLIENT_LIB=shared export ORACLE_SID=estap More information regarding database-related environment variables may be found in the document ldquoREADME_dbtxtrdquo that is located in the database directory

b Environment variable needed for BLAST programs ndash please go to NCBI site to view details about how to set up BLAST programs You need to create a ncbirc file to set Data=homeestapblastdata if your BLAST programs are located at homeestapblast You need to include this path to the PATH variable (see above) Also you need to set the environment variable BLASTDB to homeestapblastdb if your blast database are located at homeestapblastdb eg export BLASTDB=homeestapblastdb

c Environment variable needed for the Phred program

PHRED_PARAMETER_FILE If the Phred program (phred and phredpardat) is in the homeestapphred directory the variable PHRED_PARAMETER_FILE should be set to homeestapphredphredpardat eg export PHRED_PARAMETER_FILE=homeestapphredphredpardat If this variable is not set correctly phred will not run properly

d Environment variable needed for runPhredexe program PHRED_PATH

runPhredexe program wraps the phred program and combines all ab1 files in one directory to produce one seq and qal file in a fasta format If the phred program (phred and phredpardat) is in the homeestapphred directory the variable PHRED_PATH should be set to homeestapphredphred eg export PHRED_PATH=homeestapphredphred If this variable is not set correctly the program will not run properly

e Environment variable needed for d2_cluster program OMP_NUM_THREADS If you use one processor you need to set this variable to 1 eg export OMP_NUM_THREADS=1 If this variable is not set d2_cluster will not run In some cases d2_cluster program may not exit normally due to memory problem We recommend to set the environment variable MALLOC_CHECK_ to 1 eg export MALLOC_CHECK_=1 In addition d2_cluster requires libpgthreadso You may place it at the usrlocallib directory

3 Run the Programs a FTP and format NCBI blast databases (eg nr nt)

Fetched NCBI databases are updated every month or as frequently as you wish Please make sure that ESTAP analysis programs are not running when you update NCBI databases Otherwise ESTAP analysis programs will not run properly Also make sure that database formatting is complete before running ESTAP programs We provide some shell scripts (getAndFormatNcbiDBsh getDBsh uncompressFormatDBsh) and a C program (getAndFormatNcbiDBexe) under the estap_exe directory to automatically update the NCBI databases We use crontab to schedule eg 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 getAndFormatNcbiDBsh calls getAndFormatNcbiDBexe which calls getDBsh and uncompressFormatDBsh getAndFormatNcbiDBexe program makes sure to wait until all ESTAP analysis programs finish before updating NCBI databases getDBsh fetches NCBI databases via wget uncompressFormatDBsh uncompresses the fetched NCBI databases and formats the databases using formatdb Please revise the scripts when the directory setup is different

b Where to put raw data (ab1 files)

The ab1 files can be stored at a user specified directory eg homeestapraw_data When receiving data from customers be sure to create a new directory under this directory using the following naming convention for better management (the user may choose a different naming convention for this) labcode-ddmmmyy eg JC-06may02 Then transfer data to this directory All sequence files must be named following ESTAP naming conventions (see Section VIII for how to prepare raw data and naming convention)

c Run Phred

Go to the directory where runPhredexe is located eg homeestapserver_sideestap_exe run the runPhredexe program which is a program that wraps the phred program to make base calling and combines all ab1 files in one sequence file directory to produce one seq and qal file in a fasta format If multiple sequence file directories exist multiple seq and qal files will be generated Command runPhredexe input_dirctory output_directory

eg runPhredexe homeestapraw_datajc-06may02 homeestapraw_datajc-06may02-phred-out where runPhredexe is the program name homeestapraw_datajc-06may02 is the directory that contains the ab1 files from JC lab and homeestapraw_datajc-06may02-phred-out is the output directory where seq and qal files go Before starting to run the program you need to first create the output_directory homeestapraw_datajc-06may02-phred-out After the program is done check phredlog (the program will automatically log the results) in the homeestapserver_sideestap_exe directory to see any problems If not you may either rename the log file to the some other name or remove the log file In the output_directory remove any files that do not have a seq or qal extension

d Copy seq and qal files generated from the runPhredexe program to the correct

lab subdirectory under the input directory eg if the files belong to the JC lab copy those files to homeestapraw_seqinputJC

e Run cleansing program to verify the integrity of the seq and qal files cleanse the

sequences and store the results to the ESTAP database Command estapDriverexe root_path_name gtcleanlog 2gtamp1 eg estapDriverexe homeestapraw_seq gtcleanlog 2gtamp1 where estapDriverexe is the program name root_path_name is the parent directory path for the input directory (see Section IV1 for the directory hierarchy description) and gtcleanlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cleanlog This cleanlog can be viewed for debugging purposes After the cleansing program is successfully run the cleanlog can be removed to save disk space You may use nohup to run the command which immunizes your program to hang-ups eg nohup estapDriverexe root_path_name gtcleanlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep estapDriverexe If for some reason you need to re-run the cleansing for some sequence files please follow the following steps

(1) Delete the sequence file(s) from the database directly (2) Reload the corresponding seq and qal files to the input[lab_code]

directory (3) Run estapDriverexe as described in e

f Run ESTAP blast program to do blast according to corresponding blast protocols

Command runBlastexe gtblastlog 2gtamp1 where runBlastexe is the program name and gtblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named blastlog This blastlog can be viewed for debugging purposes After the runBlast program is successfully run the blastlog can be removed to save disk space You may use nohup command

eg nohup runBlastexe gtblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep runBlastexe Currently the runBlastexe program is set to blast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to blast you need to run this programs multiple times Please go to the blastlog to check how many sequences must be blasted Note Please give at least 5-10 minutes to start another runBlastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the blastlog file to see if the blast command already appears When starting a second runBlastexe program please use a different log file name instead of blastlog eg blast2log The same precautions apply when you need to run the program three or more times

g Run reblast program to do reblast according to corresponding blast protocols

Command reblastexe gtreblastlog 2gtamp1 where reblastexe is the program name and gtreblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named reblastlog This reblastlog can be viewed for debugging purposes After the reblast program is successfully run the reblastlog can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastexe gtreblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastexe Currently the reblastexe program is set to reblast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to reblast you need to run this programs multiple times Please go to reblastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the reblastlog file to see if the blast command already appears When starting a second reblastexe program please use a different log file name instead of reblastlog eg reblast2log The same precautions apply when you need to run the program three or more times

h Run clusterassembly programs You have two options 1) do clusterassembly for a specific project 2) do clusterassembly for all projects 1) Do clusterassembly for a specific project

Command clusterexe project_id gtcluster_project_idlog 2gtamp1 eg clusterexe 377 gtcluster_377log 2gtamp1 where

clusterexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gtcluster_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cluster_project_idlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterexe project_id gtcluster_project_idlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterexe

2) Do clusterassembly for all projects that are qualified for the process (this is

preferred) Command clusterAllexe gtclusterlog 2gtamp1 where

clusterAllexe is the program name and gtclusterlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named clusterlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterAllexe gtclusterlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterAllexe

i Run contig blast program to do blast for assembled contigs according to

corresponding blast protocols Command blastAllContigsexe gtcontig_blastlog 2gtamp1 where blastAllContigsexe is the program name and gt contig_blastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_blastlog This contig_blastlog can be viewed for debugging purposes After the blastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup blastAllContigsexe gtcontig_blastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep blastAllContigsexe Currently the blastAllContigsexe program is set to blast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to blast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another blastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_blastlog file to see if the blast command already appears When starting a second

blastAllContigsexe program please use a different log file name instead of contig_blastlog eg contig_blast2log The same precautions apply when you need to run the program three or more times

j Run contig reblast program to reblast for assembled contig according to corresponding blast protocols Command reblastAllContigsexe gtcontig_reblastlog 2gtamp1 where reblastAllContigsexe is the program name and gt contig_reblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_reblastlog This log file can be viewed for debugging purposes After the reblastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastAllContigsexe gtcontig_reblastlog 2gtamp1 amp It is better to use nohup and run program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastAllContigsexe Currently the reblastAllContigsexe program is set to reblast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to reblast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_reblastlog file to see if the blast command already appears When starting a second reblastAllContigsexe program please use a different log file name instead of contig_reblastlog eg contig_reblast2log The same precautions apply when you need to run the program three or more times

k Set up and run InterProScan application InterProScan is a tool developed at EBI that combines different protein signature recognition methods into one resource ESTAP wraps this tool to query contigs and singlets against the InterPro public databases to automatically annotate the protein functions The GO term of the matched proteins are linked to the contigs and sinlgets Please follow the instruction below to set up and run InterProScan application 1) Download iprscan

Go to ftpftpebiacukpubsoftwareunixiprscan site to download the following files (1) iprscan_vXXXtargz where XXX is the version number of the iprscan

release eg 32 (2) iprscan_bin_XXXtargz where XXX is your platform eg Linux (3) iprscan_DATA_XXXtargz where XXX is the version number of InterPro

database release eg 61)

2) Uncompress the files in the order specified above into your home directory ($HOME_DIR) The iprscan home directory will be $HOME_DIRiprscan eg homeestapiprscan

3) First time configuration Go to the iprscan home directory and run perl CONFIGpl Then (1) Modify CONFIGpl as described below (save a copy of the original

CONFIGpl file as CONFIG_origpl) (i) At the beginning of CONFIGpl (in main) along with other variable

declarations add the following lines my $IprPWD=iprpwdtxt my $pwd = $ENVPWD unless ($pwd) $pwd = `pwd` chomp $pwd

(ii) Before the line setting up applications $applset = get_user_prompt(Setup applications $first (y|n))

Add these lines open (FF gt$IprPWD) || die Cannot create $IprPWD$ print FF $pwd close FF

(2) Modify InterProScanpl file as described below (save a copy of the

original file as InterProScan_origpl) (i) Replace my $seqfile = $ARGV[0]

with my $seqfile = $ARGV[1] (ii) After the line my $UserId = Manager-gtgetUserId()

add these lines my $OutDir=$ARGV[0] print $OutDir n Do not use $path in this file since it wont be recognized by the iprscan wrapper program $OutDir instead of $path is used as an argument for the iprscan wrapper program to store iprscan result files

(iii) Replace $path with $OutDir in all the lines following the above comment lines

4) Reconfiguration

Go to the iprscan home directory and run perl CONFIGpl You will be asked to choose the member databases you would like to search against If you only use the public data and applications that come with the iprscan distribution you should answer ldquonrdquo for SignalPHMM and TMHMM during configuration since they are not public When the configuration is complete a file called iprpwdtxt will be created under the iprscan home directory The iprpwdtxt file stores the iprscan path information which will be used for the

iprscan wrapper programs You must copy the iprpwdtxt file to the java_prog directory where the iprscan wrapper programs are placed so that the java programs can read the iprscan path information from this file

5) Run iprscan

Go to the java_prog directory where the java application and configuration files (javaConfigtxt and iprpwdtxt) are placed and set the proper parameters in the configuration file as described in Section IV1c You may analyze a specific project or all projects at a time If you wish to run a specific project use command nohup java InterProDriver lab_code project_code gt ipr_project_codelog 2gtamp1 amp eg nohup java InterProDriver JC MCA gt ipr_MCAlog 2gtamp1 where InterProDriver is the program name lab_code is the two-letter lab code of the lab that the project belongs to project_code is the three-letter project code of the project being analyzed and gt ipr_project_codelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named ipr_project_codelog If you wish to run all projects that are qualified for this procedure use command nohup java InterProScanAll gt iprlog 2gtamp1 amp where InterProScanAll is the program name and gt iprlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named iprlog The log files can be viewed for debugging purposes After the program is successfully run the log files can be removed to save disk space

l Run genomic DNA assembly programs

You have two options 1) do DNA assembly for a specific genome project 2) do DNA assembly for all qualified genome projects 1) Do assembly for a specific genome project

Command nohup gAssembleexe project_id gtassemble_project_idlog 2gtamp1 eg nohup gAssembleexe 377 gt assemble_377log 2gtamp1 where

gAssembleexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gt assemble_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemble_project_idlog 2) Do assembly for all genome projects that are qualified for the process (this is

preferred) Command nohup gAssembleAllexe gtassemblelog 2gtamp1 where

gAssembleAllexe is the program name and gt assemblelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemblelog

m Prepare EST database files from user projects

ESTAP Web interface allows users to blast specific sequence(s) against the ESTs from their projects We provide a program called writeAndFormatFastaexe to create and format EST database files from user projects This program can be scheduled to run daily or as often as you wish to get the most recent EST databases for users to blast against To run this program you need to create a directory where the EST database files will be placed eg homeestapestdb This directory path should be used for the ESTDBFilePath parameter setting in the webxml file (see Section V3e) so that the ESTAP local blast Web service can recognize

4 Schedule Programs The above programs can be run manually as described above or scheduled to run automatically using the UNIX cron command Please see the UNIX man page for the crontab command Following is an example of the scheduling 0 91117 ~estap_exeestapDriversh gtgt estaplog 2gtamp1 0 19 20 21 ~estap_exerunBlastsh gtgt blastlog 2gtamp1 30 22 23 24 ~estap_exereblastsh gtgt reblastlog 2gtamp1 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 0 5 ~estap_execlusterAllsh gtgt ~estap_execlusterlog 2gtamp1 030 19 ~estap_exeblastAllContigssh gtgt ~estap_exeblast_contigslog 2gtamp1 030 12 ~estap_exereblastAllContigssh gtgt ~estap_exereblast_contigslog 2gtamp1 030 7 ~estap_exegAssembleAllsh gtgt ~estap_exeassemblelog 2gtamp1 5 2 ~estap_exewriteAndFormatFastash gtgt ~estap_exeestap_formatdblog 2gtamp1 The shell scripts used here call the corresponding programs and make sure the programs are run under the correct directories and using the correct environment variable settings These scripts should be revised when the directory set up and environment variable settings are different

5 Trouble-shooting When a program is stopped abnormally eg power or network failure processes are killed etc some data may be partially analyzed and partial results may be stored in the database You may need to do some clean ups (removing partial results from the database) and rerun the program Please use the following instructions

a Go to the ESTAP database table PROG_RUN_STATUS in ESTAP_SYS_DEF schema This table records the information about which programs are running or completed when they were started and when they were completed If a program is killed in the middle of a run for some reason the end_time will not be recorded and the status will still be running (status = 1) You need to change the status from 1 to 0 to indicate the program is no longer running

b If the program was runBlastexe or reblastexe after you change the status in the

PROG_RUN_STATUS table please go to ANALYSIS_TRACKING table in ESTAP_ANALYSIS schema delete those rows created by the program that have

status = 1 eg DELETE from ANALYSIS_TRACKING where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the runBlastexe program

c If the program was clusterexe after you change the status in the

PROG_RUN_STATUS table please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the row with the project id that you specified when you ran the clusterexe Commit the changes after you are sure that you want to delete them Then rerun the clusterexe program

d If the program was clusterAllexe after you change the status in the

PROG_RUN_STATUS table (look for ldquoclusterexerdquo and ldquoclusterAllexerdquo in the prog_name column) please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the rows created by the program that have status of 1 Commit the changes after you are sure that you want to delete them Then rerun the clusterAllexe program

e If the program was blastAllContigsexe or reblastAllContigsexe after you change

the status in the PROG_RUN_STATUS table please go to the CONTIG_ANALYSIS table in the ESTAP_ANALYSIS schema delete those rows created by the program that have status = 1 eg DELETE from CONTIG_ANALYSIS where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the blastAllContigsexe program

V How to Install and Run Web Programs

1 Install Oracle Client including JDBC Please contact Oracle to obtain licensing and product support ESTAP uses Oracle 9i client

2 Install JDK Servlet and Web Service Software

a To install JDK 13 (Windows Linux Solaris) or higher please go to httpjavasuncomj2se

b To install Tomcat please go to httpjakartaapacheorgtomcat c To install Apache Axis please go to httpwsapacheorgaxis Apache axis beta

3 version is used for the ESTAP local blast Web service

3 Configure Tomcat a Visit httpjakartaapacheorgtomcat for instruction for Tomcat configuration b Set up correct PATH and CLASSPATH environment

c Modify serverxml in the Tomcat conf directory In this file add the following

for estap context

ltContext path=estap docBase=estap showdebuginfo=true debug=DEBUG reloadable=true crossContext=falsegt ltLogger className=orgapachecatalinaloggerFileLogger prefix=localhost_estap_log suffix=txt timestamp=truegt ltContextgt

d Place the estap Web application under the Tomcat webapps directory Copy the estap directory from the web_side directory of the uncompressed software to the webapps directory

e Modify the webxml file in the estapWEB-INF directory The following section

in the webxml is specific for ESTAP Please modify the text in bold for your own parameter settings

ltservletgt ltservlet-namegtProjectListltservlet-namegt ltservlet-classgtProjectListltservlet-classgt ltinit-paramgt ltparam-namegtjdbcDriverClassNameltparam-namegt ltparam-valuegtoraclejdbcdriverOracleDriverltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtjdbcURLltparam-namegt ltparam-valuegtjdbcoraclethinservervtedu1521estapdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserNameltparam-namegt ltparam-valuegtestap_userltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserPasswordltparam-namegt ltparam-valuegtpasswordltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbInitConnectltparam-namegt ltparam-valuegt1ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbMaxConnectltparam-namegt ltparam-valuegt50ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbESTFilePathltparam-namegt ltparam-valuegthomeestapdbEST_Outgoingltparam-valuegt ltinit-paramgt

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 6: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

e Environment variable needed for d2_cluster program OMP_NUM_THREADS If you use one processor you need to set this variable to 1 eg export OMP_NUM_THREADS=1 If this variable is not set d2_cluster will not run In some cases d2_cluster program may not exit normally due to memory problem We recommend to set the environment variable MALLOC_CHECK_ to 1 eg export MALLOC_CHECK_=1 In addition d2_cluster requires libpgthreadso You may place it at the usrlocallib directory

3 Run the Programs a FTP and format NCBI blast databases (eg nr nt)

Fetched NCBI databases are updated every month or as frequently as you wish Please make sure that ESTAP analysis programs are not running when you update NCBI databases Otherwise ESTAP analysis programs will not run properly Also make sure that database formatting is complete before running ESTAP programs We provide some shell scripts (getAndFormatNcbiDBsh getDBsh uncompressFormatDBsh) and a C program (getAndFormatNcbiDBexe) under the estap_exe directory to automatically update the NCBI databases We use crontab to schedule eg 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 getAndFormatNcbiDBsh calls getAndFormatNcbiDBexe which calls getDBsh and uncompressFormatDBsh getAndFormatNcbiDBexe program makes sure to wait until all ESTAP analysis programs finish before updating NCBI databases getDBsh fetches NCBI databases via wget uncompressFormatDBsh uncompresses the fetched NCBI databases and formats the databases using formatdb Please revise the scripts when the directory setup is different

b Where to put raw data (ab1 files)

The ab1 files can be stored at a user specified directory eg homeestapraw_data When receiving data from customers be sure to create a new directory under this directory using the following naming convention for better management (the user may choose a different naming convention for this) labcode-ddmmmyy eg JC-06may02 Then transfer data to this directory All sequence files must be named following ESTAP naming conventions (see Section VIII for how to prepare raw data and naming convention)

c Run Phred

Go to the directory where runPhredexe is located eg homeestapserver_sideestap_exe run the runPhredexe program which is a program that wraps the phred program to make base calling and combines all ab1 files in one sequence file directory to produce one seq and qal file in a fasta format If multiple sequence file directories exist multiple seq and qal files will be generated Command runPhredexe input_dirctory output_directory

eg runPhredexe homeestapraw_datajc-06may02 homeestapraw_datajc-06may02-phred-out where runPhredexe is the program name homeestapraw_datajc-06may02 is the directory that contains the ab1 files from JC lab and homeestapraw_datajc-06may02-phred-out is the output directory where seq and qal files go Before starting to run the program you need to first create the output_directory homeestapraw_datajc-06may02-phred-out After the program is done check phredlog (the program will automatically log the results) in the homeestapserver_sideestap_exe directory to see any problems If not you may either rename the log file to the some other name or remove the log file In the output_directory remove any files that do not have a seq or qal extension

d Copy seq and qal files generated from the runPhredexe program to the correct

lab subdirectory under the input directory eg if the files belong to the JC lab copy those files to homeestapraw_seqinputJC

e Run cleansing program to verify the integrity of the seq and qal files cleanse the

sequences and store the results to the ESTAP database Command estapDriverexe root_path_name gtcleanlog 2gtamp1 eg estapDriverexe homeestapraw_seq gtcleanlog 2gtamp1 where estapDriverexe is the program name root_path_name is the parent directory path for the input directory (see Section IV1 for the directory hierarchy description) and gtcleanlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cleanlog This cleanlog can be viewed for debugging purposes After the cleansing program is successfully run the cleanlog can be removed to save disk space You may use nohup to run the command which immunizes your program to hang-ups eg nohup estapDriverexe root_path_name gtcleanlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep estapDriverexe If for some reason you need to re-run the cleansing for some sequence files please follow the following steps

(1) Delete the sequence file(s) from the database directly (2) Reload the corresponding seq and qal files to the input[lab_code]

directory (3) Run estapDriverexe as described in e

f Run ESTAP blast program to do blast according to corresponding blast protocols

Command runBlastexe gtblastlog 2gtamp1 where runBlastexe is the program name and gtblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named blastlog This blastlog can be viewed for debugging purposes After the runBlast program is successfully run the blastlog can be removed to save disk space You may use nohup command

eg nohup runBlastexe gtblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep runBlastexe Currently the runBlastexe program is set to blast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to blast you need to run this programs multiple times Please go to the blastlog to check how many sequences must be blasted Note Please give at least 5-10 minutes to start another runBlastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the blastlog file to see if the blast command already appears When starting a second runBlastexe program please use a different log file name instead of blastlog eg blast2log The same precautions apply when you need to run the program three or more times

g Run reblast program to do reblast according to corresponding blast protocols

Command reblastexe gtreblastlog 2gtamp1 where reblastexe is the program name and gtreblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named reblastlog This reblastlog can be viewed for debugging purposes After the reblast program is successfully run the reblastlog can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastexe gtreblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastexe Currently the reblastexe program is set to reblast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to reblast you need to run this programs multiple times Please go to reblastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the reblastlog file to see if the blast command already appears When starting a second reblastexe program please use a different log file name instead of reblastlog eg reblast2log The same precautions apply when you need to run the program three or more times

h Run clusterassembly programs You have two options 1) do clusterassembly for a specific project 2) do clusterassembly for all projects 1) Do clusterassembly for a specific project

Command clusterexe project_id gtcluster_project_idlog 2gtamp1 eg clusterexe 377 gtcluster_377log 2gtamp1 where

clusterexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gtcluster_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cluster_project_idlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterexe project_id gtcluster_project_idlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterexe

2) Do clusterassembly for all projects that are qualified for the process (this is

preferred) Command clusterAllexe gtclusterlog 2gtamp1 where

clusterAllexe is the program name and gtclusterlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named clusterlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterAllexe gtclusterlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterAllexe

i Run contig blast program to do blast for assembled contigs according to

corresponding blast protocols Command blastAllContigsexe gtcontig_blastlog 2gtamp1 where blastAllContigsexe is the program name and gt contig_blastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_blastlog This contig_blastlog can be viewed for debugging purposes After the blastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup blastAllContigsexe gtcontig_blastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep blastAllContigsexe Currently the blastAllContigsexe program is set to blast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to blast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another blastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_blastlog file to see if the blast command already appears When starting a second

blastAllContigsexe program please use a different log file name instead of contig_blastlog eg contig_blast2log The same precautions apply when you need to run the program three or more times

j Run contig reblast program to reblast for assembled contig according to corresponding blast protocols Command reblastAllContigsexe gtcontig_reblastlog 2gtamp1 where reblastAllContigsexe is the program name and gt contig_reblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_reblastlog This log file can be viewed for debugging purposes After the reblastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastAllContigsexe gtcontig_reblastlog 2gtamp1 amp It is better to use nohup and run program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastAllContigsexe Currently the reblastAllContigsexe program is set to reblast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to reblast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_reblastlog file to see if the blast command already appears When starting a second reblastAllContigsexe program please use a different log file name instead of contig_reblastlog eg contig_reblast2log The same precautions apply when you need to run the program three or more times

k Set up and run InterProScan application InterProScan is a tool developed at EBI that combines different protein signature recognition methods into one resource ESTAP wraps this tool to query contigs and singlets against the InterPro public databases to automatically annotate the protein functions The GO term of the matched proteins are linked to the contigs and sinlgets Please follow the instruction below to set up and run InterProScan application 1) Download iprscan

Go to ftpftpebiacukpubsoftwareunixiprscan site to download the following files (1) iprscan_vXXXtargz where XXX is the version number of the iprscan

release eg 32 (2) iprscan_bin_XXXtargz where XXX is your platform eg Linux (3) iprscan_DATA_XXXtargz where XXX is the version number of InterPro

database release eg 61)

2) Uncompress the files in the order specified above into your home directory ($HOME_DIR) The iprscan home directory will be $HOME_DIRiprscan eg homeestapiprscan

3) First time configuration Go to the iprscan home directory and run perl CONFIGpl Then (1) Modify CONFIGpl as described below (save a copy of the original

CONFIGpl file as CONFIG_origpl) (i) At the beginning of CONFIGpl (in main) along with other variable

declarations add the following lines my $IprPWD=iprpwdtxt my $pwd = $ENVPWD unless ($pwd) $pwd = `pwd` chomp $pwd

(ii) Before the line setting up applications $applset = get_user_prompt(Setup applications $first (y|n))

Add these lines open (FF gt$IprPWD) || die Cannot create $IprPWD$ print FF $pwd close FF

(2) Modify InterProScanpl file as described below (save a copy of the

original file as InterProScan_origpl) (i) Replace my $seqfile = $ARGV[0]

with my $seqfile = $ARGV[1] (ii) After the line my $UserId = Manager-gtgetUserId()

add these lines my $OutDir=$ARGV[0] print $OutDir n Do not use $path in this file since it wont be recognized by the iprscan wrapper program $OutDir instead of $path is used as an argument for the iprscan wrapper program to store iprscan result files

(iii) Replace $path with $OutDir in all the lines following the above comment lines

4) Reconfiguration

Go to the iprscan home directory and run perl CONFIGpl You will be asked to choose the member databases you would like to search against If you only use the public data and applications that come with the iprscan distribution you should answer ldquonrdquo for SignalPHMM and TMHMM during configuration since they are not public When the configuration is complete a file called iprpwdtxt will be created under the iprscan home directory The iprpwdtxt file stores the iprscan path information which will be used for the

iprscan wrapper programs You must copy the iprpwdtxt file to the java_prog directory where the iprscan wrapper programs are placed so that the java programs can read the iprscan path information from this file

5) Run iprscan

Go to the java_prog directory where the java application and configuration files (javaConfigtxt and iprpwdtxt) are placed and set the proper parameters in the configuration file as described in Section IV1c You may analyze a specific project or all projects at a time If you wish to run a specific project use command nohup java InterProDriver lab_code project_code gt ipr_project_codelog 2gtamp1 amp eg nohup java InterProDriver JC MCA gt ipr_MCAlog 2gtamp1 where InterProDriver is the program name lab_code is the two-letter lab code of the lab that the project belongs to project_code is the three-letter project code of the project being analyzed and gt ipr_project_codelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named ipr_project_codelog If you wish to run all projects that are qualified for this procedure use command nohup java InterProScanAll gt iprlog 2gtamp1 amp where InterProScanAll is the program name and gt iprlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named iprlog The log files can be viewed for debugging purposes After the program is successfully run the log files can be removed to save disk space

l Run genomic DNA assembly programs

You have two options 1) do DNA assembly for a specific genome project 2) do DNA assembly for all qualified genome projects 1) Do assembly for a specific genome project

Command nohup gAssembleexe project_id gtassemble_project_idlog 2gtamp1 eg nohup gAssembleexe 377 gt assemble_377log 2gtamp1 where

gAssembleexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gt assemble_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemble_project_idlog 2) Do assembly for all genome projects that are qualified for the process (this is

preferred) Command nohup gAssembleAllexe gtassemblelog 2gtamp1 where

gAssembleAllexe is the program name and gt assemblelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemblelog

m Prepare EST database files from user projects

ESTAP Web interface allows users to blast specific sequence(s) against the ESTs from their projects We provide a program called writeAndFormatFastaexe to create and format EST database files from user projects This program can be scheduled to run daily or as often as you wish to get the most recent EST databases for users to blast against To run this program you need to create a directory where the EST database files will be placed eg homeestapestdb This directory path should be used for the ESTDBFilePath parameter setting in the webxml file (see Section V3e) so that the ESTAP local blast Web service can recognize

4 Schedule Programs The above programs can be run manually as described above or scheduled to run automatically using the UNIX cron command Please see the UNIX man page for the crontab command Following is an example of the scheduling 0 91117 ~estap_exeestapDriversh gtgt estaplog 2gtamp1 0 19 20 21 ~estap_exerunBlastsh gtgt blastlog 2gtamp1 30 22 23 24 ~estap_exereblastsh gtgt reblastlog 2gtamp1 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 0 5 ~estap_execlusterAllsh gtgt ~estap_execlusterlog 2gtamp1 030 19 ~estap_exeblastAllContigssh gtgt ~estap_exeblast_contigslog 2gtamp1 030 12 ~estap_exereblastAllContigssh gtgt ~estap_exereblast_contigslog 2gtamp1 030 7 ~estap_exegAssembleAllsh gtgt ~estap_exeassemblelog 2gtamp1 5 2 ~estap_exewriteAndFormatFastash gtgt ~estap_exeestap_formatdblog 2gtamp1 The shell scripts used here call the corresponding programs and make sure the programs are run under the correct directories and using the correct environment variable settings These scripts should be revised when the directory set up and environment variable settings are different

5 Trouble-shooting When a program is stopped abnormally eg power or network failure processes are killed etc some data may be partially analyzed and partial results may be stored in the database You may need to do some clean ups (removing partial results from the database) and rerun the program Please use the following instructions

a Go to the ESTAP database table PROG_RUN_STATUS in ESTAP_SYS_DEF schema This table records the information about which programs are running or completed when they were started and when they were completed If a program is killed in the middle of a run for some reason the end_time will not be recorded and the status will still be running (status = 1) You need to change the status from 1 to 0 to indicate the program is no longer running

b If the program was runBlastexe or reblastexe after you change the status in the

PROG_RUN_STATUS table please go to ANALYSIS_TRACKING table in ESTAP_ANALYSIS schema delete those rows created by the program that have

status = 1 eg DELETE from ANALYSIS_TRACKING where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the runBlastexe program

c If the program was clusterexe after you change the status in the

PROG_RUN_STATUS table please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the row with the project id that you specified when you ran the clusterexe Commit the changes after you are sure that you want to delete them Then rerun the clusterexe program

d If the program was clusterAllexe after you change the status in the

PROG_RUN_STATUS table (look for ldquoclusterexerdquo and ldquoclusterAllexerdquo in the prog_name column) please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the rows created by the program that have status of 1 Commit the changes after you are sure that you want to delete them Then rerun the clusterAllexe program

e If the program was blastAllContigsexe or reblastAllContigsexe after you change

the status in the PROG_RUN_STATUS table please go to the CONTIG_ANALYSIS table in the ESTAP_ANALYSIS schema delete those rows created by the program that have status = 1 eg DELETE from CONTIG_ANALYSIS where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the blastAllContigsexe program

V How to Install and Run Web Programs

1 Install Oracle Client including JDBC Please contact Oracle to obtain licensing and product support ESTAP uses Oracle 9i client

2 Install JDK Servlet and Web Service Software

a To install JDK 13 (Windows Linux Solaris) or higher please go to httpjavasuncomj2se

b To install Tomcat please go to httpjakartaapacheorgtomcat c To install Apache Axis please go to httpwsapacheorgaxis Apache axis beta

3 version is used for the ESTAP local blast Web service

3 Configure Tomcat a Visit httpjakartaapacheorgtomcat for instruction for Tomcat configuration b Set up correct PATH and CLASSPATH environment

c Modify serverxml in the Tomcat conf directory In this file add the following

for estap context

ltContext path=estap docBase=estap showdebuginfo=true debug=DEBUG reloadable=true crossContext=falsegt ltLogger className=orgapachecatalinaloggerFileLogger prefix=localhost_estap_log suffix=txt timestamp=truegt ltContextgt

d Place the estap Web application under the Tomcat webapps directory Copy the estap directory from the web_side directory of the uncompressed software to the webapps directory

e Modify the webxml file in the estapWEB-INF directory The following section

in the webxml is specific for ESTAP Please modify the text in bold for your own parameter settings

ltservletgt ltservlet-namegtProjectListltservlet-namegt ltservlet-classgtProjectListltservlet-classgt ltinit-paramgt ltparam-namegtjdbcDriverClassNameltparam-namegt ltparam-valuegtoraclejdbcdriverOracleDriverltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtjdbcURLltparam-namegt ltparam-valuegtjdbcoraclethinservervtedu1521estapdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserNameltparam-namegt ltparam-valuegtestap_userltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserPasswordltparam-namegt ltparam-valuegtpasswordltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbInitConnectltparam-namegt ltparam-valuegt1ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbMaxConnectltparam-namegt ltparam-valuegt50ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbESTFilePathltparam-namegt ltparam-valuegthomeestapdbEST_Outgoingltparam-valuegt ltinit-paramgt

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 7: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

eg runPhredexe homeestapraw_datajc-06may02 homeestapraw_datajc-06may02-phred-out where runPhredexe is the program name homeestapraw_datajc-06may02 is the directory that contains the ab1 files from JC lab and homeestapraw_datajc-06may02-phred-out is the output directory where seq and qal files go Before starting to run the program you need to first create the output_directory homeestapraw_datajc-06may02-phred-out After the program is done check phredlog (the program will automatically log the results) in the homeestapserver_sideestap_exe directory to see any problems If not you may either rename the log file to the some other name or remove the log file In the output_directory remove any files that do not have a seq or qal extension

d Copy seq and qal files generated from the runPhredexe program to the correct

lab subdirectory under the input directory eg if the files belong to the JC lab copy those files to homeestapraw_seqinputJC

e Run cleansing program to verify the integrity of the seq and qal files cleanse the

sequences and store the results to the ESTAP database Command estapDriverexe root_path_name gtcleanlog 2gtamp1 eg estapDriverexe homeestapraw_seq gtcleanlog 2gtamp1 where estapDriverexe is the program name root_path_name is the parent directory path for the input directory (see Section IV1 for the directory hierarchy description) and gtcleanlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cleanlog This cleanlog can be viewed for debugging purposes After the cleansing program is successfully run the cleanlog can be removed to save disk space You may use nohup to run the command which immunizes your program to hang-ups eg nohup estapDriverexe root_path_name gtcleanlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep estapDriverexe If for some reason you need to re-run the cleansing for some sequence files please follow the following steps

(1) Delete the sequence file(s) from the database directly (2) Reload the corresponding seq and qal files to the input[lab_code]

directory (3) Run estapDriverexe as described in e

f Run ESTAP blast program to do blast according to corresponding blast protocols

Command runBlastexe gtblastlog 2gtamp1 where runBlastexe is the program name and gtblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named blastlog This blastlog can be viewed for debugging purposes After the runBlast program is successfully run the blastlog can be removed to save disk space You may use nohup command

eg nohup runBlastexe gtblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep runBlastexe Currently the runBlastexe program is set to blast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to blast you need to run this programs multiple times Please go to the blastlog to check how many sequences must be blasted Note Please give at least 5-10 minutes to start another runBlastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the blastlog file to see if the blast command already appears When starting a second runBlastexe program please use a different log file name instead of blastlog eg blast2log The same precautions apply when you need to run the program three or more times

g Run reblast program to do reblast according to corresponding blast protocols

Command reblastexe gtreblastlog 2gtamp1 where reblastexe is the program name and gtreblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named reblastlog This reblastlog can be viewed for debugging purposes After the reblast program is successfully run the reblastlog can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastexe gtreblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastexe Currently the reblastexe program is set to reblast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to reblast you need to run this programs multiple times Please go to reblastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the reblastlog file to see if the blast command already appears When starting a second reblastexe program please use a different log file name instead of reblastlog eg reblast2log The same precautions apply when you need to run the program three or more times

h Run clusterassembly programs You have two options 1) do clusterassembly for a specific project 2) do clusterassembly for all projects 1) Do clusterassembly for a specific project

Command clusterexe project_id gtcluster_project_idlog 2gtamp1 eg clusterexe 377 gtcluster_377log 2gtamp1 where

clusterexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gtcluster_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cluster_project_idlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterexe project_id gtcluster_project_idlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterexe

2) Do clusterassembly for all projects that are qualified for the process (this is

preferred) Command clusterAllexe gtclusterlog 2gtamp1 where

clusterAllexe is the program name and gtclusterlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named clusterlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterAllexe gtclusterlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterAllexe

i Run contig blast program to do blast for assembled contigs according to

corresponding blast protocols Command blastAllContigsexe gtcontig_blastlog 2gtamp1 where blastAllContigsexe is the program name and gt contig_blastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_blastlog This contig_blastlog can be viewed for debugging purposes After the blastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup blastAllContigsexe gtcontig_blastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep blastAllContigsexe Currently the blastAllContigsexe program is set to blast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to blast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another blastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_blastlog file to see if the blast command already appears When starting a second

blastAllContigsexe program please use a different log file name instead of contig_blastlog eg contig_blast2log The same precautions apply when you need to run the program three or more times

j Run contig reblast program to reblast for assembled contig according to corresponding blast protocols Command reblastAllContigsexe gtcontig_reblastlog 2gtamp1 where reblastAllContigsexe is the program name and gt contig_reblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_reblastlog This log file can be viewed for debugging purposes After the reblastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastAllContigsexe gtcontig_reblastlog 2gtamp1 amp It is better to use nohup and run program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastAllContigsexe Currently the reblastAllContigsexe program is set to reblast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to reblast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_reblastlog file to see if the blast command already appears When starting a second reblastAllContigsexe program please use a different log file name instead of contig_reblastlog eg contig_reblast2log The same precautions apply when you need to run the program three or more times

k Set up and run InterProScan application InterProScan is a tool developed at EBI that combines different protein signature recognition methods into one resource ESTAP wraps this tool to query contigs and singlets against the InterPro public databases to automatically annotate the protein functions The GO term of the matched proteins are linked to the contigs and sinlgets Please follow the instruction below to set up and run InterProScan application 1) Download iprscan

Go to ftpftpebiacukpubsoftwareunixiprscan site to download the following files (1) iprscan_vXXXtargz where XXX is the version number of the iprscan

release eg 32 (2) iprscan_bin_XXXtargz where XXX is your platform eg Linux (3) iprscan_DATA_XXXtargz where XXX is the version number of InterPro

database release eg 61)

2) Uncompress the files in the order specified above into your home directory ($HOME_DIR) The iprscan home directory will be $HOME_DIRiprscan eg homeestapiprscan

3) First time configuration Go to the iprscan home directory and run perl CONFIGpl Then (1) Modify CONFIGpl as described below (save a copy of the original

CONFIGpl file as CONFIG_origpl) (i) At the beginning of CONFIGpl (in main) along with other variable

declarations add the following lines my $IprPWD=iprpwdtxt my $pwd = $ENVPWD unless ($pwd) $pwd = `pwd` chomp $pwd

(ii) Before the line setting up applications $applset = get_user_prompt(Setup applications $first (y|n))

Add these lines open (FF gt$IprPWD) || die Cannot create $IprPWD$ print FF $pwd close FF

(2) Modify InterProScanpl file as described below (save a copy of the

original file as InterProScan_origpl) (i) Replace my $seqfile = $ARGV[0]

with my $seqfile = $ARGV[1] (ii) After the line my $UserId = Manager-gtgetUserId()

add these lines my $OutDir=$ARGV[0] print $OutDir n Do not use $path in this file since it wont be recognized by the iprscan wrapper program $OutDir instead of $path is used as an argument for the iprscan wrapper program to store iprscan result files

(iii) Replace $path with $OutDir in all the lines following the above comment lines

4) Reconfiguration

Go to the iprscan home directory and run perl CONFIGpl You will be asked to choose the member databases you would like to search against If you only use the public data and applications that come with the iprscan distribution you should answer ldquonrdquo for SignalPHMM and TMHMM during configuration since they are not public When the configuration is complete a file called iprpwdtxt will be created under the iprscan home directory The iprpwdtxt file stores the iprscan path information which will be used for the

iprscan wrapper programs You must copy the iprpwdtxt file to the java_prog directory where the iprscan wrapper programs are placed so that the java programs can read the iprscan path information from this file

5) Run iprscan

Go to the java_prog directory where the java application and configuration files (javaConfigtxt and iprpwdtxt) are placed and set the proper parameters in the configuration file as described in Section IV1c You may analyze a specific project or all projects at a time If you wish to run a specific project use command nohup java InterProDriver lab_code project_code gt ipr_project_codelog 2gtamp1 amp eg nohup java InterProDriver JC MCA gt ipr_MCAlog 2gtamp1 where InterProDriver is the program name lab_code is the two-letter lab code of the lab that the project belongs to project_code is the three-letter project code of the project being analyzed and gt ipr_project_codelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named ipr_project_codelog If you wish to run all projects that are qualified for this procedure use command nohup java InterProScanAll gt iprlog 2gtamp1 amp where InterProScanAll is the program name and gt iprlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named iprlog The log files can be viewed for debugging purposes After the program is successfully run the log files can be removed to save disk space

l Run genomic DNA assembly programs

You have two options 1) do DNA assembly for a specific genome project 2) do DNA assembly for all qualified genome projects 1) Do assembly for a specific genome project

Command nohup gAssembleexe project_id gtassemble_project_idlog 2gtamp1 eg nohup gAssembleexe 377 gt assemble_377log 2gtamp1 where

gAssembleexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gt assemble_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemble_project_idlog 2) Do assembly for all genome projects that are qualified for the process (this is

preferred) Command nohup gAssembleAllexe gtassemblelog 2gtamp1 where

gAssembleAllexe is the program name and gt assemblelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemblelog

m Prepare EST database files from user projects

ESTAP Web interface allows users to blast specific sequence(s) against the ESTs from their projects We provide a program called writeAndFormatFastaexe to create and format EST database files from user projects This program can be scheduled to run daily or as often as you wish to get the most recent EST databases for users to blast against To run this program you need to create a directory where the EST database files will be placed eg homeestapestdb This directory path should be used for the ESTDBFilePath parameter setting in the webxml file (see Section V3e) so that the ESTAP local blast Web service can recognize

4 Schedule Programs The above programs can be run manually as described above or scheduled to run automatically using the UNIX cron command Please see the UNIX man page for the crontab command Following is an example of the scheduling 0 91117 ~estap_exeestapDriversh gtgt estaplog 2gtamp1 0 19 20 21 ~estap_exerunBlastsh gtgt blastlog 2gtamp1 30 22 23 24 ~estap_exereblastsh gtgt reblastlog 2gtamp1 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 0 5 ~estap_execlusterAllsh gtgt ~estap_execlusterlog 2gtamp1 030 19 ~estap_exeblastAllContigssh gtgt ~estap_exeblast_contigslog 2gtamp1 030 12 ~estap_exereblastAllContigssh gtgt ~estap_exereblast_contigslog 2gtamp1 030 7 ~estap_exegAssembleAllsh gtgt ~estap_exeassemblelog 2gtamp1 5 2 ~estap_exewriteAndFormatFastash gtgt ~estap_exeestap_formatdblog 2gtamp1 The shell scripts used here call the corresponding programs and make sure the programs are run under the correct directories and using the correct environment variable settings These scripts should be revised when the directory set up and environment variable settings are different

5 Trouble-shooting When a program is stopped abnormally eg power or network failure processes are killed etc some data may be partially analyzed and partial results may be stored in the database You may need to do some clean ups (removing partial results from the database) and rerun the program Please use the following instructions

a Go to the ESTAP database table PROG_RUN_STATUS in ESTAP_SYS_DEF schema This table records the information about which programs are running or completed when they were started and when they were completed If a program is killed in the middle of a run for some reason the end_time will not be recorded and the status will still be running (status = 1) You need to change the status from 1 to 0 to indicate the program is no longer running

b If the program was runBlastexe or reblastexe after you change the status in the

PROG_RUN_STATUS table please go to ANALYSIS_TRACKING table in ESTAP_ANALYSIS schema delete those rows created by the program that have

status = 1 eg DELETE from ANALYSIS_TRACKING where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the runBlastexe program

c If the program was clusterexe after you change the status in the

PROG_RUN_STATUS table please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the row with the project id that you specified when you ran the clusterexe Commit the changes after you are sure that you want to delete them Then rerun the clusterexe program

d If the program was clusterAllexe after you change the status in the

PROG_RUN_STATUS table (look for ldquoclusterexerdquo and ldquoclusterAllexerdquo in the prog_name column) please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the rows created by the program that have status of 1 Commit the changes after you are sure that you want to delete them Then rerun the clusterAllexe program

e If the program was blastAllContigsexe or reblastAllContigsexe after you change

the status in the PROG_RUN_STATUS table please go to the CONTIG_ANALYSIS table in the ESTAP_ANALYSIS schema delete those rows created by the program that have status = 1 eg DELETE from CONTIG_ANALYSIS where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the blastAllContigsexe program

V How to Install and Run Web Programs

1 Install Oracle Client including JDBC Please contact Oracle to obtain licensing and product support ESTAP uses Oracle 9i client

2 Install JDK Servlet and Web Service Software

a To install JDK 13 (Windows Linux Solaris) or higher please go to httpjavasuncomj2se

b To install Tomcat please go to httpjakartaapacheorgtomcat c To install Apache Axis please go to httpwsapacheorgaxis Apache axis beta

3 version is used for the ESTAP local blast Web service

3 Configure Tomcat a Visit httpjakartaapacheorgtomcat for instruction for Tomcat configuration b Set up correct PATH and CLASSPATH environment

c Modify serverxml in the Tomcat conf directory In this file add the following

for estap context

ltContext path=estap docBase=estap showdebuginfo=true debug=DEBUG reloadable=true crossContext=falsegt ltLogger className=orgapachecatalinaloggerFileLogger prefix=localhost_estap_log suffix=txt timestamp=truegt ltContextgt

d Place the estap Web application under the Tomcat webapps directory Copy the estap directory from the web_side directory of the uncompressed software to the webapps directory

e Modify the webxml file in the estapWEB-INF directory The following section

in the webxml is specific for ESTAP Please modify the text in bold for your own parameter settings

ltservletgt ltservlet-namegtProjectListltservlet-namegt ltservlet-classgtProjectListltservlet-classgt ltinit-paramgt ltparam-namegtjdbcDriverClassNameltparam-namegt ltparam-valuegtoraclejdbcdriverOracleDriverltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtjdbcURLltparam-namegt ltparam-valuegtjdbcoraclethinservervtedu1521estapdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserNameltparam-namegt ltparam-valuegtestap_userltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserPasswordltparam-namegt ltparam-valuegtpasswordltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbInitConnectltparam-namegt ltparam-valuegt1ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbMaxConnectltparam-namegt ltparam-valuegt50ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbESTFilePathltparam-namegt ltparam-valuegthomeestapdbEST_Outgoingltparam-valuegt ltinit-paramgt

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 8: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

eg nohup runBlastexe gtblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep runBlastexe Currently the runBlastexe program is set to blast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to blast you need to run this programs multiple times Please go to the blastlog to check how many sequences must be blasted Note Please give at least 5-10 minutes to start another runBlastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the blastlog file to see if the blast command already appears When starting a second runBlastexe program please use a different log file name instead of blastlog eg blast2log The same precautions apply when you need to run the program three or more times

g Run reblast program to do reblast according to corresponding blast protocols

Command reblastexe gtreblastlog 2gtamp1 where reblastexe is the program name and gtreblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named reblastlog This reblastlog can be viewed for debugging purposes After the reblast program is successfully run the reblastlog can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastexe gtreblastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastexe Currently the reblastexe program is set to reblast 1000 sequences at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 sequences to reblast you need to run this programs multiple times Please go to reblastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the reblastlog file to see if the blast command already appears When starting a second reblastexe program please use a different log file name instead of reblastlog eg reblast2log The same precautions apply when you need to run the program three or more times

h Run clusterassembly programs You have two options 1) do clusterassembly for a specific project 2) do clusterassembly for all projects 1) Do clusterassembly for a specific project

Command clusterexe project_id gtcluster_project_idlog 2gtamp1 eg clusterexe 377 gtcluster_377log 2gtamp1 where

clusterexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gtcluster_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cluster_project_idlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterexe project_id gtcluster_project_idlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterexe

2) Do clusterassembly for all projects that are qualified for the process (this is

preferred) Command clusterAllexe gtclusterlog 2gtamp1 where

clusterAllexe is the program name and gtclusterlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named clusterlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterAllexe gtclusterlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterAllexe

i Run contig blast program to do blast for assembled contigs according to

corresponding blast protocols Command blastAllContigsexe gtcontig_blastlog 2gtamp1 where blastAllContigsexe is the program name and gt contig_blastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_blastlog This contig_blastlog can be viewed for debugging purposes After the blastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup blastAllContigsexe gtcontig_blastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep blastAllContigsexe Currently the blastAllContigsexe program is set to blast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to blast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another blastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_blastlog file to see if the blast command already appears When starting a second

blastAllContigsexe program please use a different log file name instead of contig_blastlog eg contig_blast2log The same precautions apply when you need to run the program three or more times

j Run contig reblast program to reblast for assembled contig according to corresponding blast protocols Command reblastAllContigsexe gtcontig_reblastlog 2gtamp1 where reblastAllContigsexe is the program name and gt contig_reblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_reblastlog This log file can be viewed for debugging purposes After the reblastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastAllContigsexe gtcontig_reblastlog 2gtamp1 amp It is better to use nohup and run program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastAllContigsexe Currently the reblastAllContigsexe program is set to reblast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to reblast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_reblastlog file to see if the blast command already appears When starting a second reblastAllContigsexe program please use a different log file name instead of contig_reblastlog eg contig_reblast2log The same precautions apply when you need to run the program three or more times

k Set up and run InterProScan application InterProScan is a tool developed at EBI that combines different protein signature recognition methods into one resource ESTAP wraps this tool to query contigs and singlets against the InterPro public databases to automatically annotate the protein functions The GO term of the matched proteins are linked to the contigs and sinlgets Please follow the instruction below to set up and run InterProScan application 1) Download iprscan

Go to ftpftpebiacukpubsoftwareunixiprscan site to download the following files (1) iprscan_vXXXtargz where XXX is the version number of the iprscan

release eg 32 (2) iprscan_bin_XXXtargz where XXX is your platform eg Linux (3) iprscan_DATA_XXXtargz where XXX is the version number of InterPro

database release eg 61)

2) Uncompress the files in the order specified above into your home directory ($HOME_DIR) The iprscan home directory will be $HOME_DIRiprscan eg homeestapiprscan

3) First time configuration Go to the iprscan home directory and run perl CONFIGpl Then (1) Modify CONFIGpl as described below (save a copy of the original

CONFIGpl file as CONFIG_origpl) (i) At the beginning of CONFIGpl (in main) along with other variable

declarations add the following lines my $IprPWD=iprpwdtxt my $pwd = $ENVPWD unless ($pwd) $pwd = `pwd` chomp $pwd

(ii) Before the line setting up applications $applset = get_user_prompt(Setup applications $first (y|n))

Add these lines open (FF gt$IprPWD) || die Cannot create $IprPWD$ print FF $pwd close FF

(2) Modify InterProScanpl file as described below (save a copy of the

original file as InterProScan_origpl) (i) Replace my $seqfile = $ARGV[0]

with my $seqfile = $ARGV[1] (ii) After the line my $UserId = Manager-gtgetUserId()

add these lines my $OutDir=$ARGV[0] print $OutDir n Do not use $path in this file since it wont be recognized by the iprscan wrapper program $OutDir instead of $path is used as an argument for the iprscan wrapper program to store iprscan result files

(iii) Replace $path with $OutDir in all the lines following the above comment lines

4) Reconfiguration

Go to the iprscan home directory and run perl CONFIGpl You will be asked to choose the member databases you would like to search against If you only use the public data and applications that come with the iprscan distribution you should answer ldquonrdquo for SignalPHMM and TMHMM during configuration since they are not public When the configuration is complete a file called iprpwdtxt will be created under the iprscan home directory The iprpwdtxt file stores the iprscan path information which will be used for the

iprscan wrapper programs You must copy the iprpwdtxt file to the java_prog directory where the iprscan wrapper programs are placed so that the java programs can read the iprscan path information from this file

5) Run iprscan

Go to the java_prog directory where the java application and configuration files (javaConfigtxt and iprpwdtxt) are placed and set the proper parameters in the configuration file as described in Section IV1c You may analyze a specific project or all projects at a time If you wish to run a specific project use command nohup java InterProDriver lab_code project_code gt ipr_project_codelog 2gtamp1 amp eg nohup java InterProDriver JC MCA gt ipr_MCAlog 2gtamp1 where InterProDriver is the program name lab_code is the two-letter lab code of the lab that the project belongs to project_code is the three-letter project code of the project being analyzed and gt ipr_project_codelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named ipr_project_codelog If you wish to run all projects that are qualified for this procedure use command nohup java InterProScanAll gt iprlog 2gtamp1 amp where InterProScanAll is the program name and gt iprlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named iprlog The log files can be viewed for debugging purposes After the program is successfully run the log files can be removed to save disk space

l Run genomic DNA assembly programs

You have two options 1) do DNA assembly for a specific genome project 2) do DNA assembly for all qualified genome projects 1) Do assembly for a specific genome project

Command nohup gAssembleexe project_id gtassemble_project_idlog 2gtamp1 eg nohup gAssembleexe 377 gt assemble_377log 2gtamp1 where

gAssembleexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gt assemble_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemble_project_idlog 2) Do assembly for all genome projects that are qualified for the process (this is

preferred) Command nohup gAssembleAllexe gtassemblelog 2gtamp1 where

gAssembleAllexe is the program name and gt assemblelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemblelog

m Prepare EST database files from user projects

ESTAP Web interface allows users to blast specific sequence(s) against the ESTs from their projects We provide a program called writeAndFormatFastaexe to create and format EST database files from user projects This program can be scheduled to run daily or as often as you wish to get the most recent EST databases for users to blast against To run this program you need to create a directory where the EST database files will be placed eg homeestapestdb This directory path should be used for the ESTDBFilePath parameter setting in the webxml file (see Section V3e) so that the ESTAP local blast Web service can recognize

4 Schedule Programs The above programs can be run manually as described above or scheduled to run automatically using the UNIX cron command Please see the UNIX man page for the crontab command Following is an example of the scheduling 0 91117 ~estap_exeestapDriversh gtgt estaplog 2gtamp1 0 19 20 21 ~estap_exerunBlastsh gtgt blastlog 2gtamp1 30 22 23 24 ~estap_exereblastsh gtgt reblastlog 2gtamp1 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 0 5 ~estap_execlusterAllsh gtgt ~estap_execlusterlog 2gtamp1 030 19 ~estap_exeblastAllContigssh gtgt ~estap_exeblast_contigslog 2gtamp1 030 12 ~estap_exereblastAllContigssh gtgt ~estap_exereblast_contigslog 2gtamp1 030 7 ~estap_exegAssembleAllsh gtgt ~estap_exeassemblelog 2gtamp1 5 2 ~estap_exewriteAndFormatFastash gtgt ~estap_exeestap_formatdblog 2gtamp1 The shell scripts used here call the corresponding programs and make sure the programs are run under the correct directories and using the correct environment variable settings These scripts should be revised when the directory set up and environment variable settings are different

5 Trouble-shooting When a program is stopped abnormally eg power or network failure processes are killed etc some data may be partially analyzed and partial results may be stored in the database You may need to do some clean ups (removing partial results from the database) and rerun the program Please use the following instructions

a Go to the ESTAP database table PROG_RUN_STATUS in ESTAP_SYS_DEF schema This table records the information about which programs are running or completed when they were started and when they were completed If a program is killed in the middle of a run for some reason the end_time will not be recorded and the status will still be running (status = 1) You need to change the status from 1 to 0 to indicate the program is no longer running

b If the program was runBlastexe or reblastexe after you change the status in the

PROG_RUN_STATUS table please go to ANALYSIS_TRACKING table in ESTAP_ANALYSIS schema delete those rows created by the program that have

status = 1 eg DELETE from ANALYSIS_TRACKING where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the runBlastexe program

c If the program was clusterexe after you change the status in the

PROG_RUN_STATUS table please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the row with the project id that you specified when you ran the clusterexe Commit the changes after you are sure that you want to delete them Then rerun the clusterexe program

d If the program was clusterAllexe after you change the status in the

PROG_RUN_STATUS table (look for ldquoclusterexerdquo and ldquoclusterAllexerdquo in the prog_name column) please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the rows created by the program that have status of 1 Commit the changes after you are sure that you want to delete them Then rerun the clusterAllexe program

e If the program was blastAllContigsexe or reblastAllContigsexe after you change

the status in the PROG_RUN_STATUS table please go to the CONTIG_ANALYSIS table in the ESTAP_ANALYSIS schema delete those rows created by the program that have status = 1 eg DELETE from CONTIG_ANALYSIS where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the blastAllContigsexe program

V How to Install and Run Web Programs

1 Install Oracle Client including JDBC Please contact Oracle to obtain licensing and product support ESTAP uses Oracle 9i client

2 Install JDK Servlet and Web Service Software

a To install JDK 13 (Windows Linux Solaris) or higher please go to httpjavasuncomj2se

b To install Tomcat please go to httpjakartaapacheorgtomcat c To install Apache Axis please go to httpwsapacheorgaxis Apache axis beta

3 version is used for the ESTAP local blast Web service

3 Configure Tomcat a Visit httpjakartaapacheorgtomcat for instruction for Tomcat configuration b Set up correct PATH and CLASSPATH environment

c Modify serverxml in the Tomcat conf directory In this file add the following

for estap context

ltContext path=estap docBase=estap showdebuginfo=true debug=DEBUG reloadable=true crossContext=falsegt ltLogger className=orgapachecatalinaloggerFileLogger prefix=localhost_estap_log suffix=txt timestamp=truegt ltContextgt

d Place the estap Web application under the Tomcat webapps directory Copy the estap directory from the web_side directory of the uncompressed software to the webapps directory

e Modify the webxml file in the estapWEB-INF directory The following section

in the webxml is specific for ESTAP Please modify the text in bold for your own parameter settings

ltservletgt ltservlet-namegtProjectListltservlet-namegt ltservlet-classgtProjectListltservlet-classgt ltinit-paramgt ltparam-namegtjdbcDriverClassNameltparam-namegt ltparam-valuegtoraclejdbcdriverOracleDriverltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtjdbcURLltparam-namegt ltparam-valuegtjdbcoraclethinservervtedu1521estapdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserNameltparam-namegt ltparam-valuegtestap_userltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserPasswordltparam-namegt ltparam-valuegtpasswordltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbInitConnectltparam-namegt ltparam-valuegt1ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbMaxConnectltparam-namegt ltparam-valuegt50ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbESTFilePathltparam-namegt ltparam-valuegthomeestapdbEST_Outgoingltparam-valuegt ltinit-paramgt

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 9: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

clusterexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gtcluster_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named cluster_project_idlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterexe project_id gtcluster_project_idlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterexe

2) Do clusterassembly for all projects that are qualified for the process (this is

preferred) Command clusterAllexe gtclusterlog 2gtamp1 where

clusterAllexe is the program name and gtclusterlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named clusterlog This log file can be viewed for debugging purposes After the program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup clusterAllexe gtclusterlog 2gtamp1 amp To check whether the program is running use command ps ndashef|grep clusterAllexe

i Run contig blast program to do blast for assembled contigs according to

corresponding blast protocols Command blastAllContigsexe gtcontig_blastlog 2gtamp1 where blastAllContigsexe is the program name and gt contig_blastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_blastlog This contig_blastlog can be viewed for debugging purposes After the blastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup blastAllContigsexe gtcontig_blastlog 2gtamp1 amp It is better to use nohup and run the program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep blastAllContigsexe Currently the blastAllContigsexe program is set to blast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to blast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another blastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_blastlog file to see if the blast command already appears When starting a second

blastAllContigsexe program please use a different log file name instead of contig_blastlog eg contig_blast2log The same precautions apply when you need to run the program three or more times

j Run contig reblast program to reblast for assembled contig according to corresponding blast protocols Command reblastAllContigsexe gtcontig_reblastlog 2gtamp1 where reblastAllContigsexe is the program name and gt contig_reblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_reblastlog This log file can be viewed for debugging purposes After the reblastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastAllContigsexe gtcontig_reblastlog 2gtamp1 amp It is better to use nohup and run program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastAllContigsexe Currently the reblastAllContigsexe program is set to reblast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to reblast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_reblastlog file to see if the blast command already appears When starting a second reblastAllContigsexe program please use a different log file name instead of contig_reblastlog eg contig_reblast2log The same precautions apply when you need to run the program three or more times

k Set up and run InterProScan application InterProScan is a tool developed at EBI that combines different protein signature recognition methods into one resource ESTAP wraps this tool to query contigs and singlets against the InterPro public databases to automatically annotate the protein functions The GO term of the matched proteins are linked to the contigs and sinlgets Please follow the instruction below to set up and run InterProScan application 1) Download iprscan

Go to ftpftpebiacukpubsoftwareunixiprscan site to download the following files (1) iprscan_vXXXtargz where XXX is the version number of the iprscan

release eg 32 (2) iprscan_bin_XXXtargz where XXX is your platform eg Linux (3) iprscan_DATA_XXXtargz where XXX is the version number of InterPro

database release eg 61)

2) Uncompress the files in the order specified above into your home directory ($HOME_DIR) The iprscan home directory will be $HOME_DIRiprscan eg homeestapiprscan

3) First time configuration Go to the iprscan home directory and run perl CONFIGpl Then (1) Modify CONFIGpl as described below (save a copy of the original

CONFIGpl file as CONFIG_origpl) (i) At the beginning of CONFIGpl (in main) along with other variable

declarations add the following lines my $IprPWD=iprpwdtxt my $pwd = $ENVPWD unless ($pwd) $pwd = `pwd` chomp $pwd

(ii) Before the line setting up applications $applset = get_user_prompt(Setup applications $first (y|n))

Add these lines open (FF gt$IprPWD) || die Cannot create $IprPWD$ print FF $pwd close FF

(2) Modify InterProScanpl file as described below (save a copy of the

original file as InterProScan_origpl) (i) Replace my $seqfile = $ARGV[0]

with my $seqfile = $ARGV[1] (ii) After the line my $UserId = Manager-gtgetUserId()

add these lines my $OutDir=$ARGV[0] print $OutDir n Do not use $path in this file since it wont be recognized by the iprscan wrapper program $OutDir instead of $path is used as an argument for the iprscan wrapper program to store iprscan result files

(iii) Replace $path with $OutDir in all the lines following the above comment lines

4) Reconfiguration

Go to the iprscan home directory and run perl CONFIGpl You will be asked to choose the member databases you would like to search against If you only use the public data and applications that come with the iprscan distribution you should answer ldquonrdquo for SignalPHMM and TMHMM during configuration since they are not public When the configuration is complete a file called iprpwdtxt will be created under the iprscan home directory The iprpwdtxt file stores the iprscan path information which will be used for the

iprscan wrapper programs You must copy the iprpwdtxt file to the java_prog directory where the iprscan wrapper programs are placed so that the java programs can read the iprscan path information from this file

5) Run iprscan

Go to the java_prog directory where the java application and configuration files (javaConfigtxt and iprpwdtxt) are placed and set the proper parameters in the configuration file as described in Section IV1c You may analyze a specific project or all projects at a time If you wish to run a specific project use command nohup java InterProDriver lab_code project_code gt ipr_project_codelog 2gtamp1 amp eg nohup java InterProDriver JC MCA gt ipr_MCAlog 2gtamp1 where InterProDriver is the program name lab_code is the two-letter lab code of the lab that the project belongs to project_code is the three-letter project code of the project being analyzed and gt ipr_project_codelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named ipr_project_codelog If you wish to run all projects that are qualified for this procedure use command nohup java InterProScanAll gt iprlog 2gtamp1 amp where InterProScanAll is the program name and gt iprlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named iprlog The log files can be viewed for debugging purposes After the program is successfully run the log files can be removed to save disk space

l Run genomic DNA assembly programs

You have two options 1) do DNA assembly for a specific genome project 2) do DNA assembly for all qualified genome projects 1) Do assembly for a specific genome project

Command nohup gAssembleexe project_id gtassemble_project_idlog 2gtamp1 eg nohup gAssembleexe 377 gt assemble_377log 2gtamp1 where

gAssembleexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gt assemble_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemble_project_idlog 2) Do assembly for all genome projects that are qualified for the process (this is

preferred) Command nohup gAssembleAllexe gtassemblelog 2gtamp1 where

gAssembleAllexe is the program name and gt assemblelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemblelog

m Prepare EST database files from user projects

ESTAP Web interface allows users to blast specific sequence(s) against the ESTs from their projects We provide a program called writeAndFormatFastaexe to create and format EST database files from user projects This program can be scheduled to run daily or as often as you wish to get the most recent EST databases for users to blast against To run this program you need to create a directory where the EST database files will be placed eg homeestapestdb This directory path should be used for the ESTDBFilePath parameter setting in the webxml file (see Section V3e) so that the ESTAP local blast Web service can recognize

4 Schedule Programs The above programs can be run manually as described above or scheduled to run automatically using the UNIX cron command Please see the UNIX man page for the crontab command Following is an example of the scheduling 0 91117 ~estap_exeestapDriversh gtgt estaplog 2gtamp1 0 19 20 21 ~estap_exerunBlastsh gtgt blastlog 2gtamp1 30 22 23 24 ~estap_exereblastsh gtgt reblastlog 2gtamp1 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 0 5 ~estap_execlusterAllsh gtgt ~estap_execlusterlog 2gtamp1 030 19 ~estap_exeblastAllContigssh gtgt ~estap_exeblast_contigslog 2gtamp1 030 12 ~estap_exereblastAllContigssh gtgt ~estap_exereblast_contigslog 2gtamp1 030 7 ~estap_exegAssembleAllsh gtgt ~estap_exeassemblelog 2gtamp1 5 2 ~estap_exewriteAndFormatFastash gtgt ~estap_exeestap_formatdblog 2gtamp1 The shell scripts used here call the corresponding programs and make sure the programs are run under the correct directories and using the correct environment variable settings These scripts should be revised when the directory set up and environment variable settings are different

5 Trouble-shooting When a program is stopped abnormally eg power or network failure processes are killed etc some data may be partially analyzed and partial results may be stored in the database You may need to do some clean ups (removing partial results from the database) and rerun the program Please use the following instructions

a Go to the ESTAP database table PROG_RUN_STATUS in ESTAP_SYS_DEF schema This table records the information about which programs are running or completed when they were started and when they were completed If a program is killed in the middle of a run for some reason the end_time will not be recorded and the status will still be running (status = 1) You need to change the status from 1 to 0 to indicate the program is no longer running

b If the program was runBlastexe or reblastexe after you change the status in the

PROG_RUN_STATUS table please go to ANALYSIS_TRACKING table in ESTAP_ANALYSIS schema delete those rows created by the program that have

status = 1 eg DELETE from ANALYSIS_TRACKING where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the runBlastexe program

c If the program was clusterexe after you change the status in the

PROG_RUN_STATUS table please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the row with the project id that you specified when you ran the clusterexe Commit the changes after you are sure that you want to delete them Then rerun the clusterexe program

d If the program was clusterAllexe after you change the status in the

PROG_RUN_STATUS table (look for ldquoclusterexerdquo and ldquoclusterAllexerdquo in the prog_name column) please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the rows created by the program that have status of 1 Commit the changes after you are sure that you want to delete them Then rerun the clusterAllexe program

e If the program was blastAllContigsexe or reblastAllContigsexe after you change

the status in the PROG_RUN_STATUS table please go to the CONTIG_ANALYSIS table in the ESTAP_ANALYSIS schema delete those rows created by the program that have status = 1 eg DELETE from CONTIG_ANALYSIS where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the blastAllContigsexe program

V How to Install and Run Web Programs

1 Install Oracle Client including JDBC Please contact Oracle to obtain licensing and product support ESTAP uses Oracle 9i client

2 Install JDK Servlet and Web Service Software

a To install JDK 13 (Windows Linux Solaris) or higher please go to httpjavasuncomj2se

b To install Tomcat please go to httpjakartaapacheorgtomcat c To install Apache Axis please go to httpwsapacheorgaxis Apache axis beta

3 version is used for the ESTAP local blast Web service

3 Configure Tomcat a Visit httpjakartaapacheorgtomcat for instruction for Tomcat configuration b Set up correct PATH and CLASSPATH environment

c Modify serverxml in the Tomcat conf directory In this file add the following

for estap context

ltContext path=estap docBase=estap showdebuginfo=true debug=DEBUG reloadable=true crossContext=falsegt ltLogger className=orgapachecatalinaloggerFileLogger prefix=localhost_estap_log suffix=txt timestamp=truegt ltContextgt

d Place the estap Web application under the Tomcat webapps directory Copy the estap directory from the web_side directory of the uncompressed software to the webapps directory

e Modify the webxml file in the estapWEB-INF directory The following section

in the webxml is specific for ESTAP Please modify the text in bold for your own parameter settings

ltservletgt ltservlet-namegtProjectListltservlet-namegt ltservlet-classgtProjectListltservlet-classgt ltinit-paramgt ltparam-namegtjdbcDriverClassNameltparam-namegt ltparam-valuegtoraclejdbcdriverOracleDriverltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtjdbcURLltparam-namegt ltparam-valuegtjdbcoraclethinservervtedu1521estapdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserNameltparam-namegt ltparam-valuegtestap_userltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserPasswordltparam-namegt ltparam-valuegtpasswordltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbInitConnectltparam-namegt ltparam-valuegt1ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbMaxConnectltparam-namegt ltparam-valuegt50ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbESTFilePathltparam-namegt ltparam-valuegthomeestapdbEST_Outgoingltparam-valuegt ltinit-paramgt

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 10: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

blastAllContigsexe program please use a different log file name instead of contig_blastlog eg contig_blast2log The same precautions apply when you need to run the program three or more times

j Run contig reblast program to reblast for assembled contig according to corresponding blast protocols Command reblastAllContigsexe gtcontig_reblastlog 2gtamp1 where reblastAllContigsexe is the program name and gt contig_reblastlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named contig_reblastlog This log file can be viewed for debugging purposes After the reblastAllContigs program is successfully run the log file can be removed to save disk space You may use nohup to run the command that immunizes your system to hang-ups eg nohup reblastAllContigsexe gtcontig_reblastlog 2gtamp1 amp It is better to use nohup and run program in the background (with amp at the end of a command) To check whether the program is running use command ps ndashef|grep reblastAllContigsexe Currently the reblastAllContigsexe program is set to reblast 1000 contigs at a time (in the future the setting can be changed to a bigger number) If there are more than 1000 contigs to reblast you need to run this programs multiple times Please go to contig_blastlog to check how many sequences are needed to blast Note Please give at least 5-10 minutes to start another reblastAllContigsexe program to make sure that sequences to be blasted in the previous run are already recorded in the ESTAP database Check the contig_reblastlog file to see if the blast command already appears When starting a second reblastAllContigsexe program please use a different log file name instead of contig_reblastlog eg contig_reblast2log The same precautions apply when you need to run the program three or more times

k Set up and run InterProScan application InterProScan is a tool developed at EBI that combines different protein signature recognition methods into one resource ESTAP wraps this tool to query contigs and singlets against the InterPro public databases to automatically annotate the protein functions The GO term of the matched proteins are linked to the contigs and sinlgets Please follow the instruction below to set up and run InterProScan application 1) Download iprscan

Go to ftpftpebiacukpubsoftwareunixiprscan site to download the following files (1) iprscan_vXXXtargz where XXX is the version number of the iprscan

release eg 32 (2) iprscan_bin_XXXtargz where XXX is your platform eg Linux (3) iprscan_DATA_XXXtargz where XXX is the version number of InterPro

database release eg 61)

2) Uncompress the files in the order specified above into your home directory ($HOME_DIR) The iprscan home directory will be $HOME_DIRiprscan eg homeestapiprscan

3) First time configuration Go to the iprscan home directory and run perl CONFIGpl Then (1) Modify CONFIGpl as described below (save a copy of the original

CONFIGpl file as CONFIG_origpl) (i) At the beginning of CONFIGpl (in main) along with other variable

declarations add the following lines my $IprPWD=iprpwdtxt my $pwd = $ENVPWD unless ($pwd) $pwd = `pwd` chomp $pwd

(ii) Before the line setting up applications $applset = get_user_prompt(Setup applications $first (y|n))

Add these lines open (FF gt$IprPWD) || die Cannot create $IprPWD$ print FF $pwd close FF

(2) Modify InterProScanpl file as described below (save a copy of the

original file as InterProScan_origpl) (i) Replace my $seqfile = $ARGV[0]

with my $seqfile = $ARGV[1] (ii) After the line my $UserId = Manager-gtgetUserId()

add these lines my $OutDir=$ARGV[0] print $OutDir n Do not use $path in this file since it wont be recognized by the iprscan wrapper program $OutDir instead of $path is used as an argument for the iprscan wrapper program to store iprscan result files

(iii) Replace $path with $OutDir in all the lines following the above comment lines

4) Reconfiguration

Go to the iprscan home directory and run perl CONFIGpl You will be asked to choose the member databases you would like to search against If you only use the public data and applications that come with the iprscan distribution you should answer ldquonrdquo for SignalPHMM and TMHMM during configuration since they are not public When the configuration is complete a file called iprpwdtxt will be created under the iprscan home directory The iprpwdtxt file stores the iprscan path information which will be used for the

iprscan wrapper programs You must copy the iprpwdtxt file to the java_prog directory where the iprscan wrapper programs are placed so that the java programs can read the iprscan path information from this file

5) Run iprscan

Go to the java_prog directory where the java application and configuration files (javaConfigtxt and iprpwdtxt) are placed and set the proper parameters in the configuration file as described in Section IV1c You may analyze a specific project or all projects at a time If you wish to run a specific project use command nohup java InterProDriver lab_code project_code gt ipr_project_codelog 2gtamp1 amp eg nohup java InterProDriver JC MCA gt ipr_MCAlog 2gtamp1 where InterProDriver is the program name lab_code is the two-letter lab code of the lab that the project belongs to project_code is the three-letter project code of the project being analyzed and gt ipr_project_codelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named ipr_project_codelog If you wish to run all projects that are qualified for this procedure use command nohup java InterProScanAll gt iprlog 2gtamp1 amp where InterProScanAll is the program name and gt iprlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named iprlog The log files can be viewed for debugging purposes After the program is successfully run the log files can be removed to save disk space

l Run genomic DNA assembly programs

You have two options 1) do DNA assembly for a specific genome project 2) do DNA assembly for all qualified genome projects 1) Do assembly for a specific genome project

Command nohup gAssembleexe project_id gtassemble_project_idlog 2gtamp1 eg nohup gAssembleexe 377 gt assemble_377log 2gtamp1 where

gAssembleexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gt assemble_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemble_project_idlog 2) Do assembly for all genome projects that are qualified for the process (this is

preferred) Command nohup gAssembleAllexe gtassemblelog 2gtamp1 where

gAssembleAllexe is the program name and gt assemblelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemblelog

m Prepare EST database files from user projects

ESTAP Web interface allows users to blast specific sequence(s) against the ESTs from their projects We provide a program called writeAndFormatFastaexe to create and format EST database files from user projects This program can be scheduled to run daily or as often as you wish to get the most recent EST databases for users to blast against To run this program you need to create a directory where the EST database files will be placed eg homeestapestdb This directory path should be used for the ESTDBFilePath parameter setting in the webxml file (see Section V3e) so that the ESTAP local blast Web service can recognize

4 Schedule Programs The above programs can be run manually as described above or scheduled to run automatically using the UNIX cron command Please see the UNIX man page for the crontab command Following is an example of the scheduling 0 91117 ~estap_exeestapDriversh gtgt estaplog 2gtamp1 0 19 20 21 ~estap_exerunBlastsh gtgt blastlog 2gtamp1 30 22 23 24 ~estap_exereblastsh gtgt reblastlog 2gtamp1 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 0 5 ~estap_execlusterAllsh gtgt ~estap_execlusterlog 2gtamp1 030 19 ~estap_exeblastAllContigssh gtgt ~estap_exeblast_contigslog 2gtamp1 030 12 ~estap_exereblastAllContigssh gtgt ~estap_exereblast_contigslog 2gtamp1 030 7 ~estap_exegAssembleAllsh gtgt ~estap_exeassemblelog 2gtamp1 5 2 ~estap_exewriteAndFormatFastash gtgt ~estap_exeestap_formatdblog 2gtamp1 The shell scripts used here call the corresponding programs and make sure the programs are run under the correct directories and using the correct environment variable settings These scripts should be revised when the directory set up and environment variable settings are different

5 Trouble-shooting When a program is stopped abnormally eg power or network failure processes are killed etc some data may be partially analyzed and partial results may be stored in the database You may need to do some clean ups (removing partial results from the database) and rerun the program Please use the following instructions

a Go to the ESTAP database table PROG_RUN_STATUS in ESTAP_SYS_DEF schema This table records the information about which programs are running or completed when they were started and when they were completed If a program is killed in the middle of a run for some reason the end_time will not be recorded and the status will still be running (status = 1) You need to change the status from 1 to 0 to indicate the program is no longer running

b If the program was runBlastexe or reblastexe after you change the status in the

PROG_RUN_STATUS table please go to ANALYSIS_TRACKING table in ESTAP_ANALYSIS schema delete those rows created by the program that have

status = 1 eg DELETE from ANALYSIS_TRACKING where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the runBlastexe program

c If the program was clusterexe after you change the status in the

PROG_RUN_STATUS table please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the row with the project id that you specified when you ran the clusterexe Commit the changes after you are sure that you want to delete them Then rerun the clusterexe program

d If the program was clusterAllexe after you change the status in the

PROG_RUN_STATUS table (look for ldquoclusterexerdquo and ldquoclusterAllexerdquo in the prog_name column) please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the rows created by the program that have status of 1 Commit the changes after you are sure that you want to delete them Then rerun the clusterAllexe program

e If the program was blastAllContigsexe or reblastAllContigsexe after you change

the status in the PROG_RUN_STATUS table please go to the CONTIG_ANALYSIS table in the ESTAP_ANALYSIS schema delete those rows created by the program that have status = 1 eg DELETE from CONTIG_ANALYSIS where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the blastAllContigsexe program

V How to Install and Run Web Programs

1 Install Oracle Client including JDBC Please contact Oracle to obtain licensing and product support ESTAP uses Oracle 9i client

2 Install JDK Servlet and Web Service Software

a To install JDK 13 (Windows Linux Solaris) or higher please go to httpjavasuncomj2se

b To install Tomcat please go to httpjakartaapacheorgtomcat c To install Apache Axis please go to httpwsapacheorgaxis Apache axis beta

3 version is used for the ESTAP local blast Web service

3 Configure Tomcat a Visit httpjakartaapacheorgtomcat for instruction for Tomcat configuration b Set up correct PATH and CLASSPATH environment

c Modify serverxml in the Tomcat conf directory In this file add the following

for estap context

ltContext path=estap docBase=estap showdebuginfo=true debug=DEBUG reloadable=true crossContext=falsegt ltLogger className=orgapachecatalinaloggerFileLogger prefix=localhost_estap_log suffix=txt timestamp=truegt ltContextgt

d Place the estap Web application under the Tomcat webapps directory Copy the estap directory from the web_side directory of the uncompressed software to the webapps directory

e Modify the webxml file in the estapWEB-INF directory The following section

in the webxml is specific for ESTAP Please modify the text in bold for your own parameter settings

ltservletgt ltservlet-namegtProjectListltservlet-namegt ltservlet-classgtProjectListltservlet-classgt ltinit-paramgt ltparam-namegtjdbcDriverClassNameltparam-namegt ltparam-valuegtoraclejdbcdriverOracleDriverltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtjdbcURLltparam-namegt ltparam-valuegtjdbcoraclethinservervtedu1521estapdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserNameltparam-namegt ltparam-valuegtestap_userltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserPasswordltparam-namegt ltparam-valuegtpasswordltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbInitConnectltparam-namegt ltparam-valuegt1ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbMaxConnectltparam-namegt ltparam-valuegt50ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbESTFilePathltparam-namegt ltparam-valuegthomeestapdbEST_Outgoingltparam-valuegt ltinit-paramgt

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 11: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

2) Uncompress the files in the order specified above into your home directory ($HOME_DIR) The iprscan home directory will be $HOME_DIRiprscan eg homeestapiprscan

3) First time configuration Go to the iprscan home directory and run perl CONFIGpl Then (1) Modify CONFIGpl as described below (save a copy of the original

CONFIGpl file as CONFIG_origpl) (i) At the beginning of CONFIGpl (in main) along with other variable

declarations add the following lines my $IprPWD=iprpwdtxt my $pwd = $ENVPWD unless ($pwd) $pwd = `pwd` chomp $pwd

(ii) Before the line setting up applications $applset = get_user_prompt(Setup applications $first (y|n))

Add these lines open (FF gt$IprPWD) || die Cannot create $IprPWD$ print FF $pwd close FF

(2) Modify InterProScanpl file as described below (save a copy of the

original file as InterProScan_origpl) (i) Replace my $seqfile = $ARGV[0]

with my $seqfile = $ARGV[1] (ii) After the line my $UserId = Manager-gtgetUserId()

add these lines my $OutDir=$ARGV[0] print $OutDir n Do not use $path in this file since it wont be recognized by the iprscan wrapper program $OutDir instead of $path is used as an argument for the iprscan wrapper program to store iprscan result files

(iii) Replace $path with $OutDir in all the lines following the above comment lines

4) Reconfiguration

Go to the iprscan home directory and run perl CONFIGpl You will be asked to choose the member databases you would like to search against If you only use the public data and applications that come with the iprscan distribution you should answer ldquonrdquo for SignalPHMM and TMHMM during configuration since they are not public When the configuration is complete a file called iprpwdtxt will be created under the iprscan home directory The iprpwdtxt file stores the iprscan path information which will be used for the

iprscan wrapper programs You must copy the iprpwdtxt file to the java_prog directory where the iprscan wrapper programs are placed so that the java programs can read the iprscan path information from this file

5) Run iprscan

Go to the java_prog directory where the java application and configuration files (javaConfigtxt and iprpwdtxt) are placed and set the proper parameters in the configuration file as described in Section IV1c You may analyze a specific project or all projects at a time If you wish to run a specific project use command nohup java InterProDriver lab_code project_code gt ipr_project_codelog 2gtamp1 amp eg nohup java InterProDriver JC MCA gt ipr_MCAlog 2gtamp1 where InterProDriver is the program name lab_code is the two-letter lab code of the lab that the project belongs to project_code is the three-letter project code of the project being analyzed and gt ipr_project_codelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named ipr_project_codelog If you wish to run all projects that are qualified for this procedure use command nohup java InterProScanAll gt iprlog 2gtamp1 amp where InterProScanAll is the program name and gt iprlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named iprlog The log files can be viewed for debugging purposes After the program is successfully run the log files can be removed to save disk space

l Run genomic DNA assembly programs

You have two options 1) do DNA assembly for a specific genome project 2) do DNA assembly for all qualified genome projects 1) Do assembly for a specific genome project

Command nohup gAssembleexe project_id gtassemble_project_idlog 2gtamp1 eg nohup gAssembleexe 377 gt assemble_377log 2gtamp1 where

gAssembleexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gt assemble_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemble_project_idlog 2) Do assembly for all genome projects that are qualified for the process (this is

preferred) Command nohup gAssembleAllexe gtassemblelog 2gtamp1 where

gAssembleAllexe is the program name and gt assemblelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemblelog

m Prepare EST database files from user projects

ESTAP Web interface allows users to blast specific sequence(s) against the ESTs from their projects We provide a program called writeAndFormatFastaexe to create and format EST database files from user projects This program can be scheduled to run daily or as often as you wish to get the most recent EST databases for users to blast against To run this program you need to create a directory where the EST database files will be placed eg homeestapestdb This directory path should be used for the ESTDBFilePath parameter setting in the webxml file (see Section V3e) so that the ESTAP local blast Web service can recognize

4 Schedule Programs The above programs can be run manually as described above or scheduled to run automatically using the UNIX cron command Please see the UNIX man page for the crontab command Following is an example of the scheduling 0 91117 ~estap_exeestapDriversh gtgt estaplog 2gtamp1 0 19 20 21 ~estap_exerunBlastsh gtgt blastlog 2gtamp1 30 22 23 24 ~estap_exereblastsh gtgt reblastlog 2gtamp1 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 0 5 ~estap_execlusterAllsh gtgt ~estap_execlusterlog 2gtamp1 030 19 ~estap_exeblastAllContigssh gtgt ~estap_exeblast_contigslog 2gtamp1 030 12 ~estap_exereblastAllContigssh gtgt ~estap_exereblast_contigslog 2gtamp1 030 7 ~estap_exegAssembleAllsh gtgt ~estap_exeassemblelog 2gtamp1 5 2 ~estap_exewriteAndFormatFastash gtgt ~estap_exeestap_formatdblog 2gtamp1 The shell scripts used here call the corresponding programs and make sure the programs are run under the correct directories and using the correct environment variable settings These scripts should be revised when the directory set up and environment variable settings are different

5 Trouble-shooting When a program is stopped abnormally eg power or network failure processes are killed etc some data may be partially analyzed and partial results may be stored in the database You may need to do some clean ups (removing partial results from the database) and rerun the program Please use the following instructions

a Go to the ESTAP database table PROG_RUN_STATUS in ESTAP_SYS_DEF schema This table records the information about which programs are running or completed when they were started and when they were completed If a program is killed in the middle of a run for some reason the end_time will not be recorded and the status will still be running (status = 1) You need to change the status from 1 to 0 to indicate the program is no longer running

b If the program was runBlastexe or reblastexe after you change the status in the

PROG_RUN_STATUS table please go to ANALYSIS_TRACKING table in ESTAP_ANALYSIS schema delete those rows created by the program that have

status = 1 eg DELETE from ANALYSIS_TRACKING where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the runBlastexe program

c If the program was clusterexe after you change the status in the

PROG_RUN_STATUS table please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the row with the project id that you specified when you ran the clusterexe Commit the changes after you are sure that you want to delete them Then rerun the clusterexe program

d If the program was clusterAllexe after you change the status in the

PROG_RUN_STATUS table (look for ldquoclusterexerdquo and ldquoclusterAllexerdquo in the prog_name column) please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the rows created by the program that have status of 1 Commit the changes after you are sure that you want to delete them Then rerun the clusterAllexe program

e If the program was blastAllContigsexe or reblastAllContigsexe after you change

the status in the PROG_RUN_STATUS table please go to the CONTIG_ANALYSIS table in the ESTAP_ANALYSIS schema delete those rows created by the program that have status = 1 eg DELETE from CONTIG_ANALYSIS where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the blastAllContigsexe program

V How to Install and Run Web Programs

1 Install Oracle Client including JDBC Please contact Oracle to obtain licensing and product support ESTAP uses Oracle 9i client

2 Install JDK Servlet and Web Service Software

a To install JDK 13 (Windows Linux Solaris) or higher please go to httpjavasuncomj2se

b To install Tomcat please go to httpjakartaapacheorgtomcat c To install Apache Axis please go to httpwsapacheorgaxis Apache axis beta

3 version is used for the ESTAP local blast Web service

3 Configure Tomcat a Visit httpjakartaapacheorgtomcat for instruction for Tomcat configuration b Set up correct PATH and CLASSPATH environment

c Modify serverxml in the Tomcat conf directory In this file add the following

for estap context

ltContext path=estap docBase=estap showdebuginfo=true debug=DEBUG reloadable=true crossContext=falsegt ltLogger className=orgapachecatalinaloggerFileLogger prefix=localhost_estap_log suffix=txt timestamp=truegt ltContextgt

d Place the estap Web application under the Tomcat webapps directory Copy the estap directory from the web_side directory of the uncompressed software to the webapps directory

e Modify the webxml file in the estapWEB-INF directory The following section

in the webxml is specific for ESTAP Please modify the text in bold for your own parameter settings

ltservletgt ltservlet-namegtProjectListltservlet-namegt ltservlet-classgtProjectListltservlet-classgt ltinit-paramgt ltparam-namegtjdbcDriverClassNameltparam-namegt ltparam-valuegtoraclejdbcdriverOracleDriverltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtjdbcURLltparam-namegt ltparam-valuegtjdbcoraclethinservervtedu1521estapdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserNameltparam-namegt ltparam-valuegtestap_userltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserPasswordltparam-namegt ltparam-valuegtpasswordltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbInitConnectltparam-namegt ltparam-valuegt1ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbMaxConnectltparam-namegt ltparam-valuegt50ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbESTFilePathltparam-namegt ltparam-valuegthomeestapdbEST_Outgoingltparam-valuegt ltinit-paramgt

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 12: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

iprscan wrapper programs You must copy the iprpwdtxt file to the java_prog directory where the iprscan wrapper programs are placed so that the java programs can read the iprscan path information from this file

5) Run iprscan

Go to the java_prog directory where the java application and configuration files (javaConfigtxt and iprpwdtxt) are placed and set the proper parameters in the configuration file as described in Section IV1c You may analyze a specific project or all projects at a time If you wish to run a specific project use command nohup java InterProDriver lab_code project_code gt ipr_project_codelog 2gtamp1 amp eg nohup java InterProDriver JC MCA gt ipr_MCAlog 2gtamp1 where InterProDriver is the program name lab_code is the two-letter lab code of the lab that the project belongs to project_code is the three-letter project code of the project being analyzed and gt ipr_project_codelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named ipr_project_codelog If you wish to run all projects that are qualified for this procedure use command nohup java InterProScanAll gt iprlog 2gtamp1 amp where InterProScanAll is the program name and gt iprlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named iprlog The log files can be viewed for debugging purposes After the program is successfully run the log files can be removed to save disk space

l Run genomic DNA assembly programs

You have two options 1) do DNA assembly for a specific genome project 2) do DNA assembly for all qualified genome projects 1) Do assembly for a specific genome project

Command nohup gAssembleexe project_id gtassemble_project_idlog 2gtamp1 eg nohup gAssembleexe 377 gt assemble_377log 2gtamp1 where

gAssembleexe is the program name project_id is the project id of the project you wish to run (note this is the internal id not the project code) and gt assemble_project_idlog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemble_project_idlog 2) Do assembly for all genome projects that are qualified for the process (this is

preferred) Command nohup gAssembleAllexe gtassemblelog 2gtamp1 where

gAssembleAllexe is the program name and gt assemblelog 2gtamp1 means to log the program output (from both stdout and stderr) to a log file named assemblelog

m Prepare EST database files from user projects

ESTAP Web interface allows users to blast specific sequence(s) against the ESTs from their projects We provide a program called writeAndFormatFastaexe to create and format EST database files from user projects This program can be scheduled to run daily or as often as you wish to get the most recent EST databases for users to blast against To run this program you need to create a directory where the EST database files will be placed eg homeestapestdb This directory path should be used for the ESTDBFilePath parameter setting in the webxml file (see Section V3e) so that the ESTAP local blast Web service can recognize

4 Schedule Programs The above programs can be run manually as described above or scheduled to run automatically using the UNIX cron command Please see the UNIX man page for the crontab command Following is an example of the scheduling 0 91117 ~estap_exeestapDriversh gtgt estaplog 2gtamp1 0 19 20 21 ~estap_exerunBlastsh gtgt blastlog 2gtamp1 30 22 23 24 ~estap_exereblastsh gtgt reblastlog 2gtamp1 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 0 5 ~estap_execlusterAllsh gtgt ~estap_execlusterlog 2gtamp1 030 19 ~estap_exeblastAllContigssh gtgt ~estap_exeblast_contigslog 2gtamp1 030 12 ~estap_exereblastAllContigssh gtgt ~estap_exereblast_contigslog 2gtamp1 030 7 ~estap_exegAssembleAllsh gtgt ~estap_exeassemblelog 2gtamp1 5 2 ~estap_exewriteAndFormatFastash gtgt ~estap_exeestap_formatdblog 2gtamp1 The shell scripts used here call the corresponding programs and make sure the programs are run under the correct directories and using the correct environment variable settings These scripts should be revised when the directory set up and environment variable settings are different

5 Trouble-shooting When a program is stopped abnormally eg power or network failure processes are killed etc some data may be partially analyzed and partial results may be stored in the database You may need to do some clean ups (removing partial results from the database) and rerun the program Please use the following instructions

a Go to the ESTAP database table PROG_RUN_STATUS in ESTAP_SYS_DEF schema This table records the information about which programs are running or completed when they were started and when they were completed If a program is killed in the middle of a run for some reason the end_time will not be recorded and the status will still be running (status = 1) You need to change the status from 1 to 0 to indicate the program is no longer running

b If the program was runBlastexe or reblastexe after you change the status in the

PROG_RUN_STATUS table please go to ANALYSIS_TRACKING table in ESTAP_ANALYSIS schema delete those rows created by the program that have

status = 1 eg DELETE from ANALYSIS_TRACKING where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the runBlastexe program

c If the program was clusterexe after you change the status in the

PROG_RUN_STATUS table please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the row with the project id that you specified when you ran the clusterexe Commit the changes after you are sure that you want to delete them Then rerun the clusterexe program

d If the program was clusterAllexe after you change the status in the

PROG_RUN_STATUS table (look for ldquoclusterexerdquo and ldquoclusterAllexerdquo in the prog_name column) please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the rows created by the program that have status of 1 Commit the changes after you are sure that you want to delete them Then rerun the clusterAllexe program

e If the program was blastAllContigsexe or reblastAllContigsexe after you change

the status in the PROG_RUN_STATUS table please go to the CONTIG_ANALYSIS table in the ESTAP_ANALYSIS schema delete those rows created by the program that have status = 1 eg DELETE from CONTIG_ANALYSIS where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the blastAllContigsexe program

V How to Install and Run Web Programs

1 Install Oracle Client including JDBC Please contact Oracle to obtain licensing and product support ESTAP uses Oracle 9i client

2 Install JDK Servlet and Web Service Software

a To install JDK 13 (Windows Linux Solaris) or higher please go to httpjavasuncomj2se

b To install Tomcat please go to httpjakartaapacheorgtomcat c To install Apache Axis please go to httpwsapacheorgaxis Apache axis beta

3 version is used for the ESTAP local blast Web service

3 Configure Tomcat a Visit httpjakartaapacheorgtomcat for instruction for Tomcat configuration b Set up correct PATH and CLASSPATH environment

c Modify serverxml in the Tomcat conf directory In this file add the following

for estap context

ltContext path=estap docBase=estap showdebuginfo=true debug=DEBUG reloadable=true crossContext=falsegt ltLogger className=orgapachecatalinaloggerFileLogger prefix=localhost_estap_log suffix=txt timestamp=truegt ltContextgt

d Place the estap Web application under the Tomcat webapps directory Copy the estap directory from the web_side directory of the uncompressed software to the webapps directory

e Modify the webxml file in the estapWEB-INF directory The following section

in the webxml is specific for ESTAP Please modify the text in bold for your own parameter settings

ltservletgt ltservlet-namegtProjectListltservlet-namegt ltservlet-classgtProjectListltservlet-classgt ltinit-paramgt ltparam-namegtjdbcDriverClassNameltparam-namegt ltparam-valuegtoraclejdbcdriverOracleDriverltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtjdbcURLltparam-namegt ltparam-valuegtjdbcoraclethinservervtedu1521estapdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserNameltparam-namegt ltparam-valuegtestap_userltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserPasswordltparam-namegt ltparam-valuegtpasswordltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbInitConnectltparam-namegt ltparam-valuegt1ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbMaxConnectltparam-namegt ltparam-valuegt50ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbESTFilePathltparam-namegt ltparam-valuegthomeestapdbEST_Outgoingltparam-valuegt ltinit-paramgt

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 13: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

ESTAP Web interface allows users to blast specific sequence(s) against the ESTs from their projects We provide a program called writeAndFormatFastaexe to create and format EST database files from user projects This program can be scheduled to run daily or as often as you wish to get the most recent EST databases for users to blast against To run this program you need to create a directory where the EST database files will be placed eg homeestapestdb This directory path should be used for the ESTDBFilePath parameter setting in the webxml file (see Section V3e) so that the ESTAP local blast Web service can recognize

4 Schedule Programs The above programs can be run manually as described above or scheduled to run automatically using the UNIX cron command Please see the UNIX man page for the crontab command Following is an example of the scheduling 0 91117 ~estap_exeestapDriversh gtgt estaplog 2gtamp1 0 19 20 21 ~estap_exerunBlastsh gtgt blastlog 2gtamp1 30 22 23 24 ~estap_exereblastsh gtgt reblastlog 2gtamp1 30 3 2 ~estap_exegetAndFormatNcbiDBsh gt get_dblog 2gtamp1 0 5 ~estap_execlusterAllsh gtgt ~estap_execlusterlog 2gtamp1 030 19 ~estap_exeblastAllContigssh gtgt ~estap_exeblast_contigslog 2gtamp1 030 12 ~estap_exereblastAllContigssh gtgt ~estap_exereblast_contigslog 2gtamp1 030 7 ~estap_exegAssembleAllsh gtgt ~estap_exeassemblelog 2gtamp1 5 2 ~estap_exewriteAndFormatFastash gtgt ~estap_exeestap_formatdblog 2gtamp1 The shell scripts used here call the corresponding programs and make sure the programs are run under the correct directories and using the correct environment variable settings These scripts should be revised when the directory set up and environment variable settings are different

5 Trouble-shooting When a program is stopped abnormally eg power or network failure processes are killed etc some data may be partially analyzed and partial results may be stored in the database You may need to do some clean ups (removing partial results from the database) and rerun the program Please use the following instructions

a Go to the ESTAP database table PROG_RUN_STATUS in ESTAP_SYS_DEF schema This table records the information about which programs are running or completed when they were started and when they were completed If a program is killed in the middle of a run for some reason the end_time will not be recorded and the status will still be running (status = 1) You need to change the status from 1 to 0 to indicate the program is no longer running

b If the program was runBlastexe or reblastexe after you change the status in the

PROG_RUN_STATUS table please go to ANALYSIS_TRACKING table in ESTAP_ANALYSIS schema delete those rows created by the program that have

status = 1 eg DELETE from ANALYSIS_TRACKING where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the runBlastexe program

c If the program was clusterexe after you change the status in the

PROG_RUN_STATUS table please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the row with the project id that you specified when you ran the clusterexe Commit the changes after you are sure that you want to delete them Then rerun the clusterexe program

d If the program was clusterAllexe after you change the status in the

PROG_RUN_STATUS table (look for ldquoclusterexerdquo and ldquoclusterAllexerdquo in the prog_name column) please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the rows created by the program that have status of 1 Commit the changes after you are sure that you want to delete them Then rerun the clusterAllexe program

e If the program was blastAllContigsexe or reblastAllContigsexe after you change

the status in the PROG_RUN_STATUS table please go to the CONTIG_ANALYSIS table in the ESTAP_ANALYSIS schema delete those rows created by the program that have status = 1 eg DELETE from CONTIG_ANALYSIS where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the blastAllContigsexe program

V How to Install and Run Web Programs

1 Install Oracle Client including JDBC Please contact Oracle to obtain licensing and product support ESTAP uses Oracle 9i client

2 Install JDK Servlet and Web Service Software

a To install JDK 13 (Windows Linux Solaris) or higher please go to httpjavasuncomj2se

b To install Tomcat please go to httpjakartaapacheorgtomcat c To install Apache Axis please go to httpwsapacheorgaxis Apache axis beta

3 version is used for the ESTAP local blast Web service

3 Configure Tomcat a Visit httpjakartaapacheorgtomcat for instruction for Tomcat configuration b Set up correct PATH and CLASSPATH environment

c Modify serverxml in the Tomcat conf directory In this file add the following

for estap context

ltContext path=estap docBase=estap showdebuginfo=true debug=DEBUG reloadable=true crossContext=falsegt ltLogger className=orgapachecatalinaloggerFileLogger prefix=localhost_estap_log suffix=txt timestamp=truegt ltContextgt

d Place the estap Web application under the Tomcat webapps directory Copy the estap directory from the web_side directory of the uncompressed software to the webapps directory

e Modify the webxml file in the estapWEB-INF directory The following section

in the webxml is specific for ESTAP Please modify the text in bold for your own parameter settings

ltservletgt ltservlet-namegtProjectListltservlet-namegt ltservlet-classgtProjectListltservlet-classgt ltinit-paramgt ltparam-namegtjdbcDriverClassNameltparam-namegt ltparam-valuegtoraclejdbcdriverOracleDriverltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtjdbcURLltparam-namegt ltparam-valuegtjdbcoraclethinservervtedu1521estapdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserNameltparam-namegt ltparam-valuegtestap_userltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserPasswordltparam-namegt ltparam-valuegtpasswordltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbInitConnectltparam-namegt ltparam-valuegt1ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbMaxConnectltparam-namegt ltparam-valuegt50ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbESTFilePathltparam-namegt ltparam-valuegthomeestapdbEST_Outgoingltparam-valuegt ltinit-paramgt

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 14: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

status = 1 eg DELETE from ANALYSIS_TRACKING where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the runBlastexe program

c If the program was clusterexe after you change the status in the

PROG_RUN_STATUS table please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the row with the project id that you specified when you ran the clusterexe Commit the changes after you are sure that you want to delete them Then rerun the clusterexe program

d If the program was clusterAllexe after you change the status in the

PROG_RUN_STATUS table (look for ldquoclusterexerdquo and ldquoclusterAllexerdquo in the prog_name column) please go to the CLUSTER_SUMMARY table in the ESTAP_ANALYSIS schema and delete the rows created by the program that have status of 1 Commit the changes after you are sure that you want to delete them Then rerun the clusterAllexe program

e If the program was blastAllContigsexe or reblastAllContigsexe after you change

the status in the PROG_RUN_STATUS table please go to the CONTIG_ANALYSIS table in the ESTAP_ANALYSIS schema delete those rows created by the program that have status = 1 eg DELETE from CONTIG_ANALYSIS where status = 1 Commit the changes after you are sure that you want to delete them Then rerun the blastAllContigsexe program

V How to Install and Run Web Programs

1 Install Oracle Client including JDBC Please contact Oracle to obtain licensing and product support ESTAP uses Oracle 9i client

2 Install JDK Servlet and Web Service Software

a To install JDK 13 (Windows Linux Solaris) or higher please go to httpjavasuncomj2se

b To install Tomcat please go to httpjakartaapacheorgtomcat c To install Apache Axis please go to httpwsapacheorgaxis Apache axis beta

3 version is used for the ESTAP local blast Web service

3 Configure Tomcat a Visit httpjakartaapacheorgtomcat for instruction for Tomcat configuration b Set up correct PATH and CLASSPATH environment

c Modify serverxml in the Tomcat conf directory In this file add the following

for estap context

ltContext path=estap docBase=estap showdebuginfo=true debug=DEBUG reloadable=true crossContext=falsegt ltLogger className=orgapachecatalinaloggerFileLogger prefix=localhost_estap_log suffix=txt timestamp=truegt ltContextgt

d Place the estap Web application under the Tomcat webapps directory Copy the estap directory from the web_side directory of the uncompressed software to the webapps directory

e Modify the webxml file in the estapWEB-INF directory The following section

in the webxml is specific for ESTAP Please modify the text in bold for your own parameter settings

ltservletgt ltservlet-namegtProjectListltservlet-namegt ltservlet-classgtProjectListltservlet-classgt ltinit-paramgt ltparam-namegtjdbcDriverClassNameltparam-namegt ltparam-valuegtoraclejdbcdriverOracleDriverltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtjdbcURLltparam-namegt ltparam-valuegtjdbcoraclethinservervtedu1521estapdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserNameltparam-namegt ltparam-valuegtestap_userltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserPasswordltparam-namegt ltparam-valuegtpasswordltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbInitConnectltparam-namegt ltparam-valuegt1ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbMaxConnectltparam-namegt ltparam-valuegt50ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbESTFilePathltparam-namegt ltparam-valuegthomeestapdbEST_Outgoingltparam-valuegt ltinit-paramgt

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 15: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

ltContext path=estap docBase=estap showdebuginfo=true debug=DEBUG reloadable=true crossContext=falsegt ltLogger className=orgapachecatalinaloggerFileLogger prefix=localhost_estap_log suffix=txt timestamp=truegt ltContextgt

d Place the estap Web application under the Tomcat webapps directory Copy the estap directory from the web_side directory of the uncompressed software to the webapps directory

e Modify the webxml file in the estapWEB-INF directory The following section

in the webxml is specific for ESTAP Please modify the text in bold for your own parameter settings

ltservletgt ltservlet-namegtProjectListltservlet-namegt ltservlet-classgtProjectListltservlet-classgt ltinit-paramgt ltparam-namegtjdbcDriverClassNameltparam-namegt ltparam-valuegtoraclejdbcdriverOracleDriverltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtjdbcURLltparam-namegt ltparam-valuegtjdbcoraclethinservervtedu1521estapdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserNameltparam-namegt ltparam-valuegtestap_userltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbUserPasswordltparam-namegt ltparam-valuegtpasswordltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbInitConnectltparam-namegt ltparam-valuegt1ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbMaxConnectltparam-namegt ltparam-valuegt50ltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtdbESTFilePathltparam-namegt ltparam-valuegthomeestapdbEST_Outgoingltparam-valuegt ltinit-paramgt

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 16: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

ltinit-paramgt ltparam-namegtESTDBFilePathltparam-namegt ltparam-valuegthomeestapestdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtblastallPathltparam-namegt ltparam-valuegthomeestapblastblastallltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtformatdbPathltparam-namegt ltparam-valuegthomeestapblastformatdbltparam-valuegt ltinit-paramgt ltinit-paramgt ltparam-namegtestapEmailltparam-namegt ltparam-valuegtestapvbivtedultparam-valuegt ltinit-paramgt ltload-on-startupgt1ltload-on-startupgt ltservletgt

The following list defines the parameters jdbcURL the URL for jdbc connection dbUserName the ESTAP database user name dbUserPassword the ESTAP database user password dbInitConnect number of database connections to be made initially dbMaxConnect the maximum number of database connections dbESTFilePath the directory path to store DBEST submission files ESTDBFilePath the directory path to store the EST files in fasta format and their

formatted files used by the ESTAP local blast Web service blastallPath the path for the blastall program formatdbPath the path for the formatdb program estapEmail the ESTAP curatorrsquos email address

f Set heap size for JVM

Some of the ESTAP web programs require more memory than JVM default heap size of 64MB To avoid ldquoout of memoryrdquo error the user should set a larger heap size for JVM To do this the user must set ldquoJAVA_OPTSrdquo variable in catalinashcatalinabat or setclasspathshsetclasspathbat file in the tomcat_homebin directory For example if you wish to set heap size to 256MB you may add a line ldquoset JAVA_OPTS=-mx256mrdquo in the catalinabat file

g Start Tomcat server To start the Tomcat server execute tomcat_homebinstartupbat on Windows (or execute tomcat_home binstartupsh on UnixLinux) Then enter the URL httplocalhost in your browser and make sure that you get the Tomcat welcome page not an error message saying that the page cannot be displayed or that the server cannot be found If you choose to use port number 8080 instead of 80 you

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 17: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

will need to use a URL such as httplocalhost8080 that includes the port number If this is successful you enter httplocalhostestapservletLogin or httplocalhost8080estapservletLogin to see if the ESTAP login page is displayed

4 Configure Axis a Go to httpwsapacheorgaxis for downloading and installing instructions b Place the axis directory downloaded under the tomcat_homewebapps directory c Move the server-configwsdd file from the tomcat_homewebappsestapWEB-

INF directory to the tomcat_homewebappsaxisWEB-INF directory d Add all jar files under tomcat_homewebappsaxisWEB-INFlib and

tomcat_homewebappsaxisWEB-INFclasses directories into your CLASSPATH environment variable

e Move blastService and createESTDBService directories from the

tomcat_homewebapps estapWEB-INFclasses directory to tomcat_homewebappsaxisWEB-INFclasses directory

f Modify the server name and Tomcat port with your server name and port number

in JBlastServiceLocatorjava and JCreateESTDBServiceLocatorjava files under the blastServicews and createESTDBServicews directories respectively eg for blastService run commands cd blastService cd ws Edit the JBlastServiceLocatorjava file by changing the line httpardavbivtedu8080 to httpyourServerNameyourPort Then compile the file just modified by running the command javac JBlastServiceLocatorjava Repeat the above processes for createESTDBService directory

g Copy blastService and createESTDBService directories to tomcat_homewebapps

estapWEB-INFclasses directory h Compile EstapWebBlastjava by running the command

javac EstapWebBlastjava

i Restart the Tomcat server and go to httplocalhostaxis or httplocalhost8080axis to view the list of deployed web services If the Axis is configured correctly you will see createESTDB and blast services

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 18: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

VI How to Register and Update ESTAP Users ESTAP defines 3 types of users with different role permissions sponsors PIs and other users Sponsors are those who fund the development of ESTAP If there is no lsquosponsorrsquo according to this definition then the person who is in charge of the ESTAP system becomes the sponsor There is one sponsor per institution Sponsors can registerupdate PIs and have all PI privileges PIs are those people responsible for the projects PIs can registerupdate projects project users and project protocols PIs can also register ESTAP codes (eg primer instrument base caller etc) submit ESTs of their projects to dbEST and view their project data and analysis results Other users include primary contacts with update permission primary contacts with read-only permission and project viewer with read-only permission Primary contacts with update privileges (associated with particular project(s)) can update projects and project protocols and view project data and analysis result Primary contacts with read-only permission and project viewers (associated with particular project(s)) can view project data and analysis results At least one sponsor must exist before PIs can register their projects and users Sponsors are registered by the ESTAP curatordatabase administrator who has direct access to the ESTAP tables in the database Please use the following instructions to register sponsors 1) In the INSTITUTION table modify the first record with the institution_id = 1

Change the institution_name to the institution name of the sponsor If there is more than one sponsor to be added then add a new record for each remaining sponsors For each record increment institution_id by 1 and enter the institution name and description to the institution_name and institution_desc columns

2) In the LAB table modify the record with lab_code of XX Change all columns of

this record to the desired values The lab_email entry must be the email address of the sponsor Lab_alt_email entry is the email address of the primary contact in the sponsorrsquos lab Each sponsor has one lab You may add more records if there is more than one sponsor

3) In the PERSON table modify the person record with the person_id of 1 Change the

user_name to the email address of the sponsor and password to the sponsorrsquos password for logging on to the Web site The user_type is 1 and sponsor_id is 1 (referring to the person_id of the sponsor himherself in the PERSON table Other columns are self-explanatory by their names Add more records if there is more than one sponsor and increment person_id for each additional sponsor

4) In the CONTACT table change the record with the contact_id of 1 and person_id of 1 to the contact information of the sponsor Add more records if there is more than one sponsor and increment contact_id for each additional sponsor

Once sponsors are registered in the ESTAP system they can log on to the ESTAP Web site to register or update PIs in their institutions Once PIs are registered PIs can log on to the ESTAP Web site to register or update their projects In addition PIs can register or update the primary contacts and the project viewers for their projects

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 19: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

VII How to Register and Update EST Projects and Protocols

1 Project Registration and Update ESTAP defines two types of projects regular and virtual projects A regular project refers to a specific EST library that a lab constructed The regular project information includes biological information regarding tissues and cloning methods used in cDNA library construction and EST sequencing information In order to analyze combined libraries from a particular taxonomy we introduced the virtual project concept We now allow PIs to register a virtual project that combines their regular projects from a particular taxonomy To register a regular project or a virtual project please go to the ldquoProject Listrdquo page and click the ldquoRegister Regular Projectrdquo link or the ldquoRegister Virtual Projectrdquo link and follow the instructions carefully in the registration form PIs and the primary contacts with update privilege may also update the project information via Web The current version of ESTAP does not allow updating a project when the sequences of that project have been processed and stored into the ESTAP database If the user needs to make corrections to the project information after the sequences are processed he or she has to make a request to the ESTAP curator The ESTAP curator should delete already loaded sequences for that project and make corrections to the project information and then reload the sequences by running the estapDriverexe program

2 Protocol Registration and Update ESTAP offers the following protocols for data analysis 1) Cleansing protocols

The cleansing protocols contain procedures to cleanse raw sequences by removing low quality end sequences vectors polyA (or polyT) and screening for chimera and contaminations There are two types of cleansing protocols one is the cleansing protocol for blast and the other is the cleansing protocol for assembly The sequences cleansed by the cleansing protocol for BLAST are used for BLAST searching and the sequences cleansed by the cleansing protocol for assembly are used for the clusterassembly analysis The user may utilize the same procedures and parameters for both protocols There are several key points for the cleansing to be successful a The cleansing program uses the 5rsquo and 3rsquo vector fragments provided in the project

registration to match the vector portions in the clones So the 5rsquo and 3rsquo vector fragments (both the orientation of the fragments and sequences) must be correct

b The vector and insert adaptor sequences provided in the project registration should not be the whole adaptor sequences Only the regions that are present in the clones should be included

c The sequencing primer code in the sequence names of the raw sequence data must be correct The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause the removing vectoradaptor to fail Please check the primer table carefully to make sure the orientation of the primer is correct

d When cleansing procedure does not seem to give expected result please check whether the above key points have been followed

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 20: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

2) Blast protocol

The blast protocol provides procedures and parameters for BLAST (BLASTN BLASTX andor TBLASTX) search Users may choose to run all some or none of the blast procedures in the protocol

3) Clusterassembly protocol

The clusterassembly protocol provides procedures and parameters for clustering and assembly analysis Users register this protocol only if they wish to do clusterassembly To let ESTAP run clusterassembly programs for the projects the users need to register the cluster protocol for the project they wish to do clusterassembly The projects are not automatically clustered without the cluster protocols Please go to the View Project page of the project you wish to clusterassembly Please follow the link called Cluster Protocol to register the protocol and set the procedure parameters The parameters that ESTAP provides are from d2_cluster and CAP3 programs The Register ClusterAssembly Protocol page has a link ldquoClusterAssembly Protocol Guiderdquo which provides paper references for d2_cluster and CAP3 programs Once the protocol is registered ESTAP will automatically perform clusterassembly for your project according to the protocol you registered If you wish to perform clusterassembly for combined libraries (not for each project) you need to register a virtual project and select the taxonomy of the libraries Then register the BLAST and clusterassembly protocols for this virtual project Once the cluster protocol is registered ESTAP will automatically perform clusterassembly for the virtual project The clusterassembly protocol also provides an option to perform InterProScan analysis Users may choose this option to let ESTAP automatically annotate singlets and contigs protein functions after clusteringassembly procedure is performed

PIs and the primary contacts with update privilege may also update the protocols The updated protocols will be effective for new sequence data not the data that have already been analyzed If the user wishes to reanalyze the old data he or she must make such a request to the ESTAP curator The curator should use the following instructions 1) If the sequences of a project must be re-cleansed using the new protocol the curator

must delete the sequences of the project from the ESTAP database and reload the sequences again by running the estapDriverexe program Use the following command to delete sequences from the database

delete from raw_seq_file where project_id = x commit

project_id is the internal id for the project The curator may get the id from the PROJECT table by matching the lab code and project code of the project select project_id from project where lab_code = lsquoxxrsquo and user_proj_code =rsquoxxxrsquo Please see Section IV4e for estapDriverexe usage

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 21: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

2) If the sequences of a project must be re-blasted using the new BLAST protocol the curator must delete the old blast results of that project from the ESTAP database and run the runBlastexe program Use the following command to delete the blast results from the database

delete from analysis_tracking where project_id = x commit

Please see Section IV4f for runBlastexe usage 3) If the sequences of a project must be re-clustered using the new clusteringassembly

protocol the curator must delete the old clusteringassembly and the contig blast results of that project from the ESTAP database Then run the clusterexe program for clusteringassembly and run the blastAllContigexe program for contig blast Use the following command to delete the blast results from the database

delete from cluster_summary where project_id = x commit

Use ldquoclusterexe [project_id]rdquo command to redo clusteringassembly analysis and ldquoblastAllContigsexerdquo to redo contig blast analysis Note [project_id] is the internal id for that project Please see Section IV4h and Section IV4i for how to run clusterexe and blastAllContigexe programs

VIII How to Prepare Raw EST Data User can send ESTAP either the chromatograms of the sequences (ab1 files) or both sequence files and their corresponding quality score files in the FASTA format If the user sends ESTAP the chromatograms ESTAP will use the Phred base calling program to generate sequences and the quality scores Each sequence file contains sequences from the same sequencing run of the same project Each quality score file contains quality scores of the sequences of the corresponding sequencing file Each sequence or quality score string in the file must start with a ldquogtrdquo followed by a sequence name To process data files ESTAP requires that data providers use a standard file name for each file and a standard sequence name for each sequence in the file The file name contains specific information about that file and the sequence name contains specific information about the sequence All the sequences in the same file must be from the same project and the same sequencing run The naming convention and rules are described as follows bull Both file name and sequence name are required to be of a fixed length bull Each element in the name must conform to a required number of characters or digits bull Pad with 0rsquos (zeros) in front of any element that is shorter than the required length

The file names and sequence names are case-sensitive

1 File naming convention All sequence files will have ldquoseqrdquo as a suffix and quality score files will have ldquoqalrdquo as a suffix Every seq string must have an exactly matching qal string by name The file names use the following naming convention LLPPPTTIIIBRRNNNCCYYMMDDHHMMseq for sequence file and LLPPPTTIIIBRRNNNCCYYMMDDHHMMqal for the corresponding quality score file

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 22: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

LL ndash two-alphanumeric letter lab code (provided by VBI) eg UN PPP ndash three-alphanumeric letter project id (provided by VBI) eg MCA TT ndash two-alphanumeric letter instrument type (provided by VBI) eg A1 III ndash three-alphanumeric letter instrument id (user defined code that is unique to the

lab) eg VT1 B ndash one-alphanumeric letter base calling program id (provided by VBI) eg A RR ndash two-alphanumeric letter run parameter id (provided by VBI) eg S1 NNN ndash three-digit run duration in minutes (actual running time in minutes eg 090

for 90 minutes) CC ndash two-alphanumeric letter sequencing chemistry id (provided by VBI) eg Pr YYMMDDHHMM - the date and time at which the sequence was run eg

0104150930 YY ndash two-digit year MM ndash two-digit month DD ndash two-digit day HH ndash two-digit hour MM ndash two-digit minute

2 Sequence Naming Convention

The sequence names in the sequence and quality score files must be in the following format TPPPNNNNNNNNNLLL

T ndash one letter primer type u for user defined primer and s for standard primer PPP ndash three-alphanumeric letter primer id (provided or confirmed by VBI) eg T3a NNNNNNNNN ndash nine-alphanumeric letter unique clone name within the project

(user defined) LLL ndash three-digit lane or capillary number eg 012 for lane 12 in the sequence gel

Note Using the correct primer code (first 4 letters) is very important for cleaning procedure The correct primer code indicates the correct orientation of the sequencing (5rsquo or 3rsquo) Wrong orientation will cause removing vectoradaptor to fail

3 Example of a Sequence and a Quality Score File

a Sequence file File name VBOVTA2VT1CS1120Tb0102011615seq The following is the content of the file gtsM13rroot0002012 GTCGTAGCTAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTATCGATCAGGGCAGACTAGC TGACTAGCATGACTAGCATT gtsM13rroot0003013 GTCGTAGCTTAGCTAGTCAGATCAGCATCG AATACGATCAGCTAGCTGTGGACTGACTGA CTAGCATCAGCTTGACTGACTAGCATAGCT AGCATACGCTTACGTACACAGACACACA

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 23: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

b Quality score file File name VBOVTA2VT1CS1120Tb0102011615qal The following is the content of the file gtsM13rroot0002012 9 13 16 17 16 12 7 7 19 19 25 27 29 28 30 31 29 30 29 29 30 30 37 37 33 37 40 44 40 35 35 40 40 40 46 40 35 35 35 35 35 35 35 40 40 37 35 40 40 40 40 40 40 40 46 46 46 46 46 47 51 51 42 42 45 45 45 45 45 45 51 46 56 56 56 56 46 51 51 51 45 40 40 40 40 40 37 45 46 42 42 44 44 44 44 56 56 37 40 40 40 39 37 37 37 37 39 40 40 43 gtsM13rroot0003013 9 13 13 19 19 9 6 6 16 19 24 24 29 29 29 29 30 37 46 34 37 40 46 40 37 37 37 37 37 40 40 40 40 40 40 35 35 40 40 37 35 35 35 35 35 35 39 40 46 46 46 46 46 56 46 42 42 46 43 51 51 51 51 42 42 51 42 42 51 51 51 51 56 51 51 51 51 51 51 56 56 56 56 56 46 42 42 42 42 42 42 42 42 43 43 35 35 35 35 35 35 56 51 51 51 51 51 51 56 56 56 56 40 40 40 40 40 40

Explanation of file name VBOVTA2VT1CS1120Tb0102011615seq The lab code is VB The project code is OVT (a library of Orbanche minor v Virginia tubercles) The instrument is a 377 so the code is A2 The id for this instrument is VT1 The base-calling program is ABI100 which has a code of C The run parameter code is S1 Run time is 2 hours I have used dye terminator chemistry with big dyes so the code is Tb I ran these samples on Feb 1 2001 at 415 pm so the datetime stamp will be 0102011615 You only have to enter this once The data itself carries a shorter name Explanation of sequence name sM13rroot0002012 The primer is standard primer M13f So the name starts with a ldquosrdquo followed by the three letter primer code ldquoM13rdquo The clone name is rroot0002 The sequence was run on lane 12 which has a code ldquo012rdquo

4 Sending Chromatogram Files If you wish ESTAP to run Phred for you send chromatograms of the sequences Put all sequences from the same sequence run of the same project in a folder The name of the folder follows the file naming convention without seq or qal suffixes eg VBOVTA2VT1CS1120Tb0102011615 Each chromatogram file follows the sequence naming convention plus the suffix ldquoab1rdquo eg sM13rroot0002012ab1

IX How to Use dbEST Submission Tool The dbEST Submission Tool incorporated in the ESTAP is available online to allow users to conveniently submit their cleansed EST sequences in the ESTAP database to NCBIrsquos dbEST database Note that only PIs have the privilege to submit their EST sequences to dbEST PIs may follow the ldquodbEST Submissionrdquo link on the ldquoProject Listrdquo page to ask ESTAP to prepare the dbEST submission files for them During dbEST submission a contact file a library file a publication file and one or more EST files will be created and sent to the primary contact person via emails EST files will be in fasta format with clone name and clean sequence ids as identifiers After submitting the EST

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool
Page 24: ESTAP Installation Guide - Biocomplexity Institute of …staff.vbi.vt.edu/estap/download/install_guide.pdfESTAP Installation And Operation Guide I. Hardware and Software Requirements

sequences to NCBIrsquos dbEST the PI should receive an email from NCBI regarding this batch of ESTs submitted In order to store the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database the PI should send this confirmation file received from NCBI to the ESTAP curator The ESTAP curator should use the ReadConfirmationFile program to process the confirmation file The ReadConfirmationFile program is located under the java_prog directory of the uncompressed software It is a stand-alone java program which reads and parses the dbEST confirmation file and stores the dbEST_ID USER_ID GENBANK_ACCN assigned by NCBI into the ESTAP database To make this program work the curator need to modify a configuration file called ldquojavaConfigtxtrdquo in the java_prog directory and set correct parameters (See Section IV1c) The ReadConfirmationFile will use the parameters defined in this file to find the confirmation file and make connection to the ESTAP database The confirmation file should be named using txt suffix (eg JC_MCA_020314052943txt) and placed in the directory path specified in the variable ldquoncbiConfirmFilePathrdquo in the javaConfigtxt file Go to the directory where the ReadConfirmationFileclass is located and run ReadConfirmationFile program using the following command java ReadConfirmationFile

  • I Hardware and Software Requirements
    • 1 Operating Systems
    • 2 Programming Language
    • 3 Database Management
    • 4 Web Browser
    • 5 Software Requirements
      • II How to Download and Uncompress the Software
      • III How to Install the Oracle Software and Create an ESTAP Database
        • 1 Install Oracle Software
        • 2 Create the Database Structure
          • IV How to Install and Run Pipeline Analysis Programs
            • 1 Set up
            • 2 Environment Variables
            • 3 Run the Programs
              • Go to the iprscan home directory and run perl CONFIGpl Then
                • 4 Schedule Programs
                • 5 Trouble-shooting
                  • V How to Install and Run Web Programs
                    • 1 Install Oracle Client including JDBC
                    • 2 Install JDK Servlet and Web Service Software
                    • 3 Configure Tomcat
                    • 4 Configure Axis
                      • VI How to Register and Update ESTAP Users
                      • VII How to Register and Update EST Projects and Protocols
                        • 1 Project Registration and Update
                        • 2 Protocol Registration and Update
                          • VIII How to Prepare Raw EST Data
                            • 1 File naming convention
                            • 2 Sequence Naming Convention
                            • 3 Example of a Sequence and a Quality Score File
                            • 4 Sending Chromatogram Files
                              • IX How to Use dbEST Submission Tool