shortran v.0.43 manual

Upload: marcosdecarvalho

Post on 14-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Shortran v.0.43 Manual

    1/27

    A users manual for

    shortran

    Released: 06-June-2012

    Vikas Gupta and Stig U. Andersen

    Bioinformatics Research Center and

    Department of Molecular Biology and Genetics

    Aarhus University, Denmark

    Contact: [email protected], [email protected]

  • 7/30/2019 Shortran v.0.43 Manual

    2/27

    2

    1 OVERVIEW.....................................................................................................................4

    1.1Background...........................................................................................................................4

    1.2Summaryofshortranmodules.......................................................................................4

    1.3Implementation...................................................................................................................4

    1.4Availability............................................................................................................................4

    2 Installation....................................................................................................................5

    2.1PythonandPerl...................................................................................................................5

    2.2Bowtie.....................................................................................................................................5

    2.3MySQLandsettingupdatabase......................................................................................5

    2.4ECHO........................................................................................................................................6

    2.5mirDeep..................................................................................................................................6

    2.6Blast.........................................................................................................................................6

    2.7NiBLS.......................................................................................................................................6

    3 QUICKSTARTGUIDE..................................................................................................7

    3.1Configuration........................................................................................................................7

    3.2Explorethedatabase................. .................. .................. ................. .................. .................. 7

    4 CONFIGURATION.........................................................................................................8

    4.1 Configurationusingtheconfiguration.txtfile........................................................8

    4.2 Interactiveconfiguration..............................................................................................9

    5 Modules..........................................................................................................................95.1 Module-1:Adaptertrimming,errorcorrection,filteringandprofiling.......9

    5.2 Module-2:GenomicMapping....................................................................................11

    5.3 Module-3:Clusteringanalysis..................................................................................12

    5.4 Module-4:miRNApredictions..................................................................................13

    5.5 Module-5:ta-siRNApredictions..............................................................................14

    5.6 Module-6:Annotation.................................................................................................14

    5.7 Module-7:ExpressionPlots......................................................................................14

    5.8 Module-8:PatternSearch..........................................................................................15

    5.9 Module-9:Queries........................................................................................................16

    6 Application.................................................................................................................17

    6.1 Module-1..........................................................................................................................17

    6.2 Module-2..........................................................................................................................21

    6.3 Module-3..........................................................................................................................21

    6.4 Module-4..........................................................................................................................22

    6.5 Module-5..........................................................................................................................22

    6.6 Module-6..........................................................................................................................22

  • 7/30/2019 Shortran v.0.43 Manual

    3/27

    3

    6.7 Module-7..........................................................................................................................22

    6.8 Module-8..........................................................................................................................25

    6.9 Module-9..........................................................................................................................26

    6.10 Computationsummary............................................................................................27

  • 7/30/2019 Shortran v.0.43 Manual

    4/27

    4

    1 OVERVIEW

    1.1Background

    WefounditcomplicatedtocombinesmallRNA(sRNA)profilingandannotationinformationusingexistinganalysistools.Herewedescribetheinstallationand

    use of shortran, which generates and combines expression profiling andannotationdatafromsRNAsequencingrawdata.Theoutputisprovidedasan

    easily accessible MySQL database format and provides graphical output tofacilitatefastdataanalysis.

    1.2Summaryofshortranmodules

    shortran isa combinationof variousmodules,allowingauser toanalyzedata

    using either an interactive command line interface or a configuration file. Adetailed descriptionof the use of eachmodule is provided in the application

    section of this manual. shortran is capable of performing filtering, read errorcorrection, expression profiling,mapping,clustering analysis,miRNA/ta-siRNA

    prediction,patternsearch,targetpredictionandannotation.

    1.3Implementation

    PythonistheprimarylanguageofthepipelinebutPerlhasbeenalsoused.The

    package contains multiple Python language scripts. shortran.py is the main,whichinvokestheotherpipelinemodules. shortranisdevelopedusingpython(v2.7.2) andperl (v5.12.3), it has been tested on bothMacOS X (10.6.8) andLinuxUbuntu(10.04.3).

    1.4Availability

    Theshortranpackageincludinganexampledatasetisavailableatwww.carb.dk/resources.asp.

  • 7/30/2019 Shortran v.0.43 Manual

    5/27

    5

    2 InstallationTheprimarymodulesofthe shortranpackagerequirenoinstallationandcanbeinvokeddirectlyusingpythonafterthedependencieshavebeeninstalled.Torun

    shortran,changedirectory(cd)tothefoldercontainingthe shortran.pyfileand

    runthescript:> cd[path of shortran_folder]

    >python shortran.py

    The > indicates the commandprompt and should not be typed. Brackets []

    indicatestextthatneedstobeadjusted.Bracketsshouldnotbetyped.Commentsareindicatedby#.Toobtainthepathofafileorfolderforintroducingitintoa

    script, the icon of the respective item can be dragged into a termial windowwherethepathwillthenbedisplayed.

    DetailedinstructionsforinstallationandsetupofcoredependenciesonMacOS

    and Linux platforms are included. Please refer to install_osx.README andinstall_linux.README.

    2.1PythonandPerl

    Most operating systems have built-in Python and Perl and no installation isrequired. If that is not the case, Python can be downloaded from

    http://python.org/getit/, and Perl from http://www.perl.org/get.html. Inaddition, the Python libraries numpy (http://numpy.scipy.org/), scipy(http://www.scipy.org/) and matplotlib (http://matplotlib.sourceforge.net/)

    needtobeinstalled.

    2.2Bowtie

    Bowtieisrequiredbyanumberofmodulesandshouldbeinstalledandaddedto

    the system path (so that typing bowtie in the terminalactivates the programregardlessoftheworkingdirectory).IfinstallingonMacOS,Bowtieisinstalled

    andaddedtothepathbytheinstallationscript.Itcanbefoundathttp://bowtie-bio.sourceforge.net/index.shtml.

    2.3MySQLandsettingupdatabase

    InstallationofMySQLisoptionalbutitisrequiredforuseoftheshortrandatabasefunctionalities.Adetaileddescriptionondownloadingandsettingup

    MySQLcanbefoundathttp://www.mysql.com/.

    MySQLsetupforshortranishandledautomaticallybytheMacOSsetupscriptandcanbesetuponLinuxsystemsusingtheenclosedMySQLsetupscript:

    > sudo mysql < "mysql_setup.txt"

    You can also set up MySQL manually by creating a database and givingpermissionstoyouruser(s):

  • 7/30/2019 Shortran v.0.43 Manual

    6/27

    6

    > sudo mysql

    mysql> CREATE DATABASE profiles;

    mysql> CREATE USER user@localhost IDENTIFIED BY ;

    mysql> GRANT ALL PRIVILEGES ON profiles.* TO

    user@localhost;ThisbasicMySQLsetupcorrespondstothefollowingsettingsintheconfiguration.txtfile:

    server=localhost

    user=user

    password=

    database=profiles

    2.4ECHO

    ECHOisusedonlyifk-mercorrectionofsequencingreadsisrequired.Itcanbe

    downloadedfromhttp://uc-echo.sourceforge.net/.

    2.5mirDeep

    AllthescriptsrequiredformirDeeparesuppliedwiththeshortranpackagebutits dependencies must be globally installed, see http://www.mdc-

    berlin.de/en/research/research_teams/systems_biology_of_gene_regulatory_ele

    ments/projects/miRDeep/.Itisonlyrequiredformodule-4.

    2.6Blast

    formatdbandfastacmdfromtheBlastpackageshouldbeinstalled.Adescriptioncanbefoundhere:http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html.

    2.7NiBLS

    NiBLS(MacLeanetal.BMCBioinformatics.201018;11:93)canoptionallybeusedtodetectsmallRNAlociinsteadofthesimpleclusterdetectionalgorithm

    includedwithshortran.Seehttps://github.com/danmaclean/NiBLS.

  • 7/30/2019 Shortran v.0.43 Manual

    7/27

    7

    3 QUICK START GUIDE3.1Configuration

    Forthefirstrun,werecommendthatyouopenthe"configuration.txt"fileina

    texteditorandreplace"/Users/test/Desktop/shortran_0.43/demo"withthepathtotheshortrandemofolderonyoursystem.

    Opentheterminalwindowandchangetotheshortrandirectoryandruntheprogram:

    > cd[path_to_shortran_folder]

    >python shortran.py

    Andhit"n"torunshortranaccordingtotheconfigurationfile.

    Ifalldependenciesareinstalledcorrectly,youroutputfoldershouldlooklikethe

    "example_output"folderavailablefordownload,whentherunisfinished.

    3.2Explorethedatabase

    YoucanexploretheMySQLdatabasedirectlyusingtheterminalwindow.For

    example,type:

    >mysql -u user

    mysql > show databases;

    mysql > use profiles;

    mysql > show tables;

    mysql > describe `demo_2012-05-09_profile_19_24_sorted.cut-1_score_1`;

    mysql > SELECT * FROM `demo_2012-05-09_profile_19_24_sorted.cut-1_score_1` LIMIT 10;

    Whenloggedintothedatabase,youcanexperimentwiththequeries.Forinstanceretrievethe10mosthighlyabundantsRNAsequenceswithexon

    matchesandsortthembyabundance:

    mysql > SELECT `#Sequence`,`sum_norm`,`genomic_region` FROM`demo_2012-05-09_profile_19_24_sorted.cut-1_score_1` WHERE

    `genomic_region` LIKE "%exonic%" ORDER BY `sum_norm` DESCLIMIT 10;

    Forfurtherqueryexamples,pleaseseesection6-9(Module9)

  • 7/30/2019 Shortran v.0.43 Manual

    8/27

    8

    4 CONFIGURATIONTherearetwowaysofconfiguringtheanalysis.Eitherthefile configuration.txtcanbemodifieddirectly,ortheinteractiveconfigurationinterfacecanbeused.

    4.1 Configurationusingtheconfiguration.txtfileParametersandvariablescanbedefinedduringconfiguration.Herewedescribethe common parameters for the pipeline. Module-specific parameters will be

    explainedinsection4ofthemanual.Westartbydefininganame(i.e.date)forthefolderas:

    date = '[a string]' # i.e. 2012-02-21

    The range of small RNA lengths to be analyzed is defined by entering the

    minimumandmaximumlengths:min_seq_size= '[an integer]' # i.e. 19

    max_seq_size= '[an integer]' # i.e. 24

    Rarereadscanbefilteredbyusingaminimumthreshold(cutoff),whichremovesall sequences which do not occur a minimum of cutoff times when summedacrossalllibraries.Filteringusingavariationscore( score_regulation_cutoff)canbeusedtofilterforsequencesthatshowinter-libraryvariationinabundance:

    cutoff = '[an integer]'

    score_regulation_cutoff= '[an integer]' #stdev/sqrt(average)

    Library headers,whichare used in the plotsand in theMySQLtables, can bedefined:

    genotype = '[strings separated by commas]' #i.e. 'wt','mut'

    TheMySQLservershouldbeconfiguredusingthefollowingoptions:

    server = '[string]' # i.e. localhost

    user = '[string]' # i.e. user

    password = '[string]' # i.e. ''

    database = '[string]' # i.e. profiles

    Bowtie mapping parameters such as maxmismatches in seed, max sum of

    mismatchquality scores across alignmentand seed length,can bemodifiedusingthefollwoingparameters:

    mismatches = '[an integer]'

    max_maq_err = '[an integer]'

  • 7/30/2019 Shortran v.0.43 Manual

    9/27

    9

    seedlen = '[an integer]'

    Each library should have its own folder containing sequencing data in fastqformat.Thepathforthedirectorycontainingallfoldersandfoldernameshould

    beconfigured(fordetailsseetheApplicationsection).fq_file_dir = '[path to directory with all data folders]'

    lib = '[all the data folder names separated by commas]'

    The variablefiltering_file in the configuration file can refer to any fasta file,against which the reads should be filtered. There is an option

    use_only_mapped_reads,whichifTrueallowsdiscardallreadsnotmappingtothefiltering_fileandifFalsediscardsonlythemappedreads.filtering_filecanbeleftempty,ifnofilteringisrequired.

    filtering_file = '[file_path]'

    use_only_mapped_reads = '[True/False]'

    Thereferencegenomefile,annotationfeaturefile(describingforexamplegene

    models),andconservedmiRNAfilesaredefinedusingthefollowingoptions:

    genome_file = '[file_path]'

    gtf_file = '[file_path]'

    miRNA_database = '[file_path]'

    Theparametersigv_infileandcluster_infileareoptionalandcanbeusedtoinputmapping or clustering results from other sources into the pipeline. For fileformats, please see the example data. If left empty, the files generated from

    earliermodulesareusedbydefault.

    4.2 InteractiveconfigurationThe shortran pipelinealsooffersan interactive configurationmode,where the

    user ispromptedforall optionsand parameters.The optionsare the sameasdescribed in section 3.1, but an additional help function is provided in the

    interactiveversion.

    5 Modules5.1 Module-1:Adaptertrimming,errorcorrection,filteringand

    profiling

    Thismodule offers pre-processing of the raw data obtained from sequencing(fastq format). Sequencing errors can be corrected byk-mercorrection using

  • 7/30/2019 Shortran v.0.43 Manual

    10/27

    10

    ECHO.Thisstep iscomputationally expensive and isnot criticalforuse ofthe

    othermodules.Itmay,however,improveanalysisaccuracy.

    Raw_data_filtering = True

    ErrorCorrection = True

    Path_2_ECHO = '[full path to echo]'

    Ifrawfastqfilesareused,adaptersshouldbeclippedoffandthefilesshouldbe

    splitbysize:

    adapter_filtering = True

    keep_with_adapter_only = True

    # If True, discards reads if no adapter was detected

    adapter='[3 adapter sequence' # i.e. 'AGCCTTCTA'

    # 3 adapter sequence. Multiple sequences can be entered as a comma-

    separated list in the same order as the sequencing libraries in thelib variable

    trim=0 # number of bases to trim from 5 end

    Readscanbefilteredbasedonhomologytoauser-suppliedmultifastafile.For

    instance, the user could remove readsmatching ribosomal RNA sequences or

    choosetofocusonlyonreadsmappingtothereferencegenome.

    homology_filtering='[full path and filename]' #multifasta file

    use_only_mapped_reads=False #removes reads matching the fasta file

    Followingerrorcorrectionandfiltering,profiletableswithrawandnormalized

    readcountscanbegenerated.

    Make_no_cut_profile_files=True

    Oncetheprofiletablesareavailable,additionalfiltersforabundanceandinter-

    libraryvariationscore(stdev/sqrt(average))canbeapplied.

    cutoff='a digit' # the minimum read count sum across all libraries

    score_regulation_cutoff=[a digit] # the minimum variation score

    ThefinalstepinModule-1addsthefilteredprofiletabletoaMySQLdatabase:

    Add_output_to_mysql_database=True

    SummaryModule-1:

    Input:

    RawFastqfilesOutput:

  • 7/30/2019 Shortran v.0.43 Manual

    11/27

    11

    AdapterfilteredFfastqfiles-$output/module-1/adapter_filtered_data/ Fastqfilessplitbysize-$output/module-1/size_fractionated_data/ HomologyfilteredFastqfiles-$output/module-1/filter/ Profiletables-$output/module-1/profiles/

    o Normalizedcounttableswithabundancecutoffo Normalized count tables with abundance cutoff and variationscorecut-off

    o Similarfilesforallthesizefractions Fileswithrawabundancetotalcounts

    o Redundantcounts-$output/module-1/libraries_abundanceo Non-redundantcounts$output/module-1/libraries_abundance_unique

    Fileforselectingcontrolforlibraries-libraries_reference Fileforselectingpairstoanalyzedforcorrelation-compare_libraries Graphicaloutput

    o Plotsforrawabundancesacrosslibraries

    o Sequencesizevariationacrossthelibrarieso Sequencesizefractionvariationwithinthelibrary

    MySQLtable

    5.2 Module-2:GenomicMappingReads are mapped to a given reference genome and alignment .igv files are

    generated. These files contain information about mapping positions and

    expression values and can be displayed using the IGV browser

    (http://www.broadinstitute.org/igv/).genome_file = '[full path and filename' #reference genome file

    Genome_mapping = True

    The variable infile allows using an alternative profile table. If left empty, theprofiletablegeneratedbyModule-1isusedbydefault.

    infile='[full path and filename]' #use alternative profile table

    SummaryModule-2:

    Input:

    Filewithallthereadso Bydefaulttakenfrommodule-1outputo Optional: The user can submit any file where the first columncontains sequences for genome mapping using the infileparameter.

    Output:

    Mappedfileso

    Samformato IGVformat

  • 7/30/2019 Shortran v.0.43 Manual

    12/27

    12

    Fastafileo Containsallthereads

    5.3 Module-3:ClusteringanalysisIn module-3, a clustering algorithm is invoked where mapped reads are

    separated intodifferent clustersbased on thegenomicpositions.Withdefaultsettings,aclusterisagenomicregion,whichcontainsatleastfiveuniquereads

    containingnogaplonger than50nucleotidesbetweenconsecutive reads.Alsotheabundancesofthesereadsshouldsumtoatotalofatleast75.Parameters

    can be changed to adjust the stringency. An external IGV file to be used for

    clusteringcanbespecifiedusingtheigv_infileparameter.Ifleftblank,theIGVfileproducedbyModule-2isusedbydefault.

    Cluster_prediction=True

    divide_by_size=False #do not treat size fractions separately incluster analysis

    abundance_cutoff = 75 #minimum sum of read abundances per cluster

    Add_cluster_to_mySQL = True

    igv_infile=''

    If preferred, the NiBLs algorithm (MacLean et al. BMC Bioinformatics. 201018;11:93) can be used to detect clusters instead of the simple approach

    describedabove:

    Cluster_prediction=TrueNiBLS=True

    Finally,clusterscanbeannotatedaccordingtotheiroverlapwiththegenomicfeaturesspecifiedusingthegtf_fileparameter.

    Cluster_genomic_region_prediction=True

    SummaryModule-3:

    Input: Mappedfiletogenome

    o BydefaultmappedfilefromModule-2o Optional:IGVformatmappedfileprovidedexternally(igv_infile)

    Output Clusterfile

    o BEDformatfileforclusterconsistingallsizeclasseso Sizefractionatedclusterso Clusterssummaryo Clustersinfastaformatfile

    Genomicannotationofclusterso Figurewithgenomicdistributionofclusters

  • 7/30/2019 Shortran v.0.43 Manual

    13/27

    13

    o Countsinthefilecluster_genomic_region.txto File with all the sequences and their corresponding genomicannotation(fileextension.region)

    MySQLtable

    5.4 Module-4:miRNApredictionsModules 4 and 5 are designed for categorizing small RNAs. Module-4 uses

    miRDeepormiRDeep-P formiRNAprediction.ThemiRDeep-Pmodule isonlyapplicableforsmallRNAdatasetsobtainedfromplants.Foranimaldata,please

    use miRDeep instead. Predicted miRNAs are mapped against the a fasta file

    containingconservedmiRNAsequences(miRNA_database)tochecktheaccuracyof the prediction. A summary onmapped conserved miRNA/miRNA*/miRNA-

    familiesisgenerated.

    miRNA_database = '[file name]'

    #fasta file with conserved miRNA sequences

    MiRNA_prediction=True

    mi_sequence_file = '[file name]' #optional external miRDeep formatinput file

    gff_file = '[file name]' #annotation file only for miRNApredictions

    Thevariablemi_sequence_fileisthefilepathtotheproperlyformatted(miRDeepstandard input) fasta file. If left empty, a miRDeep format input fasta file is

    generatedfromtheModule-1outputprofiletablebydefault.miRDeepcanuseanannotation file to filter tRNA and rRNA sequences from its prediction results

    (gff_file).

    SummaryModule-4:

    Input

    GenomeFastafile(genome_file) RNAclassificationannotationfile(gfffile) SequencefastafileinMiRDeepformat(mi_sequence_file)

    o BydefaultitistakenfromModule-1 ConservedmiRNA(miRNA_database)

    o Bydefaultitistakenfromdemo(miRbasev.17)Output

    PotentialmiRNAcandidates-miRNA_filter_P_prediction_miRNA.fasta miRNAprecursorsmiRNA_precursors miRNAstructures-miRNA_structures ConservedmiRNAs,familyandhomologcounts-

    miRNA_filter_P_prediction_miRNA.miRNA_mapped.miRNAfamily.txt

  • 7/30/2019 Shortran v.0.43 Manual

    14/27

    14

    5.5 Module-5:ta-siRNApredictionsTrans-actingsmallinterferingRNAsisanothercategoryofsmallRNAsgenerated

    byplants.Herewepredictta-siRNAusingamodifiedversionofChensalgorithm

    (Ho-MingChenetal.,2007).Afastafilecontainingta-siRNAclustersisproduced

    andsupplementedwithanotherfiledescribingtheprobabilitythatansRNAisassociatedwithatas-locus.

    TasiRNA_prediction=True

    SummaryModule-5:

    Input

    Genomefastafile(genome_file) Clusterfile(fromModule-3)

    Output

    Predicted ta-siRNA sequences with p-valve

  • 7/30/2019 Shortran v.0.43 Manual

    15/27

    15

    libraries.Theselibrariescanbedirectlycomparedagainstoneanotherorthey

    canbecomparedafterbeingnormalizedtoagivenreferencelibrary.Modifythefiles libraries_referenceand compare_libraries in theModule-1 folder to set upthelibrarycomparisons.

    Pattern_plot=True

    pattern_plot_file='' #external file with profile table, default isprofile table from Module-1

    SummaryModule-7:

    Input

    Profiletableo Bydefaultfrommodule1(pattern_plot_file)

    Output

    Graphicaloutputo Normalizedexpressionvaluesploto Logofnormalizedexpressionvaluesploto Relativefractionplotwhenprovidedcontrollibraries

    Fileswithcorrelationcoefficientvalues-*.spearman_values_*

    5.8 Module-8:PatternSearchThismodule calculates the Markov probability for themost 5 two and three

    nucleotides.Anoutput filewithprobabilities isgenerated toallow theuser toidentifysequencecompositionbiases.Givenapatterninthesub-module,allthe

    sequences with the pattern can be retrieved and further analysis can be

    performedonthisset.

    Find_pattern=True

    infile_for_pattern='' #profile table from Module-1 or other file withsequences in first column

    SummaryModule-8:

    Input

    Filecontainingsequenceso Bydefaultprofiletablefrommodule-1(infile_for_pattern)

    Output Markovprobabilities

    o Forfirstthreenucleotides-markov_prob.txt Filecontainingsequenceswiththetargetedpattern-*.fasta.* Mappingcountsforsequenceswithpattern-pattern_read_stats.txt

  • 7/30/2019 Shortran v.0.43 Manual

    16/27

    16

    5.9 Module-9:QueriesOncetheMySQLtableiscreated,thedatacanbeaccessedusingSQLqueries.

    MySQL_query=True

    query_infile='query_infile' #Text file containing MySQL queries.

    Should be located in the shortran main directory.

    SummaryModule-9:

    Forexamplequeries,pleaseseetheapplicationsection,Module-9.

    Input

    MySQLformatquery(query_infile)

    Output MySQLqueryresult-mysql_query_output

  • 7/30/2019 Shortran v.0.43 Manual

    17/27

    17

    6 Application

    Here we explain the application of shortran for the analysis of the datasetsuppliedwiththepackage.

    Asampledatasetisavailablewiththe shortranpackage.ItconsistsofasubsetoffourlibrariesfromLotusjaponicussRNA-seqdata.Outputsfromallthemodulesareincludedinthepackage,andwerecommendthatusersgothroughthedemodata examplesbeforeproceedingwiththeir owndata.Herewebrieflyexplain

    theusageandoutputofeachmodule.All shortransettingsusedintheanalysisare presented in the configuration.txt file, which can be used to analyze thesampledataafterchangingdirectoryandMySQLserversettingstoreflectyour

    localenvironment.

    6.1 Module-1Raw data for processing are fastq files. In the demo, these fastq files areseparated in fourdifferentfoldersnamedGHD-5_sample,GHD-6_sample,GHD-

    9_sample and GHD-10_sample. These four folders refer to librariesmock_wild_type, wild_type, mock_NFR1 and NFR1 respectively. Each contains

    readsobtainedfromIlluminaRNA-seqreads,whicharemappedtochromosome

    2(35-40MB)onthegenome.

    Wepre-processedthereadsinmodule1.1andanerrorcorrectionalgorithmwas

    invoked. There were 627,147 reads in the raw dataset (Figure 1) and whenparsingsequencesthroughECHOonthedatasetapproximately3.6%readswere

    correctedforeachlibrary.

    The two files lib_abundance and lib_abundance_unique contain abundances foreachsizeandlibraryfornon-redundantandredundantreadsrespectively,data

    isplotted(Figures2and3).Figures2aand2bareexamplefiguresshowingread

    lengthdistribution(in%)amongdifferentlibrariesforsizeclasses21ntand24nt.Plotsforallsizeareavailableintheplotsfolder.Figure3aand3bareplotsfor

    size distribution within a library for mock_NFR1 and NFR1 libraries,respectively.

    Homology filteringwas performed toremove all the reads coming from lotus

    repeat data, leaving 605,329 reads. Both filtered reads and reads left after

    filteringareprovidedasfastqfilesthatcanbeusedinseparateanalyses.Inmodule1.2,all the readsareprocessed tocalculateabundancesof all smallRNAuniquereadswithineachlibrary,andnormalizationisperformedtotake

    intoaccountdifferencesinlibrarysizes.Therewere42,448uniquereadsinthe

    dataset and 25,329 were left after applying an abundance cut-off of 2. An

    additionalvariationscore-cutoff(stdev/average>1)wasappliedtoidentify

    12,814differentiallyregulatedsequences(uniquereads).Theprofiletablesare

    storedintheModule-1outputfolderaswellasintheMySQLoutputtable.

  • 7/30/2019 Shortran v.0.43 Manual

    18/27

    18

    Figure1.Rawreadcountsforeachlibrary.

  • 7/30/2019 Shortran v.0.43 Manual

    19/27

    19

    Figure2a.Readlengthdistributionsamongalllibraries.21fraction.

    Figure2b.Readlengthdistributionsamongalllibraries.24fraction.

  • 7/30/2019 Shortran v.0.43 Manual

    20/27

    20

    Figure3a.Sizedistributionforthemock_NFR1library

    Figure3b.SizedistributionfortheNFR1library

  • 7/30/2019 Shortran v.0.43 Manual

    21/27

    21

    6.2 Module-2AllthereadsweremappedtogenomeusingBowtiemappingsoftware.Mapping

    parameteroptionswere:

    Mismatches=2

    max_mac_err=70

    seedlen=25

    Allthereadsweremappedbacktothegenomeon102,356places.Asthedemo

    datawasconstructedwiththemappedreadsonly,allthereadsareexpectedtomapback.AnIGVformatfileisgeneratedandplacedinModule-2.ThisIGVfile

    alsocontainsallthenormalizedexpressionvaluesandthesecanvisualizedinthe

    IGVbrowser.

    6.3 Module-3ThismoduleprocessestheIGVfilefromModule-2andpredictsclusterswitha

    minimumabundanceof75.4315clusterswerepredicted.A .bedformat file is

    suppliedcontainingall the clusterswithembeddedsequences.The file can bevisualizedusing the IGV browser.Genomic annotationwas performed and 48

    clusters were exonic, 46 intronic, 3855 were intergenic and 2 clusters

    overlappedintron/exonboundaries(Figure-4).

    Figure4.Genomicdistributionofclusters

  • 7/30/2019 Shortran v.0.43 Manual

    22/27

    22

    6.4 Module-4miRNA prediction was done using the plant-adjusted version of miRDeep,miRDeep-P,and29miRNAswerepredicted,including1conservedand28novel

    miRNAs.Thesecontain1miRNAfamiliesanddetailsofhomologsforeachfamily.

    AlldetailsaregivenintheModule-4outputfolder,mappingisperformedagainstmIRBASErelease17.AlternativemiRNAdatabasescanbecomparedagainstas

    definedbytheuser.

    6.5 Module-530 individual ta-siRNA sequences were predicted based on the relative

    positioning(phasing)ofthesRNAlocionthegenome.Themajorityoftheseloci

    wascontainedin6clusters(Module-3).Aclusterfastafileisavailable,asisthelist of all the smallRNA readswith associated p-value evaluating the relative

    positioning of the loci. A read should have p-value lower than 0.001 to beconsideredaphasedRNA.

    6.6 Module-6Thismoduleisimplementedtoaddannotationtoallthesequencesintheprofile

    table.ThismodulerequiresMySQLinstallationtobeexecuted.Asinput,avariety

    offastafilescanbeprovidedbytheuser.Sequenceswillbemappedagainsteach

    file, and where sequences map to the respective reference the associatedmappingpositionsarebereported.

    6.7 Module-7Thismodule creates expressionpatternplots for the libraries thatneed tobe

    compared. Pairs to be considered can be defined in the file provided in theModule-1folder,compare_libraries.Bydefaultplotsaregeneratedforallpossiblesamplepairs, including theplotagainstsumand scorevalues fromtheprofile

    table. As target libraries will usually be sequenced alongwith corresponding

    controls, the latter can be specified as references in the file namedlibraries_reference,alsoprovidedintheModule-1folder.

    Thefilelibraries_referencecontainstwocolumnsofwhichthefirstreferstothetargetlibrary.Librariesareannotatedbynumbersinthesameorderasonthe

    user-provided input list for libraries. The second column is filled by 0s bydefault, unless a control library is defined. Target/reference relationships are

    definedbyreplacingthe0swiththenumberoftheappropriatecontrollibrary.

    Four different kinds of plots can be generated. When no control library isdefined, normalized counts and log values of normalized counts are plotted

    (Figures 5a and 5b). If reference libraries are defined, expression scores are

    plottedaccordingtoequations1aand1b.TheexpressionplotsarevisualizedinFigures5cand5d.

  • 7/30/2019 Shortran v.0.43 Manual

    23/27

    23

    Figure5a.Normalizedcountsareplotted.

    Equation1a.log-odd-score=log(Ti/Ci)

    Equation1b.norm_control=(Ti-Ci)/(Ti+Ci)

    WhereTiistheexpressionvaluesinthetargetlibraryand

    Ciistheexpressionvaluesinthecontrollibrary

  • 7/30/2019 Shortran v.0.43 Manual

    24/27

    24

    Figure5b.log(normalizedcounts)areplotted

    Figure5c..log-odd-score=log(Ti/Ci)plotted

  • 7/30/2019 Shortran v.0.43 Manual

    25/27

    25

    Figure5d.norm_control=(Ti-Ci)/(Ti+Ci)plotted.

    6.8 Module-8Thismodule is used to find sequence compositionbiases in the5 endof the

    sequences.Di- and tri-nucleotideMarkov probabilities are calculated for eachnucleotide combination. In theprovidedsampledataset, forexample,anover-

    representationofGAA starting triplets is observed. Sequenceswith particularmotifs are extracted andmapped back to the genome. If a bias is caused by

    technical artifacts, the proportion of sequences with the motif in question

    mappingtothereferencegenomemightbelowerthantheaverage.

  • 7/30/2019 Shortran v.0.43 Manual

    26/27

    26

    6.9 Module-9Usingthismodule,theMySQLtablecanbequeried.Complexqueriesconcerning

    particular biological questions can be performed. The table below shows

    examplesofqueries.

    Purpose Syntax

    Retrieve variation scores and miRBase annotations

    for predicted miRNAs.

    SELECT

    `#Sequence`,

    `miRBase.fa`,

    `score`

    FROM `2012-03-04_profile_19_24_sorted.cut-1_score_1`

    WHERE

    `miRNA_prediction` = "predicted as miRNA";

    Retrieve average response values for sRNAs up-

    regulated in the wild only for sRNAs mapping to

    miRNAs annotated in miRBase.

    SELECT

    COUNT(score) AS n,

    AVG(LOG(wild_type_norm/Mock_wild_type_norm)) AS wt,

    AVG(LOG(NFR1_norm/Mock_NFR1_norm)) AS nfr1

    FROM `2012-03-04_profile_19_24_sorted.cut-1_score_1`

    WHERE

    LOG(wild_type_norm/Mock_wild_type_norm) > 0

    AND `miRBase.fa` !="0";

    Retrieve expression information for sRNAs

    mapping to miRNAs annotated in miRBase and

    sort in descending order by the wild type

    expression response.

    SELECT

    `#Sequence`AS sequence,

    `read_size` AS length,

    LOG(wild_type_norm/Mock_wild_type_norm) AS wt,

    LOG(NFR1_norm/Mock_NFR1_norm) AS nfr1,

    `miRBase.fa`

    FROM `2012-03-04_profile_19_24_sorted.cut-1_score_1`

    WHERE

    `miRBase.fa` != 0

    ORDER BY wt DESC;

  • 7/30/2019 Shortran v.0.43 Manual

    27/27

    6.10Computationsummary

    Module Process Time (s) CPU % Memory %

    1.1 Adapter filtering/trimming 15.4 0.4 0.3

    1.1 Error correction 5827.5 73.4 0.91.1 Size splitting 7.1 49.4 0.4

    1.1 Homology filtering 116.0 113.5 0.5

    1.2 Making profile tables 73.5 54.1 1.3

    1.3 Adding data to MySQL database 1.9 37.9 0.2

    2 Genomic mapping 129.0 131.9 5.4

    3.1 Cluster prediction only 9.6 98.5 1.6

    3.1 - including sequence extraction 1393.4 149.3 0.8

    3.2 Cluster genomic region annotation 26.7 97.8 0.4

    4.1 miRNA predictions-mirDeep2 2476.3 100 40.4

    4.1 miRNA predictions-mirDeepP 11875.1 1.8 3.94.2 miRBase mapping 5.0 3.3 0.8

    5 ta-siRNA prediction 79.7 98.9 0.9

    6 Homology annotation - mapping 6.5 16.7 0.2

    6 Homology annotation - MySQL 1.1 0.1 0.1

    7 Expression profile plotting 107.8 97.6 1.1

    8.1 Find pattern 3.7 100 0.2

    8.2 Map pattern 155.2 99.8 5.9

    9 MySQL querying 0.0 0.9 0.2

    Thetotalnumberofreadsprocessedwas627147.Datawasanalyzedona

    MacintoshiBookwith4GBmemoryandafourcore2.7GHzInteli7processor.