snakemake tutorial - griote · 09/20/2016 snakemake tutorial 9 snakefle (1/2) import glob, ntpath...

Post on 24-Jul-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

09/20/2016 Snakemake Tutorial 1

Snakemake TutorialSnakemake Tutorial

09/20/2016 Snakemake Tutorial 2

SummarySummary● Context

● Common uses (4 steps)

● Specifc uses (2 steps)

● Still in progress

09/20/2016 Snakemake Tutorial 3

● An easy scripts scheduler in python

● Can be used on computing grid

● Can be used with virtual environments

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 4

Context Common uses Specific uses Still in progress

● What for?

For example, you need to compute the 1500 fastq files of the Lanner experiment in a succession of parallelised treatments.

09/20/2016 Snakemake Tutorial 5

Context Common uses Specific uses Still in progress

● Te pipeline we will buildRule 'unzip'

Rule 'miniQC'

Rule 'fastqc'

Fichier ERR1307260

Fichier ERR1307264

Rule 'unzip'

Rule 'all'

Rule 'fastqc'

... ...

Fichier ERR130726...

Rule 'htseq'

Rule 'tophat' Rule 'htseq'

Rule 'tophat'

... ...

09/20/2016 Snakemake Tutorial 6

● Prerequisites

conda installed

access to the BiRD grid

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 7

● 1st Step

– Virtual environment creation– Basic snakefile creation (first part)

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 8

● Environment (shell commands)

conda create -n myTutoEnv python=3.5 snakemake fastQC bioconductor-deseq2 -c bioconda

qlogin

source activate myTutoEnv

touch snakefile

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 9

● Snakefle (1/2)import glob, ntpath

inFiles = set() ## set of all input files (only file names)

## extract file names from pathsinPaths = glob.glob( "/sandbox/ylelievre/singlecell/tuto/fastq/*.fastq.gz" ) for p in inPaths: inFiles.add( os.path.basename(p).replace(".fastq.gz", "") )

##############

Context Common uses Specific uses Still in progress

Basename extractionsfrom the files to be

computed

09/20/2016 Snakemake Tutorial 10

● Snakefle (2/2)rule all: input: I1 = expand("fastQC/{filename}_fastqc.zip",filename=inFiles)

##############

rule fastqc: input: "/sandbox/ylelievre/singlecell/tuto/fastq/{afile}.fastq.gz" output: "fastQC/{afile}_fastqc.zip" shell: "fastqc -o fastQC {input}"

Context Common uses Specific uses Still in progress

The 1st rule (all) definesthe target of the pipeline

The subsequent rules define the differentsteps to reach the target (here just one step)

09/20/2016 Snakemake Tutorial 11

● Launch the pipeline (shell command)

snakemake -p -n -s mySnakefileName

Context Common uses Specific uses Still in progress

-p : print the commands executed

-n : dryrun

-s : if the Snakefile name is not 'Snakefile'

09/20/2016 Snakemake Tutorial 12

● What can be observed ?– The rules are executed sequentially (SGE not used)– The 5 rules fastqc are executed before the rule all– There is not a specific order between the fastqc rules

Context Common uses Specific uses Still in progress

Rule 'fastqc'

Rule 'fastqc'

Rule 'all'...

Fichier ERR1307260

Fichier ERR1307265

Fichier ERR130726...

09/20/2016 Snakemake Tutorial 13

● Preparation of the next step

– Delete a zip file (ERR1307264) from the directory ./fastQC

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 14

● 2nd Step

– Basic snakefile creation (second part)

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 15

● Snakefle updaterule all: input: I1 = expand("fastQC/{filename}_unzip_fastqc.done",filename=inFiles)

##############...

rule unzip_fastqc: input: zips = "fastQC/{zipfile}_fastqc.zip" output: touch("fastQC/{zipfile}_unzip_fastqc.done") shell: "unzip -o {input.zips} -d ./fastQC"

Context Common uses Specific uses Still in progress

Input modified

Rule unzip_fastqc added

09/20/2016 Snakemake Tutorial 16

● Launch the pipeline (shell command)

snakemake -p

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 17

● What can be observed ?– All the rules fastqc are not necessarily executed before

the unzip rules

Context Common uses Specific uses Still in progress

Rule 'unzip'

Rule 'unzip'Rule 'all'

...

Rule 'fastqc'

Fichier ERR1307260

Fichier ERR1307263

Fichier ERR130726...

Fichier ERR1307264 Rule 'unzip'

09/20/2016 Snakemake Tutorial 18

● Preparation of the next step

– Delete 2 zip files (ERR1307263 and ERR1307264) from the directory ./fastQC

– Delete 2 done files (ERR1307262 and ERR1307264) from the directory ./fastQC

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 19

● 3rd Step

– Basic snakefile creation (second part)– Missing files and snakemake

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 20

● Snakefle updaterule all: input: "QC/miniQC.done"

##############...

rule miniQC: input: I1 = expand("fastQC/{filename}_unzip_fastqc.done",filename=inFiles) output: touch("QC/miniQC.done") shell: "./miniQC.sh"

Context Common uses Specific uses Still in progress

Input modified

Rule miniQC added

09/20/2016 Snakemake Tutorial 21

● Launch the pipeline (shell command)

snakemake -p

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 22

● What can be observed ?

Context Common uses Specific uses Still in progress

Rule 'unzip' Rule 'miniQC'

Rule 'fastqc'

Fichier ERR1307260

Fichier ERR1307263

Fichier ERR1307262

Fichier ERR1307264 Rule 'unzip'

Fichier ERR1307261

Rule 'all'

09/20/2016 Snakemake Tutorial 23

● Preparation of the next step

– Delete 2 zip files (ERR1307263 and ERR1307264) from the directory ./fastQC

– Delete 2 done files (ERR1307263 and ERR1307264) from the directory ./fastQC

– Delete the miniQC.done file from the directory ./QC

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 24

● 4th Step

– Snakefile with a config file

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 25

● Snakefle with a confg fle (1/3)import glob, ntpath

configfile: "config_tuto.json"

inFiles = set() ## set of all input files (only file names)

## extract file names from pathsinPaths = glob.glob( config["FASTQ_PATH"]+"*.fastq.gz" )for p in inPaths: inFiles.add( os.path.basename(p).replace(".fastq.gz", "") )

##############

Context Common uses Specific uses Still in progress

Load the config file

Call of the fieldFASTQ_PATH

from the json file

09/20/2016 Snakemake Tutorial 26

● Snakefle with a confg fle (2/3)rule all: input: config["ANALYSIS_PATH"]+"QC/miniQC.done"

##############rule fastqc: input: config["FASTQ_PATH"]+"{afile}.fastq.gz" output: config["ANALYSIS_PATH"]+"fastQC/{afile}_fastqc.zip" params: analysis_path=config["ANALYSIS_PATH"] shell: "fastqc -o {params.analysis_path}fastQC {input}"

Context Common uses Specific uses Still in progress

params: is required in order toinject config field into the shell script

09/20/2016 Snakemake Tutorial 27

● Snakefle with a confg fle (3/3)rule unzip_fastqc: input: zips = config["ANALYSIS_PATH"]+"fastQC/{zipfile}_fastqc.zip" output: touch(config["ANALYSIS_PATH"]+"fastQC/{zipfile}_unzip_fastqc.done") params: analysis_path=config["ANALYSIS_PATH"] shell: "unzip -o {input.zips} -d {params.analysis_path}fastQC"

rule miniQC: input: expand(config["ANALYSIS_PATH"]+"fastQC/{filename}_unzip_fastqc.done",filename=inFiles) output: touch(config["ANALYSIS_PATH"]+"QC/miniQC.done") params: fastq_type=config["FASTQ_TYPE"] shell: "./miniQC.sh {params.fastq_type}"

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 28

● Launch the pipeline (shell command)

snakemake -p

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 29

● What can be observed ?

Context Common uses Specific uses Still in progress

Rule 'unzip'

Rule 'miniQC'

Rule 'fastqc'

Fichier ERR1307260

Fichier ERR1307263

Fichier ERR1307262

Fichier ERR1307264 Rule 'unzip'

Fichier ERR1307261

Rule 'all'

Rule 'fastqc'

09/20/2016 Snakemake Tutorial 30

● Preparation of the next step

– Delete all zip files from the directory ./fastQC– Delete all done files from the directory ./fastQC– Delete the miniQC.done file from the directory ./QC– Shell commands :

source deactivateexit

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 31

Context Common uses Specific uses Still in progress

● 5th Step

– Snakemake and computing grid– Snakemake and wrappers

09/20/2016 Snakemake Tutorial 32

Context Common uses Specific uses Still in progress

Use of the option –cluster to perform snakemake on SGE

● Environment (shell commands)

source activate myTutoEnvmkdir logssnakemake -p --cluster "qsub -o ./logs/ -e ./logs/" --jobs 50

-o and -e define the repository where SGE send the output and error

--jobs define the number of threads allocated

09/20/2016 Snakemake Tutorial 33

● What can be observed ? (1/2)If you were lucky, you would observe this!

Rule 'unzip'

Rule 'miniQC'

Rule 'fastqc'

Fichier ERR1307260

Fichier ERR1307264 Rule 'unzip'

Rule 'all'

Rule 'fastqc'

... ...Fichier ERR130726...

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 34

● What can be observed ? (2/2)Except for the persons who are doing this tutorial directly in their home directory (…), your jobs should be in error because your current working directory has not been passed to the computing grid.

More over, your virtual environment variables, also, have not been transmitted to the grid.

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 35

Context Common uses Specific uses Still in progress

We will use a wrapper in order to transmit the environment variables

● Environment (shell commands)

ctrl ctouch myGridWrapper.sh

09/20/2016 Snakemake Tutorial 36

Context Common uses Specific uses Still in progress

Transmit the virtual environment variables on the node

● myGridWrapper.sh (shell commands)

#$ -S /bin/bash#$ -cwd#$ -V

{exec_job}

Transmit the current working directory

Execute the script

09/20/2016 Snakemake Tutorial 37

Context Common uses Specific uses Still in progress

● Shell commands

snakemake -p --cluster "qsub -o ./logs/ -e ./logs/" --jobs 50 --jobscript myGridWrapper.sh

09/20/2016 Snakemake Tutorial 38

● What can be observed ?Missed again!

You should get an error like 'OSError: Missing files after 5 seconds:'

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 39

Context Common uses Specific uses Still in progress

● Shell commands (try again)ctrl csnakemake -p --latency-wait 60 --cluster "qsub -o ./logs/ -e

./logs/" --jobs 50 --jobscript myGridWrapper.sh

Snakemake needs some latency to properly work with a grid engine

09/20/2016 Snakemake Tutorial 40

● What can be observed ?Finally!!!!

Rule 'unzip'

Rule 'miniQC'

Rule 'fastqc'

Fichier ERR1307260

Fichier ERR1307264 Rule 'unzip'

Rule 'all'

Rule 'fastqc'

... ...Fichier ERR130726...

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 41

Context Common uses Specific uses Still in progress

● 6th Step

– Snakemake and virtual environmentS

09/20/2016 Snakemake Tutorial 42

● Let's play with virtual environments!Often, we need to use antinomic version of tools in a same pipeline.

One solution is the use of virtual environments.

Example : use snakmake with python 3 and tophat2 with python 2

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 43

Context Common uses Specific uses Still in progress

Preparation of a virtual environment which is compliant with tophat and htseqcount

● Environment (shell commands)

conda create --name myTophatTutoEnv python=2.7.8 tophat htseq pysam=0.8.3 -c bioconda

09/20/2016 Snakemake Tutorial 44

● Snakefle with a confg fle (1/2)rule all: input: I1 = config["ANALYSIS_PATH"]+"QC/miniQC.done", I2 = expand(config["ANALYSIS_PATH"]+"counts/{filename}.counts",filename=inFiles)

rule tophat2_SE: input: config["FASTQ_PATH"]+"{bamname}.fastq.gz" output: config["ANALYSIS_PATH"]+"BAM/{bamname}/accepted_hits.bam" params: cpu=config["TOPHAT_CPU"], gene=config["GENE_REFERENCE"], bowtie=config["BOWTIE_INDEX"], analysis_path=config["ANALYSIS_PATH"] shell: """ source activate myTophatTutoEnv tophat2 -p {params.cpu} -G {params.gene} -o {params.analysis_path}BAM/{wildcards.bamname} {params.bowtie} {input} """

Update rule all

Context Common uses Specific uses Still in progress

Add the following rules

09/20/2016 Snakemake Tutorial 45

● Snakefle with a confg fle (2/2)

rule htseqcount: input: config["ANALYSIS_PATH"]+"BAM/{htseqname}/accepted_hits.bam" output: config["ANALYSIS_PATH"]+"counts/{htseqname}.counts" params: gene=config["GENE_REFERENCE"], analysis_path=config["ANALYSIS_PATH"] shell: """ source activate myTophatTutoEnv htseq-count -f bam -r pos -i gene_name -s no {input} {params.gene} > {params.analysis_path}counts/{wildcards.htseqname}.counts """

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 46

● Preparation of the next step

– Delete all zip files from the directory ./fastQC– Delete all done files from the directory ./fastQC– Delete the miniQC.done file from the directory ./QC

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 47

Context Common uses Specific uses Still in progress

● Shell commands

snakemake -p --latency-wait 60 --cluster "qsub -o ./logs/ -e ./logs/" --jobs 50 --jobscript myGridWrapper.sh

09/20/2016 Snakemake Tutorial 48

● What can be observed ?Rule 'unzip'

Rule 'miniQC'

Rule 'fastqc'

Fichier ERR1307260

Fichier ERR1307264

Rule 'unzip'

Rule 'all'

Rule 'fastqc'

... ...

Fichier ERR130726...

Context Common uses Specific uses Still in progress

Rule 'htseq'

Rule 'tophat' Rule 'htseq'

Rule 'tophat'

... ...

09/20/2016 Snakemake Tutorial 49

● A young technology

● Problem with the NFS fle system (solved with BeeGFS)

Context Common uses Specific uses Still in progress

09/20/2016 Snakemake Tutorial 50

Tank you for your perseverance!

top related