advanced tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · demo...
TRANSCRIPT
Roman Cherniatchik, 2020 May 6, St. Petersburg
SnakemakeAdvanced Tutorial
1
Snakemake
What for?
image source2
Snakemake vs Bash ScriptsSnakemake pipeline Bash Scripts
“Entrance” level Hard Easy
Programming language Python + Bash Bash - convenient for simple scripts only
Calculations Automation + +
Results Consistency / Reproducibility + depends on you
Computation Environment Reproducibility
Docker, Conda, Singularityintegration depends on you
Multiple Platfroms: Write once launch everywhere
PC, Computational clusters (HPC, LSF,..), Cloud Computing
New scripts for each platform, hard to do universal solution
Bash scripts are simple - easier to start with, but could be a nightmare for complicated large pipelines and effective cloud computing
3
Snakemake Basics
<Reminder>
4
Snakemake Dependency Graph (DAG)Crucial to understand DAG concept to write snakemake pipelines
Dependency Graph used to decide:
• Which rules to execute?
• Which input files to use?
Dependencies Graph Building:
InputOutput Files
Input Output Files
Rules Execution Order:
5
Pipeline Execution Orderreads/ A.fq.gz; B.fq.gz; C.fg.qz
Example:
snakemake --cores 1 --dag | dot -Tsvg > dag.svg6
Pipeline Dependencies Lookup
plot.svg
peaks/A.bedpeaks/B.bedpeaks/C.bed
rule: plot
rule: all
rule: call_peaks bams/{sample}.bam peaks/{sample}.bed
rule: align bams/{sample}.bamreads/{sample}.fq.gz
peaks/A.bed
peaks/B.bed
peaks/C.bed
plot.svg
reads/ A.fq.gz; B.fq.gz; C.fg.qz
RULE INPUT RULE OUTPUT
C -> {s
ample
}A -> {sample}
B ->
{sam
ple}
A,B,C -> {sample}
A,B,C -> {sample}
A,B,C -> {sample}
START HERE
The End
7
Demo time: example_01Topic: Snakemake Dependency Graph
• Q1: Why doesn’t work?
8
Demo time: example_01• call_peaks looks for “bams/A.bam”• candidate found: align “bams/{sample}” => {sample} in align is “A.bam” => input
should be “reads/A.bam.fq.gz”
• call_peaks {sample} (“A”) != align {sample} (“A.bam”)• Wildcard variable make sense only inside one rule
9
Demo time: example_01DAG: Jobs Graph vs Rules Graph
Practical advice:
Use rules graph for you pipelines
• Real use cases: 10..1000 input files
• Full graph with all jobs will be extremely large.
• Rules Graph is compact
snakemake --cores 1 --rulegraph | dot -Tsvg > rulegraph.svg
DAG: Rules onlyDAG: All jobs10
Demo time: example_01How to quickly find things in snakemake command line help
* snakemake --help | less: In less press ‘/‘, type ‘rulegraph’ (search in ‘less’ is same as VIM)
Useful snakemake options for debugnig:
• --dry-run : Builds DAG, checks pipeline & exits. Doesn’t execute rules
• --debug-dag : Prints wildcards details etc. while inferring DAG, doesn’t stop rules execution.
• --rulegraph : Prints compact DAG (only rules) & exits
Some useful snakemake methods:
• touch: Creates empty file, could be use it in output: section to mock shell/run sections.
• directory: Mark output: section argument if it is directory, not file
• protected: Mark output: section argument to create ‘read-only’ files for important pipeline results11
Snakemake Editing Tools MatterWhy I’m using PyCharm + SnakeCharm?
Let’s compare:
• cat
• vim + snakemake syntax highlighting
• Atom (recommended by Snakemake)
• PyCharm + SnakeCharm Plugin
12
Example2: catLet’s use editor w/o syntax highlighting, e.g: cat (or less / vim / nano/ …)Pros
• Installed almost on any Linux machine
• works in SSH session
• fast & light
Cons
• Easy to make an error
• Hard to read pipeline code13
Example2: vim + snakemake bundleLet’s use editor with syntax highlighting plugin, e.g: vim (or nano/ …)Pros
• Installed almost on any Linux machine
• works in SSH session
• fast & light
• Code looks better, some errors easier to notice
Cons
• Requires to install snakemake bundle
• Still easy to make an error
Supplementary: How do I enable syntax highlighting in Vim for Snakefiles?
14
Example2: AtomLet’s use IDE suggested by Snakemake official tutorial. Seems they should know better which tool to use, right ?AtomPros
• Fast & light
• Code looks readble, some errors easier to notice
• Reasonable default choice
Cons
• Requires be installed
• Unlikely works via SSH
• Still easy to make an error (see later)
15
Is this code OK?
14 mistakes in 24 lines
16
Example_02: PyCharm + SnakeCharmSnakeCharm developed by JetBrains Biolabs Team Pros
• Code analysis, lots of errors highlighted• Smart code completion• PyCharm: also good for
• Python, Markdown, R, ..• Git• ….
Cons• Requires be installed• Could show false positives• Couldn’t be used via SSH directly (put code in git, sshfs or other
tricks should be used)17
Text Editors Takeaways:My own preferences:
• Local Machine• PyCharm IDE• SnakeCharm Plugin : for Snakemake• IdeaVim Plugin: `Vim style` emulation
• Remote Machine (computation clusters, docker machines,…)• Vim / Nano + Snakemake syntax bundle
Keep your pipeline in Git :• Use it for sync pipeline with remote machine• Convenient for pipeline development
18
“Snakenstein”Snakemake file = mixture of:
• Python code
• Some additional syntax• rules declarations, rules sections,
etc• special syntax in strings:
“path/{sample}.bam”
snakemake tool:• reads Snakefile• generates valid python code• executes python code in some
special python environment“Frankenstein” in terms of programming language
19
Example_03: Snakefile - Not Pythonsnakemake --print-compilation > Snakefile.py
Snakemake generates Python file
andexecutes it!
20
Section arguments typesMost sections supports:
• positional arguments:• string arguments• lists of strings• other python expressions
• named argumentskey1 = value
Input section TEXT is inserted into some python function workflow.input(…)
=> same syntax as in python for method call arguments, e.g like in
print(“fooo”, “boo”, file=…, end=..)
--print-compilation
21
Lambda / Input FunctionsLambda functions • Access to:• wildcards, threads, input, output, resources
• Different sections - different set of arguments for input functions
Input function: • Similar to lambda functions, but for
larger pieces of code• Only for input: sections• Could be used to handle dynamic
dependencies (see checkpoints) 22
Sections Syntax is not Equal• output, log, benchmark:
• lambdas/input functions cannot be used, only expressions
• input:
• expression, lambdas, also “input functions”
• threads: lambdas/functions or expressions which returns: integer or float values
• shell:, wrapper: only one positional argument, expression returning python string
• run: python code block
• ….
Check Snakemake docs / Trust SnakeCharm !
23
Wildcards Syntax3 types:
1. Sections that introduce wildcards:• output, log, benchmark. E.g.: “peaks/{sample}.bed”
• => lambdas cannot be used here
• => wildcards set should be same in these sections
• everything in `{..}` is wildcard name, e.g `{config[reads]}`
2. Sections which uses wildcards w/o `wildcards.` prefix:• input, params, … E.g.: “peaks/{sample}.bed”
• everything in `{..}` is wildcard name, e.g `{SOME_VARIABLE}`
3. Sections which requires `wildcards.` prefix:• message, shell, run, … E.g.: “peaks/{wildcards.sample}.bed”
• w/o wildcards prefix - just python e.g. `{config[reads]}`, `{SOME_VARIABLE}`
Constraining wildcards example: “sorted_reads/{sample,[A-Za-z0-9]+}.bam”
24
Three Execution Phases1. Python module loading
• File top level
• Section arguments
2. DAG computation
• input and lambda functions
3. Rule running
• run, script, shell, wrapper sections
• bonus: after DAG computation, but before/after all rules execution
25 See Example_03.6
Rules Referencesrules.NAME.output.key
• Perfomance improvement for large pipeline DAG computations
• Reduces code duplication - fewer ERRORs!
• Helps in finding usages of rule in code
Documentation
26See Example_04
Dynamic DependenciesSometimes all intermediate file names are not known before pipeline execution
Examples:
• Download some files (e.g. fastq) from database by id
• Align samples, perform QC, use only samples passed QC for downstream
Use: • checkpoint rules (see data-dependent-conditional-execution)• dynamic flag for output is deprecated and will be removed
27
CheckpointsSpecial rule sub-type: • declaration ~ rule syntax• usage: input function + checkpoint ref: checkpoints.NAME.get(**wildcards).output.key
DAG evaluation: • Evaluate DAG except checkpoint
‘using’ rules (syntax above), e.g w/o expression rule
• Run pipeline• Re-calc DAG after checkpoint finished
(separately for every wildcard, if used)• Run pipeline
Do download
28See Example_05
decl
arat
ion
usag
e sy
ntax
Shell Commands• Different python strings
syntax is available
• Use the most convenient for the situation
29See Example_06.1
Wrappers, Scripts, ...shell is enough, but alternatives are:
• wrapper: See docs.Recommended way for standard tools.
• Automatically install tool via conda• Keep each tool in separate conda env• Collects shell args for you• Wrappers repo:
https://snakemake-wrappers.readthedocs.io
• script: Syntax sugar to pass arguments into Python, R scripts.
• notebook: Way to launch notebook (R or Python) and use it’s output, see docs
30See Example_06.2
Wrapper example:
Project Layout• Recommended Structure
• Keep pipeline settings in config.yaml
• Pass input files information as TSV / CSV tables with columns like sample name, reads path, etc. E.g.:
31
Results Consistency / ReproducibilitySnakemake smart enough but with your assistance!
• Rule output exists:
• Recalculated only if input files changed (not always for checkpoints)
• If input marked with `ancient` - file modification date not checked
• Rule fails:
• Deleted only files mentioned in output: of failed rule
• => Mention all tool output files in output: or use shadow: and mention only required files
• shadow:
• If tool outputs too many files
• Run tool in temp directory: .snakemake/shadow/tmpxxx
• Copy only files requested in output: section
• Use symlinks to make all things works
• Shadow levels: minimal, shallow, full
• `minimal` - most cases 32 See Example_07
Bad Pipeline - Inconsistent ResultsSnakemake - not a silver bullet
You can always break all Snakemake conventions and write an inconsistent not reproducible pipeline. Think what you are doing and why.
33
Bugs in the above example:
• ‘my_file.csv’ will be incorrect if several jobs works in parallel
• if one of samples fails - ‘my_file.csv’ will contain inconsistent results
• also ‘my_file.csv’ won’t be deleted
• also ‘my_file.csv’ won’t be recalculated automatically on next pipeline launch
• if input file changed - ‘my_file.csv’ won’t be recalculated automatically
Example of broken conventions:
Computation Environment ReproducibilitySee “distribution and reproducibility” snakemake docs
conda • Used for automatically tools installation with proper versions• --use-conda snakemake option + conda: section in pipeline
wrappers
• Shell commands + conda environment. Required: --use-conda option
docker • You could run whole pipeline or each job in a single docker container• See --use-singularity snakemake option and container: section• up to 100% reproducibility
34
Computational clustersDifferent types: HPC, LSF, Slurm, …
Idea:• Only Linux terminal (SSH) access to cluster
entry point (`login` node machine)• Each rule - single job submission
Snakemake:• Does all complicated work for you ~ feels
like local machine• localrules: force job be launched not
via job submission• Required options: --profile XXX, --jobscript XXX.sh, --restart-times NN
My latest data processing was:• 120 WGBS samples• 10 TB reads, 150 machines, 2
weeks, 100k+ jobs• Each rule - own Docker container
35
Snakemake Pipeline Examples• Community created workflows - https://github.com/snakemake-workflows
Curated, but not all are good
• Our team workflows:
• Chip-Seqhttps://github.com/JetBrains-Research/chipseq-smk-pipeline
• SC ATAC-Seqhttps://github.com/JetBrains-Research/scasat-smk-pipeline
• WGBS Methylation<not published yet>
36
Out of Scope
Please read snakemake docs for
• Reports
• Jupyter integration
• Piped output
• Benchmark Rules
• Handling Ambiguous Rules
• Subworkflows
37
Resources
• Snakemake Documentationhttps://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
• Snakemake Wrappershttps://snakemake.readthedocs.io/en/stable/
• SnakeCharm Pluginhttps://jetbrains-research.github.io/snakecharm/
38