linux intro 5 extra: makefiles
Post on 21-Oct-2014
1.128 views
DESCRIPTION
Lecture for the "Programming for Evolutionary Biology" workshop in Leipzig 2012 (http://evop.bioinf.uni-leipzig.de/)TRANSCRIPT
Programming for Evolutionary BiologyMarch 17th - April 1st 2012
Leipzig, Germany
Introduction to Unix systemsExtra: writing simple pipelines
with makeGiovanni Marco Dall'Olio
Universitat Pompeu FabraBarcelona (Spain)
GNU/make
make is a tool to store commandline instructions and reexecute them quickly, along with all their parameters
It is a declarative programming language It belongs to a class of softwares called 'automated
build tools'
Simplest Makefile example
The simplest Makefile contains just the name of a task and the commands associated with it:
print_hello is a makefile 'rule': it stores the commands needed to say 'Hello, world!' to the screen.
Simplest Makefile example
Makefile ruleTarget of the rule
Commands associated with the ruleThis is a
tabulation (not 8 spaces)
Simplest Makefile example
Create a file in your computer and save it as 'Makefile'.
Write these instructions in it:
print_hello:echo 'Hello, world!!'
Then, open a terminal and type:
This is a tabulation (<Tab> key)
make -f Makefile print_hello
Simplest Makefile example
Simplest Makefile example –
explanation
When invoked, the program 'make' looks for a file in the current directory called 'Makefile'
When we type 'make print_hello', it executes any procedure (target) called 'print_hello' in the makefile
It then shows the commands executed and their output
Tip1: the 'Makefile' file
The 'f' option allows you to define the file which contains the instructions for make
If you omit this option, make will look for any file called 'Makefile' in the current directory
make -f Makefile all
is equivalent to:
make all
A sligthly longer example
You can add as many commands you like to a rule
For example, this 'print_hello' rule contains 5 commands
Note: ignore the '@' thing, it is only to disable verbose mode (explained later)
A more complex example
Make - advantages
Make allows you to save shell commands along with their parameters and reexecute them;
It allows you to use commandline tools which are more flexible;
Combined with a revision control software, it makes possible to reproduce all the operations made to your data;
Second part
A closer look at make syntax (target and commands)
The target syntax
Makefile syntax:<target>: (prerequisites)
<commands associated to the rule>
The target syntax
The target of a rule can be either a title for the task, or a file name.
Everytime you call a make rule (example: 'make all'), the program looks for a file called like the target name (e.g. 'all', 'clean', 'inputdata.txt', 'results.txt')
The rule is executed only if that file doesn't exists.
Filename as target names
In this makefile, we have two rules: 'testfile.txt' and 'clean'
Filename as target names
In this makefile, we have two rules: 'testfile.txt' and 'clean'
When we call 'make testfile.txt', make checks if a file called 'testfile.txt' already exists.
Filename as target names
The commands associated with the rule 'testfile.txt' are executed only if that file doesn't exists already
Multiple target definition
A target can also be a list of files
You can retrieve the matched target with the special variable $@
Special characters
The % character can be used as a wild card For example, a rule with the target:
%.txt:....
would be activated by any file ending with '.txt' 'make 1.txt', 'make 2.txt', etc..
We will be able to retrieve the matched expression with '$*'
Special character % / creating more than a file at
a time
Makefile – cluster support
Note that in the previous example we created three files at the same time, by executing three times the command 'touch'
If we use the 'j' option when invoking make, the three processess will be launched in parallel
Makefile syntax:<target>: (prerequisites)
<commands associated to the rule>
The commands syntax
Inactivating verbose mode
You can disactivate the verbose mode for a line by adding '@' at its beginning:
Differences here
Skipping errors
The modifiers '' tells make to ignore errors returned by a command
Example: 'mkdir /var' will cause an error (the '/var' directory
already exists) and cause gnu/make to exit 'mkdir /var' will cause an error anyway, but
gnu/make will ignore it
Moving throught directories
A big issue with make is that every line is executed as a different shell process.
So, this:
lsvar:cd /varls
Won't work (it will list only the files in the current directory, not /var)
The solution is to put everything in a single process:
lsvar:(cd /var; ls)
Third part
Prerequisites and conditional execution
Makefile syntax:<target>: (prerequisites)
<commands associated to the rule>
We will look at the 'prerequisites' part of a make rule, that I had skipped before
The commands syntax
Real Makefile-rule syntax
Complete syntax for a Makefile rule:<target>: <list of prerequisites>
<commands associated to the rule>
Example:result1.txt: data1.txt data2.txt
cat data1.txt data2.txt > result1.txt@echo 'result1.txt' has been calculated'
Prerequisites are files (or rules) that need to exists already in order to create the target file.
If 'data1.txt' and 'data2.txt' don't exist, the rule 'result1.txt' will exit with an error (no rule to create them)
Piping Makefile rules together
You can pipe two Makefile rules together by defining prerequisites
Piping Makefile rules together
The rule 'result1.txt' depends on the rule 'data1.txt', which should be executed first
Piping Makefile rules together
Let's look at this example again:
what happens if we remove the file 'result1.txt' we just created?
Piping Makefile rules together
Let's look at this example again:
what happens if we remove the file 'result1.txt' we just created?
The second time we run the 'make result1.txt' command, it is not necessary to create data1.txt again, so only a rule is executed
Other pipe example
all: result1.txt result2.txt
result1.txt: data1.txt calculate_result.py
python calculate_result.txt --input data1.txt
result2.txt: data2.txtcut -f 1, 3 data2.txt > result2.txt
Make all will calculate result1.txt and result2.txt, if they don't exist already (and they are older than their prerequisites)
Conditional execution by modification date
We have seen how make can be used to create a file, if it doesn't exists.
file.txt:# if file.txt doesn't exists, then create it:echo 'contents of file.txt' > file.txt
We can do better: create or update a file only if it is newer than its prerequisites
Conditional execution by modification date
Let's have a better look at this example:
result1.txt: data1.txt calculate_result.py
python calculate_result.txt --input data1.txt
A great feature of make is that it execute a rule not only if the target file doesn't exist, but also if it has a 'last modification date' earlier than all of its prerequisites
Conditional execution by modification date
result1.txt: data1.txt@sed 's/b/B/i' data1.txt > result1.txt@echo 'result1.txt has been calculated'
In this example, result1.txt will be recalculated every time 'data1.txt' is modified
$: touch data1.txt calculate_result.py
$: make result1.txtresult1.txt has been calculated
$: make result1.txtresult1.txt is already up-to-date
$: touch data1.txt$: make result1.txtresult1.txt has been calculated
Conditional execution - applications
This 'conditional execution by modification date comparison' feature of make is very useful
Let's say you discover an error in one of your input data: you will be able to repeat the analysis by executing only the operations needed
You can also use it to recalculate results every time you modify a script:
result.txt: scripts/calculate_result.pypython calculate_result.py > result.py
Another example
Fourth part
Variables and functions
Variables and functions
You may have already noticed that Make's syntax is really old :)
In fact, it is a ~40 years old language It uses special variables like $@, $^, and it can be
worst than perl!!! (perl developers – please don't get mad at me :) )
Variables
Variables are declared with a '=' and by convention are upper case.
They are called by including their name in '$()'
WORKING_DIR is a variable
Special variables - $@
Make uses some custom variables, with a syntax similar to perl
'$@' always corresponds to the target name:
$: cat >Makefile
%.txt:echo $@
$: make filename.txtecho filename.txtfilename.txt$:
$@ took the value of 'filename.txt'
Other special variables
$@ The rule's target$< The rule's first
prerequisite$? All the rule's out of
date prerequisites$^ All Prerequisites
Functions
Usually you don't want to declare functions in make, but there are some builtin utilities that can be useful
Most frequently used functions: $(addprefix <prefix>, list)
add a prefix to a spaceseparated list →
example: FILES = file1 file2 file3 $(addprefix /home/user/data, $(FILES)
$(addsuffix) work similarly
Full makefile example
INPUTFILES = lower_DAF lower_maf upper_maf \lower_daf upper_daf
RESULTSDIR = ./results
RESULTFILES = $(addprefix $(RESULTSDIR)/, \$(addsuffix _filtered.txt,$(INPUTFILES)))
help: @echo 'type "make filter" to calculate results'
all: $(RESULTFILES)
$(RESULTSDIR)/%_filtered.txt: data/%.txt src/filter_genes.pypython src/filter_genes.py --genes \
data/Genes.txt --window $< --output $@
It looks like very complicated, but in the end you always use the same Makefile structure
Fifth part
Testing, discussion, other examples and alternatives
Testing a makefile
make n: only shows the commands to be executed You can pass variables to make:
$: make say_hello MYNAME=”Giovanni”hello, Giovanni
Strongly suggested: use a Revision Control Software with support for branching (git, hg, bazaar) and create a branch for testing
Another complex Makefile example
# make masked sequence
myseq.m: myseqrmask myseq > myseq.m
# run blast on masked seq
blastout: mydb.psq myseq.mblastx mydb myseq.m > blastoutecho “ran blast!”
# index blastable db
mydb.psq: mydbformatdb -p T mydb
# rules follow this pattern:
target: subtarget1, ..., subtargetNshell command 1shell command 2...
our starting point is the file myseq, the end point is the blast results blastout
we first want to mask out any repeats using rmask to create myseq.m
we then blastx myseq.m against a protein db called mydb
before blastx is run the protein db must be indexed using formatdb
(slide taken from biomake web site)
The “make” command
# run blast on masked seq
blastout: mydb.psq myseq.mblastx mydb myseq.m > blastoutecho “ran blast!”
# index blastable db
mydb.psq: mydbformatdb -p T mydb
# make masked sequence
myseq.m: myseqrmask myseq > myseq.m
% make blastoutformatdb -p T mydbrmask myseq.fst > myseq.mblastx mydb myseq.m > blastout
% make blastoutmake: 'blastout' is up to date
% cat newseqs >> mydb% make blastoutformatdb -p T mydbblastx mydb myseq.m > blastout
make uses unix file modification timestamps when checking dependencies
if a subtarget is more recent than the goal target, then reexecute action(slide taken from biomake web site)
BioMake and alternatives
BioMake is an alternative to make, thought to be used in bioinformatics
Developed to annotate the Drosophila melanogaster genome (Berkeley university)
Cleaner syntax,derived from prolog Separates the rule's name from the name of the
target files
formatdb(DB) req: DB run: formatdb DB comment: prepares blastdb for blasting (wublast)rmask(Seq) flat: masked_seqs/Seq.masked req: Seq srun: RepeatMasker -lib $(LIB) Seq comment: masks out repeats from input sequencemblastx(Seq,DB) flat: blast_results/Seq.DB.blastx req: formatdb(DB) rmask(Seq) srun: blastx -filter SEG+XNU DB rmask(Seq) comment: this target is for the results of running blastx on a masked input genomic sequence (wublast)
A BioMake example
(slide taken from biomake web site)
Other alternatives
There are other many alternatives to make: BioMake (prolog?) o/q/dist/etc.. make Ant (Java) Scons (python) Paver (python) Waf (python)
This list is biased because I am a python programmer :)
These tools are more oriented to software development
Conclusions
Make is very basic for bioinformatics
It is useful for the simpler tasks:
Logging the operations made to your data files Working with clusters Avoid recalculations Apply a pipeline to different datasets
It is installed in almost any unix system and has a standard syntax (interchangeable, reproducible)
Study it and understand its logic. Use it in the most basic way, without worrying about prerequisites and special variables. Later you can look for easier tools (biomake, rake, taverna, galaxy, your own, etc..)
Suggested readings Software Carpentry for bioinformatics
http://swc.scipy.org/lec/build.html
A Makefile is a pipelinehttp://www.nodalpoint.org/2007/03/18/a_pipeline_is_a_makefile
BioMake and SKAM http://skam.sourceforge.net/
BioWiki Make Manifesto http://biowiki.org/MakefileManifesto
Discussion on the BIP mailing listhttp://www.mailarchive.com/biologyin[email protected]/msg00013.html
Gnu/Make manual by R.Stallman and R.MacGrath
http://theory.uwinnipeg.ca/gnu/make/make_toc.html