linux intro 5 extra: makefiles

54
Programming for Evolutionary Biology March 17th - April 1st 2012 Leipzig, Germany Introduction to Unix systems Extra: writing simple pipelines with make Giovanni Marco Dall'Olio Universitat Pompeu Fabra Barcelona (Spain)

Post on 21-Oct-2014

1.128 views

Category:

Technology


3 download

DESCRIPTION

Lecture for the "Programming for Evolutionary Biology" workshop in Leipzig 2012 (http://evop.bioinf.uni-leipzig.de/)

TRANSCRIPT

Page 1: Linux intro 5 extra: makefiles

Programming for Evolutionary BiologyMarch 17th - April 1st 2012

Leipzig, Germany

Introduction to Unix systemsExtra: writing simple pipelines

with makeGiovanni Marco Dall'Olio

Universitat Pompeu FabraBarcelona (Spain)

Page 2: Linux intro 5 extra: makefiles

GNU/make

make is a tool to store command­line instructions and re­execute them quickly, along with all their parameters

It is a declarative programming language It belongs to a class of softwares called 'automated 

build tools'

Page 3: Linux intro 5 extra: makefiles

Simplest Makefile example

The simplest Makefile contains just the name of a task and the commands associated with it:

print_hello is a makefile 'rule': it stores the commands needed to say 'Hello, world!' to the screen.

Page 4: Linux intro 5 extra: makefiles

Simplest Makefile example

Makefile ruleTarget of the rule

Commands associated with the ruleThis is a

tabulation (not 8 spaces)

Page 5: Linux intro 5 extra: makefiles

Simplest Makefile example

Create a file in your computer and save it as 'Makefile'.

Write these instructions in it:

print_hello:echo 'Hello, world!!'

Then, open a terminal and type:

This is a tabulation (<Tab> key)

make -f Makefile print_hello

Page 6: Linux intro 5 extra: makefiles

Simplest Makefile example

Page 7: Linux intro 5 extra: makefiles

Simplest Makefile example –

explanation

When invoked, the program 'make' looks for a file in the current directory called 'Makefile'

When we type 'make print_hello', it executes any procedure (target) called 'print_hello' in the makefile

It then shows the commands executed and their output

Page 8: Linux intro 5 extra: makefiles

Tip1: the 'Makefile' file

The '­f' option allows you to define the file which contains the instructions for make

If you omit this option, make will look for any file called 'Makefile' in the current directory

make -f Makefile all

is equivalent to:

make all

Page 9: Linux intro 5 extra: makefiles

A sligthly longer example

You can add as many commands you like to a rule

For example, this 'print_hello' rule contains 5 commands

Note: ignore the '@' thing, it is only to disable verbose mode (explained later)

Page 10: Linux intro 5 extra: makefiles

A more complex example

Page 11: Linux intro 5 extra: makefiles

Make - advantages

Make allows you to save shell commands along with their parameters and re­execute them;

It allows you to use command­line tools which are more flexible;

Combined with a revision control software, it makes possible to reproduce all the operations made to your data;

Page 12: Linux intro 5 extra: makefiles

Second part

A closer look at make syntax (target and commands)

Page 13: Linux intro 5 extra: makefiles

The target syntax

Makefile syntax:<target>: (prerequisites)

<commands associated to the rule>

Page 14: Linux intro 5 extra: makefiles

The target syntax

The target of a rule can be either a title for the task, or a file name.

Everytime you call a make rule (example: 'make all'), the program looks for a file called like the target name (e.g. 'all', 'clean', 'inputdata.txt', 'results.txt')

The rule is executed only if that file doesn't exists.

Page 15: Linux intro 5 extra: makefiles

Filename as target names

In this makefile, we have two rules: 'testfile.txt' and 'clean'

Page 16: Linux intro 5 extra: makefiles

Filename as target names

In this makefile, we have two rules: 'testfile.txt' and 'clean'

When we call 'make testfile.txt', make checks if a file called 'testfile.txt' already exists.

Page 17: Linux intro 5 extra: makefiles

Filename as target names

The commands associated with the rule 'testfile.txt' are executed only if that file doesn't exists already

Page 18: Linux intro 5 extra: makefiles

Multiple target definition

A target can also be a list of files

You can retrieve the matched target with the special variable $@

Page 19: Linux intro 5 extra: makefiles

Special characters

The % character can be used as a wild card For example, a rule with the target:

%.txt:....

would be activated by any file ending with '.txt' 'make 1.txt', 'make 2.txt', etc..

We will be able to retrieve the matched expression with '$*'

Page 20: Linux intro 5 extra: makefiles

Special character % / creating more than a file at

a time

Page 21: Linux intro 5 extra: makefiles

Makefile – cluster support

Note that in the previous example we created three files at the same time, by executing three times the command 'touch'

If we use the '­j' option when invoking make, the three processess will be launched in parallel

Page 22: Linux intro 5 extra: makefiles

Makefile syntax:<target>: (prerequisites)

<commands associated to the rule>

The commands syntax

Page 23: Linux intro 5 extra: makefiles

Inactivating verbose mode

You can disactivate the verbose mode for a line by adding '@' at its beginning:

Differences here

Page 24: Linux intro 5 extra: makefiles

Skipping errors

The modifiers '­' tells make to ignore errors returned by a command

Example:  'mkdir /var' will cause an error (the '/var' directory 

already exists) and cause gnu/make to exit '­mkdir /var' will cause an error anyway, but 

gnu/make will ignore it

Page 25: Linux intro 5 extra: makefiles

Moving throught directories

A big issue with make is that every line is executed as a different shell process.

So, this:

lsvar:cd /varls 

Won't work (it will list only the files in the current directory, not /var)

The solution is to put everything in a single process:

lsvar:(cd /var; ls)

Page 26: Linux intro 5 extra: makefiles

Third part

Prerequisites and conditional execution

Page 27: Linux intro 5 extra: makefiles

Makefile syntax:<target>: (prerequisites)

<commands associated to the rule>

We will look at the 'prerequisites' part of a make rule, that I had skipped before

The commands syntax

Page 28: Linux intro 5 extra: makefiles

Real Makefile-rule syntax

Complete syntax for a Makefile rule:<target>: <list of prerequisites>

<commands associated to the rule>

Example:result1.txt: data1.txt data2.txt

cat data1.txt data2.txt > result1.txt@echo 'result1.txt' has been calculated'

Prerequisites are files (or rules) that need to exists already in order to create the target file.

If 'data1.txt' and 'data2.txt' don't exist, the rule 'result1.txt' will exit with an error (no rule to create them)

Page 29: Linux intro 5 extra: makefiles

Piping Makefile rules together

You can pipe two Makefile rules together by defining prerequisites

Page 30: Linux intro 5 extra: makefiles

Piping Makefile rules together

The rule 'result1.txt' depends on the rule 'data1.txt', which should be executed first

Page 31: Linux intro 5 extra: makefiles

Piping Makefile rules together

Let's look at this example again:

what happens if we remove the file 'result1.txt' we just created?

Page 32: Linux intro 5 extra: makefiles

Piping Makefile rules together

Let's look at this example again:

what happens if we remove the file 'result1.txt' we just created?

The second time we run the 'make result1.txt' command, it is not necessary to create data1.txt again, so only a rule is executed

Page 33: Linux intro 5 extra: makefiles

Other pipe example

all: result1.txt result2.txt

result1.txt: data1.txt calculate_result.py

python calculate_result.txt --input data1.txt

result2.txt: data2.txtcut -f 1, 3 data2.txt > result2.txt

Make all will calculate result1.txt and result2.txt, if they don't exist already (and they are older than their prerequisites)

Page 34: Linux intro 5 extra: makefiles

Conditional execution by modification date

We have seen how make can be used to create a file, if it doesn't exists.

file.txt:# if file.txt doesn't exists, then create it:echo 'contents of file.txt' > file.txt

We can do better: create or update a file only if it is newer than its prerequisites

Page 35: Linux intro 5 extra: makefiles

Conditional execution by modification date

Let's have a better look at this example:

result1.txt: data1.txt calculate_result.py

python calculate_result.txt --input data1.txt

A great feature of make is that it execute a rule not only if the target file doesn't exist, but also if it has a 'last modification date' earlier than all of its prerequisites

Page 36: Linux intro 5 extra: makefiles

Conditional execution by modification date

result1.txt: data1.txt@sed 's/b/B/i' data1.txt > result1.txt@echo 'result1.txt has been calculated'

In this example, result1.txt will be recalculated every time 'data1.txt' is modified

$: touch data1.txt calculate_result.py

$: make result1.txtresult1.txt has been calculated

$: make result1.txtresult1.txt is already up-to-date

$: touch data1.txt$: make result1.txtresult1.txt has been calculated

Page 37: Linux intro 5 extra: makefiles

Conditional execution - applications

This 'conditional execution by modification date comparison' feature of make is very useful

Let's say you discover an error in one of your input data: you will be able to repeat the analysis by executing only the operations needed

You can also use it to re­calculate results every time you modify a script:

result.txt: scripts/calculate_result.pypython calculate_result.py > result.py

Page 38: Linux intro 5 extra: makefiles

Another example

Page 39: Linux intro 5 extra: makefiles

Fourth part

Variables and functions

Page 40: Linux intro 5 extra: makefiles

Variables and functions

You may have already noticed that Make's syntax is really old :)

In fact, it is a ~40 years old language It uses special variables like $@, $^, and it can be 

worst than perl!!!  (perl developers – please don't get mad at me :­) )

Page 41: Linux intro 5 extra: makefiles

Variables

Variables are declared with a '=' and by convention are upper case.

They are called by including their name in '$()' 

WORKING_DIR is a variable

Page 42: Linux intro 5 extra: makefiles

Special variables - $@

Make uses some custom variables, with a syntax similar to perl

'$@' always corresponds to the target name:

$: cat >Makefile

%.txt:echo $@

$: make filename.txtecho filename.txtfilename.txt$:

$@ took the value of 'filename.txt'

Page 43: Linux intro 5 extra: makefiles

Other special variables

$@ The rule's target$< The rule's first

prerequisite$? All the rule's out of

date prerequisites$^ All Prerequisites

Page 44: Linux intro 5 extra: makefiles

Functions

Usually you don't want to declare functions in make, but there are some built­in utilities that can be useful 

Most frequently used functions: $(addprefix <prefix>, list)

 add a prefix to a space­separated list →

example: FILES = file1 file2 file3 $(addprefix /home/user/data, $(FILES)

$(addsuffix) work similarly

Page 45: Linux intro 5 extra: makefiles

Full makefile example

INPUTFILES = lower_DAF lower_maf upper_maf \lower_daf upper_daf

RESULTSDIR = ./results

RESULTFILES = $(addprefix $(RESULTSDIR)/, \$(addsuffix _filtered.txt,$(INPUTFILES)))

help: @echo 'type "make filter" to calculate results'

all: $(RESULTFILES)

$(RESULTSDIR)/%_filtered.txt: data/%.txt src/filter_genes.pypython src/filter_genes.py --genes \

data/Genes.txt --window $< --output $@

It looks like very complicated, but in the end you always use the same Makefile structure

Page 46: Linux intro 5 extra: makefiles

Fifth part

Testing, discussion, other examples and alternatives

Page 47: Linux intro 5 extra: makefiles

Testing a makefile

make ­n: only shows the commands to be executed You can pass variables to make:

$: make say_hello MYNAME=”Giovanni”hello, Giovanni

Strongly suggested: use a Revision Control Software with support for branching (git, hg, bazaar) and create a branch for testing

Page 48: Linux intro 5 extra: makefiles

Another complex Makefile example

# make masked sequence

myseq.m: myseqrmask myseq > myseq.m

# run blast on masked seq

blastout: mydb.psq myseq.mblastx mydb myseq.m > blastoutecho “ran blast!”

# index blastable db

mydb.psq: mydbformatdb -p T mydb

# rules follow this pattern:

target: subtarget1, ..., subtargetNshell command 1shell command 2...

our starting point is the file myseq, the end point is the blast results blastout

we first want to mask out any repeats using rmask to create myseq.m

we then blastx myseq.m against a protein db called mydb

before blastx is run the protein db must be indexed using formatdb

(slide taken from biomake web site)

Page 49: Linux intro 5 extra: makefiles

The “make” command

# run blast on masked seq

blastout: mydb.psq myseq.mblastx mydb myseq.m > blastoutecho “ran blast!”

# index blastable db

mydb.psq: mydbformatdb -p T mydb

# make masked sequence

myseq.m: myseqrmask myseq > myseq.m

% make blastoutformatdb -p T mydbrmask myseq.fst > myseq.mblastx mydb myseq.m > blastout

% make blastoutmake: 'blastout' is up to date

% cat newseqs >> mydb% make blastoutformatdb -p T mydbblastx mydb myseq.m > blastout

make uses unix file modification timestamps when checking dependencies

if a subtarget is more recent than the goal target, then re­execute action(slide taken from biomake web site)

Page 50: Linux intro 5 extra: makefiles

BioMake and alternatives

BioMake is an alternative to make, thought to be used in bioinformatics

Developed to annotate the Drosophila melanogaster genome (Berkeley university)

Cleaner syntax,derived from prolog Separates the rule's name from the name of the 

target files

Page 51: Linux intro 5 extra: makefiles

formatdb(DB) req: DB run: formatdb DB comment: prepares blastdb for blasting (wublast)rmask(Seq) flat: masked_seqs/Seq.masked req: Seq srun: RepeatMasker -lib $(LIB) Seq comment: masks out repeats from input sequencemblastx(Seq,DB) flat: blast_results/Seq.DB.blastx req: formatdb(DB) rmask(Seq) srun: blastx -filter SEG+XNU DB rmask(Seq) comment: this target is for the results of running blastx on a masked input genomic sequence (wublast)

A BioMake example

(slide taken from biomake web site)

Page 52: Linux intro 5 extra: makefiles

Other alternatives

There are other many alternatives to make: BioMake (prolog?) o/q/dist/etc.. make Ant (Java) Scons (python) Paver (python) Waf (python)

This list is biased because I am a python programmer :)

These tools are more oriented to software development

Page 53: Linux intro 5 extra: makefiles

Conclusions

Make is very basic for bioinformatics

It is useful for the simpler tasks:

Logging the operations made to your data files Working with clusters Avoid re­calculations Apply a pipeline to different datasets

It is installed in almost any unix system and has a standard syntax (interchangeable, reproducible)

Study it and understand its logic. Use it in the most basic way, without worrying about prerequisites and special variables. Later you can look for easier tools (biomake, rake, taverna, galaxy, your own, etc..)

Page 54: Linux intro 5 extra: makefiles

Suggested readings Software Carpentry for bioinformatics 

http://swc.scipy.org/lec/build.html

A Makefile is a pipelinehttp://www.nodalpoint.org/2007/03/18/a_pipeline_is_a_makefile

BioMake and SKAM http://skam.sourceforge.net/

BioWiki Make Manifesto http://biowiki.org/MakefileManifesto

Discussion on the BIP mailing listhttp://www.mail­archive.com/biology­in­[email protected]/msg00013.html  

Gnu/Make manual by R.Stallman and R.MacGrath

http://theory.uwinnipeg.ca/gnu/make/make_toc.html