more “what perl can do” with an introduction to bioperl ian donaldson biotechnology centre of...

60
More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo IMBV 3070

Upload: marly

Post on 08-Jan-2016

33 views

Category:

Documents


1 download

DESCRIPTION

More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo IMBV 3070. Much of the material in this lecture is from the “Perl” lecture and lab developed for the Canadian Bioinformatics Workshops by Will Hsiao Sohrab Shah Sanja Rogic - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

More “What Perl can do”

With an introduction to BioPerl

Ian DonaldsonBiotechnology Centre of Oslo

IMBV 3070

Page 2: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Much of the material in this lecture is from the “Perl” lecture and lab developed forthe Canadian Bioinformatics Workshops by

Will HsiaoSohrab ShahSanja Rogic

And released under the Creative Commons license

Page 3: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

http://creativecommons.org/licenses/by-sa/2.5/

Page 4: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

More “What can Perl do”

• So far, we’ve had a very brief introduction to Perl

• Next, we want to go a little deeper into

• Use of “strict” • Perl regular expressions• Modules• An introduction to object-oriented Perl and• BioPerl

Page 5: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

strict

Page 6: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Effects of “use strict”• Requires you to declare variables

• Warns you about possible typos in variables

Correct Incorrectmy $DNA;$DNA = “ATCG”;ormy $DNA = “ATCG”;

$DNA = “ATCG”;

No warning Warningmy $DNA = “ATCG”;$DNA =~tr/ATCG/TAGC/

my $DNA = “ATCG”;$DAN =~tr/ATCG/TAGC

Page 7: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Why bother “use strict”

• Enforces some good programming rules• Helps to prevent silly errors• Makes trouble shooting your program

easier• Becomes essential as your code

becomes longer• We will use strict in all the code you see

today and in your assignment• Bottom line: ALWAYS use strict

Page 8: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Exercise 12

Write a program that has one function.

Use a variable named “$some_variable” in thisfunction and in the main body of the program.

Prove that you can alter the value of $some_variable in the function withoutchanging the value of $some_variable in the the main body of the program.

Try it yourself (15 minutes) then check the answer at the end of this lecture.

Page 9: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

regular expressions

Page 10: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

What is a Regular Expression?• REGEX provides pattern matching ability• Tells you whether a string contains a pattern

or not (Note: it’s a yes or no question!)

“I have a golden retriever”“Yesterday I saw a big black dog”

“My dog ate my homework”

“Yes” or “True” “Yes” or “True” “No” or “False”

Dog! Human’s best friend

“No” since REGEX is case sensitive

Regular Expression looking for “dog”

Page 11: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Regular expressions are “regular”

Look at these names for yeast open reading frame names.

YDR0001WYDR4567CYAL0045WYBL0008C

While they are all different, they all follow a pattern (or regular expression).1. Y means yeast2. some letter between A and L represent a chromosome3. an ‘R’ or ‘L’ refers to an arm of the chromosome4. a four digit number refers to an open reading frame5. A ‘W’ or a ‘C’ refers to either the Watson or Crick strand

You can write a regular expression to recognize ALL yeast open reading frame names.

Page 12: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Perl REGEX example

my $text = “The dog ate my homework”;if ($text =~ m/dog/){print “The text contains a dog\n”;

}

• =~ m is the binding operator. It says: “does the string on the left contain the pattern on the right?”

• /dog/ is my pattern or regular expression• The matching operation results in a true or false

answer

Page 13: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Regular Expressions in Perl

• A pattern that match only one string is not very useful!

• We need symbols to represent classes of characters

• For example, say you wanted to recognize ‘Dog’ or ‘dog’ as being instances of the same thing

• REGEX is its own little language inside Perl– Has different syntax and symbols!– Symbols which you have used in perl such as $ .

{ } [ ] have totally different meanings in REGEX

Page 14: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

REGEX Metacharacters

• Metacharacters allow a pattern to match different strings– Wildcards are examples of metacharacters– /.og/ will match “dog”, “log”, “tog”, “ og”,

etc.

– So . Means “any character” – Perl REGEX has much more powerful

metacharacters used to represent classes of characters

Page 15: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Types of Metacharacters. matches any one character or space

except “\n”

[ ] denotes a selection of characters and matches ONE of the

characters in the selection. What does [ATCG] match?

\t, \s, \n match a tab, a space and a newline respectively

\w matches any characters in [a-zA-Z0-9]

\d matches [0-9]

\D matches anything except [0-9]

Page 16: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Using metacharacters to build a regular expression

YBL3456W

/Y[A-L][RL]\d\d\d\d[WC]/

Is this a good pattern for a yeast ORF name?What else does it match?What if the name only has 3 digits?

Page 17: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

REGEX Quantifiers• What if you want to match a character

more than once?

• What if you want to match an mRNA with a polyA tail that is at least 5 – 12 A’s?

“ATG……AAAAAAAAAAA”

Page 18: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

REGEX Quantifiers

• + matches one or more copies of the previous character

• * matches zero or more copies of the previous character

• ? matches zero or one copy of the previous character

• {min,max} matches a number of copies within the specified range

“ATG……AAAAAAAAAAA”

/ATG[ATCG]+A{5,12}/

Page 19: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

REGEX Anchors

• The previous pattern is not strictly correct because:– It’ll match a string that doesn’t start with

ATG– It’ll match a string that doesn’t end with

poly A’s

• Anchors tell REGEX that a pattern must occur at the beginning or at the end of a string

Page 20: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

REGEX Anchors

• ^ anchors the pattern to the start of a string

• $ anchors the pattern to the end of a string

/^ATG[ATCG]+A{5,12}$/

Page 21: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

REGEX is greedy!

• The revised pattern is still incorrect because– It’ll match a string that has more than 12 A’s at the end

• quantifiers will try to match as many copies of a sub-pattern as possible!

/^ATG[ATCG]+A{5,12}$/

“ATGGCCCGGCCTTTCCCAAAAAAAAAAAA”

“ATGGCCCGGCCTTTCCCAAAAAAAAAAAA”

Page 22: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Curb that Greed!• ? after a quantifier prevents REGEX from being

greedy

/^ATG[ATCG]+?A{5,12}$/

“ATGGCCCGGCCTTTCCGAAAAAAAAAAAA”

“ATGGCCCGGCCTTTCCGAAAAAAAAAAAA”

• note this is the second use of the question mark - what is the other use of ? in REGEX?

Page 23: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

REGEX Capture

• What if you want to keep the part of a string that matches to your pattern?

• Use ( ) “memory parentheses”

“ATGGCCCGGCCTTTCCGAAAAAAAAAAAA”

/^ATG([ATCG]+?)A{5,12}$/

Page 24: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

REGEX Capture

• What’s inside the first ( ) is assigned to $1• What’s inside the Second ( ) is $2 and so on• So $2 eq “AAAAAAAAAAAA”

/^ATG([ATCG]+?)(A{5,12})$/

$1 $2

Page 25: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

REGEX Modifiers

• Modifiers come after a pattern and affect the entire pattern

• You have seen //g already which does global matching (/T/g) and global replacement(s/T/U/g)

• Other useful modifiers:

//i make pattern case insensitive

//s let . match newline

//m let ^ and $ (anchors) match next to embedded newline

///e allow the replacement string to be a perl statement

Page 26: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

REGEX Summary• REGEX is its own little language!!!• REGEX is one of the main

strengths of Perl

• To learn more:• Learning Perl (3rd ed.) Chapters 7, 8, 9• Programming Perl (3rd ed.) Chapter 5• Mastering Regular Expression (2nd ed.) • http://www.perl.com/doc/manual/html/pod/perlre.html• A good cheat sheet is:

http://www.biotek.uio.no/EMBNET/guides/guideRegExp.pdf

Page 27: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Exercise 13

In a text file, write out three strings that matchthe following regular expression

/^ATG?C*[ATCG]+?A{3,10}$/

Write a program that reads each string from the text file and checks your answers.

Try it yourself (30 min) then look at the answer at the end of this lecture.

Page 28: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

modules

Page 29: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

What are Modules

• a “logical” collection of functions• Using modules has the same advantage as

using functions; i.e., it simplifies code (makes it modular) and facilitates code reuse

• Each collection (or module) has its own “name space”

Name space: a table containing the names of variables and functions used in your code

Page 30: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Why Use Modules?

• Modules allow you to use others’ code to extend the functionality of your program.

• There are a lot of Perl modules.

Page 31: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Finding out what modules you already have

In Perl, each module is a file stored in some directory in your system.

The system that this class is using, stores Perlmodules (like cgi.pm) in one of two directories

C:\bin\Perl\libC:\bin\Perl\site\lib

Page 32: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Finding out what modules you already have

• To find out where modules are installed, type

perl –V

at the command prompt

• To find out what is installed, type

perldoc perllocal

at the command prompt.

Page 33: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Using Modules

• To use a module, you need to include the line:

use modulename;

at the beginning of your program.

• But you already knew that…use strict;

use warnings;

Page 34: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Where to find modules

• You can search for modules (and documentation) that may be useful to your particular problem using http://search.cpan.org/

• CPAN: Comprehensive Perl Archive Network

• Central repository for Perl modules and more

• “If it’s written in Perl, and it’s helpful and free, it’s probably on CPAN”

• http://www.perl.com/CPAN/

Page 35: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Exercise14

Open a web browser

Go to http://search.cpan.org/

Type in “bioperl”

Follow the link to Bio::Tools::Blast

Read the example code

Copy the example code to a file and try to runit.

Page 36: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Bioperl Overview

• The Bioperl project – www.bioperl.org• Comprehensive, well documented set of

Perl modules • A bioinformatics toolkit for:

• Format conversion• Report processing• Data manipulation• Sequence analyses• and more!

• Written in object-oriented Perl

Page 37: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Bioperl Overview

• The last exercise most likely did not work (unless you have BioPerl installed)

• So let’s install it…

Page 38: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

How to install modules

• This class is using the active state version of Perl that comes with a program called ppm (Perl Package Manager)

• At the command prompt type

>ppm

And follow the instructions in the exercise

Page 39: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

How to install modules (without ppm)

• If you are not using active state Perl, youyou can also install modules from CPAN using:

>perl –MCPAN –e “install ‘Some::Module’”

• Module dependency is taken care of automatically

• You’ll (usually) need to be root to install a module successfully

Page 40: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Exercise15

Install bioperl 1. At the command line prompt type

>ppm

2. Then at the ppm prompt type

ppm> search bioperl

3. Then type

ppm> install bioperl

Try running the example code from the last exercise.

Enter the code on the next slide and run it.

Page 41: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Exercise16#bioperl example code

use strict;use warnings;

#make the bioperl module (class) accessible to your programuse Bio::DB::RefSeq;

#make a new instance (object) of the class and name itmy $refseq = new Bio::DB::RefSeq;

#call a method of the object to do something#in this case, another object is returnedmy $molecule = $refseq->get_Seq_by_acc('NM_006732');

#call a method or retrieve an attribute of the object#in this case, the sequence is returnedprint "seq is ", $molecule->seq, "\n";

Page 42: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

What are objects?• Examples of objects in real life:

– My car, my dog, my dishwasher…• Objects have ATTRIBUTES and METHODS

Some attributes of a my dog Fido:•Color of fur = brown•Height = 20 cm•Owner’s Name = Ian•Weight = 2 Kg•Tail position = up

Some methods of my dog Fido:•Bark•Walk•Run•Eat•Wag tail

Fido

Page 43: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

What is a class?• A class is a type of object in the real world:

– Cars, dogs, dishwashers…• Classes have ATTRIBUTES and METHODS

Some attributes of a dog:•Color of fur•Height•Owner’s Name•Weight •Tail position

Some methods of a dog:•Bark•Walk•Run•Eat•Wag tail

The concept

of a “dog”

Page 44: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

So an object is an instance of a class

The concept

of “dog”

class

object

Fido

Page 45: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Objects have unique names called “references” and classes have names

too.

Dog

class

object

reference

Fido

Class name

Page 46: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

All classes have a method called new that is used to create objects.

Dog

class

object

Fido

Fido = new Dog();

reference

Page 47: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

A reference to an object can be used to access its properties or methods.

Dog

class

object

Fidoprint Fido->bark();

woof

Page 48: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

A reference to an object can be used to access its properties or methods.

Bio::DB::RefSe

q

class

object

$refseq

$molecule = Some sequence record

$refseq = new Bio::DB::RefSeq;

$molecule = $refseq->get_seq_by_acc(“NP_01014”);

Page 49: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Exercise16

#bioperl example codeuse strict;use warnings;

#make the bioperl module (class) accessible to your programuse Bio::DB::RefSeq;

#make a new instance (object) of the class and name itmy $refseq = new Bio::DB::RefSeq;

#call a method of the object to do something#in this case, another object is returnedmy $molecule = $refseq->get_Seq_by_acc('NM_006732');

#call a method or retrieve an attribute of the object#in this case, the sequence is returnedprint "seq is ", $molecule->seq, "\n";

Page 50: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Putting it all together

So now that you understand (sort of)ClassesObjectsAttributes andMethods

What remains is learning what the different classes are that are available in BioPerl and what you can do with them.

For the next exercise, use the documentation at bioperl.org*to figure out what the following code does…

*see www.bioperl.org/wiki/HOWTOs anddoc.bioperl.org (then click on bioperl-live)

Page 51: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

#!/usr/bin/perl –w

use strict;

use Bio::SeqIO;

my $seq_in = Bio::SeqIO->new(

-file => “myGBrecord”,

-format => “genbank”);

my $seq_out = Bio::SeqIO->new(

-file => “>myEMBLrec”,

-format => ‘EMBL’);

my $seq_record = $seq_in->next_seq();

$seq_out->write_seq($seq_record);

Make the Bio::SeqIO class available to my program

Create a new Bio::SeqIO object and initialize

some attributes

a sequence object

Exercise17

Page 52: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

More Bioperl modules• Bio::SeqIO: Sequence Input/Output

– Retrieve sequence records and write to files– Converting sequence records from one format to

another

• Bio::Seq: Manipulating sequences – Get subsequences ($seq->subseq($start, $end))– Find the length of the object ($seq->length)– Reverse complement a DNA sequence– Translate a DNA sequence ….etc.

• Bio::Annotation: Annotate a sequence– Assign journal references to a sequence, etc.– Bio::Annotation is associated with an entire sequence

record and not just part of a sequence (see also Bio::SeqFeature)

Page 53: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Some more Bioperl modules

• Bio::SeqFeature: Associate feature annotation to a sequence– “features” describe specific locations in the

sequence– E.g. 5’ UTR, 3’ UTR, CDS, SNP, etc– Using this object, you can add feature annotations

to your sequences– When you parse a genbank file using Bioperl, the

“features” of a record are stored as SeqFeature objects

• Bio::DB::GenBank, GenPept, EMBL and Swissprot: Remote Database Access– You can retrieve a sequence from remote databases

(through the Internet) using these objects

Page 54: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Even more Bioperl modules

• Bio::SearchIO: Parse sequence database search reports– Parse BLAST reports (make custom report)– Parse HMMer, FASTA, SIM4, WABA, etc.– Custom reports can be output to various

formats (HTML, Table, etc)• Bio::Tools::Run::StandAloneBLAST: Run

Standalone BLAST through perl– By combining this and SearchIO, you can

automate and customize BLAST search• Bio::Graphics: Draw biological entities (e.g. a

gene, an exon, BLAST alignments, etc)

Page 55: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Bioperl Summary

• For Online documentation:– For this workshop: http://doc.bioperl.org/releases/bioperl-

1.4/– Tutorial: http://www.bioperl.org/wiki/HOWTO:Beginners – HOWTOs: http://www.bioperl.org/wiki/HOWTOs– Modules:

http://www.bioperl.org/wiki/Category:Core_Modules• Literature:

– Stajich et al., The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002 Oct;12(10):1611-8.PMID: 12368254

• Bioperl mailing list: [email protected]– Best way to get help using Bioperl– Very active list (upwards of 10 messages a day)

• Use with caution: things change fast and without warning (unless you are on the mailing list…)

Page 56: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Perl Documents• In-line documentation

– POD = plain old documents– Read POD by typing perldoc <module name>– E.g. perldoc perl, perldoc Bio::SeqIO

• On-line documentation– http://www.cpan.org– http://www.perl.com– http:/www.bioperl.org

• Books– Learning Perl (the best way to learn Perl if you know a

bit about programming already)– Beginning Perl for Bioinformatics (example based way

to learn Perl for Bioinformatics)– Programming Perl (THE Perl reference book – not for

the faint of heart)

Page 57: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

Additional Book References

• Perl Cookbook 2nd edition (quick solutions to 80% of what you want to do)

• Learning Perl Objects, References & Modules (for people who want to learn objects, references and modules in Perl)

• Perl in a Nutshell (an okay quick reference)• Perl CD Bookshelf, Version 4.0 (electronic version of the

above books – best value, searchable, and kill fewer trees)• Mastering Perl for Bioinformatics (more example based

learning)• CGI Programming with Perl (rather outdated treatment on

the subject... Not really recommended)• Perl Graphics Programming (if you want to generate

graphics using Perl; side note – Perl is probably not the best tool for generating graphics)

Page 58: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo
Page 59: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

#!/usr/bin/perluse strict;use warnings;

#TASK: demonstrate the use of “my” in setting the#scope of a variable my $some_variable = 100;

#body of the main program with the function callprint "the value of some_variable is: $some_variable\n";subroutine1();print "but here, some_variable is still: $some_variable\n";

#subroutine using $some_variablesub subroutine1{

my $some_variable = 0;print "in subroutine1,some_variable is: $some_variable\n";

}

#what happens if you comment out "use strict" and #remove "my" from lines 7 and 16

Answer 12

Page 60: More “What Perl can do” With an introduction to BioPerl Ian Donaldson Biotechnology Centre of Oslo

#!/usr/bin/perluse strict;use warnings;

#TASK: check your answers to the regex excercise

#open input and output filesopen(IN,"myanswers.txt");

#read the input file line-by-line#for each line test if it matches a regular expressionwhile(<IN>){

chomp;my $is_correct = does_it_match($_);if ($is_correct){

print "$_ is a match\n";}else{

print "$_ is NOT a match\n";}

}

#close input file and exitclose(IN);exit();

#does it matchsub does_it_match{

my($answer) = @_;my $is_correct = 0;if ($answer =~ m/^ATG?C*[ATCG]+?A{3,10}$/){

$is_correct = 1;}return $is_correct;

}

Answer 13