using perl

Using Perl

Paul Tymann

Computer Science Department

Rochester Institute of Technology

[email protected]

2

Regular Expressions

• To review your knowledge of regular expressions from the previous session…– There is redundancy in the genetic code

• GCU, GCC, GCA, and GCG all encode Alanine

– Write a series of regular expressions that could be used to match a codon to the amino acid it encodes

– For example• (GC.) would match all the codons that encode Alanine

3

Fill In The Blanks

GCA

GCC

GCG

GCU

AGA

AGG

CGA

CGC

CGG

CGU

GAC

GAU

AAC

AAU

UGC

UGU

CAA

CAG

GGA

GGC

GGG

GGU

CAC

CAU

AUA

AUC

AUU

UUA

UUG

CUA

CUC

CUG

CUU

AAA

AAG

AUG UUC

UUU

CCA

CCC

CCG

CCU

AGC

AGU

UCA

UCC

UCG

UCU

ACA

ACC

ACG

ACU

UGG UAC

UAU

GUA

GUC

GUG

GUU

UAA

UAG

UGA

4

Answers

GCA

GCC

GCG

GCU

AGA

AGG

CGA

CGC

CGG

CGU

GAC

GAU

AAC

AAU

UGC

UGU

CAA

CAG

GGA

GGC

GGG

GGU

CAC

CAU

AUA

AUC

AUU

UUA

UUG

CUA

CUC

CUG

CUU

GC. AG[AG]|

CG.

GA[CU] AA[CU] UG[CU] CA[AG] GG. CA[CU] AU[ACU] UU[AG]|

CU.

AAA

AAG

AUG UUC

UUU

CCA

CCC

CCG

CCU

AGC

AGU

UCA

UCC

UCG

UCU

ACA

ACC

ACG

ACU

UGG UAC

UAU

GUA

GUC

GUG

GUU

UAA

UAG

UGA

AA[AG] AUG UU[CU] CC. AG[CU]|

UC.

AC. UGG UA[CU] GU. UA[AG]|

UGA

5

Using Files

• In order to use a file you must associate a handle with the file using the open function

open OUT, “>Results”open FROM, “<data”

• Open returns a boolean value that indicates whether the file was opened or not

• The die statement can be used to print an error message and terminate the program

open OUT,”>Results” or die “Unable to open file ($!);

6

Read & Writing

• To print text to a file, use the print statement– print OUT “Some output”;

• To read text from a file– $line = <IN>– $line = readline(IN)

• Note that these techniques only work with text files

7

copyfile.perl

open INFILE,"<$ARGV[0]" or die "Unable to open input file ($!)";

open OUTFILE,">$ARGV[1]" or die "Unable to open output file ($!)";

while ( !eof( INFILE ) ) { $line = readline( INFILE ); print OUTFILE $line;}

close INFILE;close OUTFILE;

Open the file whose name is the first command line argument

Print a message and terminate if it cannot be opened

As long as we have not hit the end of the file

Read the next line

Close the filesWrite the line

to the copy

8

copyfile2.perl


open OUTFILE,">$ARGV[1]" or die "Unable to open output file ($!)";

while ( <INFILE> ) { print OUTFILE;}

close INFILE;close OUTFILE;

9

Files and Lists

• It is possible to read a file directly into a list• Each line in the file is an element of the list• This is useful if you need to process the data

in a file several times– You only read it once and then work with the copy

in memory


@filedata = <INFILE>;

10

Parsing Genbank Records

• Recall that perl was designed to munge the output of another program

• When you do a BLAST search you get quite a bit of information, but often you only want a small part of it

• Perl is well suited for stripping out the part of the record that you are interested in

11

Genbank Record

LOCUS AF165912 5485 bp DNA linear PLN 29-JUL-1999DEFINITION Arabidopsis thaliana CTP:phosphocholine cytidylyltransferase (CCT) gene, complete cds.ACCESSION AF165912VERSION AF165912.1 GI:5640000KEYWORDS .SOURCE Arabidopsis thaliana. ORGANISM Arabidopsis thaliana Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; Rosidae; eurosids II; Brassicales; Brassicaceae; Arabidopsis.REFERENCE 1 (bases 1 to 5485) AUTHORS Choi,Y.H., Choi,S.B. and Cho,S.H. TITLE Structure of a CTP:Phosphocholine Cytidylyltransferase Gene from Arabidopsis thaliana JOURNAL UnpublishedREFERENCE 2 (bases 1 to 5485) AUTHORS Choi,Y.H., Choi,S.B. and Cho,S.H. TITLE Direct Submission JOURNAL Submitted (06-JUL-1999) Biology, Inha University, Yonghyon-Dong 253, Inchon 402-751, Korea

12

Genbank Record

FEATURES Location/Qualifiers source 1..5485 /organism="Arabidopsis thaliana" /cultivar="Columbia Col-O" /db_xref="taxon:3702" gene 1..4637 /gene="CCT" promoter 1..1602 /gene="CCT" TATA_signal 1554..1560 /gene="CCT" mRNA join(1603..1891,2322..2438,2538..2633,2801..2843, 2918..3073,3167..3247,3874..3972,4082..4637) /gene="CCT" /product="CTP:phosphocholine cytidylyltransferase" 5'UTR 1603..1712 /gene="CCT"

13

Genbank Record

CDS join(1713..1891,2322..2438,2538..2633,2801..2843, 2918..3073,3167..3247,3874..3972,4082..4309) /gene="CCT" /EC_number="2.7.7.15" /codon_start=1 /product="CTP:phosphocholine cytidylyltransferase" /protein_id="AAD45922.1" /db_xref="GI:5640001" /translation="MSNVIGDRTEDGLSTAAAASGSTAVQSSPPTDRPVRVYADGIYD LFHFGHARSLEQAKLAFPNNTYLLVGCCNDETTHKYKGRTVMTAEERYESLRHCKWVD EVIPDAPWVVNQEFLDKHQIDYVAHDSLPYADSSGAGKDVYEFVKKVGRFKETQRTEG ISTSDIIMRIVKDYNQYVMRNLDRGYSREDLGVSFVKEKRLRVNMRLKKLQERVKEQQ ERVGEKIQTVKMLRNEWVENADRWVAGFLEIFEEGCHKMGTAIVDSIQERLMRQKSAE RLENGQDDDTDDQFYEEYFDHDMGSDDDEDEKFYDEEEVKEEETEKTVMTDAKDNK" 3'UTR 4310..4637 /gene="CCT"

14

Genbank Record

BASE COUNT 1650 a 956 c 1046 g 1833 tORIGIN 1 ccagaatggt tactatggac atccgccaac catacaagct atggtgaaat gctttatcta 61 tctcattttt agtttcaaag cttttgttat aacacatgca aatccatatc cgtaaccaat 121 atccaatcgc ttgacatagt ctgatgaagt ttttggtagt taagataaag ctcgagactg 181 atatttcata tactggatga tttagggaaa cttgcattct attcatgaac gaatgagtca 241 atacgagaca caaccaagca tgcaaggagc tgtgagttga tgttctatgc tatttaagta

lines deleted to save space…

5101 tgttgttaac caactctctt tacatattag gaccgtgctt gtcaggccaa tggttttcac 5161 ttcgaaaaat tgcttccgat atcaaactat gtgtacatta ttggtggact gtggacataa 5221 cttaaacgca taattttatt gtgtaccttt aaaataaaca atagattaca catatatata 5281 tggcaaatat ttgaacatta gatgtcaaga gaaaagtaaa acatgtcatg attacaccat 5341 ctttgttatt atttagagtg attctcacta aatcttaggc ggttagcaac cgccatagtt 5401 ttcaaaatct cattctatcg ggattaaatc tgtttttggt gactatatat aaacattggt 5461 cgaattttta ggtaagtaaa atcag//

15

Accession

• How would you go about finding the accession number?

• You can read the file line by line• See if the line starts with ACCESSION

– Does it have to start with ACCESSION?– $line =~ /^ACCESSION/;

• How would you strip out the word ACCESSION leaving only the number?– $line =~ s/ACCESSION(\s)*//;

16

accession.perl

open INFILE,"<$ARGV[0]" or die "Unable to open Genbank file ($!)";

while ( !eof( INFILE ) ) { $line = <INFILE>;

if ( $line =~ /^ACCESSION/ ) { $line =~ s/ACCESSION(\s)*//; print $line,"\n"; }}

close INFILE;

17

Your Turn!!

• Modify the script accession.perl so that in addition to the accession number it also prints out the locus and the organism

• The output from your program should look like thisAccession: AF165912

Organism: Arabidopsis thaliana

Locus: AF165912 5485 bp DNA linear PLN 29-JUL-1999

• The information obtained from the file must be printed in the order specified

18

Sequence Data

• How would you go about getting the sequence data out of a Genbank record?– The sequence data is delimited by

• ORIGIN• //

• So read lines looking for one that starts with ORIGIN

• After seeing ORIGIN, read and print lines until you see the //

19

sequence.perl

open INFILE,"<$ARGV[0]" or die "Unable to open Genbank file ($!)";

$in_sequence = 0; # 0 is false

while ( !eof( INFILE ) ) { $line = <INFILE>; if ( $line =~ /^\/\/\n/ ) { $in_sequence = 1; # non-zero value is true } elsif ( $line =~ /^ORIGIN/ ) { $in_sequence = true; } elsif ( $in_sequence ) { print $line,"\n"; }}

close INFILE;

20

Your turn!!

• Modify the script sequence.perl so that the sequence data is placed into a scalar variable named $sequence_data.

• All of the spaces, newlines, and line numbers should be removed from the sequence data– You can do this as you read or all at once at the end

• Write your program so that it prints $sequence_data to verify it works correctly

• Testing hint– Modify the Genbank record so that it has many fewer

sequence lines so that you can determine if your program is working correctly

21

Where Are We?

• So now we have a program that extracts the DNA sequence information out of a Genbank record

• Lets go one step further and convert the sequence data into an equivalent sequence of amino acids– To simplify things assume that a reading frame

starts with the first nucleotide in the sequence

22

Hashes

• Associative arrays– Key value pairs– Given the key, the table returns the value

• You can build them by hand%codons = (‘TCA’,‘S’,‘TCC’,‘S’,…)

Alternatively…

%codons = ( ‘TCA’ => ‘S’,

‘TCC’ => ‘S’,

‘TCG’ => ‘S’,

… );

• This is a big pain for all of the codons

23

Building a Hash

• The hash can be built under program control– The file codons.txt contains 64 lines in the

following format• TCA S Serine

– A perl script could read the lines one at a time– Break extract the codon and the amino acid from

the line– Add the information to the hash

24

Handy Functions

• Here are some functions that you might find useful– chomp $line

• Removes the line termination character(s) from a string

– split /\s/, $line• Splits a line into fields (i.e., turns it into a list)• ($codon, $amino, $name) = split /\s/,$line

– sort @somelist• Sorts a list

– reverse @somelist• Reverses a list

25

Functions on Hashes

• Some operations you can perform on a hash$codons{‘TCA’}=‘S’;

• Adds the entry (TCA,S) to the hash

delete $codons{‘XXX’};• Removes the key XXX from the hash

keys %codons• Returns a list of the keys associated with the hash

sort keys %codons• Sorts the keys in the hash

26

What Do You Think This Does?

foreach $d ( sort keys %codons ) { print “$d: $codons{$d}\n”;}

Returns a sorted list of the keys in the hash %codonsSteps through a list element by element. The current element is placed in $d

Prints the current element and the amino acid associated with that element

27

Your Turn!!

• Write a script named buildhash that builds a hash that contains each of the 64 codons and the amino acid that each one encodes

• The information you need to build the hash is contained in the file codons.txt

• Print the hash after it has been built

28

Now This Is Cool!!

$dna = ‘CGACGTTTCGTACGGACTAGCT’;

$amino_acids = “”;

for ($i=0; $i<length($dna)-2; $i=$i+3) { $amino_acids .= $codons{ substr( $dna, $i, 3 ) };}

print $amino_acids,”\n”;

29

Subroutines

• Clearly the code to translate DNA to amino acids is something that we might want to use many times in a program

• Perl supports subroutines– Think of a subroutine as a small program that you

can call from other programs

• Subroutines often require parameters• For the translation program

– The string to translate– Where to start translation (ORFs?)

30

The Subroutine translate

sub translate { my($dna,$start)=@_; my $amino_acid; my $i;

for ($i=$start; $i<length($dna)-2; $i=$i+3) { $amino_acid .= $codons{ substr( $dna, $i, 3 ) }; }

return $amino_acid;}

31

Packaging

• Perl subroutines are typically placed in a separate file– Often referred to as a module or a library– The last line in the module must be a ‘1’– The module name typically ends with “.pm”

• The program that will call the subroutine must state that it will use the module before calling the subroutine

32

BioSubs.pm

sub translate { my($dna,$start)=@_; my $amino_acid; my $i;

for ($i=$start; $i<length($dna)-2; $i=$i+3) { $amino_acid .= $codons{ substr( $dna, $i, 3 ) }; }

return $amino_acid;}

1

33

Using BioSubs

use BioSubs;

open CODONS,"<$ARGV[0]" or die "Unable to open codon file ($!)";

while (<CODONS>) { ($codon,$amino,$name)=split /\s/; $codons{$codon}=$amino;}

print translate("GCAGCCGCGGCUAGAAGG”,0),"\n";

close CODONS;

34

Your Turn!!

• Write a perl script that will read sequence data from a Genbank record and will print out the 6 reading frames associated with the sequence data in the record

• Take advantage of subroutines– Clearly translate will be helpful

• If you have time add a subroutine named getsequence to the BioSubs module– getSequence takes the name of a file that contains a

Genbank record as an argument– It returns a scalar that contains the sequence data found in

the record

using perl

Documents