perl practical extraction and report language. perl language windows perl-win32 activestate perl ...

98
Perl Practical Extraction and Report Language

Post on 19-Dec-2015

296 views

Category:

Documents


4 download

TRANSCRIPT

Perl

Practical Extraction and Report Language

PERL language Windows

Perl-Win32 ActiveState Perl

Linux use the whereis command to locate Perl

sources Learning Perl, O’Reilly,ISBN 0-596-10105-8 http://www.comp.leeds.ac.uk/Perl/ Perl for Dummies, 2nd ed., ISBN 0-7645-0460-6 Perl by Example, Quigley, ISBN 0-13-028251-0

Command line perl filename.pl

runs as a command line interface use a text editor to make / save the .pl file

PERL First line of the program

#!/usr/bin/perl –w instructs perl to run with the warning option not required in Windows versions

options -c check syntax -w many warnings enabled -W all warnings enabled -X disable all warnings -v version -e one line programs (immediate mode) -d debugger

Comments # character at the beginning of a line

indicates a comment can also appear in the middle of a line after a

command rest of line is ignored

blank lines are ignored

System Commands the ` character (“backtick”) executes a

system command

Perl statements Conditional tests Loops Direct statements

open(INFILE, $TheFile) or die “The file $TheFile could not be found.\n”;

$LineCount = $LineCount + 1; Statements end in “;”

Simple starts print “This is a test”;

case sensitivity (print not PRINT) Looping

while(condition) { } #End of while loop

Scalar variables Hold both strings and numbers

completely interchangeable $priority = 9; $priority = 'high';

Accepts numbers as strings $priority = '9'; $default = '0009';

can still cope with arithmetic and other operations quite happily

Variable names consists of numbers, letters and underscores Case sensitive should not start with a number $_ is a special variable (many exist)

Math Operators Perl uses all the usual C arithmetic operators:

$a = 1 + 2; # Add 1 and 2 and store in $a $a = 3 - 4; # Subtract 4 from 3 and store in $a $a = 5 * 6; # Multiply 5 and 6 $a = 7 / 8; # Divide 7 by 8 to give 0.875 $a = 9 ** 10; # Nine to the power of 10 $a = 5 % 2; # Remainder of 5 divided by 2 ++$a; # Increment $a and then return it $a++; # Return $a and then increment it --$a; # Decrement $a and then return it $a--; # Return $a and then decrement it

String Operators $a = $b . $c; # Concatenate $b and $c $a = $b x $c; # $b repeated $c times

type man perlop for other operators

Perl Assignments $a = $b; # Assign $b to $a $a += $b; # Add $b to $a $a -= $b; # Subtract $b from $a $a .= $b; # Append $b onto $a

Interpolation $a = 'apples'; $b = 'pears'; print $a.' and '.$b;

prints apples and pears using concatenation Single quotes versus double quotes

print '$a and $b'; prints literally $a and $b

print "$a and $b"; double quotes force interpolation of any codes,

including interpreting variables Other codes that are interpolated include special

characters such as newline (\n) and tab (\t)

Printing words When printing a list of words to STDOUT

unquoted word must start w/alphanumeric character

remainder is a/n and underscore Perl words are case sensitive if unquoted, word could conflict with identifiers

If word has no special meaning to Perl treated as if surrounded by single quotes

Literals numeric

12345 integer 0b1101 binary 0x456fff hex 0777 octal (leading zero) 23.45 float .234E-2 scientific notation

Literals string literals

\n newline \t tab \r carriage return \f form feed \b backspace \a alarm/bell \e escape \0333 octal character \xff hex character \c[ control character \l convert next char to lowercase \u convert next to uppercase \L convert chars to lower until “\E” found \U \Q backslash all following non-a/n until “\E” \E ends upper / lower conversion \\ backslash

Literals special literals

_ _LINE_ _ current line of the script

_ _FILE_ _ name of the script

_ _END_ _ logical end of the file

trailing text following will be ignored CTRL-d (\004) in Unix CTRL-z (\032) in MS-DOS

_ _DATA_ _ indicates data contained in script instead of external file

_ _PACKAGE_ _ current package (default is main)

Print function prints a string or list of csv to Perl filehandle

STDOUT success = 1, fail = 0

print “Hello”, “world”, “\n”; Helloworld

print “Hello world\n”; Hello world

print Hello, world, “\n”; no comma allowed after filehandle at ./perl.s. line 1

Perl thinks that ‘Hello’ is a filehandle print STDOUT Hello, world, “\n”;

Helloworld (no comma after STDOUT)

Printing literals print “The price is $100.\n”;

The price is . print “The price is \$100.\n”;

The price is $100. print “The price is \$”,100,”.\n”;

The price is $100. print “The binary number is converted to: “ 0b10001,”.\n”;

The binary number is converted to: 17. print “The octal number is converted to: “,0777,”.\n”;

The octal number is converted to: 511. print “The hex number is converted to: “,0xAbcF,”.\n”;

The hex number is converted to: 43983. print “The unformatted number is “, 14.56,”.\n”;

The unformatted number is 14.56.

printf prints a formatted string to a filehandle

(STDOUT is default) printf(“The name is %s and the number is

%d\n”, John, 50); John subs for the %s 50 subs for %d

Printing without quotes the “here” document

print from ‘here to here’ delimited text

$price = 1000;print <<EOF;the consumer said, “As I look over my budget, I’d

say the price of $price is right. I’ll give you \$500 to start.”\n

EOF The consumer said, “As I look over my budget, I’d

say the price of $1000 is right. I’ll give you $500 to start.” $price is interpolated (between double quotes)

Printing without quotes$price = 1000;print <<‘FINIS’;the consumer said, “As I look over my budget, I’d

say the price of $price is too much.\n I’ll settle for $500.”

FINIS The consumer said, “As I look over my budget, I’d

say the price of $price is too much.\n I’ll settle for $500.” $price is not interpolated (delimiter is in single quotes)

Printing without quotesprint << x 2;Here’s to a new day.Woo-hoo!(blank line)print “\nLet’s do some stuff.\n”;print <<`END`; # backtick executes system commandsecho Today isdateEND Output

Here’s to a new day.Woo-hoo!Here’s to a new day.Woo-hoo!Let’s do some stuff.Today isSun Mar 19 12:48:36 EST 2006

Arrays @food = ("apples", "pears", "eels"); @music = ("whistle", "flute"); $food[2]

returns “eels” (index is 0-based) $ used as it’s a scalar now and not an array

@moremusic = ("organ", @music, "harp"); explodes the @music equivalent to…@moremusic = ("organ", "whistle",

"flute", "harp");

push(@food, "eggs"); adds the element to the array

Arrays push two or more items

push(@food, "eggs", "lard"); push(@food, ("eggs", "lard")); push(@food, @morefood);

push function returns the length of the new list

pop function removes the last item from a list and returns it

Arrays $f = @food;

assigns the length of food to $f $f = "@food";

turns array into space delimited string and assigns it to $f

Arrays Multiple assignments ($a, $b) = ($c, $d); # Same as $a=$c; $b=$d; ($a, $b) = @food; # $a and $b are the first

#two items of @food ($a, @somefood) = @food; # $a is the first item of @food..

#@somefood is a list of the # others

(@somefood, $a) = @food; # @somefood is @food and # $a is undefined

Arrays Finding the last index of an array

$#food not to be confused with the number of elements

Displaying arrays print @food; # By itself print "@food"; # Embedded in double quotes print @food.""; # In a scalar context

File Handling Example

$file = '/etc/passwd'; # Name the file open(INFO, $file); # Open the file @lines = <INFO>; # Read it into an array close(INFO); # Close the file print @lines; # Print the array

Modes open(INFO, $file); # Open for input open(INFO, ">$file"); # Open for output open(INFO, ">>$file"); # Open for appending open(INFO, "<$file"); # Also open for input

Special Variables $_ default input $/ input record separator. OS dependent $[ index of the first list element $| Force flushing to file handle if set to true

(false is default). $] Perl version $0 name of the file containing Perl being run $^T Time of program start $, input line number of last file handle read $ARGVname of current file when using <ARGV> @ARGV command line arguments @INC list of directories for do, require and use %INC files that have been used by do and require %ENV OS environment variables

File Handling print something to a file you've already

opened for output print INFO "This line goes to the file.\n";

open the standard input (usually the keyboard) and standard output (usually the screen) open(INFO, '-'); # Open standard input open(INFO, '>-'); # Open standard output

Conditional Expressions

Testing$a == $b # Is $a numerically equal to $b?

# Beware: Don't use the = operator.$a != $b # Is $a numerically unequal to $b? $a eq $b # Is $a string-equal to $b? $a ne $b # Is $a string-unequal to $b? You can also use

#logical and, or and not: ($a && $b) # Is $a and $b true?($a || $b) # Is either $a or $b true?!($a) # is $a false?

non-zero #’s and non-empty strings are true in Perl

Control Structures

ifif ($a){

print "The string is not empty\n";}else{

print "The string is empty\n";}

if / elseif (!$a) # The ! is the not operator

{print "The string is empty\n";

}

elsif (length($a) == 1) # If above fails, try this{

print "The string has one character\n";}

elsif (length($a) == 2) # If that fails, try this {print "The string has two characters\n";}

else # Now, everything has failed{print "The string has lots of characters\n";}

forfor ($i = 0; $i < 10; ++$i) # Start with $i = 1

# Do it while $i < 10 # Increment $i before

repeating {

print "$i\n";}

for eachforeach $morsel (@food) # Visit each item in turn

# and call it $morsel{

print "$morsel\n"; # Print the item print "Yum yum\n"; # That was nice

}

while / until#!/usr/local/bin/perl print "Password? "; # Ask for input$a = <STDIN>; # Get inputchop $a; # Remove the newline at end while ($a ne "fred") # While input is wrong... {

print "sorry. Again? "; # Ask again$a = <STDIN>; # Get input againchop $a; # Chop off newline again

}

while / until#!/usr/local/bin/perldo{

"Password? "; # Ask for input$a = <STDIN>; # Get inputchop $a; # Chop off newline

} while ($a ne "fred") # Redo while wrong input

Regular Expressions

Matching Stringsand

String Manipulation

Regular Expressions regular expression is contained in slashes,

and matching occurs with the =~ operator following expression is true if the string the

appears in variable $sentence $sentence =~ /the/

case sensitive!!! $sentence !~ /the/

true if no match found

Regular Expressions /abc/

Any string matching this pattern ?abc?

Only the first occurrence matching this patter

RE Characters and Meanings. # Any single character except a

newline^ # The beginning of the line or string$ # The end of the line or string * # Zero or more of the last character + # One or more of the last character? # Zero or one of the last character

RE expressionst.e matches the, tre, tle .. does not

match te or tale^f matches f at the beginning of a line^ftp matches ftp at the beginning of a linee$ matches e at the end of a linetle$ matches tle at the end of a lineund* matches un with zero or more d

characters.. matches un, und, undd, unddd

RE expressions.* Any string without a newline.

This is because the . matches any character except a newline and the * means zero or more of these.

^$ A line with nothing in it. (beginning/end of line.

RE Options[qjk] # Either q or j or k [^qjk] # Neither q nor j nor k [a-z] # Anything from a to z inclusive[^a-z] # No lower case letters [a-zA-Z] # Any letter[a-z]+ # Any non-zero sequence of lower

# case letters

RE ExpressionsThe vertical bar “ | “ is used as an “or”

operator

jelly|cream # Either jelly or cream(eg|le)gs # Either eggs or legs(da)+ # Either da or dada or

# dadada or...

Special Characters\n # A newline\t # A tab\w # Any alphanumeric (word) character.

# The same as [a-zA-Z0-9_]\W # Any non-word character.

# The same as [^a-zA-Z0-9_] \d # Any digit. The same as [0-9] \D # Any non-digit. The same as [^0-9] \s # Any whitespace character: space, # tab, newline,

etc \S # Any non-whitespace character \b # A word boundary, outside [] only \B # No word boundary

Special Characters When you need to match a special

character, use the backslash to indicate the character (literal character follows)

\| # Vertical bar\[ # An open square bracket \) # A closing parenthesis \* # An asterisk \^ # A carat symbol \/ # A slash \\ # A backslash

RE examples[01] # Either "0" or "1" \/0 # A division by zero: "/0" \/ 0 # A division by zero with a space: "/ 0" \/\s0 # A division by zero with a whitespace:

# "/ 0" where the space may be a tab etc. \/ *0 # A division by zero with possibly some

# spaces: "/0" or "/ 0" or "/ 0" etc. \/\s*0 # A division by zero with possibly some

# whitespace. \/\s*0\.0* # As the previous one, but with decimal

# point and maybe some 0s after it. Accepts # "/0." and "/0.0" and "/0.00" etc and # "/ 0." and "/ 0.0" and "/ 0.00" etc.

Regular Expressions Matching modifiers

i turn off case sensitivity m treat a string as multiple lines

Optional if pattern enclosed in forward slashes o compile pattern once (optimize search) s single line string (when \n

embedded) x permit comments in a RE and ignore

whitespace g global matching

Regular Expressions x.txt file (file contains more text)

This is a test file to show how Perl canread a file line-by-line and then hopefullymatch patterns of the text in the file.(516) 555-5555 Telephone number777-77-7777 Social security number192.168.0.100 IP address10.4.6.8 IP address 00-4A-29-6D-01-F2 MAC @9.1.2.2 IP address$1,423.08 currency1.54893 decimal number1948.204356 decimal number1,948.204356 decimal number11548 zip code11548-1300 zip+ [email protected]://myweb.cwpost.liu.edu/cmalinow/index.html

Regular Expressions can you find the following in ‘x.txt’ using RE’s ?

‘file’ ‘the’ all telephone numbers only telephone numbers in the 516 area all zip codes (including zip+4) all IP addresses (and only IP’s) MAC addresses currency email addresses URL’s (not just websites) City and States as in addresses (2 letter abbreviation).

Regular Expressions$_ = “xabcy\n”;print if /abc/;

could be written… print $_ if $_ =~ /abc/; output is

xabcy

Regular Expression Metacharacters metacharacter is a character that does not

represent itself control a search pattern

find only at the beginning, or the end, or starts with an uppercase character, etc…

if preceded by a backslash backslash turns off the meaning of the metacharacter

Metasymbols simpler form of metacharacters

represent characters [0-9] represent 0-9 range of

characters uses bracket metacharacters

\d represents the same \d is the metasymbol

Metacharacters . any single character except a newline [a-zA-Z] any single character in the set [^a-zA-Z] any single char not in set \d matches one digit \D matches a non-digit \w matches a/n (word) character \W matches non-a/n character

Metacharacters \s whitespace char (space, tab, newline) \S non-whitespace char \n newline \r return \t tab \f formfeed \b backspace \0 null character

Metacharacters \b word boundary (when not is [ ] ) \B non-word boundary ^ matched at beginning of line $ match to end of line \t tab \f formfeed \b backspace \0 null character

DATA filehandle. . . while (<DATA>){

print if /Norma/;}_ _ DATA _ _ #placed at the end of the script

Dopey DildockSteve BlechNorma XIgor the HunchbackVlad the ImpalerFrank N. Stein

(Output)Norma X

each time a line is read from the ‘inscript’ data store, it is read into the special “$_” variable

the ‘print if’… assume you’re checking the $_ variable could also have been written print if $_ =~ /Norma/;

Displaying info

modifiers used in print statement

unless Modifierprint $_ unless $_ =~ /http/;

print unless /http/;

while Modifierprint $_ while $_ =~ /liu/i;

print while /liu/i;

until Modifierprint $_ until $_ =~ /liu/i;

stops once the condition is met beware of infinite loops

foreach Modifier@x=(a..z,”\n”); #use range operator to

load #arrayprint foreach @x;

prints each lower case letter in turn in turn each element assigned to $_

Pattern Binding if the search variable is contained in

something other than $_$salary = 5000;print if $salary =~/5/; or$name =~s/John/Joe/;

Regular Expression Operators

matching modifiers

m operator m used to match patterns

optional if delimiters are forward slashes required if other delimiters utilized/abc/m#abc#m{abc}

string treated as multiple lines Examples:

m/Good Morning/ optional ‘m’/Hi there//\/usr/var/adm/ backslash indicate following char is literal

m#/usr/var/adm#m’$name’ no interpolation of $name

i operator i indicates ignore case

g modifier global match performed

otherwise only the first match in the line is matched

useful for global substitutions

x modifier express modifier allows for comments and whitespace in the RE,

without them being interpreted as part of the RE express intentions within the RE

$_=“San Francisco to Hong Kong\n”;/Francisco #searching for Francisco/x;print “comments and spaces removed and \$& is $&/n”;(output)Comments and spaces removed and $& is Francisco $& variable contains what was matched as a

result of the search

s Operator Used in substitutions

modifiers used for substitutionse evaluate replacement side as an

expressioni ignore casem string is multiple lineso only compile pattern onces string is single line when newline

embeddedx allow whitespace and comments in REg global replacments

Substitution

Substitution$sentence =~ s/london/London/

$_ variable assumed in the following lines/london/London/s/london/London/g

global substitutions/[Ll][Oo][Nn][Dd][Oo][Nn]/London/g

regardless of the case of letterss/london/London/gi

ignore case

Substitution Delimiters

normally forward slashess/abc/xyz/;

any non-a/n character can be used following s operator

s#abc#xyz#; if paired parens, curly braces, square or angle

brackets used, any other type of delimiter can be used for the replacement pattern

s(john)/Joe/;

Remembering Patterns matches are kept in variables

$1 through $9 can be used in regular expressions as

\1 through \9

$_ = "Lord Whopper of Fibbing"; s/([A-Z])/:\1:/g;print "$_\n"; replaces all uppercase letters with that letter

(match) surrounded by colons:L:ord :W:hopper of :F:ibbing

Remembering Patterns Identify any words repeated

if (/(\b.+\b) \1/) {print "Found $1 repeated\n";}

\b represents a word boundary .+ matches any non-empty string \b.+\b matches anything between two word

boundaries remembered by the parentheses and stored as

\1 for regular expressions and as $1 for the rest of the program

Remembering Patterns Swap the first and last characters of a line

s/^(.)(.*)(.)$/\3\2\1/ ^ and $ match the beginning and end of the

line. The \1 code stores the first character; the \2

code stores everything else up the last character which is stored in the \3 code.

Then that whole line is replaced with \1 and \3 swapped

Translation tr function

$sentence =~ tr/abc/edf/ translates ‘a’ to ‘e’, ‘b’ to ‘d’ and ‘c’ to ‘f’

function returns the number of substitutions made

can be used to count number of occurrences $count ($sentence =~ tr/*/*/)

counts the number of ‘*’ in $sentence

Split Splits up a string and places it into an

array$info = "Caine:Michael:Actor:14, Leafy Drive";@personal = split(/:/, $info); which has the same overall effect as

@personal = ("Caine", "Michael", "Actor", "14, Leafy Drive");

if already in $_ then@personal = split(/:/);

Split If the fields are divided by any number of

colons then we can use the RE codes to get round this. The code $_ = "Capes:Geoff::Shotputter:::Big Avenue";@personal = split(/:+/); is the same as@personal = ("Capes", "Geoff", "Shot putter",

"Big Avenue");

Split A word can be split into characters, a

sentence split into words and a paragraph split into sentences: @chars = split(//, $word);@words = split(/ /, $sentence);@sentences = split(/\./, $paragraph);

Exercise concordance

string to be displayed in its immediate context concordance program identifying the target

string the might produce some of the following output (note how ‘the’ lines up)discovered (this is the truth) that when he

t kinds of metal to the leg of a frog, an e

rrent developed and the frog's leg kicked,

longer attached to the frog, which was dea

normous advances in the field of amphibian

ch it hop back into the pond -- almost. Bu

ond -- almost. But the greatest Electrical

ectrical Pioneer of them all was Thomas Edi

Exercise Write a concordance program Read the entire file into array

Each item in the array will be a line of the file When the chop function is used on an array it chops off

the last character of every item Recall that you can join the whole array together with a

statement like $text = "@lines"; Use the target string as delimiter for splitting the text

use the target string in place of the colon in our previous examples

For each array element in turn print it out print the target string print the next array element target strings won't line up vertically

substr function

Substring function substr(string,start,count)

start is ‘zero-based’ if left out, starts at the end can use negative offset

if extends beyond string length, returns nothing or warning

To avoid this you can pad out the string by using the x operator mentioned earlier.

The expression (" "x30) produces 30 spaces, for example

substr("Once upon a time", 3, 4); # returns "e up“substr("Once upon a time", 7); # returns "on a time"

substr("Once upon a time", -6, 5); # returns "a tim"

Associative Arrays To define

usual parenthesis notation array itself is prefixed by a % sign

To create an array of people and their ages%ages = ("Michael Caine", 39,

"Dirty Den", 34, "Angie", 27, "Willy", "21 in dog years", "The Queen

Mother", 108);

$ages{"Michael Caine"}; # Returns 39 $ages{"Dirty Den"}; # Returns 34

Note: curly braces for the index instead of parens converted back into a list array just by assigning it to a list array

variable@info = %ages; # @info is a list array. It now has 10 elements

Associative Arrays do not have any order to their elements

just like hash tables access in their order using keys and values functions keys returns a list of the keys (indices) of the

associative array values returns a list of the values of the array keys and values are called in a scalar context they

return the number of key/value pairs in the associative array

function each which returns a two element list of a key and its value

while (($person, $age) = each(%ages)){

print "$person is $age\n";}

Getting Data

Datawhile (<DATA>){

@line = split(“:”,$_);print $line[0],”\n” if $line[1] =~ /516/;

}_ _ DATA _ _ Silly Sally:516-555-9087Dopey Dildock:631-555-9265Martin Martin:516-555-2835 (output)

Silly SallyMartin Martin

Datawhile (<DATA>){

($name, $phone, $address) = split(“:”,$_);print $name if $phone !~ /631/;

}_ _ DATA _ _ Silly Sally:516-555-9087:12 Main StreetDopey Dildock:631-555-9265:1313 Mockingbird La.Martin Martin:516-555-2835:1010 10th St. (output)

Silly SallyMartin Martin

Datawhile ($inputline=<DATA>){

($name, $phone, $address) = split(“:”,$_);print $name if $phone !~ /631/;print $inputline if $name =~/^D/

}_ _ DATA _ _ Silly Sally:516-555-9087:12 Main StreetDopey Dildock:631-555-9265:1313 Mockingbird La.Martin Martin:516-555-2835:1010 10th St. (output)

Silly SallyDopey Dildock:631-555-9265:1313 Mockingbird La.Martin Martin

Environment Variables Held in an associate array %ENV

print "You are called $ENV{'USER'} and you are "; print "using display $ENV{'DISPLAY'}\n";

Subroutines may be placed anywhere in your program

sub mysubroutine { print "Not a very interesting routine\n";print "This does the same thing every time\n"; }

subroutine is called with an & character in front of the name

&mysubroutine; # Call the subroutine&mysubroutine($_); # Call it with a parameter&mysubroutine(1+2, $_); # Call it with two params

Subroutines Parameters

any parameters are passed as a list in the special @_ list array variable

following subroutine prints out the list that it was called with

sub printargs { print "@_\n"; }&printargs("perly", "king");

# Example prints "perly king“&printargs("frog", "and", "toad");

# Prints "frog and toad“ elements of @_ can be accessed with the square

bracket

Subroutines Parameters

sub printfirsttwo{print "Your first argument was $_[0]\n";print "and $_[1] was your second\n";}

the indexed scalars $_[0] and $_[1] and so on have nothing to with the scalar $_ which can also be used without fear of a clash

Subroutines Returning values

Result of a subroutine is always the last thing evaluatedsub maximum{

if ($_[0] > $_[1]){

$_[0];}else{

$_[1];}

}$biggest = &maximum(37, 24); # Now $biggest is 37

&printfirsttwo subroutine (prior slide) also returns a value in this case 1 because the last thing that subroutine did was a print

statement result of a successful print statement is always 1

Local Variables @_ variable is local to the current subroutine so are $_[0], $_[1], $_[2], and so on Other variables can be local

useful if we want to start altering the input parameters following subroutine tests to see if one string is inside another, spaces

not withstandingsub inside{

local($a, $b); # Make local variables($a, $b) = ($_[0], $_[1]); # Assign values$a =~ s/ //g; # Strip spaces from$b =~ s/ //g; # local variables($a =~ /$b/ || $b =~ /$a/); # Is $b inside $a

# or $a inside $b?}&inside("lemon", "dole money"); # trueIn fact, it can even be tidied up by replacing the first two lines with

local($a, $b) = ($_[0], $_[1]);