working with text, regular expressions

Perl Programming Perl Programming CourseCourse

Working with textWorking with textRegular expressionsRegular expressions

Krassimir Berov

I-can.eu

http://i-can.eu/

ContentsContents

1.1. Simple word matchingSimple word matching

2.2. Character classesCharacter classes

3.3. Matching this or thatMatching this or that

4.4. GroupingGrouping

5.5. Extracting matchesExtracting matches

6.6. Matching repetitionsMatching repetitions

7.7. Search and replaceSearch and replace

8.8. The split operatorThe split operator

Simple word matchingSimple word matching

• It's all about identifying patterns in textIt's all about identifying patterns in text

• The simplest regex is simply The simplest regex is simply a word – a string of characters. a word – a string of characters.

• A regex consisting of a word matches any A regex consisting of a word matches any string that contains that wordstring that contains that word

• The sense of the match can be reversed The sense of the match can be reversed by using !~ operatorby using !~ operatormy $string ='some probably big string containing justmy $string ='some probably big string containing justabout anything in it';about anything in it';print "found 'string'\n" if $string =~ /string/;print "found 'string'\n" if $string =~ /string/;print "it is not about dogs\n" if $string !~ /dog/;print "it is not about dogs\n" if $string !~ /dog/;


• The literal string in the regex can be The literal string in the regex can be replaced by a variablereplaced by a variable

• If matching against $_ , the $_ =~ part If matching against $_ , the $_ =~ part can be omitted:can be omitted:

(2)

my $string ='stringify this world';my $string ='stringify this world';my $word = 'string'; my $animal = 'dog';my $word = 'string'; my $animal = 'dog';print "found '$word'\n" if $string =~ /$word/;print "found '$word'\n" if $string =~ /$word/;print "it is not about ${animal}s\n" print "it is not about ${animal}s\n" if $string !~ /$animal/;if $string !~ /$animal/;for('dog','string','dog'){for('dog','string','dog'){ print "$word\n" if /$word/print "$word\n" if /$word/}}


• The // default delimiters for a match can be The // default delimiters for a match can be changed to arbitrary delimiters by putting an 'm' changed to arbitrary delimiters by putting an 'm' in front in front

• Regexes must match a part of the string exactly Regexes must match a part of the string exactly in order for the statement to be truein order for the statement to be true

(3)

my $string ='Stringify this world!';my $string ='Stringify this world!';my $word = 'string'; my $animal = 'dog';my $word = 'string'; my $animal = 'dog';print "found '$word'\n" if $string =~ print "found '$word'\n" if $string =~ mm#$word#;#$word#;print "found '$word' in any case\n" print "found '$word' in any case\n" if $string =~ if $string =~ mm#$word##$word#ii;;print "it is not about ${animal}s\n" print "it is not about ${animal}s\n" if $string !~ if $string !~ mm($animal);($animal);for('dog','string','Dog'){for('dog','string','Dog'){ local $\=$/;local $\=$/; print if print if mm|$animal||$animal|}}


• perl will always match at the earliest possible perl will always match at the earliest possible point in the stringpoint in the string

• Some characters, called Some characters, called metacharactersmetacharacters, are , are reserved for use in regex notation. The reserved for use in regex notation. The metacharacters are (14):metacharacters are (14):

• A A metacharactermetacharacter can be matched by putting a can be matched by putting a backslash before itbackslash before it

(4)

my $string ='my $string ='StringStringify this stringy world!';ify this stringy world!';my $word = 'string'; my $word = 'string'; print "found '$word' in any case\n" print "found '$word' in any case\n" if $string =~ m{$word}i;if $string =~ m{$word}i;

{ } [ ] ( ) ^ $ . | * + ? \{ } [ ] ( ) ^ $ . | * + ? \

print "The string \n'$string'\n contains a DOT\n" print "The string \n'$string'\n contains a DOT\n" if $string =~ m|\.|;if $string =~ m|\.|;


• Non-printable ASCII characters are Non-printable ASCII characters are represented by escape sequencesrepresented by escape sequences

• Arbitrary bytes are represented by octal Arbitrary bytes are represented by octal escape sequencesescape sequences

use utf8; use utf8; binmode(STDOUT, ':utf8') if $ENV{LANG} =~/UTF-8/;binmode(STDOUT, ':utf8') if $ENV{LANG} =~/UTF-8/;$\=$/;$\=$/;my $string ="contains\r\n Then we have some\t\ttabs.my $string ="contains\r\n Then we have some\t\ttabs.б";б";print 'matched б(\x{431})' print 'matched б(\x{431})' if $string =~ /\x{431}/;if $string =~ /\x{431}/;print 'matched б' if $string =~/б/;print 'matched б' if $string =~/б/;print 'matched \r\n' if $string =~/\r\n/;print 'matched \r\n' if $string =~/\r\n/;print 'The string was:"' . $string.'"';print 'The string was:"' . $string.'"';

(5)


• To specify where it should match, use the To specify where it should match, use the anchor metacharactersanchor metacharacters ^̂ and and $$ . .

use strict; use warnings;$\=$/;use strict; use warnings;$\=$/;my $string ='A probably long chunk of text containing my $string ='A probably long chunk of text containing strings';strings';print 'matched "A"' if $string =~ /print 'matched "A"' if $string =~ /^̂A/;A/;print 'matched "strings"' if $string =~ /stringsprint 'matched "strings"' if $string =~ /strings$$/;/;print 'matched "A", matched "strings" and something in print 'matched "A", matched "strings" and something in between' between' if $string =~ /if $string =~ /^̂A.*?stringsA.*?strings$$/;/;

(6)

Character classesCharacter classes

• A A character classcharacter class allows a set of allows a set of possible possible characterscharacters, to match, to match

• Character classes are denoted by brackets Character classes are denoted by brackets [ ][ ]with the set of characters to be possibly matched with the set of characters to be possibly matched insideinside

• The special characters for a character class areThe special characters for a character class are - ] \ ^ $- ] \ ^ $ and are matched using an escape and are matched using an escape

• The special character '-' acts as a range operator The special character '-' acts as a range operator within character classes so you can write within character classes so you can write [0-9][0-9] and and [a-z][a-z]


• ExampleExample

use strict; use warnings;$\=$/;use strict; use warnings;$\=$/;my $string ='A probably long chunk of text containing my $string ='A probably long chunk of text containing strings';strings';my $thing = 'ong ung ang enanything';my $thing = 'ong ung ang enanything';my $every = 'iiiiii';my $every = 'iiiiii';my $nums = 'I have 4325 Euro';my $nums = 'I have 4325 Euro';my $class = 'dog'; my $class = 'dog'; print 'matched any of a, b or c' print 'matched any of a, b or c' if $string =~ /[abc]/;if $string =~ /[abc]/;

for($thing, $every, $string){for($thing, $every, $string){ print 'ingy brrrings nothing using: '.$_print 'ingy brrrings nothing using: '.$_ if /[$class]/if /[$class]/}}print $nums if $nums =~/[0-9]/;print $nums if $nums =~/[0-9]/;


• Perl has several abbreviations for common character Perl has several abbreviations for common character classesclasses• \d\d is a digit – is a digit – [0-9][0-9]• \s\s is a whitespace character – is a whitespace character – [\ \t\r\n\f][\ \t\r\n\f]• \w\w is a word character is a word character

(alphanumeric or _) – (alphanumeric or _) – [0-9a-zA-Z_][0-9a-zA-Z_]• \D\D is a negated \d – any character but a digit is a negated \d – any character but a digit [^0-9][^0-9]• \S\S is a negated \s; it represents any non-whitespace is a negated \s; it represents any non-whitespace

character character [^\s][^\s]• \W\W is a negated \w – any non-word character is a negated \w – any non-word character• The period 'The period '..' matches any character but "\n"' matches any character but "\n"• The \d\s\w\D\S\W inside and outside of character classesThe \d\s\w\D\S\W inside and outside of character classes• The word anchor The word anchor \b\b matches a boundary between a word matches a boundary between a word

character and a non-word character character and a non-word character \w\W\w\W or or \W\w\W\w


• ExampleExample

my $digits ="here are some digits3434 and then ";my $digits ="here are some digits3434 and then ";

print 'found digit' if $digits =~/\d/;print 'found digit' if $digits =~/\d/;

print 'found alphanumeric' if $digits =~/\w/;print 'found alphanumeric' if $digits =~/\w/;

print 'found space' if $digits =~/\s/;print 'found space' if $digits =~/\s/;

print 'digit followed by space, followed by letter'print 'digit followed by space, followed by letter' if $digits =~/\d\s[A-z]/;if $digits =~/\d\s[A-z]/;

Matching this or thatMatching this or that

• We can match different character strings We can match different character strings with the with the alternationalternation metacharacter '|' metacharacter '|'

• perl will try to match the regex at the perl will try to match the regex at the earliest possible point in the stringearliest possible point in the stringmy $digits ="here are some digits3434 and then ";my $digits ="here are some digits3434 and then ";

print 'found "are" or "and"' if $digits =~/are|and/;print 'found "are" or "and"' if $digits =~/are|and/;

Extracting matchesExtracting matches

• The grouping metacharacters The grouping metacharacters ()() also also allow the extraction of the parts of a allow the extraction of the parts of a string that matchedstring that matched

• For each grouping, the part that matched For each grouping, the part that matched inside goes into the special variables inside goes into the special variables $1$1 , , $2$2 , etc. , etc.

• They can be used just as ordinary They can be used just as ordinary variablesvariables

Extracting matchesExtracting matches

• The grouping metacharacters The grouping metacharacters ()() allow a part of a allow a part of a regex to be treated as a single unitregex to be treated as a single unit

• If the groupings in a regex are nested, If the groupings in a regex are nested, $1$1 gets the gets the group with the leftmost opening parenthesis, group with the leftmost opening parenthesis, $2 $2 the next opening parenthesis, etc. the next opening parenthesis, etc.

my $digits ="here are some digits3434 and then678 ";my $digits ="here are some digits3434 and then678 ";

print 'found a letter followed by leters or digits":'.$1 print 'found a letter followed by leters or digits":'.$1 if $digits =~/[a-z]([a-z]|\d+)/;if $digits =~/[a-z]([a-z]|\d+)/;print 'found a letter followed by digits":'.$1 print 'found a letter followed by digits":'.$1 if $digits =~/([a-z](\d+))/;if $digits =~/([a-z](\d+))/;# $1 $2# $1 $2print 'found letters followed by digits":'.$1 print 'found letters followed by digits":'.$1 if $digits =~/([a-z]+)(\d+)/;if $digits =~/([a-z]+)(\d+)/;# $1 $2# $1 $2

Matching repetitionsMatching repetitions

• The quantifier metacharacters The quantifier metacharacters ??, , ** , , ++ , and , and {}{} allow allow us to determine the number of repeats of a portion us to determine the number of repeats of a portion of a regexof a regex

• Quantifiers are put immediately after the character, Quantifiers are put immediately after the character, character class, or grouping character class, or grouping

• a?a? = match 'a' = match 'a' 11 or or 00 times times

• a*a* = match 'a' = match 'a' 00 or or moremore times, i.e., times, i.e., anyany number of times number of times

• a+a+ = match 'a' = match 'a' 11 or or moremore times, i.e., times, i.e., at least onceat least once

• a{n,m}a{n,m} = match = match at least nat least n times, but times, but not more than mnot more than m timestimes

• a{n,}a{n,} = match at least n or more times = match at least n or more times

• a{n}a{n} = match = match exactly nexactly n times times


use strict; use warnings;$\=$/;use strict; use warnings;$\=$/;my $digits ="here are some digits3434 and then678 ";my $digits ="here are some digits3434 and then678 ";

print 'found some letters followed by leters or print 'found some letters followed by leters or digits":'.$1 .$2digits":'.$1 .$2if $digits =~/([a-z]{2,})(\w+)/;if $digits =~/([a-z]{2,})(\w+)/;

print 'found three letter followed by digits":'.$1 .$2print 'found three letter followed by digits":'.$1 .$2if $digits =~/([a-z]{3}(\d+))/;if $digits =~/([a-z]{3}(\d+))/;

print 'found up to four letters followed by digits":'.print 'found up to four letters followed by digits":'.$1 .$2$1 .$2if $digits =~/([a-z]{1,4})(\d+)/;if $digits =~/([a-z]{1,4})(\d+)/;


• GreeeedyGreeeedy

use strict; use warnings;$\=$/;use strict; use warnings;$\=$/;my $digits ="here are some digits3434 and then678 ";my $digits ="here are some digits3434 and then678 ";

print 'found as much as possible letters print 'found as much as possible letters followed by digits":'.$1 .$2followed by digits":'.$1 .$2if $digits =~/([a-z]*)(\d+)/;if $digits =~/([a-z]*)(\d+)/;

Search and replaceSearch and replace

• Search and replace is performed using Search and replace is performed using s/regex/replacement/modifierss/regex/replacement/modifiers. .

• The replacement is a Perl double quoted string The replacement is a Perl double quoted string that replaces in the string whatever is matched with that replaces in the string whatever is matched with the regex . the regex .

• The operator =~ is used to associate a string with The operator =~ is used to associate a string with s///. s///.

• If matching against $_ , the $_ =~ can be dropped. If matching against $_ , the $_ =~ can be dropped.

• If there is a match, s/// returns the number of If there is a match, s/// returns the number of substitutions made, otherwise it returns falsesubstitutions made, otherwise it returns false


• The matched variables $1 , $2 , etc. are immediately The matched variables $1 , $2 , etc. are immediately available for use in the replacement expression.available for use in the replacement expression.

• With the global modifier, s///With the global modifier, s///gg will search and will search and replace all occurrences of the regex in the stringreplace all occurrences of the regex in the string

• The evaluation modifier s///The evaluation modifier s///ee wraps an eval{...} wraps an eval{...} around the replacement string and the evaluated around the replacement string and the evaluated result is substituted for the matched substring.result is substituted for the matched substring.

• s/// can use other delimiters, such as ss/// can use other delimiters, such as s!!!!!! and s and s{}{}{}{}, , and even sand even s{}//{}//

• If single quotes are used sIf single quotes are used s'''''', then the regex and , then the regex and replacement are treated as single quoted stringsreplacement are treated as single quoted strings


• ExampleExample

#TODO....#TODO....

The split operatorThe split operator

• split /regex/, string split /regex/, string splits string into a list of substrings and splits string into a list of substrings and returns that listreturns that list

• The regex determines the character sequence The regex determines the character sequence that string is split with respect tothat string is split with respect to#TODO....#TODO....

Regular expressionsRegular expressions

• ResourcesResources

• perlrequick - Perl regular expressions quick startperlrequick - Perl regular expressions quick start

• perlre - Perl regular expressionsperlre - Perl regular expressions

• perlreref - Perl Regular Expressions Referenceperlreref - Perl Regular Expressions Reference

• Beginning Perl Beginning Perl (Chapter 5 – Regular Expressions)(Chapter 5 – Regular Expressions)

Regular expressionsRegular expressions

Questions?Questions?

working with text, regular expressions

Technology

string print

string of characters

string n

literal string

big string

dog print

strings print

w print