regular expressions aleksandr lenin. outline motivation for regular expressions constructing regular...
TRANSCRIPT
Regular expressions
Aleksandr Lenin
Outline
• Motivation for regular expressions• Constructing regular expressions– Atoms– Repetition operators– Concatenation– Alternation operator– Grouping operator
• Examples• Perl extension to REs• Backtracking problem
Acknowledgements
Thanks to Risto Vaarandi for sharing these wonderful materials.
Motivation for regular expressions
• Vehicle registration number in Estonia?• Find all events from logs containing pattern– User *** login failure
• Regular expression:– is a pattern– describes the structure of some inputs– actual input either matches or does not match
against the pattern
Application domain
• Input validation– securing our systems
• Template engines in CMS– separation of static and dynamic data– putting dynamic content together– macro substitution
• Querying textual data sets to find interesting information
Regular expression dialects
• Basic Regular Expression (BRE) language – the simplest dialect of RE language, supported by grep
• Extended regular expression (ERE) language – basically BRE enhanced with additional features, supported by egrep
• Perl regular expression language – adds lots of features to ERE language, supported by perl
• Other minor language flavors around
How regular expressions are built
Atom1 Atom2 Atom3 Atom4
Piece1 Piece2 Piece3 Piece4
Branch1 Branch2
Regular Expression
Repetition operators Repetition operators
Concatenation Concatenation
Branching operators
Atom’() operator
ERE atoms – matching single characters
• Regular characters are atoms that match themselves– a matches cat (but not CAT)
• To turn some special characters (like ?) into atoms that match themselves, add a blackslash (\) in front of them– \? matches what?– \^ matches 2^3– \* matches 12*345– \\ matches a\b
ERE atoms – matching single characters (contd.)
• Dot (.) matches any character– . matches car, 98, X, .– \. matches dot only, i.e. .bashrc
• [list] matches any character in the list: – [9abc] matches rat, toc, 1984 (not Art or dog)
• [^list] matches any character not in the list:– [^defD] matches federal, cab, but not (Dede)
• Note that most special characters (like ? or .) are treated as regular characters inside [].
ERE atoms – matching single characters (contd.)
• A list may contain a character range:– a-z is a range of lowercase characters of the latin
alphabet: a, b, c, d, …, x, y, z– A-Z is a range of uppercase characters of the latin
alphabet: A, B, C, D, …, X, Y, Z – 0-9 is a range of numbers: 0,1,2,3,4,5,6,7,8,9– [a-z0-9] matches Art, Dede, 1984 (but not FBI)– [^0-9.] matches myhost2 (but not 127.0.0.1)– Note that dash (-) is a special characters used for
character ranges. To match dash always inlude it in the end of the list: [A-Za-z0-9_-]
ERE atoms – matching single characters (contd.)
• A list may contain a character range:– [[:alpha:]] matches a (lower and upper case) letter– [[:alnum:]] matches a letter or a digit– [[:cntrl:][:blank:]] matches any control character, space, or
tabulation symbol– [^[:space:]_] matches any character that is not a tabulation,
vertical tabulation, newline, carriage return, form feed, space, or underscore.
– [[:punct:]] mathes any punktuation character like ! @ # ( ] | etc.
– [^[:print:]] matches any non-printable character (not [:alnum:], [:punct:] or [:space:])
ERE atoms – macros, boundaries and other stuff
Shortcuts are useful to keep your regular expressions simple and understandable.
• \w is identical to [[:alnum:]_]• \W is identical to [^[:alnum:]_]• \s is identical to [:space:]• …• ^ matches the beginning of a string• $ matches the end of a string
ERE pieces & repetition operators
• A piece is an atom that may be followed by a single repetition operator (a.k.a. quantifier). An atom without a repetition operator is also a piece.
• X? matches 0 or 1 occurrences of strings matched by atom X:– a? matches John, ames, can
• X* matches 0 or more consecutive occurrences of strings matched by atom X:– [A-Z]* matches 12, AF12B (once the earliest minimum
match is found, it gets expanded to maximum)
ERE pieces & repetition operators (contd.)
• X+ matches 1 or more consecutive occurrences of strings matched by atom X:– [[:digit:]]+ matches aa123bb12 (but not John)
• X{n} matches n consecutive occurrences of strings matched by atom X:– a{2} matches Caan, aaa (but not James)
• X{n,} matches at least n consecutive occurrences of strings matched by atom X:– [A-Za-z]{3,} matches aaBF8 (but not 12, AF12B)
• X{n,m} matches at least n but no more than m consecutive occurrences of string matched by atom X:– [[:digit:]]{2,3} matches aa123bb12 (but not John2)
ERE branches
A branch is a concatenation of one or more pieces – the resulting regular expression matches any string that is formed by contatenating substrings that are matched by pieces:– ^[0-9]+$ matches 1984 and 28 (but not A4)– john\b matches john, Upjohn (but not johnson) – ^[0-9]{3}[A-Z]{3}$ checks if the car has a standard
Estonian registration number (three digits followed by three uppercase letters, e.g. 123ABC)
ERE alternation operator
• A regular expression consists of one of more non-empty branches that are separated by |
• A branch represents the choice – a regular expression matches any string that any of the branches matches:– ^We like|apples matches either a string that begins with
We like, or a string that contains apples– ^[0-9]{3}[A-Z]{3}$|^[A-Z]{3}[0-9]{3}$ checks if the car has
a standard Estonian or Finnish registration number (three digits followed by three letters or vice versa, e.g. 123ABC or XYZ789).
ERE grouping operator
• If a regular expression is enclosed in () it becomes an atom itself. Such a recursive definition of ERE allows for using complex expressions as building blocks for larger expressions:– ^([0-9]{1,3}\.){3}[0-9]{1,3}$ checks if the string looks
like an IP address, e.g. matches 127.0.0.1– ^(We like (apples|bananas))?$ matches We like
apples, We like bananas, or an empty string (note that a subexpresstion inside paretheses may contain other subexpressions in parentheses)
ERE examples
– sshd\[[0-9]+\]: Connection from [0-9.]+$ matches the message: Jan 18 12:33:01 myhost sshd[1399]: Connection from 127.0.0.1– sshd\[[0-9]+\]: Connection from ([0-9]{1,3}\.)
{3}[0-9]{1,3}$ matches the previous message: Jan 18 12:33:01 myhost sshd[1399]: Connection from 127.0.0.1
ERE examples (contd.)
– sshd\[[0-9]+\]: Failed password for \w+ from ([0-9]{1,3}\.){3}[0-9]{1,3} port [0-9]+ ssh2$
matches the sessage Jan 18 12:33:01 myhost sshd[1399]: Failed password for klaus from 127.0.0.1 port 2316 ssh2
ERE examples (contd.)
• Task: match hostnames conforming to the following name scheme – must begin with a letter; must end with a letter or digit; may contain letters, digits and hyphens between the first and last character:– ^[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]$ - a good starting
point, matches all hostnames except for single letter names.
– ^([A-Za-z]|[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9])$ - matches all hostnames, including names consisting of one letter only.
Some differences between BRE and ERE languages
• BRE has no |, + and ? operators. However, some versions of grep support them as \|, \+, \?)
• ()-operator must be written as \(\)• {}-operator must be written as \{\}• ^ and $ have a special meaning only in the beginning
and in the end of the (sub)expression.• Text matching subexpressions in parentheses are
assigned to back reference atoms \1,…,\9 (also supported by some versions of egrep):– ^\([AB]\)\1$ matches strings AA or BB
NFA and DFA regexp engines
• NFA – the matching procedure is governed by the regular expression; during the matching expression is treated as a mini-program with a single thread of execution; the same input byte may be read multiple times as the engine backtracks for finding the solution.
• DFA – the matching procedure is governed by input, different branches of regular expression are checked in parallel.
• DFA does not support match variables, NFA does.
Important features of Perl regular expressions engine
• Perl regular expression engine is an NFA engine.• Matching starts at the earliest possible position– ([0-9]+) will match in 2001-2010 ($1=2001)
• Repetition operators (?, *, +) are greedy– (.+), will match 1,2,3,4 ($1 = 1,2,3)– authentication failure;.*uid=([0-9]+) will match
authentication failure; logname=klaus uid=500 euid=0 ($1 = 0)
– “(.?)” will match both “” and “A”
Important features of Perl regular expressions engine (contd.)
• Leftmost match always wins– (.*)(.+) matches abcd ($1 = abc, $2 = d)– (.*)(.*)(.+) matches abcd ($1 = abc, $2 is empty, $3 = d). $2
can’t get c, because this would come at the expense of $1!• In the alternation, the leftmost branch that allows a
match will always be used, even if the following branches would provide a longer match (this is the main difference of Perl engine compared to other engines!)– (xyz|ab|abcd) matches abcd ($1=ab)
ERE matching in Linux
GNU egrep uses the DFA engine for regexp matching which produces the longest possible match. In the example above the leftmost match produced by b? does not win, since (bc)? matches more.
echo abcd | egrep –color ‘ab?(bc)?’ abcd
ERE matching in Linux (contd.)
Perl uses NFA engine and thus the results are slightly different – the leftmost match wins, thus:
echo abcd | pcregrep –color ‘ab?(bc)?’ abcd
Some useful Perl extensions to REs
• In order to make a repetition operator non-greedy, add the ?-suffix to the operator:– (.*?)(.+) matches abcd ($1 is empty, $2=abcd)– (.*?)(.*?)(.+?) matches abcd ($1 and $2 are
empty, $3 = a)
Some useful Perl extensions to REs (contd.)
• (?modifier) prefixes:– (?i)regexp – case insensitive matching for regexp– (?s)regexp – dots (.) will also match newline– (?m)regexp – ^ and $ will match the beginning and
the end of line anywhere inside the string if it’s multiline
– Modifiers might be negated, i.e., • (?i)a(?-i)b matches Ab or ab, but not aB or AB
Some useful Perl extensions to REs (contd.)
• (?:regexp) – same as (regexp), but the text matches by regexp will not set a match variable – () are used for grouping purposes only.
• (?<name>regexp) – in addition to numbered match variable, set also named match variable name. This construct can also be written as (?P<name>regexp) or (?’name’regexp).
Some useful Perl extensions to REs (contd.)
• Character classes:– \w – alphanumeric or underscore (_)– \W – negates \w (neither alphanumeric nor _)– \s – whitespace character class– \S – negates \s (non-whitespace)– \d – digit character– \D – negates \d (non-digit)
Some useful Perl extensions to REs (contd.)
• Look ahead and look-behind assertions:– (?=pattern) – zero-width positive look-ahead, e.g.
AA(?=BB) matches AA that is followed by BB– (?!pattern) – zero-width negative look-ahead, e.g.
AA(?!BB) matches AA not followed by BB– (?<=pattern) – zero-width positive look-behind,
e.g. (?<=AA)BB matches BB that follows AA– (?<!pattern) – zero-width negative look-behind,
e.g. (?<!AA)BB matches BB that does not follow AA
The backtracking problem
Suppose one wants to match sequences of ‘a’ which might contain single instances of B and end with a digit, for example:aaBaaaaaBaa9, aaaaaBaa7, aaaa1.
As a solution, the following expression is crafted:^(a+B?)+\d$
The backtracking problem (contd.)
The expression matches strings it should match:echo “aaaBaaaa9” | pcregrep –color ‘^(a+B?)+\d$’ aaaBaaaa9 (so far so good)
What about strings it should not match?perl –e ‘print “a” x 1000’ | pcregrep ‘^(a+B?)+\d$’ …
pcregrep: pcre_exec() error -8 while matching this text:pcregrep: error -8 means that a resource limit was exceededpcregrep: check your regex for nested unlimited loops
The backtracking problem (contd.)
• After (a+B?)+ has matched all ‘a’ characters and sees no trailing digit, it has to go back in the string and try all other possible options for matching.
• Unfortunately (a+B?) can match 1000 ‘a’ characters in a huge number of different ways, since the () operator can divide these characters up to 1000 parts of different size.
• This takes ages to compute and the regexp engine considers it to be a possible infinite loop.
Avoiding unnecessary backtracking
• In order to avoid unnecessary backtracking, the following constructs have been introduced to recent versions of Perl regular expressions engine:– (?>pattern) – once the pattern has matched the
maximum amount of data, backtracking is not attempted (even if pattern is a part of a larger expression which fails to match as a consequence)
– Possessive repetition operators ?+, *+, ++, {}+ they behave like ? * + and {} operators, but after they have matched maximum amount of data, backtracking is not attempted.
Avoiding unnecessary backtracking (contd.)
We can address the previous problem by applying the (?>…) construct to (a+B?)+perl –e ‘print ”a” x 1000’ | pcregrep ‘^(?>(a+B?)+)\d$’
We could also use possessive quantifier ++, and write (a+B?)+ as (a+B?)++perl –e ‘print “a” x 1000’ | pcregrep ‘^(a+B?)++\d$’
Questions ???