regular expressions aleksandr lenin. outline motivation for regular expressions constructing regular...

Regular expressions

Aleksandr Lenin

Outline

• Motivation for regular expressions• Constructing regular expressions– Atoms– Repetition operators– Concatenation– Alternation operator– Grouping operator

• Examples• Perl extension to REs• Backtracking problem

Acknowledgements

Thanks to Risto Vaarandi for sharing these wonderful materials.

Motivation for regular expressions

• Vehicle registration number in Estonia?• Find all events from logs containing pattern– User *** login failure

• Regular expression:– is a pattern– describes the structure of some inputs– actual input either matches or does not match

against the pattern

Application domain

• Input validation– securing our systems

• Template engines in CMS– separation of static and dynamic data– putting dynamic content together– macro substitution

• Querying textual data sets to find interesting information

Regular expression dialects

• Basic Regular Expression (BRE) language – the simplest dialect of RE language, supported by grep

• Extended regular expression (ERE) language – basically BRE enhanced with additional features, supported by egrep

• Perl regular expression language – adds lots of features to ERE language, supported by perl

• Other minor language flavors around

How regular expressions are built

Atom1 Atom2 Atom3 Atom4

Piece1 Piece2 Piece3 Piece4

Branch1 Branch2

Regular Expression

Repetition operators Repetition operators

Concatenation Concatenation

Branching operators

Atom’() operator

ERE atoms – matching single characters

• Regular characters are atoms that match themselves– a matches cat (but not CAT)

• To turn some special characters (like ?) into atoms that match themselves, add a blackslash (\) in front of them– \? matches what?– \^ matches 2^3– \* matches 12*345– \\ matches a\b

ERE atoms – matching single characters (contd.)

• Dot (.) matches any character– . matches car, 98, X, .– \. matches dot only, i.e. .bashrc

• [list] matches any character in the list: – [9abc] matches rat, toc, 1984 (not Art or dog)

• [^list] matches any character not in the list:– [^defD] matches federal, cab, but not (Dede)

• Note that most special characters (like ? or .) are treated as regular characters inside [].


• A list may contain a character range:– a-z is a range of lowercase characters of the latin

alphabet: a, b, c, d, …, x, y, z– A-Z is a range of uppercase characters of the latin

alphabet: A, B, C, D, …, X, Y, Z – 0-9 is a range of numbers: 0,1,2,3,4,5,6,7,8,9– [a-z0-9] matches Art, Dede, 1984 (but not FBI)– [^0-9.] matches myhost2 (but not 127.0.0.1)– Note that dash (-) is a special characters used for

character ranges. To match dash always inlude it in the end of the list: [A-Za-z0-9_-]


• A list may contain a character range:– [[:alpha:]] matches a (lower and upper case) letter– [[:alnum:]] matches a letter or a digit– [[:cntrl:][:blank:]] matches any control character, space, or

tabulation symbol– [^[:space:]_] matches any character that is not a tabulation,

vertical tabulation, newline, carriage return, form feed, space, or underscore.

– [[:punct:]] mathes any punktuation character like ! @ # ( ] | etc.

– [^[:print:]] matches any non-printable character (not [:alnum:], [:punct:] or [:space:])

ERE atoms – macros, boundaries and other stuff

Shortcuts are useful to keep your regular expressions simple and understandable.

• \w is identical to [[:alnum:]_]• \W is identical to [^[:alnum:]_]• \s is identical to [:space:]• …• ^ matches the beginning of a string• $ matches the end of a string

ERE pieces & repetition operators

• A piece is an atom that may be followed by a single repetition operator (a.k.a. quantifier). An atom without a repetition operator is also a piece.

• X? matches 0 or 1 occurrences of strings matched by atom X:– a? matches John, ames, can

• X* matches 0 or more consecutive occurrences of strings matched by atom X:– [A-Z]* matches 12, AF12B (once the earliest minimum

match is found, it gets expanded to maximum)

ERE pieces & repetition operators (contd.)

• X+ matches 1 or more consecutive occurrences of strings matched by atom X:– [[:digit:]]+ matches aa123bb12 (but not John)

• X{n} matches n consecutive occurrences of strings matched by atom X:– a{2} matches Caan, aaa (but not James)

• X{n,} matches at least n consecutive occurrences of strings matched by atom X:– [A-Za-z]{3,} matches aaBF8 (but not 12, AF12B)

• X{n,m} matches at least n but no more than m consecutive occurrences of string matched by atom X:– [[:digit:]]{2,3} matches aa123bb12 (but not John2)

ERE branches

A branch is a concatenation of one or more pieces – the resulting regular expression matches any string that is formed by contatenating substrings that are matched by pieces:– ^[0-9]+$ matches 1984 and 28 (but not A4)– john\b matches john, Upjohn (but not johnson) – ^[0-9]{3}[A-Z]{3}$ checks if the car has a standard

Estonian registration number (three digits followed by three uppercase letters, e.g. 123ABC)

ERE alternation operator

• A regular expression consists of one of more non-empty branches that are separated by |

• A branch represents the choice – a regular expression matches any string that any of the branches matches:– ^We like|apples matches either a string that begins with

We like, or a string that contains apples– ^[0-9]{3}[A-Z]{3}$|^[A-Z]{3}[0-9]{3}$ checks if the car has

a standard Estonian or Finnish registration number (three digits followed by three letters or vice versa, e.g. 123ABC or XYZ789).

ERE grouping operator

• If a regular expression is enclosed in () it becomes an atom itself. Such a recursive definition of ERE allows for using complex expressions as building blocks for larger expressions:– ^([0-9]{1,3}\.){3}[0-9]{1,3}$ checks if the string looks

like an IP address, e.g. matches 127.0.0.1– ^(We like (apples|bananas))?$ matches We like

apples, We like bananas, or an empty string (note that a subexpresstion inside paretheses may contain other subexpressions in parentheses)

ERE examples

– sshd\[[0-9]+\]: Connection from [0-9.]+$ matches the message: Jan 18 12:33:01 myhost sshd[1399]: Connection from 127.0.0.1– sshd\[[0-9]+\]: Connection from ([0-9]{1,3}\.)

{3}[0-9]{1,3}$ matches the previous message: Jan 18 12:33:01 myhost sshd[1399]: Connection from 127.0.0.1

ERE examples (contd.)

– sshd\[[0-9]+\]: Failed password for \w+ from ([0-9]{1,3}\.){3}[0-9]{1,3} port [0-9]+ ssh2$

matches the sessage Jan 18 12:33:01 myhost sshd[1399]: Failed password for klaus from 127.0.0.1 port 2316 ssh2

ERE examples (contd.)

• Task: match hostnames conforming to the following name scheme – must begin with a letter; must end with a letter or digit; may contain letters, digits and hyphens between the first and last character:– ^[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]$ - a good starting

point, matches all hostnames except for single letter names.

– ^([A-Za-z]|[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9])$ - matches all hostnames, including names consisting of one letter only.

Some differences between BRE and ERE languages

• BRE has no |, + and ? operators. However, some versions of grep support them as \|, \+, \?)

• ()-operator must be written as • {}-operator must be written as \{\}• ^ and $ have a special meaning only in the beginning

and in the end of the (sub)expression.• Text matching subexpressions in parentheses are

assigned to back reference atoms \1,…,\9 (also supported by some versions of egrep):– ^$[AB]$\1$ matches strings AA or BB

NFA and DFA regexp engines

• NFA – the matching procedure is governed by the regular expression; during the matching expression is treated as a mini-program with a single thread of execution; the same input byte may be read multiple times as the engine backtracks for finding the solution.

• DFA – the matching procedure is governed by input, different branches of regular expression are checked in parallel.

• DFA does not support match variables, NFA does.

Important features of Perl regular expressions engine

• Perl regular expression engine is an NFA engine.• Matching starts at the earliest possible position– ([0-9]+) will match in 2001-2010 ($1=2001)

• Repetition operators (?, *, +) are greedy– (.+), will match 1,2,3,4 ($1 = 1,2,3)– authentication failure;.*uid=([0-9]+) will match

authentication failure; logname=klaus uid=500 euid=0 ($1 = 0)

– “(.?)” will match both “” and “A”

Important features of Perl regular expressions engine (contd.)

• Leftmost match always wins– (.*)(.+) matches abcd ($1 = abc, $2 = d)– (.*)(.*)(.+) matches abcd ($1 = abc, $2 is empty, $3 = d). $2

can’t get c, because this would come at the expense of $1!• In the alternation, the leftmost branch that allows a

match will always be used, even if the following branches would provide a longer match (this is the main difference of Perl engine compared to other engines!)– (xyz|ab|abcd) matches abcd ($1=ab)

ERE matching in Linux

GNU egrep uses the DFA engine for regexp matching which produces the longest possible match. In the example above the leftmost match produced by b? does not win, since (bc)? matches more.

echo abcd | egrep –color ‘ab?(bc)?’ abcd

ERE matching in Linux (contd.)

Perl uses NFA engine and thus the results are slightly different – the leftmost match wins, thus:

echo abcd | pcregrep –color ‘ab?(bc)?’ abcd

Some useful Perl extensions to REs

• In order to make a repetition operator non-greedy, add the ?-suffix to the operator:– (.*?)(.+) matches abcd ($1 is empty, $2=abcd)– (.*?)(.*?)(.+?) matches abcd ($1 and $2 are

empty, $3 = a)

Some useful Perl extensions to REs (contd.)

• (?modifier) prefixes:– (?i)regexp – case insensitive matching for regexp– (?s)regexp – dots (.) will also match newline– (?m)regexp – ^ and $ will match the beginning and

the end of line anywhere inside the string if it’s multiline

– Modifiers might be negated, i.e., • (?i)a(?-i)b matches Ab or ab, but not aB or AB


• (?:regexp) – same as (regexp), but the text matches by regexp will not set a match variable – () are used for grouping purposes only.

• (?<name>regexp) – in addition to numbered match variable, set also named match variable name. This construct can also be written as (?P<name>regexp) or (?’name’regexp).


• Character classes:– \w – alphanumeric or underscore (_)– \W – negates \w (neither alphanumeric nor _)– \s – whitespace character class– \S – negates \s (non-whitespace)– \d – digit character– \D – negates \d (non-digit)


• Look ahead and look-behind assertions:– (?=pattern) – zero-width positive look-ahead, e.g.

AA(?=BB) matches AA that is followed by BB– (?!pattern) – zero-width negative look-ahead, e.g.

AA(?!BB) matches AA not followed by BB– (?<=pattern) – zero-width positive look-behind,

e.g. (?<=AA)BB matches BB that follows AA– (?<!pattern) – zero-width negative look-behind,

e.g. (?<!AA)BB matches BB that does not follow AA

The backtracking problem

Suppose one wants to match sequences of ‘a’ which might contain single instances of B and end with a digit, for example:aaBaaaaaBaa9, aaaaaBaa7, aaaa1.

As a solution, the following expression is crafted:^(a+B?)+\d$

The backtracking problem (contd.)

The expression matches strings it should match:echo “aaaBaaaa9” | pcregrep –color ‘^(a+B?)+\d$’ aaaBaaaa9 (so far so good)

What about strings it should not match?perl –e ‘print “a” x 1000’ | pcregrep ‘^(a+B?)+\d$’ …

pcregrep: pcre_exec() error -8 while matching this text:pcregrep: error -8 means that a resource limit was exceededpcregrep: check your regex for nested unlimited loops

The backtracking problem (contd.)

• After (a+B?)+ has matched all ‘a’ characters and sees no trailing digit, it has to go back in the string and try all other possible options for matching.

• Unfortunately (a+B?) can match 1000 ‘a’ characters in a huge number of different ways, since the () operator can divide these characters up to 1000 parts of different size.

• This takes ages to compute and the regexp engine considers it to be a possible infinite loop.

Avoiding unnecessary backtracking

• In order to avoid unnecessary backtracking, the following constructs have been introduced to recent versions of Perl regular expressions engine:– (?>pattern) – once the pattern has matched the

maximum amount of data, backtracking is not attempted (even if pattern is a part of a larger expression which fails to match as a consequence)

– Possessive repetition operators ?+, *+, ++, {}+ they behave like ? * + and {} operators, but after they have matched maximum amount of data, backtracking is not attempted.

Avoiding unnecessary backtracking (contd.)

We can address the previous problem by applying the (?>…) construct to (a+B?)+perl –e ‘print ”a” x 1000’ | pcregrep ‘^(?>(a+B?)+)\d$’

We could also use possessive quantifier ++, and write (a+B?)+ as (a+B?)++perl –e ‘print “a” x 1000’ | pcregrep ‘^(a+B?)++\d$’

Questions ???

regular expressions aleksandr lenin. outline motivation for regular expressions constructing regular...

Documents

matches art

matches cat

matches abere atoms

special characters

single characters contd

control character

printable character

character ranges