regular expression howt tutorial of the re
TRANSCRIPT
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 1/23
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 2/23
`
Regular Expression HOWTΟ Tutorial of the re module
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 3/23
Abstract
This document is an introductory tutorial to using regular expressions in Python with the re module. It provides a
gentler introduction than the corresponding section in the Library Reference.
Introduction
Regular expressions (called REs or regexes or regex patterns! are essentially a tiny highly speciali"ed
programming language embedded inside Python and made available through the re module. #sing this little
language you specify the rules for the set of possible strings that you want to match$ this set might contain English
sentences or e%mail addresses or Te& commands or anything you li'e. ou can then as' )uestions such as
*+oes this string match the pattern,- or *Is there a match for the pattern anywhere in this string,-. ou can also
use REs to modify a string or to split it apart in various ways.
Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine
written in . /or advanced use it may be necessary to pay careful attention to how the engine will execute a givenRE and write the RE in a certain way in order to produce bytecode that runs faster. 0ptimi"ation isn1t covered in
this document because it re)uires that you have a good understanding of the matching engine1s internals.
The regular expression language is relatively small and restricted so not all possible string processing tas's can be
done using regular expressions. There are also tas's thatcan be done with regular expressions but the
expressions turn out to be very complicated. In these cases you may be better off writing Python code to do the
processing$ while Python code will be slower than an elaborate regular expression it will also probably be more
understandable.
Simple atterns
2e1ll start by learning about the simplest possible regular expressions. 3ince regular expressions are used to
operate on strings we1ll begin with the most common tas'4 matching characters.
/or a detailed explanation of the computer science underlying regular expressions (deterministic and non%
deterministic finite automata! you can refer to almost any textboo' on writing compilers.
!atching "haracters
5ost letters and characters will simply match themselves. /or example the regular expression test will match
the string test exactly. (ou can enable a case%insensitive mode that would let this RE match Test or TE3T as well$
more about this later.!
There are exceptions to this rule$ some characters are special metacharacters and don1t match themselves.
Instead they signal that some out%of%the%ordinary thing should be matched or they affect other portions of the RE
by repeating them or changing their meaning. 5uch of this document is devoted to discussing various
metacharacters and what they do.
6ere1s a complete list of the metacharacters$ their meanings will be discussed in the rest of this 602T0.
. 7 8 9 : , ; < = > ? @ ( !
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 4/23
The first metacharacters we1ll loo' at are = and >. They1re used for specifying a character class which is a set of
characters that you wish to match. haracters can be listed individually or a range of characters can be indicated
by giving two characters and separating them by a A%A. /or example =abc> will match any of the characters a b or c$
this is the same as =a%c> which uses a range to express the same set of characters. If you wanted to match only
lowercase letters your RE would be =a%">.
5etacharacters are not active inside classes. /or example =a'm8> will match any of the characters AaA A'A AmA
or A8A$ A8A is usually a metacharacter but inside a character class it1s stripped of its special nature.
ou can match the characters not listed within the class by complementing the set. This is indicated by including
a A7A as the first character of the class$ A7A outside a character class will simply match the A7A character. /or
example =7B> will match any character except ABA.
Perhaps the most important metacharacter is the bac'slash ?. Cs in Python string literals the bac'slash can be
followed by various characters to signal various special se)uences. It1s also used to escape all the metacharacters
so you can still match them in patterns$ for example if you need to match a = or ? you can precede them with a
bac'slash to remove their special meaning4 ?= or ??.
3ome of the special se)uences beginning with A?A represent predefined sets of characters that are often useful such
as the set of digits the set of letters or the set of anything that isn1t whitespace.
Let1s ta'e an example4 ?w matches any alphanumeric character. If the regex pattern is expressed in bytes this is
e)uivalent to the class =a%"C%D%FG>. If the regex pattern is a string ?w will match all the characters mar'ed as
letters in the #nicode database provided by the unicodedata module. ou can use the more restricted definition
of ?w in a string pattern by supplying the re.C3II flag when compiling the regular expression.
The following list of special se)uences isn1t complete. /or a complete list of se)uences and expanded class
definitions for #nicode string patterns see the last part of Regular Expression Syntax in the 3tandard Library
reference. In general the #nicode versions match any character that1s in the appropriate category in the #nicode
database.?d
5atches any decimal digit$ this is e)uivalent to the class =%F>.
?+
5atches any non%digit character$ this is e)uivalent to the class =7%F>.
?s
5atches any whitespace character$ this is e)uivalent to the class = ?t?n?r?f?v>.
?3
5atches any non%whitespace character$ this is e)uivalent to the class =7 ?t?n?r?f?v>.
?w
5atches any alphanumeric character$ this is e)uivalent to the class =a%"C%D%FG>.
?2
5atches any non%alphanumeric character$ this is e)uivalent to the class =7a%"C%D%FG>.
These se)uences can be included inside a character class. /or example =?s.> is a character class that will match
any whitespace character or AA or A.A.
The final metacharacter in this section is .. It matches anything except a newline character and there1s an alternate
mode (re.+0TCLL! where it will match even a newline.A.A is often used where you want to match *any character-.
Repeating Things
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 5/23
Heing able to match varying sets of characters is the first thing regular expressions can do that isn1t already
possible with the methods available on strings. 6owever if that was the only additional capability of regexes they
wouldn1t be much of an advance. Cnother capability is that you can specify that portions of the RE must be
repeated a certain number of times.
The first metacharacter for repeating things that we1ll loo' at is 9. 9 doesn1t match the literal character 9$ instead it
specifies that the previous character can be matched "ero or more times instead of exactly once.
/or example ca9t will match ct ( a characters! cat ( a! caaat (J a characters! and so forth. The RE engine has
various internal limitations stemming from the si"e of 1s int type that will prevent it from matching over K
billion a characters$ patterns are usually not written to match that much data.
Repetitions such as 9 are greedy $ when repeating a RE the matching engine will try to repeat it as many times as
possible. If later portions of the pattern don1t match the matching engine will then bac' up and try again with few
repetitions.
C step%by%step example will ma'e this more obvious. Let1s consider the expression a=bcd>9b. This matches the
letter AaA "ero or more letters from the class =bcd> and finally ends with a AbA. ow imagine matching this RE againstthe string abcbd.
Step !atched Explanation
a The a in the RE matches.
K abcbd The engine matches =bcd>9 going as far as it can which is to the end of the string.
J Failure The engine tries to match b but the current position is at the end of the string so it fails.
M abcb Hac' up so that =bcd>9 matches one less character.
B Failure Try b again but the current position is at the last character which is a AdA.
N abc Hac' up again so that =bcd>9 is only matching bc.
N abcb Try b again. This time the character at the current position is AbA so it succeeds.
The end of the RE has now been reached and it has matched abcb. This demonstrates how the matching engine
goes as far as it can at first and if no match is found it will then progressively bac' up and retry the rest of the RE
again and again. It will bac' up until it has tried "ero matches for =bcd>9 and if that subse)uently fails the engine
will conclude that the string doesn1t match the RE at all.
Cnother repeating metacharacter is : which matches one or more times. Pay careful attention to the difference
between 9 and :$ 9 matches zero or more times so whatever1s being repeated may not be present at allwhile : re)uires at least one occurrence. To use a similar example ca:t will match cat ( a! caaat (J aOs! but won1t
match ct.
There are two more repeating )ualifiers. The )uestion mar' character , matches either once or "ero times$ you
can thin' of it as mar'ing something as being optional. /or example home%,brew matches
either homebrew or home%brew.
The most complicated repeated )ualifier is ;mn< where m and n are decimal integers. This )ualifier means there
must be at least m repetitions and at most n. /or examplea;J<b will match ab ab and ab. It won1t match ab
which has no slashes or ab which has four.
ou can omit either m or n$ in that case a reasonable value is assumed for the missing value. 0mitting m is
interpreted as a lower limit of while omitting n results in an upper bound of infinity Q actually the upper bound is
the K%billion limit mentioned earlier but that might as well be infinity.
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 6/23
Readers of a reductionist bent may notice that the three other )ualifiers can all be expressed using this
notation. ;< is the same as 9 ;< is e)uivalent to : and ;<is the same as ,. It1s better to use 9 : or , when
you can simply because they1re shorter and easier to read.
#sing Regular Expressions
ow that we1ve loo'ed at some simple regular expressions how do we actually use them in Python,
The re module provides an interface to the regular expression engine allowing you to compile REs into obects and
then perform matches with them.
"ompiling Regular Expressions
Regular expressions are compiled into pattern obects which have methods for various operations such as
searching for pattern matches or performing string substitutions.
SSS
$$$ import re
$$$ p re.compile(Aab9A!
$$$ p
re.compile(Aab9A!
re.compile(! also accepts an optional flags argument used to enable various special features and syntax variations.
2e1ll go over the available settings later but for now a single example will do4
SSS
$$$ p re.compile(Aab9A re.IU0REC3E!
The RE is passed to re.compile(! as a string. REs are handled as strings because regular expressions aren1t part of
the core Python language and no special syntax was created for expressing them. (There are applications that
don1t need REs at all so there1s no need to bloat the language specification by including them.! Instead
the remodule is simply a extension module included with Python ust li'e the soc'et or "lib modules.
Putting REs in strings 'eeps the Python language simpler but has one disadvantage which is the topic of the next
section.
The %ac&slash lague
Cs stated earlier regular expressions use the bac'slash character ( A?A! to indicate special forms or to allow special
characters to be used without invo'ing their special meaning. This conflicts with Python1s usage of the same
character for the same purpose in string literals.
Let1s say you want to write a RE that matches the string ?section which might be found in a LaTe& file. To figure out
what to write in the program code start with the desiredstring to be matched. ext you must escape any
bac'slashes and other metacharacters by preceding them with a bac'slash resulting in the string ??section. The
resultingstring that must be passed to re.compile(! must be ??section. 6owever to express this as a Python string
literal both bac'slashes must be escaped again.
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 7/23
"haracters Stage
?section Text string to be matched
??section Escaped bac'slash for re.compile(!
V????sectionV Escaped bac'slashes for a string literal
In short to match a literal bac'slash one has to write A????A as the RE string because the regular expression must
be ?? and each bac'slash must be expressed as ??inside a regular Python string literal. In REs that feature
bac'slashes repeatedly this leads to lots of repeated bac'slashes and ma'es the resulting strings difficult to
understand.
The solution is to use Python1s raw string notation for regular expressions$ bac'slashes are not handled in any
special way in a string literal prefixed with ArA so rV?nV is a two%character string containing A?A and AnA while V?nV is a
one%character string containing a newline. Regular expressions will often be written in Python code using this raw
string notation.
Regular String Ra' string
Vab9V rVab9V
V????sectionV rV??sectionV
V??w:??s:??V rV?w:?s:?V
erforming !atches
0nce you have an obect representing a compiled regular expression what do you do with it, Pattern obects have
several methods and attributes. 0nly the most significant ones will be covered here$ consult the re docs for acomplete listing.
!ethod(Attribute urpose
match(! +etermine if the RE matches at the beginning of the string.
search(! 3can through a string loo'ing for any location where this RE matches.
findall(! /ind all substrings where the RE matches and returns them as a list.
finditer(! /ind all substrings where the RE matches and returns them as aniterator .
match(! and search(! return one if no match can be found. If they1re successful a match object instance is
returned containing information about the match4 where it starts and ends the sub string it matched and more.
ou can learn about this by interactively experimenting with the re module. If you have t'inter available you may
also want to loo' at Toolsdemoredemo.py a demonstration program included with the Python distribution. It
allows you to enter REs and strings and displays whether the RE matches or fails. redemo.py can be )uite useful
when trying to debug a complicated RE. Phil 3chwart"1s Wodos is also an interactive tool for developing and testing
RE patterns.
This 602T0 uses the standard Python interpreter for its examples. /irst run the Python interpreter import
the re module and compile a RE4
SSS
$$$ import re
$$$ p re.compile(A=a%">:A!
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 8/23
$$$ p
re.compile(A=a%">:A!
ow you can try matching various strings against the RE =a%">:. Cn empty string shouldn1t match at all
since : means Oone or more repetitions1. match(! should returnone in this case which will cause the interpreter to
print no output. ou can explicitly print the result of match(! to ma'e this clear.
SSS
$$$ p.match(VV!
$$$ print(p.match(VV!!
one
ow let1s try it on a string that it should match such as tempo. In this case match(! will return a match object so
you should store the result in a variable for later use.
SSS
$$$ m p.match(AtempoA!
$$$ m
XGsre.3REG5atch obect$ span( B! matchAtempoAS
ow you can )uery the match object for information about the matching string. match object instances also have
several methods and attributes$ the most important ones are4
!ethod(Attribute urpose
group(! Return the string matched by the RE
start(! Return the starting position of the match
end(! Return the ending position of the match
span(! Return a tuple containing the (start end! positions of the match
Trying these methods will soon clarify their meaning4
SSS
$$$ m.group(!
AtempoA
$$$ m.start(! m.end(!
( B!
$$$ m.span(!
( B!
group(! returns the substring that was matched by the RE. start(! and end(! return the starting and ending index of
the match. span(! returns both start and end indexes in a single tuple. 3ince the match(! method only chec's if the
RE matches at the start of a string start(! will always be "ero. 6owever the search(! method of patterns scans
through the string so the match may not start at "ero in that case.
SSS
$$$ print(p.match(A444 messageA!!
one
$$$ m p.search(A444 messageA!$ print(m!
XGsre.3REG5atch obect$ span(M ! matchAmessageAS
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 9/23
$$$ m.group(!
AmessageA
$$$ m.span(!
(M !
In actual programs the most common style is to store the match object in a variable and then chec' if it was one.
This usually loo's li'e4
p re.compile( ... !
m p.match( Astring goes hereA !
if m4
print(A5atch found4 A m.group(!!
else4
print(Ao matchA!
Two pattern methods return all of the matches for a pattern. findall(! returns a list of matching strings4
SSS
$$$ p re.compile(A?d:A!
$$$ p.findall(AK drummers drumming pipers piping lords a%leapingA !
=AKA AA AA>
findall(! has to create the entire list before it can be returned as the result. The finditer(! method returns a
se)uenceof match object instances as an iterator 4
SSS
$$$ iterator p.finditer(AK drummers drumming ... ...A!
$$$ iterator
XcallableGiterator obect at x...S
$$$ for match in iterator4
))) print(match.span(!!
)))
( K!
(KK KM!
(KF J!
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 10/23
!odule*+e,el -unctions
ou don1t have to create a pattern obect and call its methods$ the re module also provides top%level functions
called match(! search(! findall(! sub(! and so forth. These functions ta'e the same arguments as the
corresponding pattern method with the RE string added as the first argument and still return either one or
a match object instance.
SSS
$$$ print(re.match(rA/rom?s:A A/romage am'A!!
one
$$$ re.match(rA/rom?s:A A/rom am' Thu 5ay M F4K4 FFYA!
XGsre.3REG5atch obect$ span( B! matchA/rom AS
#nder the hood these functions simply create a pattern obect for you and call the appropriate method on it. They
also store the compiled obect in a cache so future calls using the same RE won1t need to parse the pattern again
and again.
3hould you use these module%level functions or should you get the pattern and call its methods yourself, If you1re
accessing a regex within a loop pre%compiling it will save a few function calls. 0utside of loops there1s not much
difference than's to the internal cache.
"ompilation -lags
ompilation flags let you modify some aspects of how regular expressions wor'. /lags are available in
the re module under two names a long name such as IU0REC3Eand a short one%letter form such as I. (If
you1re familiar with Perl1s pattern modifiers the one%letter forms use the same letters$ the short form
of re.ZERH03E is re.& for example.! 5ultiple flags can be specified by bitwise 0R%ing them$ re.I @ re.5 sets both
the I and 5 flags for example.
6ere1s a table of the available flags followed by a more detailed explanation of each one.
-lag !eaning
C3II C5a'es several escapes li'e ?w ?b ?s and ?d match only on C3II characters with
the respective property.
+0TCLL 3 5a'e . match any character including newlines
IU0REC3E I +o case%insensitive matches
L0CLE L +o a locale%aware match
5#LTILIE 5 5ulti%line matching affecting 7 and 8
ZERH03E & (for
Oextended1!Enable verbose REs which can be organi"ed more cleanly and understandably.
I
I./ORE"ASE
Perform case%insensitive matching$ character class and literal strings will match letters by ignoring case. /or
example =C%D> will match lowercase letters too and 3pamwill match 3pam spam or spC5. This lowercasing
doesn1t ta'e the current locale into account$ it will if you also set the L0CLE flag.
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 11/23
+
+O"A+E
5a'e ?w ?2 ?b and ?H dependent on the current locale instead of the #nicode database.
Locales are a feature of the library intended to help in writing programs that ta'e account of language
differences. /or example if you1re processing /rench text you1d want to be able to write ?w: to match words
but ?w only matches the character class =C%Da%">$ it won1t match A[A or A\A. If your system is configured properly and a
/rench locale is selected certain functions will tell the program that A[A should also be considered a letter. 3etting
the L0CLE flag when compiling a regular expression will cause the resulting compiled obect to use these
functions for ?w$ this is slower but also enables ?w: to match /rench words as you1d expect.
!
!#+TI+I/E
(7 and 8 haven1t been explained yet$ they1ll be introduced in section More Metacharacters.!
#sually 7 matches only at the beginning of the string and 8 matches only at the end of the string and immediately
before the newline (if any! at the end of the string. 2hen this flag is specified 7 matches at the beginning of the
string and at the beginning of each line within the string immediately following each newline. 3imilarly
the 8metacharacter matches either at the end of the string and at the end of each line (immediately preceding each
newline!.
S
0OTA++
5a'es the A.A special character match any character at all including a newline$ without this flag A.A will match
anything except a newline.
A
AS"II
5a'e ?w ?2 ?b ?H ?s and ?3 perform C3II%only matching instead of full #nicode matching. This is onlymeaningful for #nicode patterns and is ignored for byte patterns.
1
2ER%OSE
This flag allows you to write regular expressions that are more readable by granting you more flexibility in how you
can format them. 2hen this flag has been specified whitespace within the RE string is ignored except when the
whitespace is in a character class or preceded by an unescaped bac'slash$ this lets you organi"e and indent the
RE more clearly. This flag also lets you put comments within a RE that will be ignored by the engine$ comments are
mar'ed by a A]A that1s neither in a character class or preceded by an unescaped bac'slash.
/or example here1s a RE that uses re.ZERH03E$ see how much easier it is to read,
charref re.compile(rVVV
^=]> ] 3tart of a numeric entity reference
(
=%_>: ] 0ctal form
@ =%F>: ] +ecimal form
@ x=%Fa%fC%/>: ] 6exadecimal form
!
$ ] Trailing semicolon
VVV re.ZERH03E!
2ithout the verbose setting the RE would loo' li'e this4
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 12/23
charref re.compile(V^](=%_>:V
V@=%F>:V
V@x=%Fa%fC%/>:!$V!
In the above example Python1s automatic concatenation of string literals has been used to brea' up the RE into
smaller pieces but it1s still more difficult to understand than the version using re.ZERH03E.
!ore attern o'er
3o far we1ve only covered a part of the features of regular expressions. In this section we1ll cover some new
metacharacters and how to use groups to retrieve portions of the text that was matched.
!ore !etacharacters
There are some metacharacters that we haven1t covered yet. 5ost of them will be covered in this section.3ome of the remaining metacharacters to be discussed are zero-width assertions. They don1t cause the engine to
advance through the string$ instead they consume no characters at all and simply succeed or fail. /or
example ?b is an assertion that the current position is located at a word boundary$ the position isn1t changed by
the ?b at all. This means that "ero%width assertions should never be repeated because if they match once at a
given location they can obviously be matched an infinite number of times.
@
Clternation or the *or- operator. If C and H are regular expressions C@H will match any string that matches
either C or H. @ has very low precedence in order to ma'e it wor' reasonably when you1re alternating multi%
character strings. row@3ervo will match either row or 3ervo not ro a AwA or an A3A and ervo.
To match a literal A@A use ?@ or enclose it inside a character class as in =@>.
7
5atches at the beginning of lines. #nless the 5#LTILIE flag has been set this will only match at the beginning of
the string. In 5#LTILIE mode this also matches immediately after each newline within the string.
/or example if you wish to match the word /rom only at the beginning of a line the RE to use is 7/rom.
SSS
$$$ print(re.search(A7/romA A/rom 6ere to EternityA!!
XGsre.3REG5atch obect$ span( M! matchA/romAS
$$$ print(re.search(A7/romA AReciting /rom 5emoryA!!
one
8
5atches at the end of a line which is defined as either the end of the string or any location followed by a newline
character.
SSS
$$$ print(re.search(A<8A A;bloc'<A!!
XGsre.3REG5atch obect$ span(N _! matchA<AS
$$$ print(re.search(A<8A A;bloc'< A!!
one
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 13/23
$$$ print(re.search(A<8A A;bloc'< 3nA!!
XGsre.3REG5atch obect$ span(N _! matchA<AS
To match a literal A8A use ?8 or enclose it inside a character class as in =8>.
?C
5atches only at the start of the string. 2hen not in 5#LTILIE mode ?C and 7 are effectively the same.
In 5#LTILIE mode they1re different4 ?C still matches only at the beginning of the string but 7 may match at anylocation inside the string that follows a newline character.
?D
5atches only at the end of the string.
?b
2ord boundary. This is a "ero%width assertion that matches only at the beginning or end of a word. C word is
defined as a se)uence of alphanumeric characters so the end of a word is indicated by whitespace or a non%
alphanumeric character.
The following example matches class only when it1s a complete word$ it won1t match when it1s contained inside
another word.
SSS
$$$ p re.compile(rA?bclass?bA!
$$$ print(p.search(Ano class at allA!!
XGsre.3REG5atch obect$ span(J Y! matchAclassAS
$$$ print(p.search(Athe declassified algorithmA!!
one
$$$ print(p.search(Aone subclass isA!!
one
There are two subtleties you should remember when using this special se)uence. /irst this is the worst collision
between Python1s string literals and regular expression se)uences. In Python1s string literals ?b is the bac'space
character C3II value Y. If you1re not using raw strings then Python will convert the ?b to a bac'space and your
RE won1t match as you expect it to. The following example loo's the same as our previous RE but omits the ArA in
front of the RE string.
SSS
$$$ p re.compile(A 3bclass 3bA!
$$$ print(p.search(Ano class at allA!!
one
$$$ print(p.search(A 3bA : AclassA : A 3bA!!
XGsre.3REG5atch obect$ span( _! matchA?xYclass?xYAS
3econd inside a character class where there1s no use for this assertion ?b represents the bac'space character
for compatibility with Python1s string literals.
?H
Cnother "ero%width assertion this is the opposite of ?b only matching when the current position is not at a word
boundary.
.rouping
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 14/23
/re)uently you need to obtain more information than ust whether the RE matched or not. Regular expressions are
often used to dissect strings by writing a RE divided into several subgroups which match different components of
interest. /or example an R/%YKK header line is divided into a header name and a value separated by a A4A li'e
this4
/rom4 author 4example.com
#ser % Cgent4 Thunderbird .B..F (&KNKK_!
5I5E%Zersion4 .
To4 editor 4example.com
This can be handled by writing a regular expression which matches an entire header line and has one group which
matches the header name and another group which matches the header1s value.
Uroups are mar'ed by the A(A A!A metacharacters. A(A and A!A have much the same meaning as they do in
mathematical expressions$ they group together the expressions contained inside them and you can repeat the
contents of a group with a repeating )ualifier such as 9 : , or ;mn<. /or example (ab!9 will match "ero or more
repetitions of ab.
SSS
$$$ p re.compile(A(ab!9A!
$$$ print(p.match(AabababababA!.span(!!
( !
Uroups indicated with A(A A!A also capture the starting and ending index of the text that they match$ this can be
retrieved by passing an argument to group(! start(!end(! and span(!. Uroups are numbered starting with . Uroup
is always present$ it1s the whole RE so match object methods
all have group as their default argument. Later we1ll see how to express groups that don1t capture the span of
text that they match.
SSS
$$$ p re.compile(A(a!bA!
$$$ m p.match(AabA!
$$$ m.group(!
AabA
$$$ m.group(!
AabA
3ubgroups are numbered from left to right from upward. Uroups can be nested$ to determine the number ust
count the opening parenthesis characters going from left to right.
SSS
$$$ p re.compile(A(a(b!c!dA!
$$$ m p.match(AabcdA!
$$$ m.group(!
AabcdA
$$$ m.group(!
AabcA
$$$ m.group(K!
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 15/23
AbA
group(! can be passed multiple group numbers at a time in which case it will return a tuple containing the
corresponding values for those groups.
SSS
$$$ m.group(KK!
(AbA AabcA AbA!
The groups(! method returns a tuple containing the strings for all the subgroups from up to however many there
are.
SSS
$$$ m.groups(!
(AabcA AbA!
Hac'references in a pattern allow you to specify that the contents of an earlier capturing group must also be found
at the current location in the string. /or example ? will succeed if the exact contents of group can be found at the
current position and fails otherwise. Remember that Python1s string literals also use a bac'slash followed by
numbers to allow including arbitrary characters in a string so be sure to use a raw string when incorporating
bac'references in a RE.
/or example the following RE detects doubled words in a string.
SSS
$$$ p re.compile(rA(?b?w:!?s:?A!
$$$ p.search(AParis in the the springA!.group(!
Athe theA
Hac'references li'e this aren1t often useful for ust searching through a string Q there are few text formats which
repeat data in this way Q but you1ll soon find out that they1re very useful when performing string substitutions.
/on*capturing and /amed .roups
Elaborate REs may use many groups both to capture substrings of interest and to group and structure the RE
itself. In complex REs it becomes difficult to 'eep trac' of the group numbers. There are two features which help
with this problem. Hoth of them use a common syntax for regular expression extensions so we1ll loo' at that first.
Perl B is well 'nown for its powerful additions to standard regular expressions. /or these new features the Perl
developers couldn1t choose new single%'eystro'e metacharacters or new special se)uences beginning
with ? without ma'ing Perl1s regular expressions confusingly different from standard REs. If they chose ^ as a new
metacharacter for example old expressions would be assuming that ^ was a regular character and wouldn1t have
escaped it by writing ?^ or =^>.
The solution chosen by the Perl developers was to use (,...! as the extension syntax. , immediately after a
parenthesis was a syntax error because the , would have nothing to repeat so this didn1t introduce any
compatibility problems. The characters immediately after the , indicate what extension is being used so (,foo! is
one thing (a positive loo'ahead assertion! and (,4foo! is something else (a non%capturing group containing thesubexpression foo!.
Python supports several of Perl1s extensions and adds an extension syntax to Perl1s extension syntax. If the first
character after the )uestion mar' is a P you 'now that it1s an extension that1s specific to Python.
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 16/23
ow that we1ve loo'ed at the general extension syntax we can return to the features that simplify wor'ing with
groups in complex REs.
3ometimes you1ll want to use a group to denote a part of a regular expression but aren1t interested in retrieving the
group1s contents. ou can ma'e this fact explicit by using a non%capturing group4 (,4...! where you can replace
the ... with any other regular expression.
SSS
$$$ m re.match(V(=abc>!:V VabcV!
$$$ m.groups(!
(AcA!
$$$ m re.match(V(,4=abc>!:V VabcV!
$$$ m.groups(!
(!
Except for the fact that you can1t retrieve the contents of what the group matched a non%capturing group behaves
exactly the same as a capturing group$ you can put anything inside it repeat it with a repetition metacharacter such
as 9 and nest it within other groups (capturing or non%capturing!. (,4...! is particularly useful when modifying an
existing pattern since you can add new groups without changing how all the other groups are numbered. It should
be mentioned that there1s no performance difference in searching between capturing and non%capturing groups$
neither form is any faster than the other.
C more significant feature is named groups4 instead of referring to them by numbers groups can be referenced by
a name.
The syntax for a named group is one of the Python%specific extensions4 (,PXnameS...!. name is obviously the
name of the group. amed groups behave exactly li'e capturing groups and additionally associate a name with a
group. The match object methods that deal with capturing groups all accept either integers that refer to the group
by number or strings that contain the desired group1s name. amed groups are still given numbers so you can
retrieve information about a group in two ways4
SSS
$$$ p re.compile(rA(,PXwordS?b?w:?b!A!
$$$ m p.search( A(((( Lots of punctuation !!!A !
$$$ m.group(AwordA!
ALotsA
$$$ m.group(!
ALotsA
amed groups are handy because they let you use easily%remembered names instead of having to remember
numbers. 6ere1s an example RE from the imaplib module4
Internal+ate re.compile(rAITERCL+CTE VA
rA(,PXdayS= KJ>=%F>!%(,PXmonS=C%D>=a%">=a%">!%A
rA(,PXyearS=%F>=%F>=%F>=%F>!A
rA (,PXhourS=%F>=%F>!4(,PXminS=%F>=%F>!4(,PXsecS=%F>=%F>!A
rA (,PX"onenS=%:>!(,PX"onehS=%F>=%F>!(,PX"onemS=%F>=%F>!A
rAVA!
It1s obviously much easier to retrieve m.group(A"onemA! instead of having to remember to retrieve group F.
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 17/23
The syntax for bac'references in an expression such as (...!? refers to the number of the group. There1s naturally a
variant that uses the group name instead of the number. This is another Python extension4 (,Pname! indicates
that the contents of the group called name should again be matched at the current point. The regular expression for
finding doubled words (?b?w:!?s:? can also be written as (,PXwordS?b?w:!?s:(,Pword!4
SSS
$$$ p re.compile(rA(,PXwordS?b?w:!?s:(,Pword!A!
$$$ p.search(AParis in the the springA!.group(!
Athe theA
+oo&ahead Assertions
Cnother "ero%width assertion is the loo'ahead assertion. Loo'ahead assertions are available in both positive and
negative form and loo' li'e this4
(,...!
Positive loo'ahead assertion. This succeeds if the contained regular expression represented here by ...
successfully matches at the current location and fails otherwise. Hut once the contained expression has been
tried the matching engine doesn1t advance at all$ the rest of the pattern is tried right where the assertion started.
(,`...!
egative loo'ahead assertion. This is the opposite of the positive assertion$ it succeeds if the contained
expression doesnt match at the current position in the string.
To ma'e this concrete let1s loo' at a case where a loo'ahead is useful. onsider a simple pattern to match a
filename and split it apart into a base name and an extension separated by a .. /or example in news.rc news is
the base name and rc is the filename1s extension.
The pattern to match this is )uite simple4
.9=.>.98
otice that the . needs to be treated specially because it1s a metacharacter so it1s inside a character class to only
match that specific character. Clso notice the trailing 8$ this is added to ensure that all the rest of the string must be
included in the extension. This regular expression
matches foo.bar and autoexec.bat and sendmail.cf andprinters.conf. ow consider complicating the problem a bit$
what if you want to match filenames where the extension is not bat, 3ome incorrect attempts4
.9=.>=7b>.98
The first attempt above tries to exclude bat by re)uiring that the first character of the extension is not a b. This is
wrong because the pattern also doesn1t match foo.bar.
.9=.>(=7b>..@.=7a>.@..=7t>!8
The expression gets messier when you try to patch up the first solution by re)uiring one of the following cases to
match4 the first character of the extension isn1t b$ the second character isn1t a$ or the third character isn1t t. This
accepts foo.bar and reects autoexec.bat but it re)uires a three%letter extension and won1t accept a filename with a
two%letter extension such as sendmail.cf. 2e1ll complicate the pattern again in an effort to fix it.
.9=.>(=7b>.,.,@.=7a>,.,@..,=7t>,!8
In the third attempt the second and third letters are all made optional in order to allow matching extensions shorter than three characters such as sendmail.cf.
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 18/23
The pattern1s getting really complicated now which ma'es it hard to read and understand. 2orse if the problem
changes and you want to exclude both bat and exe as extensions the pattern would get even more complicated
and confusing.
C negative loo'ahead cuts through all this confusion4
.9=.>(,`bat8!.98 The negative loo'ahead means4 if the expression bat doesn1t match at this point try the rest of the
pattern$ if bat8 does match the whole pattern will fail. The trailing 8 is re)uired to ensure that something
li'e sample.batch where the extension only starts with bat will be allowed.
Excluding another filename extension is now easy$ simply add it as an alternative inside the assertion. The
following pattern excludes filenames that end in either bat or exe4
.9=.>(,`bat8@exe8!.98
!odif5ing Strings
#p to this point we1ve simply performed searches against a static string. Regular expressions are also commonly
used to modify strings in various ways using the following pattern methods4
!ethod(Attribute urpose
split(! 3plit the string into a list splitting it wherever the RE matches
sub(! /ind all substrings where the RE matches and replace them with a different string
subn(! +oes the same thing as sub(! but returns the new string and the number of replacements
Splitting Strings
The split(! method of a pattern splits a string apart wherever the RE matches returning a list of the pieces. It1s
similar to the split(! method of strings but provides much more generality in the delimiters that you
can split by$ string split(! only supports splitting by whitespace or by a fixed string. Cs you1d expect there1s a
module%levelre.split(! function too.
.split(string = maxsplit!" >!
3plit string by the matches of the regular expression. If capturing parentheses are used in the RE then their
contents will also be returned as part of the resulting list. If maxsplit is non"ero at most maxsplit splits are
performed.
ou can limit the number of splits made by passing a value for max split . 2hen maxsplit is non"ero at
most maxsplit splits will be made and the remainder of the string is returned as the final element of the list. In the
following example the delimiter is any se)uence of non%alphanumeric characters.
SSS
$$$ p re.compile(rA?2:A!
$$$ p.split(AThis is a test short and sweet of split(!.A!
=AThisA AisA AaA AtestA AshortA AandA AsweetA AofA AsplitA AA>
$$$ p.split(AThis is a test short and sweet of split(!.A J!
=AThisA AisA AaA Atest short and sweet of split(!.A>
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 19/23
3ometimes you1re not only interested in what the text between delimiters is but also need to 'now what the
delimiter was. If capturing parentheses are used in the RE then their values are also returned as part of the list.
ompare the following calls4
SSS
$$$ p re.compile(rA?2:A!
$$$ pK re.compile(rA(?2:!A!
$$$ p.split(AThis... is a test.A!
=AThisA AisA AaA AtestA AA>
$$$ pK.split(AThis... is a test.A!
=AThisA A... A AisA A A AaA A A AtestA A.A AA>
The module%level function re. split (! adds the RE to be used as the first argument but is otherwise the same.
SSS
$$$ re.split(A=?2>:A A2ords words words.A!
=A2ordsA AwordsA AwordsA AA>
$$$ re.split(A(=?2>:!A A2ords words words.A!
=A2ordsA A A AwordsA A A AwordsA A.A AA>
$$$ re.split(A=?2>:A A2ords words words.A !
=A2ordsA Awords words.A>
Search and Replace
Cnother common tas' is to find all the matches for a pattern and replace them with a different string.The sub(! method ta'es a replacement value which can be either a string or a function and the string to be
processed.
.sub(replacement string = count!" >!
Returns the string obtained by replacing the leftmost non%overlapping occurrences of the RE in string by the
replacement replacement . If the pattern isn1t found string is returned unchanged.
The optional argument count is the maximum number of pattern occurrences to be replaced$ count must be a non%
negative integer. The default value of means to replace all occurrences.
6ere1s a simple example of using the sub(! method. It replaces colour names with the word colour4
SSS
$$$ p re.compile( A(blue@white@red!A !
$$$ p.sub( AcolourA Ablue soc's and red shoesA!Acolour soc's and colour shoesA
$$$ p.sub( AcolourA Ablue soc's and red shoesA count!
Acolour soc's and red shoesA
The subn(! method does the same wor' but returns a K%tuple containing the new string value and the number of
replacements that were performed4
SSS
$$$ p re.compile( A(blue@white@red!A !
$$$ p.subn( AcolourA Ablue soc's and red shoesA!
(Acolour soc's and colour shoesA K!
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 20/23
$$$ p.subn( AcolourA Ano colours at allA!
(Ano colours at allA !
Empty matches are replaced only when they1re not adacent to a previous match.
SSS
$$$ p re.compile(Ax9A!
$$$ p.sub(A%A AabxdA!
A%a%b%d%A
If replacement is a string any bac'slash escapes in it are processed. That is ?n is converted to a single newline
character ?r is converted to a carriage return and so forth. #n'nown escapes such as ?^ are left alone.
Hac'references such as ?N are replaced with the substring matched by the corresponding group in the RE. This
lets you incorporate portions of the original text in the resulting replacement string.
This example matches the word section followed by a string enclosed in ; < and changes section to subsection4
SSS
$$$ p re.compile(Asection; ( =7<>9 ! <A re.ZERH03E!
$$$ p.sub(rAsubsection;?<AAsection;/irst< section;second<A!
Asubsection;/irst< subsection;second<A
There1s also a syntax for referring to named groups as defined by the (,PXnameS...! syntax. ?gXnameS will use the
substring matched by the group named name and?gXnumberS uses the corresponding group number. ?gXKS is
therefore e)uivalent to ?K but isn1t ambiguous in a replacement string such as ?gXKS. (?K would be interpreted as
a reference to group K not a reference to group K followed by the literal character AA.! The following substitutions
are all e)uivalent but use all three variations of the replacement string.
SSS
$$$ p re.compile(Asection; (,PXnameS =7<>9 ! <A re.ZERH03E!
$$$ p.sub(rAsubsection;?<AAsection;/irst<A!
Asubsection;/irst<A
$$$ p.sub(rAsubsection;?gXS<A Asection;/irst<A!
Asubsection;/irst<A
$$$ p.sub(rAsubsection;?gXnameS<A Asection;/irst<A!
Asubsection;/irst<A
replacement can also be a function which gives you even more control. If replacement is a function the function iscalled for every non%overlapping occurrence of pattern. 0n each call the function is passed a match
object argument for the match and can use this information to compute the desired replacement string and return it.
In the following example the replacement function translates decimals into hexadecimal4
SSS
$$ def hexrepl(match!4
))) VReturn the hex string for a decimal numberV
))) value int(match.group(!!
))) return hex(value!
)))
$$$ p re.compile(rA?d:A!
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 21/23
$$$ p.sub(hexrepl Aall NBMF for printing MFBK for user code.A!
Aall xffdK for printing xc for user code.A
2hen using the module%level re.sub(! function the pattern is passed as the first argument. The pattern may be
provided as an obect or as a string$ if you need to specify regular expression flags you must either use a pattern
obect as the first parameter or use embedded modifiers in the pattern string e.g. sub(V(,
i!b:V VxV VbbbbHHHHV! returns Ax xA.
"ommon roblems
Regular expressions are a powerful tool for some applications but in some ways their behaviour isn1t intuitive and
at times they don1t behave the way you may expect them to. This section will point out some of the most common
pitfalls.
#se String !ethods
3ometimes using the re module is a mista'e. If you1re matching a fixed string or a single character class
andyou1re not using any re features such as the IU0REC3E flag then the full power of regular expressions may
not be re)uired. 3trings have several methods for performing operations with fixed strings and they1re usually much
faster because the implementation is a single small loop that1s been optimi"ed for the purpose instead of the
large more generali"ed regular expression engine.
0ne example might be replacing a single fixed string with another one$ for example you might
replace word with deed. re.sub(! seems li'e the function to use for this but consider the replace(! method. ote
that replace(! will also replace word inside words turning swordfish into sdeedfish but the naive RE word would
have done that too. (To avoid performing the substitution on parts of words the pattern would have to be ?bword?b
in order to re)uire that word have a word boundary on either side. This ta'es the ob beyond replace(!Os abilities.!
Cnother common tas' is deleting every occurrence of a single character from a string or replacing it with another
single character. ou might do this with something li'ere.sub(A?nA A A 3! but translate(! is capable of doing both
tas's and will be faster than any regular expression operation can be.
In short before turning to the re module consider whether your problem can be solved with a faster and
simpler string method.
match(! versus search(!The match(! function only chec's if the RE matches at the beginning of the string while search(! will scan forward
through the string for a match. It1s important to 'eep this distinction in mind. Remember match(! will only report a
successful match which will start at $ if the match wouldn1t start at "ero match(! will not report it.
SSS
$$$ print(re.match(AsuperA AsuperstitionA!.span(!!
( B!
$$$ print(re.match(AsuperA AinsuperableA!!
one
0n the other hand search(! will scan forward through the string reporting the first match it finds.
SSS
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 22/23
$$$ print(re.search(AsuperA AsuperstitionA!.span(!!
( B!
$$$ print(re.search(AsuperA AinsuperableA!.span(!!
(K _!
3ometimes you1ll be tempted to 'eep using re.match(! and ust add .9 to the front of your RE. Resist this temptation
and use re.search(! instead. The regular expression compiler does some analysis of REs in order to speed up theprocess of loo'ing for a match. 0ne such analysis figures out what the first character of a match must be$ for
example a pattern starting with row must match starting with a AA. The analysis lets the engine )uic'ly scan
through the string loo'ing for the starting character only trying the full match if a AA is found.
Cdding .9 defeats this optimi"ation re)uiring scanning to the end of the string and then bac'trac'ing to find a match
for the rest of the RE. #se re.search(! instead.
.reed5 ,ersus /on*.reed5
2hen repeating a regular expression as in a9 the resulting action is to consume as much of the pattern as
possible. This fact often bites you when you1re trying to match a pair of balanced delimiters such as the angle
brac'ets surrounding an 6T5L tag. The naive pattern for matching a single 6T5L tag doesn1t wor' because of the
greedy nature of .9.
SSS
$$$ s AXhtmlSXheadSXtitleSTitleXtitleSA
$$$ len(s!
JK$$$ print(re.match(AX.9SA s!.span(!!
( JK!
$$$ print(re.match(AX.9SA s!.group(!!
XhtmlSXheadSXtitleSTitleXtitleS
The RE matches the AXA in XhtmlS and the .9 consumes the rest of the string. There1s still more left in the RE
though and the S can1t match at the end of the string so the regular expression engine has to bac'trac' character
by character until it finds a match for the S. The final match extends from the AXA in XhtmlS to the ASA inXtitleS which
isn1t what you want.
In this case the solution is to use the non%greedy )ualifiers 9, :, ,, or ;mn<, which match as little text as
possible. In the above example the ASA is tried immediately after the first AXA matches and when it fails the engine
advances a character at a time retrying the ASA at every step. This produces ust the right result4
SSS
$$$ print(re.match(AX.9,SA s!.group(!!
XhtmlS
(ote that parsing 6T5L or &5L with regular expressions is painful. uic'%and%dirty patterns will handle common
cases but 6T5L and &5L have special cases that will brea' the obvious regular expression$ by the time you1ve
written a regular expression that handles all of the possible cases the patterns will be very complicated. #se an
6T5L or &5L parser module for such tas's.!
#sing re)2ER%OSE
7/23/2019 Regular Expression HOWT Tutorial of the Re
http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 23/23
yHy now you1ve probably noticed that regular expressions are a very compact notation but they1re not terribly
readable. REs of moderate complexity can become lengthy collections of bac'slashes parentheses and
metacharacters ma'ing them difficult to read and understand.
/or such REs specifying the re.ZERH03E flag when compiling the regular expression can be helpful because it
allows you to format the regular expression more clearly.
The re.ZERH03E flag has several effects. 2hitespace in the regular expression that isnt inside a character class
is ignored. This means that an expression such as dog @cat is e)uivalent to the less readable dog@cat but =a b> will
still match the characters AaA AbA or a space. In addition you can also put comments inside a RE$ comments extend
from a ] character to the next newline. 2hen used with triple%)uoted strings this enables REs to be formatted
more neatly4
pat re.compile(rVVV
?s9 ] 3'ip leading whitespace
(,PXheaderS=74>:! ] 6eader name
?s9 4 ] 2hitespace and a colon (,PXvalueS.9,! ] The headerAs value %% 9, used to
] lose the following trailing whitespace
?s98 ] Trailing whitespace to end%of%line
VVV re.ZERH03E!
This is far more readable than4
pat re.compile(rV?s9(,PXheaderS=74>:!?s94(,PXvalueS.9,!?s98V!
-eedbac&
Regular expressions are a complicated topic. +id this document help you understand them, 2ere there parts that
were unclear or Problems you encountered that weren1t covered here, If so please send suggestions for
improvements to the author.
The most complete boo' on regular expressions is almost certainly effrey /riedl1s 5astering Regular Expressions
published by 01Reilly. #nfortunately it exclusively concentrates on Perl and ava1s flavours of regular expressions
and doesn1t contain any Python material at all so it won1t be useful as a reference for programming in Python. (The
first edition covered Python1s now%removed regex module which won1t help you much.! onsider chec'ing it out
from your library.