regular expression howt tutorial of the re

7/23/2019 Regular Expression HOWT Tutorial of the Re

http://slidepdf.com/reader/full/regular-expression-howt-tutorial-of-the-re 1/23



`

Regular Expression HOWTΟ Tutorial of the re module

https://docs.python.org/3/library/re.html#module-re




Abstract

This document is an introductory tutorial to using regular expressions in Python with the re module. It provides a

gentler introduction than the corresponding section in the Library Reference.

Introduction

Regular expressions (called REs or regexes or regex patterns! are essentially a tiny highly speciali"ed

programming language embedded inside Python and made available through the re module. #sing this little

language you specify the rules for the set of possible strings that you want to match$ this set might contain English

sentences or e%mail addresses or Te& commands or anything you li'e. ou can then as' )uestions such as

*+oes this string match the pattern,- or *Is there a match for the pattern anywhere in this string,-. ou can also

use REs to modify a string or to split it apart in various ways.

Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine

written in . /or advanced use it may be necessary to pay careful attention to how the engine will execute a givenRE and write the RE in a certain way in order to produce bytecode that runs faster. 0ptimi"ation isn1t covered in

this document because it re)uires that you have a good understanding of the matching engine1s internals.

The regular expression language is relatively small and restricted so not all possible string processing tas's can be

done using regular expressions. There are also tas's thatcan be done with regular expressions but the

expressions turn out to be very complicated. In these cases you may be better off writing Python code to do the

processing$ while Python code will be slower than an elaborate regular expression it will also probably be more

understandable.

Simple atterns

2e1ll start by learning about the simplest possible regular expressions. 3ince regular expressions are used to

operate on strings we1ll begin with the most common tas'4 matching characters.

/or a detailed explanation of the computer science underlying regular expressions (deterministic and non%

deterministic finite automata! you can refer to almost any textboo' on writing compilers.

!atching "haracters

5ost letters and characters will simply match themselves. /or example the regular expression test will match

the string test exactly. (ou can enable a case%insensitive mode that would let this RE match Test or TE3T as well$

more about this later.!

There are exceptions to this rule$ some characters are special metacharacters and don1t match themselves.

Instead they signal that some out%of%the%ordinary thing should be matched or they affect other portions of the RE

by repeating them or changing their meaning. 5uch of this document is devoted to discussing various

metacharacters and what they do.

6ere1s a complete list of the metacharacters$ their meanings will be discussed in the rest of this 602T0.

. 7 8 9 : , ; < = > ? @ ( !








The first metacharacters we1ll loo' at are = and >. They1re used for specifying a character class which is a set of

characters that you wish to match. haracters can be listed individually or a range of characters can be indicated

by giving two characters and separating them by a A%A. /or example =abc> will match any of the characters a b or c$

this is the same as =a%c> which uses a range to express the same set of characters. If you wanted to match only

lowercase letters your RE would be =a%">.

5etacharacters are not active inside classes. /or example =a'm8> will match any of the characters AaA A'A AmA

or A8A$ A8A is usually a metacharacter but inside a character class it1s stripped of its special nature.

ou can match the characters not listed within the class by complementing the set. This is indicated by including

a A7A as the first character of the class$ A7A outside a character class will simply match the A7A character. /or

example =7B> will match any character except ABA.

Perhaps the most important metacharacter is the bac'slash ?. Cs in Python string literals the bac'slash can be

followed by various characters to signal various special se)uences. It1s also used to escape all the metacharacters

so you can still match them in patterns$ for example if you need to match a = or ? you can precede them with a

bac'slash to remove their special meaning4 ?= or ??.

3ome of the special se)uences beginning with A?A represent predefined sets of characters that are often useful such

as the set of digits the set of letters or the set of anything that isn1t whitespace.

Let1s ta'e an example4 ?w matches any alphanumeric character. If the regex pattern is expressed in bytes this is

e)uivalent to the class =a%"C%D%FG>. If the regex pattern is a string ?w will match all the characters mar'ed as

letters in the #nicode database provided by the unicodedata module. ou can use the more restricted definition

of ?w in a string pattern by supplying the re.C3II flag when compiling the regular expression.

The following list of special se)uences isn1t complete. /or a complete list of se)uences and expanded class

definitions for #nicode string patterns see the last part of Regular Expression Syntax in the 3tandard Library

reference. In general the #nicode versions match any character that1s in the appropriate category in the #nicode

database.?d

5atches any decimal digit$ this is e)uivalent to the class =%F>.

?+

5atches any non%digit character$ this is e)uivalent to the class =7%F>.

?s

5atches any whitespace character$ this is e)uivalent to the class = ?t?n?r?f?v>.

?3

5atches any non%whitespace character$ this is e)uivalent to the class =7 ?t?n?r?f?v>.

?w

5atches any alphanumeric character$ this is e)uivalent to the class =a%"C%D%FG>.

?2

5atches any non%alphanumeric character$ this is e)uivalent to the class =7a%"C%D%FG>.

These se)uences can be included inside a character class. /or example =?s.> is a character class that will match

any whitespace character or AA or A.A.

The final metacharacter in this section is .. It matches anything except a newline character and there1s an alternate

mode (re.+0TCLL! where it will match even a newline.A.A is often used where you want to match *any character-.

Repeating Things

https://docs.python.org/3/library/unicodedata.html#module-unicodedata


https://docs.python.org/3/library/re.html#re.ASCII



https://docs.python.org/3/library/re.html#re-syntax







Heing able to match varying sets of characters is the first thing regular expressions can do that isn1t already

possible with the methods available on strings. 6owever if that was the only additional capability of regexes they

wouldn1t be much of an advance. Cnother capability is that you can specify that portions of the RE must be

repeated a certain number of times.

The first metacharacter for repeating things that we1ll loo' at is 9. 9 doesn1t match the literal character 9$ instead it

specifies that the previous character can be matched "ero or more times instead of exactly once.

/or example ca9t will match ct ( a characters! cat ( a! caaat (J a characters! and so forth. The RE engine has

various internal limitations stemming from the si"e of 1s int type that will prevent it from matching over K

billion a characters$ patterns are usually not written to match that much data.

Repetitions such as 9 are greedy $ when repeating a RE the matching engine will try to repeat it as many times as

possible. If later portions of the pattern don1t match the matching engine will then bac' up and try again with few

repetitions.

C step%by%step example will ma'e this more obvious. Let1s consider the expression a=bcd>9b. This matches the

letter AaA "ero or more letters from the class =bcd> and finally ends with a AbA. ow imagine matching this RE againstthe string abcbd.

Step !atched Explanation

a The a in the RE matches.

K abcbd The engine matches =bcd>9 going as far as it can which is to the end of the string.

J Failure The engine tries to match b but the current position is at the end of the string so it fails.

M abcb Hac' up so that =bcd>9 matches one less character.

B Failure Try b again but the current position is at the last character which is a AdA.

N abc Hac' up again so that =bcd>9 is only matching bc.

N abcb Try b again. This time the character at the current position is AbA so it succeeds.

The end of the RE has now been reached and it has matched abcb. This demonstrates how the matching engine

goes as far as it can at first and if no match is found it will then progressively bac' up and retry the rest of the RE

again and again. It will bac' up until it has tried "ero matches for =bcd>9 and if that subse)uently fails the engine

will conclude that the string doesn1t match the RE at all.

Cnother repeating metacharacter is : which matches one or more times. Pay careful attention to the difference

between 9 and :$ 9 matches zero or more times so whatever1s being repeated may not be present at allwhile : re)uires at least one occurrence. To use a similar example ca:t will match cat ( a! caaat (J aOs! but won1t

match ct.

There are two more repeating )ualifiers. The )uestion mar' character , matches either once or "ero times$ you

can thin' of it as mar'ing something as being optional. /or example home%,brew matches

either homebrew or home%brew.

The most complicated repeated )ualifier is ;mn< where m and n are decimal integers. This )ualifier means there

must be at least m repetitions and at most n. /or examplea;J<b will match ab ab and ab. It won1t match ab

which has no slashes or ab which has four.

ou can omit either m or n$ in that case a reasonable value is assumed for the missing value. 0mitting m is

interpreted as a lower limit of while omitting n results in an upper bound of infinity Q actually the upper bound is

the K%billion limit mentioned earlier but that might as well be infinity.



Readers of a reductionist bent may notice that the three other )ualifiers can all be expressed using this

notation. ;< is the same as 9 ;< is e)uivalent to : and ;<is the same as ,. It1s better to use 9 : or , when

you can simply because they1re shorter and easier to read.

#sing Regular Expressions

ow that we1ve loo'ed at some simple regular expressions how do we actually use them in Python,

The re module provides an interface to the regular expression engine allowing you to compile REs into obects and

then perform matches with them.

"ompiling Regular Expressions

Regular expressions are compiled into pattern obects which have methods for various operations such as

searching for pattern matches or performing string substitutions.

SSS

$$$ import re

$$$ p re.compile(Aab9A!

$$$ p

re.compile(Aab9A!

re.compile(! also accepts an optional flags argument used to enable various special features and syntax variations.

2e1ll go over the available settings later but for now a single example will do4

SSS

$$$ p re.compile(Aab9A re.IU0REC3E!

The RE is passed to re.compile(! as a string. REs are handled as strings because regular expressions aren1t part of

the core Python language and no special syntax was created for expressing them. (There are applications that

don1t need REs at all so there1s no need to bloat the language specification by including them.! Instead

the remodule is simply a extension module included with Python ust li'e the soc'et or "lib modules.

Putting REs in strings 'eeps the Python language simpler but has one disadvantage which is the topic of the next

section.

The %ac&slash lague

Cs stated earlier regular expressions use the bac'slash character ( A?A! to indicate special forms or to allow special

characters to be used without invo'ing their special meaning. This conflicts with Python1s usage of the same

character for the same purpose in string literals.

Let1s say you want to write a RE that matches the string ?section which might be found in a LaTe& file. To figure out

what to write in the program code start with the desiredstring to be matched. ext you must escape any

bac'slashes and other metacharacters by preceding them with a bac'slash resulting in the string ??section. The

resultingstring that must be passed to re.compile(! must be ??section. 6owever to express this as a Python string

literal both bac'slashes must be escaped again.




https://docs.python.org/3/library/re.html#re.compile






https://docs.python.org/3/library/socket.html#module-socket



https://docs.python.org/3/library/zlib.html#module-zlib









https://docs.python.org/3/library/zlib.html#module-zlib




"haracters Stage

?section Text string to be matched

??section Escaped bac'slash for re.compile(!

V????sectionV Escaped bac'slashes for a string literal

In short to match a literal bac'slash one has to write A????A as the RE string because the regular expression must

be ?? and each bac'slash must be expressed as ??inside a regular Python string literal. In REs that feature

bac'slashes repeatedly this leads to lots of repeated bac'slashes and ma'es the resulting strings difficult to

understand.

The solution is to use Python1s raw string notation for regular expressions$ bac'slashes are not handled in any

special way in a string literal prefixed with ArA so rV?nV is a two%character string containing A?A and AnA while V?nV is a

one%character string containing a newline. Regular expressions will often be written in Python code using this raw

string notation.

Regular String Ra' string

Vab9V rVab9V

V????sectionV rV??sectionV

V??w:??s:??V rV?w:?s:?V

erforming !atches

0nce you have an obect representing a compiled regular expression what do you do with it, Pattern obects have

several methods and attributes. 0nly the most significant ones will be covered here$ consult the re docs for acomplete listing.

!ethod(Attribute urpose

match(! +etermine if the RE matches at the beginning of the string.

search(! 3can through a string loo'ing for any location where this RE matches.

findall(! /ind all substrings where the RE matches and returns them as a list.

finditer(! /ind all substrings where the RE matches and returns them as aniterator .

match(! and search(! return one if no match can be found. If they1re successful a match object instance is

returned containing information about the match4 where it starts and ends the sub string it matched and more.

ou can learn about this by interactively experimenting with the re module. If you have t'inter available you may

also want to loo' at Toolsdemoredemo.py a demonstration program included with the Python distribution. It

allows you to enter REs and strings and displays whether the RE matches or fails. redemo.py can be )uite useful

when trying to debug a complicated RE. Phil 3chwart"1s Wodos is also an interactive tool for developing and testing

RE patterns.

This 602T0 uses the standard Python interpreter for its examples. /irst run the Python interpreter import

the re module and compile a RE4

SSS

$$$ import re

$$$ p re.compile(A=a%">:A!





https://docs.python.org/3/glossary.html#term-iterator


https://docs.python.org/3/library/re.html#re.regex.match


https://docs.python.org/3/library/re.html#re.regex.search


https://docs.python.org/3/library/re.html#match-objects




https://docs.python.org/3/library/tkinter.html#module-tkinter

https://hg.python.org/cpython/file/3.5/Tools/demo/redemo.py


http://kodos.sourceforge.net/










https://docs.python.org/3/library/tkinter.html#module-tkinter


http://kodos.sourceforge.net/




$$$ p

re.compile(A=a%">:A!

ow you can try matching various strings against the RE =a%">:. Cn empty string shouldn1t match at all

since : means Oone or more repetitions1. match(! should returnone in this case which will cause the interpreter to

print no output. ou can explicitly print the result of match(! to ma'e this clear.

SSS

$$$ p.match(VV!

$$$ print(p.match(VV!!

one

ow let1s try it on a string that it should match such as tempo. In this case match(! will return a match object so

you should store the result in a variable for later use.

SSS

$$$ m p.match(AtempoA!

$$$ m

XGsre.3REG5atch obect$ span( B! matchAtempoAS

ow you can )uery the match object for information about the matching string. match object instances also have

several methods and attributes$ the most important ones are4


group(! Return the string matched by the RE

start(! Return the starting position of the match

end(! Return the ending position of the match

span(! Return a tuple containing the (start end! positions of the match

Trying these methods will soon clarify their meaning4

SSS

$$$ m.group(!

AtempoA

$$$ m.start(! m.end(!

( B!

$$$ m.span(!

( B!

group(! returns the substring that was matched by the RE. start(! and end(! return the starting and ending index of

the match. span(! returns both start and end indexes in a single tuple. 3ince the match(! method only chec's if the

RE matches at the start of a string start(! will always be "ero. 6owever the search(! method of patterns scans

through the string so the match may not start at "ero in that case.

SSS

$$$ print(p.match(A444 messageA!!

one

$$$ m p.search(A444 messageA!$ print(m!

XGsre.3REG5atch obect$ span(M ! matchAmessageAS





https://docs.python.org/3/library/re.html#re.match.group

https://docs.python.org/3/library/re.html#re.match.start


https://docs.python.org/3/library/re.html#re.match.end

https://docs.python.org/3/library/re.html#re.match.span





https://docs.python.org/3/library/re.html#re.match.group


https://docs.python.org/3/library/re.html#re.match.end




$$$ m.group(!

AmessageA

$$$ m.span(!

(M !

In actual programs the most common style is to store the match object in a variable and then chec' if it was one.

This usually loo's li'e4

p re.compile( ... !

m p.match( Astring goes hereA !

if m4

print(A5atch found4 A m.group(!!

else4

print(Ao matchA!

Two pattern methods return all of the matches for a pattern. findall(! returns a list of matching strings4

SSS

$$$ p re.compile(A?d:A!

$$$ p.findall(AK drummers drumming pipers piping lords a%leapingA !

=AKA AA AA>

findall(! has to create the entire list before it can be returned as the result. The finditer(! method returns a

se)uenceof match object instances as an iterator 4

SSS

$$$ iterator p.finditer(AK drummers drumming ... ...A!

$$$ iterator

XcallableGiterator obect at x...S

$$$ for match in iterator4

))) print(match.span(!!

)))

( K!

(KK KM!

(KF J!


https://docs.python.org/3/library/re.html#re.regex.findall

https://docs.python.org/3/library/re.html#re.regex.finditer





https://docs.python.org/3/library/re.html#re.regex.findall






!odule*+e,el -unctions

ou don1t have to create a pattern obect and call its methods$ the re module also provides top%level functions

called match(! search(! findall(! sub(! and so forth. These functions ta'e the same arguments as the

corresponding pattern method with the RE string added as the first argument and still return either one or

a match object instance.

SSS

$$$ print(re.match(rA/rom?s:A A/romage am'A!!

one

$$$ re.match(rA/rom?s:A A/rom am' Thu 5ay M F4K4 FFYA!

XGsre.3REG5atch obect$ span( B! matchA/rom AS

#nder the hood these functions simply create a pattern obect for you and call the appropriate method on it. They

also store the compiled obect in a cache so future calls using the same RE won1t need to parse the pattern again

and again.

3hould you use these module%level functions or should you get the pattern and call its methods yourself, If you1re

accessing a regex within a loop pre%compiling it will save a few function calls. 0utside of loops there1s not much

difference than's to the internal cache.

"ompilation -lags

ompilation flags let you modify some aspects of how regular expressions wor'. /lags are available in

the re module under two names a long name such as IU0REC3Eand a short one%letter form such as I. (If

you1re familiar with Perl1s pattern modifiers the one%letter forms use the same letters$ the short form

of re.ZERH03E is re.& for example.! 5ultiple flags can be specified by bitwise 0R%ing them$ re.I @ re.5 sets both

the I and 5 flags for example.

6ere1s a table of the available flags followed by a more detailed explanation of each one.

-lag !eaning

C3II C5a'es several escapes li'e ?w ?b ?s and ?d match only on C3II characters with

the respective property.

+0TCLL 3 5a'e . match any character including newlines

IU0REC3E I +o case%insensitive matches

L0CLE L +o a locale%aware match

5#LTILIE 5 5ulti%line matching affecting 7 and 8

ZERH03E & (for

Oextended1!Enable verbose REs which can be organi"ed more cleanly and understandably.

I

I./ORE"ASE

Perform case%insensitive matching$ character class and literal strings will match letters by ignoring case. /or

example =C%D> will match lowercase letters too and 3pamwill match 3pam spam or spC5. This lowercasing

doesn1t ta'e the current locale into account$ it will if you also set the L0CLE flag.


https://docs.python.org/3/library/re.html#re.match



https://docs.python.org/3/library/re.html#re.search

https://docs.python.org/3/library/re.html#re.findall

https://docs.python.org/3/library/re.html#re.sub






https://docs.python.org/3/library/re.html#re.VERBOSE

https://docs.python.org/3/library/re.html#re.X





https://docs.python.org/3/library/re.html#re.findall








+

+O"A+E

5a'e ?w ?2 ?b and ?H dependent on the current locale instead of the #nicode database.

Locales are a feature of the library intended to help in writing programs that ta'e account of language

differences. /or example if you1re processing /rench text you1d want to be able to write ?w: to match words

but ?w only matches the character class =C%Da%">$ it won1t match A[A or A\A. If your system is configured properly and a

/rench locale is selected certain functions will tell the program that A[A should also be considered a letter. 3etting

the L0CLE flag when compiling a regular expression will cause the resulting compiled obect to use these

functions for ?w$ this is slower but also enables ?w: to match /rench words as you1d expect.

!

!#+TI+I/E

(7 and 8 haven1t been explained yet$ they1ll be introduced in section More Metacharacters.!

#sually 7 matches only at the beginning of the string and 8 matches only at the end of the string and immediately

before the newline (if any! at the end of the string. 2hen this flag is specified 7 matches at the beginning of the

string and at the beginning of each line within the string immediately following each newline. 3imilarly

the 8metacharacter matches either at the end of the string and at the end of each line (immediately preceding each

newline!.

S

0OTA++

5a'es the A.A special character match any character at all including a newline$ without this flag A.A will match

anything except a newline.

A

AS"II

5a'e ?w ?2 ?b ?H ?s and ?3 perform C3II%only matching instead of full #nicode matching. This is onlymeaningful for #nicode patterns and is ignored for byte patterns.

1

2ER%OSE

This flag allows you to write regular expressions that are more readable by granting you more flexibility in how you

can format them. 2hen this flag has been specified whitespace within the RE string is ignored except when the

whitespace is in a character class or preceded by an unescaped bac'slash$ this lets you organi"e and indent the

RE more clearly. This flag also lets you put comments within a RE that will be ignored by the engine$ comments are

mar'ed by a A]A that1s neither in a character class or preceded by an unescaped bac'slash.

/or example here1s a RE that uses re.ZERH03E$ see how much easier it is to read,

charref re.compile(rVVV

^=]> ] 3tart of a numeric entity reference

(

=%_>: ] 0ctal form

@ =%F>: ] +ecimal form

@ x=%Fa%fC%/>: ] 6exadecimal form

!

$ ] Trailing semicolon

VVV re.ZERH03E!

2ithout the verbose setting the RE would loo' li'e this4

https://docs.python.org/3/howto/regex.html?highlight=string%20split#more-metacharacters







charref re.compile(V^](=%_>:V

V@=%F>:V

V@x=%Fa%fC%/>:!$V!

In the above example Python1s automatic concatenation of string literals has been used to brea' up the RE into

smaller pieces but it1s still more difficult to understand than the version using re.ZERH03E.

!ore attern o'er

3o far we1ve only covered a part of the features of regular expressions. In this section we1ll cover some new

metacharacters and how to use groups to retrieve portions of the text that was matched.

!ore !etacharacters

There are some metacharacters that we haven1t covered yet. 5ost of them will be covered in this section.3ome of the remaining metacharacters to be discussed are zero-width assertions. They don1t cause the engine to

advance through the string$ instead they consume no characters at all and simply succeed or fail. /or

example ?b is an assertion that the current position is located at a word boundary$ the position isn1t changed by

the ?b at all. This means that "ero%width assertions should never be repeated because if they match once at a

given location they can obviously be matched an infinite number of times.

@

Clternation or the *or- operator. If C and H are regular expressions C@H will match any string that matches

either C or H. @ has very low precedence in order to ma'e it wor' reasonably when you1re alternating multi%

character strings. row@3ervo will match either row or 3ervo not ro a AwA or an A3A and ervo.

To match a literal A@A use ?@ or enclose it inside a character class as in =@>.

7

5atches at the beginning of lines. #nless the 5#LTILIE flag has been set this will only match at the beginning of

the string. In 5#LTILIE mode this also matches immediately after each newline within the string.

/or example if you wish to match the word /rom only at the beginning of a line the RE to use is 7/rom.

SSS

$$$ print(re.search(A7/romA A/rom 6ere to EternityA!!

XGsre.3REG5atch obect$ span( M! matchA/romAS

$$$ print(re.search(A7/romA AReciting /rom 5emoryA!!

one

8

5atches at the end of a line which is defined as either the end of the string or any location followed by a newline

character.

SSS

$$$ print(re.search(A<8A A;bloc'<A!!

XGsre.3REG5atch obect$ span(N _! matchA<AS

$$$ print(re.search(A<8A A;bloc'< A!!

one







$$$ print(re.search(A<8A A;bloc'< 3nA!!

XGsre.3REG5atch obect$ span(N _! matchA<AS

To match a literal A8A use ?8 or enclose it inside a character class as in =8>.

?C

5atches only at the start of the string. 2hen not in 5#LTILIE mode ?C and 7 are effectively the same.

In 5#LTILIE mode they1re different4 ?C still matches only at the beginning of the string but 7 may match at anylocation inside the string that follows a newline character.

?D

5atches only at the end of the string.

?b

2ord boundary. This is a "ero%width assertion that matches only at the beginning or end of a word. C word is

defined as a se)uence of alphanumeric characters so the end of a word is indicated by whitespace or a non%

alphanumeric character.

The following example matches class only when it1s a complete word$ it won1t match when it1s contained inside

another word.

SSS

$$$ p re.compile(rA?bclass?bA!

$$$ print(p.search(Ano class at allA!!

XGsre.3REG5atch obect$ span(J Y! matchAclassAS

$$$ print(p.search(Athe declassified algorithmA!!

one

$$$ print(p.search(Aone subclass isA!!

one

There are two subtleties you should remember when using this special se)uence. /irst this is the worst collision

between Python1s string literals and regular expression se)uences. In Python1s string literals ?b is the bac'space

character C3II value Y. If you1re not using raw strings then Python will convert the ?b to a bac'space and your

RE won1t match as you expect it to. The following example loo's the same as our previous RE but omits the ArA in

front of the RE string.

SSS

$$$ p re.compile(A 3bclass 3bA!

$$$ print(p.search(Ano class at allA!!

one

$$$ print(p.search(A 3bA : AclassA : A 3bA!!

XGsre.3REG5atch obect$ span( _! matchA?xYclass?xYAS

3econd inside a character class where there1s no use for this assertion ?b represents the bac'space character

for compatibility with Python1s string literals.

?H

Cnother "ero%width assertion this is the opposite of ?b only matching when the current position is not at a word

boundary.

.rouping



/re)uently you need to obtain more information than ust whether the RE matched or not. Regular expressions are

often used to dissect strings by writing a RE divided into several subgroups which match different components of

interest. /or example an R/%YKK header line is divided into a header name and a value separated by a A4A li'e

this4

/rom4 author 4example.com

#ser % Cgent4 Thunderbird .B..F (&KNKK_!

5I5E%Zersion4 .

To4 editor 4example.com

This can be handled by writing a regular expression which matches an entire header line and has one group which

matches the header name and another group which matches the header1s value.

Uroups are mar'ed by the A(A A!A metacharacters. A(A and A!A have much the same meaning as they do in

mathematical expressions$ they group together the expressions contained inside them and you can repeat the

contents of a group with a repeating )ualifier such as 9 : , or ;mn<. /or example (ab!9 will match "ero or more

repetitions of ab.

SSS

$$$ p re.compile(A(ab!9A!

$$$ print(p.match(AabababababA!.span(!!

( !

Uroups indicated with A(A A!A also capture the starting and ending index of the text that they match$ this can be

retrieved by passing an argument to group(! start(!end(! and span(!. Uroups are numbered starting with . Uroup

is always present$ it1s the whole RE so match object methods

all have group as their default argument. Later we1ll see how to express groups that don1t capture the span of

text that they match.

SSS

$$$ p re.compile(A(a!bA!

$$$ m p.match(AabA!

$$$ m.group(!

AabA

$$$ m.group(!

AabA

3ubgroups are numbered from left to right from upward. Uroups can be nested$ to determine the number ust

count the opening parenthesis characters going from left to right.

SSS

$$$ p re.compile(A(a(b!c!dA!

$$$ m p.match(AabcdA!

$$$ m.group(!

AabcdA

$$$ m.group(!

AabcA

$$$ m.group(K!





AbA

group(! can be passed multiple group numbers at a time in which case it will return a tuple containing the

corresponding values for those groups.

SSS

$$$ m.group(KK!

(AbA AabcA AbA!

The groups(! method returns a tuple containing the strings for all the subgroups from up to however many there

are.

SSS

$$$ m.groups(!

(AabcA AbA!

Hac'references in a pattern allow you to specify that the contents of an earlier capturing group must also be found

at the current location in the string. /or example ? will succeed if the exact contents of group can be found at the

current position and fails otherwise. Remember that Python1s string literals also use a bac'slash followed by

numbers to allow including arbitrary characters in a string so be sure to use a raw string when incorporating

bac'references in a RE.

/or example the following RE detects doubled words in a string.

SSS

$$$ p re.compile(rA(?b?w:!?s:?A!

$$$ p.search(AParis in the the springA!.group(!

Athe theA

Hac'references li'e this aren1t often useful for ust searching through a string Q there are few text formats which

repeat data in this way Q but you1ll soon find out that they1re very useful when performing string substitutions.

/on*capturing and /amed .roups

Elaborate REs may use many groups both to capture substrings of interest and to group and structure the RE

itself. In complex REs it becomes difficult to 'eep trac' of the group numbers. There are two features which help

with this problem. Hoth of them use a common syntax for regular expression extensions so we1ll loo' at that first.

Perl B is well 'nown for its powerful additions to standard regular expressions. /or these new features the Perl

developers couldn1t choose new single%'eystro'e metacharacters or new special se)uences beginning

with ? without ma'ing Perl1s regular expressions confusingly different from standard REs. If they chose ^ as a new

metacharacter for example old expressions would be assuming that ^ was a regular character and wouldn1t have

escaped it by writing ?^ or =^>.

The solution chosen by the Perl developers was to use (,...! as the extension syntax. , immediately after a

parenthesis was a syntax error because the , would have nothing to repeat so this didn1t introduce any

compatibility problems. The characters immediately after the , indicate what extension is being used so (,foo! is

one thing (a positive loo'ahead assertion! and (,4foo! is something else (a non%capturing group containing thesubexpression foo!.

Python supports several of Perl1s extensions and adds an extension syntax to Perl1s extension syntax. If the first

character after the )uestion mar' is a P you 'now that it1s an extension that1s specific to Python.



ow that we1ve loo'ed at the general extension syntax we can return to the features that simplify wor'ing with

groups in complex REs.

3ometimes you1ll want to use a group to denote a part of a regular expression but aren1t interested in retrieving the

group1s contents. ou can ma'e this fact explicit by using a non%capturing group4 (,4...! where you can replace

the ... with any other regular expression.

SSS

$$$ m re.match(V(=abc>!:V VabcV!

$$$ m.groups(!

(AcA!

$$$ m re.match(V(,4=abc>!:V VabcV!

$$$ m.groups(!

(!

Except for the fact that you can1t retrieve the contents of what the group matched a non%capturing group behaves

exactly the same as a capturing group$ you can put anything inside it repeat it with a repetition metacharacter such

as 9 and nest it within other groups (capturing or non%capturing!. (,4...! is particularly useful when modifying an

existing pattern since you can add new groups without changing how all the other groups are numbered. It should

be mentioned that there1s no performance difference in searching between capturing and non%capturing groups$

neither form is any faster than the other.

C more significant feature is named groups4 instead of referring to them by numbers groups can be referenced by

a name.

The syntax for a named group is one of the Python%specific extensions4 (,PXnameS...!. name is obviously the

name of the group. amed groups behave exactly li'e capturing groups and additionally associate a name with a

group. The match object methods that deal with capturing groups all accept either integers that refer to the group

by number or strings that contain the desired group1s name. amed groups are still given numbers so you can

retrieve information about a group in two ways4

SSS

$$$ p re.compile(rA(,PXwordS?b?w:?b!A!

$$$ m p.search( A(((( Lots of punctuation !!!A !

$$$ m.group(AwordA!

ALotsA

$$$ m.group(!

ALotsA

amed groups are handy because they let you use easily%remembered names instead of having to remember

numbers. 6ere1s an example RE from the imaplib module4

Internal+ate re.compile(rAITERCL+CTE VA

rA(,PXdayS= KJ>=%F>!%(,PXmonS=C%D>=a%">=a%">!%A

rA(,PXyearS=%F>=%F>=%F>=%F>!A

rA (,PXhourS=%F>=%F>!4(,PXminS=%F>=%F>!4(,PXsecS=%F>=%F>!A

rA (,PX"onenS=%:>!(,PX"onehS=%F>=%F>!(,PX"onemS=%F>=%F>!A

rAVA!

It1s obviously much easier to retrieve m.group(A"onemA! instead of having to remember to retrieve group F.


https://docs.python.org/3/library/imaplib.html#module-imaplib


https://docs.python.org/3/library/imaplib.html#module-imaplib



The syntax for bac'references in an expression such as (...!? refers to the number of the group. There1s naturally a

variant that uses the group name instead of the number. This is another Python extension4 (,Pname! indicates

that the contents of the group called name should again be matched at the current point. The regular expression for

finding doubled words (?b?w:!?s:? can also be written as (,PXwordS?b?w:!?s:(,Pword!4

SSS

$$$ p re.compile(rA(,PXwordS?b?w:!?s:(,Pword!A!

$$$ p.search(AParis in the the springA!.group(!

Athe theA

+oo&ahead Assertions

Cnother "ero%width assertion is the loo'ahead assertion. Loo'ahead assertions are available in both positive and

negative form and loo' li'e this4

(,...!

Positive loo'ahead assertion. This succeeds if the contained regular expression represented here by ...

successfully matches at the current location and fails otherwise. Hut once the contained expression has been

tried the matching engine doesn1t advance at all$ the rest of the pattern is tried right where the assertion started.

(,`...!

egative loo'ahead assertion. This is the opposite of the positive assertion$ it succeeds if the contained

expression doesnt match at the current position in the string.

To ma'e this concrete let1s loo' at a case where a loo'ahead is useful. onsider a simple pattern to match a

filename and split it apart into a base name and an extension separated by a .. /or example in news.rc news is

the base name and rc is the filename1s extension.

The pattern to match this is )uite simple4

.9=.>.98

otice that the . needs to be treated specially because it1s a metacharacter so it1s inside a character class to only

match that specific character. Clso notice the trailing 8$ this is added to ensure that all the rest of the string must be

included in the extension. This regular expression

matches foo.bar and autoexec.bat and sendmail.cf andprinters.conf. ow consider complicating the problem a bit$

what if you want to match filenames where the extension is not bat, 3ome incorrect attempts4

.9=.>=7b>.98

The first attempt above tries to exclude bat by re)uiring that the first character of the extension is not a b. This is

wrong because the pattern also doesn1t match foo.bar.

.9=.>(=7b>..@.=7a>.@..=7t>!8

The expression gets messier when you try to patch up the first solution by re)uiring one of the following cases to

match4 the first character of the extension isn1t b$ the second character isn1t a$ or the third character isn1t t. This

accepts foo.bar and reects autoexec.bat but it re)uires a three%letter extension and won1t accept a filename with a

two%letter extension such as sendmail.cf. 2e1ll complicate the pattern again in an effort to fix it.

.9=.>(=7b>.,.,@.=7a>,.,@..,=7t>,!8

In the third attempt the second and third letters are all made optional in order to allow matching extensions shorter than three characters such as sendmail.cf.



The pattern1s getting really complicated now which ma'es it hard to read and understand. 2orse if the problem

changes and you want to exclude both bat and exe as extensions the pattern would get even more complicated

and confusing.

C negative loo'ahead cuts through all this confusion4

.9=.>(,`bat8!.98 The negative loo'ahead means4 if the expression bat doesn1t match at this point try the rest of the

pattern$ if bat8 does match the whole pattern will fail. The trailing 8 is re)uired to ensure that something

li'e sample.batch where the extension only starts with bat will be allowed.

Excluding another filename extension is now easy$ simply add it as an alternative inside the assertion. The

following pattern excludes filenames that end in either bat or exe4

.9=.>(,`bat8@exe8!.98

!odif5ing Strings

#p to this point we1ve simply performed searches against a static string. Regular expressions are also commonly

used to modify strings in various ways using the following pattern methods4


split(! 3plit the string into a list splitting it wherever the RE matches

sub(! /ind all substrings where the RE matches and replace them with a different string

subn(! +oes the same thing as sub(! but returns the new string and the number of replacements

Splitting Strings

The split(! method of a pattern splits a string apart wherever the RE matches returning a list of the pieces. It1s

similar to the split(! method of strings but provides much more generality in the delimiters that you

can split by$ string split(! only supports splitting by whitespace or by a fixed string. Cs you1d expect there1s a

module%levelre.split(! function too.

.split(string = maxsplit!" >!

3plit string by the matches of the regular expression. If capturing parentheses are used in the RE then their

contents will also be returned as part of the resulting list. If maxsplit is non"ero at most maxsplit splits are

performed.

ou can limit the number of splits made by passing a value for max split . 2hen maxsplit is non"ero at

most maxsplit splits will be made and the remainder of the string is returned as the final element of the list. In the

following example the delimiter is any se)uence of non%alphanumeric characters.

SSS

$$$ p re.compile(rA?2:A!

$$$ p.split(AThis is a test short and sweet of split(!.A!

=AThisA AisA AaA AtestA AshortA AandA AsweetA AofA AsplitA AA>

$$$ p.split(AThis is a test short and sweet of split(!.A J!

=AThisA AisA AaA Atest short and sweet of split(!.A>

https://docs.python.org/3/library/re.html#re.split






3ometimes you1re not only interested in what the text between delimiters is but also need to 'now what the

delimiter was. If capturing parentheses are used in the RE then their values are also returned as part of the list.

ompare the following calls4

SSS

$$$ p re.compile(rA?2:A!

$$$ pK re.compile(rA(?2:!A!

$$$ p.split(AThis... is a test.A!

=AThisA AisA AaA AtestA AA>

$$$ pK.split(AThis... is a test.A!

=AThisA A... A AisA A A AaA A A AtestA A.A AA>

The module%level function re. split (! adds the RE to be used as the first argument but is otherwise the same.

SSS

$$$ re.split(A=?2>:A A2ords words words.A!

=A2ordsA AwordsA AwordsA AA>

$$$ re.split(A(=?2>:!A A2ords words words.A!

=A2ordsA A A AwordsA A A AwordsA A.A AA>

$$$ re.split(A=?2>:A A2ords words words.A !

=A2ordsA Awords words.A>

Search and Replace

Cnother common tas' is to find all the matches for a pattern and replace them with a different string.The sub(! method ta'es a replacement value which can be either a string or a function and the string to be

processed.

.sub(replacement string = count!" >!

Returns the string obtained by replacing the leftmost non%overlapping occurrences of the RE in string by the

replacement replacement . If the pattern isn1t found string is returned unchanged.

The optional argument count is the maximum number of pattern occurrences to be replaced$ count must be a non%

negative integer. The default value of means to replace all occurrences.

6ere1s a simple example of using the sub(! method. It replaces colour names with the word colour4

SSS

$$$ p re.compile( A(blue@white@red!A !

$$$ p.sub( AcolourA Ablue soc's and red shoesA!Acolour soc's and colour shoesA

$$$ p.sub( AcolourA Ablue soc's and red shoesA count!

Acolour soc's and red shoesA

The subn(! method does the same wor' but returns a K%tuple containing the new string value and the number of

replacements that were performed4

SSS

$$$ p re.compile( A(blue@white@red!A !

$$$ p.subn( AcolourA Ablue soc's and red shoesA!

(Acolour soc's and colour shoesA K!







$$$ p.subn( AcolourA Ano colours at allA!

(Ano colours at allA !

Empty matches are replaced only when they1re not adacent to a previous match.

SSS

$$$ p re.compile(Ax9A!

$$$ p.sub(A%A AabxdA!

A%a%b%d%A

If replacement is a string any bac'slash escapes in it are processed. That is ?n is converted to a single newline

character ?r is converted to a carriage return and so forth. #n'nown escapes such as ?^ are left alone.

Hac'references such as ?N are replaced with the substring matched by the corresponding group in the RE. This

lets you incorporate portions of the original text in the resulting replacement string.

This example matches the word section followed by a string enclosed in ; < and changes section to subsection4

SSS

$$$ p re.compile(Asection; ( =7<>9 ! <A re.ZERH03E!

$$$ p.sub(rAsubsection;?<AAsection;/irst< section;second<A!

Asubsection;/irst< subsection;second<A

There1s also a syntax for referring to named groups as defined by the (,PXnameS...! syntax. ?gXnameS will use the

substring matched by the group named name and?gXnumberS uses the corresponding group number. ?gXKS is

therefore e)uivalent to ?K but isn1t ambiguous in a replacement string such as ?gXKS. (?K would be interpreted as

a reference to group K not a reference to group K followed by the literal character AA.! The following substitutions

are all e)uivalent but use all three variations of the replacement string.

SSS

$$$ p re.compile(Asection; (,PXnameS =7<>9 ! <A re.ZERH03E!

$$$ p.sub(rAsubsection;?<AAsection;/irst<A!

Asubsection;/irst<A

$$$ p.sub(rAsubsection;?gXS<A Asection;/irst<A!

Asubsection;/irst<A

$$$ p.sub(rAsubsection;?gXnameS<A Asection;/irst<A!

Asubsection;/irst<A

replacement can also be a function which gives you even more control. If replacement is a function the function iscalled for every non%overlapping occurrence of pattern. 0n each call the function is passed a match

object argument for the match and can use this information to compute the desired replacement string and return it.

In the following example the replacement function translates decimals into hexadecimal4

SSS

$$ def hexrepl(match!4

))) VReturn the hex string for a decimal numberV

))) value int(match.group(!!

))) return hex(value!

)))

$$$ p re.compile(rA?d:A!









$$$ p.sub(hexrepl Aall NBMF for printing MFBK for user code.A!

Aall xffdK for printing xc for user code.A

2hen using the module%level re.sub(! function the pattern is passed as the first argument. The pattern may be

provided as an obect or as a string$ if you need to specify regular expression flags you must either use a pattern

obect as the first parameter or use embedded modifiers in the pattern string e.g. sub(V(,

i!b:V VxV VbbbbHHHHV! returns Ax xA.

"ommon roblems

Regular expressions are a powerful tool for some applications but in some ways their behaviour isn1t intuitive and

at times they don1t behave the way you may expect them to. This section will point out some of the most common

pitfalls.

#se String !ethods

3ometimes using the re module is a mista'e. If you1re matching a fixed string or a single character class

andyou1re not using any re features such as the IU0REC3E flag then the full power of regular expressions may

not be re)uired. 3trings have several methods for performing operations with fixed strings and they1re usually much

faster because the implementation is a single small loop that1s been optimi"ed for the purpose instead of the

large more generali"ed regular expression engine.

0ne example might be replacing a single fixed string with another one$ for example you might

replace word with deed. re.sub(! seems li'e the function to use for this but consider the replace(! method. ote

that replace(! will also replace word inside words turning swordfish into sdeedfish but the naive RE word would

have done that too. (To avoid performing the substitution on parts of words the pattern would have to be ?bword?b

in order to re)uire that word have a word boundary on either side. This ta'es the ob beyond replace(!Os abilities.!

Cnother common tas' is deleting every occurrence of a single character from a string or replacing it with another

single character. ou might do this with something li'ere.sub(A?nA A A 3! but translate(! is capable of doing both

tas's and will be faster than any regular expression operation can be.

In short before turning to the re module consider whether your problem can be solved with a faster and

simpler string method.

match(! versus search(!The match(! function only chec's if the RE matches at the beginning of the string while search(! will scan forward

through the string for a match. It1s important to 'eep this distinction in mind. Remember match(! will only report a

successful match which will start at $ if the match wouldn1t start at "ero match(! will not report it.

SSS

$$$ print(re.match(AsuperA AsuperstitionA!.span(!!

( B!

$$$ print(re.match(AsuperA AinsuperableA!!

one

0n the other hand search(! will scan forward through the string reporting the first match it finds.

SSS












$$$ print(re.search(AsuperA AsuperstitionA!.span(!!

( B!

$$$ print(re.search(AsuperA AinsuperableA!.span(!!

(K _!

3ometimes you1ll be tempted to 'eep using re.match(! and ust add .9 to the front of your RE. Resist this temptation

and use re.search(! instead. The regular expression compiler does some analysis of REs in order to speed up theprocess of loo'ing for a match. 0ne such analysis figures out what the first character of a match must be$ for

example a pattern starting with row must match starting with a AA. The analysis lets the engine )uic'ly scan

through the string loo'ing for the starting character only trying the full match if a AA is found.

Cdding .9 defeats this optimi"ation re)uiring scanning to the end of the string and then bac'trac'ing to find a match

for the rest of the RE. #se re.search(! instead.

.reed5 ,ersus /on*.reed5

2hen repeating a regular expression as in a9 the resulting action is to consume as much of the pattern as

possible. This fact often bites you when you1re trying to match a pair of balanced delimiters such as the angle

brac'ets surrounding an 6T5L tag. The naive pattern for matching a single 6T5L tag doesn1t wor' because of the

greedy nature of .9.

SSS

$$$ s AXhtmlSXheadSXtitleSTitleXtitleSA

$$$ len(s!

JK$$$ print(re.match(AX.9SA s!.span(!!

( JK!

$$$ print(re.match(AX.9SA s!.group(!!

XhtmlSXheadSXtitleSTitleXtitleS

The RE matches the AXA in XhtmlS and the .9 consumes the rest of the string. There1s still more left in the RE

though and the S can1t match at the end of the string so the regular expression engine has to bac'trac' character

by character until it finds a match for the S. The final match extends from the AXA in XhtmlS to the ASA inXtitleS which

isn1t what you want.

In this case the solution is to use the non%greedy )ualifiers 9, :, ,, or ;mn<, which match as little text as

possible. In the above example the ASA is tried immediately after the first AXA matches and when it fails the engine

advances a character at a time retrying the ASA at every step. This produces ust the right result4

SSS

$$$ print(re.match(AX.9,SA s!.group(!!

XhtmlS

(ote that parsing 6T5L or &5L with regular expressions is painful. uic'%and%dirty patterns will handle common

cases but 6T5L and &5L have special cases that will brea' the obvious regular expression$ by the time you1ve

written a regular expression that handles all of the possible cases the patterns will be very complicated. #se an

6T5L or &5L parser module for such tas's.!

#sing re)2ER%OSE











yHy now you1ve probably noticed that regular expressions are a very compact notation but they1re not terribly

readable. REs of moderate complexity can become lengthy collections of bac'slashes parentheses and

metacharacters ma'ing them difficult to read and understand.

/or such REs specifying the re.ZERH03E flag when compiling the regular expression can be helpful because it

allows you to format the regular expression more clearly.

The re.ZERH03E flag has several effects. 2hitespace in the regular expression that isnt inside a character class

is ignored. This means that an expression such as dog @cat is e)uivalent to the less readable dog@cat but =a b> will

still match the characters AaA AbA or a space. In addition you can also put comments inside a RE$ comments extend

from a ] character to the next newline. 2hen used with triple%)uoted strings this enables REs to be formatted

more neatly4

pat re.compile(rVVV

?s9 ] 3'ip leading whitespace

(,PXheaderS=74>:! ] 6eader name

?s9 4 ] 2hitespace and a colon (,PXvalueS.9,! ] The headerAs value %% 9, used to

] lose the following trailing whitespace

?s98 ] Trailing whitespace to end%of%line

VVV re.ZERH03E!

This is far more readable than4

pat re.compile(rV?s9(,PXheaderS=74>:!?s94(,PXvalueS.9,!?s98V!

-eedbac&

Regular expressions are a complicated topic. +id this document help you understand them, 2ere there parts that

were unclear or Problems you encountered that weren1t covered here, If so please send suggestions for

improvements to the author.

The most complete boo' on regular expressions is almost certainly effrey /riedl1s 5astering Regular Expressions

published by 01Reilly. #nfortunately it exclusively concentrates on Perl and ava1s flavours of regular expressions

and doesn1t contain any Python material at all so it won1t be useful as a reference for programming in Python. (The

first edition covered Python1s now%removed regex module which won1t help you much.! onsider chec'ing it out

from your library.

regular expression howt tutorial of the re

Documents