python: regular expressions

Python: Regular Expressions

http://www.flickr.com/photos/iamthestig2/3925864142/sizes/l/in/photostream/

Patterns (Regular Expressions)

Patterns are a very useful technique for processing textual data. A pattern defines a set of strings. The fundamental operation is set-membership. Given a string S, we can ask if S is a

member of the set defined by some pattern P.

Patterns (Regular Expressions)

Case Study : Hugs and Kisses

A fixed pattern is one with no variability. A hugs and kisses pattern is an example. The hugs-and-kisses pattern:

XOXO

Case Study : MPAA Ratings

There are 5 MPAA ratings: G PG PG-13 R NC-17

The MPAA rating pattern: G|PG|PG-13|R|NC-17

Case Study : SSN

A social security number can be understood as any 3 digits followed by a dash (-) followed by any 2 digits followed by a dash followed by any 4 digits.

The language of regular expressions

Here is an inductive definition of the syntax of the basic elements of a regular expression. Any single character is a regular expression. If A and B are both regular expressions, then so are

AB : this represents A followed by B; concatenation A|B : this represents A or B; the vertical bar is special;

alternation (A) : this represents a group; the parens are special

Examples

Do the following match the regex '(c|h)a?rt*' hart cat car chart chaarrtt hrtttt

Do the following match the regex '(x|y)*' x xy xxyxyyx

Case Study : SSN

A social security number can be understood as any 3 digits followed by a dash (-) followed by any 2 digits followed by a dash followed by any 4 digits.

Repetition

Patterns often include a notion of repetition. Notations are introduced to control repetition.

Case Study : SSN

We can simplify our SSN pattern

Case Study : Hugs and Kisses Two

Consider a hugs and kisses pattern that includes any string composed of pairs of XO’s XO XOXO XOXOXO

Write the following patterns A binary string that is odd A binary string that contains at least 3 consecutive 1's A binary string that contains no more than 3

consecutive 1's

Character Classes

A character class is pattern that concisely defines a set of characters.

The term digit, for example, names a character class since a digit is defined as the set of the 10 characters 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9.

Additional pattern writing rules allow us to define our own character classes and also provide several pre-defined commonly-used character classes.

Regular Expressions and Character classes

Square brackets denote a set of characters. Set members are listed explicitly. [abc] will match an 'a', b' or 'c' [uwl] will match a 'u', 'w', or 'l'

Special characters are not special in character classes. [.a*] will match a '.', 'a', or '*'

There are, however, two special characters that are still special in characters classes. - : denotes a range meaning left-through-right ^ : occurs at the beginning and denotes logical set negation.

Examples [a-c] will match an a or b or c and nothing else [a-z] will match any lower case alphabetic symbol [a-zA-Z0-9] will match any alphanumeric symbol [^a] will match anything but lowercase a [^0-9] will match anything but a digit character

Examples

Do the following match the regex '[a-z][0-9]*' abc 1z93 a-9

Do the following match the regex '[0-9]*[^02468]' 03 999 354

Give a regex for social security numbers [0-9]{3}-[0-9]{2}-[0-9]{4}

Predefined classes

Some character classes are common and have shorthand definitions \d : matches any decimal digit; equivalent to [0-9] \D : matches any non-digit character; equivalent to [^0-9] \w : matches any 'word' character; equivalent to [^ \t\n\r\f\

v] \W : matches any non-word character; equivalent to [^a-zA-

Z0-9] \s : matches any whitespace character (space, tab, newline) \S : matches any character that is not a whitespace

Give a regex for social security numbers \d{3}-\d{2}-\d{4}

Predefined classes

There are two 'positional' matches $ : matches the end of a string or matches before a newline ^ : matches the start of a string or right after a newline

What do the following mean? ^.*s$ ^\s.*

Finding patterns in text (Examples)

'a'

Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.


'Mary'



'.* '

Special characters must be escaped.*


'\.\*'



http://www.duke.edu/~dgraham/ETM/LearningtoUseRegularExpressions.html

Finding patterns in text (Examples)

'^Mary'


'Mary$'


http://www.duke.edu/~dgraham/ETM/LearningtoUseRegularExpressions.html

'.a'


'[a-z]a'


'[^a-z]a'


'to|go|the'


Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go

Matching and Searching

Regular expressions are in the "re" package match(re, text): determines whether the pattern

matches the beginning of the text. Returns either None or a Match object.

search(re, text): determines if the pattern occurs anywhere in the text. Return either None or a Match object.

findall(re,text): return all substrings of the text that match the pattern

finditer(re,text): returns an iterator of all matching substrings

>>> import re>>> re.match("c", "abcdef") # No match>>> re.match("a", "abcdef") # Match>>> re.search("c", "abcdef") # Match

Match Objects

Match Objects support the following methods start(): returns the index of the start of the match end(): returns the index of the end of the match groups(): returns a tuple of the group matches groups(n): returns the nth group match. If n = 0 returns the entire match.

>>> m = re.match("(\w+) (\w+)", "Lazy hands make for poverty,")>>> m.group(0) # The entire match'Lazy hands'>>> m.group(1) # The first parenthesized subgroup.'Lazy'>>> m.group(2) # The second parenthesized subgroup.'hands'>>> m.group(1, 2) # Multiple arguments give us a tuple.('Lazy', 'hands')

Example: Phone Numbers

Consider creating a regular expression to match phone numbers. The phone numbers can take on the following forms: 800-555-1212 800 555 1212 800.555.1212 (800) 555-1212 1-800-555-1212 800-555-1212-1234 800-555-1212x1234 800-555-1212 ext. 1234 1-(800) 555.1212 #1234


It is good to define a test for your code prior to writing the code.

Consider testing our pattern against the examples on the previous slide.import re

def test(regex): tests = ['800-555-1212','800 555 1212', '800.555.1212', '(800) 555-1212', '800-555-1212', '800-555-1212-1234', '800-555-1212x1234', '800-555-1212 ext. 1234', '1-(800) 555.1212 #1234']

for test in tests: print(test+": ", re.match(regex,test))


The previous phone numbers had only four components: area code trunk (first three digits) rest (next 4 digits) extension (last digits. May be between 1 and 4 in length)

Consider defining these parts with regular expressions area code would be \d{3} trunk would be \d{3} rest would be \d{3} extension would be \d{1,4}


Consider the following for phone numbers area code-trunk-rest-extension \d{3}-\d{3}-\d{4}-\d{1,4}

>>> test(r'\d{3}-\d{3}-\d{4}-\d{1,4}')

800-555-1212: None800 555 1212: None800.555.1212: None(800) 555-1212: None800-555-1212: None800-555-1212-1234: <_sre.SRE_Match object at 0x0000000002C06920>800-555-1212x1234: None800-555-1212 ext. 1234: None1-(800) 555.1212 #1234: None


How to modify our regex to say that extensions are optional? \d{3}-\d{3}-\d{4}(-\d{1,4})?

>>> test(r'\d{3}-\d{3}-\d{4}(-\d{1,4})?')

800-555-1212: <_sre.SRE_Match object at 0x0000000002C4D198>800 555 1212: None800.555.1212: None(800) 555-1212: None800-555-1212: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212-1234: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212x1234: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212 ext. 1234: <_sre.SRE_Match object at 0x0000000002C4D198>1-(800) 555.1212 #1234: None


How to handle different separators? Consider the following test cases especially: 800-555-1212 800 555 1212 800.555.1212 (800) 555-1212 800-555-1212x1234 800-555-1212 ext. 1234 1-(800) 555.1212 #1234

Let's say that a separator is optional any number of non-digit characters


How to modify our regex to deal with separators? \d{3}-\d{3}-\d{4}(-\d{1,4})? \d{3}\D*\d{3}\D*\d{4}\D*(\d{1,4})?>>> test(r'\d{3}\D*\d{3}\D*\d{4}\D*(\d{1,4})?')

800-555-1212: <_sre.SRE_Match object at 0x0000000002C4D198>800 555 1212: <_sre.SRE_Match object at 0x0000000002C4D198>800.555.1212: <_sre.SRE_Match object at 0x0000000002C4D198>(800) 555-1212: None800-555-1212: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212-1234: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212x1234: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212 ext. 1234: <_sre.SRE_Match object at 0x0000000002C4D198>1-(800) 555.1212 #1234: None


How to get the four parts from sub-groups? Let's first modify our test routine.

def test(regex): tests = ['800-555-1212','800 555 1212', '800.555.1212', '(800) 555-1212', '800-555-1212', '800-555-1212-1234', '800-555-1212x1234', '800-555-1212 ext. 1234', '1-(800) 555.1212 #1234']

for test in tests: match = re.match(regex, test) if match: print(test + ': ', match.groups()) else: print(test + ': ', None)


How to get the four parts from sub-groups? (\d{3})\D*(\d{3})\D*(\d{4})\D*(\d{1,4})?

>>> test(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d{1,4})?')

800-555-1212: ('800', '555', '1212', None)800 555 1212: ('800', '555', '1212', None)800.555.1212: ('800', '555', '1212', None)(800) 555-1212: None800-555-1212: ('800', '555', '1212', None)800-555-1212-1234: ('800', '555', '1212', '1234')800-555-1212x1234: ('800', '555', '1212', '1234')800-555-1212 ext. 1234: ('800', '555', '1212', '1234')1-(800) 555.1212 #1234: None

Splitting

Consider reading words from a text file. In the past we have split lines on whitespace. This is not a thorough splitting.

Consider a text file having punctuation symbols and separators: The words of the Teacher, son of David, king in Jerusalem:

"Meaningless! Meaningless!" says the Teacher.file = open(file, "r")for line in file: for word in line.split(): print(word)

for line in file: for word in re.split("\W+", line): print(word)

python: regular expressions

Documents