python: regular expressions
TRANSCRIPT
Python: Regular Expressions
http://www.flickr.com/photos/iamthestig2/3925864142/sizes/l/in/photostream/
Patterns (Regular Expressions)
Patterns are a very useful technique for processing textual data. A pattern defines a set of strings. The fundamental operation is set-membership. Given a string S, we can ask if S is a
member of the set defined by some pattern P.
Patterns (Regular Expressions)
Case Study : Hugs and Kisses
A fixed pattern is one with no variability. A hugs and kisses pattern is an example. The hugs-and-kisses pattern:
XOXO
Case Study : MPAA Ratings
There are 5 MPAA ratings: G PG PG-13 R NC-17
The MPAA rating pattern: G|PG|PG-13|R|NC-17
Case Study : SSN
A social security number can be understood as any 3 digits followed by a dash (-) followed by any 2 digits followed by a dash followed by any 4 digits.
The language of regular expressions
Here is an inductive definition of the syntax of the basic elements of a regular expression. Any single character is a regular expression. If A and B are both regular expressions, then so are
AB : this represents A followed by B; concatenation A|B : this represents A or B; the vertical bar is special;
alternation (A) : this represents a group; the parens are special
Examples
Do the following match the regex '(c|h)a?rt*' hart cat car chart chaarrtt hrtttt
Do the following match the regex '(x|y)*' x xy xxyxyyx
Case Study : SSN
A social security number can be understood as any 3 digits followed by a dash (-) followed by any 2 digits followed by a dash followed by any 4 digits.
Repetition
Patterns often include a notion of repetition. Notations are introduced to control repetition.
Case Study : SSN
We can simplify our SSN pattern
Case Study : Hugs and Kisses Two
Consider a hugs and kisses pattern that includes any string composed of pairs of XO’s XO XOXO XOXOXO
Write the following patterns A binary string that is odd A binary string that contains at least 3 consecutive 1's A binary string that contains no more than 3
consecutive 1's
Character Classes
A character class is pattern that concisely defines a set of characters.
The term digit, for example, names a character class since a digit is defined as the set of the 10 characters 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9.
Additional pattern writing rules allow us to define our own character classes and also provide several pre-defined commonly-used character classes.
Regular Expressions and Character classes
Square brackets denote a set of characters. Set members are listed explicitly. [abc] will match an 'a', b' or 'c' [uwl] will match a 'u', 'w', or 'l'
Special characters are not special in character classes. [.a*] will match a '.', 'a', or '*'
There are, however, two special characters that are still special in characters classes. - : denotes a range meaning left-through-right ^ : occurs at the beginning and denotes logical set negation.
Examples [a-c] will match an a or b or c and nothing else [a-z] will match any lower case alphabetic symbol [a-zA-Z0-9] will match any alphanumeric symbol [^a] will match anything but lowercase a [^0-9] will match anything but a digit character
Examples
Do the following match the regex '[a-z][0-9]*' abc 1z93 a-9
Do the following match the regex '[0-9]*[^02468]' 03 999 354
Give a regex for social security numbers [0-9]{3}-[0-9]{2}-[0-9]{4}
Predefined classes
Some character classes are common and have shorthand definitions \d : matches any decimal digit; equivalent to [0-9] \D : matches any non-digit character; equivalent to [^0-9] \w : matches any 'word' character; equivalent to [^ \t\n\r\f\
v] \W : matches any non-word character; equivalent to [^a-zA-
Z0-9] \s : matches any whitespace character (space, tab, newline) \S : matches any character that is not a whitespace
Give a regex for social security numbers \d{3}-\d{2}-\d{4}
Predefined classes
There are two 'positional' matches $ : matches the end of a string or matches before a newline ^ : matches the start of a string or right after a newline
What do the following mean? ^.*s$ ^\s.*
Finding patterns in text (Examples)
'a'
Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.
Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.
'Mary'
Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.
Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.
'.* '
Special characters must be escaped.*
Special characters must be escaped.*
'\.\*'
Special characters must be escaped.*
Special characters must be escaped.*
http://www.duke.edu/~dgraham/ETM/LearningtoUseRegularExpressions.html
Finding patterns in text (Examples)
'^Mary'
Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.
'Mary$'
Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.
http://www.duke.edu/~dgraham/ETM/LearningtoUseRegularExpressions.html
'.a'
Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.
'[a-z]a'
Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.
'[^a-z]a'
Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.
'to|go|the'
Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.
Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go
Matching and Searching
Regular expressions are in the "re" package match(re, text): determines whether the pattern
matches the beginning of the text. Returns either None or a Match object.
search(re, text): determines if the pattern occurs anywhere in the text. Return either None or a Match object.
findall(re,text): return all substrings of the text that match the pattern
finditer(re,text): returns an iterator of all matching substrings
>>> import re>>> re.match("c", "abcdef") # No match>>> re.match("a", "abcdef") # Match>>> re.search("c", "abcdef") # Match
Match Objects
Match Objects support the following methods start(): returns the index of the start of the match end(): returns the index of the end of the match groups(): returns a tuple of the group matches groups(n): returns the nth group match. If n = 0 returns the entire match.
>>> m = re.match("(\w+) (\w+)", "Lazy hands make for poverty,")>>> m.group(0) # The entire match'Lazy hands'>>> m.group(1) # The first parenthesized subgroup.'Lazy'>>> m.group(2) # The second parenthesized subgroup.'hands'>>> m.group(1, 2) # Multiple arguments give us a tuple.('Lazy', 'hands')
Example: Phone Numbers
Consider creating a regular expression to match phone numbers. The phone numbers can take on the following forms: 800-555-1212 800 555 1212 800.555.1212 (800) 555-1212 1-800-555-1212 800-555-1212-1234 800-555-1212x1234 800-555-1212 ext. 1234 1-(800) 555.1212 #1234
Example: Phone Numbers
It is good to define a test for your code prior to writing the code.
Consider testing our pattern against the examples on the previous slide.import re
def test(regex): tests = ['800-555-1212','800 555 1212', '800.555.1212', '(800) 555-1212', '800-555-1212', '800-555-1212-1234', '800-555-1212x1234', '800-555-1212 ext. 1234', '1-(800) 555.1212 #1234']
for test in tests: print(test+": ", re.match(regex,test))
Example: Phone Numbers
The previous phone numbers had only four components: area code trunk (first three digits) rest (next 4 digits) extension (last digits. May be between 1 and 4 in length)
Consider defining these parts with regular expressions area code would be \d{3} trunk would be \d{3} rest would be \d{3} extension would be \d{1,4}
Example: Phone Numbers
Consider the following for phone numbers area code-trunk-rest-extension \d{3}-\d{3}-\d{4}-\d{1,4}
>>> test(r'\d{3}-\d{3}-\d{4}-\d{1,4}')
800-555-1212: None800 555 1212: None800.555.1212: None(800) 555-1212: None800-555-1212: None800-555-1212-1234: <_sre.SRE_Match object at 0x0000000002C06920>800-555-1212x1234: None800-555-1212 ext. 1234: None1-(800) 555.1212 #1234: None
Example: Phone Numbers
How to modify our regex to say that extensions are optional? \d{3}-\d{3}-\d{4}(-\d{1,4})?
>>> test(r'\d{3}-\d{3}-\d{4}(-\d{1,4})?')
800-555-1212: <_sre.SRE_Match object at 0x0000000002C4D198>800 555 1212: None800.555.1212: None(800) 555-1212: None800-555-1212: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212-1234: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212x1234: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212 ext. 1234: <_sre.SRE_Match object at 0x0000000002C4D198>1-(800) 555.1212 #1234: None
Example: Phone Numbers
How to handle different separators? Consider the following test cases especially: 800-555-1212 800 555 1212 800.555.1212 (800) 555-1212 800-555-1212x1234 800-555-1212 ext. 1234 1-(800) 555.1212 #1234
Let's say that a separator is optional any number of non-digit characters
Example: Phone Numbers
How to modify our regex to deal with separators? \d{3}-\d{3}-\d{4}(-\d{1,4})? \d{3}\D*\d{3}\D*\d{4}\D*(\d{1,4})?>>> test(r'\d{3}\D*\d{3}\D*\d{4}\D*(\d{1,4})?')
800-555-1212: <_sre.SRE_Match object at 0x0000000002C4D198>800 555 1212: <_sre.SRE_Match object at 0x0000000002C4D198>800.555.1212: <_sre.SRE_Match object at 0x0000000002C4D198>(800) 555-1212: None800-555-1212: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212-1234: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212x1234: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212 ext. 1234: <_sre.SRE_Match object at 0x0000000002C4D198>1-(800) 555.1212 #1234: None
Example: Phone Numbers
How to get the four parts from sub-groups? Let's first modify our test routine.
def test(regex): tests = ['800-555-1212','800 555 1212', '800.555.1212', '(800) 555-1212', '800-555-1212', '800-555-1212-1234', '800-555-1212x1234', '800-555-1212 ext. 1234', '1-(800) 555.1212 #1234']
for test in tests: match = re.match(regex, test) if match: print(test + ': ', match.groups()) else: print(test + ': ', None)
Example: Phone Numbers
How to get the four parts from sub-groups? (\d{3})\D*(\d{3})\D*(\d{4})\D*(\d{1,4})?
>>> test(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d{1,4})?')
800-555-1212: ('800', '555', '1212', None)800 555 1212: ('800', '555', '1212', None)800.555.1212: ('800', '555', '1212', None)(800) 555-1212: None800-555-1212: ('800', '555', '1212', None)800-555-1212-1234: ('800', '555', '1212', '1234')800-555-1212x1234: ('800', '555', '1212', '1234')800-555-1212 ext. 1234: ('800', '555', '1212', '1234')1-(800) 555.1212 #1234: None
Splitting
Consider reading words from a text file. In the past we have split lines on whitespace. This is not a thorough splitting.
Consider a text file having punctuation symbols and separators: The words of the Teacher, son of David, king in Jerusalem:
"Meaningless! Meaningless!" says the Teacher.file = open(file, "r")for line in file: for word in line.split(): print(word)
for line in file: for word in re.split("\W+", line): print(word)