python: regular expressions

31
Python: Regular Expressions http://www.flickr.com/photos/iamthestig2/3925864142/sizes/l/in/ photostream/

Upload: edwin-copeland

Post on 16-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Python: Regular Expressions

Python: Regular Expressions

http://www.flickr.com/photos/iamthestig2/3925864142/sizes/l/in/photostream/

Page 2: Python: Regular Expressions

Patterns (Regular Expressions)

Patterns are a very useful technique for processing textual data. A pattern defines a set of strings. The fundamental operation is set-membership. Given a string S, we can ask if S is a

member of the set defined by some pattern P.

Page 3: Python: Regular Expressions

Patterns (Regular Expressions)

Page 4: Python: Regular Expressions

Case Study : Hugs and Kisses

A fixed pattern is one with no variability. A hugs and kisses pattern is an example. The hugs-and-kisses pattern:

XOXO

Page 5: Python: Regular Expressions

Case Study : MPAA Ratings

There are 5 MPAA ratings: G PG PG-13 R NC-17

The MPAA rating pattern: G|PG|PG-13|R|NC-17

Page 6: Python: Regular Expressions

Case Study : SSN

A social security number can be understood as any 3 digits followed by a dash (-) followed by any 2 digits followed by a dash followed by any 4 digits.

Page 7: Python: Regular Expressions

The language of regular expressions

Here is an inductive definition of the syntax of the basic elements of a regular expression. Any single character is a regular expression. If A and B are both regular expressions, then so are

AB : this represents A followed by B; concatenation A|B : this represents A or B; the vertical bar is special;

alternation (A) : this represents a group; the parens are special

Page 8: Python: Regular Expressions

Examples

Do the following match the regex '(c|h)a?rt*' hart cat car chart chaarrtt hrtttt

Do the following match the regex '(x|y)*' x xy xxyxyyx

Page 9: Python: Regular Expressions

Case Study : SSN

A social security number can be understood as any 3 digits followed by a dash (-) followed by any 2 digits followed by a dash followed by any 4 digits.

Page 10: Python: Regular Expressions

Repetition

Patterns often include a notion of repetition. Notations are introduced to control repetition.

Page 11: Python: Regular Expressions

Case Study : SSN

We can simplify our SSN pattern

Page 12: Python: Regular Expressions

Case Study : Hugs and Kisses Two

Consider a hugs and kisses pattern that includes any string composed of pairs of XO’s XO XOXO XOXOXO

Write the following patterns A binary string that is odd A binary string that contains at least 3 consecutive 1's A binary string that contains no more than 3

consecutive 1's

Page 13: Python: Regular Expressions

Character Classes

A character class is pattern that concisely defines a set of characters.

The term digit, for example, names a character class since a digit is defined as the set of the 10 characters 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9.

Additional pattern writing rules allow us to define our own character classes and also provide several pre-defined commonly-used character classes.

Page 14: Python: Regular Expressions

Regular Expressions and Character classes

Square brackets denote a set of characters. Set members are listed explicitly. [abc] will match an 'a', b' or 'c' [uwl] will match a 'u', 'w', or 'l'

Special characters are not special in character classes. [.a*] will match a '.', 'a', or '*'

There are, however, two special characters that are still special in characters classes. - : denotes a range meaning left-through-right ^ : occurs at the beginning and denotes logical set negation.

Examples [a-c] will match an a or b or c and nothing else [a-z] will match any lower case alphabetic symbol [a-zA-Z0-9] will match any alphanumeric symbol [^a] will match anything but lowercase a [^0-9] will match anything but a digit character

Page 15: Python: Regular Expressions

Examples

Do the following match the regex '[a-z][0-9]*' abc 1z93 a-9

Do the following match the regex '[0-9]*[^02468]' 03 999 354

Give a regex for social security numbers [0-9]{3}-[0-9]{2}-[0-9]{4}

Page 16: Python: Regular Expressions

Predefined classes

Some character classes are common and have shorthand definitions \d : matches any decimal digit; equivalent to [0-9] \D : matches any non-digit character; equivalent to [^0-9] \w : matches any 'word' character; equivalent to [^ \t\n\r\f\

v] \W : matches any non-word character; equivalent to [^a-zA-

Z0-9] \s : matches any whitespace character (space, tab, newline) \S : matches any character that is not a whitespace

Give a regex for social security numbers \d{3}-\d{2}-\d{4}

Page 17: Python: Regular Expressions

Predefined classes

There are two 'positional' matches $ : matches the end of a string or matches before a newline ^ : matches the start of a string or right after a newline

What do the following mean? ^.*s$ ^\s.*

Page 18: Python: Regular Expressions

Finding patterns in text (Examples)

'a'

Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.

Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.

'Mary'

Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.

Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.

'.* '

Special characters must be escaped.*

Special characters must be escaped.*

'\.\*'

Special characters must be escaped.*

Special characters must be escaped.*

http://www.duke.edu/~dgraham/ETM/LearningtoUseRegularExpressions.html

Page 19: Python: Regular Expressions

Finding patterns in text (Examples)

'^Mary'

Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.

'Mary$'

Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.

http://www.duke.edu/~dgraham/ETM/LearningtoUseRegularExpressions.html

'.a'

Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.

'[a-z]a'

Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.

'[^a-z]a'

Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.

'to|go|the'

Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go.

Mary had a little lamb. And everywhere that Marywent, the lamb was sure to go

Page 20: Python: Regular Expressions

Matching and Searching

Regular expressions are in the "re" package match(re, text): determines whether the pattern

matches the beginning of the text. Returns either None or a Match object.

search(re, text): determines if the pattern occurs anywhere in the text. Return either None or a Match object.

findall(re,text): return all substrings of the text that match the pattern

finditer(re,text): returns an iterator of all matching substrings

>>> import re>>> re.match("c", "abcdef") # No match>>> re.match("a", "abcdef") # Match>>> re.search("c", "abcdef") # Match

Page 21: Python: Regular Expressions

Match Objects

Match Objects support the following methods start(): returns the index of the start of the match end(): returns the index of the end of the match groups(): returns a tuple of the group matches groups(n): returns the nth group match. If n = 0 returns the entire match.

>>> m = re.match("(\w+) (\w+)", "Lazy hands make for poverty,")>>> m.group(0) # The entire match'Lazy hands'>>> m.group(1) # The first parenthesized subgroup.'Lazy'>>> m.group(2) # The second parenthesized subgroup.'hands'>>> m.group(1, 2) # Multiple arguments give us a tuple.('Lazy', 'hands')

Page 22: Python: Regular Expressions

Example: Phone Numbers

Consider creating a regular expression to match phone numbers. The phone numbers can take on the following forms: 800-555-1212 800 555 1212 800.555.1212 (800) 555-1212 1-800-555-1212 800-555-1212-1234 800-555-1212x1234 800-555-1212 ext. 1234 1-(800) 555.1212 #1234

Page 23: Python: Regular Expressions

Example: Phone Numbers

It is good to define a test for your code prior to writing the code.

Consider testing our pattern against the examples on the previous slide.import re

def test(regex): tests = ['800-555-1212','800 555 1212', '800.555.1212', '(800) 555-1212', '800-555-1212', '800-555-1212-1234', '800-555-1212x1234', '800-555-1212 ext. 1234', '1-(800) 555.1212 #1234']

for test in tests: print(test+": ", re.match(regex,test))

Page 24: Python: Regular Expressions

Example: Phone Numbers

The previous phone numbers had only four components: area code trunk (first three digits) rest (next 4 digits) extension (last digits. May be between 1 and 4 in length)

Consider defining these parts with regular expressions area code would be \d{3} trunk would be \d{3} rest would be \d{3} extension would be \d{1,4}

Page 25: Python: Regular Expressions

Example: Phone Numbers

Consider the following for phone numbers area code-trunk-rest-extension \d{3}-\d{3}-\d{4}-\d{1,4}

>>> test(r'\d{3}-\d{3}-\d{4}-\d{1,4}')

800-555-1212: None800 555 1212: None800.555.1212: None(800) 555-1212: None800-555-1212: None800-555-1212-1234: <_sre.SRE_Match object at 0x0000000002C06920>800-555-1212x1234: None800-555-1212 ext. 1234: None1-(800) 555.1212 #1234: None

Page 26: Python: Regular Expressions

Example: Phone Numbers

How to modify our regex to say that extensions are optional? \d{3}-\d{3}-\d{4}(-\d{1,4})?

>>> test(r'\d{3}-\d{3}-\d{4}(-\d{1,4})?')

800-555-1212: <_sre.SRE_Match object at 0x0000000002C4D198>800 555 1212: None800.555.1212: None(800) 555-1212: None800-555-1212: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212-1234: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212x1234: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212 ext. 1234: <_sre.SRE_Match object at 0x0000000002C4D198>1-(800) 555.1212 #1234: None

Page 27: Python: Regular Expressions

Example: Phone Numbers

How to handle different separators? Consider the following test cases especially: 800-555-1212 800 555 1212 800.555.1212 (800) 555-1212 800-555-1212x1234 800-555-1212 ext. 1234 1-(800) 555.1212 #1234

Let's say that a separator is optional any number of non-digit characters

Page 28: Python: Regular Expressions

Example: Phone Numbers

How to modify our regex to deal with separators? \d{3}-\d{3}-\d{4}(-\d{1,4})? \d{3}\D*\d{3}\D*\d{4}\D*(\d{1,4})?>>> test(r'\d{3}\D*\d{3}\D*\d{4}\D*(\d{1,4})?')

800-555-1212: <_sre.SRE_Match object at 0x0000000002C4D198>800 555 1212: <_sre.SRE_Match object at 0x0000000002C4D198>800.555.1212: <_sre.SRE_Match object at 0x0000000002C4D198>(800) 555-1212: None800-555-1212: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212-1234: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212x1234: <_sre.SRE_Match object at 0x0000000002C4D198>800-555-1212 ext. 1234: <_sre.SRE_Match object at 0x0000000002C4D198>1-(800) 555.1212 #1234: None

Page 29: Python: Regular Expressions

Example: Phone Numbers

How to get the four parts from sub-groups? Let's first modify our test routine.

def test(regex): tests = ['800-555-1212','800 555 1212', '800.555.1212', '(800) 555-1212', '800-555-1212', '800-555-1212-1234', '800-555-1212x1234', '800-555-1212 ext. 1234', '1-(800) 555.1212 #1234']

for test in tests: match = re.match(regex, test) if match: print(test + ': ', match.groups()) else: print(test + ': ', None)

Page 30: Python: Regular Expressions

Example: Phone Numbers

How to get the four parts from sub-groups? (\d{3})\D*(\d{3})\D*(\d{4})\D*(\d{1,4})?

>>> test(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d{1,4})?')

800-555-1212: ('800', '555', '1212', None)800 555 1212: ('800', '555', '1212', None)800.555.1212: ('800', '555', '1212', None)(800) 555-1212: None800-555-1212: ('800', '555', '1212', None)800-555-1212-1234: ('800', '555', '1212', '1234')800-555-1212x1234: ('800', '555', '1212', '1234')800-555-1212 ext. 1234: ('800', '555', '1212', '1234')1-(800) 555.1212 #1234: None

Page 31: Python: Regular Expressions

Splitting

Consider reading words from a text file. In the past we have split lines on whitespace. This is not a thorough splitting.

Consider a text file having punctuation symbols and separators: The words of the Teacher, son of David, king in Jerusalem:

"Meaningless! Meaningless!" says the Teacher.file = open(file, "r")for line in file: for word in line.split(): print(word)

for line in file: for word in re.split("\W+", line): print(word)