1 python & pattern matching with regular expressions (res) opim 101 file:pythonres.ppt
TRANSCRIPT
2
Foresight
• Pattern matching– Literal– With metacharacters
• Regular expressions (REs)
• Using REs in Python
3
Consider: dir by ItselfD:\athomepc\day\idt>dir
Volume in drive D has no label Volume Serial Number is 3E4B-1609 Directory of D:\athomepc\day\idt
. <DIR> 01-01-02 8:16a .
.. <DIR> 01-01-02 8:16a ..SPRING~1 PDF 180,072 01-01-02 8:17a spring02idtfront.pdfSPRING~2 PDF 241,542 01-01-02 8:19a spring02idtpartI.pdfSPRING~3 PDF 1,246,514 01-01-02 8:20a spring02idtpartII.pdfSPRING~4 PDF 2,517,343 01-01-02 8:22a spring02idtpartIII.pdfSPRING~5 PDF 3,469,138 01-01-02 8:24a spring02idtpartIV.pdfCASE1-~1 DOC 35,328 01-01-02 8:42a case1-python.docLECTUR~1 PPT 78,336 01-01-02 9:45a lecture01fall01.pptPYTHON~1 PPT 34,816 01-01-02 9:46a Python_Intro.pptPYTHON~2 PPT 37,376 01-01-02 9:46a Python_Structures.pptLECTUR~2 PPT 154,112 01-01-02 11:51a lecture01spring02.pptPYTHON~3 PPT 34,816 01-01-02 11:52a PythonREs.ppt 11 file(s) 8,029,393 bytes 2 dir(s) 1,209.06 MB free
D:\athomepc\day\idt>
4
Now: dir with a Literal Search
D:\athomepc\day\idt>dir case1-python.doc
Volume in drive D has no label Volume Serial Number is 3E4B-1609 Directory of D:\athomepc\day\idt
CASE1-~1 DOC 35,328 01-01-02 8:42a case1-python.doc 1 file(s) 35,328 bytes 0 dir(s) 1,209.06 MB free
D:\athomepc\day\idt>
5
Now: dir with “*”
D:\athomepc\day\idt>dir *.doc
Volume in drive D has no label Volume Serial Number is 3E4B-1609 Directory of D:\athomepc\day\idt
CASE1-~1 DOC 35,328 01-01-02 8:42a case1-python.doc 1 file(s) 35,328 bytes 0 dir(s) 1,209.06 MB free
D:\athomepc\day\idt>
6
Literal vs. Pattern Searches
• dir myfile.doc– Searches literally, for an exact match with
“myfile.doc”
• dir my*.doc– Does a pattern search. Matches to any file
beginning with “my”, followed by 0 or more characters of any kind, followed by “.doc”
7
MetaCharacters
• dir treats “*” as a metacharacter, a character not taken literally, but as instruction to match a certain kind of pattern (here: anything)
• The dir metacharacter scheme is very useful
8
On Beyond *
• ...and also very primitive and limited
• A step up: grep in Unix & Linux; support for RE searches in some text editors, e.g., TextPad (www.textpad.com)
• Regular expressions (REs) use a richer language and larger set of metacharacters, giving us a very powerful capability to extract information (patterns) from text
9
Python’s RE Metacharacters
• Here’s the complete list:
. ^ $ * + ? { } [ ] \ | ( )• No use memorizing. We’ll learn by
examples.
• A natural question: But what if I want to search for a pattern that contains what Python’s RE counts as metacharacters?– Be just a little patient
10
Load Python’s re Module>>> import re>>> teststring = "Television is public anomie number 1.”>>> teststring'Television is public anomie number 1.’>>> len(teststring)37>>> match = re.search('anomie',teststring)>>> match == None0>>> match.span()(21, 27)>>> teststring[21:27]'anomie’>>>
11
Now a Nonliteral Match
>>> match = re.search('Television',teststring)>>> match == None0>>> match = re.search('television',teststring)>>> match == None1>>> match = re.search('[tT]elevision',teststring)>>> match.span()(0, 10)>>> teststring'Television is public anomie number 1.’>>>
12
Square Bracket Notation: [...]
• “[tT]” means “any one of the characters ‘t’ or ‘T’.”
• [...] is called a character class
• Examples:– [abc], [a-z], [A-Z]– [^t^T] not t and not T
13
Not Example ^
>>> teststring'Television is public anomie number 1.’>>> match = re.search('[^t^T][a-z]+',teststring)>>> match.span()(1, 10)>>> teststring[1:10]'elevision’>>>
Note: + means “one or more of the previous”
* means “zero or more” ? means “zero or one”
14
'\s\w+\.' and '\s(\w+)\.'
>>> teststring'Television is public anomie number 1.’>>> match = re.search('\s\w+\.',teststring)>>> match.span()(34, 37)>>> teststring[34:37]' 1.’>>> match = re.search('\s(\w+)\.',teststring)>>> match.span(0)(34, 37)>>> match.span(1)(35, 36)>>> teststring[35:36]'1’>>>
15
[.] == \.
• Inside [...] most metacharacters are taken literally– So, [.] == \.
• Note (again): [...] is called a character class
>>> match = re.search('\s(\w+)[.]',teststring)>>> match.span()(34, 37)>>>
16
Avoiding Greed ?>>> newstring = '<div align="center">’>>> newstring = newstring+'<i class="smaller">’>>> newstring = newstring+'(As of 10:55 AM on 12/20/01)’>>> newstring = newstring+'</i></div><br>’>>> newstring'<div align="center"><i class="smaller">(As of 10:55 AM on 12/20/01)</i></div><br>’>>> match = re.search('<.+>',newstring)>>> match.span()(0, 81)>>> match = re.search('<.+?>',newstring)>>> match.group()<div align="center">’>>>
17
More on Not Being Greedy
>>> match = re.search(r'<(\w).+?>(.+)</(\1)',newstring)>>> match.groups()('d', '<i class="smaller">(As of 10:55 AM on 12/20/01)</i>', 'd')>>> match = re.search(r'<(\w).+?>([^<]+)</(\1)',newstring)>>> match.groups()('i', '(As of 10:55 AM on 12/20/01)', 'i')>>>
\1 is called a backreference. It refers to group 1
18
Concluding
• REs are a very powerful tool, very often very useful
• The language notation is compact and a bit hard to read
• Practice, study the examples, don’t worry about memorization.
19
Advice on Scripting
• Scripting, and programming in general, is a process
• Successful scripts don’t spring into existence whole
– Scripts built in small increments
• Attend to:
– Decomposition
– Stories
– Testing
20
Advice on Scripting
• Decomposition– Solve big problems by decomposing them into small
problems and solving them
• Stories– Scripting/programming as a form of literature
– Use comments with code to tell a clear story about what the code is or should be doing
• Testing– Everything, whole and part, often, varying inputs
21
Readings• IDT book, chapter 8, “Text and Pattern
Processing”
• Further information (but beyond the scope of 101)– The Python online documentation on the re
module– “Regular Expression HOWTO” by A.M.
Kuchling at http://py-howto.sourceforge.net/ and also at http://py-howto.sourceforge.net/regex/regex.html