regular expressions 4 day 9 - 9/15/14 ling 3820 & 6820 natural language processing harry howard...

20
Regular expressions 4 Day 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Upload: rolf-jenkins

Post on 14-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Regular expressions 4Day 9 - 9/15/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Page 2: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization

15-Sept-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/

The syllabus is under construction.

http://www.tulane.edu/~howard/CompCultEN/

Page 3: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

The quiz was the review.

Review

15-Sept-2014

3

NLP, Prof. Howard, Tulane University

Page 4: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

4.3.4. Summary table

meta-character

matches name notes

a|b a or bdisjunction

 

(ab) a and b groupingonly outputs what is in (); (?:ab) for rest of pattern

[ab] a or b range[a-z] lowercase, [A-Z] uppercase, [0-9] digits

[^a] all but a negation  

a{m, n}from m to n of a

repetition

a{n} a number n of a

^aa at start of S

   

a$a at end of S

   

a+one or more of a

  a+? lazy +

a*zero or more of a

Kleene star

a*? lazy *

a?with or without a

optionality

a?? lazy ?

15-Sept-2014NLP, Prof. Howard, Tulane University

4

Page 5: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

There is a bit more to say.

§4. Regular expressions 4

15-Sept-2014

5

NLP, Prof. Howard, Tulane University

Page 6: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Open Spyder

15-Sept-2014

6

NLP, Prof. Howard, Tulane University

Page 7: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Sample string

import re

>>> S = '''This above all: to thine own self be true, And it must follow, as the night the day, Thou canst not then be false to any man.'''

15-Sept-2014NLP, Prof. Howard, Tulane University

7

Page 8: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

4.4. Character classes

class abbreviates name notes

\w[a-zA-Z0-9_]

alphanumeric

it’s really alphanumeric and underscore, but we are lazy

\W[^a-zA-Z0-9_]

  not alphanumeric

\d [0-9] digit  

\D [^0-9]   not a digit

\s [ tvnrf] whitespace  

\S [^ tvnrf]   not whitespace

\t  horizontal tab

 

\v  vertical tab

 

\n   newline  

\r  carriage return

 

\f   form-feed  

\b  word boundary

 

\B     not a word boundary

\A ^    

\Z $    

15-Sept-2014NLP, Prof. Howard, Tulane University

8

Page 9: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

4.4.2. Raw string notation with r’‘ Python interprets regular expressions just like any other expression. This can lead to unexpected results with class meta-characters, because the backslash that they incorporate is sometimes also used by Python for its own constructs.

For instance, we just met a class meta-character \b, which marks the edge of a word. It will be extremely useful for us, but it happens to overlap with Python’s own backspace operator, \b.

15-Sept-2014NLP, Prof. Howard, Tulane University

9

Page 10: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Raw text

The way to resolve this ambiguity is to prefix an r to a regular expression. The r marks the regular expression as raw text, so Python does not process it for special characters. The previous example is augmented with the raw text notation below:

1. >>> re.findall(r'\b\w\w\b', S)

2. ['to', 'be', 'it', 'as', 'be', 'to']

3. >>> re.findall(r'\b\w{2}\b', S)

4. ['to', 'be', 'it', 'as', 'be', 'to']

15-Sept-2014NLP, Prof. Howard, Tulane University

10

Page 11: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

More raw text

As a further illustration, what do you think are the non-alphanumeric characters in the Shakespeare text?:

>>> re.findall(r'\W', S) [' ', ' ', ':', ' ', ' ', ' ', ' ', ' ', ' ', ',', '\n', ' ', ' ', ' ', ',', ' ', ' ', ' ', ' ', ' ', ',', '\n', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '.']

15-Sept-2014NLP, Prof. Howard, Tulane University

11

Page 12: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Practice

4.3.5. Further practice of variable-length matching

4.6. Further practice Practice with answers on a different page

15-Sept-2014NLP, Prof. Howard, Tulane University

12

Page 13: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

There is a bit more to say.

§5. Lists1

15-Sept-2014

13

NLP, Prof. Howard, Tulane University

Page 14: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Introduction

In working with re.findall(), you have seen many instances of a collection of strings held within square brackets, such as the one below:

>>> S = '''This above all: to thine own self be true,

... And it must follow, as the night the day,

... Thou canst not then be false to any man.'''

>>> re.findall(r'\b[a-zA-Z]{4}\b', S)

['This', 'self', 'true', 'must', 'Thou', 'then']

15-Sept-2014NLP, Prof. Howard, Tulane University

14

Page 15: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Definition of list

A list in Python is a sequence of objects delimited by square brackets, []. The objects are separated by commas. Consider this sentence from Shakespeare’s A Midsummer Night’s Dream represented as a list:

>>> L = ['Love', 'looks', 'not', 'with', 'the', 'eyes', ',', 'but', 'with', 'the', 'mind', '.'] >>> type(L) >>> type(L[0])

L is a list of strings. You may think that a string is also a list of characters, and you would be correct for ordinary English, but in pythonic English, the word ‘list’ refers exclusively to a sequence of objects delimited by square brackets.

15-Sept-2014NLP, Prof. Howard, Tulane University

15

Page 16: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

An example with numerical objects1. >>> i = 2 2. >>> type(i) 3. >>> I = [0,1,i,3] 4. >>> type(I) 5. >>> type(I[0]) 6. >>> n = 2.3 7. >>> type(n) 8. >>> N = [2.0,2.1,2.2,n] 9. >>> type(N) 10. >>> type(N[0])

15-Sept-2014NLP, Prof. Howard, Tulane University

16

Page 17: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Most of the string methods work just as well on lists1. >>> len(L) 2. >>> sorted(L) 3. >>> set(L) 4. >>> sorted(set(L)) 5. >>> len(sorted(set(L))) 6. >>> L+'!' 7. >>> len(L+'!') 8. >>> L*2 9. >>> len(L*2) 10. >>> L.count('the')

15-Sept-2014NLP, Prof. Howard, Tulane University

17

Page 18: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

String methods work on lists, cont.1. >>> L.count('Love') 2. >>> L.count('love') 3. >>> L.index('with') 4. >>> L.rindex('with') 5. >>> L[2:] 6. >>> L[:2] 7. >>> L[-2:] 8. >>> L[:-2] 9. >>> L[2:-2] 10. >>> L[-2:2] 11. >>> L[:] 12. >>> L[:-1]+['!']

15-Sept-2014NLP, Prof. Howard, Tulane University

18

Page 19: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Q1

MIN 5.0 AVG 9.5 MAX 10.0

15-Sept-2014NLP, Prof. Howard, Tulane University

19

Page 20: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

More on lists

Next time

15-Sept-2014NLP, Prof. Howard, Tulane University

20