session 2.5 wharton summer tech camp

32
SESSION 2.5 WHARTON SUMMER TECH CAMP Regex Data Acquisition

Upload: nerita

Post on 22-Jan-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Session 2.5 Wharton Summer Tech Camp. Regex Data Acquisition. 1: REGEX Intro 2: Data Acquisition. Agenda. Regular Expression. What is Regular Expression (RE) ?. RE or REGEX is a way to describe string patterns to computers Basically, an advanced “Find” and “Find and Replace ” - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Session 2.5 Wharton Summer Tech Camp

SESSION 2.5WHARTON SUMMER TECH CAMP

Regex

Data Acquisition

Page 2: Session 2.5 Wharton Summer Tech Camp

1: REGEX INTRO2: DATA ACQUISITION

Agenda

Page 3: Session 2.5 Wharton Summer Tech Camp

Regular Expression

Page 4: Session 2.5 Wharton Summer Tech Camp

What is Regular Expression (RE) ?

• RE or REGEX is a way to describe string patterns to computers

• Basically, an advanced “Find” and “Find and Replace”

• Originated from theoretical comp sci – • For the Interested: “Formal Language Theory”,

“Chomsky hierarchy”, “Automata theory”• Theory that guides programming language

• Popularized by PERL, Ubiquitous in Unix• Almost all programming languages support

REGEX and they are mostly the same

Page 5: Session 2.5 Wharton Summer Tech Camp

What is Regular Expression (RE) ?• Given a text T, RE matches the part of T

represented by the RE• RE(T) = Subset_of_matched(T) • Then you can do whatever you wish with the

matched part• Regular expression can be complicated and can

consist of multiple patterns • You can match multiple patterns at the same time• With the matched part of T, you can do something

with it or substitute part of the matched part with something else you wish

Page 6: Session 2.5 Wharton Summer Tech Camp

“Oh it’s just a text searching tool, so what?”

Page 7: Session 2.5 Wharton Summer Tech Camp

Well, Google is a text search tool, albeit for different purposes.

The power comes from the fact that by learning regex, you are essentially learning to represent complex text patterns to computers efficiently.

The size of data may be too big for humans to go through or too tedious

Learn their language and tell computers what to do!

Page 8: Session 2.5 Wharton Summer Tech Camp

True (paraphrased) quotes from some doctoral students/faculties before I introduced them to REGEX

“I despise aggregating data from the AMT – it took me a week to go through them all”

“[Grunt noise]. I had to filter out IP addresses from surveys by hand and it took me forever”

“I have this data with many different ways of representing the same variables and need to do “fuzzy” matching but don’t know a good way to do this”

Page 9: Session 2.5 Wharton Summer Tech Camp
Page 10: Session 2.5 Wharton Summer Tech Camp

Reasons to use regex

1. Regular expression will be very useful for data cleaning and aggregating

2. Very useful in basic web scraping.

3. Text data is everywhere and “If you take “text” in the widest possible sense, perhaps 90% of what you do is 90% text processing” (Programming Perl book).

4. Once you learn regex, you can use it in any language since they are similarly implemented.

5. learning regex is one of the first step in learning NLP (natural language processing)

6. You are learning a language of the machines

Page 11: Session 2.5 Wharton Summer Tech Camp

Usage Examples• You get an output from Amazon Mech Turk (or Qualtrics) and

need to extract and aggregate data and make it usable by R or Stata

• You can check survey outcomes for quality control. Useful for checking if the participants are paying attention or quality control at a massive scale. Related use in web development is checking to see if input format is correct (Password requirement).

• You want to scrape simple information from a website for your project

• One simple algorithm in NLP is matching and counting words. Regex can do that.

• You want to obtain email addresses for your evil spamming purposes. You can do that but don’t.

• Etc. Many possibilities for increase in productivity

Page 12: Session 2.5 Wharton Summer Tech Camp

But it takes some time to master

You will need to practice with a cheat sheet next to you.Literally, this is a language (“regular language”) you are learning.Just like any language, this one has vocabularies and grammars to learn.

Page 13: Session 2.5 Wharton Summer Tech Camp

Tools to practice REGEX• There are great tools to practice regex

• Website • http://gskinner.com/RegExr/

• If you have mac • http://reggyapp.com/ Reggy

• If you have windows • http://www.regexbuddy.com/ Regexbuddy

Page 14: Session 2.5 Wharton Summer Tech Camp

Basics of REGEX

• Can represent strings literally or symbolically• Literal representations are not powerful but convenient for small tasks

• Symbolic representation is the workhorse • There are a few concepts you need to learn to use this representation

• There are also many special characters with special meanings in REGEX. e.g., . ^ $ * + ? { } [ ] \ | ( )

• http://cloud.github.com/downloads/tartley/python-regex-cheatsheet/cheatsheet.pdf Cheat sheet

Page 15: Session 2.5 Wharton Summer Tech Camp

Literal Matching

•Match strings literally. •String = “I am a string”•RE= “string”

•Matched string = “string”

That’s it

Page 16: Session 2.5 Wharton Summer Tech Camp

Literal Matching & Quantifiers• Symbolic matching has many special characters to learn. • Quantifier is one concept• + means match whatever comes before match it 1 or

more• "ba" matches only "ba" • "ba+" matches baa, baaa, baaaa, etc

• ? means match whatever comes before 0 or 1 time• "ba?" matches b or ba

• * means match whatever comes before 0 or more• “ba*” matches b or ba or baa or baaa and so on

Page 17: Session 2.5 Wharton Summer Tech Camp

More Quantifiers

•{start,end} means match whatever comes before “start” to “end” many times•"ba{1,3}" matches ba, baa, baaa•“ba{2,}” matches baa, baaa, baaaa and so on

Page 18: Session 2.5 Wharton Summer Tech Camp

Special Meta characters• As you’ve seen, some characters have special meanings• . ^ $ * + ? { } [ ] \ | ( )• . Means any one character except the newline character \n• ^ dictates that the regex pattern should only be matched if it occurs in the

beginning • String= “the book” RE= “book” YES RE= “^book” NO

• $ is similar to ^ but for ending• [] is used to signify ranges [0-9] means anything from 0 to 9• () used as grouping variable

• Used to group patterns• Can be used to memorize a certain part of the regex

• | is used as “OR” (5|4) matches 5 or 4• \ <-special character to rule them all – used to escape all special meta

characters to represent them as is. \. Matches actual period .• [^stuff] means match anything that’s not “stuff” [^9] match anything but 9

Page 19: Session 2.5 Wharton Summer Tech Camp

Hey Jude

Hey Jude, don't make it bad

Take a sad song and make it better

Remember to let her under your skin

Then you'll begin to make it

(better ){6}, oh

(na ){7}, (na ){4}, Hey Jude

Page 20: Session 2.5 Wharton Summer Tech Camp

Special Vocabulary Shortcuts• Some vocabularies are so common that shortcuts were

made

• \d matches any digit [0-9]• \w any alphanumeric plus underscore [a-zA-Z0-9_]• \s white spaces – tabs newlines etc. [ \t\n]

• notice that space in the beginning

• \W any non alphanumeric plus underscore [^a-zA-Z0-9_]• \S guess? • \D again?

Page 21: Session 2.5 Wharton Summer Tech Camp

Flags

• Changes the way regex works • i ignore case • s changes the way . works. Usually . Matches anything except new line \n this flag makes . match everything

• m multiline. Changes the way ^ $ works with newline. Usually, ^ $ matches strictly start or end of string but this flag makes it match on each line.

Page 22: Session 2.5 Wharton Summer Tech Camp

REGEX in python• Python library re • import re • The function used is

re.search(pattern, string, flags=0)Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern.

• Pattern: specifies what to be matched • String: actual string to match from • Flags: options – basically changes the way regex works again, flag "i" says ignore case.

Page 23: Session 2.5 Wharton Summer Tech Camp

REGEX in python

re.search(pattern, string, flags=0)

re.findall(pattern, string, flags=0)• Pattern: always wrap the pattern with r"" for python. r""

says interpret everything between "" to be raw string – particular to python due to the way python interprets some characters.

s = "This is an example string"

matchedobject=re.search(r"This", s)

matchedobject=re.search(r"this", s)

Page 24: Session 2.5 Wharton Summer Tech Camp

Regex is easy to learn but hard to master

Example of complex regex

The regex in the next slide is taken from

http://ex-parrot.com/~pdw/Mail-RFC822-Address.html

It validates email based on RFC822 grammar which is now obsolete. It’s not written by hand. It’s produced by combining set of simpler regex.

Page 25: Session 2.5 Wharton Summer Tech Camp

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t] )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?: \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(? :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(? :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<> @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(? :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,; :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\" .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0 00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@, ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])* (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[ ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*( ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:( ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(? :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n) ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n) ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)? (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?: \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t]) *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\ .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:( ?:\r\n)?[ \t])*))*)?;\s*)

NO+!+

Page 26: Session 2.5 Wharton Summer Tech Camp

Lab

• Try some REGEX tutorial• http://regexone.com/• http://www.regexlab.com/• http://www.regular-expressions.info/tutorial.html

• The scripts I uploaded • Play around with the regex tool

• 5-10 minutes

Page 27: Session 2.5 Wharton Summer Tech Camp

Fire up the REGEX.py

Page 28: Session 2.5 Wharton Summer Tech Camp

Next SessionPreview

Page 29: Session 2.5 Wharton Summer Tech Camp

THE BIGGEST concern for doctoral students doing empirical work (year 2-4) excluding the quals/prelims

“WHERE AND HOW DO I GET DATA?!“

Mr. Data: “I believe what you are experiencing is frustration”

Page 30: Session 2.5 Wharton Summer Tech Camp

Data sources1.Companies

2.Wharton Organizations

3.Scraping Web

4.APIs : application programming interface

Page 31: Session 2.5 Wharton Summer Tech Camp

We are going to use the following for the next session

• Download WGET and make sure it works • You may already have wget if you use mac (in

terminal, type wget)• http://www.gnu.org/software/wget/

• Get Firefox Developer’s Toolbox• Data acquisition (Wharton, Company, Scraping,

API)

Page 32: Session 2.5 Wharton Summer Tech Camp

REGEX-FU Contest with small prizes!