mit big data explorers - presentation by daniel burseth

35
AN END-TO-END DEMONSTRATION OF GENERATING, CLEANING, AND VISUALIZING A “MESSY” DATA SET Daniel Burseth Co-president MIT Big Data Explorers [email protected] @dmbnyc Github: dburseth

Upload: don-dark

Post on 26-Jun-2015

229 views

Category:

Data & Analytics


1 download

DESCRIPTION

Presentation by Daniel Burseth at the MIT Big Data Explorers "Crash Course" on 9/20/2014. "An end-to-end demonstration of generating, cleaning, and visualizing a “Messy” data Set". http://www.mitbigdataexplorers.com/

TRANSCRIPT

Page 1: MIT Big Data Explorers - presentation by Daniel Burseth

AN END-TO-END DEMONSTRATION OF GENERATING, CLEANING, AND VISUALIZING A “MESSY” DATA SETDaniel BursethCo-president MIT Big Data [email protected]@dmbnycGithub: dburseth

Page 2: MIT Big Data Explorers - presentation by Daniel Burseth

WHAT’S THE MOTIVATION? Acronyms abound

Tremendous complexity

Use building blocks not code

Page 3: MIT Big Data Explorers - presentation by Daniel Burseth

CLEAN DATA IS A LUXURY This is easy

EPPM of 10 requires 500 professionals

Page 4: MIT Big Data Explorers - presentation by Daniel Burseth

BUT WHAT ABOUT INFORMATION THAT ISN’T NICELY STRUCTURED AND DOESN’T HAVE AN API?

Page 5: MIT Big Data Explorers - presentation by Daniel Burseth

ANOTHER AREA THAT DOESN’T GET MUCH AIR TIME….

http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?emc=eta1&_r=0

Data preparation and cleansing:• Missing• Duplicative• Conventions (dates, time,

geographies)• Spacing• Can we measure data

cleanliness?• What’s our Pareto point?

Page 6: MIT Big Data Explorers - presentation by Daniel Burseth

LOGIN TO YOUR AWS INSTANCE AWS -> EC2

Launch instance: ami-c6b61fae (US-EAST)

Instance type m3.medium

Connect

You should see some software on the desktop

Page 7: MIT Big Data Explorers - presentation by Daniel Burseth

AGENDA

Scrape all of Craiglist’s Boston apartment listings using WebHarvy

Examine, clean, and prepare the data set using OpenRefine

Map our data and apply filters using Tableau

……all without writing a single line of code.

Page 8: MIT Big Data Explorers - presentation by Daniel Burseth

DOWNLOAD MY SLIDES AT SHOUTKEY.COM/EFFIGY

Page 9: MIT Big Data Explorers - presentation by Daniel Burseth

WEBHARVY A hyper-intelligent utility to scrape website

data.

SysNucleus, makers of USBTrace

Heavy duty alternatives: Scrapy (scrappy.org), Beautiful Soup

Page 10: MIT Big Data Explorers - presentation by Daniel Burseth

GO TO HTTP://SHOUTKEY.COM/WIRE

1. Start Config

2. Click on Hungry Mother – capture text

3. Click on Hungry Mother – capture URL

4. Click on Kendall Square/MIT – capture text

5. Click lasts review– capture text

CLEAR

6. Mine -> Scrape a list of similar links

7. Click on Hungry Mother

Page 11: MIT Big Data Explorers - presentation by Daniel Burseth

WE’VE NOW DRILLED INTO THE TOP LINK Let’s start collecting

information in the first sub-page.

Page 12: MIT Big Data Explorers - presentation by Daniel Burseth

THIS CAPTURED THE FIRST PAGE, BUT WHAT IF WE WANT MORE? Edit Clear

Navigate into a sub-page

Start Config

Set as Next Page Link

Page 13: MIT Big Data Explorers - presentation by Daniel Burseth

OTHER BELLS AND WHISTLES Scheduler

Input keywords

Puase Inject (word of caution: scraping often violates TOS. Potentially not viable for apps, commercial purposes!)

TRY VISITING CRAIGSLIST IN AWS BTW!!

Proxy

Database export

Page 14: MIT Big Data Explorers - presentation by Daniel Burseth

20K ROWS OF MESS!

Download Craigslist Boston from http://shoutkey.com/glorify

Look at our data: open Boston Dirty.csv (20k rows of mess!)

Time to CLEAN: Launch GOOGLE-REFINE.EXE

Within MOZILLA, navigate to http://127.0.0.1:3333/

Create Project -> This Computer -> Browse

Parse by tab

Create Project

Page 15: MIT Big Data Explorers - presentation by Daniel Burseth

REMOVE DUPLICATES1. First, sort your column. 2. Then, invoke "Re-order rows permanently" in the "Sort" dropdown menu that appears on top of the middle of the data table. 3. Then invoke Edit cells and Blank down on the Title column. 4. Then on that column, invoke menu Facet > Custom facets and Facet by blank. 5. Select true in that facet, and invoke Remove matching rows in the left most "all" dropdown menu. 6. Remove the facet.

Page 16: MIT Big Data Explorers - presentation by Daniel Burseth

DUPLICATE “TITLE”

Page 17: MIT Big Data Explorers - presentation by Daniel Burseth

“TITLE” CONTAINS KEY INFO, LET’S PARSE IT

Page 18: MIT Big Data Explorers - presentation by Daniel Burseth

MORE CHANGES TO “TITLE”

Page 19: MIT Big Data Explorers - presentation by Daniel Burseth

TITLE REMAINS MESSY

Then run the “To Number” transform again

Page 20: MIT Big Data Explorers - presentation by Daniel Burseth

LET’S EXTRACT LOCATION

Page 21: MIT Big Data Explorers - presentation by Daniel Burseth

REMOVE TRAILING PAREN

Page 22: MIT Big Data Explorers - presentation by Daniel Burseth

NOW THE FUN PART: CLUSTERING

Page 23: MIT Big Data Explorers - presentation by Daniel Burseth

SWITCH THE METHOD: NEAREST NEIGHBOR

Increment the radius to 7 and make judgment calls along the way.

Change the Distance Function and do the same thing

Page 24: MIT Big Data Explorers - presentation by Daniel Burseth

TRIM WHITESPACE ON OUR CITY DATA

Page 25: MIT Big Data Explorers - presentation by Daniel Burseth

ADD “,MA” TO OUR CITY DATA

Page 26: MIT Big Data Explorers - presentation by Daniel Burseth

LET’S PLOT OUR VALUES Looks like we have SOME really expensive

real estate. Data errors????

Page 27: MIT Big Data Explorers - presentation by Daniel Burseth

EXPORT OUR DATA AND LEAVE REFINE

Boston Clean.csv

Page 28: MIT Big Data Explorers - presentation by Daniel Burseth

WELCOME TO TABLEAU Load Boston

clean.csv

“Go to Worksheet”

Page 29: MIT Big Data Explorers - presentation by Daniel Burseth

DRAG CITY TO THE BLACK BOX

Great “semantic” example. Tableau understands that this text translates to a lat/long

Page 30: MIT Big Data Explorers - presentation by Daniel Burseth

TABLEAU ALERTS TO UNPLOTTED POINTS Look on the map in the lower right corner

Let’s “Filter Data”

Page 31: MIT Big Data Explorers - presentation by Daniel Burseth

SIZE AND LABEL OUR DATA Under “Measures”, drag “Price” onto size in “Marks”

Change sum(Price) to avg(Price)

Drag Price, change to max(price) into Filters and select an “At Most”

Right click on the filter and show “Quick Filter”

Drag “City” onto “Label”

Menu Map -> Map Options

Click on a node for info and drill down potential

Page 32: MIT Big Data Explorers - presentation by Daniel Burseth

VISUALIZATION IS A HUGE TOPIC!

Page 33: MIT Big Data Explorers - presentation by Daniel Burseth

RECAP

1. Explored various webpage structures and scraped them2. Exported the data to Refine3. Parsed columns to extract critical price and location information4. Used clustering algorithms to merge related geographies5. Applied filters to identify errant prices6. Exported the data to Tableau7. Completed a real cursory mapping visualization

Page 34: MIT Big Data Explorers - presentation by Daniel Burseth

WHAT’S YOUR BUSINESS IDEA? Please come talk to me

Page 35: MIT Big Data Explorers - presentation by Daniel Burseth

QUESTIONS? THANK YOU!GITHUB:DMBNYC [email protected]