mit big data explorers - presentation by daniel burseth

Post on 26-Jun-2015

229 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation by Daniel Burseth at the MIT Big Data Explorers "Crash Course" on 9/20/2014. "An end-to-end demonstration of generating, cleaning, and visualizing a “Messy” data Set". http://www.mitbigdataexplorers.com/

TRANSCRIPT

AN END-TO-END DEMONSTRATION OF GENERATING, CLEANING, AND VISUALIZING A “MESSY” DATA SETDaniel BursethCo-president MIT Big Data Explorersdburseth@mit.edu@dmbnycGithub: dburseth

WHAT’S THE MOTIVATION? Acronyms abound

Tremendous complexity

Use building blocks not code

CLEAN DATA IS A LUXURY This is easy

EPPM of 10 requires 500 professionals

BUT WHAT ABOUT INFORMATION THAT ISN’T NICELY STRUCTURED AND DOESN’T HAVE AN API?

ANOTHER AREA THAT DOESN’T GET MUCH AIR TIME….

http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?emc=eta1&_r=0

Data preparation and cleansing:• Missing• Duplicative• Conventions (dates, time,

geographies)• Spacing• Can we measure data

cleanliness?• What’s our Pareto point?

LOGIN TO YOUR AWS INSTANCE AWS -> EC2

Launch instance: ami-c6b61fae (US-EAST)

Instance type m3.medium

Connect

You should see some software on the desktop

AGENDA

Scrape all of Craiglist’s Boston apartment listings using WebHarvy

Examine, clean, and prepare the data set using OpenRefine

Map our data and apply filters using Tableau

……all without writing a single line of code.

DOWNLOAD MY SLIDES AT SHOUTKEY.COM/EFFIGY

WEBHARVY A hyper-intelligent utility to scrape website

data.

SysNucleus, makers of USBTrace

Heavy duty alternatives: Scrapy (scrappy.org), Beautiful Soup

GO TO HTTP://SHOUTKEY.COM/WIRE

1. Start Config

2. Click on Hungry Mother – capture text

3. Click on Hungry Mother – capture URL

4. Click on Kendall Square/MIT – capture text

5. Click lasts review– capture text

CLEAR

6. Mine -> Scrape a list of similar links

7. Click on Hungry Mother

WE’VE NOW DRILLED INTO THE TOP LINK Let’s start collecting

information in the first sub-page.

THIS CAPTURED THE FIRST PAGE, BUT WHAT IF WE WANT MORE? Edit Clear

Navigate into a sub-page

Start Config

Set as Next Page Link

OTHER BELLS AND WHISTLES Scheduler

Input keywords

Puase Inject (word of caution: scraping often violates TOS. Potentially not viable for apps, commercial purposes!)

TRY VISITING CRAIGSLIST IN AWS BTW!!

Proxy

Database export

20K ROWS OF MESS!

Download Craigslist Boston from http://shoutkey.com/glorify

Look at our data: open Boston Dirty.csv (20k rows of mess!)

Time to CLEAN: Launch GOOGLE-REFINE.EXE

Within MOZILLA, navigate to http://127.0.0.1:3333/

Create Project -> This Computer -> Browse

Parse by tab

Create Project

REMOVE DUPLICATES1. First, sort your column. 2. Then, invoke "Re-order rows permanently" in the "Sort" dropdown menu that appears on top of the middle of the data table. 3. Then invoke Edit cells and Blank down on the Title column. 4. Then on that column, invoke menu Facet > Custom facets and Facet by blank. 5. Select true in that facet, and invoke Remove matching rows in the left most "all" dropdown menu. 6. Remove the facet.

DUPLICATE “TITLE”

“TITLE” CONTAINS KEY INFO, LET’S PARSE IT

MORE CHANGES TO “TITLE”

TITLE REMAINS MESSY

Then run the “To Number” transform again

LET’S EXTRACT LOCATION

REMOVE TRAILING PAREN

NOW THE FUN PART: CLUSTERING

SWITCH THE METHOD: NEAREST NEIGHBOR

Increment the radius to 7 and make judgment calls along the way.

Change the Distance Function and do the same thing

TRIM WHITESPACE ON OUR CITY DATA

ADD “,MA” TO OUR CITY DATA

LET’S PLOT OUR VALUES Looks like we have SOME really expensive

real estate. Data errors????

EXPORT OUR DATA AND LEAVE REFINE

Boston Clean.csv

WELCOME TO TABLEAU Load Boston

clean.csv

“Go to Worksheet”

DRAG CITY TO THE BLACK BOX

Great “semantic” example. Tableau understands that this text translates to a lat/long

TABLEAU ALERTS TO UNPLOTTED POINTS Look on the map in the lower right corner

Let’s “Filter Data”

SIZE AND LABEL OUR DATA Under “Measures”, drag “Price” onto size in “Marks”

Change sum(Price) to avg(Price)

Drag Price, change to max(price) into Filters and select an “At Most”

Right click on the filter and show “Quick Filter”

Drag “City” onto “Label”

Menu Map -> Map Options

Click on a node for info and drill down potential

VISUALIZATION IS A HUGE TOPIC!

RECAP

1. Explored various webpage structures and scraped them2. Exported the data to Refine3. Parsed columns to extract critical price and location information4. Used clustering algorithms to merge related geographies5. Applied filters to identify errant prices6. Exported the data to Tableau7. Completed a real cursory mapping visualization

WHAT’S YOUR BUSINESS IDEA? Please come talk to me

QUESTIONS? THANK YOU!GITHUB:DMBNYC DBURSETH@MIT.EDU

top related