intro to open refine

17
DATA CLEANING INTRODUCTION TO PRESENTED BY MILENA MARIN [email protected] ; @milena_iul

Upload: school-of-data

Post on 16-Jul-2015

608 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: Intro to open refine

DATA CLEANING

INTRODUCTION TO

PRESENTED BY

MILENA MARIN

[email protected]; @milena_iul

Page 2: Intro to open refine

Are we ready?

http://openrefine.org/

bit.ly/messydata

Page 3: Intro to open refine

What is messy data?● In groups of 2 or 3 take 10 - 15 minutes to explore the data.

● Write down on post it notes errors you find in the data; anything that

makes your data “messy”

● Example: Numeric values appear in different formats: text, numbers,

etc

Page 4: Intro to open refine

Explore your data• How many columns/ rows?

Tip: use CTRL (CMD on Mac) + cursor key (draw arrows) to explore the edges of your data

• Understand your column headers (variables)

• What values do these variables take? Tip: Apply a filter

• What types of data?Tip: Numbers, text, date, etc.

• Maximum and minimum valuesTip: Use sorting to order your values ascending or descending

Page 5: Intro to open refine

Data is messy when….● Spelling errors (example: the city NY is spelled N.Y. and N.Y)

● White spaces at beginning and end of word

● Dates formatted differently (example: 01/10/2013; 10.2013; October 2013;

01.10.2013 12:00:34)

● Numbers formatted as text (example: £100 can be a number formatted as

currency or a string of text)

○ Hint: numbers are always aligned to the right; text is always aligned to

the left

● Missing values

● 2 or more variables in the same column

Page 6: Intro to open refine

● Open-source tool for cleaning and preparing messy data for analysis

● Runs locally but in a web browser

● Formerly a Google product, now an open source project

● I wouldn’t leave home without it!

What is Open Refine?

Page 7: Intro to open refine

Microsoft Excel Open Refine

Sorting X X

Removal of white space X X

Splitting columns X X

Convert JSON X

Text faceting X

HTTP requests X

Geocoding X

Reconciliation to API X

Regex matching X

Record of transformation X

Page 8: Intro to open refine

Sorting

Page 9: Intro to open refine

Remove white spaces

Page 10: Intro to open refine

Split columns

Page 11: Intro to open refine

Rename columns

Page 12: Intro to open refine

Correct formats

Page 13: Intro to open refine

Cluster

Page 14: Intro to open refine

Cluster

Page 15: Intro to open refine

Export

Page 16: Intro to open refine

Clean Data!

http://bit.ly/clean_data

Page 17: Intro to open refine

Practice● What are the top 5 initiatives that received largest contributions?

● What about the smallest contributions?

● What is the average contribution?

● Which initiative receives most contributions? What about least

contributions?

● Which party receives most contributions?

● In which cities are the democrats receiving more contributions that the

republicans?