20130206 open refine

Post on 07-May-2015

1.475 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

10 presentation of OpenRefine (former Google Refine) for the Toronto Data Science group.

TRANSCRIPT

2013-02-06Toronto Data Science Group

1

We are surrounded by data

2013-02-06Toronto Data Science Group

2

We are surrounded by MESSY data

- Multiple standards and formats

Structured vs unstructured

Field nomination and format varies ...

- Human Error (misspellings, errors, etc)

- Non-normalized inputs (free-text entries, the “other" option)

- Incomplete data (laziness)

....

2013-02-06Toronto Data Science Group

3

Lack of

Time

Skills

» Software

2013-02-06Toronto Data Science Group

4

OpenRefine the

- Swiss army knife for data manipulation!

- glue step between your IT systems

2013-02-06Toronto Data Science Group

5

What's OpenRefine(former Google Refine, former Gridworks)

- A Cross platform Web Application that runs locally

- A Community based project hosted on GitHub

- Which have two distributions and multiple extensions

- Something between a spreadsheet and SQL

2013-02-06Toronto Data Science Group

6

Three use case

1. Data Cleaning

2. ETL (Extract Transform Load) Prototyping

3. Data extension (reconciliation & linked data)

2013-02-06Toronto Data Science Group

7

#1 Data Cleaning

Graphical interface

Facet option

Cluster similar record

Support three languages:

- GREL Jyton, Clojure

+ regex

2013-02-06Toronto Data Science Group

8

Facet example

2013-02-06Toronto Data Science Group

9

Cluster example

2013-02-06Toronto Data Science Group

10

#2 ETL Prototyping(Extract – Transform - Load)

Transform

- Understand your data

- Test the transformation that need to be done

- Undo / Redo

- Export transformation in JSON format

- Automate using the python or ruby extension

Extract & Load

Support:

- tabular (csv, xls)

- hierarchical (xml, json)

2013-02-06Toronto Data Science Group

11

History and JSON export

2013-02-06Toronto Data Science Group

12

#3 Extend your Data (reconciliation & linked data)

- Cross between OpenRefine projects (vlookup)

- Fetch URL and call web services (API)

Reconcile against

- RDF file & Local SPARQL endpoints

- Online databases

2013-02-06Toronto Data Science Group

13

Reconciliation example

2013-02-06Toronto Data Science Group

14

OpenRefine

http://openrefine.org

@OpenRefine

Martin Magdinier

martin.magdinier@gmail.com

@magdmartin

Thanks!

top related