20130206 open refine

14
2013-02-06 Toronto Data Science Group 1 We are surrounded by data

Upload: martin-magdinier

Post on 07-May-2015

1.475 views

Category:

Technology


2 download

DESCRIPTION

10 presentation of OpenRefine (former Google Refine) for the Toronto Data Science group.

TRANSCRIPT

Page 1: 20130206  open refine

2013-02-06Toronto Data Science Group

1

We are surrounded by data

Page 2: 20130206  open refine

2013-02-06Toronto Data Science Group

2

We are surrounded by MESSY data

- Multiple standards and formats

Structured vs unstructured

Field nomination and format varies ...

- Human Error (misspellings, errors, etc)

- Non-normalized inputs (free-text entries, the “other" option)

- Incomplete data (laziness)

....

Page 3: 20130206  open refine

2013-02-06Toronto Data Science Group

3

Lack of

Time

Skills

» Software

Page 4: 20130206  open refine

2013-02-06Toronto Data Science Group

4

OpenRefine the

- Swiss army knife for data manipulation!

- glue step between your IT systems

Page 5: 20130206  open refine

2013-02-06Toronto Data Science Group

5

What's OpenRefine(former Google Refine, former Gridworks)

- A Cross platform Web Application that runs locally

- A Community based project hosted on GitHub

- Which have two distributions and multiple extensions

- Something between a spreadsheet and SQL

Page 6: 20130206  open refine

2013-02-06Toronto Data Science Group

6

Three use case

1. Data Cleaning

2. ETL (Extract Transform Load) Prototyping

3. Data extension (reconciliation & linked data)

Page 7: 20130206  open refine

2013-02-06Toronto Data Science Group

7

#1 Data Cleaning

Graphical interface

Facet option

Cluster similar record

Support three languages:

- GREL Jyton, Clojure

+ regex

Page 8: 20130206  open refine

2013-02-06Toronto Data Science Group

8

Facet example

Page 9: 20130206  open refine

2013-02-06Toronto Data Science Group

9

Cluster example

Page 10: 20130206  open refine

2013-02-06Toronto Data Science Group

10

#2 ETL Prototyping(Extract – Transform - Load)

Transform

- Understand your data

- Test the transformation that need to be done

- Undo / Redo

- Export transformation in JSON format

- Automate using the python or ruby extension

Extract & Load

Support:

- tabular (csv, xls)

- hierarchical (xml, json)

Page 11: 20130206  open refine

2013-02-06Toronto Data Science Group

11

History and JSON export

Page 12: 20130206  open refine

2013-02-06Toronto Data Science Group

12

#3 Extend your Data (reconciliation & linked data)

- Cross between OpenRefine projects (vlookup)

- Fetch URL and call web services (API)

Reconcile against

- RDF file & Local SPARQL endpoints

- Online databases

Page 13: 20130206  open refine

2013-02-06Toronto Data Science Group

13

Reconciliation example

Page 14: 20130206  open refine

2013-02-06Toronto Data Science Group

14

OpenRefine

http://openrefine.org

@OpenRefine

Martin Magdinier

[email protected]

@magdmartin

Thanks!