20130206 open refine
DESCRIPTION
10 presentation of OpenRefine (former Google Refine) for the Toronto Data Science group.TRANSCRIPT
2013-02-06Toronto Data Science Group
1
We are surrounded by data
2013-02-06Toronto Data Science Group
2
We are surrounded by MESSY data
- Multiple standards and formats
Structured vs unstructured
Field nomination and format varies ...
- Human Error (misspellings, errors, etc)
- Non-normalized inputs (free-text entries, the “other" option)
- Incomplete data (laziness)
....
2013-02-06Toronto Data Science Group
3
Lack of
Time
Skills
» Software
2013-02-06Toronto Data Science Group
4
OpenRefine the
- Swiss army knife for data manipulation!
- glue step between your IT systems
2013-02-06Toronto Data Science Group
5
What's OpenRefine(former Google Refine, former Gridworks)
- A Cross platform Web Application that runs locally
- A Community based project hosted on GitHub
- Which have two distributions and multiple extensions
- Something between a spreadsheet and SQL
2013-02-06Toronto Data Science Group
6
Three use case
1. Data Cleaning
2. ETL (Extract Transform Load) Prototyping
3. Data extension (reconciliation & linked data)
2013-02-06Toronto Data Science Group
7
#1 Data Cleaning
Graphical interface
Facet option
Cluster similar record
Support three languages:
- GREL Jyton, Clojure
+ regex
2013-02-06Toronto Data Science Group
8
Facet example
2013-02-06Toronto Data Science Group
9
Cluster example
2013-02-06Toronto Data Science Group
10
#2 ETL Prototyping(Extract – Transform - Load)
Transform
- Understand your data
- Test the transformation that need to be done
- Undo / Redo
- Export transformation in JSON format
- Automate using the python or ruby extension
Extract & Load
Support:
- tabular (csv, xls)
- hierarchical (xml, json)
2013-02-06Toronto Data Science Group
11
History and JSON export
2013-02-06Toronto Data Science Group
12
#3 Extend your Data (reconciliation & linked data)
- Cross between OpenRefine projects (vlookup)
- Fetch URL and call web services (API)
Reconcile against
- RDF file & Local SPARQL endpoints
- Online databases
2013-02-06Toronto Data Science Group
13
Reconciliation example
2013-02-06Toronto Data Science Group
14
OpenRefine
http://openrefine.org
@OpenRefine
Martin Magdinier
@magdmartin
Thanks!