towards automated data wrangling · data wrangling is the process of going from "raw"...

12
The Alan Turing Institute 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets 1 Towards Automated Data Wrangling Curation of Example Datasets May Yong Research Software Engineer The Alan Turing Institute

Upload: others

Post on 22-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017

The Alan Turing Institute05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets

1

Towards Automated Data WranglingCuration of Example Datasets

May YongResearch Software Engineer

The Alan Turing Institute

Page 2: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017

The Alan Turing Institute

Data wrangling is the process of going from "raw" data to "usable data”.

James Geddes, Principal Data Scientist, The Alan Turing Institute

05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets

2

Transparency and reproducibility is essential for the data wrangling process.

Data wrangling tools should reflect this.

Page 3: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017

The Alan Turing Institute

• Obtaining, or inferring a data dictionary• Data integration• Record linkage• Spelling and format variability• Reformatting the structure of the data• Handling missing data• Anomaly detection

“Improving the Data Analytics Process”Turing Institute workshop

18-21 July 2016

05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets

3

Surgery Outcomes Web browsing history Neonatal ICU

Rainfall UK E-Petitions Cybersecurity

Datasets containing wrangling tasksWrangling challenges

Task: To find raw data, document the wrangling required to bring it to the stage where it can be analyzed.

Page 4: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017

The Alan Turing Institute

Obtaining, or inferring a data dictionary

Understanding the meaning of individual data, fields and tables or other complex structures

05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets

4

Web browsing history Neonatal ICU

Rainfall UK E-Petitions Cybersecurity

Page 5: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017

The Alan Turing Institute

Data IntegrationCombining from multiple sources data that is conceptually “about the same thing.”

05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets

5

Rainfall UK E-Petitions Cybersecurity

Page 6: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017

The Alan Turing Institute

Record linkage

Recognising that two distinct pieces of information in the data do in fact concern the same entity

05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets

6

Coping with spelling and format variability

Recovering the value of a datum from its representation (eg, recognising the string “25 Mar 16” as the ISO-8601-encoded date 2016-03-25.)

Page 7: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017

The Alan Turing Institute

Reformat data structure

Switching from “wide” to “tall” format, normalising/de-normalising relational datasets

05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets

7

Neonatal ICU

Rainfall UK E-Petitions

Day 1 Day 2

Page 8: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017

The Alan Turing Institute

Missing Data

05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets

8

Neonatal ICU

Rainfall UK E-Petitions

Missing data sources in ‘Neonatal’

Missing weather stations in ‘Rainfall’

Missing postcodes in ‘UK E-Petitions’

Day 1 Day 2

Page 9: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017

The Alan Turing Institute

Anomaly Detection

05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets

9

Rainfall UK E-Petitions Cybersecurity

Day 1 Day 2

Page 10: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017

The Alan Turing Institute05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets

10

https://alan-turing-institute.github.io/wrangling-tests/

Page 11: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017

The Alan Turing Institute

aida-dwt-petitions/code/Wrangling tasks for UK Petitions data.ipynb

05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets

11

Page 12: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017

The Alan Turing Institute

turing.ac.uk@turinginst

05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets

12