towards automated data wrangling · data wrangling is the process of going from "raw"...
TRANSCRIPT
![Page 1: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017](https://reader035.vdocuments.net/reader035/viewer/2022081607/5ec99787b8f81f27f55cffdd/html5/thumbnails/1.jpg)
The Alan Turing Institute05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets
1
Towards Automated Data WranglingCuration of Example Datasets
May YongResearch Software Engineer
The Alan Turing Institute
![Page 2: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017](https://reader035.vdocuments.net/reader035/viewer/2022081607/5ec99787b8f81f27f55cffdd/html5/thumbnails/2.jpg)
The Alan Turing Institute
Data wrangling is the process of going from "raw" data to "usable data”.
James Geddes, Principal Data Scientist, The Alan Turing Institute
05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets
2
Transparency and reproducibility is essential for the data wrangling process.
Data wrangling tools should reflect this.
![Page 3: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017](https://reader035.vdocuments.net/reader035/viewer/2022081607/5ec99787b8f81f27f55cffdd/html5/thumbnails/3.jpg)
The Alan Turing Institute
• Obtaining, or inferring a data dictionary• Data integration• Record linkage• Spelling and format variability• Reformatting the structure of the data• Handling missing data• Anomaly detection
“Improving the Data Analytics Process”Turing Institute workshop
18-21 July 2016
05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets
3
Surgery Outcomes Web browsing history Neonatal ICU
Rainfall UK E-Petitions Cybersecurity
Datasets containing wrangling tasksWrangling challenges
Task: To find raw data, document the wrangling required to bring it to the stage where it can be analyzed.
![Page 4: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017](https://reader035.vdocuments.net/reader035/viewer/2022081607/5ec99787b8f81f27f55cffdd/html5/thumbnails/4.jpg)
The Alan Turing Institute
Obtaining, or inferring a data dictionary
Understanding the meaning of individual data, fields and tables or other complex structures
05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets
4
Web browsing history Neonatal ICU
Rainfall UK E-Petitions Cybersecurity
![Page 5: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017](https://reader035.vdocuments.net/reader035/viewer/2022081607/5ec99787b8f81f27f55cffdd/html5/thumbnails/5.jpg)
The Alan Turing Institute
Data IntegrationCombining from multiple sources data that is conceptually “about the same thing.”
05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets
5
Rainfall UK E-Petitions Cybersecurity
![Page 6: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017](https://reader035.vdocuments.net/reader035/viewer/2022081607/5ec99787b8f81f27f55cffdd/html5/thumbnails/6.jpg)
The Alan Turing Institute
Record linkage
Recognising that two distinct pieces of information in the data do in fact concern the same entity
05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets
6
Coping with spelling and format variability
Recovering the value of a datum from its representation (eg, recognising the string “25 Mar 16” as the ISO-8601-encoded date 2016-03-25.)
![Page 7: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017](https://reader035.vdocuments.net/reader035/viewer/2022081607/5ec99787b8f81f27f55cffdd/html5/thumbnails/7.jpg)
The Alan Turing Institute
Reformat data structure
Switching from “wide” to “tall” format, normalising/de-normalising relational datasets
05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets
7
Neonatal ICU
Rainfall UK E-Petitions
Day 1 Day 2
![Page 8: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017](https://reader035.vdocuments.net/reader035/viewer/2022081607/5ec99787b8f81f27f55cffdd/html5/thumbnails/8.jpg)
The Alan Turing Institute
Missing Data
05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets
8
Neonatal ICU
Rainfall UK E-Petitions
Missing data sources in ‘Neonatal’
Missing weather stations in ‘Rainfall’
Missing postcodes in ‘UK E-Petitions’
Day 1 Day 2
![Page 9: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017](https://reader035.vdocuments.net/reader035/viewer/2022081607/5ec99787b8f81f27f55cffdd/html5/thumbnails/9.jpg)
The Alan Turing Institute
Anomaly Detection
05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets
9
Rainfall UK E-Petitions Cybersecurity
Day 1 Day 2
![Page 10: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017](https://reader035.vdocuments.net/reader035/viewer/2022081607/5ec99787b8f81f27f55cffdd/html5/thumbnails/10.jpg)
The Alan Turing Institute05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets
10
https://alan-turing-institute.github.io/wrangling-tests/
![Page 11: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017](https://reader035.vdocuments.net/reader035/viewer/2022081607/5ec99787b8f81f27f55cffdd/html5/thumbnails/11.jpg)
The Alan Turing Institute
aida-dwt-petitions/code/Wrangling tasks for UK Petitions data.ipynb
05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets
11
![Page 12: Towards Automated Data Wrangling · Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute 05/09/2017](https://reader035.vdocuments.net/reader035/viewer/2022081607/5ec99787b8f81f27f55cffdd/html5/thumbnails/12.jpg)
The Alan Turing Institute
turing.ac.uk@turinginst
05/09/2017Towards Automating Data Wrangling – Curation of Example Datasets
12