class website cx4242: data cleaning - visualization · class website cx4242: data cleaning mahdi...
TRANSCRIPT
![Page 1: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning](https://reader036.vdocuments.net/reader036/viewer/2022071107/5fe216f39fda9b5fe20c16b6/html5/thumbnails/1.jpg)
Class Website
CX4242:
Data CleaningMahdi Roozbahani
Lecturer, Computational Science and
Engineering, Georgia Tech
![Page 2: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning](https://reader036.vdocuments.net/reader036/viewer/2022071107/5fe216f39fda9b5fe20c16b6/html5/thumbnails/2.jpg)
Data CleaningHow dirty is real data?
![Page 3: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning](https://reader036.vdocuments.net/reader036/viewer/2022071107/5fe216f39fda9b5fe20c16b6/html5/thumbnails/3.jpg)
Examples
• Jan 19, 2016
• January 19, 16
• 1/19/16
• 2006-01-19
• 19/1/16
3
How dirty is real data?
http://blogs.verdantis.com/wp-content/uploads/2015/02/Data-cleansing.jpg
![Page 4: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning](https://reader036.vdocuments.net/reader036/viewer/2022071107/5fe216f39fda9b5fe20c16b6/html5/thumbnails/4.jpg)
4
How dirty is real data?
Discuss with you neighbors (group of 2-3)
60 seconds
Comes up with 5+ kinds of “data dirtiness”
![Page 5: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning](https://reader036.vdocuments.net/reader036/viewer/2022071107/5fe216f39fda9b5fe20c16b6/html5/thumbnails/5.jpg)
• Missing or corrupted (NaN, null)
• Numbers stored as string (“1232”)
• Different units
• Spelling/typos
• Different string encodings
• Outliers (due to data recording)
• geocoding, timezone offsets (missing +, -)
• Duplicate data
• Fake data (malicious)
• Sql injection
• Different software version generating slightly different formats
• Cap locks
• Semi-colons
• Structure (json objects)
• Invisible characters
• Different delimiters
• Indentation
5
How dirty is real data?
![Page 6: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning](https://reader036.vdocuments.net/reader036/viewer/2022071107/5fe216f39fda9b5fe20c16b6/html5/thumbnails/6.jpg)
Importance of Data Cleaning
![Page 7: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning](https://reader036.vdocuments.net/reader036/viewer/2022071107/5fe216f39fda9b5fe20c16b6/html5/thumbnails/7.jpg)
“80%” Time Spent on Data Preparation
Cleaning Big Data: Most Time-Consuming, Least
Enjoyable Data Science Task, Survey Says [Forbes]http://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-
consuming-least-enjoyable-data-science-task-survey-says/#73bf5b137f75
13
![Page 8: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning](https://reader036.vdocuments.net/reader036/viewer/2022071107/5fe216f39fda9b5fe20c16b6/html5/thumbnails/8.jpg)
Data Janitor
![Page 9: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning](https://reader036.vdocuments.net/reader036/viewer/2022071107/5fe216f39fda9b5fe20c16b6/html5/thumbnails/9.jpg)
Writing “Clean Code”
• Be careful with trailing whitespaces
• Indent code (spaces vs tabs) following
coding practices in your team/companyhttps://google.github.io/styleguide/javaguide.html#s4.2-block-indentation
17
http://codeimpossible.com/2012/04/02/trailing-whitespace-is-evil-don-t-commit-evil-into-your-repo/
http://www.businessinsider.com/tabs-vs-spaces-from-silicon-valley-2016-5
…there’s no way I'm going to be with someone who uses spaces over tabs…
Trailing whitespace is evil. Don't commit evil into your repo.
![Page 10: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning](https://reader036.vdocuments.net/reader036/viewer/2022071107/5fe216f39fda9b5fe20c16b6/html5/thumbnails/10.jpg)
18
Both available free for GT students on
http://safaribooksonline.com/
![Page 11: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning](https://reader036.vdocuments.net/reader036/viewer/2022071107/5fe216f39fda9b5fe20c16b6/html5/thumbnails/11.jpg)
Data CleanersWatch videos
• Data Wrangler (research at Stanford)
• Open Refine (previously Google Refine)
Write down
• Examples of data dirtiness
• Tool’s features demo-ed (or that you like)
Will collectively summarize similarities and
differences afterwards
Open Refine: http://openrefine.org
Data Wrangler: http://vis.stanford.edu/wrangler/
19
![Page 12: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning](https://reader036.vdocuments.net/reader036/viewer/2022071107/5fe216f39fda9b5fe20c16b6/html5/thumbnails/12.jpg)
![Page 13: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning](https://reader036.vdocuments.net/reader036/viewer/2022071107/5fe216f39fda9b5fe20c16b6/html5/thumbnails/13.jpg)
![Page 14: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning](https://reader036.vdocuments.net/reader036/viewer/2022071107/5fe216f39fda9b5fe20c16b6/html5/thumbnails/14.jpg)
What can Open Refine and Wrangler do?
• [w,o] undo, redo
• [o,w] history of data
• [o] transform data (e.g., take log)
• [w] data editing/highlighting/interaction may be easier
• [o] clustering
• [w] transpose/pivot
• [w] fill in missing data
• [w] suggestions + preview
O = Open Refine
W = Data wrangler 22
![Page 15: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning](https://reader036.vdocuments.net/reader036/viewer/2022071107/5fe216f39fda9b5fe20c16b6/html5/thumbnails/15.jpg)
!The videos only show
some of the tools’ features.
Try them out.
Open Refine: http://openrefine.org
Data Wrangler: http://vis.stanford.edu/wrangler/
37