data preparation of data science
TRANSCRIPT
![Page 1: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/1.jpg)
Data Preparation for Data Science
Casey Stella@casey_stella
2016
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 2: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/2.jpg)
Table of Contents
Preliminaries
Demo
Questions
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 3: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/3.jpg)
Introduction
Hi, I’m Casey Stella!
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 4: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/4.jpg)
Garbage In =⇒ Garbage Out
“80% of the work in any data project is in cleaning the data.”
— D.J. Patel in Data Jujitsu
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 5: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/5.jpg)
Data Cleansing =⇒ Data Understanding
There are two ways to understand your data• Syntactic Understanding• Semantic Understanding
If you hope to get anything out of your data, you have to have a handle on both.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 6: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/6.jpg)
Syntactic Understanding: True Types
A true type is a label applied to data points xi such that xi are mutually comparable.• Schemas type != true data type• A specific column can have many different types
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 7: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/7.jpg)
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative• For non-numeric data, counts and distinct counts of a canonical representation areextremely useful.
Canonical representations are representations which give you an idea at a glance of thedata format• Replacing digits with the character ‘d’• Stripping whitespace• Normalizing punctuation
Data density is an assumption underlying any conclusions drawn from your data.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 8: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/8.jpg)
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.• For numerical data, distributions and statistical characteristics are informative• For non-numeric data, counts and distinct counts of a canonical representation areextremely useful.
Canonical representations are representations which give you an idea at a glance of thedata format• Replacing digits with the character ‘d’• Stripping whitespace• Normalizing punctuation
Data density is an assumption underlying any conclusions drawn from your data.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 9: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/9.jpg)
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.• For numerical data, distributions and statistical characteristics are informative• For non-numeric data, counts and distinct counts of a canonical representation areextremely useful.
Canonical representations are representations which give you an idea at a glance of thedata format
• Replacing digits with the character ‘d’• Stripping whitespace• Normalizing punctuation
Data density is an assumption underlying any conclusions drawn from your data.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 10: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/10.jpg)
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.• For numerical data, distributions and statistical characteristics are informative• For non-numeric data, counts and distinct counts of a canonical representation areextremely useful.
Canonical representations are representations which give you an idea at a glance of thedata format• Replacing digits with the character ‘d’• Stripping whitespace• Normalizing punctuation
Data density is an assumption underlying any conclusions drawn from your data.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 11: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/11.jpg)
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.• For numerical data, distributions and statistical characteristics are informative• For non-numeric data, counts and distinct counts of a canonical representation areextremely useful.
Canonical representations are representations which give you an idea at a glance of thedata format• Replacing digits with the character ‘d’• Stripping whitespace• Normalizing punctuation
Data density is an assumption underlying any conclusions drawn from your data.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 12: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/12.jpg)
Syntactic Understanding: Density over Time
∆Density∆t is how data clumps change over time.
This kind of analysis can show• Problems in the data pipeline• Whether the assumptions of your analysis are violated
∆Density∆t =⇒• Automation• Outlier Alerting
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 13: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/13.jpg)
Syntactic Understanding: Density over Time
∆Density∆t is how data clumps change over time.
This kind of analysis can show• Problems in the data pipeline
• Whether the assumptions of your analysis are violated∆Density
∆t =⇒• Automation• Outlier Alerting
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 14: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/14.jpg)
Syntactic Understanding: Density over Time
∆Density∆t is how data clumps change over time.
This kind of analysis can show• Problems in the data pipeline• Whether the assumptions of your analysis are violated
∆Density∆t =⇒• Automation• Outlier Alerting
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 15: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/15.jpg)
Syntactic Understanding: Density over Time
∆Density∆t is how data clumps change over time.
This kind of analysis can show• Problems in the data pipeline• Whether the assumptions of your analysis are violated
∆Density∆t =⇒• Automation
• Outlier Alerting
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 16: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/16.jpg)
Syntactic Understanding: Density over Time
∆Density∆t is how data clumps change over time.
This kind of analysis can show• Problems in the data pipeline• Whether the assumptions of your analysis are violated
∆Density∆t =⇒• Automation• Outlier Alerting
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 17: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/17.jpg)
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather thanhow it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.• May come from humans (e.g. domain experience and ontologies)• May come from machine learning (e.g. analyzing usage patterns to find synonyms)
Semantic understanding does not imply SkyNet
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 18: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/18.jpg)
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather thanhow it is stored.• Finding equivalences based on semantic understanding are often context sensitive.
• May come from humans (e.g. domain experience and ontologies)• May come from machine learning (e.g. analyzing usage patterns to find synonyms)
Semantic understanding does not imply SkyNet
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 19: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/19.jpg)
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather thanhow it is stored.• Finding equivalences based on semantic understanding are often context sensitive.• May come from humans (e.g. domain experience and ontologies)
• May come from machine learning (e.g. analyzing usage patterns to find synonyms)Semantic understanding does not imply SkyNet
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 20: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/20.jpg)
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather thanhow it is stored.• Finding equivalences based on semantic understanding are often context sensitive.• May come from humans (e.g. domain experience and ontologies)• May come from machine learning (e.g. analyzing usage patterns to find synonyms)
Semantic understanding does not imply SkyNet
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 21: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/21.jpg)
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather thanhow it is stored.• Finding equivalences based on semantic understanding are often context sensitive.• May come from humans (e.g. domain experience and ontologies)• May come from machine learning (e.g. analyzing usage patterns to find synonyms)
Semantic understanding does not imply SkyNet
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 22: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/22.jpg)
DEMO
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 23: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/23.jpg)
![Page 24: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/24.jpg)
![Page 25: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/25.jpg)
![Page 26: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/26.jpg)
![Page 27: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/27.jpg)
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformationand understanding.
• Your data science teams have to be willing to get their hands dirty• Your data science teams have to be allowed to get their hands dirty• Your data science teams need software engineering chops.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 28: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/28.jpg)
Implications for Team Structure
To be successful,• Your data science teams have to be integrally involved in the data transformationand understanding.
• Your data science teams have to be willing to get their hands dirty• Your data science teams have to be allowed to get their hands dirty• Your data science teams need software engineering chops.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 29: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/29.jpg)
Implications for Team Structure
To be successful,• Your data science teams have to be integrally involved in the data transformationand understanding.
• Your data science teams have to be willing to get their hands dirty
• Your data science teams have to be allowed to get their hands dirty• Your data science teams need software engineering chops.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 30: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/30.jpg)
Implications for Team Structure
To be successful,• Your data science teams have to be integrally involved in the data transformationand understanding.
• Your data science teams have to be willing to get their hands dirty• Your data science teams have to be allowed to get their hands dirty
• Your data science teams need software engineering chops.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 31: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/31.jpg)
Implications for Team Structure
To be successful,• Your data science teams have to be integrally involved in the data transformationand understanding.
• Your data science teams have to be willing to get their hands dirty• Your data science teams have to be allowed to get their hands dirty• Your data science teams need software engineering chops.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
![Page 32: Data Preparation of Data Science](https://reader034.vdocuments.net/reader034/viewer/2022042706/588003021a28ab421b8b45a1/html5/thumbnails/32.jpg)
Questions
Thanks for your attention! Questions?• Code & scripts for this talk available on my github presentation page.1
• Find me at http://caseystella.com• Twitter handle: @casey_stella• Email address: [email protected]
1http://github.com/cestella/presentations/Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016