introduction to data pre-processing and cleaning
Post on 13-Jan-2017
555 Views
Preview:
TRANSCRIPT
Data Preparation and Cleaning
February 22, 2016
Matteo Manca matteo.manca@eurecat.org
Matteo Manca Researcher @ Eurecat (Social Media group)- BCN
PhD @ Cagliari – Italy
Research interests:• social media mining, • social networks analysis • computational social science• data Science
Contacts:matteo.manca@eurecat.org
https://mattemanca.wordpress.com
Matteo Manca matteo.manca@eurecat.org
Índice del capítulo
1
3
• Topic 1: Big Data Economy • Topic 2: Environment • Topic 3: Data Exploration • Topic 4: Data Ingestion & Storage• Topic 5: Data Preparation — Cleaning• Topic 6: Distributed Systems (Hadoop)• Topic 7: Distributed Analytics (PIG)
Topics
Big data
Matteo Manca matteo.manca@eurecat.org
• Why are we interested on Data preparation and Cleaning?• Introduction to Data pre-processing and Cleaning ( main
concepts, and main steps)• Best practices• Data Pre-processing and Cleaning in R: Step-by-Step
Tutorial
Data Preparation — Cleaning
Matteo Manca matteo.manca@eurecat.org
Why are we interested on Data pre-processing and Cleaning? Let’s analyse our data!!
1. Average test score?2. Most common
year?3. % of male and
female?
5
Raw data
Matteo Manca matteo.manca@eurecat.org
Why are we interested on Data pre-processing and Cleaning?
6
Raw data• Incomplete: lacking attribute
values, lacking certain attributes of interest, or containing only aggregate data
• Noisy: containing errors or outliers
• Inconsistent: containing discrepancies in codes or names• Data analyst spends much if not most of his time on
preparing the data before doing the analysis• 80% of data mining and analysis is really data
preparation.Matteo Manca matteo.manca@eurecat.org
Data Pre-processing and Cleaning
7
Process of transforming raw data into consistent data that can be analyzed.
Consistent data is the stage where data is ready for the analysis
Main steps:• Handle missing values
(ignore the tuple, fill missing value with mean/mode value, predict it,etc.)
• identify or remove outliers• resolve inconsistencies.• Data transformation:
normalization and aggregation
Data pre-processing and
cleaning
Raw data
Raw data
Consistent data
Matteo Manca matteo.manca@eurecat.org
Data Pre-processing and Cleaning
8
Consistent Data• Each variable you measure
should be in one column• Each different observation
(record) should be in a different row
• If we are working with different variables there should be different data frames linked each other
Data pre-processing and
cleaning
Raw data
Raw data
Consistent data
Matteo Manca matteo.manca@eurecat.org
Data Pre-processing and Cleaning
9
Best practices• Pipeline: a explicit “recipe” used
to go from step i to step i+1 (all steps should be recorded)
• A code book that describes each variable and its values in the tidy dataset
• Use make variable names human readable
• save your clean / consistent data to files to avoid to repeat each time the pre-process and DC (one file per data frame / table)
• Markdown (. md) files usually are used (https://en.wikipedia.org/wiki/Markdown)
Data pre-processing and
cleaning
Raw data
Raw data
Consistent data
Matteo Manca matteo.manca@eurecat.org
Data Pre-processing and Cleaning in R
10
Rstudio is a user interface for R.https://www.rstudio.com
Matteo Manca matteo.manca@eurecat.org
R is a free software environment for statistical computing and graphics(https://www.r-project.org)
Questions ?
Matteo Manca matteo.manca@eurecat.org
12Matteo Manca matteo.manca@eurecat.org
Data Pre-processing and Cleaning
© 2015, Barcelona Technology School ed X.X DD/MM/2015www.barcelonatechnologyschoo.com
References
14Matteo Manca matteo.manca@eurecat.org
1. https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf
2. https://www.coursera.org/learn/data-cleaning
3. https://www.coursera.org/learn/r-programming
4. http://www.r-bloggers.com
top related