cse217 introduction to data sciencem.neumann/sp2019/cse217/slides/02_eda.pdfcse217 introduction to...
TRANSCRIPT
![Page 1: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/1.jpg)
CSE217 INTRODUCTION TO DATA SCIENCE
Spring 2019Marion Neumann
LECTURE 2: EXPLORATORY DATA ANALYSIS
![Page 2: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/2.jpg)
RECAP: WHAT IS DATA SCIENCE?
2
…solving problems with data…
collect & understand
data
clean & format
data
dataproblem
use datato createsolution
scientific, social, orbusiness problem f
data analysisand/or
machine learning
![Page 3: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/3.jpg)
WHERE DOES DATA COME FROM? • Internal Sources
• business-centric data in organizational data bases recording day to day operations• scientific or experimental data
• Existing External Sources à data is available for free or a fee• public government databases, stock market data, Yelp reviews • usually (somewhat) pre-processed
• Collect your own data à beyond the scope of this course
• Online Data à typically raw data • from APIs (e.g. Google Map API, Facebook API, Twitter API)• web scraping: using software, scripts or by-hand extracting data from what is
displayed on a page or what is contained in the HTML file
3
Caution: not all data that is accessible is good to be used!• Are you violating their terms of service? • Privacy concerns for website and their clients? • Do they have an API or fee that you are bypassing? • Are they willing to share this data?
![Page 4: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/4.jpg)
• Types of Variables
• Data Types
VARIABLES AND DATA TYPES
4Example: https://www.zillow.com
numeric 2 Order continuous ordiscrete
categorical no order
binary categorical w 2 categoriest.ES No
integer discrete categorical binaryBoolean binary we prefernumericfloatingpoint continuous 1 datatypes arraysstring formatted text
categoricalfree form text
compound datatypes lists dictionaries arrays
![Page 5: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/5.jpg)
DATA(SET) REPRESENTATION• Tables (csv, xlsx etc.) • two-dimensional representation
• rows represent data records• columns represents one type of measurement
• Structured Data (json, xml etc.) • complex and multi-tiered dictionary
• Semi-structured Data (.txt)• flat text representation with known structure• data can be easily parsed
• Unstructured Data (.txt)• prose text
5
![Page 6: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/6.jpg)
DATA IS (ALWAYS) MESSY
• Common issues with data: • missing values: how do we fill in?• wrong values: how can we detect and correct?• messy format/representation
• Example: number of produce deliveries over a weekend
6
Common causes of messiness: • variables/features are stored in both rows and columns• multiple features are stored in one column • multiple types of experimental units stored in same table
![Page 7: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/7.jpg)
DATA (PRE-)PROCESSINGGoal: bring data in a format we can use for analysis (and/or machine learning)
à use a format that is good for Python J (e.g. 2d arrays)à recall from last lecture: data points vs features/variables
• Data Parsing and Formatting • Data Profiling à asses data amount and quality• Data Cleaning• Data Engineering (more later in this course…)• detect outliers• feature engineering• data augmentation
7
data wrangling
![Page 8: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/8.jpg)
DATA ≠ DATA
• Two kinds of data: population vs. sample
• What are problems with sample data?
8
A population is the entire set of objects or events under study. Population can be hypothetical “all students” or all students in this class.
A sample is a (representative) subset of the objects or events under study. à needed because it’s impossible or intractable to obtain or use population data.
![Page 9: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/9.jpg)
EXPLORATORY DATA ANALYSIS (EDA)Different ways of exploring data:• explore each individual variable in the dataset
• summary statistics• spread• distribution
• assess interactions between variables (or between individual variables and the target) • correlation, analysis of variance (ANOVA)
• explore data across many dimensions (more later in this course…)• clustering• dimensionality reduction (e.g. principal component analysis
(PCA), etc.)
9
![Page 10: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/10.jpg)
SUMMARY STATISTICS• (sample) mean
• (sample) median
• Example: Ages: 17, 19, 21, 22, 23, 23, 23, 38What is the median age? What is the mean/average age?
• mean vs median• which one is easier/more efficient to compute?
10
Caution: the mean is sensitive to outliers!Caution: consider practicality (efficiency) of implementation!
![Page 11: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/11.jpg)
SUMMARY STATISTICS
• mode = variable that occurs most often • useful for categorical variablesà visualize with a bar plot
11
DSFSCh3
![Page 12: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/12.jpg)
MEASURES OF SPREAD
• range = max value – min value
• variance• Caution: does not have the same unit as xi
• standard deviation
Why is measuring the spread important?
12
![Page 13: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/13.jpg)
DATA VISUALIZATION
• Can summary statistics and measures of spread tell us everything?
13
![Page 14: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/14.jpg)
DATA VISUALIZATION
• Can summary statistics and measures of spread tell us everything?
14
![Page 15: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/15.jpg)
TYPES OF VISUALIZATION
• distributionà how does a variable distribute over a range of possible
values• relationship
à how do the values of multiple variables in the dataset relate
• comparisonà how do trends in multiple variable or datasets compare
• compositionà how does the dataset break down into subgroups
15
![Page 16: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/16.jpg)
VISUALIZE DISTRIBUTION
• histogram
16
Caution: Trends in histograms are sensitive to the number of bins.
PDSHp245
![Page 17: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/17.jpg)
VISUALIZE RELATIONSHIP
• scatter plot • distribution of two variables• relationship between two variables
17
DSFSCh3
PDSHp233
![Page 18: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/18.jpg)
VISUALIZE COMPARISONS• multiple histograms• visualize how different variables compare (or how a
variable differs over specific groups)
à we can also use box plots to compare different variables
18
![Page 19: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/19.jpg)
VISUALIZE COMPOSITION/COMPARISON
• box plots• compare different variables à cf. Lab1• compare a quantitative variable across groupsà highlights the range, quartiles, median and outliers
19
This plot illustrates composition, since
it looks at classes/categories
of one variable.
Lab1
![Page 20: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/20.jpg)
• pie chart
• stacked area graph
VISUALIZE COMPOSITION
20
Visualize trend over time!
![Page 21: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/21.jpg)
ACTIVITY 2
• TASK 1: What do the following plots produced in Lab1 visualize?
• TASK 2: Which of the following visualizations are good/proper visualization and which do you think are problematic (and why)?
21
Caution: Not all visualizations are good visualizations.
![Page 22: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/22.jpg)
MORE DIMENSIONS
• How about relationship between 3 variables?
à 3D is not always better22
![Page 23: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/23.jpg)
CATEGORICAL VARIABLES
• use color coding for categorical variables
23
Data visualization can help figure out what we
need to predict class labels!
pedal_length
sepa
l_le
ngth
![Page 24: CSE217 INTRODUCTION TO DATA SCIENCEm.neumann/sp2019/cse217/slides/02_EDA.pdfcse217 introduction to data science spring 2019 marion neumann lecture 2: exploratory data analysis](https://reader033.vdocuments.net/reader033/viewer/2022042312/5edadcfb09ac2c67fa686de5/html5/thumbnails/24.jpg)
24
• DSFS• Ch3: Visualizing Data (matplotlib, bar/line charts, scatter plots)
• PDSH• Ch4: Visualization with Matplotlib
• plotting with matplotlib (p217-221)• scatter plots (p233-237)• histograms (p245-247)
SUMMARY & READING• EDA process • (pre-)process data• summarize data• present/visualize distribution and relationships
• EDA goals• develop/find hypothesis/question(s) to be investigated• use data to answer the question(s)