data mining lectures lecture 3: eda and visualization padhraic smyth, uc irvine ics 278: data mining...

47
Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Upload: lynne-scott

Post on 02-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

ICS 278: Data Mining

Lecture 3: Exploratory Data Analysis and Visualization

Page 2: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Lecture 3

• Finish up material from Lecture 2

• Homework due this Thursday

• Discuss projects in some detail

• Exploratory Data Analysis and Visualization– Reading: Chapter 3 in the text

Page 3: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Exploratory Data Analysis (EDA)

• get a general sense of the data • interactive and visual

– (cleverly/creatively) exploit human visual power to see patterns

• 3 to 5 dimensions (e.g. spatial, color, time, sound)

– e.g. plot raw data/statistics, reduce dimensions as needed

• data-driven (model-free)• especially useful in early stages of data mining

– detect outliers (e.g. assess data quality)– test assumptions (e.g. normal distributions?)– identify useful raw data & transforms (e.g. log(x))

• http://www.itl.nist.gov/div898/handbook/eda/eda.htm

Page 4: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Summary Statistics

• not visual• sample statistics of data X

– mean: = i Xi / n { minimizes i (Xi - )2 }– mode: most common value in X– median: X=sort(X), median = Xn/2 (half below, half above)– quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n

• interquartile range: value(Q3) - value(Q1)• range: max(X) - min(X) = Xn - X1

– variance: 2 = i (Xi - )2 / n – skewness: i (Xi - )3 / [ (i (Xi - )2)3/2 ]

• zero if symmetric; right-skewed more common (e.g. us … Gates)– number of distinct values for a variable (see unique.m in MATLAB)

– Note: all of these are estimates based on the sample at hand – they may be different from the “true” values (e.g., median age in US).

Page 5: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Exploratory Data Analysis

Tools for Displaying Single Variables

Page 6: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Histogram

• Most common form: split data range into equal-sized bins Then for each bin, count the number of points from the data set that fall into the bin. – Vertical axis: Frequency (i.e., counts for each bin) – Horizontal axis: Response variable

• The histogram graphically shows the following: 1. center (i.e., the location) of the data; 2. spread (i.e., the scale) of the data; 3. skewness of the data; 4. presence of outliers; and 5. presence of multiple modes in the data.

These features can provide useful information of both- the proper distributional model for the data

-

Page 7: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Issues with Histograms

• For small data sets, histograms can be misleading. Small changes in the data or to the bucket boundaries can result in very different histograms.

• For large data sets, histograms can be quite effective at illustrating general properties of the distribution.

• example

• Can smooth histogram using a variety of techniques– E.g., kernel density estimation (pages 59-61 in text)

• Histograms effectively only work with 1 variable at a time– Difficult to extend to 2 dimensions, not possible for >2– So histograms tell us nothing about the relationships among

variables

Page 8: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Histogram Example

classical bell-shaped, symmetric histogram with most of the frequency counts bunched in the middle and with the counts dying off out in the tails. From a physical science/engineering point of view, the Normal/Gaussian distribution often occurs in nature (due in part to the central limit theorem).

Page 9: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

ZipCode Data: Population

0 2 4 6 8 10 12

x 104

0

1000

2000

3000

4000

5000

6000

7000

8000

K = 50

0 2 4 6 8 10 12

x 104

0

100

200

300

400

500

600

700

800

900

K = 500

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000

50

100

150

200

250

300

350

400

K = 50

Page 10: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

ZipCode Data: Population

• MATLAB code: X = zipcode_data(:,2) % second column from zipcode array histogram(X, 50) % histogram of X with 50 bins

histogram(X, 500) % 500 bins

index = X < 5000; % identify X values lower than 5000

histogram(X(index),50) % now plot just these X values

Page 11: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Histogram Detecting Outlier (Missing Data)

blood pressure = 0 ?

Page 12: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Right Skewness Example: Credit Card Usage

similarly right-skewed are Power law distributions(Pi ~ 1/ia, where a >= 1)

e.g. for a = 1 we have “Zipf’s law”For word frequencies in text

Page 13: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Box (and Whisker) Plots: Pima Indians Data

Q3-Q1

box contains middle 50% of data

Q2 (median)

healthy diabetic

plots all dataoutside

whiskers

up to1.5 x Q3-Q1

(or shorter,if no datathat far

above Q3)

Page 14: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Time Series Example 1

annual fees introduced in UK(many users cutback to 1 credit card)

Page 15: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Time Series Example 2

steady growth trendNew Year bumps

summer peaks

summer bifurcations in air travel (favor early/late)

Page 16: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Time-Series Example 3

Scotland experiment:“ milk in kid diet better health” ?

20,000 kids: 5k raw, 5k pasteurize,

10k control (no supplement)

mean weight vs mean agefor 10k control group

Would expect smooth weight growth plot.

Visually reveals unexpected pattern (steps),

not apparent from raw data table.

Possible explanations:

Grow less early in year than later?

No steps in height plots; so whyheight uniformly, weight spurts?

Kids weighed in clothes: summer garb lighter than winter?

Page 17: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Exploratory Data Analysis

Tools for Displaying Pairs of Variables

Page 18: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

A simple data set

Data X 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00 Y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68

Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, pp. 195-199.

Page 19: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

A simple data set

Data X 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00 Y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68

Summary Statistics

N = 11Mean of X = 9.0Mean of Y = 7.5Intercept = 3Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816

Page 20: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

A simple data set

Data X 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00 Y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68

Page 21: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

3 more data sets

X2 Y2 X3 Y3 X4 Y4

10.00 9.14 10.00 7.46 8.00 6.58

8.00 8.14 8.00 6.77 8.00 5.76

13.00 8.74 13.00 12.74 8.00 7.71

9.00 8.77 9.00 7.11 8.00 8.84

11.00 9.26 11.00 7.81 8.00 8.47

14.00 8.10 14.00 8.84 8.00 7.04

6.00 6.13 6.00 6.08 8.00 5.25

4.00 3.10 4.00 5.39 19.00 12.50

12.00 9.13 12.00 8.15 8.00 5.56

7.00 7.26 7.00 6.42 8.00 7.91

5.00 4.74 5.00 5.73 8.00 6.89

Page 22: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Summary Statistics

Summary Statistics of Data Set 2

N = 11Mean of X = 9.0Mean of Y = 7.5Intercept = 3Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816

Page 23: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Summary Statistics

Summary Statistics of Data Set 2

N = 11Mean of X = 9.0Mean of Y = 7.5Intercept = 3Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816

Summary Statistics of Data Set 3

N = 11Mean of X = 9.0Mean of Y = 7.5Intercept = 3Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816

Summary Statistics of Data Set 4

N = 11Mean of X = 9.0Mean of Y = 7.5Intercept = 3Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816

Page 24: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Graphs reveals the mystery!

Page 25: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Displaying high-dimensional data

• multiple bivariate graphs– scatter plot matrix– trellis plot

• Icon plots– star graph– Chernoff’s faces

• Parallel coordinates

Page 26: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

2D: Scatter Plots

• standard tool for displaying relationship between two variables

• A scatter plot is a plot of the values of Y versus the corresponding values of X: – Vertical axis: variable Y--usually the response variable – Horizontal axis: variable X--variable we suspect may be related

• Scatter plots can provide answers to the following questions: 1. Are variables X and Y related? 2. Are variables X and Y linearly related? 3. Are variables X and Y non-linearly related? 4. Does the variation in Y change

depending on X? 5. Are there outliers?

Page 27: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Scatter Plot: No relationship

Page 28: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Scatter Plot: Linear relationship

Page 29: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Scatter Plot: Quadratic relationship

Page 30: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Scatter plot: Homoscedastic

Variation of Y Does Not Depend on X

Page 31: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Scatter plot: Heteroscedastic

variation in Y differs depending on the value of X

Page 32: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

2D Scatter Plots

• standard tool to display relation between 2 variables– e.g. y-axis = response, x-axis

= suspected indicator

• useful to answer:– x,y related?

• no• linearly• nonlinearly

– variance(y) depend on x?– outliers present?

• MATLAB:– plot(X(1,:),X(2,:),’.’);

credit card repayment: low-low, high-high

Page 33: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

0 2 4 6 8 10 12 14

x 104

0

0.5

1

1.5

2

2.5x 10

5

MEDIAN PERCAPITA INCOME

MEDIANHOUSEHOLD INCOME

Page 34: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Problems with Scatter Plots of Large Data

96,000 bank loan applicants

appears: later apps older; reality: downward slope (more apps, more variance)

scatter plot degrades into black smudge ...

Page 35: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Contour Plots Can Help

(same 96,000 bank loan apps as before)

recall:

unimodal

skewed

shows variance(y) with x is indeed due to horizontalskew in density

skewed

Page 36: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Problems with Scatter Plots of Large Data# weeks credit card buys gas vs groceries

(10,000 customers) actual correlation (0.48) higher than appears (overprinting)

also demands explanation

Page 37: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Exploratory Data Analysis

Tools for Displaying Pairs of Variables

Page 38: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Scatter Plot Matrix

Page 39: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Trellis Plot

Younger

Older

Male Female

Page 40: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Star Plots: Using Icons to Encode Information

• Each star represents a single observation. Star plots are used to examine the relative values for a single data point

• The star plot consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables.

• Useful for small data sets with up to 10 or so variables

• Limitations?– Small data sets, small dimensions– Ordering of variables may affect

perception

1 Price 2 Mileage (MPG) 3 1978 Repair Record (1 = Worst, 5 =

Best) 4 1977 Repair Record (1 = Worst, 5 =

Best)

5 Headroom 6 Rear Seat Room 7 Trunk Space 8 Weight

9 Length

Page 41: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Chernoff’s Faces

• described by ten facial characteristic parameters: head eccentricity, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, eye spacing, eye size, mouth length and degree of mouth opening

• Chernoff faces applet

• more icon plots

• Limitations:– Similar to star plots

Page 42: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Parallel Coordinates

interactive“brushing” is useful

for seeing such distinctions

dimensions(possibly all d of them!)

often (re)orderedto better distinguishamong interesting

subsets of n total cases

(epileptic seizure data again)

1 (of n) cases

(this case isa “brushed”one, with a darker line,to standout from the n-1other cases)

Page 43: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

“Grand Tour”

• scatter plot matrix only multi-bivariate• can achieve richer multivariate visualization by:

– rotate direction of projection over all d (not just pick two)– user control over spin– random projection (“Grand Tour”)

• e.g. XGOBI visualization package (available on the Web)

Page 44: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Page 45: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Page 46: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Page 47: Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine

Summary

• EDA and Visualization– Can be very useful for

• data checking• getting a general sense of individual or pairs of variables

– But…• do not necessarily reveal structure in high dimensions

• Reading: Chapter 3