why we use exploratory data analysis

21
1 WHY WE USE EXPLORATORY DATA ANALYSIS DATA YES NO ESTIMATES BASED ON NORMAL DISTRIB. KURTOSIS, SKEWNESS TRANSFORMATIONS QUANTILE (ROBUST) ESTIMATES OUTLIERS EXTREMS YES NO QUANTILE (ROBUST) ESTIMATES WHY ? CAN WE REMOVED THEM ? DO DATA COME FROM NORMAL DISTRIBUTION? TRANSFORMATIONS

Upload: trynt

Post on 07-Feb-2016

39 views

Category:

Documents


0 download

DESCRIPTION

ESTIMATES BASED. ON NORMAL DISTRIB. DATA. YES. NO. WHY ?. OUTLIERS. CAN WE. KURTOSIS ,. EXTR EMS. REMOVED THEM ?. SKEWNESS. YES. NO. QUANTILE. (ROBUST). TRANSFORMA TIONS. ESTIMATES. QUANTILE. (ROBUST). ESTIMATES. WHY WE USE EXPLORATORY DATA ANALYSIS. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: WHY WE USE EXPLORATORY DATA ANALYSIS

1

WHY WE USE EXPLORATORY DATA ANALYSIS

DATA YES

NO

ESTIMATES BASEDON NORMAL DISTRIB.

KURTOSIS, SKEWNESS

TRANSFORMATIONS

QUANTILE (ROBUST)

ESTIMATES

OUTLIERS

EXTREMS YES

NO

QUANTILE (ROBUST)

ESTIMATES

WHY ?

CAN WEREMOVED THEM ?

DO DATA COME FROM NORMAL DISTRIBUTION?

TRANSFORMATIONS

Page 2: WHY WE USE EXPLORATORY DATA ANALYSIS

2

METHODS OF EDA

Graphical:

dot plot

box plot

notched box plot

QQ plot

histogram

density plots

Tests:

tests of normality

minimal sample size

Page 3: WHY WE USE EXPLORATORY DATA ANALYSIS

3

DOT PLOT

Page 4: WHY WE USE EXPLORATORY DATA ANALYSIS

4

BOX PLOT

lowerquartil

upperkvartil

fenceouter inner

fenceinner outer

interquartilerange (H)

číselná osa

median

Page 5: WHY WE USE EXPLORATORY DATA ANALYSIS

5

NOTCHED BOX PLOT

interval estimate of median

FD,H

1,57 RI = M ±

n

RF

Page 6: WHY WE USE EXPLORATORY DATA ANALYSIS

6

Q-Q PLOT

X: theoretical quantiles of analysed distribution

Y: sample quantilesideal coincidence of sample values and theoretical distribution

measured values

Page 7: WHY WE USE EXPLORATORY DATA ANALYSIS

7

Q-Q GRAF

25 30 35 40 45 50 55 60 65

Pozorovaná hodnota

-3

-2

-1

0

1

2

3

Oče

káva

ná n

orm

ální

hod

nota

Page 8: WHY WE USE EXPLORATORY DATA ANALYSIS

8

Q-Q GRAF

-20 0 20 40 60 80 100 120

Pozorovaná hodnota

-3

-2

-1

0

1

2

3

Očekávaná n

orm

áln

í hodnota

Page 9: WHY WE USE EXPLORATORY DATA ANALYSIS

9

Q-Q plot

right sided – skewed to left

left sided – skewed to right

platycurtic („flat“) leptocurtic(„steep“)

Page 10: WHY WE USE EXPLORATORY DATA ANALYSIS

10

Page 11: WHY WE USE EXPLORATORY DATA ANALYSIS

11

Page 12: WHY WE USE EXPLORATORY DATA ANALYSIS

12

HISTOGRAM

Histogram - Sheet1 - TLOUSTKYČetnost

TLOUSTKY

20 30 40 50 60 700

10

20

30

Page 13: WHY WE USE EXPLORATORY DATA ANALYSIS

13

HISTOGRAM

correct width of interval:

0,4int 2,46 ( 1)L n nL 2int

Page 14: WHY WE USE EXPLORATORY DATA ANALYSIS

14

HISTOGRAM – kernel density function

Odhad hustoty - Sheet1 - TLOUSTKYHustota

TLOUSTKY

10 20 30 40 50 60 70 800.000

0.010

0.020

0.030

0.040

0.050

0.060

Page 15: WHY WE USE EXPLORATORY DATA ANALYSIS

15

TRANSFORMATION

Aim of transformation:reduction of variance better level of symmetry(normality) of data

Transformation function:non-linear function monotonic function

Page 16: WHY WE USE EXPLORATORY DATA ANALYSIS

16

TRANSFORMATION – basic concept

-0.4

-0.2

0

0.2

0.4

0.6

0.8

0 0.5 1 1.5 2 2.5 3 3.5

Original data (tree-rings widths in mm)

Tra

nsf

orm

ed d

ata

mean of original data

transformed mean and its

projection to original data set

Page 17: WHY WE USE EXPLORATORY DATA ANALYSIS

17

TRANSFORMATION – logaritmic transformation

lnx x

0.0

5.0

10.0

15.0

0.0 266.7 533.3 800.0

Histogram

C2

Count

0.0

3.3

6.7

10.0

3.0 4.3 5.7 7.0

Histogram

C7

Count

Page 18: WHY WE USE EXPLORATORY DATA ANALYSIS

18

TRANSFORMATION – power transformation

0

( ) ln 0

0

x

x x for

x

Page 19: WHY WE USE EXPLORATORY DATA ANALYSIS

19

TRANSFORMATION – Box-Cox

0xln

01x

)x(

Page 20: WHY WE USE EXPLORATORY DATA ANALYSIS

20

TRANSFORMATION – Box-Cox

Page 21: WHY WE USE EXPLORATORY DATA ANALYSIS

21

TRANSFORMATION– estimate of optimal

logarithm oflikelihood function

for various values of optimal

interval estimate of parameter

= 1 is not included in intervalestimate of . It means that

transformation will be probably

successful

1.00

maxLF – 0,5*quantile 2