analysis of data session 1

8/8/2019 Analysis of Data Session 1

1/26

Prof. Arnab K Laha - Analysis of Data

(PGP-X)

Analysis of Data

Session I


2/26


(PGP-X)

2

Summarizing Data

Raw data is often voluminous and difficultto handle

Decision makers want a few numbers tosummarize the entire data

Summarization leads to loss of information

but can help focus on key aspects of thedataset


3/26


(PGP-X)

3

Five Number Summary of Data

(Min, Q1, Median, Q3, Max) is called the five numbersummary of data

25% of the observations are below Q1and 75% of theobservations are above Q1.

Q1 is called the first quartile

50% of the observations are below Median and 50% areabove Median

75% of the observations are below Q3 and 25% areabove Q3.

Q3 is called the third quartile.

Each of the segments Min-Q1, Q1-Med, Med-Q3 andQ3-Max contains 25% of the data.


4/26


(PGP-X)

4

Box Plot

C1

160

140

120

100

80

60

40

20

0

Boxplot of C1

Max

Q3

Median

Q1

Min


5/26


6/26


(PGP-X)

6

Identifying Outliers

The interquartile range (IQR) is defined as IQR =Q3 Q1

An observation (x) is a soft or possible outlier ifx > Q3 + 1.5 IQR or x < Q1 1.5 IQR

An observation (x) is a hard or confirmedoutlier if x > Q3 + 3 IQR or x < Q1 3 IQR

Note: All hard outliers are also soft outliers butnot vice versa.


7/26


(PGP-X)

7

Dealing with outliers (if present)

Accommodative Approach: Use methods whichare resistant to the presence of outliers (Robustmethods)

Example: 5% Trimmed Mean 5% Trimmed Mean is computed as follows:

Arrange the data in increasing order and thendelete the lower 5% of the observations and alsothe upper 5% of the observations. Compute thesimple average of the remaining observations.

Deletion Approach: Delete the outliers and workwith the remaining data set.


8/26


(PGP-X)

8

Two number summary of data

Often data sets are summarized by givingonly two numbers:

- a measure of central tendency and

- a measure of spread (around themeasure of central tendency)


9/26


(PGP-X)

9

Mean and Standard Deviation

2n

1i

2i

2n

21

n1

xxn

1

n

)xx(...)x(xs:DeviationStandard

n

x...xx:MeanArithmetic

=

++=

++=

=


10/26


(PGP-X)

10

Mean and Standard Deviation

Note that Standard Deviation (SD) =0 onlywhen all the equal observations are equal.

The higher the SD the higher is the spreadaround the mean value

Lower SD indicates better reliability of the

mean value in representing the dataset.


11/26


(PGP-X)

11

Chebyshevs Inequality

In most situations of common occurrenceChebyshevs inequality asserts that

proportion of observations outside the interval(mean t SD, mean + t SD) is at most t-2

Using Chebyshevs inequality we have theproportion of observations outside

i) (mean 2 SD, mean + 2 SD) is at most 0.25

ii) (mean 3 SD, mean + 3 SD) is at most 0.11

iii) (mean 4 SD, mean + 4 SD) is at most 0.06


12/26


(PGP-X)

12

Impact of Outliers

Both the mean and the SD are quitesensitive to the presence of outliers

If mean and SD are proposed to be usedfor summarizing a data set it is better todelete the outliers first and then proceed to

calculate the mean and SD


13/26


(PGP-X)

13

Median and MAD

An alternative to using mean and SD as a summaryof the data is to use Median and MAD

MAD is the acronym for Median Absolute Deviationabout Median

If med is the median of the dataset x1,,xn, MAD isthe median of the set of numbers {|x1-med|,

|x2-med|,,|xn-med|}

Usually 1.4826 MAD is used as a measure ofspread

The Median and MAD are both far less sensitive topresence of outliers than mean and SD. No deletionof data is required if these are used.


14/26


(PGP-X)

14

Empirical Cumulative Distribution Function

quartilethird

thecalledis75.0)3Q(FsatisfyingQ3numbersmallestThe

medianthecalledis5.0)m~(Fsatisfyingm~numbersmallestThe

quartilefirst

thecalledis25.0)1Q(FsatisfyingQ1numbersmallestThe

t.toequalorthanless

butsthangreaternsobservatioofproportion)s(F)t(F

n

tnsobservatio#)t(Fasdefined

is(ECDF)FunctiononDistributiCumulativeEmpiricalTheset.datagiventhebe}x,...,{xLet

n

n

n

nn

n

n1

=

=


15/26


(PGP-X)

15

Frequency Distribution

Often, particularly for large data sets, it isadvantageous to summarize data using afrequency distribution.

The entire range (Range = Max Min) of thedata is divided into a few disjoint classes each ofwhich is an interval

A frequency distribution gives the list of theclasses along with the number of observations ineach class (called the frequency of the class)


16/26


(PGP-X)

16

Example of Frequency Distribution

1094Total

10287 289

205280 286

348273 279

334266 272

121259 265

76256 258

Cases(frequency)Duration ofPregnancy(days)

From: Bhat & Khustagi:Singapore Med J 2006; 47(12)


17/26


(PGP-X)

17

Relative Frequency and Frequency Density

Relative Frequency of a class is thefrequency of the class divided by the totalnumber of observations

Frequency Density of a class is theRelative frequency of a class divided by

the class width.


18/26


(PGP-X)

18

Example

1094Total

0.0040.0110286.5 289

0.0750.19205279.5 286.5

0.1270.32348272.5 279.5

0.1220.31334265.5 272.5

0.0440.11121258.5 265.5

0.0280.0776255 258.5

Frequency

Density

Relative

Frequency

FrequencyDuration of

Pregnancy (days)


19/26


(PGP-X)

19

HistogramHistogram

0.000

0.020

0.040

0.060

0.080

0.100

0.120

0.140

255 258.5 258.5 265.5 265.5 272.5 272.5 279.5 279.5 286.5 286.5 289

Duration of Pregnancy

FrequencyDensity

Series1


20/26


(PGP-X)

2

20

Histogram: How many classes?

For construction of the histogram it is important to decideon the number of classes or equivalently the class width

The shape of the histogram heavily depends on thechoice of the number of classes / class width.

Two popular approaches are:

a) Sturges rule

b) Freedman Diaconis rule

Both these rules usually (but not always) give similarresults if the number of observations is less than 200.

Freedman Diaconis rule is the preferred/ better rule fordetermining the class width of a histogram.


21/26


(PGP-X)

2

21

Sturges Rule

The number of classes (k) = [1 + log2n]+1 where[1+log2n] is the greatest integer less than orequal to 1+log2n

In other words, choose k such that

2k-3n k = 8

n = 83 => k = 9 The class width (h) is computed as Range

divided by the number of classes.

h = (Max Min) / k


22/26


(PGP-X)

2

22

Freedman-Diaconis Rule

[ ] function.integergreatesttheis.where

1h

RangekclassesofnumberThe

n

IQR2hbygiveniswidthclassThe

1/3

+

=

=


23/26


(PGP-X)

2

23

Example

For the pregnancy duration data set we have n =1094, Range = 289-255 = 34

Q1 = 267.1, Q3=278.3, IQR = 11.2

Freedman Diaconis rule gives the class widthto be 2.18

The number of class intervals is therefore 16.

The class intervals are 255 257.18, 257.18 259.36, , 285.52 287.7 and 287.7 289

(note the last interval has shorter length)


24/26


(PGP-X)

2

24

Example: Stock Returns

55 daily returns

Five number summary:

Min = -8.213, Q1 = -1.413, Median = -0.1364

Q3 = 0.817, Max = 4.071


25/26


(PGP-X)

2

25

Histogram using Sturges rule

Histogram of tm

tm

Frequency

-10 -5 0 5

0

5

10

15

20


26/26


(PGP-X)

26

Histogram using Freedman-Diaconis rule

Histogram of tm

tm

Frequency

-8 -6 -4 -2 0 2 4

0

2

4

6

8

10

12

14

analysis of data session 1

Documents