analysis of data session 1
TRANSCRIPT
-
8/8/2019 Analysis of Data Session 1
1/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
Analysis of Data
Session I
-
8/8/2019 Analysis of Data Session 1
2/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
2
Summarizing Data
Raw data is often voluminous and difficultto handle
Decision makers want a few numbers tosummarize the entire data
Summarization leads to loss of information
but can help focus on key aspects of thedataset
-
8/8/2019 Analysis of Data Session 1
3/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
3
Five Number Summary of Data
(Min, Q1, Median, Q3, Max) is called the five numbersummary of data
25% of the observations are below Q1and 75% of theobservations are above Q1.
Q1 is called the first quartile
50% of the observations are below Median and 50% areabove Median
75% of the observations are below Q3 and 25% areabove Q3.
Q3 is called the third quartile.
Each of the segments Min-Q1, Q1-Med, Med-Q3 andQ3-Max contains 25% of the data.
-
8/8/2019 Analysis of Data Session 1
4/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
4
Box Plot
C1
160
140
120
100
80
60
40
20
0
Boxplot of C1
Max
Q3
Median
Q1
Min
-
8/8/2019 Analysis of Data Session 1
5/26
-
8/8/2019 Analysis of Data Session 1
6/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
6
Identifying Outliers
The interquartile range (IQR) is defined as IQR =Q3 Q1
An observation (x) is a soft or possible outlier ifx > Q3 + 1.5 IQR or x < Q1 1.5 IQR
An observation (x) is a hard or confirmedoutlier if x > Q3 + 3 IQR or x < Q1 3 IQR
Note: All hard outliers are also soft outliers butnot vice versa.
-
8/8/2019 Analysis of Data Session 1
7/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
7
Dealing with outliers (if present)
Accommodative Approach: Use methods whichare resistant to the presence of outliers (Robustmethods)
Example: 5% Trimmed Mean 5% Trimmed Mean is computed as follows:
Arrange the data in increasing order and thendelete the lower 5% of the observations and alsothe upper 5% of the observations. Compute thesimple average of the remaining observations.
Deletion Approach: Delete the outliers and workwith the remaining data set.
-
8/8/2019 Analysis of Data Session 1
8/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
8
Two number summary of data
Often data sets are summarized by givingonly two numbers:
- a measure of central tendency and
- a measure of spread (around themeasure of central tendency)
-
8/8/2019 Analysis of Data Session 1
9/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
9
Mean and Standard Deviation
2n
1i
2i
2n
21
n1
xxn
1
n
)xx(...)x(xs:DeviationStandard
n
x...xx:MeanArithmetic
=
++=
++=
=
-
8/8/2019 Analysis of Data Session 1
10/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
10
Mean and Standard Deviation
Note that Standard Deviation (SD) =0 onlywhen all the equal observations are equal.
The higher the SD the higher is the spreadaround the mean value
Lower SD indicates better reliability of the
mean value in representing the dataset.
-
8/8/2019 Analysis of Data Session 1
11/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
11
Chebyshevs Inequality
In most situations of common occurrenceChebyshevs inequality asserts that
proportion of observations outside the interval(mean t SD, mean + t SD) is at most t-2
Using Chebyshevs inequality we have theproportion of observations outside
i) (mean 2 SD, mean + 2 SD) is at most 0.25
ii) (mean 3 SD, mean + 3 SD) is at most 0.11
iii) (mean 4 SD, mean + 4 SD) is at most 0.06
-
8/8/2019 Analysis of Data Session 1
12/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
12
Impact of Outliers
Both the mean and the SD are quitesensitive to the presence of outliers
If mean and SD are proposed to be usedfor summarizing a data set it is better todelete the outliers first and then proceed to
calculate the mean and SD
-
8/8/2019 Analysis of Data Session 1
13/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
13
Median and MAD
An alternative to using mean and SD as a summaryof the data is to use Median and MAD
MAD is the acronym for Median Absolute Deviationabout Median
If med is the median of the dataset x1,,xn, MAD isthe median of the set of numbers {|x1-med|,
|x2-med|,,|xn-med|}
Usually 1.4826 MAD is used as a measure ofspread
The Median and MAD are both far less sensitive topresence of outliers than mean and SD. No deletionof data is required if these are used.
-
8/8/2019 Analysis of Data Session 1
14/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
14
Empirical Cumulative Distribution Function
quartilethird
thecalledis75.0)3Q(FsatisfyingQ3numbersmallestThe
medianthecalledis5.0)m~(Fsatisfyingm~numbersmallestThe
quartilefirst
thecalledis25.0)1Q(FsatisfyingQ1numbersmallestThe
t.toequalorthanless
butsthangreaternsobservatioofproportion)s(F)t(F
n
tnsobservatio#)t(Fasdefined
is(ECDF)FunctiononDistributiCumulativeEmpiricalTheset.datagiventhebe}x,...,{xLet
n
n
n
nn
n
n1
=
=
-
8/8/2019 Analysis of Data Session 1
15/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
15
Frequency Distribution
Often, particularly for large data sets, it isadvantageous to summarize data using afrequency distribution.
The entire range (Range = Max Min) of thedata is divided into a few disjoint classes each ofwhich is an interval
A frequency distribution gives the list of theclasses along with the number of observations ineach class (called the frequency of the class)
-
8/8/2019 Analysis of Data Session 1
16/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
16
Example of Frequency Distribution
1094Total
10287 289
205280 286
348273 279
334266 272
121259 265
76256 258
Cases(frequency)Duration ofPregnancy(days)
From: Bhat & Khustagi:Singapore Med J 2006; 47(12)
-
8/8/2019 Analysis of Data Session 1
17/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
17
Relative Frequency and Frequency Density
Relative Frequency of a class is thefrequency of the class divided by the totalnumber of observations
Frequency Density of a class is theRelative frequency of a class divided by
the class width.
-
8/8/2019 Analysis of Data Session 1
18/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
18
Example
1094Total
0.0040.0110286.5 289
0.0750.19205279.5 286.5
0.1270.32348272.5 279.5
0.1220.31334265.5 272.5
0.0440.11121258.5 265.5
0.0280.0776255 258.5
Frequency
Density
Relative
Frequency
FrequencyDuration of
Pregnancy (days)
-
8/8/2019 Analysis of Data Session 1
19/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
19
HistogramHistogram
0.000
0.020
0.040
0.060
0.080
0.100
0.120
0.140
255 258.5 258.5 265.5 265.5 272.5 272.5 279.5 279.5 286.5 286.5 289
Duration of Pregnancy
FrequencyDensity
Series1
-
8/8/2019 Analysis of Data Session 1
20/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
2
20
Histogram: How many classes?
For construction of the histogram it is important to decideon the number of classes or equivalently the class width
The shape of the histogram heavily depends on thechoice of the number of classes / class width.
Two popular approaches are:
a) Sturges rule
b) Freedman Diaconis rule
Both these rules usually (but not always) give similarresults if the number of observations is less than 200.
Freedman Diaconis rule is the preferred/ better rule fordetermining the class width of a histogram.
-
8/8/2019 Analysis of Data Session 1
21/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
2
21
Sturges Rule
The number of classes (k) = [1 + log2n]+1 where[1+log2n] is the greatest integer less than orequal to 1+log2n
In other words, choose k such that
2k-3n k = 8
n = 83 => k = 9 The class width (h) is computed as Range
divided by the number of classes.
h = (Max Min) / k
-
8/8/2019 Analysis of Data Session 1
22/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
2
22
Freedman-Diaconis Rule
[ ] function.integergreatesttheis.where
1h
RangekclassesofnumberThe
n
IQR2hbygiveniswidthclassThe
1/3
+
=
=
-
8/8/2019 Analysis of Data Session 1
23/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
2
23
Example
For the pregnancy duration data set we have n =1094, Range = 289-255 = 34
Q1 = 267.1, Q3=278.3, IQR = 11.2
Freedman Diaconis rule gives the class widthto be 2.18
The number of class intervals is therefore 16.
The class intervals are 255 257.18, 257.18 259.36, , 285.52 287.7 and 287.7 289
(note the last interval has shorter length)
-
8/8/2019 Analysis of Data Session 1
24/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
2
24
Example: Stock Returns
55 daily returns
Five number summary:
Min = -8.213, Q1 = -1.413, Median = -0.1364
Q3 = 0.817, Max = 4.071
-
8/8/2019 Analysis of Data Session 1
25/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
2
25
Histogram using Sturges rule
Histogram of tm
tm
Frequency
-10 -5 0 5
0
5
10
15
20
-
8/8/2019 Analysis of Data Session 1
26/26
Prof. Arnab K Laha - Analysis of Data
(PGP-X)
26
Histogram using Freedman-Diaconis rule
Histogram of tm
tm
Frequency
-8 -6 -4 -2 0 2 4
0
2
4
6
8
10
12
14