summarizing data. statistics statistics probability probability vs. statistics sampling inference
TRANSCRIPT
Summarizing Data
Statistics
statistics
probability
probability vs. statistics
sampling
inference
Distribution ?
Distribution :
A mathematical way to represent the diversity
of characteristics of a group.
Group may be a sample and a population.
• population distribution• distribution of a sample
dist’n of a sample pop’n dist’n
realistic imaginary
data Theory (model)
statistics
Statistics starts from data.
Data are clues to truth, and say about truth.
Data are not just sets of numbers.
The 1st principle of statistics :
The sample is not the same with the population,
but the population is represented by the sample
sufficiently well.
≈
Datawork
• From real world
• Data collecting
• Exploring data
• Reducing data
• Modeling
• Evaluating
• From forest
• Making timber
• Inspecting wood grain
• Cutting
• Structuring
• Finishing
Woodwork & Datawork
Craft & Endeavor
Tools & Skills
• Paper, pencil & calculator
• Spreadsheet SW (Excel)
• Minitab, SPSS, SAS, R
• DBMS ( Access, Oracle, …)
• C/C++, Java, Python, …
Statistical tools
You need skill to use these.
Also, you need craft & experiences.
However, the more important point in
datawork is trying to get perspectives
of the data on your hand.
No typical ways for good datawork.
Think, think and think !
That’s the only way.
Datawork is not a miagic. It's a hard job.
살라카둘라 메치카불라 비비디 바비디 부 --
Grain of data ?
Seeing the grain of data
Exploratory Data Analysis≈
The step to check the basic properties of
data, by using the basic statistical
methods.
From EDA, we aim to develop insight on
data, as a first step for more specific
analysis.
Exploratory Data Analysis (EDA)
Qualitative
variable• frequency table
• crosstabulation (contingency table)
• bar chart, pie chart, ….
Basic Statistical Methods
• (cumulative) frequency distribution
• histogram
• dot-plot
• stem & leaf diagram
• scatter plot
• box plot, ….
Quantitative scale
Basic Statistical Methods
• 12 var’s & 100 obs’s
• Many types of ‘offer’ to cardholders
• To find the type of ‘offer’ that increases cardholder’s usage maximally.
Credit_Card_Bank: p22 of SVV
Example Data
[1] "Offer.Status" (Categorical) [2] "Charges.Aug.2008" (Quantitative) [3] "Charges.Sept.2008" (Quantitative) [4] "Charges.Oct.2008“ (Quantitative) [5] "Marketing.Segment" (Categorical) [6] "Industry.Segment" (Categorical) [7] "Spendlift.After.Promotion“ (Quantitative) [8] "Pre.Promotion.Avg.Spend" (Quantitative) [9] "Post.Promotion.Avg.Spend" (Quantitative) [10] "Retail.Customer" (Yes, No) [11] "Enrolled.in.Program" (Yes, No) [12] "Spendlift.Positive" (Yes, No)
oct0
8msegiseg
loct08 =
log(oct08)
data.svv<-dir("c:/temp/text")dfile.svv<-paste("c:/temp/text/",data.svv,sep="")dsv<- read.table(dfile.svv[1],head=TRUE, sep="\t")names(dsv)oct08<-dsv[,4]; loct08<-log(oct08); xoct08<-loct08[oct08>0]mseg<-dsv[,5]; iseg<-dsv[,6]
[1] -Inf 6.21 3.96 3.84 6.96 6.95 7.89 7.35 3.97 5.97
[11] 5.50 8.00 6.30 3.13 4.58 8.89 3.81 7.00 8.37 7.85
[21] 7.42 6.86 8.45 6.12 5.62 8.21 6.91 6.87 7.15 5.46
[31] 6.71 6.12 -Inf 7.68 9.08 5.91 3.42 6.12 8.05 7.03
[41] 6.02 2.51 7.20 3.29 7.44 5.88 6.33 6.24 4.33 5.93
[51] 5.25 7.85 8.76 7.15 7.95 7.13 -Inf 7.13 8.11 8.05
[61] 9.11 5.56 8.24 -Inf 7.47 6.70 7.52 6.53 8.33 4.63
[71] 6.80 5.72 7.54 3.48 7.57 8.42 8.16 4.67 7.16 5.61
[81] -Inf 10.42 8.73 4.85 -Inf 6.63 5.48 4.89 8.35 4.65
[91] 5.56 7.39 3.11 3.90 5.72 7.10 -Inf 7.58 8.15 6.30
log(oct08):
log(0) = - InfRounded up to 2nd decimal round(loct08,2)
[1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96
[11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46
[21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91
[31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30
[41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95
[51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20
[61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68
[71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16
[81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89
[91] 9.08 9.11 10.42
Sorted values of log(oct08):
after deleting 7 cases of –
Inf.round(sort(xoct08,2)
[1] B B A T A T B A A T B T T B B B B B T B R A T A A [26] R B B R T T T A A B T B A R B B A T B B R T T A A [51] B A B B T A A T B A B A B R B A A R A T B T T B R [76] T A T A A B B B T R T T R T B A A A A A A B T A T
Levels: A B R T
iseg
Meaning of the levels are not known.
[1] M L L M B A L A M H M L A M M B L B H L
[21] H B L H H M A B H L A H A B L H L B A A [41] A H A L L H L A B A A B A B B A M A B L [61] L B B H B A B A B L B A H L M L L M A B [81] A L L M H A H H L A H L B A H A L L L H Levels: L < B < M < A < H
mseg
L: low, B: below medium, M: medium, A: above medium,
H: high
levels(mseg)<-c("M","H","L","A","B")mseg<-factor(mseg, levels=c("L","B","M","A","H"))mseg
Histogram of loct08
loct08
Freq
uenc
y
2 4 6 8 10
05
1015
20
hist(xoct08,col="grey")
Stem and leaf display:
leaf unit = 0.1
2 | 5 3 | 11345889 4 | 003667789 5 | 3555666677999 6 | 0011122333567789999 7 | 000111122244445556678999 8 | 0001222234444789 9 | 11 10 | 4
a stem
a leaf
2.5
stem(xoct08)
leaf unit = 1
2 | 5 3 | 11345889 4 | 003667789 5 | 3555666677999 6 | 0011122333567789999 7 | 000111122244445556678999 8 | 0001222234444789 9 | 11 10 | 4
25
stem(10*xoct08)
Min. Q1 Median Q3 Max. 2.509 5.563 6.864 7.682 10.420
5 number summary of log(oct08):
IQR = 2.119
summary(xoct08)
Quartiles : Q1, Q2 , Q3
Q1 : values ranked at 25% from lowest
Q2 : values ranked at 50% from lowest
Q3 : values ranked at 75% from lowest
IQR (Inter-Quartile Range) = Q3 –
Q1
Median = Q2
How to take : Q1, Q2, Q3
If c is an integer, then c-th ranked
value x[c]
If c is not an integer, then (x[c-]+ x[c+])/2
Q1 : c = 0.25*(n+1)
Q2 : c= 0.5*(n+1)
Q3 : c= 0.75*(n+1)
c- : the largest lower integer than c
c+ : the smallest upper integer than c
[1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96
[11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46
[21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91
[31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30
[41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95
[51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20
[61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68
[71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16
[81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89
[91] 9.08 9.11 10.42
Sorted values of log(oct08):
after deleting 7 cases of –
Inf.n= 93 , 0.25*94=23.5, 0.5*94=47,
0.75*94=70.5
2 4 6 8 10 12
loct08
Dot plot
050
0010
000
1500
020
000
2500
030
000
Box plot oct08
46
810
Box plot of log(oct08)
boxplot(xoct08)boxplot(oct08)
IQR
Q1 Q3Q2**
mild-outlier extreme-outlier
min(non-outlier) min(non-outlier)
1.5 IQR
freq %freq cum. freq %cum. freq
Low Spender 26 0.26 26 0.26Med Low Spender 20 0.20 46 0.46Average Spender 11 0.11 57 0.57Med High Spender 25 0.25 82 0.82High Spender 18 0.18 100 1.00------------------------------------------------------------Total 100 1.00
Frequency table
table(mseg)table(mseg)/length(mseg)cumsum(table(mseg))cumsum(table(mseg))/length(mseg)
Bar chart of log(oct08)
(2,3] (3,4] (4,5] (5,6] (6,7] (7,8] (8,9] (9,10] (10,11]
05
1015
20
Histogram & Bar chart
Histogram : for quantitative variables
connected bar’s
Bar chart : for categorical variables
disconnected bar’s
A B R T Total L 5 13 0 8 26 B 11 8 0 1 20 M 2 4 2 3 11 A 8 7 2 8 25 H 5 0 6 7 18
Total 31 32 10 27 100
Contingency table of mseg and
iseg
mse
g
iseg
table(mseg,iseg)apply(table(mseg,iseg),1,sum)apply(table(mseg,iseg),2,sum)
A
B
RT
Pie chart of iseg
31
32
10 27
pie(table(iseg),col=c("red","light green","green","blue"))
A B R T
05
1015
2025
30
Segmented bar chart of (mseg, iseg) -
serial
barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"))
A B R T
02
46
810
12
Segmented bar chart of (mseg, iseg) -
parallel
barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"),beside=TRUE)
Mosaic Plot
iseg
mse
gA B R T
L B
M
AH
mosaicplot(~iseg+mseg,col=rainbow(5))
L B M A H
46
810
Box plot of log(oct08) by mseg
boxplot(loct08[oct08>0]~mseg[oct08>0])
A B C D E F
10 11 0 3 3 11
7 17 1 5 5 9
20 21 7 12 3 15
14 11 2 6 5 22
14 16 3 4 3 15
12 14 1 3 6 16
10 17 2 5 1 13
23 17 1 5 1 10
17 19 3 5 3 26
20 21 0 5 2 26
14 7 1 2 6 24
13 13 4 4 4 13
A B C D E F
05
1015
2025
InsectSprays data
Type of spray
Inse
ct c
ount
Thank you !!