summarizing data

54
Summarizing Data

Upload: nyx

Post on 28-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

Summarizing Data. Statistics. probability. sampling. inference. statistics. probability vs. statistics. Distribution ?. Distribution :. A mathematical way to represent the diversity of characteristics of a group. Group may be a sample and a population. population distribution - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Summarizing Data

Summarizing Data

Page 2: Summarizing Data

Statistics

Page 3: Summarizing Data

statistics

probability

probability vs. statistics

sampling

inference

Page 4: Summarizing Data

Distribution ?

Page 5: Summarizing Data

Distribution :

A mathematical way to represent the diversity

of characteristics of a group.

Group may be a sample and a population.

• population distribution• distribution of a sample

Page 6: Summarizing Data

dist’n of a sample pop’n dist’n

realistic imaginary

data Theory (model)

statistics

Page 7: Summarizing Data
Page 8: Summarizing Data

Statistics starts from data.

Page 9: Summarizing Data

Data are clues to truth, and say about truth.

Data are not just sets of numbers.

Page 10: Summarizing Data

The 1st principle of statistics :

The sample is not the same with the population,

but the population is represented by the sample

sufficiently well.

Page 11: Summarizing Data

Page 12: Summarizing Data

Datawork

Page 13: Summarizing Data

• From real world

• Data collecting

• Exploring data

• Reducing data

• Modeling

• Evaluating

• From forest

• Making timber

• Inspecting wood grain

• Cutting

• Structuring

• Finishing

Woodwork & Datawork

Page 14: Summarizing Data

Craft & Endeavor

Page 15: Summarizing Data

Tools & Skills

Page 16: Summarizing Data

• Paper, pencil & calculator

• Spreadsheet SW (Excel)

• Minitab, SPSS, SAS, R

• DBMS ( Access, Oracle, …)

• C/C++, Java, Python, …

Statistical tools

You need skill to use these.

Page 17: Summarizing Data

Also, you need craft & experiences.

However, the more important point in

datawork is trying to get perspectives

of the data on your hand.

Page 18: Summarizing Data

No typical ways for good datawork.

Think, think and think !

That’s the only way.

Page 19: Summarizing Data
Page 20: Summarizing Data

Datawork is not a miagic. It's a hard job.

살라카둘라 메치카불라 비비디 바비디 부 --

Page 22: Summarizing Data

Grain of data ?

Page 23: Summarizing Data

Seeing the grain of data

Exploratory Data Analysis≈

Page 24: Summarizing Data

The step to check the basic properties of

data, by using the basic statistical

methods.

From EDA, we aim to develop insight on

data, as a first step for more specific

analysis.

Exploratory Data Analysis (EDA)

Page 25: Summarizing Data

Qualitative

variable• frequency table

• crosstabulation (contingency table)

• bar chart, pie chart, ….

Basic Statistical Methods

Page 26: Summarizing Data

• (cumulative) frequency distribution

• histogram

• dot-plot

• stem & leaf diagram

• scatter plot

• box plot, ….

Quantitative scale

Basic Statistical Methods

Page 27: Summarizing Data

• 12 var’s & 100 obs’s

• Many types of ‘offer’ to cardholders

• To find the type of ‘offer’ that increases cardholder’s usage maximally.

Credit_Card_Bank: p22 of SVV

Example Data

Page 28: Summarizing Data

[1] "Offer.Status" (Categorical) [2] "Charges.Aug.2008" (Quantitative) [3] "Charges.Sept.2008" (Quantitative) [4] "Charges.Oct.2008“ (Quantitative) [5] "Marketing.Segment" (Categorical) [6] "Industry.Segment" (Categorical) [7] "Spendlift.After.Promotion“ (Quantitative) [8] "Pre.Promotion.Avg.Spend" (Quantitative) [9] "Post.Promotion.Avg.Spend" (Quantitative) [10] "Retail.Customer" (Yes, No) [11] "Enrolled.in.Program" (Yes, No) [12] "Spendlift.Positive" (Yes, No)

oct0

8msegiseg

loct08 =

log(oct08)

data.svv<-dir("c:/temp/text")dfile.svv<-paste("c:/temp/text/",data.svv,sep="")dsv<- read.table(dfile.svv[1],head=TRUE, sep="\t")names(dsv)oct08<-dsv[,4]; loct08<-log(oct08); xoct08<-loct08[oct08>0]mseg<-dsv[,5]; iseg<-dsv[,6]

Page 29: Summarizing Data

[1] -Inf 6.21 3.96 3.84 6.96 6.95 7.89 7.35 3.97 5.97

[11] 5.50 8.00 6.30 3.13 4.58 8.89 3.81 7.00 8.37 7.85

[21] 7.42 6.86 8.45 6.12 5.62 8.21 6.91 6.87 7.15 5.46

[31] 6.71 6.12 -Inf 7.68 9.08 5.91 3.42 6.12 8.05 7.03

[41] 6.02 2.51 7.20 3.29 7.44 5.88 6.33 6.24 4.33 5.93

[51] 5.25 7.85 8.76 7.15 7.95 7.13 -Inf 7.13 8.11 8.05

[61] 9.11 5.56 8.24 -Inf 7.47 6.70 7.52 6.53 8.33 4.63

[71] 6.80 5.72 7.54 3.48 7.57 8.42 8.16 4.67 7.16 5.61

[81] -Inf 10.42 8.73 4.85 -Inf 6.63 5.48 4.89 8.35 4.65

[91] 5.56 7.39 3.11 3.90 5.72 7.10 -Inf 7.58 8.15 6.30

log(oct08):

log(0) = - InfRounded up to 2nd decimal round(loct08,2)

Page 30: Summarizing Data

[1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96

[11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46

[21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91

[31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30

[41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95

[51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20

[61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68

[71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16

[81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89

[91] 9.08 9.11 10.42

Sorted values of log(oct08):

after deleting 7 cases of –

Inf.round(sort(xoct08,2)

Page 31: Summarizing Data

[1] B B A T A T B A A T B T T B B B B B T B R A T A A [26] R B B R T T T A A B T B A R B B A T B B R T T A A [51] B A B B T A A T B A B A B R B A A R A T B T T B R [76] T A T A A B B B T R T T R T B A A A A A A B T A T

Levels: A B R T

iseg

Meaning of the levels are not known.

Page 32: Summarizing Data

[1] M L L M B A L A M H M L A M M B L B H L

[21] H B L H H M A B H L A H A B L H L B A A [41] A H A L L H L A B A A B A B B A M A B L [61] L B B H B A B A B L B A H L M L L M A B [81] A L L M H A H H L A H L B A H A L L L H Levels: L < B < M < A < H

mseg

L: low, B: below medium, M: medium, A: above medium,

H: high

levels(mseg)<-c("M","H","L","A","B")mseg<-factor(mseg, levels=c("L","B","M","A","H"))mseg

Page 33: Summarizing Data

Histogram of loct08

loct08

Freq

uenc

y

2 4 6 8 10

05

1015

20

hist(xoct08,col="grey")

Page 34: Summarizing Data

Stem and leaf display:

leaf unit = 0.1

2 | 5 3 | 11345889 4 | 003667789 5 | 3555666677999 6 | 0011122333567789999 7 | 000111122244445556678999 8 | 0001222234444789 9 | 11 10 | 4

a stem

a leaf

2.5

stem(xoct08)

Page 35: Summarizing Data

leaf unit = 1

2 | 5 3 | 11345889 4 | 003667789 5 | 3555666677999 6 | 0011122333567789999 7 | 000111122244445556678999 8 | 0001222234444789 9 | 11 10 | 4

25

stem(10*xoct08)

Page 36: Summarizing Data

Min. Q1 Median Q3 Max. 2.509 5.563 6.864 7.682 10.420

5 number summary of log(oct08):

IQR = 2.119

summary(xoct08)

Page 37: Summarizing Data

Quartiles : Q1, Q2 , Q3

Q1 : values ranked at 25% from lowest

Q2 : values ranked at 50% from lowest

Q3 : values ranked at 75% from lowest

IQR (Inter-Quartile Range) = Q3 –

Q1

Median = Q2

Page 38: Summarizing Data

How to take : Q1, Q2, Q3

If c is an integer, then c-th ranked

value x[c]

If c is not an integer, then (x[c-]+ x[c+])/2

Q1 : c = 0.25*(n+1)

Q2 : c= 0.5*(n+1)

Q3 : c= 0.75*(n+1)

c- : the largest lower integer than c

c+ : the smallest upper integer than c

Page 39: Summarizing Data

[1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96

[11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46

[21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91

[31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30

[41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95

[51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20

[61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68

[71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16

[81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89

[91] 9.08 9.11 10.42

Sorted values of log(oct08):

after deleting 7 cases of –

Inf.n= 93 , 0.25*94=23.5, 0.5*94=47,

0.75*94=70.5

Page 40: Summarizing Data

2 4 6 8 10 12

loct08

Dot plot

Page 41: Summarizing Data

050

0010

000

1500

020

000

2500

030

000

Box plot oct08

46

810

Box plot of log(oct08)

boxplot(xoct08)boxplot(oct08)

Page 42: Summarizing Data

IQR

Q1 Q3Q2**

mild-outlier extreme-outlier

min(non-outlier) min(non-outlier)

1.5 IQR

Page 43: Summarizing Data

freq %freq cum. freq %cum. freq

Low Spender 26 0.26 26 0.26Med Low Spender 20 0.20 46 0.46Average Spender 11 0.11 57 0.57Med High Spender 25 0.25 82 0.82High Spender 18 0.18 100 1.00------------------------------------------------------------Total 100 1.00

Frequency table

table(mseg)table(mseg)/length(mseg)cumsum(table(mseg))cumsum(table(mseg))/length(mseg)

Page 44: Summarizing Data

Bar chart of log(oct08)

(2,3] (3,4] (4,5] (5,6] (6,7] (7,8] (8,9] (9,10] (10,11]

05

1015

20

Page 45: Summarizing Data

Histogram & Bar chart

Histogram : for quantitative variables

connected bar’s

Bar chart : for categorical variables

disconnected bar’s

Page 46: Summarizing Data

A B R T Total L 5 13 0 8 26 B 11 8 0 1 20 M 2 4 2 3 11 A 8 7 2 8 25 H 5 0 6 7 18

Total 31 32 10 27 100

Contingency table of mseg and

iseg

mse

g

iseg

table(mseg,iseg)apply(table(mseg,iseg),1,sum)apply(table(mseg,iseg),2,sum)

Page 47: Summarizing Data

A

B

RT

Pie chart of iseg

31

32

10 27

pie(table(iseg),col=c("red","light green","green","blue"))

Page 48: Summarizing Data

A B R T

05

1015

2025

30

Segmented bar chart of (mseg, iseg) -

serial

barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"))

Page 49: Summarizing Data

A B R T

02

46

810

12

Segmented bar chart of (mseg, iseg) -

parallel

barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"),beside=TRUE)

Page 50: Summarizing Data

Mosaic Plot

iseg

mse

gA B R T

L B

M

AH

mosaicplot(~iseg+mseg,col=rainbow(5))

Page 51: Summarizing Data

L B M A H

46

810

Box plot of log(oct08) by mseg

boxplot(loct08[oct08>0]~mseg[oct08>0])

Page 52: Summarizing Data

A B C D E F

10 11 0 3 3 11

7 17 1 5 5 9

20 21 7 12 3 15

14 11 2 6 5 22

14 16 3 4 3 15

12 14 1 3 6 16

10 17 2 5 1 13

23 17 1 5 1 10

17 19 3 5 3 26

20 21 0 5 2 26

14 7 1 2 6 24

13 13 4 4 4 13

Page 53: Summarizing Data

A B C D E F

05

1015

2025

InsectSprays data

Type of spray

Inse

ct c

ount

Page 54: Summarizing Data

Thank you !!