data quantitativequalitative quantitative – a variable that can be measured numerically
TRANSCRIPT
Data
Quantitative Qualitative
Quantitative – a variable that can be measured numerically
Qualitative– a variable that cannot assume a numerical value but can beclassified into two or more nonnumeric categories
Data at the nominal level of measurement are qualitative only. Data at thislevel are categorized using names, labels, or qualities. No mathematical computations can be made at this level.
Data at the ordinal level of measurement are qualitative or quantitative. Data at this level can be arranged in order, or ranked, but differences between data entries are not meaningful.
Levels of measurement
Data at the ratio level of measurement are similar to data at the interval level, with the added property that a zero entry is an inherent zero. A ratio of two data values can be formed so that one data value can be meaningfully expressed as a multiple of another.
Levels of measurement (continue)
Data at the interval level of measurement can be ordered, and meaningful differences between data entries can be calculated.At the interval level, a zero entry simply represents a position on a scale; the entry is not an inherent zero.
Level Put data in categories
Arrange data in order
Subtract data values
Determine if one data value is a multiple of another
Nominal
Ordinal
Interval
Ratio
Descriptive statistics
Central tendencyVariabilityShape (Histograms)
Descriptive statistics
Central tendencyVariabilityShape (Histograms)
Central tendency > Mean
Central tendency > Mean
Wikipedia: Temperature in the mouth (oral) is about 37.0 °C (98.6 °F)
Central tendency > Mean
Central tendency > MeanRank Company Revenues Profits
1 Royal_Dutch_Shell 484,489 30,918
2 Exxon_Mobil 452,926 41,060
3 Wal-Mart_Stores 446,950 15,699
4 BP 386,463 25,700
5 Sinopec_Group 375,214 9,453
6 China_National_Petroleum 352,338 16,317
7 State_Grid 259,142 5,678
8 Chevron 245,621 26,895
9 ConocoPhillips 237,272 12,436
10 Toyota_Motor 235,364 3,591
11 Total 231,580 17,069
12 Volkswagen 221,551 21,426
13 Japan_Post_Holdings 211,019 5,939
14 Glencore_International 186,152 4,048
15 Gazprom 157,831 44,460
Central tendency > Median
Central tendency > MedianRan
k Company Revenues Profits
1 Royal_Dutch_Shell 484,489 30,918
2 Exxon_Mobil 452,926 41,060
3 Wal-Mart_Stores 446,950 15,699
4 BP 386,463 25,700
5 Sinopec_Group 375,214 9,453
6 China_National_Petroleum 352,338 16,317
7 State_Grid 259,142 5,678
8 Chevron 245,621 26,895
9 ConocoPhillips 237,272 12,436
10 Toyota_Motor 235,364 3,591
11 Total 231,580 17,069
12 Volkswagen 221,551 21,426
13 Japan_Post_Holdings 211,019 5,939
14 Glencore_International 186,152 4,048
15 Gazprom 157,831 44,460
Central tendency > MedianRan
k CompanyRevenues
Profits
43 Hon_Hai_Precision_Industry 117,514 2,777
44 Banco_Santander 117,408 7,440
45 EXOR_Group 117,297 701,
46 Bank_of_America_Corp. 115,074 1,446
47 Siemens 113,349 8,562
48 Assicurazioni_Generali 112,628 1,190
49 Lukoil 111,43310,357
50 Verizon_Communications 110,875 2,404
51 J.P._Morgan_Chase_&_Co. 110,83818,976
52 Enel 110,560 5,768
53 HSBC_Holdings 110,14116,797
54Industrial_&_Commercial_Bank_of_China 109,040
32,214
55 Apple 108,24925,922
56 CVS_Caremark 107,750 3,461
57International_Business_Machines 106,916
15,855
Central tendency > Mode
Central tendency > Mode
Cristiano Ronaldo
Lionel Messi
Goal Pass G+PN 68 68 68
Mean 1,000000 0,220588 1,220588
Median 1,000000 0,000000 1,000000
Mode 0,000000 0,000000 1,000000
Mode frequency
26 56 23
Min 0,00 0,00 0,00
Max 3,000000 2,000000 4,000000
Variance 0,985075 0,264047 1,249122
Std.Dev. 0,992509 0,513855 1,117641
Best football player
Goal Pass G+PN 73 73 73
Mean1,12328
80,397260 1,520548
Median1,00000
00,000000 1,000000
Mode0,00000
00,000000 0,000000
Mode frequency
31 51 24
Min 0,00 0,00 0,00
Max5,00000
03,000000 5,000000
Variance1,55403
30,464992 2,030822
Std.Dev.1,24660
90,681904 1,425069
Descriptive statistics
Central tendencyVariabilityShape (Histograms)
Variability > Range
Cristiano Ronaldo
Lionel Messi
Goal Pass G+PN 68 68 68
Mean 1,000000 0,220588 1,220588
Median 1,000000 0,000000 1,000000
Mode 0,000000 0,000000 1,000000
Mode frequency
26 56 23
Min 0,00 0,00 0,00
Max 3,000000 2,000000 4,000000
Variance 0,985075 0,264047 1,249122
Std.Dev. 0,992509 0,513855 1,117641
Best football player
Goal Pass G+PN 73 73 73
Mean1,12328
80,397260 1,520548
Median1,00000
00,000000 1,000000
Mode0,00000
00,000000 0,000000
Mode frequency
31 51 24
Min 0,00 0,00 0,00
Max5,00000
03,000000 5,000000
Variance1,55403
30,464992 2,030822
Std.Dev.1,24660
90,681904 1,425069
Body Temperature
Variability > Deviations from the Mean
Variability > Variance and Std.Dev.
Variability > Variance and Std.Dev.
So what?1.Show how values deviates (in general) from the mean
2.Comparative measure for samples (population)
3.Critical element in understanding concept of Normal Distribution
Body temperature…again
Cristiano Ronaldo
Lionel Messi
Goal Pass G+PN 68 68 68
Mean 1,000000 0,220588 1,220588
Median 1,000000 0,000000 1,000000
Mode 0,000000 0,000000 1,000000
Mode frequency
26 56 23
Min 0,00 0,00 0,00
Max 3,000000 2,000000 4,000000
Variance 0,985075 0,264047 1,249122
Std.Dev. 0,992509 0,513855 1,117641
Best football player
Goal Pass G+PN 73 73 73
Mean1,12328
80,397260 1,520548
Median1,00000
00,000000 1,000000
Mode0,00000
00,000000 0,000000
Mode frequency
31 51 24
Min 0,00 0,00 0,00
Max5,00000
03,000000 5,000000
Variance1,55403
30,464992 2,030822
Std.Dev.1,24660
90,681904 1,425069
Test Score
Test Mean
Standard Deviation
Your Score
Math 82 6 80
Verbal 75 3 75
Science 60 5 70
Logic 70 7 77
Foreign Language Ability (max 250)
120 15 90
Variability > Z-score
B ox & W his k er P lot
Mean Mean±S D Mean±1,96*S D
G oal_m es s i G oal_r onaldo- 2
- 1
0
1
2
3
4
Box&Whisker Plot
Box&Whisker PlotB ox & W his k er P lot
Mean Mean±S D M in- M ax O utl ier s E x tr em es
G oal_m es s i G oal_r onaldo- 1
0
1
2
3
4
5
6
Variability > n vs. (n-1)
HistogramsBody Temperature
Temperature
Freq
uenc
y
35.5 36.0 36.5 37.0 37.5 38.0 38.5
05
1015
2025
30
Histograms
Histograms
His togr am : HR, Mens
50 55 60 65 70 75 80 85 90 95 100
X <= Categor y B oundar y
0
5
10
15
20
25
No.
of
obs.
His togr am : HR, W om ans
50 55 60 65 70 75 80 85 90 95 100
X <= Categor y B oundar y
0
2
4
6
8
10
12
14
16
18
20
No.
of
obs.
Histograms
Histograms: Skewness
Histograms
His togr am : Rev enues of T op 500 Com panies
M e a n : 1 3 6 3 1 9 ,2 M e d ia n : 1 1 0 8 5 6 ,5 M in :7 6 0 2 4 ,0 0 M a x:4 8 4 4 8 9 ,0 S D :8 2 1 6 5 ,0 7 S ke w :2 ,6 9 9 0 0 4
0 50000 1E 5 1,5E 5 2E 5 2,5E 5 3E 5 3,5E 5 4E 5 4,5E 5 5E 5
X <= Categor y B oundar y
0
5
10
15
20
25
30
35
40
45
50
55
No.
of
obs.
Histograms
Wine Tasting
Wine Tasting
Descriptive Statistics
N Mean Median Mode
Mode Fr.
Min Max Std.Dev. Skew
RedTruck 30 5,50 5,50 Multiple 5 1,0 10,0 2,25 -0,00
WoopWoop 30 5,50 5,50 Multiple 3 1,0 10,0 2,92 -0,00
HobNob 30 5,03 5,00 6,00 7 1,0 10,0 2,00 0,36
FourPlay 30 5,96 6,00 5,00 7 1,0 10,0 2,00 -0,36
Wine TastingHis togr am : RedT r uc k
K - S d=,08773, p> .20; Li l l iefor s p> .20
E x pec ted Nor mal
0 2 4 6 8 10
X <= Categor y B oundar y
0
2
4
6
8
10
12
No.
of
obs.
His togr am : W oopW oop
K - S d=,10393, p> .20; Li l l iefor s p> .20
E x pec ted Nor mal
0 2 4 6 8 10
X <= Categor y B oundar y
0
1
2
3
4
5
6
7
8
9
10
No.
of
obs.
His togr am : HobNob
K - S d=,14847, p> .20; Li l l iefor s p<,10
E x pec ted Nor mal
0 2 4 6 8 10
X <= Categor y B oundar y
0
2
4
6
8
10
12
14
16
No.
of
obs.
His togr am : F our P lay
K - S d=,14847, p> .20; Li l l iefor s p<,10
E x pec ted Nor mal
0 2 4 6 8 10
X <= Categor y B oundar y
0
2
4
6
8
10
12
14
16
No.
of
obs.
Normal distribution
The normal (or Gaussian) distribution is a continuous probability distribution that has a bell-shaped probability density function, known as the Gaussian function or informally as the bell curve.
Normal distribution
His togr am : RedT r uc k
K - S d=,08773, p> .20; Li l l iefor s p> .20
E x pec ted Nor mal
0 2 4 6 8 10
X <= Categor y B oundar y
0
2
4
6
8
10
12
No.
of
obs.
Normal distribution
68-95-99.7 rule — or three-sigma rule— states that for a normal distribution, nearly all values lie within 3 standard deviations of the mean.
Chebyshev's inequality
No more than 1/k^2 of the distribution’s values can be more than k standard deviations away from the mean.
Chebyshev's inequality
The bean machine, by Francis Galton
Parameter and statistic
4 fundamental concept
1.Random Sampling2.Sampling Error3.The Sampling
Distribution of Sample Means
4.The Central Limit Theorem
The Sampling Distribution of Sample Means
Random Sampling
• every unit or case in the population has an equal chance of being selected
• selection of any single case or unit doesn’t affect the selection of any other unit or case
• the cases or units be selected in such a way that all combinations are possible
Sampling Error
Sampling Error
Valid N Mean Std.Dev.
Pulse1 110 75,91818 13,45474
Histogram: Pulse1
30 40 50 60 70 80 90 100 110 120 130 140 150
X <= Category Boundary
0
5
10
15
20
25
30
35
40
45N
o. o
f ob
s.
Sampling Error
Valid N Mean Std.Dev.
3176,45161
10,59509
Histogram: Pulse1
K-S d=,14818, p> .20; Lilliefors p<,10
Expected Normal
55 60 65 70 75 80 85 90 95 100 105
X <= Category Boundary
0
1
2
3
4
5
6
7
No.
of
obs.
Valid N Mean Std.Dev.
37 77,97297 16,40374
Histogram: Pulse1
K-S d=,15064, p> .20; Lilliefors p<,05
Expected Normal
40 50 60 70 80 90 100 110 120 130 140 150
X <= Category Boundary
0
2
4
6
8
10
12
14
No.
of
obs.
Histogram: Pulse1
K-S d=,05831, p> .20; Lilliefors p> .20
Expected Normal
40 45 50 55 60 65 70 75 80 85 90 95 100 105
X <= Category Boundary
0
2
4
6
8
10
12
14
16
No.
of
obs.
Valid N Mean Std.Dev.
69 74,15942 10,87445
Valid N Mean Std.Dev.
Pulse population 11075,91818
13,45474
The difference between the population and the sample
Sampling Error
The difference between the population and the sample
Sampling Error
PROBLEM!
We typically don’t know the population
parameters.
Standard error is an estimate of amount of sampling error
SE = SD / SQRT(N)
Sampling Error
The Sampling Distribution of Sample Means
Central Limit Theorem
If repeated random samples of size n are taken from a population with a mean mu and a standard deviation s, the sampling distribution of sample means will have a mean equal to mu and a standard error equal to
Sx = s / SQRT(n).
Moreover, as n increases, the sampling distribution will approach a normal distribution.
Confidence Interval for the mean
CI = Sample Mean ± E
Confidence Interval for the mean. Known s
Confidence Interval for the mean. Known s
Confidence Interval for the mean.
Confidence Interval for the mean. UNknown s
You don’t know the value of the
population standard deviation
Confidence Interval for the mean. UNknown s
Link between statistics and beer
William Sealy Gosset
t-distributions (Student’s distributions)
t-distributions (Student’s distributions)
t-distributions (Student’s distributions)
Confidence Interval for the mean. UNknown s
Confidence Interval. Changing confidence level
Confidence Interval. Changing n
Confidence Interval for propotion
59% of Russians favor Vladimir Putin
Russian Public Opinion Research Center (2012)
The initiative Russian opinion polls were conducted on August 24-25, 2012. 1600 respondents were interviewed
Confidence Interval for propotion
59% of Russians favor Vladimir Putin, n = 1600
Confidence Interval for propotion
Polio vaccine
In the first half of the 20th century there were approximately 20000 cases of polio per year in the USA
In 1952, there were 58000 cases
In 1952, the first effective polio vaccine
was developed by Dr. Jonas Salk
Polio vaccine
Polio vaccine
Rate (per 100,000)• Treatment: 28• Control: 71
Statistical hypothesis testing
Critical tests of this
kind may be called tests
of significance, and
when such tests are
available we may
discover whether a
second sample is or is
not significantly
different from the first
Statistical hypothesis test is a method of making decisions using data, whether from a controlled experiment or an observational study (not controlled).
In statistics, a result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold probability, the significance level.
Null hypothesis - default position
NO effectNO relationshipNO changesNOTHING happened
H0
Alternative hypothesis
H0
H1
Null hypothesis
vs
“the defendant is not guilty”
“the defendant is guilty”
H0 is true H1 is true
Retain H0 Right decision Type II Error
Reject H0 Type I Error Right decision
Mathias Rust is a German aviator known for his illegal landing on May 28, 1987, near Red Square in Moscow. An amateur pilot, he flew from Finland to Moscow, being tracked several times by Soviet air defense and interceptors. The Soviet fighters never received permission to shoot him down, and several times he was mistaken for a friendly aircraft.
NHST
Single Sample Test
Population m = 193.80 and s = 31.55
Sample 50 flextime workersm = 202,94
(s ) of 31.55? X
How likely is it that we would have obtained a sample mean of 202.94 from a population having a mean of 193.80 and a standard deviation 31.55
Where does our sample mean of 202.94 fall in relationto all other sample means in a sampling distribution of sample means?
Single Sample Test
Phrases
Alternative hypothesis
H0
H1
Null hypothesis
vs
I fail to reject the null hypothesis
I reject the null hypothesis, with the knowledge that 5 times out of 100 I could have committed a Type I error.
Hypothesis Testing With Two Samples
KEEP OUT! next slide can destroy your mind
Hypothesis Testing With Two Samples
Dependent/paired samples Independent
Single sample
Dependent t-test
Independent t-test: Cognitive Ability Test
Independent t-test: Cognitive Ability Test
8 hours sleep group (X) 5 7 5 3 5 3 3 9
4 hours sleep group (Y) 8 1 4 6 6 4 1 2
x (x-Mx)2 y (y - My)2
5 0 8 16
7 4 1 9
5 0 4 0
3 4 6 4
5 0 6 4
3 4 4 0
3 4 1 9
9 16 2 4
Sx=40
S(x-Mx)2=32 Sy=32 S(y-My)2=46
Mx=5
s=sqrt(4,571)My=4
s=sqrt(6,571)
Test for proportion
A research center claims that less than 50% of U.S. adults have accessed the Internet over a wireless network with a laptop computer. In a random sample of 100 adults, 39% say they have accessed the Internet over a wireless network with a laptop computer.At is there enough evidence to support the researcher’s claim?
Test for variance
One-tailed and two-tailed tests
One-tailed and two-tailed tests
A car manufacturer states that mean carbon monoxide (CO) emissions for a given car model do not exceed 30 parts per million (ppm). A sample of 33 cars coming off the assembly line is taken determine if the car model meets the emission standards with 95% confidence. The sample statistics are: = 28.0 ppm and s = 5.3 ppm.
One-tailed and two-tailed tests
One-tailed and two-tailed tests