data quantitativequalitative quantitative – a variable that can be measured numerically

Post on 26-Dec-2015

223 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data

Quantitative Qualitative

Quantitative – a variable that can be measured numerically

Qualitative– a variable that cannot assume a numerical value but can beclassified into two or more nonnumeric categories

Data at the nominal level of measurement are qualitative only. Data at thislevel are categorized using names, labels, or qualities. No mathematical computations can be made at this level.

Data at the ordinal level of measurement are qualitative or quantitative. Data at this level can be arranged in order, or ranked, but differences between data entries are not meaningful.

Levels of measurement

Data at the ratio level of measurement are similar to data at the interval level, with the added property that a zero entry is an inherent zero. A ratio of two data values can be formed so that one data value can be meaningfully expressed as a multiple of another.

Levels of measurement (continue)

Data at the interval level of measurement can be ordered, and meaningful differences between data entries can be calculated.At the interval level, a zero entry simply represents a position on a scale; the entry is not an inherent zero.

Level Put data in categories

Arrange data in order

Subtract data values

Determine if one data value is a multiple of another

Nominal

Ordinal

Interval

Ratio

Descriptive statistics

Central tendencyVariabilityShape (Histograms)

Descriptive statistics

Central tendencyVariabilityShape (Histograms)

Central tendency > Mean

Central tendency > Mean

Wikipedia: Temperature in the mouth (oral) is about 37.0 °C (98.6 °F)

Central tendency > Mean

Central tendency > MeanRank Company Revenues Profits

1 Royal_Dutch_Shell 484,489 30,918

2 Exxon_Mobil 452,926 41,060

3 Wal-Mart_Stores 446,950 15,699

4 BP 386,463 25,700

5 Sinopec_Group 375,214 9,453

6 China_National_Petroleum 352,338 16,317

7 State_Grid 259,142 5,678

8 Chevron 245,621 26,895

9 ConocoPhillips 237,272 12,436

10 Toyota_Motor 235,364 3,591

11 Total 231,580 17,069

12 Volkswagen 221,551 21,426

13 Japan_Post_Holdings 211,019 5,939

14 Glencore_International 186,152 4,048

15 Gazprom 157,831 44,460

Central tendency > Median

Central tendency > MedianRan

k Company Revenues Profits

1 Royal_Dutch_Shell 484,489 30,918

2 Exxon_Mobil 452,926 41,060

3 Wal-Mart_Stores 446,950 15,699

4 BP 386,463 25,700

5 Sinopec_Group 375,214 9,453

6 China_National_Petroleum 352,338 16,317

7 State_Grid 259,142 5,678

8 Chevron 245,621 26,895

9 ConocoPhillips 237,272 12,436

10 Toyota_Motor 235,364 3,591

11 Total 231,580 17,069

12 Volkswagen 221,551 21,426

13 Japan_Post_Holdings 211,019 5,939

14 Glencore_International 186,152 4,048

15 Gazprom 157,831 44,460

Central tendency > MedianRan

k CompanyRevenues

Profits

43 Hon_Hai_Precision_Industry 117,514 2,777

44 Banco_Santander 117,408 7,440

45 EXOR_Group 117,297 701,

46 Bank_of_America_Corp. 115,074 1,446

47 Siemens 113,349 8,562

48 Assicurazioni_Generali 112,628 1,190

49 Lukoil 111,43310,357

50 Verizon_Communications 110,875 2,404

51 J.P._Morgan_Chase_&_Co. 110,83818,976

52 Enel 110,560 5,768

53 HSBC_Holdings 110,14116,797

54Industrial_&_Commercial_Bank_of_China 109,040

32,214

55 Apple 108,24925,922

56 CVS_Caremark 107,750 3,461

57International_Business_Machines 106,916

15,855

Central tendency > Mode

Central tendency > Mode

Cristiano Ronaldo

Lionel Messi

Goal Pass G+PN 68 68 68

Mean 1,000000 0,220588 1,220588

Median 1,000000 0,000000 1,000000

Mode 0,000000 0,000000 1,000000

Mode frequency

26 56 23

Min 0,00 0,00 0,00

Max 3,000000 2,000000 4,000000

Variance 0,985075 0,264047 1,249122

Std.Dev. 0,992509 0,513855 1,117641

Best football player

Goal Pass G+PN 73 73 73

Mean1,12328

80,397260 1,520548

Median1,00000

00,000000 1,000000

Mode0,00000

00,000000 0,000000

Mode frequency

31 51 24

Min 0,00 0,00 0,00

Max5,00000

03,000000 5,000000

Variance1,55403

30,464992 2,030822

Std.Dev.1,24660

90,681904 1,425069

Descriptive statistics

Central tendencyVariabilityShape (Histograms)

Variability > Range

Cristiano Ronaldo

Lionel Messi

Goal Pass G+PN 68 68 68

Mean 1,000000 0,220588 1,220588

Median 1,000000 0,000000 1,000000

Mode 0,000000 0,000000 1,000000

Mode frequency

26 56 23

Min 0,00 0,00 0,00

Max 3,000000 2,000000 4,000000

Variance 0,985075 0,264047 1,249122

Std.Dev. 0,992509 0,513855 1,117641

Best football player

Goal Pass G+PN 73 73 73

Mean1,12328

80,397260 1,520548

Median1,00000

00,000000 1,000000

Mode0,00000

00,000000 0,000000

Mode frequency

31 51 24

Min 0,00 0,00 0,00

Max5,00000

03,000000 5,000000

Variance1,55403

30,464992 2,030822

Std.Dev.1,24660

90,681904 1,425069

Body Temperature

Variability > Deviations from the Mean

Variability > Variance and Std.Dev.

Variability > Variance and Std.Dev.

So what?1.Show how values deviates (in general) from the mean

2.Comparative measure for samples (population)

3.Critical element in understanding concept of Normal Distribution

Body temperature…again

Cristiano Ronaldo

Lionel Messi

Goal Pass G+PN 68 68 68

Mean 1,000000 0,220588 1,220588

Median 1,000000 0,000000 1,000000

Mode 0,000000 0,000000 1,000000

Mode frequency

26 56 23

Min 0,00 0,00 0,00

Max 3,000000 2,000000 4,000000

Variance 0,985075 0,264047 1,249122

Std.Dev. 0,992509 0,513855 1,117641

Best football player

Goal Pass G+PN 73 73 73

Mean1,12328

80,397260 1,520548

Median1,00000

00,000000 1,000000

Mode0,00000

00,000000 0,000000

Mode frequency

31 51 24

Min 0,00 0,00 0,00

Max5,00000

03,000000 5,000000

Variance1,55403

30,464992 2,030822

Std.Dev.1,24660

90,681904 1,425069

Test Score

Test Mean

Standard Deviation

Your Score

Math 82 6 80

Verbal 75 3 75

Science 60 5 70

Logic 70 7 77

Foreign Language Ability (max 250)

120 15 90

Variability > Z-score

B ox & W his k er P lot

Mean Mean±S D Mean±1,96*S D

G oal_m es s i G oal_r onaldo- 2

- 1

0

1

2

3

4

Box&Whisker Plot

Box&Whisker PlotB ox & W his k er P lot

Mean Mean±S D M in- M ax O utl ier s E x tr em es

G oal_m es s i G oal_r onaldo- 1

0

1

2

3

4

5

6

Variability > n vs. (n-1)

HistogramsBody Temperature

Temperature

Freq

uenc

y

35.5 36.0 36.5 37.0 37.5 38.0 38.5

05

1015

2025

30

Histograms

Histograms

His togr am : HR, Mens

50 55 60 65 70 75 80 85 90 95 100

X <= Categor y B oundar y

0

5

10

15

20

25

No.

of

obs.

His togr am : HR, W om ans

50 55 60 65 70 75 80 85 90 95 100

X <= Categor y B oundar y

0

2

4

6

8

10

12

14

16

18

20

No.

of

obs.

Histograms

Histograms: Skewness

Histograms

His togr am : Rev enues of T op 500 Com panies

M e a n : 1 3 6 3 1 9 ,2 M e d ia n : 1 1 0 8 5 6 ,5 M in :7 6 0 2 4 ,0 0 M a x:4 8 4 4 8 9 ,0 S D :8 2 1 6 5 ,0 7 S ke w :2 ,6 9 9 0 0 4

0 50000 1E 5 1,5E 5 2E 5 2,5E 5 3E 5 3,5E 5 4E 5 4,5E 5 5E 5

X <= Categor y B oundar y

0

5

10

15

20

25

30

35

40

45

50

55

No.

of

obs.

Histograms

Wine Tasting

Wine Tasting

Descriptive Statistics

N Mean Median Mode

Mode Fr.

Min Max Std.Dev. Skew

RedTruck 30 5,50 5,50 Multiple 5 1,0 10,0 2,25 -0,00

WoopWoop 30 5,50 5,50 Multiple 3 1,0 10,0 2,92 -0,00

HobNob 30 5,03 5,00 6,00 7 1,0 10,0 2,00 0,36

FourPlay 30 5,96 6,00 5,00 7 1,0 10,0 2,00 -0,36

Wine TastingHis togr am : RedT r uc k

K - S d=,08773, p> .20; Li l l iefor s p> .20

E x pec ted Nor mal

0 2 4 6 8 10

X <= Categor y B oundar y

0

2

4

6

8

10

12

No.

of

obs.

His togr am : W oopW oop

K - S d=,10393, p> .20; Li l l iefor s p> .20

E x pec ted Nor mal

0 2 4 6 8 10

X <= Categor y B oundar y

0

1

2

3

4

5

6

7

8

9

10

No.

of

obs.

His togr am : HobNob

K - S d=,14847, p> .20; Li l l iefor s p<,10

E x pec ted Nor mal

0 2 4 6 8 10

X <= Categor y B oundar y

0

2

4

6

8

10

12

14

16

No.

of

obs.

His togr am : F our P lay

K - S d=,14847, p> .20; Li l l iefor s p<,10

E x pec ted Nor mal

0 2 4 6 8 10

X <= Categor y B oundar y

0

2

4

6

8

10

12

14

16

No.

of

obs.

Normal distribution

The normal (or Gaussian) distribution is a continuous probability distribution that has a bell-shaped probability density function, known as the Gaussian function or informally as the bell curve.

Normal distribution

His togr am : RedT r uc k

K - S d=,08773, p> .20; Li l l iefor s p> .20

E x pec ted Nor mal

0 2 4 6 8 10

X <= Categor y B oundar y

0

2

4

6

8

10

12

No.

of

obs.

Normal distribution

68-95-99.7 rule — or three-sigma rule— states that for a normal distribution, nearly all values lie within 3 standard deviations of the mean.

Chebyshev's inequality

No more than 1/k^2 of the distribution’s values can be more than k standard deviations away from the mean.

Chebyshev's inequality

The bean machine, by Francis Galton

Parameter and statistic

4 fundamental concept

1.Random Sampling2.Sampling Error3.The Sampling

Distribution of Sample Means

4.The Central Limit Theorem

The Sampling Distribution of Sample Means

Random Sampling

• every unit or case in the population has an equal chance of being selected

• selection of any single case or unit doesn’t affect the selection of any other unit or case

• the cases or units be selected in such a way that all combinations are possible

Sampling Error

Sampling Error

Valid N Mean Std.Dev.

Pulse1 110 75,91818 13,45474

Histogram: Pulse1

30 40 50 60 70 80 90 100 110 120 130 140 150

X <= Category Boundary

0

5

10

15

20

25

30

35

40

45N

o. o

f ob

s.

Sampling Error

Valid N Mean Std.Dev.

3176,45161

10,59509

Histogram: Pulse1

K-S d=,14818, p> .20; Lilliefors p<,10

Expected Normal

55 60 65 70 75 80 85 90 95 100 105

X <= Category Boundary

0

1

2

3

4

5

6

7

No.

of

obs.

Valid N Mean Std.Dev.

37 77,97297 16,40374

Histogram: Pulse1

K-S d=,15064, p> .20; Lilliefors p<,05

Expected Normal

40 50 60 70 80 90 100 110 120 130 140 150

X <= Category Boundary

0

2

4

6

8

10

12

14

No.

of

obs.

Histogram: Pulse1

K-S d=,05831, p> .20; Lilliefors p> .20

Expected Normal

40 45 50 55 60 65 70 75 80 85 90 95 100 105

X <= Category Boundary

0

2

4

6

8

10

12

14

16

No.

of

obs.

Valid N Mean Std.Dev.

69 74,15942 10,87445

Valid N Mean Std.Dev.

Pulse population 11075,91818

13,45474

The difference between the population and the sample

Sampling Error

The difference between the population and the sample

Sampling Error

PROBLEM!

We typically don’t know the population

parameters.

Standard error is an estimate of amount of sampling error

SE = SD / SQRT(N)

Sampling Error

The Sampling Distribution of Sample Means

Central Limit Theorem

If repeated random samples of size n are taken from a population with a mean mu and a standard deviation s, the sampling distribution of sample means will have a mean equal to mu and a standard error equal to

Sx = s / SQRT(n).

Moreover, as n increases, the sampling distribution will approach a normal distribution.

Confidence Interval for the mean

CI = Sample Mean ± E

Confidence Interval for the mean. Known s

Confidence Interval for the mean. Known s

Confidence Interval for the mean.

Confidence Interval for the mean. UNknown s

You don’t know the value of the

population standard deviation

Confidence Interval for the mean. UNknown s

Link between statistics and beer

William Sealy Gosset

t-distributions (Student’s distributions)

t-distributions (Student’s distributions)

t-distributions (Student’s distributions)

Confidence Interval for the mean. UNknown s

Confidence Interval. Changing confidence level

Confidence Interval. Changing n

Confidence Interval for propotion

59% of Russians favor Vladimir Putin

Russian Public Opinion Research Center (2012)

The initiative Russian opinion polls were conducted on August 24-25, 2012. 1600 respondents were interviewed

Confidence Interval for propotion

59% of Russians favor Vladimir Putin, n = 1600

Confidence Interval for propotion

Polio vaccine

In the first half of the 20th century there were approximately 20000 cases of polio per year in the USA

In 1952, there were 58000 cases

In 1952, the first effective polio vaccine

was developed by Dr. Jonas Salk

Polio vaccine

Polio vaccine

Rate (per 100,000)• Treatment: 28• Control: 71

Statistical hypothesis testing

Critical tests of this

kind may be called tests

of significance, and

when such tests are

available we may

discover whether a

second sample is or is

not significantly

different from the first

Statistical hypothesis test is a method of making decisions using data, whether from a controlled experiment or an observational study (not controlled).

In statistics, a result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold probability, the significance level.

Null hypothesis - default position

NO effectNO relationshipNO changesNOTHING happened

H0

Alternative hypothesis

H0

H1

Null hypothesis

vs

“the defendant is not guilty”

“the defendant is guilty”

H0 is true H1 is true

Retain H0 Right decision Type II Error

Reject H0 Type I Error Right decision

Mathias Rust is a German aviator known for his illegal landing on May 28, 1987, near Red Square in Moscow. An amateur pilot, he flew from Finland to Moscow, being tracked several times by Soviet air defense and interceptors. The Soviet fighters never received permission to shoot him down, and several times he was mistaken for a friendly aircraft.

NHST

Single Sample Test

Population m = 193.80 and s = 31.55

Sample 50 flextime workersm = 202,94

(s ) of 31.55? X

How likely is it that we would have obtained a sample mean of 202.94 from a population having a mean of 193.80 and a standard deviation 31.55

Where does our sample mean of 202.94 fall in relationto all other sample means in a sampling distribution of sample means?

Single Sample Test

Phrases

Alternative hypothesis

H0

H1

Null hypothesis

vs

I fail to reject the null hypothesis

I reject the null hypothesis, with the knowledge that 5 times out of 100 I could have committed a Type I error.

Hypothesis Testing With Two Samples

KEEP OUT! next slide can destroy your mind

Hypothesis Testing With Two Samples

Dependent/paired samples Independent

Single sample

Dependent t-test

Independent t-test: Cognitive Ability Test

Independent t-test: Cognitive Ability Test

8 hours sleep group (X) 5 7 5 3 5 3 3 9

4 hours sleep group (Y) 8 1 4 6 6 4 1 2

x (x-Mx)2 y (y - My)2

5 0 8 16

7 4 1 9

5 0 4 0

3 4 6 4

5 0 6 4

3 4 4 0

3 4 1 9

9 16 2 4

Sx=40

S(x-Mx)2=32 Sy=32 S(y-My)2=46

Mx=5 

s=sqrt(4,571)My=4 

s=sqrt(6,571)

Test for proportion

A research center claims that less than 50% of U.S. adults have accessed the Internet over a wireless network with a laptop computer. In a random sample of 100 adults, 39% say they have accessed the Internet over a wireless network with a laptop computer.At is there enough evidence to support the researcher’s claim?

Test for variance

One-tailed and two-tailed tests

One-tailed and two-tailed tests

A car manufacturer states that mean carbon monoxide (CO) emissions for a given car model do not exceed 30 parts per million (ppm). A sample of 33 cars coming off the assembly line is taken determine if the car model meets the emission standards with 95% confidence. The sample statistics are:  = 28.0 ppm and s = 5.3 ppm.

One-tailed and two-tailed tests

One-tailed and two-tailed tests

top related