descriptive statistics

77
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation 2-6 Measures of Position 2-7 Exploratory Data Analysis (EDA)

Upload: burak-mizrak

Post on 06-May-2015

1.117 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Descriptive statistics

1

Descriptive Statistics

2-1 Overview

2-2 Summarizing Data with Frequency Tables

2-3 Pictures of Data

2-4 Measures of Center

2-5 Measures of Variation

2-6 Measures of Position

2-7 Exploratory Data Analysis (EDA)

Page 2: Descriptive statistics

2

Descriptive Statistics

summarize or describe the important characteristics of a known set of population data

Inferential Statistics

use sample data to make inferences (or generalizations) about a population

2 -1 Overview

Page 3: Descriptive statistics

3

1. Center: A representative or average value that indicates where the middle of the data set is located

2. Variation: A measure of the amount that the values vary among themselves

3. Distribution: The nature or shape of the distribution of data (such as bell-shaped, uniform, or skewed)

4. Outliers: Sample values that lie very far away from the vast majority of other sample values

5. Time: Changing characteristics of the data over time

Important Characteristics of Data

Page 4: Descriptive statistics

4

Frequency Table

lists classes (or categories) of values,

along with frequencies (or counts) of

the number of values that fall into each

class

2-2 Summarizing Data With Frequency Tables

Page 5: Descriptive statistics

5

Qwerty Keyboard Word RatingsTable 2-1

2 2 5 1 2 6 3 3 4 2

4 0 5 7 7 5 6 6 8 10

7 2 2 10 5 8 2 5 4 2

6 2 6 1 7 2 7 2 3 8

1 5 2 5 2 14 2 2 6 3

1 7

Page 6: Descriptive statistics

6

Frequency Table of Qwerty Word RatingsTable 2-3

0 - 2 20

3 - 5 14

6 - 8 15

9 - 11 2

12 - 14 1

Rating Frequency

Page 7: Descriptive statistics

7

Frequency Table

Definitions

Page 8: Descriptive statistics

8

Lower Class Limits

Lower ClassLimits

0 - 2 20

3 - 5 14

6 - 8 15

9 - 11 2

12 - 14 1

Rating Frequency

are the smallest numbers that can actually belong to different classes

Page 9: Descriptive statistics

9

Upper Class Limits

Upper ClassLimits

0 - 2 20

3 - 5 14

6 - 8 15

9 - 11 2

12 - 14 1

Rating Frequency

are the largest numbers that can actually belong to different classes

Page 10: Descriptive statistics

10

are the numbers used to separate classes, but without the gaps created by class limits

Class Boundaries

Page 11: Descriptive statistics

11

number separating classes

Class Boundaries

0 - 2 20

3 - 5 14

6 - 8 15

9 - 11 2

12 - 14 1

Rating Frequency- 0.5

2.5

5.5

8.5

11.5

14.5

Page 12: Descriptive statistics

12

Class Boundaries

ClassBoundaries

0 - 2 20

3 - 5 14

6 - 8 15

9 - 11 2

12 - 14 1

Rating Frequency- 0.5

2.5

5.5

8.5

11.5

14.5

number separating classes

Page 13: Descriptive statistics

13

midpoints of the classes

Class Midpoints

Page 14: Descriptive statistics

14

midpoints of the classes

Class Midpoints

ClassMidpoints

0 - 1 2 20

3 - 4 5 14

6 - 7 8 15

9 - 10 11 2

12 - 13 14 1

Rating Frequency

Page 15: Descriptive statistics

15

is the difference between two consecutive lower class limits or two consecutive class boundaries

Class Width

Page 16: Descriptive statistics

16

Class Width

Class Width

3 0 - 2 20

3 3 - 5 14

3 6 - 8 15

3 9 - 11 2

3 12 - 14 1

Rating Frequency

is the difference between two consecutive lower class limits or two consecutive class boundaries

Page 17: Descriptive statistics

17

1. Be sure that the classes are mutually exclusive.

2. Include all classes, even if the frequency is zero.

3. Try to use the same width for all classes.

4. Select convenient numbers for class limits.

5. Use between 5 and 20 classes.

6. The sum of the class frequencies must equal the number of original data values.

Guidelines For Frequency Tables

Page 18: Descriptive statistics

18

3. Select for the first lower limit either the lowest score or a convenient value slightly less than the lowest score.

4. Add the class width to the starting point to get the second lower class limit, add the width to the second lower limit to get the

third, and so on.

5. List the lower class limits in a vertical column and enter the upper class limits.

6. Represent each score by a tally mark in the appropriate class.

Total tally marks to find the total frequency for each class.

Constructing A Frequency Table1. Decide on the number of classes .

2. Determine the class width by dividing the range by the number of classes (range = highest score - lowest score) and round

up.class width round up of

range

number of classes

Page 19: Descriptive statistics

19

Page 20: Descriptive statistics

20

Relative Frequency Table

relative frequency =class frequency

sum of all frequencies

Page 21: Descriptive statistics

21

Relative Frequency Table

0 - 2 20

3 - 5 14

6 - 8 15

9 - 11 2

12 - 14 1

Rating Frequency

0 - 2 38.5%

3 - 5 26.9%

6 - 8 28.8%

9 - 11 3.8%

12 - 14 1.9%

RatingRelative

Frequency

20/52 = 38.5%

14/52 = 26.9%

etc.

Total frequency = 52

Page 22: Descriptive statistics

22

Cumulative Frequency Table

CumulativeFrequencies

0 - 2 20

3 - 5 14

6 - 8 15

9 - 11 2

12 - 14 1

Rating Frequency

Less than 3 20

Less than 6 34

Less than 9 49

Less than 12 51

Less than 15 52

RatingCumulativeFrequency

Page 23: Descriptive statistics

23

Frequency Tables

0 - 2 20

3 - 5 14

6 - 8 15

9 - 11 2

12 - 14 1

Rating Frequency

0 - 2 38.5%

3 - 5 26.9%

6 - 8 28.8%

9 - 11 3.8%

12 - 14 1.9%

RatingRelative

Frequency

Less than 3 20

Less than 6 34

Less than 9 49

Less than 12 51

Less than 15 52

RatingCumulative Frequency

Page 24: Descriptive statistics

24

a value at the

center or middle of a data set

Measures of Center

Page 25: Descriptive statistics

25

Mean

(Arithmetic Mean)

AVERAGE

the number obtained by adding the values and dividing the total by the number of values

Definitions

Page 26: Descriptive statistics

26

Notation

denotes the addition of a set of values

x is the variable usually used to represent the individual data values

n represents the number of data values in a sample

N represents the number of data values in a population

Page 27: Descriptive statistics

27

Notationis pronounced ‘x-bar’ and denotes the mean of a set of sample values

x =n

xx

Page 28: Descriptive statistics

28

Notation

µ is pronounced ‘mu’ and denotes the mean of all values in a population

is pronounced ‘x-bar’ and denotes the mean of a set of sample values

Calculators can calculate the mean of data

x =n

xx

Nµ =

x

Page 29: Descriptive statistics

29

Definitions Median

the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude

Page 30: Descriptive statistics

30

Definitions Median

the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude

often denoted by x (pronounced ‘x-tilde’)~

Page 31: Descriptive statistics

31

Definitions Median

the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude

often denoted by x (pronounced ‘x-tilde’)

is not affected by an extreme value

~

Page 32: Descriptive statistics

32

6.72 3.46 3.60 6.44

3.46 3.60 6.44 6.72 no exact middle -- shared by two numbers

3.60 + 6.44

2

(even number of values)

MEDIAN is 5.02

Page 33: Descriptive statistics

33

6.72 3.46 3.60 6.44 26.70

3.46 3.60 6.44 6.72 26.70

(in order - odd number of values)

exact middle MEDIAN is 6.44

6.72 3.46 3.60 6.44

3.46 3.60 6.44 6.72 no exact middle -- shared by two numbers

3.60 + 6.44

2

(even number of values)

MEDIAN is 5.02

Page 34: Descriptive statistics

34

Definitions Mode

the score that occurs most frequently

Bimodal

Multimodal

No Mode

denoted by M

the only measure of central tendency that can be used with nominal data

Page 35: Descriptive statistics

35

a. 5 5 5 3 1 5 1 4 3 5

b. 1 2 2 2 3 4 5 6 6 6 7 9

c. 1 2 3 6 7 8 9 10

Examples

Mode is 5

Bimodal - 2 and 6

No Mode

Page 36: Descriptive statistics

36

Midrange

the value midway between the highest and lowest values in the original data set

Definitions

Page 37: Descriptive statistics

37

Midrange

the value midway between the highest and lowest values in the original data set

Definitions

Midrange =highest score + lowest score

2

Page 38: Descriptive statistics

38

SymmetricData is symmetric if the left half of its histogram is roughly a mirror of its right half.

SkewedData is skewed if it is not symmetric and if it extends more to one side

than the other.

Definitions

Page 39: Descriptive statistics

39

Skewness

Mode = Mean = Median

SYMMETRIC

Page 40: Descriptive statistics

40

Skewness

Mode = Mean = Median

SKEWED LEFT(negatively)

SYMMETRIC

Mean Mode Median

Page 41: Descriptive statistics

41

Skewness

Mode = Mean = Median

SKEWED LEFT(negatively)

SYMMETRIC

Mean Mode Median

SKEWED RIGHT(positively)

Mean Mode Median

Page 42: Descriptive statistics

42

Waiting Times of Bank Customers at Different Banks

in minutes

Jefferson Valley Bank

Bank of Providence

6.5

4.2

6.6

5.4

6.7

5.8

6.8

6.2

7.1

6.7

7.3

7.7

7.4

7.7

7.7

8.5

7.7

9.3

7.7

10.0

Page 43: Descriptive statistics

43

Jefferson Valley Bank

Bank of Providence

6.5

4.2

6.6

5.4

6.7

5.8

6.8

6.2

7.1

6.7

7.3

7.7

7.4

7.7

7.7

8.5

7.7

9.3

7.7

10.0

Jefferson Valley Bank

7.15

7.20

7.7

7.10

Bank of Providence

7.15

7.20

7.7

7.10

Mean

Median

Mode

Midrange

Waiting Times of Bank Customers at Different Banks

in minutes

Page 44: Descriptive statistics

44

Dotplots of Waiting Times

Page 45: Descriptive statistics

45

Measures of Variation

Page 46: Descriptive statistics

46

Measures of Variation

Range

valuehighest lowest

value

Page 47: Descriptive statistics

47

a measure of variation of the scores about the mean

(average deviation from the mean)

Measures of Variation

Standard Deviation

Page 48: Descriptive statistics

48

Sample Standard Deviation Formula

calculators can compute the sample standard deviation of data

(x - x)2

n - 1S =

Page 49: Descriptive statistics

49

Sample Standard Deviation Shortcut Formula

n (n - 1)

s =n (x2) - (x)2

calculators can compute the sample standard deviation of data

Page 50: Descriptive statistics

50

x - x

Mean Absolute Deviation Formula

n

Page 51: Descriptive statistics

51

Population Standard Deviation

calculators can compute the population standard deviation

of data

2 (x - µ)

N =

Page 52: Descriptive statistics

52

Measures of Variation

Variance

standard deviation squared

Page 53: Descriptive statistics

53

Measures of Variation

Variance

standard deviation squared

s

2

2

}

use square keyon calculatorNotation

Page 54: Descriptive statistics

54

SampleVariance

PopulationVariance

Variance

(x - x )2

n - 1s2 =

(x - µ)2

N 2 =

Page 55: Descriptive statistics

55

Estimation of Standard DeviationRange Rule of Thumb

x - 2s x x + 2s

Range 4sor

(minimumusual value)

(maximum usual value)

Page 56: Descriptive statistics

56

Estimation of Standard DeviationRange Rule of Thumb

x - 2s x x + 2s

Range 4sor

(minimumusual value)

(maximum usual value)

Range

4s =

highest value - lowest value

4

Page 57: Descriptive statistics

57

x

The Empirical Rule(applies to bell-shaped distributions)FIGURE 2-15

Page 58: Descriptive statistics

58

x - s x x + s

68% within1 standard deviation

34% 34%

The Empirical Rule(applies to bell-shaped distributions)FIGURE 2-15

Page 59: Descriptive statistics

59

x - 2s x - s x x + 2sx + s

68% within1 standard deviation

34% 34%

95% within 2 standard deviations

The Empirical Rule(applies to bell-shaped distributions)

13.5% 13.5%

FIGURE 2-15

Page 60: Descriptive statistics

60

x - 3s x - 2s x - s x x + 2s x + 3sx + s

68% within1 standard deviation

34% 34%

95% within 2 standard deviations

99.7% of data are within 3 standard deviations of the mean

The Empirical Rule(applies to bell-shaped distributions)

0.1% 0.1%

2.4% 2.4%

13.5% 13.5%

FIGURE 2-15

Page 61: Descriptive statistics

61

Chebyshev’s Theorem applies to distributions of any shape.

the proportion (or fraction) of any set of data   lying within K standard deviations of the mean   is always at least 1 - 1/K2 , where K is any   positive number greater than 1.

at least 3/4 (75%) of all values lie within 2  standard deviations of the mean.

at least 8/9 (89%) of all values lie within 3  standard deviations of the mean.

Page 62: Descriptive statistics

62

Measures of Variation Summary

For typical data sets, it is unusual for a score to differ from the mean by more than 2 or 3 standard deviations.

Page 63: Descriptive statistics

63

z Score (or standard score)

the number of standard deviations that a given value x is above or

below the mean

Measures of Position

Page 64: Descriptive statistics

64

Sample

z = x - xs

Population

z = x - µ

Measures of Positionz score

Page 65: Descriptive statistics

65

- 3 - 2 - 1 0 1 2 3

Z

Unusual Values

Unusual Values

OrdinaryValues

Interpreting Z Scores

FIGURE 2-16

Page 66: Descriptive statistics

66

Measures of Position

Quartiles, Deciles,

Percentiles

Page 67: Descriptive statistics

67

Q1, Q2, Q3 divides ranked scores into four equal parts

Quartiles

25% 25% 25% 25%

Q3Q2Q1(minimum) (maximum)

(median)

Page 68: Descriptive statistics

68

D1, D2, D3, D4, D5, D6, D7, D8, D9

divides ranked data into ten equal parts

Deciles

10% 10% 10% 10% 10% 10% 10% 10% 10% 10%

D1 D2 D3 D4 D5 D6 D7 D8 D9

Page 69: Descriptive statistics

69

99 Percentiles

Percentiles

Page 70: Descriptive statistics

70

Quartiles, Deciles, Percentiles

Fractiles

(Quantiles)partitions data into approximately equal parts

Page 71: Descriptive statistics

71

Exploratory Data Analysis

the process of using statistical tools (such as graphs, measures of center, and measures of variation) to investigate the data sets in order to understand their important characteristics

Page 72: Descriptive statistics

72

Outliers

a value located very far away from almost all of the other values

an extreme value

can have a dramatic effect on the mean, standard deviation, and on the scale of the histogram so that the true nature of the distribution is totally obscured

Page 73: Descriptive statistics

73

Boxplots(Box-and-Whisker Diagram)

Reveals the: center of the data spread of the data distribution of the data presence of outliers

Excellent for comparing two or more data sets

Page 74: Descriptive statistics

74

Boxplots

5 - number summary

Minimum

first quartile Q1

Median (Q2)

third quartile Q3

Maximum

Page 75: Descriptive statistics

75

Boxplots

Boxplot of Qwerty Word Ratings

2 4 6 8 10 12 14

02 4 6

14

0

Page 76: Descriptive statistics

76

Bell-Shaped Skewed

Boxplots

Uniform

Page 77: Descriptive statistics

77

Exploring Measures of center: mean, median, and mode

Measures of variation: Standard deviation and range

Measures of spread and relative location:    minimum values, maximum value, and quartiles

Unusual values: outliers

Distribution: histograms, stem-leaf plots, and boxplots