descriptive statistics 5 and 6.pdf · measure of variability or dispersion measure of distribution...

29
Descriptive Statistics Descriptive Statistics Department of Statistics Stat 101 Lecture 5 and 6, 2012 Lecture Slides Descriptive Statistics

Upload: others

Post on 09-Jul-2020

6 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Descriptive Statistics

Department of Statistics

Stat 101

Lecture 5 and 6, 2012

Lecture Slides Descriptive Statistics

Page 2: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Outline

1 Descriptive StatisticsMeasure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Lecture Slides Descriptive Statistics

Page 3: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Measure of Location

The mean measures the central location for the data. Thesample mean is denoted by X̄ where as the populationmean is denoted by µ.

Mathematically the sample mean is given by X̄ =∑

xin ,

where n is the number of observations.The Population mean is also given by µ =

∑xi

N , N is theobserved population size.The mean of a frequency distribution : If the numbersx1, x2, x3, ..., xk occur with frequencies f1, f2, f3, ...fk ,respectively, their mean is given by

∑fi xi∑fi

For grouped data, the xi ’s are the class mark.

Lecture Slides Descriptive Statistics

Page 4: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Assumed Mean: If M is any guessed mean or assumedmean and if di = xi −M , (i = 1,2,3, ..., k)

then x̄ = M +∑

fi di∑fi

For grouped data, if all the class intervals of a groupedfrequency distribution have equal size c, thenx̄ = M + c

∑fi ui∑fi

, where ui = xi−Mc

Geometric Mean. . .Mathematically, the geometric mean is given by

G = n√

x1x2x3...xn

G = n√∏n

i=1(xi)

Lecture Slides Descriptive Statistics

Page 5: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Measure of Location

Harmonic mean is used to calculate the average of a set ofnumbers.Here the number of elements will be averaged and dividedby the sum of the reciprocals of the elements.The Harmonic mean is always the lowest mean.Mathematically the harmonic mean is given by

Harmonic Mean = N( 1

a1+ 1

a2+ 1

a3+...+ 1

aN)

, where

X = a1,a2,a3, ...aN = Individual score andN = Sample size (Number of scores).

Lecture Slides Descriptive Statistics

Page 6: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Measure of Location

Median is another measure of central tendency. Themedian value is the middle value when the data is arrangein ascending order (Smallest to largest).

Arrange the data in ascending order (smallest value tolargest value)For odd number of observations, the median is the middlevalue.For an even number of observation, the median is theaverage of the two middle values.

Mode. . .The mode is the value that occurs with the greatestfrequency.In instances when more than one mode exist, thesefrequencies are denoted by bimodal and multi-modal fortwo and more respectively.

Lecture Slides Descriptive Statistics

Page 7: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

A Percentile provides information about how the data arespread over the interval from the smallest value to thelargest.The pth percentile is a value such that at least p percent ofthe observations are less than or equal to this value and atleast (100-p) percent of the observations are greater thanor equal to this value.Calculating the pth percentile;

Arrange the data in ascending order (smallest value to thelargest value).compute an index i ; i = ( p

100 )× n,where p is the percentile of interest and n is the number ofobservations.if i is not an integer, round it up to the next integer greaterthan i .if i is an integer, the pth percentile is the average of thevalues in positions i and i + 1.

Lecture Slides Descriptive Statistics

Page 8: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

background problem

Let us determine the 85th percentile for starting salarydata.Step 1 : Arrange the data in ascending order

3310 3355 3450 3480 3490 3520 3540 3550 3730 3925step 2 : i = ( p

100)× n = ( 85100)× 10 ,

i = 8.5Because i is not an integer, round up. The position of the85th percentile is the next integer greater than 8.5, the 9thposition.The 85th percentile position (9) corresponds to the salary3730.Also finding the 60th percentile position returns i = 6 whichis an integer. Hence the 60th percentile is 3520+3540

2

Lecture Slides Descriptive Statistics

Page 9: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Location of Quartiles

Lecture Slides Descriptive Statistics

Page 10: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Assignment

University Endowment. University Endowment.Legon 36.6 Central 10.1Knust 22.9 VVU 6.7Uds 7.2 IPS 17.2

.

Amounts are in billions of dollars.

What is the mean endowment for these universities?What is the median endowment?What is the mode endowment?Compute the first and third quartiles?

Lecture Slides Descriptive Statistics

Page 11: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Outline

1 Descriptive StatisticsMeasure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Lecture Slides Descriptive Statistics

Page 12: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Measure of Variability

The simplest measure of variability is the range.Range = Largest value - Smallest value.Interquartile range is a measure of variability thatovercomes the dependency on extreme values.This measure of variability is the difference between thethird quartile Q3 and the first quartile Q1.The interquartile range is the range for the middle 50percent of the data.IQR = Q3 −Q1Variance is a measure of variability that utilizes all the data.The variance is based on the difference between the valueof each observation and the mean.The difference between the observed value and its mean(xi − x̄) or (xi − µ) is called a deviation about the mean

Lecture Slides Descriptive Statistics

Page 13: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Mathematically Population Variance is given by,

σ2 =∑

(xi−µ)2

N and Sample Variance s2 =∑

(xi−x̄)2

n−1

An alternative formula for the computation of samplevariance is s2 =

∑xi

2−nx̄2

n−1

Standard Deviation is defined as the positive square root ofthe variance.sample standard deviation s =

√s2

population standard deviation σ =√σ2

Coefficient of Variation: This descriptive statistics indicateshow large the standard deviation is relative to the mean. Itis usually expressed in percentage.Coefficient of Variation = ( SD

mean × 100)

Lecture Slides Descriptive Statistics

Page 14: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Applications

Coefficient of Variation is used to compare the variability ofsets of data measured in different units.For example, we may wish to know, for a certain populationwhether body masses measured in kilograms, are morevariable than heights measured in centimeters.Also CV is used to compare th variability of sets of datameasured in the same units, but whose means are quitedifferent.. . . Example 2.40, page 65

Lecture Slides Descriptive Statistics

Page 15: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Outline

1 Descriptive StatisticsMeasure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Lecture Slides Descriptive Statistics

Page 16: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Distribution Shape

An important numerical measure of shape of a distributionis called Skewness.the formula for skewness of a sample data :Skewness = n

(n−1)(n−2)

∑(xi−x̄

s )3

For data skewed to the left, the skewness is negative andthe skewness is positive for data skewed right.If the data are symmetric, the skewness is zero.For symmetric distribution, the mean is equal to median.the mean is usually greater the median for positivelyskewed data.The mean will usually be less than themedian for negative skewed data.The median provides the preferred measure of locationwhen the data are highly skewed.

Lecture Slides Descriptive Statistics

Page 17: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Background Problem

In January 2009, 18 men and 18 women entered the 19-24ages class.Finish times in minutes are as follows(Naples Daily News,January 19, 2009). Times are shown in order of finish.Men Women Men Women65.30 122.62 131.80 136.7566.52 109.03 109.05 13866.85 111.22 110.23 13970.87 111.65 112.90 147.1887.18 111.93 113.52 147.3596.45 114.38 120.95 147.5098.52 118.33 127.98 153.88100.52 121.25 128.40 154.83108.18 122.48 130.90 189.28

.

Lecture Slides Descriptive Statistics

Page 18: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Background Problem

1 What is the median time for men and women runners?compare men and women runners based on their mediantimes.

2 Provide a five-number summary for both the men andwomen.

3 Are there outliers in the either group?4 Show the box plots for the groups. Did men or women

have the most variation in finish times? Explain.

Lecture Slides Descriptive Statistics

Page 19: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Outline

1 Descriptive StatisticsMeasure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Lecture Slides Descriptive Statistics

Page 20: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Relative Location

measure of relative location helps to determine how far aparticular value is from the mean.By using the mean and standard deviation, it is very easytoo determine the relative location of any observation.Z- score: zi = xi−x̄

s ,wherezi = the z- score for xi ,x̄ = the sample means = the sample standard deviation.The z-score is often called the standardized value.The z-score zi , can be interpreted as the number ofstandard deviations xi is from the mean x̄.

Lecture Slides Descriptive Statistics

Page 21: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Outline

1 Descriptive StatisticsMeasure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Lecture Slides Descriptive Statistics

Page 22: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Detecting Outliers

Chebyshev’s Theorem enables us to make statementabout the proportion of the data values that must be withina specified number of standard deviations of the mean.Chesbyshev’s Theorem. . .At least (1− 1

z2 ) of the data values must be within zstandard deviations of the mean, where z is any valuegreater than 1.At least 0.75 or 75%, of the data values must be withinz = 2 standard deviations of the mean.At least 0.89 or 89%, of the data values must be withinz = 3 standard deviations of the mean.At least 0.94 or 94%, of the data values must be withinz = 4 standard deviations of the mean.

Lecture Slides Descriptive Statistics

Page 23: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Detecting Outliers

For the test scores between 58 and 82,

Empirical Rule. . .For data having bell - shaped distribution :Approximately 68% of the data values will be within onestandard deviation of the mean.Approximately 95% of the data values will be within twostandard deviations of the mean.Almost all of the data values will be within three standarddeviations of the mean.

Lecture Slides Descriptive Statistics

Page 24: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Detecting Outliers

Sometimes a data set will have one or more observationswith unusually large or unusually small values.These extreme values are called outliers.Experience Statisticians take steps to identify outliers andthen review each one carefully.An outlier may be a data value that has been incorrectlyrecorded. If so , it can be corrected before further analysis.An outlier may also be from an observation that wasincorrectly included in the data set; if so, it can be removed.Finally, an outlier may be an unusual data value that hasbeen correctly and belongs in the data set.In such cases it should remain.

Lecture Slides Descriptive Statistics

Page 25: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Detecting Outliers

Standardized values (z-score) can be used to identifyoutliers. We recommend treating any data with a z-scoreless than −3 or greater than +3 as an outlier.Such datavalues can then be reviewed for accuracy and to determinewhether they belong in the data set.Such data values can then be reviewed for accuracy and todetermine whether they belong in the data set.

Lecture Slides Descriptive Statistics

Page 26: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Background Problem

Annual sales, in million of dollars, for 21 pharmaceuticalcompanies follow.8408 1374 1872 8879 2459 11413608 14138 6452 1850 2818 135610498 7478 4019 4341 739 21273653 5794 8305

.

Provide a five-number summary.Compute the lower and upper limitsDo the data contain any outliers?Suppose a mistake has been done in the entry processand the sales has been entered as 41138 (transposition)million.Would the method of detecting outliers aboveidentify this problem and allow for correction of the dataentry error?Show a box plot.

Lecture Slides Descriptive Statistics

Page 27: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Background Problem

1 The result of a national survey showed that on theaverage, adults sleep 6.9 hours per night.Suppose that the standard deviation is 1.2 hours.

1 Use chebyshev’s theorem to calculate the percentage ofindividuals who sleep between 4.5 and 9.5 hours.

2 Use chebyshev’s theorem to calculate the percentage ofindividuals who sleep between 3.9 and 9.9 hours.

3 Assume that the number of hours of sleep follows abell-shaped distribution.Use the empirical rule to calculate the percentage ofindividuals who sleep between 4.5 and 9.3 hours per day.How does this result compare to the value that youobtained using the Chebychev’s theorm in part(1)

Lecture Slides Descriptive Statistics

Page 28: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Background Problem

Below are some appliance brand and their rating scaledfrom 1 to 5, with 5 being best.

Brand Rating Brand RatingSony 4.00 Acatel 4.67Ericson 4.12 Hp 2.14Nokia 3.82 Acer 4.09Hisonic 4.00 Toshiba 4.17Hisense 4.56 apple 4.88Panasonic 4.32 Dell 4.26Blackberry 4.33 Binatone 2.32Philips 4.50 PSB 4.50Sharp 4.64 Infinity 4.17HTC 4.20 Bose 2.17

.

Lecture Slides Descriptive Statistics

Page 29: Descriptive Statistics 5 and 6.pdf · Measure of Variability or Dispersion Measure of Distribution shape Relative Location Detecting Outliers Measure of Location Harmonic mean is

Descriptive Statistics

Measure of LocationMeasure of Variability or DispersionMeasure of Distribution shapeRelative LocationDetecting Outliers

Background Problem

Compute the mean and the median.Compute the first and the third Quartile.Compute the standard deviation.The skewness of the data is -1.67. comment on the shapeof the distribution.What are the z-score associated with Hisense and Sony?Do the data contain any outliers? Explain.

Lecture Slides Descriptive Statistics