engineering 1

94
Engineering Mathematics 4 David Ramsey Room: B2-026 e-mail: [email protected] website: www.ul.ie/ramsey January 20, 2010 1 / 94

Upload: njk19

Post on 29-May-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Engineering 1

Engineering Mathematics 4

David Ramsey

Room: B2-026e-mail: [email protected]: www.ul.ie/ramsey

January 20, 2010

1 / 94

Page 2: Engineering 1

Course Outline

1. Data Collection and Descriptive Statistics.

2. Probability Theory

3. Statistical Inference

2 / 94

Page 3: Engineering 1

Course Texts

1. Engineering Mathematics 4 - Ref. No. 5301 includesexercise lists, statistical tables and past papers, willbe available from the print room).

2. Montgomery - Applied Statistics and Probability forEngineers.

3. Stuart - An introduction to statistical analysis forbusiness and industry: a problem solving approach.

4. Montgomery, Runger, Fari - Engineering Statistics.

Similar books are available in section 519 of the library.

3 / 94

Page 4: Engineering 1

1 Data Collection and Descriptive Statistics

Populations of objects and individuals show variation with respectto various traits (e.g. height, political preferences, the working lifeof a light bulb).

It is impractical to observe all the members of the population. Inorder to describe the distribution of a trait in the population, weselect a sample.

On the basis of the sample we gain information on the populationas a whole.

4 / 94

Page 5: Engineering 1

1.1 Types of Variables

1. Qualitive variables: These are normally categorical(non-numerical) variables. We distinguish between two types ofqualitative variables:

a) nominal: these variables are not naturally ordered in any way(e.g. i. department - mechanical engineering, mathematics,economics ii. industrial sector).

b) ordinal: there is a natural order for such categorisations e.g.with respect to smoking, people may be categorised as 1:non-smokers, 2: light smokers and 3: heavy smokers.It can be seen that the higher the category number, the more anindividual smokes.

Exam grades are ordinal variables.

5 / 94

Page 6: Engineering 1

2. Quantitative Variables

These are variables which naturally take numerical values (e.g.age, height, number of children). Such variables can be measuredor counted.

As before we distinguish between two types of quantitativevariables.

a) Discrete variables: these are variables that take values from aset that can be listed (most commonly integer values, i.e. they arevariables that can be counted). For example, number of children,the results of die rolls.

6 / 94

Page 7: Engineering 1

b) Continuous variables

These are variables that can take values in a given range to anynumber of decimal places (such variables are measured, normallyaccording to some unit e.g. height, age, weight).

It should be noted that such variables are only measured to a givenaccuracy (i.e. height is measured to the nearest centimetre, age isnormally given to the nearest year).

If a discrete random variable takes many values (i.e. thepopulation of a town), then for practical purposes it is treated as acontinuous variable.

7 / 94

Page 8: Engineering 1

1.2 Collection of data

Since it is impractical to survey all the individuals in a population,we need to base our analysis on a sample.

A population is the entire collection of individuals or objects wewish to describe.

A sample is the subset of the population chosen for datacollection.

A sampling frame is the ”list” used to chose the sample.

A unit is a member of the population.

A variable is any trait that varies from unit to unit.

8 / 94

Page 9: Engineering 1

Collection of data

For example, suppose we wish to investigate the politicalpreferences of Irish voters (our variable of interest).

The population is made up of people eligible to vote in Irishelections.

A unit is any eligible voter.

The lists of eligible voters for each constituency may be used asthe sampling frame.

9 / 94

Page 10: Engineering 1

Collection of data

A list of addresses in various constituencies may be used as thesampling frame. However, there is no one to one correspondencebetween addresses and eligible voters.

The sample is the set of people asked for their political preferences.

The variable of interest is the party an individual wishes to vote for(other socio-demographic variables may be collected, such as age,occupation, education).

The sample size is the number of individuals in a sample and isdenoted by n.

10 / 94

Page 11: Engineering 1

1.2.1 Parameters and Statistics

A parameter is an unkown number describing a population. Forexample, it may be that 9% of the population of eligible voterswish to vote for the Green Party (we do not, however, observe thispopulation proportion).

A statistic is a number describing a sample. For example, 8% of asample may wish to vote for the Green Party. This is the sampleproportion.

Statistics may be used to describe a population, but they onlyestimate the real parameters of the population.

11 / 94

Page 12: Engineering 1

Parameters and Statistics - Precision of Statistics

Naturally, the statistics from a sample will show some variationaround the appropriate parameters

e.g. 9% of the population wish to vote for the Green Party, butonly 8% in the sample.

The greater the sample size, the more precise the results (supposewe take a large number of samples of size n, the larger n the lessvariable the sample proportion from the various samples, i.e. thesmaller the sample variance).

12 / 94

Page 13: Engineering 1

Parameters and Statistics - Bias

However, there may be intrinsic bias from two possible sources:

a) Sampling bias - when a sample is chosen in a waysuch that some members of the population are morelikely to be chosen than others. e.g. Suppose that theLabour party is most popular in Dublin. If we usedsamples of voters from Dublin to estimate support inthe whole of Ireland, we would systematicallyoverestimate the support of the Labour party.

b) Non-Sampling Bias This results from mistakes indata entry and/or how interviewees react to beingquestioned. For example, Fianna Fail supporters maybe more likely to hide their preference than otherindividuals. In this case, we would systematicallyunderestimate the support of Fianna Fail.

13 / 94

Page 14: Engineering 1

Non-Sampling Bias

Other sources of bias may be:

1. Lack of anonymity.

2. The wording of a question.

3. The desire to give an answer that would please theinterviewer (e.g. surveys may systematicallyoverestimate the willingness of individuals to payextra for environmentally friendly goods).

14 / 94

Page 15: Engineering 1

Precision and Bias

It should be noted that bias is a characteristic of the way in whichdata are collected not a single sample.

Increasing the sample size will improve the precision of anestimate, but will not affect the bias.

15 / 94

Page 16: Engineering 1

1.2.2 Random Sampling

Sampling is said to be random if each individual has an equalprobability of being chosen to be in the sample and this probabilityis not affected by who else is chosen to be in the sample.

An estimate of a parameter is unbiased if there is no systematictendency to under- or over-estimate the parameter.

Random sampling does not ensure that the estimates ofparameters are unbiased. However, the bias does not result fromthe sampling procedure (see above).

16 / 94

Page 17: Engineering 1

Another Example of Sampling Bias - Estimation of thePopulation Mean

e.g. Suppose the population of interest is the Irish population as awhole and the variable of interest height.

Suppose I base my estimate of the mean height of the populationon the mean height of a sample of students. Since students tendto be on average taller than the population as a whole, I willsystematically overestimate the mean height in the population.

That is to say, if I consider many samples of students of say size100, a large majority of such samples would give me anoverestimate of the mean height of the population as a whole.

17 / 94

Page 18: Engineering 1

1.3 Descriptive Statistics - 1.3.1 Qualitative (CategoricalData)

We may describe qualitative data using

a) Frequency tables.

b) Bar charts.

c) Pie charts.

n denotes the total number of observations (the sample size).

18 / 94

Page 19: Engineering 1

Frequency tables

Frequency tables display how many observations fall into eachcategory (the frequency column), as well as the relative frequencyof each category (the proportion of observations falling into eachcategory).

Let ni denote the number of observations in category i . Therelative frequency of category i is fi , where

fi =ni

n

Multiplying by 100, we obtain the relative frequency as apercentage.If there are missing data we may also give the relative frequenciesin terms of the actual number of observations, n′ i.e.

f ′i =ni

n′

19 / 94

Page 20: Engineering 1

Frequency tables

For example 200 students were asked which of the following bandsthey preferred: Franz Ferdinand, Radiohead or Coldplay. Theanswers may be presented in the following frequency table

Band Frequency Relative Frequency (% )

Coldplay 62 62× 100/200 = 31

Franz F. 66 66× 100/200 = 33

Radiohead 72 72× 100/200 = 36

20 / 94

Page 21: Engineering 1

Bar chart

In a bar chart the height of a bar represents the relative frequencyof a given category (or the number of observations in thatcategory).

21 / 94

Page 22: Engineering 1

Pie chart

The size of a slice in a pie chart represents the relative frequencyof a category. Hence, the angle made by the slice representingcategory i is given (in degrees) by αi , where

αi = 360fi =360ni

n

(i.e. we multiply the relative frequency by the number of degreesin a full revolution).

If the relative frequency of observations in group i is given inpercentage terms, denoted pi . 1% of the observations in thesample correspond to an angle of 3.6 degrees. Thus,

αi = 3.6pi .

22 / 94

Page 23: Engineering 1

Pie chart

23 / 94

Page 24: Engineering 1

Example 1.1

Suppose 1000 randomly chosen voters are asked which party theyare going to vote for in the forthcoming election. The results areas given below:

Fianna Fail: 360Fine Gael: 270

Green Party: 100Labour: 90

Progressive Democrats: 70No Answer: 110

24 / 94

Page 25: Engineering 1

Example 1.1

The frequency table is as follows:

Party Frequency Rel. Freq. (% ) Of non-missing data

Fianna Fail 360 36 360×100890 = 40.45

Fine Gael 270 27 270×100890 = 30.34

Green Party 100 10 100×100890 = 11.24

Labour 90 9 90×100890 = 10.11

Prog. Dems. 70 7 70×100890 = 7.87

No answer 110 11 -

25 / 94

Page 26: Engineering 1

Example 1.1

The bar chart illustrating just the non-missing data is given by

26 / 94

Page 27: Engineering 1

Example 1.1

The pie chart for the sample as a whole (i.e. including those whogave no answer) is derived as follows:

The angle made by the slice representing the support of FiannaFail is given by 36× 3.6 = 129.60

The angle made by the slice representing the support of FiannaFail is given by 27× 3.6 = 97.20, etc.

27 / 94

Page 28: Engineering 1

Example 1.1

The pie chart is as follows:

28 / 94

Page 29: Engineering 1

1.3.2 Graphical Presentation of Quantitative Data

Discrete data can be presented in the form of frequency tablesand/or bar charts (as above).The distribution of continuous data can be presented using

a) A histogram.

b) Its empirical distribution function (also referred to asan OGIVE).

29 / 94

Page 30: Engineering 1

Histograms for continuous variables

In order to draw a histogram for a continuous variable, we need tocategorise the data into intervals of equal length. The end pointsof these intervals should be round numbers.

The number of categories used should be approximately√

n(normally between 5 and 20 categories are used).

For example, if we have 30 observations then we should use about√30 ≈ 5.5 categories. Hence, 5 and 6 are sensible choices for the

number of categories.

Let k be the number of categories.

30 / 94

Page 31: Engineering 1

Histograms

In order to choose the length of each interval, L, we use

L ≈ xmax − xmin

k=

r

k,

where xmax is smallest ”round” number larger than all theobservations and xmin is the largest ”round” number smaller thanall the observations. The difference between these numbers is anestimate of the range of the data, denoted r .

If necessary L is rounded upwards, so that the intervals are of”nice” length and the whole range of the data is covered.

31 / 94

Page 32: Engineering 1

Histograms

The intervals used are

[xmin, xmin + L], (xmin + L, xmin + 2L], . . . , (xmax − L, xmax ].

In general the lower end-point of an interval is assumed not tobelong to that interval (to avoid a number belonging to twoclasses).

32 / 94

Page 33: Engineering 1

Histograms

A histogram is very similar to a bar chart. The height of the blockcorresponding to an interval is the relative frequency ofobservations in that block.

Thus, the height of a block is the number of observations in thatinterval divided by the total number of observations.

33 / 94

Page 34: Engineering 1

The Ogive

Suppose we have a reasonably large amount of data. In order todraw the empirical distribution function (ogive), we first split thedata into categories as when drawing a histogram.

Let x0, x1, x2, . . . , xk be the endpoints of the intervals formed (kdenotes the number of intervals). That is to say the i-th interval is(xi−1, xi ]

34 / 94

Page 35: Engineering 1

The Ogive

The ogive is a graph of the cumulative relative frequency.

The cumulative relative frequency at an endpoint xi is theproportion of observations less than xi .

Note that no observations are smaller than the lower endpoint ofthe first interval (x0), i.e. at x0 the cumulative relative frequency is0.

The cumulative relative frequency at the upper endpoint of aninterval can be calculated by adding the relative frequency ofobservations in that interval to the cumulative relative frequency atthe lower endpoint of that interval (the upper endpoint of theprevious interval).

35 / 94

Page 36: Engineering 1

The Ogive

Hence, we can calculate the cumulative relative frequency at theendpoints of each interval.

We then draw a scatter plot for the k + 1 values of the cumulativerelative frequency at the endpoints of each interval.

The X-coordinate of each of these k + 1 points is an endpoint ofone of the intervals and the Y-coordinate is the cumulative relativefrequency at that endpoint.

To draw the OGIVE we connect each of these points to the nextusing a straight line.

Note that the height of the OGIVE at the final endpoint is bydefinition 1.

36 / 94

Page 37: Engineering 1

Example 1.2

We observe the height of 20 individuals (in cm). The data aregiven below

172, 165, 188, 162, 178, 183, 171, 158, 174, 184,167, 175, 192, 170, 179, 187, 163, 156, 178, 182.

Draw a histogram and OGIVE representing these data.

37 / 94

Page 38: Engineering 1

Example 1.2

We first consider the histogram.

First we choose the number of classes and the correspondingintervals.√

20 ≈ 4.5, thus we should choose 4 or 5 intervals.

38 / 94

Page 39: Engineering 1

Example 1.2

The tallest individual is 192cm tall and the shortest 156cm.

200cm is the smallest round number larger than all theobservations and 150cm is the largest round number smaller thanall the observations. Thus, we take the range to be 50.

To calculate the length of the intervals

L =r

k.

Taking k to be 4, L = 12.5. Taking k = 5, L = 10 (a ”nicer”length).

Hence, it seems reasonable to use 5 intervals of length 10, startingat 150.

39 / 94

Page 40: Engineering 1

Example 1.2

If we assume that the upper endpoint of an interval belongs tothat interval, then we have the intervals [150,160], (160, 170],(170,180], (180,190], (190,200].

Now we count how many observations fall into each interval andhence the relative frequency of observations in each interval.

40 / 94

Page 41: Engineering 1

Example 1.2

Height (x) No. of Observations Rel. Frequency

150 ≤ x ≤ 160 2 2/20 = 0.1

160 < x ≤ 170 5 5/20 = 0.25

170 < x ≤ 180 7 7/20 = 0.35

180 < x ≤ 190 5 5/20 = 0.25

190 < x ≤ 200 1 1/20 = 0.05

41 / 94

Page 42: Engineering 1

Example 1.2

The histogram is given below:

42 / 94

Page 43: Engineering 1

Interpretation of the histogram of a continuous variable

A histogram is an estimator of the density function of a variable(see the chapter on the distribution of random variables in Section2).

The distribution of height seems to be reasonably symmetricalaround 175cm.

43 / 94

Page 44: Engineering 1

Example 1.2 - The Ogive

From the frequency table, the cumulative relative frequencies atthe endpoints of the intervals are given by:

Endpoint Cum. Rel. Freq.

150 0.0

160 0.1

170 0.1+0.25 = 0.35

180 0.35+0.35 = 0.7

190 0.7+0.25 = 0.95

200 0.95+0.05 = 1

The Ogive is given on the next slide

44 / 94

Page 45: Engineering 1

The Ogive

0

0.2

0.4

0.6

0.8

1

150 160 170 180 190 200•������•,

,,

,,

,•�

��

��

�•�

��

��

�•((((((•

45 / 94

Page 46: Engineering 1

The Ogive

The graph of the ogive can be used to estimate percentiles of adistribution.

By definition α% of observations are less than the α-percentile.

For example, the 60-percentile of height may be estimated bydrawing a horizontal line from 0.6 on the y-axis until it hits thecumulative distribution function.

The 60-percentile is the value on the x-axis directly below thispoint of intersection. This is illustrated on the next slide.

46 / 94

Page 47: Engineering 1

Estimation of percentiles

0

0.2

0.4

0.6

0.8

1

150 160 170 180 190 200•������•,

,,

,,

,•�

��

��

�•�

��

��

�•((((((•

47 / 94

Page 48: Engineering 1

The Ogive

The 60-percentile of height in this example is approximately178cm.

The value of the cumulative relative frequency at the upperendpoint of the final interval is by definition 1.

48 / 94

Page 49: Engineering 1

1.3.3 Symmetry and Skewness of Distributions

From a histogram we may infer whether the distribution of arandom variable is symmetric or not.

The histogram of height shows that the distribution is reasonablysymmetric (even if the distribution of height in the population weresymmetric, we would normally observe some small deviation fromsymmetry in the histogram as we observe only a sample).

49 / 94

Page 50: Engineering 1

Right-Skewed distributions

A distribution is said to be right-skewed if there are observations along way to the right of the ”centre” of the distribution, but not along way to the left.

For example, the distribution of the age of students isright-skewed. Must students will be around 20 years of age. Nonewill be much younger, but there are some mature students.

The distribution of wages will also be right-skewed.

50 / 94

Page 51: Engineering 1

A right-skewed distribution

51 / 94

Page 52: Engineering 1

Left-skewed distributions

A distribution is said to be left-skewed if there are observations along way to the left of the ”centre” of the distribution, but not along way to the right.

For example, the distribution of weight of participants in theOxford-Cambridge boat race will have a left-skewed distribution.

This is due to the fact that the majority of participants will beheavy rowers, while a minority will be very light coxes.

52 / 94

Page 53: Engineering 1

A Leftskewed Distribution

53 / 94

Page 54: Engineering 1

1.4 Numerical Methods of Describing Quantitative Data

We consider two types of measure:

1. Measures of centrality - give information regardingthe location of the centre of the distribution (themean, median).

2. Measures of variability (dispersion) - give informationregarding the level of variation (the range, variance,standard deviation, interquartile range).

54 / 94

Page 55: Engineering 1

1.4.1 Measures of centrality

1. The Sample Mean, x .Suppose we have a sample of n observations, the mean is given bythe sum of the observations divided by the number of observations.

x =1

n

n∑i=1

xi ,

where xi is the value of the i-th observation.

55 / 94

Page 56: Engineering 1

The Population Mean

µ denotes the population mean. If there are N units in thepopulation, then

µ =

∑Ni=1 xi

N,

where xi is the value of the trait for individual i in the population.

µ is normally unknown. The sample mean x (a statistic) is anestimator of the population mean µ (a parameter).

56 / 94

Page 57: Engineering 1

2. The sample median Q2

In order to calculate the sample median, we first order theobservations from the smallest to the largest. The order statisticx(i) is the i-th smallest observation in a sample (i.e. x(1) is thesmallest observation and x(n) is the largest observation).

The notation for the median comes from the fact that the medianis the second quartile (see quartiles in the section on measures ofdispersion).

57 / 94

Page 58: Engineering 1

The sample median Q2

If n is odd, then the median is the observation which appears inthe centre of the ordered list of observations. Hence,

Q2 = x(0.5[n+1]).

If n is even, then the median is the average of the two observationswhich appear in the centre of the ordered list of observations.Hence,

Q2 = 0.5[x(0.5n) + x(0.5n+1)]

One half of the observations are smaller than the median and onehalf are greater.

58 / 94

Page 59: Engineering 1

The sample median

One advantage of the median as a measure of centrality is that itis less sensitive to extreme observations (which may be errors) thanthe mean. When the distribution is skewed, it is preferable to usethe median as a measure of centrality.

e.g. the median wage rather than the average wage should be usedas a measure of what the ”average man on the street earns”.

The distribution of wages is right-skewed and the small proportionof people who earn very high wages will have a significant effect onthe mean. The mean is greater than the median.

For left-skewed distributions the mean is less than the median.

59 / 94

Page 60: Engineering 1

The sample median

When we have a large number of observations the 50% -percentilecan be used to approximate the median. This can be read from theOGIVE.

Using the OGIVE of height considered earlier, the median height isapproximately 175cm.

A more accurate method of approximating the median in suchcases is considered in the section on grouped data.

60 / 94

Page 61: Engineering 1

1.4.2 Measures of Dispersion - 1. The Range

The range is defined to be the largest observation minus thesmallest observation.

Since the range is only based on 2 observations it conveys littleinformation and is sensitive to extreme values (errors).

61 / 94

Page 62: Engineering 1

2. The sample variance s2

The sample variance is a measure of the average square distancefrom the mean.

The formula for the sample variance is given by

s2 =1

n − 1

n∑i=1

(xi − x)2.

s2 ≥ 0 and s2 = 0 if and only if all the observations are equal toeach other.

62 / 94

Page 63: Engineering 1

3. The sample standard deviation s

The sample standard deviation is given by the square root of thevariance.

It (and hence the sample variance) can be calculated on a scientificcalculator by using the σn−1 or sn−1 function as appropriate.

In simple terms, the standard deviation is a measure of the averagedistance of an observation from the mean. It cannot be greaterthan the maximum deviation from the mean.

63 / 94

Page 64: Engineering 1

4. The interquartile range

The i-th quartile, Qi , is taken to be the value such that i quartersof the observations are less than Qi . Thus, Q2 is the samplemedian.

If n+14 is an integer, then the lower quartile Q1 is given by

Q1 = x( n+14 )

Otherwise, if a is the integer part of n+14 [this is obtained by simply

removing everything after the decimal point], then

Q1 = 0.5[x(a) + x(a+1)]

64 / 94

Page 65: Engineering 1

The interquartile range

If 3n+34 is an integer, then the upper quartile Q3 is given by

Q3 = x( 3n+34 )

Otherwise, if b is the integer part of 3n+34 , then

Q3 = 0.5[x(b) + x(b+1)]

65 / 94

Page 66: Engineering 1

The interquartile range

When there is a large amount of data, the quartiles Q1,Q2,Q3 canbe calculated from the ogive as the 25-th, 50-th and 75-thpercentiles, respectively.

From the OGIVE, the lower and upper quartiles for the height datagiven previously are approximately 166cm and 182cm, respectively.

The interquartile range (IQR) is the difference between the upperand lower quartiles

IQR = Q3 − Q1

i.e. for the height data approximately 16cm.

66 / 94

Page 67: Engineering 1

The interquartile range

The quartiles can be used to display the distribution of a trait inthe form of a box plot.

The central line represents the median, the ends of the boxrepresent the lower and upper quartiles.

Points outside the whiskers represent outliers.Box plots can be used to investigate whether a distribution isskewed (see following diagrams).

67 / 94

Page 68: Engineering 1

A symmetric distribution

The ends of the boxes and the whiskers are symmetricallydistributed about the median.

68 / 94

Page 69: Engineering 1

A right skewed distribution

There are several outliers much greater than the median, but nonemuch smaller than the median. The upper endpoint of the whiskeris much further from the median than the lower endpoint.

69 / 94

Page 70: Engineering 1

A left skewed distribution

There are several outliers much smaller than the median, but nonemuch greater than the median. The upper endpoint of the whiskeris much closer to the median than the lower endpoint.

70 / 94

Page 71: Engineering 1

Choice of the measure of dispersion

The units of all the measures used so far (except for the variance)are the same units as those used for the measurement ofobservations. The units of variance are the square of the units ofmeasurement.

For example, if we observe velocity in metres per second, thevariance is measured in metres squared per second squared. Forthis reason the standard deviation is generally preferred to thevariance as a measure of dispersion.

If a distribution is skewed then the interquartile range is a morereliable measure of the dispersion of a random variable than thestandard deviation.

71 / 94

Page 72: Engineering 1

Comparison of the dispersion of two variables

Sometimes we wish to compare the dispersion of two variables.

In cases where different units are used to measure the two variablesor the means of two variables are very different, it may be useful touse a measure of dispersion which does not depend on the units inwhich it is measured.

The coefficient of variation C .V . does not depend on the units ofmeasurement. It is the standard deviation divided by the samplemean

C .V . =s

x.

72 / 94

Page 73: Engineering 1

Example 1.3 - The sample mean

Calculate the measures of centrality and dispersion defined abovefor the following data.

6, 9, 12, 9, 8, 10

There are 6 items of data hence,

x =

∑6i=1 xi

6=

6 + 9 + 12 + 9 + 8 + 10

6= 9

73 / 94

Page 74: Engineering 1

Example 1.3 - The sample median

In order to calculate the median, we first order the data. If anobservation occurs k times, then it must appear k times in the listof ordered data.

The ordered list of data is 6, 8, 9, 9, 10, 12.

Since there is an even number of data (n = 6), the median is theaverage of the two observations in the middle of this ordered list.Hence,

Q2 = 0.5[x(n/2) + x(1+ n2)] = 0.5[x(3) + x(4)] =

9 + 9

2

74 / 94

Page 75: Engineering 1

Example 1.3 - The range

The range is the difference between the largest and the smallestobservations

Range = 12− 6 = 6.

75 / 94

Page 76: Engineering 1

Example 1.3 - The variance and standard deviation

The variance is given by

s2=1

n − 1

n∑i=1

(xi − x)2

=(6− 9)2 + (9− 9)2 + (12− 9)2 + (9− 9)2 + (8− 9)2 + (10− 9)2

5=4

The standard deviation is given by s =√

s2 = 2.

76 / 94

Page 77: Engineering 1

Example 1.3 - The interquartile range

In order to calculate the interquartile range, we first calculate thelower and upper quartiles. n = 6, hence n+1

4 = 1.75. The integerpart of this number is 1. Hence, the lower quartile is

Q1 = 0.5[x(1) + x(2)] = 0.5(6 + 8) = 7

Similarly, 3n+34 = 5.25. The integer part of this number is 5.

Hence, the upper quartile is

Q3 = 0.5[x(5) + x(6)] = 0.5(10 + 12) = 11.

Hence,IQR = 11− 7 = 4.

77 / 94

Page 78: Engineering 1

Example 1.3 - The coefficient of variation

C .V . =s

x=

2

9.

Suppose a variable is by definition positive, e.g. height, weight.

A coefficient of variation above 1 is accepted to be very large (suchvariation may occur in the case of wages when wage inequality ishigh).

With regard to the physical traits of people, values for thecoefficient of variation of around 0.1 to 0.3 are common (inhumans the coefficient of variation of height is around 0.1, thecoefficient of variation for weight is somewhat bigger).

78 / 94

Page 79: Engineering 1

1.5 Measures of Location and Dispersion for Grouped Data- a) Discrete Random Variables

A die was rolled 100 times and the following data were obtained

Result No. of observations

1 15

2 18

3 20

4 14

5 15

6 18

79 / 94

Page 80: Engineering 1

Grouped discrete data

Suppose the possible results are {x1, x2, . . . , xk} and the result xi

occurs fi times.

The total number of observations is

n =k∑

i=1

fi .

The sum of the observations is given by

k∑i=1

xi fi .

80 / 94

Page 81: Engineering 1

Grouped discrete data

It follows that the sample mean is given by

x =

∑ki=1 fixi

n

The variance of the observations is given by

s2 =1

n − 1

k∑i=1

fi (xi − x)2

81 / 94

Page 82: Engineering 1

Grouped discrete data

The following table is useful in calculating the sample mean

xi fi fixi

1 15 15

2 18 36

3 20 60

4 14 56

5 15 75

6 18 108∑100 350

Hence, the sample mean is x = 350100 = 3.5.

82 / 94

Page 83: Engineering 1

Grouped discrete data

Once the mean has been calculated, we can add two columns for(xi − x)2 and fi (xi − x)2:

xi fi fixi (xi − x)2 fi (xi − x)2

1 15 15 2.52 15× 2.52 = 93.75

2 18 36 1.52 18× 1.52 = 40.5

3 20 60 0.52 20× 0.52 = 5

4 14 56 0.52 14× 0.52 = 3.5

5 15 75 1.52 15× 1.52 = 33.75

6 18 108 2.52 18× 2.52 = 112.5∑100 350 289

83 / 94

Page 84: Engineering 1

Grouped discrete data

The sample variance is given by

1

n − 1

k∑i=1

fi (xi − x)2 =289

99= 2.92.

84 / 94

Page 85: Engineering 1

Calculation of the sample median for grouped discrete data

In this case we know the exact values of the observations and hencewe can order the data. In this way we can calculate the median.

Since there are 100 observations, the median is

Q2 = 0.5[x(50) + x(51)]

85 / 94

Page 86: Engineering 1

Calculation of the sample median for grouped discrete data

The 15 smallest observations are equal to 1 i.e.

x(1) = x(2) = . . . = x(15) = 1.

The next 18 smallest observations are equal to 2 i.e.

x(16) = x(17) = . . . = x(33) = 2.

The next 20 smallest observations are all equal to 3 i.e.

x(34) = x(35) = . . . = x(53) = 3.

86 / 94

Page 87: Engineering 1

Calculation of the sample median for grouped discrete data

It follows thatx(50) = x(51) = 3.

Hence,Q2 = 0.5[x(50) + x(51)] = 3.

87 / 94

Page 88: Engineering 1

1.5 Measures of Location and Dispersion for Grouped Data- a) Continuous Random Variables

In such cases we have data grouped into intervals. Let xi be thecentre of the i-th interval and fi the number of observations in thei-th interval.

The approach to calculating the sample mean and variance is thesame as in the case of discrete data. In order to carry out thecalculations, we assume that each observation is in the middle ofthe appropriate interval.

88 / 94

Page 89: Engineering 1

Example 1.4

Consider the grouped data from Example 1.2

Height (x) xi fi fixi

150 ≤ x ≤ 160 155 2 310

160 < x ≤ 170 165 5 825

170 < x ≤ 180 175 7 1225

180 < x ≤ 190 185 5 925

190 < x ≤ 200 195 1 195∑- 20 3480

Thus, the sample mean is x = 348020 = 174.

89 / 94

Page 90: Engineering 1

Example 1.4

Now we can add the remaining 2 columns of the table.

xi fi fixi (xi − x)2 fi (xi − x)2

155 2 310 192 2×192 = 722

165 5 825 92 5×92 = 405

175 7 1225 1 7

185 5 925 112 5× 112 = 605

195 1 195 212 212 = 441∑20 3480 - 2180

The variance is 218019 = 114.74.

The standard deviation is√

114.74 = 10.71.

90 / 94

Page 91: Engineering 1

Estimating the median for grouped continuous data

Now we consider a more accurate method of estimating the medianthan the graphical method presented earlier using the OGIVE.

The first step is to find the interval in which the median lies.

This is the interval in which the cumulative relative frequency (crf)equals 0.5.

Since crf(170)=0.35<0.5 and crf(180)=0.7>0.5, the median lies inthe interval (170,180).

91 / 94

Page 92: Engineering 1

Estimating the median for grouped continuous data

The median can now be estimated using geometry. Consider thegraph of the OGIVE on the interval (170, 180). We know theheight of the OGIVE at 170 and 180 and know that at the medianthe height must be 0.5. We can construct the following pair ofsimilar triangles.

��

��

��

��

��

��

��

��

��

170 180Q2 = 170 + y0.35

0.7

h1 = 0.15

0.5

c.r.f.

Height

h2 = 0.35

92 / 94

Page 93: Engineering 1

Estimating the median for grouped continuous data

The height of the large triangle is h2 = 0.35 (the proportion ofobservations in that interval).

The length of the large triangle is the length of the interval d = 10

The cumulative relative frequency (c.r.f.) at the median is 0.5.The height of the small triangle is 0.5 minus the c.r.f. at the lowerendpoint. Thus, h1 = 0.15. The median is equal to y + 170, wherey is the length of the small triangle (170 is the lower endpoint ofthe interval).

93 / 94

Page 94: Engineering 1

Estimating the median for grouped continuous data

Since the triangles are similar the ratio of the length to height isconstant, i.e.

h2

d=

h1

y.

Hence,

y =h1d

h2=

0.15× 10

0.35≈ 4.3.

It follows that the median is approximately

Q2 = 170 + 4.3 = 174.3.

94 / 94