statistics 1 - notes

7/27/2019 Statistics 1 - Notes

1/22

Edexcel Notes S1

1

Liverpool F.C.

Statistics 1

Mathematical Model

A mathematical model is a simplification of a real world problem.

1. A real world problem is observed.2. A mathematical model is thought up.3. The model is used to make predictions, "What happens if...?"4. Real world data is collected.5. Predicted results are obtained.6. These are compared with statistical tests.7. Models are refined as required and then it's back to stage 3...

Advantages of using mathematical models are:

They simplify a real world problem. They improve our understanding of a real world problem. They are quicker and cheaper. They can be used to predict future outcomes.

Disadvantages of using mathematical models are:

Only give a partial description of the real problem. Only work for a restricted range of values.

Stem and Leaf

One of the simplest ways of ordering data and presenting data is to place it in a stem and leaf diagram.

For example, which the following data:

Person 1 2 3 4 5 6 7 8 9 10 11

Weight (lb) 166 164 143 189 191 178 165 159 189 191 176

Height (cm) 161 160 160 199 167 178 169 174 172 178 167

Unordered Stem and Leaf Ordered Stem and Leaf

Height in cm Height in cm3 | 4 represents 34 cm. 3 | 4 represents 34 cm.

15 15

16 10079 16 00179

17 84286 16 00179

18 18

19 9 19 9
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086


2/22

Edexcel Notes S1

2

Liverpool F.C.

As you can see from the key, the | divides tens from units. Stem and leafs can also be back to back, if

you have two sets of data to display.

Using the data above:

Weight in pounds4 | 3 represents 34 lb.

Height in cm3 | 4 represents 34 cm.

3

9

7654

8

99

11

14

15

16 00179

17 24688

18

19 9

Stem and leafs can give us an indication of distribution. There is a much wider distribution for weight, in

this example, than height. If it were comparing something like scores on two exams, we could compare

the median.

Frequency Tables

Amount (x) Frequency, (f)

0 x < 20 5

20 x < 40 9

40 x < 60 20

60 x < 80 25

80 x < 100 9

Cumulative Frequency

One way we can interpret the data is by working out the cumulative frequency. This simply means add

the frequency as you go along. Cumulative frequency is plotted against the upper class boundary. From

the above example, we get:

Amount (x) Frequency (f) Upper class boundary Cumulative frequency

0 x < 20 5 20 5

20 x < 40 9 40 14 (5+9)

40 x < 60 20 60 34 (5+9+20)

60 x < 80 25 80 59 (5+9+20+25)

80 x < 100 9 100 68 (5+9+20+25+9)

Total 68

To check you're right for the cumulative frequency, you can add the frequency column. Or the question

will probably say something like, "a survey of 68 people..." and that's an even easier check.

When we have our cumulative frequency column, we can draw a cumulative frequency curve.


3/22

Edexcel Notes S1

3

Liverpool F.C.

Using this, we can also create a box plot. This is deduced by looking at the quartiles up the y-axis and

finding the corresponding x-values:

Box plots are useful because they tell you lots of information, such as the Median, show you the spread

of the IQR, if there are any outliers and whether the data is normal, positively or negatively skewed.

The IQR is a measure of spread.

IQR = Q - Q

Outliers are extreme values. They are usually represented as a cross:


4/22

Edexcel Notes S1

4

Liverpool F.C.

They can be either too low or two high and are usually worked out by the equations:

Q1 - 1.5 x (IQR) (Anything less than this figure will be an outlier)

Q3

+ 1.5 x (IQR) (Anything greater than this figure will be an outlier).

The exam question will always state how to work out the outliers though, so this is one thing you don't

have to worry about remembering (just as long as you know how to use the formula).

When you've distinguished the outliers, where does the end of the box plot occur? You can either use

the next highest/lowest data value after the outlier, or use the value worked from the formula.

Linear Interpolation

To work out the median, find the value.

For Q1 work out the value and for Q3 find the value.

Percentiles (P12) mean a percentage of the CF. To work out P12 for example, work out the .

For a grouped frequency, it can be difficult to calculate the median and quartiles. There is a way of

estimating an answer, however, and this is called linear interpolation.

Time (sec) Frequency Cumulative Frequency Class width

0 x < 10 0 0 10

10 x < 15 8 8 5

15 x < 17.5 3 11 2.5

17.5 x < 20 7 18 2.5

20 x < 24 12 30 4

The first step is the find the value. In this example, it is15.5.

We take away 11and then divide it by 7 (the frequency of the row the cumulative 15.5 is found in).

Next we times by 2.5(the class width of the row 15.5 is found in).

Finally add on 17.5 (the lower class boundary of the row 15.5 is found in) and the answer appears, 19.1.

The only difference for the percentiles and other quartiles is replacing by whatever you want to find.

Mean from frequency table

It's easy enough to work out the mean from normal data, just the simple formula:


5/22

Edexcel Notes S1

5

Liverpool F.C.

(In other words, add them all up and divide by the number that there is.)

Time (sec) Frequency (f)

0 - 9 0

10 - 14 8

15 - 17 318 - 20 7

21 - 24 12

For a grouped frequency table, you'll need to work out the mid-point of the x variable.

Midpoint =

The formula is:

Therefore, once you have the midpoint, you need to multiply f and x:

Time (sec) Frequency (f) Midpoint (x) f(x)

0 - 9 0 4.75 0

10 - 14 8 12 96

15 - 17 3 16 48

18 - 20 7 19 133

21 - 24 12 23 276

Add the f(x) column and then divide by the total of the Frequency column to find the mean:

Standard Deviation

For an ordinary set of data, the standard deviation is found by the following:

(Variance is the same formula, but withoutthe square root).

For a frequency table, or grouped frequency table, though, again we have a slightly different formula:

Taking the above as an example, we need to add an f(x)2column. Be careful with this. Notice only the x

is squared, not (fx)2.


6/22

Edexcel Notes S1

6

Liverpool F.C.

Time (sec) Frequency (f) Midpoint (x) f(x) f(x)2

0 - 9 0 4.75 0 0

10 - 14 8 12 96 1152

15 - 17 3 16 48 768

18 - 20 7 19 133 2527

21 - 24 12 23 276 6348

Now add up the fx2 and fcolumns, and write in the mean squared:

Stick all that in your calculator and you'll get the answer: 4.48 (3 sf)

Coding

When the numbers are too large to be reasonably worked with, there is an option for finding the mean.We can use coding. This replaces x (the midpoint) with y (connected by a formula, which makes it a

smaller number).

Use the code to calculate the mean and standard deviation of the following frequency table:

x Frequencyf

15.5 8

25.5 12

35.5 15

45.5 16

55.5 1165.5 6

75.5 2

We need to add the code column, and work out y and then add a column forf(y)andf(y)2 rather than

f(x) andf(x)2:

x Frequencyf f(y) f(y)2

15.5 8 -3 -24 72

25.5 12 -2 -24 48

35.5 15 -1 -15 15

45.5 16 0 0 0

55.5 11 1 11 11

65.5 6 2 12 24

75.5 2 3 6 18

Next, work out the mean ofy, using the formula:


7/22

Edexcel Notes S1

7

Liverpool F.C.

= -0.49 (3 s.f)

We think back to the original code:

If we replace y with here, we can replace x with :

Add the numbers, and rearrange to make the subject of the formula.

= 40.6 (3 s.f.) and that's your answer!

For standard deviation its exactly the same. Now, if we think of the dispersion, adding and subtracting

won't affect the Standard deviation. Dividing and multiplying will, however.

Histograms

Histograms are used for representing data that is continuous and are summarized in a grouped

frequency distribution.

There are no gaps between the bars. The area of the bar is proportional to the frequency.

Example:

The height of twenty children (to the nearest cm) was recorded in the following frequency table. Draw a

histogram to represent the data.

Height Frequencyf

120-124 1

124-129 5130-134 7

135-139 4

140-149 3

There are two columns that we need to add: the class width and the frequency density.


8/22

Edexcel Notes S1

8

Liverpool F.C.

Class width is the width of each group. Be careful when calculating to work out from the lower class

boundary and the upper class boundary. For example, 120-125 is actually: 124.5-119.5 and so the class

width is 5.

Height Frequencyf Class Width Frequency Density

120-124 1 5 0.2

125-129 5 5 1

130-134 7 5 1.4

135-139 4 5 0.8

140-149 3 10 0.3

When we have these values, we plot the lower class and upper class boundaries on the x axis and the

frequency density on the y axis.

Skewness

From the histogram above, we see a slight positive skew: there are more values towards the negative

than there are towards the positive. There are three types of skew, positive, negative and normal, and

there are three tests to differentiate between them:


9/22

Edexcel Notes S1

9

Liverpool F.C.

Positive Skew Symmetrical Negative Skew

Mean > Median > Mode Mean = Median = Mode Mean < Median < Mode

Q2 - Q1< Q3 - Q2 Q2 - Q1 = Q3 - Q2 Q2 - Q1 > Q3 - Q2

Correlation

Correlation is a measure of relationship between two or more variable. When we have two sets of data,

we can draw a scatter diagram to see if there is any correlation between them

Data: The marks of 10 candidates in Maths and Physics is shown below:

Candidate 1 2 3 4 5 6 7 8 9 10

Physics (x) 18 20 30 40 46 54 60 80 88 92

Maths (y) 42 54 60 54 62 68 80 66 80 100

From the data, we can plot the x values corresponding to the y values. The only difference is that we

don't join the crosses with a line:

We can already see that it's positively correlated. A way to test this is to divide the graph into four

quadrants, and then look at where the majority of the points lie:


10/22

Edexcel Notes S1

10

Liverpool F.C.

If most points lie in the 1st and 3rd

quadrants, we have a positive

correlation.

If most points lie in the 2nd and 4th

quadrants, we have a negative

correlation.

If points lie in all four quadrants

randomly, we have no correlation.

However, just looking at the scatter diagrams, is a bit inaccurate. It's much better to calculate the

strength of the correlation. There's a formula for this called PMCC (Product Moment Correlation Co-

efficient).

How to calculate Sxy, Sxx and Syy:


11/22

Edexcel Notes S1

11

Liverpool F.C.

From the above information, we complete the following table:

x y x2 y

2 xy

18 42 324 1764 756

20 54 400 2916 1080

30 60 900 3600 1800

40 54 1600 2916 2160

46 62 2116 3844 2852

54 68 2916 4624 3672

60 80 3600 6400 4800

80 66 6400 4356 5280

88 80 7744 6400 7040

92 100 8464 10000 9200

x = 528 y = 666 x2

= 34464 y2

= 46820 xy = 38640

If you're lucky the question will already give you these figures, and all you'll be asked to do is use them.

Now using the PMCC formula:

PMCC works so that1 r 1, with -1 being perfect negative correlation, 0 being no correlation and +1

being perfect positive correlation. 0.863 is strong positive.

Even if we code the data, the PMCC remains the same.

Least squares regression line

We can work out b easily enough from the data above:


12/22

Edexcel Notes S1

12

Liverpool F.C.

= 66.6

= 52.8

If the question asked you to draw on the regression line, an easy way is to plot the and point on the

scatter diagram, and then draw the line from the y-axis point, crossing this point. The mean point always

lies on the line.

If the data is coded, we need to uncode when finding the mean.

An independent (explanatory) variable is one that is set independently of the other variable. (Plotted

on the axis).

A dependent (response) variable is one whose values are determined by the values of the independent

variable. (Plotted on the axis).

Interpolation is when you estimate the value of a dependent variable within the range of the data.

Extrapolation is when you estimate a value outside the range of the data. Values estimated by

extrapolation can be unreliable.

Probability

IfA is an event, the probability of it occurring is the number of ways A can occur, divide by the samplespace (total number of outcomes, S).

=

Probability is always 0 p 1.

If you have a probability, p(A), the probability of not getting A is written as: p(A'). We can say that to find

p(A'), we merely take p(A) away from 1.

A B - this means A "intersection" B - all elements that are in A and in B. We can see this on a Venn

diagram:


13/22

Edexcel Notes S1

13

Liverpool F.C.

A B means A "union" B -- all elements that are in A or in B. On a Venn diagram this is:

Addition Rule

This addition rule for finding P(AB) :

We can rearrange this to get:

Example:

There are 15 books on a bookshelf. 10 of these are fiction, 4 of which are hard-back. 6, in total, are hard-

back and the remaining 9 are paper back.

Find the probability that a hard-back fiction book is chosen at random.

First stage is to draw a Venn diagram and write in all the numbers:


14/22

Edexcel Notes S1

14

Liverpool F.C.

We're looking for p(H F) so where is it both H and F? Where the two circles overlap, so 4/15.

Find the probability that a hardback is chosen but is not fiction.

We're wanting p(H F'). Which is 2/15.

Conditional Probability

This occurs when the probability of A is conditional upon B having already occurred. Given B, find the

probability of A. It's written out as p(A|B).

We use tree diagrams to solve conditional probability.

Example:

A bag contains 6 red and 4 blue balls. 2 balls are picked at random and retained.

Find the probability that both balls are red.

First, draw out a tree diagram.

We want p(R R), so we just follow the tree diagram along:


15/22

Edexcel Notes S1

15

Liverpool F.C.

6/10 x 5/9 = 30/90 = 1/3.

Find the probability that the balls are different colours.

We want p(R B) and p(B R). Multiply across both branches and then add these together:

p(R B) = 6/10 x 4/9 = 24/90

p(B R) = 4/10 x 6/9 = 24/90

= 48/90 = 8/15.

Find the probability that the second ball is red, given the first is blue.

We want p(R|B), so we use the formula:

= 24/90 4/10

= 2/3.

Independent Events

Independent events are the opposite of conditional, where one factor doesn't affect the next. Example,

if balls are taken from a bag and replaced. The probability of a red ball is the same no matter how many

times you pick from the bag.

This means:

If they are mutually exclusive, they cannot occur at the same time and the p(A B) is 0.

This means that:

Sample Space Diagram

Example : A dice is thrown twice and the scores obtained are added together. Find the probability that

the total score is 6.


16/22

Edexcel Notes S1

16

Liverpool F.C.

There are 36 equally likely outcomes.

5 of the outcomes result in a total of 6.

First Throw

Discrete Random Variables

Discrete Random Variables are probabilities such as the "number on a fair die".

The probability for discrete random variables is written as P( ).

Example:A tetrahedral die has the numbers 1, 2, 3, 4 on its faces. The die is biased in such way that:

P( ) = = 1,2,3

P( ) = 3 = 4

If we draw out this in a probability distribution table we get:

P( )

1

2

34 3

All the probabilities added together = 1.

(1 + 1 + 1 + 3) = 1

6 = 1

=

Therefore, we can write out the probability distribution:

P( )1

2

3

4

We can also find the cumulative distribution, the F(x):

6 7 8 9 10 11 12

5 6 7 8 9 10 11

4 5 6 7 8 9 10

3 4 5 6 7 8 9

2 3 4 5 6 7 8

1 2 3 4 5 6 7

1 2 3 4 5 6

Second Throw


17/22

Edexcel Notes S1

17

Liverpool F.C.

P( ) F(x)

1

2

3

4 1

The cumulative probability always adds up to 1.

P( ) means the probability of getting an X value less than or equal to 2. We add up the probabilities

we have, and so, in the above example, P( ) =

F(x) means so F(2) =

If a question asks you something like F(3.5), in our example 3.5 doesn't exist. Therefore, we do F(3)

instead, which would be .

Mean and Variance

Finding the mean and variance is almost identical to finding the mean of a frequency table.

The formula for mean:

For Variance, we have the formula:

To find

Example:

If X is a discrete random variable.

0 0.4 0 0

1 0.5 0.5 0.5

2 0.1 0.2 0.4

0.7 0.9

Therefore,

Suppose is the random variable given by by coding for the above table. The table would now

look like this:


18/22

Edexcel Notes S1

18

Liverpool F.C.

-2 0.4 -0.8 1.6

1 0.5 0.5 0.5

4 0.1 0.4 1.6

Total 0.1 3.7

Remember the code:

To decode back:

In general:

Discrete Uniform distribution is where each random variable has the same probability. For example,

when is the probability of a fair 6-sided die. Each probability would be .

A Discrete Uniform distribution over the values 1,2,3,, n.

Example: A tetrahedral dice has its faces numbered 1, 2, 3 and 4.Xis the score obtained when the dice

is rolled.


19/22

Edexcel Notes S1

19

Liverpool F.C.

Xtherefore has a uniform distribution, .

= 2.5

The Normal Distribution

- Symmetrical about the mean.- Total area under the curve = 1- Probabilities correspond to the area.- A continuous distribution (therefore there is no difference between and

.

- 68% of the distribution lies within 1 standard deviation of the mean.- 95% of the distribution lies within 2 standard deviations of the mean.- 99.7% of the distribution lies within 3 standard deviations of the mean.

Examples:

- The masses of new born babies.- IQ of school students.- Hand span of adult females.- Height of plants growing in a field.


20/22

Edexcel Notes S1

20

Liverpool F.C.

Working out Probabilities using tables.

Examples:

1.2.

3.

4.

5.

6.

If P(Z < a) is greater than 0.5 than a will be >0. If P(Z < a) is less than 0.5, than a is less than 0. If P (Z > a) is less than 0.5 than a will be > 0. If P (Z > a) is more than 0.5 than a will be


21/22

Edexcel Notes S1

21

Liverpool F.C.

Standardizing

If and then:

Example: If find

The first step is to standardize:

Working Backwards

Example: If ,find the value of if .

To findx, we start by finding the standardised value such that .

From tables we see that .

We therefore need to find the value that standardises to make by rearranging the formula.

Examination style question: A machine is designed to fill jars of coffee so that the contents, , follow a

normal distribution with mean grams and standard deviation grams.

If and , find and correct to 3 significant figures.


22/22

Edexcel Notes S1

22

Firstly : + 1.96

Secondly, we are told that :

- 1.75

The two equations are:

+ 1.96

- 1.75

Subtract to eliminate :

This gives

So the solutions to 3sf are and g.

statistics 1 - notes

Documents

probability and statistics chapter 1 notes. probability and...

statistics 2[1]- notes

lecture notes in statistics - home -...

statistics notes 1 data_plots and summaries

statistics and probability guided notes chapter 11...

unit support notes — statistics (scqf level 6) - … ·...

photon statistics notes

basic statistics notes

course notes statistics

statistics – 10.1 notes name: comparing two proportions...

statistics notes 2005

statistics 1 revision notes - wordpress.com · statistics 1...

lecture notes 1 - statistics...

556 notes statistics 2

gcse statistics revision notes

statistics 345 lecture notes 2017 lecture notes on applied...

basic statistics - university of ghanabasic statistics march...

statistics 2 revision notes

notes on statistics

statistics study notes