statistics 1 - notes
Post on 14-Apr-2018
220 Views
Preview:
TRANSCRIPT
-
7/27/2019 Statistics 1 - Notes
1/22
Edexcel Notes S1
1
Liverpool F.C.
Statistics 1
Mathematical Model
A mathematical model is a simplification of a real world problem.
1. A real world problem is observed.2. A mathematical model is thought up.3. The model is used to make predictions, "What happens if...?"4. Real world data is collected.5. Predicted results are obtained.6. These are compared with statistical tests.7. Models are refined as required and then it's back to stage 3...
Advantages of using mathematical models are:
They simplify a real world problem. They improve our understanding of a real world problem. They are quicker and cheaper. They can be used to predict future outcomes.
Disadvantages of using mathematical models are:
Only give a partial description of the real problem. Only work for a restricted range of values.
Stem and Leaf
One of the simplest ways of ordering data and presenting data is to place it in a stem and leaf diagram.
For example, which the following data:
Person 1 2 3 4 5 6 7 8 9 10 11
Weight (lb) 166 164 143 189 191 178 165 159 189 191 176
Height (cm) 161 160 160 199 167 178 169 174 172 178 167
Unordered Stem and Leaf Ordered Stem and Leaf
Height in cm Height in cm3 | 4 represents 34 cm. 3 | 4 represents 34 cm.
15 15
16 10079 16 00179
17 84286 16 00179
18 18
19 9 19 9
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
2/22
Edexcel Notes S1
2
Liverpool F.C.
As you can see from the key, the | divides tens from units. Stem and leafs can also be back to back, if
you have two sets of data to display.
Using the data above:
Weight in pounds4 | 3 represents 34 lb.
Height in cm3 | 4 represents 34 cm.
3
9
7654
8
99
11
14
15
16 00179
17 24688
18
19 9
Stem and leafs can give us an indication of distribution. There is a much wider distribution for weight, in
this example, than height. If it were comparing something like scores on two exams, we could compare
the median.
Frequency Tables
Amount (x) Frequency, (f)
0 x < 20 5
20 x < 40 9
40 x < 60 20
60 x < 80 25
80 x < 100 9
Cumulative Frequency
One way we can interpret the data is by working out the cumulative frequency. This simply means add
the frequency as you go along. Cumulative frequency is plotted against the upper class boundary. From
the above example, we get:
Amount (x) Frequency (f) Upper class boundary Cumulative frequency
0 x < 20 5 20 5
20 x < 40 9 40 14 (5+9)
40 x < 60 20 60 34 (5+9+20)
60 x < 80 25 80 59 (5+9+20+25)
80 x < 100 9 100 68 (5+9+20+25+9)
Total 68
To check you're right for the cumulative frequency, you can add the frequency column. Or the question
will probably say something like, "a survey of 68 people..." and that's an even easier check.
When we have our cumulative frequency column, we can draw a cumulative frequency curve.
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
3/22
Edexcel Notes S1
3
Liverpool F.C.
Using this, we can also create a box plot. This is deduced by looking at the quartiles up the y-axis and
finding the corresponding x-values:
Box plots are useful because they tell you lots of information, such as the Median, show you the spread
of the IQR, if there are any outliers and whether the data is normal, positively or negatively skewed.
The IQR is a measure of spread.
IQR = Q - Q
Outliers are extreme values. They are usually represented as a cross:
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
4/22
Edexcel Notes S1
4
Liverpool F.C.
They can be either too low or two high and are usually worked out by the equations:
Q1 - 1.5 x (IQR) (Anything less than this figure will be an outlier)
Q3
+ 1.5 x (IQR) (Anything greater than this figure will be an outlier).
The exam question will always state how to work out the outliers though, so this is one thing you don't
have to worry about remembering (just as long as you know how to use the formula).
When you've distinguished the outliers, where does the end of the box plot occur? You can either use
the next highest/lowest data value after the outlier, or use the value worked from the formula.
Linear Interpolation
To work out the median, find the value.
For Q1 work out the value and for Q3 find the value.
Percentiles (P12) mean a percentage of the CF. To work out P12 for example, work out the .
For a grouped frequency, it can be difficult to calculate the median and quartiles. There is a way of
estimating an answer, however, and this is called linear interpolation.
Time (sec) Frequency Cumulative Frequency Class width
0 x < 10 0 0 10
10 x < 15 8 8 5
15 x < 17.5 3 11 2.5
17.5 x < 20 7 18 2.5
20 x < 24 12 30 4
The first step is the find the value. In this example, it is15.5.
We take away 11and then divide it by 7 (the frequency of the row the cumulative 15.5 is found in).
Next we times by 2.5(the class width of the row 15.5 is found in).
Finally add on 17.5 (the lower class boundary of the row 15.5 is found in) and the answer appears, 19.1.
The only difference for the percentiles and other quartiles is replacing by whatever you want to find.
Mean from frequency table
It's easy enough to work out the mean from normal data, just the simple formula:
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
5/22
Edexcel Notes S1
5
Liverpool F.C.
(In other words, add them all up and divide by the number that there is.)
Time (sec) Frequency (f)
0 - 9 0
10 - 14 8
15 - 17 318 - 20 7
21 - 24 12
For a grouped frequency table, you'll need to work out the mid-point of the x variable.
Midpoint =
The formula is:
Therefore, once you have the midpoint, you need to multiply f and x:
Time (sec) Frequency (f) Midpoint (x) f(x)
0 - 9 0 4.75 0
10 - 14 8 12 96
15 - 17 3 16 48
18 - 20 7 19 133
21 - 24 12 23 276
Add the f(x) column and then divide by the total of the Frequency column to find the mean:
Standard Deviation
For an ordinary set of data, the standard deviation is found by the following:
(Variance is the same formula, but withoutthe square root).
For a frequency table, or grouped frequency table, though, again we have a slightly different formula:
Taking the above as an example, we need to add an f(x)2column. Be careful with this. Notice only the x
is squared, not (fx)2.
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
6/22
Edexcel Notes S1
6
Liverpool F.C.
Time (sec) Frequency (f) Midpoint (x) f(x) f(x)2
0 - 9 0 4.75 0 0
10 - 14 8 12 96 1152
15 - 17 3 16 48 768
18 - 20 7 19 133 2527
21 - 24 12 23 276 6348
Now add up the fx2 and fcolumns, and write in the mean squared:
Stick all that in your calculator and you'll get the answer: 4.48 (3 sf)
Coding
When the numbers are too large to be reasonably worked with, there is an option for finding the mean.We can use coding. This replaces x (the midpoint) with y (connected by a formula, which makes it a
smaller number).
Use the code to calculate the mean and standard deviation of the following frequency table:
x Frequencyf
15.5 8
25.5 12
35.5 15
45.5 16
55.5 1165.5 6
75.5 2
We need to add the code column, and work out y and then add a column forf(y)andf(y)2 rather than
f(x) andf(x)2:
x Frequencyf f(y) f(y)2
15.5 8 -3 -24 72
25.5 12 -2 -24 48
35.5 15 -1 -15 15
45.5 16 0 0 0
55.5 11 1 11 11
65.5 6 2 12 24
75.5 2 3 6 18
Next, work out the mean ofy, using the formula:
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
7/22
Edexcel Notes S1
7
Liverpool F.C.
= -0.49 (3 s.f)
We think back to the original code:
If we replace y with here, we can replace x with :
Add the numbers, and rearrange to make the subject of the formula.
= 40.6 (3 s.f.) and that's your answer!
For standard deviation its exactly the same. Now, if we think of the dispersion, adding and subtracting
won't affect the Standard deviation. Dividing and multiplying will, however.
Histograms
Histograms are used for representing data that is continuous and are summarized in a grouped
frequency distribution.
There are no gaps between the bars. The area of the bar is proportional to the frequency.
Example:
The height of twenty children (to the nearest cm) was recorded in the following frequency table. Draw a
histogram to represent the data.
Height Frequencyf
120-124 1
124-129 5130-134 7
135-139 4
140-149 3
There are two columns that we need to add: the class width and the frequency density.
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
8/22
Edexcel Notes S1
8
Liverpool F.C.
Class width is the width of each group. Be careful when calculating to work out from the lower class
boundary and the upper class boundary. For example, 120-125 is actually: 124.5-119.5 and so the class
width is 5.
Height Frequencyf Class Width Frequency Density
120-124 1 5 0.2
125-129 5 5 1
130-134 7 5 1.4
135-139 4 5 0.8
140-149 3 10 0.3
When we have these values, we plot the lower class and upper class boundaries on the x axis and the
frequency density on the y axis.
Skewness
From the histogram above, we see a slight positive skew: there are more values towards the negative
than there are towards the positive. There are three types of skew, positive, negative and normal, and
there are three tests to differentiate between them:
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
9/22
Edexcel Notes S1
9
Liverpool F.C.
Positive Skew Symmetrical Negative Skew
Mean > Median > Mode Mean = Median = Mode Mean < Median < Mode
Q2 - Q1< Q3 - Q2 Q2 - Q1 = Q3 - Q2 Q2 - Q1 > Q3 - Q2
Correlation
Correlation is a measure of relationship between two or more variable. When we have two sets of data,
we can draw a scatter diagram to see if there is any correlation between them
Data: The marks of 10 candidates in Maths and Physics is shown below:
Candidate 1 2 3 4 5 6 7 8 9 10
Physics (x) 18 20 30 40 46 54 60 80 88 92
Maths (y) 42 54 60 54 62 68 80 66 80 100
From the data, we can plot the x values corresponding to the y values. The only difference is that we
don't join the crosses with a line:
We can already see that it's positively correlated. A way to test this is to divide the graph into four
quadrants, and then look at where the majority of the points lie:
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
10/22
Edexcel Notes S1
10
Liverpool F.C.
If most points lie in the 1st and 3rd
quadrants, we have a positive
correlation.
If most points lie in the 2nd and 4th
quadrants, we have a negative
correlation.
If points lie in all four quadrants
randomly, we have no correlation.
However, just looking at the scatter diagrams, is a bit inaccurate. It's much better to calculate the
strength of the correlation. There's a formula for this called PMCC (Product Moment Correlation Co-
efficient).
How to calculate Sxy, Sxx and Syy:
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
11/22
Edexcel Notes S1
11
Liverpool F.C.
From the above information, we complete the following table:
x y x2 y
2 xy
18 42 324 1764 756
20 54 400 2916 1080
30 60 900 3600 1800
40 54 1600 2916 2160
46 62 2116 3844 2852
54 68 2916 4624 3672
60 80 3600 6400 4800
80 66 6400 4356 5280
88 80 7744 6400 7040
92 100 8464 10000 9200
x = 528 y = 666 x2
= 34464 y2
= 46820 xy = 38640
If you're lucky the question will already give you these figures, and all you'll be asked to do is use them.
Now using the PMCC formula:
PMCC works so that1 r 1, with -1 being perfect negative correlation, 0 being no correlation and +1
being perfect positive correlation. 0.863 is strong positive.
Even if we code the data, the PMCC remains the same.
Least squares regression line
We can work out b easily enough from the data above:
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
12/22
Edexcel Notes S1
12
Liverpool F.C.
= 66.6
= 52.8
If the question asked you to draw on the regression line, an easy way is to plot the and point on the
scatter diagram, and then draw the line from the y-axis point, crossing this point. The mean point always
lies on the line.
If the data is coded, we need to uncode when finding the mean.
An independent (explanatory) variable is one that is set independently of the other variable. (Plotted
on the axis).
A dependent (response) variable is one whose values are determined by the values of the independent
variable. (Plotted on the axis).
Interpolation is when you estimate the value of a dependent variable within the range of the data.
Extrapolation is when you estimate a value outside the range of the data. Values estimated by
extrapolation can be unreliable.
Probability
IfA is an event, the probability of it occurring is the number of ways A can occur, divide by the samplespace (total number of outcomes, S).
=
Probability is always 0 p 1.
If you have a probability, p(A), the probability of not getting A is written as: p(A'). We can say that to find
p(A'), we merely take p(A) away from 1.
A B - this means A "intersection" B - all elements that are in A and in B. We can see this on a Venn
diagram:
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
13/22
Edexcel Notes S1
13
Liverpool F.C.
A B means A "union" B -- all elements that are in A or in B. On a Venn diagram this is:
Addition Rule
This addition rule for finding P(AB) :
We can rearrange this to get:
Example:
There are 15 books on a bookshelf. 10 of these are fiction, 4 of which are hard-back. 6, in total, are hard-
back and the remaining 9 are paper back.
Find the probability that a hard-back fiction book is chosen at random.
First stage is to draw a Venn diagram and write in all the numbers:
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
14/22
Edexcel Notes S1
14
Liverpool F.C.
We're looking for p(H F) so where is it both H and F? Where the two circles overlap, so 4/15.
Find the probability that a hardback is chosen but is not fiction.
We're wanting p(H F'). Which is 2/15.
Conditional Probability
This occurs when the probability of A is conditional upon B having already occurred. Given B, find the
probability of A. It's written out as p(A|B).
We use tree diagrams to solve conditional probability.
Example:
A bag contains 6 red and 4 blue balls. 2 balls are picked at random and retained.
Find the probability that both balls are red.
First, draw out a tree diagram.
We want p(R R), so we just follow the tree diagram along:
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
15/22
Edexcel Notes S1
15
Liverpool F.C.
6/10 x 5/9 = 30/90 = 1/3.
Find the probability that the balls are different colours.
We want p(R B) and p(B R). Multiply across both branches and then add these together:
p(R B) = 6/10 x 4/9 = 24/90
p(B R) = 4/10 x 6/9 = 24/90
= 48/90 = 8/15.
Find the probability that the second ball is red, given the first is blue.
We want p(R|B), so we use the formula:
= 24/90 4/10
= 2/3.
Independent Events
Independent events are the opposite of conditional, where one factor doesn't affect the next. Example,
if balls are taken from a bag and replaced. The probability of a red ball is the same no matter how many
times you pick from the bag.
This means:
If they are mutually exclusive, they cannot occur at the same time and the p(A B) is 0.
This means that:
Sample Space Diagram
Example : A dice is thrown twice and the scores obtained are added together. Find the probability that
the total score is 6.
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
16/22
Edexcel Notes S1
16
Liverpool F.C.
There are 36 equally likely outcomes.
5 of the outcomes result in a total of 6.
First Throw
Discrete Random Variables
Discrete Random Variables are probabilities such as the "number on a fair die".
The probability for discrete random variables is written as P( ).
Example:A tetrahedral die has the numbers 1, 2, 3, 4 on its faces. The die is biased in such way that:
P( ) = = 1,2,3
P( ) = 3 = 4
If we draw out this in a probability distribution table we get:
P( )
1
2
34 3
All the probabilities added together = 1.
(1 + 1 + 1 + 3) = 1
6 = 1
=
Therefore, we can write out the probability distribution:
P( )1
2
3
4
We can also find the cumulative distribution, the F(x):
6 7 8 9 10 11 12
5 6 7 8 9 10 11
4 5 6 7 8 9 10
3 4 5 6 7 8 9
2 3 4 5 6 7 8
1 2 3 4 5 6 7
1 2 3 4 5 6
Second Throw
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
17/22
Edexcel Notes S1
17
Liverpool F.C.
P( ) F(x)
1
2
3
4 1
The cumulative probability always adds up to 1.
P( ) means the probability of getting an X value less than or equal to 2. We add up the probabilities
we have, and so, in the above example, P( ) =
F(x) means so F(2) =
If a question asks you something like F(3.5), in our example 3.5 doesn't exist. Therefore, we do F(3)
instead, which would be .
Mean and Variance
Finding the mean and variance is almost identical to finding the mean of a frequency table.
The formula for mean:
For Variance, we have the formula:
To find
Example:
If X is a discrete random variable.
0 0.4 0 0
1 0.5 0.5 0.5
2 0.1 0.2 0.4
0.7 0.9
Therefore,
Suppose is the random variable given by by coding for the above table. The table would now
look like this:
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
18/22
Edexcel Notes S1
18
Liverpool F.C.
-2 0.4 -0.8 1.6
1 0.5 0.5 0.5
4 0.1 0.4 1.6
Total 0.1 3.7
Remember the code:
To decode back:
In general:
Discrete Uniform distribution is where each random variable has the same probability. For example,
when is the probability of a fair 6-sided die. Each probability would be .
A Discrete Uniform distribution over the values 1,2,3,, n.
Example: A tetrahedral dice has its faces numbered 1, 2, 3 and 4.Xis the score obtained when the dice
is rolled.
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
19/22
Edexcel Notes S1
19
Liverpool F.C.
Xtherefore has a uniform distribution, .
= 2.5
The Normal Distribution
- Symmetrical about the mean.- Total area under the curve = 1- Probabilities correspond to the area.- A continuous distribution (therefore there is no difference between and
.
- 68% of the distribution lies within 1 standard deviation of the mean.- 95% of the distribution lies within 2 standard deviations of the mean.- 99.7% of the distribution lies within 3 standard deviations of the mean.
Examples:
- The masses of new born babies.- IQ of school students.- Hand span of adult females.- Height of plants growing in a field.
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
20/22
Edexcel Notes S1
20
Liverpool F.C.
Working out Probabilities using tables.
Examples:
1.2.
3.
4.
5.
6.
If P(Z < a) is greater than 0.5 than a will be >0. If P(Z < a) is less than 0.5, than a is less than 0. If P (Z > a) is less than 0.5 than a will be > 0. If P (Z > a) is more than 0.5 than a will be
-
7/27/2019 Statistics 1 - Notes
21/22
Edexcel Notes S1
21
Liverpool F.C.
Standardizing
If and then:
Example: If find
The first step is to standardize:
Working Backwards
Example: If ,find the value of if .
To findx, we start by finding the standardised value such that .
From tables we see that .
We therefore need to find the value that standardises to make by rearranging the formula.
Examination style question: A machine is designed to fill jars of coffee so that the contents, , follow a
normal distribution with mean grams and standard deviation grams.
If and , find and correct to 3 significant figures.
http://www.thestudentroom.co.uk/member.php?u=307086http://www.thestudentroom.co.uk/member.php?u=307086 -
7/27/2019 Statistics 1 - Notes
22/22
Edexcel Notes S1
22
Firstly : + 1.96
Secondly, we are told that :
- 1.75
The two equations are:
+ 1.96
- 1.75
Subtract to eliminate :
This gives
So the solutions to 3sf are and g.
top related