using statistics in ms excel - university of birmingham · pdf fileusing statistics in ms...

46
Centre for Learning and Academic Development (CLAD) Technology Skills Development Team Using Statistics in MS Excel www.intranet.birmingham.ac.uk/itskills

Upload: hanhan

Post on 10-Mar-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Centre for Learning and Academic Development (CLAD)

Technology Skills Development Team

Using Statistics in MS Excel

www.intranet.birmingham.ac.uk/itskills

Using Statistics in MS Excel (XL2105)

Using Statistic in MS Excel (XL2105)

Author: Sonia Lee Cooke

(The course is substantially based on a previous course

developed and presented by Duncan Greenhill, Barbara Hallam

and Dr Graham Hendry).

Version: 1.0, October 2012

© 2009 The University of Birmingham

All rights reserved; no part of this publication may be photocopied, recorded or otherwise reproduced, stored in a retrieval system or transmitted in any form by any electrical or mechanical means without permission of the copyright holder.

Trademarks: Microsoft Windows is a registered trademark of Microsoft Corporation. All brand names and product names used in this handbook are trademarks, registered trademarks, or trade names of their respective holders.

Using Statistics in MS Excel (XL2105) Page i

Contents

ABOUT THE WORKBOOK ..................................................................................................................... 1

HOW TO DO SOMETHING ................................................................................................................... 1

ABOUT EXCEL .............................................................................................................................................2

FORMATTING CELLS ............................................................................................................................ 2

CUSTOM FORMATS ......................................................................................................................................2 Cells with text or symbols ...................................................................................................................3 Number formats .................................................................................................................................4 Leaving space .....................................................................................................................................4 Conditions in Custom Formats ............................................................................................................5 Styles ...................................................................................................................................................5

ARRAY FORMULAE .............................................................................................................................. 7

Array constants ...................................................................................................................................8

CHARTS ............................................................................................................................................... 9

CREATING A CHART ......................................................................................................................................9 The elements of a chart ....................................................................................................................10

CHART LOCATION ......................................................................................................................................10 SAVE THE CHART FORMATTING AND LAYOUT AS A TEMPLATE .............................................................................11 APPLYING A CHART TEMPLATE TO AN EXISTING CHART ......................................................................................12 MOVING OR DELETING A CHART TEMPLATE ....................................................................................................12 ADDING ERROR BARS TO A CHART .................................................................................................................12 ADDING MORE SERIES TO A CHART ................................................................................................................14

Missing data points ..........................................................................................................................15 ADDING A SECOND Y AXIS ............................................................................................................................15 ADDING A SECOND X AXIS ............................................................................................................................17 COMBINATION CHARTS ...............................................................................................................................17 X-Y SCATTER CHARTS .................................................................................................................................18 ADD A TRENDLINE TO A CHART .....................................................................................................................18

STATISTICS WITH EXCEL .....................................................................................................................19

DESCRIPTIVE STATISTICS ..............................................................................................................................19 Conditional formatting for extreme values ......................................................................................20

EXCEL ADD-INS ..........................................................................................................................................25 PRODUCING HISTOGRAMS ...........................................................................................................................26

Dynamic histograms .........................................................................................................................28

LEAST SQUARES REGRESSION ............................................................................................................28

Calculating linear regression coefficients .........................................................................................30 Calculating best fit values .................................................................................................................31 Calculating r

2 ....................................................................................................................................31

Reduced Major Axis regression .........................................................................................................31 MULTIPLE REGRESSION ...............................................................................................................................32

Calculating polynomial regression coefficients ................................................................................32 CONFIDENCE INTERVALS ..............................................................................................................................33 FORMULAE ACROSS WORKSHEETS .................................................................................................................33

EXCEL RESOURCES ..............................................................................................................................34

WEBSITES ................................................................................................................................................34 NEWSGROUPS ..........................................................................................................................................34 SEARCH ENGINES .......................................................................................................................................34

Page ii Using Statistics in MS Excel (XL2105)

APPENDIX A – CUSTOM FORMATTING CODES ................................................................................... 35

NUMBER CODES ....................................................................................................................................... 35 TEXT CODES ............................................................................................................................................. 35 DATE CODES ............................................................................................................................................ 35 TIME CODES............................................................................................................................................. 36

APPENDIX B – EXCEL FUNCTIONS ....................................................................................................... 37

STATISTICAL FUNCTIONS ............................................................................................................................. 37 CURVE FITTING FUNCTIONS ......................................................................................................................... 38 DISTRIBUTION FUNCTIONS .......................................................................................................................... 39 SIGNIFICANCE TEST FUNCTIONS .................................................................................................................... 40

Using Statistics in MS Excel (XL2105) Page 1

About the workbook

The workbook is designed as a reference for you to use after the course has

finished. The workbook is yours to take away with you so feel free to make

any notes you need in the workbook itself.

The workbook is divided into sections with each section explaining about a

particular feature of Excel or how to do a particular task. Sections that take

you through a particular procedure step-by-step look like this:

How to do something

Do this first.

Then do this.

Then do this to finish.

There are also a number of text boxes to watch out for throughout the

workbook. These will help you to get the most out of Excel.

Tip

The thumbs-up symbol in the margin indicates a tip. These tips will help you

work more effectively.

Danger!

The skull and crossbones picture in the margin indicates common mistakes or

pitfalls to be avoided.

Page 2 Using Statistics in MS Excel (XL2105)

About Excel

The handbook introduces some of the analytical and presentational features

of Excel, as well as looking at some of the more powerful formatting

features.

Excel is powerful enough to be used for the analysis of data from many

research projects. There are a number of add-ins, both from Microsoft and

other companies, which can extend its capabilities. Many of the dedicated

statistical and graphing packages can read the Excel files if features are

required that aren’t provided in Excel.

Formatting cells

Custom formats

Formatting of numbers for statistical data can sometimes be critical.

Suppose we have some data from analysing the concentration of copper in

sediment samples. Half of the results might be accurate to two decimal

places, but the other half might have been produced using a different

analysis technique and might only be accurate to one decimal place. When

the cells are displayed the decimal points won’t line up. We can’t simply

format all the cells to one decimal place since we would lose precision in

half the data. Neither can we format all the data to two decimal places

since that would imply half of the data is more accurate than it actually is.

So how do we make the cells line up? The answer is to use custom

formats.

When a number is typed into a cell, Excel initially stores it with the

General format, unless it recognises it as a date, currency, or as a

percentage.

The General number format shows numbers with more than eleven digits

in scientific format, such as 1.23E+11.

The custom format can contain three sections for numbers and an

additional section for text. The sections are separated by semicolons.

Positive; negative; zero; text

An example is shown below.

##0.00 ; [RED](##0.00) ; 0.00 ; “format for text”

If we miss out any sections we still need to add in the semicolons. For

example, if we wanted to enter a format for positive and zero numbers but

not for negative numbers we would have two semicolons. Excel would

recognise that the negative section was empty and that the second set of

formatting codes applied to zeros.

To create a custom format:

Select a cell or range of cells.

Using Statistics in MS Excel (XL2105) Page 3

Click on the Home tab, in the Cells group, click on Format button.

and select Format Cells… Alternatively, press Ctrl 1, the Format Cells dialogue box appears:

The Number tab is selected by default, under Category: click on

Custom in the list of categories.

Delete the word General from the Type: text box and enter the codes

for the format you want

Click on the OK button.

The codes that we can enter to create a custom format can be found in the

Excel help system.

Why have I got all these formats?

Every time we modify an existing custom format, the original format stays in the

list. Make sure you delete any that you don’t need.

If you accidentally delete a custom format you need, all the cells that use the

format will lose the custom formatting. For example, if we have a custom format

to show 21 as 21oC and we delete the custom format then the cell will revert to

displaying 21.

Cells with text or symbols

If we are recording some temperature data we might want to include the

units. For example, we might want a temperature to be displayed as 21oC

Page 4 Using Statistics in MS Excel (XL2105)

or 294K. If we just type 294K into a cell, Excel will interpret it as text and

we won’t be able to do any calculations with the value. The way we would

solve the problem is by creating a custom format consisting of the number

codes and any other text we need. To display text in a cell (with or without

numbers) we enclose the text in quotes. Single characters can also be

‘escaped’ i.e. treated literally by putting a backslash \ before the character.

Some characters display without needing a backslash. These are: $ - + / ( )

{ }: ! ^ & ~ = < > ‘ and the space character. If the cell is going to contain

text, we use a code of @ to represent the cell content, e.g. “My name is

”@.

Number formats

When we create custom formats for numbers we put in placeholders to

contain the digits. A # displays only significant digits and will not display

zeros that are not significant. A 0 will display zeros that are not significant

if the number has fewer digits than the format.

If we want to have the thousands separated by commas we can include the

commas in the format.

Leaving space

The Excel help system tells us that we can use a ? in the format to leave

space. However, that method works best if we are using a monospaced

font, i.e. each character is the same width, and most fonts installed on

computers do not have equal width characters. We can get around the

problem by using the underscore character. The underscore character

leaves space equal to the width of the character that follows it. For

example, _m will leave blank space that is the width of an m, while _i will

leave space the width of an i. Neither code will put an m or i on screen.

I’m adding up time. Can I show a total more than 24 hours?

If we are using any of the standard time formats and add up times we can run

into problems if the total is more than 24 hours. For example, a timesheet would

add up to 35 hours for a normal working week, but the standard time formats

would display this as 11:00 i.e. the whole day has been discarded. We can get

around the problem with a custom format of [hh]:mm which will display elapsed

hours.

I’ve lost my data

If we don’t put in the number or text codes then Excel will only display the text

entered in quotes, for example, oC instead of 21

oC

So what does my cell contain?

It contains the value you typed in. Think of it like typing in a number and then

using the formatting to make it display with a pound sign. The cell still only

contains the number, but the formatting is making it display differently.

Using Statistics in MS Excel (XL2105) Page 5

Conditions in Custom Formats

If we want the custom formats to be conditional on some value then we

can include conditions in the format. For each of the number sections the

order would be:

[condition][colour]codes

Examples would be:

[Green][>100]###.00;[<0][Red]###.00;0.00

This would give numbers larger than 100 in green, negative numbers in red

and zeros as 0.00.

Caution!

The conditions can override the ‘logic’ of the sections, which are positive;

negative; zero; text. For example:

[Green][<-20]###.00 ; [>100][Red]###.00;0.00

would mean a value of -20 would be green, any number above 100 would be

red, while –10 would have no formatting applied to it.

What colours can I use?

The help system tells us we can have any one of eight colours. These are

black, blue, cyan, green, magenta, red, white and yellow. However, we can get

more. If we look at the patterns tab in the Format Cells dialogue box we see a

grid of colours. Imagine them numbered from 1 2 3 … across the first row and 9

10 11 … across the second down to 56 in the bottom right-hand corner. If we

type [color n] Excel will use that colour for that section. We need to use the

American spelling for Excel to understand what we want.

Styles

Format painter works well but it has its disadvantages. Suppose we have

formatted a cell to have a red background and size 14 text and then used

format painter to copy that format to 40 other cells. If we now change our

mind and decide we want a blue background and size 16 text we will have

to use format painter again. A better solution would be to use styles. Styles

are a way of giving a nickname to the formatting applied to a cell. We can

use styles to rapidly apply formatting to a cell, and we can change

formatting throughout a spreadsheet by modifying the style.

To create a cell style

Select the cell that has the formatting to be copied.

Page 6 Using Statistics in MS Excel (XL2105)

Click on the Home tab, in the Styles group, click on the Cell Styles

button select New Cell Styles… at the bottom of the list, the Styles dialogue box appears:

Type the name of the style in the Style name: box.

Click on the Format... button and select the required format from the

Format Cells dialogue box and click on the OK button to return to the

Styles dialogue box.

Click on the OK button on the Styles dialogue box

To apply a cell style:

Select the cells you want to apply the style to.

Click on the Cell Styles button, the style is displayed under Custom at the top of the list, select it.

To remove a cell style

Select the cells where the style has been applied.

Click on the Cell Styles button.

Under Good, Bad, and Neutral, click Normal, or right-click on the style and select Delete to delete it from the list.

To modify a cell style

Click on the Home tab, in the Styles group, click on Cell Styles and

right-click the Cell Style name and select Modify…

Using Statistics in MS Excel (XL2105) Page 7

Array formulae

Array formulae are one of the most powerful and least understood features

of Excel. An array formula can give us back a single result or multiple

results from complex calculations, without us having to enter the

intermediate steps into the worksheet. We can also use arrays inside a

normal formula. These are called array constants. Some of the functions in

Excel require us to use an array formula because the function either uses an

array or produces an array as a result. An example would be the matrix

functions.

What’s an array?

Think of an array as being like a grid. An array formula uses groups of cells

instead of single cells as the source data for its calculation.

Suppose we have the following worksheet:

We can select cells C1 to C4 and type a formula =A1:A4+B1:B4. If we

press enter we’ll get the answer 6 in C1 only. If we press

CTRL SHIFT ENTER then we will enter three formulas at the same time and

we’ll have the answers 2, 9, 6 and 8 in C1, C2, C3 and C4 respectively.

Using an array formula has allowed us to create a calculation that gives us

more than one answer. Excel has done this by ‘looping’ through the array,

i.e. Excel first adds A1 and B1 and puts the answer in C1, then Excel adds

A2 and B2 and puts the answer into C2, then Excel adds A3 and B3 and

puts the answer into C3 and finally adds A4 and B4 and puts the answer in

C4.

To create an array formula for a single result:

Click in a single cell.

Type the formula as normal.

Press and hold down both the Ctrl and the Shift keys, and then press the ENTER key.

To create an array formula for a multiple result:

Select a range of cells.

Type the formula as normal.

Page 8 Using Statistics in MS Excel (XL2105)

Press and hold down both the Ctrl and the Shift keys, and then press the ENTER key.

Editing arrays

When we have created an array formula with results in multiple cells we can’t

edit a single cell of the array. We have to select the whole range, edit the

formula, and then press Ctrl Shift ENTER to re-create the array formula.

Where did the { } brackets come from?

Excel has placed the { } brackets around the formula when we pressed

Ctrl Shift ENTER. We can’t type them normally as part of the formula and then

press ENTER, we have to press Ctrl Shift ENTER.

Array constants

An array constant takes the place of a constant in a normal formula. For

example, the ‘Large’ function in Excel returns the nth largest number from

a range. The syntax is where k is the place that we want

i.e. 1 for the largest, 2 for the second largest and so on. The formula:

=LARGE(A1:C20,1)

would give us the largest number within the block of cells from A1 to C20.

An array constant lets us find more than one number. For example, we

might want the three largest numbers, or the 1st, 3

rd and 5

th numbers in the

list. We can do this by entering a formula such as:

=LARGE(A1:C20,{1,2,3}) or

=LARGE(A1:C20,{1;3;5})

To enter a formula with an array constant

Select a range of cells. The number of cells selected should be the

same as the number of answers required. The selection can be

vertical or horizontal.

Type the formula and type the { } brackets around the array constant

part. If the range selected is horizontal separate the array constants

with commas. If the range selected is vertical separate the array

constants with semi-colons.

Press Ctrl Shift ENTER.

Using Statistics in MS Excel (XL2105) Page 9

My answers are identical

If we have entered a formula with an array constant to give us multiple answers

we can sometimes find all the answers are the same when we can see quite

clearly from the data that they shouldn’t be. An array constant varies according

to whether we want a horizontal or vertical result. If we want results horizontally

the array constants need to be separated with commas, e.g. {1,2,3}, but if we

want results displayed vertically then we need to separate the array constants

with semi-colons e.g. {1;3;5}.

Charts

Excel lets us create charts very easily. We can create a variety of different

types of charts such as bar charts, pie charts, x-y scatter charts and many

others. Once the chart has been created we can select various elements

within them and modify or format them to suit the needs of that particular

chart.

Creating a chart

To create a chart:

Select the data in the worksheet.

Click on the Insert tab, in the Charts group, choose the chart type

you want.

By default the chart is embedded into the worksheet but you can

change the location (see chart location below). When you click on the

embedded chart in the worksheet the Chart Tools is displayed, at the

top right of the ribbon, together with three additional tabs: Design,

Layout, and Format.

You can format different elements of the chart by clicking on the Format tab and select the required option from Current Selection and Shape Styles group.

You can use the Design tab to change the chart type, chart styles, Create a chart template and edit the x axis labels

Page 10 Using Statistics in MS Excel (XL2105)

You can use the Layout tab to add various elements to the chart such

as chart titles, axes, plot area, trendline, error bars and reset the chart

back to the default style.

The elements of a chart

A chart will consist of various elements and it can be useful to know what

Excel calls them, for example, the help system might tell us to select a

certain part of a chart.

Chart location

When you create a chart, by default the chart is placed within the current

worksheet. When the chart is within the worksheet it is sometimes called

an embedded chart, you can click on it and drag it to a different location on

the worksheet. You can also choose a different location on a completely

separate worksheet. If you have a chart on a completely separate

worksheet it can sometimes be more useful because it gives you more

room to work.

I have an embedded chart. Can I print it on a separate page?

Yes. If the chart is not selected then when you print you get what you can see

on screen – the data and the chart together. If you click on the chart to select it

you will see small dots in each corner and the middle of each side, within

double-borders around the chart. If you print with the chart selected the chart

will be printed on a separate page by itself, scaled up to fit the size of the paper.

To change chart location:

Click on the chart to select it.

Using Statistics in MS Excel (XL2105) Page 11

Click on the Design tab, in the Location group, click on the Move Chart button.

The Move Chart dialogue box appears:

Choose the location and click on the OK button.

Save the Chart Formatting and Layout as a Template

There are a number of standard formats to pick from when creating a chart.

We can change to a different format later, but quite often we find the

standard formats are not quite what we want. If we are creating a series of

charts then it can be quite long-winded to modify them all by hand. What

we could do instead is to create our own custom format, store it and then

apply it to all the other charts as well.

To save the chart as a template:

Create a chart and format as necessary.

Click on the chart to select it.

Under Chart Tools, click on the Design tab, in the Type group, click

on Save as Template, the Save Chart Template dialogue box

appears, make sure the Charts folder is selected, and in the File

name: text box, (at the bottom left of the Save Chart Template

dialogue box) enter a name for the Chart Template.

Click on the OK button.

Page 12 Using Statistics in MS Excel (XL2105)

Applying a Chart Template to an existing chart

To apply a chart template to an existing chart:

Click on the chart to select it

Click on the Insert tab, in the Charts group, click on Charts dialogue box launcher button.

The Change Chart Type dialogue box appears.

Click on Templates in the left pane

Pick the required chart template from the list on the right

Click on the OK button.

Moving or Deleting a Chart Template

To move or delete a chart from the chart template folder:

Click on the Insert tab, in the Charts group, click on chart dialogue

box launcher, (the Chart Type dialogue box appears).

Click on the Manage Templates… button at the bottom of the Chart

Type dialogue box, then do one of the following:

- To move the chart template from the Charts folder, to another

folder, drag it to the folder where you want to store it.

- To delete the chart template from the folder, right-click it, and

then click Delete.

Adding error bars to a chart

You can create a chart with a number of data series and then add error bars

to the data. You have the choice of making the error bars show a variation

of a fixed amount, a percentage, and one or more standard deviations,

standard error or by a custom amount. The data for the custom variation is

normally typed into a range of cells on the worksheet. Standard error is

calculated by dividing the standard deviation by √n, where n is the number

of data points. Additionally, we can choose to show error bars above the

data point, below the data point or both.

Using Statistics in MS Excel (XL2105) Page 13

To apply error bars:

Click on the data series; make sure they are all selected.

Under Chart Tools, click on the Layout tab, in the Type group

Click on the arrowhead to the right of Error Bars and select More Error Bars Options…the Format Error Bars dialogue box appears:

Under Display you can choose plus, minus, both or none.

Under Error Amount select the type of error bar required. If custom is chosen, either type the references in or click on the Specify Value button the Custom Error Bars dialogue box appears:

Click on the button with the red arrow and drag on the worksheet to select a range.

Page 14 Using Statistics in MS Excel (XL2105)

Click on the OK button on the Custom Error Bars dialogue box and click on the Close button on Format Error Bars dialogue box.

I can’t enter my custom error bars

If you type in the range for the error bars manually you may get an error when

you try and press Enter. This is because Excel needs the exact reference,

including any worksheet names. The full reference should be something like

Sheet2!b3:b19, with an exclamation mark after the worksheet name.

Alternatively, you can select from the worksheet directly as shown above.

Adding more series to a Chart

Each column or line is called a data series, since it is constructed from a

series of data points. We can create a chart using a number of series by

selecting the data before we create the chart. We might also be in the

situation where we have an existing chart and we want to add another data

series to it.

To add another series:

Select the cells containing the new values.

Copy using the Copy icon, in the Clipboard group, on the Home tab

or press Ctrl C.

Click on the chart to select it.

Click on the Home tab, in the Clipboard group, click on Paste and

select Paste Special… the Paste Special dialogue box appears:

Ensure New series is selected and click on the OK button to insert the

new series onto the chart.

Using Statistics in MS Excel (XL2105) Page 15

Missing data points

Suppose we have a series of data points plotted as a line. If there is a value

missing, i.e. a blank cell in the range, then Excel will leave a gap between

and our data series will show as two lines instead of one, although both

line fragments will have the same formatting. We can trick Excel into

including the unknown value as part of the series when plotting the data.

To include missing data points:

Click on the blank cell where the value should be.

Type in #N/A.

Hiding the #N/A

Typing in the #N/A code will make Excel treat the data as a continuous series

for plotting, but doesn’t look particularly good when printing the worksheet. We

can work around this problem by formatting the cell so that the text is the same

colour as the background.

Adding a second Y axis

We may have two data series that are difficult to plot on the same chart.

We might have one data series with values that vary between 1 and 50,

while a second series has values up to 4500. If we plot them together then

we won’t be able to see the detail on the first series because it will be very

close to the x-axis.

We could do two different charts but that makes it awkward to compare the

two series. An alternative solution would be to plot one of the data series

on a second y-axis.

Page 16 Using Statistics in MS Excel (XL2105)

To add a second y axis to a chart

Right-click on one of the data points in a data series.

Select Format Data Series… the Format Data Series dialogue box appears:

Click on Series Options and select Secondary Axis.

Click on the Close button.

I can’t add a secondary axis

We can’t add a secondary axis, either x or y, unless we have at least two data

series on the chart. Another common problem is that we can’t access the

secondary axis area on the Format Data Series dialogue box. We need to set

one of our data series to have a secondary axis first, before we can access that

area on the Format Data Series dialogue box.

Using Statistics in MS Excel (XL2105) Page 17

Adding a second x axis

Although rarer, there is sometimes a need to have two x axes as part of the

same graph, it may be useful in an xy (scatter) chart or bubble chart.

To add a second x axis

Click a chart that displays a secondary vertical axis.

The Chart Tools appears at the top right of the screen adding the

Design, Layout, and Format tabs.

Click on the Layout tab, in the Axes group, click Axes.

Point to Secondary Horizontal Axis and select the required option. If

you select More Secondary Horizontal Axis Options… the Format

Axis dialogue box will open and you can select the display option that

you want.

Click on the Close button.

Combination charts

If we have more than one data series on a chart then we can include more

than one chart type. For example, we could have one series plotted as a set

of columns and one plotted as a line.

To create a combination chart:

Create or select a chart with more than one data series.

Select one of the series. (We are going to apply a line chart to our

existing chart)

Click on the Insert tab, in the Charts group, click Line and choose the

style you want.

I can’t pick the series

If there are a lot of elements on a graph it can be difficult to select the one that

we want, particularly if there are a number of series and regression lines. We

can select the chart and then use the left, right, up and down arrows on the

keyboard to cycle through all the elements on the chart

Alternatively, select the chart, click on the Layout tab, in the Current Selection

group, click on the Chart Elements arrowhead and pick a series.

Page 18 Using Statistics in MS Excel (XL2105)

X-Y Scatter charts

Many types of experimental data involve sets of numbers plotted against

each other. Depending on what options we pick, Excel can join the points

with jagged or smooth lines, or leave them as separate points. We can then

choose to add a trendline which will be a best-fit to the data. A trendline is

more normally called a regression line. We can tell Excel to try and fit the

regression line using a linear, polynomial, logarithmic, power or

exponential relationship.

Add a Trendline to a Chart

To insert a trendline:

Click on the data series in the chart to select them.

Click on the Layout tab, in the .Analysis group, click on Trendline

and select More Trendline Options… alternatively, right-click on the data series and select Add Trendline…, the Format Trendline dialogue appears:

Choose the type, and click on the Close button

Using Statistics in MS Excel (XL2105) Page 19

What do the options in the Format Trendline dialogue box do?

The options can be used to name the trendline by using the Trendline Name

area, extend the line forwards or backwards by using the Forecast area, and

also to display the equation and R2 value. More detailed instructions are on

page 29 in the section on linear regression.

Statistics with Excel

Excel has many built-in functions, from simple mathematical calculations

like average and sum, trigonometric functions like tan and cos, to complex

financial and engineering functions. A partial list of Excel functions is

provided in Appendix B. It is advisable to check the formulae used

internally by the functions to ensure that they will give correct answers for

the data used. The Excel help system will sometimes give the formula used

for the statistic functions and the knowledge base at the Microsoft web site

can be searched for known issues with particular functions.

Descriptive statistics

Descriptive statistics are those that describe the characteristics of the data.

We will be looking at mean, trimmed mean, standard deviation, median,

maximum, minimum, skew and kurtosis.

Mean The mean calculates the average of the data.

Trimmed mean The mean may be distorted by values that are untypical or wrong. A

trimmed mean calculates the average by excluding the most extreme data

in pairs. For example, we can calculate the average with the highest and

lowest values removed, or with the two highest and two lowest values

removed. The trimmed mean always removes the data values in pairs.

Standard deviation The standard deviation is a measure of how much the data is dispersed

from the mean. For a normal distribution, 68% of the values should lie

within 1 standard deviation from the mean, 95% of the values will lie

within 2 standard deviations from the mean, and 99.7% of the values

would lie within 3 standard deviations from the mean.

Median The median is the middle value in an ordered set of data points, or the

average of the two middle values if there is an even number of data points.

Maximum and minimum The largest and smallest values in a data set.

Page 20 Using Statistics in MS Excel (XL2105)

Skew A measure of how asymmetric the distribution is. If the chart extends

further to the left of the mean than it does to the right, then the distribution

of the data has negative skewness. If the chart extends further to the right

of the mean than it does to the left then the distribution has positive

skewness.

Kurtosis Kurtosis is a measure of how ‘peaked’ a distribution is. The normal

distribution has a kurtosis value of zero, and is sometimes referred to as

being mesokurtic. A negative number indicates the data is less peaked than

the normal distribution, which is sometimes called a platykurtic. A positive

number indicates the data is more peaked than the normal distribution. The

term leptokurtic is sometimes used in this circumstance.

There are different ways of calculating the kurtosis statistic. If you are

comparing your calculations to published values, try to ensure that the

same statistical formula is being used for both data sets.

Conditional formatting for extreme values

We can use conditional formatting to ‘flag up’ when particular data values

are outside certain limits. There are two types of conditional formatting –

the conditional formatting we apply when we set up a custom format, and

the conditional formatting we apply to cells through Conditional

Formatting in the Styles group on the Home tab. It is this second option

we use to highlight extreme data points, and we can enter conditions.

The conditions in custom formats only change how the data is displayed.

It may cause it to have a certain number of digits or display in a certain

colour, but it can’t, for example, change the background colour of the cell,

or put a border around it. Conditional formatting for cells allows us to do

just that. When we create conditional formats for cells we can include a

subset of the formatting available from the Format Cells window. In

conditional formatting we can choose from border, pattern and some

options from font.

We can set up to sixty-four different conditions each with its own

formatting. This means that including the unchanged look we can have one

or more sets of formatting applied to a particular cell.

The conditions can be set based on the cell value or on the result of a

formula. The formula must reduce to a yes (true) or false (no) answer. For

example, we can’t use =SUM(b3:b7) since that would simply add up the

cells, but we could use =SUM(b3:b7)>=0 i.e. is the sum bigger than zero?

To create conditional formatting:

Click on the Home tab, in the Styles group, click on Conditional Formatting. You can select an option from the drop-down list, New Rules or Manage Rules.

Using Statistics in MS Excel (XL2105) Page 21

Select Manage Rules…, the Conditional Formatting Rules Manager dialogue box appears, click on the New Rule… button

the New Formatting Rule dialogue box appears.

Set a condition based on the cell value or on the result of a formula.

Click on the Format… button, choose a format and click on the OK

button to close the New Formatting Rule dialogue box and return to

the Conditional Formatting rules Manager dialogue box.

Click on the New Rule… button to add another condition and format.

Click on OK.

Page 22 Using Statistics in MS Excel (XL2105)

To use formulae within conditional formatting:

Highlight the data range you want to apply the conditional formatting

to.

Click on the Home tab, in the Styles group, click on Conditional

Formatting.

Or if you select New Rule… the New Formatting Rule dialogue box appears:

Select a Rule type from the list – e.g. if you choose Use a formula to

determine which cell to format, you could enter a formula that will

evaluate to either a true or false result.

Choose the format to be applied, by clicking on the Format... button,

the Format Cells dialogue box appears.

Select the format you want and click on the OK button.

Click on the OK button.

An example will help to make this clearer:

Suppose we have selected cells B4 to B24 and enter this formula:

=ABS(B4-AVERAGE(B4:B24))>=STDEV(B4:B24)*3

The B4-AVERAGE(B4:C24) part takes the data value and subtracts

the average to give us a number. The number shows how far the data

point is from the mean. If B4 is smaller than the mean the number will

be negative, so the ABS function converts it to a positive number.

The left hand side of the formula is calculating a positive number

indicating how the data point is away from the mean. This number is

Using Statistics in MS Excel (XL2105) Page 23

then compared to three times the standard deviation. In plain

language, the formula is asking: is the data point more than three

times the standard deviation away from the mean? The answer is

either yes (true) or no (false). If the answer is true the format will be

applied. (As shown in the New Formatting Rule dialogue box below)

You could also use the option – Format only values that are above

or below average and select the appropriate amount of standard

deviation below or above the mean.

Ensure you select the range of cells you want to apply the condition to.

Let’s say we want to see data point(s) that are two standard deviation

below the mean for the selected range.

On the New Formatting Rule dialogue box, under Format values that are: click on the arrowhead and select 2 std dev below

Click on the Format… button and select a format – e.g. you could select a fill colour to highlight all the cells that are 2 std dev below the mean.

Page 24 Using Statistics in MS Excel (XL2105)

Click on the OK button to apply the formatting rule to the data range.

To manage the rules:

Click on the Home tab, in the Styles group, click on Conditional

Formatting.

Select Manage Rules… the Conditional Formatting Rules Manager dialogue box appears:

The Conditional Formatting Rules Manager allows you to create,

edit, delete and view all your conditional formatting rules.

Using Statistics in MS Excel (XL2105) Page 25

To delete or edit conditional formatting:

Click on the Home tab, in the Styles group, click on Conditional

Formatting.

Select Manage Rules…, the Conditional Formatting Rules

Manager dialogue box appears.

Tick the boxes for the condition or conditions you want to delete or edit

and click on the Delete Rule button.

Click on the OK button.

Formula order!

The order in which we enter the formulae for the conditions can be very

important. The conditional formatting works in reverse order.

For example, we can set three conditions that test if the data is one, two or

three standard deviations away from the mean. For this to work, we first need to

test for data that is more than three standard deviations away, then two, and

then one. If the conditions are not in reverse order Excel will automatically

change it for you.

Suppose the data point is between two and three standard deviations away

from the mean. If we test for three standard deviations first the answer will be

false, the format won’t be applied and condition two (more than two standard

deviations) will be tested. This time the answer is true and the format is applied.

If we do the formulae in the opposite order and test for one standard deviation

first the formula will calculate to true, and the format will be applied even though

the data point is more than two standard deviations away. Condition two to test

for more than two standard deviations away will never be reached.

Excel add-ins

The standard capabilities of Excel can be extended by using add-ins. Some

of the add-ins are produced by Microsoft, but there are also many other

add-ins produced by other companies and individuals, for example the

Chart Tools add-in mentioned in the section on resizing charts.

To activate or deactivate add-ins:

Click on the Office button, top left of the screen, click on Excel

Options button, and the Excel Options window opens.

Select Add-Ins and click on the Go… button, to the right of Manage: Excel Add-ins

Page 26 Using Statistics in MS Excel (XL2105)

The Add-Ins dialogue box appears:

Click to add ticks to activate an Add-in, or remove the ticks to

deactivate the Add-in.

Click on the OK button, notice a new Analysis group is added to the

right of the Data tab, displaying Data Analysis and Solver.

Producing Histograms

A histogram is not the same as a column graph. If we create a column

graph then the individual data points will be plotted. What we want with

the histogram is to plot a summary of our data, for example the frequency

with which a particular value or range of values occurs. The categories into

which we summarise the data are called bins.

To create a histogram:

Click on the Data tab, in the Analysis group, (if the Analysis group is

not on the Data tab, you will need to add it to the Ribbon in order to

access the histogram, see the section above on Add-ins.

Enter the bins values into some cells.

Click on the Data tab, in the Analysis group, click on Data Analysis,

the Data Analysis dialogue box appears:

Using Statistics in MS Excel (XL2105) Page 27

Under Analysis Tools, scroll down and click on Histogram and click

on the OK button, the Histogram dialogue box appears:

Click in the Input Range: box and either type the cell references or select them from the worksheet by clicking on the Collapse/Expand

button to the right of the box.

Click in the Bin Range: box and either type the cell references or

select them from the worksheet.

The default location for the results is on a new worksheet. Click in the

New Worksheet Ply: box and type the name for the new worksheet.

Alternatively, click on the Output Range: option, click in the white box

to the right and either type a cell reference or click on a single cell on

the worksheet. This single cell is the top left-hand corner of the output.

Tick Chart Output and click on the OK button to generate the

histogram on the worksheet.

The disadvantage with the histogram produced by the Data Analysis option

is that the graph won’t change if the data it was based on subsequently

changes; the graph is not dynamic.

Page 28 Using Statistics in MS Excel (XL2105)

What bin ranges can I use?

We can use whatever bin ranges we need to use, with the advantage that they

don’t have to be equal sizes. We could leave the bin range blank and Excel will

choose equal-sized bins for us, but there is no guarantee of getting sensibly-

sized bins.

I don’t have Data Analysis on my Ribbon

In order to use the ‘Data Analysis…’ functions on the Ribbon, we need to have

the Analysis Toolpak add-in installed. See the section on Excel Add-Ins on

page 28 for instructions on how to do this.

Dynamic histograms

We can use one of Excel’s built-in functions to create a dynamic

histogram. The function is the ‘Frequency’ function, which is an array

function.

To create a dynamic histogram:

In a blank area of the worksheet enter the values for the bins.

Select a range of cells for the output to go into. The number of cells

selected should be one more than the number of bins.

Type the formula ‘=Frequency(data_range,bin_range)’.

Press CTRL SHIFT ENTER.

Plot the bins and frequency results as a column chart.

Least squares regression

Least squares regression is a way of fitting a best-fit line to a set of data

points and deriving an equation that describes the line. The line is

described by the equation:

y = mx + c

Where m is the slope of the line and c is the intercept with the y-axis.

The data points consist of an independent variable (our known data, plotted

on the x-axis) and an dependent variable (our measured data, plotted on the

y-axis). Least squares regression has a number of assumptions. Firstly, we

assume that there are no errors in our known data (x values) and that all the

errors are in our measured data (y values). Secondly, we assume that there

is a linear relationship between the known values and the measured values.

We also assume that the residuals are normally distributed with a mean of

zero and that the variance of the errors is constant for all values of X. By

residuals or errors, we mean the vertical distance between the best fit line

and the data point.

Using Statistics in MS Excel (XL2105) Page 29

As with all statistical methods, we would need to test that the assumptions

are met before we can draw any valid conclusions from the data.

To add a regression line to a data series:

Right click on one of the data points in the series.

Click Add Trendline…

Under Trend/Regression Type, select Linear and click on.

Click to tick Display equation on chart and Display R-squared

value on chart.

Click on the Close button.

Should I set an intercept?

You should be very careful about specifying an intercept through zero, since

that forces the line through that point, and may result in a much poorer fit.

Page 30 Using Statistics in MS Excel (XL2105)

Calculating linear regression coefficients

We can also calculate the line parameters directly by using the formula

‘Linest’. The LINEST function has the syntax:

=LINEST(y values,x values,const,stats)

The x and y values are the known values. Const can be set to either true or

false. If it is set to true Linest will calculate an intercept. If it is set to false,

no constant will be produced and the line will be forced through zero. The

stats option controls whether we only get the slope and intercept or

whether we get some additional statistics.

The Linest function is an array function and we need to select multiple

cells and use CTRL SHIFT ENTER to enter the formula. An example would

be:

=LINEST(B4:B20,B4:B20,true,true)

To calculate linear regression coefficients:

Select two cells in a row or ten cells (two columns, five rows) if the

extra statistics are required.

Type in:

Select or type the cell references for the known y values and the

known x values.

Enter True to calculate a constant

Type True to generate extra statistics or false to just produce the

slope and constant.

Close the brackets and press CTRL-SHIFT-ENTER.

Without the additional statistics we get two numbers, which are:

slope (m) 0.040138 14.30746 constant (c)

With the additional statistics we get ten numbers:

slope (m)

0.04013

8

14.3074

6 constant (c)

standard error of m

0.00323

1

0.57993

9 standard error of c

r-squared

0.90079

7

1.14483

8 standard error of y

F statistic 154.365 17 degrees freedom

Using Statistics in MS Excel (XL2105) Page 31

6

regression sum of

squares

202.319

9

22.2811

2

residual sum of

squares

The function only produces numbers. The descriptions in italics have

been added for clarity.

Linest warning

There are known problems with Linest. It can produce meaningless or incorrect

statistics, and it may have problems with some datasets or those containing

large numbers. Adding a trendline to a chart will produce a better result.

Calculating best fit values

The Trend function can be used to generate a set of y values using the

known x values and the best fit line. The syntax of the trend function is:

=Trend(y values, known x, new x, const)

The x and y values are the known values, and the new x values are

additional values for which we want to generate more y values. Const, as

in linest, will force the best fit line through zero if set to false.

Calculating r2

The r2 statistic gives a measure of how well the fitted line would fit the

data. The equation ranges from 1.0 for a perfect correlation, to 0 for a set

of data with no correlation between the values.

Reduced Major Axis regression

Suppose we are looking at the relationship between the weight and length

of a particular organism. Both weight and height are measured quantities

and there will be errors when we take the measurements, but least squares

regression assumes that the errors are only in the y (dependent) variable.

This means that if we use least squares we may find that our results are not

valid. The solution is use reduced major axis regression, which takes into

account errors in both variables. We can calculate the equation of a straight

line y = mx + c with RMA by using:

m = sy/ sx, where sy and sx are the standard deviations of the x and y

variables, and

c = - m where ̅ and ̅ are the means of the x and y data.

If there is a power relationship between the variables, as there often is with

biological growth data, then a linear equation can be produced by taking

logs.

Page 32 Using Statistics in MS Excel (XL2105)

Multiple regression

If we are trying to predict a child’s height from their age and weight then

we have two sets of x values (age and weight) and one set of y values

(height). This type of situation is called multiple regression.

Because we have extra sets of values we would need to select extra cells

when we are creating array formulae such as Linest or trend. For example,

for the age, weight and height example we would select three cells in a row

rather than two and the result would be:

Slope (m) for x2 Slope (m) for x1 Intercept (c)

-6.91524 2.795071 11.62541

Likewise, if we wanted the additional statistics we would select three

columns and five rows.

Slope (m)

for x2

Slope (m)

for x1

Constant

(c)

-6.9152 2.7951 11.6254

standard

error of m 38.0062 2.4435 24.8179

standard

error of c

r-squared 0.9501 2.7634 #N/A

standard

error of y

F statistic 9.5107 1.0000 #N/A

degrees

freedom

regression

sum of

squares 145.2513 7.6362 #N/A

residual sum

of squares

The headings shown in italics have been added for clarity.

Calculating polynomial regression coefficients

We can add a polynomial trendline to a chart but quite often we don’t get a

very good fit because we can only go up to order six, i.e. we have an

expression starting with an x6 term.

We can use Trend to generate fitted Y values using a polynomial of an

order greater than six, and then using Linest to calculate the coefficients

We do this by entering formulae in columns to generate the x values to the

power we require. So the first column would contain x, the second x2, the

third x3 and so on. The Trend function would then be used with our known

y values and our new x value columns as the known x values. This would

calculate the new, fitted y values.

We can then use Linest to calculate the coefficients with our known y

values, and the new x value columns.

Using Statistics in MS Excel (XL2105) Page 33

Linest accuracy

Linest can deviate from accurate least squares from polynomials with an order

of 3, i.e. anything containing a x3 or higher power. Linest can be very inaccurate

for higher order polynomials. If we want accurate answers rather than just

initially exploring the data, then a dedicated statistical package should be used.

Confidence intervals

We can calculate a confidence interval for the mean of a set of data using a

student’s t-test. To calculate the confidence interval we need to know the

number of observations, the average, and the standard deviation. The first

step is to calculate the t statistic.

In order to work out the t statistic we need to know the probability of the

result occurring by chance and the degrees of freedom. If we want a 95%

confidence level then the probability of a chance result is 5%. The degrees

of freedom is one less than the number of observations. The t statistic is

calculated using the TINV function. This function has the syntax:

=TINV(probability, degrees of freedom)

We then calculate the half-width by multiplying the t statistic with the

standard error. This gives us the distance either higher or lower than the

mean where the mean is likely to be, within the confidence level chosen.

To find the upper confidence interval we add the half width to the mean

and to find the lower confidence level we subtract the half width from the

mean.

Formulae across worksheets

Sometimes we may need to enter a formula that refers to a cell on another

worksheet, or even a separate workbook (Excel file).

When we enter a formula to refer to a separate worksheet the worksheet

name is typed in with a ! between the sheet name and the cell reference.

For example, the following reference refers to cell C9 on the Arrays

worksheet:

=Arrays!C9

When we want a formula to refer to a separate workbook we need to type

the workbook name (including the .xlsx extension) in square brackets and

surround the workbook and worksheet names with single quotes. For

example to refer to cell C3 on the Arrays worksheet in the Stats workbook

we would use:

=’[Stats.xlsx]Arrays’!C3

Page 34 Using Statistics in MS Excel (XL2105)

Excel Resources

There are a number of online resources available to help.

Websites

www.j-walk.com John Walkenbach’s site, useful for

links, tips and example worksheets.

www.bmsltd.ie Stephen Bullen’s site, with many

examples of programming and

charting techniques.

www.cpearson.com Chip Pearson’s site.

Microsoft has a support site at http://support.microsoft.com, which has

section dedicated to Microsoft Office and the different versions of Excel.

Newsgroups

Newsgroups can also be a useful source of information. Microsoft hosts a

number of newsgroups about Excel on its own server. For this to work, our

newsgroup reader needs to be set to use msnews.microsoft.com. Useful

newsgroups include:

Microsoft.public.excel

Microsoft.public.excel.worksheetfunctions

Microsoft.public.excel.programming

Microsoft.public.excel.charting

Search engines

Searching using some keywords related to the problem is often productive.

For example, typing in “Excel” and a function name into a search engine

can often produce useful help or identify known problems.

Using Statistics in MS Excel (XL2105) Page 35

Appendix A – Custom formatting codes

Number codes

0 Digit placeholder. This code adds zeros to fill

the format.

# Digit placeholder. This code does not display

extra zeros.

? Digit placeholder. This code leaves a space

for insignificant zeros but doesn’t display

them.

. (full stop) Inserts a decimal point.

% Percentage.

, (comma) Thousands separator.

E+ E- e+ e- Scientific notation.

Text codes

$ - + / ( ) Literal characters displayed in the number.

For any other characters enclose them in

quotes or place a backslash before them.

\ This code displays the following character.

“text” This code displays the quoted text.

* This character repeats the next character to

fill the column width.

_ (underscore) This code leaves space equal to the width of

the next character.

@ This code is the text placeholder.

Date codes

m Month as a number without a leading zero (1

– 12).

mm Month as a number with a leading zero (01 –

12).

mmm Month as an abbreviation (Jan – Dec).

mmmm Full month (January, February, etc).

d Day without a leading zero (1 – 31 ).

dd Day with a leading zero (01 – 31).

ddd Weekday as an abbreviation (Sun, Mon, etc).

dddd Full day (Monday, Tuesday, etc).

yy Year as a two digit number.

yyyy Year as a four digit number.

Page 36 Using Statistics in MS Excel (XL2105)

Time codes

h Hours as a number without a leading zero (0

– 23).

hh Hours as a number with a leading zero (0 –

23).

m Minutes as a number without a leading zero

(0 – 59).

mm Minutes as a number with a leading zero (00

– 59).

s Seconds as a number without a leading zero

(0 – 59).

ss Seconds as a number with a leading zero (00

– 59).

AM/PM am/pm Time based on the twelve-hour clock.

[code] Elapsed time.

Using Statistics in MS Excel (XL2105) Page 37

Appendix B – Excel functions

Some of the Excel functions that can be used for analysing data are

listed in this appendix. Further help can be found by starting the Excel

help system and searching by typing the function name in the keyword

box on the index page.

Statistical functions

Avedev The function works out the deviations of the data from the mean of the

data and then calculates the average of the deviations.

Average This function calculates the average of a group of cells.

Confidence This function calculates a confidence interval based on the standard

deviation and size of the sample.

Correl The Correl function calculates the correlation coefficient between two

sets of data.

Count Counts the number of number values in a group of cells. The function

ignores text and logical values.

Counta Counts the number of values in a group of cells. This function includes

any text or logical values in the count.

Fisher The Fisher transformation produces a normally distributed result from

skewed data. It is typically used to transform correlation coefficients

before testing for significance.

Fisherinv This function produces the inverse of the Fisher transformation.

Frequency This function is an array function and produces the frequency

distribution of a list of values.

Kurt The function calculates the kurtosis of a set of data which indicates how

peaked or flat the distribution is compared to the normal distribution.

Large This function produces the nth largest number from a data set.

Max The result of this function is the largest number within a set of data.

Page 38 Using Statistics in MS Excel (XL2105)

Median The result of this function is the middle value of a data set, or the

average of the middle two values if there is an even number of data

points.

Min This function produces the smallest number within a set of data.

Mode The result of this function is the most common data value within the

data set.

Rank This function produces a number. The number would be the position of

the number if the list were sorted in ascending or descending order.

Small This function produces the nth smallest value from a data set.

Standardize This function produces a normalised number from a distribution with a

particular mean and standard deviation.

Stdev The function calculates the standard deviation of the sample of the

population.

StdevP The function calculates the standard deviation of the entire population.

Trimmean This function calculates the mean of a data set after excluding a

percentage of the most extreme data points.

Var The function calculates an estimate of the variance of the sample.

VarP This function calculates an estimate of the variance of the whole

population.

Curve fitting functions

Forecast This function calculates a value from existing values. It is used to

extrapolate from data with a linear fit.

Growth This function calculates a value from existing values. It is used to

extrapolate from data with an exponential fit.

Intercept The function calculates where on the y axis the intercept would be using

a linear fit to the data.

Using Statistics in MS Excel (XL2105) Page 39

Linest This function is used in an array formula to calculate the slope and

intercept using a linear fit to the data. Additional statistical information

can also be produced.

Logest This function is used in an array formula to calculate the slope and

intercept using a logarithmic fit to the data. Additional statistical

information can also be produced.

Rsq The Rsq function calculates the goodness of fit statistic (r

2) between

series of data.

Slope This function calculates the slope of a best fit line.

Trend This function produces values for a best fit line, assuming a linear

relationship for the data.

Distribution functions

Binomdist This function produces the individual term within the Binomial

distribution of probability. Options allow us to calculate the cumulative

probability.

Chidist The answer from this function is the one-tailed probability of the chi-

squared distribution. This function is often used in hypothesis testing.

Chiinv This function returns the inverse of the chi-squared distribution.

Fdist The function produces the F probability distribution.

Finv The function produces the inverse of the F probability distribution.

Negbinomdist The function produces the negative binomial distribution.

Normdist This function produces the normal distribution using a specified mean

and standard deviation. Optional parts to the formula allow us to

produce a cumulative distribution if needed.

Norminv This function produces the inverse of the normal distribution.

Normsdist The result from this function is the standard normal distribution, i.e. a

normal distribution with a mean of zero and a standard deviation of one.

Page 40 Using Statistics in MS Excel (XL2105)

Normsinv The function is the inverse of the standard normal distribution.

Poisson The function produces the Poisson distribution.

Tdist The function produces a probability using the student’s t-distribution.

Tinv This function produces t-value using the probability and the degrees of

freedom.

Weibull The function produces the Weibull distribution.

Significance test functions

Chitest The function returns the value from the chi-squared distribution with the

appropriate degrees of freedom. It is often used to compare results

against a null hypothesis using discontinuous data.

Ftest The F test is designed to test if the variances of two populations are

equal. The function produces a number, the F-statistic..

Ttest The function returns the probability associated with a student’s t-test.

Ztest The z test generates a standard value for a data point compared to the

data set.