2985277 statistics using excel 2007

Upload: oscar-de-brito

Post on 14-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 2985277 Statistics Using Excel 2007

    1/36

    2

    Statistics using Excel

    2007

    Tsagris Michail

    MSc in Statistics

    Emai: [email protected]

    Athens and Nottingham 2012

    mailto:[email protected]:[email protected]:[email protected]
  • 7/30/2019 2985277 Statistics Using Excel 2007

    2/36

  • 7/30/2019 2985277 Statistics Using Excel 2007

    3/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    2

  • 7/30/2019 2985277 Statistics Using Excel 2007

    4/36

    Statistics Using Excel 2007 Tsagris Michail

    3

    1.1 Introduction ................................................................................................................. 4

    2.1 Data Analysis toolpack ............................................................................................... 5

    2.2 Descriptive Statistics ................................................................................................... 6

    2.3 Z-test for two samples ................................................................................................. 8

    2.4 t-test for two samples assuming unequal variances ............................................... 10

    2.5 t-test for two samples assuming equal variances ................................................... 11

    2.6 F-test for the equality of variances .......................................................................... 12

    2.7 Paired t-test for two samples .................................................................................... 12

    2.8 Ranks, Percentiles, Sampling, Random Numbers Generation ............................. 13

    2.9 Covariance, Correlation, Linear Regression .......................................................... 15

    2.10 One-way Analysis of Variance ............................................................................... 19

    2.11 Two-way Analysis of Variance with replication .................................................. 21

    2.12 Two-way Analysis of Variance without replication ............................................. 23

    3.1 Statistical Functions .................................................................................................. 25

    3.2 Spearmans (non-parametric) correlation coefficient ........................................... 29

    3.3 Wilcoxon Signed Rank Test for a Median .............................................................. 31

    3.4 Wilcoxon Signed Rank Test with Paired Data ....................................................... 32

    4.1 The Solver add-in ...................................................................................................... 33

  • 7/30/2019 2985277 Statistics Using Excel 2007

    5/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    4

    1.1 Introduction

    One of the reasons for which these notes were written was to help students and not only

    to perform some statistical analyses without having to use statistical software such as

    Splus, SPSS, and Minitab e.t.c. It is reasonable not to expect that excel offers much of the

    options for analyses offered by statistical packages but it is in a good level nonetheless.The areas covered by these notes are: descriptive statistics, z-test for two

    samples, t-test for two samples assuming (un)equal variances, paired t-test for two

    samples, F-test for the equality of variances of two samples, ranks and percentiles,

    sampling (random and periodic, or systematic), random numbers generation,

    Pearsons correlation coefficient, covariance, linear regression, one-way ANOA,

    two-way ANOVA with and without replication and the moving average. We will also

    demonstrate the use of non-parametric statistics in Excel for some of the previouslymentioned techniques. Furthermore, informal comparisons with the results provided by

    the Excel and the ones provided by SPSS and some other packages will be carried out to

    see for any discrepancies between Excel and SPSS. One thing that is worthy to mention

    before somebody goes through these notes is that they do not contain the theoryunderlying the techniques used. These notes show how to cope with statistics using

    Excel.

    The first edition was in May 2008. In addition to this second edition (July 2012)we have added the solver library. This allows us to perform linear numerical optimization

    (maximization/minimization) with or without linear constraints. It also offers the

    possibility to solve a system of equations again with or without linear constraints. I amgrateful to Vasilis Vrysagotis (teaching fellow at the Technological Educational Institute

    of Chalkis,[email protected]) for his contribution.

    Any mistakes you find, or disagree with something stated here or anything lese

    you want to ask, please send me an e-mail. For more statistical resources the reader is

    addressed tostatlink.tripod.com.

    mailto:[email protected]:[email protected]:[email protected]://statlink.tripod.com/http://statlink.tripod.com/http://statlink.tripod.com/http://statlink.tripod.com/mailto:[email protected]
  • 7/30/2019 2985277 Statistics Using Excel 2007

    6/36

    Statistics Using Excel 2007 Tsagris Michail

    5

    2.1 Data Analysis toolpack

    If the Excel does not offer you options for statistical analyses you can add this option

    very easily. Just click on the Microsoft log on the top left and a list will appear. From

    the list menu you select Excel Options andpicture 2 will appear on the screen.

    Picture 1

    On picture 2 click on Go at the bottom and the dialogu box of picture 3 will

    appear. There you select Analysis Toolpack. Excel will run this command for a coupleof seconds and if select Tools you will see the option Data Analysis added on the list.

  • 7/30/2019 2985277 Statistics Using Excel 2007

    7/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    6

    Picture 2

    Picture 3

    2.2 Descriptive Statistics

    The data used in most of the examples are taken from the SPSS file and refer to carmeasurements (cars.sav). We just copied and pasted the data in a worksheet ofExcel.

  • 7/30/2019 2985277 Statistics Using Excel 2007

    8/36

    Statistics Using Excel 2007 Tsagris Michail

    7

    The road will always be the same, click on Data in the tools bar and from there choose

    Data Analysis. The dialogue box of picture 4 appears on the screen. We Select

    Descriptive Statistics and clickOKand we are lead to the dialogue box of picture 5. Inthe Input Range white box we specified the data, ranging from cell 1 to cell 406 all in

    one column. If the first row contained label we could just define it by clicking that option.

    We also clicked two of the last four options (Summary statistics, Confidence Level forMean). As you can see the default value for the confidence level is 95%. In other wordsthe confidence level is set to the usual 95%. The results produced by Excel are provided

    in table 1.

    Picture 4

    Picture 5

  • 7/30/2019 2985277 Statistics Using Excel 2007

    9/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    8

    Column1

    Mean 194.0418719

    Standard Error 5.221297644

    Median 148.5

    Mode 97

    Standard Deviation 105.2062324

    Sample Variance 11068.35133

    Kurtosis -0.79094723

    Skewness 0.692125308

    Range 451

    Minimum 4

    Maximum 455

    Sum 78781

    Count 406Confidence Level(95.0%) 10.26422853

    Table 1: Descriptive Statistics

    The results are pretty much the same as should be. There are only some really

    slight differences with regard to the rounding in the results of SPSS but of not

    importance. The number of observations is 406 as we expected. If there are missing

    values, the value in count will be less than the number of rows we selected. The samplevariances differ slightly but it is really not a problem. SPSS calculates a 95% confidence

    interval for the true mean whereas Excel provides only the quantity used to calculate the

    95% confidence interval. The construction of this interval is really straightforward.Subtract this quantity from the mean to get the lower limit and add it to the mean to get

    the upper limit of the 95% confidence interval.

    2.3 Z-test for two samples

    The statistical packages known to the writer do not offer the z-test for two independent

    samples. The results are pretty much the same with the case of the t test for two

    independent samples. The difference between the two tests is that apart from thenormality assumption the z test assumes that we know the true variances of the two

    samples. We used data generated from two normal distributions with mean equal to zero

    for both population but different variances. Due to the limited options offered by Excel

    we cannot test the normality hypothesis of the data (this is also a problem met in the lattercases). Following the previously mentioned path and selecting the Z test for two samples

    from the dialogue box of picture 4 the dialogue box of picture 6 appears on the screen.

    The first column contains the data of the first sample of size 100 while the second columnis of size 80.

  • 7/30/2019 2985277 Statistics Using Excel 2007

    10/36

    Statistics Using Excel 2007 Tsagris Michail

    9

    We selected the hypothesized mean difference to be zero and filled the white

    boxes of the variances with the specified variances. We generated two samples of sizes

    100 and 80 in Splus with variances 1 and 9 respectively. The estimated variances weredifferent than those specified, as should be. The value of the z-statistic, the critical values

    and the p-values for the one-sided and two-sided tests are provided. The results are the

    same with the ones generated by Splus. Both of the p-values are equal to zero, indicatingthat the mean difference of the two populations from which the data were drawn, isstatistically significant at an alpha equal to 0.05.

    Picture 6

    z-Test: Two Sample for Means

    Variable 1 Variable 2

    Mean 0.087931684 0.489001042Known Variance 1 9

    Observations 100 80

    Hypothesized Mean Difference 0

    z

    -

    1.145912454

    P(Z

  • 7/30/2019 2985277 Statistics Using Excel 2007

    11/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    10

    2.4 t-test for two samples assuming unequal variances

    Theory states that when the variances of the two independent populations are not known

    (which is usually the case) we have to estimated them. The use of t-test is suggested inthis case (but still the normality hypothesis has to be met unless the sample size is large).

    There are two approaches in this case; the one when we assume the variance to be equaland the one we cannot assume that. We will deal with the latter case now. We used the

    same data set as before since we know that the variances cannot be assumed to be equal.We will see the test of the equality of two variances later. Selecting the t-test

    assuming unequal variances from the dialogue box of picture 4 the dialogue box of

    picture 7 appears on the screen. The results generated from SPSS are the same except forsome rounding differences. In case you forget to set the hypothesized mean difference

    equal to 0, excel will use by default this number.

    Picture 7

    t-Test: Two-Sample Assuming Unequal Variances

    Variable 1 Variable 2

    Mean 0.087931684 0.489001042

    Variance 1.060838855 7.436269015

    Observations 100 80

    Hypothesized Mean Difference 0

    df 97

    t Stat -1.246291953

    P(T

  • 7/30/2019 2985277 Statistics Using Excel 2007

    12/36

    Statistics Using Excel 2007 Tsagris Michail

    11

    2.5 t-test for two samples assuming equal variances

    We will perform the same test assuming that the equality of variances holds true. The

    dialogue box for this test following the famous path is that of picture 8.

    Picture 8

    The results are the same with the ones provided by SPSS. What is worthy to

    mention and to pay attention is that the degrees of freedom (df) for this case are equal to

    178, whereas in the previous case were equal to 96. Also the t-statistics is slightly

    different. The reason it that different kind of formulae are used in these two cases.

    t-Test: Two-Sample Assuming Equal Variances

    Variable 1 Variable 2

    Mean 0.087931684 0.489001042

    Variance 1.060838855 7.436269015

    Observations 100 80

    Pooled Variance 3.890383701

    Hypothesized Mean Difference 0

    df 178

    t Stat

    -

    1.355601392

    P(T

  • 7/30/2019 2985277 Statistics Using Excel 2007

    13/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    12

    2.6 F-test for the equality of variances

    We will now see how to test the hypothesis of the equality of variances. The dialogue box

    of picture 9 appears in the usual way by selecting the F-test from the dialogue box of

    picture 4. The results are the same with the ones provided by Splus. The p-value is equal

    to zero indicating that there is evidence to reject the assumption of equality of thevariance of the two samples at an alpha equal with 0.05.

    Picture 9

    F-Test Two-Sample for Variances

    Variable 1 Variable 2

    Mean 0.087931684 0.489001042

    Variance 1.060838855 7.436269015

    Observations 100 80

    df 99 79

    F 0.142657407

    P(F

  • 7/30/2019 2985277 Statistics Using Excel 2007

    14/36

    Statistics Using Excel 2007 Tsagris Michail

    13

    Picture 10

    t-Test: Paired Two Sample for Means

    Variable 1 Variable 2Mean 0.096713532 0.489001042

    Variance 1.152236542 7.436269015

    Observations 80 80

    Pearson Correlation

    -

    0.039932005

    Hypothesized Mean Difference 0

    df 79

    t Stat

    -

    1.181296791

    P(T

  • 7/30/2019 2985277 Statistics Using Excel 2007

    15/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    14

    Picture 11

    The dialogue box of the sampling option is the one in the picture 12. Two

    sampling schemes are available, of the systematic (periodic) and of the random

    sampling. In the first case you insert a number (period), lets say 5, means that the first

    value of the sample will be the number in that row (5th row) and all the rest values of thesample will be the ones of the 10th, the 15th, the 20th rows and so on. With the randomsampling method, you state the sample size and Excel does the rest. If you specify a

    number in the second option of the sampling method, say 30, then a sample of size 30

    will be selected from the column specified in the first box.

    Picture 12

    If you are interested in a random sample from a known distribution then the

    random numbers generation is the option you want to use. Unfortunately not many

    distributions are offered. The dialogue box of this option is at picture 13. In the number

    of variables you can select how many samples you want to be drawn from the specificdistribution. The white box below is used to define the sample size. The distributions

    offered are Uniform, Normal, Bernoulli, Binomial, and Poisson. Two more options are

    also allowed. Different distributions require different parameters to be defined. The

  • 7/30/2019 2985277 Statistics Using Excel 2007

    16/36

    Statistics Using Excel 2007 Tsagris Michail

    15

    random seed is an option used to give the sampling algorithm a starting value but can be

    left blank as well. If we specify a number, say 1234, then the next time we want to

    generate another sample, if we put the same random seed again we will get the samesample. The number of variables allows to generate more than one samples.

    Picture 13

    2.9 Covariance, Correlation, Linear Regression

    The covariance and correlation of two variables or two columns containing data is veryeasy to calculate. The dialogue box of correlation and covariance are the same. For the

    correlation matrix from the dialogue box of picture 4 we select correlation. We now

    present the dialogue box of covariance.

    Picture 11

  • 7/30/2019 2985277 Statistics Using Excel 2007

    17/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    16

    Column

    1

    Column

    2

    Column

    1 11041.09

    Column

    2 3639.343 1480.239

    Table 7: Covariance matrix

    The above table is called the variance-covariance table since it produces both of

    these measures. The first cell (11041.09) refers to the variance of the first column and the

    last cell refers to the variance of the second column. The remaining cell (3639.343) refers

    to the covariance of the two columns. The blank cell is white due to the fact that the valueis the covariance (the elements of the diagonal are the variances and the others refer to

    the covariance).

    The dialogue box of the linear regression option is presented at picture 12.

    (Different normal data used in the regression analysis). We fill the white boxes with thecolumns that represent Y and X values. The X values can contain more than one column

    (i.e. variable). We have to note that if one value is missing in any column the function

    will not be calculated. This function requires that all columns have the same number ofvalues. The second column had 6 missing values. Thus we deleted these rows from both

    columns before running the regression.

    We select the confidence interval option. We also select the Line Fit Plots and

    Normal Probability Plots. The option Constant is Zero is left un-clicked. We want the

    constant to be in the regression line regardless of its statistical significance. By pressing

    OK, the result appears in table 8.

    Picture 12

  • 7/30/2019 2985277 Statistics Using Excel 2007

    18/36

    Statistics Using Excel 2007 Tsagris Michail

    17

    Table 8: Analysis of variance table

    The multiple R is the Pearson correlation coefficient, whereas the R Square is

    called coefficient of determination and it is a quantity that measures the fitting of the

    model. It shows the proportion of variability of the data explained by the linear model.The model is Y=-62.726+2.459*X. The adjusted R Square is the coefficient ofdetermination adjusted for the degrees of freedom of the model; this is a penalty of the

    coefficient. The p-value of the constant provides evidence to claim that the constant is not

    statistical significant and therefore it should be removed from the model. So, if we run theregression again we will just click on Constant is Zero. The results are the same

    generated by SPSS except for some slight differences due to roundings. The disadvantage

    of Excel is that it offers no normality test. The two plots also constructed by Excel are

    presented.

    SUMMARY OUTPUT

    Regression Statistics

    Multiple R 0.897

    R Square 0.805

    Adjusted R

    Square 0.804Standard Error 46.712

    Observations 400.000

    ANOVA

    df SS MS F

    Significance

    F

    Regression 1 3579101.213 3579101 1640.289

    3.0302E-

    143

    Residual 398 868433.6646 2181.994

    Total 399 4447534.877

    Coefficients

    Standard

    Error t Stat P-value Lower 95%

    Upper

    95%

    Lower

    95.0%

    Upper

    95.0%

    Intercept -62.726 6.779 -9.25292 1.36E-18 -76.053 -49.3984 -76.0527 -49.39

    X Variable 1 2.459 0.061 40.50048 3E-143 2.339 2.5780 2.3393 2.57

  • 7/30/2019 2985277 Statistics Using Excel 2007

    19/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    18

    Figure 1: Scatter plot of X and Y-Predicted Y

    Figure 2: Normal Probability Plot

    The first figure is a scatter plot of the data, the X values versus the Y values and

    the predicted Y values. The linear relation between the two variables is obvious throughthe graph. Do not forget that the correlation coefficient exhibited a high value. The

    Normal Probability Plot is used to check the normality of the residuals graphically.

    Should the residuals follow the normal distribution, then the graph should be a straightline. Unfortunately many times the eye is not the best judge of things. The Kolmogorov

  • 7/30/2019 2985277 Statistics Using Excel 2007

    20/36

    Statistics Using Excel 2007 Tsagris Michail

    19

    Smirnov test conducted in SPSS provided evidence to support the normality hypothesis ofthe residuals.

    Excel produced also the residuals and the predicted values in the same sheet. We

    shall construct a scatter plot of these two values, in order to check (graphically) the

    assumption of homoscedasticity (i.e. constant variance through the residuals). If the

    assumption of heteroscedasticity of the residuals holds true, then we should see all thevalues within a bandwidth. We see that almost all values fall within -150 and 150, except

    for two values that are outside these two limits. It seems like the variance is not constant

    since there is some evidence of a pattern. This means that the residuals do not exhibitconstant variance. If we are not certain as for the validity of the assumption we can

    transform the Y values using a log transformation and run the regression using the

    transformed Y values.

    Figure 3: Residuals versus predicted values

    2.10 One-way Analysis of Variance

    The one-way analysis of variance is just the generalization of the two independent

    samples t-test. The assumptions the must be met in order for the results to be valid are

    more or less the same as in the linear regression case. It is a fact that analysis of varianceand linear regression are two equivalent techniques. The Excel produces the analysis of

    variance table but offers no options to check the assumptions of the model. The dialogue

    box of the one way analysis of variance is shown at picture 13. As in the t-test cases the

    values of the independent variable are entered in Excel in different columns according tothe factor. In our example we have three levels of the factor, therefore we have three

    columns. After defining the range of data in the dialogue box of picture 13, we click OK

    and the results follow.

  • 7/30/2019 2985277 Statistics Using Excel 2007

    21/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    20

    Picture 13

    Anova: Single Factor

    SUMMARY

    Groups Count Sum Average Variance

    Column 1 15 2566 171.0667 1102.21

    Column 2 15 1736 115.7333 2022.781

    Column 3 15 1787 119.1333 2667.838

    ANOVA

    Source of

    Variation SS df MS F P-value F crit

    Between Groups 28852.04 2 14426.02 7.470973 0.001676 3.219942

    Within Groups 81099.6 42 1930.943

    Total 109951.6 44

    Table 9: The one-way analysis of variance

    The results generated by SPSS are very close with the results shown above. There

    is some difference in the sums of squares, but rather of small importance. The meansquare values (MS) are very close with one another. Yet, by no means can we assume

    that the above results hold true since Excel does not offer options for assumptionschecking.

  • 7/30/2019 2985277 Statistics Using Excel 2007

    22/36

    Statistics Using Excel 2007 Tsagris Michail

    21

    2.11 Two-way Analysis of Variance with replication

    In the previous paragraph, we saw the case when we have one factor affecting the

    dependent variable. Now, we will see what happens when we have two factors affectingthe dependent variable. This is called the factorial design with two factors or two-way

    analysis of variance. At first, we must enter the data in the correct way. The proper wayof data entry follows (the data refer to the cars measurements). As you can see, we have

    three columns of data representing the three levels of the one factor and the first columnscontains only three words, S1, S2 and S3. This first column states the three levels of the

    second factor. We used the S1, S2, and S3 to define the number of the rows representing

    the sample sizes of each combination of the two factors. In other words the firstcombination the two factors are the cells from B2 to B28. This means that each

    combination of factors has 7 measurements.

    Picture 14

  • 7/30/2019 2985277 Statistics Using Excel 2007

    23/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    22

    From the dialogue box of picture 4, we select Anova: Two-Factor with

    replication and the dialogue box to appear is shown at picture 15.

    Picture 15

    We filled the two blank white boxes with the input range and Rows per sample.The alpha is at its usual value, equal to 0.05. By pressing OKthe results are presentedoverleaf. The results generated by SPSS are the same. At the bottom of the table 10 there

    are three p-values; two p-values for the two factors and one p-value for the interaction.

    The row factor is denoted as sample in Excel.A limitation of this analysis when performed in Excel is that the sample sizes in

    each combination of column and rows (the two factors) must be equal. In other words, the

    design has to be balanced, the same number of values everywhere.

  • 7/30/2019 2985277 Statistics Using Excel 2007

    24/36

    Statistics Using Excel 2007 Tsagris Michail

    23

    Anova: Two-Factor With Replication

    SUMMARY C1 C2 C3 Total

    S1

    Count 7 7 7 21

    Sum 2464 1153 1014 4631

    Average 352 164.7143 144.8571 220.5238

    Variance 4053.667 1068.905 2601.143 11461.26

    S2

    Count 7 7 7 21

    Sum 2502 1238 1516 5256

    Average 357.4286 176.8571 216.5714 250.2857

    Variance 11419.62 1414.143 12192.95 13810.81

    S3

    Count 7 7 7 21

    Sum 2472 1120 669 4261

    Average 353.1429 160 95.57143 202.9048

    Variance 5162.143 1441.667 1878.286 15121.09

    Total

    Count 21 21 21

    Sum 7438 3511 3199

    Average 354.1905 167.1905 152.3333

    Variance 6196.362 1230.362 7593.233

    ANOVA

    Source of Variation SS df MS F P-value F crit

    Sample 24088.1 2 12044.05 2.628906 0.081377 3.168246

    Columns 531552.3 2 265776.1 58.0121 3.56E-14 3.168246

    Interaction 28915.9 4 7228.976 1.5779 0.19342 2.542918

    Within 247395.1 54 4581.392

    Total 831951.4 62

    Table 10: The two-way analysis of variance with replication

    2.12 Two-way Analysis of Variance without replication

    We will now see another case of the two-way ANOVA when each combination of factors

    has only one measurement. In this case we need not enter the data as in the previous case

    in which the labels were necessary. We will use only the three first three rows of the data.

  • 7/30/2019 2985277 Statistics Using Excel 2007

    25/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    24

    We still have two factors except for the fact that each combination contains one

    measurement. From the dialogue box of picture 4, we select Anova: Two-Factorwithout replication and the dialogue box to appear is shown at picture 16. The onlything we did was to define the Input Range and pressed OK. The results are presented in

    table 11. What is necessary for this analysis is that there no interaction is present. The

    results are the same with the ones provided by SPSS, so we conclude once again thatExcel works fine with statistical analysis. The disadvantage of Excel is once again that itprovides no formulas for examining the residuals in the case of analysis of variance.

    Picture 16

    Anova: Two-Factor Without Replication

    SUMMARY Count Sum Average Variance

    S1 3 550 183.3333 11542.33

    S2 3 762 254 28863

    S3 3 539 179.6667 31700.33

    C1 3 1107 369 4483

    C2 3 520 173.3333 1808.333

    C3 3 224 74.66667 3754.333

    ANOVA

    Source of Variation SS df MS F P-value F crit

    Rows 10532.67 2 5266.333 2.203794 0.226348 6.944272

    Columns 134652.7 2 67326.33 28.17394 0.004393 6.944272

    Error 9558.667 4 2389.667

    Total 154744 8

    Table 11: The two-way analysis of variance without replication

  • 7/30/2019 2985277 Statistics Using Excel 2007

    26/36

    Statistics Using Excel 2007 Tsagris Michail

    25

    3.1 Statistical Functions

    Before showing how to find statistical measures using the statistical functions available

    from Excel under the Insert Function option let us see which are these.

    AVEDEV calculates the average of the absolute deviations of the data from theirmean.

    AVERAGE is the mean value of all data points.

    AVERAGEA calculates the mean allowing for text values of FALSE (evaluatedas 0) and TRUE (evaluated as 1).

    BETADIST calculates the cumulative beta probability density function.

    BETAINV calculates the inverse of the cumulative beta probability densityfunction.

    BINOMDIST determines the probability that a set number of true/false trials,where each trial has a consistent chance of generating a true or false result, will

    result in exactly a specified number of successes (for example, the probability thatexactly four out of eight coin flips will end up heads).

    CHIDIST calculates the one-tailed probability of the chi-squared distribution.

    CHIINV calculates the inverse of the one-tailed probability of the chi-squared.Distribution.

    CHITEST calculates the result of the test for independence: the value from thechi-squared distribution for the statistics and the appropriate degrees of freedom.

    CONFIDENCE returns a value you can use to construct a confidence interval fora population mean.

    CORREL returns the correlation coefficient between two data sets.

    COVAR calculates the covariance of two data sets. Mathematically, it is themultiplication of the correlation coefficient with the standard deviations of thetwo data sets.

    CRITBINOM determines when the number of failures in a series of true/falsetrials exceeds a criterion (for example, more than 5 percent of light bulbs in a

    production run fail to light).

  • 7/30/2019 2985277 Statistics Using Excel 2007

    27/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    26

    DEVSQ calculates the sum of squares of deviations of data points from theirsample mean. The derivation of standard deviation is very straightforward, simply

    dividing by the sample size or by the sample size decreased by one to get the

    unbiased estimator of the true standard deviation.

    EXPODIST returns the exponential distribution

    FDIST calculates the F probability distribution (degree of diversity) for two datasets.

    FINV returns the inverse of the F probability distribution.

    FISHER calculates the Fisher transformation.

    FISHERINV returns the inverse of the Fisher transformation.

    FORECAST calculates a future value along a linear trend based on an existingtime series of values.

    FREQUENCY calculates how often values occur within a range of values andthen returns a vertical array of numbers having one or more elements than

    Bins_array.

    FTEST returns the result of the one-tailed test that the variances of two data setsare not significantly different.

    GAMMADIST calculates the gamma distribution.

    GAMMAINV returns the inverse of the gamma distribution.

    GAMMALN calculates the natural logarithm of the gamma distribution.

    GEOMEAN calculates the geometric mean.

    GROWTH predicts the exponential growth of a data series.

    HARMEAN calculates the harmonic mean.

    HYPGEOMDIST returns the probability of selecting an exact number of a singletype of item from a mixed set of objects. For example, a jar holds 20 marbles, 6 of

    which are red. If you choose three marbles, what is the probability you will pickexactly one red marble?

    INTERCEPT calculates the point at which a line will intersect they-axis.

  • 7/30/2019 2985277 Statistics Using Excel 2007

    28/36

    Statistics Using Excel 2007 Tsagris Michail

    27

    KURT calculates the kurtosis of a data set.

    LARGE returns the k-th largest value in a data set.

    LINEST generates a line that best fits a data set by generating a two dimensional

    array of values to describe the line.

    LOGEST generates a curve that best fits a data set by generating a twodimensional array of values to describe the curve.

    LOGINV returns the inverse logarithm of a value in a distribution.

    LOGNORMDIST Returns the number of standard deviations a value is awayfrom the mean in a lognormal distribution.

    MAX returns the largest value in a data set (ignore logical values and text).

    MAXA returns the largest value in a set of data (does not ignore logical valuesand text).

    MEDIAN returns the median of a data set.

    MIN returns the largest value in a data set (ignore logical values and text).

    MINA returns the largest value in a data set (does not ignore logical values andtext).

    MODE returns the most frequently occurring values in an array or range of data.

    NEGBINOMDIST returns the probability that there will be a given number offailures before a given number of successes in a binomial distribution.

    NORMDIST returns the number of standard deviations a value is away from themean in a normal distribution.

    NORMINV returns a value that reflects the probability a random value selectedfrom a distribution will be above it in the distribution.

    NORMSDIST returns a standard normal distribution, with a mean of 0 and astandard deviation of 1.

    NORMSINV returns a value that reflects the probability a random value selectedfrom the standard normal distribution will be above it in the distribution.

  • 7/30/2019 2985277 Statistics Using Excel 2007

    29/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    28

    PEARSON returns a value that reflects the strength of the linear relationshipbetween two data sets.

    PERCENTILE returns the k-th percentile of values in a range.

    PERCENTRANK returns the rank of a value in a data set as a percentage of thedata set.

    PERMUT calculates the number of permutations for a given number of objectsthat can be selected from the total objects.

    POISSON returns the probability of a number of events happening, given thePoisson distribution of events.

    PROB calculates the probability that values in a range are between two limits orequal to a lower limit.

    QUARTILE returns the quartile of a data set.

    RANK calculates the rank of a number in a list of numbers: its size relative toother values in the list.

    RSQ calculates the square of the Pearson correlation coefficient (also met ascoefficient of determination in the case of linear regression).

    SKEW returns the skewness of a data set (the degree of asymmetry of adistribution around its mean).

    SLOPE returns the slope of a line.

    SMALL returns the k-th smallest values in a data set.

    STANDARDIZE calculates the normalized values of a data set (each value minusthe mean and then divided by the standard deviation).

    STDEV estimates the standard deviation of a numerical data set based on asample of the data.

    STDEVA estimates the standard deviation of a data set (which can include textand true/false values) based on a sample of the data.

    STDEVP calculates the standard deviation of a numerical data set.

    STDEVPA calculates the standard deviation of a data set (which can include textand true/false values).

  • 7/30/2019 2985277 Statistics Using Excel 2007

    30/36

    Statistics Using Excel 2007 Tsagris Michail

    29

    STEYX returns the predicted standard error for they value for eachx value inregression.

    TDIST returns the Students t distribution

    TINV returns a tvalue based on a stated probability and degrees of freedom.

    TREND Returns values along a trend line.

    TRIMMEAN calculates the mean of a data set having excluded a percentage ofthe upper and lower values.

    TTEST returns the probability associated with a Students t distribution.

    VAR estimates the variance of a data sample.

    VARA estimates the variance of a data set (which can include text and true/ falsevalues) based on a sample of the data.

    VARP calculates the variance of a data population.

    VARPA calculates the variance of a data population, which can include text andtrue/false values.

    WEIBULL calculates the Weibull distribution.

    ZTEST returns the two-tailed p-value of a z-test.

    3.2 Spearmans (non-parametric) correlation coefficient

    Spearmans correlation coefficient is the non-parametric alternative of the Pearsons

    correlation coefficient. It is the Pearsons correlation coefficient based upon the ranks

    of the values rather than the values. In section 2.8 we exhibited how to calculate theranks for a range of values. We repeat the same procedure inserting both columns

    now. The selection of calculation of the ranks will generate the following table in

    Excel:

  • 7/30/2019 2985277 Statistics Using Excel 2007

    31/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    30

    Point Column1 Rank Percent Point Column2 Rank Percent

    9 455 1 94.70% 9 225 1 94.70%

    20 455 1 94.70% 20 225 1 94.70%

    7 454 3 89.40% 7 220 3 89.40%

    8 440 4 84.20% 8 215 4 84.20%

    6 429 5 78.90% 6 198 5 78.90%

    19 400 6 73.60% 10 190 6 73.60%

    10 390 7 68.40% 14 175 7 63.10%

    14 383 8 57.80% 15 175 7 63.10%

    16 383 8 57.80% 16 170 9 57.80%

    15 360 10 52.60% 2 165 10 47.30%

    13 351 11 47.30% 12 165 10 47.30%

    2 350 12 36.80% 17 160 12 42.10%

    12 350 12 36.80% 13 153 13 36.80%

    17 340 14 31.50% 3 150 14 21.00%

    3 318 15 26.30% 4 150 14 21.00%

    1 307 16 21.00% 19 150 14 21.00%

    4 304 17 15.70% 5 140 17 10.50%

    5 302 18 5.20% 18 140 17 10.50%

    18 302 18 5.20% 1 130 19 5.20%

    11 133 20 0.00% 11 115 20 0.00%

    Table 12: Ranks and Percentiles

    Column 1 contains the values, Rank contains the ranks of the values, Percent

    contains the cumulative percentage of the values (the size of the values relative to theothers) and the first column (Points) indicates the row of each value. In the above table,

    Excel has sorted the values according to their ranks. The first column indicates the exact

    position of the values. We have to sort the data with respect to this first column, so that

    the format will be as in the first place. We will repeat these actions for the second set ofdata and then calculate the correlation coefficient of the ranks of the values.

    Attention is to be paid at the sequence of the steps described. The ranks of the

    values must be calculated separately for each data set and the sorting need to be donebefore calculating the correlation coefficient. This means, we take the results for each

    variable separately and do the following procedure. We sort it with respect to the column

    Point. Then we put the column Rankfrom each variable and put them together. Then we

    calculate the Pearson correlation coefficient for these two columns.The results for the data used in this example calculated the Spearmans correlation

    coefficient to be equal to 0.9143 whereas the correlation calculated using SPSS is equal

    to 0.909. The reason for this difference in the two correlations is that SPSS has a way ofdealing the values that have the same rank. It assigns to all values the average of the

    ranks. That is, if three values are equal (so their ranks are the same, see the bold values in

    table 12), SPSS assigns to each of these three values the average of their ranks (Exceldoes not do this action).

  • 7/30/2019 2985277 Statistics Using Excel 2007

    32/36

    Statistics Using Excel 2007 Tsagris Michail

    31

    3.3 Wilcoxon Signed Rank Test for a Median

    We will now see how to conduct the Wilcoxon signed rank test for a median. This test is

    based upon the ranks of the values and it is the non-parametric alternative to the one sample

    t-test (when the normality assumption is not satisfied). Assume that we are interested in

    testing the assumption that the median of a population from which the sample comes from isequal to a specific median. We will use the same data set as before. Assume that we are

    interested in testing whether the median is equal to 320. We calculated the median of the data

    set (318). This test requires some steps that must be done carefully.

    Step 1: Subtract all the values from the given median (i.e. 320-Xi, i=1,2, , n, where

    n=sample size).

    Step 2: In a new column calculate the absolute values of these subtractions.

    Step 3: Calculate the ranks of the absolute values.

    Step 4: Using the logical function Ifdecide assign 1 if the differences in the second

    column are positive and -1 if they are negative.

    Step 5: Multiply the 4th and the 5th columns to get the ranks with a sign (plus/minus).

    Step 6: Define a last column to be the squared ranks

    Table 13 summarizes all of the above. All of the tedious work is complete. Now the

    rest is mere details. In cases when there are values with the same ranks (i.e. ties) we use this

    formula for the test:

    Sum(Ranks)/(Square Root of the Sum of squared ranks).

    We calculate the sum of the 6 th column and of the square root of the sum of the 7 thcolumn. Finally, we divide the sum by the square of the second sum to get the test statistic. Inthis example, the sum of squares is equal to 289, the sum of squared ranks is equal to 117363

    and its square root is equal to 342.5828. The test statistics is 289 divided by 342.45828,which is equal to 0.8436. SPSS provides a little different test statistics due to the different

    handling of the tied ranks and the use of different test statistic.There is also another way to calculate a test statistics and that is by taking the sum of

    the positive ranks. Both Minitab and SPSS calculate another type of test statistic, which isbased on either the positive or the negative ranks. What is worthy to mention is that thesecond formula is better used in the case when there are no tied ranks. Irrespectively of the

    test statistics used the result will be the same as for the rejection of the null hypothesis. Usingthe second formula the result is 1401, whereas Minitab provides a result of 1231.5. As for theresult of the test (reject the null hypothesis or not) one must look at the tables for the 1sample Wilcoxon signed rank test. The fact that Excel does not offer options for calculating

    the probabilities used in the non-parametric tests in conjunction with the tedious work, makes

    it less popular for use.

  • 7/30/2019 2985277 Statistics Using Excel 2007

    33/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    32

    Values

    (Xi)m-Xi Absolute(m-Xi)

    Ranks ofabsolutevalues

    Positive or

    negative

    differences

    Ranks

    RiSquared

    Ranks

    Ri2

    307 13 13 64 1 64 4096

    350 -30 30 47 -1 -47 2209

    318 2 2 67 1 67 4489

    304 16 16 60 1 60 3600

    302 18 18 56 1 56 3136

    429 -109 109 19 -1 -19 361

    454 -134 134 13 -1 -13 169

    440 -120 120 17 -1 -17 289

    455 -135 135 11 -1 -11 121

    390 -70 70 32 -1 -32 1024

    350 -30 30 47 -1 -47 2209

    351 -31 31 43 -1 -43 1849

    383 -63 63 37 -1 -37 1369

    360 -40 40 41 -1 -41 1681

    383 -63 63 37 -1 -37 1369

    Table 13: Procedure of the Wilcoxon Signed Rank Test

    3.4 Wilcoxon Signed Rank Test with Paired Data

    When we have two samples which cannot be assumed to be independent (i.e. the weight

    of people before and after a diet) and we are interested in testing the hypothesis that the

    two medians are equal versus they are not then the use of the Wilcoxon signed rank testwith paired data is necessary. This is the non-parametric alternative to the paired samples

    t-test.The procedure is the same with the one sample case. We will only have to add

    another column representing the values of the second sample, so the table 13 would have

    8 columns instead of 7 and the third column would be the differences between the valuesof the two data sets. The formulae for the test statistics are the same as before and the

    results will be different from SPSS due to the fact that Excel (in contrast to SPSS) does

    not manipulate ties in the ranks. The tables of the critical values for this test must be

    available in order to decide whether to reject or not the null hypothesis.

  • 7/30/2019 2985277 Statistics Using Excel 2007

    34/36

    Statistics Using Excel 2007 Tsagris Michail

    33

    Values Y Values XValues

    Y-XAbsolutedifferences

    Ranks ofabsolute

    differences

    Positive ornegative

    differences

    RanksRi

    Squared Ranks

    Ri2

    307 225 82 82 8 1 8 64

    350 250 100 100 6 1 6 36

    318 250 68 68 11 1 11 121

    304 232 72 72 9 1 9 81302 350 -48 48 13 -1 -13 169

    429 400 29 29 14 1 14 196

    454 351 103 103 5 1 5 25

    440 318 122 122 3 1 3 9

    455 383 72 72 9 1 9 81

    390 400 -10 10 15 -1 -15 225

    350 400 -50 50 12 -1 -12 144

    351 258 93 93 7 1 7 49

    383 140 243 243 1 1 1 1

    360 250 110 110 4 1 4 16

    383 250 133 133 2 1 2 4

    Table 14: Procedure of the Wilcoxon Signed Rank Test with Paired Data

    4.1 The Solver add-in

    Let us suppose we want to maximize the following linear bivariate function

    f(X,Y)=400X+300Y

    under some linear constraints

    I. 4X+2Y300

    II. X70

    III. 2X+4Y240

    The way to do it in Excel 2007 is simple. At first we will go to picture 3 and select theoption Solver add-in. Then, similarly to the data analysis path we click on Data in the

    tools bar and from there choose Solver.

  • 7/30/2019 2985277 Statistics Using Excel 2007

    35/36

    Statistics Using Excel 2007 http://statlink.tripod.com

    34

    Picture 17

    The dialogue box of picture 17 will appear. As we can see Excel offers thepossibility for maximization, minimization and search for the values of X and Y which

    satisfy a condition, such as that the function is equal to some specific value, not only 0.Furthermore we have the option to perform the three mentioned tasks with the inclusionof constraints, either in the form of equalities or inequalities. We will use the form of

    inequalities in this example. By pressing the button Add the dialogue of picture 18 will

    appear.

    Picture 18

    We put the cell which describes the first constraint and the cell whose maximum value is.

    We repeat this task until all constraints are entered. In case we have no constraints, we do

    not have to come here. After the last constraint is entered we press Add. When we put the

    final constraint we can either press OKor press Add first and then OK. In the secondcase a message will appear (picture 19) preventing us from continuing. We will press

    Cancel and we will still go to picture 20.

    Picture 19

  • 7/30/2019 2985277 Statistics Using Excel 2007

    36/36

    Statistics Using Excel 2007 Tsagris Michail

    Picture 20

    Then we choose Solve and the message of picture 21 will appear. We press OKand the

    message disappears. The solution will appear in Excel.

    Picture 21

    X 60

    Y 30

    functionvalue 33000

    constraints

    4X+2Y 300 300

    X 60 70

    2X+4Y 240 240

    Table 15: Results of the optimization