kolmogorov smirnov test for normality

Upload: idlir

Post on 06-Jan-2016

37 views

Category:

Documents


1 download

DESCRIPTION

The purpose of this paper is to explain the Kolmogorov-Smirnov test for normality and to suggest more powerful tests as substitutes to the conventional K-S test.

TRANSCRIPT

  • Queen Mary, University of London

    Idlir Shkurti Student ID: 120192308

    MTH731U Computation Statistics

    The Kolmogorov-Smirnov test for normality

    Idlir Shkurti 120192308

    1.0 Introduction

    Suppose we have n observations 1, , coming from independent and randomly distributed

    random variables 1, , with a common cumulative distribution function (cdf) F. If we wish

    to test the hypothesis 0: = 0 against 1: 0, where 0 is the cdf of a known continuous

    distribution; the Kolmogorov-Smirnov statistic is an appropriate and valid way of doing so. The

    Kolmogorov-Smirnov statistic, is defined the following way

    = {1,,}|() 0()| (1)

    where () is the ecdf obtained from the observations 1, , . In other words, the KS statistic

    is the maximum absolute distance between the graph of the ecdf and the cdf of the known

    distribution, which we are testing our data to come from.

    Another way of defining the test statistic, when we are testing for normality of the data is the

    following. Suppose we have a sample which consists of n observations ordered such that 1

    . The ecdf of the sample is the step function such that at each the step is between 1

    and

    . If 0 is the cdf of a normal distribution with mean and standard deviation , then

    the KS statistic is given by:

    = 1 {

    (

    ) , (

    )

    1

    } (2)

    Where is the cumulative distributive function of the standard normal distribution.

    A more conventional way of testing a similar hypothesis is the chi-square test, however the

    Kolmogorov-Smirnov statistic is much more advantageous, since it can be used with small

    samples and it is overall more powerful.

    In his paper in 1967, Lilliefors provided a means of using an improved version of the

    Kolmogorov-Smirnov statistic to test whether a set of observations are from a completely

    specified continuous distribution 0() when certain parameters of the distribution must be

    estimated. If the conventional Kolmogorov-Smirnov test is used in this case the results will be

    conservative, hence compromising the power of the test. In this paper Lilliefors presents a

    method of testing whether a set of observations come from a normal population with unknown

    mean and variance. In order to do so we must first know exactly what the continuous

    distribution 0() is. Since the mean and the variance are unknown we use estimators and 2

    to estimate the mean and the variance respectively; where is the sample mean and 2 is the

    sample variance of the given observations. Hence we assume that 0() is the cdf of a normal

    distribution with mean and variance 2. When these values are calculated the Kolmogorov-

    Smirnov statistic is calculated exactly the same way as above with 0() = ,2 .

  • Queen Mary, University of London

    Idlir Shkurti Student ID: 120192308

    2.0 Adjusted Critical Values

    Once the test statistic is obtained, we cannot use the critical values from the Kolmogorov-

    Smirnov tables in order to draw a conclusion about the test because, as mentioned before, the

    results will be conservative. In his paper, Lilliefors calculates new critical values for this specific

    test using a Monte Carlo calculation. This was done by obtaining 1000 random normally

    distributed samples, 1, , for different values of n and thus estimating the distribution of .

    The first R code in appendix 1 is used to obtain a similar table to table 1 on Lilliefors paper. The

    output table is given below.

    Table 1: Monte Carlo Critical Values of Dn

    Critical values estimated as a result of a Monte Carlo calculation using 1000 samples for different values of sample size, N. Any value of Kolmogorov-Smirnov test statistic greater than the corresponding critical value at a certain level of significance leads to a rejection of the null hypothesis of normality for that specific significance level.

    Sample Size N

    Level of Significance for = {1,,}|() 0()|

    0.20 0.15 0.10 0.05 0.01 4 0.300 0.318 0.342 0.378 0.410 5 0.290 0.303 0.323 0.350 0.403 6 0.272 0.282 0.297 0.321 0.373 7 0.251 0.263 0.277 0.307 0.351 8 0.234 0.246 0.259 0.280 0.332 9 0.227 0.237 0.254 0.274 0.313 10 0.215 0.225 0.243 0.265 0.307 11 0.209 0.217 0.228 0.248 0.290 12 0.204 0.212 0.224 0.245 0.279 13 0.196 0.204 0.216 0.231 0.263 14 0.185 0.196 0.205 0.222 0.256 15 0.178 0.185 0.197 0.215 0.257 16 0.176 0.183 0.195 0.210 0.251 17 0.170 0.179 0.190 0.206 0.234 18 0.166 0.173 0.184 0.200 0.236 19 0.161 0.170 0.178 0.195 0.220 20 0.158 0.165 0.174 0.189 0.220 25 0.143 0.150 0.160 0.171 0.195 30 0.131 0.136 0.143 0.155 0.185

    Over 30 0.723

    N

    0.760

    N

    0.814

    N

    0.880

    N

    1.022

    N

    The output in table 1 gives the estimated critical values for a 0.2, 0.15, 0.1, 0.05 and 0.01

    significance level for sample sizes from n=4 to n=30. For any samples of size greater than 30,

    say N, we use the critical values of N=40 multiplied by the square root of 40 and then the

    product is divided by the square root of the sample size .

    Comparing this table to the standard Kolmogorov-Smirnov tables, we can see that the critical

    values for the 0.01 significance level here are slightly smaller than the critical values for a 0.20

    significance level. Hence if we use the standard Kolmogorov-Smirnov tables as critical values we

  • Queen Mary, University of London

    Idlir Shkurti Student ID: 120192308

    would obtain very conservative results. This means that the actual significance level would be

    much lower than the one claimed in the test. A big question which needs to be addressed at this

    stage is: Is the modified KS test still more powerful than the Chi-Squared test?

    3.0 Power of the test

    One of the implements of this specialised Kolmogorov-Smirnov test for normality is that it can

    still be used for small sample sizes, unlike the Chi-Squared test. Kiefer and Wolfowitz (1955)

    showed that it is asymptotically more powerful than the Chi-Squared test. However we want to

    know whether this test is ideal for relatively small sample sizes. In the Lilliefors paper a small

    investigation was made to compare the powers of these two tests. There were 500 samples of

    size 20 drawn from distributions such as normal, chi-square 3 d.f, students t 3 d.f, exponential

    and uniform. The probabilities of rejecting the null hypothesis for normality using the

    Kolmogorov-Smirnov statistic and the chi-square statistic were found and compared. However

    in this example Lilliefors has also used Monte-Carlo critical values for the Chi-Squared test

    rather than the standard chi-squared points to avoid a high probability of type I error.

    From Table 2 in the Lilliefors paper we can see that the probabilities of type I error are

    satisfying and relatively similar for both tests. However the Kolmogorov-Smirnov test is much

    more powerful than the Chi-Squared test. This is because the probabilities of correctly rejecting

    the hypothesis of normality are significantly greater when using the D statistic compared to the

    Chi-Squared statistic for any underlying, non-normal distribution. The probabilities of rejecting

    the null hypotheses when the observations do not come from a normal distribution are greater

    at both a 5% and 10% significance level. This is another advantage of using the specialised test

    for normality.

    However this simply tells us that this specialised Kolmogorov-Smirnov test for normality is

    superior to the Chi-Squared test. The power of this test is still far from ideal, particularly when

    the observations come from a uniform distribution and we can see this by looking at the same

    table. The probability of correctly rejecting the null hypothesis of normality when the 20

    observations come from a uniform distribution is 12% when using a 5% significance level and

    22% when using a 10% significance level. Hence this test is not ideal for certain distributions.

    Table 3 from the Lilliefors table gives us a similar calculation to that from Table 2. This table

    gives the probabilities of rejecting the hypotheses of normality from the same underlying

    distributions as Table 2, however now using 500 samples of size 10 rather than 20. The test

    used in this table is the Kolmogorov-Smirnov test with the adjusted critical values from Table 1

    at =0.05 and =0.10.

    We can see from this table that the probabilities of type I error are still satisfactory, however the

    power of the test decreases even more now that the sample size has dropped. At a 5%

    significance level, when the observations come from a uniform distribution, the test only

    correctly rejects the null hypothesis of normality 7% of the time and only 13% of the time when

    =0.10.

    Hence another important factor affecting the power of the test is the size of the sample. The

    greater the size of the sample, the more powerful the test. The code in appendix 2 will generate

  • Queen Mary, University of London

    Idlir Shkurti Student ID: 120192308

    the proportion of samples which have correctly rejected the null hypothesis of normality at a

    10% level of significance out of 500 samples, each of size N=100, coming from a uniform

    distribution. The output obtained is 0.774, which means that the null hypothesis of normality

    will be correctly rejected for 77.4% of the 500 samples. This output is much higher than both

    outputs from tables 2 and 3 from Lilliefors paper. The following table is obtained using R. This

    is equivalent to Table 3 from Lilliefors paper, the only difference being that the samples now

    are of size 100 rather than 10.

    Table 2: Probability of rejecting hypothesis of normality when sample size is 100

    Kolmogorov Smirnov test Using Critical Values from Table 1 Underlying distribution = 0.05 = 0.10 Normal 0.050 0.098 Chi-Square, 3 d.f. 0.990 0.998 Students t, 3 d.f. 0.730 0.842 Exponential 1.000 1.000 Uniform 0.578 0.774

    Comparing this table to table 3 or even table 2 from Lillieforss paper we can clearly observe

    dramatic increases in probabilities of correctly rejecting the null hypothesis of normality for the

    bottom four distributions, particularly for exponential distribution. The power of this test has

    increased; hence the sample size plays a very important factor in estimating the power of the

    test. The probabilities of type I error are still as predicted.

    4.0 Outliers

    One problem with the use of sample mean and sample variance as estimates of the mean and

    variance of the null distribution is that they are sensitive to outliers in the data sample. Since the

    test is directly affected by the choice of mean and variance then this could lead to possible

    errors. Type I error is a bigger problem when outliers are present, particularly when the sample

    size is small. This is because the smaller the sample size, the greater the effect of the outlier on

    the sample estimates. The third R code (appendix 3) uses a similar code to the one used to

    obtain the values for Table 2 (appendix 2). 500 samples of size 10 are drawn. 9 observations

    from each sample come from a standard normal distribution, whilst the remaining observation

    is an outlier. If we include the outlier as a correct observation whilst estimating the sample

    mean and the sample variance, then the null hypothesis of normality is rejected for a large

    proportion of the 500 samples (30.2% precisely in one case). This means that the probability of

    type I error is much higher than the level of significance. However if we first install the outliers

    package into R and load it, then use the command rm.outlier in order to locate the outlier and

    replace it by the mean of the remaining observations; the proportion of samples rejecting the

    hypothesis of normality go back to normal.

    In appendix 4, a code was used in order to observe the effects of a single outlier on the value of

    the test statistic Dn for samples of size 10 to 101. Just as expected, the effect of the outlier

    decreases as the sample size increases. The code in the appendix draws 500 samples for each

    sample size from 10 to 101, with one outlier in each sample and finds the average value of the

  • Queen Mary, University of London

    Idlir Shkurti Student ID: 120192308

    KS test statistic for each sample size. The graph below shows the plotted values of the average

    KS statistics against the corresponding sample sizes.

    Graph 1: The average KS statistic for different sample sizes

    4.2 Modified Lilliefors Test

    One problem with Lilliefors test is that the mean and the variance are obtained from fixed

    sample estimates. These sample estimates, particularly for relatively small sample sizes, are

    sensitive to sample outliers. In their 2008 paper, Drezner, Turel and Zerom introduced a

    modified version of Lilliefors test which they believed to be superior to the conventional

    procedure. In their paper they also use equation (2) in order to find the test statistic, however

    they do not believe that fixed sample estimators such as the sample mean and the sample

    variance are appropriate estimates of the mean and variance of the random sample. In contrast

    to the traditional KS test introduced by Lilliefors in which the data is compared against a normal

    distribution with fixed parameters, this model tries to find a normal distribution which is more

    appropriate for the data sample than the fixed parameter distribution. The traditional KS

    statistic is obtained by using the sample mean, and the variance, 2 as the estimates of the

    mean and standard deviation in (2). However the modified test introduced in this paper

    uses an algorithm in order to obtain values and which minimize the value of the test statistic

    . The critical values needed in order to make a decision about the outcome of the experiment

    were also calculated differently to Lilliefors in order to complement the purpose of this test.

    4.2.1 Algorithm

    When using Lilliefors test, equation (2) depends on the choice of and 2, hence we denote the

    test statistic as (, 2). When using the conventional Lilliefors test this test statistic is simply

    (, 2). The test statistic for the modified test is (, ) where (, ) is the vector solution

    for the problem which minimizes the KS statistic, i.e. minimizes the following problem:

    ,2{(, 2)} (3)

    We can write equation (2) as the following inequalities:

  • Queen Mary, University of London

    Idlir Shkurti Student ID: 120192308

    (, 2)

    (

    ) (4)

    (, 2) (

    )

    1

    (5)

    The solution to problem (3) is the smallest possible value of (, 2) which satisfies (4) for

    < and (5) for < (1 ) + 1. The values of and 2 for which this is satisfied are the

    values of the vector (, ).

    The benefit of using this modified version of the Lilliefors test is the fact that it accounts for

    possible outliers in the data. When using the standard KS test, possible outliers in the data could

    significantly affect the values of the sample estimates. This will lead to errors in the test.

    However this method takes possible outliers into account. When compared to the standard KS

    test, the modified version was much more powerful for most distributions, particularly when

    data came from a uniform distribution.

    5.0 Conclusion

    The Kolmogorov-Smirnov test provides a good method of testing whether a sample comes from

    a completely knows distribution with cdf 0. However when the cdf 0 is not known we can test

    for normality using Lilleforss test, which is simply a modified version of the KS test where the

    sample mean and the sample variance are used as the mean and the variance of the unknown

    distribution. Lilliefors introduced new critical values in his paper which he obtained from a

    Monte Carlo calculation from 1000 samples for different sample sizes. Once the Monte Carlo

    critical values were obtained, it was noticeable that for each sample size, the standard critical

    values from the Kolmogorov-Smirnov tables were much higher than the ones obtained in this

    paper. Hence using the standard critical values could result in a very conservative test and

    hence a loss of power.

    This version of the KS test was still more powerful than the Chi-Square test as shown in table 2

    of reference 1, where the probabilities of correctly rejecting the null hypothesis were

    significantly higher for the KS test than the Chi-Square test. This test can also be used for small

    sample sizes just like the standard KS test, which is another advantage of using this test over the

    Chi-Square test.

    Outliers will strongly affect the sample estimates, particularly for small sample sizes which

    could lead to possible errors. This was shown in graph 2 where the average values of the KS

    statistic from 500 samples of different sample sizes were calculated and plotted. The values of

    the KS statistic clearly decrease as the sample size increases, which means that the outlier is

    much more likely to lead to a rejection of the null hypothesis when the sample size is small. This

    is also shown in appendix 3, where the inclusion of an outlier in the data sample increases the

    proportion of the type I errors.

    Drezner, Turel and Zerom, in their 2008 paper introduced a modified version of the Lilliefors

    test which aims to choose more appropriate estimates for the mean and variance such that the

    test statistic is minimized. This method was proved to be more powerful than the Lilliefors test

    for many underlying distributions, however not for the t distribution. When the data came from

    a t-distribution, this method was less powerful than Lilliefors test.

  • Queen Mary, University of London

    Idlir Shkurti Student ID: 120192308

    6.0 Appendices

    6.1 Appendix 1

    > N N1 Dn=c()

    > CT.2=c()

    > CT.15=c()

    > CT.1=c()

    > CT.05=c()

    > CT.01=c()

    # Empty vectors defined.

    > for(i in 4:N){for(j in 1:N1){S

  • Queen Mary, University of London

    Idlir Shkurti Student ID: 120192308

    + cc

  • Queen Mary, University of London

    Idlir Shkurti Student ID: 120192308

    + Ecdf

  • Queen Mary, University of London

    Idlir Shkurti Student ID: 120192308

    + m

  • Queen Mary, University of London

    Idlir Shkurti Student ID: 120192308

    + DnStar