statisticaltechniques.pdf

83
-Basics of Statistics. -Applications of Statistics in solving industrial dilema. -Basics of Operations Research. -Applications of OR in solving industrial dilema. - Some more industry examples…

Upload: denzelfawn

Post on 22-Dec-2015

2 views

Category:

Documents


0 download

TRANSCRIPT

-Basics of Statistics. -Applications of Statistics in solving industrial dilema. -Basics of Operations Research. -Applications of OR in solving industrial dilema. - Some more industry examples…

Basics of Statistics

• Types of data

• Normal distribution

• Sample size selection

• Hypothesis testing

• Introduction to statistical tests : z-test and t-test.

Types of data • Nominal data/Categorical data

In these kinds of data, observations are given a particular name. Like a person is observed to be 'male' or 'female' or Name of drug as generic or brand, etc. Nominal data cannot be measured or ordered but can be counted. These types of data are considered as categorical data but the order of the categories is meaningless. Data that consist of two classes like male/female or dead/alive are called binomial data, and those that consist of more than two classes like tablet/capsule/syrup are known as multinomial data. Data of these types are usually presented in t

• Ordinal data Ordinal data is also a type of categorical data but in this, categories are ordered logically. These data can be ranked in order of magnitude. One can say definitely that one measurement is equal to, less than, or greater than another. Most of the scores and scales used in research fall under the ordinal data. For example, rating score/scale for the color, taste, smell, ease of application of products, etc. form of contingency tables like 2 × 2 tables.

• Interval data Interval data has a meaningful order and also has the quality that equal intervals between measurements represent equal changes in the quantity of whatever is being measured. But these types of data have no natural zero. Example is Celsius scale of temperature. In the Celsius scale, there is no natural zero, so we cannot say that 70°C is double than 35°C. In interval scale, zero point can be choosen arbitraly. IQ Test is also interval data as it has no natural zero. Ratio data Ratio data has all the qualities of interval data (natural order, equal intervals) plus a natural zero point. This type of data is observed to be used most frequently. Example of ratio data is height, weight, length, etc. In this type of data, it can be said meaningfully that 10 m of length is double than 5 m. This ratio hold true regardless of which scale the object is being measured in (e.g., meters or yards). Reason for this is the presence of natural zero.

Whether our Data Follow the Normal Distribution or Not?

This is the second prerequisite for selection of appropriate statistical test. If you know the type of data (nominal, ordinal, interval, and ratio) and distribution of data (normal distribution or not normal distribution), selection of statistical test will be very easy. There is no need to check distribution in the case of ordinal and nominal data. Distribution should only be checked in the case of ratio and interval data. If your data are following the normal distribution, parametric statistical test should be used and nonparametric tests should only be used when normal distribution is not followed. There are various methods for checking the normal distribution, some of them are plotting histogram, plotting box and whisker plot, plotting Q-Q plot, measuring skewness and kurtosis, using formal statistical test for normality (Kolmogorov-Smirnov test, Shapiro-Wilk test, etc). Formal statistical tests like Kolmogorov-Smirnov and Shapiro-Wilk are used frequently to check the distribution of data. All these tests are based on null hypothesis that data are taken from the population which follows the normal distribution. P value is determined to see the alpha error. If P value is less than 0.05, data is not following the normal distribution and nonparametric test should be used in that kind of data. If the sample size is less, chances of non-normal distribution are increased.

Which statistical test to use when…

• Comparison of 1 group to a hypothetical value. N- Chi square or binomial distibution for ND.

• R,I, N - 1 sample t-test OR 1 sample Z-test. • Comparison of 2 groups. • -paired groups/Unpaired groups • R,I, N -paired t-test/paired z-test/One-way ANOVA • R,I, N -Unpaired t-test/unpaired z-test and Fischer's

test i.e.N- Chi square test for large samples • Comparison of more than 2 groups. • -R,I, N -Repeated measure ANOVA and Chi-square test

for ND

There are two types of chi square tests: the goodness-of-fit test (does a coin tossed 100 times turn up heads 50 times and tails 50 times?) and the test of independence (is there a relationship between gender and a perfect SAT score?)

Statistical Hypothesis Testing

Statistical hypothesis testing is used to determine whether an experiment conducted provides enough evidence to reject a proposition. Suppose you want to study the effect of smoking on the occurrence of lung cancer cases. If you take a small group, it may happen that there appears no correlation at all, and you find that there are many smokers with healthy lungs and many non-smokers with lung cancer. However, it can just happen that this is by chance, and in the overall population this isn't true. In order to remove this element of chance and increase the reliability of our hypothesis, we use statistical hypothesis testing. In this, you will first assume a hypothesis that smoking and lung cancer are unrelated. This is called the 'null hypothesis', which is central to any statistical hypothesis testing. You should therefore first choose a distribution for the experimental group. Normal distribution is one of the most common distributions encountered in nature, but it can be different in different special cases.

Confidence Interval in Statistics In statistics, confidence interval refers to the amount of error that is allowed in the statistical data and analysis. Typical Confidence Levels: One Tailed >.95 (>95%) >.99 >.999 Two Tailed >.975 Suppose the survey shows that 34% of the people vote for Candidate A. The confidence that these results are accurate for the whole group can never be 100%; For this the survey would need to be taken for the entire group. Therefore if you are looking at say a 95% confidence interval in the results, it could mean that the final result would be 30-38%. If you want a higher confidence interval, say 99%, then the uncertainty in the result would increase; say to 28-40%. In normal statistical analysis, the confidence interval tells us the reliability of the sample mean as compared to the whole mean.

Frequency Distribution Frequency distribution is a curve that gives us the frequency of the occurrence of a particular data point in an experiment. This is usually the limit of a histogram of frequencies when the data points are very large and the results can be treated to be varying continuously instead of taking on discrete values. For example, consider a fair coin that is tossed four times. We want to derive the frequency distribution for the number of heads that can occur. There are different possibilities, through which these heads might occur, which are summarized in the table below: The frequency distribution is easy to see. On an average, if the number of flips are very high, then out of every 16 coin flips, 1 will end up with 0 heads, 4 will end up with 4 heads, 6 will end up with 2 heads, 4 will end up with 3 heads and 1 will end up as all 4 heads. This of course is assuming that the coin used for the experiment is a fair coin, with an equal probability of a head and tail on any given flip. In the above case, the coin is flipped only 4 times. If the coin is tossed many more times, like say 100 times, and the frequency distribution drawn, it will be exactly like a normal probability distribution in shape.

No. of Heads

Value of the four coin flips

Total number of ways of getting a head

0 T - T - T - T 1

1

H - T - T - T T - H - T - T T - T - H - T T - T - T - H

4

2

H - H - T - T H - T - H - T H - T - T - H T - H - H - T T - H - T - H T - T - H - H

6

3

T - H - H - H H - T - H - H H - H - T - H H - H - H - T

4

4 H - H - H - H 1

For sufficiently large values of sample

size, it can be mathematically shown

through the central limit theorem that

the distribution is approximately normal

distribution. In such a case, the 95%

confidence level occurs at an interval of

1.96 times the standard deviation.

Normal probability distribution, also

called Gaussian distribution refers to a

family of distributions that are bell

shaped.

These are symmetric in nature and peak

at the mean, with the probability

distribution decreasing away before and

after this mean smoothly, as shown in

the figure below.

μ = Mean of the Population, σ = Standard Deviation

• Normal distribution occurs very frequently in statistics, economics, natural and social sciences and can be used to approximate many distributions occurring in nature and in the manmade world.

• For example, the height of all people of a particular race, the length of all dogs of a particular breed, IQ, memory and reading skills of people in a general population and income distribution in an economy all approximately follow the normal probability distribution shaped like a bell curve.

• The theory of normal distribution also finds use in advanced sciences like astronomy, photonics and quantum mechanics.

• The normal distribution can be characterized by the mean and standard deviation. The mean determines where the peak occurs, which is at 0 in our figure for all the curves. The standard deviation is a measure of the spread of the normal probability distribution, which can be seen as differing widths of the bell curves in our figure.

• The Formula

• The mean is generally represented by μ and the standard deviation by σ. For a perfect normal distribution, the mean, median and mode are all equal. The normal distribution function can be written in terms of the mean and standard deviation as follows:

Hypothesis Testing The null and alternative hypothesis

The null hypothesis is essentially the "devil's advocate" position. That is, it assumes that whatever you are trying to prove did not happen (hint:it usually states that something equals zero). For example, the two different teaching methods did not result in different exam performances (i.e., zero difference). Another example might be that there is no relationship between anxiety and athletic performance (i.e., the slope is zero). The alternative hypothesis states the opposite and is usually the hypothesis you are trying to prove (e.g., the two different teaching methods did result in different exam performances). Initially, you can state these hypotheses in more general terms (e.g., using terms like "effect", "relationship", etc.), as shown below for the teaching methods example:

Some more concepts…

• Standard deviation. • Type I error and Type II error. • Concept of degrees of freedom. • Statistically significant results. • One tailed and two-tailed tests.

Variance is a measure of how spread out a data set is.

The standard deviation is the square root of the variance.

…Concept of degrees of freedom

Degrees of freedom are Sample size- 1

t-score 300 n n-1 300/n 300/(n-1) diff 4 3 75.0 100.0 25.0 5 4 60.0 75.0 15.0 6 5 50.0 60.0 10.0 7 6 42.9 50.0 7.1 8 7 37.5 42.9 5.4 9 8 33.3 37.5 4.2

10 9 30.0 33.3 3.3 11 10 27.3 30.0 2.7 12 11 25.0 27.3 2.3 13 12 23.1 25.0 1.9 14 13 21.4 23.1 1.6 15 14 20.0 21.4 1.4 16 15 18.8 20.0 1.3 17 16 17.6 18.8 1.1 18 17 16.7 17.6 1.0 19 18 15.8 16.7 0.9 20 19 15.0 15.8 0.8 21 20 14.3 15.0 0.7 22 21 13.6 14.3 0.6 23 22 13.0 13.6 0.6 24 23 12.5 13.0 0.5 25 24 12.0 12.5 0.5 26 25 11.5 12.0 0.5 27 26 11.1 11.5 0.4 28 27 10.7 11.1 0.4 29 28 10.3 10.7 0.4 30 29 10.0 10.3 0.3

z-score 300 n n-1 300/n 300/(n-1) diff

31 30 9.7 10.0 0.3 32 31 9.4 9.7 0.3 33 32 9.1 9.4 0.3 34 33 8.8 9.1 0.3 35 34 8.6 8.8 0.3 36 35 8.3 8.6 0.2 37 36 8.1 8.3 0.2 38 37 7.9 8.1 0.2 39 38 7.7 7.9 0.2 40 39 7.5 7.7 0.2 41 40 7.3 7.5 0.2 42 41 7.1 7.3 0.2 43 42 7.0 7.1 0.2 44 43 6.8 7.0 0.2 45 44 6.7 6.8 0.2 46 45 6.5 6.7 0.1 47 46 6.4 6.5 0.1 48 47 6.3 6.4 0.1 49 48 6.1 6.3 0.1 50 49 6.0 6.1 0.1 51 50 5.9 6.0 0.1 52 51 5.8 5.9 0.1 53 52 5.7 5.8 0.1 54 53 5.6 5.7 0.1 55 54 5.5 5.6 0.1 56 55 5.4 5.5 0.1 57 56 5.3 5.4 0.1

Concept of degrees of freedom

Degrees of freedom are used to determine whether a particular null hypothesis can be rejected based on the number of variables and samples of in the experiment. For example, while a sample size of 50 students might not be large enough to obtain significant information, obtaining the same results from a study of 500 samples can be judged as being valid.

Decision

Reject H0 Don't reject

H0

Truth H0 Type I Error Right decision

H1 Right decision Type II Error

In a hypothesis test, a type I error occurs when the null hypothesis is rejected when it is in fact true; that is, H0 is

wrongly rejected.

For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on

average, than the current drug; i.e.

H0: there is no difference between the two drugs on average.

A type I error would occur if we concluded that the two drugs produced different effects when in fact there was no

difference between them.

The following table gives a summary of possible results of any hypothesis test:

A type I error is often considered to be more serious, and therefore more important to avoid, than a type II error.

The hypothesis test procedure is therefore adjusted so that there is a guaranteed 'low' probability of

rejecting the null hypothesis wrongly; this probability is never 0. This probability of a type I error can be

precisely computed as

P(type I error) = significance level = Alpha.

The exact probability of a type II error is generally unknown.

If we do not reject the null hypothesis, it may still be false (a type II error) as the sample may not be big enough to

identify the falseness of the null hypothesis (especially if the truth is very close to hypothesis).

For any given set of data, type I and type II errors are inversely related; the smaller the risk of one, the higher the

risk of the other.

A type I error can also be referred to as an error of the first kind.

Outcome 1 : We reject the Null Hypothesis, when it is False : Good.

Outcome 2 : We reject the Null Hypothesis, when in reality, it is True : Type I error.

Outcome 3 : We retain the Null Hypothesis, when in reality, it is False : Type II error.

Outcome 4 : We retain the Null Hypothesis, when it is True : Good.

Statistically Significant Results

For most disciplines, the researcher looks for a significance level of 0.05, signifying that there is only a 5% probability that the observed results and trends occurred by chance. For some scientific disciplines, the required level is 0.01, only a 1% probability that the observed patterns occurred due to chance or error. Whatever the level, the significance level determines whether the null or alternative is rejected, a crucial part of hypothesis testing.

The structure of hypothesis testing

1.Define the research hypothesis for the study. 2.Explain how you are going to operationalize (that is, measure or operationally define) what you are studying and set out the variables to be studied.

3.Set out the null and alternative hypothesis (or more than one hypothesis; in other words, a number of hypotheses).

4.Set the significance level.

5.Make a one- or two-tailed prediction.

6.Determine whether the distribution that you are studying is normal (this has implications for the types of statistical tests that you can run on your data)

7.Select an appropriate statistical test based on the variables you have defined and whether the distribution is normal or not.

8.Run the statistical tests on your data and interpret the output.

9.Reject or fail to reject the null hypothesis.

T- test and z-test

• One-sample t-tests are used to compare a sample mean with the known population mean.

• Two-sample t-tests, the other hand, are used to compare either independent samples or dependent samples.- [Limited sample size (n < 30) as long as the variables are approximately normally distributed and the variation of scores in the two groups is not reliably different.]

• It is also great if you do not know the populations’ standard deviation. If the standard deviation is known, then, it would be best to use another type of statistical test, the Z-test.

• Z-tests are often applied in large samples (n > 30)

Difference between Z-test, F-test, and T-test A z-test is used for testing the mean of a population versus a standard, or comparing the means of two populations, with large (n ≥ 30) samples whether you know the population standard deviation or not. It is also used for testing the proportion of some characteristic versus a standard proportion, or comparing the proportions of two populations. Example: Comparing the average engineering salaries of men versus women. Example: Comparing the fraction defectives from 2 production lines. A t-test is used for testing the mean of one population against a standard or comparing the means of two populations if you do not know the populations’ standard deviation and when you have a limited sample (n < 30). If you know the populations’ standard deviation, you may use a z-test. Example: Measuring the average diameter of shafts from a certain machine when you have a small sample. An F-test is used to compare 2 populations’ variances. The samples can be any size. It is the basis of ANOVA. Example: Comparing the variability of bolt diameters from two machines. Matched pair test is used to compare the means before and after something is done to the samples. A t-test is often used because the samples are often small. However, a z-test is used when the samples are large. The variable is the difference between the before and after measurements. Example: The average weight of subjects before and after following a diet for 6 weeks

x sample mean average / arithmetic mean

x = (2+5+9) / 3 =

5.333

s 2 sample variance

population samples variance estimator

s 2 = 4

s sample standard deviation

population samples standard deviation estimator

s = 2

σ2 Population variance

variance of population values

σ2 = 4

σX Population standard deviation

standard deviation value of random variable X

σX = 2

Degrees of freedom:

• For the z-test degrees of freedom are not required since z-scores of 1.96 and 2.58 are used for 5% and 1% respectively.

• For unequal and equal variance t-tests = (n1 + n2) - 2

• For paired sample t-test = number of pairs - 1

Degrees of Freedom: Definition. The number of degrees of freedom generally refers to the number of independent observations in a sample minus the number of population parameters that must be estimated from sample data

t-score 300 n n-1 300/n 300/(n-1) diff 4 3 75.0 100.0 25.0 5 4 60.0 75.0 15.0 6 5 50.0 60.0 10.0 7 6 42.9 50.0 7.1 8 7 37.5 42.9 5.4 9 8 33.3 37.5 4.2

10 9 30.0 33.3 3.3 11 10 27.3 30.0 2.7 12 11 25.0 27.3 2.3 13 12 23.1 25.0 1.9 14 13 21.4 23.1 1.6 15 14 20.0 21.4 1.4 16 15 18.8 20.0 1.3 17 16 17.6 18.8 1.1 18 17 16.7 17.6 1.0 19 18 15.8 16.7 0.9 20 19 15.0 15.8 0.8 21 20 14.3 15.0 0.7 22 21 13.6 14.3 0.6 23 22 13.0 13.6 0.6 24 23 12.5 13.0 0.5 25 24 12.0 12.5 0.5 26 25 11.5 12.0 0.5 27 26 11.1 11.5 0.4 28 27 10.7 11.1 0.4 29 28 10.3 10.7 0.4 30 29 10.0 10.3 0.3

z-score 300 n n-1 300/n 300/(n-1) diff

31 30 9.7 10.0 0.3 32 31 9.4 9.7 0.3 33 32 9.1 9.4 0.3 34 33 8.8 9.1 0.3 35 34 8.6 8.8 0.3 36 35 8.3 8.6 0.2 37 36 8.1 8.3 0.2 38 37 7.9 8.1 0.2 39 38 7.7 7.9 0.2 40 39 7.5 7.7 0.2 41 40 7.3 7.5 0.2 42 41 7.1 7.3 0.2 43 42 7.0 7.1 0.2 44 43 6.8 7.0 0.2 45 44 6.7 6.8 0.2 46 45 6.5 6.7 0.1 47 46 6.4 6.5 0.1 48 47 6.3 6.4 0.1 49 48 6.1 6.3 0.1 50 49 6.0 6.1 0.1 51 50 5.9 6.0 0.1 52 51 5.8 5.9 0.1 53 52 5.7 5.8 0.1 54 53 5.6 5.7 0.1 55 54 5.5 5.6 0.1 56 55 5.4 5.5 0.1 57 56 5.3 5.4 0.1

1.96 is the no. of std deviations away from the mean. t distribution is Normal distribution at large sample sizes.

t-distribution

Statistical tests t-test

One sample t-test : population and sample data are compared.

Two sample t-test : Matched sample t-test (paired) i.e. Before and after means.

Two sample t-test i.e. Independent sample t-test (random) : Two sample means are compared.

z-test

One sample z-test : population and sample data are compared.

One sample z-test for proportions

Two sample z-test : Matched sample z-test (paired) i.e. Before and after means.

Two sample z-test i.e Independent sample z-test (random) Two sample means are compared.

Two sample z-test i.e. z-test for proportions. i.e. Before and after proportions.

Two sample z-test i.e. Independent sample z-test (random) : Two sample proportions are compared.

1.Define the Null and Alternative Hypotheses. 2.State the Alpha. 3.Calculate the degrees of freedom. 4.State the decision rule. 5.Calculate the Test Statistic. 6.State the Results. 7.State the Conclusion.

In the population, the average IQ is 100. A team of scientists wants to test a new medication to see if it has either a positive or negative effect on intelligence, or no effect at all. A sample of 30 participants who have taken the medication has a mean of 140 with a standard deviation of 20. Did the medication affect the intelligence ? Alpha is 0.05

One sample t-test

2. State the Alpha : α = 0.05

3. Calculate the degrees of freedom : n-1 = 30-1 = 29

2. State the Alpha : 0.05

Independent sample t-test Here we compare 2 independent samples.

A retailer wants to compare the sales of his two stores- if they perform differently on the sales last quarter. Store A had 25 products with an average sale of 70K /month, standard deviation of 15K. Store B has 20 products with an average sale of 74K /month, with a standard deviation of 25K. Did these 2 stores perform differently- use 5pc. significance level.

1.Define the Null and Alternative Hypothesis. 2.State the Alpha. 3.Calculate the degrees of freedom. 4.State the decision rule. 5.Calculate the Test Statistic. 6.State the Results. 7.State the Conclusion.

H0 : μ₀ Store A = μ₀ Store B H1 : μ₀ Store A ≠ μ₀ Store B

2. State the Alpha : α = 0.05

1. Define the Null and Alternative Hypothesis.

3. Calculate the degrees of freedom

2. State the Alpha : α = 0.05

If z is less than -1.96 or greater than 1.96, reject the Null Hypothesis

σ12= 1.5, n1 = 60, x1= 18

σ22 =2, n2 = 60, x2=19,

Are the two populations different…

σ12= 1.5, n1 = 60, x1= 18

σ22 =2, n2 = 60, x2=19,

2. State the Alpha : α = 0.05

2. State the Alpha : α = 0.05

If z is less than -1.96 or greater than 1.96, reject the Null Hypothesis

There was a significant difference in the effectiveness between the medication group and the placebo group.

1. What is the probability that a tablet has paracetamol content greater than 195 2. What is the probability that a tablet has paracetamol content greater than 205 3. What paracetamol content level corresponds to the 75th percentile 4. What is the probability that a tablet has paracetamol content between 195

and 215 5. If a sample of 100 tablets is selected, what is the probabilty that the mean

paracetamol content level is greater that 200

The composition of Paracetamol, in a certain brand of pain-killer are normally distributed with a mean of 205 and a standard deviation of 40

z= Observation - mean Standard deviation

Real-life application of z-scores occurs in usability testing.

2004

2014

2. State the Alpha : α = 0.05

2004

2014