lssg black belt training hypothesis testing. 2 introduction always about a population parameter...
Post on 26-Dec-2015
219 Views
Preview:
TRANSCRIPT
2
Introduction
Always about a population parameter
Attempt to prove (or disprove) some assumption
Setup:
alternate hypothesis: What you wish to proveExample: Change in Y after LSS Project
null hypothesis: Assume the opposite of what is to be proven. The null is always stated as an equality.Example: Y after project is same as before
3
The test
1. Take a sample, compute statistic of interest.Standardized mean customer satisfaction score
2. How likely is it that if the null were true, you would get such a statistic? (the p-value)
How likely is it that the sample would show (by random chance) the difference that we see after the LSS project, if in fact there was no improvement?
3. If very unlikely, then null must be false, hence alternate is proven beyond reasonable doubt.
4. If quite likely, then null may be true, so not enough evidence to discard it in favor of the alternate.
4
Types of Errors
Null is really
True
Null is really
False
reject null, assume alternate is proven
Type I ErrorBelieve in improvement when none occured
Good Decision
do not reject null,evidence for alternate not strong enough
Good Decision Type II Error Cannot show improvement when in fact it occured
5
The Testing Process
1. Set up Hypotheses (Null and Alternate )
2. Pick a significance level (alpha)
3. Compute critical value (based on alpha)
4. Compute test statistic (from sample)
5. Conclude: If |test statistic| > |critical value|, then Alternate Hypothesis proven at alpha
level of significance.
6
Hypothesis Testing Roadmap
Hypothesis Testing
AttributeContinuous
Normal Non-Normal 2 Contingency Tables
Correlation
Same tests as Non-Normal
Medians
VarianceMediansVarianceMeans
Levene’sCorrelation
Sign Test
Wilcoxon
Kruskal-Wallis
Mood’s
Friedman’s
2
F-test
Bartlett’s
Z-tests
t-tests
ANOVA
Correlation
Regression
7
Parametric Tests
Use parametric tests when:
1. The data are normally distributed
2. The variances of populations (if more than one is sampled from) are equal
3. The data are at least interval scaled
8
One sample z - test
A gap between two parts should be 15 microns. A sample of 25 measurements shows a mean of 17 microns. Test whether this is significantly different from 15, assuming the population standard deviation is known to be 4.
One-Sample Z
Test of mu = 15 vs not = 15The assumed standard deviation = 4
N Mean SE Mean 95% CI Z P25 17.0000 0.8000 (15.4320, 18.5680) 2.50 0.012
9
Z-test for proportions
You wish to test the hypothesis that at least two-thirds (66%) of people in the population prefer your brand over Brand X. Of the 200 customers surveyed, 140 say they prefer your brand. Is this statistically significant?
Test and CI for One Proportion
Test of p = 0.66 vs p > 0.66
95% LowerSample X N Sample p Bound Z-Value P-Value1 140 200 0.700000 0.646701 1.19 0.116
10
One sample t-test
Error Reduction
%1012987
121413151618121819201715
The data show reductions in percentage of errors made by claims processors after undergoing a training course. The target was a reduction of at least 13%. Was it achieved?
11
One Sample t-test – Minitab results
One-Sample T: Error Reduction
Test of mu = 13 vs > 13
95% LowerVariable N Mean StDev SE Mean Bound T PError Reduction 17 13.8235 3.9248 0.9519 12.1616 0.87 0.200
The p-value of 0.20 indicates that the reduction in Errors could not be proven to be greater than 13%. P-value of 0.20 shows that the probability is greater than alpha (0.05) that the difference may be 13% or less.
12
Two Sample t-test
M F10 1512 16 9 18 8 12 7 1812 1914 2013 17
15
You realize that though the overall reduction is not proven to be more than 13%, there seems to be a difference between how men and women react to the training. You separate the 17 observations by gender, and wish to test whether there is in fact a significant difference between genders in error reduction.
13
Two Sample t-test
Two-sample T for Error Reduction M vs Error Reduction F
N Mean StDev SE MeanError Red M 8 10.63 2.50 0.89Error Red F 9 16.67 2.45 0.82
Difference = mu (Error Red M) - mu (Error Red F)Estimate for difference: -6.0416795% CI for difference: (-8.60489, -3.47844)T-Test of difference = 0 (vs not =): T-Value = -5.02 P-Value = 0.000
DF = 15Both use Pooled StDev = 2.4749
The test for equal variances shows that they are not different for the 2 samples. Thus a 2-sample t test may be conducted. The results are shown below. The p-value indicates there is a significant difference between the genders in their error reduction due to the training.
Intro to LSS 14
Chi-squared test of independence
For tabulated count data.
Two types of glass sheets are manufactured, and the defects found on 111 sheets are tabulated based on the type of glass (Type A and Type B) and the location of the defect on each sheet (Zone 1 and Zone 2).
You wish to test whether the two variables (Type of glass and Location of error on the glass) are statistically independent of each other.
15
Chi Square Test Results
Tabulated statistics: Glass Type, Location
Rows: Glass Type Columns: Location
Zone 1 Zone 2 All
Type A 29 23 52 29.98 22.02 52.00
Type B 35 24 59 34.02 24.98 59.00
All 64 47 111 64.00 47.00 111.00
Cell Contents: Count Expected count
Pearson Chi-Square = 0.143, DF = 1, P-Value = 0.705Likelihood Ratio Chi-Square = 0.143, DF = 1, P-Value = 0.705
16
Assignment
From the book “Doing Data Analysis with Minitab 14” by Robert Carver:
1. Pages 138 – 142: Choose any 3 of the datasets mentioned on those pages and answer the related questions. [1-sample means]
2. Pages 148 -151: Choose any 3 of the datasets mentioned on those pages and answer the related questions. [1-sample proportions]
3. Pages 165 -168: Choose any 3 of the datasets mentioned on those pages and answer the related questions. [2-sample tests]
17
Basics of ANOVA
Analysis of Variance, or ANOVA is a technique used to test the hypothesis that there is a difference between the means of two or more populations. It is used in Regression, as well as to analyze a factorial experiment design, and in Gauge R&R studies.
The basic premise of ANOVA is that differences in the means of 2 or more groups can be seen by partitioning the Sum of Squares. Sum of Squares (SS) is simply the sum of the squared deviations of the observations from their means. Consider the following example with two groups. The measurements show the thumb lengths in centimeters of two types of primates.
Total variation (SS) is 28, of which only 4 (2+2) is within the two groups. Thus 24 of the 28 is due to the differences between the groups. This partitioning of SS into ‘between’ and ‘within’ is used to test the hypothesis that the groups are in fact different from each other.
See www.statsoft.com for more details.
Obs. Type A Type B
1
2
3
2
3
4
6
7
8
Mean
SS
3
2
7
2
Overall
Mean = 5
SS = 28
18
Results of ANOVA
One-way ANOVA: Type A, Type B
Source DF SS MS F P
Factor 1 24.00 24.00 24.00 0.008
Error 4 4.00 1.00
Total 5 28.00
___________________________________
S = 1 R-Sq = 85.71% R-Sq(adj) = 82.14%
The results of running an ANOVA on the sample data from the previous slide are shownhere. The hypothesis test computes the F-value as theratio of MS ‘Between’ toMS ‘Within’. The greater the value of F, the greater thelikelihood that there is in facta difference between the groups.looking it up in an F-distribution table shows a p-value of 0.008,indicating a 99.2% confidence thatthe difference is real (exists in the Population, not just in the sample).
Minitab: Stat/ANOVA/One-Way (unstacked)
19
Two-Way ANOVA
Two-way ANOVA: Strength versus Temp, Speed
Source DF SS MS F PTemp 1 3.5208 3.5208 5.45 0.048Speed 1 20.0208 20.0208 31.00 0.001Interaction 1 58.5208 58.5208 90.61 0.000Error 8 5.1667 0.6458Total 11 87.2292
S = 0.8036 R-Sq = 94.08% R-Sq(adj) = 91.86%
Strength Temp Speed20.0 Low Slow22.0 Low Slow21.5 Low Slow23.0 Low Fast24.0 Low Fast22.0 Low Fast25.0 High Slow24.0 High Slow24.5 High Slow17.0 High Fast18.0 High Fast17.5 High Fast
Is the strength of steel produced different for different temperatures to which it is heated and the speed with which it is cooled? Here 2 factors (speed and temp) are varied at 2 levels each, and strengths of 3 parts produced at each combination are measured as the response variable.
The results show significant main effects as well as an interaction effect.
20
Two-Way ANOVA
Str
ength
TempSpeed
LowHighSlowFastSlowFast
25
24
23
22
21
20
19
18
17
16
Boxplot of Strength by Temp, Speed
The box plots give an indication of the interaction effect. The effect of speed on the response is different for different levels of temperature. Thus, there is an interaction effect between temperature and speed.
21
Assignment
From the book “Doing Data Analysis with Minitab 14” by Robert Carver:
1. Pages 192 – 194: Choose any 3 of the datasets mentioned on those pages and answer the related questions. [1-way ANOVA]
2. Pages 204 – 206: Choose any 3 of the datasets mentioned on those pages and answer the related questions. [2-way ANOVA]
22
DOE Overview
A design of experiment involves controlling specific inputs (factors) at various levels (typically 2 levels, like “High” and “Low” settings) to observe the change in output as a result, and analyzing the data to determine the significance and relative importance of factors.
The simplest case would be to vary a single factor, say temperature, while baking cookies. Keeping all else constant, we can set temperature at 350 degrees and 400 degrees, and make several batches at those two temperatures, and measure the output desired – in this case it could be a rating by experts of crispiness of the cookies on a scale of 0-10.
23
Full Factorial Designs
A 2F Factorial design implies that there are factors at 2 levels each. The case described on the previous slide with only one factor is the simplest. Having two factors at 2 levels would give us four combinations. Three factors yield 8 combinations, 4 would yield 16, and so forth.
The following table shows the full factorial (all 8 combinations) design for 3 factors – temperature, baking time, and amount of butter,
each at two levels – HIGH and LOW.
Temp Time Butter
Low Low Low
High Low Low
Low High Low
High High Low
Low Low High
High Low High
Low High High
High High High
24
Fractional Factorials
The previous example would require 8 different setups to bake the cookies. For each setup, one could bake several batches, say 4 batches, to get a measure of the internal variation. In practice, as the number of factors tested grows, it is difficult to even create all the setups needed, much less have replications within a setup.
An alternative is to use fractional factorial designs, typically a ½ or ¼. As the name suggests, a ½ factorial design with 3 factors would only require 4 of the 8 combinations to be tested. This entails some loss of resolution, usually a confounding of interaction terms, which may be of no interest to the experimenter, and can be sacrificed.
Temp Time Butter
High High High
Low Low High
Low High Low
High Low Low
Minitab: Stat/DOE/Create Factorial Design/Display Factorial Designs
25
Running the Experiment – Outcome Values
Once the settings to be used are determined, we can run the experiment and measure the values of the outcome variable. This table shows the values of the outcome variable “Crisp”, showing the crispiness index for the cookies, for each of the 8 settings of the full factorial experiment.
Temp Time Butter Crisp
Low Low Low 7
High Low Low 10
Low High Low 7
High High Low 5
Low Low High 4
High Low High 9
Low High High 8
High High High 8
26
Analysis of Data
Analyzing the data in Minitab for the main effects and ignoring interaction
terms, we get the following output:
Factorial Fit: Crispiness versus Temp, Time, Butter
Estimated Effects and Coefficients for Crispiness (coded units)
Term Effect Coef SE Coef T PConstant 7.2500 0.3750 19.33 0.000Temp 3.0000 1.5000 0.3750 4.00 0.016Time 0.5000 0.2500 0.3750 0.67 0.541Butter 1.5000 0.7500 0.3750 2.00 0.116
S = 1.06066 R-Sq = 83.64% R-Sq(adj) = 71.36%
Analysis of Variance for Crispiness (coded units)
Source DF Seq SS Adj SS Adj MS F PMain Effects 3 23.000 23.000 7.667 6.81 0.047Residual Error 4 4.500 4.500 1.125Total 7 27.500
Estimated Coefficients for Crispiness using data in uncoded units
Term CoefConstant 7.25000Temp 1.50000Time 0.250000Butter 0.750000
Note that only the temperature is significant (p-value lower than 0.05). The effect of temperature is 3.00, which means that if temp. is set at HIGH, crispiness will increase by 3.00 units on average, compared to the LOW setting.
Minitab: Stat/DOE/Create Factorial Design/Analyze Factorial Design
27
Assignment
From the book “Doing Data Analysis with Minitab 14” by Robert Carver:
1. Pages 309 – 310: Answer any 4 of the 7 questions on those pages. [DOE]
28
Hypothesis Testing Roadmap
Hypothesis Testing
AttributeContinuous
Normal Non-Normal 2 Contingency Tables
Correlation
Same tests as Non-Normal
Medians
VarianceMediansVarianceMeans
Levene’sCorrelation
Sign Test
Wilcoxon
Kruskal-Wallis
Mood’s
Friedman’s
2
F-test
Bartlett’s
Z-tests
t-tests
ANOVA
Correlation
Regression
29
Non-Parametric Tests
Use non-parametric tests:
1. When data are obviously non-normal
2. When the sample is too small for the central limit theorem to lead to normality of averages
3. When the distribution is not known
4. When the data are nominal or ordinal scaled
Remember that even non-parametric tests have some assumptions about the data.
30
The sign test
The story:
A patient sign-in process at a hospital is being evaluated, and the time lapse between arrival and seeing a physician is recorded for a random sample of patients. You believe that currently the median time is over 20 minutes, and wish to test the hypothesis.
31
The sign test – data
ProcessTime
57
153032356275808595
100
Data for the test are as follows:
Process Time
Frequency
100806040200
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Histogram of Process Time
The histogram of the data shows that it is non-normal, andthe sample size is too small for the central limit theoremto apply.The data are at least ordinal in nature (here they are ratio scaled), satisfying the assumption of the sign test.
32
Sign test - analysis
Since the hypothesis is that the median is greater than 20, the test compares each value to 20. Those that are smaller get a negative sign, those that are larger than 20 get a positive one. The sign test then computes the probability that the number of negatives and positives observed would come about through random chance if the null were true (that the median time is 20 minutes).
33
Sign Test in Minitab - Results
Sign Test for Median: Process Time
Sign test of median = 20.00 versus > 20.00
N Below Equal Above P Median
Process Time 12 3 0 9 0.0730 48.50
In this data, there are 9 observations above 20, and 3 of them below. This can be shown to have a .0730 probability of occurring, even if the median time for the population is in fact not greater than 20. Thus, there is insufficient evidence to prove the hypothesis (to reject the null) at the 5% level, but enough if you are willing to take a 7.3% risk.
34
The sign test – other applications
The sign test can also be used for testing the value of the median difference between paired samples, as illustrated in the following link. The difference between values in a paired sample can be treated as a single sample, so any 1-sample hypothesis test can be applied. In such a case, the assumption is that the pairs are independent of each other.
The equivalent parametric tests for the sign test are the 1-sample z test and the 1-sample t-test.
http://davidmlane.com/hyperstat/B135165.html
35
Wilcoxon Signed-Rank Test
A test is conducted where customers are asked to rate two products based on various criteria, and come up with a score between 0 and 100 for each. The tester’s goal is to check whether product A, the new version, is perceived to be superior to product B. The null hypothesis would be that they are equal to each other.
36
Wilcoxon Signed-Rank Test
The data
Prod A Prod B Diff55 50 560 62 -277 70 782 78 499 90 992 95 -386 90 -484 80 490 86 472 71 1
The measures are rankings by people, so the data are not necessarily interval scaled, and certainly not ratio scaled. Thus a paired sample t-test is not appropriate. A non-parametric equivalent of that is the Wicoxon Signed-Rank Test.
This is similar to the sign test, but more sensitive.
37
Wilcoxon Signed-Rank Test
Unlike the sign test, which only looks at whether something is larger or smaller, this tests uses the magnitudes of the differences, rank orders them, and then applies the sign of the difference and computes the sum of those ranks. This statistic (called W) has a sampling distribution that is approximately normal.
For details on the technique, see the link below.Assumptions are:1. Data are at least ordinal in nature2. The pairs are independent of each other3. Dependent variable is continuous in nature
http://faculty.vassar.edu/lowry/ch12a.html
38
Wilcoxon test in Minitab - Results
Wilcoxon Signed Rank Test: Diff
Test of median = 0.000000 versus median > 0.000000
N
for Wilcoxon Estimated
N Test Statistic P Median
Diff 10 10 44.5 0.046 2.500
39
Mann-Whitney Test – Two Samples
The Story:
Customers were asked to rate a service in the past, and 10 people did so. After some improvements were made, data were collected again, with a new random set of customers. Twelve people responded this time.
There is no pairing or matching of data, since the samples of customers for the old and the new processes are different.
http://faculty.vassar.edu/lowry/ch11a.html
40
Mann-Whitney Test – Two Samples
Old New60 8570 8585 9078 9490 9068 7035 7580 9080 9075 100
9590
Note that the assumptions of a 2-sample t-test are violated because the data are not interval scaled, and may not be normally distributed.
The Mann-Whitney Test is the non-parametric alternative to the 2-sample t-test.
41
Mann-Whitney Test – Two Samples
The Mann-Whitney test rank orders all the data, with both columns combined into one. The ranks are then separated by group so the raw data is now converted to ranks. The sum of the ranks for each column is computed.
The sums of ranks are expected to be in proportion to the sample sizes, if there is no difference between the groups. Based on this premise, the actual sum is compared to the expected sum, and the statistic is tested for significance.
See details with another example on this link from Vassar Univ. :
http://faculty.vassar.edu/lowry/ch11a.html
42
Mann-Whitney Test in Minitab - Results
Mann-Whitney Test and CI: Old, New
N MedianOld 10 76.50New 12 90.00
Point estimate for ETA1-ETA2 is -14.0095.6 Percent CI for ETA1-ETA2 is (-22.00,-5.00)W = 72.5
Test of ETA1 = ETA2 vs ETA1 < ETA2 is significant at 0.0028The test is significant at 0.0025 (adjusted for ties)
43
Kruskal-Wallis Test – 3 or more samples
Here the data would be similar to the Mann-Whitney test, except for having more than 2 samples. For parametric data, one would conduct an ANOVA to test for differences between 3 or more populations. The Kruskal-Wallis test is thus a non-parametric equivalent of ANOVA.
44
Kruskal-Wallis Test – Data
Adults Teens Children
7 9 3
5 9 4
6 8 3
4 5 5
2 910
6 10 2
5 7
8
Rating Factor7 Adults5 Adults6 Adults4 Adults2 Adults6 Adults5 Adults9 Teens9 Teens8 Teens5 Teens9 Teens10 Teens7 Teens8 Teens3 Children4 Children3 Children5 Children10 Children2 Children
The data show ratings of some product by three different groups. The same data are shown stacked on the right to perform the test in Minitab.
45
Kruskal-Wallis Test
The Kruskal-Wallis test proceeds very similarly to the Mann-Whitney test. The data are all ranked from low to high values, and the ranks then separated by group. For each group, the ranks are summed and averaged.
Each group average is compared to the overall average, and the deviation measured, weighted by the number of observations in each group. If the groups were identical, the deviations from the grand mean would be a small number (not 0, as one might intuitively think) that can be computed.
The actual difference is compared to the expected one (H statistic computed) to complete the test. See the link below for details of the computation, if interested.
http://faculty.vassar.edu/lowry/ch14a.html
46
Kruskal-Wallis Test Minitab Results
Kruskal-Wallis Test: Rating versus Factor
Kruskal-Wallis Test on Rating
Factor N Median Ave Rank ZAdults 7 5.000 8.6 -1.23Children 6 3.500 7.2 -1.79Teens 8 8.500 15.9 2.86Overall 21 11.0
H = 8.37 DF = 2 P = 0.015H = 8.48 DF = 2 P = 0.014 (adjusted for ties)
47
Mood’s Median Test
Mood median test for RatingChi-Square = 10.52 DF = 2 P = 0.005
Individual 95.0% CIsFactor N<= N> Median Q3-Q1 --------+---------+---------+--------Adults 6 1 5.00 2.00 (-------*-----)Children 5 1 3.50 3.50 (-----*----------------------)Teens 1 7 8.50 1.75 (--------*-) --------+---------+---------+-------- 4.0 6.0 8.0
Overall median = 6.00
The Mood’s median test is an alternative to Kruskal-Wallis. It is generally more robust against violations of assumptions, but less powerful.
48
Friedman’s Test
Friedman’s Test is the non-parametric equivalent to a randomized block design in an ANOVA. In other words, there are 3 or more groups, but each row of values across the groups are matched.
The story
A person’s performance is rated in a normal state, rated again after introducing noise in the environment, and finally with the introduction of classical music in the background. This is done for a sample of 7 employees.
49
Friedman’s Test – Data
Normal Noise Music
7 5 8
8 4 8
6 6 8
9 5 8
5 5 7
7 4 9
8 4 9
Perform Group Block7 Normal 18 Normal 26 Normal 39 Normal 45 Normal 57 Normal 68 Normal 75 Noise 14 Noise 26 Noise 35 Noise 45 Noise 54 Noise 64 Noise 78 Music 18 Music 28 Music 38 Music 47 Music 59 Music 69 Music 7
The data show the ratings of performance by person in each of 3 conditions. The same data are stacked in the table to the right, for doing the test in Minitab. Each person represents a block of data, since the 3 numbers for that person are related.
50
Friedman’s Test - Analysis
Friedman’s test also ranks the ratings, but this time the ranking is done internally within each row – the three scores for each person are ranked 1, 2, and 3. These ranks are then summed and averaged.
If the groups are identical, then one would expect no difference in the sum or mean of rankings for each group. In other words, if the conditions did not affect the performance rating, the rankings would either be the same, or vary randomly across people to yield equal sums.
The sums are compared to this expectation to test the hypothesis. See the following link for more details.
http://faculty.vassar.edu/lowry/ch15a.html
51
Friedman’s Test in Minitab – Results.
Friedman Test: Perform versus Group blocked by Block
S = 9.50 DF = 2 P = 0.009S = 10.64 DF = 2 P = 0.005 (adjusted for ties)
Sum Est ofGroup N Median RanksMusic 7 8.000 19.5Noise 7 4.667 8.0Normal 7 7.333 14.5
Grand median = 6.667
top related