prepare for stat170 exam
Post on 21-Apr-2015
Embed Size (px)
Basic assumptions about youMany elementary concepts have been skipped. At this stage, it is assumed that you should know them well. In particular, you MUST know how to do HATPC for each of the 8 hypothesis tests. Only important things, or those that inter-connect several topics together, are elaborated here. You have ABSOLUTELY NO hope of passing STAT170 if you do not know the 8 HATPCs. This PP file will NOT push you from F to P. The contents of this file will only help the P or above students, given the presumed basic knowledge.1
Binding things together
Review of: 5 types of graphics 5 types of research questions 8 statistical tests 8 or MORE types of reports
Displaying Data: 5 types of graphicsDATAcategorical numerical
Displaying Data: 5 types of graphics(The following table conveys the same information as the previous slide.)Combination of variable(s) Graphic Bar chart pie chart Histogram stem-&-leaf Clustered bar chart Scatter plot Comparative box plots4
clustered bar chart comparative box plots bar chart or pie chart
comparative box plots
bar chart or pie chart histogram or stem-and-leaf plot
One categorical (Lecture 2, 11) One numerical (Lecture 2, 7) Two categorical (Lecture 2, 11, 12) Two numerical (Lectures 2, 9 & 10)3
scatter plot histogram or stem-and-leaf plot
One categorical and one numerical (Lecture 2, 8)
5 types of graphicsSTAT170 is restricted to only 5 types of combinations of variables, 5 different types of graphics, and 5 possible research questions. The most important step is correctly identifying the types of variables: NUMERICAL vs CATERGORICAL. Surprisingly, many students have difficulty in this very first step. The correct/wrong identification of variables would lead you to the correct/wrong: Type of graphic Research question, and Statistical test.
How to comment on graphics:1. Comments on a single bar chart(seldom asked)
Comment depends on whether variable is ordinal or nominal Ordinal: comment similar to histogram Nominal: comment on which categories have the highest count and lowest frequencies 400350 300 250 200 150 100 50 0 meat vegetarian diet vegan
Skewed to the right.5
This doesnt make any sense!
2. Comments on a single histogram (or stem-and-leaf plot)1. 2. 3. 4. 5.Freq.500 400 300 200 100 00 5 10 15 20 25 30
Comment on shape (skewed left/right, normal) Range from xxxx to xxxx Majority (high frequencies) of data about xxxx Comment outliers (if present) Comment on any unusual features (if present)Assessment
Example: U-shaped, high frequencies near both ends, lowest frequencies near the centre U-shaped, but slightly skewed left Range from 0 to 12Freq.100 80 60 40 207
00 3 6 9 12
3. Comments on comparative boxplots Compare medians Compare spread (IRQ) Compare outliers (Even when there are no outliers, say no outliers.)ClassClass
4. Comments on scatter plot Comment on linear/curved? Positive or negative slope? Comment on amount of scatter (big or small?) Comment on outliers, if any Comment on residuals Sym on both sides of the line/normal? Constant SD?UAI
Birth Rate50 45 40 35 30 25 20 15 10 5 0 10 15 20 25 30 35 40 45
age marriage 55 50 45 40 35 30 25 20 15 10 10 20 30 40 husband age 50 60 70
110 100 90 80 70 60 50 40 30 20 10 -1 0 1 2 GPA 3 4 5
5. Comments on clustered bar chartsCompare the shapes of the clusters, NOT the sizes.Shapes (not size) similar The 2 variables independent (ie have no association) (since % are the same)
Comments on clustered bar charts: explanation
Shapes (not size) not similar The 2 variables not independent (ie have association) (because % are not the same)
Never compare the actual frequencies (sizes). Only compare % (or proportions) (shapes). Since proportions are almost the same, ie about 1/3 and 2/3 for smokers and non-smokers, smoking status is independent of Activity Level (no association)
Comments on clustered bar charts: explanation
Never compare the actual frequencies (sizes). Only compare % (or proportions) (shapes). Since percentages of smokers and non-smokers are obviously different for males and females, there is an association between smoking status and gender.
similar in shape (although different sizes)
Different shape, (although same size)
The 8 hypothesis tests in STAT170DATAcategoricalClustered barchart Chi sq test of association + OR
Determining numerical vs. categoricalYou only need to be able to identify between numerical and categorical. No need to further classify into continuous or discrete(=integer), nor further classify into nominal or ordinal. If you cannot distinguish between nominal and ordinal, youll only lose a few marks in Q.1. But how about numerical vs categorical ? See next slide.15 16
numerical comparative boxplots 2-sample t test scatter plot T-test of Histogram 1-sample Z or t testbar chart z-test of proportion or chi sq test of proportions
comparative boxplots 2-sample t testbar chart z-test of proportion or chi sq test of proportions
Histogram 1-sample Z or t test
Note: 7 tests above + paired t-test +OR= 8 tests in STAT170
Example: Numerical vs CategoricalAge: age in years Numeric (continuous) Histogram / stem-leaf => z-test or t-testAge: 0-12 children (1), 13-18 teenager (2), > 18 adult (3), Categorical (ordinal) bar chart /pie chart => Chi sq test of proportions (GOF test)
No one can help you How many such mistakes can you afford to make in exam? 3 such mistakes => youll fail in STAT170 You have absolutely no hope of passing STAT170 if you cannot distinguish between numerical and categorical variables since the whole philosophy of STAT170 is based on classifying categorical and numerical variables. (This is unlike other 1st-year stat courses in other universities.)18
A mistake will cost you at least 6 marks in HATPC, plus other marks in subsequent parts of the questions. The key is look at the definition, not the meaning we use in daily language. Read the question! The results are unchanged if we use the names ABC or XYZ instead of 17 AGE.
Absolute bottom line:1. HOW MANY variables? 2. Are the variables numerical or categorical? Answering these 2 questions correctly will lead you to one of the 5 cases, and almost the correct test. The HATPC is then, hopefully, bookwork.
How students fail ?But many students already have trouble in the first question: How to determine how many variables are there?to make friends with? Who do you find it easierfrequency 400 350 300 250 200 150
For example, How many variables are there? 3 or 1?
100 50 0 same sex opposite sex response either
Think of the survey. How many questions? 3 or 1? How many columns do you need to store the data? 3 or 1? You are doomed if you choose 3 variables. In fact there is no test in STAT170 that involves 3 variables.20
How students fail ?Smoker Male Female 4 5 Non-smoker 11 8
Getting a pass in STAT170You need to be able to do ALL of the following: 1. Count how many variables 2. Identify the variables as numerical or categorical 3. Do ALL 8 hypothesis tests You will fail in STAT170 if you cannot do just ONE of them! (In fact, if you can do ALL of them well, a Cr is guaranteed.)21 22
Another example: How many variables are there? 1, 2 or 4? You are doomed if you choose 4 variables.
How to determine the appropriate testVariable(s) One categorical Graphics Barchart, pie chart Research Question (e.g.)Is the proportion of smokers equal to 0.3? Are the proportions of meateaters, vegetarians & vegans equal to 0.8, 0.15 & 0.05?
Answering the research Q: Formal stat test
Beware of the paired t-testThe paired t-test may be mistaken as: 2-sample t-test Regression Read the given Research Question If you see relation or predict => regression If you see difference => 2-sample t or paired t. Then think! Eg: Weight loss program? Y1=Wt before, Y2=Weight after
z-test of proportion (Lect 7) 2 categories only 2 test of proportions (GOF ) (Lect 11) -- 2 or more categories z and t-tests of mean (Lect 7) Chi sq test of association (Lect 11, 12) or Odds ratio Regression analysis: Test of slope (Lect 9,10) 2-sample t-test (Lect 8)
One numerical Two categorical Two numerical
Hist, stem- Is the mean equal to ? leaf, boxplot Clustered barchart Scatter plot Is there an association between and ? Is there a relation between and ?
Comparative Is there a diff in heights One categ (binary) & boxplots between males and one numeric females?
Note: 1. There is the paired t-test which doesnt fit in any of 5 cases above, perhaps it fits best in the 2nd case (one sample t-test). 2. 7 tests above + paired t-test = 8 hypothesis tests in STAT170
How to determine the appropriate testMethod 1 The ONLY SURE way to determine the correct test is to identify the variable types correctly! Method 2 IF you cannot do (1), then you may look for keywords in the research questions. But be warned it is NOT 100% fool-proof. 100% association => Chi-sq test of association certain relation, predict => Regression (with t-test on slope) difference => 2-sample t-test, or paired t-test Proportion (singular!), percentage => Z-test of proportion Proportions (plural), percentages => Chi-sq test of proportions (GoF) mean, average => One