stats ap review

61
AP Exam Review Unit 1 Exploring and Understanding Data Chapter 1/2 1. Characteristics recorded about each individual are called variables. a. Categorical sex, race, ethnicity b. Quantitative income, weight, height (Note: quantitative variables have Units!!) Chapter 3 2. Frequency tables – Used to record the counts for each category. Relative Frequency table – Used to record percentages for each category. 3. Bar Charts are used to graph categorical data with a bar height represents the frequency of the category a. All bars are the same width (Area Principal) with spaces between to indicate that they are freestanding b. **Note: It is important to label all graphs, including title, scale and axes!!! 4. Pie Charts use relative proportion (percentage) of a categorical variable a. “Pie slice” represents the percent of the category of the whole group represented by the entire circle b. **Note: It is harder to compare category size, but easier to make comparisons against “typical” percentages 5. Contingency table shows us how individuals are distributed along each variable based on the other variable. a. marginal distribution is the distribution of one variable in a contingency table in the margins (total counts) b. conditional distribution is the distribution of only a specific variable value, a distribution of one variable for only those individuals satisfying some condition of the other variable, variable on inside of the table c. Independent variables when the distribution of one variable in a contingency table is the same for all categories of another variable independent. 1. independent when we look at the conditional distributions of the table and distributions are similar 2. dependent when we look at the conditional distributions of the table and distributions are differnt 6. Segmented Bar Graphs is an alternative to a Pie Chart that divides a bar instead of circles. a. Each bar is treated as a “whole” (100%) and is divided proportionally into segments corresponding to percentages in each group. b. A great visual displays for seeing if distributions are alike or different in order to decide on independence.

Upload: ludovica-paparo

Post on 12-Nov-2014

241 views

Category:

Documents


2 download

DESCRIPTION

AP stats

TRANSCRIPT

Page 1: Stats AP Review

AP Exam Review Unit 1 Exploring and Understanding DataChapter 1/2 1. Characteristics recorded about each individual are called variables.

a. Categorical sex, race, ethnicity

b. Quantitative income, weight, height (Note: quantitative variables have Units!!)

Chapter 3 2. Frequency tables – Used to record the counts for each category.

Relative Frequency table – Used to record percentages for each category.

3. Bar Charts are used to graph categorical data with a bar height represents the frequency of the

category

a. All bars are the same width (Area Principal) with spaces between to indicate that they are freestanding

b. **Note: It is important to label all graphs, including title, scale and axes!!!

4. Pie Charts use relative proportion (percentage) of a categorical variable

a. “Pie slice” represents the percent of the category of the whole group represented by the entire circle

b. **Note: It is harder to compare category size, but easier to make comparisons against “typical” percentages

5. Contingency table shows us how individuals are distributed along each variable based on the other

variable.

a. marginal distribution is the distribution of one variable in a contingency table in the margins (total

counts)

b. conditional distribution is the distribution of only a specific variable value, a distribution of one variable

for only those individuals satisfying some condition of the other variable, variable on inside of the table

c. Independent variables when the distribution of one variable in a contingency table is the same for all

categories of another variable independent.

1. independent when we look at the conditional distributions of the table and distributions are

similar

2. dependent when we look at the conditional distributions of the table and distributions are

differnt

6. Segmented Bar Graphs is an alternative to a Pie Chart that divides a bar instead of circles.

a. Each bar is treated as a “whole” (100%) and is divided proportionally into segments corresponding to

percentages in each group.

b. A great visual displays for seeing if distributions are alike or different in order to decide on independence.

Chapter 4 7. Frequency table gives bins and the counts in each bin for the distribution of the quantitative variable.

a. Histogram is one graphical display of the distribution of a quantitative variable, which plots the bin counts as the heights of bars (like a bar chart with no spaces).

1. Plots counts, whereas a relative frequency histogram shows percentages.2. There should be no gaps in your x-axis, since the scale should just “slice” the range into equal-

sized bins.

3. Make sure your graph is well labeled, scaled, and follows the area principle!

8. Stem and Leaf is used for smaller data sets and is easily created by hand to preserve data

Page 2: Stats AP Review

a. quick visual of the data’s distribution to show the same information as a histogram b. Split stem plots are also used for very large datasets or if more detail is needed

Page 3: Stats AP Review

Chapter 4 (con’t) 9. Dotplot is a simple display for univariate data, where a dot is placed for each case in the data.

10. Describing distributions make sure to CUSS and BS!!!

C – Center

U - Unusual Features (Outliers and gaps)

S – Shape (unimodal; bimodal; uniform, skewed left (histogram tail to the left) and skewed right (histogram tail to the left)

S – Spread

B – Be

S – Specific (Context and use comparative language!!!11. Timeplots used to show how the data behaves over time.

Chapter 512. Measures of Center:

a. midrange is “middle of the range” found by taking (min + max)/2, easy to find, but very sensitive to

outliers

b. median is middle of the distribution; same units as data, halfway point of data in numerical order1. resistant to outliers

c. Mean is arithmetic average of symmetric distributions since it is extremely sensitive to outliers.

d. Mean vs median

mean is the balance point and median is the halfway point1. Mean < Median Skewed Left distributions 2. Mean = Median Symmetric distributions3. Mean > Median Skewed Right distributions:

13. Measures of Spread:

a. range is from minimum to maximum value a SINGLE number that is easy to find, but VERY sensitive to

outliers.

b. interquartile range (IQR) is the range of the middle 50% of the data

1. splits the data at the median and find the median for each half. IQR=Q3 - Q1

2. Ignores the extreme data values (possible outliers) so it gives a better indication of spread 3. The quartiles are Q1 = 25th percentile, Q2 (median) = 50th percentile and Q3 = 75th percentile

c. Standard Deviation (s)/Variance (s2) are measures of variation of symmetric data, non-resistant to

outliers1. They take into account how far each data value is from the mean (the variation)

a. You must square root variance to find the standard deviation, difficult to find by handb. Our ruler to compare data values with different units so we make sure we are using the same

scale to look at how the values deviate from the mean.14. Five Number Summary: Min – Q1 – Q2 (median) – Q3 - Max

a. boxplot is a graphical representation of the 5-number summary1. Regular (non-modified) - shows the 5 number summary only2. Modified – shows the 5 number summary as well as outliers.

In general, all of our boxplots are Modified

b. Test for outliers 1. Low fence < Q1 - 1.5(IQRs) 2. Upper fence > Q3 + 1.5(IQRs)

Page 4: Stats AP Review

Chapter 5 (con’t)15. Which to use?

a. Symmetric graph – use mean and standard deviationb. Skewed graph – use the median and IQR (or the 5 number summary)

16. Ogive is a cumulative frequency chart used to separate your data into classes and tally each frequency

a. Determine the relative frequency in each classb. Graph each cumulative frequency at the end of each classc. Remember “Old Faithful Worksheet”

Chapter 6

17. Standardized scores aka “z-scores” uses z as common notation and always means to ‘standardize the value’

a. Have NO UNITS so they are all comparable regardless of original units

18. Shifting data by adding or subtracting the same number to Mean, Median, Quartiles, Minimum, and Maximum.

a. The spread (IQR, Range, and Standard deviation) does NOT change.

19. Rescaling data by multiplying or dividing, both the measures of location and spread change.

20. Converting data to a z-scores, we are shifting the mean (to 0) and rescaling standard deviation (to 1).

a. Standardizing does not change the shape of the distribution!

b. Gives us an indication of how unusual a value is by telling us how many standard deviations value is from

the mean.

1. negative z-score tells us that the data value is below the mean

2. positive z-score tells us that the data value is above the mean

3. 0 tells us a data value is right at the mean.

21. Normal Model is a distribution that is “bell-shaped” (aka symmetric and unimodal), it may be modeled by the

Normal Distribution.

a. Parameters are values that represent the entire population. Mean μ standard

deviation σ

b. Statistics are denoted by Latin letters and represent our sample. Mean standard

deviation s

22. Empirical Rule

23. To find the proportion between z-values for the standard Normal distribution:

2nd: Distribution: 2: normalcdf( Normalcdf(low z, high z)

To find the cutoff z value for a given percentage:

2nd: Distribution: 3: InvNorm( InvNorm(proportion to left of z)

24. To determine if a dataset follows a Normal model, either:

a. Look at a histogram or stem plot and check for non-Normal features (like gaps, outliers, and skewness)

b. Compare your actual data to the Empirical Rulec. Look at a Normality Plot and check if the plot approximates a diagonal straight line

Page 5: Stats AP Review

Unit 1 AP Review Name

GIRLS BOYS

Nig

ht

Sh

ift

Military time of birth

Birthweight

in grams

Nig

ht

Sh

ift

Military time of birth

Birthweight

in grams

Even

ing

Sh

ift

Military time of birth

Birthweight

in grams0005 3837 0118 3554 1514 37830104 3334 0155 3838 1631 33450405 2208 0257 3625 1657 30340407 1745 0422 2846 1807 3300

Day S

hif

t

0814 2576 0431 3166 1854 34280909 3208

Day S

hif

t

0708 3520 1909 41621049 3746 0735 3380 1947 36301053 3523 0812 3294 1949 34061406 3430 1035 3521 1951 34021407 3480 1133 2902 2037 37361433 3116 1209 2635 2051 33701446 3428 1256 3920 2104 2121

Even

ing

S

hif

t

1742 2184 1305 3690 2123 31501825 23832010 35002217 38662327 35422355 3278

1. There has long been an “old wives tale” about the time of day that babies are born. Is there any merit to this? Create a contingency table for the gender of the baby versus the time of day the baby was born. (Note: The times in the raw data are in military time!!!!)

Night Shift(2300 – 0700)

Day Shift(0700 – 1500)

Evening Shift(1500 – 2300)

Total

Female

Male

Total

2. Is there a relationship between the gender of the baby and the time of day the baby is born? Create a segmented bar graph below, remembering to clearly label the graph. Then answer the question.

Page 6: Stats AP Review

3. The birthweights listed are in grams. Find the following values for EACH gender.Values GIRLS BOYS For EACH gender, test for outliers.

Remember to show all WORK!!!!Mean Girls Outliers?MedianStd DevMinimumQ1 Boys Outliers?Q3

MaximumRangeIQR

4. In the space below, create side-by-side boxplots (showing any outliers) for the birthweightsof the babies by gender, remembering to label and scale the graph!!

5. Compare the distributions displayed above, remembering to use comparative language (CUSS & BS).

6. The birthweights listed are in grams. Typically, in the US, we report birthweights in pounds and ounces. The conversion factors you need are : 16 ounces = 1 pound 1 ounce = 28.35 gramsUse these conversion factors to convert the summary statistics for YOUR gender.

Mean minimum

Maximum

Median Q1 rangeStd Dev Q3 IQR

7. Assuming the birth weights for YOUR gender can be modeled by a Normal distribution, sketch the Normal model and label it based on the 68-95-99.7 rule.

8. Based on this model, what percent of babies for YOUR gender weigh below 8 pounds?

9. Based on this model, what weight would represent the 75th percentile of birthweights for YOUR gender?AP Exam Review Unit 2 Exploring Relationships Between VariablesChapter 7 1. Scatter plot is a visual way you can show the associations between quantitative variables

a. In order to describe a scatter plot, look for:1. Form Is the data linear or curved? 2. Outliers Is there any points that appear to not fit the data set? 3. Direction Is there a positive, negative or no association between variables?4. Strength How much scatter is apparent in the plot? Closer to straight line, closer to 1

Page 7: Stats AP Review

b. Explanatory variable = Predictor and goes on the x axis

c. Response variable = variable controlled by the explanatory variable and goes on the y-axis

d. If the relationship between the variables is unclear, it does not matter which one we identify as explanatory or response variable.

2. Correlation coefficient (r) gives us a numerical measurement of the strength of the linear relationship between the explanatory and response variables. a. Direction: positive “r” indicates a positive association and negative “r” indicates a negative

association

b. Strength: Values close to 0 indicate weak relations, as r gets closer to 1 or -1, the relationship is stronger and values of exactly 1 indicate a perfect line and perfect correlation.

c. When to use correlation:

1. Quantitative Variables – r cannot be applied to categorical data! Make sure you understand your variables

2. Linear data – r can always be calculated, but correlation only measures strength of linear relationships, so watch for curvature!

3. Outliers – r is calculated using z-scores (mean and st. dev), it is non-resistant to outliers!d. Properties of correlation

1. Sign gives the direction of association2. Correlation is always between -1 and +13. Flipping x and y does NOT affect and changing units on x or y does not affect4. NO units!! It has been standardized5. Measures a LINEAR relationship only and is non-resistant to outliers

Chapter 8 3. Least Squares Regression Line (LSRL) is the line of best fit that may not hit any of the data points, so we

find the line that comes closer than any other line.a. LSRL is the line that minimizes the sum of the squared errors (called residuals).b. Residual is the difference between the observed and predicted values for y (dependent variable)c. A “hat” over a variable indicates a predicted value.

1. LSRL always contains the point (x-bar, y-bar) and use the variable names, not x and y!2. Note: LSRL is non-resistant to outliers

d. Slope of the LSRL1. Interpreting the slope:

a. Slope is defined as the amount of change in y as x increases by 1.b. Moving any number of standard deviations in x moves r times that number of standard

deviations in y.

e. Intercept of the LSRL1. Interpreting the intercept:

a. The intercept is the value of y-hat when x = 0b. The intercept is not always appropriate for interpretation!!!

Page 8: Stats AP Review

Chapter 8 (con’t) 4. Residuals are the errors that occur since the line doesn’t go through each point.

a. To calculate a residual, subtract the predicted value from the actual valueResidual = y - ŷ

1. Positive residual is above the line, where the model underestimated the value.2. Negative residual is below the line, where the model overestimated the value.

5. Residual Plots is a scatter plot of the explanatory variable vs. the residuals a. Residuals are very useful to us because we can determine how well a line fits the data by examining

its errors.

b. A good residual plot has a random scatter with no patterns. This indicates our model is appropriate.c. A bad residual plot shows patterns like curves, v’s, etc. This indicates that we should look for a

better model.

6. Coefficient of Determination (R2)a. We have been using correlation to gauge strength of the linear relationship, but to get a better feel

for the data, we can square the r value.

b. R2 is the ratio of the explained variation of the response variable to the total variation.c. Squared correlation, r2, gives fraction (or percent) of data’s variation that is accounted for by the

linear model.d. The remaining fraction (1- r2) is the amount of the original variance that is left in the residuals

(errors).

Chapter 9 7. Extrapolation occurs when a linear model can help us make predictions, but is far away from x value .

a. Extrapolations assume that past trends will continue into the far future. b. If you must extrapolate into the future, at least don’t believe that the predictions will come true!

8. Outliers in regressiona. Outlying points can strongly influence a regression. Even one point can dominate the regression

analysis.b. Types of regression:

1. y-outlier: a point that is extraordinary in its y-value 2. x-outlier: a point that is extraordinary in its x-value

a. Be wary of x-outliers as they can have high leverage.b. Especially watch what happens when an x-outlier “lines-up” with the rest of the points

3. model outlier: a point that deviates from the regression linea. Whenever you notice a model outlier, you should fit the line to the other points alone and

then compare the resulting regression model in order to understand how the outlying point

affected the model.

Note: This does NOT mean to just delete any point that doesn’t fit your line! You must examine and analyze all deviations! Often times, this analysis tells us more than the original model did.

c. Influential Points are points that highly influence the slope of the regression line and correlation coefficient

9. Lurking Variables are variables that do not show up as part of the model but do affect appearance of variables in the model. No matter how strong the association is, no matter how large the r (or r-squared) value, no matter how straight the line is, you CAN NOT say that one variable CAUSES the other!

Chapter 1010. Non-Linear Regression:

Model Name x-axis y-axis Equation Linear x y ŷ = a + bx

Exponential x log(y) log ŷ = a + bxLogarithmic log(x) y ŷ = a + b(log x)

Power log(x) log(y) log ŷ = a + b(log x)

Unit 2 AP Review Name _________________________

Sarah’s parents are concerned that she seems short for her age. Their pediatrician has the following record

Page 9: Stats AP Review

of Sarah’s height:

Age (months) 36 48 51 54 57 60Height (cm) 88 90 91 93 94 95

1. Make a scatterplot of these data and describe the overall pattern of the data in context.

2. Find the mean and standard deviation of both variables.

3. Find the correlation coefficient and explain what it means.

4. What is the equation of the least-squares regression line?

5. Compute the slope and explain what the slope mean in context.

6. Compute the y-intercept and explain what the slope mean in context.

7. Find the coefficient of determination and explain what it means.

Page 10: Stats AP Review

8. According to the regression line, how much does Sarah grow each month on the average?

9. For each age given, compute the predicted height using the least-squares line.

10. Make the residual plot and determine if the LSRL is a good fit. Why?

11. Use Sarah’s predicted height at 60 months and determine the residual.

12. Was it a positive or negative residual, so did the pediatrician underestimate or over estimate? How do we know?

13. How tall would you predict Sarah’s height to be at 62 months?

AP Exam Review Unit 3Gathering DataChapter 11 1. Random:

a. Don’t be “haphazard.” Outcomes have a lot of structure, especially in the long run.b. Makes the sample as representative as possible.c. Randomizing protects us from the influences of all the features of our population, even ones that we

may not have thought about it. d. It makes sure that on the average, the sample looks like the rest of the population.e. Not only does randomizing protect us from bias, it also makes it possible for us to draw inferences

Page 11: Stats AP Review

about the population when we see only a sample. 2. Simulations:

a. Identify the component to be repeated.b. Explain how you will model the outcome.c. Explain how you will simulate the trial (the sequence of events that we are pretending will take

place.)d. State clearly what the response variable is.e. Run several trials.f. Analyze the response variable.g. State your conclusion (in context of the problem).

Chapter 12 3. Population (parameter) is the entire group we are interested in

a. This value is rarely known and our goal is usually to estimate the parameter.Sample (statistic) is a smaller group that is selected from the population.a. Their goal is to estimate the parameter.b. Proportion of the population that you’ve sampled doesn’t really matter (unless you have a really

small population).

c. It’s the sample size itself that makes the difference.d. Size of population does not matter, size of sample DOES

Population SampleMean μ

standard deviation σ s 4. Polls:

a. Opinion polls are polls ran by organizations such as the Gallup Poll and are extremely diligent in their

selection so that the sample represents the population.b. Straw polls are polls that gather information in a very poor way (such as those on websites or in

magazines)5. Biased is a sample that does not represent the population in some important way

a. There is usually no way to fix a biased sample b. no way to salvage useful information from it.

6. Problem with census:a. Difficult and impractical to completeb. Populations shift in their demographicsc. Too complex in terms of time and budgetd. If using destructive sampling you would destroy population

7. Sample Types:a. Simple Random Sample (SRS) is a sample in which every person AND every combination of

people has an equal chance of being selected.1. To get an SRS: Define your population of interest, where the sample will come from, assign

numbers to each of the subjects and use a random table to select the sample.

Page 12: Stats AP Review

Chapter 12 (con’t)7. Sample Types: (con’t)

b. Stratified Random is more complicated than an SRS and involves splitting the population into subgroups.

More useful when you think certain characteristics may be an influence in the data. (SOME from ALL)1. To get a stratified sample: Define your population of interest, split your population into

homogeneous groups, called strata, within each strata, use an SRS to determine who is sampled and

combine the results from each strata

c. Cluster is a sample also involving splitting the population into subgroups and is more useful when you

think all subgroups are pretty similar and each will adequately represent population variability. (ALL from SOME)1. To get a cluster sample: Define your population of interest, split your population into

heterogeneous groups, called clusters, use an SRS to determine which cluster(s) to select and combine results

from each cluster

d. Systematic is a sample that involves every nth object. This is more useful when you believe that the

order of the list is not associated in any way with the responses sought.1. To get a systematic sample: Define your population of interest, determine a starting place using

a random table, and from your starting place, sample every nth object on the list.

e. Multistage sample is a sample made up of more than one sample type.

8. How Not to sample:a. Voluntary Response Samples: Where a large group is invited to respond, those that actually do

respond are counted. But many of these respondents will probably have a strong opinions or motivations.

b. Convenience Samples: We simply include the individuals who are readily available. 9. Problems to watch for:

a. Undercoverage : Bias in which some portion of the population is not sampled at all or has a smaller

representation in the sample than it has in the actual population.b. Nonresponse bias: Where someone who is chosen for the sample cannot be contacted or refuses

to cooperate. The problem is that those who don’t respond may differ from those who did respond.

c. Response bias: Anything in the survey that influences the responses, such as wanting to please the

interviewer, not wanting to answer personal or legal questions, etc.d. Wording of the question in a survey, as it can also influence the responses.

Asking a question with a leading statement is a good way to bias the responses.

Chapter 1310. Observational Study is a study where the researcher simply observes the subjects, recording the choices

made and the outcome based on data in which no manipulation of factors has been employed.a. Types of observational studies:

1. Prospective study is an observational study in which subjects are followed to observe future outcomes.

2. Retrospective study an observational study in which subjects previous conditions are determined.

Note: neither prospective nor retrospective studies can show cause-and-effect relationships.11. Experiment is a study that manipulates a variable to create treatments, then imposes these treatments on

Page 13: Stats AP Review

the subjects in order to record and compare the responses. Used if we are trying to prove causation. a. Elements of an experiment:

1. Factor is variable identified as at least explanatory to manipulate and at least response to measure.

2. Levels are specific values that the experimenter chooses for a factor

Page 14: Stats AP Review

Chapter 13 (con’t)3. Treatments are combinations of specific levels from all the factors

a. Once we’ve decided what to do to our subjects, we need to decide WHO gets what treatment.b. To have any hope of drawing a fair conclusion:

1. we must assign our participants to the treatments in a random manner2. we may not assign based on participant (or proctors) choice

b. The four major principles of experimental design:1. Control sources of variation other than factors testing by making condition as similar as possible

for all 2. Randomization

a. Allows us to equalize the effects of unknown or uncontrollable sources of variation. b. It does not eliminate sources of variation, spreads them out among treatments and reduces

bias 3. Replication

a. We should repeat experiment, applying treatments to a number of subjects. (Replication within

study) b. We should be able to repeat, following same design, and produce similar results (Replication

of study)

4. Blocking **Not essential, but usefula. Situation in which groups of subjects are similar, it is often a good idea to gather them into

blocks. c. By blocking, we isolate the variability due to differences between blocks so we can see

differences due to the treatments more clearly.d. Types of Experimental Design

1. completely randomized design where all experimental units have an equal chance of receiving any

treatment. (Similar to a SRS).2. randomized block design where the randomization occurs only within the blocks. (Similar to

Stratified) If we feel that a certain characteristic of our experimental units may influence the response, we can isolate the variability due to these differences by blocking our units into groups

of similar characteristics an run the experiment separately within each block.

e. Statistically significant is when we use random chance, we will get different samples and different responses. 1. Can this difference be attributed only to the fact that we used a different random group or is

difference much bigger than what we would have expected by chance alone.f. Diagram of an experiment can often times help in thinking about the details of the experiment.

However, a diagram is just a basic outline, not a complete description!

12. Blinding is two main classes of individuals who can affect the outcome of an experiment: those who could

influence the results and those who evaluate the results. a. single-blind Every individual in either of these classes is blinded in an experiment b. double-blind Everyone in both classes is blinded in the experiment.

13. Placebo is a “fake” treatment that looks just like the treatments being tested. a. Some of improvements seem with a treatment, even an effective treatment can be due to simple act

of treating.

b. To separate these two effects, we can use a control treatment that mimics the treatment itself.

Chapter 13 (con’t)c. Best way to blind subjects from knowing whether they are receiving treatment or not. (Placebo

Controlled)

Page 15: Stats AP Review

14. Matching is used to reduce variation in much the same way as blocking. Subjects are paired because they

are similar in ways not under study.a. Matching can be with another experimental unit or with yourself (as in before/after studies)

15. Confounding variables are two variables (explanatory or lurking) that the effect on the response variable

cannot be distinguished from each other.

Page 16: Stats AP Review

AP Exam Unit 3 Review Worksheet Name

1. Suppose you were asked to help design a survey of adult city residents in order to estimate the proportion

that would support a sales tax increase. The plan is to use a stratified random sample, and three stratification schemes have been proposed.

Scheme #1: Stratify adult residents into four strata based on the first letter of their last name(A-G, H-N, O-T, U-Z)

Scheme #2: Stratify adult residents into three strata: college students, non-students who work full-time and non-students who do not work full-time.

Scheme #3: Stratify adult residents into five strata by randomly assigning residents into one of five strata

Which of the three stratification schemes would be best in this situation? Explain why you chose that scheme.

2. A statistics teacher wants to know how her students feel about an introductory statistics course. She decides to administer a survey to a random sample of students taking the course. She has several

sampling plans to chose from. Name the sampling strategy in each.a. There are four ranks of students taking the class: freshmen, sophomores, juniors and seniors.

Randomly select 15 students from each class rank.

b. Randomly select a class rank (freshmen, sophomores, juniors and seniors) and survey every student in that class rank.

c. Each student has a nine-digit student number. Randomly choose 60 numbers.

d. Using the class roster, select every fifth student from the list.

3. Listed below are the names of 20 students who are juniors. Use the random numbers listed below to select five of them to be in your sample. Clearly explain your method.

Adam Chris Dave Deirdre DonEllen Eric Joan Jonathan JudiJoy Kenny Laura Mary PaulPeter Rachel Robert Sara Stacey

39634 62349 74088 65564 16879 19713 39153 69459 17986 2453714595 35050 40469 27478 44526 67331 93365 54526 22356 93208

Method:

Page 17: Stats AP Review

4. Is vitamin C helpful in preventing the common cold? We wish to conduct an experiment to examine if 1000

mg of vitamin C per day can reduce the incidence of colds. Suppose that all 120 of AP statistics students

at a high school have volunteered for the study.a. Design a completely randomized experiment for this situation.

b. Would we need to block or blind this experiment? Explain.

5. Researchers who are studying a new shampoo formula plan to compare conditionof hair for people who use the new formula with condition of hair for people whouse the current formula. Twelve volunteers are available to participate in thisstudy. Information on these volunteers (numbered 1-12) is shown in the table.a. These researchers want to conduct an experiment involving the two formulas

(new and current) of shampoo. They believe that condition of hair changeswith age but not gender. Because the researchers want the size of the blocksin an experiment to be equal to the number of treatments, they will useblocks of size 2 in their experiment. Identify volunteers (by number) thatwould be included in each of the six blocks and give the criteria you need toform the blocks.

b. Other researchers believe that hair condition differ with both age and gender. These researchers will

also use blocks of size 2 in their experiment. Identify volunteers (by number) that would be included

in each of the six blocks and give the criteria you need to form the blocks.

c. The researchers in part b decide to select three of the six blocks to receive the new formula and three the current formula. Is this an appropriate way to assign treatments? Is so, describe a

method for selecting the three blocks to receive the new formula. If not, describe an appropriate method for assigning treatments.

AP Exam Review Unit 4 ProbabilityChapter 14 1. Random Outcomes is a situation in which we know what outcomes could happen, but we don’t know which

particular outcome did or will happen.

Volunteer

Gender

Age

1 Male 21 2 Female 20 3 Male 47 4 Female 60 5 Female 62 6 Male 61 7 Male 58 8 Female 44 9 Male 4410 Female 2411 Male 2312 Female 46

123456

123456

Page 18: Stats AP Review

a. In the long run, random outcomes settle down in a way that is actually consistent and predictable. 2. Probability is the measure of the likelihood that a given event will occur; expressed as a number between

0 and 1 in the long-run behaviors. a. Trial is each attempt that generates an outcome.b. Outcome is whatever happens in each trial.c. Sample space is the individual outcomes that are possible and comprise the trial.d. Event is the combination of outcomes in a triale. conditions of probability:

1. Probabilities must be between 0 and 1, inclusive.2. A probability of zero indicates impossibility of event occurring3. A probability of one indicates certainty of event occurring (In the long run…. Of course!)

f. Probability Rules:1. For any event A, 0 ≤ P(A) ≤ 1.2. The probability of the set of all possible outcomes of a trial must be 1. P(S) = 1 3. Sample Space – the set of all possible outcomes4. The Complement Rule is the set of outcomes that are not in the event A is called the

complement of A, denoted AC.a. The probability of an event occurring is 1 minus the probability that it doesn’t occur:

P(A) = 1 – P(AC)5. Addition Rule: For two disjoint (mutually exclusive) event A and B, the probability that one or

the other occurs is the sum of the probabilities of the two events.(disjoint or mutually exclusive means that the two events can not occur at the same time)

P(A or B) = P(A) + P(B), provided that A and B are disjoint.6. Multiplication Rule: For two independent event A and B, the probability that both A and B

occurs is the product of the probabilities of the two events.

P(A and B) = P(A) x P(B), provided that A and B are independent. 3. Independence is where trials can not be related in order for us to make statements about the long-run

behavior of random phenomena. (outcome of one trial doesn’t influence or change the outcome of another) 4. Law of Large Numbers is when the long-run relative frequency of repeated independent events gets closer

and closer to the true relative frequency as the number of trials increases.a. Misinterpretation of the law: Many people believe that random phenomena are supposed to

compensate for whatever happened in the past. 5. Common Errors:

a. Don’t add probabilities of events if they’re not disjoint!! For example, to find the probability of owning a car or a house, you cannot just add the probabilities

of owning a car and owning a house because those two events are not disjoint. Many people own both!

b. Don’t multiply probabilities of events if they’re not independent!!For example, to find the probability of being absent and today being Friday, you cannot just multiply these probabilities because those two events are not independent. Knowing that it is Friday changes

the probability of the absentee rate.

c. Don’t confuse disjoint and independent Disjoint events CANNOT be independent!!

Page 19: Stats AP Review

Chapter 15 6. Finding Probability:

a. When the k possible outcomes are equally likely, each has a probability of 1/k.b. For any event A that is made up of equally likely outcomes

c. General Addition Rule: P(A or B) = P(A) + P(B) – P(A and B) when our events are not disjoint, will double count the probability of both A and B occurring.

d. Conditional probability: Takes into account a given condition and is written as P(A|B) and read as ‘probability of A given B has occurred.

e. General Multiplication Rule: P(A and B) = P(A) x P(B|A) when our events are not independentNote: There’s nothing special about which one we write as A or B, so this rule can also be stated as:

P(A and B) = P(B) x P(A|B)

7. Formal Independence Events A and B are independent whenever P(B|A) = P(B).a. Independence of two events means the outcome of one event does not influence probability of the

other.

8. Replacement: Sampling without replacement means that once individual is drawn it doesn’t go back into pool

a. We often sample without replacement, which doesn’t matter too much when dealing with a large population.

b. When drawing from a small population, we need to take note and adjust probabilities accordingly.

9. Tree diagrams: Helps us think through conditional probabilities by showing sequences of events as paths that look like branches of a tree.

Chapter 1610. Random variable assumes any of several different values as a result of some random event. a. Random variables are denoted by capital letters, such as X

b. A particular value of the random variable is denoted with a lowercase letter, x.c. Types of random variables:

1. Discrete: A random variable that take one of a finite number of distinct outcomes.2. Continuous: A random variable that take any numeric value within a range of values.

11. Probability model consists of a collection of all the possible values and the probability that they occur

12. Expected value (E(X)) (center) is the particular value of interest in the model that we expect a random

variable to take on and an also be notated as μ (for population mean) a. To find expected value of a random variable we can sum the products of each

possible value by the probability that it occurs:

13. Variance for a random variable is

14. Standard deviation for a random variable is

Note: Whenever we report a center, we also need to report a spread. It’s not enough to know where it is

centered, we also need to know the variability.Remember: To find E(x) and σ, put x in L1 and P(x) in L2, then STAT: CALC: 1 VarSTAT, L1, L2

E(x) = and σ = σx

Page 20: Stats AP Review

Chapter 16 (con’t)

15. Combining Random variables:

E(X ± Y) = E(X) ± E(Y) (you always add the standard

deviations)

Chapter 1716. Bernoulli trials are situations that occur when we do simulations, we have Bernoulli trials if:

a. There are two possible outcomes (success and failure).b. The probability of success, p, is constant.c. The trials are independent.

1. 10% condition rule is used to guarantee that Bernoulli trial is independent. If that assumption is

violated, it is still okay to proceed as long as the sample is smaller than 10% of the population.

17. Geometric Probability is the probability for a random variable that counts the number of Bernoulli trials

until the first success and are completely specified by one parameter, p, the probability of success. Denoted as Geom(p).

p = probability of successq = 1 – p = probability of failure P(x) = qx-1pX = # of trials until the first success occurs

a. The expected value is Standard deviation is

b. On TI, geometcdf (p, x) returns the total for all values in the interval [1, x]

18. Binomial Probability is the probability for a random variable that counts the number of successes in a fixed

number of Bernoulli trials. There are two parameters: n = # of trials and p = probability of success. Denoted as Binom(n, p). P(x) = nCx px q n-x

a. The expected value is Standard deviation is b. On TI, binomcdf (n, p, x) returns the total for the binomial in the interval [0,x]

Page 21: Stats AP Review

AP Exam Unit 4 Review Worksheet Name

1. We can hire a cab from one of three firms: X, Y, and Z. Of the hirings 40% are X, 50% are Y, and 10% are

Z. For the cabs hired from X, 9% are late, the corresponding percentages from Y and Z are 6% and 20%,

respectively. Calculate the probability that the next cab hired:a. will be from X and will not arrive late

b. will arrive late

c. given that a call is made for a cab and it arrives late, find the probability that it came from Y.

2. In a group of 100 people: 40 own a cat, 25 own a dog, and 15 own a cat and dog. Find the probability that a

person chosen at random:a. owns a cat or dog

b. owns a dog or a cat, but not both

c. owns a dog, given that he owns a cat

d. does not own a cat, given that he owns a dog

3. You play tennis regularly with a friend and from past experience you believe that the outcome of each match is independent. For any given match you have a probability of 0.6 of winning. The probability

that you win the next two matches is?

4. One thousand students at a city high school were classified according to both GPA and whether they consistentlyskipped classes.a. What is the probability that a student has a GPA between

2 and 3?

b. What is the probability that a student has a GPA under 2.0 and has skipped class alot?

c. What is the probability that a student has a GPA under 2.0 given that he skipped class alot?

d. What is the probability that a student has a GPA over 3.0 and has skipped class a lot?

e. Are “GPA between 2.0-3.0 and “skipped class little “ independent? Why or why not?

GPA< 2.0 2.0-3.0 >

3.0Total

Skipped a lot 80 25 5Skipped little 175 450 265

total

Page 22: Stats AP Review

5. Two events A and B are such that: P(A) = .6, P(B) = .3, and P(A|B) = .8. Find the probability that:a. both events occur

b. only one of the two events occur

6. Selected boxes of a breakfast cereal contain a prize. Suppose that 5% of the boxes contain the prize and

the other 95% contain the message “Sorry try again”. A consumer determined to find a prize decides to

continue to buy boxes of cereal until a prize is found. Consider the random variable, x where x = number of

boxes purchased to get a prize.a. What kind of probability distribution is this?

b. What is the probability that at most 2 boxes must be purchased?

c. What is the probability that exactly four boxes must be purchased?

d. What is the probability that more than four boxes must be purchased?

7. A consumer organization estimates that 29% of new cars have a cosmetic defect such as a scratch or a dent when they are delivered to car dealers. This same organization believes that 7% have a functional defect – something that does not work properly – and that 2% of new cars have both kinds of problems. If you have a functional defect on a new car, what’s the probability it also has a cosmetic defect?

8. A company’s human resources officer reports a breakdown of employees by job type and gender, shown in the table: What’s the probability that a workerselected at random is female, if the person is a supervisor?

9. To play a game, you must pay $5 for each play. There is a 10% chance you will win $100, a 40% chance you

will win $50, and a 50% chance you will win only $25. What are the mean and standard deviation of your

winnings?

Male Female

Management

7 6

Supervision 8 12Production 45 72

Page 23: Stats AP Review

10. Safety engineers must determine whether industrial workers can operate a machine’s emergency shutoff

device. Among a group of test subjects, 66% were successful with their left hands, 82% with their right hands, and 51% with both hands. What percent of these workers could not operate the switch with

eitherhand?

11. Neurological research has shown that in about 80% of people language abilities reside in the brain’s left side. Another 10% display right-brain language centers, and the remaining 10% have 2-sided languagecontrol. Assume that a freshman comp class contains 25 randomly selected people. What’s the

probabilitythat no more than 15 of them have left-brain language control?

12. Since the stock market began in 1872, stock prices have risen in about 73% of the years. Assuming that

market performance is independent from year to year, what’s the probability that market will rise during at least 1 of the next 5 years?

13. The Centers for Disease Control say that about 30% of high school students smoke tobacco (down from a

high of 38% in 1997). Suppose you randomly select high school students to survey them on their attitudes

toward scenes of smoking in the movies. What’s the probability that there are no more than 3 smokers among the 15 people that you chose?

Page 24: Stats AP Review

AP Exam Review Unit 5Inference with ProportionsChapter 18 1. Sampling Distribution. even though we depend on sampling distribution models, we never actually get to see

them. a. Sampling distribution models are important because they act as a bridge from the real world of data

to the imaginary world of the statistic

b. They enable us to say something about the population when all we have is data from the real world. 2. Sampling distribution of p is the model that occurs when we take infinite samples from the same population

and make a histogram. Provided that the sampled values are independent and the sample size is large enough,

the sampling distribution of p is modeled by a Normal model with:a. mean: p = true proportion of successes in population

= true proportion of successes in sampleb. standard deviation:

c. conditions for proportions that MUST be true before using the Normal model:1. 10% condition: If sample has not been made with replacement, then sample size, n, must < 10%

of population.

2. Random Sampling condition: The sample must be a SRS or representative of the population.3. Success/failure condition: The sample size has to be big enough so that both and are >

10.

d. If conditions are met, we know the sampling distribution follows the Normal model. 1. If it does, all of the previous information we have about Normal models holds true, such as the

68-95-99.7 Rule and finding probabilities from z-scores. 3. When we get to means in Unit 6 then success/failure condition changes to normality:

a. Normality condition: 1. Either the original population must be Normal 2. Central Limit Theorem (CLT) will assure a Normal-like distribution when our sample size

exceeds 30.Mean: Standard deviation:

Chapter 19 4. Confidence interval is the idea that our guess will be a little bit off.

a. sample proportionb. center is pc. standard deviation is

d. Standard error of Since we don’t know p, we can’t find true standard deviation of the sampling distribution model so we

find.

the standard error:

e. 95% confidence interval: if we reach out 2 SEs in either direction of , we can be 95% confident that this interval captures the true proportion.

5. Margin of Error (ME) is the extent of the interval on either side of a. the formula for a confidence interval is: estimate +/- margin of errorb. .The more confident we want to be, the larger our ME needs to be. c. In order to cut the ME in half, you have to quadruple the sample size.

6. Most commonly chosen confidence levels and critical values:Critical Values: 90% = 1.645

95% = 1.9699% = 2.576

To find use the t table, look at the confidence levels in the bottom row, find the critical value right above it in the ∞ row.

Page 25: Stats AP Review

Chapter 19 (con’t)So what does 90% = 1.645 mean? For a 90% confidence interval, the critical value is 1.645 because for a Normal model 90% of the values are within 1.645 standard deviations from the mean.

7. Confidence level (typically 90%, 95%, or 99%) tells us the proportion of intervals that *in the long run* will capture our true population parameterby using this method

8. When conditions are met, we may proceed with a one proportion z-interval

9. Procedure for Confidence Intervals (CI)P – define your ParameterA – state your Assumptions/conditionsN – Name your intervalI – find your IntervalC – write your Conclusion in Context.

“I am ___% confident that the true proportion of lies between ____ and _____.

Chapter 2010. Hypothesis test is a proposal of a model for the world. If the data is consistent with that model, we have

no reason to disbelieve the hypothesis. (Note: this does NOT mean they PROVE the hypothesis)a. If the facts are inconsistent with the model, we need to make a choice as to whether they are

inconsistent enough to disbelieve the model. b. We begin by assuming that a hypothesis is true. Next we consider whether the data are consistent

with the hypothesis. If they are, all we can do is retain the hypothesis we started with. If they are not,

then we ask whether they are unlikely beyond a reasonable doubt.

c. In Statistics, we can quantify our doubt by finding the probability that data like we saw could occur based on our hypothesized model.1. retain (Fail to reject) is if the results seem consistent with what we would expect from natural

sampling 2. reject is if the probability of seeing results like our data is really low.

11. Writing the Hypothesisa. Null hypothesis (Ho) specifies a population parameter of interest and proposes a value for it.

(Nothing different, nothing going on)b. Alternate hypothesis (Ha) contains the values of the parameter we accept if we reject the null.

(There is something different, something did go on)12. Look at the evidences: In order to evaluate the evidence, we want to compare our data to what we would

expect given that H0 is true. a. We can do this by finding out how far our data is from the mean.b. We can then make a decision by asking how likely it is to get results like we did if the null hypothesis

were true.13. Procedure for Hypothesis Test:

P – Parameter of Interest P: indicates the true proportion of H – State your Hypotheses Ho: p = ____

Ha: p < ____ or p > ____ or p ≠ ____A – Assumptions/Conditions 10%: less than 10% of our population

SRS (or representative sample)At least 10 expected successes or failures

N – Name your procedure If the conditions are met, we will continue with _________________.T – Calculate a Test Statistics For a 1-proportion z-test, the test statistic is calculated by:

Chapter 20 (con’t)O – Obtain a P-Value The p-value tells us the probability that the observed test statistics (or an even

more extreme value) could occur if the null is true.M – Make a decision Written in terms of the null hypothesis. Based on the p-value, we either retain

(fail to reject) or reject the Ho.

Page 26: Stats AP Review

S – State conclusion in context! Written in terms of the alternate hypothesis. We have enough

evidence (or not enough evidence) to conclude that ___________.

14. Two-tailed test One-tailed test:The test we run if we are testing “not equal to” The test we run if we are concerned about

one direction or “differs”, it doesn’t matter if the new result only (< or >), where P-value is in only one

tail of the is below or above the null value. (≠), where the distributionP-value is in both tails of the distribution:

15. Finding the pvaluea. If your conditions are met, you can use the Normal model to find a p-value.b. Using your test statistic, determine probability that you will have a value as extreme as the one you

got. c. It will help if you sketch your model!

(Remember: The smaller the P-value, the more evidence we have against the null hypothesis.)16. How to make a decision:

a. p-value tells probability of occurrence of test statistic as extreme as we got if null hypothesis is true.b. The p-value we use to reject the null hypothesis differs from situation to situation. c. If possible, we like to include a confidence interval for the parameter of interest when we do a

hypothesis test.

Chapter 2117. Statistically significant is when the p-value is small, it tells us that our result is a rare occurrence (given Ho).

a. If the data are “rare enough”, we just don’t think it could have happened by chance alone. b. We declare our results to be statistically significant when we reject the null, since our data did

happen, and we believe that something other than Ho is true.

18. Alpha level is a threshold or cutoff value we set to define a “rare” occurrence.a. If p-value falls below the alpha level, we reject the Ho.b. The “significance level of the test” are these common alpha levels are .01, .05, and 0.1

19. P-value provides more information than just simply declaring “REJECT” or “RETAIN”, it gives a numerical value to the strength of the evidence against the null hypothesis.a. Before Calculators/Computers, statisticians used the “critical value” method to make decisions.

1. “Critical Value” is the z* from the table. 2. It is the threshold/cutoff value separated the Rejection Region from the Non-Rejection

Region.b. p-values are easy to find now and provide more information for your decision!

20. Making Errors: Even when we have lots of data, when we make a decision in a hypothesis test, it’s possible

that we will make an error.H0: I will be better off if I take no action.Ha: I will be better off if I take action.

a. Type I Error: null hypothesis is true, but we reject 1. correspond to taking action when you would have been better off not doing so, controlled by our

alpha level!Chapter 21 (con’t)

b. Type II Error: null hypothesis is false, but we retain1. correspond to taking no action when you would have been better off taking action.2. beta, but it’s difficult to calculate because we don’t know how false Ho is.

c. Power of test: our ability to detect a false hypothesis, correspond to taking action when you should have.

Chapter 22

Page 27: Stats AP Review

21. Two Sample Proportions is used when we want to compare two groups to see how they differ, whether a

treatment is better than a placebo control, or whether this year’s results are better than last year’s.a. Assumptions/Conditions

1. EACH Sample < 10% of THEIR Population 2. Random or representative of each sample

3. Successes/failures for EACH sample greater than 104. Samples are independent of each other There should be no relationship between the

groupsb. When the conditions are met, we can proceed with a two proportion z interval.c. Hypothesis Test:

P identify the parameter for the two groupsH Null hypothesis – “No difference/change” Ho: p1 = p2 or p1 – p2 = 0

Alternate hypothesis – Ha: p1 ___ p2 or p1 – p2 ___ 0 (fill in <, >, or ≠)

A The assumptions/conditions remain the sameN When the conditions are met, we can proceed with a two proportion z-test.T

O obtain p-valueM Make a decision about the null hypothesisS State your conclusion about the alternate hypothesis in context and make a comparison of

higher/lower

AP Exam Review Unit 6/7 Inference with Means, Chi-Squared, and Linear RegressionChapter 23 1. Now that we know how to create confidence intervals and test hypotheses about proportions, it’d be nice

to be able to do the same for means. 2. We will base our confidence interval and hypothesis test on the sampling distribution of sample means.

a. mean of the model is m standard deviation of

3. Central Limit Theorem states that for large samples, the sampling distribution will be close to the Normal

model.a. For small samples we will use t-distribution a new model we use

because we use s instead of σ, and using s makes our interval toonarrow so to make it wider we use t instead of z.

b. we must allow for some extra variation. c. If you know σ, use the z-distribution but that rarely happens

4. t-distribution is a mound-shaped distribution, with mean 0 and a spread thatdepends on a parameter called degrees of freedom, df, df = n - 1

a. degrees of freedom are the degrees that a parameter is allowed to vary or number of samples that can

varyb. There is a different t-distribution for each degree of freedom. c. The greater the df, the smaller the spread. d. The spread of any t-distribution is greater than that of the standard normal distribution

Page 28: Stats AP Review

Chapter 23 (con’t) 5. To use the model, we must meet certain conditions…

a. 10% conditionb. Randomization or representativec. Normality – (instead of S/F)

1. When small samples, we have to graph the data to check for outliers/skewness.2. For large samples, Central Limit Theorem applies so if n > 30, normal model applies

6. If the conditions are met, we will proceed with a one-sample t interval for means. (Don’t forget to report your df!!)

estimate ± margin of error

7. Finding Sample Size We still use ME, but we don’t know n, we don’t know t*, so use infinity line (z*) 8. Hypothesis Test

P state the parameter for the mH: Null hypothesis is always the statement of no change… Ho: m = _____

Alternate hypothesis is then the statement of change… Ha: m < __ or m > __ or m ≠ __A Conditions/assumptions remain the sameN If our conditions/assumptions are met, we will proceed with a 1-sample t-test for means.

Again, don’t forget to report your df!!!T

O Obtain the p-valueM Make a decision to reject or retain the null hypothesisS State a conclusion based on evidence for or against the alternate hypothesis

Chapter 24 9. Comparing means is not very different from comparing two proportions. 10. Two sample means PANIC

P population model parameter of interest is the difference between the two means, m1- m2

Sampling distribution model is centered at m1- m2 with standard deviation, A The conditions/assumptions for 2 sample means are:

a. 10% condition (both samples)b. Randomization (both samples)c. Nearly Normal (both samples) … graph if data is given!!!d. Independent Groups

N If the conditions are met, we will proceed with a 2-sample t-interval for means.don’t forget to report the df!!! (Given on your calculator)

I

C State conclusion in context with a comparison of the difference in the two means (higher/lower)

11. Two sample means PHANTOMSP population model parameter of interest is the difference between the two means, m1- m2

H Null hypothesis – “No difference/change” Ho: μ1 = μ2 or μ1 – μ2 = 0Alternate hypothesis “a change” Ha: μ1 ___ μ2 or μ1 – μ2 ___ 0 (fill in blank with <,

>, or ≠)A Assumptions/conditions remain the same. Remember to define what groups 1 and 2 are!N If conditions/assumptions are met, we will proceed with a 2-sample t-test for means.

As always, don’t forget the df!!!Chapter 24 (con’t)

T

O Obtain a p-valueM Make a decision to reject or retain the null hypothesisS State a conclusion based on evidence for or against the alternate hypothesis in context with a

comparison of the difference in the two means (higher/lower)

Page 29: Stats AP Review

Chapter 2512. Matched Pair means data are paired when:

a. observations are collected in pairs b. observations in one group are naturally related to observations in the other group.c. Paired data arise in a number of ways:

1. Perhaps the most common is to compare subjects with themselves before and after a treatment.13. Why is paired data special?

a. Independence assumption is violated, but that means we get to do a better analysis because we can focus

only on how the data has changed. b. If you know the data are paired, you can (and must!) take advantage of it.

c. Once we know the data are paired, we can examine the pairwise differences. d. Because it is the differences we care about, we treat them as if they were the data and ignore the

original two sets of data.14. Looking at the difference:

a. Now that we have only one set of data to consider, we can return to one-sample t-test or interval.b. Mechanically, matched pairs test or interval is just a one-sample t-test or t-interval for the mean of

the pairwise differences.

15. Matched Pairs means PANICP population model parameter of interest is the difference of the pair , md

A Conditions/assumptions for a matched pairs interval:Sample < 10% of the populationRandomization (either random treatments, random order, or random selection)Data must be paired!!! Normal (check the differences…not original data!)

N If the conditions are met, we will proceed with a matched pairs t-interval. d represents difference for each pair. don’t forget to report the df = n-1 where n = the number of pairs

I

C We are ___% confident that the true average (context) for the (1st group) is between ___ and ___ (higher/lower) than the (2nd group).

16. Matched Paired PHANTOMSP population model parameter of interest is the difference of the pair , md

H Null hypothesis – “No difference/change” Ho: μd = 0Alternate hypothesis – “difference/change” Ha: μd ____ 0 *Note: think of what d means!

(fill in blank with <, >, or ≠)A Conditions/assumptions remain the same! Remember to define what “d” stands for!N If conditions/assumptions are met, we will proceed with a matched pairs t-test. don’t forget the

df!!!T

O Obtain a pvalue

Page 30: Stats AP Review

Chapter 25 (con’t)M Make a decision to reject or retain the null hypothesisS State a conclusion based on evidence for or against the alternate hypothesis in context with a

comparison of the difference in the paired data (higher/lower)

Unit 7 Chapter 2617. Chi-Squared Distribution is a distribution made up of a family of curves, defined by degrees of freedom.

a. Chi-Square test is a non-parametric test of statistical significance. b. It does not compare the data to a specific population parameter, but to another distribution.

18. Chi-Squared Goodness of Fit Test (GOF)a. Hypothesis test addressing, when dealing with categorical data, does the data fits what we expected.b. PHANTOMS for GOF: Data is categorical, so we are looking at a df=k-1, where k is the number of

categoriesPH Null hypothesis – no difference/change Ho: The data fits the expected distribution

Alternate hypothesis – difference/change Ha: The data doesn’t fit our expectations. A Assumptions stay the same for all c2 tests:

The data must be in counts (not percentages or amounts)Randomization or representative (of course!)Expect at least 5 individuals in each cell (and you MUST show/tell the expected values!).

N If our conditions are met, we can proceed with a Chi-Square Goodness of Fit Test with df = n - 1 T

O Obtain the p-value M Make a decision to reject or retain

a. GOF are likely performed by people who have a theory of what proportions should be. b. Unfortunately, the only null hypothesis available for a GOF is that the theory is true. c. As we know, the hypothesis testing procedure allows us only to reject or fail to reject null.

S State a conclusion in context about the alternate hypothesis19. Chi-Squared Homogeneity Test

a. Hypothesis test comparing the distribution of counts for two or more groups on the same categorical variable.

b. Data is gathered from two or more populations, we are interested in those populations have a similar/difference

c. A test of homogeneity is actually the generalization of the two-proportion z-test.d. Differences and similarities to GOF

1. The statistic that we calculate for this test is identical to the chi-square goodness-of-fit. 2. We ask whether choices have changed among different populations rather than comparing to

model. 3. Expected counts are found directly from the data and we have different degrees of freedom.

e. PHANTOMS for Homogeneity:PH Null hypothesis – no difference/change Ho: Distribution of choices is same among populations

Alternate hypothesis – difference/changeHa: Distribution of choices is different among populations

This data is organized using a table. The df = (#rows-1)(#columns-1) A Assumptions stay the same for all c2 tests:

Expected: you will use and show a matrix N If our conditions are met, we can proceed with a Chi-Square Homogeneity Test with df = (R-1)

(C-1) T O Obtain the pvalue M Make a decision to reject or retain S State a conclusion in context about the alternate hypothesis

Page 31: Stats AP Review

Chapter 26 (con’t)20. Chi-Squared Independence Test

a. Data is gathered from one population, categorize it based on two variables do they affect each other b. A test of whether the two categorical variables are independent examines the distribution of counts

for one group of individuals classified according to both variables in a contingency table.

c. A chi-square test of independence uses the same calculation as a test of homogeneity.d. PHANTOMS for Independence:

PH Null hypothesis – no difference/change Ho: There is no relationship between variable 1 and 2

Alternate hypothesis – difference/change Ha: There is a relationship between variable 1 and 2.This data is organized using a table. The df = (#rows-1)(#columns-1)

A Assumptions stay the same for all c2 tests:Expected: you will use and show a matrix

N If our conditions are met, we can proceed with a Chi-Square Independence Test with df = (R-1)(C-1)

T

O Obtain the pvalue M Make a decision to reject or retain S State a conclusion in context about the alternate hypothesis

21. Homogeneity vs Independence Homogeneity and Independence are virtually identical. a. The difference is:

1. What you are trying to figure out (same vs. a relationship between variables)2. How the data is gathered…

Homogeneity uses a stratified sample (two populations) and asks each person one question (variable).

Independence uses an SRS (one population) and classifies each person based on two variables.Chapter 2722. In regression, we want to model relationship between two quantitative variables, the predictor and response. 23. We imagine an idealized regression line, which assumes the means of the distributions of response variable fall

along the line when individual values are scattered around it.24. We write this line with Greek letters and consider the coefficients to be parameters:

b0 = intercept b1 = slope Corresponding to our fitted line of we write .

25. PHANTOMS for Inference RegressionPH Null hypothesis – no difference/change Ho: β = 0

Alternate hypothesis – difference/change Ha: β ___ 0As always, don’t forget to define your variables!

A Assumptions for all inference methods, we must check some conditionsLinearity check the scatterplot to see if trend is linearIndependence check to see if residuals are randomly scatteredNormality check to see if the histogram of residuals is symmetric with no outliersEqual Variance check to see if residuals have a uniform spread

N If our conditions are met, we can proceed with t-test for the slope of the regression line with df = n-2 T O Obtain the pvalue

M Make a decision to reject or retain S State a conclusion in context about the alternate hypothesis

27. PANIC for Inference RegressionA Conditions/assumptions remain the same. N If conditions met, we can proceed with a t-interval for the slope of the line with df = n - 2 I

C State conclusion in context

Page 32: Stats AP Review

AP Exam Unit 5/7 Inference Review Name You may work the PANIC and PHANTOMS all the way, if you would like. I would just go through naming the parameter, checking assumptions and naming the test. 1. A random sample of 415 potential voters was interviewed 3 weeks before the start of a state-wide campaign

for governor; 223 of 415 said they favored the new candidate over the incumbent. However, the new candidate made several unfortunate remarks one week before the election. Subsequently, a new

random sample of 630 potential voters showed that 317 voters favored the new candidate.Do these data support the conclusion that there was a decrease in voter support for the new candidate

afterthe unfortunate remarks were made? Give appropriate statistical evidence to support your answer.

2. A large university provides housing for 10 percent of its graduate students to live on campus. The university’s housing office thinks that the percentage of graduate students looking for housing campus may be more than 10 percent. The housing office decides to survey a random sample of graduate students, and 62 of the 481 respondents say that they are looking for housing on campus.a. One the basis of the survey data, would you recommend that the housing office consider

increasing the amount of housing on campus available to graduate students? Give appropriate evidence to support your recommendation.

b. In addition to the 481 graduate students who responded to the survey, there were 19 who did not respond. If these 19 had responded, is it possible that your recommendation would have changed? Explain.

Page 33: Stats AP Review

3. The Colorado Rocky Mountain Rescue Service wishes to study the behavior of lost hikers. If more were known about the direction in which lost hikers tend to walk, then more effective search strategies could

be devised. Two hundred hikers selected at random from those applying for hiking permits are asked

whether they would head uphill, downhill, or remain in the same place if they became lost while hiking. Each

hiker in the sample was also classified according to whether he or she was an experienced or novice hiker. The resulting data are summarized in the following table.

DirectionUphill Downhill Remain in Same Place

Novice 20 50 50Experienced

10 30 40

Do these data provide convincing evidence of an association between the level of hiking expertise and the

direction the hiker would head if lost? Give appropriate statistical evidence to support your conclusion.

4. Baby walkers are seats hanging from frames that allow babies to sit upright with their legs dangling and feet

touching the floor. Walkers have wheels on their legs that allow the infant to propel the walker around the

house long before he or she can walk or even crawl. Typically, babies use walkers between the ages of 4

months and 11 months.Because most walkers have tray tables in front that block babies’ views of their feet, child psychologists

have begun to question whether walkers affect infants’ cognitive development. One study compared mental

skills of a random sample of those who used walkers with a random sample of those who never used walkers.

Mental skill scores averaged 113 for 54 babies who used walkers (standard deviation of 12) and 123 for 55

babies who did not use walkers (standard deviation of 15).a. Is there evidence that the mean mental skill score of babies who use walkers is different from the

mean mental skill score of babies who do not use walkers? Explain your answer.

Page 34: Stats AP Review

b. Suppose that a study using this design found a statistically significant result. Would it be reasonable

to conclude that using a walker causes a change in mean mental skill score? Explain your answer.

Page 35: Stats AP Review

5. A growing number of employers are trying to hold down the costs that they pay for medical insurance for

their employees. As part of this effort, many medical insurance companies are now requiring clients to use

generic brand medicines when filling prescriptions. An independent consumer advocacy group wanted to

determine if there was a difference, in milligrams, in the amount of active ingredient between a certain “name” brand drug and its generic counterpart. Pharmacies may store drugs under different

conditions. Therefore, the consumer group randomly selected ten different pharmacies in a large city and filled two prescriptions at each of these pharmacies, one for the “name” brand and the other for the generic

brand of the drug. The consumer group’s laboratory then tested a randomly selected pill from each prescription

to determine the amount of active ingredient in the pill. The results are given in the following table.

ACTIVE INGREDIENT (in milligrams)Pharmacy 1 2 3 4 5 6 7 8 9 10

Name brand 245 244 240 250 243 246 246 246 247 250Generic brand 246 240 235 237 243 239 241 238 238 234

Based on these results, what should the consumer group’s laboratory report about the difference in the active ingredient in the two brands of pills? Give appropriate statistical evidence to support your

response.

Page 36: Stats AP Review

6. A study was conducted to determine where moose are found in a region containing a large burned area. A map

of the study area was partitioned into the following four habitat types.1. Inside burned area, not near edge of the burned area2. Inside burned area, near edge of the burned area3. Outside burned area, near edge of the burned area4. Outside burned area, not near edge of the burned area

Figure shows these four habitat types. Note: Figure not drawn to scaleThe proportion of total acreage in each of habitat types was determined for study area. Using an aerial

survey,moose locations were observed and classified into one of four habitat types. Results re given in table

below.Habitat type Proportion of Total acreage Number of moose

observed1 0.340 252 0.101 223 0.104 304 0.455 40

Total 1.000 117a. The researchers who are conducting the study expect the number of moose observed in a habitat

type tobe proportional to the amount of acreage of that type of habitat. Are the data consistent with thisexpectation? Conduct an appropriate statistical test to support your conclusion. Assume the

conditions for inference are met.

42 3

1

Page 37: Stats AP Review

7. The statistics department at a large university is trying to determine if it is possible to predict whether an

applicant will successfully complete the Ph.D. program or will leave before completing the program. The

department is considering whether GPA (grade point average) in undergraduate statistics and mathematics

courses (a measure of performance) and mean number of credit hours per semester (a measure of workload)

would be helpful measures. To gather data, a random sample of 20 entering students from the past 5 years is

taken. The data are given below.

Successfully Completed Ph.D. ProgramStudent A B C D E F G H I J K L M

GPA 3.8 3.5 4.0 3.9 2.9 3.5 3.5 4.0 3.9 3.0 3.4 3.7 3.6Credit hours

12.7

13.1

12.5

13.0

15.0

14.7

14.5

12.0

13.1

15.3

14.6

12.5

14.0

Did Not Complete Ph.D. ProgramStudent N O P Q R S T

GPA 3.6 2.9 3.1 3.5 3.9 3.6 3.3Credit hours

11.1 14.5 14.0 10.9 11.5 12.1 12.0

The regression output below resulted from fitting a line to the data in each group. The residual plot (not shown) indicated no unusual patterns, and assumptions necessary for inference were judged to be

reasonable.Successfully Completed Ph.D. Program

Predictor Coef StDev T PConstant 23.514 1.684 13.95 0.000 S = 0.5658GPA – 2.7555 0.4668 – 5.90 0.000 R-Sq = 76.0%

Did Not Complete Ph.D. ProgramPredictor Coef StDev T PConstant 24.200 3.474 6.97 0.001 S = 0.8408GPA – 3.485 1.013 – 3.44 0.018 R-Sq = 70.3%

b. For students who successfully completed the Ph.D. program, is there a significant relationship between

GPA and mean number of credit hours per semester? Give a statistical justification to support your

response.

Page 38: Stats AP Review

8. A survey given to a random sample of students at a university included a question about which of two well-

known comedy shows, S or F, students preferred. The students were asked the question, “Do you prefer S

or F?” The responses are shown below.Preference

S F Total185 139 324

a. Based on the results of this survey, construct and interpret a 95% confidence interval for the proportion

of students in the population who would respond S to the question, “Do you prefer S or F?”

b. What is the meaning of “95% confidence” in part (a)?

c. A follow-up survey of a separate group of randomly selected students was asked “Do you prefer F or S?”

The responses are shown below.Preference

S F Total68 88 156

Based on these two surveys, is there evidence that the stated preference depends on the order in which

the comedy shows were listed in the survey question? Justify your answer.

Page 39: Stats AP Review

9. In September 1990, each student in a random sample of 200 biology majors at a large university was asked

how many lab classes he or she was enrolled in. The sample results are shown below.Number of Lab Classes

Number Students

= 1.83S = 1.29

0 281 622 583 284 165 8

Total 200To determine whether the distribution has changed over the past 10 years, a similar survey was

conducted inSept. 2000 by selecting a random sample of 200 biology majors. Results from the year 2000

sample are below.Number of Lab Classes

Number Students

= 1.93S = 1.37

0 201 722 603 104 265 12

Total 200Do the data provide evidence whether the distribution of the number of lab classes taken by biology

majors was different in 2000 than in 1990? Perform an appropriate statistical test using = 0.10.

Page 40: Stats AP Review

10. A pharmaceutical company has developed a new drug to reduce cholesterol. A regulatory agency will recommend

the new drug for use if there is convincing evidence that the mean reduction in cholesterol level after one

month of use is more than 20 mg/dl, because a mean reduction of this magnitude would be greater than the

mean reduction for the current most widely used drug.The pharmaceutical company collected data by giving the new drug to a random sample of 50 people

from thepopulation of people with high cholesterol. The reduction in cholesterol level after one month of use

wasrecorded for each individual in the sample, resulting in a sample mean reduction and standard deviation

of 24mg/dl and 15 mg/dl, respectively.a. The regulatory agency decides to use an interval estimate for the population mean reduction in

cholesterol level for the new drug. Provide this 95% confidence interval and interpret it.

b. Because 95% confidence interval includes 20, the regulatory agency is not convinced that the new drug is better than the current best-seller. The pharmaceutical company tested the following hypotheses.

H0: μ – 20 Ha: μ > 20 μ = population mean reduction in cholesterol level for new drug.The test procedure resulted in a t-value of 1.89 and a p-value of 0.033. Because the p-value was

less than0.05, the company believes that there is convincing evidence that the mean reduction in

cholesterol levelfor the new drug is more than 20. Explain why the confidence interval and hypothesis test led to different conclusions.

Page 41: Stats AP Review

Simple Things To Do To Improve Your AP Exam Scores

1. Read the problem carefully, and make sure that you understand the question that is asked. Then answer

the question(s)!!!Suggestion: Circle or highlight key words and phrases. That will help you focus on exactly what the question is asking.Suggestion: When you finish writing your answer, re-read the question to make sure you haven’t

forgotten something important.

2. Write your answers completely but concisely. Don’t feel like you need to fill up the white space provided for your answer. Nail it and move on.Suggestion: Long, rambling paragraphs suggest that the test taker is using a shotgun approach to

cover up a gap in knowledge.

3. Don’t provide parallel solutions. If multiple solutions are provided, the worst or most outrageous solution will be the one that is graded.Suggestion: If you see two paths, pick the one that you think is most likely to be correct, and discard

the other one.

4. Beware of careless use of language. Even if your calculations are correct, weak communication can cost you

points.Suggestion: Distinguish between sample and population; data and model; lurking and confounding

variables; r and r2 , etc. Know what technical terms mean, and use these terms correctly.

5. A computation or calculator routine will rarely provide a complete response. Be able to write simple English

and/or mathematical sentences that convey understanding.Suggestion: Practice writing narratives for past homework problems, and have them critiqued by

yourTeacher or a fellow student.

6. Know the steps for performing inference:HypothesesAssumptions/conditionsIdentify test (confidence interval) and calculate correctlyConclusions in contextSuggestion: Learn the different forms for hypotheses, memorize conditions/assumptions for various

inference procedures, and practice solving inference problems.

7. Understand strengths and weaknesses of different experimental designs.Suggestion: Study examples of completely randomized, paired, matched pairs and blocked designs.

8. Remember that a simulation can always be used to answer a probability question, as long as it is correct and

you explain it adequately.Suggestion: Practice setting up and running simulations on your TI.

9. Be able to interpret generic computer output.Suggestion: Practice reconstructing the LSRL equation from a regression analysis printout. Identify

and interpret the other numbers.

BEST WISHES AND GOOD LUCK TO ALL OF YOU ON THE EXAM MRS. GARRISON