ch. 1 notes 2018-2019 (blank) - ms....

AP Statistics – Ch. 1 Notes

Exploring Data

Statistics is the art and science of learning from data. This may include:

• Designing appropriate tools to collect data.

• Organizing data in a meaningful way. o Displaying data with appropriate graphs. o Summarizing data with numbers.

• Using data to draw conclusions and make predictions. Data are information in context.

Individuals are the objects described by a set of data. They may be people, animals, or things. A variable is any attribute that can take different values for different individuals. A categorical (or qualitative) variable assigns labels that place each individual into a particular group or category. A quantitative variable takes number values that are quantities – counts or measurements – for which it

makes sense to find an average. Not every variable with a number value is quantitative!

Examples:

The distribution of a variable shows what values the variable takes and how often it takes each value. Distributions are summarized in tables and displayed in graphs.

How to Explore Data

Begin by examining each variable by itself. Then move on to study relationships among the variables.

Start with a graph or graphs. Then add numerical summaries. Example: The following table shows information about several popular cell phone models.

Phone Operating System Screen Size

(inches)

Internal

Storage (GB)

Expandable

Storage

Rear

Camera

(megapixels)

Battery Life

(Talk Time)

(hours)

Apple iPhone 6S Plus iOS 9 5.5 16 No 12 24

Apple iPhone 6s iOS 9 4.7 16 No 12 14

Apple iPhone 6 iOS 8 4.7 16 No 8 14

BlackBerry DTEK 50 Android 6.0 5.2 16 Yes 13 17

BlackBerry Priv Android 5.1 5.4 32 Yes 18 24

BlackBerry Leap BlackBerry 10 5.0 16 Yes 8 25

LG X Skin Android 6.0 5.0 16 Yes 8 7

LG G5 SE Android 6.0 5.3 32 Yes 16 20

LG G5 Android 6.0 5.3 32 Yes 16 20

Microsoft Lumia 650 Windows 10 5.0 16 Yes 8 13

Microsoft Lumia 950 Windows 10 5.2 32 Yes 20 13

Microsoft Lumia 950 XL Windows 10 5.7 32 Yes 20 19

Samsung Galaxy Note 7 Android 6.0 5.7 64 Yes 12 24

Samsung Galaxy On 7 Pro Android 6.0 5.5 16 Yes 13 11

Samsung Galaxy S7 Edge Android 6.0 5.5 32 Yes 12 33


a) Who/what are the individuals in this data set?

b) What variables are measured? Identify each as categorical or quantitative. In what units were the quantitative variables measured?

c) Give the distributions of the following for the data set: screen size, internal storage, and presence of expandable memory.

Analyzing Categorical Data

The values of a categorical variable are labels for the categories, such as “male” or “female”. The distribution of a categorical variable gives the categories and either the count or proportion of individuals who fall into each category. Proportion: The fraction of the total that possesses a certain attribute. Proportions can be expressed as fractions, decimals, or percentages. Frequency: The number (count) of individuals in each category. Relative Frequency: The proportion of individuals in each category. Often, we organize categorical data into either a frequency table or a relative frequency table. (These are sometimes called frequency distributions and relative frequency distributions.) Example: The following is a frequency table showing the distribution of responses to the question, “How do you eat corn on the cob?” Find the relative frequency distribution.

How do you eat corn on the cob? Frequency Relative Frequency

In rows 28 In circles 4

Bite wherever 5 I don’t eat corn on the cob 2 Cut the corn off the cobb 2

Total


Categorical data is often displayed using bar graphs and pie charts.

A bar graph shows each category as a bar. The heights of the bars correspond to the frequencies or relative frequencies of the categories. A pie chart shows each category as a sector or “slice” of a circle or “pie”. The areas of the slices are proportional to the category frequencies or relative frequencies. A segmented bar graph displays the distribution of a categorical variable as a single bar divided into segments. The height of each segment corresponds to the proportion of individuals in the category it represents. Segmented bar graphs use relative frequencies on the vertical axis. Bar Graph Procedure:

1. Draw and label the axes. Put the name of the categorical variable under the horizontal axis. To the left of the vertical axis, indicate whether the graph shows the frequency (count) or relative frequency (proportion) of individuals in each category.

2. “Scale” the axes. Write the names of the categories at equally spaced intervals under the horizontal axis. On the vertical axis, start at 0, and place tick marks at equal intervals until you exceed the highest frequency or relative frequency of any category.

3. Draw bars above the category names. Make sure the bars are equal in width and leave gaps between them. The height of each bar should correspond to the frequency or relative frequency of the individuals in that category.

Pie Chart Procedure:

1. Draw a circle to represent the entire data set.

2. Calculate the size of the central angle for each “slice”: slice size 360° relative frequency of category= ⋅

3. Divide the circle into slices with the appropriate central angles. Use a protractor (or computer) to do this.

4. Label the slices appropriately!

Example: Draw a well-labeled bar graph and a well-labeled segmented bar graph of the corn data from the previous example.


Bar graphs can be used in more situations than pie charts and segmented bar graphs! Pie

charts and segmented bar graphs can only be used in situations when the data includes all

parts of a single whole! o Bar graphs can compare proportions of different groups who share some trait. For example,

what proportions of sophomores, juniors, and seniors approve of Bingham’s parking policy? A pie chart or segmented bar graph couldn’t show this, because these proportions are parts of the same whole.

o Bar graphs can compare proportions in cases where individuals might fall into multiple categories. For example, what percent of students like pizza, what percent like spaghetti, and what percent like pancakes? Students could easily fall into multiple categories, so the percentages would add up to more than 100%. This data couldn’t be displayed on a pie chart or segmented bar graph, but could still be displayed on a bar graph.

o Bar graphs can be used in cases where information is missing. For example, we might know what category some of the individuals fall into, but not others. To display this kind of data in a pie chart or segmented bar graph, it would be necessary to add an “other” category.

Deceptive Graphs:

• Watch out for graphs in which the width changes in addition to the height. The eye responds to area, so this makes the graph misleading. This happens a lot in pictographs.

• Watch out for graphs where the axes don’t start at zero (and/or are missing).


30%

20%

20%30%

Perception of 3D Pie Charts

Cool Confusing Misleading Unreadable

• Watch out for unequally-spaced intervals.

• Watch out for pie charts or segmented bar graphs where the percentages don’t add to 100%. This is a tip-off that they don’t represent all the parts of a single whole.

• Watch out for 3D graphs or graphs set at an angle. This distorts the data.


A two-way table (or contingency table) summarizes the relationship between two categorical variables for some group of individuals. The rows represent values of one variable and the columns represent values of the other variable. A marginal relative frequency gives the percent or proportion of individuals that have a specific value for one categorical variable (ignoring the information about the other variable). It is calculated using the information in a margin of the table and dividing by the overall total number of individuals. A marginal

distribution gives the marginal relative frequencies for each of the values of a categorical variable. Example: AP Statistics students were categorized according to their gender and how they like their bacon cooked. The results are given below. Calculate the marginal distribution of bacon preferences. Draw a graph of the results. Describe what you see. We can also answer questions involving both categorical variables. A joint relative frequency is an “and” relative frequency. It gives the proportion of individuals that fall in a specific category of one variable and a specific category of another variable. Joint relative frequencies are proportions of the overall total. Example: What proportion of the students in the sample are males and like their bacon extra crispy? Example: What percent of students in the sample are females who don’t eat bacon? To examine the relationships between variables, we need to calculate some well-chosen proportions from the counts in the table. A conditional relative frequency gives the proportion of individuals with a specific value of one categorical variable among individuals who share a specific value of another categorical variable (the condition). Example: What percent of the females in the sample like their bacon a little limp?

Example: What proportion of the people who like their bacon crispy are male? Question: Are either of the above conditional relative frequencies misleading? Why?

Gender

Bacon

Preference Female Male Total

A Little Limp 6 4 10 Crispy 8 8 16

Extra Crispy 4 3 7 Don’t Eat Bacon 7 1 8

Total 25 16 41


A conditional distribution gives the conditional relative frequencies for each of the values of a categorical variable among individuals with a specific value of another categorical variable. Example: Using the data above, calculate the conditional distribution of bacon preference for each gender. (This means figure out what proportion of girls like their bacon each way and what proportion of boys like their bacon each way.) To compare the conditional distributions of a categorical variable, we use side-by-side bar graphs (or

comparative bar graphs). These display the distribution of a categorical variable for each value of another categorical variable. The bars are grouped together based on the values of one of the categorical variables and multiple distributions are placed side by side. Color-coding or keys are often used. There is an association (or relationship) between two variables if knowing the value of one variable helps us predict the value of the other. If knowing the value of one variable does not help us predict the value of the other, then there is no association between the variables.

• If the values of one variable are really different for different values of the other variable, then there is an association between the variables.

• If the values of one variable are really similar for different values of the other variable, then there isn’t an association between the variables.

Do not use the word correlation when you mean association. Correlation has a very specific

meaning in statistics, which we will talk about later in the year. Example: Draw a side-by-side bar graph comparing the bacon preferences of males and females. Use relative frequencies for the vertical axis. Then draw a segmented bar graph for each gender. Describe what you see. Does there appear to be an association between gender and bacon preference? Explain.


Displaying Quantitative Data with Graphs One of the most common parts of a statistical problem is finding an appropriate way to display data. Quantitative data can’t be displayed the same way as categorical data (bar graphs and pie charts don’t work). The most common ways to display quantitative data are dotplots, stemplots, histograms, and boxplots. How to Examine the Distribution of a Quantitative Variable

• Describe the overall pattern of a distribution by describing its shape, center, and variation.

• Point out any outliers (unusually small or unusually large data values).

• Always put your descriptions in context! Describing Shape:

• How many peaks does the distribution have? Don’t count minor ups and downs, only major peaks. Ask yourself if there are distinct groups of individuals visible in the graph.

o Unimodal: One peak (group). o Bimodal: Two peaks (groups). o Multimodal: Three or more peaks (groups).

• If there are any major gaps between groups, describe their locations.

• Is the distribution approximately symmetric or skewed? o If the right and left sides of the graph are close to mirror images of each other, describe

the distribution as “approximately symmetric.” Always use the words “approximately” or “roughly”, because in real life, distributions of data are almost never perfectly symmetric.

o If the right side of the graph is much longer than the left side (tail to the right), describe the distribution as “skewed to the right” or “skewed to positive values” or “positively

skewed.” o If the left side of the graph is much longer than the right side (tail to the left), describe the

distribution as “skewed to the left” or “skewed to negative values” or “negatively

skewed.” Describing Center: Use the median (middle value) or the mean (average). Describing Variation: Use the range, interquartile range, or standard deviation, or say something like, “The [values in context] vary from a low of _____ to a high of _____.”


Dotplots: 1. Draw a horizontal line, label it with the name of the quantitative variable and the units of

measurement, and place tick marks at equal intervals. 2. Locate each value in the data set along the measurement scale and represent it by a dot above the

line. If there are two or more observations with the same value, stack the dots vertically. Try to make all the dots the same size and space them out equally as you stack them.

To compare two distributions, stack the dotplots on top of each other, using the same scales. Make sure to label the two groups being compared.

Example: Below is a dotplot of the hair lengths of 41 AP Statistics students. Describe the distribution of hair length.

Here are parallel dotplots showing the hair lengths of the students sorted by gender. Compare the distributions of hair length for the male and female students.


Stemplots (or Stem-and-Leaf Plots):

Each number in the data set is broken into two pieces—a stem and a leaf. The stem is the first part of the number and consists of the beginning digits. The leaf is the last part of the number and consists of the final digit(s).

1. Choose stems (one or more of the leading digits) that divide the data into a reasonable number of groups (at least 5, but not too many). List possible stem values (not just those that actually appear in the data set—don’t skip stems) in a vertical column. Draw a vertical line to the right of the stems.

2. The next digit(s) after the stem become(s) the leaf. List the leaf for every observation to the right of the corresponding stem.

3. Include a key explaining what the stems and leaves represent, e.g., “2 | 5 represents 2.5 seconds” It is common to round and/or truncate (leave off) the remaining digits. For example, in a

stemplot of annual salary, we might represent $35,360 as 35 | 3, 35 | 4, or as 3 | 5, depending on our data set.

If necessary, consider using split stems. Write each stem more than once, and assign the lower

group of leaves to the first stem and the higher group of leaves to the next. For example, put the leaves 0-4 with the first stem and the leaves 5-9 with the second. If you do this, be sure that each stem is assigned an equal number of possible leaf digits (two stems, with five possible leaves each; or five stems, with two possible leaves each).

To compare two groups, make a back-to-back stemplot. Use the same set of stems and write the leaves for one group to the right and for the other group to the left. Be sure to label each side to indicate which group is being represented.

Example: The data below shows the number of pairs of shoes owned for male and female AP Statistics students. Make a back-to-back stemplot of the data using split stems. Comment on the main differences between the two data sets.

Female 6 7 7 8 9 9 10 10 10 10 11 12 15

15 20 20 20 20 21 22 25 30 30 64 87

Male 3 4 4 4 4 4 5 6 6 7 8 8 10

12 12 12


Histograms: 1. Divide the range of the data into intervals of equal width. The intervals are called “bins.” The

low value in each bin is included in the bin, but the high value is not. For example, the bins might be 0 to < 3, 3 to < 6, 6 to < 9, etc.

If the data are discrete (the observations take only whole number values) and are tightly packed, the bins are usually centered at the integer values with a width of one unit, so the rectangle for 1 is centered at 1 (0.5 to < 1.5), the rectangle for 2 is centered at 2 (1.5 to < 2.5), etc.

There are no set-in-stone rules for how many bins to use (5 to 10 is a common number), but it may be a good idea to see what the graph looks like with different width bins. It can change quite a bit!

2. Find the frequency (count) or relative frequency (proportion) of individuals in each interval. Put values that fall on a boundary in the interval containing larger values.

3. Label and scale your axes. Place equally spaced tick marks at the boundaries of each interval along the horizontal axis (or in the middle of each interval if the data are discrete). Use either frequency (count) or relative frequency (proportion) on the vertical axis.

4. Draw a rectangle for each interval. Make the bars equal width and leave no gaps between them. The height should correspond to the frequency or relative frequency of individuals in that interval.

Histograms and bar graphs are different!

o Bar graphs are used for categorical data. Histograms are used for quantitative data. o The bars in bar graphs can be rearranged because the order of the categories shouldn’t

matter. The bars in histograms can’t be rearranged because intervals must be in numerical order.

o The bars in bar graphs are generally unconnected. The bars in histograms are connected. Example: The following data gives the average points scored per game (PTSG) for the 30 NBA teams in the 2017-2018 regular season. Draw two relative frequency histograms using different bin widths. Describe the distribution.

98.8 99.3 102.3 102.7 102.9 103.4 103.4 103.4 103.8 103.9

104.0 104.1 104.5 105.6 105.6 106.5 106.6 106.6 107.9 108.1

108.2 109.0 109.5 109.8 110.0 110.9 111.7 111.7 112.4 113.5


Describing Quantitative Data with Numbers Population: The entire collection of individuals or objects that you want to learn about. Sample: A part of the population that is selected for study. Resistant Measure: A measure that is not influenced very much by strong skewness or extreme values.

Measures of Center: The most common measures of center are the mean and the median.

Mean: The sum of the values divided by the number of observations.

• If the n observations in a sample are 1 2, ,..., ,n

x x x the mean is 1 2 ....n ix x x x

xn n

+ + + ∑= =

• The mean can be thought of as the “average” value, the “fair share” value, or the “balance point” of a distribution.

• The mean is not a resistant measure. It is very sensitive to outliers and skewness. The mean of a sample is abbreviated x (pronounced “x-bar”) and the mean of a population is

abbreviated xμ (the Greek letter mu, pronounced “myoo”). They are both calculated the same way. The

distinction will be important later in the year. If the problem doesn’t specify whether the data represent a population or a sample, assume you are dealing with a sample and use .x

Median (M): The midpoint of a distribution. Half of the observations are smaller than the median and half of the values are larger than the median. To find the median:

1. Put the n observations in order from smallest to largest. 2. If the number of observations, n, is odd, the median is the middle observation of the ordered list. 3. If the number of observations, n, is even, the median is the average (mean) of the two middle

observations in the ordered list.

• The median can be thought of as the “typical” value of a variable.

• The median is a resistant measure. It is not changed greatly by strong skewness or outliers. Comparing the Mean and the Median: The mean and median of a roughly symmetric distribution are close together. If the distribution is exactly symmetric, they are equal. However, outliers and other extreme values drag the mean toward them without having much effect on the median. As a result, in

skewed distributions, the mean will be further out in the long tail than is the median.


300250200150100500

Number of Visits to Class Website

Example: Here are the amounts of fat (in grams) in McDonald’s beef sandwiches. Make a stemplot of the distribution and comment on its shape. Then calculate the mean and the median amount of fat. Example: Forty students were enrolled in a statistical reasoning course at a California college. The instructor made course materials, grades, and lecture notes available to students on a class web site, and course management software kept track of how often each student accessed any of these web pages. One month after the course began, the instructor requested a report of how many times each student had accessed a class web page. The 40 observations are below. Wasn’t it nice of me to put them in order? 0 0 0 0 0 0 3 4 4 4 5 5 7 7 8 8 8 12 12 13 13 13 14 14 16 18 19 19 20 20 21 22 23 26 36 36 37 42 84 331 (not a typo) Here is a dotplot of the data. Describe the distribution. Based on the graph, do you expect the mean or the median to be higher? Calculate the mean and the median to see if you were right. Which measure would be the best choice to describe center in this situation?

Sandwich Fat (g) Sandwich Fat (g)

Hamburger 9 Big N’ Tasty 24

Cheeseburger 12 Big N’ Tasty with Cheese 28

Double Cheeseburger 23 McRib 26

McDouble 19 Mac Snack Wrap 19

Quarter Pounder 19 Angus Bacon & Cheese 39

Quarter Pounder with Cheese 26 Angus Deluxe 39

Double Quarter Pounder with Cheese 42 Angus Mushroom & Swiss 40

Big Mac 29


Measures of Variability: Numbers that describe how spread out the data are. The most common are the range, the interquartile range, and the standard deviation. Range: The difference between the maximum and minimum values. Standard Deviation: The most common measure of spread is the standard deviation. It measures the “typical” or “average” distance of the observations from the mean. Example: Each of these distributions has a mean of 5. Rank the standard deviations from lowest to highest. Explain your answer.

The formula for standard deviation is slightly different depending on whether you have all the data for the entire population or are dealing with a sample from the population. For a Sample:

If the n observations in a sample are 1 2, ,..., ,n

x x x and the mean is ,x the standard deviation is given by:

( ) ( ) ( ) ( )2 2 2 2

1 2 ...

1 1

n i

x

x x x x x x x xs

n n

− + − + + − −= =

− −

The sample standard deviation is abbreviated .x

s

Variance: The square of the standard deviation is called the variance, abbreviated 2.xs

For the Population:

The standard deviation of a population of size N with mean μ and observations 1 2, ,...,n

x x x is given by:

( ) ( ) ( ) ( )2 2 2 2

1 2 ... n i

x

x μ x μ x μ x μσ

N N

− + − + + − −= =

The population standard deviation is abbreviated xσ (the Greek letter sigma). The population variance is

abbreviated 2.xσ

The reason that we divide by 1n− in a sample is complicated. We’ll discuss it later in the year.

Always use xs rather than x

σ unless you know that the data represent the entire population, which is

rare!

1086420

1086420

1086420


Calculating the standard deviation by hand: 1. Calculate the mean, .x

2. Find the distance of each observation from the mean (the deviations). 3. Square each of these distances to eliminate negative numbers. 4. “Average” the squared distances by adding them together and dividing by 1.n− This gives the

variance, 2.xs

5. Take the square root of the variance to get the standard deviation, .x

s

6. Interpret your result. The standard deviation is the “average” or typical distance of the observations from the mean.

Example: The table below shows the sugar content in several types of candy bar. Find the mean and standard deviation of the data. Interpret your result in context.

Properties of the Standard Deviation

• The standard deviation measures variation around the mean. It should only be used when the mean is chosen as the measure of center.

• The standard deviation is always greater than or equal to zero. If there is no variability (all observations have the same value), the standard deviation is zero. Larger standard deviations indicate greater variation from the mean.

• The standard deviation has the same units of measurement as the original observations. This is one reason we usually interpret the standard deviation and not the variance.

• The standard deviation is not a resistant measure. A few outliers can change its value dramatically.

Candy Bar Sugar (grams)

ix

Deviations

ix x−

Squared

Deviations

( )2

ix x−

Hershey’s Milk Chocolate 31

Kit Kat 22

York Peppermint Pattie 25

Reese’s Peanut Butter Cups 25

Snickers 30

Milky Way 35

Twix 27

3 Musketeers 40

Mr. Goodbar 22

Baby Ruth 33


Interquartile Range (IQR): First, calculate the quartiles:

1. Arrange the data in increasing order and locate the median, M. (The median is sometimes called the second quartile, or Q2).

2. The first quartile (Q1) is the median of all the observations lower than the median. 3. The third quartile (Q3) is the median of all the observations higher than the median.

The interquartile range is calculated as follows: IQR = Q3 – Q1

The IQR is the range of the middle 50% of the data. The range and interquartile range are numbers! Don’t say “The range is 5 to 30.” In that case, the

range would be 25.

The IQR is not a location! It doesn’t make sense to say an observation is “in the IQR”. 1.5 × IQR Rule for Outliers: Any observation that falls more than 1.5 IQR× above the third quartile or

below the first quartile. Always check for outliers and examine them closely! They may be errors, or they may tell you

something important about your data that you need to pay attention to. Don’t ignore them. Boxplots (or Box and Whisker Plots):

1. Find the Five-Number Summary: Minimum Q1 M Q3 Maximum

2. Check for outliers. You must always show this step.

• Calculate the IQR.

• Find ( )1 1.5Q IQR− × and ( )3 1.5 .Q IQR+ ×

• If you have any data points outside these thresholds, they are outliers.

3. Draw the boxplot:

• Draw a central box from Q1 to Q3.

• Draw a vertical line in the box to mark the median.

• Draw the “whiskers”: lines extending from the box out to the smallest and largest observations that are not outliers.

• Mark outliers with dots in the appropriate locations.

Each section of a boxplot contains 25% of the data.

• The lower quartile is higher than 25% of the data.

• The median (or second quartile) is higher than 50% of the data.

• The upper quartile is higher than 75% of the data.

Boxplots are useful for comparing the center and spread of distributions, but you have to be careful with them. They can mask important information about the shape of a distribution. For instance, you can’t tell from a boxplot if a distribution has multiple peaks or gaps.


Example: The data below shows the number of text messages sent by a random sample of students in a day. Draw parallel boxplots of the number of texts sent for male and female students. You must show how you determined whether there are outliers. Compare the distributions. What conclusions can you draw about the texting habits of males and females?

Male 3 6 6 10 10 12 17 24 25 40 45 87 111

Female 7 11 20 26 38 52 59 79 90 156

Choosing Measures of Center and Spread:

• Use the median and IQR for describing a skewed distribution or a distribution with strong outliers.

• Use the mean and standard deviation for describing reasonably symmetric distributions without outliers.

ALWAYS GRAPH YOUR DATA! Numerical measures of center and spread report specific facts

about a distribution, but don’t give information about its entire shape. You may miss something important if you don’t graph the data.

ch. 1 notes 2018-2019 (blank) - ms....

Documents