chapter one exploring data

44
CHAPTER ONE EXPLORING DATA SEC TION 1.1 ANA LYZI N G CAT EGORICAL DAT A

Upload: brady-miller

Post on 01-Jan-2016

25 views

Category:

Documents


1 download

DESCRIPTION

Chapter One exploring data. Section 1.1 analyzing categorical data. Distribution of a categorical variable. Frequency table. Relative frequency table. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter One exploring data

CHAPTER O

NE EXPL

ORING

DATA

SE

CT

I ON

1. 1

AN

ALY

ZI N

G C

AT

EG

OR

I CA

L DA

T A

Page 2: Chapter One exploring data

DISTRIBUTION OF A CATEGORICAL VARIABLE

F R E Q U E N C Y T A B L E

Format Count of Stations

Adult Contemporary

1,556

Adult Standards 1,196

Contemporary Hit 569

Country 2,066

News/Talk/Information

2,179

Oldies 1,060

Religious 2,014

Rock 869

Spanish Language 750

Other Formats 1,579

Total 13,838

R E L A T I V E F R E Q U E N C Y T A B L E

Format Percent of Stations

Adult Contemporary

11.2

Adult Standards 8.6

Contemporary Hit 4.1

Country 14.9

News/Talk/Information

15.7

Oldies 7.7

Religious 14.6

Rock 6.3

Spanish Language 5.4

Other Formats 11.4

Total 99.9

Page 3: Chapter One exploring data

In this case, the individuals are the radio stations and the variable being measured is the kind of programming that each station broadcasts. The table on the left, which we call a frequency table, displays the counts of stations in each format category. On the right, we see a relative frequency table of the data that shows the percents of stations in each format category.

Page 4: Chapter One exploring data

DISTRIBUTION OF A CATEGORICAL VARIABLE

F R E Q U E N C Y T A B L E

Format Count of Stations

Adult Contemporary

1,556

Adult Standards 1,196

Contemporary Hit 569

Country 2,066

News/Talk/Information

2,179

Oldies 1,060

Religious 2,014

Rock 869

Spanish Language 750

Other Formats 1,579

Total 13,838

R E L A T I V E F R E Q U E N C Y T A B L E

Format Percent of Stations

Adult Contemporary

11.2

Adult Standards 8.6

Contemporary Hit 4.1

Country 14.9

News/Talk/Information

15.7

Oldies 7.7

Religious 14.6

Rock 6.3

Spanish Language 5.4

Other Formats 11.4

Total 99.9

Page 5: Chapter One exploring data

It’s a good idea to check data for consistency. The counts should add to 13,838, the total number of stations. They do. The percents should add to 100%. In fact, they add to 99.9%. What happened? Each percent is rounded to the nearest tenth. The exact percents would add to 100, but the rounded percents only come close. This is roundoff error. Roundoff errors don’t point to mistakes in our work, just to the effect of rounding off results.

Page 6: Chapter One exploring data

PIE CHARTS

Country

News/Talk

Oldies

ReligiousRock

Spanish

Other

Adult Contem-porary

Adult Standards

Contemporary Hit

Page 7: Chapter One exploring data

• Pie charts are best when emphasizing each categories relation to the whole

Page 8: Chapter One exploring data

• Pie charts are best when emphasizing each categories relation to the whole

• Bar graphs are also called bar charts

Page 9: Chapter One exploring data

• Pie charts are best when emphasizing each categories relation to the whole

• Bar graphs are also called bar charts

• Bar graphs are also more flexible than pie charts. Both graphs can display the distribution of a categorical variable, but a bar graph can also compare any set of quantities that are measured in the same units.

Page 10: Chapter One exploring data

BAR CHARTS

Series10

10

20

Radio Station Formats

Adult Contemporary Adult StandardsContemporary Hit CountryNews/Talk OldiesReligious RockSpanish Other

Page 11: Chapter One exploring data

If I were to give you a list of several age groups and the percent of people in each age group that own an ipodwhat do you think would be better to display the data a pie chart or a bar graph???

Page 12: Chapter One exploring data

Bar GraphBecause the data will not add up to a whole

it is separate data we are comparing

Page 13: Chapter One exploring data

Bar graphs can be misleading in 2 ways…

1. If you don’t keep the widths even the proportions will be misleading

2. If you don’t start the vertical scale at zero the proprtions by comparison can also be misleading

Page 14: Chapter One exploring data

WHAT HAPPENS WHEN WE HAVE TWO CATEGORICAL VARIABLES??

A sample of 200 children were asked which superpower they would most like to have and their gender was also recorded, let’s look at the results…

Page 15: Chapter One exploring data

Superpower Female Male Total

Invisibility 17 13 30

Superstrength 3 17 20

Telepathy 39 5 44

Fly 36 18 54

Freeze Time 20 32 52

Total 115 85 200

This is a two-way table because it describes two categorical variables, gender and superpower preference. Superpower is the row variable because each row in the table describes a different superpower the kids chose. Gender is the column variable. The entries in the table are the counts of individuals in each preference-by-gender class.

Page 16: Chapter One exploring data

The distributions of preference alone and gender alone are called marginal distributions because they appear at the right and bottom margins of the two-way table.

Page 17: Chapter One exploring data

The distributions of preference alone and gender alone are called marginal distributions because they appear at the right and bottom margins of the two-way table.

The marginal distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table.

Page 18: Chapter One exploring data

The distributions of preference alone and gender alone are called marginal distributions because they appear at the right and bottom margins of the two-way table.

The marginal distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table.

Now if we want to display the marginal distribution as percents we use the following formula: row total = 30 = 0.15 = 15%

table total 200

Page 19: Chapter One exploring data

The distributions of preference alone and gender alone are called marginal distributions because they appear at the right and bottom margins of the two-way table.

The marginal distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table.

Now if we want to display the marginal distribution as percents we use the following formula: row total = 30 = 0.15 = 15%

table total 200

Now lets convert the whole marginal distribution into percents

Page 20: Chapter One exploring data

Superpower Female Male Total

Invisibility 17 13 30

Superstrength 3 17 20

Telepathy 39 5 44

Fly 36 18 54

Freeze Time 20 32 52

Total 115 85 200

Page 21: Chapter One exploring data

Superpower Total

Invisibility 15%

Superstrength 10%

Telepathy 22%

Fly 27%

Freeze Time 26%

Total 100%

Page 22: Chapter One exploring data

Now if we were to change all the data in the female column to percents we would have the conditional distribution of preference among girls.

Page 23: Chapter One exploring data

Now if we were to change all the data in the female column to percents we would have the conditional distribution of preference among girls.

A conditional distribution of a variable describes the values of that variable among individuals who have a specific value of another variable. There is a separate conditional distribution for each value of the other variable.

Page 24: Chapter One exploring data

ORGANIZING A STATISTICAL PROBLEM

Although no single strategy will work on every problem, here is a four step process that can be helpful to follow

Page 25: Chapter One exploring data

ORGANIZING A STATISTICAL PROBLEM

Although no single strategy will work on every problem, here is a four step process that can be helpful to follow

State: What’s the question that you’re trying to answer?

Page 26: Chapter One exploring data

ORGANIZING A STATISTICAL PROBLEM

Although no single strategy will work on every problem, here is a four step process that can be helpful to follow

State: What’s the question that you’re trying to answer?

Plan: How will you go about answering the question? What Statistical techniques does this problem call for?

Page 27: Chapter One exploring data

ORGANIZING A STATISTICAL PROBLEM

Although no single strategy will work on every problem, here is a four step process that can be helpful to follow

State: What’s the question that you’re trying to answer?

Plan: How will you go about answering the question? What Statistical techniques does this problem call for?

Do: Make graphs and carry out needed calculations.

Page 28: Chapter One exploring data

ORGANIZING A STATISTICAL PROBLEM

Although no single strategy will work on every problem, here is a four step process that can be helpful to follow

State: What’s the question that you’re trying to answer?

Plan: How will you go about answering the question? What Statistical techniques does this problem call for?

Do: Make graphs and carry out needed calculations.

Conclude: Give your practical conclusion in the setting of the real-world problem.

Page 29: Chapter One exploring data

Based on the survey data, can we conclude that boys and girls differ in their preference of superpower? Let’s use the four-step process to support our answer with evidence.

Page 30: Chapter One exploring data

Based on the survey data, can we conclude that boys and girls differ in their preference of superpower? Let’s use the four-step process to support our answer with evidence.

State: What is the relationship between gender and the answer to the question “What superpower would you prefer?”

Page 31: Chapter One exploring data

Based on the survey data, can we conclude that boys and girls differ in their preference of superpower? Let’s use the four-step process to support our answer with evidence.

State: What is the relationship between gender and the answer to the question “What superpower would you prefer?”

Plan: We suspect that gender might influence a child’s opinion about superpowers. So we will compare the conditional distributions of responses for females alone and for males alone.

Page 32: Chapter One exploring data

Based on the survey data, can we conclude that boys and girls differ in their preference of superpower? Let’s use the four-step process to support our answer with evidence.

State: What is the relationship between gender and the answer to the question “What superpower would you prefer?”

Plan: We suspect that gender might influence a child’s opinion about superpowers. So we will compare the conditional distributions of responses for females alone and for males alone.

Do: Here is a table and side-by-side bar graph comparing the opinions of males and females. We will use percents instead of counts since the numbers of females and males are different.

Page 33: Chapter One exploring data

State: What is the relationship between gender and the answer to the question “What superpower would you prefer?”

Plan: We suspect that gender might influence a child’s opinion about superpowers. So we will compare the conditional distributions of responses for females alone and for males alone.

Do: Here is a table and side-by-side bar graph comparing the opinions of males and females. We will use percents instead of counts since the numbers of females and males are different.

Superpower % of Females % of Males

Invisibility 15% 15%

Superstrength 3% 20%

Telepathy 34% 6%

Fly 31% 21%

Freeze Time 17% 38%

Total 100% 100%

Page 34: Chapter One exploring data

Conclude: Based on the sample data, females were much more likely to choose telepathy than males, while males were much more likely to choose superstrength or freeze time than females. Females were slightly more likely to choose flying and equally likely to choose invisibility.

Page 35: Chapter One exploring data

Conclude: Based on the sample data, females were much more likely to choose telepathy than males, while males were much more likely to choose superstrength or freeze time than females. Females were slightly more likely to choose flying and equally likely to choose invisibility.

We say that there is an association between two variables if specific values of one variable tend to occur in common with specific values of the other.

Page 36: Chapter One exploring data

Conclude: Based on the sample data, females were much more likely to choose telepathy than males, while males were much more likely to choose superstrength or freeze time than females. Females were slightly more likely to choose flying and equally likely to choose invisibility.

We say that there is an association between two variables if specific values of one variable tend to occur in common with specific values of the other.

So… if Females are more likely to choose telepathy that means there is an association between the variable gender and superpower choice.

Page 37: Chapter One exploring data

• The distribution of a categorical variable lists the categories and gives the count (frequency table) or percent (relative frequency table) of individuals that fall in each category.

SUMMARY

Page 38: Chapter One exploring data

• The distribution of a categorical variable lists the categories and gives the count (frequency table) or percent (relative frequency table) of individuals that fall in each category.

• Pie charts and bar graphs display the distribution of a categorical variable. Bar graphs can also compare any set of quantities measured in the same units. When examining any graph, ask yourself, “ What do I see?”

SUMMARY

Page 39: Chapter One exploring data

• The distribution of a categorical variable lists the categories and gives the count (frequency table) or percent (relative frequency table) of individuals that fall in each category.

• Pie charts and bar graphs display the distribution of a categorical variable. Bar graphs can also compare any set of quantities measured in the same units. When examining any graph, ask yourself, “ What do I see?”

• A two-way table of counts organizes data about two categorical variables. Two-way tables are often used to summarize large amounts of information by grouping outcomes into categories.

SUMMARY

Page 40: Chapter One exploring data

• The row totals and column totals in a two-way table give the marginal distributions of the two individual variables. It is clearer to present these distributions as percents of the table total. Marginal distributions tell us nothing about the relationship between the variables.

SUMMARY

Page 41: Chapter One exploring data

• The row totals and column totals in a two-way table give the marginal distributions of the two individual variables. It is clearer to present these distributions as percents of the table total. Marginal distributions tell us nothing about the relationship between the variables.

• Theses are two sets of conditional distributions for a two-way table: the distributions of the row variable for each value of the column variable, and the distributions of the column variable for each value of the row variable.

SUMMARY

Page 42: Chapter One exploring data

• The row totals and column totals in a two-way table give the marginal distributions of the two individual variables. It is clearer to present these distributions as percents of the table total. Marginal distributions tell us nothing about the relationship between the variables.

• Theses are two sets of conditional distributions for a two-way table: the distributions of the row variable for each value of the column variable, and the distributions of the column variable for each value of the row variable.

• A statistical problem has a real-world setting. You can organize many problems using the four steps state, plan, do, and conclude.

SUMMARY

Page 43: Chapter One exploring data

• To describe the association between the row and column variables, compare an appropriate set of conditional distributions. Remember that even a strong association between two categorical variables can be influenced by other variables lurking in the background.

SUMMARY

Page 44: Chapter One exploring data

• To describe the association between the row and column variables, compare an appropriate set of conditional distributions. Remember that even a strong association between two categorical variables can be influenced by other variables lurking in the background.

SUMMARY