stat 22000 lecture slides exploring categorical datayibi/teaching/stat220/... · 1st 122 203325 2nd...

STAT 22000 Lecture SlidesExploring Categorical Data

Yibi HuangDepartment of StatisticsUniversity of Chicago

Outline

This set of slides cover Section 1.7 in the text.

• Ways to summarize of a single categorical variable• Frequency tables• Barplots, pie charts

• Ways to summarize of relationships between two categoricalvariables• two-way contingency tables• segmented barplots, standardized segmented barplots,

mosaic plot

1

Bar Graphs and Pie Charts

Graphs for Categorical Variables

A categorical variable is summarized by a table showing the countor the percentage of cases in each category, and is often displayedby a bar plot or a pie chart.

Ex: Passengers on Titanic

Class Freq Percent1st 325 14.8%2nd 285 12.9%3rd 706 32.1%

Crew 885 40.2%Total 2201 100%

1st 2nd 3rd Crew

Class

Num

ber

of P

eopl

e

020

040

060

080

0

1st

2nd

3rd

Crew 2

Bar plots

A bar plot is a common way to display a single categorical variable.A bar plot where proportions instead of frequencies are shown iscalled a relative frequency bar plot.

1st 2nd 3rd Crew

Class

Num

ber

of P

eopl

e

020

040

060

080

0

1st 2nd 3rd Crew

Class

Per

cent

age

of P

eopl

e

010

2030

40

3

How are Bar Plots Different From Histograms?

• Bar plots are used for displaying distributions of categoricalvariables, while histograms are used for numerical variables.

• The horizontal axis in a histogram is a number line, hence theorder of the bars cannot be changed, while in a bar plot thecategories can be listed in any order (though some orderingsmake more sense than others, especially for ordinalvariables.)

4

Why We Recommend Bar Plots Over Pie Charts?

In a pie chart, the areas of slices represents the percentages ofcategories. However, it is generally more difficult to compare groupsizes in a pie chart than in a bar plot, especially when categorieshave nearly identical counts or proportions

1st

2nd

3rd

Crew

Without looking at the counts,can you tell which class havefewest people from the pie?

5

Why We Recommend Bar Plots Over Pie Charts?

It’s much easier to make a wrong pie chart than a wrong bar plot.In a pie chart, the categories must make up a whole. There is nosuch restriction for a bar plot.

6

Another Wrong Pie Chart

http:// www.youtube.com/ watch?v=-rbyhj8uTT8

7

http://www.youtube.com/watch?v=-rbyhj8uTT8

Two-Way Contingency Tables

Two-Way Contingency Tables

A table that summarizes data for two categorical variables is calleda contingency table.

E.g., breakdown of people on Titanic by class and survival status

Died Survived Total

Class

1st 122 203 3252nd 167 118 2853rd 528 178 706Crew 673 212 885Sum 1490 711 2201

The marginal totals give the distributions of the two variables, e.g.,

• overall, 1490 died and 711 survived• there were 325, 285, and 706 passengers in the 1st, 2nd and

3rd classes, and 885 crew members8

Overall Proportions

Dividing the cell counts in a contingency table by the overall total, we getthe proportions of observations in the combinations of the two variables.

SurvivedNo Yes Total

Class

1st 122/2201 ≈ 0.06 203/2201 ≈ 0.09 325/2201 ≈ 0.152nd 167/2201 ≈ 0.08 118/2201 ≈ 0.05 285/2201 ≈ 0.133rd 528/2201 ≈ 0.24 178/2201 ≈ 0.08 706/2201 ≈ 0.32Crew 673/2201 ≈ 0.31 212/2201 ≈ 0.10 885/2201 ≈ 0.40Sum 1490/2201 ≈ 0.68 711/2201 ≈ 0.32 1

e.g., of people on Titanic

• 122/2201 ≈ 6% were in the 1st class and died in the disaster

• 212/2201 ≈ 10% were survived crew members

Note the marginal totals give the distributions of the two variables, e.g.,

• Overall, 711/2201 ≈ 32% of the people survived9

Row Proportions

The row proportions (cell counts divided by the corresponding rowtotals) give the proportion of people survived in the four classes.


Class

1st 122/325 ≈ 0.38 203/325 ≈ 0.62 12nd 167/285 ≈ 0.59 118/285 ≈ 0.41 13rd 528/706 ≈ 0.75 178/706 ≈ 0.25 1Crew 673/885 ≈ 0.76 212/885 ≈ 0.24 1

e.g.,

• 203/325 ≈ 62% of people in the 1st class survived.

• 178/706 ≈ 25% of people in the 3rd class survived.

10

Column Proportions

The column proportions (dividing cell counts by the correspondingcolumn totals) give the proportion of people survived in each of thefour classes.

SurvivedNo Yes

Class

1st 122/1490 ≈ 0.08 203/711 ≈ 0.292nd 167/1490 ≈ 0.11 118/711 ≈ 0.173rd 528/1490 ≈ 0.35 178/711 ≈ 0.25Crew 673/1490 ≈ 0.45 212/711 ≈ 0.30Sum 1 1

• Among those who survived, 203/711 ≈ 29% were in the 1stclass.

• Among those who died, 673/1490 ≈ 45% were crew members

11

Independence of Two Categorical Variables

If the row proportions do not change from row to row, we say thetwo categorical variables are independent. Otherwise, we say theyare associated.

E.g., if the survival rates do not change from class to class, we say‘survival’ is independent of ‘class’. In the Titanic data, the survivalof passengers is associated with the class they were in becausethe survival rates differ substantially from class to class.

We can also define two categorical variables to be independent ifthe column proportions do not vary from column to column sincethe two conditions are equivalent (why?)

12

Exercise

The table below shows the breakdown of cases of injuries in theU.S in a certain year. by circumstance and gender1. Counts are inmillions.

CircumstanceGender Work Home Other Total

Male 8.0 9.8 17.8 35.6Female 1.3 11.6 12.9 25.8

Total 9.3 21.4 30.7 61.4

• What proportion of injury cases occurred at work?9.3/61.4 ≈ 0.15

• What proportion of injury cases occurred at work and onwomen? 1.3/61.4 ≈ 0.02

1Source: Vital and Health Statistics published by the National Center for Health Statistics

13

Practise (Cont’d)

CircumstanceGender Work Home Other Total

Male 8.0 9.8 17.8 35.6Female 1.3 11.6 12.9 25.8

Total 9.3 21.4 30.7 61.4

• Among all injury cases occurred on women, what proportionoccurred at work? 1.3/25.8 ≈ 0.05

• Among all injury cases occurred at work, what proportionoccurred on women? 1.3/9.3 ≈ 0.14

• Is the circumstance of injury cases independent of the genderof the victims? No, only 5% of injury cases on womenoccurred at work, compared with 8.0/36.5 ≈ 22% of cases onmen occurred at work.

14

Segmented Bar and Mosaic Plots

Segmented Bar Plots

SurvivedClass No Yes Total1st 122 203 3252nd 167 118 2853rd 528 178 706Crew 673 212 885Sum 1490 711 2201

Class

Fre

q

0

200

400

600

800

1st 2nd 3rd Crew

SurvivedNoYes

15

Standardized Segmented Bar Plots


Class

1st 0.38 0.62 12nd 0.59 0.41 13rd 0.75 0.25 1Crew 0.76 0.24 1

Class

0.0

0.2

0.4

0.6

0.8

1.0

1st 2nd 3rd Crew

SurvivedNoYes

Standardized segmented bar plots are convenient for comparingrow proportions, and determining whether the two variables areindependent.

However, the information of row totals is lost after standardization.

16

Mosaic Plots

• bar widths = row totals• segment lengths within a bar = row proportions

Class

Sur

vive

d

1st 2nd 3rd Crew

No

Yes

segment area = (barwidth) × (segment length)

= row total × (row proportion)

= row total ×cell countrow total

= cell count 17

Exercise 1.68 Raise Taxes on the Rich or the Poor

The mosaic plot below shows the relationship between politicalparty affiliation and views on whether it’s better to raise taxes onthe rich or on the poor for a random sample of registered voterstaken nationally in 2015.

Democrat Republican Indep / Other

Raise taxes on the rich

Raise taxes on the poorNot sure

18




Which political party identification is least common in the sample,Democrats, Republicans, or Indep/Other?

Ans: Indep/Other.

19




Based on this sample, which political party identification had thehighest percentage supported raising taxes on the rich? Which hadthe lowest?

Ans: Democrats the highest, Republicans the lowest.

20




What percentage of Democrats (in this sample) supported raisingtaxes on the rich?

(a) below 25%(b) between 25% and 50%(c) between 50% and 75%(d) over 75%

21




In this sample, which of the following groups contains the greatestnumber of subjects?

(a) Democrats who supported raising taxes on the rich.(b) Democrats who supported raising taxes on the poor.(c) Republicans who supported raising taxes on the rich.(d) Republicans who supported raising taxes on the poor.

22




Based on the mosaic plot, do views on raising taxes and politicalaffiliation appear to be independent?

23

Instead of looking at survival rates in the four classes, we can alsolook at the breakdown of the four classes among those whosurvived and among those who died.

Survived

Fre

q

0

500

1000

1500

No Yes

Survived

0.0

0.2

0.4

0.6

0.8

1.0

No Yes

Class1st2nd3rdCrew

Survived

Cla

ss

No Yes1st

2nd

3rd

Crew

24

Ways to Inspect Relationships Between Variables

• numerical v.s. numerical• scatterplots

• categorical v.s. categorical• contingency tables• segmented barplots, standardized segmented barplots,

mosaic plot

• categorical v.s. numerical• side-by-side boxplots• histograms by group on the same horizontal axis

25

Example (Diamonds)

Mosaic plot: Carat Weight v.s. Quality of Cut

Carat

Qua

lity

of C

ut

0.4

0.41

0.42

0.43

0.44

0.45

0.46

0.47

0.48

0.49 0.

5

0.51

0.52

0.53

0.54

0.55

0.56

0.57

0.58

0.59 0.

60.

610.

620.

630.

640.

650.

660.

670.

680.

69 0.7

0.71

0.72

0.73

0.74

0.75

0.76

0.77

0.78

0.79

Fair

Good

Very Good

Premium

Ideal

26

Example (Diamonds)

Carat

Qua

lity

of C

ut

0.8

0.81

0.82

0.83

0.84

0.85

0.86

0.87

0.88

0.89 0.

9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99 1

1.01

1.02

1.03

1.04

1.05

1.06

1.07

1.08

1.09 1.

11.

111.

121.

131.

141.

151.

161.

171.

181.

19 1.2

1.21

1.22

1.23

1.24

Fair

Good

Very Good

Premium

Ideal

27

Example (Diamonds)

Carat

Qua

lity

of C

ut

1.25

1.26

1.27

1.28

1.29 1.

31.

311.

321.

331.

341.

351.

361.

371.

381.

39 1.4

1.41

1.42

1.43

1.44

1.45

1.46

1.47

1.48

1.49 1.

5

1.51

1.52

1.53

1.54

1.55

1.56

1.57

1.58

1.59 1.

61.

611.

621.

631.

641.

651.

661.

671.

681.

69 1.7

1.71

1.72

1.73

1.74

1.75

1.76

1.77

1.78

1.79

FairGood

Very Good

Premium

Ideal

28

Example (Diamonds)

Carat

Qua

lity

of C

ut

1.8

1.81

1.82

1.83

1.84

1.85

1.86

1.87

1.88

1.89 1.

91.

911.

921.

931.

941.

951.

961.

971.

981.

99 2

2.01

2.02

2.03

2.04

2.05

2.06

2.07

2.08

2.09 2.

12.

112.

122.

132.

142.

152.

162.

172.

182.

19 2.2

2.21

2.22

2.23

2.24

2.25

2.26

2.27

2.28

2.29

Fair

Good

Very Good

Premium

Ideal

29

Example (Diamonds)

From the mosaic plots, we can see the proportion of low-quality cutdiamonds increases substantially whenever the carat weight ofdiamonds reaches those benchmarks (0.5, 0.7, 0.9, 1, 1.2, 1.5,2,. . . ). Diamonds with carat weights right above those benchmarksgenerally have better quality of cut then those just at thosebenchmarks.

Possible reasons:

Diamond cutters would want to get the heaviest diamond out of arough stone whenever possible. They might increase the depth ofdiamonds to increase the carat weight, but result in a loss ofbrilliance due to light leakage.

30

stat 22000 lecture slides exploring categorical datayibi/teaching/stat220/... · 1st 122 203325 2nd...

Documents