stat 22000 lecture slides exploring categorical datayibi/teaching/stat220/... · 1st 122 203325 2nd...
TRANSCRIPT
STAT 22000 Lecture SlidesExploring Categorical Data
Yibi HuangDepartment of StatisticsUniversity of Chicago
Outline
This set of slides cover Section 1.7 in the text.
• Ways to summarize of a single categorical variable• Frequency tables• Barplots, pie charts
• Ways to summarize of relationships between two categoricalvariables• two-way contingency tables• segmented barplots, standardized segmented barplots,
mosaic plot
1
Bar Graphs and Pie Charts
Graphs for Categorical Variables
A categorical variable is summarized by a table showing the countor the percentage of cases in each category, and is often displayedby a bar plot or a pie chart.
Ex: Passengers on Titanic
Class Freq Percent1st 325 14.8%2nd 285 12.9%3rd 706 32.1%
Crew 885 40.2%Total 2201 100%
1st 2nd 3rd Crew
Class
Num
ber
of P
eopl
e
020
040
060
080
0
1st
2nd
3rd
Crew 2
Bar plots
A bar plot is a common way to display a single categorical variable.A bar plot where proportions instead of frequencies are shown iscalled a relative frequency bar plot.
1st 2nd 3rd Crew
Class
Num
ber
of P
eopl
e
020
040
060
080
0
1st 2nd 3rd Crew
Class
Per
cent
age
of P
eopl
e
010
2030
40
3
How are Bar Plots Different From Histograms?
• Bar plots are used for displaying distributions of categoricalvariables, while histograms are used for numerical variables.
• The horizontal axis in a histogram is a number line, hence theorder of the bars cannot be changed, while in a bar plot thecategories can be listed in any order (though some orderingsmake more sense than others, especially for ordinalvariables.)
4
Why We Recommend Bar Plots Over Pie Charts?
In a pie chart, the areas of slices represents the percentages ofcategories. However, it is generally more difficult to compare groupsizes in a pie chart than in a bar plot, especially when categorieshave nearly identical counts or proportions
1st
2nd
3rd
Crew
Without looking at the counts,can you tell which class havefewest people from the pie?
5
Why We Recommend Bar Plots Over Pie Charts?
It’s much easier to make a wrong pie chart than a wrong bar plot.In a pie chart, the categories must make up a whole. There is nosuch restriction for a bar plot.
6
Another Wrong Pie Chart
http:// www.youtube.com/ watch?v=-rbyhj8uTT8
7
Two-Way Contingency Tables
Two-Way Contingency Tables
A table that summarizes data for two categorical variables is calleda contingency table.
E.g., breakdown of people on Titanic by class and survival status
Died Survived Total
Class
1st 122 203 3252nd 167 118 2853rd 528 178 706Crew 673 212 885Sum 1490 711 2201
The marginal totals give the distributions of the two variables, e.g.,
• overall, 1490 died and 711 survived• there were 325, 285, and 706 passengers in the 1st, 2nd and
3rd classes, and 885 crew members8
Overall Proportions
Dividing the cell counts in a contingency table by the overall total, we getthe proportions of observations in the combinations of the two variables.
SurvivedNo Yes Total
Class
1st 122/2201 ≈ 0.06 203/2201 ≈ 0.09 325/2201 ≈ 0.152nd 167/2201 ≈ 0.08 118/2201 ≈ 0.05 285/2201 ≈ 0.133rd 528/2201 ≈ 0.24 178/2201 ≈ 0.08 706/2201 ≈ 0.32Crew 673/2201 ≈ 0.31 212/2201 ≈ 0.10 885/2201 ≈ 0.40Sum 1490/2201 ≈ 0.68 711/2201 ≈ 0.32 1
e.g., of people on Titanic
• 122/2201 ≈ 6% were in the 1st class and died in the disaster
• 212/2201 ≈ 10% were survived crew members
Note the marginal totals give the distributions of the two variables, e.g.,
• Overall, 711/2201 ≈ 32% of the people survived9
Row Proportions
The row proportions (cell counts divided by the corresponding rowtotals) give the proportion of people survived in the four classes.
SurvivedNo Yes Total
Class
1st 122/325 ≈ 0.38 203/325 ≈ 0.62 12nd 167/285 ≈ 0.59 118/285 ≈ 0.41 13rd 528/706 ≈ 0.75 178/706 ≈ 0.25 1Crew 673/885 ≈ 0.76 212/885 ≈ 0.24 1
e.g.,
• 203/325 ≈ 62% of people in the 1st class survived.
• 178/706 ≈ 25% of people in the 3rd class survived.
10
Column Proportions
The column proportions (dividing cell counts by the correspondingcolumn totals) give the proportion of people survived in each of thefour classes.
SurvivedNo Yes
Class
1st 122/1490 ≈ 0.08 203/711 ≈ 0.292nd 167/1490 ≈ 0.11 118/711 ≈ 0.173rd 528/1490 ≈ 0.35 178/711 ≈ 0.25Crew 673/1490 ≈ 0.45 212/711 ≈ 0.30Sum 1 1
• Among those who survived, 203/711 ≈ 29% were in the 1stclass.
• Among those who died, 673/1490 ≈ 45% were crew members
11
Independence of Two Categorical Variables
If the row proportions do not change from row to row, we say thetwo categorical variables are independent. Otherwise, we say theyare associated.
E.g., if the survival rates do not change from class to class, we say‘survival’ is independent of ‘class’. In the Titanic data, the survivalof passengers is associated with the class they were in becausethe survival rates differ substantially from class to class.
We can also define two categorical variables to be independent ifthe column proportions do not vary from column to column sincethe two conditions are equivalent (why?)
12
Exercise
The table below shows the breakdown of cases of injuries in theU.S in a certain year. by circumstance and gender1. Counts are inmillions.
CircumstanceGender Work Home Other Total
Male 8.0 9.8 17.8 35.6Female 1.3 11.6 12.9 25.8
Total 9.3 21.4 30.7 61.4
• What proportion of injury cases occurred at work?9.3/61.4 ≈ 0.15
• What proportion of injury cases occurred at work and onwomen? 1.3/61.4 ≈ 0.02
1Source: Vital and Health Statistics published by the National Center for Health Statistics
13
Practise (Cont’d)
CircumstanceGender Work Home Other Total
Male 8.0 9.8 17.8 35.6Female 1.3 11.6 12.9 25.8
Total 9.3 21.4 30.7 61.4
• Among all injury cases occurred on women, what proportionoccurred at work? 1.3/25.8 ≈ 0.05
• Among all injury cases occurred at work, what proportionoccurred on women? 1.3/9.3 ≈ 0.14
• Is the circumstance of injury cases independent of the genderof the victims? No, only 5% of injury cases on womenoccurred at work, compared with 8.0/36.5 ≈ 22% of cases onmen occurred at work.
14
Segmented Bar and Mosaic Plots
Segmented Bar Plots
SurvivedClass No Yes Total1st 122 203 3252nd 167 118 2853rd 528 178 706Crew 673 212 885Sum 1490 711 2201
Class
Fre
q
0
200
400
600
800
1st 2nd 3rd Crew
SurvivedNoYes
15
Standardized Segmented Bar Plots
SurvivedNo Yes Total
Class
1st 0.38 0.62 12nd 0.59 0.41 13rd 0.75 0.25 1Crew 0.76 0.24 1
Class
0.0
0.2
0.4
0.6
0.8
1.0
1st 2nd 3rd Crew
SurvivedNoYes
Standardized segmented bar plots are convenient for comparingrow proportions, and determining whether the two variables areindependent.
However, the information of row totals is lost after standardization.
16
Mosaic Plots
• bar widths = row totals• segment lengths within a bar = row proportions
Class
Sur
vive
d
1st 2nd 3rd Crew
No
Yes
segment area = (barwidth) × (segment length)
= row total × (row proportion)
= row total ×cell countrow total
= cell count 17
Exercise 1.68 Raise Taxes on the Rich or the Poor
The mosaic plot below shows the relationship between politicalparty affiliation and views on whether it’s better to raise taxes onthe rich or on the poor for a random sample of registered voterstaken nationally in 2015.
Democrat Republican Indep / Other
Raise taxes on the rich
Raise taxes on the poorNot sure
18
Democrat Republican Indep / Other
Raise taxes on the rich
Raise taxes on the poorNot sure
Which political party identification is least common in the sample,Democrats, Republicans, or Indep/Other?
Ans: Indep/Other.
19
Democrat Republican Indep / Other
Raise taxes on the rich
Raise taxes on the poorNot sure
Based on this sample, which political party identification had thehighest percentage supported raising taxes on the rich? Which hadthe lowest?
Ans: Democrats the highest, Republicans the lowest.
20
Democrat Republican Indep / Other
Raise taxes on the rich
Raise taxes on the poorNot sure
What percentage of Democrats (in this sample) supported raisingtaxes on the rich?
(a) below 25%(b) between 25% and 50%(c) between 50% and 75%(d) over 75%
21
Democrat Republican Indep / Other
Raise taxes on the rich
Raise taxes on the poorNot sure
In this sample, which of the following groups contains the greatestnumber of subjects?
(a) Democrats who supported raising taxes on the rich.(b) Democrats who supported raising taxes on the poor.(c) Republicans who supported raising taxes on the rich.(d) Republicans who supported raising taxes on the poor.
22
Democrat Republican Indep / Other
Raise taxes on the rich
Raise taxes on the poorNot sure
Based on the mosaic plot, do views on raising taxes and politicalaffiliation appear to be independent?
23
Instead of looking at survival rates in the four classes, we can alsolook at the breakdown of the four classes among those whosurvived and among those who died.
Survived
Fre
q
0
500
1000
1500
No Yes
Survived
0.0
0.2
0.4
0.6
0.8
1.0
No Yes
Class1st2nd3rdCrew
Survived
Cla
ss
No Yes1st
2nd
3rd
Crew
24
Ways to Inspect Relationships Between Variables
• numerical v.s. numerical• scatterplots
• categorical v.s. categorical• contingency tables• segmented barplots, standardized segmented barplots,
mosaic plot
• categorical v.s. numerical• side-by-side boxplots• histograms by group on the same horizontal axis
25
Example (Diamonds)
Mosaic plot: Carat Weight v.s. Quality of Cut
Carat
Qua
lity
of C
ut
0.4
0.41
0.42
0.43
0.44
0.45
0.46
0.47
0.48
0.49 0.
5
0.51
0.52
0.53
0.54
0.55
0.56
0.57
0.58
0.59 0.
60.
610.
620.
630.
640.
650.
660.
670.
680.
69 0.7
0.71
0.72
0.73
0.74
0.75
0.76
0.77
0.78
0.79
Fair
Good
Very Good
Premium
Ideal
26
Example (Diamonds)
Carat
Qua
lity
of C
ut
0.8
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89 0.
9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99 1
1.01
1.02
1.03
1.04
1.05
1.06
1.07
1.08
1.09 1.
11.
111.
121.
131.
141.
151.
161.
171.
181.
19 1.2
1.21
1.22
1.23
1.24
Fair
Good
Very Good
Premium
Ideal
27
Example (Diamonds)
Carat
Qua
lity
of C
ut
1.25
1.26
1.27
1.28
1.29 1.
31.
311.
321.
331.
341.
351.
361.
371.
381.
39 1.4
1.41
1.42
1.43
1.44
1.45
1.46
1.47
1.48
1.49 1.
5
1.51
1.52
1.53
1.54
1.55
1.56
1.57
1.58
1.59 1.
61.
611.
621.
631.
641.
651.
661.
671.
681.
69 1.7
1.71
1.72
1.73
1.74
1.75
1.76
1.77
1.78
1.79
FairGood
Very Good
Premium
Ideal
28
Example (Diamonds)
Carat
Qua
lity
of C
ut
1.8
1.81
1.82
1.83
1.84
1.85
1.86
1.87
1.88
1.89 1.
91.
911.
921.
931.
941.
951.
961.
971.
981.
99 2
2.01
2.02
2.03
2.04
2.05
2.06
2.07
2.08
2.09 2.
12.
112.
122.
132.
142.
152.
162.
172.
182.
19 2.2
2.21
2.22
2.23
2.24
2.25
2.26
2.27
2.28
2.29
Fair
Good
Very Good
Premium
Ideal
29
Example (Diamonds)
From the mosaic plots, we can see the proportion of low-quality cutdiamonds increases substantially whenever the carat weight ofdiamonds reaches those benchmarks (0.5, 0.7, 0.9, 1, 1.2, 1.5,2,. . . ). Diamonds with carat weights right above those benchmarksgenerally have better quality of cut then those just at thosebenchmarks.
Possible reasons:
Diamond cutters would want to get the heaviest diamond out of arough stone whenever possible. They might increase the depth ofdiamonds to increase the carat weight, but result in a loss ofbrilliance due to light leakage.
30