coding lab: visualizing data with ggplot2
TRANSCRIPT
![Page 1: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/1.jpg)
Coding Lab: Visualizing data with ggplot2
Ari Anisfeld
Summer 2020
1 / 36
![Page 2: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/2.jpg)
How to use ggplot
I How to map data to aesthetics with aes() (and what thatmeans)
I How to visualize the mappings with geomsI How to get more out of your data by using multiple aestheticsI How to use facets to add dimensionality
There are whole books on how to use ggplot. This is a quickintroduction!
2 / 36
![Page 3: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/3.jpg)
Understanding ggplot()By itself, ggplot() tells R to prepare to make a plot.texas_annual_sales <-texas_housing_data %>%group_by(year) %>%summarize(total_volume = sum(volume, na.rm = TRUE))
ggplot(data = texas_annual_sales)
3 / 36
![Page 4: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/4.jpg)
Adding a mappingAdding mapping = aes() says how the data will map to“aesthetics”.
I e.g. tell R to make x-axis year and y-axis total_volume.I Each row of the data has (year, total_volume).
I R will map that to the coordinate pair (x,y) .I Look at the data before moving on!
ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume))
4e+10
5e+10
6e+10
7e+10
8e+10
2000 2005 2010 2015year
tota
l_vo
lum
e
4 / 36
![Page 5: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/5.jpg)
Visualizing the mapping with a geomgeom_<name> tells R what type of visualization to produce.
Here we see points.
I Each row of the data has (year, total_volume).I R will map that to the coordinate pair (x,y).
ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume)) +
geom_point()
4e+10
5e+10
6e+10
7e+10
8e+10
2000 2005 2010 2015year
tota
l_vo
lum
e
5 / 36
![Page 6: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/6.jpg)
Visualizing the mapping with a geomHere we see bars.
I Each row of the data has (year, total_volume).I R will map that to the coordinate pair (x,y)
ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume)) +
geom_col()
0e+00
2e+10
4e+10
6e+10
8e+10
2000 2005 2010 2015year
tota
l_vo
lum
e
6 / 36
![Page 7: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/7.jpg)
Visualizing the mapping with a geom
Here we see a line connecting each (x,y) pair.ggplot(data = texas_annual_sales,
mapping = aes(x = year, y = total_volume)) +geom_line()
4e+10
5e+10
6e+10
7e+10
8e+10
2000 2005 2010 2015year
tota
l_vo
lum
e
7 / 36
![Page 8: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/8.jpg)
Visualizing the mapping with a geomHere we see a smooth line. R does a statistical transformation!
I Now R doesn’t visualize the mapping (year, total_volume) to each (x,y)pair
I Instead it fits a model to the (x,y) and then plots the “smooth” line
ggplot(data = texas_annual_sales,mapping = aes(x = year, y = total_volume)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
2.5e+10
5.0e+10
7.5e+10
2000 2005 2010 2015year
tota
l_vo
lum
e
8 / 36
![Page 9: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/9.jpg)
Visualizing the mapping with a geom
We can overlay several geom.ggplot(data = texas_annual_sales,
mapping = aes(x = year, y = total_volume)) +geom_smooth() +geom_point()
2.5e+10
5.0e+10
7.5e+10
2000 2005 2010 2015year
tota
l_vo
lum
e
9 / 36
![Page 10: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/10.jpg)
Visualizing the mapping with a geom
I We saw that we can visualize a relationship between twovariables mapping data to x and y
I The data can be visualized with different geoms that can becomposed (+) together.
I We can even calculate new variables with statistics and plotthose on the fly.
Next: Now we’ll look at aesthetics that go beyond x and y axes.
10 / 36
![Page 11: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/11.jpg)
Using aesthetics to explore data.We’ll use midwest data and start with only mapping to x and y
midwest %>%ggplot(aes(x = percollege,
y = percbelowpoverty)) +geom_point()
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y
11 / 36
![Page 12: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/12.jpg)
Using aesthetics to explore data.I color maps data to the color of points or lines.
I Each state is assigned a color.I This works with discrete data and continuous data.
midwest %>%ggplot(aes(x = percollege,
y = percbelowpoverty,color = state)) +
geom_point()
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y state
IL
IN
MI
OH
WI
12 / 36
![Page 13: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/13.jpg)
Using aesthetics to explore data.I shape maps data to the shape of points.
I Each state is assigned a shape.I This works with discrete data only.
midwest %>%ggplot(aes(x = percollege,
y = percbelowpoverty,shape = state)) +
geom_point()
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y state
IL
IN
MI
OH
WI
13 / 36
![Page 14: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/14.jpg)
Using aesthetics to explore data.I alpha maps data to the transparency of points.
I Here we map the percentage of people within a known povertystatus to alpha
midwest %>%ggplot(aes(x = percollege,
y = percbelowpoverty,alpha = poptotal)) +
geom_point()
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y poptotal
1e+06
2e+06
3e+06
4e+06
5e+06
14 / 36
![Page 15: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/15.jpg)
Using aesthetics to explore data.I size maps data to the size of points and width of lines.
I Here we map the percentage of people within a known povertystatus to size
midwest %>%ggplot(aes(x = percollege,
y = percbelowpoverty,size = poptotal)) +
geom_point()
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y poptotal
1e+06
2e+06
3e+06
4e+06
5e+06
15 / 36
![Page 16: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/16.jpg)
Using aesthetics to explore data.
We can combine any and all aesthetics, and even map the same variable tomultiple aestheticsmidwest %>%
ggplot(aes(x = percollege,y = percbelowpoverty,alpha = percpovertyknown,size = poptotal,color = state))+
geom_point()
16 / 36
![Page 17: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/17.jpg)
Using aesthetics to explore data.
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y
percpovertyknown
85
90
95
state
IL
17 / 36
![Page 18: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/18.jpg)
Using aesthetics to explore data
Different geoms have specific aesthetics that go with them.
I use ? to see which aesthetics a geom accepts (e.g?geom_point)
I the bold aesthetics are required.I the ggplot cheatsheet shows all the geoms with their associated
aesthetics
18 / 36
![Page 19: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/19.jpg)
FacetsFacets provide an additional tool to explore multidimensional datamidwest %>%
ggplot(aes(x = log(poptotal),y = percbelowpoverty)) +
geom_point() +facet_wrap(vars(state))
OH WI
IL IN MI
8 10 12 14 8 10 12 14
8 10 12 140
1020304050
01020304050
log(poptotal)
perc
belo
wpo
vert
y
19 / 36
![Page 20: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/20.jpg)
discrete vs continuous dataaes discrete continuous
limited number of classes unlimited number of classesusually chr or lgl numeric
x, y yes yescolor, fill yes yesshape yes (6 or fewer categories) nosize, alpha not advised yesfacet yes not advised
Here, discrete and continuous have different meaning than in math
I For ggplot meaning is more fluid.I If you do group_by with the var and there are fewer than 6 to
10 groups, discrete visualizations can workI If your “discrete” data is numeric, as.character() or
as_factor() to enforce the decision.20 / 36
![Page 21: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/21.jpg)
color can be continuousmidwest %>%
ggplot(aes(x = percollege,y = percbelowpoverty,color = percpovertyknown)) +
geom_point()
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y
85
90
95
percpovertyknown
21 / 36
![Page 22: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/22.jpg)
shape does not play well with many categoriesI Will only map to 6 categories, the rest become NA.I We can override this behavior and get up to 25 distinct shapes
midwest %>%ggplot(aes(x = percollege,
y = percbelowpoverty,shape = county)) +
geom_point() +# legend off, otherwise it overwhelmstheme(legend.position = "none")
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y
22 / 36
![Page 23: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/23.jpg)
alpha and size can be misleading with discrete datamidwest %>%
ggplot(aes(x = percollege,y = percbelowpoverty,alpha = state)) +
geom_point()
## Warning: Using alpha for a discrete variable is not advised.
0
10
20
30
40
50
10 20 30 40 50percollege
perc
belo
wpo
vert
y state
IL
IN
MI
OH
WI
23 / 36
![Page 24: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/24.jpg)
Adding vertical linestexas_annual_sales %>%
ggplot(aes(x = year, y = total_volume)) +geom_point() +geom_vline(aes(xintercept = 2007),
linetype = "dotted")
4e+10
5e+10
6e+10
7e+10
8e+10
2000 2005 2010 2015year
tota
l_vo
lum
e
I add horizontal lines with geom_hline()I add any linear fit using geom_abline() by providing a slope
and intercept.24 / 36
![Page 25: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/25.jpg)
Key take aways
I ggplot starts by mapping data to “aesthetics”.I e.g. What data shows up on x and y axes and how color,
size and shape appear on the plot.I We need to be aware of ‘continuous’ vs. ‘discrete’ variables.
I Then, we use geoms to create a visualization based on themapping.
I Again we need to be aware of ‘continuous’ vs. ‘discrete’variables.
I Making quick plots helps us understand data and makes usaware of data issues
Resources: R for Data Science chap. 3 (r4ds.had.co.nz); RStudio’sggplot cheatsheet.
25 / 36
![Page 26: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/26.jpg)
Appendix: Some graphs you made along the way
26 / 36
![Page 27: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/27.jpg)
lab 0: a map
geom_path is like geom_line, but connects (x, y) pairs in theorder they appear in the data set.storms %>%
group_by(name, year) %>%filter(max(category) == 5) %>%
ggplot(aes(x = long, y = lat, color = name)) +geom_path() +borders("world") +coord_quickmap(xlim = c(-130, -60), ylim = c(20, 50))
27 / 36
![Page 28: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/28.jpg)
lab 0: a map
20
30
40
50
−120 −100 −80 −60long
lat
Dean
Emily
Felix
Gilbert
Hugo
Isabel
Ivan
Katrina
Mitch
28 / 36
![Page 29: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/29.jpg)
lab 1: a line plot
french_data <-wid_data %>%filter(type == "Net personal wealth",
country == "France") %>%mutate(perc_national_wealth = value * 100)
french_data %>%ggplot(aes(y = perc_national_wealth,
x = year,color = percentile)) +
geom_line()
29 / 36
![Page 30: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/30.jpg)
lab 1: a line plot
0
25
50
75
1900 1925 1950 1975 2000year
perc
_nat
iona
l_w
ealth
percentile
p0p50
p50p90
p90p100
p99p100
30 / 36
![Page 31: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/31.jpg)
lab 2: distributions
I geom_density() only requires an x asthetic and it calculatesthe distribution to plot.
I We can set the aesthetics manually, independent of data fornicer graphs.chi_sq_samples <-tibble(x = c(rchisq(100000, 2),
rchisq(100000, 3),rchisq(100000, 4)),
df = rep(c("2", "3", "4"), each = 1e5))
chi_sq_samples %>%ggplot(aes(x = x, fill = df)) +geom_density( alpha = .5) +labs(fill = "df", x = "sample")
31 / 36
![Page 32: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/32.jpg)
lab 2: distributions
0.0
0.1
0.2
0.3
0.4
0 10 20 30sample
dens
ity
df
2
3
4
32 / 36
![Page 33: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/33.jpg)
lab 4: grouped bar graphs
I position = "dodge2" tells R to put bars next to each other,rather than stacked on top of each other.
I Notice we use fill and not color because we’re “filling” anarea.mean_share_per_country %>%
ggplot(aes(y = country,x = mean_share,fill = percentile)) +
geom_col(position = "dodge2") +labs(x = "Mean share of national wealth",
y = "",fill = "Wealth\npercentile")
33 / 36
![Page 34: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/34.jpg)
lab 4: grouped bar graphs
ChinaFrance
IndiaKorea
RussiaS Africa
UKUSA
0.00 0.25 0.50 0.75Mean share of national wealth
Wealthpercentile
p90p100
p99p100
34 / 36
![Page 35: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/35.jpg)
lab 4: faceted bar graph
I Notice that we manipulate our data to the right specificationbefore making this graph
I Using facet_wrap we get a distinct graph for each timeperiod.
mean_share_per_country_with_time %>%ggplot(aes(x = country,
y = mean_share,fill = percentile)) +
geom_col(position = "dodge2") +facet_wrap(vars(time_period))
35 / 36
![Page 36: Coding Lab: Visualizing data with ggplot2](https://reader034.vdocuments.net/reader034/viewer/2022052105/6286f25e6a33313de41cb997/html5/thumbnails/36.jpg)
lab 4: faceted bar graph
1980 to 1999 2000 to present
1959 and earlier 1960 to 1979
ChinaIndiaUSA ChinaIndiaUSA
0.00.20.40.60.8
0.00.20.40.60.8
country
mea
n_sh
are
percentile
p90p100
p99p100
36 / 36