ggplot2 and grammar of graphics - data - data + science and violin plot.pdf · ggplot2 and grammar...

44
ggplot2 and Grammar of Graphics Andrew Zieffler

Upload: lyque

Post on 31-Mar-2018

244 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

ggplot2 and Grammar of Graphics

Andrew Zieffler

Page 2: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

Plotting Data

• Plots of individual data points and aggregates

• Serve different purposes in exploratory and confirmatory analysis

• Use ggplot2 package

Page 3: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

Household per capita expenditures for 7 regions of Vietnam

●●

●● ●

0

1000

2000

3000

Central Coast Central Highland Mekong Delta North Coast Northern Uplands Red River Delta South EastRegion

Dol

lars

Page 4: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

• How would you create this?

Page 5: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

• data

- source (“World Map”)

• violin (geometric object)

- position (x = Region, y = Dollars)

- color (Region)

• median (statistical summary)

- position ( x = Region, y = median(dollars) )

- geom ( point )

- color ( black )

- fill ( white )

- shape( diamond )

- size( small )

Page 6: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

• mean (statistical summary)

- position ( x = Region, y = mean(dollars) )

- geom ( point )

- color ( black )

- fill ( black )

- shape( circle )

- size( small )

Page 7: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

• Two guides that help reader understand graphic

• axis representing one-dimensional x-position

- dimension (x)

- label (“Region”)

• axis representing one-dimensional y-position

- dimension (y)

- label (“Dollars”)

Page 8: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

Install ggplot2 Package

> install.packages( "ggplot2", dependencies = TRUE )

From the R command line...

‣Click the Packages tab. ‣Click Install Packages. ‣Enter ggplot2 in the text box.‣Click Install.

...or using point-and-click

Page 9: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

# Load the ggplot2 library

> library(ggplot2)

The library() function loads the package so that the functions in the package are accessible. Libraries need to be loaded every R session.

# Load the vlss data

> vlss <- read.csv("http://www.tc.umn.edu/~zief0002/Data/VLSS.csv")

Page 10: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

Plots are Built by Layering Components• Plots are built by layering graphical components

• The components are summed together to form the plot

• First component of first layer is ggplot()

• Contains reference to the source data (data frame) and global aesthetic mappings

Page 11: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

> ggplot(data = vlss, aes(x = Region, y = Dollars, color = Region))

The aes= argument sets the aesthetic mapping.

The data= argument indicates the source data frame.

Page 12: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

Aesthetic Mappings• Aesthetic mappings define how graphical elements are

visually perceived when these elements vary

• Define x-dimension (predictor), y-dimension (response), size, color, fill, groupings, etc.

• Each aesthetic can be mapped to a variable or set to a constant value

• Aesthetics can be set globally (in ggplot() function) or locally (in a specific layer)

• Specified with aes()

Page 13: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

> ggplot(data = vlss, aes(x = Region, y = Dollars, color = Region) ) + geom_violin() + geom_jitter(alpha = 0.3)

Global Aesthetic Mappings

Aesthetic mappings inside the ggplot() layer are applied to every layer.

0

1000

2000

3000

Central Coast Central Highland Mekong Delta North Coast Northern Uplands Red River Delta South EastRegion

Dol

lars

Page 14: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

Local Aesthetic Mappings

Aesthetic mappings inside a particular layer are only applied to that layer.

> ggplot(data = vlss, aes(x = Region, y = Dollars) ) + geom_violin( aes(color = Region) ) + geom_jitter(alpha = 0.3)

0

1000

2000

3000

Central Coast Central Highland Mekong Delta North Coast Northern Uplands Red River Delta South EastRegion

Dol

lars

Page 15: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

Components of Plotting in ggplot2

• Geometric objects

• Statistical transformations

• Scales

• Coordinate systems

• Faceting

• Position adjustments

Page 16: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

Geometric Objects

• Features drawn on plot (e.g., lines, points)

• Specified using prefix geom_ and suffix that names feature to be plotted

• Points specified with geom_point()

• Lines specified with geom_line()

• Violins specified with geom_violin()

Page 17: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

> ggplot(data = vlss, aes(x = Region, y = Dollars, color = Region)) + geom_violin()

The + adds another layer.

The geom_violin() function adds the geometric object of violins using the global data and aesthetic mapping.

Page 18: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

> ggplot(data = vlss, aes(x = Region, y = Dollars, color = Region)) + geom_violin(color = "black", fill = "steelblue")

Notice the quotation marks...color names are strings.

The color= argument sets the color for the outline in this layer.

Aesthetic mappings that are fixed to a particular value (do not vary), rather, do not need to be enclosed in the aes() function.

The fill= argument sets the fill color for this layer.

Note also that the local commands override the global commands

Page 19: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

Statistical Transformations

• Used for plotting statistics/summaries

• e.g., mean of the response at fixed levels of the predictor

• Specified using prefix stat_ and suffix that names desired transformation

• Means, medians, and other summary statistics specified with stat_summary()

• Regression models specified with stat_smooth()

Page 20: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

> ggplot(data = vlss, aes(x = Region, y = Dollars, color = Region)) + geom_violin() + stat_summary(fun.y = mean, geom = "point")

fun.y= takes the function to be applied to the y-dimension for each value of x

geom="point" places a point at each computed summary.

How can we change the color, plotting character, or size (or a combination) of the mean points?

Page 21: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

> ggplot(data = vlss, aes(x = Region, y = Dollars, color = Region)) + geom_violin() + stat_summary(

fun.y = mean, geom = "point", color = "black", fill = "black", pch = 21, size = 2)

The pch= argument sets the plotting character.

The size= argument sets the point size. The default size is 4.

Page 22: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

> ggplot(data = vlss, aes(x = Region, y = Dollars, color = Region)) + stat_summary(

fun.y = mean, geom = "point", color = "black", fill = "black", pch = 21, size = 2) +

geom_violin()

> ggplot(data = vlss, aes(x = Region, y = Dollars, color = Region)) + geom_violin() + stat_summary(

fun.y = mean, geom = "point", color = "black", fill = "black", pch = 21, size = 2)

Order of the layers matters.

Page 23: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

> ggplot(data = vlss, aes(x = Region, y = Dollars, color = Region)) + geom_violin() + stat_summary(fun.y = mean, geom = "point", color = "black",

fill = "black", pch = 21, size = 2) +stat_summary(fun.y = median, geom = "point", color = "black", fill = "white", pch = 23, size = 2)

Add the median household per capita expenditures as a small, white diamond

Page 24: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

Faceting

• Creates separate graphic for each subgroup of subjects

• facet_wrap() displays the graphics conditioned on a single predictor

• facet_grid() displays the graphics conditioned on multiple predictors

Page 25: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

● ● ● ● ●●

●●

● ●

0 1

0

1000

2000

3000

Central CoastCentral HighlandMekong DeltaNorth CoastNorthern UplandsRed River DeltaSouth East Central CoastCentral HighlandMekong DeltaNorth CoastNorthern UplandsRed River DeltaSouth EastRegion

Dol

lars

The violin plots of Dollars by Region are conditioned on Urbanicity.

Rural Households Urban Households

Page 26: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

~ sets the conditioning predictor

> ggplot(data = vlss, aes(x = Region, y = Dollars, color = Region)) + geom_violin() + stat_summary(fun.y = mean, geom = "point", color = "black",

fill = "black", pch = 21, size = 2) +stat_summary(fun.y = median, geom = "point", color = "black", fill = "white", pch = 23, size = 2) +facet_wrap( ~ Urban )

Page 27: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

nrow= (and/or ncol=) sets the number of rows or columns in the plotting area

> ggplot(data = vlss, aes(x = Region, y = Dollars, color = Region)) + geom_violin() + stat_summary(fun.y = mean, geom = "point", color = "black",

fill = "black", pch = 21, size = 2) +stat_summary(fun.y = median, geom = "point", color = "black", fill = "white", pch = 23, size = 2) +facet_wrap( ~ Urban, nrow = 2)

● ● ● ● ● ●●

● ● ●●

●●

0

10

1000

2000

3000

0

1000

2000

3000

Central Coast Central Highland Mekong Delta North Coast Northern Uplands Red River Delta South EastRegion

Dol

lars

Page 28: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

> ggplot(data = vlss, aes(x = Urban, y = Dollars, color = Urban)) + geom_violin() + stat_summary(fun.y = mean, geom = "point", color = "black",

fill = "black", pch = 21, size = 2) +stat_summary(fun.y = median, geom = "point", color = "black", fill = "white", pch = 23, size = 2) +facet_wrap(~ Region)

Violin plots of Dollars by Urbanicity conditioned on Region.

Page 29: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

●●

● ●●

●●

●● ●

Central Coast Central Highland Mekong Delta

North Coast Northern Uplands Red River Delta

South East

0

1000

2000

3000

0

1000

2000

3000

0

1000

2000

3000

0.00 0.25 0.50 0.75 1.00Urban

Dol

lars

0

Urban

Urban is an integer and is being treated as such.

Page 30: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

> vlss$Urban <- factor(vlss$Urban)

> ggplot(data = vlss, aes(x = Urban, y = Dollars, color = Urban)) + geom_violin() + stat_summary(fun.y = mean, geom = "point", color = "black",

fill = "black", pch = 21, size = 2) +stat_summary(fun.y = median, geom = "point", color = "black", fill = "white", pch = 23, size = 2) +facet_wrap(~ Region)

Coerce Urban into a factor.

●●

● ●●

●●

●● ●

Central Coast Central Highland Mekong Delta

North Coast Northern Uplands Red River Delta

South East

0

1000

2000

3000

0

1000

2000

3000

0

1000

2000

3000

0 1Urban

Dol

lars Urban

01

Page 31: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

Better Labels

> vlss$Urban <- factor(vlss$Urban, labels = c("Rural", "Urban"))

> ggplot(data = vlss, aes(x = Urban, y = Dollars, color = Urban)) + geom_violin() + stat_summary(fun.y = mean, geom = "point", color = "black",

fill = "black", pch = 21, size = 2) +stat_summary(fun.y = median, geom = "point", color = "black", fill = "white", pch = 23, size = 2) +facet_wrap(~ Region)

Page 32: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

> ggplot(data = vlss, aes(x = Urban, y = Dollars, color = Urban)) + geom_violin() + … + scale_color_manual( values = c("#E69F00", "#56B4E9") )

Fine-Tuning the Color

scale_color_manual() allows you to manually set the attributes associated with the color aesthetic.

The values= argument sets the color values for each level of the factor.

RGB or hex values can be used in values= argument of scale_fill_manual() or scale_color_manual().

Page 33: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

●●

● ●●

●●

●● ●

Central Coast Central Highland Mekong Delta

North Coast Northern Uplands Red River Delta

South East

0

1000

2000

3000

0

1000

2000

3000

0

1000

2000

3000

Rural UrbanUrban

Dol

lars Urban

RuralUrban

scale_fill_manual() can be used to manually set the colors when the fill= argument is used.

Page 34: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

Pre-selected Color Palettes

Fill Scale Color Scale Description

scale_fill_hue() scale_color_hue()Colors evenly spaced around the color wheel

scale_fill_grey() scale_color_grey() Grey scale palette

scale_fill_brewer() scale_color_brewer() ColorBrewer palettes

Page 35: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

0

500

1000

1500

2000

2500

3000

●●●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

● ●●

●●●

●●●

●●●●●●●●●

●●

●●●●●●

●●

●●●

●●

●●

●●

●●●

●●●●●●●●●

●●●●●

●●●●

●●●●

●●●●●

●●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●● ●

●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●●●●

●●●

●●●●

●●

●●

Central Coast Central Highland Mekong Delta North Coast Northern Uplands Red River Delta South EastRegion

Dol

lars

# Default palette

> ggplot(data = vlss, aes(x = Region, y = Dollars, fill = Urban)) + geom_boxplot() + scale_fill_hue(guide = FALSE)

The guide=FALSE argument removes the legend.

Page 36: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

# Grey scale palette

> ggplot(data = vlss, aes(x = Region, y = Dollars, fill = Urban)) + geom_boxplot() + scale_fill_grey(guide = FALSE)

0

500

1000

1500

2000

2500

3000

●●●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

● ●●

●●●

●●●

●●●●●●●●●

●●

●●●●●●

●●

●●●

●●

●●

●●

●●●

●●●●●●●●●

●●●●●

●●●●

●●●●

●●●●●

●●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●● ●

●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●●●●

●●●

●●●●

●●

●●

Central Coast Central Highland Mekong Delta North Coast Northern Uplands Red River Delta South EastRegion

Dol

lars

Page 37: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

0

500

1000

1500

2000

2500

3000

●●●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

● ●●

●●●

●●●

●●●●●●●●●

●●

●●●●●●

●●

●●●

●●

●●

●●

●●●

●●●●●●●●●

●●●●●

●●●●

●●●●

●●●●●

●●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●● ●

●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●●●●

●●●

●●●●

●●

●●

Central Coast Central Highland Mekong Delta North Coast Northern Uplands Red River Delta South EastRegion

Dol

lars

# ColorBrewer palette (http://colorbrewer2.org)

> ggplot(data = vlss, aes(x = Region, y = Dollars, fill = factor(Urban))) + geom_boxplot() + scale_fill_brewer(guide = FALSE, palette = "Set2")

The palette= sets the pre-specified color palette.

Page 38: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

Accent

Dark2

Paired

Pastel1

Pastel2

Set1

Set2

Set3

Qualitative Color Palettes

Page 39: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

Palettes for Color-Blindness

• About eight percent of males and one-half percent of females have some form of color vision deficiency (good chance that someone in your audience will be one of these people)

• Many different forms of color blindness

• Most common forms of color vision deficiency.

• Color and grey-scale palettes have been developed for many of these

Page 40: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

Palettes for Color-Blindness

• http://jfly.iam.u-tokyo.ac.jp/color/

• Lumley, T. (2006). Color-coding and color blindness in statistical graphics. Statistical computing and graphics newsletter. http://www.amstat-online.org/sections/graphics/newsletter/Volumes/v172.pdf

Page 41: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

# Setting the labels on the x- and y-axis

> ggplot(data = vlss, aes(x = Region, y = Dollars, fill = Urban)) + geom_boxplot() + scale_fill_brewer(palette = "Set2") + ylab("Expenditures per Capita (in U.S. dollars)")

Fine-Tuning the Axes

Labels take strings as their input.

xlab() can be used to change the label on the x-axis, and ylab() is used to change the label on the y-axis.

Page 42: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

The first value is the minimum.

xlim() and ylim() are used to set the limits on the x-axis and y-axis respectively.

The second value is the maximum.

# Setting the limits on the x- and y-axis

> ggplot(data = vlss, aes(x = Region, y = Dollars, fill = Urban)) + geom_boxplot() + scale_fill_brewer(palette = "Set2") + ylab("Expenditures per Capita (in U.S. dollars)") + ylim(c(0, 5000))

Page 43: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

> ggplot(data = vlss, aes(x = Region, y = Dollars, fill = Urban)) + geom_boxplot() + scale_fill_brewer(palette = "Set2") + ylab("Expenditures per Capita (in U.S. dollars)") + theme(legend.position = "none")

Fine-Tuning Other Elements of the Plot

Use the theme() function and indicate the elements and changes via the function's arguments. See http://docs.ggplot2.org/0.9.2.1/theme.html

Another way to remove the legend using theme().

Page 44: ggplot2 and Grammar of Graphics - Data - Data + Science and violin plot.pdf · ggplot2 and Grammar of Graphics ... • violin (geometric object)-position ... Violin plots of Dollars

> ggplot(data = vlss, aes(x = Region, y = Dollars, fill = Urban)) + geom_boxplot() + scale_fill_brewer(palette = "Set2") + ylab("Expenditures per Capita (in U.S. dollars)") + theme_bw()

theme_bw() is a pre-built theme. It changes the colors of many of the elements (major and minor grid lines, background color, etc.) from the default settings.

You can build your own themes and use them.• http://drunks-and-lampposts.com/2012/10/02/clegg-vs-pleb-an-xkcd-esque-chart/• https://github.com/jrnold/ggthemes