data scientist agill@mango-solutions - londonr · available in ggplot2 •use aesthetics (colour,...

Post on 29-Sep-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Angela Castillo-Gill

Data Scientist

agill@mango-solutions.com

@acastillogill

• Produce standard graphics and understand the range of visualisations available in ggplot2

• Use aesthetics (colour, shape, size) to add information to a visualisation

• Create analytical visualisations by groups (small multiples)

• Use R functionality to export high resolution graphics

library(tidyverse)

install.packages(“tidyverse”)

github.com/rfordatascience/tidytuesday

https://ig.ft.com/sites/visual-history-of-womens-tennis/

bit.ly/GrandSlams

grand_slams <-

read_csv("grand_slams.csv")

ggplot(data = grand_slams,

mapping = aes(

x = tournament_date,

y = rolling_win_count)) +

geom_point()

ggplot(data = grand_slams,

mapping = aes(

x = tournament_date,

y = rolling_win_count)) +

geom_point()

Function used to create the skeleton structure

of a graphic

ggplot(data = grand_slams,

mapping = aes(

x = tournament_date,

y = rolling_win_count)) +

geom_point()

Define the data that will be the basis of our

plot

ggplot(data = grand_slams,

mapping = aes(

x = tournament_date,

y = rolling_win_count)) +

geom_point()

How do the variables in our data relate to

aesthetics in the plot?

ggplot(data = grand_slams,

mapping = aes(

x = tournament_date,

y = rolling_win_count)) +

geom_point()

Use the aes helper function to define how variables relate to plot

elements

ggplot(data = grand_slams,

mapping = aes(

x = tournament_date,

y = rolling_win_count)) +

geom_point()

Define the type of plot by adding geom

functions as layers

• Creates a plot skeleton when we define:

1. what data we want to use

2. how to map variables in the data to aesthetics in the plot

• Defines the type of plot we want to create

• Added as layers with "+"

• We can include multiple elements by continuing to add them as layers

www.rstudio.com/resources/cheatsheets/

ggplot(data = grand_slams,

mapping = aes(

x = tournament_date,

y = rolling_win_count)) +

geom_point() +

geom_hline(aes(yintercept = 10),

colour = "red")

• Using the mpg data (in the ggplot2 package) create a scatter plot of city miles per gallon against highway miles per gallon

• Add a smooth line to this plot

• Can you figure out how to change this to use linear regression as the smoothing method?

• Anything that defines the look and feel of the plot:

– x, y, z etc

– colour, fill

– shape, linetype

– size (inc. line size)

– alpha (aka. opacity)

more_than_10_wins <-

grand_slams %>%

group_by(name) %>%

filter(any(

rolling_win_count > 10))

ggplot(data = more_than_10_wins,

mapping = aes(

x = tournament_date,

y = rolling_win_count)) +

geom_point(aes(colour = name))

• Using the scatter plot of cty against hwy, colour the points by drv (whether front, rear or 4 wheel drive.

• How does the plot differ if you define the colour in the ggplot function as opposed to the geom_point layer?

www.rstudio.com/resources/cheatsheets/

• Counting is done automatically for us from the raw data

ggplot(data = more_than_10_wins,

aes(x = name)) +

geom_bar()

• Counting is done automatically for us from the raw data

• Change bar colour with fill

ggplot(data = more_than_10_wins,

aes(x = name)) +

geom_bar(aes(fill = grand_slam))

• Create from pre-counted data with

stat = "identity"

• Side by side categories with

position = "dodge"

• Use "path" to join by appearance in the data, "line" to join by x-axis value

more_than_10_wins <- more_than_10_wins %>%

group_by(name) %>%

mutate(

first_win = min(tournament_date),

days_since_first = as.numeric(

tournament_date - first_win))

ggplot(data = more_than_10_wins,

mapping =

aes(x = days_since_first,

y = rolling_win_count)) +

geom_line()

• For individual lines for each of some group we need to define the group

ggplot(data = more_than_10_wins,

mapping = aes(

x = days_since_first,

y = rolling_win_count)) +

geom_line(aes(group = name))

• Create a bar chart of the number of cars in each class

• Update the plot so that you can compare the year (remember you will need to fill, and the variable should be a factor)

• Can you update your plot so that bars for each year appear side by side?

• Create multiple plots that can be easily compared

• In ggplot2 this is faceting

• We can either facet into a grid structure or a table structure

• Most appropriate depends on the data

ggplot(data = more_than_10_wins,

mapping =

aes(x = days_since_first,

y = rolling_win_count)) +

geom_line(aes(colour = name)) +

facet_grid(rows = vars(gender))

ggplot(data = more_than_10_wins,

mapping =

aes(x = days_since_first,

y = rolling_win_count)) +

geom_line(aes(colour = name)) +

facet_grid(rows = vars(gender))

Saying we want these variables to have one row for each category

ggplot(data = more_than_10_wins,

mapping =

aes(x = days_since_first,

y = rolling_win_count)) +

geom_line(aes(colour = name)) +

facet_grid(rows = vars(gender))

Could also use cols, or both rows and cols

ggplot(data = more_than_10_wins,

mapping =

aes(x = days_since_first,

y = rolling_win_count)) +

geom_line(aes(colour = name)) +

facet_grid(rows = vars(gender))

Use the vars helper function to define the variables in the data

ggplot(data = more_than_10_wins,

mapping =

aes(x = days_since_first,

y = rolling_win_count)) +

geom_line() +

facet_wrap(vars(name))

• Create a scatter plot of cty against hwy as previously.

• Create a facetted version of this plot, splitting by class. Try both the facet_grid and facet_wrapfunctions. Which is more suitable for this graphic?

• Set the labels

• Consider the scales

• Think about the theme

bbc.github.io/rcookbook/

• Use the labs function to set:

– x, y axis labels

– legend titles (colour, shape, size etc.)

– title, subtitle

– caption

ggplot(data = more_than_10_wins,

mapping = aes(

x = days_since_first,

y = rolling_win_count)) +

geom_line(aes(colour = name)) +

facet_grid(rows = vars(gender)) +

labs(x = "Number of Days Since First Title",

y = "Total Number of Grand Slam Titles",

colour = "Player") +

scale_colour_viridis_d() +

theme_bw()

• The scale_* family can set:

– Exact choice of colours

– Break points of axis

– Labels on legends and axis

– Much more!

• Some default functions exist to help

• The theme function sets:

– backgrounds & borders

– grid lines

– axis text rotation

– legend position

– title positions

– Over 80 graphic elements!

• Selection of default functions available

ggplot(data = more_than_10_wins,

mapping = aes(

x = days_since_first,

y = rolling_win_count)) +

geom_line(aes(colour = name)) +

facet_grid(rows = vars(gender)) +

labs(x = "Number of Days Since First Title",

y = "Total Number of Grand Slam Titles",

colour = "Player") +

scale_colour_viridis_d() +

theme_bw()

• Use function that reflects the file type (png, jpeg, pdf, etc.)

• Control:

– Width & Height

– Quality (resolution)

• Need to control graphics devices

png( filename = "TotalWinsByPlayer.png",

width = 600,

height = 350,

res = 100)

# Code to create plot goes here

dev.off()

png( filename = "TotalWinsByPlayer.png",

width = 600,

height = 350,

res = 100)

# Code to create plot goes here

dev.off()

Open a connection to a new graphics device

(place to send your plot)

png( filename = "TotalWinsByPlayer.png",

width = 600,

height = 350,

res = 100)

# Code to create plot goes here

dev.off()

Create the plot!

png( filename = "TotalWinsByPlayer.png",

width = 600,

height = 350,

res = 100)

# Code to create plot goes here

dev.off()Close the connection to

the plot

• Using one of the graphics you have created today, set the labels to be appropriate for the graphic

• Export a png of your graphic

Cheat Sheet

www.rstudio.com/resources/cheatsheets

Practice Data

github.com/rfordatascience/tidytuesday

In Production Example

bbc.github.io/rcookbook/

Aimee Gott

agott@mango-solutions.com

top related