learn to use dplyr (feb 2015 philly r user meetup)
TRANSCRIPT
dplyr package
Fan Li @ Philly R User Meetup (R<-Gang)
Learn to use
Demo: http://rpubs.com/lifan/phillyweather
Source code: https://github.com/lifan0127/meetup_dplyr_talk
What is dplyr
A package developed by Hadley Wickham to help
transform tabular data.
● Unified, intuitive syntax
● Fast implementation in C++
● Support various data backends (dataframe, RDB, etc.)
install.package(“dplyr”) # Version 0.4
Basic Operators (Verbs)select(data, col.1, col.2, …) select existing variables (columns)
filter(data, condition.1, condition.2, ...) filter table by conditions
arrange(data, col.1, col.2, …) sort table by variables or other logicals
mutate(data, newcol = …) create new variables
group_by() + summarize() summarize data per group
For a graphic explanation, see Garrett Grolemund’s talk.
Other Helper Functions
transmute
tally
top_n
summarize_each
sample_n/sample_frac
distinct
rename
slice
n_distinct
first/last/nth
Type ?function_name in R to find how to use
%>% Pipe Operator
data %>% function() function(data, …)
foo() %>% bar() bar(foo())
Very useful to convert nested structure into
more logical chain expression.
=
=
Examplefeb.snow2 <- weather %>%
select(Year, Month, Day, Snow) %>% # Step 1. Select relevant variables
filter(Year >= 1885, Month == 2) %>% # Step 2. Filter by year and month
group_by(Year) %>% # Step 3. Group by year
summarize(
Snow.Sum = sum(Snow, na.rm = TRUE)) %>% # Step 4. Summarize monthly snowfall
arrange(-Snow.Sum, -Year) # Step 5. Sort table by monthly snowfall/year
Demo with Philly weather data (1872-2001)
Performance Benefit
C++ implementation
Lazy evaluation
Avoid accidental, expensive operations
Usually must faster than base R. Otherwise it will tell you
with a progress bar.
Data Backends
Supports the three most popular open source
databases (sqlite, mysql and postgresql), and
Google’s bigquery.
http://cran.r-project.org/web/packages/dplyr/vignettes/databases.html
ResourcesReference manual:
dplyr.pdf
Vignettes:
Data frames
Databases
Hybrid evaluation
Introduction to dplyr
Adding a new SQL backend
Non-standard evaluation
Two-table verbs
Window functions and grouped mutate/filter
Cheatsheet by RStudio
Date Meeting Title Speaker Link
20150122 Advanced Data Manipulation Mike McCann [Slide]
20150121 Berkeley Institute for Data Science Pipelines for Data Analysis Hadley Wickham [Video]
20150114 RStudio Webinar Data Wrangling with R Garrett Grolemund [Slide][Video][Data]
20150113 Upstate Data Analytics Wallace Campbell [Video][Data]
20141202 Sheffield R Users Group how to find help online, data manipulation with plyr and dplyr Mathew Hall [Slide]
20141126 Budapest BI Introduction to the dplyr R package Romain Francois [Slide]
20141111 LA R users group Benchmarking dplyr and data.table (with biggish data) Szilard Pafka [Slide][Data]
20141025 ACM DataScience Camp Data Manipulation Using R Ram Narasimhan [Slide][Video]
20141022 Becoming a data ninja with dplyr Devin Pastoor [Slide]
20141007 Davis R Users' Group dplyr: Data manipulation in R made easy Michael A. Levy [Slide][Video]
20140825 RStudio Webinar Hands-on dplyr tutorial for faster data manipulation in R [Slide][Video]
20140701 USER2014 dplyr: a grammar of data manipulation Hadley Wickham [Video]
20140630 USER2014 Data manipulation with dplyr Hadley Wickham [Slide][Video][Data]
20140214 Stanford HCI Group Expressing yourself in R Hadley Wickham [Slide][Data]
See updated list at: https://github.com/lifan0127/meetup_dplyr_talk
“Hadley Ecosystem”
Visualization
ggplot, ggmap, ggvis
Data Wrangling
reshape, plyr, dplyr, tidyr
Web
rvest, httr, xml2
Other tools
stringr, lubridate, heaven
https://github.com/hadley (Github Repo)
http://adv-r.had.co.nz/ (Advanced R Book)
http://r-pkgs.had.co.nz/ (R Packages Book)