data manipulation · manipulation tool #4: joining what it does: joins together two data frames,...
TRANSCRIPT
Data Manipulation
$$ Golden Rules of Writing Fast R $$
Vectorize your operations as much as
possible
Minimize the number of vectors to operate
on
mapply
vapply
lapply
sapply
tapply
The Apply Family
apply
The Data Pipeline
Source
Question: What are some ways in
which data can be “messy”?
Data Manipulation Tools
1. R base functions2. plyr package3. dplyr package4. sqldf - for SQL users5. Data.table package
Source
Manipulation Tool #1: FilteringWhat it does: Grabs a subset of the rows in a data frame with a condition
Name Age Major
Amit 19 Computer Science
Dae Won 24 ORIE
Chase 19 Information Science
Jared 19 Computer Science
Kenta 20 Computer Science
Manipulation Tool #2: SubsettingWhat it does: Grabs a subset of the columns in a data frame.
Name Age Major
Amit 19 Computer Science
Dae Won 24 ORIE
Chase 19 Information Science
Jared 19 Computer Science
Kenta 20 Computer Science
Manipulation Tool #3: CombiningWhat it does: Joins together two data frames, either row-wise or column-wise.How to do it: cbind and rbind operators in R.
Name
Amit
Dae Won
Chase
Jared
Kenta
Age Major
19 Computer Science
24 ORIE
19 Information Science
19 Computer Science
20 Computer Science
Name Age Major
Amit 19 Computer Science
Dae Won 24 ORIE
Chase 19 Information Science
Jared 19 Computer Science
Kenta 20 Computer Science
Manipulation Tool #4: JoiningWhat it does: Joins together two data frames, combining rows that have the same value for a column.
How to do it: Use dplyr’s join method or the merge and cbind operators in R.
Manipulation Tool #5: SummarizingWhat it does: Computes aggregate data about the data frame, such as the number of elements with a given property.
How to do it: Use dplyr’s summarize and aggregate methods.
tidyr Library
What it does: Convert Wide data.frame to Longer data.frame
How to do it: gather( ), spread( ), unite( ), separate( )
Column Manipulation 1
spread( )
gather( )
gather(data, Year, GPA, Freshman:Junior)
Name Freshman Sophomore Junior
Hermione 4.2 4.3 4.1
Name Year GPA
Hermione Freshman 4.2
Hermione Sophomore 4.3
Hermione Junior 4.1
spread(data, Year, GPA)
Column Manipulation 2
separate( )
unite( )
separate(data, TA, c("Name", "Major"), sep = ": ")
TA
Amit: CS
Dae Won: OR
Chase: IS
Jared: CS
Kenta: CS
unite(data, “TA”, c(Name, Major), sep = ": ")
Name Major
Amit CS
Dae Won OR
Chase IS
Jared CS
Kenta CS
Using the Force
Source
Technique #1: Binning
Source
What it does: Makes continuous data categorical by lumping ranges of data into discrete “levels.”
How to do it: Create a new categorical variable. Assign each category using logical vectors on the numerical column.
Technique #2: Normalizing
Source
Square root transformation Log transformation
Others include cubic root, reciprocal, square, cube... Source
What it does: Turns the data into a bell curve, or Gaussian, shape.
How to do it: Apply a function, or transformation, to each column (dplyr mutate).
Technique #3: Standardizing
Source
What it does: Centers the data with mean 0 and variance 1.
How to do it: Subtract the mean from each data point and divide by the standard deviation. Use the mean and sd functions.
Technique #4: Ordering
Source
What it does: Converts categorical data that is inherently ordered into a numerical scale.
How to do it: Assign values using logical vectors.
students$Age[which(students$Year==”SENIOR”)] <- 22students$Age[which(students$Year==”JUNIOR”)] <- 21# etc.
Technique #5: Dummy Variables
Source
What it does: Creates a binary variable for each category in a categorical variable.
How to do it: Create a new variable named after the category. Assign 1 to the variable when the categorical variable takes on the category.
students$Senior <- 0students$Senior[which(students$Year==”SENIOR”)] <- 1# etc.
Coming UpYour assignment: Assignment 2
Next week: Visualizing, animating, and presenting data
See you then!