data manipulation · manipulation tool #4: joining what it does: joins together two data frames,...

21
Data Manipulation

Upload: others

Post on 25-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Data Manipulation

Page 2: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

$$ Golden Rules of Writing Fast R $$

Vectorize your operations as much as

possible

Minimize the number of vectors to operate

on

Page 3: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

mapply

vapply

lapply

sapply

tapply

The Apply Family

apply

Page 5: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Question: What are some ways in

which data can be “messy”?

Page 6: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Data Manipulation Tools

1. R base functions2. plyr package3. dplyr package4. sqldf - for SQL users5. Data.table package

Source

Page 7: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Manipulation Tool #1: FilteringWhat it does: Grabs a subset of the rows in a data frame with a condition

Name Age Major

Amit 19 Computer Science

Dae Won 24 ORIE

Chase 19 Information Science

Jared 19 Computer Science

Kenta 20 Computer Science

Page 8: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Manipulation Tool #2: SubsettingWhat it does: Grabs a subset of the columns in a data frame.

Name Age Major

Amit 19 Computer Science

Dae Won 24 ORIE

Chase 19 Information Science

Jared 19 Computer Science

Kenta 20 Computer Science

Page 9: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Manipulation Tool #3: CombiningWhat it does: Joins together two data frames, either row-wise or column-wise.How to do it: cbind and rbind operators in R.

Name

Amit

Dae Won

Chase

Jared

Kenta

Age Major

19 Computer Science

24 ORIE

19 Information Science

19 Computer Science

20 Computer Science

Name Age Major

Amit 19 Computer Science

Dae Won 24 ORIE

Chase 19 Information Science

Jared 19 Computer Science

Kenta 20 Computer Science

Page 10: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Manipulation Tool #4: JoiningWhat it does: Joins together two data frames, combining rows that have the same value for a column.

How to do it: Use dplyr’s join method or the merge and cbind operators in R.

Page 11: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Manipulation Tool #5: SummarizingWhat it does: Computes aggregate data about the data frame, such as the number of elements with a given property.

How to do it: Use dplyr’s summarize and aggregate methods.

Page 12: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

tidyr Library

What it does: Convert Wide data.frame to Longer data.frame

How to do it: gather( ), spread( ), unite( ), separate( )

Page 13: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Column Manipulation 1

spread( )

gather( )

gather(data, Year, GPA, Freshman:Junior)

Name Freshman Sophomore Junior

Hermione 4.2 4.3 4.1

Name Year GPA

Hermione Freshman 4.2

Hermione Sophomore 4.3

Hermione Junior 4.1

spread(data, Year, GPA)

Page 14: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Column Manipulation 2

separate( )

unite( )

separate(data, TA, c("Name", "Major"), sep = ": ")

TA

Amit: CS

Dae Won: OR

Chase: IS

Jared: CS

Kenta: CS

unite(data, “TA”, c(Name, Major), sep = ": ")

Name Major

Amit CS

Dae Won OR

Chase IS

Jared CS

Kenta CS

Page 16: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Technique #1: Binning

Source

What it does: Makes continuous data categorical by lumping ranges of data into discrete “levels.”

How to do it: Create a new categorical variable. Assign each category using logical vectors on the numerical column.

Page 17: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Technique #2: Normalizing

Source

Square root transformation Log transformation

Others include cubic root, reciprocal, square, cube... Source

What it does: Turns the data into a bell curve, or Gaussian, shape.

How to do it: Apply a function, or transformation, to each column (dplyr mutate).

Page 18: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Technique #3: Standardizing

Source

What it does: Centers the data with mean 0 and variance 1.

How to do it: Subtract the mean from each data point and divide by the standard deviation. Use the mean and sd functions.

Page 19: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Technique #4: Ordering

Source

What it does: Converts categorical data that is inherently ordered into a numerical scale.

How to do it: Assign values using logical vectors.

students$Age[which(students$Year==”SENIOR”)] <- 22students$Age[which(students$Year==”JUNIOR”)] <- 21# etc.

Page 20: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Technique #5: Dummy Variables

Source

What it does: Creates a binary variable for each category in a categorical variable.

How to do it: Create a new variable named after the category. Assign 1 to the variable when the categorical variable takes on the category.

students$Senior <- 0students$Senior[which(students$Year==”SENIOR”)] <- 1# etc.

Page 21: Data Manipulation · Manipulation Tool #4: Joining What it does: Joins together two data frames, combining rows that have the same value for a column. How to do it: Use dplyr’s

Coming UpYour assignment: Assignment 2

Next week: Visualizing, animating, and presenting data

See you then!