13 case-study

26
Hadley Wickham Stat405 ddply case study Tuesday, 5 October 2010

Upload: hadley-wickham

Post on 13-May-2015

1.406 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 13 case-study

Hadley Wickham

Stat405ddply case study

Tuesday, 5 October 2010

Page 2: 13 case-study

1. Homework

2. Project

3. Case study: gender trends

1. Focus on smaller subset

2. Develop summary statistic

3. Classify names

Tuesday, 5 October 2010

Page 3: 13 case-study

Explain your code!

Comments should explain why not what

Check your indenting - if it’s not indented correctly, it’s very hard to read

Homework

Tuesday, 5 October 2010

Page 4: 13 case-study

# Really bad:# Set x equal to ten.x <- 10

# Bad:# Figure out if all windows are barsallbars <- all(windows %in% c("B", "BB", "BBB"))

# Better:# all() / any() combination used to prevent errors in the # case of three DDs.

# Better:# Check to see if DD will create a triple# if (length(unique(windows)) == 2)

Tuesday, 5 October 2010

Page 5: 13 case-study

# Best (but still not perfect:

## DD wild 4 cases and subcases #### 1c) 3 DD's #### 2c) 2 DD's #### 2c) 2 DD's #### the prize is quadrupled #### 3c) 1 DD #### prize doubled ## 3c.1) 1 DD and 2 of a kind ## 3c.2) 1 DD for any bars ## 3c.3) 1 DD for Cherries #### 4c) NO DD's ## 4c.1) Just any bar ## 4c.2) Just cherries

Tuesday, 5 October 2010

Page 6: 13 case-study

Project

Tuesday, 5 October 2010

Page 7: 13 case-study

Tips from last year

Proof read - far too many projects with obvious mistakes.

Include a section on the data, giving a quick English run-down of what you did to the data. Only appendix should technical details.

Presentation matters - you should be proud of your work, so take a little time to put it in a nice wrapper.

Tuesday, 5 October 2010

Page 8: 13 case-study

Easy ways to lose points

Overplotting

Code style violations

Forgetting about the denominator of a ratio

Tuesday, 5 October 2010

Page 9: 13 case-study

Team Assessment

Your individual grades will be weighted by effort.

Each team member should turn in a (confidential) team evaluation sheet. Don’t forget to assess yourself.

Tuesday, 5 October 2010

Page 10: 13 case-study

Case study

Tuesday, 5 October 2010

Page 11: 13 case-study

For names that are used for both boys and girls, how has usage changed?

Can we use names that clearly have the incorrect sex to estimate error rates over time?

Questions

Tuesday, 5 October 2010

Page 12: 13 case-study

Getting started

options(stringsAsFactors = FALSE)library(plyr)library(ggplot2)

bnames <- read.csv("baby-names2.csv.bz2")

Tuesday, 5 October 2010

Page 13: 13 case-study

First task

Too many names (~7000): need to identify smaller subset (~100) likely to be interesting.

Outside of class, would look at more, but starting with a subset for easier exploration is a good idea.

Tuesday, 5 October 2010

Page 14: 13 case-study

First task

Too many names (~7000): need to identify smaller subset (~100) likely to be interesting.

Outside of class, would look at more, but starting with a subset for easier exploration is a good idea.

For this task, what attributes of a name are likely to be useful?

Tuesday, 5 October 2010

Page 15: 13 case-study

Your turnFor each name, calculate the total proportion of boys, the total proportion of girls, the number of years the name was in the top 1000 as a girls name, the number of years the name was in the top 1000 as a boys name

Hint: Start with a single name and figure out how to solve the problem. Hint: Use summarise

Tuesday, 5 October 2010

Page 16: 13 case-study

times <- ddply(bnames, "name", summarise, boys = sum(prop[sex == "boy"]), boys_n = sum(sex == "boy"), girls = sum(prop[sex == "girl"]), girls_n = sum(sex == "girl"), .progress = "text")

# But this is rather painful

Useful for slow operations

Tuesday, 5 October 2010

Page 17: 13 case-study

# For this task, data much easier to work with# if put sex in columns instead of rows. We'll learn # more about reshaping in a couple of weeks# install.packages("reshape2")library(reshape2)bnames2 <- dcast(bnames, year + name ~ sex, value_var = "prop")

# No information unless we have both boys and # girls for that name in that yearboth <- subset(bnames2, !is.na(boy) & !is.na(girl))dim(both)head(both)

Tuesday, 5 October 2010

Page 18: 13 case-study

Summarise each name with the number of years its made the list for both boys and girls, the average proportion of babies given that name.

Which names would you include for further investigation?

Your turn

Tuesday, 5 October 2010

Page 19: 13 case-study

both_sum <- ddply(both, "name", summarise, years = length(name), avg_usage = mean(boy + girl) / 2)

# No point at looking at names that only appear onceboth_sum <- subset(both_sum, years > 1)

qplot(years, avg_usage, data = both_sum)

Tuesday, 5 October 2010

Page 20: 13 case-study

# Now save our selections

selected_names <- subset(both_sum, years > 20 & avg_usage > 0.005)$name

selected <- subset(both, name %in% selected_names)

nrow(selected) / nrow(both)

Tuesday, 5 October 2010

Page 21: 13 case-study

Explore how the gender assignment of these names has changed over time.

What is a good summary to use to compare boy popularity to girl popularity?

Your turn

Tuesday, 5 October 2010

Page 22: 13 case-study

qplot(year, boy - girl, data = selected, geom = "line", group = name)qplot(year, abs(boy - girl), data = selected, geom = "line", group = name, colour = sign(boy - girl))

qplot(year, boy / girl, data = selected, geom = "line", group = name)qplot(year, log10(boy / girl), data = selected, geom = "line", group = name)

selected$lratio <- with(selected, log10(boy / girl))qplot(lratio, name, data = selected) qplot(lratio, reorder(name, lratio), data = selected)qplot(abs(lratio), reorder(name, lratio), data = selected)

Tuesday, 5 October 2010

Page 23: 13 case-study

Your turn

Compute the mean and range of lratio for each name.

Plot and come up with cutoffs that you think separate the two groups.

Tuesday, 5 October 2010

Page 24: 13 case-study

rng <- ddply(selected, "name", summarise, diff = diff(range(lratio, na.rm = T)), mean = mean(lratio, na.rm = T))

qplot(diff, abs(mean), data = rng)qplot(diff, abs(mean), data = rng, geom = "text", label = name)

rng$dual <- abs(rng$mean) < 2arrange(rng, mean, dual)

selected <- join(selected, rng[c("name", "dual")]

Tuesday, 5 October 2010

Page 25: 13 case-study

qplot(year, lratio, data = selected, geom = "line", group = name) + facet_wrap(~ dual)

qplot(year, lratio, data = subset(selected, dual), geom = "line") + facet_wrap(~ name)

qplot(year, boy / (boy + girl), data = subset(selected, dual), geom = "line") + facet_wrap(~ name)

Tuesday, 5 October 2010

Page 26: 13 case-study

Now that we’ve separated the two groups, we’ll explore each in more detail.

Next time

Tuesday, 5 October 2010