13 case-study

Hadley Wickham

Stat405ddply case study

Tuesday, 5 October 2010

1. Homework

2. Project

3. Case study: gender trends

1. Focus on smaller subset

2. Develop summary statistic

3. Classify names


Explain your code!

Comments should explain why not what

Check your indenting - if it’s not indented correctly, it’s very hard to read

Homework


# Really bad:# Set x equal to ten.x <- 10

# Bad:# Figure out if all windows are barsallbars <- all(windows %in% c("B", "BB", "BBB"))

# Better:# all() / any() combination used to prevent errors in the # case of three DDs.

# Better:# Check to see if DD will create a triple# if (length(unique(windows)) == 2)


# Best (but still not perfect:

## DD wild 4 cases and subcases #### 1c) 3 DD's #### 2c) 2 DD's #### 2c) 2 DD's #### the prize is quadrupled #### 3c) 1 DD #### prize doubled ## 3c.1) 1 DD and 2 of a kind ## 3c.2) 1 DD for any bars ## 3c.3) 1 DD for Cherries #### 4c) NO DD's ## 4c.1) Just any bar ## 4c.2) Just cherries


Project


Tips from last year

Proof read - far too many projects with obvious mistakes.

Include a section on the data, giving a quick English run-down of what you did to the data. Only appendix should technical details.

Presentation matters - you should be proud of your work, so take a little time to put it in a nice wrapper.


Easy ways to lose points

Overplotting

Code style violations

Forgetting about the denominator of a ratio


Team Assessment

Your individual grades will be weighted by effort.

Each team member should turn in a (confidential) team evaluation sheet. Don’t forget to assess yourself.


Case study


For names that are used for both boys and girls, how has usage changed?

Can we use names that clearly have the incorrect sex to estimate error rates over time?

Questions


Getting started

options(stringsAsFactors = FALSE)library(plyr)library(ggplot2)

bnames <- read.csv("baby-names2.csv.bz2")


First task

Too many names (~7000): need to identify smaller subset (~100) likely to be interesting.

Outside of class, would look at more, but starting with a subset for easier exploration is a good idea.


First task

Too many names (~7000): need to identify smaller subset (~100) likely to be interesting.

Outside of class, would look at more, but starting with a subset for easier exploration is a good idea.

For this task, what attributes of a name are likely to be useful?


Your turnFor each name, calculate the total proportion of boys, the total proportion of girls, the number of years the name was in the top 1000 as a girls name, the number of years the name was in the top 1000 as a boys name

Hint: Start with a single name and figure out how to solve the problem. Hint: Use summarise


times <- ddply(bnames, "name", summarise, boys = sum(prop[sex == "boy"]), boys_n = sum(sex == "boy"), girls = sum(prop[sex == "girl"]), girls_n = sum(sex == "girl"), .progress = "text")

# But this is rather painful

Useful for slow operations


# For this task, data much easier to work with# if put sex in columns instead of rows. We'll learn # more about reshaping in a couple of weeks# install.packages("reshape2")library(reshape2)bnames2 <- dcast(bnames, year + name ~ sex, value_var = "prop")

# No information unless we have both boys and # girls for that name in that yearboth <- subset(bnames2, !is.na(boy) & !is.na(girl))dim(both)head(both)


Summarise each name with the number of years its made the list for both boys and girls, the average proportion of babies given that name.

Which names would you include for further investigation?

Your turn


both_sum <- ddply(both, "name", summarise, years = length(name), avg_usage = mean(boy + girl) / 2)

# No point at looking at names that only appear onceboth_sum <- subset(both_sum, years > 1)

qplot(years, avg_usage, data = both_sum)


# Now save our selections

selected_names <- subset(both_sum, years > 20 & avg_usage > 0.005)$name

selected <- subset(both, name %in% selected_names)

nrow(selected) / nrow(both)


Explore how the gender assignment of these names has changed over time.

What is a good summary to use to compare boy popularity to girl popularity?

Your turn


qplot(year, boy - girl, data = selected, geom = "line", group = name)qplot(year, abs(boy - girl), data = selected, geom = "line", group = name, colour = sign(boy - girl))

qplot(year, boy / girl, data = selected, geom = "line", group = name)qplot(year, log10(boy / girl), data = selected, geom = "line", group = name)

selected$lratio <- with(selected, log10(boy / girl))qplot(lratio, name, data = selected) qplot(lratio, reorder(name, lratio), data = selected)qplot(abs(lratio), reorder(name, lratio), data = selected)


Your turn

Compute the mean and range of lratio for each name.

Plot and come up with cutoffs that you think separate the two groups.


rng <- ddply(selected, "name", summarise, diff = diff(range(lratio, na.rm = T)), mean = mean(lratio, na.rm = T))

qplot(diff, abs(mean), data = rng)qplot(diff, abs(mean), data = rng, geom = "text", label = name)

rng$dual <- abs(rng$mean) < 2arrange(rng, mean, dual)

selected <- join(selected, rng[c("name", "dual")]


qplot(year, lratio, data = selected, geom = "line", group = name) + facet_wrap(~ dual)

qplot(year, lratio, data = subset(selected, dual), geom = "line") + facet_wrap(~ name)

qplot(year, boy / (boy + girl), data = subset(selected, dual), geom = "line") + facet_wrap(~ name)


Now that we’ve separated the two groups, we’ll explore each in more detail.

Next time


13 case-study

Documents

set x equal

gender trends

smaller subset

summary statistic