13 case-study
TRANSCRIPT
![Page 1: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/1.jpg)
Hadley Wickham
Stat405ddply case study
Tuesday, 5 October 2010
![Page 2: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/2.jpg)
1. Homework
2. Project
3. Case study: gender trends
1. Focus on smaller subset
2. Develop summary statistic
3. Classify names
Tuesday, 5 October 2010
![Page 3: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/3.jpg)
Explain your code!
Comments should explain why not what
Check your indenting - if it’s not indented correctly, it’s very hard to read
Homework
Tuesday, 5 October 2010
![Page 4: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/4.jpg)
# Really bad:# Set x equal to ten.x <- 10
# Bad:# Figure out if all windows are barsallbars <- all(windows %in% c("B", "BB", "BBB"))
# Better:# all() / any() combination used to prevent errors in the # case of three DDs.
# Better:# Check to see if DD will create a triple# if (length(unique(windows)) == 2)
Tuesday, 5 October 2010
![Page 5: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/5.jpg)
# Best (but still not perfect:
## DD wild 4 cases and subcases #### 1c) 3 DD's #### 2c) 2 DD's #### 2c) 2 DD's #### the prize is quadrupled #### 3c) 1 DD #### prize doubled ## 3c.1) 1 DD and 2 of a kind ## 3c.2) 1 DD for any bars ## 3c.3) 1 DD for Cherries #### 4c) NO DD's ## 4c.1) Just any bar ## 4c.2) Just cherries
Tuesday, 5 October 2010
![Page 6: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/6.jpg)
Project
Tuesday, 5 October 2010
![Page 7: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/7.jpg)
Tips from last year
Proof read - far too many projects with obvious mistakes.
Include a section on the data, giving a quick English run-down of what you did to the data. Only appendix should technical details.
Presentation matters - you should be proud of your work, so take a little time to put it in a nice wrapper.
Tuesday, 5 October 2010
![Page 8: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/8.jpg)
Easy ways to lose points
Overplotting
Code style violations
Forgetting about the denominator of a ratio
Tuesday, 5 October 2010
![Page 9: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/9.jpg)
Team Assessment
Your individual grades will be weighted by effort.
Each team member should turn in a (confidential) team evaluation sheet. Don’t forget to assess yourself.
Tuesday, 5 October 2010
![Page 10: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/10.jpg)
Case study
Tuesday, 5 October 2010
![Page 11: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/11.jpg)
For names that are used for both boys and girls, how has usage changed?
Can we use names that clearly have the incorrect sex to estimate error rates over time?
Questions
Tuesday, 5 October 2010
![Page 12: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/12.jpg)
Getting started
options(stringsAsFactors = FALSE)library(plyr)library(ggplot2)
bnames <- read.csv("baby-names2.csv.bz2")
Tuesday, 5 October 2010
![Page 13: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/13.jpg)
First task
Too many names (~7000): need to identify smaller subset (~100) likely to be interesting.
Outside of class, would look at more, but starting with a subset for easier exploration is a good idea.
Tuesday, 5 October 2010
![Page 14: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/14.jpg)
First task
Too many names (~7000): need to identify smaller subset (~100) likely to be interesting.
Outside of class, would look at more, but starting with a subset for easier exploration is a good idea.
For this task, what attributes of a name are likely to be useful?
Tuesday, 5 October 2010
![Page 15: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/15.jpg)
Your turnFor each name, calculate the total proportion of boys, the total proportion of girls, the number of years the name was in the top 1000 as a girls name, the number of years the name was in the top 1000 as a boys name
Hint: Start with a single name and figure out how to solve the problem. Hint: Use summarise
Tuesday, 5 October 2010
![Page 16: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/16.jpg)
times <- ddply(bnames, "name", summarise, boys = sum(prop[sex == "boy"]), boys_n = sum(sex == "boy"), girls = sum(prop[sex == "girl"]), girls_n = sum(sex == "girl"), .progress = "text")
# But this is rather painful
Useful for slow operations
Tuesday, 5 October 2010
![Page 17: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/17.jpg)
# For this task, data much easier to work with# if put sex in columns instead of rows. We'll learn # more about reshaping in a couple of weeks# install.packages("reshape2")library(reshape2)bnames2 <- dcast(bnames, year + name ~ sex, value_var = "prop")
# No information unless we have both boys and # girls for that name in that yearboth <- subset(bnames2, !is.na(boy) & !is.na(girl))dim(both)head(both)
Tuesday, 5 October 2010
![Page 18: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/18.jpg)
Summarise each name with the number of years its made the list for both boys and girls, the average proportion of babies given that name.
Which names would you include for further investigation?
Your turn
Tuesday, 5 October 2010
![Page 19: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/19.jpg)
both_sum <- ddply(both, "name", summarise, years = length(name), avg_usage = mean(boy + girl) / 2)
# No point at looking at names that only appear onceboth_sum <- subset(both_sum, years > 1)
qplot(years, avg_usage, data = both_sum)
Tuesday, 5 October 2010
![Page 20: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/20.jpg)
# Now save our selections
selected_names <- subset(both_sum, years > 20 & avg_usage > 0.005)$name
selected <- subset(both, name %in% selected_names)
nrow(selected) / nrow(both)
Tuesday, 5 October 2010
![Page 21: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/21.jpg)
Explore how the gender assignment of these names has changed over time.
What is a good summary to use to compare boy popularity to girl popularity?
Your turn
Tuesday, 5 October 2010
![Page 22: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/22.jpg)
qplot(year, boy - girl, data = selected, geom = "line", group = name)qplot(year, abs(boy - girl), data = selected, geom = "line", group = name, colour = sign(boy - girl))
qplot(year, boy / girl, data = selected, geom = "line", group = name)qplot(year, log10(boy / girl), data = selected, geom = "line", group = name)
selected$lratio <- with(selected, log10(boy / girl))qplot(lratio, name, data = selected) qplot(lratio, reorder(name, lratio), data = selected)qplot(abs(lratio), reorder(name, lratio), data = selected)
Tuesday, 5 October 2010
![Page 23: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/23.jpg)
Your turn
Compute the mean and range of lratio for each name.
Plot and come up with cutoffs that you think separate the two groups.
Tuesday, 5 October 2010
![Page 24: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/24.jpg)
rng <- ddply(selected, "name", summarise, diff = diff(range(lratio, na.rm = T)), mean = mean(lratio, na.rm = T))
qplot(diff, abs(mean), data = rng)qplot(diff, abs(mean), data = rng, geom = "text", label = name)
rng$dual <- abs(rng$mean) < 2arrange(rng, mean, dual)
selected <- join(selected, rng[c("name", "dual")]
Tuesday, 5 October 2010
![Page 25: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/25.jpg)
qplot(year, lratio, data = selected, geom = "line", group = name) + facet_wrap(~ dual)
qplot(year, lratio, data = subset(selected, dual), geom = "line") + facet_wrap(~ name)
qplot(year, boy / (boy + girl), data = subset(selected, dual), geom = "line") + facet_wrap(~ name)
Tuesday, 5 October 2010
![Page 26: 13 case-study](https://reader034.vdocuments.net/reader034/viewer/2022052307/5552caf8b4c90581158b4e15/html5/thumbnails/26.jpg)
Now that we’ve separated the two groups, we’ll explore each in more detail.
Next time
Tuesday, 5 October 2010