03 modelling

44
Hadley Wickham plyr Modelling large data Tuesday, 7 July 2009

Upload: hadley-wickham

Post on 05-Dec-2014

11.343 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: 03 Modelling

Hadley Wickham

plyrModelling large data

Tuesday, 7 July 2009

Page 2: 03 Modelling

1. Strategy for analysing large data.

2. Introduction to the Texas housing data.

3. What’s happening in Houston?

4. Using a models as a tool

5. Using models in their own right

Tuesday, 7 July 2009

Page 3: 03 Modelling

Large data strategy

Start with a single unit, and identify interesting patterns.

Summarise patterns with a model.

Apply model to all units.

Look for units that don’t fit the pattern.

Summarise with a single model.

Tuesday, 7 July 2009

Page 4: 03 Modelling

Texas housing data

For each metropolitan area (45) in Texas, for each month from 2000 to 2009 (112):

Number of houses listed and sold

Total value of houses, and average sale price

Average time on market

CC BY http://www.flickr.com/photos/imagesbywestfall/3510831277/Tuesday, 7 July 2009

Page 5: 03 Modelling

Strategy

Start with a single city (Houston).

Explore patterns & fit models.

Apply models to all cities.

Tuesday, 7 July 2009

Page 6: 03 Modelling

date

avgprice

160000

180000

200000

220000

2000 2002 2004 2006 2008

Tuesday, 7 July 2009

Page 7: 03 Modelling

date

sales

3000

4000

5000

6000

7000

8000

2000 2002 2004 2006 2008

Tuesday, 7 July 2009

Page 8: 03 Modelling

date

onmarket

4.0

4.5

5.0

5.5

6.0

6.5

2000 2002 2004 2006 2008

Tuesday, 7 July 2009

Page 9: 03 Modelling

Seasonal trends

Make it much harder to see long term trend. How can we remove the trend?

(Many sophisticated techniques from time series, but what’s the simplest thing that might work?)

Tuesday, 7 July 2009

Page 10: 03 Modelling

month

avgprice

160000

180000

200000

220000

2 4 6 8 10 12

Tuesday, 7 July 2009

Page 11: 03 Modelling

What does the following function do?

deseas <- function(var, month) {

resid(lm(var ~ factor(month))) +

mean(var, na.rm = TRUE)

}

How could you use it in conjunction with transform to deasonalise the data? What if you wanted to deasonalise every city?

Challenge

Tuesday, 7 July 2009

Page 12: 03 Modelling

houston <- transform(houston, avgprice_ds = deseas(avgprice, month), listings_ds = deseas(listings, month), sales_ds = deseas(sales, month), onmarket_ds = deseas(onmarket, month))

qplot(month, sales_ds, data = houston, geom = "line", group = year) + avg

Tuesday, 7 July 2009

Page 13: 03 Modelling

month

avgprice_ds

150000

160000

170000

180000

190000

200000

210000

2 4 6 8 10 12

Tuesday, 7 July 2009

Page 14: 03 Modelling

Model as tools

Here we’re using the linear model as a tool - we don’t care about the coefficients or the standard errors, just using it to get rid of a striking pattern.

Tukey described this pattern as residuals and reiteration: by removing a striking pattern we can see more subtle patterns.

Tuesday, 7 July 2009

Page 15: 03 Modelling

date

avgprice_ds

150000

160000

170000

180000

190000

200000

210000

2000 2002 2004 2006 2008

Tuesday, 7 July 2009

Page 16: 03 Modelling

date

sales_ds

4000

4500

5000

5500

6000

6500

7000

2000 2002 2004 2006 2008

Tuesday, 7 July 2009

Page 17: 03 Modelling

date

onmarket_ds

4.0

4.5

5.0

5.5

6.0

6.5

2000 2002 2004 2006 2008

Tuesday, 7 July 2009

Page 18: 03 Modelling

Summary

Most variables seem to be combination of strong seasonal pattern plus weaker long-term trend.

How do these patterns hold up for the rest of Texas? We’ll focus on sales.

Tuesday, 7 July 2009

Page 19: 03 Modelling

date

sales

2000

4000

6000

8000

2000 2002 2004 2006 2008

Tuesday, 7 July 2009

Page 20: 03 Modelling

tx <- read.csv("tx-house-sales.csv")qplot(date, sales, data = tx, geom = "line", group = city)

tx <- ddply(tx, "city", transform, sales_ds = deseas(sales, month))qplot(date, sales_ds, data = tx, geom = "line", group = city)

Tuesday, 7 July 2009

Page 21: 03 Modelling

date

sales_ds

1000

2000

3000

4000

5000

6000

7000

2000 2002 2004 2006 2008

Tuesday, 7 July 2009

Page 22: 03 Modelling

It works, but...

It doesn’t give us any insight into the similarity of the patterns across multiple cities. Are the trends the same or different?

So instead of throwing the models away and just using the residuals, let’s keep the models and explore them in more depth.

Tuesday, 7 July 2009

Page 23: 03 Modelling

Two new toolsdlply: takes a data frame, splits up in the same way as ddply, applies function to each piece and combines the results into a list

ldply: takes a list, splits up into elements, applies function to each piece and then combines the results into a data frame

dlply + ldply = ddply

Tuesday, 7 July 2009

Page 24: 03 Modelling

models <- dlply(tx, "city", function(df) lm(sales ~ factor(month), data = df))

models[[1]]coef(models[[1]])

ldply(models, coef)

Tuesday, 7 July 2009

Page 25: 03 Modelling

Notice we didn’t have to do anything to have the coefficients labelled correctly.

Behind the scenes plyr records the labels used for the split step, and ensures they are preserved across multiple plyr calls.

Labelling

Tuesday, 7 July 2009

Page 26: 03 Modelling

What are some problems with this model? How could you fix them?

Is the format of the coefficients optimal?

Turn to the person next to you and discuss for 2 minutes.

Back to the model

Tuesday, 7 July 2009

Page 27: 03 Modelling

qplot(date, log10(sales), data = tx, geom = "line", group = city)

models2 <- dlply(tx, "city", function(df) lm(log10(sales) ~ factor(month), data = df))

coef2 <- ldply(models2, function(mod) { data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1])})

Tuesday, 7 July 2009

Page 28: 03 Modelling

qplot(date, log10(sales), data = tx, geom = "line", group = city)

models2 <- dlply(tx, "city", function(df) lm(log10(sales) ~ factor(month), data = df))

coef2 <- ldply(models2, function(mod) { data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1])})

Log transform sales to make coefficients

comparable (ratios)

Puts coefficients in rows, so they can be plotted more easily

Tuesday, 7 July 2009

Page 29: 03 Modelling

month

effect

−0.1

0.0

0.1

0.2

0.3

0.4

2 4 6 8 10 12

qplot(month, effect, data = coef2, group = city, geom = "line")Tuesday, 7 July 2009

Page 30: 03 Modelling

month

10^effect

1.0

1.5

2.0

2.5

2 4 6 8 10 12

qplot(month, 10 ^ effect, data = coef2, group = city, geom = "line")Tuesday, 7 July 2009

Page 31: 03 Modelling

month

10^e

ffect

1.01.52.02.5

1.01.52.02.5

1.01.52.02.5

1.01.52.02.5

1.01.52.02.5

1.01.52.02.5

1.01.52.02.5

Abilene

Brownsville

Fort Bend

Killeen−Fort Hood

Montgomery County

San Angelo

Victoria

2 4 6 8 1012

Amarillo

Bryan−College Station

Fort Worth

Laredo

Nacogdoches

San Antonio

Waco

2 4 6 8 1012

Arlington

Collin County

Galveston

Longview−Marshall

NE Tarrant County

San Marcos

Wichita Falls

2 4 6 8 1012

Austin

Corpus Christi

Garland

Lubbock

Odessa

Sherman−Denison

2 4 6 8 1012

Bay Area

Dallas

Harlingen

Lufkin

Palestine

Temple−Belton

2 4 6 8 1012

Beaumont

Denton County

Houston

McAllen

Paris

Texarkana

2 4 6 8 1012

Brazoria County

El Paso

Irving

Midland

Port Arthur

Tyler

2 4 6 8 1012

qplot(month, 10 ^ effect, data = coef2, geom = "line") + facet_wrap(~ city)Tuesday, 7 July 2009

Page 32: 03 Modelling

What should we do next?

What do you think?

You have 30 seconds to come up with (at least) one idea.

Tuesday, 7 July 2009

Page 33: 03 Modelling

My ideas

Fit a single model, log(sales) ~ city * factor(month), and look at residuals

Fit individual models, log(sales) ~ factor(month) + ns(date, 3), look cities that don’t fit

Tuesday, 7 July 2009

Page 34: 03 Modelling

# One approach - fit a single model

mod <- lm(log10(sales) ~ city + factor(month), data = tx)

tx$sales2 <- 10 ^ resid(mod)qplot(date, sales2, data = tx, geom = "line", group = city)last_plot() + facet_wrap(~ city)

Tuesday, 7 July 2009

Page 35: 03 Modelling

date

sales2

0.5

1.0

1.5

2.0

2.5

3.0

3.5

2000 2002 2004 2006 2008

qplot(date, sales2, data = tx, geom = "line", group = city)Tuesday, 7 July 2009

Page 36: 03 Modelling

date

sale

s2

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

Abilene

Brownsville

Fort Bend

Killeen−Fort Hood

Montgomery County

San Angelo

Victoria

20002002200420062008

Amarillo

Bryan−College Station

Fort Worth

Laredo

Nacogdoches

San Antonio

Waco

20002002200420062008

Arlington

Collin County

Galveston

Longview−Marshall

NE Tarrant County

San Marcos

Wichita Falls

20002002200420062008

Austin

Corpus Christi

Garland

Lubbock

Odessa

Sherman−Denison

20002002200420062008

Bay Area

Dallas

Harlingen

Lufkin

Palestine

Temple−Belton

20002002200420062008

Beaumont

Denton County

Houston

McAllen

Paris

Texarkana

20002002200420062008

Brazoria County

El Paso

Irving

Midland

Port Arthur

Tyler

20002002200420062008

last_plot() + facet_wrap(~ city)Tuesday, 7 July 2009

Page 37: 03 Modelling

# Another approach: Essence of most cities is seasonal # term plus long term smooth trend. We could fit this # model to each city, and then look for models which don't# fit well.

library(splines)models3 <- dlply(tx, "city", function(df) { lm(log10(sales) ~ factor(month) + ns(date, 3), data = df)})

# Extract rsquared from each modelrsq <- function(mod) c(rsq = summary(mod)$r.squared)quality <- ldply(models3, rsq)

Tuesday, 7 July 2009

Page 38: 03 Modelling

rsq

city

AbileneAmarillo

ArlingtonAustin

Bay AreaBeaumont

Brazoria CountyBrownsville

Bryan−College StationCollin County

Corpus ChristiDallas

Denton CountyEl Paso

Fort BendFort WorthGalveston

GarlandHarlingen

HoustonIrving

Killeen−Fort HoodLaredo

Longview−MarshallLubbock

LufkinMcAllenMidland

Montgomery CountyNacogdoches

NE Tarrant CountyOdessa

PalestineParis

Port ArthurSan AngeloSan AntonioSan Marcos

Sherman−DenisonTemple−Belton

TexarkanaTyler

VictoriaWaco

Wichita Falls

0.5 0.6 0.7 0.8 0.9

qplot(rsq, city, data = quality)Tuesday, 7 July 2009

Page 39: 03 Modelling

rsq

reor

der(c

ity, r

sq)

Port ArthurEl Paso

PalestineParis

TexarkanaBeaumont

VictoriaLufkin

NacogdochesAmarillo

San AngeloSan Marcos

OdessaGalveston

IrvingSherman−Denison

Wichita FallsBrownsville

McAllenBrazoria County

Killeen−Fort HoodAbilene

HarlingenLaredo

MidlandLongview−Marshall

GarlandLubbock

Temple−BeltonWaco

ArlingtonCorpus Christi

Bay AreaNE Tarrant County

TylerFort WorthFort Bend

AustinDenton County

Collin CountyBryan−College Station

DallasHouston

Montgomery CountySan Antonio

0.5 0.6 0.7 0.8 0.9

qplot(rsq, reorder(city, rsq), data = quality)

How are the good fits different from

the bad fits?

Tuesday, 7 July 2009

Page 40: 03 Modelling

quality$poor <- quality$rsq < 0.7tx2 <- merge(tx, quality, by = "city")

mfit <- ldply(models3, function(mod) { data.frame( resid = resid(mod), pred = predict(mod))})tx2 <- cbind(tx2, mfit[, -1])

Can you think of any potential problems with this line?

Tuesday, 7 July 2009

Page 41: 03 Modelling

date

log1

0(sa

les)

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

Abilene

Brownsville

Fort Bend

Killeen−Fort Hood

Montgomery County

San Angelo

Victoria

20002002200420062008

Amarillo

Bryan−College Station

Fort Worth

Laredo

Nacogdoches

San Antonio

Waco

20002002200420062008

Arlington

Collin County

Galveston

Longview−Marshall

NE Tarrant County

San Marcos

Wichita Falls

20002002200420062008

Austin

Corpus Christi

Garland

Lubbock

Odessa

Sherman−Denison

20002002200420062008

Bay Area

Dallas

Harlingen

Lufkin

Palestine

Temple−Belton

20002002200420062008

Beaumont

Denton County

Houston

McAllen

Paris

Texarkana

20002002200420062008

Brazoria County

El Paso

Irving

Midland

Port Arthur

Tyler

20002002200420062008Raw data

Tuesday, 7 July 2009

Page 42: 03 Modelling

date

pred

1.01.52.02.53.03.5

1.01.52.02.53.03.5

1.01.52.02.53.03.5

1.01.52.02.53.03.5

1.01.52.02.53.03.5

1.01.52.02.53.03.5

1.01.52.02.53.03.5

Abilene

Brownsville

Fort Bend

Killeen−Fort Hood

Montgomery County

San Angelo

Victoria

20002002200420062008

Amarillo

Bryan−College Station

Fort Worth

Laredo

Nacogdoches

San Antonio

Waco

20002002200420062008

Arlington

Collin County

Galveston

Longview−Marshall

NE Tarrant County

San Marcos

Wichita Falls

20002002200420062008

Austin

Corpus Christi

Garland

Lubbock

Odessa

Sherman−Denison

20002002200420062008

Bay Area

Dallas

Harlingen

Lufkin

Palestine

Temple−Belton

20002002200420062008

Beaumont

Denton County

Houston

McAllen

Paris

Texarkana

20002002200420062008

Brazoria County

El Paso

Irving

Midland

Port Arthur

Tyler

20002002200420062008Predictions

Tuesday, 7 July 2009

Page 43: 03 Modelling

date

resid

−0.4−0.20.00.20.4

−0.4−0.20.00.20.4

−0.4−0.20.00.20.4

−0.4−0.20.00.20.4

−0.4−0.20.00.20.4

−0.4−0.20.00.20.4

−0.4−0.20.00.20.4

Abilene

Brownsville

Fort Bend

Killeen−Fort Hood

Montgomery County

San Angelo

Victoria

20002002200420062008

Amarillo

Bryan−College Station

Fort Worth

Laredo

Nacogdoches

San Antonio

Waco

20002002200420062008

Arlington

Collin County

Galveston

Longview−Marshall

NE Tarrant County

San Marcos

Wichita Falls

20002002200420062008

Austin

Corpus Christi

Garland

Lubbock

Odessa

Sherman−Denison

20002002200420062008

Bay Area

Dallas

Harlingen

Lufkin

Palestine

Temple−Belton

20002002200420062008

Beaumont

Denton County

Houston

McAllen

Paris

Texarkana

20002002200420062008

Brazoria County

El Paso

Irving

Midland

Port Arthur

Tyler

20002002200420062008Residuals

Tuesday, 7 July 2009

Page 44: 03 Modelling

ConclusionsSimple (and relatively small) example, but shows how collections of models can be useful for gaining understanding.

Each attempt illustrated something new about the data.

Plyr made it easy to create and summarise collection of models, so we didn’t have to worry about the mechanics.

Tuesday, 7 July 2009