lab 3: logistic regression models - stt.msu.edu · the us presidential election is held every four...

1

Lab 3: Logistic regression models

In this lab, we will apply logistic regression models to United States (US)

presidential election data sets. The main purpose is to predict the outcomes of

presidential election in each state based on the election polls and historical

election results data sets. We will use the election polls data sets collected in

2008 and 2012, and the true election outcome data set of 2008 to predict the

election outcomes of 2012. It might be more interesting to predict the outcomes

of the 2016 presidential election. However, due to the data availability limitation

and current stage of election, we will not consider 2016 presidential election in

today’s lab.

The US presidential election is held every four years on Tuesday after the first

Monday in November. The 2016 presidential election date is scheduled for Nov 8,

2016. The 2008 and 2012 elections were held, respectively, on Nov 4, 2008 and

Nov 6, 2012. The President of US is not elected directly by popular vote. Instead,

the President is elected by electors who are selected by popular vote on a state-

by-state basis. These selected electors cast direct votes for the President. Almost

all the states except Maine and Nebraska, electors are selected on a “winner-

take-all” basis. That is, all electoral votes go to the presidential candidate who

wins the most votes in popular vote. For simplicity, we will assume all the states

use the “winner-take-all” principle in this lab. The number of electors in each

state is the same as the number of congressmen of that state. Currently, there are

a total of 538 electors including 435 House representatives, 100 senators and 3

electors from the District of Columbia. A presidential candidate who receives an

absolute majority of electoral votes (no less than 270) is elected as President.

For simplicity, our data analysis only considers the two major political parties:

Democratic (Dem) and Republican (Rep). The interest is to predict which party

(Dem or Rep) will win the most votes in each state. Because the chance that a

third-party (except Dem and Rep) receives an electoral vote is very small, our

simplification is reasonable.

Prediction of the outcomes of presidential election campaigns is of great interests

to many people. In the past, the prediction was typically made by political

analysts and pundits based on their personal experience, intuition and

preferences. However, in recent decades, statistical methods have been widely

2

used in predicting election results. Surprisingly, in 2012, statistician Nate Silver

correctly predicted the outcome in every state while he successfully called the

outcomes in 49 states out of the 50 states in 2008. In today’s lab, we will compare

his method (a simplified version) to our method built on the logistic regression

models.

Date sets

The following data sets are available for our data analysis

1) Polling data from the 2008 US presidential election (2008-polls.csv);

2) Election results from the 2008 US presidential election (2008-results.csv);

3) Polling data from the 2012 US presidential election (2012-polls.csv);

4) Election results from the 2012 US presidential election (2012-results.csv).

The data sets 1) and 2) will be used for training purpose. That is, the data sets 1)

and 2) will be used to build logistic regression models. The data set 3) will be used

for prediction. The data set 4) is provided for validation purpose, which can help

us to check if our predictions are correct or not.

Both polling data sets 1) and 2) contain five columns. The first column is the State

Abbreviations (SA). The second and third columns are, respectively, the

percentages of votes to Democratic and Republican. The fourth column is the

dates that the polls were conducted. The last column is the names of pollster

institutions.

Election polls

Our prediction will be based on election polls. An election poll is a survey that

samples a small portion of voters about their vote plans. If the survey is

conducted appropriately, the samples of voters should be a representation of the

voting population at large. However, it is very challenging to obtain a good

representative group because a good sampling strategy needs to consider many

factors (e.g., sampling time, locations, methods). Therefore, a poll’s prediction

could be biased and the prediction accuracy could be improved by combining

multiple polls.

There exist many possible factors affecting the prediction accuracy of election

polls. Based on the available data sets, we consider the following three factors.

3

1. Sampling time. It is understandable that if the sampling time is far ahead of

the election date, the accuracy could be worse than those polls conducted

more close to the election date. Because there are many events that could

change voters’ opinions about presidential candidates, the longer the time,

the more likely voters are going to change their voting plans.

2. Pollsters. Systematic biases could occur if a false sampling method is taken.

For example, if a pollster only collects samples through Internet, it would

be a biased sample since the sample only includes those who have access

to Internet. Each pollster uses different methods for sampling voters. Some

sampling schemes could be better than the others. Therefore, it is very

likely that some pollsters’ predictions are more reliable than some others.

We should not give equal weights to every poll.

3. State edges. The state edge is the difference between the Democratic and

Republican popular vote percentages (based on the polls) in that state. For

instance, if the Democratic candidate receives 55% of the vote and

Republican candidate receives 45% of the votes, then the Democratic edge

is 10 percentage points. Because of the sampling errors, if the state edges

are small, the prediction accuracy of a poll is more likely to be affected by

the sampling errors. However, if the state edges are big, the prediction

accuracy is less likely to be affected by sampling errors.

Silver’s approach

The Nate Silver’s algorithm is described in detail at the FiveThirtyEight blog

(http://fivethirtyeight.blogs.nytimes.com/methodology/?_r=0). The key idea of

his algorithm is to smooth (average) different polls’ results using a weighted

average. Silver’s algorithm gives weight to each pollster according to its prediction

accuracy in the previous elections. More biased pollsters will receive less weight.

In the following, we briefly describe the general structure of Silver’s algorithm.

1. Calculate the average error of each pollster’s prediction for previous

elections. This is known as the pollster’s rank. A smaller rank indicates a

more accurate pollster.

2. Transform each rank into a weight. In this lab, we simply set weight as the

one over square of rank. In Silver’s algorithm, a number of factors are

http://fivethirtyeight.blogs.nytimes.com/methodology/?_r=0

4

considered in computing a weight. But we are lack of that information in

the available data sets.

3. For each state, compute a weighted average of predictions made by

pollsters. This predicts the winner in that state.

In this lab, we will compare our method based on the logistic regression models

with Silver’s approach in predicting the presidential election winner in each state.

To this end, please answer the following questions.

Q1. Read the data sets “2008-polls.csv”, “2012-polls.csv” and “2008-results.csv”

into R.

To simplify our data analysis, let us focus on subsets of these available data sets.

We will select the subset of data sets based on pollsters because not all the

pollsters conducted polls in every state. For our data analysis, please first select

pollsters that conducted at least five polls. Then obtain all the polling data

collected by those selected pollsters. Using R to find out the pollsters that

conducted at least five polls in both 2008 and 2012 polling data sets 1) and 3).

Then create subsets of the 2008 and 2012 polling data sets that are collected by

the selected pollsters.

Answer: To read the data sets into R, we use the following R code

setwd("…") ## Change the directory where you saved the data sets

polls2008<-read.csv(file="2008-polls.csv",header=TRUE)

polls2012<-read.csv(file="2012-polls.csv",header=TRUE)

results2008<-read.csv(file="2008-results.csv",header=TRUE)

Because the data sets were stored in csv files, we used the read.csv to read data

sets. To select the pollsters who conducted at least five polls, we first create

frequency tables for the pollsters in 2008 and 2012 polling data sets. Then we find

the pollsters which conducted at five polls. The following R code can be used to

select the desired pollsters.

pollsters20085<-table(polls2008$Pollster)[table(polls2008$Pollster)>=5]

pollsters20125<-table(polls2012$Pollster)[table(polls2012$Pollster)>=5]

subset1<-

names(pollsters20085)[names(pollsters20085)%in%names(pollsters20125)]

pollers<-names(pollsters20125)[names(pollsters20125)%in%subset1]

Finally, we create the subsets of the 2008 and 2012 data sets that are collected by the selected pollsters using the following R code

5

subsamplesID2008<-polls2008[,5]%in%pollers

polls2008sub<-polls2008[subsamplesID2008,]

subsamplesID2012<-polls2012[,5]%in%pollers

polls2012sub<-polls2012[subsamplesID2012,]

Q2. For the purpose of performing logistic regression, we need to define three

new variables using data sets created in Q1.

First, we define binary response variables (Resp), which is an indicator that

indicates if the predictions given by polls are correct or not. If the prediction is

correct, we define Resp to be 1 otherwise 0. To check if the prediction given by

each poll is correct or not, you could first find out the predicted winner for each

state, and then compare it with the actual winner in the data set “2008-

results.csv”.

Second, define state edges based on the definition of the state edges (see above

for the definition).

Finally, compute the number of days between the sampling time (polling date)

and the presidential election date of 2008 (lag time). The 2008 presidential

election date is Nov 4, 2008.

Combining the above defined variables (Resp, State edge and lag time), State

names and pollsters into a new data set.

Answer: We first define the response variable based on the 2008 polling and true

election results. The following R code could be used for this purpose.

winers2008<-(results2008[,2]-results2008[,3]>0)+0

StateID2008<-results2008[,1]

Allresponses<-NULL

for (sid in 1:51)

{

polls2008substate<-polls2008sub[polls2008sub$State==StateID2008[sid],]

pollwiners2008state<-(polls2008substate[,2]-

polls2008substate[,3]>0)+0

pollwinersIND<-(pollwiners2008state==winers2008[sid])+0

Allresponses<-c(Allresponses,pollwinersIND)

}

Then we define the state edges and the lag time using the following R code. Please note that how we compute the new variable lag time. Finally, we combine the new defined variables into a new data set.

6

margins<-abs(polls2008sub[,2]-polls2008sub[,3])

lagtime<-rep(0,dim(polls2008sub)[1])

electiondate2008<-c("Nov 04 2008")

for (i in 1:dim(polls2008sub)[1])

{

lagtime[i]<-as.Date(electiondate2008, format="%b %d %Y")-

as.Date(as.character(polls2008sub[i,4]), format="%b %d %Y")

}

dataset2008<-

cbind(Allresponses,as.character(polls2008sub[,1]),margins,lagtime,as.c

haracter(polls2008sub[,5]))

Q3. In the data set created in Q2, you might find that the responses (Resp) of

some states are all equal to 1. For these states, the prediction is relatively easy.

Therefore, we will focus on the states that are relatively difficult to predict. Please

select the states whose responses (Resp) contain at least one 0. Then find the

corresponding subsets of the polling data sets for those selected states.

Answer: We find the states which have at least one 0 and put them into a list.

Then we select the corresponding subset. We use the following R code

stateslist<-unique(dataset2008[which(dataset2008[,1]=="0"),2])

subdataset2008<-dataset2008[dataset2008[,2]%in%stateslist,]

Q4. Now we fit a logistic regression model using the data set created in Q3. In the

model, using Resp as the binary response variable, SA and the Pollsters as

categorical predictors, together with the other two predictors defined in Q2: lag

time and the state edges. Based on the fitted model, what predictors are

significantly associated with Resp? Please also conduct a hypothesis testing to

examine if the categorical variable SA is significant or not.

Answer: To fit the logistic regression model using R, we first define the following

variables based on the data set created in Q3. Since we treat SA and Pollsters as

categorical predictors, we use define SA and Pollsters as factors in logistic

regression model. To this end, we use the following R code,

resp<-as.integer(subdataset2008[,1])

statesFAC<-as.factor(subdataset2008[,2])

margins<-as.double(subdataset2008[,3])

lagtime<-as.double(subdataset2008[,4])

pollersFAC<-as.factor(subdataset2008[,5])

7

Then we fit a logistic regression model using SA, state edges, lag time and the pollsters as predictors. The following R code is used.

logitreg<-

glm(resp~statesFAC+margins+lagtime+pollersFAC,family="binomial")

summary(logitreg)

The output of the logistic regression model is given as following:

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.477331 0.602829 0.792 0.428466

statesFACFL -1.647375 0.547285 -3.010 0.002612 **

statesFACGA 1.354619 1.157192 1.171 0.241756

statesFACIN -3.359969 0.926903 -3.625 0.000289 ***

statesFACMA 2.064918 1.233361 1.674 0.094087 .

statesFACMI -0.109506 0.733656 -0.149 0.881348

statesFACMN 1.576421 0.909859 1.733 0.083167 .

statesFACMO -0.427149 0.614079 -0.696 0.486683

statesFACMT 1.572071 1.165776 1.349 0.177491

statesFACNC -2.289227 0.641511 -3.568 0.000359 ***

statesFACND 0.582515 1.411664 0.413 0.679867

statesFACNH 0.608812 0.770412 0.790 0.429386

statesFACNJ 0.562342 0.953698 0.590 0.555429

statesFACNM 0.115791 0.722887 0.160 0.872741

statesFACNV -0.782439 0.620767 -1.260 0.207511

statesFACNY 1.106166 1.220608 0.906 0.364808

statesFACOH -1.456813 0.554890 -2.625 0.008655 **

statesFACOR 2.466634 1.227231 2.010 0.044440 *

statesFACPA 0.999567 0.706504 1.415 0.157125

statesFACVA -0.764514 0.578862 -1.321 0.186595

statesFACWA 2.049390 1.222229 1.677 0.093589 .

statesFACWI 1.724056 0.952639 1.810 0.070332 .

statesFACWV 0.176470 1.192351 0.148 0.882341

margins 0.243394 0.038387 6.341 2.29e-10 ***

lagtime -0.010550 0.001722 -6.128 8.89e-10 ***

pollersFACEPICMRA 1.884727 1.341388 1.405 0.160004

pollersFACInsiderAdvantage 0.831820 0.586503 1.418 0.156112

pollersFACMaristColl 1.899700 1.201596 1.581 0.113883

pollersFACMasonDixon 0.368782 0.590033 0.625 0.531958

pollersFACMuhlenbergColl -0.107470 1.516623 -0.071 0.943508

pollersFACQuinnipiacU 1.742448 0.629726 2.767 0.005658 **

pollersFACRasmussen 0.273553 0.451894 0.605 0.544948

pollersFACSienaColl 15.026258 542.747543 0.028 0.977913

pollersFACSuffolkU 1.166058 0.920064 1.267 0.205024

pollersFACSurveyUSA 0.831435 0.518039 1.605 0.108501

pollersFACUofCincinnati 0.399582 1.113652 0.359 0.719742

pollersFACUofNewHampshire -1.361725 1.333940 -1.021 0.307335

pollersFACZogby 0.501113 0.745531 0.672 0.501484

---

8

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note that in the above R output. The SA and pollsters are treated as categorical

variables. The baseline level for SA is “CO”, which is the Colorado state and the

baseline level of pollster is “ARG”. Based on the above output, it is clear to see

that state edges (margins) and lag time are statistically significant in affecting the

success rate of the response (the prediction accuracy). Among different levels of

the categorical variable SA, the states “FL”, “IN”, “NC”, “OH” and “OR” are

statistically significant at the nominal level 0.05. This means that these five states

are different from the baseline state “CO” in terms of the prediction accuracy

given the other predictors fixed. Among the different levels of the categorical

variable Pollsters, the pollster “QuinnipiacU” is statistically different from the base

line pollster “ARG”. All the other pollsters are not statistically different from the

baseline pollster “ARG”. This suggests that the polls conducted by pollster

“QuinnipiacU” have better accuracy than “ARG” in predicting the election

outcome.

To check if the categorical variable “SA” is significant or not, we use a likelihood

ratio test by comparing a model with the predictor “SA” and a model without

“SA”.

logitreg1<-glm(resp~margins+lagtime+pollersFAC,family="binomial")

anova(logitreg1,logitreg, test="Chisq")

The following out provide the analysis of deviance table and the corresponding p-

value for assessing the significance of the variable “SA”.

Analysis of Deviance Table

Model 1: resp ~ margins + lagtime + pollersFAC

Model 2: resp ~ statesFAC + margins + lagtime + pollersFAC

Resid. Df Resid. Dev Df Deviance Pr(>Chi)

1 647 620.09

2 625 492.68 22 127.41 < 2.2e-16 ***

Because the likelihood ratio test has a p-value much smaller than 0.05, we

conclude that the variable “SA” is significant.

9

Q5. Refit the logistic regression model in Q4 without the categorical variable SA.

Compare this model with the model fitted in Q4, which one is better?

Answer: By delete the categorical variable SA, we refit the model using R code as

following:

logitreg1<-glm(resp~margins+lagtime+pollersFAC,family="binomial")

To compare it with the model fitted in Q4, we use AIC and BIC to select an

appropriate model. AIC and BIC are more appropriate for model selection

because these information criteria involve the penalties for the complexity of the

models. The likelihood ratio test conducted in Q4 shows that model using “SA”

can fit the data better than the model without “SA”, which is a goodness-of-fit

measure. The following table presents the AIC and BIC values for both models.

AIC BIC

Model without SA 652.0945 724.0429

Model with SA 568.6820 739.5594

Based on the above table, we can see that the model with SA has smaller AIC

value than the model without SA. Therefore, if AIC is used, we should choose the

model with SA. However, if BIC value is used, the model without SA is better. The

conclusions given by AIC and BIC are different. The reason is that the AIC puts less

penalty on the number of parameters but BIC has a larger penalty on the number

of free parameters. This also suggests that the model selected by AIC is typically

bigger (containing more predictors) than the model selected by BIC.

Q6. For the prediction purpose, we need to define new variables: State edges and

the lag time for the 2012 polling data set. The definition of these new variables is

same as those described in Q2. For computing the lag time, note that the 2012

presidential election date is Nov 6, 2012. Then create a new data set containing

these two new variables for the polls conducted by the pollsters selected in Q1

and the states selected in Q3.

Based on the logistic regression models fitted in Q4 and Q5, predicting the mean

of the response variable (Resp) for the data set just created. The mean of Resp is

10

the probability that Resp=1 (success probability). Please predict the success

probability of each poll for the following states: FL, MI, MO and CO.

Answer: We define the new variables for prediction using the 2012 polling data

set. These variables could be defined as those in Q2. The following R code is used:

pollwiners2012<-(polls2012sub[,2]-polls2012sub[,3]>0)+0

margins2012<-abs(polls2012sub[,2]-polls2012sub[,3])

lagtime2012<-rep(0,dim(polls2012sub)[1])

electiondate2012<-c("Nov 06 2012")

for (i in 1:dim(polls2012sub)[1])

{

lagtime2012[i]<-as.Date(electiondate2012, format="%b %d %Y")-

as.Date(as.character(polls2012sub[i,4]), format="%b %d %Y")

}

dataset2012<-

cbind(pollwiners2012,as.character(polls2012sub[,1]),margins2012,lagtim

e2012,as.character(polls2012sub[,5]))

For our analysis, we focus on the states in the list created in Q3.

subdataset2012<-dataset2012[dataset2012[,2]%in%stateslist,]

Using the defined new variables, it is easy to perform predictions for each poll

using the function “predict” in R. The predictions for the success probability of

poll in MI using model in Q4 could be computed in following R code

margins2012<-as.double(subdataset2012[,3])

lagtime2012<-as.double(subdataset2012[,4])

pollersFAC2012<-as.factor(subdataset2012[,5])

NOpolls<-sum(subdataset2012[,2]=="MI")

locations<-which(subdataset2012[,2]=="MI")

MIPredictresults<-

cbind(as.double(subdataset2012[locations,1]),rep(0,NOpolls))

counts<-0

for (i in locations)

{

counts<-counts+1

MIdatapoints<-data.frame(statesFAC="MI", margins=margins2012[i],

lagtime=lagtime2012[i], pollersFAC=pollersFAC2012[i])

MIPredictresults[counts,2]<-predict(logitreg, MIdatapoints,

type="response")

}

11

The predictions for the polls in MI using model in Q5 is

NOpolls<-sum(subdataset2012[,2]=="MI")


MIPredictresults1<-


counts<-0


{

counts<-counts+1

MIdatapoints<-data.frame(margins=margins2012[i],


MIPredictresults1[counts,2]<-predict(logitreg1, MIdatapoints,

type="response")

}

The first row is the predicted winners based on each poll. The predictions of the

probabilities using model in Q4 are given in the second column below. The

predictions of the probabilities using model in Q5 are given in the third column.

[1,] 0 0.7050161 0.7150713

[2,] 1 0.9836286 0.9829498

[3,] 1 0.7184638 0.8035997

[4,] 1 0.8614575 0.8714864

[5,] 1 0.7653980 0.8954679

[6,] 1 0.7277984 0.7007603

[7,] 1 0.9357792 0.9352594

[8,] 1 0.9741315 0.9605871

[9,] 1 0.9343612 0.9006867

[10,] 1 0.9041542 0.8784930

[11,] 1 0.7424675 0.7674955

[12,] 1 0.9802705 0.9943562

[13,] 1 0.8880373 0.8321297

[14,] 1 0.9707425 0.9578605

[15,] 1 0.8588295 0.7600386

[16,] 1 0.9554800 0.9480250

Similar methods could be used for predicting the success probabilities of the polls

in the states “FL”, “MO” and “CO”.

The R code for the “FL” state is given in the following. The first part applies the

model in Q4




NOpolls<-sum(subdataset2012[,2]=="FL")

12

locations<-which(subdataset2012[,2]=="FL")

FLPredictresults<-


counts<-0


{

counts<-counts+1

FLdatapoints<-data.frame(statesFAC="FL", margins=margins2012[i],


FLPredictresults[counts,2]<-predict(logitreg, FLdatapoints,

type="response")

}

The second part applies the model in Q5

NOpolls<-sum(subdataset2012[,2]=="FL")


FLPredictresults1<-


counts<-0


{

counts<-counts+1

FLdatapoints<-data.frame(margins=margins2012[i],


FLPredictresults1[counts,2]<-predict(logitreg1, FLdatapoints,

type="response")

}

The predictions results are given below. The first row is the predicted winners

based on each poll. The second column is the success probabilities of polls

predicted using model Q4, and the third column is the predicted probabilities

based on model Q5.

[1,] 0 0.56214658 0.8172779

[2,] 0 0.14006329 0.5900694

[3,] 0 0.23539652 0.4892338

[4,] 0 0.21664557 0.4612514

[5,] 0 0.13079384 0.4661476

[6,] 0 0.05549671 0.3457921

[7,] 0 0.64968573 0.8476961

[8,] 1 0.53303376 0.7561092

[9,] 0 0.07828385 0.2883600

[10,] 0 0.06236959 0.2514030

[11,] 0 0.12796709 0.3417972

[12,] 0 0.64710916 0.8262847

[13,] 1 0.51461589 0.7485297

[14,] 1 0.06433675 0.3162441

13

[15,] 1 0.58318467 0.7768037

[16,] 1 0.14151955 0.3723813

[17,] 1 0.15701236 0.4587324

[18,] 0 0.32868218 0.6065016

[19,] 0 0.52997354 0.7448883

[20,] 1 0.04629752 0.2763060

[21,] 1 0.46001315 0.4928207

[22,] 1 0.64411350 0.8359995

[23,] 1 0.42661942 0.5886155

[24,] 0 0.39076333 0.5313525

[25,] 0 0.44675510 0.4479178

[26,] 0 0.31911545 0.5337634

[27,] 0 0.45085822 0.6782808

[28,] 0 0.70216913 0.8345176

[29,] 1 0.42993929 0.7249804

[30,] 1 0.47746333 0.8461812

[31,] 1 0.50577197 0.7283998

[32,] 1 0.27092546 0.5018527

[33,] 1 0.66321631 0.8422744

[34,] 1 0.67279456 0.7316409

[35,] 1 0.37520930 0.3721240

[36,] 1 0.25649807 0.4712261

[37,] 0 0.36904939 0.5639717

[38,] 1 0.42084846 0.8986539

[39,] 1 0.47564231 0.8111579

[40,] 1 0.79508390 0.9341599

[41,] 1 0.62016858 0.7645409

[42,] 1 0.76200376 0.8903070

[43,] 1 0.39465029 0.7093355

[44,] 1 0.72876273 0.8704343

[45,] 1 0.90964579 0.9562761

The R code for the “MO” state is given in the following. We first apply the model

in Q4.




NOpolls<-sum(subdataset2012[,2]=="MO")

locations<-which(subdataset2012[,2]=="MO")

MOPredictresults<-


counts<-0


{

counts<-counts+1

MOdatapoints<-data.frame(statesFAC="MO", margins=margins2012[i],


MOPredictresults[counts,2]<-predict(logitreg, MOdatapoints,

type="response")

14

}

Then we apply the model in Q5 for predicting the success probabilities.

NOpolls<-sum(subdataset2012[,2]=="MO")


MOPredictresults1<-


counts<-0


{

counts<-counts+1

MOdatapoints<-data.frame(margins=margins2012[i],


MOPredictresults1[counts,2]<-predict(logitreg1, MOdatapoints,

type="response")

}

The results are summarized below. The first row is the predicted winners based

on each poll. The second column is the success probabilities of polls predicted

using model Q4, and the third column is the predicted probabilities based on

model Q5.

[1,] 0 0.5035039 0.7200187

[2,] 0 0.9694242 0.9710735

[3,] 0 0.6044276 0.7044485

[4,] 0 0.8194090 0.8628442

[5,] 0 0.7910887 0.8080985

[6,] 0 0.9278236 0.8966974

[7,] 0 0.9421374 0.9413403

[8,] 0 0.5542213 0.4922224

[9,] 0 0.6769271 0.7092202

[10,] 0 0.2520589 0.3617698

[11,] 0 0.6137583 0.5711565

[12,] 0 0.6647826 0.6007542

[13,] 1 0.4416067 0.4014089

The R code and the results for the state “CO” are given below. The first part uses

the complex model in Q4.




NOpolls<-sum(subdataset2012[,2]=="CO")

COPredictresults<-

cbind(as.double(subdataset2012[1:NOpolls,1]),rep(0,NOpolls))

15

for (i in 1:NOpolls)

{

COdatapoints<-data.frame(statesFAC="CO", margins=margins2012[i],


COPredictresults[i,2]<-predict(logitreg, COdatapoints,

type="response")

}

The following part performs the prediction for the state "CO" using the simple logistic regression model in Q5.

NOpolls<-sum(subdataset2012[,2]=="CO")

COPredictresults1<-

cbind(as.double(subdataset2012[1:NOpolls,1]),rep(0,NOpolls))

for (i in 1:NOpolls)

{

COdatapoints<-data.frame(margins=margins2012[i],


COPredictresults1[i,2]<-predict(logitreg1, COdatapoints,

type="response")

}

The predicted probabilities are summarized below. The first row is the predicted

winners based on each poll. The second column is the success probabilities of

polls predicted using model Q4, and the third column is the predicted

probabilities based on model Q5.

[1,] 0 0.2966687 0.2437847

[2,] 0 0.6704433 0.5091158

[3,] 0 0.9217368 0.8403067

[4,] 1 0.7045756 0.7054296

[5,] 0 0.7585907 0.6682250

[6,] 0 0.8257308 0.6908284

[7,] 1 0.8497009 0.6723346

[8,] 1 0.7255040 0.5371906

[9,] 0 0.4452990 0.3148480

[10,] 0 0.8973189 0.7094159

[11,] 0 0.7802836 0.5773134

[12,] 0 0.6515317 0.4903710

[13,] 0 0.8016550 0.6377274

[14,] 1 0.8738779 0.6823733

[15,] 0 0.9037745 0.8141867

[16,] 1 0.5948107 0.4947829

[17,] 1 0.6632452 0.4669786

[18,] 1 0.9559369 0.9366182

[19,] 1 0.9081582 0.9219822

16

Q7. In this question, we will predict the winner of each state (FL, MI, MO and CO)

using predictions given in Q6. To be concrete, define the winner indicator as 1

(WIND=1) if the Democratic candidate is the winner, otherwise define it as 0.

Based on Q6, we could know the probability that a poll made a correct prediction

of the winner (i.e. Resp=1). Note that Resp=1 if the variable WIND based on the

polling data is the same as the variable WIND based on the actual election data.

Then we use the average probability of WIND=1 to predict the probability that

Dem wins the election, and use the average probability of WIND=0 to predict the

probability that Rep wins the election. The average is across all the predicted

probabilities of multiple pollsters who conducted polls in that state. Please do the

prediction using both models in Q4 and Q5. Compare your predictions with the

actual election results in the data file “2012-results.csv”, what are your

conclusions about the accuracy of your predictions?

Answer: The predicted probabilities using model Q4 for MI is given by the

following R code

MIprobDemwin<-MIPredictresults[,1]*MIPredictresults[,2]+(1-

MIPredictresults[,1])*(1-MIPredictresults[,2])

MImeanProbDemwin<-mean(MIprobDemwin)

MIprobGopwin<-(1-

MIPredictresults[,1])*MIPredictresults[,2]+MIPredictresults[,1]*(1-

MIPredictresults[,2])

MImeanProbGopwin<-mean(MIprobGopwin)

The predicted probabilities using model Q5 for MI is given by the following R code

MIprobDemwin1<-MIPredictresults1[,1]*MIPredictresults1[,2]+(1-

MIPredictresults1[,1])*(1-MIPredictresults1[,2])

MImeanProbDemwin1<-mean(MIprobDemwin1)

MIprobGopwin1<-(1-

MIPredictresults1[,1])*MIPredictresults1[,2]+MIPredictresults1[,1]*(1-

MIPredictresults1[,2])

MImeanProbGopwin1<-mean(MIprobGopwin1)

Similarly, the predicted probabilities for the “FL” state using model Q4 is given by

FLprobDemwin<-FLPredictresults[,1]*FLPredictresults[,2]+(1-

FLPredictresults[,1])*(1-FLPredictresults[,2])

FLmeanProbDemwin<-mean(FLprobDemwin)

FLprobGopwin<-(1-

FLPredictresults[,1])*FLPredictresults[,2]+FLPredictresults[,1]*(1-

FLPredictresults[,2])

17

FLmeanProbGopwin<-mean(FLprobGopwin)

For model Q5, we use the following R code

FLprobDemwin1<-FLPredictresults1[,1]*FLPredictresults1[,2]+(1-

FLPredictresults1[,1])*(1-FLPredictresults1[,2])

FLmeanProbDemwin1<-mean(FLprobDemwin1)

FLprobGopwin1<-(1-

FLPredictresults1[,1])*FLPredictresults1[,2]+FLPredictresults1[,1]*(1-

FLPredictresults1[,2])

FLmeanProbGopwin1<-mean(FLprobGopwin1)

Since the R code for the state MO and CO are similar to the above two states, we

omit the details here. The R code is included in the file “R-code-for-lab-3.txt” on

the class page website.

We summarize the above prediction results in a table below.

Model Q4 Model Q5 Actual Results

Dem Rep Winner Dem Rep Winner

MI 0.843 0.156 Dem 0.842 0.157 Dem Dem

FL 0.553 0.447 Dem 0.577 0.422 Dem Dem

MO 0.317 0.683 Rep 0.289 0.711 Rep Rep

CO 0.491 0.509 Rep 0.522 0.478 Dem Dem

Based on the above table, we observed that the predictions based on model Q5

are all accurate. But the prediction based on model Q4 is not all correct. The

prediction for CO is not correct using model Q4, but the corresponding prediction

is correct using model Q5.

Q8. Please construct the 95% prediction intervals for the average probabilities

predicted in Q7.

Answer: The method for constructing the 95% prediction intervals was

introduced in one of the notes sent through email. The R code for constructing

prediction intervals for the average probabilities using model Q4 is given below

for four states “MI”, “FL”, “MO” and “CO”.

Deritive<-function(x,beta)

{

18

deri0<-exp(x%*%beta)

deri1<-deri0/((1+deri0)^2)

return(deri1)

}

## Prediction intervals for Michigan


sub.MI2012<-subdataset2012[locations,]

loc2008<-which(subdataset2008[,2]=="MI")

SApart<-model.matrix(logitreg)[loc2008[1],c(1:23)]

ModMatQ4<-NULL

for (i in 1:dim(sub.MI2012)[1])

{

pollerloc2008<-which(subdataset2008[,5]==sub.MI2012[i,5])

PollersIND<-model.matrix(logitreg)[pollerloc2008[1],c(26:38)]

ModMatQ4<-

rbind(ModMatQ4,c(SApart,as.numeric(sub.MI2012[i,3:4]),PollersIND))

}

Ghat1<-apply(ModMatQ4, 1, Deritive, beta=coef(logitreg))

Ghat2<-ModMatQ4*Ghat1

Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.MI2012[,1])))

Ghat<-colMeans(Ghat3)

Varphat<-t(Ghat)%*%vcov(logitreg)%*%Ghat

MIPredIntQ4Dem<-c(MImeanProbDemwin-

qnorm(0.975)*sqrt(Varphat),MImeanProbDemwin+qnorm(0.975)*sqrt(Varphat))

Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.MI2012[,1])))

Ghatrep<-colMeans(Ghat3rep)

Varphatrep<-t(Ghatrep)%*%vcov(logitreg)%*%Ghatrep

MIPredIntQ4Rep<-c(MImeanProbGopwin-

qnorm(0.975)*sqrt(Varphatrep),MImeanProbGopwin+qnorm(0.975)*sqrt(Varph

atrep))

## Prediction intervals for Florida


sub.FL2012<-subdataset2012[locations,]

loc2008<-which(subdataset2008[,2]=="FL")


ModMatQ4<-NULL

for (i in 1:dim(sub.FL2012)[1])

{

pollerloc2008<-which(subdataset2008[,5]==sub.FL2012[i,5])


ModMatQ4<-

rbind(ModMatQ4,c(SApart,as.numeric(sub.FL2012[i,3:4]),PollersIND))

}



Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.FL2012[,1])))

19



FLPredIntQ4Dem<-c(FLmeanProbDemwin-

qnorm(0.975)*sqrt(Varphat),FLmeanProbDemwin+qnorm(0.975)*sqrt(Varphat))

Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.FL2012[,1])))



FLPredIntQ4Rep<-c(FLmeanProbGopwin-

qnorm(0.975)*sqrt(Varphatrep),FLmeanProbGopwin+qnorm(0.975)*sqrt(Varph

atrep))

## Prediction intervals for Missouri


sub.MO2012<-subdataset2012[locations,]

loc2008<-which(subdataset2008[,2]=="MO")


ModMatQ4<-NULL

for (i in 1:dim(sub.MO2012)[1])

{

pollerloc2008<-which(subdataset2008[,5]==sub.MO2012[i,5])


ModMatQ4<-

rbind(ModMatQ4,c(SApart,as.numeric(sub.MO2012[i,3:4]),PollersIND))

}



Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.MO2012[,1])))



MOPredIntQ4Dem<-c(MOmeanProbDemwin-

qnorm(0.975)*sqrt(Varphat),MOmeanProbDemwin+qnorm(0.975)*sqrt(Varphat))

Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.MO2012[,1])))



MOPredIntQ4Rep<-c(MOmeanProbGopwin-

qnorm(0.975)*sqrt(Varphatrep),MOmeanProbGopwin+qnorm(0.975)*sqrt(Varph

atrep))

## Prediction intervals for Colorado

locations<-which(subdataset2012[,2]=="CO")

sub.CO2012<-subdataset2012[locations,]

loc2008<-which(subdataset2008[,2]=="CO")


ModMatQ4<-NULL

for (i in 1:dim(sub.CO2012)[1])

{

pollerloc2008<-which(subdataset2008[,5]==sub.CO2012[i,5])


20

ModMatQ4<-

rbind(ModMatQ4,c(SApart,as.numeric(sub.CO2012[i,3:4]),PollersIND))

}



Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.CO2012[,1])))



COPredIntQ4Dem<-c(COmeanProbDemwin-

qnorm(0.975)*sqrt(Varphat),COmeanProbDemwin+qnorm(0.975)*sqrt(Varphat))

Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.CO2012[,1])))



COPredIntQ4Rep<-c(COmeanProbGopwin-

qnorm(0.975)*sqrt(Varphatrep),COmeanProbGopwin+qnorm(0.975)*sqrt(Varph

atrep))

The following table summarizes the prediction intervals for all the states (MI, FL,

MO and CO) based on model Q4.

Dem 95% Prediction Interval Rep 95% Prediction Interval

MI 0.843 (0.754, 0.933) 0.156 (0.067, 0.246)

FL 0.553 (0.481, 0.624) 0.447 (0.375, 0.519)

MO 0.317 (0.200, 0.434) 0.683 (0.566, 0.800)

CO 0.491 (0.454, 0.527) 0.509 (0.473, 0.546)

Based on the above table, we can see that the prediction intervals for MI and MO

do not include 0.5. This suggests that the predictions for MI and MO made in Q7

are more reliable. But the prediction intervals for CO and FL include 0.5, which

suggests that the predictions made using model Q4 is not very reliable for these

two states.

The R code for constructing prediction intervals for predictions using Q5 is given

below:

## Prediction intervals for Michigan


sub.MI2012<-subdataset2012[locations,]

ModMatQ5<-NULL

for (i in 1:dim(sub.MI2012)[1])

{

pollerloc2008<-which(subdataset2008[,5]==sub.MI2012[i,5])

PollersIND<-model.matrix(logitreg1)[pollerloc2008[1],c(4:16)]

21

ModMatQ5<-

rbind(ModMatQ5,c(1,as.numeric(sub.MI2012[i,3:4]),PollersIND))

}

Ghat1<-apply(ModMatQ5, 1, Deritive, beta=coef(logitreg1))


Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.MI2012[,1])))


Varphat<-t(Ghat)%*%vcov(logitreg1)%*%Ghat

MIPredIntQ5Dem<-c(MImeanProbDemwin1-

qnorm(0.975)*sqrt(Varphat),MImeanProbDemwin1+qnorm(0.975)*sqrt(Varphat

))

Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.MI2012[,1])))


Varphatrep<-t(Ghatrep)%*%vcov(logitreg1)%*%Ghatrep

MIPredIntQ5Rep<-c(MImeanProbGopwin1-

qnorm(0.975)*sqrt(Varphatrep),MImeanProbGopwin1+qnorm(0.975)*sqrt(Varp

hatrep))

## Prediction intervals for Florida


sub.FL2012<-subdataset2012[locations,]

ModMatQ5<-NULL

for (i in 1:dim(sub.FL2012)[1])

{

pollerloc2008<-which(subdataset2008[,5]==sub.FL2012[i,5])


ModMatQ5<-

rbind(ModMatQ5,c(1,as.numeric(sub.FL2012[i,3:4]),PollersIND))

}



Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.FL2012[,1])))



FLPredIntQ5Dem<-c(FLmeanProbDemwin1-

qnorm(0.975)*sqrt(Varphat),FLmeanProbDemwin1+qnorm(0.975)*sqrt(Varphat

))

Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.FL2012[,1])))



FLPredIntQ5Rep<-c(FLmeanProbGopwin1-

qnorm(0.975)*sqrt(Varphatrep),FLmeanProbGopwin1+qnorm(0.975)*sqrt(Varp

hatrep))

## Prediction intervals for Missouri


sub.MO2012<-subdataset2012[locations,]

ModMatQ5<-NULL

22

for (i in 1:dim(sub.MO2012)[1])

{

pollerloc2008<-which(subdataset2008[,5]==sub.MO2012[i,5])


ModMatQ5<-

rbind(ModMatQ5,c(1,as.numeric(sub.MO2012[i,3:4]),PollersIND))

}



Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.MO2012[,1])))



MOPredIntQ5Dem<-c(MOmeanProbDemwin1-

qnorm(0.975)*sqrt(Varphat),MOmeanProbDemwin1+qnorm(0.975)*sqrt(Varphat

))

Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.MO2012[,1])))



MOPredIntQ5Rep<-c(MOmeanProbGopwin1-

qnorm(0.975)*sqrt(Varphatrep),MOmeanProbGopwin1+qnorm(0.975)*sqrt(Varp

hatrep))

## Prediction intervals for Colorado

locations<-which(subdataset2012[,2]=="CO")

sub.CO2012<-subdataset2012[locations,]

ModMatQ5<-NULL

for (i in 1:dim(sub.CO2012)[1])

{

pollerloc2008<-which(subdataset2008[,5]==sub.CO2012[i,5])


ModMatQ5<-

rbind(ModMatQ5,c(1,as.numeric(sub.CO2012[i,3:4]),PollersIND))

}



Ghat3<-Ghat2*((-1)^(1+as.numeric(sub.CO2012[,1])))



COPredIntQ5Dem<-c(COmeanProbDemwin1-

qnorm(0.975)*sqrt(Varphat),COmeanProbDemwin1+qnorm(0.975)*sqrt(Varphat

))

Ghat3rep<-Ghat2*((-1)^(as.numeric(sub.CO2012[,1])))



COPredIntQ5Rep<-c(COmeanProbGopwin1-

qnorm(0.975)*sqrt(Varphatrep),COmeanProbGopwin1+qnorm(0.975)*sqrt(Varp

hatrep))

23

The following table summarizes the prediction intervals for all the states (MI, FL,

MO and CO) based on model Q5.

Dem 95% Prediction Interval Rep 95% Prediction Interval

MI 0.842 (0.781, 0.902) 0.157 (0.097, 0.218)

FL 0.577 (0.545, 0.611) 0.422 (0.389, 0.455)

MO 0.289 (0.253, 0.326) 0.711 (0.674, 0.747)

CO 0.522 (0.499, 0.545) 0.478 (0.454, 0.500)

Based on the above table, similar to the prediction interval table for Q4, we can

see that the prediction intervals for MI, FL, MO do not include 0.5, But the

prediction intervals for CO include 0.5. This suggests that the predictions for MI,

FL and MO made in Q7 are reliable but the prediction for CO is not very reliable.

However, comparing the prediction intervals for CO given by the model Q4, we

found that the lengths of prediction intervals using Q5 are narrower. This might

suggest that the prediction using model Q5 is more reliable than that given by

model Q4.

Q9. Finally, implement the Silver’s approach to the data sets created in Q3 and Q6

to predict the winners for states considered in Q6 (namely, FL, MI, MO and CO).

Please compare the accuracy of the predictions using Silver’s approach and our

approach.

Answer: According to the algorithm described in at the beginning of the lab, we

could implement them step by step using the following R code:

## Step 1: Compute average errors

statedgesbypolls<-polls2008sub[,2]-polls2008sub[,3]

subSEbypolls<-statedgesbypolls[dataset2008[,2]%in%stateslist]

subSEwithstates<-cbind(subdataset2008[,c(2,5)],subSEbypolls)

trueSE<-results2008[,2]-results2008[,3]

trueSEexpand<-rep(0,length(subSEbypolls))

for (i in 1:length(subSEbypolls))

{

loc<-which(results2008[,1]==subSEwithstates[i,1])

trueSEexpand[i]<-trueSE[loc]

}

Errors<-abs(as.numeric(subSEs[,3])-trueSEexpand)

subSEs<-cbind(subSEwithstates,trueSEexpand,Errors)

Errorbypollers<-tapply(Errors,subSEs[,2],mean)

24

## Step 2: Compute weights

rankPollers<-rank(Errorbypollers)

weights<-1/(rankPollers^2)

## Step 3: Compute weighted averages

poll2012<-polls2012sub[dataset2012[,2]%in%stateslist,]

poll2012weights<-rep(0,dim(poll2012)[1])

for (i in 1:length(poll2012weights))

{

locwei<-which(names(weights)==poll2012[i,5])

poll2012weights[i]<-weights[locwei]

}

DemPollsWei<-poll2012[,2]*poll2012weights

RepPollsWei<-poll2012[,3]*poll2012weights

DemAveNA<-

tapply(DemPollsWei,poll2012[,1],sum)/tapply(poll2012weights,poll2012[,

1],sum)

DemAve<-DemAveNA[!is.na(DemAveNA)]

RepAveNA<-

tapply(RepPollsWei,poll2012[,1],sum)/tapply(poll2012weights,poll2012[,

1],sum)

RepAve<-RepAveNA[!is.na(RepAveNA)]

The following results are the weighted averages for the four states (“MI”, “FL”,

“MO” and “CO”) computed by the Silver’s approach:

CO FL MI MO

Dem 47.72712 47.08507 49.35732 42.89962

Rep 47.06949 46.62246 41.43978 50.45573

Prediction Dem Dem Dem Rep

Based on the weighted averages given in the above table, Nate Silver’s approach

predicts all the results correctly. Therefore, the accuracy of Silver’s approach is

the same as our method based on model Q5 but better than that based on model

Q4.

lab 3: logistic regression models - stt.msu.edu · the us presidential election is held every four...

Documents