imputation techniques for market research datasets with missing values

Imputation Techniques for Market Research Datasets with Missing

Values:Multiple Adaptive Regression Spines, Linear Probability Models and Logit

Analyses

Ingo BentrottSchool of Marketing

University of Technology, Sydney

“Vinod Shetty, of Mumbai, secretary of the newly formed Young Professionals Collective, said staff were subject to so much abuse that thousands of its workers were quitting in despair. The problem has become so bad that remaining workers are being forced to extend their shifts to 12 to 13 hours a day to fill the gaps.

Although a call centre worker in India earns about $70 a week- twice as much as most professionals in a nation suffering chronic underemployment- up to 60 per cent leave their jobs each year.”

Abusive callers targeted in Indian mutiny:

By Nick O’MalleyMarch 18, 2006

(insert graph and logistics regression)

If you run a logistic regression for BUY using the data on the left, you will get a response like the graphic on the right

This is due to the well known issue of Listwise Deletion (LD)

1. Introduction

There are two types of non-response: complete non-response, where the person does not participate at all in the survey and item non-response where a survey is only partially completed

Coleman (1991) mentioned that the rates of non-responses have remained constant but Jarvis (2002) says the rates are increasing when you control for answering machines.

Respondents have grown to ‘strongly dislike’ phone surveys.◦ The primary concern is privacy, which has been made worse by well-

publicized breaches in security (Jarvis 2002)

In essence, whenever you have missing data in your data, you are forced to somehow address it◦ Delete or Impute

1. Introduction

Missing data can be of three types◦ Missing Completely at Random (MCAR)

Missings are unrelated to the value of x or any other variable

◦ Missing at Random (MAR) Missing not a function of x when ‘controlled for other

variable effects.’

◦ Non-ignorable missing Missing caused by an unmeasured variable

1. Introduction- Types of Missing Data

Most current discrete choice studies are using stated preference designs◦ Creates orthogonal Xs

This is a way to reduce the number of respondents by getting as much data as possible out of fewer respondents

Discrete choice studies based on Random Utility Theory (RUT) can give you excellent estimation of willingness to pay estimates (WTP) ◦ Is necessary to have complete cases for low variance estimation

If data is collected by same survey instrument, it is likely to have the same missing pattern across the Xs (Howell, 1998).

Revealed Preference (RP) data usually has multicollinearity issues and the use of missing data indicators will exacerbate this issue.

1. Introduction- Discrete Choice Survey

(insert graph)

From our example a bit ago, using most multiple imputation techniques would still have problems imputing a value for USER RATING above.

If the only variables that can be used are AGE, INCOME and POST CODE, missings would be a linear combination of these

2. Review of Literature

Many statistics packages use Listwise Deletion (LD) by default when estimating a discrete choice model.◦ In SEM models, VAR-COV matrix only uses valid data for

estimation

Leads to selection bias and estimates with reduced efficiency

If data is MCAR, only penalty is loss of power

Mean Imputation takes multiple imputes to the same data point and averages the results◦ MI is a main-effects only model, CART/MARS use interactions so

we may not need multiple imputes

“Hot Deck” imputation (Little and Rubin, 1987) is a technique when you use values based on similar cases (similar to surrogates in CART)

2. Review of Literature- what is out there

Expected Maximization (EM) has been successfully applied to missing data but standard errors must be obtained using auxiliary methods. ◦ Missing imputed during EM

FIML and ML methods assume multivariate normality ◦ These techniques are best when there are a few, distinct

patterns of missing data (Little, Schnabel, Baumert, 2000).

If the data is MAR and not MCAR all the above techniques will be biased◦ Since MAR implies another ‘observed’ explanatory variable is

affecting the missing, interactions in CART/MARS can pick this up.

2. Review of Literature- what is out there

Most missing data tends to act in combination (Borgoni and Berrington, 2004)

We should not try to “break” the multivariate nature of the data.◦ CART uses surrogates, so even though we impute data one variable at a

time, the structure will be preserved.

Most imputation techniques assume multivariate normal.

Imputation sometimes assumes data is MCAR but if the data has high degree of interactions and non-monotonic, CART, by its nature will perform better on data that is MAR

EM algorithm has been proven to be good but implies missings only during estimation◦ CART technique can fill the dataset for later analysis.

2. Review of Literature- Why use CART?

If data has high dimensionality and data sparseness, univariate nature of CART will be better able to handle this than Multiple Imputation using regression.

Trees are also less prone to outliers and misspecified models

Although a multiple iteration tree is shown to be better in Monte Carlo studies by using multiple draws from CARTs conditional distribution (Borgoni and Berrington, 2004), the results are within a standard error of the “one shot” variable at a time CART imputation technique. ◦ One shot has some added variability (like other techniques) but

standard errors may be underestimated.◦ Extra information gathered from imputation may offset extra

variability

If the data is MCAR, using a simple Pearson Chi Square test of Observed versus Expected values validates the imputed values.

2. Review of Literature- Why use CART?

(insert table of Descriptive Statistics)

The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA.

3. Data- Pima Indians Diabetes Data Base

(insert table of Descriptive Statistics)

This is a dataset with information about renters and homeowners. The dataset is a good mixture of categorical and continuous variables with a lot of missing data.

3. Data- Homeowner

This survey is aimed at gathering some information about your preferences for athletic shoes. More specifically, the product in question is an athletic shoe that is to be used primarily for playing a sport (or several sports). For example, the shoes could be used for playing basketball, tennis, running, hiking, and so on.

Since the questions asked are from a balanced stated preference (SP) design, there are only missing values in the demographic questions

3. Data- Shoes

(insert table on Descriptive Statistics)

3. Data- Shoes

This presentation looks at 5 different modeling techniques on the 3 datasets mentioned previously.

Model 1. The first model was a simple logistic regression using all variables◦ No transformations◦ Listwise deletion was used for missing values

Model 2. A MARS model was then run with main effects only and all model defaults◦ Since the data is binary, this is a Linear Probability Model (LPM)

Model 3. Mean imputation was used in a logit model

Model 4. MARS basis functions were then put into logistic regression to recover standard errors and eliminate the need for weighted least squares in LPM

4. Methodology

Step 1. Sort the variables with missing values from least to worst

Step 2. Starting with the least missing variable, partition the data into one data set with that variable’s missing values and one data set with complete cases

Step 3. Estimate a tree with the least missing variable as a target

Step 4. Score the data set with missing values from the results in step 3

Step 5. Repeat for the next affected variable until all data is filled

4. Methodology- Fifth Model: CART Multiple Imputation

(insert graph)

Regression by logit will yield a different shape than a linear probability model

Some cases will be classified differently using the same basis functions from MARS

4. Methodology

(insert table)

5. Results- Predictive Accuracy % Correct

The data on Shoe buyers is “real” in that it was an SP study that was deployed

The nature of orthogonal design forced trade offs and controls for interactions

The Pima Indian and Home Owner dataset are well known and has well defined patterns amongst the Xs

If the buyers are the class of interest, a CART/MARS imputation is clearly preferred

5. What explains the discrepancy?

CART and MARS will perform better on mixed data types and should be the preferred imputation modeling technique ◦ Possible CART MARS Logit technique to capture all possible non-

monotonics

Web based surveys allow us to see when people quit survey

Can investigate if the person looked at all questions and refused some◦ In mail surveys, this is impossible◦ The web will expand our missing data categories as a complete survey,

means someone that viewed and answered all the questions (Bosnjak and Tuten, 2001)

If survey respondents are paid, this still works best for reducing non-response◦ CART can be used with ROC/Lifts charts to see what is optimal amount of

payment per completed survey◦ Many companies would be willing to pay for this completeness (Coleman,

1991)

6. Conclusions

7. References

imputation techniques for market research datasets with missing values

Technology

data sparseness

asmuch data

data onevariable

use of missing data

variable missing

nonignorable missing

valid data forestimation

buy usingthe data