torben schubert, december 12th, 2012, circle, lund norsi course on ‘survey of quantitative...

Survey Design and Analysis

Torben Schubert, December 12th, 2012, CIRCLE, Lund

NORSI course on ‘Survey of Quantitative Research’

Survey Design◦ Cluster analysis◦ Latent factors

Hypothesis testing using Community Innovation Survey data◦ Limited dependent variables◦ Application using STATA

Outline

Survey Design

Yesterday, you have had an introduction into linear regression analysis

OLS is one the most powerful tools to test hypothesis

But hypothesis testing is not the only task in quantitative empirical research

Sometimes we might not even have a clear idea about structures in the data set. We may find it difficult to develop sensible hypothesis.

Introduction

Sometimes we encounter measurement problems that make it difficult to discern what the theoretical meaning of a variable or a set of variables actually is.

What can we do then?

Introduction

Good empirical research should follow the following steps:◦ Build a theory about a certain phenomenon (e.g. by literature

review, or by squeezing your brain)◦ Delineate expectations about empirical relationships (often

called hypotheses)◦ Collect the data that is necessary to measure your relationships◦ Use a sensible technique to determine whether your hypotheses

hold.

The ideal way to good results

This ideal process is often obstructed:◦ We might get access to a rich dataset that we have not self-

compiled and which we therefore do not fully understand.◦ We might have a complex measurement construct in mind, but

we are not sure whether our variables really measure it.

Problems

If you are unsure about the information contained in your dataset, do not underestimate the power of descriptive statistics.

Means by groups or correlations can greatly improve your understanding of the data.

Take your time to investigate an unknown dataset.

Some suggestions

Cluster analysis

What is a cluster? Loosely defined:

Data can be considered clustered, if◦ observations belonging

to the same cluster is alike.

◦ observations belonging to other clusters differ.

Cluster analysis

Cluster analysis assumes that observations (e.g. firms) belong to a given number of different clusters that are inherently different from each other.

Technically, you search for multivariate similarity between observations giving a set of characteristics.

E.g. you could think firms differ by age, size, and innovativeness

Cluster analysis

A clustering method then sorts those firms together into a given number of clusters that are most similar to each other.

A multitude of techniques exist, but most of the common ones are rather descriptive allowing many arbitrary options to the researcher:◦ Which variables to include?◦ How many clusters to go for?◦ Which method to use?

Cluster analysis

Cluster Analysis

-2 -1 0 1 2 3 4 5

05

10

15

x[,1]

x[,2]

-2 -1 0 1 2 3 4 5

-20

24

68

10

x[,1]

x[,2]

-2 -1 0 1 2 3 4 5

-20

24

6

x[,1]

x[,2]

-2 -1 0 1 2 3 4 5

-2-1

01

23

4

x[,1]

x[,2]

and not all data are clustered…

An example in STATA based on the auto data set The command structure is

cluster subcommand varlist, options

Type the following:sysuse autocluster wardslinkage rep78 length price if

!missing (rep78) & !missing(length) & !missing(price), measure(correlation)

cluster dendrogram

Cluster analysis

The dendrogram looks like this and tells at which tolerance we start to cluster together observations and subgroups

Number of cluster arbitrary, but maybe 3 not a bad choice.

Cluster analysis

.996

.997

.998

.999

1co

rrelat

ion si

milar

ity m

easu

re

123142281821011173214724205691916151213254429415434485245513740502639304733382849314227323546365343556368576258566461656759666069

Dendrogram for _clus_1 cluster analysis

Then type cluster generate cutvar = groups(3)

In order to generate a grouping variable To generate summary statistics by groups type

bysort cutvar: sum rep78 length price if !missing(cutvar)

Cluster analysis

Cluster analysis

price 15 10896.27 2667.294 6850 15906 length 15 199.8 21.74922 156 233 rep78 15 3.533333 .8338094 2 5 Variable Obs Mean Std. Dev. Min Max

-> cutvar = 3

price 30 5319.367 874.2762 3895 7827 length 30 181.6 25.47155 142 222 rep78 30 3.633333 .9278575 2 5 Variable Obs Mean Std. Dev. Min Max

-> cutvar = 2

price 24 4210.5 516.8955 3291 5379 length 24 189.4583 16.66284 163 221 rep78 24 3.041667 1.082636 1 5 Variable Obs Mean Std. Dev. Min Max

-> cutvar = 1

Cluster analysis is a nice tool of data mining useful when you have no idea of what is going on.◦ Arguably, I would not recommend using it in a scientific paper,

because of its exploratory character.◦ It might assist you in earlier stages of research.

Note that there are statistically more advanced methods in other packages such as R (header: model based clustering)

Cluster analysis

Latent factors

Often theory is termed in unmeasureable concepts.

Happens often in management research, sociology, psychology

Suppose, you hypothesize that teacher quality increases student performance.

How to measure teacher quality? Might consider to ask a battery of questions

about a set of quality dimension (Is he well prepared? Does he react to students‘ questions?...)

Latent factors

The first question you ask is, if there is really a unidimensional thing called teacher quality.

You can use factor analysis for this. Factor analysis determines for any given set of

variables underlying (latent) constructs. Type in the following:

use http://www.ats.ucla.edu/stat/stata/output/m255, clear

factor item13-item24, ipf factor(3)

Factor analysis

General rule: use as many factors as there are Eigenvalues greater than one.

In this case 1: good news!

Factor analysis

item24 0.6952 0.0183 -0.3873 0.3665 item23 0.8194 -0.0262 -0.3454 0.2086 item22 0.6128 0.2609 -0.0228 0.5559 item21 0.7317 0.1168 0.0007 0.4509 item20 0.5501 0.2392 0.0932 0.6315 item19 0.6165 0.4159 0.1551 0.4228 item18 0.7395 0.3448 0.1129 0.3216 item17 0.7831 -0.0734 0.0667 0.3770 item16 0.6478 -0.1890 0.1114 0.5322 item15 0.7212 -0.2450 0.1057 0.4086 item14 0.7032 -0.3391 0.0978 0.3810 item13 0.7134 -0.3987 0.0923 0.3236 Variable Factor1 Factor2 Factor3 Uniqueness

Factor loadings (pattern matrix) and unique variances

LR test: independent vs. saturated: chi2(66) = 8683.10 Prob>chi2 = 0.0000 Factor12 -0.09084 . -0.0129 1.0000 Factor11 -0.06035 0.03050 -0.0086 1.0129 Factor10 -0.04594 0.01440 -0.0065 1.0215 Factor9 -0.01906 0.02688 -0.0027 1.0281 Factor8 -0.00440 0.01466 -0.0006 1.0308 Factor7 0.00218 0.00658 0.0003 1.0314 Factor6 0.03164 0.02946 0.0045 1.0311 Factor5 0.05527 0.02362 0.0079 1.0266 Factor4 0.13146 0.07619 0.0187 1.0187 Factor3 0.36146 0.23001 0.0515 1.0000 Factor2 0.80687 0.44540 0.1149 0.9485 Factor1 5.85150 5.04464 0.8336 0.8336 Factor Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated) Number of params = 33 Method: iterated principal factors Retained factors = 3Factor analysis/correlation Number of obs = 1365

Another commonly used measure is Cronbach‘s Alpha being defined as the average correlation between a given set of variables.

This should be large (at least 0.65). Type in

alpha item13-item24

Cronbach‘s Alpha

Scale reliability coefficient: 0.9125Number of items in the scale: 12Average interitem covariance: .386608

Test scale = mean(unstandardized items)

Hypothesis testing using Community

Innovation Survey data

Community Innovation Survey: harmonized survey of innovation behavior in the European Union+Norway

Moving cross section data with many information about innovation inputs, outputs, firm characteristics, markets,…

We can analyse this data with the tools we been equipped with yesterday:◦ T-tests about differences in means◦ OLS to test more complicated hypotheses

But many variables do not easily lend themselves to OLS because of their nature…

Introduction

Limited Dependent Variables

Limited dependent variables (LDV)◦ Types of LDV◦ Implications for OLS

Estimation Methods◦ Maximum Likelihood Estimation◦ The need for marginal effects◦ Probit and Logit Models◦ Multinomial Models◦ Count data◦ Tobit Models

Overview

What do we estimate by regression? Suppose we have the regression equation:

We are typically interested in the coefficients/parameters.

But what is their meaning? A commonly heard suggestion:

◦ Measures how the explained variable changes when the explaining variables change by one unit…

Introductory Reminder

y x u

This is imprecise. But why? Look at the formula again:

The error obstructs this direct relationship between the explained variable, and the coefficients as well as the explaining variables.


y x u

We solve that by focusing on expectations

The coefficient now has the following meaning:

A coefficient measures how the expected value of the explained variable changes when the explaining variables change by one unit.


E y x

Ek

k

y

x

Some Theory

Basic definition:An LDV is any dependent (also: explained, left-hand-side) variable in a regression that cannot take any value on the real axis.

Examples◦ Indicator-variables: e.g. employed (y/n)◦ Count variables: # patents◦ Strictly positive variables: amount of consumed alcohol per

week◦ Multinomial response variables: prefered leasure time activities

(bowling, reading, meeting friends)

LDV - Types

Suppose we intended to explain employment status of persons.

Convenient way of coding is 1: employed and 0: unemployed

Technically we could run a linear regression of the following form:

yielding estimates

LDV – Implications for OLS

empl x u

But consider the estimate expectation of

Since is fixed and there are no restrictions the predicted values way well lie outside

the theoretical boundaries of 0 and 1. Implication of the linearity of OLS.


E e empl mpl x

x

empl

We impose a linear model with no restrictions on an expected value that should be bounded between 0 and 1.


Need to find a non-linear model for the expectation value.

Suppose you want to explain income, data is censored at an upper threshold (e.g. 100,000€ p.m. and above)

What happens, if you use OLS dropping the highest category (truncation) or replacing the censored value with 100,000 (censoring)?


Obviously, downward bias in this case.

Inconsistent results from OLS.


OLS doesn‘t work in these situations. Common practice therefore:

◦ Confirm that explained variable is not LDV (profits), or at least roughly not LDV (size of a person)

◦ If variable is LDV in some sense, use other methods implementing appropriate non-linear models for the expectation value.

What are these methods?

Estimation methods: ML

Gladly, the Maximum Likelihood Approach offers a flexible solution to a large class of such problems (developed by Fisher in the beginning 20th century)

It follows several steps:◦ Choose an appropriate statistical model for your data.◦ Based on this model express the likelihood for observing your

sample as a functions of the parameters◦ Maximize this likelihood over the parameters. The solution to

this problem are the ML estimates.


What about size of the effects? We are always interested in how the dependent

variable changes when one of the indepent changes.

Unfortunately, because the expectation value is now non-linear, the coefficients are not identical to the marginal effects anymore.

Marginal effects and meaning of coefficients

E( )

j

y

x

In the Probit Model for example we can show that the marginal effect is:


0

E( ) ( )( ) j

j j

y xx

x x

Implications:◦ In the Probit model the coefficient does not coincide with the

marginal effect◦ Nonetheless, it gives the correct direction. This holds for many

ML methods but not for all. Allways, and I seriously mean allways, report

marginal effects instead of raw coefficients when using ML. (STATA can do that easily.)


Practice in STATA

Whenever, we encounter an indicator variable (0/1) as dependent we should think of a correct probability model

Examples:◦ Unemployed vs. Employed◦ Non-patenting company vs. patenting company◦ …

Several usable models, but most common:◦ Logit model and probit model◦ Practically, no large difference between both, when we focus on

marginal effects

The Probit and the Logit Model

Easy to invoke them in STATA using the probit or logit command

probit depvar indepvars, optionslogit depvar indepvars, options

For example, if you have a patent indicator pat, the innovation expenditures innoexp and the size of the company empl, the command looks like this:

probit pat innoexp empl


The marginal effects are computed using the command directly after a probit/logit regression:

mfx, predict(p)

Observe that this command always refers to the last regression.


Suppose there are many buying alternatives for a product (e.g. Android Smartphone, I-Phone, Windows Smartphone) and you would like to know how customers‘ characteristics impact on there buying decision

In this case, 4 categories:no SPAndroid SPIPhoneWindows SP

Multinomial models

Differs from probit/logit because there is more than one category.

Two widely used models:◦ Multinomial logit◦ Multinomial probit

Here there is a difference: multinomial probit more flexible, but calculation computationally usually not feasible with more than four-five categories.

Multinomial models

STATA commands are mprobit and mlogit:

mprobit depvar indepvars, optionsmlogit depvar indepvars, options

For example you have a variable sp giving consumer level data on SP choice, inc being the imcome, and age the age, the command would be

mprobit sp inc age

Multinomial models

Obs: coefficients and marginal effects do not even have the same direction

You must calculate marginal effects using (we have four categories, each has its own marginal effects)

mfx, predict(p outcome(1))mfx, predict(p outcome(2))mfx, predict(p outcome(3))mfx, predict(p outcome(4))

Note: If data is ordered (e.g. Likert scale) you can use Ordered probit (oprobit with the same syntax)

Multinomial models

When data takes on integer values that have a clear numeric meaning (e.g. patents) you should use count data.

What is the difference to a Likert scale? Sensible methods are

◦ Poisson regression◦ Negative binomial regression

Negative binomial is much more flexible without imposing a considerable computational penalty.

Count data

The STATA command is nbreg depvar indepvars, options

Suppose number of patents is stored in numpat, size is given by empl and innovation expenditures by innoexp the command could be

nbreg numpat innoexp size

For marginal effects it is enough to type

mfx

Count data

Censored or quasi-censored dependent variables are those that ◦ are principally continuous◦ cannot take on all values on the real axis◦ have mass points at their censoring limit

Examples:◦ Innovation expenditures◦ Turnover◦ R&D-intensity (R&D expenditures divided by turnover)

Data can be single or double-censored

Censored data

The correct model to use is the Tobit model, which can be invoked by the following command:

tobit depvar indepvars, ll() ul()

where the options ll() and ul() handle the upper and lower limits

Suppose you want to explain the share of employees with tertiary education in % (shtert) by size (empl) of the comapny, you could use

tobit shtert empl, ll(0) ul(100)

Censored data

The marginal effects follow using

mfx, predict(e(0,100))

In the case of only zero censored data that is otherwise inrestricted you would technically want to write something like this:

mfx, predict(e(0,infinity))

Censored data

But STATA does not know infinity as number. You could simply use a number larger than your sample maximum.

Or more elegantly you can use the following sequence (suppose that your dependent var is turnover)

summarize turnoverlocal maxturn=r(max)mfx, predict(e(0,`maxturn‘))

Censored data

ML methods are computationally intensive and are purely numerical methods

Sometimes the standard algorithm does not converge (mprobit not unlikely to produce this outcome)

What to do then?◦ Do under no circumstances report results when convergence

was not achieved.◦ You can try the difficult option.◦ You can use maximize options.◦ You can provide different starting values.

But nothing is guaranteed to help.

A word of caution

Check whether your variable is LDV. If not, use OLS. If yes, determine the type of LDV characteristic. Choose appropriate model (there are many more

than those discussed today). If you use LDV methods, it is safest to report

marginal effects instead of raw coefficients.

Summary

Sounds complicated, and in fact it can be so. But an easy example is the Probit model for

indicator explained variables. Suppose there is an unobservable (latent)

variable taking on any value and the observable indicator taking on only a value of 0 or 1.


*yy

Back-Up

Both are linked as follows:

Like always, we would like to estimate the expected value of


* ~ (0,1)y x u u N *1 0y y

y

E( ) 1 ( 1) 0 ( 0) ( 1)y P y P y P y

Both are linked as follows:

In order to form the Likelihood function we have to find the probabilities that equals zero and one as function of the parameters:

The probability that the indicator is zero is simply


* ~ (0,1)y x u u N *1 0y y

*1 ( 0) ( ) ( )P y P y P u x P u x

( )x

( 0) 1 ( 1) 1 ( )P y P y x

y

The probability for observing a generic observation then is:

Because of independence between each observation, the likelihood giving the probability of observing the whole sample is:

Or for computational reasons in log-Form:


1( ) (1 ( ))i iy yi ix x

1

1

( ) ( ) (1 ( ))i i

ny y

i ii

L x x

1

( ) log ( ) (1 ) log(1 ( ))n

i i i ii

l y x y x