Download - Statistical Analysis of Experimental Datariener.vwl.uni-mannheim.de/uploads/media/ExpEconEconometrics.pdf · Data Analysis and Experiments Data Analysis and Experiments IGood experimental

Statistical Analysis of Experimental Data

Gerhard Riener

Düsseldorf Institute for Competition Economics

September 24, 2014

Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 1 / 37

Outline

Data Analysis and Experiments




I Good experimental design makes for clean data analysisI Knowing with which statistical techniques you analyze helps to

plan your designI Choose the statistical approach that best fits your needs (graphs,

tests, confidence intervals, regressions)I Think of what kind of data you can collect, to get the cleanest

possible test of your hypothesisI Compute the sample size necessary to meaningfully test your

hypotheses



Data types and sample sizes

Let Y ∈ Z be a random variable, where Z is the set of possibleoutcomes for Y

I if Z = 0, 1 we say that the outcome is binary e.g. offers in the UGcan be either accepted (1) or rejected (0)

I if instead Z = h,h + 1,h + 2, ...k?2, k?1, k or Z = [h, k ] we saythat the outcome is multi-valued e.g. offers in our UG were integernumbers, between 0 and 10.

The set Z is known and should not be determined based on the data.Data are given by a sample of independent observations consisting ofn draws from Y , denoted by y1, y2, y3, . . . yn.



Null and alternative hypothesis

Make inference about the true distributions PY . Test the null hypothesis

H0 : PY ∈W (1)

against the alternativeH1 : PY /∈W (2)

where W is given. The null hypothesis has to be constructed ascomplement of the desired statement or research objective. So if theaim is to provide evidence that some treatment makes a differencethen the null hypothesis is that there is no treatment effect.



One-sided and two-sided hypothesis

I two-sided when one tests for equality: e.g.I H0: Probability that the offer is rejected is equal to 50%I H1: Probability that the offer is rejected is different from 50%

I one-sided when one tests for inequalities: e.g.I H0: Probability that the offer is rejected is greater or equal to 50%I H1: Probability that the offer is rejected is smaller than 50%

One-sided hypotheses should only be tested if there is a goodstory behind them.



Size of a test and the p-value

Hypothesis testing is about making statements based on noisy data.True statements cannot be guaranteed� the objective is to makestatements that are true up to some pre-specified maximal error.The size (or type I error) α of a test is the maximal probability ofwrongly rejecting the null hypothesis, calculated before gathering thedata. So this is the maximal probability of rejecting the null hypothesiswhen it is true.The p-value associated to a given test and a given data set is themaximal value of α such that the null hypothesis would be rejected forthe given data.



Size of a test, and the p-value

Type II error βFor a given subset of the alternative hypotheses is the maximalprobability of wrongly not rejecting the null hypothesis when the truedata generating process belongs to this subset.

Power (1− β) of a testProbability of correctly rejecting the null hypothesis for a given samplesize.The conventions are in general much more rigid with respect to α thanwith respect to β.The common reliance on statistical significance as the sole criterionleads to an excessive number of false positives (Maniadis, Tufano andList, AER, 2014).



Maniadis, Tufano and List, AER 2014

How do we assess whether an effect truly exists

I n number of scientific associations examinedI π fraction of true associationsI α typical significance levelI β typical power

What is the proabability that a decalarion of a research finding is true?Post Study Probability (PSP)



MTL ctd

I of the n associations π · n will be true and consequently (1− π) · nfalse

I (1− β) · π · n will be declared trueI α · (1− π) · n will be declared true although they are falseI PSP is the number of true association declared true divided by

number of all associations declared true

PSP =(1− β) · π

(1− β) · π + α · (1− π)(3)



MTL: Researchers competition

Suppose k independent researchers work simultaneously of n each ofthe n associations.I (1− βk ) probability that at least one of the k researchers will

declare a true association as trueI 1− (1− α)k ) probability that false relationship is declared true by

at least one researcherI

I [1− (1− α)k ] · (1− π) · n will be declared true although they arefalse

PSPComp =(1− βk )π

(1− βk ) · π + [1− (1− α)k ] · (1− π)(4)

Decreasing in k if 1− β > α which is usually the case



MTL: Research bias

Researchers tend to interpret their results in a “favourable” way whichleads to a bias u that a positive research finding has been announcealthough it shouldn’t have been announcedI A fraction (βπn) of the findings will be declared false because of

noiseI Total number of findings declared true changes to

(1− β) · π · n + u · β · π · nI Additionally out of the false associationsα(1− π)n + u(1− α)(1− π)n will be declared true

PSPBias =(1− β)π + βπu

(1− β) · π + βπu + [α + (1− α)u] · (1− π)(5)

Decreasing in u if 1− β > α which is usually the case



Accepting the null?

Null hypotheses are rejected or not rejected, never acceptedNot being able to reject the null hypothesis can have many reasons.I the null hypothesis is true.I the alternative hypothesis is true but that the test was not able to

discover thisThe power of a test quantifies the evidence for not being able to rejectthe null hypothesis conditional onI the sample sizeI a statement about the alternatives.

Whether the sample is sufficiently large and the test is sufficientlypowerful depends on where the underlying distribution lies in H1. If itlies close to H0 then it will be very difficult to detect it.



Qualitative analysis

Graphs and summary statistics best devices for getting familiar withthe data They can helpI the experimenter to detect outliers and anomaliesI the experimenter to spot unexpected regularitiesI the reader to get acquainted with the data

Well-chosen descriptive statistics and graphs can give the reader (orthe experimenter!) the impression that the main questions areanswered. Do we really need any formal statistical tests?



Qualitative analysis: an example



Quantitative analysis

Research questions in experimental economicsI Does treatment X affect outcome Y?I Is outcome Y better predicted by model M1 or M2?

How to answer the questions? first question, you could compare theaverage choices across treatments, or the average observed choiceswith the expected value predicted by the two theories.Observed difference might be due to experimental error: statisticaltechniques to evaluate that possibility. Formal statistical tests tell youwhether differences in observed outcomes across treatments are dueto chance alone (sampling error) rather than to the treatments(differences in the underlying population distributions), and whetherone model is really explaining the data better than another, or just gotlucky.



Parametric vs. non-parametric statistics

I assumed that the error distribution belonged to a known familyparametric statistics and includes tests such as the t-test, theF-test, ...

I Fewer distributional assumptions (only independence) but lesspowerful when assumption were true. This approach is known asnon-parametric statistics we will cover some essential tools itprovides.



Tests and confidence intervals

A confidence interval is a range of values designed to contain the trueparameter of interest with a minimal probability denoted 1− α(coverage). E.g. a 95% confidence interval? has a coverage of 95%.I Confidence interval estimation provides a convenient alternative

to significance testing in most situations.I It represents a more flexible approach as one does not have to a

priori specify a null hypothesis but instead only a parameter ofinterest.

I This approach does not raise the problem of how to deriveimplications when the null is not rejected. Moreover they provideinformation about how strong the evidence is when the null isrejected.



RegressionsA simple way of testing whether treatment effects are significant is toinclude a dummy variable for each treatment but the baseline, in thelist of explanatory variables X.

yi = Xiβ + ui

When the coefficient estimate (e.g. from OLS) for a dummy issignificant, you can conclude that the treatment actually did affect theobserved performance y , with respect to the baseline treatment.Regressions can often be a useful complement to statistical tests, forexample to:I control the effect of nuisance factors (subjects individualI characteristics, session fixed effects, etc.)I study trends across periods in repeated gamesI investigate the presence of linear/non-linear effects of some

treatment variables on the subjects? behaviorGerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 19 / 37


Regressions, stata codeBasic code that regresses

reg offer communicate

Interaction terms: often and especially in factorial designs, we wouldlike to evaluate the difference between the cells

reg offer communicate##female

Heteroskedstic robust standard errors

reg accept communicate, robust

AttentionAlways use with binary variables because standard errors of the linearprobability model suffer from heteroskedasticity



Interpretation of the interaction term

. tab2 AA female if consultant, sum(completed)

-> tabulation of AA by female if consultant

Means, Standard Deviations and Frequencies of CompletedAffirmativ | Female

e action | Male Female | Total-----------+----------------------+----------

0 | .35555556 .21126761 | .29192547| .48136303 .41111323 | .45606677| 90 71 | 161

-----------+----------------------+----------1 | .31343284 .35384615 | .33333333

| .46738976 .48188332 | .47320035| 67 65 | 132

-----------+----------------------+----------Total | .33757962 .27941176 | .3105802

| .4743976 .45036901 | .46352285| 157 136 | 293



Interpretation of the interaction term

. reg completed AA##female if consultant, robust

Linear regression Number of obs = 293F( 3, 289) = 1.79Prob > F = 0.1495R-squared = 0.0160Root MSE = .46218

------------------------------------------------------------------------------| Robust

completed | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+----------------------------------------------------------------

1.AA | -.0421227 .0764034 -0.55 0.582 -.1925004 .1082549|

female |Female | -.1442879 .0704317 -2.05 0.041 -.2829121 -.0056638

|AA#female |1#Female | .1847013 .1085501 1.70 0.090 -.0289477 .3983503

|_cons | .3555556 .0508054 7.00 0.000 .25556 .4555511

------------------------------------------------------------------------------



The binomial test (Siegel and Castellan, 1988)Statement: most of the offers=3 (O3) are accepted.I H0 : the probability that O3 is rejected is ≥ 50%.I H1 : the probability that O3 is rejected is < 50%.I If I can reject H0 I can provide statistical support to my statement.

In our dataset, 8 out of 20 subjects rejected the offer. I can run abinomial test with size α = 0.05 to check whether I can reject H0 at the5% significance level.I First approach: look at the table which reports the one-tailed

probabilities associated with the occurrence of at most xsuccesses out of N trials. The probability of observing no morethan 8 successes under H0 is 0.252 > 0.05, then we cannot rejectthe null, at the 5% significance level.

I Second approach: with stata: returns a p-value of 0.251722 forone-sided test.

. bitesti 20 8 0.5



Confidence intervals

Can we say more? We can say that, with probability 90%, the trueprobability that O3 is rejected lies in the interval [p, p̄], where:I p is the min. value of p for which I cannot reject H0 : Py ≤ p vs.

H1 : Py > pI p̄ is the max. value of p for which I cannot reject H0 : Py ≥ p vs.

H1 : Py < p

Stata code

cii 20 8, exact level(90)

-- Binomial Exact --Variable | Obs Mean Std. Err. [90\% Conf. Interval]---------+----------------------------------------------------------

| 20 .4 .1095445 .2170686 .6064151



Chi-squared test (Siegel and Castellan, 1988)

Suppose now we want to test if males are more likely to accept O3than females. We can use a chi-squared test. The test statistic isbased on the observed and expected frequencies:

Degrees of freedom: (# rows - 1)(# cols - 1) = 1 Look into the table forthe critical values of the test statistics, for a one-tailed test with size5%.



Chi-squared test—ctd.

Stata code

. tabulate male accept3, chi2

| accept3male | 0 1 | Total

-----------+----------------------+----------0 | 2 2 | 41 | 6 10 | 16

-----------+----------------------+----------Total | 8 12 | 20

Pearson chi2(1) = 0.2083 Pr = 0.648

CAVEAT: the chi-squared test should not be used when mini,j Eij < 5or when N < 20.



Fisher exact test (Siegel and Castellan, 1988)

Generally preferred to the chi-squared test for small samples. This testis based on the probability of observing a particular set of frequenciesin a 2 × 2 table, when the marginal totals are regarded as fixed.

The probability of observing this outcome under H0 is 0.381. But whatis the probability of observing a more extreme outcome?



Fisher exact test

Consider now the most extreme outcome, in which the difference inacceptance rates between males and females is maximized.

The probability of observing the most extreme outcome under H0 is0.0144. Hence, the probability of observing the outcome we had, or amore extreme one, under H0 , is 0.381 + 0.139 + 0.014 = 0.534. Wecannot reject the null that males and females are equally likely to rejectO3, based on our observations.



Fisher exact test—contd.

Stata code

. tabulate male accept3, exact

| accept3male | 0 1 | Total

-----------+----------------------+----------0 | 2 2 | 41 | 6 10 | 16

-----------+----------------------+----------Total | 8 12 | 20

Fisher’s exact = 1.0001-sided Fisher’s exact = 0.535



Z-test

According to Schlag (2011), the z-test is generally more powerful thanthe Fisher exact test, hence it would be preferable. However, it is lesscommonly used.The test statistic (s1, s2) depends on the number of successes s1 ands2 in the two samples of size n1 and n2 , with n1 + n2 = N.

φ(s1, s2) =(s2/n2)− (s1/n1)√

s1+s2N (1− s1+s2

N ( 1n1

+ 1n2

= 0.4564 (6)

To get the corresponding p-value, look at the table for the standardnormal distribution→ 0.3228.



Z-test—ctd.

Stata code

. prtest accept3, by(male)Two-sample test of proportions 0: Number of obs = 4

1: Number of obs = 16--------------------------------------------------------------------------Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]---------+----------------------------------------------------------------

0 | .5 .25 .010009 .9899911 | .625 .1210307 .3877841 .8622159

---------+----------------------------------------------------------------diff | -.125 .2777561 -.6693919 .4193919

| under Ho: .2738613 -0.46 0.648--------------------------------------------------------------------------diff = prop(0) - prop(1) z = -0.4564Ho: diff = 0Ha: diff < 0 Ha: diff != 0 Ha: diff > 0Pr(Z < z) = 0.3240 Pr(|Z| < |z|) = 0.6481 Pr(Z > z) = 0.6760



Wilcoxon Mann-Whitney test (S&C, 1988)

Suppose now we want to check the distribution of male offers isdifferent from female offers. Males offered on average 3.0625, andfemales offered on average 3.The Wilcoxon-Mann Whitney test can be used to test the one sidednull hypothesis that PY1 first order stochastically dominates PY2 .Consider the set of female and of male offers. Combine both samplesand assign ranks 1 to N without regard to which population each valuecame from.I If several sample values are equal (tied), assign to each the

average of their possible ranks.I The test statistic is the sum of the ranks assigned to those values

from the smallest population.



Wilcoxon Mann-Whitney test–ctd.

Gender m m m m f m f m m f m m m m m m m f m m

Offer 0 0 1 1 1 3 3 3 3 3 3 4 4 4 4 4 5 5 5 5

Rank 1.5 1.5 4 4 4 8.5 8.5 8.5 8.5 8.5 8.5 14 14 14 14 14 18.5 18.5 18.5 18.5

Sum of the ranks for females: 39.5 Sum of the ranks for males: 170.5From the Wilcoxon Ranks-Sum table, we can see that the criticalvalues for the test statistics are ... 21 and 63, for a two-sided test withsize 5Hence, we cannot reject the null that the two distributions are?equal? at the 5Note: the Stata version of the test is not exact, and it is based on thenormal approximation, which works only for large enough samples(largest sample > 20).



Wilcoxon Mann-Whitney test—cont’d

Stata code

. ranksum offer, by(male)

Two-sample Wilcoxon rank-sum (Mann-Whitney) test

male | obs rank sum expected-------------+---------------------------------

0 | 4 39.5 421 | 16 170.5 168

-------------+---------------------------------combined | 20 210 210

unadjusted variance 112.00adjustment for ties -5.89

----------adjusted variance 106.11

Ho: offer(male==0) = offer(male==1)z = -0.243

Prob > |z| = 0.8082



Wilcoxon signed-rank test (S&C, 1988)

Suppose now that we had repeated the UG for two periodsWe want to see whether the distribution of offers made in the secondperiod is different from the one for the first period.The Wilcoxon signed ranks test reduces the matched pair to a singleobservation by considering the difference between the pairedobservations. To find the test statistic,I drop the observations where no difference emergesI rank the remaining data by absolute value.

The Wilcoxon signed rank test statistic is the sum of the ranks ofobservations that originally had a positive sign.



Wilcoxon signed-rank test—ctd.

OfferP1 0 0 1 1 1 3 3 3 3 3 3 4 4 4 4 4 5 5 5 5

OfferP2 0 0 2 1 0 3 4 3 4 1 3 4 4 5 4 4 4 5 5 5

Difference 0 0 1 0-1 0 1 0 1

-2 0 0 0 1 0 0

-1 0 0 0

Rank 0 0 3.5 0 3.5 0 3.5 0 3.5 7 0 0 0 3.5 0 0 3.5 0 0 0

N=7W=14If the sample size is su?ciently large (N>25) one can use the normal approximation (which is what Stata does). Otherwise, it isbetter to use the tables.The table reports that the critical value for the test statistic, for a two-sided test with size 5%, is equal to 2. Hence, we cannotreject the null hypothesis at the 5% level.



Wilcoxon signed-rank test ctd.

Stata code

. signrank offer=offer2

sign | obs sum ranks expected-------------+---------------------------------

positive | 3 53 59.5negative | 4 66 59.5

zero | 13 91 91-------------+---------------------------------

all | 20 210 210

unadjusted variance 717.50adjustment for ties -4.38adjustment for zeros -204.75

----------adjusted variance 508.38

Ho: offer = offer2z = -0.288

Prob > |z| = 0.7731



The independent trialOne of the crucial assumptions behind the tests we have seen so far isthat the observations are independent.What is a trial in experimental economics?I the action of a single player in each period?I the average group action each period?I or is it the session average (e.g. the mean over all subjects all

time)?There is no easy answer to this question and no real consensusamong experimental economists.

I if you are testing whether agents imitate or apply best reply, youshould use individual data, and adjust the sample size in light ofthe observed correlation.

I For checking the statistical significance of a treatment, it is best todo the tests on run averages.



The optimal sample size (List et al. 2011)

The number of subjects one needs to involve in an experimentdepends:I on the number of treatments and on the randomization schemeI on the type of tests one wishes to useI on the size of these testsI on the power of the test, which in turn depends on the alternative

hypothesis chosen, andI on the minimum detectable effect size

The effect size is the magnitude of the treatment effect that theexperimenter wants to detect.



The optimal sample size, ctd

List et al. (Experimental Economics, 2011) propose a formula tocalculate the optimal size for experimentsI that have a dichotomous treatment,I where the outcome is continuousI and t-test will be used to determine differences in means between

the treatment and control group.




AssumptionsI outcomes in treatments t = 1, 2 are distributed according to

Yit |Xi N(µt , σ2t )

I δ is the minimum average treatment effect, that the experiment willbe able to detect at a given significance level and power.

I null hypothesis: the population treatment and control outcomesare equal H0: µ0 = µ1

I alternative hypothesis: the population treatment and controloutcomes are different H1: µ0 = µ1




The optimal sample size ? cont?dIf the variances of the outcomes are equal to σ0 = σ1 = σ then theoptimal sample size is:

n∗0 = n∗

1 = n∗ = 2(tα/2 + tβ)2(σ

δ)2




Examplesignificance level α: 0.05 power: β: 0.80 from standard normal tableswe get: tα/2 = 1.96tβ = 0.84Thus, to detect a one-half standard deviation change in the outcomevariable one would need n∗ = 64 observations in each treatment cell:n∗ = 2(1.96 + 0.84)222 = 63



The optimal sample size, ctdOptimal sample sizesI increase proportionally with the variance of outcomes,I increase non-linearly with the significance level and the power,I decrease proportionally with the square of the minimum

detectable effect.I The relative distribution of subjects across treatment and control is

proportional to the standard deviation of the respective outcomes.In experimental economics discussion of optimal sample arrangementis rare. Reasons:I the effect size and variance are both unknown and difficult to

guessI the analyst might be involved in multiple hypothesis testingI the status quo is powerful!


Download - Statistical Analysis of Experimental Datariener.vwl.uni-mannheim.de/uploads/media/ExpEconEconometrics.pdf · Data Analysis and Experiments Data Analysis and Experiments IGood experimental

Top Related