Statistical Analysis of Experimental Data
Gerhard Riener
Düsseldorf Institute for Competition Economics
September 24, 2014
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 1 / 37
Outline
Data Analysis and Experiments
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 2 / 37
Data Analysis and Experiments
Data Analysis and Experiments
I Good experimental design makes for clean data analysisI Knowing with which statistical techniques you analyze helps to
plan your designI Choose the statistical approach that best fits your needs (graphs,
tests, confidence intervals, regressions)I Think of what kind of data you can collect, to get the cleanest
possible test of your hypothesisI Compute the sample size necessary to meaningfully test your
hypotheses
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 3 / 37
Data Analysis and Experiments
Data types and sample sizes
Let Y ∈ Z be a random variable, where Z is the set of possibleoutcomes for Y
I if Z = 0, 1 we say that the outcome is binary e.g. offers in the UGcan be either accepted (1) or rejected (0)
I if instead Z = h,h + 1,h + 2, ...k?2, k?1, k or Z = [h, k ] we saythat the outcome is multi-valued e.g. offers in our UG were integernumbers, between 0 and 10.
The set Z is known and should not be determined based on the data.Data are given by a sample of independent observations consisting ofn draws from Y , denoted by y1, y2, y3, . . . yn.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 4 / 37
Data Analysis and Experiments
Null and alternative hypothesis
Make inference about the true distributions PY . Test the null hypothesis
H0 : PY ∈W (1)
against the alternativeH1 : PY /∈W (2)
where W is given. The null hypothesis has to be constructed ascomplement of the desired statement or research objective. So if theaim is to provide evidence that some treatment makes a differencethen the null hypothesis is that there is no treatment effect.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 5 / 37
Data Analysis and Experiments
One-sided and two-sided hypothesis
I two-sided when one tests for equality: e.g.I H0: Probability that the offer is rejected is equal to 50%I H1: Probability that the offer is rejected is different from 50%
I one-sided when one tests for inequalities: e.g.I H0: Probability that the offer is rejected is greater or equal to 50%I H1: Probability that the offer is rejected is smaller than 50%
One-sided hypotheses should only be tested if there is a goodstory behind them.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 6 / 37
Data Analysis and Experiments
Size of a test and the p-value
Hypothesis testing is about making statements based on noisy data.True statements cannot be guaranteed� the objective is to makestatements that are true up to some pre-specified maximal error.The size (or type I error) α of a test is the maximal probability ofwrongly rejecting the null hypothesis, calculated before gathering thedata. So this is the maximal probability of rejecting the null hypothesiswhen it is true.The p-value associated to a given test and a given data set is themaximal value of α such that the null hypothesis would be rejected forthe given data.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 7 / 37
Data Analysis and Experiments
Size of a test, and the p-value
Type II error βFor a given subset of the alternative hypotheses is the maximalprobability of wrongly not rejecting the null hypothesis when the truedata generating process belongs to this subset.
Power (1− β) of a testProbability of correctly rejecting the null hypothesis for a given samplesize.The conventions are in general much more rigid with respect to α thanwith respect to β.The common reliance on statistical significance as the sole criterionleads to an excessive number of false positives (Maniadis, Tufano andList, AER, 2014).
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 8 / 37
Data Analysis and Experiments
Maniadis, Tufano and List, AER 2014
How do we assess whether an effect truly exists
I n number of scientific associations examinedI π fraction of true associationsI α typical significance levelI β typical power
What is the proabability that a decalarion of a research finding is true?Post Study Probability (PSP)
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 9 / 37
Data Analysis and Experiments
MTL ctd
I of the n associations π · n will be true and consequently (1− π) · nfalse
I (1− β) · π · n will be declared trueI α · (1− π) · n will be declared true although they are falseI PSP is the number of true association declared true divided by
number of all associations declared true
PSP =(1− β) · π
(1− β) · π + α · (1− π)(3)
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 10 / 37
Data Analysis and Experiments
MTL: Researchers competition
Suppose k independent researchers work simultaneously of n each ofthe n associations.I (1− βk ) probability that at least one of the k researchers will
declare a true association as trueI 1− (1− α)k ) probability that false relationship is declared true by
at least one researcherI
I [1− (1− α)k ] · (1− π) · n will be declared true although they arefalse
PSPComp =(1− βk )π
(1− βk ) · π + [1− (1− α)k ] · (1− π)(4)
Decreasing in k if 1− β > α which is usually the case
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 11 / 37
Data Analysis and Experiments
MTL: Research bias
Researchers tend to interpret their results in a “favourable” way whichleads to a bias u that a positive research finding has been announcealthough it shouldn’t have been announcedI A fraction (βπn) of the findings will be declared false because of
noiseI Total number of findings declared true changes to
(1− β) · π · n + u · β · π · nI Additionally out of the false associationsα(1− π)n + u(1− α)(1− π)n will be declared true
PSPBias =(1− β)π + βπu
(1− β) · π + βπu + [α + (1− α)u] · (1− π)(5)
Decreasing in u if 1− β > α which is usually the case
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 12 / 37
Data Analysis and Experiments
Accepting the null?
Null hypotheses are rejected or not rejected, never acceptedNot being able to reject the null hypothesis can have many reasons.I the null hypothesis is true.I the alternative hypothesis is true but that the test was not able to
discover thisThe power of a test quantifies the evidence for not being able to rejectthe null hypothesis conditional onI the sample sizeI a statement about the alternatives.
Whether the sample is sufficiently large and the test is sufficientlypowerful depends on where the underlying distribution lies in H1. If itlies close to H0 then it will be very difficult to detect it.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 13 / 37
Data Analysis and Experiments
Qualitative analysis
Graphs and summary statistics best devices for getting familiar withthe data They can helpI the experimenter to detect outliers and anomaliesI the experimenter to spot unexpected regularitiesI the reader to get acquainted with the data
Well-chosen descriptive statistics and graphs can give the reader (orthe experimenter!) the impression that the main questions areanswered. Do we really need any formal statistical tests?
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 14 / 37
Data Analysis and Experiments
Qualitative analysis: an example
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 15 / 37
Data Analysis and Experiments
Quantitative analysis
Research questions in experimental economicsI Does treatment X affect outcome Y?I Is outcome Y better predicted by model M1 or M2?
How to answer the questions? first question, you could compare theaverage choices across treatments, or the average observed choiceswith the expected value predicted by the two theories.Observed difference might be due to experimental error: statisticaltechniques to evaluate that possibility. Formal statistical tests tell youwhether differences in observed outcomes across treatments are dueto chance alone (sampling error) rather than to the treatments(differences in the underlying population distributions), and whetherone model is really explaining the data better than another, or just gotlucky.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 16 / 37
Data Analysis and Experiments
Parametric vs. non-parametric statistics
I assumed that the error distribution belonged to a known familyparametric statistics and includes tests such as the t-test, theF-test, ...
I Fewer distributional assumptions (only independence) but lesspowerful when assumption were true. This approach is known asnon-parametric statistics we will cover some essential tools itprovides.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 17 / 37
Data Analysis and Experiments
Tests and confidence intervals
A confidence interval is a range of values designed to contain the trueparameter of interest with a minimal probability denoted 1− α(coverage). E.g. a 95% confidence interval? has a coverage of 95%.I Confidence interval estimation provides a convenient alternative
to significance testing in most situations.I It represents a more flexible approach as one does not have to a
priori specify a null hypothesis but instead only a parameter ofinterest.
I This approach does not raise the problem of how to deriveimplications when the null is not rejected. Moreover they provideinformation about how strong the evidence is when the null isrejected.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 18 / 37
Data Analysis and Experiments
RegressionsA simple way of testing whether treatment effects are significant is toinclude a dummy variable for each treatment but the baseline, in thelist of explanatory variables X.
yi = Xiβ + ui
When the coefficient estimate (e.g. from OLS) for a dummy issignificant, you can conclude that the treatment actually did affect theobserved performance y , with respect to the baseline treatment.Regressions can often be a useful complement to statistical tests, forexample to:I control the effect of nuisance factors (subjects individualI characteristics, session fixed effects, etc.)I study trends across periods in repeated gamesI investigate the presence of linear/non-linear effects of some
treatment variables on the subjects? behaviorGerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 19 / 37
Data Analysis and Experiments
Regressions, stata codeBasic code that regresses
reg offer communicate
Interaction terms: often and especially in factorial designs, we wouldlike to evaluate the difference between the cells
reg offer communicate##female
Heteroskedstic robust standard errors
reg accept communicate, robust
AttentionAlways use with binary variables because standard errors of the linearprobability model suffer from heteroskedasticity
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 20 / 37
Data Analysis and Experiments
Interpretation of the interaction term
. tab2 AA female if consultant, sum(completed)
-> tabulation of AA by female if consultant
Means, Standard Deviations and Frequencies of CompletedAffirmativ | Female
e action | Male Female | Total-----------+----------------------+----------
0 | .35555556 .21126761 | .29192547| .48136303 .41111323 | .45606677| 90 71 | 161
-----------+----------------------+----------1 | .31343284 .35384615 | .33333333
| .46738976 .48188332 | .47320035| 67 65 | 132
-----------+----------------------+----------Total | .33757962 .27941176 | .3105802
| .4743976 .45036901 | .46352285| 157 136 | 293
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 21 / 37
Data Analysis and Experiments
Interpretation of the interaction term
. reg completed AA##female if consultant, robust
Linear regression Number of obs = 293F( 3, 289) = 1.79Prob > F = 0.1495R-squared = 0.0160Root MSE = .46218
------------------------------------------------------------------------------| Robust
completed | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+----------------------------------------------------------------
1.AA | -.0421227 .0764034 -0.55 0.582 -.1925004 .1082549|
female |Female | -.1442879 .0704317 -2.05 0.041 -.2829121 -.0056638
|AA#female |1#Female | .1847013 .1085501 1.70 0.090 -.0289477 .3983503
|_cons | .3555556 .0508054 7.00 0.000 .25556 .4555511
------------------------------------------------------------------------------
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 22 / 37
Data Analysis and Experiments
The binomial test (Siegel and Castellan, 1988)Statement: most of the offers=3 (O3) are accepted.I H0 : the probability that O3 is rejected is ≥ 50%.I H1 : the probability that O3 is rejected is < 50%.I If I can reject H0 I can provide statistical support to my statement.
In our dataset, 8 out of 20 subjects rejected the offer. I can run abinomial test with size α = 0.05 to check whether I can reject H0 at the5% significance level.I First approach: look at the table which reports the one-tailed
probabilities associated with the occurrence of at most xsuccesses out of N trials. The probability of observing no morethan 8 successes under H0 is 0.252 > 0.05, then we cannot rejectthe null, at the 5% significance level.
I Second approach: with stata: returns a p-value of 0.251722 forone-sided test.
. bitesti 20 8 0.5
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 23 / 37
Data Analysis and Experiments
Confidence intervals
Can we say more? We can say that, with probability 90%, the trueprobability that O3 is rejected lies in the interval [p, p̄], where:I p is the min. value of p for which I cannot reject H0 : Py ≤ p vs.
H1 : Py > pI p̄ is the max. value of p for which I cannot reject H0 : Py ≥ p vs.
H1 : Py < p
Stata code
cii 20 8, exact level(90)
-- Binomial Exact --Variable | Obs Mean Std. Err. [90\% Conf. Interval]---------+----------------------------------------------------------
| 20 .4 .1095445 .2170686 .6064151
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 24 / 37
Data Analysis and Experiments
Chi-squared test (Siegel and Castellan, 1988)
Suppose now we want to test if males are more likely to accept O3than females. We can use a chi-squared test. The test statistic isbased on the observed and expected frequencies:
Degrees of freedom: (# rows - 1)(# cols - 1) = 1 Look into the table forthe critical values of the test statistics, for a one-tailed test with size5%.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 25 / 37
Data Analysis and Experiments
Chi-squared test—ctd.
Stata code
. tabulate male accept3, chi2
| accept3male | 0 1 | Total
-----------+----------------------+----------0 | 2 2 | 41 | 6 10 | 16
-----------+----------------------+----------Total | 8 12 | 20
Pearson chi2(1) = 0.2083 Pr = 0.648
CAVEAT: the chi-squared test should not be used when mini,j Eij < 5or when N < 20.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 26 / 37
Data Analysis and Experiments
Fisher exact test (Siegel and Castellan, 1988)
Generally preferred to the chi-squared test for small samples. This testis based on the probability of observing a particular set of frequenciesin a 2 × 2 table, when the marginal totals are regarded as fixed.
The probability of observing this outcome under H0 is 0.381. But whatis the probability of observing a more extreme outcome?
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 27 / 37
Data Analysis and Experiments
Fisher exact test
Consider now the most extreme outcome, in which the difference inacceptance rates between males and females is maximized.
The probability of observing the most extreme outcome under H0 is0.0144. Hence, the probability of observing the outcome we had, or amore extreme one, under H0 , is 0.381 + 0.139 + 0.014 = 0.534. Wecannot reject the null that males and females are equally likely to rejectO3, based on our observations.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 28 / 37
Data Analysis and Experiments
Fisher exact test—contd.
Stata code
. tabulate male accept3, exact
| accept3male | 0 1 | Total
-----------+----------------------+----------0 | 2 2 | 41 | 6 10 | 16
-----------+----------------------+----------Total | 8 12 | 20
Fisher’s exact = 1.0001-sided Fisher’s exact = 0.535
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 29 / 37
Data Analysis and Experiments
Z-test
According to Schlag (2011), the z-test is generally more powerful thanthe Fisher exact test, hence it would be preferable. However, it is lesscommonly used.The test statistic (s1, s2) depends on the number of successes s1 ands2 in the two samples of size n1 and n2 , with n1 + n2 = N.
φ(s1, s2) =(s2/n2)− (s1/n1)√
s1+s2N (1− s1+s2
N ( 1n1
+ 1n2
= 0.4564 (6)
To get the corresponding p-value, look at the table for the standardnormal distribution→ 0.3228.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 30 / 37
Data Analysis and Experiments
Z-test—ctd.
Stata code
. prtest accept3, by(male)Two-sample test of proportions 0: Number of obs = 4
1: Number of obs = 16--------------------------------------------------------------------------Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]---------+----------------------------------------------------------------
0 | .5 .25 .010009 .9899911 | .625 .1210307 .3877841 .8622159
---------+----------------------------------------------------------------diff | -.125 .2777561 -.6693919 .4193919
| under Ho: .2738613 -0.46 0.648--------------------------------------------------------------------------diff = prop(0) - prop(1) z = -0.4564Ho: diff = 0Ha: diff < 0 Ha: diff != 0 Ha: diff > 0Pr(Z < z) = 0.3240 Pr(|Z| < |z|) = 0.6481 Pr(Z > z) = 0.6760
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 31 / 37
Data Analysis and Experiments
Wilcoxon Mann-Whitney test (S&C, 1988)
Suppose now we want to check the distribution of male offers isdifferent from female offers. Males offered on average 3.0625, andfemales offered on average 3.The Wilcoxon-Mann Whitney test can be used to test the one sidednull hypothesis that PY1 first order stochastically dominates PY2 .Consider the set of female and of male offers. Combine both samplesand assign ranks 1 to N without regard to which population each valuecame from.I If several sample values are equal (tied), assign to each the
average of their possible ranks.I The test statistic is the sum of the ranks assigned to those values
from the smallest population.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 32 / 37
Data Analysis and Experiments
Wilcoxon Mann-Whitney test–ctd.
Gender m m m m f m f m m f m m m m m m m f m m
Offer 0 0 1 1 1 3 3 3 3 3 3 4 4 4 4 4 5 5 5 5
Rank 1.5 1.5 4 4 4 8.5 8.5 8.5 8.5 8.5 8.5 14 14 14 14 14 18.5 18.5 18.5 18.5
Sum of the ranks for females: 39.5 Sum of the ranks for males: 170.5From the Wilcoxon Ranks-Sum table, we can see that the criticalvalues for the test statistics are ... 21 and 63, for a two-sided test withsize 5Hence, we cannot reject the null that the two distributions are?equal? at the 5Note: the Stata version of the test is not exact, and it is based on thenormal approximation, which works only for large enough samples(largest sample > 20).
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 33 / 37
Data Analysis and Experiments
Wilcoxon Mann-Whitney test—cont’d
Stata code
. ranksum offer, by(male)
Two-sample Wilcoxon rank-sum (Mann-Whitney) test
male | obs rank sum expected-------------+---------------------------------
0 | 4 39.5 421 | 16 170.5 168
-------------+---------------------------------combined | 20 210 210
unadjusted variance 112.00adjustment for ties -5.89
----------adjusted variance 106.11
Ho: offer(male==0) = offer(male==1)z = -0.243
Prob > |z| = 0.8082
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 34 / 37
Data Analysis and Experiments
Wilcoxon signed-rank test (S&C, 1988)
Suppose now that we had repeated the UG for two periodsWe want to see whether the distribution of offers made in the secondperiod is different from the one for the first period.The Wilcoxon signed ranks test reduces the matched pair to a singleobservation by considering the difference between the pairedobservations. To find the test statistic,I drop the observations where no difference emergesI rank the remaining data by absolute value.
The Wilcoxon signed rank test statistic is the sum of the ranks ofobservations that originally had a positive sign.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 35 / 37
Data Analysis and Experiments
Wilcoxon signed-rank test—ctd.
OfferP1 0 0 1 1 1 3 3 3 3 3 3 4 4 4 4 4 5 5 5 5
OfferP2 0 0 2 1 0 3 4 3 4 1 3 4 4 5 4 4 4 5 5 5
Difference 0 0 1 0-1 0 1 0 1
-2 0 0 0 1 0 0
-1 0 0 0
Rank 0 0 3.5 0 3.5 0 3.5 0 3.5 7 0 0 0 3.5 0 0 3.5 0 0 0
N=7W=14If the sample size is su?ciently large (N>25) one can use the normal approximation (which is what Stata does). Otherwise, it isbetter to use the tables.The table reports that the critical value for the test statistic, for a two-sided test with size 5%, is equal to 2. Hence, we cannotreject the null hypothesis at the 5% level.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 36 / 37
Data Analysis and Experiments
Wilcoxon signed-rank test ctd.
Stata code
. signrank offer=offer2
sign | obs sum ranks expected-------------+---------------------------------
positive | 3 53 59.5negative | 4 66 59.5
zero | 13 91 91-------------+---------------------------------
all | 20 210 210
unadjusted variance 717.50adjustment for ties -4.38adjustment for zeros -204.75
----------adjusted variance 508.38
Ho: offer = offer2z = -0.288
Prob > |z| = 0.7731
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 37 / 37
Data Analysis and Experiments
The independent trialOne of the crucial assumptions behind the tests we have seen so far isthat the observations are independent.What is a trial in experimental economics?I the action of a single player in each period?I the average group action each period?I or is it the session average (e.g. the mean over all subjects all
time)?There is no easy answer to this question and no real consensusamong experimental economists.
I if you are testing whether agents imitate or apply best reply, youshould use individual data, and adjust the sample size in light ofthe observed correlation.
I For checking the statistical significance of a treatment, it is best todo the tests on run averages.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 38 / 37
Data Analysis and Experiments
The optimal sample size (List et al. 2011)
The number of subjects one needs to involve in an experimentdepends:I on the number of treatments and on the randomization schemeI on the type of tests one wishes to useI on the size of these testsI on the power of the test, which in turn depends on the alternative
hypothesis chosen, andI on the minimum detectable effect size
The effect size is the magnitude of the treatment effect that theexperimenter wants to detect.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 39 / 37
Data Analysis and Experiments
The optimal sample size, ctd
List et al. (Experimental Economics, 2011) propose a formula tocalculate the optimal size for experimentsI that have a dichotomous treatment,I where the outcome is continuousI and t-test will be used to determine differences in means between
the treatment and control group.
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 40 / 37
Data Analysis and Experiments
The optimal sample size, ctd
AssumptionsI outcomes in treatments t = 1, 2 are distributed according to
Yit |Xi N(µt , σ2t )
I δ is the minimum average treatment effect, that the experiment willbe able to detect at a given significance level and power.
I null hypothesis: the population treatment and control outcomesare equal H0: µ0 = µ1
I alternative hypothesis: the population treatment and controloutcomes are different H1: µ0 = µ1
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 41 / 37
Data Analysis and Experiments
The optimal sample size, ctd
The optimal sample size ? cont?dIf the variances of the outcomes are equal to σ0 = σ1 = σ then theoptimal sample size is:
n∗0 = n∗
1 = n∗ = 2(tα/2 + tβ)2(σ
δ)2
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 42 / 37
Data Analysis and Experiments
The optimal sample size, ctd
Examplesignificance level α: 0.05 power: β: 0.80 from standard normal tableswe get: tα/2 = 1.96tβ = 0.84Thus, to detect a one-half standard deviation change in the outcomevariable one would need n∗ = 64 observations in each treatment cell:n∗ = 2(1.96 + 0.84)222 = 63
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 43 / 37
Data Analysis and Experiments
The optimal sample size, ctdOptimal sample sizesI increase proportionally with the variance of outcomes,I increase non-linearly with the significance level and the power,I decrease proportionally with the square of the minimum
detectable effect.I The relative distribution of subjects across treatment and control is
proportional to the standard deviation of the respective outcomes.In experimental economics discussion of optimal sample arrangementis rare. Reasons:I the effect size and variance are both unknown and difficult to
guessI the analyst might be involved in multiple hypothesis testingI the status quo is powerful!
Gerhard Riener (DICE) Lecture Statistical Analysis September 24, 2014 44 / 37