[ieee 2013 1st international workshop on data analysis patterns in software engineering (dapse) -...

3
Effect Size Analysis Emanuel Giger Department of Informatics University of Zurich giger@ifi.uzh.ch Harald C. Gall Department of Informatics University of Zurich gall@ifi.uzh.ch Abstract—When we seek insight in collected data we are most often forced to limit our measurements to a portion of all individuals that can be hypothetically considered for observation. Nevertheless, as researchers, we want to draw more general conclusions that are valid beyond the restricted subset we are currently analyzing. Statistical significance testing is a funda- mental pattern of data analysis that helps us to infer conclusions from a subset about the entire set of possible individuals. However, the outcome of such tests depends on several factors. Software engineering experiments often address similar research questions but vary with respect to those factors, for example, they operate on different sizes or measurements. Hence, the use of statistical significance alone to interpret findings across studies is insufficient. This paper describes how significance testing can be extended by an analysis of the magnitude, i.e., effect size, of an observation allowing to abstract the results of different studies. I. I NTRODUCTION In virtually all experiments, we consider it not feasible, if not impossible, to measure our variables of interest on the entire set of all potentially interesting individuals. This (physical) collection of all individuals that matter in the context of a given study is commonly referred to as population [1]. For example, to ensure a timely allocation of developer resources one could measure if there is an increase in the fix–time needed for this year’s defects. This can be difficult since most of the defects are yet to be fixed or detected—some might actually remain undetected. Even if all defects were detected and fixed, e.g., towards the end of the year, some defects might never be entered into the defect database [2]. Another example, interviewing the entire developer staff might not be possible if the company cannot afford spending such an amount of labour power for intensive data analysis purposes. Under such circumstances, we focus the analysis on only a (small) subset of the entire population, i.e., the sample set [1]. Given the examples above, we might only include the defects found during the first six months in our data; or we only interview the developers of a single unit. Moreover, in practice, most software engineering experiments do not use a subset of a single system, but instead consider a collection of several software systems to be a representative sample for the entire software system population. Nevertheless, research results must contain some degree of generalizability in order to be useful for situations other than the study itself. Statistical significance testing (SST) tells whether findings and knowledge drawn from samples are most likely due to chance or can be safely transferred to the entire population—given a pre–determined risk to err. In the latter, the result is denoted as statistically significant and considered to be a property of the population. 1 A drawback of SST is that a particular observed value can be significant, i.e., not due to chance, in one study while in other studies not. Moreover, even an arbitrarily small value can still be significant assuming that the underlying sample is large enough. In the same manner as significance does not imply causality, it does not give any conclusion regarding (practical) importance of the size of an observation either. Researchers often misleadingly use significance as a surrogate for (practical) relevance when results are presented [3]; they inherently attach some large and meaningful effect to significant results. Therefore, a dimension is required that discusses observations not only in terms of significance but also in terms of size. Such a quantity should, in addition, abstract the findings of different studies and facilitate their comparison. This is of particular relevance in the field of software engineering where experiments differ (widely) in many aspects and identical replications are difficult [4]. In the remainder of this article, we describe how to provide such a discussion about the size of an observation in addition to its significance. II. PROBLEM STATEMENT The aforementioned problems of SST arise because signifi- cance depends mainly on three factors: Significance Criterion: A pre–determined risk to err Magnitude of the effect: The size of the observed effect Size of the sample from which findings are drawn The first factor is less of a problem as there exist de–facto standard criteria that are consistently applied throughout all empirical work, i.e., α–level of 0.05 or 0.01, respectively. The size of an effect is usually the subject of interest itself in a study. The sample size when analyzing software engineering data varies the most, e.g., multiple systems exhibit a different number of files, defects, developers, or age. Moreover, even if intended to, replicating an existing study in exact manner is often not possible for practical reasons. For instance, when the original study was carried out on proprietary software data. Typically we want to evaluate whether a difference between two observed variables in our samples are significant, and 1 How to effectively select an appropriate sample and the correct test procedure is another vital and skillful subject of data analysis but is beyond the scope of this article. 978-1-4673-6296-2/13/$31.00 c 2013 IEEE DAPSE 2013, San Francisco, CA, USA 11

Upload: harald-c

Post on 13-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2013 1st International Workshop on Data Analysis Patterns in Software Engineering (DAPSE) - San Francisco, CA, USA (2013.05.21-2013.05.21)] 2013 1st International Workshop on

Effect Size AnalysisEmanuel Giger

Department of InformaticsUniversity of Zurich

[email protected]

Harald C. GallDepartment of Informatics

University of [email protected]

Abstract—When we seek insight in collected data we aremost often forced to limit our measurements to a portion of allindividuals that can be hypothetically considered for observation.Nevertheless, as researchers, we want to draw more generalconclusions that are valid beyond the restricted subset we arecurrently analyzing. Statistical significance testing is a funda-mental pattern of data analysis that helps us to infer conclusionsfrom a subset about the entire set of possible individuals.However, the outcome of such tests depends on several factors.Software engineering experiments often address similar researchquestions but vary with respect to those factors, for example,they operate on different sizes or measurements. Hence, the useof statistical significance alone to interpret findings across studiesis insufficient. This paper describes how significance testing canbe extended by an analysis of the magnitude, i.e., effect size, of anobservation allowing to abstract the results of different studies.

I. INTRODUCTION

In virtually all experiments, we consider it not feasible, if notimpossible, to measure our variables of interest on the entireset of all potentially interesting individuals. This (physical)collection of all individuals that matter in the context of agiven study is commonly referred to as population [1]. Forexample, to ensure a timely allocation of developer resourcesone could measure if there is an increase in the fix–timeneeded for this year’s defects. This can be difficult since mostof the defects are yet to be fixed or detected—some mightactually remain undetected. Even if all defects were detectedand fixed, e.g., towards the end of the year, some defects mightnever be entered into the defect database [2]. Another example,interviewing the entire developer staff might not be possible ifthe company cannot afford spending such an amount of labourpower for intensive data analysis purposes.

Under such circumstances, we focus the analysis on onlya (small) subset of the entire population, i.e., the sample set[1]. Given the examples above, we might only include thedefects found during the first six months in our data; or weonly interview the developers of a single unit. Moreover, inpractice, most software engineering experiments do not use asubset of a single system, but instead consider a collection ofseveral software systems to be a representative sample for theentire software system population.

Nevertheless, research results must contain some degreeof generalizability in order to be useful for situations otherthan the study itself. Statistical significance testing (SST) tellswhether findings and knowledge drawn from samples are mostlikely due to chance or can be safely transferred to the entire

population—given a pre–determined risk to err. In the latter,the result is denoted as statistically significant and consideredto be a property of the population.1

A drawback of SST is that a particular observed value can besignificant, i.e., not due to chance, in one study while in otherstudies not. Moreover, even an arbitrarily small value can still besignificant assuming that the underlying sample is large enough.In the same manner as significance does not imply causality, itdoes not give any conclusion regarding (practical) importance ofthe size of an observation either. Researchers often misleadinglyuse significance as a surrogate for (practical) relevance whenresults are presented [3]; they inherently attach some large andmeaningful effect to significant results. Therefore, a dimensionis required that discusses observations not only in terms ofsignificance but also in terms of size. Such a quantity should, inaddition, abstract the findings of different studies and facilitatetheir comparison. This is of particular relevance in the fieldof software engineering where experiments differ (widely) inmany aspects and identical replications are difficult [4]. In theremainder of this article, we describe how to provide such adiscussion about the size of an observation in addition to itssignificance.

II. PROBLEM STATEMENT

The aforementioned problems of SST arise because signifi-cance depends mainly on three factors:

• Significance Criterion: A pre–determined risk to err• Magnitude of the effect: The size of the observed effect• Size of the sample from which findings are drawn

The first factor is less of a problem as there exist de–factostandard criteria that are consistently applied throughout allempirical work, i.e., α–level of 0.05 or 0.01, respectively. Thesize of an effect is usually the subject of interest itself in astudy. The sample size when analyzing software engineeringdata varies the most, e.g., multiple systems exhibit a differentnumber of files, defects, developers, or age. Moreover, even ifintended to, replicating an existing study in exact manner isoften not possible for practical reasons. For instance, when theoriginal study was carried out on proprietary software data.

Typically we want to evaluate whether a difference betweentwo observed variables in our samples are significant, and

1How to effectively select an appropriate sample and the correct testprocedure is another vital and skillful subject of data analysis but is beyondthe scope of this article.

978-1-4673-6296-2/13/$31.00 c© 2013 IEEE DAPSE 2013, San Francisco, CA, USA11

Page 2: [IEEE 2013 1st International Workshop on Data Analysis Patterns in Software Engineering (DAPSE) - San Francisco, CA, USA (2013.05.21-2013.05.21)] 2013 1st International Workshop on

1100 25 50 75 100

0.7

0

0.1

0.2

0.3

0.4

0.5

0.6

Sample Size in %

Sign

ifica

nce

Prob

abili

ty

Fig. 1. Significance probabilities, i.e., p-value, for each random subset. Thered line denotes the significance criterion α = 0.05.

hence, most likely a true property of the entire population. Weemploy this common task with data from the Eclipse project toillustrate the issues of SST and then discuss possible solutions.In detail, we want to find out if there is a difference in thetotal lines of code (LOC) per file from release 2.0 to 3.0.2 Wemight be interested to answer this question since LOC canbe an indicator for defects as shown by previous studies, e.g.,[5], [6], [7]. If we see a large, significant increase in LOC wecould staff up software testing efforts in release 3.0 becausewe expect more defects.

The average LOC per file in release 2.0 and 3.0 are 118 and123, respectively, resulting in a difference of 5 lines of code.To test whether this observation is due to chance we performeda non-parametric test. The test yielded that the differenceis significant at α = 0.05. We repeated this procedure withrandom subsets of 75%, 50%, and 25% of all files of releases2.0 and 3.0. In other words, we (artificially) reduced the samplesize compared to the entire dataset.3 Figure 1 shows clearlyhow the results of SST depend on the sample size. As expected,the significance probabilities to err increase with a decreasingsample size—exceeding the critical level of 0.05 already inthe case of the 75% subset. In other words, sample size isinherently related to the power of SST [1].

The next question is how to interpret the results. Are thoseadditional 5 lines on average—although significant—worthadditional testing resources or costly refactoring measures toreduce LOC per file? Even when consulting previous studieswe might be left inconclusive; in one study additional 200 LOCper file are reported to be significant as well. However, sucha difference is a magnitude larger; in another study we read

2For our example we benefit from the dataset presented in [5].3Note that by using random subset samples the observed difference becomes

a random variable itself. Nevertheless, the differences among the subsetsshowed little variation in the example, and hence, are adequate for our ratherillustrative case.

that even 2 additional lines are significant; in contrast, a thirdstudy states that 10 additional lines are not significant. Evenworse, we might find studies that measure LOC differently,e.g., they ignore source code comments, or interpret file sizein terms of number of statements per file rather than text lines.How can we relate the results of our example to those of otherstudies that answer the same question but differ, for instance,with respect to measurements or scale.

Although simplified, the previous example carves the difficul-ties of using solely significance to discuss findings, in particular,when referring to existing studies with varying settings.

III. SOLUTION: EFFECT SIZE ANALYSIS

A standardized discussion of the effect size of an observationcan facilitate the interpretation of findings and alleviate someof the issues associated with SST, especially when includingresults from other experiments. Cohen’s d is an establishedeffect size measure proposed in literature for this purpose [8],[9]. It expresses the difference between two means in terms ofstandard deviation units [10], and can be calculated based onsamples as follows:4

d =X1 −X2√

σ21−σ2

2

2

(1)

X1 and σ1 are the mean and the standard deviation of the firstsample group, i.e., LOC of the files of release 2.0, and X2

and σ2 are the mean and the standard deviation of the secondgroup, i.e., LOC of the files of release 3.0, respectively. Inother words, d measures how much two distributions ”overlap”based on the pooled standard deviation.

Therefore, by using d we can interpret a difference in-dependent of the scale and the type of measurements of aparticular study. This property greatly facilitates the comparisonof differences among individual studies when many of themoperate on different scales and measurements.

A. Interpretation of Cohen’s d

If we plug the Eclipse numbers of our example into Equation1 we get d ≈ 0.03. Similarly to SST (where we get a p-value),we are still left with a number. However, there is an importantdifference in contrast to significance, d is the same for theentire dataset as well as for the 75%, 50%, and 25% randomsets, i.e., it does not depend on the sample size. Hence, d isstandardized a number which we can compare to effect sizes(when also expressed in d) of other studies without detailedknowledge of their measurement scales. For instance, in theirsurvey [11], Kampenes et al. found an average effect size of 0.6over all examined software engineering experiments. Therefore,with an effect size of only 0.03, our difference of LOC perfile between release 2.0 and 3.0 is small when put into theperspective of that survey. This additional information in thecontext of statistical decision making may thereby protect usfrom inaccurate interpretations of significance alone. We could

4Sometimes Cohen’s d is used as a surrogate term for the original d [10]as well as for its derivatives Hedge’s g and Glass’s ∆.

12

Page 3: [IEEE 2013 1st International Workshop on Data Analysis Patterns in Software Engineering (DAPSE) - San Francisco, CA, USA (2013.05.21-2013.05.21)] 2013 1st International Workshop on

rethink costly actions in response to such a comparatively smalleffect size.

B. DiscussionIn other research fields, the over-reliance on statistical

significance testing is strongly debated (and criticized) [12].Moreover, in their ”official” manual on how to write andpublish results, the American Psychological Association (APA)recommends to always include some kind of effect sizemeasures in the results section [13]. In fact, even more p-valuesupplements are suggested in order to enhance the interpretationof results, e.g., power analysis, confident intervals for meansand effect size [12].

However, using such reference numbers can be a delicatetask since effect sizes, in contrast to statistical significance,are rarely discussed in software engineering studies [11] aswell as in some other fields, e.g., educational research [14].Therefore, most work compare effect sizes to Cohen’s originalguidelines [15]: d = 0.2 denotes a small, d = 0.5 a medium,and d = 0.8 large effect, respectively. Again, with respect tothis convention the effect size of LOC per file in the example isvery small. Cohen proposed the convention based on his ownobservations in behavioral science. Hence, its interpretationin the context of software engineering requires caution. Thereason why the convention is cited so frequently is the absenceof other guidelines. Nevertheless, the comparison itself is stillvalid as it is independent of scale and measurements.

IV. CLOSING REMARKS

Analyzing data in terms of statistical significance, i.e.,determining whether findings drawn from samples are likelydue to chance or not, is a standard and relatively straightforwardprocedure. Clear guidelines exist, for instance, which test isappropriate for given distributions. However, the outcome ofsignificance testing depends on factors that are inherent to theunderlying study, e.g., sample size or scale of measurements.There are limitations when using significance to interpret themagnitude of an observed effect, especially, across differentstudies. Consequently, to understand the findings of individualstudies as a whole, a standardized measure of effect size isrequired. An established and popular measure is Cohen’s d.Despite the benefits effect size measures are still not a commonpractice in software data analytics—along with a consistent

reporting of power and confident intervals. Therefore, importantquestions such as ”what is a practically relevant effect size”remain unanswered [11]. Nevertheless, discussing results bymeans of significance and effect size is an additional dimensionthat abstracts the details of different studies, and hence, enablesmeta-analysis of software engineering experiments.

REFERENCES

[1] S. Dowdy, S. Weardon, and D. Chilko, Statistics for Research, 3rd ed.Wiley-Interscience, 2004.

[2] C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, andP. Devanbu, “Fair and balanced?: bias in bug-fix datasets,” in Proc.European software engineering conference and the ACM SIGSOFTsymposium on The foundations of software engineering, 2009, pp. 121–130.

[3] H. Cooper, L. Hedges, and J. Valentine, The Handbook of ResearchSynthesis and Meta-Analysis, 2nd ed. Russell Sage Foundation, 2009.

[4] N. Juristo and S. Vegas, “Using differences among replications of softwareengineering experiments to gain knowledge,” in Proc. InternationalSymposium on Empirical Software Engineering and Measurement, 2009,pp. 356–366.

[5] T. Zimmermann, R. Premraj, and A. Zeller, “Predicting defects foreclipse,” in Proc. International Workshop on Predictor Models in SoftwareEngineering, 2007, pp. 9–16.

[6] N. Nagappan and T. Ball, “Static analysis tools as early indicators ofpre-release defect density,” in Proc. International conference on Softwareengineering, 2005, pp. 580–586.

[7] T. Menzies, J. Greenwald, and A. Frank, “Data mining static codeattributes to learn defect predictors,” IEEE Trans. Softw. Eng., vol. 33,no. 1, pp. 2–13, 2007.

[8] L. Wilkinson, “Statistical methods in psychology journals: guidelinesand explanations,” American Psychologist, vol. 54, pp. 594–604, 1999.

[9] J. Miller, “Applying meta-analytical procedures to software engineeringexperiments,” Journal of Systems and Software, vol. 54, no. 1, pp. 29–39,2000.

[10] J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed.Lawrence Erlbaum Assoc., 1988.

[11] V. B. Kampenes, T. Dyba, J. E. Hannay, and D. I. K. Sjøberg, “Systematicreview: A systematic review of effect size in software engineeringexperiments,” Inf. Softw. Technol., vol. 49, no. 11-12, pp. 1073–1086,2007.

[12] A. Fritz, T. Scherndl, and A. Kuehberger, “A comprehensive reviewof reporting practices in psychological journals: Are effect sizes reallyenough?” Theory Psychology, vol. 23, no. 1, pp. 98–122, 2013.

[13] A. P. Association, Publication Manual of the American PsychologicalAssociation, 6th ed. American Psychological Association, 2009.

[14] H. J. Keselman, C. J. Huberty, L. M. Lix, S. Olejnik, R. A. Cribbie,B. Donahue, R. K. Kowalchuk, L. L. Lowman, M. D. Petoskey,J. C. Keselman, and J. R. Levin, “Statistical practices of educationalresearchers: An analysis of their anova, manova, and ancova analyses,”Review of Educational Research, vol. 68, no. 3, pp. 350–386, 1998.

[15] J. Cohen, “A power primer,” Psychological Bulletin, vol. 112, pp. 155–159, 1992.

13