discovering causal knowledge by design · 2009. 9. 25. · causal knowledge is frequently pursued...

Discovering Causal Knowledge by Design

Marc E. Maier Matthew J. Rattigan

Knowledge Discovery LaboratoryDepartment of Computer Science

University of Massachusetts Amherst{maier, rattigan, jensen}@cs.umass.edu

David D. Jensen

Abstract

Causal knowledge is frequently pursued by researchersin many fields, such as medicine, economics, and socialscience, yet very little research in knowledge discovery fo-cuses on discovering causal knowledge. Those researchersrely on a set of methods, called experimental and quasi-experimental designs, that exploit the ontological structureof the world to limit the set of possible statistical modelsthat can produce observed correlations among variables.As a result, designs are powerful techniques for drawingconclusions about cause-and-effect relationships. However,designs are almost never used explicitly by knowledge dis-covery algorithms. In this work, we provide explicit evi-dence that designs have the potential to be highly usefulas part of algorithms to discover causal knowledge. Wefirst formalize the basic elements of experimental and quasi-experimental designs to characterize a design search space.We then quantify the range and diversity of designs thatcan be applied to examine the central questions associatedwith a large and complex domain (Wikipedia). Finally, weshow that explicit consideration of designs can substantiallyimprove the accuracy of causal inference and increase thestatistical power of algorithms for learning the structure ofgraphical models.

1. Introduction

The strongest type of knowledge we can discover iscausal. Causal knowledge is actionable, as opposed to as-sociational knowledge, which is only predictive. The exis-tence of a causal dependence between two variables, x andy, implies that manipulating x will result in a change in y.While a necessary precondition for causality, statistical as-sociation between x and y merely implies that knowledge ofx can help predict the value of y. Thus, causal knowledge iscommonly pursued in fields such as medicine, economics,

biology, and social science where true understanding of be-havior and knowledge of how to effect changes are neces-sary.

Causal discovery is strictly a more difficult task thannon-causal discovery because it requires inferring a supersetof the conditions of non-causal discovery. The most chal-lenging of these conditions is to eliminate the effects of allpotential common causes, whether observed or latent. As aresult, very little research in knowledge discovery focuseson causal discovery, and discovering causal knowledge isleft as a manual activity for researchers in fields that re-quire explicit determination of causal knowledge. Thoseresearchers rarely use the algorithms developed within theknowledge discovery community; instead, they much moreoften employ a set of formal frameworks and guidelinesfor gathering and analyzing data that has been developedover the past eighty years. Some of these approaches—collectively called experimental designs—help structure theplanning and conduct of prospective experiments. The restof these approaches—called quasi-experimental designs—help structure the analysis of retrospective studies of obser-vational data. Such designs can alleviate the challenges ofcausal knowledge discovery by increasing statistical powerand eliminating common causes.

In work published last year, Jensen conjectured that in-corporating specific consideration of designs could improvealgorithms for machine learning and knowledge discovery[10], and Jensen et al. later showed that quasi-experimentaldesigns could be identified automatically [11]. In this paper,we formalize designs with respect to relational databasesand adapt them to knowledge discovery in order to char-acterize the search space of designs. The formalization ofa search space of designs presents an unexploited opportu-nity to develop algorithms that automate the identificationand application of designs for causal discovery. We showthat the design space is large for relational domains, andthat relational data enables the expression of designs andfacilitates their applicability to causal discovery. We also

PROJECT

ADOPT

EDIT

USER

CATEGORY

PAGE

JOIN

PROTECT

ASSIGN LINK

Figure 1. Wikipedia is a complex domain withvarious entities, relationships, and attributes.

demonstrate how designs can improve causal knowledgediscovery by eliminating common causes (both latent andobserved), reducing the space of alternative causal models,and increasing statistical power.

1.1. Example

Consider the problem of understanding the operation ofWikipedia, a peer-produced general knowledge encyclope-dia [21]. Wikipedia articles, or pages, are produced collec-tively by thousands of volunteer users. Pages are createdand modified by users, and users often organize themselvesinto groups called projects, each of which covers a generaltopic. Within a project, individual pages are assessed by ed-itors for “importance” (how central the page is to the projecttheme) and “quality” (a project-independent objective eval-uation of key criteria). See Figure 1 for a more completerelational data schema describing the major entities and re-lations that make up Wikipedia.

Wikipedia exemplifies many common aspects of mod-ern data sets. It is made up of heterogeneous entities con-nected by relations. Many data elements record temporally-varying characteristics; Wikipedia records the precise mo-ment of all user actions, page edits, etc. so any associationscan easily be examined with respect to temporal order. Fi-nally, the data exhibit sufficient complexity that they allowexamination of a remarkably wide variety of questions.

For example, one of the most persistent claims aboutWikipedia is that its reputability stems from the large num-ber of users that collaborate to write each article [12]. Wecall this the “many-eyes hypothesis”—the more users thatrevise an article, the higher the quality of that article. If weknew that this claim were actually causal, then we couldtheoretically increase the quality of an article by askingmore users to participate in revisions. However, to actuallydetermine that there exists a causal dependence between thenumber of users editing an article and its quality, we must

eliminate other plausible alternative models that could ex-plain the observed correlation. In other words, we must ac-count for all potential common causes, which can be verychallenging. Fortunately, the data available on Wikipediamake it possible to evaluate this claim. In fact, the data al-low the use of a number of different designs, each eliminat-ing different potential threats to a valid causal conclusion.

A naive approach to this question would be to simply ex-amine a large number of pages at a given point in time andestimate the correlation between the number of editors andthe quality of the page. This design tests the assumptions ofthe graphical model shown in Figure 2, and given this de-sign, the variables are highly correlated. A chi-square testyields χ2=101.83 (n=189; DOF=12; p=2.44× 10−16), andapproximately 66% of the variance of page quality would beattributed to the number of editors. This approach is quitesimilar to those conducted by many algorithms in knowl-edge discovery and data mining—it identifies a statisticalassociation between two variables, but it does little to iden-tify cause and effect. The observed correlation could stemfrom a common cause, such as general topic. Pages ontopics of high interest to Wikipedians may be edited by adisproportionately large number of users, and that interestcould also drive editors to exert special care when editing,thereby improving quality.

We could remove this potential common cause by us-ing an alternative design. Since projects govern pagesthat are thematically similar, we can use page-project re-lations to factor out the influence of subject matter. Thismore complex design helps to differentiate between thegraphical model shown in Figure 2 and the model in Fig-ure 3a. When we use project links to arrange pages intogroups (called “blocks” in the language of experimen-tal and quasi-experimental design), we find that the av-erage correlation between editor count and page qualityhas decreased. A Cochran-Mantel-Haenszel test [1] yieldsM2=82.33 (n=189; DOF=12; p=1.48 × 10−12). Althoughlower, this value is still highly significant, and roughly53.4% of the variance would now be attributed to the num-ber of editors. The effect size has dropped, but it is still sig-nificant. Moreover, using this approach allows a stronger

QE

PAGE

Figure 2. A simple graphical model can de-scribe the dependence between the numberof editors and quality of an article, but it doesnot account for common causes.

QE

PAGE

QE

?

ADOPT

PAGE

I

A

PAGE

QE

PROJECT

?

ADOPT

PROJECT

(a) (b)

Figure 3. (a) A more complex graphical modelincorporates common causes (potentially la-tent) due to project; (b) An even more com-plex graphical model adds measured vari-ables of article importance and age to ac-count for all plausible common causes ofnumber of editors and quality.

claim regarding the source of the association because wehave plausibly factored out at least one potential (unmea-sured) common cause. The ability to factor out multiplevariables, observed or latent, is a highly valuable benefit ofthis type of design. This design is easily found by exploit-ing the relational structure of the data, yet it is unexploitedin current knowledge discovery algorithms.

There are still other potential common causes that couldaccount for the observed correlation between editors andquality. For example, the importance or age of a page couldinfluence both the number of edits and the page quality. For-tunately, we can build on the design above to factor out thepotentially confounding influence of importance and age.Both of these effects can be mathematically modeled, al-lowing us to statistically control for their effects. When wedo so, the correlation between editors and quality drops toM2=29.13 (n=189; DOF=12; p=0.00377), and the effectsize also drops to 18.9% of the variance. After ruling outseveral plausible common causes of variation, we now havemuch stronger evidence that the relationship between editorcount and page quality is indeed causal, and that the “many-eyes hypothesis” is valid.

1.2. Central concepts

The example in the previous section illustrates many im-portant concepts. First, learning causal knowledge is veryuseful and can have greater utility than associational knowl-

edge. If many eyes cause page quality, then Wikipedia ad-ministrators seeking to improve article quality could try en-couraging a diversity of editors within single pages. How-ever, if we only know that the number of editors and pagequality covary, then various alternative causal models couldexplain the correlation, each implying a different action.Thus, causal knowledge is highly useful in domains wewould like to control, and, in settings that are not amenableto change, true understanding of behavior can be derivedonly under causal interpretations.

Second, causal knowledge discovery is a difficult task.To infer a causal dependence between two variables x andy, three conditions must be met:

1. Association — x and y must be statistically correlated.

2. Direction — The direction of causality is known.

3. No common causes — All possible common causes ofx and y have been accounted for.

Clearly, causal discovery requires inferring a superset ofthe conditions of standard knowledge discovery. Findingstatistical associations is the focus of most current knowl-edge discovery algorithms. The second condition is typi-cally achieved with temporal precedence and is straightfor-ward given some notion of time within the data. The thirdcondition of eliminating the effects of all possible commoncauses is challenging.

One approach is to statistically model all possible com-mon cause variables. In fact, structure learning algorithmsthat learn probabilistic models of a set of variables, in-cluding propositional algorithms (e.g., Bayesian networks[8]) and relational algorithms (e.g., Probabilistic RelationalModels [6]), follow this approach. They determine structureby finding dependencies among the variables through statis-tical control of restricted sets of parent variables. However,even with a highly accurate model, this approach succumbsto various problems, such as the existence of latent, unmea-sured variables and low statistical power.

Third, designs are useful for causal knowledge discov-ery. Designs are mostly employed by researchers in otherfields to alleviate the challenges of discoverying causalknowledge, and the techniques used in the previous sectionemphasize their utility. The first design used in the exampleis similar to those implicitly encoded in many knowledgediscovery algorithms. It can certainly identify a statisticalassociation between two variables, but it does not addresscommon causes. It is important to note that causal conclu-sions are only valid up to the assumptions, or causal con-straints, that we make. If we assume that there are no com-mon causes between the number of editors and page quality,then this design, upon detecting a significant association,could conclude causation. The other designs in the exam-ple incorporate additional elements of blocking and statisti-

cal control that help to account for more potential commoncauses.

Finally, the space of designs is large. The example high-lights that, for a given analytical task, several potential de-signs apply, and we are interested in identifying and se-lecting among those designs. Additionally, it is the rela-tional structure of the data that enables many designs. Forexample, the project-page relations in Wikipedia allow theblocking design that controls for all variables attributed toprojects. By utilizing the complex structure of our data, dif-ferent designs are able to rule out different sets of competinghypotheses that can explain an observed correlation.

In the sections that follow, we formalize the concept ofdesigns for knowledge discovery and relate them to exper-imental and quasi-experimental designs found in other em-pirical sciences. We demonstrate that, for typical data setssuch as Wikipedia, there exist many valid designs for eachgiven task, and that selecting among these designs provideshigher analytical power than consistently selecting a singlesimple design.

2. Designs

The full range of experimental and quasi-experimentaldesigns is unfamiliar to many researchers in knowledge dis-covery. As a result, almost no knowledge discovery algo-rithms incorporate explicit consideration of experimentaland quasi-experimental designs. A system might consis-tently make use of one hard-coded design, and human in-vestigators sometimes explicitly select a design before sub-mitting data to a knowledge discovery algorithm, but nosystem known to us automatically and dynamically selectsamong multiple potential designs in an effort to maximizethe statistical power or identify causal relationships. Thealgorithm developed by Jensen et al. was the first systemto explicitly identify one type of quasi-experimental design,but this work was primarily proof-of-concept and very sim-plistic [11]. Given the widespread use of designs in dataanalyses directed by human analysts and the absence of de-signs from current knowledge discovery algorithms, devel-oping algorithms that automate the identification and ap-plication of designs appears to be a great opportunity forcausal knowledge discovery.

Our goal in this section is to adapt and translate the lit-erature of experimental and quasi-experimental design intoa body of ideas that complement existing concepts familiarto researchers and practitioners in the knowledge discoverycommunity. We formally define designs in terms of rela-tional databases and the relational algebra. Given the natu-ral composition of the relational algebra, this formalizationof designs clearly characterizes a large search space for de-signs and stresses the potential, and need, for automatingthe process of identifying applicable designs.

2.1. Designs defined

Given a relational database D characterized by a schemaS = {R1, . . . , Rk} with attributes A(Ri) defined over eachrelation Ri, a design is a function from S to a result ta-ble R, such that R specifies the data necessary to conducta hypothesis test that determines the dependence betweensome treatment variable x ∈ A(Ri) and some outcomevariable y ∈ A(Rj). For example, the naive design thattests the many-eyes hypothesis for Wikipedia specifies boththe treatment and outcome variables as attributes of the Pagerelation. The aim of a design is to construct a situation inwhich the outcome of a single hypothesis test will deter-mine, with high probability, whether a causal dependenceholds between two specified variables. That is, designs areformulated in such a way that a valid conclusion about sta-tistical association will correctly determine causal depen-dence.

The design itself does not specify the hypothesis test,only the necessary data to conduct such a test. An analyst(or algorithm) must choose both a design and an accom-panying statistical test. The statistical test depends on thelevels of measurement for the variables (i.e., nominal, or-dinal, or continuous) and the type of design. For example,the design that blocks for associated projects must groupthe instances and use a hypothesis test that can handle strat-ified data (e.g., the Cochran-Mantel-Haenszel test). As weare primarily concerned with formally defining designs, weleave the choice of hypothesis tests as a separate issue.

As defined above, a design can be formulated givenknowledge about the entities, relationships, and attributes ina domain such as those in the schema for Wikipedia shownin Figure 1. A design tests the dependence between twovariables, the treatment and outcome, of a particular classof units. Units denote some experimental subject (e.g., en-tity, group of entities, relationship) during a specific periodof time and corresponds to a single row in the result table.Treatments are potential causes and can be any measurableitem that affects a unit, including an attribute value (of aspecific entity or an aggregate) or a link to another entity.Formally, treatments can be defined in the relational alge-bra as

πtreatment(RBase ./ · · · ./ RTreatment)

where RBase is the base table of the unit, and the treatmentvariable is projected following a series of joins (i.e., a pathin the relational schema) to the underlying table that definesthe treatment variable. Outcomes are potential effects andcan be defined in the same manner as treatments.

2.2. Design elements

We discuss different designs in terms of three main ele-ments: (1) sampling; (2) blocking; and (3) statistical con-

trol. Each of these elements can be formally defined withinthe relational algebra, and they can be naturally composedto form a vast design space. We believe this space of de-signs has great potential as a search space that can be au-tomatically explored, and this is an avenue of research weare actively pursuing. In this work, we focus on defin-ing the search space and, in later sections, demonstratingthe wide applicability of designs to relational domains andthe large benefit associated with using designs to discovercausal knowledge.

Sampling is the process of refining units, treatments, oroutcomes, and it is formally defined as

σϕ(RBase ./ · · · ./ RSample)

where ϕ is a propositional formula specifying conditionsnecessary to be included in the sample.

In the KD community, data sampling is usually only em-ployed as a computational necessity. We tend to sample datato speed up algorithms or calculations with unacceptableruntimes, or to clean data in hopes of reducing noise. Oth-erwise, KD practitioners are hesitant to “throw away” databy sampling; by analyzing all the data that are available, wehope to maximize the sample sizes involved in our calcula-tions and increase power. However, selecting subsets of thedata to analyze can improve the accuracy of causal infer-ence, yield an increase in statistical power, or help identifycontext-sensitive dependencies. For example, twin studiesare able to draw conclusions about the population at largeby looking at a small fraction of all available individuals [3].Of course, the accuracy gained through sampling may comeat the expense of generalizability. If twins are not represen-tative of the population at large, the conclusions we drawfrom their study will not be widely applicable.

Blocking is a data grouping strategy, a way of organiz-ing units to improve models and increase power. Blockingserves similar purposes to statistical control: reducing vari-ability and pruning the space of alternative causal models.However, blocking simultaneously controls for the influ-ence of entire classes of variables as opposed to that of indi-vidual ones. The goal of blocking is to organize experimen-tal units into groups such that variability within each group,or block, is reduced. Blocks contain units with varyingtreatments and outcomes while homogenizing confoundingfactors that make detecting the relationship between treat-ment and outcome more difficult. These blocks are thenmodeled individually, and the results are combined into ahypothesis test to make a determination about the popula-tion at large.

Relational data sets lend themselves to the creation ofblocks, as link structure is often correlated with attributevalues. One main type of blocking is entity blocking, whichcan be defined as

πblockId(RBase ./ · · · ./ RBlock)

in which we identify entities that are related to a particularbase unit through a series of joins in the schema. Units withcommon blocking entity identifiers can then be grouped to-gether. In the example above, we blocked Wikipedia ar-ticles according to their related projects. By blocking inthis manner, we control for any possible confounding as-pect of projects on page quality. This allows us to controlfor all project attributes simultaneously, whether they aremeasured or unmeasured. This ability to factor out the ef-fects of latent variables is one of the key strengths of block-ing over more traditional approaches to statistical control ofindividual variables.

In addition to blocking by entities, we can block in-stances of the same experimental unit over time.

πid(στ1(RBase)) ∪ στ2(RBase)))

The same base unit, at different times τi, is unioned with it-self. This allows analysts to control for all aspects of an en-tity, except those that are influenced by the passage of timeitself, so-called maturation effects [4, 17]. In a sense, a unitat an instant in time t1 is a “twin” of the same unit at sometime t2. Temporal blocking is often implicit in relationallearning; any time we examine multiple measurements ofthe same attribute on an entity we are blocking their associ-ated instances temporally.

Finally, in addition to sampling and blocking, we canaccount for common causes using statistical control.

πcontrol(RBase ./ · · · ./ RControl)

The design can specify additional variables to control for,retrieving them in a similar manner as the treatment andoutcome variables. This class of methods models the ef-fects of potential common causes so that their effects canbe removed mathematically. If there can be any claim thatconcepts from quasi-experimental design are used in exist-ing knowledge discovery algorithms, then they would bebased on the use of statistical control. For example, work byPearl [14], Spirtes, Scheines, and Glymour [18], and othershas explored this approach within the context of graphicalmodels.

The design testing the many-eyes hypothesis that com-bines blocking on projects and controlling for importanceand age can be expressed as follows:

πeditorcount(PAGE) ./pageid πquality(PAGE)./pageid πprojectid(PAGE ./ ADOPT ./ PROJECT)./pageid πage(PAGE)./pageid πimportance(PAGE ./ ADOPT)

While the precise details of the definitions of design inthe literature vary [16, 17, 19], the description above islargely compatible with most other definitions of design.

Entity and temporal blocking may appear to be fundamen-tally different techniques, but they are actually quite similarwhen we consider that both operations group units. Whenwe include multiple observations of the same entities withinan experiment, we’re including multiple instances of thesame entity and blocking them together. Furthermore, ifwe consider these instances to be connected by “temporal”links, then relational and temporal blocking amount to thesame technique: using link structure to group instances to-gether to reduce variability and eliminate common causes.

One distinction between our categorization of designsand those found in the social science literature involves howwe define treatment. Traditional experimental descriptionsseparate units into distinct treatment and control groups. Inour context, treatment can take on many possible values;they are not merely binary designations. Also, most discus-sions of experimental design in the social science literatureinclude detailed discussions of different “threats to valid-ity” that accompany designs or design elements (e.g., mea-surement error, maturation, resentful demoralization amongsubjects). While we do not discuss specific threats here, allinternal validity concerns can be viewed as possible alterna-tive models that explain the data and invalidate the favoredhypothesis.

3. Applicability of designs

In order to determine whether designs are a profitable di-rection for enhancing knowledge discovery algorithms, wemust assess how often designs actually apply to typical do-mains targeted by knowledge discovery researchers. To ex-amine this question, we take an in-depth look at Wikipedia.Although the Wikipedia schema (as described in Section1.1) is not overly complex, it can spawn dozens of causalquestions with hundreds of applicable designs. There arefour main entity types (articles, categories, projects, andusers) and six types of relationships. Wikipedia is, however,quite large, with over 2 million pages, 8 million users, andclose to 300 million edits. Furthermore, although Wikipediahas been the subject of several recent studies [9, 20, 22], weknow very little about how it functions, especially from acausal standpoint. These aspects make Wikipedia an idealcandidate for studying the applicability and utility of de-signs.

We wanted to produce a representative list of tasks ofreal concern that could then be examined in the context ofdesign. We surveyed a group of ten people, each with abachelors or masters degree in Computer Science, to obtaina sample of interesting causal questions in the Wikipediadomain. Respondents were given a simple list of attributes(see Table 1) and asked to indicate ten pairs of treatmentsand outcomes they found compelling for study. Treatmentsand outcomes were presented in one of five random orders,

Table 1. Complete list of survey variables.Entity AttributesPage Adopted by Project, Age, Assessment, Editors,

Edits, Featured, Importance, Length, Notice,Number of Links, Protected, Quality, Views

User Role, Edits, Membership in ProjectEdit Size, Vandalism, Minor, Reverted

Ado

pted

by Pro

ject

AgeAssessment

Editors

Edits

Featured

Importance

Length

Notice

Num

ber of Links

Protected

Quality

Views

Use

r: R

ole

Use

r: E

dits

Use

r: M

embe

rship

in P

roject

Edit: S

ize

Edit: V

andalis

m

Edit: M

inor

Edit: R

eve

rted

Outcome

Adop

ted by

Proj

ect

Age

Assessment

Editors

Edits

Featured

Importance

Length

Notice

Num

ber o

f Lin

ks

Protected

Quality

Views

User

: Rol

e

User

: Edi

ts

User:

Mem

bersh

ip in

Proje

ct

Edit:

Size

Edit:

Van

dalis

m

Edit:

Min

or

Edit:

Rev

erte

d

Treatment

1 0 4 2 8 10 3 3 1 3 8 20 6 3 6 1 1 11 2 6

1

1

7

3

5

2

4

11

6

6

7

8

2

6

3

4

7

0

5

11

Figure 4. For each treatment and outcomevariable pair, the degree of shade indicateshow frequently that particular task was se-lected by survey respondents. The row andcolumn marginals indicate how often a par-ticular treatment or outcome variable was se-lected, respectively.

to eliminate biases associated with presentation order. Thegroup generated a list of 99 causal discovery tasks (one re-spondent provided only 9 tasks), 71 of which were unique.A graphical representation of the survey results can be seenin Figure 4 (unless otherwise noted, variables correspond topage entities). For each pair of treatment and outcome vari-ables, the shading in the corresponding square indicates thenumber of respondents who highlighted that particular re-lationship in his/her response. Page quality and vandalismwere the most popular outcome variables selected, whilepage views and project inclusion were deemed interestingtreatments.

In Section 2.2, we discussed the advantages of usingblocking designs to eliminate common causes and reducevariability. Our ability to employ a blocking design doesnot rely on the particular treatment or outcome variables;

User

Edit

Page

Treatment

0 20 40 60 80

User

Edit

Page

Outcome

0 20 40 60 80

Figure 5. The frequency with which selectedtreatment and outcome variables appear onparticular entity types.

User Temporal

Page Temporal

User Entity

Project Entity

Page Entity

Category Entity

0 20 40 60 80

Figure 6. The number of tasks that apply to agiven method of entity or temporal blocking.

rather, it is based on the structure of the schema surroundingthe entity type associated with the outcome variable. Fig-ure 5 shows the frequency that different entity types (page,user, edit) occur as the underlying entity for either the treat-ment or outcome variables. Page was the most frequentlyselected underlying entity for both treatment (76 out of 99tasks) and outcome (69 out of 99 tasks) variables. This isprobably due to the disproportionately large number of vari-ables associated with page as opposed to users or edits. Wemanually identified the different design choices, and Fig-ure 6 depicts the applicability of the different relational andtemporal blocking schemes for the tasks identified in oursurvey. The overwhelming majority of tasks identified wereamenable to at least one blocking design.

Below, we consider the design options for three exampletasks generated from the survey in detail. Our goal here is todemonstrate the breadth of applicable designs for each taskrather than focus on selecting the “best” design to evaluatethe relationship in question. The first two tasks correspondto the most cited treatment-outcome pairs in the group (eachwas identified by four respondents). The third example isone of several that were chosen by three respondents. Thesethree tasks are representative of the entire list of tasks, aseach employs a different outcome entity type.

Example 1: Page adoption by a project → Page qualityThis task concerns the dependence between articlesbeing included in Wikipedia projects and their qual-ity. In Wikipedia, quality is assessed as an ordinalvariable with 7 levels, ranging from Stub to Featured

Article. The simplest design would involve only thetreatment and outcome variables, thereby ignoring allother related entities, but this design would not be ro-bust in terms of factoring out potential common causesof treatment and outcome. For instance, certain cate-gories of pages might be highly likely to be adopted byprojects as well as of uncommon quality; blocking in-stances by category could eliminate this interpretationof the data from the hypothesis space. Similarly, com-mon cause threats from users, linked pages, and addi-tional projects can be addressed via blocking, yielding32 different designs. In addition, a temporal designcan be applied to the pages themselves, examining thepage quality longitudinally before and after it is addedto a project. Finally, any subset of the several page en-tity attributes (e.g., importance, age, length) could bestatistically controlled. All together, there are dozensof designs that apply to this task.

Example 2: Edit vandalism → Edit reversionThis task involves the relationship between two vari-ables on edits; specifically, whether edits categorizedas vandalism are reverted. While the association be-tween these variables may appear trivial, establish-ing the strength of a generalizable causal relationshipwould be of great interest to Wikipedia administrators.In terms of blocking choices, edits are made by usersto particular articles, so designs blocking on pages andusers apply. In addition, the influence of page categorycould also be a confounding factor, as could projectinclusion. Unlike the previous example, however, notall combinations of blocking choices are compatiblewithin the same design—for example, it is meaning-less to block on page as well as project, since pageblocks would subsume project blocks. In total there areten different combinations of factors to use to groupedit instances into blocks via relations. Since edits areinstantaneous, they cannot be blocked temporally.

Example 3: User project membership → User edit countThis task considers whether user membership in aWikipedia project has an effect on the number of ed-its they make on various articles. Possible alternativehypotheses that could explain a correlation include theinfluence of page content or project theme. Both ofthese can be ruled out by blocking user instances onthose entities. Furthermore, temporal blocking can beemployed to control for the inherent variability of theedit behavior of the users themselves.

As we demonstrate by these example tasks, the applica-ble design choices for even simple causal questions are nu-merous and potentially overwhelming to investigate manu-ally. A knowledge discovery algorithm that could automat-

ically search the space of possible designs would be bene-ficial, which is the primary motivation for performing thisassessment.

4. Utility of designs

In Section 2, we described the different elements of ex-perimental and quasi-experimental design in a knowledgediscovery context. In Section 3, we demonstrated how vari-ous designs can be used to examine interesting causal ques-tions in a typical rich data set. Now we examine the keyadvantages of sampling, blocking, and statistical controlthrough the use of two examples taken from the Wikipediadomain. Where possible, we quantify the effects of eachdesign element, and we support these findings through ad-ditional simulations.

The two example tasks from Wikipedia can be summa-rized with the following questions:

1. Do “many eyes” cause page quality? That is, doesthe number of unique contributors to an article affectits quality? Recall this is the example used in Sec-tion 1.1. For this task, the units are Wikipedia arti-cles, the treatment variable is the number of distinctusers that have edited the article, and the outcome is itsquality. To analyze this question, we randomly sam-pled 189 Wikipedia articles from within ten differentprojects.

2. Does exposure of an article cause a change in editingbehavior? For this question, we examine the effect ofbeing featured on Wikipedia’s main page as “Today’sfeatured article” on the number of edits made to thatarticle. In this setting, the units are also Wikipedia ar-ticles, the treatment is being featured on Wikipedia’sfront page (which changes daily), and the outcomeis the number of edits measured over different timeperiods. For this question, we randomly selected 97Wikipedia articles that were featured in 2008.

4.1. Sampling

Sampling, the first element of design we discussed, canbe utilized in designs for both examples. Sampling has twoprimary benefits: enabling the creation of blocks and reduc-ing variability. To analyze the many-eyes hypothesis, wesample pages that have associated projects and have beenassessed for quality. Not only does this ensure that the out-come variable (quality) is measured, but it also enables amore complex design to block on projects. In this case, thesampling procedure reduces the size of the data by 72% be-cause only a fraction of articles have been assessed. In thesecond example, we sample only articles that have “featuredarticle” quality. Only pages designated with this status may

050

100

150

Week

Num

ber

of E

dits

-10 -5 0 5 10

Featured

05

10

15

20

Week

Nu

mb

er

of

Ed

its

-10 -5 0 5 10

Figure 7. Featured article have an immedi-ate and vast increase in the number of editswhen they appear on Wikipedia’s front page(left). If we remove this edit spike, a cleardownward trend in the number of edits is vis-ible (right).

be chosen as “Today’s featured article,” and only 1 out ofevery 1,130 articles actually have featured article quality.This is equivalent to more than a 99.9% reduction in thenumber of articles studied.

Additionally, sampling can be used to reduce variabil-ity, which may change the detected effect size or increasepower. Suppose we hypothesize that users prepare an arti-cle for an upcoming request to be “Today’s featured article”by frequently editing the page in the weeks leading up tothe event. However, after the article’s exposure, the num-ber of edits decreases since it no longer requires prepara-tion for widespread exposure. We can test this by perform-ing a one-tailed t-test on the number of edits in the monthbefore and month after being featured, but we will incor-rectly reject this hypothesis regardless of doing a paired orunpaired t-test. On closer inspection, it is clear that beingfeatured leads to an immediate increase in edits (see Fig-ure 7). Therefore, if we sample the number of edits in themonth before and month after, but remove the edits duringthe week surrounding being featured, then the effect sizecompletely changes. In fact, the sign of the effect changesfrom an average increase of 128 edits due to being featuredto an average decrease of 21 edits due to being featured.

4.2. Blocking

Blocking designs can relax the causal sufficiency as-sumption by eliminating entire classes of potential commoncauses, including both measured and latent variables. Inthe many-eyes task, we block pages by project in order tocontrol for potential common causes, as discussed in Sec-tion 1.1. Without blocking, there is a strong correlationbetween the number of editors and page quality. Usinga simple Chi-square test, we obtain a value of χ2=101.83(n=189; DOF=12; p=2.44 × 10−16). After blocking on

0.0

0.2

0.4

0.6

0.8

1.0

Sample Size

Power

0 20 40 60 80 100

Paired

Unpaired

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Strength of EffectPower Entity blocking

No blocking

Figure 8. (a) Looking at article edits in themonth before and after an article becomesfeatured, a design that employs blocking canincrease statistical power by reducing vari-ance. (b) As the strength of effect increasesbetween two variables x and y, a designblocking on common entities can detect cor-relation with increasingly higher power asopposed to a design without blocking.

project, the Cochran-Mantel-Haenszel test achieves a valueof M2=82.33 (n=189; DOF=12; p=1.48 × 10−12), indi-cating that blocking on project explains some of the orig-inally observed correlation between the number of editorsand page quality. By blocking, we have also reduced thehypothesis space by eliminating any potential alternativecausal models in which variables on project solely controlfor correlation between number of editors and quality. As aresult, we can make a stronger claim about the source of theremaining association.

In the featured article example, we block temporally onpage by comparing the number of edits in the month be-fore and after being featured (and ignoring the surroundingweeks of the featured event). In this case, not only do wefactor out potential common causes residing on the page it-self, but we also increase the power since the within-blockvariability is less than the between-block variability. This isthe second main benefit of blocking: Blocking can be usedto decrease variability and increase the power to detect cor-relation. In the featured article example, we use a pairedt-test which increases the power of the test. In Figure 8a,we see that blocking provides for the equivalent of approx-imately 10 additional data instances for the same level ofpower (e.g., 80% power). However, beyond power, we haveagain reduced the space of alternative models by eliminat-ing potential common causes attributed to the page itself.

We can use simulation to examine the phenomenon ofblocking to increase power in more detail. We generate datafrom a bipartite model that mirrors the structure of articlesand projects in Wikipedia. Each page entity has two discretevariables x and y, and there is a latent variable z on the

project entity; x and y covary by a given strength, but z alsocontributes to the value of y. In Figure 8b, we illustrate thepower gained from blocking by project as a function of thestrength of the dependence between x and y. As we increasethe strength of effect between x and y, the power of theblocking design increases dramatically; without blocking,even a strong dependence goes undetected.

4.3. Statistical control

Statistical control is an element of design applicable toall tasks. Like blocking, it can be used to eliminate po-tential common causes of measured variables and reducevariability. In the many-eyes task, we assume that age andimportance of an article are common causes of the numberof distinct editors and page quality. As a result, it is nec-essary to factor out the additional source of variation dueto these two variables. When we block on project and ad-ditionally condition on age and importance, the Cochran-Mantel-Haenszel test still indicates a highly significant cor-relation between number of distinct editors and page qual-ity (M2=29.1286; n=189; DOF=12; p=0.00377). Again,by controlling for additional factors we reduce the space ofalternative models. Given the assumption that there are noremaining common causes, we can conclude that the many-eyes hypothesis holds true for Wikipedia: To achieve highquality articles, we should encourage more volunteers tocontribute to articles.

As shown in Section 3, large numbers of different de-signs can be formulated for most causal questions of inter-est. In this section, we have shown that the benefits of ap-plying any individual design can be large, and that thosebenefits vary substantially depending on which design isemployed. This suggests an unexploited opportunity forknowledge discovery algorithms to identify, evaluate, andapply designs to facilitate causal discovery.

5. Related work

The ideas behind experimental and quasi-experimentaldesign date back decades [4, 5], but they continue to bean active area of research to date [16, 17, 19] as newtypes of designs and their associated analysis are formu-lated and classified. Most traditional quasi-experimentaldesigns fit nicely into the framework of sampling, block-ing, and statistical control. For example, the examinationof editor count and page quality was an expanded “posttestonly with nonequivalent control group” design, while thetemporally-blocked design examining page exposure andedit frequency in Wikipedia was an example of an “un-treated control group with dependent pretest and posttestsamples using switching replications.”

The concept of blocking as a means of controlling vari-ance has been a widely used technique in statistical analysis[19]. The utilization of relational structure to block by en-tire entities rather than attributes is a generalization of theclassic twin design. For more than a century, researchershave relied on twin data to control for whole classes of (of-ten unmeasurable) attributes related to family environmentand heredity [3].

The idea of using designs to maximize the utility of datacollection is also studied extensively in the combinatoricscommunity [2]. Many of these ideas are not as directlyapplicable to our work, as we work with “found” obser-vational data rather than designing studies to be tabulated.Furthermore, they focus almost exclusively on binary (orat the very least, discrete) treatments and outcomes and areunsuitable for the types of continuous data sets studied inmachine learning.

A small but active research community in computer sci-ence, statistics, and philosophy has focused on moving fromcorrelational studies to causal ones, utilizing Bayes netswith causal semantics [14, 18]. This research is providinga solid theoretical foundation for causal analysis, but it fo-cuses almost entirely on propositional data sets that don’texplicitly represent relations or time. Extending these ideasto relational-temporal domains is a promising area of futureinvestigation.

Several branches of the social sciences utilize statisti-cal models based on the hierarchical nature of many datasets to infer causality; these include hierarchical linear mod-els [15], structural equation modeling [13], and multi-levelmodeling [7]. However, each of these approaches consis-tently, and implicitly, relies on one specific design or set ofdesigns. They do not define a space of designs or chooseamong the applicable designs.

6. Conclusions

In this paper, we have formally defined designs withrespect to relational databases and the relational algebra,which naturally compose to form a searchable space of de-signs. Given a formally defined search space, it is clearthat we can develop algorithms to identify and evaluate de-signs. We have shown that large numbers of design alterna-tives exist for a representative sample of causal questions,which suggests that algorithms could significantly aid in-vestigators engaged in causal discovery. Additionally, wehave presented quantitative evidence that designs employ-ing sampling, blocking, and statistical control can limit theset of possible causal models and improve the statisticalpower of causal discovery. The utility of designs for causaldiscovery indicates that algorithms identifying such designswould be very beneficial. We believe this is a major area offuture research, and we are actively pursuing algorithms to

search the design space and automatically apply designs todiscover causal knowledge.

7. Acknowledgments

We thank Cynthia Loiselle for her assistance in preparingthis manuscript. This material is based on research spon-sored by the Air Force Research Laboratory and the Intel-ligence Advanced Research Projects Activity (IARPA), un-der agreement number FA8750-07- 2-0158. The U.S. Gov-ernment is authorized to reproduce and distribute reprintsfor Governmental purposes notwithstanding any copyrightnotation thereon. The views and conclusion containedherein are those of the authors and should not be interpretedas necessarily representing the official policies or endorse-ments, either expressed or implied, of the Air Force Re-search Laboratory and the Intelligence Advanced ResearchProjects Activity (IARPA), or the U.S. Government.

References

[1] A. Agresti. Categorical Data Analysis. Wiley New York,1990.

[2] R. Bailey. Association Schemes: Designed Experiments,Algebra and Combinatorics. Cambridge University Press,2004.

[3] D. Boomsma, A. Busjahn, and L. Peltonen. Classical twinstudies and beyond. Nature Reviews Genetics, 3:872–882,November 2002.

[4] D. Campbell and J. Stanley. Experimental and Quasi-Experimental Designs for Research. Rand McNally,Chicago, IL, 1966.

[5] R. A. Fisher. The Design of Experiments. Oliver and Boyd,Edinburgh, 1935.

[6] N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learn-ing probabilistic relational models. In Proceedings of theSixteenth International Joint Conference on Artificial Intel-ligence, pages 1300–1309, August 1999.

[7] H. Goldstein. Multilevel Statistical Models. Arnold London,1995.

[8] D. Heckerman, D. Geiger, and D. M. Chickering. Learn-ing Bayesian networks: The combination of knowledge andstatistical data. Machine Learning, 20(3):197–243, 1995.

[9] M. Hu, E.-P. Lim, A. Sun, H. W. Lauw, and B.-Q. Vuong.Measuring article quality in Wikipedia: Models and evalu-ation. In Proceedings of the Sixteenth ACM Conference onInformation and Knowledge Management, pages 243–252,2007.

[10] D. Jensen. Beyond prediction: Directions for probabilisticand relational learning. In Inductive Logic Programming,Seventeenth International Conference, Revised Selected Pa-pers, number 4894, pages 4–21. Springer, Berlin, Corvallis,OR, 2008.

[11] D. Jensen, A. Fast, B. Taylor, and M. Maier. Automaticidentification of quasi-experimental designs for discoveringcausal knowledge. In Proceedings of the Fourteenth ACM

SIGKDD International Conference on Knowledge Discov-ery and Data Mining, pages 372–380, 2008.

[12] A. Kittur and R. E. Kraut. Harnessing the wisdom of crowdsin Wikipedia: Quality through coordination. In Proceedingsof the ACM 2008 Conference on Computer Supported Co-operative Work, pages 37–46, 2008.

[13] R. Kline. Principles and Practice of Structural EquationModeling. The Guilford Press, 2005.

[14] J. Pearl. Causality: Models, Reasoning, and Inference.Cambridge University Press, New York, NY, 2000.

[15] S. Raudenbush and A. Bryk. Hierarchical Linear Models:Applications and Data Analysis Methods. Sage Publica-tions, Thousand Oaks, CA, 2002.

[16] P. Rosenbaum. Observational Studies. Springer, 2002.[17] W. R. Shadish, T. D. Cook, and D. T. Campbell. Experimen-

tal and Quasi-Experimental Designs for Generalized CausalInference. Houghton Mifflin, Boston, MA, 2002.

[18] P. Spirtes, C. Glymour, and R. Scheines. Causation, Predic-tion and Search. MIT Press, Cambridge, MA, 2nd edition,2000.

[19] W. M. Trochim. The Research Methods Knowledge Base,2nd Edition. www.socialresearchmethods.net/kb/, October 2006.

[20] B.-Q. Vuong, E.-P. Lim, A. Sun, M.-T. Le, and H. W. Lauw.On ranking controversies in Wikipedia: Models and evalu-ation. In Proceedings of the First International Conferenceon Web Search and Web Data Mining, pages 171–182, 2008.

[21] Wikipedia. Wikipedia, The Free Encyclopedia. en.wikipedia.org/wiki/Main Page, 2009.

[22] D. Wilkinson and B. Huberman. Assessing the value of co-operation in Wikipedia. First Monday, 12(4), April 2 2007.

discovering causal knowledge by design · 2009. 9. 25. · causal knowledge is frequently pursued...

Documents