U NDERSTANDING W IKIPEDIA Niki Kittur nkittur@cs.cmu.edu

Download U NDERSTANDING W IKIPEDIA Niki Kittur nkittur@cs.cmu.edu

Post on 29-Dec-2015

216 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

<p>Slide 1</p> <p>Understanding Wikipedia</p> <p>Niki Kitturnkittur@cs.cmu.edu</p> <p>1Slowing growthSince 2007, slowing growth</p> <p>Why?Fewer new topics to write aboutGrowing resistance to new contributions</p> <p>Proportion reverted edits (by editor class)</p> <p>Number of active editors per monthSuh, Convertino, Chi, &amp; Pirolli, 20092Wisdom of crowds poll</p> <p>What proportion of Wikipedia (in words) is made up of articles?</p> <p>0-25% | 25-50% | 50-75% | 75-100%3As measured in bytesWisdom of crowds poll</p> <p>4So while a casual visitor to Wikipedia might only look at the main article page, theres much more going on beneath the surface. The pages we saw in the last example -- the article talk, user talk, and procedural pages -- enable conflict resolution, coordination, and the setting of policies and procedures. These pages make up more than half of the size Wikipedia (in characters). So Wikipedia isnt just an online encyclopedia, but an organization with its own complex laws and precedents, or as a sophisticated engine for coordinating and discussing content. Article</p> <p>But first let me take you on a brief tour of Wikipedia. This is the music of Italy article. Articles like these are the only thing that most visitors to Wikipedia see. However, for every article in Wikipedia there is a corresponding </p> <p>5Discussion</p> <p>talk or discussion page, in which editors coordinate changes to the article and resolve conflicts. You can see at the top of the page a bunch of templates which include things like the quality of the article. Here music of Italy has been rated GA-class, or a Good article, by the community. If we scroll down, </p> <p>6Discussion</p> <p>we see a table of contents for the discussions that happened on this page. 7Edit history</p> <p>Each article also has an edit history. Here you see the earliest edit history for music of Italy, and most of the edits are made by an editor named TUF-KAT. 8Edit history</p> <p>You can also see evidence of anonymous users, some of whose edits were reverted due to vandalism.9Policies + Procedures</p> <p>10How does it work?Wisdom of crowds - Many independent judgmentswith enough eyeballs all bugs are shallowMore contributors -&gt;more informationfewer errorsless bias</p> <p>11Wilkinson &amp; Huberman, 2007Examined featured articles vs. non-featured articlesControlling for PageRank (i.e., popularity)Featured articles = more edits, more editorsMore work, more people =&gt; better outcomes</p> <p>EditsEditors12Now, were not the first to look at the wisdom of crowds in Wikipedia. In a very nice study, Dan Wilkinson and Bernardo Huberman examined what made featured articles different from non-featured articles. Featured articles are the highest quality articles in Wikipedia, and have gone through a stringent peer review process. So, controlling for PageRank, or popularity, for any given pagerank featured articles (in red) have more edits [point] and more editors [point] than non-featured articles. This seems to support the idea that having more work or more people involved leads to better outcomes.Difficulties with generalizing resultsCross-sectional analysisReverse causation: articles which become featured may subsequently attract more peopleCoarse quality metricsFewer than 2000 out of &gt;2,000,000 articles are featuredWhat about coordination?13However, there are some problems with generalizing from these results. Since this was a cross-sectional analysis, it leaves open the possibility of reverse causation: that is, articles which become featured may subsequently attract more people, rather than more people causing an article to become featured. Also, they used quite coarse quality metrics; very few articles in Wikipedia are featured, and they go through an especially stringent peer review process that might not be representative.Coordination costsIncreasing contributors incurs process losses (Boehm, 1981; Steiner, 1972)Diminishing returns with added people (Hill, 1982; Sheppard, 1993)Super-linear increase in communication pairsLinear increase in added workIn the extreme, costs may exceed benefits to quality (Brooks, 1975)The more you can support coordination, the more benefits from adding people</p> <p>Adding manpower to a late software project makes it later Brooks, 197514Coordination costs, and in coordination-intensive tasks increasing the number of contributors leads to process losses, meaning the effectiveness of the group is lower than what the members could ideally produce. This can lead to diminishing returns with added people. For example, when you add a person to a group you have a super-linear increase in the number of pairs of people who can communicate, but only a linear increase in the added work. This means that in the extreme, the coordination costs that results from adding a person may exceed the benefits that person brings. As Fred Brooks famously said in the domain of software engineering: Adding manpower to a late software project makes it laterThis suggests that the more you can support coordination, the greater benefits you should see from adding peopleResearch questionTo what degree are editors in Wikipedia working independently versus coordinating?</p> <p>So the first question well look at here is: 15Research infrastructureAnalyzed entire history of WikipediaEvery edit to every articleLarge dataset (as of 2008)10+ million pages200+ million revisions2.5+ TbUsed distributed processingHadoop distributed filesystemMap/reduce to process data in parallelReduce time for analysis from weeks to hours16To answer this question we looked at the entire history of Wikipedia. As I mentioned this is a pretty large dataset, so for a number of the studies Im going to talk about we processed these data using distributed processing, first on a cluster of about a dozen machines, and more recently weve been given access to Yahoos multi-thousand core computing cloud. What this means is that we can reduce the time for doing analyses from weeks or even months of computing time down to hours.</p> <p>[The data Im going to present is from the June 2006 dump, which included more than 4 million pages and 58 million revisions. ]</p> <p>Types of work</p> <p>Direct work Editing articlesIndirect workUser talk, creating policyMaintenance work Reverts, vandalism17Using this infrastructure we broke down three different kinds of work. Direct work to the article pages is immediately consumable by visiting browsers. Indirect work includes users talking to each other and creating policy and procedure pages, and this is where most of the coordination happens. Finally, maintenance work is work that rolls back an article into a previous state, which is done to revert edits a person disagrees with or to combat vandalism.</p> <p>Point out that coordination and conflict are in the indirect and maintenance workLess direct workDecrease in proportion of edits to article page</p> <p>70%18Over time we see a decrease in the proportion of direct work to article pages, from nearly 100% at Wikipedias inception down to about 70%</p> <p>More indirect workIncrease in proportion of edits to user talk8%19More indirect workIncrease in proportion of edits to user talkIncrease in proportion of edits to policy pages</p> <p>11%20More maintenance workIncrease in proportion of edits that are reverts</p> <p>7%21More wasted workIncrease in proportion of edits that are revertsIncrease in proportion of edits reverting vandalism</p> <p>1-2%22But vandalism still only accounts for about 1% of all editsGlobal levelCoordination costs are growingLess direct work (articles)More indirect work (article talk, user, procedure)More maintenance work (reverts, vandalism)Kittur, Suh, Pendleton, &amp; Chi, 2007</p> <p>23So globally it looks like coordination costs are growing in Wikipedia. Theres less direct work to articles, and more indirect and maintenance work. This is summarized in this graph, in which Ive blown up the top area to show the trends more clearly.</p> <p>This suggests that to understand the success of Wikipedia, we need to understand the role of coordination</p> <p>Say why the costs are growing and how it is different from other tasks. Wikipedia is high coordination work.Research questionHow does coordination impact quality?</p> <p>So the first question well look at here is: 24Coordination typesExplicit coordinationDirect communication among editors planning and discussing articleImplicit coordinationDivision of labor and workgroup structureConcentrating work in core group of editors</p> <p>Leavitt, 1951; March &amp; Simon, 1958; Malone, 1987; Rouse et al., 1992; Thompson, 196725There has been a lot of work in the groups and organizations literature that can help us predict how coordination might affect quality. One major distinction in the literature is between explicit and implicit coordination. Explicit coordination refers to people communicating directly with each other and verbally planning and building consensus around a project. In Wikipedia, this communication primarily takes place on the discussion pages. Taking the Music of Italy discussion page we looked at before, we can see that editors are communicating about issues such as planning how the article should move forward, arguing about coverage, and discussing how to improve readability.</p> <p>Explicit coordination: Music of Italy</p> <p>planning26Explicit coordination: Music of Italy</p> <p>coverage27Explicit coordination: Music of Italy</p> <p>readability28Coordination typesExplicit coordinationDirect communication among editors planning and discussing articleImplicit coordinationDivision of labor and workgroup structureConcentrating work in core group of editors</p> <p>Leavitt, 1951; March &amp; Simon, 1958; Malone, 1987; Rouse et al., 1992; Thompson, 196729Implicit coordination, on the other hand, is often based on the division of labor and the structure of the workgroup, for example, by concentrating the work in a core group of editors who enable other more peripheral members to effectively contribute. Now, whats nice about implicit coordination is that, if its successful, it can avoid the overhead of people communicating with each other and thus should scale up better than explicit coordination.Implicit coordination: Music of Italy</p> <p>30Lets look at a couple of examples of implicit coordination in the Music of Italy article. This is a map of the article over time, showing how TUF-KAT (in yellow), Jeffmatt (in red), and other more peripheral editors (in green) contributed to the article.Implicit coordination: Music of Italy</p> <p>TUF-KAT: Set scope and structure31TUF-KAT, as we saw before, did a lot of work early in the article setting the scope and creating a structure to which...Implicit coordination: Music of Italy</p> <p>Filling in by many contributors32...to which many other users, shown in green, could effectively fill in. Implicit coordination: Music of Italy</p> <p>Restructuring by Jeffmatt33At the end of this period the article required a reorganization to integrate that content, which Jeffmatt (in red) stepped up to do. Now he didnt do this in a vacuum, but by reducing the number of people involved the task could be done while minimizing the overhead that would occur if it was distributed across a larger group. These examples suggest that implicit coordination can be especially important when there are many people involved.Research questionWhat factors lead to improved quality?More contributorsExplicit coordinationNumber of communication editsImplicit coordinationConcentration among editorsSo in this study were going to look at what factors lead to improved article quality. And were going to look at not just adding more editors or more work, but also explicit and implicit coordination, and how they scale with increasing editors.34Measuring concentrationIf an article has 100 edits and 10 editors, it could have:10 editors making 10 edits each</p> <p>35Language shift from implicit coordination to concentrationMeasuring concentrationIf an article has 100 edits and 10 editors, it could have:10 editors making 10 edits each1 editor making 90 edits</p> <p>36Measuring concentrationIf an article has 100 edits and 10 editors, it could have:10 editors making 10 edits each1 editor making 90 editsMeasure concentration with Gini coefficient</p> <p>37Measuring concentrationIf an article has 100 edits and 10 editors, it could have:10 editors making 10 edits each1 editor making 90 editsMeasure concentration with Gini coefficient</p> <p>Gini = 038Measuring concentrationIf an article has 100 edits and 10 editors, it could have:10 editors making 10 edits each1 editor making 90 editsMeasure concentration with Gini coefficient</p> <p>Gini = 0Gini ~ 139Have actual formula slideHerfendal index look upMeasuring qualityWikipedia 1.0 quality assessment scale Over 900,000 assessments6 classes of quality, from Stub up to FeaturedTop 3 classes require increasingly rigorous peer reviewValidated community assessments with non-expert judges (r = .54***)</p> <p>40For our measure of quality, we turned to Wikipedias quality assessment drive, which has made of 900k assessments of articles into six classes of quality.We then validated these community assessments by taking a sample of articles and asking non-experts to rate their quality, and we found a significant correlation between their judgments and the community assessments.Analysis</p> <p>41[have labels all the way][say it is a 6 month period of time, looking at what happens between assessments]</p> <p>Essentially, our analysis models this process. Given some starting characteristics of article, such as its age and quality, and some characteristics of what happened to it in the interval, we want to predict what the change in quality will be.</p> <p>Point out that communication = implicitAnalysis</p> <p>42[have labels all the way][say it is a 6 month period of time, looking at what happens between assessments]</p> <p>Essentially, our analysis models this process. Given some starting characteristics of article, such as its age and quality, and some characteristics of what happened to it in the interval, we want to predict what the change in quality will be.</p> <p>Point out that communication = implicitAnalysis</p> <p>43[have labels all the way][say it is a 6 month period of time, looking at what happens between assessments]</p> <p>Essentially, our analysis models this process. Given some starting characteristics of article, such as its age and quality, and some characteristics of what happened to it in the interval, we want to predict what the change in quality will be.</p> <p>Point out that communication = implicitLongitudinal analysisLagged multiple regressionPredict change in quality at time t+1 given quality at time t and other predictors Advantages of predicting changesStronger causal claimsControl for unobserved qualities, e.g., topic, importanceControl for selection biasesSame factors can influence both likelihood of evaluation and assessed quality (e.g., # edits)Used Heckman 2-step selection modelKittur &amp; Kraut, 200844Ok, so in slightly more rigorous terms what were doing is a longitudinal analysis using lagged m...</p>