u nderstanding w ikipedia niki kittur nkittur@cs.cmu.edu

Download U NDERSTANDING W IKIPEDIA Niki Kittur nkittur@cs.cmu.edu

Post on 29-Dec-2015

216 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

Slide 1

Understanding Wikipedia

Niki Kitturnkittur@cs.cmu.edu

1Slowing growthSince 2007, slowing growth

Why?Fewer new topics to write aboutGrowing resistance to new contributions

Proportion reverted edits (by editor class)

Number of active editors per monthSuh, Convertino, Chi, & Pirolli, 20092Wisdom of crowds poll

What proportion of Wikipedia (in words) is made up of articles?

0-25% | 25-50% | 50-75% | 75-100%3As measured in bytesWisdom of crowds poll

4So while a casual visitor to Wikipedia might only look at the main article page, theres much more going on beneath the surface. The pages we saw in the last example -- the article talk, user talk, and procedural pages -- enable conflict resolution, coordination, and the setting of policies and procedures. These pages make up more than half of the size Wikipedia (in characters). So Wikipedia isnt just an online encyclopedia, but an organization with its own complex laws and precedents, or as a sophisticated engine for coordinating and discussing content. Article

But first let me take you on a brief tour of Wikipedia. This is the music of Italy article. Articles like these are the only thing that most visitors to Wikipedia see. However, for every article in Wikipedia there is a corresponding

5Discussion

talk or discussion page, in which editors coordinate changes to the article and resolve conflicts. You can see at the top of the page a bunch of templates which include things like the quality of the article. Here music of Italy has been rated GA-class, or a Good article, by the community. If we scroll down,

6Discussion

we see a table of contents for the discussions that happened on this page. 7Edit history

Each article also has an edit history. Here you see the earliest edit history for music of Italy, and most of the edits are made by an editor named TUF-KAT. 8Edit history

You can also see evidence of anonymous users, some of whose edits were reverted due to vandalism.9Policies + Procedures

10How does it work?Wisdom of crowds - Many independent judgmentswith enough eyeballs all bugs are shallowMore contributors ->more informationfewer errorsless bias

11Wilkinson & Huberman, 2007Examined featured articles vs. non-featured articlesControlling for PageRank (i.e., popularity)Featured articles = more edits, more editorsMore work, more people => better outcomes

EditsEditors12Now, were not the first to look at the wisdom of crowds in Wikipedia. In a very nice study, Dan Wilkinson and Bernardo Huberman examined what made featured articles different from non-featured articles. Featured articles are the highest quality articles in Wikipedia, and have gone through a stringent peer review process. So, controlling for PageRank, or popularity, for any given pagerank featured articles (in red) have more edits [point] and more editors [point] than non-featured articles. This seems to support the idea that having more work or more people involved leads to better outcomes.Difficulties with generalizing resultsCross-sectional analysisReverse causation: articles which become featured may subsequently attract more peopleCoarse quality metricsFewer than 2000 out of >2,000,000 articles are featuredWhat about coordination?13However, there are some problems with generalizing from these results. Since this was a cross-sectional analysis, it leaves open the possibility of reverse causation: that is, articles which become featured may subsequently attract more people, rather than more people causing an article to become featured. Also, they used quite coarse quality metrics; very few articles in Wikipedia are featured, and they go through an especially stringent peer review process that might not be representative.Coordination costsIncreasing contributors incurs process losses (Boehm, 1981; Steiner, 1972)Diminishing returns with added people (Hill, 1982; Sheppard, 1993)Super-linear increase in communication pairsLinear increase in added workIn the extreme, costs may exceed benefits to quality (Brooks, 1975)The more you can support coordination, the more benefits from adding people

Adding manpower to a late software project makes it later Brooks, 197514Coordination costs, and in coordination-intensive tasks increasing the number of contributors leads to process losses, meaning the effectiveness of the group is lower than what the members could ideally produce. This can lead to diminishing returns with added people. For example, when you add a person to a group you have a super-linear increase in the number of pairs of people who can communicate, but only a linear increase in the added work. This means that in the extreme, the coordination costs that results from adding a person may exceed the benefits that person brings. As Fred Brooks famously said in the domain of software engineering: Adding manpower to a late software project makes it laterThis suggests that the more you can support coordination, the greater benefits you should see from adding peopleResearch questionTo what degree are editors in Wikipedia working independently versus coordinating?

So the first question well look at here is: 15Research infrastructureAnalyzed entire history of WikipediaEvery edit to every articleLarge dataset (as of 2008)10+ million pages200+ million revisions2.5+ TbUsed distributed processingHadoop distributed filesystemMap/reduce to process data in parallelReduce time for analysis from weeks to hours16To answer this question we looked at the entire history of Wikipedia. As I mentioned this is a pretty large dataset, so for a number of the studies Im going to talk about we processed these data using distributed processing, first on a cluster of about a dozen machines, and more recently weve been given access to Yahoos multi-thousand core computing cloud. What this means is that we can reduce the time for doing analyses from weeks or even months of computing time down to hours.

[The data Im going to present is from the June 2006 dump, which included more than 4 million pages and 58 million revisions. ]

Types of work

Direct work Editing articlesIndirect workUser talk, creating policyMaintenance work Reverts, vandalism17Using this infrastructure we broke down three different kinds of work. Direct work to the article pages is immediately consumable by visiting browsers. Indirect work includes users talking to each other and creating policy and procedure pages, and this is where most of the coordination happens. Finally, maintenance work is work that rolls back an article into a previous state, which is done to revert edits a person disagrees with or to combat vandalism.

Point out that coordination and conflict are in the indirect and maintenance workLess direct workDecrease in proportion of edits to article page

70%18Over time we see a decrease in the proportion of direct work to article pages, from nearly 100% at Wikipedias inception down to about 70%

More indirect workIncrease in proportion of edits to user talk8%19More indirect workIncrease in proportion of edits to user talkIncrease in proportion of edits to policy pages

11%20More maintenance workIncrease in proportion of edits that are reverts

7%21More wasted workIncrease in proportion of edits that are revertsIncrease in proportion of edits reverting vandalism

1-2%22But vandalism still only accounts for about 1% of all editsGlobal levelCoordination costs are growingLess direct work (articles)More indirect work (article talk, user, procedure)More maintenance work (reverts, vandalism)Kittur, Suh, Pendleton, & Chi, 2007

23So globally it looks like coordination costs are growing in Wikipedia. Theres less direct work to articles, and more indirect and maintenance work. This is summarized in this graph, in which Ive blown up the top area to show the trends more clearly.

This suggests that to understand the success of Wikipedia, we need to understand the role of coordination

Say why the costs are growing and how it is different from other tasks. Wikipedia is high coordination work.Research questionHow does coordination impact quality?

So the first question well look at here is: 24Coordination typesExplicit coordinationDirect communication among editors planning and discussing articleImplicit coordinationDivision of labor and workgroup structureConcentrating work in core group of editors

Leavitt, 1951; March & Simon, 1958; Malone, 1987; Rouse et al., 1992; Thompson, 196725There has been a lot of work in the groups and organizations literature that can help us predict how coordination might affect quality. One major distinction in the literature is between explicit and implicit coordination. Explicit coordination refers to people communicating directly with each other and verbally planning and building consensus around a project. In Wikipedia, this communication primarily takes place on the discussion pages. Taking the Music of Italy discussion page we looked at before, we can see that editors are communicating about issues such as planning how the article should move forward, arguing about coverage, and discussing how to improve readability.

Explicit coordination: Music of Italy

planning26Explicit coordination: Music of Italy

coverage27Explicit coordination: Music of Italy

readability28Coordination typesExplicit coordinationDirect communication among editors planning and discussing articleImplicit coordinationDivision of labor and workgroup structureConcentrating work in core group of editors

Leavitt, 1951; March & Simon, 1958; Malone, 1987; Rouse et al., 1992; Thompson, 196729Implicit coordination, on the other hand, is often based on the division of labor and the structure of

Recommended

View more >