wikipedia (dbpedia): crowdsourced data curation

43
Copyright 2010 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute www.deri.i e Wikipedia (DBpedia): Crowdsourced Data Curation Edward Curry, Andre Freitas, Seán O'Riain [email protected] http://www.deri.org/ http://www.EdwardCurry.org/

Upload: edward-curry

Post on 26-Jan-2015

129 views

Category:

Technology


1 download

DESCRIPTION

Wikipedia is an open-source encyclopedia, built collaboratively by a large community of web editors. The success of Wikipedia as one of the most important sources of information available today still challenges existing models of content creation. Despite the fact that the term ‘curation’ is not commonly addressed by Wikipedia’s contributors, the task of digital curation is the central activity of Wikipedia editors, who have the responsibility for information quality standards. Wikipedia, is already widely used as a collaborative environment inside organizations5. The investigation of the collaboration dynamics behind Wikipedia highlights important features and good practices which can be applied to different organizations. Our analysis focuses on the curation perspective and covers two important dimensions: social organization and artifacts, tools & processes for cooperative work coordination. These are key enablers that support the creation of high quality information products in Wikipedia’s decentralized environment.

TRANSCRIPT

Page 1: Wikipedia (DBpedia): Crowdsourced Data Curation

Copyright 2010 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute www.deri.ie

Wikipedia (DBpedia): Crowdsourced Data Curation

Edward Curry, Andre Freitas, Seán O'Riain

[email protected]://www.deri.org/http://www.EdwardCurry.org/

Page 2: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Speaker Profile

Research Scientist at the Digital Enterprise Research Institute (DERI) Leading international web science research organization

Researching how web of data is changing way business work and interact with information Projects include studies of enterprise linked data,

community-based data curation, semantic data analytics, and semantic search

Investigate utilization within the pharmaceutical, oil & gas, financial, advertising, media, manufacturing, health care, ICT, and automotive industries

Invited speaker at the 2010 MIT Sloan CIO Symposium to an audience of more than 600 CIOs

Page 3: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Overview

Curation Background The Business Need for Curated Data What is Data Curation? Data Quality and Curation How to Curate Data

Wikipedia (DBpedia) Case Study

Best Practices from Case Study Learning 

Page 4: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

The Business Need

Working incomplete inaccurate, or wrong information can have disastrous consequences

Knowledge workers need: Access to the right information Confidence in that information

Page 5: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

The Problems with Data

Flawed Data Effects 25% of critical data in world’s top

companies (Gartner)

Data Quality Recent banking crisis (Economist Dec’09) Inaccurate figures made it difficult to manage

operations (investments exposure and risk)– “asset are defined differently in different programs”– “numbers did not always add up”– “departments do not trust each other’s figures”– “figures … not worth the pixels they were made of”

Page 6: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

What is Data Curation?

Digital Curation Selection, preservation, maintenance, collection,

and archiving of digital assets

Data Curation Active management of data over its life-cycle

Data Curators Ensure data is trustworthy, discoverable,

accessible, reusable, and fit for use– Museum cataloguers of the Internet age

Page 7: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

What is Data Curation?

Data Governance Convergence of data quality, data

management, business process management, and risk management

Data Curation is a complimentary activity Part of overall data governance strategy for

organization

Data Curator = Data Steward ?? Overlapping terms between communities

Page 8: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Data Quality and Curation

What is Data Quality? Desirable characteristics for information

resource Described as a series of quality dimensions

– Discoverability, Accessibility, Timeliness, Completeness, Interpretation, Accuracy, Consistency, Provenance & Reputation

Data curation can be used to improve these quality dimensions

Page 9: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Data Quality and Curation

Discoverability & Accessibility Curate to streamline search by storing and

classifying in appropriate and consistent manner

Accuracy Curate to ensure data correctly represents the

“real-world” values it models

Consistency Curate to ensure data created and maintained

using standardized definitions, calculations, terms, and identifiers

Page 10: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Data Quality and Curation

Provenance & Reputation Curate to track source of data and determine

reputation Curate to include the objectivity of the

source/producer– Is the information unbiased, unprejudiced, and

impartial?– Or does it come from a reputable but partisan source?

Other dimensions discussed in chapter

Page 11: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

How to Curate Data

Data Curation is a large field with sophisticated techniques and processes

Section provides high-level overview on: Should you curate data? Types of Curation Setting up a curation process

Additional detail and references available in book chapter

Page 12: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Should You Curate Data?

Curation can have multiple motivations Improving accessibility, quality, consistency,…

Will the data benefit from curation? Identify business case Determine if potential return support

investment

Not all enterprise data should be curated Suits knowledge-centric data rather than

transactional operations data

Page 13: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Types of Data Curation

Multiple approaches to curate data, no single correct way Who?

– Individual Curators– Curation Departments– Community-based Curation

How?– Manual Curation– (Semi-)Automated– Sheer Curation

Page 14: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Types of Data Curation – Who?

Individual Data Curators Suitable for infrequently changing small

quantity of data– (<1,000 records)– Minimal curation effort (minutes per record)

Page 15: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Types of Data Curation – Who? Curation Departments

Curation experts working with subject matter experts to curate data within formal process

– Can deal with large curation effort (000’s of records)

Limitations Scalability: Can struggle with large quantities

of dynamic data (>million records) Availability: Post-hoc nature creates delay in

curated data availability

Page 16: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Types of Data Curation - Who?

Community-Based Data Curation Decentralized approach to data curation Crowd-sourcing the curation process

– Leverages community of users to curate data Wisdom of the community (crowd) Can scale to millions of records

Page 17: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Types of Data Curation – How?

Manual Curation Curators directly manipulate data Can tie users up with low-value add activities

(Sem-)Automated Curation Algorithms can (semi-)automate curation

activities such as data cleansing, record duplication and classification

Can be supervised or approved by human curators

Page 18: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Types of Data Curation – How?

Sheer curation, or Curation at Source Curation activities integrated in normal

workflow of those creating and managing data Can be as simple as vetting or “rating” the

results of a curation algorithm Results can be available immediately

Blended Approaches: Best of Both Sheer curation + post hoc curation department Allows immediate access to curated data Ensures quality control with expert curation

Page 19: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Setting up a Curation Process

5 Steps to setup a curation process:1 - Identify what data you need to curate

2 - Identify who will curate the data

3 - Define the curation workflow

4 - Identity appropriate data-in & data-out formats

5 - Identify the artifacts, tools, and processes needed to support the curation process

Page 20: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Wikipedia

The World Largest Open Digital Curation Community

Page 21: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Wikipedia

Open-source encyclopedia Collaboratively built by large community

Challenges existing models of content creation More than 19,000,000 articles 270+ languages, 3,200,000+ articles in

English More than 157,000 active contributors

Studies show accuracy and stylistic formality are equivalent to resources developed in expert-based closed communities i.e. Columbia and Britannica encyclopedias

Page 22: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Wikipedia

MediaWiki Wiki platform behind Wikipedia

– Widespread and popular technology Wikis can also support data curation

– Lowers entry barriers for collaborative data curation

Widely used inside organizations Intellipedia covering 16 U.S. Intelligence agencies Wiki Proteins, curated Protein data for knowledge

discovery and annotation

Page 23: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Wikipedia

Decentralized environment supports creation of high quality information with: Social organization Artifacts, tools & processes for cooperative work

coordination

Wikipedia collaboration dynamics highlight good practices

Page 24: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Wikipedia – Social Organization Any user can edit its contents

Without prior registration

Does not lead to a chaotic scenario In practice highly scalable approach for high

quality content creation on the Web

Relies on simple but highly effective way to coordinate its curation process

Curation is activity of Wikipedia admins Responsibility for information quality standards

Page 25: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Wikipedia – Social Organization

Four main types of accounts: Anonymous users

– Identified by their associated IP address

Registered users– Users with an account in the Wikipedia website

Administrators/Editors– Registered users with additional permissions in the

system– Access to curation tools

Bots – Programs that perform repetitive tasks

Page 26: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Wikipedia – Social Organization

Page 27: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Wikipedia – Social Organization

Incentives Improvement of one’s reputation Sense of efficacy

– Contributing effectively to a meaningful project

Over time focus of editors typically change– From curators of a few articles in specific topics – To more global curation perspective– Enforcing quality assessment of Wikipedia as a whole

Page 28: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Wikipedia – Artifacts, Tools & Processes

Wiki Article Editor (Tool) WYSIWYG or markup text editor

Talk Pages (Tool) Public arena for discussions around Wikipedia resources

Watchlists (Tool) Helps curators to actively monitor the integrity and

quality of resources they contribute Permission Mechanisms (Tool)

Users with administrator status can perform critical actions such as remove pages and grant administrative permissions to new users

Page 29: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Wikipedia – Artifacts, Tools & Processes

Automated Edition (Tool) Bots are automated or semi-automated tools that perform

repetitive tasks over content Page History and Restore (Tool)

Historical trail of changes to a Wikipedia Resource Guidelines, Policies & Templates (Artifact)

Defines curation guidelines for editors to assess article quality

Dispute Resolution (Process) Dispute mechanism between editors over the article

contents Article Edition, Deletion, Merging, Redirection,

Transwiking, Archival (Process) Describe the curation actions over Wikipedia resources

Page 30: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Wikipedia - DBPedia

DBPedia Knowledge base Inherits massive volume of curated Wikipedia

data Built using information info box properties Indirectly uses wiki as data curation platform

DBPedia provides direct access to data 3.4 million entities and 1 billion RDF triples Comprehensive data infrastructure

– Concept URIs, definitions, and basic types

Page 31: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Page 32: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Wikipedia - DBPedia

Page 33: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Overview

Curation Background The Business Need for Curated Data What is Data Curation? Data Quality and Curation How to Curate Data

Wikipedia (DBpedia) Case Study

Best Practices from Case Study Learning 

Page 34: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Best Practices from Case Study Learning Social Best Practices

Participation Engagement Incentives Community Governance Models

Technical Best Practices Data Representation Human- and AutomatedCuration Track Provenance

Page 35: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Social Best Practices

Participation Stakeholders involvement for data producers

and consumers must occur early in project– Provides insight into basic questions of what

they want to do, for whom, and what it will provide

White papers are effective means to present these ideas, and solicit opinion from community

– Can be used to establish informal ‘social contract’ for community

Page 36: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Social Best Practices

Engagement Outreach activities essential for promotion and

feedback Typical consumers-to-contributors ratios of less

than 5% Social communication and networking forums

are useful– Majority of community may not communicate

using these media– Communication by email still remains important

Page 37: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Social Best Practices

Incentives Sheer curation needs line of sight from data

curating activity, to tangible exploitation benefits

Lack of awareness of value proposition will slow emergence of collaborative contributions

Recognizing contributing curators through a formal feedback mechanism

– Reinforces contribution culture– Directly increases output quality

Page 38: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Social Best Practices

Community Governance Models Effective governance structure is vital to

ensure success of community Internal communities and consortium perform

well when they leverage traditional corporate and democratic governance models

Open communities need to engage the community within the governance process

– Follow less orthodox approaches using meritocratic and autocratic principles

Page 39: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Technical Best Practices

Data Representation Must be robust and standardized to encourage

community usage and tools development Support for legacy data formats and ability to

translate data forward to support new technology and standards

Human & Automated Curation Balancing will improve data quality Automated curation should always defer to,

and never override, human curation edits– Automate validating data deposition and entry– Target community at focused curation tasks

Page 40: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Technical Best Practices

Track Provenance All curation activities should be recorded and

maintained as part data provenance effort– Especially where human curators are involved

Users can have different perspectives of provenance

– A scientist may need to evaluate the fine grained experiment description behind the data

– For a business analyst the ’brand’ of data provider can be sufficient for determining quality

Page 41: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Conclusions

Data curation can ensure the quality of data and its fitness for use

Pre-competitive data can be shared without conferring a commercial advantage

Pre-competitive data communities Common curation tasks carried out once in

public domain Reduces cost, increase quantity and quality

Page 42: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Acknowledgements

Collaborators Andre Freitas & Seán O'Riain

Insight from Thought Leaders Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product

Development and Management), and Gregg Fenton (Director Emerging Platforms) from the New York Times

Krista Thomas (Vice President, Marketing & Communications), Tom Tague (OpenCalais initiative Lead) from Thomson Reuters

Antony Williams (VP of Strategic Development ) from ChemSpider Helen Berman (Director), John Westbrook (Product Development) from the

Protein Data Bank Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance.

The work presented has been funded by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).

Page 43: Wikipedia (DBpedia): Crowdsourced Data Curation

Digital Enterprise Research Institute www.deri.ie

Further Information

The Role of Community-DrivenData Curation for EnterprisesEdward Curry, Andre Freitas, & Seán O'Riain

In David Wood (ed.),

Linking Enterprise Data Springer, 2010.

Available Free at:

http://3roundstones.com/led_book/led-curry-et-al.html