data curation at the new york times
DESCRIPTION
The New York Times is the largest metropolitan and the third largest newspaper in the United States. The Times website, nytimes.com, is ranked as the most popular newspaper website in the United States and is an important source of advertisement revenue for the company. The NYT has a rich history for curation of its articles and its 100 year old curated repository has ultimately defined its participation as one of the first players in the emergingWeb of Data. Data curation is a process that can ensure the quality of data and its fitness for use. Traditional approaches to curation are struggling with increased data volumes, and near real-time demands for curated data. In response, curation teams have turned to community crowd-sourcing and semi-automatedmetadata tools for assistance. E. Curry, A. Freitas, and S. O’Riáin, “The Role of Community-Driven Data Curation for Enterprises,” in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25-47.TRANSCRIPT
Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
Digital Enterprise Research Institute www.deri.ie
Data Curation at the New York Times
Edward Curry, Andre Freitas, Seán O'Riain
[email protected]://www.deri.org/http://www.EdwardCurry.org/
Digital Enterprise Research Institute www.deri.ie
Speaker Profile
Research Scientist at the Digital Enterprise Research Institute (DERI) Leading international web science research organization
Researching how web of data is changing way business work and interact with information Projects include studies of enterprise linked data,
community-based data curation, semantic data analytics, and semantic search
Investigate utilization within the pharmaceutical, oil & gas, financial, advertising, media, manufacturing, health care, ICT, and automotive industries
Invited speaker at the 2010 MIT Sloan CIO Symposium to an audience of more than 600 CIOs
Digital Enterprise Research Institute www.deri.ie
Overview
Curation Background The Business Need for Curated Data What is Data Curation? Data Quality and Curation How to Curate Data
New York Times Case Study
Best Practices from Case Study Learning
Digital Enterprise Research Institute www.deri.ie
The Business Need
Working incomplete inaccurate, or wrong information can have disastrous consequences
Knowledge workers need: Access to the right information Confidence in that information
Digital Enterprise Research Institute www.deri.ie
The Problems with Data
Flawed Data Effects 25% of critical data in world’s top
companies (Gartner)
Data Quality Recent banking crisis (Economist Dec’09) Inaccurate figures made it difficult to manage
operations (investments exposure and risk)– “asset are defined differently in different programs”– “numbers did not always add up”– “departments do not trust each other’s figures”– “figures … not worth the pixels they were made of”
Digital Enterprise Research Institute www.deri.ie
What is Data Curation?
Digital Curation Selection, preservation, maintenance, collection,
and archiving of digital assets
Data Curation Active management of data over its life-cycle
Data Curators Ensure data is trustworthy, discoverable,
accessible, reusable, and fit for use– Museum cataloguers of the Internet age
Digital Enterprise Research Institute www.deri.ie
What is Data Curation?
Data Governance Convergence of data quality, data
management, business process management, and risk management
Data Curation is a complimentary activity Part of overall data governance strategy for
organization
Data Curator = Data Steward ?? Overlapping terms between communities
Digital Enterprise Research Institute www.deri.ie
Data Quality and Curation
What is Data Quality? Desirable characteristics for information
resource Described as a series of quality dimensions
– Discoverability, Accessibility, Timeliness, Completeness, Interpretation, Accuracy, Consistency, Provenance & Reputation
Data curation can be used to improve these quality dimensions
Digital Enterprise Research Institute www.deri.ie
Data Quality and Curation
Discoverability & Accessibility Curate to streamline search by storing and
classifying in appropriate and consistent manner
Accuracy Curate to ensure data correctly represents the
“real-world” values it models
Consistency Curate to ensure data created and maintained
using standardized definitions, calculations, terms, and identifiers
Digital Enterprise Research Institute www.deri.ie
Data Quality and Curation
Provenance & Reputation Curate to track source of data and determine
reputation Curate to include the objectivity of the
source/producer– Is the information unbiased, unprejudiced, and
impartial?– Or does it come from a reputable but partisan source?
Other dimensions discussed in chapter
Digital Enterprise Research Institute www.deri.ie
How to Curate Data
Data Curation is a large field with sophisticated techniques and processes
Section provides high-level overview on: Should you curate data? Types of Curation Setting up a curation process
Additional detail and references available in book chapter
Digital Enterprise Research Institute www.deri.ie
Should You Curate Data?
Curation can have multiple motivations Improving accessibility, quality, consistency,…
Will the data benefit from curation? Identify business case Determine if potential return support
investment
Not all enterprise data should be curated Suits knowledge-centric data rather than
transactional operations data
Digital Enterprise Research Institute www.deri.ie
Types of Data Curation
Multiple approaches to curate data, no single correct way Who?
– Individual Curators– Curation Departments– Community-based Curation
How?– Manual Curation– (Semi-)Automated– Sheer Curation
Digital Enterprise Research Institute www.deri.ie
Types of Data Curation – Who?
Individual Data Curators Suitable for infrequently changing small
quantity of data– (<1,000 records)– Minimal curation effort (minutes per record)
Digital Enterprise Research Institute www.deri.ie
Types of Data Curation – Who? Curation Departments
Curation experts working with subject matter experts to curate data within formal process
– Can deal with large curation effort (000’s of records)
Limitations Scalability: Can struggle with large quantities
of dynamic data (>million records) Availability: Post-hoc nature creates delay in
curated data availability
Digital Enterprise Research Institute www.deri.ie
Types of Data Curation - Who?
Community-Based Data Curation Decentralized approach to data curation Crowd-sourcing the curation process
– Leverages community of users to curate data Wisdom of the community (crowd) Can scale to millions of records
Digital Enterprise Research Institute www.deri.ie
Types of Data Curation – How?
Manual Curation Curators directly manipulate data Can tie users up with low-value add activities
(Sem-)Automated Curation Algorithms can (semi-)automate curation
activities such as data cleansing, record duplication and classification
Can be supervised or approved by human curators
Digital Enterprise Research Institute www.deri.ie
Types of Data Curation – How?
Sheer curation, or Curation at Source Curation activities integrated in normal
workflow of those creating and managing data Can be as simple as vetting or “rating” the
results of a curation algorithm Results can be available immediately
Blended Approaches: Best of Both Sheer curation + post hoc curation department Allows immediate access to curated data Ensures quality control with expert curation
Digital Enterprise Research Institute www.deri.ie
Setting up a Curation Process
5 Steps to setup a curation process:1 - Identify what data you need to curate
2 - Identify who will curate the data
3 - Define the curation workflow
4 - Identity appropriate data-in & data-out formats
5 - Identify the artifacts, tools, and processes needed to support the curation process
Digital Enterprise Research Institute www.deri.ie
The New York Times
100 Years of Expert Data Curation
Digital Enterprise Research Institute www.deri.ie
The New York Times
Largest metropolitan and third largest newspaper in the United States
nytimes.com Most popular newspaper
website in US
100 year old curated repository defining its participation in the emerging Web of Data
Digital Enterprise Research Institute www.deri.ie
The New York Times
Data curation dates back to 1913 Publisher/owner Adolph S. Ochs decided to
provide a set of additions to the newspaper New York Times Index
Organized catalog of articles titles and summaries
– Containing issue, date and column of article– Categorized by subject and names– Introduced on quarterly then annual basis
Transitory content of newspaper became important source of searchable historical data Often used to settle historical debates
Digital Enterprise Research Institute www.deri.ie
The New York Times
Index Department was created in 1913 Curation and cataloguing of NYT resources
– Since 1851 NYT had low quality index for internal use
Developed a comprehensive catalog using a controlled vocabulary Covering subjects, personal names,
organizations, geographic locations and titles of creative works (books, movies, etc), linked to articles and their summaries
Current Index Dept. has ~15 people
Digital Enterprise Research Institute www.deri.ie
The New York Times
Challenges with consistently and accurately classifying news articles over time Keywords expressing subjects may show some
variance due to cultural or legal constraints Identities of some entities, such as
organizations and places, changed over time
Controlled vocabulary grew to hundreds of thousands of categories Adding complexity to classification process
Digital Enterprise Research Institute www.deri.ie
The New York Times
Increased importance of Web drove need to improve categorization of online content
Curation carried out by Index Department Library-time (days to weeks) Print edition can handle next-day index
Not suitable for real-time online publishing nytimes.com needed a same-day index
Digital Enterprise Research Institute www.deri.ie
The New York Times
Introduced two stage curation process Editorial staff performed best-effort semi-
automated sheer curation at point of online pub.
– Several hundreds journalists
Index Department follow up with long-term accurate classification and archiving
Benefits: Non-expert journalist curators provide instant
accessibility to online users Index Department provides long-term high-
quality curation in a “trust but verify” approach
Digital Enterprise Research Institute www.deri.ie
NYT Curation Workflow
Curation starts with article getting out of the newsroom
Digital Enterprise Research Institute www.deri.ie
NYT Curation Workflow
Member of editorial staff submits article to web-based rule based information extraction system (SAS Teragram)
Digital Enterprise Research Institute www.deri.ie
NYT Curation Workflow
Teragram uses linguistic extraction rules based on subset of Index Dept’s controlled vocab.
Digital Enterprise Research Institute www.deri.ie
NYT Curation Workflow
Teragram suggests tags based on the Index vocabulary that can potentially describe the content of article
Digital Enterprise Research Institute www.deri.ie
NYT Curation Workflow
Editorial staff member selects terms that best describe the contents and inserts new tags if necessary
Digital Enterprise Research Institute www.deri.ie
NYT Curation Workflow
Reviewed by the taxonomy managers with feedback to editorial staff on classification process
Digital Enterprise Research Institute www.deri.ie
NYT Curation Workflow
Article is published online at nytimes.com
Digital Enterprise Research Institute www.deri.ie
NYT Curation Workflow
At later stage article receives second level curation by Index Dept. additional Index tags and a summary
Digital Enterprise Research Institute www.deri.ie
NYT Curation Workflow
Article is submitted to NYT Index
Digital Enterprise Research Institute www.deri.ie
The New York Times
Early adopter of Linked Open Data (June ‘09)
Digital Enterprise Research Institute www.deri.ie
The New York Times
Linked Open Data @ data.nytimes.com Subset of 10,000 tags from index vocabulary Dataset of people, organizations & locations
– Complemented by search services to consume data about articles, movies, best sellers, Congress votes, real estate,…
Benefits Improves traffic by third party data usage Lowers development cost of new applications
for different verticals inside the website– E.g. movies, travel, sports, books
Digital Enterprise Research Institute www.deri.ie
Overview
Curation Background The Business Need for Curated Data What is Data Curation? Data Quality and Curation How to Curate Data
Case Study New York Times
Best Practices from Case Study Learning
Digital Enterprise Research Institute www.deri.ie
Best Practices from Case Study Learning Social Best Practices
Participation Engagement Incentives Community Governance Models
Technical Best Practices Data Representation Human- and AutomatedCuration Track Provenance
Digital Enterprise Research Institute www.deri.ie
Social Best Practices
Participation Stakeholders involvement for data producers
and consumers must occur early in project– Provides insight into basic questions of what
they want to do, for whom, and what it will provide
White papers are effective means to present these ideas, and solicit opinion from community
– Can be used to establish informal ‘social contract’ for community
Digital Enterprise Research Institute www.deri.ie
Social Best Practices
Engagement Outreach activities essential for promotion and
feedback Typical consumers-to-contributors ratios of less
than 5% Social communication and networking forums
are useful– Majority of community may not communicate
using these media– Communication by email still remains important
Digital Enterprise Research Institute www.deri.ie
Social Best Practices
Incentives Sheer curation needs line of sight from data
curating activity, to tangible exploitation benefits
Lack of awareness of value proposition will slow emergence of collaborative contributions
Recognizing contributing curators through a formal feedback mechanism
– Reinforces contribution culture– Directly increases output quality
Digital Enterprise Research Institute www.deri.ie
Social Best Practices
Community Governance Models Effective governance structure is vital to
ensure success of community Internal communities and consortium perform
well when they leverage traditional corporate and democratic governance models
Open communities need to engage the community within the governance process
– Follow less orthodox approaches using meritocratic and autocratic principles
Digital Enterprise Research Institute www.deri.ie
Technical Best Practices
Data Representation Must be robust and standardized to encourage
community usage and tools development Support for legacy data formats and ability to
translate data forward to support new technology and standards
Human & Automated Curation Balancing will improve data quality Automated curation should always defer to,
and never override, human curation edits– Automate validating data deposition and entry– Target community at focused curation tasks
Digital Enterprise Research Institute www.deri.ie
Technical Best Practices
Track Provenance All curation activities should be recorded and
maintained as part data provenance effort– Especially where human curators are involved
Users can have different perspectives of provenance
– A scientist may need to evaluate the fine grained experiment description behind the data
– For a business analyst the ’brand’ of data provider can be sufficient for determining quality
Digital Enterprise Research Institute www.deri.ie
Conclusions
Data curation can ensure the quality of data and its fitness for use
Pre-competitive data can be shared without conferring a commercial advantage
Pre-competitive data communities Common curation tasks carried out once in
public domain Reduces cost, increase quantity and quality
Digital Enterprise Research Institute www.deri.ie
Acknowledgements
Collaborators Andre Freitas & Seán O'Riain
Insight from Thought Leaders Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product
Development and Management), and Gregg Fenton (Director Emerging Platforms) from the New York Times
Krista Thomas (Vice President, Marketing & Communications), Tom Tague (OpenCalais initiative Lead) from Thomson Reuters
Antony Williams (VP of Strategic Development ) from ChemSpider Helen Berman (Director), John Westbrook (Product Development) from the
Protein Data Bank Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance.
The work presented has been funded by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).
Digital Enterprise Research Institute www.deri.ie
Further Information
The Role of Community-DrivenData Curation for EnterprisesEdward Curry, Andre Freitas, & Seán O'Riain
In David Wood (ed.),
Linking Enterprise Data Springer, 2010.
Available Free at:
http://3roundstones.com/led_book/led-curry-et-al.html