Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup

Download Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup

Post on 23-Jul-2015



Data & Analytics

16 download


  • Crowdsourcing Approaches to Big Data Cura5on

    Edward Curry Insight Centre for Data Analy5cs,

    University College Dublin

  • Take Home

    Algorithms Humans Better Data Data

  • Talk Overview

    Part I: Mo4va4on Part II: Data Quality And Data Cura4on Part III: Crowdsourcing Part IV: Case Studies on Crowdsourced Data Cura4on

    Part V: SeBng up a Crowdsourced Data Cura4on Process

    Part VI: Linked Open Data Example Part IIV: Future Research Challenges

  • PART I

  • BIG Big Data Public Private Forum


    Overall objective

    Bringing the necessary stakeholders into a self-sustainable industry-led initiative, which will greatly contribute to

    enhance the EU competitiveness taking full advantage of Big Data technologies.

    Work at technical, business and policy levels, shaping the future through the positioning of IIM and Big Data

    specifically in Horizon 2020.

    BIG Big Data Public Private Forum

  • BIG Big Data Public Private Forum


    Health Public Sector Finance & Insurance

    Telco, Media& Entertainment

    Manufacturing, Retail, Energy,


    Needs Offerings

    Value Chain

    Technical Working Groups

    Industry Driven Sectorial Forums

    Data Acquisition

    Data Analysis

    Data Curation

    Data Storage

    Data Usage

    Structured data Unstructured data Event processing Sensor networks Protocols Real-time Data streams Multimodality

    Stream mining Semantic analysis Machine learning Information extraction

    Linked Data Data discovery Whole world semantics

    Ecosystems Community data analysis

    Cross-sectorial data analysis

    Data Quality Trust / Provenance Annotation Data validation Human-Data Interaction

    Top-down/Bottom-up Community / Crowd Human Computation Curation at scale Incentivisation Automation Interoperability

    In-Memory DBs NoSQL DBs NewSQL DBs Cloud storage Query Interfaces Scalability and Performance

    Data Models Consistency, Availability, Partition-tolerance

    Security and Privacy Standardization

    Decision support Prediction In-use analytics Simulation Exploration Visualisation Modeling Control Domain-specific usage

  • BIG Big Data Public Private Forum


  • BIG Big Data Public Private Forum


    Key Trends Lower usability barrier for data tools Blended human and algorithmic data processing for coping with

    for data quality Leveraging large communities (crowds) Need for semantic standardized data representation Significant increase in use of new data models (i.e. graph)

    (expressivity and flexibility)

    Much of (Big Data) technology is evolving evolutionary

    But business processes change must be revolutionary

    Data variety and verifiability are key opportunities

    Long tail of data variety is a major shift in the data landscape

    The Data Landscape Lack of Business-driven Big

    Data strategies Need for format and data

    storage technology standards Data exchange between

    companies, institutions, individuals, etc.

    Regulations & markets for data access

    Human resources: Lack of skilled data scientists

    Biggest Blockers

    Technical White Papers available on:

  • The Internet of Everything: Connecting the Unconnected

  • Earth Science Systems of Systems

  • Ci5zen Sensors

    humans as ci,zens on the ubiquitous Web, ac,ng as sensors and sharing their observa,ons and views

    Sheth, A. (2009). Ci4zen sensing, social signals, and enriching human experience. Internet Compu,ng, IEEE, 13(4), 87-92.

    Air Pollution

  • Citizens as Sensors


  • The Problems with Data

    Knowledge Workers need: Access to the right data Confidence in that data

    Flawed data effects 25% of critical data in worlds top companies

    Data quality role in recent financial crisis: Asset are defined differently

    in different programs

    Numbers did not always add up

    Departments do not trust each others figures

    Figures not worth the pixels they were made of

  • What is Data Quality?

    Desirable characteristics for information resource Described as a series of quality dimensions: n Discoverability & Accessibility: storing and classifying in

    appropriate and consistent manner

    n Accuracy: Correctly represents the real-world values it models n Consistency: Created and maintained using standardized

    definitions, calculations, terms, and identifiers

    n Provenance & Reputation: Track source & determine reputation Includes the objectivity of the source/producer Is the information unbiased, unprejudiced, and impartial? Or does it come from a reputable but partisan source?

    Wang, R. and D. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 1996. 12(4): p. 5-33.

  • Data Quality


    APNR iPod Nano Red 150

    APNS iPod Nano Silver 160

    150 5

    Source A

    Source B

    Schema Difference?

    Data Developer


    iPod Nano




    iPod Nano



    iPod Nano IPN890 150


    Value Conflicts? Entity Duplication?

    Data Steward

    Business Users


    Technical Domain (Technical)


  • What is Data Curation?

    n Digital Curation Selection, preservation, maintenance, collection,

    and archiving of digital assets

    n Data Curation Active management of data over its life-cycle

    n Data Curators Ensure data is trustworthy, discoverable, accessible,

    reusable, and fit for use Museum cataloguers of the Internet age

  • Related Activities

    n Data Governance/ Master Data Management Convergence of data quality, data management,

    business process management, and risk management

    Part of overall data governance strategy for organization

    n Data Curator = Data Steward

    n DO

  • Types of Data Curation

    n Multiple approaches to curate data, no single correct way Who?

    Individual Curators Curation Departments Community-based Curation

    How? Manual Curation (Semi-)Automated Sheer Curation

  • Types of Data Curation Who?

    n Individual Data Curators Suitable for infrequently changing small quantity

    of data (

  • Types of Data Curation Who?

    n Curation Departments Curation experts working with subject matter

    experts to curate data within formal process Can deal with large curation effort (000s of records)

    n Limitations Scalability: Can struggle with large quantities of

    dynamic data (>million records)

    Availability: Post-hoc nature creates delay in curated data availability

  • Types of Data Curation - Who?

    n Community-Based Data Curation Decentralized approach to data curation Crowd-sourcing the curation process

    Leverages community of users to curate data Wisdom of the community (crowd) Can scale to millions of records

  • Types of Data Curation How?

    n Manual Curation Curators directly manipulate data Can tie users up with low-value add activities

    n (Sem-)Automated Curation Algorithms can (semi-)automate curation

    activities such as data cleansing, record duplication and classification

    Can be supervised or approved by human curators

  • Types of Data Curation How?

    n Sheer curation, or Curation at Source Curation activities integrated in normal workflow

    of those creating and managing data

    Can be as simple as vetting or rating the results of a curation algorithm

    Results can be available immediately

    n Blended Approaches: Best of Both Sheer curation + post hoc curation department Allows immediate access to curated data Ensures quality control with expert curation

  • Data Quailty

    Data Curation Example

    Profile Sources

    Define Mappings

    Cleans Enrich

    De-duplicate Define Rules

    Curated Data

    Data Developer

    Data Curator

    Data Governance

    Business Users


    Product Data Product Data

  • Data Curation

    n Pros Can create a single version of truth Standardized information creation and management Improves data quality

    n Cons Significant upfront costs and efforts Participation limited to few (mostly) technical experts Difficult to scale for large data sources

    Extended Enterprise e.g. partner, data vendors Small % of data under management (i.e. CRM, Product, )

  • The New York Times

    100 Years of Expert Data Curation

  • The New York Times

    n Largest metropolitan and third largest newspaper in the United States

    n q Most popular newspaper

    website in US

    n 100 year old curated repository defining its participation in the emerging Web of Data

  • The New York Times

    n Data curation dates back to 1913 Publisher/owner Adolph S. Ochs decided to

    provide a set of additions to the newspaper

    n New York Times Index Organized catalog of articles titles and summaries

    Containing issue, date and column of article Categorized by subject and names Introduced on quarterly then annual basis

    n Transitory content of newspaper became important source of searchable historical data Often used to settle historical debates

  • The New York Times

    n Index Department was created in 1913 Curation and cataloguing of NYT resources

    Since 1851 NYT had low quality index for internal use

    n Developed a comprehensive catalog using a controlled vocabulary Covering subjects, personal names,

    organizations, geographic locations and titles of creative works (books, movies, etc), linked to articles and their summaries

    n Current Index Dept. has ~15 people

  • The New York Times

    n Challenges with consistently and accurately classifying news articles over time Keywords expressing subjects may show some

    variance due to cultural or legal constraints

    Identities of some entities, such as organizations and places, changed over time

    n Controlled vocabulary grew to hundreds of thousands of categories Adding complexity to classification process

  • The New York Times

    n Increased importance of Web drove need to improve categorization of online content

    n Curation carried out by Index Department Library-time (days to weeks) Print edition can handle next-day index

    n Not suitable for real-time online publishing needed a same-day index

  • The New York Times

    n Introduced two stage curation process Editorial staff performed best-effort semi-

    automated sheer curation at point of online pub. Several hundreds journalists

    Index Department follow up with long-term accurate classification and archiving

    n Benefits: Non-expert journalist curators provide instant

    accessibility to online users

    Index Department provides long-term high-quality curation in a trust but verify approach

  • NYT Curation Workflow

    Curation starts with article getting out of the newsroom

  • NYT Curation Workflow

    Member of editorial staff submits article to web-based rule based information extraction system (SAS Teragram)

  • NYT Curation Workflow

    Teragram uses linguistic extraction rules based on subset of Index Depts controlled vocab.

  • NYT Curation Workflow

    Teragram suggests tags based on the Index vocabulary that can potentially describe the content of article

  • NYT Curation Workflow

    Editorial staff member selects terms that best describe the contents and inserts new tags if necessary

  • NYT Curation Workflow

    Reviewed by the taxonomy managers with feedback to editorial staff on classification process

  • NYT Curation Workflow

    Article is published online at

  • NYT Curation Workflow

    At later stage article receives second level curation by Index Dept. additional Index tags and a summary

  • NYT Curation Workflow

    Article is submitted to NYT Index

  • The New York Times

    n Early adopter of Linked Open Data (June 09)

  • The New York Times

    n Linked Open Data @ Subset of 10,000 tags from index vocabulary Dataset of people, organizations & locations

    Complemented by search services to consume data about articles, movies, best sellers, Congress votes, real estate,

    n Benefits Improves traffic by third party data usage Lowers development cost of new applications

    for different verticals inside the website E.g. movies, travel, sports, books


  • Introduction to Crowdsourcing

    n Coordinating a crowd (a large group of workers)to do micro-work (small tasks) that solves problems (that computers or a single user cant)

    n A collection of mechanisms and associated methodologies for scaling and directing crowd activities to achieve goals

    n Related Areas Collective Intelligence Social Computing Human Computation Data Mining

    A. J. Quinn and B. B. Bederson, Human computation: a survey and taxonomy of a growing field, in Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems, 2011, pp. 14031412.

  • When Computers Were Human

    n Maskelyne 1760 Used human computers

    to created almanac of moon positions

    Used for shipping/navigation

    Quality assurance Do calculations twice Compare to third verifier

    D. A. Grier, When Computers Were Human, vol. 13. Princeton University Press, 2005.

  • When Computers Were Human

  • Human Visual perception Visuospatial thinking Audiolinguistic ability Sociocultural


    Creativity Domain knowledge

    Machine Large-scale data


    Collecting and storing large amounts of data

    Efcient data movement Bias-free analysis

    Human vs Machine Affordances

    R. J. Crouser and R. Chang, An affordance-based framework for human computation and human-computer collaboration, IEEE Trans. Vis. Comput. Graph., vol. 18, pp. 28592868, 2012.

  • When to Crowdsource a Task?

    n Computers cannot do the task

    n Single person cannot do the task

    n Work can be split into smaller tasks

  • Platforms and Marketplaces

  • Types of Crowds

    n Internal corporate communities Taps potential of internal workforce Curate competitive enterprise data that will

    remain internal to the company May not always be the case e.g. product technical

    support and marketing data

    n External communities Public crowd-souring market places Pre-competitive communities

  • Generic Architecture


    Platform/Marketplace (Publish Task, Task Management)






  • Mturk Workflow


  • Crowdsourced Data Curation


    DQ Rules & Algorithms

    Entity Linking Data Fusion

    Relation Extraction

    Human Computation

    Relevance Judgment

    Data Verification Disambiguation

    Clean Data Internal Community - Domain Knowledge - High Quality Responses - Trustable

    Web of Data


    Textual Content

    Programmers Managers

    External Crowd - High Availability - Large Scale - Expertise Variety

  • Examples of CDM Tasks

    n Understanding customer sentiment for launch of new product around the world.

    n Implemented 24/7 sentiment analysis system with workers from around the world.

    n 90% accuracy in 95% on content

    n Categorize millions of products on eBays catalog with accurate and complete attributes

    n Combine the crowd with machine learning to create an affordable and flexible catalog quality system

  • Examples of CDM Tasks

    n Natural Language Processing Dialect Identification, Spelling Correction, Machine

    Translation, Word Similarity

    n Computer Vision Image Similarity, Image Annotation/Analysis

    n Classification Data attributes, Improving taxonomy, search results

    n Verification Entity consolidation, de-duplicate, cross-check, validate


    n Enrichment Judgments, annotation

  • Wikipedia

    n Collaboratively built by large community More than 19,000,000 articles, 270+ languages,

    3,200,000+ articles in English

    More than 157,000 active contributors

    n Accuracy and stylistic formality are equivalent to expert-based resources i.e. Columbia and Britannica encyclopedias

    n WikiMeida Software behind Wikipedia Widely used inside organizations Intellipedia:16 U.S. Intelligence agencies Wiki Proteins: curated Protein data for

    knowledge discovery

  • Wikipedia Social Organization

    n Any user can edit its contents Without prior registration

    n Does not lead to a chaotic scenario In practice highly scalable approach for high

    quality content creation on the Web

    n Relies on simple but highly effective way to coordinate its curation process

    n Curation is activity of Wikipedia admins Responsibility for information quality standards

  • Wikipedia Social Organization

  • DBPedia Knowledge base

    n DBPedia provides direct access to data Indirectly uses wiki as data curation platform Inherits massive volume of curated

    Wikipedia data

    3.4 million entities and 1 billion RDF triples Comprehensive data infrastructure

    Concept URIs Definitions Basic types

  • Wikipedia - DBPedia

  • n Collaborative knowledge base maintained by community of web users

    n Users create entity types and their meta-data according to guidelines

    n Requires administrative approvals for schema changes by end users

  • Audio Tagging - Tag a Tune

  • Image Tagging - Peekaboom

  • Protein Folding -

  • ReCaptcha

    n OCR ~ 1% error rate 20%-30% for 18th and 19th

    century books

    n 40 million ReCAPTCHAs every day (2008) Fixing 40,000 books a day


  • Core Design Questions

    Goal What

    Why Incentives Who Workers

    How Process

    Malone, T. W., Laubacher, R., & Dellarocas, C. N. Harnessing crowds: Mapping the genome of collective intelligence. MIT Sloan Research Paper 4732-09, (2009).

  • 1) Who is doing it? (Workers)

    n Hierarchy (Assignment) Someone in authority assigns a particular person

    or group of people to perform the task

    Within the Enterprise (i.e. Individuals, specialised departments)

    Within a structured community (i.e. pre-competitive community)

    n Crowd (Choice) Anyone in a large group who choses to do so Internal or External Crowds

  • 2) Why are they doing it? (Incentives)

    n Motivation Money ($$) Glory (reputation/prestige) Love (altruism, socialize, enjoyment) Unintended by-product (e.g. re-Captcha, captured in workflow) Self-serving resources (e.g. Wikipedia, product/customer data) Part of their job description (e.g. Data curation as part of role)

    n Determine pay and time for each task Marketplace: Delicate balance

    Money does not improve quality but can increase participation Internal Hierarchy: Engineering opportunities for recognition

    Performance review, prizes for top contributors, badges, leaderboards, etc.

  • Effect of Payment on Quality

    n Cost does not affect quality n Similar results for bigger tasks [Ariely et al, 2009] Mason, W. A., & Watts, D. J. (2009). Financial incentives and the performance of crowds. Proceedings of the Human Computation Workshop. Paris: ACM, June 28, 2009.

    [Panos Ipeirotis. WWW2011 tutorial]

  • 3) What is being done? (Goal)

    3.1 Identify the Data Newly created data and/or legacy data? How is new data created?

    Do users create the data, or is it imported from an external source?

    How frequently is new data created/updated? What quantity of data is created? How much legacy data exists? Is it stored within a single source, or scattered

    across multiple sources?

  • 3) What is being done? (Goal)

    3.2 Identify the Tasks Creation Tasks

    Create/Generate Find Improve/ Edit / Fix

    Decision (Vote) Tasks Accept / Reject Thumbs up / Thumbs Down Vote for Best

  • 4) How is it being done? (How)

    n Identify the workflow Tasks integrated in normal workflow of those

    creating and managing data

    Simple as vetting or rating results of algorithm n Identify the platform

    Internal/Community collaboration platforms Public crowdsourcing platform

    Consider the availability of appropriate workers (i.e. experts)

    n Identify the Algorithm Data quality Image recognition etc

  • Pull Routing

    n Workers seek tasks and assign to themselves Search and Discovery of tasks support by platform Task Recommendation Peer Routing


    Tasks Select



    Search & Browse Interface


  • Push Routing

    n System assigns tasks to workers based on: Past performance Expertise Cost Latency








    Task Interface



  • Managing Task Quality Assurance

    n Redundancy: Quorum Votes Replicate the task (i.e. 3 times) Use majority voting to determine right value (% agreement) Weighted majority vote

    n Gold Data / Honey Pots Inject trap question to test quality Worker fatigue check (habit of saying no all the time)

    n Estimation of Worker Quality Redundancy plus gold data

    n Qualification Test Use test tasks to determine users ability for such tasks



  • Linked Open Data (LOD)

    n Expose and interlink datasets on the Web n Using URIs to identify things in your data n Using a graph representation (RDF) to describe URIs n Vision: The Web as a huge graph database

    Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.

  • Linked Data Example

    Mul5ple Iden5ers

    Iden5ty resolu5on links

  • Identity Resolution in LOD



    owl:sameAs Consumer

    Mul5ple Iden5ers for Galway en5ty in Linked Open Data Cloud

    Dierent sources of iden5ty resolu5on links Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.

  • LOD Application Architecture

    Utility Module

    Feedback Module

    Consolidation Module


    Matching Dependencies

    Ranked Feedback Tasks

    Data Improvement

    Candidate Links

    Tom Heath and Christian Bizer (2011)Linked Data: Evolving the Web into a Global Data Space(1st edition), 1-136. Morgan & Claypool.



  • Future Research Directions

    n Incentives and social engagement Better recognition of the data curation role Understanding of social engagement mechanisms

    n Economic Models Pre-competitive and public-private partnerships

    n Curation at Scale Evolution of human computation and crowdsourcing Instrumenting popular apps for data curation General-purpose data curation pipelines Human-data interaction

  • Future Research Directions

    n Spatial Crowdsourcing Matching tasks with workers at right time and location Balancing workload among workers Tasks at remote locations Chaining tasks in same vicinity Preserving worker privacy

    n Interoperability Finding semantic similarity of tasks across systems Defining and measuring worker capability across

    heterogeneous systems

    Enabling routing middleware for multiple systems Compatibility of reputation systems Defining standards for task exchange

  • Heterogeneous Crowds

    n Multiple requesters, tasks, workers, platform


    Collaborative Data Curation

    Tasks Workers

    Cyber Physical Social System


  • SLUA Ontology





    User Task

    offers earns

    includes performs

    requires possesses

    Location Skill Knowledge Ability Availability

    Reputation Money Fun Altruism Learning



    U. ul Hassan, S. ORiain, E. Curry, SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing, in International Workshop on Crowdsourcing the Semantic Web, 2013.

  • Future Research Directions

    n Task Routing Optimizing task completion, quality, and latency Inferring worker preferences, skills, and knowledge Balancing exploration-exploitation trade-off between

    inference and optimization

    Cold-start problem for new workers or tasks Ensuring worker satisfaction via load balancing & rewards

    n HumanComputer Interaction Reducing search friction through good browsing interfaces Presenting requisite information nothing more Choosing the level of task granularity for complex tasks Ensuring worker engagement Designing games with a purpose to crowd source with fun

  • Summary

    Algorithms Humans Better Data Data

  • Selected References

    n Big Data & Data Quality S. Lavalle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, Big Data, Analytics and the

    Path from Insights to Value, MIT Sloan Management Review, vol. 52, no. 2, pp. 2132, 2011.

    A. Haug and J. S. Arlbjrn, Barriers to master data quality, Journal of Enterprise Information Management, vol. 24, no. 3, pp. 288303, 2011.

    R. Silvola, O. Jaaskelainen, H. Kropsu-Vehkapera, and H. Haapasalo, Managing one master data challenges and preconditions, Industrial Management & Data Systems, vol. 111, no. 1, pp. 146162, 2011.

    E. Curry, S. Hasan, and S. ORiain, Enterprise Energy Management using a Linked Dataspace for Energy Intelligence, in Second IFIP Conference on Sustainable Internet and ICT for Sustainability, 2012.

    D. Loshin, Master Data Management. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2008.

    Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment.Communications of the ACM,45(4), 211-2

    Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement.ACM Computing Surveys (CSUR),41(3), 16.

    B. Otto and A. Reichert, Organizing Master Data Management: Findings from an Expert Survey, in Proceedings of the 2010 ACM Symposium on Applied Computing - SAC 10, 2010, pp. 106110.

    Wang, R. and D. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 1996. 12(4): p. 5-33

    Ul Hassan, U., ORiain, S., and Curry, E. 2012. Leveraging Matching Dependencies for Guided User Feedback in Linked Data Applications, In 9th International Workshop on Information Integration on the Web (IIWeb2012) Scottsdale, Arizona,: ACM.

  • Selected References

    n Collective Intelligence, Crowdsourcing & Human Computation Malone, Thomas W., Robert Laubacher, and Chrysanthos Dellarocas. "Harnessing Crowds: Mapping the

    Genome of Collective Intelligence." (2009). A. Doan, R. Ramakrishnan, and A. Y. Halevy, Crowdsourcing systems on the World-Wide Web,

    Communications of the ACM, vol. 54, no. 4, p. 86, Apr. 2011. A. J. Quinn and B. B. Bederson, Human computation: a survey and taxonomy of a growing field, in

    Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems, 2011, pp. 14031412.

    Mason, W. A., & Watts, D. J. (2009). Financial incentives and the performance of crowds. Proceedings of the Human Computation Workshop. Paris: ACM, June 28, 2009.

    E. Law and L. von Ahn, Human Computation, Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 5, no. 3, pp. 1121, Jun. 2011.

    M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin, CrowdDB: Answering Queries with Crowdsourcing, in Proceedings of the 2011 international conference on Management of data - SIGMOD 11, 2011, p. 61.

    P. Wichmann, A. Borek, R. Kern, P. Woodall, A. K. Parlikad, and G. Satzger, Exploring the Crowd as Enabler of Better Information Quality, in Proceedings of the 16th International Conference on Information Quality, 2011, pp. 302312.

    Winter A. Mason, Duncan J. Watts: Financial incentives and the "performance of crowds". SIGKDD Explorations (SIGKDD) 11(2):100-108 (2009)

    Panos Ipeirotis. Managing Crowdsourced Human Computation, WWW2011 Tutorial O. Alonso & M. Lease. Crowdsourcing 101: Putting the WSDM of Crowds to Work for You, WSDM Hong Kong

    2011. D. A. Grier, When Computers Were Human, vol. 13. Princeton University Press, 2005. Ul Hassan, U., & Curry, E. (2013, October). A capability requirements approach for predicting worker

    performance in crowdsourcing. In Collaborative Computing: Networking, Applications and Worksharing (Collaboratecom), 2013 9th Internatinal Conference Conference on (pp. 429-437). IEEE.

  • Selected References

    n Collaborative Data Management E. Curry, A. Freitas, and S. O. Riain, The Role of Community-Driven Data Curation for Enterprises, in

    Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 2547. Ul Hassan, U., ORiain, S., and Curry, E. 2012. Towards Expertise Modelling for Routing Data Cleaning

    Tasks within a Community of Knowledge Workers, In 17th International Conference on Information Quality (ICIQ 2012), Paris, France.

    Ul Hassan, U., ORiain, S., and Curry, E. 2013. Effects of Expertise Assessment on the Quality of Task Routing in Human Computation, In 2nd International Workshop on Social Media for Crowdsourcing and Human Computation, Paris, France.

    Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G., Cherniack, M., Zdonik, S. B., ... & Xu, S. (2013). Data Curation at Scale: The Data Tamer System. In CIDR.

    Parameswaran, A. G., Park, H., Garcia-Molina, H., Polyzotis, N., & Widom, J. (2012, October). Deco: declarative crowdsourcing. InProceedings of the 21st ACM international conference on Information and knowledge management(pp. 1203-1212). ACM.

    Parameswaran, A., Boyd, S., Garcia-Molina, H., Gupta, A., Polyzotis, N., & Widom, J. (2014). Optimal crowd-powered rating and filtering algorithms.Proceedings Very Large Data Bases (VLDB).

    Marcus, A., Wu, E., Karger, D., Madden, S., & Miller, R. (2011). Human-powered sorts and joins.Proceedings of the VLDB Endowment,5(1), 13-24.

    Guo, S., Parameswaran, A., & Garcia-Molina, H. (2012, May). So who won?: dynamic max discovery with the crowd. InProceedings of the 2012 ACM SIGMOD International Conference on Management of Data(pp. 385-396). ACM.

    Davidson, S. B., Khanna, S., Milo, T., & Roy, S. (2013, March). Using the crowd for top-k and group-by queries. InProceedings of the 16th International Conference on Database Theory(pp. 225-236). ACM.

    Chai, X., Vuong, B. Q., Doan, A., & Naughton, J. F. (2009, June). Efficiently incorporating user feedback into information extraction and integration programs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data (pp. 87-100). ACM.

  • Selected References

    n Spatial Crowdsourcing Kazemi, L., & Shahabi, C. (2012, November). Geocrowd: enabling query answering with spatial

    crowdsourcing. In Proceedings of the 20th International Conference on Advances in Geographic Information Systems (pp. 189-198). ACM.

    Benouaret, K., Valliyur-Ramalingam, R., & Charoy, F. (2013). CrowdSC: Building Smart Cities with Large Scale Citizen Participation. IEEE Internet Computing, 1.

    Musthag, M., & Ganesan, D. (2013, April). Labor dynamics in a mobile micro-task market. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 641-650). ACM.

    Deng, Dingxiong, Cyrus Shahabi, and Ugur Demiryurek. "Maximizing the number of worker's self-selected tasks in spatial crowdsourcing."Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 2013.

    To, H., Ghinita, G., & Shahabi, C. (2014). A Framework for Protecting Worker Location Privacy in Spatial Crowdsourcing.Proceedings of the VLDB Endowment,7(10).

    Goncalves, J., Ferreira, D., Hosio, S., Liu, Y., Rogstadius, J., Kukka, H., & Kostakos, V. (2013, September). Crowdsourcing on the spot: altruistic use of public displays, feasibility, performance, and behaviours. InProceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing(pp. 753-762). ACM.

    Cardone, G., Foschini, L., Bellavista, P., Corradi, A., Borcea, C., Talasila, M., & Curtmola, R. (2013). Fostering participaction in smart cities: a geo-social crowdsensing platform.Communications Magazine, IEEE,51(6).

  • Books

    n Surowiecki, J. (2005). The wisdom of crowds. Random House LLC. n Batini, C., & Scannapieco, M. (2006).Data quality: concepts, methodologies

    and techniques. Springer.

    n Michelucci, P. (2013).Handbook of human computation. Springer. n Law, E., & Ahn, L. V. (2011). Human computation.Synthesis Lectures on

    Artificial Intelligence and Machine Learning,5(3), 1-121.

    n Heath, T., & Bizer, C. (2011). Linked data: Evolving the web into a global data space.Synthesis lectures on the semantic web: theory and technology,1(1), 1-136.

    n Grier, D. A. (2013).When computers were human. Princeton University Press. n Easley, D., & Kleinberg, J. Networks, Crowds, and Markets.Cambridge


    n Sheth, A., & Thirunarayan, K. (2012). Semantics Empowered Web 3.0: Managing Enterprise, Social, Sensor, and Cloud-based Data and Services for Advanced Applications.Synthesis Lectures on Data Management,4(6), 1-175.

  • Tutorials

    n Human Computation and Crowdsourcing

    n Human-Powered Data Management

    n Crowdsourcing Applications and Platforms: A Data Management Perspective

    n Human Computation: Core Research Questions and State of the Art

    n Crowdsourcing & Machine Learning

    n Data quality and data cleaning: an overview

  • Datasets

    n TREC Crowdsourcing Track

    n 2010 Crowdsourced Web Relevance Judgments Data


    n Statistical QUality Assurance Robustness Evaluation Data

    n Crowdsourcing at Scale 2013

    n USEWOD - Usage Analysis and the Web of Data

    n NAACL 2010 Workshop

    n n


  • Credits

    Special thanks to Umair ul Hassan for his assistance with the Tutorial