Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup

Download Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup

Post on 23-Jul-2015



Data & Analytics

16 download

Embed Size (px)


<ul><li><p>Crowdsourcing Approaches to Big Data Cura5on </p><p>Edward Curry Insight Centre for Data Analy5cs, </p><p>University College Dublin </p></li><li><p>Take Home </p><p>Algorithms Humans Better Data Data </p></li><li><p>Talk Overview </p><p> Part I: Mo4va4on Part II: Data Quality And Data Cura4on Part III: Crowdsourcing Part IV: Case Studies on Crowdsourced Data Cura4on </p><p> Part V: SeBng up a Crowdsourced Data Cura4on Process </p><p> Part VI: Linked Open Data Example Part IIV: Future Research Challenges </p></li><li><p>PART I </p></li><li><p>BIG Big Data Public Private Forum </p><p>THE BIG PROJECT </p><p>Overall objective </p><p>Bringing the necessary stakeholders into a self-sustainable industry-led initiative, which will greatly contribute to </p><p>enhance the EU competitiveness taking full advantage of Big Data technologies. </p><p>Work at technical, business and policy levels, shaping the future through the positioning of IIM and Big Data </p><p>specifically in Horizon 2020. </p><p>BIG Big Data Public Private Forum </p></li><li><p>BIG Big Data Public Private Forum </p><p>SITUATING BIG DATA IN INDUSTRY </p><p> Health Public Sector Finance &amp; Insurance </p><p>Telco, Media&amp; Entertainment </p><p>Manufacturing, Retail, Energy, </p><p>Transport </p><p>Needs Offerings </p><p>Value Chain </p><p>Technical Working Groups </p><p>Industry Driven Sectorial Forums </p><p>Data Acquisition </p><p>Data Analysis </p><p>Data Curation </p><p>Data Storage </p><p>Data Usage </p><p> Structured data Unstructured data Event processing Sensor networks Protocols Real-time Data streams Multimodality </p><p> Stream mining Semantic analysis Machine learning Information extraction </p><p> Linked Data Data discovery Whole world semantics </p><p> Ecosystems Community data analysis </p><p> Cross-sectorial data analysis </p><p> Data Quality Trust / Provenance Annotation Data validation Human-Data Interaction </p><p> Top-down/Bottom-up Community / Crowd Human Computation Curation at scale Incentivisation Automation Interoperability </p><p> In-Memory DBs NoSQL DBs NewSQL DBs Cloud storage Query Interfaces Scalability and Performance </p><p> Data Models Consistency, Availability, Partition-tolerance </p><p> Security and Privacy Standardization </p><p> Decision support Prediction In-use analytics Simulation Exploration Visualisation Modeling Control Domain-specific usage </p></li><li><p>BIG Big Data Public Private Forum </p><p>SUBJECT MATTER EXPERT INTERVIEWS </p></li><li><p>BIG Big Data Public Private Forum </p><p>KEY INSIGHTS </p><p>Key Trends Lower usability barrier for data tools Blended human and algorithmic data processing for coping with </p><p>for data quality Leveraging large communities (crowds) Need for semantic standardized data representation Significant increase in use of new data models (i.e. graph) </p><p>(expressivity and flexibility) </p><p> Much of (Big Data) technology is evolving evolutionary </p><p> But business processes change must be revolutionary </p><p> Data variety and verifiability are key opportunities </p><p> Long tail of data variety is a major shift in the data landscape </p><p>The Data Landscape Lack of Business-driven Big </p><p>Data strategies Need for format and data </p><p>storage technology standards Data exchange between </p><p>companies, institutions, individuals, etc. </p><p> Regulations &amp; markets for data access </p><p> Human resources: Lack of skilled data scientists </p><p>Biggest Blockers </p><p>Technical White Papers available on: </p></li><li><p>The Internet of Everything: Connecting the Unconnected </p></li><li><p>Earth Science Systems of Systems </p></li><li><p>Ci5zen Sensors </p><p>humans as ci,zens on the ubiquitous Web, ac,ng as sensors and sharing their observa,ons and views </p><p> Sheth, A. (2009). Ci4zen sensing, social signals, and enriching human experience. Internet Compu,ng, IEEE, 13(4), 87-92. </p><p>Air Pollution </p></li><li><p>Citizens as Sensors </p></li><li><p>DATA QUALITY AND DATA CURATION PART II </p></li><li><p>The Problems with Data </p><p>Knowledge Workers need: Access to the right data Confidence in that data </p><p>Flawed data effects 25% of critical data in worlds top companies </p><p>Data quality role in recent financial crisis: Asset are defined differently </p><p>in different programs </p><p> Numbers did not always add up </p><p> Departments do not trust each others figures </p><p> Figures not worth the pixels they were made of </p></li><li><p>What is Data Quality? </p><p>Desirable characteristics for information resource Described as a series of quality dimensions: n Discoverability &amp; Accessibility: storing and classifying in </p><p>appropriate and consistent manner </p><p>n Accuracy: Correctly represents the real-world values it models n Consistency: Created and maintained using standardized </p><p>definitions, calculations, terms, and identifiers </p><p>n Provenance &amp; Reputation: Track source &amp; determine reputation Includes the objectivity of the source/producer Is the information unbiased, unprejudiced, and impartial? Or does it come from a reputable but partisan source? </p><p>Wang, R. and D. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 1996. 12(4): p. 5-33. </p></li><li><p>Data Quality </p><p>ID PNAME PCOLOR PRICE </p><p>APNR iPod Nano Red 150 </p><p>APNS iPod Nano Silver 160 </p><p> 150 5 </p><p>Source A </p><p>Source B </p><p>Schema Difference? </p><p>Data Developer </p><p>APNR </p><p>iPod Nano </p><p>Red </p><p>150 </p><p>APNR </p><p>iPod Nano </p><p>Silver </p><p>160 </p><p>iPod Nano IPN890 150 </p><p>5 </p><p>Value Conflicts? Entity Duplication? </p><p>Data Steward </p><p>Business Users </p><p>? </p><p>Technical Domain (Technical) </p><p>Domain </p></li><li><p>What is Data Curation? </p><p>n Digital Curation Selection, preservation, maintenance, collection, </p><p>and archiving of digital assets </p><p>n Data Curation Active management of data over its life-cycle </p><p>n Data Curators Ensure data is trustworthy, discoverable, accessible, </p><p>reusable, and fit for use Museum cataloguers of the Internet age </p></li><li><p>Related Activities </p><p>n Data Governance/ Master Data Management Convergence of data quality, data management, </p><p>business process management, and risk management </p><p> Part of overall data governance strategy for organization </p><p>n Data Curator = Data Steward </p><p>n DO </p></li><li><p>Types of Data Curation </p><p>n Multiple approaches to curate data, no single correct way Who? </p><p> Individual Curators Curation Departments Community-based Curation </p><p> How? Manual Curation (Semi-)Automated Sheer Curation </p></li><li><p>Types of Data Curation Who? </p><p>n Individual Data Curators Suitable for infrequently changing small quantity </p><p>of data (</p></li><li><p>Types of Data Curation Who? </p><p>n Curation Departments Curation experts working with subject matter </p><p>experts to curate data within formal process Can deal with large curation effort (000s of records) </p><p>n Limitations Scalability: Can struggle with large quantities of </p><p>dynamic data (&gt;million records) </p><p> Availability: Post-hoc nature creates delay in curated data availability </p></li><li><p>Types of Data Curation - Who? </p><p>n Community-Based Data Curation Decentralized approach to data curation Crowd-sourcing the curation process </p><p> Leverages community of users to curate data Wisdom of the community (crowd) Can scale to millions of records </p></li><li><p>Types of Data Curation How? </p><p>n Manual Curation Curators directly manipulate data Can tie users up with low-value add activities </p><p>n (Sem-)Automated Curation Algorithms can (semi-)automate curation </p><p>activities such as data cleansing, record duplication and classification </p><p> Can be supervised or approved by human curators </p></li><li><p>Types of Data Curation How? </p><p>n Sheer curation, or Curation at Source Curation activities integrated in normal workflow </p><p>of those creating and managing data </p><p> Can be as simple as vetting or rating the results of a curation algorithm </p><p> Results can be available immediately </p><p>n Blended Approaches: Best of Both Sheer curation + post hoc curation department Allows immediate access to curated data Ensures quality control with expert curation </p></li><li><p>Data Quailty </p><p>Data Curation Example </p><p>Profile Sources </p><p>Define Mappings </p><p>Cleans Enrich </p><p>De-duplicate Define Rules </p><p>Curated Data </p><p>Data Developer </p><p>Data Curator </p><p>Data Governance </p><p>Business Users </p><p>Applications </p><p>Product Data Product Data </p></li><li><p>Data Curation </p><p>n Pros Can create a single version of truth Standardized information creation and management Improves data quality </p><p>n Cons Significant upfront costs and efforts Participation limited to few (mostly) technical experts Difficult to scale for large data sources </p><p> Extended Enterprise e.g. partner, data vendors Small % of data under management (i.e. CRM, Product, ) </p></li><li><p>The New York Times </p><p>100 Years of Expert Data Curation </p></li><li><p>The New York Times </p><p>n Largest metropolitan and third largest newspaper in the United States </p><p>n q Most popular newspaper </p><p>website in US </p><p>n 100 year old curated repository defining its participation in the emerging Web of Data </p></li><li><p>The New York Times </p><p>n Data curation dates back to 1913 Publisher/owner Adolph S. Ochs decided to </p><p>provide a set of additions to the newspaper </p><p>n New York Times Index Organized catalog of articles titles and summaries </p><p> Containing issue, date and column of article Categorized by subject and names Introduced on quarterly then annual basis </p><p>n Transitory content of newspaper became important source of searchable historical data Often used to settle historical debates </p></li><li><p>The New York Times </p><p>n Index Department was created in 1913 Curation and cataloguing of NYT resources </p><p> Since 1851 NYT had low quality index for internal use </p><p>n Developed a comprehensive catalog using a controlled vocabulary Covering subjects, personal names, </p><p>organizations, geographic locations and titles of creative works (books, movies, etc), linked to articles and their summaries </p><p>n Current Index Dept. has ~15 people </p></li><li><p>The New York Times </p><p>n Challenges with consistently and accurately classifying news articles over time Keywords expressing subjects may show some </p><p>variance due to cultural or legal constraints </p><p> Identities of some entities, such as organizations and places, changed over time </p><p>n Controlled vocabulary grew to hundreds of thousands of categories Adding complexity to classification process </p></li><li><p>The New York Times </p><p>n Increased importance of Web drove need to improve categorization of online content </p><p>n Curation carried out by Index Department Library-time (days to weeks) Print edition can handle next-day index </p><p>n Not suitable for real-time online publishing needed a same-day index </p></li><li><p>The New York Times </p><p>n Introduced two stage curation process Editorial staff performed best-effort semi-</p><p>automated sheer curation at point of online pub. Several hundreds journalists </p><p> Index Department follow up with long-term accurate classification and archiving </p><p>n Benefits: Non-expert journalist curators provide instant </p><p>accessibility to online users </p><p> Index Department provides long-term high-quality curation in a trust but verify approach </p></li><li><p>NYT Curation Workflow </p><p> Curation starts with article getting out of the newsroom </p></li><li><p>NYT Curation Workflow </p><p> Member of editorial staff submits article to web-based rule based information extraction system (SAS Teragram) </p></li><li><p>NYT Curation Workflow </p><p> Teragram uses linguistic extraction rules based on subset of Index Depts controlled vocab. </p></li><li><p>NYT Curation Workflow </p><p> Teragram suggests tags based on the Index vocabulary that can potentially describe the content of article </p></li><li><p>NYT Curation Workflow </p><p> Editorial staff member selects terms that best describe the contents and inserts new tags if necessary </p></li><li><p>NYT Curation Workflow </p><p> Reviewed by the taxonomy managers with feedback to editorial staff on classification process </p></li><li><p>NYT Curation Workflow </p><p> Article is published online at </p></li><li><p>NYT Curation Workflow </p><p> At later stage article receives second level curation by Index Dept. additional Index tags and a summary </p></li><li><p>NYT Curation Workflow </p><p> Article is submitted to NYT Index </p></li><li><p>The New York Times </p><p>n Early adopter of Linked Open Data (June 09) </p></li><li><p>The New York Times </p><p>n Linked Open Data @ Subset of 10,000 tags from index vocabulary Dataset of people, organizations &amp; locations </p><p> Complemented by search services to consume data about articles, movies, best sellers, Congress votes, real estate, </p><p>n Benefits Improves traffic by third party data usage Lowers development cost of new applications </p><p>for different verticals inside the website E.g. movies, travel, sports, books </p></li><li><p>CROWDSOURCING PART III </p></li><li><p>Introduction to Crowdsourcing </p><p>n Coordinating a crowd (a large group of workers)to do micro-work (small tasks) that solves problems (that computers or a single user cant) </p><p>n A collection of mechanisms and associated methodologies for scaling and directing crowd activities to achieve goals </p><p>n Related Areas Collective Intelligence Social Computing Human Computation Data Mining </p><p> A. J. Quinn and B. B. Bederson, Human computation: a survey and taxonomy of a growing field, in Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems, 2011, pp. 14031412. </p></li><li><p>When Computers Were Human </p><p>n Maskelyne 1760 Used human computers </p><p>to created almanac of moon positions </p><p>Used for shipping/navigation </p><p>Quality assurance Do calculations twice Compare to third verifier </p><p>D. A. Grier, When Computers Were Human, vol. 13. Princeton University Press, 2005. </p></li><li><p>When Computers Were Human </p></li><li><p>Human Visual perception Visuospatial thinking Audiolinguistic ability Sociocultural </p><p>awareness </p><p>Creativity Domain knowledge </p><p>Machine Large-scale data </p><p>manipulation </p><p>Collecting and storing large amounts of data </p><p>Efcient data movement Bias-free analysis </p><p>Human vs Machine Affordances </p><p>R. J. Crouser and R. Chang, An affordance-based framework for human computation and human-computer collaboration, IEEE Trans. Vis. Comput. Graph., vol. 18, pp. 28592868, 2012. </p></li><li><p>When to Crowdsource a Task? </p><p>n Computers cannot do the task </p><p>n Single person cannot do the task </p><p>n Work can be split into smaller tasks </p></li><li><p>Platforms and Marketplaces </p></li><li><p>Types of Crowds </p><p>n Internal corporate communities Taps potential of internal workforce Curate competitive enterprise data that will </p><p>remain internal to the company May not always be the case e.g. product technical </p><p>support and marketing data </p><p>n External communities Public crowd-souring market places Pre-competitive communities </p></li><li><p>Generic Architecture </p><p>Workers </p><p>Platform/Marketplace (Publish Task, Task Management) </p><p>Requestors </p><p>1. </p><p>2. </p><p>4. </p><p>3. </p></li><li><p>Mturk Workflow </p></li><li><p>CASE STUDIES ON CROWDSOURCED DATA CURATION PART IV </p></li><li><p>Crowdsourced Data Curation </p><p>59 </p><p>DQ Rules &amp; Algorithms </p><p> Entity Linking Data Fusion </p><p>Relation Extraction </p><p>Human Computation </p><p> Relevance Judgment </p><p>Data Verification Disambiguation </p><p>Clean Data Internal Community - Domain Knowledge - High Quality Responses - Trustable </p><p>Web of Data </p><p>Databases </p><p>Textual Content </p><p>Programmers Managers </p><p>External Crowd - High Availability - Large Scale - Expertise Variety </p></li><li><p>Examples of CDM Tasks </p><p>n Understanding customer sentiment for launch of new product around the world. </p><p>n Implemented 24/7 sentiment analysis system with workers from around the world. </p><p>n 90% accuracy in 95% on content </p><p>n Categorize millions of products on eBays catalog with accurate and complete attributes </p><p>n Combine the crowd with machine learning to create an affordable and flexible catalog quality system </p></li><li><p>Examples of CDM Tasks </p><p>n Natural Language Processing Dialect Identification, Spelling Correction, Machine </p><p>Translation, Word Similarity </p><p>n Computer Vision Image Similarity, Image An...</p></li></ul>