human computation for big data

53
Human Computation for Big Data Gianluca Demartini eXascale Infolab University of Fribourg, Switzerland gianlucademartini.net exascale.info CUSO Seminar on Big Data – May 23, 2014 – Fribourg

Upload: exascale-infolab

Post on 26-Jan-2015

109 views

Category:

Science


1 download

DESCRIPTION

Over the last few years we have observed the emergence of hybrid human-machine information systems which are able to both scale over large amount of data as well as to maintain high-quality data processing intrinsic in human intelligence. In this talk I will focus on the use of human intelligence at scale by means of crowdsourcing to deal with Big Data problems. We will look specifically on how to deal with the variety in data by means of Human Computation still being able to operate with a large data volume. First, I will introduce the area of micro-task crowdsourcing also providing an overview of different research challenges that needs to be tackled to enable large-scale hybrid human-machine information systems. Next, I will provide examples of such hybrid systems for entity linking and disambiguation using crowdsourcing and a graph of linked entities as background corpus. I will describe how keyword query understanding can be crowdsourced to build search engines that can answer rare complex queries. Finally, I will present new techniques that allow to improve the quality of crowdsourced information system components by means of push crowdsourcing.

TRANSCRIPT

Page 1: Human Computation for Big Data

Human Computation for Big Data

Gianluca DemartinieXascale Infolab

University of Fribourg, Switzerland

gianlucademartini.netexascale.info

CUSO Seminar on Big Data – May 23, 2014 – Fribourg

Page 2: Human Computation for Big Data

Gianluca Demartini

Gianluca Demartini

• M.Sc. at University of Udine, Italy• Ph.D. at University of Hannover, Germany

– Entity Retrieval• Worked for UC Berkeley (on Crowdsourcing), Yahoo! Research

(Spain), L3S Research Center (Germany)• Post-doc at the eXascale Infolab, Uni Fribourg, Switzerland.• Lecturer for Social Computing in Fribourg• Tutorial on Entity Search at ECIR 2012, on Crowdsourcing at

ESWC 2013 and ISWC 2013• Research Interests

– Information Retrieval, Semantic Web, Human Computation

2

[email protected]

Page 3: Human Computation for Big Data

Gianluca Demartini 3

Web of Data• Freebase

– Acquired by Google in July 2010.– Knowledge Graph launched in May 2012.

• Schema.org– Driven by major search engine companies– Machine-readable annotations of Web pages

• Linked Open Data– 31 billion triples, Sept. 2011

• Volume and Variety

Page 4: Human Computation for Big Data

4

Linked Open Data

Z. Kaoudi and I. Manolescu, ICDE seminar 2013

Page 5: Human Computation for Big Data

Gianluca Demartini 5

LOD data is an enormous graph

• Subject – Predicate – Object– Barack Obama – marriedTo – Michelle Obama

• Specific scalable DB systems exist

e1e2

e3

p1 p2

p3

e4

Page 6: Human Computation for Big Data

Gianluca Demartini 6

I will talk about

• Micro-task Crowdsourcing • Hybrid Human-Machine systems• Entity Linking/Disambiguation

– On the Web using crowdsourcing• Improving Crowdsourcing Platform Quality

– Pushing tasks to workers• Research directions

– Crowdsourced Query Understanding– Transactive Search

Page 7: Human Computation for Big Data

Gianluca Demartini 7

Crowdsourcing

• Exploit human intelligence to solve– Tasks simple for humans, complex for machines– With a large number of humans (the Crowd)– Small problems: micro-tasks (Amazon MTurk)

• Examples– Wikipedia, Image tagging

• Incentives– Financial, fun, visibility

Page 8: Human Computation for Big Data

Gianluca Demartini 8

Case-Study: Amazon MTurk

• Micro-task crowdsourcing marketplace• On-demand, scalable, real-time workforce• Different crowd motivation (not just money)• Online since 2005 (still in “beta”)• Currently the most popular platform• Developer’s API as well as GUI

Page 9: Human Computation for Big Data

Gianluca Demartini 9

Amazon MTurk

Page 10: Human Computation for Big Data

Gianluca Demartini 10

A Task on MTurk

Page 11: Human Computation for Big Data

Gianluca Demartini 11

Amazon Mturk Workflow

• Requesters create tasks (HITs)• Workers preview, accept, submit HITs• Requesters approve, download results

Page 12: Human Computation for Big Data

Gianluca Demartini 12

Example: Hybrid Image Search

Yan, Kumar, Ganesan, CrowdSearch: Exploiting Crowds for Accurate Real-time Image Search on Mobile Phones, Mobisys 2010.

Page 13: Human Computation for Big Data

Not sure

Example: Hybrid Data Integration

paper conf

Data integration VLDB-01

Data mining SIGMOD-02

title author email

OLAP Mike mike@a

Social media Jane jane@b

Generate plausible matches– paper = title, paper = author, paper = email, paper = venue– conf = title, conf = author, conf = email, conf = venue

Ask users to verify

paper conf

Data integration VLDB-01

Data mining SIGMOD-02

title author email venue

OLAP Mike mike@a ICDE-02

Social media Jane jane@b PODS-05

Does attribute paper match attribute author?

NoYes

McCann, Shen, Doan: Matching Schemas in Online Communities. ICDE, 2008 13

Page 14: Human Computation for Big Data

Gianluca Demartini 14

Hybrid Systems: Key Issues

• The role of machine (i.e., algorithm) and humans– use only humans? both? who’s doing what?

• Quality control• Payment• Optimization: What to crowdsource• Scalability: How much to crowdsource

Page 15: Human Computation for Big Data

Entity Linking/Disambiguation

Page 16: Human Computation for Big Data

Gianluca Demartini 16

http://dbpedia.org/resource/Facebook

http://dbpedia.org/resource/Instagram

fbase:Instagramowl:sameAs

Google

Android

<p>Facebook is not waiting for its initial public offering to make its first big purchase.</p><p>In its largest acquisition to date, the social network has purchased Instagram, the popular photo-sharing application, for about $1 billion in cash and stock, the company said Monday.</p>

<p><span about="http://dbpedia.org/resource/Facebook"><cite property=”rdfs:label">Facebook</cite> is not waiting for its initial public offering to make its first big purchase.</span></p><p><span about="http://dbpedia.org/resource/Instagram">In its largest acquisition to date, the social network has purchased <cite property=”rdfs:label">Instagram</cite> , the popular photo-sharing application, for about $1 billion in cash and stock, the company said Monday.</span></p>

RDFa enrichment

HTML:

Page 17: Human Computation for Big Data

Gianluca Demartini 17

ZenCrowd

• Combine both algorithmic and manual linking• Automate manual linking via crowdsourcing• Dynamically assess human workers with a

probabilistic reasoning framework

Crowd

AlgorithmsMachines

Page 18: Human Computation for Big Data

Gianluca Demartini 18

ZenCrowd Architecture

Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking. In: 21st International Conference on World Wide Web (WWW 2012).

Page 19: Human Computation for Big Data

Gianluca Demartini 19

Entity Factor Graphs

• Graph components– Workers, links, clicks– Prior probabilities– Link Factors– Constraints

• Probabilistic Inference– Select all links with

posterior prob >τ 2 workers, 6 clicks, 3 candidate links

Link priors

Workerpriors

Observedvariables

Linkfactors

SameAsconstraints

DatasetUnicityconstraints

Page 20: Human Computation for Big Data

Gianluca Demartini 20

Experimental Evaluation

• Datasets– 25 news articles from

• CNN.com (Global news)• NYTimes.com (Global news)• Washington-post.com (US local news)• Timesofindia.indiatimes.com (India news)• Swissinfo.com (Switzerland local news)

– 40M entities (Freebase, DBPedia, Geonames, NYT)

Page 21: Human Computation for Big Data

Gianluca Demartini 21

Worker Selection

Page 22: Human Computation for Big Data

Gianluca Demartini 22

Lessons Learnt

• Crowdsourcing + Prob reasoning works!• But

– Different worker communities perform differently– Many low quality workers– Completion time may vary (based on reward)

• Need to find the right workers for your task (see WWW13 paper)

Page 23: Human Computation for Big Data

Gianluca Demartini 23

ZenCrowd Summary

• ZenCrowd: Probabilistic reasoning over automatic and crowdsourcing methods for entity linking

• Standard crowdsourcing improves 6% over automatic• 4% - 35% improvement over standard crowdsourcing• 14% average improvement over automatic

approaches

http://exascale.info/zencrowd/

Page 24: Human Computation for Big Data

Gianluca Demartini 24

Blocking for Instance Matching

• Find the instances about the same real-world entity within two datasets

• Avoid Comparison of all possible pairs– Step 1: cluster similar items using a cheap

similarity measure– Step 2: n*n comparison within the clusters with an

expensive measure

Page 25: Human Computation for Big Data

Gianluca Demartini 25

Three-stage blocking with the Crowdfor Data Integration

• 1. Cheap clustering/inverted index selection of candidates

• 2. Expensive similarity measure• 3. Crowdsource low confidence matches

Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. Large-Scale Linked Data Integration Using Probabilistic Reasoning and Crowdsourcing. In: VLDB Journal, Volume 22, Issue 5 (2013), Page 665-687, Special issue on Structured, Social and Crowd-sourced Data on the Web. October 2013.

Page 26: Human Computation for Big Data

Gianluca Demartini 26

Improving Crowdsourcing Platforms

Page 27: Human Computation for Big Data

Gianluca Demartini 27

Pull (Traditional) Crowdsourcing

• In MTurk HITs are published on the market• The first worker willing to do it can take it• Pro: Fast• Con: Not necessarily optimal / not the best

worker for the task

Page 28: Human Computation for Big Data

28

Push Crowdsourcing

• Pick-A-Crowd: A system architecture that uses Task-to-Worker matching:– The worker’s social profile – The task context

• Workers can provide higher quality answers on tasks they relate to

Djellel Eddine Difallah, Gianluca Demartini, and Philippe Cudré-Mauroux. Pick-A-Crowd: Tell Me What You Like, and I'll Tell You What to Do. In: 22nd International Conference on World Wide Web (WWW 2013), Rio de Janeiro, Brazil, May 2013.

Page 29: Human Computation for Big Data

29

Matching Models–Expert Finding

• Build an inverted index on the pages’ titles and description• Use the title/description of the tasks as a key word query on

the inverted index and get a subset of pages• Rank the workers by the number of liked pages in the subset

Page 30: Human Computation for Big Data

30

Pick-A-Crowd

Page 31: Human Computation for Big Data

31

Discussion

• Pull vs. Push methodologies in Crowdsourcing • Pick-A-Crowd system architecture with Task-

to-Worker recommendation• Experimental comparison with AMT shows a

consistent quality improvement“Workers Know what they Like”

www.openturk.com

Page 32: Human Computation for Big Data

Gianluca Demartini 32

OpenTurk

• Yet another a platform? Build on top of Mturk!• Chrome Extension for push / notification• 400+ users• http://bit.ly/openturk-extension• Open source: https

://github.com/openturk/extension

Page 33: Human Computation for Big Data

CrowdQ: Crowdsourced Query Understanding

Page 34: Human Computation for Big Data

Gianluca Demartini 34

birthdate of the mayor of the capital city of italy

Page 35: Human Computation for Big Data

Gianluca Demartini 35

capital city of italy

Page 36: Human Computation for Big Data

Gianluca Demartini 36

mayor of rome

Page 37: Human Computation for Big Data

Gianluca Demartini 37

birthdate of ignazio marino

Page 38: Human Computation for Big Data

Gianluca Demartini 38

Motivation

• Web Search Engines can answer simple factual queries directly on the result page

• Users with complex information needs are often unsatisfied

• Purely automatic techniques are not enough• We want to solve it with Crowdsourcing!

Page 39: Human Computation for Big Data

Gianluca Demartini 39

CrowdQ

• CrowdQ is the first system that uses crowdsourcing to– Understand the intended meaning– Build a structured query template– Answer the query over Linked Open Data

Gianluca Demartini, Beth Trushkowsky, Tim Kraska, and Michael Franklin. CrowdQ: Crowdsourced Query Understanding. In: 6th Biennial Conference on Innovative Data Systems Research (CIDR 2013).

Page 40: Human Computation for Big Data

Gianluca Demartini 40

Hybrid Human-Machine Pipeline

Q= birthdate of actors of forrest gump

Query annotation Noun Noun Named entity

Verification

Entity Relations

Is forrest gump this entity in the query?

Which is the relation between: actors and forrest gump starring

Schema element Starring <dbpedia-owl:starring>

Verification Is the relation between:Indiana Jones – Harrison FordBack to the Future – Michael J. Foxof the same type asForrest Gump - actors

Page 41: Human Computation for Big Data

Gianluca Demartini 41

Structured query generation

SELECT ?y ?xWHERE { ?y <dbpedia-owl:birthdate> ?x .

?z <dbpedia-owl:starring> ?y .?z <rdfs:label> ‘Forrest Gump’ }

Results from BTC09:

Q= birthdate of actors of forrest gumpMOVIE

MOVIE

Page 42: Human Computation for Big Data

Gianluca Demartini 42

Transactive Search

Page 43: Human Computation for Big Data

Gianluca Demartini 43

Transactive Search

• What if the data to answer your query is not stored on any digital support?

• What if the data is just in people minds?

• Big Data No Data

Page 44: Human Computation for Big Data

Gianluca Demartini 44

Transactive Search

• Search using Transactive (group) Memories• “Who attended the WWW 2014 conference?”

• Machines: Harvest the Web + Data Mining• Crowd: Search twitter, look at event pictures• Transactive Memories: Remember who I met

Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, and Philippe Cudré-Mauroux. Hippocampus: Answering Memory Queries using Transactive Search. In: 23rd International Conference on World Wide Web (WWW 2014), Web Science Track. Seoul, South Korea, April 2014.

Page 45: Human Computation for Big Data

Gianluca Demartini 45

Transactive Search (2)

Page 46: Human Computation for Big Data

Gianluca Demartini 46

Transactive Search (3)

Page 47: Human Computation for Big Data

Gianluca Demartini 47

Discussion

• Sometime data is not on the Web• The right group of people can still answer

– Collaboratively– Using Transactive Search– Better than machines or anonymous crowds

• Open challenges– Incentives– Repeatability– SNA

Page 48: Human Computation for Big Data

Gianluca Demartini 48

Research Directions forMicro-task Crowdsourcing

Page 49: Human Computation for Big Data

Gianluca Demartini 49

State of Micro-task Crowdsourcing

• Platform side– Pull platforms– Batch processing

• Worker side– Work flexibility– Anonymity

• Requester side– Web/API

Page 50: Human Computation for Big Data

Gianluca Demartini 50

The Future for Requesters

• Push Platforms– RecSys, User Modeling, Trust

• Mobile Access• Quality and Time guarantees• Worker API (enable novel worker UI)

Page 51: Human Computation for Big Data

Gianluca Demartini 51

The Future of the Worker side

• Reputation system for workers• More than financial incentives• Recognize worker potential (badges)

– Paid for their expertise• Train less skilled workers (tutoring system)

Aniket Kittur et al. The Future of Crowd Work. CSCW 2013.

Page 52: Human Computation for Big Data

Gianluca Demartini 52

Crowdsourcing Ethics

• People work full-time as crowd workers• Chinese crowdsourcing platform with 5.5M workers• Pros

– Help developing countries– Provide cash fast to people == short-term satisfaction– Job Flexibility

• Cons– No job security– No social security– Long term satisfaction? Career plans?

Dagstuhl Seminar on “Crowdsourcing: From Theory to Practice and Long-Term Perspectives”, September 2013.

Page 53: Human Computation for Big Data

Gianluca Demartini 53

Conclusions

• Structured Data makes the Web better• It’s growing fast

– Large volume– Large heterogeneity

• Crowds can help understanding data semantics• Hybrid human-machine systems (ZenCrowd)• Research opportunities:

– Exploit Human Intelligence at Scale (CrowdQ)– Pick the right crowd (Pick-A-Crowd, Transactive Search)

gianlucademartini.net [email protected]