zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity...
DESCRIPTION
Conference talk at WWW 2012, , April 2012, Lyon, FranceTRANSCRIPT
ZenCrowd Leveraging Probabilis3c Reasoning and
Crowdsourcing Techniques for Large-‐Scale En3ty Linking
Gianluca Demar3ni, Djellel Eddine
Difallah, and Philippe Cudre-‐Mauroux eXascale Infolab, University of Fribourg
Switzerland
MoFvaFon
• Linked Open Data (LOD) • Linking enFty from text to LOD – Rich snippets – EnFty-‐centric search
• Linking – Algorithmic – Manual (NYT arFcles)
HTML+ RDFaPages
LOD Cloud
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 2
Gianluca DemarFni, eXascale Infolab 3
hVp://dbpedia.org/resource/Facebook
hVp://dbpedia.org/resource/Instagram
Zase:Instagram owl:sameAs
Android
<p>Facebook is not waiFng for its iniFal public offering to make its first big purchase.</p><p>In its largest acquisiFon to date, the social network has purchased Instagram, the popular photo-‐sharing applicaFon, for about $1 billion in cash and stock, the company said Monday.</p>
<p><span about="hVp://dbpedia.org/resource/Facebook"><cite property=”rdfs:label">Facebook</cite> is not waiFng for its iniFal public offering to make its first big purchase.</span></p><p><span about="hVp://dbpedia.org/resource/Instagram">In its largest acquisiFon to date, the social network has purchased <cite property=”rdfs:label">Instagram</cite> , the popular photo-‐sharing applicaFon, for about $1 billion in cash and stock, the company said Monday.</span></p>
RDFa enrichment
HTML:
Crowdsourcing
• Exploit human intelligence to solve – Tasks simple for humans, complex for machines – With a large number of humans (the Crowd) – Small problems: micro-‐tasks (Amazon MTurk)
• Examples – Wikipedia, Image tagging
• IncenFves – Financial, fun, visibility
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 4
ZenCrowd
• Combine both algorithmic and manual linking • Automate manual linking via crowdsourcing • Dynamically assess human workers with a probabilisFc reasoning framework
22-‐Apr-‐12 5
Crowd
Algorithms Machines
Outline
• Related approaches • ZenCrowd: System architecture • ProbabilisFc model to combine automaFc matching and crowdsourcing results
• Experimental evaluaFon
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 6
Related Approaches
• EnFty Linking • Ad-‐hoc Object Retrieval – Keyword queries – Looking for a specific enFty – IR indexing and ranking over RDF data
• Crowdsourcing – Training set, tagging, annotaFon – IR evaluaFon – CrowdDB
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 7
ZenCrowd Architecture
Micro Matching
Tasks
HTMLPages
HTML+ RDFaPages
LOD Open Data Cloud
CrowdsourcingPlatform
ZenCrowdEntity
Extractors
LOD Index Get Entity
Input Output
Probabilistic Network
Decision Engine
Micr
o-Ta
sk M
anag
er
Workers Decisions
AlgorithmicMatchers
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 8
Algorithmic Matching
• Inverted index over LOD enFFes – DBPedia, Freebase, Geonames, NYT
• TF-‐IDF (IR ranking funcFon) • Top ranked URIs linked to enFFes in docs
• Threshold on the ranking funcFon or top N
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 9
EnFty Factor Graphs
• Graph components – Workers, links, clicks – Prior probabiliFes – Link Factors – Constraints
• ProbabilisFc Inference – Select all links with posterior prob >τ
w1 w2
l1 l2
pw1( ) pw2( )
lf1( ) lf2( )
pl1( ) pl2( )
l3
lf3( )
pl3( )
c11 c22c12c21 c13 c23
u2-3( )sa1-2( )
2 workers, 6 clicks, 3 candidate links
Link priors
Worker priors
Observed variables
Link factors
SameAs constraints
Dataset Unicity constraints
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 10
EnFty Factor Graphs
• Training phase – IniFalize worker priors – with k matches on known answers
• UpdaFng worker Priors – Use link decision as new observaFons – Compute new worker probabiliFes
• IdenFfy (and discard) unreliable workers
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 11
Experimental EvaluaFon • Datasets – 25 news arFcles from
• CNN.com (Global news) • NYTimes.com (Global news) • Washington-‐post.com (US local news) • Timesofindia.indiaFmes.com (India local news) • Swissinfo.com (Switzerland local news)
– 40M enFFes (Freebase, DBPedia, Geonames, NYT) • 80 workers on Amazon MTurk, $0.01 per enFty • Standard evaluaFon measures: Prec, Recall, Acc • Gold Standard: editorial selecFon
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 12
The micro-‐task
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 13
Experimental EvaluaFon
• AutomaFc Approach: P/R trade-‐off
0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"
1" 2" 3" 4" 5"
Precision)/)Re
call)
Top)N)Results)
P"R"
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 14
Experimental EvaluaFon
• EnFty Linking with Crowdsourcing and agreement vote (at least 2 out of 5 workers select the same URI)
Top-‐1 precision: 0.70
0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"
0" 0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"
Precision)/)Re
call)
Matching)Probability)Threshold)
P"R"
0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"
1" 2" 3" 4" 5"
Precision)/)Re
call)
Top)N)Results)
P"R"
Figure 5: Performance results (Precision, Recall) forthe automatic approach.
the URIs with at least 2 votes are selected as valid links(we tried various thresholds and manually picked 2 in theend since it leads to the highest precision scores while keep-ing good recall values for our experiments). We report onthe performance of this crowdsourcing technique in Table 2.The values are averaged over all linkable entities for di↵erentdocument types and worker communities.
Table 2: Performance results for crowdsourcing withagreement vote over linkable entities.
US Workers Indian WorkersP R A P R A
GL News 0.79 0.85 0.77 0.60 0.80 0.60US News 0.52 0.61 0.54 0.50 0.74 0.47IN News 0.62 0.76 0.65 0.64 0.86 0.63SW News 0.69 0.82 0.69 0.50 0.69 0.56All News 0.74 0.82 0.73 0.57 0.78 0.59
The first question we examine is whether there is a di↵er-ence in reliability between the various populations of work-ers. In Figure 6 we show the performance for tasks per-formed by workers located in USA and India (each pointcorresponds to the average precision and recall over all en-tities in one document). On average, we observe that tasksperformed by workers located in the USA lead to higherprecision values. As we can see in Table 2, Indian workersobtain higher precision and recall on local Indian news ascompared to US workers. The biggest di↵erence in terms ofaccuracy between the two communities can be observed onthe global interest news.
0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"
0" 0.2" 0.4" 0.6" 0.8" 1"
Recall&
Precision&
US"India"
Figure 6: Per document task e↵ectiveness.
A second question we examine is how the textual contextgiven for an entity influences the worker performance. InFigure 7, we compare the tasks for which only the entitylabel is given to those for which a context consisting of all
the sentences containing the entity are shown to the worker(snippets). Surprisingly, we could not observe a significantdi↵erence in e↵ectiveness caused by the di↵erent textual con-texts given to the workers. Thus, we focus on only one typeof context for the remaining experiments (we always givethe snippet context).
0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"
1" 2" 3" 4" 5" 6" 7" 8" 9" 10"
Precision)
Document)
Simple"
Snippet"
Figure 7: Crowdsourcing results with two di↵erenttextual contexts
Entity Linking with ZenCrowd.We now focus on the performance of the probabilistic in-
ference network as proposed in this paper. We consider themethod described in Section 4, with an initial training phaseconsisting of 5 entities, and a second, continuous trainingphase, consisting of 5% of the other entities being o↵ered tothe workers (i.e., the workers are given a task whose solutionis known by the system every 20 tasks on average).In order to reduce the number of tasks having little influ-
ence in the final results, a simple technique of blacklistingof bad workers is used. A bad worker (who can be consid-ered as a spammer) is a worker who randomly and rapidlyclicks on the links, hence generating noise in our system.In our experiments, we consider that 3 consecutive bad an-swers in the training phase is enough to identify the workeras a spammer and to blacklist him/her. We report the aver-age results of ZenCrowd when exploiting the training phase,constraints, and blacklisting in Table 3. As we can observe,precision and accuracy values are higher in all cases whencompared to the agreement vote approach.
Table 3: Performance results for crowdsourcing withZenCrowd over linkable entities.
US Workers Indian WorkersP R A P R A
GL News 0.84 0.87 0.90 0.67 0.64 0.78US News 0.64 0.68 0.78 0.55 0.63 0.71IN News 0.84 0.82 0.89 0.75 0.77 0.80SW News 0.72 0.80 0.85 0.61 0.62 0.73All News 0.80 0.81 0.88 0.64 0.62 0.76
Finally, we compare ZenCrowd to the state of the artcrowdsourcing approach (using the optimal agreement vote)and our best automatic approach on a per-task basis in Fig-ure 8. The comparison is given for each document in thetest collection. We observe that in most cases the humanintelligence contribution improves the precision of the auto-matic approach. We also observe that ZenCrowd dominates
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 15
Experimental EvaluaFon
• Worker community and textual context
0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"
0" 0.2" 0.4" 0.6" 0.8" 1"
Recall&
Precision&
US"India"
0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"
1" 2" 3" 4" 5" 6" 7" 8" 9" 10"Precision)
Document)
Simple"
Snippet"
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 16
Experimental EvaluaFon
• Worker SelecFon
Top$US$Worker$
0$
0.5$
1$
0$ 250$ 500$
Worker&P
recision
&
Number&of&Tasks&
US$Workers$
IN$Workers$
0.6$0.62$0.64$0.66$0.68$0.7$0.72$0.74$0.76$0.78$0.8$
1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$
Precision)
Top)K)workers)
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 17
Experimental EvaluaFon
• EnFty Linking with ZenCrowd – Training with first 5 enFFes + 5% aserwards – 3 consecuFve bad answers lead to blacklisFng
0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"
0" 0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"
Precision)/)Re
call)
Matching)Probability)Threshold)
P"R"
0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"
1" 2" 3" 4" 5"
Precision)/)Re
call)
Top)N)Results)
P"R"
Figure 5: Performance results (Precision, Recall) forthe automatic approach.
the URIs with at least 2 votes are selected as valid links(we tried various thresholds and manually picked 2 in theend since it leads to the highest precision scores while keep-ing good recall values for our experiments). We report onthe performance of this crowdsourcing technique in Table 2.The values are averaged over all linkable entities for di↵erentdocument types and worker communities.
Table 2: Performance results for crowdsourcing withagreement vote over linkable entities.
US Workers Indian WorkersP R A P R A
GL News 0.79 0.85 0.77 0.60 0.80 0.60US News 0.52 0.61 0.54 0.50 0.74 0.47IN News 0.62 0.76 0.65 0.64 0.86 0.63SW News 0.69 0.82 0.69 0.50 0.69 0.56All News 0.74 0.82 0.73 0.57 0.78 0.59
The first question we examine is whether there is a di↵er-ence in reliability between the various populations of work-ers. In Figure 6 we show the performance for tasks per-formed by workers located in USA and India (each pointcorresponds to the average precision and recall over all en-tities in one document). On average, we observe that tasksperformed by workers located in the USA lead to higherprecision values. As we can see in Table 2, Indian workersobtain higher precision and recall on local Indian news ascompared to US workers. The biggest di↵erence in terms ofaccuracy between the two communities can be observed onthe global interest news.
0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"
0" 0.2" 0.4" 0.6" 0.8" 1"
Recall&
Precision&
US"India"
Figure 6: Per document task e↵ectiveness.
A second question we examine is how the textual contextgiven for an entity influences the worker performance. InFigure 7, we compare the tasks for which only the entitylabel is given to those for which a context consisting of all
the sentences containing the entity are shown to the worker(snippets). Surprisingly, we could not observe a significantdi↵erence in e↵ectiveness caused by the di↵erent textual con-texts given to the workers. Thus, we focus on only one typeof context for the remaining experiments (we always givethe snippet context).
0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"
1" 2" 3" 4" 5" 6" 7" 8" 9" 10"
Precision)
Document)
Simple"
Snippet"
Figure 7: Crowdsourcing results with two di↵erenttextual contexts
Entity Linking with ZenCrowd.We now focus on the performance of the probabilistic in-
ference network as proposed in this paper. We consider themethod described in Section 4, with an initial training phaseconsisting of 5 entities, and a second, continuous trainingphase, consisting of 5% of the other entities being o↵ered tothe workers (i.e., the workers are given a task whose solutionis known by the system every 20 tasks on average).In order to reduce the number of tasks having little influ-
ence in the final results, a simple technique of blacklistingof bad workers is used. A bad worker (who can be consid-ered as a spammer) is a worker who randomly and rapidlyclicks on the links, hence generating noise in our system.In our experiments, we consider that 3 consecutive bad an-swers in the training phase is enough to identify the workeras a spammer and to blacklist him/her. We report the aver-age results of ZenCrowd when exploiting the training phase,constraints, and blacklisting in Table 3. As we can observe,precision and accuracy values are higher in all cases whencompared to the agreement vote approach.
Table 3: Performance results for crowdsourcing withZenCrowd over linkable entities.
US Workers Indian WorkersP R A P R A
GL News 0.84 0.87 0.90 0.67 0.64 0.78US News 0.64 0.68 0.78 0.55 0.63 0.71IN News 0.84 0.82 0.89 0.75 0.77 0.80SW News 0.72 0.80 0.85 0.61 0.62 0.73All News 0.80 0.81 0.88 0.64 0.62 0.76
Finally, we compare ZenCrowd to the state of the artcrowdsourcing approach (using the optimal agreement vote)and our best automatic approach on a per-task basis in Fig-ure 8. The comparison is given for each document in thetest collection. We observe that in most cases the humanintelligence contribution improves the precision of the auto-matic approach. We also observe that ZenCrowd dominates
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 18
Comparing 3 matching techniques
• ZenCrowd best for 75% of documents
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 19
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
1" 2" 3" 4" 5" 6" 7" 8" 9" 10"11"12"13"14"15"16"17"18"19"20"21"22"23"24"25"
Precision)
)
Document)
Agr."Vote"
ZenCrowd"
Top"1"
Simple Crowdsourcing ZenCrowd AutomaFc
Lessons Learnt
• Crowdsourcing + Prob reasoning works! • But – Different worker communiFes perform differently – Many low quality workers – No differences w/ different contexts – CompleFon Fme may vary (based on reward)
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 20
Conclusions
• ZenCrowd: ProbabilisFc reasoning over automaFc and crowdsourcing methods for enFty linking
• Standard crowdsourcing improves 6% over automaFc • 4% -‐ 35% improvement over standard crowdsourcing • 14% average improvement over automaFc approaches
• Next steps – Long-‐term worker behavior analysis – More efficient and effecFve linking by LOD dataset pre-‐selecFon
hVp://diuf.unifr.ch/xi/zencrowd/
22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 21