on leveraging crowdsourcing techniques for schema matching networks
DESCRIPTION
Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Karl AbererTRANSCRIPT
![Page 1: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/1.jpg)
1
Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Karl Aberer
École Polytechnique Fédérale de Lausanne, Switzerland
Zoltán Miklós
Université de Rennes 1, IRISA, France
DASFAA 2013, Part II, LNCS 7826, pp. 139 – 154, 2013
![Page 2: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/2.jpg)
2
Database schema matching is an active research field:Surveys: [1], [2]Applications: data transformation, data migration, data alignment, …Automatic Matching Tools: COMA++, AMC, OpenII, Falcon, …
Schema matching is the task of establishing correspondences that connect related attributes in two (independently developed) database schemas.
[1] Rahm, E. et al. “A Survey of Approaches to Automatic Schema Matching”. JVLDB, 2001[2] Bernstein, P.A. et al. “Generic Schema Matching, Ten Years Later”. PVLDB, 2011
SA SB
BirthName BirthName
BirthDate
AddressAddress
![Page 3: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/3.jpg)
3
Automatic schema matchers will(sometimes) fail to identify the correct correspondences
There is a need for post‐matchingreconciliation through human inputThis effort is the « real cost » in the company
Schemas do not appear alone, they are part of a matching network
The network‐level consistency constraintsare very important for business users
![Page 4: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/4.jpg)
4
Real‐world scenario: a repository of schemas in the same domain
Schema matching network: connect schemas by pair‐wise matchings
Network‐level consistency constraints
Automatic tools produce incorrect correspondences need validation by human
![Page 5: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/5.jpg)
5
![Page 6: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/6.jpg)
6
![Page 7: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/7.jpg)
7
DASFAA’2013, BDA’2013: On LeveragingCrowdsourcing Techniques for SchemaMatching NetworksER’2013: Minimizing Human Effort in Reconciling Match NetworkscoopIS’2013: Collaborative Schema MatchingReconciliationICDE’2014: Pay‐as‐you‐go Reconciliation in Schema Matching Networks
![Page 8: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/8.jpg)
8
“Crowdsourcing is the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people, and especially from an online community, rather than from traditional employees or suppliers.” ‐Wiki
Our context: employ many workers (users) to validate same correspondences and combine their answers.
Surveys: [1], [2]A wide range of applications (e.g. CrowdSearch) have been developed on top of more than 70 crowdsourcing platforms (e.g. Amazon Mechanical Turk).
Our contribution:Define network‐level constraints in schema matching networkDesign questions for workers to validate correspondencesLeverage network‐level constraints to reduce user efforts
[1] E. Law et al. “Human Computation”. Morgan & Claypool Publishers, 2011[2] A. Doan et al. “Crowdsourcing systems on the World Wide Web”. CACM, 2011
![Page 9: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/9.jpg)
9
![Page 10: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/10.jpg)
10
![Page 11: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/11.jpg)
11
Three elements of questions:Asking object: correspondencePossible choices: simple YES/NO questionSupport Information: alternatives, constraint satisfactions, constraint violations
![Page 12: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/12.jpg)
12
User Question Answer
U1 C Yes
U2 C Yes
U3 C No
User Reliability
U1 r1U2 r2U3 r3
User Feedbacks
Answer Aggregation
User Quality
Probabilistic Model (*)
Pr(C)
Corr Aggregation Error Rate
C True 0.19
Compute <a,e> aggregation + error rate
r1 = Pr (C=true | U1=yes) = Pr (C=false | U1=no)
(*) Majority Voting, Expectation Maximization, …See full paper for details
![Page 13: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/13.jpg)
13
Solution: Leverage constraints to reduce error rate
r = 0.6
Goal
To achieve higher accuracy, we need more answers Cost‐Accuracy Tradeoff
![Page 14: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/14.jpg)
14
Idea: correspondences support each other if they satisfy a constraint
1‐1 constraint: ONE source attribute matches to only ONE target attribute
S T
b1
ab2
Pr(ab1=true) = 0.8
Pr(ab2=false) = 0.6
ab1 ab2 ProbT T 0.32 not satisfyT F 0.48 satisfyF T 0.08 satisfyF F 0.12 satisfy
Pr0.48 0.12
0.48 0.08 0.12.
With ConstraintWithout Constraint
Corr Aggregation Error Rate
ab2 False 0.4 (*)
Corr Aggregation Error Rate
ab2 False 0.12 (**)
(*) Error Rate = 1 – Pr (ab2=false) (**) Error Rate = 1 – Pr |
>
By independence, 0.8 x 0.6
![Page 15: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/15.jpg)
15
Circle constraint: sequence of correspondences create a closed circleΔ: probability of compensating errors along the circle (*)
With ConstraintWithout Constraint
Corr Aggregation Error Rate
ab True 0.2 (**)
Corr Aggregation Error Rate
ab True 0.027 (***)
(**) Error Rate = 1 – Pr (ab=T) (***) Error Rate = 1 – Pr
S1
S3
S2
a
c
bPr(ab=T) = 0.8
Pr(ac=T) = 0.8 Pr(bc=T) = 0.8
ab bc ac ProbT T T 0.512 1.0T T F 0.128 0.0T F T 0.128 0.0T F F 0.032F T T 0.128 0.0F T F 0.032F F T 0.032F F F 0.008
Pr0.512 Δ 0.032
0.512 3 Δ 0.032 Δ 0.008. with .
>
By independence, 0.8 x 0.8 x 0.8
* Cudré-Mauroux, et al. Probabilistic message passing in peer data management systems. ICDE 2006.
![Page 16: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/16.jpg)
16
Settings:Real‐world schemas. Use ground truth to simulate users/workers. Error Threshold = 0.1 : make decision when error rate < 0.1; otherwise, continue to ask users.Metric: Cost =
Observation: Cost (With Constraints) Cost (Without Constraints)
![Page 17: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/17.jpg)
17
We model a crowdsourcing process for schema matching network
address optimization goals: minimize monetary cost, maximize accuracy (minimize error rate).
We design a variety of questions with different support information.We leverage consistency constraints reduce error rate reduce the monetary cost.
![Page 18: On Leveraging Crowdsourcing Techniques for Schema Matching Networks](https://reader033.vdocuments.net/reader033/viewer/2022060110/555ded1dd8b42a192c8b5a61/html5/thumbnails/18.jpg)
18