indexing techniques for scalable record linkage and deduplication
TRANSCRIPT
Indexing Techniques forScalable Record Linkage and Deduplication
Pradeeban Kathiravelu
INESC-ID LisboaInstituto Superior Tecnico, Universidade de Lisboa
Lisbon, Portugal
Data Quality – Presentation 3April 14, 2015.
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 1 / 18
Introduction
Introduction
Matching.
Approach known as:
Data or Record Linkage.Data or Field Matching.The Merge/Purge Problem.
Too large to fit in the main memory.
Corrupted incoming new data requiring complex tests.
Importance of accuracy, than missing data.
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 2 / 18
Introduction
Matching Records
{Data|Record} Linkage | {Data|Field} MatchingPradeeban Kathiravelu (IST-ULisboa) Record Linkage 3 / 18
Introduction
Motivation
Linked Data
Improving data quality and integrity.
Allowing re-use of existing data sources.
Reducing costs and efforts in data acquisition.
Multiple Domains
Fraud and crime detection.
Pervasive health systems.
Enterprise business systems.
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 4 / 18
Introduction
Indexing in Record Linkage
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 5 / 18
Record Linkage
Record Linkage Approaches
Blocking
.[] [] [] [] Similar values.
Blocking key.
Trade-off of size: False negatives vs cost.
Blocking Keys
No. of true matches in the candidate record pairs ⇑.
Total No. of candidate pairs ⇓.
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 6 / 18
Record Linkage
Research Avenues
Scaling to large data sets.
While keeping a high linkage quality.
Development of techniques that can learn optimal blocking keydefinitions.
Manual ⇒ Supervised machine learning based approaches.Machine learning approaches leveraging,
Predicate-based formulations of learnable blocking functions.The sequential covering algorithm, which discovers disjunctive sets ofrules.
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 7 / 18
Evaluation
Evaluation
Evaluation Framework
Febrl (Freely Extensible Biomedical Record Linkage).
Developed in Python -https://sourceforge.net/projects/febrl/
data standardisation (segmentation and cleaning).probabilistic record linkage (”fuzzy” matching)
Data Sets
SecondString Toolkit.
Developed in Java - http://secondstring.sourceforge.net/Approximate string-matching techniques.
Census, bibliographic, restaurant, and CD records.
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 8 / 18
Evaluation
Indexing Techniques
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 9 / 18
Sorted-Neighborhood
Sorted-Neighborhood method
Partition the data.
Sort the partitions before thematching.
with the most important BKVCorrupted keys?
Approach:
Create Keys
Sort Data
Merge
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 10 / 18
Sorted-Neighborhood
Case
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 11 / 18
Sorted-Neighborhood
Equational Theory
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 12 / 18
Sorted-Neighborhood
Accuracy of Sorted-Neighborhood method
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 13 / 18
Sorted-Neighborhood
Clustering Methods vs. SNM
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 14 / 18
Sorted-Neighborhood
Memory-based database (13751 records)
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 15 / 18
Sorted-Neighborhood
Multiple Processors (1 million records; width = 10)
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 16 / 18
Sorted-Neighborhood
Time Performance
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 17 / 18
Sorted-Neighborhood
References
Christen, P. (2012). A survey of indexing techniques for scalablerecord linkage and deduplication. Knowledge and Data Engineering,IEEE Transactions on, 24(9), 1537-1555.
Hernandez, M. A., & Stolfo, S. J. (1995, June). The merge/purgeproblem for large databases. In ACM SIGMOD Record (Vol. 24, No.2, pp. 127-138). ACM.
Thank you!
Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 18 / 18