indexing techniques for scalable record linkage and deduplication

18
Indexing Techniques for Scalable Record Linkage and Deduplication Pradeeban Kathiravelu INESC-ID Lisboa Instituto Superior T´ ecnico, Universidade de Lisboa Lisbon, Portugal Data Quality – Presentation 3 April 14, 2015. Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 1 / 18

Upload: kathiravelu-pradeeban

Post on 18-Jul-2015

267 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Indexing Techniques for Scalable Record Linkage and Deduplication

Indexing Techniques forScalable Record Linkage and Deduplication

Pradeeban Kathiravelu

INESC-ID LisboaInstituto Superior Tecnico, Universidade de Lisboa

Lisbon, Portugal

Data Quality – Presentation 3April 14, 2015.

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 1 / 18

Page 2: Indexing Techniques for Scalable Record Linkage and Deduplication

Introduction

Introduction

Matching.

Approach known as:

Data or Record Linkage.Data or Field Matching.The Merge/Purge Problem.

Too large to fit in the main memory.

Corrupted incoming new data requiring complex tests.

Importance of accuracy, than missing data.

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 2 / 18

Page 3: Indexing Techniques for Scalable Record Linkage and Deduplication

Introduction

Matching Records

{Data|Record} Linkage | {Data|Field} MatchingPradeeban Kathiravelu (IST-ULisboa) Record Linkage 3 / 18

Page 4: Indexing Techniques for Scalable Record Linkage and Deduplication

Introduction

Motivation

Linked Data

Improving data quality and integrity.

Allowing re-use of existing data sources.

Reducing costs and efforts in data acquisition.

Multiple Domains

Fraud and crime detection.

Pervasive health systems.

Enterprise business systems.

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 4 / 18

Page 5: Indexing Techniques for Scalable Record Linkage and Deduplication

Introduction

Indexing in Record Linkage

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 5 / 18

Page 6: Indexing Techniques for Scalable Record Linkage and Deduplication

Record Linkage

Record Linkage Approaches

Blocking

.[] [] [] [] Similar values.

Blocking key.

Trade-off of size: False negatives vs cost.

Blocking Keys

No. of true matches in the candidate record pairs ⇑.

Total No. of candidate pairs ⇓.

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 6 / 18

Page 7: Indexing Techniques for Scalable Record Linkage and Deduplication

Record Linkage

Research Avenues

Scaling to large data sets.

While keeping a high linkage quality.

Development of techniques that can learn optimal blocking keydefinitions.

Manual ⇒ Supervised machine learning based approaches.Machine learning approaches leveraging,

Predicate-based formulations of learnable blocking functions.The sequential covering algorithm, which discovers disjunctive sets ofrules.

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 7 / 18

Page 8: Indexing Techniques for Scalable Record Linkage and Deduplication

Evaluation

Evaluation

Evaluation Framework

Febrl (Freely Extensible Biomedical Record Linkage).

Developed in Python -https://sourceforge.net/projects/febrl/

data standardisation (segmentation and cleaning).probabilistic record linkage (”fuzzy” matching)

Data Sets

SecondString Toolkit.

Developed in Java - http://secondstring.sourceforge.net/Approximate string-matching techniques.

Census, bibliographic, restaurant, and CD records.

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 8 / 18

Page 9: Indexing Techniques for Scalable Record Linkage and Deduplication

Evaluation

Indexing Techniques

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 9 / 18

Page 10: Indexing Techniques for Scalable Record Linkage and Deduplication

Sorted-Neighborhood

Sorted-Neighborhood method

Partition the data.

Sort the partitions before thematching.

with the most important BKVCorrupted keys?

Approach:

Create Keys

Sort Data

Merge

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 10 / 18

Page 11: Indexing Techniques for Scalable Record Linkage and Deduplication

Sorted-Neighborhood

Case

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 11 / 18

Page 12: Indexing Techniques for Scalable Record Linkage and Deduplication

Sorted-Neighborhood

Equational Theory

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 12 / 18

Page 13: Indexing Techniques for Scalable Record Linkage and Deduplication

Sorted-Neighborhood

Accuracy of Sorted-Neighborhood method

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 13 / 18

Page 14: Indexing Techniques for Scalable Record Linkage and Deduplication

Sorted-Neighborhood

Clustering Methods vs. SNM

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 14 / 18

Page 15: Indexing Techniques for Scalable Record Linkage and Deduplication

Sorted-Neighborhood

Memory-based database (13751 records)

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 15 / 18

Page 16: Indexing Techniques for Scalable Record Linkage and Deduplication

Sorted-Neighborhood

Multiple Processors (1 million records; width = 10)

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 16 / 18

Page 17: Indexing Techniques for Scalable Record Linkage and Deduplication

Sorted-Neighborhood

Time Performance

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 17 / 18

Page 18: Indexing Techniques for Scalable Record Linkage and Deduplication

Sorted-Neighborhood

References

Christen, P. (2012). A survey of indexing techniques for scalablerecord linkage and deduplication. Knowledge and Data Engineering,IEEE Transactions on, 24(9), 1537-1555.

Hernandez, M. A., & Stolfo, S. J. (1995, June). The merge/purgeproblem for large databases. In ACM SIGMOD Record (Vol. 24, No.2, pp. 127-138). ACM.

Thank you!

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 18 / 18