indexing techniques for scalable record linkage and deduplication

Indexing Techniques forScalable Record Linkage and Deduplication

Pradeeban Kathiravelu

INESC-ID LisboaInstituto Superior Tecnico, Universidade de Lisboa

Lisbon, Portugal

Data Quality – Presentation 3April 14, 2015.

Pradeeban Kathiravelu (IST-ULisboa) Record Linkage 1 / 18

Introduction

Introduction

Matching.

Approach known as:

Data or Record Linkage.Data or Field Matching.The Merge/Purge Problem.

Too large to fit in the main memory.

Corrupted incoming new data requiring complex tests.

Importance of accuracy, than missing data.


Introduction

Matching Records

{Data|Record} Linkage | {Data|Field} MatchingPradeeban Kathiravelu (IST-ULisboa) Record Linkage 3 / 18

Introduction

Motivation

Linked Data

Improving data quality and integrity.

Allowing re-use of existing data sources.

Reducing costs and efforts in data acquisition.

Multiple Domains

Fraud and crime detection.

Pervasive health systems.

Enterprise business systems.


Introduction

Indexing in Record Linkage


Record Linkage

Record Linkage Approaches

Blocking

.[] [] [] [] Similar values.

Blocking key.

Trade-off of size: False negatives vs cost.

Blocking Keys

No. of true matches in the candidate record pairs ⇑.

Total No. of candidate pairs ⇓.


Record Linkage

Research Avenues

Scaling to large data sets.

While keeping a high linkage quality.

Development of techniques that can learn optimal blocking keydefinitions.

Manual ⇒ Supervised machine learning based approaches.Machine learning approaches leveraging,

Predicate-based formulations of learnable blocking functions.The sequential covering algorithm, which discovers disjunctive sets ofrules.


Evaluation

Evaluation

Evaluation Framework

Febrl (Freely Extensible Biomedical Record Linkage).

Developed in Python -https://sourceforge.net/projects/febrl/

data standardisation (segmentation and cleaning).probabilistic record linkage (”fuzzy” matching)

Data Sets

SecondString Toolkit.

Developed in Java - http://secondstring.sourceforge.net/Approximate string-matching techniques.

Census, bibliographic, restaurant, and CD records.


https://sourceforge.net/projects/febrl/

http://secondstring.sourceforge.net/

Evaluation

Indexing Techniques


Sorted-Neighborhood

Sorted-Neighborhood method

Partition the data.

Sort the partitions before thematching.

with the most important BKVCorrupted keys?

Approach:

Create Keys

Sort Data

Merge


Sorted-Neighborhood

Case


Sorted-Neighborhood

Equational Theory


Sorted-Neighborhood

Accuracy of Sorted-Neighborhood method


Sorted-Neighborhood

Clustering Methods vs. SNM


Sorted-Neighborhood

Memory-based database (13751 records)


Sorted-Neighborhood

Multiple Processors (1 million records; width = 10)


Sorted-Neighborhood

Time Performance


Sorted-Neighborhood

References

Christen, P. (2012). A survey of indexing techniques for scalablerecord linkage and deduplication. Knowledge and Data Engineering,IEEE Transactions on, 24(9), 1537-1555.

Hernandez, M. A., & Stolfo, S. J. (1995, June). The merge/purgeproblem for large databases. In ACM SIGMOD Record (Vol. 24, No.2, pp. 127-138). ACM.

Thank you!


indexing techniques for scalable record linkage and deduplication

Technology

candidate record pairs

high linkage quality

data acquisition

missing data

large data sets

corrupted incoming new

based approaches

blocking keysno