record linkage in a distributed environment

1

Record Linkagein a Distributed Environment

Huang YipengWing group meeting, 11 March 2011

2Introduction

Record LinkageDetermining if pairs of personal

records refer to the same entity

E.g. Distinguishing betweendata belonging to…

<Yipeng, author of this presentation> and <Yipeng, son of PM Lee>

3Introduction

The Distributed Environment Why?

◦ Dealing with large data

◦ Limitation of blocking

Advantages◦ Parallel computation◦ Data source

flexibility◦ Complementary to

blocking methods

O(nC2)

AmandaBeverleyKatherine Amanda

Amanda

Amanda

AmandaAmanda

4Introduction

The Distributed Environment MapReduce

◦ Distributed environment for large data sets

Hadoop ◦ Open source

implementation

◦ Convenient model for scaling Record Linkage

◦ Protects users from system level concerns

5Introduction

Research ProblemDisconnect between generic

parallel framework and specific Record Linkage problem

The goal Tailor Hadoop for Record Linkage tasks

6

OutlineIntroductionRelated WorkMethodology Evaluation Conclusion

7Related Work

Related WorkRecord Linkage Literature

◦Blocking techniquesParallel Record Linkage Literature

◦P-Febrl (P Christen 2003), ◦P-Swoosh (H Kawai 2006), ◦Parallel Linkage (H Kim 2007)

Hadoop Literature ◦Evaluation Metrics◦Pairwise comparisons (T Elsayed 2008)

8


9Methodology

MapReduce Workflow

Partitioner

10Methodology

ImplementationMapPurpose:

◦ Parallelism ◦ Data manipulation◦ Blocking

Reads lines of input and outputs <key, value> pairs.

ReducePurpose:

◦ Parallelism ◦ Record Linkage

ops

Records with the same <key> in same Reduce().

Linkage results

11Methodology

Hash Partitioner Default implementation Hash(Key) mod NGood for uniformed data but not

for skewed distributions

Node

10 22 21 3 4 5 6 7 2 80

20

40

60

Reduce task list for Job x

Name Distribution Comparisonsjoshua 5000 12497500

emiily 48 1128

jack 35 595

thomas 33 528

lachlan 32 496

benjamin 31 465

5416986 comparisons

210 comparisons

13Methodology

Record Linkage PartitionerGoal: Have all nodes finish the reduce

phase at the same time Attain a better runtime but

retaining the same level of accuracy

14Methodology

Domain principlesCounting pairwise comparisons

gives a more accurate picture of the true computational workload

The distribution of names tends to follow a power law distribution in many countries (D Zanette 2001), (S Miyazima 2000)

15Methodology

Record Linkage Workflow

Round 1

Round 2

Round 3

Range partition based on comparison workload

Merge lost comparisons from Round 1

Remove cross duplicates

16Methodology

Input

Round 1

Map Phase

Distribution

1. Calc avg comparison workload over N nodes

2. Check if a record will exceed the avg. If Yes, Divide by min number of nodes needed to drop below.

3. Assign records to nodes and update the avg comparison workload to reflect lost comparisons , if any. 4. Recurse until comparison load can be evenly distributed among nodes

Methodology

Round 2

A

17

B

List XA B

A R1B R1

A BA R1B R2 R1

Methodology

Round 2

18

A BAB Job 1

A B CAB Job 1C Job 2 Job 3

1. Only acts on lost comparisons

2. Because input is indistinct, a 3rd round of deduplication may be needed.

19


Introduction

20Evaluation

Performance MetricsPerformance evaluation in

absolute runtime, speedup & scaleup on a shared cluster.◦“It’s what users care about” ◦Representative of real operations

21Methodology

Input Records

10 million records, 0.9 million original, 0.1 million duplicate, up to 9 duplicates per record, 1 modification per field, 1 modification per record, duplicates follow Poisson distribution.

<rec-359705-org, talyor, swift, 5, canterbury crescent, , cooks hill, 4122, , 19090518, 38, 07 34366927, 6174819, 9>

22Methodology

Data setsSynthetic data produced with

Febrl data generator◦Artificially skewed distribution

1 1352694035376718050

200400600800

100012001400

Comparisons

Name Distribution Comparisonsjoshua 50 1225

emiily 48 1128

jack 35 595

thomas 33 528

lachlan 32 496

benjamin 31 465

23Evaluation

Utilization

Node 1 Node 20

2

4

6

8

10

12

IdleComputation

24Evaluation

Utilization

Node 1 Node 2 Node 30

2

4

6

8

10

12

IdleComputation

25Evaluation

Utilization


2

4

6

8

10

12

IdleComputation

A

B

C

26Evaluation

Utilization


2

4

6

8

10

12

IdleRedistributed ComputationOriginal Computa-tion

CA B

Round 2

27

A B CABC

J1

J3 J5

J2

J4 J6 ?

Node Utilization 50-100%

28Evaluation

Results so far….Default Workflow

RL Workflow

2 nodes, 5000 records, 2433 duplicates

71.5 secs 75 secs

2 nodes, 7000 records, 4814 duplicates

>10 mins 196.8 secs

29Evaluation

Results so far….RL Workflow runtime

◦Similar to Hash-based runtime on small datasets

◦Better as the size of the dataset grows

30

ConclusionParallelism a right step in the

right direction for record linkage ◦Complementary to existing

approaches

Hadoop can be tailored for Record Linkage tasks◦“Record Linkage” Partitioner /

Workflow is just one an example of possible improvements

Conclusion

record linkage in a distributed environment

Documents

scaling record linkage

record linkagedetermining

parallel linkage h kim

uniformed data

pairs of personal records

calc avg comparison

comparison workloadmerge

number of nodes