record linkage in a distributed environment

29
Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

Upload: bozica

Post on 23-Feb-2016

58 views

Category:

Documents


0 download

DESCRIPTION

Record Linkage in a Distributed Environment. Huang Yipeng Wing group meeting, 11 March 2011. Record Linkage. E.g. Distinguishing between data belonging to… and . Determining if pairs of personal records refer to the same entity . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Record Linkage in a Distributed Environment

1

Record Linkagein a Distributed Environment

Huang YipengWing group meeting, 11 March 2011

Page 2: Record Linkage in a Distributed Environment

2Introduction

Record LinkageDetermining if pairs of personal

records refer to the same entity

E.g. Distinguishing betweendata belonging to…

<Yipeng, author of this presentation> and <Yipeng, son of PM Lee>

Page 3: Record Linkage in a Distributed Environment

3Introduction

The Distributed Environment Why?

◦ Dealing with large data

◦ Limitation of blocking

Advantages◦ Parallel computation◦ Data source

flexibility◦ Complementary to

blocking methods

O(nC2)

AmandaBeverleyKatherine Amanda

Amanda

Amanda

AmandaAmanda

Page 4: Record Linkage in a Distributed Environment

4Introduction

The Distributed Environment MapReduce

◦ Distributed environment for large data sets

Hadoop ◦ Open source

implementation

◦ Convenient model for scaling Record Linkage

◦ Protects users from system level concerns

Page 5: Record Linkage in a Distributed Environment

5Introduction

Research ProblemDisconnect between generic

parallel framework and specific Record Linkage problem

The goal Tailor Hadoop for Record Linkage tasks

Page 6: Record Linkage in a Distributed Environment

6

OutlineIntroductionRelated WorkMethodology Evaluation Conclusion

Page 7: Record Linkage in a Distributed Environment

7Related Work

Related WorkRecord Linkage Literature

◦Blocking techniquesParallel Record Linkage Literature

◦P-Febrl (P Christen 2003), ◦P-Swoosh (H Kawai 2006), ◦Parallel Linkage (H Kim 2007)

Hadoop Literature ◦Evaluation Metrics◦Pairwise comparisons (T Elsayed 2008)

Page 8: Record Linkage in a Distributed Environment

8

OutlineIntroductionRelated WorkMethodology Evaluation Conclusion

Page 9: Record Linkage in a Distributed Environment

9Methodology

MapReduce Workflow

Partitioner

Page 10: Record Linkage in a Distributed Environment

10Methodology

ImplementationMapPurpose:

◦ Parallelism ◦ Data manipulation◦ Blocking

Reads lines of input and outputs <key, value> pairs.

ReducePurpose:

◦ Parallelism ◦ Record Linkage

ops

Records with the same <key> in same Reduce().

Linkage results

Page 11: Record Linkage in a Distributed Environment

11Methodology

Hash Partitioner Default implementation Hash(Key) mod NGood for uniformed data but not

for skewed distributions

Node

10 22 21 3 4 5 6 7 2 80

20

40

60

Reduce task list for Job x

Name Distribution Comparisonsjoshua 5000 12497500

emiily 48 1128

jack 35 595

thomas 33 528

lachlan 32 496

benjamin 31 465

5416986 comparisons

210 comparisons

Page 12: Record Linkage in a Distributed Environment

13Methodology

Record Linkage PartitionerGoal: Have all nodes finish the reduce

phase at the same time Attain a better runtime but

retaining the same level of accuracy

Page 13: Record Linkage in a Distributed Environment

14Methodology

Domain principlesCounting pairwise comparisons

gives a more accurate picture of the true computational workload

The distribution of names tends to follow a power law distribution in many countries (D Zanette 2001), (S Miyazima 2000)

Page 14: Record Linkage in a Distributed Environment

15Methodology

Record Linkage Workflow

Round 1

Round 2

Round 3

Range partition based on comparison workload

Merge lost comparisons from Round 1

Remove cross duplicates

Page 15: Record Linkage in a Distributed Environment

16Methodology

Input

Round 1

Map Phase

Distribution

1. Calc avg comparison workload over N nodes

2. Check if a record will exceed the avg. If Yes, Divide by min number of nodes needed to drop below.

3. Assign records to nodes and update the avg comparison workload to reflect lost comparisons , if any. 4. Recurse until comparison load can be evenly distributed among nodes

Page 16: Record Linkage in a Distributed Environment

Methodology

Round 2

A

17

B

List XA B

A R1B R1

A BA R1B R2 R1

Page 17: Record Linkage in a Distributed Environment

Methodology

Round 2

18

A BAB Job 1

A B CAB Job 1C Job 2 Job 3

1. Only acts on lost comparisons

2. Because input is indistinct, a 3rd round of deduplication may be needed.

Page 18: Record Linkage in a Distributed Environment

19

OutlineIntroductionRelated WorkMethodology Evaluation Conclusion

Introduction

Page 19: Record Linkage in a Distributed Environment

20Evaluation

Performance MetricsPerformance evaluation in

absolute runtime, speedup & scaleup on a shared cluster.◦“It’s what users care about” ◦Representative of real operations

Page 20: Record Linkage in a Distributed Environment

21Methodology

Input Records

10 million records, 0.9 million original, 0.1 million duplicate, up to 9 duplicates per record, 1 modification per field, 1 modification per record, duplicates follow Poisson distribution.

<rec-359705-org, talyor, swift, 5, canterbury crescent, , cooks hill, 4122, , 19090518, 38, 07 34366927, 6174819, 9>

Page 21: Record Linkage in a Distributed Environment

22Methodology

Data setsSynthetic data produced with

Febrl data generator◦Artificially skewed distribution

1 1352694035376718050

200400600800

100012001400

Comparisons

Name Distribution Comparisonsjoshua 50 1225

emiily 48 1128

jack 35 595

thomas 33 528

lachlan 32 496

benjamin 31 465

Page 22: Record Linkage in a Distributed Environment

23Evaluation

Utilization

Node 1 Node 20

2

4

6

8

10

12

IdleComputation

Page 23: Record Linkage in a Distributed Environment

24Evaluation

Utilization

Node 1 Node 2 Node 30

2

4

6

8

10

12

IdleComputation

Page 24: Record Linkage in a Distributed Environment

25Evaluation

Utilization

Node 1 Node 2 Node 30

2

4

6

8

10

12

IdleComputation

A

B

C

Page 25: Record Linkage in a Distributed Environment

26Evaluation

Utilization

Node 1 Node 2 Node 30

2

4

6

8

10

12

IdleRedistributed ComputationOriginal Computa-tion

CA B

Page 26: Record Linkage in a Distributed Environment

Round 2

27

A B CABC

J1

J3 J5

J2

J4 J6 ?

Node Utilization 50-100%

Page 27: Record Linkage in a Distributed Environment

28Evaluation

Results so far….Default Workflow

RL Workflow

2 nodes, 5000 records, 2433 duplicates

71.5 secs 75 secs

2 nodes, 7000 records, 4814 duplicates

>10 mins 196.8 secs

Page 28: Record Linkage in a Distributed Environment

29Evaluation

Results so far….RL Workflow runtime

◦Similar to Hash-based runtime on small datasets

◦Better as the size of the dataset grows

Page 29: Record Linkage in a Distributed Environment

30

ConclusionParallelism a right step in the

right direction for record linkage ◦Complementary to existing

approaches

Hadoop can be tailored for Record Linkage tasks◦“Record Linkage” Partitioner /

Workflow is just one an example of possible improvements

Conclusion