g-link_probablistic record linkage system_pver conf_may2011

25
G-Link — G-Link — A Probabilistic A Probabilistic Record Linkage System Record Linkage System Antoine Chevrette Antoine Chevrette System Engineering Division System Engineering Division Statistics Canada Statistics Canada

Upload: norc-at-the-university-of-chicago

Post on 05-Dec-2014

1.346 views

Category:

Technology


2 download

DESCRIPTION

May 2011 Personal Validation and Entity Resolution Conference. Presenter: Antoine Chevrette, System Engineering Division, Statistics Canada

TRANSCRIPT

Page 1: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

— — G-Link —G-Link —

A ProbabilisticA ProbabilisticRecord Linkage SystemRecord Linkage System

Antoine ChevretteAntoine ChevretteSystem Engineering DivisionSystem Engineering Division

Statistics CanadaStatistics Canada

Page 2: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

2

AgendaAgenda

• Background: early days of record linkage

• Motivation for building G-Link

• G-Link design objectives

• System overview

• Software installation

• What’s in the future?

23-04-102 Statistics Canada • Statistique Canada

Page 3: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

3

Theory of Record LinkageTheory of Record Linkage

Ivan Fellegi & Alan Sunter• “A Theory for Record Linkage” (1969)

Still widely regarded as both pivotal and definitive Implemented in Statistics Canada’s linkage software

23-04-103 Statistics Canada • Statistique Canada

Page 4: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

4

Linkage Systems at Statistics CanadaLinkage Systems at Statistics Canada

Ted Hill (SDD) and Martha Fair (Health) produced: the “Generalized Iterative Record Linkage System” (GIRLS)• First released as a mainframe-only product (GIRLS) ca. 1980

• Re-engineered for Unix servers ca. 1990 (rename GRLS)

Larger linkages became practical over time Functionality and ease of use encouraged wider application

23-04-104 Statistics Canada • Statistique Canada

Page 5: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

5

Why Replace GRLS?Why Replace GRLS?

GRLS fully functional, and very popular, but:• Requires the use of a Unix-based server

• Requires connection with the Oracle DBMS

Potential applications saw architecture as a barrier GRLS software was aging & required significant updates

23-04-105 Statistics Canada • Statistique Canada

Page 6: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

6

G-Link Design ObjectivesG-Link Design Objectives

Operable on all Windows desktops Available for both Windows & Unix servers No third-party software dependencies No additional licensing fees Full GRLS work-alike functionality Processing speed comparable to GRLS Extensible Easy to use

23-04-106 Statistics Canada • Statistique Canada

Page 7: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

7

G-LINK introduction through:• Menu options.• The following screens:

Project creation Data importation Data analysis Pairs creation Index creation Rules creation Graph and pairs distribution weitghts Pairs review Group creation and mapping Data exportation Batch functionality

Installation instructions

G-LINK Overview G-LINK Overview

Page 8: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

8

G-LINK overview G-LINK overview

Page 9: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

9

Project CreationProject Creation

External or Internal Linkage

Internal: e.g. Find duplicate records from an address file.

External: e.g. Link a cancer database with a death database.

Information taken from a configuration file (for server

mode only)

Project protected by a username and

password

Page 10: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

10

Data ImportationData Importation

You can see the first 100

observations form the SAS file

Once the importation is complete you can create

derived columns based on nysiis and soundex

Definitions for the columns to import

Page 11: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

11

Data analysisData analysis

Obtain the frequency of

each field value

Page 12: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

12

Pairs CreationPairs Creation

Create pairs interactively

Experienced users can directly create SQL statements

Page 13: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

13

Rule CreationRule Creation

3 level character rule

Page 14: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

14

Rule creationRule creation

3 level character matrix rule

Page 15: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

15

Rule CreationRule Creation

2 level date rule

Page 16: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

16

Rule CreationRule Creation

Numerical condition rule

Page 17: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

17

User RulesUser Rules

Type must be custom

Outcome set by users. (use in the user rule psql)

Include field from your input tables

Page 18: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

18

Pairs weight distribution graphPairs weight distribution graph

You can choose the range selection

Minimum and maximum weight + the threshold values

Page 19: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

19

Pairs revisionPairs revision

Special criteria in order to revise groups of pairs

Rules outcome level

Manual update

Page 20: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

20

Group creation and mappingGroup creation and mapping

Mapping screen

Group creation screen

Page 21: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

21

Data ExportationData Exportation

Export in flat or SAS files

Page 22: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

22

Set a G-Link project as batch. Run from the command line,

embeded script with time execution.

BatchBatch

Page 23: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

23

How to install G-LINKHow to install G-LINK

G-LINK is installed using an .exe file on a Windows machine. G-LINK can be installed locally or in server mode

• You should use the server client mode when:

Performance is important (option of using multiple cpus) Data confidentiality is required.

Interface

Logical

Processing (DBMS)

Local

Processing (DBMS)

Server

Page 24: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

24

G-Link: The Future?G-Link: The Future?

Product will continue to evolve:• Faster processing• Enhanced pre-processing and post-processing• Enhanced fuzzy matching

Possibility of “record-at-a-time” linkages:• For interactive applications (capture, un-duplication)• Potential for embedded processing

23-04-1024 Statistics Canada • Statistique Canada

Page 25: G-Link_Probablistic Record Linkage System_PVER Conf_May2011

25

Contact: