g-link_probablistic record linkage system_pver conf_may2011

Post on 05-Dec-2014

1.346 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

May 2011 Personal Validation and Entity Resolution Conference. Presenter: Antoine Chevrette, System Engineering Division, Statistics Canada

TRANSCRIPT

— — G-Link —G-Link —

A ProbabilisticA ProbabilisticRecord Linkage SystemRecord Linkage System

Antoine ChevretteAntoine ChevretteSystem Engineering DivisionSystem Engineering Division

Statistics CanadaStatistics Canada

2

AgendaAgenda

• Background: early days of record linkage

• Motivation for building G-Link

• G-Link design objectives

• System overview

• Software installation

• What’s in the future?

23-04-102 Statistics Canada • Statistique Canada

3

Theory of Record LinkageTheory of Record Linkage

Ivan Fellegi & Alan Sunter• “A Theory for Record Linkage” (1969)

Still widely regarded as both pivotal and definitive Implemented in Statistics Canada’s linkage software

23-04-103 Statistics Canada • Statistique Canada

4

Linkage Systems at Statistics CanadaLinkage Systems at Statistics Canada

Ted Hill (SDD) and Martha Fair (Health) produced: the “Generalized Iterative Record Linkage System” (GIRLS)• First released as a mainframe-only product (GIRLS) ca. 1980

• Re-engineered for Unix servers ca. 1990 (rename GRLS)

Larger linkages became practical over time Functionality and ease of use encouraged wider application

23-04-104 Statistics Canada • Statistique Canada

5

Why Replace GRLS?Why Replace GRLS?

GRLS fully functional, and very popular, but:• Requires the use of a Unix-based server

• Requires connection with the Oracle DBMS

Potential applications saw architecture as a barrier GRLS software was aging & required significant updates

23-04-105 Statistics Canada • Statistique Canada

6

G-Link Design ObjectivesG-Link Design Objectives

Operable on all Windows desktops Available for both Windows & Unix servers No third-party software dependencies No additional licensing fees Full GRLS work-alike functionality Processing speed comparable to GRLS Extensible Easy to use

23-04-106 Statistics Canada • Statistique Canada

7

G-LINK introduction through:• Menu options.• The following screens:

Project creation Data importation Data analysis Pairs creation Index creation Rules creation Graph and pairs distribution weitghts Pairs review Group creation and mapping Data exportation Batch functionality

Installation instructions

G-LINK Overview G-LINK Overview

8

G-LINK overview G-LINK overview

9

Project CreationProject Creation

External or Internal Linkage

Internal: e.g. Find duplicate records from an address file.

External: e.g. Link a cancer database with a death database.

Information taken from a configuration file (for server

mode only)

Project protected by a username and

password

10

Data ImportationData Importation

You can see the first 100

observations form the SAS file

Once the importation is complete you can create

derived columns based on nysiis and soundex

Definitions for the columns to import

11

Data analysisData analysis

Obtain the frequency of

each field value

12

Pairs CreationPairs Creation

Create pairs interactively

Experienced users can directly create SQL statements

13

Rule CreationRule Creation

3 level character rule

14

Rule creationRule creation

3 level character matrix rule

15

Rule CreationRule Creation

2 level date rule

16

Rule CreationRule Creation

Numerical condition rule

17

User RulesUser Rules

Type must be custom

Outcome set by users. (use in the user rule psql)

Include field from your input tables

18

Pairs weight distribution graphPairs weight distribution graph

You can choose the range selection

Minimum and maximum weight + the threshold values

19

Pairs revisionPairs revision

Special criteria in order to revise groups of pairs

Rules outcome level

Manual update

20

Group creation and mappingGroup creation and mapping

Mapping screen

Group creation screen

21

Data ExportationData Exportation

Export in flat or SAS files

22

Set a G-Link project as batch. Run from the command line,

embeded script with time execution.

BatchBatch

23

How to install G-LINKHow to install G-LINK

G-LINK is installed using an .exe file on a Windows machine. G-LINK can be installed locally or in server mode

• You should use the server client mode when:

Performance is important (option of using multiple cpus) Data confidentiality is required.

Interface

Logical

Processing (DBMS)

Local

Processing (DBMS)

Server

24

G-Link: The Future?G-Link: The Future?

Product will continue to evolve:• Faster processing• Enhanced pre-processing and post-processing• Enhanced fuzzy matching

Possibility of “record-at-a-time” linkages:• For interactive applications (capture, un-duplication)• Potential for embedded processing

23-04-1024 Statistics Canada • Statistique Canada

25

Contact:

top related