de-duplication of bibliographic records

De-duplication of Bibliographic Records

Tsach Moshkovitz, Development Team Leader

Olybris, Ex Libris Seminar 2005

Kos, April 2005

De-duplication of Bibliographic Records 2

Overview

De-Duplication is a required procedure whenever new records are introduced to the database.

De-duplication streamlines the process of loading old or new records as much as possible.


Overview (cont)

The main process steps are:1) matching2) finding merge direction 3) merging


Overview (cont)

Matching involves searching for similar database records that have specific given parameters.


Overview (cont)

Merging direction involves identifying a preferred record (which the process sometimes implies). A non-preferred record is merged into a preferred record.


Overview (cont)

Merging involves blending a new record with a similar record in the database, according to a control table or a configuration table.


De-duplication vs. Union Catalog

The Union Catalog is a sophisticated mechanism for supporting integration of disparate libraries into a single environment.The Union Catalog is based on Match and Merge algorithms.ALEPH’s Union Catalog is not a unified Catalog, in the sense that an actual Merge does not take place in ALEPH’s Union catalog database.


De-duplication vs. Union Catalog (cont)

Union Catalog MatchAn Equivalence table is created to map each record to a set of equivalent records. It is totally acceptable to find more than one match in the Union Catalog.The set of equivalent records also contains the preferred record.



Union Catalog MergeUser View uses an on-the-fly merge to construct a virtual single record that is built from a group of equivalents.

Merge product is not saved in the database.



ALEPH De-duplication:

Match - “Rigid”. search for similar records.

Merge - similar records are combined together in the database. All substantial fields should remain.

Union:

Match - “Loose”. search for equivalent records.

Merge – Virtual merge for display. Fields can be added or omitted from the merged display.



ALEPH De-duplication:

Preferred – Performed only between two records (incoming and matched).

Union:

Preferred – Performed on the entire equivalent set.


Sources for New Records

The main sources for new records are:

Cataloging GUI – A user catalogs a new record in the database.

Load Servers (such as OCLC server) – The server receives records from a search client (e.g., CatMe).

Batch Loading of resource files – Records are loaded in the database from an input file.


Sources for New Records (cont)

The different sources for new records pose a different level of user intervention during the process:

Cataloging GUI –Maximum user control during the process.Loading Servers – Limited user control during the process.Loading Resource Files – No user control during the process.



In general, the more the involvement of a staff user, the less rigid a required match.

Q. How can the staff user be involved in batch loading a record?

A. A given input file can be split into three files (zero matched records, single matched records and multi-matched records) before the actual batch load.



In the Cataloging GUI, it is common for a new record not to contain enough information for a rigid match (e.g., fast cataloging).


Match and Merge

Match & Merge Setup

Version 15.2, and after, feature a unified mechanism for Match and Merge.

The mechanism is similar to the check/fix doc mechanism, and defines the programs to be executed for deferent procedure codes (contexts).The functionality is also in Versions 14.x but is configured separately for cataloging, load servers, and batch jobs.


Match and Merge (cont)

Match & Merge Setup

Just as in other Configuration tables, the Match/Preferred/Merge Conf tables feature the following three columns:

1) Section code (the context, e.g., OCLC)2) Program(s) to execute3) Program-specific parameters


Match

Example, tab_match:

!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!!!!!!!!OCLC match_doc_uid I-ISBNOCLC match_doc_acc tab_match_acc

CAT match_doc_uid I-ISBNCAT match_doc_acc tab_match_acc

YBP match_doc_uid I-ISBNYBP match_doc_acc tab_match_accYBP match_doc_gen TYPE=IND...


Match (cont)

The procedure name is defined elsewhere, depending on the module. For example, the OCLC server match procedure is defined in tab_oclc col. 9.

! 1 2 3 4 5 6 7 8 9

!!!!-!!!-!!!!!-!!!!!-!!!!!-!-!-!!!!!!!!!!-!!!!!

7545 BIB USM01 1 N Y OCLC OCLC

7545 AUT USM10 Y OCLC OCLC


Match Algorithms

The more common Match algorithms (programs) are:

match_doc_uidmatch_doc_accmatch_doc_genmatch_doc_script


Match Algorithms (cont)

The match_doc_uid is based on the direct index (Z11).

The parameters column in the match table should contain either the index name (prefix I) or the tag code (prefix T).



Important notes for match_doc_uid:

Even if direct index is used, the program might return several matches.Program matches the normalized (filing) text, rather than the original text. (Normalization is different for different indexes.)



The match_doc_acc program is based on a headings (ACC) index.The third column is a table name that lists the record tags that should be checked against the headings index.Example (tab_match_acc):

! 1!!!!!245##240##



Important notes for match_doc_acc:If relevant tags (listed in given table) are associated with several headings (e.g., TTL, NTL) then all headings of that type will be checked.If an incoming record introduces a new heading, consecutive incoming records may not be able to use that heading immediately.



The match_doc_gen program is based on the headings (ACC) index, the direct index (Z11) and the direct system number.The third column tells the program which index type to use (heading, direct, or sys) and which tag code or index code to extract from the incoming record.



Example:

!!!!!-!!!!!!!!!!!!!!-!!!!!!!!!!!!!!!!!!>ISSN match_doc_gen TYPE=IND,TAG=022,SUBFIELD=a,CODE=ISSN

Specific parameters instruct the program to use the z11/ISSN index and to match the tag 022 $$a of the incoming record.



The match_doc_gen is also used when the ALEPH system number is already catalogued in the incoming record. In this case, the index-type is SYS.



The match_doc_script program is based on a command script that allows a user defined logical flow for matching. Example:

!1 2 3 4 5!!-!!!!!!!!!!!!!!!!!!!!-!!!!-!!!!!!!!!!-!!!!!!!!!!!!!!!>00 1 goto 0200 0 stop

01 match_doc_gen 50- skip TYPE=ACC,TAG=245..01 10+ goto 03

02 match_doc_gen 5+ goto 03 TYPE=ACC,TAG=00102 0+ skip

03 match_doc_gen 5+ skip TYPE=ACC,TAG=00103 1 stop


Finding Preferred Record

The tab_preferred located in $data_tab defines a preferred_doc program to be executed, per context.Example:

!!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!-!!!!!!!!!>OCLC preferred_doc_aleph1 weights_table1RLIN preferred_doc_aleph1 weights_table2

In this example, the preferred_doc_aleph1 must decide whether the assigned two documents and or the Weights table is preferred.


Finding Preferred Record (cont)

A weights table might look like this:

! 1 2 3 4 5!!!!!-!!!!!!-!!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!-!!!LDR F05-01 EQUAL d -10LDR F17-01 NOT-EQUAL 1,2,3,4,5,7,8,u,z 010LDR F17-01 EQUAL 1 009110## PRESENT 001505## PRESENT 050

The document with the higher accumulative weight is the preferred one.


Merge

The tab_merge located in $data_tab defines a merge program to be executed, per procedure.Example:

!!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!>OVERLAY-01 merge_doc_overlay 02OVERLAY-02 merge_doc_overlay 02OVERLAY-03 merge_doc_overlay 03OVERLAY-04 merge_doc_overlay 04

OCLC merge_doc_overlay 01


Merge (cont)

The procedure name is defined elsewhere, depending on the module. For example, the OCLC server merge procedure is defined in tab_oclc col. 8.

! 1 2 3 4 5 6 7 8 9

!!!!-!!!-!!!!!-!!!!!-!!!!!-!-!-!!!!!!!!!!-!!!!!

7545 BIB USM01 1 N Y OCLC OCLC1

7545 AUT USM10 Y OCLC OCLC2


Merge Algorithms

The more common merge programs are:merge_doc_overlaymerge_doc_replacemerge_doc_adv_overlay


Merge Algorithms (cont)

The merge_doc_replace program replaces the contents of an original record with the contents of a new record, while retaining the CAT fields from both records.


Merge Algorithms (cont)

The merge_doc_overlay program overlays the record according to the specifications defined in tab_merge_overlay.

The tab_doc_overlay was named tab_doc_merge in Version 14.x

The table may define multiple merge sets by using col. 1. Column 4 of the tab_merge table contains the merge set, performed when the routine (e.g., OVERLAY-01) is selected.


Merge Algorithms – tab_doc_overlay

Example:

!1 2 3 4!!-!-!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

01 1 Y #####01 1 N 051##01 1 N 245##01 2 Y 245##01 2 Y 650##01 2 N 00101 2 Y 00801 2 Y LDR


Merge Algorithms – tab_doc_overlay

tab_merge


Merge Algorithms

The merge_doc_adv_overlay works essentially like merge_doc_overlay, with the following two differences:

1. Before merge is finished, tab_preferred is called with a context parameter hard-coded to AD-OVERLAY.

2. The tab_merge_adv_overlay has slightly better functionality, and the conditions may be sensitive both to the tag’s existence and its value.


Cataloging

Catalog GUI uses check_doc and fix_doc to implement match and merge respectively:

When uploading a record check_doc, (CATALOG-INSERT) is executed. The program check_doc_match is used as an entry to tab_match.When using copy/paste of a whole record, fix_doc (MERGE) is executed. The program fix_doc_merge is used as an entry to tab_merge.


Cataloging – Match


Cataloging – Merge

Original Record (1)


Cataloging – Merge (cont)

Copied Record (2)


Cataloging – Merge (cont)

Merged Record


OCLC Server

OCLC

OCLCClient

(CatMe)

ALEPHDatabase

Search

OCLC Server

Match &

Merge


OCLC Server (cont)

Search in OCLC.Send a record from the OCLC client to the OCLC server. Check if there is a similar record in the ALEPH database (tab_match).If no matching record is found, the record is added to the ALEPH database.If a single matching record is found, both records are merged (tab_merge), and the merged record is added to the ALEPH database.If multiple matches are found, an error is reported to the OCLC client.


OCLC Server – tab_oclc ! 1 2 3 4 5 6 7 8 9!!!!-!!!-!!!!!-!!!!!-!!!!!-!-!-!!!!!!!!!!-!!!!!!!!!!7545 BIB USM01 1 N Y OCLC OCLC7545 AUT USM10 Y OCLC OCLC

tab_match!!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!!!!!!!OCLC match_doc_uid I-ISBN

tab_preferred!!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!!!!!>OCLC preferred_doc_aleph1 weights_tab1

tab_merge!!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!!OCLC merge_doc_overlay 01


Resource File

Input file – ALEPH sequential format

Run p_manage_36. This function splits an input file of documents into three output files, according to user defined matching criteria.

Run p_manage_38. This function runs a merge routine on the second output of p_manage_36, and the records in the database.


Resource File (cont)

The first output of p_manage_36 (using p_manage_18), is loaded using NEW.

The output of p_manage_38 (using p_manage_18), is loaded using REPLACE.


Resource File – p_manage_36

Output File 1:Contains records that do not match any record in the database.The records in this file are given a new sequential number (starting from 000000001).


Resource File – p_manage_36 (cont)

Output File 2:Contains records for which a unique single match was found in the database.

The records in this file are given new system numbers that match the system number of the matched record.


Resource File – p_manage_36 (cont)

Output File 3:Contains records for which a match was found, as long as the match is not a single match.

The records in this output file are given a new sequential number.


Resource File – p_manage_38

The input file is the second output file of p_mange_36 (single match records)For each record in the input file:

1. The same record should be read from the database.

2. The preferred using tab_preferred should be determined.

3. The non-preferred should be merged into the preferred using tab_merge.

4. The merged record should be written into the output file.

de-duplication of bibliographic records

Documents

union catalogthe union

similar database records

merge similar records

alephs union catalog

records incoming

set of equivalent records

unified catalog

merge virtual merge