de-duplication of bibliographic records

52
De-duplication of Bibliographic Records Tsach Moshkovitz, Development Team Leader Olybris, Ex Libris Seminar 2005 Kos, April 2005

Upload: erich-hester

Post on 31-Dec-2015

25 views

Category:

Documents


0 download

DESCRIPTION

De-duplication of Bibliographic Records. Tsach Moshkovitz, Development Team Leader. Olybris, Ex Libris Seminar 2005 Kos, April 2005. Overview. De-Duplication is a required procedure whenever new records are introduced to the database. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records

Tsach Moshkovitz, Development Team Leader

Olybris, Ex Libris Seminar 2005

Kos, April 2005

Page 2: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 2

Overview

De-Duplication is a required procedure whenever new records are introduced to the database.

De-duplication streamlines the process of loading old or new records as much as possible.

Page 3: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 3

Overview (cont)

The main process steps are:1) matching2) finding merge direction 3) merging

Page 4: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 4

Overview (cont)

Matching involves searching for similar database records that have specific given parameters.

Page 5: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 5

Overview (cont)

Merging direction involves identifying a preferred record (which the process sometimes implies). A non-preferred record is merged into a preferred record.

Page 6: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 6

Overview (cont)

Merging involves blending a new record with a similar record in the database, according to a control table or a configuration table.

Page 7: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 7

De-duplication vs. Union Catalog

The Union Catalog is a sophisticated mechanism for supporting integration of disparate libraries into a single environment.The Union Catalog is based on Match and Merge algorithms.ALEPH’s Union Catalog is not a unified Catalog, in the sense that an actual Merge does not take place in ALEPH’s Union catalog database.

Page 8: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 8

De-duplication vs. Union Catalog (cont)

Union Catalog MatchAn Equivalence table is created to map each record to a set of equivalent records. It is totally acceptable to find more than one match in the Union Catalog.The set of equivalent records also contains the preferred record.

Page 9: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 9

De-duplication vs. Union Catalog (cont)

Union Catalog MergeUser View uses an on-the-fly merge to construct a virtual single record that is built from a group of equivalents.

Merge product is not saved in the database.

Page 10: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 10

De-duplication vs. Union Catalog (cont)

ALEPH De-duplication:

Match - “Rigid”. search for similar records.

Merge - similar records are combined together in the database. All substantial fields should remain.

Union:

Match - “Loose”. search for equivalent records.

Merge – Virtual merge for display. Fields can be added or omitted from the merged display.

Page 11: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 11

De-duplication vs. Union Catalog (cont)

ALEPH De-duplication:

Preferred – Performed only between two records (incoming and matched).

Union:

Preferred – Performed on the entire equivalent set.

Page 12: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 12

Sources for New Records

The main sources for new records are:

Cataloging GUI – A user catalogs a new record in the database.

Load Servers (such as OCLC server) – The server receives records from a search client (e.g., CatMe).

Batch Loading of resource files – Records are loaded in the database from an input file.

Page 13: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 13

Sources for New Records (cont)

The different sources for new records pose a different level of user intervention during the process:

Cataloging GUI –Maximum user control during the process.Loading Servers – Limited user control during the process.Loading Resource Files – No user control during the process.

Page 14: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 14

Sources for New Records (cont)

In general, the more the involvement of a staff user, the less rigid a required match.

Q. How can the staff user be involved in batch loading a record?

A. A given input file can be split into three files (zero matched records, single matched records and multi-matched records) before the actual batch load.

Page 15: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 15

Sources for New Records (cont)

In the Cataloging GUI, it is common for a new record not to contain enough information for a rigid match (e.g., fast cataloging).

Page 16: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 16

Match and Merge

Match & Merge Setup

Version 15.2, and after, feature a unified mechanism for Match and Merge.

The mechanism is similar to the check/fix doc mechanism, and defines the programs to be executed for deferent procedure codes (contexts).The functionality is also in Versions 14.x but is configured separately for cataloging, load servers, and batch jobs.

Page 17: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 17

Match and Merge (cont)

Match & Merge Setup

Just as in other Configuration tables, the Match/Preferred/Merge Conf tables feature the following three columns:

1) Section code (the context, e.g., OCLC)2) Program(s) to execute3) Program-specific parameters

Page 18: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 18

Match

Example, tab_match:

!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!!!!!!!!OCLC match_doc_uid I-ISBNOCLC match_doc_acc tab_match_acc

CAT match_doc_uid I-ISBNCAT match_doc_acc tab_match_acc

YBP match_doc_uid I-ISBNYBP match_doc_acc tab_match_accYBP match_doc_gen TYPE=IND...

Page 19: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 19

Match (cont)

The procedure name is defined elsewhere, depending on the module. For example, the OCLC server match procedure is defined in tab_oclc col. 9.

! 1 2 3 4 5 6 7 8 9

!!!!-!!!-!!!!!-!!!!!-!!!!!-!-!-!!!!!!!!!!-!!!!!

7545 BIB USM01 1 N Y OCLC OCLC

7545 AUT USM10 Y OCLC OCLC

Page 20: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 20

Match Algorithms

The more common Match algorithms (programs) are:

match_doc_uidmatch_doc_accmatch_doc_genmatch_doc_script

Page 21: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 21

Match Algorithms (cont)

The match_doc_uid is based on the direct index (Z11).

The parameters column in the match table should contain either the index name (prefix I) or the tag code (prefix T).

Page 22: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 22

Match Algorithms (cont)

Important notes for match_doc_uid:

Even if direct index is used, the program might return several matches.Program matches the normalized (filing) text, rather than the original text. (Normalization is different for different indexes.)

Page 23: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 23

Match Algorithms (cont)

The match_doc_acc program is based on a headings (ACC) index.The third column is a table name that lists the record tags that should be checked against the headings index.Example (tab_match_acc):

! 1!!!!!245##240##

Page 24: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 24

Match Algorithms (cont)

Important notes for match_doc_acc:If relevant tags (listed in given table) are associated with several headings (e.g., TTL, NTL) then all headings of that type will be checked.If an incoming record introduces a new heading, consecutive incoming records may not be able to use that heading immediately.

Page 25: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 25

Match Algorithms (cont)

The match_doc_gen program is based on the headings (ACC) index, the direct index (Z11) and the direct system number.The third column tells the program which index type to use (heading, direct, or sys) and which tag code or index code to extract from the incoming record.

Page 26: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 26

Match Algorithms (cont)

Example:

!!!!!-!!!!!!!!!!!!!!-!!!!!!!!!!!!!!!!!!>ISSN match_doc_gen TYPE=IND,TAG=022,SUBFIELD=a,CODE=ISSN

Specific parameters instruct the program to use the z11/ISSN index and to match the tag 022 $$a of the incoming record.

Page 27: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 27

Match Algorithms (cont)

The match_doc_gen is also used when the ALEPH system number is already catalogued in the incoming record. In this case, the index-type is SYS.

Page 28: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 28

Match Algorithms (cont)

The match_doc_script program is based on a command script that allows a user defined logical flow for matching. Example:

!1 2 3 4 5!!-!!!!!!!!!!!!!!!!!!!!-!!!!-!!!!!!!!!!-!!!!!!!!!!!!!!!>00 1 goto 0200 0 stop

01 match_doc_gen 50- skip TYPE=ACC,TAG=245..01 10+ goto 03

02 match_doc_gen 5+ goto 03 TYPE=ACC,TAG=00102 0+ skip

03 match_doc_gen 5+ skip TYPE=ACC,TAG=00103 1 stop

Page 29: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 29

Finding Preferred Record

The tab_preferred located in $data_tab defines a preferred_doc program to be executed, per context.Example:

!!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!-!!!!!!!!!>OCLC preferred_doc_aleph1 weights_table1RLIN preferred_doc_aleph1 weights_table2

In this example, the preferred_doc_aleph1 must decide whether the assigned two documents and or the Weights table is preferred.

Page 30: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 30

Finding Preferred Record (cont)

A weights table might look like this:

! 1 2 3 4 5!!!!!-!!!!!!-!!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!-!!!LDR F05-01 EQUAL d -10LDR F17-01 NOT-EQUAL 1,2,3,4,5,7,8,u,z 010LDR F17-01 EQUAL 1 009110## PRESENT 001505## PRESENT 050

The document with the higher accumulative weight is the preferred one.

Page 31: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 31

Merge

The tab_merge located in $data_tab defines a merge program to be executed, per procedure.Example:

!!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!>OVERLAY-01 merge_doc_overlay 02OVERLAY-02 merge_doc_overlay 02OVERLAY-03 merge_doc_overlay 03OVERLAY-04 merge_doc_overlay 04

OCLC merge_doc_overlay 01

Page 32: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 32

Merge (cont)

The procedure name is defined elsewhere, depending on the module. For example, the OCLC server merge procedure is defined in tab_oclc col. 8.

! 1 2 3 4 5 6 7 8 9

!!!!-!!!-!!!!!-!!!!!-!!!!!-!-!-!!!!!!!!!!-!!!!!

7545 BIB USM01 1 N Y OCLC OCLC1

7545 AUT USM10 Y OCLC OCLC2

Page 33: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 33

Merge Algorithms

The more common merge programs are:merge_doc_overlaymerge_doc_replacemerge_doc_adv_overlay

Page 34: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 34

Merge Algorithms (cont)

The merge_doc_replace program replaces the contents of an original record with the contents of a new record, while retaining the CAT fields from both records.

Page 35: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 35

Merge Algorithms (cont)

The merge_doc_overlay program overlays the record according to the specifications defined in tab_merge_overlay.

The tab_doc_overlay was named tab_doc_merge in Version 14.x

The table may define multiple merge sets by using col. 1. Column 4 of the tab_merge table contains the merge set, performed when the routine (e.g., OVERLAY-01) is selected.

Page 36: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 36

Merge Algorithms – tab_doc_overlay

Example:

!1 2 3 4!!-!-!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

01 1 Y #####01 1 N 051##01 1 N 245##01 2 Y 245##01 2 Y 650##01 2 N 00101 2 Y 00801 2 Y LDR

Page 37: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 37

Merge Algorithms – tab_doc_overlay

tab_merge

Page 38: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 38

Merge Algorithms

The merge_doc_adv_overlay works essentially like merge_doc_overlay, with the following two differences:

1. Before merge is finished, tab_preferred is called with a context parameter hard-coded to AD-OVERLAY.

2. The tab_merge_adv_overlay has slightly better functionality, and the conditions may be sensitive both to the tag’s existence and its value.

Page 39: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 39

Cataloging

Catalog GUI uses check_doc and fix_doc to implement match and merge respectively:

When uploading a record check_doc, (CATALOG-INSERT) is executed. The program check_doc_match is used as an entry to tab_match.When using copy/paste of a whole record, fix_doc (MERGE) is executed. The program fix_doc_merge is used as an entry to tab_merge.

Page 40: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 40

Cataloging – Match

Page 41: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 41

Cataloging – Merge

Original Record (1)

Page 42: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 42

Cataloging – Merge (cont)

Copied Record (2)

Page 43: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 43

Cataloging – Merge (cont)

Merged Record

Page 44: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 44

OCLC Server

OCLC

OCLCClient

(CatMe)

ALEPHDatabase

Search

OCLC Server

Match &

Merge

Page 45: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 45

OCLC Server (cont)

Search in OCLC.Send a record from the OCLC client to the OCLC server. Check if there is a similar record in the ALEPH database (tab_match).If no matching record is found, the record is added to the ALEPH database.If a single matching record is found, both records are merged (tab_merge), and the merged record is added to the ALEPH database.If multiple matches are found, an error is reported to the OCLC client.

Page 46: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 46

OCLC Server – tab_oclc ! 1 2 3 4 5 6 7 8 9!!!!-!!!-!!!!!-!!!!!-!!!!!-!-!-!!!!!!!!!!-!!!!!!!!!!7545 BIB USM01 1 N Y OCLC OCLC7545 AUT USM10 Y OCLC OCLC

tab_match!!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!!!!!!!OCLC match_doc_uid I-ISBN

tab_preferred!!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!!!!!>OCLC preferred_doc_aleph1 weights_tab1

tab_merge!!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!!OCLC merge_doc_overlay 01

Page 47: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 47

Resource File

Input file – ALEPH sequential format

Run p_manage_36. This function splits an input file of documents into three output files, according to user defined matching criteria.

Run p_manage_38. This function runs a merge routine on the second output of p_manage_36, and the records in the database.

Page 48: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 48

Resource File (cont)

The first output of p_manage_36 (using p_manage_18), is loaded using NEW.

The output of p_manage_38 (using p_manage_18), is loaded using REPLACE.

Page 49: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 49

Resource File – p_manage_36

Output File 1:Contains records that do not match any record in the database.The records in this file are given a new sequential number (starting from 000000001).

Page 50: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 50

Resource File – p_manage_36 (cont)

Output File 2:Contains records for which a unique single match was found in the database.

The records in this file are given new system numbers that match the system number of the matched record.

Page 51: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 51

Resource File – p_manage_36 (cont)

Output File 3:Contains records for which a match was found, as long as the match is not a single match.

The records in this output file are given a new sequential number.

Page 52: De-duplication of Bibliographic Records

De-duplication of Bibliographic Records 52

Resource File – p_manage_38

The input file is the second output file of p_mange_36 (single match records)For each record in the input file:

1. The same record should be read from the database.

2. The preferred using tab_preferred should be determined.

3. The non-preferred should be merged into the preferred using tab_merge.

4. The merged record should be written into the output file.