matching names in parallel t. hickey access 2006 2006 october

26
Matching names in parallel T. Hickey Access 2006 2006 October

Upload: jordan-mccormick

Post on 27-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Matching names in parallel T. Hickey Access 2006 2006 October

Matching names in parallel

T. Hickey

Access 20062006 October

Page 2: Matching names in parallel T. Hickey Access 2006 2006 October

Virtual International Authority File

Link national authority records Build on their authority work

Move towards universal bibliographic control

• Allow national or regional variations in authorized forms to co-exist

• Support needs for variations in preferred language, script, and spelling

• 10 million WorldCat records in non-English metadata

Page 3: Matching names in parallel T. Hickey Access 2006 2006 October

Joint VIAF Project

Page 4: Matching names in parallel T. Hickey Access 2006 2006 October

Matching Variations

In the LCNAF and PND authority files: Same name, same person Same name, different people Different names, same person Missing person in one file

Page 5: Matching names in parallel T. Hickey Access 2006 2006 October

Two Different People – One Name

Adams, Mike PND: a golfer LCNAF: author of a Beatles collector's guide

Same Name

Different

People

Page 6: Matching names in parallel T. Hickey Access 2006 2006 October

One Person – Two Names

LCNAF: Morel, Pierre PND: Morellus, Petrus

Same Person

Different

Names

Page 7: Matching names in parallel T. Hickey Access 2006 2006 October

Enhancing the Authorities

Bibliographic

Record

Derived

Authority

Authority

Record

Enhanced

Authority

Page 8: Matching names in parallel T. Hickey Access 2006 2006 October

Strong Matching Attributes

A work (title) in common Common control numbers (ISBN, ISSN, or LCCN) Exact birth and death year Joint authors Name as subject

Page 9: Matching names in parallel T. Hickey Access 2006 2006 October

Weaker Attributes

Only one of birth/death date(s) (allows some variation) Subject area of works (two levels) Format (books, films, musical scores, etc.) Language Publisher Partial title match

Date of publication Country Role (author, illustrator, composer, etc.) Format (books, films, musical scores, etc.)

Page 10: Matching names in parallel T. Hickey Access 2006 2006 October

Computing it

Standard approach• Generate keys and data• Load information into a database• Index it• Extract fields needed

Map/Reduce approach• Split the database up• Run parallel jobs• Bring information together via map/reduce• Assemble information in stages

Page 11: Matching names in parallel T. Hickey Access 2006 2006 October

Map/Reduce

Two stages• Map

• Read in source file (e.g. MARC-21)• Write out key + data

• Reduce• Read in array of data for each unique key• Write out key + data

Page 12: Matching names in parallel T. Hickey Access 2006 2006 October

Overview of MapReduce

Source: Dean & Ghemawat (Google)

Page 13: Matching names in parallel T. Hickey Access 2006 2006 October

Our Implementation

Written in Python Uses ssh and XML-RPC for control and communication Map/Reduce seems to add ~ 10% overhead Ran an earlier implementation on a 48 cpu cluster Current VIAF cluster is a 12 cpu cluster on 4 nodes Running Linux and 64-bit Python

Page 14: Matching names in parallel T. Hickey Access 2006 2006 October

VIAF Matching Code

17 modules 1,100 lines of code Plus

• 600 lines configuration• 2,755 lines of tables embedded in code

Page 15: Matching names in parallel T. Hickey Access 2006 2006 October

VIA

F D

ata

Flo

w

get changed Ids

eliminate forename, date conflicts from buckets

Extract Data

build buckets

surname:forename,date

compare

build compare data

id:tag, data

build compare data

id:tag, data

build name:id map

name:id

map authorities

authority id: bib id

changed authority ids

potential pairsidentify compare data

pair id:[bib/auth]idselect compare data

pair id: compare data

map authorities

authority id: bib id

name:id

build name:id map

pair id: scores

identify compare data

pair id:[bib/auth]id

select compare data

pair id: compare data

LC Authority

Extract DataExtract Data

LC Catalog PND Authority

Extract Data

PND Catalog

Extract Data

PND Catalog

Page 16: Matching names in parallel T. Hickey Access 2006 2006 October

WorldCat Identities

Bring together all of WorldCat’s information about people• Name(s)• Works by and about• Subjects• Dates• Fiction/non-fiction• Roles• Co-authors

Add links• Wikipedia• Authority files

Page 17: Matching names in parallel T. Hickey Access 2006 2006 October

Sam

ple

Iden

tity

Page 18: Matching names in parallel T. Hickey Access 2006 2006 October

Statistics

Nearly 19 million different ‘identities’ in WorldCat 80 million (nominally) controlled headings

The WorldCat Identity code is ~800 lines of Python in 4 modules (plus XSLT, CSS, etc.)

Page 19: Matching names in parallel T. Hickey Access 2006 2006 October

Identities Data Flow

Stage 1

NameInfo Citation

Stage 3

Stage 4

NameInfo CitationsStage 2

Cover Art WorldCat FRBR Audience

Authorities

IdentitiesWikipedia

Page 20: Matching names in parallel T. Hickey Access 2006 2006 October

Identities Stage 1Extract Data From WorldCat

Input: WorldCat (MARC-21) Map output:

• NameKey <nameInfo>• WorkID <citation>

Reduce output:• WorkID <best citation>• NameKey <cumulative nameInfo>

Page 21: Matching names in parallel T. Hickey Access 2006 2006 October

Identities Stage 2Extract Data From Authorities

Input: NACO Authorities file (MARC-21) Map output

• NameKey <authorityInfo>• XTos• XFroms

Reduce output• NameKey <authorityInfo, symetric xrefs>

Page 22: Matching names in parallel T. Hickey Access 2006 2006 October

Identities Stage 3Connect Citations with Names

Input• Stage 1 output

• WorkID <by/about citation>’s• NameKey <nameInfo>

Map output• NameKey <nameInfo>• NameKey <topCitations>

Page 23: Matching names in parallel T. Hickey Access 2006 2006 October

Identities Stage 4Create Identities

Input• Authority info from stage 2• Merged name info from stage 3• Merged citations from stage 3

Map output• Pass through

Reduce output• Pnkey <Identity Record>

Page 24: Matching names in parallel T. Hickey Access 2006 2006 October

Schedules

Identities• Up this year?

VIAF• Reload, rematch this year• Public service up early 2007

Page 25: Matching names in parallel T. Hickey Access 2006 2006 October

Conclusions

Our merged files (e.g. WorldCat) are really quite large More processing power opens up new ways of manipulating

and looking at our data Parallel processing is the only way to obtain the cycles

needed Map-Reduce is an attractive way to do parallel processing

• Forces decomposition• Scales well• Opens up new possibilities

Page 26: Matching names in parallel T. Hickey Access 2006 2006 October

Thank you

T. HickeyVIAF.orghttp://errol.oclc.org/laf/n82-54463.html

Access 20062006 October