lorcan dempsey (with contributions from colleagues) vp research and chief strategist library of...
TRANSCRIPT
Lorcan Dempsey(with contributions from colleagues)
VP Research and Chief Strategist
Library of Congress, 15 June 2004
OCLC: some development and research directionsin the areas of metadata management and
knowledge organization.
Presented to Library of Congress cataloging managers retreat.
TopicsFramework for WorldCat directionsFramework for WorldCat directions
Metadata management and knowledge organizationMetadata management and knowledge organization
Working with web servicesWorking with web services
Making data work harderMaking data work harder
Some research, some productionSome research, some production
Open WorldCatOpen WorldCat
Framework for WorldCat directionsFramework for WorldCat directions
Collections grid
high low
low
high
stewardship
uni
que
ne
ssBooksJournals•Newspapers•Gov. docs•CD, DVD•Maps•Scores
Special collectionsArchives•Rare books•Local history materials•Archives & Manuscripts•Theses & dissertations
Research and learning materials
•ePrints/tech reports•Learning objects•Courseware•E-portfolios•Research data
Untransferred records
Freely-accessible web resources
WorldCat – the what?
WorldCat: - Grow - Version - Improve
• Easier to use (FRBR)
• Microcontent
• Evaluative content
Add special collections & institutional content to WorldCat: dissertations,
cultural heritage collections, Eprints, learning objects
The Open WebBoth surface and acquire
WorldCat content
WorldCat – the how?
Prospects
OAI RepositoriesCONTENTdmDspaceILSTOCSCover Art
Verification
TEIDCM21EADMETSMPEG21
Validation
Validating toStandardsuch asAACR2
CollectionMetadataCreation
AdministrativeMetadata
ServiceDescription
ContentDescription
Conversion
DC -> M21HTML -> DCM21 -> XML
EnhancementServices
MetadataCapture
Auto-Dewey
Authorities
Set Holdings
Users
AcademicCataloger
PublicReferenceLibrarian
Public Lib.Patron
Web Surfer
Selection
WorldCatCollectionDevelopmentPolicy
Transmission
OAIFTPTapesCrawl
CompositeServices
FirstSearch
Group Cat.
Connexion
ILL/RM
CollectionManagement
Google/Yahoo/Amazon
Micro
Service
Micro
Service
Micro
Service
Micro
Service
Publish
WorldCat
OpenWorldCat
Access
Enh
ance
Ref. Stayswith Items
Research in these areasResearch in these areas
Some issues
• Metadata variety– Encoding, element sets, values/content– Provenance
• Metadata manipulation– Validation, identification – Enhancement, augmentation– Relation, FRBR, deduplication– Transformation
• Schematization and web services– Make data available in forms that allow machine
services to be flexibly built on top of them– Everything is a service
Open WorldCatOpen WorldCat
Open WorldCat
• Facilitate the rendezvous of users and library services on the web
• Surface the library where the users are
• Help release the value of library services in the working and learning lives of their users.
Open WorldCat Architecture
Aggregators
Schemas and Vocabularies
Profiles and Relationships
Content Owner
Portals
Metadata
Distribution, Search,
Display
Access
Google, Yahoo and Book Vendors Organization and Presentation
OCLC Organizes WorldCat content in model suitable for harvesting, anticipate unique aspects of various portals
OCLC Uses Host of Authentication and Authorization tools to progressively match content to rights
OCLC Developed Geo-locator services to matches users to extensive FirstSearch WorldCat institution and user profiles
WorldCat , Additional collections can be added to Worldcatlibraries domain
OCLC will use tools such as xISBN and FRBR models to organize WorldCat public views suitable for low precision access
Current partners
• Book vendors and bibliographies ABE Books ABAA Alibris HCBIB BookPage
• Search engines (pilot with 2M records exposed as web pages for harvesting)
Google Yahoo!
Click in presentation mode to go through toexamples
Click in presentation mode to go through toexamples
Try a search for:A history of caricature and grotesque in literature and art Try a search for:A history of caricature and grotesque in literature and art
8/14/03:Googlecontractsigned
9/19/03:Google given go-ahead to harvest records
10/22/03:Google harvests150,000 records
Dec.’03:Records begin toappear in Google;800 inbound-linkslogged (search-site-originating[SSO])
Jan.’04:32,000 inboundlinks logged(SSO)
Mar.’04:109,000 inboundlinks logged(SSO)
5/21/04:Yahoocontractsigned
5/28/04:Yahooharvestsrecords
May’04:725,000 inboundlinks logged(SSO)
6/6/04:Yahoocompletesindexing of2 million WCrecords
Google and Yahoo! timeline
Traffic
800 32,064 42,659 108,971315,988
725,545
2,452,521
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
Dec Jan Feb Mar Apr May Jun*
Search Engine History
Full record displays. Projected for June.
Full record displays. Projected for June.
Off Click Dispersion
17%
1%7%
4%0%
69%
2%Full Text
ILL Request Form
Library Information Page
Library's Map Page
netLibrary
OPAC Links
OpenURL Resolver
Metadata management and knowledge organizationMetadata management and knowledge organization
Research activities
• Structures– FRBR– VIAF
BT
– FAST– Vocabulary encoding
and mappings
• Services– xISBN– Metadata
transformation services– Terminology services– Authority services– Automatic classification
and cataloging Eprints uk Web harvesting
FRBR
• OR Work-set algorithm
• Work-based view incorporated into WorldCat in FirstSearch in late 2004
• FictionFinder– 2.6+ million fiction records from Worldcat,
clustered by OCLC’s FRBR algorithm– Make greater use of data (genres, settings,
imaginary characters, etc)
• Participate in ongoing FRBR refinement
Click in presentation mode to go through toFictionFinder
Click in presentation mode to go through toFictionFinder
FAST
Vocabulary mappings
Services
• Web services– Computer to computer applications over the web
• Unplug and play– Unbundling monolithic applications and making
functionality available in more modular ways
• Reuse and sharing– Of services!
• Release the value in a web environment of the historical library investment in vocabularies and structures
xISBN
• An experimental web service– Leverages FRBRization work– Give it an ISBN, it returns all related ISBNs– Based on WorldCat– Designed for machine-to-machine data exchange
• Examples:– Check user ILL requests against all editions/versions in
OPAC– Find library’s editions when user finds any
edition/version of item on Amazon– Check OPAC for all editions during
selection/acquisitions/gift book processing– …
xISBN
Click cover to search amazon.co.ukClick cover to search amazon.co.uk
Click cover to search Seattle Public LibraryClick cover to search Seattle Public Library
Install FRBR Bookmarklets in your browser to see xISBN working.See Bookmarklets pageAt www.oclc.org/research/researchworks/
Install FRBR Bookmarklets in your browser to see xISBN working.See Bookmarklets pageAt www.oclc.org/research/researchworks/
Metadata schema transformations
• Metadata Schema Transformation Services– Evaluate approaches to crosswalking metadata– Prototype transformation environments
• The XSLT “short path”– Supports lightweight XML processing– Designed for public access– Deliverables:
OAI repository of METS-captured xwalks [NEW]
• The “long path” option– Designed for high-fidelity translations– May be public or proprietary– Deliverables: Toolkit; expertise in non-MARC formats
1111
File of records in format X
5555
File of records in format Y
2222Transform to intermediate form
STRUCTURAL TRANSFORM
Translate input semantics to CORE
3333
CORE
SEMANTIC TRANSLATION
Transform to output format Y
STRUCTURAL TRANSFORM
Translate CORE to output semantics
4444
SEMANTIC TRANSLATION
A crosswalk as a METS record
• Describe the crosswalk object in the METS header.
• Assemble and identify six objects in the METS structural map:– The source metadata schema– The target metadata schema– The crosswalk
– Human-readable and executable versions of each
• Associate metadata for each file in the METS Descriptive Metadata Section.
Crosswalk METS record in OAI repository
What the METS encoding solves
• The semantic and syntactic information required for interpreting and executing a crosswalk is collected into a single object.
• The repository is searchable by humans and automated processes.
• Services can be built on top of it.
• It encourages the development and standardization of crosswalks.
These outcomes are possible because every component in the system is a standard.These outcomes are possible because every component in the system is a standard.
Terminology Services
• Terminology services are web services for knowledge organization schemes (kos)– e.g., authority files, subject heading systems, thesauri,
taxonomies, and classification schemes
• A web service that provides mappings from a term in one vocabulary to one or more terms in another vocabulary is an example of a terminology service
Current Situation
• A plethora of vocabularies
• Many encoding formats
• Few inter-vocabulary connections
• Identifiers inadequate– Unavailable– Temporary– Inconsistent
Terminology services system framework
• Schema transformation:– MARC XML– SKOS– Zthes
• Record enhancement:– Inter-vocabulary mappings– Persistent identifiers (info:uri)
• Access:– Human-readable:– Browse interface (ERRoLs)– Search/retrieve records
(SRU/W)– Switch between schema-
specific views (XSLT)– m2m:
Publishing (OAI) Search/retrieve records
(SRU/W) info:uri resolution (OpenURL)
• Open standards:– MARC 21– XML/XSLT/XPath– SKOS– Zthes– SRU/SRW– OAI– info:uri– OpenURL
• Open source software:– OCLC OAICat– OCLC SRU/SRW server– OCLC ERRoL J2EE webapp
• Open content:– GSAFD, others…
• Open access
• Web services-oriented
Schema Transformation
• MARC XML– Authority Format & Classification Format
• SKOS– Simple Knowledge Organization Systems
• Zthes– Z39.50 Profile for Thesaurus Navigation.5– Based on Z39.19 (NISO Thesaurus Standard)
Vocabulary Processing
Vocabulary X
Zthes SKOS
schema transformation
Add: •provenance (MARC Org. Codes)•persistent identifiers (info:kos)
Optionally, add:•inter-vocabulary mappings
•Concepts & terms•persistent identifers (info:kos)
Vocabulary Y
data enhancement
Conversion from mostformats:•Z39.19•wordlists in PDF, etc.
Initial conversion to MARC XML•Authorities format, or,•Classification format
Info:kos
• Info:uri– provides a mechanism for the registration of public
namespaces that are used for the identification of information assets
• The kos identifier– provides a mechanism for identifying knowledge
organization schemes and the concepts used in those schemes. It has two elements:
scheme concept
http://errol.oclc.org [OpenURL base URL]
http://errol.oclc.org/xyz.search [SRU-to-HTML gtwy]
http://errol.oclc.org/xyz.html [HTML interface]
server(info:uri resolver)
http://alcme.oclc.org/srw/ [SRW request]
New services environment
DC
SKOS
Zthes
server
[SRW/SRU response]
[ERRoLs server stylesheets applied]
http://errol.oclc.org/xyz.rss [RSS feed]
http://errol.oclc.org/xyz.sru [SRU gateway]
http://errol.oclc.org/xyz.srw2oai [OAI gateway]
Name authority lookup
• Interactive
• As a web service
Lorcan DempseyLorcan Dempsey
• An example: authority control serviceinvoked from within Dspace
Click in presentation
mode.
Click in presentation
mode.
Working with web servicesWorking with web services
Making data work harderMaking data work harder
Data mining
• Research
• Production– Collection analysis service in development
phase– Leverages WorldCat data in interactive mode
Compare my collection to my peers Compare my collection to my neighbors Profile my collection by subject, by age, … etc
Collection
• Change creates demand for better data.
• Growing interest in knowing more about:– Characteristics– Gaps and overlaps– Use
• Tuning collections based on data.
• Focus collection spending where creates most value.
Some projects
• Characteristics of collections– WorldCat– CIC
• Compare ILL, circulation and holdings data.
• Last copy: what is irreplaceable?
• ARL Global Resources.– Exploring coverage of
overseas titles in ARL libraries.
• Depends on consistency, coverage, currency
Comparing CIC Collection Profiles
Audience level
Forge Letters
Profiles of ‘Letters’ & ‘Forge’ Example
0%
20%
40%
60%
80%
ARL Academic Public School
Per
cen
t o
f H
old
ing
s Letters of …
Forge of Liberty
0.81 0.65
TopicsFramework for WorldCat directionsFramework for WorldCat directions
Metadata management and knowledge organizationMetadata management and knowledge organization
Working with web servicesWorking with web services
Making data work harderMaking data work harder
Some research, some productionSome research, some production
Open WorldCatOpen WorldCat
Thoughts
• Machines will do more work– Consistency becomes more important
• Variety
• Low precision– Make data work
The pattern is new …
The knowledge imposes a pattern and falsifies
For the pattern is new in every moment
The knowledge imposes a pattern and falsifies
For the pattern is new in every moment
Further information
Thanks to colleagues in OCLC Research forcontributions to this presentation. Further information about OCLC Research projectscan be found at http://www.oclc.org/research/
Thanks to colleagues in OCLC Collection Management Services for contributions to this presentation. Further information aboutOpen WorldCat athttp://www.oclc.org/worldcat/pilot/