the open archives initiative protocol for metadata harvesting and the imls digital collections &...

28
The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy W. Cole ([email protected]) Mathematics Librarian & Professor of Library Administration University of Illinois at Urbana-Champaign Friday 12 November 2004 MCN 2004, Minneapolis, MN http://imlsdcc.grainger.uiuc.edu/Cole_MCN2004_OAI.ppt

Upload: paul-trumbull

Post on 31-Mar-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content

Project at the University of Illinois

Timothy W. Cole ([email protected])Mathematics Librarian & Professor of Library AdministrationUniversity of Illinois at Urbana-Champaign

Friday 12 November 2004MCN 2004, Minneapolis, MN

http://imlsdcc.grainger.uiuc.edu/Cole_MCN2004_OAI.ppt

Page 2: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 2

The Digital Information Landscape

The information landscape can be seen as a contour map in which there are mountains, hillocks, valleys, plains and plateaus…. A specialized collection of particular importance is like a sharp peak. Upon a plateau there might be undulations representing strengths and weaknesses…. The landscape is, however, multidimensional. Where one scholar may see a peak another may see a trough. The task is to devise mapping conventions which enable scholars to read the map of the landscape fruitfully, at the appropriate level of generality or specificity.

Michael Heaney (2000), “An Analytical Model of Collections and their Catalogues.”

Page 3: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 3

Users & Uses of Digital Libraries

From Bibusages study (French National Library): Digital Libraries are used in conjunction with Web search

engines, generalist portals, commercial sites Mix of intensive & casual users DL users skew somewhat older, higher degree level than

average French Internet user population DL users seeking answer for specific information need;

most time spent discovering, viewing, & downloading documents

“Digital Libraries … are now attracting a new type of public, bringing about new, unique and original ways for reading and understanding texts.”Houssem Assadi, et al. “Users & Uses of Online Digital Libraries in France,” ECDL 2003

Page 4: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 4

Managing Digital Collections & Content

How do mandates translate & change in digital world? Content & collections as virtual ‘information landscapes’ New users, uses, & metrics Increased emphasis on interoperability & sharing

New models for sharing & resource discovery Harvesting – e.g., OAI-PMH Federated searching – e.g., Z39.50 / ZNG, DiGIR, ...

New Emphasis on ‘Shareable’ metadata Reconciling different descriptive metadata practices New metrics for metadata quality (for interoperability)

Page 5: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 5

IMLS Digital Library Forum (2001)

Framework of Guidance for Building Good Digital Collectionshttp://www.niso.org/framework/forumframework.html

Stresses reusability, persistence, interoperability, verification, and documentation of digital collections & content

Accompanying report included recommendations encouraging: Creation of an IMLS Collection Registry Implementation of the Open Archives Initiative Protocol for

Metadata Harvesting by IMLS projects creating digital content Development of infrastructure to facilitate interoperability

between IMLS projects and initiatives like NSDL

Page 6: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 6

IMLS DCC Project Overview

Collection description & prototype registry for IMLS National Leadership Grant projects with associated digital content

Enhance discoverability of collections & content Provide alternative view of one output of IMLS NLG

program

Prototype item level metadata repository via OAI-PMH Demonstrate potential of metadata for interoperability Serve as testbed for IMLS projects interested in OAI-PMH Facilitate reuse of information resources paid for by IMLS

Research question:How can resource developers best represent collections and itemsto meet the needs of service providers and end users?

Page 7: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 7

IMLS Grantees – A Diverse Community

Mix of library, museum, and archive traditions Wide variation in technical skills, technology

infrastructure & information management policy Diverse perspectives on intellectual property; use and

presentation of metadata & primary resources Diverse embedded knowledge structures

Results in wide variability in: Metadata formats Content resource types Controlled vocabularies Descriptive metadata practices

Page 8: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

Broad Categories of InstitutionsRepresented in Collection Registry

Institutions in IM LS Collection Registry by Category(349 institutions from 134 collections / 92 NLG projects)

Libraries41%

M useums36%

Archives3%

Specimen Holding

3%

Other17%

Page 9: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

Detailed Institution TypesRepresented in Collection Registry

Types of Institutions in IMLS Collection Registry(349 institutions from 134 collections from 92 NLG projects)

69

58

48

39

24

15 13 12 12 11 10 8 6 6 4 3 3 2 1 1 1 1 10

10

20

30

40

50

60

70

80

Aca

d.

Lib

.

His

tori

cal

So

c.

Pu

bli

c L

ib.

Oth

er

His

tory

Mu

s.

Gen

eral

Mu

s.

Oth

er H

igh

erE

d

Sp

ec.

Mu

s.

Sta

te L

ib.

Res

earc

hL

ib./

Arc

hiv

es

Art

Mu

seu

m

K-1

2 S

cho

ol

Lib

. C

on

s.

Nat

.His

. M

us.

Sci

ence

Mu

s.

Bo

t. G

ard

en /

Her

bar

ium

Sp

ec.

lib

rary

His

tori

c S

ite

Arb

ore

tum

Mu

seu

m L

ib.

Pri

vate

Lib

.

Sch

oo

l L

ib.

Sta

te M

us.

Institution Types

Nu

mb

ers

Page 10: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

Broad Categories of InstitutionsRepresented in Metadata Repository

Institutions Represented in M etadata Repository(136 institutions--27 harvested collections/193,677 metadata records)

Libraries42%

Museums37%

Archives4%

Specimen Holding4%

Other13%

Page 11: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

Detailed Institution TypesRepresented in Metadata Repository

Types of Institutions Represented in Item Level Metadata Repository(136 institutions -- 27 harvested collections/193,677 metadata records)

34

16 16 14

8 6 6 6 5 4 4 3 3 2 2 2 1 1 1 1 1 0 0 005

10152025303540

Institution Types

Nu

mb

er o

f In

stit

uti

on

s

Page 12: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

Metadata Formats

Metadata Formats in Use

0

0

10 (13%)38 (49%)

48 (62%)

1 (1%)11 (14%)

12 (16%)

4 (5%)24 (31%)

28 (36%)

16 (21%)16 (21%)

21 (27%)8 (10%)

14 (18%)10 (13%)

4 (5%)

2 (3%)2 (3%)

0 10 20 30 40 50 60

Dublin Core onlyDublin Core in combination with

Dublin Core Total

EAD onlyEAD in combination with other

EAD total

MARC onlyMARC in combination with other

MARC total

TEI onlyTEI in combination with other

TEI total

VRA Core OnlyVRA Core in combination with

VRA Core total

Other Metadata Standard OnlyOther Metadata Standard in

Other Metadata Standard Total

Locally Developed Metadata OnlyLocally Developed Metadata in

Locally Developed Metadata Total

Number of Respondents

Page 13: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

Types of Resources

Type of Material in Digital Collection

0

8 (9%)71 (80%)

79 (89%)

3 (3%)69 (78%)

72 (81%)

2 (2%)24 (27%)

26 (29%)

1 (1%)15 (17%)16 (18%)

1 (1%)13 (15%)14 (16%)

12 (13%)12 (13%)

0 10 20 30 40 50 60 70 80 90

Images OnlyImages in combination

Total Images

Text OnlyText in combination

Total Text

Sound OnlySound in combination

Total Sound

Interactive Resource OnlyInteractive Resource in

Total Interactive Resource

Moving Image OnlyMoving Image in

Total Moving Image

Other OnlyOther in combination

Total Other

Ty

pe

of

Ma

teri

al

Number of Respondents

Page 14: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 14

Controlled Vocabularies

Element Top three used Controlled Vocabulary (% of respondents who identified a controlled vocabulary)

Subject LCSH (50%); LC TGM I (19%); AAT (13%)

Format LC TGM I I (7%); AAT (7%); MIME types (4%)

Type DCMI Type (8%); LC TGM I I (7%); AACR2 (7%)

Personal names

LC Name Authority File (47%)

Geographic names

LC Name Authority File (18%); LCSH (15%); Getty Thesaurus of Geographic Names (10%)

Page 15: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 15

Descriptive Practice

Different traditions regarding Inclusion of interpretive information Granularity of description Presentation of information resources

Shared problems / issues How to provide context & collection description What exactly to describe Which metadata scheme(s) to use

Page 16: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 16

Illustration – Coverlets (1 of 2)

Description: Digital image of a single-sized cotton coverlet for a bed with embroidered butterfly design. Handmade by Anna F. Ginsberg Hayutin.

Source: Materials: cotton and embroidery floss. Dimensions: 71 in. x 86 in. Markings: top right hand corner has 1 1/2 in. x 1/2 in. label cut outs at upper left and right hand side for head board; fabric is woven in a variation of a rib weave; color each of yellow and gray; hand-embroidered cotton butterflies and flowers from two shades of each color of embroidery floss - blue, pink, green and purple and single top 20 in. bordered with blue and black cotton embroidery thread; stitches used for embroidery: running stitch, chain stitch, French knot and back stitches; selvage edges left unfinished; lower edges turned under and finished with large gray running stitches made with embroidery floss.

Format: Epson Expression 836 XL Scanner with Adobe Photoshop version 5.5; 300 dpi; 21-53K bytes. Available via the World Wide Web.

Coverage: —

Date Created: 2001-09-19 09:45:18; Updated: 20011107162451; Created: 2001-04-05; Created: 1912-1920?

Type: Image

Page 17: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 17

Illustration – Coverlets (2 of 2)

Description: Materials: Textile--Multi, Pigment—Dye; Manufacturing Process: Weaving--Hand, Spinning, Dyeing, Hand-loomed blue wool and white linen coverlet, worked in overshot weave in plain geometric variant of a checkerboard pattern. Coverlet is constructed from finely spun, indigo-dyed wool and undyed linen, woven with considerable skill. Although the pattern is simpler, the overall craftsmanship is higher than 1934.01.0094A. - D. Schrishuhn, 11/19/99 This coverlet is an example of early "overshot" weaving construction, probably dating to the 1820's and is not attributable to any particular weaver. -- Georgette Meredith, 10/9/1973

Source: —

Format: 228 x 169 x 1.2 cm (1,629 g)

Coverage: Euro-American; America, North; United States; Indiana? Illinois?

Date: Early 19th c. CE

Type: cultural; physical object; original

Page 18: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 18

OAI Protocol for Metadata Harvesting

‘Harvesting’ approachto interoperabilityat metadata level

Divides world intoMetadata Providers& Service Providers

Builds on HTTP,XML, & Community Metadata Standards

Page 19: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

Metadata Harvesting Model

Page 20: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 20

How OAI-PMH Works

OAI “VERBS”

Identify

ListMetadataFormats

ListSets

ListIdentifiers

ListRecords

GetRecord

HARVESTER

REPOSITORY

OAI OAI

Service Provider Metadata Provider

HTTP Request

HTTP Response

(OAI Verb)

(Valid XML)

Page 21: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 21

Why OAI-PMH for IMLS DCC Project

Offers low technical barrier options; primary cost is metadata e.g., OAI-PMH itself, OAI Static Repository, mod_oai

Is a cross-domain, non-proprietary approach to interoperability

Already used by NSDL, OAIster, etc. Seen as a way to bring content to attention of wider audience

37% of visits to State Library of New South Wales image collection via PictureAustralia (a OAI-PMH based portal)

Facilitates metadata & metadata services research What makes for good ‘shareable’ metadata? Contrast & compare metadata designs & workflows Explore normalization, enhancement, aggregated searching

issues

Page 22: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 22

OAI-PMH Issues

Harvesting vs. federated Harvested metadata aggregation always out of date, but

Federated real-time performance dependent on weakest link Sorting, ranking, & de-dupping easier with harvesting model

Potential scale issues Largest OAI-PMH provider serves 4 million records Largest OAI-PMH service provider < 10 million records

Integration into existing metadata workflow requires some investment – cost-to-benefit ratio still unclear

Practical metadata sharing issues: Persistent identifiers, date stamps, proper application of

protocol Metadata quality, consistency, context, cross-walking, ...

Page 23: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

Federated Searching Model

Page 24: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 24

Alternative Approaches for Interoperability

Federated search models Library: NISO Z39.50 Specimen / Natural History: DiGIR More homogeneous metadata schemes, query rules

Collaborative, sometimes proprietary project portals RLG Cultural Materials ArtStor GBIF, MaNIS, ...

Generally higher technical threshold; rely on higher level of metadata homogeneity & compliance

Page 25: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 25

OAI-PMH as Complement to Other Approaches

OAI-PMH provides a lowest-common-denominator approach to sharing & interoperability

Insufficient for some high-level, domain-specific applications,

But useful for sharing across more heterogeneous communities & allowing participation with less technology

Portals can exploit combination of approaches OAI-PMH metadata harvesters can normalize &

augment metadata before sharing on with domain-specific federated search portals

Page 26: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 26

IMLS DCC Collection Registry (alpha)

Features:

Searchable

Browseable

An entry point foritem-level searching

Page 27: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 27

IMLS DCC Metadata Repository (alpha)

Currently Harvesting: 27 Collections 193,677 Records

Ongoing analysis of metadata

Documenting practices

Potential for normalization

Implications for interface & search engine design

Page 28: The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois Timothy

OAI-PMH & The IMLS DCC ProjectMCN 2004, 12 November 2004

[email protected] of Illinois at UC 28

More Information

This presentation: http://imlsdcc.grainger.uiuc.edu/Cole_MCN2004_OAI.ppt Project Website: http://imlsdcc.grainger.uiuc.edu/ Project PI: Tim Cole, [email protected] Project Coordinator: Sarah Shreeves, [email protected]

OAI-PMH resources: http://www.openarchives.org/ Online OAI-PMH tutorial: http://www.oaforum.org/tutorial/ DLF OAI-PMH & shareable metadata best practices

(under development): http://oai-best.comm.nsdl.org/