bitter harvest metadata harvesting issues, problems, and possible solutions roy tennant california...

39
Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Upload: nicholas-willis

Post on 13-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Bitter Harvest Metadata Harvesting Issues,

Problems, and Possible Solutions

Roy TennantCalifornia Digital Library

Page 2: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Outline

Brief Harvesting OverviewHarvesting ProblemsSteps to a Fruitful HarvestA Harvesting Service ModelIndexing and InterfacesWhat’s Next?

Page 3: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Open Archives Initiative

Open Archives Initiative: “develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content”Huh? Let’s just say it’s an effort to help people find stuff Protocol for Metadata Harvesting (OAI-PMH) specifies how repositories can expose their metadata for others to harvestWell over 500 repositories world-wide support the protocolOAIster.org has indexed 3.5 million items from those repositories

Page 4: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

OAI-PMHData providers (DP) — those with the stuffService providers (SP) — those who harvest metadata and provide aggregation and search servicesOAI-PMH verbs:

IdentifyListIdentifiersListMetadataFormatsListSetsListRecordsGetRecord

Software for both DPs and SPs readily available

Page 5: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

www.oaforum.org/tutorial/

Page 6: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

OAI Architecture

Source: Open Archives Forum Tutorial

Page 7: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

gita.grainger.uiuc.edu/registry/

Page 8: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

errol.oclc.org

Page 9: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Harvesting Problems

SetsMetadata FormatsMetadata ArtifactsGranularityMetadata Variances

Page 10: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Sets

Records are harvested in clumps, called “sets” created by DPsNo guidelines exist for defining setsExamples:

CollectionOrganizational structureFormat (but is a page image an image? See example)

Page 11: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Metadata Formats

Only required format is simple Dublin Core, although any format can be made available in additionFew DPs surface richer metadataSimple DC is simply too simple!Example (artifact vs. surrogate dates)

Page 12: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Metadata Artifacts

“unintended, unwanted aberrations”Sample causes:

Idiosyncratic local practicesAnachronismsHTML code

Examples: Circa = string of dates for searching purposes[electronic resource]

Page 13: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Granularity

Record Granularity: what is an “object”?

A book, or each individual page?Examples: CDL, Univ. of Michigan

Metadata Granularity: Multiple values in one fieldExample: Univ. of Washington

Page 14: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Metadata Variances

Subject terminology differencesDisparities in recording the same metadata

Example: date variances

Mapping oddities or mistakesExamples: 1) format into description, 2) description into subject

Page 15: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Steps to a Fruitful Harvest

Needs Assessment (it’s the user, stupid)DP Identification and CommunicationMetadata CaptureMetadata AnalysisMetadata SubsettingMetadata NormalizationMetadata EnrichmentIndexingInterface (it’s still the user, stupid)

Page 16: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Needs Assessment

What are you trying to accomplish?What will your users want to be able to do?What metadata will you need, and what procedures will you need to set up to enable these activities?Which repositories have what you want?Is what they have (e.g., sets, metadata) usable as is, or ?

Page 17: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

DP Identification & Communication

Identification:Use UIUC directory of DPs to identify potential sources

Communication:Not required to tell them you are harvesting, but may help establish a good relationshipMay want to request that they surface a richer metadata format and/or provide a different set

Page 18: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Metadata Capture

Sample questions to answer:Individual sets, or all?Richer metadata formats available?How frequently to reharvest?Start from scratch each time or update?

Many software options

Page 19: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library
Page 20: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

+-----------------------------------------+| Harvester Sample Configurator |+-----------------------------------------+| Version 1.1 :: July 2002 || Hussein Suleman <[email protected]> || Digital Library Research Laboratory || www.dlib.vt.edu :: Virginia Tech |------------------------------------------+

Defaults/previous values are in brackets - press <enter> to accept thoseenter "&delete" to erase a default valueenter "&continue" to skip further questions and use all defaultspress <ctrl>-c to escape at any time (new values will be lost)

Press <enter> to continue

[ARCHIVES]Add all the archives that should be harvested

Current list of archives:No archives currently defined !

Select from: [A]dd [D]oneEnter your choice [D] : a{return}

[ARCHIVE IDENTIFIER]You need a unique name by which to refer to the archive youwill harvest metadata fromExamples: nsdl-380602, VTETD

Archive identifier [] : nsdl-380602{return}

Virginia Tech Perl Harvester

Page 21: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Metadata Analysis

Finding out what you have (and don’t have)

Encoding practicesGap analysis (e.g., missing fields, etc.)Mistakes (e.g., mapping errors)

Software can helpCommercial software like SpotfireIn-house or open source software tools

Page 22: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Source: 2002 Master’s Thesis, Jewel Hope Ward, UNC Chapel Hill

Five elements are used 71% of the time

Page 23: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library
Page 24: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library
Page 25: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library
Page 26: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library
Page 27: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library
Page 28: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library
Page 29: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Metadata Analysis Model

Page 30: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Metadata Subsetting

DP sets are unlikely to serve all SP uses wellSPs will need the ability to subset harvested metadataExample: prototype subsetting tool

Page 31: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library
Page 32: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

A Subsetting Model

Page 33: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Metadata Normalization

Normalizing: to reduce to a standard or normal statePrototype date normalization service screen

Page 34: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Metadata Enrichment

Adding fields or values may be useful or required, for example:

Metadata provider informationGeographic coverageSubject terms mapped to a different thesaurusAuthority control record

Page 35: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

A Harvesting Service Model

Page 36: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Indexing

Pick your favorite database/indexing software:

MySQLSWISH-E

May need to specifically set up a method to search across the entire recordMay need different fields for indexing than for display

Page 37: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

Interface

Software interface (API) for other applications:

SRU/SRW?Arbitrary Web Services schema?

User interface

Page 38: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

What’s Next?

Further protocol developmentServices layered on top of OAI-PMHShared software toolsBest practices for both DPs and SPs

Page 39: Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

oai-best.comm.nsdl.org