improving metadata quality: strategies and services

35
Improving Metadata Quality: Strategies and Services Diane I. Hillmann Naomi Dushay Jon Phipps National Science Digital Library

Upload: alaire

Post on 04-Feb-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Improving Metadata Quality: Strategies and Services. Diane I. Hillmann Naomi Dushay Jon Phipps National Science Digital Library. Introduction. Useful services depend on good metadata, but most metadata is not very good Human created metadata is expensive - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Improving Metadata Quality:  Strategies and Services

Improving Metadata Quality: Strategies and Services

Diane I. Hillmann

Naomi Dushay

Jon Phipps

National Science Digital Library

Page 2: Improving Metadata Quality:  Strategies and Services

Introduction

• Useful services depend on good metadata, but most metadata is not very good

• Human created metadata is expensive

• Automated crawling strategies are limited by:

– Accessibility barriers (rights issues, technical issues)

– Variable results with crawling technologies for non-text

• Best metadata does not rely solely on information contained within the resource itself– Ex.: Controlled vocabularies, descriptions, links

Page 3: Improving Metadata Quality:  Strategies and Services

The NSDL Environment

• Functions to some extent as a metadata aggregator– Simple, two-level hierarchy (Collections & items)

– Based on OAI-PMH harvest model

– Each harvested item associated with a collection

• Collection records managed via internal system that also drives automated harvest/ingest processes– Harvested records split into elements for storage and

reassembled for output

Page 4: Improving Metadata Quality:  Strategies and Services

Why Transform Metadata at All?

• Four categories of problems limit metadata usefulness:– Missing data: elements not present– Incorrect data: values not conforming to proper

usage– Confusing data: embedded html tags, improper

separation of multiple elements, etc.– Insufficient data: no indication of controlled

vocabularies, formats, etc.

Page 5: Improving Metadata Quality:  Strategies and Services

Transforming Metadata “Safely”

• Enhance original data with no risk of degradation• Provide low cost, scaleable way to improve the

quality and predictability of data– Remove “noise”: empty elements, useless values

– Detect and identify controlled vocabularies: DCMIType and IMT values

– Normalize presentation: clean up values, remove double XML encodings, extra whitespace, etc.

Page 6: Improving Metadata Quality:  Strategies and Services

Beyond “Safe Transforms”

• Managing each "record" separately made automated maintenance and enhancement difficult

• Many sources of data required more tailored quality improvement

• Distinction between improvements to the metadata expression and additional information about the resource itself

• Potential to make the knowledge and expertise of NSDL data managers available to downstream consumers of the data

Page 7: Improving Metadata Quality:  Strategies and Services

From Records to Elements

• Metadata record -- “a series of statements about resources” which can be aggregated to build a more complete profile of a resource

• Statements come with source information, and links to details about services and harvests

Page 8: Improving Metadata Quality:  Strategies and Services

NSDL Metadata Repository

iViaEnhancement

Service

Provider Aorig metadata<dc:title><dc:identifier><dc:creator><dc:type>

ENCEnhancement

Service

<dct:audience source=ENC><dct:educationLevel source=ENC>

NSDL normalized/augmented<dc:title source=A><dc:creator source=A>

<dc:subject GEM source=iVia><dc:subject LCSH source=iVia><dc:subject LCC source=iVia>

<dc:identifier URI source=MR><dc:type DCMIType source=MR>

Provider A

iVia enhancements <dc:subject GEM><dc:subject LCSH><dc:subject LCC>

OAI

Safe xform enhancements<dc:identifier URI><dc:type DCMIType>

ENC enhancements<dct:audience><dct:educationLevel>

OAI

NSDL SafeTransforms

OAI

OAI

Page 9: Improving Metadata Quality:  Strategies and Services

Exposing Quality Information

• Metadata statements vary in quality, and may be subjective

• Quality of statements can be determined to a great extent by knowledge of the source, and knowledge of the methodology used to create the statement

• Detailed provenance itself is a good indicator of quality metadata

Page 10: Improving Metadata Quality:  Strategies and Services

Exposing Data to Downstream Users

• Two major issues:– Linking statements to particular harvested source

records (including the datestamp of the harvest)

– Linking records to the services that provided them (including descriptions of those services and the methods used to create the metadata)

• Required the creation and exposure of service records and a service vocabulary to categorize them

Page 11: Improving Metadata Quality:  Strategies and Services
Page 12: Improving Metadata Quality:  Strategies and Services
Page 13: Improving Metadata Quality:  Strategies and Services
Page 14: Improving Metadata Quality:  Strategies and Services
Page 15: Improving Metadata Quality:  Strategies and Services
Page 16: Improving Metadata Quality:  Strategies and Services
Page 17: Improving Metadata Quality:  Strategies and Services
Page 18: Improving Metadata Quality:  Strategies and Services
Page 19: Improving Metadata Quality:  Strategies and Services
Page 20: Improving Metadata Quality:  Strategies and Services

<record><metadata>

<nsdl_dcard_m >…<dc:identifier sourceRecordID="332518“

sourceServiceID="316878">http://www.chem.qmw.ac.uk/surfaces/scc/

</dc:identifier><dc:identifier sourceRecordID="993251“

sourceServiceID="8957432" xsi:type="dct:URI">http://www.chem.qmw.ac.uk/surfaces/scc/

</dc:identifier>

<dc:language sourceRecordID="332518“ sourceServiceID="316878">eng-GB

</dc:language><dc:language sourceRecordID="993251“

sourceServiceID="8957432" xsi:type="dct:RFC3066">en-GB

</dc:language>…

</nsdl_dcard_m ></metadata>

</record>

Page 21: Improving Metadata Quality:  Strategies and Services

<about><sourceRecords>

<sourceRecord recordID="332518" sourceServiceID="316878"><datestamp>

2002-11-11</datestamp><identifier>

http://nsdl.org/mr/oai:nsdl.org:316878:oai:asdlib.org:asdl001709</identifier>

</sourceRecord>

<sourceRecord recordID="993251" sourceServiceID="8957432"><datestamp>

2004-15-05T05:11:00Z</datestamp><identifier>

http://nsdl.org/mr/oai:nsdl.org:nsdl.service:993251</identifier>

</sourceRecord>…

</sourceRecords></about>

Page 22: Improving Metadata Quality:  Strategies and Services

<about><sourceServices>

<sourceService serviceID="316878"><dc:title>

Analytical Sciences Digital Library (ASDL)</dc:title><dc:description>

The ASDL is an electronic library that collects, catalogs and links web-based information or discovery material an...

</dc:description><dc:type xsi:type="nsdl:serviceType">

Collection</dc:type><serviceDescription xsi:type="nsdl:xml">

http://nsdl.org/mr/xml/316878</serviceDescription>

</sourceService><sourceService serviceID="8957432">

<dc:title>NSDL Metadata Normalization Service

</dc:title><dc:description>

The NSDL Metadata Normalization Service provides the spices that help to create delicious sausage from metadata chicken lips, feathers...

</dc:description>…

</sourceService></sourceServices>

</about>

Page 23: Improving Metadata Quality:  Strategies and Services

<about xmlns:dc="http://purl.org/dc/elements/1.1/"><collectionMembership>

<collection collectionID="316878"><dc:title>

Analytical Sciences Digital Library (ASDL)</dc:title><dc:description>

The ASDL is an electronic library that collects, catalogs and links web-based information or discovery material an...

</dc:description><dc:identifier xsi:type="URI">

http://www.asdlib.org/</dc:identifier><dc:identifier>

oai:nsdl.org:nsdl.nsdl:00229</dc:identifier>

</collection>

<collection collectionID="4718"><dc:title>

ENC Online: The best selection of K-12 mathematics and science curriculum resources on the Internet!

</dc:title>…

</collection></collectionMembership>

</about>

Page 24: Improving Metadata Quality:  Strategies and Services

Service Provision Model: iVia

• A variety of metadata generation services– Crawling to determine what resources are part

of a “collection”– Metadata creation for each resource– Augmenting metadata, adding subjects,

classification, format

Page 25: Improving Metadata Quality:  Strategies and Services

iVia Service Issues

• Human review of results is essential– Error handling and Blacklisting

Page 26: Improving Metadata Quality:  Strategies and Services

iVia Service Issues

• Human review of results is essential– Log review

Page 27: Improving Metadata Quality:  Strategies and Services

iVia Service Challenge

• Repeatable Crawls– Storing and reusing the crawl parameters– Repeating the crawl on a schedule– Incremental updates of the iVia data– Editor notification of crawl completion– Initiation of incremental reharvest

Page 28: Improving Metadata Quality:  Strategies and Services

Metadata Quality Services

• Metadata generation & augmentation• Metadata transformation

(“safe” and “collection specific”)• Equivalence• Crosswalking (schema and vocabulary)• Persistence/archiving• Annotation• Metadata improvement and rating

Page 29: Improving Metadata Quality:  Strategies and Services

“Conducting” Service Interactions

• Order, timing, and response important

• Passive and active interactions; human and automated triggers

• Parameters for each interaction stored

• Supporting “freshness” and automated updating

Page 30: Improving Metadata Quality:  Strategies and Services

Typical Service Orchestration

Introducing Lenny…• Editor initiates an iVia

Guided Crawl• Editor reviews results,

blacklists• Editor notifies Lenny that crawl is complete• Lenny initiates OAI harvest and ingest• Lenny notified of ingest success

Page 31: Improving Metadata Quality:  Strategies and Services

Typical Service Orchestration

• Lenny initiates Safe Transform Service

• Service notifies Lenny that it’s done

• Lenny initiates OAI harvest and ingest

• Lenny notified of ingest success

Page 32: Improving Metadata Quality:  Strategies and Services

Typical Service Orchestration

• Lenny initiates Collection-Specific Transform Service

• Service notifies Lenny that it’s done

• Lenny initiates OAI harvest and ingest

• Lenny notified of ingest success

• Lenny rests

Page 33: Improving Metadata Quality:  Strategies and Services

The Who and Where of Services

• Many of the services we describe are useful to most metadata aggregators

• No aggregator can afford to create many single purpose services closely coupled with a single aggregator

• Shared, open services can provide a useful basis for improved metadata for all

Page 34: Improving Metadata Quality:  Strategies and Services

Conclusions

• New role for “metadata aggregators”—providing enhanced metadata for other services to re-use– Integrating fragmentary metadata created by

automated services– Improving metadata in standard ways– Exposing all relevant data in ways that allow

consumers to evaluate quality and usefulness

Page 35: Improving Metadata Quality:  Strategies and Services

Conclusions

“This model of service provision holds much potential in an environment where persistent metadata quality issues threaten to overwhelm aggregators hoping to build services on top of harvested metadata. No single aggregator can fill in the quality gaps alone, but if metadata services are built to interoperate with a variety of aggregators using low barrier protocols like OAI-PMH, many can benefit from the work, freeing resources for new service development.”