improving metadata quality: strategies and services

Post on 04-Feb-2016

55 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Improving Metadata Quality: Strategies and Services. Diane I. Hillmann Naomi Dushay Jon Phipps National Science Digital Library. Introduction. Useful services depend on good metadata, but most metadata is not very good Human created metadata is expensive - PowerPoint PPT Presentation

TRANSCRIPT

Improving Metadata Quality: Strategies and Services

Diane I. Hillmann

Naomi Dushay

Jon Phipps

National Science Digital Library

Introduction

• Useful services depend on good metadata, but most metadata is not very good

• Human created metadata is expensive

• Automated crawling strategies are limited by:

– Accessibility barriers (rights issues, technical issues)

– Variable results with crawling technologies for non-text

• Best metadata does not rely solely on information contained within the resource itself– Ex.: Controlled vocabularies, descriptions, links

The NSDL Environment

• Functions to some extent as a metadata aggregator– Simple, two-level hierarchy (Collections & items)

– Based on OAI-PMH harvest model

– Each harvested item associated with a collection

• Collection records managed via internal system that also drives automated harvest/ingest processes– Harvested records split into elements for storage and

reassembled for output

Why Transform Metadata at All?

• Four categories of problems limit metadata usefulness:– Missing data: elements not present– Incorrect data: values not conforming to proper

usage– Confusing data: embedded html tags, improper

separation of multiple elements, etc.– Insufficient data: no indication of controlled

vocabularies, formats, etc.

Transforming Metadata “Safely”

• Enhance original data with no risk of degradation• Provide low cost, scaleable way to improve the

quality and predictability of data– Remove “noise”: empty elements, useless values

– Detect and identify controlled vocabularies: DCMIType and IMT values

– Normalize presentation: clean up values, remove double XML encodings, extra whitespace, etc.

Beyond “Safe Transforms”

• Managing each "record" separately made automated maintenance and enhancement difficult

• Many sources of data required more tailored quality improvement

• Distinction between improvements to the metadata expression and additional information about the resource itself

• Potential to make the knowledge and expertise of NSDL data managers available to downstream consumers of the data

From Records to Elements

• Metadata record -- “a series of statements about resources” which can be aggregated to build a more complete profile of a resource

• Statements come with source information, and links to details about services and harvests

NSDL Metadata Repository

iViaEnhancement

Service

Provider Aorig metadata<dc:title><dc:identifier><dc:creator><dc:type>

ENCEnhancement

Service

<dct:audience source=ENC><dct:educationLevel source=ENC>

NSDL normalized/augmented<dc:title source=A><dc:creator source=A>

<dc:subject GEM source=iVia><dc:subject LCSH source=iVia><dc:subject LCC source=iVia>

<dc:identifier URI source=MR><dc:type DCMIType source=MR>

Provider A

iVia enhancements <dc:subject GEM><dc:subject LCSH><dc:subject LCC>

OAI

Safe xform enhancements<dc:identifier URI><dc:type DCMIType>

ENC enhancements<dct:audience><dct:educationLevel>

OAI

NSDL SafeTransforms

OAI

OAI

Exposing Quality Information

• Metadata statements vary in quality, and may be subjective

• Quality of statements can be determined to a great extent by knowledge of the source, and knowledge of the methodology used to create the statement

• Detailed provenance itself is a good indicator of quality metadata

Exposing Data to Downstream Users

• Two major issues:– Linking statements to particular harvested source

records (including the datestamp of the harvest)

– Linking records to the services that provided them (including descriptions of those services and the methods used to create the metadata)

• Required the creation and exposure of service records and a service vocabulary to categorize them

<record><metadata>

<nsdl_dcard_m >…<dc:identifier sourceRecordID="332518“

sourceServiceID="316878">http://www.chem.qmw.ac.uk/surfaces/scc/

</dc:identifier><dc:identifier sourceRecordID="993251“

sourceServiceID="8957432" xsi:type="dct:URI">http://www.chem.qmw.ac.uk/surfaces/scc/

</dc:identifier>

<dc:language sourceRecordID="332518“ sourceServiceID="316878">eng-GB

</dc:language><dc:language sourceRecordID="993251“

sourceServiceID="8957432" xsi:type="dct:RFC3066">en-GB

</dc:language>…

</nsdl_dcard_m ></metadata>

</record>

<about><sourceRecords>

<sourceRecord recordID="332518" sourceServiceID="316878"><datestamp>

2002-11-11</datestamp><identifier>

http://nsdl.org/mr/oai:nsdl.org:316878:oai:asdlib.org:asdl001709</identifier>

</sourceRecord>

<sourceRecord recordID="993251" sourceServiceID="8957432"><datestamp>

2004-15-05T05:11:00Z</datestamp><identifier>

http://nsdl.org/mr/oai:nsdl.org:nsdl.service:993251</identifier>

</sourceRecord>…

</sourceRecords></about>

<about><sourceServices>

<sourceService serviceID="316878"><dc:title>

Analytical Sciences Digital Library (ASDL)</dc:title><dc:description>

The ASDL is an electronic library that collects, catalogs and links web-based information or discovery material an...

</dc:description><dc:type xsi:type="nsdl:serviceType">

Collection</dc:type><serviceDescription xsi:type="nsdl:xml">

http://nsdl.org/mr/xml/316878</serviceDescription>

</sourceService><sourceService serviceID="8957432">

<dc:title>NSDL Metadata Normalization Service

</dc:title><dc:description>

The NSDL Metadata Normalization Service provides the spices that help to create delicious sausage from metadata chicken lips, feathers...

</dc:description>…

</sourceService></sourceServices>

</about>

<about xmlns:dc="http://purl.org/dc/elements/1.1/"><collectionMembership>

<collection collectionID="316878"><dc:title>

Analytical Sciences Digital Library (ASDL)</dc:title><dc:description>

The ASDL is an electronic library that collects, catalogs and links web-based information or discovery material an...

</dc:description><dc:identifier xsi:type="URI">

http://www.asdlib.org/</dc:identifier><dc:identifier>

oai:nsdl.org:nsdl.nsdl:00229</dc:identifier>

</collection>

<collection collectionID="4718"><dc:title>

ENC Online: The best selection of K-12 mathematics and science curriculum resources on the Internet!

</dc:title>…

</collection></collectionMembership>

</about>

Service Provision Model: iVia

• A variety of metadata generation services– Crawling to determine what resources are part

of a “collection”– Metadata creation for each resource– Augmenting metadata, adding subjects,

classification, format

iVia Service Issues

• Human review of results is essential– Error handling and Blacklisting

iVia Service Issues

• Human review of results is essential– Log review

iVia Service Challenge

• Repeatable Crawls– Storing and reusing the crawl parameters– Repeating the crawl on a schedule– Incremental updates of the iVia data– Editor notification of crawl completion– Initiation of incremental reharvest

Metadata Quality Services

• Metadata generation & augmentation• Metadata transformation

(“safe” and “collection specific”)• Equivalence• Crosswalking (schema and vocabulary)• Persistence/archiving• Annotation• Metadata improvement and rating

“Conducting” Service Interactions

• Order, timing, and response important

• Passive and active interactions; human and automated triggers

• Parameters for each interaction stored

• Supporting “freshness” and automated updating

Typical Service Orchestration

Introducing Lenny…• Editor initiates an iVia

Guided Crawl• Editor reviews results,

blacklists• Editor notifies Lenny that crawl is complete• Lenny initiates OAI harvest and ingest• Lenny notified of ingest success

Typical Service Orchestration

• Lenny initiates Safe Transform Service

• Service notifies Lenny that it’s done

• Lenny initiates OAI harvest and ingest

• Lenny notified of ingest success

Typical Service Orchestration

• Lenny initiates Collection-Specific Transform Service

• Service notifies Lenny that it’s done

• Lenny initiates OAI harvest and ingest

• Lenny notified of ingest success

• Lenny rests

The Who and Where of Services

• Many of the services we describe are useful to most metadata aggregators

• No aggregator can afford to create many single purpose services closely coupled with a single aggregator

• Shared, open services can provide a useful basis for improved metadata for all

Conclusions

• New role for “metadata aggregators”—providing enhanced metadata for other services to re-use– Integrating fragmentary metadata created by

automated services– Improving metadata in standard ways– Exposing all relevant data in ways that allow

consumers to evaluate quality and usefulness

Conclusions

“This model of service provision holds much potential in an environment where persistent metadata quality issues threaten to overwhelm aggregators hoping to build services on top of harvested metadata. No single aggregator can fill in the quality gaps alone, but if metadata services are built to interoperate with a variety of aggregators using low barrier protocols like OAI-PMH, many can benefit from the work, freeing resources for new service development.”

top related