collaborate, automate, prepare, prioritize: creating metadata for legacy research data

22
Creating Metadata for Legacy Research Data Collaborate, Automate, Prepare, Prioritize Stacy Konkiel IU Libraries Inna Kouper Data to Insight Center, IU Jennifer A. Liss IU Libraries Juliet L. Hardesty IU Libraries

Upload: jennifer-liss

Post on 08-May-2015

158 views

Category:

Education


0 download

DESCRIPTION

Data curation projects frequently deal with data that were not created for the purposes of long- term preservation and re-use. How can curation of such legacy data be improved by supplying necessary metadata? In this report, we address this and other questions by creating robust metadata for twenty legacy research datasets. We report on the metrics of creating domain- specific metadata and propose a four-prong framework of metadata creation for legacy research data. Our findings indicate that there is a steep learning curve in encoding metadata using the FGDC content standard for digital geospatial metadata. Our project demonstrates that when data curators are handed research data “as is,” they may be successful in incorporating such data into a data sharing environment. We found that data curators can be successful in creating descriptive metadata and enhancing discoverability via subject analysis. However, curators must be aware of the limitations in applying structural and administrative metadata for legacy data.

TRANSCRIPT

Page 1: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Creating Metadata for Legacy Research Data

Collaborate, Automate, Prepare, Prioritize

Stacy KonkielIU Libraries

Inna KouperData to Insight Center, IU

Jennifer A. LissIU Libraries

Juliet L. HardestyIU Libraries

Page 2: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Data Management as “Grand Challenge”

& Metadata^

Page 3: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

SEAD is funded by the National Science Foundation under Cooperative Agreement #OCI0940824

Page 4: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Active

Curation

Repository

(ACR)

-- metadata

harvest

-- annotation

-- web tools

SEAD VIVO-- social

networking -- links data sets and

researchers

SEAD Virtual Archive (SVA)

-- manage sustainability science window to multiple IRs

IU Scholar Works IR

publish associate

discover

UIUC IDEALSIR

UMich Deep Blue IR

ingest

Page 5: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Investigation

How can the curation of legacy data be improved by supplying necessary metadata?

How much time and effort is required to supply domain-specific metadata?

Page 6: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Goals

• Enable discovery of research data

• Communicate experiences with metadata creation for legacy dataset to community

• Begin conversation about metadata practices for legacy data

Page 7: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Methodology

• 20 NCED legacy datasets

• Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata

Page 8: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Methodology

• 4 encoders, each assigned 5 datasets

• Datasets ranged greatly in size and composition

• 0.01–664 GB

• 1–140,000 files

Page 9: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Methodology

• Phase I

Standalone XML files using basic NCED-provided information & “Googleable” facts

• Phase II

Extensive research re: processes by which datasets were created and used

Page 10: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Findings–Phase I

Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 50:000:280:571:261:552:242:523:213:504:19

Metadata creation time Phase I (h:mm)

Encoder 1 Encoder 2 Encoder 3 Encoder 4

Page 11: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Findings–Phase I

Successes:

• Supplied many mandatory elements

• Thesauri & Controlled Vocabularies

Challenges:

• Time-intensive startup

• Lacking geospatial information

Page 12: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data
Page 13: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Findings–Phase II

Successes:

• Enhanced 10 metadata fields

Challenges:

• Accessing and processing the datasets (size, complexity)

Page 14: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Observations

Though the information that we found may enhance opportunities for the discovery of legacy research data, the available information was unlikely to be sufficient to support the tasks of preservation, reproducibility, and re-use.

Page 15: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Observations• FGDC is insufficient for dealing with

legacy research data

• Data curators without domain expertise can be successful in creating some types of metadata

• Structural and administrative metadata is difficult to curate without help of researchers

Page 16: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Proposal: The CAPP Framework

• Labor• Datasets• Types of

metadata• User needs

• Choice of metadata standards

• Instructions / manuals

• Workflows / software

• Licensing and contact information

• File format identification

• Provenance• Native

environment• Entity extraction

• Subject specialists• Librarians• Researchers• Tool developers

Collaborate Automate

PrioritizePrepare

Page 17: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Collaborate

• Subject specialists

• Librarians

• Researchers

• Tool developers

Page 18: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Automate

• File format identification

• Provenance

• Native environment

• Entity extraction

Page 19: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Prepare

• Choice of metadata standards

• Licensing and contact information

• Instructions and manuals

• Workflows and software

Page 20: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Prioritize

• Labor

• Datasets

• Types of metadata

• User needs

Page 21: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Future Work

Benchmark:

• Effectiveness of tools and workflows

• Collaborations and relationships

• Domains/interdisciplinarity

Page 22: Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Thank you!

Jennifer A. Liss

Metadata/Cataloging Librarian

[email protected]

http://sead-data.net