10-15-13 “metadata and repository services for research data curation” presentation slides

46
October 15, 2014 Hot Topics: DuraSpace Community Webinar Series Hot Topics: The DuraSpace Community Webinar Series Series Six: “Research Data in Repositories” Curated by David Minor

Upload: duraspace

Post on 18-Dec-2014

606 views

Category:

Education


1 download

DESCRIPTION

“Hot Topics: The DuraSpace Community Webinar Series," Series Six: Research Data in Repositories” Curated by David Minor, Research Data Curation Program, UC San Diego Library. Webinar 2: “Metadata and Repository Services for Research Data Curation” Presented by Declan Fleming, Chief Technology Strategist, Arwen Hutt, Metadata Librarian & Matt Critchlow, Manager of Development and Web ServicesUC, San Diego Library.

TRANSCRIPT

Page 1: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

October 15, 2014 Hot Topics: DuraSpace Community Webinar Series

Hot Topics: The DuraSpace Community Webinar Series

Series Six: “Research Data in Repositories”

Curated by David Minor

Page 2: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

October 15, 2013 Hot Topics: DuraSpace Community Webinar Series

Webinar 2: Metadata & Repository Services for Research Data Curation

Presented by: Declan Fleming, Chief Technology Strategist, UC San Diego Library Matt Critchlow, Manager of Development and Web Services, UC San Diego Library Arwen Hutt, Metadata Librarian, UC San Diego Library

Page 3: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Hot Topics Web Seminar Series: Research Data in Repositories

The UC San Diego Experience Second Webinar: Metadata and Repository Services

for Research Data Curation

Page 4: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

General Series Intro

• First webinar: Intro and Framing: UC San Diego decisions and planning

• Second Webinar: Deep dive into technology and metadata

• Third Webinar: The perspective from researchers, next steps

Page 5: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Your esteemed presenters …

First webinar: David Minor – Program Director, Research Data Curation Declan Fleming - Chief Technology Strategist

Second webinar: Declan Fleming - Chief Technology Strategist Arwen Hutt - Metadata Librarian Matt Critchlow - Manager of Development and Web Services

Third webinar: Dick Norris – Professor, Scripps Institution of Oceanography Rick Wagner – Data Scientist at San Diego Supercomputer Center

Page 6: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Today we will …

• Discuss real-world researcher interaction

• Document how metadata and files combine to make digital objects

• Describe the DAMS data model and how it supports complex research objects

• Detail the technology driving the DAMS

• Point to the future

Page 7: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Working with Researchers: Pilots

• The Brain Observatory

• NSF OpenTopography Facility

• Levantine Archaeology Laboratory • Scripps Institute of Oceanography

Geological Collections

• The Laboratory for Computational

Astrophysics

Page 8: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Working with Researchers: Process

• Introductory meeting • Metadata point person • Ongoing discussions • One on one work

Iterative, collaborative, customized, experimental…pilot!

Page 9: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Working with Researchers: Data management

• Collocation • Clean up • Identifiers • Metadata

Page 10: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Working with Researchers: What is an object?

• What are the boundaries on a discreet set or subset of data? What is required to make the data intelligible, usable and reusable?

• What needs to be preserved? • What do they want to display and/or share? • What do they want to be able to refer to or

cite?

Page 11: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Working with Researchers: What is an object?

Slice

Etc…

or

Brain

Artifact

Site

or

Page 12: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Working with Researchers: Take Aways

They are the subject experts

There are a lot of broad level similarities

But no such thing as one size fits all

Page 13: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

We want a new data model…

• One that is flexible and accommodates disparate metadata from a variety of sources

• While promoting consistency within the data store • One that supports relationships within and between

objects • One that is more community engaged, both sharing

vocabularies and technology, and utilizing others shared vocabularies and technologies

• One that supports improved management of objects and metadata

Page 14: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Data Model Development Process

• Five people, in a room, 16 hours a week for 4 months

• Worked through existing data, use case scenarios, known data requirements, investigated known ontologies, etc.

• Lots and lots and lots of discussion • Utilizes MADS (Metadata Authority Description

Schema) • Results = a data dictionary and an OWL ontology • Living document

Page 15: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Data Model: Flexibility

• The data model provides enough flexibility that we can accommodate a wide variety of data within the schema – Vocabularies – Use of “types” or “display labels” to distinguish

specific subtypes of a data field – Flexible structures and relationships – Extensible

Page 16: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Data Model: Consistency

• But enough consistency that searching and display rules do not need to be customized for each individual collection of material – Rules can be applied at the level of the broader

concept • As well as establishing the organizational

structure necessary for maintaining consistency over time – Evaluation and approval of modifications

Page 17: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Data Model: Relationships

• It allows us to create a number of different relationships – Collections and sub-collections – Collections and objects – Objects and components

(complex hierarchical objects) – Other related resources internal

or external to the DAMS

complex object example

Page 18: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Data Model: Vocabularies

• Allow management of local & community vocabularies – Vocabulary terms as entities – Ability to encode authority data (vocabulary

source, value uri, etc.) as well as sameAs relationships between the same term expressed in multiple sources

– Ability to update authority records as community vocabularies become more formalized.

Page 19: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Data Model: Management

• One that supports improved management of objects and metadata – Authority management of vocabulary terms – Event metadata!

Page 20: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Architecture

Page 21: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Preservation: Chronopolis

Current DAMS Process 1. Create Bagit bags for all objects 2. Host via HTTP(S) 3. Bags are retrieved and ingested into Chronopolis DAMS4 Process 1. Create Bagit bags for Δ objects using Event metadata 2. Host via HTTP(S) or enqueue on messaging queue for

ingestion

Page 22: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Storage

Page 23: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Storage: EMC Isilon 72NL

Storage For Library Collections 1 cluster of 5 Nodes 1 Node = 36 x 2TB Drives Total Current Usable Storage of 320TB OneFS 7.0.2.1

Page 24: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Storage: OpenStack

Storage For Research Data Collections Testing: • Performance versus Local Storage • Large Files (up to 1TB)

– Segmenting files > 5GB – Lexical order bug fix: 1,10,2 -> 0001,0002,…0010

• Rackspace CloudFiles API VS OpenStack REST API Testing Notes: https://libraries.ucsd.edu/blogs/dams/openstack-testing-notes/

Page 25: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Repository

Page 26: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Repository

Core Repository Application: Create, Read, Update, Delete (CRUD) Uses: Jena, ActiveMQ, JHOVE, Apache Tika, FFMPEG, ImageMagick Manages: • Metadata Triplestore • Storage • Solr

Page 27: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Repository: Metadata Triplestore

Page 28: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Repository: Metadata Triplestore

Triplestore was: Allegrograph Triplestore is: PostgresSQL DB + Jena • Schema: (ID), Parent, Subject, Predicate, Object Jena Usage: • Core/RDF API – Parsing, loading, updating, serializing RDF • ARQ API – SPARQL queries

Page 29: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Repository: REST API

Page 30: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Hydra Framework

Source: https://wiki.duraspace.org/display/hydra/Technical+Framework+and+its+Parts

Page 31: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Repository: Fedora API-ish

Page 32: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Fedora API – Next PID

Page 33: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Fedora API – Next PID

Page 34: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Manager

Page 35: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Manager

Java application using Spring MVC framework • Collection Management

– Metadata Ingest and Export – File Ingest – Derivative Generation – Solr indexing by Collection

• Administrative Reporting and Statistics

Page 36: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Hydra Head

Page 37: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Hydra Head

Page 38: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Hydra Head: Blacklight

Page 39: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

RDF in Hydra

Page 40: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

RDF in Hydra: (Read) Nested Attributes

Page 41: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

RDF in Hydra: (Create) Nested Attributes

Page 42: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

DAMS Hydra Head: Complex Objects

Page 43: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Next Steps

Beta Release: Late October Production Release: January Future: • Sufia/Curate Integration for administrative functionality • Additional Linked Data Integration and Crosswalks

– Schema.org, OpenURL, Dublin Core, ResourceSync

• Fedora4

Page 44: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

More Information

DAMS Overview https://github.com/ucsdlib/dams/wiki/DAMS-Manual DAMS Hydra Head https://github.com/ucsdlib/damspas DAMS Ontology https://github.com/ucsdlib/dams/tree/master/ontology DAMS REST API https://github.com/ucsdlib/dams/wiki/REST-API Hot Topics Series 3: Get a Head on the Repository with Hydra http://duraspace.org/hot-topics Hydra Technical Overview https://wiki.duraspace.org/display/hydra/Technical+Framework+and+its+Parts OneFS Technical Overview http://www.emc.com/collateral/hardware/white-papers/h10719-isilon-onefs-technical-overview-wp.pdf Isilon Overview http://www.emc.com/collateral/software/data-sheet/h10541-ds-isilon-platform.pdf

Page 45: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Coming Up Next

Final Webinar (October 31) The researcher perspective from two of our pilot participants Dick Norris – Professor, Scripps Institution of Oceanography Rick Wagner – Data Scientist at San Diego Supercomputer Center

Page 46: 10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

Questions?

Thanks! Declan Fleming @declan | [email protected] Arwen Hutt @arwenh | [email protected] Matt Critchlow @mattcritchlow | [email protected]