10-31-13 “researcher perspectives of data curation” presentation slides

36
Hot Topics Web Seminar Series: Research Data in Repositories The UC San Diego Experience Third Webinar: The Researcher Perspective

Upload: duraspace

Post on 22-Apr-2015

688 views

Category:

Technology


0 download

DESCRIPTION

“Hot Topics: The DuraSpace Community Webinar Series, " Series Six: Research Data in Repositories” Curated by David Minor, Research Data Curation Program, UC San Diego Library. Webinar 3: “Researcher Perspectives of Data Curation” Presented by: David Minor, Research Data Curation Program, UC San Diego Library, Dick Norris, Professor, Scripps Institution of Oceanography & Rick Wagner, Data Scientist, San Diego Supercomputer Center.

TRANSCRIPT

Page 1: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Hot Topics Web Seminar Series: Research Data in Repositories

The UC San Diego Experience Third Webinar: The Researcher Perspective

Page 2: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Reminder: General Series Info

• First webinar: Intro and Framing: UC San Diego decisions and planning

• Second Webinar: Deep dive into technology and metadata

• Third Webinar: The perspective from researchers, next steps

Page 3: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Reminder: General Series Info

Slides and presentations from previous webinars are available for download! http://www.duraspace.org/hot-topics

Page 4: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Your esteemed presenters …

First webinar: David Minor – Program Director, Research Data Curation Declan Fleming - Chief Technology Strategist

Second webinar: Declan Fleming - Chief Technology Strategist Arwen Hutt - Metadata Librarian Matt Critchlow - Manager of Development and Web Services

Third webinar: David Minor – Program Director, Research Data Curation Dick Norris – Professor, Scripps Institution of Oceanography Rick Wagner – Data Scientist at San Diego Supercomputer Center

Page 5: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Today we will …

Discuss how researchers have approached curation and data management

Page 6: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Reminder: UCSD Research Data Curation Pilots

• The Brain Observatory

• NSF OpenTopography Facility

• Levantine Archaeology Laboratory • Scripps Institute of Oceanography

Geological Collections

• The Laboratory for Computational

Astrophysics

Page 7: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Reminder: UCSD Research Data Curation Pilots

• The Brain Observatory

• NSF OpenTopography Facility

• Levantine Archaeology Laboratory • Scripps Institute of Oceanography

Geological Collections

• The Laboratory for Computational

Astrophysics

Page 8: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Richard Norris

Professor at Scripps Institution

of Oceanography

Page 9: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Rick Wagner

High Performance Computing

Manager at the San Diego

Supercomputer Center

Ph.D. Candidate within

The Laboratory for

Computational Astrophysics

Page 10: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

General Series Intro

• First webinar: Intro and Framing: UC San Diego decisions and planning

• Second Webinar: Deep dive into technology and metadata

• Third Webinar: The perspective from researchers, next steps

SIO Geological Collections

Curator: Dick Norris Collections Manager: Alexandra Hangsterfer

Part of the International Marine and Lacustrine Geological Collections With collections at Columbia, Oregon State, Woods Hole, USGS and more

Page 11: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Our Collection: Sediment cores and rocks recovered from the oceans & long-lived lakes

Reef sediment-Panama

Salton Sea-CA

Page 12: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

How we get them…. Mostly by Sea (Ship, Cruise, Leg)

But also by Land Country, Locality, Lat/Long

Page 13: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Collection events Deploying a Dredge to collect seafloor rocks

Recovering a Gravity Core to collect seafloor sediments

Page 14: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

A collection event is an Object and includes: • Specimen(s) Latitude/Longitude)

• Ship name and cruise number • Text descriptions • Thin-sections • Images, field notes, publications • Location in the repository • International Geological Sample Number

Page 15: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

The Sediment Core Collection

Archive and Working halves of ~7000 cores from the world’s oceans Typically 3-5 sections/core + core photos, chemical data and sampling history

The IODP Core collection, Bremen Germany

Page 16: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

The Marine Rock collection… • ~4000 dredge sites worldwide • In an 8000 sq ft building • Volcanic rocks, manganese nodules, reef rock

Page 17: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Our data resides with NGDC… • NOAA’s National Geophysical Data Center • And IGSN’s with Lamont’s SESAR

Page 18: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

NGDC searches on ships, repositories, sampling systems, and locations

But no keyword search, automated data input, ways to link associated data, returns on nearest search terms, sampling history, etc….

Page 19: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

What the Community Wants

• A unified National geo-referenced system • Exploratory search by nearest word and map-

based system • Links to associated data types (images, text,

data, references…) • All data types linked by IGSNs • Data entry through web forms with

publication by curators

Page 20: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

What we did with RCI • Identified one type of object

– Based in sampling events – Ship-Cruise-Sampling device-Sample number – Geo-referenced – Includes associated materials: text description,

images, chemical data, references, records of sampling event, sampling records, storage location

• NGDC records imported into UC Library system

• Records searchable by any word in a record

Page 21: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

What’s next? • NSF-sponsored SEASAR (System for Earth

Sample Registration) – Created the International GeoSample Number – http://www.geosamples.org/

• NSF-sponsored workshop: – Digital Environment for Sample Curation (June

2013) – http://www.geosamples.org/news/descwebinarmaterials

• NSF “EarthCube” initiative

Page 22: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

CyberInfrastructure needs (from DESC)

• Offline data entry at sea or in the field • DESC should respect data moratoriums (typically 2

years, if collected with NSF grants) • Automated release to public at close of moratorium • Secure login-based data serving for project scientists • Flexible search and access for users to view public

archive (view by location name, type, bounding region) and associated data

• Flexible sample request submission

Page 23: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

More cyberInfrastructure needs

• Display stored datasets and images hosted on other servers (as in other repositories)

• Connections with Standard Visualization Tools Such as Corelyzer, Correlator, PSICAT, CoreRef, GMT, GeoMapApp

• Sampling database should be easily accessible by researchers to submit requests

• Automatically updated by repository (personnel) to reflect samples sent to the researchers

• Way of entering historical sampling information

Page 24: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

These are general issues for Natural History Collections

• Most museums have similar issues to us – Geo-Referenced collections – Mix of physical specimens, images, text

descriptions, sampling data, and affiliated data files

– Many have home-grown data bases that are not interoperable with other museums

Fish from the SIO Marine Vertebrates Collection

Page 25: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Natural History Collections • Need controlled vocabularies but flexibility to

search on variants – Since nobody agrees on common vocabularies

• Value in cross-referencing to related collections – Such as samples (geology, biology, water)

collected on a cruise with ship track, sea floor maps…

– Presently working on “Rolling deck to Repository” NSF project

Page 26: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

LCA

PI: Mike Norman Current and Past Students: Many

Page 27: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Image credit: NASA, IoA, A. Fabian et al.

Research group focusing on numerical modeling of complex astrophysical processes: cosmology, galaxy formation, turbulence, radiation hydrodynamics, magneto-hydrodynamics, …

Page 28: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Our simulations are large, based on the current definition of “large” (we grow with the technology). Typical results are 1-100 TB.

Page 29: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

This work is costly in terms of both the computer time and human effort, and we see a benefit to the science community in sharing. (Citations are nice, too.)

http://bit.ly/sB30f1 http://bit.ly/IzTVV2 http://bit.ly/IE4iFd http://bit.ly/HFYLQJ

Page 30: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Participation in the Virtual Observatory • Standards for simulation

metadata, search, and retrieval • An odd fit beside the “pure”

astronomy projects and data centers

• But, it meant we weren’t starting from scratch in terms of describing our data

Started the curation effort very curious about how much of this previous work would translate to library space Also wanted stable platform for data hosting (e.g., not a closet server)

Prior Sharing Efforts

Page 31: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Several steps: • Choosing the pilot dataset • Cleaning up simulation cruft • Identifying related publications • Adding historical documents

(proposals, reports, etc.) • Organize various data groups • Simulations are a collection of

datasets from various points in time, needed a description for each type of digital object in each dataset

• Bundle, checksum, and handoff

Curation Process

By E.gordienko (Own work) [CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons

Decided near the end to replicate the metadata record to a second site as test of its portability

Page 32: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

http://bit.ly/17yTc1n

Page 33: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Final result: • Datasets from a high-resolution

cosmology simulation held at UCSD

• Viewable both at UCSD, and via the Online Archive of California

• Raw simulation data and various analysis results accessible over HTTP

Page 34: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Some thoughts: • When it comes to metadata formats libraries are like any other science

domain and speak their own language • If you have a highly-specialized domain-specific metadata dialect or

language, you may need an additional discovery service • If not, it’s a good starting point • We’re working on repeating this process on our own for another simulation

Page 35: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Next steps at UC San Diego

Move from pilot services to a scalable series of processes. Work with additional researchers in same domains. Work with new domains. Broaden lifecycle management mindset on campus.

Page 36: 10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

Questions?

Rick Wagner - [email protected] Richard Norris - [email protected] David Minor - [email protected] http://www.duraspace.org/hot-topics