10-31-13 “researcher perspectives of data curation” presentation slides
DESCRIPTION
“Hot Topics: The DuraSpace Community Webinar Series, " Series Six: Research Data in Repositories” Curated by David Minor, Research Data Curation Program, UC San Diego Library. Webinar 3: “Researcher Perspectives of Data Curation” Presented by: David Minor, Research Data Curation Program, UC San Diego Library, Dick Norris, Professor, Scripps Institution of Oceanography & Rick Wagner, Data Scientist, San Diego Supercomputer Center.TRANSCRIPT
Hot Topics Web Seminar Series: Research Data in Repositories
The UC San Diego Experience Third Webinar: The Researcher Perspective
Reminder: General Series Info
• First webinar: Intro and Framing: UC San Diego decisions and planning
• Second Webinar: Deep dive into technology and metadata
• Third Webinar: The perspective from researchers, next steps
Reminder: General Series Info
Slides and presentations from previous webinars are available for download! http://www.duraspace.org/hot-topics
Your esteemed presenters …
First webinar: David Minor – Program Director, Research Data Curation Declan Fleming - Chief Technology Strategist
Second webinar: Declan Fleming - Chief Technology Strategist Arwen Hutt - Metadata Librarian Matt Critchlow - Manager of Development and Web Services
Third webinar: David Minor – Program Director, Research Data Curation Dick Norris – Professor, Scripps Institution of Oceanography Rick Wagner – Data Scientist at San Diego Supercomputer Center
Today we will …
Discuss how researchers have approached curation and data management
Reminder: UCSD Research Data Curation Pilots
• The Brain Observatory
• NSF OpenTopography Facility
• Levantine Archaeology Laboratory • Scripps Institute of Oceanography
Geological Collections
• The Laboratory for Computational
Astrophysics
Reminder: UCSD Research Data Curation Pilots
• The Brain Observatory
• NSF OpenTopography Facility
• Levantine Archaeology Laboratory • Scripps Institute of Oceanography
Geological Collections
• The Laboratory for Computational
Astrophysics
Richard Norris
Professor at Scripps Institution
of Oceanography
Rick Wagner
High Performance Computing
Manager at the San Diego
Supercomputer Center
Ph.D. Candidate within
The Laboratory for
Computational Astrophysics
General Series Intro
• First webinar: Intro and Framing: UC San Diego decisions and planning
• Second Webinar: Deep dive into technology and metadata
• Third Webinar: The perspective from researchers, next steps
SIO Geological Collections
Curator: Dick Norris Collections Manager: Alexandra Hangsterfer
Part of the International Marine and Lacustrine Geological Collections With collections at Columbia, Oregon State, Woods Hole, USGS and more
Our Collection: Sediment cores and rocks recovered from the oceans & long-lived lakes
Reef sediment-Panama
Salton Sea-CA
How we get them…. Mostly by Sea (Ship, Cruise, Leg)
But also by Land Country, Locality, Lat/Long
Collection events Deploying a Dredge to collect seafloor rocks
Recovering a Gravity Core to collect seafloor sediments
A collection event is an Object and includes: • Specimen(s) Latitude/Longitude)
• Ship name and cruise number • Text descriptions • Thin-sections • Images, field notes, publications • Location in the repository • International Geological Sample Number
The Sediment Core Collection
Archive and Working halves of ~7000 cores from the world’s oceans Typically 3-5 sections/core + core photos, chemical data and sampling history
The IODP Core collection, Bremen Germany
The Marine Rock collection… • ~4000 dredge sites worldwide • In an 8000 sq ft building • Volcanic rocks, manganese nodules, reef rock
Our data resides with NGDC… • NOAA’s National Geophysical Data Center • And IGSN’s with Lamont’s SESAR
NGDC searches on ships, repositories, sampling systems, and locations
But no keyword search, automated data input, ways to link associated data, returns on nearest search terms, sampling history, etc….
What the Community Wants
• A unified National geo-referenced system • Exploratory search by nearest word and map-
based system • Links to associated data types (images, text,
data, references…) • All data types linked by IGSNs • Data entry through web forms with
publication by curators
What we did with RCI • Identified one type of object
– Based in sampling events – Ship-Cruise-Sampling device-Sample number – Geo-referenced – Includes associated materials: text description,
images, chemical data, references, records of sampling event, sampling records, storage location
• NGDC records imported into UC Library system
• Records searchable by any word in a record
What’s next? • NSF-sponsored SEASAR (System for Earth
Sample Registration) – Created the International GeoSample Number – http://www.geosamples.org/
• NSF-sponsored workshop: – Digital Environment for Sample Curation (June
2013) – http://www.geosamples.org/news/descwebinarmaterials
• NSF “EarthCube” initiative
CyberInfrastructure needs (from DESC)
• Offline data entry at sea or in the field • DESC should respect data moratoriums (typically 2
years, if collected with NSF grants) • Automated release to public at close of moratorium • Secure login-based data serving for project scientists • Flexible search and access for users to view public
archive (view by location name, type, bounding region) and associated data
• Flexible sample request submission
More cyberInfrastructure needs
• Display stored datasets and images hosted on other servers (as in other repositories)
• Connections with Standard Visualization Tools Such as Corelyzer, Correlator, PSICAT, CoreRef, GMT, GeoMapApp
• Sampling database should be easily accessible by researchers to submit requests
• Automatically updated by repository (personnel) to reflect samples sent to the researchers
• Way of entering historical sampling information
These are general issues for Natural History Collections
• Most museums have similar issues to us – Geo-Referenced collections – Mix of physical specimens, images, text
descriptions, sampling data, and affiliated data files
– Many have home-grown data bases that are not interoperable with other museums
Fish from the SIO Marine Vertebrates Collection
Natural History Collections • Need controlled vocabularies but flexibility to
search on variants – Since nobody agrees on common vocabularies
• Value in cross-referencing to related collections – Such as samples (geology, biology, water)
collected on a cruise with ship track, sea floor maps…
– Presently working on “Rolling deck to Repository” NSF project
LCA
PI: Mike Norman Current and Past Students: Many
Image credit: NASA, IoA, A. Fabian et al.
Research group focusing on numerical modeling of complex astrophysical processes: cosmology, galaxy formation, turbulence, radiation hydrodynamics, magneto-hydrodynamics, …
Our simulations are large, based on the current definition of “large” (we grow with the technology). Typical results are 1-100 TB.
This work is costly in terms of both the computer time and human effort, and we see a benefit to the science community in sharing. (Citations are nice, too.)
http://bit.ly/sB30f1 http://bit.ly/IzTVV2 http://bit.ly/IE4iFd http://bit.ly/HFYLQJ
Participation in the Virtual Observatory • Standards for simulation
metadata, search, and retrieval • An odd fit beside the “pure”
astronomy projects and data centers
• But, it meant we weren’t starting from scratch in terms of describing our data
Started the curation effort very curious about how much of this previous work would translate to library space Also wanted stable platform for data hosting (e.g., not a closet server)
Prior Sharing Efforts
Several steps: • Choosing the pilot dataset • Cleaning up simulation cruft • Identifying related publications • Adding historical documents
(proposals, reports, etc.) • Organize various data groups • Simulations are a collection of
datasets from various points in time, needed a description for each type of digital object in each dataset
• Bundle, checksum, and handoff
Curation Process
By E.gordienko (Own work) [CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons
Decided near the end to replicate the metadata record to a second site as test of its portability
http://bit.ly/17yTc1n
Final result: • Datasets from a high-resolution
cosmology simulation held at UCSD
• Viewable both at UCSD, and via the Online Archive of California
• Raw simulation data and various analysis results accessible over HTTP
Some thoughts: • When it comes to metadata formats libraries are like any other science
domain and speak their own language • If you have a highly-specialized domain-specific metadata dialect or
language, you may need an additional discovery service • If not, it’s a good starting point • We’re working on repeating this process on our own for another simulation
Next steps at UC San Diego
Move from pilot services to a scalable series of processes. Work with additional researchers in same domains. Work with new domains. Broaden lifecycle management mindset on campus.
Questions?
Rick Wagner - [email protected] Richard Norris - [email protected] David Minor - [email protected] http://www.duraspace.org/hot-topics