dataone_cobb_hubbub2012_20120924_v05
DESCRIPTION
An interoperable data repositories case study: DataONE presented at HUBbub 2012 confenceTRANSCRIPT
![Page 1: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/1.jpg)
DataONE: An interoperable data repositories case study
John W. Cobb R&D Staff and DataONE Leadership Team Member Oak Ridge Na;onal Laboratory HUBbub 2012 , the HUBzero conference Indianapolis, IN 24 September 2012
![Page 2: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/2.jpg)
2
• Authorship: This talk represents work of the en;re DataONE extended team.
• It especially draws upon slide material from • Bill Michener, UNM
(esp. recent DataONE AHM Sept. 18, 2012)
• Amber Budden – DataONE Ass. Dir. For CE
• DataONE is an NSF supported project (OCI-‐0830944)
Acknowledgment:
![Page 3: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/3.jpg)
3
• A personal view (apologies for a possibly mis-‐informed speaker)
• HUB-‐roots (history and pre-‐history) • PUNCH: web portal for running tools
(DOI: 10.1109/40.846308) • -‐> NanoHUB: Applica;on orchestra;on environment • + RAPPTURE: Rapid Applica;on por;ng and development • + Framelesss VNC windows –
seamless hosted environment on clients! • + Rich collabora;ve environment and rich user experience !! (“wishlist”) • Repurpose: Hubzero -‐> hubs explode
(ex. NEESHub a cri;cal advantage for largest research award in Purdue history) • Now (and recent past) turn to Hub+Data Integra;on. Some successes
already • Opportunity: Richer interac;ons between HUB’s and mul;ple data
repositories • Perhaps for example: Enable mul;-‐project collabora;on within PURR? • Or: Integrate NEES DB’s with SCEC simula;ons and IRIS waveforms?
Hubs and data repositories
![Page 4: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/4.jpg)
4
• HUB + Database exists • HUB + external data repository access use case. • But ….. What if? • Access mul;ple (possibly external) repositories from within a HUB
environment? • Access mul;ple external repositories with similar data? Say aggregate all data
from state hydrologists? C.f. driNET hip://drinet.hubzero.org • Integrate disparate data sets for new and novel analysis.
Recall Noshir Contractor’s comments this morning: teaming and interdisciplinary work has increased impact (Wuchty, Jones, Uzzi)
• Enable reproducible analysis and synthesis via a automated workflow to create synthe;c data products
• Programma;c access • More integra;on (more than just raw search terms a la Google) • … • What do you want to discover today? (to paraphrase Microsol)
Mul;ple data repository access?
![Page 5: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/5.jpg)
5
• DataONE is a project to address these issues
• Build (assemble/aggregate) data repository interoperability
• Advance state of the prac;ce data lifecycle management • Planning • Deposi;on • Metadata genera;on • Seman;c integra;on • Workflow and provenance • Analysis • Synthesis
• Focus on a broad science area • Deploy a working CI and grow it • DataONE – Data Observa;on
Network Earth
DataONE mo;va;on
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
![Page 6: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/6.jpg)
6
![Page 7: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/7.jpg)
7
Pressing issues for the digital data lifecycle
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
![Page 8: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/8.jpg)
8
Decreasin
g Spa;
al Coverage
Increasin
g Process K
nowledge
Adapted from CENR-‐OSTP
Remote sensing
Intensive science sites and experiments
Extensive science sites
Volunteer & educa;on networks
Mul;ple data sources – mutually reinforcing
8
![Page 9: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/9.jpg)
9
Scaiered data sources “finding the needle in the haystack”
Data are massively dispersed • Ecological field sta;ons and research centers (100s) • Natural history museums and biocollec;on facili;es (100s) • Agency data collec;ons (100s to 1000s) • Individual scien;sts (1000s to 10,000s to 100,000s)
![Page 10: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/10.jpg)
10
Data Preserva;on and Planning
✔ ?
![Page 11: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/11.jpg)
11
Preserva;on: Poor data prac;ce “data entropy”
Inform
a?on
Con
tent
Time
Time of publica?on
Specific details
General details
Accident
Re?rement or career change
Death
(Michener et al. 1997)
![Page 12: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/12.jpg)
12
Preserva;on: Data longevity
Study Resource Type Resource Half-life
Rumsey (2002) Legal Citations 1.4 years
Harter and Kim (1996) Scholarly Article Citations 1.5 years
Koehler (1999 and 2002) Random Web Pages 2.0 years
Spinellis (2003) Computer Science Citations
4.0 years
Markwell and Brooks (2002)
Biological Science Education Resources
4.6 years
Nelson and Allen (2002) Digital Library Object 24.5 years
Koehler, W. (2004) Informa(on Research 9(2): 174.
![Page 13: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/13.jpg)
13
The Long Tail of Orphan Data Vo
lum
e
Rank frequency of datatype
Specialized repositories (e.g. GenBank, PDB)
Orphan data
(B. Heidorn)
“Most of the bytes are at the high end, but most of the datasets are at the low end” – Jim Gray
13
The Ultra-‐violet divergence
The Infrared Catastrophe
![Page 14: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/14.jpg)
14
Data deluge and interoperability “the flood of increasingly heterogeneous data”
Data are heterogeneous • Syntax
• (format) • Schema
• (model) • Seman;cs
• (meaning)
Jones et al. 2007
![Page 15: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/15.jpg)
15
Metadata universe (mul;-‐verse)
Source: Jenn Riley, Indiana U. Digital Librarian hip://www.dlib.indiana.edu/~jenlrile/metadatamap/ Via John Kunze, Cal. Dig. Lib
• There are a mul;tude of metadata standards • Discipline and sub-‐discipline specific • Each with different terms and context
![Page 16: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/16.jpg)
16
Each dot is its own standard !
“…billions and billions of worlds …” – Carl Sagan
![Page 17: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/17.jpg)
17
• Hard-‐core cyberinfrastructure (CI) • CI Member Node (MN) data
repositories • Coordina;ng Node (CN) global
metadata repo’s • Simple, but powerful REST API/SPI for
universal access • Inves;gator toolkit (ITK) solware tools
to allow access to the data repository collec;ve via familiar access idioms
• Cultural and wetware issues • Educa;onal Materials • Best prac;ces • Workshops and tutorials • Surveys and assessments • Scien;st, policymaker, ci;zen
engagement • Collabora;on, governance, and
sustainability
DataONE CI architectural Elements
Member Nodes
Service Interfaces
Bridge to non-DataONE Member Node services
Data Repository
Coordinating Nodes
Object Store Index
Coordination LayerIdentifiers
Preservation
Catalog
Monitor
Service InterfacesResolution Discovery
Replication Registration
Investigator Toolkit
Client LibrariesJava Python Command Line
Web Interface Data ManagementAnalysis, Visualization
hip://mule1.dataone.org/ArchitectureDocs-‐current/
![Page 18: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/18.jpg)
18
A User’s View
![Page 19: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/19.jpg)
19
Key Cyberinfrastructure Elements
• Unique iden;fiers • Search and deliver • Replica;on • Federated iden;ty
Usable by People and their Agents
![Page 20: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/20.jpg)
20
Suppor;ng the data lifecycle
UCSB Node
UNM Node
ORC Node
1. Deposi;on/acquisi;on/ingest 2. Cura;on and metadata management 3. Protec;on, including privacy 4. Discovery, access, use, and dissemina;on 5. Interoperability, standards, and integra;on 6. Evalua;on, analysis, and visualiza;on"
The data lifecycle
}
![Page 21: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/21.jpg)
21
Three major components for a flexible, scalable, sustainable network
Member Nodes • diverse ins;tu;ons • serve local community • provide resources for managing their data
• retain copies of data
Coordina?ng Nodes • retain complete metadata catalog
• indexing for search • network-‐wide services • ensure content availability (preserva;on)
• replica;on services
Inves?gator Toolkit
DataONE Supports Data Preserva;on
![Page 22: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/22.jpg)
22
• Enables integra;on of mul;ple geographically diverse and metadata diverse repositories
• Presents collec;ve search results across mul;ple repository
• Provides a unified API/SPI for search and programma;c interface hip://mule1.dataone.org/ArchitectureDocs-‐current/
• DataONE content has unique iden;fiers (DOI’s) for referencable/citable data objects
• Supports both large datasets and the long-‐tails
DataONE sa;sfies arch requirements
![Page 23: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/23.jpg)
23
• Enables new analysis and synthesis efforts by integra;ng tasks across repositories
• Provides means for data replica;on and basis for repositories to build “data wills” or “data trust” plans
• Provides a plavorm to develop advanced interoperable workflow tools and seman;c integra;on tools
DataONE spurs innova;on
![Page 24: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/24.jpg)
24
DataONE: current state/recent progress
![Page 25: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/25.jpg)
25
DataONE: Suppor;ng Scien;fic Data Preserva;on, Discovery, and Innova;on
Current Member Nodes: Coming Soon: Current Tools:
Tools Coming Soon: Queensland University of Technology
![Page 26: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/26.jpg)
26
Data Management Planning Tool
hips://dmp.cdlib.org/
![Page 27: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/27.jpg)
27
![Page 28: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/28.jpg)
28
![Page 29: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/29.jpg)
29
Plans per template (as of June 2012)
hip://dmptool.org
339
287
197
159 133 133 124
101 71 65 60
46 37 36 34 17 15 6
0
50
100
150
200
250
300
350
400
Approxim
ate nu
mbe
r of p
lans per te
mplate
Templates of greatest interest to the DataONE community in red;
2,302 unique users to date
![Page 30: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/30.jpg)
30
✔ Check for best prac;ces ✔ Create metadata ✔ Connect to ONEShare
Data & Metadata (EML)
![Page 31: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/31.jpg)
31
2. Data Discovery
![Page 32: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/32.jpg)
32
The DataONE Federa;on
![Page 33: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/33.jpg)
33
NASA collectors DAAC Users (UWG)
DataONE Users
ORNL DAAC as a DataONE Member Node
Inves?gator Toolkit
33
![Page 34: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/34.jpg)
34
hips://cn.dataone.org/onemercury/
![Page 35: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/35.jpg)
35
![Page 36: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/36.jpg)
36
![Page 37: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/37.jpg)
37
Kepler
DMP-Tool
Inves;gator Toolkit Support
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
![Page 38: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/38.jpg)
38
Spa;o-‐Temporal Exploratory Model iden;fies factors affec;ng paierns of migra;on
Diverse bird observa;ons and environmental data from 300,00 loca;ons in the US integrated and analyzed using High Performance Compu;ng Resources
Land Cover
Meteorology
MODIS – Remote sensing data
• Examine paierns of migra;on
• Infer how climate change may affect bird migra;on
Model results
Occurrence of Indigo Bun?ng (2008)
Jan Sep Dec Jun Apr
Explora;on, Visualiza;on, and Analysis
38
![Page 39: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/39.jpg)
39
Public Par?cipa?on in Scien?fic Research Conference: 4-‐5 August 2012 in Portland, Oregon USA prior to Ecological Society of America mee;ng (6-‐10 Aug.): hip://www.birds.cornell.edu/citscitoolkit/conference/2012
39
![Page 40: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/40.jpg)
40
Year 1 Year 2 Year 3 Year 4 Year 5
Scien;sts: BL
User Assessments
Scien;sts: FU
Librarians: BL Librarians: FU
Policy Makers: BL Policy Makers: FU
Educators: BL Educators: FU
Library Policies: BL Library Policies: FU
![Page 41: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/41.jpg)
41
12 21 26 95 95 96 97
266
676
DIF DwC DC EML FGDC Open GIS
ISO My Lab none
Metadata language
What standard do you currently use?
![Page 42: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/42.jpg)
42
41%
76%
78%
81%
0% 20% 40% 60% 80% 100%
Willing to place all of my data into a central data repository with no
restric;ons
Appropriate to create new datasets from shared data
Willing to place at least some of my data into a central data repository
with no restric;ons
Willing to share data across a broad group of researchers
Many are interested in sharing data
Percent agree
![Page 43: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/43.jpg)
43
User Matrix
Dat
a
Ser
vice
Inve
stig
ator
To
olK
it
Dat
a M
anag
emen
t P
lann
ing
Bes
t P
ract
ices
Tool
s D
atab
ase
Trai
ning
Cur
ricul
a
Scientist
Data Librarians
Ecological Modeler
Resource Manager
![Page 44: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/44.jpg)
44
Community Engagement
![Page 45: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/45.jpg)
45
Best Prac;ces and Solware Tools
![Page 46: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/46.jpg)
46
![Page 47: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/47.jpg)
47
• Member node growth • Number of member nodes • Increase the number and size of data sets • Sustainably
• In terms of resource needs form MN’s • In terms of resource demands on DataONE
• New Inves;gator toolkit tools (strategically) • An increasing number of science use cases with
more breakthrough science • Also, re-‐purposing DataONE CI outside of Bio/Eco/
Env areas in strategic collabora;ve partnerships
DataONE: Next steps
![Page 48: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/48.jpg)
48
Ack: DataONE Team and Sponsors
• Bertram Ludaescher
• Deborah McGuinness
• Jeff Horsburgh
• Robert Sandusky
• Peter Honeyman
• Carole Goble
• Cliff Duke
• Donald Hobern
• Ewa Deelman • Amber Budden, Roger Dahl, Rebecca Koskela, Bill Michener, Robert Nahf, Skye Roseboom, Mark Servilla
• Patricia Cruse, John Kunze
• Dave Vieglais
• Paul Allen, Rick Bonney, Steve Kelling
• Stephanie Hampton, Chris Jones, Mai Jones, Ben Leinfelder, Andrew Pippin
• Suzie Allard, Nick Dexter, Kimberly Douglass, Carol Tenopir, Robert Waltz, Bruce Wilson
• John Cobb, Bob Cook, Ranjeet Devarakonda, Giri Palanismy, Line Pouchard
• Sky Bristol, Mike Frame, Richard Huffine, Viv Hutchison, Jeff Moriseie, Jake Weltzin, Lisa Zolly
• David DeRoure
• Ryan Scherle, Todd Vision
LEON LEVY FOUNDATION
• Randy Butler
![Page 49: DataONE_cobb_hubbub2012_20120924_v05](https://reader034.vdocuments.net/reader034/viewer/2022051816/54628327b4af9f491c8b46fa/html5/thumbnails/49.jpg)
49
Ques;ons?
Contact Points
John W. Cobb, Ph.D. Oak Ridge
John W. Cobb, Ph.D.
Oak Ridge Na;onal Lab [email protected] 865.576.5439
hip://www.dataone.org/ hip://docs.dataone.org