sead datanet and sustainability science
Embed Size (px)
DESCRIPTION
This is the SEADTRANSCRIPT

SEAD Datanet and Sustainability ScienceRobert H. McDonald
Deputy Director/Associate Dean
Data to Insight Center/IU Libraries
SC12 | Salt Lake City, UT
November 12, 2012
http://[email protected]
1. NSF DataNet Overview2. SEAD Overview3. SEAD Active/Social Curation4. SEAD Virtual Archive Repository

SEAD DataNet and Sustainability
Science
http://slidesha.re/TAk3hthttp://www.sead-data.net
@SEADdatanet
2 SEAD DataNet Home

SEAD TEAMS
Margaret Hedstrom-PI, Marietta Van Buhler, Karen Woollams, George Alter (ICPSR), Bryan Beecher (ICPSR)
Beth Plale-Co-PI, Katy Börner, Robert H. McDonald, Robert Light, Kavitha Chandrasekar, Stacy Kowalczyk, Inna Kouper, Robert Ping, Ryan Cobine
James Myers-Co-PI, Ram Prasanna Govind Krishnan, Lindsay Todd
Praveen Kumar-Co-PI, Terry McLaren (NCSA), Rob Kooper (NCSA), Luigi Marini (NCSA)
Michigan
Indiana
Rensselaear
Illinois
3 SEAD DataNet Home

NSF DataNet Program
Motivation: “… one of the major challenges of this scientific generation: how to develop the new methods, management structures and technologies to manage the diversity, size, and complexity of current and future data sets and data streams.”
Response: DataNet creates “a set of exemplar national and global data research infrastructure organizations” to address this challenge.
4 SEAD DataNet Home

Current NSF DataNet Projects
SEAD• http://sead-data.net
DataOne• http://www.dataone.org
DataNet Federation Consortium• http://datafed.org
Terra Populous• https://www.pop.umn.edu/terra_pop
5 SEAD DataNet Home

SEAD’s ApproachSEAD Partners - http://sead-data.net
• Contribute infrastructure to the
DataNet vision that supports data access, sharing, reuse, and preservation for the long tail
• Develop a data access and preservation environment that supports the research, technical, and economic requirements for data management in the long tail
• Enable Active and Social Curation Utilize emerging preservation and access infrastructures
6 SEAD DataNet Home

Long Tail Data ChallengesBy
tes
per d
ay
ExaBytes
PetaBytes
TeraBytes
GigaBytes Many smaller datasets…
7 SEAD DataNet Home

CI for the Long Tail
What is the “long tail” of scientific research and why does it matter?
• Diverse set of researchers, questions, data, and methodologies, etc.
• Diverse set of requirements for instrumentation, data collection, models, analysis, etc.
• Little standardization, no common denominator• Most researchers and most research dollars go to
researchers in the long tail• The long tail is underserved by current CI
8 SEAD DataNet Home

Long Tail Example: Sustainability ResearchMany dimensions, many coordinate systems, many scales, many data collection and analysis tools, many formats, a long-tail of providers and users, …
9 SEAD DataNet Home

SEAD 18 month Pilot PhaseDomain Engagement:
• National Center for Earth-Surface Dynamics (NCED), Illinois River Basin Observatory
• Requirements, Use Cases, Prioritization of Data Types and ServicesActive and Social Curation
• Pilot Active Content Repository, VIVO deployments• Exemplar services for Data Ingest, Discovery, Re-use, Curation
(Tupelo/Medici)CI for Long-term Access (Virtual Archive)
• Data model, protocol design/development• Pilot Federated Repository infrastructure
Education, Outreach, and Training• Post-doc mentoring• Web site, training materials, meetings, workshops, …
Project Oversight• Management, reporting, committees• Business model development
10 SEAD DataNet Home

NCED Collection AccessNCED collections in SEAD-ACR
• (20 Top-level Collections, 454K files, 2.25M objects, 1.6 TB data)
• NCED Repository Interface• Support for hierarchy • Support for collection annotation• View/add NCED/domain specific Terms• New Large Server with Virtual Machine
ACR instances• Ingest tools and procedures
• csv2rdf4LOD• Archiving, Citation, DOI assignment, …
NCED users can (with an account) go from web page to previews and downloads (w/o cart), can add annotations, can browse, search by text (any fields and content), tags, etc.
11 SEAD DataNet Home

SEAD notions of defined Data PhasesPhases of data lifecycle acknowledge and accommodate the difference between public data and data still in work by a researcher. Research Data Phase: data set is research data collection, owned by individual and under their control.
• Data need not be licensed at this time because it is not ready for broader release
• Data need not have permanent IDs because still work in progress
• Corresponds to first existence in Active Curation Repository Published Phase: Owner of research data collection determines that dataset is ready for publication
• License terms set• Persistent ID • Made available as part of public profile in VIVO• Activated by user-controlled publish event
12 SEAD DataNet Home

SEAD Active/Social Curation Repository
13 SEAD DataNet Home

14 SEAD DataNet Home

ACR Bulk Ingest Process
TWC: csv2rdf4lod
ACR Ingest
.ttl output fileData
Metadata
global ID, filepath, file
Incremental ingest, restart, verify
Configuration:• Headers to Standard Vocabularies• Content Mapping to identifiers• Additional Inference possible
SEAD ACR Instance
Extractors/ Preview
On/Off
DROID Analysis
15 SEAD DataNet Home

16 SEAD DataNet Home

17 SEAD DataNet Home

18 SEAD DataNet Home

19

20 SEAD DataNet Home

21 SEAD DataNet Home

22 SEAD DataNet Home

23 SEAD DataNet Home

24 SEAD DataNet Home

25 SEAD DataNet Home

2626
SEAD DataNet Home

SEAD/NCED Data Social Network
27 SEAD DataNet Home

NCED Data Social Network in SEAD-VIVOMary Power NCED PI and Professor University of CaliforniaWilliam Dietrich NCED PI and Professor University of CaliforniaCollin Bode NCED Data Technician
NCED Social Network Connections Based on Data Authorship
28 SEAD DataNet Home

Angelo Basic GIS Coverage Data Set
29 SEAD DataNet Home

SEAD Data Set Publishing Workflow
• Data content used within ACR
• Researcher Profile Established in VIVO
NCED Data Set Ingested to ACR
• Data Set ready to publish
NCED Data Set Ingested to VA • DataCite minted
DOI attached to finalized Data Set
NCED Data Set Deposited with IR
• DOI Resolution to designated IR
NCED Data Set Published to
VIVO
30 SEAD DataNet Home

Published NCED Data Set in IR (IU ScholarWorks)
31 SEAD DataNet Home

SEAD Virtual Archive
32 SEAD DataNet Home

Virtual Archive Features
Usability consistent with research user expectations• Additional metadata fields for scientific datasets• Ability to ingest data with previewing data
Repository tracking: tracking member Institutional Repositories (IRs) and their stored content
• Not just link to repository, but extensive cataloging tool (metadata and other additional information)
• Allows users to search for data in particular IR or over all IR’s
Low cost replication: cloud based storage for reliability• Proof of concept uses Amazon S3 to maintain copy of
files and collections. Amazon Glacier is low-cost, secure and durable. Optimized for cold storage. Other solutions exist.
33 SEAD DataNet Home

Virtual Archive Features
34 SEAD DataNet Home

Virtual Archive Features
35 SEAD DataNet Home

Component Interactions: Virtual Archive and ACR
Data Set Uploaded to ACR
Data Set Ingested to Virtual Archive
Data Set Deposited with
Institutional repository
Data Set Published to
VIVO
36 SEAD DataNet Home

ACR – VA Interaction Protocol
Activ
e Cu
ratio
n Re
posi
tory
Virt
ual A
rchi
ve
SPARQ
L End
point
VA UISW
ORD
Endpoint
Query
EndpointACR UI
Query
DOI Metadata
Tim
e
Mark Data For Publication (and Accept Licensing Terms)
(SPARQL) Query MetadataReturn Metadata
Curator Preview
User Queries VA for DOI
Metadata update and View
Researcher
Ingest Data To VA
Curator
Curator Request for Preview
37 SEAD DataNet Home

Virtual Archive Workflow
Preview Data
Ready to Publish
Upload Data to VA
Run Virus Checking
File Character-
ization
Deposit to IR Mint DOI Index
Metadata
Accept Repository Agreement
in ACR
Index Scientific MetadataVersion
DataIR Match-
maker
Large Dataset Policy
Decision
To be completed by March 2013
38 SEAD DataNet Home

Key Questions for SEAD Prototype
• What could SEAD capture when?• How can SEAD provide direct value
to data producers, users, and curators?
• How can web 2.0/3.0 and social computing lower barriers and reduce/realign costs?
39 SEAD DataNet Home

Towards A Shared Data Future
Source: EU HLEG Report on Data Deluge: Riding the Wave, pg 31, 2010
Trus
t
Dat
a Cu
ratio
n
Data Generators Users
Community Support Services
Common Data Services
User functionalities, data capture & transfer, virtual research environments
Data discovery & navigation, workflow generation, annotation, interpretability
Persistent storage, identification, authenticity (provenance), workflow execution, data mining
40 SEAD DataNet Home

Data Interoperability and SEAD
• NSF OCI: DataNet and INTEROP now DIBBs• EUDAT• Research Data Alliance• IETF Research Data Identifier BOF• NCED Data Network
41 SEAD DataNet Home

AcknowledgementsSEAD is funded by the National Science Foundation under cooperative agreement #OCI0940824
http://sead-data.net
• For more on SEAD go to:• http://sead-data.net
• Follow us on Twitter @SEADdatanet
42 SEAD DataNet Home