Transcript
Page 1: SEAD Datanet and Sustainability Science

SEAD Datanet and Sustainability ScienceRobert H. McDonald

Deputy Director/Associate Dean

Data to Insight Center/IU Libraries

SC12 | Salt Lake City, UT

November 12, 2012

http://www.sead-data.net@SEADdatanet

1. NSF DataNet Overview2. SEAD Overview3. SEAD Active/Social Curation4. SEAD Virtual Archive Repository

Page 2: SEAD Datanet and Sustainability Science

SEAD DataNet and Sustainability

Science

http://slidesha.re/TAk3hthttp://www.sead-data.net

@SEADdatanet

2 SEAD DataNet Home

Page 3: SEAD Datanet and Sustainability Science

SEAD TEAMS

Margaret Hedstrom-PI, Marietta Van Buhler, Karen Woollams, George Alter (ICPSR), Bryan Beecher (ICPSR)

Beth Plale-Co-PI, Katy Börner, Robert H. McDonald, Robert Light, Kavitha Chandrasekar, Stacy Kowalczyk, Inna Kouper, Robert Ping, Ryan Cobine

James Myers-Co-PI, Ram Prasanna Govind Krishnan, Lindsay Todd

Praveen Kumar-Co-PI, Terry McLaren (NCSA), Rob Kooper (NCSA), Luigi Marini (NCSA)

Michigan

Indiana

Rensselaear

Illinois

3 SEAD DataNet Home

Page 4: SEAD Datanet and Sustainability Science

NSF DataNet Program

Motivation: “… one of the major challenges of this scientific generation: how to develop the new methods, management structures and technologies to manage the diversity, size, and complexity of current and future data sets and data streams.”

Response: DataNet creates “a set of exemplar national and global data research infrastructure organizations” to address this challenge.

4 SEAD DataNet Home

Page 5: SEAD Datanet and Sustainability Science

Current NSF DataNet Projects

SEAD• http://sead-data.net

DataOne• http://www.dataone.org

DataNet Federation Consortium• http://datafed.org

Terra Populous• https://www.pop.umn.edu/terra_pop

5 SEAD DataNet Home

Page 6: SEAD Datanet and Sustainability Science

SEAD’s ApproachSEAD Partners - http://sead-data.net

• Contribute infrastructure to the

DataNet vision that supports data access, sharing, reuse, and preservation for the long tail

• Develop a data access and preservation environment that supports the research, technical, and economic requirements for data management in the long tail

• Enable Active and Social Curation Utilize emerging preservation and access infrastructures

6 SEAD DataNet Home

Page 7: SEAD Datanet and Sustainability Science

Long Tail Data ChallengesBy

tes

per d

ay

ExaBytes

PetaBytes

TeraBytes

GigaBytes Many smaller datasets…

7 SEAD DataNet Home

Page 8: SEAD Datanet and Sustainability Science

CI for the Long Tail

What is the “long tail” of scientific research and why does it matter?

• Diverse set of researchers, questions, data, and methodologies, etc.

• Diverse set of requirements for instrumentation, data collection, models, analysis, etc.

• Little standardization, no common denominator• Most researchers and most research dollars go to

researchers in the long tail• The long tail is underserved by current CI

8 SEAD DataNet Home

Page 9: SEAD Datanet and Sustainability Science

Long Tail Example: Sustainability ResearchMany dimensions, many coordinate systems, many scales, many data collection and analysis tools, many formats, a long-tail of providers and users, …

9 SEAD DataNet Home

Page 10: SEAD Datanet and Sustainability Science

SEAD 18 month Pilot PhaseDomain Engagement:

• National Center for Earth-Surface Dynamics (NCED), Illinois River Basin Observatory

• Requirements, Use Cases, Prioritization of Data Types and ServicesActive and Social Curation

• Pilot Active Content Repository, VIVO deployments• Exemplar services for Data Ingest, Discovery, Re-use, Curation

(Tupelo/Medici)CI for Long-term Access (Virtual Archive)

• Data model, protocol design/development• Pilot Federated Repository infrastructure

Education, Outreach, and Training• Post-doc mentoring• Web site, training materials, meetings, workshops, …

Project Oversight• Management, reporting, committees• Business model development

10 SEAD DataNet Home

Page 11: SEAD Datanet and Sustainability Science

NCED Collection AccessNCED collections in SEAD-ACR

• (20 Top-level Collections, 454K files, 2.25M objects, 1.6 TB data)

• NCED Repository Interface• Support for hierarchy • Support for collection annotation• View/add NCED/domain specific Terms• New Large Server with Virtual Machine

ACR instances• Ingest tools and procedures

• csv2rdf4LOD• Archiving, Citation, DOI assignment, …

NCED users can (with an account) go from web page to previews and downloads (w/o cart), can add annotations, can browse, search by text (any fields and content), tags, etc.

11 SEAD DataNet Home

Page 12: SEAD Datanet and Sustainability Science

SEAD notions of defined Data PhasesPhases of data lifecycle acknowledge and accommodate the difference between public data and data still in work by a researcher. Research Data Phase: data set is research data collection, owned by individual and under their control.

• Data need not be licensed at this time because it is not ready for broader release

• Data need not have permanent IDs because still work in progress

• Corresponds to first existence in Active Curation Repository Published Phase: Owner of research data collection determines that dataset is ready for publication

• License terms set• Persistent ID • Made available as part of public profile in VIVO• Activated by user-controlled publish event

12 SEAD DataNet Home

Page 13: SEAD Datanet and Sustainability Science

SEAD Active/Social Curation Repository

13 SEAD DataNet Home

Page 14: SEAD Datanet and Sustainability Science

14 SEAD DataNet Home

Page 15: SEAD Datanet and Sustainability Science

ACR Bulk Ingest Process

TWC: csv2rdf4lod

ACR Ingest

.ttl output fileData

Metadata

global ID, filepath, file

Incremental ingest, restart, verify

Configuration:• Headers to Standard Vocabularies• Content Mapping to identifiers• Additional Inference possible

SEAD ACR Instance

Extractors/ Preview

On/Off

DROID Analysis

15 SEAD DataNet Home

Page 16: SEAD Datanet and Sustainability Science

16 SEAD DataNet Home

Page 17: SEAD Datanet and Sustainability Science

17 SEAD DataNet Home

Page 18: SEAD Datanet and Sustainability Science

18 SEAD DataNet Home

Page 19: SEAD Datanet and Sustainability Science

19

Page 20: SEAD Datanet and Sustainability Science

20 SEAD DataNet Home

Page 21: SEAD Datanet and Sustainability Science

21 SEAD DataNet Home

Page 22: SEAD Datanet and Sustainability Science

22 SEAD DataNet Home

Page 23: SEAD Datanet and Sustainability Science

23 SEAD DataNet Home

Page 24: SEAD Datanet and Sustainability Science

24 SEAD DataNet Home

Page 25: SEAD Datanet and Sustainability Science

25 SEAD DataNet Home

Page 26: SEAD Datanet and Sustainability Science

2626

SEAD DataNet Home

Page 27: SEAD Datanet and Sustainability Science

SEAD/NCED Data Social Network

27 SEAD DataNet Home

Page 28: SEAD Datanet and Sustainability Science

NCED Data Social Network in SEAD-VIVOMary Power NCED PI and Professor University of CaliforniaWilliam Dietrich NCED PI and Professor University of CaliforniaCollin Bode NCED Data Technician

NCED Social Network Connections Based on Data Authorship

28 SEAD DataNet Home

Page 29: SEAD Datanet and Sustainability Science

Angelo Basic GIS Coverage Data Set

29 SEAD DataNet Home

Page 30: SEAD Datanet and Sustainability Science

SEAD Data Set Publishing Workflow

• Data content used within ACR

• Researcher Profile Established in VIVO

NCED Data Set Ingested to ACR

• Data Set ready to publish

NCED Data Set Ingested to VA • DataCite minted

DOI attached to finalized Data Set

NCED Data Set Deposited with IR

• DOI Resolution to designated IR

NCED Data Set Published to

VIVO

30 SEAD DataNet Home

Page 31: SEAD Datanet and Sustainability Science

Published NCED Data Set in IR (IU ScholarWorks)

31 SEAD DataNet Home

Page 32: SEAD Datanet and Sustainability Science

SEAD Virtual Archive

32 SEAD DataNet Home

Page 33: SEAD Datanet and Sustainability Science

Virtual Archive Features

Usability consistent with research user expectations• Additional metadata fields for scientific datasets• Ability to ingest data with previewing data

Repository tracking: tracking member Institutional Repositories (IRs) and their stored content

• Not just link to repository, but extensive cataloging tool (metadata and other additional information)

• Allows users to search for data in particular IR or over all IR’s

Low cost replication: cloud based storage for reliability• Proof of concept uses Amazon S3 to maintain copy of

files and collections. Amazon Glacier is low-cost, secure and durable. Optimized for cold storage. Other solutions exist.

33 SEAD DataNet Home

Page 34: SEAD Datanet and Sustainability Science

Virtual Archive Features

34 SEAD DataNet Home

Page 35: SEAD Datanet and Sustainability Science

Virtual Archive Features

35 SEAD DataNet Home

Page 36: SEAD Datanet and Sustainability Science

Component Interactions: Virtual Archive and ACR

Data Set Uploaded to ACR

Data Set Ingested to Virtual Archive

Data Set Deposited with

Institutional repository

Data Set Published to

VIVO

36 SEAD DataNet Home

Page 37: SEAD Datanet and Sustainability Science

ACR – VA Interaction Protocol

Activ

e Cu

ratio

n Re

posi

tory

Virt

ual A

rchi

ve

SPARQ

L End

point

VA UISW

ORD

Endpoint

Query

EndpointACR UI

Query

DOI Metadata

Tim

e

Mark Data For Publication (and Accept Licensing Terms)

(SPARQL) Query MetadataReturn Metadata

Curator Preview

User Queries VA for DOI

Metadata update and View

Researcher

Ingest Data To VA

Curator

Curator Request for Preview

37 SEAD DataNet Home

Page 38: SEAD Datanet and Sustainability Science

Virtual Archive Workflow

Preview Data

Ready to Publish

Upload Data to VA

Run Virus Checking

File Character-

ization

Deposit to IR Mint DOI Index

Metadata

Accept Repository Agreement

in ACR

Index Scientific MetadataVersion

DataIR Match-

maker

Large Dataset Policy

Decision

To be completed by March 2013

38 SEAD DataNet Home

Page 39: SEAD Datanet and Sustainability Science

Key Questions for SEAD Prototype

• What could SEAD capture when?• How can SEAD provide direct value

to data producers, users, and curators?

• How can web 2.0/3.0 and social computing lower barriers and reduce/realign costs?

39 SEAD DataNet Home

Page 40: SEAD Datanet and Sustainability Science

Towards A Shared Data Future

Source: EU HLEG Report on Data Deluge: Riding the Wave, pg 31, 2010

Trus

t

Dat

a Cu

ratio

n

Data Generators Users

Community Support Services

Common Data Services

User functionalities, data capture & transfer, virtual research environments

Data discovery & navigation, workflow generation, annotation, interpretability

Persistent storage, identification, authenticity (provenance), workflow execution, data mining

40 SEAD DataNet Home

Page 41: SEAD Datanet and Sustainability Science

Data Interoperability and SEAD

• NSF OCI: DataNet and INTEROP now DIBBs• EUDAT• Research Data Alliance• IETF Research Data Identifier BOF• NCED Data Network

41 SEAD DataNet Home

Page 42: SEAD Datanet and Sustainability Science

AcknowledgementsSEAD is funded by the National Science Foundation under cooperative agreement #OCI0940824

http://sead-data.net

• For more on SEAD go to:• http://sead-data.net

• Follow us on Twitter @SEADdatanet

42 SEAD DataNet Home


Top Related