proposed dataone teragrid joint initiative

19
Proposed DataONE TeraGrid Joint Initiative John Cobb, TeraGrid, and DataONE Presentation to TeraGrid Quarterly Management Meeting August 31, 2010 Seattle, WA

Upload: yair

Post on 06-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Proposed DataONE TeraGrid Joint Initiative. John Cobb, TeraGrid, and DataONE Presentation to TeraGrid Quarterly Management Meeting August 31, 2010 Seattle, WA. DataONE objectives. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Proposed DataONE TeraGrid  Joint Initiative

Proposed DataONE TeraGrid Joint Initiative

John Cobb, TeraGrid, and DataONE

Presentation to TeraGrid Quarterly Management Meeting

August 31, 2010Seattle, WA

Page 2: Proposed DataONE TeraGrid  Joint Initiative

DataONE objectives• Develop a distributed cyberinfrastructure architecture to

enable the long term preservation of digital data: support the data life cycle

• Engage the scientific community to move forward concepts of– Digital data archives of scholarly data– Best practices for digital data preservation– Engage journal publishers’ efforts for digital data repositories (e.g.

Dryad)• Enable new science via data synthesis• Develop a long-term sustainability strategy – decades long

– Architecture– Technology future-proofing– Arrangements/MOU’s

• Focus on ecological, biological, environmental science areas.

Page 3: Proposed DataONE TeraGrid  Joint Initiative

What shapes DataONE?• Challenges associated with climate variability • Community needs good data • Good data

– builds good science– makes possible wise management – enables sound decisions

• Good data needs– good technical infrastructure– sound organization– community engagement (you)

Page 4: Proposed DataONE TeraGrid  Joint Initiative

Architecture to support the data lifecycle

UCSBNode

UNMNode

ORCNode

1. Deposition/acquisition/ingest2. Curation and metadata management3. Protection, including privacy4. Discovery, access, use, and dissemination5. Interoperability, standards, and integration6. Evaluation, analysis, and visualization

The data lifecycle }

Page 5: Proposed DataONE TeraGrid  Joint Initiative

DataONE – Building new global CI

Additional prospective member nodes under discussion

Page 6: Proposed DataONE TeraGrid  Joint Initiative

The Character of a member node• Source of data• Participant in a larger

collective• (Usually) provides new and

interesting data sets (watersheds, satellite remote observations, citizen science data collections, environmental observations, geographical diversity, specific diversity, discipline diversity

• Supports DataONE Member Node (MN) software stack

• May contribute storage to support replicas of other

member nodes

• May differ in scale– My data– University library digital services

arm– Associated data repositories for

journals– DOI infrastructure– Project specific data collections– Agency specific programs for data

management– National scale cyberinfrastructure

providers (i.e. TG)

Page 7: Proposed DataONE TeraGrid  Joint Initiative

The Metadata challenge“the flood of increasingly heterogeneous data”

• Data are heterogeneous– Syntax

• (format)– Schema

• (model)– Semantics

• (meaning)

Jones et al. 2007

DataONE Focus: Synthesize data sets with disparate metadata to provide new scientific insights

Page 8: Proposed DataONE TeraGrid  Joint Initiative

DataONE Member Node Operations• Minimal set of operations

to enable a distributed archive– Minimal to enable wide

deployment in heterogeneous environment

– Does not include some operations that are Coordinating node only

• That set = {C,R,U,D}– Create– Replicate– Update– Delete

• Implementation

– Pilot now (operational and operational)

– Eval. of Pilot started– V.1 deploy planned next yr.

• Deployed platforms– Python– R– Mercury– …

• Note the meaning of “platform”

Page 9: Proposed DataONE TeraGrid  Joint Initiative

Coordinating Nodes• Contains full metadata catalog of member node data collections

• Directs certain operations– Replication direction– Location tracking– Ingestion– Assisted by deployed platforms. Ex. Mercury

leads to automatic ingest capability for NASA DAAC (MODIS data)

• CN locations also have MN instances. Provides some “free energy” for replication

Page 10: Proposed DataONE TeraGrid  Joint Initiative

Service layer model of data/knowledge services (Analogy with OSI)

• Platters• Controllers• Hardware redundancy• I/O Bandwidth

provisioning• Connections• File systems• AAAA• Federated Identity

• Wide area data distribution– Block level– Xnodes– File level

• Metadata generation (Automatically?)

• Metadata harmonization• Replication, decoherent,

survivable copies• Workflow mediated data

operations• Semantics and ontology

Page 11: Proposed DataONE TeraGrid  Joint Initiative

Natural TG and DataONE interaction• TG emphasizes left column• DataONE emphasizes right column --- for areas of interest.

• DataONE MN collective resembles part of old TeraGrid collections mission

• DataONE includes large community engagement component with the hope of generating sufficient interest for collected communities to sustain interest (c.f. well attended data best practices tutorial at 2010 Ecological Society of Am. meeting

Page 12: Proposed DataONE TeraGrid  Joint Initiative

Proposed interaction• For DataONE: TeraGrid RP’s (XD Sp’s) as Member nodes

• For TeraGrid: DataONE as a data oriented Science Gateway

• Requirements:– For DataONE:

• Participate in TG activities– Sci Gwy efforts– Some of TG’s

distributed data efforts– Some of TG outreach

• Request data allocations– TeraGrid RP’s:

• Deploy DataONE MN services

• Make MN services available as REST services (advsertised SW IIS)

– Both:•Interact•Investigate “new opportunities”

Page 13: Proposed DataONE TeraGrid  Joint Initiative

What about XD?• TeraGrid is “Pre-XD”• Does XD have a data archive mission?

– yes (as far as I know now)– All things Digital, but eXtreme

The goal of this solicitation is to encourage innovation in the design and implementation of an effective, efficient, increasingly virtualized approach to the provision of high-end digital services – extreme digital services - while ensuring that the infrastructure continues to deliver high-quality access for the many researchers and educators that use it in their work.

• Conclusion: work with current TeraGrid and plan to manage a smooth transition to XD (DataONE will need to be capable of this pivot if it hopes to have decades long stewardship)

• Go ahead and get started now

Page 14: Proposed DataONE TeraGrid  Joint Initiative

Sustainability• DataONE is called to create an environment for “decades long” sustainability – technically and economically

• No project has more than a 5 year horizon (not even NASA archives)

• Datanet’s must “figure this out”• Solution: plan to manage change• Recognize the underlying forces. Science wants data preservation

• “someone will provide” (More detail needed here)

Page 15: Proposed DataONE TeraGrid  Joint Initiative

What is the Value add?• Helps TG and DataONE meet their respective goals– Providing cyberinfrastructure for NSF funded

research– Providing curation and life cycle support for

digital data archives• Diminishes DataONE need to provision large amounts of low level data resources internally – partner instead of re-invent

• Re-iterates TeraGrid/XD mission to provide tier 2 (and tier 1) resources for storage

Page 16: Proposed DataONE TeraGrid  Joint Initiative

Next steps/action items• Commission a combined

TG+D1 WG– Goals

• Develop TG RP’s as DataONE meber nodes

– Action Items• DataONE All hands meeting Nov. 2-5 Tamaya, NM

• Initiate DataONE SGW• Initial TG allocation • Deploy pilot MN stack on TG resources

• Demonstrate CN orchestrated replication to TG MN’s – exercise the CRUD services

– Composition• TeraGrid

– Chris Jordan – TG AD for Data– Nancy Wilkins-Diehr – TG AD for SGW

– Dan Katz – TG Dir. Of Science– Others?

• DataONE– Dave Vieglais, DataONE AD for CI

– John Cobb, Dist Storage WG lead

– Bruce Wilson, DataONE core cyberinfrastructure team (CCIT)

– Others

Page 17: Proposed DataONE TeraGrid  Joint Initiative

Where are future opportunities?• MN replication can be viewed as data placement.

Thus DataONE can be a data staging method for large scale computations on TG/XD

• Metadata harmonization can imply moderate to large regular computations (“daily farm fresh” data-sets may require daily data/computation workflows)

• “Noodle out” how to support NSF data management plan requirement, perhaps together

• Ability to integrate with MRE’s as a ready data management solution

• Ability to integrate with similar simulation efforts (much more data intensive)

Page 18: Proposed DataONE TeraGrid  Joint Initiative

Discussion/Questions?

[email protected]

Page 19: Proposed DataONE TeraGrid  Joint Initiative

Post discussion action items• Smaller team continue discussions (Cobb, Jordan, Katz, Wilkins-Diehr,

Vieglais, Wilson, Jones) • Bundle pilot MN SW for TG MN deployment• Identify MN listening ports for services• Initiate Security WG • Initiate Gateway project• Define RP’s willing to deploy these services• DataONE to write TG allocation request

– Gateway services– Replicated Data Service

• Continue larger discussion, particularly as larger needs come down the line

• Explore mutual line of business opportunities• Separately: continue to investigate economic sustainability of large

scale storage needs