datanet federation consortiumdatafed.org/dev/wp-content/uploads/2012/04/3-dfc... · 4/3/2012 ·...
TRANSCRIPT
1
Engagement and Prototype
User Requirements User Groups
Prototype
Helen Tibbo, Reagan Moore, Arcot Rajasekar UNC-CH
DataNet Federation Consortium
National Science Foundation Cooperative Agreement: OCI-0940841
Please credit the DataNet Federation Consortium when referencing this
information.
Topics
• User requirements – User surveys – Assessment: Use cases – Policies and standards
• User Groups – DFC funded community interactions – Extended community
• Prototype – Architecture – Deliverables – Federation
2
User Requirements
Lesson Learned: There are three levels of requirements:
1. Infrastructure interoperability – Survey of science and engineering technology
– Track technology evolution
2. Domain management – Governance policies (within a project)
– Federation policies (with other projects)
3. Researcher features – Specific capabilities to improve productivity
– User interfaces
4
Methods
• Survey (Rajasekar)
– Identify technology interoperability requirements
• Interviews (Tibbo)
– Identify consortia governance, workflows, provenance requirements
– Identify researcher needs
5
Survey: Hydrology Use Case- Automated analyses in hydrology
Integrated Water Model – Automation of VIC Workflows
Capture of Provenance & Process Information – Identify Lineage & Acknowledgements
Provision for Unique Data Identification Signature Impose Restrictions & Apply Transformations Capture & Propagate Caveats & Error Corrections Provisions for Failure Recovery, Debugging & Explanations Re-execute models for Reproducible Science
– Extension to RHESsys Workflow Integration with GRASS Methods
– Identify and Apply SLAs Heterogeneous Data Access
– CUAHSI HIS Data – Climate Data (NCDC) – Satellite Data (NASA)
6
Survey: Engineering Use Case Information-based Engineering
Integrated Infrastructure for Engineering Models – From Silos (Projects/Teams) to Clouds
– Format Registry for Design Models
– Format Verification Services
– Model Conversion Services
– Model Metadata Extraction & Discovery Services
– Distributed Model Data Access OOI Sensor Data CIBER-U CAD data iConnect Civil Infrastructure (Bridges) System
Take Design Models from NSF Projects to Education – Integrate DFC Platform with CIBER-U
7
Survey: Marine Use Case
• Long-term Access to Marine Data Streams – Replicate & Archive Data Streams at NCDC – Capture & Propagate Data Provenance, Errors & Corrections – Capture Metadata & Enable Long-term Discovery – Provision for Replay of Archived Data Streams – Services for Runtime Stream Format Conversions – Impose Restrictions & Apply Transformations
• Federate with Hydrology and Climate Data – Applications of Hydrology Models with Marine Data – Enable Integration of Ocean and Hydrology Modeling – Enable Integration of Ocean and Climate Modeling
• Expose Marine Engineering Design to Educational reuse – Provision access to Sensor/Platform Design Data
8
GeoScience Domain Requirements: NSF EarthCube Interoperability Testbed
Architecture Layer Technologies
Collaboration environment UNC-CH integrated Rule Oriented Data System (iRODS)
Models UC Boulder Community Surface Dynamics Modeling System,
UNC RHESSys
Data grids GMU geospatial data grid, iRODS, DataONE OneDrive
Workflows iRODS, NCSA Cyberintegrator, UCSD Kepler, GMU
BPELPower
Policies iRODS, NCSA Cyberintegrator
Web Services
OGC Sensor Web Enablement standard (SWE), WHOI
observation assessment (SWE), NCSA Semantic Geostreaming
toolkit (SWE, W3C), GMU Geospatial client, Colorado State
University NextGen Network Enabled Weather
Web analysis services GMU GeOnAS, DataONE Oner
Web visualization services GRASS, NOAA Environmental Information Service
Network and security protocols iRODS (Grid Security Infrastructure, Kerberos, Shibboleth,
Reliable Blast UDP, parallel TCP/IP)
Repositories NOAA CLASS, NASA Echo, GEOSS ClearingHouse
Catalog GMU GI-Cat, CUAHSI
Federation iRODS, CLASS
9
Policies and Standards: Central to the DFC
Policies for
automating
data management
Standards Groups
International Projects
Advisory Committee
Science & Engineering
Domains
Sustainability And
Institutions
Facilities And
Operations
Technology And
Research
Education And
Outreach Policies
And Standards
Policies for
publication
& federation
Policies for
IPR & citations
Policies for
provenance &
sustainability
Policies for
collaboration
and reuse
Policies for
technology
migration
Policies for
metadata
extraction
Policies for
analysis and
workflow
Policies for
change
management
Domain-centric
policies
Policies for
authentication
& authorization
Policies for
archiving, staging
& caching
Policies for
replication &
synchronization
Policies for
retention &
disposition
Policies for
Deletion &
redaction
Policies for
trust
Polices for
curation & preservation
Policies and Standards COP Methodology
Standards Community
Domain Scientists & Engineers Advisory
Committees Peer
Initiatives
Requirements Inputs
Outcomes and Deliverables
Ingest, Management
& Storage
Educational Reuse
Requirements Transformed to Policies for Testing,
Evaluation & Iteration
Expertise and Interactions
Graduate Digital
Curation Program
Professional Institutes
Int’l. DigCCurr
Conferences
System Resources & Experience
ISO WG for Repository Audit
& Certification
Partnerships w/ NARA, JISC, DCC, Glasgow
SAA Leadership US, EU
Grant Reviews
Reuse by Scientists & Others
Educators
Novel, Integrated Analysis
Sustainable Repositories
11
Multiple Methodologies to Elicit Work Practices and Curation Needs
• Review of the literature • Collaborate with other DataNet groups and international
projects. • Interviews, surveys, and content analysis of documentation
to produce Curation Profiles extending work of Purdue & UIUC (collaborators on CDCG) and the DCC SCARP project.
• Use cases and data requirements observed, solicited, and transformed into curation requirements across the lifecycle.
• Start with targeted communities and limited functions; iterate out to other communities and across lifecycle functions over the life of the project.
• Test policy efficacy in communities; iterate.
5/3/2012 12
User Groups
DataNet Federation Consortium
13
Ocean Observatories Initiative (OOI) John Orcutt – UCSD
DFC Funded Collaborations
5/3/2012 15
Hydrology-GeoScience
Jon Goodall – South Carolina
Cyber-Infrastructure-Based Engineering Repositories for Undergraduates (CIBER-U) William Regli – Drexel
16
Broader Impact: Federal Agencies
• NASA Center for Climate Simulation – Developed Virtual Climate Data Server based on
iRODS data grid
• National Climatic Data Center – Installed two iRODS data grids to manage access to
climate data records
• National Nuclear Security Administration – Assisting Product Realization Digital Enterprise (PRIDE)
Program (NA 122) in representation, ingest and curation of engineering records on ‘at risk’ media and digital CAD artifacts; contributing to requirements modernization initiatives in model-based enterprise
17
Broader Impact: Collaborative International Development
• EUDAT – Memorandum of Understanding on interoperable systems
• ARCS - Australian Research Collaboration Service – Web-DAV user interface
• France National Institute for Nuclear Physics and Particle Physics – monitoring system
• Academia Sinica – Multi-lingual support, Storage Resource Manager
• Sustainable Heritage Access through Multivalent Archiving – Cheshire3 text processing
• Sanger Wellcome Trust – Genomics data grid
• CoopeUs – Ocean Sciences data sharing including EU programs and OOI
Broader Impact: Vendor Relationships
• Data Direct Networks – SFA10KE storage controller integrates iRODS policy
based data management within the storage controller; demonstrated at SC’11
– Enables policy controlled storage-based processing
• Distributed Bio – Security and high performance extensions
• RENCI – Enterprise version of iRODS (E-iRODS) for DFC
production system
18
Prototype Architecture
Deliverables
Federation
19
DataNet Federation Consortium
Zen of DFC Architecture
Architecture design: Highly extensible, scalable, modular virtualization environment • Based on three basic goals:
o Organize distributed data into a shareable collection o Virtualize the collection instead of the storage systems o Make it easy to customize at all levels
-------------------------------------- Our Model ------------------------------------------------ • Take a Peer-to-peer client-server architecture
– Enables a distributed cloud management
• Add a Virtualization Framework to Manage and Abstract Namespaces – Provides logical independence from physical attributes – Enables abstraction for Authentication, Authorization and Identification (AAI)
• Integrate meta-data support – Ease of publishing and discovery of data and services
• Interleave with Policies – Empowers service-level customizability
• Expose a Published Protocol – Both at the Front and the Back end – Eases multi-language client interfacing & and adding new services
20
DFC Platform Architecture
• Build the DFC Platform on Proven Technology: iRODS – Stable releases and in production in multiple projects – Scalable, Extensible and Modular – Community-oriented & Open Source – Integrates data, metadata from different resources – Built-in federation & server-side computation facilities – Established software practices
SVN, Bugzilla, Gforge, irod-chat, Wiki, Doxygen, Continuous testing, Installation scripts, RPM, etc
• DFC User requirements translated into – Integration of new data resources – Wrapping new functions and procedures as micro-services – Create new rules/workflows to perform transformations or
analysis – Implementation/Integration of new client interfaces
21
iRODS Architecture
22
iCAT iRES iXMS iSEC iRES iRES
Resc
Resc Res
c Res
c Resc DB
Schedule &
Compute
Queue
Message
Queue
Metadata
Database
Storage & Compute Resources
File Systems, Archives, Databases, Sensor Systems, Clusters,…
Windows
Browser
iCommands
Command Line WebDAV
On iPOD
iRODS Rich
Web Client
Visualization
Of HDF5 File
HDF Viewer
For iRODS
Clients
iRODS Protocol
Servers
Resources
Foundational Ideas
• Policy-based data management – Essential for supporting scalable data-driven science
• Sustainability – Underpins the stages of the data life cycle through
repurposing of collections
• Extensibility – Essential for incorporating new technologies and new
research domains
• Federation – Mechanism for building collaboration environments and
implementing long-term sustainability
• Enabling Reproducible Science – Support researcher by managing data, workflows,
collaboration environment, and sharing of data and workflows
23
DFC Prototype Realization
• Build on iRODS Data Grid Software – Support for heterogeneous resource access, multiple data
movement protocols, integrated handling for system and descriptive metadata, provenance management, seamless federation capability, full-fledged data management capability, rule-based policy execution, extensibility through micro-services, orchestration of internal and external workflows
• Extend with Domain-specific Software & Resource Access – Hydrology: support for programs, functions, services, multiple data
collection access for hydrology workflows
– Marine: support for sensor data preservation (snapshots) and replay, access to marine memorizing functions, federated access to national repositories
– Engineering: support for format registry, format verifications, model conversions and integration with CIBER-U repositories and tools
24
DFC Federation Technology
• Data grid – sustainable & extensible policy-oriented data management – Build shared name spaces – Provide distributed data management functions – Enforce administration and usage through policies
• Federated data grids – Cross-register users and resources across data grids
• Soft links – Register data from external data management system,
accessed through its protocol
• Workflow integration – Register workflows into data grid for storage side procedures – Integrate data management workflows with external workflows – Gather provenance as workflow is executed
25
Sequence of Technology Activities
1. Support Applications in Collaborator Communities
– Automate analyses
2. Facilitate Cross-Domain Applications – Support workflow execution across domains
3. Establish end-to-end data life cycle management
– Support preservation of reference collections
26
DFC Federation
27
Federation Hub
DFC Snow-flake Federation
AdHoc Inter-Domain Federation
Federation to External Grids
DFC Federation Grid Status (Phase-1)
DFC-HUB
DFC-ENGG
DFC-HYDRO
DFC-MARINE
DFC-Marine Administration: UCSD Metadata: UCSD Data Resc: UCSD Replica Resc: RENCI Ingestion Resc: Oregon Ingestion Resc: Rutgers Workflow Resc: ALL Rule Engine: UCSD Message Hub: UCSD
DFC-Hydrology Administration: RENCI Metadata: RENCI Data Resc: USC Data Resc: NCDC Replica Resc: RENCI Workflow Resc: ALL Rule Engine: RENCI Message Hub: RENCI
DFC-Engineering Administration: Drexel Metadata: Drexel Data Resc: Drexel Replica Resc: RENCI Workflow Resc: ALL Rule Engine: RENCI Message Hub: Drexel
DFC-Federation Hub Administration: RENCI Metadata: RENCI Data Resc: RENCI Replica Resc: ITS-UNC Workflow Resc: ALL Rule Engine: RENCI Message Hub: RENCI
28
TeraGrid TeraGrid
Federation of Federations
29
DFC-HUB
DFC-ENGG
DFC-HYDRO
DFC-MARINE
DFC Federation
RENCI-VO
NARA-RENCI
TDLC ASGC
TIP-DUKE
TACC
RENCI Federation
TeraGrid CoopeUs
DFC-Learning
DFC-Sociology
DFC-Biology
NCDC
EU-DAT
29
Iterative Software Development
• Identify Requirement – Working closely with S&E partner – Small semantically well-defined functionality
• Design with iRODS Framework – With feedback from S&E partner – High-level design of Resource Drivers, micro-services, rules or
client integration
• Construct and Perform Unit & Integration Testing – Technology team with some liaison with S&E partner – Using iRODS Coding, Testing & Documentation Practices
• Apply & Tune – Working closely with S&E partner
• Transition into Production Release
30 5/3/2012
Technology Success Metrics (Apr 2013)
• Support Applications in S&E Partner Communities Hydrology: Federate access to data from CUASHI, NCDC, NASA and other resources; Automate VIC Workflow & Show Reproducibility of Results
Engineering: Integrate format registry, model conversion, format verification & metadata extraction services; CIBER-U access to DFC data
Marine: Facilitate sensor data preservation (snapshots) into DFC (possibly at NCDC); Wrap OOI memoizing functions to provide real-time access to marine sensor data and services
31 5/3/2012
DFC Sustainability Metrics
• Create community resources for science & engineering – Facility for Reference Collections
– Facility for Preserving “At Risk” and “One of a kind” Data
– Facility for “provenance-supported” data
– Facility to discover and access & apply cross-domain data
• Create research environment for collaborations – Enable Reuse & Repurposing of Data Collections
Add more micro-services & workflows
– Provision Value-added Services for Industry-related Capabilities Compartmentalized Privacy & Security Policies
– Extensible, Modular, Scalable, Technology-agnostic & Policy-oriented full data-life-cycle services for Academia & Research
32
33
Questions?
DataNet Federation Consortium
DFC: A Services Oriented Architecture
34
RENCI-DFC Network Connectivity
35