1
integrated Rule Oriented Data System
Tutorial: iRODS Capabilities
2
Outline
Introduction to iRODS capabilities
Data-driven science and full Data Life Cycle
Policy-based Management of Distributed Data
Scaling: petabytes, 100s of millions of files
Enabling unified sharable "virtual" collections
Enabling data grids (sharing), digital libraries (publishing), persistent archives (preservation)
Unified Data Space: Interoperate via Federation
3
Introduction to
iRODS Capabilities
4
Data Driven Science
• Enable new science through collaborative research on shared data collections• Management of entire scientific data life cycle from data
analysis pipelines to long-term sustainability of reference collections
• Implement national scale data cyber-infrastructure• Federation of exemplar data management technologies in
exemplar research initiatives
• Creation of production data management systems
• Proven technology implemented in extant data grids
• Integrate “live” research data collections into education initiatives• Policy-based data management across distributed data
Project
Shared Collection
Processing Pipeline
Digital Library
Reference Collection
Federation
Data Life Cycle
5
Data are Inherently Distributed
• Distributed sources• Projects span multiple institutions, nations
• Distributed analysis platforms• Grid computing
• Distributed data storage• Minimize risk of data loss, optimize access
• Distributed users• Caching of data near user
• Multiple stages of data life cycle• Data repurposing for use in broader context
Cloud Storage
Institutional Repositories
Federal Repositories
Carolina Digital Repository
Texas Digital Library
National Climatic Data Center
National Optical Astronomy Observatory
Data Processing Pipelines
Preservation Environment
Ocean Observatories Initiative
NARA Transcontinental Persistent Archive Prototype
Carolina Digital Repository
Large Synoptic Survey Telescope
Digital Library
Texas Digital Library
French National Library
Data GridTeragrid Temporal Dynamics
of Learning Center
Australian Research Collaboration Service
Taiwan National Archive
8
Data Life Cycle
ProjectCollection
Private
LocalPolicy
DataGrid
Shared
DistributionPolicy
DigitalLibrary
Published
DescriptionPolicy
DataProcessing
Pipeline
Analyzed
ServicePolicy
ReferenceCollection
Preserved
RepresentationPolicy
Federation
Sustained
Re-purposingPolicy
Each stage adds new policies for a broader communityVirtualize the stages of data life cycle through evolution of policies
Interoperability across data life cycle representations
Each stage of the data life cycle re-purposes the original collection
9
Tracing the Data Life Cycle
•Collection Creation using a Data Grid•Data manipulation / Data ingestion
•Processing Pipelines•Pipeline processing / Environment administration
•Data Grid•Policy display / Micro-service display / State information display / Replication
•Digital Library•Access / Containers / Metadata browsing / Visualization
•Preservation Environment•Validation / Audit / Federation / Deep Archive / SHAMAN
10
Goal - Generic Infrastructure
• Manage all stages of the data life cycle• Data organization• Data processing pipelines• Collection creation• Data sharing• Data publication• Data preservation
• Create reference collection against which future information and knowledge is compared• Each stage uses similar storage, arrangement,
description, and access mechanisms
1111
Concept Roadmap
• Purpose - reason a collection is assembled• Properties - attributes needed to ensure the purpose• Policies - enforce and maintain required properties• Procedures – computer functions to implement Policies• State information - results of applying procedures (iCAT) • Assessment criteria - validate that state information conforms
to desired purpose• Federation – interoperate w/shared logical name spaces• These are the required elements for data life cycle
virtualization
12
Policy-based Management
• Each data life cycle stage is driven by extensions of management policies to address broader user communities• Data arrangement <-----> Project policies• Data analysis <-----------> Processing pipeline standards• Data sharing <-----------> Research collaborations• Data publication <---------> Discipline standards• Data preservation <------> Reference collection
• Reference collections need to be preserved and interpretable by future generations, most stringent standard• Data grids - integrated Rule Oriented Data System
13
iRODS - Policy-based Management
• Turn Policies into computer-actionable Rules• Compose Rules by chaining Micro-services• Manage state information (in iCAT metadata
catalog) as attributes on namespaces:• Files / collections /users / resources / rules
• Validate assessment criteria• Queries on state information, parsing audit trails
• Automate administrative functions• Enable scaling to today's massive collections
14
User w/ClientCan Search, Access, Add and
Manage Data& Metadata
Access distributed data with Web-based Browser or iRODS GUI or Command Line clients.
Overview of iRODS Architecture
iRODS Data Server
Disk, Tape, etc.
iRODS Metadata
CatalogTrack information
iRODS Data System
iRODS Rule Engine
Tracks Policies
iput ../src/irm.c - Checks 10 Policy hooks when file put into iRODS
brick14:10900:ApplyRule#116:: acChkHostAccessControlbrick14:10900:GotRule#117:: acChkHostAccessControlbrick14:10900:ApplyRule#118:: acSetPublicUserPolicybrick14:10900:GotRule#119:: acSetPublicUserPolicybrick14:10900:ApplyRule#120:: acAclPolicybrick14:10900:GotRule#121:: acAclPolicybrick14:10900:ApplyRule#122:: acSetRescSchemeForCreatebrick14:10900:GotRule#123:: acSetRescSchemeForCreatebrick14:10900:execMicroSrvc#124:: msiSetDefaultResc(demoResc,null)brick14:10900:ApplyRule#125:: acRescQuotaPolicybrick14:10900:GotRule#126:: acRescQuotaPolicybrick14:10900:execMicroSrvc#127:: msiSetRescQuotaPolicy(off)brick14:10900:ApplyRule#128:: acSetVaultPathPolicybrick14:10900:GotRule#129:: acSetVaultPathPolicybrick14:10900:execMicroSrvc#130:: msiSetGraftPathScheme(no,1)brick14:10900:ApplyRule#131:: acPreProcForModifyDataObjMetabrick14:10900:GotRule#132:: acPreProcForModifyDataObjMetabrick14:10900:ApplyRule#133:: acPostProcForModifyDataObjMetabrick14:10900:GotRule#134:: acPostProcForModifyDataObjMetabrick14:10900:ApplyRule#135:: acPostProcForCreatebrick14:10900:GotRule#136:: acPostProcForCreatebrick14:10900:ApplyRule#137:: acPostProcForPutbrick14:10900:GotRule#138:: acPostProcForPutbrick14:10900:GotRule#139:: acPostProcForPutbrick14:10900:GotRule#140:: acPostProcForPut
16
Scale of iRODS Data Grid• Number of files
• Desktop to 10s to 100s of millions of files
• Size of data• Desktop to 100s of terabytes to petabytes of data
• Number of policy enforcement points• 64 actions define when policies are checked
• System state information• 112 metadata attributes of system information per file
• Number of functions• 185 composable iRODS Micro-services
• Number of storage systems that are linked• Desktop to 10s to 100 storage resources
• Number of data grids that can interoperate• Federation of 10s of data grids
17
UserWith Client Views & Manages Data
My DataDisk, Tape, Database,
Filesystem, etc.
The iRODS Data System can install in a “layer” over existing or new data, letting you view, manage, and share part or all of diverse data in a unified Collection.
iRODS Shows Unified “Virtual Collection”
Project DataDisk, Tape, Database,
Filesystem, etc.
User Sees Single “Virtual Collection”
Reference DataRemote Disk, Tape,
Filesystem, etc.
18
Organize Distributed Data into a Sharable "Virtual" Collection
• Project repository• MotifNet - manage collection of analysis products
• Institutional repository• Carolina Digital Repository for UNC collections
• Regional collaboration• RENCI Data Grid linking resources across North Carolina
• National collaboration• NSF Temporal Dynamics of Learning Center• Australian Research Collaboration Service
• National Library• French National Library
• National Archive• NARA Transcontinental Persistent Archive Prototype, Taiwan
• International collaboration• BaBar High Energy Physics (SLAC-IN2P3)• National Optical Astronomy Observatory (Chile-US)
19
Infrastructure Independence
• Manage properties of the collection independently of the choice of technology• Access, authentication, authorization, description,
location, distribution, replication, integrity, retention
• Enforce policies globally at all storage locations• Rule Engine resident at each storage site
• Apply procedures at each remote storage site• Chain encapsulated operations into workflows
• Infrastructure independence enables evolution to new technology without interruption• Integrate new access methods, new storage systems,
new network protocols, new authentication systems
20
Data VirtualizationData Virtualization
Storage SystemStorage System
Storage ProtocolStorage Protocol
Access InterfaceAccess Interface
Standard Micro-servicesStandard Micro-services
Data GridData Grid
Map from actions
requested by access
method to standard set
of iRODS Micro-services.
Map standard Micro-
services to standard
operations.
Map the operations to
protocol supported by
operating system.
Standard OperationsStandard Operations
21
Data Grid Security• Manage global name spaces for:
• {users, files, storage}
• Assign access controls as constraints imposed between two logical name spaces• Access controls remain invariant as files are moved within
the data grid• Controls on: Files / Storage systems / Metadata
• Authenticate each user access• PKI, Kerberos, challenge-response, Shibboleth• Use internal or external identity management system
• Authorize all operations• ACLs (Access Control Lists) on users and groups• Separate condition for execution of each Rule• Internal approval flags (e.g. IRB) within a Rule
NOAO Zone Architecture
Archive
Telescope Telescope
Ocean Observatories Initiative
SensorsCloud
Computing
External Repositories
Cloud Storage Cache
Message Bus
Aggregate sensor data in cache
SuperComputer
Event DetectionRemote locations
Simulations
Digital LibraryArchive
Clients
Remote Users
iRODS Data Grid
Multiple Protocols
Large-scale workflows from real-time data to steerable instruments, dig. Library.
Access: Data Grid Clients
API Client DeveloperBrowser
DCAPE UNCiExplore DICE-Bing ZhuJUX IN2P3Peta Web browser PetaShare
Digital LibraryAkubra/iRODS DICEDspace MITFedora on Fuse IN2P3Fedora/iRODS module DICEIslandora DICE
File SystemDavis - Webdav ARCSDropbox / iDrop DICE-Mike ConwayFUSE IN2P3, DICE,FUSE optimization PetaShareOpenDAP ARCSPetaFS (Fuse) Petashare - LSUPetashell (Parrot) PetaShare
GridGridFTP - Griffin ARCSJsaga IN2P3Parrot Notre Dame-Doug ThainSaga KEK
API Client DeveloperI/O Libraries
PHP - DICE DICE-Bing ZhuC API DICE-Mike WanC I/O library DICE-Wayne SchroederJargon DICE-Mike ConwayPyrods - Python SHAMAN-Jerome Fusillier
Portal iDrop DICE-Mike ConwayEnginFrame NICE / RENCI
ToolsArchive tools-NOAO NOAOBig Board visualization RENCIFile-format-identifier GA Techicommands DICEPcommands PetaShareResource Monitoring IN2P3Sync-package Academica SinicaURSpace Teldap - Academica Sinica
Web ServiceVOSpace NVOAShibboleth King's College
WorkflowsKepler DICEStork LSUTaverna RENCI
25
iRODS Distributed Data Management
26
Towards a Unified Data Space
• Sharing data across Space • Organize data as a shared "virtual" Collection• Define unifying properties for the Collection
• Sharing data across Time • Preservation is communication with the future• Preservation validates communication from the past
• Managing full Data Life Cycle • Evolution of the Policies that govern a data Collection
at each stage of the life cycle • From data creation, to collection, to publication, to
reference collection, to analysis pipeline
27
Intellectual Property
• Given generic infrastructure, intellectual property resides in the Policies and procedures that manage the Collection• Consistency of the Policies• Capabilities of the procedures• Automation of internal Policy assessment• Validation of desired Collection properties• Automation of administrative tasks
• Interacting with DataDirectNetwork, HP, IBM, MicroSoft on commercial application of open source technology.
28
Societal Impact
• Many communities are assembling digital holdings that represent an emerging consensus:• Common meaning associated with the data• Common interpretation of the data• Common data manipulation mechanisms
• The development of a consensus is described as• Socialization of Collections• An example is Trans-border Urban Planning
29
Social consensus for sharing data, policies, methods, practice
• Each community controls their own collection Policies • Policies enforced at each storage location
• Explicit computer-actionable rules control type of federation interactions• e.g. peer-to-peer, central archive, master-slave data
distribution, chained data grids, deep archives
Interoperability mechanisms support technology integration
• Community specific clients
• Bulk data export / import
• Cross registration of data
• Structured information resource drivers
Federation of CollectionsFederation of Collections
30
Data Grid Federation
• Motivation• Improve performance, scalability, and independence
• To initiate Federation, each Data Grid administrator establishes trust and creates a remote user• iadmin mkzone B remote Host:Port
• iadmin mkuser rods#B rodsuser• Use cases
• Chained data grids - National Optical Astronomy Observatory• Master-slave data grids - NIH BIRN• Central archive - UK e-Science• Deep archive - NARA TPAP• Replication - NSF Teragrid
31
Federated irodsUser
(use iRODS clients)
Federated irodsUsers can upload, download, replicate, share, manage & track access to some or all data (depending on access permissions) in either zone.
Accessing Data in Federated iRODS
“Gets data to user”
“With access permissions”
“Finds the data”
iRODS/ICAT system at University of North Carolina
at Chapel Hill(renci zone)
Two federated iRODS data grids
iRODS/ICAT system at University of Texas
at Austin (tacc zone)
32
Development Team• DICE team
• Arcot Rajasekar - iRODS Development Lead • Mike Wan - iRODS Chief Architect• Wayne Schroeder - iRODS Product Mgr., Sr. Developer• Bing Zhu - Fedora, Windows• Mike Conway - Java (Jargon)• Paul Tooby - Documentation, Foundation• Sheau-Yen Chen - Data Grid Administration• Reagan Moore - PI
• Preservation • Richard Marciano - Preservation Development Lead• Chien-Yi Hou - Preservation Micro-services• Antoine de Torcy - Preservation Micro-services
33
Foundation• Data Intensive Cyber Environments Foundation
• Nonprofit open source software development
• Promotes use of iRODS technology
• Supports standards efforts, intellectual prop.
• Coordinates international development efforts• IN2P3 - quota and monitoring system• King’s College London - Shibboleth• Australian Research Collaboration Services - WebDAV• Academia Sinica - SRM interface
• More information: http://diceresearch.org
34
iRODS Wiki
• More information…• http://irods.diceresearch.org• Descriptions, tutorials, documentation• Publications / presentations• Download of iRODS open source s.w.• Performance tests• irods-chat page