data grids and data management

40
Data Grids and Data Data Grids and Data Management Management Storage Resource Broker Storage Resource Broker Reagan W. Moore Reagan W. Moore [email protected] http://www.sdsc.edu/srb http://www.sdsc.edu/srb

Upload: brody

Post on 05-Jan-2016

54 views

Category:

Documents


5 download

DESCRIPTION

Data Grids and Data Management. Storage Resource Broker. Reagan W. Moore [email protected] http://www.sdsc.edu/srb. Topics. Data management evolution Shared collections Digital Libraries Persistent Archives Building shared collections Project level / National level / International - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Grids and Data Management

Data Grids and Data ManagementData Grids and Data Management

Storage Resource BrokerStorage Resource Broker

Reagan W. MooreReagan W. Moore

[email protected]

http://www.sdsc.edu/srbhttp://www.sdsc.edu/srb

Page 2: Data Grids and Data Management

TopicsTopics

• Data management evolution• Shared collections• Digital Libraries• Persistent Archives

• Building shared collections• Project level / National level / International

• Demonstration of shared collections• Access to collections at SDSC

Page 3: Data Grids and Data Management

Date

ProjectGBs of data

stored

1000’s of files

GBs of data

stored

1000’s of files

Users with

ACLs

GBs of data

stored

1000’s of files

Users with

ACLsData Grid

NSF / NVO 17,800 5,139 51,380 8,690 80 100,990 13,217 100

NSF / NPACI 1,972 1,083 17,578 4,694 380 34,830 7,239 380

Hayden 6,800 41 7,201 113 178 8,013 161 227

Pzone 438 31 812 47 49 23,099 13,287 68

NSF / LDAS-SALK 239 1 4,562 16 66 115,178 146 67

NSF / SLAC-JCSG 514 77 4,317 563 47 17,095 1,775 55

NSF / TeraGrid 80,354 685 2,962 202,226 4,443 3,267

NIH / BIRN 5,416 3,366 148 16,288 15,306 361

Digital Library

NSF / LTER 158 3 233 6 35 236 34 36

NSF / Portal 33 5 1,745 48 384 2,620 53 460

NIH / AfCS 27 4 462 49 21 733 94 21

NSF / SIO Explorer 19 1 1,734 601 27 2,605 1,121 27

NSF / SCEC 15,246 1,737 52 167,140 3,471 73

Persistent Archive

NARA 7 2 63 81 58 2,916 2,004 58

NSF / NSDL 2,785 20,054 119 5,653 50,600 136

UCSD Libraries 127 202 29 190 208 29

NHPRC / PAT 1,338 519 28

TOTAL 28 TB 6 mil 194 TB 40 mil 4,635 701 TB 113 mil 5,393

5/17/02 6/30/04 5/5/06

Page 4: Data Grids and Data Management

Data GridData Grid

Using a Data Grid – Using a Data Grid – in Abstractin Abstract

Ask for d

ata

•User asks for data from the data grid

Data d

elivere

d

•The data is found and returned•Where & how details are hidden

Page 5: Data Grids and Data Management

Using a Data Grid - Using a Data Grid - DetailsDetails

Storage Resource Broker

•Data request goes to SRB Server

Storage Resource Broker

Metadata Catalog

DB

•Server looks up data in catalog

•Catalog tells which SRB server has data

•1st server asks 2nd for data

•The data is found and returned

•User asks for data

Page 6: Data Grids and Data Management

Using a Data Grid - Using a Data Grid - DetailsDetails

SRB

MCAT

DB

SRB

SRB

SRB

SRB SRB

•Data Grid has arbitrary number of servers•Heterogeneity is hidden from users

Page 7: Data Grids and Data Management

Shared CollectionsShared Collections

• Data grids support the creation of shared collections that may be distributed across multiple institutions, sites, and storage systems.

• Digital libraries publish data, and provide services for discovery and display

• Persistent archives preserve data, managing the migration to new technology

Page 8: Data Grids and Data Management

Date

ProjectGBs of data

stored

1000’s of files

GBs of data

stored

1000’s of files

Users with

ACLs

GBs of data

stored

1000’s of files

Users with

ACLsData Grid

NSF / NVO 17,800 5,139 51,380 8,690 80 93,252 11,189 100

NSF / NPACI 1,972 1,083 17,578 4,694 380 34,452 7,235 380

Hayden 6,800 41 7,201 113 178 8,013 161 227

Pzone 438 31 812 47 49 19,674 10,627 68

NSF / LDAS-SALK 239 1 4,562 16 66 104,494 131 67

NSF / SLAC-JCSG 514 77 4,317 563 47 15,703 1,666 55

NSF / TeraGrid 80,354 685 2,962 195,012 4,071 3,267

NIH / BIRN 5,416 3,366 148 13,597 13,329 351

Digital Library

NSF / LTER 158 3 233 6 35 236 34 36

NSF / Portal 33 5 1,745 48 384 2,620 53 460

NIH / AfCS 27 4 462 49 21 733 94 21

NSF / SIO Explorer 19 1 1,734 601 27 2,452 1,068 27

NSF / SCEC 15,246 1,737 52 153,159 3,229 73

Persistent Archive

NARA 7 2 63 81 58 2,703 1,906 58

NSF / NSDL 2,785 20,054 119 5,205 50,586 136

UCSD Libraries 127 202 29 190 208 29

NHPRC / PAT 101 474 28

TOTAL 28 TB 6 mil 194 TB 40 mil 4,635 655 TB 106 mil 5,383

5/17/02 6/30/04 1/3/06

Page 9: Data Grids and Data Management

Shared CollectionsShared Collections

• Purpose of SRB data grid is to enable the creation of a collection that is shared between academic institutions• Register digital entity into the shared collection• Assign owner, access controls• Assign descriptive, provenance metadata• Manage state information

• Audit trails, versions, replicas, backups, locks• Size, checksum, validation date, synchronization date, …

• Manage interactions with storage systems• Unix file systems, Windows file systems, tape archives, …

• Manage interactions with preferred access mechanisms• Web browser, Java, WSDL, C library, …

Page 10: Data Grids and Data Management

SRBserver

SRB agent

SRBserver

Federated Server ArchitectureFederated Server Architecture

MCAT

Read Application

SRB agent

1

2

34

6

5

Logical NameOr

Attribute Condition

1.Logical-to-Physical mapping2.Identification of Replicas3.Access & Audit Control

Peer-to-peer

Brokering

Server(s) SpawningData

Access

Parallel Data Access

R1R2

5/6

Page 11: Data Grids and Data Management

Generic InfrastructureGeneric Infrastructure

• Digital libraries now build upon data grids to manage distributed collections• DSpace digital library - MIT and Hewlitt Packard• Fedora digitial library - Cornell University and University

of Virginia

• Persistent archives build upon data grids to manage technology evolution• NARA research prototype persistent archive• California Digital Library - Digital Preservation Repository• NSF National Science Digital Library persistent archive

Page 12: Data Grids and Data Management

National Science Digital LibraryNational Science Digital Library

• URLs for educational material for all grade levels registered into repository at Cornell

• SDSC crawls the URLs, registers the web pages into a SRB data grid, builds a persistent archive• 750,000 URLs• 13 million web pages• About 3 TBs of data

Page 13: Data Grids and Data Management
Page 14: Data Grids and Data Management

Southern California Earthquake CenterSouthern California Earthquake Center

SCEC Community

Library

Select Receiver (Lat/Lon)

OutputTime HistorySeismograms

Select ScenarioFault Model

Source Model

•Intuitive User Interface–Pull-Down Query Menus –Graphical Selection of Source Model–Clickable LA Basin Map (Olsen)–Seismogram/History extraction (Olsen)

•Access SCEC Digital Library

–Data stored in a data grid–Annotated by modelers–Standard naming convention–Automated extraction of selected data and metadata–Management of visualizations

SCEC Digital LibrarySCEC Digital Library

Page 15: Data Grids and Data Management

Terashake Data HandlingTerashake Data Handling

• Simulate 7.7 magnitude earthquake on San Andreas fault• 50 Terabytes in a simulation• Move 10 Terabytes per day

• Post-Processing of wave field• Movies of seismic wave propagation• Seismogram formatting for interactive on-

line analysis• Velocity magnitude• Displacement vector field• Cumulative peak maps• Statistics used in visualizations• Register derived data products into

SCEC digital library

Page 16: Data Grids and Data Management

HumidityClimateEcologicalWirelessOceanography

Wind SpeedClimateEcologicalWirelessOceanography

SeismicGeophysics

ROADNet Sensor Network Data Integration

Fire startRain start

Frank Vernon - UCSD/SIOFrank Vernon - UCSD/SIO

Page 17: Data Grids and Data Management

NARA Persistent ArchiveNARA Persistent Archive

NARA U Md SDSC

MCAT MCAT MCAT

Original data at NARA, data replicated to U Md & SDSC

Replicated copyat U Md for improvedaccess, load balancingand disaster recovery

Active archive atSDSC, user access

Demonstrate preservation environment • Authenticity• Integrity• Management of technology evolution• Mitigation of risk of data loss

• Replication of data• Federation of catalogs

• Management of preservation metadata• Scalability

• Types of data collections• Size of data collections

Federation of Three Independent Data Grids

Page 18: Data Grids and Data Management

Worldwide University Network Data GridWorldwide University Network Data Grid

• SDSC• Manchester• Southampton• White Rose• NCSA• U. Bergen

• A functioning, general purpose international Data Grid for academic collaborations

Manchester-SDSC mirror

Page 19: Data Grids and Data Management

WUNGrid CollectionsWUNGrid Collections

• BioSimGrid• Molecular structure collaborations

• White Rose Grid • Distributed Aircraft Maintenance Environment

• Medieval Studies• Music Grid• e-Print collections

• DSpace

• Astronomy

Page 20: Data Grids and Data Management

HASS GridHASS Grid

• HASS Grid - Kevin Franklin (UCHRI, CITRIS, SDSC)• Installing 2-5 TBs per site - Cyberbricks

• UC Berkeley, UCLA, UCI, UCSD/SDSC• UCSB, UCSF, UCR

• Federating data grid with WUNgrid

• House research collections in Humanities, Arts and Social Sciences• Cultural representation• Medieval manuscripts

Page 21: Data Grids and Data Management

KEK Data GridKEK Data Grid

• Japan• Taiwan• South Korea• Australia• Poland• US

• A functioning, general purpose international Data Grid for high-energy physics

Manchester-SDSC mirror

Page 22: Data Grids and Data Management

BaBar High-energy PhysicsBaBar High-energy Physics

• Stanford Linear Accelerator

• Lyon, France• Rome, Italy• San Diego• RAL, UK

• A functioning international Data Grid for high-energy physics

Manchester-SDSC mirror

Moved over 100 TBs of dataMoved over 100 TBs of data

Page 23: Data Grids and Data Management

Astronomy Data GridAstronomy Data Grid

• Chile• Tucson, Arizona• NCSA, Illinois

• A functioning international Data Grid for Astronomy Manchester-SDSC mirror

Moved over 400,000 imagesMoved over 400,000 images

Page 24: Data Grids and Data Management

International Institutions (2005)International Institutions (2005)Project InstitutionData mangement project British Antarctic Survey, UKeMinerals Cambridge e-Science Center, UKSickkids Hospital in Toronto CanadaWelsh e-Science Centre Cardiff University, UKAustralian Partnership for Advanced Computing Data Grid Victoria, AustraliaTrinity College High Performance Computing (HPC-Europa) Trinity College, IrelandNational Environment Research Council United KingdomCenter for Advanced Studies, Research, and Development ItalyLIACS(Leiden Inst. Of Comp. Sci) Leiden University,The NetherlandsPhysics Labs University of Bristol, UKMonash E-Research Grid Monash University, AustraliaComputational Modelling University of Queensland, AustraliaSchool Computing University of Leeds, UKDept. of Computer Science University of Liverpool, UKBelfast e-Science Centre Queen's University, UKWorldwide Universities Network University of Manchester, UKLarge Hadron Collider Computing Grid University of Oxford, UKWhite Rose Grid University of Sheffield. UKProtein structure prediction Taiwan University, Taiwan

SRB distributed to 86 US institutions and 84 international institutions in last 2 yearsSRB distributed to 86 US institutions and 84 international institutions in last 2 years

Page 25: Data Grids and Data Management

Unix Shell

NT Browser,Kepler Actors

http,Portlet,WSDL,

OAI-PMH)

DSpace,OpenDAP,GridFTP,Fedora

Archives - Tape,Sam-QFS, DMF,

HPSS, ADSM,UniTree, ADS

Databases -DB2, Oracle,

Sybase, Postgres, mySQL, Informix

File SystemsUnix, NT,Mac OSX

Application

ORB

Storage Repository AbstractionDatabase Abstraction

Databases -DB2, Oracle, Sybase,

Postgres, mySQL,Informix

CLibrary,

Java

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency & Metadata Management / Authorization, Authentication, Audit

Linux I/OC++

DLL /Python,

Perl, Windows

Federation Management

Storage Resource Broker 3.3.1Storage Resource Broker 3.3.1

Page 26: Data Grids and Data Management

SRB ObjectivesSRB Objectives

• Automate all aspects of data discovery, access, management, analysis, preservation• Security paramount• Distributed data

• Provide distributed data support for• Data sharing - data grids• Data publication - digital libraries• Data preservation - persistent archives• Data collections - Real time sensor data

Page 27: Data Grids and Data Management

SRB DevelopersSRB DevelopersReagan Moore Reagan Moore - PI- PIMichael Wan Michael Wan - SRB Architect- SRB ArchitectArcot Rajasekar Arcot Rajasekar - SRB Manager- SRB ManagerWayne Schroeder Wayne Schroeder - SRB Productization- SRB ProductizationCharlie CowartCharlie Cowart - inQ- inQLucas Gilbert Lucas Gilbert - Jargon- JargonBing Zhu Bing Zhu - Perl, Python, Windows- Perl, Python, WindowsAntoine de Torcy Antoine de Torcy - mySRB web browser- mySRB web browserSheau-Yen Chen Sheau-Yen Chen - SRB Administration- SRB AdministrationGeorge KremenekGeorge Kremenek - SRB Collections- SRB CollectionsArun Jagatheesan Arun Jagatheesan - Matrix workflow- Matrix workflowMarcio Faerman Marcio Faerman - SCEC Application- SCEC ApplicationSifang Lu Sifang Lu - ROADnet Application- ROADnet ApplicationRichard Marciano Richard Marciano - SALT persistent archives- SALT persistent archives

Contributors from UK e-Science, Academia Sinica, Ohio State University, Aerospace Corporation, …

75 FTE-years of support75 FTE-years of supportAbout 300,000 lines of CAbout 300,000 lines of C

Page 28: Data Grids and Data Management

HistoryHistory• 1995 - DARPA Massive Data Analysis Systems• 1997 - DARPA/USPTO Distributed Object Computation Testbed• 1998 - NSF National Partnership for Advanced Computational Infrastructure

• 1998 - DOE Accelerated Strategic Computing Initiative data grid• 1999 - NARA persistent archive• 2000 - NASA Information Power Grid• 2001 - NLM Digital Embryo digital library• 2001 - DOE Particle Physics data grid• 2001 - NSF Grid Physics Network data grid• 2001 - NSF National Virtual Observatory data grid• 2002 - NSF National Science Digital Library persistent archive• 2003 - NSF Southern California Earthquake Center digital library• 2003 - NIH Biomedical Informatics Research Network data grid• 2003 - NSF Real-time Observatories, Applications, and Data management Network

• 2004 - NSF ITR, Constraint based data systems• 2005 - LC Digital Preservation Lifecycle Management• 2005 - LC National Digital Information Infrastructure and Preservation program

Page 29: Data Grids and Data Management

DevelopmentDevelopment

• SRB 1.1.8 - December 15, 2000• Basic distributed data management system• Metadata Catalog

• SRB 2.0 - February 18, 2003• Parallel I/O support• Bulk operations

• SRB 3.0 - August 30, 2003• Federation of data grids

• SRB 3.4 - October 31, 2005• Feature requests (extensible schema)

Page 30: Data Grids and Data Management

Separation of Access Method Separation of Access Method from Storage Protocolsfrom Storage Protocols

Storage SystemStorage System

Storage ProtocolStorage Protocol

Access MethodAccess Method

Access OperationsAccess Operations

Data GridData Grid

Map from the Map from the

operations used byoperations used by

the access methodthe access method

to a standard set ofto a standard set of

operations used to operations used to

interact with theinteract with the

storage systemstorage system

Storage OperationsStorage Operations

Page 31: Data Grids and Data Management

Data Grid OperationsData Grid Operations• File access

• Open, close, read, write, seek, stat, synch, …• Audit, versions, pinning, checksums, synchronize, …• Parallel I/O and firewall interactions• Versions, backups, replicas

• Latency management• Bulk operations

• Register, load, unload, delete, …

• Remote procedures• HDFv5, data filtering, file parsing, replicate, aggregate

• Metadata management• SQL generation, schema extension, XML import and export,

browsing, queries, • GGF, “Operations for Access, Management, and Transport at Remote

Sites”

Page 32: Data Grids and Data Management

Examples of ExtensibilityExamples of Extensibility• Storage Repository Driver evolution

• Initially supported Unix file system• Added archival access - UniTree, HPSS• Added FTP/HTTP• Added database blob access• Added database table interface• Added Windows file system• Added project archives - Dcache, Castor, ADS• Added Object Ring Buffer, Datascope• Added GridFTP version 3.3

• Database management evolution• Postgres• DB2• Oracle• Informix• Sybase• mySQL (most difficult port - no locks, no views, limited SQL)

Page 33: Data Grids and Data Management

Examples of ExtensibilityExamples of Extensibility• The 3 fundamental APIs are C library, shell commands,

Java• Other access mechanisms are ported on top of these interfaces

• API evolution• Initial access through C library, Unix shell command• Added iNQ Windows browser (C++ library)• Added mySRB Web browser (C library and shell commands)• Added Java (Jargon)• Added Perl/Python load libraries (shell command)• Added WSDL (Java)• Added OAI-PMH, OpenDAP, DSpace digital library (Java)• Added Kepler actors for dataflow access (Java)• Added GridFTP version 3.3 (C library)• Added Fedora

Page 34: Data Grids and Data Management

Logical Name SpacesLogical Name Spaces

Storage Repository

• Storage location

• User name

• File name

• File context (creation date,…)

• Access constraints

Data Access Methods (C library, Unix, Web Browser)

Data access directly between

application and storage

repository using names

required by the local

repository

Page 35: Data Grids and Data Management

Logical Name SpacesLogical Name Spaces

Storage Repository

• Storage location

• User name

• File name

• File context (creation date,…)

• Access constraints

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints

Data Collection

Data Access Methods (C library, Unix, Web Browser)

Data is organized as a shared collection

Page 36: Data Grids and Data Management

Federation Between Data GridsFederation Between Data Grids

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints

Data Collection B

Data Access Methods (Web Browser, DSpace, OAI-PMH)

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints

Data Collection A

Access controls and consistency constraints on cross registration of digital entities

Page 37: Data Grids and Data Management

Types of Risk Types of Risk

• Media failure• Replicate data onto multiple media

• Vendor specific systemic errors• Replicate data onto multiple vendor products

• Operational error• Replicate data onto a second administrative domain

• Natural disaster• Replicate data to a geographically remote site

• Malicious user• Replicate data to a deep archive

Page 38: Data Grids and Data Management

How Many ReplicasHow Many Replicas

• Three sites minimize risk• Primary site

• Supports interactive user access to data

• Secondary site• Supports interactive user access when first site is

down• Provides 2nd media copy, located at a remote site,

uses different vendor product, independent administrative procedures

• Deep archive• Provides 3rd media copy, staging environment for

data ingestion, no user access

Page 39: Data Grids and Data Management

Deep ArchiveDeep Archive

Z2Z2 Z1Z1Z3Z3

Z2:D2:U2Z2:D2:U2

RegisterRegister

Z3:D3:U3Z3:D3:U3

RegisterRegister

PullPull PullPull

FirewallFirewall

Server initiated I/OServer initiated I/O

DeepDeep

ArchiveArchive

StagingStaging

ZoneZone

Remote ZoneRemote Zone

No access byNo access by

Remote zonesRemote zones

PVNPVN

Page 40: Data Grids and Data Management

For More InformationFor More Information

Reagan W. Moore

San Diego Supercomputer Center

[email protected]

http://www.sdsc.edu/srb/