ncar ncar data and grid efforts: the earth system grid & the community data portal don middleton...
TRANSCRIPT
NCAR
NCAR Data and Grid Efforts:NCAR Data and Grid Efforts:The Earth System GridThe Earth System Grid
& The & The CommunityCommunity Data Portal Data Portal
Don MiddletonDon Middleton
NCAR Scientific Computing DivisionNCAR Scientific Computing Division
CAS2003CAS2003
September 11, 2003September 11, 2003
NCAR
The Earth System GridThe Earth System Grid
U.S. DOE SciDAC funded R&D effort - a U.S. DOE SciDAC funded R&D effort - a ““Collaboratory Pilot Project”Collaboratory Pilot Project”
Build an “Earth System Grid” that enables Build an “Earth System Grid” that enables management, discovery, distributed access, management, discovery, distributed access, processing, & analysis of distributed terascale processing, & analysis of distributed terascale climate research dataclimate research data
Build upon Globus ToolkitBuild upon Globus Toolkit and DataGrid and DataGrid technologies and technologies and deploydeploy
Potential broad application to other areasPotential broad application to other areas
http://www.earthsystemgrid.org
NCAR
ESG TeamESG Team ANLANL
– Ian Foster (PI)Ian Foster (PI)– Veronika NefedovaVeronika Nefedova– (John Bresenhan)(John Bresenhan)– (Bill Allcock)(Bill Allcock)
LBNLLBNL– Arie ShoshaniArie Shoshani– Alex SimAlex Sim
ORNLORNL– David BernholdteDavid Bernholdte– Kasidit ChanchioKasidit Chanchio– Line PouchardLine Pouchard
LLNL/PCMDILLNL/PCMDI– Bob DrachBob Drach– Dean Williams (PI)Dean Williams (PI)
USC/ISIUSC/ISI– Anne ChervenakAnne Chervenak– Carl KesselmanCarl Kesselman– (Laura Perlman)(Laura Perlman)
NCARNCAR– David BrownDavid Brown– Luca CinquiniLuca Cinquini– Peter FoxPeter Fox– Jose GarciaJose Garcia– Don Middleton (PI)Don Middleton (PI)– Gary StrandGary Strand
NCAR
NCAR
Baseline NumbersBaseline Numbers T42 CCSM (current, 280km)T42 CCSM (current, 280km)
– 7.5GB/yr, 100 years -> .75TB7.5GB/yr, 100 years -> .75TB T85 CCSM (140km)T85 CCSM (140km)
– 29GB/yr, 100 years -> 2.9TB29GB/yr, 100 years -> 2.9TB T170 CCSM (70km)T170 CCSM (70km)
– 110GB/yr, 100 years -> 11TB110GB/yr, 100 years -> 11TB
NCAR
Capacity-related ImprovementsCapacity-related ImprovementsIncreased turnaround, model development, ensemble of runs
Increase by a factor of 10, linear data
Current T42 CCSMCurrent T42 CCSM– 7.5GB/yr, 100 years -> .75TB * 10 = 7.5GB/yr, 100 years -> .75TB * 10 =
7.5TB7.5TB
NCAR
Capability-related Improvements Capability-related Improvements Spatial Resolution: T42 -> T85 -> T170
Increase by factor of ~ 10-20, linear data Temporal Resolution: Study diurnal cycle, 3 hour data
Increase by factor of ~ 4, linear data
CCM3 at T170 (70km)
NCAR
Capability-related Improvements Capability-related Improvements
Quality: Improved boundary layer, clouds, convection, ocean physics, land model, river runoff, sea ice
Increase by another factor of 2-3, data flat
Scope: Atmospheric chemistry (sulfates, ozone…), biogeochemistry (carbon cycle, ecosystem dynamics),middle Atmosphere Model…
Increase by another factor of 10+, linear data
NCAR
Model Improvement WishlistModel Improvement Wishlist
Grand Total:
Increase compute by a Factor O(1000-10000)
NCAR
Longer-term MissionsLonger-term Missions - - Observation of Key Earth System InteractionsObservation of Key Earth System Interactions
Terra
Aura
Aqua
Landsat 7
Exploratory - Exploratory - Explore Specific Earth System Processes and Parameters and Explore Specific Earth System Processes and Parameters and Demonstrate TechnologiesDemonstrate Technologies
GRACE
PICASSO
Cloudsat
QuikScat
EO-1
ICEsat Jason-1
SRTMVCL
We Will Examine Practically Every Aspect of the Earth We Will Examine Practically Every Aspect of the Earth System from Space in This DecadeSystem from Space in This Decade
Triana
Courtesy of Tim Killeen, NCAR
NCAR
ESG ScenarioESG Scenario End 2002: 1.2 million files comprising End 2002: 1.2 million files comprising
~75TB of data at NCAR, ORNL, LANL, ~75TB of data at NCAR, ORNL, LANL, NERSC, and PCMDINERSC, and PCMDI
End 2007: As much as 3 PB (3,000 TB) End 2007: As much as 3 PB (3,000 TB) of data (!)of data (!)
Current practice is already broken – the Current practice is already broken – the future will be even worse if something future will be even worse if something isn’t done…isn’t done…
NCAR
ESG Scenario (cont.)ESG Scenario (cont.) DataData
– Different formats are converted to netCDFDifferent formats are converted to netCDF– netCDF is not standardized to the CF modelnetCDF is not standardized to the CF model– Different sites require knowledge of different methods of accessDifferent sites require knowledge of different methods of access
MetadataMetadata– Most kept in online files separate from data and unsearchable unless one is Most kept in online files separate from data and unsearchable unless one is
“in the know”“in the know”– Some kept in people’s brainsSome kept in people’s brains
Access controlAccess control– ManualManual– Not formalizedNot formalized
Data requestsData requests– Beginnings of a formal process (e.g., the PCMDI model)Beginnings of a formal process (e.g., the PCMDI model)– Beginnings of web portalsBeginnings of web portals– Far too much done by handFar too much done by hand– Logging nearly non-existentLogging nearly non-existent
NCAR
ESG: ChallengesESG: Challenges Enabling the simulation and data Enabling the simulation and data
management teammanagement team Enabling the core research community Enabling the core research community
in analyzing and visualizing resultsin analyzing and visualizing results Enabling broad multidisciplinary Enabling broad multidisciplinary
communities to access simulation communities to access simulation resultsresultsWe need integrated scientific work environments that enable
smooth WORKFLOW for knowledge development: computation, collaboration & collaboratories, data management, access, distribution, analysis, and visualization.
NCAR
ESG: StrategiesESG: Strategies Move data a minimal amount, keep it close to Move data a minimal amount, keep it close to
computational point of origin when possiblecomputational point of origin when possible– Data access protocols, distributed analysisData access protocols, distributed analysis
When we must move data, do it fast and with When we must move data, do it fast and with a minimum amount of human interventiona minimum amount of human intervention– Storage Resource Management, fast networksStorage Resource Management, fast networks
Keep track of what we have, particularly Keep track of what we have, particularly what’s on deep storagewhat’s on deep storage– Metadata and Replica CatalogsMetadata and Replica Catalogs
Harness a federation of sites, web portalsHarness a federation of sites, web portals– Globus Toolkit -> The Earth System Grid -> The Globus Toolkit -> The Earth System Grid -> The
UltraDataGridUltraDataGrid
NCAR
Server
Tera/Peta-scaleArchive
HRM
Tools for reliable staging,
transport, and replication
Server
Tera/Peta-scaleArchive
HRM
ClientSelectionControl
MonitoringHRM
Storage/Data Management
NCAR
HRM aka “DataMover”HRM aka “DataMover” Running well across DOE/HPSS systemsRunning well across DOE/HPSS systems New component built that abstracts New component built that abstracts
NCAR Mass Storage SystemNCAR Mass Storage System Defining next generation of Defining next generation of
requirements with climate production requirements with climate production groupgroup
First “real” usageFirst “real” usage“The bottom line is that it now works fines and is over 100 times faster than what I was doing before. As important as two orders of magnitude increase in throughput is, more importantly I can see a path that will essentially reduce my own time spent on file transfers to zero in the development of the climate model database” – Mike Wehner, LBNL
NCAR
OPeNDAPOPeNDAP
An Open Source Project for a An Open Source Project for a Network Data Access ProtocolNetwork Data Access Protocol
(originally DODS, the Distributed (originally DODS, the Distributed Oceanographic Data System)Oceanographic Data System)
NCAR
OPeNDAP-g-Transparency-Performance-Security-Authorization-(Processing)Typical Application
Data(local)
netCDF lib
Application
Data(remote)
OPeNDAP Client
Application
OPeNDAPViahttp
Big Data(remote)
ESG client
Application
ESG+
DODS
OpenDAP Server ESG Server
Distributed Application
data
Distributed Data Access Services
OPeNDAPViaGrid
NCAR
For XML encoding of metadata (and data) of any generic netCDF For XML encoding of metadata (and data) of any generic netCDF filefile
Objects: netCDF, dimension, variable, attributeObjects: netCDF, dimension, variable, attribute Beta version reference implementation as Java Library Beta version reference implementation as Java Library
(http://www.scd.ucar.edu/vets/luca/netcdf/extract_metadata.htm)(http://www.scd.ucar.edu/vets/luca/netcdf/extract_metadata.htm)
ESG: NcML Core SchemaESG: NcML Core Schema
netCDFnetCDF
nc:netCDFType
nc:dimension
nc:variable
nc: attribute
nc:attribute
nc:values
nc:VariableType
NCAR
Object[1] id
Object[1] id
Activity[0,1] name[0,1] description[0,1] rights[0,n] date type=[0,n] note[0,n] participant role=[0,n] reference uri=
Activity[0,1] name[0,1] description[0,1] rights[0,n] date type=[0,n] note[0,n] participant role=[0,n] reference uri=
isA
Investigation
Investigation
isA
Project[0,n] topic type=[0,1] funding
Project[0,n] topic type=[0,1] funding
isA Ensemble
Ensemble
Campaign
Campaign
isPartOf
Simulation[0,n] simulationInput type=[0,n] simulationHardware
Simulation[0,n] simulationInput type=[0,n] simulationHardware
Observation
Observation
Experiment
Experiment
Analysis
Analysis
isPartOf
hasParent
hasChild
hasSibling
Dataset[0,1] type[0,1] conventions[0,n] date type=[0,n] format type= uri=[0,1] timeCoverage[0,1] spaceCoverage
Dataset[0,1] type[0,1] conventions[0,n] date type=[0,n] format type= uri=[0,1] timeCoverage[0,1] spaceCoverage
isA
generatedBy
isPartOf
Person[0,1] firstName[0,1] lastName[0,1] contact
Person[0,1] firstName[0,1] lastName[0,1] contact
Institution[0,1] name[0,1] type[0,1] contact
Institution[0,1] name[0,1] type[0,1] contact
isAworksF
or
participant role=
Class
Class
AbstractClass
AbstractClass
inheritanceassociation
LEGEND
Service[0,1] name[0,1] description
Service[0,1] name[0,1] description
serviceId
NCAR
ESG Metadata ProgressESG Metadata Progress Co-developed NcML with UnidataCo-developed NcML with Unidata
– CF conventions in progress, almost doneCF conventions in progress, almost done Developed & evaluated a prototype metadata systemDeveloped & evaluated a prototype metadata system Finalized an initial schema for PCM/CCSMFinalized an initial schema for PCM/CCSM
– Address interoperability with federal standards and Address interoperability with federal standards and NASA/GCMD via the generation of DIF/FGDC/ISONASA/GCMD via the generation of DIF/FGDC/ISO
– Address interoperability with digital libraries via the Address interoperability with digital libraries via the creation of Dublin Corecreation of Dublin Core
Testing relational and native XML databases, and OGSA-Testing relational and native XML databases, and OGSA-DAIDAI
Exploratory work for first-generation ontologyExploratory work for first-generation ontology Authoring of discovery metadata in progressAuthoring of discovery metadata in progress
NCAR
ESG Web PortalESG Web PortalDemonstrationDemonstration
NCAR
RLS
MSS
HRM
HPSSHRM
RLS
HPSSHRM
RLS
DISKHRM
RLS
DISKcache
OGSA-DAIMySQLRDBMS
ESG WEB PORTALTomcat/Struts
cross-updatecross-update
gridFTP
gridFTP
gridFTP
query
query MyProxy
authenticate
GRAMGATEKEEPER
submit
execute
gridFTP SERVER
gridFTP SERVER
gridFTP SERVER
gridFTP SERVER
LAS SERVERvisualize
LBNL
ISI
LLNL
NCAR ORNL
CAS
ANLESG Topology (CAS 2003)
NCAR
Collaborations & RelationshipsCollaborations & Relationships CCSM Data Management GroupCCSM Data Management Group The Globus ProjectThe Globus Project Other SciDAC Projects: Climate, Security & Policy for Other SciDAC Projects: Climate, Security & Policy for
Group Collaboration, Scientific Data Management Group Collaboration, Scientific Data Management ISIC, & High-performance DataGrid ToolkitISIC, & High-performance DataGrid Toolkit
OPeNDAP/DODS (multi-agency)OPeNDAP/DODS (multi-agency) NSF National Science Digital Libraries Program NSF National Science Digital Libraries Program
(UCAR & Unidata THREDDS Project)(UCAR & Unidata THREDDS Project) U.K. e-Science and British Atmospheric Data CenterU.K. e-Science and British Atmospheric Data Center NOAA NOMADS and CEOS-gridNOAA NOMADS and CEOS-grid Earth Science Portal group (multi-agency, intnl.)Earth Science Portal group (multi-agency, intnl.)
NCAR
Immediate DirectionsImmediate Directions Broaden usage of DataMover and refineBroaden usage of DataMover and refine Continue building metadata catalogsContinue building metadata catalogs Revisit overall security model and consider Revisit overall security model and consider
simplified approachessimplified approaches Redesign and implement user interfaceRedesign and implement user interface Alpha version of OPeNDAPgAlpha version of OPeNDAPg
– Test and evaluate with three client applications Test and evaluate with three client applications (ncview, CDAT, & NCL)(ncview, CDAT, & NCL)
Develop automation for data publishing (GT3)Develop automation for data publishing (GT3) Deploy for IPCC runsDeploy for IPCC runs
NCAR
The Community Data Portal (CDP)The Community Data Portal (CDP)
Provide a common portal to NCAR, UCAR, and university dataProvide a common portal to NCAR, UCAR, and university data Provide cyberinfrastructure that dramatically lowers the cost of Provide cyberinfrastructure that dramatically lowers the cost of
sharing data (there is large interest in this)sharing data (there is large interest in this) Directly couple to simulation and data analysis systemsDirectly couple to simulation and data analysis systems Begin capturing rich metadata and catalog our scientific Begin capturing rich metadata and catalog our scientific
experiments for the worldexperiments for the world MSS -> A petascale Mass Knowledge SystemMSS -> A petascale Mass Knowledge System Federate internationally (ESG, THREDDS, U.K. e-Science, Federate internationally (ESG, THREDDS, U.K. e-Science,
NOMADS, PRISM, GEON, etc.)NOMADS, PRISM, GEON, etc.)
“The dataportal has changed my life…” Ben Kirtman, COLA
NCAR
A Quick Tour of the CDPA Quick Tour of the CDP
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.‘ing Our Data
Community Data Portal Metadata Software
THREDDScatalogs
ESGmetadata
DCmetadata
NcMLmetadata
THREDDS catalog parserapplication
relational DB(MySQL)
XML native DB(Xindice
XML viewerweb application
schema-specific
stylesheets
stores full XML doc
shreds XML doc into tables
Search & Discoveryweb application
simple query(SQL)
Results: list of triplets(dataset id, metadata schema,
metadata URL)THREDDS catalogs browser
Web application
reference
othermetadata
parses
futureadvanced query(Xpath, Xquery)
displays
links to
uses
NCAR
Data->KnowledgeData->Knowledge
Mass StorageSystem (1.3PB) Petascale Knowledge
Repository
Establish new paradigms for managing and accessingscientific data based on semantic organization.
NCAR
Closing ThoughtsClosing Thoughts Building an environment for the long-Building an environment for the long-
termterm– Difficult, expensive, and time-consumingDifficult, expensive, and time-consuming– Requires longer-term projectsRequires longer-term projects
Team-building is a critical processTeam-building is a critical process– Collaboration technologies really helpCollaboration technologies really help
Managing all the collaborations is a Managing all the collaborations is a challengechallenge– But extremely valuableBut extremely valuable
Good progress, first real usageGood progress, first real usage
NCAR
Managing Expectations…Managing Expectations…
NCAR
LinksLinks Earth System GridEarth System Grid
– www.earthsystemgrid.orgwww.earthsystemgrid.org Community Data PortalCommunity Data Portal
– dataportal.ucar.edudataportal.ucar.edu
NCAR
ENDEND