architecting scientific data systems in the 21 st century dan crichton
DESCRIPTION
Architecting Scientific Data Systems in the 21 st Century Dan Crichton Principal Computer Scientist Program Manager, Data Systems and Technology NASA Jet Propulsion Laboratory. Architecting the “End-to-End” Science Data System. Focus on science data generation data capture, end-to-end - PowerPoint PPT PresentationTRANSCRIPT
Jet Propulsion LaboratoryCalifornia Institute of Technology
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
DJC-1
Architecting Scientific Data Systems in the 21st CenturyDan CrichtonPrincipal Computer ScientistProgram Manager, Data Systems and TechnologyNASA Jet Propulsion Laboratory
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Architecting the “End-to-End” Science Data System
• Focus on– science data generation – data capture, end-to-end– access to science data
by the community
• Multiple scientific domains– Earth science– Planetary science– Biomedical research
• Applied technology research
– SW/Sys architectures– Product lines– Emerging technologies
DJC-2
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Challenges in Science Data Systems
• A major challenge is in organizing the wealth of science data which requires both standards and data engineering/curation
– Search and access are dependent on good curation– Community support is critical to capture and curate the data in a manner
that is useful the community
• Usability of data continues to be a big challenge– Planetary science requires ALL science data and/or science data pipelines
be peer reviewed prior to release of data– Standard formats are critical
• Data sharing continues to be a challenge– Policies at the grant level coupled with standard data management plans
are helping
• Computational and Storage, historically major concerns, are now commodity services
– Google, Microsoft Research, Yahoo! And Amazon try to provide services to e-science in the form of “Cloud Computing”
DJC-3
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
National Research Council: Committee on Data Management and Computation
• CODMAC (1980s) identified seven core principles:– Scientific involvement; – Scientific oversight; – Data availability including usable formats, ancillary data, timely distribution,
validated data, and documentation; – Proper facilities; – Structured, transportable, adequately documented software; – Data storage in permanent and retrievable form; and– Adequate data system funding.
• The CODMAC has led to national efforts to organize scientific results in partnership with the science community (particularly physical science)
• What does CODMAC mean in the 21st Century?
DJC-4
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
5
The “e-science” Trend…
• Highly distributed, multi-organizational systems– Systems are moving towards loosely coupled systems or federations in
order to solve science problems which span center and institutional environments
• Sharing of data and services which allow for the discovery, access, and transformation of data
– Systems are moving towards publishing of services and data in order to address data and computationally-intensive problems
– Infrastructures which are being built to handle future demand
• Address complex modeling, inter-disciplinary science and decision support needs
– Need a dynamic environment where data and services can be used quickly as the building blocks for constructing predictive models and answering critical science questions
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
JPL e-science ExamplesPlanetary Data System
Distributed Planetary Science Archive
Small Bodies NodeUniversity of Maryland
College Park, MD
Planetary Plasma Interactions NodeUniversity of California Los AngelesLos Angeles, CA
Geosciences NodeWashington University
St. Louis, MOImaging NodeJPL and USGSPasadena, CA and Flagstaff, AZ
THEMIS Data NodeArizona State UniversityTempe, AZ
Central NodeJet Propulsion LaboratoryPasadena, CA
Navigation Ancillary Information NodeJet Propulsion LaboratoryPasadena, CA
Rings NodeAmes Research CenterMoffett Field, CA
Atmospheres NodeNew Mexico State UniversityLas Cruces, NM
National Data Sharing InfrastructureSupporting Collaboration In Biomedical Research For EDRN
Universityof Michigan
(CEC)
Moffitt CancerCenter, Tampa
(BDL)
CreightonUniversity
(CEC)
UT Health ScienceCenter, San Antonio
(CEC)
University ofColorado
(CEC)
Fred HutchinsonCancer Research Center, Seattle
(DMCC)
University ofPittsburgh
(CEC)
Planetary Science Data System (4X)• Highly diverse (40 years of science data from NASA and Int’l missions)• Geographically distributed; moving int’l• New centers plugging in (i.e. data nodes)• Multi-center data system infrastructure• Heterogeneous nodes with common interfaces• Integrated based on enterprise-wide data standards• Sits on top of COTS-based middleware
EDRN Cancer Research (8X) • Highly diverse (30+ centers performing parallel studies using different instruments)• Geographically distributed• New centers plugging in (i.e. data nodes)• Multi-center data system infrastructure• Heterogeneous sites with common interfaces allowing access to distributed portals Integrated based on common data standards Secure (e.g. encryption, authentication, authorization)
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Architectural drivers in science data systems
• Increasing data volumes requiring new approaches for data production, validation, processing, discovery and data transfer/distribution (E.g., scalability relative to available resources)
• Increased emphasis on usability of the data (E.g., discovery, access and analysis)
• Increasing diversity of data sets and complexity for integrating across missions/experiments (E.g., common information model for describing the data)
• Increasing distribution of coordinated processing and operations (E.g., federation)
• Increased pressure to reduce cost of supporting new missions
• Increasing desire for PIs to have integrated tool sets to work with data products with their own environments (E.g. perform their own generation and distribution)
Archive Volume Growth
0
10
20
30
40
50
60
70
80
90
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
Year
TB (Accum)
TBytes
DJC-7
Planetary Science Archive
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Architectural Focus
• Consistent distributed capabilities– Resource discovery (data, metadata, services, etc), unified repository
access, simple transformations, bulk transfer of multiple products, and unified catalog access
– Move towards era of “grid-ing” loosely coupled science system
• Develop on-demand, shared services (E.g. processing, translation, etc)– Processing– Translation
• Deploy high throughput data movement mechanisms
• Move capability up the mission pipeline
• Reduce local software solutions that do not scale– Increasing importance in developing an “enterprise” approach with common
services
• Build value-added services and capabilities on top of the infrastructureDJC-8
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Object Oriented Data Technology*
• Started in 1998 as a research and development task funded at JPL by the Office of Space Science to address
• Application of Information Technology to Space Science
• Provide an infrastructure for distributed data management
• Research methods for interoperability, knowledge management and knowledge discovery
• Develop software frameworks for data management to reuse software, manage risk, reduce cost and leverage IT experience
• OODT Initial focus• Data archiving – Manage heterogeneous data
products and resources in a distributed, metadata-driven environment
• Data location and discovery – Locate data products across multiple archives, catalogs and data systems
• Data retrieval – Retrieve diverse data products from distributed data sources and integrate
OODT/Science Web Tools
OODT/Science Web Tools
ArchiveClient
OBJ ECT ORIENTED DATA TECHNOLOGY FRAMEWORK
ProfileXMLData
ProfileXMLData
NavigationService
NavigationService
Data System
2
Data System
2
Data System
1
Data System
1
Other Service 1
Other Service 1
Other Service 2
Other Service 2
QueryServiceQueryService
ProductServiceProductService
ProfileServiceProfileService
ArchiveServiceArchiveService
Bridge to External Services
Bridge to External Services
DJC-9
* 2003 NASA Software of the Year Runner Up
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Architectural Principles*
• Separate the technology and the information architecture
• Encapsulate the messaging layer to support different messaging implementations
• Encapsulate individual data systems to hide uniqueness
• Provide data system location independence
• Require that communication between distributed systems use metadata
• Define a model for describing systems and their resources
• Provide scalability in linking both number of nodes and size of data sets
• Allow systems using different data dictionaries and metadata implementations to be integrated
• Leverage existing software, where possible (e.g., open source, etc)
DJC-10
* Crichton, D, Hughes, J. S, Hyon, J, Kelly, S. “Science Search and Retrieval using XML”,Proceedings of the 2nd National Conference on Scientific and Technical Data, National Academy of Science, Washington DC, 2000.
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Distributed Architecture
DJC-11
3. Repositories for storing and retrieving many types of data
1. Science data tools and applications use “APIs” to connect to a virtual data repository
Visualization Tools
Analysis Tools
OODTReusable
DataGrid
Framework
OODTReusable
DataGrid
Framework
MissionData
Repositories
MissionData
RepositoriesOODT
API
OODTAPI
2. Middleware creates thedata grid infrastructure connecting distributed heterogeneous systems and data
BiomedicalData
Repositories
BiomedicalData
Repositories
EngineeringData
Repositories
EngineeringData
Repositories
Web Search Tools
OODTAPI
OODTAPI
OODTAPI
OODTAPI
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Software Implementation
• OODT is Open Source
• Developed using open source software (i.e. Java/J2EE and XML)
• Implemented reusable, extensible Java-based software components– Core software for building and connecting data management systems
• Provided messaging as a “plug-in” component that can be replaced independent of the other core components. Messaging components include:
– CORBA, Java RMI, JXTA, Web Services, etc– REST seems to have prevailed
• Provided client APIs in Java, C++, HTTP, Python, IDL
• Simple installation on a variety of platforms (Windows, Unix, Mac OS X, etc)
• Used international data architecture standards– ISO/IEC 11179 – Specification and Standardization of Data Elements– Dublin Core Metadata Initiative– W3C’s Resource Description Framework (RDF) from Semantic Web Community
DJC-12
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Characteristics of Informatics in Space Science
DJC-13
• Often unique, one of a kind missions– Can drive technological changes
• Instruments are competed and developed by academic, industry and industrial partners
– Highly distributed acquisition and processing across partner organizations
– Highly diverse data sets given heterogeneity of the instruments and the targets (i.e. solar system)
• Missions are required to share science data results with the research community requiring:
– Common domain information model used to drive system implementations
– Expert scientific help to the user community on using the data
– Peer-review of data results to ensure quality– Distribution of data to the community
• Planetary science data from NASA (and some international) missions is deposited into the Planetary Data System
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Distributed Space Architecture
DJC-14
External Science
Community
Data Acquisition
and CommandMission
OperationsInstrument /Sensor Operations
ScienceData
Archive
ScienceData
Processing
Data Analysis and
Modeling
Science Information Package
Science Team
Relay Satellite
Spacecraft / lander
Spacecraft andScientific Instruments
Primitive Information Object
Primitive Information Object
Simple Information Object
Telemetry Information Package
Science Information Package
Instrument Planning
Information Object
Science Information Package
Science Products - Information Objects
PlanningInformation
Object
Science Information Package
• Common Meta Models for Describing Space Information Objects• Common Data Dictionary end-to-end
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Planetary Science Data Standards
• JPL has led and managed development of the planetary science data standards for NASA and the international community
– ESA, ISRO, JAXA, etc leveraging planetary science data standards
– A diverse model used across the community that unifies data systems
• Core “information” model that has been used to describe every type of data from NASA’s planetary exploration missions and instruments
– ~4000 different types of data
DJC-15
PDS ImageLabel (ODL)
PDS Image Class (Object-Oriented)
An Image
Describes
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
2001 Mars Odyssey: A paradigm change
DJC-16
• Pre-Oct 2002, no unified view across distributed operational planetary science data repositories
– Science data distributed across the country– Science data distributed on physical media
• Planetary data archive increasing from 4 TBs in 2001 to 100 TBs in 2009
– Traditional distribution infeasible due to cost and system constraints
– Mars Odyssey could not be distributed using traditional method
• Current work with the OODT Data Grid Framework has provided the technology for NASA’s planetary data management infrastructure to
– Support online distribution of science data to planetary scientists
– Enable interoperability between nine institutions– Support real-time access to data products– Provided uniform software interfaces to all Mars
Odyssey data allowing scientists and developers to link in their own tools
– Operational October 1, 2002
• Moving to multi-terrabyte online data movement in 2009
2001 Mars Odyssey
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Explosion of Data in Biomedical Research
DJC-17
• “To thrive, the field that links biologists and their data urgently needs structure, recognition and support. The exponential growth in the amount of biological data means that revolutionary measures are needed for data management, analysis and accessibility. Online databases have become important avenues for publishing biological data.” – Nature Magazine, September 2008
• The capture and sharing of data to support collaborative research is leading to new opportunities to examine data in many sciences
– NASA routinely releases “data analysis programs” to analyze and process existing data
• EDRN has become a leader in building informatics technologies and constructing databases for cancer research. The tools and technologies are now ready for wider use!
Apr 21, 2023 17
EDRN DataRepositories
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Bioinformatics: National Cancer Institute Early Detection Research Network (EDRN)
DJC-18
• Initiated in 2000, renewed in 2005
• 100+ Researchers (both members and associated members)
• ~40 + Research Institutions
• Mission of EDRN– Discover, develop and validate biomarkers for
cancer detection, diagnosis and risk assessment
– Conduct correlative studies/trials to validate biomarkers as indicators of early cancer, pre-invasive cancer, risk, or as surrogate endpoints
– Develop quality assurance programs for biomarker testing and evaluation
– Forge public-private partnerships
• Leverage building distributed planetary science data systems for biomedicine
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
EDRN Knowledge Environment
• EDRN has been a pioneer in the use of informatics technologies to support biomarker research
• EDRN has developed a comprehensive infrastructure to support biomarker data management across EDRN’s distributed cancer centers
– Twelve institutions are sharing data– Same architectural framework as planetary science
• It supports capture and access to a diverse set of information and results
– Biomarkers– Proteomics– Biospecimens– Various technologies and data products (image,
micro-satellite, …)– Study Management
DJC-19
Data and Computers interconnected to
f orm a virtual database Integrated Cancer Resources
•Specimens•Images•Assays•Biomarkers•etc
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
EDRN’s Ontology Model
• EDRN has developed a High level ontology model for biomarker research which provides standards for the capture of biomarker information across the enterprise
• Specific models are derived from this high level model– Model of biospecimens
– Model for each class of science data
• EDRN is specifically focusing on a granular model for annotating biomarkers, studies and scientific results
• EDRN has a set of EDRN Common Data Elements which is used to provide standard data elements and values for the capture and exchange of data
DJC-20EDRN Biomarker Ontology Model
EDRN CDE Tools
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Earth Science Distributed Process Mgmt
• Leveraged OODT software framework for constructing ground data systems for earth science missions
– Used OODT Catalog and Archive Service software
• Constructed “workflows” – Execution of “processors” based on a set of
rules
• Provided “lights out” operations
• Multiple Missions– SeaWinds– QuikSCAT– Orbiting Carbon Observatory (OCO)– NP Sounder PEATE– SMAP
DJC-21
Spacecraft& Ancillary
Files
Pre-Processors
(PP)
ScienceLevel
Processors(LP)
Science Analysis
and Quality
Reporting(SA)
InstrumentCommands File
Transfer (FX)
User Interface (Process Monitoring & Control, Instrument Commanding, Data Verification)
Data Management and Automatic Process Control (PM) using OODT
EngineeringAnalysis
(EA)
Product Delivery (PM)Science
ProductsReleased
toPO.DAAC
SeaWinds on ADEOS II (Launched Dec 2002)
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Supporting Climate Research
• Earth Observing System Data and Information System (EOSDIS) serves NASA’s earth scientists data needs
• Two major legacies are left– Archiving of explosion in observational data in Distributed Active Archive
Centers (DAACs)• Request-driven retrieval from archive is time consuming
– Adoption of Hierarchical Data Format (HDF) for data files• Defined by and unique to each instrument but not necessarily consistent between
instruments
• What are the next steps to accelerating use of an ever increasing observational data collection?
– What data are available?– What is the information content?– How should it be interpreted in climate modeling research?
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
EOSDIS DAAC’sEarth Observing System Data and Information System Distributed Active Archive Centers
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
EOSDIS DAAC’sEarth Observing System Data and Information System Distributed Active Archive Centers
Cumulative Volume of L2+ Products at All DAACs
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
FY00 FY01 FY02 FY03 FY04 FY05 FY06 FY07 FY08 FY09 FY10 FY11 FY12 FY13 FY14
Fiscal Year
Cumulative Volume (TB)
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Current Data System
• System serves static data products. User must find
move, and manipulate all data him/herself.
• User must change spatial and temporal resolutions
to match.
• User must understand instrument observation strategies and subtleties to interpret.
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Climate Data eXchange (CDX)
• Develop an architecture that enables sharing of climate model output and NASA observational data
– Develop an architectural model that evaluates trade space of model
• Provide extensive server-side computational services side– Increase performance– Subsetting, reformatting, re-gridding
• Deliver an “open source” toolkit
• Connect NASA and DOE
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Combining Instrument Data to enable Climate Research: AIRS and MLS
Combining AIRS and MLS requires:
– Rectifying horizontal, vertical and temporal mismatch
– Assessing and correcting for the instruments’ scene-specific error characteristics (see left diagram)
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Climate Data Exchange
DJC-282828
Key Questions to be Answered
Specific Tools (H2O, CO2, …)
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Summary
• Software is critical to supporting collaborative research in science– Virtual organizations– Transparent access to data– End-to-end environments
• Software architecture is critical to – Reducing cost of building science data systems– Building virtual organizations– Constructing software product lines– Driving standards
• Science is still learning how to best leverage technology in a collaborative discovery environment, but significant progress is being made!
DJC-29
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
THANK YOU…
Dan Crichton– [email protected]– +1 818 354 9155
DJC-30