architecting scientific data systems in the 21 st century dan crichton

30
Jet Propulsion Laboratory California Institute of Technology ional Aeronautics and ce Administration Propulsion Laboratory ifornia Institute of Technology adena, California DJC-1 Architecting Scientific Data Systems in the 21 st Century Dan Crichton Principal Computer Scientist Program Manager, Data Systems and Technology NASA Jet Propulsion Laboratory

Upload: norah

Post on 19-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Architecting Scientific Data Systems in the 21 st Century Dan Crichton Principal Computer Scientist Program Manager, Data Systems and Technology NASA Jet Propulsion Laboratory. Architecting the “End-to-End” Science Data System. Focus on science data generation data capture, end-to-end - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

Jet Propulsion LaboratoryCalifornia Institute of Technology

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

DJC-1

Architecting Scientific Data Systems in the 21st CenturyDan CrichtonPrincipal Computer ScientistProgram Manager, Data Systems and TechnologyNASA Jet Propulsion Laboratory

Page 2: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Architecting the “End-to-End” Science Data System

• Focus on– science data generation – data capture, end-to-end– access to science data

by the community

• Multiple scientific domains– Earth science– Planetary science– Biomedical research

• Applied technology research

– SW/Sys architectures– Product lines– Emerging technologies

DJC-2

Page 3: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Challenges in Science Data Systems

• A major challenge is in organizing the wealth of science data which requires both standards and data engineering/curation

– Search and access are dependent on good curation– Community support is critical to capture and curate the data in a manner

that is useful the community

• Usability of data continues to be a big challenge– Planetary science requires ALL science data and/or science data pipelines

be peer reviewed prior to release of data– Standard formats are critical

• Data sharing continues to be a challenge– Policies at the grant level coupled with standard data management plans

are helping

• Computational and Storage, historically major concerns, are now commodity services

– Google, Microsoft Research, Yahoo! And Amazon try to provide services to e-science in the form of “Cloud Computing”

DJC-3

Page 4: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

National Research Council: Committee on Data Management and Computation

• CODMAC (1980s) identified seven core principles:– Scientific involvement; – Scientific oversight; – Data availability including usable formats, ancillary data, timely distribution,

validated data, and documentation; – Proper facilities; – Structured, transportable, adequately documented software; – Data storage in permanent and retrievable form; and– Adequate data system funding.

• The CODMAC has led to national efforts to organize scientific results in partnership with the science community (particularly physical science)

• What does CODMAC mean in the 21st Century?

DJC-4

Page 5: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

5

The “e-science” Trend…

• Highly distributed, multi-organizational systems– Systems are moving towards loosely coupled systems or federations in

order to solve science problems which span center and institutional environments

• Sharing of data and services which allow for the discovery, access, and transformation of data

– Systems are moving towards publishing of services and data in order to address data and computationally-intensive problems

– Infrastructures which are being built to handle future demand

• Address complex modeling, inter-disciplinary science and decision support needs

– Need a dynamic environment where data and services can be used quickly as the building blocks for constructing predictive models and answering critical science questions

Page 6: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

JPL e-science ExamplesPlanetary Data System

Distributed Planetary Science Archive

Small Bodies NodeUniversity of Maryland

College Park, MD

Planetary Plasma Interactions NodeUniversity of California Los AngelesLos Angeles, CA

Geosciences NodeWashington University

St. Louis, MOImaging NodeJPL and USGSPasadena, CA and Flagstaff, AZ

THEMIS Data NodeArizona State UniversityTempe, AZ

Central NodeJet Propulsion LaboratoryPasadena, CA

Navigation Ancillary Information NodeJet Propulsion LaboratoryPasadena, CA

Rings NodeAmes Research CenterMoffett Field, CA

Atmospheres NodeNew Mexico State UniversityLas Cruces, NM

National Data Sharing InfrastructureSupporting Collaboration In Biomedical Research For EDRN

Universityof Michigan

(CEC)

Moffitt CancerCenter, Tampa

(BDL)

CreightonUniversity

(CEC)

UT Health ScienceCenter, San Antonio

(CEC)

University ofColorado

(CEC)

Fred HutchinsonCancer Research Center, Seattle

(DMCC)

University ofPittsburgh

(CEC)

Planetary Science Data System (4X)• Highly diverse (40 years of science data from NASA and Int’l missions)• Geographically distributed; moving int’l• New centers plugging in (i.e. data nodes)• Multi-center data system infrastructure• Heterogeneous nodes with common interfaces• Integrated based on enterprise-wide data standards• Sits on top of COTS-based middleware

EDRN Cancer Research (8X) • Highly diverse (30+ centers performing parallel studies using different instruments)• Geographically distributed• New centers plugging in (i.e. data nodes)• Multi-center data system infrastructure• Heterogeneous sites with common interfaces allowing access to distributed portals Integrated based on common data standards Secure (e.g. encryption, authentication, authorization)

Page 7: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Architectural drivers in science data systems

• Increasing data volumes requiring new approaches for data production, validation, processing, discovery and data transfer/distribution (E.g., scalability relative to available resources)

• Increased emphasis on usability of the data (E.g., discovery, access and analysis)

• Increasing diversity of data sets and complexity for integrating across missions/experiments (E.g., common information model for describing the data)

• Increasing distribution of coordinated processing and operations (E.g., federation)

• Increased pressure to reduce cost of supporting new missions

• Increasing desire for PIs to have integrated tool sets to work with data products with their own environments (E.g. perform their own generation and distribution)

Archive Volume Growth

0

10

20

30

40

50

60

70

80

90

1990 1992 1994 1996 1998 2000 2002 2004 2006 2008

Year

TB (Accum)

TBytes

DJC-7

Planetary Science Archive

Page 8: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Architectural Focus

• Consistent distributed capabilities– Resource discovery (data, metadata, services, etc), unified repository

access, simple transformations, bulk transfer of multiple products, and unified catalog access

– Move towards era of “grid-ing” loosely coupled science system

• Develop on-demand, shared services (E.g. processing, translation, etc)– Processing– Translation

• Deploy high throughput data movement mechanisms

• Move capability up the mission pipeline

• Reduce local software solutions that do not scale– Increasing importance in developing an “enterprise” approach with common

services

• Build value-added services and capabilities on top of the infrastructureDJC-8

Page 9: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Object Oriented Data Technology*

• Started in 1998 as a research and development task funded at JPL by the Office of Space Science to address

• Application of Information Technology to Space Science

• Provide an infrastructure for distributed data management

• Research methods for interoperability, knowledge management and knowledge discovery

• Develop software frameworks for data management to reuse software, manage risk, reduce cost and leverage IT experience

• OODT Initial focus• Data archiving – Manage heterogeneous data

products and resources in a distributed, metadata-driven environment

• Data location and discovery – Locate data products across multiple archives, catalogs and data systems

• Data retrieval – Retrieve diverse data products from distributed data sources and integrate

OODT/Science Web Tools

OODT/Science Web Tools

ArchiveClient

OBJ ECT ORIENTED DATA TECHNOLOGY FRAMEWORK

ProfileXMLData

ProfileXMLData

NavigationService

NavigationService

Data System

2

Data System

2

Data System

1

Data System

1

Other Service 1

Other Service 1

Other Service 2

Other Service 2

QueryServiceQueryService

ProductServiceProductService

ProfileServiceProfileService

ArchiveServiceArchiveService

Bridge to External Services

Bridge to External Services

DJC-9

* 2003 NASA Software of the Year Runner Up

Page 10: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Architectural Principles*

• Separate the technology and the information architecture

• Encapsulate the messaging layer to support different messaging implementations

• Encapsulate individual data systems to hide uniqueness

• Provide data system location independence

• Require that communication between distributed systems use metadata

• Define a model for describing systems and their resources

• Provide scalability in linking both number of nodes and size of data sets

• Allow systems using different data dictionaries and metadata implementations to be integrated

• Leverage existing software, where possible (e.g., open source, etc)

DJC-10

* Crichton, D, Hughes, J. S, Hyon, J, Kelly, S. “Science Search and Retrieval using XML”,Proceedings of the 2nd National Conference on Scientific and Technical Data, National Academy of Science, Washington DC, 2000.

Page 11: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Distributed Architecture

DJC-11

3. Repositories for storing and retrieving many types of data

1. Science data tools and applications use “APIs” to connect to a virtual data repository

Visualization Tools

Analysis Tools

OODTReusable

DataGrid

Framework

OODTReusable

DataGrid

Framework

MissionData

Repositories

MissionData

RepositoriesOODT

API

OODTAPI

2. Middleware creates thedata grid infrastructure connecting distributed heterogeneous systems and data

BiomedicalData

Repositories

BiomedicalData

Repositories

EngineeringData

Repositories

EngineeringData

Repositories

Web Search Tools

OODTAPI

OODTAPI

OODTAPI

OODTAPI

Page 12: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Software Implementation

• OODT is Open Source

• Developed using open source software (i.e. Java/J2EE and XML)

• Implemented reusable, extensible Java-based software components– Core software for building and connecting data management systems

• Provided messaging as a “plug-in” component that can be replaced independent of the other core components. Messaging components include:

– CORBA, Java RMI, JXTA, Web Services, etc– REST seems to have prevailed

• Provided client APIs in Java, C++, HTTP, Python, IDL

• Simple installation on a variety of platforms (Windows, Unix, Mac OS X, etc)

• Used international data architecture standards– ISO/IEC 11179 – Specification and Standardization of Data Elements– Dublin Core Metadata Initiative– W3C’s Resource Description Framework (RDF) from Semantic Web Community

DJC-12

Page 13: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Characteristics of Informatics in Space Science

DJC-13

• Often unique, one of a kind missions– Can drive technological changes

• Instruments are competed and developed by academic, industry and industrial partners

– Highly distributed acquisition and processing across partner organizations

– Highly diverse data sets given heterogeneity of the instruments and the targets (i.e. solar system)

• Missions are required to share science data results with the research community requiring:

– Common domain information model used to drive system implementations

– Expert scientific help to the user community on using the data

– Peer-review of data results to ensure quality– Distribution of data to the community

• Planetary science data from NASA (and some international) missions is deposited into the Planetary Data System

Page 14: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Distributed Space Architecture

DJC-14

External Science

Community

Data Acquisition

and CommandMission

OperationsInstrument /Sensor Operations

ScienceData

Archive

ScienceData

Processing

Data Analysis and

Modeling

Science Information Package

Science Team

Relay Satellite

Spacecraft / lander

Spacecraft andScientific Instruments

Primitive Information Object

Primitive Information Object

Simple Information Object

Telemetry Information Package

Science Information Package

Instrument Planning

Information Object

Science Information Package

Science Products - Information Objects

PlanningInformation

Object

Science Information Package

• Common Meta Models for Describing Space Information Objects• Common Data Dictionary end-to-end

Page 15: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Planetary Science Data Standards

• JPL has led and managed development of the planetary science data standards for NASA and the international community

– ESA, ISRO, JAXA, etc leveraging planetary science data standards

– A diverse model used across the community that unifies data systems

• Core “information” model that has been used to describe every type of data from NASA’s planetary exploration missions and instruments

– ~4000 different types of data

DJC-15

PDS ImageLabel (ODL)

PDS Image Class (Object-Oriented)

An Image

Describes

Page 16: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

2001 Mars Odyssey: A paradigm change

DJC-16

• Pre-Oct 2002, no unified view across distributed operational planetary science data repositories

– Science data distributed across the country– Science data distributed on physical media

• Planetary data archive increasing from 4 TBs in 2001 to 100 TBs in 2009

– Traditional distribution infeasible due to cost and system constraints

– Mars Odyssey could not be distributed using traditional method

• Current work with the OODT Data Grid Framework has provided the technology for NASA’s planetary data management infrastructure to

– Support online distribution of science data to planetary scientists

– Enable interoperability between nine institutions– Support real-time access to data products– Provided uniform software interfaces to all Mars

Odyssey data allowing scientists and developers to link in their own tools

– Operational October 1, 2002

• Moving to multi-terrabyte online data movement in 2009

2001 Mars Odyssey

Page 17: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Explosion of Data in Biomedical Research

DJC-17

• “To thrive, the field that links biologists and their data urgently needs structure, recognition and support. The exponential growth in the amount of biological data means that revolutionary measures are needed for data management, analysis and accessibility. Online databases have become important avenues for publishing biological data.” – Nature Magazine, September 2008

• The capture and sharing of data to support collaborative research is leading to new opportunities to examine data in many sciences

– NASA routinely releases “data analysis programs” to analyze and process existing data

• EDRN has become a leader in building informatics technologies and constructing databases for cancer research. The tools and technologies are now ready for wider use!

Apr 21, 2023 17

EDRN DataRepositories

Page 18: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Bioinformatics: National Cancer Institute Early Detection Research Network (EDRN)

DJC-18

• Initiated in 2000, renewed in 2005

• 100+ Researchers (both members and associated members)

• ~40 + Research Institutions

• Mission of EDRN– Discover, develop and validate biomarkers for

cancer detection, diagnosis and risk assessment

– Conduct correlative studies/trials to validate biomarkers as indicators of early cancer, pre-invasive cancer, risk, or as surrogate endpoints

– Develop quality assurance programs for biomarker testing and evaluation

– Forge public-private partnerships

• Leverage building distributed planetary science data systems for biomedicine

Page 19: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

EDRN Knowledge Environment

• EDRN has been a pioneer in the use of informatics technologies to support biomarker research

• EDRN has developed a comprehensive infrastructure to support biomarker data management across EDRN’s distributed cancer centers

– Twelve institutions are sharing data– Same architectural framework as planetary science

• It supports capture and access to a diverse set of information and results

– Biomarkers– Proteomics– Biospecimens– Various technologies and data products (image,

micro-satellite, …)– Study Management

DJC-19

Data and Computers interconnected to

f orm a virtual database Integrated Cancer Resources

•Specimens•Images•Assays•Biomarkers•etc

Page 20: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

EDRN’s Ontology Model

• EDRN has developed a High level ontology model for biomarker research which provides standards for the capture of biomarker information across the enterprise

• Specific models are derived from this high level model– Model of biospecimens

– Model for each class of science data

• EDRN is specifically focusing on a granular model for annotating biomarkers, studies and scientific results

• EDRN has a set of EDRN Common Data Elements which is used to provide standard data elements and values for the capture and exchange of data

DJC-20EDRN Biomarker Ontology Model

EDRN CDE Tools

Page 21: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Earth Science Distributed Process Mgmt

• Leveraged OODT software framework for constructing ground data systems for earth science missions

– Used OODT Catalog and Archive Service software

• Constructed “workflows” – Execution of “processors” based on a set of

rules

• Provided “lights out” operations

• Multiple Missions– SeaWinds– QuikSCAT– Orbiting Carbon Observatory (OCO)– NP Sounder PEATE– SMAP

DJC-21

Spacecraft& Ancillary

Files

Pre-Processors

(PP)

ScienceLevel

Processors(LP)

Science Analysis

and Quality

Reporting(SA)

InstrumentCommands File

Transfer (FX)

User Interface (Process Monitoring & Control, Instrument Commanding, Data Verification)

Data Management and Automatic Process Control (PM) using OODT

EngineeringAnalysis

(EA)

Product Delivery (PM)Science

ProductsReleased

toPO.DAAC

SeaWinds on ADEOS II (Launched Dec 2002)

Page 22: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Supporting Climate Research

• Earth Observing System Data and Information System (EOSDIS) serves NASA’s earth scientists data needs

• Two major legacies are left– Archiving of explosion in observational data in Distributed Active Archive

Centers (DAACs)• Request-driven retrieval from archive is time consuming

– Adoption of Hierarchical Data Format (HDF) for data files• Defined by and unique to each instrument but not necessarily consistent between

instruments

• What are the next steps to accelerating use of an ever increasing observational data collection?

– What data are available?– What is the information content?– How should it be interpreted in climate modeling research?

Page 23: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

EOSDIS DAAC’sEarth Observing System Data and Information System Distributed Active Archive Centers

Page 24: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

EOSDIS DAAC’sEarth Observing System Data and Information System Distributed Active Archive Centers

Cumulative Volume of L2+ Products at All DAACs

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

FY00 FY01 FY02 FY03 FY04 FY05 FY06 FY07 FY08 FY09 FY10 FY11 FY12 FY13 FY14

Fiscal Year

Cumulative Volume (TB)

Page 25: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Current Data System

• System serves static data products. User must find

move, and manipulate all data him/herself.

• User must change spatial and temporal resolutions

to match.

• User must understand instrument observation strategies and subtleties to interpret.

Page 26: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Climate Data eXchange (CDX)

• Develop an architecture that enables sharing of climate model output and NASA observational data

– Develop an architectural model that evaluates trade space of model

• Provide extensive server-side computational services side– Increase performance– Subsetting, reformatting, re-gridding

• Deliver an “open source” toolkit

• Connect NASA and DOE

Page 27: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Combining Instrument Data to enable Climate Research: AIRS and MLS

Combining AIRS and MLS requires:

– Rectifying horizontal, vertical and temporal mismatch

– Assessing and correcting for the instruments’ scene-specific error characteristics (see left diagram)

Page 28: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Climate Data Exchange

DJC-282828

Key Questions to be Answered

Specific Tools (H2O, CO2, …)

Page 29: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Summary

• Software is critical to supporting collaborative research in science– Virtual organizations– Transparent access to data– End-to-end environments

• Software architecture is critical to – Reducing cost of building science data systems– Building virtual organizations– Constructing software product lines– Driving standards

• Science is still learning how to best leverage technology in a collaborative discovery environment, but significant progress is being made!

DJC-29

Page 30: Architecting Scientific Data Systems in the 21 st  Century Dan Crichton

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

THANK YOU…

Dan Crichton– [email protected]– +1 818 354 9155

DJC-30