andrew hart nasa jet propulsion laboratory david kale whittier vpicu, children’s hospital la

Distributed, Modular Grid Software for Management and Exploration of Data in Patient-Centric Healthcare IT

Andrew HartNASA Jet Propulsion Laboratory

David KaleWhittier VPICU, Children’s Hospital LA

Heather KincaidNASA Jet Propulsion Laboratory

Agenda Health Care Data Challenges for Large-scale Research

Intro to Object Oriented Data Technology (OODT)

Applications of OODT in distributed scientific data systems

- NASA’s Planetary Data System

- NCI’s Early Detection Research Network

- Whittier Virtual Pediatric Intensive Care Unit (VPICU)

OODT as Open Source

Learning More & Keeping in Touch

Health care research Increasingly collaborative

Increasingly geographically distributed

Scale, Complexity, Cost drive cooperation

Opportunities for discovery emerge through larger data sets

Increase in need for technology to support for “virtual organizations” carrying out distributed scientific research

OODT – What Is It?

“A data grid software infrastructure for constructing large-scale, distributed data-intensive systems”

Reference Architecture

Software Product Line

Reusable Components

Common Patterns

OODT/Science Web Tools


ArchiveClient

OBJ ECT ORIENTED DATA TECHNOLOGY FRAMEWORK

ProfileXMLData

ProfileXMLData

NavigationService

NavigationService

Data System

2

Data System

2

Data System

1

Data System

1

Other Service 1

Other Service 1

Other Service 2

Other Service 2

QueryServiceQuery

ServiceProductServiceProductService

ProfileServiceProfileService

ArchiveServiceArchiveService

Bridge to External Services


A Brief History of OODT Funded out of NASA’s Office of Space Science in 1998 Funded to address critical software engineering challenges

affecting the design of mission science data systems Designed, implemented, and refined over the past 7 years

across multiple scientific domains:

- Planetary Science,

- Earth Science,

- Cancer Research,

- Space Physics,

- Modeling and Simulation,

- Pediatric Intensive Care Runner up NASA software of the year in 2003

Principles behind OODT Division of Labor

Avoid making one component the workhorse, configurable

Technology Independence Guard against unexpected changes in the technology landscape

Metadata as a first-class citizenDescriptions of resources come in handy

Separation of software and data modelsAllow each to evolve independently

Modular, domain-agnostic Pick and choose from adaptable components with defined interfaces

OODT Core Framework Services

Archive ServiceIngest data + metadata, processing algorithms, workflow support

Profile ServiceDeliver metadata from an underlying data store

Product ServiceDeliver data from an underlying data store

Query ServiceManage sets of profile servers

Data Grid ServiceInterfaces and tools for connecting distributed resources over the web



ArchiveClient

OBJ ECT ORIENTED DATA TECHNOLOGY FRAMEWORK

ProfileXMLData

ProfileXMLData

NavigationService

NavigationService

Data System

2

Data System

2

Data System

1

Data System

1

Other Service 1

Other Service 1

Other Service 2

Other Service 2

QueryServiceQuery

ServiceProductServiceProductService

ProfileServiceProfileService

ArchiveServiceArchiveService



Applications of OODT: PDS Planetary Data System National Aeronautics and Space Administration http://pds.nasa.gov

NASA Planetary Data System Official NASA archive for all planetary data

9 Nodes with data located at discipline sites

All missions must add theirdata (required as part of mission Announcement of Opportunity

Prior to October 2002, no ability to find and share data between PDS nodes

Planetary Data SystemDistributed Planetary Science Archive

Small Bodies NodeUniversity of Maryland

College Park, MD

Planetary Plasma Interactions NodeUniversity of California Los AngelesLos Angeles, CA

Geosciences NodeWashington University

St. Louis, MOImaging NodeJPL and USGSPasadena, CA and Flagstaff, AZ

THEMIS Data NodeArizona State UniversityTempe, AZ

Central NodeJet Propulsion LaboratoryPasadena, CA

Navigation Ancillary Information NodeJet Propulsion LaboratoryPasadena, CA

Rings NodeAmes Research CenterMoffett Field, CA

Atmospheres NodeNew Mexico State UniversityLas Cruces, NM

PDS Data Key ChallengesChallenges to building a science data system for the PDS:

NASA often flies unique, one of a kind missions

A static infrastructure won’t work: Nodes and models change

Data stored at PDS nodes differs dramatically in structure

Missions are required to share science data results with the research community

PDS Data Architecture Distributed data system environment with federated governance

Each site maintains their own database and infrastructure

Common domain information model (regularly updated) used to drive system implementationsOntology and Common Data Elements (based on ISO/IEC 11179)

Common query interface to distributed servicesimplemented with OODT Query Handlers

Software services that wrap existing data systems to share data Implemented with OODT Product & Profile servers

Publishing of data products to a common portal Implemented using Resource Description Format (RDF)

PDS Architecture Decomposition

Applications of OODT: EDRN Early Detection Research Network

- Division of Cancer Prevention, National Cancer Institute

- http://cancer.gov/edrn

EDRN Overview Focus: investigator-initiated, collaborative

research on molecular, genetic and other biomarkers for cancer detection and risk assessment.

Funded since 2000 by the Division of Cancer Prevention in the National Cancer Institute (NCI)

40+ geographically distributed centers performing parallel, complementary studies

Strong emphasis on therole of informatics

EDRN Participants Biomarker Development Laboratories

Responsible for the development and characterization of new biomarkers or the refinement of existing biomarkers.

Biomarker Reference LaboratoriesServe as a Network resource for clinical and laboratory validation of biomarkers, which includes technological development, quality control, refinement, and high throughput.

Clinical Epidemiology and Validation CentersConduct clinical and epidemiological research regarding the clinical application of biomarkers.

Data Management and Coordinating CenterCoordinate EDRN research activities, provide logistic support, conduct statistical and computational research for data analysis, analyzing data for validation.

OODT and EDRN OODT’s success lead to interagency agreements with both

NIH and NCI, resulting in:

EDRN Informatics CenterSupport EDRN's efforts through the development of software systems for information management. Located at NASA Jet Propulsion Laboratory, Pasadena, CA.- Principal Investigator: Dan Crichton, JPL.

EDRN Data EDRN collects, generates, analyzes, and stores a wide variety of

different data, including:

- Specimen Inventories Map specimens collected (blood, sputum, etc.) to patient characteristics

- Studies and Publications Information about studies conducted in the EDRN as well as published results (publications, outputs)

- Biomarkers Information about indicators of early disease

- Science DataOutputs of experiments on specimens, regarding biomarkers, driven by particular studies and protocols

EDRN Data Flow Moving beyond the local laboratory Scalability, interoperability

Case Study: ERNE ERNE: EDRN Resource Network Exchange

Challenge: Overcome differences in local schema to develop a national distributed specimen information infrastructure

All sites running different software and following own procedures

Rely on a common informationmodel for distributed querying,and provide site-specific mappings at each participant

ERNE Architecture

Connecting Research Designing the EDRN informatics architecture as a collection of

well-defined components via OODT has simplified the process of building interfaces to non-EDRN systems

Wrappers can be built to link non-EDRN systems Translators can be developed to deal with different semantic

architectures

caBIG

- ERNE/caTissue Wrapper

EDRN-Canary Collaboration

- A cloud computing effort that shares raw science data via Amazon S3 between EDRN and the Canary group which uses software from GenoLogics Life Sciences

EDRN Knowledge Environment Building a Semantic Bioinformatics Grid for the EDRN

Lessons From EDRN Architecture and a vision has been critical

- Technology hasn’t been as critical

- Keep it simple

Science support has been critical- Getting buy-in and participation from domain experts is key

Incremental development and deployment- Starting with a few sites was very helpful in understanding the issues

- We had both development sites and observer sites initially

The IRB process has been a big schedule driver Distributed architecture can be a challenge

- Not all sites up to maintaining the implementation

- Loosely coupled architecture with simple interfaces helped

Applications of OODT: VPICU

Whittier Virtual Pediatric Intensive Care Unit

- Childrens Hospital Los Angeles

- http://picu.net

Collaboration between 85 Multi-disciplinary pediatric intensive care units across the U.S.

http://picu.net/

Collaboration with VPICU Laura P. and Leland K. Whittier Virtual Pediatric Intensive Care

Unit (VPICU), founded in 1998 by clinicians at CHLA

Leverage advances in technology to:

- Improve patient care

- Educate practitioners

- Conduct research

- Reduce cost of providing care

VPICU Research Data

Real Health Care Data Set Massive, grows continuously Heterogeneous formats, types,

etc. Incomplete, proprietary,

descriptions Fragmented across stores,

organizational boundaries Incomplete, inconsistent Highly restricted (legal,

privacy, ethical considerations)

Ideal Research Data Set Manageable size, Static Homogeneous

Complete, standardized descriptions and annotations

Available as single unit

Complete, consistent Minimal usage restrictions

Secondary use of observational clinical (EHR, monitor, annotations) data

VPICU Project Areas Data extraction and management

Take data from proprietary stores, make it accessible

Transformation of data into knowledgeProcess (and re-process) the data to extract insight

Data-driven decision supportDevelop tools that learn continuously from the data

Distributed data-sharing over a national networkEnable research on scales previously impossible while maintaining security, privacy, compliance

Principles behind VPICU Decouple from (proprietary) vendor databases

Integrate disparate data sources into a single model

Dynamically (re)generate research database(s)

- we don’t know for sure what queries will be most useful at the outset

Provide web services for multi-faceted access to the data to enable discovery & analysis

Support federation among multiple PICU sites

“Algorithm” for VPICU Data System1. Develop a common Domain Ontology to describe the information

space

2. Develop compute services that support extraction of data from existing databases

3. Identify mechanisms to integrate information objects from disparate repositories and map them to the common domain ontology

4. Construct a set of online research databases to enable data mining and analysis

5. Deploy a “data grid” infrastructure of hardware & software to facilitate utilization of the data environment at CHLA and beyond (external entities and applications)

6. Deploy a set of compute services to support data mining and analysis

7. Develop an architectural plan and roadmap for scaling and integrating other PICUs

VPICU Architecture

File-based storage

VPICU Architecture

File-based storage

Original data sources/stores at backend Proprietary schema Hardware that we don’t “own” or control Production systems (very load-sensitive) Legacy technologies (sometimes) Unreliable (can’t guarantee always available)

Includes: Hospital-wide commercial EHR system(s) Homegrown critical care database Specialized clinical applications Raw bedside monitor data

EHR

Homegrown

Clinical apps

Monitor data

Proprietary data sources

VPICU Architecture

File-based storage

Regular extraction of new data VPICU-controlled resources

(Our hardware and software) Transform to VPICU schema Link data belonging to same patient May contain PHI

Must be highly secure

Data at this stage is normalized, stored in a format suitable for ingestion into any number of research databases

VPICU-owned resources

VPICU Architecture

File-based storage

Research databases Application-specific Optimized Contain de-identified or

anonymized data

VPICU ontology, schema Access via configurable

web services

What are “research databases?”

Designed for specific research questions, analytical techniques Need not always be relational or databases at all Available via web interfaces and software services

Researcher using R can connect directly through R bindings

Examples: Relational database for traditional retrospective studies Search engine over free text clinical notes, etc. Patient/patient comparison, retrieval (find patient like this one) Data-backed patient simulator for “testing” interventions

VPICU Architecture

File-based storage

OODT and the VPICU Data System1. Develop an Information Model (Ontology) to describe the domain

2. Develop compute services that support extraction of data from existing CHLA databases (OODT Query Handlers)

3. Identify mechanisms to integrate information objects from disparate repositories and map them to the common domain ontology (OODT CAS crawler, catalog services)

4. Construct a set of online research databases to enable data mining and analysis (OODT Catalog and Archive Services)

5. Deploy a “data grid” infrastructure of hardware & software to facilitate utilization of the data environment at CHLA and beyond (external entities and applications) (OODT Data Grid Services)

6. Deploy a set of compute services to support data mining and analysis

7. Develop an architectural plan and roadmap for scaling and integrating other PICUs

OODT as Open Source Jan 2010: OODT Accepted as a podling in the Apache Software

Foundation (ASF) Incubator First NASA software licensed and incubating within the ASF Learn more and track our progress at:

- http://incubator.apache.org/projects/oodt.html Join the mailing list:

- [email protected] Chat on IRC:

- #oodt on irc.freenode.net

http://incubator.apache.org/projects/oodt.html

mailto:[email protected]

Acknowledgements Jet Propulsion Laboratory: Dan Crichton, Chris Mattmann,

Sean Kelly, Steve Hughes, Amy Braverman, Thuy Tran National Cancer Institute: Sudhir Srivastava, Christos Patriotis,

Don Johnsey Fred Hutchinson Cancer Research Center: Mark Thornquist,

Ziding Feng, Jackie Dalhgren, Suzanna Reid Children’s Hospital Los Angeles:

Randall Wetzel, Robinder Khemani,Paul Vee, Jeff Terry, Robert Kaptan,Doug Hallam

andrew hart nasa jet propulsion laboratory david kale whittier vpicu, children’s hospital la

Documents

data metadata

data models

exploration of data

science data results

larger data sets increase

planetary science

nasa software

distributed resources