cancer biomedical informatics grid™ (cabig tm )

Natio

nal C

ance

r Ins

titut

e

The Cancer Biomedical Informatics Grid™ (caBIG™)

2006 CODATA ConferenceBeijing, China

Mary Jo Deering , Ph.D.Director, Informatics DisseminationNCI Center for Bioinformatics

Natio

nal C

ance

r Ins

titut

e

Cancer Biomedical Informatics Grid™ (caBIGTM)

• Common, widely distributed infrastructure permits research community to focus on innovation

• Shared vocabulary, data elements, data models facilitate information exchange

• Collection of interoperable applications developed to common standards

• Raw published cancer research data is available for mining and integration

Natio

nal C

ance

r Ins

titut

e


• caBIG™ infrastructure

and tools are widely

applicable outside cancer

• caBIG™ components may be used by anyone

Natio

nal C

ance

r Ins

titut

e

caBIG™ principles

• Open source

• Open access

• Open development

• Federated

Natio

nal C

ance

r Ins

titut

e

caBIG™’s Informatics Core

Natio

nal C

ance

r Ins

titut

e

caBIG™caBIG™

caBIGTM Architecture WorkspacecaBIGTM Architecture Workspace

caBIGTM Vocabularies and Common Data Elements WorkspacecaBIGTM Vocabularies and Common Data Elements Workspace

Strategic Planning

Workspace

Strategic Planning

Workspace

Training Workspace

Training Workspace

Integrative Cancer

Research Workspace

Integrative Cancer

Research Workspace

In VivoImaging

Workspace

In VivoImaging

Workspace

Tissue Banks &

Pathology Tools

Workspace

Tissue Banks &

Pathology Tools

Workspace

Data Sharing & Intellectual

Capital Workspace

Data Sharing & Intellectual

Capital Workspace

Clinical Trials Mgmt

Systems Workspace

Clinical Trials Mgmt

Systems Workspace

caBIG™ Operational Structure

Natio

nal C

ance

r Ins

titut

e

2006 Clinical Trial Tools Development Activities

• caAERS

• Patient Study Calendar

• Lab Data Hub

• Making other CTMS systems caBIG compatible

Natio

nal C

ance

r Ins

titut

e

Clinical Research IT Infrastructure

Clinical Systems

De-identificationServices

Labs,EMR,

Tissue,etc.

ClinicalTrials

ExternalReporting

HL7/CAM

SDK

HL7- v3

HL7-v3,Janus

ClinicalData Mgmt

EDC

Adverse Events

Participant Registry

etc.

TranslationService

FDA

SPONSOR

NCI

other

HL7 trans-

actionaldatabase

ClinicalResearch

InformationExchange

HL7- v2.x,other

ResearchData

Warehouse

HL7-v3,Janus

PatientHealthRecord

Lifecycle Management

Natio

nal C

ance

r Ins

titut

e

Integrated Cancer Research

• Microarray Repositories

• Data Analysis & Statistics

• Informatics for Proteomics

• Genome Annotation

• Pathways Tools

• Translational Tools

• Population Sciences and Cancer Control

Natio

nal C

ance

r Ins

titut

e

Natio

nal C

ance

r Ins

titut

e

Tissue Banks and Pathology Tools

• caTISSUE Core (WU) – Core specimen handling and tracking functions

• caTISSUE Clinical Annotation Engine (UPMC) - Annotation of specimens with clinical data

• caTIES (UPMC) - Text extraction and de-identification of surgical pathology reports

Natio

nal C

ance

r Ins

titut

e

caTISSUE Core:

Register Specimen Group

Natio

nal C

ance

r Ins

titut

e

caIMAGE – Cancer Images Database

• caIMAGE allows researchers to submit and retrieve images and annotations.

• Images are streamed for efficient access.

• Researchers can search images based on tissue and diagnosis and experiment information.

• Use of common terminology originating from the NCI Enterprise Vocabulary Server (EVS).

Natio

nal C

ance

r Ins

titut

e

Natio

nal C

ance

r Ins

titut

e

caBIG™ Compatibility

• caBIG™ is all about Interoperability– Key is to create tools for sharing information

• Extensible infrastructure – Expandable and modular software to plug into existing

systems so current development efforts are not wasted

• Ensures partnerships– Encourages relationships between academic, government and

industry

• Evolving– Compatibility guidelines are being translated into certification

procedures

• Compatibility Guidelines at https://cabig.nci.nih.gov/guidelines_documentation

Natio

nal C

ance

r Ins

titut

e

use access

SemanticSemanticinteroperabilityinteroperability

SyntacticSyntacticinteroperabilityinteroperability

Interoperability

ability of a system to

and

the parts or equipment of another system

Natio

nal C

ance

r Ins

titut

e

caCORE

Bioinformatics Objects

Enterprise Vocabulary

Common Data Elements

SECURITY

Natio

nal C

ance

r Ins

titut

e

Professional Documentation

Natio

nal C

ance

r Ins

titut

e

caCORE Software Development Kit Components

• UML Modeling Tool (any with XMI export)

• Semantic Connector (concept binding utility)

• UML Loader (model registration in caDSR)

• Codegen (middleware code generator)

• Security Adaptor (Common Security Module)

• caCORE SDK generates a caBIG-Silver compliant system

Natio

nal C

ance

r Ins

titut

e

Natio

nal C

ance

r Ins

titut

e

Grid Technology in caBIGTM

• What is a ‘Grid’

– “A Grid is a system that coordinates resources that are not subject to centralized control using standard, open, general-purpose protocols and interfaces to deliver nontrivial qualities of service.” - Ian Foster Grid Today, July 20, 2002

• Grid Technology supplies two useful components to a network of computers:

– Advertising: Inform the network about the capabilities of new systems

– Discovery: Allow users to find resources that meet their needs.

• The caGrid project is the ‘Grid in caBIGTM’; the actual infrastructure that data and analytical services will use to interoperate.

• The current caGrid is version 0.5; caGrid 1.0 in December.

• The combination of data and analytical service nodes in caBIGTM produced a design that utilizes a variety of standard Grid technologies including the Globus Toolkit and OGSA-DAI, DQP, GRAM, etc.

Natio

nal C

ance

r Ins

titut

e

Test bed Infrastructure

caGrid 0.5 Test Bed

Natio

nal C

ance

r Ins

titut

e


• caBIG™ infrastructure

and tools are widely

applicable outside cancer

• caBIG™ components may be used by anyone

Natio

nal C

ance

r Ins

titut

e

Contact Information

Mary Jo Deering, Ph.D

Director for Informatics Dissemination

NCI Center for Bioinformatics

National Cancer Institute

National Institutes of Health, USDHHS

6116 Executive Blvd. - #403

Rockville, MD 20852

(o) 301-496-3458

(f) 301-480-4222

[email protected]

Natio

nal C

ance

r Ins

titut

e

27

Additional Background and Detail

• The following slides were not included in the presentation.

Natio

nal C

ance

r Ins

titut

e

28

Current caBIG™ community• NCI-designated Cancer Centers (50)

– Academic Centers (integrated into broader biomedical infrastructure)

– Stand-alone (community leaders)

– Community outreach

• NCI Divisions and Programs

• National Institutes of Health

• Other Government Agencies

• Industry

• International Groups – Standards development organizations

– U.K.’s National Cancer Research Institute

• ~900 active participants

Natio

nal C

ance

r Ins

titut

e

29

Four Domain Workspaces and two Cross Cutting Workspaces have been launched

DOMAIN WORKSPACE 3Tissue Banks & Pathology ToolsDOMAIN WORKSPACE 3Tissue Banks & Pathology Tools

Provides for the integration, development, and implementation of tissue and pathology tools.

DOMAIN WORKSPACE 2Integrative Cancer ResearchDOMAIN WORKSPACE 2Integrative Cancer Research

Provides tools and systems to enable integration and sharing of information.

DOMAIN WORKSPACE 1Clinical Trial Management SystemsDOMAIN WORKSPACE 1Clinical Trial Management Systems

Addresses the need for consistent, open and comprehensive tools for clinical trials management.

CROSS CUTTING WORKSPACE 2Architecture

CROSS CUTTING WORKSPACE 2Architecture

Developing architectural standards and architecture necessary for other workspaces.

CROSS CUTTING WORKSPACE 1Vocabularies & Common

Data Elements

CROSS CUTTING WORKSPACE 1Vocabularies & Common

Data Elements

Responsible for evaluating, developing, and integrating systems for vocabulary and ontology content, standards, and software systems for content delivery.

DOMAIN WORKSPACE 4ImagingDOMAIN WORKSPACE 4Imaging

Provides for the sharing and analysis of in vivo imaging data.

Natio

nal C

ance

r Ins

titut

e

30

Strategic Level Workspaces

Strategic PlanningStrategic PlanningAssists in identifying strategic priorities for the development and evolution of the caBIGTM effort.

TrainingTrainingDeveloping strategies for providing training in the use of the caBIG developed resources including on-line tutorials, workshops, and training programs.

Data Sharing and Intellectual CapitalData Sharing and Intellectual Capital

Addresses issues related to the sharing of data, applications and infrastructure both within the consortium and in the larger cancer research community.

Natio

nal C

ance

r Ins

titut

e

31

http://rembrandt.nci.nih.gov

REMBRANDT: Building a robust translational research framework for brain tumor studies

REpository of Molecular BRAin Neoplasia DaTa

Natio

nal C

ance

r Ins

titut

e

32

Rembrandt Knowledgebase

Better understanding

Better treatments

Expression array data

Clinical data

SNPArray data

Proteomics data

caIntegrator -DataMart

caBIG Analytic Tools

Natio

nal C

ance

r Ins

titut

e

33

caBIGTM Compatibility Guidelines

• The caBIGTM compatibility guidelines are designed to insure that systems designed in a Federated environment are still interoperable on the caBIGTM Grid, both syntactically and semantically

• Since achieving interoperability is a process, caBIGTM recognizes four levels of compatibility, starting from Legacy (not interoperable) through Bronze, Silver and Gold (fully interoperable)

• caBIGTM compatibility is all about interfaces rather than the scientific content of the system

Natio

nal C

ance

r Ins

titut

e

34

SYNTACTIC

SEMANTIC

SEMANTIC

SEMANTIC

caBIG Compatibility Guidelines

Natio

nal C

ance

r Ins

titut

e

35

• What do all those data classes and attributes actually mean, anyway?

• Data descriptors or “semantic metadata” required

• Computable, commonly structured, reusable units of metadata are “Common Data Elements” or CDEs.

• NCI uses the ISO/IEC 11179 standard for metadata structure and registration

• Semantics all drawn from Enterprise Vocabulary Service resources

Common Data Elements

Natio

nal C

ance

r Ins

titut

e

36

Cancer Data Standards Repository (caDSR)

• Basic caDSR unit of metadata information to describe a datum is a Common Data Element or CDE

• Enterprise-class system for storing metadata, with APIs that give runtime access to both metadata and semantics

• Implements the ISO 11179 standard, a flexible model for describing arbitrary metadata

• Used to describe metadata associated with clinical case report forms and UML Models

Natio

nal C

ance

r Ins

titut

e

37

Enterprise Vocabulary Services

• Controlled vocabulary resources for caCORE and the cancer research community

• Vocabulary Products and Services

– NCI Thesaurus

– NCI Metathesaurus

– External vocabularies

• NCI Thesaurus - controlled vocabulary source for metadata

– Has excellent coverage of cancer terminology

– Expands based on needs for additional terminology

– Based on concepts rather than terms

– Each concept has a unique identifier or CUI with definitions and synonym

Natio

nal C

ance

r Ins

titut

e

38

Data Standards in caBIG™ • The V/CDE workspace is responsible for facilitating the

development and ratification of Data Standards for caBIG™

• Data Standards can be Vocabularies or Common Data Elements (CDEs) with their associated controlled terminology

• A caBIG™ Data Standard is, in effect, a ‘pre-approved’ mechanism for semantically modeling an attribute or series of attributes in a data object. Ideally, having a standard available shortens development time for other projects that need to present such data

• Whenever possible, caBIG™ adopts standards that are derived from other standards bodies (HL7, ISO, USPS, UPU, W3C, etc.) and in general use within our community

• In the last year, the V/CDE workspace has developed a consensus driven mechanism for approving Data Standards and applied it to an increasing number of CDEs

Natio

nal C

ance

r Ins

titut

e

39

Java Applications

Data AccessObjects

Web Application Server

Interfaces

Java

SOAP

XML

HTTP Clients

SOAP Clients

DataDataClientsClients

Perl Clients

EnterpriseVocabulary

CommonData

Elements

MiddlewareMiddleware

API

API

API

API

Data AccessObjects

DomainObjects[Gene,

Disease, etc.]

DomainObjects[Gene,

Disease, Agent,etc.]

caCORE Architecture

BiomedicalData

Authorization

Natio

nal C

ance

r Ins

titut

e

40

Use cases for caGrid• Advertisement

– Service Provider composes service metadata describing the service and publishes it to grid.

• Discovery– Researcher (or application developer) specifies search criteria

describing a service of interest

– The research submits the discovery request to a discovery service, which identifies a list of services matching the criteria, and returns the list.

• Invocation– Researcher (or application developer) instantiates the grid

service and access its resources

Natio

nal C

ance

r Ins

titut

e

41

caGrid 0.5 Services• Data Services

– caBIO: Gene-centric bioinformatics objects

• NCICB-Rockville, MD

– caArray: MAGE-OM compliant microarray repository

• NCICB-Rockville, MD

• Lombardi Cancer Center-Georgetown, DC

– gridPIR: Protein Information Resource

• Lombardi Cancer Center-Georgetown, DC

– caTIES: Text Information Extraction System for pathology reports

• UPMC-Pittsburgh, PA

– SNP500: Polymorphism database with population frequencies

• NCI Core Genotyping Facility-Gaithersburg, MD

– caMOD II: Cancer Model Organism Database

• NCI Mouse Models of Human Cancer Consortium (MMHCC)

• Analytical Service– RProteomics: Statistical analysis of proteomics data

• Duke-Durham, NC

Natio

nal C

ance

r Ins

titut

e

42

Grid Communication Protocol

Service Description

Service

Workflow

Service R

egistry

Secu

rity

Metad

ata Man

agem

ent

Reso

urce M

anag

emen

t

Functions Management

ID R

esolu

tion

OGSA Compliant - Service Oriented Architecture

Transport

caGrid Service-Oriented Architecture

Sch

ema M

anag

emen

t GSI

CAS

myProxy

Globus

OGSA-DAIGlobusGRAM

Globus Toolkit

GlobusBPEL

Mobius

caCORE

Natio

nal C

ance

r Ins

titut

e

43

Enabling Technology

• The NCI provides freely available enabling technology for caBIGTM compatibility

• These technologies are distributed under a ‘non-viral’ open source license.

• caCORE

– Enterprise Vocabulary Services (EVS)

– Cancer Data Standards Repository (caDSR)

• caCORE Software Development Kit

– When complete process is followed, the outcome is a caBIG ‘Silver’ compliant data system.

Natio

nal C

ance

r Ins

titut

e

44

How can my research benefit from caBIG™ Tools?

• Everything developed by the program is open source and freely available

• Training is available at https://cabig.nci.nih.gov/training

• The latest versions of all the software developed as part of the project can be obtained from the caBIG™ project gforge site:

– http://gforge.nci.nih.gov

Natio

nal C

ance

r Ins

titut

e

45

caBIG™: Getting Involved

• To get involved with caBIG™:

– Track caBIG™ activities on the NCI’s caBIG™ website, https://cabig.nci.nih.gov/

– Attend caBIG™ Annual Meeting, February 5-7, 2007, Wardman Park Marriott, Washington, DC

– Learn about the existing bioinformatics infrastructure, caCORE, at https://ncicb.nci.nih.gov/core

– Download currently available caBIG™ tools from the caBIG™ website at https://cabig.nci.nih.gov/inventory

– Sign up for the caBIG™ mailing list at http://list.nih.gov/archives/cabig_announce.html

• Please visit the main caBIG™ website for more information: https://cabig.nci.nih.gov/

cancer biomedical informatics grid™ (cabig tm )

Documents