egee’09 21-25 september 2009 barcellona,spainmilanesi luciano bioinformatics grid and hpc...

51
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano Bioinformatics GRID and HPC challenges in Biomedicine and Biosciences. Milanesi Luciano National Research Council Institute of Biomedical Technologies, Milan, Italy [email protected]

Upload: erik-boone

Post on 30-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Bioinformatics GRID and HPC challenges in Biomedicine and Biosciences.

Milanesi LucianoNational Research Council

Institute of Biomedical Technologies, Milan, Italy [email protected]

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Outline

Introduction

Bioinformatics data integration and data analysis

Challenges in design new portals:

example the SysBio-GatewayCell Cycle Database

G2S Breast Cancer Database

Nervous System

Kinweb

ProCMD

TMA Rep

Conclusion and Acknowledgments

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

ICT and Genomics

• A key development in the computational world has been the arrival of de novo design algorithms that use all available spatial information to be found within the target to design novel drugs.

• Coupling these algorithms to the rapidly growing body of information from structural genomics together with the new ICT technology (eg. HPC, GRID, Web Services, Bioinspired networks ecc.)

• provides a powerful new possibility for exploring design to a broad spectrum of genomics targets, including more challenging techniques such as:

• protein–protein interactions, docking, molecular dynamics, system biology, gene network ecc.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Disease Network

Disease resistant population Disease susceptible population

Genotype all individuals for thousands of SNPs

ATGAATTATAG ATGTTTTATAG

Resistant people all have an ‘A’ at position 4 in geneX, while susceptible people have a ‘T’ call SNP

geneX

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

System Biology for Health

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Data Integration

Definition of data integration:

the process of combining information, residing at different resources, to provide the user with a unified view of these data for enabling the possibility to achieve real knowledge

.

HPCJobsWorkflow

Results

Grid

U.I.

Soap

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Data Integration

• Data integration is an essential task to accomplish in order to achieve a view of the biological knowledge as much complete as possible in emerging fields such as bioinformatics and systems biology.

• The integration of biological knowledge in the systems biology field is related to different levels, such as genomics, transcriptomics, proteomics and network interactions.

• Such data integration is crucial in order to support the mathematical modelling and the computer simulation of biological pathways

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Data Integration

The data integration can be described on three levels of complexity:

• the first layer is the integration of information from heterogeneous resources by collecting data between different database to allow an unified query schema;

• the second layer consists in the identification of correlative associations across different datasets, generally using ontology support, to provide a comprehensive and coherent view of the same objects in light of different data sources;

• the third layer is mapping information gained about interacting objects into networks and pathways that may be used as basic models for the underlying cellular systems.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

SysBio-Gateway Tools framework

The SysBio-Gateway tools framework are implemented for the following main purpose:• Simulation of Ordinary Differential Equation to solve the

mathematical models.

• In silico parameter values estimation to develop new cellular system biology models.

• Visualization of the Protein structure and search by the correlated Connolly surfaces.

• Analysis of protein-protein interactions network (search for the first neighborhood, search for shortest path and common annotations)

• Modeling the protein mutant starting from Single Nucleotide Polymorphism data based on Modeller program.

• Image processing oriented to support tissue microarray analysis.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

SysBio-Gateway Tools frameworkThe global optimization algorithm relies on an evolution strategy for

solving the uncertainty of parameters values:

• The system accepts as input a ODE based model (formatted as plain text or encoded in the SBML standard) and the experimental data and outputs the best fitting parameters.

• During the computation candidate solutions of separated evolution processes are swapped every k iterations thanks to a relational database.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

SysBio-Gateway Ontology

The databases developed in the frame of SysBio-Gateway are enriched by the ontological integration.• The ontology used concern all levels of molecular biology, from

genes to proteins to pathways, tissues and diseases aspects.

• As example, we used Gene Ontology (GO) for genes annotation and KEGG Pathway Ontology (derived from the hierarchical organization of KEGG pathways) for biological networks.

• Ontology provides not only the availability of a commonly accepted vocabulary, which facilitates data sharing and information querying, but also increases the performance of statistical and analytical studies.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

SysBio-Gateway

• SysBio-Gateway is an example of unified framework which embraces a set of specialized biomedical databases and web resources oriented to different biological topics such as:• Biological processes (Cell Cycle)

• Pathologies (Breast Cancer)

• Organs (Brain and the Nervous System)

• Protein families (Protein kinases)

• Protein mutations (Protein C mutations)

• Tissues (Tissue microarray)

• SysBio-Gateway relies both on a common methodology of data integration and a set of tools which have been developed ad hoc to enables the investigation in a systems biology perspective

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

SysBio-Gateway

http://www.itb.cnr.it/sysbio-gateway

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

The Cell Cycle

Cell Cycle:

repeated sequence of events which leads the division of a mother cell into daughter cells

Biological process frequently studied in correlation to tumour disease

It is considered a valuable target for drug discovery in the context of cancer and neurodegenerative disease

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Cell Cycle Database

• The “Cell Cycle Database” (CCDB) is a resource which collects useful information about genes and proteins involved in the cell cycle process and mathematical models of the cell cycle process.

• The integrating information belongs to the following eukaryotic organisms:• the budding yeast Saccaromyces cerevisiae • the Homo sapiens

• Yeast and human organism have been chosen since a deep knowledge of their cell cycle machinery is available and an evolutionary conservation between the basic regulatory events of the cell cycle has been demonstrated.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

CCDB: the web interface

• Cell Cycle Database is accessible through a web interface made up of HTML pages dynamically generated from PHP scripts.

• URL: http://www.itb.cnr.it/cellcycle/

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

CCDB: the integrative system

Resources provided as

external links

Resources from which

data are taken

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

CCDB Gene Report: main features

• Gene report collects information integrating data from different table;

• Gene report is linked to different external biological databases;

• Gene report is linked to protein report.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

CCDB Protein Report: main features

• Protein report collects information integrating data from different tables

• Protein report is linked to different biological databases from which we take data.

• Protein report is linked to gene report

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

CCDB Model Report

• users can find:• the mathematical

description of each model

• the kinetic and differential equations

• the related parameters (the rate constants and the initial protein concentrations) defined for the simulation analysis.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Mathematical section

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Simulation section

2D plot: image exported in png using GnuPlot

User can perfom the simulation of a single ODE system describing a cell cycle model

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

SysBio-Gateway

www.itb.cnr.it/sysbio-gateway

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

G2BCDB web interface

http://www.itb.cnr.it/breastcancer.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Genes-to-System Breast Cancer Database

• The Genes-to-Systems Breast Cancer (G2SBC) Database is a bioinformatics resource to integrate the information about genes, transcripts and proteins which have been reported in literature to be altered in breast cancer cells.

• The resource includes a section dedicated to mathematical models related to carcinogenesis, tumor growth and tumor response to treatments.

• This comprehensive resource is dedicated to molecular and systems biology of breast cancer, including both the building-blocks level (genes, transcripts and proteins) and the systems level (molecular and cellular systems).

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Genes-to-System Breast Cancer Database

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

G2BCDB: web interface and analysis tools

• Query system at molecular, systems and cellular level

• Ontology based query system

• Query system based on biochemical pathways and protein-protein interaction network

• Common annotations (in a gene set) search tool

• Mathematical models analysis and simulation (with a specific connection with the breast cancer genes with the Cell Cycle Database and the cell cycle related models)

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

G2BCDB: test case

Example of analysis combining the use of G2SBC Database tools

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

SysBio-Gateway

www.itb.cnr.it/sysbio-gateway

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

BAIN What level?

At what level can a systems biology strategy be implemented?

Faugeras O. et al., 2007, Journal of Physiology

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Nervous System Database

• The Nervous System Database (NSD) deals with the data integration of the Nervous System regarding:

• Knowledge about biological components, such as genes and proteins, are collected together with information at the system level, such as protein networks and molecular pathways, that are relevant for a better exploration of neural systems

• The annotations stored in NSD are associated to ontological terms: this solution provides a semantic layer to improve data storage, accessibility and sharing and represents an instrument to identify relations among biological components

• The NSD systems supplies information about gene functions, processes where they are involved and their spatial localization.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

NEUROINFORMATICS

Description, matching and retrieval of 3D anatomical data

(Extended Reeb graphs and size functions)

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

NSD: the web interface

http://www.itb.cnr.it/gncdb

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Nervous System Database

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

SysBio-Gateway

www.itb.cnr.it/sysbio-gateway

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Kinweb kinase database

• Eukaryotic protein kinase (ePKs) constitute one of the largest recognized protein families represented in the human genome

• The common name kinase is applied to enzymes that catalyze the transfer of the terminal phosphate group from ATP to a receptor substrate.

• All kinase catalyze essentially the same phosphoryl transfer reaction, however, they display remarkable diversity in their structures, substrate specificity, and the pathways in which they participate.

• ePKs are important players in virtually every signaling pathway involved in normal development and disease.

• The results of the human kinome analysis are collected in the KinWeb database, available for browsing and searching over the internet, where all results from the comparative analysis and the gene structure annotation are made available, alongside the domain information.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Kinweb kinase database

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Kinweb: the web interface

http://www.itb.cnr.it/kinweb

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Kinweb: Kinase Protein Analysis

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

SysBio-Gateway

www.itb.cnr.it/sysbio-gateway

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

ProCMD: a 3D web resource for protein C mutants

• Activated Protein C (ProC) is an anticoagulant plasma serine protease which also plays an important role in controlling inflammation and cell proliferation

• Structure prediction and computational analysis of the mutants have proven to be a valuable aid in understanding the molecular aspects of clinical thrombophilia

• ProCMD is a relational database which collects data on clinical, structural and functional properties of the protein C variants with a graphic interface to search and visualize the mutated residue in the protein structure

• The main purpose of this tools is to provide an easy access to mutant structural data through the use of 3D interactive viewers (VRML and RasMol)

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

ProCMD: the web interface

http://www.itb.cnr.it/procmd

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

ProcMD: test case

• The ProcMD system allows the user to retrieve entries • by position in sequence of a mutated residue,

• by amino acid substitution,

• by keyword amd by domain localization.

• The results appears in a mutations list linked with a dedicated ‘details page’ where the mutant is fully described.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Details page for mutant G216D with 3D images gallery

ProcMD: test case

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

SysBio-Gateway

www.itb.cnr.it/sysbio-gateway

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Tissue Microarray -TMA Rep

• Tissue MicroArray technique is becoming increasingly important in pathology for the validation of experimental data from transcriptomics analysis.

• We propose a Tissue MicroArray web oriented system that supports researchers in managing bio-samples and that, through the use of ontologies, enables tissue sharing in order to promote TMA experiments design and results evaluation.

• This system provides ontological description both for describing pre-analysis tissue images and for identifying post-process image results, which represents a crucial feature for promoting information exchange

• Through this system, users associate an ontology-based description to each image uploaded into the database and also integrate results with the ontological descriptions of genes identified in each tissue.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

TMA Platform Overview

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Networks of resources

• The potential of new biological and biomedical technological platforms in connection with HPC and GRID technology will be particularly useful to deal with the increasing amount, complexity, and heterogeneity of biological and biomedical data.

• Bioinformatics applications for eHealth have become an ideal research area where computer scientists can apply and further develop new intelligent computation methods, in both experimental and theoretical cases.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Networking People

Data analysis specific for biomedical applications can allow the user to store and search genetics data, with direct access to the data files and application on GRID servers.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

Conclusion

• Here we present an integrated solution to explore part of the information gained in the field of life science oriented to systems biology.

• SysBio-Gateway integrated portal combines:• bioinformatics approach, i.e. data integration using data

warehouse approach

• application of tools for the data analysis

• study of structural modifications - both for genome and proteins

• systems biology approach, that is the study of protein-protein interaction networks

• molecular mathematical models

• pathological states under a systemic point of view.

EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN Milanesi Luciano

People of the Institute for Biomedical Technologies-CNR:

Luciano Milanesi,

Roberta Alfieri,

Ettore Mosca,

Federica Viti,

Pasqualina D'Ursi,

Ivan Merelli

Chiara Bishop and John Hatton (Graphical Web Interface)

BioinfoGRID http://www.bioinfogrid.eu EGEE http://www.eu .egee.orgFIRB-MIUR LITBIO: Laboratory for Interdisciplinary

Technologies in Bioinformatics http://www.litbio.org, FIRB-MIUR ITALBIONET: Italian Bioinformatics Network

Acknowledgments