integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data luca...

25
Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi [email protected] BIO-Lab, DIST University of Genoa

Upload: dorthy-phelps

Post on 24-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

Integromics: a grid-enalbled platform for integration of advanced

bioinformatics tools and data

Luca CorradiLuca [email protected]

BIO-Lab, DISTUniversity of Genoa

Page 2: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

2

Integromics

• Cancer research goal: tailor treatment to the molecular profile of an individual patient's tumor

• Microarrays and other 'omic' technologies allow to study tens of thousand of genes simultaneously

• Tools and methodologies used lack of standardization and repeatability

• Need of an "integromic" platform to:– Develop integrative ('integromic') analyses of the data– Combine tools available for genomics

Better results, higher quality of work

Page 3: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

3

Focus on...

• How to exploit the backend gLite infrastructure and a HPC environment to integrate bioinformatics tools and data

• How a Grid Portal can:– integrate heterogeneous tools and data – simplify user interaction through customized web

interfaces– increase usability and efficiency

• Case study: example of correlation between genomics data and clinical data through a combination of processing tools provided by the platform

Page 4: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

4

The challenges

• Manage large volumes of bioinformatics data

• Deal with complex issues as different formats, distributed locations, time-consuming tasks, computational needs

• Integrate heterogeneous tools and platforms

• Speed up analysis process through automated metodologies

• Improve efficiency and quality of work

• Make the system usable and accessible

Page 5: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

5

Microarray technology

• Computation of genes expression values of thousands genes at the same time

• Collection of microscopic DNA spots, representing single genes, arrayed on a solid surface by covalent attachment to chemically suitable matrices

• Estimation of the absolute value of gene expression

Page 6: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

6

• Analyse large microarray datasets for breast cancer prognosis assessment

• Run several R/Bioconductor scripts

• Deploy a re-usable and reliable service

• Avoid errors, increase repeatability

• Create a processing pipeline where new algorithms and data analysis techniques can be tested

• Create a set of “atomic” components that can be combined into workflows

The use case

Page 7: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

Data Analysis Tools

R/Bioconductor

• Free software environment for statistical computing and graphics

• Bioconductor is a series of R packages specific for bioinformatics community

• Active user community

Dchip

• Free software for analysis and visualization of gene expression data

Affymetrix Power Tools (APT)

. Cross-platform command line programs that implement algorithms for analyzing Affymetrix GeneChip arrays

Page 8: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

8

Parallel dChip execution

• Module 1– n jobs each opening N/n Files

and normalizing them– Each job produces N/n CSV

Files (matching with input files)

• Module 2– m jobs each opening all N CSV

Files and computing genes expression values concerneing a certain group of genes

– Each job produces one CSV File

• Module 3– One job opening the m

expression files– It searches for differentially

expressed genes and it performs clustering of results

Mod11

Mod12

Mod1n

Mod21

Mod22

Mod2m

CSV 1

CSV 2

CSV m

CSV 1

CSVN

Mod3

CEL 1

CELN

CEL N/n

Page 9: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

9

Parallel APT execution

Page 10: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

10

• Analyze large microarray datasets for breast cancer prognosis assessment

• Concatenate phenodata and expression results

• Mix of custom and R programs

• Automatic analysis and plot creation

The service

Page 11: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

11

The BioMedicalPortal

Based on EnginFrame, an industry proven production-grade grid-portal (public/private academic and industry customer worldwide)

Page 12: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

12

BM Portal

Grid Users

NON-Grid users

Client AppsWeb

ServiceInterface

User WebInterface

gLite

Clusters (LSF, PBS, LL, etc..)

AMGA Grid

AMGA local

GS

AF

Secure Storage

Other GridsNorduGrid, Globus, SRB, AliEn, etc…

other Grid DBs

WLM

Engin Fram

e

AP

Is

• based on EnginFrame product from NICE srl • data management and secure storage layer

are based on GSAF / Secure Storage APIs

BMPortal Architecture

Page 13: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

13

BioMedicalPortal services

• User management, authentication and authorization services

• Data management (extension to metadata support on GRID)

• Job submission (GRID, local, remote cluster) and monitoring

• Support for every programming and scripting language • Plugin strategy for applications integration• Web services interface• Workflow management system• Lots of software and applications already integrated

etc......

Page 14: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

14

gLite plugin & GWT

• Authentication, Authorization using VOMS (client side applet is coming)

• Job submission and monitoring, retrieve and result visualization

• Preference settings (RB, CE, …)

• Traditional LFC based data management

• New Google Web Toolkit interfaces for GSAF integration via Java API using VOMS credentials

Page 15: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

15

8

User can check the job status, exit results or messages

Users 1User submits and monitor work via a standard web browser

Win LX

UXMacUsersBMPortal

2BMPortal checks input parameters and files, and submits a job to gLite

gLite UIEF Server&Agent

Application

3 The RB matches the user requirements with the available resources on the Grid

EGEE gLite infrastructure

Input files - primary - include

4The job starts

5

Results are written to the input file directory

6

Streaming output allows to monitor the progress of the job

7

Job is done

Testbed architecture

Local or remotecluster (LSF)

Page 16: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

16

Analysis /1

• EnginFrame Grid portal interface (web access)

• Input data selection (Affy .CEL files, phenodata, gene list)

Page 17: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

17

Analysis /2– Services execution

& monitoring

– Users can come back after coffee

Page 18: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

18

• Result visualization in portal spooler area (txt files, images, etc.)

Analysis /3

Page 19: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

19

Impact

• Addressed to bio-medical researchers without specific computation skills

• The collaboration between molecular oncologists and software engineers allowed for the optimization of the system without loosing flexibility

• Scales up in the size of processed data above current available Desktop Personal computer limitations

• Following the Software as a Service paradigm, users can focus on experimental design rather than infrastructure.

Page 20: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

20

• Each processing step is an “atomic” service

• Services can be invoked one by one

• Now services are composed using EnginFrame portal features and LSF scheduler tools

• But…

Atomic services

Page 21: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

21

Current work (1)

• Viasual and easy WFmonitoring

• Totally integrated with

the EnginFrame job

monitoring and data

access

• Useful for very long

lasting workflows

• User-designed “virtual

experiments

Page 22: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

22

Current work (2)

Integration of new algorithms

(multi-chip quality control, across-platform data

integration, etc...)

Page 23: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

23

Current work (3)

Possibility to perform different analyses in a parallel way

Page 24: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

24

Acknowledgements

• Part of this work is developed within the Italian FIRB project LITBIO (Laboratory for Interdisciplinary Technologies in BlOinformatics).

• Thanks are due to Ulrich Pfeffer and his functional genomics group at IST (National Institute for Cancer Research) of Genoa, Italy for their support.

Page 25: Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi Luca Corradi luca.corradi@unige.it BIO-Lab,

25

Thank you!

Thank you!