introduction we connect, in a complete pipeline, an ontology-based environment for proteomics...

1
INTRODUCTION We connect, in a complete pipeline, an ontology- based environment for proteomics spectra management with a distributed complete validation platform for predictive analysis. We leverage from two existing software platforms (MS-Analyzer and BioDCV) and from emerging proteomics standards. In the set-up, BioDCV is accessed from the MS-Analyzer workflow as a service, thus providing a complete pipeline for proteomics data analysis. Predictive classifica- tion studies on MALDI-TOF data based on this pipeline are presented. REFERENCES [1] M. Cannataro, P. Guzzi, T. Mazza, G. Tradigo, P. Veltri, Using ontologies for preprocessing and mining spectra data on the Grid,FGCS, 2006, In press, http://dx.doi.org/10.1016/j.future.2006 .04.011 [2] M. Cannataro, P.H. Guzzi, T. Mazza, G. Tradigo, P. Veltri. Preprocessing of Mass Spectrometry Proteomics Data on the Grid. IEEE CBMS 2005: 549-554 [3] C. Furlanello, M. Serafini, S. Merler, and G. Jurman. Semi-supervised learning for molecular profiling. IEEE Transactions on Computational Biology and Bioinformatics, 2(2):110-118, 2005. [4] A.Barla, B.Irler, S.Merler, G. Jurman, S.Paoli and C. Furlanello, Proteome profiling without selection bias. IEEE CBMS 2006, 941—946 [5] R. Tibshirani, T. Hastie, B. Narasimhan, S. Soltys, G. Shi, A. Kong, and Q. Le. Sample classification from protein mass spectometry, by ”peak probability contrasts”. Bioinformatics, 20(17):3034–3044, 2004. Workflows, ontologies and standards for unbiased prediction in high-throughput proteomics Cannataro M*, Barla A**, Gallo A*, Paoli S**, Jurman G**, Merler S**, Veltri P*, Furlanello C**. *University Magna Graecia of Catanzaro, Italy, **ITC-irst, Trento, Italy MGED 9 MGED 9 September 7-10, 2006 Seattle, WA, U.S.A. DATASETS D1. MALDI-TOF Ovarian Cancer Dataset, from (www-stat.stanford.edu/~tibs/PPC/ Rdist)[5] 49 samples (24 diseased + 25 controls) Each raw sample has 56384 m/z measurements (892 KB) Each preprocessed sample has 564 m/z measurements (19 KB) Preprocessing: Normalization Binning Biomarker identification Baseline subtraction Peak Alignment – Clustering 67 features identified D2. (lab calibration sample) MALDI-TOF, human serum, 20 technical replicates, 10 control samples, 10 with 2 proteins, 34671 measurement, 347 m/z after preprocessing, predictive discrimination with 7 peaks MS-ANALYZER MS-Analyzer [1] is a platform for the integrated management and processing of proteomics spectra data. It supports the ontology based design of “in silico” proteomics studies: ontologies are used to model software tools and spectra data, while workflows are used to model applications. MS- Analyzer uses a specialized spectra database and provides a set of pre-processing services: • Interface to heterogeneous mass spectrometers formats such as MALDI-TOF, SELDI-TOF, ICAT- based LC-MS/MS. Formats are unified into mzData, in compliance with the HUPO-PSI proteomics standardization initiative. • Acquisition, storage, and management of MS data with the SpecDB database. Spectra are stored in their different stages (raw, pre- processed, prepared). Single, multiple, or portions of spectra can be queried (in- database preprocessing). • Preprocessing of MS data (smoothing, baseline subtraction, normalization, binning, peaks alignment), as well as spectra preparation for further data mining (spectra to ARFF conversion) [2]. WS RSR PPSR PSR raw spectra pre-processed spectra prepared spectra SpecDB APIs Ontology-based Workflow Designer Ontology Assistant - browsing - querying WF Editor -composition -browsing -selection -visualization WF Schema Abstract, Concrete WF Resource Discovery Services WF Translator WF Scheduler WF Monitor Workflow Scheduler Ontology manager Ontologie s UDDI/MDS Metadata WSDL WS 1 WS 2 Spectra Management Services Network WS 1 WS 2 Spectra Visualization Services WS 1 WS 2 Spectra Preparation Services WS 1 WS 2 Spectra PreprocessingS ervices 1 M-WS Ontology-based Workflow Designer BIODcv WS BioDCV WS front-end Server FTP repository • Data • Metadata • Repository URL • email • DMZ Server Apache mod_Python ZSI module BIODCV The predictive modeling portion of the proposed system is provided by BioDCV, the ITC-irst platform for machine learning in high-throughput functional genomics. BioDCV fully supports complete validation in order to control selection bias effects. To harness the intensive data throughput, BioDCV uses E- RFE, an entropy based acceleration of the SVM- RFE feature ranking procedure [3]. For proteomics, it includes methods for baseline subtraction, spectra alignment, peak clustering and peak assignment that were adapted from existing R packages and concatenated to the complete validation system. BioDCV is also a grid application and it has been used in production within the EGEE Biomed VO [4]. FEATURE EXTRACTION • Within sample • across sample Complete Validation R scripts visualization ATE, sampletracking PHP • biomarker lists • HTML publication • Biomarkers data • REPORT ACKNOWLEDGMENTS ITC-irst: R Flor, D Albanese, B Irler UniCZ: G. Cuda, M. Gaspari, PH Guzzi,T Mazza Three Internet Web Services are used to integrate remotely the two main system components. The BioDCV component is invoked from the MSAnalyzer workflow as a WebService (biodcv-ws-client) in the UniCZ network: data and metadata are copied in a FTP repository, then the data URL and a notification email address are transmitted to the BioDCV WebService (biodcv-ws) on a DMZ area of the ITC-irst network. This service is directly run by Apache with Mod_Python and the Zolera Soap infrastructure. The incoming data are transferred to the internal front-end server (server-cz-tn.py) within the firewalled area. The front-end launches first the feature extraction module and then a full complete validation process using the BioDCV component. The system outputs are thus formatted as graphs and tables by R and PHP scripts. The results are published by the front-end on the DMZ server, and notified back by email. WEB SERVICES ARCHITECTURE n ATE 10 20 30 40 1 5 10 15 20 30 40 50 67 Number of features E(S) 0.0 0.5 1.0 1 5 50n1 1: S0 (26) 1 5 50n1 2: S1 (28) 1 5 50n1 3: S2 (27) 1 5 50n1 4: S3 (25) 1 5 50n1 5: S4 (26) 0.0 0.5 1.0 1 5 50n1 6: S5 (35) 1 5 50n1 7: S6 (19) 1 5 50n1 8: S7 (32) 1 5 50n1 9: S8 (31) 1 5 50n1 10: S9 (30) 0.0 0.5 1.0 1 5 50n1 11: S10 (24) 1 5 50n1 12: S11 (22) 1 5 50n1 13: S12 (22) 1 5 50n1 14: S13 (24) 1 5 50n1 15: S14 (20) 0.0 0.5 1.0 1 5 50n1 16: S15 (27) 1 5 50n1 17: S16 (24) 1 5 50n1 18: S17 (22) 1 5 50n1 19: S18 (26) 1 5 50n1 20: S19 (18) 0.0 0.5 1.0 1 5 50n1 21: S20 (27) 1 5 50n1 22: S21 (25) 1 5 50n1 23: S22 (19) 1 5 50n1 24: S23 (21) 1 5 50n1 25: S24 (23) Error rate (tumour tissue) Error rate (non- tumoural tissue) No-information error rate 1 The BioDCV system: EGEE BioMed VO 2-50 MB 50-400 MB grid-ftp scp grid-ftp grid-ftp grid-ftp scp Commands: 1.grid-url-copy/lcg-cp db from local to SE 2.edg-job-submit BioDCV.jdl 3.grid-url-copy/lcg-cp db from SE to local D2: mean A m/z Intensity 9100 9120 9140 9160 9180 9200 0 1000 2000 3000 4000 D2: .95 Student bootstrap CI D2: mean B D2: .95 Student bootstrap CI 9133,17 Da

Upload: elisabeth-strickland

Post on 15-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: INTRODUCTION We connect, in a complete pipeline, an ontology-based environment for proteomics spectra management with a distributed complete validation

INTRODUCTION

We connect, in a complete pipeline, an ontology-based environment for proteomics spectra management with a distributed complete validation platform for predictive analysis. We leverage from two existing software platforms (MS-Analyzer and BioDCV) and from emerging proteomics standards. In the set-up, BioDCV is accessed from the MS-Analyzer workflow as a service, thus providing a complete pipeline for proteomics data analysis. Predictive classifica-tion studies on MALDI-TOF data based on this pipeline are presented.

REFERENCES

[1] M. Cannataro, P. Guzzi, T. Mazza, G. Tradigo, P. Veltri, Using ontologies for preprocessing and mining spectra data on the Grid,FGCS, 2006, In press, http://dx.doi.org/10.1016/j.future.2006.04.011

[2] M. Cannataro, P.H. Guzzi, T. Mazza, G. Tradigo, P. Veltri. Preprocessing of Mass Spectrometry Proteomics Data on the Grid. IEEE CBMS 2005: 549-554

[3] C. Furlanello, M. Serafini, S. Merler, and G. Jurman. Semi-supervised learning for molecular profiling. IEEE Transactions on Computational Biology and Bioinformatics, 2(2):110-118, 2005.

[4] A.Barla, B.Irler, S.Merler, G. Jurman, S.Paoli and C. Furlanello, Proteome profiling without selection bias. IEEE CBMS 2006, 941—946

[5] R. Tibshirani, T. Hastie, B. Narasimhan, S. Soltys, G. Shi, A. Kong, and Q. Le. Sample classification from protein mass spectometry, by ”peak probability contrasts”. Bioinformatics, 20(17):3034–3044, 2004.

Workflows, ontologies and standards for unbiased prediction in high-throughput proteomics

Cannataro M*, Barla A**, Gallo A*, Paoli S**, Jurman G**, Merler S**, Veltri P*, Furlanello C**. *University Magna Graecia of Catanzaro, Italy, **ITC-irst, Trento, Italy

MGED MGED 99

September 7-10, 2006 Seattle, WA,

U.S.A.

DATASETS

D1. MALDI-TOF Ovarian Cancer Dataset, from (www-stat.stanford.edu/~tibs/PPC/ Rdist)[5]

• 49 samples (24 diseased + 25 controls)• Each raw sample has 56384 m/z

measurements (892 KB)• Each preprocessed sample has

564 m/z measurements (19 KB)• Preprocessing:

• Normalization• Binning

• Biomarker identification• Baseline subtraction• Peak Alignment – Clustering• 67 features identified

D2. (lab calibration sample) MALDI-TOF, human serum, 20 technical

replicates, 10 control samples, 10 with 2 proteins, 34671 measurement, 347 m/z after preprocessing, predictive discrimination with 7 peaks

MS-ANALYZER

MS-Analyzer [1] is a platform for the integrated management and processing of proteomics spectra data. It supports the ontology based design of “in silico” proteomics studies: ontologies are used to model software tools and spectra data, while workflows are used to model applications. MS-Analyzer uses a specialized spectra database and provides a set of pre-processing services:

• Interface to heterogeneous mass spectrometers formats such as MALDI-TOF, SELDI-TOF, ICAT-based LC-MS/MS. Formats are unified into mzData, in compliance with the HUPO-PSI proteomics standardization initiative.

• Acquisition, storage, and management of MS data with the SpecDB database. Spectra are stored in their different stages (raw, pre-processed, prepared). Single, multiple, or portions of spectra can be queried (in-database preprocessing).

• Preprocessing of MS data (smoothing, baseline subtraction, normalization, binning, peaks alignment), as well as spectra preparation for further data mining (spectra to ARFF conversion) [2].

• Sharing of experiments data, workflows and knowledge

WS

RSR PPSRPSR

raw spectra

pre-processedspectra

preparedspectra

SpecDB APIs

Ontology-based Workflow Designer

Ontology Assistant- browsing- querying

WF Editor-composition-browsing-selection-visualization

WF SchemaAbstract,

Concrete WF

ResourceDiscoveryServices

WF Translator

WF Scheduler

WF Monitor

Workflow Scheduler

Ontology manager

Ontologies

UDDI/MDS

MetadataWSDL

WS1

WS2

Spectra Management

Services

Network

WS1

WS2

Spectra Visualization

Services

WS1

WS2

Spectra Preparation

Services

WS1

WS2

Spectra Preprocessing

Services

11

M-WS

Ontology-based Workflow Designer

BIODcv WS

BioDCV WSfront-end

Server

FTP repositoryFTP repository

• Data• Metadata

• Repository URL• email

• DMZ Server

Apachemod_Python ZSI module

BIODCV

The predictive modeling portion of the proposed system is provided by BioDCV, the ITC-irst platform for machine learning in high-throughput functional genomics. BioDCV fully supports complete validation in order to control selection bias effects. To harness the intensive data throughput, BioDCV uses E-RFE, an entropy based acceleration of the SVM-RFE feature ranking procedure [3].

For proteomics, it includes methods for baseline subtraction, spectra alignment, peak clustering and peak assignment that were adapted from existing R packages and concatenated to the complete validation system.

BioDCV is also a grid application and it has been used in production within the EGEE Biomed VO [4].

FEATUREEXTRACTION

• Within sample

• across sample

Complete Validation

R scripts

• visualizationATE, sampletracking

PHP

• biomarker lists

• HTML publication

• Biomarkers data• REPORT

ACKNOWLEDGMENTS

• ITC-irst: R Flor, D Albanese, B Irler • UniCZ: G. Cuda, M. Gaspari, PH Guzzi,T

Mazza

Three Internet Web Services are used to integrate remotely the two main system components.

The BioDCV component is invoked from the MSAnalyzer workflow as a WebService (biodcv-ws-client) in the UniCZ network: data and metadata are copied in a FTP repository, then the data URL and a notification email address are transmitted to the BioDCV WebService (biodcv-ws) on a DMZ area of the ITC-irst network.

This service is directly run by Apache with Mod_Python and the Zolera Soap infrastructure. The incoming data are transferred to the internal front-end server (server-cz-tn.py) within the firewalled area.

The front-end launches first the feature extraction module and then a full complete validation process using the BioDCV component. The system outputs are thus formatted as graphs and tables by R and PHP scripts. The results are published by the front-end on the DMZ server, and notified back by email.

WEB SERVICESARCHITECTURE

n

AT

E

10

20

30

40

1 5 10 15 20 30 40 50 67

Number of features

E(S

)

0.0

0.5

1.0

1 5 50n1

1: S0 (26)

1 5 50n1

2: S1 (28)

1 5 50n1

3: S2 (27)

1 5 50n1

4: S3 (25)

1 5 50n1

5: S4 (26)

0.0

0.5

1.0

1 5 50n1

6: S5 (35)

1 5 50n1

7: S6 (19)

1 5 50n1

8: S7 (32)

1 5 50n1

9: S8 (31)

1 5 50n1

10: S9 (30)

0.0

0.5

1.0

1 5 50n1

11: S10 (24)

1 5 50n1

12: S11 (22)

1 5 50n1

13: S12 (22)

1 5 50n1

14: S13 (24)

1 5 50n1

15: S14 (20)

0.0

0.5

1.0

1 5 50n1

16: S15 (27)

1 5 50n1

17: S16 (24)

1 5 50n1

18: S17 (22)

1 5 50n1

19: S18 (26)

1 5 50n1

20: S19 (18)

0.0

0.5

1.0

1 5 50n1

21: S20 (27)

1 5 50n1

22: S21 (25)

1 5 50n1

23: S22 (19)

1 5 50n1

24: S23 (21)

1 5 50n1

25: S24 (23)

Error rate (tumour tissue)

Error rate (non- tumoural tissue)

No-information error rate

11

The BioDCV system: EGEE BioMed VO

2-50 MB

50-400 MB

grid-ftp

scpgrid-ftp

grid-ftp

grid-ftp

scp

Commands:1.grid-url-copy/lcg-cp db from local to SE2.edg-job-submit BioDCV.jdl3.grid-url-copy/lcg-cp db from SE to local

D2: mean A

m/z

Inte

nsity

9100 9120 9140 9160 9180 9200

01

000

200

03

000

400

0 D2: .95 Student bootstrap CI

D2: mean B

D2: .95 Student bootstrap CI

9133,17 Da