proteomics repositories integration using eudat resources · proteomics repositories integration...

27
European Life Sciences Infrastructure for Biological Information www.elixir-europe.org Rafael C Jimenez ELIXIR CTO 25 September 2014 Proteomics repositories integration using EUDAT resources

Upload: others

Post on 23-Sep-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

European Life Sciences Infrastructure for Biological Informationwww.elixir-europe.org

Rafael C JimenezELIXIR CTO

25 September 2014

Proteomics repositories integration using EUDAT resources

Page 2: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Data submissions

2

Sub

mis

sio

ns

raw data

processed data

metadata

Data

repository

Search

Integration

Page 3: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Noble WS, MacCoss MJ (2012) Computational and Statistical Analysis of Protein Mass Spectrometry Data. PLoS Comput Biol 8(1):

e1002296. doi:10.1371/journal.pcbi.1002296

http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002296

Overview of shotgun proteomics data production

MKKKNIYSIRKLGVG

IASVTLGTLLISG

GVTPAANAAQHD

FYQVLNMPNLNADQ

RNGFIQSLK

DDPSQSANVKLN

4

Peptide sequences

Raw data Process data

Metadata

Page 4: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Data examples

4

Raw data Process data Metadata

DNA

Human

Liver

Mitochondria

W. Smith

Peptide

Mouse

Heart

Nucleus

J. Heinz

LPISASHSSK…

TTGTTATCCG…

… … …

Page 5: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Proteomics data in PRIDE

5

~85% raw data

Page 6: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

ProteomeCentralMetadata /

Manuscript

Raw Data*

Results

Journals

UniProt/

NeXtProt

Peptide Atlas

Other DBs

Receiving repositories

PASSEL

(SRM data)

PRIDE

(MS/MS data)

GPMDB

Researcher’s results Reprocessed results Raw data* Metadata

ProteomeXchange

Vizcaíno et al., Nature Biotechnology, 2014

• Framework to enable standard data submission and

dissemination pipelines between the main existing

proteomics resources.

Page 7: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

7 Martens et al., Proteomics, 2005

Vizcaíno et al., NAR, 2013

PRIDE (PRoteomics IDEntifications) database

Mass spectrometry

Page 8: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Origin: 152 USA108 Germany67 United Kingdom53 Switzerland48 Netherlands42 China42 Canada41 France36 Spain33 Belgium25 Australia23 Sweden17 Japan16 Denmark13 Norway12 Finland12 India12 Taiwan10 Italy9 Republic of Korea8 Austria8 Ireland8 Brazil7 Singapore5 Israel5 Russia …

Type:

273 PRIDE complete

501 PRIDE partial

47 PeptideAtlas/PASSEL complete

Access:

38.3% PRIDE public

5.3% PASSEL public

56% PRIDE private

0.4% PASSEL private

Data volume:

Total: >40 TB

Number of all files: >120,000

PXD000320-324: ~ 5 TB

PXD000065: ~ 1.4TB

Top Species studied by at least 8

datasets:

381 Homo sapiens

100 Mus musculus

31 Arabidopsis thaliana

26 Saccharomyces cerevisiae

16 Escherichia coli

14 Rattus norvegicus

12 Mycobacterium tuberculosis

11 Drosophila melanogaster

~ 215 species in total

Submissions/year:

2012: 102

2013: 527

2014: 192

Page 9: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Pilot evolution

• Use EUDAT• Replication of ELIXIR data in EUDAT data centers

• Delegation of ELIXIR data in EUDAT data centers

• Adopt EDUAT• Replication of ELIXIR data in ELIXIR data centers using EUDAT

technology

9

Page 10: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Replication of ELIXIR data in EUDAT data centers

10

Central repository Data storage centers

Meta

data

Raw

Data

Meta

data

ResultsRaw

Data

Page 11: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

ProteomeCentralMetadata /

Manuscript

Raw Data*

Results

Journals

UniProt/

NeXtProt

Peptide Atlas

Other DBs

Receiving repositories

GPMDB

Researcher’s results Reprocessed results Raw data* Metadata

Vizcaíno et al., Nature Biotechnology, 2014

Raw Data*

PASSEL

(SRM data)

PRIDE

(MS/MS data)

Replication of ELIXIR data in EUDAT data centers

Page 12: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Delegation of ELIXIR data in EUDAT data centers

12

Central repository Data storage centers

Meta

data

Raw

Data

Meta

data

ResultsRaw

Data

Page 13: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

ProteomeCentralMetadata /

Manuscript

Raw Data*

Results

Journals

UniProt/

NeXtProt

Peptide Atlas

Other DBs

Receiving repositories

GPMDB

Researcher’s results Reprocessed results Raw data* Metadata

Vizcaíno et al., Nature Biotechnology, 2014

Raw Data*

PRIDE

(MS/MS data)

PASSEL

(SRM data)

Delegation of ELIXIR data in EUDAT data centers

Page 14: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Replication of ELIXIR data in ELIXIR data centers using EUDAT technology

14

National proteomics centers

Meta

data

ResultsRaw

Data

Central repository

Meta

data

ResultsRaw

Data

Page 15: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Plans

15

National proteomics centers

Meta

data

ResultsRaw

Data

Central repository

Meta

data

ResultsRaw

Data

Data storage centers

Meta

data

Raw

Data

1.- ELXIR replication

2.- EUDAT replication

Page 16: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Plans

16

National proteomics centers

Meta

data

ResultsRaw

Data

Central repository

Meta

data

ResultsRaw

Data

Data storage centers

Meta

data

Raw

Data

3.- delegation

Page 17: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

ELIXIR Pilot action

17

Page 18: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

EUDAT services

18

Page 19: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

File sharing model

19

CSC

BILS

Site B

Site C

EUDAT CDIELIXIR

B2SAFE

B2SAFE

B2SAFE

B2SAFE

PRIDEEMBL-EBI

Page 20: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Pilot – EUDAT adoption: ELIXIR replication

20

CSC

BILS

Site B

Site C

EUDAT CDIELIXIR

B2SAFE

B2SAFE

B2SAFE

B2SAFE

PRIDEEMBL-EBI

Central repositoryNational proteomics centers

Meta

data

ResultsRaw

Data

Meta

data

ResultsRaw

Data

Page 21: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

PIDs

21

ELIXIR

community centerELIXIR

Data center 1

EUDAT

Data center 1

CSCPRIDEBILS

Page 22: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Status

• BILS• Migrating from existing Swestore dCache to iRODS

• Testing compatibility with B2SAFE

• Latest iRDOS not compatible with B2SAFE?

• PRIDE• iRODS service installed

• B2SAFE module have been deployed at EMBL-EBI (PRIDE)

• Test B2SAFE replication PRIDE -> CSC

• DOI for datasets

• PID for dataset files

• Web service to associate datasets to dataset files

22

Page 23: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Status

In progress• Handle System Registration

• Test requests of EPIC/EUDAT identifiers

Open questions• BILS local PIDs?

• Sync back from PRIDE to BILS for modifications/additions at PRIDE?

• Data push or pull model?

• Replication of process data requires previous validation

23

Page 24: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Participants

EUDAT/CSC

• Jani Heikkinen

• Damien Lecarpentier

EMBL-EBI/systems

• Andy Jenkinson

• Steven Newhouse

EMBL-EBI/PRIDE

• Juan Antonio Vizcaíno

• Henning Hermjakob

24

BILS

• Mikael Borg

• Fredrik Levander

• Bengt Persson

ELIXIR Hub

• Rafael C Jimenez

Page 25: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

European Life Sciences Infrastructure for Biological Information

www.elixir-europe.org

Thank you for your attention

Page 26: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

Delegation of raw data

26

processed data

metadata

Data

repository

PID

Subm

issio

ns

Search

Integration

Page 27: Proteomics repositories integration using EUDAT resources · Proteomics repositories integration using EUDAT resources. Data submissions 2 ns raw data processed data metadata Data

27

National proteomics centers

Meta

data

ResultsRaw

Data

Central repository

Meta

data

ResultsRaw

Data

Data storage centers

Meta

data

Raw

Data

National proteomics centers

Meta

data

ResultsRaw

Data

Central repository

Meta

data

ResultsRaw

Data

Data storage centers

Meta

data

Raw

Data