data replication · persistent identifiers registration the set up of the automated data transfer...

51
Data Replication Automated move and copy of data Giovanni Morelli [email protected] Claudio Cacciari [email protected] Giuseppe Fiameni [email protected] Johannes Reetz [email protected] EUDAT 2 nd Conference Rome, Oct. 28 th -30 th 2013 Data Staging Moving large amounts of data around, and moving it close to compute resources

Upload: others

Post on 22-Jul-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Data ReplicationAutomated move and copy of data

Giovanni [email protected]

Claudio [email protected]

Giuseppe [email protected]

Johannes [email protected]

EUDAT 2nd ConferenceRome, Oct. 28th-30th 2013

Data StagingMoving large amounts of data around, and moving it close to

compute resources

Page 2: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Safe Replication Outline– Principles– Data movement:

• Which kind of service?• Which kind of users?

– Flexibility– Different transfer strategies– Policies– Performances– Different federation strategies– PID and registered data

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 2

Page 3: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Principles – where we want to be

1. Data deposited will be preserved in perpetuity

2. Data are best curated in their own communities

3. Access to data in the Collaborative Data Infrastructure (CDI) is free at the point of use

4. The Collaborative Data Infrastructure will not assert ownership of any data it holds

3EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 3

Page 4: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Data movement: staging or replication?

Data StagingSafe ReplicationDynamic replication to HPC workspace for processing

Data preservation and access optimization

Data

PID

4EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 4

Page 5: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

ReplicationSafe Replication to enable communities easily createreplicas of their scientific datasets in multiple datacentres for improving data curation and accessibility

Safe Replication to enable communities easily createreplicas of their scientific datasets in multiple datacentres for improving data curation and accessibility

5EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 5

data bit-stream preservation more optimal data curation better accessibility of data identification of data through Persistent Identifiers (PIDs)

Page 6: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Persistent Identifiers (PID)

• EUDAT relies on the EPIC service to associate persistent identifier to digital objects (http://www.pidconsortium.eu).

• EPIC is an identifier system using the Handle infrastructure.

• Its focus is the registration of data in an early state of the scientific process, where lots of data is generated and has to become referable to collaborate with other scientific groups or communities, but it is still unclear, which small part of the data should be available for a long time period.

http://www.ands.org.au/services/pid-policy.html

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 6

Page 7: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Wath users want

replicate my collection X to three data centresand store the collection safely for 10 years

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 7

Page 8: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Are you talking about clouds?

I already do it every day in my cloud space !

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 8

Page 9: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

We are talking about a …

robust

safe

highly available

Replication Service

… which is not a personal cloud space

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 9

Page 10: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

What about trust?Can you find where your data are

physically stored on the cloud?

No, because clouds are opaque

Or who can access them?

While a Collaborative Data Infrastructure is transparent

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 10

Page 11: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Community center

EUDAT centerCLARIN

ENES

VPH

Lifewatch

replicate my collection X to three data centres

CINECA

BSC

EPCC

EPOS

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 11

Page 12: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Then is it a complex mechanism suited only forexpert data managers?

No.It is a flexible approach able to encompass

quite different usage scenarios.EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 12

Page 13: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Flexibility

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 13

Single researcher

TeamCommunityUpload and forget

Upload, addmetadata

share

Periodic transfers, quality checks …

Different strategies for different usage scenariosHorizontal flexibility

Vert

ical

flex

ibili

ty

rule

-orie

nted

repl

icat

ion

Page 14: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

iRODS• iRODS is a data management system, which

integrates a rule engine• The rules can be triggered automatically

based on specific events (for example a data set is moved to a particular directory)

• Invoked from remote via iRODS command lineclient or integrated into applications based on iRODS java (Jargon) and python (PyRods) API

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 14

Page 15: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Transfer approachesAutonomous Planned

Data size

smallmedium

largehuge

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 15

Page 16: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

replicate my collection X to three data centresand store the collection safely for 10 years

Coming back …

Apparently a simple statement

But you need to plan it, then you need …

Policies !EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 16

Page 17: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

EUDAT consortium is working on

Policies for the Collaborative Data Infrastructure management

We can call them “internal policies”

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 17

Page 18: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

How to manage software updates?

What about security bugs?

Which systems are monitored?

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 18

Page 19: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

EUDAT center

Each community has one or more doors to connectto the infrastructure

community center

Ingestionpoint

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 19

Page 20: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

replicate my collection X to three data centresand store the collection safely for 10 years …

Updating the sub-collection X1 weekly

And the sub-collection X2 hourly

And keeping on-line the datauploaded during the last six months

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 20

Page 21: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

So far so good, we have our infrastructure, our policies

Performances?

What else is missing?

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 21

Page 22: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

If you produce 1 TB of data per day and you want to store it remotely

And it takes one week to move the data

Then any policy is useless !

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 22

Page 23: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

High Performance Transfers

http iRODS gridFTP

speed

band

wid

th

Transfer tools

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 23

Page 24: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Community repositoryData center

To join or not to joinUse: automated data movement client to serverJoin: automated data movement server to server

EUDAT 2nd Conference Rome, Oct. 28th-30th 201324

Page 25: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

EUDAT CDI Domain of registered data

Safe replication explained

Community repository

PIDPID

Data Center Store

Data Center Store

Data Center Store

EPICservice

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 25

Page 26: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Registered data

domain of globally referenceable data

temporarydata

preservation

registration

referenceabledata

acquisitiongenerationdescription

data

dataenrichmentprocessingreductionanalysis

citablepublication

global registration

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 26

Page 27: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Finally, all together

Preservation

Near realtime sync(ongoing)

Persistent Identifiersregistration

The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far:

EPOS joined the EUDAT CDIWe defined a specific policy with them

We tuned the transfer tools to achieve the best performance, butthe HP tool (GridFTP) was useless since the bottleneck was the

bandwidthSo we chose a more flexible tool like iRODS irsync protocol

In fact in order to achieve a hourly synchronization we exploitedchecksum sync and file age limit options.

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 27

Page 28: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Data Staging Outline– Definition– Moving data around– Size/Performances– Transfer options– StageIN/StageOut– Data Staging Script– All services together

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 28

Page 29: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Data movement: staging or replication?

Data StagingSafe ReplicationDynamic replication to HPC workspace for processing

Data preservation and access optimization

Data

PID

29EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 29

Page 30: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Staging

30EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 30

Page 31: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Moving large amounts of data around

Data sets can be large both in terms of

Numbers of objects Single object size

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 31

Page 32: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

All data are gray in the data staging

The data types are irrelevant

Except when they can affect the size or the number of the files

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 32

Page 33: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

When I should care about data types?

When you can compress them

Data transfer efficiency is crucial in data staging

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 33

Page 34: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Data Throughput – Transfer Times

This table available at http://fasterdata.es.net

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 34

Page 35: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Data staging is Just in Time

You do not plan it

You can pre-stage… sometimes

There are techniques to “improve the efficiency”

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 35

Page 36: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Improve the efficiency

TCP tuning: refers to the proper configuration of buffers that correspond to TCP windowing

Pipelining (of commands): speeds up lots of tiny files by stuffing multiple commands into each login session back-to-back without

waiting for the first command's response

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 36

Page 37: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Parallelizing: on wide-area links, using multiple TCPstreams in parallel (even between the same source anddestination) can improve aggregate bandwidth over using asingle TCP stream

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 37

Improve the efficiency…

Page 38: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Striping: data may be striped or interleaved across multiple servers

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 38

…improve the efficiency

Page 39: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Parallelism and TCP tuning are the keys• It is much easier to achieve a given performance

level with four parallel connections than with one• A good TCP tuning can improve drastically

performances

Latency interaction is critical• Wide area data transfers have much higher latency

than LAN transfers• Many tools and protocols assume a LAN• Examples: SCP/SFTP, HPSS mover protocol

High Performance Browser Networking by Ilya Grigorik

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 39

Page 40: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

But efficiency is not allEasiness of use

Which priority?Different scenarios, different needs

Possibility to control/limit the transfer throughput to avoid engulfing the network

Third-party transferHigh reliability

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 40

Page 41: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Move the data

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 41

Page 42: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

How it works• Server side

– the data staging functionality is realized by extending the iRODS system with a GridFTPinterface (Data Storage Interface - DSI) so to permit the transfer of data through a reliable, high-performance protocol.

• Client side– any existing client, supporting the GridFTP protocol

can be employed – globus-url-copy, Globus On Line, UberFTP, gTransfer, etc.

• Users need a personal certificate (X.509) to fully exploit the service

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 42

Page 43: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

GridFTP: third Party Transfer

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 43

Page 44: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Eudat Transfer optionsGlobus OnlineThird party transfer, web portal

Globus client toserver

iRODS specificclient to server

HTTP optionsunder discussion…

iRODS specificserver to server

Globus server to server

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 44

Page 45: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

EUDATLTA

PID

EUDAT SP site A Data Mining cluster

Step 1 Step 2 Step N

Workflow framework

Stage-out/stage-in

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 45

Page 46: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Data Staging Script• A simple python modular staging script to help communities

integrate the data staging service within their exiting solutions

• Based on Globus Online API and iRODS rule mechanism for data selection

./datastager.py <what to do> -u <username> --ss <HPCmachine> --sd <UnixPath> -p <filename> --ds <EUDAT-NODE> --dd <iRODS-Path>

./datastager.py in taskid -u <USER> --ss <SRC_SITE> --sd <SRC_DIR> -p <PATH> \\--ds <DST_SITE> --dd <DST_DIR>

./datastager.py out pid --pid <PID> -u <USER> --ds <DST_SITE> --dd <DST_DIR>

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 46

Page 47: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Data Staging Script: Stage-in

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 47

1. DSS contact GO via API (A)

2. GO contacts endpoints and

activate them (B)

3. DSS gives to GO the list of

files

4. A third party transfer is

executed between

endpoints (C)

5. User recive email when

transfer is finished

6. User can retrive PID for

furher processing (D,E)

PID

Page 48: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Data Staging Script: Stage-out

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 48

Page 49: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

EUDAT service overview

Data StagingSafe Replication Simple Store

AAIMetadata Catalogue

Dynamic replication to HPC workspace for processing

Data preservation and access optimization

Researcher data store (simple upload, share and access)

Aggregated EUDAT metadata domain.Data inventory

Network of trust among authentication and authorization actors

MetadataData

Anchor foridentifica-tion and integrity

PID

PID

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 49

Page 50: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Conclusions• The way to move data has to be enough flexible

to accomodate different transfer protocols, different access mechanisms.

• Flexibility means also that the transfer tools can be used as they are, with default parameters, foraverage performances, but also fine tuned byexperts for faster transfers.

• No solution fits all, so different services are provided

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 50

Page 51: Data Replication · Persistent Identifiers registration The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: EPOS

Thank you!

http://eudat.eu/datastaging | [email protected]

http://eudat.eu/safe-replication | [email protected]

EUDAT 2nd Conference Rome, Oct. 28th-30th 2013 51