xtreemfs - a distributed and replicated cloud file … · xtreemfs - a distributed and replicated...

34
XtreemFS - a distributed and replicated cloud file system Michael Berlin Zuse Institute Berlin DESY Computing Seminar, 16.05.2011

Upload: nguyenthien

Post on 04-Sep-2018

235 views

Category:

Documents


0 download

TRANSCRIPT

XtreemFS - a distributed and replicated cloud file system Michael Berlin Zuse Institute Berlin

DESY Computing Seminar, 16.05.2011

Who we are

– Zuse Institute Berlin

– operates the HLRN supercomputer

(#63+64)

– Research in Computer Science and

Mathematics

– Parallel and Distributed Systems

Group

– lead by Prof. Alexander Reinefeld

(Humboldt University)

– Distributed and failure-tolerant storage

systems

Who we are

Michael Berlin

PhD student since 03/2011

studied Informatik at Humboldt Universität zu Berlin

Diplom thesis dealt with XtreemFS

currently working on the XtreemFS client

3

Motivation

Problem: Multiple copies of data

Where?

Copy complete?

Different versions?

4

internal Cluster storage

external Cluster storage

local file server

PC external

Cluster

Nodes

internal

Cluster

Nodes

Motivation (2)

Problem: Different access interfaces

5

external Cluster storage

local file server

PC Laptop

via 3G/Wi-Fi

external

Cluster

Nodes

<parallel file system>

NFS/ Samba

VPN+?/ SSHFS

SCP

Motivation (3)

XtreemFS goals:

Transparency

Availability

6

PC Laptop

via 3G/Wi-Fi

external

Cluster

Nodes

internal

Cluster

Nodes

XtreemFS

File Systems Landscape

7

Outline

1. XtreemFS Architecture

2. Client Interfaces

3. Read-Only Replication

4. Read-Write Replication

5. Metadata Replication

6. Customization through Policies

7. Security

8. Use Case: Mosgrid

9. Snapshots

8

XtreemFS Architecture (1)

Volume on a Metadata Server:

provides hierarchical namespace

File Content on Storage servers:

accessed directly by clients

9

internal Cluster storage

PC internal

Cluster

Nodes

local file server

XtreemFS Architecture (2)

10

Metadata and Replica Catalog

(MRC):

– holds volumes

Object Storage Devices

(OSDs):

– file content split into

objects

– objects can be striped

across OSDs

object-based

file system

architecture

Scalability

File I/O Throughput

parallel I/O: scales with number of OSDs

Storage Capacity

add and removal of OSDs possible

OSDs may be used by multiple volumes

Metadata Throughput

limited by MRC hardware

use many volumes spread over

multiple MRCs

11

RE

AD

W

RIT

E

Accessing Components

Directory Service (DIR)

central registry

all servers (MRC, OSD) register there with their id

provides:

list of available volumes

mapping id URL to service

list of available OSDs

12

Client Interfaces

XtreemFS supports POSIX interface and semantics

mount.xtreemfs: using FUSE

runs on Linux, FreeBSD, OS X and Windows (Dokan)

libxtreemfs for Java and C++

13

PC Laptop

via 3G/WiFi

external

Cluster

Nodes

XtreemFS

mount.xtreemfs

mount.xtreemfs

mount.xtreemfs

internal

Cluster

Nodes

Read-Only Replication

Requirement: Mark file as read-only

Replica types:

a. Full replica:

requires complete copy

b. Partial replica:

fills itself on demand

instantly ready to use

14

external Cluster storage

external

Cluster

Nodes

internal Cluster storage

Read-Only Replication (2)

15

Read-Only Replication (3)

Receiver-initiated transfer at object level

OSDs exchange object lists

– Filling strategies:

Fetch objects

– in order

– rarest first

– Prefetching available

– On-Close Replication: automatic replica creation

16

Read-Write Replication

Availability

Data safety

Allow Modifications

17

internal Cluster storage

PC

local file server

important.cpp

important.cpp

Read-Write Replication (2)

18

Primary/Backup:

Read-Write Replication (3)

19

Primary/Backup:

1. Lease Acquisition

at most one valid lease per

file

revocation

= lease timeout

Read-Write Replication (4)

20

Primary/Backup:

1. Lease Acquisition

at most one valid lease per

file

revocation

= lease timeout

2. Data Dissemination

Read-Write Replication (5)

Lease Acquisition

XtreemFS: Flease

scalable

majority-based

Data Dissemination

Update Strategies:

Write All, Read 1

Write Quorum, Read Quorum

21

Flease Central Lock Service

Metadata Replication

Primary/backup replication

volume = database

transparently replicate database

use leases to elect primary

replicate insert/update/delete

Database = Key/Value Store

own implementation: BabuDB

22

Customization through Policies

Example:

Which replica shall the client select?

determined by policies

Policies:

– Authentication

– Authorization

– UID/GID mappings

– Replica placement

– Replica selection

23

external Cluster storage

external

Cluster

Nodes

internal Cluster storage

???

Customization through Policies (2)

Replica Placement/Selection Policies:

filter / sort / group replica list

available default policies:

FQDN-based

datacenter map

Vivaldi (latency estimation)

can be chained

own policies possible (Java)

24

external Cluster storage

external

Cluster

Nodes

internal Cluster storage

node1.ext-cluster

osd1.ext-cluster osd1.int-cluster

open()

sorted replica list

MRC

Security

X.509 certificates support for authentication

SSL to encrypt communication

25

Laptop

via 3G/Wi-Fi

external

Cluster

Nodes

XtreemFS

mount.xtreemfs w/ user certificate

mount.xtreemfs w/ host certificate

Use case: Mosgrid

Mosgrid:

ease running experiments in computational chemistry

use grid resources through a web portal

portal allows to submit and retrieve compute jobs

XtreemFS: global data repository

26

Use case: Mosgrid (2)

27

PC

Cluster

Nodes

XtreemFS

Web Portal

Browser

Unicore

Frontend

Results Input Data

Retrieve Results Submit Job

XtreemFS scope

Berlin

mount.xtreemfs w/ user certificate

mount.xtreemfs w/ host certificate

libxtreemfs (Java)

Dresden Köln

Snapshots

Backups needed in case of

accidental deletion/modification

virus infections

Snapshot

stable image of the file system at a given point in time

28

internal Cluster storage

PC

local file server

important.cpp

important.cpp

unlink(“important.cpp“)

Snapshots (2)

MRC: create snapshot if requested

OSDs: Copy-on-Write

on modify: create new object instead of overwriting

on delete: only mark as deleted

29

t

t0

snapshot()

write("file.txt“)

file.txt: V1, t1

write("file.txt“)

file.txt: V2, t2

Snapshots (3)

No exact global time: Loosely synchronized clocks

assumption: maximum drift ε

Time span-based snapshots

30

t

t0

snapshot()

write("file.txt“)

file.txt: V1, t1

write("file.txt“)

file.txt: V2, t2

t0 - ε t0 + ε write("file.txt“)

file.txt: V2, t2

Snapshots (4)

OSDs: limit number of versions

not version-on-every-write

Instead: close-to-open

problem: client sends no explicit close

implicit close:

create new version if last write at least X seconds ago

Cleanup tool:

deletes versions which belong to no snapshot

Snapshots on directory level possible

31

Future Research

Self-Tuning

Quota support

Data de-duplication

Hierarchical Storage Management

32

XtreemFS Software

Open source: www.xtreemfs.org

Development:

5 core developers at ZIB

integration tests for quality assurance

Community:

users and bug reporters

mailing list with 102 subscribers

Release 1.3:

Experimental support for read/write replication and snapshots

33

Thank You!

www.contrail-project.eu

The Contrail project is supported by funding under

the Seventh Framework Programme of the European

Commission: ICT, Internet of Services, Software and

Virtualization.

GA nr.: FP7-ICT-257438.

34

References:

http://www.xtreemfs.org/publications.php