xtreemfs - a distributed and replicated cloud file … · xtreemfs - a distributed and replicated...
TRANSCRIPT
XtreemFS - a distributed and replicated cloud file system Michael Berlin Zuse Institute Berlin
DESY Computing Seminar, 16.05.2011
Who we are
– Zuse Institute Berlin
– operates the HLRN supercomputer
(#63+64)
– Research in Computer Science and
Mathematics
– Parallel and Distributed Systems
Group
– lead by Prof. Alexander Reinefeld
(Humboldt University)
– Distributed and failure-tolerant storage
systems
Who we are
Michael Berlin
PhD student since 03/2011
studied Informatik at Humboldt Universität zu Berlin
Diplom thesis dealt with XtreemFS
currently working on the XtreemFS client
3
Motivation
Problem: Multiple copies of data
Where?
Copy complete?
Different versions?
4
internal Cluster storage
external Cluster storage
local file server
PC external
Cluster
Nodes
internal
Cluster
Nodes
Motivation (2)
Problem: Different access interfaces
5
external Cluster storage
local file server
PC Laptop
via 3G/Wi-Fi
external
Cluster
Nodes
<parallel file system>
NFS/ Samba
VPN+?/ SSHFS
SCP
Motivation (3)
XtreemFS goals:
Transparency
Availability
6
PC Laptop
via 3G/Wi-Fi
external
Cluster
Nodes
internal
Cluster
Nodes
XtreemFS
Outline
1. XtreemFS Architecture
2. Client Interfaces
3. Read-Only Replication
4. Read-Write Replication
5. Metadata Replication
6. Customization through Policies
7. Security
8. Use Case: Mosgrid
9. Snapshots
8
XtreemFS Architecture (1)
Volume on a Metadata Server:
provides hierarchical namespace
File Content on Storage servers:
accessed directly by clients
9
internal Cluster storage
PC internal
Cluster
Nodes
local file server
XtreemFS Architecture (2)
10
Metadata and Replica Catalog
(MRC):
– holds volumes
Object Storage Devices
(OSDs):
– file content split into
objects
– objects can be striped
across OSDs
object-based
file system
architecture
Scalability
File I/O Throughput
parallel I/O: scales with number of OSDs
Storage Capacity
add and removal of OSDs possible
OSDs may be used by multiple volumes
Metadata Throughput
limited by MRC hardware
use many volumes spread over
multiple MRCs
11
RE
AD
W
RIT
E
Accessing Components
Directory Service (DIR)
central registry
all servers (MRC, OSD) register there with their id
provides:
list of available volumes
mapping id URL to service
list of available OSDs
12
Client Interfaces
XtreemFS supports POSIX interface and semantics
mount.xtreemfs: using FUSE
runs on Linux, FreeBSD, OS X and Windows (Dokan)
libxtreemfs for Java and C++
13
PC Laptop
via 3G/WiFi
external
Cluster
Nodes
XtreemFS
mount.xtreemfs
mount.xtreemfs
mount.xtreemfs
internal
Cluster
Nodes
Read-Only Replication
Requirement: Mark file as read-only
Replica types:
a. Full replica:
requires complete copy
b. Partial replica:
fills itself on demand
instantly ready to use
14
external Cluster storage
external
Cluster
Nodes
internal Cluster storage
Read-Only Replication (3)
Receiver-initiated transfer at object level
OSDs exchange object lists
– Filling strategies:
Fetch objects
– in order
– rarest first
– Prefetching available
– On-Close Replication: automatic replica creation
16
Read-Write Replication
Availability
Data safety
Allow Modifications
17
internal Cluster storage
PC
local file server
important.cpp
important.cpp
Read-Write Replication (3)
19
Primary/Backup:
1. Lease Acquisition
at most one valid lease per
file
revocation
= lease timeout
Read-Write Replication (4)
20
Primary/Backup:
1. Lease Acquisition
at most one valid lease per
file
revocation
= lease timeout
2. Data Dissemination
Read-Write Replication (5)
Lease Acquisition
XtreemFS: Flease
scalable
majority-based
Data Dissemination
Update Strategies:
Write All, Read 1
Write Quorum, Read Quorum
21
Flease Central Lock Service
Metadata Replication
Primary/backup replication
volume = database
transparently replicate database
use leases to elect primary
replicate insert/update/delete
Database = Key/Value Store
own implementation: BabuDB
22
Customization through Policies
Example:
Which replica shall the client select?
determined by policies
Policies:
– Authentication
– Authorization
– UID/GID mappings
– Replica placement
– Replica selection
23
external Cluster storage
external
Cluster
Nodes
internal Cluster storage
???
Customization through Policies (2)
Replica Placement/Selection Policies:
filter / sort / group replica list
available default policies:
FQDN-based
datacenter map
Vivaldi (latency estimation)
can be chained
own policies possible (Java)
24
external Cluster storage
external
Cluster
Nodes
internal Cluster storage
node1.ext-cluster
osd1.ext-cluster osd1.int-cluster
open()
sorted replica list
MRC
Security
X.509 certificates support for authentication
SSL to encrypt communication
25
Laptop
via 3G/Wi-Fi
external
Cluster
Nodes
XtreemFS
mount.xtreemfs w/ user certificate
mount.xtreemfs w/ host certificate
Use case: Mosgrid
Mosgrid:
ease running experiments in computational chemistry
use grid resources through a web portal
portal allows to submit and retrieve compute jobs
XtreemFS: global data repository
26
Use case: Mosgrid (2)
27
PC
Cluster
Nodes
XtreemFS
Web Portal
Browser
Unicore
Frontend
Results Input Data
Retrieve Results Submit Job
XtreemFS scope
Berlin
mount.xtreemfs w/ user certificate
mount.xtreemfs w/ host certificate
libxtreemfs (Java)
Dresden Köln
Snapshots
Backups needed in case of
accidental deletion/modification
virus infections
Snapshot
stable image of the file system at a given point in time
28
internal Cluster storage
PC
local file server
important.cpp
important.cpp
unlink(“important.cpp“)
Snapshots (2)
MRC: create snapshot if requested
OSDs: Copy-on-Write
on modify: create new object instead of overwriting
on delete: only mark as deleted
29
t
t0
snapshot()
write("file.txt“)
file.txt: V1, t1
write("file.txt“)
file.txt: V2, t2
Snapshots (3)
No exact global time: Loosely synchronized clocks
assumption: maximum drift ε
Time span-based snapshots
30
t
t0
snapshot()
write("file.txt“)
file.txt: V1, t1
write("file.txt“)
file.txt: V2, t2
t0 - ε t0 + ε write("file.txt“)
file.txt: V2, t2
Snapshots (4)
OSDs: limit number of versions
not version-on-every-write
Instead: close-to-open
problem: client sends no explicit close
implicit close:
create new version if last write at least X seconds ago
Cleanup tool:
deletes versions which belong to no snapshot
Snapshots on directory level possible
31
XtreemFS Software
Open source: www.xtreemfs.org
Development:
5 core developers at ZIB
integration tests for quality assurance
Community:
users and bug reporters
mailing list with 102 subscribers
Release 1.3:
Experimental support for read/write replication and snapshots
33
Thank You!
www.contrail-project.eu
The Contrail project is supported by funding under
the Seventh Framework Programme of the European
Commission: ICT, Internet of Services, Software and
Virtualization.
GA nr.: FP7-ICT-257438.
34
References:
http://www.xtreemfs.org/publications.php