xtreemfs-a cloud file system
TRANSCRIPT
-
7/27/2019 XtreemFS-A Cloud File System
1/27
XtreemFS A Cloud File SystemMichael BerlinZuse Institute Berlin
Contrail Summer School, Almere, 24.07.2012
Funded under: FP7 (Seventh Framework Programme)Area: Internet of Services, Software & virtualization (ICT-2009.1.2)
Project reference: 257438
-
7/27/2019 XtreemFS-A Cloud File System
2/27
Motivation Cloud Storage / Cloud File System
Cloud Storage Requirements
highly available
scalable
elastic: add and remove capacity
suitable for wide area networks
Support for legacy applications
POSIX-compatible file system required
Google for cloud file system: www.XtreemFS.org
2
-
7/27/2019 XtreemFS-A Cloud File System
3/27
Outline
XtreemFS Architecture
Replication in XtreemFS
Read-Only File Replication
Read/Write File Replication
Custom Replica Placement and Selection
Metadata Replication
XtreemFS Use Cases
XtreemFS and OpenNebula
3
-
7/27/2019 XtreemFS-A Cloud File System
4/27
XtreemFS - A Cloud File System
History
2006 initial development in XtreemOS project
2010 further development in Contrail project
2012 August: Release 1.3.2
Features
Distributed File System
POSIX compatible
Replication
X.509 Certificates and SSLSupport
Software
Open source: www.xtreemfs.org
Client software (C++) runs on Linux & OS X (Fuse), Windows (Dokan)
Server software (Java)4
-
7/27/2019 XtreemFS-A Cloud File System
5/27
XtreemFS Architecture
5
Metadata and Replica Catalog
(MRC):
stores metadata per volume
Object Storage Devices (OSDs):
directly accessed by clients
file content split into objects
Separation of Metadata
and File Content:
object-based file system
-
7/27/2019 XtreemFS-A Cloud File System
6/27
-
7/27/2019 XtreemFS-A Cloud File System
7/27
Read-Only Replication (1)
Only for write-once files
File must be marked as read-only
done automatically after close()
Use Case: CDN
Replica Types:
1. Full replicas
complete copy, fills itself as fast as possible
2. Partial replicas Initially empty
on-demand fetching of missing objects
P2P-like efficient transfer between all replicas
7
-
7/27/2019 XtreemFS-A Cloud File System
8/27
Read-Only Replication (2)
8
-
7/27/2019 XtreemFS-A Cloud File System
9/27
Read/Write Replication (1)
Primary/backup scheme
POSIX requires total order of update operations
primary/backup
Primary fail-over?
Leases
grants access to a resource (here: primary role)
for a predefined period of time
Failover after timeout possible
Assumption: loosely synchronized clocks
max drift
9
-
7/27/2019 XtreemFS-A Cloud File System
10/27
Read/Write Replication (2)
10
Replicated write():
-
7/27/2019 XtreemFS-A Cloud File System
11/27
Read/Write Replication (3)
11
Replicated write():
1. Lease Acquisition
-
7/27/2019 XtreemFS-A Cloud File System
12/27
Read/Write Replication (4)
12
Replicated write():
1. Lease Acquisition
2. Data Dissemination
-
7/27/2019 XtreemFS-A Cloud File System
13/27
-
7/27/2019 XtreemFS-A Cloud File System
14/27
Read/Write Replication: Distributed Lease Acquisition with Flease
Flease
Failure tolerant: majority-based
Scalable: lease per file
Experiment:
Zookeeper: 3 servers
Flease: 3 nodes
(2 randomly selected)
14
FleaseCentral Lock Service
-
7/27/2019 XtreemFS-A Cloud File System
15/27
-
7/27/2019 XtreemFS-A Cloud File System
16/27
Read/Write Replication: Summary
High up-front costs (for first access to inactive file)
3+ round-trips
2 for Flease (lease acquisition)
1 for Replica Reset
+ further when fetching missing objects
Minimal cost for subsequent operations
Read: identical to non-replicated case
Write: latency increases by time to update majority of backups
Works at file-level: scales with # OSDs and # files
Flease: no I/O to stable storage for crash-recovery needed
16
-
7/27/2019 XtreemFS-A Cloud File System
17/27
Custom Replica Placement and Selection
Policies
filter and sort available OSDs/replicas
evaluates client information (IP address/hostname, estimated latency)
create file on OSD close to me
access closest replica
Available default policies:
Server ID
DNS
Datacenter Map
Vivaldi
Own policies possible (Java)
17
-
7/27/2019 XtreemFS-A Cloud File System
18/27
Replica Placement/Selection: Vivaldi Visualization
18
-
7/27/2019 XtreemFS-A Cloud File System
19/27
Metadata Replication
Replication at database level same approach as file R/W replication
Loosen consistency
allow stale reads
All services replicated
No single point of failure
19
-
7/27/2019 XtreemFS-A Cloud File System
20/27
XtreemFS Use Cases
Storage of VM images for IaaS solutions (OpenNebula, ...)
Storage-as-a-Service: Volumes per User
XtreemFS as HDFS replacement in Hadoop
XtreemFS in ConPaaS: storage on demand for other services
20
-
7/27/2019 XtreemFS-A Cloud File System
21/27
XtreemFS and OpenNebula (1)
Use Case: VM images in OpenNebula cluster
no distributed file system: scp VM images to hosts
distributed file system: shared storage, available on all nodes
Support for live migration
Fault-tolerant storage of VM images
Resume VM on another node after crash Use XtreemFS Read/Write file replication
21
-
7/27/2019 XtreemFS-A Cloud File System
22/27
XtreemFS and OpenNebula (2)
VM deployment Create copy (clone) of original VM image
Run cloned VM image at scheduled host
(Discard cloned image after VM shutdown)
Problems
1. cloning time-consuming
2. waste of space
3. increasing total boot time when starting multiple VMs e.g., ConPaaS image
22
-
7/27/2019 XtreemFS-A Cloud File System
23/27
XtreemFS and OpenNebula: qcow2 + Replication
qcow2 VM image format allows snapshots
1. immutable backing file
2. mutable, initially empty snapshot file
instead of cloning, snapshot original VM image (< 1 second)
Use Read/Write replication for snapshot file
Problem left: run multiple VMs simultaneously
snapshot file: R/W replication scales with # OSDs and # files backing file: bottle neck
use Read-Only Replication
23
-
7/27/2019 XtreemFS-A Cloud File System
24/27
XtreemFS and OpenNebula: Benchmark (1)
OpenNebula Test Cluster Frontend + 30 Worker nodes
Gigabit Ethernet (100 MB/s)
SATA disk (70 MB/s)
Setup Frontend
MRC
OSD (has the ConPaaS VM image)
Each worker node
OSD
XtreemFS Fuse client
OpenNebula node
Replica Placement + Replica Selection: prefer local OSD/replica
24
-
7/27/2019 XtreemFS-A Cloud File System
25/27
XtreemFS and OpenNebula: Benchmark (2)
Setup Total Boot Time
copy (1.6 GB image file) 82 seconds (69 seconds for copy)
qcow2, 1 VM 13.6 seconds
qcow2, 30 VMs 20.8 seconds
qcow2, 30 VMs, 30 partial replicas 142.8 seconds
- second run 20.1 seconds
- after second run 17.5 seconds
+ Read/Write Replication on
snapshot file
19.5 seconds
25
few read()s on image, no bottleneck yet
Replication: object granularity vs. small reads/writes
-
7/27/2019 XtreemFS-A Cloud File System
26/27
Future Research & Work
Deduplication
Improved Elasticity
Fault Tolerance
Optimize Storage Cost
Erasure Codes
Self-*
Client Cache less POSIX: replace MRC with a scalable service
26
-
7/27/2019 XtreemFS-A Cloud File System
27/27
27
Funded under: FP7 (Seventh Framework Programme)Area: Internet of Services, Software & virtualization (ICT-2009.1.2)Project reference: 257438
Total cost: 11,29 million euroEU contribution: 8,3 million euroExecution: From 2010-10-01 till 2013-09-30Duration: 36 monthsContract type: Collaborative project (generic)