pisa

Beolink.org

Pisa Block Distribution and Replication Framework

Fabrizio Manfredi FuruholmenFederico Mosca

Beolink.org

Buzzwords 2014

2

Agenda

Introduction Overview Problem Common Pattern

Implementation Data placement Data Consistency Cluster Coordination Data Transmission

Beolink.org

Block Storage Devices

3

Pisa

is a simple block data distribution and

Replication Framework on a wide range of node

New NodeTransfer

New NodeNew Node

Node

Data Block

KeyData[Hash]

Beolink.org

4

Build a solution

Beolink.org What is it ?

5

RestFS is

High scalable, high available

network object storage

Beolink.org Five pylons

6

Ob

ject

s •Separation btw data and metadata

• Each element is marked with a revision

•Each element is marked with an hash.

Cac

he • Client side

• Callback/Notify

• Persistent

Tra

nsm

iss

ion • Parallel

operation

• Http like protocol

• Compression

• Transfer by difference

Dis

trib

uti

on •Resource

discovery by DNS

•Data spread on multi node cluster

•Decentralize

•Independents cluster

•Data Replication

Se

curi

ty •Secure connection

• Encryption client side,

• Extend ACL

• Delegation/Federation

•Admin Delegation

Beolink.org

7

RestFS Key Words

RestFS

Cellcollection of servers

Bucket virtual container, hosted by one or

more server

Object entity (file, dir, …)

contained in a Bucket

Beolink.orgObject

8

Data Metadata

Segments Ob

ject

Attributes set by user

Properties

ACL

Ext Properties

Block 1

Block 2

Block n

Block …

Ha

sh

Ha

sh

Ha

sh

Ha

sh

Se

ria

lS

eri

al

Se

ria

lS

eri

al

Se

ria

l

Beolink.org

9

Main Goal …

Storage as Lego Brick

The infrastructure has to be inexpensive with high scalability and reliability

Beolink.org

10

Problems

Beolink.org

11

Main Problem

VS

Beolink.org

12

Main Problem

Beolink.org

13

CAP theorem

According to Brewer’s CAP theorem, it is impossible for any distributed computer system to simultaneously provide all three of Consistency, Availability and Partition Tolerance.

You can’t have the three at

the same time and get an acceptable latency.

Beolink.org

14

CAP

ACIDAtomic: Everything in a transaction succeeds or the entire transaction is rolled back.Consistent: A transaction cannot leave the database in an inconsistent state.Isolated: Transactions cannot interfere with each other.Durable: Completed transactions persist, even when servers restart etc.

- Strong consistency for transaction highest priority- Pessimistic- Complex mechanisms

- Availability and scaling highest priorities- Weak consistency- Optimistic- Best Effort- Simple and FAST

Basic AvailabilitySoft-stateEventual consistency

BASE

RDBMS

NoSQL

Beolink.org First of all …

15

“Think as a child…”

Beolink.org Second …

16

“There is always a failure waiting around the corner”

*Werner Vogel

Beolink.org

17

Data Distribution

Replication

Data Placement

Data Consistency

Cluster Coordination

Data Transmission

Beolink.org

18

Data Placement

Better Distribution = partitioning Parallel operation = parallel stream/multi core

Beolink.org

19

Data Distribution: DHT

Distributed Hash Table

Blocks are distributed in partitions

Partition are identify by an hash prefix

Partition hosted in servers

19

Part id Node id

1 2

2 …

Node id

Node

1 obj

2 obj

0000010000

Key (hash)

Partition id

Beolink.orgData Distribution

Zero Hop Hash (Consistent HASH)- Partition location with 0 hops- 1% capacity added and 1% moved

Node • Zone• Weight

Partition , array list (FIXED) :• Position = kex prefix• Value = node id

Shuffle Avoid sequential allocation

Part_list = array('H')part_key_shift = 32 - part_exppart_count = 2 ** part_expsha(data).digest())[0] >> self.partition_shift

shuffle(part_list)

Ip = 10.1.0.1zone = 1weight = 3.0class = 1

Beolink.org

21

Data placement

Vnode base Client base

Replication

Beolink.org

22

Data Distribution

Proximity base

http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/

node_ids = [master_node]zones = [self.nodes[node_ids[0]]] for replica in xrange(1, replicas): while self.part_list[part_id] in node_ids : part_id += 1 if part_id >= len(self.part_list): part_id = 0 node_ids.append(self.part_list[part_id])return [self.nodes[n] for n in node_ids]

Part Serv

1 xxxx

2 yyyyy

3 zzzzz

4

5

…

Partition one will be also in node 2 and 3 , the master node is always the first

Beolink.org

23

Data Consistency


To avoid ACID implementation but to guarantee the consistency some solution leave to the client the ownership of the algorithm.

Beolink.org

24

Data Consistency


Tunable trade-offs for distribution and replication (N, R, W)

The Read operation is implemented with hash check

Beolink.org

25


Cluster communication

Table distribution

(routing table)

Failure detection

Join / Leaving node to Cluster

Beolink.org

26


Epidemic (Gossip)

epidemic: anybody can infect anyone else with equal probability

O(lo

g n

)h

ttp

://w

ww

.cis

.co

rne

ll.e

du

/IA

I/e

ven

ts/G

oss

ip_

Tu

toria

l.pd

f

Periodic anti-entropy exchanges among nodes ensure that they eventually converge, even if updates are lost.

Arbitrary pairs of replicas periodically establish contact and resolve all differences between their databases.

Hash reduce the volume of data exchanged in the common case.

Beolink.org

27


Table Items(Routing Table)• Node table list• Partition 2 Node List

Bootstrap • DNS name or IP at startup • DNS Lookup (SRV)• Multicast

Transfer Type• Complete transfer• Resync by Diff (Merkel Tree)• Notification for a single change

• Join Node• Leave Node• Partition owner

Part Serv

1 xxxx

2 …

3

4

5

…Node ID

Object

1 xxxx

2 …

3

4

5

…

Segment hash

1-100 xxxx

101-200 …

…

Beolink.orgCluster Coordination

28

Node X New Node Z

Bootstrap

Part Serv

X Z

.. …

Notify of new node

Partition claim x

Table Change Notification via Gossip

Node Y

Accept

Client

Request part x

Return New Owner

Re

qu

es

t p

art

x

Re

turn

da

ta

In case the date is not present in the new node the new node act as a proxy (Lazy trnasfer)

Beolink.org

29

Transport Protocol

ZeroMQ and MessagePack (RPC)

Cluster Communications

Client Data transfer

Partition replication/Relocation

Beolink.org

30

Status

Eeeemmm… not really perfect …

Beolink.org

31

Next

http://www.cs.rutgers.edu/~pxk/417/notes/23-lookup.html

Chord

Space base/multi dimension

New data distribution model Chord/Cluster Node

Vector clock

Rebalance, handover partition (weight change)

Locking

WAN area network replication (Async)

Config Replication (pub/sub, event)

Server Priority

…

Beolink.org

Thank you

http://restfs.beolink.org

[email protected]@gmail.com

pisa

Internet

node id node

id node id

data consistency http

master node

simple block data distribution

node zones

node id shuffle

multi node cluster