pisa
DESCRIPTION
Pisa is a decentralized block storage distribution and replication framework with the specific goal of simplifying the development of storage back-end services in a distributed environment. Main chararistics of the project are the message security, self-organization cluster and simple setup. Pisa is a subproject of RestFS project and the talk will explain our experience acquired with the development of this subcomponent and the decisions taken in the design of the framework.TRANSCRIPT
Beolink.org
Pisa Block Distribution and Replication Framework
Fabrizio Manfredi FuruholmenFederico Mosca
Beolink.org
Buzzwords 2014
2
Agenda
Introduction Overview Problem Common Pattern
Implementation Data placement Data Consistency Cluster Coordination Data Transmission
Beolink.org
Block Storage Devices
3
Pisa
is a simple block data distribution and
Replication Framework on a wide range of node
New NodeTransfer
New NodeNew Node
Node
Data Block
KeyData[Hash]
Beolink.org
4
Build a solution
Beolink.org What is it ?
5
RestFS is
High scalable, high available
network object storage
Beolink.org Five pylons
6
Ob
ject
s •Separation btw data and metadata
• Each element is marked with a revision
•Each element is marked with an hash.
Cac
he • Client side
• Callback/Notify
• Persistent
Tra
nsm
iss
ion • Parallel
operation
• Http like protocol
• Compression
• Transfer by difference
Dis
trib
uti
on •Resource
discovery by DNS
•Data spread on multi node cluster
•Decentralize
•Independents cluster
•Data Replication
Se
curi
ty •Secure connection
• Encryption client side,
• Extend ACL
• Delegation/Federation
•Admin Delegation
Beolink.org
7
RestFS Key Words
RestFS
Cellcollection of servers
Bucket virtual container, hosted by one or
more server
Object entity (file, dir, …)
contained in a Bucket
Beolink.orgObject
8
Data Metadata
Segments Ob
ject
Attributes set by user
Properties
ACL
Ext Properties
Block 1
Block 2
Block n
Block …
Ha
sh
Ha
sh
Ha
sh
Ha
sh
Se
ria
lS
eri
al
Se
ria
lS
eri
al
Se
ria
l
Beolink.org
9
Main Goal …
Storage as Lego Brick
The infrastructure has to be inexpensive with high scalability and reliability
Beolink.org
10
Problems
Beolink.org
11
Main Problem
VS
Beolink.org
12
Main Problem
Beolink.org
13
CAP theorem
According to Brewer’s CAP theorem, it is impossible for any distributed computer system to simultaneously provide all three of Consistency, Availability and Partition Tolerance.
You can’t have the three at
the same time and get an acceptable latency.
Beolink.org
14
CAP
ACIDAtomic: Everything in a transaction succeeds or the entire transaction is rolled back.Consistent: A transaction cannot leave the database in an inconsistent state.Isolated: Transactions cannot interfere with each other.Durable: Completed transactions persist, even when servers restart etc.
- Strong consistency for transaction highest priority- Pessimistic- Complex mechanisms
- Availability and scaling highest priorities- Weak consistency- Optimistic- Best Effort- Simple and FAST
Basic AvailabilitySoft-stateEventual consistency
BASE
RDBMS
NoSQL
Beolink.org First of all …
15
“Think as a child…”
Beolink.org Second …
16
“There is always a failure waiting around the corner”
*Werner Vogel
Beolink.org
17
Data Distribution
Replication
Data Placement
Data Consistency
Cluster Coordination
Data Transmission
Beolink.org
18
Data Placement
Better Distribution = partitioning Parallel operation = parallel stream/multi core
Beolink.org
19
Data Distribution: DHT
Distributed Hash Table
Blocks are distributed in partitions
Partition are identify by an hash prefix
Partition hosted in servers
19
Part id Node id
1 2
2 …
Node id
Node
1 obj
2 obj
0000010000
Key (hash)
Partition id
Beolink.orgData Distribution
Zero Hop Hash (Consistent HASH)- Partition location with 0 hops- 1% capacity added and 1% moved
Node • Zone• Weight
Partition , array list (FIXED) :• Position = kex prefix• Value = node id
Shuffle Avoid sequential allocation
Part_list = array('H')part_key_shift = 32 - part_exppart_count = 2 ** part_expsha(data).digest())[0] >> self.partition_shift
shuffle(part_list)
Ip = 10.1.0.1zone = 1weight = 3.0class = 1
Beolink.org
21
Data placement
Vnode base Client base
Replication
Beolink.org
22
Data Distribution
Proximity base
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/
node_ids = [master_node]zones = [self.nodes[node_ids[0]]] for replica in xrange(1, replicas): while self.part_list[part_id] in node_ids : part_id += 1 if part_id >= len(self.part_list): part_id = 0 node_ids.append(self.part_list[part_id])return [self.nodes[n] for n in node_ids]
Part Serv
1 xxxx
2 yyyyy
3 zzzzz
4
5
…
Partition one will be also in node 2 and 3 , the master node is always the first
Beolink.org
23
Data Consistency
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/
To avoid ACID implementation but to guarantee the consistency some solution leave to the client the ownership of the algorithm.
Beolink.org
24
Data Consistency
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/
Tunable trade-offs for distribution and replication (N, R, W)
The Read operation is implemented with hash check
Beolink.org
25
Cluster Coordination
Cluster communication
Table distribution
(routing table)
Failure detection
Join / Leaving node to Cluster
Beolink.org
26
Cluster Coordination
Epidemic (Gossip)
epidemic: anybody can infect anyone else with equal probability
O(lo
g n
)h
ttp
://w
ww
.cis
.co
rne
ll.e
du
/IA
I/e
ven
ts/G
oss
ip_
Tu
toria
l.pd
f
Periodic anti-entropy exchanges among nodes ensure that they eventually converge, even if updates are lost.
Arbitrary pairs of replicas periodically establish contact and resolve all differences between their databases.
Hash reduce the volume of data exchanged in the common case.
Beolink.org
27
Cluster Coordination
Table Items(Routing Table)• Node table list• Partition 2 Node List
Bootstrap • DNS name or IP at startup • DNS Lookup (SRV)• Multicast
Transfer Type• Complete transfer• Resync by Diff (Merkel Tree)• Notification for a single change
• Join Node• Leave Node• Partition owner
Part Serv
1 xxxx
2 …
3
4
5
…Node ID
Object
1 xxxx
2 …
3
4
5
…
Segment hash
1-100 xxxx
101-200 …
…
Beolink.orgCluster Coordination
28
Node X New Node Z
Bootstrap
Part Serv
X Z
.. …
Notify of new node
Partition claim x
Table Change Notification via Gossip
Node Y
Accept
Client
Request part x
Return New Owner
Re
qu
es
t p
art
x
Re
turn
da
ta
In case the date is not present in the new node the new node act as a proxy (Lazy trnasfer)
Beolink.org
29
Transport Protocol
ZeroMQ and MessagePack (RPC)
Cluster Communications
Client Data transfer
Partition replication/Relocation
Beolink.org
30
Status
Eeeemmm… not really perfect …
Beolink.org
31
Next
http://www.cs.rutgers.edu/~pxk/417/notes/23-lookup.html
Chord
Space base/multi dimension
New data distribution model Chord/Cluster Node
Vector clock
Rebalance, handover partition (weight change)
Locking
WAN area network replication (Async)
Config Replication (pub/sub, event)
Server Priority
…