distribute storage system may-2014
DESCRIPTION
Architect document for build distributed storage system and the good sample distributed storage from authorTRANSCRIPT
![Page 1: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/1.jpg)
DISTRIBUTED STORAGE
SYSTEM
Mr. Dương Công Lợi
Company: VNG-Corp
Tel: +84989510016/+84908522017
![Page 2: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/2.jpg)
CONTENTS
1. What is distributed-computing system?
2. Principle of distributed database/storage
system
3. Distributed storage system paradigm
4. Canonical problems in distributed systems
5. Common solution for canonical problems in
distributed systems
6. UniversalDistributedStorage
7. Appendix
![Page 3: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/3.jpg)
1. WHAT IS DISTRIBUTED-COMPUTING
SYSTEM?
Distributed-Computing is the process of solving a
computational problem using a distributed
system.
A distributed system is a computing system in
which a number of components on multiple
computers cooperate by communicating over a
network to achieve a common goal.
![Page 4: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/4.jpg)
DISTRIBUTED DATABASE/STORAGE
SYSTEM
A distributed database system, the database is
stored on several computers .
A distributed database is a collection of multiple
, logic computer network .
![Page 5: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/5.jpg)
DISTRIBUTED SYSTEM ADVANCE
Advance
Avoid bottleneck & single-point-of-failure
More Scalability
More Availability
Routing model
Client routing: client request to appropriate server to
read/write data
Server routing: server forward request of client to
appropriate server and send result to this client
* can combine the two model above into a system
![Page 6: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/6.jpg)
DISTRIBUTED STORAGE SYSTEM
Store some data {1,2,3,4,6,7,8} into 1 server
And store them into 3 distributed server
1,2,3,4,6,7,8
1,2,3 4,6
7,8
![Page 7: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/7.jpg)
2. PRINCIPLE OF DISTRIBUTED
DATABASE/STORAGE SYSTEM
Shard data key and store it into appropriate
server use Distributed Hash Table (DHT)
DHT must be consistent hashing:
Uniform distribution of generation
Consistent
Jenkins, Murmur are the good choice;some else:
MD5, SHA slower
![Page 8: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/8.jpg)
3. DISTRIBUTED STORAGE SYSTEM
PARADIGM
Data Hashing/Addressing
Determine server for data store in
Data Replication
Store data into multi server node for more
availability, fault-tolerance
![Page 9: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/9.jpg)
DISTRIBUTED STORAGE SYSTEM
ARCHITECT
Data Hashing/Addressing
Use DHT to addressing server (use server-name) to a
number, performing it on one circle called the keys
space
Use DHT to addressing data and find server store it
by successor(k)=ceiling(addressing(k))
successor(k): server store k
0
server3
server1
server2
![Page 10: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/10.jpg)
DISTRIBUTED STORAGE SYSTEM
ARCHITECT
Addressing – Virtual node
Each server node is generated to more node-id for
evenly distributed, load balance
Server1: n1, n4, n6
Server2: n2, n7
Server3: n3, n5, n8
0
server3
server1
server2
n7
n1
n5
n2
n4
n8
n3
n6
![Page 11: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/11.jpg)
DISTRIBUTED STORAGE SYSTEM
ARCHITECT
Data Replication
Data k1 store in server1 as master and store in
server2 as slave
0
server3
server1
server2
k1
![Page 12: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/12.jpg)
4. CANONICAL PROBLEMS IN DISTRIBUTED
SYSTEMS
Distributed transactions: ACID (Atomicity,
Consistency, Isolation, Durability) requirement
Distributed data independence
Fault tolerance
Transparency
![Page 13: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/13.jpg)
5. COMMON SOLUTION FOR CANONICAL
PROBLEMS IN DISTRIBUTED SYSTEMS
Atomicity and Consistency with Two Phase
Commit protocal
Distributed data independence with consistent
hashing algorithm
Fault tolerance with leader election, multi
master and data replication
Transparency with server routing, client seen
distributed system as a single server
![Page 14: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/14.jpg)
TWO PHASE COMMIT PROTOCAL
What is this?
Two-phase commit is a transaction protocol designed for
the complications that arise with distributed resource
managers.
Two-phase commit technology is used for hotel and
airline reservations, stock market transactions, banking
applications, and credit card systems.
With a two-phase commit protocol, the distributed
transaction manager employs a coordinator to manage
the individual resource managers. The commit process
proceeds as follows:
![Page 15: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/15.jpg)
TWO PHASE COMMIT PROTOCAL
Phase1: Obtaining a Decision
Step 1 Coordinator asks all participants to prepare
to commit transaction Ti.
Ci adds the records <prepare T> to the log and
forces log to stable storage (a log is a file which
maintains a record of all changes to the database)
sends prepare T messages to all sites where T
executed
![Page 16: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/16.jpg)
TWO PHASE COMMIT PROTOCAL
Phase1: Making a Decision Step 2 Upon receiving message, transaction
manager at site determines if it can commit the transaction
if not:
add a record <no T> to the log and send abort T message to Ci
if the transaction can be committed, then:
1). add the record <ready T> to the log
2). force all records for T to stable storage
3). send ready T message to Ci
![Page 17: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/17.jpg)
TWO PHASE COMMIT PROTOCAL
Phase 2: Recording the Decision Step 1 T can be committed of Ci received a ready T
message from all the participating sites: otherwise T
must be aborted.
Step 2 Coordinator adds a decision record, <commit
T> or <abort T>, to the log and forces record onto stable
storage. Once the record is in stable storage, it cannot
be revoked (even if failures occur)
Step 3 Coordinator sends a message to each
participant informing it of the decision (commit or abort)
Step 4 Participants take appropriate action locally.
![Page 18: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/18.jpg)
![Page 19: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/19.jpg)
TWO PHASE COMMIT PROTOCAL
Costs and Limitations
If one database server is unavailable, none of the
servers gets the updates.
This is correctable through network tuning and
correctly building the data distribution through
database optimization techniques.
![Page 20: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/20.jpg)
LEADER ELECTION
Some leader election algorithm can use: LCR
(LeLann-Chang-Roberts), Pitterson, HS
(Hirschberg-Sinclair)
![Page 21: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/21.jpg)
LEADER ELECTION
Bully Leader Election algorithm
![Page 22: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/22.jpg)
![Page 23: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/23.jpg)
MULTI MASTER
Multi-master replication
Problem of multi-master replication
![Page 24: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/24.jpg)
MULTI MASTER
Solution, 2 candicate model:
Two phase commit (always consistency)
Asynchronize sync data among multi node
Still active despite some node dies
Faster than 2PC
![Page 25: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/25.jpg)
MULTI MASTER
Asynchronize sync data
Data store to main master (called sub-leader), then
this data post to queue to sync to other master.
![Page 26: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/26.jpg)
MULTI MASTER
Asynchronize sync data
req1 req2
Server1
(leader )
Server2
data queue
req2: forward
X
![Page 27: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/27.jpg)
UNIVERSALDISTRIBUTEDSTORAGE
a distributed storage system
![Page 28: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/28.jpg)
6. UNIVERSALDISTRIBUTEDSTORAGE
UniversalDistributedStorage is a distributed
storage system develop for:
Distributed transactions (ACID)
Distributed data independence
Fault tolerance
Leader election (decision for join or leave server node)
Replicate with multiple master replication
Transparency
![Page 29: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/29.jpg)
UNIVERSALDISTRIBUTEDSTORAGE
ARCHITECTURE
Overview
Bussiness
Layer
Distrib
uted
Layer
Storage
Layer
Bussiness
Layer
Distrib
uted
Layer
Storage
Layer
Bussiness
Layer
Distrib
uted
Layer
Storage
Layer
Server
![Page 30: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/30.jpg)
UNIVERSALDISTRIBUTEDSTORAGE
ARCHITECTURE
Internal Overview
Business Layer
Distributed Layer
Storage Layer
dataLocate(), dataRemote()
Result(s)
localData()
Result{s}
Client request(s)
remote
queuing
![Page 31: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/31.jpg)
ARCHITECTURE OVERVIEW
![Page 32: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/32.jpg)
UNIVERSALDISTRIBUTEDSTORAGE
FEATURE
Data hashing/addressing
Use Murmur hashing function
![Page 33: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/33.jpg)
UNIVERSALDISTRIBUTEDSTORAGE
FEATURE
Leader election
Use Bully Leader Election algorithm
![Page 34: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/34.jpg)
UNIVERSALDISTRIBUTEDSTORAGE
FEATURE
Multi-master replication
Use asynchronize sync data among server nodes
![Page 35: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/35.jpg)
UNIVERSALDISTRIBUTEDSTORAGE
STATISTIC
System information:
3 machine 8GB Ram, core i5 3,220GHz
LAN/WAN network
7 physical servers on 3 above mechine
Concurrence write 16500000 items in 3680s, rate~ 4480req/sec (at client computing)
Concurrence read 16500000 items in 1458s, rate~ 11320req/sec (at client computing)
* It doesn’t limit of this system, it limit at clients (this test using 3 client thread)
![Page 36: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/36.jpg)
Q & A
Contact:
Duong Cong Loi
https://www.facebook.com/duongcong.loi
![Page 37: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/37.jpg)
7. APPENDIX
![Page 38: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/38.jpg)
APPENDIX - 001
How to join/leave server(s)
1. join/leave 2. join/leave: forward
Leader server
4. broadcast result
3. process join/leave
Server A Server B Server C
![Page 39: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/39.jpg)
APPENDIX - 002
How to move data when join/leave server(s)
Make appropriate data for the moving
Async data for the moving by thread, and control
speed of the moving
![Page 40: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/40.jpg)
APPENDIX - 003
How to detect Leader or sub-leader die
Easy dectect by polling connection
![Page 41: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/41.jpg)
APPENDIX - 004
How to make multi virtual node for one server
Easy generate multi virtual node for one server by
hash server-name
Ex:
make 200 virtual node for server ‘photoTokyo’:
use hash value of: photoTokyo1, photoTokyo2, …,
photoTokyo200
![Page 42: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/42.jpg)
APPENDIX - 005
For fast moving data
Use bloomfilter for dectect exist hash value of data-
key
Use a storage for store all data-key for this local
server
![Page 43: Distribute Storage System May-2014](https://reader036.vdocuments.net/reader036/viewer/2022081821/549595d0b47959b3088b4650/html5/thumbnails/43.jpg)
APPENDIX - 006
How to avoid network turnning
Use client connection pool with screening strategy
before, it avoid many connection hanging when call
remote via network between two server