blobseer in nosql world

BlobSeer: Architecture

Clients Perform fine grain blob accesses

Providers Store the pages of the blob

Provider manager Monitors the providers Favours data load balancing

Metadata providers Store information about page location

Version manager Ensures concurrency control

Clients

Providers

Metadata providers

Provider manager

Version manager

BlobSeer: What may be refined

Hotspots/fault-tolerance Fixed single version manager Fixed provider manager

Load balancing Version manager, provider manager may become hotspots Fixed metadata providers

BlobSeer: What I am thinking of

Background: Lighting-weigh DHT(may not correct) Using consistent hashing to hash distribute keys

Load balancing Fault tolerance Elasticity

Lookup cost: O(1) Base on Gossip overlay (borrowed from NoSQL world) Or base on Kelips P2P prototype (I have just know about it) Given a key, node know the destination exactly in most cases Overhead: OK ref. NoSQL world (Facebook Cassandra, Amazon Dynamo,

Voldermort)

I will try solving my given problems by building BlobSeer on top of this DHT

Distributed version managers

Distributed version managers: A 2 levels Splitting BLOB_ID namespace

DHT-based Fortunately, blob is independent from each other Hash (BLOB_ID) => ID of version manager server

Splitting version ID’s space per BLOB Easily Rely on DHT replication Hash (BLOB_ID) => {neighbouring version managers}

Lookup cost = O(1), equally to BlobSeer

Concurrent writing/appending need to be serialized On master Blob.getlatest() Blob.write() Blob.append()

Access to history versions Randomly on {master, slaves} Blob.read() Blob.getsize() Ask Master only in case of necessary

Master periodically PUTS OR Slaves PULL versions to do serialization Version info is quite tiny

Eliminate the provider manager Provider manager keeps cluster state to answer clients’ requests

Lookup costs O(1)

Providers can learn themselves about the system state Load and Load balancing?? Lookup costs O(1) Use the presented DHT overlay to propagate providers’ states

Gossip-based (limited in cluster size around 1000 but it is still good) Or a lighting version of P2P overlay (E.g. Kelips) Hotspot when increasing number of clients, providers

Client randomly asks any providers

However !!!

We will not want to use consistent hashing

Architecture

Version managers, metadata managers, providers, clients

DHT with consistent hashing

Distributed membership management

Gossip based

Zookeeper (like Google’s chubby)

Replication, fault tolerance, leader election

Access scenarios

Reading Hash blobID to know its associated version manager Go down the metadata tree Access providers O(1) for any step and equal to the current BlobSeer design

Writing The same as in BlobSeer but no provider manager

Overview of the implementation

Gossip based DHT

We need 3 hash namespaces Version managers Metadata providers Providers

Elasticity Is inherent if we use consistent hashing for DHT

Fault-tolerance DHT based

Load balancing DHT based

Advantages

Still keeping the current nice features of BlobSeer

Monolithic-based design Node provides all capabilities as a client, a version manager, a metadata

manager and a provider Simpler/easier for configuration/deployment (autonomic feature?)

Load balancing

Fault tolerance

Elasticity

Compare to NoSQL key/value store Efficient one key/ a value of TB size (versioning, throughput)

Some more discussions

If client is outside of BlobSeer storage cloud, client randomly chooses one node to communicate. Node is as a proxy server (Cassandra)

We may need a small number of version manager, metadata managers Leader election (can base on Apache Zookeeper) If we fix them, we will reduce overhead at DHT level

BlobSeer cloud

Client

BlobSeer in NoSQL paradigm

Document stores

Column stores

{pages} distribution

BlobSeer’s approach Distribute {pages} over different providers {pages} are mapped to physical addresses of providers directly

DHT’s approach DHT is used only to know how has {pages} but not to route {pages} Must find a good way: {pages} of single write should be distributed over

different providers? [YES or NO] Hopefully, page keys are picked by client in BlobSeer

DHT load balancing DHT fault-tolerance Lookup cost: O(1)

Eliminate the provider manager Provider manager keeps cluster state to answer clients’ requests

Lookup costs O(1) Hotspot when increasing number of clients, providers

Providers can learn themselves about the system state Lookup costs O(1) Use the presented DHT overlay to propagate providers’ states

Gossip-based (limited in cluster size around 1000 but it is still good) Or a lighting version of P2P overlay (E.g. Kelips)

Need a good way to distribute {pages} of each separated write operation over DHT?

BlobSeer’s approach DHT’s approach

blobseer in nosql world

blobseer dht

providers client

blob provider manager

associated version manager

clients dht

different providers

load balancing dht

providers states gossip

Technology

nosql now! nosql architecture patterns

monitoring the blobseer distributed data-management platform...

nosql or not nosql?

real-world nosql schema design - berlin buzzwords ·...

restful web service modeling with nosql...

a13 mysql & nosql～best of both world～ by philip...

nosql databases for enterprises - nosql now conference 2013

nosql (examples) -...

simonvc basho-nosql-in-a-continuous-delivery-world

how companies use nosql & couchbase - nosql now 2014

m. grigorieva, m. golosova€¦ · database performance...

choosing a next gen database: the new world order of nosql,...

nosql 프로그래밍 : 한 권으로 끝내는 nosql...

distributed monitoring for user accounting in the blobseer

real world nosql (by chris yuen)

objectdbkirsh/download/mta nosql...

real-world nosql schema design

blobseer: towards eﬀicient data storage management for

hello nosql world

nosql -...