planet cassandra - nosql replication in modern databases (2014)

Data replication is the concept of having data, within a system, be geo-distributed; preferably through a

non-interactive, reliable process. In traditional RDBMS databases, implementing any sort of replication is

a struggle because these systems were not developed with horizontal scaling in mind. Instead, these

systems can be backed up via a semi-manual process where live recovery wouldnt be much of an issue.

Even with live recovery not being much of an issue, it downplays the complexity of this setup. When

dealing with todays globally distributed data, the former colocated replication concepts will not suffice

when implemented at geographic scale.

Todays infrastructure requires systems that natively support active and real-time replication, achieved

through transparent and simple configurations. The ability to dictate where and how your data is

replicated via easily tunable settings, along with providing users with easily understood concepts is what

modern day NoSQL databases strive to offer.

Apache Cassandra, built with native multi data center replication in mind, is one of the most overlooked

because this level of infrastructure has been assimilated as tribal knowledge within the Cassandra

community. For those new to Apache Cassandra, this page is meant to highlight the simple inner workings

of how Cassandra excels in multi data center replication by simplifying the problem at a single-node level.

This page covers the fundamentals of Cassandra internals, multi-data center use cases, and a few caveats

to keep in mind when expanding your cluster.

Handling Writes at the Node Level

Multi-Data Center Awareness at Little Extra Cost

Real World Use Cases

Caveats of Multi Data Center Replication

Synchronous Actions of Mutations

Asynchronous Actions of Mutations

Workload Separation

Live Backup Scenario

Live Recovery

Geographic Location Scenario

Data Replication in NoSQL Databases | Planet Cassandra http://planetcassandra.org/data-replication-in-nosql-databases-explained/

1 sur 7 12/01/2015 11:42

ioniMettre en vidence

[http://planetcassandra.org/wp-content/uploads/2014/06/Illustration_1-3.png]

Apache Cassandra was designed with horizontal scaling in mind from its infancy, at a time just before Big

Data was becoming mainstream. This is important because up until this point in database history,

horizontal scaling typically seemed to be an afterthought. Some legacy databases claim to support scaling,

yet require extra infrastructures to support doing so. Some newer databases even go as far as claiming that

faults across distributed systems never occur.

Our hope is that those reading this have never had to deal with these hack-based infrastructures or have

never learned firsthand that faults in a distributed setting are very real and failures of minute chance will

always occur at scale. However, pretending a majority of you are new to all of this would be ignorance on

our part. And if we exclaim that Cassandra has solved these issues without presenting how this actually

happens we may as well be selling snake oil.

Thats why, before we continue any further, well cover how writes occur in Cassandra, to set the basis for

understanding our claim that Cassandra is a highly available, fault tolerant, distributed, redundant

database.

A mutation is any write request that hits the Cassandra servers. Write mutations, or writes, can be an

insert, update, or delete command since Cassandra deals with them in the same fashion in order to

elegantly handle mutations in a distributed setting. When a mutation is received two types of actions

happen: synchronous actions and asynchronous actions.

Synchronous actions, in this case, are typically actions that occur between receiving a request and

issuing a response [1]. When requests come in, they are routed to both memory for present access and

disk for recovery.

In memory, they are stored in memtables that exist per column family. There, rows can be overwritten

or appended easily with new data and provide fast read access. But because memory clears when nodes

are power cycled, all requests also are written to disk for a persistent backup.

When requests hit disk, they are stored on the Commit Log, an append-only file structure that accepts


2 sur 7 12/01/2015 11:42


all requests as they are received. This append-only access ensures that rotational media is utilized

resourcefully and solid state drives are not subject to extensive disk wear.

For highly critical systems, you can even tune the commitlog_sync setting [http://www.datastax.com

/documentation/cassandra/2.0/cassandra/configuration

/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__commitlog_sync] to wait on acknowledging writes

until the data is synchronized to disk; in most cases however, this is overkill. Writing at a consistency

level of QUORUM or higher will handle most use cases adequately, while also providing lower latency

writes + reads, due to the level of available replicas at write time.

If, and only if, the above tasks have been completed, will the request receive a successful response to the

write. This means that at response time the node not only has a copy of the newest value in memory, but

contains a hard copy in the event of power loss. In such an event, Cassandra will read all data into

memory from the commit log to restore itself to the previous state before being marked as UP to the rest

of the cluster.

Until this point, weve been talking about handling data at the local level on memory and on disk to

ensure data integrity. The optionally asynchronous actions that occur on the coordinator nodes is where

the Cassandra gets its multi-data center leverage from.

Before continuing, well define the importance and election of being a coordinator node in Cassandra.

Being a coordinator is something that occurs on randomly selected peers for the life of each request.

Because Cassandra doesnt have a master/slave architecture all peers are capable of receiving requests

and its during this phase that peers become coordinators and synchronously ensure Consistency Levels

are met. Their asynchronous responsibilities are to ensure that additional replicas are also delivered any

received mutations.

Its during this phase that mutations are bundled up and sent to nodes currently accepting writes. In the

case that nodes are not available, the coordinator leaves hints on the local machine to be delivered once

the machine becomes available.

This phase can occur asynchronously because Cassandra has the advantage of tunable consistency. If

there are ever requests that require higher consistency, increasing write and/or read consistencies can

overlook these asynchronous mutations by requiring agreement from more nodes, but in most cases

applications can be designed to accept and appropriately handle eventual consistency.


3 sur 7 12/01/2015 11:42



[http://planetcassandra.org/wp-content/uploads/2014/06/Illustration_2-32.png]

Because Cassandra has been designed to work well at the peer-to-peer level with eventual consistency, we

can take the above concepts and have multi-data center awareness with very little additional logic.

Most times data centers remain in semi-isolation, transferring data back and forth as applications connect

with only the closest (local as opposed to remote) data center. This facilitates the workload separation,

live backup, and geographic location scenarios covered in the next sections. However, occasionally an

application may want to use a data center in a more active role by using a Consistency Level of

EACH_QUORUM [http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html]

.

When the EACH_QUORUM Consistency Level is used, the coordinator acts roughly in the same fashion

as above, yet knowingly sends a request to a remote data center to have a remote coordinator node handle

that request locally. Once that remote coordinator gets a LOCAL_QUORUM-type response, it responds

back to the original coordinator, who resolves any conflicts, if applicable, before responding to the

application request with an EACH_QUORUM-level response.

Hopefully it has become apparent that Cassandra excels at being a distributed database at the core level

and as such allows for almost free multi-data center awareness. Now that you understand how

Cassandras multi-data center approach works and why you can trust it, do take a minute to realize how

this affects your application when you move into new data centers: typically it doesnt.

Cassandra houses all this logic internally on each of the nodes in the cluster and by doing so allows

applications to function independently of the Cassandra topology. The only caveat to watch out for is

properly using Consistency Levels and Replication Strategies with scaling in mind thats it!

When the time comes to implement the upcoming scenarios, double check Consistency Levels and

Replication Strategies, simply bootstrap a few nodes into new data centers, and properly set the new

Replication Strategies. Continue focusing on your application logic as Cassandra handles the high

availability without your worrying about data integrity or actively maintaining multiple data centers!


4 sur 7 12/01/2015 11:42

By implementing data centers as the divisions between varying workloads, DataStax Enterprise allows a

natural distribution of data from real-time data centers to near real-time analytics and search data

centers.

Whenever a write comes in via a client application, it hits the main Cassandra data center and returns

the acknowledgment at the current consistency level (typically less than EACH_QUORUM, to allow for

a high throughput and low latency). In parallel and asynchronously, these writes are sent off to the

Analytics and Search data centers based on the replication strategy for active keyspaces. A typical

replication strategy would look similar to {Cassandra: 3, Analytics: 2, Search: 1}, depending on use

cases and throughput requirements.

Once these asynchronous hints are received on the additional clusters, they undergo the normal write

procedures and are assimilated into that data center. This way, any analytical jobs that are running can

easily and simply access this new data without a time-consuming ETL process. For DataStax

Enterprises Search nodes, these writes are introduced into the memtables and additional Search

processes are triggered to incorporate this data.

Dont assume this can only be done on DataStax Enterprise due to proprietary code. DataStax

Enterprise delivers a complete Analytics and Search package, but any Apache Cassandra installation,

that is separated by data centers, can benefit from the same workflow. Simply set up one data center to

be the real-time data center thats accessed by your application layer and another data center to house

intensive workloads. Then your separate data center can use near-real time data to churn through

computations in scratch column families before writing results out to shared analytical column families.

Once these results are written out locally to the shared column families, theyll soon appear in the

real-time data center ready for live usage.

Some Cassandra use cases instead use different data centers as a live backup that can quickly be used as

a fallback cluster. We will cover the most common use case using Amazons Web Services Elastic Cloud

Computing (EC2) in the following example.

Allow your application to have multiple fallback patterns across multiple consistencies and data

centers.

A client application was created and currently sends requests to EC2s US-East-1 region at a

Consistency Level of QUORUM. Later, another data center is added to EC2s US-West-1 region to

serve as a live backup. The replication strategy can be a full live backup ({US-East-1: 3, US-West-1:

3}) or a smaller live backup ({US-East-1: 3, US-West-1: 2}) to save on cost and disk usage for this

regional outage scenario. All clients continue to write to the US-East-1 nodes by ensuring that the

clients load balancing pools are restricted to just those nodes, to minimize cross data center

latency.

To implement a better cascading fallback, initially the clients connection pool will only be aware of

all nodes in the US-East-1 region. In the event of client errors, all requests will retry at a

Consistency Level of QUORUM, for X times, then decrease to a Consistency Level of ONE while

escalating the appropriate notifications. If the requests are still unsuccessful, using a new

connection pool consisting of nodes from the US-West-1 data center, requests should begin

contacting US-West-1 at a higher Consistency Level, before ultimately dropping down to a

Consistency Level of ONE.

For cases like this, natural events and other failures can be prevented from affecting your live


5 sur 7 12/01/2015 11:42

applications.

As long as the original data center is restored within gc_grace_seconds (10 days by default), perform a

rolling repair (without the -pr option) on all of its nodes once they come back online. Doing so will

ensure that all data is consistent and that all deletes have properly been applied.

If, however, the nodes will be set to come up and complete the repair commands after

gc_grace_seconds, you will need to take the following steps to ensure that deleted records are not

reinstated:

After these nodes are up to date, you can restart your applications and continue using your primary data

center.

There are certain use cases where data should be housed in different data centers depending on a users

location in order to provide a more responsive exchange. The use case we will cover refers to data

centers in different countries, but the same logic and procedures apply for data centers in different

regions. The logic that defines which data center a user will be connected to resides at the application

level.

Have users connect to data centers based on geographic location, but ensure this data is available

cluster-wide for backup, analytics, and to account for user travel across regions.

The required end result is for users in the US to contact one data center while UK users contact

another to lower end-user latency. An additional requirement is for both of these data centers to be

a part of the same cluster to lower operational costs and facilitate cross-data center replication.

This can be handled using the following rules:

ensure your applications are only contacting the backup data center

remove all the offending nodes from the ring using nodetool removetoken

clear all data from the failed region (data directories, commit logs, snapshots, and system logs)

decrement all tokens in this region if not using vnodes

disable auto_bootstrap

start up the entire failed region

and, finally, run a rolling repair (without the -pr option) on all nodes in the recovered region

When reading and writing from each data center, ensure that the clients load balancing

policies prefers the local data center

If doing reads at QUORUM, ensure that LOCAL_QUORUM is being used and not

EACH_QUORUM since this latency will affect the end users performance experience due to

cross-data center resolutions

Depending on how consistent you want your data centers to be, you may choose to run repair

operations (without the -pr option) more frequently than the required once per

gc_grace_seconds


6 sur 7 12/01/2015 11:42

When setting up clusters in this manner:

Notes

1. Responses can be issued without validation by using the non-default Consistency Level of ANY

[http://www.datastax.com/documentation/cassandra/2.0/cassandra

/dml/dml_config_consistency_c.html?scroll=concept_ds_umf_5xx_zj__about-write-consistency] , but we ignore this

anti-pattern for simplicitys sake.

Multi Data Center Replication in NoSQL Databases was created by Joaquin Casares

[https://www.linkedin.com/profile/view?id=46810484] , Software Engineer in Test at DataStax

[http://www.DataStax.com]

You ensure faster performance for each end user

You gain multi-region live backups for free, just as mentioned in the section above. Granted,

the performance for requests across the US and UK will not be as fast, but your application

does not have to hit a complete standstill in the event of catastrophic losses

Users can travel across regions and in the time taken to travel, the users information should

have finished replicating asynchronously across regions

Always use the NetworkTopologyStrategy when using, or planning to use, multiple data centers.

SimpleStrategy will work across data centers, but has the following disadvantages:

Defining one rack for the entire cluster is the simplest and most common implementation. Multiple

racks should be avoided for the following reasons:

Does not provide any of the features mentioned above

Introduces latency on each write (depending on data center distance and latency between

data centers)

Most users tend to ignore or forget rack requirements that state racks should be in an

alternating order to allow the data to get distributed safely and appropriately

Many users are not using the rack information effectively by using a setup with as many racks

as they have nodes, or similar non-beneficial scenarios

When using racks correctly, each rack should typically have the same number of nodes

In a scenario that requires a cluster expansion while using racks, the expansion procedure can

be tedious since it typically involves several node moves and has to ensure that racks will be

distributing data correctly and evenly. At times when clusters need immediate expansion,

racks should be the last things to worry about


7 sur 7 12/01/2015 11:42

planet cassandra - nosql replication in modern databases (2014)

Documents