distributed rdbms: data distribution policy: part 2 - creating a data distribution policy

Distributed RDBMSData Distribution Policy: Part 2Creating a data distribution policy

October 2014

2

Data Distribution Policy: Part 2

Distributed RDBMSs provide many scalability, availability and performance advantages.

This presentation examines steps to create a customized data distribution policy for your RDBMS that best suits your application’s needs to provide maximum scalability.

We will discuss: • The different approaches to data distribution• How to create your own data distribution policy, whether you

are scaling an exisiting application or creating a new app.• How ScaleBase can help you create your policy

Yarden Sibony

I'm leaving the links for now -- not sure if they should be removed

3

Distributed relational databases are a perfect match for Cloud computing models and distributed Cloud infrastructure.

They are the way forward for delivering web scale applications and keeping ACID properties.

• Social apps• Games• Many concurrent users• High transaction throughput• Very large data volumes

Why is a Distributed Relational Database Good?

4

A data distribution policy describes the rules under which data is distributed across a distributed RDBMS.(a virtual database made up of many database instances, or “shards”).

A good data distribution policy aims to:

1. Maintain full relational database integrity

2. Distribute workloads in an even and predictable manner

3. Minimize the amount of joins across the array of database instances

4. Yield database scalability

What Is a Data Distribution Policy? – Recap

5

1. Arbitrary Distribution: Data is distributed across database instances without any consideration for or understanding of specific application requirements. Arbitrary distribution is often used by NoSQL database technologies.

2. Policy-Based Distribution: Data is distributed across database instances in a way that specifically understands all application requirements, data relationships, transaction flows, and how the data is used in reads and writes by the application.

Two Broad Types of Data Distribution Policy

6

Two Broad Types of Data Distribution Policy

Data Distribution PolicyArbitrary Data Distribution Policy Declarative Data Distribution Policy

Pros - Pros - - Unsophisticated - Ensures that a specific transaction

finds all the data it needs in one specific database

- Predetermined (no forethought required) - Aligns with schema and DB structure

Cons - - Highly efficient and scalable- No intelligence about business, schema, use

cases- Anticipates future requirements and growth

assumptions

- Leads to excessive use of database nodes Cons - - Leads to excessive use of network - Requires forethought and analysis

Data Distribution PolicyArbitrary Data Distribution Policy Declarative Data Distribution Policy

PROs - PROs -

Predetermined (no forethought required) Ensures that a specific transactionfinds all the data it needs in one specific database

CONs - Aligns with schema and DB structure

No intelligence about business, schema, use cases

Highly efficient and scalable

Leads to excessive use of database nodes Anticipates future requirements and growth assumptions

Leads to excessive use of network CONs -

Requires forethought and analysis

7

• NoSQL databases abandoned the relational model to get the scalability benefits of a distributed database. NoSQL and document store type databases can use arbitrary data distribution because their data model does not provide for joins, sequential integrity or ACID.

• However, today RDBMSs can get massive web scale and keep the time-tested relational database model, ACID and SQL if you use a declarative, policy-based data distribution approach.

• Academia has written about various types of distributed

relational databases for decades. But today they are a reality. Declarative, policy-based data distribution is the way forward.

Distributed Databases: NoSQL vs. DRDBMS

8

There are two typical development and database scenarios in which relational databases can evolve into modern distributed relational databases:

1. Scaling an existing application

2. Designing scalability in a new application

Two Distributed RDBMS Use Cases

9

Problem: A monolithic MySQL database is suffering from scalability issues:

• inconsistent performance• inconsistent availability• transaction throughput bottlenecks

Solution: A distributed MySQL database that retains its relational principles by applying a declarative, policy-based data distribution process.

Scaling an Existing Application:Key Observations and Measurements

10

In today’s public, private and hybrid cloud world that leverages distributed infrastructure, for an existing database reaching its scalability limits, scaling up – getting bigger hardware – is a counterintuitive, temporary and expensive approach.

A good data distribution policy:

1. Transforms a monolithic single-instance MySQL database into a distributed MySQL database that retains its relational principles.

2. Aligns with the application's current database structure and commands. Related data within various tables is identified and amassed to stay localized in a single database instance.

3. Ensures “reads” and “writes” can be completed successfully using only data from within one database instance.

Scaling an Existing Application:Key Observations and Measurements

11

Reads (Queries):• Examine the bits of data that are accessed in joins, sub-

queries or unions to find what data ought to be kept together on one machine. This usually comes from related tables that have the same foreign keys.

Writes (Transactions):• Additions to the database need to be placed in the

appropriate partitioned database instance (or shard) with their related data.

• A transaction is more efficient when it is contained to a single database cluster. This practice eliminates the need for a distributed transaction with 2-phase-commit.

Determining your Data Distribution Policy: Reads and Writes

12

Reads (Queries): • When identifying the ‘users’ in a database, the next step

would involve identifying the ‘orders’ related to those ‘users’, then the ‘items’ related to the ‘orders’.

Write (Transactions): • An ‘order’ is made up of many ‘items’, which are

consequently added to the same shard as the ‘order’.

Distribution Example: Reads and Writes

Efficiency dictates that we want to ensure that data can be either read together, such as in queries, or

written together, such as in transactions.“The data that plays together, should stay together.”

13

A distribution key is the field according to which data is directed. If a table does not contain the distribution key, the routing process can become very difficult.

• Denormalization adds the distribution key to the tables in which it is missing - however, this creates many additional problems along the way. It is not recommended.

• ScaleBase’s cascading key lookup solution easily removes the need for denormalization whilst efficiently resolving any data placement issues.

Scaling an Existing Application:Denormalization – Not Recommended

http://www.scalebase.com/distributed-databases-and-cascading-tables-part-1/



14

The fields that determine where to route the data and commands cannot be empty (i.e. null) or updated during the life of the row. To ensure this:

• Every piece of data must be “born” with a distribution key that it keeps for the course of its entire life.

• It is not enough to simply have the distribution key category in all tables; it needs to be populated, as part of the data in the table, as well.

• A row can be inserted into a table, updated many times and deleted.

• It is vital to insert every table into the database with an updated distribution key.

• If a row is inserted into the database with a ‘null’ shard key, it cannot be placed into the distributed database.

Scaling an Existing Application:Null Columns

15

If you want to add linear scalability to an existing MySQL database, you can use ScaleBase’s free SaaS tool, Analysis Genie.

• The Analysis Genie will help you define the best data distribution policy tailored to your application’s unique requirements.

• The results are based on a guided analysis of the nature of your data, data relationships, and the functional use of your data.

• You can iterate with different policies in a simulated environment to achieve the highest application / distributed database efficiency.

Automating data Distribution Analysis: ScaleBase’s Analysis Genie

http://www.scalebase.com/products/analysis-genie/


16

New web-facing apps have to anticipate millions of users, high-transaction rates, and ever-larger data volumes.

• The same data distribution principles applied to existing applications are also be applied to new applications and databases.

• Data is stored and accessed together on the same database, whether it is for “reads” or “writes”.

Designing Scalability in a New Application

17

When designing a data distribution policy, the distribution key should be selected according to how data will be distributed.

You can then denormalize, adding the distribution key to every table, or distribute by understanding the link between the tables within each shard from the beginning of the design process.

Designing Scalability in a New Application (Continued)

18

When designing a database, ask yourself about the life-cycle of the rows of your data.

• Were they born with a populated distribution key?

Designing your application in a way that makes sure this is taken care of avoids the unpleasant situations of null shard keys.

Designing Scalability in a New Application

19

Analysis tools are not appropriate for new applications as they do not have anything to track.

For this reason we’ve created a special guide:• Building

a New Application with Massive Database Scalability – Getting Started with ScaleBase

This document demonstrates how to build a new application that plans for massive database scalability right from the start.

Provides a walkthrough of how to create a simple, straightforward RDBMS data distribution policy.

Massive Database Scalability With ScaleBase

http://www.scalebase.com/wp-content/uploads/2014/05/Building-a-New-App-with-Massive-Scalability-Getting-Started-with-ScaleBase.pdf




20

Additional Distributed RDBMS Resources

To develop a custom made data distribution policy for your RDBMS and application, we also recommend the following resources:

• Four table Types You Need To Know To Scale Your Relational Database

• Distributed Databases and Cascading Tables• Discover your Application Scalability Score with

ScaleBase Analysis Genie• Optimizing Sharding

Policies to Scale Out MySQL – Choosing the Best Data Distribution Policy (whitepaper)

http://www.scalebase.com/four-table-types-you-need-to-know-to-scale-your-relational-data/





http://www.scalebase.com/discover-your-application-scalability-score-with-scalebase-analysis-genie/




http://www.scalebase.com/resources/white-papers/





21

• ScaleBase is a distributed database built on MySQL and optimized for the cloud. It deploys in minutes so your database can handle an unlimited number of users, humongous volumes of data, and faster transactions.

• It dynamically optimizes workloads and availability by logically distributing data across public, private, and geo-distributed clouds.

ScaleBase Software

http://www.scalebase.com/products/

22

“What differentiates ScaleBase is its ability to add scalability without the need to migrate to new database architecture or make any changes to existing applications” - Matt Aslett, The 451 Group

ScaleBase Software

“ScaleBase allows us to effectively scale, without downtime, and without having to rewrite our application.” - Sheeri Cabral, Mozilla

23

Try ScaleBase Today

ScaleBase software is available for free: • ScaleBase Website• Amazon Marketplace • Rackspace Marketplace • IBM Cloud marketplace

• ScaleBase’s free online Analysis Genie service

AWS Marketplace Guide and a AWS Getting Started Tutorial are available from the documentation section of the ScaleBase website.

Contact [email protected]

http://www.scalebase.com/software/

http://www.scalebase.com/software/

https://aws.amazon.com/marketplace/search/results/ref=dtl_navgno_search_box?page=1&searchTerms=scalebase

https://aws.amazon.com/marketplace/search/results/ref=dtl_navgno_search_box?page=1&searchTerms=scalebase

https://cloudtools.rackspace.com/apps/1041#!support

https://cloudtools.rackspace.com/apps/1041#!support

https://marketplace.ibmcloud.com/apps/1365?restoreSearch=true#!overview

https://marketplace.ibmcloud.com/apps/1365?restoreSearch=true#!overview



http://www.scalebase.com/resources/documentation/

mailto:[email protected]

24

Data Distribution Policy: Part 1 and 3

Data Distribution Policy Part 1: • What a data distribution policy is• The challenges faced when data is distributed via sharding• What defines a good data distribution policy• The best way to distribute data for your application and

workload

Data Distribution Policy Part 3: • Three stages of your data distribution policy’s lifecycle.• Adapting the distributed RDBMS to match application changes.• Ensuring that your distributed relational database is flexible and

elastic enough to accommodate endless growth and change.

Yarden Sibony

I'm leaving the links for now -- not sure if they should be removed

Distributed RDBMSData Distribution Policy: Part 2

October 2014