distributed rdbms: data distribution policy: part 2 - creating a data distribution policy
DESCRIPTION
Distributed RDBMSs provide many scalability, availability and performance advantages. This presentation examines steps to create a customized data distribution policy for your RDBMS that best suits your application’s needs to provide maximum scalability. We will discuss: 1. The different approaches to data distribution 2. How to create your own data distribution policy, whether you are scaling an exisiting application or creating a new app. 3. How ScaleBase can help you create your policyTRANSCRIPT
Distributed RDBMSData Distribution Policy: Part 2Creating a data distribution policy
October 2014
2
Data Distribution Policy: Part 2
Distributed RDBMSs provide many scalability, availability and performance advantages.
This presentation examines steps to create a customized data distribution policy for your RDBMS that best suits your application’s needs to provide maximum scalability.
We will discuss: • The different approaches to data distribution• How to create your own data distribution policy, whether you
are scaling an exisiting application or creating a new app.• How ScaleBase can help you create your policy
3
Distributed relational databases are a perfect match for Cloud computing models and distributed Cloud infrastructure.
They are the way forward for delivering web scale applications and keeping ACID properties.
• Social apps• Games• Many concurrent users• High transaction throughput• Very large data volumes
Why is a Distributed Relational Database Good?
4
A data distribution policy describes the rules under which data is distributed across a distributed RDBMS.(a virtual database made up of many database instances, or “shards”).
A good data distribution policy aims to:
1. Maintain full relational database integrity
2. Distribute workloads in an even and predictable manner
3. Minimize the amount of joins across the array of database instances
4. Yield database scalability
What Is a Data Distribution Policy? – Recap
5
1. Arbitrary Distribution: Data is distributed across database instances without any consideration for or understanding of specific application requirements. Arbitrary distribution is often used by NoSQL database technologies.
2. Policy-Based Distribution: Data is distributed across database instances in a way that specifically understands all application requirements, data relationships, transaction flows, and how the data is used in reads and writes by the application.
Two Broad Types of Data Distribution Policy
6
Two Broad Types of Data Distribution Policy
Data Distribution PolicyArbitrary Data Distribution Policy Declarative Data Distribution Policy
Pros - Pros - - Unsophisticated - Ensures that a specific transaction
finds all the data it needs in one specific database
- Predetermined (no forethought required) - Aligns with schema and DB structure
Cons - - Highly efficient and scalable- No intelligence about business, schema, use
cases- Anticipates future requirements and growth
assumptions
- Leads to excessive use of database nodes Cons - - Leads to excessive use of network - Requires forethought and analysis
Data Distribution PolicyArbitrary Data Distribution Policy Declarative Data Distribution Policy
PROs - PROs -
Predetermined (no forethought required) Ensures that a specific transactionfinds all the data it needs in one specific database
CONs - Aligns with schema and DB structure
No intelligence about business, schema, use cases
Highly efficient and scalable
Leads to excessive use of database nodes Anticipates future requirements and growth assumptions
Leads to excessive use of network CONs -
Requires forethought and analysis
7
• NoSQL databases abandoned the relational model to get the scalability benefits of a distributed database. NoSQL and document store type databases can use arbitrary data distribution because their data model does not provide for joins, sequential integrity or ACID.
• However, today RDBMSs can get massive web scale and keep the time-tested relational database model, ACID and SQL if you use a declarative, policy-based data distribution approach.
• Academia has written about various types of distributed
relational databases for decades. But today they are a reality. Declarative, policy-based data distribution is the way forward.
Distributed Databases: NoSQL vs. DRDBMS
8
There are two typical development and database scenarios in which relational databases can evolve into modern distributed relational databases:
1. Scaling an existing application
2. Designing scalability in a new application
Two Distributed RDBMS Use Cases
9
Problem: A monolithic MySQL database is suffering from scalability issues:
• inconsistent performance• inconsistent availability• transaction throughput bottlenecks
Solution: A distributed MySQL database that retains its relational principles by applying a declarative, policy-based data distribution process.
Scaling an Existing Application:Key Observations and Measurements
10
In today’s public, private and hybrid cloud world that leverages distributed infrastructure, for an existing database reaching its scalability limits, scaling up – getting bigger hardware – is a counterintuitive, temporary and expensive approach.
A good data distribution policy:
1. Transforms a monolithic single-instance MySQL database into a distributed MySQL database that retains its relational principles.
2. Aligns with the application's current database structure and commands. Related data within various tables is identified and amassed to stay localized in a single database instance.
3. Ensures “reads” and “writes” can be completed successfully using only data from within one database instance.
Scaling an Existing Application:Key Observations and Measurements
11
Reads (Queries):• Examine the bits of data that are accessed in joins, sub-
queries or unions to find what data ought to be kept together on one machine. This usually comes from related tables that have the same foreign keys.
Writes (Transactions):• Additions to the database need to be placed in the
appropriate partitioned database instance (or shard) with their related data.
• A transaction is more efficient when it is contained to a single database cluster. This practice eliminates the need for a distributed transaction with 2-phase-commit.
Determining your Data Distribution Policy: Reads and Writes
12
Reads (Queries): • When identifying the ‘users’ in a database, the next step
would involve identifying the ‘orders’ related to those ‘users’, then the ‘items’ related to the ‘orders’.
Write (Transactions): • An ‘order’ is made up of many ‘items’, which are
consequently added to the same shard as the ‘order’.
Distribution Example: Reads and Writes
Efficiency dictates that we want to ensure that data can be either read together, such as in queries, or
written together, such as in transactions.“The data that plays together, should stay together.”
13
A distribution key is the field according to which data is directed. If a table does not contain the distribution key, the routing process can become very difficult.
• Denormalization adds the distribution key to the tables in which it is missing - however, this creates many additional problems along the way. It is not recommended.
• ScaleBase’s cascading key lookup solution easily removes the need for denormalization whilst efficiently resolving any data placement issues.
Scaling an Existing Application:Denormalization – Not Recommended
14
The fields that determine where to route the data and commands cannot be empty (i.e. null) or updated during the life of the row. To ensure this:
• Every piece of data must be “born” with a distribution key that it keeps for the course of its entire life.
• It is not enough to simply have the distribution key category in all tables; it needs to be populated, as part of the data in the table, as well.
• A row can be inserted into a table, updated many times and deleted.
• It is vital to insert every table into the database with an updated distribution key.
• If a row is inserted into the database with a ‘null’ shard key, it cannot be placed into the distributed database.
Scaling an Existing Application:Null Columns
15
If you want to add linear scalability to an existing MySQL database, you can use ScaleBase’s free SaaS tool, Analysis Genie.
• The Analysis Genie will help you define the best data distribution policy tailored to your application’s unique requirements.
• The results are based on a guided analysis of the nature of your data, data relationships, and the functional use of your data.
• You can iterate with different policies in a simulated environment to achieve the highest application / distributed database efficiency.
Automating data Distribution Analysis: ScaleBase’s Analysis Genie
16
New web-facing apps have to anticipate millions of users, high-transaction rates, and ever-larger data volumes.
• The same data distribution principles applied to existing applications are also be applied to new applications and databases.
• Data is stored and accessed together on the same database, whether it is for “reads” or “writes”.
Designing Scalability in a New Application
17
When designing a data distribution policy, the distribution key should be selected according to how data will be distributed.
You can then denormalize, adding the distribution key to every table, or distribute by understanding the link between the tables within each shard from the beginning of the design process.
Designing Scalability in a New Application (Continued)
18
When designing a database, ask yourself about the life-cycle of the rows of your data.
• Were they born with a populated distribution key?
Designing your application in a way that makes sure this is taken care of avoids the unpleasant situations of null shard keys.
Designing Scalability in a New Application
19
Analysis tools are not appropriate for new applications as they do not have anything to track.
For this reason we’ve created a special guide:• Building
a New Application with Massive Database Scalability – Getting Started with ScaleBase
This document demonstrates how to build a new application that plans for massive database scalability right from the start.
Provides a walkthrough of how to create a simple, straightforward RDBMS data distribution policy.
Massive Database Scalability With ScaleBase
20
Additional Distributed RDBMS Resources
To develop a custom made data distribution policy for your RDBMS and application, we also recommend the following resources:
• Four table Types You Need To Know To Scale Your Relational Database
• Distributed Databases and Cascading Tables• Discover your Application Scalability Score with
ScaleBase Analysis Genie• Optimizing Sharding
Policies to Scale Out MySQL – Choosing the Best Data Distribution Policy (whitepaper)
21
• ScaleBase is a distributed database built on MySQL and optimized for the cloud. It deploys in minutes so your database can handle an unlimited number of users, humongous volumes of data, and faster transactions.
• It dynamically optimizes workloads and availability by logically distributing data across public, private, and geo-distributed clouds.
ScaleBase Software
22
“What differentiates ScaleBase is its ability to add scalability without the need to migrate to new database architecture or make any changes to existing applications” - Matt Aslett, The 451 Group
ScaleBase Software
“ScaleBase allows us to effectively scale, without downtime, and without having to rewrite our application.” - Sheeri Cabral, Mozilla
23
Try ScaleBase Today
ScaleBase software is available for free: • ScaleBase Website• Amazon Marketplace • Rackspace Marketplace • IBM Cloud marketplace
• ScaleBase’s free online Analysis Genie service
AWS Marketplace Guide and a AWS Getting Started Tutorial are available from the documentation section of the ScaleBase website.
Contact [email protected]
24
Data Distribution Policy: Part 1 and 3
Data Distribution Policy Part 1: • What a data distribution policy is• The challenges faced when data is distributed via sharding• What defines a good data distribution policy• The best way to distribute data for your application and
workload
Data Distribution Policy Part 3: • Three stages of your data distribution policy’s lifecycle.• Adapting the distributed RDBMS to match application changes.• Ensuring that your distributed relational database is flexible and
elastic enough to accommodate endless growth and change.
Distributed RDBMSData Distribution Policy: Part 2
October 2014