mysql scaling and high availability architectures

MySQL Scaling and HighAvailability Architectures

Jeremy [email protected]

Eric [email protected]

Who are we?• Proven Scaling is a consulting company founded

in 2006 by Eric and Jeremy specializing in MySQL

• We primarily deal with architecture and design forlarge scalable systems

• We also do training, DBA work, custom MySQLfeatures, etc.

• Jeremy: optimization, architecture, performance• Eric: operations, administration, monitoring

Overview• What’s the problem?• Basic Tenets of Scaling and High Availability• Lifetime of a Scalable System• Approaches to Scaling• Approaches to High Availability• Tools and Components

What’s the problem?• Internet-age systems can grow (or be forced to

choose between growth and death) very quickly• No matter what you plan for or predict, users will

always surprise you• Mobs, err, valued users can be very annoying

sometimes (e.g. “biggest group ever” logic)• Users may have vastly different usage patterns• Web 2.0™ (blechhhh!) sites have changed the

world of scaling; it’s much harder now• Everyone (your VCs included) expects you to be

Web 2.0® compliant™

Basic Tenets• Don’t design scalable or highly available systems:

Using components you do not control or that haveloose tolerances (e.g. DNS)

Using processes with potentially ugly side effects(e.g. code changes to add a new server) [Yes,configuration files are very often “code”]

• If a user doesn’t think/notice something is down,it’s not really “down”

• Eliminate (or limit) single points of failure -- if youhave only one of any component, examine why

• Cache everything

Lifetime of a ScalableSystem

Newborn• Shared hosting• Might start worrying (a little bit) about query

optimization at this point• Don’t have much control over configuration• Overall performance may be poor• Traffic picks up, and performance is bad... What do

we do about it?

Toddler• A single (dedicated) server for everything• MySQL and Apache etc., competing for resources• MySQL needs memory for caching data• Apache (and especially PHP etc.) needs lots of

memory for handling requests• Memory contention will be the first major

bottleneck

Child• Separate web servers and database server• Usually go ahead and get multiple web servers

now, since it’s easy• Get a single database server, since it’s hard --

maybe better hardware?• Now we need to do session management across

web servers… hmm we have this nice database…• Other load same as before, but now with added

network overhead• Single database server becomes your biggest

bottleneck

Teenager• “Simple” division of load by moving tables or

processes• Use replication to move reporting off production• Move individual tables or databases to lighten load• Use replication to move reads to slaves• Modify code to know where everything is• Still too many writes in some parts of system• Replication synchronization problems mean either

annoying users or writing lots of code to workaround the problem

Late teens to 20s• The “awkward” stage• This is where many applications (and sometimes

entire companies) die by making bad decisions• Death can be slow (becoming irrelevant due to

speed or lack of scalability) or quick (massivemeltdown losing user confidence)

• Managing the move from teenager into adulthoodis often the first real project requiring specs andreal processes to do it right

• Downtime at this point is hard to swallow due tosize of userbase

Adult• Scalable system that can grow for a long time,

generally based on data partitioning• Most improvements now are incremental• System is built to allow incremental improvements

without downtime• A lot has been learned from the successful

transition to adulthood

Data Partitioning:The only game in town

What is partitioning?• Distributing data on a record-by-record basis• Usually a single basis for distributing records in

each data set is chosen: a “partition key”• An application may have multiple partition keys• Each node has all related tables, but only a portion

of the data

Partitioning Models• Fixed “hash key” partitioning• Dynamic “directory” partitioning

• Partition by “group”• Partition by “user”

Partitioning Difficulties• Inter-partition interactions are a lot more difficult• Example: Partitioning by user, where do we store a

message sent from one user to another? Howabout a friend list?

• Overall reporting becomes more difficult• Example: Find the average number of friends a

user has by state…

Partition by …• Partitioning by user (or equivalent) allows for the

most flexibility in most applications• In many cases it may make sense to partition by

groups, if most (or all) interactions between usersare within that group

• You could also get most of the same benefits ofpartitioning by group by partitioning by user withan affinity based on group

Fixed Hash Key• Divide the data into B buckets• Divide the B buckets over M machines• Example: Define 1024 user buckets 0..1023 based on (user_id

% 1024) for 4 physical servers, so each server gets 256 of thebuckets by range: 0-255, 256-511, 512-767, 768-1023

• Problem: Moving entire buckets means affecting 1/B of yourusers at a time in the best case… in simple implmentationsyou may have to affect 1/M or 2/M of your users

• Problem: The bucket-to-machine mapping must be storedsomewhere (usually in code) and updated atomically

• Problem: You have no control over which bucket (and thusmachine) a given user is assigned to

Dynamic Directory• A “directory” server maintains a database of

mappings between users and partitions• A user is assigned (often randomly) to one

partition and that mapping is stored• Any user may be moved later by locking the user,

moving their data, and updating their mapping inthe directory

• Solution: Only single users are affected by anyrepartitioning that must be done

• Solution: Partitions may be rebalanced user-by-user at any time

Custom Solutions

Custom Solutions• It’s very easy to implement simple hash key

partitioning to get data distributed• It’s much more difficult to be able to re-partition• It’s difficult to grow

Hibernate Shards

Hibernate Shards• Sort of a merge between fixed key partitioning and

directory-based partitioning• “Virtual Shards” abstract the mapping of objects to

shards, but simplistically• It’s still painful to repartition• It doesn’t handle rebalancing at all currently• It doesn’t handle aggregation at all

HiveDB

HiveDB Project• HiveDB is an Open Source project to design and

implement the entire “standard” partition-by-keyMySQL system in Java

• Originally envisioned by Jeremy while working withseveral customers

• Implemented by Fortress Consulting andCafePress along with help and guidance fromProven Scaling

• Many companies have built somewhat similarsystems but nobody has really open sourced it

Why HiveDB?• Many solutions that exist only solve the easy part:

storing and retrieving data across many machines• Nobody really touches on the hard part: being able

to rebalance and move users on the fly

Server Architecture• Hive Metadata

Partition definition• Directory

Partition Key -> Partition mapping Secondary Key -> Partition Key mapping

• Hive Queen - makes management and rebalancingdecisions

• Job Server (Quartz) - actually executes tasks• Aggregation Layer (future)

Client Architecture• Client uses Hive API to request a connection for a

certain partition key• Client uses those direct connections to do work

• Hive API should be written in each developmentlanguage as necessary

High Availability

Goals• Avoid downtime due to failures• No single point of failure• Extremely fast failover• No dependency on DNS changes• No dependency on code changes

• Allow for painless, worry-free “casual failovers” toupgrade, change hardware, etc.

• Fail-back must be just as painless

MySQL Replication

Basics• MySQL replication is master-slave one-way

asynchronous replication• “Master” keeps logs of all changes – called “binary

logs” or “binlogs”• “Slave” connects to the master through the normal

MySQL protocol on TCP port 3306

• Slave requests binary logs from last position• Master sends binary logs up to current time• Master keeps sending binary logs in real-time

More Basics• Replication works with all tables types and (mostly)

all features

• Any “critical”, reads, ones that cannot be allowedto return stale data, must be done on the master –replication is asynchronous, so there may be adelay at any time

Typical Setup• One Master (single source of truth)• Any number of slaves

• Slaves are used for reads only• All writes go to the master

• There are many other possibilities…

Replication Topologies

Master with One Slave

Master

Slave

Master with Many Slaves

Master

Slave SlaveSlave Slave Slave

Master with Relay Slave

Master

RelaySlave

Slave

Master with Relay andMany Slaves

Master

RelaySlave


Master with Many Relays

Master

RelaySlave


RelaySlave

RelaySlave

RelaySlave

RelaySlave

Dual Masters

Master Master

Dual Masters withSlaves

Master Master

Slave

RelaySlave

Slave SlaveSlave

Ring (Don’t Use)

Master Master

Master

High Availability Options

Dual Master• Two machines with independent storage

configured as master and slave of each other• Optionally: Any number of slaves for reads only• Manual (scripted) or automatic (heartbeat-based)

failover is possible

Dual Master Pros• Very simple configuration• Simple to understand = simple to maintain• Very similar to basic master-slave configuration

that many are familiar with• Allows easy failover in either direction without

reconfiguration or rebuilding• Allows for easy and reliable failover for non-

emergency situations: upgrades, schema changes,etc.

• Allows for quick failover in emergency• Can work between distant sites fairly easily

Dual Master Cons• Does not help scale writes (no, not at all)• Limited to two sites; replication does not allow

multiple masters, so three or more is not possible• Replication is asynchronous, and may get behind --

there is always a chance of data loss (albeit small)

SAN• Shared storage of a single set of disks by two

MySQL servers, with a single copy of the data on aFibreChannel or IP/iSCSI SAN

• Automatic (heartbeat) failover by fencing andmounting the SAN on the other machine

SAN Pros• Single copy of the data means lower storage cost

for extremely large databases• No worries about replication getting behind• SAN systems can achieve very high performance

for same or lower cost as two very large RAIDs

SAN Cons• Single copy of the data means corruption is

possible, and could be very damaging• For medium or small databases, cost can be

prohibitive• FibreChannel requires additional infrastructure

often not present in typical MySQL systems; iSCSIcan be very helpful in this regard

• Single copy of the data -- no schema change tricksare possible

DRBD• Block device-level replication between two

machines with their own independent storage(mirrors of the same data)

• Automatic (heartbeat-based) failover by fencingand mounting local copy of filesystem is typical

DRBD Pros• Simple hardware and infrastructure using locally-

attached RAID• No expensive hardware or network

DRBD Cons• Complex configuration and maintenance• May cause performance problems, especially if

poorly configured• Failure of or problems with mirror can cause

problems in production• From the software perspective, there is still a

single copy of the data, which may get corrupted• Single copy of the data -- no schema change tricks

are possible

Putting It All Together

Partitioning + HA• No partitioning solutions really address HA .. They

treat the “shards” or “partitions” as single MySQLservers

• In reality you would implement an HA solution foreach partition

• There are many possibilities

HiveDB + Dual Master• We recommend HiveDB plus Dual Master for most

installations• While not technically perfect, and with a chance of

data loss, administrative tasks are very simple• Additionally, LVM for volume management gives

ability to take snapshot backups easily

Any questions?Discussion!

mysql scaling and high availability architectures

Documents

schema change

single copy

dual masters

custom solutions

dual master

data loss

proven scaling

data partitioning