nosql with mysql

NoSQL with MySQL

Yekesa KosuruDistinguished Architect, Nokia

Peter Zaitsev, Percona

Agenda• Part 1 : Problem statement• Part 2 : Background (What is, Why, ACID, CAP)• Part 3 : Building a Key-Value Store• Part 4 : NoSQL with MySQL

PART 1: PROBLEM STATEMENT

NoSQL : No Formal Definition• No Formal Definition

– Not Only SQL ?– Don’t use SQL ?– Limit expressive power ?

• Underlying issues ?– ACID (Strong Consistency ?, Isolation ?)– SQL– Something else ?

• Facts– Hardware will fail– Software will have bugs– Growing traffic, data volumes, unpredictable– No downtime or expose errors to users

Problem to solve• Key Value Store Problem Statement

– A consumer facing system with foll. characteristics :• Host large volumes of data & queries• Responsive with predictable performance• Always available (survives failures, software/hardware upgrades)• Low cost of ownership • Minimal administration

• Availability : How to service customers (24 X 7) despite failures, upgrades• Scalability : How to scale (capacity, transactions, costs…)• Predictability : Predictable responsiveness• Not suited for all applications

PART 2: BACKGROUND

ACID• Transactions simplify job of developers• Client/developer point of view• Atomicity: Bundled operations

– All of the operations (multiple updates, deletes, inserts) in the transaction will complete, or none will

• Consistency: The database is in a consistent state when the transaction begins and ends– Referential integrity, and other constraints– All reads must see latest data

• Isolation: The transaction will behave as if it is the only operation being performed upon the database. Serialized updates i.e locks

• Durability: Upon completion of the transaction, the operation will not be reversed.• ACID in single database is efficient

Issues Scaling DBMS• Databases extend ACID to span multiple nodes using 2PC• Shared disk & shared nothing architectures• Cost of 2PC (N nodes involved in transaction)

– Consistency & Isolation over 2PC makes it expensive– 3N+1 Message Complexity– N+1 Stable Writes – Locks – lock resources

• Availability is a pain point– 2PC is blocking protocol , uses single coordinator and not fault tolerant– Coordinator failure can block transaction indefinitely– Blocked transaction causes increased contention

• Rebalancing• Upgrades• Total Cost of Ownership

RequestCommit

PreparePreparePreparePrepare

PreparePreparePrepareCommit

PreparePreparePreparePrepared

CAP• Introduced by Prof. Brewer in year 2000• Consistency, Availability, (Network) Partition Tolerance

– Industry proved that we can guarantee two out of three• System point of view• Consistency : Defines rules for the apparent order and visibility of updates

– Variations on consistency– Eventual Consistency - eventually replicas will be updated – order of

updates may vary - all reads will eventually see it• Availability : All clients can read & write data to some replica, even in the

presence of failures• Partition-tolerance : operations will complete, even if network is

unavailable i.e disconnected network

PART 3: BUILDING KEY-VALUE STORE

Important Considerations• Functional

– Simple API ( Get, Put, Delete - like Hashtable)– Primary key lookups only (limit expressive power)

• No joins, No non-key lookups, No grouping functions, No constraints …– Data Model (Key-Value, Value = JSON)– Read-Your-Write Consistency

• SLA– Highly Available & Horizontally Scalable– Performance (low latencies & predictable)

• Uniform cost of queries• Operational

– Symmetric, Rolling Upgrades, Backups, Alerts, Monitoring• Total Cost of Ownership (TCO)

– Hardware, Software, Licenses …

Making it Available• Make N replicas of data• Use Quorum (R + W > N)• Overlap R & W to make it strongly consistent• Handle Faults (partition tolerance and self heal)

– Real faults • Machine down, NIC down …• Eventual Consistency (data)• Prefer A & P in CAP• If a node is down during write, read repair later – let write proceed

– Intermittent faults• X and Y are clients of Z; X can talk to Z but Y can’t• Server does not respond within timeout, stress scenarios …• Inconsistent cluster view can cause inconsistencies in data• If two concurrently writing clients have different views of cluster, resolve

inconsistency later - let write proceed

R + W > N2 + 2 > 3

N = Number of replicasR = Number of ReadsW = Number of Writes

Making it Scalable• Sharding (or partition data) works well for Capacity• Large number of logical shards distributed over few physical machines• How to shard

- Consistent Hashing- Hash(key) say “2”, move clockwise to find nearest node “B”

- Hash to create a uniform distribution- Allows for adding of machines

- Without redesign of hashing- Without massive movement of data

- Load balancing- If a machine can’t handle load, move shards to another machine

A,B,C = nodes1,2,3,4 = keys

Vector Clocks• Determine history of a key-value, think data versioning• Need to interpret causality of events, to identify missed updates & detect

inconsistencies• Timestamps are one solution, but rely on synchronized clocks and don't capture

causality• Vector clocks are an alternative method of capturing order in a distributed

system• Node (i) coordinating the write generates a clock at the time of write• Construct: ( nodeid : logical sequence)

– Node A receives first insert : A:1– Node B receives an update : A:2

• Clock: (A:1) is predecessor of (A:1, B:1)• Clock: (A:2) and (A:1, B:1) are concurrent• Reads use vector clocks to detect missed updates & inconsistencies

3 replicasEventual Consistency Example 3/2/2

ServerNode B

Application ServerNode A

Client

PUT

GET

V1 A:1WriteWrite

Read

PUT

PUT

GET

PUT

Write

ServerNode C

WriteV1 A:1

V1 A:1

V2 (A:1,B:1)V2 (A:1,B:1)

ReadV1 A:1

Read

Read Repair

NODE 1 DOWN

NODE 1 UP

Detect Missing Update

V2 (A:1,B:1)

V2 (A:1,B:1)

V2 (A:1,B:1)V2 (A:1,B:1)

OutageUpgrade

Missed update added

PART 4: NOSQL WITH MYSQL

NoSQL with MySQL• Key-Value store implemented on top of MySQL/ InnoDB• MySQL serves as persistence layer, Java code handles the rest• Atomicity limited to one operation (auto commit), Consistency & Isolation limited

to one machine (recall issues with 2PC)• Why MySQL ?

– Easy to install, tune and restore– Fast, Reliable, Proven– Performance :

• MVCC, row level locking• Buffer pool, insert buffering, direct IO

– Data Protection & Integrity :• Automatic crash recovery• Backup, Recovery (full and incremental)

– Leverage expertise across ACID and BASE systems

Implementation• InnoDB Engine (innodb_file_per_table)• Simple schema

– Auto increment PK– App Key, Value– Metadata– Hash of app key (index)

• Clustered Index and Secondary Index• KV use simple queries – no joins or grouping functions• Database is a shard (pick right number of shards)• Backup & Recovery

Tuning• Queries are so simple that tuning is focused on IOPS, Memory, TCP, Sockets,

Timeouts, Connection Pool, JDBC, JVM , Queuing etc – Memory to disk ratio– Connector/J

useServerPrepStmts=true useLocalSessionState = true

cachePrepStmts = truecacheResultSetMetadata = truetcpKeepAlive=truetcpTrafficClass=24autoReconnect=truemaxReconnects=...allowMultiQueries=trueprepStmtCacheSize=…blobSendChunkSize=…largeRowSizeThreshold=…locatorFetchBufferSize=…

• CentOS 5.5, XFS File System

Scalability Test (TPS/IOPS) EC2’s

0 20 40 60 80 100 1200

500

1000

1500

2000

2500

3000

3500

TPS

50% LAT

90% LAT

99% LAT

Server IOPS

TPS

Client Threads

Characteristics

m1.xlarge imageUser Dataset = 256GB (240M 1K values)Machines = 5Spindles=4 per boxXFS File System Buffer Pool=12GB/machineMemory:Disk=1:4Workload R/W=100/0Random Access PatternN/R/W=1/1/1Incremental load to 100 threads

Client TPS 50.00% 90.00% 99.00% 99.50% 99.90% Single ServerThreads LAT LAT LAT LAT LAT IOPS

1 114.4 8 16 24 28 52 262 226.88 8 16 26 30 63 484 437.37 9 17 29 34 63 965 600.1 8 14 27 32 64 10710 1061.45 9 18 35 44 77 19920 1730.69 10 23 52 66 118 30750 2775.85 12 40 98 123 202 482

100 3207.18 15 78 224 274 407 571

Know Your IOPS

Disk IOPS (random) 7.2k rpm SATA 9010k rpm SAS 14015k rpm SAS 180Value SSD 400Intel X25-M 1500Intel X25-E 5000DDR Drive x1 300000

Performance Test

1458 2393 3328 4263 5198 6132 7066 8000 8937 98630

0.1

0.2

0.3

0.4

0.5

0.6

50% 90% 99%

Characteristics

User Dataset = 900GBMachines = 9Spindles=8 (15K)/machine XFS File System (RAID0)Buffer Pool=16GB/machineMemory:Disk=1:6Cache=16%Workload R/W=100/0Recency Skewi.e Recent Access=80/20N/R/W=3/2/2Incremental load to 10K (TPS)

SECS

TPS

Tips By Peter ZaitsevPercona

Why XFS

File System Threads(1) Threads(16)

EXT3

RNDRD 243 582

RNDWR 219 218

Total 462 800

XFS

RNDRD 239 552

RNDWR 205 408

Total 444 960

MySQL for NoSQL Tune for Simple Queries Memory is the most important

innodb_buffer_pool_size Are we looking at intensive writes ?

innodb_log_file_size RAID with BBU XFS + O_DIRECT (avoid per inode locking)

Restrict Concurrency on database Connection pool size or innodb_thread_concurrency

More on Scalability Is contention problem

Use MySQL 5.5 or Percona Server Tune innodb_buffer_pool_instances

Index update contention Use Multiple Tables

Or MySQL Partitions if you can't Eliminate Query Parsing Overhead

Consider HandlerSocket NoSQL Interface to InnoDB Storage Engine

Work on Result Stability Ensure you're getting stable results

InnoDB Flushing can cause non uniform performance

[email protected]

THANK YOU

mailto:[email protected]

nosql with mysql

Documents

keyvalue storepart

building keyvalue store

data volumes

mysql nosql

nonkey lookups

hashtableprimary key

acid strong consistency

problem statementpart