cloudcon east presentation

Scaling with HiveDB

The Problem• Hundreds of

millions of products

• Millions of new products every week

• Accelerating growth

Scaling up is no longer an option.

Requirements

➡ OLTP optimized➡ Constant response time is more

important than low-latency➡ Expand capacity online➡ Ability to move records➡ No re-sharding

2 years agoClustered Non-relational Unstructured

Oracle RACMS SQL ServerMySQL Cluster

Teradata (OLAP)

HadoopMogileFS

S3

TodayClustered Non-relational Unstructured

Oracle RACMS SQL ServerMySQL Cluster

DB2Teradata (OLAP)

HiveDB

HypertableHBase

CouchDBSimpleDB

HadoopMogileFS

S3

System Storage Interface Partitioning Expansion Node Types Maturity

Oracle RAC Shared SQL TransparentNo

downtime Identical 7 years

MySQL Cluster Memory / Local SQL Transparent

Requires Restart Mixed 3 years

Hypertable Local HQL TransparentNo

downtime Mixed?

(released 2/08)

DB2 Local SQL Fixed HashDegraded Performan

ceIdentical

3 years (25 years

total)

HiveDB Local SQL Key-basedDirectory

No downtime Mixed 18 months

(+13 years!)

Operating a distributed system is hard.

Non-functional concerns quickly dominate functional concerns.

ReplicationFail overMonitoring

Our approach:minimize cleverness

Build atop solid, easy to manage technologies.

So we can get back to selling t-shirts.

MySQL is fast.

MySQL is well understood.

MySQL replication works.

MySQL is free.

The same for JAVA

The simplest thing

➡ HiveDB is a JDBC Gatekeeper for applications.

➡ Lightest integration➡ Smallest possible implementation

The eBay approach

➡ Pro: Internally consistent data➡ Con: Re-sharding of data required➡ Con: Re-sharding is a DBA

operation

Partition by keyA simple bucketed partitioning scheme expressed in code.

Partition by key

Directory

➡ No broadcasting➡ No re-partitioning➡ Easy to relocate records➡ Easy to add capacity

Disadvantages

➡ Intelligence moved from the database to the application (Queries have to be planned and indexed)

➡ Can’t join across partitions➡ NO OLAP! (We consider this a

benefit.)

Hibernate ShardsPartitioned Hibernate from Google

Why did we build this thing again?

Oh wait, you have to tell it where things are.

Benefits of Shards

➡ Unified data access layer➡ Result set aggregation across

partitions➡ Everyone in the JAVA-world knows

Hibernate.

Working on simplifying it even further.

But does it go?

Case Study: CafePress➡ Leader in User-generated

Commerce➡ More products than eBay

(>250,000,000)➡ 24/7/365

Case Study: Performance Requirements

➡ Thousands of queries/sec➡ 10 : 1 read/write ratio➡ Geographically distributed

Case Study:Test Environment

➡ Real schema ➡ Production-class hardware➡ Partial data (~40M records)

[email protected] Modified on April 09 2007

CafePress HiveDB 2007Performance Test Environment

JMeter (100s of threads)

web s

erv

ice

(hiv

edb)

Dell 1950 / 2x2 Xeon

Tomcat 5.5

Hardware LB

Directory Partition 0 Partition 1

client.jar

data

bases

(mysql)


Tomcat 5.5


16GB, 6x72GB 15k


16GB, 6x72GB 15k


16GB, 6x72GB 15k


Tomcat 5.5


16GB, 6x72GB 15k

JMeter (100s of threads)

client.jar


16GB, 6x72GB 15k

load g

enera

tors

48GB backplane non-

blocking gigabit switch

100MBit switch

com

mand &

contr

ol

JMeter (1 thread)

client.jar

Measurement

Workstation

JMeter (no threads)

client.jar

Test Controller

Workstation

Case Study:Performance Goals

➡ Large object reads: 1500/s

➡ Large object writes: 200/s

➡ Response time: 100ms

Case Study:Performance Results

➡ Large object reads: 1500/sActual result: 2250/s

➡ Large object writes: 200/sActual result: 300/s

➡ Response time: 100msActual result: 8ms

Case Study:Performance Results

➡ Max read throughputActual result: 4100/s(CPU limited in Java layer; MySQL <25% utilized)

Case Study:Results➡ Billions of queries served➡ Highest DB uptime at CafePress➡ Hundreds of millions of updates

performed

Show don’t tell!

High Availability

➡ We don’t specify a fail over strategy.

➡ We don’t specify a backup or replication strategy.

Fail Over Options

➡ Load balancer➡ MySQL Proxy➡ Flipper / MMM➡ HAProxy

Replication

➡ You should use MySQL replication.➡ It really works.➡ You can add in LVM snapshots for

even more disastrous disaster recovery.

The BottleneckThere is one problem with HiveDB. The directory is a scaling bottleneck.

There’s no reason it has to be a database.

MemcacheD

➡ Hive directory implemented atop memcached

Roadmap

➡ MemcacheD directory➡ Simplified Hibernate support➡ Management web service➡ Command-line tools➡ Framework Adapters

Call for developers!

➡ Grails (GORM)➡ ActiveRecord (JRuby on Rails)➡ Ambition (JRuby or web service)➡ iBatis➡ JPA➡ ... Postgres?

Contributing

➡ Post to the mailing list http://groups.google.com/group/hivedb-dev

➡ Comment on our sitehttp://www.hivedb.org

➡ Submit a patch / pull requestgit clone git://github.com/britt/hivedb.git

http://groups.google.com/group/hivedb-dev

http://groups.google.com/group/hivedb-dev

http://www.hivedb.org

http://www.hivedb.org

Photo Credits➡ http://www.flickr.com/photos/7362313@N07/1240245941/

sizes/o

➡ http://www.flickr.com/photos/99287245@N00/2229322675/sizes/o

➡ http://www.flickr.com/photos/51035555243@N01/26006138/

➡ http://www.flickr.com/photos/vgm8383/2176897085/sizes/o/

➡ http://t-shirts.cafepress.com/item/money-organic-cotton-tee/73439294

http://www.flickr.com/photos/7362313@N07/1240245941/sizes/o








http://www.flickr.com/photos/51035555243@N01/26006138/




http://www.flickr.com/photos/vgm8383/2176897085/sizes/o/




Blobject

➡ Gets around the problem of ALTER statements. No data set of this size can be transformed synchronously.

➡ The hive can contain multiple versions of a serialized record.

➡ Compression

Hadoop!

➡ Query and transform blobbed data asynchronously.

cloudcon east presentation

Technology

case study

2x2 xeondell

s response time

2x2 xeontomcat

2x2 xeon16gb

2x2 xeon dell

s of threads jmeter

hivedb dell