ccsf12_which_freaking_database_should_i_use

32
Which Freaking Database Should I Use? Andrew C. Oliver Open Software Integrators www.osintegrators.com @osintegrators

Upload: couchbase

Post on 20-Aug-2015

1.760 views

Category:

Technology


1 download

TRANSCRIPT

Which Freaking Database Should I Use?

Andrew C. Oliver Open Software Integrators

www.osintegrators.com @osintegrators

Andrew C. Oliver

10

•  Programming since I was about 8 •  Java since ~1997 •  Founded POI project (currently hosted at Apache) with

Marc Johnson ~2000 o  Former member Jakarta PMC o  Emeritus member of Apache Software Foundation

•  Joined JBoss ~2002 •  Former Board Member/current helper/lifetime member:

Open Source Initiative (http://opensource.org) •  Column in InfoWorld:

http://www.infoworld.com/author-bios/andrew-oliver o  I make fanboys cry. Open Software Integrators

Open Software Integrators

•  Founded Nov 2007 by Andrew C. Oliver (me) o  in Durham, NC

Revenue and staff has at least doubled every year since 2009.

•  New office (2012) in Chicago, IL o  we're hiring mid to senior level as well as UI Developers

(JQuery, Javascript, HTML, CSS) o  up to 25% travel, salary + bonus, 401k, health, etc etc o  preferred: Java, Tomcat, JBoss, Hibernate, Spring, RDBMS,

JQuery o  nice to have: Hadoop, Neo4j, CouchBase, Ruby, at least one

Cloud platform

6 Open Software Integrators

•  Why not just use the RDBMS for everything?

•  Operational vs Analytical

•  Key Value

•  Column Family

•  Document

•  Graph

•  Hadoop?

•  Convergence of "clustered filesystems" and "databases"

•  Conclusions 12

Overview

Open Software Integrators

Why not "just use" the RDBMS for everyting?

Before we begin...

11

•  Let's handle the Elephant or rather the teddy bears in the room:

http://highscalability.com/blog/2010/9/5/hilarious-video-relational-database-vs-nosql-fanbois.html/

Open Software Integrators

The CAP theorem

15 Open Software Integrators

•  Great at consistency •  Okay at availability •  Not so great at partition tolerance...

RDBMS CAP characteristics

15 Open Software Integrators

•  Lots of servers with many connections to few servers.

Single process model

15 Open Software Integrators

•  Lots of servers with many connections to as many servers as we need.

Many process model

15 Open Software Integrators

•  10mb disks were "big" •  Scalability meant more disks, controllers and

possilby CPUs •  CPUs went from 4.77 Mhz to 3.4ghz •  Disks went from 64kps@70ms to 6gb/s •  Network speeds went from under 4mb to gigabit

to bonded gigabit and beyond. •  Disk speeds for a long time didn't keep up with

CPU...

Historical Scalability

15 Open Software Integrators

•  RDBMS is based on "Relational Algebra" which is just an extension of basic "set theory"

•  Not every problem is a set problem: "direct path" or "which thing contains this other thing which has this other thing" (foaf)

•  Sometimes relationships are as important as the data •  Sometimes data is even simpler than the relational

model but needs higher levels of availability, etc. •  One size never really did fit all

The Mathematical model

15 Open Software Integrators

Data Complexity

15 Open Software Integrators

Datarrhea

15 Open Software Integrators

•  Yes I've already registered that ;-) •  The cheapness of storing data has yielded more

demand o  economics predicted this

•  Moore's law ended while you slept o  Intel says next year (but when did CPU speeds last

double?) •  Massive parallelization is the most feasible way to get at

it (counter trended with an explosion in disk speeds)

...but

15 Open Software Integrators

•  If o  your data is tabular; o  fits cleanly in a relational model; o  you aren't having scalability issues; o  you don't have a large dataset; or o  a dataset/problem that lends itself to massive

parallelization... •  you can probably stick with your RDBMS for now

o  ...and probably aren't at this conference anyhow.

JPA/RDBMS Tables Example

PersonID

Firstname

Lastname

CompanyID

2

Andy Oliver 3

CompanyID

Name

City

State

3

Open Software Integrators

Durham NC

PhoneNumber

Type

PersonID

919.627.1236 google 2

919.321.0119 work 2

Operational vs Analytical

15 Open Software Integrators

•  One DB type is unlikely to be well suited for all of your problems.

•  The system doing "short and sweet" "lightweight" transactions is your operational system.

•  The system doing long running reports and generating charts and graphs and statistics is your analytical system.

•  There is also search. There are recommendation engines, etc.

Other types of databases

•  Examples: Couchbase 1.8, Cassandra o  also: Gemfire, Infinispan (distributed caches)

•  Constant Time O(1) - Lookup by key •  Good enough for "right now" stock quotes •  Usually combined with an index for search, but the

structure isn't inherently indexed. •  Generally works well with Map Reduce. •  Extremely scalable, easy to partition

Key-Value Stores

17 Open Software Integrators

•  Many Key-Value support "column families" o  Cassandra

•  Some we designed this way o  HBase

•  Keys and values become composite •  essentially a hashmap with a multi-dimensional array

o  each column is a row of data

•  map-reduce friendly •  Stock quote with time ranges

Column Family / Big Table

19 Open Software Integrators

HBase Example

23

Row key

First name

Last name Company City State Phone

number Phone type

5bfbd4a0-d02a-11e1-9b23-0800200c9a66

Andy Oliver Open Software Integrators

Durham NC 919-627-1236 google

7b2435c0-d02a-11e1-9b23-0800200c9a66

Andy Oliver Open Software Integrators

Durham NC 919-321-0119 work

Open Software Integrators

•  Many developers think these are the "holy grail" since the fit nicely with object-oriented programming.

•  Couchbase 2.0, CouchDB, MongoDB •  JSON documents •  One way to think of this is a Key-Value store that

understands the values. •  Not as map-reduce friendly, larger datasets require

indexes. •  clearly rest services, operational store

Document databases

19 Open Software Integrators

•  JSON document: {

"firstname" : "Andy", "lastname" : "Oliver", "company" : "Open Software Integrators", "location" : { "city" : "Durham", "state" : "NC" }, "phone" : [ { "number" : "123 456 7890", "type" : "mobile" }, { "number" : "123 654 1234", "type" : "work" } ] }

Document databases

19 Open Software Integrators

•  Based on Graph Theory •  Less about volume of the data and more about

complexity •  Many are transactional

o  often the transactions are "more correct" than those offered by a relational database.

•  FOAF, direct path operations are easy o  very complicated/inefficient in RDBMS

•  Usually paired with an index for search

Graph Databases

19 Open Software Integrators

Design: RDBMS vs Graph

20 Open Software Integrators

Phone Number: 919.627.1236 Type : googlevoice

HAS

Phone Number: 919.321.0119 Type : work

Company: Open Software Integrators

LOCATED

FOUNDED

HAS

Firstname: Andrew Lastname: Oliver

City: Durham State: NC

Neo4j Graph Example

21

WORKS FOR

LOCATED City: Chicago State: IL

HAS

RESIDES

Open Software Integrators

Note the extra relationships and details here - graph databases are just fun and easy to understand.

•  NoSQL •  Software Framework (lots of pieces/lots of choices):

o  Pig - scripting language used to quickly write MapReduce code to handle unstructured sources

o  Hive - facilitates structure for the data o  HCatalog - provides inter-operability between these

internal systems o  HBase - Bigtable-type database o  HDFS - Hadoop file system

•  Excellent choice for data processing and data analysis •  MapReduce

Where does Hadoop fit?

22 Open Software Integrators

•  Hadoop HDFS is...a distributed filesystem •  So is Gluster, Ceph, GFS, etc •  Hadoop can use Ceph or Gluster in place of HDFS

Convergence of Filesystems and Databases

22 Open Software Integrators

•  Triplestores o  Apache Jenna

•  OODBMS /ORDMS o  Cache

Other Derivatives

22 Open Software Integrators

•  Persistence o  Asynch / Synch

•  Replication •  Availability •  Transactions / Consistancy •  "Locality" •  Language •  Resources

o  http://en.wikipedia.org/wiki/Comparison_of_structured_storage_software

o  http://sevenweeks.org/

Things you may consider

22 Open Software Integrators

•  RDBMS may not scale to your needs •  Your data may not map efficiently to tables •  Key Value Store - data by key, fast, scalable, can't

handle complex data •  Column Family/Big Table - fast, scalable, denormalized,

map reduce, good for series, not efficient for complex data

•  Document - a good operational system, not your analytical, moderately scalable, matches OO

•  Graph - great for complex data, transactional, less scalable

•  Filesystems and "databases" are converging

Conclusions

53 Open Software Integrators

Thank you for attending!

Andrew C. Oliver Open Software Integrators

www.osintegrators.com @osintegrators