no sql

Introduction to NOSQL And Cassandra@rantav @outbrain

SQL is good Rich language Easy to use and integrate Rich toolset Many vendors

The promise: ACID o Atomicity o Consistency o Isolation o Durability

SQL Rules

BUT

HOWEVER...

The Challenge: Modern web apps Internet-scale data size High read-write rates Frequent schema changes "social" apps - not banks o They don't need the same level of ACID

SCALING

Scaling Solutions - Replication

Scales Reads

Scaling Solutions - Sharding

Scales also Writes

Brewer's CAP Theorem: You can only choose two

CAP

Availability + Consistency (no Partition Tolerance)

Single master SQL server

Or - an array of SQLs

Consistency + Partition Tolerance (no Availability)

Availability + Partition Tolerance (no Consistency)

Consistency Levels Strong Consistency (RDBMS, Local Disk, RAM, ...) Weak Consistency - no guarranties Eventual Consistentcy (Cassandra, DNS etc)Causal consistency. A writes, then tells B "I wrote". Read-your-writes consistency. (special case of causal). o Monotonic read consistency. A reads x. In future reads, A will never read older values of x o Monotonic write consistency. Serialize the writes by the same process.o o

Existing NOSQL Solutions

Developed at facebook Follows the BigTable Data Model - column oriented Follows the Dynamo Eventual Consistency model Opensourced at Apache Implemented in Java

N/R/W

CONSISTENCY DOWN TO EARTH

N - Number of replicas (nodes) for any data item W - Number or nodes a write operation blocks on R - Number of nodes a read operation blocks on

N/R/W - Typical Values W=1 => Block until first node written successfully W=N => Block until all nodes written successfully W=0 => Async writes R=1 => Block until the first node returns an answer R=N => Block until all nodes return an answer R=0 => Doesn't make sense QUORUM: o R = N/2+1 o W = N/2+1 o => Fully consistent

Data Model - Forget SQLDo you know SQL?

Data Model - Vocabulary Keyspace like namespace for unique keys. Column Family very much like a table but not quite. Key a key that represent row (of columns) Column representation of value with: o Column name o Value o Timestamp Super Column Column that holds list of columns inside

Data Model - Columnsstruct Column { 1: binary name, 2: binary value, 3: i64 timestamp, } JSON-ish notation: { "name": "emailAddress", "value": "[email protected]", "timestamp": 123456789 }

Data Model - Column Family Similar to SQL tables Has many columns Has many rows

Data Model - Rows Primary key for objects All keys are arbitrary length strings{ "Users": { "ran":{ {"name":"emailAddress", "value":"[email protected]"}, {"name":"webSite", "value":"http://bar.com"} }, "f.rat":{ {"name":"emailAddress", "value":"[email protected]"} } "Stats":{ "ran":{ {"name":"visits", "value":"243"}, } }

Data Model - Short NotationUsers: CF ran: ROW emailAddress: [email protected], COLUMN webSite: http://bar.com COLUMN f.rat: ROW emailAddress: [email protected] COLUMN Stats: CF ran: ROW visits: 243 COLUMN

Data Model - Songs exampleSongs: Meir Ariel: Shir Keev: 6:13, Tikva: 4:11, Erol: 6:17 Suetz: 5:30 Dr Hitchakmut: 3:30 Mashina: Rakevet Layla: 3:02 Optikai: 5:40

Data Model - Super ColumnsSongs: Meir Ariel: Shirey Hag: Shir Keev: 6:13, Tikva: 4:11, Erol: 6:17 Vegluy Eynaim: Suetz: 5:30 Dr Hitchakmut: 3:30 Mashina: ...

Data Model - Super Columns Columns whose values are lists of columns

The APIget get_slice multiget multiget_slice get_count get_ranage_slice get_ranage_slices insert remove batch_insert batch_mutate

The True APIget(keyspace, key, column_path, consistency) get_slice(ks, key, column_parent, predicate, consistency) multiget(ks, keys, column_path, consistency) multiget_slice(ks, keys, column_parent, predicate, consistency) ...

Consistency Model N - per keyspace R - per each read requests W - per each write request

Consistency ModelCassandra defines: enum ConsistencyLevel { ZERO = 0, ONE = 1, QUORUM = 2, DCQUORUM = 3, ALL = 5, }

Java CodeTTransport tr = new TSocket("localhost", 9160); TProtocol proto = new TBinaryProtocol(tr); Cassandra.Client client = new Cassandra.Client(proto); tr.open(); String key_user_id = "1"; long timestamp = System.currentTimeMillis(); client.insert("Keyspace1", key_user_id, new ColumnPath("Standard1", null, "name".getBytes("UTF-8")), "Chris Goffinet".getBytes("UTF-8"), timestamp, ConsistencyLevel.ONE);

Java Client - Hector http://github.com/rantav/hector The de-facto java client for cassandra Encapsulates thrift Adds JMX (Monitoring) Connection pooling Failover Open-sourced at github and has a growing community of developers and users.

Java Client - Hector - cont/** * Insert a new value keyed by key * * @param key Key for the value * @param value the String value to insert */ public void insert(final String key, final String value) { Mutator m = createMutator(keyspaceOperator); m.insert(key, CF_NAME, createColumn(COLUMN_NAME, value)); }

Java Client - Hector - cont/** * Get a string value. * * @return The string value; null if no value exists for the given key. */ public String get(final String key) throws HectorException { ColumnQuery q = createColumnQuery(keyspaceOperator, serializer, serializer); Result r = q.setKey(key). setName(COLUMN_NAME). setColumnFamily(CF_NAME). execute(); HColumn c = r.get(); return c == null ? null : c.getValue(); }

ExtraIf you're not snoring yet...

SortingColumns are sorted by their type BytesType UTF8Type AsciiType LongType LexicalUUIDType TimeUUIDType Rows are sorted by their Partitioner RandomPartitioner OrderPreservingPartitioner CollatingOrderPreservingPartitioner

ThriftCross-language protocol Compiles to: C++, Java, PHP, Ruby, Erlang, Perl, ... struct UserProfile { 1: i32 uid, 2: string name, 3: string blurb } service UserStorage { void store(1: UserProfile user), UserProfile retrieve(1: i32 uid) }

ThriftGenerating sources: thrift --gen java cassandra.thrift thrift -- gen py cassandra.thrift

Internals

Required Reading ;-)BigTable http://labs.google.com/papers/bigtable.html Dynamo http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

From Dynamo: Symmetric p2p architecture Gossip based discovery and error detection Distributed key-value store o Pluggable partitioning o Pluggable topology discovery Eventual consistent and Tunable per operation

From BigTable Sparse Column oriented sparse array SSTable disk storage o Append-only commit log o Memtable (buffering and sorting) o Immutable sstable files o Compactions o High write performance

Architecture LayersCluster Management Messaging service Gossip Failure detection Cluster state Partitioner Replication Single Host Commit log Memtable SSTable Indexes Compaction Consistency Tombstones Hinted handoff Read repair Bootstrap Monitoring Admin tools

Gossip p2p Enables seamless nodes addition. Rebalancing of keys Fast detection of nodes that goes down. Every node knows about all others - no master.

Internals - Consistent Hashing

Memtables In-memory representation of recently written data When the table is full, it's sorted and then flushed to disk -> sstable

SSTablesSorted Strings Tables Immutable On-disk Sorted by a string key In-memory index of elements Binary search (in memory) to find element location Bloom filter to reduce number of unneeded binary searches.

Write Path

Write Path

Compactions

Write Properties No Locks in the critical path Always available to writes, even if there are failures. No reads No seeks Fast Atomic within ColumnFamily

Read Path

Reads

Read Properteis Read multiple SSTables Slower than writes (but still fast) Seeks can be mitigated with more RAM Uses probabilistic bloom filters to reduce lookups.

Bloom Filters Space efficient probabilistic data structure Test whether an element is a member of a set Allow false positive, but not false negative k hash functions Union and intersection are implemented as bitwise OR, AND

Compactions Merge keys Combine columns Discard tombstones Use bloom filters bitwise OR operation

Large and Small compactions

Deletions Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction Read repair complicates things a little Eventually consistent complicates things more Solution: configurable delay before tombstone GC, after which tombstones are not repaired

Extra Long list of subjectsSEDA anti entropy hinted handoff repair on read timestamps -> vector clocks consistent hashing merkle trees

References http://horicky.blogspot.com/2009/11/nosql-patterns.html http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon -dynamo-sosp2007.pdf http://labs.google.com/papers/bigtable.html https://nosqleast.com/2009/ http://bret.appspot.com/entry/how-friendfeed-uses-mysql http://www.julianbrowne.com/article/viewer/brewers-captheorem http://www.allthingsdistributed.com/2008/12/eventually_cons istent.html http://wiki.apache.org/cassandra/DataModel http://incubator.apache.org/thrift/

no sql

Documents

o column

consistency model n

o monotonic read consistency

bigtable data model

weak consistency

consistency multigetks

string value null

param value