being closer to cassandra by oleg anastasyev. talk at cassandra summit eu 2013

33
Oleg Anastasyev lead platform developer Odnoklassniki.ru Being Closer to Cassandra

Upload: odnoklassnikiru

Post on 15-Jan-2015

1.189 views

Category:

Technology


3 download

DESCRIPTION

Odnoklassniki uses cassandra for its business data, which doesn't fit into RAM. This data is typically fast growing, frequently accessed by our users and must be always available, because it constitute our primary business as a social network. The way we use cassandra is somewhat unusual - we don't use thrift or netty based native protocol to communicate with cassandra nodes remotely. Instead, we co-locate cassandra nodes in the same JVM with business service logic, exposing not generic data manipulation, but business level interface remotely. This way, we avoid extra network roundtrips within a single business transaction and use internal calls to Cassandra classes to get information faster. Also, this helps us to create many small hacks on Cassandra's internals, making huge gains on efficiency and ease of distributed server development.

TRANSCRIPT

Page 1: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

Oleg Anastasyevlead platform developerOdnoklassniki.ru

Being Closer to Cassandra

Page 2: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Top 10 of World’s social networks40M DAU, 80M MAU, 7M peak

~ 300 000 www req/sec, 20 ms render latency

>240 Gbit out

> 5 800 iron servers in 5 DCs99.9% java

* Odnoklassniki means “classmates” in english

Page 3: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Cassandra @ * Since 2010

-branched 0.6-aiming at:

full operation on DC failure, scalability, ease of operations

*Now-23 clusters-418 nodes in total-240 TB of stored data

-survived several DC failures

Page 4: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Case #1. The fast

Page 5: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Like! 103 927 You and 103 927

Page 6: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEUData Range

00-64

Like! widget* Its everywhere

-Have it on every page, dozen-On feeds (AKA timeline)-3rd party websites elsewhere on internet

* Its on everything-Pictures and Albums-Videos-Posts and comments-3rd party shared URLs

Like! 103 927

Page 7: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEUData Range

00-64

Like! widget*High load

-1 000 000 reads/sec, 3 000 writes/sec

*Hard load profile-Read most -Long tail (40% of reads are random)-Sensitive to latency variations-3TB total dataset (9TB with RF) and growing-~ 60 billion likes for ~6bi entities

Like! 103 927

Page 8: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

RefId:long RefType:byte UserId:long Created

9999999999 PICTURE(2) 11111111111 11:00

Classic solution

= N >=1

= M>N

= N*140

You and 4256

SQL table

to render

SELECT TOP 1 WHERE RefId,RefType,UserId=?,?,? (98% are NONE)

SELECT COUNT (*) WHERE RefId,RefType=?,? (80% are 0)

SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)

Page 9: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Cassandra solutionLikeByRef (

refType byte,refId bigint,userId bigint,

PRIMARY KEY ( (RefType,RefId), UserId)

LikeCount (refType byte,refId bigint,likers counter,

PRIMARY KEY ( (RefType,RefId))

= N*20%

so, to render

SELECT FROM LikeCount WHERE RefId,RefType=?,? (80% are 0)

SELECT * FROM LikeByRef WHERE RefId,RefType,UserId=?,?,? (98% are NONE)

You and 4256

Page 10: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

>11 M iops

LikeByRef (refType byte,refId bigint,userId bigint,

PRIMARY KEY ( (RefType,RefId, UserId) )

*Quick workaround ?

SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)

-Forces Order Pres Partitioner (random not scales)

-Key range scans-More network overhead-Partitions count >10x, Dataset size > x2

Page 11: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

*What is does- Includes pairs of (PartKey, ColumnKey) in

SSTable *-Filter.db

*The good-Eliminated 98 % of reads -Less false positives

*The bad-They become too large

GC Promotion Failures.. but fixable (CASSANDRA-2466)

By column bloom filter

Page 12: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Are we there yet ?

- min 2 roundtrips per render (COUNT+RR) - THRIFT is slow, esp having lot of connections- EXISTS() is 200 Gbit/sec (140*8*1Mps*20%)

cassandra

00

application server> 400

1. COUNT()

2. EXISTS

Page 13: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Co-locate!

- one-nio remoting (faster than java nio)- topology aware clients

odnoklassniki-like

cassandra

get() : LikeSummary

Remote Business Intf

Counters Cache

Social Graph Cache

Page 14: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

* Fast TOP N friend likers query1. Take friends from graph cache2. Check it with memory bloom filter3. Read some until N friends found

*Custom caches-Tuned for application

*Custom data merge logic- ... so you can detect and resolve conflicts

co-location wins

Page 15: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Listen for mutations// Implement itinterface StoreApplyListener { boolean preapply(String key, ColumnFamily data); }

*Register itbetween commit logs replay and gossip

*RowMutation.apply()extend original mutation+ Replica, hints, ReadRepairs

// and register with CFSstore=Table.open(..) .getColumnFamilyStore(..);store.setListener(myListener);

Page 16: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Like! optimized countersLikeCount (

refType byte,refId bigint,ip inet,counter intPRIMARY KEY ( (RefType,RefId), ip)

*Counters cache-Off heap (sun.misc.Unsafe)-Compact (30M in 1G RAM)-Read cached local node only

*Replicated cache state- cold replica cache problem- making (NOP) mutations

less reads- long tail aware

Page 17: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Read latency variations*CS read behavior

1. Choose 1 node for data and N for digest2. Wait for data and digest3. Compare and return (or RR)

*Nodes suddenly slowdown-SEDA hiccup, commit log rotation, sudden IO

saturation, Network hiccup or partition, page cache miss

*The bad-You have spikes.-You have to wait (and timeout)

Page 18: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Read Latency leveling* “Parallel” read handler

1. Ask all replicas for data in parallel2. Wait for CL responses and return

*The good-Minimal latency response-Constant load when DC fails

*The (not so) bad- “Additional” work and traffic

Page 19: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

More tiny tricks*On SSD io

-Deadline IO elevator-64k -> 4k read request size

*HintLog-Commit log for hints-Wait for all hints on startup

* Selective compaction-Compacts most read CFs more often

Page 20: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Case #2. The fat

Page 21: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

*Messages in chats-Last page is accessed on open- long tail (80%) for rest

-150 billion, 100 TB in storage-Read most (120k reads/sec, 8k writes/sec)

Page 22: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Messages have structure

-All chat’s messages in single partition-Single blob for message data

to reduce overhead

-The badConflicting modifications can happen

(users, anti-spam, etc..)

Message (chatId, msgId,

created, type,userIndex,deletedBy,...text)

MessageCF (chatId, msgId,

data blob,

PRIMARY KEY ( chatId, msgId )

Page 23: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

LW conflict resolution

Messages (chatId, msgId,version timestamp,data blobPRIMARY KEY ( chatId, msgId, version )

get

(version:ts1, data:d1)

write( ts1, data2 )

get

(version:ts1, data:d1)

write( ts1, data3 )

(ts2, data2)(ts3, data3)

delete(version:ts1)insert(version: ts3=now(), data3)

- merged on read

delete(version:ts1)insert(version: ts2=now(), data2)

Page 24: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Specialized cache*Again. Because we can

-Off-heap (Unsafe)-Caches only freshest chat page-Saves its state to local (AKA system) CF

keys AND values seq read, much faster startup

- In memory compression2x more memory almost free

Page 25: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Disk mgmt*4U HDDx24, up to 4TB/node

-Size tiered compaction = 4 TB sstable file-RAID10 ? LCS ?

* Split CF to 256 pieces*The good

-Smaller, more frequent memtable flushes-Same compaction work

in smaller sets

-Can distribute across disks

Page 26: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Disk Allocation Policies*Default is

- “Take disk with most free space”

* Some disks have-Too much read iops

*Generational policy-Each disk has same # of same gen files

work better for HDD

Page 27: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Case #3. The uglyfeed my Frankenstein

Page 28: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

*Chats overview-small dataset (230GB)-has hot set, short tail (5%)- list reorders often-130k read/s, 21k write/s

Page 29: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Conflicting updates* List<Overview> is single blob

.. or you’ll have a lot of tombstones

* Lot of conflictsupdates of single column

*Need conflict detection*Has merge algoritm

Page 30: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Vector clocks*Voldemort-byte[] key -> byte[] value + VC-Coordination logic on clients-Pluggable storage engines

* Plugged-CS 0.6 SSTables persistance -Fronted by specialized cache

we love caches

Page 31: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Performance*3 node cluster, RF = 3

- Intel Xeon CPU E5506 2.13GHz RAM: 48Gb, 1x HDD, 1x SSD

*8 byte key -> 1 KB byte value

*Results-75 k /sec reads, 15 k/ sec writes

Page 32: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU

Why cassandra ?*Reusable distributed DB components

fast persistance, gossip, Reliable Async Messaging, Fail detectors,Topology, Seq scans, ...

*Has structurebeyond byte[] key -> byte[] value

*Delivered promises* Implemented in Java

Page 33: Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

#CASSANDRAEU CASSANDRASUMMITEU

THANK YOU

one-niormi faster than java nio with fast and compact automagic java serialization

shared-memory-cachejava Off-Heap cache using shared memory

Oleg [email protected]/oa@m0nstermind

github.com/odnoklassniki