being closer to cassandra by oleg anastasyev. talk at cassandra summit eu 2013
DESCRIPTION
Odnoklassniki uses cassandra for its business data, which doesn't fit into RAM. This data is typically fast growing, frequently accessed by our users and must be always available, because it constitute our primary business as a social network. The way we use cassandra is somewhat unusual - we don't use thrift or netty based native protocol to communicate with cassandra nodes remotely. Instead, we co-locate cassandra nodes in the same JVM with business service logic, exposing not generic data manipulation, but business level interface remotely. This way, we avoid extra network roundtrips within a single business transaction and use internal calls to Cassandra classes to get information faster. Also, this helps us to create many small hacks on Cassandra's internals, making huge gains on efficiency and ease of distributed server development.TRANSCRIPT
Oleg Anastasyevlead platform developerOdnoklassniki.ru
Being Closer to Cassandra
#CASSANDRAEU
Top 10 of World’s social networks40M DAU, 80M MAU, 7M peak
~ 300 000 www req/sec, 20 ms render latency
>240 Gbit out
> 5 800 iron servers in 5 DCs99.9% java
* Odnoklassniki means “classmates” in english
#CASSANDRAEU
Cassandra @ * Since 2010
-branched 0.6-aiming at:
full operation on DC failure, scalability, ease of operations
*Now-23 clusters-418 nodes in total-240 TB of stored data
-survived several DC failures
#CASSANDRAEU
Case #1. The fast
#CASSANDRAEU
Like! 103 927 You and 103 927
#CASSANDRAEUData Range
00-64
Like! widget* Its everywhere
-Have it on every page, dozen-On feeds (AKA timeline)-3rd party websites elsewhere on internet
* Its on everything-Pictures and Albums-Videos-Posts and comments-3rd party shared URLs
Like! 103 927
#CASSANDRAEUData Range
00-64
Like! widget*High load
-1 000 000 reads/sec, 3 000 writes/sec
*Hard load profile-Read most -Long tail (40% of reads are random)-Sensitive to latency variations-3TB total dataset (9TB with RF) and growing-~ 60 billion likes for ~6bi entities
Like! 103 927
#CASSANDRAEU
RefId:long RefType:byte UserId:long Created
9999999999 PICTURE(2) 11111111111 11:00
Classic solution
= N >=1
= M>N
= N*140
You and 4256
SQL table
to render
SELECT TOP 1 WHERE RefId,RefType,UserId=?,?,? (98% are NONE)
SELECT COUNT (*) WHERE RefId,RefType=?,? (80% are 0)
SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)
#CASSANDRAEU
Cassandra solutionLikeByRef (
refType byte,refId bigint,userId bigint,
PRIMARY KEY ( (RefType,RefId), UserId)
LikeCount (refType byte,refId bigint,likers counter,
PRIMARY KEY ( (RefType,RefId))
= N*20%
so, to render
SELECT FROM LikeCount WHERE RefId,RefType=?,? (80% are 0)
SELECT * FROM LikeByRef WHERE RefId,RefType,UserId=?,?,? (98% are NONE)
You and 4256
#CASSANDRAEU
>11 M iops
LikeByRef (refType byte,refId bigint,userId bigint,
PRIMARY KEY ( (RefType,RefId, UserId) )
*Quick workaround ?
SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)
-Forces Order Pres Partitioner (random not scales)
-Key range scans-More network overhead-Partitions count >10x, Dataset size > x2
#CASSANDRAEU
*What is does- Includes pairs of (PartKey, ColumnKey) in
SSTable *-Filter.db
*The good-Eliminated 98 % of reads -Less false positives
*The bad-They become too large
GC Promotion Failures.. but fixable (CASSANDRA-2466)
By column bloom filter
#CASSANDRAEU
Are we there yet ?
- min 2 roundtrips per render (COUNT+RR) - THRIFT is slow, esp having lot of connections- EXISTS() is 200 Gbit/sec (140*8*1Mps*20%)
cassandra
00
application server> 400
1. COUNT()
2. EXISTS
#CASSANDRAEU
Co-locate!
- one-nio remoting (faster than java nio)- topology aware clients
odnoklassniki-like
cassandra
get() : LikeSummary
Remote Business Intf
Counters Cache
Social Graph Cache
#CASSANDRAEU
* Fast TOP N friend likers query1. Take friends from graph cache2. Check it with memory bloom filter3. Read some until N friends found
*Custom caches-Tuned for application
*Custom data merge logic- ... so you can detect and resolve conflicts
co-location wins
#CASSANDRAEU
Listen for mutations// Implement itinterface StoreApplyListener { boolean preapply(String key, ColumnFamily data); }
*Register itbetween commit logs replay and gossip
*RowMutation.apply()extend original mutation+ Replica, hints, ReadRepairs
// and register with CFSstore=Table.open(..) .getColumnFamilyStore(..);store.setListener(myListener);
#CASSANDRAEU
Like! optimized countersLikeCount (
refType byte,refId bigint,ip inet,counter intPRIMARY KEY ( (RefType,RefId), ip)
*Counters cache-Off heap (sun.misc.Unsafe)-Compact (30M in 1G RAM)-Read cached local node only
*Replicated cache state- cold replica cache problem- making (NOP) mutations
less reads- long tail aware
#CASSANDRAEU
Read latency variations*CS read behavior
1. Choose 1 node for data and N for digest2. Wait for data and digest3. Compare and return (or RR)
*Nodes suddenly slowdown-SEDA hiccup, commit log rotation, sudden IO
saturation, Network hiccup or partition, page cache miss
*The bad-You have spikes.-You have to wait (and timeout)
#CASSANDRAEU
Read Latency leveling* “Parallel” read handler
1. Ask all replicas for data in parallel2. Wait for CL responses and return
*The good-Minimal latency response-Constant load when DC fails
*The (not so) bad- “Additional” work and traffic
#CASSANDRAEU
More tiny tricks*On SSD io
-Deadline IO elevator-64k -> 4k read request size
*HintLog-Commit log for hints-Wait for all hints on startup
* Selective compaction-Compacts most read CFs more often
#CASSANDRAEU
Case #2. The fat
#CASSANDRAEU
*Messages in chats-Last page is accessed on open- long tail (80%) for rest
-150 billion, 100 TB in storage-Read most (120k reads/sec, 8k writes/sec)
#CASSANDRAEU
Messages have structure
-All chat’s messages in single partition-Single blob for message data
to reduce overhead
-The badConflicting modifications can happen
(users, anti-spam, etc..)
Message (chatId, msgId,
created, type,userIndex,deletedBy,...text)
MessageCF (chatId, msgId,
data blob,
PRIMARY KEY ( chatId, msgId )
#CASSANDRAEU
LW conflict resolution
Messages (chatId, msgId,version timestamp,data blobPRIMARY KEY ( chatId, msgId, version )
get
(version:ts1, data:d1)
write( ts1, data2 )
get
(version:ts1, data:d1)
write( ts1, data3 )
(ts2, data2)(ts3, data3)
delete(version:ts1)insert(version: ts3=now(), data3)
- merged on read
delete(version:ts1)insert(version: ts2=now(), data2)
#CASSANDRAEU
Specialized cache*Again. Because we can
-Off-heap (Unsafe)-Caches only freshest chat page-Saves its state to local (AKA system) CF
keys AND values seq read, much faster startup
- In memory compression2x more memory almost free
#CASSANDRAEU
Disk mgmt*4U HDDx24, up to 4TB/node
-Size tiered compaction = 4 TB sstable file-RAID10 ? LCS ?
* Split CF to 256 pieces*The good
-Smaller, more frequent memtable flushes-Same compaction work
in smaller sets
-Can distribute across disks
#CASSANDRAEU
Disk Allocation Policies*Default is
- “Take disk with most free space”
* Some disks have-Too much read iops
*Generational policy-Each disk has same # of same gen files
work better for HDD
#CASSANDRAEU
Case #3. The uglyfeed my Frankenstein
#CASSANDRAEU
*Chats overview-small dataset (230GB)-has hot set, short tail (5%)- list reorders often-130k read/s, 21k write/s
#CASSANDRAEU
Conflicting updates* List<Overview> is single blob
.. or you’ll have a lot of tombstones
* Lot of conflictsupdates of single column
*Need conflict detection*Has merge algoritm
#CASSANDRAEU
Vector clocks*Voldemort-byte[] key -> byte[] value + VC-Coordination logic on clients-Pluggable storage engines
* Plugged-CS 0.6 SSTables persistance -Fronted by specialized cache
we love caches
#CASSANDRAEU
Performance*3 node cluster, RF = 3
- Intel Xeon CPU E5506 2.13GHz RAM: 48Gb, 1x HDD, 1x SSD
*8 byte key -> 1 KB byte value
*Results-75 k /sec reads, 15 k/ sec writes
#CASSANDRAEU
Why cassandra ?*Reusable distributed DB components
fast persistance, gossip, Reliable Async Messaging, Fail detectors,Topology, Seq scans, ...
*Has structurebeyond byte[] key -> byte[] value
*Delivered promises* Implemented in Java
#CASSANDRAEU CASSANDRASUMMITEU
THANK YOU
one-niormi faster than java nio with fast and compact automagic java serialization
shared-memory-cachejava Off-Heap cache using shared memory
Oleg [email protected]/oa@m0nstermind
github.com/odnoklassniki