cassandra at instagram (august 2013)
Post on 08-Sep-2014
14.752 Views
Preview:
DESCRIPTION
TRANSCRIPT
CASSANDRAAT INSTAGRAMRick Branson, Infrastructure Engineer@rbranson
SF Cassandra MeetupAugust 29, 2013
Disqus HQ
September 2012Redis fillin' up.
What sucks?
THE OBVIOUSMemory is expensive.
LESS OBVIOUS:In-memory "degrades" poorly
• Flat namespace. What's in there?
• Heap fragmentation
• Single threaded
BGSAVE
• Boils down to centralized logging
• VERY high skew of writes to reads (1,000:1)
• Ever growing data set
• Durability highly valued
• Dumb to store it in RAM, basically...
The Data
• Cassandra 1.1
• 3 EC2 m1.xlarge (2-core, 15GB RAM)
• RAIDed ephemerals (1.6TB of SATA)
• RF=3
• 6GB Heap, 200MB NewSize
• HSHA
The Setup
It worked. Mostly.
The horriblecool thing about Chef...
commit a1489a34d2aa69316b010146ab5254895f7b9141Author: Rick BransonDate: Thu Oct 18 20:05:16 2012 -0700
Follow the rules for Cassandra listen_address so I don't burn a whole day fixing my retarded mistake
commit 41c96f3243a902dd6af4ea29ef6097351a16494aAuthor: Rick BransonDate: Tue Oct 30 17:12:00 2012 -0700
Use 256k JVM stack size for C* -- fixes a bug that got integrated with 1.1.6 packaging + Java 1.6.0_u34+
November 2013Doubled to 6 nodes.
18,000 connections. Spread those more evenly.
commit 3f2e4f2e5da6fe99d7f3fc13c0da09b464b3a9e0Author: Rick BransonDate: Wed Nov 21 09:50:21 2012 -0800
Drop key cache size on C*UA cluster: was causing heap issues, and apparently 1GB is _WAY_ outside of the normal range of operation for nodes of this size.
commit 5926aa5ce69d48e5f2bb7c0d0e86b411645bc786Author: Rick BransonDate: Mon Dec 24 12:41:13 2012 -0800
Lower memtable sizes on C* UA cluster to make more room for compression metadata / bloom filters on heap
1.2.1.It went well.well... until...
commit 84982635d5c807840d625c22a8bd4407c1879ebaAuthor: Rick BransonDate: Thu Jan 31 09:43:56 2013 -0800
Switch Cassandra from tokens to vnodes
commit e990acc5dc69468c8a96a848695fca56e79f8b83Author: Rick BransonDate: Sun Feb 10 20:26:32 2013 -0800
We aren't ready for vnodes yet guys
TAKEAWAYLet stupidenterprising, experienced operators that
will submit patches take the first few bullets on brand-new major versions.
commit acb02daea57dca889c2aa45963754a271fa51566Author: Rick BransonDate: Sun Feb 10 20:36:34 2013 -0800
Doubled C* cluster
commit cc13a4c15ee0051bb7c4e3b13bd6ae56301ac670Author: Rick BransonDate: Thu Mar 14 16:23:18 2013 -0700
Subtract token from C*ua7 to replace the node
pycassa exceptions (last 6 months)
• 3.4TB
• vnode migration still pending
TAKEAWAYAdopt a technology by understanding what it's best at and letting it do that first, then expand...
• Sharded master/slave Redis
• 32x68GB (m2.4xlarge)
• Space (memory) bound
• Resharding sucks
• Failover is manual, wakes us up at night
user_id: [ activity, activity, ...]
user_id: [ activity, activity, ...]
Thrift Serialized Activity
Bound the Sizeuser_id: [ activity1, activity2, ... activity100, activity101, ...]
LTRIM <user_id> 0 99
Undo
user_id: [ activity1, activity2, activity3, ...]
LREM <user_id> 0 <activity2>
C* data model
user_idTimeUUID1 TimeUUID2
...TimeUUID101
user_id<activity> <activity>
...<activity>
Bound the Size
user_idTimeUUID1 TimeUUID2
...TimeUUID101
user_id<activity> <activity>
...<activity>
get(<user_id>)delete(<user_id>, columns=[<TimeUUID101>, <TimeUUID102>, <TimeUUID103>, ...])
The great destroyer of systems shows up. Tombstones abound.
user_id
TimeUUID1 TimeUUID2
...
TimeUUID2
user_id <activity> <activity> ... [tombstone]user_id
timestamp1 timestamp2
...
timestamp2
Cassandra internally stores deletes as tombstones, which mark data for a given column as deleted at-or-before a timestamp.
Column Delete
tombstone timestamp is >= live column timestamp, so it will be
hidden from queries and compacted away.
user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id <activity> <activity> ... <activity>user_id
timestamp1 timestamp2
...
timestamp101
TimeUUID = timestamp
To avoid tombstones, exploit that the timestamp embedded in our TimeUUID (ordering) is the same as the column timestamp.
user_id
TimeUUID1 TimeUUID2
...
TimeUUID101
user_id <activity> <activity> ... <activity>user_id
timestamp1 timestamp2
...
timestamp101
delete(<user_id>, timestamp=<timestamp101>)
Row DeleteCassandra can also store row tombstones, which delete all data from a row at-or-before the timestamp provided.
Optimizes Reads
SSTable
max_ts=100
SSTable
max_ts=200
SSTable
max_ts=300
SSTable
max_ts=400
SSTable
max_ts=500
SSTable
max_ts=600
SSTable
max_ts=700
SSTable
max_ts=800
Contains row tombstonewith timestamp 350
Safely ignoredusing in-memorymetadata
~10% of actions are undos.
Undo Support
user_idTimeUUID1 TimeUUID2
...TimeUUID101
user_id<activity> <activity>
...<activity>
get(<user_id>)delete(<user_id>, columns=[<TimeUUID2>])
get(<user_id>)delete(<user_id>, columns=[<TimeUUID2>])
Simple Race ConditionThe state of the row may have changed between these two operations.
💩
Replica[A, B]
Replica[A]
Writer
insert B OK
Replica[A, B]
FAIL
Like
Diverging Replicas
Replica[A, B]
Replica[A]
Writer
read [A]
Replica[A, B]
Undo Like
Diverging Replicas
Replica is missing B, so if a read is required to find B before deleting it, it's going to fail.
SuperColumn = Old/Busted AntiColumn = New/Hotness
user_id(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)
user_idanti-column activity activity
"Anti-Column"Borrowing from the idea of Cassandra's by-name tombstones, Contains an MD5 hash of the activity data "value" it is marking as deleted.
user_id(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)
user_idanti-column activity activity
Composite ColumnFirst component is zero for anti-columns,splitting the row into two independent lists,and ensuring the anti-columns always appearat the head.
Replica[A, B]
Replica[A]
Writer
insert B OK
Replica[A, B]
FAIL
Like
Diverging Replicas: Solved
Replica[A, B, C]
Replica[A, C]
Writer
insert C
Replica[A, B, C]
Undo Like
Diverging Replicas: Solved
OK
Instead of read-before-write, an anti-column is inserted to mark the activity as deleted.
TAKEAWAYRead-before-write is a smell. Try to model data as a log of user "intent" rather than manhandling the
data into place.
• Keep 30% "buffer" for trims.
• Undo without read. (thumbsup)
• Large lists suck for this. (thumbsdown)
• CASSANDRA-5527
Built in two days.Experience paid off.
Reusability is key to rapid rollout.Great documentation eases concerns.
• C* 1.2.3
• vnodes, LeveledCompactionStrategy
• 12 hi1.4xlarge (8-core, 60GB, 2T SSD)
• 3 AZs, RF=3, CL W=TWO R=ONE
• 8G heap, 800M NewSize
Initial Setup
1. Dial up Double Writes
2. Test with "Shadow" Reads
3. Dial up "Real" Reads
Rollout
commit 1c3d99a9e337f9383b093009dba074b8ade20768Author: Rick BransonDate: Mon May 6 14:58:54 2013 -0700
Bump C* inbox heap size 8G -> 10G, seeing heap pressure
Bootstrapping sucked because compacting10,000 SSTables takes forever.
sstable_size_in_mb: 5 => 25
Monitor Consistency
$ nodetool netstatsMode: NORMALNot sending any streams.Not receiving any streams.Read Repair Statistics:Attempted: 3192520Mismatch (Blocking): 0Mismatch (Background): 11584Pool Name Active Pending CompletedCommands n/a 0 1837765727Responses n/a 1 1750784545
UPDATE COLUMN FAMILYInboxActivitiesByUserIDWITH read_repair_chance = 0.01;
99.63% consistent
SSTable Size (again)Saw lots of GC pressure related to buffer
garbage. Eventually they landed on a new default in 1.2.9+ (160MB).
sstable_size_in_mb: 25 => 128
Fetch & Deserialize Time (measured from app)
Mean vs P90 (ms), trough-to-peak
Space used (live): 180114509324Space used (total): 180444164726Memtable Columns Count: 2315159Memtable Data Size: 112197632Memtable Switch Count: 1312Read Count: 316192445Read Latency: 1.982 ms.Write Count: 1581610760Write Latency: 0.031 ms.Pending Tasks: 0Bloom Filter False Positives: 481617Bloom Filter False Ratio: 0.08558Bloom Filter Space Used: 54723960Compacted row minimum size: 25Compacted row maximum size: 545791Compacted row mean size: 3020
20K 200-column slice reads/sec
30K 1-column mutations/sec
30% CPU utilization48K clients
Peak Stats
Exciting Future Things
• Python Native Protocol Driver
• Read CPU Consumption Work
• Mass CQL Adoption
• Triggers
• CAS (for limited use cases)
Next 6 Months...
• Node repair visibility & monitoring
• Objects & Associations Storage API on C* + memcache
• Migrate more from Redis
• New major use case
• Cassandra 2.0?
We're hiring!
top related