alter table with: understanding cassandra table options

ALTER TABLE WITH Understanding Cassandra’s Table Options

A Little About Me

• Cassandra in production since 2010• Infrastructure @ Crowdstrike• Hundreds of terabytes in Cassandra• Occasional code contributions• Cassandra MVP• Cassandra Day LA: 5 years of Hindsight• Cassandra Summit 2015: DTCS is Broken (unofficial title)

An Introduction to CrowdStrike

We Are CyberSecurity Technology Company

We Detect, Prevent And Respond To All Attack Types In Real Time, Protecting

Organizations From Catastrophic Breaches

We Provide Next Generation Endpoint Protection, Threat Intelligence & Pre &Post IR

Services

A Little About Tonight

• Cassandra Write paths• Cassandra Read paths• Knowing what Cassandra is doing helps you understand how

to tune • It’s not just about performance, it’s also about latencies,

stability, and correctness• Feel free to interrupt me! Ask questions before, during, after

Write Path, Simplified

• Writes first go to the commitlog

• Then, memtable

• Then, eventually flushed to sstables

• If RF > ONE, the coordinator sends the mutation to replicas

• Depending on CL, the coordinator waits until enough respond before reporting success to the client


• Writes first go to the commitlog- Append only journal- Replayed on node startup- Purged once the node knows that all relevant data is written into sstables (nodetool flush)- If you use spinning disks, append-only model avoids seeks (as long as commitlog is on its own

partition)


• Then the memtable- Effectively a write-back cache of rows as they’re written- Once a row is written to the memtable, the mutation can be counted towards the CONSISTENCY

LEVEL of the query- Writes are batched in the memtable until it’s ready to flush


• Then flushed to sstables- At specified thresholds ( memtable_(off)heap_space_in_mb * memtable_cleanup_threshold ), the

memtable is flushed to disk- Each sstable is written exactly one time - never changed once it’s written- If a new write comes in for the same value, it’s written to a new sstable with a new timestamp

Table Option #1: Compaction Strategy

• If tables are never re-written, how do updates and deletions work? Compaction! Multiple sstables are joined together, duplicate cells are merged, deleted data is purged (eventually)

• Each table specifies a compaction strategy. Cassandra ships with 3 by default• SizeTieredCompactionStrategy is the oldest, most mature, tuned for writes• LeveledCompactionStrategy is tuned for read latency• DateTieredCompactionStrategy is meant for time series, TTL heavy workloads


• SizeTieredCompactionStrategy• Every time min_threshold (4) files of the same size appear, combine them together


• SizeTieredCompactionStrategy• Every time min_threshold (4) files of the same size appear, combine them together• Over time, older data naturally ends up in larger files


• SizeTieredCompactionStrategy Advantages• Minimizes write amplification• Very easy to reason about• Simple algorithm, so unlikely to cause extra CPU/memory usage at flush time• Flushing is important – complicated compaction strategies that block flushing can be bad (if the

memtable fills before it flushes, stop accepting writes)


• SizeTieredCompactionStrategy Disadvantages• Deleted data from old files may not be compacted away for a very long time• Frequently changed cells will live in many files, and must be merged on read• Read queries may touch a number of files, which is SLOW


• LeveledCompactionStrategy• Spends extra effort compacting sstables to ensure that each row exists in at most one sstable per

‘level’• Expected probability for number of sstables per read: ~1.11• Advantage: lower read latency• Disadvantage: much more IO required• Typically advantageous when you: Read much more than you write Highly sensitive to read latency Rows change over time (values updated, or values expire)• Prefer STCS if: You can’t spare the IO Rows are write-once You write far more than you read


• DateTieredCompactionStrategy• Designed for time series, often TTL heavy workloads• Assumes writes come in order• Tries to group sstables by date• Great in theory


• Takeaway: Choosing the right compaction strategy not only impacts latency, but IO/CPU, and can have a huge impact on disk space if you use TTLS

Read Path

1. Find the right server using the partition key and partition function (probably murmur3)2. Find the sstables on disk that contain the row in question3. Find the partition offset in the data files (use cache if possible, otherwise use the partition index

data)4. The data is then read from the appropriate file5. Duplicate cells are merged with timestamp resolution (last write wins)6. If CL > ONE, the coordinator checks multiple replicas, and repairs any that are incorrect

Read Path

https://academy.datastax.com/demos/brief-introduction-apache-cassandra

Table Option #2: Bloom Filters

• Off-heap data structure that tells Cassandra that the row either “might” or “does not” exist in a given data file

• Probabilistic: bloom_filter_fp_chance• Defaults to 0.01 on STCS, 0.1 on LCS (LCS already defragments, so false positives are less

costly)• Cost: RAM (offheap) – 0.01 uses approximately 3x the memory as 0.1• Tuning: Adjust based on RAM available and number of sstables. • For slow disks or lots of sstables, lower fp chance to decrease disk IO• If you’re memory starved and have few sstables on disk, raise the fp chance and use the RAM

for page cache

• WITH bloom_filter_fp_chance=0.01

Read Path


Table Option #3: Key Cache

• There’s a row cache – don’t use it• The key cache helps find the data in the sstable quickly• If you set the key cache low, there’s a good chance the OS page cache will help, but key cache

will be faster

• WITH caching = {‘keys’: ‘ALL’, ‘rows_per_partition’: ‘NONE’}

Read Path


Table Option #4: Partition Summary / Index

• Maps row key to offset in data file• It’s not every row key – it’s a sorted sampling• You can tune the sample parameters: max_index_interval , min_index_interval• Cassandra will adapt sample based on sstable read hotness – more frequently read tables will

get a more dense index for more accurate locations on disk• Again, primarily a RAM tradeoff – lower interval = more RAM = less IO

• WITH max_index_interval 2048 AND min_index_interval = 128

Read Path


Table Option #5: Compression

• The sstable is compressed chunk-by-chunk as it’s written (either during flush or compaction)• Compression offsets are mapped like index offsets• Larger chunks typically means better compression ratios for most data sets• Smaller chunks means that if you do go to disk, you have less over-read• Very literal tradeoff between disk IO and storage capacity – larger chunks = better ratios, but you

may have to read larger chunks off the disk when it’s not cached in RAM• Data size dependent: 64k read for 500 bytes of data may severely limit your read performance

• WITH compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}

Read Path


Table Option #6: Correctness (CRC)

• Compressed tables have a checksum embedded in the compression data• Cassandra can verify that checksum on decompression, IF you want

• WITH crc_check_chance = 1.0

• Uncompressed files have NO CORRECTNESS VALIDATION in the read path – if you have disk based bit rot, Cassandra won’t know unless you run manual sstable verify (nodetool verify)

Read Path


Table Option #7: Clustering

• Each partition is written once in the file• Values in the partition are sorted based on clustering order• In CQL3 terms, this means clustering key values:

• Because records are sorted when written, retrieving a range of clustering keys is incredibly fast (nearly free)

• Normal sort order is ascending! If you need descending, flip the order in the schema so the read path can do a single linear pass:

• WITH CLUSTERING ORDER BY (sensor_reading_timestamp DESC)

Read Path


Table Option #8: Correctness (Read Repair)

• Depending on your consistency level, the coordinator will ask multiple replicas for the data• One will return the data; others will return a digest• If the digest doesn’t match the data, the coordinator will choose the value with the highest

timestamp, and make sure all replicas have that value – you can not disable this type of read repair, except by querying with CL:ONE

• If the digest does match for the replicas returned, but you’re using CL < ALL, you can have cassandra read-repair that cell anyway just in case:

• WITH dclocal_read_repair_chance = 0.01 AND read_repair_chance = 0.0

Read Path


Table Option #9: Avoiding Timeouts

• Typical Cassandra use cases have RF > 1• You may ask for data from X nodes, where X < RF• If one of those nodes is slow to respond (query load, compaction load, JVM GC), Cassandra

can try other replicas before waiting for the full 10s timeout• “Speculative Retry” is configurable based on logical time limits, like 99% latencies

• WITH speculative_retry = '99PERCENTILE’

• Watch out: speculative retry may violate LOCAL_ datacenter consistency levels (for now)

Lots of Options, Lots of Flexibility

• Choose compaction based on write / read PATTERNS• Choose bloom filter FP chance based on read latency and memory available• Enable the key cache, but probably not the row cache• You can tune the index interval if you have really hot and really cold sstables• Compression chunk size can control how much data you read off of the disk at a time, or how

well your data compresses• Compression gives you CRCs to guard against corruption, and you can tune whether or not

they’re used• SSTables are inherently sorted, use clustering order options as it fits your data• Foreground read repair can’t be disabled, but background read repair can be used to help speed

up ‘eventual’ consistency• Speculative retry may can help avoid timeouts and/or drop your 99.9% latencies

That’s it!

• You can talk to me about Cassandra on Twitter ( @jjirsa )• There’s an active Cassandra community in IRC: irc.freenode.net #cassandra

• Crowdstrike is hiring: www.crowdstrike.com/careers/

• Huge thanks to Datastax and Hulu for making this meetup happen!

http://www.crowdstrike.com/careers/