always on: building highly available applications on cassandra
TRANSCRIPT
Always On:Building Highly Available Applications on Cassandra
Robbie Strickland
Who Am I?
Robbie StricklandVP, Software [email protected]@rs_atl An IBM Business
Who Am I?• Contributor to C*
community since 2010• DataStax MVP 2014/15/16• Author, Cassandra High
Availability & Cassandra 3.x High Availability
• Founder, ATL Cassandra User Group
What is HA?• Five nines – 99.999% uptime?– Roughly 9 hours per year– … or a full work day of down time!
• Can we do better?
Cassandra + HA• No SPOF• Multi-DC replication• Incremental backups• Client-side failure handling• Server-side failure handling• Lots of JMX stats
HA by Design (it’s not an add-on)• Properly designed topology• Data model that respects C* architecture• Application that handles failure• Monitoring strategy with early warning• DevOps mentality
Table Stakes• NetworkTopologyStrategy• GossipingPropertyFileSnitch– Or [YourCloud]Snitch
• At least 5 nodes• RF=3• No load balancer
HA Topology
Consistency Basics• Start with LOCAL_QUORUM reads & writes– Balances performance & availability, and provides
single DC full consistency– Experiment with eventual consistency (e.g.
CL=ONE) in a controlled environment• Avoid non-local CLs in multi-DC environments– Otherwise it’s a crap shoot
Rack Failure• Don’t put all your
nodes in one rack!• Use rack awareness– Places replicas in
different racks• But don’t use
RackAwareSnitch
Rack Awareness
R2
R3R1
Rack A Rack B
Rack Awareness
R2
R3R1
Rack A Rack B
GossipingPropertyFileSnitchcassandra-rackdc.properties
dc=dc1rack=a
dc=dc1rack=b
Rack Awareness (Cloud Edition)
R2
R3R1
Availability Zone A
Availability Zone B
[YourCloud]Snitch(it’s automagic!)
Data Center Replication
dc=us-1 dc=eu-1
Data Center ReplicationCREATE KEYSPACE myKeyspaceWITH REPLICATION = {
‘class’:’NetworkTopologyStrategy’,‘us-1’:3,‘eu-1’:3
}
Multi-DC Consistency?
dc=us-1 dc=eu-1Assumption: LOCAL_QUORUM
Multi-DC Consistency?
dc=us-1 dc=eu-1Assumption: LOCAL_QUORUM
Fullyconsistent
Fullyconsistent
Multi-DC Consistency?
dc=us-1 dc=eu-1Assumption: LOCAL_QUORUM
Fullyconsistent
Fullyconsistent
?
Multi-DC Consistency?
dc=us-1 dc=eu-1Assumption: LOCAL_QUORUM
Fullyconsistent
Fullyconsistent
Eventually
consistent
Multi-DC Routing with LOCAL CLClient App
us-1
Client App
eu-1
Multi-DC Routing with LOCAL CLClient App
us-1
Client App
eu-1
Multi-DC Routing with non-LOCAL CL
Client App
us-1
Client App
eu-1
Multi-DC Routing with non-LOCAL CL
Client App
us-1
Client App
eu-1
Multi-DC Routing• Use DCAwareRoundRobinPolicy wrapped by
TokenAwarePolicy– This is the default– Prefers local DC – chosen based on host distance
and seed list– BUT this can fail for logical DCs that are physically
co-located, or for improperly defined seed lists!
Multi-DC RoutingPro tip:val localDC = //get from configval dcPolicy =
new TokenAwarePolicy(DCAwareRoundRobinPolicy.builder()
.withLocalDc(localDC)
.build())
Be explicit!!
Handling DC Failure• Make sure backup DC has sufficient capacity– Don’t try to add capacity on the fly!
• Try to limit updates– Avoids potential consistency issues on recovery
• Be careful with retry logic– Isolate it to a single point in the stack– Don’t DDoS yourself with retries!
Topology Lessons• Leverage rack awareness• Use LOCAL_QUORUM
– Full local consistency– Eventual consistency across DCs
• Run incremental repairs to maintain inter-DC consistency• Explicitly route local app to local C* DC• Plan for DC failure
Data Modeling
Quick Primer• C* is a distributed hash table– Partition key (first field in PK declaration) determines
placement in the cluster– Efficient queries MUST know the key!
• Data for a given partition is naturally sorted based on clustering columns
• Column range scans are efficient
Quick Primer• All writes are immutable– Deletes create tombstones– Updates do not immediately purge old data– Compaction has to sort all this out
Who Cares?• Bad performance = application downtime &
lost users• Lagging compaction is an operations
nightmare• Some models & query patterns create serious
availability problems
Do• Choose a partition key that distributes evenly• Model your data based on common read
patterns• Denormalize using collections & materialized
views• Use efficient single-partition range queries
Don’t• Create hot spots in either data or traffic
patterns• Build a relational data model• Create an application-side join• Run multi-node queries• Use batches to group unrelated writes
Problem Case #1SELECT *FROM contactsWHERE id IN (1,3,5,7,9)
Client
Problem Case #1
SELECT *FROM contactsWHERE id IN (1,3,5,7)
1 26 5
4 72 8
3 67 8
1 35 2
4 57 8
1 36 4
Must ask every 4 out of 6 nodes in the cluster to satisfy quorum!
Client
Problem Case #1
SELECT *FROM contactsWHERE id IN (1,3,5,7)
1 26 5
4 72 8
3 67 8
1 35 2
4 57 8
1 36 4
“Not enough replicas available for query at consistency LOCAL_QUORUM” X
X1,3,5 all have sufficient replicas,yet entire query fails because of 7
Solution #1• Option 1: Be optimistic and run it anyway– If it fails, you can fall back to option 2
• Option 2: Run parallel queries for each key– Return the results that are available– Fall back to CL ONE for failed keys– Client token awareness means coordinator does less
work
Problem Case #2CREATE INDEX ON contacts(birth_year)
SELECT *FROM contactsWHERE birth_year=1975
Client
Problem Case #2
SELECT *FROM contactsWHERE birth_year=1975
1975:JimSue
1975:SamJim
1975:SueTim
1975:TimJim
1975:SueSam
1975:SamTim
Index lives with the source data… so 5 nodes must be queried!
Client
Problem Case #2
SELECT *FROM contactsWHERE birth_year=1975
1975:JimSue
1975:SamJim
1975:SueTim
1975:TimJim
1975:SueSam
1975:SamTim
“Not enough replicas available for query at consistency LOCAL_QUORUM”
Index lives with the source data… so 5 nodes must be queried!
X
X
Solution #2• Option 1: Build your own index– App has to maintain the index
• Option 2: Use a materialized view– Not available before 3.0
• Option 3: Run it anyway– Ok for small amounts of data (think 10s to 100s of rows) that
can live in memory– Good for parallel analytics jobs (Spark, Hadoop, etc.)
Problem Case #3CREATE TABLE sensor_readings (
sensorID uuid,timestamp int,reading decimal,PRIMARY KEY (sensorID, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
Problem Case #3• Partition will grow unbounded– i.e. it creates wide rows
• Unsustainable number of columns in each partition
• No way to archive off old data
Solution #3CREATE TABLE sensor_readings (
sensorID uuid,time_bucket int,timestamp int,reading decimal,PRIMARY KEY ((sensorID, time_bucket),
timestamp)) WITH CLUSTERING ORDER BY (timestamp DESC);
Monitoring
Monitoring Basics• Enable remote JMX• Connect a stats collector (jmxtrans, collectd,
etc.)• Use nodetool for quick single-node queries• C* tells you pretty much everything via JMX
Thread Pools• C* is a SEDA architecture– Essentially message queues feeding thread pools– nodetool tpstats
• Pending messages are bad:Pool Name Active Pending Completed Blocked All time blockedCounterMutationStage 0 0 0 0 0ReadStage 0 0 103 0 0RequestResponseStage 0 0 0 0 0MutationStage 0 13234794 0 0 0
Lagging Compaction• Lagging compaction is the reason for many
performance issues• Reads can grind to a halt in the worst case• Use nodetool tablestats/cfstats &
compactionstats
Lagging Compaction• Size-Tiered: watch for high SSTable counts:
Keyspace: my_keyspaceRead Count: 11207Read Latency: 0.047931114482020164 ms.Write Count: 17598Write Latency: 0.053502954881236506 ms.Pending Flushes: 0
Table: my_tableSSTable count: 84
Lagging Compaction• Leveled: watch for SSTables remaining in L0:
Keyspace: my_keyspaceRead Count: 11207Read Latency: 0.047931114482020164 ms.Write Count: 17598Write Latency: 0.053502954881236506 ms.Pending Flushes: 0
Table: my_tableSSTable Count: 70SSTables in each level: [50/4, 15/10, 5/100]
50 in L0 (should be 4)
Lagging Compaction Solution• Triage:– Check stats history to see if it’s a trend or a blip– Increase compaction throughput using nodetool
setcompactionthroughput– Temporarily switch to SizeTiered
• Do some digging:– I/O problem?– Add nodes?
Wide Rows / Hotspots• Only takes one to wreak havoc• It’s a data model problem• Early detection is key!• Watch partition max bytes– Make sure it doesn’t grow unbounded– … or become significantly larger than mean bytes
Wide Rows / Hotspots• Use nodetool toppartitions to sample
reads/writes and find the offending partition• Take action early to avoid OOM issues with:– Compaction – Streaming– Reads
For More Info…
(shameless book plug)
Thanks!
Robbie [email protected]@rs_atl An IBM Business