always on: building highly available applications on cassandra

Always On:Building Highly Available Applications on Cassandra

Robbie Strickland

Who Am I?

Robbie StricklandVP, Software [email protected]@rs_atl An IBM Business

Who Am I?• Contributor to C*

community since 2010• DataStax MVP 2014/15/16• Author, Cassandra High

Availability & Cassandra 3.x High Availability

• Founder, ATL Cassandra User Group

What is HA?• Five nines – 99.999% uptime?– Roughly 9 hours per year– … or a full work day of down time!

• Can we do better?

Cassandra + HA• No SPOF• Multi-DC replication• Incremental backups• Client-side failure handling• Server-side failure handling• Lots of JMX stats

HA by Design (it’s not an add-on)• Properly designed topology• Data model that respects C* architecture• Application that handles failure• Monitoring strategy with early warning• DevOps mentality

Table Stakes• NetworkTopologyStrategy• GossipingPropertyFileSnitch– Or [YourCloud]Snitch

• At least 5 nodes• RF=3• No load balancer

HA Topology

Consistency Basics• Start with LOCAL_QUORUM reads & writes– Balances performance & availability, and provides

single DC full consistency– Experiment with eventual consistency (e.g.

CL=ONE) in a controlled environment• Avoid non-local CLs in multi-DC environments– Otherwise it’s a crap shoot

Rack Failure• Don’t put all your

nodes in one rack!• Use rack awareness– Places replicas in

different racks• But don’t use

RackAwareSnitch

Rack Awareness

R2

R3R1

Rack A Rack B

Rack Awareness

R2

R3R1

Rack A Rack B

GossipingPropertyFileSnitchcassandra-rackdc.properties

dc=dc1rack=a

dc=dc1rack=b

Rack Awareness (Cloud Edition)

R2

R3R1

Availability Zone A

Availability Zone B

[YourCloud]Snitch(it’s automagic!)

Data Center Replication

dc=us-1 dc=eu-1

Data Center ReplicationCREATE KEYSPACE myKeyspaceWITH REPLICATION = {

‘class’:’NetworkTopologyStrategy’,‘us-1’:3,‘eu-1’:3

}

Multi-DC Consistency?

dc=us-1 dc=eu-1Assumption: LOCAL_QUORUM



Fullyconsistent

Fullyconsistent



Fullyconsistent

Fullyconsistent

?



Fullyconsistent

Fullyconsistent

Eventually

consistent

Multi-DC Routing with LOCAL CLClient App

us-1

Client App

eu-1

Multi-DC Routing with non-LOCAL CL

Client App

us-1

Client App

eu-1

Multi-DC Routing• Use DCAwareRoundRobinPolicy wrapped by

TokenAwarePolicy– This is the default– Prefers local DC – chosen based on host distance

and seed list– BUT this can fail for logical DCs that are physically

co-located, or for improperly defined seed lists!

Multi-DC RoutingPro tip:val localDC = //get from configval dcPolicy =

new TokenAwarePolicy(DCAwareRoundRobinPolicy.builder()

.withLocalDc(localDC)

.build())

Be explicit!!

Handling DC Failure• Make sure backup DC has sufficient capacity– Don’t try to add capacity on the fly!

• Try to limit updates– Avoids potential consistency issues on recovery

• Be careful with retry logic– Isolate it to a single point in the stack– Don’t DDoS yourself with retries!

Topology Lessons• Leverage rack awareness• Use LOCAL_QUORUM

– Full local consistency– Eventual consistency across DCs

• Run incremental repairs to maintain inter-DC consistency• Explicitly route local app to local C* DC• Plan for DC failure

Data Modeling

Quick Primer• C* is a distributed hash table– Partition key (first field in PK declaration) determines

placement in the cluster– Efficient queries MUST know the key!

• Data for a given partition is naturally sorted based on clustering columns

• Column range scans are efficient

Quick Primer• All writes are immutable– Deletes create tombstones– Updates do not immediately purge old data– Compaction has to sort all this out

Who Cares?• Bad performance = application downtime &

lost users• Lagging compaction is an operations

nightmare• Some models & query patterns create serious

availability problems

Do• Choose a partition key that distributes evenly• Model your data based on common read

patterns• Denormalize using collections & materialized

views• Use efficient single-partition range queries

Don’t• Create hot spots in either data or traffic

patterns• Build a relational data model• Create an application-side join• Run multi-node queries• Use batches to group unrelated writes

Problem Case #1SELECT *FROM contactsWHERE id IN (1,3,5,7,9)

Client

Problem Case #1

SELECT *FROM contactsWHERE id IN (1,3,5,7)

1 26 5

4 72 8

3 67 8

1 35 2

4 57 8

1 36 4

Must ask every 4 out of 6 nodes in the cluster to satisfy quorum!

Client

Problem Case #1

SELECT *FROM contactsWHERE id IN (1,3,5,7)

1 26 5

4 72 8

3 67 8

1 35 2

4 57 8

1 36 4

“Not enough replicas available for query at consistency LOCAL_QUORUM” X

X1,3,5 all have sufficient replicas,yet entire query fails because of 7

Solution #1• Option 1: Be optimistic and run it anyway– If it fails, you can fall back to option 2

• Option 2: Run parallel queries for each key– Return the results that are available– Fall back to CL ONE for failed keys– Client token awareness means coordinator does less

work

Problem Case #2CREATE INDEX ON contacts(birth_year)

SELECT *FROM contactsWHERE birth_year=1975

Client

Problem Case #2


1975:JimSue

1975:SamJim

1975:SueTim

1975:TimJim

1975:SueSam

1975:SamTim

Index lives with the source data… so 5 nodes must be queried!

Client

Problem Case #2


1975:JimSue

1975:SamJim

1975:SueTim

1975:TimJim

1975:SueSam

1975:SamTim

“Not enough replicas available for query at consistency LOCAL_QUORUM”

Index lives with the source data… so 5 nodes must be queried!

X

X

Solution #2• Option 1: Build your own index– App has to maintain the index

• Option 2: Use a materialized view– Not available before 3.0

• Option 3: Run it anyway– Ok for small amounts of data (think 10s to 100s of rows) that

can live in memory– Good for parallel analytics jobs (Spark, Hadoop, etc.)

Problem Case #3CREATE TABLE sensor_readings (

sensorID uuid,timestamp int,reading decimal,PRIMARY KEY (sensorID, timestamp)

) WITH CLUSTERING ORDER BY (timestamp DESC);

Problem Case #3• Partition will grow unbounded– i.e. it creates wide rows

• Unsustainable number of columns in each partition

• No way to archive off old data

Solution #3CREATE TABLE sensor_readings (

sensorID uuid,time_bucket int,timestamp int,reading decimal,PRIMARY KEY ((sensorID, time_bucket),

timestamp)) WITH CLUSTERING ORDER BY (timestamp DESC);

Monitoring

Monitoring Basics• Enable remote JMX• Connect a stats collector (jmxtrans, collectd,

etc.)• Use nodetool for quick single-node queries• C* tells you pretty much everything via JMX

Thread Pools• C* is a SEDA architecture– Essentially message queues feeding thread pools– nodetool tpstats

• Pending messages are bad:Pool Name Active Pending Completed Blocked All time blockedCounterMutationStage 0 0 0 0 0ReadStage 0 0 103 0 0RequestResponseStage 0 0 0 0 0MutationStage 0 13234794 0 0 0

Lagging Compaction• Lagging compaction is the reason for many

performance issues• Reads can grind to a halt in the worst case• Use nodetool tablestats/cfstats &

compactionstats

Lagging Compaction• Size-Tiered: watch for high SSTable counts:

Keyspace: my_keyspaceRead Count: 11207Read Latency: 0.047931114482020164 ms.Write Count: 17598Write Latency: 0.053502954881236506 ms.Pending Flushes: 0

Table: my_tableSSTable count: 84

Lagging Compaction• Leveled: watch for SSTables remaining in L0:

Keyspace: my_keyspaceRead Count: 11207Read Latency: 0.047931114482020164 ms.Write Count: 17598Write Latency: 0.053502954881236506 ms.Pending Flushes: 0

Table: my_tableSSTable Count: 70SSTables in each level: [50/4, 15/10, 5/100]

50 in L0 (should be 4)

Lagging Compaction Solution• Triage:– Check stats history to see if it’s a trend or a blip– Increase compaction throughput using nodetool

setcompactionthroughput– Temporarily switch to SizeTiered

• Do some digging:– I/O problem?– Add nodes?

Wide Rows / Hotspots• Only takes one to wreak havoc• It’s a data model problem• Early detection is key!• Watch partition max bytes– Make sure it doesn’t grow unbounded– … or become significantly larger than mean bytes

Wide Rows / Hotspots• Use nodetool toppartitions to sample

reads/writes and find the offending partition• Take action early to avoid OOM issues with:– Compaction – Streaming– Reads

For More Info…

(shameless book plug)

Thanks!

Robbie [email protected]@rs_atl An IBM Business

always on: building highly available applications on cassandra

Data & Analytics