optimizing your cluster with coordinator nodes (eric lubow, simplereach) | cassandra summit 2016
TRANSCRIPT
![Page 1: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/1.jpg)
SOMETHING SOMETHING COORDINATOR NODES
Cheating Our Way to Better Performance
![Page 2: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/2.jpg)
JON HADDAD LEARN DATA MODELING BY EXAMPLE
THIS IS AWESOME!!
GO TO ROOM 210A NOW!
![Page 3: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/3.jpg)
Eric Lubow @elubow #CassandraSummit
PERSONAL VANITY
๏ CTO of SimpleReach
๏ Co-Author of Practical Cassandra
๏ Skydiver, Mixed Martial Artist, Motorcyclist, Dog Dad (IG: @charliedognyc), NY Giants fan
![Page 4: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/4.jpg)
Eric Lubow @elubow #CassandraSummit
SIMPLEREACH
๏ Help marketers organize
๏ Identify the best content
๏ Use engagement metrics
๏ Workflow solution
๏ Stream processing ingest
๏ Many metrics, time sliced
๏ Multiple data stores
![Page 5: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/5.jpg)
Eric Lubow @elubow #CassandraSummit
CONCEPTS YOU SHOULD UNDERSTAND
1. Thick clients and thin clients
2. CPU utilization and load average
3. Database tuning may not have anything to do with the database
![Page 6: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/6.jpg)
Eric Lubow @elubow #CassandraSummit
A fat client (also called heavy, rich, or thick client) is a computer (client) in client–server architecture or networks that typically provides rich functionality independent of the central server.
— Wikipedia
WHAT IS A FAT CLIENT?
![Page 7: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/7.jpg)
Eric Lubow @elubow #CassandraSummit
• Thin clients typically canʼt operate without the “server”
• Thick clients try to do more locally compared to thin clients which try to do more remotely
• Thick clients require more resources, but fewer servers.
• Thin clients require fewer resources and more servers.
THIN CLIENTS AND THICK CLIENTS
![Page 8: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/8.jpg)
Eric Lubow @elubow #CassandraSummit
• Fat clients have the Cassandra binary, but no data
• Data nodes are denser and more focused on storage
• For context, we can call them proxy nodes
• Proxy nodes are more compute heavy
• Fat clients only handles coordination
ONCE MORE, BUT WITH CASSANDRA
![Page 9: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/9.jpg)
Eric Lubow @elubow #CassandraSummit
๏ Fat clients are effectively just changing settings
๏ < Cassandra 2.2 -Djoin_ring=false (hack)
๏ No data on the nodes, just coordination responsibility
๏ Intentionally sidestepping Cassandra homogenous nature in favor of performance
๏ Can be lots of room for adding proxy nodes without incurring additional performance loss from increasing the ring size
๏ Reduces per node work on the data nodes
WHAT’S REALLY GOING ON HERE?
![Page 10: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/10.jpg)
Eric Lubow @elubow #CassandraSummit
NORMAL CASSANDRA SETUP
CASSANDRA CLUSTER APPLICATION TIER
C*
App 1
App 2
C* C* C*
C* C* C* C*
![Page 11: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/11.jpg)
Eric Lubow @elubow #CassandraSummit
![Page 12: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/12.jpg)
Eric Lubow @elubow #CassandraSummit
CASSANDRA PROXY TIER SETUP
CASSANDRA CLUSTER
App 1
App 2
PROXY TIER
Proxy
Proxy
Data Data
Data Data Data
Data Data
Data
APPLICATION TIER
Data Nodes: c3.4xlargeProxy Nodes: c3.4xlarge
![Page 13: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/13.jpg)
Eric Lubow @elubow #CassandraSummit
๏ More compute power for token calculations
๏ More compute power for writing data
๏ More focused compute on coordination tasks
๏ Smarter allocation of instance types
๏ Cheaper hardware for proxy instances
TRADEOFFS
๏ More instance types to manage
๏ More infrastructure overhead
๏ Requires different monitoring
๏ High potential for nasty accident (forget to make proxy node)
![Page 14: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/14.jpg)
Eric Lubow @elubow #CassandraSummit
Before
Before
After
After
AVERAGE CLUSTER CPU UTILIZATION
![Page 15: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/15.jpg)
Eric Lubow @elubow #CassandraSummit
HOW DID WE DO?
![Page 16: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/16.jpg)
Eric Lubow @elubow #CassandraSummit
๏ Why are we talking about CPU utilization and load average
๏ Terminology is important
๏ Understanding gains/losses is important
๏ Letʼs talk about CPU utilization and load average
HOW DO WE KNOW?
![Page 17: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/17.jpg)
Eric Lubow @elubow #CassandraSummit
๏ Letʼs use a traffic analogy for load average
๏ Imagine you are the bridge operator of a single lane bridge (single CPU):
๏ 0.00 means there's no traffic on the bridge at all. In fact, between 0.00 and 1.00 means there's no backup, and an arriving car will just go right on.
๏ 1.00 means the bridge is exactly at capacity. All is still good, but if traffic gets a little heavier, things are going to slow down.
๏ over 1.00 means there's backup. How much? Well, 2.00 means that there are two lanes worth of cars total -- one lane's worth on the bridge, and one lane's worth waiting. 3.00 means there are three lane's worth total -- one lane's worth on the bridge, and two lanes' worth waiting. Etc.
๏ Best load average for a single CPU system is between 0.7 and 0.8 (headroom)
๏ Different for multi-core systems
WHAT IS LOAD AVERAGE?
![Page 18: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/18.jpg)
Eric Lubow @elubow #CassandraSummit
๏ Each core on a CPU has itʼs own utilization graph
๏ CPU utilization isnʼt straight forward
๏ Assume you have a single core processor fixed at a frequency of 2.0 GHz. CPU utilization in this scenario is the percentage of time the processor spends doing work (as opposed to being idle). If this 2.0 GHz processor does 1 billion cycles worth of work in a second, it is 50% utilized for that second.
๏ Current multiple cores processors exist with dynamically changing frequencies, hardware multithreading, and shared caches all of which effect reporting.
๏ Resource sharing makes monitoring CPU utilization difficult
NOTES ABOUT CPU UTILIZATION
![Page 19: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/19.jpg)
Eric Lubow @elubow #CassandraSummit
![Page 20: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/20.jpg)
Eric Lubow @elubow #CassandraSummit
Before
Before
After
After
AVERAGE CLUSTER CPU UTILIZATION
![Page 21: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/21.jpg)
Eric Lubow @elubow #CassandraSummit
๏ Batches work better when there is a coordinator dispatching batches to the correct data node without additional processing on the part of the data node
๏ One of the many downsides of vnodes is massive coordination requirements
๏ Removing coordination responsibilities from data nodes makes them more performant
๏ less context switching
๏ less network traffic/gossip/GC
๏ less CPU utilization
WHAT ACTUALLY HAPPENED 1/2
![Page 22: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/22.jpg)
Eric Lubow @elubow #CassandraSummit
๏ 30 nodes * 256 tokens/node = 7,680 token ranges
๏ Queries go through a nearly 8,000 item list, slow, context switch, lots of GCable objects
๏ Considering just reads, at 30k requests per second, this is a significant reduction in work on a per query basis
๏ We are able to tune the JVMs differently
WHAT ACTUALLY HAPPENED 2/2
![Page 23: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/23.jpg)
Eric Lubow @elubow #CassandraSummit
RESULTS
![Page 24: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/24.jpg)
Eric Lubow @elubow #CassandraSummit
๏ Went from 72 nodes down to 30 nodes
๏ All data is now stored on AWS ST1 EBS volumes
๏ Works best for write heavy workloads
๏ Roughly 300% increase in available and burstable capacity
๏ Less footprint to watch over; fewer machines, more roles
RESULTS
![Page 25: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/25.jpg)
Eric Lubow @elubow #CassandraSummit
๏ Command line option or cassandra.yaml option for coordinator only mode
๏ Code path short cuts for performance
๏ Specific JMX beans around query coordination
๏ Allow query mutation by coordinator nodes (Lua?)
FEATURE NOT HACK
![Page 26: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/26.jpg)
Eric Lubow @elubow #CassandraSummit
WHAT DID I SAY?
๏ Fat clients can save you money
๏ Donʼt start out with complexity
๏ Know the basics
๏ Know what your baseline measurements are
๏ Monitor everything
๏ Sometimes database tuning doesnʼt require making changes to the database
![Page 27: Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016](https://reader033.vdocuments.net/reader033/viewer/2022051123/586f75c71a28ab10258b6189/html5/thumbnails/27.jpg)
Eric Lubow @elubow #CassandraSummit
QUESTIONS IN LIFE ARE GUARANTEED, ANSWERS AREN’T.
Eric Lubow
@elubow