running cassandra on amazon ec2

@cassandralondon

Thanks

Reminder

Next meetup Wednesday 8th December

Jake Luciani will be giving a talk on "Lucandra" (a Cassandra backend for Lucene open source search software)

Quick intro to Cassandra

• Decentralized

• Fault-tolerant

• Tunable consistency

• Elasticity

This talk

Why consider EC2?

What are the challenges of running Cassandra on EC2?

Is it a good idea?

Cassandra design decisions

Cassandra designed to run on many commodity servers

It is designed to deal with unreliable hardware and networks

Why consider EC2?

On demand instances

“frees you from the costs and complexities of planning, purchasing, and maintaining hardware and transforms what are commonly large fixed costs into much smaller variable costs”

http://aws.amazon.com/ec2/pricing/

Why consider EC2?

Multiple “Availability Zones” in multiple regions (US East, US West, Ireland and Singapore)

http://aws.amazon.com/ec2/

Writing to Cassandra

1. Write added to local log on targetmachine2. Memtable updated3. Memtable flushed to disk as datafiles (SSTable plus SSTable Index)4. Eventually data files are compacted

http://wiki.apache.org/cassandra/ArchitectureOverview#Write_path

Reading from Cassandra

1. Read from any node2. Partitioner3. Wait for R responses4. Wait for N – R responses in thebackground and perform read repair http://wiki.apache.org/cassandra/ArchitectureOverview#Read_path

Reading from Cassandra

Reads from multiple SSTables

The application use-case will affect performance and what the bottleneck is (totally random reads being worst case)

The challenges

Getting good enough I/O performance

Not a huge number of resources on the Internet (new and shiny)

Some minor setup and monitoring challenges (documentation is available)

EC2 I/O performance

Ephemeral or EBS; low, moderate or high I/O performance indicators

“other resources like the network and the disk subsystem are shared among instances… when a resource is under-utilized you will often be able to consume a higher share of that resource”

http://aws.amazon.com/ec2/instance-types/

EBS or ephemeral?

Jonathan Ellis recently on mailing list:

“we recommend using raid0 ephemeral disks on EC2 with L or XL instance sizes for betteri/o performance.”

http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cold-boot-performance-problems-tp5615829p5615889.html

http://www.coreyhulen.org/?p=326

EBS or ephemeral?

Amazon suggest EBS is better:

“Amazon EBS is particularly suited for applications that require a database, file system, or access to raw block level storage”

http://aws.amazon.com/ebs/

“The latency and throughput of Amazon EBS volumes is designed to be significantly better than the Amazon EC2 instance stores in nearly all cases. You can also attach multiple volumes to an instance and stripe across the volumes. This is one way to improve I/O rates, especially if your application performs a lot of random access across your data set.”

http://aws.amazon.com/ebs/

EC2 I/O benchmark

Throughput measured using dd

Seek measured using seeker.c

Software RAID uses mdadm

http://www.linuxinsight.com/how_fast_is_your_disk.htmlhttp://en.wikipedia.org/wiki/Mdadm

Which is better?

EBS has better throughput, ephemeral better for random seeks

Generic benchmarks aren’t great – depends on your use case

Warning: EC2 performance not consistent

EC2 Cassandra benchmark

Read and write TPS

Benchmarks carried out by Corey Hulen

http://www.coreyhulen.org/?p=326

Which is better?

Corey suggests:

“Raid 0 EBS drives are the way to go”

“We didn’t notice a difference above the normal EC2 fluctuations when testing for 2 vs 4 drives”

Conclusions

Cassandra will run acceptably on EC2, but real HW is better

It will depend on your use case – particularly the types of read that you do

Real HW may work out cheaper

Conclusions

Ephemeral I/O seems to be better than EBS, although EBS has other advantages (doesn’t disappear if you stop the node)

Again, it will depend on use case

Conclusions

Large nodes are the best bet

Small nodes have poor I/O

Extra large nodes are probably not worth it (better to have more nodes)

http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Nodes-dropping-out-of-cluster-due-to-GC-tp5128481p5131568.html

Questions?

Please leave feedback on meetup.comFollow @cassandralondon on Twitter

running cassandra on amazon ec2

Technology

(bdt307) running nosql on amazon ec2 | aws re:invent 2014

running a high performance nosql database on amazon ec2 for...

cassandra day london 2015: running hailo on apache cassandra

amazon cassandra guidelines and basics for aws/ec2/vpc

lessons learned from running 1800 clusters (brooke jensen,...

cassandra summit 2014: cassandra compute cloud: an elastic...

netapp solidfire and cassandra · cassandra and nosql...

framework for availability assurance of services running on...

running cassandra on amazon’s ecs -...

running r from amazon's elastic compute...

running kmeans mapreduce code on amazon...

chapter 1: getting up and running with cassandra...chapter...

running on amazon ec2

running 400-node cassandra + spark clusters in azure...

cassandra on armv8 - a comparison with x86 and other …perf...

working with kafka advanced consumers - cloudurable ·...

jonathan weiss - cloud computing from the trenches –...

rails in the cloud - experiences from running on ec2

running cassandra in aws

introduction to amazon ec2 running...