running cassandra on amazon ec2

Post on 13-Dec-2014

24.292 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

What are the challenges of running Apache Cassandra on Amazon EC2? Is it a good idea?In this presentation, we explore reasons for and against running the distributed database Cassandra on EC2. We look at the I/O performance of EC2 and

TRANSCRIPT

@cassandralondon

Thanks

Reminder

Next meetup Wednesday 8th December

Jake Luciani will be giving a talk on "Lucandra" (a Cassandra backend for Lucene open source search software)

Quick intro to Cassandra

• Decentralized

• Fault-tolerant

• Tunable consistency

• Elasticity

This talk

Why consider EC2?

What are the challenges of running Cassandra on EC2?

Is it a good idea?

Cassandra design decisions

Cassandra designed to run on many commodity servers

It is designed to deal with unreliable hardware and networks

Why consider EC2?

On demand instances

“frees you from the costs and complexities of planning, purchasing, and maintaining hardware and transforms what are commonly large fixed costs into much smaller variable costs”

http://aws.amazon.com/ec2/pricing/

Why consider EC2?

Multiple “Availability Zones” in multiple regions (US East, US West, Ireland and Singapore)

http://aws.amazon.com/ec2/

Writing to Cassandra

1. Write added to local log on targetmachine2. Memtable updated3. Memtable flushed to disk as datafiles (SSTable plus SSTable Index)4. Eventually data files are compacted

http://wiki.apache.org/cassandra/ArchitectureOverview#Write_path

IO

IO

IO

Reading from Cassandra

1. Read from any node2. Partitioner3. Wait for R responses4. Wait for N – R responses in thebackground and perform read repair http://wiki.apache.org/cassandra/ArchitectureOverview#Read_path

IO

IO

Reading from Cassandra

Reads from multiple SSTables

The application use-case will affect performance and what the bottleneck is (totally random reads being worst case)

IO

The challenges

Getting good enough I/O performance

Not a huge number of resources on the Internet (new and shiny)

Some minor setup and monitoring challenges (documentation is available)

EC2 I/O performance

Ephemeral or EBS; low, moderate or high I/O performance indicators

“other resources like the network and the disk subsystem are shared among instances… when a resource is under-utilized you will often be able to consume a higher share of that resource”

http://aws.amazon.com/ec2/instance-types/

EBS or ephemeral?

Jonathan Ellis recently on mailing list:

“we recommend using raid0 ephemeral disks on EC2 with L or XL instance sizes for betteri/o performance.”

http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cold-boot-performance-problems-tp5615829p5615889.html

http://www.coreyhulen.org/?p=326

EBS or ephemeral?

Amazon suggest EBS is better:

“Amazon EBS is particularly suited for applications that require a database, file system, or access to raw block level storage”

http://aws.amazon.com/ebs/

“The latency and throughput of Amazon EBS volumes is designed to be significantly better than the Amazon EC2 instance stores in nearly all cases. You can also attach multiple volumes to an instance and stripe across the volumes. This is one way to improve I/O rates, especially if your application performs a lot of random access across your data set.”

http://aws.amazon.com/ebs/

EC2 I/O benchmark

Throughput measured using dd

Seek measured using seeker.c

Software RAID uses mdadm

http://www.linuxinsight.com/how_fast_is_your_disk.htmlhttp://en.wikipedia.org/wiki/Mdadm

Which is better?

EBS has better throughput, ephemeral better for random seeks

Generic benchmarks aren’t great – depends on your use case

Warning: EC2 performance not consistent

EC2 Cassandra benchmark

Read and write TPS

Benchmarks carried out by Corey Hulen

http://www.coreyhulen.org/?p=326

Which is better?

Corey suggests:

“Raid 0 EBS drives are the way to go”

“We didn’t notice a difference above the normal EC2 fluctuations when testing for 2 vs 4 drives”

Conclusions

Cassandra will run acceptably on EC2, but real HW is better

It will depend on your use case – particularly the types of read that you do

Real HW may work out cheaper

Conclusions

Ephemeral I/O seems to be better than EBS, although EBS has other advantages (doesn’t disappear if you stop the node)

Again, it will depend on use case

Conclusions

Large nodes are the best bet

Small nodes have poor I/O

Extra large nodes are probably not worth it (better to have more nodes)

http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Nodes-dropping-out-of-cluster-due-to-GC-tp5128481p5131568.html

Questions?

Please leave feedback on meetup.comFollow @cassandralondon on Twitter

top related