running cassandra in aws

Post on 06-May-2015

2.736 Views

Category:

Technology

6 Downloads

Preview:

Click to see full reader

DESCRIPTION

For this upcoming meetup, we welcome Patrick Eaton PhD, Systems Architect at Stackdriver, and Joey Imbasciano, Cloud Platform Engineer at Stackdriver. What You'll Learn At This Meetup: • Why Stackdriver chose Cassandra over other DB offerings • Stackdriver's data pipeline that runs into Cassandra • Operating Cassandra Running on AWS • Stackdriver's approach to disaster recovery Patrick and Joey will be presenting their use of Apache Cassandra at Stackdriver, some lesson's learned, technical tips and a Q&A to end the evening.

TRANSCRIPT

Running Cassandra in AWS

Patrick Eaton, PhDpatrick@stackdriver.com@PatrickREaton

Joey Imbascianojoey@stackdriver.com@_joeyi

Stackdriver at a Glance

Stackdriver's hosted intelligent monitoring service helps SaaS companies innovate more by reducing the burden of day-to-day operations● Cloud-native and cloud-aware● Designed for complex distributed applications● Founded by cloud/infrastructure industry veterans

(Microsoft, VMware, EMC, Endeca, Red Hat) with deep systems and DevOps expertise

● Team of ~25, based in Downtown Boston

Intelligent MonitoringDiscover customer’s cloud-hosted applications● Infrastructure inventory● Logical units, like groups/clusters● Services, hosted and self-managed● Elastic resources

Monitor● Various data sources

● Provider metrics● Host metrics● Custom metrics● Endpoints● Events● Health

● Rich visualizations

Analyze● Integrate data sources● Aggregate metrics● Report utilization, cost, etc.● Detect policy violations● Recommend actions

Lambda Architecture

● Typical of modern architectures for on-line applications.

● Formalized by Nathan Marz● Composed of "batch", "speed", and "serving" layers● Batch layer

○ Store of record○ Compute arbitrary views

● Speed layer○ Low latency updates○ Streaming algorithms

● Serving layer○ Combine data from batch and speed layers to

answer queries

Speed Batch

Data

Serving

Stackdriver Architecture

● Shares characteristics of lambda architecture● Indexing (speed) path

○ Make "live" data available "pre-analysis"● Analysis (batch) path

○ Compute aggregations○ Create recommendations

● Query (serving) layer○ Combine "live" and analyzed

data to answer queries○ May require on-the-fly analysis

● Alerting (speed) path (not discussed here)○ Stream processing to detect

policy-based anomalies

Database

Data

Query(Serving)

Analysis(Batch)

Indexing(Speed)

Alerting(Speed)

Notification(Serving)

Database Options

● We chose Cassandra!○ True P2P architecture○ Good support for write-heavy workloads○ Compatible data model for time series data

■ Column per metric type, timestamps as columns● Why not MySQL?

○ Experience with operating large, sharded deployments○ Relational data model not a good match

● Why not HBase?○ Operational complexity - zk, hadoop, hdfs, ...○ Special "Master" role

● Why not Dynamo?○ Avoid vendor lock-in and high cost

Stackdriver Architecture ++

● Archival pipeline stores all data● Very small surface area, battle-tested● Critical for disaster recovery● S3 considered durable enough● Replicated for availability

● Archive means Cassandra is "soft state"● C* consolidates analysis and indexing results● Properties of data in C*

● Immutable data● Append-only● Read-1, write-1 consistency

● Scales out easily● Indexers, archivers, analyzers, query servers

Analyze

ArchiveIndex

S3

Roll-upsAnalysis

Recs

InventoryData Series

Data

Query

Cassandra

Cassandra at Stackdriver Cluster Configuration

● Version: Datastax Community Edition 1.2.10● Replication Factor: 3● Vnodes● Murmur3Partitioner● Ec2Snitch

○ Aids in request efficiency○ Enables Cassandra to ensure replicas are in

different Availability Zones● phi_convict_threshold: 8 -> 12

○ Used to determine when nodes are down○ AWS network can be spotty

Cassandra Topology in AWS

1

us-east-1a

3

us-east-1c

2

us-east-1b

Where we started...

Keep it balanced!

us-east-1a

us-east-1cus-east-1b

Where we are...

Cassandra EC2 Node Configuration

● m1.xlarge ○ 4 cores○ 15 GB RAM○ 4 ephemeral disks available

● 4 disks RAID-0 for Data Volume and CommitLog○ ext4 - defaults,noatime○ mdadm RAID-0○ Compactions○ Heavy Read/Write IO

Cassandra Automation and Operations

● Combination of Boto, Fabric, & Puppet○ Boto for AWS API○ Fabric + Puppet for Bootstrapping○ Fabric for Operations

● One command to:○ Launch a new cluster○ Upsize a cluster○ Replace a dead node○ Remove existing nodes○ List nodes in a cluster

Our (Internal) Slogan

Cassandra Backups using S3

● No Cassandra Powered Backups● Restore from S3● Useful for major version upgrades

S3Bulk Loader

Map Reduce CassandraData

1. Data is archived when it is received2. Bulk loader reads from S33. M/R re-analyzes data4. Cassandra is repopulated

Disaster Recover in the Wild

● October 23, Stackdriver suffered a total loss of our C* cluster● Exhausted memory due to number of open file descriptors (see graph)

● We did not notice the problem until it was too late● Nodes began crashing, resulted in inconsistent view of the ring

● Attempted to restart the cluster unsuccessfully for ~2 hours● Provisioned new 36 node cluster in ~2 hours● Directed “live” data to new cluster● Started bulk restore operation from archive

● Full-fidelity data and aggregations● No data loss due to archival pipeline● See http://www.stackdriver.com/post-mortem-october-23-stackdriver-outage/

Cluster Restoration Process

UIUI

UI

UIUIAPI

S3Bulk Loader

Map Reduce

UIUI

Gateway

Historical Data

New Data

New Cluster

Old Cluster

Thank you!

Yes, we are hiring!

Patrick Eaton - patrick@stackdriver.com - @PatrickREatonJoey Imbasciano - joey@stackdriver.com - @_joeyi

top related