apache cassandra an introduction

40
Apache Cassandra: A Brief History: Dive into the Dynamo whitepaper

Upload: shehaaz-saif

Post on 12-Apr-2017

223 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Apache cassandra  an introduction

Apache Cassandra:A Brief History: Dive into the Dynamo whitepaper

Page 2: Apache cassandra  an introduction

About me@Shehaaz

I love hacking on Wearable/iOT devices.

Page 3: Apache cassandra  an introduction

Topics Today● History and Dynamo● Time series data modeling ● Example App

Page 4: Apache cassandra  an introduction

History● Peer-to-Peer (All nodes are EQUAL)

○ Centralized peer-to-peer networks■ Node connects to “Directory” server.

● e.g: Napster○ Unstructured networks

■ Nodes randomly connect to each other● e.g: Kazaa, Gossip

○ Structured networks■ Nodes organized into a specific topology (consistent Hashing)

● e.g: Cassandra Ring

Page 5: Apache cassandra  an introduction

Napster: Centralized P2P

Page 6: Apache cassandra  an introduction

Road to Cassandra● 1999: Napster and other “questionable” P2P services● 2006: Google Big Table

○ C* has similar data storage.

● 2007: Amazon Dynamo (Avinash Lakshman)

○ C* has similar architecture

● 2008: Facebook Open Sourced C* (Avinash Lakshman)

Page 7: Apache cassandra  an introduction

CAP Theorem ● Consistency

○ All nodes see the same data at the same time

● Availability○ A guarantee that every request receives a response about whether it succeeded or failed

● Partition Tolerance○ The system continues to operate despite arbitrary message loss or failure of part of the system

e.g: Increasing Availability (increase Rep.Factor) Reduce Consistency. You can only have two out of the three!

Page 8: Apache cassandra  an introduction

DynamoThe motivation: ● You must ALWAYS be able to add to your

shopping cart! (High Availability)

● Conflict resolution is done at the application:○ merge conflicting shopping carts.

● Primary Key access to data store (RDB limitations) ○ e.g: best seller list, customer preferences, etc

Page 9: Apache cassandra  an introduction

Dynamo ArchitectureKey principles:

1. Incremental scalability○ Add nodes w/o disrupting system

2. Symmetry○ Every node has same responsibility

3. Decentralization○ peer-to-peer over centralized control

4. Heterogeneity ○ The work distribution must be

proportional to the capabilities of the individual servers.

Page 10: Apache cassandra  an introduction

Distributed Hash TableData OrganizationDistributed Hash Table (DHT) using Consistent Hashing:

The keys are mapped to form a ring. The output range of the hash function is treated as a fixed circular “ring”. (i.e: The largest Hash Value wraps around to the smallest hash value)

Page 11: Apache cassandra  an introduction

Inserting data: High LevelHash(RowKey) = 4500

circle clockwise and insert in Node 5

Page 12: Apache cassandra  an introduction

Row Level Hashing?

1 T:22:00:02, HR:71 T:22:00:01, HR:72

2 T:22:00:05, HR:90 T:22:00:02, HR:95

Patient ID (Partition Key) Event Time (Clustering Column)

Page 13: Apache cassandra  an introduction

Dynamo ArchitectureConsistent Hashing ● Advantage:

○ Departure or Arrival of a node only affects immediate neighbors. Every node is in charge of the previous node clockwise.

○ Only K/N nodes need to be remapped when a node drops. K= #keys N= #Nodes

● Disadvantage:○ ?

Page 14: Apache cassandra  an introduction

Dynamo Architecture

Page 15: Apache cassandra  an introduction

Dynamo ArchitectureConsistent Hashing● Disadvantage?

Page 16: Apache cassandra  an introduction

Dynamo ArchitectureConsistent Hashing● Disadvantage

○ Random Node position assignment leads to non-uniform data and load distribution

○ Some nodes could simply suck

Page 17: Apache cassandra  an introduction

Disadvantage Diagram

Page 18: Apache cassandra  an introduction

Virtual nodes to rescue! ● Instead of mapping a node to a single

point in the ring, each node gets assigned to multiple locations in the ring….(what does that mean?)

Virtual Nodes!

Page 19: Apache cassandra  an introduction

Virtual NodesThree node cluster with zero V-nodes

p = Position

Page 20: Apache cassandra  an introduction

Virtual Nodes

● V-Nodes look like nodes in the system● Regular node can be responsible for more

than one V-Node

Page 21: Apache cassandra  an introduction

Virtual Nodes: Add NodeAdding a new Node:

● This will evenly balance the data in the cluster. Server #4 will get data from all the servers.

○ How? ■ Server 4 is next to 1,2 and 3

Page 22: Apache cassandra  an introduction

V-Nodes: Remove NodeWhen a node goes down the data is evenly distributed.

When #1 went down, #2 and #3 took over the data.

If we didn’t have virtual nodes #2 would have been overloaded.

Page 23: Apache cassandra  an introduction

ReplicationWhy?To achieve high availabilitye.g: Replication Factor: 3Hash(KEY1) = 500 Node #1 is the coordinator node for values 0 to 999Its job is to replicate it to TWO other nodes.In modern C* it is the job of the Node that received the write.

Page 24: Apache cassandra  an introduction

ReplicationServer 1 copies the data to TWO other nodes clockwise to satisfy Replication Factor: 3

If 1 goes down 2 will make sure to keep R.F=3

Page 25: Apache cassandra  an introduction

Example Application● Patient in critical care. Needs a vital sign

dashboard● Arduino based Heart Rate and spO2

measuring device.● Pretty graph and gain insight from the data

Page 26: Apache cassandra  an introduction

Arduino + e-Health PCB

Page 27: Apache cassandra  an introduction

System Diagram

Page 28: Apache cassandra  an introduction

Setup GCloud C* cluster

Page 29: Apache cassandra  an introduction

Requirements.txt

Page 30: Apache cassandra  an introduction

Example Code

Page 31: Apache cassandra  an introduction

Create Tables

Page 32: Apache cassandra  an introduction

What’s Wrong?1. We will eventually run out of columns. Cassandra

allows 2 billions columns per row

63.3 years

Page 33: Apache cassandra  an introduction

What’s Wrong?2. RowKey Hashing will create a hotspot in the cluster. (Remember Row Level Hashing?)

Page 34: Apache cassandra  an introduction

Data modeling in C*

Time Series data modeling.

Page 35: Apache cassandra  an introduction

Create Tables

A.K.A: Compound Row Key

Page 36: Apache cassandra  an introduction

Table

1,2015-02-17 T:22:00:01, HR:71 T:22:00:00, HR:72

2,2015-02-17 T:22:00:05, HR:90 T:22:00:02, HR:95

Patient ID (Partition Key) Event Time (Clustering Column)

Data is SORTED and stored Sequentially on Disk

Page 37: Apache cassandra  an introduction

Insert Data

Page 38: Apache cassandra  an introduction

Query Data

Page 40: Apache cassandra  an introduction

ResourcesAmazon Dynamo paper:http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Cassandra High Availability by Robbie Stricklandhttp://www.amazon.com/gp/product/1783989122/ref=cm_cr_ryp_prd_ttl_sol_0