introduction to apache cassandra

Introduction to Apache Cassandra

Jesús GuzmánApache Cassandra Certified

#cassandra

Jesus Alberto Guzmán [email protected]

Apache Cassandra Certified @Datum

Who am I ?

mailto:[email protected]

#cassandra

• Cassandra Overview• Cassandra Architecture• Data Modeling• Datastax Enterprise

Objectives

#cassandra

Big Data

#cassandra

No SQL

#cassandra

About Cassandra

#cassandra

"Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon's Dynamo and its data model on Google's Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web."

Cassandra: The Definitive Guide.

Apache Cassandra

#cassandra

BigTable Dynamo

Apache Cassandra

#cassandra

• Must always be available• 100% uptime• Must be easy to manage and maintain• Linear scalability at lowest cost• Big Data

World Has Changed - Modern Online Applications

#cassandra

• Operational (OLTP) Data Store• Masterless - No single point of failure• Always on• Linear scale performance• Fast response times• Always on reliability• Data replication across multiple data centers and the cloud• Large amounts of structured, semi-structured, and unstructured data

Why Cassandra?

#cassandra

Linear scalability• Designed expecting failure• Data partitioned among all nodes in the cluster• Configurable data replication to ensure uptime• Linear scalability (performance / storage)

#cassandra

Fault Tolerant

#cassandra

Data replication – Multi Data Center

#cassandra

Massterless

#cassandra

Internals Cassandra

#cassandra

• Keyspace • Identified by name • Contains tables ("column families") • Determines replication factor

• Table • Identified by name• Has rows

• Row• Contains columns (up to 2 billion!)• Can have different number of columns

• Column• Identified by name• Has data type

Basic Concepts

#cassandra

• Node: A single instance of Cassandra• Rack: A logical grouping of nodes (optional)• Data Center: A logical grouping of racks or nodes • Cluster: A logical grouping of data centers (1 to N)

Architecture

#cassandra

Architecture

#cassandra

• Required for each table• Uniquely identifies row• Partition Key• Determines node• Has one or more columns

• Cluster Key• Determines disk location (order)• Has zero or more columns• Binary search• Search by: >, >=, <=, <, =

Primary Key

#cassandra

Three Key concepts

• Partitioning (data distribution)

• Replication (fault tolerance)

• Consistency (performance tunable)

How Does It Work?

#cassandra

• Partitioner• Generate tokens• Data distribution• Partition Keys are hashed into 128bit • Murmur3 default

Partitioning

Node 1

Node 3

Node 2Node 4

- 263+ 263

#cassandra

• Simplified Token Range: Integers from 0 -> 100

Data Partitioning Example

Node 1

Node 3

Node 2

Node 4

0100

25

50

75

ID NAME DOB

AB1 John Smith 10/11/1972

AB2 Bob Jones 3/1/1964

ZZ3 Mike West 4/22/1968

WX2 Sally Thompson

10/15/1969

MNZ Bill Wright 6/6/1966

HASH 17

HASH 79

HASH 14

HASH 32

HASH 51

Node 2

Node 1

Node 2

Node 3

Node 4

#cassandra

• Provides fault tolerance• Provides geographic distribution• Copies of each partition are distributed to data centers• Defined on a schema level (Replication Factor)

Replication

RF =1 RF = 2 RF = 3

A123 | JOHN SMITH | 11234

A147 | BOB MARTIN | 32235

B212 | JEN JONES | 43323

#cassandra

• Higher Replication Levels = Greater Fault Tolerance

Replication

RF =1 RF = 2 RF = 3

UNAVAILABLE

#cassandra

• Assign Replication Factor for each Data Center and schema

APP {Toronto : 3San Francisco : 3Dubai : 3New York : 3 }

Replication

San Francisco

New York

Dubai

Toronto

#cassandra

CAP Theorem

#cassandra

• It is the number of REPLICAS that need to respond for a request to be considered complete (reads and write/updates)• Consistency Level can is set on every request (normally by default)

Consistency

DC 1 DC 2

#cassandra

Some Consistency Levels• Any** (Hints, only in write)• ONE – one replica must respond• Quorum – 51% of replicas must respond• Local_Quorum – 51% of replicas in local data

center• ALL – all replicas must respond

Consistency Level

DC 1 DC 2

RF=3 RF=3

#cassandra

Tunable ConsistencyHow it works in Cassandra

WRITING DATA

RF=3 RF=3

CLIENT

CONSISTENCY LEVEL

LOCAL_QUORUM

#cassandra

Tunable Consistency

How it works in Cassandra

READING DATA

CLIENT

CONSISTENCY LEVEL

ONE

#cassandra

Common:• One• Local_Quorum Reads / Writes• Light Weight Transactions (LWT)• Application Level Locking (ING*)

Consistency

DC 1 DC 2

RF=3 RF=3

#cassandra

• Operation = Write/Read

Operations

#cassandra

Operations• Operation = Write/Read

#cassandra

Write Path

#cassandra

Read Path

#cassandra

• HintsCoordinator stores missed mutations for later replayTime out after 3 hours

• Read Repair• Mismatched results at read trigger a repair for that partition• Read Repair Chance setting triggers validation of all replicas on small

percentage of reads

• Repair• Process run on Node / Keyspace to true up data• Can be run automatically via Opscenter in DSE• Ensures tombstones are properly evicted during compaction

Anti-Entropy Mechanisms

#cassandra

Compaction

#cassandra

• Snapshots• By table, keyspace, node, cluster• So fast• So Hard-Link

• Do you need Backups ?• Data replication• Data across all nodes

Backups

#cassandra

Data Modeling

#cassandra

• Cassandra is not an RDBMS• Distributed changes the rules • OLTP (not Analytics / Search / ad hoc query)• Rows are accessed by Partition Key • De-normalization (No joins)• Multiple query tables • Use Solr for Search, Hadoop/Spark for Analytics

Data Modeling in Cassandra

#cassandra

• Cassandra Query Language (CQL) is a query language for the Cassandra database.

• A SQL-like query language for communicating with Cassandra

• CQLSH• No Joins• JSON support • Upserts• TTL• Timestamps

CQL

#cassandra

Some datatypes

#cassandra

• Collections:• Set• List• Map

• User defined types (UTD)• Tuples

Interesting Datatypes

#cassandra

Table Example

Track customer transactions by type

DATE CUST_ID TYPE TIME CUST NAME LOCATION AMOUNT

PARTITION KEY CLUSTERING COLUMNS

PRIMARY KEY

#cassandra

Track customer transactions by type

DATE CUST_ID TYPE TIME CUST NAME LOCATION AMOUNT

10/15/14 A11 DEPOSIT 09:24:33.55 JOHN SMITH 30132 252.50

10/15/14 A11 DEPOSIT 09:25:53.21 JOHN SMITH 30132 63.49

10/15/14 A11 WITHDRAW 12:45:22.23 JOHN SMITH 30060 -300.00

10/15/14 B23 DEPOSIT 08:12:22.32 BOB BARKER 94123 500.00

Table Example

Partition size considerations

#cassandra

• Defines transitions between models• Query-driven methodology• Formal analysis and validation

• Defines a scientific approach to data modeling• Modeling rules• Mapping patterns• Schema optimization techniques

Query-Driven Methodology

#cassandra

• ER diagram (Chen notation)• Describes entities, relationships, roles, keys, cardinalities• What is possible and what is not in existing or future data

Conceptual model

#cassandra

QuerysSimple Order Management (queries)

• Q1: Customers by Customer ID• Q2: Customer by email• Q3: Product by Product ID• Q4: Product by Name• Q5: Product By Category• Q6: Order Details by Order ID• Q7: Order Details by Customer / Date

#cassandra

• Logical-level shows column names and properties• Physical-level also shows the column data type

Chebotko Diagram Notation

#cassandra

Logical Model

#cassandra

Physical Model

#cassandra

Version Enterprise

#cassandra

Datastax

Founded in April 2010

~40 600+

Santa Clara, Austin, New York, London, Paris

480+Employees Percent Customers

#cassandra

Datastax Enterprise

• Certified Production Cassandra

• Enterprise Security Options• Integrated Search• Integrated Analytics (Spark)• DSE Graph• Workload Segregation• In Memory• OpsCenter• Management Services

#cassandra

• MDM: Customer 360, Product Catalog• Personalization and Recommendation• Internet of Things and Time Series• Fraud Detection• List Management• Messaging• Inventory Management• Authentication

DSE Use Cases

#cassandra

• Visual, browser-based user interface. • Installation, configuration, and administration tasks

carried out in point-and-click fashion.• Visually supports DataStax Automatic Management

Services.

Datastax OpsCenter

#cassandra

Muchas Gracias

introduction to apache cassandra

Education