introduction to apache cassandra

58
Introduction to Apache Cassandra Jesús Guzmán Apache Cassandra Certified

Upload: jesus-alberto-guzman-polanco

Post on 14-Apr-2017

142 views

Category:

Education


8 download

TRANSCRIPT

Page 1: Introduction to Apache Cassandra

Introduction to Apache Cassandra

Jesús GuzmánApache Cassandra Certified

Page 2: Introduction to Apache Cassandra

#cassandra

Jesus Alberto Guzmán [email protected]

Apache Cassandra Certified @Datum

Who am I ?

Page 3: Introduction to Apache Cassandra

#cassandra

• Cassandra Overview• Cassandra Architecture• Data Modeling• Datastax Enterprise

Objectives

Page 4: Introduction to Apache Cassandra

#cassandra

Big Data

Page 5: Introduction to Apache Cassandra

#cassandra

No SQL

Page 6: Introduction to Apache Cassandra

#cassandra

About Cassandra

Page 7: Introduction to Apache Cassandra

#cassandra

"Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon's Dynamo and its data model on Google's Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web."

Cassandra: The Definitive Guide.

Apache Cassandra

Page 8: Introduction to Apache Cassandra

#cassandra

BigTable Dynamo

Apache Cassandra

Page 9: Introduction to Apache Cassandra

#cassandra

• Must always be available• 100% uptime• Must be easy to manage and maintain• Linear scalability at lowest cost• Big Data

World Has Changed - Modern Online Applications

Page 10: Introduction to Apache Cassandra

#cassandra

• Operational (OLTP) Data Store• Masterless - No single point of failure• Always on• Linear scale performance• Fast response times• Always on reliability• Data replication across multiple data centers and the cloud• Large amounts of structured, semi-structured, and unstructured data

Why Cassandra?

Page 11: Introduction to Apache Cassandra

#cassandra

Linear scalability• Designed expecting failure• Data partitioned among all nodes in the cluster• Configurable data replication to ensure uptime• Linear scalability (performance / storage)

Page 12: Introduction to Apache Cassandra

#cassandra

Fault Tolerant

Page 13: Introduction to Apache Cassandra

#cassandra

 Data replication – Multi Data Center

Page 14: Introduction to Apache Cassandra

#cassandra

Massterless

Page 15: Introduction to Apache Cassandra

#cassandra

Internals Cassandra

Page 16: Introduction to Apache Cassandra

#cassandra

• Keyspace • Identified by name • Contains tables ("column families") • Determines replication factor

• Table • Identified by name• Has rows

• Row• Contains columns (up to 2 billion!)• Can have different number of columns

• Column• Identified by name• Has data type

Basic Concepts

Page 17: Introduction to Apache Cassandra

#cassandra

• Node: A single instance of Cassandra• Rack: A logical grouping of nodes (optional)• Data Center: A logical grouping of racks or nodes • Cluster: A logical grouping of data centers (1 to N)

Architecture

Page 18: Introduction to Apache Cassandra

#cassandra

Architecture

Page 19: Introduction to Apache Cassandra

#cassandra

• Required for each table• Uniquely identifies row• Partition Key• Determines node• Has one or more columns

• Cluster Key• Determines disk location (order)• Has zero or more columns• Binary search• Search by: >, >=, <=, <, =

Primary Key

Page 20: Introduction to Apache Cassandra

#cassandra

Three Key concepts

• Partitioning (data distribution)

• Replication (fault tolerance)

• Consistency (performance tunable)

How Does It Work?

Page 21: Introduction to Apache Cassandra

#cassandra

• Partitioner• Generate tokens• Data distribution• Partition Keys are hashed into 128bit • Murmur3 default

Partitioning

Node 1

Node 3

Node 2Node 4

- 263+ 263

Page 22: Introduction to Apache Cassandra

#cassandra

• Simplified Token Range: Integers from 0 -> 100

Data Partitioning Example

Node 1

Node 3

Node 2

Node 4

0100

25

50

75

ID NAME DOB

AB1 John Smith 10/11/1972

AB2 Bob Jones 3/1/1964

ZZ3 Mike West 4/22/1968

WX2 Sally Thompson

10/15/1969

MNZ Bill Wright 6/6/1966

HASH 17

HASH 79

HASH 14

HASH 32

HASH 51

Node 2

Node 1

Node 2

Node 3

Node 4

Page 23: Introduction to Apache Cassandra

#cassandra

• Provides fault tolerance• Provides geographic distribution• Copies of each partition are distributed to data centers• Defined on a schema level (Replication Factor)

Replication

RF =1 RF = 2 RF = 3

A123 | JOHN SMITH | 11234

A147 | BOB MARTIN | 32235

B212 | JEN JONES | 43323

Page 24: Introduction to Apache Cassandra

#cassandra

• Higher Replication Levels = Greater Fault Tolerance

Replication

RF =1 RF = 2 RF = 3

UNAVAILABLE

Page 25: Introduction to Apache Cassandra

#cassandra

• Assign Replication Factor for each Data Center and schema

APP {Toronto : 3San Francisco : 3Dubai : 3New York : 3 }

Replication

 

San Francisco

New York

Dubai

Toronto

Page 26: Introduction to Apache Cassandra

#cassandra

CAP Theorem

Page 27: Introduction to Apache Cassandra

#cassandra

• It is the number of REPLICAS that need to respond for a request to be considered complete (reads and write/updates)• Consistency Level can is set on every request (normally by default)

Consistency

DC 1 DC 2

Page 28: Introduction to Apache Cassandra

#cassandra

Some Consistency Levels• Any** (Hints, only in write)• ONE – one replica must respond• Quorum – 51% of replicas must respond• Local_Quorum – 51% of replicas in local data

center• ALL – all replicas must respond

Consistency Level

DC 1 DC 2

RF=3 RF=3

Page 29: Introduction to Apache Cassandra

#cassandra

Tunable ConsistencyHow it works in Cassandra

WRITING DATA

RF=3 RF=3

CLIENT

CONSISTENCY LEVEL

LOCAL_QUORUM

Page 30: Introduction to Apache Cassandra

#cassandra

Tunable Consistency

How it works in Cassandra

READING DATA

CLIENT

CONSISTENCY LEVEL

ONE

Page 31: Introduction to Apache Cassandra

#cassandra

Common:• One• Local_Quorum Reads / Writes• Light Weight Transactions (LWT)• Application Level Locking (ING*)

Consistency

DC 1 DC 2

RF=3 RF=3

Page 32: Introduction to Apache Cassandra

#cassandra

• Operation = Write/Read

Operations

Page 33: Introduction to Apache Cassandra

#cassandra

Operations• Operation = Write/Read

Page 34: Introduction to Apache Cassandra

#cassandra

Write Path

Page 35: Introduction to Apache Cassandra

#cassandra

Read Path

Page 36: Introduction to Apache Cassandra

#cassandra

Read Path

Page 37: Introduction to Apache Cassandra

#cassandra

• HintsCoordinator stores missed mutations for later replayTime out after 3 hours

• Read Repair• Mismatched results at read trigger a repair for that partition• Read Repair Chance setting triggers validation of all replicas on small

percentage of reads

• Repair• Process run on Node / Keyspace to true up data• Can be run automatically via Opscenter in DSE• Ensures tombstones are properly evicted during compaction

Anti-Entropy Mechanisms

Page 38: Introduction to Apache Cassandra

#cassandra

Compaction

Page 39: Introduction to Apache Cassandra

#cassandra

• Snapshots• By table, keyspace, node, cluster• So fast• So Hard-Link

• Do you need Backups ?• Data replication• Data across all nodes

Backups

Page 40: Introduction to Apache Cassandra

#cassandra

Data Modeling

Page 41: Introduction to Apache Cassandra

#cassandra

• Cassandra is not an RDBMS• Distributed changes the rules • OLTP (not Analytics / Search / ad hoc query)• Rows are accessed by Partition Key • De-normalization (No joins)• Multiple query tables • Use Solr for Search, Hadoop/Spark for Analytics

Data Modeling in Cassandra

Page 42: Introduction to Apache Cassandra

#cassandra

• Cassandra Query Language (CQL) is a query language for the Cassandra database.

• A SQL-like query language for communicating with Cassandra

• CQLSH• No Joins• JSON support • Upserts• TTL• Timestamps

CQL

Page 43: Introduction to Apache Cassandra

#cassandra

Some datatypes

Page 44: Introduction to Apache Cassandra

#cassandra

• Collections:• Set• List• Map

• User defined types (UTD)• Tuples

Interesting Datatypes

Page 45: Introduction to Apache Cassandra

#cassandra

Table Example

Track customer transactions by type

DATE CUST_ID TYPE TIME CUST NAME LOCATION AMOUNT

PARTITION KEY CLUSTERING COLUMNS

PRIMARY KEY

Page 46: Introduction to Apache Cassandra

#cassandra

Track customer transactions by type

DATE CUST_ID TYPE TIME CUST NAME LOCATION AMOUNT

10/15/14 A11 DEPOSIT 09:24:33.55 JOHN SMITH 30132 252.50

10/15/14 A11 DEPOSIT 09:25:53.21 JOHN SMITH 30132 63.49

10/15/14 A11 WITHDRAW 12:45:22.23 JOHN SMITH 30060 -300.00

10/15/14 B23 DEPOSIT 08:12:22.32 BOB BARKER 94123 500.00

Table Example

Partition size considerations

Page 47: Introduction to Apache Cassandra

#cassandra

• Defines transitions between models• Query-driven methodology• Formal analysis and validation

• Defines a scientific approach to data modeling• Modeling rules• Mapping patterns• Schema optimization techniques

Query-Driven Methodology

Page 48: Introduction to Apache Cassandra

#cassandra

• ER diagram (Chen notation)• Describes entities, relationships, roles, keys, cardinalities• What is possible and what is not in existing or future data

Conceptual model

Page 49: Introduction to Apache Cassandra

#cassandra

QuerysSimple Order Management (queries)

• Q1: Customers by Customer ID• Q2: Customer by email• Q3: Product by Product ID• Q4: Product by Name• Q5: Product By Category• Q6: Order Details by Order ID• Q7: Order Details by Customer / Date

Page 50: Introduction to Apache Cassandra

#cassandra

• Logical-level shows column names and properties• Physical-level also shows the column data type

Chebotko Diagram Notation

Page 51: Introduction to Apache Cassandra

#cassandra

Logical Model

Page 52: Introduction to Apache Cassandra

#cassandra

Physical Model

Page 53: Introduction to Apache Cassandra

#cassandra

Version Enterprise

Page 54: Introduction to Apache Cassandra

#cassandra

Datastax

Founded in April 2010

~40 600+

Santa Clara, Austin, New York, London, Paris

480+Employees Percent Customers

Page 55: Introduction to Apache Cassandra

#cassandra

Datastax Enterprise

• Certified Production Cassandra

• Enterprise Security Options• Integrated Search• Integrated Analytics (Spark)• DSE Graph• Workload Segregation• In Memory• OpsCenter• Management Services

Page 56: Introduction to Apache Cassandra

#cassandra

• MDM: Customer 360, Product Catalog• Personalization and Recommendation• Internet of Things and Time Series• Fraud Detection• List Management• Messaging• Inventory Management• Authentication

DSE Use Cases

Page 57: Introduction to Apache Cassandra

#cassandra

• Visual, browser-based user interface. • Installation, configuration, and administration tasks

carried out in point-and-click fashion.• Visually supports DataStax Automatic Management

Services.

Datastax OpsCenter

Page 58: Introduction to Apache Cassandra

#cassandra

Muchas Gracias