apache cassandra for timeseries- and graph-data

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH

Apache Cassandra for Timeseries- and Graph-DataGuido Schmutz

Guido Schmutz

• Working for Trivadis for more than 18 years• Oracle ACE Director for Fusion Middleware and SOA• Co-Author of different books• Consultant, Trainer Software Architect for Java, Oracle, SOA and

Big Data / Fast Data• Member of Trivadis Architecture Board• Technology Manager @ Trivadis

• More than 25 years of software development experience

• Contact: [email protected]• Blog: http://guidoschmutz.wordpress.com• Twitter: gschmutz

2

Agenda

1. Customer Use Case and Architecture2. Cassandra Data Modeling3. Cassandra for Timeseries Data4. Cassandra for Graph Data5. Summary

3

Customer Use Case andArchitecture

4

Research Project @ Armasuisse W&T

W+T flagship project, standing for innovation & tech transfer

Building capabilities in the areas of:• Social Media Intelligence

(SOCMINT)

• Big Data Technologies & Architectures

Invest into new, innovative and not widely-proven technology• Batch analysis

• Real-time analysis

• NoSQL databases• Text analysis (NLP)

• …

3 Phases: June 2013 – June 2015

5

SOCMINT Demonstrator – Time Dimension

Major data model: Time series (TS)

TS reflect user behaviors over time

Activities correlate with events

Anomaly detectionEvent detection & prediction

6

SOCMINT Demonstrator – Social Dimension

User-user networks (social graphs);

Twitter: follower, retweet and mention graphs

Who is central in a social network?

Who has retweeted a given tweet to whom?

7

SOCMINT Demonstrator - “Lambda Architecture” for Big Data

DataCollection

(Analytical) Batch Data Processing

Batchcompute

Batch Result Store

DataSources

Channel

DataAccess

Reports

Service

AnalyticTools

AlertingTools

Social

RDBMS

Sensor

ERP

Logfiles

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Real-Time Result Store

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest8

SOCMINT Demonstrator – Frameworks & Components in Use

DataCollection

(Analytical) Batch Data Processing

Batchcompute

Batch Result Store

DataSources

Channel

DataAccess

Reports

Service

AnalyticTools

AlertingTools

Social

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Real-Time Result Store

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest10

SOCMINT Demonstrator – Cassandra Cluster

6 node cluster based on Datastax Enterprise Edition (DSE)

Installed in a virtualized environment but we control the placement on disk

We only keep 3 days of data • use TTL of Cassandra to automatically

erase old data

Cassandra supports both Timeseries and Connected-Data (Graph)

Node1

Node2Node6

Node3Node5

Node4

11

Cassandra Data Modeling

12

Cassandra Data Modelling

13

• Don’t think relational

• Denormalize, Denormalize, Denormalize ….

• Rows are gigantic and sorted = one row is stored on one node• Know your application/use cases => from query to model

• Index is not an afterthought, anymore => “index” upfront• Control physical storage structure

Static Column Family – “Skinny Row”

14

rowkey

CREATE TABLE skinny (rowkey text, c1 text PRIMARY KEY,c2 text,c3 text,

PRIMARY KEY (rowkey));

Growsup

toBillionofRow

s

rowkey-1 c1 c2 c3value-c1 value-c2 value-c3

rowkey-2 c1 c3value-c1 value-c3

rowkey-3 c1 c2 c3value-c1 value-c2 value-c3

c1 c2 c3

PartitionKey

Dynamic Column Family – “Wide Row”

15

rowkey

Billion

ofR

ows rowkey-1 ckey-1:c1 ckey-1:c2

value-c1 value-c2

rowkey-2

rowkey-3

CREATE TABLE wide (rowkey text, ckey text,c1 text,c2 text,

PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC);

ckey-2:c1 ckey-2:c2value-c1 value-c2







1 2Billion

PartitionKey Clustering Key

Cassandra for Timeseries Data

16

Know your application => From query to model

17

Show Timeline of Tweets

Show Timeseries on different levels of aggregation (resolution)

• Seconds• Minute• Hours

Show Timeline: Provide Raw Data (Tweets)

18

CREATE TABLE tweet (tweet_id bigint,username text,message text, hashtags list<text>,latitude double, longitude double,…PRIMARY KEY(tweet_id));

• Skinny Row Table

• Holds the sensor raw data =>

Tweets

• Similar to a relational table

• Primary Key is the partition key

10000121 username message hashtags latitude longitudegschmutz Gettingreadyfor .. [cassandra,nosql] 0 0

20121223 username message hashtags latitude longitudeDataStax The SpeedFactor.. [BigData 0 0

tweet_id


Show Timeline: Provide Raw Data (Tweets)

19

INSERT INTO tweet (tweet_id, username, message, hashtags, latitude, longitude) VALUES (10000121, 'gschmutz', 'Getting ready for my talk about using Cassandra for Timeseries and Graph Data', ['cassandra', 'nosql'], 0,0);

SELECT tweet_id, username, hashtags, message FROM tweet WHERE tweet_id = 10000121 ;

tweet_id | username | hashtag | message---------+----------+------------------------+----------------------------10000121 | gschmutz | ['cassandra', 'nosql'] | Getting ready for ...20121223 | DataStax | [’BigData’] | The Speed Factor ...


Show Timeline: Provide Sequence of Events

20

CREATE TABLE tweet_timeline (sensor_id text,bucket_id text,time_id timestamp, tweet_id bigint,

PRIMARY KEY((sensor_id, bucket_id), time_id))WITH CLUSTERING ORDER BY (time_id DESC);

Wide Row Table

bucket-id creates buckets for columns• SECOND-2015-10-14

ABC-001:SECOND-2015-10-14 10:00:02:tweet-id10000121

DEF-931:SECOND-2015-10-14 10:09:02:tweet-id1003121343

09:12:09:tweet-id1002111343

09:10:02:tweet-id1001121343


Show Timeline: Provide Sequence of Events

21

INSERT INTO tweet_timeline (sensor_id, bucket_id, time_id, tweet_id)VALUES ('ABC-001', 'SECOND-2015-10-14', '2015-09-30 10:50:00', 10000121 );

SELECT * from tweet_timelineWHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'AND key = 'ALL’ AND time_id <= '2015-10-14 12:00:00';

sensor_id | bucket_id | time_id | tweet_id----------+-------------------+--------------------------+----------ABC-001 | SECOND-2015-10-14 | 2015-10-14 11:53:00+0000 | 10020334 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:52:00+0000 | 10000334 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:51:00+0000 | 10000127 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:50:00+0000 | 10000121

Sorted

bytim

e_id


Show Timeseries: Provide list of counts

22

CREATE TABLE tweet_count (sensor_id text,bucket_id text,key text,time_id timestamp,count counter,

PRIMARY KEY((sensor_id, bucket_id), key, time_id))WITH CLUSTERING ORDER BY (key ASC, time_id DESC);

Wide Row Table

bucket-id creates buckets for columns• SECOND-2015-10-14• HOUR-2015-10• DAY-2015-10

ABC-001:HOUR-2015-10 ALL:10:00:count1’550

ABC-001:DAY-2015-10 ALL:14-OCT:count105’999

ALL:13-OCT:count120’344

nosql:14-OCT:count2’532

ALL:09:00:count2’299

nosql:08:00:count25

30d*24h*nkeys=n*720cols


Show Timeseries: Provide list of counts

23

UPDATE tweet_count SET count = count + 1WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00';

SELECT * from tweet_countWHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10'AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’;

sensor_id | bucket_id | key | time_id | count----------+--------------+-----+--------------------------+-------ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230


Processing Pipeline

Kafka provides reliable and efficient queuing

Strom processes (rollups, counts)

Cassandra stores at the same speed

StoringProcessingQueuing

24

TwitterSensor 1

TwitterSensor 2

TwitterSensor 3

VisualizationApplication

VisualizationApplication

Processing Pipeline – Stream-Processing with

25

Pre-Processes data before storing in different Cassandra tables

Implemented in Java

Using DataStax Java driver for writing to Cassandra (similar to JDBC)

Kafka

SentenceSplitter

KafkaSpout

WordCounter

SentenceSplitter

WordCounter

Who will win: Barca,Real,Juve orBayern?…bit.ly/1yRsPmE #fcb

#barca

…#barca

…#fcb real=1juve =1

barca =2bayern =1

INCRbarca

INCRreal

INCRjuve

INCRbarca

INCRbayern

real

juve

barca

barca

bayern

fcb

fcb =1INCRfcb

Cassandra for Graph-Data

26

Using Cassandra for Social Dimension

27

Introduction to the Graph Model – Property Graph

Node / Vertex• Represent Entities• Can contain properties

(key-value pairs)

Relationship / Edge• Lines between nodes • may be directed or

undirected

Properties• Values about node or

relationship• Allow to add semantic to

relationships

User1 Tweetauthor

follow retweet

User2

Id:16134540name:clouderalocation:PaloAlto

Id:18898576name:gschmutzlocation:Berne

Id:18898576text:JoinBigData..time:June112015

time:June112015

key: value

28

Titan:DB – Graph Database

Optimized to work against billions of nodes and edges• Theoretical limitation of 2^60 edges and 1^60 nodes

Works with several different distributed databases• Apache Cassandra, Apache HBase, Oracle BerkeleyDB and Amazon DynamoDB

Supports many concurrent users doing complex graph traversals simultaneously

Native integration with TinkerPop stack

Created by Thinkaurelius (http://thinkaurelius.com/) now part of DataStax

29

Titan:DB Architecture

30

Titan:DB – Schema and Data Modeling

Titan gaph has a schema comprised of the edge labels, property keys, and vertex labels

schema can either be explicitly or implicitly defined

schema can evolve over time without interruption of normal operations

mgmt = graph.openManagement() person = mgmt.makeVertexLabel('person').make()birthDate = mgmt.makePropertyKey('birthDate')

.dataType(Long.class)

.cardinality(Cardinality.SINGLE).make() name = mgmt.makePropertyKey('name').dataType(String.class)

.cardinality(Cardinality.SET).make()

SOCMINT Data Model

User Post

Term

author(time,targetId)

follow

useHashtag

retweetOf(time,targetId)

mention(time)

mentionOf(time)

User:#userId =>userId (asString)name=>screenNamelanguage=>langprofileImageUrlHttpslocation=>locationtime=>createdAtpageRanklastUpdateTime

useUrl

Place:#placeId=>id(asString)street=>streetname=>fullNamecountry=>countrytype=>placeTypeurl =>placeUrllastUpdateTime

retweet(time)

reply(time)

replyTo(time)

Place

placed(time)

Term:#value=>hashtag orurl valuetype=>“hashtag”or“url”lastUpdateTime

reply(time)

Post:#postId =>tweetId (asString)time=>createAttargetIds=>targetIdslanguage=>langcoordinate=>latitude+longitudelastUpdateTime

32

TinkerPop 3 Stack

• TinkerPop is a framework composed of various interoperable components

• Vendor independent (similar to JDBC for RDBMS)

• Core API defines Graph, Vertex, Edge, …

• Gremlin traversal language is vendor-independent way to query (traverse) a graph

• Gremlin server can be leveraged to allow over the wire communication with a TinkerPopenabled graph system

http://tinkerpop.incubator.apache.org/

33

Gremlin – a graph query language

Imperative graph traversal language• Sequence of “steps” of the computation

Must understand structure of graph

peter

paul

roger

ken

eva

bob

marcg.V(1).out(“follow”).out(“follow”).count()

g.V(1).repeat(out(“follow”)).times(2).count()

follow

follow

follow

follow

follow

follow

follow

or

34

Summary

35

Summary

36

Cassandra is an always-on database

Ability to collect and analyze massive volumes of data in sequence at extremely high velocity

Forget (some of) your existing database modeling skills

Cassandra is an excellent fit for time series data

Cassandra is no longer “just a” column family database => Multi-Model Database

• DSE Search• JSON support• DSE Graph• DSE Timeseries• Spark Support

Summary - Know your domain

ConnectednessofDatalow high

DocumentDataStore

Key-ValueStores

Wide-ColumnStore

GraphDatabases

RelationalDatabases