apache cassandra for timeseries- and graph-data
TRANSCRIPT
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Apache Cassandra for Timeseries- and Graph-DataGuido Schmutz
Guido Schmutz
• Working for Trivadis for more than 18 years• Oracle ACE Director for Fusion Middleware and SOA• Co-Author of different books• Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data• Member of Trivadis Architecture Board• Technology Manager @ Trivadis
• More than 25 years of software development experience
• Contact: [email protected]• Blog: http://guidoschmutz.wordpress.com• Twitter: gschmutz
2
Agenda
1. Customer Use Case and Architecture2. Cassandra Data Modeling3. Cassandra for Timeseries Data4. Cassandra for Graph Data5. Summary
3
Customer Use Case andArchitecture
4
Research Project @ Armasuisse W&T
W+T flagship project, standing for innovation & tech transfer
Building capabilities in the areas of:• Social Media Intelligence
(SOCMINT)
• Big Data Technologies & Architectures
Invest into new, innovative and not widely-proven technology• Batch analysis
• Real-time analysis
• NoSQL databases• Text analysis (NLP)
• …
3 Phases: June 2013 – June 2015
5
SOCMINT Demonstrator – Time Dimension
Major data model: Time series (TS)
TS reflect user behaviors over time
Activities correlate with events
Anomaly detectionEvent detection & prediction
6
SOCMINT Demonstrator – Social Dimension
User-user networks (social graphs);
Twitter: follower, retweet and mention graphs
Who is central in a social network?
Who has retweeted a given tweet to whom?
7
SOCMINT Demonstrator - “Lambda Architecture” for Big Data
DataCollection
(Analytical) Batch Data Processing
Batchcompute
Batch Result Store
DataSources
Channel
DataAccess
Reports
Service
AnalyticTools
AlertingTools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical)Real-TimeDataProcessing
Stream/EventProcessing
Batchcompute
Real-Time Result Store
Messaging
ResultStore
QueryEngine
ResultStore
ComputedInformation
RawData(Reservoir)
=DatainMotion =DataatRest8
SOCMINT Demonstrator – Frameworks & Components in Use
DataCollection
(Analytical) Batch Data Processing
Batchcompute
Batch Result Store
DataSources
Channel
DataAccess
Reports
Service
AnalyticTools
AlertingTools
Social
(Analytical)Real-TimeDataProcessing
Stream/EventProcessing
Batchcompute
Real-Time Result Store
Messaging
ResultStore
QueryEngine
ResultStore
ComputedInformation
RawData(Reservoir)
=DatainMotion =DataatRest10
SOCMINT Demonstrator – Cassandra Cluster
6 node cluster based on Datastax Enterprise Edition (DSE)
Installed in a virtualized environment but we control the placement on disk
We only keep 3 days of data • use TTL of Cassandra to automatically
erase old data
Cassandra supports both Timeseries and Connected-Data (Graph)
Node1
Node2Node6
Node3Node5
Node4
11
Cassandra Data Modeling
12
Cassandra Data Modelling
13
• Don’t think relational
• Denormalize, Denormalize, Denormalize ….
• Rows are gigantic and sorted = one row is stored on one node• Know your application/use cases => from query to model
• Index is not an afterthought, anymore => “index” upfront• Control physical storage structure
Static Column Family – “Skinny Row”
14
rowkey
CREATE TABLE skinny (rowkey text, c1 text PRIMARY KEY,c2 text,c3 text,
PRIMARY KEY (rowkey));
Growsup
toBillionofRow
s
rowkey-1 c1 c2 c3value-c1 value-c2 value-c3
rowkey-2 c1 c3value-c1 value-c3
rowkey-3 c1 c2 c3value-c1 value-c2 value-c3
c1 c2 c3
PartitionKey
Dynamic Column Family – “Wide Row”
15
rowkey
Billion
ofR
ows rowkey-1 ckey-1:c1 ckey-1:c2
value-c1 value-c2
rowkey-2
rowkey-3
CREATE TABLE wide (rowkey text, ckey text,c1 text,c2 text,
PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC);
ckey-2:c1 ckey-2:c2value-c1 value-c2
ckey-3:c1 ckey-3:c2value-c1 value-c2
ckey-1:c1 ckey-1:c2value-c1 value-c2
ckey-2:c1 ckey-2:c2value-c1 value-c2
ckey-1:c1 ckey-1:c2value-c1 value-c2
ckey-2:c1 ckey-2:c2value-c1 value-c2
ckey-3:c1 ckey-3:c2value-c1 value-c2
1 2Billion
PartitionKey Clustering Key
Cassandra for Timeseries Data
16
Know your application => From query to model
17
Show Timeline of Tweets
Show Timeseries on different levels of aggregation (resolution)
• Seconds• Minute• Hours
Show Timeline: Provide Raw Data (Tweets)
18
CREATE TABLE tweet (tweet_id bigint,username text,message text, hashtags list<text>,latitude double, longitude double,…PRIMARY KEY(tweet_id));
• Skinny Row Table
• Holds the sensor raw data =>
Tweets
• Similar to a relational table
• Primary Key is the partition key
10000121 username message hashtags latitude longitudegschmutz Gettingreadyfor .. [cassandra,nosql] 0 0
20121223 username message hashtags latitude longitudeDataStax The SpeedFactor.. [BigData 0 0
tweet_id
PartitionKey Clustering Key
Show Timeline: Provide Raw Data (Tweets)
19
INSERT INTO tweet (tweet_id, username, message, hashtags, latitude, longitude) VALUES (10000121, 'gschmutz', 'Getting ready for my talk about using Cassandra for Timeseries and Graph Data', ['cassandra', 'nosql'], 0,0);
SELECT tweet_id, username, hashtags, message FROM tweet WHERE tweet_id = 10000121 ;
tweet_id | username | hashtag | message---------+----------+------------------------+----------------------------10000121 | gschmutz | ['cassandra', 'nosql'] | Getting ready for ...20121223 | DataStax | [’BigData’] | The Speed Factor ...
PartitionKey Clustering Key
Show Timeline: Provide Sequence of Events
20
CREATE TABLE tweet_timeline (sensor_id text,bucket_id text,time_id timestamp, tweet_id bigint,
PRIMARY KEY((sensor_id, bucket_id), time_id))WITH CLUSTERING ORDER BY (time_id DESC);
Wide Row Table
bucket-id creates buckets for columns• SECOND-2015-10-14
ABC-001:SECOND-2015-10-14 10:00:02:tweet-id10000121
DEF-931:SECOND-2015-10-14 10:09:02:tweet-id1003121343
09:12:09:tweet-id1002111343
09:10:02:tweet-id1001121343
PartitionKey Clustering Key
Show Timeline: Provide Sequence of Events
21
INSERT INTO tweet_timeline (sensor_id, bucket_id, time_id, tweet_id)VALUES ('ABC-001', 'SECOND-2015-10-14', '2015-09-30 10:50:00', 10000121 );
SELECT * from tweet_timelineWHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'AND key = 'ALL’ AND time_id <= '2015-10-14 12:00:00';
sensor_id | bucket_id | time_id | tweet_id----------+-------------------+--------------------------+----------ABC-001 | SECOND-2015-10-14 | 2015-10-14 11:53:00+0000 | 10020334 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:52:00+0000 | 10000334 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:51:00+0000 | 10000127 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:50:00+0000 | 10000121
Sorted
bytim
e_id
PartitionKey Clustering Key
Show Timeseries: Provide list of counts
22
CREATE TABLE tweet_count (sensor_id text,bucket_id text,key text,time_id timestamp,count counter,
PRIMARY KEY((sensor_id, bucket_id), key, time_id))WITH CLUSTERING ORDER BY (key ASC, time_id DESC);
Wide Row Table
bucket-id creates buckets for columns• SECOND-2015-10-14• HOUR-2015-10• DAY-2015-10
ABC-001:HOUR-2015-10 ALL:10:00:count1’550
ABC-001:DAY-2015-10 ALL:14-OCT:count105’999
ALL:13-OCT:count120’344
nosql:14-OCT:count2’532
ALL:09:00:count2’299
nosql:08:00:count25
30d*24h*nkeys=n*720cols
PartitionKey Clustering Key
Show Timeseries: Provide list of counts
23
UPDATE tweet_count SET count = count + 1WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00';
SELECT * from tweet_countWHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10'AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’;
sensor_id | bucket_id | key | time_id | count----------+--------------+-----+--------------------------+-------ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230
PartitionKey Clustering Key
Processing Pipeline
Kafka provides reliable and efficient queuing
Strom processes (rollups, counts)
Cassandra stores at the same speed
StoringProcessingQueuing
24
TwitterSensor 1
TwitterSensor 2
TwitterSensor 3
VisualizationApplication
VisualizationApplication
Processing Pipeline – Stream-Processing with
25
Pre-Processes data before storing in different Cassandra tables
Implemented in Java
Using DataStax Java driver for writing to Cassandra (similar to JDBC)
Kafka
SentenceSplitter
KafkaSpout
WordCounter
SentenceSplitter
WordCounter
Who will win: Barca,Real,Juve orBayern?…bit.ly/1yRsPmE #fcb
#barca
…#barca
…#fcb real=1juve =1
barca =2bayern =1
INCRbarca
INCRreal
INCRjuve
INCRbarca
INCRbayern
real
juve
barca
barca
bayern
fcb
fcb =1INCRfcb
Cassandra for Graph-Data
26
Using Cassandra for Social Dimension
27
Introduction to the Graph Model – Property Graph
Node / Vertex• Represent Entities• Can contain properties
(key-value pairs)
Relationship / Edge• Lines between nodes • may be directed or
undirected
Properties• Values about node or
relationship• Allow to add semantic to
relationships
User1 Tweetauthor
follow retweet
User2
Id:16134540name:clouderalocation:PaloAlto
Id:18898576name:gschmutzlocation:Berne
Id:18898576text:JoinBigData..time:June112015
time:June112015
key: value
28
Titan:DB – Graph Database
Optimized to work against billions of nodes and edges• Theoretical limitation of 2^60 edges and 1^60 nodes
Works with several different distributed databases• Apache Cassandra, Apache HBase, Oracle BerkeleyDB and Amazon DynamoDB
Supports many concurrent users doing complex graph traversals simultaneously
Native integration with TinkerPop stack
Created by Thinkaurelius (http://thinkaurelius.com/) now part of DataStax
29
Titan:DB Architecture
30
Titan:DB – Schema and Data Modeling
Titan gaph has a schema comprised of the edge labels, property keys, and vertex labels
schema can either be explicitly or implicitly defined
schema can evolve over time without interruption of normal operations
mgmt = graph.openManagement() person = mgmt.makeVertexLabel('person').make()birthDate = mgmt.makePropertyKey('birthDate')
.dataType(Long.class)
.cardinality(Cardinality.SINGLE).make() name = mgmt.makePropertyKey('name').dataType(String.class)
.cardinality(Cardinality.SET).make()
SOCMINT Data Model
User Post
Term
author(time,targetId)
follow
useHashtag
retweetOf(time,targetId)
mention(time)
mentionOf(time)
User:#userId =>userId (asString)name=>screenNamelanguage=>langprofileImageUrlHttpslocation=>locationtime=>createdAtpageRanklastUpdateTime
useUrl
Place:#placeId=>id(asString)street=>streetname=>fullNamecountry=>countrytype=>placeTypeurl =>placeUrllastUpdateTime
retweet(time)
reply(time)
replyTo(time)
Place
placed(time)
Term:#value=>hashtag orurl valuetype=>“hashtag”or“url”lastUpdateTime
reply(time)
Post:#postId =>tweetId (asString)time=>createAttargetIds=>targetIdslanguage=>langcoordinate=>latitude+longitudelastUpdateTime
32
TinkerPop 3 Stack
• TinkerPop is a framework composed of various interoperable components
• Vendor independent (similar to JDBC for RDBMS)
• Core API defines Graph, Vertex, Edge, …
• Gremlin traversal language is vendor-independent way to query (traverse) a graph
• Gremlin server can be leveraged to allow over the wire communication with a TinkerPopenabled graph system
http://tinkerpop.incubator.apache.org/
33
Gremlin – a graph query language
Imperative graph traversal language• Sequence of “steps” of the computation
Must understand structure of graph
peter
paul
roger
ken
eva
bob
marcg.V(1).out(“follow”).out(“follow”).count()
g.V(1).repeat(out(“follow”)).times(2).count()
follow
follow
follow
follow
follow
follow
follow
or
34
Summary
35
Summary
36
Cassandra is an always-on database
Ability to collect and analyze massive volumes of data in sequence at extremely high velocity
Forget (some of) your existing database modeling skills
Cassandra is an excellent fit for time series data
Cassandra is no longer “just a” column family database => Multi-Model Database
• DSE Search• JSON support• DSE Graph• DSE Timeseries• Spark Support
Summary - Know your domain
ConnectednessofDatalow high
DocumentDataStore
Key-ValueStores
Wide-ColumnStore
GraphDatabases
RelationalDatabases
38