all your base
DESCRIPTION
Slides from Tim Moreton's talk on "Apache Cassandra and Why BASE is great for real-time analytics" from All Your Base. Nov 23, 2012.TRANSCRIPT
Apache Cassandra and why BASE is great for real-time analytics
Tim Moreton
• Cassandra -- What makes it different?
• Who’s using it, and for what?
• DIY Real Time Analytics on Cassandra
• The Easy Option -- Acunu Analytics
2
BigTable Data model Dynamo distribution
3
BigTable Data model Dynamo distribution
Incubator, 2009Top-Level, 2010
Open sourced, 2008
3
BigTable Data model Dynamo distribution
Incubator, 2009Top-Level, 2010
Open sourced, 2008
3
• Multi-master architecture: no SPOF
• Tunable consistency, multi-DC aware
• High performance, optimised for writes
• Atomic counters
4
user345: {chess: { lives: 2, score: 33 ...} ...
}
5
Data model
user345: {chess: { lives: 2, score: 33 ...} ...
}
5
user345 [chess, lives]: 2
[chess, score]:44
user292 [go, lives]:4
[monop, avatar]: top_hat
[monop, score]: 33
user188 [monop, score]: 13
Row keyRows arranged randomly around cluster. Load balanced, but no ordering.Put stuff to access sequentially within a row.
Data model
user345: {chess: { lives: 2, score: 33 ...} ...
}
6
user345 [chess, lives]: 2
[chess, score]:44
user292 [go, lives]:4
[monop, avatar]: top_hat
[monop, score]: 33
user188 [monop, score]: 13
Column keyCompound columns allow you to create multiple ordered ‘dictionaries’ in a row.
Data model
user345: {chess: { lives: 2, score: 33 ...} ...
}
7
user345 [chess, lives]: 2
[chess, score]:44
user292 [go, lives]:4
[monop, avatar]: top_hat
[monop, score]: 33
user188 [monop, score]: 13
Flexible schemas“Columns” are really just cell identifiers. Rows can be VERY wide.
Data model
ONE QUORUM ALLWrite:
ONE QUORUM ALLRead:
Tunable consistency — per operation
8
#Replicas
#Replicas
ONE QUORUM ALLWrite:
ONE QUORUM ALLRead:
Risk of replica failing,Multiple values
Tunable consistency — per operation
8
#Replicas
#Replicas
ONE QUORUM ALLWrite:
ONE QUORUM ALLRead:
More likely to return out-of-date data
Tunable consistency — per operation
8
#Replicas
#Replicas
ONE QUORUM ALLWrite:
ONE QUORUM ALLRead:
Never going to say “ok” if a replica is down
Tunable consistency — per operation
8
#Replicas
#Replicas
ONE QUORUM ALLWrite:
ONE QUORUM ALLRead:
Tunable consistency — per operation
8
#Replicas
#Replicas
DC 1 DC 2
r1 r2 r1 r2
Multi data center aware
9
DC 1 DC 2
r1 r2 r1 r2
Multi data center aware
9
user345
10
Session Stores
• Read dominated• Updates to existing items• Probably fits in RAM• Distribute for availability• Want: Atomicity
Real Time Analytics
• Write dominated• Updates very rare• Read “results” mostly• Distribute for availability,
performance, capacity• Want: Rich queries
Source: Twitter
11
An analytics app on Cassandra
eg: “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter”Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data
http://blog.twitter.com/2011/03/numbers.html
12
eg: “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter”Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data
http://blog.twitter.com/2011/03/numbers.html
12
Cassandra approach:For each tweet, increment a bunch of counters, such that answering a queryis as easy as reading some counters
Analytics13
[1234, man] +1[1234, acunu] +1[1234, rock] +1
12:32:15 I like #trafficlights12:33:43 Nobody expects...
12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!
Analytics
Key 00:01 00:02 ...
[01/05/11, acunu] 3 5 ...
[02/05/11, acunu] 12 4 ...
... ... ...
Row key is ‘big’ time bucket
Column key is ‘small’ time bucket
13
[1234, man] +1[1234, acunu] +1[1234, rock] +1
12:32:15 I like #trafficlights12:33:43 Nobody expects...
12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!
14
Solution Con
Scalability$$$
Not real time
Spartan query semantics: complex, DIY solutions
High Velocity Event Streams
HTTP JSON, MQ, flume
As events are ingested: ■Update real time views■Refresh dashboards ■Preserve original event data
Dashboards and API deliver pre-computed results:■Roll-ups■Drilldowns■Trends
Provide definitions and real time views:
15
01101001010101010
010110
101010101001011010101011001011010101010010110101010101101
0010
01101001010
101010
0101101010101010010110101010110010110101010100110
100101001011010
101010100101101010101100101101010101
00
create table foo ( x long, y string, t time(hour, min), z path('/'));create view select sum(x) from foo where y group by z;create view select count from foo where x, t group by t;
Via the RESTful HTTP API, command line tools, or the UI query builder
Acunu Analytics
Analytics16
Analytics
countgrouped by ...
day
16
Analytics
countgrouped by ...
daycount
distinct (session)
16
Analytics
countgrouped by ...
daycount
distinct (session)
count
16
Analytics
countgrouped by ...
daycount
distinct (session)
count
avg(duration)
16
Analytics
countgrouped by ...
daycount
distinct (session)
count ... geography
avg(duration)
16
Analytics
countgrouped by ...
daycount
distinct (session)
count ... geography
... browseravg(duration)
16
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3221 :00→22 :00→19 :02→104 ...
... ...
UK all→228 user01→1 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1904 ...
∅ all→87314 UK→238 US→354 ...
{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,
}
17
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :00→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
18
{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,
}
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3221 :00→22 :00→19 :02→104 ...
... ...
UK all→228 user01→1 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1904 ...
∅ all→87314 UK→238 US→354 ...
19
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
20
where time 21:00-22:00count(*)
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
20
where time 21:00-22:00count(*)
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
21
where time 21:00-22:00count(*)
where time 22:00-23:00, group by minute
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
21
where time 21:00-22:00count(*)
where time 22:00-23:00, group by minute
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
22
where time 21:00-22:00count(*)
where time 22:00-23:00, group by minute
where geography=UK group all by user,
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
22
where time 21:00-22:00count(*)
where time 22:00-23:00, group by minute
where geography=UK group all by user,
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
23
where time 21:00-22:00count(*)
where time 22:00-23:00, group by minute
where geography=UK group all by user,
count all
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
23
where time 21:00-22:00count(*)
where time 22:00-23:00, group by minute
where geography=UK group all by user,
count all
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
24
where time 21:00-22:00count(*)
where time 22:00-23:00, group by minute
where geography=UK group all by user,
count all
group all by geo
Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
24
where time 21:00-22:00count(*)
where time 22:00-23:00, group by minute
where geography=UK group all by user,
count all
group all by geo
25
Accuracy Performance
k
Identify the root causes of aggregate results
Proactively identify deviation from baseline and breaks from trends
APPROXIMATE AGGREGATESDRILLDOWN TO ORIGINAL EVENTS
TRENDING AND CORRELATION
Automatic handling of paths, timestamps and geospatial queries
HIERARCHICALAGGREGATES
Fast probabilistic data structures for COUNT UNIQUE, TOP n to trade accuracy for performance - predictably
26
■ Enhanced Cassandra for higher density, better scalability, simpler management
■ Roll-up and transform cubes in real time■ Leverage NoSQL for write-optimization,
schema freedom, and horizontal scalability
■ ‘Single pane of glass‘ management UI
DASHBOARDS UI, JSON APIs
ACUNU ANALYTICS
ENHANCED CASSANDRA
CASTLE: STORAGE ENGINE
OPSUI
COMMODITY HW OR CLOUD
REAL-TIME BIG DATA ANALYTICS, POWERED BY NOSQL
CASSANDRA ENHANCED FORHIGHER DENSITY, LOWER TCO
■ In-kernel storage engine designed and optimised for NoSQL databases
STORAGE CRAFTED FOR BIG DATA
DASHBOARDS UI, JSON APIs
ACUNU ANALYTICS
ENHANCED CASSANDRA
CASTLE: STORAGE ENGINE
OPSUI
COMMODITY HW OR CLOUD
Shameless plug
27
Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logos are trademarks of the Apache Software Foundation.
THANK YOU@timmoreton @acunu
29