all your base

50
Apache Cassandra and why BASE is great for real-time analytics Tim Moreton

Upload: acunu

Post on 24-Jan-2015

425 views

Category:

Technology


1 download

DESCRIPTION

Slides from Tim Moreton's talk on "Apache Cassandra and Why BASE is great for real-time analytics" from All Your Base. Nov 23, 2012.

TRANSCRIPT

Page 1: All Your Base

Apache Cassandra and why BASE is great for real-time analytics

Tim Moreton

Page 2: All Your Base

• Cassandra -- What makes it different?

• Who’s using it, and for what?

• DIY Real Time Analytics on Cassandra

• The Easy Option -- Acunu Analytics

2

Page 3: All Your Base

BigTable Data model Dynamo distribution

3

Page 4: All Your Base

BigTable Data model Dynamo distribution

Incubator, 2009Top-Level, 2010

Open sourced, 2008

3

Page 5: All Your Base

BigTable Data model Dynamo distribution

Incubator, 2009Top-Level, 2010

Open sourced, 2008

3

Page 6: All Your Base

• Multi-master architecture: no SPOF

• Tunable consistency, multi-DC aware

• High performance, optimised for writes

• Atomic counters

4

Page 7: All Your Base

user345: {chess: { lives: 2, score: 33 ...} ...

}

5

Data model

Page 8: All Your Base

user345: {chess: { lives: 2, score: 33 ...} ...

}

5

user345 [chess, lives]: 2

[chess, score]:44

user292 [go, lives]:4

[monop, avatar]: top_hat

[monop, score]: 33

user188 [monop, score]: 13

Row keyRows arranged randomly around cluster. Load balanced, but no ordering.Put stuff to access sequentially within a row.

Data model

Page 9: All Your Base

user345: {chess: { lives: 2, score: 33 ...} ...

}

6

user345 [chess, lives]: 2

[chess, score]:44

user292 [go, lives]:4

[monop, avatar]: top_hat

[monop, score]: 33

user188 [monop, score]: 13

Column keyCompound columns allow you to create multiple ordered ‘dictionaries’ in a row.

Data model

Page 10: All Your Base

user345: {chess: { lives: 2, score: 33 ...} ...

}

7

user345 [chess, lives]: 2

[chess, score]:44

user292 [go, lives]:4

[monop, avatar]: top_hat

[monop, score]: 33

user188 [monop, score]: 13

Flexible schemas“Columns” are really just cell identifiers. Rows can be VERY wide.

Data model

Page 11: All Your Base

ONE QUORUM ALLWrite:

ONE QUORUM ALLRead:

Tunable consistency — per operation

8

#Replicas

#Replicas

Page 12: All Your Base

ONE QUORUM ALLWrite:

ONE QUORUM ALLRead:

Risk of replica failing,Multiple values

Tunable consistency — per operation

8

#Replicas

#Replicas

Page 13: All Your Base

ONE QUORUM ALLWrite:

ONE QUORUM ALLRead:

More likely to return out-of-date data

Tunable consistency — per operation

8

#Replicas

#Replicas

Page 14: All Your Base

ONE QUORUM ALLWrite:

ONE QUORUM ALLRead:

Never going to say “ok” if a replica is down

Tunable consistency — per operation

8

#Replicas

#Replicas

Page 15: All Your Base

ONE QUORUM ALLWrite:

ONE QUORUM ALLRead:

Tunable consistency — per operation

8

#Replicas

#Replicas

Page 16: All Your Base

DC 1 DC 2

r1 r2 r1 r2

Multi data center aware

9

Page 17: All Your Base

DC 1 DC 2

r1 r2 r1 r2

Multi data center aware

9

user345

Page 18: All Your Base

10

Session Stores

• Read dominated• Updates to existing items• Probably fits in RAM• Distribute for availability• Want: Atomicity

Real Time Analytics

• Write dominated• Updates very rare• Read “results” mostly• Distribute for availability,

performance, capacity• Want: Rich queries

Page 19: All Your Base

Source: Twitter

11

An analytics app on Cassandra

Page 20: All Your Base

eg: “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter”Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data

http://blog.twitter.com/2011/03/numbers.html

12

Page 21: All Your Base

eg: “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter”Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data

http://blog.twitter.com/2011/03/numbers.html

12

Cassandra approach:For each tweet, increment a bunch of counters, such that answering a queryis as easy as reading some counters

Page 22: All Your Base

Analytics13

[1234, man] +1[1234, acunu] +1[1234, rock] +1

12:32:15 I like #trafficlights12:33:43 Nobody expects...

12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!

Page 23: All Your Base

Analytics

Key 00:01 00:02 ...

[01/05/11, acunu] 3 5 ...

[02/05/11, acunu] 12 4 ...

... ... ...

Row key is ‘big’ time bucket

Column key is ‘small’ time bucket

13

[1234, man] +1[1234, acunu] +1[1234, rock] +1

12:32:15 I like #trafficlights12:33:43 Nobody expects...

12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!

Page 24: All Your Base

14

Solution Con

Scalability$$$

Not real time

Spartan query semantics: complex, DIY solutions

Page 25: All Your Base

High Velocity Event Streams

HTTP JSON, MQ, flume

As events are ingested: ■Update real time views■Refresh dashboards ■Preserve original event data

Dashboards and API deliver pre-computed results:■Roll-ups■Drilldowns■Trends

Provide definitions and real time views:

15

01101001010101010

010110

101010101001011010101011001011010101010010110101010101101

0010

01101001010

101010

0101101010101010010110101010110010110101010100110

100101001011010

101010100101101010101100101101010101

00

create table foo (   x long,   y string,   t time(hour, min),   z path('/'));create view select sum(x) from foo where y group by z;create view select count from foo where x, t group by t;

Via the RESTful HTTP API, command line tools, or the UI query builder

Acunu Analytics

Page 26: All Your Base

Analytics16

Page 27: All Your Base

Analytics

countgrouped by ...

day

16

Page 28: All Your Base

Analytics

countgrouped by ...

daycount

distinct (session)

16

Page 29: All Your Base

Analytics

countgrouped by ...

daycount

distinct (session)

count

16

Page 30: All Your Base

Analytics

countgrouped by ...

daycount

distinct (session)

count

avg(duration)

16

Page 31: All Your Base

Analytics

countgrouped by ...

daycount

distinct (session)

count ... geography

avg(duration)

16

Page 32: All Your Base

Analytics

countgrouped by ...

daycount

distinct (session)

count ... geography

... browseravg(duration)

16

Page 33: All Your Base

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3221 :00→22 :00→19 :02→104 ...

... ...

UK all→228 user01→1 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1904 ...

∅ all→87314 UK→238 US→354 ...

{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,

}

17

Page 34: All Your Base

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :00→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

18

{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,

}

Page 35: All Your Base

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3221 :00→22 :00→19 :02→104 ...

... ...

UK all→228 user01→1 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1904 ...

∅ all→87314 UK→238 US→354 ...

19

Page 36: All Your Base

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

20

where time 21:00-22:00count(*)

Page 37: All Your Base

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

20

where time 21:00-22:00count(*)

Page 38: All Your Base

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

21

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

Page 39: All Your Base

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

21

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

Page 40: All Your Base

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

22

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

where geography=UK group all by user,

Page 41: All Your Base

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

22

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

where geography=UK group all by user,

Page 42: All Your Base

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

23

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

where geography=UK group all by user,

count all

Page 43: All Your Base

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

23

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

where geography=UK group all by user,

count all

Page 44: All Your Base

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

24

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

where geography=UK group all by user,

count all

group all by geo

Page 45: All Your Base

Analytics

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→354 ...

24

where time 21:00-22:00count(*)

where time 22:00-23:00, group by minute

where geography=UK group all by user,

count all

group all by geo

Page 46: All Your Base

25

Accuracy Performance

k

Identify the root causes of aggregate results

Proactively identify deviation from baseline and breaks from trends

APPROXIMATE AGGREGATESDRILLDOWN TO ORIGINAL EVENTS

TRENDING AND CORRELATION

Automatic handling of paths, timestamps and geospatial queries

HIERARCHICALAGGREGATES

Fast probabilistic data structures for COUNT UNIQUE, TOP n to trade accuracy for performance - predictably

Page 47: All Your Base

26

Page 48: All Your Base

■ Enhanced Cassandra for higher density, better scalability, simpler management

■ Roll-up and transform cubes in real time■ Leverage NoSQL for write-optimization,

schema freedom, and horizontal scalability

■ ‘Single pane of glass‘ management UI

DASHBOARDS UI, JSON APIs

ACUNU ANALYTICS

ENHANCED CASSANDRA

CASTLE: STORAGE ENGINE

OPSUI

COMMODITY HW OR CLOUD

REAL-TIME BIG DATA ANALYTICS, POWERED BY NOSQL

CASSANDRA ENHANCED FORHIGHER DENSITY, LOWER TCO

■ In-kernel storage engine designed and optimised for NoSQL databases

STORAGE CRAFTED FOR BIG DATA

DASHBOARDS UI, JSON APIs

ACUNU ANALYTICS

ENHANCED CASSANDRA

CASTLE: STORAGE ENGINE

OPSUI

COMMODITY HW OR CLOUD

Shameless plug

27

Page 49: All Your Base

Analytics

http://bit.ly/UBsdej

Page 50: All Your Base

Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logos are trademarks of the Apache Software Foundation.

THANK YOU@timmoreton @acunu

29