acunu analytics
DESCRIPTION
TRANSCRIPT
Acunu AnalyticsSimple, powerful, real-time
Andrew BydePrincipal Scientist
Tuesday, 27 March 2012
Making big data useful
time page session id duration
... ... ... ...
14:58:03.234 /index.html 248.180.3.40 898
14:58:03.234 /csi/csi/council/freedom.html 248.180.3.40 1234
14:58:03.234 /docs/access/chapter8.txt 99.1.10.178 52
... ... ... ...
x billions
How do we turn this ...
Tuesday, 27 March 2012
IntroductionMY
into this...
Tuesday, 27 March 2012
or this...
Tuesday, 27 March 2012
or this...
Tuesday, 27 March 2012
• SQL + materialised views
Tuesday, 27 March 2012
• SQL + materialised views
... would be nice if it scaled
Tuesday, 27 March 2012
• Hadoop/Map-Reduce can do anything
Tuesday, 27 March 2012
• Hadoop/Map-Reduce can do anything
Not real-time
Inefficient re-computation
Tuesday, 27 March 2012
• Hadoop/Map-Reduce can do anything
Not real-time
Inefficient re-computation
(100TB on a 100 node cluster is > 3 hours)
Tuesday, 27 March 2012
• Cassandra counters are pretty cool
Tuesday, 27 March 2012
• Cassandra counters are pretty cool
but the query semantics is spartan
=> DIY solutions
Tuesday, 27 March 2012
Acunu Analytics
• Simple, real-time, incremental analytics
• push processing into ingest phase
CassandraeventAA
counterupdates
Tuesday, 27 March 2012
Acunu Analytics
• Event template, e.g.,
• specifies “blow-up” strategy according to supported queries
select : ["COUNT", "AVG(loadTime)"],type : { time : [TIME(HOUR; MIN; SEC), ?, 0], page : PATH(/), loadTime : [LONG, 0, 0]}
Tuesday, 27 March 2012
Acunu Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3221 :00→22 :00→19 :02→104 ...
... ...
click all→228 user01→1 user14→12 user99→7 ...
open all→354 user01→4 user04→8 user56→17 ...
...
click, 22:00 all→1904 ...
∅ all→87314 click→238 open→354 ...
type : { time : TIME(HOUR; MIN), category : STRING, user : STRING}
Tuesday, 27 March 2012
Acunu Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3221 :00→22 :00→19 :02→104 ...
... ...
click all→228 user01→1 user14→12 user99→7 ...
open all→354 user01→4 user04→8 user56→17 ...
...
click, 22:00 all→1904 ...
∅ all→87314 click→238 open→354 ...
(22:02, “click”, user01)
type : { time : TIME(HOUR; MIN), category : STRING, user : STRING}
Tuesday, 27 March 2012
Acunu Analytics
(22:02, “click”, user01)
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :00→19 :02→105 ...
... ...
click all→229 user01→2 user14→12 user99→7 ...
open all→354 user01→4 user04→8 user56→17 ...
...
click, 22:00 all→1905 ...
∅ all→87315 click→239 open→355 ...
type : { time : TIME(HOUR; MIN), category : STRING, user : STRING}
Tuesday, 27 March 2012
Acunu Analytics
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :00→19 :02→105 ...
... ...
click all→229 user01→2 user14→12 user99→7 ...
open all→354 user01→4 user04→8 user56→17 ...
...
click, 22:00 all→1905 ...
∅ all→87315 click→239 open→355 ...
Pre-assembled queries, e.g. ...
count all
group all by category
group all by user, where category=click
for 22:00-23:00, group by minute
Tuesday, 27 March 2012
Summary
• Simple, real-time, incremental analytics
• work done on ingest
• sum, count, distinct, avg, stddev, min-max etc
• time + hierarchy bucketing
• efficient ‘group’ semantics
• works with Apache Cassandra
Tuesday, 27 March 2012
Early Access Program
Tuesday, 27 March 2012
Tuesday, 27 March 2012
count
Tuesday, 27 March 2012
count distinct
(session)
count
Tuesday, 27 March 2012
count distinct
(session)
count
avg(duration)
Tuesday, 27 March 2012
countgrouped by ...
daycount
distinct (session)
count
avg(duration)
Tuesday, 27 March 2012
countgrouped by ...
daycount
distinct (session)
count ... geography
avg(duration)
Tuesday, 27 March 2012
countgrouped by ...
daycount
distinct (session)
count ... geography
... browseravg(duration)
Tuesday, 27 March 2012