scalable real-time processing techniques

Scalable real-time processing techniques

How to almost countLars Albertsson, Schibsted

“We promised to count live...

...but since you can’t do that, we used historical numbers and this cool math to extrapolate.”

Stream counting is simple

You already have the building blocks

Yet many wait for batch execution

Or go through estimation hoops

BucketiserBucketiser

Accurate counting

● Straightforward, with some plumbing.● Heavier than you need.

BusServer

Bucketiser

AggregatorServer

ServerServer

Now or later? Exact or rough?

Approximation now >> accurate later

Basic scenarios

● How many distinct items in last x minutes?● What are the top k items in last x minutes?● How many Ys in last x minutes?

These base techniques are sufficient for implementing e.g. personalisation and recommendation algorithms.

Cardinality - distinct stream count

● Naive: Set of hashes. X bits per item.

● Naive: Set of hashes. X bits per item.● Naive 2: Set approximation with Bloom filter

+ counter.

Counting in context

● Look backward, different time windows, compare.

● Count for a small time quantum, keep history.

● Aggregate old windows.● Monoid representations are desirable.

+ counter.● Naive 3: Hash to bitmap. Count bits.

+ counter.● Naive 3: Hash to bitmap. Count bits.● Attempt 4: Hash, bitmap, count + collision

compensation. Linear Probabilistic Counter.

+ counter.● Naive 3: Hash to bitmap. Count bits.● Attempt 4: Hash, bitmap, count + collision

compensation. Linear Probabilistic Counter.● Read papers… -> HyperLogLog counter

Source: Shakespeare, highscalability.com

Top K counting

Gaga 46

Avicii 23

Eminem 21

Dolly 18

Gaga 46

Avicii 23

Eminem 21

Peps 19

Gaga 46

Avicii 23

Eminem 21

Dolly 20

● Keep k items, assume absentees have lowest value

● Accurate at top, overcounting in bottom

Approx counting - Count-Min Sketch

● Compute n hashes for key.● Increment once on each row, col by mod

(hash)● Retrieve by min() over rows

3 7 20 3 11 6 3+1 4 1 1

3 8 6 2+1 17 13 1 0 4 5

12 7 6 14 2 0 2 3 6+1 7

3 2 12 8+1 10 2 7 2 11 2

Top K with Count-Min Sketch

Gaga 46

Avicii 23

Eminem 21

Dolly 18

Gaga 46

Avicii 23

Eminem 21

Peps 2

Gaga 46

Avicii 23

Eminem 21

Dolly 19

● Keep Heavy Hitters list.● Lookup absentees in CMS.● Risk of overcount is smaller and spread out.

Cubic CMS

● Decorate song with geo, age, etc. Pour into CMS.

● Keep heavy hitters per geo, age group.

*:*:<U2>

SE:*:<U2>

*:31-40:<U2>

SE:31-40:<U2>

Machinery

O(104) messages / s per machine.

You probably only need one. If not, use Storm.

Read and write to pub/sub channel, e.g. Kafka or ZeroMQ.

Brute force alternative

Dump every single message into ElasticSearch.

Suitable for high dimensionality cubes.

Recommendations, you said?

● Collaborative filtering - similarity matrix

2 4 1 1 5 2

0 1 7 1 0 6

5 2 9 0 3 0

3 8 0 6 0 7

Shave the matrixUsers

... ...

Flip Sort

0 0 0 0 0 0

0 0 7 0 0 6

0 0 9 0 0 0

0 8 0 0 0 7

Noise removed - fine for recommendations

2 4 1 1 5 2

0 1 7 1 0 6

5 2 9 0 3 0

3 8 0 6 0 7

Hungry for more?Mikio Braun: http://www.berlinbuzzwords.de/session/real-time-personalization-and-recommendation-stream-mining

Ted Dunning on deep learning for real-time anomaly detection: http://www.berlinbuzzwords.de/session/deep-learning-high-performance-time-series-databasesTed Dunning on Storm: http://www.youtube.com/watch?v=7PcmbI5aC20

Open source: stream-lib, Algebird

Want to work in this area?lalle@schibsted.com

scalable real-time processing techniques

Software

design techniques for energy- quality scalable digital...

scalable acquisition and pre-processing

scalable stream processing spark streaming and flink stream

transaction processing techniques

apache kafka - scalable message-processing and more !

processing techniques

techniques in scalable and effective performance analysis

polymer processing techniques

database foundations for scalable rdf processing -...

a scalable processing-in-memory accelerator for...

scalable trigger processing

ultracold molecules_vehicles to scalable quantum information...

scalable approximate query processing

indexing techniques for scalable record linkage and...

scalable analysis techniques for microprocessor performance

scalable in-memory transaction processing with htm

data processing techniques

mineral processing techniques

scalable batch processing in the cloud -...

scalable integration and processing of linked data