probabilistically counting counting is hard: views at ... · hll (redis implementation) max size =...

38
Counting is Hard: Probabilistically Counting Views at Reddit Krishnan Chandra, Data Engineer

Upload: others

Post on 10-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Counting is Hard: Probabilistically Counting Views at RedditKrishnan Chandra, Data Engineer

Page 2: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Overview

● What is probabilistic counting?

● How did probabilistic counting help us scale?

● What issues did we face along the way?

Page 3: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

What is Reddit?Reddit is the frontpage of the internet

A social network where there are tens of thousands of communities around whatever passions or interests you might have

It’s where people converse about the things that are most important to them

Page 4: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Reddit by the numbers

Alexa Rank (US/World)

MAU

Active Communities

Posts per month

Screenviews per month

4th/7th

330M+

138K+

10.7M

14B

Page 5: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Counting Views

Page 6: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Why Count Views?

● Includes logged-out users● Better measure of reach than

votes● Currently exposed to

moderators and content creators

Page 7: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Cat Walking a HumanCat Fist Bumping

Page 8: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Why is Counting Hard?

Page 9: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Product Requirements

● Counts are over the life of a post● The same user should not count multiple

times within a short time frame● Should build in some protections against

spamming/cheating (similar to votes)● Should provide (near) real-time feedback

Page 10: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

● Exact counting:○ Requires storing state per user per

post

● Approximate counting:○ Requires much less state and storage○ Provides an estimate of reach within a

few percentage points of the exact number

Exact vs. Approximate Counting

Page 11: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

● HyperLogLog (HLL)○ Hash-based probabilistic algorithm

published in 2007○ Approximates set cardinality○ Works well for large cardinalities,

but not for small ones

● HyperLogLog++○ Introduced by Google in 2013○ Uses sparse and dense HLL

representations○ Switches over to HLL once needed

HyperLogLog (And Friends)

Page 12: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals
Page 13: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

● Hash table consisting of m registers or buckets, each of width k bits

● Hash the input value, and split the hash value into 2 portions

● First portion (log2m bits) used to index to a register

● Second portion used to count the number of leading zeros and set the register value

How does HLL work?

Page 14: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Assume: m=8 registers, k=3 bits

input hash 1 1 1 0 0 0 1 1

Register# 7 3 leading zeroesRecord 3+1=4 into Register# 7

r0

r1

r2

r3

r4

r5

r6

r7 1 0 0

Adapted from HyperLogLog - A Layman’s Overview

Page 15: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

● Estimate of cardinality is computed by taking the harmonic mean of the registers and raising 2 to that power

● Intuition: HLL is like flipping a coin!

● Largest run of heads gives an estimate of total number of flips

Computing Cardinality

Page 16: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Counting Error

● HLL standard error○ Number of registers/hash

buckets m○ Standard error = 1.04/sqrt(m)○ Using Redis’s HLL

implementation, standard error is 0.81%!

Page 17: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Using HLL to Count Views

● 1 HLL per post● HLL inserts are idempotent!

○ Allows reprocessing data if needed

● How to manage de-duping over short time window?○ Store user + truncated timestamp

as the value

Page 18: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals
Page 19: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Space Usage

● Exact counting:○ User id = 8 byte long○ ~1.5m users * 8 bytes = 12

MB

● HLL (Redis implementation)○ Max size = 12 KB○ 0.1% of the exact counting

storage

Page 20: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Counting Architecture

Page 21: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Architecture Goals

1. Consume a stream of view events and filter out spam/bad events

2. For good events, insert into an HLL in real time

3. Allow clients to consume views values in real time

Page 22: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Counting

Server Side Events

App Servers

Client Side Events

Anti-Spam

Page 23: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Stream Processing Infrastructure

● Kafka○ Main message bus for view events

● Redis○ Used for storing state + HLLs○ Intended as short term storage○ Functions as a cache for Cassandra

● Cassandra○ Used to store the final counts and

HLLs in separate column families○ Intended as long term storage

Page 24: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Counting Application (Part 1)

● Anti-Spam Consumer○ Consumes the stream of views from

Kafka○ Basic rules engine backed by Redis○ Consumer outputs a decision to a

Kafka topic

Page 25: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Counting Application (Part 2)

● Counting Consumer○ Consumes the decisions topic output

by the anti-spam consumer○ Creates/updates the HLL for the post

in Redis.○ Stores both the count and the HLL

filter out to Cassandra.

Page 26: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Scaling Challenges

Page 27: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

● Problems○ Rules engine is very memory heavy○ HLL counting is very CPU-heavy○ Rules engine data is generally

time-bound with expiry○ HLL data should be kept in Redis as

long as possible to avoid reading from Cassandra

Redis

Page 28: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

● Solutions○ Separate Redis instances for the

2 parts of the application○ Different instance types to reflect

the different workloads○ Allkeys-lru expiration on HLLs,

volatile-ttl expiration on the rules engine

Page 29: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals
Page 30: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

● Problems○ 1 row per post - overwritten

frequently○ Read rate on page loads

overwhelming the cluster○ Issues with load when “catching

up“○ Storage grows forever with the

number of posts!

Cassandra

Page 31: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

● Solutions○ Updates to the same row in

Cassandra throttled to every 10 seconds

○ Read caching○ Slow the update rate when

catching up ○ More disk!

Page 32: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals
Page 33: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

● Views on Reddit skew towards newer posts○ Allows most views to be served by

Redis○ Keeps read rate on Cassandra

very low

Observations

Page 34: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals
Page 35: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

● Thanks to HLLs, counting views became much more efficient○ Current storage usage is ~1TB for a

full year of posts!

● Delivery was possible in a quarter with an engineering team of 3 (not always full time)

Takeaways

Page 36: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Thanks to our team!

● /u/gooeyblob - Cassandra + Backend

● /u/d3fect - Backend + API

● /u/powerlanguage - Product Management

Page 37: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals

Thanks! Krishnan Chandra [email protected]/shrink_and_an_arch

PS: We’re hiring!http://reddit.com/jobs