probabilistically counting counting is hard: views at ... · hll (redis implementation) max size =...

Post on 10-Jul-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Counting is Hard: Probabilistically Counting Views at RedditKrishnan Chandra, Data Engineer

Overview

● What is probabilistic counting?

● How did probabilistic counting help us scale?

● What issues did we face along the way?

What is Reddit?Reddit is the frontpage of the internet

A social network where there are tens of thousands of communities around whatever passions or interests you might have

It’s where people converse about the things that are most important to them

Reddit by the numbers

Alexa Rank (US/World)

MAU

Active Communities

Posts per month

Screenviews per month

4th/7th

330M+

138K+

10.7M

14B

Counting Views

Why Count Views?

● Includes logged-out users● Better measure of reach than

votes● Currently exposed to

moderators and content creators

Cat Walking a HumanCat Fist Bumping

Why is Counting Hard?

Product Requirements

● Counts are over the life of a post● The same user should not count multiple

times within a short time frame● Should build in some protections against

spamming/cheating (similar to votes)● Should provide (near) real-time feedback

● Exact counting:○ Requires storing state per user per

post

● Approximate counting:○ Requires much less state and storage○ Provides an estimate of reach within a

few percentage points of the exact number

Exact vs. Approximate Counting

● HyperLogLog (HLL)○ Hash-based probabilistic algorithm

published in 2007○ Approximates set cardinality○ Works well for large cardinalities,

but not for small ones

● HyperLogLog++○ Introduced by Google in 2013○ Uses sparse and dense HLL

representations○ Switches over to HLL once needed

HyperLogLog (And Friends)

● Hash table consisting of m registers or buckets, each of width k bits

● Hash the input value, and split the hash value into 2 portions

● First portion (log2m bits) used to index to a register

● Second portion used to count the number of leading zeros and set the register value

How does HLL work?

Assume: m=8 registers, k=3 bits

input hash 1 1 1 0 0 0 1 1

Register# 7 3 leading zeroesRecord 3+1=4 into Register# 7

r0

r1

r2

r3

r4

r5

r6

r7 1 0 0

Adapted from HyperLogLog - A Layman’s Overview

● Estimate of cardinality is computed by taking the harmonic mean of the registers and raising 2 to that power

● Intuition: HLL is like flipping a coin!

● Largest run of heads gives an estimate of total number of flips

Computing Cardinality

Counting Error

● HLL standard error○ Number of registers/hash

buckets m○ Standard error = 1.04/sqrt(m)○ Using Redis’s HLL

implementation, standard error is 0.81%!

Using HLL to Count Views

● 1 HLL per post● HLL inserts are idempotent!

○ Allows reprocessing data if needed

● How to manage de-duping over short time window?○ Store user + truncated timestamp

as the value

Space Usage

● Exact counting:○ User id = 8 byte long○ ~1.5m users * 8 bytes = 12

MB

● HLL (Redis implementation)○ Max size = 12 KB○ 0.1% of the exact counting

storage

Counting Architecture

Architecture Goals

1. Consume a stream of view events and filter out spam/bad events

2. For good events, insert into an HLL in real time

3. Allow clients to consume views values in real time

Counting

Server Side Events

App Servers

Client Side Events

Anti-Spam

Stream Processing Infrastructure

● Kafka○ Main message bus for view events

● Redis○ Used for storing state + HLLs○ Intended as short term storage○ Functions as a cache for Cassandra

● Cassandra○ Used to store the final counts and

HLLs in separate column families○ Intended as long term storage

Counting Application (Part 1)

● Anti-Spam Consumer○ Consumes the stream of views from

Kafka○ Basic rules engine backed by Redis○ Consumer outputs a decision to a

Kafka topic

Counting Application (Part 2)

● Counting Consumer○ Consumes the decisions topic output

by the anti-spam consumer○ Creates/updates the HLL for the post

in Redis.○ Stores both the count and the HLL

filter out to Cassandra.

Scaling Challenges

● Problems○ Rules engine is very memory heavy○ HLL counting is very CPU-heavy○ Rules engine data is generally

time-bound with expiry○ HLL data should be kept in Redis as

long as possible to avoid reading from Cassandra

Redis

● Solutions○ Separate Redis instances for the

2 parts of the application○ Different instance types to reflect

the different workloads○ Allkeys-lru expiration on HLLs,

volatile-ttl expiration on the rules engine

● Problems○ 1 row per post - overwritten

frequently○ Read rate on page loads

overwhelming the cluster○ Issues with load when “catching

up“○ Storage grows forever with the

number of posts!

Cassandra

● Solutions○ Updates to the same row in

Cassandra throttled to every 10 seconds

○ Read caching○ Slow the update rate when

catching up ○ More disk!

● Views on Reddit skew towards newer posts○ Allows most views to be served by

Redis○ Keeps read rate on Cassandra

very low

Observations

● Thanks to HLLs, counting views became much more efficient○ Current storage usage is ~1TB for a

full year of posts!

● Delivery was possible in a quarter with an engineering team of 3 (not always full time)

Takeaways

Thanks to our team!

● /u/gooeyblob - Cassandra + Backend

● /u/d3fect - Backend + API

● /u/powerlanguage - Product Management

Thanks! Krishnan Chandra krishnan@reddit.comu/shrink_and_an_arch

PS: We’re hiring!http://reddit.com/jobs

top related