spark-driven audience counting by boris trofimov

25
AUDIENCE COUNTING @ SCALE Boris Trofimoff Sigma Software & Collective @b0ris_1

Upload: javadayua

Post on 06-Jan-2017

220 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Spark-driven audience counting by Boris Trofimov

AUDIENCE COUNTING @ SCALE

Boris Trofimoff

Sigma Software & Collective

@b0ris_1

Page 2: Spark-driven audience counting by Boris Trofimov

AGENDA

About our customer1

Motivation or Why & What we are counting2

Counting Fundamentals3

Counting with Spark4

Spark Road Notes5

Page 3: Spark-driven audience counting by Boris Trofimov

Place where advertisers and end-users meet each user

Collective Media is a full stack cookie serving company

USERS

~1B active user profiles

MODELS

1000s of models built weekly

PREDICTIONS

100s of billions predictions daily

MODELING AT SCALEVOLUME

Petabytes of data used

VARIETY

Profiles, formats, screens

VELOCITY

100k+ requests per second

20 billions events per day

VERACITY

Robust measurements

Page 4: Spark-driven audience counting by Boris Trofimov

MOTIVATION

Page 5: Spark-driven audience counting by Boris Trofimov

HOW AUDIENCE IS CREATED

Page 6: Spark-driven audience counting by Boris Trofimov

IMPRESSION LOG

AD SITE COOKIE IMPRESSIONS CLICKS SEGMENTS

bmw_X5 forbes.com 13e835610ff0d95 10 1 [a.m, b.rk, c.rh, d.sn, ...]

mercedes_2015 forbes.com 13e8360c8e1233d 5 0 [a.f, b.rk, c.hs, d.mr, ...]

nokia gizmodo.com 13e3c97d526839c 8 0 [a.m, b.tk, c.hs, d.sn, ...]

apple_music reddit.com 1357a253f00c0ac 3 1 [a.m, b.rk, d.sn, e.gh, ...]

nokia cnn.com 13b23555294aced 2 1 [a.f, b.tk, c.rh, d.sn, ...]

apple_music facebook.com 13e8333d16d723d 9 1 [a.m, d.sn, g.gh, s.hr, ...]

Page 7: Spark-driven audience counting by Boris Trofimov

SEGMENT EXAMPLES

SEGMENT DESCRIPTION

a.m Male

a.f Female

b.tk $75k-$100k annual income

b.rk $100k-$150k annual income

c.hs High School

c.rh College

d.sn Single

d.mr Married

Page 8: Spark-driven audience counting by Boris Trofimov

WHAT WE CAN DO WITH DATA

What is male/female ratio for people who have seen bmw_X5 ad

on forbes.com

Income distribution for people who have seen Apple Music ad

Nokia click distribution across different education levels

Page 9: Spark-driven audience counting by Boris Trofimov

BUILDING AUDIENCE PROFILE

Page 10: Spark-driven audience counting by Boris Trofimov

COUNTING

FUNAMENTALS

Page 11: Spark-driven audience counting by Boris Trofimov

SQL?

SELECT count(distinct cookie_id)

FROM impressions

WHERE site = 'forbes.com' AND ad = 'bmw_X5' AND segment contains 'a.m'

Infinite combinations

Big Data => Big Latency for Hive, Impala and Druid

Page 12: Spark-driven audience counting by Boris Trofimov

CARDINALITY ESTIMATION ALGORITHMS

ACCURACY

MEMORY EFFICIENCY

ESTIMATE LARGE CARDINALITIES

PRACTIALITY

For a fixed amount of memory, the algorithm should provide as accurate an estimate as possible. Especially for small cardinalities, the results should be near exact

The algorithm should use the available memory efficiently and adapt its memory usage to the cardinality

Multisets with cardinalities well beyond 1 billion occur on a daily basis, and it is important that such large cardinalities can be estimated with reasonable accuracy

The algorithm should be implementable and maintainable

Page 13: Spark-driven audience counting by Boris Trofimov

HYPERLOGLOG AND OTHERS

Page 14: Spark-driven audience counting by Boris Trofimov

AUDIENCE CARDINALITY

APPROXIMATION WITH HYPERLOGLOG

Create Audience of people addressed by unique identifiers (cookies)

Create Audience “Hash Sum” file with fixed size regardless of audience size

Cardinalities ~ 109

with a typical accuracy of 2%using 1.5KB of memory.

1.5KB

Create

Audience

Create

Hash

Page 15: Spark-driven audience counting by Boris Trofimov

HYPERLOGLOG OPERATIONS

trait HyperLogLog {

def add(cookieId: String): Unit

// |A|

def cardinality(): Long

// |A ∪ B|

def merge(other: HyperLogLog): HyperLogLog

// |A ∩ B| = |A| + |B| - |A ∪ B|,

def intersect(other: HyperLogLog): Long

}

∪ ~ merge =1.5KB 1.5KB 1.5KB

∩ ~ intrsct =1.5KB 1.5KB 1.5KB

| |

Page 16: Spark-driven audience counting by Boris Trofimov

COUNTING

WITH SPARK

Page 17: Spark-driven audience counting by Boris Trofimov

FROM COOKIES TO HYPERLOGLOG

AD SITE COOKIES HLL IMPRESSIONS CLICKS

bmw_X5 forbes.com HyperLogLog@23sdg4 5468 35

bmw_X5 cnn.com HyperLogLog@84jdg4 8943 29

SEGMENT COOKIES HLL IMPRESSIONS CLICKS

Male HyperLogLog@65xzx2 235468 335

$100k-$150k HyperLogLog@12das1 569473 194

Page 18: Spark-driven audience counting by Boris Trofimov

DATA FRAMES

val adImpressions: DataFrame = sqlContext.parquetFile("/aa/${yy-mm-dd}/${hh}/audience")

adImpressions.printSchema()

// root

// | -- ad: string (nullable = true)

// | -- site: string (nullable = true)

// | -- hll: binary (nullable = true)

// | -- impressions: long (nullable = true)

// | -- clicks: long (nullable = true)

val segmentImpressions: DataFrame = sqlContext.parquetFile("/aa/${yy-mm-dd}/${hh}/segments")

segmentImpressions.printSchema()

// root

// | -- segment: string (nullable = true)

// | -- hll: binary (nullable = true)

// | -- impressions: long (nullable = true)

// | -- clicks: long (nullable = true)

Page 19: Spark-driven audience counting by Boris Trofimov

LET’S COUNT SOMETHING

import org.apache.spark.sql.functions._

import org.apache.spark.sql.HLLFunctions._

val bmwCookies: HyperLogLog = adImpressions

.filter(col("ad") === "bmw_X5")

.select(mergeHll(col("hll")).first() // -- sum(clicks)

val educatedCookies: HyperLogLog = hllSegments

.filter(col("segment") in Seq("College", "High School"))

.select(mergeHll(col("hll")).first()

val p = (bmwCookies intersect educatedCookies).cardinality() / bmwCookies.cardinality()

Percent of college and high school education in BMW campaign?

Page 20: Spark-driven audience counting by Boris Trofimov

SPARK ROAD NOTES

Page 21: Spark-driven audience counting by Boris Trofimov

WRITING OWN SPARK

AGGREGATION FUNCTIONS

case class MergeHLLPartition(child: Expression)

extends AggregateExpression with trees.UnaryNode[Expression] { ... }

case class MergeHLLMerge(child: Expression)

extends AggregateExpression with trees.UnaryNode[Expression] { ... }

case class MergeHLL(child: Expression)

extends PartialAggregate with trees.UnaryNode[Expression] {

override def asPartial: SplitEvaluation = {

val partial = Alias(MergeHLLPartition(child), "PartialMergeHLL")()

SplitEvaluation(

MergeHLLMerge(partial.toAttribute),

partial :: Nil )

}

}

def mergeHLL(e: Column): Column = MergeHLL(e.expr)

define function that will be applied to each row in RDD partition

define function that will take results from different partitions and merge them together

tell Spark how you want it to split your computation across RDD

Page 22: Spark-driven audience counting by Boris Trofimov

AGGREGATION FUNCTIONS

PROS & CONS

Simple DSL and Native DataFrame look-like functions

Works much faster than solving this problem with Scala transformations on

top of RDD[Row]

Dramatic Performance Speed-Up via mutable state control (10x times)

UDF should be part of private Spark package, risk these interfaces might be

changed/abandoned in the future.

Page 23: Spark-driven audience counting by Boris Trofimov

SPARK AS IN-MEMORY SQL DATABASE

BATCH-DRIVEN APP LONG-RUNNING APPCHANGE

Create SparkContext

RunCalculations

DestloySparkContext

ShowResult

Load Data

Cache itIn memory

ReceiveRequest

Create SparkContext

ShowResult

RunCalculations

~ 500 GB (1 year history) ~40N occupied from ~200N clusterResponse time 1-2 seconds

DestloySparkContext

Page 24: Spark-driven audience counting by Boris Trofimov

REFERENCES

http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-

with-spark-and-hyperloglog/

(Especial thanks to Eugene Zhulenev for his brilliant blog post)

https://github.com/collectivemedia/spark-hyperloglog

http://research.google.com/pubs/pub40671.html

https://github.com/AdRoll/cantor

http://tech.adroll.com/blog/data/2013/07/10/hll-minhash.html

Page 25: Spark-driven audience counting by Boris Trofimov

THANK YOU!