how we built analytics from scratch (in seven easy steps)

How we built analytics from scratch (in seven easy steps)

Jodi Moran, Co-founder & CTO

1

Plumbee: social casino games

2

Plumbee’s growth

3

Oct 2011

• 3 founders & 3 founding employees

• 0 in data

March 2012

• Mirrorball Slots on Facebook launch

• 15 staff

• 0 in data

Dec 2012

• Mirrorball Slots on iOS beta launch

• 29 staff

• 4 in data

Today

• 1.2M MAU

• 250K DAU

• 39 staff

• 5 in data

“Build, measure, learn”

4

Timing and targeting of offers

Balancing of the virtual economy

Creation of engaging features

Cost-effective acquisition

Goals

5

Never say “we don’t have that data”

Breadth of data use

Depth of data use

Agile data use

Scalable foundation for the future

In the beginning…

6

Step #1:

7

Blank slate No time

No bandwidth

No experience

3rd party analytics

Third-party analytics

• Low opportunity cost

• Full stack solution

• Lots of choices

• Get useful data to everyone fast

8

Step #2:

10

3rd party systems lack

flexibility

Want to own the data

Don’t know what we want

to know

Analytics is strategic

Collect everything

What is everything?

• State-changing calls from client to server

• Changes of state

• State-changing calls from client to third parties (Facebook)

Yes, this is a lot of data: 450m events (45 GB compressed) per day.

Using Amazon Web Services makes this possible.

11

Why we like it

No need:

– To test instrumentation

– To add instrumentation of new features

– To touch transactional databases

– To worry we won’t have the data

Easy and fast to implement

... but we still miss things.

13

Step #3:

15

Lots and lots of data

Need access

Data is unstructured

No time to build

structure

Elastic MapReduce & Hive

The secret to success

17

The right

analyst

Technical skills

Unstructured data

Data architecture

Step #4:

18

Only access via SQL

Lack of visibility

Want data to be everyday

Google Spreadsheets

Step #5:

21

Want to know what worked

Can’t separate factors

Want flexibility

In-house split testing

It’s easy to serve experiments…

• Server-side random assignment of users

• Second tier allows deep tests (bonus: canary deployments)

• Tool for configuration-only tests

• Test & variant pairs attached to every analytics event

22

… but it’s hard to analyse experiments

23

Web analytics

Conversion rate

Binomial distribution

Simple tests

•Measuring variables that don’t satisfy “conversion rate” assumptions •The need for an Overall Evaluation Criterion

Step #6:

24

All data processing is

manual

This is getting expensive

And it takes a long time to

run

Automation & optimization

(Basic) optimization

• Spot instances

• Output compression with snappy

• Python streaming jobs

• There’s a lot more we could do…

26

Step #7:

27

Expensive Hive clusters

Queries take a long time to

run

Hive functionality

is limited

Relational data mart

Why Hive AND a traditional database?

15 GB of aggregates

20 TB total

28

29

Plumbee analytics today

Goals

30

Never say “we don’t have that data”

Breadth of data use

Depth of data use

Agile data use

Scalable foundation for the future

The results: average daily spenders

31 Month

But we have tons to do. E

ng

ine

eri

ng

• Replace our custom event aggregators with Flume

• Replace pull-based Hive & Python streaming jobs with Cascading + JVM-based languages

• Change event storage from JSON to Avro

• Better dashboards and tools

• Consider in-memory processing, e.g. Spark/Shark

• Toward “big data” A

nal

ysis

• More “actionable”, less “interesting”

• Continuous optimization: split / multivariate testing, multi-armed bandit

• Better predictive models

• Clustering, segmentation, personalization

• Toward “data science”

32

33

Jodi Moran jobs.plumbee.com

[email protected] www.plumbee.com

@jodi_p_moran apps.facebook.com/mirrorballslots

www.facebook.com/jodipmoran

www.linkedin.com/in/jmoran

Questions? Get in touch!

how we built analytics from scratch (in seven easy steps)

Technology

python streaming

mirrorball

scalable foundation

changing calls

long time

agile data

hive

plumbee