how we built analytics from scratch (in seven easy steps)
Post on 21-Oct-2014
518 views
DESCRIPTION
TRANSCRIPT
How we built analytics from scratch (in seven easy steps)
Jodi Moran, Co-founder & CTO
1
Plumbee: social casino games
2
Plumbee’s growth
3
Oct 2011
• 3 founders & 3 founding employees
• 0 in data
March 2012
• Mirrorball Slots on Facebook launch
• 15 staff
• 0 in data
Dec 2012
• Mirrorball Slots on iOS beta launch
• 29 staff
• 4 in data
Today
• 1.2M MAU
• 250K DAU
• 39 staff
• 5 in data
“Build, measure, learn”
4
Timing and targeting of offers
Balancing of the virtual economy
Creation of engaging features
Cost-effective acquisition
Goals
5
Never say “we don’t have that data”
Breadth of data use
Depth of data use
Agile data use
Scalable foundation for the future
In the beginning…
6
Step #1:
7
Blank slate No time
No bandwidth
No experience
3rd party analytics
Third-party analytics
• Low opportunity cost
• Full stack solution
• Lots of choices
• Get useful data to everyone fast
8
9
Step #2:
10
3rd party systems lack
flexibility
Want to own the data
Don’t know what we want
to know
Analytics is strategic
Collect everything
What is everything?
• State-changing calls from client to server
• Changes of state
• State-changing calls from client to third parties (Facebook)
Yes, this is a lot of data: 450m events (45 GB compressed) per day.
Using Amazon Web Services makes this possible.
11
12
12
12
12
12
Why we like it
No need:
– To test instrumentation
– To add instrumentation of new features
– To touch transactional databases
– To worry we won’t have the data
Easy and fast to implement
... but we still miss things.
13
14
14
Step #3:
15
Lots and lots of data
Need access
Data is unstructured
No time to build
structure
Elastic MapReduce & Hive
16
16
The secret to success
17
The right
analyst
Technical skills
Unstructured data
Data architecture
Step #4:
18
Only access via SQL
Lack of visibility
Want data to be everyday
Google Spreadsheets
19
20
20
Step #5:
21
Want to know what worked
Can’t separate factors
Want flexibility
In-house split testing
It’s easy to serve experiments…
• Server-side random assignment of users
• Second tier allows deep tests (bonus: canary deployments)
• Tool for configuration-only tests
• Test & variant pairs attached to every analytics event
22
… but it’s hard to analyse experiments
23
Web analytics
Conversion rate
Binomial distribution
Simple tests
•Measuring variables that don’t satisfy “conversion rate” assumptions •The need for an Overall Evaluation Criterion
Step #6:
24
All data processing is
manual
This is getting expensive
And it takes a long time to
run
Automation & optimization
(Basic) optimization
• Spot instances
• Output compression with snappy
• Python streaming jobs
• There’s a lot more we could do…
26
Step #7:
27
Expensive Hive clusters
Queries take a long time to
run
Hive functionality
is limited
Relational data mart
Why Hive AND a traditional database?
15 GB of aggregates
20 TB total
28
29
29
29
Plumbee analytics today
Goals
30
Never say “we don’t have that data”
Breadth of data use
Depth of data use
Agile data use
Scalable foundation for the future
The results: average daily spenders
31 Month
But we have tons to do. E
ng
ine
eri
ng
• Replace our custom event aggregators with Flume
• Replace pull-based Hive & Python streaming jobs with Cascading + JVM-based languages
• Change event storage from JSON to Avro
• Better dashboards and tools
• Consider in-memory processing, e.g. Spark/Shark
• Toward “big data” A
nal
ysis
• More “actionable”, less “interesting”
• Continuous optimization: split / multivariate testing, multi-armed bandit
• Better predictive models
• Clustering, segmentation, personalization
• Toward “data science”
32
33
Jodi Moran jobs.plumbee.com
[email protected] www.plumbee.com
@jodi_p_moran apps.facebook.com/mirrorballslots
www.facebook.com/jodipmoran
www.linkedin.com/in/jmoran
Questions? Get in touch!