piano rubyslava final
TRANSCRIPT
1
2 0 1 5 P I A N O . I O
Data Science @
Rubyslava
October 2015
2
2 0 1 5 P I A N O . I O
2 0 1 5 P I A N O . I O
Overview
R U B Y S L A V A
Intro
Data
Methods
Outputs
Challenges
01
02
03
04
05
3
4
P I A N O F R A M E W O R K
What Data Science? 4
R U B Y S L A V A 2 0 1 5 P I A N O . I O
5
P I A N O F R A M E W O R K
Why do we need analytics (data science) 5
R U B Y S L A V A 2 0 1 5 P I A N O . I O
PAID USERSREGISTEREDENGAGEDCASUAL
1. What is my potential? (Potential Clients)
2. X has happened, why? Is Y a good idea? (Existing
Clients, Account managers)
3. Input into products
6
7
D A T A
What data do we collect? 7
R U B Y S L A V A 2 0 1 5 P I A N O . I O
1. Transactional Data
Subscriptions, Users
2. System settings
Products, Clients, Offers, Messaging
3. Clickstream Data
Pageviews, user agents
4. Conversion Data
Steps in a conversion funnel
Pageviews: Mongo DBTransactions + System settings: PostGreSQL
Conversion Data: Google Analytics
Pageviews: Amazon S3 (Cassandra)
Transactions + System settings: Oracle
Conversion Data: Amazon S3 (Cassandra)
Pageviews + Conversion Data:
BigQuery (Cassandra)
Transactions + System settings: MySQL
?
And then you wonder why 60 – 80 % of Data Scientist’s job is cleaning and merging datasets…
Piano
Press+Tinypass
8
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
1 2 3 4 5 6 7 8 9 10+
% P
ayin
g U
sers
# Devices
D A T A
What are we even measuring?
R U B Y S L A V A 2 0 1 5 P I A N O . I O
1. Cookies …
… can be deleted
2. Fingerprinting …
… doesn’t work for mobile
3. IP address …
… shared networks, proxies, VPNs
4. Registration …
… Try convincing a publisher all their
readers should register
Ref: ‘Not your business’
Stats > Math
9
D A T A
It is important to understand the data generating process
R U B Y S L A V A 2 0 1 5 P I A N O . I O
We are looking at how much users read in
order to estimate the ideal setting for a
metered paywall. In this particular case
we’ve realized the site refreshes every 5
minutes the user is inactive.
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
50.00%
1+ 2+ 3+ 4+ 5+ 6+ 7+ 8+ 9+ 10+ 11+ 12+ 13+ 14+ 15+ 16+ 17+ 18+ 19+ 20+
% o
f U
sers
rea
din
g ar
ticl
es
Readership Level
All PVs Without Refreshed PVs
Domain knowledge and observations
10
D A T A
Being able to perform a calculation, doesn’t mean you should
R U B Y S L A V A 2 0 1 5 P I A N O . I O
A session is defined as stream of actions for
a user with delays less than 30 minutes
between each action tied by consecutive
referrer.
It can also be tracked vie a session cookie
create table pvs_with_visits select@sessionId:=if(@prevUser=uid AND diff <= 1800 , @sessionId, @sessionId+1) as sessionId,@prevUser:=uid AS uid, url,t,diff,rldiff,ref,dt,sidfrom(select @sessionId:=0, @prevUser:='-') bjoin(selectTIME_TO_SEC(if(@prevU=uid, TIMEDIFF(t, @prevD), '00:00')) as diff,if(@prevU=uid & @prevrl!=1000, @prevrl-a_rl,0) as rldiff,@prevU:=uid as uid,@prevD:=t as t,@prevrl:=a_rl,url,ref,dt,sidfrompageviewsjoin(select @prev:=0, @prevU='-')aorder byuid, t) a;
Have mercy on your analysts
select *
from (SELECT
*,
FIRST_VALUE(referrerSegmentId) OVER (PARTITION BY uid, session_order order by datetime) AS session_ref,
FIRST_VALUE(url_class) OVER (PARTITION BY uid, session_order order by datetime) AS session_start_class
FROM (
SELECT
*,
MAX(session_order) OVER (PARTITION BY uid) AS n_sessions_of_user
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY datetime_string) AS pvs_order_in_visit
FROM (
SELECT
*,
CONCAT(uid, '-', CAST(session_order AS STRING)) AS session_id
FROM (
SELECT
*,
session_order_anc + 1 AS session_order
FROM (
SELECT
*,
SUM(new_event_boundary) OVER (PARTITION BY uid ORDER BY datetime_string) AS session_order_anc
FROM (
SELECT
*,
(datetime_ut - lag_1)/1000000/60 AS minutes_since_last_interval,
CASE WHEN (datetime_ut - lag_1)/1000000 > 30 * 60 THEN 1 ELSE 0 END AS new_event_boundary
FROM (
SELECT
*,
LAG(datetime_ut) OVER (PARTITION BY uid ORDER BY datetime_string) AS lag_1,
MONTH(datetime_string) AS month
FROM (
SELECT
*,
datetime AS datetime_string,
TIMESTAMP(datetime) AS datetime_ut
FROM
[tmp.pageviews_r])))))))))",
project = project;
11
D A T A
Storage
R U B Y S L A V A 2 0 1 5 P I A N O . I O
• DO NOT keep information only in “Application Logic”Period_Type_ID Decoding
1 Day(s)
2 Week
3 Every 10 days
4 Every other week
5 Every 15 days
6 Every 20 days
7 Month(s)/30-day
8 Month(s) Actual
9 2 Months/60-day
10 2 Months Actual
11 3 Months/90-day
12 3 Months Actual
13 6 Months/180-day
14 6 Months Actual
15 Year(s) 365-days
Duration = Period_Type * Cycle_Count
12
D A T A
Storage
R U B Y S L A V A 2 0 1 5 P I A N O . I O
• DO store history, either in form of delta logs or record validity
dates
• As is versus as is
• As is versus as was
• As was versus as was
Payment_id Duration Fee Start Date End Date
123456 366 39.00 1. 1. 2015 31. 1. 2015
Cancellation on30. 6. 2015
Payment_id Duration Fee Start Date End Date
123456 180 19.18 1. 1. 2015 30. 6. 2015
13
1.3 Billion rows @ 25 columns
3 months of US clickstream data
150 GB gzipped ≈ 1.5 TB full
1.3 Billion rows @ 6 columns
3 months of US clickstream data
252 GB full
614M rows @ 10 columns
3 months of session data
78.1 GB
340M rows @ 15 columns
3 months of user data
49.5 GB
3170 rows
@ 31 columns
1.3 MB
D A T A
On Big Data
R U B Y S L A V A 2 0 1 5 P I A N O . I O
• Do we really have Big Data? (50 %)
• Do we need to work with in in Big Data form? (1 % – 5 %)
Your big data might be
quite small…
14
15
M E T H O D S
If you can write a for loop you can do Data Science 15
R U B Y S L A V A 2 0 1 5 P I A N O . I O
What is a p-value?
One simple trick,
the statisticians hate it!
16
M E T H O D S
If you can write a for loop you can do Data Science 16
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Is there a significant difference between
average CPU usage?
84% 72% 81% 69%
57% 46% 74% 61%
63% 76% 56% 87%
99% 91% 69% 65%
66% 44%
62% 69%
Data Centre 1 Data Centre 2
Mean Data Centre 1: 73.5 %
Mean Data Centre 1: 66.9 %
Difference: 6.6 %
source: https://speakerdeck.com/jakevdp/statistics-for-hackers
17
M E T H O D S
If you can write a for loop you can do Data Science 17
R U B Y S L A V A 2 0 1 5 P I A N O . I O
What would statistics do?
𝑡 =73.5 − 66.9
316.88−124.812
t > tcrit
0.932 > 1.796 Difference is not significant
source: https://speakerdeck.com/jakevdp/statistics-for-hackers
18
M E T H O D S
If you can write a for loop you can do Data Science 18
R U B Y S L A V A 2 0 1 5 P I A N O . I O
What can for loop do?
1. Shuffle the observations between 2 groups randomly2. Compute means for each group3. Compute the differences of the means4. Repeat n times (n can be 10 000)
source: https://speakerdeck.com/jakevdp/statistics-for-hackers An approach used in real scientific
papers
19
M E T H O D S
If you can write a for loop you can do Data Science 19
R U B Y S L A V A 2 0 1 5 P I A N O . I O
A more interesting application
1. Take the dataset you’ve build the model on2. Shuffle Y values3. Build the model from random data
In what % of cases did you build a model betterthan the original one?
Call:glm(formula = DD_index ~ ., data = perm_data_dummies)
Deviance Residuals: Min 1Q Median 3Q Max
-0.71028 -0.07349 0.00514 0.08096 0.63189
Coefficients: (1 not defined because of singularities)Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.052e-01 7.716e-02 5.251 1.52e-07 ***`authorAdam Molon` -4.710e-01 5.098e-02 -9.239 < 2e-16 ***`authorAlexandra Gibbs` -2.160e-02 1.546e-02 -1.397 0.162359 `authorAlex Crippen` 1.025e-01 1.647e-02 6.220 5.02e-10 ***`authorAlex Rosenberg` 5.432e-02 1.495e-02 3.634 0.000279 ***…Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for gaussian family taken to be 0.01748868) Null deviance: 3115.15 on 35716 degrees of freedomResidual deviance: 608.73 on 34807 degrees of freedomAIC: -42258Number of Fisher Scoring iterations: 2
20
21
O U T P U T S
Health Diagnostics for Sites with a Paywall: Approach 1
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Can we, without doing a deep dive identify
areas for improvement for any of our cca
1200 (600) sites?
• Percentiles
• One-dimensional view
• Not everyone can be above average
𝑆𝑡𝑜𝑝 𝑅𝑎𝑡𝑒 =𝑈𝑛𝑖𝑞𝑢𝑒 𝑉𝑖𝑠𝑖𝑡𝑜𝑟𝑠 𝑏𝑒𝑖𝑛𝑔 𝑆𝑡𝑜𝑝𝑝𝑒𝑑
𝑈𝑛𝑖𝑞𝑢𝑒 𝑉𝑖𝑠𝑖𝑡𝑜𝑟𝑠
0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
6.00%
7.00%
8.00%
95% 90% 80% 75% 70% 60% 50% 40% 30% 25% 20% 10% 5%
You are here
22
O U T P U T S
Health Diagnostics for Sites with a Paywall: Approach 2
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Can we, without doing a deep dive identify
areas for improvement for any of our cca
1200 (600) sites?
• Clustering (PAM -> Fuzzy)
• KPIs in relation to each other
• Easy to read (or so we thought)
• Too much variation
Site 1
Site 2
23
O U T P U T S
Health Diagnostics for Sites with a Paywall: Approach 3
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Can we, without doing a deep dive identify
areas for improvement for any of our cca
1200 (600) sites?
• Site similarity (Size, Age)
Example of a site with good 3rd party benchmarks
Site 1
Example of a „lonely“ site
xyx
24
D A T A
Health Diagnostics for Sites with a Paywall: Approach 3
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Can we, without doing a deep dive
identify areas for improvement for any of
our cca 1200 (600) sites?
• Compare to:
• Similar Sites
• Sites of the same Publisher
• Worst Site
• Best Site
• Display multiple KPIs in different
units in one chart
25
26
C H A L L E N G E S
Building Data Products
R U B Y S L A V A 2 0 1 5 P I A N O . I O
What if we didn’t have to analyze the data,
what if we could just say – [This] is
interesting?
[This] can be a combination of multiple
variables such as author, section, traffic from
device and anything else.
[This] sits in hierarchy
We want to know [This] is interesting
because of [This] alone not because of
[Parent(s) of This] or [Child of this]
27
C H A L L E N G E S
Building Data Products
R U B Y S L A V A 2 0 1 5 P I A N O . I O
If [This] is constructed from as author,
section, traffic from device, the hierarchy in
which [This] sits also includes author,
section, device individually as well as all
possible combinations of 2 variables
If we assume an ever changing number of
variables [This] can be constructed for, in
order to construct a hierarchy of all possible
[This] elements, the following applies:
#V This =
𝑘=1
𝑛𝑛!
𝑘! ∗ 𝑛 − 𝑘 !
Variables Queries
1 1
2 3
3 7
5 31
10 1,023
15 32,767
20 1,048,575
28
C H A L L E N G E S
Building Data Products
R U B Y S L A V A 2 0 1 5 P I A N O . I O
Currently we are looking for interesting
[This] in a very simple context. We define a
[Segment] which can be any type of users
(in our case a loyal user), and we measure
how string their preference for [This] is over
the general preference for [This] in whole
population.
And the results are exciting, sometimes
[This] is clearly interesting because one of
their parents
To be continued…
2929
30
2 0 1 5
Thank you for your time!Roman Gavuliak
Lead Data Scientist
@rgavuliak
P I A N O . I O