piano rubyslava final

1

2 0 1 5 P I A N O . I O

Data Science @

Rubyslava

October 2015

2

2 0 1 5 P I A N O . I O

2 0 1 5 P I A N O . I O

Overview

R U B Y S L A V A

Intro

Data

Methods

Outputs

Challenges

01

02

03

04

05

4

P I A N O F R A M E W O R K

What Data Science? 4

R U B Y S L A V A 2 0 1 5 P I A N O . I O

5

P I A N O F R A M E W O R K

Why do we need analytics (data science) 5


PAID USERSREGISTEREDENGAGEDCASUAL

1. What is my potential? (Potential Clients)

2. X has happened, why? Is Y a good idea? (Existing

Clients, Account managers)

3. Input into products

7

D A T A

What data do we collect? 7


1. Transactional Data

Subscriptions, Users

2. System settings

Products, Clients, Offers, Messaging

3. Clickstream Data

Pageviews, user agents

4. Conversion Data

Steps in a conversion funnel

Pageviews: Mongo DBTransactions + System settings: PostGreSQL

Conversion Data: Google Analytics

Pageviews: Amazon S3 (Cassandra)

Transactions + System settings: Oracle

Conversion Data: Amazon S3 (Cassandra)

Pageviews + Conversion Data:

BigQuery (Cassandra)

Transactions + System settings: MySQL

?

And then you wonder why 60 – 80 % of Data Scientist’s job is cleaning and merging datasets…

Piano

Press+Tinypass

8

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

1 2 3 4 5 6 7 8 9 10+

% P

ayin

g U

sers

# Devices

D A T A

What are we even measuring?


1. Cookies …

… can be deleted

2. Fingerprinting …

… doesn’t work for mobile

3. IP address …

… shared networks, proxies, VPNs

4. Registration …

… Try convincing a publisher all their

readers should register

Ref: ‘Not your business’

Stats > Math

9

D A T A

It is important to understand the data generating process


We are looking at how much users read in

order to estimate the ideal setting for a

metered paywall. In this particular case

we’ve realized the site refreshes every 5

minutes the user is inactive.

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

45.00%

50.00%

1+ 2+ 3+ 4+ 5+ 6+ 7+ 8+ 9+ 10+ 11+ 12+ 13+ 14+ 15+ 16+ 17+ 18+ 19+ 20+

% o

f U

sers

rea

din

g ar

ticl

es

Readership Level

All PVs Without Refreshed PVs

Domain knowledge and observations

10

D A T A

Being able to perform a calculation, doesn’t mean you should


A session is defined as stream of actions for

a user with delays less than 30 minutes

between each action tied by consecutive

referrer.

It can also be tracked vie a session cookie

create table pvs_with_visits select@sessionId:=if(@prevUser=uid AND diff <= 1800 , @sessionId, @sessionId+1) as sessionId,@prevUser:=uid AS uid, url,t,diff,rldiff,ref,dt,sidfrom(select @sessionId:=0, @prevUser:='-') bjoin(selectTIME_TO_SEC(if(@prevU=uid, TIMEDIFF(t, @prevD), '00:00')) as diff,if(@prevU=uid & @prevrl!=1000, @prevrl-a_rl,0) as rldiff,@prevU:=uid as uid,@prevD:=t as t,@prevrl:=a_rl,url,ref,dt,sidfrompageviewsjoin(select @prev:=0, @prevU='-')aorder byuid, t) a;

Have mercy on your analysts

select *

from (SELECT

*,

FIRST_VALUE(referrerSegmentId) OVER (PARTITION BY uid, session_order order by datetime) AS session_ref,

FIRST_VALUE(url_class) OVER (PARTITION BY uid, session_order order by datetime) AS session_start_class

FROM (

SELECT

*,

MAX(session_order) OVER (PARTITION BY uid) AS n_sessions_of_user

FROM (

SELECT

*,

ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY datetime_string) AS pvs_order_in_visit

FROM (

SELECT

*,

CONCAT(uid, '-', CAST(session_order AS STRING)) AS session_id

FROM (

SELECT

*,

session_order_anc + 1 AS session_order

FROM (

SELECT

*,

SUM(new_event_boundary) OVER (PARTITION BY uid ORDER BY datetime_string) AS session_order_anc

FROM (

SELECT

*,

(datetime_ut - lag_1)/1000000/60 AS minutes_since_last_interval,

CASE WHEN (datetime_ut - lag_1)/1000000 > 30 * 60 THEN 1 ELSE 0 END AS new_event_boundary

FROM (

SELECT

*,

LAG(datetime_ut) OVER (PARTITION BY uid ORDER BY datetime_string) AS lag_1,

MONTH(datetime_string) AS month

FROM (

SELECT

*,

datetime AS datetime_string,

TIMESTAMP(datetime) AS datetime_ut

FROM

[tmp.pageviews_r])))))))))",

project = project;

11

D A T A

Storage


• DO NOT keep information only in “Application Logic”Period_Type_ID Decoding

1 Day(s)

2 Week

3 Every 10 days

4 Every other week

5 Every 15 days

6 Every 20 days

7 Month(s)/30-day

8 Month(s) Actual

9 2 Months/60-day

10 2 Months Actual

11 3 Months/90-day

12 3 Months Actual

13 6 Months/180-day

14 6 Months Actual

15 Year(s) 365-days

Duration = Period_Type * Cycle_Count

12

D A T A

Storage


• DO store history, either in form of delta logs or record validity

dates

• As is versus as is

• As is versus as was

• As was versus as was

Payment_id Duration Fee Start Date End Date

123456 366 39.00 1. 1. 2015 31. 1. 2015

Cancellation on30. 6. 2015

Payment_id Duration Fee Start Date End Date

123456 180 19.18 1. 1. 2015 30. 6. 2015

13

1.3 Billion rows @ 25 columns

3 months of US clickstream data

150 GB gzipped ≈ 1.5 TB full

1.3 Billion rows @ 6 columns

3 months of US clickstream data

252 GB full

614M rows @ 10 columns

3 months of session data

78.1 GB

340M rows @ 15 columns

3 months of user data

49.5 GB

3170 rows

@ 31 columns

1.3 MB

D A T A

On Big Data


• Do we really have Big Data? (50 %)

• Do we need to work with in in Big Data form? (1 % – 5 %)

Your big data might be

quite small…

15

M E T H O D S

If you can write a for loop you can do Data Science 15


What is a p-value?

One simple trick,

the statisticians hate it!

16

M E T H O D S



Is there a significant difference between

average CPU usage?

84% 72% 81% 69%

57% 46% 74% 61%

63% 76% 56% 87%

99% 91% 69% 65%

66% 44%

62% 69%

Data Centre 1 Data Centre 2

Mean Data Centre 1: 73.5 %

Mean Data Centre 1: 66.9 %

Difference: 6.6 %

source: https://speakerdeck.com/jakevdp/statistics-for-hackers

17

M E T H O D S



What would statistics do?

𝑡 =73.5 − 66.9

316.88−124.812

t > tcrit

0.932 > 1.796 Difference is not significant

source: https://speakerdeck.com/jakevdp/statistics-for-hackers

18

M E T H O D S



What can for loop do?

1. Shuffle the observations between 2 groups randomly2. Compute means for each group3. Compute the differences of the means4. Repeat n times (n can be 10 000)

source: https://speakerdeck.com/jakevdp/statistics-for-hackers An approach used in real scientific

papers

19

M E T H O D S



A more interesting application

1. Take the dataset you’ve build the model on2. Shuffle Y values3. Build the model from random data

In what % of cases did you build a model betterthan the original one?

Call:glm(formula = DD_index ~ ., data = perm_data_dummies)

Deviance Residuals: Min 1Q Median 3Q Max

-0.71028 -0.07349 0.00514 0.08096 0.63189

Coefficients: (1 not defined because of singularities)Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.052e-01 7.716e-02 5.251 1.52e-07 ***àuthorAdam Molon` -4.710e-01 5.098e-02 -9.239 < 2e-16 ***àuthorAlexandra Gibbs` -2.160e-02 1.546e-02 -1.397 0.162359 àuthorAlex Crippen` 1.025e-01 1.647e-02 6.220 5.02e-10 ***àuthorAlex Rosenberg` 5.432e-02 1.495e-02 3.634 0.000279 ***…Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for gaussian family taken to be 0.01748868) Null deviance: 3115.15 on 35716 degrees of freedomResidual deviance: 608.73 on 34807 degrees of freedomAIC: -42258Number of Fisher Scoring iterations: 2

21

O U T P U T S

Health Diagnostics for Sites with a Paywall: Approach 1


Can we, without doing a deep dive identify

areas for improvement for any of our cca

1200 (600) sites?

• Percentiles

• One-dimensional view

• Not everyone can be above average

𝑆𝑡𝑜𝑝 𝑅𝑎𝑡𝑒 =𝑈𝑛𝑖𝑞𝑢𝑒 𝑉𝑖𝑠𝑖𝑡𝑜𝑟𝑠 𝑏𝑒𝑖𝑛𝑔 𝑆𝑡𝑜𝑝𝑝𝑒𝑑

𝑈𝑛𝑖𝑞𝑢𝑒 𝑉𝑖𝑠𝑖𝑡𝑜𝑟𝑠

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

8.00%

95% 90% 80% 75% 70% 60% 50% 40% 30% 25% 20% 10% 5%

You are here

22

O U T P U T S





1200 (600) sites?

• Clustering (PAM -> Fuzzy)

• KPIs in relation to each other

• Easy to read (or so we thought)

• Too much variation

Site 1

Site 2

23

O U T P U T S





1200 (600) sites?

• Site similarity (Size, Age)

Example of a site with good 3rd party benchmarks

Site 1

Example of a „lonely“ site

xyx

24

D A T A



Can we, without doing a deep dive

identify areas for improvement for any of

our cca 1200 (600) sites?

• Compare to:

• Similar Sites

• Sites of the same Publisher

• Worst Site

• Best Site

• Display multiple KPIs in different

units in one chart

26

C H A L L E N G E S

Building Data Products


What if we didn’t have to analyze the data,

what if we could just say – [This] is

interesting?

[This] can be a combination of multiple

variables such as author, section, traffic from

device and anything else.

[This] sits in hierarchy

We want to know [This] is interesting

because of [This] alone not because of

[Parent(s) of This] or [Child of this]

27

C H A L L E N G E S



If [This] is constructed from as author,

section, traffic from device, the hierarchy in

which [This] sits also includes author,

section, device individually as well as all

possible combinations of 2 variables

If we assume an ever changing number of

variables [This] can be constructed for, in

order to construct a hierarchy of all possible

[This] elements, the following applies:

#V This =

𝑘=1

𝑛𝑛!

𝑘! ∗ 𝑛 − 𝑘 !

Variables Queries

1 1

2 3

3 7

5 31

10 1,023

15 32,767

20 1,048,575

28

C H A L L E N G E S



Currently we are looking for interesting

[This] in a very simple context. We define a

[Segment] which can be any type of users

(in our case a loyal user), and we measure

how string their preference for [This] is over

the general preference for [This] in whole

population.

And the results are exciting, sometimes

[This] is clearly interesting because one of

their parents

To be continued…

30

2 0 1 5

Thank you for your time!Roman Gavuliak

Lead Data Scientist

@rgavuliak

P I A N O . I O

piano rubyslava final

Data & Analytics