data janitor 101

40
1

Upload: daniel-molnar

Post on 16-Apr-2017

396 views

Category:

Data & Analytics


1 download

TRANSCRIPT

1

Data Janitor 101Daniel Molnar, Microsoft

Data Natives 2016

2

tl;dr4 KISS is the philosophy,

4 take the long view, invest in durable knowledge,

4 strive for fast and good enough,

4 just because you can doesn't mean you should.

3

CAP #1BUSINESS ANALYST

4

"... American MBA? ... if you don’t understand

something it must be simple and only take five

minutes."1

Sean Murphy, PingThings5

Don't4 unicorn my a**,

4 hockey stick here for me,

4 skip leg day.

6

Do4 make definitions,

4 show direction,

4 care about data quality,

4 rule dashboards.

7

KPIs that matter4 DAU, WAU, MAU, LTV, churn,

4 cohorts, segments, funnels,

4 first hour, first day.

8

Approach4 KPIs must hurt (aka no feelgood metrics),

4 you are what you measure,

4 you can run in one direction,

4 is it actionable (the Friday 1700 test).

9

Toolset4 Excel,

4 SQL,

4 Metabase.

10

Heroes of the dayJoel Spolsky: You Suck at ExcelDan McKinley: Data Driven Products Now!

11

CAP #2DATA ENGINEER

12

"Don't reinventthe flat tyre."

1Alan Kay

13

Don't4 just Apache it,

4 build a Hadoop JENGA (10x-235x slow),

4 real-time it,

4 stream it,

4 overengineer it.

14

Do4 embrace dirty reality

(entity recognition makes a data engineer),

4 ETL, events and DWH,

4 data quality (know your leakage),

4 testing (yes, you can even unit test data).

15

Approach4 avoid GIGO,

4 pedal to the metal, skip the overhead,

4 know that big RAM is eating big data,

4 use open source, pragmatic, cloud service agnostic tools.

16

Toolset4 UNIX (bash, make),

4 Python,

4 SQL,

4 ETL in batch (mETL, night-shift)

4 event tracking (Hamustro, logsanitizer, RPi?),

4 DWH = MPP SQL (Azure DWH, Redshift, Vertica...).17

Heroes of the dayJames Mickens: Computers are a Sadness, I am the CureDan McKinley: Choose Boring TechnologyDavid Beazley: Discovering Python

18

CAP #3DATA SCIENTIST

19

"Friends don’t let friends calculate p-values

(without fully understanding them)."

1Scott Weingart

20

Don't4 expect CSVs and produce models whatever it takes,

4 expect that you have to explore the laws of Universe,

4 forget about Occam's razor,

4 A/B test (only if it REALLY REALLY makes sense).

21

Do4 user testing to define context (usertesting.com),

4 talk to users via surveys,

4 embed yourself in departments (personas),

4 have common sense.

22

Approach4 you mostly tell what not to do,

4 it's hard, but still the only way,

4 persist when not finding anything or trivialities,

4 kill teh lurking causation.

23

A/B4 think twice about TCO,

4 the world isn’t identically distributed,

4 random variation will cheat you in small samples,

4 most A/B test results are illusory,

4 small data -> go Bayesian = less certainty.

24

Toolset4 SQL,

4 Wizard,

4 Python,

4 R (only to anger CS peeps).

25

Heroes of the dayEvan Miller: Wizard Statistical AnalyzerChris Stucchio talks and posts on testing

26

Machine Learning

CAP #4

27

Don't4 need a PhD,

4 develop new unique matrix algos, please,

4 need more than Excel,

4 give false hope.

28

Do4 deploy good enough fast,

4 copy Kaggle (ensembles, random forest, XGBoost),

4 feature engineer,

4 build core data/feature (augment and enhance).

29

Approach4 the Mailchimp way

(offline built model redeployed each quarter),

4 hybrid approaches (domain expert, vanilla ML),

4 you are a machine instructor,

4 Tensorflow (logic to clients, handle models).

30

Toolset4 Excel,

4 Wizard,

4 BigML,

4 Python.

31

Heroes of the dayJohn Foreman: Data SmartJeroen Janssen: Data Science at the Command Line

32

CAP #5HEAD OF DATA

33

"In god we trusteverybody else bringdata to the table."

1W. Edwards Deming

34

Don't4 believe the hype,

4 trust no-one, just benchmarks,

4 let black box take over,

4 expect hiring to be easy.

35

Do4 maintain data mythology,

4 keep the view backwards straight,

4 expect emotions,

4 see the future.

36

Approach4 train to be the bearer of the bad news,

4 laugh at endless growth without saturation,

4 handle the cargo cult (inverse causality).

37

Marketing4 Google Analytics (sampling, off by 20%, no user

granularity, no raw, 150k per year),

4 CPA, FB CPA, mobile CPA, conversion, attribution,

4 Net Promoter Score.

38

Heroes of the dayDan Lyons: DisruptedVenkatesh Rao: The Gervais Principle

39

Thank you!@soobrosa

visuals: @xkcd, @DorsaAmir, ˙Cаvin 〄, thelearningcurvedotca, JD Hancock, Thomas Hawk, jonolist,

Kalexanderson

40