oscon 2015

31
Measuring Big Data Understanding data by usage Charles Smith Big Data Platform Architecture - Netflix

Upload: charles-smith

Post on 17-Aug-2015

209 views

Category:

Technology


0 download

TRANSCRIPT

Measuring Big DataUnderstanding data by usage

Charles SmithBig Data Platform Architecture - Netflix

About Me ▪Netflix

- I joined Netflix in 2011

- I spend my time working to make big data easy and efficient

- Usually from the perspective of someone trying to use the platform

▪University of Florida

- Research in Information Retrieval

- How much information does a document have

What would you measure?

What do you want to know?

~20 PB of compressed data

~500 billion events a day

~18K data sets

~4200 nodes in our clusters

Our largest two datasets:

1.4 PB

1.2 PB

~11K Hive

~3K Pig

~2.5K Presto

Task Hour Cost = (cost of node)/(tasks per node) * sum(task duration ms)/(60*60*1000)

100 Jobs comprise 86% of the cost

What data is important?

Make people tell you the answer: tagging.

Manual data doesn’t stay current unless it needs to.

How do we actually use the data?

Parse the job (or ask the tool that parses it)

CharlottePresto

Sql Parser (Hive)

Sql Parser(Teradata)

Lipstick (Pig)

Metacat*

Dataset Distinct Queries… 2000… 1052prodhive/dse/geo_country_d 1009prodhive/dse/ttl_title_d 580… 565… 512… 466… 427… 395… 317

Dataset Queriesprodhive/dse/geo_country_d 11405prodhive/dse/ttl_title_d 8194… 5928… 5451… 4849… 4654… 4334… 3620… 3046… 2823

Related To geo_country_d Shared Queriesprodhive/dse/ttl_title_country_r 2277… 1697prodhive/dse/ttl_show_d 1540prodhive/dse/ttl_season_d 1405prodhive/dse/ttl_title_d 1392… 926… 817… 743prodhive/dse/ttl_season_country_r 638… 628

Datasets Input Jobs Queriesprodhive/cdn/occ… 2016 66teradata/gdw_stg_prod/seg… 1587 36prodhive/dse/msg… 1527 14prodhive/dse/msg… 1512 30teradata/gdw_stg_prod/seg… 1043 50teradata/gdw_stg_prod/cdn… 970 10teradata/gdw_tbl_prod/seg… 903 1prodhive/rpt/pbe… 811 11prodhive/gps/gro… 904 137prodhive/cdn/ttl… 631 39

Challenges ▪Knowing what questions should you try to answer.▪Getting this data isn’t easy.▪The data is noisy.

Thanks ▪Charles Smith – Big Data Platform Architecture Netflix

▪@charles_s_smith