oscon 2015

Download OSCON 2015

Post on 17-Aug-2015

188 views

Category:

Technology

0 download

Embed Size (px)

TRANSCRIPT

  1. 1. Measuring Big Data Understanding data by usage Charles Smith Big Data Platform Architecture - Netflix
  2. 2. About Me Netflix - I joined Netflix in 2011 - I spend my time working to make big data easy and efficient - Usually from the perspective of someone trying to use the platform University of Florida - Research in Information Retrieval - How much information does a document have
  3. 3. What would you measure?
  4. 4. What do you want to know?
  5. 5. ~20 PB of compressed data ~500 billion events a day ~18K data sets ~4200 nodes in our clusters
  6. 6. Our largest two datasets: 1.4 PB 1.2 PB
  7. 7. ~11K Hive ~3K Pig ~2.5K Presto
  8. 8. Task Hour Cost = (cost of node)/(tasks per node) * sum(task duration ms)/(60*60*1000)
  9. 9. 100 Jobs comprise 86% of the cost
  10. 10. What data is important?
  11. 11. Make people tell you the answer: tagging.
  12. 12. Manual data doesnt stay current unless it needs to.
  13. 13. How do we actually use the data?
  14. 14. Parse the job (or ask the tool that parses it)
  15. 15. CharlottePresto Sql Parser (Hive) Sql Parser (Teradata) Lipstick (Pig) Metacat*
  16. 16. Dataset Distinct Queries 2000 1052 prodhive/dse/geo_country_d 1009 prodhive/dse/ttl_title_d 580 565 512 466 427 395 317
  17. 17. Dataset Queries prodhive/dse/geo_country_d 11405 prodhive/dse/ttl_title_d 8194 5928 5451 4849 4654 4334 3620 3046 2823
  18. 18. Related To geo_country_d Shared Queries prodhive/dse/ttl_title_country_r 2277 1697 prodhive/dse/ttl_show_d 1540 prodhive/dse/ttl_season_d 1405 prodhive/dse/ttl_title_d 1392 926 817 743 prodhive/dse/ttl_season_country_r 638 628
  19. 19. Datasets Input Jobs Queries prodhive/cdn/occ 2016 66 teradata/gdw_stg_prod/seg 1587 36 prodhive/dse/msg 1527 14 prodhive/dse/msg 1512 30 teradata/gdw_stg_prod/seg 1043 50 teradata/gdw_stg_prod/cdn 970 10 teradata/gdw_tbl_prod/seg 903 1 prodhive/rpt/pbe 811 11 prodhive/gps/gro 904 137 prodhive/cdn/ttl 631 39
  20. 20. Challenges Knowing what questions should you try to answer. Getting this data isnt easy. The data is noisy.
  21. 21. Thanks Charles Smith Big Data Platform Architecture Netflix @charles_s_smith