spark streaming resiliency (bay area spark meetup)

Download Spark Streaming Resiliency (Bay Area Spark Meetup)

Post on 08-Aug-2015

783 views

Category:

Technology

2 download

Embed Size (px)

TRANSCRIPT

  1. 1. Bay Area Spark Meetup 05/19/2015 @
  2. 2. Spark Streaming Resiliency Prasanna Padmanabhan & Bharat Venkat Personalization Infrastructure
  3. 3. Deployment Setup Background Agenda Use cases for Real Time Stream Processing Creating Chaos Motivations for Spark Spark Streaming Primer Injecting Chaos in Spark Future
  4. 4. Agenda Background Use cases for Real Time Stream Processing Motivations for Spark Creating Chaos Spark Streaming Primer Deployment Setup Injecting Chaos in Spark Future
  5. 5. Netflix is a logging company
  6. 6. that also happens to stream videos
  7. 7. Scale at Netflix 400 Billion events per day 8 Million events/sec during peak Numerous types of events (UI Events, Play Events, Impression events etc)
  8. 8. What do we do with it? Event logs are captured into Hadoop (EMR) Run ETL jobs using Hive/Presto to Provide input to pre-compute recommendations User behavior analysis Data analysis and Reporting
  9. 9. Agenda Background Use cases for Real Time Stream Processing Motivations for Spark Creating Chaos Spark Streaming Primer Deployment Setup Injecting Chaos in Spark Future
  10. 10. Use Cases for Stream Processing Recommendations based on collective real time signals
  11. 11. Use Cases for Stream Processing Faster identification of Data Anomalies and Regressions Bad iPhone push
  12. 12. Agenda Background Use cases for Real Time Stream Processing Motivations for Spark Creating Chaos Spark Streaming Primer Deployment Setup Injecting Chaos in Spark Future
  13. 13. Motivations for Spark Popular compute engine for batch processing Widely used for Offline Experimentations at Netflix Improves agility with Interactive queries Interactive Experimenters Notebook
  14. 14. Motivations for Spark Single platform to build batch and real-time applications S3 Micro Services Spark Spark Streaming Recommender Systems Batch Data Streaming Data
  15. 15. Agenda Background Use cases for Real Time Stream Processing Motivations for Spark Creating Chaos Spark Streaming Primer Deployment Setup Injecting Chaos in Spark Future
  16. 16. Challenges in Cloud Ephemeral Resources Cannot rely on local state No fixed IP
  17. 17. Chaos Monkey Approach Simulate failures by randomly killing components Failures inevitably happen when least desired Lather, Rinse, Repeat!
  18. 18. Can Spark Streaming survive Chaos Monkey?
  19. 19. Agenda Background Use cases for Real Time Stream Processing Motivations for Spark Creating Chaos Spark Streaming Primer Deployment Setup Injecting Chaos in Spark Future
  20. 20. Spark Components Spark Driver Cluster Manager (Mesos, YARN, Standalone) Task Task Worker Node Executor Task Task Worker Node Executor . . .
  21. 21. Spark Driver Spark Driver Cluster Manager (Mesos, YARN, Standalone) Task Task Worker Node Executor Task Task Worker Node Executor . . . Main Program, DAG Scheduler
  22. 22. Cluster Manager Spark Driver Cluster Manager (Mesos, YARN, Standalone) Task Task Worker Node Executor Task Task Worker Node Executor . . . Resource Allocation
  23. 23. Spark Worker Spark Driver Cluster Manager (Mesos, YARN, Standalone) Task Task Worker Node Executor Task Task Worker Node Executor . . . Runs Worker Process & Monitors Executors
  24. 24. How does streaming work? Data Streams are processed in batches Each batch processed in Spark Results are pushed out in batch
  25. 25. Agenda Background Use cases for Real Time Stream Processing Motivations for Spark Creating Chaos Spark Streaming Primer Deployment Setup Injecting Chaos in Spark Future
  26. 26. Application Details Process subset of UI Events from Kafka Compute aggregate metrics Publish metrics to Atlas Spark 1.2.0
  27. 27. Standalone Cluster Manager Provide resource management and resiliency All in one package Built-in, easy to deploy Troubleshoot issues with single team (Databricks)
  28. 28. Deployment
  29. 29. Agenda Background Use cases for Real Time Stream Processing Motivations for Spark Creating Chaos Spark Streaming Primer Deployment Setup Injecting Chaos in Spark Future
  30. 30. Stream Resiliency Streaming application continues to run Partial data loss during failure is acceptable
  31. 31. Driver Resiliency (Client Mode) WorkerMaster Worker Worker Client Driver ./spark-submit --deploy-mode client
  32. 32. Driver Resiliency (Client Mode) WorkerMaster Worker Worker Client Driver
  33. 33. Entire Application is killed Driver Resiliency (Client Mode) WorkerMaster Worker Worker Client Driver
  34. 34. Driver Resiliency (Cluster Mode) (with supervise) WorkerMaster Worker Worker Client ./spark-submit --deploy-mode cluster --supervise
  35. 35. Driver Resiliency (Cluster Mode) (with supervise) WorkerMaster Worker Worker Client Driver Driver runs in the worker
  36. 36. Driver Resiliency (Cluster Mode) (with supervise) WorkerMaster Worker Worker Client Driver
  37. 37. Driver Resiliency (Cluster Mode) (with supervise) WorkerMaster Worker Worker Client Driver Driver is started in a new worker
  38. 38. Driver Resiliency (Cluster Mode) (with supervise) WorkerMaster Worker Worker Client Driver Driver is started in a new worker
  39. 39. Master Resiliency (Single Master) WorkerMaster Worker Worker Client
  40. 40. Entire Application is killed Master Resiliency (Single Master) WorkerMaster Worker Worker Client
  41. 41. Master Resiliency (Multi Master) Worker Worker Worker Client Standby MasterActive Master
  42. 42. No impact Master Resiliency (Multi Master) Worker Worker Worker Client Standby MasterActive Master
  43. 43. Master Resiliency (Multi Master) Worker Worker Worker Client Standby MasterActive Master
  44. 44. Master Resiliency (Multi Master) Worker Worker Worker Client Standby MasterActive Master Active Master Standby becomes Active
  45. 45. Master Resiliency (Multi Master) Worker Worker Worker Client Standby MasterActive Master Active Master Standby becomes Active
  46. 46. Executor runs as child process of Worker Worker Resiliency WorkerMaster Worker Worker Client ExecutorDriver Worker
  47. 47. Worker Resiliency WorkerMaster Worker Worker Client ExecutorDriver Worker
  48. 48. Worker Resiliency WorkerMaster Worker Worker Client ExecutorDriver Driver and Executor are also killed Worker
  49. 49. Worker Resiliency WorkerMaster Worker Worker Client ExecutorDriver Worker is relaunched Worker
  50. 50. Worker Resiliency WorkerMaster Worker Worker Client ExecutorDriver Driver and Executor are also killed Worker is relaunched Driver and executor are also relaunched Worker
  51. 51. Worker Resiliency WorkerMaster Worker Worker Client ExecutorDriver Driver and Executor are also killed Worker is relaunched Driver and executor are also relaunched Worker
  52. 52. Executor Resiliency WorkerMaster Worker Worker Client Driver ExecutorExecutor
  53. 53. Executor Resiliency WorkerMaster Worker Worker Client Driver Executor
  54. 54. Executor Resiliency WorkerMaster Worker Worker Client Driver Executor Executor is relaunched Executor
  55. 55. Executor Resiliency WorkerMaster Worker Worker Client Driver Executor Executor is relaunched Executor Tasks in flight are rescheduled
  56. 56. Executor Resiliency WorkerMaster Worker Worker Client Driver Executor Executor is relaunched Executor Tasks in flight are rescheduled
  57. 57. Resiliency Results
  58. 58. Summary
  59. 59. Agenda Background Use cases for Real Time Stream Processing Motivations for Spark Creating Chaos Spark Streaming Primer Deployment Setup Injecting Chaos in Spark Future
  60. 60. Future Lambda Architecture Operational Enhancements Dynamic scaling Additional spark instrumentation
  61. 61. http://bit.ly/persinfra (Senior Software Engineer - Personalization Infra) We are hiring!