there is no spoon - ran leibman, facebook - devopsdays tel aviv 2016

54
Production Engineering (PE) There Is No Spoon Ran Leibman Production Engineer

Upload: devopsdays-tel-aviv

Post on 08-Jan-2017

43 views

Category:

Technology


0 download

TRANSCRIPT

  • Production Engineering (PE) There Is No Spoon

    Ran LeibmanProduction Engineer

  • Agenda1. How Production Engineering was formed in Facebook 2. What we do in Onavo 3. How we moved to the PE model in Onavo 4. Q & A

  • Facebook - Pre PE Days

  • SRO - Site Reliability Operations

    1. Keep the site up 24/7 2. Follow the sun 3. Capacity plans

  • Why SRO was not enough ?

  • What are the alternatives ?

  • NOC

  • The Production Engineering Model

    1. PEs are embedded within the software engineering teams 2. Taking part in meetings 3. Involved in roadmap plans 4. Reviewing diffs 5. Oncall - Software & Production Engineers

  • Onavo - Adopting The PE Model

  • Protect user traffic using IPsec Protect against malicious sites Compress user traffic Control data leakage

    Save, Measure & Protect your mobile data

  • a bit of contextOnavo

    1. Founded at 2010 2. Classic Startup Dev & Ops teams

    1. Dev - writes code 2. Ops - keeps the infra up & running

    3. Acquired by Facebook at 2013

  • Making The Change - Step By Step

  • Step 1 - Go Sit Close/Next With The Developers

  • Step 2 - Get The Colleagues Onboard

  • Step 3 - Get Your Tooling Ready

  • you dont want that Confused Travolta moment Have Good (short) Documentation

    Document your alerts Links to dashboards Links to third party software docs Runbooks - how to debug in prod

    log files, how to restart the service, getting stack traces & metrics Links to config management

  • Dev Friendly Systems

  • avoid the graph porn Simple And Indicative Dashboards

    1. Match the product KPIs 2. Strong signal 3. Intuitive titles 4. Easy to spot anomalies 5. Easy to find correlations

  • Step 4 - Review Your Alerts

  • rm -rf /all/false/alarms*Refactor Your Alerts as Needed

    The first challenge is to make sure alerts are handled To make it possible every alert should be

    Indicate a real problem Clear to understand - Informative Impactful Actionable

  • Step 5 - Train The Team - Get Them Ready

  • learning is easy - remembering is hardTrain The Team

    Wiki / Doc based makes it easier to remember

    Hands-on Hands-on Hands-on Pre create task pool (even if low impact) Give oncall use cases & examples Reusable

  • Step 6 - Oncall + Hand Holding

  • make yourself available and adjust as you goShared Oncall

    Short oncall cycles, 1-2 days Increase the period each cycle Oncall Summaries Do oncall as well - set an example Preemptively check status with

    the current oncall

  • Step 1 - Go Sit Close/Next With The DevelopersThe Steps

  • Step 2 - Get The Colleagues Onboard Step 1 - Go Sit Close/Next With The Developers

    The Steps

  • Step 3 - Get Your Tooling Ready

    Step 2 - Get The Colleagues Onboard Step 1 - Go Sit Close/Next With The Developers

    The Steps

  • Step 4 - Review Your Alerts

    Step 3 - Get Your Tooling Ready

    Step 2 - Get The Colleagues Onboard Step 1 - Go Sit Close/Next With The Developers

    The Steps

  • Step 5 - Train The Team

    Step 4 - Review Your Alerts

    Step 3 - Get Your Tooling Ready

    Step 2 - Get The Colleagues Onboard Step 1 - Go Sit Close/Next With The Developers

    The Steps

  • Step 6 - Oncall + Hand Holding

    Step 5 - Train The Team

    Step 4 - Review Your Alerts

    Step 3 - Get Your Tooling Ready

    Step 2 - Get The Colleagues Onboard Step 1 - Go Sit Close/Next With The Developers

    The Steps

  • Questions?

    Ran LeibmanProduction Engineer