devops roadtrip minneapolis
TRANSCRIPT
JASON HAND |DevOps Evangelist
• Holds over 15 years of experience as a developer, system administrator, and support specialist
• Fully emerged into the world of agile development and the DevOps movement with Colorado tech startups
#DevOpsRoadTrip
A little about VictorOps…
VictorOps is the real-time incident management platform that combines the power of people and data to embolden DevOps pros to handle incidents as they occur.
#DevOpsRoadTrip
“How Organizations Process Information”
Roy Westrum: A Typology of Organizational Cultures2014 State of DevOps Report shows that in the context of IT, job satisfaction is the biggest predictor of profitability, market share, and productivity. The biggest predictor of job satisfaction, in turn, is how effectively organizations process information, as determined by a model created by sociologist Ron Westrum, shown below. 1
1: https://continuousdelivery.com/implementing/culture/
Words are how we think – stories are how we link.
- Christina Baldwin
Oral narrative is and for a long time has been the
chief basis of culture itself.
- John D. Niles
Stories from the road
Unordered OrderedComplicated
Obvious
Complex
ChaoticCause Effect
ObviousFrom Experience
Cause Effect RequiresAnalysis
Cause Effect Only Apparent in Hindsight
Cause & Effect CannotBe Related
Sense – Categorize - Respond
Sense – Analyze - RespondProbe – Sense - Respond
Act – Sense - Respond
Tim
e To
Rep
air
(TTR
)
Continuous Improvement Efforts
Reactive(chaotic)
Tactical(obvious)
Integrated(complicated)
Strategic(complex)
✓ No automation
✓ No operational stack awareness
✓ Poor collaboration between teams (Dev & Ops)
✓ Documentation not available
✓ No standardized communication
✓ High focus on consistent continuous learning
✓ Uses a NOC
✓ Some monitoring & alerting instrumentation
✓ Collaboration in crisis
✓ "Mission critical" processes are available
✓ Understood crisis communication protocols
✓ Remediation data available to IT Operations
✓ Team rotations, paging policies, role hunting
✓ Continuous improvement of key health indicators
✓ Technical collaboration across all incidents
✓ Docs up to date and easily accessible
✓ Consistent real-time communication practices
✓ Automated docs and remediation✓ Actionable Alerts with full context✓ High collaboration among all
teams✓ Documentation part of
remediation✓ Targeted, proactive crisis comms✓ High focus on continuous learning
Incident Management Maturity
Reactive(chaotic)
✓No automation
✓No operational stack awareness
✓Poor collaboration between teams (Dev & Ops)
✓Documentation not available
✓No standardized communication
✓High focus on consistent continuous learning
Tactical(obvious)
✓Uses a NOC
✓Some monitoring & alerting instrumentation
✓Collaboration in crisis
✓"Mission critical" processes are available
✓Understood crisis communication protocols
✓Remediation data available to IT Operations
Integrated(complicated)
✓Team rotations, paging policies, role hunting
✓Continuous improvement of key health indicators
✓Technical collaboration across all incidents
✓Docs up to date and easily accessible
✓Consistent real-time communication practices
Strategic(complex)
✓Automated docs and remediation
✓Actionable Alerts with full context
✓High collaboration among all teams
✓Documentation part of remediation
✓Targeted, proactive crisis comms
✓High focus on continuous learning
“Six Trends Shape DevOps Adoption, Q1 2015” Forrester report
• The Foundation For Success Is In Place . . . Mostly
• Fear Of Failure Will Hamper Advancement
• Monitoring And Analytics Strategies Must Make A Big Leap Forward
• The Focus On Customer Experience Is Not Second Nature . . . Yet
• Change And Release Processes Are Not Delivering Business Needs
• You Must Prioritize And Focus Sourcing Strategies
Failure not seen as opportunity to learn
Source: “Six Trends Shape DevOps Adoption, Q1 2015”, Forrester report
© 2015 Forrester Research, Inc. Reproduction Prohibited 46
Single Source Of Truth Lacking In Many Orgs – 95% only most of the time or less
Source: April 15, 2015 “Six Trends That Will Shape DevOps Adoption”, Forrester report
Teams siloed throughout life cycle
Source: “Six Trends Shape DevOps Adoption, Q1 2015”, Forrester report
© 2015 Forrester Research, Inc. Reproduction Prohibited 50
IT teams aren’t measured on customer experience goals.
Delays in notifications Leads To Customers Finding the Problem First
Source: “Six Trends Shape DevOps Adoption, Q1 2015”, Forrester report
Bridget Kromhout | Pivotal - Cloud Foundry Principal Technologist • Bridget Kromhout is a Principal Technologist for Cloud Foundry at
Pivotal.
• After years as an operations engineer (most recently at DramaFever), she traded in oncall for more travel.
• A frequent speaker at tech conferences, she helps organize tech meetups at home in Minneapolis, serves on the program committee for Velocity, and acts as a global core organizer for devopsdays.
• She podcasts at Arrested DevOps, occasionally blogs at bridgetkromhout.com, and is active in a Twitterverse near you.
#DevOpsRoadTrip
@bridgetkromhout
lives: Minneapolis,
Minnesota
works: Pivotal
podcasts: Arrested DevOps
organizes: devopsdays
Bridget Kromhout
@bridgetkromhout
“…measuring value, throughput, and performance…
revenue rather than cost”
The Art of Monitoring (2016) James Turnbull
artofmonitoring.com
@bridgetkromhout
The Art of Monitoring (2016) James Turnbull
Monitoring containers
artofmonitoring.com
@bridgetkromhout
“Almost every task run under Borg contains a
built-in HTTP server that publishes information
about the health of the task and thousands of performance metrics”
Large-scale cluster management at Google with Borg - Verma et al. 2015
“Almost every task run under Borg contains a
built-in HTTP server that publishes information
about the health of the task and thousands of performance metrics”
@bridgetkromhout
The Art of Monitoring (2016) — James Turnbull
Monitoring Maturity Model
artofmonitoring.com
@bridgetkromhout Image credit: Wikipedia
“Any organization that designs a system… will produce a design
whose structure is a copy of the organization's
communication structure.”
Mel Conway
Andy Domeier | SPS CommerceDirector System Operations
• Andy has been in Technology Operations leadership with SPS Commerce for the past 11 years.
• Andy spends many mental cycles collaborating to solve effective patterns for monitoring and operating complex changing systems.
• Andy’s also spends time solving for priority organization and alignment and the organization of knowledge.
#DevOpsRoadTrip
• #monoliths– Familiar, All or None, Less Agility
• #microservices– Complex, semi-isolated, Agile
WHAT’S YOUR SYSTEM?
© SPS COMMERCE 20
• Hey Danielle, It looks like the site is acting up and when looking around the only outlier I have found so far is a cpu spike on the DB. Can you help me investigate this a bit more?
RESPOND AS A TEAM
© SPS COMMERCE 23
• Share Screens & Visualize Data• Display Alerts w/ Integrations• Automatic History Retention• Enables Collaboration for All• And my Favorite…...
#CHATOPS
© SPS COMMERCE 24
• Make health data as transparent and central as possible– Helps the Team “Know where the fire is”
• Share data in chat– Use the metric from your tools
• “Be Transparent”
• Team Response Nurtures Team Follow Up
TIPS FOR HEALTHY INCIDENT RESPONSE
© SPS COMMERCE 26
• Always tie things back to the customer– Simple but often over looked
– Opportunity to link the team to the business
TIPS FOR HEALTHY INCIDENT RESPONSE
© SPS COMMERCE 27
Ben Overmyer | Star TribuneDigital Manager, Operations • Ben is the Digital Manager of Operations at the Minneapolis Star
Tribune.
• He has over a decade of experience as a back end software engineer, two years of experience as a dedicated operations engineer, and great enthusiasm for the DevOps culture.
• Besides the Star Tribune, he’s worked for an eclectic mix of organizations, including the USGS, a game company in New Zealand, and a beauty products marketing company.
• When not hacking on servers, apps, or people, he acts as art director and author for a tabletop gaming company.
#DevOpsRoadTrip
IN THE BEGINNING
▸ Forwarded phone line
▸ An on-call list maintained in a wiki
▸ Every week, manually change to the next person on the list
▸ …and overrides or substitutions?
EARLY MONITORING
▸ Zabbix monitoring set up for a handful of causes
▸ Zabbix alerts sent via email to a distribution list
▸ Sometimes no one would see these alerts until hours or, in rare cases, days later
THE PAIN POINTS
▸ Manual maintenance of the calling tree data
▸ Manual rotation of the support phone line forwarding
▸ Poor documentation of incident life cycles
▸ No sense of incident frequency beyond “this was a bad couple weeks”
▸ If the on-call person didn’t respond, there was no escalation process other than calling the head of Digital
ADOPTING VICTOROPS
▸ Automated rotations
▸ Multiple teams
▸ Automatic escalation processes
▸ Easy schedule overrides and changes
▸ APIs for programmatic incident interaction
THE NATURE OF ALERTS
▸ OK, we can set up programmatic alerts. Now what?
▸ Integrating Zabbix, New Relic, and CloudWatch
▸ Discovering alert floods
▸ Move to alerting on symptoms, not causes
▸ …but still monitoring causes
THE SPIDEY-SENSE FACTOR
▸ Humans are good at catching certain kinds of problems
▸ “This doesn’t feel right” and gaps in monitoring
▸ The evolution of the Sev incident system
THE STATUS SITE: MANUAL ALERTING FOR NON-TECH USERS
▸ Want to let certain non-tech users report Sev incidents
▸ Initially just a password-protected form
▸ Uses the VictorOps alert ingestion API for triggering alerts
▸ Uses the VictorOps public API for fetching information
▸ Each Sev alert is created with its own entity_id
▸ Lets admin users share status updates
MONTHLY INCIDENT REPORTING
▸ Monthly reports include a list of all Sev incidents, when they started, when they ended, what the alert text was, and what the resolution was
▸ Combine automated and chat messages in VictorOps with data gathered from other sources
▸ Present this data as automatically as possible in the Status Site
NEXT STEPS
▸ Integration of summarized data collected from Datadog/CloudWatch/etc. into incident reporting
▸ Reports for users that shouldn’t have access to VictorOps
▸ Integration of the Status Site into Slack
Breakout Sessions◻ ChatOps - Jason Hand
◻ Leveraging Data to Establish a Healthy Culture - Andy Domeier
◻ Monitoring and Microservices – Bridget Kromhout
◻ Blameless Culture – Heather Mickman
◻ Devs vs. Ops On-Call, How and Why to Get started – Ben Overmyer
#DevOpsRoadTrip
Breakout Sessions◻ ChatOps - Jason Hand
◻ Leveraging Data to Establish a Healthy Culture - Andy Domeier
◻ Monitoring and Microservices – Bridget Kromhout
◻ Blameless Culture – Heather Mickman
◻ Devs vs. Ops On-Call, How and Why to Get started – Ben Overmyer
#DevOpsRoadTrip
Heather Mickman | Target Senior Director of Platform Engineering• Heather Mickman is the Senior Director of Platform Engineering at Target and a
DevOps enthusiast.
• Heather has 20+ years of IT experience in various roles and industries including retail, transportation, and high tech manufacturing.
• She is currently working on building the platforms used by software engineers at Target including a multi-provider cloud platform, API Gateway, telemetry tooling, data stores, and messaging.
• She has a passion for technology, building high performing teams, driving a culture of innovation, and having fun along the way. Heather lives in Minneapolis with her 2 sons and mini dachshund.
#DevOpsRoadTrip
Unordered OrderedComplicated
Obvious
Complex
ChaoticCause Effect Obvious
From Experience
Cause Effect RequiresAnalysis
Cause Effect Only Apparent in Hindsight
Cause & Effect CannotBe Related
Sense – Categorize - Respond
Sense – Analyze - RespondProbe – Sense - Respond
Act – Sense - Respond
Tim
e To
Rep
air
(TTR
)
Continuous Improvement Efforts
Reactive(chaotic)
Tactical(obvious)
Integrated(complicated)
Strategic(complex)
✓ No automation
✓ No operational stack awareness
✓ Poor collaboration between teams (Dev & Ops)
✓ Documentation not available
✓ No standardized communication
✓ High focus on consistent continuous learning
✓ Uses a NOC
✓ Some monitoring & alerting instrumentation
✓ Collaboration in crisis
✓ "Mission critical" processes are available
✓ Understood crisis communication protocols
✓ Remediation data available to IT Operations
✓ Team rotations, paging policies, role hunting
✓ Continuous improvement of key health indicators
✓ Technical collaboration across all incidents
✓ Docs up to date and easily accessible
✓ Consistent real-time communication practices
✓ Automated docs and remediation✓ Actionable Alerts with full context✓ High collaboration among all
teams✓ Documentation part of
remediation✓ Targeted, proactive crisis comms✓ High focus on continuous learning
Incident Management Maturity
Reactive(chaotic)
✓No automation
✓No operational stack awareness
✓Poor collaboration between teams (Dev & Ops)
✓Documentation not available
✓No standardized communication
✓High focus on consistent continuous learning
Tactical(obvious)
✓Uses a NOC
✓Some monitoring & alerting instrumentation
✓Collaboration in crisis
✓"Mission critical" processes are available
✓Understood crisis communication protocols
✓Remediation data available to IT Operations
Integrated(complicated)
✓Team rotations, paging policies, role hunting
✓Continuous improvement of key health indicators
✓Technical collaboration across all incidents
✓Docs up to date and easily accessible
✓Consistent real-time communication practices
Strategic(complex)
✓Automated docs and remediation
✓Actionable Alerts with full context
✓High collaboration among all teams
✓Documentation part of remediation
✓Targeted, proactive crisis comms
✓High focus on continuous learning
Failure not seen as opportunity to learn
Source: “Six Trends Shape DevOps Adoption, Q1 2015”, Forrester report
© 2015 Forrester Research, Inc. Reproduction Prohibited 23
Single Source Of Truth Lacking In Many Orgs – 95% only most of the time or less
Source: April 15, 2015 “Six Trends That Will Shape DevOps Adoption”, Forrester report
Teams siloed throughout life cycle
Source: “Six Trends Shape DevOps Adoption, Q1 2015”, Forrester report
© 2015 Forrester Research, Inc. Reproduction Prohibited 27
IT teams aren’t measured on customer experience goals.
Delays in notifications Leads To Customers Finding the Problem First
Source: “Six Trends Shape DevOps Adoption, Q1 2015”, Forrester report
Tim
e To
Rep
air (
TTR
)
Continuous Improvement Efforts
Reactive (0 – 4)(chaotic)
Tactical (5 – 9)(obvious)
Integrated (10 -14)(complicated)
Strategic (15 – 18)(complex)
✓ No automation
✓ No operational stack awareness
✓ Poor collaboration between teams (Dev & Ops)
✓ Documentation not available
✓ No standardized communication
✓ High focus on consistent continuous learning
✓ Uses a NOC
✓ Some monitoring & alerting instrumentation
✓ Collaboration in crisis
✓ "Mission critical" processes are available
✓ Understood crisis communication protocols
✓ Remediation data available to IT Operations
✓ Team rotations, paging policies, role hunting
✓ Continuous improvement of key health indicators
✓ Technical collaboration across all incidents
✓ Docs up to date and easily accessible
✓ Consistent real-time communication practices
✓ Automated docs and remediation✓ Actionable Alerts with full context✓ High collaboration among all teams✓ Documentation part of remediation✓ Targeted, proactive crisis comms✓ High focus on continuous learning
Incident ManagementMaturity