devops roadtrip minneapolis

184

Upload: victorops

Post on 08-Apr-2017

219 views

Category:

Technology


0 download

TRANSCRIPT

JASON HAND |DevOps Evangelist

• Holds over 15 years of experience as a developer, system administrator, and support specialist

• Fully emerged into the world of agile development and the DevOps movement with Colorado tech startups

#DevOpsRoadTrip

#DevOpsRoadtrip#DevOpsRoadTrip

A little about VictorOps…

VictorOps is the real-time incident management platform that combines the power of people and data to embolden DevOps pros to handle incidents as they occur.

#DevOpsRoadTrip

Why AreWe Here?

Culture

Culture

“How Organizations Process Information”

Roy Westrum: A Typology of Organizational Cultures2014 State of DevOps Report shows that in the context of IT, job satisfaction is the biggest predictor of profitability, market share, and productivity. The biggest predictor of job satisfaction, in turn, is how effectively organizations process information, as determined by a model created by sociologist Ron Westrum, shown below. 1

1: https://continuousdelivery.com/implementing/culture/

Words are how we think – stories are how we link.

- Christina Baldwin

Oral narrative is and for a long time has been the

chief basis of culture itself.

- John D. Niles

Stories from the road

Cynefin

Unordered OrderedComplicated

Obvious

Complex

ChaoticCause Effect

ObviousFrom Experience

Cause Effect RequiresAnalysis

Cause Effect Only Apparent in Hindsight

Cause & Effect CannotBe Related

Sense – Categorize - Respond

Sense – Analyze - RespondProbe – Sense - Respond

Act – Sense - Respond

The systems we engineer, maintain, and improve are

Complicated .. or ..

Known unknowns

The systems we engineer, maintain, and improve are

ComplexUnknown unknowns

What is the

Root Cause?

What are the..

ContributingFactors?

Identifying a “root cause” helps us to …

Put it backhow it was

What we really want is to..

ContinuouslyImprove

Tim

e To

Rep

air

(TTR

)

Continuous Improvement Efforts

Reactive(chaotic)

Tactical(obvious)

Integrated(complicated)

Strategic(complex)

✓ No automation

✓ No operational stack awareness

✓ Poor collaboration between teams (Dev & Ops)

✓ Documentation not available

✓ No standardized communication

✓ High focus on consistent continuous learning

✓ Uses a NOC

✓ Some monitoring & alerting instrumentation

✓ Collaboration in crisis

✓ "Mission critical" processes are available

✓ Understood crisis communication protocols

✓ Remediation data available to IT Operations

✓ Team rotations, paging policies, role hunting

✓ Continuous improvement of key health indicators

✓ Technical collaboration across all incidents

✓ Docs up to date and easily accessible

✓ Consistent real-time communication practices

✓ Automated docs and remediation✓ Actionable Alerts with full context✓ High collaboration among all

teams✓ Documentation part of

remediation✓ Targeted, proactive crisis comms✓ High focus on continuous learning

Incident Management Maturity

Reactive(chaotic)

✓No automation

✓No operational stack awareness

✓Poor collaboration between teams (Dev & Ops)

✓Documentation not available

✓No standardized communication

✓High focus on consistent continuous learning

Tactical(obvious)

✓Uses a NOC

✓Some monitoring & alerting instrumentation

✓Collaboration in crisis

✓"Mission critical" processes are available

✓Understood crisis communication protocols

✓Remediation data available to IT Operations

Integrated(complicated)

✓Team rotations, paging policies, role hunting

✓Continuous improvement of key health indicators

✓Technical collaboration across all incidents

✓Docs up to date and easily accessible

✓Consistent real-time communication practices

Strategic(complex)

✓Automated docs and remediation

✓Actionable Alerts with full context

✓High collaboration among all teams

✓Documentation part of remediation

✓Targeted, proactive crisis comms

✓High focus on continuous learning

“Six Trends Shape DevOps Adoption, Q1 2015” Forrester report

• The Foundation For Success Is In Place . . . Mostly

• Fear Of Failure Will Hamper Advancement

• Monitoring And Analytics Strategies Must Make A Big Leap Forward

• The Focus On Customer Experience Is Not Second Nature . . . Yet

• Change And Release Processes Are Not Delivering Business Needs

• You Must Prioritize And Focus Sourcing Strategies

Automation

Awareness

Collaboration

Documentation User Empathy

Learning

Learning

Failure not seen as opportunity to learn

Source: “Six Trends Shape DevOps Adoption, Q1 2015”, Forrester report

Awareness

http://blog.vmware.com

© 2015 Forrester Research, Inc. Reproduction Prohibited 46

Single Source Of Truth Lacking In Many Orgs – 95% only most of the time or less

Source: April 15, 2015 “Six Trends That Will Shape DevOps Adoption”, Forrester report

Collaboration

http://neolivemarketing.com/wp-content/uploads/2015/09/Collaboration.jpg

Teams siloed throughout life cycle

Source: “Six Trends Shape DevOps Adoption, Q1 2015”, Forrester report

User Empathy

https://open.buffer.com/wp-content/uploads/2015/12/empathy3.jpg

© 2015 Forrester Research, Inc. Reproduction Prohibited 50

IT teams aren’t measured on customer experience goals.

Automation

http://thelifedesignproject.com/wp-content/uploads/2009/09/373881476_217d24ef6d.jpg

Delays in notifications Leads To Customers Finding the Problem First

Source: “Six Trends Shape DevOps Adoption, Q1 2015”, Forrester report

Documentation

http://blog.vmware.com

Reduce MTTRState of DevOps Report (2015) – by Puppet Labs

Automation

Awareness

Collaboration

Documentation User Empathy

Learning

jhand.co/DRT_SF

Bridget Kromhout | Pivotal - Cloud Foundry Principal Technologist • Bridget Kromhout is a Principal Technologist for Cloud Foundry at

Pivotal.

• After years as an operations engineer (most recently at DramaFever), she traded in oncall for more travel.

• A frequent speaker at tech conferences, she helps organize tech meetups at home in Minneapolis, serves on the program committee for Velocity, and acts as a global core organizer for devopsdays.

• She podcasts at Arrested DevOps, occasionally blogs at bridgetkromhout.com, and is active in a Twitterverse near you.

#DevOpsRoadTrip

@bridgetkromhout

Monitoring

@bridgetkromhout

lives: Minneapolis,

Minnesota

works: Pivotal

podcasts: Arrested DevOps

organizes: devopsdays

Bridget Kromhout

@bridgetkromhout

Traded oncall… …for more travel (Similar effect on sleep)

@bridgetkromhout

@bridgetkromhout

“…measuring value, throughput, and performance…

revenue rather than cost”

The Art of Monitoring (2016) James Turnbull

artofmonitoring.com

@bridgetkromhout

Image credit: James Ernest

@bridgetkromhout

The Art of Monitoring (2016) James Turnbull

Monitoring containers

artofmonitoring.com

@bridgetkromhout

“Almost every task run under Borg contains a

built-in HTTP server that publishes information

about the health of the task and thousands of performance metrics”

Large-scale cluster management at Google with Borg - Verma et al. 2015

“Almost every task run under Borg contains a

built-in HTTP server that publishes information

about the health of the task and thousands of performance metrics”

@bridgetkromhout

The Art of Monitoring (2016) — James Turnbull

Monitoring Maturity Model

artofmonitoring.com

@bridgetkromhout Image credit: Wikipedia

“Any organization that designs a system… will produce a design

whose structure is a copy of the organization's

communication structure.”

Mel Conway

@bridgetkromhout

silos are for grain

@bridgetkromhout

three Friday mornings in Minneapolis

removed restored

@bridgetkromhout

Thank you!

Andy Domeier | SPS CommerceDirector System Operations

• Andy has been in Technology Operations leadership with SPS Commerce for the past 11 years.

• Andy spends many mental cycles collaborating to solve effective patterns for monitoring and operating complex changing systems.

• Andy’s also spends time solving for priority organization and alignment and the organization of knowledge.

#DevOpsRoadTrip

HOW EFFECTIVE IS YOUR INCIDENT RESPONSE?Andy Domeier@ajdomie

agenda© SPS COMMERCE 2

Styles of Incident ResponseHealthy Incident ResponseTips & Tricks

STYLE #1 - DENIAL

© SPS COMMERCE 3

That’s not possible!No Wai!

© SPS COMMERCE 4

STYLE #2 - CONFUSED

© SPS COMMERCE 5

UmmmmHmmmm

(crickets)

How is thisPossible?

© SPS COMMERCE 6

STYLE #3 - LAZY

© SPS COMMERCE 7

It’s the DatabaseIt’s the Network

Just Restart It

© SPS COMMERCE 8

STYLE #4 - ANGRY

© SPS COMMERCE 9

Why did

you do that? What did you

change?

#!%& $#!@ #%$! &#!^ #$@

© SPS COMMERCE 10

STYLE #5 - FIREDRILL

© SPS COMMERCE 11

OMG WTF FML

“Buckshot”

© SPS COMMERCE 12

© SPS COMMERCE 13

LET’S GET REAL

© SPS COMMERCE 14

• Good way - Alarm

HOW DO WE KNOW THERE IS A FIRE?

© SPS COMMERCE 15

• Bad Way – Humans

HOW DO WE KNOW THERE IS A FIRE?

© SPS COMMERCE 16

• If you catch it right away?

WHO PUTS THE FIRE OUT?

© SPS COMMERCE 17

• If it’s out of control?

WHO PUTS THE FIRE OUT?

© SPS COMMERCE 18

INCIDENT RESPONSE TEAM

© SPS COMMERCE 19

• #monoliths– Familiar, All or None, Less Agility

• #microservices– Complex, semi-isolated, Agile

WHAT’S YOUR SYSTEM?

© SPS COMMERCE 20

• Monitoring Tools– Base IT

– Logging

– APM

– Metrics

WHERE’S YOUR DATA?

© SPS COMMERCE 21

RESPOND IN ISOLATION

© SPS COMMERCE 22

• Hey Danielle, It looks like the site is acting up and when looking around the only outlier I have found so far is a cpu spike on the DB. Can you help me investigate this a bit more?

RESPOND AS A TEAM

© SPS COMMERCE 23

• Share Screens & Visualize Data• Display Alerts w/ Integrations• Automatic History Retention• Enables Collaboration for All• And my Favorite…...

#CHATOPS

© SPS COMMERCE 24

#CHATOPS – CELEBRATE WITH GIFS

© SPS COMMERCE 25

• Make health data as transparent and central as possible– Helps the Team “Know where the fire is”

• Share data in chat– Use the metric from your tools

• “Be Transparent”

• Team Response Nurtures Team Follow Up

TIPS FOR HEALTHY INCIDENT RESPONSE

© SPS COMMERCE 26

• Always tie things back to the customer– Simple but often over looked

– Opportunity to link the team to the business

TIPS FOR HEALTHY INCIDENT RESPONSE

© SPS COMMERCE 27

THANK YOU!Andy Domeier

@ajdomie

© SPS COMMERCE 28

Ben Overmyer | Star TribuneDigital Manager, Operations • Ben is the Digital Manager of Operations at the Minneapolis Star

Tribune.

• He has over a decade of experience as a back end software engineer, two years of experience as a dedicated operations engineer, and great enthusiasm for the DevOps culture.

• Besides the Star Tribune, he’s worked for an eclectic mix of organizations, including the USGS, a game company in New Zealand, and a beauty products marketing company.

• When not hacking on servers, apps, or people, he acts as art director and author for a tabletop gaming company.

#DevOpsRoadTrip

EVOLVING INCIDENT MANAGEMENT

STAR TRIBUNE DEVOPS

IN THE BEGINNING

▸ Forwarded phone line

▸ An on-call list maintained in a wiki

▸ Every week, manually change to the next person on the list

▸ …and overrides or substitutions?

EARLY MONITORING

▸ Zabbix monitoring set up for a handful of causes

▸ Zabbix alerts sent via email to a distribution list

▸ Sometimes no one would see these alerts until hours or, in rare cases, days later

THE PAIN POINTS

▸ Manual maintenance of the calling tree data

▸ Manual rotation of the support phone line forwarding

▸ Poor documentation of incident life cycles

▸ No sense of incident frequency beyond “this was a bad couple weeks”

▸ If the on-call person didn’t respond, there was no escalation process other than calling the head of Digital

PHASE I: VICTOROPS

ADOPTING VICTOROPS

▸ Automated rotations

▸ Multiple teams

▸ Automatic escalation processes

▸ Easy schedule overrides and changes

▸ APIs for programmatic incident interaction

THE NATURE OF ALERTS

▸ OK, we can set up programmatic alerts. Now what?

▸ Integrating Zabbix, New Relic, and CloudWatch

▸ Discovering alert floods

▸ Move to alerting on symptoms, not causes

▸ …but still monitoring causes

PHASE 2: THE STATUS SITE

THE SPIDEY-SENSE FACTOR

▸ Humans are good at catching certain kinds of problems

▸ “This doesn’t feel right” and gaps in monitoring

▸ The evolution of the Sev incident system

THE STATUS SITE: MANUAL ALERTING FOR NON-TECH USERS

▸ Want to let certain non-tech users report Sev incidents

▸ Initially just a password-protected form

▸ Uses the VictorOps alert ingestion API for triggering alerts

▸ Uses the VictorOps public API for fetching information

▸ Each Sev alert is created with its own entity_id

▸ Lets admin users share status updates

MONTHLY INCIDENT REPORTING

▸ Monthly reports include a list of all Sev incidents, when they started, when they ended, what the alert text was, and what the resolution was

▸ Combine automated and chat messages in VictorOps with data gathered from other sources

▸ Present this data as automatically as possible in the Status Site

PHASE 3: EVOLUTION

NEXT STEPS

▸ Integration of summarized data collected from Datadog/CloudWatch/etc. into incident reporting

▸ Reports for users that shouldn’t have access to VictorOps

▸ Integration of the Status Site into Slack

▸ @bovermyer

▸ benovermyer.com

Q&A

BREAK TIME#DevOpsRoadTrip

Breakout Sessions◻ ChatOps - Jason Hand

◻ Leveraging Data to Establish a Healthy Culture - Andy Domeier

◻ Monitoring and Microservices – Bridget Kromhout

◻ Blameless Culture – Heather Mickman

◻ Devs vs. Ops On-Call, How and Why to Get started – Ben Overmyer

#DevOpsRoadTrip

BREAK TIME#DevOpsRoadTrip

Breakout Sessions◻ ChatOps - Jason Hand

◻ Leveraging Data to Establish a Healthy Culture - Andy Domeier

◻ Monitoring and Microservices – Bridget Kromhout

◻ Blameless Culture – Heather Mickman

◻ Devs vs. Ops On-Call, How and Why to Get started – Ben Overmyer

#DevOpsRoadTrip

BREAK TIME#DevOpsRoadTrip

Heather Mickman | Target Senior Director of Platform Engineering• Heather Mickman is the Senior Director of Platform Engineering at Target and a

DevOps enthusiast.

• Heather has 20+ years of IT experience in various roles and industries including retail, transportation, and high tech manufacturing.

• She is currently working on building the platforms used by software engineers at Target including a multi-provider cloud platform, API Gateway, telemetry tooling, data stores, and messaging.

• She has a passion for technology, building high performing teams, driving a culture of innovation, and having fun along the way. Heather lives in Minneapolis with her 2 sons and mini dachshund.

#DevOpsRoadTrip

Q&A

Automation

Awareness

Collaboration

Documentation User Empathy

Learning

jhand.co/DRT_MSP

Cynefin

Unordered OrderedComplicated

Obvious

Complex

ChaoticCause Effect Obvious

From Experience

Cause Effect RequiresAnalysis

Cause Effect Only Apparent in Hindsight

Cause & Effect CannotBe Related

Sense – Categorize - Respond

Sense – Analyze - RespondProbe – Sense - Respond

Act – Sense - Respond

The systems we engineer, maintain, and improve are

Complicated .. or ..

Known unknowns

The systems we engineer, maintain, and improve are

ComplexUnknown unknowns

What is the

Root Cause?

What are the..

ContributingFactors?

Identifying a “root cause” helps us to …

Put it backhow it was

What we really want is to..

ContinuouslyImprove

Tim

e To

Rep

air

(TTR

)

Continuous Improvement Efforts

Reactive(chaotic)

Tactical(obvious)

Integrated(complicated)

Strategic(complex)

✓ No automation

✓ No operational stack awareness

✓ Poor collaboration between teams (Dev & Ops)

✓ Documentation not available

✓ No standardized communication

✓ High focus on consistent continuous learning

✓ Uses a NOC

✓ Some monitoring & alerting instrumentation

✓ Collaboration in crisis

✓ "Mission critical" processes are available

✓ Understood crisis communication protocols

✓ Remediation data available to IT Operations

✓ Team rotations, paging policies, role hunting

✓ Continuous improvement of key health indicators

✓ Technical collaboration across all incidents

✓ Docs up to date and easily accessible

✓ Consistent real-time communication practices

✓ Automated docs and remediation✓ Actionable Alerts with full context✓ High collaboration among all

teams✓ Documentation part of

remediation✓ Targeted, proactive crisis comms✓ High focus on continuous learning

Incident Management Maturity

Reactive(chaotic)

✓No automation

✓No operational stack awareness

✓Poor collaboration between teams (Dev & Ops)

✓Documentation not available

✓No standardized communication

✓High focus on consistent continuous learning

Tactical(obvious)

✓Uses a NOC

✓Some monitoring & alerting instrumentation

✓Collaboration in crisis

✓"Mission critical" processes are available

✓Understood crisis communication protocols

✓Remediation data available to IT Operations

Integrated(complicated)

✓Team rotations, paging policies, role hunting

✓Continuous improvement of key health indicators

✓Technical collaboration across all incidents

✓Docs up to date and easily accessible

✓Consistent real-time communication practices

Strategic(complex)

✓Automated docs and remediation

✓Actionable Alerts with full context

✓High collaboration among all teams

✓Documentation part of remediation

✓Targeted, proactive crisis comms

✓High focus on continuous learning

Automation

Awareness

Collaboration

Documentation User Empathy

Learning

Learning

Failure not seen as opportunity to learn

Source: “Six Trends Shape DevOps Adoption, Q1 2015”, Forrester report

Awareness

http://blog.vmware.com

© 2015 Forrester Research, Inc. Reproduction Prohibited 23

Single Source Of Truth Lacking In Many Orgs – 95% only most of the time or less

Source: April 15, 2015 “Six Trends That Will Shape DevOps Adoption”, Forrester report

Collaboration

http://neolivemarketing.com/wp-content/uploads/2015/09/Collaboration.jpg

Teams siloed throughout life cycle

Source: “Six Trends Shape DevOps Adoption, Q1 2015”, Forrester report

User Empathy

https://open.buffer.com/wp-content/uploads/2015/12/empathy3.jpg

© 2015 Forrester Research, Inc. Reproduction Prohibited 27

IT teams aren’t measured on customer experience goals.

Automation

http://thelifedesignproject.com/wp-content/uploads/2009/09/373881476_217d24ef6d.jpg

Delays in notifications Leads To Customers Finding the Problem First

Source: “Six Trends Shape DevOps Adoption, Q1 2015”, Forrester report

Documentation

http://blog.vmware.com

Reduce MTTRState of DevOps Report (2015) – by Puppet Labs

How do youScore?

Tim

e To

Rep

air (

TTR

)

Continuous Improvement Efforts

Reactive (0 – 4)(chaotic)

Tactical (5 – 9)(obvious)

Integrated (10 -14)(complicated)

Strategic (15 – 18)(complex)

✓ No automation

✓ No operational stack awareness

✓ Poor collaboration between teams (Dev & Ops)

✓ Documentation not available

✓ No standardized communication

✓ High focus on consistent continuous learning

✓ Uses a NOC

✓ Some monitoring & alerting instrumentation

✓ Collaboration in crisis

✓ "Mission critical" processes are available

✓ Understood crisis communication protocols

✓ Remediation data available to IT Operations

✓ Team rotations, paging policies, role hunting

✓ Continuous improvement of key health indicators

✓ Technical collaboration across all incidents

✓ Docs up to date and easily accessible

✓ Consistent real-time communication practices

✓ Automated docs and remediation✓ Actionable Alerts with full context✓ High collaboration among all teams✓ Documentation part of remediation✓ Targeted, proactive crisis comms✓ High focus on continuous learning

Incident ManagementMaturity

RAFFLE TIME#DevOpsRoadTrip

DENVER - SEATTLE - SAN FRANCISCO - MINNEAPOLIS - NEW YORK CITY