incident management with workflows

22
Patrick Hoolboom September 22, 2016 Incident Management with Workflows © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY

Upload: patrick-hoolboom

Post on 17-Jan-2017

75 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Incident Management with Workflows

Patrick Hoolboom September 22, 2016

Incident Managementwith Workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY

Page 2: Incident Management with Workflows

What is a Workflow?

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY

Page 3: Incident Management with Workflows

What Is a Workflow?

• A sequence of processes through which a piece of work passes frominitiation to completion• Process as Code• Living Documentation

– Document your process in an easily human readable, executable format

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 3

Page 4: Incident Management with Workflows

Event Driven Automation 2.0

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 4

FBAR (saving 1532 hours/day)

Naoru

Nurse

Winston (powered by StackStorm)

Azure Automation

Mistral workflow service

StackStorm automation platform

ACT

OBSERVE

ORIENT

DECIDE

Page 5: Incident Management with Workflows

When to use a workflow

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY

Page 6: Incident Management with Workflows

When to use a workflow

• Clearly defined process• When multiple systems or services need to be touched• Frequently performed tasks

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 6

Page 7: Incident Management with Workflows

Why Use Workflows?

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY

Page 8: Incident Management with Workflows

Why Use Workflows?

• Consistency

– Trust that your automations will perform the same tasks every timefor a given event

• Speed– Reduce time to resolution for an incident

Audit– Creates a clear audit trail of what was done when

• Connect Disparate Systems

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 8

Page 9: Incident Management with Workflows

Tools…

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 9

Page 10: Incident Management with Workflows

What Can Be Automated?

• Security checks– On malware detection in a VM, isolate

network port on a switch• Blue-green app deployment

– On Jenkins tests passed, bring new vmclaster, deploy and configure app, setloadbalancer to send % of traffic to new app,monitor, roll forward, or back out

• Networking– On BGP peer goes down: collect

troubleshooting data, post on slack & createJIRA ticket

– On Link aggregation member error, checkload, if capacity of rest of LAG bundleenough, disable link with error

• Restart a down service– On monitoring event, bounce a service

• OpenStack orphan VM clean-up– On orphans detected, shut down, email owner,

keep for few days, delete• NFV:

– Nokia, AT&T, with Mistral and OpenStack• OpenStack VM evacuation on

hardware failures– On host RAID failure, get list of impacted VMs,

email VM owners, evacuate VMs, create JIRAticket for hardware replacement.

• Cassandra “node down”recovery

– Replace a node on alert

• Clean up disk space– On monitoring event, clean up disk space

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 10

Page 11: Incident Management with Workflows

StackStorm

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY

Page 12: Incident Management with Workflows

Architecture

12

Web GUI CLI Chatops

Sensor Containers Action Runners

Sensor Plugins(inbound integrations)

Master Content Repo

to Audit…

Action Plugins(outbound integrations)

PLATFORM

CLIEN

TSPLU

GINS

AMQP message busAMQP message bus

Workflow Engine

REST API

{*}

RulesEngine

IFTTT.yml

KV Store

k[v]

Page 13: Incident Management with Workflows

● Diagnostic Workflows

● Remediation Workflows

Workflow DesignPatterns

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY

Page 14: Incident Management with Workflows

Workflow Design PatternsDiagnostic Workflows

• Troubleshooting and data gathering steps• No remediations or changes to the system• Good way to “get your feet wet” with workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 14

Page 15: Incident Management with Workflows

Workflow Design PatternsRemediation Workflows

• Fix the issue!• Should be triggered after diagnostic workflows if applicable•

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 15

Page 16: Incident Management with Workflows

● Facilitated Troubleshooting

● Auto-Remediation

Workflow Use CasesDuring an Incident

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY

Page 17: Incident Management with Workflows

Workflow Use CasesFacilitated Troubleshooting

• Useful if you don’t quite trust the automation– Gain confidence in your workflows

• Faster Time to Resolution

• Consistent Data Collection• Diagnostic workflow with notifications

– Send data to user via

• Email

• Chat

• Ticketing System

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 17

Page 18: Incident Management with Workflows

Workflow Use CasesAuto-Remediation

• Trusted Automation– Will make automated changes to the system

• Much Faster Time to Resolution• Consistent Solutions• Less Pager Fatigue

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 18

Page 19: Incident Management with Workflows

● Low Disk Space Event

Example

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY

Page 20: Incident Management with Workflows

Automation Example

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 20

Automation

EngineerService

Monitoring IncidentManagement

Event: “low diskon web301”

Web301 is“low disk”

Resolve knowncases, fast. Is it

/var/log? Clean up!

Unknownproblem, need a

human

Wake up, buddy.Something real

is going on…

Page 21: Incident Management with Workflows

21

Page 22: Incident Management with Workflows

● Email: [email protected]

● Twitter: @DoriftoShoes

Thank You!

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY