incident management with workflows
TRANSCRIPT
Patrick Hoolboom September 22, 2016
Incident Managementwith Workflows
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
What is a Workflow?
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
What Is a Workflow?
• A sequence of processes through which a piece of work passes frominitiation to completion• Process as Code• Living Documentation
– Document your process in an easily human readable, executable format
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 3
Event Driven Automation 2.0
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 4
FBAR (saving 1532 hours/day)
Naoru
Nurse
Winston (powered by StackStorm)
Azure Automation
Mistral workflow service
StackStorm automation platform
ACT
OBSERVE
ORIENT
DECIDE
When to use a workflow
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
When to use a workflow
• Clearly defined process• When multiple systems or services need to be touched• Frequently performed tasks
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 6
Why Use Workflows?
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
Why Use Workflows?
• Consistency
– Trust that your automations will perform the same tasks every timefor a given event
• Speed– Reduce time to resolution for an incident
Audit– Creates a clear audit trail of what was done when
• Connect Disparate Systems
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 8
Tools…
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 9
What Can Be Automated?
• Security checks– On malware detection in a VM, isolate
network port on a switch• Blue-green app deployment
– On Jenkins tests passed, bring new vmclaster, deploy and configure app, setloadbalancer to send % of traffic to new app,monitor, roll forward, or back out
• Networking– On BGP peer goes down: collect
troubleshooting data, post on slack & createJIRA ticket
– On Link aggregation member error, checkload, if capacity of rest of LAG bundleenough, disable link with error
• Restart a down service– On monitoring event, bounce a service
• OpenStack orphan VM clean-up– On orphans detected, shut down, email owner,
keep for few days, delete• NFV:
– Nokia, AT&T, with Mistral and OpenStack• OpenStack VM evacuation on
hardware failures– On host RAID failure, get list of impacted VMs,
email VM owners, evacuate VMs, create JIRAticket for hardware replacement.
• Cassandra “node down”recovery
– Replace a node on alert
• Clean up disk space– On monitoring event, clean up disk space
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 10
StackStorm
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
Architecture
12
Web GUI CLI Chatops
Sensor Containers Action Runners
Sensor Plugins(inbound integrations)
Master Content Repo
to Audit…
Action Plugins(outbound integrations)
PLATFORM
CLIEN
TSPLU
GINS
AMQP message busAMQP message bus
Workflow Engine
REST API
{*}
RulesEngine
IFTTT.yml
KV Store
k[v]
● Diagnostic Workflows
● Remediation Workflows
Workflow DesignPatterns
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
Workflow Design PatternsDiagnostic Workflows
• Troubleshooting and data gathering steps• No remediations or changes to the system• Good way to “get your feet wet” with workflows
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 14
Workflow Design PatternsRemediation Workflows
• Fix the issue!• Should be triggered after diagnostic workflows if applicable•
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 15
● Facilitated Troubleshooting
● Auto-Remediation
Workflow Use CasesDuring an Incident
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
Workflow Use CasesFacilitated Troubleshooting
• Useful if you don’t quite trust the automation– Gain confidence in your workflows
• Faster Time to Resolution
• Consistent Data Collection• Diagnostic workflow with notifications
– Send data to user via
• Chat
• Ticketing System
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 17
Workflow Use CasesAuto-Remediation
• Trusted Automation– Will make automated changes to the system
• Much Faster Time to Resolution• Consistent Solutions• Less Pager Fatigue
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY 18
● Low Disk Space Event
Example
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY
Automation Example
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 20
Automation
EngineerService
Monitoring IncidentManagement
Event: “low diskon web301”
Web301 is“low disk”
Resolve knowncases, fast. Is it
/var/log? Clean up!
Unknownproblem, need a
human
Wake up, buddy.Something real
is going on…
21
● Email: [email protected]
● Twitter: @DoriftoShoes
Thank You!
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY