site reliability engineering - usenix · availability and reliability meet slos • defend customer...
TRANSCRIPT
![Page 1: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/1.jpg)
7/22/16 1
GregVeithDirector– MicrosoftAzureSRE
SiteReliabilityEngineering
![Page 2: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/2.jpg)
They’re Alive!
7/22/16 2
Organizations Are Living Organisms
![Page 3: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/3.jpg)
Evolution and Complexity
7/22/16 3
![Page 4: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/4.jpg)
7/22/164
Azure Service Offerings
![Page 5: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/5.jpg)
%Revenue from Startups
and ISVs
kNew Azure customer subscriptions/month Distinct Azure Service Offerings
Datacenters
24Datacenter Regions
Scale
MMessages per second
processed by Azure IoT
![Page 6: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/6.jpg)
7/22/16 6
Transformation
![Page 7: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/7.jpg)
7/22/16 7
Learning Culture, Growth Mindset
![Page 8: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/8.jpg)
Scaling Up Operational Models
7/22/16 8
![Page 9: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/9.jpg)
Welcome To The Team!
7/22/16 9
SR-
![Page 10: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/10.jpg)
North is…
7/22/16 10
![Page 11: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/11.jpg)
Symptoms of Success
7/22/16 11
• defendcustomertrustAvailabilityandreliabilitymeetSLOs
• ToileliminationEliminatehumantouchestoprod
• Reduceinventory,shipfast,safelySpeedupdeployments
Alltheaboveareasreinforcemeasurement.Reliability’sfoundation.
![Page 12: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/12.jpg)
3 Strategic Pillars
7/22/16 12
Provethemodel– ApplyPrinciples
StartSREatMicrosoft– EstablishPrinciples
Accelerateandimprove– ScalethePrinciples
![Page 13: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/13.jpg)
7/22/16 13
SRE Engagement Types
Services at Planetary Scale
Newer Service Facing Rapid Growth
Greenfield Services or Redesign
SRE develops solutions to close operational gaps, fire suppressant, iterate toward transformation
SRE attaches to team, develops targeted improvements to prepare for growth, get on call
Operability and continuous innovation, design for scale from the beginning
Ops Transformation at Scale
Growth and Maturation
Design and Architecture
![Page 14: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/14.jpg)
Production Readiness
7/22/16 14
![Page 15: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/15.jpg)
3 Strategic Pillars
7/22/16 15
Provethemodel– Pilots– ApplyPrinciples
StartSREatMicrosoft- EstablishPrinciples
Accelerateandimprove– ScalethePrinciples
![Page 16: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/16.jpg)
Service Facing Rapid GrowthAzure IoT
7/22/16 16
![Page 17: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/17.jpg)
Established Service at Planetary Scale Azure Storage
7/22/16 17
![Page 18: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/18.jpg)
3 Prong Strategy
7/22/16 18
Provethemodel– Pilots– ApplyPrinciples
StartSREatMicrosoft- EstablishPrinciples
Accelerateandimprove– ScalethePrinciples
![Page 19: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/19.jpg)
Production Virtuous Cycle
7/22/16 19
Goal:EnablethislooptorunasfastandoftenaspossiblewhilemaintainingSLOs
Code
Test
Deploy
Monitor,Measure,Alert
Mitigate
Restore
PostMortem
Learn
SRE
![Page 20: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/20.jpg)
7/22/16 20
• Instrumentation,SLOs,Alarms,insightsà actionsMetricsandMonitoring
• Tooling,infraforglobaloptimaInfrastructureEngineering
• ChangeManagement,DeploymentReleaseEngineering
• EnoughSaidIncidentResponse
• Integratingexistingbestinclass infraCommonInfrastructure
• Buildout,decomm,fleetunderstandingandmgmtCapacity&FleetManagement
SRE Areas of Focus
![Page 21: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/21.jpg)
Metrics and Monitoring
7/22/16 21
![Page 22: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/22.jpg)
Incident Response
7/22/16 22
![Page 23: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve](https://reader035.vdocuments.net/reader035/viewer/2022081401/5b84c9eb7f8b9ad34a8cedbc/html5/thumbnails/23.jpg)
Critical Moves, LearningsBuildandprotecttheSREbrand
Managethechange
Meetteamswheretheyare
GrabaShovel(andbuildabackhoe)
Findthebrightspots
7/22/16 23