everything breaks all the time -...
TRANSCRIPT
![Page 1: Everything breaks All the time - JanWiersmajanwiersma.nl/wp-content/uploads/2018/05/VU-lecture.pdf · Everything fails…all the time… •Hardware will fail. •Software will have](https://reader033.vdocuments.net/reader033/viewer/2022060207/5f03f1b17e708231d40b8aaf/html5/thumbnails/1.jpg)
Everything breaksAll the time
![Page 2: Everything breaks All the time - JanWiersmajanwiersma.nl/wp-content/uploads/2018/05/VU-lecture.pdf · Everything fails…all the time… •Hardware will fail. •Software will have](https://reader033.vdocuments.net/reader033/viewer/2022060207/5f03f1b17e708231d40b8aaf/html5/thumbnails/2.jpg)
Intro
• After IT related jobs;• Dutch Police – enterprise IT• OCOM group - hosting & cloud• SDL – Software As A Service• Spoken Communications – Call Center as Service
![Page 3: Everything breaks All the time - JanWiersmajanwiersma.nl/wp-content/uploads/2018/05/VU-lecture.pdf · Everything fails…all the time… •Hardware will fail. •Software will have](https://reader033.vdocuments.net/reader033/viewer/2022060207/5f03f1b17e708231d40b8aaf/html5/thumbnails/3.jpg)
History of Cloud backend
• Behind the Website and Apps;• Running in to scale problems • Cost• Complexity• Speed
![Page 4: Everything breaks All the time - JanWiersmajanwiersma.nl/wp-content/uploads/2018/05/VU-lecture.pdf · Everything fails…all the time… •Hardware will fail. •Software will have](https://reader033.vdocuments.net/reader033/viewer/2022060207/5f03f1b17e708231d40b8aaf/html5/thumbnails/4.jpg)
Running at scale
• The ‘normal’ things didn’t work any more.• New server hardware, networking, software• New datacenters• New economics
![Page 5: Everything breaks All the time - JanWiersmajanwiersma.nl/wp-content/uploads/2018/05/VU-lecture.pdf · Everything fails…all the time… •Hardware will fail. •Software will have](https://reader033.vdocuments.net/reader033/viewer/2022060207/5f03f1b17e708231d40b8aaf/html5/thumbnails/5.jpg)
Running at scale
• Many layers, abstracted away
![Page 6: Everything breaks All the time - JanWiersmajanwiersma.nl/wp-content/uploads/2018/05/VU-lecture.pdf · Everything fails…all the time… •Hardware will fail. •Software will have](https://reader033.vdocuments.net/reader033/viewer/2022060207/5f03f1b17e708231d40b8aaf/html5/thumbnails/6.jpg)
Large scale – everybody's problem now
• For cloud providers –millions of servers• For cloud costumers –
millions of instances
![Page 7: Everything breaks All the time - JanWiersmajanwiersma.nl/wp-content/uploads/2018/05/VU-lecture.pdf · Everything fails…all the time… •Hardware will fail. •Software will have](https://reader033.vdocuments.net/reader033/viewer/2022060207/5f03f1b17e708231d40b8aaf/html5/thumbnails/7.jpg)
Automation
• Large scale;• Deployment• Config• CI/CD• Capacity mgmt• Monitoring
![Page 8: Everything breaks All the time - JanWiersmajanwiersma.nl/wp-content/uploads/2018/05/VU-lecture.pdf · Everything fails…all the time… •Hardware will fail. •Software will have](https://reader033.vdocuments.net/reader033/viewer/2022060207/5f03f1b17e708231d40b8aaf/html5/thumbnails/8.jpg)
Everything fails… all the time…
• Hardware will fail.• Software will have bugs• Force Majeure will hit
• Focus on restore not on preventing
AFR harddisk failure
![Page 9: Everything breaks All the time - JanWiersmajanwiersma.nl/wp-content/uploads/2018/05/VU-lecture.pdf · Everything fails…all the time… •Hardware will fail. •Software will have](https://reader033.vdocuments.net/reader033/viewer/2022060207/5f03f1b17e708231d40b8aaf/html5/thumbnails/9.jpg)
Everything fails… all the time…
• Force Majeure will hit; Amazon datacenter failure 2010
![Page 10: Everything breaks All the time - JanWiersmajanwiersma.nl/wp-content/uploads/2018/05/VU-lecture.pdf · Everything fails…all the time… •Hardware will fail. •Software will have](https://reader033.vdocuments.net/reader033/viewer/2022060207/5f03f1b17e708231d40b8aaf/html5/thumbnails/10.jpg)
Everything fails… all the time…
• Human error; Amazon S3 failure 2017
![Page 11: Everything breaks All the time - JanWiersmajanwiersma.nl/wp-content/uploads/2018/05/VU-lecture.pdf · Everything fails…all the time… •Hardware will fail. •Software will have](https://reader033.vdocuments.net/reader033/viewer/2022060207/5f03f1b17e708231d40b8aaf/html5/thumbnails/11.jpg)
Balancing cost and risk
- Cost goes up- Complexity goes up
![Page 12: Everything breaks All the time - JanWiersmajanwiersma.nl/wp-content/uploads/2018/05/VU-lecture.pdf · Everything fails…all the time… •Hardware will fail. •Software will have](https://reader033.vdocuments.net/reader033/viewer/2022060207/5f03f1b17e708231d40b8aaf/html5/thumbnails/12.jpg)
Outage !
• What happens ?• Typical Process (ITIL or similar)• Typical Human behaviour
• What happens after ?• Review process, retro or post mortem• Typical Human behaviour
![Page 13: Everything breaks All the time - JanWiersmajanwiersma.nl/wp-content/uploads/2018/05/VU-lecture.pdf · Everything fails…all the time… •Hardware will fail. •Software will have](https://reader033.vdocuments.net/reader033/viewer/2022060207/5f03f1b17e708231d40b8aaf/html5/thumbnails/13.jpg)
Retro’s
• When ?• Who ?• Structure:
• Detailed Timeline• Root cause• What went well• What could we have done better• Action items
• Automate the input (ChatOps)• Importance of Blameless Retro’s.• Share retro output internal and external
![Page 14: Everything breaks All the time - JanWiersmajanwiersma.nl/wp-content/uploads/2018/05/VU-lecture.pdf · Everything fails…all the time… •Hardware will fail. •Software will have](https://reader033.vdocuments.net/reader033/viewer/2022060207/5f03f1b17e708231d40b8aaf/html5/thumbnails/14.jpg)
Materials / references
• Barraso, Clidaras, Holzle: The Datacenter as a Computer. 2009/2013
• Mark Burgess: In Search of Certainty. 2013
• Woods DD. STELLA: Report from the SNAFUcatchers Workshop on Coping With Complexity. 2017.