openstack - getting it all up magically - and when the magic fails
TRANSCRIPT
Magic of Openstack..Magic of Openstack..
Anshu Anshu PrateekPrateek
23/01/16 Footnote 2
Openstack..Openstack..
Getting it all up magically..And..
When the magic fails :!
23/01/16 Footnote 3
Me..Me..
https://about.me/anshuprateek
large scale environments for the last seven years. Starting with Yahoo! search, where I spent the first four years - moment of truth - I have “touched” each and every machine in the search inventory!
Next stopover - Aerospike, - A million TPS in less than a millisecond!
Reliance Jio – Cloud efforts
23/01/16 Footnote 4
Openstack..Openstack..
Millions of AT&T wireless subscribers are connected to virtualized network
services ... based on OpenStack.
— John DonovanSenior Executive Vice President of
Technology and Network OperationsAT&T
23/01/16 Footnote 5
DeploymentDeployment
Dev setup – devstack
Prod –
PuppetFuel
Ansible
23/01/16 Footnote 6
DeploymentDeployment
Its like piloting a plane
Despite so much technology and automation it still needs “pilots”
Pilots are expensive and niche!
What we need is the driverless cars!
A system which can set itself upFix the problems itself
23/01/16 Footnote 7
ExperienceExperience
Are we ready to trust driverless cars?
Would still need active monitoring and oncalls
A breakdown can only be so much “self-healed”
Google cars have done 1.2M Miles
10K each week
23/01/16 Footnote 8
ExperienceExperience
Completely automated openstack setup
40 nodes labs
With known failure points
CI-CD enabled
1 build every 40 minutes
+ Manual builds by ~50 engineers
23/01/16 Footnote 9
How?How?
A puppet module
Whch manages other puppet modules
And uses itself..As a module of itself.. !
++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.
23/01/16 Footnote 10
BrainfuckBrainfuck
https://en.wikipedia.org/wiki/Brainfuck
(Nothing to do with this talk – I just wanted to say that out without having to beep myself :D )
23/01/16 Footnote 11
Puppet modulePuppet module
https://github.com/JioCloud/puppet-rjil
Puppet – R – J – I – L
or
Puppet - Rijil
23/01/16 Footnote 12
How it worksHow it works
Master less – stand alone setup
Each node runs the same copy of puppet code
Depending upon node name, appropriate includes are done
Hiera for more seggregation
23/01/16 Footnote 13
How it worksHow it works
Entry point – site.pp
CI-CD perspective – 13 minimum nodes
The “nodes” can be Vms on top of openstack
Or Vagrant
Or docker (WIP)
23/01/16 Footnote 14
How it worksHow it works
Consul for service discovery
Local build and packaging system for running our own code
Everything is packaged – no git clones!
23/01/16 Footnote 15
Development modelDevelopment model
No branches in central repo
Fork and branch and develop and test
PR
Review, test and merge
Similar to Openstack model
23/01/16 Footnote 16
SubstitutionsSubstitutions
Storage – Ceph
Boot by volume only to support migration
Network – OpenContrail
Consul for local DNS and service discovery and rudimentary alerts console
23/01/16 Footnote 17
The magicThe magic
Please test this
Let me go for a coffee and a chat
Do some more chatting
~40 minutes – result
If pass → code merged
Acceptance test
Jorc trigger_update `date +%s`
Deployed!
23/01/16 Footnote 18
The beautyThe beauty
Local check scripts
If a check fails → try to self heal
Works most of the time
Except you cant self-heal a bleeding wound :!
At times it may aggravate the problem
23/01/16 Footnote 19
The beastThe beast
Added a comma
Please test this
40 minutes
It failed!
But not my code
Cant merge yet :(
23/01/16 Footnote 20
WolvesWolves
Labs is broken
I know the fix :D
Oops, please test this fails (see point 1)
Cant merge
Argh – ergo labs is still broken :(
23/01/16 Footnote 21
Yet another wolf..Yet another wolf..
Labs is broken due to network
I am a storage developer
Labs is broken
Cant merge code :(
:(
:(
23/01/16 Footnote 22
Why we did so?Why we did so?
Developer inertia for known issues
Priorities features over fixes
This system ensures nothing till its all fixed
23/01/16 Footnote 23
ProblemsProblems
Too many wasted man hours till its fixed
Too few people who could actually fix :(
23/01/16 Footnote 24
New attemptNew attempt
Make it decentralised
Micro services
Every one is a customer of everyone else
Outages become component wise rather than labs wise
23/01/16 Footnote 25
Little Red Riding HoodLittle Red Riding Hood
https://github.com/JioCloud/puppet-rjil-keystoneSome teams going ansible as well as other routes
23/01/16 Footnote 26
LearningsLearnings
Big automated systems are fun
Scalable
But needs coordination between teams (and component)
Avoid feedback-loop-blockers
Monolithic systems are a pain to work on
Kernel developers – we feel you
23/01/16 Footnote 27
LearningsLearnings
CI-CD helps (and works!)
Self-heal needs to be done using surgical knife (or a laser cut) than a butcher's knife
Band-aid at best – but could hold the dam!
Micro-services are better
Nimble and agile
Though tough to get the big-picture
Move fast – break often