infrastructure migration
DESCRIPTION
Slideshow from a (nonrecorded) talk I gave at the Columbus, Ohio LOPSA chapter.TRANSCRIPT
InfrastructureMigrations
How many infrastructure migrations have I done? I’m not sure. I stopped counting around 5.
One of the benefits of working for a small company that’s growing quickly is that you get to experience a lot of new things...and moving production and office environments is one of them.
Thursday, August 2, 12
I am: Matt Simmons
• 10+ year sysadmin
• Small infrastructures
• 6+ infrastructure migrations
• http://www.standalone-sysadmin.com
You probably know this...
Thursday, August 2, 12
This is:
InfrastructureMigrations
Thursday, August 2, 12
10,000ft view
• Pre-Planning
• Execution
• Post-Mortem
Like most things, 90% of the work is planning.
The other 90% is lifting heavy things.
There’s another 10-25% reserved for figuring out what went wrong, and determining how to make it not happen again.
Thursday, August 2, 12
Considerations:Types of Migrations
• Build in parallel
• Move Infrastructure
• Hybrid
You really, really want to build in parallel. Sure it’s expensive, but it means much, much shorter periods of downtime.
Moving an infrastructure is hair-raising, because there are only a few million things that can go wrong.
And you don’t know scary until you’re driving a U-Haul full of servers across the Pennsylvania Turnpike in the middle of a rainstorm.
Most people will probably end up doing hybrid migrations, where you build some of the new infrastructure, then migrate some from the existing setup.
Watch out for things like IP addressing issues, and that you’ve made the correct assumptions about rack space and power requirements for the machines that are moving.
Thursday, August 2, 12
Considerations:
• Downtime Limits
• Uptime Requirements
• Service Window Length
You might have a maintenance window, where downtime is planned and doesn’t count against your SLAs. If your migration can fit within this, awesome (hint: it can’t.)
So you need to figure out what kind of downtime you can afford, and remember to schedule notices to your customers far enough in advance so that they aren’t taken by surprise.
Strangely enough, downtime limits and uptime requirements aren’t the same.
Figure out what your uptime limits are according to your user base’s expectations, then figure out how much infrastructure needs to be running in order to accommodate that. Good luck.
Thursday, August 2, 12
Considerations:
Upstream Network Changes
I think I could do an entire presentation where I just list all of the problems that could happen when network providers screw things up.
Big ones to watch out for:
1. Is the test and turn-up date early enough so that inevitable failures don’t impact the go-live date?
2. Is the circuit exactly what you ordered, and is what you ordered exactly what you need?
3. Are cross-connects in the datacenter ordered, and is the datacenter networking team working with the provider?
Thursday, August 2, 12
Considerations:
(Wo)man Power
You can’t lift all of the things you own.
You need friends to come help you move, right? And you usually pay them beer and pizza for the effort.
Moving infrastructures is kind of like that, except “money” typically substitutes for beer and pizza, and you want to find people who are reasonably smart, because you probably don’t own anything in your apartment that costs as much as a high performance RAID array.
Figure out how many people you need, then add 20% to cover the stuff you didn’t think of.
Have another 10% at home ready to come in if the need arises.
Thursday, August 2, 12
Considerations:
How can we parallelize the work?
If you have teams, having them all work independently but simultaneously is important, so try not to have one team waiting around on the result of another team. This is no different than removing bottlenecks from a computing infrastructure.
Thursday, August 2, 12
Establishing a Plan
Documentation shall set you free!
Thursday, August 2, 12
Build a checklist
• What needs to be done
• By whom?
• Where?
• In what order?
Every good plan includes a checklist
Thursday, August 2, 12
Build a checklist
• Off site prior
• On site prior
• On site during
• On site after
• Testing
• Signoff
Include all phases
Off site things before moves are usually slow processes or long-term changes that rely on TTLs or human interaction outside of your organization.
Thursday, August 2, 12
Build a checklist
Establish Dependencies
If item 23 relies on item 24 being done, then it’s probably in the wrong place...
Figuring out all of these dependencies is like untangling a knot. It’s slow, it’s difficult, and when you’re done, no one seems to be as appreciative of your hard work as you are.
Thursday, August 2, 12
Build a checklist
Build in checkpoints
Checkpoints are a great place to stop all the teams at the same time and make sure that everyone’s on the same page.
Thursday, August 2, 12
Build a checklist
Include communication up-stream
Overcommunicate.
Keep your boss informed.
Keep your stakeholders informed.
If you have the kind of work environment where your users care, keep them informed.
Thursday, August 2, 12
Build a checklist
• Per team?
• Per location?
• Per person?
Multiple Checklists
If you’ve got multiple teams, you are likely to need multiple checklists.
Ditto if your locations are farther apart.
If each person’s tasks are complicated, give each person an individual checklist, too.
Thursday, August 2, 12
Build a checklist
Schedule Breaks
Breaks are SO important.
You can’t work for 8 hours without stopping to rest, physically or mentally. Put these into the schedule.
Thursday, August 2, 12
Change Management Techniques
Establish tests for complicated steps(or groups)
Would you build a new server then put it into production without testing it?
Of course not.
Build tests to see if your work so far is correct. It can be as simple as “at this point, LED 7, 8, and 9 should be green, and LED 10 should be amber”.
Thursday, August 2, 12
Change Management Techniques
Establish roll-back procedures
Things happen. Stuff doesn’t always go right.
Make sure your plan includes when to roll-back and what steps to take to do it.
Thursday, August 2, 12
Change Management Techniques
Establish failure guidelines
(What happens if...)
• ...a machine breaks?
• ...a router doesn’t boot?
• ...?
Failures are inevitable. Unhandled failures are unnecessary though.
Know how to tell if something has failed, and know what to do about it.
Thursday, August 2, 12
Identify Goods & Services to be Purchased
• Cables of specific lengths, connectors, label tape, velcro, rack shelves, etc
• Servers, routers, firmwares, licenses, etc
• Circuits, bandwidth, accounts, etc
These kinds of steps require a lot of planning, but more planning just makes the end result better.
Thursday, August 2, 12
Maintain Communications
• Cellphones
• (at least one per team)
• 2-way radios
• (for lack of cellular service)
• Probably not IP phones
Cell reception in datacenters is spotty. Using handheld 2-way radios is much more reliable.
Don’t rely on your IP phone infrastructure for critical communications during network outages.
Just don’t.
Thursday, August 2, 12
Find Warm Bodies
Figure out how many people you need.
Add 20% for good measure
Have 10% standing by
Thursday, August 2, 12
Establish Roles
• Zone
• Man to Man
• Point Guard
Zone: “Your job is to stay at this rack, pulling things out in the order prescribed by the checklist, and to load them on the cart once removed”
Man to Man: “Your job is to cart these servers to the truck, and once the number of servers in the truck matches the number prescribed by the checklist, to drive the truck to the new datacenter, and assist in loading the servers onto the cart for the next zone man”
Point Guard: “Your job is to act as the communications hub, the person to verify that check points happen on schedule, and that things are correct, as well as to finalize sign-off and hand-off once we’re done”
...and so on, as required by your migration.
Thursday, August 2, 12
Communicatethe plan
Default to being too communicative
Have your point guard annoy people with the number of email updates.
Thursday, August 2, 12
Communicatethe plan
Get clearance from the stake-holdersBefore ever starting work, make sure that everyone is on board with the migration plan, and that everyone has agreed and signed off.
Thursday, August 2, 12
Communicatethe plan
Alert users multiple times
• Well in advance
• A week before
• Immediately before
(so long term projects aren’t scheduled)
(so short-term pushes aren’t interrupted)
(so last minute issues don’t compound)
Thursday, August 2, 12
Communicatethe plan
Give everyone the information they need
• Checklists
• Plan document
• Contact Information
...and has signed off on it
I actually got to the point where every person involved in the migration got a personalized envelope.
The contents were the checklist relevant to their job, the diagrams of what the rack looked like before, what the new racks were supposed to look like, and the contact information for all of the other team members.
Thursday, August 2, 12
Executing the planI love it when a plan comes together...
Thursday, August 2, 12
Executing the plan
Verify all goods were purchased
Doing inventory sucks, but not having enough ethernet cables that reach to the switch sucks more...
Thursday, August 2, 12
Executing the plan
Clear personal schedules
“oh, that was this weekend? Crap, man, I’m sorry. I have to go drink beer with my other friends and have a good weekend. Maybe next time, brah”
Thursday, August 2, 12
Executing the plan
Complete off-site checklist items
Verify that everyone at both sites knows what’s happening, when, and is on board. Make sure the datacenter has people on hand to help who are capable of helping.
Thursday, August 2, 12
Executing the plan
Show up early
,,,because something won’t be right.
Thursday, August 2, 12
Executing the plan
Verify assigned roles
Ask for questions
...and ask each person.
Make sure that they know how to get ahold of you and the point guard.
Thursday, August 2, 12
Executing the plan
Step through the list
Thursday, August 2, 12
Executing the plan
Verify completeness with each team
Thursday, August 2, 12
Executing the plan
Perform on-site and off-site post-complete items
Thursday, August 2, 12
Executing the plan
Go have a beer.
Seriously, celebrate completing the task with the team. I didn’t always get to do this, and I’m still sorry about it today.
Thursday, August 2, 12
Executing the plan
Complete post-mortem according to schedule
During the next work-week, complete the post-mortem and identify what went wrong as well as what went right.
You can’t replicate success and eliminate failure unless you identify them.
Thursday, August 2, 12
Dealing with problemsYes, you will have problems...
Thursday, August 2, 12
Dealing with problems
Problems are inevitable
(It’s not “if”, it’s “when”)
During my talk, I gave far more discussion on this topic than I’m going to give here.
Two big take-aways:
1) Problems are inevitable because they are a condition of the infrastructure, and they arise from its inherent complexity.
2) It’s not possible to eliminate all failures, but it’s desirable to minimize them, and to try to eliminate repeating the same failure by improving the process and design.
Read “The Field Guide to Understanding Human Error” by Sydney Dekker
http://amzn.to/QFpcqY
Thursday, August 2, 12
Dealing with problems
• Identify & Acknowledge the problem
• Don’t punish the reporter
• Follow the failure guidelines
• Roll-back if necessary & reschedule
Thursday, August 2, 12
Post-mortem
• What went wrong?
• Why?
• The ‘Five Whys’
• What went right?
• What have we learned?
Thursday, August 2, 12
InfrastructureMigrations
Thanks for your time.
I hope you were able to get something out of it.
If you have questions, feel free to contact me
Thursday, August 2, 12