gluecon 2013 - dark architecture and how to forklift upgrade your system - dyn inc
DESCRIPTION
Dyn's CTO Cory von Wallenstein walks through how to evolve a system architecture for scale, performance and looser coupling without putting the business at risk and while keeping high tech team morale using a Dark Architecture approach.TRANSCRIPT
Dark Architecture & How to Forklift Upgrade Your Infrastructure with Zero Downtime
Cory von WallensteinChief Technology Officer,
Dyn Inc.@cvonwallenstein
@cvonwallenstein from @DynInc at #gluecon
But First, Who Is Dyn?• Internet Infrastructure as a Service
– Managed DNS and Email Delivery
• 230 Global Employees (we bootstrapped to 170)• Headquarters in Manchester, NH (offices in SFO & UK too)• Raised first financing in Oct 2012: $38MM from NorthBridge
@cvonwallenstein from @DynInc at #gluecon
Problem We Are Trying To Solve
InputsBlack Magic
(Your Current System Architecture)Outputs
Different Black Magic (Your New System Architecture)
Inputs
Inputs
Inputs
Outputs
Outputs
Outputs
Scalex10, x102, etc.
Performance(t2 - t0) <= (t1 - t0)
t1
t2
t0
t0
CouplingTight -> Loose
@cvonwallenstein from @DynInc at #gluecon
Pragmatic Engineering over Unicorn Marketing
@cvonwallenstein from @DynInc at #gluecon
Why Things Get This Way• Time to market reigns supreme
– MVP was very… minimum… on… everything– Sooner is better than perfect
• Prototype to production to scale without architectural rigor– Skillset for system engineering in high demand
• Seen more often in small teams who find product market fit faster than expected– Inexperience, but we’ve all been there
@cvonwallenstein from @DynInc at #gluecon
Dark Architecture• A way of thinking about, and technical
approach to, solving the scale/performance/coupling problem while enabling the business to succeed and keeping (some) of your hair
• We stand on shoulders of giants– Fowler, Amazon, Netflix, etc.
@cvonwallenstein from @DynInc at #gluecon
High Level of Dark Architecture• Legacy approach: Flag Day Upgrade/Deploy
– Scope out 3 month upgrade to swap architecture A to B, turns into 6 months, don’t get to anything else, cross fingers on flag day, fight fires where broken, gain weight, lose hair, girlfriend breaks up with you, team quits, FML…
• Evolved approach: Fowler’s Blue/Green Deploy– Two copies of system, load balancing to rapidly
deploy new system version, rapidly fail back to legacy on failure (only one active at a time)
@cvonwallenstein from @DynInc at #gluecon
High Level of Dark Architecture• Dark Architecture Approach
– Two copies of system, both active, send inputs for a workflow to both, compare outputs and throw one away (the one you threw the output away from is the “dark architecture”), log and inspect output differences, gain confidence in new system when differences go away, swap which output you throw away (effectively bringing the “dark” architecture “light”), achieve equilibrium on what workflows get processed by what system so your business has flexibility, high five everyone, onward and upward.
@cvonwallenstein from @DynInc at #gluecon
Tangible Examples• Scaling Global DNS Stats beyond 17 POPs
– MySQL to Cassandra, Log file rsync to agg counts
@cvonwallenstein from @DynInc at #gluecon
Tangible Examples• Scaling Email Delivery beyond 1 billion/month
– Cron to daemon (2011), Perl to Node.js (now)
Dark Architecture Manifesto1. Clear definition of success over ambiguity
– Likely scale/performance measured, may get blank stares on coupling
2. Continuously deliver value over months of no visible progress
3. Confidence in functional equivalence over scope creep
4. ^5’s over finger pointing5. Plan for failure over cross fingers
@cvonwallenstein from @DynInc at #gluecon
Dark Architecture Manifesto6. Customer impact over elegant system
diagrams7. System flows over system components8. Operational confidence and familiarity over
trial by fire9. Having a ten item list over a nine item list10. Architecture evolution over architecture
revolution
@cvonwallenstein from @DynInc at #gluecon
Scope and Priority
• Prioritize a backlog of input/output workflows by amount of pain– Don’t think on a system component level
• “swap MySQL for Cassandra”
– Think on a system workflow level• “retrieve query logs and render *.example.com graphs”
– This exercise will force you to hone scope to exactly where the pain is so you can focus on delivering the solution to this pain first and save others for later.
@cvonwallenstein from @DynInc at #gluecon
Legacy Approach
@cvonwallenstein from @DynInc at #gluecon
Legacy Approach: Week 0
Input
Legacy System
100% of functionality enabled
100% of functionality consumed
Output
@cvonwallenstein from @DynInc at #gluecon
Legacy Approach: Week 1
Input
Legacy System
100% of functionality enabled
100% of functionality consumed
Output
New System
0% of functionality enabled
0% of functionality consumed
@cvonwallenstein from @DynInc at #gluecon
Legacy Approach: Week 4
Input
Legacy System
100% of functionality enabled
100% of functionality consumed
Output
New System
25% of functionality enabled
0% of functionality consumed
Most people start with easy pieces under a misguided “crawl walk run” philosophy. Quick wins on easy stuff while saving hard problems
for later rarely ends well.
@cvonwallenstein from @DynInc at #gluecon
Legacy Approach: Week 8
Input
Legacy System
100% of functionality enabled
100% of functionality consumed
Output
New System
35% of functionality enabled
0% of functionality consumed
Progress slows as harder problems encountered
@cvonwallenstein from @DynInc at #gluecon
Legacy Approach: Week 12
Input
Legacy System
100% of functionality enabled
100% of functionality consumed
Output
New System
80% of functionality enabled
0% of functionality consumed
80% of projects spend 80% of their calendar time at 80% perceived completion. I’m 80% sure.
@cvonwallenstein from @DynInc at #gluecon
Legacy Approach: Week 24
Input
Legacy System
100% of functionality enabled
100% of functionality consumed
Output
New System
100% of functionality enabled
0% of functionality consumed
Other fires came up, things took longer than expected, you know… business. Morale never
been lower
@cvonwallenstein from @DynInc at #gluecon
Legacy Approach: Flag Day!
Input
Legacy System
100% of functionality enabled
0% of functionality consumed
Output
New System
100% of functionality enabled
100% of functionality consumed
@cvonwallenstein from @DynInc at #gluecon
Legacy Approach: Flag Day!
Input
Legacy System
100% of functionality enabled
0% of functionality consumed
Output
New System
100% of functionality enabled
100% of functionality consumed
@cvonwallenstein from @DynInc at #gluecon
Dark Architecture Approach
@cvonwallenstein from @DynInc at #gluecon
Dark Architecture Approach: Week 0
Input
Legacy System
100% of functionality enabled
100% of functionality consumed
Output
@cvonwallenstein from @DynInc at #gluecon
Dark Architecture Approach: Week 1
Input
Legacy System
100% of functionality enabled
100% of functionality consumed
Output
New System
0% of functionality enabled
0% of functionality consumed
@cvonwallenstein from @DynInc at #gluecon
Dark Architecture Approach: Week 2
Input
Legacy System
100% of functionality enabled
100% of functionality consumed
Output
New System
0% of functionality enabled
0% of functionality consumed
InputOutput
No functionality yet, just dark architecture
framework for two inputs and two
outputs (throwing one output away)
Dark Architecture Approach: Week 3
Input
Legacy System
100% of functionality enabled
100% of functionality consumed
Output
New System
2% of functionality enabled
2% of functionality consumed (dark)
InputOutput
Throw one away, but log and inspect differences!
Dark Architecture Approach: Week 4
Input
Legacy System
100% of functionality enabled
98% of functionality consumed
Output
New System
2% of functionality enabled
2% of functionality consumed
InputOutput
Gain confidence operating with two
equal outputs, switch which one is thrown
away for that workflow. Goes horribly wrong?
Switch back.
Dark Architecture Approach: Week 12
Input
Legacy System
100% of functionality enabled
80% of functionality consumed
Output
New System
20% of functionality enabled
20% of functionality consumed
InputOutput
Where do we stand at expected 3 months? Most painful 20% of problems resolved…
now we have flexibility for what to
do next.
Customer impact over elegant system diagrams
• Your customers are not paying you to have pretty whiteboards of elegant system architectures
• Your customers are paying you to make their pain go away. This gets priority.
• It’s OK to have different workflows handled by different systems to give your team agility– Other priorities came up? System is stable.– Have technical debt time? Continue arch migration
@cvonwallenstein from @DynInc at #gluecon
Parting Takeaways• Manifesto is a preference, not a rule• Think in flows not components• Deliver most painful pieces first so when
priorities change, you’re not left half complete.• Process success >>> process name• Be realistic. DA provides flexibility and frequent
victories for morale and some value delivered sooner, but it won’t necessarily make a full architecture migration faster in calendar days.
@cvonwallenstein from @DynInc at #gluecon
Cory von Wallenstein@cvonwallenstein
Questions?
@cvonwallenstein from @DynInc at #gluecon