devops days vancouver 2014 slides

46
Architecting Software as if Ops Matters November 15 th , 2014 DevOps Days Vancouver 2014 Alex Cruise Director of Architecture, Metafor Software

Upload: alex-cruise

Post on 16-Jul-2015

1.361 views

Category:

Software


1 download

TRANSCRIPT

Page 1: DevOps Days Vancouver 2014 Slides

Architecting Software as if Ops Matters

November 15th, 2014DevOps Days Vancouver 2014

Alex CruiseDirector of Architecture, Metafor Software

Page 2: DevOps Days Vancouver 2014 Slides

Who am I? Background

● Self-taught programmer● Worked at technology-intensive companies around

Vancouver for over 20 years– Mostly software companies– Some of which are still in business!– Two of which were acquired for decent money

● Layer 7/CA● Subserveo/DST Systems

● I came to Computer Science as such (not the same as writing code for a living!) because I saw how it would help me solve problems I was already having, not because it seemed like it would be a good job when I was 18. :)

Page 3: DevOps Days Vancouver 2014 Slides

Who am I?Declaration of Biases

● I really like static typing.● I really like functional programming.● I have internalized large parts of the Java

ecosystem over the years.● As you might expect I really like Scala.

– Five years in, I still like it!

● I have a tendency to go on Architecture Austronaut EVAs, but I'm working on it...

Page 4: DevOps Days Vancouver 2014 Slides

Outline

● General Pontification● Some Hard Problems that we all need to face

– Scale Out

– Concurrency

– Fault tolerance

– Correctness

● Be proactive: Meta-solutions that can help– Not solutions... Solution Construction Sets

● Be reactive:– Find the gaps so you can mind them

Page 5: DevOps Days Vancouver 2014 Slides

Where We're At

“The rise of the DevOps movement has brought into welcome focus something that is often learned only through painful experience and expense: the success of a software product critically depends not only on its implementation, maintenance and enhancement, but also on how it's deployed and operated.”

-- me :)

Page 6: DevOps Days Vancouver 2014 Slides

Anna Karenina Principle

“Happy families are all alike; every unhappy family is unhappy in its own way”

—Leo Tolstoy

Page 7: DevOps Days Vancouver 2014 Slides

Anna Karenina Principle(of software architecture)

“Happy families are all alike; every unhappy family is unhappy in its own way”

—Leo Tolstoy

● The small systems all start to look the same after awhile

Page 8: DevOps Days Vancouver 2014 Slides

Anna Karenina Principle(of software architecture)

“Happy families are all alike; every unhappy family is unhappy in its own way”

—Leo Tolstoy

● The small systems all start to look the same after awhile– if you squint

– when you're bitter enough

Page 9: DevOps Days Vancouver 2014 Slides

Anna Karenina Principle(of software architecture)

“Happy families are all alike; every unhappy family is unhappy in its own way”

—Leo Tolstoy

● The small systems all start to look the same after awhile– if you squint

– when you're bitter enough

● But large systems are UNIQUE and TERRIFYING.

Page 10: DevOps Days Vancouver 2014 Slides

Project Risk Factors

● Problem domain risk factors– Some problem domains are inherently full of hairy yaks– Some are not necessarily complex, but unfamiliar

● Tooling/infrastructure risk factors– Some tools are immature or buggy– Some are hard to use and/or unfamiliar

● Scale/distributed architecture is itself a huge risk factor● Complex + unfamiliar problem domain + tooling problems

+ distributed architecture = PROJECT DEATH● Minimize as many of these risk categories as you can at

any given time

Page 11: DevOps Days Vancouver 2014 Slides

Hard Problem #1: Scale Out

● How does it cause problems?– Complexity of the overall system increases

– Reasoning about the system's behaviour as a whole quickly becomes intractable

– Airplane Rule: All else being equal, on average, a twin-engined airplane has twice as many engine failures as a single-engined airplane.

● Hopefully, a greater proportion of these failures are survivable!

– With airplanes, uh, maybe not as much as we'd want.

Page 12: DevOps Days Vancouver 2014 Slides

Hard Problem #1: Scale Out

● Why do we need it anyway?– A single server beefy enough to run the whole thing is

too expensive (assuming it's possible to buy one)– Worse yet, a single server reliable enough to run the

whole thing forever doesn't exist—we need redundancy to achieve high availability. Redundancy necessarily invites coordination problems.

– Ideally, we want to be able to start small and cheap, add capacity gradually, and avoid re-architecting the whole thing too often.

Page 13: DevOps Days Vancouver 2014 Slides

Hard Problem #1: Scale Out

● Some things we've tried (1):– Try to simplify our life as developers, by making the

distributed system “feel” local.● CORBA, DCOM, EJB, Distributed transactions, SOAP,

RPC, etc.

– What went wrong?● See “Eight Fallacies of Distributed Computing”● Leaky abstractions: SO MANY THINGS can fail, and for

reasons that don't usually make it into API documentation.● RPC-style request-response protocols frequently block the

client thread (and often the server thread too!), limiting scalability.

Page 14: DevOps Days Vancouver 2014 Slides

Hard Problem #1: Scale Out

● Some things we've tried (2):– Use lots of dumb, cheap front-end servers, and

fewer smart, expensive back-end servers (e.g. database)

– What went wrong?● As application state/logic increases in complexity, the

backend servers quickly become a bottleneck—one that we know for a fact is REALLY HARD to scale out (e.g. CAP theorem)

Page 15: DevOps Days Vancouver 2014 Slides

Hard Problem #1: Scale Out

● Some things we've tried (3):– Actors, message passing, location transparency

● What went wrong?– Brain rewiring

– Requires rethinking of how systems are built

– Purely local interactions are more cumbersome

● What went right?– Usefully narrow, unifying abstraction over local, point-to-point and

clustered message passing

– Participants in message exchanges don't need to be aware of where their counterparts are deployed

– Topology decisions can be configured at runtime, decoupled from application logic

Page 16: DevOps Days Vancouver 2014 Slides

Hard Problem #2: Concurrency

● How does it cause problems?– Note! Concurrency != Parallelism!

“Concurrency is like having a juggler juggle many balls. Regardless of how it seems, the juggler is only catching/throwing one ball at a time. Parallelism is having multiple jugglers juggle balls simultaneously.”

– Systems that are inherently nondeterministic are really hard to reason about.

– Concurrency bugs are hard to troubleshoot, often only showing up under load.

Page 17: DevOps Days Vancouver 2014 Slides

Hard Problem #2: Concurrency

● Why do we need it anyway?– Make best use of available system resources by

doing lots of stuff at once

– Even if you're not using threads or shared memory, your application still has state changes that need to be triggered, validated, committed or rolled back, notified, etc.

– If you can use shared memory, you can avoid paying for IPC/network/codec round trips every time your state changes.

Page 18: DevOps Days Vancouver 2014 Slides

Hard Problem #2: Concurrency

● Some things we've tried (1)– Let's use threads, and share mutable state!

– What went wrong?● OW MY BRAIN● Bugs galore!

Page 19: DevOps Days Vancouver 2014 Slides

Hard Problem #2: Concurrency

● Some things we've tried (2)– Just don't do it!

● What went wrong?– Inefficient use of resources.

● All important application state transitions incur Network, IPC, Codec costs

– Having to wait for external coordination services (e.g. memcached, databases) limits scalability.

Page 20: DevOps Days Vancouver 2014 Slides

Hard Problem #2: Concurrency

● Some things we've tried (4)– Actors

● What went wrong?– We (well, some of us...) want our types back!

– Message passing is much less efficient than method invocation● Thousands per second is fine● Millions per second? Reconsider...

● What went right?– Actors aren't concurrent individually

● Can safely have mutable state

– Decoupled from threads: you can have lots and lots alive at any time.

Page 21: DevOps Days Vancouver 2014 Slides

Hard Problem #2: Concurrency

● Some things we've tried (5):– Immutable data, functional programming

● What went wrong?– Requires some brain rewiring– Some languages make it hard to use

● Get a better language ;)

– FP itself doesn't provide any concurrency, but...

● What went right?– Declarative programming, really!– Referentially transparent functions that are independent can

safely be evaluated in parallel (in some cases automatically)– Immutable data is always safe to share

Page 22: DevOps Days Vancouver 2014 Slides

Hard Problem #3: Fault Tolerance

● How does it cause problems? – Error handling code gets all over everything,

obscuring meaning, hurting readability and composability

Page 23: DevOps Days Vancouver 2014 Slides

Hard Problem #3: Fault Tolerance

● Why do we need it anyway?– So we don't get woken up at 3AM quite so often.

● 'nuff said.

Page 24: DevOps Days Vancouver 2014 Slides

Hard Problem #3: Fault Tolerance

● Some things we've tried (1):– Exceptions

● What went wrong?– Hard to fit cleanly into static type systems

● e.g. checked exceptions vs. functions

– Non-local control transfers are confusing

Page 25: DevOps Days Vancouver 2014 Slides

Hard Problem #3: Fault Tolerance

● Some things we've tried (2):– Multiple return values

● What went wrong?– Verbosity

– It just seems really weird to me● BUT... People seem to like Go plenty, and are shipping a

lot of amazing software with it, so I'll discount my own opinion here.

Page 26: DevOps Days Vancouver 2014 Slides

Hard Problem #3: Fault Tolerance

● Some things we've tried (3):– Algebraic data types

● Option/Maybe● Either/Validation● Try

● What went wrong?– Some brain rewiring required– Tricky to retrofit onto existing code bases

● What went right?– IMO a big win for new code; enables composability and functional

abstraction

– Relatively easy to convert locally, write thin adaptations for traditional error handling systems

Page 27: DevOps Days Vancouver 2014 Slides

Hard Problem #3: Fault Tolerance

● Some things we've tried (4):– Actors

● again?

● What went wrong?– Brain rewiring

● What went right?– Truly impressive robustness achievements

● Erlang in Ericsson switches, 99.9999999% uptime

– Truly simple rules for how failures are dealt with

Page 28: DevOps Days Vancouver 2014 Slides

Hard Problem #4: Correctness

● How does it cause problems?– Historically, many techniques for improving

correctness have been:● Hard to learn● Bad for performance● Overly reliant on programmer discipline

Page 29: DevOps Days Vancouver 2014 Slides

Hard Problem #4: Correctness

● Why do we need it anyway?– What choice do we have? Bugs happen, let's try to

avoid them.

Page 30: DevOps Days Vancouver 2014 Slides

Hard Problem #4: Correctness

● Some things we've tried (1):– TDD

● What went wrong?– Can be a big culture shift; requires programmer

training/discipline– Dynamic typing ;)

● What went right?– Seems to be effective as long as everyone buys in– Static typing ;)

Page 31: DevOps Days Vancouver 2014 Slides

Hard Problem #4: Correctness

● Some things we've tried (2):– Static/formal program verification

● What went wrong?– Tooling is usually unfamiliar, sometimes terrifying,

occasionally expensive

– Very few useful real-world programs are verifiable without significant and difficult adaptation

● What went right?– When it works, it works well

– Languages are moving in on this turf. :)

Page 32: DevOps Days Vancouver 2014 Slides

Digression: Abstraction

● Avoid premature abstraction!● Overly abstract code is harder to understand, and

understanding is really, really important● BDUF isn't always wrong, and YAGNI isn't aways right—the

balance is somewhere in between, selected for your particular project

● Try to delay creating new abstractions until repetition becomes painful

● Use richer languages/libraries! – Lots of nice abstractions already written for you, less temptation to

roll your own.

● Use more concise languages/libraries!– Smaller code Less temptation to add abstraction to hide detail⇒

Page 33: DevOps Days Vancouver 2014 Slides

Proactive Meta-Solutions Inc.

● Actors– scaling, concurrency, fault tolerance

● Functional programming– correctness (especially statically typed),

concurrency

● Immutable data– correctness, concurrency

● Configuration management– Chef, puppet, containerization

Page 34: DevOps Days Vancouver 2014 Slides

Actors

● Actors can:– Create child actors

– Send messages to other actors● not necessarily local● not necessarily its own child● not necessarily expecting a reply● including references to self or other actors

– Receive messages, execute code based on their types and data

– Change their own behaviour in response to a message

– Ask to be notified when some other actor is stopped

Page 35: DevOps Days Vancouver 2014 Slides

Actors (2)

● Actors can not:– Process more than one message at a time

● Actors should not:– Attempt to directly observe the state of other actors

– Send mutable messages

Page 36: DevOps Days Vancouver 2014 Slides

Actors (2)

● Actors live in a tree. C-R-A-S-H-I-N-G

● When an actor crashes (e.g. due to an exception), it gets restarted automatically, by default.

● However, a parent actor can override the fault handling strategy that the system will apply to that actor's children, e.g.:– Restart all children when any child has died– Give up, escalating the failure to its own parent– Give up and escalate if a child has crashed more than n times in

m seconds

● Faults in the root actor are handled by the system– By default, in Akka, the system will restart the root actor every

time it crashes, forever. ))<>((

Page 37: DevOps Days Vancouver 2014 Slides

Functional Programming

● Avoiding side effects makes it much, much easier to reason about code

● Higher-order functions and other FP language features make declarative programming possible, reduce code size– Small code is best code!

● Referentially transparent (side-effect-free) functions that are independent can be run in parallel automatically– not always worthwhile due to overhead– a bit bleeding edge

Page 38: DevOps Days Vancouver 2014 Slides

Functional Programming (2)

● Not all FP languages are pure!– Scala in particular places no restraint on the use of side

effects, but displays a pervasive bias toward purity

● Static typing – Not inherent in FP, but frequently found together

● Scala, Haskell, OCaml, F#

– Fewer things to test● If it compiles, it might actually be correct!

– Many classes of runtime error are impossible● Barring unsafe code, reflection hacks, compiler bugs, runtime

bugs

Page 39: DevOps Days Vancouver 2014 Slides

Immutable Data

● Immutable data structures can safely be shared at any time– Impossible to corrupt internal state

– No need for defensive copying

– No TOCTTOU bugs

● Not as inefficient as you might think– Structural sharing

Page 40: DevOps Days Vancouver 2014 Slides

Configuration Management

● I don't really need to talk about this at a DevOps conference. :)

Page 41: DevOps Days Vancouver 2014 Slides

Be Reactive :(

● Being proactive is great, but we all make mistakes.

● Applications deployed to production and exposed to the public routinely experience unforeseen conditions

● Old man yells at cloud– Noisy neighbours

– EBS behaving badly

● DDoS, exploits, patches, etc.

Page 42: DevOps Days Vancouver 2014 Slides

Be Reactive

● When things do go wrong—and they will, no matter how well you've built your application—you'll need visibility into how your systems and application components are feeling, and what they've been up to.

● Logging– Components should log all interesting events– Send logs to a centralized collection/analytics system (e.g.

Logstash, ElasticSearch, Splunk, etc.)

● Metrics– Generate system-level metrics (e.g. collectd)– Generate application-level metrics (e.g. Coda Hale's metrics)– Send metrics to a centralized analytics system (e.g. Graphite,

ElasticSearch, Librato, Circonus, etc.)

Page 43: DevOps Days Vancouver 2014 Slides

The Very Short Pitch

● Do you have anyone whose job it is to constantly watch the logs and metrics dashboards? Lucky.

● Can you relate your log events to your metrics?● What constitutes normal behaviour?

– Is what's normal today the same as what was normal last week?

– Could you write it down?– How would you teach it to a human?

– How would you describe it to monitoring/analytics software?

Page 44: DevOps Days Vancouver 2014 Slides

Thanks!

● twitter.com/alexcruise● [email protected]

Page 45: DevOps Days Vancouver 2014 Slides
Page 46: DevOps Days Vancouver 2014 Slides