devops days vancouver 2014 slides

Architecting Software as if Ops Matters

November 15th, 2014DevOps Days Vancouver 2014

Alex CruiseDirector of Architecture, Metafor Software

Who am I? Background

● Self-taught programmer● Worked at technology-intensive companies around

Vancouver for over 20 years– Mostly software companies– Some of which are still in business!– Two of which were acquired for decent money

● Layer 7/CA● Subserveo/DST Systems

● I came to Computer Science as such (not the same as writing code for a living!) because I saw how it would help me solve problems I was already having, not because it seemed like it would be a good job when I was 18. :)

Who am I?Declaration of Biases

● I really like static typing.● I really like functional programming.● I have internalized large parts of the Java

ecosystem over the years.● As you might expect I really like Scala.

– Five years in, I still like it!

● I have a tendency to go on Architecture Austronaut EVAs, but I'm working on it...

Outline

● General Pontification● Some Hard Problems that we all need to face

– Scale Out

– Concurrency

– Fault tolerance

– Correctness

● Be proactive: Meta-solutions that can help– Not solutions... Solution Construction Sets

● Be reactive:– Find the gaps so you can mind them

Where We're At

“The rise of the DevOps movement has brought into welcome focus something that is often learned only through painful experience and expense: the success of a software product critically depends not only on its implementation, maintenance and enhancement, but also on how it's deployed and operated.”

-- me :)

Anna Karenina Principle

“Happy families are all alike; every unhappy family is unhappy in its own way”

—Leo Tolstoy

Anna Karenina Principle(of software architecture)


—Leo Tolstoy

● The small systems all start to look the same after awhile



—Leo Tolstoy

● The small systems all start to look the same after awhile– if you squint

– when you're bitter enough



—Leo Tolstoy

● The small systems all start to look the same after awhile– if you squint

– when you're bitter enough

● But large systems are UNIQUE and TERRIFYING.

Project Risk Factors

● Problem domain risk factors– Some problem domains are inherently full of hairy yaks– Some are not necessarily complex, but unfamiliar

● Tooling/infrastructure risk factors– Some tools are immature or buggy– Some are hard to use and/or unfamiliar

● Scale/distributed architecture is itself a huge risk factor● Complex + unfamiliar problem domain + tooling problems

+ distributed architecture = PROJECT DEATH● Minimize as many of these risk categories as you can at

any given time

Hard Problem #1: Scale Out

● How does it cause problems?– Complexity of the overall system increases

– Reasoning about the system's behaviour as a whole quickly becomes intractable

– Airplane Rule: All else being equal, on average, a twin-engined airplane has twice as many engine failures as a single-engined airplane.

● Hopefully, a greater proportion of these failures are survivable!

– With airplanes, uh, maybe not as much as we'd want.


● Why do we need it anyway?– A single server beefy enough to run the whole thing is

too expensive (assuming it's possible to buy one)– Worse yet, a single server reliable enough to run the

whole thing forever doesn't exist—we need redundancy to achieve high availability. Redundancy necessarily invites coordination problems.

– Ideally, we want to be able to start small and cheap, add capacity gradually, and avoid re-architecting the whole thing too often.


● Some things we've tried (1):– Try to simplify our life as developers, by making the

distributed system “feel” local.● CORBA, DCOM, EJB, Distributed transactions, SOAP,

RPC, etc.

– What went wrong?● See “Eight Fallacies of Distributed Computing”● Leaky abstractions: SO MANY THINGS can fail, and for

reasons that don't usually make it into API documentation.● RPC-style request-response protocols frequently block the

client thread (and often the server thread too!), limiting scalability.


● Some things we've tried (2):– Use lots of dumb, cheap front-end servers, and

fewer smart, expensive back-end servers (e.g. database)

– What went wrong?● As application state/logic increases in complexity, the

backend servers quickly become a bottleneck—one that we know for a fact is REALLY HARD to scale out (e.g. CAP theorem)


● Some things we've tried (3):– Actors, message passing, location transparency

● What went wrong?– Brain rewiring

– Requires rethinking of how systems are built

– Purely local interactions are more cumbersome

● What went right?– Usefully narrow, unifying abstraction over local, point-to-point and

clustered message passing

– Participants in message exchanges don't need to be aware of where their counterparts are deployed

– Topology decisions can be configured at runtime, decoupled from application logic

Hard Problem #2: Concurrency

● How does it cause problems?– Note! Concurrency != Parallelism!

“Concurrency is like having a juggler juggle many balls. Regardless of how it seems, the juggler is only catching/throwing one ball at a time. Parallelism is having multiple jugglers juggle balls simultaneously.”

– Systems that are inherently nondeterministic are really hard to reason about.

– Concurrency bugs are hard to troubleshoot, often only showing up under load.


● Why do we need it anyway?– Make best use of available system resources by

doing lots of stuff at once

– Even if you're not using threads or shared memory, your application still has state changes that need to be triggered, validated, committed or rolled back, notified, etc.

– If you can use shared memory, you can avoid paying for IPC/network/codec round trips every time your state changes.


● Some things we've tried (1)– Let's use threads, and share mutable state!

– What went wrong?● OW MY BRAIN● Bugs galore!


● Some things we've tried (2)– Just don't do it!

● What went wrong?– Inefficient use of resources.

● All important application state transitions incur Network, IPC, Codec costs

– Having to wait for external coordination services (e.g. memcached, databases) limits scalability.


● Some things we've tried (4)– Actors

● What went wrong?– We (well, some of us...) want our types back!

– Message passing is much less efficient than method invocation● Thousands per second is fine● Millions per second? Reconsider...

● What went right?– Actors aren't concurrent individually

● Can safely have mutable state

– Decoupled from threads: you can have lots and lots alive at any time.


● Some things we've tried (5):– Immutable data, functional programming

● What went wrong?– Requires some brain rewiring– Some languages make it hard to use

● Get a better language ;)

– FP itself doesn't provide any concurrency, but...

● What went right?– Declarative programming, really!– Referentially transparent functions that are independent can

safely be evaluated in parallel (in some cases automatically)– Immutable data is always safe to share

Hard Problem #3: Fault Tolerance

● How does it cause problems? – Error handling code gets all over everything,

obscuring meaning, hurting readability and composability


● Why do we need it anyway?– So we don't get woken up at 3AM quite so often.

● 'nuff said.


● Some things we've tried (1):– Exceptions

● What went wrong?– Hard to fit cleanly into static type systems

● e.g. checked exceptions vs. functions

– Non-local control transfers are confusing


● Some things we've tried (2):– Multiple return values

● What went wrong?– Verbosity

– It just seems really weird to me● BUT... People seem to like Go plenty, and are shipping a

lot of amazing software with it, so I'll discount my own opinion here.


● Some things we've tried (3):– Algebraic data types

● Option/Maybe● Either/Validation● Try

● What went wrong?– Some brain rewiring required– Tricky to retrofit onto existing code bases

● What went right?– IMO a big win for new code; enables composability and functional

abstraction

– Relatively easy to convert locally, write thin adaptations for traditional error handling systems


● Some things we've tried (4):– Actors

● again?

● What went wrong?– Brain rewiring

● What went right?– Truly impressive robustness achievements

● Erlang in Ericsson switches, 99.9999999% uptime

– Truly simple rules for how failures are dealt with

Hard Problem #4: Correctness

● How does it cause problems?– Historically, many techniques for improving

correctness have been:● Hard to learn● Bad for performance● Overly reliant on programmer discipline


● Why do we need it anyway?– What choice do we have? Bugs happen, let's try to

avoid them.


● Some things we've tried (1):– TDD

● What went wrong?– Can be a big culture shift; requires programmer

training/discipline– Dynamic typing ;)

● What went right?– Seems to be effective as long as everyone buys in– Static typing ;)


● Some things we've tried (2):– Static/formal program verification

● What went wrong?– Tooling is usually unfamiliar, sometimes terrifying,

occasionally expensive

– Very few useful real-world programs are verifiable without significant and difficult adaptation

● What went right?– When it works, it works well

– Languages are moving in on this turf. :)

Digression: Abstraction

● Avoid premature abstraction!● Overly abstract code is harder to understand, and

understanding is really, really important● BDUF isn't always wrong, and YAGNI isn't aways right—the

balance is somewhere in between, selected for your particular project

● Try to delay creating new abstractions until repetition becomes painful

● Use richer languages/libraries! – Lots of nice abstractions already written for you, less temptation to

roll your own.

● Use more concise languages/libraries!– Smaller code Less temptation to add abstraction to hide detail⇒

Proactive Meta-Solutions Inc.

● Actors– scaling, concurrency, fault tolerance

● Functional programming– correctness (especially statically typed),

concurrency

● Immutable data– correctness, concurrency

● Configuration management– Chef, puppet, containerization

Actors

● Actors can:– Create child actors

– Send messages to other actors● not necessarily local● not necessarily its own child● not necessarily expecting a reply● including references to self or other actors

– Receive messages, execute code based on their types and data

– Change their own behaviour in response to a message

– Ask to be notified when some other actor is stopped

Actors (2)

● Actors can not:– Process more than one message at a time

● Actors should not:– Attempt to directly observe the state of other actors

– Send mutable messages

Actors (2)

● Actors live in a tree. C-R-A-S-H-I-N-G

● When an actor crashes (e.g. due to an exception), it gets restarted automatically, by default.

● However, a parent actor can override the fault handling strategy that the system will apply to that actor's children, e.g.:– Restart all children when any child has died– Give up, escalating the failure to its own parent– Give up and escalate if a child has crashed more than n times in

m seconds

● Faults in the root actor are handled by the system– By default, in Akka, the system will restart the root actor every

time it crashes, forever. ))<>((

Functional Programming

● Avoiding side effects makes it much, much easier to reason about code

● Higher-order functions and other FP language features make declarative programming possible, reduce code size– Small code is best code!

● Referentially transparent (side-effect-free) functions that are independent can be run in parallel automatically– not always worthwhile due to overhead– a bit bleeding edge

Functional Programming (2)

● Not all FP languages are pure!– Scala in particular places no restraint on the use of side

effects, but displays a pervasive bias toward purity

● Static typing – Not inherent in FP, but frequently found together

● Scala, Haskell, OCaml, F#

– Fewer things to test● If it compiles, it might actually be correct!

– Many classes of runtime error are impossible● Barring unsafe code, reflection hacks, compiler bugs, runtime

bugs

Immutable Data

● Immutable data structures can safely be shared at any time– Impossible to corrupt internal state

– No need for defensive copying

– No TOCTTOU bugs

● Not as inefficient as you might think– Structural sharing

Configuration Management

● I don't really need to talk about this at a DevOps conference. :)

Be Reactive :(

● Being proactive is great, but we all make mistakes.

● Applications deployed to production and exposed to the public routinely experience unforeseen conditions

● Old man yells at cloud– Noisy neighbours

– EBS behaving badly

● DDoS, exploits, patches, etc.

Be Reactive

● When things do go wrong—and they will, no matter how well you've built your application—you'll need visibility into how your systems and application components are feeling, and what they've been up to.

● Logging– Components should log all interesting events– Send logs to a centralized collection/analytics system (e.g.

Logstash, ElasticSearch, Splunk, etc.)

● Metrics– Generate system-level metrics (e.g. collectd)– Generate application-level metrics (e.g. Coda Hale's metrics)– Send metrics to a centralized analytics system (e.g. Graphite,

ElasticSearch, Librato, Circonus, etc.)

The Very Short Pitch

● Do you have anyone whose job it is to constantly watch the logs and metrics dashboards? Lucky.

● Can you relate your log events to your metrics?● What constitutes normal behaviour?

– Is what's normal today the same as what was normal last week?

– Could you write it down?– How would you teach it to a human?

– How would you describe it to monitoring/analytics software?

Thanks!

● twitter.com/alexcruise● [email protected]

devops days vancouver 2014 slides

Software

unhappy family

small systems

software projects

metafor software

architecting software

software product

systems behaviour

yearsmostly software