7 lessons learned building high availability / performance systems - cm2015

@EdMcBane 7 lessons learned building HP/HA systems

Never gonnagive you up Never

gonna let you down

@EdMcBane

Francesco Degrassi

Enthusiastic yet pragmatic Lean Software Developer.

Uppish and cynical nihilist from time to time.

@EdMcBane

Lean Software Development

Continuous Delivery - High availability - Scale-up

Security sensitive & high uncertainty domains

@EdMcBane

The challenge

● Primary european client

● Innovative service for the consumer market

● Non-trivial userbase (400K+ users)

● High request rate

● Low latency requirement (<< RTT)

@EdMcBane

What we built

@EdMcBane

Make your assumptions explicit

and keep testing them

Do not eatyellow snow

What did we learn?

@EdMcBane



#1 Make your

assumptions explicitand keep challenging them

@EdMcBane

Issues

● failure to properly estimate

● failure to reassess performance goals

● losing track of assumptions and implications

@EdMcBane



#2 Performance &

Availability are not extra features

@EdMcBane

@EdMcBane

Challenges

● Support for required failover modes

● Support for required scale-out/scale-up modes

● Operability in general○ and monitoring in particular

● most important of all, avoiding complexity

@EdMcBane



#3 Keep things simple

and do not reinvent the wheel

@EdMcBane

Everything should be made as simple as possible, but not simpler

— Albert Einstein

@EdMcBane

@EdMcBane

LESS(1) General Commands Manual LESS(1)

NAME less - opposite of more

SYNOPSIS less -? less --help less -V less --version less [-[+]aABcCdeEfFgGiIJKLmMnNqQrRsSuUVwWX~] [-b space] [-h lines] [-j line] [-k keyfile] [-{oO} logfile] [-p pattern] [-P prompt] [-t tag] [-T tagsfile] [-x tab,...] [-y lines] [-[z] lines] [-# shift] [+[+]cmd] [--] [filename]... (See the OPTIONS section for alternate option syntax with long option names.)

DESCRIPTION

LESS IS similar to MORE (1), but has many more features. Less does not have to read the entire input file before starting, so with large input files it starts up faster than text editors like vi (1). Less uses termcap (or terminfo on some systems), so it can run on

Manual page less(1) line 1 (press h for help or q to quit) .

@EdMcBane

● Everything was good with the single core scenario

In our case...

@EdMcBane

SO_REUSEPORT

For TCP, so_reuseport allows multiple listener sockets to be bound to the same port.

Received packets are distributed to multiple sockets bound to the same port using a 4-tuple hash.

With so_reuseport the distribution is uniform.

@EdMcBane

Suggestions

● Prefer open source solutions○ when things break, you want to be able to fix it

● Be skeptical○ pick any software, chances are it is crap○ +1 for open source, you can “peek under the hood”

● Do not use tools you do not fully understand○ or as I’d rather say...

@EdMcBane



#4Be wary of cargo-cult

software engineering

@EdMcBane

@EdMcBane

TCP_TW_RECYCLE

Enable fast recycling TIME-WAIT sockets. Default value is 0. It should not be changed without advice/request of technical experts.

Linux will drop any segment from the remote host whose timestamp is not strictly bigger than the latest recorded timestamp

TCP_TW_RECYCLE + NAT = MADNESS

http://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux.html






@EdMcBane

@EdMcBane



#5High Availability is much more than just redundancy

@EdMcBane

Impact

Frequency

Time to recover

@EdMcBane

● Redundant hardware● Redundant software components

But there’s more!

● Graceful degradation● Incremental rollouts

Failure impact

@EdMcBane

Failure frequency

But then also:

● proven technology

● high quality hardware

● automation (to avoid errors)

@EdMcBane

● Effective monitoring○ realtime○ reliable○ understandable○ thorough○ meaningful○ actionable

● Rollback / rollforward● Automation (for speed)

Time to recover

@EdMcBane

Our response plan goes something like this...

AaaaaAAaaaah

@EdMcBane

...but be prepared to improvise

Processes designed for ordinary times are not resilient in a crisis and need to be changed.

Dave Snowden

“”

@EdMcBane

Easier said than done

No, improvising is wonderful.

But, the thing is that you cannot improvise unless you know exactly what you're doing.

Christopher Walken

“”

@EdMcBane

Improvisation requires

● In house expertise

● Lots and lots of experience

● Developers on call

● Practice (drills, e.g. chaos monkeys)

@EdMcBane

Also from Walken...

At its best, life is completely unpredictable.“ ”

Everybody has to be a little lucky, I think.“ ”I try not to worry about things I can't do anything about.“ ”

@EdMcBane



#6 Embrace diversity

@EdMcBane

@EdMcBane



#7Monitoring is essential

… and we can do way better

@EdMcBane

No one size fits all

● “Monitor everything”, like “100% test coverage” is a nice slogan, nothing more.

● Each environment requires a slightly different solution

● Balance between data availability, cost and ability to keep it actionable

@EdMcBane

@EdMcBane

We are doing logging wrong

● Unstructured

● Inconsistent

● Poor defaults

● Complex, obscure components

● A huge waste of computing power

@EdMcBane

We need a complete overview

● Logs

● Metrics

● Alerts

● Together, coherent, cross-referenced

○ correlating different stores poses challenges

@EdMcBane

Human beings, who are almost unique in having the ability to learn from the experience of others, are also remarkable for their apparent disinclination to do so.

Douglas Adams

“

”

@EdMcBane

Thanks!@[email protected]@optionfactory.net

http://www.optionfactory.net

mailto:[email protected]