7 lessons learned building high availability / performance systems - cm2015

42
@EdMcBane 7 lessons learned building HP/HA systems Never gonna give you up Never gonna let you down

Upload: francesco-degrassi

Post on 28-Jan-2018

377 views

Category:

Software


2 download

TRANSCRIPT

Page 1: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane 7 lessons learned building HP/HA systems

Never gonnagive you up Never

gonna let you down

Page 2: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Francesco Degrassi

Enthusiastic yet pragmatic Lean Software Developer.

Uppish and cynical nihilist from time to time.

Page 3: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Lean Software Development

Continuous Delivery - High availability - Scale-up

Security sensitive & high uncertainty domains

Page 4: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

The challenge

● Primary european client

● Innovative service for the consumer market

● Non-trivial userbase (400K+ users)

● High request rate

● Low latency requirement (<< RTT)

Page 5: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

What we built

Page 6: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Make your assumptions explicit

and keep testing them

Do not eatyellow snow

What did we learn?

Page 7: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Make your assumptions explicit

and keep testing them

#1 Make your

assumptions explicitand keep challenging them

Page 8: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Issues

● failure to properly estimate

● failure to reassess performance goals

● losing track of assumptions and implications

Page 9: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Make your assumptions explicit

and keep testing them

#2 Performance &

Availability are not extra features

Page 10: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Page 11: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Challenges

● Support for required failover modes

● Support for required scale-out/scale-up modes

● Operability in general○ and monitoring in particular

● most important of all, avoiding complexity

Page 12: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Make your assumptions explicit

and keep testing them

#3 Keep things simple

and do not reinvent the wheel

Page 13: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Everything should be made as simple as possible, but not simpler

— Albert Einstein

Page 14: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Page 15: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

LESS(1) General Commands Manual LESS(1)

NAME less - opposite of more

SYNOPSIS less -? less --help less -V less --version less [-[+]aABcCdeEfFgGiIJKLmMnNqQrRsSuUVwWX~] [-b space] [-h lines] [-j line] [-k keyfile] [-{oO} logfile] [-p pattern] [-P prompt] [-t tag] [-T tagsfile] [-x tab,...] [-y lines] [-[z] lines] [-# shift] [+[+]cmd] [--] [filename]... (See the OPTIONS section for alternate option syntax with long option names.)

DESCRIPTION

LESS IS similar to MORE (1), but has many more features. Less does not have to read the entire input file before starting, so with large input files it starts up faster than text editors like vi (1). Less uses termcap (or terminfo on some systems), so it can run on

Manual page less(1) line 1 (press h for help or q to quit) .

Page 16: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

● Everything was good with the single core scenario

In our case...

Page 17: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

SO_REUSEPORT

For TCP, so_reuseport allows multiple listener sockets to be bound to the same port.

Received packets are distributed to multiple sockets bound to the same port using a 4-tuple hash.

With so_reuseport the distribution is uniform.

Page 18: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Suggestions

● Prefer open source solutions○ when things break, you want to be able to fix it

● Be skeptical○ pick any software, chances are it is crap○ +1 for open source, you can “peek under the hood”

● Do not use tools you do not fully understand○ or as I’d rather say...

Page 19: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Make your assumptions explicit

and keep testing them

#4Be wary of cargo-cult

software engineering

Page 20: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Page 21: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

TCP_TW_RECYCLE

Enable fast recycling TIME-WAIT sockets. Default value is 0. It should not be changed without advice/request of technical experts.

Linux will drop any segment from the remote host whose timestamp is not strictly bigger than the latest recorded timestamp

TCP_TW_RECYCLE + NAT = MADNESS

Page 22: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Page 23: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Make your assumptions explicit

and keep testing them

#5High Availability is much more than just redundancy

Page 24: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Impact

Frequency

Time to recover

Page 25: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

● Redundant hardware● Redundant software components

But there’s more!

● Graceful degradation● Incremental rollouts

Failure impact

Page 26: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Failure frequency

But then also:

● proven technology

● high quality hardware

● automation (to avoid errors)

Page 27: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

● Effective monitoring○ realtime○ reliable○ understandable○ thorough○ meaningful○ actionable

● Rollback / rollforward● Automation (for speed)

Time to recover

Page 28: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Our response plan goes something like this...

AaaaaAAaaaah

Page 29: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

...but be prepared to improvise

Processes designed for ordinary times are not resilient in a crisis and need to be changed.

Dave Snowden

“”

Page 30: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Easier said than done

No, improvising is wonderful.

But, the thing is that you cannot improvise unless you know exactly what you're doing.

Christopher Walken

“”

Page 31: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Improvisation requires

● In house expertise

● Lots and lots of experience

● Developers on call

● Practice (drills, e.g. chaos monkeys)

Page 32: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Also from Walken...

At its best, life is completely unpredictable.“ ”

Everybody has to be a little lucky, I think.“ ”I try not to worry about things I can't do anything about.“ ”

Page 33: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Make your assumptions explicit

and keep testing them

#6 Embrace diversity

Page 34: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Page 35: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Page 36: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Make your assumptions explicit

and keep testing them

#7Monitoring is essential

… and we can do way better

Page 37: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

No one size fits all

● “Monitor everything”, like “100% test coverage” is a nice slogan, nothing more.

● Each environment requires a slightly different solution

● Balance between data availability, cost and ability to keep it actionable

Page 38: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Page 39: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

We are doing logging wrong

● Unstructured

● Inconsistent

● Poor defaults

● Complex, obscure components

● A huge waste of computing power

Page 40: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

We need a complete overview

● Logs

● Metrics

● Alerts

● Together, coherent, cross-referenced

○ correlating different stores poses challenges

Page 41: 7 lessons learned building high availability / performance systems - CM2015

@EdMcBane

Human beings, who are almost unique in having the ability to learn from the experience of others, are also remarkable for their apparent disinclination to do so.

Douglas Adams