high scalability toronto: meetup #2

44
Basics of scale and availability High Scalability

Upload: scribblelive

Post on 22-Apr-2015

1.085 views

Category:

Documents


7 download

DESCRIPTION

Slides from the second meeting of the Toronto High Scalability Meetup @ http://www.meetup.com/toronto-high-scalability/ -Basics of High Scalability and High Availability -Using a CDN to Achieve 99% Offload -Caching at the Code Layer

TRANSCRIPT

Page 1: High Scalability Toronto: Meetup #2

Basics of scale and availability

High Scalability

Page 2: High Scalability Toronto: Meetup #2

Who am I?• Jonathan Keebler @keebler keebler.net• Built video player for all CTV properties• Worked on news sites like CP24, CTV, TSN• CTO, Founder of ScribbleLive• Bootstrapped a high scalability startup

– Credit card limit wasn’t that high, had to find cheap ways to handle the load of top tier news sites

2

Page 3: High Scalability Toronto: Meetup #2

Sample load test

3

17 x Windows Server 2008, 2 x Varnish, 4 x nginx, 1 x SQL Server 2008

Page 4: High Scalability Toronto: Meetup #2

Scalability vs Availability• Often talked about separately• Can’t have one without the other• Let’s talk about the basic building blocks

4

Page 5: High Scalability Toronto: Meetup #2

Building blocks• Content Distribution Network (CDN)• Load-balancer• Reverse proxy• Caching server• Origin server

5

Page 6: High Scalability Toronto: Meetup #2

Basic hosting structure

6

Page 7: High Scalability Toronto: Meetup #2

Basic hosting structure

7

Amazon ELBF5HAProxy

VarnishSquidaiCache

LAMPASP.NETnode.js

nginxAkamaiCloudFrontEdgeCast

Page 8: High Scalability Toronto: Meetup #2

Basic hosting structure

8

Amazon ELBF5HAProxy

VarnishSquidaiCache

LAMPASP.NETnode.js

nginxAkamaiCloudFrontEdgeCast

+ Monitoring + Monitoring + Monitoring + Monitoring + Monitoring

Page 9: High Scalability Toronto: Meetup #2

Monitor or die• If you aren’t monitoring your stack, you

have NO IDEA what’s going on• Pingdom/WatchMouse/Gomez not enough

– Don’t help you when you’re trying to figure out what’s going wrong

– You need actionable metrics

9

Page 10: High Scalability Toronto: Meetup #2

Monitor or die• Outside monitoring e.g. Pingdom, Gomez

– DNS problems, localized problem, SLA• Inside monitoring e.g. New Relic, CloudWatch,

Server Density– High latency, CPU spikes, memory crunch,

peek-a-boo servers, rogue processes, SQL queries per second, SQL wait time, SQL locks, disk usage, disk IO performance, page file usage, network traffic, requests per second, active connections, timeouts, sleeping sockets,

10

Page 12: High Scalability Toronto: Meetup #2

Alerting• Don’t send them to your email

– Try to work with notifications coming in every second

• PagerDuty• Don’t over do it = alert fatigue

12

Page 13: High Scalability Toronto: Meetup #2

Basic hosting structure• Now back to our servers...

13

Page 14: High Scalability Toronto: Meetup #2

Load-balancers• Bandwidth limits on dedicated boxes

harder to work around• F5s are great boxes, but have lousy live

reporting = can get into trouble quick• Adding/removing servers sucks• DNS load-balancing sucks for everyone

14

Page 15: High Scalability Toronto: Meetup #2

nginx• Fantastic at handling massive number of

requests (low CPU, low memory)• Easy to configure and change on-the-fly• Gzip, modify headers, host names• Proxy with error intercept• No query string or IF-statement* support

15

Page 16: High Scalability Toronto: Meetup #2

Varnish• Caching server but so much more• Fantastic at handling massive number of

requests (low CPU, low memory)• Easy to configure and change on-the-fly• Protect your origin servers• Deals with errors from origin servers

16

Page 17: High Scalability Toronto: Meetup #2

Origin servers• Whatever tweaks you make will never help

enough– e.g. If your disk IO is becoming a problem, it’s

already too late to save you• Keep them stock so you don’t blow your mind,

easier to deploy• Handle any query string hacking in Varnish

17

Page 18: High Scalability Toronto: Meetup #2

Databases• No silver bullet• Two options:

– Shard (split your data between servers)– Cluster (many boxes working together as one)

• Shards commonly used today– Lots of work on code level, no incremental IDs

• Clusters have a single point of failure– Try upgrading one and tell me they don’t

18

Page 20: High Scalability Toronto: Meetup #2

Content Distribution Networks

High Scalability

Page 21: High Scalability Toronto: Meetup #2

Basics• Worldwide network of DNS load-balanced

reverse proxies• Not magic• Can achieve 99% offload if you do it right• Have to understand your requests

21

Page 22: High Scalability Toronto: Meetup #2

Market leaders• Akamai: market leader, $$$, most options, yearly

contracts, pay for GB + request headers• CloudFront: built on AWS, cheaper, pay-as-you-

go, less features, new features coming quickly, GB + pay-per-request

• EdgeCast (pay-as-you-go through GoGrid), CloudFlare (optimizer, security, easy!)

22

Page 23: High Scalability Toronto: Meetup #2

Tiered distribution• More points-of-presence (POPs) = less caching if

your traffic is global• Need to put a layer of servers between POPs

and your servers• Sophisticated setups throttle requests

– if 100 come in at same time, only 1 gets through

23

Page 24: High Scalability Toronto: Meetup #2

Cache keys• Need to have same query string to get cached

result• Some CDNs can ignore params

– important if you need a random number on the query string to prevent browser caching

• Cool options: case sensitive/insensitive, cache differently based on cookie, headers

24

Page 25: High Scalability Toronto: Meetup #2

Invalidations suck• Trying to get CDN to drop its cache is hard

– takes a long time to reach all POPs– triggers thundering herd– takes out all caching for a bit

• Build the ability to change query strings at the code layer– e.g. add version number to JS/CSS URLs. When you

rollout, breaks cache

25

Page 26: High Scalability Toronto: Meetup #2

How long to cache for?• As long as you need, but no longer• Make sure you think about error case i.e.

what if an error gets cached– Some CDNs let you set your own rules for that– Remember, invalidations suck

26

Page 27: High Scalability Toronto: Meetup #2

Thundering herds

27

Page 28: High Scalability Toronto: Meetup #2

Thundering herds• When you rollout or have high latency, all your

timeouts align– Origins get slammed at regular interval by POPs

• Random TTLs are your friend– Just +/- a few minutes can be a big help– TIP: break into C in Varnish

28

Page 29: High Scalability Toronto: Meetup #2

Don’t build your own*• You will never be as smart as Akamai/Amazon• You will never be able to bring on new servers

fast enough to scale• Spend your time building awesome software• Build your own caching layer for the POPs (and

just in-case, to protect your origin servers)

29

Page 30: High Scalability Toronto: Meetup #2

Discussion• What CDN do you use?• War stories

30

Page 31: High Scalability Toronto: Meetup #2

Caching in Code

High Scalability

Page 32: High Scalability Toronto: Meetup #2

Why do I need this?• You can’t cache every request• You can’t cache POST requests• Protect the database!• The longer you can go before you have to

shard your database, the better

32

Page 33: High Scalability Toronto: Meetup #2

What is it?• In-process, in-memory caching• Static variables work great

– TIP: .NET: static variables are scoped in the thread, WHY?!

• Custom memory stores• Whatever you want, just not the disk

33

Page 34: High Scalability Toronto: Meetup #2

Isn’t that what Memcached is for? • Memcached is in-memory BUT so is your database

– Advantages of Memcached over your database:• Cheaper to replicate• Fast lookups...if your db sucks

– Disadvantages:• Still has network latency, higher than db lookup (unless

your db sucks)• IT’S NOT A DATABASE!

34

Page 35: High Scalability Toronto: Meetup #2

Getting started• Think about your data + classes• TTLs based on knowledge of your data• Random TTLs (avoid the thundering herd)• Use concurrent, thread-safe objects• Wrap your code in try-catch

– Caching isn’t worth breaking your site for

35

Page 36: High Scalability Toronto: Meetup #2

Updating cache• Use semaphores (that Comp Sci degree is finally going to come in handy)• Semaphores should always unlock on their own

– Your thread could die/timeout at any time. You don’t want to lock forever• Use a separate thread for the lookup. Why should one user suffer?• Using a datetime semaphore is usually the best

– keep a time when the next update will take place– 1st thread to hit that time, immediately adds some seconds to the time.

Buys itself enough time to do lookup– Any blocked thread gets cached data. DON’T LOCK

36

Page 37: High Scalability Toronto: Meetup #2

Populating cache for first time• How do you prevent thundering herd before

cache?• Ok, you may have to lock. But be smart about it.• Are you sure your database can’t handle it?• This is where other caching layers help: CDN

throttling, Varnish throttling, Memcached, read-only databases

37

Page 38: High Scalability Toronto: Meetup #2

Garbage collection• Keep counters for metrics e.g. how many hits to the cached

object, datetime of last request for that object• Every X something, run your garbage collection

– Use semaphores– Don’t get rid of the most used objects

• You are going to collide with running code– try-catch is your friend

• Don’t be afraid to dump the cache and start over

38

Page 39: High Scalability Toronto: Meetup #2

Watch out for references• If you are storing something in a cache object, you

can save a lot of memory by passing reference to object

• Don’t forget about the reference• Watch out for garbage collection trying to destroy it• Updating cache operation might involve updating an

existing object

39

Page 40: High Scalability Toronto: Meetup #2

The curse• More servers = more caches = less

efficient• Discipline: can’t throw more servers at the

problem

40

Page 41: High Scalability Toronto: Meetup #2

Totally worth it!

41

Requests per minute to origin servers

Page 42: High Scalability Toronto: Meetup #2

Totally worth it!

42

CPU of 1 x SQL Server 2008 database

Page 43: High Scalability Toronto: Meetup #2

Discussion• What do you use to cache at a code layer?• War stories

43