high scalability toronto: meetup #2

Basics of scale and availability

High Scalability

Who am I?• Jonathan Keebler @keebler keebler.net• Built video player for all CTV properties• Worked on news sites like CP24, CTV, TSN• CTO, Founder of ScribbleLive• Bootstrapped a high scalability startup

– Credit card limit wasn’t that high, had to find cheap ways to handle the load of top tier news sites

2

Sample load test

3

17 x Windows Server 2008, 2 x Varnish, 4 x nginx, 1 x SQL Server 2008

Scalability vs Availability• Often talked about separately• Can’t have one without the other• Let’s talk about the basic building blocks

4

Building blocks• Content Distribution Network (CDN)• Load-balancer• Reverse proxy• Caching server• Origin server

5

Basic hosting structure

6


7

Amazon ELBF5HAProxy

VarnishSquidaiCache

LAMPASP.NETnode.js

nginxAkamaiCloudFrontEdgeCast


8

Amazon ELBF5HAProxy

VarnishSquidaiCache

LAMPASP.NETnode.js

nginxAkamaiCloudFrontEdgeCast

+ Monitoring + Monitoring + Monitoring + Monitoring + Monitoring

Monitor or die• If you aren’t monitoring your stack, you

have NO IDEA what’s going on• Pingdom/WatchMouse/Gomez not enough

– Don’t help you when you’re trying to figure out what’s going wrong

– You need actionable metrics

9

Monitor or die• Outside monitoring e.g. Pingdom, Gomez

– DNS problems, localized problem, SLA• Inside monitoring e.g. New Relic, CloudWatch,

Server Density– High latency, CPU spikes, memory crunch,

peek-a-boo servers, rogue processes, SQL queries per second, SQL wait time, SQL locks, disk usage, disk IO performance, page file usage, network traffic, requests per second, active connections, timeouts, sleeping sockets,

10

New Relic• Dashboard

11

https://rpm.newrelic.com/accounts/37021/applications/122194%23end_user_data=visible

https://rpm.newrelic.com/accounts/37021/applications/122194%23end_user_data=visible

Alerting• Don’t send them to your email

– Try to work with notifications coming in every second

• PagerDuty• Don’t over do it = alert fatigue

12

Basic hosting structure• Now back to our servers...

13

Load-balancers• Bandwidth limits on dedicated boxes

harder to work around• F5s are great boxes, but have lousy live

reporting = can get into trouble quick• Adding/removing servers sucks• DNS load-balancing sucks for everyone

14

nginx• Fantastic at handling massive number of

requests (low CPU, low memory)• Easy to configure and change on-the-fly• Gzip, modify headers, host names• Proxy with error intercept• No query string or IF-statement* support

15

Varnish• Caching server but so much more• Fantastic at handling massive number of

requests (low CPU, low memory)• Easy to configure and change on-the-fly• Protect your origin servers• Deals with errors from origin servers

16

Origin servers• Whatever tweaks you make will never help

enough– e.g. If your disk IO is becoming a problem, it’s

already too late to save you• Keep them stock so you don’t blow your mind,

easier to deploy• Handle any query string hacking in Varnish

17

Databases• No silver bullet• Two options:

– Shard (split your data between servers)– Cluster (many boxes working together as one)

• Shards commonly used today– Lots of work on code level, no incremental IDs

• Clusters have a single point of failure– Try upgrading one and tell me they don’t

18

Discussion• What stack do you use?• What database do you use?• SQL vs NoSQL

19

http://highscalability.com/blog/2010/9/5/hilarious-video-relational-database-vs-nosql-fanbois.html

http://highscalability.com/blog/2010/9/5/hilarious-video-relational-database-vs-nosql-fanbois.html

Content Distribution Networks

High Scalability

Basics• Worldwide network of DNS load-balanced

reverse proxies• Not magic• Can achieve 99% offload if you do it right• Have to understand your requests

21

Market leaders• Akamai: market leader, $$$, most options, yearly

contracts, pay for GB + request headers• CloudFront: built on AWS, cheaper, pay-as-you-

go, less features, new features coming quickly, GB + pay-per-request

• EdgeCast (pay-as-you-go through GoGrid), CloudFlare (optimizer, security, easy!)

22

Tiered distribution• More points-of-presence (POPs) = less caching if

your traffic is global• Need to put a layer of servers between POPs

and your servers• Sophisticated setups throttle requests

– if 100 come in at same time, only 1 gets through

23

Cache keys• Need to have same query string to get cached

result• Some CDNs can ignore params

– important if you need a random number on the query string to prevent browser caching

• Cool options: case sensitive/insensitive, cache differently based on cookie, headers

24

Invalidations suck• Trying to get CDN to drop its cache is hard

– takes a long time to reach all POPs– triggers thundering herd– takes out all caching for a bit

• Build the ability to change query strings at the code layer– e.g. add version number to JS/CSS URLs. When you

rollout, breaks cache

25

How long to cache for?• As long as you need, but no longer• Make sure you think about error case i.e.

what if an error gets cached– Some CDNs let you set your own rules for that– Remember, invalidations suck

26

Thundering herds

27

Thundering herds• When you rollout or have high latency, all your

timeouts align– Origins get slammed at regular interval by POPs

• Random TTLs are your friend– Just +/- a few minutes can be a big help– TIP: break into C in Varnish

28

Don’t build your own*• You will never be as smart as Akamai/Amazon• You will never be able to bring on new servers

fast enough to scale• Spend your time building awesome software• Build your own caching layer for the POPs (and

just in-case, to protect your origin servers)

29

Discussion• What CDN do you use?• War stories

30

Caching in Code

High Scalability

Why do I need this?• You can’t cache every request• You can’t cache POST requests• Protect the database!• The longer you can go before you have to

shard your database, the better

32

What is it?• In-process, in-memory caching• Static variables work great

– TIP: .NET: static variables are scoped in the thread, WHY?!

• Custom memory stores• Whatever you want, just not the disk

33

Isn’t that what Memcached is for? • Memcached is in-memory BUT so is your database

– Advantages of Memcached over your database:• Cheaper to replicate• Fast lookups...if your db sucks

– Disadvantages:• Still has network latency, higher than db lookup (unless

your db sucks)• IT’S NOT A DATABASE!

34

Getting started• Think about your data + classes• TTLs based on knowledge of your data• Random TTLs (avoid the thundering herd)• Use concurrent, thread-safe objects• Wrap your code in try-catch

– Caching isn’t worth breaking your site for

35

Updating cache• Use semaphores (that Comp Sci degree is finally going to come in handy)• Semaphores should always unlock on their own

– Your thread could die/timeout at any time. You don’t want to lock forever• Use a separate thread for the lookup. Why should one user suffer?• Using a datetime semaphore is usually the best

– keep a time when the next update will take place– 1st thread to hit that time, immediately adds some seconds to the time.

Buys itself enough time to do lookup– Any blocked thread gets cached data. DON’T LOCK

36

Populating cache for first time• How do you prevent thundering herd before

cache?• Ok, you may have to lock. But be smart about it.• Are you sure your database can’t handle it?• This is where other caching layers help: CDN

throttling, Varnish throttling, Memcached, read-only databases

37

Garbage collection• Keep counters for metrics e.g. how many hits to the cached

object, datetime of last request for that object• Every X something, run your garbage collection

– Use semaphores– Don’t get rid of the most used objects

• You are going to collide with running code– try-catch is your friend

• Don’t be afraid to dump the cache and start over

38

Watch out for references• If you are storing something in a cache object, you

can save a lot of memory by passing reference to object

• Don’t forget about the reference• Watch out for garbage collection trying to destroy it• Updating cache operation might involve updating an

existing object

39

The curse• More servers = more caches = less

efficient• Discipline: can’t throw more servers at the

problem

40

Totally worth it!

41

Requests per minute to origin servers

Totally worth it!

42

CPU of 1 x SQL Server 2008 database

Discussion• What do you use to cache at a code layer?• War stories

43

Thank you!• Jonathan Keebler• [email protected]• @keebler

44

mailto:[email protected]




high scalability toronto: meetup #2

Documents