provisioning and capacity planning workshop (dogpatch labs, september 2015)

43
Scaling Workshop Provisioning and Capacity Planning Brian Brazil Founder

Upload: brian-brazil

Post on 14-Apr-2017

2.496 views

Category:

Internet


1 download

TRANSCRIPT

Page 1: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Scaling WorkshopProvisioning and Capacity Planning

Brian BrazilFounder

Page 2: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Who am I?Engineer passionate about running software reliably in production.

● TCD CS Degree● Google SRE for 7 years, working on high-scale reliable systems such as

Adwords, Adsense, Ad Exchange, Billing, Database● Boxever TL Systems&Infrastructure, applied processes and technology to let

allow company to scale and reduce operational load● Contributor to many open source projects, including Prometheus, Ansible,

Python, Aurora and Zookeeper.● Founder of Robust Perception, making scalability and efficiency available to

everyone

Page 3: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

GoalsAt the end of the workshop you will be able to:

● Estimate how much spare capacity you have in less than 5 minutes● Estimate how much runway that capacity provides● Determine how many machines you need● Spot common potential problems as you scale

This should set you up for your first 1-2 years, if not more

Page 4: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

AudienceThis is an introductory workshop to teach you the basics.

Your company:

● Uses Unix in production● Has a relatively simple setup/small number of machines● Operations primarily performed by developers● Performance has not been a primary consideration in your product

I’m also going to focus on webservices-type systems rather than offline processing or batch.

Page 5: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Capacity

Page 6: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Estimate your capacity in 3 easy steps!

1. Measure bottleneck resource at peak traffic2. Divide to get fraction of limit3. Multiply by peak traffic

Page 7: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Estimate your capacity in 3 not so easy steps!

1. What’s your bottleneck? How do you measure it?2. What’s your bottleneck’s limit?3. What’s your peak traffic?

Page 8: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Step 1: What’s the bottleneck?The most common bottlenecks:

1. CPU2. Disk I/O

Less common: network, disk space, external resources, quotas, hardcoded limits, contention/locking, memory, file descriptors, port numbers, humans

Page 9: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Step 1: Where’s the bottleneck?Look at CPU % and Disk I/O Utilisation on each type of machine.

If you’ve monitoring, use that.

Failing that:

sudo apt-get install sysstat

iostat -x 5

Page 10: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Step 1: Iostatavg-cpu: %user %nice %system %iowait %steal %idle 4.24 0.00 1.18 0.98 0.00 93.60

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %utilsda 0.00 1.40 0.00 3.80 0.00 45.20 23.79 0.00 1.05 0.00 1.05 0.84 0.32sdb 0.00 1.40 0.00 21.00 0.00 267.20 25.45 0.09 4.11 0.00 4.11 4.11 8.64sdc 0.00 1.40 0.00 20.00 0.00 267.20 26.72 0.06 3.24 0.00 3.24 3.24 6.48md0 0.00 0.00 0.00 2.00 0.00 8.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00

The numbers you care about are %idle and %util.

%idle is the amount of CPU not in use. %util is the amount of disk I/O in use, take the biggest one.

Page 11: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Step 2: What’s the limit?We now know the CPU and disk I/O usage on each machine at peak.

Which is the bottleneck though?

Need to know the limit. Rules of thumb:

● 80% limit for CPU● 50% limit for Disk I/O

Page 12: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Step 2: DivisionFind how full each CPU and disk is.

Say we had a disk 10% utilised, and a CPU 20% utilised (80% idle).

0.1/0.5 = 0.2 => Disk IO is at 20% of limit

0.2/0.8 = 0.25 => CPU is at 25% of limit

CPU is our bottleneck, with 25% of capacity used.

Page 13: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Step 2: Utilisation Visualisation

Page 14: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Step 3: Peak trafficNow that we know how full our bottleneck is, we need to know how much capacity we have.

Figure out how much traffic you were handling around the time you measured cpu and disk utilisation.

You might do this via monitoring, or parsing logs or if you’re really stuck tcpdump.

Page 15: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Step 3: The 2nd divisionLet’s say our queries per second (qps) was 10 around peak.

Our CPU was our bottleneck, and about 25% of our limit.

10/0.25 = 40qps

So we can currently handle a maximum traffic of around 40qps

Page 16: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Step 3: Capacity Visualisation

Page 17: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Now you can estimate your capacity in 3 easy steps!1. Measure bottleneck resource at peak traffic

○ Use monitoring or iostat to see how close you are to the limit, say 20% full

2. Divide to get fraction of limit○ With a limit of 80% for CPU, you’re 20/80 = 25% full

3. Multiply by peak traffic○ Traffic was 10qps, so 10/0.25 = 40qps capacity

Page 18: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Runway

Page 19: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

How much runway do you have?You now have a rough idea of how much capacity you have to spare.

In the example here, we’re using 10qps out of 40qps capacity.

How long will that 30qps last you?

The two main factors are new customers and organic growth.

Page 20: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

New CustomersNew customers/partners are your main source of traffic.

Look at your traffic graphs around the time a new customer started using your system.

If the customer had say 1M users and you saw 10qps increased peak traffic, you can now predict how much traffic future customers will need.

Based on sales predictions, you can tell how much capacity you’ll need for new customers.

Page 21: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Organic growthOver time your existing customers/partners will use the system more and more, new employees are hired, they get new customers etc.

Look at your monitoring’s traffic graphs over a few months to see what the trend is like. Do your best to ignore the impact of launches.

Calculate your % growth month on month.

Starting out, it’s likely that organic growth will not be your main consideration.

Page 22: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Calculating runwayOnce again in the example here, we’re using 10qps out of 40qps capacity.

Each 1M user customer generates 10qps of additional traffic.

You also expect a negligible amount of organic growth.

This means you can handle 3M more users worth of new customers.

If you’re signing up one 1M user customer per month, that gives you 3 months.

Page 23: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Provisioning

Page 24: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Provisioning vs Capacity PlanningCapacity Planning:

In 6 months I will have 7 new customers, and need to be able to handle 100qps in total

Provisioning:

To handle 100qps I need X frontends and Y databases

Page 25: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Provisioning: What can a machine handle?Continuing our example, let’s say we had 4 machines and each reported being at CPU 20% (25% of the 80% limit) while dealing with 10qps each.

The key metric is qps per machine.

10qps/.2 machines = 50qps/machine

Can only safely use 80% of the machine, so 50*.8 = 40qps

So we can handle 40 qps per machine.

Page 26: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Provisioning: How many machines do I need?If we want to handle 100qps, we need 100/40 = 2.5 machines. So 3 machines.

For each type of machine, calculate the incoming external qps it can handle and how many you need.

Don’t fret about $10/month worth of cost, it’s not worth your time.

Page 27: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Provisioning: Visualisation

Page 28: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Review: The Basics

● Estimating capacity:○ Measure bottleneck at peak○ Find how near bottleneck is to the limit○ Calculate spare capacity based on peak traffic

● Keep an eye on new customers/partners and organic growth to track runway● For provisioning, calculate qps/machine for each type of machine

Page 29: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Life is not Basic

Page 30: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

A few wrinklesI’ve glossed over a lot of detail so you can go away from today’s workshop with something you can immediately use.

Some questions ye may have:

● Why measure at peak traffic?● What if I don’t have much traffic?● Why 80% limit on CPU and 50% on disk?● What if a machine fails?● What if things aren’t that simple?● Doesn’t autoscaling take care of all this for me?

Page 31: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Why measure at peak traffic?As your utilisation increases: ● Latency increases● Performance decreases

In addition skew due to background of constant CPU usage is decreased

Measuring at peak helps allow for these factors.

Beware the knee.

Page 32: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

What if I don’t have much traffic?If you don’t have enough traffic to show up in top or iotop, then these techniques won’t help you much.

You could loadtest, but that takes time. Or use rules of thumb.

Easier way: Use latency to estimate throughput.

If your queries take 10ms, then you can probably handle 100/s

Page 33: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Why 80% limit on CPU and 50% on disk?For CPU due to utilisation/latency curve you want to avoid having too high utilisation.

If you have the CPU to yourself 90-95% is safe in a controlled environment with good loadtesting. This is uncommon, so leave safety margin for OS processes etc.

For spinning disks the impact of utilisation tend to be more problematic, and background tasks tend to use a lot of disk.

Page 34: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

What if a machine fails?You generally should add 2 extra machines beyond that you need to serve peak qps. This is commonly known as “n+2”.

This is to allow for one machine failure, and to let you take down a machine to push a new binary, perform maintenance or whatever.

This also gives you some slack in your capacity. As you grow, more sophisticated math is required.

Page 35: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

What if things aren’t that simple?Lots of other issues can throw a spanner in the works.

● Heterogeneous machines● Varying machine performance● Varying traffic mixes● Multiple datacenters● Multi-tiered services

As a general rule try to keep things simple. A perfect model is brittle and usually takes more time than it’s worth.

Page 36: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Doesn’t autoscaling take care of all this for me?

Short answer Long answer

Page 37: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Doesn’t autoscaling take care of all this for me?

Short answer

No

Long answer

Page 38: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Doesn’t autoscaling take care of all this for me?

Short answer

No

Long answer

Haha, Haha.

No

Page 39: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Doesn’t autoscaling take care of all this for me? EC2 Autoscaling can eliminate some of the day-to-day work in provisioning servers.

There’s operational and complexity overhead, as you have to maintain images and systems that can be spun up.

You have to wait for instances to spin up - can’t rely on it completely for sudden spikes. You need to do math to tune it to be able to handle a spikes.

You still have to tune everything. Control systems are hard.

Page 40: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Wrapping Up

Page 41: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Monitoring MattersA common thread through this workshop is that monitoring is what should be providing you the information you need to make operational decisions.

Make sure you have a good monitoring system.

Logs are not monitoring, though better than nothing.

I recommend Prometheus.io: If it didn’t exist I would have created it.

Page 42: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Production MattersProvisioning and Capacity planning is just one aspect of production. There’s many others involved with running your company:

Robust Perception can help you with all of this and more.

● Deployment● Change Management● Configuration Management● Reliability● Architecture● Design Feasibility● Cost Management

● Performance Tuning● SLAs● Contract Sanity Check● Debugging● Alerting● Oncall● Incident Management

Page 43: Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Questions?

Blog: www.robustperception.io/blog

Twitter: @RobustPerceiver

Email: [email protected]

Linkedin: https://ie.linkedin.com/in/brianbrazil