how we tripled our load in production for testing...

52
How we tripled our load in production How we tripled our load in production How we tripled our load in production How we tripled our load in production for testing purposes for testing purposes for testing purposes for testing purposes and survived to tell the story! and survived to tell the story! and survived to tell the story! and survived to tell the story! (and how we did it with AWS) (and how we did it with AWS) (and how we did it with AWS) (and how we did it with AWS) Carlos Arguelles, Senior SDET Dan-Constantin Florescu, SDET Website Applications Platform

Upload: others

Post on 22-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

How we tripled our load in productionHow we tripled our load in productionHow we tripled our load in productionHow we tripled our load in productionfor testing purposesfor testing purposesfor testing purposesfor testing purposesand survived to tell the story!and survived to tell the story!and survived to tell the story!and survived to tell the story!

(and how we did it with AWS)(and how we did it with AWS)(and how we did it with AWS)(and how we did it with AWS)

Carlos Arguelles, Senior SDETDan-Constantin Florescu, SDETWebsite Applications Platform

Page 2: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

AgendaAgendaAgendaAgenda

• Why we did this

• What choices we considered

• Why we picked ours

• How we implemented it

• Q&A

Page 3: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

What I’d like you to get out of thisWhat I’d like you to get out of thisWhat I’d like you to get out of thisWhat I’d like you to get out of this

• Start thinking about Load testing (nownownownow)

• Large-scale load testing in Production is an option

• Key questions you can use to guide your testing

• What you can do as a service owner to design more testable services TIP

Page 4: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Why we did thisWhy we did thisWhy we did thisWhy we did this

• Load Testing…

– Every team says “It’d be nice to do…”

– But often it doesn’t get enough priority

• Large scale load testing

– Is expensive (hardware, SDE-time)

• Large scale load testing in production

– Is risky (downtime, customer data loss)

Page 5: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Why we did thisWhy we did thisWhy we did thisWhy we did this

• During Black Friday 2010, one of our services had a number of issues due to scaling problems

–Customer impact: Many services across Amazon unable to make real-time data-driven decisions during one of the most important shopping day of the year

Page 6: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Why we did Why we did Why we did Why we did thisthisthisthis

Black FridayBlack FridayBlack FridayBlack FridayCyber MondayCyber MondayCyber MondayCyber Monday

Page 7: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

How we designed the How we designed the How we designed the How we designed the Load TestLoad TestLoad TestLoad Test

• We made a list of questions list of questions list of questions list of questions to guide us

– No right/wrong answer, no “Golden Formula”

– Cost/risk/benefitCost/risk/benefitCost/risk/benefitCost/risk/benefit for each decision

– Product specific

Page 8: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Open QuestionsOpen QuestionsOpen QuestionsOpen Questions

• What do you really want to do?What do you really want to do?What do you really want to do?What do you really want to do?

• What is the architecture and data flow?

• What are the dependencies?

• Where should we test?

• How long?

• What data? How much?

• How do I assess the impact of the load?

• How to scale the load gen?

• How do I ensure the right pace?

• How do I perform an emergency-stop?

Page 9: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

What do you really want to do?What do you really want to do?What do you really want to do?What do you really want to do?

Performance Testing

Performance Testing

ResilienceTesting

ResilienceTesting

StressTestingStressTesting

LoadTestingLoadTesting

Figure out what you *really* want to do before you start testing!

Page 10: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

What do you really want to do?What do you really want to do?What do you really want to do?What do you really want to do?

Performance Testing

LoadLoadLoadLoadTestingTestingTestingTesting

StressTesting

ResilienceTesting

Can it handle an expected realistic load?

How much hardware do I need for my peak?

Miss SLAs? (i.e. Latency)System down?Data lossErrors…

Emphasis on “prod“prod“prod“prod----like”like”like”like”behavior

Page 11: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

What do you really want to do?What do you really want to do?What do you really want to do?What do you really want to do?

Performance Testing

LoadTesting

StressStressStressStressTestingTestingTestingTesting

ResilienceTesting

When & Where will it break?

Bottlenecks?Contention? Locks?

Not on “prod“prod“prod“prod----like”like”like”like”behavior

Emphasis on breaking the breaking the breaking the breaking the systemsystemsystemsystem

Page 12: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

What do you really want to do?What do you really want to do?What do you really want to do?What do you really want to do?

Performance Testing

LoadTesting

StressTesting

ResilienceResilienceResilienceResilienceTestingTestingTestingTesting

• How will it handle underlying failures?

• Data center down• Slow network• Dependencies failing• CPU/Memory hogs• …

Page 13: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

What do you really want to do?What do you really want to do?What do you really want to do?What do you really want to do?

Performance Performance Performance Performance TestingTestingTestingTesting

LoadTesting

StressTesting

ResilienceTesting

• How does latency & throughput change as load increases?

• When will you miss latency SLAs?

Page 14: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Open QuestionsOpen QuestionsOpen QuestionsOpen Questions

• What do you really want to do?

• What is the architecture and data flow?What is the architecture and data flow?What is the architecture and data flow?What is the architecture and data flow?

• What are the dependencies?

• Where should we test?

• How long?

• What data? How much?

• How do I assess the impact of the load?

• How to scale the load gen?

• How do I ensure the right pace?

• How do I perform an emergency-stop?

Page 15: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Architecture @ 30,000 Architecture @ 30,000 Architecture @ 30,000 Architecture @ 30,000 feetfeetfeetfeet

Upstream customers posting logs(Retail fleet)

Our serviceOur serviceOur serviceOur service

Downstream customersconsuming processed data

What does “Load” mean to this architecture?

• Transactions per second?Transactions per second?Transactions per second?Transactions per second?• Concurrent connections?Concurrent connections?Concurrent connections?Concurrent connections?• Terabytes per day?Terabytes per day?Terabytes per day?Terabytes per day?

Page 16: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Architecture @ 30,000 Architecture @ 30,000 Architecture @ 30,000 Architecture @ 30,000 feetfeetfeetfeet

Upstream customers Upstream customers Upstream customers Upstream customers posting posting posting posting logslogslogslogs(Retail fleet)(Retail fleet)(Retail fleet)(Retail fleet)

Our serviceOur serviceOur serviceOur service

• 1 file per host per minute• tens of thousands of hosts• file size 200KB-2MB• 500MB/sec at peak

• TPS is low and somewhat constant

• Payload could be larger on days with high load

• Will be tens of terabytes on peak day

Page 17: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Architecture Architecture Architecture Architecture @ 10,000 @ 10,000 @ 10,000 @ 10,000 feetfeetfeetfeet

Data inputData inputData inputData input

Data processingData processingData processingData processing

Data outputData outputData outputData output

Page 18: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Open QuestionsOpen QuestionsOpen QuestionsOpen Questions

• What do you really want to do?

• What is the architecture and data flow?

• What are the dependencies?What are the dependencies?What are the dependencies?What are the dependencies?

• Where should we test?

• How long?

• What data? How much?

• How do I assess the impact of the load?

• How to scale the load gen?

• How do I ensure the right pace?

• How do I perform an emergency-stop?

Page 19: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

What are my dependencies?What are my dependencies?What are my dependencies?What are my dependencies?

DependenciesDependenciesDependenciesDependencies

Host 1 Host 2

Production Load

Load Balancer

… Host nHost 1 Host 2

Production Load

Load Balancer

… Host n

If you’re loading your service,

you’re loadingyour dependencies

DependenciesDependenciesDependenciesDependencies

Test Load

Page 20: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

What are my dependencies?What are my dependencies?What are my dependencies?What are my dependencies?

• You can’t always control them

–convince them to load-test with you

–you might need to mock them

Page 21: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

What are my dependencies?What are my dependencies?What are my dependencies?What are my dependencies?

• Mocking your dependencies

– can be expensive to do

• percentile latencies

• error rates, throttling, …

+ but you can model theoretical situations

Page 22: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Open QuestionsOpen QuestionsOpen QuestionsOpen Questions

• What do you really want to do?

• What is the architecture and data flow?

• What are the dependencies?

• Where should we test?Where should we test?Where should we test?Where should we test?

• How long?

• What data? How much?

• How do I assess the impact of the load?

• How to scale the load gen?

• How do I ensure the right pace?

• How do I perform an emergency-stop?

Page 23: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Component testing vs. EndComponent testing vs. EndComponent testing vs. EndComponent testing vs. End----totototo----endendendend

• Component testing: Component testing: Component testing: Component testing:

+ very targeted

+ localized risk

- component interaction?

- expensive mocking?

Can your components be load-tested separately?TIP

Page 24: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Test Environment vsTest Environment vsTest Environment vsTest Environment vs. . . . ProductionProductionProductionProduction

• “Shadow” testing: “Shadow” testing: “Shadow” testing: “Shadow” testing:

+ breaking point

- smaller scale

- configuration

Big brother, Production

Little brother, Test Environment

Page 25: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

• But to truly show that our Production fleet was ready…

– test in Productiontest in Productiontest in Productiontest in Production

– while processing regular traffic while processing regular traffic while processing regular traffic while processing regular traffic

– triple the load of a regular triple the load of a regular triple the load of a regular triple the load of a regular daydaydayday

• High risk, high benefit

Where should we test? Where should we test? Where should we test? Where should we test?

Page 26: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

• What do you really want to do?

• What is the architecture and data flow?

• What are the dependencies?

• Where should we test?

• How long? How long? How long? How long?

• What data? How much?

• How do I assess the impact of the load?

• How to scale the load gen?

• How do I ensure the right pace?

• How do I perform an emergency-stop?

Open Open Open Open QuestionsQuestionsQuestionsQuestions

Page 27: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

• A day in our service’s life

How How How How long long long long do we run fordo we run fordo we run fordo we run for? ? ? ?

We process it:- 10-Min jobs- Hourly jobs- Daily jobs

Page 28: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

• Should we run it for more than a day?

How How How How long do we run for? long do we run for? long do we run for? long do we run for?

Cyber

Monday

Black

Friday

Week of Black Friday

Load Tests

Regular days

before load tests

Page 29: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

• Should we run it for more than a day?

– may start failing under load after a longer period of time (such as resource leaks)

– Infrequent failures are more likely to occur if we run it for a longer period of time

• CostCostCostCost/risk/benefit/risk/benefit/risk/benefit/risk/benefit

How How How How long do we run for? long do we run for? long do we run for? long do we run for?

Page 30: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

• What do you really want to do?

• What is the architecture and data flow?

• What are the dependencies?

• Where should we test?

• How long?

• What data? How much?What data? How much?What data? How much?What data? How much?

• How do I assess the impact of the load?

• How to scale the load gen?

• How do I ensure the right pace?

• How do I perform an emergency-stop?

Open Open Open Open QuestionsQuestionsQuestionsQuestions

Page 31: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Allow a way to send test data to your production service

What data?What data?What data?What data?

TIP

Have to make sure that the Load test data does not pollute Production data!

Page 32: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

• Need process to save production input

– Useful for a lot more than load testing!

• Open questions:

• % of hosts saving data?

• % of transactions sampled?

• Safe to replay? (i.e. idempotency?)

What data?What data?What data?What data?

Page 33: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Saving production dataSaving production dataSaving production dataSaving production data

Load Balancer

InterceptorInterceptorInterceptorInterceptor

Host 1 Host 2 Host 3 … Host n

Production Data Storage

Page 34: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Host 1

Saving production dataSaving production dataSaving production dataSaving production data

Load Balancer

Host 2 Host 3 … Host n

InterceptorInterceptorInterceptorInterceptorInterceptorInterceptorInterceptorInterceptor

Production Data Storage

Page 35: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

How much data?How much data?How much data?How much data?

Data pattern: POST payload

P90

P50

P99.9

TIP Monitoring is your friend: what metrics will help you?

Page 36: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Aug 2011

Nov 2010

How much data?How much data?How much data?How much data?

Data pattern: Number of POSTs

Page 37: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

• What do you really want to do?

• What is the architecture and data flow?

• What are the dependencies?

• Where should we test?

• How long?

• What data? How much?

• How How How How do I assess the impact of the load?do I assess the impact of the load?do I assess the impact of the load?do I assess the impact of the load?

• How to scale the load gen?

• How do I ensure the right pace?

• How do I perform an emergency-stop?

Open Open Open Open QuestionsQuestionsQuestionsQuestions

Page 38: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Hardware metricsHardware metricsHardware metricsHardware metrics

CPU MetricsCPU MetricsCPU MetricsCPU Metrics

Regular day in September

1st attempt at running our test(cancelled)

2nd attempt at running our test (successful)

Need to have relevant monitoring in place that will give you accurate state of the system (and tell you when it’s at risk)

Page 39: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

ProductProductProductProduct----specific metricsspecific metricsspecific metricsspecific metrics

� Hadoop Cluster: % of mappers in use

Hadoop Cluster: % of reducers in use �

Page 40: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

• What do you really want to do?

• What is the architecture and data flow?

• What are the dependencies?

• Where should we test?

• How long?

• What data? How much?

• How do I assess the impact of the load?

• How How How How to scale the load gen?to scale the load gen?to scale the load gen?to scale the load gen?

• How do I ensure the right pace?

• How do I perform an emergency-stop?

Open Open Open Open QuestionsQuestionsQuestionsQuestions

Page 41: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Architecture of the Load GeneratorArchitecture of the Load GeneratorArchitecture of the Load GeneratorArchitecture of the Load Generator

• Process saves % of production data to a “Test Data Repository” (TDR) with date/time metadata

• TDR can list all files that belong to an interval

• Can replay:• A percentage of a desired day• Multiple days at the same time

ProductionService

Test DataRepository

(S3)

Page 42: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Architecture of the Load GeneratorArchitecture of the Load GeneratorArchitecture of the Load GeneratorArchitecture of the Load Generator

Test DataRepository

ProductionService

ControllerControllerControllerControllerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorker

ControllerControllerControllerController

JobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJob

WorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorker

S3 for data

SQS for state, resilience

EC2 for hardware

CloudWatchAutoScaling

Page 43: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

AWS AWS AWS AWS services for load services for load services for load services for load testingtestingtestingtesting

• Reactive auto-scaling

– Auto-scaling based on CPU/Memory/…

– If the SQS queue grows launch more instances

• Predictive auto-scaling

– Can read ahead in the TestDataRepository and knows if the load will be increasing or decreasing in half an hour

Page 44: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

• What do you really want to do?

• What is the architecture and data flow?

• What are the dependencies?

• Where should we test?

• How long?

• What data? How much?

• How do I assess the impact of the load?

• How to scale the load gen?

• How How How How do I ensure the right pace? do I ensure the right pace? do I ensure the right pace? do I ensure the right pace?

• How do I perform an emergency-stop?

Open Open Open Open QuestionsQuestionsQuestionsQuestions

Page 45: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

How do I ensure that I’m applying the load How do I ensure that I’m applying the load How do I ensure that I’m applying the load How do I ensure that I’m applying the load at the right pace?at the right pace?at the right pace?at the right pace?

expected

actual

Page 46: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

How do I ensure that I’m applying the load How do I ensure that I’m applying the load How do I ensure that I’m applying the load How do I ensure that I’m applying the load at the right pace?at the right pace?at the right pace?at the right pace?

Page 47: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

• What do you really want to do?

• What is the architecture and data flow?

• What are the dependencies?

• Where should we test?

• How long?

• What data? How much?

• How do I assess the impact of the load?

• How to scale the load gen?

• How do I ensure the right pace?

• How How How How do I perform an emergencydo I perform an emergencydo I perform an emergencydo I perform an emergency----stop?stop?stop?stop?

Open Open Open Open QuestionsQuestionsQuestionsQuestions

Page 48: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Safety Measures: Big Red ButtonSafety Measures: Big Red ButtonSafety Measures: Big Red ButtonSafety Measures: Big Red Button

• Deactivate the Load Generator

– Kill/stop the workers, controller

– Delete SQS queue• if there’s no queue there’s no work

TIP Have redundant safety measures!Have redundant safety measures!Have redundant safety measures!Have redundant safety measures!

Page 49: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Safety MeasuresSafety MeasuresSafety MeasuresSafety Measures

• Situation:

– Workers can’t dequeue jobs for a while

– Job queue grows very large

– When workers resume dequeing…

• Solution:

– Queue size > X �Alarm

– Queue size > Y �Auto shutdown

Page 50: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Problems our test preventedProblems our test preventedProblems our test preventedProblems our test prevented

• We were under-scaled by ~15%

– Processing would have fallen behind: missed SLAs

• Leak affecting files > 1 MB

– All hosts in service would have run out of disk space in hours: downtime

• These issues would not have been found without large scale testing in production!

Page 51: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

What I’d like you to get out of thisWhat I’d like you to get out of thisWhat I’d like you to get out of thisWhat I’d like you to get out of this

• Start thinking about Load testing (nownownownow)

• Large-scale load testing in Production is an option

• Key questions you can use to guide your testing

• What you can do as a service owner to design more testable services TIP

Page 52: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived

Q&AQ&AQ&AQ&A