how we tripled our load in production for testing...
TRANSCRIPT
![Page 1: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/1.jpg)
How we tripled our load in productionHow we tripled our load in productionHow we tripled our load in productionHow we tripled our load in productionfor testing purposesfor testing purposesfor testing purposesfor testing purposesand survived to tell the story!and survived to tell the story!and survived to tell the story!and survived to tell the story!
(and how we did it with AWS)(and how we did it with AWS)(and how we did it with AWS)(and how we did it with AWS)
Carlos Arguelles, Senior SDETDan-Constantin Florescu, SDETWebsite Applications Platform
![Page 2: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/2.jpg)
AgendaAgendaAgendaAgenda
• Why we did this
• What choices we considered
• Why we picked ours
• How we implemented it
• Q&A
![Page 3: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/3.jpg)
What I’d like you to get out of thisWhat I’d like you to get out of thisWhat I’d like you to get out of thisWhat I’d like you to get out of this
• Start thinking about Load testing (nownownownow)
• Large-scale load testing in Production is an option
• Key questions you can use to guide your testing
• What you can do as a service owner to design more testable services TIP
![Page 4: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/4.jpg)
Why we did thisWhy we did thisWhy we did thisWhy we did this
• Load Testing…
– Every team says “It’d be nice to do…”
– But often it doesn’t get enough priority
• Large scale load testing
– Is expensive (hardware, SDE-time)
• Large scale load testing in production
– Is risky (downtime, customer data loss)
![Page 5: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/5.jpg)
Why we did thisWhy we did thisWhy we did thisWhy we did this
• During Black Friday 2010, one of our services had a number of issues due to scaling problems
–Customer impact: Many services across Amazon unable to make real-time data-driven decisions during one of the most important shopping day of the year
![Page 6: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/6.jpg)
Why we did Why we did Why we did Why we did thisthisthisthis
Black FridayBlack FridayBlack FridayBlack FridayCyber MondayCyber MondayCyber MondayCyber Monday
![Page 7: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/7.jpg)
How we designed the How we designed the How we designed the How we designed the Load TestLoad TestLoad TestLoad Test
• We made a list of questions list of questions list of questions list of questions to guide us
– No right/wrong answer, no “Golden Formula”
– Cost/risk/benefitCost/risk/benefitCost/risk/benefitCost/risk/benefit for each decision
– Product specific
![Page 8: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/8.jpg)
Open QuestionsOpen QuestionsOpen QuestionsOpen Questions
• What do you really want to do?What do you really want to do?What do you really want to do?What do you really want to do?
• What is the architecture and data flow?
• What are the dependencies?
• Where should we test?
• How long?
• What data? How much?
• How do I assess the impact of the load?
• How to scale the load gen?
• How do I ensure the right pace?
• How do I perform an emergency-stop?
![Page 9: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/9.jpg)
What do you really want to do?What do you really want to do?What do you really want to do?What do you really want to do?
Performance Testing
Performance Testing
ResilienceTesting
ResilienceTesting
StressTestingStressTesting
LoadTestingLoadTesting
Figure out what you *really* want to do before you start testing!
![Page 10: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/10.jpg)
What do you really want to do?What do you really want to do?What do you really want to do?What do you really want to do?
Performance Testing
LoadLoadLoadLoadTestingTestingTestingTesting
StressTesting
ResilienceTesting
Can it handle an expected realistic load?
How much hardware do I need for my peak?
Miss SLAs? (i.e. Latency)System down?Data lossErrors…
Emphasis on “prod“prod“prod“prod----like”like”like”like”behavior
![Page 11: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/11.jpg)
What do you really want to do?What do you really want to do?What do you really want to do?What do you really want to do?
Performance Testing
LoadTesting
StressStressStressStressTestingTestingTestingTesting
ResilienceTesting
When & Where will it break?
Bottlenecks?Contention? Locks?
Not on “prod“prod“prod“prod----like”like”like”like”behavior
Emphasis on breaking the breaking the breaking the breaking the systemsystemsystemsystem
![Page 12: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/12.jpg)
What do you really want to do?What do you really want to do?What do you really want to do?What do you really want to do?
Performance Testing
LoadTesting
StressTesting
ResilienceResilienceResilienceResilienceTestingTestingTestingTesting
• How will it handle underlying failures?
• Data center down• Slow network• Dependencies failing• CPU/Memory hogs• …
![Page 13: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/13.jpg)
What do you really want to do?What do you really want to do?What do you really want to do?What do you really want to do?
Performance Performance Performance Performance TestingTestingTestingTesting
LoadTesting
StressTesting
ResilienceTesting
• How does latency & throughput change as load increases?
• When will you miss latency SLAs?
![Page 14: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/14.jpg)
Open QuestionsOpen QuestionsOpen QuestionsOpen Questions
• What do you really want to do?
• What is the architecture and data flow?What is the architecture and data flow?What is the architecture and data flow?What is the architecture and data flow?
• What are the dependencies?
• Where should we test?
• How long?
• What data? How much?
• How do I assess the impact of the load?
• How to scale the load gen?
• How do I ensure the right pace?
• How do I perform an emergency-stop?
![Page 15: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/15.jpg)
Architecture @ 30,000 Architecture @ 30,000 Architecture @ 30,000 Architecture @ 30,000 feetfeetfeetfeet
Upstream customers posting logs(Retail fleet)
Our serviceOur serviceOur serviceOur service
Downstream customersconsuming processed data
What does “Load” mean to this architecture?
• Transactions per second?Transactions per second?Transactions per second?Transactions per second?• Concurrent connections?Concurrent connections?Concurrent connections?Concurrent connections?• Terabytes per day?Terabytes per day?Terabytes per day?Terabytes per day?
![Page 16: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/16.jpg)
Architecture @ 30,000 Architecture @ 30,000 Architecture @ 30,000 Architecture @ 30,000 feetfeetfeetfeet
Upstream customers Upstream customers Upstream customers Upstream customers posting posting posting posting logslogslogslogs(Retail fleet)(Retail fleet)(Retail fleet)(Retail fleet)
Our serviceOur serviceOur serviceOur service
• 1 file per host per minute• tens of thousands of hosts• file size 200KB-2MB• 500MB/sec at peak
• TPS is low and somewhat constant
• Payload could be larger on days with high load
• Will be tens of terabytes on peak day
![Page 17: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/17.jpg)
Architecture Architecture Architecture Architecture @ 10,000 @ 10,000 @ 10,000 @ 10,000 feetfeetfeetfeet
Data inputData inputData inputData input
Data processingData processingData processingData processing
Data outputData outputData outputData output
![Page 18: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/18.jpg)
Open QuestionsOpen QuestionsOpen QuestionsOpen Questions
• What do you really want to do?
• What is the architecture and data flow?
• What are the dependencies?What are the dependencies?What are the dependencies?What are the dependencies?
• Where should we test?
• How long?
• What data? How much?
• How do I assess the impact of the load?
• How to scale the load gen?
• How do I ensure the right pace?
• How do I perform an emergency-stop?
![Page 19: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/19.jpg)
What are my dependencies?What are my dependencies?What are my dependencies?What are my dependencies?
DependenciesDependenciesDependenciesDependencies
Host 1 Host 2
Production Load
Load Balancer
… Host nHost 1 Host 2
Production Load
Load Balancer
… Host n
If you’re loading your service,
you’re loadingyour dependencies
DependenciesDependenciesDependenciesDependencies
Test Load
![Page 20: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/20.jpg)
What are my dependencies?What are my dependencies?What are my dependencies?What are my dependencies?
• You can’t always control them
–convince them to load-test with you
–you might need to mock them
![Page 21: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/21.jpg)
What are my dependencies?What are my dependencies?What are my dependencies?What are my dependencies?
• Mocking your dependencies
– can be expensive to do
• percentile latencies
• error rates, throttling, …
+ but you can model theoretical situations
![Page 22: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/22.jpg)
Open QuestionsOpen QuestionsOpen QuestionsOpen Questions
• What do you really want to do?
• What is the architecture and data flow?
• What are the dependencies?
• Where should we test?Where should we test?Where should we test?Where should we test?
• How long?
• What data? How much?
• How do I assess the impact of the load?
• How to scale the load gen?
• How do I ensure the right pace?
• How do I perform an emergency-stop?
![Page 23: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/23.jpg)
Component testing vs. EndComponent testing vs. EndComponent testing vs. EndComponent testing vs. End----totototo----endendendend
• Component testing: Component testing: Component testing: Component testing:
+ very targeted
+ localized risk
- component interaction?
- expensive mocking?
Can your components be load-tested separately?TIP
![Page 24: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/24.jpg)
Test Environment vsTest Environment vsTest Environment vsTest Environment vs. . . . ProductionProductionProductionProduction
• “Shadow” testing: “Shadow” testing: “Shadow” testing: “Shadow” testing:
+ breaking point
- smaller scale
- configuration
Big brother, Production
Little brother, Test Environment
![Page 25: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/25.jpg)
• But to truly show that our Production fleet was ready…
– test in Productiontest in Productiontest in Productiontest in Production
– while processing regular traffic while processing regular traffic while processing regular traffic while processing regular traffic
– triple the load of a regular triple the load of a regular triple the load of a regular triple the load of a regular daydaydayday
• High risk, high benefit
Where should we test? Where should we test? Where should we test? Where should we test?
![Page 26: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/26.jpg)
• What do you really want to do?
• What is the architecture and data flow?
• What are the dependencies?
• Where should we test?
• How long? How long? How long? How long?
• What data? How much?
• How do I assess the impact of the load?
• How to scale the load gen?
• How do I ensure the right pace?
• How do I perform an emergency-stop?
Open Open Open Open QuestionsQuestionsQuestionsQuestions
![Page 27: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/27.jpg)
• A day in our service’s life
How How How How long long long long do we run fordo we run fordo we run fordo we run for? ? ? ?
We process it:- 10-Min jobs- Hourly jobs- Daily jobs
![Page 28: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/28.jpg)
• Should we run it for more than a day?
How How How How long do we run for? long do we run for? long do we run for? long do we run for?
Cyber
Monday
Black
Friday
Week of Black Friday
Load Tests
Regular days
before load tests
![Page 29: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/29.jpg)
• Should we run it for more than a day?
– may start failing under load after a longer period of time (such as resource leaks)
– Infrequent failures are more likely to occur if we run it for a longer period of time
• CostCostCostCost/risk/benefit/risk/benefit/risk/benefit/risk/benefit
How How How How long do we run for? long do we run for? long do we run for? long do we run for?
![Page 30: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/30.jpg)
• What do you really want to do?
• What is the architecture and data flow?
• What are the dependencies?
• Where should we test?
• How long?
• What data? How much?What data? How much?What data? How much?What data? How much?
• How do I assess the impact of the load?
• How to scale the load gen?
• How do I ensure the right pace?
• How do I perform an emergency-stop?
Open Open Open Open QuestionsQuestionsQuestionsQuestions
![Page 31: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/31.jpg)
Allow a way to send test data to your production service
What data?What data?What data?What data?
TIP
Have to make sure that the Load test data does not pollute Production data!
![Page 32: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/32.jpg)
• Need process to save production input
– Useful for a lot more than load testing!
• Open questions:
• % of hosts saving data?
• % of transactions sampled?
• Safe to replay? (i.e. idempotency?)
What data?What data?What data?What data?
![Page 33: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/33.jpg)
Saving production dataSaving production dataSaving production dataSaving production data
Load Balancer
InterceptorInterceptorInterceptorInterceptor
Host 1 Host 2 Host 3 … Host n
Production Data Storage
![Page 34: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/34.jpg)
Host 1
Saving production dataSaving production dataSaving production dataSaving production data
Load Balancer
Host 2 Host 3 … Host n
InterceptorInterceptorInterceptorInterceptorInterceptorInterceptorInterceptorInterceptor
Production Data Storage
![Page 35: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/35.jpg)
How much data?How much data?How much data?How much data?
Data pattern: POST payload
P90
P50
P99.9
TIP Monitoring is your friend: what metrics will help you?
![Page 36: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/36.jpg)
Aug 2011
Nov 2010
How much data?How much data?How much data?How much data?
Data pattern: Number of POSTs
![Page 37: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/37.jpg)
• What do you really want to do?
• What is the architecture and data flow?
• What are the dependencies?
• Where should we test?
• How long?
• What data? How much?
• How How How How do I assess the impact of the load?do I assess the impact of the load?do I assess the impact of the load?do I assess the impact of the load?
• How to scale the load gen?
• How do I ensure the right pace?
• How do I perform an emergency-stop?
Open Open Open Open QuestionsQuestionsQuestionsQuestions
![Page 38: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/38.jpg)
Hardware metricsHardware metricsHardware metricsHardware metrics
CPU MetricsCPU MetricsCPU MetricsCPU Metrics
Regular day in September
1st attempt at running our test(cancelled)
2nd attempt at running our test (successful)
Need to have relevant monitoring in place that will give you accurate state of the system (and tell you when it’s at risk)
![Page 39: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/39.jpg)
ProductProductProductProduct----specific metricsspecific metricsspecific metricsspecific metrics
� Hadoop Cluster: % of mappers in use
Hadoop Cluster: % of reducers in use �
![Page 40: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/40.jpg)
• What do you really want to do?
• What is the architecture and data flow?
• What are the dependencies?
• Where should we test?
• How long?
• What data? How much?
• How do I assess the impact of the load?
• How How How How to scale the load gen?to scale the load gen?to scale the load gen?to scale the load gen?
• How do I ensure the right pace?
• How do I perform an emergency-stop?
Open Open Open Open QuestionsQuestionsQuestionsQuestions
![Page 41: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/41.jpg)
Architecture of the Load GeneratorArchitecture of the Load GeneratorArchitecture of the Load GeneratorArchitecture of the Load Generator
• Process saves % of production data to a “Test Data Repository” (TDR) with date/time metadata
• TDR can list all files that belong to an interval
• Can replay:• A percentage of a desired day• Multiple days at the same time
ProductionService
Test DataRepository
(S3)
![Page 42: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/42.jpg)
Architecture of the Load GeneratorArchitecture of the Load GeneratorArchitecture of the Load GeneratorArchitecture of the Load Generator
Test DataRepository
ProductionService
ControllerControllerControllerControllerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorker
ControllerControllerControllerController
JobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJobJob
WorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorker
S3 for data
SQS for state, resilience
EC2 for hardware
CloudWatchAutoScaling
![Page 43: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/43.jpg)
AWS AWS AWS AWS services for load services for load services for load services for load testingtestingtestingtesting
• Reactive auto-scaling
– Auto-scaling based on CPU/Memory/…
– If the SQS queue grows launch more instances
• Predictive auto-scaling
– Can read ahead in the TestDataRepository and knows if the load will be increasing or decreasing in half an hour
![Page 44: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/44.jpg)
• What do you really want to do?
• What is the architecture and data flow?
• What are the dependencies?
• Where should we test?
• How long?
• What data? How much?
• How do I assess the impact of the load?
• How to scale the load gen?
• How How How How do I ensure the right pace? do I ensure the right pace? do I ensure the right pace? do I ensure the right pace?
• How do I perform an emergency-stop?
Open Open Open Open QuestionsQuestionsQuestionsQuestions
![Page 45: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/45.jpg)
How do I ensure that I’m applying the load How do I ensure that I’m applying the load How do I ensure that I’m applying the load How do I ensure that I’m applying the load at the right pace?at the right pace?at the right pace?at the right pace?
expected
actual
![Page 46: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/46.jpg)
How do I ensure that I’m applying the load How do I ensure that I’m applying the load How do I ensure that I’m applying the load How do I ensure that I’m applying the load at the right pace?at the right pace?at the right pace?at the right pace?
![Page 47: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/47.jpg)
• What do you really want to do?
• What is the architecture and data flow?
• What are the dependencies?
• Where should we test?
• How long?
• What data? How much?
• How do I assess the impact of the load?
• How to scale the load gen?
• How do I ensure the right pace?
• How How How How do I perform an emergencydo I perform an emergencydo I perform an emergencydo I perform an emergency----stop?stop?stop?stop?
Open Open Open Open QuestionsQuestionsQuestionsQuestions
![Page 48: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/48.jpg)
Safety Measures: Big Red ButtonSafety Measures: Big Red ButtonSafety Measures: Big Red ButtonSafety Measures: Big Red Button
• Deactivate the Load Generator
– Kill/stop the workers, controller
– Delete SQS queue• if there’s no queue there’s no work
TIP Have redundant safety measures!Have redundant safety measures!Have redundant safety measures!Have redundant safety measures!
![Page 49: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/49.jpg)
Safety MeasuresSafety MeasuresSafety MeasuresSafety Measures
• Situation:
– Workers can’t dequeue jobs for a while
– Job queue grows very large
– When workers resume dequeing…
• Solution:
– Queue size > X �Alarm
– Queue size > Y �Auto shutdown
![Page 50: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/50.jpg)
Problems our test preventedProblems our test preventedProblems our test preventedProblems our test prevented
• We were under-scaled by ~15%
– Processing would have fallen behind: missed SLAs
• Leak affecting files > 1 MB
– All hosts in service would have run out of disk space in hours: downtime
• These issues would not have been found without large scale testing in production!
![Page 51: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/51.jpg)
What I’d like you to get out of thisWhat I’d like you to get out of thisWhat I’d like you to get out of thisWhat I’d like you to get out of this
• Start thinking about Load testing (nownownownow)
• Large-scale load testing in Production is an option
• Key questions you can use to guide your testing
• What you can do as a service owner to design more testable services TIP
![Page 52: How we tripled our load in production for testing …romania.amazon.com/techon/presentations/HowWeTripledOur...How we tripled our load in production for testing purposes and survived](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec99f0e81fedd21814d8be8/html5/thumbnails/52.jpg)
Q&AQ&AQ&AQ&A