how sumo logic and anki build highly resilient services on aws to manage massive usage spikes
TRANSCRIPT
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Las Vegas, November 13th, 2014
How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage SpikesBen Whaley, AnkiChristian Beedgen, Sumo Logic
Agenda
Introductions
Chasing Infinity
Obvious Versus Hard
Who Am I
Co-Founder & CTO• Cloud-based Machine
Data Analytics Service• Applications, Operations,
Security
Chief Architect• Major SIEM player in the
enterprise space• Log Management for
security & compliance
present past
Sumo Logic Is The Machine Data Cloud
Applications
Internet of ThingsNetwork
Mobile
Search
Visualize
Predict
Being A Service Is Fundamentally Different
• Enterprise software sticks you with the hard part– Scale - this truly is Big Data, baby– Infrastructure is painful to obtain and costly to maintain– Who pays the watchmen?
• Everyone is an island– Your on-prem system is feeling lonely and cold– No sharing of discovered insight– No vendor can guess the right analytics
What Is Machine Data
• Computer, network, and other equipment logs• Satellite and similar telemetry (espionage or science)• Location data, RFID chip readings, GPS system output• Temperature and other environmental sensor readings• Sensor readings from factories, pipelines, etc.• Output from many kinds of medical devices
10 Trillion Per Day - The Scale Of Our Operation
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
GB per Day
Searches
Who Am I
AWS Infrastructure Lead• Artificial intelligence and
robotics• Anki DRIVE, smart robot
race cars with weapons
Cloud Architect• API and Big Data
analytics technology• Leading platform for
digital acceleration
present past
Anki Runs On AWS• Game play analytics
– Python, SQS, EC2 with Autoscale, Elasticache, RDS, EMR, Redshift
• Custom customer account system- Go-based, ELB, EC2, Dynamo, MySQL
• Twilio-based customer support application- Rails, ELB, EC2, MySQL
• Computer vision simulations- Matlab, EC2
• eCommerce via Magento
Chasing Infinity
In 2010, we knew that success will look
something like this…
In 2010, we knew that success will look
something like this…
So What Did We Do?
• Automated, Multi-Dimensional Scalability!
• We decided to make the system multi-tenant• We picked a scalable substrate - AWS• We automated everything from the start• We modularized the code from the beginning• We broke functionality into services
A Multi-Tenant Database? WTF?
• Building the church out for Sunday is • A game changing differentiator in our market• Customer message and query loads fluctuate wildly• When stuff hits the fan, logs splatter everywhere• Islands of fail – not everybody fails at the same time
Just one typical Sumo Logic customer - 8x Variance!
Just one typical Sumo Logic customer - 8x Variance!
Money flushed down the toilet
Here’s another one – spike at 2.5 of steady…
Or… Sweet, incremental, unfettered growth
AWS - A Scalable Substrate
• We actually saw Werner’s talk at Stanford in 2008
AWS - A Scalable Substrate
• We actually saw Werner’s talk at Stanford in 2008• Opened our eyes to the possibilities
Datacenter As An API!!!
AWS - A Scalable Substrate
• We actually saw Werner’s talk at Stanford in 2008• Opened our eyes to the possibilities• We are developers, now the data center is an API• Success story: Sumo in Sydney, Sumo in Dublin
A++++++++++++!! Will Buy Again
A Bonus Success Story
• Sumo Logic in APAC – Sydney• Sumo Logic in Europe – Dublin
Automate All The Things
• There’s no cables, racks, boxes, so…• Really, are you going to use the AWS Console?• We are developers, remember?• So we rolled our own automation in Scala• Fully model-driven, descriptive vs. imperative
• We operate the service from our own CLI
Scaling In Depth
• Internal service-oriented architecture - Microservices• Many loosely coupled components• No poking around private parts• Avro-based protocols for messaging & RPC• Lookup service, minimal configuration
$ bin/receiver prod.service-registry.sumologic.com
Why This Really Matters
Why This Really Matters• Everything you know is wrong• Under scale, everything you know will eventually be… wrong• No matter how smart you think you are• No matter how many conferences you go to• Every rough order of magnitude of scale is going to MESS YOU UP• Not everything is going to fail, and not all the time• You need to divide and conquer so you can easily replace• You need to be in a position to replace the thing that fails
Anki Architectural Principles
• N+2 redundancy• Compute is ephemeral• Infrastructure is code• Less is more• Ubiquitous monitoring• Stateless services
Obvious Versus Hard
Here We Are In 2014
Here We Are, 4 ½ Years Later
• The principles we have described are canon• Over time, they have served us well• So, this is pretty easy, no?
What We Should Really Say
• It’s not supposed to be easy, of course• But even though certain things are obvious…• Under scale, reality still has its own mind
Microservices
Factoring Is Still Important
• Highly cohesive loosely coupled is an ideal• The same ideal OO is striving for• Here’s a snapshot of the current Sumo factoring
• 2 to the power of 5 services (“32”), 170+ modules
• Don’t even ask about the # of dependencies
• At least 3 of each – everything is a separately scalable cluster
Refactor-able Infrastructure
• The same old thing all over again• Now that infrastructure is code, keep refactoring• Split things that don’t belong, join others• Service abstractions can help keeping impact low• Moving around the code, vs. the protocols
It’s Your Spaghetti• We are not building shrink-wrap anymore• The world is moving on, and to scale is to change• Use today’s knowledge to address todays problem• Use tomorrows knowledge tomorrow• Keep degrees of freedom on the fundamental level• Then exploit control and visibility
Embrace Change Or Die
Utilization
• Cost effectiveness can be a problem• One cluster per service means 3 or more instances• Good news - you can size to taste• Bad news – unlikely to really utilize the resources• Things are often either CPU bound, or I/O bound
Resource Management
• Docker-ize your services• Homogenize your fleet/herd• Interleave services based on resource requirements• Doable by hand with automation• Or consider Mesos, …
Anki: Broad Service Best Practices
• Open to change• Thoughtful network design• Composable services• Loose coupling among services• Feature flags
It Is Not One System
We Are Recovering Enterprise Developers
• Many different modules…• Roll up into many different services• Roll up into one system• Build system, test system, deploy system• “Continuously” (We use Jenkins, eh)
It Is Just Not One System
• Finding a cross cut of services that compile…• …and can be tested together will take days• Our Solution: service groups as an organizing unit• A set of services that “belong” together functionally…• And yet have different runtime characteristics• Much easier to keep APIs compatible between 5 groups…• Versus 32 services
Service Groups Scale
• This is really a level of granularity optimization• One system: too heavy – 32 systems: too fleeting• Build a service group, deploy against baseline, test• During deployment, deploy by service group• Balances crosscutting integration tests with turnaround
There Is No Place Like Production
You Cannot Simulate The Big Datas
• There’s simply no substitute for Production• This doesn’t mean you shouldn’t have nighty, staging, …• This doesn’t mean you shouldn’t have integration tests• This doesn’t mean you shouldn’t test manually• But there’s just a class of issues you will not find
• You can’t move Production data into testing• You can’t afford a second Production size system
So Now What?
• Instrument, instrument, instrument• Monitor, monitor, monitor• Alert on symptoms
• Basically, don’t worry about 100% CPU, etc.• Alert on customer impact• Message ingestion delayed, search takes too long, …
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/mobilebasic?pli=1&viewopt=127#h.dmn6k1rdb6jf
Don’t Alert If You Don’t
Have A Playbook
Stefan Zier Chief Architect
Sumo Logic
When Things Are Breaking
• You have quick turnaround continuous deployment…• Then you can “just push a fix”• This can work, but sometimes the “fix” is not that easy…• Or even that quick to develop in the first place• Then you need to be able to quickly roll back
• See also the “service groups” approach
When Not To Scale
Horizontal Scaling Gone Bad
• The ideal scenario: work stealing• One queue of tasks, bunch of workers• Grab from queue, work work work, happy
Queue
Node 1
Node 2
Node 3
Node 4
Task, Message Block, “Stuff”
Horizontal Scaling Gone Bad
• Scaling out a multi-tenant processing system• 1000s of customers, 1000s of machines• Parallelism is good, but locality has to be considered• 1 customer distributed over 1000 machines is bad• No single machine getting enough load for that customer• Batches & shards will become too small• Metadata and in-memory structures grow out of
proportion
Horizontal Scaling Gone Bad
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Horizontal Scaling Gone Bad
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
Horizontal Scaling Gone Bad
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
1 2 1 2 1 2 1 2 1 2
1 2 1 2 1 2 1 2 1 2
1 2 1 2 1 2 1 2 1 2
1 2 1 2 1 2 1 2 1 2
1 2 1 2 1 2 1 2 1 2
Horizontal Scaling Gone Bad
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8
Horizontal Scaling With Partitioning
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
1 1
1 1
Horizontal Scaling With Partitioning
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
1 1 2 2 2
1 1 2 2 2
Horizontal Scaling With Partitioning
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
1 3 4 1 3 4 2 3 5 2 3 5 2 3 6
7 7 5 8 5 8
1 3 4 1 3 4 2 3 5 2 3 5 2 3 6
7 7 5 8 5 8
7 7 5 8 5 8
5 8
5 8
5 8
6
6
6
Partitioning By Customer
• Each cluster elects a leader node via Zookeeper• Leader runs the partitioning logic• • Partitioning written to Zookeeper• Example: indexer node knows which customer’s
message blocks to pull from message bus
Set[Customer], Set[Instance] Map[Instance, Set[Customer]]
Copy & Paste Scaling
So You Keep Adding Customers…
• Your current system is getting achy• You are seeing the next order of magnitude on the
horizon• You start to understand what you need to rebuild
• Your quarter ends in 21 days…
Sometimes, You Need To Break The Rules
• Copy & Paste ScalingTM
• Copy your deployment descriptor files & metadata• Point them at a different region and pull the trigger• Instant 2x scaling!
• There are those times…
Ben Whaley@iAmTheWhaley
Christian Beedgen@raychaser