how sumo logic and anki build highly resilient services on aws to manage massive usage spikes

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Las Vegas, November 13th, 2014

How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage SpikesBen Whaley, AnkiChristian Beedgen, Sumo Logic

Agenda

Introductions

Chasing Infinity

Obvious Versus Hard

Who Am I

Co-Founder & CTO• Cloud-based Machine

Data Analytics Service• Applications, Operations,

Security

Chief Architect• Major SIEM player in the

enterprise space• Log Management for

security & compliance

present past

Sumo Logic Is The Machine Data Cloud

Applications

Internet of ThingsNetwork

Mobile

Search

Visualize

Predict

Being A Service Is Fundamentally Different

• Enterprise software sticks you with the hard part– Scale - this truly is Big Data, baby– Infrastructure is painful to obtain and costly to maintain– Who pays the watchmen?

• Everyone is an island– Your on-prem system is feeling lonely and cold– No sharing of discovered insight– No vendor can guess the right analytics

What Is Machine Data

• Computer, network, and other equipment logs• Satellite and similar telemetry (espionage or science)• Location data, RFID chip readings, GPS system output• Temperature and other environmental sensor readings• Sensor readings from factories, pipelines, etc.• Output from many kinds of medical devices

10 Trillion Per Day - The Scale Of Our Operation

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

GB per Day

Searches

Who Am I

AWS Infrastructure Lead• Artificial intelligence and

robotics• Anki DRIVE, smart robot

race cars with weapons

Cloud Architect• API and Big Data

analytics technology• Leading platform for

digital acceleration

present past

Anki Runs On AWS• Game play analytics

– Python, SQS, EC2 with Autoscale, Elasticache, RDS, EMR, Redshift

• Custom customer account system- Go-based, ELB, EC2, Dynamo, MySQL

• Twilio-based customer support application- Rails, ELB, EC2, MySQL

• Computer vision simulations- Matlab, EC2

• eCommerce via Magento

Chasing Infinity

In 2010, we knew that success will look

something like this…

So What Did We Do?

• Automated, Multi-Dimensional Scalability!

• We decided to make the system multi-tenant• We picked a scalable substrate - AWS• We automated everything from the start• We modularized the code from the beginning• We broke functionality into services

A Multi-Tenant Database? WTF?

• Building the church out for Sunday is • A game changing differentiator in our market• Customer message and query loads fluctuate wildly• When stuff hits the fan, logs splatter everywhere• Islands of fail – not everybody fails at the same time

Just one typical Sumo Logic customer - 8x Variance!

Just one typical Sumo Logic customer - 8x Variance!

Money flushed down the toilet

Here’s another one – spike at 2.5 of steady…

Or… Sweet, incremental, unfettered growth

AWS - A Scalable Substrate

• We actually saw Werner’s talk at Stanford in 2008


• We actually saw Werner’s talk at Stanford in 2008• Opened our eyes to the possibilities

Datacenter As An API!!!


• We actually saw Werner’s talk at Stanford in 2008• Opened our eyes to the possibilities• We are developers, now the data center is an API• Success story: Sumo in Sydney, Sumo in Dublin

A++++++++++++!! Will Buy Again

A Bonus Success Story

• Sumo Logic in APAC – Sydney• Sumo Logic in Europe – Dublin

Automate All The Things

• There’s no cables, racks, boxes, so…• Really, are you going to use the AWS Console?• We are developers, remember?• So we rolled our own automation in Scala• Fully model-driven, descriptive vs. imperative

• We operate the service from our own CLI

Scaling In Depth

• Internal service-oriented architecture - Microservices• Many loosely coupled components• No poking around private parts• Avro-based protocols for messaging & RPC• Lookup service, minimal configuration

$ bin/receiver prod.service-registry.sumologic.com

Why This Really Matters

Why This Really Matters• Everything you know is wrong• Under scale, everything you know will eventually be… wrong• No matter how smart you think you are• No matter how many conferences you go to• Every rough order of magnitude of scale is going to MESS YOU UP• Not everything is going to fail, and not all the time• You need to divide and conquer so you can easily replace• You need to be in a position to replace the thing that fails

Anki Architectural Principles

• N+2 redundancy• Compute is ephemeral• Infrastructure is code• Less is more• Ubiquitous monitoring• Stateless services

Obvious Versus Hard

Here We Are In 2014

Here We Are, 4 ½ Years Later

• The principles we have described are canon• Over time, they have served us well• So, this is pretty easy, no?

What We Should Really Say

• It’s not supposed to be easy, of course• But even though certain things are obvious…• Under scale, reality still has its own mind

Microservices

Factoring Is Still Important

• Highly cohesive loosely coupled is an ideal• The same ideal OO is striving for• Here’s a snapshot of the current Sumo factoring

• 2 to the power of 5 services (“32”), 170+ modules

• Don’t even ask about the # of dependencies

• At least 3 of each – everything is a separately scalable cluster

Refactor-able Infrastructure

• The same old thing all over again• Now that infrastructure is code, keep refactoring• Split things that don’t belong, join others• Service abstractions can help keeping impact low• Moving around the code, vs. the protocols

It’s Your Spaghetti• We are not building shrink-wrap anymore• The world is moving on, and to scale is to change• Use today’s knowledge to address todays problem• Use tomorrows knowledge tomorrow• Keep degrees of freedom on the fundamental level• Then exploit control and visibility

Embrace Change Or Die

Utilization

• Cost effectiveness can be a problem• One cluster per service means 3 or more instances• Good news - you can size to taste• Bad news – unlikely to really utilize the resources• Things are often either CPU bound, or I/O bound

Resource Management

• Docker-ize your services• Homogenize your fleet/herd• Interleave services based on resource requirements• Doable by hand with automation• Or consider Mesos, …

Anki: Broad Service Best Practices

• Open to change• Thoughtful network design• Composable services• Loose coupling among services• Feature flags

It Is Not One System

We Are Recovering Enterprise Developers

• Many different modules…• Roll up into many different services• Roll up into one system• Build system, test system, deploy system• “Continuously” (We use Jenkins, eh)

It Is Just Not One System

• Finding a cross cut of services that compile…• …and can be tested together will take days• Our Solution: service groups as an organizing unit• A set of services that “belong” together functionally…• And yet have different runtime characteristics• Much easier to keep APIs compatible between 5 groups…• Versus 32 services

Service Groups Scale

• This is really a level of granularity optimization• One system: too heavy – 32 systems: too fleeting• Build a service group, deploy against baseline, test• During deployment, deploy by service group• Balances crosscutting integration tests with turnaround

There Is No Place Like Production

You Cannot Simulate The Big Datas

• There’s simply no substitute for Production• This doesn’t mean you shouldn’t have nighty, staging, …• This doesn’t mean you shouldn’t have integration tests• This doesn’t mean you shouldn’t test manually• But there’s just a class of issues you will not find

• You can’t move Production data into testing• You can’t afford a second Production size system

So Now What?

• Instrument, instrument, instrument• Monitor, monitor, monitor• Alert on symptoms

• Basically, don’t worry about 100% CPU, etc.• Alert on customer impact• Message ingestion delayed, search takes too long, …

https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/mobilebasic?pli=1&viewopt=127#h.dmn6k1rdb6jf

Don’t Alert If You Don’t

Have A Playbook

Stefan Zier Chief Architect

Sumo Logic

When Things Are Breaking

• You have quick turnaround continuous deployment…• Then you can “just push a fix”• This can work, but sometimes the “fix” is not that easy…• Or even that quick to develop in the first place• Then you need to be able to quickly roll back

• See also the “service groups” approach

When Not To Scale

Horizontal Scaling Gone Bad

• The ideal scenario: work stealing• One queue of tasks, bunch of workers• Grab from queue, work work work, happy

Queue

Node 1

Node 2

Node 3

Node 4

Task, Message Block, “Stuff”


• Scaling out a multi-tenant processing system• 1000s of customers, 1000s of machines• Parallelism is good, but locality has to be considered• 1 customer distributed over 1000 machines is bad• No single machine getting enough load for that customer• Batches & shards will become too small• Metadata and in-memory structures grow out of

proportion


Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index


Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1


Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

1 2 1 2 1 2 1 2 1 2

1 2 1 2 1 2 1 2 1 2

1 2 1 2 1 2 1 2 1 2

1 2 1 2 1 2 1 2 1 2

1 2 1 2 1 2 1 2 1 2


Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

Horizontal Scaling With Partitioning

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

1 1

1 1


Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

1 1 2 2 2

1 1 2 2 2


Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

1 3 4 1 3 4 2 3 5 2 3 5 2 3 6

7 7 5 8 5 8

1 3 4 1 3 4 2 3 5 2 3 5 2 3 6

7 7 5 8 5 8

7 7 5 8 5 8

5 8

5 8

5 8

6

6

6

Partitioning By Customer

• Each cluster elects a leader node via Zookeeper• Leader runs the partitioning logic• • Partitioning written to Zookeeper• Example: indexer node knows which customer’s

message blocks to pull from message bus

Set[Customer], Set[Instance] Map[Instance, Set[Customer]]

Copy & Paste Scaling

So You Keep Adding Customers…

• Your current system is getting achy• You are seeing the next order of magnitude on the

horizon• You start to understand what you need to rebuild

• Your quarter ends in 21 days…

Sometimes, You Need To Break The Rules

• Copy & Paste ScalingTM

• Copy your deployment descriptor files & metadata• Point them at a different region and pull the trigger• Instant 2x scaling!

• There are those times…

Ben Whaley@iAmTheWhaley

Christian Beedgen@raychaser

how sumo logic and anki build highly resilient services on aws to manage massive usage spikes

Software