how sumo logic and anki build highly resilient services on aws to manage massive usage spikes

69
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Las Vegas, November 13 th , 2014 How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes Ben Whaley, Anki Christian Beedgen, Sumo Logic

Upload: christian-beedgen

Post on 15-Jan-2017

227 views

Category:

Software


0 download

TRANSCRIPT

Page 1: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Las Vegas, November 13th, 2014

How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage SpikesBen Whaley, AnkiChristian Beedgen, Sumo Logic

Page 2: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Agenda

Introductions

Chasing Infinity

Obvious Versus Hard

Page 3: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Who Am I

Co-Founder & CTO• Cloud-based Machine

Data Analytics Service• Applications, Operations,

Security

Chief Architect• Major SIEM player in the

enterprise space• Log Management for

security & compliance

present past

Page 4: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Sumo Logic Is The Machine Data Cloud

Applications

Internet of ThingsNetwork

Mobile

Search

Visualize

Predict

Page 5: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Being A Service Is Fundamentally Different

• Enterprise software sticks you with the hard part– Scale - this truly is Big Data, baby– Infrastructure is painful to obtain and costly to maintain– Who pays the watchmen?

• Everyone is an island– Your on-prem system is feeling lonely and cold– No sharing of discovered insight– No vendor can guess the right analytics

Page 6: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

What Is Machine Data

• Computer, network, and other equipment logs• Satellite and similar telemetry (espionage or science)• Location data, RFID chip readings, GPS system output• Temperature and other environmental sensor readings• Sensor readings from factories, pipelines, etc.• Output from many kinds of medical devices

Page 7: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

10 Trillion Per Day - The Scale Of Our Operation

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

GB per Day

Searches

Page 8: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Who Am I

AWS Infrastructure Lead• Artificial intelligence and

robotics• Anki DRIVE, smart robot

race cars with weapons

Cloud Architect• API and Big Data

analytics technology• Leading platform for

digital acceleration

present past

Page 9: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes
Page 10: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Anki Runs On AWS• Game play analytics

– Python, SQS, EC2 with Autoscale, Elasticache, RDS, EMR, Redshift

• Custom customer account system- Go-based, ELB, EC2, Dynamo, MySQL

• Twilio-based customer support application- Rails, ELB, EC2, MySQL

• Computer vision simulations- Matlab, EC2

• eCommerce via Magento

Page 11: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Chasing Infinity

Page 12: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

In 2010, we knew that success will look

something like this…

Page 13: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

In 2010, we knew that success will look

something like this…

Page 14: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

So What Did We Do?

• Automated, Multi-Dimensional Scalability!

• We decided to make the system multi-tenant• We picked a scalable substrate - AWS• We automated everything from the start• We modularized the code from the beginning• We broke functionality into services

Page 15: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

A Multi-Tenant Database? WTF?

• Building the church out for Sunday is • A game changing differentiator in our market• Customer message and query loads fluctuate wildly• When stuff hits the fan, logs splatter everywhere• Islands of fail – not everybody fails at the same time

Page 16: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Just one typical Sumo Logic customer - 8x Variance!

Page 17: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Just one typical Sumo Logic customer - 8x Variance!

Money flushed down the toilet

Page 18: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Here’s another one – spike at 2.5 of steady…

Page 19: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Or… Sweet, incremental, unfettered growth

Page 20: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

AWS - A Scalable Substrate

• We actually saw Werner’s talk at Stanford in 2008

Page 21: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes
Page 22: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes
Page 23: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

AWS - A Scalable Substrate

• We actually saw Werner’s talk at Stanford in 2008• Opened our eyes to the possibilities

Page 24: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Datacenter As An API!!!

Page 25: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

AWS - A Scalable Substrate

• We actually saw Werner’s talk at Stanford in 2008• Opened our eyes to the possibilities• We are developers, now the data center is an API• Success story: Sumo in Sydney, Sumo in Dublin

A++++++++++++!! Will Buy Again

Page 26: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

A Bonus Success Story

• Sumo Logic in APAC – Sydney• Sumo Logic in Europe – Dublin

Page 27: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Automate All The Things

• There’s no cables, racks, boxes, so…• Really, are you going to use the AWS Console?• We are developers, remember?• So we rolled our own automation in Scala• Fully model-driven, descriptive vs. imperative

• We operate the service from our own CLI

Page 28: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Scaling In Depth

• Internal service-oriented architecture - Microservices• Many loosely coupled components• No poking around private parts• Avro-based protocols for messaging & RPC• Lookup service, minimal configuration

$ bin/receiver prod.service-registry.sumologic.com

Page 29: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Why This Really Matters

Page 30: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes
Page 31: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Why This Really Matters• Everything you know is wrong• Under scale, everything you know will eventually be… wrong• No matter how smart you think you are• No matter how many conferences you go to• Every rough order of magnitude of scale is going to MESS YOU UP• Not everything is going to fail, and not all the time• You need to divide and conquer so you can easily replace• You need to be in a position to replace the thing that fails

Page 32: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Anki Architectural Principles

• N+2 redundancy• Compute is ephemeral• Infrastructure is code• Less is more• Ubiquitous monitoring• Stateless services

Page 33: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Obvious Versus Hard

Page 34: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Here We Are In 2014

Page 35: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Here We Are, 4 ½ Years Later

• The principles we have described are canon• Over time, they have served us well• So, this is pretty easy, no?

Page 36: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes
Page 37: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

What We Should Really Say

• It’s not supposed to be easy, of course• But even though certain things are obvious…• Under scale, reality still has its own mind

Page 38: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Microservices

Page 39: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Factoring Is Still Important

• Highly cohesive loosely coupled is an ideal• The same ideal OO is striving for• Here’s a snapshot of the current Sumo factoring

Page 40: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

• 2 to the power of 5 services (“32”), 170+ modules

• Don’t even ask about the # of dependencies

• At least 3 of each – everything is a separately scalable cluster

Page 41: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Refactor-able Infrastructure

• The same old thing all over again• Now that infrastructure is code, keep refactoring• Split things that don’t belong, join others• Service abstractions can help keeping impact low• Moving around the code, vs. the protocols

Page 42: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

It’s Your Spaghetti• We are not building shrink-wrap anymore• The world is moving on, and to scale is to change• Use today’s knowledge to address todays problem• Use tomorrows knowledge tomorrow• Keep degrees of freedom on the fundamental level• Then exploit control and visibility

Embrace Change Or Die

Page 43: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Utilization

• Cost effectiveness can be a problem• One cluster per service means 3 or more instances• Good news - you can size to taste• Bad news – unlikely to really utilize the resources• Things are often either CPU bound, or I/O bound

Page 44: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Resource Management

• Docker-ize your services• Homogenize your fleet/herd• Interleave services based on resource requirements• Doable by hand with automation• Or consider Mesos, …

Page 45: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Anki: Broad Service Best Practices

• Open to change• Thoughtful network design• Composable services• Loose coupling among services• Feature flags

Page 46: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

It Is Not One System

Page 47: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

We Are Recovering Enterprise Developers

• Many different modules…• Roll up into many different services• Roll up into one system• Build system, test system, deploy system• “Continuously” (We use Jenkins, eh)

Page 48: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

It Is Just Not One System

• Finding a cross cut of services that compile…• …and can be tested together will take days• Our Solution: service groups as an organizing unit• A set of services that “belong” together functionally…• And yet have different runtime characteristics• Much easier to keep APIs compatible between 5 groups…• Versus 32 services

Page 49: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Service Groups Scale

• This is really a level of granularity optimization• One system: too heavy – 32 systems: too fleeting• Build a service group, deploy against baseline, test• During deployment, deploy by service group• Balances crosscutting integration tests with turnaround

Page 50: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

There Is No Place Like Production

Page 51: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

You Cannot Simulate The Big Datas

• There’s simply no substitute for Production• This doesn’t mean you shouldn’t have nighty, staging, …• This doesn’t mean you shouldn’t have integration tests• This doesn’t mean you shouldn’t test manually• But there’s just a class of issues you will not find

• You can’t move Production data into testing• You can’t afford a second Production size system

Page 52: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

So Now What?

• Instrument, instrument, instrument• Monitor, monitor, monitor• Alert on symptoms

• Basically, don’t worry about 100% CPU, etc.• Alert on customer impact• Message ingestion delayed, search takes too long, …

https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/mobilebasic?pli=1&viewopt=127#h.dmn6k1rdb6jf

Page 53: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Don’t Alert If You Don’t

Have A Playbook

Stefan Zier Chief Architect

Sumo Logic

Page 54: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

When Things Are Breaking

• You have quick turnaround continuous deployment…• Then you can “just push a fix”• This can work, but sometimes the “fix” is not that easy…• Or even that quick to develop in the first place• Then you need to be able to quickly roll back

• See also the “service groups” approach

Page 55: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

When Not To Scale

Page 56: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Horizontal Scaling Gone Bad

• The ideal scenario: work stealing• One queue of tasks, bunch of workers• Grab from queue, work work work, happy

Queue

Node 1

Node 2

Node 3

Node 4

Task, Message Block, “Stuff”

Page 57: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Horizontal Scaling Gone Bad

• Scaling out a multi-tenant processing system• 1000s of customers, 1000s of machines• Parallelism is good, but locality has to be considered• 1 customer distributed over 1000 machines is bad• No single machine getting enough load for that customer• Batches & shards will become too small• Metadata and in-memory structures grow out of

proportion

Page 58: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Horizontal Scaling Gone Bad

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Page 59: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Horizontal Scaling Gone Bad

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

Page 60: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Horizontal Scaling Gone Bad

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

1 2 1 2 1 2 1 2 1 2

1 2 1 2 1 2 1 2 1 2

1 2 1 2 1 2 1 2 1 2

1 2 1 2 1 2 1 2 1 2

1 2 1 2 1 2 1 2 1 2

Page 61: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Horizontal Scaling Gone Bad

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

1 2 3 4

5 6 7 8

Page 62: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Horizontal Scaling With Partitioning

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

1 1

1 1

Page 63: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Horizontal Scaling With Partitioning

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

1 1 2 2 2

1 1 2 2 2

Page 64: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Horizontal Scaling With Partitioning

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

1 3 4 1 3 4 2 3 5 2 3 5 2 3 6

7 7 5 8 5 8

1 3 4 1 3 4 2 3 5 2 3 5 2 3 6

7 7 5 8 5 8

7 7 5 8 5 8

5 8

5 8

5 8

6

6

6

Page 65: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Partitioning By Customer

• Each cluster elects a leader node via Zookeeper• Leader runs the partitioning logic• • Partitioning written to Zookeeper• Example: indexer node knows which customer’s

message blocks to pull from message bus

Set[Customer], Set[Instance] Map[Instance, Set[Customer]]

Page 66: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Copy & Paste Scaling

Page 67: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

So You Keep Adding Customers…

• Your current system is getting achy• You are seeing the next order of magnitude on the

horizon• You start to understand what you need to rebuild

• Your quarter ends in 21 days…

Page 68: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Sometimes, You Need To Break The Rules

• Copy & Paste ScalingTM

• Copy your deployment descriptor files & metadata• Point them at a different region and pull the trigger• Instant 2x scaling!

• There are those times…

Page 69: How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Massive Usage Spikes

Ben Whaley@iAmTheWhaley

Christian Beedgen@raychaser