obvious and non-obvious scalability issues: spotify learnings

Post on 05-Dec-2014

2.328 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

These are the slides for the talk I held during the Barcelona Developers Conference 2013. In this talk, I cover some of the scalability issues we've been facing during our intense growth experienced since 2008. The talk is mostly focused to systems and backend engineers. Note: some of the slides are not superawesome because the transitions are lost in the conversion to PDF.

TRANSCRIPT

November 12, 2013

Obvious and Non-ObviousScalability Issues: Spotify Learnings

David Poblador i Garcia@davidpoblador

!

!

BcnDevCon13

Spotify in numbers

2011

2011

2013

One order of magnitude bigger

in some dimensions

1.000.000.000 playlists400M two years ago

2M new playlists every day

6000+ servers1300 two years ago

20 in 2008

Available in 32 markets12 two years ago

Two years ago less than 10 people in OPS + Inframore than 10 times bigger now

4Data Centers

More than 20M songsAdding 20K every day

More than 24M active users6M paying subscribers

More than 50 teamsbuilding products & features

Around 100backend systems

Learning to Scale

ScalingData Centers

Admit that when you are small, there is

someone better than you at building

datacenters1Scaling Data Centers

Scaling Data Centers

2009?

Streamlineyour procurement

process2Scaling Data Centers

Scaling Data Centers

2012

Have a“unit of capacity”

!

We call it POD3Scaling Data Centers

Scaling Data Centers

2012

Data Centers are being commoditized

!

Chances are that only a few players will deploy DCs in the

future. !

Keep an eye on that. Might make sense for your needs

4Scaling Data Centers

Scaling Operations

cloud

Scaling your backend

Scaling your backend

AP

AP

AP

AP

User

User

User

backendservice

backendservice

Scaling your backend

know your limits

Scaling your backend

AP

AP

AP

AP

User

User

User

backendservice node

backendservice node

60K users5000 reqs/second

examples

Do not try to be‘too smart’

Do not try to be ‘too smart’

DNS à la Spotify

Do not try to be ‘too smart’

Error Reporting DHT ring lookup

Service Discovery User Distribution

Do not try to be ‘too smart’

AP

AP

AP

AP

User

User

User

DNS GeoIP magic

Do not try to be ‘too smart’

Do not try to be ‘too smart’

8 . 8 . 8 . 8

Do not try to be ‘too smart’

AP

AP

AP

AP

User

User

User

8 . 8 . 8 . 8

Storage Devices

Storage Devices

AP

AP

AP

AP

backendservice node

backendservice node

5000 reqs/second

RAM

Storage Devices

AP

AP

AP

AP

big backend service node

big backend service node

? reqs/second

Does not fit in RAM anymore

Storage Devices

Hard Drives200 IOPS

Storage Devices

AP

AP

AP

AP

big backend service node

big backend service node

? reqs/second

Storage Devices

SSD10,000 IOPS

Storage Devices

Fusion IO250,000 IOPS

Page Cache

Page Cache

AP

AP

AP

AP

backendservice node

backendservice node

5000 reqs/second

Example !

RAM: 32 GB OS RAM: 2 GB !

Songs: 10M Index size: 10 GB

Page Cache

AP

AP

AP

AP

backendservice node

backendservice node

Increase in data (songs…) !

Index: approx 13 GB

Page Cache

Page Cache

posix_fadvise(2)

orchestrate index deployment

mlock(2)

Retry (not much) Back Off Fail Fast

Degrade Gracefully

Retry (not much). Back Off. Fail Fast. Degrade Gracefully

AP

AP

AP

AP

User

User

User

backendservice

backendservice

Retry (not much). Back Off. Fail Fast. Degrade Gracefully

APUser

5000 conns/sec

DDoS’d by your clients

Retry (not much). Back Off. Fail Fast. Degrade Gracefully

APUser

5000 conns/sec

Exponential Back Off Retry

Retry (not much). Back Off. Fail Fast. Degrade Gracefully

AP

AP

AP

AP

User

User

User

backendservice

backendservice

Fail Fast

Retry (not much). Back Off. Fail Fast. Degrade Gracefully

AP

AP

AP

AP

User

User

User

backendservice

backendservice

Degrade gracefully

Acceptable Behaviour

Test in real world conditions

Test in real world conditions

Use your most valuable assetStart by sending X% of users to X% of your servers

Automate

Automate

When necessary

Automatehttp://xkcd.com/1205/

Take a self service approach everywhere

Take a self service approach everywhere

Configuration Management Databases and Storage Provisioning of Servers

Service Discovery Load Balancing

Monitoring …

ScalingOperations(the team)

Scaling Operations

2011

Start having teams carry operational

responsibility for their own services,

including on-call duties for the systems

they own1

Scaling Operations

Scaling Operations

2012

Infrastructure and Operations provide

expert guidance/help on how to run

service(s) teams own in production

(and everywhere else)

2Scaling Operations

Scaling Operations

2013

Infrastructure and Operations focus the effort on building and

extending our platform to create an awesome place to run

services3

Scaling Operations

Scaling Operations

devops

IncidentManagement

Process

Incident Management Process

“Prevent an issue from happening twice”

Incident Management Process

OPS-6000

Incident Management Process

Incident (severity)

Postmortem meeting with stakeholders

Remediations (urgency)

November 12, 2013

Moltes gràcies!David Poblador i Garcia@davidpoblador !

!

BcnDevCon13

top related