obvious and non-obvious scalability issues: spotify learnings

November 12, 2013

Obvious and Non-ObviousScalability Issues: Spotify Learnings

David Poblador i Garcia@davidpoblador

BcnDevCon13

Spotify in numbers

One order of magnitude bigger

in some dimensions

1.000.000.000 playlists400M two years ago

2M new playlists every day

6000+ servers1300 two years ago

20 in 2008

Available in 32 markets12 two years ago

Two years ago less than 10 people in OPS + Inframore than 10 times bigger now

4Data Centers

More than 20M songsAdding 20K every day

More than 24M active users6M paying subscribers

More than 50 teamsbuilding products & features

Around 100backend systems

Learning to Scale

ScalingData Centers

Admit that when you are small, there is

someone better than you at building

datacenters1Scaling Data Centers

Scaling Data Centers

Streamlineyour procurement

process2Scaling Data Centers

Have a“unit of capacity”

We call it POD3Scaling Data Centers

Data Centers are being commoditized

Chances are that only a few players will deploy DCs in the

future. !

Keep an eye on that. Might make sense for your needs

4Scaling Data Centers

Scaling Operations

Scaling your backend

backendservice

know your limits

backendservice node

60K users5000 reqs/second

examples

Do not try to be‘too smart’

Do not try to be ‘too smart’

DNS à la Spotify

Error Reporting DHT ring lookup

Service Discovery User Distribution

DNS GeoIP magic

8 . 8 . 8 . 8

Storage Devices

backendservice node

5000 reqs/second

Storage Devices

big backend service node

? reqs/second

Does not fit in RAM anymore

Storage Devices

Hard Drives200 IOPS

Storage Devices

big backend service node

? reqs/second

Storage Devices

SSD10,000 IOPS

Storage Devices

Fusion IO250,000 IOPS

Page Cache

backendservice node

5000 reqs/second

Example !

RAM: 32 GB OS RAM: 2 GB !

Songs: 10M Index size: 10 GB

Page Cache

backendservice node

Increase in data (songs…) !

Index: approx 13 GB

Page Cache

posix_fadvise(2)

orchestrate index deployment

mlock(2)

Retry (not much) Back Off Fail Fast

Degrade Gracefully

Retry (not much). Back Off. Fail Fast. Degrade Gracefully

backendservice

APUser

5000 conns/sec

DDoS’d by your clients

APUser

5000 conns/sec

Exponential Back Off Retry

backendservice

Fail Fast

backendservice

Degrade gracefully

Acceptable Behaviour

Test in real world conditions

Use your most valuable assetStart by sending X% of users to X% of your servers

Automate

When necessary

Automatehttp://xkcd.com/1205/

Take a self service approach everywhere

Configuration Management Databases and Storage Provisioning of Servers

Service Discovery Load Balancing

Monitoring …

ScalingOperations(the team)

Scaling Operations

Start having teams carry operational

responsibility for their own services,

including on-call duties for the systems

they own1

Scaling Operations

Infrastructure and Operations provide

expert guidance/help on how to run

service(s) teams own in production

(and everywhere else)

2Scaling Operations

Scaling Operations

Infrastructure and Operations focus the effort on building and

extending our platform to create an awesome place to run

services3

Scaling Operations

devops

IncidentManagement

Process

Incident Management Process

“Prevent an issue from happening twice”

OPS-6000

Incident (severity)

Postmortem meeting with stakeholders

Remediations (urgency)

November 12, 2013

Moltes gràcies!David Poblador i Garcia@davidpoblador !

BcnDevCon13

obvious and non-obvious scalability issues: spotify learnings

Technology

im spotify

foundation learnings

spotify places

tweetdeck i hootsuite: teva comunicació a internet …...

spotify - youthteachingadults.ca · spotify is an app that...

tutorial spotify

das spotify buch thomas raukamp -...

spotify exercice

spotify - kenwood

spotify teknikdagarna

tfa spotify

spotify – large scale, low latency, p2p music-on...

key learnings

learnings from spotify and wrapp

entrepreneurseship learnings

zoe keating - 9 months of spotify (scroll down for totals) -...

spotify behind the...

ojt learnings

hookedfest learnings

agile spotify