Эволюция службы эксплуатации «spotify» / Лев Попов (spotify)

34
Operations Engineering Evolution at Spotify Lev Popov Site Reliability Engineer @nabamx

Upload: ontico

Post on 16-Apr-2017

378 views

Category:

Engineering


3 download

TRANSCRIPT

Page 1: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Operations Engineering Evolution at SpotifyLev PopovSite Reliability Engineer@nabamx

Page 2: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Who am I?

Lev Popov Service Reliability Engineer in Spotify Joined Spotify in 2014 Previous QIK – Skype – Microsoft

Background in services and networks operations

Page 3: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

What is Spotify?

Page 4: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Some Numbers

• Over 60 million MAU (monthly active users)• Over 15 million paying subscribers• Over 30 million tracks• Over 1.5 billion playlists• Over 20.000 songs added per day

Page 5: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)
Page 6: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Capacity We Own

• 4 Data Centers• Over 7000 bare metal servers• Many different services• Pushing an average of 35GBps to the Internet• 24/7/365

Page 7: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

But let's talk about operations

Page 8: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Service

Service

Service

Service

Dev owner

In the beginning was the…Dev owner

Ops owner

Dev owner

Ops owner

Operations team

Dev owner

On-callMonitoring

Build systems

BackupsDBNetworks…

Page 9: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Operations Team in 2011

Thin group of 5 people

• Over 10 million users• Over 2 million paying subscribers• 12 Countries• Over 15 million tracks• Over 400 million playlists• 3 datacenters• Over 1300 servers

Page 10: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Operations Team Now

?• Over 60 million users• Over 15 million paying

subscribers• 58 Countries• Over 30 million tracks• Over 1.5 billion playlists• 4 datacenters• Over 5000 servers

Page 11: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Operations Team Now

No team• Over 60 million users• Over 15 million paying

subscribers• 58 Countries• Over 30 million tracks• Over 1.5 billion playlists• 4 datacenters• Over 5000 servers

Page 12: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)
Page 13: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Spotify Engineering Culture

Page 14: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

How We Scale

• Service oriented architectureSeparate services for separate features

• UNIX waySmall simple programs doing one thing well

• KISS principleSimple applications are easier to scale

Page 15: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

How Spotify Works

Page 16: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Scaling Agile

• Squad is similar to a scrum team

• Designed to feel like a small startup

• Self organizing teams• Autonomy to decide

their own way of working

Page 17: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Scaling Agile

Page 18: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

ServiceDev owner

Service

Can we scale that?

Service

Dev owner

Ops owner

Service

Dev owner

Ops owner

Operations team

Dev owner

On-callMonitoring

Build systems

BackupsDBNetworks…

Page 19: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)
Page 20: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Ops in Squads

Page 21: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Ops in Squads Background

Impossible to scale a central operations team• Understaffed• Difficult to find generalists

We believe that operation has to sit close to development

Our bet for autonomy• Break dependencies• End to end responsibility

Page 22: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Timeline

DevDev

Backend InfrastructureI/O

Operations

SRE

Internal IT

Operations in Squads

2008 Early 2011 Mid 2012 Sep 2013

Page 23: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Infrastructure Operations

featuresquad

featuresquad

featuresquad

featuresquad

IOTribe

networksconf mgmt containers

featuresquad

enable + support

product area

Page 24: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Ops in SquadsExpectations

Page 25: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Wait, wait, but what if…

Page 26: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)
Page 27: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

squad

Core SRE

Core SRE

IOTribe

Major Incidents Scalability IssuesSystems Design Problems

Teaching Best Practices in General

squad squad squadsquad

Page 28: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Incident Management

Page 29: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Incident Management

Incident Postmortem

Remediation

Incident ManagerOn-Call

Everybody involved in an incident

Page 30: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Postmortems

• Plan for post-mortems• Keep it close in time• Record the project details• Involve everyone• Get it in writing• Record successes as well as failures• It's not for punishment• Create an action plan• Make it available

Page 31: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

On-call follows the sun

StockholmNew York

StockholmNew York

StockholmNew YorkL0

SA Product OwnersL1

SA LeadL2

19 CET

01 EST

19 CET

01 EST

07 CET 07 CET

13 EST13 EST

19 CET

13 EST

Page 32: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Areas of Improvement

Page 33: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Areas of Improvement

• The expectations we place on squads are sometimes unclear

• Communication between feature teams and infrastructure teams

• It’s hard to measure ops in squads success

• Abandoned services and other ownership issues

Page 34: Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Thank you.

@[email protected]