data at spotify

31
June 12, 2014 Danielle Jabin Data Engineer, A/B Testing Data at Spotify

Upload: dj4b1n

Post on 15-Jan-2015

1.143 views

Category:

Technology


2 download

DESCRIPTION

Data infrastructure at Spotify

TRANSCRIPT

Page 1: Data at Spotify

June 12, 2014

Danielle Jabin Data Engineer, A/B Testing

Data at Spotify

Page 2: Data at Spotify

I’m Danielle Jabin

•  Data Engineer in the Stockholm office •  A/B testing infrastructure

•  California born & raised •  If I can survive a Swedish winter, so can you!

•  Studied Computer Science, Statistics, and Real Estate through the M&T program at the University of Pennsylvania

Page 3: Data at Spotify

3

Over 40 million active users

As of June 9, 2014  

Page 4: Data at Spotify

4

Access to more than 20 million songs

As of June 9, 2014  

Page 5: Data at Spotify

Big Data

•  40 million Monthly Active Users •  20+ million tracks •  1.5 TB of compressed data from users per day •  64 TB of data generated in Hadoop each day (including

replication factor of 3)

As of June 9, 2014  

Page 6: Data at Spotify

6

So how much data is that?

Page 7: Data at Spotify

Let’s compare: 64 TB

•  293, 203, 072 books (200 pages or 240,000 characters)

•  16,777,216 MP3 files (with 4MB average file size) •  22,369,600 images (with 3MB average file size)

Page 8: Data at Spotify

8

That’s a lot of selfies

Page 9: Data at Spotify

9

How do we use this data?

Page 10: Data at Spotify

Use Cases

•  Reporting •  Business Analytics •  Operational Analytics •  Product Features

Page 11: Data at Spotify

Reporting

•  Reporting to labels, licensors, partners, and advertisers •  We support our partners

Page 12: Data at Spotify

Business Analytics

•  Analyzing growth, user behavior, sign-up funnels, etc •  Company KPIs •  NPS analysis

Page 13: Data at Spotify

Operational Metrics

•  Root cause analysis •  Latency analysis •  Better capacity planning (servers, people, bandwidth)

Page 14: Data at Spotify

Product Features

•  Discover and Radio •  Top lists •  Personalized recommendations •  A/B Testing

Page 15: Data at Spotify

15

How do we collect this data?

Page 16: Data at Spotify

The three pillars of our Data Infrastructure:

Kafka Collection

Hadoop Processing

DatabasesAnalytics/Visualization

Page 17: Data at Spotify

This is Dave. Data Engineer at Spotify by day…

Page 18: Data at Spotify

…chiptune DJ Demoscene Time Machine by night.

Page 19: Data at Spotify

Let’s listen to Dave’s song

Page 20: Data at Spotify

Kafka

•  High volume pub-sub system

•  “Producers publish messages to Kafka topics, and consumers subscribe to these topics and consume the messages.”

Page 21: Data at Spotify

Kafka

•  Robust and scalable solution for collection of logs •  Fast data transfer •  Low CPU overhead •  Built-in partitioning, replication, and fault-tolerance

•  Consumers can pull data at different rates •  Able to handle extremely high volumes

Page 22: Data at Spotify

Other people listened too!

Page 23: Data at Spotify

Hadoop

•  Process and store massive amounts of unstructured data across a distributed cluster

•  One cluster with 37 nodes to 690 nodes today •  28 PB of storage •  The largest Hadoop cluster in Europe

Page 24: Data at Spotify

Hadoop

•  Entering the land of optimizations •  Data retention policy •  Move to JVM-based languages

•  MapReduce languages •  Moving to Crunch, JVM-based, for speed and scalability •  Python with Hadoop Streaming, Java, Hive, PIG, Scala

•  Sprunch: Crunch wrapper for Scala, open sourced by Spotify

•  Spotify open-sourced scheduler, Luigi, written in Python •  Simple and easy way to chain jobs

Page 25: Data at Spotify

What if we want to know more?

vs

Page 26: Data at Spotify

Databases

•  Aggregates from Hadoop put into PostgreSQL or Cassandra

•  Sqoop •  Core data can be used and manipulated for various needs

•  Ad hoc queries •  Dashboards

Page 27: Data at Spotify

Databases

•  Aggregates from Hadoop put into PostgreSQL or Cassandra

•  Sqoop

•  Ad hoc queries •  Dashboards

Page 28: Data at Spotify

Databases

•  Aggregates from Hadoop put into PostgreSQL or Cassandra

•  Sqoop

•  Ad hoc queries •  Dashboards

Page 29: Data at Spotify

Questions?

Page 30: Data at Spotify

A/B testing questions? Find me!

Control

vs

Page 31: Data at Spotify

Thank you!