data at spotify

June 12, 2014

Danielle Jabin Data Engineer, A/B Testing

Data at Spotify

I’m Danielle Jabin

•  Data Engineer in the Stockholm office •  A/B testing infrastructure

•  California born & raised •  If I can survive a Swedish winter, so can you!

•  Studied Computer Science, Statistics, and Real Estate through the M&T program at the University of Pennsylvania

3

Over 40 million active users

As of June 9, 2014

4

Access to more than 20 million songs

As of June 9, 2014

Big Data

•  40 million Monthly Active Users •  20+ million tracks •  1.5 TB of compressed data from users per day •  64 TB of data generated in Hadoop each day (including

replication factor of 3)

As of June 9, 2014

6

So how much data is that?

Let’s compare: 64 TB

•  293, 203, 072 books (200 pages or 240,000 characters)

•  16,777,216 MP3 files (with 4MB average file size) •  22,369,600 images (with 3MB average file size)

8

That’s a lot of selfies

9

How do we use this data?

Use Cases

•  Reporting •  Business Analytics •  Operational Analytics •  Product Features

Reporting

•  Reporting to labels, licensors, partners, and advertisers •  We support our partners

Business Analytics

•  Analyzing growth, user behavior, sign-up funnels, etc •  Company KPIs •  NPS analysis

Operational Metrics

•  Root cause analysis •  Latency analysis •  Better capacity planning (servers, people, bandwidth)

Product Features

•  Discover and Radio •  Top lists •  Personalized recommendations •  A/B Testing

15

How do we collect this data?

The three pillars of our Data Infrastructure:

Kafka Collection

Hadoop Processing

DatabasesAnalytics/Visualization

This is Dave. Data Engineer at Spotify by day…

…chiptune DJ Demoscene Time Machine by night.

Let’s listen to Dave’s song

Kafka

•  High volume pub-sub system

•  “Producers publish messages to Kafka topics, and consumers subscribe to these topics and consume the messages.”

Kafka

•  Robust and scalable solution for collection of logs •  Fast data transfer •  Low CPU overhead •  Built-in partitioning, replication, and fault-tolerance

•  Consumers can pull data at different rates •  Able to handle extremely high volumes

Other people listened too!

Hadoop

•  Process and store massive amounts of unstructured data across a distributed cluster

•  One cluster with 37 nodes to 690 nodes today •  28 PB of storage •  The largest Hadoop cluster in Europe

Hadoop

•  Entering the land of optimizations •  Data retention policy •  Move to JVM-based languages

•  MapReduce languages •  Moving to Crunch, JVM-based, for speed and scalability •  Python with Hadoop Streaming, Java, Hive, PIG, Scala

•  Sprunch: Crunch wrapper for Scala, open sourced by Spotify

•  Spotify open-sourced scheduler, Luigi, written in Python •  Simple and easy way to chain jobs

What if we want to know more?

vs

Databases

•  Aggregates from Hadoop put into PostgreSQL or Cassandra

•  Sqoop •  Core data can be used and manipulated for various needs

•  Ad hoc queries •  Dashboards

Databases

•  Aggregates from Hadoop put into PostgreSQL or Cassandra

•  Sqoop

•  Ad hoc queries •  Dashboards

Questions?

A/B testing questions? Find me!

Control

vs

Thank you!

data at spotify

Technology

tb of data

data infrastructure

big data

ab testing data

tb of compressed data

cassandra sqoop core

hadoop streaming

largest hadoop cluster