data at spotify
DESCRIPTION
Data infrastructure at SpotifyTRANSCRIPT
June 12, 2014
Danielle Jabin Data Engineer, A/B Testing
Data at Spotify
I’m Danielle Jabin
• Data Engineer in the Stockholm office • A/B testing infrastructure
• California born & raised • If I can survive a Swedish winter, so can you!
• Studied Computer Science, Statistics, and Real Estate through the M&T program at the University of Pennsylvania
3
Over 40 million active users
As of June 9, 2014
4
Access to more than 20 million songs
As of June 9, 2014
Big Data
• 40 million Monthly Active Users • 20+ million tracks • 1.5 TB of compressed data from users per day • 64 TB of data generated in Hadoop each day (including
replication factor of 3)
As of June 9, 2014
6
So how much data is that?
Let’s compare: 64 TB
• 293, 203, 072 books (200 pages or 240,000 characters)
• 16,777,216 MP3 files (with 4MB average file size) • 22,369,600 images (with 3MB average file size)
8
That’s a lot of selfies
9
How do we use this data?
Use Cases
• Reporting • Business Analytics • Operational Analytics • Product Features
Reporting
• Reporting to labels, licensors, partners, and advertisers • We support our partners
Business Analytics
• Analyzing growth, user behavior, sign-up funnels, etc • Company KPIs • NPS analysis
Operational Metrics
• Root cause analysis • Latency analysis • Better capacity planning (servers, people, bandwidth)
Product Features
• Discover and Radio • Top lists • Personalized recommendations • A/B Testing
15
How do we collect this data?
The three pillars of our Data Infrastructure:
Kafka Collection
Hadoop Processing
DatabasesAnalytics/Visualization
This is Dave. Data Engineer at Spotify by day…
…chiptune DJ Demoscene Time Machine by night.
Let’s listen to Dave’s song
Kafka
• High volume pub-sub system
• “Producers publish messages to Kafka topics, and consumers subscribe to these topics and consume the messages.”
Kafka
• Robust and scalable solution for collection of logs • Fast data transfer • Low CPU overhead • Built-in partitioning, replication, and fault-tolerance
• Consumers can pull data at different rates • Able to handle extremely high volumes
Other people listened too!
Hadoop
• Process and store massive amounts of unstructured data across a distributed cluster
• One cluster with 37 nodes to 690 nodes today • 28 PB of storage • The largest Hadoop cluster in Europe
Hadoop
• Entering the land of optimizations • Data retention policy • Move to JVM-based languages
• MapReduce languages • Moving to Crunch, JVM-based, for speed and scalability • Python with Hadoop Streaming, Java, Hive, PIG, Scala
• Sprunch: Crunch wrapper for Scala, open sourced by Spotify
• Spotify open-sourced scheduler, Luigi, written in Python • Simple and easy way to chain jobs
What if we want to know more?
vs
Databases
• Aggregates from Hadoop put into PostgreSQL or Cassandra
• Sqoop • Core data can be used and manipulated for various needs
• Ad hoc queries • Dashboards
Databases
• Aggregates from Hadoop put into PostgreSQL or Cassandra
• Sqoop
• Ad hoc queries • Dashboards
Databases
• Aggregates from Hadoop put into PostgreSQL or Cassandra
• Sqoop
• Ad hoc queries • Dashboards
Questions?
A/B testing questions? Find me!
Control
vs
Thank you!