data democratization at soundcloud - bruno sá (soundcloud)

DATA DEMOCRATIZATION @ SOUNDCLOUDOctober, 29th 2013

HI, I’M BRUNO

SOUNDCLOUD IS THE WORLD’S ˝LEADING AUDIO PLATFORM

Every minute, creators upload

12hrs of audio

reaching over

200m

people every month

!

8% of the internet

FOO FIGHTERS SNOOP LION MADONNA MACKLEMOREPRESIDENT OBAMA JOHN OLIVER˝(DAILY SHOW/BUGLE)

SKRILLEX

what gets listened to where?

how many new usersdid we get from that campaign?

how much revenue dowe make in Brasil?

how do users use our iOS andAndroid apps?

what makes asound successfull?

did the product change affect feature x?

do comments ontracks correlatewith listens?

what makes anartist successfull?

• Avoid Silos

• Remove unnecessary restrictions

• Provide simple tools

• Teach People how to use data

DATA DEMOCRATIZATION

In one sentence:


Deliver the right information to the right person at the right time.

DATA ANALYSIS AND REPORTING

PRODUCTION DB

ANALYTICS DB

2010-2012


ListensSoundsUsersCommentsFavoritesSharesReposts

ImpressionsClicksConversionsSuggestionsDownloadsTaggingsUploads


Listens

timestampdurationsound ownerlistenerAPI-key(location) country


additional metadata:• location within sound• context (location on site)• segmentation

Listening creates >6000 events/s

BIG DATA

HADOOP TO THE RESCUE

2 Datacenter in AMS 200+ Nodes

HADOOP TO THE RESCUE

listen data listen metadata search data recommender data product testing data backend production databackend logs

HADOOP AND DATA DEMOCRATIZATION

Data is siloed on hadoopData governance not existing Technical hurdles for accessNot realtimeSlow access

AMAZON REDSHIFT

Fast fully managed DW service

Optimized for petabyte or more datasets

Fast query and I/O performance

Columnar storage technology

Staging Area

Pig/Ruby Scripts

Amazon EMR

COPY

Pig/Ruby Scripts

Job execution powered by:

2013

BI INFRASTRUCTURE

Data Exploration

Source Systems

Hadoop

MySql

External Systems

...

(production db)MySql

DataWarehouse

ETL Scripts ETL Scripts

First: Gather data from the several source systems into S3

Hadoop

MySql

External Systems

(production db)MySql

Full/Daily Imports

MapReduce for: - Listens - Plays - Impressions - Affiliations - ...

IMPORT DATA FROM SOURCE SYSTEMS

Second: Rebuild staging area tables for full imports


Staging Area

tracks users client applications

...

Based on configuration files!Create statements generated!Re-create DISTKEYS and SORTKEYS Full control in changes in the data model!

yaml config files

Third: Import the data from S3 to RedShift

Staging Area

tracks users client applications

...

Full import: TRUNCATE & COPY Daily import: COPY


!ETL scripts divided into layers:

!- Layer 1: Staging -> DW (dimensions)

- Layer 2: Staging -> DW (fact tables - raw data)

- Layer 3: DW -> DW (aggregated fact tables)

- Layer 4: DW -> Reporting Data Cubes (reporting data)

!

ETL AND DW DATAMODEL

DataWarehouse

ETL AND DW DATAMODEL

Staging Area

Data Cleaning Data Transformation !Ruby/SQL Scripts

ETL Layer 1 & 2

Data Aggregation !Ruby/SQL Scripts

ETL Layer 3

Data Exploration

ETL Layer 4

Data Presentation !SQL

JOB SCHEDULE AND EXECUTION

Job-scheduling tool developed internally

Set dependencies between jobs

Execution in multiple machines

Supports all the ETL layers

DATA EXPLORATION

Simple and fast access to data

More time for “deep dives” into data

Individualized Reporting

Allows interactivity between users

Integrated with RedShift

TIMELINEWeek 2 Week 4 Week 8 Week 10 Week 12 Week 14 Week 16

• Gap Analysis˝• Business Exploration

(requirements interviews)

• Information Mapping Design˝

• Solution Design (Draft)

Requirement Analysis˝

Analysis Stage˝

End of Analysis Stage˝

Milestones˝ Design & Build˝

• Define Infrastructure˝• Design Data Model

Week 6

Infrastructure Ready!˝

• Build ETL ˝• Build Data Cubes˝• Design Reports/Dashboards (Presentation

Layer)BI 1.0 is built!˝

• System/Integration Tests ˝

• User Acceptance

BI 1.0 is tested!˝

• User Workshops˝• BI 1.0 Evaluation

BI 1.0 is ready to use!˝

Test & Deploy

• Reports designed by end users

• Central repository for data analysis

• User interaction

• Data from one source only

• Scalable solution

• Data to the people!


what gets listened to where?

how many new usersdid we get from that campaign?

what makes asound successfull?

did the product change affect feature x?

how much revenue dowe make in Brasil?

how do users use our iOS andAndroid apps?

do comments ontracks correlatewith listens?

what makes anartist successfull?

!

QUESTIONS?

THANK YOU!

P.S. WE’RE HIRING.SOUNDCLOUD.COM/JOBS

http://SOUNDCLOUD.COM/JOBS

data democratization at soundcloud - bruno sá (soundcloud)

Technology

import data

rescuelisten data

data democratizationdata

eventssbig data

data democratizationin

data individualized

hadoop data governance

data exploration simple