data democratization at soundcloud - bruno sá (soundcloud)
DESCRIPTION
SoundCloud is the world’s leading social sound platform where anyone can create sounds and share them everywhere. 200 Million people every month listen to sounds on SoundCloud. That is eight percent of the Internet. 12 hours are uploaded on SoundCloud every minute. This means that SoundCloud not only deals with a lot of data (3-digit terabytes approximately) but embraces the concept of “data democratization,” which means that all data must be available for anyone in the company that needs to access and work with it.TRANSCRIPT
DATA DEMOCRATIZATION @ SOUNDCLOUDOctober, 29th 2013
HI, I’M BRUNO
SOUNDCLOUD IS THE WORLD’S ˝LEADING AUDIO PLATFORM
Every minute, creators upload
12hrs of audio
reaching over
200m
people every month
!
8% of the internet
FOO FIGHTERS SNOOP LION MADONNA MACKLEMOREPRESIDENT OBAMA JOHN OLIVER˝(DAILY SHOW/BUGLE)
SKRILLEX
what gets listened to where?
how many new usersdid we get from that campaign?
how much revenue dowe make in Brasil?
how do users use our iOS andAndroid apps?
what makes asound successfull?
did the product change affect feature x?
do comments ontracks correlatewith listens?
what makes anartist successfull?
• Avoid Silos
• Remove unnecessary restrictions
• Provide simple tools
• Teach People how to use data
DATA DEMOCRATIZATION
In one sentence:
DATA DEMOCRATIZATION
Deliver the right information to the right person at the right time.
DATA ANALYSIS AND REPORTING
PRODUCTION DB
ANALYTICS DB
2010-2012
DATA ANALYSIS AND REPORTING
ListensSoundsUsersCommentsFavoritesSharesReposts
ImpressionsClicksConversionsSuggestionsDownloadsTaggingsUploads
DATA ANALYSIS AND REPORTING
Listens
timestampdurationsound ownerlistenerAPI-key(location) country
DATA ANALYSIS AND REPORTING
additional metadata:• location within sound• context (location on site)• segmentation
Listening creates >6000 events/s
BIG DATA
HADOOP TO THE RESCUE
2 Datacenter in AMS 200+ Nodes
HADOOP TO THE RESCUE
listen data listen metadata search data recommender data product testing data backend production databackend logs
HADOOP AND DATA DEMOCRATIZATION
Data is siloed on hadoopData governance not existing Technical hurdles for accessNot realtimeSlow access
AMAZON REDSHIFT
Fast fully managed DW service
Optimized for petabyte or more datasets
Fast query and I/O performance
Columnar storage technology
Staging Area
Pig/Ruby Scripts
Amazon EMR
COPY
Pig/Ruby Scripts
Job execution powered by:
2013
BI INFRASTRUCTURE
Data Exploration
Source Systems
Hadoop
MySql
External Systems
...
(production db)MySql
DataWarehouse
ETL Scripts ETL Scripts
First: Gather data from the several source systems into S3
Hadoop
MySql
External Systems
(production db)MySql
Full/Daily Imports
MapReduce for: - Listens - Plays - Impressions - Affiliations - ...
IMPORT DATA FROM SOURCE SYSTEMS
Second: Rebuild staging area tables for full imports
IMPORT DATA FROM SOURCE SYSTEMS
Staging Area
tracks users client applications
...
Based on configuration files!Create statements generated!Re-create DISTKEYS and SORTKEYS Full control in changes in the data model!
yaml config files
Third: Import the data from S3 to RedShift
Staging Area
tracks users client applications
...
Full import: TRUNCATE & COPY Daily import: COPY
IMPORT DATA FROM SOURCE SYSTEMS
!ETL scripts divided into layers:
!- Layer 1: Staging -> DW (dimensions)
- Layer 2: Staging -> DW (fact tables - raw data)
- Layer 3: DW -> DW (aggregated fact tables)
- Layer 4: DW -> Reporting Data Cubes (reporting data)
!
ETL AND DW DATAMODEL
DataWarehouse
ETL AND DW DATAMODEL
Staging Area
Data Cleaning Data Transformation !Ruby/SQL Scripts
ETL Layer 1 & 2
Data Aggregation !Ruby/SQL Scripts
ETL Layer 3
Data Exploration
ETL Layer 4
Data Presentation !SQL
JOB SCHEDULE AND EXECUTION
Job-scheduling tool developed internally
Set dependencies between jobs
Execution in multiple machines
Supports all the ETL layers
DATA EXPLORATION
Simple and fast access to data
More time for “deep dives” into data
Individualized Reporting
Allows interactivity between users
Integrated with RedShift
TIMELINEWeek 2 Week 4 Week 8 Week 10 Week 12 Week 14 Week 16
• Gap Analysis˝• Business Exploration
(requirements interviews)
• Information Mapping Design˝
• Solution Design (Draft)
Requirement Analysis˝
Analysis Stage˝
End of Analysis Stage˝
Milestones˝ Design & Build˝
• Define Infrastructure˝• Design Data Model
Week 6
Infrastructure Ready!˝
• Build ETL ˝• Build Data Cubes˝• Design Reports/Dashboards (Presentation
Layer)BI 1.0 is built!˝
• System/Integration Tests ˝
• User Acceptance
BI 1.0 is tested!˝
• User Workshops˝• BI 1.0 Evaluation
BI 1.0 is ready to use!˝
Test & Deploy
• Reports designed by end users
• Central repository for data analysis
• User interaction
• Data from one source only
• Scalable solution
• Data to the people!
DATA DEMOCRATIZATION
what gets listened to where?
how many new usersdid we get from that campaign?
what makes asound successfull?
did the product change affect feature x?
how much revenue dowe make in Brasil?
how do users use our iOS andAndroid apps?
do comments ontracks correlatewith listens?
what makes anartist successfull?
!
QUESTIONS?