![Page 1: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/1.jpg)
Self-Serve Reporting Platform on Hadoop
Shirshanka Das Strata Singapore 2015
![Page 2: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/2.jpg)
![Page 3: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/3.jpg)
![Page 4: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/4.jpg)
4
Ingest Process Serve Visualize
Reporting Pipelines
![Page 5: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/5.jpg)
5
Ingest Process Serve Visualize
Reporting at LinkedIn: EvolutionSources
Oracle MSTR
Tableau
Internal Tools
Espresso
Kafka
External
Custom
Custom
Custom
Hadoop Voldemort
Pinot
MySQL
INFA + MSTR OracleTeradataon+ Scripts
Jobs on
![Page 6: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/6.jpg)
Infra Scale
6
Number of Hadoop clusters: 12 Total number of machines: ~7k Largest Cluster: ~3k machines
Data volume generated per day: XX Terabytes Total accumulated data: XX Petabytes
![Page 7: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/7.jpg)
People Scale
7
Reporting Platform Team: ~10 Core Warehouse Team: 1x
Data Scientists: 10x Business Analysts: 10x Product Managers: 10x
Sales and Marketing: 100x
![Page 8: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/8.jpg)
8
Ingest Process Serve Visualize
Challenges
Disjointed efforts, unreliable systems Unpredictable SLA across all systems
Fragmented data pipelines with inconsistent data
![Page 9: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/9.jpg)
9
Ingest Process Serve Visualize
![Page 10: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/10.jpg)
Houston we have a problem
![Page 11: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/11.jpg)
Step 1 Central transport pipeline
![Page 12: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/12.jpg)
Still have a problem
![Page 13: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/13.jpg)
Step 2
Central Ingestion
Framework
13
![Page 14: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/14.jpg)
Stream + Batch
RESTSFTP
JDBC
Diverse Sources
Open source @ github.com/linkedin/gobblin In production @ LinkedIn, Intel, Swisscom, NerdWallet
@LinkedIn ~20 distinct source types Hundreds of TB per day Hundreds of datasets
Data Quality
![Page 15: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/15.jpg)
15
Ingest Process Serve Visualize
Unified Metrics Platform
![Page 16: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/16.jpg)
16
Single Source
of Truth
Easy Onboarding Operability
Requirements
![Page 17: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/17.jpg)
WorkflowMetric
Definition
Sandbox
Code Repository
Metric Owner
System Jobs
Build
Core Metrics Job
Central Team, Relevant
Stakeholders
1. iterate
2. create 3. review
4. check in
![Page 18: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/18.jpg)
Metric Definition
Name Description TagsOwners
Dataset
Dimensions
TimeScript
Metrics
Entity Ids
Tier
Formulas
Entity Dimensions
Input Datasets
Temporality
![Page 19: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/19.jpg)
An example: video play analysisname: "video"
description: “Metrics for video tracking”
label: “video”
tags: [flagship, feed]
owners: [jdoe, jsmith]
enabled: true
retention: 90d
timestamp: timestamp
frequency: daily
script: video_play.pig
output_window: 1d
dimensions:[ { name: platform doc: “phone, tablet or desktop" } { name: action_type doc: “click play or auto-play“ } ]
input_datasets [
{ name: actionsRaw path: Tracking.ActionEvent range: 1d }
]
![Page 20: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/20.jpg)
An example contd…metrics: [ name: unique_viewers
doc: “Count of unique viewers” formula: “unique(member_id)”
tier: 2 good_direction: "up" } { name: play_actions doc: “Sum of play actions" tier: 2 formula: “sum(play_actions)" good_direction: "up" } ]
entity_ids: [ {
name: member_id category: member } { name:video_id category: video }
]
![Page 21: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/21.jpg)
UMP Data FlowUmp Monitor
Primary Data
(tracking, databases, external)
UMP Raw Data
UMP Aggregated
Data Relevance
Experiment analysis
Ad-hoc
Metrics Script
Data Prep agg cube
dimension verify
HDFS + Pinot
Dashboards
…
![Page 22: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/22.jpg)
First version in production since early 2014 Significant redesign in 2015
Total amount of data being scanned per day: Hundreds of TBs Total number of metrics being computed: 2k+ Total number of scripts: ~ 400 Number of authors for these metrics: ~ 200 Maximum number of dimensions per dataset: ~ 30 Number of people responsible for upkeep of pipeline: 2
UMP by the numbers
22
![Page 23: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/23.jpg)
Learnings so far
23
Ease of onboarding Hard when you have > 1000 users with different skill sets Need great UX to complement developer friendly alternatives
Single source of truth Not just a technology challenge Organization needs to rally around it
Operability Multi-tenant Hadoop pipeline with SLA-s and QoS: hard Cost 2 Serve: Managing metrics lifecycle is important
The Next Big Things Bridging streaming and batch Code-free metrics Sessions, Funnels, Cohorts Open source
![Page 24: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/24.jpg)
24
Ingest Process Serve Visualize
P not
![Page 25: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/25.jpg)
SQL-like interface
(minus joins)
Sub second query latency
Data load from Hadoop
and Kafka
Capabilities
![Page 26: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/26.jpg)
Pinot Data Flow
Kafka Hadoop
Samza Process
Pinot
minuteshour +
![Page 27: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/27.jpg)
Pinot@LinkedIn
Site-‐facing Apps Reporting dashboards Monitoring
In production since 2012 Open source @ github.com/linkedin/pinot
![Page 28: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/28.jpg)
28
Ingest Process Serve Visualize
Raptor
![Page 29: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/29.jpg)
Standardize Visualization
29
Leverage- Standalone app, with support for embedding - Can use existing analytics backend: Pinot
Strategic- Reduces dependency on 3rd party BI tools - Closer integration with LinkedIn’s ecosystem of
experimentation, anomaly detection solutions
![Page 30: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/30.jpg)
30
Requirements
Support apps
ecosystem
Core Visualization Capabilities
Metadata Integration
![Page 31: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/31.jpg)
Raptor 1.0
31
First version built by 3 engineers in a quarter Features - Integration with UMP, Pinot - Time series, bar charts, … - Create, Publish, Clone, Discover
Dashboards Numbers - Number of dashboards: ~100 - Weekly unique users: ~400
![Page 32: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/32.jpg)
![Page 33: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/33.jpg)
![Page 34: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/34.jpg)
![Page 35: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/35.jpg)
![Page 36: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/36.jpg)
![Page 37: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/37.jpg)
![Page 38: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/38.jpg)
![Page 39: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/39.jpg)
The Future for Raptor
39
Social Collaboration features Intelligence - Anomaly detection - Dashboards You May Like
Embedding into data products Open Source
![Page 40: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/40.jpg)
40
Ingest Process Serve Visualize
A Few Good Hammers
Unified Metrics Platform
P not Raptor
![Page 41: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/41.jpg)
41
Ingest Process Serve Visualize
What we’re excited about
Unified Metrics Platform
P not Raptor
Metadata Bus
![Page 42: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/42.jpg)
42
Metadata driven e2e Optimizations
Dynamic prioritization of data ingest Surface source data quality issues in dashboard Surface backfill status on dashboard Cascading deprecation of dashboards, computation and data sources through lineage
![Page 43: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop](https://reader036.vdocuments.net/reader036/viewer/2022081517/58890bdf1a28ab4a5c8b4ebb/html5/thumbnails/43.jpg)
43
Shirshanka Das @shirshanka
Catch me offline to chat about…
What we’re doing for - Views on Hadoop - Data Quality - Metadata