lego: data driven growth hacking powered by big data
TRANSCRIPT
1Salesforce ConfidentialSalesforce
Confidential
LEGO: Data Driven Growth Hacking Powered by Big Data
June 2016
Kamal Duggireddy Prashant Gokhale
2Salesforce Confidential
Kamal Duggireddy
Kamal Duggireddy currently leads Data Engineering, Product Data Science Team at Salesforce.com Prior to this, he served as Director - Big Data Architecture at American Express. Combining deep technical skills along with business knowledge and strong execution experience, Kamal developed reference architectures and new enterprise-level capabilities with the Hadoop stack.
Prashant Gokhale
Prashant is currently working on solving big data problems at Salesforce.com using Hadoop and its ecosystem components. Prior to this he held several critical engineering positions at Yahoo, Cloudera & Lookout.
About Us
3Salesforce Confidential
The Use Case | Overview
ExecutivesAnalystsProduct Managers
4Salesforce Confidential
The Use Case | Flow
Ad-Hoc Requests
Predictive Data Apps
Data Engineering & Curation
Smart Data Dashboards(Salesforce Wave)
Advanced AnalysisInstrumentation
150+ Loglines
HadoopData Processing
Traditional Data Warehouses Dimensions
5Salesforce Confidential
The Journey | How it all started
6Salesforce Confidential
Milestones | Along the way
</>
<\>
Reusability Declarative Data Lake Data Dictionary
Self serviceAutomation
Security Visualization Governance
7Salesforce Confidential
The Framework | Finally!
Dat
aset
s(V
ario
us g
rain
)
Data Lake
Log Processing
Metadata
Flow Engine
W
eb A
pp
Self Service
Log
Sou
rces
Clou
d M
etri
cs
Data Profiler
Data Science
Kafka Splunk
Files
Warehouse
Objects
Hadoop
Cube
s(C
usto
m g
rain
)
8Salesforce Confidential
Goals
ScalableProcess hundreds of billions of log lines.
FlexibleHandle thousands of log schemas. Support variable grain and transformations using custom code.
Data QualityAutomated data profiling, monitoring and alerting.
Self ServiceEnable ad-hoc analysis
9Salesforce Confidential
Log Processing Engine•Declaratively define features and flows.
•Normalize data across multiple log lines.
•Custom code injection for data transformation.
Data Profiler•Profile data at scale to detect anomalies.
Web App •Interface to manage features and flows.
Job Automation engine•End to end automation from features/flows to curated data sets in Wave.
Key Building Blocks
10Salesforce Confidential
Log Processing Engine
logType==’X’ and event==’Create Event’ and page==’Home Landing’,”Feat 1”,”eval_code(event.toUpperCase())”,page,…..
logType==’ABC’ and event==’Create Event’ and page==’Home’,”Feat 2”,”eval_code(event.substring(5))”,event,…..
usage Log Files
Feature definitions
Hive tables
Data Normalization
Data Cleansing
Data Transformation
+
11Salesforce Confidential
Data Profiler
Dataset Field Type, Total, Min, Max, Avg, # Nulls, # Distinct, Median, 99th %tile, Top N
lego_feat browser STR 2.3B 7 63 25 1M 50 34 38 [.....]
lego_feat url STR 2.3B 20 223 50 0 5M 70 90 [.....] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Datasets across platform
HCatalog
MapReduce
Datasets Dataset Profile An Example
Monitoring & alerting
12Salesforce Confidential
Everything put together
Dat
aset
s(V
ario
us g
rain
)
Data Lake
Log Processing
Metadata
Flow Engine
W
eb A
pp
Self Service
Log
Sou
rces
Clou
d M
etri
cs
Data Profiler
Data Science
Kafka Splunk
Files
Warehouse
Objects
Hadoop
Cube
s(C
usto
m g
rain
)
13Salesforce Confidential
Data Volumetrics
TOTAL
Avg. Volume of App Logs processed (Compressed) 100’s TB/mon
Avg. Number of Jobs 6000+ /mon
Avg. Log Size volume growth rate A lot!
Number of Log Record Types 1,000s
Number of fields 10s of 1,000s
200+ BEvents / Day
500+Features
14Salesforce Confidential
thank y u
14
We are hiring!! www.salesforce.com/comapany/careers