big data and user segmentation in mobile context
DESCRIPTION
In these slides, we explore the unique challenges that mobile data present. The high cardinality, low signal to noise ratio and realtime needs have significant system implications. We outline how InMobi tackles these challenges. A specific Data Science use case is also presented. We outline our approach to user segmentation. A brief description of the challenges faced and our attempts to address them is also included.TRANSCRIPT
Big Data and User
Segmentation in the Mobile
Context
Rajiv BhatSeptember 6 2014
The world is going mobile
2Source: Left image – 123RF.com. Right image – amadarose.co.uk/KPCB
Mobile traffic pct is growing at 1.5X per year
3
Source: StatCounter Global Stats, 5/13. Note that PC-based Internet data bolstered by streaming.
InMobi: Overview
4
Global25 locations worldwide in <4 yrs.
(Faster on-ground global presence than Google, Facebook, Twitter & LinkedIn)
ScaleReaching 750 million+ mobile consumers in 165+ countries(over >250 million in Asia)
Over 800 employees across multiple locations
Investors
($200M strategic investment in 2011)
Technology
65K+ requests/sec at peak load; 100TB data generation/month(more than Twitter)
500+ Data Scientists, Engineers & Analysts across US, IN, JP(one of the largest Prod & Engg teams working on mobile advertising globally)
InMobi: Recognized as one of 50 disruptive companies in
2013
5
6
1 6 0 0 0 0 0 0 0 0 0 0 0
Total Ad requests = 1.6 Trillion
7
5 7 8 0 0 0 0 0 0 0 0 0
Total Ads served = 578 Billion
8
1 2 0 0 0 0
Peak ad requests per second = 120,000
9
1 0 0 0 0 0 0 0 0 0 0
Total # of events ingested in a day = 10 Billion
10
5 4 0 0 0 0 0 0 0 0 0 0 0
New data generated every day = 5.4 TB
11
1 0 0 0 0 0 0 0 0 0 0 0 0 0
Total Hadoop clusters size = 1000 TB
0 0
12
8 3 0 0 0 0 0 0
Total # of Hadoop jobs = 83 Million
How is Mobile Data Different?
1.Rich
Context
• GPS
• Accelerometer
• Gyroscope
• Digital compass
• Altimeter
• Health sensors?
High Cardinality
How is Mobile data different?
2.Short
halflife
• User context
changes rapidly
• Geo context half
life < 30 minutes
Real time/fast
feedback critical
How is Mobile data different?
3.Controlled
ecosystem
• Most activity happens
on apps vs mobile
browsers
• Controlled by
iOS/Android
• Fast changing standards
for user identification
• Privacy controls
Joins with back dated
data
How is Mobile data different?
4.‘Snacking’
• Shorter attention
span
• Far smaller sessions
• Higher number of
visits
Weaker signals
How is Mobile data different?
5.Fragmentation
• Much lower shelf life of
apps (compared to
websites)
• Users appear across
devices
Need for
fingerprinting/hygiene
algorithms
How is Mobile data different?
6.Lossy
• Varied network latency
even on single device
• Devices/operators
treat packets
differently (http vs
https)
Redundancy in data
collection
How is Mobile data different?
7.Scale
• Mobile data is roughly
10X that of desktop
• Growing very rapidly
• Great opportunity
Truly scalable fast data
platforms
How is Mobile data different?
Responding to scaling challenges (1/2)
Q: As your business scales across multiple
geographies (and data centers), how do
you maintain reporting SLAs?
Apache Falcon: Geo distributed Data
processing model (Open sourced),
aggregation done in local data centers
and pushed to center
Q: How do you leverage different
reporting systems with different cost to
granularity tradeoffs?
Grill: Abstract view over tiered data
stores with query redirection based on
latency and cost objectives
Q: How do you handle fast joins across
billions of keys?
Stream enrichment: Enrich impressions,
clicks and conversions with request
information
Responding to scaling challenges (2/2)
Q: How to do data collection as a service
at scale with reliability, predictability,
varying SLA’s and audit?
Conduit and Pintail: Reliable and high
scalable data collection technology.
Allows both batch and streaming
consumption(Open sourced)
Q: How do you offer linked data analytics
at scale?
ShadowFax: A homegrown graph
database for OLAP analytics (currently
being open sourced)
Q: How do you crunch user data at scale
while still maintaining user privacy?
Hybrid Map Reduce: Map Reduce on a
hybrid runtime of Mobile and Hadoop.
Map on mobile, reduce on Hadoop
Information consumption is undergoing another discontinuity
Broadcast based consumption
• Books/Radio/TV
• One-way
Pull based consumption
• Search
• Intent specified
Contextual consumption
• Mobile advertising?
• Just in time information delivery
• Context aware environment
1436-2015
1990 – 2020
2010-20XX
Contextual advertising is a reality
• Mobile is ubiquitous
• Digitization of the offline
world – PoIs to User
specific tagging
• Location information
available which is the
equivalent of placing
cookies on server
• Data density is actionable
in pockets
• Contextual
advertising/information
delivery now possible. E.g.
Amazon’s anticipatory
shipping is a step in that
direction
• In the world going
forward, wearable devices
will continue to have
location and users will
expect the cloud to
understand the context
User Segmentation Example
Problem statement
Segment users
based on their
interests/behavior
Deliver better
CVR/CPDs for the
advertiser
Optimum Mapping
of Demand to
Supply
● User meta data
● Engagement data (ad
requests, clicks ,
impressions ,
downloads etc)
● Geo location data
● Device data
Data
User Segmentation: Challenges
● Scalable segment generation (No. of
segments and users per segment)
● Frequency of generation (Time required
per segment)
● Choosing the target variable for prediction
(ex :CVR or CTR )
● Should be a measure of user's affinity for
the segment type
● Sparse user data
● Validation Metrics
● Feature Inclusion
Pipeline
Modelling
User Segmentation: Steps
Feature Generation
PreprocessingTraining and prediction
Joins over
multiple
data cubes
and meta
data
Data
Cleaning and
Outlier
Removal
Sampling, Training
Model for
Predicting Segment
Probabilities and
Segment formation
User Segmentation: Feature Generation
● Generate user based features at large
volumes (Per day: >10 billion, 5.4 TB) for
training models
● Challenge: Want current event data at
training time
● Events (clicks, downloads) however arrive
long after original impression
● Use real-time logs: heavy processing!
● Our approach: create static snapshots
where possible, aggregate later
Apache Falcon - job
synchronization,
scheduling, retention,
etc.
Conduit & PinTail - real-
time streaming &
transport across DCs
Apache Pig - where
possible, fewer lines of
code than MR
Tools
User Segmentation: Preprocessing
● Unusual user data due to
tracking issues
● Accommodating backfill of
data
● Accommodating signals
with high latency
(Conversions)
Challenges
User Segmentation: Training
● Optimum duration of user behavior for training (days, weeks, months )
o Data Generation/Storage costs increase with duration
o Shorter duration data might miss seasonal patterns
● Target variable can be Conversion or Click based metrics. (CVR or CTR)
o Predicting user behavior towards a segment in future
Ex: Predicting CVR for next week using behavior features for previous
week
More suited towards Performance Advertising Campaigns
o Modelling current behavior (Categorization of current behavior)
Ex: Categorizing the user based on current week’s behavior
More suited for Brand Advertising campaigns
● Number of users with conversions/clicks much lesser than users without
conversions/clicks
o Class Imbalance Problem
o Sampling strategies - Stratified Sampling
User Segmentation: Training
● User data for mobile is much sparser as compared to web.
o Define dense and sparse data.
o Separate models for dense and sparse data.
o Train only on dense data
Losing the information in sparse data
o Augment the sparse data using additional meta features for user.
Os, device ,country and other demographic details
● Modelling approaches :
o Linear/Logistic Regression, Random Forests
Predicts a probability of membership for a segment
o Collaborative Filtering/Clustering Algorithms
Put user into segments based on similarity to current segment
users
● Hybrid : Using Clustering for augmenting sparse data and modelling on
the results
User Segmentation: Training
● In memory tools (Spark - MLib, H2O) v/s Hadoop MapReduce tools
(Mahout)
o Machine Learning models are iterative in nature i.e access same data
blocks multiple times
o Lesser overhead in reading data if present in memory
o In memory tools better for training on “relatively” smaller data sets.
o We found Spark easier to use as compared to H2O due to sufficient
debugging documentation and community support
● Hadoop Streaming : Using any executable script as mapper or
reducer.
o Useful for quick deployment in languages such as python and shell
● Dynamic Segmentation vs Static Segmentation
Big Data at InMobi
Consume
Create
Contribute
InMobi
Apache Falcon
Conduit
Pintail
Apache Hadoop
Apache Hive
Apache Oozie
Apache Hbase Apache Cassandra MongoDB Redis Druid Apache Zookeeper Apache Flume Apache Storm Scribe Apache Ambari
Thanks
Questions? [email protected]