big data and user segmentation in mobile context

Big Data and User

Segmentation in the Mobile

Context

Rajiv BhatSeptember 6 2014

The world is going mobile

2Source: Left image – 123RF.com. Right image – amadarose.co.uk/KPCB

Mobile traffic pct is growing at 1.5X per year

3

Source: StatCounter Global Stats, 5/13. Note that PC-based Internet data bolstered by streaming.

InMobi: Overview

4

Global25 locations worldwide in <4 yrs.

(Faster on-ground global presence than Google, Facebook, Twitter & LinkedIn)

ScaleReaching 750 million+ mobile consumers in 165+ countries(over >250 million in Asia)

Over 800 employees across multiple locations

Investors

($200M strategic investment in 2011)

Technology

65K+ requests/sec at peak load; 100TB data generation/month(more than Twitter)

500+ Data Scientists, Engineers & Analysts across US, IN, JP(one of the largest Prod & Engg teams working on mobile advertising globally)

InMobi: Recognized as one of 50 disruptive companies in

2013

5

6

1 6 0 0 0 0 0 0 0 0 0 0 0

Total Ad requests = 1.6 Trillion

7

5 7 8 0 0 0 0 0 0 0 0 0

Total Ads served = 578 Billion

8

1 2 0 0 0 0

Peak ad requests per second = 120,000

9

1 0 0 0 0 0 0 0 0 0 0

Total # of events ingested in a day = 10 Billion

10

5 4 0 0 0 0 0 0 0 0 0 0 0

New data generated every day = 5.4 TB

11

1 0 0 0 0 0 0 0 0 0 0 0 0 0

Total Hadoop clusters size = 1000 TB

0 0

12

8 3 0 0 0 0 0 0

Total # of Hadoop jobs = 83 Million

How is Mobile Data Different?

1.Rich

Context

• GPS

• Accelerometer

• Gyroscope

• Digital compass

• Altimeter

• Health sensors?

High Cardinality

How is Mobile data different?

2.Short

halflife

• User context

changes rapidly

• Geo context half

life < 30 minutes

Real time/fast

feedback critical


3.Controlled

ecosystem

• Most activity happens

on apps vs mobile

browsers

• Controlled by

iOS/Android

• Fast changing standards

for user identification

• Privacy controls

Joins with back dated

data


4.‘Snacking’

• Shorter attention

span

• Far smaller sessions

• Higher number of

visits

Weaker signals


5.Fragmentation

• Much lower shelf life of

apps (compared to

websites)

• Users appear across

devices

Need for

fingerprinting/hygiene

algorithms


6.Lossy

• Varied network latency

even on single device

• Devices/operators

treat packets

differently (http vs

https)

Redundancy in data

collection


7.Scale

• Mobile data is roughly

10X that of desktop

• Growing very rapidly

• Great opportunity

Truly scalable fast data

platforms


Responding to scaling challenges (1/2)

Q: As your business scales across multiple

geographies (and data centers), how do

you maintain reporting SLAs?

Apache Falcon: Geo distributed Data

processing model (Open sourced),

aggregation done in local data centers

and pushed to center

Q: How do you leverage different

reporting systems with different cost to

granularity tradeoffs?

Grill: Abstract view over tiered data

stores with query redirection based on

latency and cost objectives

Q: How do you handle fast joins across

billions of keys?

Stream enrichment: Enrich impressions,

clicks and conversions with request

information

Responding to scaling challenges (2/2)

Q: How to do data collection as a service

at scale with reliability, predictability,

varying SLA’s and audit?

Conduit and Pintail: Reliable and high

scalable data collection technology.

Allows both batch and streaming

consumption(Open sourced)

Q: How do you offer linked data analytics

at scale?

ShadowFax: A homegrown graph

database for OLAP analytics (currently

being open sourced)

Q: How do you crunch user data at scale

while still maintaining user privacy?

Hybrid Map Reduce: Map Reduce on a

hybrid runtime of Mobile and Hadoop.

Map on mobile, reduce on Hadoop

Information consumption is undergoing another discontinuity

Broadcast based consumption

• Books/Radio/TV

• One-way

Pull based consumption

• Search

• Intent specified

Contextual consumption

• Mobile advertising?

• Just in time information delivery

• Context aware environment

1436-2015

1990 – 2020

2010-20XX

Contextual advertising is a reality

• Mobile is ubiquitous

• Digitization of the offline

world – PoIs to User

specific tagging

• Location information

available which is the

equivalent of placing

cookies on server

• Data density is actionable

in pockets

• Contextual

advertising/information

delivery now possible. E.g.

Amazon’s anticipatory

shipping is a step in that

direction

• In the world going

forward, wearable devices

will continue to have

location and users will

expect the cloud to

understand the context

User Segmentation Example

Problem statement

Segment users

based on their

interests/behavior

Deliver better

CVR/CPDs for the

advertiser

Optimum Mapping

of Demand to

Supply

● User meta data

● Engagement data (ad

requests, clicks ,

impressions ,

downloads etc)

● Geo location data

● Device data

Data

User Segmentation: Challenges

● Scalable segment generation (No. of

segments and users per segment)

● Frequency of generation (Time required

per segment)

● Choosing the target variable for prediction

(ex :CVR or CTR )

● Should be a measure of user's affinity for

the segment type

● Sparse user data

● Validation Metrics

● Feature Inclusion

Pipeline

Modelling

User Segmentation: Steps

Feature Generation

PreprocessingTraining and prediction

Joins over

multiple

data cubes

and meta

data

Data

Cleaning and

Outlier

Removal

Sampling, Training

Model for

Predicting Segment

Probabilities and

Segment formation

User Segmentation: Feature Generation

● Generate user based features at large

volumes (Per day: >10 billion, 5.4 TB) for

training models

● Challenge: Want current event data at

training time

● Events (clicks, downloads) however arrive

long after original impression

● Use real-time logs: heavy processing!

● Our approach: create static snapshots

where possible, aggregate later

Apache Falcon - job

synchronization,

scheduling, retention,

etc.

Conduit & PinTail - real-

time streaming &

transport across DCs

Apache Pig - where

possible, fewer lines of

code than MR

Tools

User Segmentation: Preprocessing

● Unusual user data due to

tracking issues

● Accommodating backfill of

data

● Accommodating signals

with high latency

(Conversions)

Challenges

User Segmentation: Training

● Optimum duration of user behavior for training (days, weeks, months )

o Data Generation/Storage costs increase with duration

o Shorter duration data might miss seasonal patterns

● Target variable can be Conversion or Click based metrics. (CVR or CTR)

o Predicting user behavior towards a segment in future

Ex: Predicting CVR for next week using behavior features for previous

week

More suited towards Performance Advertising Campaigns

o Modelling current behavior (Categorization of current behavior)

Ex: Categorizing the user based on current week’s behavior

More suited for Brand Advertising campaigns

● Number of users with conversions/clicks much lesser than users without

conversions/clicks

o Class Imbalance Problem

o Sampling strategies - Stratified Sampling


● User data for mobile is much sparser as compared to web.

o Define dense and sparse data.

o Separate models for dense and sparse data.

o Train only on dense data

Losing the information in sparse data

o Augment the sparse data using additional meta features for user.

Os, device ,country and other demographic details

● Modelling approaches :

o Linear/Logistic Regression, Random Forests

Predicts a probability of membership for a segment

o Collaborative Filtering/Clustering Algorithms

Put user into segments based on similarity to current segment

users

● Hybrid : Using Clustering for augmenting sparse data and modelling on

the results


● In memory tools (Spark - MLib, H2O) v/s Hadoop MapReduce tools

(Mahout)

o Machine Learning models are iterative in nature i.e access same data

blocks multiple times

o Lesser overhead in reading data if present in memory

o In memory tools better for training on “relatively” smaller data sets.

o We found Spark easier to use as compared to H2O due to sufficient

debugging documentation and community support

● Hadoop Streaming : Using any executable script as mapper or

reducer.

o Useful for quick deployment in languages such as python and shell

● Dynamic Segmentation vs Static Segmentation

Big Data at InMobi

Consume

Create

Contribute

InMobi

Apache Falcon

Conduit

Pintail

Apache Hadoop

Apache Hive

Apache Oozie

Apache Hbase Apache Cassandra MongoDB Redis Druid Apache Zookeeper Apache Flume Apache Storm Scribe Apache Ambari

Thanks

Questions? [email protected]