big data and user segmentation in mobile context

34
Big Data and User Segmentation in the Mobile Context Rajiv Bhat September 6 2014

Upload: inmobi-technology

Post on 01-Jul-2015

681 views

Category:

Technology


1 download

DESCRIPTION

In these slides, we explore the unique challenges that mobile data present. The high cardinality, low signal to noise ratio and realtime needs have significant system implications. We outline how InMobi tackles these challenges. A specific Data Science use case is also presented. We outline our approach to user segmentation. A brief description of the challenges faced and our attempts to address them is also included.

TRANSCRIPT

Page 1: Big Data and User Segmentation in Mobile Context

Big Data and User

Segmentation in the Mobile

Context

Rajiv BhatSeptember 6 2014

Page 2: Big Data and User Segmentation in Mobile Context

The world is going mobile

2Source: Left image – 123RF.com. Right image – amadarose.co.uk/KPCB

Page 3: Big Data and User Segmentation in Mobile Context

Mobile traffic pct is growing at 1.5X per year

3

Source: StatCounter Global Stats, 5/13. Note that PC-based Internet data bolstered by streaming.

Page 4: Big Data and User Segmentation in Mobile Context

InMobi: Overview

4

Global25 locations worldwide in <4 yrs.

(Faster on-ground global presence than Google, Facebook, Twitter & LinkedIn)

ScaleReaching 750 million+ mobile consumers in 165+ countries(over >250 million in Asia)

Over 800 employees across multiple locations

Investors

($200M strategic investment in 2011)

Technology

65K+ requests/sec at peak load; 100TB data generation/month(more than Twitter)

500+ Data Scientists, Engineers & Analysts across US, IN, JP(one of the largest Prod & Engg teams working on mobile advertising globally)

Page 5: Big Data and User Segmentation in Mobile Context

InMobi: Recognized as one of 50 disruptive companies in

2013

5

Page 6: Big Data and User Segmentation in Mobile Context

6

1 6 0 0 0 0 0 0 0 0 0 0 0

Total Ad requests = 1.6 Trillion

Page 7: Big Data and User Segmentation in Mobile Context

7

5 7 8 0 0 0 0 0 0 0 0 0

Total Ads served = 578 Billion

Page 8: Big Data and User Segmentation in Mobile Context

8

1 2 0 0 0 0

Peak ad requests per second = 120,000

Page 9: Big Data and User Segmentation in Mobile Context

9

1 0 0 0 0 0 0 0 0 0 0

Total # of events ingested in a day = 10 Billion

Page 10: Big Data and User Segmentation in Mobile Context

10

5 4 0 0 0 0 0 0 0 0 0 0 0

New data generated every day = 5.4 TB

Page 11: Big Data and User Segmentation in Mobile Context

11

1 0 0 0 0 0 0 0 0 0 0 0 0 0

Total Hadoop clusters size = 1000 TB

0 0

Page 12: Big Data and User Segmentation in Mobile Context

12

8 3 0 0 0 0 0 0

Total # of Hadoop jobs = 83 Million

Page 13: Big Data and User Segmentation in Mobile Context

How is Mobile Data Different?

Page 14: Big Data and User Segmentation in Mobile Context

1.Rich

Context

• GPS

• Accelerometer

• Gyroscope

• Digital compass

• Altimeter

• Health sensors?

High Cardinality

How is Mobile data different?

Page 15: Big Data and User Segmentation in Mobile Context

2.Short

halflife

• User context

changes rapidly

• Geo context half

life < 30 minutes

Real time/fast

feedback critical

How is Mobile data different?

Page 16: Big Data and User Segmentation in Mobile Context

3.Controlled

ecosystem

• Most activity happens

on apps vs mobile

browsers

• Controlled by

iOS/Android

• Fast changing standards

for user identification

• Privacy controls

Joins with back dated

data

How is Mobile data different?

Page 17: Big Data and User Segmentation in Mobile Context

4.‘Snacking’

• Shorter attention

span

• Far smaller sessions

• Higher number of

visits

Weaker signals

How is Mobile data different?

Page 18: Big Data and User Segmentation in Mobile Context

5.Fragmentation

• Much lower shelf life of

apps (compared to

websites)

• Users appear across

devices

Need for

fingerprinting/hygiene

algorithms

How is Mobile data different?

Page 19: Big Data and User Segmentation in Mobile Context

6.Lossy

• Varied network latency

even on single device

• Devices/operators

treat packets

differently (http vs

https)

Redundancy in data

collection

How is Mobile data different?

Page 20: Big Data and User Segmentation in Mobile Context

7.Scale

• Mobile data is roughly

10X that of desktop

• Growing very rapidly

• Great opportunity

Truly scalable fast data

platforms

How is Mobile data different?

Page 21: Big Data and User Segmentation in Mobile Context

Responding to scaling challenges (1/2)

Q: As your business scales across multiple

geographies (and data centers), how do

you maintain reporting SLAs?

Apache Falcon: Geo distributed Data

processing model (Open sourced),

aggregation done in local data centers

and pushed to center

Q: How do you leverage different

reporting systems with different cost to

granularity tradeoffs?

Grill: Abstract view over tiered data

stores with query redirection based on

latency and cost objectives

Q: How do you handle fast joins across

billions of keys?

Stream enrichment: Enrich impressions,

clicks and conversions with request

information

Page 22: Big Data and User Segmentation in Mobile Context

Responding to scaling challenges (2/2)

Q: How to do data collection as a service

at scale with reliability, predictability,

varying SLA’s and audit?

Conduit and Pintail: Reliable and high

scalable data collection technology.

Allows both batch and streaming

consumption(Open sourced)

Q: How do you offer linked data analytics

at scale?

ShadowFax: A homegrown graph

database for OLAP analytics (currently

being open sourced)

Q: How do you crunch user data at scale

while still maintaining user privacy?

Hybrid Map Reduce: Map Reduce on a

hybrid runtime of Mobile and Hadoop.

Map on mobile, reduce on Hadoop

Page 23: Big Data and User Segmentation in Mobile Context

Information consumption is undergoing another discontinuity

Broadcast based consumption

• Books/Radio/TV

• One-way

Pull based consumption

• Search

• Intent specified

Contextual consumption

• Mobile advertising?

• Just in time information delivery

• Context aware environment

1436-2015

1990 – 2020

2010-20XX

Page 24: Big Data and User Segmentation in Mobile Context

Contextual advertising is a reality

• Mobile is ubiquitous

• Digitization of the offline

world – PoIs to User

specific tagging

• Location information

available which is the

equivalent of placing

cookies on server

• Data density is actionable

in pockets

• Contextual

advertising/information

delivery now possible. E.g.

Amazon’s anticipatory

shipping is a step in that

direction

• In the world going

forward, wearable devices

will continue to have

location and users will

expect the cloud to

understand the context

Page 25: Big Data and User Segmentation in Mobile Context

User Segmentation Example

Problem statement

Segment users

based on their

interests/behavior

Deliver better

CVR/CPDs for the

advertiser

Optimum Mapping

of Demand to

Supply

● User meta data

● Engagement data (ad

requests, clicks ,

impressions ,

downloads etc)

● Geo location data

● Device data

Data

Page 26: Big Data and User Segmentation in Mobile Context

User Segmentation: Challenges

● Scalable segment generation (No. of

segments and users per segment)

● Frequency of generation (Time required

per segment)

● Choosing the target variable for prediction

(ex :CVR or CTR )

● Should be a measure of user's affinity for

the segment type

● Sparse user data

● Validation Metrics

● Feature Inclusion

Pipeline

Modelling

Page 27: Big Data and User Segmentation in Mobile Context

User Segmentation: Steps

Feature Generation

PreprocessingTraining and prediction

Joins over

multiple

data cubes

and meta

data

Data

Cleaning and

Outlier

Removal

Sampling, Training

Model for

Predicting Segment

Probabilities and

Segment formation

Page 28: Big Data and User Segmentation in Mobile Context

User Segmentation: Feature Generation

● Generate user based features at large

volumes (Per day: >10 billion, 5.4 TB) for

training models

● Challenge: Want current event data at

training time

● Events (clicks, downloads) however arrive

long after original impression

● Use real-time logs: heavy processing!

● Our approach: create static snapshots

where possible, aggregate later

Apache Falcon - job

synchronization,

scheduling, retention,

etc.

Conduit & PinTail - real-

time streaming &

transport across DCs

Apache Pig - where

possible, fewer lines of

code than MR

Tools

Page 29: Big Data and User Segmentation in Mobile Context

User Segmentation: Preprocessing

● Unusual user data due to

tracking issues

● Accommodating backfill of

data

● Accommodating signals

with high latency

(Conversions)

Challenges

Page 30: Big Data and User Segmentation in Mobile Context

User Segmentation: Training

● Optimum duration of user behavior for training (days, weeks, months )

o Data Generation/Storage costs increase with duration

o Shorter duration data might miss seasonal patterns

● Target variable can be Conversion or Click based metrics. (CVR or CTR)

o Predicting user behavior towards a segment in future

Ex: Predicting CVR for next week using behavior features for previous

week

More suited towards Performance Advertising Campaigns

o Modelling current behavior (Categorization of current behavior)

Ex: Categorizing the user based on current week’s behavior

More suited for Brand Advertising campaigns

● Number of users with conversions/clicks much lesser than users without

conversions/clicks

o Class Imbalance Problem

o Sampling strategies - Stratified Sampling

Page 31: Big Data and User Segmentation in Mobile Context

User Segmentation: Training

● User data for mobile is much sparser as compared to web.

o Define dense and sparse data.

o Separate models for dense and sparse data.

o Train only on dense data

Losing the information in sparse data

o Augment the sparse data using additional meta features for user.

Os, device ,country and other demographic details

● Modelling approaches :

o Linear/Logistic Regression, Random Forests

Predicts a probability of membership for a segment

o Collaborative Filtering/Clustering Algorithms

Put user into segments based on similarity to current segment

users

● Hybrid : Using Clustering for augmenting sparse data and modelling on

the results

Page 32: Big Data and User Segmentation in Mobile Context

User Segmentation: Training

● In memory tools (Spark - MLib, H2O) v/s Hadoop MapReduce tools

(Mahout)

o Machine Learning models are iterative in nature i.e access same data

blocks multiple times

o Lesser overhead in reading data if present in memory

o In memory tools better for training on “relatively” smaller data sets.

o We found Spark easier to use as compared to H2O due to sufficient

debugging documentation and community support

● Hadoop Streaming : Using any executable script as mapper or

reducer.

o Useful for quick deployment in languages such as python and shell

● Dynamic Segmentation vs Static Segmentation

Page 33: Big Data and User Segmentation in Mobile Context

Big Data at InMobi

Consume

Create

Contribute

InMobi

Apache Falcon

Conduit

Pintail

Apache Hadoop

Apache Hive

Apache Oozie

Apache Hbase Apache Cassandra MongoDB Redis Druid Apache Zookeeper Apache Flume Apache Storm Scribe Apache Ambari

Page 34: Big Data and User Segmentation in Mobile Context

Thanks

Questions? [email protected]