quettra design problem solution - deepti chafekar

Location Trace Analysis

Deepti Chafekar

Quettra, June 6, 2014

Design Problem• Given: Location GPS trace data for users• Goal: Extract insights for a given user from his location trace

data• Solution spilt into several parts:

Extracting Insights How would you organization the data What models could be created to understand this data? What useful insights be extracted from these models?

Using the extracted insight What apps and business services can be designed to use this insight?

Scalability issues How can this system be made scalable across millions of users and 100s

of incoming datapoints per day

Input data conditioning

AggregationClusteringFilteringLabeling

Data Modeling

Markov Models

Location-Time Distribution

Model

ML Classifiers for location prediction

Labeled Significant Location Clusters

PlaceRecommendation

Service

Collaborative Filtering

Memory-based (Min-Hashing)

Part1:ExtractInsights

Part2:Use Insight

Overview

Scalability

Scalability

Scalability

Data Organization

Location Trace DataData Source and Format• Generated sample data (~400

points) for 5 weekdays and 4 weekends

• Raw Data Format: Timestamp, Lat, Long

• Probes Generated every 15 mins

Data Organization• Aggregate similar points• make data manageable• remove noise• identify significant data points Trace data generated based on my schedule on weekdays for 1 week

around 400 points generated

Clustering

Data Organization: ClusteringVariation of K-means

• K is not fixed but dynamically set

• 2-Phase approach

Phase 1:

1. For each point P, compute distance between it and all cluster centroids Ci

2. Compute Minimum distance: Dmin(P, Ci)

3. If Dmin(P, Ci) <= d, insert point P in cluster Ci.

4. else create new cluster with P as centroid

5. Update centroids of all clusters

Phase 2:

1. Run K-means on the clusters and points

Extracting Places from Traces of Locations, Kang, WellborneUsing GPS to Learn Significant Locations and Predict Movement Across Multiple Users, Ashbrook, Starner

C1 C2 C3

P

D(P,C1) D(P,C2) D(P,C3)

C1 C2 C3

Clustering Results

Before

After

• Reduce effective data size • Filter out redundant points

Identify Significant Clusters

• Filter distance based clusters further to identify significant clusters

• Significance: based on time spent per cluster and frequency of visit

• Sort the traces by time and calculate hours spent per cluster

• Frequency: # of times a cluster is visited in a day

0

20

40

60

80

L10 L6 L7 L23

Tota

l Tim

e s

pe

nt

(ho

urs

)

Cluster

Total Time Spent per cluster

L10L6

L7

Data Conditioning: Labeling• Add labels to the cluster to make them meaningful

• Google Places API:

– Enter lat, lon: gives information about the places

– Gives meta-data such as: name of the place and type

– For e.g. : Given lat lon (37.406679, -122.036603), Places API gives

name: Moffett Towers Club

type: [“gym”, “health”, “spa”]

• Meta Data Associated with a cluster:– Center point (Lat, Lon)

– Label: name, type

– Total hours spent

• Survey data from American Time Use Survey (ATUS, http://www.bls.gov/tus/) -stats when people are normally home (sample collected over 38K people, gives avg hours spent at work and home)

Learning Likely Locations, Krum et.alGoogle maps places api

http://www.bls.gov/tus/

Data Conditioning: Labeling

HomeWork

Gym

School

Labeled clusters for weekday trace

Park

Park

Park

Museum

Home

Zoo

Labeled clusters for weekend trace

Modeling to extract insights

• Goal– Predict user’s location at time t

– Given user’s current location and time t, predict user’s next location

• Can be used for location based services:– Place/activity recommendation systems

– Automatically provide relevant traffic updates and route recommendation

– Targeted advertisements for certain activities (gym, kids places)

Data Modeling

Modeling: Markov Model

home school work gym0.8

0.2

1.0

0.7

0.3

home Park0.7

Zoo/Museum

0.3

1.0

1.0

1.0

Markov model for weekday activities

Markov model for weekend activities

Insight: Predicts your next location• Simple design• Does not capture temporal

aspects

Modeling: Location-Time Distribution• Need to capture time-location correlation

• Capture Location distribution with time

• Discretize time into 24, 1-hour slots (T0,T1,T2…T23)

• For every slot, sum up hours spent at a given location

• E.g. At 4 to 4:30 Location was work and 4:45 location was gym, T16(work) = 30 mins

PnLUM: System for prediction of next location for users with mobility, Nguyen et.al

Day Time Home School Work Gym Park

weekday 8 5

weekday 9 1

weekday 10 5

weekday 11 5

weekday 12 5

weekday 13 5

Table shows for each time slot the sum of hours user spent at a certain location

Modeling: Location-Time Distribution

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30

Time

Location Time Distribution Weekday

Home

Work

School

Gym

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30

Time

Location Time Distribution Weekend

Home

Park

Zoo/Museum

• For each time slot Ti, compute the probability PTi(Lj) the user is at location Lj

• E.g. P12(work) = 1, P18(gym) = 0.3

Insights: One weekdays, user typically spends mornings and evenings at home, during the day at workOne weekends, user typically spends mornings and evenings at home and during the day between 10 am to 6pm at parks and kids activity places

Modeling: Location Prediction Classifiers

• Predict future location given my current context (location and time) – Classification problem

• Trained 3 classifiers (1) Pruned decision tree (J48), (2) Naïve Bayes, (3) K- Nearest Neighbors

• Training set

References:

Time Slot Curr Location Day Next Location

9 school weekday work

10 work weekday work

11 work weekday work

16 gym weekday home

PnLUM: System for prediction of next location for users with mobility, Nguyen et.al

Location Prediction: Classifiers

Classifier Accuracy

J48 90%

Naïve Bayes 91%

K-NN 95%

Weka: Open Source ML library (http://www.cs.waikato.ac.nz/ml/weka/)PnLUM: System for prediction of next location for users with mobility, Nguyen et.al

• Approach is good for a coarse predictions of locations. For e.g. where is my location in next hour

• For finer predictions, we need to have finer time slots- could increase the complexity of the training.

Data Insights

0

0.5

1

1.5

0 20 40

Time

Location Time Distribution Weekend

Home

Park

Zoo/Museum

HomeWork

Gym

School0

0.5

1

1.5

0 10 20 30

Time

Location Time Distribution Weekday

Home

Work

School

Gym

Park

Park

Park

Museum

Home

Zoo

Models and Insights- Recap• A single model does not capture all aspects

– consider combination of models to gain insights– Identify user’s location at a given time– Predict user’s next location given current context

• Insights: – Understand user’s routine and schedule on weekdays and weekends

• User leaves home at ~9am and leaves work at ~5pm (Show traffic update and route suggestion at those times)

• User visits preschool on weekdays (Target advertisement for preschoolers)

• User is at kid friendly places on weekends during the day (Recommendation for kids places and activities, Show deals and coupons at these places)

– Insights into user’s interest and habits• User goes to gym on an avg of 2 times a week (Target advertisement

for gym accessories)

Part 2: Using insights to construct an app/service

Place Recommendation System

• Goal: Recommend places for kids activities based on user preferences

• Problem Statement

– Given: N users, M kids places and user history of places visited

– Output: recommend K places to the user that he/she might be interested in visiting

Place Recommendation Methods


• Recommended places that people with similar preferences liked in the past

• Can provide recommendations for new types of places

Content Based Filtering

• Recommend places similar to ones the user herself preferred in the past

• Compare similarity of places (need detailed meta-data for each place), Compare park with zoo?

• Does not recommend a new type of place e.g. Planetarium

A Survey of Collaborative Filtering Techniques, Su .et. AlTowards the Next Generation of Recommender Systems:A Survey of the State-of-the-Art and Possible Extensions,

Adomavicius et. Al.

Collaborative Filtering: Memory Based Approach

User Rating for visited places

Find people with similar tastes (neighbors)

Recommend places highly rated by neighbors

Make ratings predictions for users based on their past ratings

Collaborative Filtering• User Rating: Implicit

• Frequency fsi = # of times a user visited a place si /total places visited– E.g. P = {(Discovery Museum, 5),

{Oakland zoo, 1), (Ortega park, 6)} fDiscovery Museum = 5/12

• For simplicity, binary rating. If fsi> Threshold, Rating rsi = 1 else 0

Rating Set R = {s1, s2, .., sl} consists of all places having rating = 1



Recommend places highly rated by

neighbors


• Similarity Metric: Find common places between 2 users A, B.

RA = {s1,s3}

RB = {s1,s2,s3,s4}

• For user u set of top L users (neighbors) Wu that are similar to u

Google News Personalization: Scalable Online Collaborative Filtering, Das et. Al.




neighbors

Sim(A,B) =| RA ÇRB |

| RA ÈRB |

Sim(A,B) =2

4


• Predict the rating for user u for place sk by considering the ratings given by users in set Wu User Rating for visited

places



neighbors

ru,sk =

Sim(u,v)rv,skvÎWu

å

Sim(u,v)vÎWu

å

Memory-based: Pros• Easy implementation, simple design• New users can be added easily and incrementallyCons• Scalability issues for millions of users• Performance decreases when data is sparse• Adding new place would require re-computation of

rating vector

Scalability

Scalability

• 100 points per day ( for 6 months) for 1 M users. Data size is in Tera Bytes

• Data Storage: Map Reduce. Trace data for a given user is split across different machines

• Scalability challenges

– Clustering:

– Location-Time Distribution: (sorting of trace data)

– Min-Hash

References:

K means Clustering-Map Reduce

M1 M2 M3

(C1, P1)

Compute K-meansKey,value pair(Cluster centroid, point)

Data(u1) Data(u1) Data(u1)Input Trace data for a user can be split into different shards

ReduceK-means, new clusters and centroids

Input Trace Data

(C2, P2) (C4, P3)

(C1, P4) (C3, P5) (C4, P6)

(C’1, P1, P2), (C’

2, P4, P6) (C’

3, P3, P5)

Sorting of traces with time-Map Reduce

M1 M2 M3

(T1, P1)

Sort Trace data, generateKey,value pair(Timestamp, data point)

Data(u1) Data(u1) Data(u1)Input Trace data for a user can be split into different shards

Reducer: Each reducer has a key. Assign elements <= keySort

Input Trace Data

(T3, P2) (T4, P3)

(T2, P4) (T5, P5) (T6, P6)

(T1, P1)

(T2, P4)

(T3, P2)

(T4, P3)

(T5, P5)

(T6, P6)

R(3) R(4)

Collaborative Filtering-Scalability• Similarity metric is computed between all pairs of

users, complexity O(N2) explodes for millions of users

• Don’t have to consider all user pairs. Consider those pairs that have a high probability of being similar

• Locality Sensitive Hashing (LSH): Hashing technique, that hashes data points such that probability of collision is higher for objects close to each other

• Points that have same hash value form a cluster, similarity metric is computed only with pairs of users in that cluster

Collaborative Filtering-Min-Hash• Hashing Function

– Let P = {s1, s2, s3, …,sM} be a set of all M possible kids places in an area

– For a user u, v let Ru = {s1,s3, s4}, Rv = {s1,s3} be the rating set. Hash h(u) = randomly pick number from the set [1,3,4] , h(v) = randomly pick number from set [1,3]

– Probabilty[h(u)=h(v)] = 2/5, which is the same as our similarity metric


Location Sensitive Hashing

U1, U3, U7

U2, U5, U8

1

2

Users U1, U3, U7 have same hash value = 1 are similar to each other with probability = similartiy metric

Location Sensitive Hashing • Hashing Function

– Let P = {s1, s2, s3, …,sM} be a set of all M possible kids places in an area

– For a user u, v let Ru = {s1,s3, s4}, Rv = {s1,s3} be the rating set. Hash h(u) = randomly pick number from the set [1,3,4] , h(v) = randomly pick number from set [1,3]

– Probabilty[h(u)=h(v)] = 2/5, which is the same as our similarity metric


Location Sensitive Hashing-Map Reduce


M1 M2 M3

H(u1)=1 H(u2)=1 H(u3)=2

Compute hash for each user in parallel

Ru1 Ru2 Ru3

Rating Set for different users spread out on different machines

(1, u1) (1, u2) (2, u3)Output Key, value pair (hash, userid)

ReduceCombine users with same hash value

1->(u1, u2)2->(u3)

Summary

• Data organization: Variation of K-means, labeling

• Modeling: Different models to predict user’s current and future location

• Insights: User’s schedule and interests

• Business Service: Place Recommendation system

• Scalability: approaches on Map Reduce

References:

References• Extracting Places from Traces of Locations, Kang et. Al.• Using GPS to Learn Significant Locations and Predict Movement

Across Multiple Users, Ashbrook et. Al.• PnLUM: System for prediction of next location for users with

mobility, Nguyen et.al• A Survey of Collaborative Filtering Techniques, Su et. al• Towards the Next Generation of Recommender Systems:A Survey of

the State-of-the-Art and Possible Extensions, Adomavicius et. al.• Google News Personalization: Scalable Online Collaborative

Filtering, Das et. Al.• Learning Likely Locations, Krumm et. Al.• Learning Travel Recommendations from User-Generated GPS

Traces, Zheng et. Al.• Google maps APIs• Weka: Open source ML library

Thanks

Questions/Discussion

quettra design problem solution - deepti chafekar

Engineering