research summary: efficiently estimating statistics of points of interest on maps

Efficiently Estimating Statistics of Points ofInterest on Maps – Wang, He, Liu (2016)

Alex Klibisz, alex.klibisz.com, UTK STAT645

November 10, 2016

Motivation

Point-of-interest (PoI) data is valuable.I Google, Foursquare, Baidu, etc. collect user activity and feedback for

PoIs.I This data can help businesses understand market, determine locations,

consumer preferences, etc.Public-facing APIs have restrictions

I Number of PoIs returned, query frequency, area size, no guarantee ofunderlying sampling.

Propose methods to sample and approximate aggregate statisticsI Emphasize query efficiency, aggregate statistic error.

Notation

I A – Area of interest (e.g. a city).I P – Set of PoIs in A (e.g. all hotels).I k – Maximum number of PoIs returned (API constraint).I Fully-accessible region – when queried, returns < k PoIs.

I If k = 50, a region is fully-accessible when a query returns 49 PoIs.I δ – minimum acceptable lat. and lon. precision of map APIs.I Hit-ratio – probability of sampling a non-empty sub-region from A.

I ρ is the fraction of non-empty sub-regions in A → 1ρqueries to get a

non-empty sub-region.

Problem Statement

Given1. Area of interest A containing PoIs P.2. API-specific query restrictions.

Estimate1. Sum Aggregate: the sum of an attribute across P.2. Average Aggregate: the average of an attribute across P.3. PoI Distribution: distribution of an attribute across P4. n(A), number of PoIs in A.1

1Not directly presented how to estimate this, but used for evaluation.

DatasetsBaidu and Foursquare datasets used from two other publications. No clearindication of time span.

Algorithms

1. Random Region Zoom-In, RRZI2. RRZI w/ Count Information, RRZIC3. Uniform Region Sampling, RRZI(C)_URS4. Metropolis-Hastings Weighted Region Sampling, RRZIC_MHWRS

RRZI Algorithm

1. Sample region Q from A at random.2. Divide Q into two sub-regions Q0, Q1 without overlap.3. Randomly select a non-empty sub-region as the next region to query.4. Query the selected region.5. Repeat until a fully-accessible sub-region is found.

CharacteristicsI Typically run RRZI until m fully-accessible sub-regions are found.I How to divide Q into Q0, Q1? – Equations (1), (2).I How to determine if Q0, Q1 are empty? – Store prior PoIs.I Correcting for sampling bias? – Counter τ records probability of

sampling each sub-region.I Maximum number of queries? – Hmax = log(Lx/δ) + log(Ly/δ)

I Lx and Ly are x , y dimensions, δ is degree granularity.I Binary search over a 2-d array

Seems like random binary search until a fully-accessible sub-region is found.

RRZI Example

Important to note RRZI is repeated m = 3 times. Evaluation show that asm increases, estimation improves.

RRZI Estimators

Sum Estimator, Proven Consistent pg. 5

The average of some attribute over all PoIs in fully-accessibleregions, standardized by the probability of picking the PoI’sregion.

Confidence Interval

(variance defined equation 4.)

RRZI Estimators

Distribution Estimator

RRZIC Algorithm

ContextI Some APIs provide the count z of PoIs in a queried region.

Use the count to improve RRZI:I Choose the next sub-region with probability z0

z and z1z .

→ The larger sub-region is more likely to be chosen to query next.→ The number of PoIs in the next-explored region is more stable.→ Sampling is closer to uniform and error is reduced.

I Estimators now standardize by the known count in each region, n(ri)instead of probability of choosing the region.

I Question: why not always pick the region with greater z?I You would end up with the same FA region every time.

Seems like semi-sorted binary search now.

Mix Methods to overcome Size Constraints

ContextI Some APIs constrain the size of the queried region.

I 3◦x3◦ (lat, lon) query fails on Foursquare.I Introduce mix-methods URS and MHWRS to overcome size

constraints with clever sampling.Intuition

I Subdivide A before running RRZI and RRZIC.I Improved sampling makes it more query-efficient and lowers error.

Uniform Random Sampling (RRZI_URS, RRZIC_URS)Uniform Random Sampling Step1. Apply L recursive region divisions to get set of sub-regions BL,

|BL| = 2L, B∗L is the set of non-empty sub-regions.

I L tuned such that the sub-regions meet size constraint.

Continue with RRZI or RRZIC using regions from URS2. Randomly select nonempty b from BL3. Sample fully-accessible region(s) from b using RRZI(b) and

RRZIC(b) (instead of RRZI(A) and RRZIC(A)).Characteristics

I Estimator functions are similar; standardize w.r.t BL instead of A.I (Generally) more query-efficient:

I URS requires |BL||B∗

L |queries to find a non-empty region; non-URS

requires L. |BL||B∗

L |< L for small values of L (few division steps).

I Arrives at a non-empty query more quickly → undersamples denseregions → higher error.

Metropolis-Hastings Based Weighted Region Sampling(RRZIC_MHWRS)

Modify the Sampling Step to Improve Error1. Sample non-empty region b from BL following distributionπ = (πb = n(b)

n(A) : b ∈ B∗L)

I Draw more samples from dense regions.

2. Sample a fully-accessible region with RRZIC.Move to Another Region (maybe)

3. Sample next region b∗, and move to it with probability min( n(b∗)n(b) , 1)

I If b∗ is larger than b, it will always be moved to.

CharacteristicsI Only works if you know the count of the region.I More query-efficient for same reasons as URS.I MHWRS falls into Markov-Chain Monte Carlo techniques.

Algorithms Comparison

I Parameter L required to determine sub-region size.

Evaluation

Tests1. Estimate n(A) (number of PoIs in area A).2. Estimating average and distribution statistics.

Baselines1. Nearest-Neighbor Search2. Random Region Sampling

Hypothesis1. RRZIC_MHWRS will be most efficient if PoI count is available.2. RRZI_URS will be most efficient otherwise.

Estimating n(A)

I Normalized root-mean-squared error for n(A) estimate using RRZI.I m ↑, error ↓I k ↑, error ↓I Not obvious how they actually estimate n(A).

Estimating n(A)

I How many RRZI_URS queries to reach a fully-accessible region?I L = 0 models RRZI.I L ↑, sub-region size ↓I Local minimum around 10-15.

Estimating n(A)

I How many RRZI_URS queries to sample a non-empty sub-region?I Small L → large sub-region → less likely to be empty.

Estimating n(A)

I How does decreasing sub-region size affect error for n(A)?I Smaller regions → lower error until L = 20

Estimating n(A)

2

I How many queries to decrease error to 0.1?I Baseline methods require ~150K queries.I RRZI and RRZI_URC require ~20K and ~50K queries.

2The lines between cities don’t really represent anything.

Estimating n(A)

Test Interpretations1. RRZI, RRZI_URC methods reach a low error much sooner.2. Tuning hyper-parameter L is important.3. Not obvious how the proposed methods compute the n(A) estimate.

Estimating average and distribution statistics

I “Correct” data for Foursquare.I Leaving out Baidu evaluation for brevity.


I Average and distribution root-mean-squared error for RRZI, RRZIC,RRZI_URS, RRZIC_MHWRS up to 10K queries.

I RRZIC_MHWRS is best in all cases.


I How many queries needed for error < 0.1 for average number ofFoursquare check-ins?

I RRZIC_MHWRS is best in all cases.I RRZI_URS is best if PoI count is unavailable.


Test Interpretations1. True PoI count is very nice to have.2. Modified sampling methods reach a low error much more efficiently.3. Modified sampling methods are more query efficient; possible they get

more meaningful data more quickly.

Omitted for Brevity

Real ApplicationsI Present data collected from Foursquare, Google, Baidu using the

proposed methods.Related Work

I Describe Nearest-Neighbor Search and Random Region Samplingdrawbacks for this task.

Contributions, Questions

Contributions1. A practical guide to overcoming data limitations.2. Clear improvement on prior “state-of-the-art” methods (NNS, RRS).3. Clever sampling methods to reach low errors very efficiently.

Questions1. What’s the shelf-life of PoI estimates? How often would you have to

re-query to maintain accurate estimates?2. Do the ground-truth data and estimates come from the same time

window? If not, is it valid to compare data from different points intime? Is it useful to use data from all time?3

3. Is it possible that the decreased error for mix-methods is a product ofmore efficient querying?

4. Didn’t fully understand role of CDS_UNI and CDS_NOR in section4.2.

3At a quick glance, most of Foursquare’s venue statistics can but don’t necessarilyrequire a bounded time range.

research summary: efficiently estimating statistics of points of interest on maps

Technology