algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfalgorithms for analyzing...

Algorithms for analyzing spatio-temporal data

PhD defenseAbhinandan Nath

Department of Computer ScienceDuke University

Committee :Pankaj K. Agarwal (supervisor) Kamesh MunagalaRong Ge Yusu Wang

Introduction

The Data Deluge

“Mankind created 150 exabytes (billion gigabytes) of data in 2005. This year, it will create 1,200 exabytes.”

- The Economist, 2010

https://www.economist.com/node/15579717

Some (more) numbers ...

USGS National Elevation data (10 metre resolution)[Dewberry, 2012]

NYC taxi pickup and dropoff data, 2009-2016 : 1.3 billion points[towardsdatascience.com]

Geometric flavor of data

● Many data sets geometric in nature

● Problems in other domains can be mapped to geometric domain

– e.g., SELECT query in relational databases

NAME AGE SALARY

Alice 26 30,000

Bob 30 35,000

Charlie 28 25,000

... ... ….

Challenges

Massive data sets that are -

Noisy [towardsdatascience.com]

Have outliers

Incomplete Time-varying, e.g., trajectories

My Research

● Use techniques from computational geometry and topology to tackle some of these challenges in geometric data sets

● Design algorithms that are– Practical– Have provable performance guarantees

Broad themes

● Distributed algorithms– Inspired by frameworks like MapReduce [Dean

& Ghemawat, 2008] and Spark [Zaharia et al., 2010]

● Succinct descriptors– Concisely encode desired properties of big

data sets– Noise-robust proxies for data sets– Clustering

At a glance

Distributedalgorithms

Succinctdescriptors

Indices to answer range and nearest-neighbor queries [AFMN, 2016]

Triangulation & contour tree of massive terrains [AFMN, 2016]

Comparing merge trees of real-valued functions [AFNSW, 2015]

Common movement patterns from trajectory data [AFMNPT, 2018]

At a glance

Succinctdescriptors

Distributed model of computation

● Massively Parallel Communication (MPC) model [Beame et al., 2013]

● Captures salient features of modern frameworks like MapReduce [Dean & Ghemawat, 2008]

MPC model of computation

● : no. of machines● : input distributed

across machines● :

each machine has storage

Assume ,

Communication Medium

Input size n

O(s) O(s) O(s) O(s) O(s) O(s)

MPC model of computation

● Computation proceeds in rounds– In each round, each machine computes on

local data

● Communication between machines occurs between rounds

● No. of messages sent/received by any machine in a round bounded by

Performance measures

● No. of rounds of computation :

● Running time : – : running time of machine in round

● Total work :

At a glance

Succinctdescriptors

Joint work with Pankaj K. Agarwal,Kyle Fox & Kamesh Munagala

Indexing big data

● Query big data sets faster, but how?

– Build an index !

● Consider geometric queries– Orthogonal range queries– Nearest-neighbor queries

Previous work

● Work on conjunctive and join queries, graph processing in MapReduce and its variants [Lee et al., 2012; Qin et al., 2014; Malewicz et al., 2010; Beame et al., 2013; Koutris et al.,2018; ...]

● Geometric queries - MapReduce implementations for analyzing and querying spatial and geometric data [Eldawy et al., 2013, 2015; Arabi et

al.,2014; …] - no provable performance guarantees!!

Our work

Build and query distributed variants of the following classical data structures, with provable performance guarantees

– Orthogonal range searching● Kd-tree [Bentley, 1975]

● Range tree [Bentley, 1980]

Our work

– Nearest-neighbor searching● Balanced Box Decomposition

(BBD)-tree [Arya et al., 1998]

Our work

Our results

: total no. of input points in

: total no. of points reported for a range query

: max no. of points reported by a machine for a range query

Our results

● Kd-tree :– Construction : rounds, time,

– Query : rounds, time, work – optimal if each point can be stored exactly once

Also extends to partition trees [Chan 2012] for simplex range searching

Our results

● Range tree :– Construction : rounds, time,

– Query : rounds, time, and work

● BBD-tree :– Construction : rounds, time,

– Query : rounds, time and work

Key idea : random sampling

● Data structures based on balanced hierarchical partitioning of input points represented as a tree

Key idea : random sampling

● Data structures based on balanced hierarchical partitioning of input points represented as a tree

● Approximate this partitioning using a small random sample of input!

Balanced partitioning on random sample leads to balanced partitioning on entire set!!

At a glance

Succinctdescriptors

At a glance

Succinctdescriptors

Joint work with Pankaj K. Agarwal,Kyle Fox & Kamesh Munagala

Terrain modeling

Airborne LiDAR scanning[http://www.lgs.ie/airborne-lidar.shtml]

Raw elevation data (3D point cloud)

[kellylab.berkeley.edu]

Digital Elevation Model (DEM)[gisgeography.com/free-global-dem-data-sources/]

From 3D point cloud to DEM

● Terrain – xy-monotone surface in

● Graph of a height function

● Often stored as a triangulated irregular network (TIN)

● How to build TINs and perform terrain analysis in the MPC model ?

Our Work

● Build TIN model, using Delaunay triangulation

● Compute the contour tree to succinctly encode all contours of terrain

Input points in

Build terrain model

Build contour tree

Use contour tree Many applications, e.g., waterflow

prediction, climate model viz.

Prior Work

● Delaunay triangulation– RAM and I/O model [Crauser et al., 2001]

– PRAM algorithms [Blelloch et al., 1999]

– Goodrich's algorithm [Goodrich, 1997] can be adapted to MPC model – too complicated

– SpatialHadoop [Eldawy et al., 2015] – no theoretical bounds

● Contour tree– RAM and I/O model [Carr et al., 2003; Pascucci and Cole-McLaughlin, 2002; Agarwal

et al., 2010; …]

– Distributed and parallel algorithms [Morozov and Weber, 2013, 2014;

Pascucci and Cole-McLaughlin, 2003; Acharya and Natarajan, 2015; ...]

Our results

● Given points, compute its Delaunay triangulation in rounds, time, and work, with high probability

● Given a terrain of size , compute its contour tree in rounds, time, and work

Build terrain model

Input points in

Build terrain model

Build contour tree

Use contour tree

Delaunay Triangulation

● Given points in , a triangulation of is Delaunay if– No triangle contains

any point of in interior of its circumcircle

● Many useful properties, e.g., avoids skinny triangles

[gamedev.stackexchange.com]

Basic idea

● Randomly sample small set of points and compute triangulation of

● Use triangulation of to split input into smaller chunks

● Recurse on each chunk in parallel

Algorithm

1. Given points stored across many machines, randomly sample of size and send to one machine

Algorithm

1. Given points stored across many machines, randomly sample of size and send to one machine

Algorithm

2. Compute , and use it to distribute to disjoint machines

Algorithm

With slight changes, it can be shown that each chunk has size with high probability

Algorithm

3. Recursively compute for each chunk in parallel. Can filter unnecessary triangles by simple geometric tests to get

Analysis

● No. of levels of recursion is

● Each level takes rounds, time, and work

Build contour tree

Input points in

Build TIN DEM

Build contour tree

Use contour tree

Level sets and contours

● : triangulation of

● Height function – Defined on each vertex

– Linearly interpolated within each face(triangle)

● Level set

● Contour : connected component of a level set

Topology changes at saddle points

Image from [Agarwal et al., 2015]

Contour tree

● Obtained by contracting each contour of to a point

Agarwal et al., 2015

Our contribution

A simple and efficient divide-and-conquer algorithm to build and store the contour tree of a massive triangulated terrain in MPC model

Storage

● Contour tree stored in a distributed fashion

Storage

– Top subtree : a sized subtree stored on one machine

Storage

– Top subtree : a sized subtree stored on one machine

– Remaining subtrees stored on other machines, pointers to which stored with

Algorithm (divide step)

1. Split into smaller chunks● Each chunk has same no. of points, goes to

disjoint set of machines

Algorithm (conquer step)

2. Compute distributed contour trees of each chunk recursively in parallel

Algorithm (conquer step)

2. Compute distributed contour trees of each chunk recursively in parallel

Algorithm (merge step)

3. Combine contour trees to get

Algorithm (merge step)

3. Combine contour trees to get – Minimize interaction b/w neighboring chunks– Take advantage of data distribution and

triangulation

Our main result

Given a terrain of size , designed algorithm to compute its contour tree in rounds, time, and work

● These bounds are worst-case optimal !

At a glance

Succinctdescriptors

At a glance

Succinctdescriptors

Joint work with Pankaj K. Agarwal,Kyle Fox, Tasos Sidiropoulos &

Yusu Wang

Gonna skip!!

At a glance

Succinctdescriptors

Joint work with Pankaj K. Agarwal,Kyle Fox, Kamesh Munagala,

Jiangwei Pan & Erin Taylor

Trajectory data

● Huge data available

– Improve decision making

– Gain insights

● Noisy and incomplete

● Several computational challenges

[https://www.sundried.com]

[developer.huawei.com]

Motivation

● Subtrajectory clusters capture common portions● Different from clustering trajectories as a whole

Motivation

● Extract high-level shared structure from large trajectory data sets

Motivation

● Extract high-level shared structure from large trajectory data sets

Pathlet

Representative pathlet for each cluster– Cluster “center”– Pathlet is a curve, not necessarily part of the

Application of pathlets

● Compression of large trajectory data [Chen et al. 2013]

– Hope that each trajectory can be reconstructed with small no. of pathlets

– Small pathlet dictionary - non-linear dimension reduction

● Reconstructing road network from trajectory data [Li et al. 2013; Buchin et al. 2017]

Our contribution

● Model for subtrajectory clustering– Robust to noise and missing data

– Data-driven clusters and pathlets

● NP-hardness of subtrajectory clustering problem

● Provably-efficient approximation algorithms– Faster algorithms for realistic inputs

● Experimental results

Previous work

● Graph setting – no noise or gaps [Chen et al. 2013]

● Based only on point density [Panagiotakis et al. 2012]

● Restricted to line segments [Lee et al. 2007]

● Search for pre-defined patterns [Fan et al. 2016; Tang et al. 2013; Wang et al. 2015; Zheng et al. 2013]

None of these have provable performance guarantees!!

Model and problem formulation

Model inputs :– Trajectories :

– Each trajectory is sequence of points in

● Subtrajectory is subsequence of traj.

– Let be all trajectory points

Objective function

Need small# pathlets Measure of cluster quality

Objective function

Fraction of pointsunassigned for

each trajectory : “gaps”

Objective function

A note on the distance

We use discrete Fréchet distance

Given and

● Correspondence s.t. every pt. in at least one pair

● is monotone if for all ,

Discrete Fréchet distance

: Set of all monotonone correspondencess b/w ,

Choosing pathlets

Given , goal is to choose from set of candidate pathlets to minimize objective function

● If is given as input : pathlet-cover problem

● If not given but assumed to be (uncountably) infinite set of all trajectories in plane : subtrajectory-clustering problem

Basic idea

● Reduce to set-cover

● Solve using greedy algorithm : gives approximation

● Challenge : implementing greedy step efficiently

Set-cover

Input :● Set system● Cost

Goal is to find of minimum total cost such that

From pathlet-cover to set-cover

● has two kinds of sets :– For all , with

Corresponds to treating as a gap in pathlet cover

● has two kinds of sets :– For all and for any set of subtraj. ,

Corresponds to assigningsubtraj. in to

Exponential # sets : cannot construct explicitly!!

Theorem : There exists bijection between feasible solutions of and with same cost across bijection

Greedy algorithm for set-cover

Initialize

● At each step add to the set in that maximizes the coverage-to-cost ratio

● Stop when all points are covered

Coverage-to-cost ratio

● For let denote coverage-to-cost ratio

where is set of uncovered pts. of

● For let denote coverage-cost ratio

, if is not yet covered

, otherwise

Implementing greedy step

For each need to compute that maximizes – Tricky, since we do not construct these sets at all !

Implementing greedy step

For each need to compute that maximizes – Tricky, since we do not construct these sets at all !

● Best set for can be found in poly-time without explicitly constructing all the sets !!

– Can decompose into contribution corresponding to each traj.

– Independently chose “best” subtraj. from each traj.

Our result

● Theorem : The greedy algorithm computes a -approximate solution to the pathlet-cover problem in time

Subtrajectory clustering

Set of candidate pathlets not given, assumed to be all possible trajectories

Reducing # candidate pathlets

● satisfies triangle inequality :– Let candidate pathlets be subtraj. of input traj.– # candidate pathlets is – Optimal solution cost increases by factor of 2

Reducing # candidate pathlets

● satisfies triangle inequality :– Let candidate pathlets be subtraj. of input traj.– # candidate pathlets is – Optimal solution cost increases by factor of 2

● :– Can reduce # candidate pathlets to – Cost increases by factor of

Improved running time

● For realistic inputs can achieve more speed-up– For each pathlet only subtraj. assigned from

each traj.

● Theorem : For realistic curves using Fréchet distance, can compute -approximate solution to the subtrajectory clustering problem in time

Experiments : data sets

Real data sets :● Beijing taxi data [Tsinghua University]

– 28,000 cabs over 4 days

– 9 mil. points

– Incomplete and sparse

Real data sets :● GeoLife [Microsoft Research Asia]

– Pedestrian data of 182 users over 4 years

– ~2,600 trajs.

– ~1.5 mill. pts.

● Cycling– 37 traj.

– 106,000 pts.

– Has self-intersections and loops

Synthetic data sets :● RTP

– Traffic data generated by web-based tool [http://mntg.cs.umn.edu/tg/index.php]

– Research Triangle in NC

– ~20,000 traj.

– ~1 mill. pts.

Dense & popular regions

Common trajectory portions

Handling noise

Data-driven pathlets

Summary

● Indexing big data

● Massive terrain analysis

● Comparing merge trees - briefly

● Extracting common movement patterns from trajectories

Future directions

● MPC model– Point location queries, multiway separators for

planar graphs ...

– Big open problem – general graph connectivity in rounds

– Other open problems in parallel query processing in databases [Koutris et al. 2018]

● Gromov-Hausdorff distance– Big gap b/w upper and lower bound :(

– More research into additive distortion of metric embeddings

Future directions

● Trajectory clustering– Efficient -approx. to k-center, k-median, k-

means for say Frechet distance

– Stumbling block – infinite doubling dimension

– Work by [Driemel et al. 2016] on clustering time-series data● Running time is exponential in complexity of cluster

centers – assumed to be constant● Is it a good assumption??

– What are good assumptions? Perturbation resilience? Stability?

● Can anything interesting be proved ?

Acknowledgements

Committee

Pankaj

Kamesh Rong Yusu

Collaborators

Pankaj Kamesh YusuKyle Tasos

Jiangwei Erin

Theory group

CS@Duke

● Ergys and Cassie; other students ...

● Marilyn, Pam, Celeste, Alison, Kathleen …

● CS Lab staff

Outside Duke

algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfalgorithms for analyzing...

Documents

spatio-temporal databases

spatio-temporal resolution of primary processes of ... ·...

actionvlad: learning spatio-temporal aggregation for...

spatio-temporal databases spatio-temporal databases

spatio-temporal design · 1 collecting spatio-temporal data...

analyzing local spatio-temporal patterns of police calls...

logical process modeling of spatio-temporal application ·...

spatio and temporal dietary patterns

spatio-temporal wifi localization

spatio-temporal declines in philippine fisheries and its...

spatio-temporal veracity assessment

a ghost story: spatio-temporal response characteristics of...

an invitation to spatio-temporal data...

spatio-temporal variations of soil nutrients influenced by...

spatio temporal 2

spatio-temporal pattern queries

spatio-temporal compression of trajectories in road … -...

global analyzing community data with joint species...

network complexity and spatio-temporal data …data is the...

spatio-temporal modelling spatio-temporal modelling —...