presented by: zohreh raghebi fall 2015cse.ucdenver.edu/~bdlab/seminar/2015/11.pdf · 2016-10-30 ·...

Presented by: Zohreh Raghebi

Fall 2015

Foteini Katsarou

Nikos Ntarmos

Peter Triantafillou

University of Glasgow, UK

Graph data management systems have become very popular

One of the main problems for these systems is subgraph query processing

Given a query graph, return all graphs that contain the query.

To perform a subgraph isomorphism, test against each graph in the dataset

Not scale, as subgraph isomorphism is NP-Complete

Many indexing methods : to reduce the number of candidate graphs for subgraph isomorphism test

A set of key factors-parameters, that influence the performance of related methods:

the number of nodes per graph

the graph density

the number of distinct labels

the number of graphs in the dataset

and the query graph size

First to derive conclusions about the algorithms’ relative performance

Second highlight how both performance and scalability depend on the above factors

six well established indexing methods, namely:

Grapes, CT-Index, GraphGrepSX, gIndex, Tree+∆, and gCode

Most related works are tested against the AIDS antiviral dataset and synthetic datasets,

Formed of many small graphs

Grapes alone used several real datasets

The authors did not evaluate scalability

The iGraph comparison framework compared the performance of older algorithms (up to 2010).

Since then, several, more efficient algorithms have been proposed

A linear increase in the number of nodes results in a quadratic increase in the number of edges;

The number of edges increases linearly to the graph density

The increase of the above two factors leads to a detrimental increase in the indexing time

The number of graphs increases the overall complexity only linearly

The frequent mining techniques are more sensitive because more features have to be located across more graphs

The increase in the number of distinct labels leads to:

An easier dataset to index

1. It results in fewer occurrences of any given feature

2. A decrease in the false positive ratio of the various algorithms

Our findings give rise to the following adage: “Keep It Simple and Smart”.

The simpler the feature structure and extraction process, The faster the indexing and query processing algorithm

Frequent mining algorithms (gIndex, Tree+∆) may be competitive for small/sparse datasets

Techniques using exhaustive enumeration (Grapes, GGSX, CT-Index) are the clear winners

Especially with those indexing simple features (paths; i.e., Grapes, GGSX)

Instead of having more complex features (trees, cycles; i.e., CT-Index)

Industrial paper

Avery Ching

Sergey Edunov

Maja Kabiljo


Fall 2015

Analyzing the real world graphs

The scale of hundreds of billions edges with available software is very difficult

Graph processing engines tend to have additional challenges in scaling to larger graphs

Apache Giraph is an iterative graph processing system designed to scale to process trillions of edges

Used at Facebook

Giraph was inspired by Pregel, the graph processing developed at Google

While Giraph did not scale to our needs at Facebook

with over 1.39B users and hundreds of billions of social connections

Improve the platform to support the workloads

Giraph’s graph input model was only vertex centric

Parallelizing Giraph infrastructure relied on Map Reduce’s task level parallelism

Not have multithreading support

Giraph’s flexible types were initially implemented using native Java objects

Consumed excessive memory and garbage collection time.

The aggregator framework was inefficiently implemented in ZooKeeper

Need to support very large aggregators

we modified Giraph to allow loading vertex data and edges from separate sources

Parallelization support:

Use worker local multithreading to take advantage of additional CPU cores

Memory optimization :

By default serialize the edges of every vertex into a byte array

Rather than instantiating them as native Java objects

Aggregator architecture:

Each aggregator is now randomly assigned to one of the workers

Aggregation responsibilities are balanced across all workers

Not bottlenecked by the Master

Xiaofei Zhang†, Hong Cheng‡, Lei Chen††Department of Computer Science & Engineering, HKUST


Fall 2015

Extracts the most prominent vertices

Returns a minimum set of vertices with:

The maximum importance total betweenness and shortest pathreachability in connecting two sets of input vertices

logistic planning, social community bonding

In social network study: To understand information propagation and hiddencorrelations

To find the “bonding” of communities , an ideal bonding agent would:

1. Reside on as many cross group pairwise shortest paths as possible

2. Connect as large portion of two groups as possible

Such agents could best serve the message passing between two groups

The VSB query ranks a vertex’s prominence with two factors:

betweenness and shortest path connectivity

Minimum cut finds the minimum set of edges to remove to turn a graph into two disjoint subgraph

Not offer how other vertices contributes to the connection between sets

Top-k betweenness computing are employed to find important vertices in a network

However, due to the local dominance property of the betweenness metric

such queries cannot serve the vertex sets bonding properly

Two novel building blocks for the efficient VSB query evaluation:

Guided graph exploration: by vertex filtering scheme

to reduce redundant vertex access

The minimum set of vertices of the highest accumulative betweenness as the bonding vertices

Betweenness ranking on-exploration

Instead of computing the exact betweenness value

Rank the betweenness of vertices during graph exploration

to save the computation cost.

Daniel Margo Harvard UniversityCambridge, [email protected]

Margo Seltzer Harvard UniversityCambridge, [email protected]


Fall 2015

Capable of handling graphs that far exceed main memory

High quality edge partitions

Graph partitioning is an important problem that affects many graph-structured system

Partitioning quality greatly impacts the performance of distributed graph analysis frameworks

METIS: the gold standard for graph partitioning

A multi-level graph partitioning algorithm

These approaches do not scale to today’s large graphs

Streaming partitioners

A graph loader : reads serial graph data from a disk onto a cluster

It must make a decision about the location of each node as it is loaded

The goal is to find an optimal balanced partitioning with as little computation as possible

They are sensitive to the stream order, which can affect performance

Streaming algorithms are difficult to parallelize

Partitions by a method that does not vary with how the input graph is distributed

Sheep can arbitrarily divide the input graph for parallelism and fit tasks in memory

Sheep reduces the input graph to a small elimination tree

Sheep’s tree transformation is a distributed map-reduce operation

Using simple degree ranking, Sheep creates competitive edge partitions

faster than other partitioners

Type: Research Paper

Authors: Tomohiro Manabe, Keishi Tajima

Presented by: Siddhant Kulkarni

Term: Fall 2015

How is logical structure different from the mark-up structure?

Difference between Human understanding and Browser interpretation

“Mark-up structure does not necessarily always correspond to the logical hierarchy”

Basic idea for webpages with improper tag usage

HTML5 solves this problem, but we cannot port all web pages to HTML 5

Other techniques for document segmentation

Based on margins between blocks

Based on text density

Based on identification of important blocks

Etc.

Most rely on tags (so does this paper, but not entirely)

Define Blocks and Headings for their own structure extraction

Logical Hierarchy extraction using

Preprocessing

Heading based Page Segmentation

Dataset: Web snapshot ClueWeb09 Category B document collection

Calculate accuracy based on Precision and Recall of extracted relationship

Types of Relationships – parent, ancestor, siblings, child, descendants

Type: Industry Paper

Authors: Daniel Haas, Jason Ansel, Lydia Gu, Adam Marcus

Presented by: Siddhant Kulkarni

Term: Fall 2015

What is Macrotask Crowdsourcing?

What is the problem with it?

The related work focuses on Macrotasking and Crowdsourcing frameworks

Argonaut

Predictive models to identify trustable workers who can perform reviews

And a model to identify which tasks need review

Evaluate trade off between single and multiple phases of reviews based on budget

Presented by: Shahab Helmi

VLDB 2015 Paper Review Series

Fall 2015

Authors:

Moria Bergman, Tel-Aviv University

Tova Milo,Tel-Aviv University

Slava Novgorodov, Tel-Aviv University

WangChiew Tan, University of California, Santa Cruz

Publication:

VLDB 2015

Type:

Demonstration Paper

It is important for the database to be as complete (no missing values) and correct (nowrong values) as possible. For this reason, many data cleaning tools have been developedto automatically resolve inconsistencies in databases. However these tools:

are not able to remove all erroneous data:

95% accuracy in YOGO database.

It is impossible to correct all errors manually in big datasets.

are not usually able to determine what information is missing from a database.

A novel query oriented data cleaning system with oracle crowds.

It uses materialized views (i.e., query oriented views which are defined through userqueries) are used as a trigger for identifying incorrect or missing information.

If an error (i.e., a wrong tuple or missing tuple) in the materialized view is detected,the system will interact minimally with a crowd of oracles by asking only pertinentquestions.

Answers to a certain question will help to identify the next pertinent questions to askand ultimately, a sequence of edits is derived and applied to the underlying database.

Cleaning the entire database is not the goal of QOCO. It cleans parts of the database asneeded. Hence, it could be used as a complementary tool alongside with othercleaning tools.

Data cleaning techniques:

QOCO uses crowd to correct query results.

QOCO propagates updates back to the underplaying database.

QOCO discovers and inserts true tuples that are missing from the input database.

Crowdsourcing is a model where humans perform small tasks to help solve challengingproblems such as

Entity/conflict resolution .

Duplicate detection.

Schema matching.

Consider a user query which searches for European teams that won the World Cupat least twice.

Result will contain ESP (which is wrong) it ITA will be missing.

Correct tuples

Wrong tuples

Missing tuples

Tuples which contains the wrong answer

t1 = Game(11:07:10;ESP;NED; final; 1:0)t2 = Game(17:07:94;ESP;NED; final; 3:1)t3 = Team(ESP;EU)



1. Finding the most frequent tuple (t3) and asking the oracle if it istrue or not?

t3: is correct -> the remaining candidates will be{t1, t2}, {t2, t4}, {t4, t1}

2. The rest of tuples have the same frequency so QOCO will chooseone of them randomly, let say t1.

1. t1: is correct -> {t2}, {t2, t4}, {t4} -> ESP won the world cup only once;hence both t2 and t4 are wrong and should be deleted!

Could be find in the original research paper:

M. Bergman, T. Milo, S. Novgorodov, and W. Tan. Query-oriented data cleaning with oracles. In ACM SIGMOD, 2015

Presented by: Shahab Helmi

VLDB 2015 Paper Review Series

Fall 2015

Authors:

Weimo Liuy, The George Washington University

Md Farhadur Rahmanz, University of Texas at Arlington

Saravanan Thirumuruganathanz, University of Texas at Arlington

Nan Zhangy, The George Washington University

Gautam Das, University of Texas at Arlington

Publication:

VLDB 2015

Type:

Research Paper

Location-based services (LBS)

Location-returned services (LR-LBS): this services return the location of the k returnedtuples.

Google Maps.

Location-not-returned services (LNR-LBS): this services does not return the location ofthe k tuples and returns some other attributes such as ID, ranking and etc.

WeChat

Sina Weibo

K-nearest-neighbors (kNN) queries: returns the k nearest tuples to the query pointaccording to a ranking function (Euclidian distance in this paper).

LBS with a kNN interface: third-party applications and/or end users do not have complete anddirect access to this entire database. The database is essentially “hidden”, and access is typicallylimited to a restricted public web query interface or API.

These interfaces impose some constraints:

Query limitation: 10,000 per user per day in Google Maps

Maximum coverage limit, for example 5 miles away from the query point

Aggregate Estimations: For many interesting third-party applications, it is important to collectaggregate statistics over the tuples contained in such hidden databases such as sum, count, ordistributions of the tuples satisfying certain selection conditions.

A hotel recommendation application would like to know the average review scores forMarriott vs Hilton hotels in Google Maps;

A cafe chain startup would like to know the number of Starbucks restaurants in a certaingeographical region;

A demographics researcher may wish know the gender ratio of users of social networks inChina etc.

Aggregate information can be obtained by:

Entering into data sharing agreements with the location-based service providers, butthis approach can often be extremely expensive, and sometimes impossible if the dataowners are unwilling to share their data.

Getting the whole data using limited interfaces would take so long.

Goals:

Approximate estimates of such aggregates by only querying the database via its restrictivepublic interface.

Minimizing the query cost (i.e., ask as few queries as possible) in an effort to adhere tothe rate limits or budgetary constraints imposed by the interface.

Making the aggregate estimations as accurate as possible.

Analytics and Inference over LBS:

Estimating COUNT and SUM aggregates.

Error reduction, such as bias correction

Aggregate Estimations over Hidden Web Repositories:

Unbiased estimators for COUNT and SUM aggregates for static databases.

efficient techniques to obtain random samples from hidden web databases that can then beutilized to perform aggregate estimation.

Estimating the size of search engines.

For LR-LBS interfaces: the developed algorithm (LR-LBS-AGG), for estimating COUNTand SUM aggregates, represents a significant improvement over prior work alongmultiple dimensions:

a novel way of precisely calculating Voronoi cells leads to completely unbiased estimations;

top-k returned tuples are leveraged rather than only top-1; several innovative techniquesdeveloped for reducing error and increasing efficiency.

For LNR-LBS interfaces: the developed algorithm (LNR-LBS-AGG) which was a novelproblem with no prior work.

The algorithm is not bias-free bias-free, but the bias can be controlled to any desiredprecision.

Top-1 Voronoi Top-2 Voronoi

In a Voronoi diagram, for each point, there is a corresponding region consisting of allpoints closer to that point than to any other.

1. Precisely Compute Voronoi Cells

Faster Initialization

Leverage history on Voronoi cell computation

2. Error Reduction

Bias error removal/reduction

Variance reduction

...

Datasets:

Offline Real-World Dataset (OpenStreetMap, USA Portion): to verify the correctness ofthe algorithm.

Online LBS Demonstrations: to evaluate efficiency of the algorithm.

Google Maps

WeChat

Sina Weibo

presented by: zohreh raghebi fall 2015cse.ucdenver.edu/~bdlab/seminar/2015/11.pdf · 2016-10-30 ·...

Documents