presented by: zohreh raghebi fall 2015cse.ucdenver.edu/~bdlab/seminar/2015/11.pdf · 2016-10-30 ·...
TRANSCRIPT
Presented by: Zohreh Raghebi
Fall 2015
Foteini Katsarou
Nikos Ntarmos
Peter Triantafillou
University of Glasgow, UK
Graph data management systems have become very popular
One of the main problems for these systems is subgraph query processing
Given a query graph, return all graphs that contain the query.
To perform a subgraph isomorphism, test against each graph in the dataset
Not scale, as subgraph isomorphism is NP-Complete
Many indexing methods : to reduce the number of candidate graphs for subgraph isomorphism test
A set of key factors-parameters, that influence the performance of related methods:
the number of nodes per graph
the graph density
the number of distinct labels
the number of graphs in the dataset
and the query graph size
First to derive conclusions about the algorithms’ relative performance
Second highlight how both performance and scalability depend on the above factors
six well established indexing methods, namely:
Grapes, CT-Index, GraphGrepSX, gIndex, Tree+∆, and gCode
Most related works are tested against the AIDS antiviral dataset and synthetic datasets,
Formed of many small graphs
Grapes alone used several real datasets
The authors did not evaluate scalability
The iGraph comparison framework compared the performance of older algorithms (up to 2010).
Since then, several, more efficient algorithms have been proposed
A linear increase in the number of nodes results in a quadratic increase in the number of edges;
The number of edges increases linearly to the graph density
The increase of the above two factors leads to a detrimental increase in the indexing time
The number of graphs increases the overall complexity only linearly
The frequent mining techniques are more sensitive because more features have to be located across more graphs
The increase in the number of distinct labels leads to:
An easier dataset to index
1. It results in fewer occurrences of any given feature
2. A decrease in the false positive ratio of the various algorithms
Our findings give rise to the following adage: “Keep It Simple and Smart”.
The simpler the feature structure and extraction process, The faster the indexing and query processing algorithm
Frequent mining algorithms (gIndex, Tree+∆) may be competitive for small/sparse datasets
Techniques using exhaustive enumeration (Grapes, GGSX, CT-Index) are the clear winners
Especially with those indexing simple features (paths; i.e., Grapes, GGSX)
Instead of having more complex features (trees, cycles; i.e., CT-Index)
Industrial paper
Avery Ching
Sergey Edunov
Maja Kabiljo
Presented by: Zohreh Raghebi
Fall 2015
Analyzing the real world graphs
The scale of hundreds of billions edges with available software is very difficult
Graph processing engines tend to have additional challenges in scaling to larger graphs
Apache Giraph is an iterative graph processing system designed to scale to process trillions of edges
Used at Facebook
Giraph was inspired by Pregel, the graph processing developed at Google
While Giraph did not scale to our needs at Facebook
with over 1.39B users and hundreds of billions of social connections
Improve the platform to support the workloads
Giraph’s graph input model was only vertex centric
Parallelizing Giraph infrastructure relied on Map Reduce’s task level parallelism
Not have multithreading support
Giraph’s flexible types were initially implemented using native Java objects
Consumed excessive memory and garbage collection time.
The aggregator framework was inefficiently implemented in ZooKeeper
Need to support very large aggregators
we modified Giraph to allow loading vertex data and edges from separate sources
Parallelization support:
Use worker local multithreading to take advantage of additional CPU cores
Memory optimization :
By default serialize the edges of every vertex into a byte array
Rather than instantiating them as native Java objects
Aggregator architecture:
Each aggregator is now randomly assigned to one of the workers
Aggregation responsibilities are balanced across all workers
Not bottlenecked by the Master
Xiaofei Zhang†, Hong Cheng‡, Lei Chen††Department of Computer Science & Engineering, HKUST
Presented by: Zohreh Raghebi
Fall 2015
Extracts the most prominent vertices
Returns a minimum set of vertices with:
The maximum importance total betweenness and shortest pathreachability in connecting two sets of input vertices
logistic planning, social community bonding
In social network study: To understand information propagation and hiddencorrelations
To find the “bonding” of communities , an ideal bonding agent would:
1. Reside on as many cross group pairwise shortest paths as possible
2. Connect as large portion of two groups as possible
Such agents could best serve the message passing between two groups
The VSB query ranks a vertex’s prominence with two factors:
betweenness and shortest path connectivity
Minimum cut finds the minimum set of edges to remove to turn a graph into two disjoint subgraph
Not offer how other vertices contributes to the connection between sets
Top-k betweenness computing are employed to find important vertices in a network
However, due to the local dominance property of the betweenness metric
such queries cannot serve the vertex sets bonding properly
Two novel building blocks for the efficient VSB query evaluation:
Guided graph exploration: by vertex filtering scheme
to reduce redundant vertex access
The minimum set of vertices of the highest accumulative betweenness as the bonding vertices
Betweenness ranking on-exploration
Instead of computing the exact betweenness value
Rank the betweenness of vertices during graph exploration
to save the computation cost.
Daniel Margo Harvard UniversityCambridge, [email protected]
Margo Seltzer Harvard UniversityCambridge, [email protected]
Presented by: Zohreh Raghebi
Fall 2015
Capable of handling graphs that far exceed main memory
High quality edge partitions
Graph partitioning is an important problem that affects many graph-structured system
Partitioning quality greatly impacts the performance of distributed graph analysis frameworks
METIS: the gold standard for graph partitioning
A multi-level graph partitioning algorithm
These approaches do not scale to today’s large graphs
Streaming partitioners
A graph loader : reads serial graph data from a disk onto a cluster
It must make a decision about the location of each node as it is loaded
The goal is to find an optimal balanced partitioning with as little computation as possible
They are sensitive to the stream order, which can affect performance
Streaming algorithms are difficult to parallelize
Partitions by a method that does not vary with how the input graph is distributed
Sheep can arbitrarily divide the input graph for parallelism and fit tasks in memory
Sheep reduces the input graph to a small elimination tree
Sheep’s tree transformation is a distributed map-reduce operation
Using simple degree ranking, Sheep creates competitive edge partitions
faster than other partitioners
Type: Research Paper
Authors: Tomohiro Manabe, Keishi Tajima
Presented by: Siddhant Kulkarni
Term: Fall 2015
How is logical structure different from the mark-up structure?
Difference between Human understanding and Browser interpretation
“Mark-up structure does not necessarily always correspond to the logical hierarchy”
Basic idea for webpages with improper tag usage
HTML5 solves this problem, but we cannot port all web pages to HTML 5
Other techniques for document segmentation
Based on margins between blocks
Based on text density
Based on identification of important blocks
Etc.
Most rely on tags (so does this paper, but not entirely)
Define Blocks and Headings for their own structure extraction
Logical Hierarchy extraction using
Preprocessing
Heading based Page Segmentation
Dataset: Web snapshot ClueWeb09 Category B document collection
Calculate accuracy based on Precision and Recall of extracted relationship
Types of Relationships – parent, ancestor, siblings, child, descendants
Type: Industry Paper
Authors: Daniel Haas, Jason Ansel, Lydia Gu, Adam Marcus
Presented by: Siddhant Kulkarni
Term: Fall 2015
What is Macrotask Crowdsourcing?
What is the problem with it?
The related work focuses on Macrotasking and Crowdsourcing frameworks
Argonaut
Predictive models to identify trustable workers who can perform reviews
And a model to identify which tasks need review
Evaluate trade off between single and multiple phases of reviews based on budget
Presented by: Shahab Helmi
VLDB 2015 Paper Review Series
Fall 2015
Authors:
Moria Bergman, Tel-Aviv University
Tova Milo,Tel-Aviv University
Slava Novgorodov, Tel-Aviv University
WangChiew Tan, University of California, Santa Cruz
Publication:
VLDB 2015
Type:
Demonstration Paper
It is important for the database to be as complete (no missing values) and correct (nowrong values) as possible. For this reason, many data cleaning tools have been developedto automatically resolve inconsistencies in databases. However these tools:
are not able to remove all erroneous data:
95% accuracy in YOGO database.
It is impossible to correct all errors manually in big datasets.
are not usually able to determine what information is missing from a database.
A novel query oriented data cleaning system with oracle crowds.
It uses materialized views (i.e., query oriented views which are defined through userqueries) are used as a trigger for identifying incorrect or missing information.
If an error (i.e., a wrong tuple or missing tuple) in the materialized view is detected,the system will interact minimally with a crowd of oracles by asking only pertinentquestions.
Answers to a certain question will help to identify the next pertinent questions to askand ultimately, a sequence of edits is derived and applied to the underlying database.
Cleaning the entire database is not the goal of QOCO. It cleans parts of the database asneeded. Hence, it could be used as a complementary tool alongside with othercleaning tools.
Data cleaning techniques:
QOCO uses crowd to correct query results.
QOCO propagates updates back to the underplaying database.
QOCO discovers and inserts true tuples that are missing from the input database.
Crowdsourcing is a model where humans perform small tasks to help solve challengingproblems such as
Entity/conflict resolution .
Duplicate detection.
Schema matching.
Consider a user query which searches for European teams that won the World Cupat least twice.
Result will contain ESP (which is wrong) it ITA will be missing.
Correct tuples
Wrong tuples
Missing tuples
Tuples which contains the wrong answer
t1 = Game(11:07:10;ESP;NED; final; 1:0)t2 = Game(17:07:94;ESP;NED; final; 3:1)t3 = Team(ESP;EU)
t2 = Game(17:07:94;ESP;NED; final; 3:1)t4 = Game(12:07:98;ESP;NED; final; 4:2)t3 = Team(ESP;EU)
t4 = Game(12:07:98;ESP;NED; final; 4:2)t1 = Game(11:07:10;ESP;NED; final; 1:0)t3 = Team(ESP;EU)
1. Finding the most frequent tuple (t3) and asking the oracle if it istrue or not?
t3: is correct -> the remaining candidates will be{t1, t2}, {t2, t4}, {t4, t1}
2. The rest of tuples have the same frequency so QOCO will chooseone of them randomly, let say t1.
1. t1: is correct -> {t2}, {t2, t4}, {t4} -> ESP won the world cup only once;hence both t2 and t4 are wrong and should be deleted!
Could be find in the original research paper:
M. Bergman, T. Milo, S. Novgorodov, and W. Tan. Query-oriented data cleaning with oracles. In ACM SIGMOD, 2015
Presented by: Shahab Helmi
VLDB 2015 Paper Review Series
Fall 2015
Authors:
Weimo Liuy, The George Washington University
Md Farhadur Rahmanz, University of Texas at Arlington
Saravanan Thirumuruganathanz, University of Texas at Arlington
Nan Zhangy, The George Washington University
Gautam Das, University of Texas at Arlington
Publication:
VLDB 2015
Type:
Research Paper
Location-based services (LBS)
Location-returned services (LR-LBS): this services return the location of the k returnedtuples.
Google Maps.
Location-not-returned services (LNR-LBS): this services does not return the location ofthe k tuples and returns some other attributes such as ID, ranking and etc.
Sina Weibo
K-nearest-neighbors (kNN) queries: returns the k nearest tuples to the query pointaccording to a ranking function (Euclidian distance in this paper).
LBS with a kNN interface: third-party applications and/or end users do not have complete anddirect access to this entire database. The database is essentially “hidden”, and access is typicallylimited to a restricted public web query interface or API.
These interfaces impose some constraints:
Query limitation: 10,000 per user per day in Google Maps
Maximum coverage limit, for example 5 miles away from the query point
Aggregate Estimations: For many interesting third-party applications, it is important to collectaggregate statistics over the tuples contained in such hidden databases such as sum, count, ordistributions of the tuples satisfying certain selection conditions.
A hotel recommendation application would like to know the average review scores forMarriott vs Hilton hotels in Google Maps;
A cafe chain startup would like to know the number of Starbucks restaurants in a certaingeographical region;
A demographics researcher may wish know the gender ratio of users of social networks inChina etc.
Aggregate information can be obtained by:
Entering into data sharing agreements with the location-based service providers, butthis approach can often be extremely expensive, and sometimes impossible if the dataowners are unwilling to share their data.
Getting the whole data using limited interfaces would take so long.
Goals:
Approximate estimates of such aggregates by only querying the database via its restrictivepublic interface.
Minimizing the query cost (i.e., ask as few queries as possible) in an effort to adhere tothe rate limits or budgetary constraints imposed by the interface.
Making the aggregate estimations as accurate as possible.
Analytics and Inference over LBS:
Estimating COUNT and SUM aggregates.
Error reduction, such as bias correction
Aggregate Estimations over Hidden Web Repositories:
Unbiased estimators for COUNT and SUM aggregates for static databases.
efficient techniques to obtain random samples from hidden web databases that can then beutilized to perform aggregate estimation.
Estimating the size of search engines.
For LR-LBS interfaces: the developed algorithm (LR-LBS-AGG), for estimating COUNTand SUM aggregates, represents a significant improvement over prior work alongmultiple dimensions:
a novel way of precisely calculating Voronoi cells leads to completely unbiased estimations;
top-k returned tuples are leveraged rather than only top-1; several innovative techniquesdeveloped for reducing error and increasing efficiency.
For LNR-LBS interfaces: the developed algorithm (LNR-LBS-AGG) which was a novelproblem with no prior work.
The algorithm is not bias-free bias-free, but the bias can be controlled to any desiredprecision.
Top-1 Voronoi Top-2 Voronoi
In a Voronoi diagram, for each point, there is a corresponding region consisting of allpoints closer to that point than to any other.
1. Precisely Compute Voronoi Cells
Faster Initialization
Leverage history on Voronoi cell computation
2. Error Reduction
Bias error removal/reduction
Variance reduction
...
Datasets:
Offline Real-World Dataset (OpenStreetMap, USA Portion): to verify the correctness ofthe algorithm.
Online LBS Demonstrations: to evaluate efficiency of the algorithm.
Google Maps
Sina Weibo