Download - Authors: Kenneth S.Bogh, Sean Chester, Ira Assent (Data-Intensive Systems Group, Aarhus University). Type: Research Paper Presented by: Dardan Xhymshiti

Work-Efficient Parallel Skyline Computation for

the GPUAuthors: Kenneth S.Bogh, Sean Chester, Ira Assent (Data-Intensive Systems Group, Aarhus University).

Type: Research Paper

Presented by: Dardan Xhymshiti

Fall 2015

Outline Introduction Skyline computation Related-Work GPU-Friendly partitioning The SKYALIGN algorithm Experimental evaluation

Introduction Skyline operator:

First introduced:

Stephan Borzsonyi, Donald Kossman, Konrad Stocker 2001

(Universitat Passau & Technische Universitat Muncen Germany)

Introduction Skyline operator: Example:

1. Go for a one day skiing in one of the Colorado’s ski center. 2. You have spent a lot of money. 3. It happens a car defect. 4. Try to find the nearest and cheapest hotel. 5. Take your phone and lunch the unknown touristic application. 6. A lot of hotels in different locations with variety of prices. 7. You want to find the CHEAPEST and the NEAREST one!?


Query results:

Result query Price Distance (Miles)

Hotel A $120 1.5Hotel B $140 1Hotel C $200 2Hotel D $150 0.7…. … …

120 140 150 200

0.71

1.5

2


Query results:

Result query Price Distance (Miles)

Hotel A $120 1.5Hotel B $140 1Hotel C $200 2Hotel D $150 0.7…. … …

120 140 150 200

0.71

1.5

2Skyline set = {Hotel A, Hotel B, Hotel D}

Term: Dominance

Introduction Major problems:

Multidimensional data.

Computation intensive.

Comparison tuple-to-tuple (point-to-point).

What is done till now: State-of-the art sequential algorithms. Parallel skyline query processing algorithms.

Often try to achieve device’s maximum theoretical compute throughput.

Throughput is costly. The most efficient GPU algorithm GSS, does up to 650 times

more work comparing to the best sequential algorithm, even if executing in 2688 cores.

For benchmark datasets, sequential algorithms perform 3x faster than GPU ones.

Should we use GPU or NOT?

Introduction Sequential algorithms high performance is achieved by using:

Trees

Recursion

Strict ordering of computation.

Unpredictable branching.

Motivation:

Come up with a new algorithm called SkyAlign which: MAIN POINT: Avoid as much as it can point-to-point comparisons.

Employ a globally static grid schema to make the dataset compatible for GPU.

This algorithm do not maximizes THROUGHPUT but is WORK-EFFICIENT.

Many of these techniques are not compatible with

GPU.

Introduction

Dataset

Skyline set

Parallel

Dataset Skyline set

SequentialVS

Skyline computation Notations:

P : dataset

: number of tuples (points) in the dataset

dimensions (number of attributes)

arbitrary points

: the value of the attribute in the tuple (point)

Id1 2 32 2 12 4 13 3 3

Skyline computation Skyline definitions:

Skyline is defined through the concept of dominance. Definition 1:

Data point (tuple) A dominates the data point iff:

1. for all the attribute values

2. for at least one attribute value Definition 2 (on this paper):

Point dominates point , denoted by iff:

If neither we say that and are incomparable.

Transitivity:

Skyline computation Measuring skyline work:

Dominance Test (DT)

Determining if a data point dominates the data point by comparing point-to-point.

Defining the number of DTs done, actually tells the skyline work performed.

Mask Test(MT)

Define bitmask for each point by comparing it with a skyline (pivot) point.

Use transitivity for pruning the number of tests.

Mask Tests are much cheaper than Dominance Tests.

Skyline computation GPU Computation

Tesla K80: 4992 number of Cuda Cores. Threads are grouped into warps usually of sizes 32. Warps are grouped into thread blocks. All threads within a warp execute the same instruction at the same time. Problem: branch divergence.

Related work Partition-based skyline algorithms

Divide-and-Conquer:

Halved the dataspace recursively by the median of an arbitrarily chosen dimension and solved each half. After that the results are merged.

Sequential partition-based algorithms:

These algorithms employ recursive, point-based partitioning.

For each partition defined, a skyline point (pivot), is found, and the other points are partitioned based on their relationship to the pivot.

The work performed varies from the pivot selected.

SkyAlign: is a partition-based method, but it is not recursive and has no merge.

Related work Sort-based (and GPU) skyline algorithms

Obtain efficiency from monotonicity and transitivity. Block-nested-loops algorithm(BNL)

Each unprocessed point is compared with DT against each point which actually is a skyline point. If the is removed and control passes to the next point.

Sort-first skyline (SFS)

Sorts the data points prior to executing BNL. Once a point is added to the solution, it will never be removed.

GNL

Assigns a thread for every point . ’ thread compares it with another data points to check the dominance criterion.

GPU-Friendly partitioning Work efficiency of skyline algorithms comes from skipping DTs. To know which DTs to skip among two data points and we need to know if they are

incomparable. Transitivity helps on this. Example:

1. Say that we have three data tuples: and

2. The relationship of with is represented with one bit for each ,d). (Mask Test)

3. The relationship of with also is represented with one bit for each ,d).

4. The incomparability between with can be detected by comparing these mask tests.

Mask Test (MT) is cheaper than (DT).

GPU-Friendly partitioningGet to know with point-based methods Point-based recursive partitioning methods use a quad-tree partitioning of the data

set and record skyline points as they are found in a tree.

CB

E

A

D

F

Skyline points (pivots):

GPU-Friendly partitioning

Each tree node contains a bitmask that records on which dimensions is worse than its parent.

When processing a point the quad tree can be used to eliminate DTs for .

First builds a bitmask recording its dimension-wise relationship to the root of the tree (in this case C).

If all bits are set (all bits in bitmask are 1) is dominated, otherwise only children of the root (B, E) for which comparing the bitmasks between and them do not infer incomparability need to be visited.

Deeper tree, permits skipping more DTs.

GPU-Friendly partitioningWhy recursive partitioning is not preferred?

High divergence

Traversal

Consider when points in F are to compare with points in D. First a DT with the root E is performed for each point, so generating bitmasks. These

bitmasks are then used to determine which branches of D each point of F should traverse. Results often diverge.

Partitioning

Each partition has to be sub-partitioned relative to its own pivot.

The pivot needs to be skyline.

High dimensions

Quad-tree partitioning do not scale well with dimensionality.

GPU-Friendly partitioningA static grid alternative Each dimension is split based on the quartiles computed from the dimension

values. There is defined three global pivots one corresponding to each quartile boundary.

For each point there is defined:

1. One bitmask relative to the median

2. One bitmask relative to either first or third quartile. First level: all the points are partition by their relationship to the median of the

dataset. Second level: All the points are partitioned by their relationship to either the first or

third quartiles. Do we need a third level?

GPU-Friendly partitioningDefinition of masks Let:

be the quartile for the attribute

be the median

be the quartile for the attribute

We denote by:

the median-level-resolution bitmask for point

: the quartile-level-resolution bitmask for point

GPU-Friendly partitioningDefinition of masks For dimension , (or is set) if larger or equal to the median on dimension For dimension , (or is set) if larger or equal to the on dimension We have:

GPU-Friendly partitioningDefinition of masks

because is less then x-median and greater than y-median.

Same for the others.

GPU-Friendly partitioningHow to define incomparability using statically-based MT We can define incomparability between two bitmasks by considering:

1. Ordering (Number of bits being 1 in bitmasks)

2. Bitwise relationships. The authors have defined these equations for both resolutions, which rely on the

transitivity property with respect to the median: Median-level resolution

GPU-Friendly partitioning Median-level resolution

This equation checks whether has any bits set (equal to 1) that are not also set in (are 0). If so, then such that . Consequently .

Example:


If has more 1’s than does then it necessarily contains one that is not set in .

Example:


If and have the same order, then the only condition under which all bits set in are also set in , is if the masks are identical. If the bit masks are not identical then either and , because both of them have the same order but different arrangements of 1s.

Example: , but and

GPU-Friendly partitioning Quartile-level resolution

The Skyline Algorithm A global static partitioning is done in the data set. Each thread is assigned to each data point. At a high level, SkyAlign consists of d iterations. In the iteration, remaining points

are compared, each by its thread, to all points with order using MTs and DTs as necessary.

After each phase we remove dominated points and move all surviving points into the solution.

The Skyline Algorithm

Pre-filter: Eliminates points that are easy to identify as not in the skyline by defining a threshold as the min of max values.

which is the ’s max value and the smallest largest value in the data.

Each thread is responsible for a point, and the comparing starts whether the data point has any values larger than threshold.

Id

1 2 3

2 2 1

2 4 1

3 3 3


Mask assignment: Masks are assigned for each point, given the quartiles of the dataset for each dimension.


Data sorting:Sort the data points based on their masks order.

Experimental evaluation Evaluation is done by comparing SkyAlign against state-of-the-art

sequential, multi-core, and GPU skyline algorithms. Algorithms used for comparing:

BSkyTree, Hybrid, GSS, SkyAlign Testing is done using synthetic data generated by skyline dataset

generator, which produced datasets that are correlated, independent and anticorrelated.

By default: and Environment:

Quad core Intel i7 at 3.40GHz, with 16GB of Ram, using NVidia GTX titan GPU.

Experimental evaluationRun-time performance Measure the execution time of the four algorithms, testing them on datasets with variations in distribution, dimensionality and cardinality.

1. Cardinality (d = 12)

2. Dimensionality (n = )

Experimental evaluationWork-efficiencyCompare the performance of the four algorithms with respect to:

1. Dominance tests (DT)

2. Work-efficiency

Thank You

Download - Authors: Kenneth S.Bogh, Sean Chester, Ira Assent (Data-Intensive Systems Group, Aarhus University). Type: Research Paper Presented by: Dardan Xhymshiti

Top Related