pay-as-you-go approximate join top-k processing for the

33
Pay-as-you-go Approximate Join Top-k Processing for the Web of Data Technical Report Andreas Wagner , Veli Bicer , and Thanh Tran Karlsruhe Institute of Technology and IBM Research Centre Dublin {a.wagner,thanh.tran}@kit.edu, [email protected] Abstract. For effectively searching the Web of data, ranking of results is a crucial. Top-k processing strategies have been proposed to allow an efficient processing of such ranked queries. Top-k strategies aim at computing k top-ranked results without complete result materialization. However, for many applications result computation time is much more important than result accuracy and completeness. Thus, there is a strong need for approximated ranked results. Unfortunately, previous work on approximate top-k processing is not well-suited for the Web of data. In this paper, we propose the first approximate top-k join framework for Web data and queries. Our approach is very lightweight – necessary statistics are learned at runtime in a pay-as-you-go manner. We con- ducted extensive experiments on state-of-art SPARQL benchmarks. Our results are very promising: we could achieve up to 65% time savings, while maintaining a high precision/recall. 1 Introduction With the proliferation of the Web of data, RDF has become an accepted standard for publishing data on the Web. RDF data comprises a set of triples {hs, p, oi}, which forms a data graph, cf. Fig. 1-a. User-/Query-Dependent Ranking. For web-scale data, queries often pro- duce a large number of results (bindings ). Given large result sets, result ranking becomes a key factor for an effective search. However, ranking functions often need to incorporate query or user characteristics [1, 4, 19]: Example 1. Find movies with highest ratings, featuring an actress “Audrey Hep- burn”, and playing close to Rome, cf. Fig. 1. Exp. 1 would require a ranking function to incorporate the movie rating, quality of keyword matches for “Audrey Hepburn”, and distance of the movie’s location to Rome. While one may assume that a higher rating value is preferred by any user and query, scores for keyword and location constraint dependent on query and user characteristics. For instance, in order to rank a binding for “Au- drey Hepburn”, a function may measure the edit distance between that keyword and the binding’s attribute value, Fig. 1-c. Notice, given another keyword (e.g., only “Audrey”), the very same attribute value would yield a different score.

Upload: others

Post on 21-Oct-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pay-as-you-go Approximate Join Top-k Processing for the

Pay-as-you-go Approximate Join Top-kProcessing for the Web of Data

Technical Report

Andreas Wagner†, Veli Bicer‡, and Thanh Tran†

† Karlsruhe Institute of Technology and ‡ IBM Research Centre Dublin{a.wagner,thanh.tran}@kit.edu, [email protected]

Abstract. For effectively searching the Web of data, ranking of resultsis a crucial. Top-k processing strategies have been proposed to allowan efficient processing of such ranked queries. Top-k strategies aim atcomputing k top-ranked results without complete result materialization.However, for many applications result computation time is much moreimportant than result accuracy and completeness. Thus, there is a strongneed for approximated ranked results. Unfortunately, previous work onapproximate top-k processing is not well-suited for the Web of data.In this paper, we propose the first approximate top-k join frameworkfor Web data and queries. Our approach is very lightweight – necessarystatistics are learned at runtime in a pay-as-you-go manner. We con-ducted extensive experiments on state-of-art SPARQL benchmarks. Ourresults are very promising: we could achieve up to 65% time savings,while maintaining a high precision/recall.

1 Introduction

With the proliferation of the Web of data, RDF has become an accepted standardfor publishing data on the Web. RDF data comprises a set of triples {〈s, p, o〉},which forms a data graph, cf. Fig. 1-a.

User-/Query-Dependent Ranking. For web-scale data, queries often pro-duce a large number of results (bindings). Given large result sets, result rankingbecomes a key factor for an effective search. However, ranking functions oftenneed to incorporate query or user characteristics [1, 4, 19]:

Example 1. Find movies with highest ratings, featuring an actress “Audrey Hep-burn”, and playing close to Rome, cf. Fig. 1.

Exp. 1 would require a ranking function to incorporate the movie rating,quality of keyword matches for “Audrey Hepburn”, and distance of the movie’slocation to Rome. While one may assume that a higher rating value is preferredby any user and query, scores for keyword and location constraint dependent onquery and user characteristics. For instance, in order to rank a binding for “Au-drey Hepburn”, a function may measure the edit distance between that keywordand the binding’s attribute value, Fig. 1-c. Notice, given another keyword (e.g.,only “Audrey”), the very same attribute value would yield a different score.

Page 2: Pay-as-you-go Approximate Join Top-k Processing for the

r

“Audrey K. Hepburn“

“Roman Holiday“

Movie

starring

type

title

m1

-74.006

8.1

rating

type

“Breakfast at Tiffany's“

title

7.7 rating starring

m3

12.4839

loc

type 8.5rating

“Amélie“

41.8947

40.7146

Location

l1

type

lat

long

lat

long

type“Audrey Tautou“

title

starring

locmrating

“Audrey Hepburn“

starringloc

(a) (b)

(c)

ltp1 tp3

tp2

scoreQ(b) = rating(r) + (1 - editDistance(s,“Audrey Hepburn“)) + (IF [distance(l,(41.8947,12.4839)) ≤ 100km] 1 ; ELSE 0)with b = (<m,rating,r>,<m,starring,s>,<m,loc,l>)

“Rome“

name

l2

“New York“

name

m2

Fig. 1: (a) RDF data graph about the movies “Roman Holiday”, “Breakfast atTiffany’s”, and “Amelie”. (b) Query graph asking for a movie starring “AudreyHepburn”. (c) Scoring function that aggregates scores for triple pattern bindings(bold): movie ratings, edit distance w.r.t. “Audrey Hepburn”, and distance ofthe movie’s location to Rome (lat: 41.8947, long: 12.4839) ≤ 100 km.

Further, depending on the user’s geographic knowledge of Italy, she may havedifferent notions of “closeness” to Rome, e.g., distance ≤ 100 km, cf. Fig. 1-c.

Join Top-k Processing. Top-k processing aims at computing k top-rankedbindings without full result materialization [7, 8]. That is, after computing somebindings, the algorithm can terminate early, because it knows that no bindingwith higher ranking score exists. For efficiently processing ranked queries overWeb data, two recent works employed top-k processing techniques [9, 21].

However, many applications do not require a high result accuracy or com-pleteness. In fact, result computation time is often the key factor. Thus, there isa strong need for approximated ranked results. That is, a system should be ableto trade off result accuracy and completeness for computation time.

Approximate Join Top-k Processing. Unfortunately, existing approach-es for top-k processing over RDF data compute only exact and complete re-sults [9, 21]. Moreover, previous works for approximate top-k processing overrelational databases [2, 3, 12, 18, 20] are not suitable for Web queries/data. Thisis because these works assume complete ranking score statistics at offline time:

(P.1) Web Queries. Query-/user-dependent ranking functions are employedfor many important Web queries, e.g., keyword, spatial or temporal queries [1,4, 19]. However, such ranking scores are only known at runtime. Consider tp2

and tp3 in Fig. 1-b: binding scores are decided by query (i.e., edit distance toquery keyword “Audrey Hepburn”) or user characteristics (i.e., the user-defineddistance to Rome). So, no offline score statistics can be computed for tp2 or tp3.

(P.2) Web Data. Web data is commonly highly distributed and frequentlyupdated. For instance, movie ratings for pattern tp1 (Fig. 1-b) may be spreadacross multiple data sources – some of them even “hidden” behind SPARQL end-points. Moreover, these sources may feature constantly updated rating scores.Thus, while constructing an offline statistic for rating scores is feasible, it comeswith great costs in terms of maintenance. This problem is exacerbated by thefact that RDF allows for very heterogeneous data. For example, the rating pred-icate in tp1 could be used to specify the rating of movies as well as products,restaurants etc. Thus, score statistics may grow quickly and become complex.

Contributions. (1) This is the first work towards approximate top-k joinprocessing for the Web of data. That is, we propose a lightweight approach,

Page 3: Pay-as-you-go Approximate Join Top-k Processing for the

which addresses problem P.1 and P.2: (P.1) We learn score distributions in apay-as-you-go manner at runtime. (P.2) Our score statistics have a constantspace complexity and a computation complexity bounded by the result size.(2) We conducted experiments on two SPARQL benchmarks: we could achievetime savings of up to 65%, while still allowing for a high precision/recall.

Outline. We outline preliminaries in Sect. 2 and present the approximatetop-k join in Sect. 3. In Sect. 4, we discuss evaluation results. Last, we give anoverview over related works in Sect. 5 and conclude with Sect. 6.

2 Preliminaries

Data and Query Model. We use RDF as data model:

Definition 1 (RDF Graph). Given a set of edge labels `, a RDF graph is adirected labeled graph G = (V, E , `), where V = VE ] VA with entity nodes as VEand attribute nodes as VA. Edges E = {〈s, p, o〉} are called triples, with s ∈ VEas subject, p ∈ ` as predicate, and o ∈ VE ] VA as object.

An example is depicted in Fig. 1-a. Further, we employ basic graph patterns(BGPs) as query model:

Definition 2 (BGP Query). A BGP query Q is a directed labeled graph Q =(VQ, EQ), with VQ = VQV ]V

QC as union of variables VQV and constants VQC . Edges

EQ are called triple patterns. Triple pattern tp = 〈s, p, o〉 with s ∈ VQV ] VQC ,

p ∈ `] VQV , and o ∈ VQV ] VQC . We write Q as set of its triple patterns: Q = {tpi}.

Example 2. In Fig. 1-b, pattern 〈m, starring, “Audrey Hepburn”〉 has m asvariable, constant “Audrey Hepburn” as object, and starring as predicate.

Given a query Q, a binding b is a vector (t1, . . . , tn) of triples such that: eachtriple ti matches exactly one pattern tpi in Q and triples in b form a subgraph ofthe data graph, G. We say b binds variables to nodes in the data via the matchingof patterns in Q. Formally, for binding b there is a function µb : VQV 7→ V thatmaps every variable in Q to an entity/attribute node in the data.

Partial bindings (featuring some patterns with no matching triple) oc-cur during query processing. For a partial binding b, we refer to a patterntpi with no matching triple as unevaluated and write ∗ in b’s i-th position:(t1, . . . , ti−1, ∗, ti+1, . . . , tn). We denote the set of unevaluated patterns for par-tial binding b as Qu(b) ⊆ Q. A binding b comprises a binding b′, if all triples inb′ are also contained in b. If b comprises b′, we say binding b′ contributes to b.

Example 3. Given Fig. 1-b, a partial binding b31 = (∗, ∗, t31 = 〈m1, loc, l2〉)in Fig. 2-a matches pattern tp3, while Qu(b31) = {tp1, tp2} are unevaluated.b31 binds variable m and l to entity m1 and l1. Further, the complete bindingb = (t12, t21, t31) comprises partial binding b31 = (∗, ∗, t31). b31 contributes to b.

Ranking Function. To quantify the relevance of a binding b w.r.t. aquery/user, we employ a ranking function: scoreQ : BQ 7→ R, with BQ as set ofall partial/complete bindings for Q. That is, scoreQ(b) is defined as aggregationover b’s triples: scoreQ(b) =

⊕t ∈ b scoreQ(t), with ⊕ as monotonic aggregation

function. A ranking function for our example is in Fig. 1-c. Note, scoreQ could bedefined as part of the query, e.g., by means of the ORDER BY clause in SPARQL.

Page 4: Pay-as-you-go Approximate Join Top-k Processing for the

O1+2+3 score

complete bindings: tp1+ tp2 + tp3

sa1: <m,rating,r> sa2: <m,starring,“Audrey Hepburn“>

sa3: <m,loc,l>H1 score H2 score

m

O1+2 score

A1

Hyperparameters: α1

A2

Hyperparameters: α2

A4

Hyperparameters: α4

Input i1 Input i2

b11 = (t11 = <m3,rating,8.5>, *, *) 8.5

t11:<m3,rating,8.5>, 8.5t12:<m1,rating,8.1>, 8.1t13:<m2,rating,7.7>, 7.7

Offline Index: Rating

Sorted AccessPointer at next match-ing triple

t21:<m1,starring,“Audrey K. Hepburn“>, 0.9 t22:<m2,starring,“Audrey K. Hepburn“>, 0.9 t23:<m3,starring,“Audrey Tautou“>, 0.3

Sorted List at Runtime:Triples containing “Audrey“ or “Hepburn“

l2: (41.8947,12.4839) l1: (40.7146,-74.006)

R-Tree Offline Index: Location

A-PBRJ j1

partial bindings: tp1+ tp2

b21 = (*, t21 = <m1,starring, “Audrey K. Hepburn“>, *)

0.9

H1+2 score

Input i4

Input i3

H3 score

b31 = (*, *, t31 = <m1,loc,l2>) 1

A3

Hyperparameters: α3

Tree of Approximate Top-k Joins

m

A-PBRJ j2

Sufficient Statistics

Stat1: Offline sample mean and variance for rating ranking

Stat2: Runtime sample mean and variance for edit distance ranking

Stat3: Runtime mean and variance (based on uniform distribution) for location ranking

(b)(a)score samples

partial bindings

partial bindings

(0.5, 0.08)

(0.7, 0.12)

(8.1, 0.16)

Fig. 2: (a) A-PBRJ tree for Fig. 1-b. Two information flows occur in the tree:partial bindings (green) and score samples (blue). (b) Sufficient statistics basedon scores observed at indexing time (stat1) and runtime (stat2 and stat3).

Sorted Access. For every pattern tpi in query Q, a sorted access sai re-trieves matching triples in descending score order. Previous works on join top-kprocessing over Web data introduced efficient sorted access implementations forRDF stores [9, 21]. Let us present simple approaches for our example (Fig. 2-a):

Example 4. Given the keyword pattern tp2 = 〈m, starring, “Audrey Hepburn”〉,a sorted access must materialize all triples, which have a value that contains“Audrey” or “Hepburn”. After materialization, these triples are sorted with de-scending similarity w.r.t. that keyword (e.g., measured via edit distance). On theother hand, for pattern 〈m, loc, l〉, an R-tree on the attribute pair (lat, long)may be used. This offline computed index yields two hits: l1 and l2. While l2 isan exact match (thus, triple t31 has max. score 1), l1 is more distant from Rome.Last, an index for attribute rating can be constructed offline: triples are storedwith descending rating value. Then, sorted access sa1 can iterate over this list.

Partial bindings retrieved from sorted accesses are combined via joins. Thatis, an equi-join combines two (or more) inputs. This way, multiple joins form atree. For instance, three sorted accesses are combined via two joins in Fig. 2-a.

Problem. Our goal is to compute k high-ranked query bindings that maydiffer from the true top-k results in terms of false positives/negatives. Theseapproximations aim at saving computation time. For this, we use a top-k test:given a partial binding, we estimate its probability for contributing to the finaltop-k results and discard such bindings that have only a small a probability.

We exploit conjugate priors for learning necessary probability distributions.

Bayesian Inference. Let Θ be a set of parameters. One may model priorbeliefs about these parameters in the form of probabilities: Θ ∼ P (Θ | α) withΘ ∈ Θ [6]. Here, α is a vector of hyperparameters allowing to parametrize theprior distribution. Suppose we observe relevant data x = {x1, . . . , xn} w.r.t. Θ,where each xi ∼ P (xi | Θ). Then, the dependency between observations x andprior parameters Θ can be written as P (x | Θ). Using the Bayes theorem we

Page 5: Pay-as-you-go Approximate Join Top-k Processing for the

can estimate a posterior probability, which captures parameters Θ conditionedon observed events x. In simple terms, a posterior distribution models how likelyparameters Θ are, in light of the seen data x and the prior beliefs [6]:

P (Θ | x, α) ∝ P (x | Θ) · P (Θ | α) =P (x | Θ) · P (Θ | α)∑

Θ P (x | Θ)P (Θ)(1)

Example 5. For pattern tp1 in Fig-2-a, scores are based on rating values. So,we can compute sufficient statistics (mean x1 = 8.1 and variance s21 = 0.16)for these scores at offline time, cf. stat1 in Fig-2-b. Such statistics representprior beliefs about the “true” distribution, which is capturing only those scoresfor bindings of tp1 that are part of a complete binding. Only triple t12 and t13contribute to complete bindings. Thus, only their scores should be modeled via adistribution. We update the prior beliefs using scores samples x observed duringquery processing, thereby learning the true (posterior) distribution as we go.

As we are interested in unobserved events x∗, we need the posterior predictivedistribution, i.e., the distribution of new events given observed data x:

P (x∗ | x, α) =∑Θ

P (x∗ | Θ)P (Θ | x, α) (2)

An important kind of Bayesian priors are the conjugate priors. Intuitively, con-jugate priors require the posterior and prior distribution to belong to the samedistribution family. In other words, these priors provide a “computational conve-nience”, because they give a closed-form of the posterior distribution [6]. Thus,posterior computation is easy and efficient for conjugate priors.

3 Approximate Top-k Join

We now present an approximate top-k processing for the Web of data. In contrastto existing works [2, 3, 12, 18, 20], we follow a lightweight approach: (1) We learnall necessary score statistics at runtime, cf. Algo. 2 (P.1, Sect. 1). (2) We showour score distribution learning to have a constant space complexity and a runtimecomplexity bounded by the result size, cf. Thm. 1 and Thm. 2 (P.2, Sect. 1).

3.1 Approximate Rank Join Framework

We follow [17] and define an approximate Pull/Bound Rank Join (A-PBRJ)framework that comprises three parts: a pulling strategy PS, a bounding strat-egy BS, and a probabilistic component PC. PS determines the next join inputto pull from [17]. The bounding strategy BS gives an upper bound, β, for themaximal possible score of unseen join results [17]. Last, we use PC to estimatea probability for a partial binding to contribute to the final top-k result.

Approximate Pull/Bound Rank Join. The A-PBRJ is depicted inAlgo. 1. Following [17], on line 4 we check whether output buffer O comprises kcomplete bindings and if there are unseen bindings with higher scores (measuredvia bound β). If both conditions hold, the A-PBRJ terminates and reports O.Otherwise, PS selects an input i to pull from (line 5) and produces a new par-tial binding b from the sorted access on input i, line 6. After materialization, weupdate β using bounding strategy BS.

Page 6: Pay-as-you-go Approximate Join Top-k Processing for the

Example 6. In Fig. 2-a, join j2 decides (via strategy PS) to first pull on sa3 andload partial binding t31. Then, join j2 pulls on input i4 (join j1), which in turnpulls on its input i1 (sa1) loading binding t11 and afterwards on input i2 (sa2)loading t21. The join attempt t11 on t21 in join j1 fails, because entity m3 6= m1.

Algorithm 1: Approx. Pull/Bound Rank Join (A-PBRJ).

Param.: Pulling strategy PS, bounding strategy BS, probabilistic comp. PC.Index : Sorted access sai and saj for input i and j, respectively.Buffer : Output buffer O. Hi and Hj for “seen” bindings from sai and saj .Input : Query Q, result size k, and top-k test threshold τ .Output: Approximated top-k result.

1 begin2 β ←∞, κ← −∞3 PC.initialize()4 while | O |< k or minb′ ∈O scoreQ(b′) < β do5 i← PS.input() // choose next input via pulling strategy PS6 b← next partial binding from sorted access sai7 β ← BS.update(b) // update β via bounding strategy BS

// top-k test, cf. Algo. 3

8 if PC.probabilityTopK(b,κ) > τ then9 O← Hj on {b}

10 b ∪Hi // add b to buffer Hi

11 if #new bindings b in O ≥ training threshold then// score distribution learning, cf. Algo. 2

12 PC.train(b)13 Retain only k top-ranked bindings in O

14 if |O | ≥ k then // update smallest top-k score κ15 κ← minb′ ∈O scoreQ(b′)

// return approximated top-k results

16 return O

In line 8, PC estimates the probability for partial binding b leading to acomplete top-k binding: the top-k test. If b fails this test, it will be pruned. Thatis, we do not attempt to join it and do not insert it in Hi. Hi is a buffer thatholds “seen” bindings from input i. Otherwise, if the top-k test holds, b is furtherprocessed (lines 9 - 15). That is, we join b with seen bindings from the otherinput j and add results to O. Further, b is inserted into buffer Hi, line 10. Forlearning the necessary probability distributions, PC trains on seen bindings/-scores in O, line 12. Notice, we continuously train PC throughout the queryprocessing – every time “enough” new bindings are in O, line 11. PC requiresparameter κ for its pruning decision. κ holds the the smallest currently knowntop-k score (line 15). On line 2, κ is initialized as −∞.

Page 7: Pay-as-you-go Approximate Join Top-k Processing for the

Choices for BS and PS. Multiple works proposed bounding strategies,e.g., [5, 7, 10, 17] as well as pulling strategies, e.g., [7, 11]. Commonly, the cornerbound [7] is employed as bounding strategy BS:

Definition 3 (Corner Bound). For a join operator, we maintain ui and lifor each input i. ui is the highest score observed from i, while li is the lowestobserved score on i. If input i is exhausted, li is set to −∞. The bound for scoresof unseen join results is β := max{u1 ⊕ l2, u2 ⊕ l1}.

In example Fig. 2-a, join j1 currently has β = max{8.5 + 0.9, 0.9 + 8.5}, withu1 = l1 = 8.5 and u2 = l2 = 0.9. On the other hand, the corner-bound-adaptivestrategy [7] is frequently used as pulling strategy PS:

Definition 4 (Corner-Bound-Adaptive Pulling). The corner-bound-adap-tive pulling strategy chooses the input i such that: i = 1 iff u1⊕ l2 > u2⊕ l1 andi = 2 otherwise. In case of a tie, the input with less unseen bindings is chosen.

For instance, in join j1 (Fig. 2-a) either input may be selected, because 8.5+0.9 =0.9 + 8.5 and both inputs have two unseen partial bindings.

3.2 Probabilistic Component PCGiven a partial binding b, we wish to know how likely b will contribute to the finaltop-k results. For this, the top-k test exploits two probabilities: (1) The prob-ability that b contributes to a complete binding (binding probability). (2) Theprobability that complete bindings comprising b have higher scores than thecurrent top-k bindings (score probability).

Binding Probability. To address the former probability, we use a selec-tivity estimation function sel. Simply put, given a query Q, sel(Q) estimatesthe probability that there is at least one binding for Q [14, 15]. For example,selectivity of pattern tp3 = 〈m,loc, l〉 is sel(tp3) = 2

3 , because out of the threemovie entities only two have a loc predicate, cf. Fig. 1-a.

Further, we define a complete binding indicator for a partial binding b:

1{Qu(b) | b} :=

{1 if sel(Qu(b) | b) > 0

0 otherwise(3)

Intuitively, for a partial binding b, 1{Qu(b) | b} models whether matchingtriples for b’s remaining unevaluated patterns can exist, given variable assign-ments dictated by b. That is, Qu(b) | b is a set of patterns {tpi}, such that patterntpi ∈ Qu(b) and each variable v in tpi that is bound by b is replaced with itsassignment in b, µb(v), which results in a new pattern tpi.

Example 7. Consider partial binding b11 = (t11 = 〈m3, rating, 8.5〉, ∗, ∗) inFig. 2-a. Qu(b11) | b11 = {〈m3, starring, “Audrey Hepburn”〉, 〈m3, loc, l〉},because variable m in pattern tp2 and tp3 is replaced with its assignment in b11,µb11(m) = m3. 1{Qu(b11) | b11} = 0, as selectivity for both patterns is 0.

Notice, any selectivity estimation implementation may be used for the com-plete binding indicator. We employed [14, 15] for our experiments.

Score Probability. For a partial binding b, let scores for bindings of b’sunevaluated patterns, Qu(b), be captured via a random variable Xs

Qu(b).

Page 8: Pay-as-you-go Approximate Join Top-k Processing for the

Example 8. In Fig. 2-a, partial binding b31 currently has a score of 1. However,scores for bindings to tp1 and tp2 are unknown and modeled via Xs

Qu(b31).

Then, we can obtain the probability for b contributing to a complete bindingthat has a score ≥ x as:

P(XsQu(b) ≥ δ(x, b)

)(4)

where δ(x, b) := x − scoreQ(b). Note, partial binding b has a current score,scoreQ(b), and only the score for its unevaluated patterns is unknown. So, δ(x, b)is the “delta” between b’s current score and a desired score x.

Top-k Test. Finally, we use (1) the complete binding indicator to determinewhether b might contribute to any complete binding. Further, (2) the scoreprobability to estimate how likely a complete binding comprising b has a scorethat is larger than the smallest known top-k score, κ (cf. Algo. 1 line 15):

1(Qu(b) | b)︸ ︷︷ ︸(1)

·P (XsQu(b) ≥ δ(κ, b))︸ ︷︷ ︸

(2)

> τ (5)

with τ ∈ [0, 1] as top-k test threshold.

Discussion.

– Threshold τ provides a key instrument for semantic search systems, since itallows to adjust the result accuracy. For instance, a system may decide tocompute top-50 results in total and increase τ every time a new top-rankedbinding can be reported. Generally speaking, τ should not be conceived as aconstant, but rather as a function in reported top-ranked results. This way,systems can target typical end-user information needs, which do not requireaccurate low-ranked query results.

– The parameter ω refers to the smallest currently known top-k binding score.In particular, as long as no k complete bindings have been found, ω is setto −∞ (see Algorithm 1, Line 2), and the score probability is always 1. So,a partial binding b is only pruned if it fails the complete binding indicator.That is, if b is not expected to contribute to any complete binding.

– Assume we have a tree of A-PBRJ operators and a partial binding b fails thecomplete binding indicator test at join ji. Then, it is crucial to know, becauseof which pattern in Qu(b) | b the partial binding b fails the complete bindingindicator test. In other words, we need to know which pattern tp ∈ Qu(b) | bleads to 1(Qu(b) | b) = 0.Let tpk be that pattern and let join jk be the join, which joins pattern tpk

with the remainder of the query. Then, we update all β thresholds associ-ated with joins, which are “above” join ji and “below” jk in the tree. Moreprecisely, we update β with the expected score of the partial binding b inthat particular join.This updating is necessary, because the partial binding b is assumed to suc-cessfully join with other bindings in joins “above” join ji and “below” jk.Only at join jk, due to pattern tpk, binding b is assumed to fail.

Page 9: Pay-as-you-go Approximate Join Top-k Processing for the

(a) Predictive Dist. (b) Priors

Input i1P (Xs

i1) stat2⊕ stat3:Qu = {tp2, tp3} (0.7 + 0.5, 0.12 + 0.08)

Input i2P (Xs

i2) stat1⊕ stat3:Qu = {tp1, tp3} (8.1 + 0.5, 0.16 + 0.08)

Input i3P (Xs

i3) stat1⊕ stat2:Qu = {tp1, tp2} (8.1 + 0.7, 0.16 + 0.12)

Input i4P (Xs

i4) stat3:Qu = {tp3} (0.5, 0.08)

Fig. 3: (a) Given joins in Fig. 2-a,we train four predictive score dis-tributions (one for each input).For instance, Xs

i1models scores for

bindings of tp2 on tp3. (b) Priorsare based on sufficient statisticsin Fig. 2-b. The aggregation func-tion ⊕ is a summation in Fig. 1-c.Thus, e.g., stat1 ⊕ stat3 = (8.1 +0.5, 0.16 + 0.08) = (8.6, 0.24).

3.3 Score Distribution Learning

Distributions for random variables XsQu(b) may be obtained by learning a score

distribution P (Xsi ) for each join input i. Note, partial bindings, which come

from the same input, have the same set of unevaluated triple patterns. Thus, Xsi

captures scores of the unevaluated patterns from its partial bindings.

Example 9. In Fig. 2-a, all partial bindings from input i1 have Qu = {tp2, tp3}as unevaluated patterns. Thus, P (Xs

Qu(b11)) = P (Xs

i1), as binding b11 is produced

by input i1. In fact, all bindings from i1 follow the same distribution, P (Xsi1

),which captures scores of tp2 on tp3. Overall, we learn four distributions, cf. Fig. 3.

We do not know the true distribution for Xsi . In such a case, a common

assumption is to use a Gaussian distribution for Xsi , cf. Eq. 6a. We employ a

conjugate prior to train its unknown mean and variance, respectively.As shown in [6], the mean of Xs

i follows a Gaussian distribution (Eq. 6b) andthe variance of Xs

i follows an inverse-Gamma distribution (Eq. 6c). Hyperpa-rameters α0 = (µ0, η0, σ

20 , ν0) parameterize both distributions, where µ0 is prior

mean with quality η0, and σ20 is prior variance with quality ν0 [6]:

Xsi ∼ normal

(µ, σ2

)(6a)

µ | σ2 ∼ normal

(µ0,

σ2

η0

)(6b)

σ2 ∼ inverse-gamma(0.5 · ν0, 0.5 · ν0σ2

0

)(6c)

Prior Distribution. Prior initialization is called on line 3 in Algo. 1. Foreach input i we specify a prior distribution for Xs

i via prior hyperparametersα0. For α0 we require sufficient score statistics in the form of a sample mean,x = 1

n

∑xi ∈x xi, and a sample variance s2 = 1

(n−1)∑

xi ∈x(xi − x)2, with x as

sample. There are multiple ways to obtain the necessary score samples:

Example 10. Fig. 2-b depicts three sufficient statistics based on information fromthe sorted accesses: (1) Offline information in the case of sa1. That is, scoresare known before runtime, thus, x1 = 8.1 and s21 = 0.16 can be computed offline.(2) Online information for access sa2. Recall, the list of matching triples forkeywords “Audrey” and “Hepburn” must be fully materialized. So, x2 = 0.7 ands22 = 0.12 may be computed from runtime score samples. (3) Last, given accesssa3, we have neither offline scores, nor a fully materialized list of triples (sa3loads a triple solely upon a pull request). In lack of more information, we assume

Page 10: Pay-as-you-go Approximate Join Top-k Processing for the

Algorithm 2: PC.train()

Params: Weight w ≥ 1 for score sample x.Buffer : Buffer A storing hyperparameters α.Input : Complete bindings B ⊆ O and join j.

1 begin2 foreach input i in join j do

// load prior hyperparameters for input i3 αn = (µn, ηn, σ

2n, νn)← Ai

// get scores of bindings for input i’s unevaluated patterns

4 foreach complete binding b ∈ B do5 get binding b′ comprised in b, which matches unevaluated patterns6 add scoreQ(b′) to score sample x

// compute sample mean and variance

7 x ← mean(x) = 1n

∑xi

8 s2 ← var(x) = 1(n−1)

∑(xi − x)2

// compute posterior hyperparameters

9 νn+1 ← νn + w, ηn+1 ← ηn + w10 µn+1 ← 1

ηn+1· (ηnµn + wx)

11 σ2n+1 ← 1

νn+1·(νnσ

2n + (w − 1)s2 + ηnw

ηn+1· (x− µn)2

)// store new (posterior) hyperparameters for input i

12 Ai ← αn+1 = (µn+1, ηn+1, σ2n+1, νn+1)

each score to be equal likely, i.e., a uniform distribution. With min. score as 0and max. score as 1: x3 = 0.5 and s23 = 0.08.

We initialize hyperparameters α0 with µ0 as sample mean, σ20 as sample

variance, and η0 = ν0 as sample quality. For every input, we aggregate necessarysample means/variances for µ0/σ2

0 . For example, given input i1 with unevaluatedpattern {tp2, tp3}, we sum up (aggregate) statistics stat2 and stat3: x2 + x3 forµ0 and s22 + s23 for σ2

0 , cf. Fig. 3. Note, η0 and ν0 are used to quantify the priorquality. For instance, stat1 and stat2 are exact statistics, while stat3 relies on auniform distribution. So, weighting reflects the prior’s trustworthiness.

Posterior Distribution. Having estimated a prior distribution, we contin-uously update the distribution with scores seen during query processing.

Intuitively, each time new complete bindings are produced, all prior distribu-tions could be trained, cf. Algo. 1 line 11 and Algo. 2. That is, complete bindingscores are used to update hyperparameters from the previous n-th training it-eration, αn, resulting in new posterior hyperparameters, αn+1. For this, we usestandard training on lines 10-11 (Algo. 2) [6]. In simple terms, the prior meanµn is updated with the new sample mean x, line 10, and the prior variance σ2

n

is updated with the sample variance s2, line 11. Note, each input computes its“own” score sample x (Algo. 2, lines 5-6).

Prior hyperparameters are weighted via ηn and νn. Further, for each hyper-parameter update, a parameter w is used as weight (indicating the quality ofsamples x). Finally, new hyperparameters αn+1 are stored on line 12, Algo. 2.

Page 11: Pay-as-you-go Approximate Join Top-k Processing for the

Example 11. Given input i1 and η0 = ν0 = 1 in Fig. 3. Then, its prior isα0 = (1.2, 1, 0.2, 1). We observe scores x = {x1, x2} from B = {(t12, t21, t31),(t13, t22, t32)}, with w = |x| = 2, x1 = 1.9 = scoreQ(t21) + scoreQ(t31), andx2 = 0.9 = scoreQ(t22) + scoreQ(t32). So, s2 = 0.5, x = 1.4, which leads toposterior hyperparameters: η1 = ν1 = 1 + 2 = 3 and

σ21 =

1

(0.2 + (2− 1) · 0.5 +

(1.4− 1.2)2

3

)= 0.71

µ1 =(1.2 + 2 · 1.4)

3= 1.33

After each such update only posterior hyperparameters are stored, therebymaking the learning highly space and time efficient:

Theorem 1 (Distribution Learning Space Complexity). Given an A-PRBJ operator, at any time during query processing, we require a space com-plexity of O(1) for score distribution learning.

Proof Sketch. Given an A-PRBJ operator, every of its inputs i stores onlya parameter vector, hyperparameters α, during each training iteration (Algo-rithm 2, Line 12). Since each vector α has a fixed size, the space consumptionremains constant. In particular, vector α is independent of the number of trainingiterations �

For the learning time complexity we can show:

Theorem 2 (Distribution Learning Time Complexity). Given an A-PRBJ operator, a query Q, and B complete bindings for Q, score learning timecomplexity is bounded by O(|B|).

Proof Sketch. Given an A-PRBJ operator and a set of complete bindings B: Ascore sample, x, is constructed (Algorithm 2, Lines 5-6) with O(|B|) complexity.Mean and variance is computed from x in O(|x|) time. However, since |x| ≤|B|, it holds that O(|x|) ∈ O(|B|). In fact, computation of mean and variancecould also be done while collecting the sample (Algorithm 2, Lines 5-6). Further,hyperparameters are updated via x in constant time. Overall, the training hasa complexity of: O(|B|) �

Predictive Distribution. In Algo. 3, we provide an implementation of thetop-k test. At any point during query processing, one may need to perform thistest, Algo. 1 line 8. Thus, our approach allows to always give a distribution forXs

i based on the currently known hyperparameters αn (Algo. 3, line 2). Sincehyperparameters are continuously trained, the distributions improve over time.

More specifically, we use the posterior predictive distribution. This distri-bution estimates probabilities for new scores, based on observed scores and theprior distribution. For a Gaussian conjugate prior, this distribution can be easilyobtained in a closed form as non-standardized Student’s t-distribution with νndegrees of freedom [6], cf. Algo. 3, line 3. Then, we compute P (Xs

Qu(b)) = P (Xsi )

by means of the posterior predictive distribution on line 4. Last, we compute thebinding probability via a selectivity estimation function (Eq. 3) on line 5 andreturn b’s top-k test probability, cf. line 6.

Page 12: Pay-as-you-go Approximate Join Top-k Processing for the

Algorithm 3: PC.probabilityTopK()

Buffer : Buffer A storing hyperparameters.Input : Partial bindings b, input i, and join j.Output: Probability that b will result in one (or more) final top-k bindings.

1 begin// load hyperparameters αn for input i

2 αn = (µn, ηn, σ2n, νn)← Ai

// posterior predictive distribution based on hyperparam. αn// in closed-form as Student’s t-distribution

3 Xsi ∼ t(νn)

(x | µn, σ

2n(ηn+1)

ηn

)// compute score probability

4 pS ← P(XsQu(b) ≥ δ(κ, b)

)= P (Xs

i ≥ δ(κ, b))// compute binding probability

5 pB ← 1{Qu(b) | b}// probability that b contributes to top-k results

6 return pS · pB

4 Evaluation

Benchmarks. We used two SPARQL benchmarks: (1) The SP² benchmarkfeaturing synthetic DBLP data [16]. (2) The DBpedia SPARQL benchmark(DBPSB), which holds real-world DBpedia data and queries [13]. For bothbenchmarks we generated datasets with 10M triples. We translated the SPARQLbenchmark queries to our query model (BGPs). Queries featuring no BGPs werediscarded, i.e., we omitted 12 and 4 queries in DBPSB and SP². We generatedDBPSB queries as proposed in [13]: Overall, used 8 seed queries with 15 randombindings, which led to a total of 120 DBPSB queries. For SP² we employed 13queries. In total, we had a comprehensive load of 133 queries. Query statisticsare given in Table 1 and a complete query listing is shown in Sect. 7.

SP² Queries DBPSB Queries

# Queries 13 120# Triple pattern [2, 9] [2, 4]Mean(#Triple pattern) 5 2.8Var(#Triple pattern) 6.4 0.6# Results [1, 5.4M] [1, 50]Mean(#Results) 590K 3.9Var(#Results) 2.1B 51.6

Table 1: Query statistics for the SP² and DBPSB benchmark.

Systems. We randomly generated bushy query plans. For a given query,all systems rely on the same plan. We implemented three systems that solelydiffer in their join operator: (1) A system with join-sort operator, JS, whichdoes not employ top-k processing, but instead produces all results and then sorts

Page 13: Pay-as-you-go Approximate Join Top-k Processing for the

them. (2) An exact and complete top-k join operator, PBRJ, featuring the corner-bound in Def. 3 and the corner-bound-adaptive pulling strategy in Def. 4. PBRJis identical to Algo. 1, however, no top-k test is applied. Note, PBRJ resemblesprevious approaches for top-k processing over RDF data [9, 21]. (3) Last, weimplemented our approximate operator, A-PBRJ, see Algo 1 in Sect. 3.

Score learning and top-k test implementation for the A-PBRJ operator followsAlgo. 2 and Algo. 3, cf. Sect. 3.3. Further, we used sufficient statistics based ona uniform distribution over [0, 1], as discussed in Exp. 10 for sorted access sa3.Prior weights ν0 and η0 are both 1, Algo. 2. Weight w in Algo. 2 is the samplesize, |x|. We reused the selectivity estimation implementation from [14, 15] forour binding probabilities.

Hypothesis (H.1): We expect that JS is outperformed by PBRJ, as it computesall results for a query. Further, we expect A-PBRJ to outperform JS and PBRJ.A-PBRJ’s savings come at the cost of effectiveness.

We implemented all systems in Java 6. Experiments were run on a Linuxserver with two Intel Xeon 5140 CPUs at 2.33GHz, 48GB memory (16GB as-signed to the JVM), and a RAID10 with IBM SAS 148GB 10K rpm disks. Beforeeach query execution, all operating system caches were cleared. The presentedvalues are averages collected over five runs.

Ranking Function. We chose triple pattern binding scores, scoreQ(t), atrandom with distribution d ∈ {u, n, e} (uniform, normal, and exponential dis-tribution). We employed a summation as aggregation function, ⊕. By means ofvarying distributions, we aim at an abstraction from a particular ranking func-tion and examine performance for different “classes” of functions. We employedstandard parameters for all distributions and normalized scores to be in [0, 1].Hypothesis (H.2): A-PBRJ’s efficiency and effectiveness is not influenced by thescore distribution.

Parameters. We vary the number of results k ∈ {1, 5, 10, 20}. Hypothesis(H.3): We predict efficiency to decrease in parameter k for A-PBRJ and PBRJ.Further, we used top-k test thresholds τ ∈ [0, 0.8] for inspecting the trade-offbetween efficiency and effectiveness.

Metrics. We measure efficiency via: (1) #Inputs processed. (2) Time neededfor result computation. As effectiveness metrics we use: (1) Precision: fractionof approximated top-k results that are exact top-k results. (2) Recall: fractionof exact top-k results, which are reported as approximate results. Notice, pre-cision and recall have identical values, as both share the same denominator k.We therefore discuss only precision results in the following. Further, precision isgiven as average over our query load (so-called macro-precision). (3) Score er-ror: approximate vs. exact top-k score: 1

k

∑b=1,...,k |score∗Q(b)−scoreQ(b)|, with

score∗Q(b) and scoreQ(b) as approximated and exact score for binding b [20].

Efficiency Results. Efficiency results are depicted in Fig. 4-a/e (b/f) forSP² (DBPSB). As expected in hypothesis H.1, we observed A-PBRJ to save #in-puts and computation time. For SP² (DBPSB), A-PBRJ needed up to 25% (23%)less inputs vs. baseline PBRJ and 30% (67%) vs. JS. We explain these gains withpruning of partial bindings via our top-k test, thereby omitting “unnecessary”

Page 14: Pay-as-you-go Approximate Join Top-k Processing for the

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.1 0.2 0.4 0.6 0.8

Mac

ro-P

reci

sio

n

Threshold τ

A-PBRJ (E) A-PBRJ (N) A-PBRJ (U)

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.1 0.2 0.4 0.6 0.8

Ma

cro

-Pre

cisi

on

Threshold τ

A-PBRJ (E) A-PBRJ (N) A-PBRJ (U)

2.0E+06

2.2E+06

2.4E+06

2.6E+06

2.8E+06

3.0E+06

3.2E+06

0 0.1 0.2 0.4 0.6 0.8

#In

pu

ts

Threshold τ

JS

PBRJ

A-PBRJ

3.5E+04

4.5E+04

5.5E+04

6.5E+04

7.5E+04

8.5E+04

0 0.1 0.2 0.4 0.6 0.8

Tim

e (

ms)

Threshold τ

JS

PBRJ

A-PBRJ

0E+00

1E-02

2E-02

3E-02

4E-02

5E-02

6E-02

7E-02

0 0.1 0.2 0.4 0.6 0.8

Sco

re E

rro

r

Threshold τ

A-PBRJ (E) A-PBRJ (N) A-PBRJ (U) SP

2

(a) (b) (c) (d)

2.0E+05

3.0E+05

4.0E+05

5.0E+05

6.0E+05

7.0E+05

0 0.1 0.2 0.4 0.6 0.8

#In

pu

ts

Threshold τ

JS

PBRJ

A-PBRJ

2.0E+03

4.0E+03

6.0E+03

8.0E+03

1.0E+04

1.2E+04

0 0.1 0.2 0.4 0.6 0.8

Tim

e (m

s)

Threshold τ

JS

PBRJ

A-PBRJ

DB

PSB

5E-03

1E-02

2E-02

2E-02

3E-02

3E-02

0 0.1 0.2 0.4 0.6 0.8

Sco

re E

rro

r

Threshold τ

A-PBRJ (E) A-PBRJ (N) A-PBRJ (U)

(e) (f) (h) (g)

Fig. 4: Evaluation results for SP²/DBPSB: (a)/(e) Efficiency: #inputs vs. thresh-old τ . (b)/(f) Efficiency: time vs. threshold τ . (c)/(g) Effectiveness: macro-precision vs. threshold τ . (d)/(h) Effectiveness: score error vs. threshold τ .

joins and join attempts. In fact, we were able to prune up to 40% (90%) of theinputs, given SP² (DBPSB). Fewer #inputs translated to time savings of 35%(65%) vs. PBRJ and 47% (80%) vs. JS, given SP² (DBPSB).

Interestingly, we saw an increase in #inputs for τ ∈ [0.2, 0.4] in SP² andτ ∈ [0.4, 0.8] in DBPSB, cf. Fig. 4-a/e. For instance, comparing τ = 0.2 andτ = 0.4 in SP², A-PBRJ read 8% more inputs. DBPSB was less affected: wenoticed a marginal increase of 2% for τ = 0.4 vs. τ = 0.6. We explain theincrease in both benchmarks with a too “aggressive” pruning – too many partialbindings were pruned wrongfully. That is, many pruned bindings would have ledto a larger or even a complete binding. In turn, this led to more inputs being read,in order to produce the desired k results. In fact, τ ∈ [0.6, 0.8] was even moreaggressive. However, the ratio between pruned bindings and read inputs washigh enough to compensate for the extra inputs. Overall, we saw a “sweet spot”at τ ≈ 0.2 for SP² and DBPSB. Here, we noted pruning to be fairly accurate,i.e., only few partial bindings were wrongfully pruned. In fact, we observed highprecision (recall) values for both benchmarks given τ ≈ 0.2: 88% (95%) in SP²(DBPSB) – as discussed below. With regard to computation time for SP² andDBPSB queries, we noticed similar effects as for the #inputs, cf. Fig. 4-b/f. Inparticular, the “sweet spot” at τ ≈ 0.2 is also reflected here.

As expressed by hypothesis H.3, we observed #inputs and time to increasein k for A-PBRJ and PBRJ. For instance, comparing k = 1 and k = 20, A-PBRJneeded a factor of 1.2 (5.7) more time, given SP² (DBPSB). Similarly, 1.2 (6.8)times more inputs were consumed by A-PBRJ for SP² (DBPSB). We explainthis behavior with more inputs/join attempts being required to produce a largerresult. PBRJ leads to a similar performance decrease. For instance given k = 1vs. k = 20 in SP², PBRJ needed a factor of 1.3 (1.2) more inputs (time). Note,as baseline JS simply computed all results, this system was not affected by k.

Furthermore, we can confirm our hypothesis H.2 with regard to system ef-ficiency: we cloud not find a correlation between system performance and scoredistributions. In other words, score distributions (ranking functions) had no im-pact on A-PBRJ’s performance. For instance given DBPSB queries, A-PBRJ re-

Page 15: Pay-as-you-go Approximate Join Top-k Processing for the

sulted in the following gains vs. PBRJ w.r.t. #inputs (time): 27% (65%) for edistribution, 23% (64%) given u distribution, and 21% (64%) for n distribution.

Last, with regard to parameter τ , we noted A-PBRJ’s efficiency to increasewith τ ∈ [0, 0.2], given SP² and DBPSB. However, as outlined above, too ag-gressive pruning let to “inverse” effects. An important observation is, however,that our approach was already able to achieve performance gains with a verysmall τ < 0.1. Here, partial bindings were pruned primarily due to their lowbinding probability. In fact, A-PBRJ could even save time for τ = 0: 26% (60%)with SP² (DBPSB). We inspected queries leading to such saving and saw thatmany of their partial bindings had a binding probability ≈ 0. We argue that thisa strong advantage of A-PBRJ: even for low error thresholds (leading to a minoreffectiveness decrease), we could achieve efficiency gains.

Effectiveness Results. Next, we analyze A-PBRJ in terms of its accuracy.Baselines PBRJ and JS always compute exact and complete results. So, we restrictour attention to the A-PBRJ system and different score distributions d ∈ {u, n, e}.

Fig. 4-c/g (d/h) depicts the macro-precision (score error) for varying scoredistributions. We observed high precision values of up to 0.98 for both bench-marks, see Fig. 4-c/g. More precisely, we saw best results for a small τ < 0.1 andthe exponential distribution. However, differences are only marginal. That is,given τ < 0.1, all distributions led to very similar precision results ∈ [0.8, 0.95]and [0.90, 0.98] for SP² and DBPSB, respectively. In other words, A-PBRJ’s ef-fectiveness is not affected by a particular score distribution. We explain thesegood approximations with accurate score/binding probabilities.

Moreover, even for large τ ∈ [0.6, 0.8] A-PBRJ achieved a high macro-precisionin [0.75, 0.8] on DBPSB queries. This is because DBPSB queries featured selec-tive patterns and had only a small result cardinality ≤ 10. Thus, “chances” ofpruning a final top-k binding were quite small – even for a large τ . Moreover,A-PBRJ let to a very effective pruning via binding probabilities, as many partialbindings had a binding probability ≈ 0 (due to the high query selectivity). Thisway, A-PBRJ pruned up to 97% of the total inputs for some DBPSB queries.

In order to quantify “how bad” false positive/negative results are, we em-ployed the score error metric, see Fig. 4-d/h. For both benchmarks, we observedthat score error was ∈ [0.07, 0.11] for a small τ < 0.1. We explain this with ourhigh precision (recall). That is, A-PBRJ led to only few false positive/negativetop-k results given τ < 0.1. As expected, score error increased in τ , due to morefalse positives/negatives top-k results. Overall, however, score error results werevery promising: we saw an average score error of 0.03 (0.02), given SP² (DBPSB).

With regard to parameter k, we observed that k does not impact A-PBRJ’seffectiveness. Given SP², we saw A-PBRJ to be fairly stable in different values forparameter k. For instance, macro-precision was in [0.8, 0.85] as average over all kand τ = 0.1. Also for the DBPSB benchmark, we noted only minor effectivenessfluctuations: macro-precision varied around 7% with regard to different k.

We noticed A-PBRJ’s effectiveness to not be influenced by varying score dis-tributions, see Fig. 4-c/g/d/h. Given SP², we saw a macro-precision of: 0.79 foru distribution, 0.79 for e distribution, and 0.80 for n distribution. Also for the

Page 16: Pay-as-you-go Approximate Join Top-k Processing for the

DBPSB benchmark, we observed only minor changes in macro-precision: 0.87for u distribution, 0.85 for e as well as n distribution.

With regard to the effectiveness of A-PBRJ versus parameter τ , we noticedthat metrics over both benchmarks decreased with increasing τ . For instance,macro-precision decreased for τ = 0 versus τ = 0.8 with 27% (23%), givenSP² (DBPSB). Such a behavior can be expected, since chances of pruning “thewrong” bindings increase with higher τ values. Overall, this confirms H.1: A-PBRJ trades off effectiveness for efficiency, as dictated by threshold τ .

5 Related Work

There is a large body of work on top-k query processing for relational data-bases [8]. Most recently, such approaches have been adopted to RDF data andSPARQL queries [9, 21]. These works aim at exact and complete top-k results.However, for many applications result accuracy and completeness is not impor-tant. Instead, result computation time is the key factor.

To foster an efficient result computation, approximate top-k techniques havebeen proposed [2, 3, 12, 18, 20]. Most notably, [20] used score statistics to pre-dict the highest possible complete score of a partial binding. Partial results arediscarded, if they are not likely to contribute to a top-k result. Focusing ondistributed top-k queries, [12] employed histograms to predict aggregated scorevalues over a space of data sources. Anytime measures for top-k processing havebeen introduced by [2, 3]. For this, the authors used offline score information, e.g.,histograms, to predict complete binding scores at runtime. In [18], approximatetop-k processing under budgetary has been addressed.

Unfortunately, all such approximate top-k approaches heavily rely on scorestatistics at offline time. That is, scores must be known at indexing time forcomputing statistics, e.g., histograms. However, offline statistics lead to majordrawbacks in a Web setting – as outlined in problem (P.1) and (P.2), cf. Sect. 1.In contrast, we propose a lightweight system: we learn our score distributionsin a pay-as-you-go manner at runtime. In fact, our statistics cause only minoroverhead in terms of space and time, cf. Thm. 1 and Thm. 2.

6 Conclusion

In this paper, we introduced an approximate join top-k algorithm, A-PBRJ, well-suited for the Web of data (P.1+P.2, Sect. 1). For this, we extended the well-known PBRJ framework [17] with a novel probabilistic component. This compo-nent allows to prune partial bindings, which are not likely to contribute to thefinal top-k result. We evaluated our A-PBRJ system by means of two SPARQLbenchmarks: we could achieve times savings of up to 65%, while maintaining ahigh precision/recall.

References

1. R. Agrawal, R. Rantzau, and E. Terzi. Context-sensitive ranking. In SIGMOD,2006.

Page 17: Pay-as-you-go Approximate Join Top-k Processing for the

2. B. Arai, G. Das, D. Gunopulos, and N. Koudas. Anytime measures for top-kalgorithms. In VLDB, 2007.

3. B. Arai, G. Das, D. Gunopulos, and N. Koudas. Anytime measures for top-kalgorithms on exact and fuzzy data sets. VLDB Journal, 2009.

4. S. Chaudhuri, G. Das, V. Hristidis, and G. Weikum. Probabilistic informationretrieval approach for ranking of database query results. TODS, 2006.

5. J. Finger and N. Polyzotis. Robust and efficient algorithms for rank join evaluation.In SIGMOD, 2009.

6. P. D. Hoff. A First Course in Bayesian Statistical Methods. Springer, 2009.7. I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid. Supporting top-k join queries in

relational databases. VLDB Journal, 2004.8. I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of top-k query processing

techniques in relational database systems. ACM Comput. Surv., 2008.9. S. Magliacane, A. Bozzon, and E. Della Valle. Efficient execution of top-k SPARQL

queries. In ISWC, 2012.10. N. Mamoulis, M. L. Yiu, K. H. Cheng, and D. W. Cheung. Efficient top-k aggre-

gation of ranked inputs. TODS, 2007.11. D. Martinenghi and M. Tagliasacchi. Cost-Aware Rank Join with Random and

Sorted Access. TKDE, 2012.12. S. Michel, P. Triantafillou, and G. Weikum. KLEE: a framework for distributed

top-k query algorithms. In VLDB, 2005.13. M. Morsey et al. DBpedia SPARQL Benchmark – Performance Assessment with

Real Queries on Real Data. In ISWC, 2011.14. T. Neumann and G. Moerkotte. Characteristic sets: Accurate cardinality estima-

tion for RDF queries with multiple joins. In ICDE, 2011.15. T. Neumann and G. Weikum. Scalable join processing on very large RDF graphs.

In SIGMOD, 2009.16. M. Schmidt, T. Hornung, G. Lausen, and C. Pinkel. SP2Bench: A SPARQL Per-

formance Benchmark. In ICDE, 2009.17. K. Schnaitter and N. Polyzotis. Evaluating rank joins with optimal cost. In PODS,

2008.18. M. Shmueli-Scheuer, C. Li, Y. Mass, H. Roitman, R. Schenkel, and G. Weikum.

Best-Effort Top-k Query Processing Under Budgetary Constraints. In ICDE, 2009.19. A. Telang, C. Li, and S. Chakravarthy. One Size Does Not Fit All: Toward User-

and Query-Dependent Ranking for Web Databases. TKDE, 2012.20. M. Theobald, G. Weikum, and R. Schenkel. Top-k query evaluation with proba-

bilistic guarantees. In VLDB, 2004.21. A. Wagner, T. T. Duc, G. Ladwig, A. Harth, and R. Studer. Top-k linked data

query processing. In ESWC, 2012.

7 Appendix

In this section, we present the query load that was used during our experiments.Queries for the SP² benchmark are based on [16], while the DBPSB benchmarkqueries are generated from seed queries in [13]. All queries are given in RDF N31

notation.

1 http://www.w3.org/TeamSubmission/n3/

Page 18: Pay-as-you-go Approximate Join Top-k Processing for the

Listing 1.1: Prefixes used for SP² and DBPSB queries.

1 @prefix rd f :2 <http ://www. w3 . org /1999/02/22− rdf−syntax−ns#> .3 @prefix r d f s :4 <http ://www. w3 . org /2000/01/ rdf−schema#> .5 @prefix dc :6 <http :// pur l . org /dc/ e lements /1.1/> .7 @prefix dcterms :8 <http :// pur l . org /dc/ terms/> .9 @prefix xs :

10 <http ://www. w3 . org /2001/XMLSchema#> .11 @prefix bench :12 <http :// l o c a l h o s t / vocabulary /bench/> .13 @prefix f o a f :14 <http :// xmlns . com/ f o a f /0.1/> .15 @prefix swrc :16 <http :// swrc . ontoware . org / onto logy#> .17 @prefix dbpedia :18 <http :// dbpedia . org / onto logy/> .19 @prefix dbpediaprop :20 <http :// dbpedia . org / property/> .21 @prefix dbpediares :22 <http :// dbpedia . org / r e sou r c e/> .23 @prefix skos :24 <http ://www. w3 . org /2004/02/ skos / core> .25 @prefix yago :26 <http :// dbpedia . org / c l a s s /yago/> .

Listing 1.2: Queries for SP² benchmark [16].

1 ### 12 ? j o u r n a l dc : t i t l e ” Journal 1 (1940) ”ˆˆ xs : s t r i n g .3 ? j o u r n a l dcterms : i s s u e d ? yr .4 ? j o u r n a l rd f : type bench : Journal .5

6 ### 27 ? inproc dcterms : partOf ? proc .8 ? inproc bench : b o o k t i t l e ? b o o k t i t l e .9 ? inproc swrc : pages ? page .

10 ? inproc dc : t i t l e ? t i t l e .11 ? inproc r d f s : s eeAl so ? ee .12 ? inproc f o a f : homepage ? u r l .13 ? inproc dcterms : i s s u e d ? yr .14 ? inproc dc : c r e a t o r ? author .15 ? inproc rd f : type bench : Inproceed ings .16

17 ### 318 ? a r t i c l e rd f : type bench : A r t i c l e .

Page 19: Pay-as-you-go Approximate Join Top-k Processing for the

19 ? a r t i c l e swrc : pages ? value .20

21 ### 422 ? a r t i c l e rd f : type bench : A r t i c l e .23 ? a r t i c l e swrc : month ? value .24

25 ### 526 ? a r t i c l e rd f : type bench : A r t i c l e .27 ? a r t i c l e swrc : i sbn ? value .28

29 ### 630 ? a r t i c l e 1 dc : c r e a t o r ? author1 .31 ? author1 f o a f : name ?name1 .32 ? a r t i c l e 1 swrc : j o u r n a l ? j o u r n a l .33 ? a r t i c l e 2 swrc : j o u r n a l ? j o u r n a l .34 ? a r t i c l e 2 dc : c r e a t o r ? author2 .35 ? author2 f o a f : name ?name2 .36 ? a r t i c l e 1 rd f : type bench : A r t i c l e .37 ? a r t i c l e 2 rd f : type bench : A r t i c l e .38

39 ### 740 ? a r t i c l e dc : c r e a t o r ? person .41 ? person f o a f : name ?name .42 ? person2 f o a f : name ?name .43 ? inproc dc : c r e a t o r ? person2 .44 ? a r t i c l e rd f : type bench : A r t i c l e .45 ? inproc rd f : type bench : Inproceed ings .46

47 ### 848 ?document dcterms : i s s u e d ? yr .49 ?document dc : c r e a t o r ? author .50 ? author f o a f : name ?name .51 ?document rd f : type ? c l a s s .52 ? c l a s s r d f s : subClassOf f o a f : Document .53

54 ### 955 ? doc dc : t i t l e ? t i t l e .56 ?bag2 ?member2 ? doc .57 ? doc2 dcterms : r e f e r e n c e s ?bag2 .58 ? doc rd f : type ? c l a s s .59 ? c l a s s r d f s : subClassOf f o a f : Document .60

61 ### 1062 ? erdoes f o a f : name ”Paul Erdoes ”ˆˆ xs : s t r i n g .63 ?document dc : c r e a t o r ? erdoes .64 ?document dc : c r e a t o r ? author .65 ?document2 dc : c r e a t o r ? author .66 ?document2 dc : c r e a t o r ? author2 .67 ? author2 f o a f : name ?name .68 ? erdoes rd f : type f o a f : Person .

Page 20: Pay-as-you-go Approximate Join Top-k Processing for the

69

70 ### 1171 ? person rd f : type f o a f : Person .72 ? s u b j e c t ? p r e d i c a t e ? person .73

74 ### 1275 ? a r t i c l e dc : c r e a t o r ? person1 .76 ? person1 f o a f : name ?name .77 ? person2 f o a f : name ?name .78 ? inproc dc : c r e a t o r ? person2 .79 ? inproc rd f : type bench : Inproceed ings .80 ? a r t i c l e rd f : type bench : A r t i c l e .81

82 ### 1383 ? erdoes f o a f : name ”Paul Erdoes ”ˆˆ xs : s t r i n g .84 ?document dc : c r e a t o r ? erdoes .85 ?document dc : c r e a t o r ? author .86 ?document2 dc : c r e a t o r ? author .87 ?document2 dc : c r e a t o r ? author2 .88 ? author2 f o a f : name ?name .89 ? erdoes rd f : type f o a f : Person .

Listing 1.3: Queries for DBPSB benchmark [13].

1 ### 12 ? var5 rd f : type dbpedia : Person .3 ? var5 f o a f : page ? var8 .4 ? var5 dbpedia : thumbnail ? var4 .5 ? var5 r d f s : l a b e l ”Thaksin Shinawatra ”@nn .6

7 ### 28 ? var5 dbpedia : thumbnail ? var4 .9 ? var5 rd f : type dbpedia : Person .

10 ? var5 r d f s : l a b e l11 ”\u0420\u0438\u0448\u0435 ,12 \u0428\u0430\u0440\u043B\u044C”@ru .13 ? var5 f o a f : page ? var8 .14

15 ### 316 ? var5 dbpedia : thumbnail ? var4 .17 ? var5 rd f : type dbpedia : Person .18 ? var5 r d f s : l a b e l19 ”Amadeo , quinto20 Duque de Aosta”@es .21 ? var5 f o a f : page ? var8 .22

23 ### 424 ? var5 dbpedia : thumbnail ? var4 .25 ? var5 rd f : type dbpedia : Person .

Page 21: Pay-as-you-go Approximate Join Top-k Processing for the

26 ? var5 r d f s : l a b e l ”Godeberta”@en .27 ? var5 f o a f : page ? var8 .28

29 ### 530 ? var5 dbpedia : thumbnail ? var4 .31 ? var5 rd f : type dbpedia : Person .32 ? var5 r d f s : l a b e l ”Thaksin Shinawatra ”@nl .33 ? var5 f o a f : page ? var8 .34

35 ### 636 ? var5 dbpedia : thumbnail ? var4 .37 ? var5 rd f : type dbpedia : Person .38 ? var5 r d f s : l a b e l39 ”\u827E\u9A30\u00B740 \u4F0A\u683C\u8A00”@zh .41 ? var5 f o a f : page ? var8 .42

43 ### 744 ? var5 dbpedia : thumbnail ? var4 .45 ? var5 rd f : type dbpedia : Person .46 ? var5 r d f s : l a b e l ”Vlad\u00EDmir Karpets ”@es .47 ? var5 f o a f : page ? var8 .48

49 ### 850 ? var5 dbpedia : thumbnail ? var4 .51 ? var5 rd f : type dbpedia : Person .52 ? var5 r d f s : l a b e l ” Danie l Pear l ”@en .53 ? var5 f o a f : page ? var8 .54

55 ### 956 ? var5 dbpedia : thumbnail ? var4 .57 ? var5 rd f : type dbpedia : Person .58 ? var5 r d f s : l a b e l59 ”\u30DE\u30EA\u30FC\u30FB\u30EB60 \u30A4\u30FC\u30BA\u30FB\u30C9\u30EB61 \u30EC\u30A2\u30F3”@ja .62 ? var5 f o a f : page ? var8 .63

64 ### 1065 ? var5 dbpedia : thumbnail ? var4 .66 ? var5 rd f : type dbpedia : Person .67 ? var5 r d f s : l a b e l ”Walter Hodge”@en .68 ? var5 f o a f : page ? var8 .69

70 ### 1171 ? var5 dbpedia : thumbnail ? var4 .72 ? var5 rd f : type dbpedia : Person .73 ? var5 r d f s : l a b e l ”Damian Wayne”@en .74 ? var5 f o a f : page ? var8 .75

Page 22: Pay-as-you-go Approximate Join Top-k Processing for the

76 ### 1277 ? var5 dbpedia : thumbnail ? var4 .78 ? var5 rd f : type dbpedia : Person .79 ? var5 r d f s : l a b e l80 ”\u0417\u0430\u043B81 \u0435\u0432\u0441\u043A82 \u0438\u0439 , \u041A\u0430\u043783 \u0438\u043C\u0435\u0436”@ru .84 ? var5 f o a f : page ? var8 .85

86 ### 1387 ? var5 dbpedia : thumbnail ? var4 .88 ? var5 rd f : type dbpedia : Person .89 ? var5 r d f s : l a b e l90 ”\u0413\u043891 \u043B\u0430\u0443\u0440\u0438 ,92 \u041D\u0438\u043A\u043E\u043B93 \u043E\u0437 \u0417\u0443\u044094 \u0430\u0431\u043E\u0432\u043895 \u0447”@ru .96 ? var5 f o a f : page ? var8 .97

98 ### 1499 ? var5dbpedia : thumbnail ? var4 .

100 ? var5 rd f : type dbpedia : Person .101 ? var5 r d f s : l a b e l ” Franc i s Atterbury ”@en .102 ? var5 f o a f : page ? var8 .103

104 ### 15105 ? var5 dbpedia : thumbnail ? var4 .106 ? var5 rd f : type dbpedia : Person .107 ? var5 r d f s : l a b e l ”Damian Wayne”@es .108 ? var5 f o a f : page ? var8 .109

110 ### 16111 ? var4 dbpediaprop : b i r thP lac e112 ”Vigny , Val d ’ Oise ”@en .113 ? var4 dbpedia : b irthDate ? var6 .114 ? var4 f o a f : name ? var8 .115 ? var4 dbpedia : deathDate ? var10 .116

117 ### 17118 ? var4 dbpediaprop : b i r thP lac e119 ” Sa l i sbury , England”@en .120 ? var4 dbpedia : b irthDate ? var6 .121 ? var4 f o a f : name ? var8 .122 ? var4 dbpedia : deathDate ? var10 .123

124 ### 18125 ? var4 dbpediaprop : b i r thP lac e

Page 23: Pay-as-you-go Approximate Join Top-k Processing for the

126 ” Bai l ey in the c i t y o f Durham”@en .127 ? var4 dbpedia : b irthDate ? var6 .128 ? var4 f o a f : name ? var8 .129 ? var4 dbpedia : deathDate ? var10 .130

131 ### 19132 ? var4 dbpediaprop : b i r thP lac e133 ” Vas i l i evskaya ,134 Tambov Governorate , ”@en .135 ? var4 dbpedia : b irthDate ? var6 .136 ? var4 f o a f : name ? var8 .137 ? var4 dbpedia : deathDate ? var10 .138

139 ### 20140 ? var4 dbpediaprop : b i r thP lac e141 dbpediares : Waltham%2C Massachusetts .142 ? var4 dbpedia : b irthDate ? var6 .143 ? var4 f o a f : name ? var8 .144 ? var4 dbpedia : deathDate ? var10 .145

146 ### 21147 ? var4 dbpediaprop : b i r thP lac e148 dbpediares : Valenc ia%2C Spain .149 ? var4 dbpedia : b irthDate ? var6 .150 ? var4 f o a f : name ? var8 .151 ? var4 dbpedia : deathDate ? var10 .152

153 ### 22154 ? var4 dbpediaprop : b i r thP lac e155 dbpediares : Ha l i f ax%2C West Yorkshire .156 ? var4 dbpedia : b irthDate ? var6 .157 ? var4 f o a f : name ? var8 .158 ? var4 dbpedia : deathDate ? var10 .159

160 ### 23161 ? var4 dbpediaprop : b i r thP lac e dbped iares : Sucre .162 ? var4 dbpedia : b irthDate ? var6 .163 ? var4 f o a f : name ? var8 .164 ? var4 dbpedia : deathDate ? var10 .165

166 ### 24167 ? var4 dbpediaprop : b i r thP lac e168 dbpediares : L%C3%BAcar .169 ? var4 dbpedia : b irthDate ? var6 .170 ? var4 f o a f : name ? var8 .171 ? var4 dbpedia : deathDate ? var10 .172

173 ### 25174 ? var4 dbpediaprop : b i r thP lac e175 dbpediares :%C3%89tampes .

Page 24: Pay-as-you-go Approximate Join Top-k Processing for the

176 ? var4 dbpedia : b irthDate ? var6 .177 ? var4 f o a f : name ? var8 .178 ? var4 dbpedia : deathDate ? var10 .179

180 ### 26181 ? var4 dbpediaprop : b i r thP lac e182 dbpediares : Montgomery County%2C Maryland .183 ? var4 dbpedia : b irthDate ? var6 .184 ? var4 f o a f : name ? var8 .185 ? var4 dbpedia : deathDate ? var10 .186

187 ### 27188 ? var4 dbpediaprop : b i r thP lac e189 ” Berkeley , G l o u c e s t e r s h i r e ”@en .190 ? var4 dbpedia : b irthDate ? var6 .191 ? var4 f o a f : name ? var8 .192 ? var4 dbpedia : deathDate ? var10 .193

194 ### 28195 ? var4 dbpediaprop : b i r thP lac e196 dbpediares : Papa l State s .197 ? var4 dbpedia : b irthDate ? var6 .198 ? var4 f o a f : name ? var8 .199 ? var4 dbpedia : deathDate ? var10 .200

201 ### 29202 ? var4 dbpediaprop : b i r thP lac e203 dbpediares : City of London .204 ? var4 dbpedia : b irthDate ? var6 .205 ? var4 f o a f : name ? var8 .206 ? var4 dbpedia : deathDate ? var10 .207

208 ### 30209 ? var4 dbpediaprop : b i r thP lac e210 ”Houghton , Norfolk , England”@en .211 ? var4 dbpedia : b irthDate ? var6 .212 ? var4 f o a f : name ? var8 .213 ? var4 dbpedia : deathDate ? var10 .214

215 ### 31216 ? var4 r d f s : l a b e l ” (372) Palma”@de .217 ? var3 skos : broader ? var4 .218 ? var3 r d f s : l a b e l ? var6 .219

220 ### 32221 ? var4 r d f s : l a b e l ” (11554) Asios ”@de .222 ? var3 skos : broader ? var4 .223 ? var3 r d f s : l a b e l ? var6 .224

225 ### 33

Page 25: Pay-as-you-go Approximate Join Top-k Processing for the

226 ? var4 r d f s : l a b e l ” (3080) Mois se i ev ”@de .227 ? var3 skos : broader ? var4 .228 ? var3 r d f s : l a b e l ? var6 .229

230 ### 34231 ? var4 r d f s : l a b e l ” (1273) Helma”@de .232 ? var3 skos : broader ? var4 .233 ? var3 r d f s : l a b e l ? var6 .234

235 ### 35236 ? var4 r d f s : l a b e l237 ” (119878) 2002 CY224”@en .238 ? var3 skos : broader ? var4 .239 ? var3 r d f s : l a b e l ? var6 .240

241 ### 36242 ? var4 r d f s : l a b e l243 ”039A\u578B\u6F5C244 \u6C34\u8266”@ja .245 ? var3 skos : broader ? var4 .246 ? var3 r d f s : l a b e l ? var6 .247

248 ### 37249 ? var4 r d f s : l a b e l250 ” (4444) \u042D251 \u0448\u0435\u0440”@ru .252 ? var3 skos : broader ? var4 .253 ? var3 r d f s : l a b e l ? var6 .254

255 ### 38256 ? var4 r d f s : l a b e l257 ” (3834) Zappafrank ”@es .258 ? var3 skos : broader ? var4 .259 ? var3 r d f s : l a b e l ? var6 .260

261 ### 39262 ? var4 r d f s : l a b e l ” (2612) Kathryn”@de .263 ? var3 skos : broader ? var4 .264 ? var3 r d f s : l a b e l ? var6 .265

266 ### 40267 ? var4 r d f s : l a b e l ” (290) Bruna”@de .268 ? var3 skos : broader ? var4 .269 ? var3 r d f s : l a b e l ? var6 .270

271 ### 41272 ? var4 r d f s : l a b e l ” (438) Zeuxo”@de .273 ? var3 skos : broader ? var4 .274 ? var3 r d f s : l a b e l ? var6 .275

Page 26: Pay-as-you-go Approximate Join Top-k Processing for the

276 ### 42277 ? var4 r d f s : l a b e l ” !X\u00F3\u00F5”@de .278 ? var3 skos : broader ? var4 .279 ? var3 r d f s : l a b e l ? var6 .280

281 ### 43282 ? var4 r d f s : l a b e l ” (1083) Sa lv i a ”@de .283 ? var3 skos : broader ? var4 .284 ? var3 r d f s : l a b e l ? var6 .285

286 ### 44287 ? var4 r d f s : l a b e l288 ” (1296) Andr\u00E9e”@de .289 ? var3 skos : broader ? var4 .290 ? var3 r d f s : l a b e l ? var6 .291

292 ### 45293 ? var1 rd f : type yago : ChristianLGBTPeople .294 ? var1 f o a f : givenName ? var2 .295

296 ### 46297 ? var1 rd f : type yago : DefJamRecordingsArt ists .298 ? var1 f o a f : givenName ? var2 .299

300 ### 47301 ? var1 rd f : type yago : IndianFi lmActors .302 ? var1 f o a f : givenName ? var2 .303

304 ### 48305 ? var1 rd f : type yago : Eng l i shKeyboard i s t s .306 ? var1 f o a f : givenName ? var2 .307

308 ### 49309 ? var1 rd f : type yago : GuitarPlayers .310 ? var1 f o a f : givenName ? var2 .311

312 ### 50313 ? var1 rd f : type yago : Fi l ip inoFemaleModels .314 ? var1 f o a f : givenName ? var2 .315

316 ### 51317 ? var1 rd f : type yago : BluesBrothers .318 ? var1 f o a f : givenName ? var2 .319

320 ### 52321 ? var1 rd f : type yago : AmericanSongwriters .322 ? var1 f o a f : givenName ? var2 .323

324 ### 53325 ? var1 rd f : type yago : F r e n c h J a z z V i o l i n i s t s .

Page 27: Pay-as-you-go Approximate Join Top-k Processing for the

326 ? var1 f o a f : givenName ? var2 .327

328 ### 54329 ? var1 rd f : type yago : EnglishJazzComposers .330 ? var1 f o a f : givenName ? var2 .331

332 ### 55333 ? var1 rd f : type yago : HarveyMuddCollegeAlumni .334 ? var1 f o a f : givenName ? var2 .335

336 ### 56337 ? var1 rd f : type yago : Bass i s t109842629 .338 ? var1 f o a f : givenName ? var2 .339

340 ### 57341 ? var1 rd f : type yago : Curate109983572 .342 ? var1 f o a f : givenName ? var2 .343

344 ### 58345 ? var1 rd f : type yago : GreekFemaleModels .346 ? var1 f o a f : givenName ? var2 .347

348 ### 59349 ? var1 rd f : type yago : F i l i p i n o R e l i g i o u s L e a d e r s .350 ? var1 f o a f : givenName ? var2 .351

352 ### 60353 ? var4 skos : s u b j e c t354 dbpediares : Category :1004 deaths .355 ? var4 f o a f : name ? var6 .356

357 ### 61358 ? var4 skos : s u b j e c t359 dbpediares : Category :360 11 th century in Eng land .361 ? var4 f o a f : name ? var6 .362

363 ### 62364 ? var4 skos : s u b j e c t365 dbpediares : Category :1067 deaths .366 ? var4 f o a f : name ? var6 .367

368 ### 63369 ? var4 skos : s u b j e c t370 dbpediares : Category :1107 b i r t h s .371 ? var4 f o a f : name ? var6 .372

373 ### 64374 ? var4 skos : s u b j e c t375 dbpediares : Category :%C5%A0koda trams .

Page 28: Pay-as-you-go Approximate Join Top-k Processing for the

376 ? var4 f o a f : name ? var6 .377

378 ### 65379 ? var4 skos : s u b j e c t380 dbpediares : Category :381 %C3%81g u i l a s C i b a e%C3%B1as p layer s .382 ? var4 f o a f : name ? var6 .383

384 ### 66385 ? var4 skos : s u b j e c t386 dbpediares : Category :1255 b i r t h s .387 ? var4 f o a f : name ? var6 .388

389 ### 67390 ? var4 skos : s u b j e c t391 dbpediares : Category : 0 s BC bi r ths .392 ? var4 f o a f : name ? var6 .393

394 ### 68395 ? var4 skos : s u b j e c t396 dbpediares : Category :397 1130 d i s e s t a b l i s h m e n t s .398 ? var4 f o a f : name ? var6 .399

400 ### 69401 ? var4 skos : s u b j e c t402 dbpediares : Category :403 . 32 S%26W Long firearms .404 ? var4 f o a f : name ? var6 .405

406 ### 70407 ? var4 skos : s u b j e c t408 dbpediares : Category :1009 deaths .409 ? var4 f o a f : name ? var6 .410

411 ### 71412 ? var4 skos : s u b j e c t413 dbpediares : Category :1144 deaths .414 ? var4 f o a f : name ? var6 .415

416 ### 72417 ? var4 skos : s u b j e c t418 dbpediares : Category :1239 deaths .419 ? var4 f o a f : name ? var6 .420

421 ### 73422 ? var4 skos : s u b j e c t423 dbpediares : Category :1070 s dea th s .424 ? var4 f o a f : name ? var6 .425

Page 29: Pay-as-you-go Approximate Join Top-k Processing for the

426 ### 74427 ? var4 skos : s u b j e c t428 dbpediares : Category :1105 .429 ? var4 f o a f : name ? var6 .430

431 ### 75432 ? var3 dbpedia : i n f l u e n c e d433 dbpediares :Ram%C3%B3n Emeterio Betances .434 ? var3 f o a f : page ? var4 .435 ? var3 r d f s : l a b e l ? var6 .436

437 ### 76438 ? var3 dbpedia : i n f l u e n c e d dbped iares : Rob Corddry .439 ? var3 f o a f : page ? var4 .440 ? var3 r d f s : l a b e l ? var6 .441

442 ### 77443 ? var3 dbpedia : i n f l u e n c e d444 dbpediares : Parakrama Nir i e l l a .445 ? var3 f o a f : page ? var4 .446 ? var3 r d f s : l a b e l ? var6 .447

448 ### 78449 ? var3 dbpedia : i n f l u e n c e d450 dbpediares : Alexander VI .451 ? var3 f o a f : page ? var4 .452 ? var3 r d f s : l a b e l ? var6 .453

454 ### 79455 ? var3 dbpedia : i n f l u e n c e d dbped iares : Iqba l .456 ? var3 f o a f : page ? var4 .457 ? var3 r d f s : l a b e l ? var6 .458

459 ### 80460 ? var3 dbpedia : i n f l u e n c e d461 dbpediares : Al−Maqriz i .462 ? var3 f o a f : page ? var4 .463 ? var3 r d f s : l a b e l ? var6 .464

465 ### 81466 ? var3 dbpedia : i n f l u e n c e d467 dbpediares : C l a r enc e I rv ing Lew i s .468 ? var3 f o a f : page ? var4 .469 ? var3 r d f s : l a b e l ? var6 .470

471 ### 82472 ? var3 dbpedia : i n f l u e n c e d473 dbpediares : Ibn Kha lee l .474 ? var3 f o a f : page ? var4 .475 ? var3 r d f s : l a b e l ? var6 .

Page 30: Pay-as-you-go Approximate Join Top-k Processing for the

476

477 ### 83478 ? var3 dbpedia : i n f l u e n c e d479 dbpediares : Dav id Fr i ed l%C3%A4nder .480 ? var3 f o a f : page ? var4 .481 ? var3 r d f s : l a b e l ? var6 .482

483 ### 84484 ? var3 dbpedia : i n f l u e n c e d485 dbpediares : John Warnock .486 ? var3 f o a f : page ? var4 .487 ? var3 r d f s : l a b e l ? var6 .488

489 ### 85490 ? var3 dbpedia : i n f l u e n c e d491 dbpediares : Vladimir Lenin .492 ? var3 f o a f : page ? var4 .493 ? var3 r d f s : l a b e l ? var6 .494

495 ### 86496 ? var3 dbpedia : i n f l u e n c e d497 dbpediares : Nial l McLaren .498 ? var3 f o a f : page ? var4 .499 ? var3 r d f s : l a b e l ? var6 .500

501 ### 87502 ? var3 dbpedia : i n f l u e n c e d503 dbpediares : David J . Farber .504 ? var3 f o a f : page ? var4 .505 ? var3 r d f s : l a b e l ? var6 .506

507 ### 88508 ? var3 dbpedia : i n f l u e n c e d509 dbpediares : Fran Lebowitz .510 ? var3 f o a f : page ? var4 .511 ? var3 r d f s : l a b e l ? var6 .512

513 ### 89514 ? var3 dbpedia : i n f l u e n c e d515 dbpediares : Kathleen Raine .516 ? var3 f o a f : page ? var4 .517 ? var3 r d f s : l a b e l ? var6 .518

519 ### 90520 ? var0 r d f s : l a b e l ”The Subt le Knife ”@en .521 ? var0 rd f : type ? var1 .522

523 ### 91524 ? var0 r d f s : l a b e l ” P a t r i o t e r ”@sv .525 ? var0 rd f : type ? var1 .

Page 31: Pay-as-you-go Approximate Join Top-k Processing for the

526

527 ### 92528 ? var0 r d f s : l a b e l ” Scar Tissue ( l i b r o ) ”@es .529 ? var0 rd f : type ? var1 .530

531 ### 93532 ? var0 r d f s : l a b e l533 ” Jason Bournes ultimatum”@nn .534 ? var0 rd f : type ? var1 .535

536 ### 94537 ? var0 r d f s : l a b e l ”Jane Eyre”@f r .538 ? var0 rd f : type ? var1 .539

540 ### 95541 ? var0 r d f s : l a b e l542 ”Gone with the Wind ( l i v r o ) ”@pt .543 ? var0 rd f : type ? var1 .544

545 ### 96546 ? var0 r d f s : l a b e l ”Alkumets\u00E4”@f i .547 ? var0 rd f : type ? var1 .548

549 ### 97550 ? var0 r d f s : l a b e l551 ” F l i gh t from the Dark”@en .552 ? var0 rd f : type ? var1 .553

554 ### 98555 ? var0 r d f s : l a b e l ”Mongol Empire”@en .556 ? var0 rd f : type ? var1 .557

558 ### 99559 ? var0 r d f s : l a b e l ” Kultahattu ”@f i .560 ? var0 rd f : type ? var1 .561

562 ### 100563 ? var0 r d f s : l a b e l564 ” Aseiden k\u00E4ytt\u00F6”@f i .565 ? var0 rd f : type ? var1 .566

567 ### 101568 ? var0 r d f s : l a b e l ”Marrow ( nove l ) ”@en .569 ? var0 rd f : type ? var1 .570

571 ### 102572 ? var0 r d f s : l a b e l ”The Acid House”@en .573 ? var0 rd f : type ? var1 .574

575 ### 103

Page 32: Pay-as-you-go Approximate Join Top-k Processing for the

576 ? var0 r d f s : l a b e l ”\u6B63\u4E49\u8BBA”@zh .577 ? var0 rd f : type ? var1 .578

579 ### 104580 ? var0 r d f s : l a b e l ”Dawn o f the Dragons”@en .581 ? var0 rd f : type ? var1 .582

583 ### 105584 ? var2 rd f : type dbpedia : Person .585 ? var2 r d f s : l a b e l586 ”Ab\u016B l−Hasan Ban\u012Bsadr”@de .587 ? var2 f o a f : page ? var4 .588

589 ### 106590 ? var2 rd f : type dbpedia : Person .591 ? var2 r d f s : l a b e l592 ”Abdul Rahman o f Neger i Sembilan”@en .593 ? var2 f o a f : page ? var4 .594

595 ### 107596 ? var2 rd f : type dbpedia : Person .597 ? var2 r d f s : l a b e l ”A.W. Farwick”@en .598 ? var2 f o a f : page ? var4 .599

600 ### 108601 ? var2 rd f : type dbpedia : Person .602 ? var2 r d f s : l a b e l ”Abdullah G\u00FCl”@sv .603 ? var2 f o a f : page ? var4 .604

605 ### 109606 ? var2 rd f : type dbpedia : Person .607 ? var2 r d f s : l a b e l ”Aaron Pe\u00F1a”@en .608 ? var2 f o a f : page ? var4 .609

610 ### 110611 ? var2 rd f : type dbpedia : Person .612 ? var2 r d f s : l a b e l ”Abby Lockhart ”@f r .613 ? var2 f o a f : page ? var4 .614

615 ### 111616 ? var2 rd f : type dbpedia : Person .617 ? var2 r d f s : l a b e l ”Abd al−L a t i f ”@en .618 ? var2 f o a f : page ? var4 .619

620 ### 112621 ? var2 rd f : type dbpedia : Person .622 ? var2 r d f s : l a b e l ”Abdel Halim Khaddam”@f r .623 ? var2 f o a f : page ? var4 .624

625 ### 113

Page 33: Pay-as-you-go Approximate Join Top-k Processing for the

626 ? var2 rd f : type dbpedia : Person .627 ? var2 r d f s : l a b e l ”Abdur Rahman Khan”@f r .628 ? var2 f o a f : page ? var4 .629

630 ### 114631 ? var2 rd f : type dbpedia : Person .632 ? var2 r d f s : l a b e l633 ”A\u0142\u0142a Kudriawcewa”@pl .634 ? var2 f o a f : page ? var4 .635

636 ### 115637 ? var2 rd f : type dbpedia : Person .638 ? var2 r d f s : l a b e l ”Aaron Raper”@en .639 ? var2 f o a f : page ? var4 .640

641 ### 116642 ? var2 rd f : type dbpedia : Person .643 ? var2 r d f s : l a b e l ”A. L . Wil l iams ”@en .644 ? var2 f o a f : page ? var4 .645

646 ### 117647 ? var2 rd f : type dbpedia : Person .648 ? var2 r d f s : l a b e l ”Abdul Kadir Khan”@sv .649 ? var2 f o a f : page ? var4 .650

651 ### 118652 ? var2 rd f : type dbpedia : Person .653 ? var2 r d f s : l a b e l ”A. J . P i e r zynsk i ”@en .654 ? var2 f o a f : page ? var4 .655

656 ### 119657 ? var2 rd f : type dbpedia : Person .658 ? var2 r d f s : l a b e l ”Abdullah Ahmad Badawi”@pl .659 ? var2 f o a f : page ? var4 .660

661 ### 120662 ? var2 rd f : type dbpedia : Person .663 ? var2 r d f s : l a b e l664 ” Abdelbaset Al i Mohmed Al Megrahi”@en .665 ? var2 f o a f : page ? var4 .