datalogra : datalog with recursive aggregation in the ... · pdf filedatalogra : datalog with...

28
DatalogRA : Datalog with Recursive Aggregation in the Spark RDD Model Marek Rogala 1 Jan Hidders 2 Jacek Sroka 1 1 Institute of Informatics, University of Warsaw 2 Vrije Universiteit Brussel 24 June, 2016 Jan Hidders (VUB) GRADES 2016 24 June, 2016 1 / 28

Upload: nguyenthien

Post on 10-Mar-2018

232 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

DatalogRA : Datalog with Recursive Aggregation in theSpark RDD Model

Marek Rogala 1 Jan Hidders 2 Jacek Sroka 1

1Institute of Informatics, University of Warsaw

2Vrije Universiteit Brussel

24 June, 2016

Jan Hidders (VUB) GRADES 2016 24 June, 2016 1 / 28

Page 2: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Outline

1 Introduction

2 Plain Datalog and its Evaluation

3 DatalogRA: Syntax and Semantics

4 Implementation in Spark

5 Experiments and Evaluation

6 Conclusions and Future Work

Jan Hidders (VUB) GRADES 2016 24 June, 2016 2 / 28

Page 3: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Introduction

Outline

1 Introduction

2 Plain Datalog and its Evaluation

3 DatalogRA: Syntax and Semantics

4 Implementation in Spark

5 Experiments and Evaluation

6 Conclusions and Future Work

Jan Hidders (VUB) GRADES 2016 24 June, 2016 3 / 28

Page 4: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Introduction

Motivation

Need for high-level declarative languages for Graph ProcessingDatalog seems an interesting starting point:

Well-understood semanticsVery parallellizable [Ganguly et al. 1990] [Zhang et al. 1995].Large body of research on optimization [Tekle et al. 2010]Limited recursion matches graph navigation

Becomes more interesting when extended with basic arithmetic andstratified aggregation [Mumick et al. 1990] [Shkapsky et al. 2013]

Counting trianglesAnd even better with recursive aggregation [Lam et al. 2013] (Socialite)

Shortest Path, PageRank

Jan Hidders (VUB) GRADES 2016 24 June, 2016 4 / 28

Page 5: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Introduction

Contribution of Paper

Implementation in Spark:Leverages optimizations in Spark (but not yet Spark SQL)Embedding in mature frameworkDatalogRA program can be part of bigger Spark workflow

Semantics:Explicit and more general semantics then SocialiteSome investigation of well-definedness of result

Jan Hidders (VUB) GRADES 2016 24 June, 2016 5 / 28

Page 6: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Plain Datalog and its Evaluation

Outline

1 Introduction

2 Plain Datalog and its Evaluation

3 DatalogRA: Syntax and Semantics

4 Implementation in Spark

5 Experiments and Evaluation

6 Conclusions and Future Work

Jan Hidders (VUB) GRADES 2016 24 June, 2016 6 / 28

Page 7: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Plain Datalog and its Evaluation

Syntax of Plain Datalog

A database is a finite set of facts of the form r(v1, . . . , vn) where r isa relation name and (v1, . . . , vn) a vector of domain values.

E.g., {a(1, 2), a(2, 3), b(3, 1)}We will assume all domains are finite.

A basic Datalog program consist of a set of rules where a rule is anexpression of the form:

r(x) :- s1(y1), . . . , sn(yn).

where n ≥ 1, r, s1, . . . , sn are relation names and x , y1, . . . yn aretuples of variables and constants (i.e., domain values).

Head: r(x)Body: s1(x1), . . . , sn(xn), which is a set of subgoals

Operational semantics in terms of a minimal/first fixed point of afunction that applies all rules to infer facts.

Jan Hidders (VUB) GRADES 2016 24 June, 2016 7 / 28

Page 8: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Plain Datalog and its Evaluation

Semi-naive Evaluation

Basic idea: compute inferred facts based on newly added atoms inprevious interationFor example:

a rule r(x , y) :- s(x , y , z), r(z , 2), r(y , z)assume r′ contains the tuples added in the previous stepthe tuples added by this rule in the next step are the union of

{(x , y) | s(x , y , z) ∧ r′(z, 2), r(y , z)} and{(x , y) | s(x , y , z) ∧ r(z, 2) ∧ r′(y , z)}

after this we compute the next r′ by subtracting existing tuplesPrevents a lot of redundant computation, but same tuple may still bederived more than once

Jan Hidders (VUB) GRADES 2016 24 June, 2016 8 / 28

Page 9: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

DatalogRA: Syntax and Semantics

Outline

1 Introduction

2 Plain Datalog and its Evaluation

3 DatalogRA: Syntax and Semantics

4 Implementation in Spark

5 Experiments and Evaluation

6 Conclusions and Future Work

Jan Hidders (VUB) GRADES 2016 24 June, 2016 9 / 28

Page 10: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

DatalogRA: Syntax and Semantics

Basic Idea of DatalogRA

Based on ideas in Socialite [Lam et al. 2013]

Allows recursive aggregation, under certain conditionsi.e., optionally an aggregation function can be specified for the lastcolumn of a relation

Example: (compute length of shortest path from node 1)

Edge(int src, int sink, int len)Path(int target, int dist aggregate Min)

Path(t, d) :- t = 1, d = 0.Path(t, d) :- Path(s, d1), Edge(s, t, d2), d = d1 + d2.

Can be generalized to allow aggregation on multiple columnsWe also allow basic arithmetic predicates and stratified negation

Jan Hidders (VUB) GRADES 2016 24 June, 2016 10 / 28

Page 11: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

DatalogRA: Syntax and Semantics

Semantics of DatalogRAOperational semantics

The semantics of DatalogRA program P (without negation) is thefirst fixed point of immediate conseq. operator ΓP ◦ TP

TP computes the bag of direct consequences of PΓP is a function that aggregates as specified in P

Jan Hidders (VUB) GRADES 2016 24 June, 2016 11 / 28

Page 12: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

DatalogRA: Syntax and Semantics

Semantics of DatalogRAThe bag of direct consequences

TP computes the bag of direct consequences of P:The result bag of a rule r for database D, r(D), is a bag over r(D)such that

the multiplicity of each fact r(c) in this bag is the number ofvaluations of the variables in the tail that cause its inference

The bag of direct consequences of P for D, is

TP(D) = D ]⊎r∈P

r(D)

where ] is the additive bag union.

Jan Hidders (VUB) GRADES 2016 24 June, 2016 12 / 28

Page 13: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

DatalogRA: Syntax and Semantics

Semantics of DatalogRAThe global aggregation function

ΓP is a function that aggregates as specified in P:If relation R is aggregated in P with G :

for each vector x s.t. there is a fact of the form R(x , y) in the input:replace these facts with R(x , G(Y )) where Y is the bag of domainvalues where the multiplicity of an element y is the multiplicity ofR(x , y) in the input.

If relation R is not aggregated in P:remove duplicate facts for this relation

Note: the result of ΓP is in both cases without duplicates

Jan Hidders (VUB) GRADES 2016 24 June, 2016 13 / 28

Page 14: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

DatalogRA: Syntax and Semantics

Semantics of DatalogRAWell-definedness

So the semantics of P(D) is the first fixed point of ΓP ◦ TP on DQuestions:

When is this defined?Is result a minimal fixed point in some sense?

Sufficient condition: for some partial ordering over databases ΓP ◦ TPis monotonicSubset ordering is too strict when aggregation is used.

Jan Hidders (VUB) GRADES 2016 24 June, 2016 14 / 28

Page 15: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

DatalogRA: Syntax and Semantics

Semantics of DatalogRAAggregation-dependent partial order

Assume G is based on a binary operator, say ⊕G , that is commutativeand associative:

G applied to non-empty bag {{a1, . . . , an}} is a1 ⊕G . . .⊕G an

Implies sometimes a partial order: a vG b iff a = b or there is a csuch that a ⊕G c = b.

E.g., for Max operator that ordering is ≤for Min it is ≥for Sum over nonnegative integers it is also ≤for Sum over all integers it is not a partial order

We consider only those G where vG is a partial order

Jan Hidders (VUB) GRADES 2016 24 June, 2016 15 / 28

Page 16: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

DatalogRA: Syntax and Semantics

Semantics of DatalogRAAggregation-based database ordering

Assume vG is a partial order for all G in a program PWe let vP define a partial order over facts:

1 if relation R has aggregation operator G in P thenR(x , y) vP R(x ′, y ′) iff x = x ′ and y vG y ′ and

2 if R has no aggregation operator in P then R(x) vP R(x ′) iff x = x ′.We let vP also define a partial order over databases:

1 D1 vP D2 holds iff for all R(x) ∈ D1 there is a fact R(x ′) ∈ D2 suchthat R(x) vP R(x ′)

If P is monotonic w.r.t. to vP , i.e., ΓP ◦ TP is monotonic under vP ,then P always computes a minimal fixed point.

Jan Hidders (VUB) GRADES 2016 24 June, 2016 16 / 28

Page 17: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

DatalogRA: Syntax and Semantics

Semantics of DatalogRAA sufficient condition for monotonicity

Also assume all G are all idempotent, i.e., a ⊕G a = ae.g., for Min and Max

Then multiplicity in the bags is ignored by ΓP , so ΓP ◦ TP = ΓP ◦ TP ,where TP is the classical Datalog inference functionSince ΓP is always monotonic under vP , it is sufficient to require thatTP is monotonic under vP .Complexity of deciding this property is still unclearUnder such monotonicity we essentially can do semi-naive evaluation:

1 “New facts” are those not subsumed (under vP) by an existing fact2 Infer additional results in TP for these facts as usual3 Add these results and apply ΓP

Jan Hidders (VUB) GRADES 2016 24 June, 2016 17 / 28

Page 18: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Implementation in Spark

Outline

1 Introduction

2 Plain Datalog and its Evaluation

3 DatalogRA: Syntax and Semantics

4 Implementation in Spark

5 Experiments and Evaluation

6 Conclusions and Future Work

Jan Hidders (VUB) GRADES 2016 24 June, 2016 18 / 28

Page 19: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Implementation in Spark

Integrating DatalogRA in Spark

The main component is the Database class with a datalog method.Contains a set of named relation objects.

1 val edgesRdd = ... // Read from HDFS or computed using Spark23 val database = Database(Relation.ternary("Edge", edgesRdd))4 val resultDatabase = database.datalog("""5 declare Path(int v, int dist aggregate Min).6 Path(x, d) :- s == 1, Edge(s, x, d).7 Path(x, d) :- Path(y, da), Edge(y, x, db), d = da + db.8 """)9 val resultPathsRdd = resultDatabase("Path")

1011 ... // Save or use resultPathsRdd as any RDD.

Jan Hidders (VUB) GRADES 2016 24 June, 2016 19 / 28

Page 20: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Implementation in Spark

Optimizations

The rules are divided into a sequence of strataevaluated one by onenon-recursive strata iterate only once

Semi-naive evaluation if possible, each iteration determining a deltadatabase of “new facts”Caching intermediate results: if relation referred to is from a lowerstratum it is persisted as RDD after it is generated

Jan Hidders (VUB) GRADES 2016 24 June, 2016 20 / 28

Page 21: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Experiments and Evaluation

Outline

1 Introduction

2 Plain Datalog and its Evaluation

3 DatalogRA: Syntax and Semantics

4 Implementation in Spark

5 Experiments and Evaluation

6 Conclusions and Future Work

Jan Hidders (VUB) GRADES 2016 24 June, 2016 21 / 28

Page 22: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Experiments and Evaluation

Experimental Setup

Three classic graph problems:Connected ComponentsShortest PathsTriangle Counting

Compared to plain Spark using core methods and GraphX extensionExecuted using Amazon EC2 clusters consisting of 2, 4, 8 and 16worker nodes and one master node.

Each node was a 2-core 64-bit machine with 7.5 GB of RAM memory.Dataset used: social graph of Twitter circles on SNAP, which has2.4M edges.

Jan Hidders (VUB) GRADES 2016 24 June, 2016 22 / 28

Page 23: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Experiments and Evaluation

Experimental ResultsEfficiency: Connected Components

Jan Hidders (VUB) GRADES 2016 24 June, 2016 23 / 28

Page 24: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Experiments and Evaluation

Experimental ResultsEfficiency: Shortest Paths

Jan Hidders (VUB) GRADES 2016 24 June, 2016 24 / 28

Page 25: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Experiments and Evaluation

Experimental ResultsEfficiency: Triangles

Jan Hidders (VUB) GRADES 2016 24 June, 2016 25 / 28

Page 26: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Experiments and Evaluation

Experimental ResultsCompactness

Number of lines in programs, excluding data loading and comments.

plain Spark SparkDatalogConnected Components 11 6Shortest Paths 12 4Triangles 7 5

Jan Hidders (VUB) GRADES 2016 24 June, 2016 26 / 28

Page 27: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Conclusions and Future Work

Outline

1 Introduction

2 Plain Datalog and its Evaluation

3 DatalogRA: Syntax and Semantics

4 Implementation in Spark

5 Experiments and Evaluation

6 Conclusions and Future Work

Jan Hidders (VUB) GRADES 2016 24 June, 2016 27 / 28

Page 28: DatalogRA : Datalog with Recursive Aggregation in the ... · PDF fileDatalogRA : Datalog with Recursive Aggregation in the ... Need for high-level declarative languages for Graph Processing

Conclusions and Future Work

Conclusions and Future Work

Studied implementing Datalog with recursive aggregation in SparkOngoing work:

Leveraging Spark SQLSupport wider class of recursive aggregationMagic setsMore optimized distributed execution of conjunctive queriesOptimizing more general classes of aggregation operationsInvestigate decidability of aggregation monotonicity

Jan Hidders (VUB) GRADES 2016 24 June, 2016 28 / 28