# large scale centrality measures in apache flink and apache ... · large scale centrality measures...

TRANSCRIPT

Master Thesis

Thesis Advisor: Sebastian Schelter, Research Associate

Thesis Supervisor: Prof. Dr. Markl, Volker

Large Scale Centrality Measures in

Apache Flink and Apache Giraph

Submitted by

Janani Chakkaradhari

5 September 2014

Centrality measures identify the most central

nodes in a network

2

3

Targeted advertising minimizes resources and effort

required for marketing

Centrality measures to identify the head of terrorist

network that attacked on 9/11

Krebs, 2002

4

Different notions of the “most central nodes”

5 Freeman et al, 1977

Degree Closeness Betweenness

The real world networks are very large and

sparse

6 Barabási, 2004

Big Data platforms to analyze large networks

7

Related work on parallel computation of

centrality measures

Two novel algorithms proposed by Kang, U., et al in the paper

“Centralities in Large Networks: Algorithms and Observations”

for computing closeness and betweenness and implemented in

Hadoop.

• Effective Closeness algorithm

- an approximate technique for closeness

• LineRank algorithm

- random walk betweenness

8

Comparison of parallel computing platforms by

implementing and evaluating centrality measures

• How well does the

programming model of these

data processing platforms fit

Effective Closeness and

LineRank algorithms?

• Evaluating the performance

of each of these two

platforms

9

Closeness & Betweenness

of large networks

Parallel data processing

platforms

Apache Flink

Programming model of Apache Giraph & Apache Flink

for iterative graph processing

• Apache Giraph, a vertex centric model

for iterative graph processing based • Apache Flink offers special iteration

operator

10 Stephan Ewen, 2014 Sebastian Schelter, 2012

V1

V2

V3

V1

V2

V3

V1

V2

V3

superstep i superstep i+1 superstep i+2

Programming model of Apache Giraph & Apache Flink

for iterative graph processing

• Apache Giraph, a vertex centric model

for iterative graph processing based • Apache Flink offers special iteration

operator

Iterative

Function

Initial dataset

Result

Bulk

Iteration

Initial

solutionset

Initial

workset

Result Delta

Iteration

11 Stephan Ewen, 2014 Sebastian Schelter, 2012

V1

V2

V3

V1

V2

V3

V1

V2

V3

superstep i superstep i+1 superstep i+2

Comparison of parallel computing platforms by

implementing and evaluating centrality measures

• Evaluating the performance

of each of these two

platforms

12

Closeness & Betweenness

of large networks

Parallel data processing

platforms

Apache Flink

• How well does the

programming model of these

data processing platforms fit

Effective Closeness and

LineRank algorithms?

1. Computation logic

2. Implementation

Iterative computation of Effective Closeness Algorithm

• Shortest path between nodes => it counts the node at each

step/shortest path progressively

• Sum of the shortest paths: (2 x 1) + (2 x 2) + (1 x 3)= 9

13

2 2 1

Step 1 Step 2 Step 3

Computation of Effective Closeness in Apache Giraph

3 2

2

3 3

1

4 5

4

5 5

3

1 3

2

2 2

2

14

1 0 0 0 0 0 1

0 1 0 0 0 0 2

0 0 1 0 0 0 3

0 0 0 1 0 0 4

0 0 0 0 1 0 5

0 0 0 0 0 1 6

1 2

1 5

2 3

5 4

4 6

2 1

5 1

3 2

4 5

6 4

2 5

3 4

4 3

5 2

vid bit_string

src des

Illustration of Effective Closeness in Apache Flink using Delta iteration

(1/4)

• Vertices – Initial workset and solution set

• Edges - Pair of source and destination ids

15

1 0 0 0 0 0 1

0 1 0 0 0 0 2

0 0 1 0 0 0 3

0 0 0 1 0 0 4

0 0 0 0 1 0 5

0 0 0 0 0 1 6

1 2

1 5

2 3

5 4

4 6

2 1

5 1

3 2

4 5

6 4

2 5

3 4

4 3

5 2

⋈ vid=src

vid bit_string

src des

emit

0 1 0 0 0 0

0 1 0 0 0 0

0 0 1 0 0 0

0 0 1 0 0 0

0 0 0 0 1 0

0 0 0 0 1 0

0 0 0 0 1 0

0 0 0 1 0 0

0 0 0 1 0 0

0 0 0 0 0 1

1 0 0 0 0 0

1 0 0 0 0 0

0 1 0 0 0 0

0 0 0 1 0 0

1

1

2

5

4

2

5

3

4

6

2

3

4

5

des bit_string

Illustration of Effective Closeness in Apache Flink using Delta iteration (2/4)

16

0 1 0 0 0 0

0 1 0 0 0 0

0 0 1 0 0 0

0 0 1 0 0 0

0 0 0 0 1 0

0 0 0 0 1 0

0 0 0 0 1 0

0 0 0 1 0 0

0 0 0 1 0 0

0 0 0 0 0 1

1 0 0 0 0 0

1 0 0 0 0 0

0 1 0 0 0 0

0 0 0 1 0 0

1

1

2

5

4

2

5

3

4

6

2

3

4

5

des bit_string

𝛾𝑠𝑟𝑐

0 1 0 0 0 0

0 0 0 0 1 0

1

1

0 0 1 0 0 0 2

1 0 0 0 0 0 2

0 0 0 0 1 0 2

0 0 0 1 0 0 5

1 0 0 0 0 0 5

0 1 0 0 0 0 5

0 1 0 0 0 0 3

0 0 0 1 0 0 3

0 0 0 0 0 1 4

0 0 0 0 1 0 4

0 0 1 0 0 0 4

0 0 0 1 0 0 6

1 1 0 0 1 0 1

1 1 1 0 1 0 2

0 1 1 1 0 0 3

0 0 1 1 1 1 4

1 1 0 1 1 0 5

0 0 0 1 0 1 6

Bit-OR

Illustration of Effective Closeness in Apache Flink using Delta iteration (3/4)

des bit_string

Updated result in current iteration

17

1 0 0 0 0 0 1

0 1 0 0 0 0 2

0 0 1 0 0 0 3

0 0 0 1 0 0 4

0 0 0 0 1 0 5

0 0 0 0 0 1 6

⋈ vid=des

vid bit_string

Illustration of Effective Closeness in Apache Flink using Delta iteration (4/4)

Solutionset /previous

iteration’s result

1 1 0 0 1 0 1

1 1 1 0 1 0 2

0 1 1 1 0 0 3

0 0 1 1 1 1 4

1 1 0 1 1 0 5

0 0 0 1 0 1 6

des bit_string

Updated result in

current iteration

0

0

0

0

0

0

2

3

2

3

3

1

Termination condition

check If(prev count != current count)

emit the updated nodes => Next

Workset

else keep calm!

1 1 0 0 1 0 1

1 1 1 0 1 0 2

0 1 1 1 0 0 3

0 0 1 1 1 1 4

1 1 0 1 1 0 5

0 0 0 1 0 1 6

2

3

2

3

3

1

emit

Next Workset

18

Illustration of Effective Closeness in Apache Flink of delta iteration

19

REDUCE

JOIN

Step Function

Update Function

JOIN

Summary of Effective Closeness implementation

• Both implementations reduces the amount of data to be processed

in the successive iterations

• Hence both the computing models for finding Effective Closeness exploits the sparse nature of the real world graphs

20

Idea behind LineRank Algorithm

• Betweenness is computed by finding the importance score of

incident edges of a node

1

2 3

b a d

c

e

G

a e

c b

d

L(G)

kang et al, 2011

Power

Iteration

PageRank

Eigenvector/ Rank

of nodes in L(G)

• Problem: Line graph L(G) is larger than original graph

𝑟 = 𝑇𝑘 𝑟0

21

Challenges in implementing LineRank in Apache Giraph

• Two step matrix-vector multiplication in the power iteration using two

sparse matrices (incoming and outgoing edges)

• The vertex state value in the LineRank is edge score which contradicts

with the vertex centric computation model

• How to achieve two stage matrix-vector multiplication in Giraph?

𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2 𝑣 ⟵ 𝐿(𝐺)𝑣 ↔

22

Proposed solution for implementing LineRank algorithm in

Apache Giraph (1/2)

• Illustration of “think like vertex”

• Let us compute the step v2 in the first iteration for our example graph

𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2

1

2 3

b a d

c

e

23

=

Proposed solution for implementing LineRank algorithm in

Apache Giraph (2/2)

Pseudo-code

• Current state of the vertex is assigned with computation result of v2

• The messages that are distributed or exchanged in the iteration are

considered to be the edge score v3

𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2

=

24

Illustration of proposed solution to implement LineRank in Apache

Giraph

1

2 3

1 2 3 Input graph

0.2 0.2 0.2

0.2

0.2 0.1

0.1

0.1

superstep 0

0.15

0.3 0.3 0.1

0.1

0.1 0.15

0.15

superstep 1

25

𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2

Implementation of LineRank in Apache Flink

26

Summary of LineRank implementation

• Two step matrix-vector multiplication is hard to implement in

Apache Giraph

• Remodeling the LineRank computation in Apache Giraph requires

an in-depth knowledge in both platform level and algorithmic level

• Programmability with Apache Flink for computational intensive

iterative algorithms are simple and flexible

27

Comparison of parallel computing platforms by implementing and

evaluating centrality measures

28

Closeness & Betweenness

of large networks

Parallel data processing

platforms

Apache Flink

• How well does the

programming model of these

data processing platforms fit

Effective Closeness and

LineRank algorithms?

1. Computation logic

2. Implementation

• Evaluating the performance

of each of these two

platforms

Evaluation – Dataset

29

Evaluating scalability of Effective Closeness in Apache Giraph &

Apache Flink (Runtime vs Edges)

*Fixed number of parallel tasks 30

Evaluating scalability of LineRank in Apache Giraph & Apache Flink

(Runtime vs Edges)

*Fixed graph data

LineRank in Flink: Runtime vs Number of cores

LineRank in Giraph: Runtime vs Number of cores

31

Evaluation – Comparing the performance of Apache

Giraph and Apache Flink

LineRank Effective Closeness

32

No. of cores = 15

Evaluation – Comparing the performance of

Apache Giraph and Apache Flink

• Apache Giraph incorporates hash based aggregations

• Apache Flink uses sorting technique for aggregations

• Efficient mechanism for estimating memory

requirements in Apache Giraph

33

Conclusion

34

• Implementation of Effective Closeness exploits the sparse nature of the real world graphs

• The programming model of Apache Giraph is not flexible for

computations that involves multi-step matrix-vector

multiplication whereas Apache Flink is more flexible for these

computations

• Efficient optimizations in Apache Giraph makes it perform better

than Apache Flink

• The implementation of these algorithms are targeted to

contribute to the Apache Flink open source community

Future Works

35

• This work can be extended to evaluate the computation intensive

centrality algorithms on other parallel data processing systems

such as Apache Spark and Distributed GraphLab

References

[1] Kang, U., et al. "Centralities in Large Networks: Algorithms and Observations."

SDM. Vol. 2011. 2011

[2] Sebastian Schelter, “Introducing Apache Giraph for Large Scale Graph

Processing”, “slideshare.net/sscdotopen/introducing- apache-giraph-for-large-

scale-graph-processing”, 2012

[3] Krebs, Valdis E. "Mapping networks of terrorist cells." Connections 24.3 (2002):

43-52

[4] Ewen, Stephan, et al. "Spinning fast iterative data flows." Proceedings of

the VLDB Endowment 5.11 (2012): 1268-1279

[5] Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing"

Proceedings of the 2010 ACM SIGMOD International Conference on Management

of data. ACM, 2010.

36

References

[6] Freeman, Linton C. "A set of measures of centrality based on betweenness"

Sociometry (1977): 35-41

[9] Stephan Ewen “Stratosphere, Next-Gen Data Analytics Platform”, Hadoop Summit

Europe, 2014

[8] Barabasi, Albert-Laszlo, and Zoltan N. Oltvai. "Network biology: understanding

the cell's functional organization." Nature Reviews Genetics 5.2 (2004): 101-113

37

Backup Slides

38

Summary of Effective Closeness implementation

• Both implementations reduces the amount of data to be processed

in the successive iterations

• Hence both the computing models for finding Effective Closeness exploits the sparse nature of the real world graphs

Highly connected

Less connected

39

LineRank Dataflow in Apache Flink

Final step in the proposed solution to implement LineRank in

Apache Giraph

• Aggregating the computed edge scores

(incoming and outgoing edges )

• Computation of v2 represents aggregation

incoming edges scores

1

2 3

b a d

c

e

G

Bet(1) = R(a)+R(b)+R(c)+R(d)

Bet(2) = R(a)+R(b)+R(e)

Bet(3) = R(c)+R(d)+R(e)

LineRank algorithm computes the random-walk

betweenness without constructing line graph

• L(G) is decomposed into two sparse matrices

– Source Incidence Matrix S(G) [Outgoing edges]

– Target Incidence Matrix T(G) [Incoming edges]

– L G = 𝑇 𝐺 𝑆(𝐺)𝑇

1

2 3

b a c

d

e

a 1 0 0

b 0 1 0

c 1 0 0

d 0 0 1

e 0 1 0

1 2 3

a 0 1 0

b 1 0 0

c 0 0 1

d 1 0 0

e 0 0 1

1 2 3

S(G) T(G)

0 1 0 1 0 0

= 1 0 0

0 0 1

0 0 1

1 0 0 1 0 0 1 0 0 1 0 0 1 0 0

𝑻 𝑮 𝑺(𝑮)𝑻

0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0

=

Power iteration in LineRank

Referred from [Ukang]