pmv: pre-partitioned generalized matrix-vector multiplication for … · 2017. 9. 27. · rithm...

10
PMV: Pre-partitioned Generalized Matrix-Vector Multiplication for Scalable Graph Mining Chiwan Park Seoul National University [email protected] Ha-Myung Park KAIST [email protected] Minji Yoon Seoul National University [email protected] U Kang Seoul National University [email protected] ABSTRACT How can we analyze enormous networks including the Web and social networks which have hundreds of billions of nodes and edges? Network analyses have been conducted by various graph mining methods including shortest path computation, PageRank, connected component computation, random walk with restart, etc. ese graph mining methods can be expressed as general- ized matrix-vector multiplication which consists of few operations inspired by typical matrix-vector multiplication. Recently, several graph processing systems based on matrix-vector multiplication or their own primitives have been proposed to deal with large graphs; however, they all have failed on Web-scale graphs due to insucient memory space or the lack of consideration for I/O costs. In this paper, we propose PMV (Pre-partitioned generalized Matrix-Vector multiplication), a scalable distributed graph min- ing method based on generalized matrix-vector multiplication on distributed systems. PMV signicantly decreases the communica- tion cost, which is the main boleneck of distributed systems, by partitioning the input graph in advance and judiciously applying execution strategies based on the density of the pre-partitioned sub- matrices. Experiments show that PMV succeeds in processing up to 16× larger graphs than existing distributed memory-based graph mining methods, and requires 9× less time than previous disk-based graph mining methods by reducing I/O costs signicantly. 1 INTRODUCTION How can we analyze enormous networks including the Web and social networks which have hundreds of billions of nodes and edges? Various graph mining algorithms including shortest path computation [8, 11], PageRank [2], connected component com- putation [9, 19], and random walk with restart [15], have been developed for network analyses and many of them are expressed in generalized matrix-vector multiplication form [28]. As graph sizes increase exponentially, many eorts have been devoted to nd scalable graph processing methods which could perform large-scale matrix-vector multiplication eciently on distributed systems. Recently, several graph processing systems have been proposed to perform such computations in billion-scale graphs; they are di- vided into single-machine systems, distributed-memory, and MapReduce- based systems. However, they all have limited scalability. I/O ef- cient single-machine systems including GraphChi [30] cannot process a graph exceeding the external-memory space of a single machine. Similarly, distributed-memory systems like GraphLab [13] 10 2 10 3 10 4 10 8 10 9 10 10 10 11 16x higher scalability o.o.t. o.o.m. o.o.m. o.o.m. Running time (sec) Number of edges PMV PEGASUS GraphLab GraphX Giraph Figure 1: e running time on subgraphs of ClueWeb12. o.o.m.: out of memory. o.o.t.: out of time (¿5h). Our pro- posed method PMV is the only framework that succeeds in processing the full ClueWeb12 graph. cannot process a graph that does not t into the distributed-memory. On the other hand, MapReduce-based systems [20, 26, 28, 39, 42], which use a distributed-external-memory like GFS [12] or HDFS [48], can handle much larger graphs than single-machine or distributed- memory systems do. However, the MapReduce-based systems succeed only in non-iterative graph mining tasks such as trian- gle counting [39, 40] and graph visualization [20, 26]. ey have limited scalability for iterative tasks like PageRank because they need to read and shue the entire input graph in every iteration. In MapReduce [7], shuing massive data is the main performance bot- tleneck as it requires heavy disk and network I/Os, which seriously limit the scalability and the fault tolerance. us, it is desirable to shrink the amount of shued data to process matrix-vector multiplication in distributed systems. In this paper, we propose PMV (Pre-partitioned generalized Matrix-Vector multiplication), a new scalable graph mining algo- rithm performing large-scale generalized matrix-vector multipli- cation in distributed systems. PMV succeeds in processing billion- scale graphs which all other state-of-the-art distributed systems fail to process, by signicantly reducing the shued data size, and the costs of network and disk I/Os. PMV partitions the matrix of input graph once, and reuses the partitioned matrices for all iter- ations. Moreover, PMV carefully assigns the partitioned matrix blocks to each worker to minimize the I/O cost. PMV is a general arXiv:1709.09099v1 [cs.DC] 26 Sep 2017

Upload: others

Post on 01-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PMV: Pre-partitioned Generalized Matrix-Vector Multiplication for … · 2017. 9. 27. · rithm performing large-scale generalized matrix-vector multipli-cation in distributed systems

PMV: Pre-partitioned Generalized Matrix-Vector Multiplicationfor Scalable Graph Mining

Chiwan Park

Seoul National University

[email protected]

Ha-Myung Park

KAIST

[email protected]

Minji Yoon

Seoul National University

[email protected]

U Kang

Seoul National University

[email protected]

ABSTRACT

How can we analyze enormous networks including the Web and

social networks which have hundreds of billions of nodes and

edges? Network analyses have been conducted by various graph

mining methods including shortest path computation, PageRank,

connected component computation, random walk with restart,

etc. �ese graph mining methods can be expressed as general-

ized matrix-vector multiplication which consists of few operations

inspired by typical matrix-vector multiplication. Recently, several

graph processing systems based on matrix-vector multiplication

or their own primitives have been proposed to deal with large

graphs; however, they all have failed on Web-scale graphs due to

insu�cient memory space or the lack of consideration for I/O costs.

In this paper, we propose PMV (Pre-partitioned generalized

Matrix-Vector multiplication), a scalable distributed graph min-

ing method based on generalized matrix-vector multiplication on

distributed systems. PMV signi�cantly decreases the communica-

tion cost, which is the main bo�leneck of distributed systems, by

partitioning the input graph in advance and judiciously applying

execution strategies based on the density of the pre-partitioned sub-

matrices. Experiments show that PMV succeeds in processing up

to 16× larger graphs than existing distributed memory-based graph

mining methods, and requires 9× less time than previous disk-based

graph mining methods by reducing I/O costs signi�cantly.

1 INTRODUCTION

How can we analyze enormous networks including the Web and

social networks which have hundreds of billions of nodes and

edges? Various graph mining algorithms including shortest path

computation [8, 11], PageRank [2], connected component com-

putation [9, 19], and random walk with restart [15], have been

developed for network analyses and many of them are expressed

in generalized matrix-vector multiplication form [28]. As graph

sizes increase exponentially, many e�orts have been devoted to �nd

scalable graph processing methods which could perform large-scale

matrix-vector multiplication e�ciently on distributed systems.

Recently, several graph processing systems have been proposed

to perform such computations in billion-scale graphs; they are di-

vided into single-machine systems, distributed-memory, and MapReduce-

based systems. However, they all have limited scalability. I/O ef-

�cient single-machine systems including GraphChi [30] cannot

process a graph exceeding the external-memory space of a single

machine. Similarly, distributed-memory systems like GraphLab [13]

102

103

104

108 109 1010 1011

16x higher scalability

o.o.t.

o.o.m.o.o.m.

o.o.m.

Run

ning

tim

e (s

ec)

Number of edges

PMVPEGASUSGraphLabGraphXGiraph

Figure 1: �e running time on subgraphs of ClueWeb12.

o.o.m.: out of memory. o.o.t.: out of time (¿5h). Our pro-

posed method PMV is the only framework that succeeds in

processing the full ClueWeb12 graph.

cannot process a graph that does not �t into the distributed-memory.

On the other hand, MapReduce-based systems [20, 26, 28, 39, 42],

which use a distributed-external-memory like GFS [12] or HDFS [48],

can handle much larger graphs than single-machine or distributed-

memory systems do. However, the MapReduce-based systems

succeed only in non-iterative graph mining tasks such as trian-

gle counting [39, 40] and graph visualization [20, 26]. �ey have

limited scalability for iterative tasks like PageRank because they

need to read and shu�e the entire input graph in every iteration. In

MapReduce [7], shu�ing massive data is the main performance bot-

tleneck as it requires heavy disk and network I/Os, which seriously

limit the scalability and the fault tolerance. �us, it is desirable

to shrink the amount of shu�ed data to process matrix-vector

multiplication in distributed systems.

In this paper, we propose PMV (Pre-partitioned generalized

Matrix-Vector multiplication), a new scalable graph mining algo-

rithm performing large-scale generalized matrix-vector multipli-

cation in distributed systems. PMV succeeds in processing billion-

scale graphs which all other state-of-the-art distributed systems

fail to process, by signi�cantly reducing the shu�ed data size, and

the costs of network and disk I/Os. PMV partitions the matrix of

input graph once, and reuses the partitioned matrices for all iter-

ations. Moreover, PMV carefully assigns the partitioned matrix

blocks to each worker to minimize the I/O cost. PMV is a general

arX

iv:1

709.

0909

9v1

[cs

.DC

] 2

6 Se

p 20

17

Page 2: PMV: Pre-partitioned Generalized Matrix-Vector Multiplication for … · 2017. 9. 27. · rithm performing large-scale generalized matrix-vector multipli-cation in distributed systems

Table 1: Table of symbols.

Symbol Description

v Vector, or set of vertices

θ Degree threshold to divide sparse and dense sub-matrices

out (p) a set of out-neighbors of a vertex pb Number of vector blocks or vertex partitions

ψ Vertex partitioning function: v → {1, ..., b }vi i-th element of vv (i ) Set of vector elements (p, vp ) where ψ (p) = iv (i )s Set of vector elements (p, vp ) ∈ v (i ) where |out (p) | < θv (i )d Set of vector elements (p, vp ) ∈ v (i ) where |out (p) | ≥ θ|v | Size of vector v , or of vertices in a graph

M Matrix, or set of edges

mi, j (i, j)-th element of MM (i, j ) Set of matrix elements (p, q,mp,q ) where ψ (p) = i and ψ (q) = jM (i, j )s Set of matrix elements (p, q,mp,q ) ∈ M (i, j ) where |out (q) | < θM (i, j )d Set of matrix elements (p, q,mp,q ) ∈ M (i, j ) where |out (q) | ≥ θ|M | Number of non-zero elements in M (= number of edges in a graph)

⊗ User-de�ned matrix-vector multiplication

framework that can be implemented in any distributed framework;

we implement PMV on Hadoop and Spark, the two most widely

used distributed computing frameworks. Our main contributions

are the following:

• Algorithm. We propose PMV, a new scalable graph min-

ing algorithm for performing generalized matrix-vector

multiplication in distributed systems. PMV is designed

to reduce the amount of shu�ed data by partitioning the

input matrix before iterative computation. Moreover, PMV

splits the partitioned matrix blocks into two regions and

applies di�erent placement strategies on them to minimize

the I/O cost.

• Cost analysis. We give a theoretical analysis of the I/O

costs of the block placement strategies which are the crite-

ria of block placement selection. We prove the e�ciency

of PMV by giving theoretical analyses of the performance.

• Experiment. We empirically evaluate PMV using both

large real-world and synthetic networks. We emphasize

that only our system succeeds in processing the Clueweb12

graph which has 6 billion vertices and 71 billion edges. Also,

PMV shows up to 9× faster performance than previous

MapReduce-based methods do (see Figure 1).

�e rest of the paper is organized as follows. In Section 2, we

review existing large-scale graph processing systems and introduce

GIM-V primitive for graph mining tasks. In Section 3, we describe

the proposed algorithm PMV in detail along with its theoretical

analysis. A�er showing experimental results in Section 4, we con-

clude in Section 5. �e symbols frequently used in this paper are

summarized in Table 1.

2 BACKGROUND AND RELATEDWORKS

In this section, we �rst review representative graph processing sys-

tems and show their limitations on scalability (Section 2.1). �en,

we outline MapReduce and Spark to highlight the importance of

decreasing the amount of shu�ed data in improving their perfor-

mances (Sections 2.2). A�er that, we review the GIM-V model for

graph algorithms (Section 2.3).

2.1 Large-scale Graph Processing Systems

Large-scale graph processing systems can be classi�ed into three

groups: I/O e�cient single-machine systems, distributed-memory

systems, and MapReduce-based systems.

I/O e�cient graph mining systems [16, 17, 30, 33] handle large

graphs with external-memory (i.e., disk) and optimize disk I/O

costs to achieve higher performance. Some single-machine sys-

tems [35, 44, 52] use accelerators like GPUs to improve performance.

However, all of these systems have limited scalability as they use

only a single machine.

A typical approach to handle large-scale graphs is using mul-

tiple machines. Recently, several graph processing systems using

distributed-memory have been proposed: Pregel [36], GraphLab-

PowerGraph [13, 34], Trinity [45], GraphX [14], GraphFrames [5],

GPS [43], Presto [50], Pregel+ [51] and PowerLyra [4]. Even though

these distributed-memory systems achieve faster performance and

higher scalability than single machine systems do, they cannot pro-

cess graphs that do not �t into the distributed-memory. Pregelix [3]

succeeds in processing graphs whose size exceeds the distributed-

memory space by exploiting out-of-core support of Hyracks [1],

a general data processing engine. However, Pregelix uses only a

single placement strategy which is similar to PMVvertical

, one of

our basic proposed methods.

MapReduce-based systems increase the processable graph size as

MapReduce is a disk-based distributed system. PEGASUS [25, 28]

is a MapReduce-based graph mining library based on a generalized

matrix-vector multiplication. SGC [42] is another MapReduce-

based system exploiting two join operations, namely NE join and

EN join. �e MapReduce-based systems, however, still have lim-

ited scalability because they need to shu�e the input matrix and

vector repeatedly. UNICORN [32] avoids massive data shu�ing

by exploiting HBase, a distributed database system on Hadoop,

but it reaches another performance bo�leneck, intensive random

accesses to HBase.

In the next section, we highlight the importance of reducing the

amount of shu�ed data in MapReduce and Spark.

2.2 MapReduce and Spark

MapReduce is a programming model to process large data by par-

allel and distributed computation. �anks to its ease of use, fault

tolerance, and high scalability, MapReduce has been applied to

various graph mining tasks including computation of radius [27],

triangle [39], visualization [26], etc. MapReduce transforms an in-

put set of key-value pairs into another output set of key-value pairs

through three steps: map, shu�e, and reduce. Each input key-value

pair is transformed into a set of key-value pairs (map-step), and

all the output pairs from the map-step are grouped by key (shu�e-

step), then, each group of pairs is processed independently of other

groups. Finally, an output set of key-value pairs is emi�ed (reduce-

step). �e performance of a MapReduce algorithm depends mainly

on the amount of shu�ed data which are sorted by key requiring

massive network and disk I/Os [18]. In each map worker, the output

pairs from the map-step are stored in R independent regions on

disk according to the key where R is the number of reduce workers

(collect and spill). Each map worker outputs key-value pairs into

R independent regions on local disks according to the key where

R is the number of reduce workers. �e pairs stored in R regions

2

Page 3: PMV: Pre-partitioned Generalized Matrix-Vector Multiplication for … · 2017. 9. 27. · rithm performing large-scale generalized matrix-vector multipli-cation in distributed systems

Table 2: Graph Algorithms on GIM-V

Algorithm GIM-V Functions

PageRank

combine2 (mi, j , vj ) =mi, j × vjcombineAll ({xi,1, · · · , xi,n }) =

∑ni=1

xiassign (vi , ri ) = 0.15 + 0.85 × ri

Random Walk

with Restart

combine2 (mi, j , vj ) =mi, j × vjcombineAll ({xi,1, · · · , xi,n }) =

∑ni=1

xi

assign (vi , ri ) ={

0.15 + 0.85 × ri if i = source vertex

0.85 × ri otherwise

Single Source

Shortest Path

combine2 (mi, j , vj ) =mi, j + vjcombineAll ({xi,1, · · · , xi,n }) = min({xi,1, · · · , xi,n })assign (vi , ri ) = min(vi , ri )

Connected

Component

combine2 (mi, j , vj ) = vjcombineAll ({xi,1, · · · , xi,n }) = min({xi,1, · · · , xi,n })assign (vi , ri ) = min(vi , ri )

are shu�ed to corresponding reduce workers periodically. As a

reduce worker has received all the pairs from the map workers, the

reduce worker conducts external-sort to group the key-value pairs

according to the key. in order to group the pairs by key (reduce).

�e performance of a MapReduce algorithm depends mainly on the

amount of shu�ed data since they require massive network and

disk I/Os [18]. Requiring such heavy disk and network I/Os, a large

amount of shu�ed data signi�cantly increases the running time

and decreases the stability of the system. Requiring such heavy

disk and network I/Os signi�cantly increases the running time and

decreases the scalability of the system. �us, it is important to

shrink the amount of shu�ed data as much as possible to increase

the performance.

Spark [54] is a general data processing engine with an abstraction

of data collection called Resilient Distributed Datasets (RDDs) [53].

Each RDD consists of multiple partitions distributed across the

machines of a cluster. Each partition has data objects and can

be manipulated through operations like map and reduce. Unlike

Hadoop, a widely used open-source implementation of MapReduce,

RDD partitions are cached in memory or on disks of each worker

in the cluster. Due to the in-memory caching, Spark shows a good

performance for iterative computation [31, 46] which is necessary

for graph mining and machine learning tasks. However, Spark still

requires disk I/O [38] since its typical operations with shu�ing

including join and groupBy operations need to access disks for

external-sort. �erefore, the e�ort to reduce intermediate data to

be shu�ed is still valuable in Spark.

2.3 GIM-V for Graph Algorithms

Several optimized algorithms have been proposed for speci�c graph

mining tasks such as shortest path computation [6, 24, 37], con-

nected component computation [41], and random walk with restart [21–

23, 47]. GIM-V (Generalized Iterative Matrix-Vector Multiplica-

tion) [28], a widely-used graph mining primitive, uni�es such graph

algorithms by representing them in the form of matrix-vector mul-

tiplication. For GIM-V representation, a user needs to describe only

three operations for a graph algorithm: combine2, combineAll,

and assign.

v1

v3

v6

v4

v2

v5

m4,1

m4,3

m4,6

m2,4

m5,4

(a) A graph with 6 vertices

1

11 1

v1v2v3v4v5v6

1 2 3 4 5 6123456

Source

Des

tinat

ion

1⊗

v′1v′2v′3v′4v′5v′6

=

(b) Matrix-vector representation

Figure 2: An example graph with 6 vertices and 9 edges and

its matrix-vector representation. In GIM-V, the vertex 4 re-

ceives 3 messages from incoming neighbors (1, 3, and 6) and

sends 2 messages to outgoing neighbors (2 and 5). �is pro-

cess can be represented by matrix-vector multiplication.

Consider a matrix M of size n×n, and a vectorv of size n, where

mi, j is the (i, j)-th element of M , and vi is the i-th element of v for

i, j ∈ {1, · · · ,n}. �en, the operations play the following roles:

• combine2(mi, j ,vj ): return the combined value xi, j from a

matrix elementmi, j and a vector element vj .• combineAll({xi,1, · · · ,xi,n }): reduce the input values to

a single value ri .• assign(vi , ri ): compute the new i-th vector elementv ′i for

the next iteration from the current i-th vector element viand the reduced value ri , and check the convergence.

Let M ⊗ v be a user-de�ned generalized matrix-vector multipli-

cation between the matrix M and the vector v . �e new i-th vector

element v ′i of the result vector v ′ of M ⊗ v is then:

v ′i = assign(vi , combineAll({xi, j |xi, j = combine2(mi, j ,vj ), j ∈ {1, · · · ,n}}))

GIM-V can be considered as a process of passing messages

from each vertex to its outgoing neighbors on a graph where

mi, j corresponds to an edge from vertex j to vertex i . In Fig-

ure 2, vertex 4 receives messages {x4,1,x4,3,x4,6} from incom-

ing neighbors 1, 3, and 6, where x4, j = combine2(m4, j ,vj ) for

j ∈ {1, 3, 6}. From the received messages, GIM-V calculates a new

value r4 = combineAll({x4,1,x4,3,x4,6}) for the vertex 4, and then,

updates v4 with a new value v ′4= assign(v4, r4). �e updated

value v ′4

is passed to the outgoing neighbors 2 and 5 in the next

iteration.

With GIM-V, a user can easily describe various graph algorithms.

Table 2 shows implementations of PageRank, random walk with

restart, single source shortest path, and connected component on

GIM-V, respectively. Note that only few lines of codes are required

for the implementations.

3 PROPOSED METHOD

In this section, we propose PMV, a scalable algorithm to e�ciently

perform the GIM-V on distributed systems. PMV greatly increases

the scalability by the following ideas:

(1) Pre-partitioning signi�cantly shrinks the amount of shuf-

�ed data. PMV shu�es O(|M |) data only once at the be-

ginning while the previous MapReduce algorithms shu�e

O(|M | + |v |) data in each iteration (Section 3.1).

(2) Considering the density of the pre-partitioned matrices en-

ables PMV to minimize the I/O cost by applying the two

3

Page 4: PMV: Pre-partitioned Generalized Matrix-Vector Multiplication for … · 2017. 9. 27. · rithm performing large-scale generalized matrix-vector multipli-cation in distributed systems

11

11

1

11

1 v1v2v3v4v5v6

1

1

11

1

1

11 1

1⊗

+ +

v1 v2 v3 v4 v5 v6

=

M(1,1)M(1,2)

M(2,1)

M(3,1)

M(1,3)

v(1)

v(2)

v(3)

v(1) v(2) v(3)

M v

=

v(1,1)

v(2,1)

v(3,1)

v(1,2)

v(2,2)

v(3,2)

v(1,3)

v(2,3)

v(3,3)

+ +

⊗ ⊗v′1v′2v′3v′4v′5v′6

v′(1)

v′(2)

v′(3)

v′

=

Figure 3: �e user-de�ned generalized matrix-vector multiplication M ⊗ v performed on 3 × 3 sub-matrices. M(i, j) is (i, j)-thsub-matrix andv(i) is i-th sub-vector. v(i, j) is the result vector of sub-multiplicationM(i, j) ⊗v(j). �e i-th sub-vectorv ′(i) of theresult vector v ′ is calculated by combining v(i, j) for j ∈ {1, · · · ,b} with the combineAll (+) operation.

multiplication methods: vertical placement and horizontal

placement (Sections 3.2-3.5).

We �rst describe the pre-partitioning method in Section 3.1.

Once the graph is partitioned, the multiplication method can be

classi�ed as PMVhorizontal

and PMVvertical

depending on which par-

titions are processed together on the same machine. We describe

the two basic methods in Sections 3.2 and 3.3. In Section 3.4, we

analyze the I/O cost of PMVhorizontal

and PMVvertical

, and propose

a naıve method, namely PMVselective

, that selects one of the two ba-

sic methods according to the density of the input graph. A�er that,

we propose PMVhybrid

, our desired method, that uses the two basic

methods simultaneously in Section 3.5. Finally, in Section 3.6, we

describe how to implement PMV on two popular distributed frame-

works, Hadoop and Spark, to show that PMV is general enough to

be implemented on any computing frameworks.

3.1 PMV: Pre-partitioned Generalized

Matrix-Vector Multiplication

How can we e�ciently perform GIM-V on distributed systems? �e

key idea of PMV is based on the observation that the input matrix

M never changes and is reused in each iteration, while the vector

v varies. PMV �rst divides the vector v into several sub-vectors

and partitions the matrix M into corresponding sub-matrices which

will be multiplied with each sub-vector respectively. �en only

sub-vectors are shu�ed to the corresponding sub-matrices in the

iteration phase, thus avoiding shu�ing the entire matrix in every

iteration unlike existing MapReduce-based systems which shu�e

the entire matrix. Note that, even though some distributed-memory

systems also do not shu�e the matrix by retaining both the matrix

and the vector in main memory of each worker redundantly, they

fail when the matrix and the vector do not �t into the memory

while PMV is insensitive to the memory size. PMV consists of two

steps: the pre-partitioning and the iterative multiplication.

3.1.1 Pre-partitioning. PMV �rst initializes the input vector vproperly based on the graph algorithm used. For example, v is set

to 1/|v | in PageRank. �en, PMV partitions the matrix M into b ×bsub-matrices M(i, j) = {mp,q ∈ M | ψ (p) = i,ψ (q) = j} for i, j ∈{1, · · · ,b} whereψ is a vertex partitioning function. Likewise, the

vectorv is also divided into b sub-vectorsv(i) = {vp ∈ v |ψ (p) = i}for i ∈ {1, · · · ,b}. We consider the number of workers and the

size of vector to determine the number b of blocks. b is set to the

numberW of workers to maximize the parallelism if |v |/M <W ,

otherwise b is set to O(|v |/M) to �t a sub-vector into the main

memory of size M. Note that this proper se�ing for b makes

PMV insensitive to the memory size. In Figure 2b, the partitioning

function ψ divides the set of vertices {1, 2, 3, 4, 5, 6} into b = 3

subsets {1, 2}, {3, 4}, {5, 6}. Accordingly, the matrix and the vector

are divided into 3 × 3 sub-matrices, and 3 sub-vectors, respectively;

sub-matrices and sub-vectors are depicted with boxes with bold

border lines.

3.1.2 Iterative Multiplication. PMV divides the entire problem

M ⊗ v into b2subproblems and solves them in parallel. Sub-

problem 〈i, j〉 is to calculate v(i, j) = M(i, j) ⊗ v(j) for each pair

(i, j) ∈ {1, · · · ,b}2. �en, i-th sub-vector v ′(i) is calculated by

combining v(i, j) for all j ∈ {1, · · · ,b}. Figure 3 illustrates how

the entire problem is divided into several subproblems in PMV. A

subproblem requires O(|v |/b) of the memory size: a subproblem

should retain a sub-vector v(i), whose expected size is O(|v |/b), in

the main memory of a worker. �e sub-matrix M(i, j) is cached in

the main memory or external memory of a worker: each worker

reads the sub-matrix once from distributed storage and stores it

locally.

Meanwhile, each worker solves multiple subproblems. �e way

of distributing subproblems to workers a�ects the amount of I/Os.

�en, how should we assign the subproblems to workers to min-

imize the I/O cost? In the following subsections, we introduce

multiple PMV methods to answer the question. We focus on the

I/O cost of handling only vectors because all the methods require

the same I/O cost O(|M |) to read the matrix by the local caching of

sub-matrices.

3.2 PMVhorizontal: Horizontal Matrix Placement

PMVhorizontal

uses horizontal matrix placement illustrated in Fig-

ure 4b so that each worker solves subproblems which share the

same output sub-vector. As a result, PMVhorizontal

does not need

to shu�e any intermediate vector while the input vector is copied

multiple times as described in Algorithm 1. Each worker directly

computes v ′(i) from M(i, :) = {M(i, j) | j ∈ {1, · · · ,b}} and v (lines

2-10). For j ∈ {1, · · · ,b}, a worker computes intermediate vectors

v(i, j) by combining M(i, j) and v(j), and reduces them into v ′(i) im-

mediately without any access to the distributed storage (lines 4-7).

Note that combineAllb and combine2b are block operations for

4

Page 5: PMV: Pre-partitioned Generalized Matrix-Vector Multiplication for … · 2017. 9. 27. · rithm performing large-scale generalized matrix-vector multipli-cation in distributed systems

Algorithm 1 Iterative Multiplication (PMVhorizontal

)

Input: a set {(M (i, :), v) | i ∈ {1, · · · , b }} of matrix-vector pairs

Output: a result vector v ′ = {v ′(i ) | i ∈ {1, · · · , b }}1: repeat

2: for each (M (i, :), v) do in parallel

3: initialize v ′(i )

4: for each j ∈ {1, · · · , b } do5: v (i, j ) ← combineAllb (combine2b (M (i, j ), v (j )))6: v ′(i ) ← combineAllb (v (i, j ) ∪ v ′(i ))7: v ′(i ) ← assignb (v (i ), v ′(i ))8: store v ′(i ) to v (i ) in distributed storage

9: until convergence

10: return v ′ =⋃i∈{1, ··· ,b} v (i )

Algorithm 2 Iterative Multiplication (PMVvertical

)

Input: a set {(M (:, j ), v (j )) | j ∈ {1, · · · , b }} of matrix-vector pairs

Output: a result vector v ′ = {v ′(j ) | j ∈ {1, · · · , b }}1: repeat

2: for each (M (:, j ), v (j )) do in parallel

3: for each sub-matrix M (i, j ) ∈ M (:, j ) do4: v (i, j ) ← combineAllb (combine2b (M (i, j ), v (j )))5: store v (i, j ) to distributed storage

6: Barrier

7: load v (j,i ) for i ∈ {1, · · · , b } \ {j }8: v ′(j ) ← assignb (v (j ), combineAllb (

⋃i∈{1, ··· ,b} v (j,i )))

9: store v ′(j ) to v (j ) in distributed storage

10: until convergence

11: return v ′ =⋃j∈{1, ··· ,b} v (j )

combineAll and combine2, respectively; combine2b (M(i, j),v(j))applies combine2(mp,q ,vq ) for all mp,q ∈ M(i, j) and vq ∈ v(j),and combineAllb (X (i, j)) reduces each row values in X (i, j) into a

single value by applying the combineAll operation. A�er that, each

worker applies the assignb operation where assignb (v(j),x (j)) ap-

plies assign(vp ,xp ) for all vertices in {p | vp ∈ v(j)} and stores the

result to the distributed storage (lines 8-9). PMVhorizontal

repeats

this task until convergence.

3.3 PMVvertical: Vertical Matrix Placement

PMVvertical

uses vertical matrix placement illustrated in Figure 4c

to solve the subproblems that share the same input sub-vector in

the same worker. By doing so, PMVvertical

reads each sub-vector

only once in each worker. As described in Algorithm 2, PMVvertical

computes v(:, j) = {v(i, j) | i ∈ {1, · · · ,b}} for each j ∈ {1, · · · ,b}in parallel (lines 2-11). Given j ∈ {1, · · · ,b}, a worker �rst loads

v(j) into the main memory; then, it computes v(i, j) by sequentially

reading M(i, j) for each i ∈ {1, · · · ,b} and stores v(i, j) into the

distributed storage (lines 3-6). �e worker of j is responsible for

combining all intermediate data v(j,i) for i ∈ {1, · · · ,b} stored

in the distributed storage into the �nal value v(j). A�er waiting

for all the other workers to �nish the sub-multiplication using a

barrier (line 7), the worker of j loads v(j,i) for i ∈ {1, · · · ,b} from

the distributed storage (line 8). �en, the worker calculates v ′(j)

which replacesv(j) in the distributed storage (lines 9-10). Note that,

the vectors v(j,i) do not need to be loaded all at once because the

Algorithm 3 Iterative Multiplication (PMVselective

)

Input: a set {M(i, j) | (i, j) ∈ {1, · · · ,b}2} of matrix blocks, a set

{v(i) | i ∈ {1, · · · ,b}} of vector blocks

Output: a result vector v ′ = {v ′(j) | j ∈ {1, · · · ,b}}1: if

(1 − |M |/|v |2

) |v |/b< 0.5 then

2: v ′ ← PMVhorizontal

({(M(i, :),v) | i ∈ {1, · · · ,b}})3: else

4: v ′ ← PMVvertical

({(M(:, j),v(j)) | j ∈ {1, · · · ,b}})

combineAll operation is commutative and associative. PMVvertical

repeats this task until convergence.

3.4 PMVselective: Selecting Best Method between

PMVhorizontal and PMVvertical

Given a graph, how can we decide the best multiplication method

between PMVhorizontal

and PMVvertical

? In distributed graph sys-

tems, a major bo�leneck is not computational cost, but expensive

I/O cost. PMVselective

compares the expected I/O costs of two basic

methods and selects the one having the minimum expected I/O

cost. �e expected I/O costs of PMVhorizontal

and PMVvertical

are

derived in Lemmas 3.1 and 3.2, respectively.

Lemma 3.1 (I/O Cost ofHorizontal Placement). PMVhorizontal

has an expected I/O cost Ch per iteration:

E [Ch ] = (b + 1)|v | (1)

where |v | is the size of vector v and b is the number of vector blocks.

Proof. With the horizontal placement, each worker should load

all vector blocks from the distributed storage since combine2 func-

tion in each worker is computed with all sub-vectors. �is causes

b |v | I/O cost. Also the result vector |v | should be wri�en to the

distributed storage. �us, the total I/O cost is (b + 1)|v |. Note that

PMVhorizontal

requires no communication between workers. �

Lemma 3.2 (I/O Cost of Vertical Placement). PMVvertical

has

an expected I/O cost Cv per iteration:

E [Cv ] = 2|v |(1 + (b − 1)

(1 −

(1 − |M |/|v |2

) |v |/b ))(2)

where |v | is the size of vectorv , |M | is the number of non-zero elements

in the matrixM , and b is the number of vector blocks.

Proof. �e expected I/O cost of PMVvertical

is the sum of 1)

the cost to read the vector from the previous iteration, 2) the cost

to transfer the sub-multiplication results between workers using

distributed storage, and 3) the cost to write the result vector to

the distributed storage. To transfer one of the sub-multiplication

results, PMVvertical

requires 2

���v(i, j)��� of I/O costs: one is for writing

the results to distributed storage, and the other is for reading them

from the distributed storage. �erefore,

E [Cv ] = 2|v | +∑i,jE

[2

���v(i, j)���] = 2|v | + 2b(b − 1)E[���v(i, j)���]

(3)

where v(i, j) is the result vector of sub-multiplication M(i, j) ⊗ v(j).For each vertex u ∈ v(i, j), let Xu denote an event that u-th element

of v(i, j) has a non-zero value. �en,

P(Xu ) = 1 − P(u has no in-edges in M(i, j)) = 1 −(1 − |M ||v |2

) |v |/b5

Page 6: PMV: Pre-partitioned Generalized Matrix-Vector Multiplication for … · 2017. 9. 27. · rithm performing large-scale generalized matrix-vector multipli-cation in distributed systems

by assuming that every matrix block has the same number of edges

(non-zeros). �e expected size of the sub-multiplication result is:

E[|v(i, j) |

]=

∑u ∈v (i )

P(Xu ) =|v |b

(1 −

(1 − |M ||v |2

) |v |/b )(4)

Combining (3) and (4), we obtain the claimed I/O cost. �

Lemmas 3.1 and 3.2 state that the cost depends on the density

of the matrix and the number of vector blocks. Comparing (1)

and (2), the condition to prefer horizontal placement over vertical

placement is given by (5).

E [Ch ] <E [Cv ] ⇔(1 − |M |/|v |2

) |v |/b< 0.5 (5)

For sparse matrices, the I/O cost of PMVvertical

is lower than that

of PMVhorizontal

. On the other hand, for dense matrices, PMVhorizontal

has smaller I/O cost than that of PMVvertical

. As described in Al-

gorithm 3, PMVselective

�rst evaluates the condition (5) and selects

the best method based on the result. �us, the performance of

PMVselective

is be�er than or at least equal to those of PMVvertical

or PMVhorizontal

. Our experiment (see Section 4.4) shows the e�ec-

tiveness of each method according to the matrix density.

3.5 PMVhybrid: Using PMVhorizontal and

PMVvertical Together

PMVhybrid

improves PMVselective

to further reduce I/O costs by

using PMVhorizontal

and PMVvertical

together. �e main idea is based

on the fact that PMVvertical

is appropriate for a sparse matrix while

PMVhorizontal

is appropriate for a dense matrix, as we discussed in

Section 3.4. We also observe that density of a matrix block varies

across di�erent sub-areas of the block. In other words, some areas

of each matrix block are relatively dense with many high-degree

vertices while the other areas are sparse. Using these observations,

PMVhybrid

divides each vector block v(i) into a sparse region v(i)s

with vertices whose out-degrees are smaller than a threshold θ and

a dense region v(i)d with vertices whose out-degrees are larger than

or equal to the threshold. Likewise, each matrix block M(i, j) is

also divided into a sparse region M(i, j)s where each source vertex

is in v(j)s and a dense region M

(i, j)d where each source vertex is in

v(j)d . �en, PMV

hybridexecutes PMV

horizontalfor the dense area

and PMVvertical

for the sparse area. Figure 4d illustrates PMVhybrid

on 3 × 3 matrix blocks with 3 workers.

Algorithm 4 describes PMVhybrid

. PMVhybrid

performs an addi-

tional pre-processing step a�er the pre-partitioning step to split

each matrix block into the dense and sparse regions (lines 1-2).

�en, each worker �rst multiplies all assigned sparse matrix-vector

pairs (M(:, j)s ,v(j)s ) by applying PMV

vertical(lines 5-11). A�er that,

the dense matrix-vector pairs (M(j, :)d ,v(:)d ) are multiplied using

PMVhorizontal

and added to the results of the sparse regions (lines

12-16). Finally, each worker splits the result vector into two regions

again for next iteration (lines 17-19). PMVhybrid

repeats this task

until convergence like PMVhorizontal

and PMVvertical

do.

�e threshold θ to split the sparse and dense regions a�ects

the performance and the I/O cost of PMVhybrid

. If we set θ = 0,

PMVhybrid

is the same as PMVhorizontal

because there is no ver-

tex in the sparse regions. On the other hand, if we set θ = ∞,

PMVhybrid

is the same as PMVvertical

because there is no vertex in

Algorithm 4 Iterative Multiplication (PMVhybrid

)

Input: a set {M (i, j ) | (i, j) ∈ {1, · · · , b }2 } of matrix blocks, a set

{v (i ) | i ∈ {1, · · · , b }} of vector blocks.

Output: a result vector v ′ = {v ′(i ) | i ∈ {1, · · · , b }}1: split v (i ) into v (i )s and v (i )d for i ∈ {1, · · ·b }2: split M (i, j ) into M (i, j )s and M (i, j )d for (i, j) ∈ {1, · · · , b }23: repeat

4: for each (M (:, j )s , v (j )s , M (j, :)d , v (:)d ) do in parallel

5: for each sub-matrix M (i, j )s ∈ M (:, j )s do

6: v (i, j )s ← combineAllb (combine2b (M (i, j )s , v (j )s ))7: store v (i, j )s to distributed storage

8: Barrier

9: load v (j,i )s for i ∈ {1, · · · , b } \ {j }10: v ′(j ) ← combineAllb (

⋃i∈{1, ··· ,b} v

(j,i )s ))

11: for each i ∈ {1, · · · , b } do12: v (j,i )d ← combineAllb (combine2b (M (j,i )d , v (i )d ))13: v ′(j ) ← combineAllb (v (j,i )d , v ′(j ))14: v ′(j ) ← assignb (v

(j )s ∪ v

(j )d , v ′(j ))

15: split v ′(j ) into v ′(j )s and v ′(j )d16: store v ′(j )s to v (j )s in distributed storage

17: store v ′(j )d to v (j )d in distributed storage

18: until convergence

19: return v ′ =⋃i∈{1, ··· ,b} v

(i )s ∪ v

(i )d

the dense regions. To �nd the threshold which minimizes the I/O

cost, we compute the expected I/O cost of PMVhybrid

varying θ by

Lemma 3.3, and choose the one with the minimum I/O cost.

Lemma 3.3 (I/O Cost of PMVhybrid). PMVhybrid

has an expected

I/O cost Chb per iteration:

E [Chb ] = |v | (Pout (θ ) + b(1 − Pout (θ )) + 1)

+ 2 |v |(b − 1)|v |∑d=0

(1 −

(1 − 1

b· Pout (θ )

)d )· pin (d )

(6)

where |v | is the size of vector v , b is the number of vector blocks,

Pout (θ ) is the ratio of vertices whose out-degree is less than θ , andpin (d) is the ratio of vertices whose in-degree is d .

Proof. �e expected I/O cost of PMVhybrid

is the sum of 1) the

cost to read the sparse regions of each vector block, 2) the cost

to transfer the sub-multiplication results, 3) the cost to read the

dense regions of each vector block, and 4) the cost to write the

result vector. Like PMVvertical

, PMVhybrid

requires 2

���v(i, j)s

��� of I/O

costs to transfer one of the sub-multiplication results by writing the

results to distributed storage and reading them from the distributed

storage. �erefore,

E [Chb ] = |v | · Pout (θ ) +∑i,jE

[2

���v (i, j )s

���]+ b |v | · (1 − Pout (θ )) + |v |

(7)

wherev(i, j)s is the result vector of sub-multiplication betweenM

(i, j)s

andv(j)s . For each vertex u ∈ v(i, j)s , letXu denote an event that u-th

element of v(i, j)s has a non-zero value. �en,

P(Xu ) = 1 − P(u has no in-edges in M (i, j )s ) = 1 −(1 − Pout (θ )

b

) |in(u)|6

Page 7: PMV: Pre-partitioned Generalized Matrix-Vector Multiplication for … · 2017. 9. 27. · rithm performing large-scale generalized matrix-vector multipli-cation in distributed systems

M

M(1,1)M(1,2)

M(2,1)

M(3,1)

M(1,3)

(a) Pre-partitioned Matrix

Worker 2

M (2,:)

Worker 1

M (1,:)Worker 3

M (3,:)

(b) Horizontal Placement

Worker 1

M (:,1)

Worker 2

M (:,2)

Worker 3

M (:,3)

(c) Vertical Placement

Worker 1 Worker 2 Worker 3

M(:,1)s

M(1,:)d

M(:,2)s

M(2,:)d

M(:,3)s

M(3,:)d

(d) Hybrid Placement

Figure 4: An example of matrix placement methods on 3 × 3 sub-matrices with 3 workers. In each matrix block M(i, j), the

striped region and the white region represent the dense regionM(i, j)d and the sparse regionM

(i, j)s , respectively. �e horizontal

placement groups the matrix blocks M(i, :) which share rows to a worker i while the vertical placement groups the matrix

blocksM(:, j) which share columns to a worker j. �e hybrid placement groups the sparse regionsM(i, :)s which share rows (same

stripe pattern) and the dense regionsM(:,i)d which share columns (same color) to a worker i.

where in(u) is a set of in-neighbors of vertex u. Considering |in(u)|as a random variable following the in-degree distribution pin (d),

E[���v (i, j )s

���] = ∑u∈v (i )

P(Xu ) =|v |b· E [P(Xu )]

=|v |b

|v |∑d=0

(1 −

(1 − 1

b· Pout (θ )

)d )· pin (d )

(8)

Combining (7) and (8), we obtain the claimed I/O cost. �

Note that the in-degree distribution pin (d) and the cumulative

out-degree distribution Pout (θ ) are approximated well using power-

law degree distributions for real world graphs. Although the exact

cost of PMVhybrid

in Lemma 3.3 includes data-dependent terms and

thus is not directly comparable to those of other PMV methods, in

Section 4.4 we experimentally show that PMVhybrid

achieves higher

performance and smaller amount of I/O than other PMV methods.

3.6 Implementation

In this section, we discuss practical issues to implement PMV

on distributed systems. We only discuss the issues related to

PMVhybrid

because PMVhorizontal

and PMVvertical

are special cases

of PMVhybrid

, as we discussed in Section 3.5. We focus on famous

distributed processing frameworks, Hadoop and Spark. Note that

PMV can be implemented on any distributed processing frame-

works.

3.6.1 PMV on Hadoop. �e pre-partitioning is implemented in

a single MapReduce job. �e implementation places the matrix

blocks within the same column into a single machine; each matrix

elementmp,q ∈ M(i, j) moves to j-th reducer during map and shu�e

steps; a�er that, each reducer groups matrix elements into matrix

blocks, and divides each matrix block into two regions (sparse and

dense) by the given threshold θ . �e iterative multiplication is

implemented in a single Map-only job. Each mapper solves the

assigned subproblems one by one; for each subproblem, a mapper

reads the corresponding sub-matrix and the sub-vector from HDFS.

�e mapper �rst computes the sub-multiplication M(i, j)s ⊗ v(j)s of

sparse regions, and waits for all the other mappers to �nish the

sub-multiplication using a barrier. �e result vector v(i, j)s of a

subproblem is sent to the i-th mapper via HDFS to be merged to

v ′(i). A�er that, the sub-multiplicationsM(i, :)d ⊗v(:)d of dense regions

are computed by the i-th mapper. �e result vector v(i, j)d is directly

merged to v ′(i) in the main memory. A�er the result vector v ′ is

computed, each mapper splits the result vector into the sparse and

dense regions. �en, the next iteration starts with the new vector

v ′ by the same Map-only job until convergence.

3.6.2 PMV on Spark. �e pre-partitioning is implemented by

a partitionBy and two mapPartitions operations of typical Re-

silient Distributed Dataset (RDD) API. �e partitionBy opera-

tion uses a custom partitioner to partition the matrix blocks. �e

mapPartitions operations output four RDDs, sparseMatRDD, dense-

MatRDD, sparseVecRDD, and denseVecRDD which contain sparse

and dense regions of matrix blocks, and sparse and dense regions

of vector blocks, respectively. Each iteration of matrix-vector mul-

tiplication is implemented by �ve RDD operations. For the sparse

regions, the multiplication comprises the following operations: (1)

join operation on the sparseMatRDD and the sparseVecRDD to

combine vector blocks and matrix blocks, (2) mapPartitions op-

eration to create the partial vector blocks, and (3) reduceByKeyoperation on the partial vector blocks. In the case of the dense re-

gions, each iteration of the multiplication comprises the following

operations: (1) flatMap operation on the denseVecRDD to copy

the vector blocks, (2) join operation on the denseMatRDD and the

copied denseVecRDD, and (3) mapPartitions operation to create

the updated vecRDD. A�er both multiplications for the sparse and

dense regions, (4) join operation is used to combine the results

of multiplications in sparse regions and dense regions. Finally, (5)

mapPartitions splits the combined results into sparseVecRDD and

denseVecRDD again. We ensure the colocation of relevant matrix

blocks and vector blocks by using a custom partitioner. �ere-

fore, each worker runs the join operation combining the sparse

matrices and the sparse vectors without network I/Os. �e joinoperation for the dense regions requires network I/Os but only the

dense vectors, whose sizes are relatively small in PMVhybrid

, are

transferred.

4 EXPERIMENTS

We perform experiments to answer the following questions:

Q1. How much does PMV improve the performance and scala-

bility compared to the existing systems? (Section 4.3)

Q2. How much does the matrix density a�ect the performance

of the PMV’s four methods? (Section 4.4)

7

Page 8: PMV: Pre-partitioned Generalized Matrix-Vector Multiplication for … · 2017. 9. 27. · rithm performing large-scale generalized matrix-vector multipli-cation in distributed systems

Horizontal Vertical Selective Hybrid

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6 1.8

2

TW YW CW09

RM26

5.1x slowerthan selective

Rel

ativ

e tim

e ov

er s

elec

tive

(a) Running time

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6 1.8

2 2.2 2.4

TW YW CW09

RM26

6.4x more I/Othan selective

Rel

ativ

e I/O

ove

r se

lect

ive

(b) Amount of I/O

Figure 5: �e e�ect of the matrix density on running time

and I/O. PMVvertical

is faster and more I/O e�cient than

PMVhorizontal

for sparse graphs while PMVhorizontal

is faster

and more I/O e�cient than PMVvertical

for a dense graph.

PMVhybrid

shows the best performance for all cases outper-

forming other versions of PMV.

Table 3: �e summary of graphs.

Graph Vertices Edges Source

ClueWeb12 (CW12) 6,231,126,594 71,746,553,402 Lemur Project1

ClueWeb09 (CW09) 1,684,876,525 7,939,647,897 Lemur Project2

YahooWeb (YW) 720,242,173 6,636,600,779 Webscope3

Twi�er (TW) 41,652,230 1,468,365,182 Kwak et al.4

[29]

RMAT26 (RM26) 42,147,725 5,000,000,000 TegViz.5

[20]

Q3. How much does the threshold θ a�ect the performance

and the amount of I/O of PMVhybrid

? (Section 4.5)

Q4. How does PMV scale up with the number of workers?

(Section 4.6)

Q5. How does the performance of PMV di�er depending on

the underlying distributed framework? (Section 4.7)

4.1 Datasets

We use real-world graphs to compare PMV to existing systems

(Sections 4.3 and 4.6) and a synthetic graph to evaluate the perfor-

mance of PMV (Section 4.4). �e graphs are summarized in Table 3.

Twi�er is a who-follows-whom network in Twi�er crawled in 2010.

YahooWeb, ClueWeb09 and ClueWeb12 are page-level hyperlink net-

works on the WWW. RMAT [10] is a famous graph generation

model that matches the characteristic of real-world networks. We

generate an RMAT graph with parameters a = 0.57, b = 0.19,

c = 0.19, and d = 0.05 using TegViz [20], a distributed graph

generator.

4.2 Environment

We implemented PMV on Hadoop and Spark, which are famous

distributed processing frameworks. Sections 4.3, 4.4, and 4.6 show

the experimental results on Hadoop. �e result on Spark is in

Section 4.7. We compare PMV to existing graph processing sys-

tems: PEGASUS, GraphX, GraphLab, and Giraph. PEGASUS is a

1h�ps://lemurproject.org/clueweb12/

2h�ps://lemurproject.org/clueweb09/

3h�p://webscope.sandbox.yahoo.com

4h�p://an.kaist.ac.kr/traces/WWW2010.html

5h�p://datalab.snu.ac.kr/tegviz

disk-based system, and the others are distributed-memory based

systems.

We run our experiments on a cluster of 17 machines; one is a

master and the others are for workers. Each machine is equipped

with an Intel E3-1240v5 CPU (quad-core, 3.5GHz), 32GB of RAM,

and 4 hard disk drives. A machine that is not the master runs 4

workers, each with 1 CPU core and 6GB of RAM. All the machines

are connected via 1 Gigabit Ethernet. Hadoop 2.7.3, Spark 2.0.1 and

MPICH 3.0.4 are installed on the cluster.

4.3 Performance of PMV

We compare the running time of PMV and competitors (PEGASUS,

GraphX, GraphLab, and Giraph) on ClueWeb12; induced subgraphs

with varying number of edges are used. For each system, we run the

PageRank algorithm with 8 iterations. Figure 1 shows the running

time of all systems on various graph sizes. We emphasize that

only PMV succeeds in processing the entire ClueWeb12 graph. �e

memory-based systems fail on graphs with more than 2.3 billion

edges due to out of memory error, while PEGASUS fails to process

graphs with 9 billion edges within 5 hours. �e underlying causes

are as follows. Giraph requires that all the out-edges of the assigned

vertices are stored in the main memory of the worker. However,

this requirement can be easily broken since highly skewed degree

distribution is likely to lead to out of memory error. GraphLab uses

the vertex-cut partitioning method and copies the vertices to the

multiple workers which have the edges related to the vertices. �e

edges and the copied vertices are stored in the main memory of

each worker, and incur the out of memory error. GraphX uses the

same approach as GraphLab, but succeeds in processing a graph

which GraphLab fails to process because Spark, its underlying data

processing engine, uses both the disk and the main memory of each

worker. Even GraphX, however, fails to process graphs with more

than 2.3 billion edges due to huge number of RDD partitions.

4.4 E�ect of Matrix Density

We evaluate the performance of PMV on graphs with varying den-

sity. �e results are in Figure 5. Twi�er, YahooWeb, and ClueWeb09

are real-world sparse graphs where the matrix density |M |/|v |2 is

less than 10−7

while RMAT26 is a synthetic dense graph where

the matrix density is larger than 10−7

. As we discussed in Sec-

tion 3.4, the vertical placement is appropriate for a sparse graph

while the horizontal placement is appropriate for a dense graph.

Figures 5a and 5b verify the relation between the performance and

the density of graph. PMVvertical

shows a be�er performance than

PMVhorizontal

when the input matrix is sparse. On the other hand,

if the matrix is dense, PMVhorizontal

provides a be�er performance

than PMVvertical

. PMVselective

shows the same performance as the

best of PMVhorizontal

and PMVvertical

as we expected. PMVhybrid

signi�cantly reduces the amount of I/O for both sparse and dense

graphs, and improves the performance up to 18% from PMVselective

.

4.5 E�ect of �reshold θWe iterate PMV

hybridbased PageRank algorithm 30 times on Twit-

ter graph varying threshold θ . Figure 6 presents the e�ect of the

threshold on the running time and the amount of I/O. PMVvertical

(θ = ∞) shows be�er performance and lower amount of I/O than

PMVhorizontal

(θ = 0), as we expected, because Twi�er is sparse

8

Page 9: PMV: Pre-partitioned Generalized Matrix-Vector Multiplication for … · 2017. 9. 27. · rithm performing large-scale generalized matrix-vector multipli-cation in distributed systems

600

650

700

750

800

850

900

0 100 200 500 1000 214 215 ∞

200

300

400

500

600

700

800

900

44% reduced I/Osthan PMVvertical (θ=∞)

Run

ning

tim

e (s

ec)

Am

ount

of I

/O (

GB

)

Threshold θ

Running TimeAmount of I/O

Figure 6: �e e�ect of threshold θ on the running time and

the amount of I/O. PMVhybrid

shows the best performance

when θ = 200 with 44% reduced amount of I/O compared to

when θ = ∞, i.e., PMVvertical

.

1

1.5

2

2.5

3

3.5

4

16 32 48 64

Spe

ed u

p ov

er 1

6 w

orke

rs

Number of workers

PMVPEGASUS

Figure 7: Machine scalability of PMV on YahooWeb. PMV

shows linearmachine scalability with slope close to 1, while

PEGASUS does with a much smaller slope because of the

curse of the last reducer problem [49] incurred by the high-

degree vertices.

with density lower than 10−7

. PMVhybrid

achieves the best perfor-

mance with θ = 200: in the se�ing PMVhybrid

shows 44% decreased

amount of I/O compared to that of PMVvertical

, from 318GB to

178GB. Note that θ = 100 gives the minimum amount of I/O while

θ = 200 gives the fastest running time. A possible explanation is

that skewness of in-degree distribution of dense area and out-degree

distribution of sparse area a�ects the running times of horizontal

and vertical computations of PMVhybrid

, respectively; however,

the di�erence is minor and does not change the conclusion that

PMVhybrid

outperforms all other versions of PMV.

4.6 Machine Scalability

We evaluate the machine scalability of PMV and competitors by

running the PageRank algorithm with varying number of workers

on YahooWeb. Figure 7 shows the speedup according to the number

of workers from 16 to 64; the speedup is de�ned as t16/tn , where

tn is the running time with n workers. We omit GraphLab, Giraph,

and GraphX because they fail to process the YahooWeb graph on

16 workers. PMV shows linear machine scalability with slope close

to 1, while PEGASUS does with a much smaller slope. PEGASUS

su�ers from the curse of the last reducer problem [49] which is

incurred by the high-degree vertices. PMV overcomes the problem

by treating the high-degree vertices in multiple workers.

102

103

104

108 109 1010 1011

Run

ning

tim

e (s

ec)

Number of edges

PMV (Spark)PMV (Hadoop)

Figure 8: �e running time of PMV on Hadoop and Spark.

PMV is faster on Spark than on Hadoop when the graph

is small. On large graphs, however, PMV runs faster on

Hadoop than on Spark (see Section 4.7 for details).

4.7 Underlying Engine

Figure 8 shows the performance of PMV according to underlying

systems: Hadoop and Spark. We use ClueWeb12 with varying num-

ber of edges as in Section 4.3. When the graph is small, PMV on

Spark beats PMV on Hadoop. �is is because Spark is highly opti-

mized for iterative computation; Spark requires much less start-up

and clean-up time for each iteration than Hadoop does. When the

graph is large, however, PMV on Spark falls behind PMV on Hadoop.

PMV on Spark requires more memory than PMV on Hadoop since

Spark’s RDD is immutable; for updating a vector, PMV on Spark

creates a new vector requiring additional memory while PMV on

Hadoop updates the vector in-place. Accordingly, when the graph

is large, PMV on Spark needs to partition the input vector into

smaller blocks than PMV on Hadoop does. �is makes the perfor-

mance of PMV on Spark worse than that of PMV on Hadoop for

large graphs.

5 CONCLUSION

We propose PMV, a scalable graph mining method based on gener-

alized matrix-vector multiplication on distributed systems. PMV

exploits both horizontal and vertical placement strategies to reduce

I/O costs. PMV shows up to 16× larger scalability than existing

distributed memory methods, 9× faster performance than exist-

ing disk-based ones, and linear scalability for the number of edges

and machines. Future research directions include a graph parti-

tioning algorithm that improves the performance of graph mining

algorithms based on distributed matrix-vector multiplication.

REFERENCES

[1] Vinayak R. Borkar, Michael J. Carey, Raman Grover, Nicola Onose, and Rares

Vernica. 2011. Hyracks: A �exible and extensible foundation for data-intensive

computing. In ICDE. 1151–1162.

[2] Sergey Brin and Lawrence Page. 1998. �e Anatomy of a Large-Scale Hypertex-

tual Web Search Engine. Computer Networks 30, 1-7 (1998), 107–117.

[3] Yingyi Bu, Vinayak R. Borkar, Jianfeng Jia, Michael J. Carey, and Tyson Condie.

2014. Pregelix: Big(ger) Graph Analytics on a Data�ow Engine. PVLDB 8, 2

(2014), 161–172.

[4] Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. 2015. PowerLyra: dif-

ferentiated graph computation and partitioning on skewed graphs. In Eurosys.

1:1–1:15.

[5] Ankur Dave, Alekh Jindal, Li Erran Li, Reynold Xin, Joseph Gonzalez, and Matei

Zaharia. 2016. GraphFrames: an integrated API for mixing graph and relational

queries. In GRADES.

9

Page 10: PMV: Pre-partitioned Generalized Matrix-Vector Multiplication for … · 2017. 9. 27. · rithm performing large-scale generalized matrix-vector multipli-cation in distributed systems

[6] Tiago Manuel Louro Machado De Simas. 2012. Stochastic models and transitivity

in complex networks. Ph.D. Dissertation. Indiana University.

[7] Je�rey Dean and Sanjay Ghemawat. 2008. MapReduce: simpli�ed data processing

on large clusters. Commun. ACM 51, 1 (2008), 107–113.

[8] E. W. Dijkstra. 1959. A note on two problems in connexion with graphs. Numer.

Math. 1 (1959), 269–271. Issue 1.

[9] Shimon Even and Yossi Shiloach. 1981. An On-Line Edge-Deletion Problem. J.

ACM 28, 1 (1981), 1–4.

[10] Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. 1999. On Power-law

Relationships of the Internet Topology. In SIGCOMM. 251–262.

[11] Robert W. Floyd. 1962. Algorithm 97: Shortest path. Commun. ACM 5, 6 (1962),

345.

[12] Sanjay Ghemawat, Howard Gobio�, and Shun-Tak Leung. 2003. �e Google �le

system. In SOSP. 29–43.

[13] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin.

2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs.

In OSDI. 17–30.

[14] Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J.

Franklin, and Ion Stoica. 2014. GraphX: Graph Processing in a Distributed

Data�ow Framework. In OSDI. 599–613.

[15] Leo Grady and Gareth Funka-Lea. 2004. Multi-label Image Segmentation for

Medical Applications Based on Graph-�eoretic Electrical Potentials. In ECCV

Workshops CVAMIA. 230–245.

[16] Hugo Gualdron, Robson L. F. Cordeiro, Jose Fernando Rodrigues Jr., Duen

Horng (Polo) Chau, Minsuk Kahng, and U Kang. 2016. M-Flash: Fast Billion-

Scale Graph Computation Using a Bimodal Block Processing Model. In PKDD.

623–640.

[17] Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, Jeong-Hoon Lee, Min-Soo Kim,

Jinha Kim, and Hwanjo Yu. 2013. TurboGraph: a fast parallel graph engine

handling billion-scale graphs in a single PC. In KDD. 77–85.

[18] Herodotos Herodotou. 2011. Hadoop performance models. arXiv (2011).

[19] John E. Hopcro� and Robert Endre Tarjan. 1973. E�cient Algorithms for Graph

Manipulation [H] (Algorithm 447). Commun. ACM 16, 6 (1973), 372–378.

[20] ByungSoo Jeon, Inah Jeon, and U Kang. 2015. TeGViz: Distributed Tera-Scale

Graph Generation and Visualization. In ICDM. 1620–1623.

[21] Jinhong Jung, Woojeong Jin, Lee Sael, and U Kang. 2016. Personalized Ranking

in Signed Networks Using Signed Random Walk with Restart. In ICDM. 973–978.

[22] Jinhong Jung, Namyong Park, Lee Sael, and U Kang. 2017. BePI: Fast and Memory-

E�cient Method for Billion-Scale Random Walk with Restart. In SIGMOD. 789–

804.

[23] Jinhong Jung, Kijung Shin, Lee Sael, and U Kang. 2016. Random Walk with

Restart on Large Graphs Using Block Elimination. ACM Trans. Database Syst. 41,

2 (2016), 12:1–12:43.

[24] Vasiliki Kalavri, Tiago Simas, and Dionysios Logothetis. 2016. �e shortest path

is not always a straight line. PVLDB 9, 9 (2016), 672–683.

[25] U Kang and Christos Faloutsos. 2014. Mining Tera-Scale Graphs with ”Pegasus”:

Algorithms and Discoveries. In Large-Scale Data Analytics. 75–99.

[26] U Kang, Jay Yoon Lee, Danai Koutra, and Christos Faloutsos. 2014. Net-Ray:

Visualizing and Mining Billion-Scale Graphs. In PAKDD. 348–361.

[27] U Kang, Charalampos E. Tsourakakis, Ana Paula Appel, Christos Faloutsos, and

Jure Leskovec. 2011. HADI: Mining Radii of Large Graphs. TKDD 5, 2 (2011), 8.

[28] U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. 2009. PEGASUS:

A Peta-Scale Graph Mining System. In ICDM. 229–238.

[29] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is

Twi�er, a social network or a news media?. In WWW. 591–600.

[30] Aapo Kyrola, Guy E. Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-Scale

Graph Computation on Just a PC. In OSDI. 31–46.

[31] Haejoon Lee, Minseo Kang, Sun-Bum Youn, Jae-Gil Lee, and YongChul Kwon.

2016. An Experimental Comparison of Iterative MapReduce Frameworks. In

CIKM.

[32] Ho Lee, Bin Shao, and U Kang. 2015. Fast graph mining with HBase. Inf. Sci. 315

(2015), 56–66.

[33] Zhiyuan Lin, Minsuk Kahng, Kaeser Md. Sabrin, Duen Horng (Polo) Chau, Ho

Lee, and U Kang. 2014. MMap: Fast billion-scale graph computation on a PC via

memory mapping. In Big Data. 159–164.

[34] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin,

and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Ma-

chine Learning in the Cloud. PVLDB 5, 8 (2012), 716–727.

[35] Lingxiao Ma, Zhi Yang, Han Chen, Jilong Xue, and Yafei Dai. 2017. Garaph:

E�cient GPU-accelerated Graph Processing on a Single Machine with Balanced

Replication. In ATC. 195–207.

[36] Grzegorz Malewicz, Ma�hew H. Austern, Aart J. C. Bik, James C. Dehnert,

Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for

large-scale graph processing. In SIGMOD. 135–146.

[37] Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. 2008. Graph summa-

rization with bounded error. In SIGMOD. 419–432.

[38] Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Sco� Shenker, and Byung-Gon

Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In

NSDI. 293–307.

[39] Ha-Myung Park, Sung-Hyon Myaeng, and U Kang. 2016. PTE: Enumerating

Trillion Triangles On Distributed Systems. In KDD. 1115–1124.

[40] Ha-Myung Park, Francesco Silvestri, U Kang, and Rasmus Pagh. 2014. MapReduce

Triangle Enumeration With Guarantees. In CIKM. 1739–1748.

[41] Md. Mostofa Ali Patwary, Peder Refsnes, and Fredrik Manne. 2012. Multi-core

Spanning Forest Algorithms using the Disjoint-set Data Structure. In IPDPS.

827–835.

[42] Lu Qin, Je�rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, and Xuemin

Lin. 2014. Scalable big graph processing in MapReduce. In SIGMOD. 827–838.

[43] Semih Salihoglu and Jennifer Widom. 2013. GPS: a graph processing system. In

SSDBM. 22:1–22:12.

[44] Hyunseok Seo, Jinwook Kim, and Min-Soo Kim. 2015. GStream: a graph stream-

ing processing method for large-scale graphs on GPUs. In PPoPP. 253–254.

[45] Bin Shao, Haixun Wang, and Yatao Li. 2013. Trinity: a distributed graph engine

on a memory cloud. In SIGMOD. 505–516.

[46] Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold

Reinwald, and Fatma Ozcan. 2015. Clash of the Titans: MapReduce vs. Spark for

Large Scale Data Analytics. PVLDB 8, 13 (2015), 2110–2121.

[47] Kijung Shin, Jinhong Jung, Lee Sael, and U Kang. 2015. BEAR: Block Elimination

Approach for Random Walk with Restart on Large Graphs. In SIGMOD. 1571–

1585.

[48] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010.

�e Hadoop Distributed File System. In MSST. 1–10.

[49] Siddharth Suri and Sergei Vassilvitskii. 2011. Counting triangles and the curse

of the last reducer. In WWW. 607–614.

[50] Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, and Robert S.

Schreiber. 2013. Presto: distributed machine learning and graph processing with

sparse matrices. In EuroSys. 197–210.

[51] Da Yan, James Cheng, Yi Lu, and Wilfred Ng. 2015. E�ective Techniques for

Message Reduction and Load Balancing in Distributed Graph Computation. In

WWW. 1307–1317.

[52] Xintian Yang, Srinivasan Parthasarathy, and P. Sadayappan. 2011. Fast Sparse

Matrix-Vector Multiplication on GPUs: Implications for Graph Mining. PVLDB

4, 4 (2011), 231–242.

[53] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,

Murphy McCauly, Michael J. Franklin, Sco� Shenker, and Ion Stoica. 2012. Re-

silient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Computing. In NSDI. 15–28.

[54] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Sco� Shenker, and

Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In HotCloud.

10