large scale complex network analysis using th h b …...large scale complex network analysis using...
TRANSCRIPT
![Page 1: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/1.jpg)
Large Scale Complex Network Analysis usingLarge Scale Complex Network Analysis usingth H b id C bi ti fth H b id C bi ti fthe Hybrid Combination of athe Hybrid Combination of a
MapReduceMapReduce Cluster andCluster anda Highly Multithreaded Systema Highly Multithreaded System
Seunghwa Kang David A. Badera Highly Multithreaded Systema Highly Multithreaded System
1
![Page 2: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/2.jpg)
Various Complex Networks
• Friendship networkCit ti t k• Citation network
• Web-link graphC ll b ti t kSource: http://www facebook com • Collaboration network
=> Need to extract h f l
Source: http://www.facebook.com
graphs from large volumes of raw data.
=> Extracted graphs are=> Extracted graphs are highly irregular.
2
Source: http://academic.research.microsoft.com
![Page 3: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/3.jpg)
A Challenge Problem
• Extracting a subgraph from a larger graph a=0.55
a b=0.1
larger graph.- The input graph: An R-MAT* graph
(undirected, unweighted) with c=0 1 d=0 25
c d
(undirected, unweighted) with approx. 4.29 billion vertices and 275 billion edges (7.4 TB in text format)
c=0.1 d=0.25
format).- Extract subnetworks that cover
10%, 5%, and 2% of the vertices.• Finding a single-pair shortest
path (for up to 30 pairs).
3
* D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-MAT: A recursive model for graph mining,” SIAM Int’l Conf. on Data Mining (SDM), 2004.
Source: Seokhee Hong
![Page 4: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/4.jpg)
Presentation Outline
• Present the hybrid system.S l th bl i th diff t• Solve the problem using three different systems: A MapReduce cluster, a highly multithreaded system and the hybrid systemmultithreaded system, and the hybrid system.
• Show the effectiveness of the hybrid system byby- Algorithm level analyses- System level analysesy y- Experimental results
4
![Page 5: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/5.jpg)
Highlights
A MapReduce clusterA highly
multithreaded A hybrid system of the twosystem of the two
Theory level
Graph extraction: WMapReduce(n) ≈ θ(T*(n))
Work optimal Effective if|Thmt - TMapReduce| > level
analysis
p
Shortest path: WMapReduce(n) > θ(T*(n))
| hmt MapReduce|n / BWinter
System Bisection bandwidth Limited aggregate BWinter is System level
analysis
and disk I/O overheadgg g
computing power, disk capacity, and I/O bandwidth
interimportant.
Experi-ments
Five orders of magnitude slower than the highly multithreaded system in finding a
Incapable of storing the input graph
Efficient in solving the challenge problemsystem in finding a
shortest pathproblem.
5
![Page 6: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/6.jpg)
A Hybrid System to Address theDistinct Computational ChallengesDistinct Computational Challenges
A highlymultithreaded
A MapReduce cluster
system
A MapReduce cluster
666
![Page 7: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/7.jpg)
A Hybrid System to Address theDistinct Computational ChallengesDistinct Computational Challenges
1. graph extractionA highly
multithreaded
A MapReduce cluster
system
A MapReduce cluster
76
![Page 8: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/8.jpg)
A Hybrid System to Address theDistinct Computational ChallengesDistinct Computational Challenges
1. graph extraction2 h
A highlymultithreaded
2. graphanalysisqueriesA MapReduce cluster
system
queriesA MapReduce cluster
6
![Page 9: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/9.jpg)
The MapReduce Programming Model• Scans the entire
input data in the map
map
sort
sort
reduce
reduce pmap phase.
• # MapReduce
map
map
sort
sort
reduce
reduce
iterations = the depth of a directed
li h
map sort reduce
Inputdata
Intermediatedata
Sortedintermediate
Outputdata
acyclic graph (DAG) for MapReduce
data
A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7]Depth
1
MapReduce computationA’[0] A’[1] A’[2] A’[3] A’[4]2
7A’’[0] A’’[1] A’’[2] A’’[3] A’’[4]3
![Page 10: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/10.jpg)
Evaluating the efficiency ofM R d Al ithMapReduce Algorithms• WMapReduce = Σi = 1 to k (O(ni • (1 + fi • (1 + ri)) + pr •
Sort(n f / p ))Sort(nifi / pr))- k: # MapReduce iterations.- ni: the input data size for the ith iteration.i- fi: map output size / map input size- ri: reduce output size / reduce input size.
p : # reducers- pr: # reducers• Extracting a subgraph
k = 1 and f << 1W (n) ≈ θ(T*(n)) T*(n): the- k = 1 and fi << 1 WMapReduce(n) ≈ θ(T (n)), T (n): the time complexity of the best sequential algorithm
• Finding a single-pair shortest path
10
- k =┌ d/2 ┐, fi ≈ 1 WMapReduce(n) > θ(T*(n))
![Page 11: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/11.jpg)
Evaluating the efficiency ofM R d Al ithMapReduce Algorithms• WMapReduce = Σi = 1 to k (O(ni • (1 + fi • (1 + ri)) + pr •
Sort(nifi / p ))Sort(nifi / pr))- k: # MapReduce iterations.- ni: the input data size for the ith iteration.- fi: map output size / map input size- ri: reduce output size / reduce input size.
p : # reducers- pr: # reducers• Extracting a subgraph
- k = 1 and f << 1W (n) ≈ θ(T*(n)) T*(n): the- k = 1 and fi << 1 WMapReduce(n) ≈ θ(T (n)), T (n): the time complexity of the best sequential algorithm
• Finding a single-pair shortest path
10
- k =┌ d/2 ┐, fi ≈ 1 WMapReduce(n) > θ(T*(n))
![Page 12: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/12.jpg)
Evaluating the efficiency ofM R d Al ithMapReduce Algorithms• WMapReduce = Σi = 1 to k (O(ni • (1 + fi • (1 + ri)) + pr •
Sort(n f / p ))Sort(nifi / pr))- k: # MapReduce iterations.- ni: the input data size for the ith iteration.i- fi: map output size / map input size- ri: reduce output size / reduce input size.
p : # reducers- pr: # reducers• Extracting a subgraph
k = 1 and f << 1W (n) ≈ θ(T*(n)) T*(n): the- k = 1 and fi << 1 WMapReduce(n) ≈ θ(T (n)), T (n): the time complexity of the best sequential algorithm
• Finding a single-pair shortest path
11
- k =┌ d/2 ┐, fi ≈ 1 WMapReduce(n) > θ(T*(n))
![Page 13: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/13.jpg)
Bisection Bandwidth Requirements for a MapReduce ClusterRequirements for a MapReduce Cluster• The shuffle phase, which requires inter-node
communication, can be overlapped with the , ppmap phase.
• If Tmap > Tshuffle, Tshuffle does not affect the overall execution timeoverall execution time.- Tmap scales trivially.- To scale Tshuffle linearly, bisection bandwidth also
d t l i ti t b f dneeds to scale in proportion to a number of nodes. Yet, the cost to linearly scale bisection bandwidth increases super-linearly.If f << 1 the sub linear scaling of T does not- If f << 1, the sub-linear scaling of Tshuffle does not increase the overall execution time.
- If f ≈ 1, it increases the overall execution time.
10
![Page 14: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/14.jpg)
A single-pair shortest path
131313
Source: http://academic.research.microsoft.com
![Page 15: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/15.jpg)
A single-pair shortest path
1413
Source: http://academic.research.microsoft.com
![Page 16: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/16.jpg)
A single-pair shortest path
13
Source: http://academic.research.microsoft.com
![Page 17: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/17.jpg)
Disk I/O overhead
• Disk I/O overhead is unavoidable if the size of data overflows the main memory capacitydata overflows the main memory capacity.
• Raw data can be very large.E t t d h h ll• Extracted graphs are much smaller.- The Facebook network: 400 million users × 130
friends per user less than 256 GB using thefriends per user less than 256 GB using the sparse representation.
36
12
2 71 3 4 5 7
1
34
7
345
222
6
7
11
25
7 67
31 2 5
![Page 18: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/18.jpg)
A Highly Multithreaded Systemw/ the Shared Memory Programming Modelw/ the Shared Memory Programming Model
• Provide a random access mechanism
Sun Fire T2000 (Niagara)
mechanism.• In SMPs, non-contiguous
accesses are expensive *accesses are expensive.• Multithreading tolerates
memory access latency +
Source: Sun Microsystems
Cray XMT
memory access latency.+• There is a work optimal
parallel algorithm to find aparallel algorithm to find a single-pair shortest path.
Source: Cray* D. R. Helman and J. Ja’Ja’, “Prefix computations on symmetric multiprocessors,” J. of parallel and di t ib t d ti 61(2) 2001
12
distributed computing, 61(2), 2001.+ D. A. Bader, V. Kanade, and K. Madduri, “SWARM: A parallel programming framework for multi-core processors,” Workshop on Multithreaded Architectures and Applications, 2007.
![Page 19: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/19.jpg)
A single-pair shortest path
181813
Source: http://academic.research.microsoft.com
![Page 20: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/20.jpg)
A single-pair shortest path
1913
Source: http://academic.research.microsoft.com
![Page 21: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/21.jpg)
A single-pair shortest path
13
Source: http://academic.research.microsoft.com
![Page 22: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/22.jpg)
Low Latency High BisectionBandwidth Interconnection NetworkBandwidth Interconnection Network• Latency increases as the size of a system
increasesincreases.- A larger number of threads and additional
parallelism are required as latency increases.parallelism are required as latency increases.• Network cost to linearly scale bisection
bandwidth increases super-linearly.p y- But not too expensive for a small number of nodes.
• These limit the size of a system.- Reveal limitations in extracting a subgraph from a
very large graph.
14
![Page 23: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/23.jpg)
The Time Complexity of anAlgorithm on the Hybrid SystemAlgorithm on the Hybrid System• Thybrid = Σi = 1 to k min(Ti, MapReduce + Δ, Ti, hmt + Δ)
k # t- k: # steps- Ti, MapReduce and Ti, hmt: time complexities of the ith step
on a MapReduce cluster and a highly multithreadedon a MapReduce cluster and a highly multithreaded system, respectively.
- Δ: ni / BWinter ×δ(i – 1, i), - ni : the input data size for the ith step.- BWinter: the bandwidth between a MapReduce cluster
and a highly multithreaded systemand a highly multithreaded system.- δ(i – 1, i): 0 if selected platforms for the i - 1th and ith
steps are same. 1, otherwise.
15
![Page 24: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/24.jpg)
Test Platforms
• A MapReduce cluster- 4 nodes4 nodes- 4 dual core 2.4 GHz Opteron
processors and 8 GB main memory per node.
Source: http://hadoop.apache.org/
Sun Fire T2000 (Niagara)memory per node.- 96 disks (1 TB per disk).
• A highly multithreaded system
Sun Fire T2000 (Niagara)
system- A single socket UltraSparc T2
1.2 GHz processor (8 core, 64 threads)threads).
- 32 GB main memory.- 2 disks (145 GB per disk)
Source: Sun Microsystems
• A hybrid system of the two16
![Page 25: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/25.jpg)
A subgraph that covers 10%of the input graphof the input graph
140
(hou
rs)
100
120MapReduce clusterHybrid system
MapReduce Hybrid
Subgraph extraction
24 24
ecut
ion
time
40
60
80 Memory loading
- 0.83
Finding a 103 0.00073
0 5 10 15 20 25 30
Exe
0
20shortest path (for 30 pairs)
Num. pairs
Once the subgraph is loaded into the memory, the hybrid system analyzes the subgraph five orders of magnitude faster than the
17
analyzes the subgraph five orders of magnitude faster than the MapReduce cluster (103 hours vs 2.6 seconds).
![Page 26: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/26.jpg)
Subgraphs that cover 5% (left) and 2% (right) of the input graph
ours
)
80
100MapReduce clusterHybrid system
ours
)
80
100MapReduce clusterHybrid system
5% (left) and 2% (right) of the input graphec
utio
n tim
e (h
o
40
60
ecut
ion
time
(ho
40
60
Num. pairs0 5 10 15 20 25 30
Exe
0
20
Num. pairs0 5 10 15 20 25 30
Ex
0
20
MapReduce HybridSubgraph extraction
22 22MapReduce Hybrid
Subgraph extraction
21 21
Memory loading
- 0.42
Finding a 61 0.00047
Memory loading
- 0.038
Finding a 5.2 0.00019gshortest path (for 30 pairs)
18
gshortest path (for 30 pairs)
![Page 27: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/27.jpg)
Conclusions
• We identified the key computational challenges in large scale complex networkchallenges in large-scale complex network analysis problems.
• Our hybrid system effectively addresses the• Our hybrid system effectively addresses the challenges by using a right tool in a right place in a synergistic way.place in a synergistic way.
• Our work showcases a holistic approach to solve real-world challenges.g
19
![Page 28: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster](https://reader034.vdocuments.net/reader034/viewer/2022042915/5f50689dc14b9479221d3bdd/html5/thumbnails/28.jpg)
Acknowledgment of Support
20