graphalytics: a big data benchmark for graph-processing platforms › fileadmin › user_upload ›...
TRANSCRIPT
Graphalytics:A Big Data Benchmark for Graph-Processing Platforms
Mihai Capotă, Tim Hegeman, Alexandru Iosup,Arnau Prat-Pérez, Orri Erling, Peter Boncz
Delft University of TechnologyUniversitat Politècnica de CatalunyaOpenLink SoftwareCWI
2
The data deluge: large-scale graphs
Tens of billions of edges
3
Graph processing
The data deluge: large-scale graphs
Tens of billions of edges
4
Platform diversity
GraphX
5
Yet another benchmark?
6
Yet another benchmark?
● Lack of benchmarks for genericgraph processing platforms
7
Yet another benchmark?
● Lack of benchmarks for genericgraph processing platforms
● Graph500– BFS
– Kroneker graph
8
Yet another benchmark?
● Lack of benchmarks for genericgraph processing platforms
● Graph500– BFS
– Kroneker graph
● Several academic studies
– Specific to graph or RDF databases– Ad hoc setup, difficult to extend
9
P-A-D triangle
AAlgorithm
PPlatform
DDataset
10
P-A-D triangle
AAlgorithm
PPlatform
DDataset
Performance is enabled,portability is disabled
11
P-A-D triangle
AAlgorithm
PPlatform
DDataset
Performance is enabled,portability is disabled
Algorithms for differentdata types and graphs
12
P-A-D triangle
AAlgorithm
PPlatform
DDataset
Performance is enabled,portability is disabled
Algorithms for differentdata types and graphs
No systematic findings yet
13
P-A-D triangle
AAlgorithm
PPlatform
DDataset
Performance is enabled,portability is disabled
Algorithms for differentdata types and graphs
No systematic findings yet
Deployment
14
Graphalytics
15
Graphalytics
● The first comprehensive benchmark forbig data graph-processing platforms
16
Graphalytics
● The first comprehensive benchmark forbig data graph-processing platforms– Advanced benchmark harness
17
Graphalytics
● The first comprehensive benchmark forbig data graph-processing platforms– Advanced benchmark harness
– Choke-point analysis
18
Graphalytics
● The first comprehensive benchmark forbig data graph-processing platforms– Advanced benchmark harness
– Choke-point analysis
– Realistic graph generator
19
Graphalytics
● The first comprehensive benchmark forbig data graph-processing platforms– Advanced benchmark harness
– Choke-point analysis
– Realistic graph generator
20
Graphalytics
● The first comprehensive benchmark forbig data graph-processing platforms– Advanced benchmark harness
– Choke-point analysis
– Realistic graph generator
● Co-sponsored by Oracle
21
Choke points
22
Choke points
● Technological challenges for platforms
Chokepoint
Big datasystem
Timelyresults
23
Choke points
● Technological challenges for platforms● Identified by experts
Chokepoint
Big datasystem
Timelyresults
24
Choke points
● Technological challenges for platforms● Identified by experts● Real-world scenarios are enhanced to
stress choke points– Prevent tunnel vision
– Advance state of the art Chokepoint
Big datasystem
Timelyresults
25
Choke points
● Technological challenges for platforms● Identified by experts● Real-world scenarios are enhanced to
stress choke points– Prevent tunnel vision
– Advance state of the art
Erling et al., The LDBC Social Network Benchmark:Interactive Workload , SIGMOD 2015
Chokepoint
Big datasystem
Timelyresults
26
Examples of choke points
27
Examples of choke points
● Network utilization
28
Examples of choke points
● Network utilization● Memory footprint
29
Examples of choke points
● Network utilization● Memory footprint● Access locality
30
Examples of choke points
● Network utilization● Memory footprint● Access locality● Skewed execution
31
Examples of choke points
● Network utilization● Memory footprint● Access locality● Skewed execution● CPU?
32
Examples of choke points
● Network utilization● Memory footprint● Access locality● Skewed execution● CPU?
Ousterhout et al., “Making Sense ofPerformance in Data Analytics Frameworks”, NSDI 2015
33
Examples of choke points
● Network utilization● Memory footprint● Access locality● Skewed execution● CPU?● Others?
Ousterhout et al., “Making Sense ofPerformance in Data Analytics Frameworks”, NSDI 2015
34
Realistic graph generator
35
Realistic graph generator
● LDBC Datagen– Synthetic social network similar to Facebook
36
Realistic graph generator
● LDBC Datagen– Synthetic social network similar to Facebook
● Graphalytics enhancements– Multiple degree distributions
● Zeta and geometric implemented
37
Realistic graph generator
● LDBC Datagen– Synthetic social network similar to Facebook
● Graphalytics enhancements– Multiple degree distributions
● Zeta and geometric implemented
– Other structural characteristics● Clustering coefficient● Assortativity
38
Realistic graph generator
● LDBC Datagen– Synthetic social network similar to Facebook
● Graphalytics enhancements– Multiple degree distributions
● Zeta and geometric implemented
– Other structural characteristics● Clustering coefficient● Assortativity
Erling et al., The LDBC Social Network Benchmark:Interactive Workload , SIGMOD 2015
39
Advanced benchmark harness
BenchmarkCore
ReportGenerator
Platform-specificalgorithm
implementation
OutputValidator
SystemMonitor
DatasetGenerator
Datasets
UserResults
Graph processingplatform
Configuration
40
Advanced benchmark harness
BenchmarkCore
ReportGenerator
Platform-specificalgorithm
implementation
OutputValidator
SystemMonitor
DatasetGenerator
Datasets
UserResults
Graph processingplatform
Configuration
41
Advanced benchmark harness
BenchmarkCore
ReportGenerator
Platform-specificalgorithm
implementation
OutputValidator
SystemMonitor
DatasetGenerator
Datasets
UserResults
Graph processingplatform
Configuration
42
Advanced benchmark harness
BenchmarkCore
ReportGenerator
Platform-specificalgorithm
implementation
OutputValidator
SystemMonitor
DatasetGenerator
Datasets
UserResults
Graph processingplatform
Configuration
43
Advanced benchmark harness
BenchmarkCore
ReportGenerator
Platform-specificalgorithm
implementation
OutputValidator
SystemMonitor
DatasetGenerator
Datasets
UserResults
Graph processingplatform
Configuration
44
Advanced benchmark harness
BenchmarkCore
ReportGenerator
Platform-specificalgorithm
implementation
OutputValidator
SystemMonitor
DatasetGenerator
Datasets
UserResults
Graph processingplatform
Configuration
45
Advanced benchmark harness
BenchmarkCore
ReportGenerator
Platform-specificalgorithm
implementation
OutputValidator
SystemMonitor
DatasetGenerator
Datasets
UserResults
Graph processingplatform
Configuration
46
Advanced benchmark harness
BenchmarkCore
ReportGenerator
Platform-specificalgorithm
implementation
OutputValidator
SystemMonitor
DatasetGenerator
Datasets
UserResults
Graph processingplatform
Configuration
47
Advanced benchmark harness
BenchmarkCore
ReportGenerator
Platform-specificalgorithm
implementation
OutputValidator
SystemMonitor
DatasetGenerator
Datasets
UserResults
Graph processingplatform
Configuration
48
Advanced benchmark harness
BenchmarkCore
ReportGenerator
Platform-specificalgorithm
implementation
OutputValidator
SystemMonitor
DatasetGenerator
Datasets
UserResults
Graph processingplatform
Configuration
49
Advanced benchmark harness
BenchmarkCore
ReportGenerator
Platform-specificalgorithm
implementation
OutputValidator
SystemMonitor
DatasetGenerator
Datasets
UserResults
Graph processingplatform
Configuration
50
Advanced benchmark harness
BenchmarkCore
ReportGenerator
Platform-specificalgorithm
implementation
OutputValidator
SystemMonitor
DatasetGenerator
Datasets
UserResults
Graph processingplatform
Configuration
52
Supported algorithms
ID Algorithm Class Use (%)BFS Breadth-first search Traversal 46STATS Local clustering
coefficientStatistics 16
CONN Weakly connected components
Connected components
13
CD Label propagation Community detection
5
EVO Forest fire evolution Evolution 4
Guo et al., How Well Do Graph-Processing Platforms Perform?An Empirical Performance Evaluation and Analysis, IPDPS 2014
53
Experimental setup
● DAS-4 cluster– Typical big data setup
– 11 nodes, 24 GiB RAM, 2 x 8-core Xeon E5620
– 1 Gbit/s Ethernet
● Single machine– HPC-like setup
– 192 GiB RAM, 2 x 8-core Xeon E5-2450 v2
54
Runtime
55
Runtime
56
Runtime
57
Runtime
58
Runtime
59
Runtime
2 orders of magnitudedifference
60
Runtime
Neo4j 2x faster than MR
61
Runtime
MR 2x faster than Neo4j
62
Runtime
Neo4j fails
63
Runtime
64
Edge-normalized performance
65
Edge-normalized performance
Number of edges in graph / runtime
66
Edge-normalized performance
20 x differencebecause of dataset structure
67
Edge-normalized performance
Also for other platforms
68
Edge-normalized performance
69
See paper for more results
70
Datagen scalability
71
Conclusion
72
Conclusion
● Use Graphalytics to:– Compare
– Tune
– Re-design
BenchmarkCore
ReportGenerator
Platform-specificalgorithm
implementation
OutputValidator
SystemMonitor
DatasetGenerator
Datasets
UserResults
Graph processingplatform
Configuration
73
Conclusion
● Use Graphalytics to:– Compare
– Tune
– Re-design
● Open source (Apache License 2.0)– Contribute an implementation for your platform
BenchmarkCore
ReportGenerator
Platform-specificalgorithm
implementation
OutputValidator
SystemMonitor
DatasetGenerator
Datasets
UserResults
Graph processingplatform
Configuration
74
Conclusion
● Use Graphalytics to:– Compare
– Tune
– Re-design
● Open source (Apache License 2.0)– Contribute an implementation for your platform
● SPEC RG standardization coming
BenchmarkCore
ReportGenerator
Platform-specificalgorithm
implementation
OutputValidator
SystemMonitor
DatasetGenerator
Datasets
UserResults
Graph processingplatform
Configuration
75
Performance evaluation – Granular
76
Deploying Granular
77
Granular results