Download - Dato vs GraphX
DATO VS. SPARK GRAPHX
KEIRA ZHOUOCT, 2015
Details: https://github.com/keiraqz/dato-vs-graphx
SETTINGS• 1 master node and 3 work nodes on AWS
• m4.large instances with 8GB of RAM with 2 cores
DATO• A graph-based, asynchronous, high performance, distributed
computation framework written in C++
• 30-days free trial, then a service fee
• Install GraphLab Create on the local machine and Dato Distributed on a cluster
SPARK GRAPHX• Come with Spark
import org.apache.spark._import org.apache.spark.graphx._
EXPERIMENTS• Graph Algorithms
• Triangle-counting• PageRank• Connected Components
• Datasets: Stanford Large Network Dataset Collection (SNAP)• Facebook:
• Nodes: 4039 | Edges: 88234 | Number of triangles: 1612010• YouTube:
• Nodes: 1134890 | Edges: 2987624 | Number of triangles: 3056386• Pokec:
• Nodes: 1632803 | Edges: 30622564 | Number of triangles: 32557458• LiveJournal:
• Nodes: 3997962 | Edges: 34681189 | Number of triangles: 177820130
EXPERIMENTS (CONT’D)• Default settings
• Dato:• GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY = 4G
• GraphX• Start with executor memory = 1G• Change into 2G later
RESULTS• Triangle Counting: both Dato and GraphX (if it finishes the job) returns the
correct answer as listed on the SNAP website.
• For Pokec and LiveJournal data, GraphX has trouble finishing the computation
TAKE-AWAY FOR GRAPHX• What I observed was that certain stages within the job kept
failing
• A stage in Spark will operate on one partition of the RDD at a time (and load the data in that partition into memory)
• Potential Solution
• Increasing the executor memory• Increase the number of partitions of the RDD so that each
stage is processing smaller amount of data
RESULTS (CONT’D)• PageRank: The threshold for PageRank is set to 0.001
RESULTS (CONT’D)• Connected Components
CONCLUSIONS• Quick setups for both of the tools without fine-tune runtime
parameters, but
• Dato has clear advantages over GraphX in terms of execution time for processing large scale graph data
• However, GraphX is free while Dato charges a service fee after the free trial.
• The goal of the GraphX project is to unify graph-parallel and data-parallel computation in one system with a single composable API.
• Further experiments can be done to compare the overall performance of a specific task that contains both graph algorithms and other data-parallel computation
MORE DETAILS• https://github.com/keiraqz/dato-vs-graphx
REFERENCES• Dato:
• https://dato.com/
• Spark GraphX:
• https://spark.apache.org/docs/1.1.0/graphx-programming-guide.html
• Stanford Large Network Dataset Collection (SNAP):
• https://snap.stanford.edu/data/