accumulo summit 2014: accumulo backed tinkerpop implementation
DESCRIPTION
As graph processing grows as a field, eventually standards will be created. The TinkerPop graph processing stack is one such potential standard. The TinkerPop stack contains an algorithm engine, a scripting engine and a RESTful service for accessing graphs. At the base of TinkerPop is Blueprints; an interface for accessing and creating property graphs. Blueprints has already been implemented with several different backing technologies (e.g., relational databases, RDF triple stores, graph databases) and implementations (e.g., JDBC-based, OpenRDF Sail, and Neo4j). This presentation will discuss our implementation of the Blueprints API backed by Accumulo to enable storage of arbitrarily large, distributed graphs. Our implementation falls between the extremes of distributed graph processing systems which require the entire graph fit within the available RAM of the cluster and batch-oriented systems that incur significant disk I/O costs during execution and generally handle iterative algorithms poorly. We will discuss the benefits of supporting the TinkerPop API and the design and performance trade-offs we faced when developing the Accumulo backend and integrating with the Hadoop MapReduce framework. We aim to merge the advantages of the TinkerPop software ecosystem with the scalability and fault-tolerance of Accumulo and provide a robust, turn-key solution for certain classes of large-scale, graph-related challenges.TRANSCRIPT
2
Agenda
Introduction to TinkerPop Detailed Implementation Obstacles Overcoming Obstacles Map Reduce Integration Performance
3
Background
Associate Professional at The Johns Hopkins Applied Physics Laboratory
Bachelors of Science in Computer Science with a minor in Mathematics from the University of Delaware
Pursing a Masters in Computer Science with a focus on Distributed Systems at the Whiting School of Engineering
4
TinkerPop Blueprints
Foundational technology for a complete graph stack
Extensive test suite to ensure implementations follow all the rules required.
Only a simple API getVertex getEdge setProperty getProperty
Multiple Interfaces with incremental features
5
TinkerPop Blueprints Graph API
6
Graph Creation
Configuration cfg = new AccumuloGraphConfiguration()
.instance("accumulo").user("user").zkHosts("zk1")
.password("password".getBytes()).name("myGraph");
Graph graph = GraphFactory.open(cfg);
Vertex v1 = graph.addVertex("1");
v1.setProperty("name", "Alice");
Vertex v2 = graph.addVertex("2");
v2.setProperty("name", "Bob");
Edge e1 = graph.addEdge("E1", v1, v2, "knows");
e1.setProperty("since", new Date());
7
Trade off Spectrum
Consistency
Performance
8
Accumulo Implementation
Base Naïve implementation passes all required TinkerPop tests Far Right of the spectrum
As consistent as you can get
Table Structure Edge and Vertex Edge and Vertex Index table Metadata Table for indexes
9
Table Structure
Vertex
Edge
Row ID Column Family Column Qualifier Value
VertexID Label Flag Exists Flag [empty]
VertexID INVERTEX OutVertexID_EdgeID Edge Label
VertexID OUTVERTEX InVertexID_EdgeID Edge Label
VertexID Property Key [empty] Serialized Value
Row ID Column Family Column Qualifier Value
EdgeID Label Flag InVertexID_OutVertexID Edge Label
EdgeID Property Key [empty] Serialized Value
10
Graph Access and Index Creation/Use
// Access before Index
for (Vertex v: graph.getVertices()) {
String name = v.getProperty("name");
}
((KeyIndexableGraph)graph)
.createKeyIndex("name", Vertex.class);
// Access after Index
for (Vertex v: graph.getVertices()) {
String name = v.getProperty("name");
}
11
Table Structure - Continued
Indexes
Metadata
Row Column Family Column Qualifier Value
Serialized Value Property Key VertexID [empty]
Row Column Family Column Qualifier Value
Index Name Index Class [empty] [empty]
12
Obstacles
Existence checking is expensive Required for TinkerPop test suite
Writing every graph object out is expensive Building indexes post ingest is expensive
Blocking, full table scan
Consistency is expensive
13
Overcoming Obstacles
Give more power to users who know they are using an Accumulo Graph Ingest Improvements
Give option to disable existence checks Allow manual batching Specialized Ingest path
Traversal Improvements Attribute preloading Property caching Element caching
14
Simple Bulk Ingest
// Will migrate to BatchGraph
AccumuloBulkIngester g = new AccumuloBulkIngester(cfg);
PropertyBuilder v1 = g.addVertex("ID1");
PropertyBuilder v2 = g.addVertex("ID2");
PropertyBuilder edge = g.addEdge("ID1", "ID2", "knows");
v1.add("name", "alice");
v2.add("name", "bob");
edge.add("since", new Date());
15
Map Reduce Integration
In your Tool
j.setInputFormatClass(VertexInputFormat.class);
VertexInputFormat.setAccumuloGraphConfiguration(
new AccumuloGraphConfiguration()
.instance(“accumulo").zkHosts(“zk1").user("root")
.password(“secret".getBytes()).name("myGraph"));
In your Mapper
public void map(Text k, Vertex v, Context c){
System.out.println(v.getId().toString());
}
16
Results
2 Nodes 4 Nodes 8 Nodes
20 Hours 9 Minutes 13 Hours 47 Minutes 7 Hours 4 Minutes
Cluster Stats8 Node Cluster64 GB RamQuad-Core Xeon Processor 2.50GHz 10MB 2x 4 TB 6.0Gb/s 7200 RPM Drives1 Gb/s Networking
Accumulo 1.5.1, Hadoop 2.0.0 – MR1
Stanford SNAP Friendster Graph65,608,366 Vertices1,806,067,135 Edges
2 Nodes 4 Nodes 8 Nodes
55 Minutes 29 Minutes 15 Minutes
Vertex Iteration
Ingest
17
Conclusion
Simple, easy to read graph API Give developers a lot of tuning points for their implementations Performance is “good enough”
Not meant for high performance, specialized solutions
Quick to develop new ideas and investigate your graph. Easy to integrate and already integrated.
Low effort to get REST access to your graph
18
Future
Polish and open source Iterators Locality Groups Addressing Security Graph Query Extending MapReduce Integration Upgrading to Accumulo 1.6, TinkerPop 2.5
Conditional Mutations Table namespaces
19
Resources
http://www.tinkerpop.com/ http://snap.stanford.edu/data/com-Friendster.html