accumulo summit 2014: accumulo backed tinkerpop implementation

of 20/20
TinkerPop Backed By Accumulo 6/12/2014 Ryan Webb Associate Professional [email protected]

Post on 26-Jan-2015




2 download

Embed Size (px)


As graph processing grows as a field, eventually standards will be created. The TinkerPop graph processing stack is one such potential standard. The TinkerPop stack contains an algorithm engine, a scripting engine and a RESTful service for accessing graphs. At the base of TinkerPop is Blueprints; an interface for accessing and creating property graphs. Blueprints has already been implemented with several different backing technologies (e.g., relational databases, RDF triple stores, graph databases) and implementations (e.g., JDBC-based, OpenRDF Sail, and Neo4j). This presentation will discuss our implementation of the Blueprints API backed by Accumulo to enable storage of arbitrarily large, distributed graphs. Our implementation falls between the extremes of distributed graph processing systems which require the entire graph fit within the available RAM of the cluster and batch-oriented systems that incur significant disk I/O costs during execution and generally handle iterative algorithms poorly. We will discuss the benefits of supporting the TinkerPop API and the design and performance trade-offs we faced when developing the Accumulo backend and integrating with the Hadoop MapReduce framework. We aim to merge the advantages of the TinkerPop software ecosystem with the scalability and fault-tolerance of Accumulo and provide a robust, turn-key solution for certain classes of large-scale, graph-related challenges.


  • 1. TinkerPop Backed By Accumulo 6/12/2014 Ryan Webb Associate Professional [email protected]

2. Agenda Introduction to TinkerPop Detailed Implementation Obstacles Overcoming Obstacles Map Reduce Integration Performance 3. Background Associate Professional at The Johns Hopkins Applied Physics Laboratory Bachelors of Science in Computer Science with a minor in Mathematics from the University of Delaware Pursing a Masters in Computer Science with a focus on Distributed Systems at the Whiting School of Engineering 4. TinkerPop Blueprints Foundational technology for a complete graph stack Extensive test suite to ensure implementations follow all the rules required. Only a simple API getVertex getEdge setProperty getProperty Multiple Interfaces with incremental features 5. TinkerPop Blueprints Graph API 6. Graph Creation Configuration cfg = new AccumuloGraphConfiguration() .instance("accumulo").user("user").zkHosts("zk1") .password("password".getBytes()).name("myGraph"); Graph graph =; Vertex v1 = graph.addVertex("1"); v1.setProperty("name", "Alice"); Vertex v2 = graph.addVertex("2"); v2.setProperty("name", "Bob"); Edge e1 = graph.addEdge("E1", v1, v2, "knows"); e1.setProperty("since", new Date()); 7. Trade off Spectrum Consistency Performance 8. Accumulo Implementation Base Nave implementation passes all required TinkerPop tests Far Right of the spectrum As consistent as you can get Table Structure Edge and Vertex Edge and Vertex Index table Metadata Table for indexes 9. Table Structure Vertex Edge Row ID Column Family Column Qualifier Value VertexID Label Flag Exists Flag [empty] VertexID INVERTEX OutVertexID_EdgeID Edge Label VertexID OUTVERTEX InVertexID_EdgeID Edge Label VertexID Property Key [empty] Serialized Value Row ID Column Family Column Qualifier Value EdgeID Label Flag InVertexID_OutVertexID Edge Label EdgeID Property Key [empty] Serialized Value 10. Graph Access and Index Creation/Use // Access before Index for (Vertex v: graph.getVertices()) { String name = v.getProperty("name"); } ((KeyIndexableGraph)graph) .createKeyIndex("name", Vertex.class); // Access after Index for (Vertex v: graph.getVertices()) { String name = v.getProperty("name"); } 11. Table Structure - Continued Indexes Metadata Row Column Family Column Qualifier Value Serialized Value Property Key VertexID [empty] Row Column Family Column Qualifier Value Index Name Index Class [empty] [empty] 12. Obstacles Existence checking is expensive Required for TinkerPop test suite Writing every graph object out is expensive Building indexes post ingest is expensive Blocking, full table scan Consistency is expensive 13. Overcoming Obstacles Give more power to users who know they are using an Accumulo Graph Ingest Improvements Give option to disable existence checks Allow manual batching Specialized Ingest path Traversal Improvements Attribute preloading Property caching Element caching 14. Simple Bulk Ingest // Will migrate to BatchGraph AccumuloBulkIngester g = new AccumuloBulkIngester(cfg); PropertyBuilder v1 = g.addVertex("ID1"); PropertyBuilder v2 = g.addVertex("ID2"); PropertyBuilder edge = g.addEdge("ID1", "ID2", "knows"); v1.add("name", "alice"); v2.add("name", "bob"); edge.add("since", new Date()); 15. Map Reduce Integration In your Tool j.setInputFormatClass(VertexInputFormat.class); VertexInputFormat.setAccumuloGraphConfiguration( new AccumuloGraphConfiguration() .instance(accumulo").zkHosts(zk1").user("root") .password(secret".getBytes()).name("myGraph")); In your Mapper public void map(Text k, Vertex v, Context c){ System.out.println(v.getId().toString()); } 16. Results 2 Nodes 4 Nodes 8 Nodes 20 Hours 9 Minutes 13 Hours 47 Minutes 7 Hours 4 Minutes Cluster Stats 8 Node Cluster 64 GB Ram Quad-Core Xeon Processor 2.50GHz 10MB 2x 4 TB 6.0Gb/s 7200 RPM Drives 1 Gb/s Networking Accumulo 1.5.1, Hadoop 2.0.0 MR1 Stanford SNAP Friendster Graph 65,608,366 Vertices 1,806,067,135 Edges 2 Nodes 4 Nodes 8 Nodes 55 Minutes 29 Minutes 15 Minutes Vertex Iteration Ingest 17. Conclusion Simple, easy to read graph API Give developers a lot of tuning points for their implementations Performance is good enough Not meant for high performance, specialized solutions Quick to develop new ideas and investigate your graph. Easy to integrate and already integrated. Low effort to get REST access to your graph 18. Future Polish and open source Iterators Locality Groups Addressing Security Graph Query Extending MapReduce Integration Upgrading to Accumulo 1.6, TinkerPop 2.5 Conditional Mutations Table namespaces 19. Resources