accumulo summit 2014: accumulo backed tinkerpop implementation

20
TinkerPop Backed By Accumulo 6/12/2014 Ryan Webb Associate Professional [email protected]

Upload: accumulo-summit

Post on 26-Jan-2015

109 views

Category:

Technology


2 download

DESCRIPTION

As graph processing grows as a field, eventually standards will be created. The TinkerPop graph processing stack is one such potential standard. The TinkerPop stack contains an algorithm engine, a scripting engine and a RESTful service for accessing graphs. At the base of TinkerPop is Blueprints; an interface for accessing and creating property graphs. Blueprints has already been implemented with several different backing technologies (e.g., relational databases, RDF triple stores, graph databases) and implementations (e.g., JDBC-based, OpenRDF Sail, and Neo4j). This presentation will discuss our implementation of the Blueprints API backed by Accumulo to enable storage of arbitrarily large, distributed graphs. Our implementation falls between the extremes of distributed graph processing systems which require the entire graph fit within the available RAM of the cluster and batch-oriented systems that incur significant disk I/O costs during execution and generally handle iterative algorithms poorly. We will discuss the benefits of supporting the TinkerPop API and the design and performance trade-offs we faced when developing the Accumulo backend and integrating with the Hadoop MapReduce framework. We aim to merge the advantages of the TinkerPop software ecosystem with the scalability and fault-tolerance of Accumulo and provide a robust, turn-key solution for certain classes of large-scale, graph-related challenges.

TRANSCRIPT

Page 1: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

TinkerPop Backed By Accumulo

6/12/2014

Ryan Webb

Associate Professional

[email protected]

Page 2: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

2

Agenda

Introduction to TinkerPop Detailed Implementation Obstacles Overcoming Obstacles Map Reduce Integration Performance

Page 3: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

3

Background

Associate Professional at The Johns Hopkins Applied Physics Laboratory

Bachelors of Science in Computer Science with a minor in Mathematics from the University of Delaware

Pursing a Masters in Computer Science with a focus on Distributed Systems at the Whiting School of Engineering

Page 4: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

4

TinkerPop Blueprints

Foundational technology for a complete graph stack

Extensive test suite to ensure implementations follow all the rules required.

Only a simple API getVertex getEdge setProperty getProperty

Multiple Interfaces with incremental features

Page 5: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

5

TinkerPop Blueprints Graph API

Page 6: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

6

Graph Creation

Configuration cfg = new AccumuloGraphConfiguration()

.instance("accumulo").user("user").zkHosts("zk1")

.password("password".getBytes()).name("myGraph");

Graph graph = GraphFactory.open(cfg);

Vertex v1 = graph.addVertex("1");

v1.setProperty("name", "Alice");

Vertex v2 = graph.addVertex("2");

v2.setProperty("name", "Bob");

Edge e1 = graph.addEdge("E1", v1, v2, "knows");

e1.setProperty("since", new Date());

Page 7: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

7

Trade off Spectrum

Consistency

Performance

Page 8: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

8

Accumulo Implementation

Base Naïve implementation passes all required TinkerPop tests Far Right of the spectrum

As consistent as you can get

Table Structure Edge and Vertex Edge and Vertex Index table Metadata Table for indexes

Page 9: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

9

Table Structure

Vertex

Edge

Row ID Column Family Column Qualifier Value

VertexID Label Flag Exists Flag [empty]

VertexID INVERTEX OutVertexID_EdgeID Edge Label

VertexID OUTVERTEX InVertexID_EdgeID Edge Label

VertexID Property Key [empty] Serialized Value

Row ID Column Family Column Qualifier Value

EdgeID Label Flag InVertexID_OutVertexID Edge Label

EdgeID Property Key [empty] Serialized Value

Page 10: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

10

Graph Access and Index Creation/Use

// Access before Index

for (Vertex v: graph.getVertices()) {

String name = v.getProperty("name");

}

((KeyIndexableGraph)graph)

.createKeyIndex("name", Vertex.class);

// Access after Index

for (Vertex v: graph.getVertices()) {

String name = v.getProperty("name");

}

Page 11: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

11

Table Structure - Continued

Indexes

Metadata

Row Column Family Column Qualifier Value

Serialized Value Property Key VertexID [empty]

Row Column Family Column Qualifier Value

Index Name Index Class [empty] [empty]

Page 12: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

12

Obstacles

Existence checking is expensive Required for TinkerPop test suite

Writing every graph object out is expensive Building indexes post ingest is expensive

Blocking, full table scan

Consistency is expensive

Page 13: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

13

Overcoming Obstacles

Give more power to users who know they are using an Accumulo Graph Ingest Improvements

Give option to disable existence checks Allow manual batching Specialized Ingest path

Traversal Improvements Attribute preloading Property caching Element caching

Page 14: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

14

Simple Bulk Ingest

// Will migrate to BatchGraph

AccumuloBulkIngester g = new AccumuloBulkIngester(cfg);

PropertyBuilder v1 = g.addVertex("ID1");

PropertyBuilder v2 = g.addVertex("ID2");

PropertyBuilder edge = g.addEdge("ID1", "ID2", "knows");

v1.add("name", "alice");

v2.add("name", "bob");

edge.add("since", new Date());

Page 15: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

15

Map Reduce Integration

In your Tool

j.setInputFormatClass(VertexInputFormat.class);

VertexInputFormat.setAccumuloGraphConfiguration(

new AccumuloGraphConfiguration()

.instance(“accumulo").zkHosts(“zk1").user("root")

.password(“secret".getBytes()).name("myGraph"));

In your Mapper

public void map(Text k, Vertex v, Context c){

System.out.println(v.getId().toString());

}

Page 16: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

16

Results

2 Nodes 4 Nodes 8 Nodes

20 Hours 9 Minutes 13 Hours 47 Minutes 7 Hours 4 Minutes

Cluster Stats8 Node Cluster64 GB RamQuad-Core Xeon Processor 2.50GHz 10MB 2x 4 TB 6.0Gb/s 7200 RPM Drives1 Gb/s Networking

Accumulo 1.5.1, Hadoop 2.0.0 – MR1

Stanford SNAP Friendster Graph65,608,366 Vertices1,806,067,135 Edges

2 Nodes 4 Nodes 8 Nodes

55 Minutes 29 Minutes 15 Minutes

Vertex Iteration

Ingest

Page 17: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

17

Conclusion

Simple, easy to read graph API Give developers a lot of tuning points for their implementations Performance is “good enough”

Not meant for high performance, specialized solutions

Quick to develop new ideas and investigate your graph. Easy to integrate and already integrated.

Low effort to get REST access to your graph

Page 18: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

18

Future

Polish and open source Iterators Locality Groups Addressing Security Graph Query Extending MapReduce Integration Upgrading to Accumulo 1.6, TinkerPop 2.5

Conditional Mutations Table namespaces

Page 19: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation

19

Resources

http://www.tinkerpop.com/ http://snap.stanford.edu/data/com-Friendster.html

Page 20: Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation