accumulo summit 2014: accumulo backed tinkerpop implementation

TinkerPop Backed By Accumulo

6/12/2014

Ryan Webb

Associate Professional

[email protected]

2

Agenda

Introduction to TinkerPop Detailed Implementation Obstacles Overcoming Obstacles Map Reduce Integration Performance

3

Background

Associate Professional at The Johns Hopkins Applied Physics Laboratory

Bachelors of Science in Computer Science with a minor in Mathematics from the University of Delaware

Pursing a Masters in Computer Science with a focus on Distributed Systems at the Whiting School of Engineering

4

TinkerPop Blueprints

Foundational technology for a complete graph stack

Extensive test suite to ensure implementations follow all the rules required.

Only a simple API getVertex getEdge setProperty getProperty

Multiple Interfaces with incremental features

5

TinkerPop Blueprints Graph API

6

Graph Creation

Configuration cfg = new AccumuloGraphConfiguration()

.instance("accumulo").user("user").zkHosts("zk1")

.password("password".getBytes()).name("myGraph");

Graph graph = GraphFactory.open(cfg);

Vertex v1 = graph.addVertex("1");

v1.setProperty("name", "Alice");

Vertex v2 = graph.addVertex("2");

v2.setProperty("name", "Bob");

Edge e1 = graph.addEdge("E1", v1, v2, "knows");

e1.setProperty("since", new Date());

7

Trade off Spectrum

Consistency

Performance

8

Accumulo Implementation

Base Naïve implementation passes all required TinkerPop tests Far Right of the spectrum

As consistent as you can get

Table Structure Edge and Vertex Edge and Vertex Index table Metadata Table for indexes

9

Table Structure

Vertex

Edge

Row ID Column Family Column Qualifier Value

VertexID Label Flag Exists Flag [empty]

VertexID INVERTEX OutVertexID_EdgeID Edge Label

VertexID OUTVERTEX InVertexID_EdgeID Edge Label

VertexID Property Key [empty] Serialized Value

Row ID Column Family Column Qualifier Value

EdgeID Label Flag InVertexID_OutVertexID Edge Label

EdgeID Property Key [empty] Serialized Value

10

Graph Access and Index Creation/Use

// Access before Index

for (Vertex v: graph.getVertices()) {

String name = v.getProperty("name");

}

((KeyIndexableGraph)graph)

.createKeyIndex("name", Vertex.class);

// Access after Index

for (Vertex v: graph.getVertices()) {

String name = v.getProperty("name");

}

11

Table Structure - Continued

Indexes

Metadata

Row Column Family Column Qualifier Value

Serialized Value Property Key VertexID [empty]

Row Column Family Column Qualifier Value

Index Name Index Class [empty] [empty]

12

Obstacles

Existence checking is expensive Required for TinkerPop test suite

Writing every graph object out is expensive Building indexes post ingest is expensive

Blocking, full table scan

Consistency is expensive

13

Overcoming Obstacles

Give more power to users who know they are using an Accumulo Graph Ingest Improvements

Give option to disable existence checks Allow manual batching Specialized Ingest path

Traversal Improvements Attribute preloading Property caching Element caching

14

Simple Bulk Ingest

// Will migrate to BatchGraph

AccumuloBulkIngester g = new AccumuloBulkIngester(cfg);

PropertyBuilder v1 = g.addVertex("ID1");

PropertyBuilder v2 = g.addVertex("ID2");

PropertyBuilder edge = g.addEdge("ID1", "ID2", "knows");

v1.add("name", "alice");

v2.add("name", "bob");

edge.add("since", new Date());

15

Map Reduce Integration

In your Tool

j.setInputFormatClass(VertexInputFormat.class);

VertexInputFormat.setAccumuloGraphConfiguration(

new AccumuloGraphConfiguration()

.instance(“accumulo").zkHosts(“zk1").user("root")

.password(“secret".getBytes()).name("myGraph"));

In your Mapper

public void map(Text k, Vertex v, Context c){

System.out.println(v.getId().toString());

}

16

Results

2 Nodes 4 Nodes 8 Nodes

20 Hours 9 Minutes 13 Hours 47 Minutes 7 Hours 4 Minutes

Cluster Stats8 Node Cluster64 GB RamQuad-Core Xeon Processor 2.50GHz 10MB 2x 4 TB 6.0Gb/s 7200 RPM Drives1 Gb/s Networking

Accumulo 1.5.1, Hadoop 2.0.0 – MR1

Stanford SNAP Friendster Graph65,608,366 Vertices1,806,067,135 Edges

2 Nodes 4 Nodes 8 Nodes

55 Minutes 29 Minutes 15 Minutes

Vertex Iteration

Ingest

17

Conclusion

Simple, easy to read graph API Give developers a lot of tuning points for their implementations Performance is “good enough”

Not meant for high performance, specialized solutions

Quick to develop new ideas and investigate your graph. Easy to integrate and already integrated.

Low effort to get REST access to your graph

18

Future

Polish and open source Iterators Locality Groups Addressing Security Graph Query Extending MapReduce Integration Upgrading to Accumulo 1.6, TinkerPop 2.5

Conditional Mutations Table namespaces

19

Resources

http://www.tinkerpop.com/ http://snap.stanford.edu/data/com-Friendster.html

accumulo summit 2014: accumulo backed tinkerpop implementation

Technology

namemygraph graph graph

accumulo graph

graph access

complete graph

graph object

tinkerpop blueprints

security graph query

table structure edge