hbasecon 2012 | storing and manipulating graphs in hbase

Storing and Manipulating Graphs in HBase

Dan [email protected]

@danklynn

mailto:[email protected]


Keeps Contact Information Current and Complete

Based in Denver, Colorado

CTO & Co-Founder

Turn Partial Contacts Into Full Contacts

Refresher: Graph Theory


Vertex


Edge

Social Networks

Tweets

@danklynn

@xorlev

“#HBase rocks”

author

follows

retweeted

Web Links

http://fullcontact.com/blog/

http://techstars.com/

<a href=”...”>TechStars</a>



http://techstars.com

http://techstars.com

Why should you care?

Vertex Influence- PageRank

- Social Influence

- Network bottlenecks

Identifying Communities

Storage Options

Very expressive querying(e.g. Gremlin)

neo4j

Transactional

neo4j

Data must fit on a single machine

neo4j

:-(

FlockDB

Scales horizontally

FlockDB

Very fast

FlockDB

No multi-hop query support

:-(

FlockDB

RDBMS(e.g. MySQL, Postgres, et al.)

Transactional

RDBMS

Huge amounts of JOINing

RDBMS

:-(

Massively scalable

HBase

Data model well-suited

HBase

Multi-hop querying?

HBase

Modeling Techniques

1

2

3

Adjacency Matrix

Adjacency Matrix

0 1 1

1 0 1

1 1 0

1 2 3

1

2

3

Adjacency Matrix

Can use vectorized libraries

Adjacency Matrix

Requires O(n2) memory n = number of vertices

Adjacency Matrix

Hard(er) to distribute

1

2

3

Adjacency List

Adjacency List

1 2,3

2 1,3

3 1,2

Adjacency List Design in HBase

e:[email protected]

t:danklynn

p:+13039316251




e:[email protected] p:+13039316251= ...

t:danklynn= ...

p:+13039316251

t:danklynn= ...

e:[email protected]= ...

row key “edges” column family

t:danklynn e:[email protected]= ...

p:+13039316251= ...








e:[email protected] p:+13039316251= ...

t:danklynn= ...

p:+13039316251

t:danklynn= ...

e:[email protected]= ...

row key “edges” column family

t:danklynn e:[email protected]= ...

p:+13039316251= ...

What to

store?













Custom Writables

package org.apache.hadoop.io;

public interface Writable { void write(java.io.DataOutput dataOutput); void readFields(java.io.DataInput dataInput);}

java

Custom Writables

class EdgeValueWritable implements Writable { EdgeValue edgeValue

void write(DataOutput dataOutput) { dataOutput.writeDouble edgeValue.weight }

void readFields(DataInput dataInput) { Double weight = dataInput.readDouble() edgeValue = new EdgeValue(weight) }

// ...}

groovy

Don’t get fancy with byte[]

class EdgeValueWritable implements Writable { EdgeValue edgeValue

byte[] toBytes() { // use strings if you can help it}

static EdgeValueWritable fromBytes(byte[] bytes) { // use strings if you can help it}

}groovy

Querying by vertex

def get = new Get(vertexKeyBytes)get.addFamily(edgesFamilyBytes)

Result result = table.get(get);result.noVersionMap.each {family, data ->

// construct edge objects as needed// data is a Map<byte[],byte[]>

}

Adding edges to a vertex

def put = new Put(vertexKeyBytes)

put.add( edgesFamilyBytes, destinationVertexBytes, edgeValue.toBytes() // your own implementation here)

// if writing directlytable.put(put)

// if using TableReducercontext.write(NullWritable.get(), put)

Distributed Traversal / Indexing

e:[email protected]

t:danklynn

p:+13039316251




e:[email protected]

t:danklynn

p:+13039316251

Pivot vertex




e:[email protected]

t:danklynn

p:+13039316251

MapReduce over outbound edges




e:[email protected]

t:danklynn

p:+13039316251

Emit vertexes and edge data grouped by the pivot




e:[email protected]

t:danklynn

p:+13039316251Reduce key

“Out” vertex

“In” vertex




e:[email protected] t:danklynn

Reducer emits higher-order edge




Iteration 0


Iteration 1


Iteration 2


Iteration 2

Reuse edges created during previous iterations


Iteration 3


Iteration 3

Reuse edges created during previous iterations


hops requires only

iterations

Tips / Gotchas

Do implement your own comparator

java

public static class Comparator extends WritableComparator {

public int compare( byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { // ..... }

}

Do implement your own comparator

java

static { WritableComparator.define(VertexKeyWritable, new VertexKeyWritable.Comparator())}

MultiScanTableInputFormat

MultiScanTableInputFormat.setTable(conf,"graph");

MultiScanTableInputFormat.addScan(conf, new Scan());

job.setInputFormatClass(MultiScanTableInputFormat.class);

java

TableMapReduceUtil

TableMapReduceUtil.initTableReducerJob("graph", MyReducer.class, job);

java

Elastic MapReduce

Elastic MapReduce

HFiles

Elastic MapReduce

HFiles

SequenceFiles

Copy to S3

Elastic MapReduce

HFiles

SequenceFiles SequenceFiles

Copy to S3 Elastic MapReduce

Elastic MapReduce

HFiles


HFiles


HFileOutputFormat.configureIncrementalLoad(job, outputTable)

Elastic MapReduce

HFiles


HFiles HBase


HFileOutputFormat.configureIncrementalLoad(job, outputTable)

$ hadoop jar hbase-VERSION.jar completebulkload

Additional Resources

Google Pregel: BSP-based graph processing system

Apache Giraph: Implementation of Pregel for Hadoop

MultiScanTableInputFormat: (code to appear on GitHub)

Apache Mahout - Distributed machine learning on Hadoop

[email protected]



hbasecon 2012 | storing and manipulating graphs in hbase

Technology

elastic mapreducehfi

adjacency matrix1

adjacency matrixcan

adjacency matrixharder

adjacency list132

ou t vertexe

adjacency list design

s3seq uen cefiles