hbasecon 2012 | storing and manipulating graphs in hbase
DESCRIPTION
Google’s original use case for BigTable was the storage and processing of web graph information, represented as sparse matrices. However, many organizations tend to treat HBase as merely a “web scale” RDBMS. This session will cover several use cases for storing graph data in HBase, including social networks and web link graphs, MapReduce processes like cached traversal, PageRank, and clustering and lastly will look at some lower-level modeling details like row key and column qualifier design, using FullContact’s graph processing systems as a real-world use case.TRANSCRIPT
Storing and Manipulating Graphs in HBase
@danklynn
Keeps Contact Information Current and Complete
Based in Denver, Colorado
CTO & Co-Founder
Turn Partial Contacts Into Full Contacts
Refresher: Graph Theory
Refresher: Graph Theory
Refresher: Graph Theory
Vertex
Refresher: Graph Theory
Edge
Social Networks
Tweets
@danklynn
@xorlev
“#HBase rocks”
author
follows
retweeted
Web Links
http://fullcontact.com/blog/
http://techstars.com/
<a href=”...”>TechStars</a>
Why should you care?
Vertex Influence- PageRank
- Social Influence
- Network bottlenecks
Identifying Communities
Storage Options
neo4j
Very expressive querying(e.g. Gremlin)
neo4j
Transactional
neo4j
Data must fit on a single machine
neo4j
:-(
FlockDB
Scales horizontally
FlockDB
Very fast
FlockDB
No multi-hop query support
:-(
FlockDB
RDBMS(e.g. MySQL, Postgres, et al.)
Transactional
RDBMS
Huge amounts of JOINing
RDBMS
:-(
Massively scalable
HBase
Data model well-suited
HBase
Multi-hop querying?
HBase
Modeling Techniques
1
2
3
Adjacency Matrix
Adjacency Matrix
0 1 1
1 0 1
1 1 0
1 2 3
1
2
3
Adjacency Matrix
Can use vectorized libraries
Adjacency Matrix
Requires O(n2) memory n = number of vertices
Adjacency Matrix
Hard(er) to distribute
1
2
3
Adjacency List
Adjacency List
1 2,3
2 1,3
3 1,2
Adjacency List Design in HBase
t:danklynn
p:+13039316251
Adjacency List Design in HBase
e:[email protected] p:+13039316251= ...
t:danklynn= ...
p:+13039316251
t:danklynn= ...
e:[email protected]= ...
row key “edges” column family
t:danklynn e:[email protected]= ...
p:+13039316251= ...
Adjacency List Design in HBase
e:[email protected] p:+13039316251= ...
t:danklynn= ...
p:+13039316251
t:danklynn= ...
e:[email protected]= ...
row key “edges” column family
t:danklynn e:[email protected]= ...
p:+13039316251= ...
What to
store?
Custom Writables
package org.apache.hadoop.io;
public interface Writable { void write(java.io.DataOutput dataOutput); void readFields(java.io.DataInput dataInput);}
java
Custom Writables
class EdgeValueWritable implements Writable { EdgeValue edgeValue
void write(DataOutput dataOutput) { dataOutput.writeDouble edgeValue.weight }
void readFields(DataInput dataInput) { Double weight = dataInput.readDouble() edgeValue = new EdgeValue(weight) }
// ...}
groovy
Don’t get fancy with byte[]
class EdgeValueWritable implements Writable { EdgeValue edgeValue
byte[] toBytes() { // use strings if you can help it}
static EdgeValueWritable fromBytes(byte[] bytes) { // use strings if you can help it}
}groovy
Querying by vertex
def get = new Get(vertexKeyBytes)get.addFamily(edgesFamilyBytes)
Result result = table.get(get);result.noVersionMap.each {family, data ->
// construct edge objects as needed// data is a Map<byte[],byte[]>
}
Adding edges to a vertex
def put = new Put(vertexKeyBytes)
put.add( edgesFamilyBytes, destinationVertexBytes, edgeValue.toBytes() // your own implementation here)
// if writing directlytable.put(put)
// if using TableReducercontext.write(NullWritable.get(), put)
Distributed Traversal / Indexing
t:danklynn
p:+13039316251
Distributed Traversal / Indexing
t:danklynn
p:+13039316251
Distributed Traversal / Indexing
t:danklynn
p:+13039316251
Pivot vertex
Distributed Traversal / Indexing
t:danklynn
p:+13039316251
MapReduce over outbound edges
Distributed Traversal / Indexing
t:danklynn
p:+13039316251
Emit vertexes and edge data grouped by the pivot
Distributed Traversal / Indexing
t:danklynn
p:+13039316251Reduce key
“Out” vertex
“In” vertex
Distributed Traversal / Indexing
Iteration 0
Distributed Traversal / Indexing
Iteration 1
Distributed Traversal / Indexing
Iteration 2
Distributed Traversal / Indexing
Iteration 2
Reuse edges created during previous iterations
Distributed Traversal / Indexing
Iteration 3
Distributed Traversal / Indexing
Iteration 3
Reuse edges created during previous iterations
Distributed Traversal / Indexing
hops requires only
iterations
Tips / Gotchas
Do implement your own comparator
java
public static class Comparator extends WritableComparator {
public int compare( byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { // ..... }
}
Do implement your own comparator
java
static { WritableComparator.define(VertexKeyWritable, new VertexKeyWritable.Comparator())}
MultiScanTableInputFormat
MultiScanTableInputFormat.setTable(conf,"graph");
MultiScanTableInputFormat.addScan(conf, new Scan());
job.setInputFormatClass(MultiScanTableInputFormat.class);
java
TableMapReduceUtil
TableMapReduceUtil.initTableReducerJob("graph", MyReducer.class, job);
java
Elastic MapReduce
Elastic MapReduce
HFiles
Elastic MapReduce
HFiles
SequenceFiles
Copy to S3
Elastic MapReduce
HFiles
SequenceFiles SequenceFiles
Copy to S3 Elastic MapReduce
Elastic MapReduce
HFiles
SequenceFiles SequenceFiles
Copy to S3 Elastic MapReduce
Elastic MapReduce
HFiles
SequenceFiles SequenceFiles
HFiles
Copy to S3 Elastic MapReduce
HFileOutputFormat.configureIncrementalLoad(job, outputTable)
Elastic MapReduce
HFiles
SequenceFiles SequenceFiles
HFiles HBase
Copy to S3 Elastic MapReduce
HFileOutputFormat.configureIncrementalLoad(job, outputTable)
$ hadoop jar hbase-VERSION.jar completebulkload
Additional Resources
Google Pregel: BSP-based graph processing system
Apache Giraph: Implementation of Pregel for Hadoop
MultiScanTableInputFormat: (code to appear on GitHub)
Apache Mahout - Distributed machine learning on Hadoop