big graph data
DESCRIPTION
The problems we are faced with in the 21st century require efficient analysis of ever more complex systems. This presentation outlines how such problems can be better understood and effectively solved if they are modeled as graphs or networks. We present two tools for to help solve such problems at scale: Titan, which is a real-time distributed graph database based on Apache Cassandra and Hbase and Faunus, which is a batch analytics framework for graphs based on Apache Hadoop. We discuss their current development status as of November 2012 and illustrate an example application for the GitHub coding network.TRANSCRIPT
AURELIUS THINKAURELIUS.COM
BIG GRAPH DATA Understanding a Complex World
Matthias Broecheler, CTO @mbroecheler November XIII, MMXII
AURELIUS THINKAURELIUS.COM
I Graph Foundation
Graph
name: Jupiter type: god
name: Pluto type: god
name: Neptune type: god
name: Hercules type: demigod
name: Cerberus type: monster
name: Alcmene type: god
name: Saturn type: titan
Vertex Property
Graph
name: Jupiter type: god
name: Pluto type: god
name: Neptune type: god
name: Hercules type: demigod
name: Cerberus type: monster
name: Alcmene type: god
name: Saturn type: titan
father father
mother brother
brother battled
pet
time:12
Edge
Edge Property
Edge Type
Path
name: Jupiter type: god
name: Pluto type: god
name: Neptune type: god
name: Hercules type: demigod
name: Cerberus type: monster
name: Alcmene type: god
name: Saturn type: titan
father father
mother brother
brother battled
pet
time:12
Degree
name: Jupiter type: god
name: Pluto type: god
name: Neptune type: god
name: Hercules type: demigod
name: Cerberus type: monster
name: Alcmene type: god
name: Saturn type: titan
father father
mother brother
brother battled
pet
time:12
AURELIUS THINKAURELIUS.COM
I Connected World
HEALTH
HEALTH
HEALTH
HEALTH
ECONOMY
ECONOMY
ECONOMY
ECONOMY
Social Systems
Social Systems
Social Systems
Social Systems
AURELIUS THINKAURELIUS.COM
III Titan Graph Database
Numerous Concurrent Users Many Short Transactions
read/write
Real-time Traversals (OLTP) High Availability Dynamic Scalability Variable Consistency Model
ACID or eventual consistency
Real-time Big Graph Data
Titan Features
Storage Backends
Partitionability
Availability Consistency
Titan Features
I. Data Management
II. Vertex-Centric Indices
Titan Features
III. Graph Partitioning
IV. Edge Compression
Titan Ecosystem
Native Blueprints Implementation
Gremlin Query Language
Rexster Server any Titan graph can be
exposed as a REST endpoint
Generic Graph API
Dataflow Processing
TraversalLanguage
Object-GraphMapper
GraphAlgorithms
GraphServer
AURELIUS THINKAURELIUS.COM
IV Github Network
Setup
$ ./titan-0.1.0/bin/gremlin.sh! ! ! !\,,,/! (o o)!-----oOOo-(_)-oOOo-----!gremlin> g = TitanFactory.open('/tmp/titan')!==>titangraph[local:/tmp/titan]!
Titan Storage Model
Adjacency list in one column family
Row key = vertex id
Each property and edge in one column Denormalized, i.e. stored twice
Direction and label/key as column prefix Use slice predicate for quick retrieval
5
5
USER
REPOSITORY
COMMENT
ISSUE COMMIT
PAGE
edited
pushed
in
opened
to on on
on
created
Defining Property Keys
gremlin> g.makeType().name(‘username’).!! ! ! dataType(String.class).!! ! ! functional().!! ! ! indexed().unique().!! ! ! makePropertyKey()!
gremlin> g.makeType().name(‘time’).!! ! ! dataType(Long.class).!! ! ! functional().makePropertyKey()!
Defining Edge Labels
gremlin> g.makeType().name(‘on’).!! ! ! makeEdgeLabel()!
gremlin> g.makeType().name(‘pushed’).!! ! ! primaryKey(time).!! ! ! makeEdgeLabel()!
gremlin> g.makeType().name(‘in’).!! ! ! unidirected().!! ! ! makeEdgeLabel()!
Create & Retrieve
gremlin> v = g.addVertex([username: ‘okram’])!==>v[4]!gremlin> v.map!==>{username=okram}!gremlin> g.V('username','okram')!==>v[4]!
Titan Locking
Locking ensures consistency when it is needed
Titan uses time stamped consistent reads and writes on separate CFs for locking
Uses Property uniqueness: .unique()
Functional edges: .functional()
Global ID management
5
9
name : Hercules
name : Hercules
name : Jupiter
name : Pluto
father
father
x
Titan Indexing
Vertices can be retrieved by property key + value
Titan maintains index in a separate column family as graph is updated
Only need to define a property key as .index()
5
9
name : Hercules
name : Jupiter
Basic Queries
gremlin> v.out(‘pushed’)!gremlin> v.out(‘pushed’).out(‘to’).name!gremlin> v.out(‘pushed’).out(‘to’).dedup.name!gremlin> v.out(‘pushed’).out(‘to’).dedup.!! ! ! name.sort{it}!
gremlin> v.outE(‘pushed’).has(‘time’,T.gt,1000).inV!
Basic Queries
gremlin> v.out(‘pushed’)!gremlin> v.out(‘pushed’).out(‘to’).name!gremlin> v.out(‘pushed’).out(‘to’).dedup.name!gremlin> v.out(‘pushed’).out(‘to’).dedup.!! ! ! name.sort{it}!
gremlin> v.outE(‘pushed’).has(‘time’,T.gt,1000).inV!
Query Optimization
Vertex-Centric Indices
Sort and index edges per vertex by primary key Primary key can be composite
Enables efficient focused traversals Only retrieve edges that matter
Uses push down predicates for quick, index-driven retrieval
v
time: 1
fought fought father
mother
battled battled battled
battled
time: 3 time: 5
time: 9 v.query()!
v
time: 1
father
mother
battled battled battled
battled
time: 3 time: 5
time: 9 v.query()! .direction(OUT)!
v
time: 1
battled battled battled
battled
time: 3 time: 5
time: 9 v.query()! .direction(OUT)! .labels(‘battled’)!
v
time: 1
battled battled
time: 3
v.query()! .direction(OUT)! .labels(‘battled’)! .has(‘time,T.lt,5)!
Recommendation Engine
gremlin> v.out('pushed').out('to')[0..9].!! ! ! in('to').in('pushed')[0..500].!! ! ! except([v]).name.!! ! ! groupCount.cap.next().sort{-it.value}[0..4]!
Recommendation Engine
gremlin> v.out('pushed').out('to')[0..9].!! ! ! in('to').in('pushed')[0..500].!! ! ! except([v]).name.!! ! ! groupCount.cap.next().sort{-it.value}[0..4]!
v = g.V(‘username’,’okram’):!==>lvca=175!==>spmallette=56!==>sgomezvillamor=36!==>mbroecheler=33!==>joshsh=20!
Recommendation Engine
gremlin> v.out('pushed').out('to')[0..9].!! ! ! in('to').in('pushed')[0..500].!! ! ! except([v]).name.!! ! ! groupCount.cap.next().sort{-it.value}[0..4]!
v = g.V(‘username’,’torvalds’):!==>iksaif=90!==>rjwysocki=22!==>kernel-digger=20!==>giuseppecalderaro=16!==>groeck=15!
Titan Embedding
Rexster RexPro lightweight Gremlin
Server
based on Grizzly
Titan Gremlin Engine
Embedded Storage Backend in-JVM method calls
Graph Partitioning
Goal: Vertex Co-location Titan maintains multiple
ID Pools
Ordered Partitioner in Storage Backend
Dynamically determines optimal partition and allocates corresponding IDs
ID Pool
What’s coming
Full-text indexing external index system integration
Bulk Loading integration with storage backend
utilities and Hadoop ingestion
240 Billion Edge Benchmark performance analysis and improvements
across the entire stack
AURELIUS THINKAURELIUS.COM
V Faunus Graph Analytics
Hadoop-based Graph Computing Framework
Graph Analytics
Breadth-first Traversals
Global Graph Computations
Batch Big Graph Data
Faunus Features
Faunus Architecture
g._()!
Faunus Work Flow
hdfs://user/ubuntu/
output/job-0/
output/job-1/
output/job-2/ { graph*
sideeffect*
g.V.out .out .count()
Compressed HDFS Graphs stored in sequence files variable length encoding prefix compression
Faunus Setup
$ bin/gremlin.sh !
\,,,/! (o o)!-----oOOo-(_)-oOOo-----!gremlin> g = FaunusFactory.open('bin/titan-hbase.properties')!==>faunusgraph[titanhbaseinputformat]!gremlin> g.getProperties()!==>faunus.graph.input.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseInputFormat!==>faunus.graph.output.format=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat!==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat!==>faunus.output.location=dbpedia!==>faunus.output.location.overwrite=true!
gremlin> g._() !12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)!12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.IdentityMap.Map]!12/11/09 15:17:50 INFO mapred.JobClient: Running job: job_201211081058_0003!
Graph Analytics
gremlin> g.E.has('label',’followed').keep.!! ! !V.sideEffect('{it.degree = it.outE.count()}').!! ! !degree.groupCount!
gremlin> g.E.has('label','pushed').keep.!! ! !V.sideEffect('{it.degree = it.outE.count()}').!! ! !degree.groupCount!
Follow Degree Distribution
Follow Degree Distribution
P(k) ~ k-γ
γ = 2.2
Pushed Degree Distribution
Global Recommendations
gremlin> g.E.has('label','pushed','to').keep.!! ! !V.out('pushed').out('to').!! ! !in('to').in('pushed').!! ! !sideEffect('{it.score =it.pathCounter}').!! ! !score.order(F.decr,'name')!
# Top 5:!Jippi ! ! ! !60892182927!garbear ! ! !30095282886!FakeHeal ! ! !30038040349!brianchandotcom !24684133382!nyarla ! ! !15230275746!
What’s coming
Faunus 0.1
Bulk Loading loaded graph into Titan
loading derivations into Titan
Extending Gremlin Support currently only a subset is of
Gremlin implemented
Operational Tools
I Graph = Relationship Centric
II Graph = Agile Data Model
III Graph = Algebraic Data Model
Aurelius Graph Cluster
Stores a massive-scale property graph allowing real-time traversals and updates
Batch processing of large graphs with Hadoop
Runs global graph algorithms on large, compressed,
in-memory graphs
Map/Reduce Load & Compress
Analysis results back into Titan
Apache 2
The Graph Landscape Sp
eed
of T
rave
rsal
/Pro
cess
Size of Graph Illustration only, not to scale
TINKERPOP.COM
AURELIUS THINKAURELIUS.COM
Thanks!
Vadas Gintautas @vadasg
Marko Rodriguez @twarko
Stephen Mallette @spmallette
Daniel LaRocque
AURELIUS THINKAURELIUS.COM
AURELIUS THINKAURELIUS.COM
XVX Benchmark Results
AURELIUS THINKAURELIUS.COM
XVX - I Titan Performance Evaluation on Twitter-like Benchmark
Twitter Benchmark
1.47 billion followship edges and 41.7 million users Loaded into Titan using BatchGraph
Twitter in 2009, crawled by Kwak et. al
4 Transaction Types Create Account (1%)
Publish tweet (15%)
Read stream (76%)
Recommendation (8%) Follow recommended user (30%)
Kwak, H., Lee, C., Park, H., Moon, S., “What is Twitter, a Social Network or a News Media?,” World Wide Web Conference, 2010.
Benchmark Setup
6 cc1.4xl Cassandra nodes in one placement group
Cassandra 1.10
40 m1.small worker machines repeatedly running transactions
simulating servers handling user requests
EC2 cost: $11/hour
Benchmark Results
Transaction Type Number of tx Mean tx time Std of tx time
Create account 379,019 115.15 ms 5.88 ms
Publish tweet 7,580,995 18.45 ms 6.34 ms
Read stream 37,936,184 6.29 ms 1.62 ms
Recommendation 3,793,863 67.65 ms 13.89 ms
Total 49,690,061
Runtime 2.3 hours 5,900 tx/sec
Peak Load Results
Transaction Type Number of tx Mean tx time Std of tx time
Create account 374,860 172.74 ms 10.52 ms
Publish tweet 7,517,667 70.07 ms 19.43 ms
Read stream 37,618,648 24.40 ms 3.18 ms
Recommendation 3,758,266 229.83 ms 29.08 ms
Total 49,269,441
Runtime 1.3 hours 10,200 tx/sec
Benchmark Conclusion
Titan can handle 10s of thousands of concurrent users with short response 5mes even for complex traversals on a simulated social networking applica5on based on real-‐world network data with billions of edges and millions of users in a standard EC2 deployment.
For more informa5on on the benchmark: hDp://thinkaurelius.com/2012/08/06/5tan-‐provides-‐real-‐5me-‐big-‐graph-‐data/