big graph data

AURELIUS THINKAURELIUS.COM

BIG GRAPH DATA Understanding a Complex World

Matthias Broecheler, CTO @mbroecheler November XIII, MMXII


I Graph Foundation

Graph

name: Jupiter type: god

name: Pluto type: god

name: Neptune type: god

name: Hercules type: demigod

name: Cerberus type: monster

name: Alcmene type: god

name: Saturn type: titan

Vertex Property

Graph








father father

mother brother

brother battled

pet

time:12

Edge

Edge Property

Edge Type

Path








father father

mother brother

brother battled

pet

time:12

Degree








father father

mother brother

brother battled

pet

time:12


I Connected World

HEALTH

ECONOMY

Social Systems


III Titan Graph Database

  Numerous Concurrent Users   Many Short Transactions

  read/write

  Real-time Traversals (OLTP)   High Availability   Dynamic Scalability   Variable Consistency Model

  ACID or eventual consistency

 Real-time Big Graph Data

Titan Features

Storage Backends

Partitionability

Availability Consistency

Titan Features

I.  Data Management

II. Vertex-Centric Indices

Titan Features

III.  Graph Partitioning

IV.  Edge Compression

Titan Ecosystem

  Native Blueprints Implementation

  Gremlin Query Language

  Rexster Server   any Titan graph can be

exposed as a REST endpoint

Generic Graph API

Dataflow Processing

TraversalLanguage

Object-GraphMapper

GraphAlgorithms

GraphServer


IV Github Network

Setup

$ ./titan-0.1.0/bin/gremlin.sh! ! ! !\,,,/! (o o)!-----oOOo-(_)-oOOo-----!gremlin> g = TitanFactory.open('/tmp/titan')!==>titangraph[local:/tmp/titan]!

Titan Storage Model

  Adjacency list in one column family

  Row key = vertex id

  Each property and edge in one column   Denormalized, i.e. stored twice

  Direction and label/key as column prefix   Use slice predicate for quick retrieval

5

5

USER

REPOSITORY

COMMENT

ISSUE COMMIT

PAGE

edited

pushed

in

opened

to on on

on

created

Defining Property Keys

gremlin> g.makeType().name(‘username’).!! ! ! dataType(String.class).!! ! ! functional().!! ! ! indexed().unique().!! ! ! makePropertyKey()!

gremlin> g.makeType().name(‘time’).!! ! ! dataType(Long.class).!! ! ! functional().makePropertyKey()!

Defining Edge Labels

gremlin> g.makeType().name(‘on’).!! ! ! makeEdgeLabel()!

gremlin> g.makeType().name(‘pushed’).!! ! ! primaryKey(time).!! ! ! makeEdgeLabel()!

gremlin> g.makeType().name(‘in’).!! ! ! unidirected().!! ! ! makeEdgeLabel()!

Create & Retrieve

gremlin> v = g.addVertex([username: ‘okram’])!==>v[4]!gremlin> v.map!==>{username=okram}!gremlin> g.V('username','okram')!==>v[4]!

Titan Locking

  Locking ensures consistency when it is needed

  Titan uses time stamped consistent reads and writes on separate CFs for locking

  Uses   Property uniqueness: .unique()

  Functional edges: .functional()

  Global ID management

5

9

name : Hercules

name : Hercules

name : Jupiter

name : Pluto

father

father

x

Titan Indexing

  Vertices can be retrieved by property key + value

  Titan maintains index in a separate column family as graph is updated

  Only need to define a property key as .index()

5

9

name : Hercules

name : Jupiter

Basic Queries

gremlin> v.out(‘pushed’)!gremlin> v.out(‘pushed’).out(‘to’).name!gremlin> v.out(‘pushed’).out(‘to’).dedup.name!gremlin> v.out(‘pushed’).out(‘to’).dedup.!! ! ! name.sort{it}!

gremlin> v.outE(‘pushed’).has(‘time’,T.gt,1000).inV!

Basic Queries

gremlin> v.out(‘pushed’)!gremlin> v.out(‘pushed’).out(‘to’).name!gremlin> v.out(‘pushed’).out(‘to’).dedup.name!gremlin> v.out(‘pushed’).out(‘to’).dedup.!! ! ! name.sort{it}!

gremlin> v.outE(‘pushed’).has(‘time’,T.gt,1000).inV!

Query Optimization

Vertex-Centric Indices

  Sort and index edges per vertex by primary key   Primary key can be composite

  Enables efficient focused traversals   Only retrieve edges that matter

  Uses push down predicates for quick, index-driven retrieval

v

time: 1

fought fought father

mother

battled battled battled

battled

time: 3 time: 5

time: 9 v.query()!

v

time: 1

father

mother


battled

time: 3 time: 5

time: 9 v.query()! .direction(OUT)!

v

time: 1


battled

time: 3 time: 5

time: 9 v.query()! .direction(OUT)! .labels(‘battled’)!

v

time: 1

battled battled

time: 3

v.query()! .direction(OUT)! .labels(‘battled’)! .has(‘time,T.lt,5)!

Recommendation Engine

gremlin> v.out('pushed').out('to')[0..9].!! ! ! in('to').in('pushed')[0..500].!! ! ! except([v]).name.!! ! ! groupCount.cap.next().sort{-it.value}[0..4]!



v = g.V(‘username’,’okram’):!==>lvca=175!==>spmallette=56!==>sgomezvillamor=36!==>mbroecheler=33!==>joshsh=20!



v = g.V(‘username’,’torvalds’):!==>iksaif=90!==>rjwysocki=22!==>kernel-digger=20!==>giuseppecalderaro=16!==>groeck=15!

Titan Embedding

  Rexster RexPro   lightweight Gremlin

Server

  based on Grizzly

  Titan Gremlin Engine

  Embedded Storage Backend   in-JVM method calls

Graph Partitioning

Goal: Vertex Co-location   Titan maintains multiple

ID Pools

  Ordered Partitioner in Storage Backend

  Dynamically determines optimal partition and allocates corresponding IDs

ID Pool

What’s coming

  Full-text indexing   external index system integration

  Bulk Loading   integration with storage backend

utilities and Hadoop ingestion

  240 Billion Edge Benchmark   performance analysis and improvements

across the entire stack


V Faunus Graph Analytics

  Hadoop-based Graph Computing Framework

  Graph Analytics

  Breadth-first Traversals

  Global Graph Computations

 Batch Big Graph Data

Faunus Features

Faunus Architecture

g._()!

Faunus Work Flow

hdfs://user/ubuntu/

output/job-0/

output/job-1/

output/job-2/ { graph*

sideeffect*

g.V.out .out .count()

Compressed HDFS Graphs   stored in sequence files   variable length encoding   prefix compression

Faunus Setup

$ bin/gremlin.sh !

\,,,/! (o o)!-----oOOo-(_)-oOOo-----!gremlin> g = FaunusFactory.open('bin/titan-hbase.properties')!==>faunusgraph[titanhbaseinputformat]!gremlin> g.getProperties()!==>faunus.graph.input.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseInputFormat!==>faunus.graph.output.format=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat!==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat!==>faunus.output.location=dbpedia!==>faunus.output.location.overwrite=true!

gremlin> g._() !12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)!12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.IdentityMap.Map]!12/11/09 15:17:50 INFO mapred.JobClient: Running job: job_201211081058_0003!

Graph Analytics

gremlin> g.E.has('label',’followed').keep.!! ! !V.sideEffect('{it.degree = it.outE.count()}').!! ! !degree.groupCount!

gremlin> g.E.has('label','pushed').keep.!! ! !V.sideEffect('{it.degree = it.outE.count()}').!! ! !degree.groupCount!

Follow Degree Distribution

Follow Degree Distribution

P(k) ~ k-γ

γ = 2.2

Pushed Degree Distribution

Global Recommendations

gremlin> g.E.has('label','pushed','to').keep.!! ! !V.out('pushed').out('to').!! ! !in('to').in('pushed').!! ! !sideEffect('{it.score =it.pathCounter}').!! ! !score.order(F.decr,'name')!

# Top 5:!Jippi ! ! ! !60892182927!garbear ! ! !30095282886!FakeHeal ! ! !30038040349!brianchandotcom !24684133382!nyarla ! ! !15230275746!

What’s coming

  Faunus 0.1

  Bulk Loading   loaded graph into Titan

  loading derivations into Titan

  Extending Gremlin Support   currently only a subset is of

Gremlin implemented

  Operational Tools

I Graph = Relationship Centric

II Graph = Agile Data Model

III Graph = Algebraic Data Model

Aurelius Graph Cluster

Stores a massive-scale property graph allowing real-time traversals and updates

Batch processing of large graphs with Hadoop

Runs global graph algorithms on large, compressed,

in-memory graphs

Map/Reduce Load & Compress

Analysis results back into Titan

Apache 2

The Graph Landscape Sp

eed

of T

rave

rsal

/Pro

cess

Size of Graph Illustration only, not to scale

TINKERPOP.COM


Thanks!

Vadas Gintautas @vadasg

Marko Rodriguez @twarko

Stephen Mallette @spmallette

Daniel LaRocque


XVX Benchmark Results


XVX - I Titan Performance Evaluation on Twitter-like Benchmark

Twitter Benchmark

  1.47 billion followship edges and 41.7 million users   Loaded into Titan using BatchGraph

  Twitter in 2009, crawled by Kwak et. al

  4 Transaction Types   Create Account (1%)

  Publish tweet (15%)

  Read stream (76%)

  Recommendation (8%)   Follow recommended user (30%)

Kwak, H., Lee, C., Park, H., Moon, S., “What is Twitter, a Social Network or a News Media?,” World Wide Web Conference, 2010.

Benchmark Setup

  6 cc1.4xl Cassandra nodes   in one placement group

  Cassandra 1.10

  40 m1.small worker machines   repeatedly running transactions

  simulating servers handling user requests

  EC2 cost: $11/hour

Benchmark Results

Transaction Type Number of tx Mean tx time Std of tx time

Create account 379,019 115.15 ms 5.88 ms

Publish tweet 7,580,995 18.45 ms 6.34 ms

Read stream 37,936,184 6.29 ms 1.62 ms

Recommendation 3,793,863 67.65 ms 13.89 ms

Total 49,690,061

Runtime 2.3 hours 5,900 tx/sec

Peak Load Results

Transaction Type Number of tx Mean tx time Std of tx time

Create account 374,860 172.74 ms 10.52 ms

Publish tweet 7,517,667 70.07 ms 19.43 ms

Read stream 37,618,648 24.40 ms 3.18 ms

Recommendation 3,758,266 229.83 ms 29.08 ms

Total 49,269,441

Runtime 1.3 hours 10,200 tx/sec

Benchmark Conclusion

Titan can handle 10s of thousands of concurrent users with short response 5mes even for complex traversals on a simulated social networking applica5on based on real-‐world network data with billions of edges and millions of users in a standard EC2 deployment.

For more informa5on on the benchmark: hDp://thinkaurelius.com/2012/08/06/5tan-‐provides-‐real-‐5me-‐big-‐graph-‐data/

big graph data

Technology

brother time

retrievegremlin v

battledbattledbattled

recommendation enginegremlin

god type

basic queriesgremlin

hercules type

cerberus type