problem-solving using graph traversals: searching, scoring, ranking, and recommendation
DESCRIPTION
A graph is a data structure that links a set of vertices by a set of edges. Modern graph databases support multi-relational graph structures, where there exist different types of vertices (e.g. people, places, items) and different types of edges (e.g. friend, lives at, purchased). By means of index-free adjacency, graph databases are optimized for graph traversals and are interacted with through a graph traversal engine. A graph traversal is defined as an abstract path whose instance is realized on a graph dataset. Graph databases and traversals can be used for searching, scoring, ranking, and in concert, recommendation. This presentation will explore graph structures, algorithms, traversal algebras, graph-related software suites, and a host of examples demonstrating how to solve real-world problems, in real-time, with graphs. This is a whirlwind tour of the theory and application of graphs.TRANSCRIPT
Problem-Solving using Graph Traversals
Searching, Scoring, Ranking, and Recommendation
Marko A. RodriguezGraph Systems Architecthttp://markorodriguez.com
http://twitter.com/twarko
AT&Ti Technical Talk - Glendale, California – July 27, 2010
July 26, 2010
Abstract
A graph is a data structure that links a set of vertices by a set of edges.Modern graph databases support multi-relational graph structures, wherethere exist different types of vertices (e.g. people, places, items) anddifferent types of edges (e.g. friend, lives at, purchased). By means ofindex-free adjacency, graph databases are optimized for graph traversalsand are interacted with through a graph traversal engine. A graphtraversal is defined as an abstract path whose instance is realized on agraph dataset. Graph databases and traversals can be used for searching,scoring, ranking, and in concert, recommendation. This presentation willexplore graph structures, algorithms, traversal algebras, graph-relatedsoftware suites, and a host of examples demonstrating how to solvereal-world problems, in real-time, with graphs. This is a whirlwind tour ofthe theory and application of graphs.
Outline
• Graph Structures, Algorithms, and Algebras
• Graph Databases and the Property Graph
• TinkerPop Open-Source Graph Product Suite
• Real-Time, Real-World Use Cases for Graphs
Difficulty Chartdifficulty
timeal
gebr
a
grap
hs
data
base
s
indi
ces
data
mod
els
softw
are
algo
rithm
s
real
-wor
ld
conc
lusi
on
Outline
• Graph Structures, Algorithms, and Algebras
• Graph Databases and the Property Graph
• TinkerPop Open-Source Graph Product Suite
• Real-Time, Real-World Use Cases for Graphs
Difficulty Chartdifficulty
timeal
gebr
a
grap
hs
data
base
s
indi
ces
data
mod
els
softw
are
algo
rithm
s
real
-wor
ld
conc
lusi
on
G = (V,E)
A Vertex
There once was a vertex i ∈ V named tenderlove.
Two Vertices
And then came along another vertex j ∈ V named sixwing.Thus, i, j ∈ V .
A Directed Edge
Our tenderlove extended a relationship to sixwing. Thus,(i, j) ∈ E.
The Single-Relational, Directed Graph
More vertices join, create edges and, in turn, the graph grows...
The Single-Relational, Directed Graph as a Matrix
A single-relational graph defined as
G = (V,E ⊆ (V × V ))
can be represented as the adjacency matrix A ∈ 0, 1n×n, where
Ai,j =
1 if (i, j) ∈ E0 otherwise.
The Single-Relational, Directed Graph as a Matrix
1
1 0
1
0 1
10
0
0
0
00
AG
1
0
0
The Single-Relational, Directed Graph
• All vertices are homogenous in meaning—all vertices denote the sametype of object (e.g. people, webpages, etc.).1
• All edges are homogenous in meaning—all edges denote the same typeof relationships (e.g. friendship, works with, etc.).2
1This is not completely true. All n-partite single-relational graphs allow for the division of the vertex setinto n subsets, where V =
⋃ni Ai : Ai ∩ Aj = ∅. Thus, its possible to implicitly type the vertices.
2This is not completely true. There exists an injective, information-preserving function that maps anymulti-relational graph to a single-relational graph, where edge types are denoted by topological structures.Thus, at a “higher-level,” it is possible to create a heterogenous set of relationships.Rodriguez, M.A., “Mapping Semantic Networks to Undirected Networks,” International Journal of AppliedMathematics and Computer Sciences, 5(1), pp. 39–42, 2009. [http://arxiv.org/abs/0804.0277]
Applications of Single-Relational Graphs
• Social: define how people interact (collaborators, friends, kins).
• Biological: define how biological components interact (protein, foodchains, gene regulation).
• Transportation: define how cities are joined by air and road routes.
• Dependency: define how software modules, data sets, functions dependon each other.
• Technology: define the connectivity of Internet routers, web pages, etc.
• Language: define the relationships between words.
The Limitations of Single-Relational Graph Modeling
Friendship Graph Favorite Graph Works-For Graph
Unfortunately, single-relational graphs are independent of each other. Thisis because G = (V,E)—there is only a single edge set E (i.e. a single typeof relation).
Numerous Algorithms for Single-Relational Graphs
We would like a more flexible graph modeling construct, but unfortunately,most of our graph algorithms were designed for single-relational graphs.3
• Geodesic: diameter, radius, eccentricity, closeness, betweenness, etc.
• Spectral: random walks, PageRank, eigenvector centrality, spreading activation, etc.
• Assortativity: scalar, categorical, hierarchal, etc.
• Others: ...4
We can solve this with multi-relational graphs and a path algebra.
3For a fine book on graph analysis algorithms, please see:Brandes, U., Erlebach T., “Network Analysis: Methodolgical Foundations,” edited book, Springer, 2005.
4One of the purposes of this presentation is advocate for local graph analysis algorithms (i.e. priors-based,relative) vs. global graph analysis algorithms. Most popular graph analysis algorithms are global in thatthey require an analysis of the whole graph (or a large portion of a graph) to yield results. Local analysisalgorithms are dependent on sub-graphs of the whole and in effect, can boast faster running times.
G = (V,E)
A Directed Edge
A Directed, Labeled Edge
friend
Lets specify the type of relationship that exists betweentenderlove and sixwing. Thus, (i, j) ∈ Efriend.
Growing a Multi-Relational Graph
friend
friend
Lets make the friendship relationship symmetric. Thus,(j, i) ∈ Efriend.
Growing a Multi-Relational Graph
friend
friend
friend
friend
Lets add marko to the mix: k ∈ V . This graph is stillsingle-relational. There is only one type of relation.
Growing a Multi-Relational Graph
friend
friend
friend
friend favorite
Lets add an (i, l) ∈ Efavorite. Now there are multiple types ofrelationships: Efriend and Efavorite (2 edge sets).
The Multi-Relational, Directed Graph
• At this point, there is a multi-relational, directed graph: G = (V,E),where E = (E0, E1, . . . , Em ⊆ (V × V )).5
• Vertices can denote different types of objects (e.g. people, places).6
• Edge can denote different types of relationships (e.g. friend, favorite).7
• This is the data model of the Web of Data—the RDF data model.8
5Another representation is G ⊆ (V × Ω× V ), where Ω ⊆ Σ∗ is the set of legal edge labels.6Vertex types can be determined by the domain and range specification of the respective edge
relation/label/predicate. Or, another way, by means of an explicit typing relation such as 〈a, type, b〉.7Edge types are determined by the label that accompanies the edge.8This is not completely true. The vertex set is split into URIs (U), literals (L), and blank/anonymous
nodes (B), such that G ⊆ ((U × B)× U × (U × B × L)). [http://www.w3.org/RDF/]
The Multi-Relational, Directed Graph as a Tensor
A three-way tensor can be used to represent a multi-relational graph. If
G = (V,E = E0, E1, . . . , Em ⊆ (V × V ))
is a multi-relational graph, then A ∈ 0, 1n×n×m and
Aki,j =
1 if (i, j) ∈ Em : 1 ≤ k ≤ m0 otherwise.
Thus, each edge set in E represents an adjacency matrix and thecombination of m adjacency matrices forms a 3-way tensor.
The Multi-Relational, Directed Graph as a Tensor
favoritefriend
answers
0
0
0
0
0
0
0
0
0
0
0
1
0
00
0
A
friend
friend
favorite
G
Multi-Relational Graph Algorithms
“Can we evaluate single-relational graph analysis algorithmson a multi-relational graph?”
The Meaning of Edge Meanings
lovesloves loves loves
loves hateshates hates hates
hates
• Multi-relationally: tenderlove is more liked than marko.
• Single-relationally: tenderlove and marko simply have the samein-degree.
? Given, lets say, degree-centrality, tenderlove and marko are equal asthey have the same number of relationships. The edge labels do noteffect the output of the degree-centrality algorithm.
What Do You Mean By “Central?”
...
...
friend friend
favorite
friend
What is your favoritebookstore?
favorite
question_by
answer_for
answer_by
answer
Lets focus specifically on centrality. What is the most central vertex in a
multi-relational graph? Who is the most central friend in the graph—by friendship, by
question answering, by favorites, etc?
Primary Eigenvector
“What does the primary eigenvector of a multi-relationalgraph mean?”91011
9We will use the primary eigenvector for the following argument. Note that the same argument appliesfor all known single-relational graph algorithms (i.e. geodesic, spectral, community detection, etc.).
10Technical details are left aside such as outgoing edge probability distributions and the irreducibility ofthe graph.
11The popular PageRank vector is defined as the primary eigenvector of a low-probability fully connectedgraph combined with the original graph (i.e. both graphs maintain the same V ).
Primary Eigenvector: Ignoring Edge Labels
• If π = Bπ, where B ∈ N|V |×|V |+ is the adjacency matrix formed bymerging the edge sets in E, then edge labels are ignored—all edges aretreated equally.
• In this “ignoring labels”-model, there is only one primary eigenvector forthe graph—one definition of centrality.
• With a heterogenous set of vertices connected by a heterogenous set ofedges, what does this type of centrality mean?
Primary Eigenvector: Isolating Subgraphs
• Are there other primary eigenvectors in the multi-relational graph?
• You can ignore certain edge sets and calculate the primary eigenvector(e.g. pull out the single-relational “friend”-graph.)
? π = Afriendπ, where Afriend ∈ 0, 1|V |×|V | is the adjacency matrixformed by the edge set Efriend.
• Thus, you can isolate subgraphs (i.e. adjacency matrices) of themulti-relational graph and calculate the primary eigenvector for thosesubgraphs.
• In this “isolation”-model, there are m definitions of centrality—one foreach isolated subgraph.12
12Remember, A ∈ 0, 1n×n×m.
Ultimately what we want is...
Primary Eigenvector: Turing Completeness
• What about using paths through the graph—not simply explicit one-stepedges?
• What about determining centrality for a relation that isn’t explicit in E(i.e. Ak ∈ A)? In general, what about π = Xπ, where X is a derivedadjacency matrix of the multi-relational graph.
? For example, if I know who everyone’s friends are, then I know (i.e. caninfer, derive, compute) who everyone’s friends-of-a-friends (FOAF) are.What about the primary eigenvector of the derived FOAF graph?
• In the end, you want a Turing-complete framework—you want completecontrol (universal computability) over how π moves through themulti-relational graph structure.13
13These ideas are expounded upon at great length throughout this presentation.
A Path Algebra for EvaluatingSingle-Relational Algorithms on Multi-Relational Graphs
• There exists a multi-relational graph algebra for mapping single-relationalgraph analysis algorithms to the multi-relational domain.14
• The algebra works on a tensor representation of a multi-relational graph.
• In this framework and given the running example, there are as manyprimary eigenvectors as there are abstract path definitions.
14* Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational NetworkAnalysis Algorithms,” Journal of Informetrics, 4(1), pp. 29–41, doi:10.1016/j.joi.2009.06.004, 2009.[http://arxiv.org/abs/0806.2274]* Rodriguez, M.A., “Grammar-Based Random Walkers in Semantic Networks,” Knowledge-Based Systems,21(7), pp. 727–739, doi:10.1016/j.knosys.2008.03.030, 2008. [http://arxiv.org/abs/0803.4355]* Rodriguez, M.A., Watkins, J.,“Grammar-Based Geodesics in Semantic Networks,” Knowledge-BasedSystems, in press, doi:10.1016/j.knosys.2010.05.009, 2010.
The Operations of the Multi-Relational Path Algebra
• A ·B: ordinary matrix multiplication determines the number of (A,B)-paths between vertices.
• A>: matrix transpose inverts path directionality.
• A B: Hadamard, entry-wise multiplication applies a filter to selectivelyexclude paths.
• n(A): not generates the complement of a 0, 1n×n matrix.
• c(A): clip generates a 0, 1n×n matrix from a Rn×n+ matrix.
• v±(A): vertex generates a 0, 1n×n matrix from a Rn×n+ matrix, whereonly certain rows or columns contain non-zero values.
• xA: scalar multiplication weights the entries of a matrix.
• A + B: matrix addition merges paths.
Primary Eigenvectors in a Multi-Relational Graph
• Friend:(Afriend
)π
• FOAF:(Afriend · Afriend
)π ≡
(Afriend2
)π
• FOAF (no self):(Afriend2 n(I)
)π15
• FOAF (no friends nor self):(Afriend2 n
(Afriend
) n(I)
)π
• Co-Worker:((Aworks at · Aworks at>
) n (I)
)π
• Friend-or-CoWorker:(
0.65Afriend + 0.35((Aworks at · Aworks at>
) n (I)
))π
• ...and more.16
15I ∈ 0, 1|V |×|V | : Ii,i = 1—the identity matrix.16Note, again, that the examples are with respect to determining the primary eigenvector of the derived
adjacency matrix. The same argument holds for all other single-relational graph analysis algorithms. Ingeneral, the path algebra provides a means of creating “higher-order” (i.e. semantically-rich) single-relationalgraphs from a single multi-relational graph. Thus, these derived matrices can be subjected to standardsingle-relational graph analysis algorithms.
Deriving “Semantically Rich” Adjacency Matrices
friend
-of-fri
end
(no se
lf)
favori
tefriend
answ
ers
0
0
0
0
0
0
0
0
0
0
0
1
0
00
0
A
0
0 0
0
0 0
00
1
0
0
000
1
0
Afriend · Afriend
n(I)
"friend-of-a-friend (no self)"
Afriend2 n(I)
favori
tefriend
answ
ers
0
0
0
0
0
0
0
0
0
0
0
1
0
00
0
A
∪ =
Use the multi-relational graph to generate explicit edges that were implicitly defined as
paths. Those new explicit edges can then be memoized17 and re-used (time vs. space
tradeoff)—aka path reuse.17Memoization Wikipedia entry: http://en.wikipedia.org/wiki/Memoization.
Benefits, Drawbacks, and Future of the Path Algebra
• Benefit: Provides a set of theorems for deriving equivalences and thus,provides the foundation for graph traversal engine optimizers.18 Serves asimilar purpose as the relational algebra for relational databases.19
• Drawback: The algebra is represented in matrix form and thus,operationally, works globally over the graph.20
• Future: A non-matrix-based, ring theoretic model of graph traversalthat supports +, −, and · on individual vertices and edges. The Gremlin[http://gremlin.tinkerpop.com] graph traversal engine presentedlater provides the implementation before a fully-developed theory.
18Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational Network AnalysisAlgorithms,” Journal of Informetrics, 4(1), pp. 29–41, 2009. [http://arxiv.org/abs/0806.2274]
19Codd, E.F., “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM,13(6), pp. 377–387, doi:10.1145/362384.362685, 1970.
20It is possible to represent local traversals using vertex filters at the expense of clumsy notation.
Outline
• Graph Structures, Algorithms, and Algebras
• Graph Databases and the Property Graph
• TinkerPop Open-Source Graph Product Suite
• Real-Time, Real-World Use Cases for Graphs
Difficulty Chartdifficulty
timeal
gebr
a
grap
hs
data
base
s
indi
ces
data
mod
els
softw
are
algo
rithm
s
real
-wor
ld
conc
lusi
on
The Simplicity of a Graph
• A graph is a simple data structure.
• A graph states that something is related to something else (the foundationof any other data structure).21
• It is possible to model a graph in various types of databases.22
? Relational database: MySQL, Oracle, PostgreSQL
? JSON document database: MongoDB, CouchDB
? XML document database: MarkLogic, eXist-db
? etc.
21A graph can be used to represent other data structures. This point becomes convenient when lookingbeyond using graphs for typical, real-world domain models (e.g. friends, favorites, etc.), and seeing theirapplicability in other areas such as modeling code (e.g. http://arxiv.org/abs/0802.3492), indices, etc.
22For the sake of diagram clarity, the examples to follow are with respect to a single-relational, directedgraph. Note that it is possible to model multi-relational graphs in these types of database as well.
Representing a Graph in a Relational Database
outV | inV
------------
A | B
A | C
C | D
D | A
A
CB
D
Representing a Graph in a JSON Database
A :
outE : [B, C]
B :
outE : []
C :
outE : [D]
D :
outE : [A]
A
CB
D
Representing a Graph in an XML Database
<graphml>
<graph>
<node id=A />
<node id=B />
<node id=C />
<node id=D />
<edge source=A target=B />
<edge source=A target=C />
<edge source=C target=D />
<edge source=D target=A />
</graph>
</graphml>
A
CB
D
Defining a Graph Database
“If any database can represent a graph, then what
is a graph database?”
Defining a Graph Database
A graph database is any storage system thatprovides index-free adjacency.2324
23There is no “official” definition of what makes a database a graph database. The one provided is mydefinition (respective of the influence of my collaborators in this area). However, hopefully the followingargument will convince you that this is a necessary definition. Given that any database can model a graph,such a definition would not provide strict enough bounds to yield a formal concept (i.e. >).
24There is adjacency between the elements of an index, but if the index is not the primary data structureof concern (to the developer), then there is indirect/implicit adjacency, not direct/explicit adjacency. Agraph database exposes the graph as an explicit data structure (not an implicit data structure).
Defining a Graph Database by Example
D
E
C
A
B
Toy Graph Gremlin(stuntman)
Graph Databases and Index-Free Adjacency
D
E
C
A
B
• Our gremlin is at vertex A.
• In a graph database, vertex A has direct references to its adjacent vertices.
• Constant time cost to move from A to B and C. It is dependent upon the number
of edges emanating from vertex A (local).
Graph Databases and Index-Free Adjacency
D
E
C
A
B
The Graph (explicit)
Graph Databases and Index-Free Adjacency
D
E
C
A
B
The Graph (explicit)
Non-Graph Databases and Index-Based Adjacency
D
E
C
A
B
A B C
D EB,C E D,E
• Our gremlin is at vertex A.
Non-Graph Databases and Index-Based Adjacency
D
E
C
A
B
A B C
D EB,C E D,E
• In a non-graph database, the gremlin needs to look at an index to determine whatis adjacent to A.
• log2(n) time cost to move to B and C. It is dependent upon the total number of
vertices and edges in the database (global).
Non-Graph Databases and Index-Based Adjacency
D
E
C
A
B
A B C
D EB,C E D,E
The Index (explicit) The Graph (implicit)
Non-Graph Databases and Index-Based Adjacency
D
E
C
A
B
A B C
D EB,C E D,E
The Index (explicit) The Graph (implicit)
Index-Free Adjacency
• While any database can implicitly represent a graph, only agraph database makes the graph structure explicit.25
• In a graph database, each vertex serves as a “mini index”of its adjacent elements.26
• Thus, as the graph grows in size, the cost of a local stepremains the same.27
25Please see http://markorodriguez.com/Blarko/Entries/2010/3/29_MySQL_vs._Neo4j_on_a_
Large-Scale_Graph_Traversal.html for some performance characteristics of graph traversals in arelational database (MySQL) and a graph database (Neo4j).
26Each vertex can be intepreted as a “parent node” in an index with its children being its adjacentelements. In this sense, traversing a graph is analogous in many ways to traversing an index—albeit thegraph is not an acyclic connected graph (tree). (a vision espoused by Craig Taverner)
27A graph, in many ways, is like a distributed index.
Graph Databases Do Make Use of Indices
A B C
D E
The Graph
Index of Vertices(by id)
• There is more to the graph than the explicit graph structure.
• Indices index the vertices by their properties (e.g. ids, name, latitude).28
28Graph databases can be used to create index structures. In fact, in the early days of Neo4j, Neo4j usedits own graph structure to index the properties of its vertices—a graph indexing a graph. A thought iteratedmany times over by Craig Taverner who is interested in graph databases for geo-spatial indexing/analysis.
The Patterns of Relational and Graph Databases
• In a relational database, operations are conceptualized set-theoreticallywith the joining of tuple structures being the means by whichnormalized/separated data is associated.
• In a graph database, operations are conceptualized graph-theoreticallywith paths over edges being the means by which non-adjacent/separatedvertices are associated.29
In theory and ignoring performance, both models have the sameexpressivity and allow for the same manipulations. But such theory doesnot determine intention and the mental ruts that any approach engrains.The graph database provides a novel perspective on the ancient necessityto manipulate information.
29Rodriguez, M.A., Neubauer, P., “The Graph Traversal Pattern,” AT&Ti and NeoTechnology TechnicalReport, currently in review, 2010. [http://arxiv.org/abs/1004.1001]
Property Graphs and Graph Databases
• Most graph databases support a graph data model known as a propertygraph.
• A property graph is a directed, attributed, multi-relational graph.In other words, vertices and edges are equipped with a collection ofkey/value pairs.30
30Rodriguez, M.A., Neubauer, P., “Constructions from Dots and Lines,” Bulletin of the American Societyfor Information Science and Technology, American Society for Information Science and Technology, 2010.[http://arxiv.org/abs/1006.2361]
From a Multi-Relational Graph...
friend
friend
friend
friend favorite
...to a Property Graph
friend
friend
friend
friend favorite
name=markolocation=Santa Fe
gender=malelat=11111
long=22222
created_at=123456
created_at=234567
created_at=234567
name=sixwinglocation=West Hollywood
gender=male
Why the Property Graph Model?
• Standard single-relational graphs do not provide enough modeling flexibility for use in
real-world situations.31
• Multi-relational graphs do and the Web of Data (RDF) world demonstrates this to be
the case in practice.
• Property graphs are perhaps more practical because not every datum needs to be
“related” (e.g. age, name, etc.). Thus, the edge and key/value model is a convenient
dichotomy.32
• Property graphs provide finer-granularity on the meaning of an edge as the key/values
of an edge add extra information beyond the edge label.
31This is not completely true—researchers use the single-relational graph all the time. However, in mostdata rich applications, its limiting to work with a single edge type and a homogenous population of vertices.
32RDF has a similar argument in that literals can only be the object of a triple. However, in practice, whenrepresented in a graph database, there is a single literal vertex denoting that literal and thus, is traversablelike any other vertex.
Graph Type Morphisms
property graph
weighted graph
semantic graph
multi-graph
undirected graph
directed graph
simple graph
add weight attribute
remove attributes
remove edge labels
remove loops, directionality, and multiple edges
no op
no op
no op
no op
remove directionality
remove attributes
labeled graph
remove edge labels
no op
rdf graph
make labels URIs
Outline
• Graph Structures, Algorithms, and Algebras
• Graph Databases and the Property Graph
• TinkerPop Open-Source Graph Product Suite
• Real-Time, Real-World Use Cases for Graphs
Difficulty Chartdifficulty
timeal
gebr
a
grap
hs
data
base
s
indi
ces
data
mod
els
softw
are
algo
rithm
s
real
-wor
ld
conc
lusi
on
TinkerPop: Making Stuff for the Fun of It• Open source software group started in 2008 focusing on graph data
structures, graph query engines, graph-based programming languages,and, in general, tools and techniques for working with graphs.[http://tinkerpop.com] [http://github.com/tinkerpop]
? Current members: Marko A. Rodriguez (AT&Ti), Peter Neubauer(NeoTechnology), Joshua Shinavier (Rensselaer Polytechnic Institute),and Pavel Yaskevich (“I am no one from nowhere”).
TinkerPop Productions
• Blueprints: Data Models and their Implementations
[http://blueprints.tinkerpop.com]
• Pipes: A Data Flow Framework using Process Graphs
[http://pipes.tinkerpop.com]
• Gremlin: A Graph-Based Programming Language
[http://gremlin.tinkerpop.com]
• Rexster: A RESTful Graph Shell
[http://rexster.tinkerpop.com]
? Wreckster: A Ruby API for Rexster
[http://github.com/tenderlove/wreckster]
There are other TinkerPop products (e.g. Ripple, LoPSideD, TwitLogic, etc.), but for the
purpose of this presentation, only the above will be discussed.
Blueprints: Data Models and their Implementations
Blueprints
• Blueprints is the like the JDBC of the graph database community.
• Provides a Java-based interface API for the property graph data model.
? Graph, Vertex, Edge, Index.
• Provides implementations of the interfaces for TinkerGraph, Neo4j, Sails(e.g. AllegroGraph, HyperSail, etc.), and soon (hopefully) others suchas InfiniteGraph, InfoGrid, Sones, DEX, and HyperGraphDB.33
33HyperGraphDB makes use of an n-ary graph structure known as a hypergraph. Blueprints, in its currentform, only supports the more common binary graph.
Pipes: A Data Flow Framework using Process Graphs
Pipes
• A dataflow framework with support for Blueprints-based graph processing.
• Provides a collection of “pipes” (implement Iterable and Iterator)that are connected together to form processing pipelines.
? Filters: ComparisonFilterPipe, RandomFilterPipe, etc.? Traversal: VertexEdgePipe, EdgeVertexPipe, PropertyPipe, etc.? Splitting/Merging: CopySplitPipe, RobinMergePipe, etc.? Logic: OrPipe, AndPipe, etc.
Gremlin: A Graph-Based Programming Language
GremlinG = (V,E)
• A Turing-complete, graph-based programming language that compilesGremlin syntax down to Pipes (implements JSR 223).
• Support various language constructs: :=, foreach, while, repeat,if/else, function and path definitions, etc.
? ./outE[@label=‘friend’]/inV
? ./outE[@label=‘friend’]/inV/outE[@label=‘friend’]/inV[g:except($ , .)]
? g:key(‘name’,‘Aaron Patterson’)[0]/outE[@label=‘favorite’]/inV/@name
Rexster: A RESTful Graph Shell
reXster
• Allows Blueprints graphs to be exposed through a RESTful API (HTTP).
• Supports stored traversals written in raw Pipes or Gremlin.
• Supports adhoc traversals represented in Gremlin.
• Provides “helper classes” for performing search-, score-, and rank-basedtraversal algorithms—in concert, support for recommendation.
• Aaron Patterson (AT&Ti) maintains the Ruby connector Wreckster.
Typical TinkerPop Graph Stack
NativeStore TinkerGraphNeo4j
GET http://host/resource
Outline
• Graph Structures, Algorithms, and Algebras
• Graph Databases and the Property Graph
• TinkerPop Open-Source Graph Product Suite
• Real-Time, Real-World Use Cases for Graphs
Difficulty Chartdifficulty
timeal
gebr
a
grap
hs
data
base
s
indi
ces
data
mod
els
softw
are
algo
rithm
s
real
-wor
ld
conc
lusi
on
Using Graphs in Real-Time Systems
• Most popular graph algorithms require global graph analysis.
? Such algorithms compute a score, a vector, etc. given the structureof the whole graph. Moreover, many of these algorithms have largerunning times: O(|V |+ |E|), O(|V | log |V |), O(|V |2), etc.
• Many real-world situations can make use of local graph analysis.34
? Search for x starting from y.? Score x given its local neighborhood.? Rank x relative to y.? Recommend vertices to user x.
34Many web applications are “ego-centric” in that they are with respect to a particular user (the userlogged in). In such scenarios, local graph analysis algorithms are not only prudent to use, but also, beneficialin that they are faster than global graph analysis algorithms. Many of the local analysis algorithms discussedrun in the sub-second range (for graphs with “natural” statistics).
Applications of Graph Databases and Traversal Engines:Searching, Scoring, and Ranking
• Searching: given a power multi-set of vertices (P(V )) and a pathdescription (Ψ), return the vertices at the end of that path.35
? P(V )×Ψ→ P(V )
• Scoring: given some vertices and a path description, return a score.
? P(V )×Ψ→ R
• Ranking: given some vertices and a path description, return a map ofscored vertices.
? P(V )×Ψ→ (V × R)
35Use cases need not be with respect to vertices only. Edges can be searched, scored, and ranked as well.However, in order to express the ideas as simply as possible, all discussion is with respect to vertices.
Applications of Graph Databases and Traversal Engines:Recommendation
• Recommendation: searching, scoring, and ranking can all be used ascomponents of a recommendation. Thus, recommendation is founded onthese more basic ideas.
? Recommendation aids the user by allowing them to make “jumps” through
the data. Items that are not explicitly connected, are connected implicitly through
recommendation (through some abstract path Ψ).
• The act of recommending can be seen as an attempt to increase thedensity of the graph around a user’s vertex. For example, recommendinguser i ∈ V places to visit U ⊂ V , will hopefully lead to edges of the form〈i, visited, j〉 : ∀j ∈ U .36
36A standard metric for recommendation quality is seen as how well it predicts the user’s future behavior.That is, does it predict an edge.
There Is More Than “People Who Like X Also Like Y .”
• A system need not be limited to one type of recommendation. With graph-based
methods, there are as many recommendations as there are abstract paths.
• Use recommendation to aid the user in solving problems (i.e. computationally
derive solutions for which your data set is primed for). Examples below are with respect
to problem-solving in the scholarly community.37
? Recommend articles to read. (articles)
? Recommend collaborators to work on an idea/article with. (people)
? Recommend a venue to submit the article to. (venues)? Recommend an editor referees to review the article. (people)38
? Recommend scholars to talk to and concepts to talk to them about at the venue.
(people and tags)
37Rodriguez, M.A., Allen, D.W., Shinavier, J., Ebersole, G., “A Recommender System to Support theScholarly Communication Process,” KRS-2009-02, 2009. [http://arxiv.org/abs/0905.1594]
38Rodriguez, M.A., Bollen, J., “An Algorithm to Determine Peer-Reviewers,” Conference on Informationand Knowledge Management (CIKM), pp. 319–328, doi:10.1145/1458082.1458127, 2008. [http://arxiv.org/abs/cs/0605112]
Real-Time, Domain-Specific, Graph-Based,Problem-Solving Engine
Graph Data Set
Ψ1Ψ2
Ψ3Ψ4
Ψn
Ψ5
Library of Path/Traversal Expressions
+ =Real-Time
Domain-SpecificGraph-Based
Problem-Solving Engine
Your domain model (i.e. graph dataset) determines what traversals you can design,
develop, and deploy. Together, these determine which types of problems you can solve
automatically/computationally for yourself, your users.
Applicable in Various, Seemingly Diverse Areas
• Applications to a techno-social government (i.e. collective decision making systems).39
percentage of active citizens
error
100 90 80 70 60 50 40 30 20 10 0
0.00
0.05
0.10
0.15
0.20
dynamically distributed democracydirect democracy
4
percentage of active citizens
pro
port
ion o
f corr
ect decis
ions
100 90 80 70 60 50 40 30 20 10 0
0.50
0.65
0.80
0.95
dynamically distributed democracy
direct democracy
(n)
Fig. 5. The relationship between k and evotek for direct democracy (gray
line) and dynamically distributed democracy (black line). The plot providesthe proportion of identical, correct decisions over a simulation that was runwith 1000 artificially generated networks composed of 100 citizens each.
As previously stated, let x ! [0, 1]n denote the politicaltendency of each citizen in this population, where xi is thetendency of citizen i and, for the purpose of simulation, isdetermined from a uniform distribution. Assume that everycitizen in a population of n citizens uses some social network-based system to create links to those individuals that theybelieve reflect their tendency the best. In practice, these linksmay point to a close friend, a relative, or some public figurewhose political tendencies resonate with the individual. Inother words, representatives are any citizens, not politicalcandidates that serve in public office. Let A ! [0, 1]n!n denotethe link matrix representing the network, where the weight ofan edge, for the purpose of simulation, is denoted
Ai,j =
!1 " |xi " xj | if link exists0 otherwise.
In words, if two linked citizens are identical in their politicaltendency, then the strength of the link is 1.0. If their tendenciesare completely opposing, then their trust (and the strength ofthe link) is 0.0. Note that a preferential attachment networkgrowth algorithm is used to generate a degree distribution thatis reflective of typical social networks “in the wild” (i.e. scale-free properties). Moreover, an assortativity parameter is usedto bias the connections in the network towards citizens withsimilar tendencies. The assumption here is that given a systemof this nature, it is more likely for citizens to create links tosimilar-minded individuals than to those whose opinions arequite different. The resultant link matrix A is then normalizedto be row stochastic in order to generate a probability distribu-tion over the weights of the outgoing edges of a citizen. Figure6 presents an example of an n = 100 artificially generatedtrust-based social network, where red denotes a tendency of0.0, purple a tendency of 0.5, and blue a tendency of 1.0.
Given this social network infrastructure, it is possible to bet-ter ensure that the collective tendency and vote is appropriatelyrepresented through a weighting of the active, participatingpopulation. Every citizen, active or not, is initially provide with
Fig. 6. A visualization of a network of trust links between citizens. Eachcitizen’s color denotes their “political tendency”, where full red is 0, full blueis 1, and purple is 0.5. The layout algorithm chosen is the Fruchterman-Reingold layout.
1n “vote power” and this is represented in the vector ! ! Rn
+,such that the total amount of vote power in the population is1. Let y ! Rn
+ denote the total amount of vote power that hasflowed to each citizen over the course of the algorithm. Finally,a ! 0, 1n denotes whether citizen i is participating (ai = 1)in the current decision making process or not (ai = 0). Thevalues of a are biased by an unfair coin that has probability kof making the citizen an active participant and 1"k of makingthe citizen inactive. The iterative algorithm is presented below,where # denotes entry-wise multiplication and " $ 1.
! % 0while
"i"ni=1 yi < " do
y % y + (! # a)! % ! # (1 " a)! % A!
end
In words, active citizens serve as vote power “sinks” inthat once they receive vote power, from themselves or froma neighbor in the network, they do not pass it on. Inactivecitizens serve as vote power “sources” in that they propagatetheir vote power over the network links to their neighborsiteratively until all (or ") vote power has reached activecitizens. At this point, the tendency in the active populationis defined as #tend = x · y. Figure 4 plots the error incurredusing dynamically distributed democracy (black line), wherethe error is defined as
etendk = |dtend
100 " #tendk |.
Next, the collective vote #votek is determined by a weighted
majority as dictated by the vote power accumulated by activeparticipants. Figure 5 plots the proportion of votes that aredifferent from what a fully participating population would
39* Rodriguez, M.A., Watkins, J.H., “Revisiting the Age of Enlightenment from a Collective Decision Making Systems
Perspective,” First Monday, 14(8), 2009. [http://arxiv.org/abs/0901.3929]
* Rodriguez, M.A., “Social Decision Making with Multi-Relational Networks and Grammar-Based Particle Swarms,” Hawaii
International Conference on Systems Science (HICSS), pp. 39–49, 2007. [http://arxiv.org/abs/cs/0609034]
* Rodriguez, M.A., Steinbock, D.J., “A Social Network for Societal-Scale Decision-Making Systems,” Proceedings of the North
American Association for Computational Social and Organizational Science Conference, 2004. [http://arxiv.org/abs/cs/
0412047]
Toy Graph Dataset
friendfriend
favoritename=marko
location=Santa Fegender=male
lat=11111long=22222
created_at=123456
1 23
4 name=sixwinglocation=West Hollywood
gender=male
friend
name=charlie
favorite
favorite
favorite
name=Bryce Canyon
created_at=234567
5
6
We will use the toy-graph above to demonstrate Gremlin (to introduce the syntax). However, in parallel, we
will also use a large graph of the same schema to demonstrate how SQL/MySQL compares relative to
Gremlin/Neo4j on traversal-based queries (i.e. for relational databases, queries with table joins).
Dataset Schema in Neo4jNeo4j [http://neo4j.org] is a “schema-less” database. However, ultimately, data is
represented according to some schema whether that schema be explicit in the database, in
the code interacting with the database, or in the developer’s head.40 Please note the
schema diagrammed below is a non-standard convention.41
Person Place
name=<string>location=<string>gender=<string>type=Person
name=<string>lat=<double>long=<double>type=Place
favoritefriend
40A better term for “schema-less” might have been “dynamic schema.”41For expressive, standardized graph-based schema languages, refer to RDFS [http://www.w3.org/TR/
rdf-schema/] and OWL [http://www.w3.org/TR/owl-features/] of the Web of Data community.
Dataset Schema in MySQL
CREATE TABLE friend (
outV INT NOT NULL,
inV INT NOT NULL);
CREATE INDEX friend_outV_index USING BTREE ON friend (outV);
CREATE INDEX friend_inV_index USING BTREE ON friend (inV);
CREATE TABLE favorite (
outV INT NOT NULL,
inV INT NOT NULL);
CREATE INDEX favorite_outV_index USING BTREE ON favorite (outV);
CREATE INDEX favorite_inV_index USING BTREE ON favorite (inV);
CREATE TABLE metadata (
vertex INT NOT NULL,
_key VARCHAR(100) NOT NULL,
_value VARCHAR(100),
PRIMARY KEY (vertex, _key));
CREATE INDEX metadata_vertex_index USING BTREE ON metadata (vertex);
CREATE INDEX metadata_key_index USING BTREE ON metadata (_key);
CREATE INDEX metadata_value_index USING BTREE ON metadata (_value);
Experiment Discussion
• First, for each experiment, no cache is used. For each query (or run ofqueries), caches are reset/flushed and the query is performed.42
• Second, for each experiment, a “stable point” (i.e. performance with fullcaching) is found through the repeated evaluation of the same query.
• Evaluations are done on my laptop using SQL/MySQL(5.1.45) andGremlin(0.5-alpha)/Neo4j(1.1).43
• I am not an expert in relational databases. Be aware of all of my choices(table design, indexes used, query representation, etc.).44
42I believe, from looking at the behavior of MySQL, MySQL caches maintain joined structure in mainmemory for subsequent queries. Neo4j caches by maintaining active portions of the graph in main memory.
43Note that Gremlin 0.5-alpha is much more performant than Gremlin 0.2.2. Also, running times presentedare likely to change with optimizations (discussed later)—consider all times in passing only.
44For the more interested, please do experiments yourself with your particular domain models and queries.
Loading Identical Data into MySQL and Neo4j
For the first half of the examples, we will use a small data set. Later wewill increase this data set by 10,000,000 edges and compare again. Thereason is to test how indices effect the performance of standard queries.As indices grow, log2(n) becomes costly.
mysql> (SELECT * FROM friend) UNION (SELECT * FROM favorite)
71100 rows in set (0.47 sec)
gremlin> g:count($_g/E)
==>71100 results returned in 145.427ms (0.145 sec)
First thing to note—graph databases don’t have a notion of “tables,” theentire graph is one atomic entity.
Basic Gremlin
gremlin> (1 + 2) * 4 div 5
==>2.4
gremlin> "marko" + " a. " + "rodriguez"
==>marko a. rodriguez
gremlin> func ex:add-one($x)
$x + 1
end
gremlin> foreach $y in g:list(1,2,3,4)
g:print(ex:add-one($y))
end
2
3
4
5
Searching Example: Friends
friend
friend favorite
name=markolocation=Santa Fe
gender=malelat=11111
long=22222
created_at=123456
1 2
3 4
name=sixwinglocation=West Hollywood
gender=male
friend
name=charlie
favorite
favorite favorite
name=Bryce Canyon
created_at=234567
5
6
gremlin> $_g := neo4j:open(‘/data/mygraph’)
gremlin> $_ := g:id(1)
==>v[1]
gremlin> .
==>v[1]
gremlin> ./outE
==>e[10][1-friend->2]
==>e[11][1-friend->3]
==>e[12][1-favorite->4]
gremlin> ./outE[@label=‘friend’]/inV/@name
==>sixwing
==>marko
gremlin> ./outE[@label=‘friend’]/inV/@gender
==>male
==>male
gremin> ./outE[@label=‘friend’]
/inV[@location=‘Santa Fe’]/@name
==>marko
Searching FriendsSQL/MySQL vs. Gremlin/Neo4j
What are the names of Rand Fitzpatrick’s friends?45
mysql> SELECT friend.inV, b._value FROM friend, metadata as a,
metadata as b WHERE a._key=‘name’ AND
a._value=‘Rand Fitzpatrick’ AND a.vertex=friend.outV AND
b.vertex=friend.inV AND b._key=‘name’;
97 rows in set (0.32 sec -- 320.0 ms)
gremlin> g:key(‘name’,‘Rand Fitzpatrick’)/outE[@label=‘friend’]/inV/@name
97 results returned (0.00258 sec -- 25.88 ms)
45When in cache (through repeated, identical querying), SQL/MySQL evaluates in ∼0.005 seconds (5ms)and Gremlin/Neo4j evaluates in ∼0.0002 seconds (0.2ms).
Searching Example: FOAF (No Friends, No Self)
friend
friend favorite
name=markolocation=Santa Fe
gender=malelat=11111
long=22222
created_at=123456
1 2
3 4
name=sixwinglocation=West Hollywood
gender=male
friend
name=charlie
favorite
favorite favorite
name=Bryce Canyon
created_at=234567
5
6
gremlin> .
==>v[1]
gremlin> ./outE[@label=‘friend’]/inV
/outE[@label=‘friend’]/inV
==>v[1]
==>v[1]
==>v[5]
gremlin> (./outE[@label=‘friend’]
/inV)[g:assign(‘$x’)]
/outE[@label=‘friend’]
/inV[g:except(.,$_)][g:except(.,$x)]
/@name
==>charlie
Searching FOAF (Not Self)SQL/MySQL vs. Gremlin/Neo4j
What are the names of Rand Fitzpatrick’s friends friends who are not Rand(note: this may include Rand’s friends)?46
mysql> SELECT mb._value FROM friend as a, friend as b, metadata as ma,
metadata as mb WHERE ma._key=‘name’ AND ma._value=‘Rand Fitzpatrick’
AND ma.vertex=a.outV AND a.inV=b.outV AND b.outV != ma.vertex AND
b.inV = mb.vertex AND mb._key=‘name’
8985 rows in set (0.47 sec -- 470.00 ms)
gremlin> g:key(‘name’,‘Rand Fitzpatrick’)/outE[@label=‘friend’]
/inV/outE[@label=‘friend’]/inV[g:except(.,$_)]/@name
8985 results returned (0.053 sec -- 53.879 ms)
46When in cache (through repeated, identical querying), SQL/MySQL evaluates in ∼0.03 seconds (30ms)and Gremlin/Neo4j evaluates in ∼0.015 seconds (15ms).
Searching Example: Friend’s Favorites
friend
friend favorite
name=markolocation=Santa Fe
gender=malelat=11111
long=22222
created_at=123456
1 2
3 4
name=sixwinglocation=West Hollywood
gender=male
friend
name=charlie
favorite
favorite favorite
name=Bryce Canyon
created_at=234567
5
6
gremlin> .
==>v[1]
gremlin> ./outE[@label=‘friend’]/inV
/outE[@label=‘favorite’]/inV
==>v[6]
==>v[6]
gremlin> ./outE[@label=‘friend’]/inV
/outE[@label=‘favorite’ and @created_at>234500]
/inV/@name
==>Bryce Canyon
Searching FOAF (No Self) FavoritesSQL/MySQL vs. Gremlin/Neo4j
What do Rand’s friends friends (who are not Rand) favorite?47
mysql> SELECT mb._value FROM friend as fa, friend as fb, favorite,
metadata as ma, metadata as mb WHERE ma._key=‘name’ AND
ma._value=‘Rand Fitzpatrick’ AND ma.vertex=fa.outV AND fa.inV=fb.outV
AND fb.inV != ma.vertex AND fb.inV=favorite.outV AND
mb.vertex=favorite.inV AND mb._key=‘name’;
364905 rows in set (11.17 sec -- 11170.0 ms)
gremlin> g:key(‘name’,‘Rand Fitzpatrick’)/outE[@label=‘friend’]
/inV/outE[@label=‘friend’]/inV[g:except(.,$_)]
/outE[@label=’favorite’]/inV/@name
364905 results returned (2.278 sec -- 2278.59 ms)
47When in cache (through repeated, identical querying), SQL/MySQL evaluates in ∼6.25 seconds(6250ms) and Gremlin/Neo4j evaluates in ∼1.0 second (1000ms).
A Traversal Detour Through the Web of Data
As of July 2009
LinkedCTReactome
Taxonomy
KEGG
PubMed
GeneID
Pfam
UniProt
OMIM
PDB
SymbolChEBI
Daily Med
Disea-some
CAS
HGNC
InterPro
Drug Bank
UniParc
UniRef
ProDom
PROSITE
Gene Ontology
HomoloGene
PubChem
MGI
UniSTS
GEOSpecies
Jamendo
BBCProgrammes
Music-brainz
Magna-tune
BBCLater +TOTP
SurgeRadio
MySpaceWrapper
Audio-Scrobbler
LinkedMDB
BBCJohnPeel
BBCPlaycount
Data
Gov-Track
US Census Data
riese
Geo-names
lingvoj
World Fact-book
Euro-stat
flickrwrappr
Open Calais
RevyuSIOCSites
Doap-space
Flickrexporter
FOAFprofiles
CrunchBase
Sem-Web-
Central
Open-Guides
Wiki-company
QDOS
Pub Guide
RDF ohloh
W3CWordNet
OpenCyc
UMBEL
Yago
DBpediaFreebase
Virtuoso Sponger
DBLPHannover
IRIT Toulouse
SWConference
Corpus
RDF Book Mashup
Project Guten-berg
DBLPBerlin
LAAS- CNRS
Buda-pestBME
IEEE
IBM
Resex
Pisa
New-castle
RAE 2001
CiteSeer
ACM
DBLP RKB
Explorer
eprints
LIBRIS
SemanticWeb.org
Eurécom
RKBECS
South-ampton
CORDIS
ReSIST ProjectWiki
NationalScience
Foundation
ECS South-ampton
LinkedGeoData
BBC Music
Image produced by Richard Cyganiak and Anja Jentzsch. [http://linkeddata.org/]
Defining the Web of Data
• The Web of Data is similar to the Web of Documents (of common knowledge), but
instead of referencing documents (e.g. HTML, images, etc.) with the URI address
space, individual datum are referenced.4849
? 〈http://markorodriguez.com, foaf:fundedBy, http://atti.com〉? 〈http://markorodriguez.com, foaf:name, "Marko Rodriguez"〉? 〈http://markorodriguez.com, foaf:age, "30"〉? 〈http://markorodriguez.com, foaf:knows, http://tenderlovemaking.com〉
• In graph theoretic terms, the Web of Data is a multi-relational graph defined as
G ⊆ (U ∪B)× U × (U ∪B ∪ L), where U is the set of all URIs, B is the set of
all blank/anonymous nodes, and L is the set of all literals.
48The Web of Data is also known as the Linked Data Web, the Giant Global Graph, the Semantic Web,the RDF graph, etc.
49* Rodriguez, M.A., “Interpretations of the Web of Data, Data Management in the Semantic Web, eds.H. Jin and Z. Lv, Nova Publishing, in press, 2010. [http://arxiv.org/abs/0905.3378]* Rodriguez, M.A., “A Graph Analysis of the Linked Data Cloud,” Technical Report, KRS-2009-01, 2009.[http://arxiv.org/abs/0903.0194]
Some of the Datasets on the Web of Datadata set domain data set domain data set domain
audioscrobbler music govtrack government pubguide booksbbclatertotp music homologene biology qdos socialbbcplaycountdata music ibm computer rae2001 computerbbcprogrammes media ieee computer rdfbookmashup booksbudapestbme computer interpro biology rdfohloh socialchebi biology jamendo music resex computercrunchbase business laascnrs computer riese governmentdailymed medical libris books semanticweborg computerdblpberlin computer lingvoj reference semwebcentral socialdblphannover computer linkedct medical siocsites socialdblprkbexplorer computer linkedmdb movie surgeradio musicdbpedia general magnatune music swconferencecorpus computerdoapspace social musicbrainz music taxonomy referencedrugbank medical myspacewrapper social umbel generaleurecom computer opencalais reference uniref biologyeurostat government opencyc general unists biologyflickrexporter images openguides reference uscensusdata governmentflickrwrappr images pdb biology virtuososponger referencefoafprofiles social pfam biology w3cwordnet referencefreebase general pisa computer wikicompany businessgeneid biology prodom biology worldfactbook governmentgeneontology biology projectgutenberg books yago generalgeonames geographic prosite biology . . .
Web of Data Dataset Dependencies
geospecies
freebase
dbpedia
libris
geneid
interpro
hgnc
symbol
pubmed
mgi
geneontology
uniprot
pubchem
unists
omim
homologene
pfam
pdb
reactome
chebi
uniparc
kegg
cas
uniref
prodomprosite
taxonomy
dailymed
linkedct
acm
dblprkbexplorer
laascnrs
newcastle
eprints
ecssouthampton
irittoulouseciteseer
pisa
resexibm
ieee
rae2001
budapestbme
eurecom
dblphannover
diseasome
drugbank
geonames
yago
opencyc
w3cwordnet
umbel
linkedmdb
rdfbookmashup
flickrwrappr
surgeradio
musicbrainz myspacewrapper
bbcplaycountdata
bbcprogrammes
semanticweborg
revyu
swconferencecorpus
lingvoj
pubguide
crunchbase
foafprofiles
riese
qdos
audioscrobbler
flickrexporter
bbcjohnpeel
wikicompany
govtrack
uscensusdata
openguides
doapspace
bbclatertotp
eurostat
semwebcentral
dblpberlin
siocsites
jamendo
magnatuneworldfactbook
projectgutenberg
opencalais
rdfohloh
virtuososponger
geospecies
freebase
dbpedia
libris
geneid
interpro
hgnc
symbol
pubmed
mgi
geneontology
uniprot
pubchem
unists
omim
homologene
pfam
pdb
reactome
chebi
uniparc
kegg
cas
uniref
prodomprosite
taxonomy
dailymed
linkedct
acm
dblprkbexplorer
laascnrs
newcastle
eprints
ecssouthampton
irittoulouseciteseer
pisa
resexibm
ieee
rae2001
budapestbme
eurecom
dblphannover
diseasome
drugbank
geonames
yago
opencyc
w3cwordnet
umbel
linkedmdb
rdfbookmashup
flickrwrappr
surgeradio
musicbrainz myspacewrapper
bbcplaycountdata
bbcprogrammes
semanticweborg
revyu
swconferencecorpus
lingvoj
pubguide
crunchbase
foafprofiles
riese
qdos
audioscrobbler
flickrexporter
bbcjohnpeel
wikicompany
govtrack
uscensusdata
openguides
doapspace
bbclatertotp
eurostat
semwebcentral
dblpberlin
siocsites
jamendo
magnatuneworldfactbook
projectgutenberg
opencalais
rdfohloh
virtuososponger
Web of Data Transforms Development ParadigmA new application development paradigm emerges. No longer do data and application
providers need to be the same entity (left). With the Web of Data, its possible for
developers to write applications that utilize data that they do not maintain (right).50
Web of Data
127.0.0.1 127.0.0.2 127.0.0.3
Application 1 Application 2 Application 3
structures structuresstructures
processes processes processes
127.0.0.1 127.0.0.2 127.0.0.3
Application 1 Application 2 Application 3
structures structures structures
processes processes processes
50Rodriguez, M.A., “A Reflection on the Structure and Process of the Web of Data,”Bulletin of the American Society for Information Science and Technology, 35(6), pp. 38–43,doi:10.1002/bult.2009.1720350611, 2009. [http://arxiv.org/abs/0908.0373]
Extending our Knowledge of Bryce Canyon National Parkgremlin> $h := lds:open()
gremlin> $_ := g:add-v($h, ‘http://dbpedia.org/resource/Bryce_Canyon_National_Park’)
==>v[http://dbpedia.org/resource/Bryce_Canyon_National_Park]
gremlin> ./outE
==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:reference -> http://www.nps.gov/brca/]
==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:iucnCategory -> "II"@en]
==>e[dbpedia:Bryce_Canyon_National_Park - dbpedia-owl:numberOfVisitors -> "1012563"^^<xsd:integer>]
==>e[dbpedia:Bryce_Canyon_National_Park - skos:subject -> dbpedia:Category:Colorado_Plateau]
==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:visitationNum -> "1012563"^^<xsd:int>]
==>e[dbpedia:Bryce_Canyon_National_Park - dbpedia-owl:abstract -> "Bryce Canyon National Park is a national
park located in southwestern Utah in the United States..."@en]
==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:area -> "35835.0"^^<http://dbpedia.org/datatype/acre>]
==>e[dbpedia:Bryce_Canyon_National_Park - rdf:type -> dbpedia-owl:ProtectedArea]
==>e[dbpedia:Bryce_Canyon_National_Park - dbpedia-owl:location -> dbpedia:Garfield_County%2C_Utah]
==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:nearestCity -> dbpedia:Panguitch%2C_Utah]
==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:established -> "1928-09-15"^^<xsd:date>]
...
51
51Linked Data Sail (LDS) was developed by Joshua Shinavier (RPI and TinkerPop) and connects toGremlin through Gremlin’s native support for Sail (i.e. for RDF graphs). LDS caches the traversed aspectsof the Web of Data into any quad-store (e.g. MemoryStore, AllegroGraph, HyperGraphSail, Neo4jSail, etc.).
Augmenting Traversals with the Web of Data
Lets extend our query over the Web of Data. Perhaps incorporate that into our searching,scoring, ranking, and recommendation.
gremlin> $visits := ./outE[@label=‘dbpprop:visitationNum’]/inV/@value
==>1012563
gremlin> $acreage := ./outE[@label=‘dbpprop:area’]/inV/@value
==>35835.0
### imagine wrapping traversals in Gremlin functions:
### func lds:acreage($h, $v) and func lds:visitors($h, $v)
gremlin> ./outE[@label=‘friend’]/inV/outE[@label=‘favorite’]
/inV[lds:acreage($h, .) < 1000000 and lds:visitors($h, .) < 2000000]/@name
==>Bryce Canyon
Thus, what do tenderlove’s friends favorite that are small in acreage and visitation?52
52In Gremlin, its possible to have multiple graphs open in parallel and thus, mix and match data fromeach graph as desired. Hence, demonstrated by the example above, its possible to mix Web of Data RDFgraph data and Blueprints property graph data.
Using the Web of Data for Music Recommendation
Yet another aside: Using only the Web of Data data to recommend musicians/bands
with a simplistic, edge-boolean spreading activation algorithm.53
gremlin> $_ :=
g:id(‘http://dbpedia.../Grateful_Dead’)
==>v[http://dbpedia.../Grateful_Dead]
gremlin> lds:spreading-activation(.)
==>Jerry Garcia Acoustic Band
==>BK3
==>Phil Lesh and Friends
==>Old and In the Way
==>RatDog
==>The Dead
==>Heart of Gold Band
==>Legion of Mary
==>The Tubes
==>Bob Dylan
==>New Riders of the Purple Sage
==>Bruce Hornsby
==>Donna Jean Godchaux
==>Kingfish
==>Jerry Garcia Band
==>Donna Jean Godchaux Band
==>The Other Ones
==>Bobby and the Midnites
==>Furthur
==>Rhythm Devils
53Please read the following for interesting, deeper ideas in this space: Clark, A., “Associative Engines:Connectionism, Concepts, and Representational Change,” MIT Press, 1993.
Another View of the TinkerPop Stack
Web of DataLocal Dataset
owl:sameAs
GET http://host/resource
Scoring Example: How Many of My Friends Favorite X?
friend
friend favorite
name=markolocation=Santa Fe
gender=malelat=11111
long=22222
created_at=123456
1 2
3 4
name=sixwinglocation=West Hollywood
gender=male
friend
name=charlie
favorite
favorite favorite
name=Bryce Canyon
created_at=234567
5
6
gremlin> .
==>v[1]
gremlin> ./outE[@label=‘friend’]/inV
==>v[3]
==>v[2]
gremlin> g:count(./outE[@label=‘friend’]/inV
/outE[@label=‘favorite’]
/inV[@id=6])
==>2
Scoring Example: How Many of My FOAFs Favorite X?
friend
friend favorite
name=markolocation=Santa Fe
gender=malelat=11111
long=22222
created_at=123456
1 2
3 4
name=sixwinglocation=West Hollywood
gender=male
friend
name=charlie
favorite
favorite favorite
name=Bryce Canyon
created_at=234567
5
6
gremlin> .
==>v[1]
gremlin> g:count(
(./outE[@label=‘friend’]/inV)[g:assign(‘$x’)]
/outE[@label=‘friend’]
/inV[g:except(.,$_)][g:except(.,$x)]
/outE[@label=‘favorite’]/inV[@id=6])
==>1
Loading Identical Data into MySQL and Neo4j
Now we will use a larger data set. 10,000,000 edges are created between100,000 vertices. Random assignment with 50% favorite-edges and 50%friend-edges. This is a dense, relatively unnatural graph—everyone isheavily connected.54
mysql> (SELECT * FROM favorite) UNION (SELECT * FROM friend)
10071100 rows in set (4 min 28.10 sec)
gremlin> g:count($_g/E)
10071100 edges in return (5 min 35 sec)
54The largest Neo4j instance that I know of contained 100,030,002 (100 million) vertices, 3,041,030,000(3 billion) edges, and 140,120,000 (140 million) properties. This was deployed on Amazon EC2 and wasyielding FOAF traversals, on average, in ∼50ms (again, index-free traversal). Figures provided by ToddStavish (Stav.ish Consulting [http://blog.stavi.sh/]).
Querying Random Vertices with Repeatsmysql> SELECT count(favorite.inV) FROM friend as fa, friend as fb, favorite
WHERE fa.outV=XXX AND fa.inV=fb.outV AND fb.inV=favorite.outV;
29.72 sec -- vertex 110752
0.330 sec -- vertex 110752 REPEAT
10.10 sec -- vertex 145893
11.64 sec -- vertex 126993
0.250 sec -- vertex 126993 REPEAT
14.37 sec -- vertex 136442
6.990 sec -- vertex 154837
0.240 sec -- vertex 154837 REPEAT
gremlin> g:count(g:id(XXX)/outE[@label=‘friend’]/inV
/outE[@label=‘friend’]/inV/outE[@label=‘favorite’]/inV)
3.646 sec -- vertex 110752
0.350 sec -- vertex 110752 REPEAT
0.756 sec -- vertex 145893
3.251 sec -- vertex 126993
0.211 sec -- vertex 126993 REPEAT
1.462 sec -- vertex 136442
1.875 sec -- vertex 154837
0.268 sec -- vertex 154837 REPEAT
Recommendation
Extending the Schema for Some Richer Examples
For the last part of this presentation on recommendation, we will extendthe data schema to include tags (a place can be tagged with a tag). Thiswill allow for some richer examples.5556
Person Place
name=<string>location=<string>gender=<string>type=Person
name=<string>lat=<double>long=<double>type=Place
favoritefriend
Tag
name=<string>type=Tag
tagged
55Please note that 1.) “place” can be item/thing/book/music/etc. 2.) “favorite” can belikes/purchased/visited/etc. 3.) “tag” can be category/etc. A particular use case is presented, but withlittle imagination, application to other schemas is, of course, plausible.
56Following examples have experimental syntax that may differ slightly from official Gremlin 0.5 release.
Recommendation Example: Friend Finder
• Open Friendship Triangles: (V ×Ψ)→ (V × N+)57 (people)
1. Create return map (i.e. V × N+).
2. Determine who my friends are.
3. Determine who my friends friends are...
4. ...that are not already my friends or me. (weighted by the number of overlapping
friends—more overlaps, more traversers at that user vertex)
5. Sort return map by number of traversers at those user/people vertices.
$m := g:map()
(./outE[@label=‘friend’]/inV)[g:assign(‘$x’)]
/outE[@label=‘friend’]/inV
/.[g:except(.,$x)][g:except(.,$_)][g:op-value(‘+’,$m,.,1)]
g:sort($m,‘value’,true)
57((
Rx Afriend)· Afriend
) n(Afriend
) n (I), where x is the user/person vertex. The in-degree
centrality vector of the derived adjacency matrix determines the resultant V rank.
Recommendation Example: Follower Finder• People Similarity based on Favorites: (V ×Ψ)→ (V × N+)58 (people)
1. Create return map (i.e. V × N+).
2. Determine what I favorite/like/prefer/purchased/etc.
3. Of those things I favorite, who else favorites them that are not me? (weighted user
similarity based on taste—the more I share in common, the more traversers are at
that user vertex).
4. Filter out those people that are my friends.
5. Sort return map by number of traversers at those people vertices.
$m := g:map()
(./outE[@label=‘favorite’]/inV)[g:assign(‘$x’)]
/inE[@label=‘favorite’]/outV[g:except(.,$_)]
/outE[@label=‘friend’]/inV[g:except(.,$x)]/../..[g:op-value(‘+’,$m,.,1)]
g:sort($m,‘value’,true)
58((
Rx Afavorite)· Afavorite>
)n (I) n
(Afriend
). The in-degree centrality vector of the derived
adjacency matrix determines the resultant V rank.
Recommendation Example: Follower Finder 2
• People Similarity based on Tags: (V ×Ψ)→ (V × N+)5960 (people)
1. Create return map (i.e. V × N+).
2. Determine the tags associated with what I favorite.
3. What else is tagged with those tags?
4. Who favorites those tagged items that are not me.61
5. Sort return map by number of traversers at those people vertices.
$m := g:map()
./outE[@label=‘favorite’]/inV/outE[@label=‘tagged’]/inV
/inE[@label=‘tagged’]/outV
/inE[@label=‘favorite’]/outV[g:except(.,$_)][g:op-value(‘+’,$m,.,1)]
g:sort($m,‘value’,true)
59((
Rx Afavorite)· Atagged · Atagged> · Afavorite>
) n (I). The in-degree centrality vector of the
derived adjacency matrix determines the resultant V rank.60Variations on this theme can be used for expertise identification.61A user’s friends could be recommended. This filter was ignored for the sake of brevity.
Recommendation Example:“Users Who Like x Also Like y”
• Co-Favorited Places: (V ×Ψ)→ (V × N+)6263 (places)
1. Create return map (i.e. V × N+).
2. Determine who has favorited (i.e. liked) place x.
3. What else have they favorited that is not place x.
4. Sort return map by number of traversers at those place vertices.
$m := g:map()
$x/inE[@label=‘favorite’]/outV
/outE[@label=‘favorite’]/inV[g:except(.,$x)][g:op-value(‘+’,$m,.,1)]
g:sort($m,‘value’,true)
62((
Rx Afavorite>)· Afavorite
) n (Cx). In-degree centrality of derived matrix determines rank.
63This type of recommendation may be considered content-based recommendation. When two verticesshare content (relations to other vertices), they are deemed similar. Co-relation, in general, is a patternfor content-based recommendation. Look back at the first three recommendation examples: “friend finder”(co-friend), “follower finder” (co-favorites), “follow finder 2” (co-tagged-favorites).
Recommendation Example: Places Related through Tags
• Co-Tagged Places: (V ×Ψ)→ (V × N+)6465 (places)
1. Create return map (i.e. V × N+).2. Determine the tags for place x.3. What else is tagged the same as x that is not x.4. Sort return map by number of traversers at those place vertices.
$m := g:map()
$x/outE[@label=‘tagged’]/inV
inE[@label=‘tagged’]/outV[g:except(.,$x)][g:op-value(‘+’,$m,.,1)]
g:sort($m,‘value’,true)
64((
Rx Atagged)· Atagged>
) n (I). In-degree centrality of derived matrix determines rank.
65Yet another type of content-based recommendation, but items are similar to each other not because ofco-favoriting, but because of co-tagging. Think about mixing and matching different similarities. How doyou weight the different “co”-graphs (i.e. aAα + bAβ)? Statistical techniques can emerge the significantfactors.
Recommendation Example: Tags Related through Places
• Co-Placed Tags: (V ×Ψ)→ (V × N+)6667 (tags)
1. Create return map (i.e. V × N+).2. Determine what has been tagged x.3. What other tags do those items have that are not x.4. Sort return map by number of traversers at those tag vertices.
$m := g:map()
$x/inE[@label=‘tagged’]/outV
outE[@label=‘tagged’]/inV[g:except(.,$x)][g:op-value(‘+’,$m,.,1)]
g:sort($m,‘value’,true)
66((
Rx Atagged>)· Atagged
) n (I). In-degree centrality of derived matrix determines rank.
67In the previous example, items were related if they shared the same tags. In this example, tags arerelated if they are used to tag the same items. Anything can be deemed similar to anything else if thereexists paths between such items—inferred or explicit. The path taken (Ψ) determines the meaning/type ofsimilarity. Cognitive philosophers/psychologists see this as associativity through spreading activation.
Recommendation Example: Collaborative Filtering 1• Basic Collaborative Filtering: (V ×Ψ)→ (V × N+)68 (places)
1. Create return map (i.e. V × N+).
2. Determine what I favorite/like/prefer/purchased/etc.
3. Of those things I favorite, who else favorites them? (weighted user similarity based
on taste—the more I share in common, the more traversers are at that person
vertex).
4. Of those similar users, what do they favorite that I don’t already favorite?
5. Sort return map by number of traversers at those favorited places.
$m := g:map()
(./outE[@label=‘favorite’]/inV)[g:assign(‘$x’)]
/inE[@label=‘favorite’]/outV
/outE[@label=‘favorite’]/inV[g:except(.,$x)][g:op-value(‘+’,$m,.,1)]
g:sort($m,‘value’,true)
68Related to “follower finder” from previous. However, it takes the traversal one step further. Instead ofsimply finding who is similar to me with respect to favoriting, you then compute, what do those similar usersalso favorite. This is a classic case for path-reuse as an optimization.
Recommendation Example: Collaborative Filtering 2
• Collaborative “Category” Filtering: (V ×Ψ× V )→ (V × N+) (places)
1. Create return map (i.e. V × N+).
2. Determine what I favorite...
3. ...in category/tag x.
4. Of those things I favorite, who else favorites them?
5. Of those similar users, what do they favorite categorized/tagged x ...
6. ...that I don’t already favorite?
7. Sort return map by number of traversers at those favorited places.
$m := g:map()
(./outE[@label=‘favorite’]/inV
/outE[@label=‘tagged’]/inV[@name=‘bar’]/../..)[g:assign(‘$x’)]
/inE[@label=‘favorite’]/outV
/outE[@label=‘favorite’]/inV/outE[@label=‘tagged’]/inV[@name=‘bar’]
/../..[g:except(.,$x)][g:op-value(‘+’,$m,.,1)]
g:sort($m,‘value’,true)
Recommendation Example: Collaborative Filtering 3
• Collaborative “Location” Filtering: (V ×Ψ× R4)→ (V × N+)69 (places)
1. Create return map (i.e. V × N+).
2. Determine what I favorite.
3. Of those things I favorite, who else favorites them?
4. Of those similar users, what do they favorite in bounding box x1, x2, y1, y2...
5. ...that I don’t already favorite?
6. Sort return map by number of traversers at those places.
$m := g:map()
(./outE[@label=‘favorite’]/inV)[g:assign(‘$x’)]
/inE[@label=‘favorite’]/outV
/outE[@label=‘favorite’]/inV[@lat > $x1 and @lat < $x2 ...]
/.[g:except(.,$x)][g:op-value(‘+’,$m,.,1)]
g:sort($m,‘value’,true)
69Location-filtering idea adapted from the Bonobo recommender engine by Nate Murray (AT&Ti).
Recommendation Example: Collaborative Filtering 4
• Collaborative “State of Mind” Filtering: (V ×Ψ×N+)→ (V ×N+)(places)
1. Create return map (i.e. V × N+).2. Determine what I have favorited in the last x minutes.3. Of those things I recently favorited, who else favorites them?4. Of those similar users, what do they favorite that I don’t?5. Sort return map by number of traversers at those favorited places.
$m := g:map()
(./outE[@label=‘favorite’ and @created_at > 1234567]/inV)[g:assign($x)]
/inE[@label=‘favorite’]/outV
/outE[@label=‘favorite’]/inV[g:except(.,$x)][g:op-value(‘+’,$m,.,1)]
g:sort($m,‘value’,true)
Recommendation Example: Collaborative Filtering 5
• Collaborative “Zietgeist” Filtering: (V ×Ψ× N+)→ (V × N+) (places)
1. Create return map (i.e. V × N+).
2. Determine what I have favorited.
3. Of those things I favorited, who else favorites them?
4. Of those similar users, what have they favorited in the last x minutes...
5. ...that I don’t already favorite?
6. Sort return map by number of traversers at those favorited places.
$m := g:map()
(./outE[@label=‘favorite’]/inV)[g:assign(‘$x’)]
/inE[@label=‘favorite’]/outV
/outE[@label=‘favorite’ and @created_at > 1234567]
/inV[g:except(.,$x)][g:op-value(‘+’,$m,.,1)]
g:sort($m,‘value’,true)
...keep going all day long.
A Cornucopia of Recommendations – Part 1
• Its possible to use offline statistical methods to determine which factorsof a vertex contribute to user interest (e.g. PCA+KMeans to determinemetadata contributing to shared interests). (slow)
• Then, use online, real-time graph methods to incorporate those featuresinto the traversal (i.e. to define Ψ). (fast)
? Mix various traversals together: aAα + bAβ + . . . + zAζ (or other,perhaps non-linear combinations).70
70Though not discussed in this presentation, sampling techniques can be used to increase the speed ofa traversal. For example, ./outE[g:rand-real() > 0.5] only traverses, on average, 50% of the edges.Moreover, if edges have weights, those weights can be used to create probability distributions and thus,biased sampling can be implemented (i.e. random walks)
A Cornucopia of Recommendations – Part 2
• ...also, be creative. Develop numerous recommendation traversals fornumerous problem-solving situations.71
• Make use of user click-behavior to determine usefulness.
• ...Or, allow users to select which algorithms they want to apply (givethem the option to select how they want to solve their problems).
71For a fine review of graph-based techniques and ideas regarding recommendation, please see:* Mirza, B.J., Keller, B., Ramakrishnan, N., “Studying Recommendation Algorithms by Graph Analysis,”Journal of Intelligent Information Systems, 20(2), pp. 131–160, doi:10.1023/a:1021819901281, 2003.* Huang, Z., Zeng, D., Chen, H., “A Link Analysis Approach to Recommendation Under Sparse Data,”Proceedings of the Tenth Americas Conference on Information Systems, 2004.* Perugini, S., Goncalves, M.A., Fox, E., “Recommender System Research: A Connection-Centric Survey,”Journal of Intelligent Information Systems, 23(2), pp. 107–143, 2004.* Rodriguez M.A., Bollen, J., Van de Sompel, H., “Automatic Metadata Generation using AssociativeNetworks,” ACM Transactions on Information Systems, 27(2), pp. 1–20, doi:10.1145/1462198.1462199,2009. [http://arxiv.org/abs/0807.0023]
Traversal Algorithms Simulate User Behavior
• A traversal is like a simulation of the user(s).
• If all the user had were direct links (i.e. a basic user-interface over thedataset), what path would they take to solve their problem?
• Operationalize as a traversal and you have simulated (and sped up) theirproblem-solving behavior.7273
72Rodriguez, M.A., Watkins, J., “Faith in the Algorithm, Part 2: Computational Eudaemonics,”Proceedings of the International Conference on Knowledge-Based and Intelligent Information & EngineeringSystems, Lecture Notes in Artificial Intelligence, 5712, pp. 813–820, doi:10.1007/978-3-642-04592-9 101,Springer-Verlag, 2009. [http://arxiv.org/abs/0904.0027] – see Faith in the Algorithm, in general:http://faithinthealgorithm.net.
73Think of the graph data set as a conceptual graph—“things” and their relationships to each other:the world as index. Think how your mind composes, manipulates, make use of such structures to solveproblems—to think, to infer, to creatively combine (i.e. join, traverse) ideas. Automate that process....automate the process that generates that process. [http://arxiv.org/abs/0704.3395]
Graph Traversal Model: Benefits and Drawbacks
• Benefits:
? The solution is explainable (i.e. the factors/paths are known).? Evaluations can happen in real-time and on live data.7475
? Can easily develop/deploy new traversals for different problems.76
• Drawbacks:
? If intuition fails, derive factors with offline statistical techniques.7778
74A user can add an edge and then recalculate a traversal.75It is noted that this depends on the complexity of the traversal and density of the graph.76For very rich data models, this is a promising proposition.77In the past, my method has been to use intuition to develop traversals, and then with sample data,
validate/tweak the traversal [http://arxiv.org/abs/cs/0605112, http://arxiv.org/abs/0807.0023].Also, for live systems with active users, using click-behavior is possible.
78Think about deriving Ψ from the paths that the users take through the data. “Ruts,” given the law oflarge numbers, can expose the collective’s problem-solving behavior. In short, study your users to derive Ψ.
The Future of Gremlin – Part 1
• Pavel Yaskevich and I are currently re-writting Gremlin from the groundup with a new compiler and virtual machine. Orders of magnitude fasterand more memory efficient. (now)79
• Make use of equivalences in the path algebra to do run-time optimizationsof path traversals. Extend the algebra. (future)
• Make use of path caching to do (V ×Ψ)→ P(V ) lookups. (future)80
? For example, x · ./outE/inV/outE/inV→ a, b, c, d, . . .
79This new implementation is Gremlin 0.5 and can be currently git pulled–note that its unstable untilofficial release date.
80A simple, intelligent memoization technique introduced by Joshua Shinavier in the Ripple programminglanguage [http://ripple.fortytwo.net/].
The Future of Gremlin – Part 2
• Get more community involvement on the optimization of thecompiler/virtual machine. (now)
? http://groups.google.com/group/gremlin-users/
• Support splitting/branching of path descriptions. Currently supported inPipes, but no syntactic mapping yet available in Gremlin. (future)
? ./outE/inVsplit() [1]| ./outE/inV [2]| .[@name=‘atti’]/outE/@name
• Support for threading. Pipes, due to its data flow nature, is easilyparallelized. Support concurrency through to the Gremlin language.(future)81
81Kahn, G., “The Semantics of a Simple Language for Parallel Processing,” Proceedings of the InformationProcessing Congress, pp. 471–475, 1974.
Acknowledgements
• The ideas presented have been developed over the course of my time with the following
institutions: University of California at Santa Cruz, Vrije Universiteit Brussel, Los
Alamos National Laboratory, and AT&T Interactive.
• My core collaborators: Alberto Pepe (Harvard), Johan Bollen (University of Indiana),
Herbert Van de Sompel (LANL), Jennifer H. Watkins (LANL), Peter Neubauer
(NeoTechnology), Joshua Shinavier (Rensselaer Polytechnic Institute), and Pavel
Yaskevich (“No one, from no where.”)
• The Neo4j team [http://neo4j.org] have been instrumental in influencing my
thoughts with respect to the database considerations of graph processing. These
people include Peter Neubauer, Emil Eifrem, Tobais Ivarsson, Johan Svensson, Mattias
Persson...
• My current institution of AT&Ti has provided me with ideas and support: Aaron
Patterson, Rand Fitzpatrick, Nate Murray, Gene Chuang, and Charlie Hornberger.
• The greater TinkerPop [http://tinkerpop.com] community for their discussions,
code submissions, and general excitement in the space.
Conclusion
• Model real-world structures with multi-relational/property graphs.
• Augment local data with the Web of Data.
• Store in a graph database to make traversing efficient.
• Traverse to search, score, rank, and recommend.
• Execute using TinkerPop productions.
• Relish in the glory that is the graph.
• “I must rest now. I’m tired from battle.” – Maximus.