graph database - unimi.it‣ cypher is an expressive (yet compact) graph database query language ‣...

GRAPH DATABASE

Ernesto Damiani and Paolo [email protected]

Università degli Studi di MilanoDipartimento di Informatica

WHAT IS A GRAPH?

‣ Formally, a graph is just a collection of vertices and edges

‣ Graphs represent entities as nodes and the ways in which those entities relate as relationships

‣ This general-purpose, expressive structure allows us to model all kinds of scenarios

‣ Graphs are extremely useful in understanding a wide diversity of datasets in fields such as science, government, and business

‣ Represent networks: social structures, topological relationships

‣ Represent a sequence of events

‣ Represent relationships between concepts: hyperonymy, hyponymy, meronymy

WHAT IS A GRAPH?

THE LABELED GRAPH MODEL

‣ The most popular form of graph model is the Labeled Graph Model

‣ It contains nodes and relationships

‣ Nodes contain properties (key-value pairs)

‣ Nodes can be labeled with one or more labels

‣ Relationships are named and directed, and always have a start and end node

‣ Relationships can also contain properties

THE LABELED GRAPH MODEL

{date: 20

GRAPH DATABASE MANAGEMENT SYSTEM

‣ A Graph Database Management System is an online database management system

‣ CRUD (Create, Read, Update, and Delete) properties

‣ OLTP (Online Transaction Processing) transactional systems

‣ OLAP (Online Analytical Processing)

‣ Management System that address scalability are also available


‣ There are two properties of graph databases we should consider when investigating graph database technologies:

‣ The underlying storage

‣ Some graph databases use native graph storage that is optimised and designed for storing and managing graphs

‣ The processing engine

‣ Native graph processing require that a graph database use index-free adjacency, meaning that connected nodes physically “point” to each other in the database


‣ Index-free adjacency

‣ A graph processing engine is said native if it implements index-free adjacency

‣ An index table implies O(log n) computational complexity while adjacent relationship O(1)

‣ The cost of queries is not dependent on the size of the graph but on the size of the traversed path

‣ With index-free adjacency, bidirectional joins are effectively precomputed and stored in the database as relationships

GRAPH COMPUTE ENGINES

‣ A graph compute engine is a technology that enables global graph computational algorithms to be run against large datasets

‣ The architecture includes a system of record (SOR) database with OLTP properties

‣ Periodically, an Extract, Transform, and Load (ETL) job moves data from the system of record database into the graph compute engine for offline querying and analysis

WHY USING GRAPH DATABASES

‣ Performances

‣ In contrast to relational databases, where join-intensive query performance deteriorates as the dataset gets bigger, with a graph database performance tends to remain relatively constant, even as the dataset grows. This is because queries are localized to a portion of the graph

‣ Flexibility

‣ Structure and schema can emerge with our growing understanding of the problem space

‣ Graphs are naturally additive, meaning we can add new kinds of relationships, new nodes, new labels, and new subgraphs to an existing structure without disturbing existing queries and application functionality

‣ Semantic lifting and expansion are naturally implemented on graphs

‣ Integration with heterogeneous sources is also more natural in graph databases

WHY USING GRAPH DATABASES

‣ Agility

‣ Governance is typically applied in a programmatic fashion, using tests to drive out the data model and queries, as well as assert the business rules that depend upon the graph

RELATIONAL DATABASES LACK RELATIONSHIPS

‣ Join tables add accidental complexity; they mix business data with foreign key metadata

‣ Foreign key constraints add additional development and maintenance overhead

‣ parse tables with nullable columns require special checking in code

‣ Several expensive joins are often needed

‣ Reciprocal queries are even more costly


‣ Relational databases struggle with highly connected domains

‣ To understand the cost of performing connected queries in a relational database, we’ll look at some simple and not-so-simple queries in a social network domain

SELECT p1.PersonFROM Person p1 JOIN PersonFriend

ON PersonFriend.FriendID = p1.ID JOIN Person p2

ON PersonFriend.PersonID = p2.ID

WHERE p2.Person = 'Bob'


‣ Relational databases struggle with highly connected domains

‣ To understand the cost of performing connected queries in a relational database, we’ll look at some simple and not-so-simple queries in a social network domain

SELECT p1.PersonFROM Person p1 JOIN PersonFriend

ON PersonFriend.PersonID = p1.ID JOIN Person p2

ON PersonFriend.FriendD = p2.ID

WHERE p2.Person = 'Bob'

NOSQL DATABASES ALSO LACK RELATIONSHIPS ‣ Seeing a reference to order: 1234 in the

record beginning user: Alice, we infer a connection between user: Alice and order: 1234. This gives us false hope that we can use keys and values to manage graphs

‣ There are no identifiers that “point” backward (the foreign aggregate “links” are not reflexive, of course), we lose the ability to run other interesting queries on the database

‣ Aggregate stores do not maintain consistency of connected data, nor do they support what is known as index- free adjacency

‣ Aggregate stores must employ inherently latent methods for creating and querying relationships outside the data model

PERFORMANCE

‣ Graph Databases are designed to traverse graphs, their performances in querying interconnected domains are high

QUERYING GRAPHS‣ Cypher is an expressive (yet compact) graph database query language

‣ Other graph databases have other means of querying data. Many, including Neo4j, support the RDF query language SPARQL and the imperative, path-based query language Gremlin

(emil)<-[:KNOWS]-(jim)-[:KNOWS]->(ian)-[:KNOWS]->(emil)

QUERYING GRAPHS

(emil:Person {name:'Emil'}) <-[:KNOWS]-(jim:Person {name:'Jim'}) -[:KNOWS]->(ian:Person {name:'Ian'}) -[:KNOWS]->(emil)

QUERYING GRAPHS

MATCH (a:Person {name:'Jim'})-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c)RETURN b, c

QUERYING GRAPHS

MATCH (a:Person)-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c) WHERE a.name = 'Jim'RETURN b, c

QUERYING GRAPHS‣ Cypher Clauses

‣ WHERE: Provides criteria for filtering pattern matching results.

‣ CREATE and CREATE UNIQUE: Create nodes and relationships.

‣ MERGE: Ensures that the supplied pattern exists in the graph, either by reusing existing nodes and relationships that match the supplied predicates, or by creating new nodes and relationships.

‣ DELETE: Removes nodes, relationships, and properties.

‣ SET: Sets property values.

‣ FOREACH: Performs an updating action for each element in a list.

‣ UNION: Merges results from two or more queries.

‣ WITH: Chains subsequent query parts and forwards results from one to the next. Similar to piping commands in Unix.

INCREMENTAL MODELING‣ Graph databases provide for the smooth evolution of a data model

‣ We develop the data model feature by feature, user story by user story

INCREMENTAL MODELING

INCREMENTAL MODELING‣ If we need to find all the events

that have occurred over a specific period, we can build a timeline tree

INCREMENTAL MODELING‣ The carousel fraud

QUERYING GRAPHS‣ POLE MODEL

‣ The POLE data model focuses on four basic types of entities and the relationships between them: Persons, Objects, Locations, and Events

Greater Manchester, UK from August 2017

INTEGRATION WITH ONTOLOGIES ‣ An ontology is a formal, explicit specification of a shared

conceptualization that is characterized by high semantic expressiveness required for increased complexity ( Feilmayr and Wöß - 2016)

‣ Ontology are typically represented as graphs

‣ Web Ontology Language (OWL) is typically represented using RDF triples

‣ Ontologies contain inference rules that can be applied to a knowledge base

INTEGRATION WITH ONTOLOGIES ‣ Taking an example for the LUBM benchmark (Lehigh University Benchmark), a

student is derived to be an attendee if he or she takes some course

‣ Thus when she matches the following ontological rule: Student and (takesCourse some) SubClassOf Attendee

‣Any experienced Neo4j programmer may rub his or her hands since this rule can be translated straightforward into the following Cypher expression:

match (x:Student)-[:takesCourse]->() set x:Attendee

‣ That is perfectly possible but could become cumbersome in case of deeply nested rules that may also depend on each other

‣ For instance, the Cypher expression misses the subclasses of Student such as UndergraduateStudent. Strictly speaking the expression above should therefore read: match (x)-[:takesCourse]->() where x:Student or x:UndergraduateStudent set x:Attendee

graph database - unimi.it‣ cypher is an expressive (yet compact) graph database query language ‣...

Documents