graph database - unimi.it‣ cypher is an expressive (yet compact) graph database query language ‣...
TRANSCRIPT
GRAPH DATABASE
Ernesto Damiani and Paolo [email protected]
Università degli Studi di MilanoDipartimento di Informatica
WHAT IS A GRAPH?
‣ Formally, a graph is just a collection of vertices and edges
‣ Graphs represent entities as nodes and the ways in which those entities relate as relationships
‣ This general-purpose, expressive structure allows us to model all kinds of scenarios
‣ Graphs are extremely useful in understanding a wide diversity of datasets in fields such as science, government, and business
‣ Represent networks: social structures, topological relationships
‣ Represent a sequence of events
‣ Represent relationships between concepts: hyperonymy, hyponymy, meronymy
WHAT IS A GRAPH?
WHAT IS A GRAPH?
THE LABELED GRAPH MODEL
‣ The most popular form of graph model is the Labeled Graph Model
‣ It contains nodes and relationships
‣ Nodes contain properties (key-value pairs)
‣ Nodes can be labeled with one or more labels
‣ Relationships are named and directed, and always have a start and end node
‣ Relationships can also contain properties
THE LABELED GRAPH MODEL
{date: 20
GRAPH DATABASE MANAGEMENT SYSTEM
‣ A Graph Database Management System is an online database management system
‣ CRUD (Create, Read, Update, and Delete) properties
‣ OLTP (Online Transaction Processing) transactional systems
‣ OLAP (Online Analytical Processing)
‣ Management System that address scalability are also available
GRAPH DATABASE MANAGEMENT SYSTEM
‣ There are two properties of graph databases we should consider when investigating graph database technologies:
‣ The underlying storage
‣ Some graph databases use native graph storage that is optimised and designed for storing and managing graphs
‣ The processing engine
‣ Native graph processing require that a graph database use index-free adjacency, meaning that connected nodes physically “point” to each other in the database
GRAPH DATABASE MANAGEMENT SYSTEM
‣ Index-free adjacency
‣ A graph processing engine is said native if it implements index-free adjacency
‣ An index table implies O(log n) computational complexity while adjacent relationship O(1)
‣ The cost of queries is not dependent on the size of the graph but on the size of the traversed path
‣ With index-free adjacency, bidirectional joins are effectively precomputed and stored in the database as relationships
GRAPH DATABASE MANAGEMENT SYSTEM
GRAPH COMPUTE ENGINES
‣ A graph compute engine is a technology that enables global graph computational algorithms to be run against large datasets
‣ The architecture includes a system of record (SOR) database with OLTP properties
‣ Periodically, an Extract, Transform, and Load (ETL) job moves data from the system of record database into the graph compute engine for offline querying and analysis
WHY USING GRAPH DATABASES
‣ Performances
‣ In contrast to relational databases, where join-intensive query performance deteriorates as the dataset gets bigger, with a graph database performance tends to remain relatively constant, even as the dataset grows. This is because queries are localized to a portion of the graph
‣ Flexibility
‣ Structure and schema can emerge with our growing understanding of the problem space
‣ Graphs are naturally additive, meaning we can add new kinds of relationships, new nodes, new labels, and new subgraphs to an existing structure without disturbing existing queries and application functionality
‣ Semantic lifting and expansion are naturally implemented on graphs
‣ Integration with heterogeneous sources is also more natural in graph databases
WHY USING GRAPH DATABASES
‣ Agility
‣ Governance is typically applied in a programmatic fashion, using tests to drive out the data model and queries, as well as assert the business rules that depend upon the graph
RELATIONAL DATABASES LACK RELATIONSHIPS
‣ Join tables add accidental complexity; they mix business data with foreign key metadata
‣ Foreign key constraints add additional development and maintenance overhead
‣ parse tables with nullable columns require special checking in code
‣ Several expensive joins are often needed
‣ Reciprocal queries are even more costly
RELATIONAL DATABASES LACK RELATIONSHIPS
‣ Relational databases struggle with highly connected domains
‣ To understand the cost of performing connected queries in a relational database, we’ll look at some simple and not-so-simple queries in a social network domain
SELECT p1.PersonFROM Person p1 JOIN PersonFriend
ON PersonFriend.FriendID = p1.ID JOIN Person p2
ON PersonFriend.PersonID = p2.ID
WHERE p2.Person = 'Bob'
RELATIONAL DATABASES LACK RELATIONSHIPS
‣ Relational databases struggle with highly connected domains
‣ To understand the cost of performing connected queries in a relational database, we’ll look at some simple and not-so-simple queries in a social network domain
SELECT p1.PersonFROM Person p1 JOIN PersonFriend
ON PersonFriend.PersonID = p1.ID JOIN Person p2
ON PersonFriend.FriendD = p2.ID
WHERE p2.Person = 'Bob'
NOSQL DATABASES ALSO LACK RELATIONSHIPS ‣ Seeing a reference to order: 1234 in the
record beginning user: Alice, we infer a connection between user: Alice and order: 1234. This gives us false hope that we can use keys and values to manage graphs
‣ There are no identifiers that “point” backward (the foreign aggregate “links” are not reflexive, of course), we lose the ability to run other interesting queries on the database
‣ Aggregate stores do not maintain consistency of connected data, nor do they support what is known as index- free adjacency
‣ Aggregate stores must employ inherently latent methods for creating and querying relationships outside the data model
PERFORMANCE
‣ Graph Databases are designed to traverse graphs, their performances in querying interconnected domains are high
PERFORMANCE
‣ Graph Databases are designed to traverse graphs, their performances in querying interconnected domains are high
QUERYING GRAPHS‣ Cypher is an expressive (yet compact) graph database query language
‣ Other graph databases have other means of querying data. Many, including Neo4j, support the RDF query language SPARQL and the imperative, path-based query language Gremlin
(emil)<-[:KNOWS]-(jim)-[:KNOWS]->(ian)-[:KNOWS]->(emil)
QUERYING GRAPHS
(emil:Person {name:'Emil'}) <-[:KNOWS]-(jim:Person {name:'Jim'}) -[:KNOWS]->(ian:Person {name:'Ian'}) -[:KNOWS]->(emil)
QUERYING GRAPHS
MATCH (a:Person {name:'Jim'})-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c)RETURN b, c
QUERYING GRAPHS
MATCH (a:Person)-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c) WHERE a.name = 'Jim'RETURN b, c
QUERYING GRAPHS‣ Cypher Clauses
‣ WHERE: Provides criteria for filtering pattern matching results.
‣ CREATE and CREATE UNIQUE: Create nodes and relationships.
‣ MERGE: Ensures that the supplied pattern exists in the graph, either by reusing existing nodes and relationships that match the supplied predicates, or by creating new nodes and relationships.
‣ DELETE: Removes nodes, relationships, and properties.
‣ SET: Sets property values.
‣ FOREACH: Performs an updating action for each element in a list.
‣ UNION: Merges results from two or more queries.
‣ WITH: Chains subsequent query parts and forwards results from one to the next. Similar to piping commands in Unix.
INCREMENTAL MODELING‣ Graph databases provide for the smooth evolution of a data model
‣ We develop the data model feature by feature, user story by user story
INCREMENTAL MODELING
INCREMENTAL MODELING
INCREMENTAL MODELING
INCREMENTAL MODELING
INCREMENTAL MODELING‣ If we need to find all the events
that have occurred over a specific period, we can build a timeline tree
INCREMENTAL MODELING‣ The carousel fraud
QUERYING GRAPHS‣ POLE MODEL
‣ The POLE data model focuses on four basic types of entities and the relationships between them: Persons, Objects, Locations, and Events
Greater Manchester, UK from August 2017
INTEGRATION WITH ONTOLOGIES ‣ An ontology is a formal, explicit specification of a shared
conceptualization that is characterized by high semantic expressiveness required for increased complexity ( Feilmayr and Wöß - 2016)
‣ Ontology are typically represented as graphs
‣ Web Ontology Language (OWL) is typically represented using RDF triples
‣ Ontologies contain inference rules that can be applied to a knowledge base
INTEGRATION WITH ONTOLOGIES ‣ Taking an example for the LUBM benchmark (Lehigh University Benchmark), a
student is derived to be an attendee if he or she takes some course
‣ Thus when she matches the following ontological rule: Student and (takesCourse some) SubClassOf Attendee
‣Any experienced Neo4j programmer may rub his or her hands since this rule can be translated straightforward into the following Cypher expression:
match (x:Student)-[:takesCourse]->() set x:Attendee
‣ That is perfectly possible but could become cumbersome in case of deeply nested rules that may also depend on each other
‣ For instance, the Cypher expression misses the subclasses of Student such as UndergraduateStudent. Strictly speaking the expression above should therefore read: match (x)-[:takesCourse]->() where x:Student or x:UndergraduateStudent set x:Attendee