graphdb free documentationgraphdb.ontotext.com/documentation/6.6/pdf/graphdb-free.pdfgraphdb free...

GraphDB Free DocumentationRelease 6.6

Ontotext

Dec 14, 2017

CONTENTS

1 General 11.1 About GraphDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Architecture & Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.1.1 Sesame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.1.2 The SAIL API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.2 GraphDB components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2.1 Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2.2 Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.2.3 Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 GraphDB Free . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.1 Comparison of GraphDB Free and GraphDB SE . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5.2 How to use it . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Quick Start Guide 92.1 Starting the database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Creating locations and repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Loading data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Supported file formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 Loading data through the GraphDB Workbench . . . . . . . . . . . . . . . . . . . . . . 11

2.3.2.1 To load a local file: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2.2 To load a database server file: . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2.3 Other ways of loading data: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.3 Loading data through SPARQL or Sesame API . . . . . . . . . . . . . . . . . . . . . . 132.3.4 Loading data through the GraphDB LoadRDF tool . . . . . . . . . . . . . . . . . . . . 14

2.4 Querying data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.1 Querying data through the GraphDB Workbench . . . . . . . . . . . . . . . . . . . . . 142.4.2 Querying data programmatically . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.2.1 Execute the example query with a HTTP GET request: . . . . . . . . . . . . . . 152.4.2.2 Execute the example query with a POST operation: . . . . . . . . . . . . . . . . 15

2.4.3 Supported export/download formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5 Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Installation 173.1 Distribution Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Minimum requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2 Production environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.2.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2.3 Control & management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

i

3.2.3 Licensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Installing GraphDB Free . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Install GraphDB on a laptop/desktop computer . . . . . . . . . . . . . . . . . . . . . . 183.3.2 Installing GraphDB on a server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Upgrading GraphDB Free . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4.1 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4.2 Upgrade scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4.2.1 The storage/ folder format is not different (for minor versions) . . . . . . . . 193.4.2.2 The storage/ folder format is different but GraphDB can auto-upgrade it (for

minor changes in the format) . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4.2.3 The storage/ folder format is different but the changes cannot be auto-

upgraded (for major versions) . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4.3 Upgrade from/to specific versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4.3.1 Upgrade from 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4.3.2 Upgrade from 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4.3.3 Upgrade from 5.4 to 6.2+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4.3.4 Upgrade from 5.x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Administration 234.1 Administration Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Administration Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2.1 Through the Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2.2 Through the JMX interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.3 GraphDB configurator (spreadsheet) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3 Creating a Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3.1 Creating locations and repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.4 Configuring a Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4.1 Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4.2 A repository configuration template - how it works . . . . . . . . . . . . . . . . . . . . 264.4.3 Sample configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.4.4 Configuration parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4.4.1 Debugging options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4.5 Configuring memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4.5.1 Configuring the JVM, the application and the GraphDB workspace memory . . 324.4.5.2 Configuring the memory for storing entities . . . . . . . . . . . . . . . . . . . 324.4.5.3 Configuring the Cache memory . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.5 Sizing Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.5.1 Entry-level deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.5.2 Mid-range deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.3 Enterprise deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.6 Disk Space Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.6.1 GraphDB disk space requirements for loading a dataset . . . . . . . . . . . . . . . . . . 344.6.2 GraphDB disk space requirements per statement . . . . . . . . . . . . . . . . . . . . . . 35

4.7 Configuring the Entity Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.8 Starting & Shutting down GraphDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.8.1 Starting GraphDB in the embedded Tomcat . . . . . . . . . . . . . . . . . . . . . . . . 364.8.2 Shutting down GraphDB in the embedded Tomcat . . . . . . . . . . . . . . . . . . . . . 364.8.3 How to start/stop via non-embedded Tomcat (or other application server) . . . . . . . . 364.8.4 Shutting down a repository instance from the JMX interface . . . . . . . . . . . . . . . 37

4.9 Managing Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.9.1 Changing repository parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.9.2 Renaming a repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.10 Access Rights and Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.10.1 Using the GraphDB Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.10.2 Using an HTTP authentication and the Sesame server’s deployment descriptor . . . . . . 40

4.10.2.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.10.2.2 Security constraints and roles . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.10.2.3 User accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

ii

4.10.3 Programmatic authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.11 Backing up and Recovering a Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.11.1 Backing up a repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.11.2 Restoring a repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.12 Query and Transaction Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.12.1 Query monitoring and termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.12.1.1 Preventing long running queries . . . . . . . . . . . . . . . . . . . . . . . . . . 464.12.2 Terminating a transaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.13 Diagnosing and Reporting Critical Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.13.1 Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.13.2 Report script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.13.2.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.13.2.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Usage 515.1 Workbench User Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1.1 Installation, start-up and shut down . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.1.2 Checking your setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.1.3 Using the Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.1.3.1 Managing locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.1.3.2 Managing repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.1.4 Loading data into a repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.1.4.1 Import settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.1.4.2 Four ways to import data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.1.5 Executing queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.1.6 Exporting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.1.7 Viewing and editing resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.1.8 Namespace management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.1.9 Context view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.1.10 Connector management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.1.11 Users and access management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.1.11.1 Free access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.1.12 REST API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.1.13 Configuration properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2 Using GraphDB with the Sesame API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.2.1 Sesame Application Programming Interface (API) . . . . . . . . . . . . . . . . . . . . . 73

5.2.1.1 Using the Sesame API to access a local GraphDB repository . . . . . . . . . . 735.2.1.2 Using the Sesame API to access a remote GraphDB repository . . . . . . . . . 74

5.2.2 Managing repositories with the Sesame Workbench . . . . . . . . . . . . . . . . . . . . 755.2.3 SPARQL endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.2.4 Graph Store HTTP Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.3 Using GraphDB with Jena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3.1 Installing GraphDB with Jena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.4 GraphDB Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.4.1 Lucene GraphDB Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.4.1.1 Overview and features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.4.1.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.4.1.3 Setup and maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.4.1.4 Working with data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.4.1.5 List of creation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.4.1.6 Datatype mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.4.1.7 Advanced filtering and fine tuning . . . . . . . . . . . . . . . . . . . . . . . . 915.4.1.8 Overview of connector predicates . . . . . . . . . . . . . . . . . . . . . . . . . 955.4.1.9 Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.4.1.10 Upgrading from previous versions . . . . . . . . . . . . . . . . . . . . . . . . 96

5.5 GraphDB Dev Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.5.1 Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.5.1.1 Logical formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

iii

5.5.1.2 Rule format and semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.5.1.3 The rule-set file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.5.1.4 Rule-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.5.1.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055.5.1.6 How TO’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.5.2 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1105.5.2.1 What is GraphDB’s persistence strategy . . . . . . . . . . . . . . . . . . . . . 1105.5.2.2 GraphDB’s indexing options . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.5.3 Full-text Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.5.3.1 RDF search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.5.4 Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225.5.4.1 Plugin API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225.5.4.2 RDF Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1325.5.4.3 Geo-spatial Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5.5.5 Notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1415.5.5.1 What are GraphDB local notifications . . . . . . . . . . . . . . . . . . . . . . 1415.5.5.2 What are GraphDB remote notifications . . . . . . . . . . . . . . . . . . . . . 142

5.5.6 Query Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1435.5.6.1 What are named graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1435.5.6.2 How to manage explicit and implicit statements . . . . . . . . . . . . . . . . . 1445.5.6.3 How to query explicit and implicit statements . . . . . . . . . . . . . . . . . . 1455.5.6.4 How to specify the dataset programmatically . . . . . . . . . . . . . . . . . . . 1465.5.6.5 How to access internal identifiers for entities . . . . . . . . . . . . . . . . . . . 1465.5.6.6 How to use Sesame ‘direct hierarchy’ vocabulary . . . . . . . . . . . . . . . . 1485.5.6.7 Other special GraphDB query behaviour . . . . . . . . . . . . . . . . . . . . . 148

5.5.7 Retain BIND Position Special Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1485.5.8 Performance Optimisations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.5.8.1 Data Loading & Query Optimisations . . . . . . . . . . . . . . . . . . . . . . . 1505.5.8.2 Explain Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1545.5.8.3 Inference Optimisations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

5.6 Experimental Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1705.6.1 Additional Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

5.6.1.1 Plugin API improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1705.6.1.2 External plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

5.6.2 SPARQL-MM Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1705.6.2.1 How to install SPARQL-MM support . . . . . . . . . . . . . . . . . . . . . . . 1705.6.2.2 Usage examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

5.6.3 GeoSPARQL support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1725.6.3.1 What is GeoSPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1725.6.3.2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1735.6.3.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

5.6.4 Provenance Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1765.6.4.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1765.6.4.2 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1765.6.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

5.6.5 Blueprints RDF Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1785.6.6 RDF Priming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

5.6.6.1 What is RDF Priming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1795.6.6.2 RDF Priming configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1795.6.6.3 Controlling RDF Priming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1795.6.6.4 RDF Priming example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

5.6.7 Nested Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1865.6.7.1 What are nested repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1865.6.7.2 Inference, indexing and queries . . . . . . . . . . . . . . . . . . . . . . . . . . 1875.6.7.3 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1875.6.7.4 Initialisation and shut down . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

5.6.8 LVM-based Backup and Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1885.6.8.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

iv

5.6.8.2 How it works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1885.6.8.3 Some further notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

6 Tools 1916.1 LoadRDF Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

6.1.1 What is GraphDB’s LoadRDF tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1916.1.2 How it works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1916.1.3 Java -D cmdline options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1926.1.4 Using loadrdf class programmatically . . . . . . . . . . . . . . . . . . . . . . . . . . . 1926.1.5 Parallel inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

6.2 Storage Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1946.2.1 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1946.2.2 Supported commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1946.2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

7 Benchmarks 1977.1 LUBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

7.1.1 What is LUBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1977.1.2 Configuring GraphDB and Sesame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1977.1.3 Running the benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

7.1.3.1 Alternative to generating the LUBM datasets . . . . . . . . . . . . . . . . . . . 1987.2 Performance Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

8 References 1998.1 Introduction to the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

8.1.1 Resource Description Framework (RDF) . . . . . . . . . . . . . . . . . . . . . . . . . . 1998.1.1.1 Uniform Resource Identifiers (URIs) . . . . . . . . . . . . . . . . . . . . . . . 2008.1.1.2 Statements: Subject-Predicate-Object Triples . . . . . . . . . . . . . . . . . . . 2008.1.1.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2038.1.1.4 Named graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

8.1.2 RDF Schema (RDFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2038.1.2.1 Describing classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2048.1.2.2 Describing properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2048.1.2.3 Sharing vocabularies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2048.1.2.4 Dublin Core Metadata Initiative . . . . . . . . . . . . . . . . . . . . . . . . . . 205

8.1.3 Ontologies and knowledge bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2068.1.3.1 Classification of ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2078.1.3.2 Knowledge bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

8.1.4 Logic and inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2088.1.4.1 Logic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2088.1.4.2 Predicate logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2098.1.4.3 Description logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

8.1.5 The Web Ontology Language (OWL) and its dialects . . . . . . . . . . . . . . . . . . . 2108.1.5.1 OWL DLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2118.1.5.2 OWL Horst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2118.1.5.3 OWL2 RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2128.1.5.4 OWL Lite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2128.1.5.5 OWL DL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

8.1.6 Query languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2128.1.6.1 RQL, RDQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2128.1.6.2 SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2138.1.6.3 SeRQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

8.1.7 Reasoning strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2138.1.7.1 Total materialisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

8.1.8 Semantic repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2148.2 GraphDB Feature Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2158.3 SPARQL Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

8.3.1 SPARQL 1.1 Protocol for RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2158.3.2 SPARQL 1.1 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

v

8.3.3 SPARQL 1.1 Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2168.3.3.1 Modification operations on the RDF triples: . . . . . . . . . . . . . . . . . . . 2168.3.3.2 Operations for managing graphs: . . . . . . . . . . . . . . . . . . . . . . . . . 216

8.3.4 SPARQL 1.1 Federation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2168.3.5 SPARQL 1.1 Graph Store HTTP Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 217

8.3.5.1 URL patterns for this new functionality are provided at: . . . . . . . . . . . . . 2188.3.5.2 Methods supported by these resources and their effects: . . . . . . . . . . . . . 2188.3.5.3 Request headers: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2188.3.5.4 Supported parameters for requests on indirectly referenced named graphs: . . . 218

8.4 OWL Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2188.5 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2198.6 Install Tomcat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

8.6.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2208.6.2 On Mac OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

8.6.2.1 Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2208.6.3 On Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

8.7 Previous Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2218.7.1 Documentation for version 6.0 to 6.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2218.7.2 Documentation for versions 4.x and 5.x . . . . . . . . . . . . . . . . . . . . . . . . . . 221

9 Release Notes 2239.1 GraphDB 6.6.x Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

9.1.1 GraphDB 6.6.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2239.1.1.1 Component versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2239.1.1.2 GraphDB Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2239.1.1.3 GraphDB Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2249.1.1.4 GraphDB Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224





9.1.6 GraphDB 6.6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2269.1.6.1 Component versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2269.1.6.2 GraphDB Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2279.1.6.3 GraphDB Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2289.1.6.4 GraphDB Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2289.1.6.5 GraphDB Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

9.1.7 GraphDB 6.6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2289.1.7.1 Component versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2289.1.7.2 GraphDB Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2289.1.7.3 GraphDB Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2299.1.7.4 GraphDB Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

vi

9.1.7.5 GraphDB Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2299.1.8 GraphDB 6.6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

9.1.8.1 Component versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2299.1.8.2 GraphDB Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2309.1.8.3 GraphDB Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2309.1.8.4 GraphDB Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230



9.2 GraphDB 6.5.x Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2359.2.1 GraphDB 6.5.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

9.2.1.1 Component versions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2359.2.1.2 GraphDB Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2369.2.1.3 GraphDB Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2379.2.1.4 GraphDB Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

9.3 GraphDB 6.4.x Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2389.3.1 GraphDB 6.4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

9.3.1.1 Component versions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2389.3.2 GraphDB 6.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239




9.3.5.1 Component versions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2429.4 GraphDB 6.3.x release notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

9.4.1 GraphDB 6.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2439.4.1.1 Component versions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2449.4.1.2 GraphDB Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2449.4.1.3 GraphDB Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2449.4.1.4 GraphDB Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

9.4.2 GraphDB 6.3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2449.4.2.1 Component versions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2449.4.2.2 GraphDB Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2459.4.2.3 GraphDB Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2459.4.2.4 GraphDB Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

10 Roadmap 24710.1 Roadmap for Q4‘15 and Q1‘16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

11 FAQ 249

12 Support 253

vii

CHAPTER

ONE

GENERAL

Hint: This documentation is written to be used by technical people. Whether you are a database engineer orsystem designer evaluating how this database fits to your system, or you are a developer who has already integratedit and actively employs its power - this is the complete reference. It is also useful for system administrators whoneed to support and maintain a GraphDB-based system.

Note: The GraphDB documentation presumes that the reader is familiar with databases. The required minimumof Semantic Web concepts and related information is provided in the Introduction to the Semantic Web section inReferences.

Ontotext GraphDB is a highly-efficient and robust graph database with RDF and SPARQL support. This docu-mentation is a comprehensive guide, which explains every feature of GraphDB as well as topics such as setting upa repository, loading and working with data, tuning its performance, scaling, etc.

Credits and licensing

GraphDB uses rdf4j Sesame as a library, taking advantage of its APIs for storage and querying, as well as thesupport for a wide variety of query languages (e.g., SPARQL and SeRQL) and RDF syntaxes (e.g., RDF/XML,N3, Turtle).

The development of GraphDB is partly supported by SEKT, TAO, TripCom, LarKC, and other FP6 and FP7European research projects.

Helpful hints

Throughout the documentation there are a number of helpful pieces of information that can give you additionalinformation, warn you or save you time and unnecessary effort. Here is what to pay attention to:

Hint: Hint badges give additional information you may find useful.

Tip: Tips badges are handy pieces of information.

Note: Notes are comments or references that may save you time and unnecessary effort.

1

http://www.w3.org/TR/rdf-concepts/

http://www.rdf4j.org

http://www.sekt-project.com/

http://www.tripcom.org/

http://www.larkc.eu/

http://cordis.europa.eu/fp6/

http://cordis.europa.eu/fp7/

http://www.ontotext.com/resources/

GraphDB Free Documentation, Release 6.6

Warning: Warnings are pieces of advice that turn your attention to things you should be cautious about.

1.1 About GraphDB

GraphDB is a family of highly-efficient, robust and scalable graph databases with RDF and SPARQL support. Ithas been proven in a number of enterprise use cases that required resilience in data loading and query answering.GraphDB can be used as a stand-alone server and as an embedded database.

GraphDB is the only triplestore that can perform semantic inferencing at scale allowing users to derive newsemantic facts from existing facts. It handles massive loads, queries and inferencing in real time.

Ontotext offers three editions of GraphDB: Free, Standard and Enterprise.

• GraphDB Free - commercial, file-based, sameAs & query optimisations, scales to 10’s of billions of RDFstatements on a single server with a limit of two concurrent queries;

• GraphDB Standard Edition (SE) - commercial, file-based, sameAs & query optimisations, scales to 10’s ofbillions of RDF statements on a single server and an unlimited number of concurrent queries. GraphDB SEis also available on-demand in the AWS Cloud, on a pay-per-use basis - see GraphDB Cloud v6.4.2

• GraphDB Enterprise Edition (EE) - a high-availability cluster with worker and master database implemen-tation for resilience and high performance parallel query-answering.

For more about the differences between the editions, see GraphDB Feature Comparison section.

1.2 Architecture & Components

1.2.1 Architecture

GraphDB is packaged as a Storage and Inference Layer (SAIL) for Sesame and makes extensive use of the featuresand infrastructure of Sesame, especially the RDF model, RDF parsers and query engines.

Inference is performed by the Reasoner (TRREE Engine), where the explicit and inferred statements are stored inhighly-optimised data structures that are kept in-memory for query evaluation and further inference. The inferredclosure is updated through inference at the end of each transaction that modifies the repository.

GraphDB implements the The SAIL API interface so that it can be integrated with the rest of the Sesame frame-work, e.g., the query engines and the web UI. A user application can be designed to use GraphDB directly throughthe Sesame SAIL API or via the higher-level functional interfaces. When a GraphDB repository is exposed us-ing the Sesame HTTP Server, users can manage the repository through the embedded Workbench, or the SesameWorkbench, or other tools integrated with Sesame.

2 Chapter 1. General

http://ontotext.com/products/graphdb/editions/#lite_edition

http://ontotext.com/products/graphdb/editions/#strandard_edition

https://aws.amazon.com/marketplace/pp/B00OM7VXGW/ref=srh_res_product_title?ie=UTF8&sr=0-2&qid=1444305397921

http://ontotext.com/products/graphdb/editions/#enterprise_edition


GrpahDB High-level Architecture

1.2.1.1 Sesame

The rdf4j Sesame framework is a framework for storing, querying and reasoning with RDF data. It is implementedin Java by Aduna as an open source project and includes various storage back-ends (memory, file, database), querylanguages, reasoners and client-server protocols.

There are essentially two ways to use Sesame:

• as a standalone server;

• embedded in an application as a Java library.

Sesame supports the W3C SPARQL query language. It also supports the most popular RDF file formats and queryresult formats.

Sesame offers a JBDC-like user API, streamlined system APIs and a RESTful HTTP interface. Various extensionsare available or are being developed by third parties.

Sesame Architecture

The following is a schematic representation of Sesame’s architecture and a brief overview of the main components.

The Sesame architecture (reproduced from the rdf4j Sesame 2 online documentation)

1.2. Architecture & Components 3

http://rdf4j.org

http://rdf4j.org


The Sesame framework is a loosely coupled set of components, where alternative implementations can be easilyexchanged. Sesame comes with a variety of Storage And Inference Layer (SAIL) implementations that a user canselect for the desired behaviour (in memory storage, file-system, relational database, etc). GraphDB is a pluginSAIL component for the Sesame framework.

Applications will normally communicate with Sesame through the Repository API. This provides a high enoughlevel of abstraction so that the details of particular underlying components remain hidden, i.e., different compo-nents can be swapped without requiring modification of the application.

The Repository API has several implementations, one of which uses HTTP to communicate with a remote repos-itory that exposes the Repository API via HTTP.

1.2.1.2 The SAIL API

The SAIL API is a set of Java interfaces that support RDF storing, retrieving, deleting and inferencing. It isused for abstracting from the actual storage mechanism, e.g., an implementation can use relational databases, filesystems, in-memory storage, etc. Its main characteristics are:

• flexibility and freedom for optimisations so that huge amounts of data can be handled efficiently onenterprise-level machines;

• extendability to other RDF-based languages;

• stacking of SAILs;

• concurrency control for any type of repository.

1.2.2 GraphDB components

1.2.2.1 Engine

Query optimiser

The query optimiser attempts to determine the most efficient way to execute a given query by considering thepossible query plans. Once queries are submitted and parsed, they are then passed to the query optimiser whereoptimisation occurs. GraphDB allows hints for guiding the query optimiser.

Reasoner (TRREE Engine)

GraphDB is implemented on top of the TRREE engine. TRREE stands for ‘Triple Reasoning and Rule EntailmentEngine’. The TRREE performs reasoning based on forward-chaining of entailment rules over RDF triple patternswith variables. TRREE’s reasoning strategy is total materialisation, although various optimisations are used.Further details of the rule language can be found in the Reasoning section.

Storage

GraphDB stores all of its data in files in the configured storage directory, usually called ‘storage’. It consists oftwo main indices on statements POS and PSO, two context indices PSCO and POCS, literal index and page cache.

Entity Pool

The Entity Pool is a key component of the GraphDB storage layer. It converts entities (URIs, Blank nodes andLiterals) to internal IDs (32- or 40-bit integers). From version 6.2 on, this component supports transactionalbehaviour, which improves space usage and cluster behaviour.


http://rdf4j.org/sesame/2.7/docs/users.docbook?view#chapter-sesame-intro


1.2.2.2 Connectors

The Connectors provide extremely fast keyword and faceted (aggregation) searches that are typically implementedby an external component or service, but have the additional benefit of staying automatically up-to-date with theGraphDB repository data. GraphDB comes with the following connector implementations:

• Lucene GraphDB Connector

1.2.2.3 Workbench

The Workbench is the default web-based administration tool.

1.3 GraphDB Free

What makes GraphDB Free different?

• Free to use;

• Manage 10’s of billions of RDF statements on a single server;

• Performs query and reasoning operations using file-based indices;

• Full SPARQL 1.1 support;

• Easy JAVA deployment and portability;

• Scalability, both in terms of data volume and loading and inferencing speed;

• Compatible with Sesame 2;

• Compatible with Jena with a built-in adapter;

• Full standard-compliant reasoning for RDFS, OWL 2 RL and QL;

• Support for custom reasoning rulesets; performance optimised rulesets;

• Optimised support for data integration through owl:sameAs;

• Special indices for efficient geo-spatial constraints (near-by, within, distance);

• Full-text search, based on Lucene;

• Efficient retraction of inferred statements upon update;

• Reliable data preservation, consistency and integrity;

• Import/export of RDF syntaxes through Sesame: XML, N3, N-Triples, N-Quads, Turtle, TriG, TriX;

• API plugin framework, public classes and interfaces;

• Query optimiser allowing for the evaluation of different query plans;

• RDF rank to order query results by relevance or other measures;

• RDF Priming, based upon activation spreading, allows efficient data selection and context-aware queryanswering for handling huge datasets;

• Notification allowing clients to react to statements in the update stream;

• Lucene connector for extremely fast normal and faceted (aggregation) searches; automatically stays up-to-date with the GraphDB data;

• GraphDB Workbench - the default web-based administration tool;

• LoadRDF for very fast repository creation from big datasets;

1.3. GraphDB Free 5

http://rdf4j.org/sesame/2.7/docs/system.docbook?view

http://jena.sourceforge.net/


GraphDB Free is the free standalone edition of GraphDB. It is implemented in Java and packaged as a Storageand Inference Layer (SAIL) for the Sesame RDF framework. GraphDB Free is a native RDF rule-entailment andstorage engine. The supported semantics can be configured through ruleset definition and selection. Included arerulesets for OWL-Horst, unconstrained RDFS with OWL Lite and the OWL2 profiles RL and QL. Custom rulesetsallow tuning for optimal performance and expressivity.

Reasoning and query evaluation are performed over a persistent storage layer. Loading, reasoning and queryevaluation proceed extremely quickly even against huge ontologies and knowledge bases.

GraphDB Free can manage billions of explicit statements on a desktop hardware and can handle tens of billionsof statements on a commodity server hardware.

1.3.1 Comparison of GraphDB Free and GraphDB SE

GraphDB Free and GraphDB SE – are identical in terms of usage and integration and share most features:

• designed as an enterprise-grade semantic repository system;

• suitable for massive volumes of data;

• file-based indices (enables it to scale to billions of statements even on desktop machines);

• inference and query optimisations (ensures fast query evaluations).

GraphDB Free

• suitable for low query loads and smaller projects.

GraphDB SE

• suitable for heavy query loads.

1.4 Connectors

The GraphDB Connectors provide extremely fast keyword and faceted (aggregation) searches that are typicallyimplemented by an external component or service, but have the additional benefit of staying automatically up-to-date with the GraphDB repository data.

The Connectors provide synchronisation at the entity level, where an entity is defined as having a unique identifier(URI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that have thesame subject. In addition to simple properties (defined by a single triple), the Connectors support property chains.A property chain is defined as a sequence of triples where each triple’s object is the subject of the subsequenttriple.

GraphDB Free comes with the following connector implementations:

• Lucene GraphDB Connector

1.5 Workbench

The GraphDB Workbench is the default web-based administration tool. The user interface is similar to the SesameWorkbench Web Application, but with more functionality.

What makes GraphDB Workbench different?

• Better SPARQL editor based on http://laurensrietveld.nl/yasgui/;

• Import of server files;

• Export in more formats;


http://laurensrietveld.nl/yasgui/


• Query monitoring with the possibility to kill a long running query;

• System resource monitoring;

• User and permission management;

• Connector management;

• Cluster management.

The GraphDB Workbench can be used for:

• managing GraphDB repositories;

• loading and exporting data;

• executing SPARQL queries and updates;

• managing namespaces;

• managing contexts;

• viewing/editing RDF resources;

• monitoring queries;

• monitoring resources;

• managing users and permissions;

• managing connectors;

• provides REST API for automating various tasks for managing and administering repositories.

The GraphDB Workbench is packaged as a separate .war file in the GraphDB distribution. It can be used eitheras a workbench only, or as workbench + database server.

1.5.1 Requirements

• Java: JRE 1.7 or higher (JDK 1.7 or higher if using custom rule-sets)

• Apache Tomcat: Version 7 or higher if deployed as a .war file

1.5.2 How to use it

• See the Workbench User Guide

1.5. Workbench 7

CHAPTER

TWO

QUICK START GUIDE

2.1 Starting the database

1. Download your GraphDB distribution file and unzip it.

2. Start the GraphDB database and Workbench interfaces in the embedded Tomcat server by executing thestartup script located in the root directory:

startup.bat (Windows)./startup.sh (Linux/Unix/Mac OS)

The message below appears in your Terminal and the GraphDB Workbench opens up at http://localhost:8080/.

INFO: Starting ProtocolHandler ["http-bio-8080"]...Opening web app in default browser

Tip: To change the database port number, execute:

startup.bat -p 9080 (Windows)./startup.sh -p 9080 (Linux/Unix/Mac OS)

Warning: In versions of GraphDB prior to 6.6.3 there is a known issue with the startup.bat file andspecifying the port number as shown above will not work. Instead, you have to edit the file and add the portnumber like this:java -XX:PermSize=256m -XX:MaxPermSize=256m -jar graphdb-tomcat.jar -p 9080

Note: The maximum memory allowed that can be allocated by the GraphDB’s JVM is controlled by the -Xmx1Gparameter. To see how to increase the memory, see Configuring memory.

Note:

GraphDB runs in a non-secure mode.To see how to add password protection to the GraphDB server instance, see Access Rights and Security. Thedefault administrator user name and password are admin/root.

9

http://localhost:8080/



2.2 Creating locations and repositories

Data locations group a set of repositories and expose them as Sesame endpoints. They can be initialised as alocal file path or a remote server URL, which requires a valid Sesame endpoint. When a local file path is set,the current Java process initialises all repositories locally and they operate in the same memory address space.Each location has a SYSTEM repository containing meta-data about how to initialise the other repositories from thecurrent location.

To create data locations and repositories:

1. Go to the GraphDB Workbench and navigate to Admin -> Locations and Repositories.

2. Choose Attach Location and enter a local file system path.

3. Click the Add button.

Tip: This creates the path where all GraphDB database binaries are created. Alternatively, connect to aremote location exposed via the Sesame API by supplying a valid URL endpoint.

4. Create a repository with the Create Repository button.

5. Enter the Repository ID (e.g., worker-node) and leave all other optional configuration settings with theirdefault values.

Tip: For repositories with more than few tens of millions of statements, see Configuring a Repository.

6. Set the newly created repository as the default repository for this location with the Connect button.

Tip: Alternatively, use curl command to perform basic location and repository management through the Work-bench REST API.

10 Chapter 2. Quick Start Guide

http://graphdb.ontotext.com/devhub/workbench-rest-api/location-and-repository-tutorial.html

http://graphdb.ontotext.com/devhub/workbench-rest-api/location-and-repository-tutorial.html


2.3 Loading data

2.3.1 Supported file formats

Serialisation format MIME type File extension prefixRDF/XML application/rdf+xml .owl / .rdfN-Triples text/plain .ntTurtle text/turtle .ttlN3 text/rdf+n3 .n3N-Quads text/x-nquads .nqRDF/JSON application/rdf+json .rjTriX application/trix .trixTriG application/x-trig .trigSesame Binary RDF application/x-binary-rdf .brf

2.3.2 Loading data through the GraphDB Workbench

2.3.2.1 To load a local file:

1. Select Data -> Import.

2. Open the Local files tab and click the Select files icon to choose the file you want to upload.

3. Click the Import button.

4. Enter the import settings in the pop-up window.

2.3. Loading data 11

http://www.w3.org/TR/rdf-syntax-grammar/

http://www.w3.org/TR/rdf-testcases/#ntriples

http://www.dajobe.org/2004/01/turtle/

http://www.w3.org/DesignIssues/Notation3.html

http://sw.deri.org/2008/07/n-quads/

https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-json/index.html

http://swdev.nokia.com/trix/TriX.html

http://www4.wiwiss.fu-berlin.de/bizer/TriG/Spec/

http://rivuli-development.com/2011/11/binary-rdf-in-sesame/


Import Settings

• Base URI: the default prefix for all local names in the file;

• Context: specifies a graph within the repository;

• Chunk size: the size of the batch operation; used for very large files (e.g., 10,000 - 100,000 triples perchunk);

• Retry times: the number of times the workbench will try to upload the chunk before canceling (in case ofHTTP error, during the data transfer);

• Preserve BNnode IDs: when clicked, the parser keeps the blank node ID-s with their original strings.

Tip: Chunking a file is optional, but we recommend it for files larger than 200 MB.


2.3.2.2 To load a database server file:

1. Create a folder named graphdb-import in your user home directory.

2. Copy all data files you want to load into the GraphDB database to this folder.

3. Go to the GraphDB Workbench.

4. Select Data -> Import.

5. Open the Server files tab.

6. Select the files you want to import.


Note: The file can be loaded only if it is accessible from the local file system (or a network file system mountedlocally).

Note: This option works only for local Locations (i.e., the Location is a local file path and not a remote URL),otherwise the file will be posted via the remote Sesame API.



2.3.2.3 Other ways of loading data:

• By pasting a data URL in the Remote content tab of the Import page.

• By pasting data in the Text area tab of the Import page.

• By executing an INSERT query in the SPARQL -> SPARQL Query page.

2.3.3 Loading data through SPARQL or Sesame API

The GraphDB database also supports a very powerful API with a standard SPARQL or Sesame endpoint to whichdata can be posted with cURL, a local Java client API or a Sesame console. It is compliant with all standards. Itallows every database operation to be executed via a HTTP client request.

1. Locate the correct GraphDB URL endpoint:

• select Admin -> Location and Repositories

• click the link icon next to the repository name

2.3. Loading data 13


• copy the repository URL.

2. Go to the folder where your local data files are.

3. Execute the script:

curl -X POST -H "Content-Type:application/x-turtle" -T localfilename.ttlhttp://localhost:8080/repositories/repository-id/statements

where localfilename.ttl is the data file you want to import and http://localhost:8080/repositories/repository-id/statements is the GraphDB URL endpoint of your repository.

Tip: Alternatively, use the full path to your local file.

2.3.4 Loading data through the GraphDB LoadRDF tool

LoadRDF is a low level bulk load tool, which writes directly in the database index structures. It is ultra fast andsupports parallel inference. For more information, see the LoadRDF Tool.

Note: Loading data through the GraphDB LoadRDF tool can be performed only if the repository is empty, e.g.,the initial loading after the database was down.

2.4 Querying data

Hint: SPARQL is a SQL-like query language for RDF graph databases with the following types:

• SELECT - returns tabular results;

• CONSTRUCT - creates a new RDF graph based on query results;

• ASK - returns “YES”, if the query has a solution, otherwise “NO”;

• DESCRIBE - returns RDF data about a resource; useful when you do not know the RDF data structure in thedata source;

• INSERT - inserts triples into a graph;

• DELETE - deletes triples from a graph.

2.4.1 Querying data through the GraphDB Workbench

1. Select the repository and click the SPARQL menu tab.

2. Write your query.

## Example query:CONSTRUCT {?s ?p ?o}WHERE {?s ?p ?o}LIMIT 100

3. Click the Run button.



Tip: To get started with your own query, use the sample template provided above the query text box. The editor’ssyntax highlighting will guide you in writing your own data READ or WRITE queries. You can also save yourfavourite query templates for later use or send them via the persistent link using icon.

2.4.2 Querying data programmatically

SPARQL is not only a standard query language, but also a protocol for communicating with RDF databases.GraphDB stays compliant with the protocol specification and allows querying data with standard HTTP requests.

2.4.2.1 Execute the example query with a HTTP GET request:

curl -G -H "Accept:application/x-trig"-d query=CONSTRUCT+%7B%3Fs+%3Fp+%3Fo%7D+WHERE+%7B%3Fs+%3Fp+%3Fo%7D+LIMIT+10http://localhost:8080/repositories/yourrepository

2.4.2.2 Execute the example query with a POST operation:

curl -X POST --data-binary @file.sparql -H "Accept: application/rdf+xml"-H "Content-type: application/x-www-form-urlencoded"http://localhost:8080/repositories/worker-node

where, file.sparql contains an encoded query:

query=CONSTRUCT+%7B%3Fs+%3Fp+%3Fo%7D+WHERE+%7B%3Fs+%3Fp+%3Fo%7D+LIMIT+10

Tip: For more information how to interact with GraphDB APIs, refer to the Sesame and SPARQL protocols orthe Linked Data Platform specifications.

2.4.3 Supported export/download formats

Serialisation format Query type MIME typeXML SELECT, ASK application/sparql-results+xmlJSON SELECT, ASK application/sparql-results+jsonCSV SELECT, ASK text/csvTSV SELECT, ASK text/tsvJSON CONSTRUCT, DESCRIBE application/rdf+jsonJSON-LD CONSTRUCT, DESCRIBE application/ld+jsonRDF/XML CONSTRUCT, DESCRIBE application/rdf+xmlN3 CONSTRUCT, DESCRIBE text/rdf+n3N-Quads CONSTRUCT, DESCRIBE text/x-nquadsN-Triples CONSTRUCT, DESCRIBE text/plainTurtule CONSTRUCT, DESCRIBE text/turtleTriX CONSTRUCT, DESCRIBE application/trixTrig CONSTRUCT, DESCRIBE application/x-trig

2.5 Additional resources

Ontotext:

2.5. Additional resources 15


Community Forum and Evaluation Support: http://stackoverflow.comGraphDB Website: http://graphdb.ontotext.com

SPARQL, OWL, and RDF:

RDF: http://www.w3.org/TR/rdf11-concepts/RDFS: http://www.w3.org/TR/rdf-schema/SPARQL Overview: http://www.w3.org/TR/sparql11-overview/SPARQL Query: http://www.w3.org/TR/sparql11-query/SPARQL Update: http://www.w3.org/TR/sparql11-update


http://stackoverflow.com

http://graphdb.ontotext.com

http://www.w3.org/TR/rdf11-concepts/

http://www.w3.org/TR/rdf-schema/

http://www.w3.org/TR/sparql11-overview/

http://www.w3.org/TR/sparql11-query/

http://www.w3.org/TR/sparql11-update/

CHAPTER

THREE

INSTALLATION

3.1 Distribution Package

In version 6.6.3 and newer the distribution contains the following files:

Path Descriptionadapters\ Support for SAIL graphs with the Blueprints APIbenchmark\ Semantic publishing benchmark scriptsbin\ Scripts for running various utilities, such as LoadRDF and the Storage

Tooldependencies\ Not in the distribution. Created when a script from bin is run for the

first timedoc\ License agreements, graphdb-se-configurator.xlsexamples\ Getting started examplesext\ External third party librariesext2\ External third party librariesgraphdb-tomcat.jar Embedded Tomcat serverlib\ Database binary filesloadrdf\ Sample code to load dataplugins\ Geo-sparql and SPARQL-mm pluginsREADME The readme filerules\ Standard reasoning rulesetsscripts\ Deprecated scripts. Use the scripts in bin instead.sesame_graphdb\ GraphDB as web application packagessesame_graphdb\graphdb-workbench-free.war

Sesame server, GraphDB database and GraphDB Workbench packagedas a web application

startup.bat Windows start scriptstartup.sh Linux/Unix/Mac OS start scripts

In versions 6.6.0 to 6.6.2 the distribution contains the following files:

Path Descriptiondoc\ License agreementsexamples\ Getting started examplesgraphdb-tomcat.jar Embedded Tomcat serverREADME The readme filesesame_graphdb\graphdb-workbench-free.war

Sesame server, GraphDB database and GraphDB Workbench packagedas a web application

startup.bat Windows start scriptstartup.sh Linux/Unix/Mac OS start scripts

17


3.2 Requirements

3.2.1 Minimum requirements

• Java Runtime Environment (JRE 1.7 or higher)

• Java Development Kit (JDK 1.7 or higher)

3.2.2 Production environment

3.2.2.1 Hardware

We recommend the following configuration parameters, especially for production deployments:

• Intel Xeon CPU’s;

• At least 32GB of memory;

• SSD drives.

3.2.2.2 Software

• Apache Tomcat 7 or JEE Compliant servlet container

3.2.2.3 Control & management

• Workbench interface.

3.2.3 Licensing

GraphDB Free is available under an RDBMS-like free license. It is free to use but not open-source.

3.3 Installing GraphDB Free

3.3.1 Install GraphDB on a laptop/desktop computer

This is the quickest way to start GraphDB and the Workbench. It uses the embedded Tomcat server (see the QuickStart Guide).

Hint: Embedded Tomcat vs Standalone Tomcat

Install GraphDB on a server to support enterprise-level applications. This type of installation gives you bettercontrol and more configuration possibilities.

3.3.2 Installing GraphDB on a server

1. Install Tomcat Servlet Container (or any other application server).

2. Download and unzip the GraphDB distribution file.

3. Copy the graphdb-workbench-free.war file from the sesame\_graphdb directory to the Tomcat webappsdirectory.

4. Start Tomcat.

18 Chapter 3. Installation


5. Go to http://localhost:8080/graphdb-workbench-free to access the GraphDB Workbench.

6. Start using GraphDB (refer to the Workbench User Guide).

3.4 Upgrading GraphDB Free

3.4.1 Procedure

Common configuration:

• OS: Linux, Bash shell, and a standard GraphDB installation - Tomcat + graphdb-workbench-free.war

Warning: Before upgrading GraphDB, it is very important to backup your storage files, otherwise later itmay be impossible to downgrade it to a previous version.

1. Backup the repository data.

2. Shut down the database (stop Tomcat) and delete the older GraphDB application – .war file and the ex-panded folder.

Sample shutdown/cleanup code

/opt/tomcat/bin/shutdown.shsleep 10 #wait some time for the tomcat to stopcd /opt/tomcat/webappsrm -rf graphdb-workbench* # this will delete the ``.war`` file and the folder graphdb-→˓workbench/

3. Copy the new .war file to the application server’s /webapps folder:

cp ~/Downloads/graphdb-free-x.y.z/sesame_graphdb/graphdb-workbench-free.war /opt/tomcat/→˓webapps/

4. Start the application server.

/opt/tomcat/startup.sh

5. Upgrade the storage/ folder.

3.4.2 Upgrade scenarios

There are three possible upgrade scenarios depending on whether the GraphDB version has changes in the repos-itory format.

Tip: Usually, only the major versions of GraphDB have changes in the repository format.

3.4.2.1 The storage/ folder format is not different (for minor versions)

The upgrade is straightforward.

3.4. Upgrading GraphDB Free 19

http://localhost:8080/graphdb-workbench-free


3.4.2.2 The storage/ folder format is different but GraphDB can auto-upgrade it (for minorchanges in the format)

Note: The auto-upgrade takes place during the repository initialisation (after the deployment of the new .warfiles). It is highly advisable to look at the relevant log messages during this automatic upgrade.

1. Use the new version of GraphDB to create an empty repository (with the right configuration, e.g., ruleset).

2. Shutdown any running GraphDB instance.

3. Locate the storage folders for the current instance and the new instance.

4. Delete the contents of the new storage folder (it will be called something like repo_id/storage).

5. Copy all files and (subdirectories, if present) from the old storage folder to the new storage folder.

6. Restart the new instance.

Note: It may take quite a long time to initialise as the storage files are modified but it should be quicker thanre-importing all the data.

3.4.2.3 The storage/ folder format is different but the changes cannot be auto-upgraded (formajor versions)

Note: Slow for big repositories.

With the GraphDB or Sesame workbench.

1. Export the RDF data from the old version of GraphDB.

2. Reload it into a new repository instance that uses the new version of GraphDB.

Programmatically using the Sesame API.

Export the repository data using the RepositoryConnection.getStatements() API.

Note: Export only the explicit statements, as the inferred statements are recomputed at load time. Fortunately,the Sesame console application does have a ‘load’ function and this can be used to reload the exported statements.

3.4.3 Upgrade from/to specific versions

3.4.3.1 Upgrade from 6.2

The upgrade is straightforward. No special steps are required:

• No changes in the engine;

• No incompatible changes in the Workbench;

• No incompatible changes in the Connectors.


3.4.3.2 Upgrade from 6.1

Repository upgrade

There are no changes in the repository format between 6.1 and 6.2, but it is highly recommended that the repositorygets checked with the Storage tool.

1. Stop GraphDB.

2. Use offline the Storage Tool to check the repository integrity.

(a) Run the check command to check the indexes consistence;

-command=check -storage=/repo/storage

(b) Run the rebuild command to rebuild the broken index (exception: predLists rebuilds automaticallyon GraphDB start if you delete its files before that);

-command=rebuild -storage=/repo/storage

(c) Run the scan command to check the integrity of the data in the statement indexes.

-command=scan -storage=/repo/storage

(d) Run the rebuild command to rebuild the broken statement index from one of its siblings.

3. Start GraphDB.

3.4.3.3 Upgrade from 5.4 to 6.2+

For this upgrade:

• delete the log config, otherwise it will not be replaced by the newer GraphDB 6.x log config

• delete the folder <ADUNA-HOME>/openrdf-sesame/conf

rm ~/.aduna/openrdf-sesame/conf/*

If you do not do the above, query.log and slow-query.log will be missing in the log/ folder.

Finally, we advise you to switch to the GraphDB Workbench instead of using the Sesame Workbench.

3.4.3.4 Upgrade from 5.x

Note: GraphDB’s former name was ‘OWLIM’ and it was renamed to better represent what it actually does.

The renaming was a cosmetic change - the error messages and other OWLIM strings have been renamed toGraphDB. Property files, license files and package/class names remain the same for compatibility reasons (e.g.,the owlim.properties file has not been changed). OWLIM 5.6 corresponds to GraphDB 6.0.

In most cases, upgrading GraphDB can be achieved simply by replacing the GraphDB and bundled 3rd party .jarfiles. You can check the required version of the Sesame framework for the recent GraphDB (OWLIM) versions inthe following table:

3.4. Upgrading GraphDB Free 21


OWLIM/GraphDB version Required Sesame versionGraphDB 6.0 Sesame 2.7.8 + new index format (autoupgraded) - the same for OWLIM 5.5OWLIM 5.5 Sesame 2.7.8 + new index format (autoupgraded)OWLIM 5.4 Sesame 2.7.8OWLIM 5.3 Sesame 2.6.7+OWLIM 5.2 Sesame 2.6.7+OWLIM 5.1 Sesame 2.6OWLIM 5.0 Sesame 2.6OWLIM 4.3 Sesame 2.6OWLIM 4.2 Sesame 2.5OWLIM 4.1 Sesame 2.4OWLIM 4.0 Sesame 2.4BigOWLIM 3.5 Sesame 2.3BigOWLIM 3.4 Sesame 2.3BigOWLIM 3.3 Sesame 2.3BigOWLIM 3.2 Sesame 2.3BigOWLIM 3.1 Sesame 2.2

Note:

Sometimes, the binary format of the storage files for different versions of BigOWLIM/OWLIM/GraphDB are notcompatible. However, storage files created with version 3.1 and upwards will be automatically upgraded whenusing a later version of OWLIM/GraphDB.[ This is a one-way process and no downgrade facility is provided.


CHAPTER

FOUR

ADMINISTRATION

4.1 Administration Tasks

The goal of this guide is to help you perform all common administrative tasks needed to keep a GraphDB databaseoperational. These tasks include configuring the database, managing memory and storage, managing users, man-aging repositories, performing basic troubleshooting, creating backups, performance monitoring activities, andmore.

The common administration tasks are:

• Installing GraphDB Free

• Upgrading GraphDB Free

• Creating a Repository

• Configuring a Repository

• Configuring the Entity Pool

• Starting & Shutting down GraphDB

• Managing Repositories

– Changing repository parameters

– Renaming a repository

• Access Rights and Security

• Backing up and Recovering a Repository

– Backing up a repository

– Restoring a repository

• Diagnosing and Reporting Critical Errors

4.2 Administration Tools

GraphDB can be administered through the GraphDB Workbench, the JMX interface, or programmatically.

4.2.1 Through the Workbench

For administering the repository through the Workbench, see Managing repositories.

23


4.2.2 Through the JMX interface

GraphDB offers a number of management and control functions using JMX. GraphDB uses this interface toprovide information about its internal state and behaviour, including trend information, as well as operations forintervening in certain database activities.

After initialisation, GraphDB will register a number of JMX MBeans for each repository, each providing a differ-ent set of information and functions for specific features. The JMX endpoint is configured using special systemproperties when starting the Java virtual machine (JVM) in which GraphDB is running. For example, the follow-ing command line parameters set the JMX server endpoint to listen on port 8089, without an authentication and asecure socket layer:

-Dcom.sun.management.jmxremote.port=8089-Dcom.sun.management.jmxremote.authenticate=false-Dcom.sun.management.jmxremote.ssl=false

If using GraphDB with Tomcat, these parameters must be passed to Tomcat’s JVM by setting either the JAVA_OPTSor CATALINA_OPTS environment variable, e.g.:

set JAVA_OPTS="-Dcom.sun.management.jmxremote.port=8089-Dcom.sun.management.jmxremote.authenticate=false-Dcom.sun.management.jmxremote.ssl=false"

For some Linux distributions, you can also edit the file /etc/default/tomcat6 and set JAVA_OPTS there.

Once GraphDB is loaded, use any compliant JMX client, e.g., jconsole that is part of the Java development kit,to access the JMX interface on the configured port.

4.2.3 GraphDB configurator (spreadsheet)

You can find it in the distribution /doc folder.

4.3 Creating a Repository

4.3.1 Creating locations and repositories

Data locations group a set of repositories and expose them as Sesame endpoints. They can be initialised as alocal file path or a remote server URL, which requires a valid Sesame endpoint. When a local file path is set,the current Java process initialises all repositories locally and they operate in the same memory address space.Each location has a SYSTEM repository containing meta-data about how to initialise the other repositories from thecurrent location.

To create data locations and repositories, follow the steps:

1. Go to the GraphDB Workbench and navigate to Admin -> Locations and Repositories.

2. Choose Attach Location and enter a local file system path.

24 Chapter 4. Administration


3. Click the Add button.

Tip: This creates the path where all GraphDB database binaries are created. Alternatively, connect to aremote location exposed via the Sesame API by supplying a valid URL endpoint.

4. Create a repository with the Create Repository button.

5. Enter the Repository ID (e.g., worker-node) and leave all other optional configuration settings with theirdefault values.

Tip: For repositories with more than few tens of millions of statements, see Configuring a Repository.

6. Set the newly created repository as the default repository for this location with the Connect button.

4.4 Configuring a Repository

GraphDB uses a Sesame configuration template for configuring its repositories. Sesame 2.0 keeps the repositoryconfigurations with their parameters, modelled in RDF, in the SYSTEM repository. Therefore, in order to create anew repository, the Sesame needs such an RDF file to populate the SYSTEM repository.

The GraphDB repository configuration templates are simple .ttl files in the /templates folder of the GraphDBdistribution. They can be used if you want to configure the repositories programmatically, otherwise you can usethe GraphDB Workbench.

Tip: For hints related to rule-sets and reasoning, see Rules Optimisations.

4.4.1 Steps

To configure a GraphDB repository, follow the steps:

1. Check the Sizing Guidelines section.

2. Check the Disk Space Requirements section.

3. Use the configuration spreadsheet from the /doc folder of your distribution to calculate what you needfor your setup.

4. Check all GraphDB Configuration parameters - their descriptions as well as their default and allowed values.

5. Check the Configuring memory section.

6. Change the default values of the repository properties, when creating it.

4.4. Configuring a Repository 25

http://rdf4j.org/sesame/2.7/docs/users.docbook?view#Repository_configuration_templates__advanced_

For creating repositories, see Creating a Repository.

4.4.2 A repository configuration template - how it works

The diagram below provides an illustration of an RDF graph that describes a repository configuration:

Often, it is helpful to ensure that a repository starts with a predefined set of RDF statements - usually one or moreschema graphs. This is possible by using the owlim:imports property. After start up, these files are parsed andtheir contents are permanently added to the repository.

In short, the configuration is an RDF graph, where the root node is of rdf:type rep:Repository, and it mustbe connected through the rep:RepositoryID property to a Literal that contains the human readable name of therepository. The root node must be connected via the rep:repositoryImpl property to a node that describesthe configuration. The type of the repository is defined via the rep:repositoryType property and its valuemust be openrdf:SailRepository to allow for custom Sail implementations (such as GraphDB) to be used inSesame 2.0. Then a node that specifies the Sail implementation to be instantiated must be connected throughthe sr:sailImpl property. To instantiate GraphDB, this last node must have a property sail:sailType with thevalue owlim:Sail - the Sesame framework will locate the correct SailFactory within the application classpaththat will be used to instantiate the Java implementation class.

The namespaces corresponding to the prefixes used in the above paragraph are as follows:

rep: <http://www.openrdf.org/config/repository#>sr: <http://www.openrdf.org/config/repository/sail#>sail: <http://www.openrdf.org/config/sail#>owlim: <http://www.ontotext.com/trree/owlim#>

All properties used to specify the GraphDB configuration parameters use the owlim:prefix and the local namesmatch up with the Configuration parameters, e.g., the value of the ruleset parameter can be specified using thehttp://www.ontotext.com/trree/owlim#ruleset property.

4.4.3 Sample configuration

The following is an example configuration (in Turtle RDF format) of a Sesame 2 repository that uses a GraphDBSail implementation:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.@prefix rep: <http://www.openrdf.org/config/repository#>.@prefix sr: <http://www.openrdf.org/config/repository/sail#>.@prefix sail: <http://www.openrdf.org/config/sail#>.@prefix owlim: <http://www.ontotext.com/trree/owlim#>.

[] a rep:Repository ;rep:repositoryID "owlim" ;rdfs:label "GraphDB Getting Started" ;rep:repositoryImpl [rep:repositoryType "graphdb:FreeSailRepository" ;sr:sailImpl [

sail:sailType "graphdb:FreeSail" ;owlim:ruleset "owl-horst-optimized" ;owlim:base-URL "http://example.org/owlim#" ;owlim:imports "./ontology/my_ontology.rdf" ;owlim:defaultNS "http://www.my-organisation.org/ontology#" ;owlim:entity-index-size "5000000" ;owlim:cache-memory "4G" ;owlim:storage-folder "storage" ;owlim:repository-type "file-repository" ;]

].

4.4.4 Configuration parameters

This is a list of all repository configuration parameters. Some of the parameters can be changed (effective after arestart), some cannot be changed (the change has no effect) and others must NOT be changed once a repositoryhas been created as doing so will likely lead to inconsistent data (e.g., unsupported inferred statements, missinginferred statements, or inferred statements that can not be deleted).

repository-type (Cannot be changed)

Default value: file-repositoryPossible values: file-repository, weighted-file-repository.

storage-folder (Can be changed)

Description: specifies the folder where the index files will be stored.Default value: none

ruleset (Must NOT be changed)

Description: Sets of axiomatic triples, consistency checks and entailment rules, which determine theapplied semantics.Default value: owl-horst-optimizedPossible values: empty, rdfs, owl-horst, owl-max and owl2-rl and their optimised counterpartsrdfs-optimized, owl-horst-optimized, owl-max-optimized and owl2-rl-optimized. A customruleset is chosen by setting the path to its rule file .pie.

base-URL (Can be changed)

Description: Specifies the default namespace for the main persistence file. Non-empty namespaces arerecommended, because their use guarantees the uniqueness of the anonymous nodes that may appearwithin the repository.Default value: none

entity-index-size (Cannot be changed)

Description: Defines the number of entity hash table index entries. The bigger the size, the less thecollisions in the hash table and the faster the entity retrieval. The entity hash table does not rehash, so itsindex size is constant throughout the life of the repository.Default value: 1000000

cache-memory (Can be changed)

Description: Specifies the total amount of memory to be given to all types of cache.Default value: <none>

check-for-inconsistencies (Can be changed)

Description: Turns the mechanism for consistency checking on and off; consistency checks are defined inthe rule file and are applied at the end of every transaction, if this parameter is true. If an inconsistency isdetected when committing a transaction, the whole transaction will be rolled back.Default value: false

defaultNS (Cannot be changed)

Description: Default namespaces corresponding to each imported schema file separated by semicolon andthe number of namespaces must be equal to the number of schema files from the imports parameter.Default value: <empty>Example: owlim:defaultNS "http://www.w3.org/2002/07/owl#;http://example.org/owlim#".

Warning: This parameter cannot be set via a command line argument.

disable-sameAs (Must NOT be changed)

Description: Enables or disables the owl:sameAs optimisation.Default value: false

enable-context-index (Can be changed)

Default value: falsePossible value: true, where GraphDB will build and use the context index/indices.

enable-literal-index (Can be changed)

Description: Enables or disables the Storage. The literal index is always built as data is loaded/modified.This parameter only affects whether the index is used during query-answering.Default value: true

enable-optimization (Can be changed)

Description: Enables or disables query optimisation.Default value: true

Warning: Disabling query optimisation is rarely needed - usually only for debugging purposes. Also,be aware that disabling query optimisation will also disable the correct behaviour of plugins (Full-textsearch, Geo-spatial extensions, RDF Rank, etc).

enablePredicateList (Can be changed)

Description: Enables or disables mappings from an entity (subject or object) to its predicates; switchingthis on can significantly speed up queries that use wildcard predicate patterns.Default value: false:

entity-id-size (Cannot be changed)

Description: Defines the bit size of internal IDs used to index entities (URIs, blank nodes and literals). Inmost cases, this parameter can be left to its default value. However, if very large datasets containing more


than 2 32 entities are used, set this parameter to 40. Be aware that this can only be set when instantiating anew repository and converting an existing repository between 32 and 40-bit entity widths is not possible.Default value: 32Possible values: 32 and 40

imports (Cannot be changed)

Description: A list of schema files that will be imported at start up. All the statements, found in these files,will be loaded in the repository and will be treated as read-only. The serialisation format is determinedby the file extension:

• .brf => BinaryRDF

• .n3 => N3

• .nq => N-Quads

• .nt => N-Triples

• .owl => RDF/XML

• .rdf => RDF/XML

• .rdfs => RDF/XML

• .trig => TriG

• .trix => TriX

• .ttl => Turtle

• .xml => TriX

Default value: noneExample: owlim:imports "./ont/owl.rdfs;./ont/ex.rdfs".

Tip: Schema files can be either a local path name, e.g., ./ontology/myfile.rdf or a URL, e.g., http://www.w3.org/2002/07/owl.rdf. If this parameter is used, the default namespace for each imported schemafile must be provided using the defaultNS parameter.

index-compression-ratio (Cannot be changed)

Description: The compression ratio of paged index files as a percentage of their uncompressed size. Thevalue indicates how much smaller the compressed page should be, so a value of 25 (percent) will attemptto make the index files one quarter of their uncompressed size. Any page that can not be compressed to thissize will be stored uncompressed in a separate overlay file.Default value: -1Possible value: -1 (off) and the range [10-50]Recommended value: 30

in-memory-literal-properties (Can be changed)

Description: Turns caching of the literal languages and data-types on and off. If the caching is on and theentity pool is restored from persistence, but there is no such cache available on disk, it is created after theentity pool initialisation.Default value: false

predicate-memory (Can be changed)

Description: Specifies the amount of memory to be used for predicate lists cache.Default value: 80m

query-limit-results (Can be changed)


Description: Sets the maximum number of results returned from a query after which the evaluation of aquery will be terminated; values less than or equal to zero mean no limit.Default value: 0; (no limit);

query-timeout (Can be changed)

Description: Sets the number of seconds after which the evaluation of a query will be terminated; valuesless than or equal to zero mean no limit.Default value: 0; (no limit);

read-only (Can be changed)

Description: In this mode, no modifications are allowed to the data or namespaces.Default value: falsePossible value: true, puts the repository in to read-only mode.

throw-QueryEvaluationException-on-timeout (Can be changed)

Default value: falsePossible value: true; if set, a QueryEvaluationException is thrown when the duration of a queryexecution exceeds the time-out parameter.

transaction-mode (Can be changed)

Description: Specifies the transaction mode. In fast mode, dirty pages are written to disk in the laziestfashion possible, i.e., pages are only swapped when a new page is requested and there is no more memoryavailable. No guarantees about data security are given when operating in this mode. So, in the event of anabnormal termination, the database must be considered corrupted and will need to be recreated fromscratch.Default value: safe; when set to safe, all updates are flushed to disk at the end of each transaction.Commit operations normally take a little longer, but recovery after an abnormal termination is instant. Thismode also has much better concurrency characteristics.

transaction-isolation (Can be changed)

Description: This parameter only has an effect when transaction-mode=fast. In fast mode, updateslock the repository preventing concurrent query answering.Default value: true;Possible value: false, if set, concurrent queries are permitted with the loss of isolation.

tuple-index-memory (Can be changed)

Description: Specifies the amount of memory to be used for statement storage cache.Default value: 80m

useShutdownHooks (Can be changed)

Default value: true. If set, the method OwlimSchemaRepository.shutdown() is called when the JVMexits (running GraphDB under Tomcat requires this parameter to be true, otherwise it cannot beguaranteed that the shutdown() method will be called at all).

4.4.4.1 Debugging options

The option debug.level can be set through a system property to define the level of detail of the GraphDB output,used in QueryModelConverter, SailConnectionImpl and HashEntityPool. The extra logging information iswritten to the logger at DEBUG level, so in order to see this output, the logger properties must be set by adding anentry at the appropriate level. For example, when using GraphDB, deployed using Tomcat on Ubuntu Linux, youwill need to edit the file /usr/share/tomcat6/.aduna/openrdf-sesame/conf/logback.xml and add an entryafter the <appender>...</appender> section. Exactly how to set the logger depends on which of the classes arebeing examined as shown below:

• QueryModelConverter:

– Logger entry: <logger name="com.ontotext.trree.query" level="all"/>

– debug.level > 2: Outputs the query optimisation time.

– debug.level > 3: Outputs the query plan.

• SailConnectionImpl:

– Logger entry: <logger name="com.ontotext.trree.SailConnectionImpl" level="all"/>

– debug.level > 0: Outputs “Owlim evaluation strategy” or “Sesame evaluation strategy” when evalu-ating a query.

– debug.level > 2: ThreadPool outputs when a worker thread starts and stops.

• HashEntityPool:

– Logger entry: <logger name="com.ontotext.trree.entitypool.HashEntityPool"level="all"/>

– debug.level > 1: If the version number is less than the current one, outputs “Older Entity storageversion found: X, recent one is: Y”.

The default value is 0 and it produces no extra debugging output.

4.4.5 Configuring memory

Configuring the memory used by GraphDB is the single most important factor for optimal performance - the morememory available, the better the performance. The available JAVA heap memory is used by:

• the JVM, the application and the GraphDB workspace (byte code, stacks, etc.);

• data structures for storing entities affected by specifying entity-index-size;

• data structures for indexing statements specified using cache-memory.

The following diagram offers a view of the memory use in GraphDB:

The challenge is how to divide up the available memory between the various GraphDB data structures in order toachieve the best overall behaviour.


4.4.5.1 Configuring the JVM, the application and the GraphDB workspace memory

Running in the embedded servlet container

Specify the maximum amount of heap space used by a JVM via the -Xmx virtual machine parameter.

Note: The value should be no higher than the amount of free memory available in the target system multipliedby some factor to allow for extra runtime overhead (e.g., approximately ~90%).

For example, if a system has 16GB total of RAM and 1GB is used by the operating system, services, etc., ideally,the JVM that hosts the application using GraphDB should have a maximum heap size of 15GB (16-1) and can beset using the JVM argument: -Xmx15g.

Running in a servlet container

If the GraphDB repository is hosted by the Sesame HTTP servlet, the maximum heap space applies to the servletcontainer (Tomcat).

Note: Allow some more heap memory for the runtime overhead, especially if running at the same time as otherservlets.

Other options that may improve performance are:

• some configurations of the servlet container, e.g., increasing the permanent generation, which by default is64MB;

• quadrupling (for Tomcat) with -XX:MaxPermSize=256m.

For more information, see the Tomcat documentation.

4.4.5.2 Configuring the memory for storing entities

The memory required for storing entities is determined by the number of entities in the dataset, where the memoryrequired is 4 bytes per slot, allocated by entity-index-size, plus 12 bytes for each stored entity.

4.4.5.3 Configuring the Cache memory

Apart from the I/O buffers used for caching, GraphDB keeps in memory the indexes from the nodes in the RDFgraph. This is a design decision in order to improve the overall performance of the repository. Each I/O buffer(page) is exactly 64kb and the indexing information per node in the graph is 12 bytes. So, depending on the dataset,memory requirements per repository may vary. To ease the calculation for the amount of Java heap memory re-quired for a GraphDB repository, an Excel spreadsheet is included in the distribution – graphdb-configurator.xls.

The page cache is organised in two sets of buffers, read-only and dirty. Each page is first loaded into the read-onlycache. When this gets full, a page (if dirty) is moved to the dirty cache, where it can be later written to the storage.

Cache memory distribution

There are several components in GraphDB that make use of caching (e.g., predicate list, tuple indices). In differentsituations, certain caches will need more memory than others. GraphDB allows for the configuration of both thetotal cache memory to be used by a repository and all the separate per-module caches.


http://tomcat.apache.org/tomcat-7.0-doc/


Parameters

The following parameters control the amount of memory assigned to each of the different caches:

Parameter Unit Default Descriptioncache-memory bytes The amount of memory to be distributed among different caches.tuple-index-memory bytes 80M Memory used for PSO and POS caches.predicate-memory bytes 80M Memory used for predicate list cache.

Note: All parameters can be specified in bytes, kilobytes, megabytes or gigabytes by using a unit specifier atthe end of the integer number. When no unit specifier is given, this is interpreted as bytes, otherwise use k or K -kilobytes, m or M - megabytes and g or G - gigabytes (everything base 2).

The memory required for the indices (cache types) depends on which indices are being used. The SPO and PSOindices are always used while predicateLists and the context indices PCSO / PSOC are optional. The memoryallocated to these cache types can be calculated automatically by GraphDB, but some of them can be specified ina more fine-grained way.

The following configuration parameters are relevant:

cache-memory = tuple-index-memory + predicate-memory

If cache-memory is explicitly configured and some of the other memory parameters are omitted, the missingvalues are resolved by uniformly distributing the remaining memory after all the explicitly configured memoryparameters are subtracted.

For example, if cache-memory = 100M, predicate-memory = 10M and the other memory parameter is missing,then it is implicitly assigned (100M - 10M) = 90M.

If cache-memory is not specified, then all the missing memory parameters are assigned their default values.

4.5 Sizing Guidelines

The following sizing guidelines provide a glimpse into what hardware and physical resource allocations are re-quired at a high level. For a thorough analysis, please contact your GraphDB sales or support team to schedule atuning analysis session.

4.5.1 Entry-level deployment

Low data volume & limited, simple queries:

• 1-50 users;

• Less than 25 simultaneous queries;

• Simple queries;

• Less than 100 million explicit statements (triples);

• Less than 18 GB of data;

• Offline (no real-time insert, update, delete and synchronisation);

• Inferencing off.

Recommendations:

• 1 master node (Intel Core i7);

4.5. Sizing Guidelines 33


• 4 cores (CPUs);

• 4-16 GB RAM;

• 50 GB disk space (SSD w/SATA).

4.5.2 Mid-range deployment

Low data volume/limited, complex queries:

• 50-100 users;

• 25 – 50 simultaneous queries;

• Simple & moderately complex queries;

• Between 100 million and 1 billion explicit statements (triples);

• Less than 175 GB of data;

• Online (moderate real-time insert, update, delete and synchronisation).

Recommendations:

• 2-4 master nodes (Intel Core i7);

• 8-16 cores (CPUs) per master;

• 8-32 GB RAM;

• 50-100 GB disk space per node (SSD w/SATA).

4.5.3 Enterprise deployment

High data volume/extensive, complex queries

• 100-500 users;

• 50-100 simultaneous queries;

• Simple & highly complex queries;

• Between 1 & 50 billion explicit statements (triples);

• Less than 8.5 terabytes;

• Online (persistent real-time insert, update, delete and synchronisation).

Recommendations:

• 4-8 master nodes (Intel Core i7);

• 16-32 cores (CPUs) per master;

• 32 + GB RAM;

• 100-200 GB disk space per node (SSD w/SATA).

4.6 Disk Space Requirements

4.6.1 GraphDB disk space requirements for loading a dataset

It depends on the reasoning complexity (the number of inferred triples), the length of the URIs, the additionalindices used, etc. For example, the following table shows the required disk space in bytes per explicit statementwhen loading the Wordnet dataset with various GraphDB configurations:



Configuration Bytes per explicit statementowl2-rl + all optional indices 366owl2-rl 236owl-horst + all optional indices 290owl-horst 196empty + all optional indices 240empty 171

When planning for storage capacity based on the input RDF file size, the required disk space depends not only onthe GraphDB configuration, but also on the RDF file format used and the complexity of its contents. The followingtable gives a rough estimate of the expected expansion from an input RDF file to GraphDB storage requirements.E.g., when using OWL2-RL with all optional indices turned on, GraphDB needs about 6.7GB of storage space toload one gigabyte N3 file. With no inference (‘empty’) and no optional indices, GraphDB needs about 0.7GB ofstorage space to load one gigabyte Trix file. Again, these results were created with the Wordnet dataset:

N3 N-Triples RDF/XML Trig Trix Turtleowl2-rl + all optional indices 6.7 2.2 4.8 6.6 1.5 6.7owl2-rl 4.3 1.4 3.1 4.2 1.0 4.3owl-horst + all optional indices 5.3 1.7 3.8 5.2 1.2 5.3owl-horst 3.6 1.2 2.6 3.5 0.8 3.6empty + all optional indices 4.4 1.4 3.1 4.3 1.0 4.4empty 3.1 1.0 2.2 3.1 0.7 3.1

4.6.2 GraphDB disk space requirements per statement

GraphDB computes inferences when new explicit statements are committed to the repository. The number ofinferred statements can be zero, when using the ‘empty’ rule-set, or many multiples of the number of explicitstatements (depending on the chosen rule-set and the complexity of the data).

The disk space required for each statement further depends on the size of the URIs and literals. The typical datasetswith only the default indices require around 200 bytes, and up to about 300 bytes when all optional indices areturned on.

So, when using the default indices, a good estimate for the amount of disk space you will need is 200 bytes perstatement (explicit and inferred), i.e.:

• 1 million statements => ~200 Megabytes storage;

• 1 billion statements => ~200 Gigabytes storage;

• 10 billion statements => ~2 Terabytes storage.

4.7 Configuring the Entity Pool

The transactional property of the Entity Pool fixes many issues related to creating IDs. However, entities still needto be pre-processed and all other commit operations need to be performed (storing, inference, plugin handling,consistency checking, statement retraction on remove operations), including adding the entities to the permanentstore. All these operations are time-consuming, so the new transactional Entity Pool would not be faster than theclassic one.

The Entity Pool implementation can be selected by the entity-pool-implementation config parameter or the-D command line parameter with the same name. The valid values are:

classic

• the default implementation;

• recommended for large transactions and bulk loads;

4.7. Configuring the Entity Pool 35

• avoids the overhead of temporarily storing of entities and the remapping from temporary to permanent IDs(which is performed in the transactional-simple implementation);

• when adding statements, the entities are added directly and cannot be rolled back.

transactional-simple

• all new entities are kept in memory - not recommended for large transactions (> 100M statements) to preventOutOfMemoryErrors;

• good for large number of small transactions;

transactional

• the recommended transactional implementation in the current version of GraphDB;

• for this version of GraphDB, it is the same as transactional-simple but this may change in future ver-sions.

4.8 Starting & Shutting down GraphDB

4.8.1 Starting GraphDB in the embedded Tomcat

Start the GraphDB database and Workbench interfaces in the embedded Tomcat server by executing the startupscript located in the root directory:


4.8.2 Shutting down GraphDB in the embedded Tomcat

Shut down the GraphDB database and Workbench interfaces in the embedded Tomcat server by executing thestartup script located in the root directory:

shutdown.bat (Windows)./shutdown.sh (Linux/Unix/Mac OS)

4.8.3 How to start/stop via non-embedded Tomcat (or other application server)

Before starting the Tomcat, make sure the necessary .war applications are copied under <TOMCAT>/webapps folder(here <TOMCAT> denotes the folder where Tomcat is installed). The memory available to Tomcat can be adjustedvia the CATALINA_OPTS environment variable (the JVM’s -Xmx parameter needs to be specified). The license forGraphDB can be specified by the -Dowlim-license=/path/to/license (where -D is a JVM parameter).

Example (Windows):

1. Edit <TOMCAT>/bin/startup.bat .

2. Add “SET CATALINA_OPTS=-Xmx4g -Dowlim-license=/path/to/license” (without the quotes).

Example (Linux/Unix/Mac OS):

1. Edit <TOMCAT>/bin/startup.sh .

2. Add “export CATALINA_OPTS=’-Xmx4g -Dowlim-license=/path/to/license”’ (without the double quotes).

Startup:

cd <TOMCAT_HOME>/binstartup.bat (Windows)./startup.sh (Linux/Unix/Mac OS)

Shutdown:

cd <TOMCAT_HOME>/binshutdown.bat (Windows)./shutdown.sh (Linux/Unix/Mac OS)

4.8.4 Shutting down a repository instance from the JMX interface

This operation can be done either programmatically or via a JMX console. The following MBean is used:

Package com.ontotextMBean name OwlimRepositoryManager

Operation DescriptionshutdownRepositoryInstance Shuts down the repository instance.

4.9 Managing Repositories

4.9.1 Changing repository parameters

Once a repository is created, it is possible to change some parameters, either by changing the SYSTEM reposi-tory or by overriding values by specifying the parameter values as JVM options. For instance, by passing the-Dcache-memory=1g option to the JVM, GraphDB will read it and use its value to override whatever was config-ured by the .ttl file. This is convenient for temporary set-ups that require easy and fast configuration change,e.g., for experimental purposes.

Changing the configuration in the SYSTEM repository is trickier, because the configurations are usually structuredusing blank node identifiers, which are always unique, so attempting to modify a statement with a blank node byusing the same blank node identifier will fail. However, this can be achieved with SPARQL UPDATE using aDELETE-INSERT-WHERE command.

A restart of Sesame/GraphDB is required. If deployed using Tomcat, the easiest way is just to restart Tomcat itself.

Note: Almost all GraphDB parameters can be changed both in the TTL configuration file (that populates theSYSTEM repository) and from the command line using the Java -D<param.name>=<value> command line optionto set system properties. Both methods require a restart of the JVM or Tomcat.

Note: When a parameter is set simultaneously using both methods, the system property overrides the value in theconfiguration file. Some GraphDB parameters can only be set using system properties.

Warning: Some parameters can not be changed after a repository has been created. These either have noeffect (once the relevant data structures are built, their structure can not be changed) or changing them willcause inconsistencies (these parameters affect the reasoner).

4.9.2 Renaming a repository

For an existing repository that has already been used:

4.9. Managing Repositories 37

1. Restart Tomcat to ensure that the repository is not loaded into memory (with locked/open files).

2. Select the SYSTEM repository.

3. Execute the following SPARQL update with the appropriate old and new names substituted in the last twolines.

PREFIX sys:<http://www.openrdf.org/config/repository#>DELETE { GRAPH ?g { ?repository sys:repositoryID ?old_name } }INSERT { GRAPH ?g { ?repository sys:repositoryID ?new_name } }WHERE {

GRAPH ?g { ?repository a sys:Repository . }GRAPH ?g { ?repository sys:repositoryID ?old_name . }FILTER( ?old_name = "old_repository_name" ) .BIND( "new_repository_name" AS ?new_name ) . }

4. Rename the folder for this repository in the file system.

• mv /home/<user>/.aduna/openrdf-sesame/repositories/OLD_NAME /home/<user>/.aduna/openrdf-sesame/repositories/NEW_NAME (Linux/Unix)

– Repositories location can be explicitly defined by using the parameter: -Dinfo.aduna.platform.appdata.basedir=<repositories_location>

• The location for the ‘repositories’ folder under Windows varies depending on the version of Windows andthe installation method for Tomcat - some possibilities are:

– C:\Users\<username>\AppData\Roaming\Aduna\repositories (Windows 7 when running Tom-cat as a user)

– C:\Documents and Settings\LocalService\Application Data\Aduna\OpenRDFSesame\repositories (Windows XP when running Tomcat as a service)

5. Reselect the SYSTEM repository and press F5.

Note:

There is another consideration regarding the storage folderhttp://www.ontotext.com/trree/owlim#storage-folder

If it is set to an absolute pathname and moving the repository requires an update of this parameter as well, youwill need the value of this parameter (with the new name).

4.10 Access Rights and Security

Controlling access to a GraphDB repository and assigning user accounts to roles with specified access permissionscan be done using the GraphDB Workbench or with a combination of HTTP authentication and the Sesame server’sdeployment descriptor.

This allows you to:

• Create security constraints on operations (reading statements, writing statements, modifying named graphs,changing namespace definitions, etc.);

• Group security constraints in security roles;

• Manage user accounts and assign these to specific security roles.

4.10.1 Using the GraphDB Workbench

To manage your users’ access, from the dropdown list Admin choose Users and Access. The page displays a listof users and the number of repositories they have access to. By default, the security for the entire Workbench


instance is disabled. It means that everyone has full access to repositories and admin functionality. To enablethem, move the Security is off slider to on.

Here you can create new users, delete existing ones or edit their properties, including setting roles and read/writepermissions for each repository. The password can also be changed here.

User roles:

• User - a user who can read and write according to his permissions for each repository;

• Admin - a user with full access, including creating, editing, deleting users.

Since GraphDB 6.4, repository permissions can be bound to a specific location only, or to all locations (“*” in thelocation list) to mimic the behaviour of pre-6.4 versions.

Login and default credentials:

If security is enabled, the first page you will see is the login page. The default administrator account informationis:

Hint:

username: adminpassword: root

It is highly recommended that you change the root password as soon as you log in for the first time. Click yourusername (admin) in the top right corner to change it.

4.10. Access Rights and Security 39

4.10.2 Using an HTTP authentication and the Sesame server’s deployment de-scriptor

4.10.2.1 Operations

The Sesame HTTP server is a servlet-based Web application deployed to any standard servlet container (for theremainder of this section it is assumed that Tomcat is being used).

The Sesame server exposes its functionality using a RESTful API that is an extension of the SPARQL protocolfor RDF. This protocol defines exactly what operations can be achieved using specific URL patterns and HTTPmethods (GET, POST, PUT, DELETE). Each combination of URL pattern and HTTP method can be associated with aset of user roles, thus giving very fine-grained control.

The following shows the URL patterns associated with groups of operations on a Sesame HTTP server.

<SESAME_URL>/protocol : protocol version (GET)/repositories : overview of available repositories (GET)

/<REP_ID> : query evaluation on a repository (GET/POST)/statements : repository statements (GET/POST/PUT/DELETE)/contexts : context overview (GET)/size : count of statements in repository (GET)/namespaces : overview of namespace definitions (GET/DELETE)

/<PREFIX> : namespace-prefix definition (GET/PUT/DELETE)/rdf-graphs : overview of named graphs (GET)

/<NAME> : directly named graph (GET/POST/PUT/DELETE)/service : indirectly named graph (GET/POST/PUT/DELETE)

It is copied (with modifications) from chapter 4 of the online Sesame 2 system guide.

In general, read operations are effected using GET and write operations using PUT, POST and DELETE. The exceptionto this is that POST is allowed for SPARQL queries. This is for practical reasons, because some HTTP servers havelimits on the length of the parameter values for GET requests.

4.10.2.2 Security constraints and roles

The association between operations and security roles is specified using security constraints in the Sesame server’sdeployment descriptor - a file called web.xml that can be found in the .../webapps/openrdf-sesame/WEB-INFdirectory. web.xml becomes available immediately after the installation without any security roles defined.

Warning: When redeployed, the Sesame HTTP server overwrites web.xml with the default version. There-fore, if you change it, make sure you create a backup. In particular, do not edit web.xml while Tomcat isrunning.

The deployment descriptor defines:

• authentication mechanism/configuration;

• security constraints in terms of operations (URL pattern plus HTTP method);

• security roles associated with security constraints.

To enable authentication, add the following XML element to web.xml inside the <web-app> element:

<login-config><auth-method>BASIC</auth-method><realm-name>Sesame</realm-name>

</login-config>

Security constraints associate operations (URL pattern plus HTTP method) with security roles. Both securityconstraints and security roles are nested in the <web-app> element.


http://en.wikipedia.org/wiki/Representational_state_transfer

http://www.w3.org/TR/sparql11-protocol/


http://openrdf.callimachus.net/sesame/2.7/docs/system.docbook?view#chapter-http-protocol

A security constraint minimally consists of a collection of web resources (defined in terms of URL patterns andHTTP methods) and an authorisation constraint (the role name that has access to the resource collection). Someexample security constraints are shown below:

<security-constraint><web-resource-collection>

<web-resource-name>SPARQL query access to the 'test' repository</web-resource-name><url-pattern>/repositories/test</url-pattern><http-method>GET</http-method><http-method>POST</http-method>

</web-resource-collection><auth-constraint>

<role-name>viewer</role-name><role-name>editor</role-name>

</auth-constraint></security-constraint>


<web-resource-name>Read access to 'test' repository's namespaces, size, contexts, etc</web-resource-name><url-pattern>/repositories/test/*</url-pattern><http-method>GET</http-method>


<role-name>viewer</role-name><role-name>editor</role-name>

</auth-constraint></security-constraint>


<web-resource-name>Write access</web-resource-name><url-pattern>/repositories/test/*</url-pattern><http-method>POST</http-method><http-method>PUT</http-method><http-method>DELETE</http-method>


<role-name>editor</role-name></auth-constraint>

</security-constraint>

The ability to create and delete repositories requires access to the SYSTEM repository. An administrator securityconstraint for this looks like the following:


<web-resource-name>Administrator access to SYSTEM</web-resource-name><url-pattern>/repositories/SYSTEM/</url-pattern><url-pattern>/repositories/SYSTEM/*/</url-pattern><http-method>GET</http-method><http-method>POST</http-method><http-method>PUT</http-method><http-method>DELETE</http-method>


<role-name>administrator</role-name></auth-constraint>

</security-constraint>

Also nested inside the <web-app> element are definitions of security roles. The format is shown by the example:

4.10. Access Rights and Security 41

<security-role><description>

Read only access to repository data</description><role-name>viewer</role-name>

</security-role>


Read/write access to repository data</description><role-name>editor</role-name>

</security-role>


Full control over the repository, as well as creating/deleting repositories</description><role-name>administrator</role-name>

</security-role>

4.10.2.3 User accounts

Tomcat has a number of ways to manage user accounts. The different techniques are called ‘realms’ and thedefault one is called ‘UserDatabaseRealm’. This is the simplest one to manage, but also the least secure, becauseusernames and passwords are stored in plain text.

For the default security realm, usernames and passwords are stored in the file tomcat-users.xml in the Tomcatconfiguration directory, usually /etc/tomcat6/tomcat-users.xml on Linux systems. To add user accounts, add<user> elements inside the <tomcat-users> element, for example:

<user username="adam" password="secret" roles="viewer" /><user username="eve" password="password" roles="viewer,editor,administrator" />

4.10.3 Programmatic authentication

To use a remote repository where authentication has been enabled, it is necessary to provide the username and pass-word to the Sesame API. Remote repositories are usually accessed via the org.openrdf.repository.manager.RemoteRepositoryManager class. Tell the repository manager what the security credentials are using the follow-ing method:

void setUsernameAndPassword(String username, String password)

Alternatively, they can be passed in the factory method:

static RemoteRepositoryManager getInstance(String serverURL, String username, String password)

4.11 Backing up and Recovering a Repository

4.11.1 Backing up a repository

Several options are available:


Option 1: Using the GraphDB Workbench

Note: Best used for a small running system.

Export the database contents using the GraphDB Workbench. To preserve the contexts (named graph) whenexporting/importing the whole database, use a context-aware RDF file format, e.g., TriG.

1. Go to Data -> Export

2. Choose the files you want to export.

3. Click Export graph as TriG.

Option 2: Exporting the data of each repository

Note: Works without stopping GraphDB but it is very slow.

1. Export the data of each repository, while the database is running.

Note: All updates executed after the EXPORT had been started will not be put in the exported data (due tothe READ COMMITTED transaction isolation in GraphDB).

2. Shutdown the database (stop Tomcat) and delete the older GraphDB application(s) – .war files and theexpanded folder.

Option 3: Using the graph store protocol and curl

This can be achieved on the command line in a single step using the graph store protocol (change the repositoryURL and name of the export file accordingly).

curl -X GET -H "Accept:application/x-trig" http://localhost:7200/repositories/test_repo/statements?→˓infer=false > export.trig

This method streams a snapshot of the database’s explicit statements into the export.trig file.

Option 4: Programmatically using the Sesame API.

Use the RepositoryConnection.exportStatements() method with the includeInferred flag set to false (in ordernot to serialise the inferred statements).

Example:

4.11. Backing up and Recovering a Repository 43

http://rdf4j.org/sesame/2.6/apidocs/org/openrdf/repository/RepositoryConnection.html#exportStatements(org.openrdf.model.Resource,org.openrdf.model.URI,org.openrdf.model.Value,boolean,org.openrdf.rio.RDFHandler,org.openrdf.model.Resource...)

RepositoryConnection connection = repository.getConnection();FileOutputStream outputStream = new FileOutputStream(new File("/tmp/test.txt"));RDFWriter writer = Rio.createWriter(RDFFormat.NTRIPLES, outputStream);connection.exportStatements(null, null, null, false, writer);IOUtils.closeQuietly(outputStream);

Use the RepositoryConnection.getStatements() method with the includeInferred flag set to false (in order notto serialise the inferred statements).

Example:

java.io.OutputStream out = ...;RDFWriter writer = Rio.createWriter(RDFFormat.NTRIPLES, out);writer.startRDF();RepositoryResult<Statement> statements =repositoryConnection.getStatements(null, null, null, false);while (statements.hasNext()) {

writer.handleStatement(statements.next());}statements.close();writer.endRDF();out.flush();

The returned iterator can be used to visit every explicit statement in the repository and one of the Sesame RDFwriter implementations can be used to output the statements in the chosen format. If the data will be re-imported,we recommend the N-Triples format as it can easily be broken into large ‘chunks’ that can be inserted and com-mitted separately.

Option 5: Copying GraphDB storage folders

Note: It is very fast but requires stopping GraphDB.

1. Stop GraphDB/Tomcat.

2. Manually copy the storage folders to the backup location.

/opt/tomcat/bin/shutdown.shsleep 10 #wait some time for the tomcat to stopcp -r ~/.aduna/openrdf-sesame/repositories/my-repo ~/my-backups/TODAY-DATE/

4.11.2 Restoring a repository

Several options are available:

Option 1: Importing data with preserved contexts in Sesame Workbench


1. Go to Add.

2. Choose Data format: TriG.

3. Choose RDF Data File: e.g., export.trig.

4. Clear the context text field (it will have been set to the URL of the file). If this is not cleared, all the importedRDF statements will be given a context of file://export.trig or similar.


http://rdf4j.org/sesame/2.6/apidocs/org/openrdf/repository/RepositoryConnection.html#getStatements(org.openrdf.model.Resource,org.openrdf.model.URI,org.openrdf.model.Value,boolean,org.openrdf.model.Resource...)


5. Upload.

You can also use the TriX format (an XML-based context-aware RDF serialisation).

Option 2: Importing data with preserved contexts in GraphDB Workbench


See Loading data.

Option 3: Replacing the GraphDB storage directory (and any subdirectories)

Note: If it is possible to shut down the repository.

1. Replace the entire contents of the storage directory (and any subdirectories) with the backup.

2. Restart the repository.

3. Check the log file to ensure a successful start up.

4.12 Query and Transaction Monitoring

4.12.1 Query monitoring and termination

GraphDB provides statistics about executing queries or more accurately query result iterators. This is done throughthe SailIterationMonitor MBean, one for each repository instance. Each bean instance is named after thestorage directory of the repository it relates to.

Package com.ontotextMBean name SailIterationMonitor

The SailIterationMonitor Mbean has a single attribute TrackRecords, which is an array of objects with thefollowing attributes:

Attribute DescriptionisRequestedToStop indicates if the query has been requested to terminate early (see be-

low)msLifeTime the lifetime of the iterator (in ms) between being created and reach-

ing the CLOSED statemsSinceCreated the time (in ms) since the iterator was creatednNext the total number of invocations of next() for this iteratornsAverageForOneNext the average time spent for one (has)Next calculation (in nanosec-

onds), i.e., nsTotalSpentInNext / nNextnsTotalSpentInNext the cumulative time spent in (has)Next calculations (in nanoseconds)state the current state of the iterator, values are: ACTIVE, IN_NEXT,

IN_HAS_NEXT, IN_REMOVE, IN_CLOSE, CLOSEDtrackId a unique ID for this iterator - if debug level is used to increase the de-

tail of the GraphDB output, then this value is used to identify querieswhen logging the query execution plan and optimisation informa-tion.

4.12. Query and Transaction Monitoring 45


The collection of these objects grows for each executing/executed query, however, older objects in the CLOSEDstate expire and are removed from the collection as the query result iterators are garbage collected.

There is a single operation available with this MBean:

Operation DescriptionrequestStop Request that a query terminates early; parameter: trackId of the query to stop.

This operation allows an administrator to request that a query terminates early.

To terminate a query, when inconvenient or not possible for client to terminate:

• Execute requestStop with the trackId of the problem query.

– The isRequestedToStop attribute is set to true.

– The query terminates normally when hasNext() returns false.

4.12.1.1 Preventing long running queries

• A global query time-out period;

• A configuration parameter query-timeout;

• All queries will stop after this many seconds;

• A default value 0 indicates no limit.

4.12.2 Terminating a transaction

It is possible to contrive situations where committing an update transaction can take an excessive amount oftime. For example, when committing a ‘chain’ of many thousands of statements using some transitive property,the inferencer will attempt to materialise all possible combinations leading to hundreds of millions of inferredstatements. In such a situation, it is possible to force a commit operation to abort and rollback to the state thedatabase had before the commit was attempted.

The following MBean is used:

Package com.ontotextMBean name OwlimRepositoryManager



This MBean has no attributes:

Operation DescriptionabortTransaction-Commit

Request that the currently executing (lengthy) commit operation be terminated androlled back.

4.13 Diagnosing and Reporting Critical Errors

In order to easily reproduce an issue, send us a well-formed bug report with logs, data and JVM reports.

4.13.1 Logs

The Sesame server and workbench use SLF4J for logging, although they each map to different underlying loggingimplementations. The Sesame server has bindings to logback, whereas the Sesame workbench has bindings to theJava Runtime logging framework. By default, the Sesame server logs to:

[ADUNA_DATA]/openrdf-sesame/logs/main.log

The default log level is INFO, but this can be adjusted after the first run by editing:

[ADUNA_DATA]/openrdf-sesame/conf/logback.xml

GraphDB also uses SLF4J for logging and the distribution includes the SLF4J-to-Java-logging mapping jarfile. The repackaged openrdf-sesame war file that comes with GraphDB does not have this mapping, so in thiscase GraphDB logs as Sesame does to the logback framework.

Tomcat uses log4j and the logging details can set up by editing this file (on Linux):

/var/lib/tomcat6/conf/logging.properties

or this file on Windows:

C:\Program Files\Apache Software Foundation\Tomcat 6.0\conf\logging.properties

By default, the Tomcat logs (on Linux) are written to:

/var/log/tomcat6/

and the logs (on Windows) are written to:

C:\Program Files\Apache Software Foundation\Tomcat 6.0\logs

4.13.2 Report script

Note: If the issue is related to the cluster, information should be provided for every worker and master.

Tip: If possible, make the snapshots for this while the issue is taking place.

report.sh is a bash script that accepts a Tomcat’s pid, automatically gathers most of the necessary informationand produces a bug report. It can be found in the scripts folder in the GraphDB distribution.

To start the script run:

./report.sh

4.13. Diagnosing and Reporting Critical Errors 47

http://logback.qos.ch/

Hint:

Legend:D Information gathered by the script.5 Information that is not supported by the script and you need to gather it separately.

A reproduced issue should contain as a minimum the following information:

D Logs from Tomcat (everything, including catalina.out, access log, localhost.$date logs).D Logs from Aduna’s home (on most Linux machines, this is located in $aduna_base/openrdf-sesame/logs.D jstack <pid> from Tomcat.D jmap -histo <pid> from Tomcat.D jstat -gcutil <pid> 1000 100 - this makes 100 gc snapshots one every second. Note that this is optional.A much better option is to give GC logs.5 If possible, please start the JVM with the following gc logs parameters:

-XX:NumberOfGCLogFiles=5 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCCause -XX:+PrintGCDetails-XX:-PrintTenuringDistribution -XX:+UseCompressedOops -XX:+UseGCLogFileRotation -Xloggc:<file>

then change the file destination to something that is writeable and provide these log files.D tail -10000 from the syslog (on most machines, this is located in /var/log/syslog).D If anything is touched in the Tomcat’s conf directory, it must be provided as is.5 If this is an out of memory issue, then start java with -XX:+HeapDumpOnOutOfMemoryError and try toreproduce. Then provide the heap dump that will be generated by the JVM.5 If this is a cluster, please check that there is NTP on the machines (i.e., their time is synchonized).5 If the issue is reproduced on a client dataset, it will help us to have access to it.D Output from bin/versions.sh in Tomcat.D A version of graphdb unzip webapps/openrdf-sesame/WEB-INF/lib/graphdb -runtime.jar -prelease.properties.5 Logs from the client. We need this to check for errors from the Sesame client.

4.13.2.1 Requirements

• bash installed;

• run the script as the user that runs the Tomcat instance.

4.13.2.2 Example

Before running the script, you might need to give it executable permission chmod +x report.sh

Without parameters

the simplest would be running the script without parameters:

$ report.shThe program accepts a single argument, tomcat's pidUsage: ./report.sh <tomcat's-pid>


Real run

You need the Tomcat’s pid. On most setups you can easily check the Tomcat’s pid with jps

$ jps | grep Bootstrap32053 Bootstrap

you can see that Tomcat is running with pid 32053. Now run the report script:

$ ./report.sh 32053Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=lcd -Dswing.aatext=trueFound tomcat home as /home/nikolavp/Downloads/apache-tomcat-7.0.55Found aduna base as /home/nikolavp/.adunaDid you change files in tomcat's conf directory?[y|n] (n): yGetting tomcat version informationGetting graphdb version informationCollecting tomcat runtime parametersCopying tomcat logs from /home/nikolavp/Downloads/apache-tomcat-7.0.55/logsCopying aduna logs from /home/nikolavp/.aduna/openrdf-sesame/logsWaiting for jstat to finishYou can find the collected data in 32053-data

first the script asks you if this is a default Tomcat configuration: Did you change files in Tomcat's confdirectory?[y|n] y will copy the Tomcat config for you. It will also tell you where to find the produced diagnosticfiles (You can find the collected data in 32053-data), which varies from pid to pid. When reporting anissue, just zip this directory and send it to us.

4.13. Diagnosing and Reporting Critical Errors 49

CHAPTER

FIVE

USAGE

5.1 Workbench User Guide

5.1.1 Installation, start-up and shut down

To deploy to an application server:

1. Download your GraphDB Workbench distribution file and unzip it.

2. Copy the graphdb-workbench-free.war file to the servlet container’s webapps directory.

Note: We recommend using Tomcat (7.x or 8.x).

3. Use a URL similar to http://localhost:8080/graphdb-workbench-free.

Note: Refer to the documentation of your application server for more information on starting, stopping anddeploying applications.

To run as a standalone server:

1. Download your GraphDB Workbench distribution file and unzip it.

2. Start the web application for accessing and administering the GraphDB Workbench by:

(a) Executing the startup script located in the root directory:


or

(b) Running the line:

java -jar graphdb-tomcat.jar

The message below appears in your Terminal:

INFO: Starting ProtocolHandler ["http-bio-8080"]...Opening web app in default browser

51

http://localhost:8080/graphdb-workbench-free

Note: These methods of deployment use an embedded Tomcat server, which deploys the .war files in thesesame_graphdb directory.

If the web application is located on the same machine, it opens up at http://localhost:8080/ (assuming the defaultport number).

To shut down the server:

Write ctrl-c in the console.

5.1.2 Checking your setup

After opening the Workbench URL, a summary page is displayed showing the versions of the various GraphDBcomponents and license details. If you see this page, it means you installed and configured the Workbench cor-rectly.

5.1.3 Using the Workbench

5.1.3.1 Managing locations

In order to create or see any repositories, you have to attach a location. Locations represent individual GraphDBservers and they can be local (hosted in an internal server within the Workbench) or remote (any other GraphDBinstallation). Locations can be attached, edited and detached from the Locations and Repositories manager acces-sible in the Admin menu. The steps to attach a location are:

1. Go to Locations and Repositories in the Admin menu;

2. Click Attach location;

3. Enter a location:

• For local locations, use the absolute path to a directory on the machine running the Workbench;

• For remote locations, the URL to the GraphDB web application, e.g., http://192.0.2.1:8080/openrdf-sesame.

– (Optionally) Specify credentials for the Sesame location (user and password);

– (Optionally) Add the JMX Connection parameters (host, port and credentials) - this allows youto monitor the resources on the remote location, do query monitoring and manage a GraphDBcluster.

Note: The JMX endpoint is configured by specifying a host and a port. The Workbench will construct a JMXURI of the kind service:jmx:rmi:///jndi/rmi://<host>:<port>/jmxrmi and the remote process has to beconfigured with compatible JMX settings. For example:

-Dcom.sun.management.jmxremote-Dcom.sun.management.jmxremote.port=<port>-Dcom.sun.management.jmxremote.local.only=false-Dcom.sun.management.jmxremote.authenticate=false-Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=<host>

You can attach multiple locations but only one can be active at a given time. The active location is always shownin the navigation bar next to a plug icon.

52 Chapter 5. Usage


http://192.0.2.1:8080/openrdf-sesame

http://192.0.2.1:8080/openrdf-sesame


Note: If you use the Workbench as a SPARQL endpoint, all your queries will be sent to a repository in thecurrently active location. This works well if you make sure no one changes the active location. To have endpointsthat are always accessible outside the Workbench, we recommend using standalone Workbench and Engine instal-lations, connecting the Workbench to the Engine over a remote location and using the Engine endpoints (i.e., notthe ones provided by the Workbench) in any software that executes SPARQL queries.

5.1.3.2 Managing repositories

The repository management page is accessed by clicking Admin -> Locations and Repositories from the menu.This displays a list of available repositories and their locations as well as the user’s permissions for each repository.

Creating a repository

To create a new repository, click Create repository. This will display the configuration page for the new repositorywhere a new, unique ID has to be entered. The rest of the parameters are described in the Configuration parameterssection of the GraphDB documentation.

5.1. Workbench User Guide 53


Alternatively, you can use a .ttl file that specifies the repository type, ID and configuration parameters. Clickthe triangle at the edge of the Create repository button and choose File.

Editing a repository

The parameters you specify at repository creation time, such as cache memory, can be changed at any point. Clickthe edit icon next to a repository to edit it. Note that you have to restart the relevant GraphDB instance for thechanges to take effect.

54 Chapter 5. Usage


Deleting a repository

Click the bucket icon to delete a repository. Once a repository is deleted, all data contained in it is irrevocably lost.

Selecting a repository

You can connect to a repository by using the dropdown menu in the top right corner. This will allow you to easilychange the repository while running queries as well as importing and exporting data in other views.

Another way to connect to a repository in Locations and Repositories is by clicking the slider button next to arepository.



5.1.4 Loading data into a repository

There are four ways to import data into the currently selected repository. They can be accessed from the menu byclicking Data -> Import.

All import methods support asynchronous running of the import tasks, except for the text area import one, whichis intended for a very fast and simple import.

Note: Currently, only one import task of a type is executed at a time, while the others wait in the queue aspending.

Note:

For Local repositories, since the parsing is done by the Workbench, we support interruption and additionalsettings.When the location is a remote one, you just send the data to the remote endpoint and the parsing and loading isperformed there.

A file name filter is available to narrow down the list if you have many files.

5.1.4.1 Import settings

The settings for each import are saved so that you can use them, in case you want to re-import a file. They are:

• Base URI - specifies the base URI against which to resolve any relative URIs found in the uploaded data(see the Sesame System documentation);

• Context - if specified, imports the data into the specific context;

56 Chapter 5. Usage

http://rdf4j.org/sesame/2.7/docs/system.docbook?view


• Chunk size - the number of statements to commit in one chunk. If a chunk fails, the import operations areinterrupted and the imported statements are not rollbacked. The default is no chunking. When there is nochunking, all statements are loaded in one transaction.

• Retry times - how many times to retry the commit if it fails.

• Preserve BNode IDs - assigns its own internal blank node identifiers or uses the blank node IDs it finds inthe file.

5.1.4.2 Four ways to import data

Upload files and import

Note: The limitation of this method is that it supports files of a limited size. The default is 200MB and it iscontrolled by the graphdb.workbench.maxUploadSize property. The value is in bytes (-Dgraphdb.workbench.maxUploadSize=20971520).

Loading data from the Local files directly streams the file to the Sesame’s statements endpoint:

1. Click the icon to browse files for uploading;

2. When the files appear in the table, either import a file by clicking Import on its line or select multiple filesand click Batch import;

3. The import settings modal will appear, just in case you want to add additional settings.



Import server files

The server files import allows you to load files of arbitrary sizes. Its limitation is that the files must be put(symbolic links are supported) in a specific directory. By default, it is ${user.home}/graphdb-import/.

If you want to tweak the directory location, see the graphdb.workbench.importDirectory system property. Thedirectory will be scanned recursively and all files with a semantic MIME type will be visible in the Server files tab.

Import remote content

You can import from a URL with RDF data. Each endpoint that returns RDF data may be used.

If the URL has an extension, it is used to detect the correct data type (e.g., http://linkedlifedata.com/resource/umls-concept/C0024117.rdf).

Otherwise, you have to provide the Data Format parameter, which will be sent as Accept header to the endpointand then to the import loader.

You can also import data from a SPARQL CONSTRUCT query. Again, you have to provide the Data Format.

Paste and import

This is a very simple text import that sends the data to the Repository Statements Endpoint.

58 Chapter 5. Usage

http://linkedlifedata.com/resource/umls-concept/C0024117.rdf

http://linkedlifedata.com/resource/umls-concept/C0024117.rdf

http://rdf4j.org/sesame/2.7/docs/system.docbook?view#repository-statements


5.1.5 Executing queries

Access the SPARQL pages from the menu by clicking SPARQL. The GraphDB Workbench SPARQL view inte-grates the YASGUI query editor and has some additional features.

SPARQL SELECT and UPDATE queries are executed from the same view. The Workbench detects the query typeand sends it to the correct Sesame endpoint. Some handy features are:

Editor

• A query area with syntax highlighting and namespace autocompletion - to add/remove namespaces goto Data -> Namespaces;

• Query tabs - saved in your browser local storage, so you can keep them even when switching views.

Saved queries Click the save icon to save a query, or the folder icon to access existing saved queries. Savedqueries are persisted on the server running the Workbench.

Include or exclude inferred statements A >>-like icon controls the inclusion of inferred statements. When bothelements of the icon are the same shade of dark colour, inferred statements are included. When only the leftelement is dark and the right one is greyed out, only the explicit statements are included.

Pagination SPARQL view can show only a limited number of results at once. Use pagination to navigate throughall results. Each page executes the query again with query limit and offset for SELECT queries. For graphqueries (CONSTRUCT and DESCRIBE), all results and fetched by the server and only the page of interest isgathered from the results iterator and sent to the client.

Keyboard shortcuts Use Ctrl/Cmd+Enter to execute a query. You can find other useful shortcuts in the key-board shortcuts link in the lower right corner of the SPARQL editor.

Creating links to queries Use the link icon to copy your query as a URL and share it. The link opens the queryeditor in a new tab and the query is filled there. For a longer query, it might be more convenient to first saveit and then get a link to the saved query by opening the saved queries list and clicking the respective linkicon.

Downloading query results The Download As button allows you to download query results in your preferredformat (JSON, XML, CSV, TSV and Binary RDF for Select queries and all RDF formats for Graph queryresults).

Various ways to view the results Query results are shown in a table on the same page. You can order the resultsby column values and filter by table values.

The results can be viewed in different formats according to the type of the query and they can be used tocreate a Google Charts diagram.

Note: The query results are limited to 1000, since your browser cannot handle an infinite number of results.To obtain all results, use Download As and select the required format for the data.

You will see the total number of results and the query execution time in the query results header.

Note: The total number of results is obtained by an async request with a default-graph-uri parameterand the value http://www.ontotext.com/count.


http://about.yasgui.org/


SPARLQ editor options Since GraphDB 6.5, the Workbench introduces support for additional viewing/editingmodes in the SPARQL editor.

Horizontal and vertical mode Use the vertical mode switch to show the editor and the results next to eachother, which is particularly useful on wide screen. Click the switch again to return to horizontal mode.

Viewing results or editor only Both in horizontal and vertical mode, you can also hide the editor or theresults to focus on query editing or result viewing. Click the buttons Editor only, Editor and results orResults only to switch between the different modes.

60 Chapter 5. Usage


5.1.6 Exporting data

Data can be exported in several ways and formats.

Exporting entire repository or individual graphs

Click Data -> Export from the menu and decide whether you have to export the whole repository (in severaldifferent formats) or specific named graphs (in the same variety of formats). Click the appropriate format and thedownload will start:



Exporting query results

The SPARQL query results can also be exported from the SPARQL view with results by clicking Download As.

Exporting resources

From the resource description page, export the RDF triples that make up the resource description to JSON, JSON-LD, RDF-XML, N3/Turtle and N-Triples:

62 Chapter 5. Usage


5.1.7 Viewing and editing resources

Viewing

To view a resource in the repository, go to Data -> View resource and enter the URI of a resource or navigate toit by clicking the SPARQL results links. Even when the resource is missing, you still will be navigated to theresource view where you can create triples for it using the resource edit. Viewing resources provides an easy wayto see triples where a given URI is the subject, predicate or object.

Create new resource



When ready, save the new resource to the repository.

Editing

Once you open a resource in View resource, you can also edit it. Click the edit icon next to the resource namespaceand add, change or delete the properties of this resource.

64 Chapter 5. Usage


Note: You can not change or delete the inferred statements.

5.1.8 Namespace management

You can view and manipulate the RDF namespaces for the active repository from the view accessible throughData -> Namespaces. If you only have read access to the repository, you cannot add or delete namespaces butonly view them.



5.1.9 Context view

A list of the contexts (graphs) in a repository can be seen in the Contexts view available through Data/Contexts.You can use it for the following tasks:

• to see a reference of available contexts in a repository (use the filter to narrow down the list if you havemany contexts);

• to inspect triples in a context by clicking it;

• to drop a context by clicking the bucket icon.

66 Chapter 5. Usage


5.1.10 Connector management

The Connector manager lets you create, view and delete GraphDB Connector instances. It provides a handyform-based editor for Connector configurations. Click Data -> Connector management to access it.

Creating connectors

To create a new Connector configuration, click the New Connector button in the tab of the respective Connectortype you want to create. Once you fill the configuration form, you can either execute the CREATE statement fromthe form by clicking OK or only view it by clicking View SPARQL Query. If you view the query, you can alsocopy it to execute manually or integrate in automation scripts.

Viewing connectors

Existing Connector instances will show under Existing connectors (below the New Connector button). Click thename of an instance to view its configuration and SPARQL query, or click the repair and delete icons to performthose operations.



5.1.11 Users and access management

Note: User and access checks are disabled by default. If you want to enable them, go to Admin -> Users andAccess and click the Security slider above the user table.

Users and access management is under Admin -> Users and Access from the menu. The page displays a list ofusers and the number of repositories they have access to. It is also possible to disable the security for the entireGraphDB Workbench instance by clicking Disable/Enable. When security is disabled, everyone has full access tothe repositories and the admin functionality.

68 Chapter 5. Usage


From here, you can create new users, delete existing users or edit user properties, including setting their role andthe read/write permission for each repository. The password can also be changed here.

User roles:

• User - a user who can read and write according to his permissions for each repository;

• Admin - a user with full access, including creating, editing, deleting users.

Since GraphDB 6.4, repository permissions can be bound to a specific location only, or to all locations (“*” in thelocation list) to mimic the behaviour of pre-6.4 versions.

Login and default credentials:

If security is enabled, the first page you will see is the login page.

Note:

The default administrator account information is:username: adminpassword: root



It is highly recommended that you change the root password as soon as you log in for the first time. Click yourusername (admin) in the top right corner to change it.

5.1.11.1 Free access

Free access is a new feature since GraphDB 6.5. It allows people to access a predefined set of functionality withouthaving to log in. This is especially useful for providing read-only access to a repository.

You can enable free access by going to Admin -> Users and Access and clicking the Free Access slider above theuser table. When you enable free access, a dialog box will open and prompt you to select the access rights for freeaccess users. The available permissions are similar to those for authenticated users, e.g., you can provide read orread/write access to one or more repositories.

Note: To use free access, you must have security enabled. The setting will not show if security is disabled.

5.1.12 REST API

The Workbench in GraphDB 6.4 introduces the Workbench REST API. It can be used to automate various taskswithout having to resort to opening the Workbench in a browser and doing them manually.

The REST API calls fall into six major categories:

70 Chapter 5. Usage


Security management

Use the security management API to add, edit or remove users and thus integrate Workbench security into anexisting system.

Location management

Use the location management API to attach, activate, edit or detach locations.

Repository management

Use the repository management API to add, edit or remove repository into any attached location. Unlike theSesame API, you can work with multiple remote locations from a single access point. When combined with thelocation management, it can be used to automate the creation of multiple repositories across your network.

Data import

Use the data import to import data into GraphDB. You can choose between server files and a remote URL.

Saved queries

Use the saved queries API to create, edit or remove saved queries. It is a convenient way to automate the creationof saved queries that are important to your project.

You can find more information about each REST API in Admin -> REST API Documentation, as well as executethem directly from there and see the results.



5.1.13 Configuration properties

In addition to the standard GraphDB command line parameters, the GraphDB Workbench can be controlled withthe following parameters. They should be of the form -Dparam=value.

72 Chapter 5. Usage

Parameter Descriptiongraphdb.workbench.cors.enableapp.cors.enable (deprecated)

Enables cross-origin resource sharing.Default: false

graphdb.workbench.maxConnectionsapp.maxConnections (deprecated)

Sets the maximum number of concurrent connections to a GraphDBinstance.Default: 200

graphdb.workbench.datadirapp.datadir (deprecated)

Sets the directory where the workbench persistence data will bestored.Default: ${user.home}/.graphdb-workbench/

graphdb.workbench.importDirectoryimpex.dir (deprecated)

Changes the location of the file import folder.Default: ${user.home}/graphdb-import/

graphdb.workbench.maxUploadSizeapp.maxUploadSize (deprecated)

Sets the maximum upload size for importing local files. The valuemust be in bytes.Default: 200 MB

resource.language Sets the default language in which to filter results displayed in theresource exploration.Default: en (English)

resource.limit Sets the limit for the number of statements displayed in the resourceview page.Default: 100

5.2 Using GraphDB with the Sesame API

This section describes how to use the Sesame API to create and access GraphDB repositories, both on the localfile-system and remotely via the Sesame HTTP server, and provides a brief introduction to the Sesame workbenchWeb application, which provides many repository management functions through a convenient user interface.

Sesame comprises a large collection of libraries, utilities and APIs. The important components for this sectionare:

• the Sesame classes and interfaces (API), which provide a uniform access to the SAIL components frommultiple vendors/publishers;

• the Sesame server and workbench web application.

5.2.1 Sesame Application Programming Interface (API)

Programmatically, GraphDB can be used via the Sesame Java framework of classes and interfaces. Documenta-tion for these interfaces (including Javadoc) can be found at http://www.rdf4j.org. Code snippets in the followingsections are taken from (or are variations of) the getting-started program, which comes with the GraphDB distri-bution.

5.2.1.1 Using the Sesame API to access a local GraphDB repository

With Sesame 2, repository configurations are represented as RDF graphs. A particular repository configuration isdescribed as a resource, possibly a blank node, of type:http://www.openrdf.org/config/repository#Repository.This resource has an id, a label and an implementation, which in turn has a type, SAIL type, etc. The examplerepository configuration from the getting-started example program looks like this:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.@prefix rep: <http://www.openrdf.org/config/repository#>.@prefix sr: <http://www.openrdf.org/config/repository/sail#>.@prefix sail: <http://www.openrdf.org/config/sail#>.

5.2. Using GraphDB with the Sesame API 73

@prefix owlim: <http://www.ontotext.com/trree/owlim#>.

[] a rep:Repository ;rep:repositoryID "owlim" ;rdfs:label "GraphDB Getting Started" ;rep:repositoryImpl [

rep:repositoryType "openrdf:SailRepository" ;sr:sailImpl [sail:sailType "owlim:Sail" ;owlim:ruleset "owl-horst-optimized" ;owlim:storage-folder "storage" ;owlim:base-URL "http://example.org/owlim#" ;owlim:repository-type "file-repository" ;owlim:imports "./ontology/owl.rdfs" ;owlim:defaultNS "http://example.org/owlim#" .

]].

The Java code that uses the configuration to instantiate a repository and get a connection to it is as follows:

Graph graph = ...;Resource repositoryNode = ...;

// Create a manager for local repositoriesRepositoryManager repositoryManager = new LocalRepositoryManager(new File("."));repositoryManager.initialize();

// Create a configuration object from the configuration graph// and add it to the repositoryManagerRepositoryConfig repositoryConfig = RepositoryConfig.create(graph, repositoryNode);repositoryManager.addRepositoryConfig(repositoryConfig);

// Get the repository to useRepository repository = repositoryManager.getRepository("owlim");

// Open a connection to this repositoryRepositoryConnection repositoryConnection = repository.getConnection();

// ... use the repository

// Shutdown connection, repository and managerrepositoryConnection.close();repository.shutDown();repositoryManager.shutDown();

The procedure is as follows:

1. Instantiate a local repository manager with the data directory to use for the repository storage files (reposi-tories store their data in their own subdirectory from here).

2. Add a repository configuration for the desired repository type to the manager.

3. ‘Get’ the repository and open a connection to it.

From then on, most activities will use the connection object to interact with the repository, e.g., executing queries,adding statements, committing transactions, counting statements, etc. See the getting-started application for ex-amples.

5.2.1.2 Using the Sesame API to access a remote GraphDB repository

The Sesame server is a Web application that allows interaction with repositories using the HTTP protocol. Itruns in a JEE compliant servlet container, e.g., Tomcat, and allows client applications to interact with reposito-ries located on remote machines. In order to connect to and use a remote repository, you have to replace the

74 Chapter 5. Usage

local repository manager for a remote one. The URL of the Sesame server must be provided, but no repositoryconfiguration is needed if the repository already exists on the server. The following lines can be added to thegetting-started example program, although a correct URL must be specified:

RepositoryManager repositoryManager =new RemoteRepositoryManager( "http://192.168.1.25:8080/openrdf-sesame" );

repositoryManager.initialize();

The rest of the example program should work as expected, although the following library files must be added tothe class-path:

• commons-httpclient-3.1.jar

• commons-logging-1.1.1.jar

• commons-codec-1.3.jar

5.2.2 Managing repositories with the Sesame Workbench

1. Deploy the Sesame server and Workbench applications in a Tomcat instance.

2. Copy the necessary library files (GraphDB, Lucene, etc.) to the correct locations.

3. Connect to the Sesame server by using the Sesame console application with the owlim.ttl repositorytemplate file.

4. Create a repository instance.

When the Sesame server is running, it shows a welcome message at the following URL:

http://<hostname|ip_address>:8080/openrdf-sesame

The server has a simple user interface that shows status, logging and configuration information.The workbench application, however, provides most repository management functions and is available from thefollowing URL:

http://<hostname|ip_address>:8080/openrdf-workbench

The workbench lists repositories and their namespaces, allows for the addition and deletion of statements, andprovides a query interface for the SPARQL and SeRQL query languages.

5.2.3 SPARQL endpoint

The Sesame HTTP server is a fully fledged SPARQL endpoint - the Sesame HTTP protocol is a superset ofthe SPARQL 1.1 protocol. It provides an interface for transmitting SPARQL queries and updates to a SPARQLprocessing service and returning the results via HTTP to the entity that requested them.

Any tools or utilities designed to interoperate with the SPARQL protocol will function with GraphDB whendeployed using the Sesame HTTP server, i.e., the openrdf-sesame Web application.

5.2.4 Graph Store HTTP Protocol

The Graph Store HTTP Protocol is fully supported for direct and indirect graph names. The SPARQL 1.1 GraphStore HTTP Protocol has the most details, although further information can be found in the Sesame user guide.

This protocol supports the management of RDF statements in named graphs in the REST style, by providing theability to get, delete, add to or overwrite statement in named graphs using the basic HTTP methods.

5.2. Using GraphDB with the Sesame API 75


http://www.w3.org/TR/sparql11-http-rdf-update/


http://rdf4j.org/sesame/2.7/docs/users.docbook?view


5.3 Using GraphDB with Jena

GraphDB can also be used with the Jena framework, which is achieved with a customised Jena/Sesame/GraphDBadapter component.

Jena is a Java framework for building Semantic Web applications. It provides a programmatic environment forRDF, RDFS, OWL and SPARQL and includes a rule-based inference engine. Access to GraphDB via the Jenaframework is achieved with a special adapter, which is essentially an implementation of the Jena ARQ interfacethat provides access to individual triples managed by a GraphDB repository through the Sesame API interfaces.

Note: The GraphDB-specific Jena adapter can only be used with ‘local’ repositories, i.e., not ‘remote’ repositoriesthat are accessed using the Sesame HTTP protocol. If you want to use GraphDB remotely, consider using theJoseki server as described below.

5.3.1 Installing GraphDB with Jena

Required software

• Jena version 2.7 (tested with version 2.7.3)

• ARQ (tested with version 2.9.3)

Description of the GraphDB Jena adapter

The GraphDB Jena adapter is essentially an implementation of the Jena DatasetGraph interface that providesaccess to individual triples managed by a GraphDB repository through the Sesame API interfaces.

It is not a general purpose Sesame adapter and cannot be used to access any Sesame compatible repository, becauseit utilises an internal GraphDB API to provide more efficient methods for processing RDF data and evaluatingqueries.

The adapter comes with its own implementation of the Jena ‘assembler’ factory to make it easier to instantiate anduse with those related parts of the Jena framework, although you can instantiate an adapter directly by providingan instance of a Sesame SailRepository (a GraphDB GraphDBRepository implementation). Query evaluation iscontrolled by the ARQ engine, but specific parts of a query (mostly batches of statement patterns) are evaluatednatively through a modified StageGenerator plugged into the Jena runtime framework for efficiency. This alsoavoids unnecessary cross-api data transformations during query evaluation.

Instantiate Jena adapter using a SailRepository

In this approach, a GraphDB repository is first created and wrapped in a Sesame SailRespository. Then a connec-tion to it is used to instantiate the adapter class SesameDataset. The following example helps to clarify:

import com.ontotext.trree.OwlimSchemaRepository;import org.openrdf.repository.sail.SailRepository;import org.openrdf.repository.RepositoryConnection;import com.ontotext.jena.SesameDataset;

...

OwlimSchemaRepository schema = new OwlimSchemaRepository();

// set the data folder where GraphDB will persist its dataschema.setDataDir(new File("./local-sotrage"));

// configure GraphDB with some parameters

76 Chapter 5. Usage

http://jena.apache.org/

http://jena.apache.org//

http://jena.sourceforge.net/ARQ/

https://repository.apache.org/content/repositories/releases/org/apache/jena/jena-core/

https://repository.apache.org/content/repositories/releases/org/apache/jena/jena-arq/


schema.setParameter("storage-folder", "./");schema.setParameter("repository-type", "file-repository");schema.setParameter("ruleset", "rdfs");

// wrap it into a Sesame SailRepositorySailRepository repository = new SailRepository(schema);

// initializerepository.initialize();RepositoryConnection connection = repository.getConnection();

// finally, create the DatasetGraph instanceSesameDataset dataset = new SesameDataset(connection);

From now on the SesameDataset object can be used through the Jena API as a regular dataset, e.g., to add somedata to it, you could do something like the following:

Model model = ModelFactory.createModelForGraph(dataset.getDefaultGraph());Resource r1 = model.createResource("http://example.org/book#1") ;Resource r2 = model.createResource("http://example.org/book#2") ;

r1.addProperty(DC.title, "SPARQL - the book").addProperty(DC.description, "A book about SPARQL") ;

r2.addProperty(DC.title, "Advanced techniques for SPARQL") ;

It can also be used to evaluate queries through the ARQ engine:

// Query string.String queryString = "PREFIX dc: <" + DC.getURI() + "> " +

"SELECT ?title WHERE {?x dc:title ?title . }";

Query query = QueryFactory.create(queryString);

// Create a single execution of this query, apply to a model// which is wrapped up as a QueryExecution and then fetch the resultsQueryExecution qexec = QueryExecutionFactory.create(query, dataset.asDataset());try {

// Assumption: it's a SELECT query.ResultSet rs = qexec.execSelect();// The order of results is undefined.for (; rs.hasNext();) {

QuerySolution rb = rs.nextSolution();for (Iterator<String> iter = rb.varNames(); iter.hasNext();) {

String name = iter.next();RDFNode x = rb.get(name);if (x.isLiteral()) {

Literal titleStr = (Literal) x;System.out.print(name + "=" + titleStr + "\t");

} else if (x.isURIResource()) {Resource res = (Resource) x;System.out.print(name + "=" + res.getURI() + "\t");

}else

System.out.print(name + "=" + x.toString() + "\t");}System.out.println();

}}catch( Exception e ) {

System.out.println( "Exception occurred: " + e );}

5.3. Using GraphDB with Jena 77

finally {// QueryExecution objects should be closed to free any system resourcesqexec.close();

}

Instantiate GraphDB adapter using the provided Assembler

Another approach is to use the Jena assemblers infrastructure to instantiate a GraphDB Jena adapter. For thispurpose, the required configuration must be stored in some valid RDF serialisation format and its contents read ina Jena model. Then, the assembler can be invoked to get an instance of the Jena adapter. The following examplespecifies an adapter instance in N3 format.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .@prefix otjena: <http://www.ontotext.com/jena/> .

@prefix : <#> .

[] ja:loadClass "com.ontotext.jena.SesameVocab" .otjena:SesameDataset rdfs:subClassOf ja:Object .otjena:SesameDataset ja:assembler "com.ontotext.jena.SesameAssembler" .<#dataset> rdf:type otjena:SesameDataset ;

otjena:datasetParam "./location" .

The ja:loadClass statements ensure that the GraphDB Jena adapter factory class file(s) are initialised andplugged in the Jena framework prior to being invoked. Then, the \\#dataset description tells the Jena frameworkto expect instances of otjena:SesameDataset to be created by this factory. The following example uses such adescription stored in the file owlimbridge.n3 to get an instance of the Jena adapter:

Model spec = FileManager.get().loadModel( "owlimbridge.n3" );Resource root = spec.createResource( spec.expandPrefix( ":dataset" ) );DataSource datasource = (DataSource)Assembler.general.open( root );DatasetGraph dataset = datasource.asDatasetGraph();

After this, the adapter is ready to be used, for example, to evaluate some queries through the ARQ engine usingthe same approach.

Using GraphDB with the Joseki server

To use a GraphDB repository with the Joseki server, you only need to configure it as a dataset, so that the Jenaassembler framework is able to instantiate it. An example Joseki configuration file that makes use of such a datasetdescription could look like the following. First, a service that hosts the dataset is described:

<#service1>rdf:type joseki:Service ;rdfs:label "service point" ;joseki:dataset otjena:bridge ;joseki:serviceRef "sparql" ;joseki:processor joseki:ProcessorSPARQL ;.

Then, the dataset is described:

[] ja:loadClass "com.ontotext.jena.SesameVocab" .otjena:DatasetSesame rdfs:subClassOf ja:RDFDataset .otjena:bridge rdf:type otjena:DatasetSesame ;

rdfs:label "GraphDB repository" ;otjena:datasetParam "./location" .

78 Chapter 5. Usage


If a repositoryConnection is obtained (as in the example in the Sesame section above), the Jena adapter can beused as follows:

import com.ontotext.jena.SesameDataset;

// Create the DatasetGraph instanceSesameDataset dataset = new SesameDataset(repositoryConnection);

From now on the SesameDataset object can be used through the Jena API as a regular dataset, e.g., to add somedata to it, you could do something like the following:

Model model = ModelFactory.createModelForGraph(dataset.getDefaultGraph());Resource r1 = model.createResource("http://example.org/book#1");Resource r2 = model.createResource("http://example.org/book#2");r1.addProperty(DC.title, "SPARQL - the book")

.addProperty(DC.description, "A book about SPARQL");r2.addProperty(DC.title, "Advanced techniques for SPARQL");

When GraphDB is used through Jena, its performance is quite similar to using it through the Sesame APIs. Formost of the scenarios and tasks, GraphDB can deliver considerable performance improvements when used as areplacement for Jena’s own native RDF backend TDB.

5.4 GraphDB Connectors

5.4.1 Lucene GraphDB Connector

5.4.1.1 Overview and features

The GraphDB Connectors provide extremely fast normal and faceted (aggregation) searches, typically imple-mented by an external component or a service such as Lucene but have the additional benefit of staying automati-cally up-to-date with the GraphDB repository data.

The Connectors provide synchronisation at the entity level, where an entity is defined as having a unique identifier(a URI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that havethe same subject. In addition to simple properties (defined by a single triple), the Connectors support propertychains. A property chain is defined as a sequence of triples where each triple’s object is the subject of the followingtriple.

The main features of the GraphDB Connectors are:

• maintaining an index that is always in sync with the data stored in GraphDB;

• multiple independent instances per repository;

• the entities for synchronisation are defined by:

– a list of fields (on the Lucene side) and property chains (on the GraphDB side) whose values will besynchronised;

– a list of rdf:type‘s of the entities for synchronisation;

– a list of languages for synchronisation (the default is all languages);

– additional filtering by property and value.

• full-text search using native Lucene queries;

• snippet extraction: highlighting of search terms in the search result;

• faceted search;

• sorting by any preconfigured field;

5.4. GraphDB Connectors 79

• paging of results using offset and limit;

• custom mapping of RDF types to Lucene types;

• specifying which Lucene analyzer to use (the default is Lucene’s StandardAnalyzer);

• stripping HTML/XML tags in literals (the default is not to strip markup);

• boosting an entity by the numeric value of one or more predicates;

• custom scoring expressions at query time to evaluate score based on Lucene score and entity boost.

Each feature is described in detail below.

5.4.1.2 Usage

All interactions with the Lucene GraphDB Connector shall be done through SPARQL queries.

There are three types of SPARQL queries:

• INSERT for creating and deleting connector instances;

• SELECT for listing connector instances and querying their configuration parameters;

• INSERT/SELECT for storing and querying data as part of the normal GraphDB data workflow.

In general, this corresponds to INSERT adds or modifies data and SELECT queries existing data.

Each connector implementation defines its own URI prefix to distinguish it from other connectors. Forthe Lucene GraphDB Connector, this is http://www.ontotext.com/connectors/lucene#. Each com-mand or predicate executed by the connector uses this prefix, e.g., http://www.ontotext.com/connectors/lucene#createConnector to create a connector instance for Lucene.

Individual instances of a connector are distinguished by unique names that are also URIs. They have their ownprefix to avoid clashing with any of the command predicates. For Lucene, the instance prefix is http://www.ontotext.com/connectors/lucene/instance#.

Sample data All examples use the following sample data, which describes five fictitious wines: Yoyowine, Fran-vino, Noirette, Blanquito and Rozova as well as the grape varieties required to make these wines. Theminimum required rule-set level in GraphDB is RDFS.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .@prefix : <http://www.ontotext.com/example/wine#> .

:RedWine rdfs:subClassOf :Wine .:WhiteWine rdfs:subClassOf :Wine .:RoseWine rdfs:subClassOf :Wine .

:Merlordf:type :Grape ;rdfs:label "Merlo" .

:CabernetSauvignonrdf:type :Grape ;rdfs:label "Cabernet Sauvignon" .

:CabernetFrancrdf:type :Grape ;rdfs:label "Cabernet Franc" .

:PinotNoirrdf:type :Grape ;rdfs:label "Pinot Noir" .

80 Chapter 5. Usage


:Chardonnayrdf:type :Grape ;rdfs:label "Chardonnay" .

:Yoyowinerdf:type :RedWine ;:madeFromGrape :CabernetSauvignon ;:hasSugar "dry" ;:hasYear "2013"^^xsd:integer .

:Franvinordf:type :RedWine ;:madeFromGrape :Merlo ;:madeFromGrape :CabernetFranc ;:hasSugar "dry" ;:hasYear "2012"^^xsd:integer .

:Noiretterdf:type :RedWine ;:madeFromGrape :PinotNoir ;:hasSugar "medium" ;:hasYear "2012"^^xsd:integer .

:Blanquitordf:type :WhiteWine ;:madeFromGrape :Chardonnay ;:hasSugar "dry" ;:hasYear "2012"^^xsd:integer .

:Rozovardf:type :RoseWine ;:madeFromGrape :PinotNoir ;:hasSugar "medium" ;:hasYear "2013"^^xsd:integer .

5.4.1.3 Setup and maintenance

Third-party component versions This version of the Lucene GraphDB Connector uses Lucene version 4.10.4.

Creating a connector instance

Creating a connector instance is done by sending a SPARQL query with the following configuration data:

• the name of the connector instance (e.g., my_index);

• classes to synchronise;

• properties to synchronise.

The configuration data has to be provided as a JSON string representation and passed together with the createcommand.

Tip: Use the GraphDB Connectors management interface provided by the GraphDB Workbench as it lets youcreate the configuration easily, and then create the connector instance directly or copy the configuration andexecute it elsewhere.

The create command is triggered by a SPARQL INSERT with the createConnector predicate, e.g., it creates aconnector instance called my_index, which synchronises the wines from the sample data above:


PREFIX : <http://www.ontotext.com/connectors/lucene#>PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>

INSERT DATA {inst:my_index :createConnector '''

{"types": [

"http://www.ontotext.com/example/wine#Wine"],"fields": [

{"fieldName": "grape","propertyChain": [

"http://www.ontotext.com/example/wine#madeFromGrape","http://www.w3.org/2000/01/rdf-schema#label"

]},{

"fieldName": "sugar","propertyChain": [

"http://www.ontotext.com/example/wine#hasSugar"],

},{

"fieldName": "year","propertyChain": [

"http://www.ontotext.com/example/wine#hasYear"]

}]

}''' .}

The above command creates a new Lucene connector instance.

The "types" key defines the RDF type of the entities to synchronise and, in the example, it is only entities ofthe type http://www.ontotext.com/example/wine#Wine (and its subtypes). The "fields" key defines themapping from RDF to Lucene. The basic building block is the property chain, i.e., a sequence of RDF propertieswhere the object of each property is the subject of the following property. In the example, three bits of informationare mapped - the grape the wines are made of, sugar content, and year. Each chain is assigned a short andconvenient field name: “grape”, “sugar”, and “year”. The field names are later used in the queries.

Grape is an example of a property chain composed of more than one property. First, we take the wine’smadeFromGrape property, the object of which is an instance of the type Grape, and then we take the rdfs:labelof this instance. Sugar and year are both composed of a single property that links the value directly to the wine.

Dropping a connector instance

Dropping a connector instance removes all references to its external store from GraphDB as well as all Lucenefiles associated with it.

The drop command is triggered by a SPARQL INSERT with the dropConnector predicate where the name of theconnector instance has to be in the subject position, e.g., this removes the connector my_index:


INSERT DATA {inst:my_index :dropConnector "" .

}

82 Chapter 5. Usage

Listing available connector instances

Listing connector instances returns all previously created instances. It is a SELECT query with the listConnectorspredicate:

PREFIX : <http://www.ontotext.com/connectors/lucene#>

SELECT ?cntUri ?cntStr {?cntUri :listConnectors ?cntStr .

}

?cntUri is bound to the prefixed URI of the connector instance that was used during creation, e.g., http://www.ontotext.com/connectors/lucene/instance#my_index>, while ?cntStr is bound to a string, representingthe part after the prefix, e.g., "my_index".

Instance status check

The internal state of each connector instance can be queried using a SELECT query and the connectorStatuspredicate:

PREFIX : <http://www.ontotext.com/connectors/lucene#>

SELECT ?cntUri ?cntStatus {?cntUri :connectorStatus ?cntStatus .

}

?cntUri is bound to the prefixed URI of the connector instance, while ?cntStatus is bound to a string represen-tation of the status of the connector represented by this URI. The status is key-value based.

5.4.1.4 Working with data

Adding, updating and deleting data

From the user point of view, all synchronisation happens transparently without using any additional predicates ornaming a specific store explicitly, i.e., you must simply execute standard SPARQL INSERT/DELETE queries. This isachieved by intercepting all changes in the plugin and determining which abstract documents need to be updated.

Simple queries

Once a connector instance has been created, it is possible to query data from it through SPARQL. For eachmatching abstract document, the connector instance returns the document subject. In its simplest form, queryingis achieved by using a SELECT and providing the Lucene query as the object of the query predicate:


SELECT ?entity {?search a inst:my_index ;

:query "grape:cabernet" ;:entities ?entity .

}

The result binds ?entity to the two wines made from grapes that have “cabernet” in their name, namely:Yoyowine and :Franvino.

Note: You must use the field names you chose when you created the connector instance. They can be identicalto the property URIs but you must escape any special characters according to what Lucene expects.

1. Get a query instance of the requested connector instance by using the RDF notation "X a Y" (= Xrdf:type Y), where X is a variable and Y is a connector instance URI. X is bound to a query instance of theconnector instance.

2. Assign a query to the query instance by using the system predicate :query.

3. Request the matching entities through the :entities predicate.

It is also possible to provide per query search options by using one or more option predicates. The option predicatesare described in detail below.

Combining Lucene results with GraphDB data

The bound ?entity can be used in other SPARQL triples in order to build complex queries that fetch additionaldata from GraphDB, for example, to see the actual grapes in the matching wines as well as the year they weremade:

PREFIX : <http://www.ontotext.com/connectors/lucene#>PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>PREFIX wine: <http://www.ontotext.com/example/wine#>

SELECT ?entity ?grape ?year {?search a inst:my_index ;


?entity wine:madeFromGrape ?grape .?entity wine:hasYear ?year

}

The result looks like this:

?entity ?grape ?year:Yoyowine :CabernetSauvignon 2013:Franvino :Merlo 2012:Franvino :CabernetFranc 2012

Note: :Franvino is returned twice because it is made from two different grapes, both of which are returned.

Entity match score

It is possible to access the match score returned by Lucene with the score predicate. As each entity has its ownscore, the predicate should come at the entity level. For example:


SELECT ?entity ?score {?search a inst:my_index ;


?entity :score ?score}

The result looks like this but the actual score might be different as it depends on the specific Lucene version:

84 Chapter 5. Usage


?entity ?score:Yoyowine 0.9442660212516785:Franvino 0.7554128170013428

Basic facet queries

Consider the sample wine data and the my_index connector instance described previously. You can also queryfacets using the same instance:


SELECT ?facetName ?facetValue ?facetCount WHERE {# note empty query is allowed and will just match all documents, hence no :query?r a inst:my_index ;

:facetFields "year,sugar" ;:facets _:f .

_:f :facetName ?facetName ._:f :facetValue ?facetValue ._:f :facetCount ?facetCount .

}

It is important to specify the facet fields by using the facetFields predicate. Its value is a simple comma-delimited list of field names. In order to get the faceted results, use the facets predicate. As each facet has threecomponents (name, value and count), the facets predicate binds a blank node, which in turn can be used to accessthe individual values for each component through the predicates facetName, facetValue, and facetCount.

The resulting bindings look like the following:

facetName facetValue facetCountyear 2012 3year 2013 2sugar dry 3sugar medium 2

You can easily see that there are three wines produced in 2012 and two in 2013. You also see that three of thewines are dry, while two are medium. However, it is not necessarily true that the three wines produced in 2012 arethe same as the three dry wines as each facet is computed independently.

Sorting

It is possible to sort the entities returned by a connector query according to one or more fields. Sorting is achievedby the orderBy predicate the value of which is a comma-delimited list of fields. Each field can be prefixed with aminus to indicate sorting in descending order. For example:



:query "year:2013" ;:orderBy "-sugar" ;:entities ?entity .

}

The result contains wines produced in 2013 sorted according to their sugar content in descending order:

entityRozovaYoyowine



By default, entities are sorted according to their matching score in descending order.

Note: If you join the entity from the connector query to other triples stored in GraphDB, GraphDB mightscramble the order. To remedy this, use ORDER BY from SPARQL.

Tip: Sorting by an analysed textual field works but might produce unexpected results. Analysed textual fields arecomposed of tokens and sorting uses the least (in the lexicographical sense) token. For example, “North America”will be sorted before “Europe” because the token “america” is lexicographically smaller than the token “europe”.If you need to sort by a textual field and still do full-text search on it, it is best to create a copy of the field with thesetting "analyzed":false. For more information, see Copy fields.

Limit and offset

Limit and offset are supported on the Lucene side of the query. This is achieved through the predicates limit andoffset. Consider this example in which an offset of 1 and a limit of 1 are specified:



:query "sugar:dry" ;:offset "1" ;:limit "1" ;:entities ?entity .

}

The result contains a single wine, Franvino. If you execute the query without the limit and offset, Franvino willbe second in the list:

entityYoyowineFranvinoBlanquito

Note: The specific order in which GraphDB returns the results depends on how Lucene returns the matches,unless sorting is specified.

Snippet extraction

Snippet extraction is used for extracting highlighted snippets of text that match the query. The snippets are ac-cessed through the dedicated predicate snippets. It binds a blank node that in turn provides the actual snippetsvia the predicates snippetField and snippetText. The predicate snippets must be attached to the entity, as eachentity has a different set of snippets. For example, in a search for Cabernet:


SELECT ?entity ?snippetField ?snippetText {?search a inst:my_index ;


?entity :snippets _:s ._:s :snippetField ?snippetField ;

86 Chapter 5. Usage

:snippetText ?snippetText .}

the query returns the two wines made from Cabernet Sauvignon or Cabernet Franc grapes as well as the respectivematching fields and snippets:

?entity ?snippetField ?snippetText:Yoyowine grape Cabernet Sauvignon:Franvino grape Cabernet Franc

Note: The actual snippets might be different as this depends on the specific Lucene implementation.

It is possible to tweak how the snippets are collected/composed by using the following option predicates:

• :snippetSize - sets the maximum size of the extracted text fragment, 250 by default;

• :snippetSpanOpen - text to insert before the highlighted text, by default;

• :snippetSpanClose - text to insert after the highlighted text, by default.

The option predicates are set on the query instance, much like the :query predicate.

Total hits

You can get the total number of hits by using the totalHits predicate, e.g., for the connector instance my_indexand a query that retrieves all wines made in 2012:


SELECT ?totalHits {?r a inst:my_index ;

:query "year:2012" ;:totalHits ?totalHits .

}

As there are three wines made in 2012, the value 3 (of type xdd:long) binds to ?totalHits.

5.4.1.5 List of creation parameters

The creation parameters define how a connector instance is created by the :createConnector predicate. Someare required and some are optional. All parameters are provided together in a JSON object, where the parameternames are the object keys. Parameter values may be simple JSON values such as a string or a boolean, or they canbe lists or objects.

All of the creation parameters can also be set conveniently from the Create Connector user interface in theGraphDB Workbench without any knowledge of JSON.

analyzer (string), optional, specifies Lucene analyser The Lucene Connector supports custom Analyser im-plementations. They may be specified via the analyzer parameter whose value must be a fully qualifiedname of a class that extends org.apache.lucene.analysis.Analyzer. The class requires either a defaultconstructor or a constructor with exactly one parameter of type org.apache.lucene.util.Version. Forexample, these two classes are valid implementations:

package com.ontotext.example;

import org.apache.lucene.analysis.Analyzer;

public class FancyAnalyzer extends Analyzer {public FancyAnalyzer() {


...}...

}

package com.ontotext.example;

import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.util.Version;

public class SmartAnalyzer extends Analyzer {public SmartAnalyzer(Version luceneVersion) {

...}...

}

FancyAnalyzer and SmartAnalyzer can then be used by specifying their fully qualified names, for exam-ple:

..."analyzer": "com.ontotext.example.SmartAnalyzer",

...

types (list of URI), required, specifies the types of entities to sync The RDF types of entities to sync are spec-ified as a list of URIs. At least one type URI is required.

languages (list of string), optional, valid languages for literals RDF data is often multilingual but you canmap only some of the languages represented in the literal values. This can be done by specifying a listof language ranges to be matched to the language tags of literals according to RFC 4647, Section 3.3.1.Basic Filtering. In addition, an empty range can be used to include literals that have no language tag. Thelist of language ranges maps all existing literals that have matching language tags.

fields (list of field object), required, defines the mapping from RDF to Lucene The fields define exactlywhat parts of each entity will be synchronised as well as the specific details on the connector side. Thefield is the smallest synchronisation unit and it maps a property chain from GraphDB to a field in Lucene.The fields are specified as a list of field objects. At least one field object is required. Each field object hasfurther keys that specify details.

• fieldName (string), required, the name of the field in Lucene The name of the field defines themapping on the connector side. It is specified by the key fieldName with a string value. Thefield name is used at query time to refer to the field. There are few restrictions on the allowedcharacters in a field name but to avoid unnecessary escaping (which depends on how Luceneparses its queries), we recommend to keep the field names simple.

• propertyChain (list of URI), required, defines the property chain to reach the value The prop-erty chain (propertyChain) defines the mapping on the GraphDB side. A property chain isdefined as a sequence of triples where the entity URI is the subject of the first triple, its objectis the subject of the next triple and so on. In this model, a property chain with a single elementcorresponds to a direct property defined by a single triple. Property chains are specified as a listof URIs where at least one URI must be provided.

See Copy fields for defining multiple fields with the same property chain.

See Multiple property chains per field for defining a field whose values are populated from morethan one property chain.

• defaultValue (string), optional, specifies a default value for the field The default value(defaultValue) provides means for specifying a default value for the field when the propertychain has no matching values in GraphDB. The default value can be a plain literal, a literal witha datatype (xsd: prefix supported), a literal with language, or a URI. It has no default value.

88 Chapter 5. Usage

http://www.w3.org/International/core/langtags/#rfc4647


• indexed (boolean), optional, default true If indexed, a field is available for Lucene queries. trueby default.

This option corresponds to Lucene’s field option "indexed".

• stored (boolean), optional, default true Fields can be stored in Lucene and this is controlled by theBoolean option "stored". Stored fields are required for retrieving snippets. true by default.

This options corresponds to Lucene’s property "stored".

• analyzed (boolean), optional, default true When literal fields are indexed in Lucene, they will beanalysed according to the analyser settings. Should you require that a given field is not analysed,you may use "analyzed". This option has no effect for URIs (they are never analysed). true bydefault.

This option corresponds to Lucene’s property “tokenized”.

• multivalued (boolean), optional, default true RDF properties and synchronised fields may havemore than one value. If "multivalued" is set to true, all values will be synchronised to Lucene.If set to false, only a single value will be synchronised. true by default.

• facet (boolean), optional, default true Lucene needs to index data in a special way, if it will beused for faceted search. This is controlled by the Boolean option “facet”. True by default. Fieldsthat are not synchronised for faceting are also not available for faceted search.

• datatype (string), optional, the manual datatype override By default, the Lucene GraphDB Con-nector uses datatype of literal values to determine how they must be mapped to Lucene types. Formore information on the supported datatypes, see Datatype mapping.

The datatype mapping can be overridden through the parameter "datatype", which can be spec-ified per field. The value of "datatype" can be any of the xsd: types supported by the automaticmapping.

Special field definitions

Copy fields

Often, it is convenient to synchronise one and the same data multiple times with different settings to accommodatefor different use cases, e.g., faceting or sorting vs full-text search. The Lucene GraphDB Connector has explicitsupport for fields that copy their value from another field. This is achieved by specifying a single element inthe property chain of the form @otherFieldName, where otherFieldName is another non-copy field. Take thefollowing example:

..."fields": [

{"fieldName": "grape","propertyChain": [

"http://www.ontotext.com/example/wine#madeFromGrape","http://www.w3.org/2000/01/rdf-schema#label"

],"analyzed": true,

},{

"fieldName": "grapeFacet","propertyChain": [

"@grape"],"analyzed": false,

}]

...


The snippet creates an analysed field “grape” and a non-analysed field “grapeFacet”, both fields are populatedwith the same values and “grapeFacet” is defined as a copy field that refers to the field “facet”.

Note: The connector handles copy fields in a more optimal way than specifying a field with exactly the sameproperty chain as another field.

Multiple property chains per field

Sometimes, you have to work with data models that define the same concept (in terms of what you want to indexin Lucene) with more than one property chain, e.g., the concept of “name” could be defined as a single canonicialname, multiple historical names and some unofficial names. If you want to index these together as a single fieldin Lucene you can define this as a multiple property chains field.

Fields with multiple property chains are defined as a set of separate virtual fields that will be merged into a singlephysical field when indexed. Virtual fields are distinguished by the suffix /xyz, where xyz is any alphanumericsequence of convenience. For example, we can define the fields name/1 and name/2 like this:

..."fields": [

{"fieldName": "name/1","propertyChain": [

"http://www.ontotext.com/example#canonicalName"],"fieldName": "name/2","propertyChain": [

"http://www.ontotext.com/example#historicalName"]...

},...

The values of the fields name/1 and name/2 will be merged and synchronised to the field name in Lucene.

Note: You cannot mix suffixed and unsuffixed fields with the same same, e.g., if you defined myField/new andmyField/old you cannot have a field called just myField.

Filters and fields with multiple property chains

Filters can be used with fields defined with multiple property chains. Both the physical field values and theindividual virtual field values are available:

• Physical fields are specified without the suffix, e.g., ?myField

• Virtual fields are specified with the suffix, e.g., ?myField/2 or ?myField/alt.

Note: Physical fields cannot be combined with parent() as their values come from different property chains.If you really need to filter the same parent level, you can rewrite parent(?myField) in (<urn:x>, <urn:y>)as parent(?myField/1) in (<urn:x>, <urn:y>) || parent(?myField/2) in (<urn:x>, <urn:y>) ||parent(?myField/3) ... and surround it with parentheses if it is a part of a bigger expression.

90 Chapter 5. Usage


5.4.1.6 Datatype mapping

The Lucene GraphDB Connector maps different types of RDF values to different types of Lucene values accordingto the basic type of the RDF value (URI or literal) and the datatype of literals. The autodetection uses the followingmapping:

RDF value RDF datatype Lucene typeURI n/a StringFieldliteral none TextFieldliteral xsd:boolean StringField with values “true” and “false”literal xsd:double DoubleFieldliteral xsd:float FloatFieldliteral xsd:long LongFieldliteral xsd:int IntFieldliteral xsd:dateTime DateTools.timeToString(), second precisionliteral xsd:date DateTools.timeToString(), day precision

The datatype mapping can be affected by the synchronisation options too, e.g., a non-analysed field that hasxsd:long values is indexed with a StringField.

Note: For any given field the automatic mapping uses the first value it sees. This works fine for clean datasetsbut might lead to problems, if your dataset has non-normalised data, e.g., the first value has no datatype but othervalues have.

5.4.1.7 Advanced filtering and fine tuning

entityFilter (string) The entityFilter parameter is used to fine-tune the set of entities and/or individualvalues for the configured fields, based on the field value. Entities and field values are synchronised toLucene if, and only if, they pass the filter. The entity filter is similar to a FILTER() inside a SPARQL querybut not exactly the same. Each configured field can be referred to, in the entity filter, by prefixing it with a?, much like referring to a variable in SPARQL. Several operators are supported:

Operator Meaning Example?var in(value1,value2, ...)

Tests if the field var‘s value is one of the specified values.Values that do not match, are treated as if they were notpresent in the repository.

?status in ("active","new")

?var not in(value1,value2, ...)

The negated version of the in-operator. ?status not in("archived")

bound(?var) Tests if the field var has a valid value. This can be used tomake the field compulsory.

bound(?name)

expr1 or expr2 Logical disjunction of expressions expr1 and expr2. bound(?name) orbound(?company)

expr1 &&expr2

Logical conjunction of expressions expr1 and expr2. bound(?status) &&?status in ("active","new")

!expr Logical negation of expression expr. !bound(?company)( expr ) Grouping of expressions (bound(?name) or

bound(?company)) &&bound(?address)

Note:

• ?var in (...) filters the values of ?var and leaves only the matching values, i.e., it will modify the actualdata that will be synchronised to Lucene

• bound(?var) checks if there is any valid value left after filtering operators such as ?var in (...) havebeen applied


In addition to the operators, there are some constructions that can be used to write filters based not on the valuesbut on values related to them:

Accessing the previous element in the chain The construction parent(?var) is used for going to a pre-vious level in a property chain. It can be applied recursively as many times as needed, e.g.,parent(parent(parent(?var))) goes back in the chain three times. The effective value of parent(?var) can be used with the in or not in operator like this: parent(?company) in (<urn:a>, <urn:b>),or in the bound operator like this: parent(bound(?var)).

Accessing an element beyond the chain The construction ?var -> uri (alternatively ?var o uri or just ?varuri) is used for accessing additional values that are accessible through the property uri. In essence, thisconstruction corresponds to the triple pattern value uri ?effectiveValue, where ?value is a value boundby the field var. The effective value of ?var -> uri can be used with the in or not in operator likethis: ?company -> rdf:type in (<urn:c>, <urn:d>). It can be combined with parent() like this:parent(?company) -> rdf:type in (<urn:c>, <urn:d>). The same construction can be applied to thebound operator like this: bound(?company -> <urn:hasBranch>), or even combined with parent() likethis: bound(parent(?company) -> <urn:hasGroup>).

The URI parameter can be a full URI within < > or the special string rdf:type (alternatively just type),which will be expanded to http://www.w3.org/1999/02/22-rdf-syntax-ns#type.

Filtering by RDF graph The construction graph(?var) is used for accessing the RDF graph of a field’s value.The typical use case is to sync only explicit values: graph(?a) not in (<http://www.ontotext.com/implicit>). The construction can be combined with parent() like this: graph(parent(?a)) in(<urn:a>).

Entity filters and default values Entity filters can be combined with default values in order to get more flexiblebehaviour.

A typical use-case for an entity filter is having soft deletes, i.e., instead of deleting an entity, it is marked asdeleted by the presence of a specific value for a given property.

Basic entity filter example

Given the following RDF data:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix : <http://www.ontotext.com/example#> .

# the entity bellow will be synchronised because it has a matching value for city: ?city in ("London→˓"):alpha

rdf:type :gadget ;:name "John Synced" ;:city "London" .

# the entity below will not be synchronised because it lacks the property completely: bound(?city):beta

rdf:type :gadget ;:name "Peter Syncfree" .

# the entity below will not be synchronised because it has a different city value:# ?city in ("London") will remove the value "Liverpool" so bound(?city) will be false:gamma

rdf:type :gadget ;:name "Mary Syncless" ;:city "Liverpool" .

If you create a connector instance such as:

92 Chapter 5. Usage

{"types": ["http://www.ontotext.com/example#gadget"],"fields": [

{"fieldName": "name","propertyChain": ["http://www.ontotext.com/example#name"]

},{

"fieldName": "city","propertyChain": ["http://www.ontotext.com/example#city"]

}],"entityFilter":"bound(?city) && ?city in (\\"London\\")"

}''' .

}

The entity :beta is not synchronised as it has no value for city.

To handle such cases, you can modify the connector configuration to specify a default value for city:

...{

"fieldName": "city","propertyChain": ["http://www.ontotext.com/example#city"],"defaultValue": "London"

}...}

The default value is used for the entity :beta as it has no value for city in the repository. As the value is “London”,the entity is synchronised.

Advanced entity filter example

Sometimes, data represented in RDF is not well suited to map directly to non-RDF. For example, if you havenews articles and they can be tagged with different concepts (locations, persons, events, etc.), one possible way tomodel this is a single property :taggedWith. Consider the following RDF data:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix : <http://www.ontotext.com/example2#> .

:Berlinrdf:type :Location ;rdfs:label "Berlin" .

:Mozartrdf:type :Person ;rdfs:label "Wolfgang Amadeus Mozart" .

:Einsteinrdf:type :Person ;rdfs:label "Albert Einstein" .

:Cannes-FF

rdf:type :Event ;rdfs:label "Cannes Film Festival" .

:Article1rdf:type :Article ;rdfs:comment "An article about a film about Einstein's life while he was a professor in Berlin." ;:taggedWith :Berlin ;:taggedWith :Einstein ;:taggedWith :Cannes-FF .

:Article2rdf:type :Article ;rdfs:comment "An article about Berlin." ;:taggedWith :Berlin .

:Article3rdf:type :Article ;rdfs:comment "An article about Mozart's life." ;:taggedWith :Mozart .

:Article4rdf:type :Article ;rdfs:comment "An article about classical music in Berlin." ;:taggedWith :Berlin ;:taggedWith :Mozart .

:Article5rdf:type :Article ;rdfs:comment "A boring article that has no tags." .

:Article6rdf:type :Article ;rdfs:comment "An article about the Cannes Film Festival in 2013." ;:taggedWith :Cannes-FF .

Now, if you map this data to Lucene so that the property :taggedWith x is mapped to separate fieldstaggedWithPerson and taggedWithLocation according to the type of x (we are not interested in events), youcan map taggedWith twice to different fields and then use an entity filter to get the desired values:



{"types": ["http://www.ontotext.com/example2#Article"],"fields": [

{"fieldName": "comment","propertyChain": ["http://www.w3.org/2000/01/rdf-schema#comment"]

},{

"fieldName": "taggedWithPerson","propertyChain": ["http://www.ontotext.com/example2#taggedWith"]

},{

"fieldName": "taggedWithLocation","propertyChain": ["http://www.ontotext.com/example2#taggedWith"]

}],"entityFilter": "?taggedWithPerson type in (<http://www.ontotext.com/example2#Person>)

&& ?taggedWithLocation type in (<http://www.ontotext.com/example2#Location>)"

94 Chapter 5. Usage

}''' .

}

Note: type is the short way to write <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>.

The six articles in the RDF data above will be mapped as such:

Article URI Value intaggedWith-Person

Value intaggedWithLo-cation

Explanation

:Article1 :Einstein :Berlin :taggedWith has the values:Einstein, :Berlin and :Cannes-FF.The filter leaves only the correct valuesin the respective fields. The value:Cannes-FF is ignored as it does notmatch the filter.

:Article2 :Berlin :taggedWith has the value :Berlin.After the filter is applied, onlytaggedWithLocation is populated.

:Article3 :Mozart :taggedWith has the value :Mozart.After the filter is applied, onlytaggedWithPerson is populated

:Article4 :Mozart :Berlin :taggedWith has the values :Berlinand :Mozart. The filter leaves only thecorrect values in the respective fields.

:Article5 :taggedWith has no values. The filteris not relevant.

:Article6 :taggedWith has the value:Cannes-FF. The filter removes itas it does not match.

This can be checked by issuing a faceted search for taggedWithLocation and taggedWithPerson:


SELECT ?facetName ?facetValue ?facetCount {?search a inst:my_index ;

:facetFields "taggedWithLocation,taggedWithPerson" ;:facets _:f .

_:f :facetName ?facetName ;:facetValue ?facetValue ;:facetCount ?facetCount .

}

If the filter was applied, you should get only :Berlin for taggedWithLocation and only :Einstein and :Mozartfor taggedWithPerson:

?facetName ?facetValue ?facetCounttaggedWithLocation http://www.ontotext.com/example2#Berlin 3taggedWithPerson http://www.ontotext.com/example2#Mozart 2taggedWithPerson http://www.ontotext.com/example2#Einstein 1

5.4.1.8 Overview of connector predicates

The following diagram shows a summary of all predicates that can administer (create, drop, check status) connec-tor instances or issue queries and retrieve results. It can be used as a quick reference of what a particular predicate


needs to be attached to. For example, to retrieve entities, you need to use :entities on a search instance andto retrieve snippets, you need to use :snippets on an entity. Variables that are bound as a result of a query areshown in green, blank helper nodes are shown in blue, literals in red, and URIs in orange. The predicates arerepresented by labelled arrows.

5.4.1.9 Caveats

Order of control

Even though SPARQL per se is not sensitive to the order of triple patterns, the Lucene GraphDB Connector expectsto receive certain predicates before others so that queries can be executed properly. In particular, predicates thatspecify the query or query options need to come before any predicates that fetch results.

The diagram in Overview of connector predicates provides a quick overview of the predicates.

5.4.1.10 Upgrading from previous versions

No special procedures are required for upgrading from:

• GraphDB 6.2 / Lucene Connector 4.0



Migrating from a pre-6.2 version

GraphDB prior to 6.2 shipped with version 3.x of the Lucene GraphDB Connector that had different options andslightly different behaviour and internals. Unfortunately, it is not possible to migrate existing connector instancesautomatically. To prevent any data loss, the Lucene GraphDB Connector will not initialise, if it detects an existingconnector in the old format. The recommended way to migrate your existing instances is:

1. Backup the INSERT statement used to create the connector instance.

2. Drop the connector.

3. Deploy the new GraphDB version.

96 Chapter 5. Usage

4. Modify the INSERT statement according to the changes described below.

5. Re-create the connector instance with the modified INSERT statement.

You might also need to change your queries to reflect any changes in field names or extra fields.

Changes in field configuration and synchronisation

Prior to 6.2, a single field in the config could produce up to three individual fields on the Lucene side, based onthe field options. For example, for the field “firstName”:

field notefirstName produced, if the option “index” was true; used explicitly in queries_facet_firstName produced, if the option “facet” was true; used implicitly for facet search_sort_firstName produced, if the option “sort” was true; used implicitly for ordering connector results

The current version always produces a single Lucene field per field definition in the configuration. This meansthat you have to create all appropriate fields based on your needs. See more in List of creation parameters.

Tip: To mimic the functionality of the old _sort_fieldName fields, you can either create a non-analysed Copyfields (for textual fields) or just use the normal field (for non-textual fields).

Migrating from the Lucene4 plugin

You can easily migrate your existing lucene4 plugin setup to the new connectors interface.

Create index queries We provide an automated migration tool for your create index queries. The tool is dis-tributed with GraphDB 6.0 onward and can be found in the tools subdirectory. Here is how to use it:

java -jar migration.jar --file <input-file> <output-file>

where input-file is your old sparql file and output-file is the new SPARQL file.

You can find possible options with:

java -jar migration.jar --help

Select queries using the index We have changed the syntax for the search queries to be able to match our needsfor new features and better design. Here is an example query using the lucene4 plugin:

PREFIX luc4:<http://www.ontotext.com/owlim/lucene4#>SELECT ?c ?snippet WHERE {

?c rdf:type <http://data.ontotext.com/ontologies/ontology1/Type1> .

?c luc4:content ("gold" "limit=10;snippet.size=200") .?c luc4:snippet ?snippet .?c luc4:score ?score .

}

and here is the connector variant:

PREFIX conn:<http://www.ontotext.com/connectors/lucene#>PREFIX inst:<http://www.ontotext.com/connectors/lucene/instance#>

SELECT ?c ?snippet WHERE {[] a inst:content ;

conn:query "gold" ;conn:limit "10" ;conn:snippetSize "200" ;conn:entities ?entity


https://confluence.ontotext.com/display/EM/Lucene4+OWLIM+Plug-in


?entity conn:snippets ?s .?s conn:snippetText ?snippet .

}

Note: Note the following changes:

• We use special predicates for everything - no more key value options in a string;

• The query is actually an instance of the index;

• Snippets belong to the entity;

• Snippets are now first class objects - you can also get the field of the match;

• Indexes are now an instance of another namespace. This allows you to create indexes with the name“entities”, for example.

Tip: For more information on the new syntax and how everything is linked together, see Overview of connectorpredicates.

5.5 GraphDB Dev Guide

5.5.1 Reasoning

Hint: To get the full benefit from this section, you need some basic knowledge of the two principle Reasoningstrategies for rule-based inference - forward-chaining and backward-chaining.

GraphDB performs reasoning based on forward-chaining of entailment rules defined using RDF triple patternswith variables. GraphDB’s reasoning strategy is one of Total materialisation, where the inference rules are appliedrepeatedly to the asserted (explicit) statements until no further inferred (implicit) statements are produced.

The GraphDB repository uses configured rule-sets to compute all inferred statements at load time. To some extent,this process increases the processing cost and time taken to load a repository with a large amount of data. However,it has the desirable advantage that subsequent query evaluation can proceed extremely quickly.

5.5.1.1 Logical formalism

GraphDB uses a notation almost identical to R-Entailment defined by Horst. RDFS inference is achieved via a setof axiomatic triples and entailment rules. These rules allow the full set of valid inferences using RDFS semanticsto be determined.

Herman ter Horst defines RDFS extensions for more general rule support and a fragment of OWL, which ismore expressive than DLP and fully compatible with RDFS. First, he defines R-entailment, which extends RDFS-entailment in the following way:

• It can operate on the basis of any set of rules R (i.e., allows for extension or replacement of the standard set,defining the semantics of RDFS);

• It operates over so-called generalised RDF graphs, where blank nodes can appear as predicates (a possibilitydisallowed in RDF);

• Rules without premises are used to declare axiomatic statements;

• Rules without consequences are used to detect inconsistencies (integrity constraints).

98 Chapter 5. Usage

Tip: To learn more, see OWL Compliance.

5.5.1.2 Rule format and semantics

The rule format and the semantics enforced in GraphDB is analogous to R-entailment with the following differ-ences:

• Free variables in the head (without binding in the body) are treated as blank nodes. This feature must beused with extreme caution because custom rule-sets can easily be created, which recursively infer an infinitenumber of statements making the semantics intractable;

• Variable inequality constraints can be specified in addition to the triple patterns (they can be placed afterany premise or consequence). This leads to less complexity compared to R-entailment;

• the cut operator can be associated with rule premises. This is an optimisation that tells the rule compilernot to generate a variant of the rule with the identified rule premise as the first triple pattern;

• Context can be used for both rule premises and rule consequences allowing more expressive constructionsthat utilise ‘intermediate’ statements contained within the given context URI;

• Consistency checking rules do not have consequences and will indicate an inconsistency when the premisesare satisfied;

• Axiomatic triples can be provided as a set of statements, although these are not modelled as rules withempty bodies.

5.5.1.3 The rule-set file

GraphDB can be configured via rule-sets - sets of axiomatic triples, consistency checks and entailment rules,which determine the applied semantics.

A rule-set file has three sections named Prefices (sic), Axioms, and Rules. All sections are mandatory and mustappear sequentially in this order. Comments are allowed anywhere and follow the Java convention, i.e.,. "/* ...*/" for block comments and "//" for end of line comments.

For historic reasons, the way in which terms (variables, URLs and literals) are written differs from Turtle andSPARQL:

• URLs in Prefices are written without angle brackets

• variables are written without ? or $ and can include multiple alphanumeric chars

• URLs are written in brackets, no matter if they are use prefix or are spelt in full

• datatype URLs are written without brackets, eg

a <owl:maxQualifiedCardinality> "1"^^xsd:nonNegativeInteger

See the examples below and be careful when writing terms.

Prefices

This section defines the abbreviations for the namespaces used in the rest of the file. The syntax is:

shortname : URI

The following is an example of how a typical prefices section might look like:

5.5. GraphDB Dev Guide 99

Prefices{

rdf : http://www.w3.org/1999/02/22-rdf-syntax-ns#rdfs : http://www.w3.org/2000/01/rdf-schema#owl : http://www.w3.org/2002/07/owl#xsd : http://www.w3.org/2001/XMLSchema#

}

Axioms

This section asserts axiomatic triples, which usually describe the meta-level primitives used for defining theschema such as rdf:type, rdfs:Class, etc. It contains a list of the (variable free) triples, one per line.

For example, the RDF axiomatic triples are defined in the following way:

Axioms{

// RDF axiomatic triples<rdf:type> <rdf:type> <rdf:Property><rdf:subject> <rdf:type> <rdf:Property><rdf:predicate> <rdf:type> <rdf:Property><rdf:object> <rdf:type> <rdf:Property><rdf:first> <rdf:type> <rdf:Property><rdf:rest> <rdf:type> <rdf:Property><rdf:value> <rdf:type> <rdf:Property><rdf:nil> <rdf:type> <rdf:List>

}

Note: Axiomatic statements are considered to be inferred for the purpose of query-answering because they are aresult of semantic interpretation defined by the chosen rule-set.

Rules

This section is used to define entailment rules and consistency checks, which share a similar format. Each def-inition consists of premises and corollaries that are RDF statements defined with subject, predicate, object andoptional context components. The subject, predicate and object can each be a variable, blank node, literal, fullURI or the short name for a URI. If given, the context must be a full URI or a short name for a URI. Variables arealpha-numeric and must begin with a letter.

If the context is provided, the statements produced as rule consequences are not ‘visible’ during normal queryanswering. Instead, they can only be used as input to this or other rules and only when the rule premise explicitlyuses the given context (see the example below).

Furthermore, inequality constraints can be used to state that the values of the variables in a statement must notbe equal to a specific full URI (or its short name) or to the value of another variable within the same rule. Thebehaviour of an inequality constraint depends on whether it is placed in the body or the head of a rule. If it isplaced in the body of a rule, then the whole rule will not ‘fire’ if the constraint fails, i.e., the constraint can benext to any statement pattern in the body of a rule with the same behaviour (the constraint does not have to beplaced next to the variables it references). If the constraint is in the head, then its location is significant because aconstraint that does not hold will prevent only the statement it is adjacent to from being inferred.

Entailment rules

The syntax of a rule definition is as follows:

100 Chapter 5. Usage

Id: <rule_name><premises> <optional_constraints>-------------------------------<consequences> <optional_constraints>

where each premise and consequence is on a separate line.

The following example helps to illustrate the possibilities:

Rules{Id: rdf1_rdfs4a_4b

x a y-------------------------------x <rdf:type> <rdfs:Resource>a <rdf:type> <rdfs:Resource>y <rdf:type> <rdfs:Resource>

Id: rdfs2x a y [Constraint a != <rdf:type>]a <rdfs:domain> z [Constraint z != <rdfs:Resource>]-------------------------------x <rdf:type> z

Id: owl_FunctPropp <rdf:type> <owl:FunctionalProperty>x p y [Constraint y != z, p != <rdf:type>]x p z [Constraint z != y] [Cut]-------------------------------y <owl:sameAs> z

}

The symbols p, x, y, z and a are variables. The second rule contains two constraints that reduce the number ofbindings for each premise, i.e., they ‘filter out’ those statements where the constraint does not hold.

In a forward-chaning inference step, a rule is interpreted as meaning that for all possible ways of satisfying thepremises, the bindings for the variables are used to populate the consequences of the rule. This generates newstatements that will manifest themselves in the repository, e.g., by being returned as query results.

The last rule contains an example of using the Cut operator, which is an optimisation hint for the rule compiler.When rules are compiled, a different variant of the rule is created for each premise, so that each premise occurs asthe first triple pattern in one of the variants. This is done so that incoming statements can be efficiently matchedto appropriate inferences rules. However, when a rule contains two or more premises that match identical triplespatterns, but using different variable names, the extra variant(s) are redundant and better efficiency can be achievedby simply not creating the extra rule variant(s).

In the above example, the rule owl_FunctProp would by default be compiled in three variants:

p <rdf:type> <owl:FunctionalProperty>x p yx p z-------------------------------y <owl:sameAs> z

x p yp <rdf:type> <owl:FunctionalProperty>x p z-------------------------------y <owl:sameAs> z

x p z

p <rdf:type> <owl:FunctionalProperty>x p y-------------------------------y <owl:sameAs> z

Here, the last two variants are identical apart from the rotation of variables y and z, so one of these variants arenot needed. The use of the Cut operator above tells the rule compiler to eliminate this last variant, i.e., the onebeginning with the premise x p z.

The use of context in rule bodies and rule heads is also best explained by an example. The following three rulesimplement the OWL2-RL property chain rule prp-spo2 and are inspired by the Rule Interchange Format (RIF)implementation:

Id: prp-spo2_1p <owl:propertyChainAxiom> pcstart pc last [Context <onto:_checkChain>]----------------------------start p last

Id: prp-spo2_2pc <rdf:first> ppc <rdf:rest> t [Constraint t != <rdf:nil>]start p nextnext t last [Context <onto:_checkChain>]----------------------------start pc last [Context <onto:_checkChain>]

Id: prp-spo2_3pc <rdf:first> ppc <rdf:rest> <rdf:nil>start p last----------------------------start pc last [Context <onto:_checkChain>]

The RIF rules that implement prp-spo2 use a relation (unrelated to the input or generated triples) called_checkChain. The GraphDB implementation maps this relation to the ‘invisible’ context of the same name withthe addition of [Context <onto:_checkChain>] to certain statement patterns. Generated statements with thiscontext can only be used for bindings to rule premises when the exact same context is specified in the rule premise.The generated statements with this context will not be used for any other rules.

Consistency checks

Consistency checks are used to ensure that the data model is in a consistent state and are applied wheneveran update transaction is committed. GraphDB supports consistency violation checks using standard OWL2RLsemantics. You can define rule-sets that contain consistency rules. When creating a new repository, set theparameter check-for-inconsistencies to true. It is false by default (for compatibility with the previousOWLIM releases).

The syntax is similar to that of rules, except that Consistency replaces the Id tag that introduces normal rules.Also, consistency checks do not have any consequences and indicate an inconsistency whenever their premisescan be satisfied, e.g.:

Consistency: something_can_not_be_nothingx rdf:type owl:Nothing-------------------------------

Consistency: both_sameAs_and_differentFrom_is_forbiddenx owl:sameAs yx owl:differentFrom y-------------------------------


Consistency checks features

• Materialisation and consistency mix: the rule-sets support the definition of a mixture of materialisation andconsistency rules. This follows the existing naming syntax id: and Consistency:

• Multiple named rule-sets: GraphDB supports multiple named rule-sets.

• No downtime deployment: The deployment of new/updated rule-sets can be done to a running instance.

• Update transaction rule-set: Each update transaction can specify which named rule-set to apply. This isdone by using ‘special’ RDF statements within the update transaction.

• Consistency violation exceptions: if a consistency rule is violated, GraphDB throws exceptions. The excep-tion includes details such as which rule has been violated and to which RDF statements.

• Consistency rollback: if a consistency rule is violated within an update transaction, the transaction will berolled back and no statements will be committed.

In case of any consistency check(s) failure, when a transaction is committed and consistency checking is switchedon (by default it is off), then:

• A message is logged with details of what consistency checks failed;

• An exception is thrown with the same details;

• The whole transaction is rolled back.

5.5.1.4 Rule-sets

GraphDB offers several predefined semantics by way of standard rule-sets (files), but can also be configured touse custom rule-sets with semantics better tuned to the particular domain. The required semantics can be specifiedthrough the rule-set for each specific repository instance. Applications that do not need the complexity of the mostexpressive supported semantics can choose one of the less complex, which will result in faster inference.

Note: Each rule-set defines both rules and some schema statements, otherwise known as axiomatic triples. These(read-only) triples are inserted into the repository at intialisation time and count towards the total number ofreported ‘explicit’ triples. The variation may be up to the order of hundreds depending upon the rule-set.

Predefined rule-sets

The pre-defined rule-sets provided with GraphDB cover various well-known knowledge representation formalismsand are layered in such a way that each one extends the preceding one.

Rule-set

Description

empty No reasoning, i.e., GraphDB operates as a plain RDF store.rdfs Supports the standard model-theoretic RDFS semantics.owl-horst

OWL dialect close to OWL Horst - essentially pD*

owl-max

RDFS and that part of OWL Lite that can be captured in rules (deriving functional and inversefunctional properties, all-different, subclass by union/enumeration; min/max cardinalityconstraints, etc.).

owl2-ql

The OWL2 QL profile - a fragment of OWL2 Full designed so that sound and complete queryanswering is LOGSPACE with respect to the size of the data. This OWL2 profile is based onDL-LiteR, a variant of DL-Lite that does not require the unique name assumption.

owl2-rl

The OWL2 RL profile - an expressive fragment of OWL2 Full that is amenable for implementationon rule engines.



Note: All rule-sets do not support data-type reasoning, which is the main reason why OWL-Horst is not the sameas pD*. The rule-set you need to use for a specific repository is defined through the ruleset parameter. There areoptimised versions of all rule-sets that avoid some little used inferences.

OWL2 QL non-conformance

The implementation of OWL2 QL is non-conformant with the W3C OWL2 profiles recommendation as shown inthe following table:

Conformant behaviour Implemented behaviourGiven a list of disjoint (data or object) properties andan entity that is related with these properties toobjects {a, b, c, d,...}, infer anowl:AllDifferent restriction on an anonymous listof these objects.

For each pair {p, q} (p != q) of disjoint (data orobject) properties, infer the triple: powl:propertyDisjointWith q Which is morelikely to be useful for query answering.

For each class C in the knowledge base, infer theexistence of an anonymous class that is the union of alist of classes containing only C.

Not supported. Even if this infinite expansion werepossible in a forward-chaining rule-basedimplementation, the resulting statements are of nouse during query evaluation.

If a instance of C1, and b instance of C2, and C1 andC2 disjoint, infer: a owl:differentFrom b

Impractical for knowledge bases with many membersof pairs of disjoint classes, e.g., Wordnet. Instead,this is implemented as a consistency check: If xinstance of C1 and C2, and C1 and C2 disjoint, theninconsistent.

Custom rule-sets

GraphDB has an internal rule compiler that can be configured with a custom set of inference rules and axioms.You may define a custom rule-set in a .pie file (e.g., MySemantics.pie). The easiest way to create a customrule-set is to start modifying one of the .pie files that were used to build the precompiled rule-sets.

Note: All pre-defined .pie files are included in the GraphDB distribution.

If the code generation or compilation cannot be completed successfully, a Java exception is thrown indicating theproblem. It will state either the Id of the rule or the complete line from the source file where the problem islocated. Line information is not preserved during the parsing of the rule file.

You must specify the custom rule-set via the ruleset configuration parameter. The value of the ruleset parame-ter is interpreted as a filename and .pie is appended when not present. This file is processed to create Java sourcecode that is compiled using the compiler from the Java Development Kit (JDK). The compiler is invoked usingthe mechanism provided by the JDK version 1.6 (or later).

Therefore, a prerequisite for using custom rule-sets is that you use the Java Virtual Machine (JVM) from a JDKversion 1.6 (or later) to run the application. If all goes well, the class is loaded dynamically and instantiated forfurther use by GraphDB during inference. The intermediate files are created in the folder that is pointed by thejava.io.tmpdir system property. The JVM should have sufficient rights to read and write to this directory.

Note: Using GraphDB, this is more difficult. It will be necessary to export/backup all explicit statements andrecreate a new repository with the required rule-set. Once created, the explicit statements exported from the oldrepository can be imported to the new one.



5.5.1.5 Inference

Reasoner

The GraphDB reasoner requires a .pie file of each rule-set to be compiled in order to instantiate. The processincludes several steps:

1. Generate a java code out of the .pie file contents using the built-in GraphDB rule compiler.

2. Compile the java code (it requires JDK instead of JRE, hence the java compiler will be available throughthe standard java instrumentation infrastructure).

3. Instantiate the java code using a custom byte-code class loader.

Note: In prior-GraphDB 6 versions, the reasoner was instantiated when the repository was initialised. Theprocess was slow and took some time depending on the rule-set used and the rules included. Therefore, fromGraphDB 6 on, the reasoner can be dynamically extended with new rule-sets.

Rule-sets execution

• For each rule and each premise (triple pattern in the rule head), a rule variant is generated. We call thisthe ‘leading premise’ of the variant. If a premise has the Cut annotation, no variant is generated for it.

• Every incoming triple (inserted or inferred) is checked against the leading premise of every rule variant.Since rules are compiled to Java bytecode on startup, this checking is very fast.

• If the leading premise matches, the rest of the premises are checked. This checking needs to access therepository, so it can be much slower.

– GraphDB first checks premises with the least number of unbound variables.

– For premises that have the same number of unbound variables, GraphDB follows the textual order inthe rule.

• If all premises match, the conclusions of the rule are inferred.

• For each inferred statement:

– If it does not exist in the default graph, it is stored in the repository and is queued for inference.

– If it exists in the default graph, no duplicate statement is recorded. However, its ‘inferred’ flag is stillset, (see How to manage explicit and implicit statements).

Retraction of assertions

GraphDB stores explicit and implicit statements, i.e., the statements inferred (materialised) from the explicitstatements. So, when explicit statements are removed from the repository, any implicit statements that rely onthe removed statement must also be removed.

In the previous versions of GraphDB, this was achieved with a re-computation of the full closure (minimal model),i.e., applying the entailment rules to all explicit statements and computing the inferences. This approach guaran-tees correctness, but does not scale - the computation is increasingly slow and computationally expensive inproportion to the number of explicit statements and the complexity of the entailment rule-set.

Removal of explicit statements is now achieved in a more efficient manner, by invalidating only the inferredstatements that can no longer be derived in any way.

One approach is to maintain track information for every statement - typically the list of statements that can beinferred from this statement. The list is built up during inference as the rules are applied and the statementsinferred by the rules are added to the lists of all statements that triggered the inferences. The drawback of this


technique is that track information inflates more rapidly than the inferred closure - in the case of large datasets upto 90% of the storage is required just to store the track information.

Another approach is to perform backward-chaining. Backward-chaining does not require track information, sinceit essentially re-computes the tracks as required. Instead, a flag for each statement is used so that the algorithmcan detect when a statement has been previously visited and thus avoid an infinite recursion.

The algorithm used in GraphDB works as follows:

1. Apply a ‘visited’ flag to all statements (false by default).

2. Store the statements to be deleted in the list L.

3. For each statement in L that is not visited yet, mark it as visited and apply the forward-chaning rules.Statements marked as visited become invisible, which is why the statement must be first marked and thenused for forward-chaning.

4. If there are no more unvisited statements in L, then END.

5. Store all inferred statements in the list L1.

6. For each element in L1 check the following:

• If the statement is a purely implicit statement (a statement can be both explicit and implicit and if so,then it is not considered purely implicit), mark it as deleted (prevent it from being returned by theiterators) and check whether it is supported by other statements. The isSupported() method usesqueries that contain the premises of the rules and the variables of the rules are preliminarily boundusing the statement in question. That is to say, the isSupported() method starts from the projectionof the query and then checks whether the query will return results (at least one), i.e., this methodperforms backward-chaining.

• If a result is returned by any query (every rule is represented by a query) in isSupported(), then thisstatement can be still derived from other statements in the repository, so it must not be deleted (itsstatus is returned to ‘inferred’).

• If all queries return no results, then this statement can no longer be derived from any other statements,so its status remains ‘deleted’ and the number of statements counter is updated.

7. L := L1 and GOTO 3.

Special care is taken when retracting owl:sameAs statements, so that the algorithm still works correctly whenmodifying equivalence classes.

Note: One consequence of this algorithm is that deletion can still have poor performance when deleting schemastatements, due to the (probably) large number of implicit statements inferred from them.

Note: The forward-chaning part of the algorithm terminates as soon as it detects that a statement is read-only,because if it cannot be deleted, there is no need to look for statements derived from it. For this reason, performancecan be greatly improved when all schema statements are made read-only by importing ontologies (and OWL/RDFSvocabularies) using the imports repository parameter.

Schema update transactions

When fast statement retraction is required, but it is also necessary to update schemas, you can use a specialstatement pattern. By including an insert for a statement with the following form in the update:

[] <http://www.ontotext.com/owlim/system#schemaTransaction> []

GraphDB will use the smooth-delete algorithm, but will also traverse read-only statements and allow them tobe deleted/inserted. Such transactions are likely to be much more computationally expensive to achieve, but are

intended for the occasional, offline update to otherwise read-only schemas. The advantage is that fast-delete canstill be used, but no repository export and import is required when making a modification to a schema.

For any transaction that includes an insert of the above special predicate/statement:

• Read-only (explicit or inferred) statements can be deleted;

• New explicit statements are marked as read-only;

• New inferred statements are marked:

– Read-only if all the premises that fired the rule are read-only;

– Normal otherwise.

Schema statements can be inserted or deleted using SPARQL UPDATE as follows:

DELETE {[[schema statements to delete]]

}INSERT {

[] <http://www.ontotext.com/owlim/system#schemaTransaction> [] .[[schema statements to insert]]

}WHERE { }

5.5.1.6 How TO’s

Operations on rule-sets

All examples below use the sys: namespace, defined as:

prefix sys: <http://www.ontotext.com/owlim/system#>

Add a custom rule-set from .pie file

The predicate sys:addRuleset adds a custom rule-set from the specified .pie file. The rule-set is named afterthe filename, without the .pie extension.

Example 1 This creates a new rule-set ‘test’. If the absolute path to the file resides on, for example, /opt/rules/test.pie, it can be specified as <file:/opt/rules/test.pie>, <file://opt/rules/test.pie>, or<file:///opt/rules/test.pie>, i. e., with 1, 2, or 3 slashes. Relative paths are specified without theslashes or with a dot between the slashes: <file:opt/rules/test.pie>, <file:/./opt/rules/test.pie>, <file://./opt/rules/test.pie>, or even <file:./opt/rules/test.pie> (with a dot in front ofthe path). Relative paths can be used if you know the work directory of the Java process in which GraphDBruns.

INSERT DATA {_:b sys:addRuleset <file:c:/graphdb/test-data/test.pie>

}

Example 2 Same as above but creates a rule-set called ‘custom’ out of the test.pie file found in the givenabsolute path.

INSERT DATA {<:custom> sys:addRuleset <file:c:/graphdb/test-data/test.pie>

}

Example 3 Retrieves the .pie file from the given URL. Again, you can use <:custom> to change the name ofthe rule-set to “custom” or as necesary.

INSERT DATA {_:b sys:addRuleset <http://example.com/test-data/test.pie>

}

Add a built-in rule-set

The predicate sys:addRuleset adds a built-in rule-set (one of the rule-sets that GraphDB supports natively).

Example This adds the "owl-max" rule-set to the list of rulsets in the repository.

INSERT DATA {_:b sys:addRuleset "owl-max"

}

Add a custom rule-set with SPARQL INSERT

The predicate sys:addRuleset adds a custom rule-set from the specified .pie file. The rule-set is named afterthe filename, without the .pie extension.

Example This creates a new rule-set "custom".

INSERT DATA {<:custom> sys:addRuleset

'''Prefices { a : http://a/ }Axioms {}Rules{Id: custom

a b ca <a:custom1> c-----------------------b <a:custom1> a

}'''}

List all rule-sets

The predicate sys:listRulesets lists all rule-set available in the repository.

Example

SELECT ?state ?ruleset {?state sys:listRulesets ?ruleset

}

Explore a rule-set

The predicate sys:exploreRuleset explores a rule-set.

Example

SELECT * {?content sys:exploreRuleset "test"

}

Set a default rule-set

The predicate sys:defaultRuleset switches the default rule-set to the one specified in the object literal.

Example This sets the default rule-set to “test”. All transactions use this rule-set, unless they specify anotherrule-set as a first operation in the transaction.

INSERT DATA {_:b sys:defaultRuleset "test"

}

Rename a rule-set

The predicate sys:renameRuleset renames the rule-set from “custom” to “test”. Note that “custom” is specifiedas the subject URI in the default namespace.

Example This renames the rule-set “custom” to “test”.

INSERT DATA {<:custom> sys:renameRuleset "test"

}

Delete a rule-set

The predicate sys:removeRuleset deletes the rule-set "test", specified in the object literal.

Example

INSERT DATA {_:b sys:removeRuleset "test"

}

Consistency check

The predicate sys:consistencyCheckAgainstRuleset checks if the repository is consistent with the specifiedrule-set.

Example

INSERT DATA {_:b sys:consistencyCheckAgainstRuleset "test"

}

Reinferring

Statements are inferred only when you insert new statements. So, if reconnected to a repository with a differentrule-set, it does not take effect immediately. However, you can cause reinference with an Update statement suchas:

INSERT DATA { [] <http://www.ontotext.com/owlim/system#reinfer> [] }

This removes all inferred statements and reinfers from scratch. If a statement is both explicitly inserted andinferred, it is not removed.

Tip: To learn more, see How to manage explicit and implicit statements.


5.5.2 Storage

5.5.2.1 What is GraphDB’s persistence strategy

GraphDB stores all of its data (statements, indexes, entity pool, etc.) in files in the configured storage directory,usually called storage. The content and names of these files is not defined and is subject to change betweenversions.

There are several types of indices available, all of which apply to all triples, whether explicit or implicit. Theseindices are maintained automatically.

In general, the index structures used in GraphDB are chosen and optimised to allow for efficient:

• handling of billions of statements under reasonable RAM constraints;

• query optimisation;

• transaction management.

GraphDB maintains two main indices on statements for use in inference and query evaluation: the predicate-object-subject (POS) index and the predicate-subject-object (PSO) index. There are many other additional datastructures that are used to enable the efficient manipulation of RDF data, but these are not listed, since theseinternal mechanisms cannot be configured.

5.5.2.2 GraphDB’s indexing options

There are indexing options that offer considerable advantages for specific datasets, retrieval patterns and queryloads. Most of them are disabled by default, so you need to enable them as necessary.

Note: Unless stated otherwise, GraphDB allows you to switch indices on and off against an already populatedrepository. The repository has to be shut down before the change of the configuration is specified. The next timethe repository is started, GraphDB will create or remove the corresponding index. If the repository is alreadyloaded with a large volume of data, switching on a new index can lead to considerable delays during initialisation– this is the time required for building the new index.

Transaction mode

There are two transaction mechanisms in GraphDB. The default safe mode causes all updates to be flushed todisk as part of the commit operation. The ordering of updated pages in the index files and the sequence used towrite them to the file-system mean that they are consistent with the state of the database prior to the update inthe event of an abnormal termination. In other words, rollback is natively supported should the application crashand recovery after such an event is instant. Also, the method for updating data structures (copy of page index andcopy-on-write of pages) mean that a high level of concurrency is supported between updates queries.

In bulk-loading fast mode, updated pages are not automatically flushed to disk and remain in memory until thecache is exhausted and further pages are required. Only then are the least recently used dirty pages swappedto disk. This can be significantly faster than safe mode when updating using a single-thread, but there are noguarantees for data security in this mode. If a crash occurs, then data will be lost. The intention of this mode isto speed up regular bulk-loading in situations where query loads are negligible or non-existent. Query and updateconcurrency in this mode is not as sophisticated as in safe mode.

Warning: In fast mode, it is VERY IMPORTANT to shut down the repository connections properly in orderto ensure that unwritten data is flushed to the file-system. If, for any reason, the database is not shut downproperly, GraphDB assumes that data corruption has occurred and it will refuse to start with the same diskimage.



The transaction mode is set using the transaction-mode configuration parameter (see Configuring a Repository).Changing modes requires you to restart GraphDB.

In fast transaction mode, the isolation constraint can be relaxed in order to improve concurrency behaviour whenstrict read isolation is not a requirement. This is controlled by a new parameter transaction-isolation thatonly has an effect in fast mode (see Configuring a Repository).

Transaction control

Transaction support is exposed via Sesame’s RepositoryConnection interface. The three methods of this inter-face that give you control when updates are committed to the repository are as follows:

Method Effectvoidbegin()

Begins a transaction. Subsequent changes effected through update operations will onlybecome permanent after commit() is called.

voidcommit()

Commits all updates that have been performed through this connection since the last call tobegin().

voidrollback()

Rolls back all updates that have been performed through this connection since the last call tobegin().

GraphDB supports the so called ‘read committed’ transaction isolation level, which is well-known to relationaldatabase management systems - i.e., pending updates are not visible to other connected users, until the completeupdate transaction has been committed. It guarantees that changes will not impact query evaluation before theentire transaction they are part of is successfully committed. It does not guarantee that execution of a singletransaction is performed against a single state of the data in the repository. Regarding concurrency:

• Multiple update/modification/write transactions can be initiated and stay open simultaneously, i.e., onetransaction does not need to be committed in order to allow another transaction to complete;

• Update transactions are processed internally in sequence, i.e., GraphDB processes the commits one afteranother;

• Update transactions do not block read requests in any way, i.e., hundreds of SPARQL queries can be evalu-ated in parallel (the processing is properly multi-threaded) while update transactions are being handled onseparate threads.

Note: GraphDB performs materialisation, esuring that all statements that can be inferred from the current stateof the repository are indexed and persisted (except for those compressed due to the sameAs Optimisation). Whenthe commit method is completed, all reasoning activities related to the changes in the data introduced by thecorresponding transaction will have already been performed.

Note: An uncommitted transaction will not affect the ‘view’ of the repository through any connection, includingthe connection used to do the modification. This is perhaps not in keeping with most relational database im-plementations. However, committing a modification to a semantic repository involves considerably more work,specifically the computation of the changes to the inferred closure resulting from the addition or removal of ex-plicit statements. This computation is only carried out at the point where the transaction is committed and so tobe consistent, neither the inferred statements nor the modified statements related to the transaction are ‘visible’.

Predicate lists

Certain datasets and certain kinds of query activities, for example, queries that use wildcard patterns for predicates,benefit from another type of index called a ‘predicate list’, i.e.:

• subject-predicate (SP)

• object-predicate (OP)


This index maps from entities (subject or object) to their predicates. It is not switched on by default (seeenablePredicateList in Configuring a Repository), because it is not always necessary. Indeed, for most datasetsand query loads, the performance of GraphDB without such an index is good enough even with wildcard-predicatequeries, and the overhead of maintaining this index is not justified. You should consider using this index fordatasets that contain a very large number (greater than around 1000) of different predicates.

Context indices

There are two more optional indices that can be used to speed up query evaluation when searching statements viatheir context identifier. These indices are the PCSO and the PCOS indices and they are switched on together (seethe enable-context-index parameter in Configuring a Repository).

Index compression

The pages containing index data structures can be written to disk with ZIP compression. This adds a smalloverhead to the performance of read/write operations, but can save a significant amount of disk-storage space.This is particularly significant for large databases that use expensive SSD storage devices.

Index compression is controlled using a single configuration parameter called index-compression-ratio, whosedefault value is -1 indicating no compression.

To create a repository that uses ZIP compression, set this parameter to a value between 10 and 50 percent (inclu-sive). Once created, this compression ratio can not be changed.

Note: The value for this parameter indicates the attempted compression ratio for pages - the smaller the valuethe more compression is attempted. Pages that can not be compressed below the requested size are stored uncom-pressed. Therefore, setting this value too low will not save any disk space and will simply add to the processingoverhead. Typically, a value of 30% gives good performance with significant disk-space reduction, i.e., around70% less disk space used for each index. The total disk space requirements are typically reduced by around halfwhen using index compression at 30%.

Literal index

GraphDB automatically builds a literal index allowing faster look-ups of numeric and date/time object values. Theindex is used during query evaluation only if a query or a subquery (e.g., union) has a filter that is comprised of aconjunction of literal constraints using comparisons and equality (not negation or inequality), e.g., FILTER(?x =100 && ?y <= 5 && ?start > "2001-01-01"^^xsd:date).

Other patterns will not use the index, i.e., filters will not be re-written into usable patterns.

For example, the following FILTER patterns will all make use of the literal index:

FILTER( ?x = 7 )FILTER( 3 < ?x )FILTER( ?x >= 3 && ?y <= 5 )FILTER( ?x > "2001-01-01"^^xsd:date )

whereas these FILTER patterns will not:

FILTER( ?x > (1 + 2) )FILTER( ?x < 3 || ?x > 5 )FILTER( (?x + 1) < 7 )FILTER( ! (?x < 3) )

The decision of the query-optimiser whether to make use of this index is statistics-based. If the estimated numberof matches for a filter constraint is large relative to the rest of the query, e.g., a constraint with large or one-sidedrange, then the index might not be used at all.

To disable this index during query evaluation, use the configuration parameter enable-literal-index. Thedefault value is true.

Note: Because of the way the literals are stored, the index with dates far in the future and far in the past(approximately 200,000,000 years) as well as numbers beyond the range of 64-bit floating-point representation(i.e., above approximately 1e309 and below -1e309) will not work properly.

Handling of explicit and implicit statements

As already described, GraphDB applies the inference rules at load time in order to compute the full closure. There-fore, a repository will contain some statements that are explicitly asserted and other statements that exist throughimplication. In most cases, clients will not be concerned with the difference, however there are some scenarioswhen it is useful to work with only explicit or only implicit statements. These two groups of statements can beisolated during programmatic statement retrieval using the Sesame API and during (SPARQL) query evaluation.

Retrieving statements with the Sesame API

The usual technique for retrieving statements is to use the RepositoryConnection method:

RepositoryResult<Statement> getStatements(Resource subj,URI pred,Value obj,boolean includeInferred,Resource... contexts)

The method retrieves statements by ‘triple pattern’, where any or all of the subject, predicate and object parameterscan be null to indicate wildcards.

To retrieve explicit and implicit statements, the includeInferred parameter must be set to true. To retrieve onlyexplicit statements, the includeInferred parameter must be set to false.

However, the Sesame API does not provide the means to enable only the retrieval of implicit statements. In orderto allow clients to do this, GraphDB allows the use of the special ‘implicit’ pseudo-graph with this API, whichcan be passed as the context parameter.

The following example shows how to retrieve only implicit statements:

RepositoryResult<Statement> statements =repositoryConnection.getStatements(null, null, null, true,new URIImpl("http://www.ontotext.com/implicit"));

while (statements.hasNext()) {Statement statement = statements.next();// Process statement

}statements.close();

The above example uses wildcards for subject, predicate and object and will therefore return all implicit statementsin the repository.

SPARQL query evaluation

GraphDB also provides mechanisms to differentiate between explicit and implicit statements during query evalu-ation. This is achieved by associating statements with two pseudo-graphs (explicit and implicit) and using specialsystem URIs to identify these graphs.


Tip: To learn more, see Query Behaviour.

5.5.3 Full-text Search

Hint: Apache Lucene is a high-performance, full-featured text search engine written entirely in Java. GraphDBsupports FTS capabilities using Lucene with a variety of indexing options and the ability to simultaneously usemultiple, differently configured indices in the same query.

Full-text search (FTS) concerns retrieving text documents out of a large collection by keywords or, more generally,by tokens (represented as sequences of characters). Formally, the query represents an unordered set of tokens andthe result is a set of documents, relevant to the query. In a simple FTS implementation, relevance is Boolean: adocument is either relevant to the query, if it contains all the query tokens, or not. More advanced FTS imple-mentations deal with a degree of relevance of the document to the query, usually judged on some sort of measureof the frequency of appearance of each of the tokens in the document, normalised, versus the frequency of theirappearance in the entire document collection. Such implementations return an ordered list of documents, wherethe most relevant documents come first.

FTS and structured queries, like these in database management systems (DBMS), are different information accessmethods based on a different query syntax and semantics, where the results are also displayed in a different form.FTS systems and databases usually require different types of indices, too. The ability to combine these two typesof information access methods is very useful for a wide range of applications. Many relational DBMS supportsome sort of FTS (which is integrated in the SQL syntax) and maintain additional indices that allow efficientevaluation of FTS constraints.

Typically, a relational DBMS allows you to define a query, which requires specific tokens to appear in a specificcolumn of a specific table. In SPARQL, there is no standard way for the specification of FTS constraints. Ingeneral, there is neither a well-defined nor commonly accepted concept for FTS in RDF data. Nevertheless, somesemantic repository vendors offer some sort of FTS in their engines.

5.5.3.1 RDF search

The GraphDB FTS implementation, called ‘RDF Search’, is based on Lucene. It enables GraphDB to performcomplex queries against character data, which significantly speeds up the query process. RDF Search allows forefficient extraction of RDF resources from huge datasets, where ordering of the results by relevance is crucial.

Its main features are:

• FTS query form - List of tokens (with Lucene query extensions);

• Result form - Ordered list of URIs;

• Textual Representation - Concatenation of text representations of nodes from the so called ‘molecule’ (1-step neighbourhood in a graph) of the URI;

• Relevance - Vector-space model, reflecting the degree of relevance of the text and the RDF rank of the URI;

• Implementation - The Lucene engine is integrated and used for indexing and search.

Usage

In order to use the FTS in GraphDB, first a Lucene index must be computed. Before it is created, each index canbe parametrised in a number of ways, using SPARQL ‘control’ updates.

This provides the ability to:

• select what kinds of nodes are indexed (URIs/literals/blank-nodes);

• select what is included in the ‘molecule’ associated with each node;


http://lucene.apache.org

• select literals with certain language tags;

• choose the size of the RDF ‘molecule’ to index;

• choose whether to boost the relevance of the nodes using RDF Rank values;

• select alternative analysers;

• select alternative scorers.

In order to use the indexing behaviour of Lucene, a text document must be created for each node in the RDF graphto be indexed. This text document is called an ‘RDF molecule’ and is made up of other nodes reachable via thepredicates that connect the nodes to each other. Once a molecule has been created for each node, Lucene generatesan index over these molecules. During search (query answering), Lucene identifies the matching molecules andGraphDB uses the associated nodes as variables substitutions, when evaluating the enclosing SPARQL query.

The scope of an RDF molecule includes the starting node and its neighbouring nodes, which are reachable viathe specified number of predicate arcs. For each Lucene index, it can be specified what type of nodes are indexedand what type of nodes are included in the molecule. Furthermore, the size of the molecule can be controlled byspecifying the number of allowed traversals of predicate arcs starting from the molecule centre (the node beingindexed).

Note: Blank nodes are never included in the molecule. If a blank node is encountered, the search is extended viaany predicate to the next nearest entity and so on. Therefore, even when the molecule size is 1, entities reachablevia several intermediate predicates can still be included in the molecule if all the intermediate entities are blanknodes.

Parameters

Exclude

Predicate: http://www.ontotext.com/owlim/lucene#excludeDefault: <none>Description: Provides a regular expression to identify nodes, which will be excluded from the molecule.

Note that for literals and URI local names the regular expression is case-sensitive.The example given below will cause matching URIs (e.g., <http://example.com/uri#helloWorld>)and literals (e.g., "hello world!") not to be included.

Example:

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>INSERT DATA {

luc:exclude luc:setParam "hello.\*"}

Exclude entities

Predicate: http://www.ontotext.com/owlim/lucene#excludeEntitiesDefault: <none>Description: A comma/semi-colon/white-space separated list of entities that will NOT be included in an RDFmolecule. The example below includes any URI in a molecule, except the two listed.Example:


luc:excludeEntities luc:setParam"http://www.w3.org/2000/01/rdf-schema#Class http://www.example.com/dummy#E1"

}

Exclude Predicates

Predicate: http://www.ontotext.com/owlim/lucene#excludePredicatesDefault: <none>Description: A comma/semi-colon/white-space separated list of properties that will NOT be traversed in order tobuild an RDF molecule. The example below prevents any entities being added to an RDF molecule, if they canonly be reached via the two given properties.Example:


luc:excludePredicates luc:setParam"http://www.w3.org/2000/01/rdf-schema#subClassOf http://www.example.com/dummy#p1"

}

Include

Predicate: http://www.ontotext.com/owlim/lucene#includeDefault: "literals"Description: Indicates what kinds of nodes are to be included in the molecule. The value can be a list of valuesfrom: URI, literal, centre (the plural forms are also allowed: URIs, literals, centres). The value of centre causesthe node for which the molecule is built to be added to the molecule (provided it is not a blank node). This can beuseful, for example, when indexing URI nodes with molecules that contain only literals, but the local part of theURI should also be searchable.Example:

luc:include luc:setParam "literal uri"}

Include entities

Predicate: http://www.ontotext.com/owlim/lucene#includeEntitiesDefault: <none>Description: A comma/semi-colon/white-space separated list of entities that can be included in an RDFmolecule. Any other entities are ignored. The example below builds molecules that only contain the two entities.Example:


luc:includeEntities luc:setParam"http://www.w3.org/2000/01/rdf-schema#Class http://www.example.com/dummy#E1"

}

Include predicates

Predicate: http://www.ontotext.com/owlim/lucene#includePredicatesDefault: <none>Description: A comma/semi-colon/white-space separated list of properties that can be traversed in order to buildan RDF molecule. The example below allows any entities to be added to an RDF molecule, but only if they canbe reached via the two given properties.Example:


luc:includePredicates luc:setParam"http://www.w3.org/2000/01/rdf-schema#subClassOf http://www.example.com/dummy#p1"

}

Index

Predicate: http://www.ontotext.com/owlim/lucene#indexDefault: "literals"Description: Indicates what kinds of nodes are to be indexed. The value can be a list of values from: URI,literal, bnode (the plural forms are also allowed: URIs, literals, bnodes).Example:


luc:index luc:setParam "literals, bnodes"}


Languages

Predicate: http://www.ontotext.com/owlim/lucene#languagesDefault: "" (which is used to indicate that literals with any language tag are used, including those with nolanguage tag)Description: A comma separated list of language tags. Only literals with the indicated language tags areincluded in the index. To include literals that have no language tag, use the special value none.Example:


luc:languages luc:setParam "en,fr,none"}

Molecule size

Predicate: http://www.ontotext.com/owlim/lucene#moleculeSizeDefault: 0Description: Sets the size of the molecule associated with each entity. A value of zero indicates that only theentity itself should be indexed. A value of 1 indicates that the molecule will contain all entities reachable by asingle ‘hop’ via any predicate (predicates not included in the molecule). Note that blank nodes are never includedin the molecule. If a blank node is encountered, the search is extended via any predicate to the next nearest entityand so on. Therefore, even when the molecule size is 1, entities reachable via several intermediate predicates canstill be included in the molecule, if all the intermediate entities are blank nodes. Molecule sizes of 2 and more areallowed, but with large datasets it can take a very long time to create the index.Example:


luc:moleculeSize luc:setParam "1"}

useRDFRank

Predicate: http://www.ontotext.com/owlim/lucene#useRDFRankDefault: "no"Description: Indicates whether the RDF weights (if they have been already computed) associated with eachentity should be used as boosting factors when computing the relevance of a given Lucene query. Allowablevalues are no, yes and squared. The last value indicates that the square of the RDF Rank value is to be used.Example:


luc:useRDFRank luc:setParam "yes"}

analyser

Predicate: http://www.ontotext.com/owlim/lucene#analyzer


Default: <none>Description: Sets an alternative analyser for processing text to produce terms to index. By default, thisparameter has no value and the default analyser used is:org.apache.lucene.analysis.standard.StandardAnalyzer An alternative analyser must be derived from:org.apache.lucene.analysis.Analyzer. To use an alternative analyser, use this parameter to identify thename of a Java factory class that can instantiate it. The factory class must be available on the Java virtualmachine’s classpath and must implement this interface:com.ontotext.trree.plugin.lucene.AnalyzerFactory.Example:


luc:analyzer luc:setParam "com.ex.MyAnalyserFactory"}

Detailed example: In this example, we create two Java classes (Analyzer and Factory) and then create a Luceneindex, using the custom analyser. This custom analyser filters the accents (diacritics), so a search for “Beyonce”finds labels “Beyoncé”.

public class CustomAnalyzerFactory implements com.ontotext.trree.plugin.lucene.AnalyzerFactory {@Overridepublic Analyzer createAnalyzer() {

CustomAnalyzer ret = new CustomAnalyzer(Version.LUCENE_36);return ret;

}

@Overridepublic boolean isCaseSensitive() {

return false;}

}

public class CustomAnalyzer extends StopwordAnalyzerBase {public CustomAnalyzer(Version matchVersion){

super(matchVersion, StandardAnalyzer.STOP_WORDS_SET);}

@Overrideprotected TokenStreamComponents createComponents(String fieldName, Reader reader) {

final Tokenizer source = new StandardTokenizer(matchVersion, reader);TokenStream tokenStream = source;tokenStream = new StandardFilter(matchVersion, tokenStream);tokenStream = new LowerCaseFilter(tokenStream);tokenStream = new StopFilter(matchVersion, tokenStream, getStopwordSet());tokenStream = new ASCIIFoldingFilter(tokenStream);return new TokenStreamComponents(source, tokenStream);

}}

Create the index:

1. Put the two classes in a .jar file, e.g., “com.example”

2. Put the .jar file in the plugins folder (specified by -Dexternal-plugins=..., which by default is under<TOMCAT-WEBAPPS>/openrdf-sesame/WEB-INF/plugins). There has to be some data in the repository.

3. Create the index.


luc:analyzer luc:setParam "com.example.CustomAnalyzerFactory" .luc:index luc:setParam "uris".luc:moleculeSize luc:setParam "1".luc:myIndex luc:createIndex "true".

}

scorer

Predicate: http://www.ontotext.com/owlim/lucene#scorerDefault: <none>Description: Sets an alternative scorer that provides boosting values, which adjust the relevance (and hence theordering) of results to a Lucene query. By default, this parameter has no value and no additional scoring takesplace, however, if the useRDFRank parameter is set to true, then the RDF Rank scores are used. An alternativescorer must implement this interface: com.ontotext.trree.plugin.lucene.Scorer. In order to use analternative scorer, use this parameter to identify the name of a Java factory class that can instantiate it. Thefactory class must be available on the Java virtual machine’s classpath and must implement this interface:com.ontotext.trree.plugin.lucene.ScorerFactory.Example:


luc:scorer luc:setParam "com.ex.MxScorerFactory"}

Creating an index

Once you have set the parameters for an index, you create and name the index by committing a SPARQL updateof this form, where the index name appears as the subject in the triple pattern:

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>INSERT DATA { luc:myIndex luc:createIndex "true" . }

The index name must have the http://www.ontotext.com/owlim/lucene# namespace and the local part cancontain only alphanumeric characters and underscores.

Creating an index can take some time, although usually no more than a few minutes when the molecule size is 1 orless. During this process, for each node in the repository, its surrounding molecule is computed. Then, each suchmolecule is converted into a single string document (by concatenating the textual representation of all the nodesin the molecule) and this document is indexed by Lucene. If the RDF Rank weights are used (or an alternativescorer is specified), then the computed values are stored in the Lucene index as a boosting factor that will later oninfluence the selection order.

To use a custom Lucene index in a SPARQL query, use the index’s name as the predicate in a statement pattern,with the Lucene query as the object using the full Lucene query vocabulary.

The following query produces bindings for ?s from entities in the repository, where the RDF molecule associatedwith this entity (for the given index) contains terms that begin with “United”. Furthermore, the bindings areordered by relevance (with any boosting factor):

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>SELECT ?sWHERE { ?s luc:myIndex "United*" . }


http://lucene.apache.org/java/3_0_0/queryparsersyntax.html

The Lucene score for a bound entity for a particular query can be exposed using a special predicate:

http://www.ontotext.com/owlim/lucene#score

This can be useful when the Lucene query results are ordered in a manner based on but different from the originalLucene score.

For example, the following query orders the results by a combination of the Lucene score and some ontologydefined importance value:

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>PREFIX ex: <http://www.example.com/myontology#>SELECT * {

?node luc:myIndex "lucene query string" .?node ex:importance ?importance .?node luc:score ?score .

} ORDER BY ( ?score + ?importance )

The luc:score predicate works only on bound variables. There is no problem disambiguating multiple indicesbecause each variable is bound from exactly one Lucene index and hence its score.

The combination of ranking RDF molecules together with FTS provides a powerful mechanism for query-ing/analysing datasets, even when the schema is not known. This allows for keyword-based search over bothliterals and URIs with the results ordered by importance/interconnectedness.

You can see an example of such ‘RDF Search’ in FactForge.

Detailed example

The following example configuration shows how to index URIs using literals attached to them by a single, namedpredicate - in this case rdfs:label.

1. Assume the following starting data:

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>PREFIX ex:<http://example.com#>INSERT DATA {ex:astonMT rdfs:label "Aston McTalisker" .ex:astonMartin ex:link "Aston Martin" .<http://www1.aston.ac.uk/> rdfs:label "Aston University"@EN .

}

2. Set up the configuration - index URIs by including, in their RDF molecule, all literals that can be reachedvia a single statement using the rdfs:label predicate:

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>INSERT DATA {luc:index luc:setParam "uris" .luc:include luc:setParam "literals" .luc:moleculeSize luc:setParam "1" .luc:includePredicates luc:setParam "http://www.w3.org/2000/01/rdf-schema#label" .

}

3. Create a new index called luc:myTestIndex - note that the index name must be in the <http://www.ontotext.com/owlim/lucene#> namespace:

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>INSERT DATA {luc:myTestIndex luc:createIndex "true" .

}


http://factforge.net

4. Use the index in a query - find all URIs indexed using the luc:myTestIndex index that match the Lucenequery “ast*”:

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>SELECT * {

?id luc:myTestIndex "ast*"}

The results of this query are:

?idhttp://example.com#astonMThttp://www1.aston.ac.uk/

showing that ex:astonMartin is not returned, because it does not have an rdfs:label linking it to the appropriatetext.

Incremental update

Each Lucene-based FTS index must be recreated from time to time as the indexed data changes. Due to thecomplex nature of the structure of RDF molecules, rebuilding an index is a relatively expensive operation. Still,indices can be updated incrementally on a per resource basis as directed by the user.

The following control update:

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>INSERT DATA { <index-name> luc:addToIndex <resource> . }

updates the FTS index for the given resource and the given index.

Note: Each index stores the values of the parameters used to define it, e.g., the value of luc:includePredicates,therefore there is no need to set them before requesting an incremental update.


PREFIX luc: <http://www.ontotext.com/owlim/lucene#>INSERT DATA { <index-name> luc:updateIndex _:b1 . }

causes all resources not currently indexed by <index-name> to get indexed. It is a shorthand way of batchingtogether index updates for several (new) resources.

5.5.4 Plugins

5.5.4.1 Plugin API

What is the GraphDB Plugin API

The GraphDB Plugin API is a framework and a set of public classes and interfaces that allow developers to extendGraphDB in many useful ways. These extensions are bundled into plugins, which GraphDB discovers during itsinitialisation phase and then uses to delegate parts of its query processing tasks. The plugins are given low-levelaccess to the GraphDB repository data, which enables them to do their job efficiently. They are discovered via theJava service discovery mechanism, which enables dynamic addition/removal of plugins from the system withouthaving to recompile GraphDB or change any configuration files.


http://example.com#astonMT

http://www1.aston.ac.uk/


Description of a GraphDB plugin

A GraphDB plugin is a java class that implements the com.ontotext.trree.sdk.Plugin interface. All publicclasses and interfaces of the plugin API are located in this java package, i.e., com.ontotext.trree.sdk. Here iswhat the plugin interface looks like in an abbreviated form:

public interface Plugin extends Service {void setStatements(Statements statements);

void setEntities(Entities entities);

void setOptions(SystemOptions options);

void setDataDir(File dataDir);

void setLogger(Logger logger);

void initialize();

void setFingerprint(long fingerprint);

long getFingerprint();

void precommit(GlobalViewOnData view);

void shutdown();}

As it derives from the Service interface, the plugin is automatically discovered at run-time, provided that thefollowing conditions also hold:

• The plugin class is located in the classpath;

• It is mentioned in a META-INF/services/com.ontotext.trree.sdk.Plugin file in the classpath or in a.jar that is in the classpath. The full class signature has to be written on a separate line in such a file.

The only method introduced by the Service interface is getName(), which provides the plugin’s (service’s) name.This name must be unique within a particular GraphDB repository and it serves as a plugin identifier, which canbe used at any time to retrieve a reference to the plugin instance.

There are a lot more functions (interfaces) that a plugin could implement, but these are all optional and are declaredin separate interfaces. Implementing any such complementary interface is the means to announce to the systemwhat this particular plugin can do in addition to its mandatory plugin responsibilities. It is then automatically usedas appropriate.

The life-cycle of a plugin

A plugin’s life-cycle consists of several phases:

• Discovery - this phase is executed at repository initialisation. GraphDB searches for all plugin servicesregistered in the META-INF/services/com.ontotext.trree.sdk.Plugins service registry files and con-structs a single instance of each plugin found.

• Configuration - every plugin instance discovered and constructed during the previous phase is thenconfigured. During this phase, plugins are injected with a Logger object, which they use for logging(setLogger(Logger logger)), and the path to their own data directory (setDataDir(File dataDir)),which they create, if needed, and then use to store their data. If a plugin does not need to store anythingto the disk, it can skip the creation of its data directory. However, if it needs to use it, it is guaranteedthat this directory will be unique and available only to the particular plugin that it was assigned to. Theplugins also inject Statements and Entities instances (Repository internals (Statements and Entities)),and a SystemOptions instance, which gives the plugins access to the system-wide configuration optionsand settings.



• Initialisation - after a plugin has been configured, the framework calls its initialize() method so it getsthe chance to do whatever initialisation work it needs to do. It is important at this point that the plugin hasreceived all its configuration and low-level access to the repository data (Repository internals (Statementsand Entities)).

• Request - the plugin participates in the request processing. This phase is optional for the plugins. It isdivided into several subphases and each plugin can choose to participate in any or none of these. Therequest phase not only includes the evaluation of, for instance SPARQL queries, but also SPARQL/Updaterequests and getStatements calls. Here are the subphases of the request phase:

– Pre-processing - plugins are given the chance to modify the request before it is processed. In thisphase, they could also initialise a context object, which will be visible till the end of the requestprocessing (Pre-processing);

– Pattern interpretation - plugins can choose to provide results for requested statement patterns (Pat-tern interpretation);

– Post-processing - before the request results are returned to the client, plugins are given a chance tomodify them, filter them out or even insert new results (Post-processing);

• Shutdown - during repository shutdown, each plugin is prompted to execute its own shutdown routines,free resources, flush data to disk, etc. This must be done in the shutdown() method.

Repository internals (Statements and Entities)

In order to enable efficient request processing, plugins are given low-level access to the repository data and inter-nals. This is done through the Statements and Entities interfaces.

The Entities interface represents a set of RDF objects (URIs, blank nodes and literals). All such objects aretermed entities and are given unique long identifiers. The Entities instance is responsible for resolving theseobjects from their identifiers and inversely for looking up the identifier of a given entity. Most plugins processentities using their identifiers, because dealing with integer identifiers is a lot more efficient than working with theactual RDF entities they represent. The Entities interface is the single entry point available to plugins for entitymanagement. It supports the addition of new entities, entity replacement, look-up of entity type and properties,resolving entities, listening for entity change events, etc.

It is possible in a GraphDB repository to declare two RDF objects to be equivalent, e.g., by using owl:sameAs.In order to provide a way to use such declarations, the Entities interface assigns a class identifier to eachentity. For newly created entities, this class identifier is the same as the entity identifier. When two entities aredeclared equivalent, one of them adopts the class identifier of the other, and thus they become members of thesame equivalence class. The Entities interface exposes the entity class identifier for plugins to determine whichentities are equivalent.

Entities within an Entities instance have a certain scope. There are three entity scopes:

• Default - entities stored in this scope are persisted to disk and can be used in statements that are alsophysically stored on disk. These entities have non-zero, positive identifiers and are often referred to physicalentities.

• System - system entities have negative identifiers and are not persisted to disk. They can be used, forexample, for system (or magic) predicates. They are available throughout the whole repository lifetime, butafter it is restarted, they disappear and need to be re-created again, should one need them.

• Request - entities stored in request scope, such as system entities, are not persisted on disk and have negativeidentifiers. However, they only live in the scope of a particular request. They are not visible to otherconcurrent requests and disappear immediately after the request processing has finished. This scope isuseful for temporary entities such as literal values that are not expected to occur often (e.g., numericalvalues) and do not appear inside a physical statement.

The Statements interface represents a set of RDF statements where ‘statement’ means a quadruple of subject,predicate, object and context RDF entity identifiers. Statements can be added, removed and searched for. Addi-tionally, a plugin can subscribe to receive statement event notifications:

• transaction started;



• statement added;

• statement deleted;

• transaction completed.

An important abstract class, which is related to GraphDB internals, is StatementIterator. It has a method -boolean next() - which attempts to scroll the iterator onto the next available statement and returns true onlyif it succeeds. In case of success, its subject, predicate, object and context fields are initialised with therespective components of the next statement. Furthermore, some properties of each statement are available via thefollowing methods:

• boolean isReadOnly() - returns true if the statement is in the Axioms part of the rule-file or is importedat initialisation;

• boolean isExplicit() - returns true if the statement is explicitly asserted;

• boolean isImplicit() - returns true if the statement is produced by the inferencer (raw statements can beboth explicit and implicit).

Here is a brief example that puts Statements, Entities and StatementIterator together, in order to output allliterals that are related to a given URI:

// resolve the URI identifierlong id = entities.resolve(new URIImpl("http://example/uri"));

// retrieve all statements with this identifier in subject positionStatementIterator iter = statements.get(id, 0, 0, 0);while (iter.next()) {

// only process literal objectsif (entities.getType(iter.object) == Entities.Type.LITERAL) {

// resolve the literal and print out its valueValue literal = entities.get(iter.object);System.out.println(literal.stringValue());

}}

Request-processing phases

As already mentioned, a plugin’s interaction with each of the request-processing phases is optional. The plugindeclares if it plans to participate in any phase by implementing the appropriate interface.

Pre-processing

A plugin willing to participate in request pre-processing must implement the Preprocessor interface. It lookslike this:

public interface Preprocessor {RequestContext preprocess(Request request);

}

The preprocess() method receives the request object and returns a RequestContext instance. The Requestinstance passed as the parameter is a different class instance, depending on the type of the request (e.g.,SPARQL/Update or “get statements”). The plugin changes the request object in the necessary way, and initialisesand returns its context object, which is passed back to it in every other method during the request processing phase.The returned request context may be null, and whatever it is, it is only visible to the plugin that initialises it. Itcan be used to store data, visible for (and only for) this whole request, e.g., to pass data relating to two differentstatement patterns recognised by the plugin. The request context gives further request processing phases access tothe Request object reference. Plugins that opt to skip this phase do not have a request context and are not able toget access to the original Request object.


Pattern interpretation

This is one of the most important phases in the lifetime of a plugin. In fact, most plugins need to participate inexactly this phase. This is the point where request statement patterns need to get evaluated and statement resultsare returned.

For example, consider the following SPARQL query:

SELECT * WHERE {?s <http://example/predicate> ?o

}

There is just one statement pattern inside this query: ?s <http://example/predicate> ?o. All plugins thathave implemented the PatternInterpreter interface (thus declaring that they intend to participate in the patterninterpretation phase) are asked if they can interpret this pattern. The first one to accept it and return results for itwill be used. If no plugin interprets the pattern, it will be looked up using the repository’s physical statements,i.e., the ones persisted on the disk.

Here is the PatternInterpreter interface:

public interface PatternInterpreter {double estimate(long subject, long predicate, long object, long context, Statements statements,

Entities entities, RequestContext requestContext);

StatementIterator interpret(long subject, long predicate, long object, long context,Statements statements, Entities entities, RequestContext requestContext);

}

The estimate() and interpret() methods take the same arguments and are used in the following way:

• Given a statement pattern (e.g., the one in the SPARQL query above), all plugins that implementPatternInterpreter are asked to interpret() the pattern. The subject, predicate, object andcontext values are either the identifiers of the values in the pattern or 0, if any of them is an unboundvariable. The statements and entities objects represent respectively the statements and entities that areavailable for this particular request. For instance, if the query contains any FROM <http://some/graph>clauses, the statements object will only provide access to statements in the defined named graphs. Simi-larly, the entities object contains entities that might be valid only for this particular request. The plugin’sinterpret() method must return a StatementIterator, if it intends to interpret this pattern or null, if itrefuses.

• In case the plugin signals that it will interpret the given pattern (returns non-null value), GraphDB’s queryoptimiser will call the plugin’s estimate() method, in order to get an estimate on how many results willbe returned by the StatementIterator returned by interpret(). This estimate need not be precise, butthe better it is, the more likely the optimiser will make an efficient optimisation. There is a slight differencein the values that will be passed to estimate(). The statement components (e.g., subject) might not onlybe entity identifiers, but they can also be set to 2 special values:

– Entities.BOUND - the pattern component is said to be bound, but its particular binding is not yetknown;

– Entities.UNBOUND - the pattern component will not be bound. These values must be treated as hintsto the estimate() method to provide a better approximation of the result set size, although its precisevalue cannot be determined before the query is actually run.

• After the query has been optimised, the interpret() method of the plugin might be called again should anyvariable become bound due to the pattern reordering applied by the optimiser. Plugins must be prepared toexpect different combinations of bound and unbound statement pattern components, and return appropriateiterators.

The requestContext parameter is the value returned by the preprocess() method, if one exists, or null other-wise.

The plugin framework also supports the interpretation of an extended type of a list pattern.

Consider the following SPARQL query:

SELECT * WHERE {?s <http://example/predicate> (?o1 ?o2)

}

If a plugin wants to handle such list patterns, it has to implement an interface very similar to thePatternInterpreter interface - ListPatternInterpreter:

public interface ListPatternInterpreter {double estimate(long subject, long predicate, long[] objects, long context, Statements␣

→˓statements,Entities entities, RequestContext requestContext);

StatementIterator interpret(long subject, long predicate, long[] objects, long context,Statements statements, Entities entities, RequestContext requestContext);

}

It only differs by having multiple objects passed as an array of long, instead of a single long object. The semanticsof both methods is equivalent to the one in the basic pattern interpretation case.

Post-processing

There are cases when a plugin would like to modify or otherwise filter the final results of a request. This is wherethe Postprocessor interface comes into play:

public interface Postprocessor {

boolean shouldPostprocess(RequestContext requestContext);

BindingSet postprocess(BindingSet bindingSet, RequestContext requestContext);

Iterator<BindingSet> flush(RequestContext requestContext);}

The postprocess() method is called for each binding set that is to be returned to the repository client. Thismethod may modify the binding set and return it, or alternatively return null, in which case the binding set isremoved from the result set. After a binding set is processed by a plugin, the possibly modified binding set ispassed to the next plugin having post-processing functionality enabled. After the binding set is processed by allplugins (in the case where no plugin deletes it), it is returned to the client. Finally, after all results are processedand returned, each plugin’s flush() method is called to introduce new binding set results in the result set. Thesein turn are finally returned to the client.

Update processing

As well as query/read processing, plugins are able to process update operations for statement patterns containingspecific predicates. In order to intercept updates, a plugin must implement the UpdateInterpreter interface.During initialisation, the getPredicatesToListenFor is called once by the framework, so that the plugin canindicate which predicates it is interested in.

From then onwards, the plugin framework will filter updates for statements using these predicates and notifythe plugin. Filtered updates are not processed further by GraphDB, so if the insert or delete operation must bepersisted, the plugin must handle this by using the Statements object passed to it.

/*** An interface that must be implemented by the plugins that want to be* notified for particular update events. The getPredicatesToListenFor()* method should return the predicates of interest to the plugin. This* method will be called once only immediately after the plugin has been

* initialized. After that point the plugin's interpretUpdate() method* will be called for each inserted or deleted statement sharing one of the* predicates of interest to the plugin (those returned by* getPredicatesToListenFor()).*/public interface UpdateInterpreter {

/*** Returns the predicates for which the plugin needs to get notified* when statement is added or removed and contains the predicates in* question** @return array of predicates*/long[] getPredicatesToListenFor();

/*** Hook that handles updates that this interpreter is registered for** @param subject subject value of the updated statement* @param predicate predicate value of the updated statement* @param object object value of the updated statement* @param context context value of the updated statement* @param isAddition true if the statement was added, false if it was removed* @param isExplicit true if the updated statement was explicit one* @param statements Statements instance that contains the updated statement* @param entities Entities instance for the request*/void interpretUpdate(long subject, long predicate, long object, long context,

boolean isAddition, boolean isExplicit,Statements statements, Entities entities);

}

Putting it all together: an example plugin

The following example plugin has two responsibilities:

• It interpres patterns such as ?s <http://example.com/time> ?o and binds their object component to aliteral containing the repository local date and time.

• If a FROM <http://example.com/time> clause is detected in the query, the result is a single binding set inwhich all projected variables are bound to a literal containing the repository local date and time.

For the first part, it is clear that the plugin implements the PatternInterpreter interface. A date/time literal isstored as a request-scope entity to avoid cluttering the repository with extra literals.

For the second requirement, the plugin must first take part in the pre-processing phase, in order to inspect the queryand detect the FROM clause. Then, the plugin must hook into the post-processing phase where, if the pre-processingphase detects the desired FROM clause, it deletes all query results (in postprocess() and returns a single result (inflush()) containing the binding set specified by the requirements. Again, request-scoped literals are created.

The plugin implementation extends the PluginBase class that provides a default implementation of the Pluginmethods:

public class ExamplePlugin extends PluginBase {private static final URI PREDICATE = new URIImpl("http://example.com/time");private long predicateId;

@Overridepublic String getName() {

return "example";}


@Overridepublic void initialize() {

predicateId = entities.put(PREDICATE, Entities.Scope.SYSTEM);}

}

In this basic implementation, the plugin name is defined and during initialisation, a single system-scope predicateis registered.

Note: It is important not to forget to register the plugin in the META-INF/services/com.ontotext.trree.sdk.Plugin file in the classpath.

The next step is to implement the first of the plugin’s requirements - the pattern interpretation part:

public class ExamplePlugin extends PluginBase implements PatternInterpreter {

// ...

@Overridepublic StatementIterator interpret(long subject, long predicate, long object, long context,

Statements statements, Entities entities, RequestContext requestContext) {// ignore patterns with predicate different than the one we recognizeif (predicate != predicateId)

return null;

// create the date/time literallong literalId = createDateTimeLiteral();

// return a StatementIterator with a single statement to be iteratedreturn StatementIterator.create(subject, predicate, literalId, 0);

}

private long createDateTimeLiteral() {Value literal = new LiteralImpl(new Date().toString());return entities.put(literal, Scope.REQUEST);

}

@Overridepublic double estimate(long subject, long predicate, long object, long context,

Statements statements, Entities entities, RequestContext requestContext) {return 1;

}}

The interpret() method only processes patterns with a predicate matching the desired predicate identifier. Fur-ther on, it simply creates a new date/time literal (in the request scope) and places its identifier in the object positionof the returned single result. The estimate() method always returns 1, because this is the exact size of the resultset.

Finally, to implement the second requirement concerning the interpretation of the FROM clause:

public class ExamplePlugin extends PluginBase implements PatternInterpreter, Preprocessor,Postprocessor {

private static class Context implements RequestContext {private Request theRequest;private BindingSet theResult;

public Context(BindingSet result) {theResult = result;

}@Override


public Request getRequest() {return theRequest;

}@Overridepublic void setRequest(Request request) {

theRequest = request;}public BindingSet getResult() {

return theResult;}

}

// ...

@Overridepublic RequestContext preprocess(Request request) {

if (request instanceof QueryRequest) {QueryRequest queryRequest = (QueryRequest) request;Dataset dataset = queryRequest.getDataset();if ((dataset != null && dataset.getDefaultGraphs().contains(PREDICATE))) {

// create a date/time literallong literalId = createDateTimeLiteral();Value literal = entities.get(literalId);// prepare a binding set with all projected variables set// to the date/time literal valueMapBindingSet result = new MapBindingSet();if (queryRequest.getTupleExpr() instanceof Projection) {

Projection projection = (Projection) queryRequest.getTupleExpr();for (String bindingName : projection.getBindingNames()) {

result.addBinding(bindingName, literal);}

}return new Context(result);

}}return null;

}

@Overridepublic BindingSet postprocess(BindingSet bindingSet, RequestContext requestContext) {

// if we have found the special FROM clause we filter out all resultsreturn requestContext != null ? null : bindingSet;

}

@Overridepublic Iterator<BindingSet> flush(RequestContext requestContext) {

// if we have found the special FROM clause we return the special binding setBindingSet result = ((Context) requestContext).getResult();return requestContext != null ? new SingletonIterator<BindingSet>(result) : null;

}}

The plugin provides the custom implementation of the RequestContext interface, which can hold a reference tothe desired single BindingSet with the date/time literal, bound to every variable name in the query projection.The postprocess() method filters out all results if the requestContext is non-null (i.e., if the FROM clause isdetected by preprocess()). Finally, flush() returns a singleton iterator, containing the desired binding set inthe required case and returns nothing otherwise.


Making a plugin configurable

Plugins are expected to require configuring. There are two ways for GraphDB plugins to receive their configu-ration. The first practice is to define magic system predicates that can be used to pass some configuration valuesto the plugin through a query at run-time. This approach is appropriate whenever the configuration changes fromone plugin usage scenario to another, i.e., when there are no globally valid parameters for the plugin. However, inmany cases the plugin behaviour has to be configured ‘globally’ and then the plugin framework provides a suitablemechanism through the Configurable interface.

A plugin implements the Configurable interface to announce its configuration parameters to the system. This al-lows it to read parameter values during initialisation from the repository configuration and have them merged withall other repository parameters (accessible through the SystemOptions instance passed during the configurationphase).

This is the Configurable interface:

public interface Configurable {public String[] getParameters();

}

The plugin needs to enumerate its configuration parameter names. The example plugin is extended with the abilityto define the name of the special predicate it uses. The parameter is called predicate-uri and accepts a URIvalue.

public class ExamplePlugin extends PluginBase implements PatternInterpreter, Preprocessor,Postprocessor, Configurable {

private static final String DEFAULT_PREDICATE = "http://example.com/time";private static final String PREDICATE_PARAM = "predicate-uri";

// ...

@Overridepublic String[] getParameters() {

return new String[] { PREDICATE_PARAM };}

// ...

@Overridepublic void initialize() {

// get the configured predicate URI, falling back to our default if none was foundString predicate = options.getParameter(PREDICATE_PARAM, DEFAULT_PREDICATE);

predicateId = entities.put(new URIImpl(predicate), Entities.Scope.SYSTEM);}

// ...}

Now that the plugin parameter has been declared, it can be configured either by adding the http://www.ontotext.com/trree/owlim#predicate-uri parameter to the GraphDB configuration, or by setting a Java sys-tem property using -Dpredicate-uri parameter for the JVM running GraphDB.

There are also a special kind of configuration parameters, called ‘memory’ parameters. These are parameters thatare used to configure the amount of memory available for the plugin to use. If a plugin has such parameters, ituses the MemoryConfigurable interface:

public interface MemoryConfigurable {public String[] getMemoryParameters();

public void setMemoryParameter(String name, long bytes);}



The getMemoryParameters() method enumerates the names of the plugin’s memory parameters in a similarway to Configurable.getParameters(). During the configuration phase, the plugin’s setMemoryParameter()method is called once for each such parameter with its respective configured value in bytes. The parameters definedas memory parameters can be given values such as 1g or 300M, but these values are interpreted and converted tobytes.

A special property of the memory parameters is that they can be configured in a group. GraphDB acceptsa parameter called cache-memory. This parameter accumulates the values of a group of other parameters:tuple-index-memory and predicate-memory. Declaring a memory parameter automatically adds it to the groupof parameters accumulated by cache-memory. The good thing about this approach is that, if cache-memory isconfigured to some amount and any of the grouped memory parameters is not configured (unknown), the amountconfigured for cache-memory is divided among all unknown memory parameters, thus providing the user with asimple way to control the memory requirements of many plugins using a single parameter.

For instance, if cache-memory is configured to 100m, tuple-index-memory to 20m, there are no predicate listsconfigured (which automatically disables the predicate-memory parameter) and there are 4 memory parametersdeclared by several plugins, which were not explicitly configured. The effect of such a setup is that 80M (100M -20M) are divided among the 4 memory parameters and each of them is set to 20M. This value is then reported tothe plugins in bytes using their setMemoryParameter() method.

Accessing other plugins

Plugins can make use of the functionality of other plugins. For example, the Lucene-based full-text search plugincan make use of the rank values provided by the RDFRank plugin, to facilitate query result scoring and ordering.This is not a matter of re-using program code (e.g., in a .jar with common classes), but rather it is about re-usingdata. The mechanism to do this allows plugins to obtain references to other plugin objects by knowing their names.To achieve this, they only need to implement the PluginDependency interface:

public interface PluginDependency {public void setLocator(PluginLocator locator);

}

They are then injected into an instance of the PluginLocator interface (during the configuration phase), whichdoes the actual plugin discovery for them:

public interface PluginLocator {public Plugin locate(String name);

}

Having a reference to another plugin is all that is needed to call its methods directly and make use of its services.

5.5.4.2 RDF Rank

What is RDF Rank

RDF Rank is an algorithm that identifies the more important or more popular entities in the repository by exam-ining their interconnectedness. The popularity of entities can then be used to order the query results in a similarway to the internet search engines, the way Google orders search results using PageRank.

The RDF Rank component computes a numerical weighting for all nodes in the entire RDF graph stored in therepository, including URIs, blank nodes and literals. The weights are floating point numbers with values between0 and 1 that can be interpreted as a measure of a node’s relevance/popularity.


http://en.wikipedia.org/wiki/PageRank

Since the values range from 0 to 1, the weights can be used for sorting a result set (the lexicographical order worksfine even if the rank literals are interpreted as plain strings).

Here is an example SPARQL query that uses the RDF rank for sorting results by their popularity:

PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>PREFIX opencyc-en: <http://sw.opencyc.org/2008/06/10/concept/en/>SELECT * WHERE {

?Person a opencyc-en:Entertainer .?Person rank:hasRDFRank ?rank .

}ORDER BY DESC(?rank) LIMIT 100

As seen in the example query, RDF Rank weights are made available via a special system predicate. GraphDBhandles triple patterns with the predicate http://www.ontotext.com/owlim/RDFRank#hasRDFRank in a specialway, where the object of the statement pattern is bound to a literal containing the RDF Rank of the subject.

In order to use this mechanism, the RDF ranks for the whole repository must be computed in advance. This is doneby committing a series of SPARQL updates that use special vocabulary to parameterise the weighting algorithm,followed by an update that triggers the computation itself.

Parameters

Parameter Maximum iterationsPredicate http://www.ontotext.com/owlim/

RDFRank#maxIterationsDescription Sets the maximum number of iterations of the algo-

rithm over all entities in the repository.Default 20Example

PREFIX rank:<http://www.ontotext.com/owlim/RDFRank#>INSERT DATA { rank:maxIterations rank:setParam“16” . }

Parameter EpsilonPredicate http://www.ontotext.com/owlim/

RDFRank#epsilonDescription Terminates the weighting algorithm early when the to-

tal change of all RDF Rank scores has fallen belowthis value.

Default 0.01Example

PREFIX rank:<http://www.ontotext.com/owlim/RDFRank#>INSERT DATA { rank:epsilon rank:setParam “0.05”. }

Full computation

To trigger the computation of the RDF Rank values for all resources, use the following update:

PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>INSERT DATA { _:b1 rank:compute _:b2. }

Incremental updates

The full computation of RDF Rank values for all resources can be relatively expensive. When new resources havebeen added to the repository after a previous full computation of the RDF Rank values, you can either have afull re-computation for all resources (see above) or compute only the RDF Rank values for the new resources (anincremental update).


PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>INSERT DATA {_:b1 rank:computeIncremental "true"}

computes RDF Rank values for the resources that do not have an associated value, i.e., the ones that have beenadded to the repository since the last full RDF Rank computation.

Note: The incremental computation uses a different algorithm, which is lightweight (in order to be fast), but isnot as accurate as the proper ranking algorithm. As a result, ranks assigned by the proper and the lightweightalgorithms will be slightly different.


http://www.ontotext.com/owlim

http://www.ontotext.com/owlim

Exporting RDF Rank values

The computed weights can be exported to an external file using an update of this form:

PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>INSERT DATA { _:b1 rank:export "/home/user1/rdf_ranks.txt" . }

If the export fails, the update throws an exception and an error message is recorded in the log file.

With the RDF Priming, you can use the RDF Rank values as initial activation values. To set it up, use the followingupdate:

PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>INSERT DATA { _:b1 rank:ranksAsWeights _:b2 . }

5.5.4.3 Geo-spatial Extensions

What are geo-spatial extensions

GraphDB provides support for 2-dimensional geo-spatial data that uses the WGS84 Geo Positioning RDF vocabu-lary (World Geodetic System 1984). Special indices can be used for this data, which permit the efficient evaluationof query forms and extension functions that allow finding locations:

• within a certain distance of a point, i.e. within the specified circle on the surface of the sphere (Earth), usingthe nearby(...) construction;

• within rectangles and polygons, where the vertices are defined by spherical polar coordinates, using thewithin(...) construction.

The WGS84 ontology can be found at: http://www.w3.org/2003/01/geo/wgs84_pos and it contains several classesand predicates:


http://www.w3.org/2003/01/geo/wgs84_pos

Element DescriptionSpatialThing A class for representing anything with spatial extent, i.e., size, shape or position.Point A class for representing a point (relative to Earth) defined by latitude,

longitude (and altitude). subClassOf http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing

location The relation between a thing and where it is. Range SpatialThingsubPropertyOf http://xmlns.com/foaf/0.1/based_near

lat The WGS84 latitude of a SpatialThing (decimal degrees). domain http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing

long The WGS84 longitude of a SpatialThing (decimal degrees). domain http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing

lat_long A comma-separated representation of a latitude, longitude coordinate.alt The WGS84 altitude of a SpatialThing (decimal meters above the lo-

cal reference ellipsoid). domain http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing

How to create a geo-spatial index

Note: For the geo-spatial extensions to function correctly, you need tp have the following .jar files (fromthe /ext folder of the GraphDB distribution) in the classpath: jsi-1*.jar, log4j-1*.jar, sil-0*.jar,trove4j-2*.jar

However, before you can use the geo-spatial extensions, first you need to build the geo-spatial index.

To do this, insert and commit a dummy statement (that is not actually stored), which uses a special predicate:

PREFIX ontogeo: <http://www.ontotext.com/owlim/geo#>INSERT DATA { _:b1 ontogeo:createIndex _:b2. }

If the indexing is successful, the above update succeeds, otherwise an exception is thrown. You can find informa-tion about the indexing process and any errors in the logs.

Note: If there is no geo-spatial data in the repository, i.e., no statements describing resources with latitude andlongitude properties, this update will fail.

Geo-spatial query syntax

The Geo-spatial query syntax is the SPARQL’s RDF Collections syntax. It uses round brackets as a shorthand forthe statements, connecting a list of values using rdf:first and rdf:rest predicates with terminating rdf:nil.Statement patterns that use one of the special geo-spatial predicates, supported by GraphDB, are treated differentlyby the query engine.

The following special syntax is supported when evaluating SPARQL queries. All descriptions use the namespace:omgeo: <http://www.ontotext.com/owlim/geo#>


http://www.w3.org/TR/rdf-sparql-query/#collections

Construct Nearby (lat long distance)Syntax ?point omgeo:nearby(?lat ?long ?distance)Description This statement pattern will evaluate to true, if the following constraints hold:

• ?point geo:lat ?plat .• ?point geo:long ?plong .• Shortest great circle distance from (?plat, ?plong) to (?lat, ?long) <= ?distance

Such a construction uses the geo-spatial indices to find bindings for ?point,which lie within the defined circle.Constants are allowed for any of ?lat ?long ?distance, where latitude andlongitude are specified in decimal degrees and distance is specified in eitherkilometres (‘km’ suffix) or miles (‘mi’ suffix). If the units are not specified,then ‘km’ is assumed.

Restrictions Latitude is limited to the range -90 (South) to 90 (North)Longitude is limited to the range -180 (West) to +180 (East)

Examples Find the names of airports that are within 50 miles of Seoul:

PREFIX geo-pos: <http://www.w3.org/2003/01/geo/wgs84_pos#>

PREFIX geo-ont: <http://www.geonames.org/ontology#>

PREFIX omgeo: <http://www.ontotext.com/owlim/geo#>

SELECT distinct ?airport

WHERE {

?base geo-ont:name "Seoul" .

?base geo-pos:lat ?latBase .

?base geo-pos:long ?longBase .

?link omgeo:nearby(?latBase ?longBase "50mi") .

?link geo-ont:name ?airport .

?link geo-ont:featureCode geo-ont:S.AIRP .

}

Construct Within (rectangle)Syntax ?point omgeo:within(?lat1 ?long1 ?lat2 ?long2)Description This statement pattern is used to test/find points that lie within the rectangle

specified by diagonally opposite corners ?lat1 ?long1 and ?lat2 ?long2.The corners of the rectangle must be either constants or bound values.It will evaluate to true, if the following constraints hold:

• ?point geo:lat ?plat . * ?point geo:long ?plong . * ?lat1 <=?plat <= ?lat2 * ?long1 <= ?plong <= ?long2

Note that the most westerly and southerly corners must be specified first and themost northerly and easterly ones - second. Proper account is taken for rectanglesthat cross the +/-180 degree meridian.Constants are allowed for any of ?lat1 ?long1 ?lat2 ?long2, where latitudeand longitude are specified in decimal degrees. If ?point is unbound, thenbindings for all points within the rectangle will be produced.

Restrictions Latitude is limited to the range -90 (South) to +90 (North).Longitude is limited to the range -180 (West) to +180 (East).Rectangle vertices must be specified in the order lower-left followed by upper-right.

Examples Find tunnels lying within a rectangle enclosing Tirol, Austria:

PREFIX geo-pos: <http://www.w3.org/2003/01/geo/wgs84_pos#>

PREFIX geo-ont: <http://www.geonames.org/ontology#>

PREFIX omgeo: <http://www.ontotext.com/owlim/geo#>

SELECT ?feature ?lat ?long

WHERE {

?link omgeo:within(45.85 9.15 48.61 13.18) .

?link geo-ont:featureCode geo-ont:R.TNL .

?link geo-ont:name ?feature .

?link geo-pos:lat ?lat .

?link geo-pos:long ?long .

}

Construct Within (polygon)Syntax ?point omgeo:within(?lat1 ?long1 ... ?latN ?longN)Description This statement pattern is used to test/find points that lie within the polygon

whose vertices are specified by three or more latitude/longitude pairs.The values of the vertices must be either constants or bound values.It will evaluate to true, if the following constraints hold:

• ?point geo:lat ?plat .• ?point geo:long ?plong .• the position ?plat ?plong is enclosed by the polygon

The polygon is closed automatically if the first and last vertices do not coincide.The vertices must be constants or bound values. Coordinates are specified indecimal degrees. If ?point is unbound, then bindings for all points within thepolygon will be produced.

Restrictions Latitude is limited to the range -90 (South) to +90 (North)Longitude is limited to the range -180 (West) to +180 (East)

Examples Find caves in the sides of cliffs lying within a polygon approximating the shapeof England:

PREFIX geo-pos: <http://www.w3.org/2003/01/geo/wgs84_pos#>PREFIX geo-ont: <http://www.geonames.org/ontology#>PREFIX omgeo: <http://www.ontotext.com/owlim/geo#>SELECT ?feature ?lat ?longWHERE {?link omgeo:within( "51.45" "-2.59"

"54.99" "-3.06""55.81" "-2.03""52.74" "1.68""51.17" "1.41" ) .?link geo-ont:featureCode geo-ont:S.CAVE .?link geo-ont:name ?feature .?link geo-pos:lat ?lat .?link geo-pos:long ?long .

}

Extension query functions

At present, there is just one SPARQL extension function:

Function Distance functionSyntax double omgeo:distance(?lat1, ?long1, ?lat2, ?long2)Description This SPARQL extension function computes the distance between two points in

kilometres and can be used in FILTER and ORDER BY clauses.Restrictions Latitude is limited to the range -90 (South) to +90 (North)

Longitude is limited to the range -180 (West) to +180 (East)Examples Find caves in the sides of cliffs lying within a polygon approximating the shape

of England:

PREFIX geo-pos: <http://www.w3.org/2003/01/geo/wgs84_pos#>PREFIX geo-ont: <http://www.geonames.org/ontology#>PREFIX omgeo: <http://www.ontotext.com/owlim/geo#>

SELECT distinct ?airport_nameWHERE {

?a1 geo-ont:name "Bournemouth" .?a1 geo-pos:lat ?lat1 .?a1 geo-pos:long ?long1 .?airport omgeo:nearby(?lat1 ?long1 "80mi" ) .?airport geo-ont:name ?airport_name .?airport geo-ont:featureCode geo-ont:S.AIRP .?airport geo-pos:lat ?lat2 .?airport geo-pos:long ?long2 .?a2 geo-ont:name "Brize Norton" .?a2 geo-pos:lat ?lat3 .?a2 geo-pos:long ?long3 .FILTER( omgeo:distance(?lat2, ?long2, ?lat3, ?long3) < 80)

}ORDER BY ASC( omgeo:distance(?lat2, ?long2, ?lat3, ?long3) )

Implementation details

Knowing the implementation’s algorithms and assumptions allow you to make the best use of the GraphDB geo-spatial extensions.

The following aspects are significant and can affect the expected behaviour during query answering:

• Spherical Earth - the current implementation treats the Earth as a perfect sphere with a 6371.009km radius;

• Only 2-dimensional points are supported, i.e., there is no special handling of geo:alt (metres above thereference surface of the Earth);

• All latitude and longitude values must be specified using decimal degrees, where East and North are positiveand -90 <= latitude <= +90 and -180 <= longitude <= +180;

• Distances must be in units of kilometres (suffix ‘km’) or statute miles (suffix ‘mi’). If the suffix is omitted,kilometres are assumed;

• omgeo:within( rectangle ) construct uses a ‘rectangle’ whose edges are lines of latitude and longitude,so the north-south distance is constant, and the rectangle described forms a band around the Earth, whichstarts and stops at the given longitudes;

• omgeo:within( polygon ) joins vertices with straight lines on a cylindrical projection of the Earth tan-gential to the equator. A straight line starting at the point under test and continuing East out of the polygonis examined to see how many polygon edges it intersects. If the number of intersections is even, then thepoint is outside the polygon. If the number of intersections is odd, the point is inside the polygon. With thecurrent algorithm, the order of vertices is not relevant (clockwise or anticlockwise);

• omgeo:within() may not work correctly when the region (polygon or rectangle) spans the +/-180 meridian;

• omgeo:nearby() uses the great circle distance between points.


5.5.5 Notifications

5.5.5.1 What are GraphDB local notifications

Notifications are a publish/subscribe mechanism for registering and receiving events from a GraphDB repository,whenever triples matching a certain graph pattern are inserted or removed.

The Sesame API provides such a mechanism, where a RepositoryConnectionListener can be notified ofchanges to a NotifiyingRepositoryConnection. However, the GraphDB notifications API works at a lowerlevel and uses the internal raw entity IDs for subject, predicate and object instead of Java objects. The benefit ofthis is that a much higher performance is possible. The downside is that the client must do a separate lookup toget the actual entity values and because of this, the notification mechanism works only when the client is runninginside the same JVM as the repository instance.

How to register for local notifications

To receive notifications, register by providing a SPARQL query.

Note: The SPARQL query is interpreted as a plain graph pattern by ignoring all more complicated SPARQLconstructs such as FILTER, OPTIONAL, DISTINCT, LIMIT, ORDER BY, etc. Therefore, the SPARQL query is inter-preted as a complex graph pattern involving triple patterns combined by means of joins and unions at any level.The order of the triple patterns is not significant.

Here is an example of how to register for notifications based on a given SPARQL query:

AbstractRepository rep =((OwlimSchemaRepository)owlimSail).getRepository();

EntityPoolConnection entConn = ((OwlimSchemaRepository)owlimSail).getEntities().getConnection();String query = "SELECT * WHERE { ?s rdf:type ?o }";SPARQLQueryListener listener =

new SPARQLQueryListener(query, rep, entConn) {public void notifyMatch(int subj, int pred, int obj, int context) {

System.out.println("Notification on subject: " + subj);}

}rep.addListener(listener); // start receiving notifications...rep.removeListener(listener); // stop receiving notifications

In the example code, the caller will be asynchronously notified about incoming statements matching the pattern?s rdf:type ?o.

Note: In general, notifications are sent for all incoming triples, which contribute to a solution of the query. Theinteger parameters in the notifyMatch method can be mapped to values using the EntityPoolConnection. Fur-thermore, any statements inferred from newly inserted statements are also subject to handling by the notificationmechanism, i.e., clients are notified also of new implicit statements when the requested triple pattern matches.

Note: The subscriber should not rely on any particular order or distinctness of the statement notifications. Du-plicate statements might be delivered in response to a graph pattern subscription in an order not even bound to thechronological order of the statements insertion in the underlying triplestore.

Tip: The purpose of the notification services is to enable the efficient and timely discovery of newly added RDFdata. Therefore, it should be treated as a mechanism for giving the client a hint that certain new data is available


and not as an asynchronous SPARQL evaluation engine.

5.5.5.2 What are GraphDB remote notifications

GraphDB’s remote notification mechanism provides filtered statement add/remove and transaction begin/end no-tifications for a local or a remote GraphDB repository. Subscribers for this mechanism use patterns of subject,predicate and object (with wildcards) to filter the statement notifications. JMX is used internally as a transportmechanism.

How to use remote notifications

To register / deregister for notifications, use the NotifyingOwlimConnection class, which is located in thegraphdb-notifications-<version>.jar in the lib folder of the distribution .zip file. This class wraps aRepositoryConnection object connected to a GraphDB repository and provides an API to add/remove notifica-tion listeners of the type RepositoryNotificationsListener.

Here is a simple example of how to use the API when the GraphDB repository is initialised in the same JVM thatruns the example (local repository):

RepositoryConnection conn = null;// initialize repository connection to GraphDB ...

RepositoryNotificationsListener listener = new RepositoryNotificationsListener() {@Overridepublic void addStatement(Resource subject, URI predicate,

Value object, Resource context, boolean isExplicit, long tid) {System.out.println("Added: " + subject + " " + predicate + " " + object);

}@Overridepublic void removeStatement(Resource subject, URI predicate,

Value object, Resource context, boolean isExplicit, long tid) {System.out.println("Removed: " + subject + " " + predicate + " " + object);

}@Overridepublic void transactionStarted(long tid) {

System.out.println("Started transaction " + tid);}@Overridepublic void transactionComplete(long tid) {

System.out.println("Finished transaction " + tid);}

};

NotifyingOwlimConnection nConn = new NotifyingOwlimConnection(conn);URIImpl ex = new URIImpl("http://example.com/");

// subscribe for statements with 'ex' as subjectnConn.subscribe(listener, ex, null, null);

// note that this could be any other connection to the same repositoryconn.add(ex, ex, ex);conn.commit();// statement added should have been printed out

// stop listening for this patternnConn.unsubscribe(listener);

Note: The transactionStarted() and transactionComplete() events are not bound to any statement. They

are dispatched to all subscribers, no matter what they are subscribed for. This means that pairs of start/completeevents can be detected by the client without receiving any statement notifications in between.

To use a remote repository (e.g., HTTPRepository), the notifying repository connection should be initialiseddifferently:

NotifyingOwlimConnection nConn =new NotifyingOwlimConnection(conn, host, port);

where host (String) and port (int) are the host name of the remote machine, in which the repository resides andthe port number of the JMX service in the repository JVM. The other part of the above example is also valid for aremote repository.

How to configure remote notifications

For remote notifications, where the subscriber and the repository are running in different JVM instances (possiblyon different hosts), a JMX remote service should be configured in the repository JVM.

This is done by adding the following parameters to the JVM command line:

-Dcom.sun.management.jmxremote.port=1717-Dcom.sun.management.jmxremote.authenticate=false-Dcom.sun.management.jmxremote.ssl=false

If the repository is running inside a servlet container, these parameters must be passed to the JVM that runsthe container and GraphDB. For Tomcat, this can be done using the JAVA_OPTS or CATALINA_OPTS environmentvariable.

The port number used should be exactly the port number that is passed to the NotifyingOwlimConnection con-structor (as in the example above). You have to make sure that the specified port (e.g., 1717) is accessible remotely,i.e., no firewalls or NAT redirection prevent access to it.

5.5.6 Query Behaviour

5.5.6.1 What are named graphs

Hint: GraphDB supports the following SPARQL specifications:

• SPARQL 1.1 Protocol for RDF

• SPARQL 1.1 Query

• SPARQL 1.1 Update

• SPARQL 1.1 Federation

• SPARQL 1.1 Graph Store HTTP Protocol

An RDF database can store collections of RDF statements (triples) in separate graphs identified (named) by aURI. A group of statements with a unique name is called a ‘named graph’. An RDF database has one more graph,which does not have a name, and it is called the ‘default graph’.

The SPARQL query syntax provides a means to execute queries across default and named graphs using FROMand FROM NAMED clauses. These clauses are used to build an RDF dataset, which identifies what statementsthe SPARQL query processor will use to answer a query. The dataset contains a default graph and named graphsand is constructed as follows:

• FROM <uri> - brings statements from the database graph, identified by URI, to the dataset’s default graph,i.e., the statements ‘lose’ their graph name.





http://www.w3.org/TR/sparql11-federated-query/

• FROM NAMED <uri> - brings the statements from the database graph, identified by URI, to the dataset, i.e.,the statements keep their graph name.

If either FROM or FROM NAMED are used, the database’s default graph is no longer used as input for processingthis query. In effect, the combination of FROM and FROM NAMED clauses exactly defines the dataset. This issomewhat bothersome, as it precludes the possibility, for instance, of executing a query over just one named graphand the default graph. However, there is a programmatic way to get around this limitation as described below.

The default SPARQL dataset

Note: The SPARQL specification does not define what happens when no FROM or FROM NAMED clauses are presentin a query, i.e., it does not define how a SPARQL processor should behave when no dataset is defined. In thissituation, implementations are free to construct the default dataset as necessary.

GraphDB constructs the default dataset as follows:

• The dataset’s default graph contains the merge of the database’s default graph AND all the database namedgraphs;

• The dataset contains all named graphs from the database.

This means that if a statement ex:x ex:y ex:z exists in the database in the graph ex:g, then the following querypatterns will behave as follows:

Query BindingsSELECT * { ?s ?p ?o } ?s=ex:x ?p=ex:y ?o=ex:zSELECT * { GRAPH ?g { ?s ?p ?o } } ?s=ex:x ?p=ex:y ?o=ex:z ?g=ex:g

In other words, the triple ex:x ex:y ex:z will appear to be in both the default graph and the named graph ex:g.

There are two reasons for this behaviour:

1. It provides an easy way to execute a triple pattern query over all stored RDF statements.

2. It allows all named graph names to be discovered, i.e., with this query: SELECT ?g { GRAPH ?g { ?s ?p?o } }.

5.5.6.2 How to manage explicit and implicit statements

GraphDB maintains two flags for each statement:

• Explicit: the statement is inserted in the database by the user, using SPARQL UPDATE, the Sesame API orthe ‘imports’ configuration parameter. The same explicit statement can exist in the database’s default graphand in each named graph.

• Implicit: the statement is created as a result of inference, by either Axioms or Rules. Inferred statements areALWAYS created in the database’s default graph.

These two flags are not mutually exclusive. The following sequences of operations are possible:

• For the operations, use the names ‘insert/delete’ for explicit, and ‘infer/retract’ for implicit (retract meansthat all premises of the statement are deleted or retracted).

• To show the results after each operation, use tuples <statement graph flags> :

– <s G EI> means statement s in graph G having both flags Explicit and Implicit;

– <s _ EI> means statement s in the default graph having both flags Explicit and Implicit;

– <_ G _> means the statement is deleted from graph G.

First, let’s consider operations on statement s in the default graph only:

• insert <s _ E>, infer <s _ EI>, delete <s _ I>, retract <_ _ _>;

• insert <s _ E>, infer <s _ EI>, retract <s _ E>, delete <_ _ _>;

• infer <s _ I>, insert <s _ EI>, delete <s _ I>, retract <_ _ _>;

• infer <s _ I>, insert <s _ EI>, retract <s _ E>, delete <_ _ _>;

• insert <s _ E>, insert <s _ E>, delete <_ _ _>;

• infer <s _ I>, infer <s _ I>, retract <_ _ _> (if the two inferences are from the same premises).

This does not show all possible sequences, but it shows the principles:

• No duplicate statement can exist in the default graph;

• Delete/retract clears the appropriate flag;

• The statement is deleted only after both flags are cleared;

• Deleting an inferred statement has no effect (except to clear the I flag, if any);

• Retracting an inserted statement has no effect (except to clear the E flag, if any);

• Inserting the same statement twice has no effect: insert is idempotent;

• Inferring the same statement twice has no effect: infer is idempotent, and I is a flag, not a counter, but theRetraction algorithm ensures I is cleared only after all premises of s are retracted.

Now, let’s consider operations on statement s in the named graph G, and inferred statement s in the default graph:

• insert <s G E>, infer <s _ I> <s G E>, delete <s _ I>, retract <_ _ _>;

• insert <s G E>, infer <s _ I> <s G E>, retract <s G E>, delete <_ _ _>;

• infer <s _ I>, insert <s G E> <s _ I>, delete <s _ I>, retract <_ _ _>;

• infer <s _ I>, insert <s G E> <s _ I>, retract <s G E>, delete <_ _ _>;

• insert <s G E>, insert <s G E>, delete <_ _ _>;

• infer <s _ I>, infer <s _ I>, retract <_ _ _> (if the two inferences are from the same premises).

The additional principles here are:

• The same statement can exist in several graphs - as explicit in graph G and implicit in the default graph;

• Delete/retract works on the appropriate graph.

Note: In order to avoid a proliferation of duplicate statements, it is recommended not to insert inferrable state-ments in named graphs.

5.5.6.3 How to query explicit and implicit statements

The database’s default graph can contain a mixture of explicit and implicit statements. The Sesame API providesa flag called ‘includeInferred’, which is passed to several API methods and when set to false causes only explicitstatements to be iterated or returned. When this flag is set to true, both explicit and implicit statements are iteratedor returned.

GraphDB provides extensions for more control over the processing of explicit and implicit statements. Theseextensions allow the selection of explicit, implicit or both for query answering and also provide a mechanism foridentifying which statements are explicit and which are implicit. This is achieved by using some ‘pseudo-graph’names in FROM and FROM NAMED clauses, which cause certain flags to be set.

The details are as follows:

FROM <http://www.ontotext.com/explicit> The dataset’s default graph includes only explicit statementsfrom the database’s default graph.

FROM <http://www.ontotext.com/implicit> The dataset’s default graph includes only inferred statementsfrom the database’s default graph.

FROM NAMED <http://www.ontotext.com/explicit> The dataset contains a named graph http://www.ontotext.com/explicit that includes only explicit statements from the database’s default graph, i.e., quadpatterns such as GRAPH ?g {?s ?p ?o} rebind explicit statements from the database’s default graph to agraph named http://www.ontotext.com/explicit.

FROM NAMED <http://www.ontotext.com/implicit> The dataset contains a named graph http://www.ontotext.com/implicit that includes only implicit statements from the database’s default graph.

Note: These clauses do not affect the construction of the default dataset in the sense that using any combinationof the above will still result in a dataset containing all named graphs from the database. All it changes is whichstatements appear in the dataset’s default graph and whether any extra named graphs (explicit or implicit) appear.

5.5.6.4 How to specify the dataset programmatically

The Sesame API provides an interface Dataset and an implementation class DatasetImpl for defining the datasetfor a query by providing the URIs of named graphs and adding them to the default graphs and named graphsmembers. This permits null to be used to identify the default database graph (or null context to use Sesameterminology).

DatasetImpl dataset = new DatasetImpl();dataset.addDefaultGraph(null);dataset.addNamedGraph(valueFactory.createURI("http://example.com/g1"));

This dataset can then be passed to queries or updates, e.g.:

TupleQuery query = connection.prepareTupleQuery(QueryLanguage.SPARQL, queryString);query.setDataset(dataset);

5.5.6.5 How to access internal identifiers for entities

Internally, GraphDB uses integer identifiers (IDs) to index all entities (URIs, blank nodes and literals). Statementindices are made up of these IDs and a large data structure is used to map from ID to entity value and back. Thereare occasions (e.g., when interfacing to an application infrastructure) when having access to these internal IDs canimprove the efficiency of data structures external to GraphDB by allowing them to be indexed by an integer valuerather than a full URI.

Here, we introduce a special GraphDB predicate and function that provide access to the internal IDs. The datatypeof the internal IDs is <http://www.w3.org/2001/XMLSchema#long>.

Predicate <http://www.ontotext.com/owlim/entity#id>Description A map between an entity and an internal IDExample Select all entities and their IDs:

PREFIX ent: <http://www.ontotext.com/owlim/entity#>SELECT * WHERE {?s ent:id ?id} ORDER BY ?id

Function <http://www.ontotext.com/owlim/entity#id>Description Return an entity’s internal IDExample Select all statements and order them by the internal ID of the object values:

PREFIX ent: <http://www.ontotext.com/owlim/entity#>SELECT * WHERE {?s ?p ?o .} order by ent:id(?o)


http://www.ontotext.com/explicit



http://www.ontotext.com/implicit

http://www.ontotext.com/implicit

Examples

• Enumerate all entities and bind the nodes to ?s and their IDs to ?id, order by ?id:

select * where {?s <http://www.ontotext.com/owlim/entity#id> ?id

} order by ?id

• Enumerate all non-literals and bind the nodes to ?s and their IDs to ?id, order by ?id:

SELECT * WHERE {?s <http://www.ontotext.com/owlim/entity#id> ?id .FILTER (!isLiteral(?s)) .

} ORDER BY ?id

• Find the internal IDs of subjects of statements with specific predicate and object values:

SELECT * WHERE {?s <http://test.org#Pred1> "A literal".?s <http://www.ontotext.com/owlim/entity#id> ?id .

} ORDER BY ?id

• Find all statements where the object has the given internal ID by using an explicit, untyped value as the ID(the "115" is used as object in the second statement pattern):

SELECT * WHERE {?s ?p ?o.?o <http://www.ontotext.com/owlim/entity#id> "115" .

}

• As above, but using an xsd:long datatype for the constant within a FILTER condition:

SELECT * WHERE {?s ?p ?o.?o <http://www.ontotext.com/owlim/entity#id> ?id .FILTER (?id="115"^^<http://www.w3.org/2001/XMLSchema#long>) .

} ORDER BY ?o

• Find the internal IDs of subject and object entities for all statements:

SELECT * WHERE {?s ?p ?o.?s <http://www.ontotext.com/owlim/entity#id> ?ids.?o <http://www.ontotext.com/owlim/entity#id> ?ido.

}

• Retrieve all statements where the ID of the subject is equal to "115"^^xsd:long, by providing an internalID value within a filter expression:

SELECT * WHERE {?s ?p ?o.FILTER ((<http://www.ontotext.com/owlim/entity#id>(?s))

= "115"^^<http://www.w3.org/2001/XMLSchema#long>).}

• Retrieve all statements where the string-ised ID of the subject is equal to "115", by providing an internalID value within a filter expression:

SELECT * WHERE {?s ?p ?o.FILTER (str( <http://www.ontotext.com/owlim/entity#id>(?s) ) = "115").

}

5.5.6.6 How to use Sesame ‘direct hierarchy’ vocabulary

GraphDB supports the Sesame specific vocabulary for determining ‘direct’ subclass, subproperty and type rela-tionships. The special vocabulary used and their definitions are shown below (reproduced from the rdf4j Sesameuser guide). The three predicates are all defined using the namespace definition:

PREFIX sesame: <http://www.openrdf.org/schema/sesame#>

Predicate DefinitionA sesame:directSubClassOf B Class A is a direct subclass of B if:

1. A is a subclass of B and;2. A and B are not equal and;3. there is no class C (not equal to A or B) such

that A is a subclass of C and C of B.

P sesame:directSubPropertyOf Q Property P is a direct subproperty of Q if:1. P is a subproperty of Q and;2. P and Q are not equal and;3. there is no property R (not equal to P or Q) such

that P is a subproperty of R and R of Q.

I sesame:directType T Resource I is a direct type of T if:1. I is of type T and2. There is no class U (not equal to T) such that:

(a) U is a subclass of T and;(b) I is of type U.

5.5.6.7 Other special GraphDB query behaviour

There are several more special graph URIs in GraphDB, which are used for controlling query evaluation.

FROM / FROM NAMED <http://www.ontotext.com/disable-sameAs> Switch off the enumeration of equiva-lence classes produced by the sameAs Optimisation. By default, all owl:sameAs URIs are returned bytriple pattern matching. This clause reduces the number of results to include a single representative fromeach owl:sameAs class. For more details, see Not enumerating sameAs.

FROM / FROM NAMED <http://www.ontotext.com/count> Used for triggering the evaluation of the query, sothat it gives a single result in which all variable bindings in the projection are replaced with a plain literal,holding the value of the total number of solutions of the query. In the case of a CONSTRUCT query in whichthe projection contains three variables (?subject, ?predicate, ?object), the subject and the predicate arebound to <http://www.ontotext.com/> and the object holds the literal value. This is because there cannotexist a statement with a literal in the place of the subject or predicate. This clause is deprecated in favor ofusing the COUNT aggregate of SPARQL 1.1.

FROM / FROM NAMED <http://www.ontotext.com/skip-redundant-implicit> Used for triggering the exclu-sion of implicit statements when there is an explicit one within a specific context (even default). Initiallyimplemented to allow for filtering of redundant rows where the context part is not taken into account andwhich leads to ‘duplicate’ results.

FROM <http://www.ontotext.com/distinct> Using this special graph name in DESCRIBE and CONSTRUCTqueries will cause only distinct triples to be returned. This is useful when several resources are beingdescribed, where the same triple can be returned more than once, i.e., when describing its subject and itsobject. This clause is deprecated in favor of using the DISTINCT clause of SPARQL 1.1.

5.5.7 Retain BIND Position Special Graph

The default behavior of the GraphDB query optimiser is to try and reposition BIND clauses so that all the variableswithin its EXPR part (on the left side of ‘AS’) are bound to have valid bindings for all of the variables referred

within it.

If you look at the following data:

INSERT DATA {<urn:q> <urn:pp1> 1 .<urn:q> <urn:pp2> 2 .<urn:q> <urn:pp3> 3 .

}

and try to evaluate a SPARQL query such as the one below (without any rearrangement of the statement patterns):

SELECT ?r {?q <urn:pp1> ?x .?q <urn:pp2> ?y .BIND (?x + ?y + ?z AS ?r) .?q <urn:pp3> ?z .

}

the ‘correct’ result would be:

1 result: r=UNDEF

because the expression that sums several variables will not produce any valid bindings for ?r.

But if you rearrange the statement patterns in the same query so that you have bindings for all of the variablesused within the sum expression of the BIND clause:

SELECT ?r {?q <urn:pp1> ?x .?q <urn:pp2> ?y .?q <urn:pp3> ?z .BIND (?x + ?y + ?z AS ?r) .

}

the query would return a single result and now the value bound to ?r will be 6:

1 result: r=6

By default, the GraphDB query optimiser tries to move the BIND after the last statement pattern, so that all thevariables referred internally have a binding. However, that behavior can be modified by using a special ‘system’graph within the dataset section of the query (e.g., as FROM clause) that has the following URI:

<http://www.ontotext.com/retain-bind-position>.

In this case, the optimiser retains the relative position of the BIND operator within the group in which it appears,so that if you evaluate the following query against the GraphDB repository:

SELECT ?rFROM <http://www.ontotext.com/retain-bind-position> {

?q <urn:pp1> ?x .?q <urn:pp2> ?y .BIND (?x + ?y + ?z AS ?r) .?q <urn:pp3> ?z .

}

you will get the following result:

1 result: r=UNDEF

Still, the default evaluation without the special ‘system’ graph provides a more useful result:


1 result: r="6"

5.5.8 Performance Optimisations

The best performance is typically measured by the shortest load time and the fastest query answering. Here are allthe factors that affect GraphDB performance:

• Configuring memory

• Data Loading & Query Optimisations

– Dataset loading

– GraphDB’s optional indices

– Cache/index monitoring and optimisations

– Query optimisations

• Explain Plan

• Inference Optimisations

– Delete Optimisations

– Rules Optimisations

– sameAs Optimisation

– RDFS and OWL Support Optimisations

5.5.8.1 Data Loading & Query Optimisations

The life-cycle of a repository instance typically starts with the initial loading of datasets, followed by the pro-cessing of queries and updates. The loading of a large dataset can take a long time - up to 12 hours for a billionstatements with inference. Therefore, during loading, it is often helpful to use a different configuration than theone for a normal operation.

Furthermore, if you frequently load a certain dataset, since it gradually changes over time, the loading configu-ration can evolve as you become more familiar with the GraphDB behaviour towards this dataset. Many datasetproperties only become apparent after the initial load (such as the number of unique entities) and this informationcan be used to optimise the loading step for the next round or to improve the configuration for a normal operation.

Dataset loading

The following is a typical initialisation life-cycle:

1. Configure a repository for best loading performance with many estimated parameters.

2. Load data.

3. Examine dataset properties.

4. Refine loading configuration.

5. Reload data and measure improvement.

Unless the repository has to answer queries during the initialisation phase, it can be configured with the minimumnumber of options and indices, where a large portion of the available heap space is given over to the cache memory:

enablePredicateList = false (unless the dataset has a large number of predicates)enable-context-index = falsein-memory-literal-properties = falsecache-memory = approximately 50% of total heap space (-Xmx value)



Normal operation

The size of the data structures used to index entities is directly related to the number of unique entities in theloaded dataset. These data structures are always kept in memory. In order to get an upper bound on the number ofunique entities loaded and to find the actual amount of RAM used to index them, it is useful to know the contentsof the storage folder.

The total amount of memory needed to index entities is equal to the sum of the sizes of the files entities.indexand entities.hash. This value can be used to determine how much memory is used and therefore how to dividethe remaining memory between the cache-memory, etc.

An upper bound on the number of unique entities is given by the size of entities.hash divided by 12 (memoryis allocated in pages and therefore the last page will likely not be full).

The file entities.index is used to look up entries in the file entities.hash and its size is equal to the valueof the entity-index-size parameter multiplied by 4. Therefore, the entity-index-size parameter has less todo with efficient use of memory and more with the performance of entity indexing and lookup. The larger thisvalue, the less collisions occur in the entities.hash table. A reasonable size for this parameter is at least half thenumber of unique entities. However, the size of this data structure is never changed once the repository is created,so this knowledge can only be used to adjust this value for the next clean load of the dataset with a new (empty)repository.

The following parameters can be adjusted:

entity-index-size Set to a large enough value.

enablePredicateList Can speed up queries (and loading).

enable-context-index To provide better performance when executing queries that use contexts.

in-memory-literal-properties Whether to keep the properties of each literal in-memory.

cache-memory The total memory given for the caches, including tuple-index, predicate, etc.

tuple-index-memory The cache memory for the statement indeces (POS, PSO, and optionally PCSO and PCOS).

predicate-memory If predicate lists are enabled.

Note: Do not forget that:

cache-memory = tuple-index-memory + predicate-memory

If any one of these is omitted, it is calculated. If two or more are unspecified, the remaining cache memory isdivided evenly between them.

Furthermore, the inference semantics can be adjusted by choosing a different rule-set. However, this will requirea reload of the whole repository, otherwise some inferences can remain when they should not.

Note: The optional indices can be built at a later time when the repository is used for query answering. You needto experiment using typical query patterns from the user environment.

GraphDB’s optional indices

Predicate lists

Predicate lists are two indices (SP and OP) that can improve performance in the following situations:

• When loading/querying datasets that have a large number of predicates;

• When executing queries or retrieving statements that use a wildcard in the predicate position, e.g., thestatement pattern: dbpedia:Human ?predicate dbpedia:Land.



As a rough guideline, a dataset with more than about 1000 predicates will benefit from using these indices forboth loading and query answering. Predicate list indices are not enabled by default, but can be switched on usingthe enablePredicateList configuration parameter.

Context indices

To provide better performance when executing queries that use contexts, you can use two other indices - PCSO andPSOC. They are enabled by using the enable-context-index configuration parameter.

Cache/index monitoring and optimisations

Statistics are kept for the main index data structures and include information such as cache hits/misses, filereads/writes, etc. This information can be used to fine-tune GraphDB memory configuration and can be use-ful for ‘debugging’ certain situations, such as understanding why load performance changes over time or withparticular data sets.

For each index, there will be a CollectionStatistics MBean published, which shows the cache and file I/O valuesupdated in real-time:

Package com.ontotextMBean name CollectionStatistics

The following information is displayed for each MBean/index:



Attribute DescriptionCacheHits The number of operations completed without accessing the storage system.CacheMisses The number of operations completed, which needed to access the storage system.FlushInvocationsFlushReadItemsFlushRead-TimeAvarageFlushReadTimeTo-talFlushWriteItemsFlushWrite-TimeAvarageFlushWriteTimeTo-talPageDiscards The number of times a non-dirty page’s memory was reused to read in another page.PageSwaps The number of times a page was written to the disk, so its memory could be used to

load another page.Reads The total number of times an index was searched for a statement or a range of

statements.Writes The total number of times a statement was added to a collection.

The following operations are available:

Operation DescriptionresetCounters Resets all the counters for this index.

Ideally, the system should be configured to keep the number of cache misses to a minimum. If the ratio of hits tomisses is low, consider increasing the memory available to the index (if other factors permit this).

Page swaps tend to occur much more often during large scale data loading. Page discards occur more frequentlyduring query evaluation.

Query optimisations

GraphDB uses a number of query optimisation techniques by default. They can be disabled by using theenable-optimization configuration parameter set to false, however there is rarely any need to do this. SeeGraphDB’s Explain Plan for a way to view query plans and applied optimisations.

Caching literal language tags

This optimisation applies when the repository contains a large number of literals with language tags and it isnecessary to execute queries that filter based on language, e.g., using the following SPARQL query construct:

FILTER ( lang(?name) = "ES" )

In this situation, the in-memory-literal-properties configuration parameters can be set to true, causing thedata values with language tags to be cached.

Not enumerating sameAs

During query answering, all URIs from each equivalence class produced by the sameAs optimisation are enumer-ated. You can use the onto:disable-sameAs pseudo-graph (see Other special query behaviour) to significantlyreduce these duplicate results (instead of returning a single representative from each equivalence class).

Consider these example queries executed against the FactForge combined dataset. Here, the default is to enumer-ate:


http://factforge.net/

PREFIX dbpedia: <http://dbpedia.org/resource/>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>SELECT * WHERE { ?c rdfs:subClassOf dbpedia:Airport}

producing many results:

dbpedia:Air_striphttp://sw.cyc.com/concept/Mx4ruQS1AL_QQdeZXf-MIWWdngumbel-sc:CommercialAirportopencyc:Mx4ruQS1AL_QQdeZXf-MIWWdngdbpedia:Jetportdbpedia:Airstripsdbpedia:Airportfb:guid.9202a8c04000641f800000000004ae12opencyc-en:CommercialAirport

If you specify the onto:disable-sameAs pseudo-graph:

PREFIX onto: <http://www.ontotext.com/>PREFIX dbpedia: <http://dbpedia.org/resource/>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>SELECT * FROM onto:disable-sameAsWHERE {?c rdfs:subClassOf dbpedia:Airport}

only two results are returned:

dbpedia:Air_stripopencyc-en:CommercialAirport

The Expand results over equivalent URIs checkbox in the GraphDB Workbench SPARQL editor plays a similarrole, but the meaning is reversed.

Warning: If the query uses a filter over the textual representation of a URI, e.g., filter(strstarts(str(?x),"http://dbpedia.org/ontology")), this may skip some valid solutions as not all URIs within an equiv-alence class are matched against the filter.

5.5.8.2 Explain Plan

What is GraphDB’s Explain Plan

GraphDB’s Explain Plan is a feature that explains how GraphDB executes a SPARQL query and also includesinformation about unique subject, predicate and object collection sizes. It can help you improve the query, leadingto better execution performance.

Warning: For users of GraphDB versions 6.4.3 - 6.6.0, please note that from GraphDB version 6.6.1 on, theExperimental Explain Plan becomes GraphDB’s regular Explain Plan.

Activating the explain plan

To see the query explain plan, use the onto:explain pseudo-graph:

PREFIX onto: <http://www.ontotext.com/>select * from onto:explain...

Simple explain plan

For the simplest query explain plan possible (?s ?p ?o), execute the following query:

PREFIX onto: <http://www.ontotext.com/>select * from onto:explain {

?s ?p ?o .}

Depending on the number of triples that you have in the database, the results will vary, but you will get somethinglike the following:

SELECT ?s ?p ?o{

{ # ----- Begin optimization group 1 -----

?s ?p ?o . # Collection size: 108.0# Predicate collection size: 108.0# Unique subjects: 90.0# Unique objects: 55.0# Current complexity: 108.0

} # ----- End optimization group 1 -----# ESTIMATED NUMBER OF ITERATIONS: 108.0

}

This is the same query, but with some estimations next to the statement pattern (1 in this case).

Note: The query might not be the same as the original one. See below the triple patterns in the order in whichthey are executed internally.

• ----- Begin optimization group 1 ----- - indicates starting a group of statements, which mostprobably are part of a subquery (in the case of property paths, the group will be the whole path);

• Collection size - an estimation of the number of statements that match the pattern;

• Predicate collection size - the number of statements in the database for this particular predicate (inthis case, for all predicates);

• Unique subjects - the number of subjects that match the statement pattern;

• Unique objects - the number of objects that match the statement pattern;

• Current complexity - the complexity (the number of atomic lookups in the index) the database will needto make so far in the optimisation group (most of the time a subquery). When you have multiple triplepatterns, these numbers grow fast.

• ----- End optimization group 1 ----- - the end of the optimisation group;

• ESTIMATED NUMBER OF ITERATIONS: 108.0 - the approximate number of iterations that will be executedfor this group.

Multiple triple patterns

Note: The result of the explain plan is given in the exact order the engine is going to execute the query.

The following is an example where the engine reorders the triple patterns based on their complexity. The query isa simple join:

PREFIX onto: <http://www.ontotext.com/>PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select *from onto:explain{

?o rdf:type ?o1 .?o rdfs:subPropertyOf ?o2

}

and here is the output:

SELECT ?o ?o1 ?o2{


?o rdfs:subPropertyOf ?o2 . # Collection size: 20.0# Predicate collection size: 20.0# Unique subjects: 19.0# Unique objects: 18.0# Current complexity: 20.0

?o rdf:type ?o1 . # Collection size: 43.0# Predicate collection size: 43.0# Unique subjects: 34.0# Unique objects: 7.0# Current complexity: 860.0


}

Understanding the output:

• ?o rdfs:subPropertyOf ?o1 has a lower collection size (20 instead of 43), so it will be executed first.

• ?o rdf:type ?o1 has a bigger collection size (43 instead of 20), so it will be executed second (although itis written first in the original query).

• The current complexity grows fast because it multiplies. In this case, you can expect to get 20 results fromthe first statement pattern and then you have to join them with the results from the second triple pattern,which results in the complexity of 20 * 43 = 860.

• Although the complexity for the whole group is 860, the estimated number of iterations for this group is25.3.

Wine queries

All of the following examples refer to our simple wine dataset (wine.ttl). The file is quite small, but here issome basic explanation about the data:

• There are different types of wine (Red, White, Rose).

• Each wine has a label.

• Wines are made from different types of grapes.

• Wines contain different levels of sugar.

• Wines are produced in a specific year.

First query with aggregation

A typical aggregation query contains a group with some aggregation function. Here, we have added an explaingraph:

# Retrieve the number of wines produced in each year along with the yearPREFIX onto: <http://www.ontotext.com/>PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX : <http://www.ontotext.com/example/wine#>SELECT (count(?wine) as ?wines) ?yearfrom onto:explainWHERE {

?wine rdf:type :Wine .optional {

?wine :hasYear ?year}

}group by ?yearORDER BY DESC(?wines)

When you execute the query on GraphDB, you get the following as an output (instead of the real results):

SELECT (COUNT(?wine) AS ?wines) ?year{


?wine rdf:type onto:example/wine#Wine . # Collection size: 5.0# Predicate collection size: 64.0# Unique subjects: 50.0# Unique objects: 12.0# Current complexity: 5.0


OPTIONAL{


?wine onto:example/wine#hasYear ?year . # Collection size: 5.0# Predicate collection size: 5.0# Unique subjects: 5.0# Unique objects: 2.0# Current complexity: 5.0


}}GROUP BY ?yearORDER BY DESC(?wines)LIMIT 1000

5.5.8.3 Inference Optimisations

Delete Optimisations

GraphDB’s inference policy is based on materialisation, where implicit statements are inferred from explicit state-ments as soon as they are inserted into the repository, using the specified semantics ruleset. This approach hasthe advantage of achieving query answering very quickly, since no inference needs to be done at query time.

However, no justification information is stored for inferred statements, therefore deleting a statement normallyrequires a full re-computation of all inferred statements, which can take a very long time for large datasets.

GraphDB uses a special technique for handling the deletion of explicit statements and their inferences, called‘smooth delete‘. It allows fast delete operations as well as ensures that schemas can be changed when necessary.

The algorithm

The algorithm for identifying and removing the inferred statements that can no longer be derived by the explicitstatements that have been deleted, is as follows:

1. Use forward-chaning to determine what statements can be inferred from the statements marked for deletion.

2. Use backward-chaining to see if these statements are still supported by other means.

3. Delete explicit statements and the no longer supported inferred statements.

Note: We recommend that you mark the visited statements as read-only. Otherwise, as almost all delete opera-tions follow inference paths that touch schema statements, which then lead to almost all other statements in therepository, the ‘smooth delete’ can take a very long time. However, since a read-only statement cannot be deleted,there is no reason to find what statements are inferred from it (such inferred statements might still get deleted, butthey will be found by following other inference paths).

Statements are marked as read-only if they occur in the Axioms section of the ruleset files (standard or custom)or are loaded at initialisation time via the imports configuration parameter.

Note: When using ‘smooth delete’, we recommend that you load all ontology/schema/vocabulary statementsusing the imports configuration parameter.

Example

Consider the following statements:

Schema:<foaf:name> <rdfs:domain> <owl:Thing> .<MyClass> <rdfs:subClassOf> <owl:Thing> .

Data:<wayne_rooney> <foaf:name> "Wayne Rooney" .<Reviewer40476> <rdf:type> <MyClass> .<Reviewer40478> <rdf:type> <MyClass> .<Reviewer40480> <rdf:type> <MyClass> .<Reviewer40481> <rdf:type> <MyClass> .

When using the owl-horst rule-set the removal of the statement:

<wayne_rooney> <foaf:name> "Wayne Rooney"

will cause the following sequence of events:

rdfs2:x a y - (x=<wayne_rooney>, a=foaf:name, y="Wayne Rooney")a rdfs:domain z (a=foaf:name, z=owl:Thing)-----------------------x rdf:type z - The inferred statement [<wayne_rooney> rdf:type owl:Thing] is to be removed.

rdfs3:x a u - (x=<wayne_rooney>, a=rdf:type, u=owl:Thing)a rdfs:range z (a=rdf:type, z=rdfs:Class)-----------------------u rdf:type z - The inferred statement [owl:Thing rdf:type rdfs:Class] is to be removed.

rdfs8_10:x rdf:type rdfs:Class - (x=owl:Thing)-----------------------x rdfs:subClassOf x - The inferred statement [owl:Thing rdfs:subClassOf owl:Thing] is to be removed.

proton_TransitiveOver:y q z - (y=owl:Thing, q=rdfs:subClassOf, z=owl:Thing)p protons:transitiveOver q - (p=rdf:type, q=rdfs:subClassOf)x p y - (x=[<Reviewer40476>, <Reviewer40478>, <Reviewer40480>, <Reviewer40481>], p=rdf:type,␣→˓y=owl:Thing)-----------------------x p z - The inferred statements [<Reviewer40476> rdf:type owl:Thing], etc., are to be removed.

Statements such as [<Reviewer40476> rdf:type owl:Thing] exist because of the statements[<Reviewer40476> rdf:type <MyClass>] and [<MyClass> rdfs:subClassOf owl:Thing].

In large datasets, there are typically millions of statements [X rdf:type owl:Thing] and they are all visited bythe algorithm.

The [X rdf:type owl:Thing] statements are not the only problematic statements considered for removal. Everyclass that has millions of instances leads to similar behaviour.

One check to see if a statement is still supported requires about 30 query evaluations with owl-horst, hence theslow removal.

If [owl:Thing rdf:type owl:Class] is marked as an axiom (because it is derived by statements from theschema, which must be axioms), then the process stops when reaching this statement. So, in the current version,the schema (the system statements) must necessarily be imported through the imports configuration parameter inorder to mark the schema statements as axioms.

Schema transactions

As mentioned above, ontologies and schemas imported at initialisation time using the imports configurationparameter are flagged as read-only. However, there are times when it is necessary to change a schema and this canbe done inside a ‘system transaction’.

The user instructs GraphDB that the transaction is a system transaction by including a dummy statement with thespecial schemaTransaction predicate, i.e.:

_:b1 <http://www.ontotext.com/owlim/system#schemaTransaction> _:b2

This statement is not inserted into the database, but rather it serves as a flag telling GraphDB that the statementsfrom this transaction are going to be inserted as read-only; all statements derived from them are also marked asread-only. When you delete statements in a system transaction, you can remove statements marked as read-only,as well as statements derived from them. Axiom statements and all statements derived from them stay untouched.

Rules Optimisations

GraphDB 6 includes a useful new feature that allows you to debug rule performance.

How to enable rule profiling

To enable rule profiling, start GraphDB with the following Java option:

-Denable-debug-rules=true

This enables the collection of rule statistics (various counters).

Note: Rule profiling slows down the rule execution (the leading premise checking part) by 10-30%, so do not useit in production.

Log file

When rule profiling is enabled:

• Complete rule statistics are printed every 1M statements, every 5 minutes, or on shutdown (whichever comesfirst).

• They are written to openrdf-sesame/logs/main-<date>.log after a line such as Rule statistics:

• They are cumulative (you only need to work with the last one).

• Rule variants are ordered by total time (descending).

For example, consider the following rule :

Id: ptop_PropRestrt <ptop:premise> pt <ptop:restriction> rt <ptop:conclusion> qt <rdf:type> <ptop:PropRestr>x p yx r y----------------x q y

This is a conjunction of two props. It is declared with the axiomatic (A-Box) triples involving t. Wheneverthe premise p and restricton r hold between two resources, the rule infers the conclusion q between the sameresources, i.e., p & r => q

The corresponding log for variant 4 of this rule may look like the following:

RULE ptop_PropRestr_4 invoked 163,475,763 times.ptop_PropRestr_4:e b fa ptop_premise ba rdf_type ptop_PropRestre c fa ptop_restriction ca ptop_conclusion d------------------------------------e d f

a ptop_conclusion d invoked 1,456,793 times and took 1,814,710,710 ns.a rdf_type ptop_PropRestr invoked 7,261,649 times and took 9,409,794,441 ns.a ptop_restriction c invoked 1,456,793 times and took 1,901,987,589 ns.


e c f invoked 17,897,752 times and took 635,785,943,152 ns.a ptop_premise b invoked 10,175,697 times and took 9,669,316,036 ns.Fired 1,456,793 times and took 157,163,249,304 ns.Inferred 1,456,793 statements.Time overall: 815,745,001,232 ns.

Note: Variable names are renamed due to the compilation to Java bytecode.

Understanding the output:

• The premises are checked in the order given in RULE. (The premise statistics printed after the blank lineare not in any particular order.)

• invoked is the number of times the rule variant or specific premise was checked successfully. Tracingthrough the rule:

– ptop_PropRestr_4 checked successfully 163M times: for each incoming triple, since the leadpremise (e b f = x p y) is a free pattern.

– a ptop_premise b checked successfully 10M times: for each b=p that has an axiomatic triple involv-ing ptop_premise.

This premise was selected because it has only 1 unbound variable a and it is first in the rule text.

– a rdf_type ptop_PropRestr checked successfully 7M times: for each ptop_premise that has typeptop_PropRestr.

This premise was selected because it has 0 unbound variables (after the previous premise binds a).

• The time to check each premise is printed in ns.

• fired is the number of times all premises matched, so the rule variant was fired.

• inferred is the number of inferred triples.

It may be more than “fired” if there are multiple conclusions.It may be less then “fired” since a duplicate triple is not inferred a second time.

• time overall is the total time that this rule variant took.

Excel format

The log records detailed information about each rule and premise, which is indispensable when you are trying tounderstand which rule is spending too much time. However, it can still be overwhelming because of this level ofdetail.

Therefore, we have developed the script rule-stats.pl that outputs a TSV file such as the following:

rule ver tried time patts checks time fired time triples speedptop_PropChain 4 163475763 776.3 5 117177482 185.3 15547176 590.9 9707142 12505

Parameters:

• rule: the rule ID (name);

• ver: the rule version (variant) or “T” for overall rule totals;

• tried, time: the number of times the rule/variant was tried, the overall time in sec;

• patts: the number of triple patterns (premises) in the rule, not counting the leading premise;

• checks, time: the number of times premises were checked, time in sec;


• fired: the number of times all premises matched, so the rule was fired;

• triples: the number of inferred triples;

• speed: inference speed, triples/sec.

Run the script in the following way:

perl rule-stats.pl main-2014-07-28.log > main-2014-07-28.xls

Investigating performance

The following is an example of using the Excel format to investigate where time is spent during rule execution.

Download the example file time-spent-during-rule.xlsx and use it as a template.

Note: These formulas are dynamic and they refresh every time you change the filters.

To perform your investigation:

1. Open the results in Excel.

2. Set a filter “ver=T” (to first look at rules as a whole, not rule variants).

3. Sort in a descending order by total “time” (third column).

4. Check out which rules are highlighted in red (the rules that spend substantial time and whose speed issignificantly lower than average).

5. Pick up a rule (for example, PropRestr).

6. Filter on “rule=PropRestr” and “ver<>T” to look at its variants.

7. Focus on a variant to investigate the reasons for time and speed performance.

In this example, first you have to focus on the variant ptop_PropRestr_5, which spends 30% of the time of thisrule, and has very low “speed”. The reason is that it fired 1.4M times but produced only 238 triples, so most ofthe inferred triples were duplicates.

You can find the definition of this variant in the log file:

RULE ptop_PropRestr_5 invoked 163,475,763 times.ptop_PropRestr_5:e c fa ptop_restriction ca rdf_type ptop_PropRestre b fa ptop_premise ba ptop_conclusion d------------------------------------e d f

It is very similar to the productive variant ptop_PropRestr_4 (see Log file above):

• one checks e b f. a ptop_premise b first,

• the other checks e c f. a ptop_restriction c first.

Still, the function of these premises in the rule is the same and therefore the variant ptop_PropRestr_5 (which ischecked after 4) is unproductive.

The most likely way to improve performance would be if you make the two premises use the same axiomatic tripleptop:premise (emphasising they have the same role), and introduce a Cut:

Id: ptop_PropRestr_SYMt <ptop:premise> pt <ptop:premise> rt <ptop:conclusion> qt <rdf:type> <ptop:PropRestr>x p yx r y [Cut]----------------x q y

The Cut eliminates the rule variant with x r y as leading premise. It is legitimate to do this, since the two variantsare the same, up to substitution p<->r.

Note: Introducing a Cut in the original version of the rule would not be legitimate:

Id: ptop_PropRestr_CUTt <ptop:premise> pt <ptop:restriction> rt <ptop:conclusion> qt <rdf:type> <ptop:PropRestr>x p yx r y [Cut]----------------x q y

since it would omit some potential inferences (in the case above, 238 triples), changing the semantics of the rule(see the example below).

Assume these axiomatic triples:

:t_CUT a ptop:PropRestr; ptop:premise :p; ptop:restriction :r; ptop:conclusion :q. # for ptop_→˓PropRestr_CUT:t_SYM a ptop:PropRestr; ptop:premise :p; ptop:premise :r; ptop:conclusion :q. # for ptop_→˓PropRestr_SYM

Now consider a sequence of inserted triples :x :p :y. :x :r :y.


• ptop_PropRestr_CUT will not infer :x :q :y since no variant is fired by the second incoming triple :x:r :y: it is matched against x p y, but there is no axiomatic triple t ptop:premise :r

• ptop_PropRestr_SYM will infer :x :q :y since the second incoming triple :x :r :y will match x p yand t ptop:premise :r, then the previously inserted :x :p :y will match t ptop:premise :p and therule will fire.

Tip: Rule execution is often non-intuitive, therefore we recommend that you detailed the speed history andcompare the performance after each change.

Hints on optimising GraphDB’s rule-sets

The complexity of the rule-set has a large effect on the loading performance, the number of inferred statements andthe overall size of the repository after inferencing. The complexity of the standard rule-sets increases as follows:

• none (lowest complexity, best performance)

• rdfs-optimized

• rdfs

• owl-horst-optimized

• owl-horst

• owl-max-optimized

• owl-max

• owl2-ql-optimized

• owl2-ql

• owl2-rl-optimized

• owl2-rl (highest complexity, worst performance)

OWL RL and OWL QL do a lot of heavy work that is often not required by applications. For more details, seeOWL Compliance.

Know what you want to infer

Check the ‘expansion ratio’ (total/explicit statements) for your dataset and get an idea whether this is what youexpect. If your rule-set infers, for example, 4 times more statements over a large number of explicit statements,this will take time, no matter how you try to optimise the rules.

Minimise the number of rules

The number of rules and their complexity affects inferencing performance, even for rules that never infer anynew statements. This is because every incoming statement is passed through every variant of every rule to checkwhether something can be inferred. This often results in many checks and joins, even if the rule never fires.

So, start with a minimal rule-set and add only the additional rules that you require. The default rule-set (owl-horst-optimized) works for many people, but you could even consider starting from RDFS. For example, if you needowl:Symmetric and owl:inverseOf on top of RDFS, you can copy only these rules from OWL Horst to RDFSand leave the rest aside.

Conversely, you can start with a bigger standard rule-set and remove the rules that you do not need.



Note: To deploy a custom rule-set, set the rule-set configuration parameter to the full pathname of your custom.pie file.

Write your rules carefully

• Be careful with the recursive rules as they can lead to an explosion in the number of inferred statements.

• Always check your spelling:

– A misspelled variable in a premise leads to a Cartesian explosion of the number of triple joins to beconsidered by the rule.

– A misspelled variable in a conclusion (or use an unbound variable) causes new blank nodes to becreated. This is almost never what you really want.

• Order premises by specificity. GraphDB first checks premises with the least number of unbound variables.But if there is a tie, it follows the order given by you. Since you may know the cardinalities of triples inyour data, you may be in a better position to determine which premise has better specificity (selectivity).

• Use [Cut] for premises that have the same role (for an example, see Investigating performance), but becareful not to remove some needed inferences by mistake.

Avoid duplicate statements

Avoid inserting explicit statements in a named graph if the same statements are inferable. GraphDB always storesinferred statements in the default graph, so this will lead to duplicating statements. This will increase the repositorysize and will slow down query answering.

You can eliminate duplicates from query results using DISTINCT or FROM onto:skip-redundant-implicit (seeOther special GraphDB query behaviour). But these are slow operations and it is better not to produce duplicatestatements in the first place.

Know the implications of ontology mapping

People often use owl:equivalentProperty, owl:equivalentClass (and less often rdfs:subPropertyOf,rdfs:subClassOf) to map ontologies. But every such assertion means that many more statements are inferred(owl:equivalentProperty works as a pair of rdfs:subPropertyOf, and owl:equivalentClass works as apair of rdfs:subClassOf).

A good example is DCTerms (DCT): almost each DC property has a declared DCT subproperty and there is alsoa hierarchy amongst DCT properties, for instance:

dcterms:created rdfs:subPropertyOf dc:date, dcterms:date .dcterms:date rdfs:subPropertyOf dc:date.

This means that every dcterms:created statement will expand to 3 statements. So, do not load the DC ontologyunless you really need these inferred dc:date.

Consider avoiding inverse statements

Inverse properties (e.g., :p owl:inverseOf :q) offer some convenience in querying, but are never necessary:

• SPARQL natively has bidirectional data access: instead of ?x :q ?y, you can always query for ?y :p ?x.

• You can even invert the direction in a property path: instead of ?x :p1/:q ?y, use ?x :p1/(^:p) ?y


If an ontology defines inverses but you skip inverse reasoning, you have to check which of the two properties isused in a particular dataset, and write your queries carefully.

The Provenance Ontology (PROV-O) has considered this dilemma carefully and have abstained from defininginverses, to “avoid the need for OWL reasoning, additional code, and larger queries” (see http://www.w3.org/TR/prov-o/#inverse-names).

Consider avoiding long transitive chains

A chain of N transitive relations (e.g., rdfs:subClassOf) causes GraphDB to infer and store a further (𝑛2−𝑛)/2statements. If the relationship is also symmetric (e.g., in a family ontology with a predicate such as relatedTo),then there will be 𝑛2 − 𝑛 inferred statements.

Consider removing the transitivity and/or symmetry of relations that make long chains. Or if you must have them,consider the implementation of TransitiveProperty through step property, which can be faster than the standardimplementation of owl:TransitiveProperty.

Consider specialised property constructs

While OWL2 has very powerful class constructs, its property constructs are quite weak. Some widely used OWL2property constructs could be done faster.

See this draft for some ideas and clear illustrations. Below we describe 3 of these ideas.

Tip: To learn more, see a detailed account of applying some of these ideas in a real-world setting.

PropChain

Consider 2-place PropChain instead of general owl:propertyChainAxiom.

owl:propertyChainAxiom needs to use intermediate nodes and edges in order to unroll the rdf:List represent-ing the chain. Since most chains found in practice are 2-place chains (and a chain of any length can be implementedas a sequence of 2-place chains), consider a rule such as the following:

Id: ptop_PropChaint <ptop:premise1> p1t <ptop:premise2> p2t <ptop:conclusion> qt <rdf:type> <ptop:PropChain>x p1 yy p2 z----------------x q z

It is used with axiomatic triples as in the following:

@prefix ptop: <http://www.ontotext.com/proton/protontop#>.:t a ptop:PropChain; ptop:premise1 :p1; ptop:premise2 :p2; ptop:conclusion :q.

transitiveOver

ptop:transitiveOver has been part of Ontotext’s PROTON ontology since 2008. It is defined as follows:


http://www.w3.org/TR/prov-o/#inverse-names

http://www.w3.org/TR/prov-o/#inverse-names

http://vladimiralexiev.github.io/pres/extending-owl2/

Id: ptop_transitiveOverp <ptop:transitiveOver> qx p yy q z---------------x p z

It is a specialised PropChain, where premise1 and conclusion coincide. It allows you to chain p with q on theright, yielding p. For example, the inferencing of types along the class hierarchy can be expressed as:

rdf:type ptop:transitiveOver rdfs:subClassOf

TransitiveProperty through step property

owl:TransitiveProperty is widely used and is usually implemented as follows:

Id: owl_TransitivePropertyp <rdf:type> <owl:TransitiveProperty>x p yy p z----------x p z

You may recognise this as a self-chain, thus a specialisation of ptop:transitiveOver, i.e.:

?p rdf:type owl:TransitiveProperty <=> ?p ptop:transitiveOver ?p

Most transitive properties comprise transitive closure over a basic ‘step’ property. For example,skos:broaderTransitive is based on skos:broader and is implemented as:

skos:broader rdfs:subPropertyOf skos:broaderTransitive.skos:broaderTransitive a owl:TransitiveProperty.

Now consider a chain of N skos:broader between two nodes. The owl_TransitiveProperty rule has to con-sider every split of the chain, thus infering the same closure between the two nodes N times, leading to quadraticinference complexity.

This can be optimised by looking for the step property s and extending the chain only at the right end:

Id: TransitiveUsingStepp <rdf:type> <owl:TransitiveProperty>s <rdfs:subPropertyOf> px p yy s z----------x p z

However, this would not make the same inferences as owl_TransitiveProperty if someone inserts the transitiveproperty explicitly (which is a bad practice).

It is more robust to declare the step and transitive properties together using ptop:transitiveOver, for instance:

skos:broader rdfs:subPropertyOf skos:broaderTransitive.skos:broaderTransitive ptop:transitiveOver skos:broader.

sameAs Optimisation


owl:sameAs challenges

The performance of a GraphDB repository is greatly improved by a specific optimisation that handles owl:sameAsstatements efficiently. owl:sameAs declares that two different URIs identify the same resource. Most often, it isused to align identifiers of the same real-world entity used in different data sets.

For example, in DBPedia the URI of Vienna is http://dbpedia.org/page/Vienna, while in Geonames it ishttp://sws.geonames.org/2761369. DBpedia contains the statement

(S1) dbpedia:Vienna owl:sameAs geonames:2761369

which declares the two URIs as equivalent.

owl:sameAs is probably the most important OWL predicate when it comes to integrating data from different datasources and interlinking RDF datasets. However, its semantics causes an explosion of the number of inferredstatements. Following the formal definition of OWL (OWL2 RL, to be more specific), whenever two URIs aredeclared equivalent, all statements that involve one of the URIs should be ‘replicated’ using the other URI in thesame position.

For instance, in Geonames the city of Vienna is defined as part of http://www.geonames.org/2761367 (thefirst-order administrative division in Austria with the same name), which in turn, is part of Austria http://www.geonames.org/2782113:

(S2) geonames:2761369 gno:parentFeature geonames:2761367(S3) geonames:2761367 gno:parentFeature geonames:2782113

Since gno:parentFeature is a transitive relationship, it will be inferred that the city of Vienna is also part ofAustria:

(S4) geonames:2761369 gno:parentFeature geonames:2782113

Due to the semantics of owl:sameAs from (S1) it should also be inferred that statements (S2) and (S4) also holdfor Vienna’s DBpedia URI:

(S5) dbpedia:Vienna gno:parentFeature geonames:2761367(S6) dbpedia:Vienna gno:parentFeature geonames:2782113

These implicit statements must hold no matter which of the equivalent URIs is used in a query. When you considerthat Austria also has an equivalent URI in DBpedia:

(S7) geonames:2782113 owl:sameAs dbpedia:Austria

you should also infer that:

(S8) dbpedia:Vienna gno:parentFeature dbpedia:Austria(S9) geonames:2761369 gno:parentFeature dbpedia:Austria(S10) geonames:2761367 gno:parentFeature dbpedia:Austria

In the above example, you have two alignment statements (S1 and S7), two statements carrying specific factualknowledge (S2 and S3), one statement inferred due to a transitive property (S4), and seven statements inferred asa result of the owl:sameAs expansion (S5, S7, S8, S9, S10, as well as the inverse statements of S1 and S7).

As you see, inference without owl:sameAs increases the dataset by 25% (one new statement on top of 4 explicit),while the owl:sameAs inference increases the full closure by 175% (7 new statements). But Vienna also has a URIin UMBEL: if you add an owl:sameAs for this alignment, it will cause the inference of 4 new implicit statements(duplicates of S1, S5, S6, and S8). Although this is a small example, it provides a indication about the performanceimplications of using owl:sameAs alignment statements from Linked Open Data.

Furthermore, owl:sameAs is an equivalence relation (transitive, reflexive, and symmetric). Thus, for a set of nequivalent URIs, 𝑛2 (n squared) owl:sameAs statements should be considered.


GraphDB solution

GraphDB addresses these challenges by handling owl:sameAs in a special manner. It does not explode the state-ment indices, nor does it store a quadratic number of owl:sameAs statements per cluster (equivalence class).Instead, each owl:sameAs cluster is represented by a single supernode, and all statements are recorded against aselected representative of each cluster. During query evaluation, GraphDB uses a kind of backward-chaining byenumerating equivalent URIs, thus guaranteeing completeness of inference and query results. Special care is takento ensure that this optimisation does not hinder the ability to distinguish between explicit and implicit statements.

Note: This occurs even when the ‘empty’ (no inference) rule-set is selected. The owl:sameAs optimisation canbe disabled completely using the disable-sameAs configuration parameter.

RDFS and OWL Support Optimisations

There are several features in the RDFS and OWL specifications that lead to inefficient entailment rules and axioms,which can have a significant impact on the performance of the inferencer. For example:

• The consequence X rdf:type rdfs:Resource for each URI node in the RDF graph;

• The system should be able to infer that URIs are classes and properties, if they appear in schema-definingstatements such as Xrdfs:subClassOf Y and X rdfs:subPropertyOf Y;

• The individual equality property in OWL is reflexive, i.e., the statement X owl:sameAs X holds for everyOWL individual;

• All OWL classes are subclasses of owl:Thing and for all individuals X rdf:type owl:Thing should hold;

• C is inferred as being rdfs:Class whenever an instance of the class is defined: I rdf:type C.

Although the above inferences are important for formal semantics completeness, users rarely execute queries thatseek such statements. Moreover, these inferences generate so many inferred statements that performance andscalability can be significantly degraded.

For this reason, optimised versions of the standard rule-sets are provided. These have -optimized appended tothe rule-set name, e.g., owl-horst-optimized.

The following optimisations are enacted in GraphDB:

Optimisation Affects patternsRemove axiomatic triples

• <any> <any> <rdfs:Resource>• <rdfs:Resource> <any> <any>• <any> <rdfs:domain> <rdf:Property>• <any> <rdfs:range> <rdf:Property>• <owl:sameAs> <rdf:type><owl:SymmetricProperty>

• <owl:sameAs> <rdf:type><owl:TransitiveProperty>

Remove rule conclusions• <any> <any> <rdfs:Resource>

Remove rule constraints• [Constraint <variable> !=<rdfs:Resource>]


5.6 Experimental Features

5.6.1 Additional Plugins

5.6.1.1 Plugin API improvements

Support for external plugins

External plugins are implemented by a code outside the GraphDB classpath and loaded through their own dedi-cated class loaders.

They are packaged in /plugin directories and grouped into one or more /plugin home directories. A dedi-cated class loader is created for loading the code from each /plugin directory. Every plugin should provide alldependencies that cannot be expected to be provided through the GraphDB classpath in its /plugin directory.

Note: For the current implementation, we recommend using a single /plugin directory for the implementationof each plugin.

Class-loading and plugin dependencies

To allow maximum freedom, each external plugin is loaded in a self-first class loader (contrary to Java’s parent-first), which means that the external plugins can use whatever dependencies they want, including such that overrideGraphDB’s dependencies.

Warning: There is one limitation: as the Java ecosystem is full of various frameworks, which do classinstantiation based on reflection or custom class loaders, be aware of the class-loading in the frameworksyou use (Spring, slf4j, GATE, XML libraries, even Lucene) in order to prevent _ClassCastException_s and_LinkageError_s. Usually, the issue is that a dependency uses the current thread’s context class loader, whichis the GraphDB class loader (since webapp containers run each webapp in its own thread pool and class loader)and this allows the mixing of plugin and core dependencies.

5.6.1.2 External plugins

The following plugins were previously shipped within OWLIM and have been externalised in order to cut the coredependency on Lucene.

• Full-text Search - allows you to create full-text search queries through SPARQL;

• Lucene.

5.6.2 SPARQL-MM Support

5.6.2.1 How to install SPARQL-MM support

Hint: GraphDB 6.4 introduces support for SPARQL-MM, a multimedia-extension for SPARQL 1.1. The imple-mentation is based on code by Thomas Kurz.

The SPARQL-MM support is implemented as a GraphDB plugin.

Note: Currently, the plugin is not enabled by default.


To enable the plugin, follow these steps:

1. Locate the plugin .zip file in the plugins/sparql-mm folder of the GraphDB distribution.

2. Unzip the file into your plugins directory (by default root of unpacked web app /WEB-INF/classes/plugins).

5.6.2.2 Usage examples

Temporal Relations

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX mm: <http://linkedmultimedia.org/sparql-mm/ns/1.0.0/function#>

SELECT ?t1 ?t2 WHERE {?f1 rdfs:label ?t1.?f2 rdfs:label ?t2.FILTER mm:precedes(?f1,?f2)

} ORDER BY ?t1 ?t2

Temporal aggregation


SELECT ?f1 ?f2 (mm:temporalIntermediate(?f1,?f2) AS ?box) WHERE {?f1 rdfs:label "a".?f2 rdfs:label "b".

}

Spatial relations


SELECT ?t1 ?t2 WHERE {?f1 rdfs:label ?t1.?f2 rdfs:label ?t2.FILTER mm:rightBeside(?f1,?f2)

} ORDER BY ?t1 ?t2

Spatial aggregation


SELECT ?f1 ?f2 (mm:spatialIntersection(?f1,?f2) AS ?box) WHERE {?f1 rdfs:label "a".?f2 rdfs:label "b".

}

5.6. Experimental Features 171

Combined aggregation


SELECT ?f1 ?f2 (mm:boundingBox(?f1,?f2) AS ?box) WHERE {?f1 rdfs:label "a".?f2 rdfs:label "b".

}

Accessor method

PREFIX ma: <http://www.w3.org/ns/ma-ont#>PREFIX mm: <http://linkedmultimedia.org/sparql-mm/ns/1.0.0/function#>

SELECT ?f1 WHERE {?f1 a ma:MediaFragment.

} ORDER BY mm:duration(?f1)

Tip: For more information, see:

• The SPARQL-MM Specification

• List of SPARQL-MM functions

5.6.3 GeoSPARQL support

5.6.3.1 What is GeoSPARQL

GeoSPARQL is a standard for representing and querying geospatial linked data for the Semantic Web from theOpen Geospatial Consortium (OGC).The standard provides:

• a small topological ontology in RDFS/OWL for representation using Geography Markup Language (GML)and Well-Known Text (WKT) literals;

• Simple Features, RCC8, and DE-9IM (a.k.a. Egenhofer) topological relationship vocabularies and ontolo-gies for qualitative reasoning;

• a SPARQL query interface using a set of topological SPARQL extension functions for quantitative reason-ing.

The following is a simplified diagram of some geometry classes and properties:


http://mayor2.dia.fi.upm.es/oeg-upm/files/eswc2014/Posters%20and%20Demonstrations/eswc2014pd_submission_65.pdf

https://github.com/tkurz/sparql-mm/blob/master/ns/1.0.0/function/index.md

http://www.opengeospatial.org/

https://en.wikipedia.org/wiki/Geography_Markup_Language

https://en.wikipedia.org/wiki/Well-known_text

5.6.3.2 Installation

The GeoSPARQL support is implemented as a GraphDB plugin, which is currently not installed by default.

To install the plugin, follow these steps:

1. Locate the plugin .zip file in the plugins/geosparql folder of the GraphDB distribution.

2. Unzip the file into your plug-ins directory (by default root_of_unpacked_web_app/WEB-INF/classes/plugins).

Note: You still have to enable the plugin after you install it (see Enable plugin).

5.6.3.3 Usage

Plugin control predicates

The plugin allows you to configure it through SPARQL UPDATE queries with embedded control predicates.

Enable plugin

When the plugin is enabled, it indexes all existing GeoSPARQL data in the repository and automatically reindexesany updates.

PREFIX : <http://www.ontotext.com/plugins/geosparql#>

INSERT DATA {_:s :enabled "true" .

}

Disable plugin

When the plugin is disabled, it does not index any data or process updates. It does not handle any of theGeoSPARQL predicates either.


INSERT DATA {_:s :enabled "false" .

}

Force reindex GeoSPARQL geometry data

This configuration option is usually used when index files are either corrupted or have been mistakenly deleted.


INSERT DATA {_:s :forceReindex ""

}

GeoSPARQL predicates

The following are some examples of select queries on geographic data.

For demo purposes, just import the following files:

• geosparql-simple-features-geometries.rdf

• geosparql-example.rdf

and run the following queries on them:

Example 1

PREFIX my: <http://example.org/ApplicationSchema#>PREFIX geo: <http://www.opengis.net/ont/geosparql#>PREFIX geof: <http://www.opengis.net/def/function/geosparql/>

SELECT ?fWHERE {

my:A my:hasExactGeometry ?aGeom .?aGeom geo:asWKT ?aWKT .?f my:hasExactGeometry ?fGeom .?fGeom geo:asWKT ?fWKT .FILTER (geof:sfContains(?aWKT, ?fWKT) && !sameTerm(?aGeom, ?fGeom))

}

Example 2


SELECT ?fWHERE {

?f my:hasPointGeometry ?fGeom .?fGeom geo:asWKT ?fWKT .FILTER (geof:sfWithin(?fWKT, '''

<http://www.opengis.net/def/crs/OGC/1.3/CRS84>Polygon ((-83.4 34.0, -83.1 34.0, -83.1 34.2, -83.4 34.2, -83.4 34.0))'''^^geo:wktLiteral))

}

Example 3


SELECT ?fWHERE {

?f my:hasExactGeometry ?fGeom .?fGeom geo:asWKT ?fWKT .my:A my:hasExactGeometry ?aGeom .?aGeom geo:asWKT ?aWKT .my:D my:hasExactGeometry ?dGeom .?dGeom geo:asWKT ?dWKT .FILTER (geof:sfTouches(?fWKT, geof:union(?aWKT, ?dWKT)))

}

Example 4

PREFIX uom: <http://www.opengis.net/def/uom/OGC/1.0/>PREFIX my: <http://example.org/ApplicationSchema#>PREFIX geo: <http://www.opengis.net/ont/geosparql#>PREFIX geof: <http://www.opengis.net/def/function/geosparql/>

SELECT ?fWHERE {

my:C my:hasExactGeometry ?cGeom .?cGeom geo:asWKT ?cWKT .?f my:hasExactGeometry ?fGeom .?fGeom geo:asWKT ?fWKT .FILTER (?fGeom != ?cGeom)

} ORDER BY ASC(geof:distance(?cWKT, ?fWKT, uom:metre)) LIMIT 3

Example 5

PREFIX geo: <http://www.opengis.net/ont/geosparql#>PREFIX my: <http://example.org/ApplicationSchema#>

SELECT ?fWHERE {

?f geo:sfOverlaps my:AExactGeom}

Example 6

Note: Using geometry literals in the object position is a GraphDB extension and not part of the GeoSPARQLspecification.

PREFIX geo: <http://www.opengis.net/ont/geosparql#>PREFIX my: <http://example.org/ApplicationSchema#>

SELECT ?fWHERE {

?f geo:sfOverlaps"Polygon((-83.6 34.1, -83.2 34.1, -83.2 34.5, -83.6 34.5, -83.6 34.1))"^^geo:wktLiteral

}

Tip: For more information about GeoSPARQL predicates and functions, see the current official spec: OGC11-052r4, Version: 1.0, Approval Date: 2012-04-27, Publication Date: 2012-09-10.

5.6.4 Provenance Plugin

5.6.4.1 Description

The Provenance plugin generates at query-time all statements that can be inferred from the statements of a specificgraph, when they are combined with the axiomatic triples or added as part of the schema transaction (read-only).

In its essence, the plugin does a forward-chaning over the statements in a specific graph. The closure is computedduring query evaluation and considers only statements marked as AXIOM in conjunction with the statements inthe target graph.

The basic approach is to create ‘inference context’ using the provenance:derivedFrom predicate where thetarget graph is passed as a subject and its ‘inference context’ is created with a request scope and bound to theobject variable of the provenance:derivedFrom triple pattern. Once created, the statements contained in thegraph, along with the statements directly derived from them (recursively) are enumerated when the triple patternis reevaluated by the query engine. The subject, predicate and object of any solution could be accessed using theprovenance:subject, provenance:predicate, provenance:object predicates, which accept the ‘inferencecontext’ as a subject component and bind the respective statement component within the object variable of therespective pattern. In case the object of these patterns is already bound, the solutions are filtered so that only thestatements that have the specified component bound are returned.

5.6.4.2 Vocabulary

Common plugin namespace

<http://www.ontotext.com/provenance/>

Plugin predicates

The predicate that creates ‘inference context’ from a specific graph: <http://www.ontotext.com/provenance/derivedFrom>.

The set of predicates that access the solution using the currently bound ‘inference context’ from its subject, pred-icate and object components:

<http://www.ontotext.com/provenance/subject><http://www.ontotext.com/provenance/predicate><http://www.ontotext.com/provenance/object>


http://www.opengis.net/doc/IS/geosparql/1.0

5.6.4.3 Examples

Example 1:

A query that returns all inferences derived from the triples in the <g:1> graph:

PREFIX pr: <http://www.ontotext.com/provenance/>

select {<g:1> pr:derivedFrom ?ctx .?ctx pr:subject ?subject .?ctx pr:predicate ?predicate .?ctx pr:object ?object .

}

Example 2:

A query that returns all inferences derived from the triples in the <g:1> graph, which have an object that matchesa specific RDF node:

PREFIX pr: <http://www.ontotext.com/provenance/>

select {<g:1> pr:derivedFrom ?ctx .?ctx pr:subject ?subject .?ctx pr:predicate ?predicate .?ctx pr:object <unr:object> .

}

Example 3:

A more elaborate example with some background data and results returned from the sample queries.

1. Configure a repository with an rdfs rule-set and add the following ‘axiomatic’ data.

Note: sys:schemaTransaction is used to mark the statements as ‘axioms’.

INSERT DATA {<u:label> rdf:type rdf:Property .<u:label> rdfs:subPropertyOf rdfs:label .<u:prop> rdfs:domain <u:Class> .<u:Class> rdfs:subClassOf <u:ClassBase> .<u:ClassBase> rdfs:subClassOf <u:Root> .[] <http://www.ontotext.com/owlim/system#schemaTransaction> [] .

}

2. Add some user data to the two user graphs <g:1> and <g:2>:

INSERT DATA {GRAPH <g:1> {<u:s:1> <u:prop> <u:val:1> .<u:shared> <u:label> "shared:g1" .

}

GRAPH <g:2> {<u:s:2> <u:prop> <u:val:2> .<u:shared> <u:label> "shared:g2" .}

}

3. Evaluate a query using the Provenance plugin:

prefix pr: <http://www.ontotext.com/provenance/>

select * {values ?g {<g:1>}?g pr:derivedFrom ?r .?r pr:subject ?subject .?r pr:predicate ?predicate .?r pr:object ?object.

}

4. Get results such as:

g object subject predicateg:1 "shared:g1" u:shared u:labelg:1 "shared:g1" u:shared http://www.w3.org/2000/01/rdf-schema#labelg:1 u:val:1 u:s:1 u:propg:1 u:Class u:s:1 http://www.w3.org/1999/02/22-rdf-syntax-ns#typeg:1 u:ClassBase u:s:1 http://www.w3.org/1999/02/22-rdf-syntax-ns#typeg:1 u:Root u:s:1 http://www.w3.org/1999/02/22-rdf-syntax-ns#type

To get only the statements that have the <u:shared> node as a subject (use ‘VALUES’ for easy reading):


select * {values ?subject {<u:shared>} .values ?g {<g:1>}?g pr:derivedFrom ?r .?r pr:subject ?subject .?r pr:predicate ?predicate .?r pr:object ?object.

}

the results are, respectively:

g object subject predicateg:1 "shared:g1" u:shared u:labelg:1 "shared:g1" u:shared http://www.w3.org/2000/01/rdf-schema#label

Alternatively, to get only the rdf:type statements, the query may look like the following:


select * {values ?predicate {rdf:type} .values ?g {<g:1>}?g pr:derivedFrom ?r .?r pr:subject ?subject .?r pr:predicate ?predicate .?r pr:object ?object.

}

and the results will be:

g subject object predicateg:1 u:s:1 u:Class http://www.w3.org/1999/02/22-rdf-syntax-ns#typeg:1 u:s:1 u:ClassBase http://www.w3.org/1999/02/22-rdf-syntax-ns#typeg:1 u:s:1 u:Root http://www.w3.org/1999/02/22-rdf-syntax-ns#type

5.6.5 Blueprints RDF Support

To install the Blueprints API:

1. Download Gremlin 2.6.0 from https://github.com/tinkerpop/gremlin/wiki/Downloads.

2. Unzip the file gremlin-groovy-2.6.0.zip in a convenient location.

3. Go to the newly extracted folder, e.g., my_files/gremlin-groovy-2.6.0.

4. Put the file graphdb-blueprints-rdf-1.0.jar in the /lib subfolder.

5. Run the Gremlin console by executing bin/gremlin.sh or bin/gremlin.bat.

6. Connect to a GraphDB repository by executing one of the following:

• g = new com.ontotext.blueprints.GraphDBSailGraph("<URL to a GraphDB repository>")or

• g = new com.ontotext.blueprints.GraphDBSailGraph("<URL to a GraphDB repository>","<username>", "<password>")

Hint:

GraphDB 6.4 introduces support for SAIL graphs with the Blueprints API and GraphDB.You can use it to access GraphDB through the Gremlin Query language.

Tip: For more information, see the following:

• https://github.com/tinkerpop/gremlin/wiki

• https://github.com/tinkerpop/blueprints/wiki

• https://en.wikipedia.org/wiki/Gremlin_%28programming_language%29

5.6.6 RDF Priming

5.6.6.1 What is RDF Priming

RDF Priming is a technique that selects a subset of available statements for use as an input to query answering. Itis based upon the concept of ‘spreading activation’ as developed in cognitive science. RDF Priming is a scalableand customisable implementation of the popular connectionist method on top of RDF graphs, which allows the‘priming’ of large datasets with respect to concepts relevant to the context and the query. It is controlled viaSPARQL ASK queries.

5.6.6.2 RDF Priming configuration

To enable RDF Priming over the repository, set the repository-type configuration parameter toweighted-file-repository.

Note: The current implementation of RDF Priming does not store activation values, which means that they areonly available at runtime and are lost when the repository is shut down. However, they can be exported andimported using the special query directives shown below. Another side effect is that the activation values areglobal, because they are stored within the shared Entity Pool.

5.6.6.3 Controlling RDF Priming

The initialisation and management of the RDF Priming module is done via SPARQL ASK queries, which set allparameters and default values. They use the following special system predicates:


https://github.com/tinkerpop/gremlin/wiki/Downloads

https://github.com/tinkerpop/gremlin/wiki

https://github.com/tinkerpop/blueprints/wiki

https://en.wikipedia.org/wiki/Gremlin_%28programming_language%29

Predicate: http://www.ontotext.com/owlim/RDFPriming#enableSpreadingFunction: Enable Activation SpreadingDescription: Enables or disables the RDF Priming module. The Object value of the statement pattern must be aLiteral, whose value is either true or false.Example:

PREFIX prim: <http://www.ontotext.com/owlim/RDFPriming#>ASK {

_:b1 prim:enableSpreading "true"}

Predicate: http://www.ontotext.com/owlim/RDFPriming#decayActivationsFunction: Set Activation DecayDescription: Alters all activation values for the nodes in the RDF graph by multiplying them by a factorspecified as a Literal in the Object position of the Statement pattern of the query. The following example resetsall the activation values to zero by multiplying them by 0.0.Example:


_:b1 prim:decayActivations "0.0"}

Predicate: http://www.ontotext.com/owlim/RDFPriming#spreadActivationFunction: Trigger Activation Spreading CycleDescription: Triggers an Activation spreading cycle that starts from the nodes scheduled for activation for thisround. No special values are required for the Subject or Object part of the statement pattern - blank nodes suffice.Example:


_:b1 prim:spreadActivation _:b2}

Predicate: http://www.ontotext.com/owlim/RDFPriming#assignWeightFunction: Set Statement WeightDescription: Sets a non-default weight factor for statements with a specific predicate.The Subject of the Statement pattern is the predicate to which the new value must be set.The Object of the pattern is the new weight value as a Literal. The example query sets 0.5 as a weight factor to allrdfs:subClassOf statements.Example:

PREFIX prim: <http://www.ontotext.com/owlim/RDFPriming#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>ASK {

rdfs:subClassOf prim:assignWeight "0.5"}

Predicate: http://www.ontotext.com/owlim/RDFPriming#activateNodeFunction: Schedule Nodes for ActivationDescription: Schedules the nodes specified as Subject or Object of the statement pattern for activation.Scheduling for activation can also be performed by evaluating an ASK query with variables in the body, in whichcase the nodes bound to the variables used in the query will be scheduled for activation. The behaviour of such anASK query is altered, so that all solutions are exhausted before returning the query result. This could take a longtime, since LIMIT and OFFSET are not available in this case. The first example activates two nodesgossip:hasTrack and prel:hasChild, and the second example activates many nodes identifying people (andtheir names) that have an album called “American Life”.Example:

# Example 1PREFIX prim: <http://www.ontotext.com/owlim/RDFPriming#>PREFIX gossip: <http://www.ontotext.com/rascalli/2008/04/gossipdb.owl#>PREFIX prel: <http://proton.semanticweb.org/2007/10/proton_rel#>ASK {

gossip:hasTrack prim:activateNode prel:hasChild}

# Example 2PREFIX gossip: <http://www.ontotext.com/rascalli/2008/04/gossipdb.owl#>PREFIX onto: <http://www.ontotext.com#>ASK {

?person gossip:hasAlbum ?album .?album gossip:name "American Life" .?person gossip:name ?name

}

Predicate: http://www.ontotext.com/owlim/RDFPriming#activationThresholdFunction: Activation ThresholdDescription: During activation spreading, activations are accumulated in nodes and can grow indefinitely. TheactivationThreshold allows trimming these values to a certain threshold. The default value of this parameter is1.0, which means that all values bigger than 1.0 are set to 1.0 on every iteration. This parameter is applied onevery iteration of the process and guarantees that no activations larger than the parameter value will beencountered.Example:


prim:activationThreshold prim:setParam "0.9"}

Predicate: http://www.ontotext.com/owlim/RDFPriming#decayFactorFunction: Decay FactorDescription: Used during spreading activation to control how much a node’s activation level is transferred tonodes that it affects. The following example query sets the new decayFactor to 0.55.Example:


prim:decayFactor prim:setParam "0.55"}


Predicate: http://www.ontotext.com/owlim/RDFPriming#defaultActivationFunction: Default Activation ValueDescription: Sets the default activation value for all nodes in the repository. If the default activation is not preset,then the default activation for all repository nodes is 0. This does not affect the activation origin nodes, whoseactivation values are set by using http://www.ontotext.com/owlim/RDFPriming#initialActivation.Example:


prim:defaultActivation prim:setParam "0.4"}

Predicate: http://www.ontotext.com/owlim/RDFPriming#defaultWeightFunction: Default WeightDescription: Edges in the RDF graph can be given weights, which are multiplied by the source node activationin order to compute the activation that is spread across the edge to the destination node (see assignWeight). Ifthe predicate of the edge is not given any specific weight (via assignWeight), then the edge weight is assumedto be 1/3 (one third). This default weight can be changed by using the defaultWeight parameter. Any floatingpoint value in the range [0-1] can be used.Example:


prim:defaultWeight prim:setParam "0.2"}

Predicate: http://www.ontotext.com/owlim/RDFPriming#exportActivationsFunction: Export Activation ValuesDescription: Used for exporting activation values for a set of nodes. The values are stored in a file, identified bythe given URL, as the Object of the statement pattern. The format of the data in the file is simply one line perURI, followed by a tab character, and the floating-point value of its activation value.Example:


prim:exportActivations prim:setParam "file:///D/work/my_activations.txt"}

Predicate: http://www.ontotext.com/owlim/RDFPriming#filterThresholdFunction: Filter ThresholdDescription: Sets the new filter threshold value used for deciding when a statement is visible, depending on theactivation level of its subject, predicate and object.Example:




prim:filterThreshold prim:setParam "0.50"}

Predicate: http://www.ontotext.com/owlim/RDFPriming#firingThresholdFunction: Firing ThresholdDescription: Sets the threshold above which a node will activate its neighbours.Example:


prim:firingThreshold prim:setParam "0.25"}

Predicate: http://www.ontotext.com/owlim/RDFPriming#importActivationsFunction: Import Activation ValuesDescription: Used for importing activation values for a set of nodes. The values are loaded from a file, identifiedby the given URL, as the Object of the statement pattern. The format of the data in the file is simply one line perURI, followed by a tab character, and the floating-point value of its activation value.Example:


prim:importActivations prim:setParam "file:///D/work/my_activations.txt"}

Predicate: http://www.ontotext.com/owlim/RDFPriming#initialActivationFunction: Initial Activation ValueDescription: Sets the initial activation value for each of the nodes from which the activation process starts. Thenodes that are scheduled for activation will receive that amount at the beginning of the spreading activationprocess.Example:


prim:initialActivation prim:setParam "0.66"}

Predicate: http://www.ontotext.com/owlim/RDFPriming#maxNodesFiredPerCycleFunction Maximum Nodes Fired Per CycleDescription: Sets the number of nodes that should fire activations during one spreading activation cycle. Thedefault value is 100000.Example:


prim:maxNodesFiredPerCycle prim:setParam "10000"}

Predicate: http://www.ontotext.com/owlim/RDFPriming#cyclesFunction: Number of CyclesDescription: Sets the number of activation spreading cycles to perform when the process is initiated.Example:


prim:cycles prim:setParam "4"}

Predicate: http://www.ontotext.com/owlim/RDFPriming#workerThreadsFunction: Number of Worker ThreadsDescription: Sets the number of worker threads that will perform the spreading activation (the default is 2).Example:


prim:workerThreads prim:setParam "4"}

5.6.6.4 RDF Priming example

The following example uses data from DBpedia http://dbpedia.org/About and has been imported into GraphDBwith the RDF Priming mode enabled. The management queries can be evaluated through the Sesame consoleapplication.

The initial step is to evaluate a demo query that retrieves all the instances of the dbpedia:V8 concept:

SELECT * WHERE {?x <http://dbpedia.org/property/class> <http://dbpedia.org/resource/V8>

}

The above query returns the following results:

?x------------------------------------dbpedia3:Jaguar_AJ-V8_enginedbpedia3:BMW_M62dbpedia3:BMW_N62dbpedia3:Chrysler_Flathead_enginedbpedia3:Duramax_V8_enginedbpedia3:Ford_385_enginedbpedia3:Ford_MEL_enginedbpedia3:Ford_Power_Stroke_enginedbpedia3:Ford_Y-block_enginedbpedia3:Ford_Yamaha_V8_enginedbpedia3:GM_Premium_V_enginedbpedia3:Lincoln_Y-block_V8_engine


http://dbpedia.org/About

dbpedia3:Mercedes-Benz_M113_enginedbpedia3:Nissan_VH_enginedbpedia3:Nissan_VK_enginedbpedia3:BMW_N63dbpedia3:Toyota_UR_enginedbpedia3:Toyota_UZ_engine

Here, the query returns many engines from different manufacturers. The RDF Priming module can be used toreduce the number of results returned by this query by targeting the query to specific parts of the global RDFgraph, i.e., the parts of the graph that have been activated.

The following example shows how to configure the RDF Priming module so that the example query returns asmaller set of more specific results. It is assumed that there is an available SPARQL endpoint connected to arunning repository instance.

1. Enable the RDF Priming module:

PREFIX onto: <http://www.ontotext.com/owlim/RDFPriming#>ASK { _:b1 onto:enableSpreading "true" . }

2. Change the default decay factor:

PREFIX onto: <http://www.ontotext.com/owlim/RDFPriming#>ASK { onto:decayFactor onto:setParam "0.55" . }

3. Change the firing threshold parameter:

PREFIX onto: <http://www.ontotext.com/owlim/RDFPriming#>ASK { onto:firingThreshold onto:setParam "0.25" . }

4. Change the filter threshold:

PREFIX onto: <http://www.ontotext.com/owlim/RDFPriming#>ASK { onto:filterThreshold onto:setParam "0.60" . }

5. Change the initial Activation Level to reflect the specifics of the dataset:

PREFIX onto: <http://www.ontotext.com/owlim/RDFPriming#>ASK { onto:initialActivation onto:setParam "0.66" . }

6. Adjust the Weight factors for a specific predicate so that it activates the relevant subset of the RDF graph,in this case the rdfs:subClassOf predicate:

PREFIX onto: <http://www.ontotext.com/owlim/RDFPriming#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>ASK { rdfs:subClassOf onto:assignWeight "0.5" . }

7. Alter the Weight Factor of the rdf:type predicate so that it does not propagate activations to the classesfrom the activated instances.

Note: This is a useful technique when there are a lot of instances and a very large classification taxonomy,which should not be broadly activated (as is the case with the DBpedia dataset).

PREFIX onto: <http://www.ontotext.com/owlim/RDFPriming#>prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>ASK { rdf:type onto:assignWeight "0.1" . }

8. Activate the nodes Ford Motor Company dbpedia3:Ford_Motor_Company and one of the cars they builddbpedia3:1955_Ford, which came out of the factory with a very nice V8 engine:

PREFIX onto: <http://www.ontotext.com/owlim/RDFPriming#>PREFIX dbpedia3: <http://dbpedia.org/resource/>ASK { dbpedia3:1955_Ford onto:activateNode dbpedia3:Ford_Motor_Company }

Note: If the example query is executed without activating some nodes of the RDF graph, it will return noresults.

9. Tell the RDF Priming module to spread the activations from these two nodes:

PREFIX onto: <http://www.ontotext.com/owlim/RDFPriming#>ASK { _:b0 onto:spreadActivation _:b1 . }

This will normally take 8-10 seconds after which the example query can be re-evaluated with the followingresults:

?x------------------------------------dbpedia3:Jaguar_AJ-V8_enginedbpedia3:BMW_M62dbpedia3:Ford_385_enginedbpedia3:Ford_MEL_enginedbpedia3:Ford_Y-block_engine

Now, the result set is smaller and most of the engines retrieved are made by Ford. The result set containsJaguar_AJ-V8_engine and BMW_M62 engine because Ford and BMW owned Jaguar at the time.

10. Disable the RDF Priming module to return to the normal operating mode:

PREFIX onto: <http://www.ontotext.com/owlim/RDFPriming#>ASK { _:b1 onto:enableSpreading "false" . }

5.6.7 Nested Repositories

5.6.7.1 What are nested repositories

Nested repositories is a technique for sharing RDF data between multiple GraphDB repositories. It is most usefulwhen several logically independent repositories need to make use of a large (reference) dataset, e.g., a combinationof one or more LOD datasets (GeoNames, DBpedia, MusicBrainz, etc.), but where each repository adds its ownspecific data. This mechanism allows the data in the common repository to be logically included, or ‘nested’,within other repositories that extend it. A repository that is nested in another repository (possibly into more thanone other repository) is called a ‘parent repository’ while a repository that nests a parent repository is called a‘child repository’. The RDF data in the common repository is combined with the data in each child repository forinference purposes. Changes in the common repository are reflected across all child repositories and inferencesare maintained to be logically consistent.

Results for queries against a child repository are computed from the contents of the child repository, as well as thenested repository. The following diagram illustrates the nested repositories concept:


Note: When two child repositories extend the same nested repository, they remain logically separate. Onlychanges made to the common nested repository will affect any child repositories.

5.6.7.2 Inference, indexing and queries

A child repository ignores all values for its ruleset parameter and automatically uses the same rule-set as itsparent repository. Child repositories compute inferences based on the union of the explicit data stored in the childand parent repositories. Changes to either parent or child cause the set of inferred statements in the child to beupdated.

Note: The child repository must be initialised (running) when updates to the parent repository take place, other-wise the child can become logically inconsistent.

When a parent repository is updated, before its transaction is committed, it updates every connected child reposi-tory by a set of statement INSERT/DELETE operations. When a child repository is updated, any new resources arerecorded in the parent dictionary so that the same resource is indexed in the sibling child repositories using thesame internal identifier.

Note: A current limitation on the implementation is that no updates using the owl:sameAs predicate are permitted.

Queries executed on a child repository should perform almost as well as queries executed against a repositorycontaining all the data (from both parent and child repositories).

5.6.7.3 Configuration

Both parent and child repositories must be deployed using Tomcat and they must deployed to the same instanceon the same machine (same JVM).

Repositories that are configured to use the nesting mechanism must be created using specific Sesame SAIL types:

• owlim:ParentSail - for parent (shared) repositories;


• owlim:ChildSail - for child repositories that extend parent repositories.

(Where the owlim namespace above maps to http://www.ontotext.com/trree/owlim#.)

Additional configuration parameters:

• owlim:id is used in the parent configuration to provide a nesting name;

• owlim:parent-id is used in the child configurations to identify the parent repository.

Once created, a child repository must not be reconfigured to use a different parent repository as this leads toinconsistent data.

Note: When setting up several GraphDB instances to run in the same Java Virtual Machine, i.e., the JVM used tohost Tomcat, make sure that the configured memory settings take into account the other repositories. For example,if setting up 3 GraphDB instances, configure them as though each of them had only one third of the total Java heapspace available.

5.6.7.4 Initialisation and shut down

To initialise nested repositories correctly, start the parent repository followed by each of its children.

As long as no further updates occur, the shutdown sequence is not strictly defined. However, we recommend thatyou shut down the children first followed by the parent.

5.6.8 LVM-based Backup and Replication

In essence, the Linux Logical Volume Management (LVM)-based Backup and Replication uses shell scripts to findout the logical volume and volume group where the repository storage folder resides and then creates a filesystemsnapshot. Once the snapshot is created, the repository is available for reads and writes while the maintenanceoperation is still in-progress. When it finishes, the snapshot is removed and the changes are merged back to thefilesystem.

5.6.8.1 Prerequisites

• Linux OS;

• The system property (JVM’s -D) named lvm-scripts should point to the folder with the above scripts;

• The folder you are about to backup or use for replication contains a file named owlim.properties;

• That folder DOES NOT HAVE a file named lock.

All of the above mean that the repository storage is ‘ready’ for maintenance.

5.6.8.2 How it works

By default, the LVM-based Backup and Replication feature is disabled.

To enable it:

1. Get the scripts located in the lvmscripts folder of the distribution.

2. Place them on each of the workers in a chosen folder.

3. Set the system property (JVM’s -D) named lvm-scripts, e.g.,-Dlvm-scripts=<folder-with-the-scripts>, to point to the folder with the scripts.


Note: GraphDB checks if the folder contains scripts named: create-snapshot.sh, release-snapshot.sh, and locatelvm.sh. This is done the first time you try to get the repository storage folder contents. Forexample, when you need to do backup or to perform full-replication.

GraphDB executes the script locatelvm.sh with a single parameter, which is the pathname of the storage folderfrom where you want to transfer the data (either to perform backup or to replicate it to another node). Whileinvoking it, GraphDB captures the script standard and error output streams in order to get the logical volume,volume group, and the storage location, relative to the volume’s mount point.

GraphDB also checks the exit code of the script (MUST be 0) and fetches the locations by processing the scriptoutput, e.g., it must contain the logical volume (after, lv=), the volume group (vg=), and the relative path (local=)from the mount point of the folder supplied as a script argument.

If the storage folder is not located on a LVM2 managed volume, the script will fail with a different exit code (itrelies on the exit code of the lvs command) and the whole operation will revert back to the ‘classical’ way of doingit (same as in the previous versions).

If it succeeds to find the volume group and the logical volume, the create-snapshot.sh script is executed, whichthen creates a snapshot named after the value of $BACKUP variable (see the config.sh script, which also defineswhere the snapshot will be mounted). When the script is executed, the logical volume and volume groups arepassed as environment variables, named LV and VG preset by GraphDB.

If it passes without any errors (script exit code = 0), the node is immediately initialised in order to be available forfurther operations (reads and writes).

The actual maintenance operation will now use the data from the ‘backup’ volume instead from where it ismounted.

When the data transfer completes (either with an error, canceled or successfully), GraphDB invokes the .release-snapshot.sh script, which unmounts the backup volume and removes it. This way, the data changesare merged back with the original volume.

5.6.8.3 Some further notes

The scripts rely on a root access to do ‘mount’, and also to create and remove snapshot volumes. TheSUDO_ASKPASS variable is set to point to the askpass.sh script from the same folder. All commands that needprivilege are executed using sudo -A, which invokes the command pointed by the SUDO_ASKPASS variable. Thelatter simply spits out the required password to its standard output. You have to alter the askpass.sh accordingly.

During the LVM-based maintenance session, GraphDB will create two additional files (zero size) in the scriptsfolder, named snapshot-lock, indicating that a session is started, and snapshot-created, indicating a successfulcompletion of the create-snapshot.sh script. They are used to avoid other threads or processes interfering withthe maintenance operation that has been initiated and is still in progress.


CHAPTER

SIX

TOOLS

6.1 LoadRDF Tool

6.1.1 What is GraphDB’s LoadRDF tool

The LoadRDF tool is used for fast loading of large data sets. It resides in the bin/ folder of the GraphDBdistribution.

Note: You create a brand new repository from existing data so you cannot update existing repositories.

6.1.2 How it works

bin/loadrdf.sh <config.ttl> <serial|parallel|fullyparallel> <files...>bin/loadrdf.cmd <config.ttl> <serial|parallel|fullyparallel> <files...>

As input on the command line, the LoadRDF tool accepts a standard config file in Turtle format, the mode, and alist of files for loading.

The mode specifies the way the data is loaded in the repository:

• serial - parsing is followed by entity resolution, which is then followed by load, optionally followed byinference.

• parallel (deprecated due to some known issues; it might be patched in some of the coming minor releases)- resolving and loading take place as soon as the parse buffer is filled, thus reducing the overall time byoccupying more CPU and memory resources. Inference, if enabled, is performed in the end. Therefore, thismode is effective only without inference.

• fullyparallel - using the parallel parse and load from the previous mode, but instead of starting theinference at the end, the inference is made parallel during the load. This gives a significant boost whenloading large data sets with enabled inference.

Warning: At the moment, this mode does not support the owl:sameAs optimisation.

Note:

Gzipped files (.gz) are supported. The files must be in a NTriples format, e.g., file.nt.gz .In addition to files, if specified, whole directories can be processed (together with their subfolders).

191

6.1.3 Java -D cmdline options

The LoadRDF tool accepts java command line options, using -D. To change them, edit the command line script.

• -Dcontext.file - a text file containing context names, one per line. Lines are trimmed and empty linesdenote the null context. Each file (transaction) uses one context line from this file.

The following options can tune the behaviour of the parallel loading:

• -Dpool.size - how many sorting threads to use per index, after resolving entities into IDs and prior toloading the data in the repository. Sorting accelerates data loading. There are separate thread pools for eachof the indices: PSO (sorts in the pred-subj-obj order), POS (sorts in the pred-obj-subj order), and if contextindices are enabled: PCSO, PCOS.

Note: The value of this parameter defaults to 1. It is, for the moment, more of an experimental optionas our experience shows that more than one sorting thread does not have much effect. This is because theresolving stage takes much more time and the sorting stage is an in-memory operation. When the ‘quicksort’ algorithm is used, the operation is performed really fast even for large buffers.

• -Dpool.buffer.size - the buffer size (the number of statements) for each stage. Defaults to 200,000statements. You can use this parameter to tune the memory usage and the overhead of inserting data:

– less buffer size reduces the memory required.

– bigger buffer size theoretically reduces the overhead as the operations performed by threads have alower probability to wait for the operations on which they rely and the CPU is intensively used mostof the time.

• -Dignore.corrupt.files - the valid values are true/false, the default value is true. If set to true, theprocess will continue even if the dataset contains a corrupt file (it will skip the file).

• -Dlru.cache.type - the valid values are synch/lockfree, the default value is synch. It determines whichtype of the ‘least recently used’ cache will be used. The recommended value for LoadRDF is lockfree asit performs better in parallel modes than the serial one. The option is set in the command line scripts.

• -Dinfer.pool.size - the number of inference threads in fullyparallel mode. The default value isthe number of cores of the machine processor or 4, as set in the command line scripts. A bigger pooltheoretically means faster load, if there are enough unoccupied cores and the inference does not wait for theother load stages to complete.

6.1.4 Using loadrdf class programmatically

Because the LoadRDF tool uses parallel loading, there is also a way to use loadrdf class programmatically. Forexample, you can write you own java tool, which uses parallel loading internally.

One option is to give the parallel loading InputStream as a parameter. Parallel loading parses the file internallyand fills the buffers of statements for the next stage (resolving), which then prepares the resolved statements forthe next stage (sorting) and, finally, the sorted statements are asynchronously loaded in PSO and POS.

Another way to use the parallel loading is to specify Iterator<Statement> (parsed from another sources orpossibly generated) instead of InputStream, or a File. Both constructors require to be supplied with the contextwhere the statements will be loaded. Only statements without a context will go to the specified one and if you usea format supporting contexts (.trig, .trix, .nq), only statements without a specified context will go in the oneyou want. Statements with contexts will use their own context rather than the one you have additionally specified.

Parallel loading accepts two -D command line options used for testing purposes - to measure the overhead ofparsing and resolving vs loading data in the repository:

• -Ddo.resolve.entities=false - only parses the data, does not proceed to resolving;

• -Ddo.load.data=false - parses and resolves the data, but does not proceed to loading in the repository;

192 Chapter 6. Tools


Note: By default, the data is parsed, resolved and loaded in the repository.

If any of these options is specified, a descriptive message is shown in the console.

6.1.5 Parallel inference

The diagram below reflects the design of the parallel reasoning mechanism, which works in combination with thefast bulk load.

Tip: The workflow is indicated with blue arrows. The workflow for reasoning is indicated with red arrows.

1. The thread that stores statements in the PSO collection is modified to put the statements that were actuallyadded (i.e., the ones that did not exist in the repository) in a buffer. This is where the statements that shouldbe used for the input of the inference process are collected. The buffer is attached to the PSO collection,because (a) unlike the Context-related collections, it is always there and (b) the POS collection already takescare to update the statistics, so, it is a way of distributing the load between the POS and PSO storing threads.

2. There are N identical threads that do inference, each of which consuming statements from the buffer, whichis populated at the previous step. This process starts when the storing of all statements from the buffer,which is currently being loaded, is completed in all collections. If the inference starts before that, there is achance of missing valid inferences.

3. Each of the inferencers feeds newly inferred statements in a queue (eventually, blocking queue) - this is asimple structure that should allow multiple threads to put statements into it with minimal contention.

4. There is a separate thread that consumes statements form the inference results queue and produces a sortedbuffer of statements, without duplications.

5. One ‘epoch’ of inference is done, when (a) the inferencers exhaust all the statements from the buffer ofnewly added statements, (b) all the inference threads are done and (c) the sorter and deduplicaion threadexhaust the inference results queue and complete the generation of the buffer of sorted unique inferenceresults.

6. The Storage Controller starts a new iteration of storage and inference epoch as it starts to consume state-ments from the buffer of sorted unique inferred statements.

6.1. LoadRDF Tool 193

7. The inference process is done when, in the end of an inference epoch, the buffer of sorted unique inferredstatements is empty.

Note: The Storage Controller does not start processing the next “buffer of resolved statements” until the inferenceprocess is finished for the previous one. In other words, storage and inference work on shifts and never run inparallel.

There is space for further parallelisation here and there, but we consider the implementation of this algorithmthe right starting point, as it seems to be a good balance between parallelisation and simplicity. Therefore, it ismanageable to implement and debug within few days. It should serve as a formal proof that this type of parallelinference can generate correct results. It will also provide a base-line for speed.

6.2 Storage Tool

The Storage Tool is an application for scanning and repairing a GraphDB repository, available since GraphDB6.1.

Note: The tool works only on repository images that are not in use (i.e., when Tomcat is stopped).

To run the Storage Tool, please execute bin/storage-tool.sh (on Mac, Linux and Unix systems) orbin\storage-tool.cmd (on Windows) in the GraphDB distribution folder.

Note: In GraphDB 6.6.0 to 6.6.2 the scripts storage-tool.sh and storage-tool.cmd are missing and youhave to execute Java directly with the respective classpath and main class. If GraphDB is deployed in a foldersuch as /home/username/dev/tomcat/webapps/openrdf-sesame and assuming that the current directory is /home/username/dev/tomcat/webapps/openrdf-sesame/WEB-INF, you can use a relative path in the java -cpoption as shown in the example code below.

java -cp "lib/*" com.ontotext.trree.util.convert.storage.StorageTool --help

6.2.1 Options

-command=<operation to be executed, MANDATORY>-storage=<absoluth path to repo storage dir, MANDATORY>-esize=<size of entity pool IDs: 32 or 40 bits, DEFAULT 32>-statusPrintInterval=<interval between status message printing, DEFAULT 30, means 30 seconds>-pageCacheSize=<size of the page cache, DEFAULT 10, means 10K elements>-sortBufferSize=<size of the external sort buffer, DEFAULT 100, means 100M elements>-srcIndex=<one of pso, pos, pcso, pcos>-destIndex=<one of pso, pos, pcso, pcos, predicates>-origURI=<original existing URI in the repo>-replURI=<new non-existing URI in the repo>-destFile=<path to file used to store exported data>

6.2.2 Supported commands

• scan - scans the repository index(es) and prints statistics about the number of statements and repositoryconsistency;

• rebuild - uses the source index srcIndex to rebuild the destination index destIndex. If srcIndex =destIndex, compacts destIndex. If srcIndex is missing and destIndex = predicates, just rebuildsdestIndex.

• replace - replaces an existing entity -origURI with a non-existing one -replURI;

• repair - repairs the repository indexes and restores data, a better variant of the merge index;

• check - checks all indexes or -srcIndex for consistency and scans for corrupt statements. A statement iscorrupt if one of its parts (s,p,o,c) is not present in the entity pool.

• export - uses the source index (srcIndex) to export repository data to the destination file destFile. Sup-ported destination file extensions formats: .trig .ttl .nq

6.2.3 Examples

• scan the repository, print statement statistics and repository consistency status:

-command=scan -storage=/repo/storage

• scan the PSO index of a 40bit repository, print a status message every 60 seconds:

-command=scan -storage=/repo/storage -srcIndex=pso -esize=40 -statusPrintInterval=60

• compact the PSO index (self-rebuild equals compacting):

-command=rebuild -storage=/repo/storage -esize=40 -srcIndex=pso -destIndex=pso

• rebuild the POS index from the PSO index and compact POS:

-command=rebuild -storage=/repo/storage -esize=40 -srcIndex=pso -destIndex=pos

• rebuild the predicates statistics index:

-command=rebuild -storage=/repo/storage -esize=40 -destIndex=predicates

• replace http://onto.com#e1 with http://onto.com#e2:

-command=replace -storage=/repo/storage -origURI=<http://onto.com#e1>-replURI=<http://onto.com#e2>

• check the POS consistency and additionally scan it for corrupt statements:

-command=check -storage=/repo/storage -srcIndex=pos

• dump the repository data using the POS index into a f.trig file:

-command=export -storage=/repo/storage -srcIndex=pos -destFile=/repo/storage/f.trig

6.2. Storage Tool 195

CHAPTER

SEVEN

BENCHMARKS

7.1 LUBM

7.1.1 What is LUBM

The Lehigh University Benchmark (LUBM) is a benchmark developed for evaluating the performance of SemanticWeb repositories with respect to extensional queries over a large dataset that commits to a single ontology. Itconsists of a university domain ontology, customisable data, a set of test queries, and several performance metrics.

7.1.2 Configuring GraphDB and Sesame

To run LUBM, you may be required to modify the script setvars (.cmd for Windows or .sh for Mac/Linux) inthe scripts folder of the GraphDB distribution. The most important setting there is the specification of the Javavirtual machine through the JAVA_HOME environment variable.

To tune the performance of the benchmark to the particular hardware on which it runs, modify the repositoryspecification file named lubm.ttl located in the benchmark/lubm subfolder of the GraphDB distribution.

7.1.3 Running the benchmark

1. Generate the test file-set by using the lubm-generate (.cmd or .sh) script in the benchmark/lubm subfolderof the GraphDB distribution. The distribution includes a pre-built library of the benchmark source code.The pre-built lubm.jar library is located in the ext2 folder and also includes the wrapper classes that thebenchmark’s code-base uses in order to run against a Sesame repository (configured with GraphDB).

Note: The lubm-generate script accepts a single numeric argument as the target number of universitiesand creates a subfolder for the test run with the name univer and the number of universities appended, e.g.,univer1000. This is the folder in which it places the generated OWL files.

2. Edit the lubm.config file and comment/uncomment the appropriate benchmark configuration to be exe-cuted.

By default, a single University dataset is configured and its configuration section is:

# 1-university[GraphDB_1]class=owlim.OwlimWrapperdata=./univer1database=jdbc:ignoreontology=http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl

Note: Use the # symbol at the beginning of a line to comment it.

197

http://swat.cse.lehigh.edu/projects/lubm/

http://swat.cse.lehigh.edu/projects/lubm/index.htm


3. Launch the test via the lubm-benchmark script.

Note: Before generating datasets or executing the test, make sure that GraphDB is configured as discussedin Configuring a Repository by editing lubm.ttl.

The file provided is suitable for running LUBM with small datasets.

To find out more about this benchmark, visit the LUBM web site.

7.1.3.1 Alternative to generating the LUBM datasets

It is also possible to generate LUBM datasets ‘on-the-fly’ during the loading stage.

To do this, edit the lubm.config file and use a data directory with the format GENERATE-n, where n is the numberof universities, e.g.:

# 1-university[GraphDB_1]class=owlim.OwlimWrapperdata=GENERATE-2database=jdbc:ignoreontology=http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl

Note: This approach saves the extra step (and disk space) of generating data files.

7.2 Performance Benchmark Results

The latest GraphDB benchmark results for a range of tests (LUBM, OUBM, BSBM, etc.) are maintained here:http://www.ontotext.com/graphdb-benchmark-results/

198 Chapter 7. Benchmarks

http://swat.cse.lehigh.edu/projects/lubm/index.htm

http://www.ontotext.com/graphdb-benchmark-results/

CHAPTER

EIGHT

REFERENCES

8.1 Introduction to the Semantic Web

The Semantic Web represents a broad range of ideas and technologies that attempt to bring meaning to the vastamount of information available via the Web. The intention is to provide information in a structured form so that itcan be processed automatically by machines. The combination of structured data and inferencing can yield muchinformation not explicitly stated.

The aim of the Semantic Web is to solve the most problematic issues that come with the growth of the non-semantic (HTML-based or similar) Web that results in a high level of human effort for finding, retrieving andexploiting information. For example, contemporary search engines are extremely fast, but tend to be very poorat producing relevant results. Of the thousands of matches typically returned, only a few point to truly relevantcontent and some of this content may be buried deep within the identified pages. Such issues dramatically reducethe value of the information discovered as well as the ability to automate the consumption of such data. Otherproblems related to classification and generalisation of identifiers further confuse the landscape.

The Semantic Web solves such issues by adopting unique identifiers for concepts and the relationships betweenthem. These identifiers, called Universal Resource Identifiers (URIs) (a “resource” is any ‘thing’ or ‘concept’)are similar to Web page URLs, but do not necessarily identify documents from the Web. Their sole purpose is touniquely identify objects or concepts and the relationships between them.

The use of URIs removes much of the ambiguity from information, but the Semantic Web goes further by allowingconcepts to be associated with hierarchies of classifications, thus making it possible to infer new information basedon an individual’s classification and relationship to other concepts. This is achieved by making use of ontologies– hierarchical structures of concepts – to classify individual concepts.

8.1.1 Resource Description Framework (RDF)

The World-Wide Web has grown rapidly and contains huge amounts of information that cannot be interpreted bymachines. Machines cannot understand meaning, therefore they cannot understand Web content. For this reason,most attempts to retrieve some useful pieces of information from the Web require a high degree of user involvement– manually retrieving information from multiple sources (different Web pages), ‘digging’ through multiple searchengine results (where useful pieces of data are often buried many pages deep), comparing differently structuredresult sets (most of them incomplete), and so on.

For the machine interpretation of semantic content to become possible, there are two prerequisites:

1. Every concept should be uniquely identified. (For example, if one and the same person owns a web site,authors articles on other sites, gives an interview on another site and has profiles in a couple of social mediasites such as Facebook and LinkedIn, then the occurrences of his name/identifier in all these places shouldbe related to one and the same unique identifier.)

2. There must be a unified system for conveying and interpreting meaning that all automated search agents anddata storage applications should use.

One approach for attaching semantic information to Web content is to embed the necessary machine-processableinformation through the use of special meta-descriptors (meta-tagging) in addition to the existing meta-tags thatmainly concern the layout.

199


Within these meta tags, the resources (the pieces of useful information) can be uniquely identified in the samemanner in which Web pages are uniquely identified, i.e., by extending the existing URL system into somethingmore universal – a URI (Uniform Resource Identifier). In addition, conventions can be devised, so that resourcescan be described in terms of properties and values (resources can have properties and properties have values).The concrete implementations of these conventions (or vocabularies) can be embedded into Web pages (throughmeta-descriptors again) thus effectively ‘telling’ the processing machines things like:

[resource] John Doe has a [property] web site which is [value] www.johndoesite.com

The Resource Description Framework (RDF) developed by the World Wide Web Consortium (W3C) makes pos-sible the automated semantic processing of information, by structuring information using individual statementsthat consist of: Subject, Predicate, Object. Although frequently referred to as a ‘language’, RDF is mainly adata model. It is based on the idea that the things being described have properties, which have values, and thatresources can be described by making statements. RDF prescribes how to make statements about resources, inparticular, Web resources, in the form of subject-predicate-object expressions. The ‘John Doe’ example aboveis precisely this kind of statement. The statements are also referred to as triples, because they always have thesubject-predicate-object structure.

The basic RDF components include statements, Uniform Resource Identifiers, properties, blank nodes and literals.They are discussed in the topics that follow.

8.1.1.1 Uniform Resource Identifiers (URIs)

A unique Uniform Resource Identifier (URI) is assigned to any resource or thing that needs to be described.Resources can be authors, books, publishers, places, people, hotels, goods, articles, search queries, and so on. Inthe Semantic Web, every resource has a URI. A URI can be a URL or some other kind of unique identifier. UnlikeURLs, URIs do not necessarily enable access to the resource they describe, i.e, in most cases they do not representactual web pages. For example, the string http://www.johndoesite.com/aboutme.htm, if used as a URL (Weblink) is expected to take us to a Web page of the site providing information about the site owner, the person JohnDoe. The same string can however be used simply to identify that person on the Web (URI) irrespective of whethersuch a page exists or not.

Thus URI schemes can be used not only for Web locations, but also for such diverse objects as telephone numbers,ISBN numbers, and geographic locations. In general, we assume that a URI is the identifier of a resource and canbe used as either the subject or the object of a statement. Once the subject is assigned a URI, it can be treated as aresource and further statements can be made about it.

This idea of using URIs to identify ‘things’ and the relations between them is important. This approach goes someway towards a global, unique naming scheme. The use of such a scheme greatly reduces the homonym problemthat has plagued distributed data representation in the past.

8.1.1.2 Statements: Subject-Predicate-Object Triples

To make the information in the following sentence

“The web site www.johndoesite.com is created by John Doe.”

machine-accessible, it should be expressed in the form of an RDF statement, i.e., a subject-predicate-object triple:

“[subject] the web site www.johndoesite.com [predicate] has a creator [object] called John Doe.”

This statement emphasises the fact that in order to describe something, there has to be a way to name or identifya number of things:

• the thing the statement describes (Web site “www.johndoesite.com”);

• a specific property (“creator”) of the thing the statement describes;

• the thing the statement says is the value of this property (who the owner is).

The respective RDF terms for the various parts of the statement are:

• the subject is the URL “www.johndoesite.com”;

200 Chapter 8. References

• the predicate is the expression “has creator”;

• the object is the name of the creator, which has the value “John Doe”.

Next, each member of the subject-predicate-object triple should be identified using its URI, e.g.:

• the subject is http://www.johndoesite.com;

• the predicate is http://purl.org/dc/elements/1.1/creator (this is according to a particular RDFSchema, namely, the Dublin Core Metadata Element Set);

• the object is http://www.johndoesite.com/aboutme (which may not be an actual web page).

Note that in this version of the statement, instead of identifying the creator of the web site by the character string“John Doe”, we used a URI, namely http://www.johndoesite.com/aboutme. An advantage of using a URI isthat the identification of the statement’s subject can be more precise, i.e., the creator of the page is neither thecharacter string “John Doe”, nor any of the thousands of other people with that name, but the particular John Doeassociated with this URI (whoever created the URI defines the association). Moreover, since there is a URI torefer to John Doe, he is now a full-fledged resource and additional information can be recorded about him simplyby adding additional RDF statements with John’s URI as the subject.

What we basically have now is the logical formula 𝑃 (𝑥, 𝑦), where the binary predicate 𝑃 relates the object 𝑥 tothe object 𝑦 – we may also think of this formula as written in the form 𝑥, 𝑃, 𝑦. In fact, RDF offers only binarypredicates (properties). If more complex relationships are to be defined, this is done through sets of multiple RDFtriples. Therefore, we can describe the statement as:

<http://www.johndoesite.com> <http://purl.org/dc/elements/1.1/creator> <http://www.johndoesite.com/→˓aboutme>

There are several conventions for writing abbreviated RDF statements, as used in the RDF specifications them-selves. This shorthand employs an XML qualified name (or QName) without angle brackets as an abbreviationfor a full URI reference. A QName contains a prefix that has been assigned to a namespace URI, followedby a colon, and then a local name. The full URI reference is formed from the QName by appending the localname to the namespace URI assigned to the prefix. So, for example, if the QName prefix foo is assigned tothe namespace URI http://example.com/somewhere/, then the QName “foo:bar” is a shorthand for the URIhttp://example.com/somewhere/bar.

In our example, we can define the namespace jds for http://www.johndoesite.com and use the Dublin CoreMetadata namespace dc for http://purl.org/dc/elements/1.1/.

So, the shorthand form for the example statement is simply:

jds: dc:creator jds:aboutme

Objects of RDF statements can (and very often do) form the subjects of other statements leading to a graph-likerepresentation of knowledge. Using this notation, a statement is represented by:

• a node for the subject;

• a node for the object;

• an arc for the predicate, directed from the subject node to the object node.

So the RDF statement above could be represented by the following graph:

8.1. Introduction to the Semantic Web 201

This kind of graph is known in the artificial intelligence community as a ‘semantic net’.

In order to represent RDF statements in a machine-processable way, RDF uses mark-up languages, namely (andalmost exclusively) the Extensible Mark-up Language (XML). Because an abstract data model needs a concretesyntax in order to be represented and transmitted, RDF has been given a syntax in XML. As a result, it inherits thebenefits associated with XML. However, it is important to understand that other syntactic representations of RDF,not based on XML, are also possible. XML-based syntax is not a necessary component of the RDF model. XMLwas designed to allow anyone to design their own document format and then write a document in that format. RDFdefines a specific XML mark-up language, referred to as RDF/XML, for use in representing RDF information andfor exchanging it between machines. Written in RDF/XML, our example will look as follows:

<?xml version="1.0" encoding="UTF-16"?>

<rdf:RDFxmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:jds="http://www.johndoesite.com/">

<rdf:Description rdf:about="http://www.johndoesite.com/"><dc:creator rdf:resource="jds:aboutme">

</rdf:Description></rdf:RDF>

Note: RDF/XML uses the namespace mechanism of XML, but in an expanded way. In XML, namespaces areonly used for disambiguation purposes. In RDF/XML, external namespaces are expected to be RDF documentsdefining resources, which are then used in the importing RDF document. This mechanism allows the reuse ofresources by other people who may decide to insert additional features into these resources. The result is theemergence of large, distributed collections of knowledge.

Also observe that the rdf:about attribute of the element rdf:Description is equivalent in meaning to that ofan ID attribute, but it is often used to suggest that the object about which a statement is made has already been‘defined’ elsewhere. Strictly speaking, a set of RDF statements together simply forms a large graph, relatingthings to other things through properties, and there is no such concept as ‘defining’ an object in one place andreferring to it elsewhere. Nevertheless, in the serialised XML syntax, it is sometimes useful (if only for humanreadability) to suggest that one location in the XML serialisation is the ‘defining’ location, while other locationsstate ‘additional’ properties about an object that has been ‘defined’ elsewhere.

8.1.1.3 Properties

Properties are a special kind of resource: they describe relationships between resources, e.g., written by, age,title, and so on. Properties in RDF are also identified by URIs (in most cases, these are actual URLs). Therefore,properties themselves can be used as the subject in other statements, which allows for an expressive ways todescribe properties, e.g., by defining property hierarchies.

8.1.1.4 Named graphs

A named graph (NG) is a set of triples named by a URI. This URI can then be used outside or within the graph torefer to it. The ability to name a graph allows separate graphs to be identified out of a large collection of statementsand further allows statements to be made about graphs.

Named graphs represent an extension of the RDF data model, where quadruples <s,p,o,ng> are used to definestatements in an RDF multi-graph. This mechanism allows, e.g., the handling of provenance when multiple RDFgraphs are integrated into a single repository.

From the perspective of GraphDB, named graphs are important, because comprehensive support for SPARQLrequires NG support.

8.1.2 RDF Schema (RDFS)

While being a universal model that lets users describe resources using their own vocabularies, RDF does not makeassumptions about any particular application domain, nor does it define the semantics of any domain. It is up tothe user to do so using an RDF Schema (RDFS) vocabulary.

RDF Schema is a vocabulary description language for describing properties and classes of RDF resources, with asemantics for generalisation hierarchies of such properties and classes. Be aware of the fact that the RDF Schemais conceptually different from the XML Schema, even though the common term schema suggests similarity. TheXML Schema constrains the structure of XML documents, whereas the RDF Schema defines the vocabularyused in RDF data models. Thus, RDFS makes semantic information machine-accessible, in accordance with theSemantic Web vision. RDF Schema is a primitive ontology language. It offers certain modelling primitives withfixed meaning.

RDF Schema does not provide a vocabulary of application-specific classes. Instead, it provides the facilitiesneeded to describe such classes and properties, and to indicate which classes and properties are expected to beused together (for example, to say that the property JobTitle will be used in describing a class “Person”). Inother words, RDF Schema provides a type system for RDF.

The RDF Schema type system is similar in some respects to the type systems of object-oriented programminglanguages such as Java. For example, RDFS allows resources to be defined as instances of one or more classes. Inaddition, it allows classes to be organised in a hierarchical fashion. For example, a class Dog might be defined asa subclass of Mammal, which itself is a subclass of Animal, meaning that any resource that is in class Dog is alsoimplicitly in class Animal as well.

RDF classes and properties, however, are in some respects very different from programming language types. RDFclass and property descriptions do not create a straight-jacket into which information must be forced, but insteadprovide additional information about the RDF resources they describe.

The RDFS facilities are themselves provided in the form of an RDF vocabulary, i.e., as a specialised set ofpredefined RDF resources with their own special meanings. The resources in the RDFS vocabulary have URIs withthe prefix http://www.w3.org/2000/01/rdf-schema# (conventionally associated with the namespace prefixrdfs). Vocabulary descriptions (schemas) written in the RDFS language are legal RDF graphs. Hence, systemsprocessing RDF information that do not understand the additional RDFS vocabulary can still interpret a schemaas a legal RDF graph consisting of various resources and properties. However, such a system will be obliviousto the additional built-in meaning of the RDFS terms. To understand these additional meanings, the software thatprocesses RDF information has to be extended to include these language features and to interpret their meaningsin the defined way.


8.1.2.1 Describing classes

A class can be thought of as a set of elements. Individual objects that belong to a class are referred to as instancesof that class. A class in RDFS corresponds to the generic concept of a type or category similar to the notion of aclass in object-oriented programming languages such as Java. RDF classes can be used to represent any category ofobjects such as web pages, people, document types, databases or abstract concepts. Classes are described using theRDF Schema resources rdfs:Class and rdfs:Resource, and the properties rdf:type and rdfs:subClassOf.The relationship between instances and classes in RDF is defined using rdf:type.

An important use of classes is to impose restrictions on what can be stated in an RDF document using the schema.In programming languages, typing is used to prevent incorrect use of objects (resources) and the same is truein RDF imposing a restriction on the objects to which the property can be applied. In logical terms, this is arestriction on the domain of the property.

8.1.2.2 Describing properties

In addition to describing the specific classes of things they want to describe, user communities also need to be ableto describe specific properties that characterise these classes of things (such as numberOfBedrooms to describean apartment). In RDFS, properties are described using the RDF class rdf:Property, and the RDFS propertiesrdfs:domain, rdfs:range and rdfs:subPropertyOf.

All properties in RDF are described as instances of class rdf:Property. So, a new property, such asexterms:weightInKg, is defined with the following RDF statement:

exterms:weightInKg rdf:type rdf:Property .

RDFS also provides vocabulary for describing how properties and classes are intended to be used together. Themost important information of this kind is supplied by using the RDFS properties rdfs:range and rdfs:domainto further describe application-specific properties.

The rdfs:range property is used to indicate that the values of a particular property are members of a designatedclass. For example, to indicate that the property ex:author has values that are instances of class ex:Person, thefollowing RDF statements are used:

ex:Person rdf:type rdfs:Class .ex:author rdf:type rdf:Property .ex:author rdfs:range ex:Person .

These statements indicate that ex:Person is a class, ex:author is a property, and that RDF statements using theex:author property have instances of ex:Person as objects.

The rdfs:domain property is used to indicate that a particular property is used to describe a specific class ofobjects. For example, to indicate that the property ex:author applies to instances of class ex:Book, the followingRDF statements are used:

ex:Book rdf:type rdfs:Class .ex:author rdf:type rdf:Property .ex:author rdfs:domain ex:Book .

These statements indicate that ex:Book is a class, ex:author is a property, and that RDF statements using theex:author property have instances of ex:Book as subjects.

8.1.2.3 Sharing vocabularies

RDFS provides the means to create custom vocabularies. However, it is generally easier and better practice to usean existing vocabulary created by someone else who has already been describing a similar conceptual domain.Such publicly available vocabularies, called ‘shared vocabularies’, are not only cost-efficient to use, but they alsopromote the shared understanding of the described domains.

Considering the earlier example, in the statement:


jds: dc:creator jds:aboutme .

the predicate dc:creator, when fully expanded into a URI, is an unambiguous reference to the creator attributein the Dublin Core metadata attribute set, a widely used set of attributes (properties) for describing informationof this kind. So this triple is effectively saying that the relationship between the website (identified by http://www.johndoesite.com/) and the creator of the site (a distinct person, identified by http://www.johndoesite.com/aboutme) is exactly the property identified by http://purl.org/dc/elements/1.1/creator. This way,anyone familiar with the Dublin Core vocabulary or those who find out what dc:creator means (say, by lookingup its definition on the Web) will know what is meant by this relationship. In addition, this shared understandingbased upon using unique URIs for identifying concepts is exactly the requirement for creating computer systemsthat can automatically process structured information.

However, the use of URIs does not solve all identification problems, because different URIs can be created forreferring to the same thing. For this reason, it is a good idea to have a preference towards using terms from existingvocabularies (such as the Dublin Core) where possible, rather than making up new terms that might overlap withthose of some other vocabulary. Appropriate vocabularies for use in specific application areas are being developedall the time, but even so, the sharing of these vocabularies in a common ‘Web space’ provides the opportunity toidentify and deal with any equivalent terminology.

8.1.2.4 Dublin Core Metadata Initiative

An example of a shared vocabulary that is readily available for reuse is The Dublin Core, which is a set of elements(properties) for describing documents (and hence, for recording metadata). The element set was originally devel-oped at the March 1995 Metadata Workshop in Dublin, Ohio, USA. Dublin Core has subsequently been modifiedon the basis of later Dublin Core Metadata workshops and is currently maintained by the Dublin Core MetadataInitiative.

The goal of Dublin Core is to provide a minimal set of descriptive elements that facilitate the description andthe automated indexing of document-like networked objects, in a manner similar to a library card catalogue. TheDublin Core metadata set is suitable for use by resource discovery tools on the Internet, such as Web crawlersemployed by search engines. In addition, Dublin Core is meant to be sufficiently simple to be understood andused by the wide range of authors and casual publishers of information to the Internet.

Dublin Core elements have become widely used in documenting Internet resources (the Dublin Core creatorelement was used in the earlier examples). The current elements of Dublin Core contain definitions for propertiessuch as title (a name given to a resource), creator (an entity primarily responsible for creating the content ofthe resource), date (a date associated with an event in the life-cycle of the resource) and type (the nature or genreof the content of the resource).

Information using Dublin Core elements may be represented in any suitable language (e.g., in HTML meta el-ements). However, RDF is an ideal representation for Dublin Core information. The following example usesDublin Core by itself to describe an audio recording of a guide to growing rose bushes:

<rdf:RDFxmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:dc="http://purl.org/dc/elements/1.1/">

<rdf:Description rdf:about="http://media.example.com/audio/guide.ra"><dc:creator>Mr. Dan D. Lion</dc:creator><dc:title>A Guide to Growing Roses</dc:title><dc:description>Describes planting and nurturing rose bushes.</dc:description><dc:date>2001-01-20</dc:date>

</rdf:Description></rdf:RDF>

The same RDF statements in Notation-3:

@prefix dc: <[http://purl.org/dc/elements/1.1/]> .@prefix rdf: <[http://www.w3.org/1999/02/22-rdf-syntax-ns#]> .


http://dublincore.org/

<http://media.example.com/audio/guide.ra> dc:creator "Mr. Dan D. Lion" ;dc:title "A Guide to Growing Roses" ;dc:description "Describes planting and nurturing rose bushes." ;dc:date "2001-01-20" .

8.1.3 Ontologies and knowledge bases

In general, an ontology formally describes a (usually finite) domain of related concepts (classes of objects) andtheir relationships. For example, in a company setting, staff members, managers, company products, offices, anddepartments might be some important concepts. The relationships typically include hierarchies of classes. Ahierarchy specifies a class C to be a subclass of another class C' if every object in C is also included in C'. Forexample, all managers are staff members.

Apart from subclass relationships, ontologies may include information such as:

• properties (X is subordinated Y);

• value restrictions (only managers may head departments);

• disjointness statements (managers and general employees are disjoint);

• specifications of logical relationships between objects (every department must have at least three staff mem-bers).

Ontologies are important because semantic repositories use ontologies as semantic schemata. This makes auto-mated reasoning about the data possible (and easy to implement) since the most essential relationships betweenthe concepts are built into the ontology.

Formal knowledge representation (KR) is about building models. The typical modelling paradigm is mathematicallogic, but there are also other approaches, rooted in the information and library science. KR is a very broad term;here we only refer to the mainstream meaning of the world (of a particular state of affairs, situation, domain orproblem), which allow for automated reasoning and interpretation. Such models consist of ontologies defined ina formal language. Ontologies can be used to provide formal semantics (i.e., machine-interpretable meaning) toany sort of information: databases, catalogues, documents, Web pages, etc. Ontologies can be used as semanticframeworks: the association of information with ontologies makes such information much more amenable tomachine processing and interpretation. This is because ontologies are described using logical formalisms, such asOWL, which allow automatic inferencing over these ontologies and datasets that use them, i.e., as a vocabulary.An important role of ontologies is to serve as schemata or ‘intelligent’ views over information resources. Thisis also the role of ontologies in the Semantic Web. Thus, they can be used for indexing, querying, and referencepurposes over non-ontological datasets and systems such as databases, document and catalogue managementsystems. Because ontological languages have formal semantics, ontologies allow a wider interpretation of data,i.e., inference of facts, which are not explicitly stated. In this way, they can improve the interoperability and theefficiency of using arbitrary datasets.

An ontology O can be defined as comprising the 4-tuple.

O = <C,R,I,A>

where

• C is a set of classes representing concepts from the domain we wish to describe (e.g., invoices, payments,products, prices, etc);

• R is a set of relations (also referred to as properties or predicates) holding between (instances of) theseclasses (e.g., Product hasPrice Price);

• I is a set of instances, where each instance can be a member of one or more classes and can be linked toother instances or to literal values (strings, numbers and other data-types) by relations (e.g., product23compatibleWith product348 or product23 hasPrice €170);

• A is a set of axioms (e.g., if a product has a price greater than C200, then shipping is free).


8.1.3.1 Classification of ontologies

Ontologies can be classified as light-weight or heavy-weight according to the complexity of the KR language andthe extent to which it is used. Light-weight ontologies allow for more efficient and scalable reasoning, but do notpossess the highly predictive (or restrictive) power of more powerful KR languages. Ontologies can be furtherdifferentiated according to the sort of conceptualisation that they formalise: upper-level ontologies model generalknowledge, while domain and application ontologies represent knowledge about a specific domain (e.g., medicineor sport) or a type of application, e.g., knowledge management systems.

Finally, ontologies can be distinguished according to the sort of semantics being modelled and their intendedusage. The major categories from this perspective are:

• Schema-ontologies: ontologies that are close in purpose and nature to database and object-orientedschemata. They define classes of objects, their properties and relationships to objects of other classes.A typical use of such an ontology involves using it as a vocabulary for defining large sets of instances. Inbasic terms, a class in a schema ontology corresponds to a table in a relational database; a relation – to acolumn; an instance – to a row in the table for the corresponding class;

• Topic-ontologies: taxonomies that define hierarchies of topics, subjects, categories, or designators. Thesehave a wide range of applications related to classification of different things (entities, information resources,files, Web-pages, etc). The most popular examples are library classification systems and taxonomies, whichare widely used in the knowledge management field. Yahoo and DMoz are popular large scale incarnationsof this approach. A number of the most popular taxonomies are listed as encoding schemata in Dublin Core;

• Lexical ontologies: lexicons with formal semantics that define lexical concepts. We use ‘lexical concept’here as some kind of a formal representation of the meaning of a word or a phrase. In Wordnet, for example,lexical concepts are modelled as synsets (synonym sets), while word-sense is the relation between a wordand a synset, word-senses and terms. These can be considered as semantic thesauri or dictionaries. Theconcepts defined in such ontologies are not instantiated, rather they are directly used for reference, e.g., forannotation of the corresponding terms in text. WordNet is the most popular general purpose (i.e., upper-level) lexical ontology.

8.1.3.2 Knowledge bases

Knowledge base (KB) is a broader term than ontology. Similar to an ontology, a KB is represented in a KR formal-ism, which allows automatic inference. It could include multiple axioms, definitions, rules, facts, statements, andany other primitives. In contrast to ontologies, however, KBs are not intended to represent a shared or consensualconceptualisation. Thus, ontologies are a specific sort of a KB. Many KBs can be split into ontology and instancedata parts, in a way analogous to the splitting of schemata and concrete data in databases.

Proton

PROTON is a light-weight upper-level schema-ontology developed in the scope of the SEKT project, which wewill use for ontology-related examples in this section. PROTON is encoded in OWL Lite and defines about 542entity classes and 183 properties, providing good coverage of named entity types and concrete domains, i.e.,modelling of concepts such as people, organisations, locations, numbers, dates, addresses, etc. A snapshot of thePROTON class hierarchy is shown below.


http://www.yahoo.com/

http://www.dmoz.org/

http://dublincore.org/documents/dces/

http://wordnet.princeton.edu/


8.1.4 Logic and inference

The topics that follow take a closer look at the logic that underlies the retrieval and manipulation of semantic dataand the kind of programming that supports it.

8.1.4.1 Logic programming

Logic programming involves the use of logic for computer programming, where the programmer uses a declarativelanguage to assert statements and a reasoner or theorem-prover is used to solve problems. A reasoner can interpretsentences, such as IF A THEN B, as a means to prove B from A. In other words, given a collection of logicalsentences, a reasoner will explore the solution space in order to find a path to justify the requested theory. Forexample, to determine the truth value of C given the following logical sentences

IF A AND B THEN CBIF D THEN AD

a reasoner will interpret the IF..THEN statements as rules and determine that C is indeed inferred from the KB.This use of rules in logic programming has led to ‘rule-based reasoning’ and ‘logic programming’ becomingsynonymous, although this is not strictly the case.

In LP, there are rules of logical inference that allow new (implicit) statements to be inferred from other (explicit)statements, with the guarantee that if the explicit statements are true, so are the implicit statements.

Because these rules of inference can be expressed in purely symbolic terms, applying them is the kind of symbolmanipulation that can be carried out by a computer. This is what happens when a computer executes a logicalprogram: it uses the rules of inference to derive new statements from the ones given in the program, until it findsone that expresses the solution to the problem that has been formulated. If the statements in the program are true,then so are the statements that the machine derives from them, and the answers it gives will be correct.

The program can give correct answers only if the following two conditions are met:

• The program must contain only true statements;

• The program must contain enough statements to allow solutions to be derived for all the problems that areof interest.

There must also be a reasonable time frame for the entire inference process. To this end, much research has beencarried out to determine the complexity classes of various logical formalisms and reasoning strategies. Generallyspeaking, to reason with Web-scale quantities of data requires a low-complexity approach. A tractable solution isone whose algorithm requires finite time and space to complete.



8.1.4.2 Predicate logic

From a more abstract viewpoint, the subject of the previous topic is related to the foundation upon which logicalprogramming resides, which is logic, particularly in the form of predicate logic (also known as ‘first order logic’).Some of the specific features of predicate logic render it very suitable for making inferences over the SemanticWeb, namely:

• It provides a high-level language in which knowledge can be expressed in a transparent way and with highexpressive power;

• It has a well-understood formal semantics, which assigns unambiguous meaning to logical statements;

• There are proof systems that can automatically derive statements syntactically from a set of premises. Theseproof systems are both sound (meaning that all derived statements follow semantically from the premises)and complete (all logical consequences of the premises can be derived in the proof system);

• It is possible to trace the proof that leads to a logical consequence. (This is because the proof system issound and complete.) In this sense, the logic can provide explanations for answers.

The languages of RDF and OWL (Lite and DL) can be viewed as specialisations of predicate logic. One reasonfor such specialised languages to exist is that they provide a syntax that fits well with the intended use (in ourcase, Web languages based on tags). The other major reason is that they define reasonable subsets of logic. This isimportant because there is a trade-off between the expressive power and the computational complexity of certainlogic: the more expressive the language, the less efficient (in the worst case) the corresponding proof systems. Aspreviously stated, OWL Lite and OWL DL correspond roughly to description logic, a subset of predicate logic forwhich efficient proof systems exist.

Another subset of predicate logic with efficient proof systems comprises the so-called rule systems (also knownas Horn logic or definite logic programs).

A rule has the form:

A1, ... , An → B

where Ai and B are atomic formulas. In fact, there are two intuitive ways of reading such a rule:

• If A1, ... , An are known to be true, then B is also true. Rules with this interpretation are referred to as‘deductive rules’.

• If the conditions A1, ... , An are true, then carry out the action B. Rules with this interpretation arereferred to as ‘reactive rules’.

Both approaches have important applications. The deductive approach, however, is more relevant for the purposeof retrieving and managing structured data. This is because it relates better to the possible queries that one canask, as well as to the appropriate answers and their proofs.

8.1.4.3 Description logic

Description Logic (DL) has historically evolved from a combination of frame-based systems and predicate logic.Its main purpose is to overcome some of the problems with frame-based systems and to provide a clean andefficient formalism to represent knowledge. The main idea of DL is to describe the world in terms of ‘properties’or ‘constraints’ that specific ‘individuals’ must satisfy. DL is based on the following basic entities:

• Objects - Correspond to single ‘objects’ of the real world such as a specific person, a table or a telephone.The main properties of an object are that it can be distinguished from other objects and that it can be referredto by a name. DL objects correspond to the individual constants in predicate logic;

• Concepts - Can be seen as ‘classes of objects’. Concepts have two functions: on one hand, they describea set of objects and on the other, they determine properties of objects. For example, the class “table” issupposed to describe the set of all table objects in the universe. On the other hand, it also determines someproperties of a table such as having legs and a flat horizontal surface that one can lay something on. DLconcepts correspond to unary predicates in first order logic and to classes in frame-based systems;



• Roles - Represent relationships between objects. For example, the role ‘lays on’ might define the relation-ship between a book and a table, where the book lays upon the table. Roles can also be applied to concepts.However, they do not describe the relationship between the classes (concepts), rather they describe theproperties of the objects that are members of that classes;

• Rules - In DL, rules take the form of “if condition x (left side), then property y (right side)” and formstatements that read as “if an object satisfies the condition on the left side, then it has the properties of theright side”. So, for example, a rule can state something like ‘all objects that are male and have at least onechild are fathers’.

The family of DL system consists of many members that differ mainly with respect to the constructs they provide.Not all of the constructs can be found in a single DL system.

8.1.5 The Web Ontology Language (OWL) and its dialects

In order to achieve the goal of a broad range of shared ontologies using vocabularies with expressiveness appropri-ate for each domain, the Semantic Web requires a scalable high-performance storage and reasoning infrastructure.The major challenge towards building such an infrastructure is the expressivity of the underlying standards: RDF,RDFS, OWL and OWL 2. Even though RDFS can be considered a simple KR language, it is already a challengingtask to implement a repository for it, which provides performance and scalability comparable to those of relationaldatabase management systems (RDBMS). Even the simplest dialect of OWL (OWL Lite) is a description logic(DL) that does not scale due to reasoning complexity. Furthermore, the semantics of OWL Lite are incompatiblewith that of RDF(S).



Figure 1 - OWL Layering Map

8.1.5.1 OWL DLP

OWL DLP is a non-standard dialect, offering a promising compromise between expressive power, efficient rea-soning, and compatibility. It is defined as the intersection of the expressivity of OWL DL and logic programming.In fact, OWL DLP is defined as the most expressive sublanguage of OWL DL, which can be mapped to Datalog.OWL DLP is simpler than OWL Lite. The alignment of its semantics to RDFS is easier, as compared to OWLLite and OWL DL dialects. Still, this can only be achieved through the enforcement of some additional modellingconstraints and transformations.

Horn logic and description logic are orthogonal (in the sense that neither of them is a subset of the other). OWLDLP is the ‘intersection’ of Horn logic and OWL; it is the Horn-definable part of OWL, or stated another way, theOWL-definable part of Horn logic.

DLP has certain advantages:

• From a modeller’s perspective, there is freedom to use either OWL or rules (and associated tools andmethodologies) for modelling purposes, depending on the modeller’s experience and preferences.

• From an implementation perspective, either description logic reasoners or deductive rule systems can beused. This feature provides extra flexibility and ensures interoperability with a variety of tools.

Experience with using OWL has shown that existing ontologies frequently use very few constructs outside theDLP language.

8.1.5.2 OWL Horst

In “Combining RDF and Part of OWL with Rules: Semantics, Decidability, Complexity” ter Horst defines RDFSextensions towards rule support and describes a fragment of OWL, more expressive than DLP. He introducesthe notion of R-entailment of one (target) RDF graph from another (source) RDF graph on the basis of a set ofentailment rules R. R-entailment is more general than the D-entailment used by Hayes in defining the standardRDFS semantics. Each rule has a set of premises, which conjunctively define the body of the rule. The premisesare ‘extended’ RDF statements, where variables can take any of the three positions.

The head of the rule comprises one or more consequences, each of which is, again, an extended RDF statement.The consequences may not contain free variables, i.e., which are not used in the body of the rule. The consequencesmay contain blank nodes.

The extension of R-entailment (as compared to D-entailment) is that it ‘operates’ on top of so-called generalisedRDF graphs, where blank nodes can appear as predicates. R-entailment rules without premises are used to declareaxiomatic statements. Rules without consequences are used to detect inconsistencies.

In this document, we refer to this extension of RDFS as “OWL Horst”. This language has a number of importantcharacteristics:

• It is a proper (backward-compatible) extension of RDFS. In contrast to OWL DLP, it puts no constraintson the RDFS semantics. The widely discussed meta-classes (classes as instances of other classes) are notdisallowed in OWL Horst. It also does not enforce the unique name assumption;

• Unlike DL-based rule languages such as SWRL, R-entailment provides a formalism for rule extensionswithout DL-related constraints;

• Its complexity is lower than SWRL and other approaches combining DL ontologies with rules.

In Figure 1, the pink box represents the range of expressivity of GraphDB, i.e., including OWL DLP, OWL Horst,OWL 2 RL, most of OWL Lite. However, none of the rule-sets include support for the entailment of typed literals(D-entailment).

OWL Horst is close to what SWAD-Europe has intuitively described as OWL Tiny. The major difference is thatOWL Tiny (like the fragment supported by GraphDB) does not support entailment over data types.



8.1.5.3 OWL2 RL

OWL 2 is a re-work of the OWL language family by the OWL working group. This work includes identifyingfragments of the OWL 2 language that have desirable behavior for specific applications/environments.

The OWL 2 RL profile is aimed at applications that require scalable reasoning without sacrificing too muchexpressive power. It is designed to accommodate both OWL 2 applications that can trade the full expressivityof the language for efficiency, and RDF(S) applications that need some added expressivity from OWL 2. Thisis achieved by defining a syntactic subset of OWL 2, which is amenable to implementation using rule-basedtechnologies, and presenting a partial axiomatisation of the OWL 2 RDF-Based Semantics in the form of first-order implications that can be used as the basis for such an implementation. The design of OWL 2 RL was inspiredby Description Logic Programs and pD.

8.1.5.4 OWL Lite

The original OWL specification, now known as OWL 1, provides two specific subsets of OWL Full designed tobe of use to implementers and language users. The OWL Lite subset was designed for easy implementation andto offer users a functional subset that provides an easy way to start using OWL.

OWL Lite is a sublanguage of OWL DL that supports only a subset of the OWL language constructs. OWL Liteis particularly targeted at tool builders, who want to support OWL, but who want to start with a relatively simplebasic set of language features. OWL Lite abides by the same semantic restrictions as OWL DL, allowing reasoningengines to guarantee certain desirable properties.

8.1.5.5 OWL DL

The OWL DL (where DL stands for Description Logic) subset was designed to support the existing DescriptionLogic business segment and to provide a language subset that has desirable computational properties for reasoningsystems.

OWL Full and OWL DL support the same set of OWL language constructs. Their difference lies in the restrictionson the use of some of these features and on the use of RDF features. OWL Full allows free mixing of OWL withRDF Schema and, like RDF Schema, does not enforce a strict separation of classes, properties, individuals anddata values. OWL DL puts constraints on mixing with RDF and requires disjointness of classes, properties,individuals and data values. The main reason for having the OWL DL sublanguage is that tool builders havedeveloped powerful reasoning systems that support ontologies constrained by the restrictions required for OWLDL.

8.1.6 Query languages

In this section, we introduce some query languages for RDF. This may beg the question why we need RDF-specificquery languages at all instead of using an XML query language. The answer is that XML is located at a lowerlevel of abstraction than RDF. This fact would lead to complications if we were querying RDF documents with anXML-based language. The RDF query languages explicitly capture the RDF semantics in the language itself.

All the query languages discussed below have an SQL-like syntax, but there are also a few non-SQL-like languageslike Versa and Adenine.

The query languages supported by Sesame (which is the Java framework within which GraphDB operates) andtherefore by GraphDB, are SPARQL and SeRQL.

8.1.6.1 RQL, RDQL

RQL (RDF Query Language) was initially developed by the Institute of Computer Science at Heraklion, Greece,in the context of the European IST project MESMUSES.3. RQL adopts the syntax of OQL (a query languagestandard for object-oriented databases), and, like OQL, is defined by means of a set of core queries, a set of basicfilters, and a way to build new queries through functional composition and iterators.



The core queries are the basic building blocks of RQL, which give access to the RDFS-specific contents of anRDF triplestore. RQL allows queries such as Class (retrieving all classes), Property (retrieving all properties) orEmployee (returning all instances of the class with name Employee). This last query, of course, also returns allinstances of subclasses of Employee, as these are also instances of the class Employee by virtue of the semanticsof RDFS.

RDQL (RDF Data Query Language) is a query language for RDF first developed for Jena models. RDQL is animplementation of the SquishQL RDF query language, which itself is derived from rdfDB. This class of querylanguages regards RDF as triple data, without schema or ontology information unless explicitly included in theRDF source.

Apart from Sesame, the following systems currently provide RDQL (all these implementations are known toderive from the original grammar): Jena, RDFStore, PHP XML Classes, 3Store and RAP (RDF API for PHP).

8.1.6.2 SPARQL

SPARQL (pronounced “sparkle”) is currently the most popular RDF query language; its name is a recursiveacronym that stands for “SPARQL Protocol and RDF Query Language”. It was standardised by the RDF DataAccess Working Group (DAWG) of the World Wide Web Consortium, and is now considered a key Semantic Webtechnology. On 15 January 2008, SPARQL became an official W3C Recommendation.

SPARQL allows for a query to consist of triple patterns, conjunctions, disjunctions, and optional patterns. SeveralSPARQL implementations for multiple programming languages exist at present.

8.1.6.3 SeRQL

SeRQL (Sesame RDF Query Language, pronounced “circle”) is an RDF/RDFS query language developed bySesame’s developer - Aduna - as part of Sesame. It selectively combines the best features (considered by itscreators) of other query languages (RQL, RDQL, N-Triples, N3) and adds some features of its own. As of thiswriting, SeRQL provides advanced features not yet available in SPARQL. Some of SeRQL’s most importantfeatures are:

• Graph transformation;

• RDF Schema support;

• XML Schema data-type support;

• Expressive path expression syntax;

• Optional path matching.

8.1.7 Reasoning strategies

There are two principle strategies for rule-based inference: Forward-chaining and Backward-chaining:

Forward-chaining to start from the known facts (the explicit statements) and to perform inference in a deductivefashion. Forward-chaining involves applying the inference rules to the known facts (explicit statements)to generate new facts. The rules can then be re-applied to the combination of original facts and inferredfacts to produce more new facts. The process is iterative and continues until no new facts can be generated.The goals of such reasoning can have diverse objectives, e.g., to compute the inferred closure, to answer aparticular query, to infer a particular sort of knowledge (e.g., the class taxonomy), etc.

Advantages: When all inferences have been computed query answering can proceed extremely quickly.

Disadvantages: Initialisation costs (inference computed at load time) and space/memory usage (especiallywhen the number of inferred facts is very large).

Backward-chaining involves starting with a fact to be proved or a query to be answered. Typically, the reasonerexamines the knowledge base to see if the fact to be proved is present and if not it examines the rule-set to seewhich rules could be used to prove it. For the latter case, a check is made to see what other ‘supporting’ factswould need to be present to ‘fire’ these rules. The reasoner searches for proofs of each of these ‘supporting’



facts in the same way and iteratively maps out a search tree. The process terminates when either all of theleaves of the tree have proofs or no new candidate solutions can be found. Query processing is similar, butonly stops when all search paths have been explored. The purpose in query answering is to find not just onebut all possible substitutions in the query expression.

Advantages: There are no inferencing costs at start-up and minimal space requirements.

Disadvantages: Inference must be done each and every time a query is answered and for complex searchgraphs this can be computationally expensive and slow.

As both strategies have advantages and disadvantages, attempts to overcome their weak points have led to thedevelopment of various hybrid strategies (involving partial forward- and backward-chaining), which have provenefficient in many contexts.

8.1.7.1 Total materialisation

Imagine a repository that performs total forward-chaining, i.e., it tries to make sure that after each update tothe KB, the inferred closure is computed and made available for query evaluation or retrieval. This strategy isgenerally known as materialisation. In order to avoid ambiguity with various partial materialisation approaches,let us call such an inference strategy, taken together with the monotonic entailment. When new explicit facts(statements) are added to a KB (repository), new implicit facts will likely be inferred. Under a monotonic logic,adding new explicit statements will never cause previously inferred statements to be retracted. In other words, theaddition of new facts can only monotonically extend the inferred closure. Assumption, total materialisation.

Advantages and disadvantages of the total materialisation:

• Upload/store/addition of new facts is relatively slow, because the repository is extending the inferred closureafter each transaction. In fact, all the reasoning is performed during the upload;

• Deletion of facts is also slow, because the repository should remove from the inferred closure all the factsthat can no longer be proved;

• The maintenance of the inferred closure usually requires considerable additional space (RAM, disk, or both,depending on the implementation);

• Query and retrieval are fast, because no deduction, satisfiability checking, or other sorts of reasoning arerequired. The evaluation of queries becomes computationally comparable to the same task for relationdatabase management systems (RDBMS).

Probably the most important advantage of the inductive systems, based on total materialisation, is that they caneasily benefit from RDBMS-like query optimisation techniques, as long as all the data is available at query time.The latter makes it possible for the query evaluation engine to use statistics and other means in order to make‘educated’ guesses about the ‘cost’ and the ‘selectivity’ of a particular constraint. These optimisations are muchmore complex in the case of deductive query evaluation.

Total materialisation is adopted as the reasoning strategy in a number of popular Semantic Web repositories,including some of the standard configurations of Sesame and Jena. Based on publicly available evaluation data, itis also the only strategy that allows scalable reasoning in the range of a billion of triples; such results are publishedby BBN (for DAML DB) and ORACLE (for RDF support in ORACLE 11g).

8.1.8 Semantic repositories

Over the last decade, the Semantic Web has emerged as an area where semantic repositories became as importantas HTTP servers are today. This perspective boosted the development, under W3C driven community processes,of a number of robust metadata and ontology standards. These standards play the role, which SQL had for thedevelopment and spread of the relational DBMS. Although designed for the Semantic Web, these standards faceincreasing acceptance in areas such as Enterprise Application Integration and Life Sciences.

In this document, the term ‘semantic repository’ is used to refer to a system for storage, querying, and manage-ment of structured data with respect to ontologies. At present, there is no single well-established term for suchengines. Weak synonyms are: reasoner, ontology server, metastore, semantic/triple/RDF store, database, reposi-tory, knowledge base. The different wording usually reflects a somewhat different approach to implementation,



performance, intended application, etc. Introducing the term ‘semantic repository’ is an attempt to convey thecore functionality offered by most of these tools. Semantic repositories can be used as a replacement for databasemanagement systems (DBMS), offering easier integration of diverse data and more analytical power. In a nutshell,a semantic repository can dynamically interpret metadata schemata and ontologies, which define the structure andthe semantics related to the data and the queries. Compared to the approach taken in a relational DBMS, thisallows for easier changing and combining of data schemata and automated interpretation of the data.

8.2 GraphDB Feature Comparison

Feature GraphDB Free GraphDBSE

GraphDBEE

Manage unlimited number of RDF statements D D D

Full SPARQL 1.1 support D D D

Deploy anywhere using Java D D D

100% compatible with Sesame framework D D D

Ultra fast forward-chaining reasoning D D D

Efficient retraction of inferred statements upon update D D D

Full standard-compliant and optimised rule-sets for RDFS, OWL2 RL and QL

D D D

Custom reasoning and consistency checking rule-sets D D D

Plugin API for engine extension D D D

Support for Geo-spatial indexing & querying, plus GeoSPARQL D D D

Query optimizer allowing effective query execution D D D

Workbench interface to manage repositories, data, user accountsand access roles

D D D

Lucene connector for full-text search D D D

Solr connector for full-text search 5 5 D

Elasticsearch connector for full-text search 5 5 D

High performance load, query and inference simultaneously Limited to twoconcurrent queries

D D

Automatic failover, synchronisation and load balancing tomaximize cluster utilisation

5 5 D

Scale out concurrent query processing allowing query throughputto scale proportionally to the number of cluster nodes

5 5 D

Cluster elasticity remaining fully functional in the event of failingnodes

5 5 D

Community support D D D

Commercial SLA 5 D D

8.3 SPARQL Compliance

GraphDB supports the following SPARQL specifications:

8.3.1 SPARQL 1.1 Protocol for RDF

SPARQL 1.1 Protocol for RDF defines the means for transmitting SPARQL queries to a SPARQL query processingservice, and returning the query results to the entity that requested them.

8.2. GraphDB Feature Comparison 215



8.3.2 SPARQL 1.1 Query

SPARQL 1.1 Query provides more powerful query constructions compared to SPARQL 1.0. It adds:

• Aggregates;

• Subqueries;

• Negation;

• Expressions in the SELECT clause;

• Property Paths;

• Assignment;

• An expanded set of functions and operators.

8.3.3 SPARQL 1.1 Update

SPARQL 1.1 Update provides a means to change the state of the database using a query-like syntax. SPARQLUpdate has similarities to SQL INSERT INTO, UPDATE WHERE and DELETE FROM behaviour. For full details, seethe W3C SPARQL Update working group page.

8.3.3.1 Modification operations on the RDF triples:

• INSERT DATA {...} - inserts RDF statements;

• DELETE DATA {...} - removes RDF statements;

• DELETE {...} INSERT {...} WHERE {...} - for more complex modifications;

• LOAD (SILENT) from_iri - loads an RDF document identified by from_iri;

• LOAD (SILENT) from_iri INTO GRAPH to_iri - loads an RDF document into the local graph calledto_iri;

• CLEAR (SILENT) GRAPH iri - removes all triples from the graph identified by iri;

• CLEAR (SILENT) DEFAULT - removes all triples from the default graph;

• CLEAR (SILENT) NAMED - removes all triples from all named graphs;

• CLEAR (SILENT) ALL - removes all triples from all graphs.

8.3.3.2 Operations for managing graphs:

• CREATE - creates a new graph in stores that support empty graphs;

• DROP - removes a graph and all of its contents;

• COPY - modifies a graph to contain a copy of another;

• MOVE - moves all of the data from one graph into another;

• ADD - reproduces all data from one graph into another.

8.3.4 SPARQL 1.1 Federation

SPARQL 1.1 Federation provides extensions to the query syntax for executing distributed queries over any numberof SPARQL endpoints. This feature is very powerful, and allows integration of RDF data from different sourcesusing a single query.

For example, to discover DBpedia resources about people who have the same names as those stored in a localrepository, use the following query:





SELECT ?dbpedia_idWHERE {

?person a foaf:Person ;foaf:name ?name .

SERVICE <http://dbpedia.org/sparql> {?dbpedia_id a dbpedia-owl:Person ;

foaf:name ?name .}

}

It matches the first part against the local repository and for each person it finds, it checks the DBpedia SPARQLendpoint to see if a person with the same name exists and, if so, returns the ID.

Since Sesame repositories are also SPARQL endpoints, it is possible to use the federation mechanism to dodistributed querying over several repositories on a local server.

For example, imagine that you have two repositories - one called my_concepts with triples about concepts andanother called my_labels, containing all label information.

To retrieve the corresponding label for each concept, you can execute the following query on the my_conceptsrepository:

SELECT ?id ?labelWHERE {

?id a ex:Concept .SERVICE <http://localhost:8080/openrdf-sesame/repositories/my_labels> {

?id rdfs:label ?label.}

}

Note: Federation must be used with caution. First of all, to avoid doing excessive querying of remote (public)SPARQL endpoints, but also because it can lead to inefficient query patterns.

The following example finds resources in the second SPARQL endpoint, which have a similar rdfs:label to therdfs:label of <http://dbpedia.org/resource/Vaccination> in the first SPARQL endpoint:

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>

SELECT ?endpoint2_id {SERVICE <http://faraway_endpoint.org/sparql>{

?endpoint1_id rdfs:label ?l1 .FILTER( lang(?l1) = "en" )

}SERVICE <http://remote_endpoint.com/sparql>{

?endpoint2_id rdfs:label ?l2 .FILTER( str(?l2) = str(?l1) )

}}BINDINGS ?endpoint1_id{ ( <http://dbpedia.org/resource/Vaccination> ) }

However, such a query is very inefficient, because no intermediate bindings are passed between endpoints. Instead,both subqueries execute independently, requiring the second subquery to return all X rdfs:label Y statementsthat it stores. These are then joined locally to the (likely much smaller) results of the first subquery.

8.3.5 SPARQL 1.1 Graph Store HTTP Protocol

SPARQL 1.1 Graph Store HTTP Protocol provides a means for updating and fetching RDF graph content from a

8.3. SPARQL Compliance 217

Graph Store over HTTP in the REST style.

8.3.5.1 URL patterns for this new functionality are provided at:

• <SESAME_URL>/repositories/<repo_id>/rdf-graphs/service> (for indirectly referenced namedgraphs);

• <SESAME_URL>/repositories/<repo_id>/rdf-graphs/<NAME> (for directly referenced named graphs).

8.3.5.2 Methods supported by these resources and their effects:

• GET - fetches statements in the named graph from the repository in the requested format.

• PUT - updates data in the named graph in the repository, replacing any existing data in the named graph withthe supplied data. The data supplied with this request is expected to contain an RDF document in one of thesupported RDF formats.

• DELETE - deletes all data in the specified named graph in the repository.

• POST - updates data in the named graph in the repository by adding the supplied data to any existing data inthe named graph. The data supplied with this request is expected to contain an RDF document in one of thesupported RDF formats.

8.3.5.3 Request headers:

• Accept: Relevant values for GET requests are the MIME types of supported RDF formats.

• Content-Type: Must specify the encoding of any request data sent to a server. Relevant values are the MIMEtypes of supported RDF formats.

8.3.5.4 Supported parameters for requests on indirectly referenced named graphs:

• graph (optional): specifies the URI of the named graph to be accessed.

• default (optional): specifies that the default graph to be accessed. This parameter is expected to be presentbut to have no value.

Note: Each request on an indirectly referenced graph needs to specify precisely one of the above parameters.

8.4 OWL Compliance

GraphDB supports several OWL like dialects: OWL Horst (owl-horst), OWL Max (owl-max), which coversmost of OWL Lite and RDFS, OWL2 QL (owl2-ql) and OWL2 RL (owl2-rl).

With the owl-max rule-set, GraphDB supports the following semantics:

• full RDFS semantics without constraints or limitations, apart from the entailment related to typed literals(known as D-entailment). For instance, meta-classes (and any arbitrary mixture of class, property, andindividual) can be combined with the supported OWL semantics;

• most of OWL Lite;

• all of OWL DLP.

The differences between OWL Horst and the OWL dialects supported by GraphDB (owl-horst and owl-max)can be summarised as follows:


• GraphDB does not provide the extended support for typed literals, introduced with the D-entailment exten-sion of the RDFS semantics. Although such support is conceptually clear and easy to implement, it is ourunderstanding that the performance penalty is too high for most applications. You can easily implement therules defined for this purpose by ter Horst and add them to a custom rule-set;

• There are no inconsistency rules by default;

• A few more OWL primitives are supported by GraphDB (rule-set owl-max);

• There is extended support for schema-level (T-Box) reasoning in GraphDB.

Even though the concrete rules pre-defined in GraphDB differ from those defined in OWL Horst, the complexityand decidability results reported for R-entailment are relevant for TRREE and GraphDB. To be more precise, therules in the owl-horst rule-set do not introduce new B-Nodes, which means that R-entailment with respect tothem takes polynomial time. In KR terms, this means that the owl-horst inference within GraphDB is tractable.

Inference using owl-horst is of a lesser complexity compared to other formalisms that combine DL formalismswith rules. In addition, it puts no constraints with respect to meta-modelling.

The correctness of the support for OWL semantics (for these primitives that are supported) is checked against thenormative Positive- and Negative-entailment OWL test cases.

8.5 Glossary

Datalog A query and rule language for deductive databases that syntactically is a subset of Prolog.

D-entailment A vocabulary entailment of an RDF graph that respects the ‘meaning’ of data types.

Description Logic A family of formal knowledge representation languages that are subsets of first order logic,but have more efficient decision problems.

Horn Logic Broadly means a system of logic whose semantics can be captured by Horn clauses. A Horn clausehas at most one positive literal and allows for an IF...THEN interpretation, hence the common term ‘HornRule’.

Knowledge Base (In the Semantic Web sense) is a database of both assertions (ground statements) and an infer-ence system for deducing further knowledge based on the structure of the data and a formal vocabulary.

Knowledge Representation An area in artificial intelligence that is concerned with representing knowledge in aformal way such that it permits automated processing (reasoning).

Load Average The load average represents the average system load over a period of time.

Materialisation The process of inferring and storing (for later retrieval or use in query answering) every piece ofinformation that can be deduced from a knowledge base’s asserted facts and vocabulary.

Named Graph A group of statements identified by a URI. It allows a subset of statements in a repository to bemanipulated or processed separately.

Ontology A shared conceptualisation of a domain, described using a formal (knowledge) representation language.

OWL A family of W3C knowledge representation languages that can be used to create ontologies. See WebOntology Language.

OWL Horst An entailment system built upon RDF Schema, see R-entailment.

Predicate Logic Generic term for symbolic formal systems like first-order logic, second-order logic, etc. Itsformulas may contain variables which can be quantified.

RDF Graph Model The interpretation of a collection of RDF triples as a graph, where resources are nodes in thegraph and predicates form the arcs between nodes. Therefore one statement leads to one arc between twonodes (subject and object).

RDF Schema A vocabulary description language for RDF with formal semantics.

Resource An element of the RDF model, which represents a thing that can be described, i.e., a unique name toidentify an object or a concept.

8.5. Glossary 219

http://www.w3.org/TR/owl-features/

http://www.w3.org/TR/owl-features/

R-entailment A more general semantics layered on RDFS, where any set of rules (i.e., rules that extend or evenmodify RDFS) are permitted. Rules are of the form IF...THEN... and use RDF statement patterns in theirpremises and consequences, with variables allowed in any position.

Resource Description Framework (RDF) A family of W3C specifications for modelling knowledge with a va-riety of syntaxes.

Semantic Repository A semantic repository is a software component for storing and manipulating RDF data. Itis made up of three distinct components:

• An RDF database for storing, retrieving, updating and deleting RDF statements (triples);

• An inference engine that uses rules to infer ‘new’ knowledge from explicit statements;

• A powerful query engine for accessing the explicit and implicit knowledge.

Semantic Web The concept of attaching machine understandable metadata to all information published on theinternet, so that intelligent agents can consume, combine and process information in an automated fashion.

SPARQL The most popular RDF query language.

Statement or Triple A basic unit of information expression in RDF. A triple consists of subject-predicate-object.

Universal Resource Identifier (URI) A string of characters used to (uniquely) identify a resource.

8.6 Install Tomcat

8.6.1 Requirements

• Tomcat 8.x requires Java 7 or later.

Tip: To check your current Java version, open a Terminal and run java -version.

8.6.2 On Mac OS

8.6.2.1 Steps

1. Download a binary distribution of the core module: apache-tomcat-8.0.26.tar.gz from here.

2. Move the unarchived distribution to /usr/local.

sudo mkdir -p /usr/localsudo mv ~/Downloads/apache-tomcat-8.0.26 /usr/local

Tip: To make it easy to replace this release with future releases, create a symbolic link that you are goingto use when referring to Tomcat (after removing the old link you might have from installing a previousversion):

sudo rm -f /Library/Tomcatsudo ln -s /usr/local/apache-tomcat-8.0.26 /Library/Tomcat

3. Change ownership of the /Library/Tomcat folder hierarchy:

sudo chown -R <your_username> /Library/Tomcat

4. Make all scripts executable:


http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

http://tomcat.apache.org/download-80.cgi


sudo chmod +x /Library/Tomcat/bin/*.sh

5. Start the Tomcat server.

./startup.sh

The Tomcat default page appears at http://localhost:8080.

6. Stop the Tomcat server:

./shutdown.sh

8.6.3 On Windows

Check the Tomcat Documentation.

8.7 Previous Versions

This is the documentation for GraphDB Free 6.6. To read the documentation for a previous version, please clickthe corresponding link below.

8.7.1 Documentation for version 6.0 to 6.5

The documentation for versions of GraphDB from 6.0 to 6.5 is in a unified format that applies to all GraphDBeditions. It is available on Ontotext’s Confluence:

• GraphDB 6.5

• GraphDB 6.4

• GraphDB 6.3

• GraphDB 6.2

• GraphDB 6.0 and 6.1

8.7.2 Documentation for versions 4.x and 5.x

Prior to version 6.0, GraphDB was called OWLIM, whose documentation is in the same unified format on Onto-text’s Confluence:

• OWLIM 5.6

• OWLIM 5.5

• OWLIM 5.4

• OWLIM 5.3

• OWLIM 5.2

• OWLIM 5.1

• OWLIM 5.0

• OWLIM 4.4

• OWLIM 4.3

• OWLIM 4.2

• OWLIM 4.1

8.7. Previous Versions 221

http://localhost:8080

https://tomcat.apache.org/tomcat-8.0-doc/windows-service-howto.html

https://confluence.ontotext.com/display/GraphDB65/Home





https://confluence.ontotext.com/display/OWLIM56/Home

https://confluence.ontotext.com/display/OWLIM55/Home

https://confluence.ontotext.com/display/OWLIMv54/Home










• OWLIM 4.0



CHAPTER

NINE

RELEASE NOTES

Hint: The release notes for this version of GraphDB are available under GraphDB 6.6.x Release Notes.

GraphDB release notes provide information about the features and improvements in each release as well as variousbugfixes. Starting from version 6.2, GraphDB’s versioning scheme is based on semantic versioning. The fullversion is composed of three components:

major.minor.patch

e.g., 6.2.1 where the major version is 6, the minor version is 2 and the patch version is 1.

Note: Releases with the same major and minor versions do not contain any new features. Releases with differentpatch versions contain fixes for bugs discovered since the previous minor. New or significantly changed featuresare released with a higher major or minor version.

GraphDB includes three separate components with their own version numbers:

• GraphDB Engine

• GraphDB Workbench

• GraphDB Connectors.

Their individual versions use the same semantic versioning scheme as the whole product and their values areprovided only as a reference.

9.1 GraphDB 6.6.x Release Notes

9.1.1 GraphDB 6.6.9

Released: 6 February, 2017

9.1.1.1 Component versions

Sesame Engine Workbench Connectors2.7.8 6.2.13 6.8.5 4.2.7

9.1.1.2 GraphDB Engine

• OWLIM-2622 Queries that generate a lot of new entities at query time result in an exception “The requestentity pool has been exhausted”.

223

http://semver.org


9.1.1.3 GraphDB Workbench

No changes.

9.1.1.4 GraphDB Connectors

No changes.

9.1.2 GraphDB 6.6.8

Released: 28 October, 2016



Important:

• Fixed issue with the connectors, which can potentially return a different finger print of the entity pool underheavy load.


No changes.


No changes.


• GDB-567 Connector fingerprint may go out of sync during intense workload on a cluster setup

9.1.3 GraphDB 6.6.7

Released: 1 August, 2016



Important:

• Fixed issues with the connectors causing out of sync errors when used in a cluster.


No changes.

224 Chapter 9. Release Notes



No changes.


• STORE-244 Connector fingerprint not reset on transaction rollback caused by exception

• STORE-245 Connector fingerprint may go out of sync in cluster after a worker restart

9.1.4 GraphDB 6.6.6

Released: 21 July, 2016



Important:

• Fixed memory leak when long running queries with GROUP BU or subquery timeout.


Bug fixes

• OWLIM-3060 Memory leak when using a combination of INSERT and SELECT queries with GROUP BYand ORDER BY clauses


No changes.


• STORE-232 Race condition with concurrent queries/updates in Lucene

9.1.5 GraphDB 6.6.5

Released: 25 April, 2016



Important:

• Bug fixes and some improvements in the Engine;

• Bug fixes in the Workbench.

9.1. GraphDB 6.6.x Release Notes 225



Improvements

• OWLIM-2965 Improve the logging in the cluster code for better errors diagnostic;

• OWLIM-2968 The SailIterationMonitor Mbean in JMX should use the repository id instead of the pathto the repository storage. Backslashes in the Windows paths can lead to many problems.

Bug fixes

• OWLIM-2980 INSERT followed by DELETE ... where fails in the transactional Entity Pool.

The error message in the log will be:

org.openrdf.repository.RepositoryException: org.openrdf.sail.SailException: com.ontotext.trree.→˓entitypool2.EntityPoolConnectionException: Connection in precommit state may not create IDs

• OWLIM-2986 BIND within OPTIONAL returns an incorrectly set previous value. BIND within OPTIONAL failsto update its value and returns incorrect results.

• OWLIM-2962 StorageTool after rebuilding the POS index increases its size instead of decreasing (compact-ing) it

• OWLIM-2975 Storage tool rebuilds PCSO index in PCOS order.


• WB-958 Updated the link to the documentation to open the edition-specific documentation for 6.6;

• WB-953 Use daemon threads for import (resolves hang on shutdown in Tomcat);

• WB-934 URIs that contain single quote generate broken links.


No changes.

9.1.6 GraphDB 6.6.4

Released: 1 April, 2016



Important:

• The engine can auto DISTINCT CONSTRUCT queries (configurable)

• Fix for GraphDB free hanging on some DESCRIBE queries

• Fix for issue when big transactions fails the database could freeze on rollback with ArrayIndexOutOfBound-sException

• Fix for LoadRDF can not create repository of type graphdb:FreeSailRepository

• Fix for wrong results when using one or more and sameAs together


• Fix for repository create/edit form in the Workbench


Features and improvements

• OWLIM-2870 Add DISTINCT to CONSTRUCT queries

Some CONSTRUCT queries can return different number of results with different GraphDB versions. Thisis because they can return duplicates which can be different for each GraphDB version depending on theimplementation of the engine. It is now possible to add internally DISTINCT to all CONSTRUCT queries.The SPARQL 1.1 specification does not assume DISTINCT of the results of DESCRIBE. The rationale isto allow for efficient generation of large RDF graphs with DESCRIBE queries. This option is configurable.

• OWLIM-2871 Allow literals containing spaces in a PIE file

Bugfixes

• OWLIM-2899 Failure to convert to our query model could result in GaphDB free hanging

GraphDB free hangs on some DESCRIBE queries

• OWLIM-2867 Big transactions, increasing the total number of pages in the index file, causes database tofreeze on rollback with java.lang.ArrayIndexOutOfBoundsException

If a big transactions, increasing the total number of pages in the index file, fails the database will freeze onrollback with java.lang.ArrayIndexOutOfBoundsException: number. The possible ways to encounter thisproblem are (1) RDF file with invalid syntax or (2) consistency checking rule on the initial or a very bigimport which doubles the size of the repository.

• OWLIM-2868 Wrong results using one or more and sameAs together

The engine fails to optimize queries containing arbitrary property path lengths. with a runtime exception.As result queries that uses special graphs or predicates fails to return valid result since they are evaluated bythe Sesame implementation which ignore them.

• OWLIM-2866 LoadRDF does not support creation of repository type: graphdb:FreeSailRepository

LoadRDF tool cannot be used with GraphDB Free version

• OWLIM-2935 StorageTool can not rebuild images using 40bit entities

• OWLIM-2924 Import of big files fails when Connectors are enabled with “The system entity pool has beenexhausted”

• OWLIM-2905 The plugin manager code is creating system entities when in fact they should be in the defaultscope

Observed as java.lang.RuntimeException: System entities may not be used as regular data. The problem isthat the plugin manager is putting the subject and the object in the system scope and those cannot be usedas regular data after that.

• OWLIM-2748 Inconsistent removal of owl:sameAs statements

Operation that removes statements with owl:sameAs predicate fails to delete all triples from the repository.

• OWLIM-2925 Failing on initialize might lead to failed to lock repository on next initialization

A sail exception with the following message: “Repository under <path> is currently in use (we failed tolock its lockfile: <lock-file-path>)” in the logs and failure to initialize a repository


9.1.6.3 GraphDB Cluster

No changes.


Bugfixes

• WB-885 Enabling predicate indices when creating or editing a repository is not saved.


No changes.

9.1.7 GraphDB 6.6.3

Released: 24 February, 2016



Important:

• Fixes and improvements in the query optimisation;

• Improvements in logging;

• Improvement in the Storage tool - now it finds inconsistencies in the index page IDs;

• Fix for the explain plan - not showing the use of literal index;

• Fix for queries with BINDINGS and GROUP BY queries - correctness of results and speed improvement;

• Fix in the cluster for a worker with critically low disc space - when starting, it hangs forever on initialisation;

• Fix for LIMIT breaking SPARQL queries in the Workbench when combined with BINDINGS;

• Fix for the free access and permission check in the Workbench for repositories with “_” in the name.



• OWLIM-2595 Query Optimization Improvements and fixes.

• OWLIM-2737 Storage tool finds different factor id in page and page index for the same page.

• OWLIM-2745 Log a warning when a query couldn’t be executed on particular worker.

• OWLIM-2751 Improve log message in GraphDB logs when incompatible version of owlim.properties hasbeen used.

• OWLIM-2706 Improve log messages related to a transaction being rolled back due to (JMX) shutdown of arepository.



Bugfixes

• OWLIM-2727 The explain plan doesn’t expose when the literals-index is being used.

• OWLIM-2733 Queries with the BINDINGS clause in combination with GROUP BY return incorrect results.

• OWLIM-2789 Sub-SELECT with GROUP BY within OPTIONAL does not provide results for subsequentreevaluations.

• OWLIM-2796 Queries with GROUP BY can not be interrupted.

• OWLIM-2768 Updates with WHERE clause and timeout throws NullPointerException, when the timeout isreached Heavy loaded cluster might become massively unstable since many transactions fail with unknownexceptions and cause finger print mismatch.

• OWLIM-2760 Properly handle exceptions when the browser cannot be fired by the embedded tomcat WhenGraphDb is started from startup.sh (bat) the engine get’s killed if there is no default browser set.

9.1.7.3 GraphDB Cluster

Bugfixes

• OWLIM-2795 On unknown exception, retry the transaction at most 3 times on a particular worker. Writesare blocked and there are infinite amount of exceptions in the master’s log that a transaction will be retriedbecause there is an unknown failure: “Unknown failure (re-sending”

• OWLIM-2773 When starting a worker with critically low disc space it hangs forever on initialization.


Bugfixes

• WB-799 LIMIT in queries not working when combined with BINDINGS.

• WB-826 Free access is broken when there is “_” in the name of the repository.


No changes.

9.1.8 GraphDB 6.6.2

Released: 5 January, 2016



Important: GraphDB Workbench

Fix for a critical issue that could allow an attacker to disable security. If you use the GraphDB Workbench withsecurity ON please upgrade to this version. If you do not use security there is no need to upgrade.




No changes.


Bugfixes

• WB-801 Security can be disabled without authorization


No changes.

9.1.9 GraphDB 6.6.1

Released: 30 December, 2015



Important: GraphDB Engine

• Fix for memory leak when using direct plugin with UNION;

• The Storage tool rebuilds collections faster and detects duplicated IDs in the entity pool during ‘scan’ phase;

• Experimental query plan is set as default now and the literal index usage exposed;

• The Literal index structure is now transactional;

• The Entity Pool detects invalid files and prevents ID duplication;

• Consistency checking rules are also fired on DELETE queries.

GraphDB Cluster

• OutOfMemoryException, properly handled, prevents the workers to become in an inconsistent state;

• Improvements in handling some scenarios and bug fixes (see detailed issue list).

GraphDB Workbench

• Improved usability in the SPARQL editor;

• Support for Internet Explorer 11 and Microsoft Edge;

• Bug fixes.

Other

• Maven installer for the GraphDB runtime artifact in the developer examples.





• OWLIM-2629 Storage Tool: implement check for consistency of EntityPool and/or HashEntityPool index

• OWLIM-2404 Storage Tool: rebuild command speed improvement

• OWLIM-2559 Make the new explain plan default

Bugfixes

• OWLIM-2731 Memory leak when using direct plugin with UNION

• OWLIM-2722 ASK query status not reported correctly in SailIterationWrapper

• OWLIM-2720 Iterator leak on rebuild predicate statistics when asserts are enabled

• OWLIM-2710 Duplicate entity IDs were identified when overwriting a pre-created repo image over anempty one

• OWLIM-2705 Handle worker initialization across hot restarts

• OWLIM-2676 Change the hash equivalence collection to be transactional

• OWLIM-2673 Inconsistent repo state after rollback caused by plugin exception

• OWLIM-2669 Killer transaction logic might give the update on the same worker more than once

• OWLIM-2645 Invalid literal index causes the workers to go out of sync

• OWLIM-2502 Consistency checking rules fail to prevent deleting statement leading to inconsistent state

• OWLIM-2422 .exportStatements() with includeInferred and OWL2-RL ruleset throws NullPointerExcep-tion

• OWLIM-2391 Shutting down the repository might result in failed queries on clients

• OWLIM-2360 Literal index is not transactional which might lead to the page size exception

• OWLIM-2658 Explain plan doesn’t expose if we are hitting the literals-index at the moment

• OWLIM-2723 Query specific timeout (setMaxQueryTime) is not propagated to the worker nodes

• OWLIM-2724 Out of memory errors are not re-thrown on beginTransaction, which can lead to cyclic repli-cations in cluster setup



• WB-792 Improved usability and user experience in the SPARQL editor

• WB-791 Improved support for IE 11 and Edge

Bugfixes

• WB-790 Namespaces are not shown properly in some cases

• WB-789 Abort query not working when multiple repositories with a common prefix in the name exist

• WB-786 Query is not executed and the loader stays on in the SPARQL editor

• WB-782 Size in the navbar not reported correctly for remote repositories

• WB-679 Repositories from a different GraphDB product are not shown in the location view


No changes.

9.1.10 GraphDB 6.6.0

Released: 3 December, 2015



Important: GraphDB Free

There is a new edition of GraphDB called GraphDB Free. It does not require a licence and supports the fullfunctionality of the Standard Edition. The only limitation is that no more than 2 queries can run simultaneously.

GraphDB Engine

• Introduced a read-only mode, allowing you to do a replication and backup without shutting down the nodeused as a source for replication/backup;

• Fixed memory leaks in certain scenarios (OWLIM-2660, OWLIM-2563);

• Fixed degradation of performance for queries that make extensive use of the literal index (datetimes, num-bers);

• Fix for incorrect predicate statistics calculation when doing heavy delete operations;

• Fix for slow repository shutdown due to internal entity pool connections leaks;

• Notification and diagnostic improvements in the JMX bean and log files;

• Storage tool’s “scan” command now checks the integrity of the Entity Pool as well;

GraphDB Cluster

• Improved cluster stability in Multi DC setup:

– Masters prevent transactions that consistently ‘break’ workers and bring the whole cluster down;

– Fix for backup and restore failures in some scenarios (OWLIM-2599, OWLIM-2649, OWLIM-2474,OWLIM-2526);

– Fix for error during clearing worker’s transaction log;

– Replication and restore timeout take the repository size into account;

– The clearTransactionLog method was removed from the JMX bean. It caused more problems thanit actually solved. To clear the transaction log, stop the master and clean its transaction log directory.

• Compression is used by default for backup, restore and replications, which improves the speed significantly;

• The logs now indicate why there was a fingerprint mismatch if the master expected one fingerprint but gotanother from the worker. Examples:

Out of sync (417): Main fingerprint parts are not the same;<tx2> @103 [direct=0][expose-entity=0][geospatial=2][literals-index=3][lucene=0]

[notifications=0][plugincontrol=0][rdfpriming=0][rdfrank=0][script=0] 1 32expected: <tx1> @102 [direct=0][expose-entity=0][geospatial=2][literals-index=3][lucene=0]

[notifications=0][plugincontrol=0][rdfpriming=0][rdfrank=0][script=0] 1 32

We properly detect:

– different main parts of the fingerprint (the first component);


– different plugins enabled;

– different plugin fingerprints;

– different entity id sizes;

• GraphDB Enterprise has been enabled to work with repositories created on GraphDB Standard Edition;

• A new configuration option has been added: “protect-single-worker”, which prevents the cluster from repli-cation if only a single worker is available (default: true).

GraphDB Workbench

Minor improvements and many bug fixes.

GraphDB Connectors

Minor improvements and bug fixes.



• F-728 Track the provenance of implicit statements (experimental provenance plugin).

• OWLIM-2434 Remove stack traces from JMX node status returned by master.

• OWLIM-2522 Human readable revision in System status query.

• OWLIM-2486 MUTE mode for master nodes.

• OWLIM-2598 Throw an exception for read requests if there is a restore running at the moment.

• OWLIM-2412 Read-only mode for the worker/SE and online replications allowed.

• OWLIM-2404 Storage tool improvements.

• OWLIM-2529 Retry transactions after unknown failures until ConnectException or worker restarts.

• OWLIM-2472 Remove clearTransactionLog() method from cluster info bean.

• OWLIM-2640 New reporting script - generates archive with all logs of masters and workers and status info.

• OWLIM-2616 Manual master failover support.

• OWLIM-2478 Clear indication in the logs and JMX notification about the reason for the replication.

• OWLIM-2480 Clearly state in the logs if a full replication is started by the smart replication.

• OWLIM-2487 GraphDB Enterprise should be able to open SE repositories.

• OWLIM-2609 Create config parameter allowing replication if there is only one active worker.

• OWLIM-2643 Remove the reverse literal collection.

• OWLIM-2650 Timeouts for replication/restore should take the repository size into account.

Bugfixes

• OWLIM-2652 IndexOutOfBoundsException thrown on clearing worker log

• OWLIM-2396 Rollback connections on close if their state transition is unexpected to prevent potentialdeadlocks.

• OWLIM-2565 Masters are unable to send statuses to each other at the start of the test.

• OWLIM-2644 Master fails to recognize foreign transactions with a timestamp before the last TX logcleanup.



• OWLIM-2642 Ignore unknown transactions that fall before the start of the local log.

• OWLIM-2597 Page n has size of s1 != s2 which is written in the index.

• OWLIM-2524 Treat unknown errors on testing workers as transaction failures.

• OWLIM-2442 Cluster with empty transaction log and different nodes show wrong state in JMX Mbean.

• OWLIM-2590 ClientHTTPRepository.createAndInitialize throws exception “No masters available to pro-cess exportStatements() request!”.

• OWLIM-2540 owl:SameAs doesn’t work on a repository with an empty rule-set and a custom (.pie) rule-setset as default.

• OWLIM-2617 AbstractHashEntityMap3And4.getItemHash throws an assertion failure sometimes.

• OWLIM-2599 Write request to a worker that has initiated a backup results in retrying the same write untilthe backup is complete.

• OWLIM-2589 Freezing storage in read-only mode causes all queries that use the Literal index to fail.

• OWLIM-2660 Leaking internal entity pool connections leads to a memory leak and slow shutdown.

• OWLIM-2542 Predicate statistics not adjusted correctly on delete.

• OWLIM-2656 Rule-set compilation error.

• OWLIM-2381 Query to count graph returns more results number than the actual returned results.

• OWLIM-2563 Memory leak after many small transactions and subsequent queries.

• OWLIM-2649 Workers are marked as ON, but the master is rejecting reads during restore.

• OWLIM-2626 Unknown exceptions cause retry of the transaction on the same worker, which might lead todead end.

• OWLIM-2356 The performance of the cluster degrades under SPB stress after 22 hours.

• OWLIM-2614 Backup + restore with failover master throws an exception.

• OWLIM-2474 Deadlock on restore in a multi DC topology.

• OWLIM-2526 Restore fails if there are workers that are OFF at the moment.

• OWLIM-2610 Returns BAD_GATEWAY instead of SERVICE_UNAVAILABLE when the query cannot beexecuted on the master.

• OWLIM-2625 Properly throw an exception in the client failover API when a repository exception is thrownfrom the master.



• WB-665 Introduced import data status call.

• WB-742 Support for GraphDB Free in the Workbench.

• WB-769 More user-friendly home page in the Workbench.

Bugfixes

• WB-149 Adding a location such as ftp://ftp.example.xx results in creating a non-absolute file location, whileit should produce an error.

• WB-633 A nasty DESCRIBE query makes the browser to fail.

• WB-668 Issue with handling special chars in URIs.


ftp://ftp.example.xx


• WB-727 Wrong LIMIT when “LIMIT” is also in a comment.

• WB-737 A query with LIMIT in the subquery gets broken.

• WB-748 Workbench does not report the error when adding an invalid location.

• WB-749 Some file imports fail with an error.

• WB-754 Security not checked properly for queries sent over GET.

• WB-755 One line query with limit does not work in the Workbench.

• WB-761 SPARQLs with OFFSET and LIMIT 0 are not supported by the Workbench.

• WB-768 Statements count in the navigation bar is zero when the location is remote.



• STORE-224 Remove guava dependency (make the connectors take less disk space).

Bugfixes

• STORE-225 Upgrade Elasticsearch to 1.7.3 and fix bug in the build process that caused 1.6.0. to be used(even though we claimed it was 1.7.2).

• STORE-223 Creating a Lucene Connector index fails with: java.lang.IllegalArgumentException: empty ornull components not allowed.


9.2.1 GraphDB 6.5.0

Released: 19 October, 2015

9.2.1.1 Component versions:


Important: GraphDB Cluster:

• Improved cluster stability in single DC setup:

– The master will not start a replication if there is a single worker in ON state in the cluster, thus allowingthe cluster to continue to serve requests;

– The master will choose an up-to-date worker from the transaction log when reporting self status;

– Workers in the cluster will reject updates that are not sent trough the master;

– Workers that are in an UNINITIALIZED state will not be available for serving requests.

• Various bugfixes:

– A lower memory footprint of workers due to a decreased query execution history kept in memory;

– Corrected state of a worker in cases when low disk space causes a shutdown.

GraphDB Engine:



• BIND operator, which was not fully covered by GraphDB in previous version and relied on Sesame’s im-plementation;

• Query optimiser when dealing with multiple OPTIONALs;

• Special characters are handled properly by RDF parser/serialiser, e.g., white spaces in URL;

• During the consistency checking phase, deleted statements in the current transaction will be ignored;

• Enforced consistency of predicate statistics - these will be rebuilt if found inconsistent.

GraphDB Workbench:

• Many new features and bugfixes.

GraphDB Connectors:

• New features and bugfixes.


Features and improvements:

• OWLIM-1535 GraphDB EE users should be able to create/load SE repositories.

• OWLIM-2492 When starting a new worker with empty repo other workers shouldn’t replicate from it.

• OWLIM-2227 Verify the predicate statistics on start and rebuild them if needed.

• OWLIM-2415 Increase replication speed in cluster using compression.

• OWLIM-2461 Initial worker state checking upon initialisation on empty TX log.

Bugfixes:

• OWLIM-2429 BIND operator not fully covered by GraphDB’s query optimiser.

• OWLIM-2425 RDF parser and serializer doesn’t handle properly special characters.

• OWLIM-2421 Consistency checking rules fail to ignore deleted statements in the current transactions.

• OWLIM-2418 SPARQL query with explicitly specified graph returns more results than the default graph.

• OWLIM-2417 Sometimes the cluster doesn’t detect that a worker is out of sync but instead marks the othersto out of sync.

• OWLIM-2427 Cluster workers should reject side updates.

• OWLIM-2423 The entity pool(s) don’t behave properly with request-scoped entities.

• OWLIM-2271 Fix replication client timeout.

• OWLIM-2315 Master should choose an up to date worker (in transaction log) to report self status.

• OWLIM-2538 Predicate statistics rebuilds after cluster restart on LDBC tests.

• OWLIM-2548 Master started with an empty transaction log does not set the worker’s state before the firstwrite request.

• OWLIM-2537 Sometimes the timeout for the replication socket on the master is not enough.

• OWLIM-2086 Wrong results from DirectPlugin when query expands a filter to union.

• OWLIM-1218 Incorrect evaluation for queries with multiple OPTONAL expressions; can lead to miss ofrelevant results.

• OWLIM-1773 Worker left in a bad state after shutdown on “low disk space”.

• OWLIM-2437 Joining two subqueries that share a variable and it is in an OPTIONAL is not done properly.



• OWLIM-2403 Master doesn’t replicate a single worker but marks the others as OUT_OF_SYNC.

• OWLIM-2408 Replication in the cluster shouldn’t be initiated if there is only one active worker.

• OWLIM-2157 Backup/restore in cluster may fail if workers are created with a storage directory that isn’tcalled “storage”.

• OWLIM-2469 UNINITIALIZED workers were marked as readable, which can cause queries to be executedon them.

• OWLIM-2468 Wrong max element of a page set in the index.

• OWLIM-2511 BINDTriplePattern’s logic for restrictVars() makes BIND in UNION always to be placed inthe beginning.

• OWLIM-2339 Report back to an unknown peer master which sent sync data that it will be ignored.

• OWLIM-2510 Increasing worker’s memory usage due growing execution history.

Known issues:

• OWLIM-2381 Count graph query returns bigger number than the actual statements in the graph.

• OWLIM-2407 Backup and restore in the cluster shouldn’t timeout the HTTP request blindly.



• WB-252 Filtering for the server files list.

• WB-273 New security mode “free access” that allows access to some functionality for non-authenticatedusers.

• WB-319 Graph filtering in the export view based on expressions.

• WB-395 Add pagination to SPARQL results.

• WB-455 Improved SPARQL view: horizontal and vertical view, editor only, results only.

• WB-612 New context view under Data/Context.

• WB-613 Add load timing information for local imports.

• WB-640 Show the location in the navigation bar.

• WB-681 Better error handling in the SPARQL editor when the repository cannot be initialised.

• WB-699 Restored copy connector button (same as WB 6.5.x).

Bugfixes:

• WB-575 Unable to cancel remote content import.

• WB-642 Limit CONSTRUCT results internally rather than by SPARQL LIMIT.

• WB-646 Queries with many comments do not execute correctly on a remote location.

• WB-667 Workbench not loading properly on slower networks.

• WB-678 Misdetecting a GraphDB instance in the same tomcat when Workbench is deployed in the rootcontext.

• WB-680 Workbench not using the optimised Ontotext N-Triples and N-Quads parsers.

• WB-698 Clicking outside the Connector create form is interpreted as cancel.



• WB-700 Adding fields in the Connector create form copies values from previous field.

• WB-706 Raw HTML shown in some location connection errors.

• WB-710 Incorrect JSON generated for some checkboxes in the Connector create form.



• STORE-198 Allow multiple property chains per field.

• STORE-218 Elasticsearch upgraded to 1.7.2.

Bugfixes:

• STORE-215 Wrong evaluation of next hop IN() in filter when next hop does not exist.

• STORE-216 Wrong evaluation of negated BOUND() in filter.

• STORE-217 Not possible to use BOUND() with next hop or PARENT() in filter.


9.3.1 GraphDB 6.4.4

Released: 15 September, 2015



Important:

• bugfixes in the Workbench.

GraphDB Engine

No changes.

GraphDB Workbench

Bug fixes:

• WB-682 Updates execute but report fake generic error.

• WB-683 Spinner on update doesn’t go away in Safari.

GraphDB Connectors

No changes.



9.3.2 GraphDB 6.4.3

Released: 12 September, 2015



Important:

• bugfixes in the engine;

• explain plan improvement (experimental);

• improved diagnostics and notifications in the cluster;

• bug fixes in the Workbench.

GraphDB Engine


• OWLIM-2344 Additional JMX notifications in the master.

• OWLIM-2398 Ignore n number of faults in the worker to make the status reporting stable.

• OWLIM-2318 Show cluster status in the MBean.

• OWLIM-2387 New worker statuses in master - REPLICATION_SERVER, REPLICATION_CLIENT.

• OWLIM-2172 Alternative explain plan accessible through onto:experimental-explain that produces muchbetter results than the old explain plan.

• OWLIM-2380 Explain plan support for CONSTRUCT, DESCRIBE and ASK queries.

Bugfixes:

• OWLIM-2389 Master not reporting failed transaction to a client.

• OWLIM-2383 Reordering PluginTriplePatterns introduces inconsistent query plans with duplicates.

• OWLIM-2385 Failing the compile PIE file with error incompatible types required: java.lang.Stringfound: boolean.

• OWLIM-2395 Long FILTERs with more than 16 OR clauses are incorrectly optimised.

• OWLIM-2346 Backup operation makes the cluster unresponsive.

• OWLIM-2394 VALUES with single BindingSet before BIND is converted to Filter (as a result of optimisa-tion) but then disappears from the query after SubQuery.optimize().

• OWLIM-2378 rdfs:Resource resource appears from nowhere after a rule-set change.

• OWLIM-2374 JVM garbage collection shouldn’t be initiated when cleaning old connections.

• OWLIM-1532 Worker hasn’t recovered after clearing.

• OWLIM-1576 Build the name of the ClusterInfo bean from the master repository name and not the currentThread name.

• OWLIM-2370 Unable to delete statements from GraphDB.

• OWLIM-2386 Exceptions on begin transaction are not handled properly.



• OWLIM-2284 Exception when importing large files.

• OWLIM-2367 The database engine shows incorrect repository size.

• OWLIM-2376 Failing the compile PIE file with error incompatible types required: java.lang.Stringfound: boolean.

• OWLIM-2307 Negative number of Pending Writes showing in JMX properties of the master.

• OWLIM-2238 NQuadsSimpleParser does not handle baseUri properly.

• OWLIM-2399 Optimisation for IN -> VALUES fails to recognize NOT IN operator.

GraphDB Workbench

Bugfixes:

• WB-105 Some errors on query execution not reported to user.

• WB-635 Error reporting for unreachable locations.

• WB-636 Access denied when trying to execute query.

• WB-637 Unable to download results from System repository.

• WB-639 Workbench cannot load settings from previous releases.

• WB-671 No permission to execute SPARQL queries over SSL if security is enabled.

GraphDB Connectors

No changes.

9.3.3 GraphDB 6.4.2




Important:

• bugfixes in the engine;

• explain plan improvement (experimental);

• storage tool improvement;

• improve data consistency in EE;

• improvements and fixes for some basic scenarios in cluster environment - rolling upgrade, backup, replica-tion.



GraphDB Engine


• OWLIM-2172 Explain plan improvements.

• OWLIM-2178 Cluster data consistency per connection.

• OWLIM-1368 SPARQL 1.1 VALUES implementation.

• OWLIM-2281 Read-only mode for master.

• OWLIM-2283 Strict consistency mode for master.

• OWLIM-2282 Suspend-backup flag for master.

• OWLIM-2199 Storage tool scan command - groups statements by flags and compare counts for each groupbetween indexes.

• OWLIM-2313 Storage tool export command.

• OWLIM-2198 Storage tool repair command - compares and updates all indexes.

• OWLIM-2354 New JMX command to flush and shutdown GDB repository.

• OWLIM-2358 Client Failover API - configurable timeout before giving up requests and run-time re-configuration support.

Bugfixes:

• OWLIM-2353 Fix: Reads stop after 2 workers are taken down in 1m3w setup.

• OWLIM-2197 Fix: Invalid FILTER clause (?a = x && ?a = y) return false positive results than the oldexplain plan.

• OWLIM-2333 Fix: A worker reports different fingerprints when initialised for first time and after restore.

• OWLIM-2250 Fix: Investigate performance degradation and wrong results on queries in GraphDB 6.2.2compared to 6.1.8410.

• OWLIM-2351 Fix: Rule-set with a specific consistency checking rule fails to deliver a descriptive exception.

• OWLIM-2369 Fix: Adding non-existing worker + restart of the master leads to less workers shown in JMX.

• OWLIM-2341 Fix: Correct the console messages produced by the storageTool when rebuilding indexesfiles..

• OWLIM-2248 Fix: Properly patch the‘‘ logback.xml‘‘ configuration file in the enterprise distribution.

• OWLIM-2316 Fix: Large queries crash GraphDB with OutOfMemoryError.

• OWLIM-2337 Fix: Deadlock when running the finaliser in sail connection impl might result in a deadlock.

• OWLIM-2310 Fix: Group by queries are not stopped by timeouts since 6.1.SP4.

GraphDB Workbench

No changes.

GraphDB Connectors

No changes.



9.3.4 GraphDB 6.4.1




Important:

• bugfixes for the Workbench.

[ If you use the Workbench, we strongly recommend that you upgrade.

GraphDB Engine

No changes.

GraphDB Workbench

Bug fixes:

• WB-611 Fix: Export not working when security is ON.

• WB-607 Fix: Workbench behind some proxies doesn’t know what port for the repository cookie to check.

• WB-604 Fix: Unreachable location makes cluster manager unusable.

• WB-603 Fix: Cluster view doesn’t work when you have locations that were added with localhost only.

• WB-602 Fix: Non initialized masters in the cluster view are disabled and the reason is misleading.

• WB-601 Fix: Adding invalid values in the edit dialog of the location closes the dialog and doesn’t save thevalues.

• WB-600 Fix: First attached location is not properly validated.

GraphDB Connectors

No changes.

9.3.5 GraphDB 6.4.0




Important: This release of GraphDB focuses on the usability and stability of the Workbench.

[ If you use the Workbench, we strongly recommend that you upgrade.



The release also introduces support for Blueprints RDF/Gremlin and SPARQL-MM. There are no changes in theEngine or the Connectors.

GraphDB Engine

No changes.

GraphDB Workbench


• Improved overall usability;

• Tens of bugs fixed;

• New: Single page application for faster loading;

• New: Improvements in Users management:

– users can have permissions to a repository in a particular location only;

– non-admin users can change their passwords.

• New: Status of imported server files kept per repository;

• New: Improvements to the SPARQL editor:

– tracking of syntax errors in the name of each tab;

– creating links to saved queries.

• Creating repositories through TTL files;

• Revised REST API:

– managing multiple locations and repositories;

– user and permissions management.

• Numerous visual and presentational changes.

GraphDB Connectors

No changes.

Other

• Support for Blueprints RDF;

• Support for SPARQL-MM.

9.4 GraphDB 6.3.x release notes

9.4.1 GraphDB 6.3.1

Released: 1 July, 2015

9.4. GraphDB 6.3.x release notes 243




Important: This is a release of GraphDB that addresses critical issues in the Connectors.

[ If you use the Connectors, we strongly recommend that you upgrade. There are no changes in the Engine or theWorkbench.


No changes.


Bugfixes:

• WB-198 Fix: Changing the password of the default admin user works only until tomcat is restarted.

• WB-490 Fix: Adding a location that was autocompleted in the browser results in an error.



• STORE-202 New: Failure to access a single Lucene instance on repo init should not disable the connector.

Bugfixes:

• STORE-206 Fix: clear() or clear(context) does not update the index.

• STORE-207 Fix: Wrong query execution when multiple joins are used in SPARQL.

• STORE-210 Fix: Copy fields that contain URIs are not tracked as URIs and not returned as URIs whenused in facet search.

9.4.2 GraphDB 6.3.0

Released: 11 June, 2015



Important: This is a release of GraphDB that addresses bug fixes in the Connectors and the Workbench.

[ If you use the Connectors or the Workbench, we strongly recommend that you upgrade. There are no changesin the Engine




No changes.


Bugfixes:

• WB-198 Fix: Changing the password of the default admin user works only until tomcat is restarted.

• WB-490 Fix: Adding a location that was autocompleted in the browser results in an error.



• STORE-165 New: Improved splitting for facet field values.

• STORE-200 New: Upgrade Lucene to 4.10.4.

Bugfixes:

• STORE-191 Fix: Lucene connector does not index xsd:dateTime properly.

• STORE-192 Fix: Invalid values in typed literals are fatal but should rather be ignored (Lucene).

• STORE-193 Fix: Field sync option “facet” not interpreted correctly (Lucene).

• STORE-194 Fix: The field option “indexed” is not interpreted correctly.

• STORE-197 Fix: Entity filter graph() not working correctly for statements in named graphs.

9.4. GraphDB 6.3.x release notes 245

CHAPTER

TEN

ROADMAP

10.1 Roadmap for Q4‘15 and Q1‘16

Release Features20-Dec, next 6.x minor

• MultiDC improvements - Backup/Restore andMajority voting in synchronisation;

• Better visualisations in Workbench;• Autocompletion of URIs in Workbench;• Open source plugins.

1-Feb, next 6.x minor• Cluster improvements to read/write scalability;• Local Consistency;• Improved provenance plugin;• Better visualisations in the Workbench - Data

exploration.

1-Mar, GraphDB 7.0• Sesame 2.8 support;• Improved logging;• FactForge-like FTS in the Workbench;• PGM Blueprints;• Easier setup of license files;• LoadRDF/StorageTool integration in the Work-

bench.

247


248 Chapter 10. Roadmap

CHAPTER

ELEVEN

FAQ

Where does the name “OWLIM” (the former GraphDB name) come from? The name originally came fromthe term “OWL In Memory” and was fitting for what later became OWLIM-Lite. However, OWLIM-SE used a transactional, index-based file-storage layer where “In Memory” was no longer appropriate.Nevertheless, the name stuck and it was rarely asked where it came from.

What kind of SPARQL compliance is supported? All GraphDB editions support:

• SPARQL 1.1 Protocol for RDF

• SPARQL 1.1 Query

• SPARQL 1.1 Update

• SPARQL 1.1 Federation

• SPARQL 1.1 Graph Store HTTP Protocol

See also SPARQL Compliance.

How is GraphDB related to Sesame?

GraphDB is a semantic repository, packaged as a Storage and Inference Layer (SAIL) for the rdf4j Sesameframework and it makes extensive use of the features and infrastructure of Sesame, especially the RDFmodel, RDF parsers and query engines.For more details, see the GraphDB Sesame.

Is GraphBD Jena-compatible? Yes, GraphBD is compatible with Jena 2.7.3 with a built-in adapter. | For moreinformation, see Using GraphDB with Jena

What are the advantages of using solid-state drives as opposed to hard-disk drives? We recommend usingenterprise grade SSDs whenever possible as they provide a significantly faster database performance com-pared to hard-disk drives.

Unlike relational databases, a semantic database needs to compute the inferred closure for inserted anddeleted statements. This involves making highly unpredictable joins using statements anywhere in its in-dices. Despite utilising paging structures as best as possible, a large number of disk seeks can be expectedand SSDs perform far better than HDDs in such a task.

How to find out the exact version number of GraphDB? The major/minor version and build number make uppart of the GraphDB distribution .zip file name. The embedded owlim .jar file has the major and minorversion numbers appended.

In addition, at start up, GraphDB will log the full version number in an INFO logger message, e.g.,OwlimSchemaRepository: version: 5.6, revision: 6864.

The following DESCRIBE query:

DESCRIBE <http://www.ontotext.com/SYSINFO> FROM <http://www.ontotext.com/SYSINFO>

returns pseudo-triples providing information on various GraphDB states, including the number of triples(total and explicit), storage space (used and free), commits (total and if one is in progress), the repositorysignature, and the build number of the software.

249

How to retrieve repository configurations from the Sesame SYSTEM repository? When using aLocalRepositoryManager, Sesame stores the configuration data for repositories in its own SYSTEMrepository. A Tomcat instance does the same and there is SYSTEM under the list of repositories that theinstance manages.

To see what configuration data is stored in a GraphDB repository, connect to the SYSTEM repository andexecute the following query:

PREFIX sys: <http://www.openrdf.org/config/repository#>PREFIX sail: <http://www.openrdf.org/config/repository/sail#>

select ?id ?type ?param ?valuewhere {

?rep sys:repositoryID ?id .?rep sys:repositoryImpl ?impl .?impl sys:repositoryType ?type .optional {

?impl sail:sailImpl ?sail .?sail ?param ?value .

}# FILTER( ?id = "specific_repository_id" ) .

}ORDER BY ?id ?param

This will return the repository ID and type, followed by name-value pairs of configuration data for SAILrepositories, including the SAIL type - owlim:Sail for GraphDB.

If you uncomment the FILTER clause, you can substitute a repository ID to get the configuration just for thatrepository.

Why can’t I use my custom rule file (.pie) - an exception occurred? To use custom rule files, GraphDB mustbe running in a JVM that has access to the Java compiler. The easiest way to do this is to use the Javaruntime from a Java Development Kit (JDK).

The tools.jar file from the Java Development Kit (JDK) must be on the classpath or alternatively copiedto Sesame server’s WEB-INF/lib directory.

Why can’t I delete a repository? Sesame keeps all repositories in the SYSTEM repository and sometimes youwill not be able to initialise the repository, so you cannot delete it. You can execute the following query toremove the repository from SYSTEM.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX sys: <http://www.openrdf.org/config/repository#>

delete {?g rdf:type sys:RepositoryContext .

} where {graph ?g {

?s sys:repositoryID "repositoryID" .}

?g rdf:type sys:RepositoryContext .}

Change the repositoryID literal as needed. This removes the statement that makes the context a repositorycontext. Configuration for the repository will be kept intact as well as the data in the storage.

Where can I find the Experimental Explain Plan in the documentation of GraphDB 6.6? GraphDB Experi-mental Explain Plan was introduced in version 6.4.3 to improve the execution of complex queries. It wassimultaneously used with the regular Explain Plan until version 6.6.1 when it became the GraphDB’s regularExplain Plan. See the table below for more information.

250 Chapter 11. FAQ


Version Explain Plan Experimental Explain Plan6.0 - 6.4.2 D 5

6.4.3 - 6.6.0 D D

6.6.1 and higher 5 D (renamed to Explain Plan)

251


252 Chapter 11. FAQ

CHAPTER

TWELVE

SUPPORT

• email: [email protected]

• Twitter: @OntotextGraphDB

• GraphDB tag on Stack Overflow at http://stackoverflow.com/questions/tagged/graphdb

253

mailto:[email protected]

https://twitter.com/OntotextGraphDB

http://stackoverflow.com/questions/tagged/graphdb