we focus on the backend semantic web database architecture...

3 3 3

We focus on the backend semantic web database architecture and offer support and other services around that. Reasons for working with SYSTAP, include: - You have a prototype and you want to get it to the market. - You want to embed a fast high performance semantic web database into your application. - You want to integrate, query, and analyze large amounts information across the enterprise.

As a historical note, provenance was left out of the original semantic web architecture because it requires statements about statements which implies second order predicate calculus. At the Amsterdam 2000 W3C Conference, TBL stated that he was deliberately staying away from second order predicate calculus. Provenance mechanisms are slowly emerging for the semantic web due to their absolute necessity in many domains. The SGML / XML Topic Maps standardization was occurring more or less at the same time. It had a focus which provided for provenance and semantic alignment with very little support for reasoning.

RDF is standardized by the W3C. The same people who brought you the web. There is an extensive standards stack for RDF, but most applications can focus on the data interchange standards (RDF/XML, N3, N-Triples, etc.) and SPARQL, the standard for querying and updating data in RDF databases. There are great tool chains out there for nearly any programming language. Historically, one weak point in RDF was the lack of a clear solution for link attributes. Link attributes are critical to graph processing applications. They can be interchanged using RDF Reification (it is messy, but it works) and then represented and queried efficiently by RDF databases. It is important that the database uses an efficient internal representation for link attributes or it will have significantly higher query costs (if you handle RDF Reification naively, it implies a fully normalized form that is expensive to query). The SIDS (Statement Identifiers) mode of bigdata provides a non-standard option for handling link attributes efficiently. See the section on “Statement identifiers” and “Reification done right” for more information.

5

6

Images copyright WC3. See http://www.w3.org/2000/Talks/1206-xml2k-tbl/slide10-0.html (TBL, 2000) and http://www.w3.org/2006/Talks/1023-sb-W3CTechSemWeb/ (S. Bratt, 2006) Provenance is huge concern for many communities. However, support for provenance, other than digitally signed proofs, was explicitly off the table when the semantic web was introduced in 1999. TBL’s reason for not tackling provenance was that it requires second order predicate calculus, while the semantic web is based on first order predicate calculus. However, it is possible to tackle provenance without using a highly expressive logic. The ISO and XML Topic Maps communities have lead in this area and, recently, there has been increasing discussion about provenance within the W3C.

English: The […] diagram [above] visualizes the data sets in the LOD cloud as well as their interlinkage relationships. Each node in this cloud diagram represents a distinct data set published as Linked Data. The arcs indicate that RDF links exist between items in the two connected data sets. Heavier arcs roughly correspond to a greater number of links between two data sets, while bidirectional arcs indicate the outward links to the other exist in each data set. Image by Anja Jentzsch. The image file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.

10 years ago everyone knew that the database was a frozen market. Today, nothing could be further from the truth This is a time of profound upheaval in the database world. There are a number of trends which are driving this. One is the continued pressure on commodity hardware prices and performance. Another is the growth in open source software, driven in a large part by a series of publications by Google on GFS, M/R, bigtable, and more recently Pregel. Today, all “big data” systems leverage open source software and many federal contracts require it. At the same time the speed of processors has hit the wall so applications are forced to parallelize and use distributed architectures. SSD has also changed the playing field, virtually eliminating disk latency in critical systems. The central focus for the bigdata® platform is a parallel semantic web database architecture. Main memory graph processing platforms tend to lack core features of a database architecture, such as durability and isolation.

This is our take on where this is all heading. We tend to focus on high data scale and low expressivity with rapid data exploitation cycles. We are moving into both linked data applications and graph data mining. These things appear to be completely different directions, but we see them as converging in the next few years.

9

-  Document centric approaches such as MongoDB are most closely related to XML databases, but lack the sophisticated query language and standards. They store blobs and may index some paths into the blob.

-  Object centric approaches (such as neo4j) are basically object databases. They are optimized to store and retrieve single objects and emphasize traversal (pointer chasing against disk) in their APIs.

-  “Edge-centric” approaches (such as bigdata) index edges and provide standards based query and update languages and deliver high performance graph query based on subgraph matching.

There are a number of design tradeoffs for graph databases: Index stride Graph traversal costs Object materialization costs Standards (or their absence) Benchmarks (or their absence) High level query languages (or their absence)

All of this has implications for performance and scalability.

10

There is a lot hidden in this diagram. We see the emerging architecture for the semantic web as a powerful, multi-faceted tool. The database layer is broken open into a linked data cache (providing materialized joins as well as servicing real-time analytics and web-facing graph/sematic clients) and a gtraph indexing layer. The linked data cache plays in the same world as the key-value stores – fast access to a property value set by a primary key. This opens open semantic web applications to the world of highly scalable web applications. The database tier provides fast indexing and high level query not available in key-value stores. It also provides transparent opportunities for discovery, semantic alignment (aided by real-time analytics), and aggregation of relevant data sets. There are several key enabling technologies and standards in play here: -  Link attributes for datum level metadata (source, confidence, security, etc.). -  High level query (SPARQL) -  Decomposing the database into linked data and novel graph indexing schemes. -  Low latency transparent federated query (m-way symmetric hash joins) -  Main main graph data mining techniques (similar to Pregel, but semantic web

enabled) -  GPU acceleration for real time analytics (the equivalent of a 40-80 node cluster in

a single workstation).

11

Links: - http://www.bigdata.com/bigdata/blog/ (blog) - http://sourceforge.net/projects/bigdata/ (project) - http://sourceforge.net/projects/bigdata/files/latest/download (WAR) - http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Main_Page (wiki)

13

This is a slide from the KaBoB project. They are connecting the detail records in the different bioinformatics databases and then cross-linking them for higher level reasoning and data mining. They have 8B edges in a bigdata instance running on a Sun server with an SSD array.

14

BSBM is a standard benchmark for RDF databases. Unfortunately, it is difficult to compare non-standard graph databases. SPAQRL is a good query language for graph data, but it does not support vertex programs which need their own language and benchmarks.

15

Benefits over Sesame: -trivial to deploy -optimized for bigdata MVCC semantics -built in resource management: controls the # of threads

20

SPARQL - http://www.w3.org/TR/sparql11-query/ Blueprints - https://github.com/tinkerpop/blueprints/wiki/

22

Go and get sesame, it will fall over because of this. Several commercial and open source Java grid cache products exist, including Oracle’s Coherence, infinispan, and Hazelcast. However, all of these products share a common problem – they are unable to exploit large heaps due to an interaction between the Java Garbage Collector (GC) and the object creation and object retention rate of the cache. There is a non-linear interaction between the object creation rate, the object retention rate, and the GC running time and cycle time. For many applications, garbage collection is very efficient and Java can run as fast as C++. However, as the Java heap begins to fill, the garbage collector must run more and more frequently, and the application is locked out while it runs. This leads to long application level latencies that bring the cache to its knees and throttles throughput. Sun and Oracle have developed a variety of garbage collector strategies, but none of them are able to manage large heaps efficiently. A new technology is necessary in order to successfully deploy object caches that can take advantage of servers with large main memories.

22

Slated for bigdata 1.2.2. See https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=SPARQL_Update

See https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=FederatedQuery See https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=QueryHints

This is a great feature that we’ve built on top of the SPARQL Basic Federated Query mechanisms. Using a “custom service,” you can add nearly arbitrary application specific capabilities to bigdata. Your application uses “SERVICE uri {graph-pattern}” clauses in your queries, but SERVICE recognizes registered URIs and targets your registered extension rather than a remote endpoint. Some examples include: -  Geosparql indices (Open Sahara) -  Indexing history (Rapidly compute the delta between any two commit points) -  Custom indices for rapidly changing data (Accelerate application specific queries)

27

This effort is an outcome from Dagstuhl 2012 Seman9c Data Management workshop. The RDF data model and SPARQL algebra harmoniza9on is thanks to Olaf Har9g (Humboldt University). We are working toward a W3C member submission to standardize this proposal. See hQps://sourceforge.net/apps/trac/bigdata/9cket/526 (Reifica9on Done Right)

The SIDs mode in bigdata provides efficient support for link attributes today. It will be replaced by the newer “reification done right” model. The big advantage of reification done right is that it tells you how to interchange data as RDF but frees you to process that data efficiently. The initial implementations have focused on linline statements into statements. However, other efficient implementations are possible. For example, you can project a sparse matrix containing only the link weights and then operate efficiently on that sparse matrix using BLAs. Do not confuse the syntax with efficient representations for indices or graph data mining algorithms.

33

The API should make it easy to write the query, but the query optimizer should be responsible for finding a good query plan. The best plan depends on the available indices, cardinality counts, database operators, etc. You can filter for just those unconnected friends that have more than one shared friend by changing the HAVING clause in the query. Only interesting attributes should be projected into the graph data mining platform. See bigdata-gom/samples/com/bigdata/gom/Example2.java for running code.

39

Links: -  http://www.bigdata.com/bigdata/blog/ (blog) -  http://sourceforge.net/projects/bigdata/ (project) -  http://sourceforge.net/projects/bigdata/files/latest/download (WAR) -  http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Main_Page

(wiki)

we focus on the backend semantic web database architecture...

Documents