we focus on the backend semantic web database architecture...

46
1

Upload: others

Post on 06-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

1

Page 2: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

2

Page 3: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

3 3 3  

We focus on the backend semantic web database architecture and offer support and other services around that. Reasons for working with SYSTAP, include: - You have a prototype and you want to get it to the market. - You want to embed a fast high performance semantic web database into your application. - You want to integrate, query, and analyze large amounts information across the enterprise.

Page 4: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

As a historical note, provenance was left out of the original semantic web architecture because it requires statements about statements which implies second order predicate calculus. At the Amsterdam 2000 W3C Conference, TBL stated that he was deliberately staying away from second order predicate calculus. Provenance mechanisms are slowly emerging for the semantic web due to their absolute necessity in many domains. The SGML / XML Topic Maps standardization was occurring more or less at the same time. It had a focus which provided for provenance and semantic alignment with very little support for reasoning.

Page 5: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

RDF is standardized by the W3C. The same people who brought you the web. There is an extensive standards stack for RDF, but most applications can focus on the data interchange standards (RDF/XML, N3, N-Triples, etc.) and SPARQL, the standard for querying and updating data in RDF databases. There are great tool chains out there for nearly any programming language. Historically, one weak point in RDF was the lack of a clear solution for link attributes. Link attributes are critical to graph processing applications. They can be interchanged using RDF Reification (it is messy, but it works) and then represented and queried efficiently by RDF databases. It is important that the database uses an efficient internal representation for link attributes or it will have significantly higher query costs (if you handle RDF Reification naively, it implies a fully normalized form that is expensive to query). The SIDS (Statement Identifiers) mode of bigdata provides a non-standard option for handling link attributes efficiently. See the section on “Statement identifiers” and “Reification done right” for more information.

5

Page 6: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

6

Images copyright WC3. See http://www.w3.org/2000/Talks/1206-xml2k-tbl/slide10-0.html (TBL, 2000) and http://www.w3.org/2006/Talks/1023-sb-W3CTechSemWeb/ (S. Bratt, 2006) Provenance is huge concern for many communities. However, support for provenance, other than digitally signed proofs, was explicitly off the table when the semantic web was introduced in 1999. TBL’s reason for not tackling provenance was that it requires second order predicate calculus, while the semantic web is based on first order predicate calculus. However, it is possible to tackle provenance without using a highly expressive logic. The ISO and XML Topic Maps communities have lead in this area and, recently, there has been increasing discussion about provenance within the W3C.

Page 7: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

English: The […] diagram [above] visualizes the data sets in the LOD cloud as well as their interlinkage relationships. Each node in this cloud diagram represents a distinct data set published as Linked Data. The arcs indicate that RDF links exist between items in the two connected data sets. Heavier arcs roughly correspond to a greater number of links between two data sets, while bidirectional arcs indicate the outward links to the other exist in each data set. Image by Anja Jentzsch. The image file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.

Page 8: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

10 years ago everyone knew that the database was a frozen market. Today, nothing could be further from the truth This is a time of profound upheaval in the database world. There are a number of trends which are driving this. One is the continued pressure on commodity hardware prices and performance. Another is the growth in open source software, driven in a large part by a series of publications by Google on GFS, M/R, bigtable, and more recently Pregel. Today, all “big data” systems leverage open source software and many federal contracts require it. At the same time the speed of processors has hit the wall so applications are forced to parallelize and use distributed architectures. SSD has also changed the playing field, virtually eliminating disk latency in critical systems. The central focus for the bigdata® platform is a parallel semantic web database architecture. Main memory graph processing platforms tend to lack core features of a database architecture, such as durability and isolation.

Page 9: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

This is our take on where this is all heading. We tend to focus on high data scale and low expressivity with rapid data exploitation cycles. We are moving into both linked data applications and graph data mining. These things appear to be completely different directions, but we see them as converging in the next few years.

9

Page 10: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

-  Document centric approaches such as MongoDB are most closely related to XML databases, but lack the sophisticated query language and standards. They store blobs and may index some paths into the blob.

-  Object centric approaches (such as neo4j) are basically object databases. They are optimized to store and retrieve single objects and emphasize traversal (pointer chasing against disk) in their APIs.

-  “Edge-centric” approaches (such as bigdata) index edges and provide standards based query and update languages and deliver high performance graph query based on subgraph matching.

There are a number of design tradeoffs for graph databases: Index stride Graph traversal costs Object materialization costs Standards (or their absence) Benchmarks (or their absence) High level query languages (or their absence)

All of this has implications for performance and scalability.

10

Page 11: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

There is a lot hidden in this diagram. We see the emerging architecture for the semantic web as a powerful, multi-faceted tool. The database layer is broken open into a linked data cache (providing materialized joins as well as servicing real-time analytics and web-facing graph/sematic clients) and a gtraph indexing layer. The linked data cache plays in the same world as the key-value stores – fast access to a property value set by a primary key. This opens open semantic web applications to the world of highly scalable web applications. The database tier provides fast indexing and high level query not available in key-value stores. It also provides transparent opportunities for discovery, semantic alignment (aided by real-time analytics), and aggregation of relevant data sets. There are several key enabling technologies and standards in play here: -  Link attributes for datum level metadata (source, confidence, security, etc.). -  High level query (SPARQL) -  Decomposing the database into linked data and novel graph indexing schemes. -  Low latency transparent federated query (m-way symmetric hash joins) -  Main main graph data mining techniques (similar to Pregel, but semantic web

enabled) -  GPU acceleration for real time analytics (the equivalent of a 40-80 node cluster in

a single workstation).

11

Page 12: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

12

Page 13: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

Links: - http://www.bigdata.com/bigdata/blog/ (blog) - http://sourceforge.net/projects/bigdata/ (project) - http://sourceforge.net/projects/bigdata/files/latest/download (WAR) - http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Main_Page (wiki)

13

Page 14: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

This is a slide from the KaBoB project. They are connecting the detail records in the different bioinformatics databases and then cross-linking them for higher level reasoning and data mining. They have 8B edges in a bigdata instance running on a Sun server with an SSD array.

14

Page 15: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

BSBM is a standard benchmark for RDF databases. Unfortunately, it is difficult to compare non-standard graph databases. SPAQRL is a good query language for graph data, but it does not support vertex programs which need their own language and benchmarks.

15

Page 16: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

16

Page 17: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

17

Page 18: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

18

Page 19: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

19

Page 20: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

Benefits over Sesame: -trivial to deploy -optimized for bigdata MVCC semantics -built in resource management: controls the # of threads

20

Page 21: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

SPARQL - http://www.w3.org/TR/sparql11-query/ Blueprints - https://github.com/tinkerpop/blueprints/wiki/

Page 22: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

22

Go and get sesame, it will fall over because of this. Several commercial and open source Java grid cache products exist, including Oracle’s Coherence, infinispan, and Hazelcast. However, all of these products share a common problem – they are unable to exploit large heaps due to an interaction between the Java Garbage Collector (GC) and the object creation and object retention rate of the cache. There is a non-linear interaction between the object creation rate, the object retention rate, and the GC running time and cycle time. For many applications, garbage collection is very efficient and Java can run as fast as C++. However, as the Java heap begins to fill, the garbage collector must run more and more frequently, and the application is locked out while it runs. This leads to long application level latencies that bring the cache to its knees and throttles throughput. Sun and Oracle have developed a variety of garbage collector strategies, but none of them are able to manage large heaps efficiently. A new technology is necessary in order to successfully deploy object caches that can take advantage of servers with large main memories.

22  

Page 23: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

23

Page 24: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

24

Page 25: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

Slated for bigdata 1.2.2. See https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=SPARQL_Update

Page 26: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

See https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=FederatedQuery See https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=QueryHints

Page 27: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

This is a great feature that we’ve built on top of the SPARQL Basic Federated Query mechanisms. Using a “custom service,” you can add nearly arbitrary application specific capabilities to bigdata. Your application uses “SERVICE uri {graph-pattern}” clauses in your queries, but SERVICE recognizes registered URIs and targets your registered extension rather than a remote endpoint. Some examples include: -  Geosparql indices (Open Sahara) -  Indexing history (Rapidly compute the delta between any two commit points) -  Custom indices for rapidly changing data (Accelerate application specific queries)

27

Page 28: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

28

Page 29: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

29

Page 30: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

30

Page 31: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

31 31

Page 32: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

This  effort  is  an  outcome  from  Dagstuhl  2012  Seman9c  Data  Management  workshop.    The  RDF  data  model  and  SPARQL  algebra  harmoniza9on  is  thanks  to  Olaf  Har9g  (Humboldt  University).    We  are  working  toward  a  W3C  member  submission  to  standardize  this  proposal.    See  hQps://sourceforge.net/apps/trac/bigdata/9cket/526  (Reifica9on  Done  Right)  

Page 33: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

The SIDs mode in bigdata provides efficient support for link attributes today. It will be replaced by the newer “reification done right” model. The big advantage of reification done right is that it tells you how to interchange data as RDF but frees you to process that data efficiently. The initial implementations have focused on linline statements into statements. However, other efficient implementations are possible. For example, you can project a sparse matrix containing only the link weights and then operate efficiently on that sparse matrix using BLAs. Do not confuse the syntax with efficient representations for indices or graph data mining algorithms.

33

Page 34: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

34

Page 35: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

35

Page 36: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

36

Page 37: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

37

Page 38: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

38

Page 39: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

The API should make it easy to write the query, but the query optimizer should be responsible for finding a good query plan. The best plan depends on the available indices, cardinality counts, database operators, etc. You can filter for just those unconnected friends that have more than one shared friend by changing the HAVING clause in the query. Only interesting attributes should be projected into the graph data mining platform. See bigdata-gom/samples/com/bigdata/gom/Example2.java for running code.

39

Page 40: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

40

Page 41: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

41

Page 42: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

42

Page 43: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

43

Page 44: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

44

Page 45: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

45

Page 46: We focus on the backend semantic web database architecture ...files.meetup.com/1495149/BryanThomspon_BigDataTalk_Sept12th.pdf · and low expressivity with rapid data exploitation

Links: -  http://www.bigdata.com/bigdata/blog/ (blog) -  http://sourceforge.net/projects/bigdata/ (project) -  http://sourceforge.net/projects/bigdata/files/latest/download (WAR) -  http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Main_Page

(wiki)