big (linked) semantic data compression...big (linked) semantic data compression motivation &...

53
Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23 TH AUGUST 2017 3rd KEYSTONE Training School Keyword search in Big Linked Data Image: ROMAN AQUEDUCT (SEGOVIA, SPAIN)

Upload: others

Post on 22-Apr-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Big (Linked) Semantic Data Compression

Motivation & Challenges

Antonio Fariña, Javier D. Fernández and

Miguel A. Martinez-Prieto

23TH AUGUST 2017

3rd KEYSTONE Training SchoolKeyword search in Big Linked Data

Image:ROMAN AQUEDUCT (SEGOVIA, SPAIN)

Page 2: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Linked Data & Semantic Technologies

Foundations

RDF

SPARQL

(Some) Open Issues

Linked Data Workflow

Big Linked Data Challenges

Semantic Data Compression

Why is Semantic Data Redundant?

Compression Approaches

Achievements & Challenges

PAGE 2

Agenda

images: zurb.com

Page 3: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

• Foundations

• RDF

• SPARQL

Linked Data & Semantic Technologies

Big (Linked) Semantic Data Compression

Page 4: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Linked Data is simply about using the Web to create typed links between data from different sources.

Linked Data Foundations

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 4

Page 5: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Linked Data is simply about using the Web to create typed links between data from different sources.

Linked Data refers to a set of best practices for publishing and connectingdata on the Web.

These best practices have been adopted by an increasing number of data providers, leading to the creation of a global data space:

Data are machine-readable.

Data meaning is explicitly defined.

Data are linked from/to external datasets.

The resulting data network connects data from different domains:

Publications, movies, multimedia, government data, statistical data, etc.

Linked Data

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 5

Linked Data

Page 6: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 7

Linked Data Principles

1. Use URIs as names for entities.

2. Use HTTP URIs so that people can look up those names.

3. When someone looks up a URI, provide useful information, using standards(e.g. RDF, SPARQL).

4. Include links to other URIs, so that they can discover more things.

Page 7: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 8

#1 URIs as names

What is his name?

“Homer Simpson”

“Homer Simpson”

Human conventions are not effective to name data entities:

They are not universal → ambiguity.

The use of URIs (Universal Resource Identifier) enables any real-world entity to be identified at universal scale:

What is his name?

http://example.org/person/homer-simpson

http://example.org/person/homer-simpson-guy

Names must ensure that any data entity has its own identity in the global Linked Data space.

Page 8: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 9

#2 HTTP URIs

Dereferenceable URIs ensure the corresponding entity descriptions to be retrieved when an HTTP URI is accessed (via HTTP client).

http://example.org/person/homer-simpson

http://example.org/property/name "Homer Simpson"

http://example.org/property/address "742 Evergreen Terrace"

http://example.org/property/location http://example.org/place/springfield

http://example.org/property/father http://example.org/person/abe-simpson

...

Entity names must be searchable (via HTTP).

Page 9: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 10

#3 Standards

Many and varied stakeholders coexist within the Linked Data ecosystem…

Data providers from diverse domains (economy, bioinformatics, multimedia…).

Application developers.

End-users…

… but all of them “must speak the same languages” for effective understanding.

Standardized semantic technologies:

URIs for naming.

Serialization formats (XML, N3, Turtle, HDT…) for data storage.

RDF for data modelling and exchange.

SPARQL for RDF querying.

Page 10: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Data entities are individually described:

A particular HTTP URI is assigned as name.

Its features are stated.

Linking two URIs establishes a particular type of connection between two existing entities:

This principle materializes the aim of data integration in Linked Data.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 11

#4 Links to Other URIs

person:homer-simpson

“Homer Simpson”

"742 Evergreen Terrace"property:address

property:name

person:marge-simpson

“Marge Simpson”

property:address

property:name

@prefix person : <http://example.org/person/> .

@prefix property : <http://example.org/property/> .

Page 11: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 12

#4 Links to Other URIs

person:homer-simpson

“Homer Simpson”

"742 Evergreen Terrace"property:address

property:name

person:marge-simpson

“Marge Simpson”

property:address

property:name

location:Springfield

“Springfield”

property:name

property:locationproperty:location

person:abe-simpson

property:father

“Abe Simpson”

83

property:age

property:name

person:bart-simpson

“Bart Simpson”

10

property:age

property:name

property:father

property:mother

@prefix person : <http://example.org/person/> .

@prefix property : <http://example.org/property/> .

Page 12: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

The Web of Linked Data revisits WWW foundations to build a cloud of data-to-data labelled hyperlinks.

The Web of Linked Data converts raw data into a first-class citizen of the Web:

Data entities are the atoms of the Web of Linked Data.

Each entity has its own identity.

Relies on the WWW infrastructure:

It uses HTTP as communication protocol.

Entities are named using URIs.

Web of Linked Data

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 13

The Web of Linked Data

Knowledge from different fields can be easily integrated and universally shared/exploited using WWW infrastructure.

Page 13: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

The Web of Linked Data (2007 – 2011)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 14http://lod-cloud.net/

Page 14: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

The Web of Linked Data (2014)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 15

http://lod-cloud.net/

Page 15: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

The Web of Linked Data (2017)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 16

~10K datasets organized into 9 domains which include many and varied knowledge fields.

150B statements, including entity descriptions and (inter/intra-dataset) links between them.

>500 live endpoints serving this data.

http://lod-cloud.net/

http://stats.lod2.eu/

http://sparqles.ai.wu.ac.at/

Page 16: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

RDF is a standard model for data interchange on the Web. RDF has features that facilitate data merging even if the underlying schemas differ…

RDF

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 17

Page 17: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

RDF is a standard model for data publication, interchange, and consumption on the Web of Linked Data.

RDF allows any class of data to be described using a simple triplestructure:

Subject: the resource being described.

Predicate: a property of that resource.

Object: the value for the corresponding property.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 18

RDF Basics

http://example.org/person/homer-simpson

http://example.org/property/name

"Homer Simpson"

http://example.org/person/homer-simpson

http://example.org/property/father

http://example.org/person/abe-simpson

Page 18: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

An RDF triple can be seen as a labelled directed subgraph in which subject and object nodes are linked by a particular (predicate) edge:

The subject node contains the URI which names the resource.

The predicate edge labels the relationship using a URI whose semantics is described by any vocabulary/ontology.

The object node may contain a URI or a Literal value.

RDF links (between entities) also take the form of RDF triples.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 19

RDF Triples

@prefix person : <http://example.org/person/> .

@prefix property : <http://example.org/property/> .

person:homer-simpsonperson:abe-simpson "Homer Simpson"property:name

"742 Evergreen Terrace"property:address

property:father

Page 19: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 20

RDF Graphs

@prefix person : <http://example.org/person/> .

@prefix property : <http://example.org/property/> .

person:homer-simpson

person:abe-simpson

"Homer Simpson"property:name

"742 Evergreen Terrace"

property:address

property:father

person:marge-simpson

property:address

"Marge Simpson"

property:namelocation:springfield

property:location

property:location

person:bart-simpson "Springfield"

property:mother

property:father

property:name"Bart Simpson"

10

property:name

property:age

83 "Bart Simpson"property:age property:name

Page 20: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 21

RDF Graphs

An RDF graph is only a mental model which must be serialized for effective storage:

Choosing a particular serialization format is an important decision for the most relevant tasks in the Web of Linked Data.

Page 21: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 23

RDF Serialization Formats

NTriples

RDF/XML

N3

JSON/LD

http://www.easyrdf.org/converter

Page 22: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

SPARQL is a semantic query language for databases, able to retrieve and manipulate data stored in Resource Description Framework (RDF) format.

SPARQL

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 24

Page 23: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

SPARQL is a semantic query language, able to retrieve and manipulate RDF data.

SPARQL query resolution performs graph pattern matching:

The objective is to determine all the candidates matching to a pattern query in a large data graph.

Triple Patterns are the most basic SPARQL query:

Triple patterns are like RDF triples except that each of the subject, predicate and object may be a variable.

More complex Graph Patterns comprise joins or unions of multiple triple patterns.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 25

SPARQL Basics

Page 24: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

The semantics of “lives in” is captured by the property http://example.org/property/location.

The entity describing “Springfield”is named by the URI http://example.org/location/springfield.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 26

SPARQL Querying

Who lives in Springfield?

?who location:springfield

property:location

Who?

@prefix person : <http://example.org/person/> .

@prefix property : <http://example.org/property/> .

SELECT ?Who

WHERE { ?Who property:location location:Springfield }

Page 25: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 27

SPARQL Querying

person:homer-simpson

person:abe-simpson

"Homer Simpson"property:name

"742 Evergreen Terrace"

property:address

property:father

person:marge-simpson

property:address

"Marge Simpson"

property:namelocation:springfield

property:location

property:location

person:bart-simpson "Springfield"

property:mother

property:father

property:name"Bart Simpson"

10

property:name

property:age

83 "Bart Simpson"property:age property:name

?who location:springfield

property:location

location:springfield

property:location

property:location

?who

?who

Page 26: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

• Linked Data Workflow

• Big Linked Data Challenges

(Some) Open Issues

Big (Linked) Semantic Data Compression

Page 27: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Data moves from data providers to end users within the Linked Data ecosystem. It evolves along many stages to consolidate effective results which satisfy end-user requirements.

Linked Data Workflow

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 29

Page 28: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Generation: data is ingested from real-world systems and transformed into an RDF dataset.

Publication: the RDF dataset is exposed using one or more Linked Data compliant services.

Consumption: data is retrieved from consumers using well-defined interfaces.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 31

Linked Data Workflow

Real-world

Facts

Data Creator

RDF Dataset

Data Provider

HTTP URIs

Dump

Endpoint

LD Fragment

Data Consumers

GENERATION

PU

BLIC

ATIO

N

CONSUMPTION

Page 29: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Data is extracted from real-world sources:

Real-time sources require streams of raw data to be continuously ingested.

Data from other sources are ingested in batch.

Data cleansing deals with the transformation of raw data into high-quality data which is finally modeled using RDF:

Streaming or batch processing workflows are deployed for data transformation.

Data integration enriches the current data with links from/to other relevant entities in the Linked Data Web:

Where is the entity published? What is its URI?

A new RDF dataset (or and RDF stream) is generated as a result.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 32

RDF Data Generation

Page 30: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

The new RDF dataset must be exposed in the Linked Data Web.

a. RDF Dump: I. Triples are serialized (and possible compressed) into a valid RDF format.

II. The resulting dump file is hosted at a web server and registered into a central catalog (e.g. datahub.io) for discovering purposes.

b. HTTP URIs:I. Triples are stored and indexed (using possible a semantic database).

II. An interface is exposed for dereferencing URIs.

c. SPARQL endpoint:I. Triples are stored and indexed using a semantic database.

II. An SPARQL interface is exposed for querying.

d. Linked Data fragment:I. Triples are self-indexed using an in-memory RDF engine (HDT).

II. An LD interface is exposed for (federated) querying.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 33

RDF Data Publication

Page 31: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 34

RDF Data Consumption

Consumers retrieve RDF data to meet their information requirements.

Consumers can download RDF dumps for local use: A (possibly) voluminous file is retrieved and stored locally.

Data are usually indexed using a semantic database.

The remaining mechanisms are used to retrieve pieces of RDF data matching particular needs:

Dereferencing HTTP URIs allows individual entity descriptions to be retrieved from the corresponding publisher.

SPARQL endpoints are used to retrieve RDF data which matches complex SPARQL queries in a particular dataset.

Finally, Linked Data Fragments are accessed to resolve federated queries along the Web of Linked Data.

Page 32: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

This processing workflow suffers when Big Linked Data must be managed along it.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 35

Linked Data Workflow

Real-world

Facts

Data Creator

RDF Dataset

Data Provider

HTTP URIs

LD Fragment

Endpoint

DumpData

Consumers

CONSUMPTION

Page 33: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Big is not a matter of size... it is a matter or representativity & consumption capacity.

Big Linked Data Challenges

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 36

Page 34: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

We refer as Big Linked Data (BLD) an RDF dataset that exceeds the capacity of conventional tools used to implement the processing workflow.

BLD management is a challenge because of its volume:

Individual datasets are not typically considered BLD [+Gb,++Gb].

Integrated RDF datasets (mashups) can be considered BLD [++Gb, +Tb].

The whole Web of Linked Data is obviously BLD [++Tb].

BLD management is a challenge because of its velocity:

Data is generated continuously in many scenarios.

Data is queried continuously from many users.

BLD management is a challenge because of other Vs (variability, veracity, validity, vulnerability…) .

Big Linked Data

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 37

Big Linked Data

Page 35: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Generation Challenges:

Efficient streaming solutions are required for processing continuous RDF sources (e.g. meteorology or traffic sensors, social networks…).

Scalable batch-processing tools able to store and manage large amounts of RDF data along the cleansing/integration stages.

Publication Challenges:

Compact serialization formats for saving storage space (RDF dumps).

Efficient indexes for performing fast URI lookups and retrieving the corresponding triples (HTTP URIs).

Lightweight indexes and efficient in-memory algorithms for fast query resolution (SPARQL endpoints).

Linked Data Fragments is a good example of scalable solution.

Consumption Challenges:

Compact serialization formats for reducing network latencies.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 38

Big Linked Data Challenges

Page 36: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

• Why is Semantic Data Redundant?

• Compression Approaches

• Achievements & Challenges

Semantic Data Compression

Big (Linked) Semantic Data Compression

Page 37: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

A semantic data compressor provides an alternative method for serializing RDF triples:

It detects redundant information within the dataset and removes it.

In practice, the resulting encoding uses (much) less bits than that required by traditional formats.

This lecture deals with the use of semantic data compression to address some of the Big Linked Data challenges:

Compressed RDF saves storage space.

Compressed RDF requires less bandwidth for exchanging purposes.

Compressed RDF is loaded faster than other traditional solutions, and (possibly) requires less amounts of memory for triples parsing.

Some classes of compressed RDF allows fast lookups to be performed with no prior triples decoding.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 40

Semantic Data Compression

Page 38: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Data redundancy means that the same information can be encoded using less bits.

Why Semantic Data is Redundant?

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 41

Page 39: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Data is redundant when the same information can be encoded using less bits:

Compression is achieved when redundant data is removed from the original dataset.

Why is semantic data redundant?

Some facts can be inferred from other ones →semantic redundancy.

Sequences of symbols are repeated along the dataset → symbolic redundancy.

The underlying RDF graph structure is redundant by itself → syntactic (structural) redundancy.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 42

Semantic Data Redundancy

Page 40: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Semantic redundancy occurs when the same meaning can be expressed using less triples:

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 43

Semantic Redundancy

http://example.org/property/age

http://www.w3.org/2000/01/rdf−schema#domain

http://example.org/classes/person

http://example.org/person/bart-simpson

http://example.org/property/age

10

http://example.org/person/bart-simpson

http://www.w3.org/1999/02/22−rdf−syntax−ns#type

http://example.org/classes/person

The third triple is redundant because the entity named as http://example.org/ person/bart-simpson is the type http://example.org/classes/person because of the

second triple (it provides the age of the person).

Page 41: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Symbolic redundancy is due to symbol repetitions in triples:

This is the “traditional” source of redundancy removed by universal compressors.

Symbolic redundancy in RDF is mainly due to URIs:

URIs tend to be very large strings which share long prefixes, but also has other common infixes or suffixes.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 44

Symbolic Redundancy

Literals also contribute to this redundancy.

http://example.org/class/person

http://example.org/property/address

http://example.org/property/age

http://example.org/property/location

http://example.org/person/abe-simpson

http://example.org/person/bart-simpson

http://example.org/person/homer-simpson

http://example.org/person/marge-simpson

Abe Simpson

Marge Simpson

Page 42: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Syntactic redundancy depends on how the RDF graph structure is serialized:

For instance, a serialized subset of n triples (which describes the same resource) writes n times the subject value. It can be abbreviated.

... and also on the underlying graph structure by itself:

For instance, resources of the same classes are described using (almost) the same sub-graph structure.

Syntactic compression also has (many) room for optimization:

Structural RDF features must be better understood.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 45

Syntactic Redundancy

Page 43: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

The current state of the art comprises a rich and varied set of compressors for RDF data. These are mainly lossless compressors (because they preserve the original knowledge in the dataset), yet lossy compressors are also emerging

Compression Approaches

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 46

Page 44: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Semantic compressors proposes many and varied techniques for lossless compression:

They preserve the original knowledge in the dataset.

Lossy compressors are not in the scope of this lecture because losing knowledge is not acceptable for the Linked Data workflow.

Three classes of lossless compressors:

Physical compressors detect and removes symbolic and/or syntactic redundancy from the original dataset.

Logical compressors detect and removes semantic redundancy from the original dataset.

Hybrid compressors perform at physical and logical levels.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 47

Compression Approaches

Page 45: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Physical compressors adapt traditional concepts from data compression to the particular case of RDF compression:

Dictionary compression removes symbolic redundancy and allows triples to be rewritten as 3-IDs tuples (ID graph).

Graph compression is applied to the ID graph representation and removes different kinds of syntactic redundancy.

Many and varied physical compressors have been proposed:

It reports good space savings → compressed datasets take (much) less

space than their uncompressed counterparts.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 48

Physical Compressors

A subset of such compressors (e.g. HDT, k2triples, or RDFCSA) are really self-indexes:

Datasets are compressed and indexed in a single encoding.

Triples can be searched accessed with no prior decompression.

Page 46: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Logical compressors look for redundant triples (those than can be inferred), and remove them from the dataset.

The resulting dataset encoding includes two components:

The canonical subgraph which organizes all canonical triples.

The set of inference rules which must be applied to recover “redundant triples”.

Different rule-based algorithms have been proposed to obtain canonical subgraphs.

Compression effectiveness is less competitive than for physical compressors.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 49

Logical Compressors

Page 47: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Hybrid compressors combine the best of both worlds:

Consider different strategies to compact the graph by deleting semantic redundancy at logical level.

Detect and remove syntactic/symbolic redundancy at physical level(dictionary + graph compression).

These topic has not been much explored yet… but it is (maybe) the most promissory:

Canonical subgraphs are smaller than their original counterparts → less

space requirements.

Implicit triples can be efficiently accessed by applying inference rules → better overall search performance.

HDT++ or k2triples++ report competitive (preliminary)

space/time tradeoffs.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 50

Hybrid Compressors

Page 48: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

RDF compression is a mature field of research, but the current state of the art has many roomfor optimization.

Achievements & Challenges

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 51

Page 49: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

RDF compression is a mature field of research: We are working on this field from 2009.

HDT is W3C Member Submission from 2011: It is a de facto standard for publishing compressed RDF

datasets (e.g. LOD Laundromat).

It is the RDF engine of Linked Data Fragments and has been adopted by many other tools in the Semantic Web community.

HDT+Jena allows in-memory SPARQL resolution to be efficiently performed.

K2triples and RDFCSA are the state of the art for RDF compression and SPARQL triples pattern resolution.

We have recently compressed the Linked Data Web in a single dataset: LOD-a-lot.

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 52

RDF Compression Achievements

Sandra Mario Nieves

Ana Antonio Javier D.

José M. Claudio Antonio

Miguel A. Gonzalo Axel

Page 50: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

RDF compression is yet underexploited:

Managing compressed RDF allows the memory footprint to be reduced, improving scalability.

Self-indexes can be adopted by semantic databases to improve query performance.

A full-compressed triplestore has not been yet released, but we currently working on it.

Although our techniques has room for optimization!

Some other semantic applications can be improved using compression:

RDF archives, Linked Closed Data…

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 53

RDF Compression Challenges

Page 51: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 54

Bibliography

1. Sandra Alvarez-Garccía, Nieves Brisaboa, Javier D. Fernáandez, Miguel A. Martínez-Prieto, and Gonzalo Navarro.Compressed Vertical Partitioning for Efficient RDF Management. Knowledge and Information Systems (KAIS),44(2):439–474, 2015.

2. Tim Berners-Lee. Linked Data, 2006. http://www.w3.org/DesignIssues/LinkedData.html.

3. Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked Data - The Story So Far. International Journal ofSemantic Web and Information Systems, 5(3):1–22, 2009.

4. Nieves Brisaboa, Ana Cerdeira, Antonio Fariña, and Gonzalo Navarro. A Compact RDF Store using Suffix Arrays. InProceedings of SPIRE, pages 103-115, 2015.

5. Javier D. Fernández, Mario Arias, Miguel A. Martínez-Prieto, and Claudio Gutiérrez. Management of Big SemanticData. In Big Data Computing, chapter 4. Taylor and Francis/CRC, 2013.

6. Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez, and Axel Polleres. Binary RDF Representation forPublication and Exchange. W3C Member Submission, 2011. www.w3.org/Submission/HDT/.

7. Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez, Axel Polleres, and Mario Arias. Binary RDFRepresentation for Publication and Exchange. Journal of Web Semantics, 19:22–41, 2013.

Page 52: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 55

Bibliography

8. Tom Heath and Christian Bizer. Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool, 1edition, 2011. http://linkeddatabook.com/.

9. Amit K. Joshi, Pascal Hitzler, and Guozhu Dong. Logical Linked Data Compression. In Proceedings of ESWC, pages170–184, 2013.

10. Frank Manola and Eric Miller. RDF Primer. W3C Recommendation, 2004. www.w3.org/TR/rdf-primer/.

11. Jeff Z. Pan, José Manuel Gómez-Pérez, Yuan Ren, Honghan Wu, and Man Zhu. SSP: Compressing RDF data bySummarisation, Serialisation and Predictive Encoding. Technical report, 2014. Available at http://www.kdrive-project.eu/wp-content/uploads/2014/06/WP3-TR2-2014 SSP.pdf.

12. Eric Prud’hommeaux and Andy Seaborne. SPARQL Query Language for RDF. W3C Recommendation, 2008.http://www.w3.org/TR/rdf-sparql-query/.

Page 53: Big (Linked) Semantic Data Compression...Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23TH AUGUST

Basics of Data Compression

Let’s the lecture continues…

Image:MAIN SQUARE & CATHEDRAL (SEGOVIA, SPAIN)