digital preservation in the era of big data - the diachron platform - acting on change 2016

21
Digital Preservation in the era of Big Data The DIACHRON Platform, archiving and querying linked open data George Papastefanatos [email protected] “Athena” Research & Innovation Center Panel discussion: Preparing for change Acting on Change Conference : New Approaches and Future Practices in LTDP London, Dec 2016

Upload: periclesfp7

Post on 15-Apr-2017

43 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

Digital Preservation in the era of Big Data

The DIACHRON Platform, archiving and querying linked open data

George [email protected]

“Athena” Research & Innovation Center

Panel discussion: Preparing for changeActing on Change Conference : New Approaches and Future Practices in LTDP

London, Dec 2016

Page 2: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

Data ExplosionBIG DATA GENERATE SIGNIFICANT FINANCIAL VALUE ACROSS SECTORS.

Page 3: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

Data on the Web

Global data space

Connecting data from diverse domains and sources

Primary objects: (description of) “entities”

Links between “entities”

Info granularity: from entire data collections to atomic data

Interrelated, Heterogeneous

Adapted from Chris Bizer, Richard Cyganiak, Tom Heath, available at http://linkeddata.org/guides-and-tutorials

Web of Data

Conceptual Representation

entity entity entityentity

Typed Links Typed Links Typed Links

Spreadsheets

HTMLXMLRDFa

represent

SemiStructuredTriplesStatistical

represent represent represent

Web of world things described by Web of data

Page 4: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

Data Web Evolution

Explosion of data volume published on web and diversity of sources

Government Scientific Corporate Crowd-sourced

Linked Open Data (LOD)continuously published

Currently data.gouv.fr lists 350,000 datasets,data.gov.uk has 8,200 datasets.

Current Status

2007

2009

2011

Page 5: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

StatisticsDatasets#: 1014Social web 51.28%Government 18.05%Publications 9.47%Life sciences 8.19%User-generated content 4.73%Cross-domain 4.04%Media 2.17%Geographic 2.07%

Rapidly Evolving EcosystemMid 2014

http://lod-cloud.net/

Page 6: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

Big Data Preservation is ChallengingEmerging Application Domains

2020: digital data production > 40 zetabytes = 5,200 Gbytes for every person on the planet

WIRED – 09/10/2014

Effective & efficient techniques to manage the data lifecycle

Appraisal

Integration

ArchivingProducing

Publishing

Linking

Page 7: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

DIACHRON Approach

Publishing and preservation of data performed together Archiving and dissemination are synonymous.

Page 8: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

DIACHRON ModelDataset Model

Page 9: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

Dataset Model

[M1-M12] – Task 1.5Diachronic Dataset D1

D1(t1) D1(t2) D1(t3)…………. D1(tm)

t1 t2 t3 t4 ………….

time

Change Set

D1(t1,t2)

Change Set

D1(t1,t2)

Tim

e-Ag

nosti

c Sp

ace

Tim

e-Aw

are

Spac

e

Record_1

Record Atts

subject

predicate

“6 Artemidos st.”

Resource_a (D1,tm)

“vcard:hasAddress”

object

RecordSet(tm)

Schema(tm)

Data Space Curated Information Space

Record_2

Record Atts

predicate

“John Doe”foaf:na

me

object

subject

D1(tn)

Resource_a (D1,tn)Record_i

Record Atts

subject

Resource Changes

(D1,tm,tn)

Change SetD1(t1,t2)

Resource_a (D2,tk)

………….

Record and Schema changes

Diachronic Resource bowl:sameAsDiachronic

Resource aDiachronic Dataset D2

Page 10: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

Dataset Model

[M1-M12] – Task 1.5Diachronic Dataset D1

D1(t1) D1(t2) D1(t3)…………. D1(tm)

t1 t2 t3 t4 ………….

time

Change Set

D1(t1,t2)

Change Set

D1(t1,t2)

Tim

e-Ag

nosti

c Sp

ace

Tim

e-Aw

are

Spac

e

Record_1

Record Atts

subject

predicate

“6 Artemidos st.”

Resource_a (D1,tm)

“vcard:hasAddress”

object

RecordSet(tm)

Schema(tm)

Data Space Curated Information Space

Record_2

Record Atts

predicate

“John Doe”foaf:na

me

object

subject

D1(tn)

Resource_a (D1,tn)Record_i

Record Atts

subject

Resource Changes

(D1,tm,tn)

Change SetD1(t1,t2)

Resource_a (D2,tk)

………….

Record and Schema changes

Diachronic Resource bowl:sameAsDiachronic

Resource aDiachronic Dataset D2

Page 11: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

Dataset Model

[M1-M12] – Task 1.5Diachronic Dataset D1

D1(t1) D1(t2) D1(t3)…………. D1(tm)

t1 t2 t3 t4 ………….

time

Change Set

D1(t1,t2)

Change Set

D1(t1,t2)

Tim

e-Ag

nosti

c Sp

ace

Tim

e-Aw

are

Spac

e

Record_1

Record Atts

subject

predicate

“6 Artemidos st.”

Resource_a (D1,tm)

“vcard:hasAddress”

object

RecordSet(tm)

Schema(tm)

Data Space Curated Information Space

Record_2

Record Atts

predicate

“John Doe”foaf:na

me

object

subject

D1(tn)

Resource_a (D1,tn)Record_i

Record Atts

subject

Resource Changes

(D1,tm,tn)

Change SetD1(t1,t2)

Resource_a (D2,tk)

………….

Record and Schema changes

Diachronic Resource bowl:sameAsDiachronic

Resource aDiachronic Dataset D2

Page 12: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

Dataset Model

[M1-M12] – Task 1.5Diachronic Dataset D1

D1(t1) D1(t2) D1(t3)…………. D1(tm)

t1 t2 t3 t4 ………….

time

Change Set

D1(t1,t2)

Change Set

D1(t1,t2)

Tim

e-Ag

nosti

c Sp

ace

Tim

e-Aw

are

Spac

e

Record_1

Record Atts

subject

predicate

“6 Artemidos st.”

Resource_a (D1,tm)

“vcard:hasAddress”

object

RecordSet(tm)

Schema(tm)

Data Space Curated Information Space

Record_2

Record Atts

predicate

“John Doe”foaf:na

me

object

subject

D1(tn)

Resource_a (D1,tn)Record_i

Record Atts

subject

Resource Changes

(D1,tm,tn)

Change SetD1(t1,t2)

Resource_a (D2,tk)

………….

Record and Schema changes

Diachronic Resource bowl:sameAsDiachronic

Resource aDiachronic Dataset D2

Page 13: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

Dataset Model

[M1-M12] – Task 1.5Diachronic Dataset D1

D1(t1) D1(t2) D1(t3)…………. D1(tm)

t1 t2 t3 t4 ………….

time

Change Set

D1(t1,t2)

Change Set

D1(t1,t2)

Tim

e-Ag

nosti

c Sp

ace

Tim

e-Aw

are

Spac

e

Record_1

Record Atts

subject

predicate

“6 Artemidos st.”

Resource_a (D1,tm)

“vcard:hasAddress”

object

RecordSet(tm)

Schema(tm)

Data Space Curated Information Space

Record_2

Record Atts

predicate

“John Doe”foaf:na

me

object

subject

D1(tn)

Resource_a (D1,tn)Record_i

Record Atts

subject

Resource Changes

(D1,tm,tn)

Change SetD1(t1,t2)

Resource_a (D2,tk)

………….

Record and Schema changes

Diachronic Resource bowl:sameAsDiachronic

Resource aDiachronic Dataset D2

Page 14: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

Dataset Model

[M1-M12] – Task 1.5Diachronic Dataset D1

D1(t1) D1(t2) D1(t3)…………. D1(tm)

t1 t2 t3 t4 ………….

time

Change Set

D1(t1,t2)

Change Set

D1(t1,t2)

Tim

e-Ag

nosti

c Sp

ace

Tim

e-Aw

are

Spac

e

Record_1

Record Atts

subject

predicate

“6 Artemidos st.”

Resource_a (D1,tm)

“vcard:hasAddress”

object

RecordSet(tm)

Schema(tm)

Data Space Curated Information Space

Record_2

Record Atts

predicate

“John Doe”foaf:na

me

object

subject

D1(tn)

Resource_a (D1,tn)Record_i

Record Atts

subject

Resource Changes

(D1,tm,tn)

Change SetD1(t1,t2)

Resource_a (D2,tk)

………….

Record and Schema changes

Diachronic Resource bowl:sameAsDiachronic

Resource aDiachronic Dataset D2

Page 15: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

DIACHRON Query Language

• Queries on archive catalog• Lists of datasets • Lists of versions of a given dataset• Filtered based on temporal, provenance or other metadata criteria

• Queries on Data• Retrieve part(s) of a dataset that match certain criteria.

• Longitudinal queries• Retrieve part(s) of a dataset across multiple versions. • Temporal (version based) criteria can be applied.

• Queries on Changes • Retrieve changes between two concurrent versions. • Limit results for specific type of changes (schema, data, etc.).

• Mixed Queries on Changes and Data• Retrieve datasets or parts of datasets affected by specific changes

Requirements

Page 16: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

Diachron Query language

• Extension of SPARQL– SPARQL queries are valid DIACHRON queries

• DIACHRON graph model – basis of the query language, e.g.

– <FROM DATASET>,<FROM CHANGES>, …

• Specific versions– AT VERSION, AFTER VERSION,

BEFORE VERSION, BETWEEN VERSIONS

• Syntactic Sugar for graph patterns, e.g. – RECORD (e.g. for record variable)– RECATT

• Query results dereified

Overview

Page 17: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

Archiving Strategies

• Versions Materialization (query efficiency, space consuming)

• Changes (delta-based) Materialization (space efficiency, poor query performance, update overhead)

• Versions & Changes Materialization(vast space requirements update overhead)

1st approach

Page 18: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

Archiving Strategies

• Hybrid Materialization

• Only major versions & and all changes (delta) are stored

• Balance between query performance & storage space

2nd approach

Page 19: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

DIACHRON applicationsThe Pilots

Page 20: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

Thank you!www.diachron-fp7.eu

Page 21: Digital Preservation in the era of Big Data - The Diachron Platform - Acting on Change 2016

DIACHRON

• http://wwwdev.ebi.ac.uk/ols/beta/ontologies/go

• http://diachron.imis.athena-innovation.gr:8080/services/ui/ • https://www.youtube.com/channel/UCIzfRLHiuOz4ZgaSytAg

P7w

• https://twitter.com/diachron_fp7@diachron_fp7

• https://github.com/diachron

Demos & Outreach