digital preservation in the era of big data - the diachron platform - acting on change 2016
TRANSCRIPT
Digital Preservation in the era of Big Data
The DIACHRON Platform, archiving and querying linked open data
George [email protected]
“Athena” Research & Innovation Center
Panel discussion: Preparing for changeActing on Change Conference : New Approaches and Future Practices in LTDP
London, Dec 2016
Data ExplosionBIG DATA GENERATE SIGNIFICANT FINANCIAL VALUE ACROSS SECTORS.
Data on the Web
Global data space
Connecting data from diverse domains and sources
Primary objects: (description of) “entities”
Links between “entities”
Info granularity: from entire data collections to atomic data
Interrelated, Heterogeneous
Adapted from Chris Bizer, Richard Cyganiak, Tom Heath, available at http://linkeddata.org/guides-and-tutorials
Web of Data
Conceptual Representation
entity entity entityentity
Typed Links Typed Links Typed Links
Spreadsheets
HTMLXMLRDFa
represent
SemiStructuredTriplesStatistical
represent represent represent
Web of world things described by Web of data
Data Web Evolution
Explosion of data volume published on web and diversity of sources
Government Scientific Corporate Crowd-sourced
Linked Open Data (LOD)continuously published
Currently data.gouv.fr lists 350,000 datasets,data.gov.uk has 8,200 datasets.
Current Status
2007
2009
2011
StatisticsDatasets#: 1014Social web 51.28%Government 18.05%Publications 9.47%Life sciences 8.19%User-generated content 4.73%Cross-domain 4.04%Media 2.17%Geographic 2.07%
Rapidly Evolving EcosystemMid 2014
http://lod-cloud.net/
Big Data Preservation is ChallengingEmerging Application Domains
2020: digital data production > 40 zetabytes = 5,200 Gbytes for every person on the planet
WIRED – 09/10/2014
Effective & efficient techniques to manage the data lifecycle
Appraisal
Integration
ArchivingProducing
Publishing
Linking
DIACHRON Approach
Publishing and preservation of data performed together Archiving and dissemination are synonymous.
DIACHRON ModelDataset Model
Dataset Model
[M1-M12] – Task 1.5Diachronic Dataset D1
D1(t1) D1(t2) D1(t3)…………. D1(tm)
t1 t2 t3 t4 ………….
time
Change Set
D1(t1,t2)
Change Set
D1(t1,t2)
Tim
e-Ag
nosti
c Sp
ace
Tim
e-Aw
are
Spac
e
Record_1
Record Atts
subject
predicate
“6 Artemidos st.”
Resource_a (D1,tm)
“vcard:hasAddress”
object
RecordSet(tm)
Schema(tm)
Data Space Curated Information Space
Record_2
Record Atts
predicate
“John Doe”foaf:na
me
object
subject
D1(tn)
Resource_a (D1,tn)Record_i
Record Atts
subject
Resource Changes
(D1,tm,tn)
Change SetD1(t1,t2)
Resource_a (D2,tk)
………….
Record and Schema changes
Diachronic Resource bowl:sameAsDiachronic
Resource aDiachronic Dataset D2
Dataset Model
[M1-M12] – Task 1.5Diachronic Dataset D1
D1(t1) D1(t2) D1(t3)…………. D1(tm)
t1 t2 t3 t4 ………….
time
Change Set
D1(t1,t2)
Change Set
D1(t1,t2)
Tim
e-Ag
nosti
c Sp
ace
Tim
e-Aw
are
Spac
e
Record_1
Record Atts
subject
predicate
“6 Artemidos st.”
Resource_a (D1,tm)
“vcard:hasAddress”
object
RecordSet(tm)
Schema(tm)
Data Space Curated Information Space
Record_2
Record Atts
predicate
“John Doe”foaf:na
me
object
subject
D1(tn)
Resource_a (D1,tn)Record_i
Record Atts
subject
Resource Changes
(D1,tm,tn)
Change SetD1(t1,t2)
Resource_a (D2,tk)
………….
Record and Schema changes
Diachronic Resource bowl:sameAsDiachronic
Resource aDiachronic Dataset D2
Dataset Model
[M1-M12] – Task 1.5Diachronic Dataset D1
D1(t1) D1(t2) D1(t3)…………. D1(tm)
t1 t2 t3 t4 ………….
time
Change Set
D1(t1,t2)
Change Set
D1(t1,t2)
Tim
e-Ag
nosti
c Sp
ace
Tim
e-Aw
are
Spac
e
Record_1
Record Atts
subject
predicate
“6 Artemidos st.”
Resource_a (D1,tm)
“vcard:hasAddress”
object
RecordSet(tm)
Schema(tm)
Data Space Curated Information Space
Record_2
Record Atts
predicate
“John Doe”foaf:na
me
object
subject
D1(tn)
Resource_a (D1,tn)Record_i
Record Atts
subject
Resource Changes
(D1,tm,tn)
Change SetD1(t1,t2)
Resource_a (D2,tk)
………….
Record and Schema changes
Diachronic Resource bowl:sameAsDiachronic
Resource aDiachronic Dataset D2
Dataset Model
[M1-M12] – Task 1.5Diachronic Dataset D1
D1(t1) D1(t2) D1(t3)…………. D1(tm)
t1 t2 t3 t4 ………….
time
Change Set
D1(t1,t2)
Change Set
D1(t1,t2)
Tim
e-Ag
nosti
c Sp
ace
Tim
e-Aw
are
Spac
e
Record_1
Record Atts
subject
predicate
“6 Artemidos st.”
Resource_a (D1,tm)
“vcard:hasAddress”
object
RecordSet(tm)
Schema(tm)
Data Space Curated Information Space
Record_2
Record Atts
predicate
“John Doe”foaf:na
me
object
subject
D1(tn)
Resource_a (D1,tn)Record_i
Record Atts
subject
Resource Changes
(D1,tm,tn)
Change SetD1(t1,t2)
Resource_a (D2,tk)
………….
Record and Schema changes
Diachronic Resource bowl:sameAsDiachronic
Resource aDiachronic Dataset D2
Dataset Model
[M1-M12] – Task 1.5Diachronic Dataset D1
D1(t1) D1(t2) D1(t3)…………. D1(tm)
t1 t2 t3 t4 ………….
time
Change Set
D1(t1,t2)
Change Set
D1(t1,t2)
Tim
e-Ag
nosti
c Sp
ace
Tim
e-Aw
are
Spac
e
Record_1
Record Atts
subject
predicate
“6 Artemidos st.”
Resource_a (D1,tm)
“vcard:hasAddress”
object
RecordSet(tm)
Schema(tm)
Data Space Curated Information Space
Record_2
Record Atts
predicate
“John Doe”foaf:na
me
object
subject
D1(tn)
Resource_a (D1,tn)Record_i
Record Atts
subject
Resource Changes
(D1,tm,tn)
Change SetD1(t1,t2)
Resource_a (D2,tk)
………….
Record and Schema changes
Diachronic Resource bowl:sameAsDiachronic
Resource aDiachronic Dataset D2
Dataset Model
[M1-M12] – Task 1.5Diachronic Dataset D1
D1(t1) D1(t2) D1(t3)…………. D1(tm)
t1 t2 t3 t4 ………….
time
Change Set
D1(t1,t2)
Change Set
D1(t1,t2)
Tim
e-Ag
nosti
c Sp
ace
Tim
e-Aw
are
Spac
e
Record_1
Record Atts
subject
predicate
“6 Artemidos st.”
Resource_a (D1,tm)
“vcard:hasAddress”
object
RecordSet(tm)
Schema(tm)
Data Space Curated Information Space
Record_2
Record Atts
predicate
“John Doe”foaf:na
me
object
subject
D1(tn)
Resource_a (D1,tn)Record_i
Record Atts
subject
Resource Changes
(D1,tm,tn)
Change SetD1(t1,t2)
Resource_a (D2,tk)
………….
Record and Schema changes
Diachronic Resource bowl:sameAsDiachronic
Resource aDiachronic Dataset D2
DIACHRON Query Language
• Queries on archive catalog• Lists of datasets • Lists of versions of a given dataset• Filtered based on temporal, provenance or other metadata criteria
• Queries on Data• Retrieve part(s) of a dataset that match certain criteria.
• Longitudinal queries• Retrieve part(s) of a dataset across multiple versions. • Temporal (version based) criteria can be applied.
• Queries on Changes • Retrieve changes between two concurrent versions. • Limit results for specific type of changes (schema, data, etc.).
• Mixed Queries on Changes and Data• Retrieve datasets or parts of datasets affected by specific changes
Requirements
Diachron Query language
• Extension of SPARQL– SPARQL queries are valid DIACHRON queries
• DIACHRON graph model – basis of the query language, e.g.
– <FROM DATASET>,<FROM CHANGES>, …
• Specific versions– AT VERSION, AFTER VERSION,
BEFORE VERSION, BETWEEN VERSIONS
• Syntactic Sugar for graph patterns, e.g. – RECORD (e.g. for record variable)– RECATT
• Query results dereified
Overview
Archiving Strategies
• Versions Materialization (query efficiency, space consuming)
• Changes (delta-based) Materialization (space efficiency, poor query performance, update overhead)
• Versions & Changes Materialization(vast space requirements update overhead)
1st approach
Archiving Strategies
• Hybrid Materialization
• Only major versions & and all changes (delta) are stored
• Balance between query performance & storage space
2nd approach
DIACHRON applicationsThe Pilots
Thank you!www.diachron-fp7.eu
DIACHRON
• http://wwwdev.ebi.ac.uk/ols/beta/ontologies/go
• http://diachron.imis.athena-innovation.gr:8080/services/ui/ • https://www.youtube.com/channel/UCIzfRLHiuOz4ZgaSytAg
P7w
• https://twitter.com/diachron_fp7@diachron_fp7
• https://github.com/diachron
Demos & Outreach