versioning for linked data: archiving systems and benchmarks

26
Versioning for Linked Data: Archiving Systems and Benchmarks Vassilis Papakonstantinou, Giorgos Flouris Irini Fundulaki , Giannis Rousakis Institute of Computer Science – FORTH, Greece Kostas Stefanidis University ofTampere, Finland 11/8/16 BLINK 2016: Benchmarking Big Linked Data, 18 October 2016 1

Upload: holistic-benchmarking-of-big-linked-data

Post on 16-Apr-2017

666 views

Category:

Science


3 download

TRANSCRIPT

VersioningforLinkedData:ArchivingSystemsandBenchmarks

VassilisPapakonstantinou,GiorgosFlouris

IriniFundulaki,GiannisRousakisInstituteofComputerScience–FORTH,Greece

KostasStefanidisUniversityofTampere,Finland

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 1

TheLinkedOpenDataCloud

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 2

Media

Government

Geographic

Publications

User-generated

Lifesciences

Cross-domain

TheLinkedOpenDataCloud

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 3*AdaptedfromSuchanek&Weikumtutorial@SIGMOD2013

Datasetsevolveovertime

EvolvingDataset:DBpedia•  InitialRelease:January10,2007;9yearsago•  StableRelease:DBpedia3.11a/k/aDBpedia2015-04a/k/a

DBpedia2015A/September2015•  Datasets2.0,3.0–3.9,2014and2015includinglinksto

externaldatasetsthatmaybeevolvingovertime–  Geographicdomain:LinkedGeoData,GeoNames,…–  BibliographicSources:DBLP–  BiologicalDatasets:Diseasome,DrugBank,…

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 4

Versioning

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 5

Versioningreferstotheabilitytostoreandretrievedifferentversionsofanevolvingdataset.

"Versioningisthecreationandmanagementofmultiplereleasesofaproduct,allofwhichhavethesamegeneralfunctionbutareimproved,upgradedorcustomized.”

VersioningSystems:ArchivingStrategies•  FullMaterialization– Mostwidelyusedapproachforstoringdifferentversionsofdatasets

–  Explicitstorageofallversionsofadatasetinthearchive+ Advantages:

•  Noprocessingcostforstoringversions•  Costforretrievingversionsoransweringqueriesboilsdowntocostofqueryanswering

–  Disadvantages•  Significantspaceoverheadforlargedatasetsthatdonotchangesignificantlyovertime•  Duplicationofunchangeddata

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 6

VersioningSystems:ArchivingStrategies•  Delta-based–  Onefullversionofthedatasetmustbestored,andforeachnewversiononlythesetofchangesw.r.t.previousversionarestored(delta)

–  Solutions:forwardversusbackwarddeltas+ Advantages:

•  Modestspacerequirementssincedeltasaremuchsmallerthanthedatasetitself

–  Disadvantages•  Additionalcomputationalcostforcomputingandstoringdeltas•  Overheadduringquerytimesinceon-the-flyconstructionofoneormoreversionsofthedataisrequired

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 7

VersioningSystems:ArchivingStrategies•  AnnotatedTriples–  Eachtripleisannotatedwithitstemporalvalidity•  2timestampsthatdeterminewhenthetriplewascreatedanddeleted

+ Advantages:•  Savingstoragespacebyavoidingrepetitions,astriplesareannotatedonlywhenareaddedordeleted

– Disadvantages•  Asindelta-based:extracomputationalcostforallretrievaldemands,exceptfordeltamaterialization.

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 8

VersioningSystems:ArchivingStrategies•  HybridStrategy–  Combinationoffull-materializationanddelta-basedstrategiesbasedonacostmodeltoquantifytheoverheads•  Spaceandtimeoverheadforstorage•  Timeoverheadforqueryevaluation

–  Combinationofdelta-basedandannotatedtriplesapproachtostoreconsecutivedeltas•  Eachtripleisannotatedwithavaluethatdeterminesitsversion

+ Advantages:- Advantagesofallarchivingstrategiesmaybeenjoyed

-  Disadvantages- Costforidentifyingtheappropriatearchivingstrategyaccordingtotypeofchanges

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 9

VersioningSystems:QueryTypes•  Queriesaredistinguishedbasedontheir

1.  Focus:VersionandDeltaqueries•  Versionqueriesrequestdatastoredinversions– Modern:queriesrequestinginformationonlatestversion– Historical:queriesrequestinginformationonpastversions

•  Deltaqueriesrequiredataofthechangesofversions2.  Versioningsolution

•  Materialization:requestfullversionoffulldelta•  Single-version•  Cross-version

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 10

QueryTypes:Combinationofbasiccategories

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 11

Modern Historical Delta Materialization SingleVersion

Crossversion

ModernVersionMaterialization

✔ ✔

ModernSingle-VersionQueries

HistoricalVersionMaterialization

HistoricalSingle-VersionQueries

DeltaMaterialization ✔

Single-deltaqueries ✔

Cross-deltaqueries ✔

Cross-versionqueries ✔

Retrievefullcurrentversion

Retrievefullpastversion

Retrievedatafrom2consecutiveversions

Returnallfriendsofaperson

Returnallfriendsofapersonat

timet

Returnallfriendsthataperson

obtainedbetweentimetandtimet+1

ArchivingSystemsforLinkedData

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 12

System/Framework ArchivingStrategy

SPARQLsupport

BlankNodessupport

Committing,Merging,Branching

x-RDF-3X[NW10] AnnotatedTriples ✔

− −

SemVersion[VG06] FullMaterialization

− ✔

Cassidyet.al[CB07] Delta-based − − ✔

R&Wbase[SC+13] AnnotatedTriples ✔

R43ples[GH+14] Delta-based ✔

− ✔

TailR[MK+15] Hybridapproach − − −

Imat.al[MP16] Delta-based − − −

Memento[SN+09] FullMaterialization

− − −

BenchmarkingRDFVersioningSystems•  Benchmarkisa–  setofsoftwaretools,–  performancemetrics,and–  setofclearexecutionrules

•  Standardizedapplicationscenariothatservesasabasisfortestingandevaluationcomparisonofsystems

•  Clearsetoffactorsthataremeasuredandtheconditionsunderwhichshouldbemeasured

•  Leadstoimprovements:–  vendorscanimprovetheirtechnology–  currentbenchmarkdesigncanbeimprovedtocovernewnecessitiesandapplicationdomains

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 13

ImportanceofBenchmarking•  Help–  DesignersandDeveloperstoimprovetheirtoolsbyassessingtheirusefulnessandconstantlyevaluatingtheirperformance

–  Userstocomparethedifferentavailabletoolsandevaluatesuitabilityfortheirneeds

–  Researcherstocomparethemselvestoothers

•  Existforalongterm–  Toallowadequatemeasurementsofsystems–  Evaluationinthefield

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 14

Benchmarks:DesignPrinciples

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 15

Principle Comment

Relevant Thebenchmarkismeaningfulforthetargetdomain

Understandable Thebenchmarkiseasytounderstandanduse

GoodMetrics Themetricsdefinedbythebenchmarkarelinear,orthogonalandmonotonic

Configurable&Scalable

Thebenchmarkisapplicabletoabroadspectrumofhardwareandsoftwareconfigurations

Coverage Thebenchmarkworkloaddoesnotoversimplifythetypicalenvironment

Acceptance Thebenchmarkisrecognizedasrelevantbythemajorityofvendorsandusers

Open&Accessible

Thebenchmarkshouldbeavailabletosystemsundertest

Unbiased Thebenchmarkshouldbefairtoallsystems

BenchmarkingRDFVersioningSystems•  AVersioningBenchmarkshouldtesthowdifferentsystems

behavewithrespectto–  thespacerequiredbythemulti-versionrepositoryand–  theefficiencyofretrievingdifferentversionsandansweringqueriesacrossversions

•  VersioningBenchmarks

– BEAR[GU+15]–  EvoGen[MP16a]

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 16

BEAR[GU+15](1)•  Characteristics

–  Agnosticw.r.t.archivingstrategies–  Simplequeriesbecomingmorecomplexovertime–  Extensible,abletoincorporateadditionalfeatures

•  Datasetfeatures–  Datadynamicity

•  Numberofchangesbetweenversionsquantifiedthroughchangeratioanddatagrowth

–  Datastaticcore•  Triplesthatdonotchangeacrossversions

–  Totalversion-oblivioustriples•  Differenttriplesthatexistinanarchiveindependentlyoftheversion

–  RDFvocabulary•  RDFResources(subjectsandobjects)ofRDFtriples

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 17

BEAR[GU+15]](2)•  QueryGeneration–  ResultCardinalityandQuerySelectivityareconsidered–  QueryCategories1.  VersionMaterialization:Retrievalofaversion2.  DeltaMaterialization:Resultsofaquerybetweenversions3.  ChangeChecking:Booleanquerythataskswhetherthereare

changesbetweenversions4.  Crossversionjoin:Joinbetweentriplepatternsaccessingdata

indifferentversions5.  Changematerialization:Pointwhereaqueryisevaluated

differently

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 18

EvoGen[MP16a](1)•  EvoGenBenchmarkSuite•  BasedonLUBMgeneratorextendedwith10classesand19

propertiestosupportschemaevolution•  ExtensibleandHighlyConfigurableBenchmarkGenerator– DataGeneration•  #generatedversions,#changesbetweenversions

– QueryGeneration•  takesintoaccountthedatagenerationconfiguration

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 19

EvoGen[MP16a](2)•  DataGeneration–  Shift:Showshowadatasetevolvesw.r.t.size

•  Distinguishedintopositive/negativeshifts(increasing/decreasingsize)

– Monotonicity:Booleanvaluethatdetermineswhethershiftschangemonotonicallyadataset

•  SchemaEvolutionofadataset–  OntologyEvolution•  Representsthechangeoftheontologyw.r.t.#classes

–  SchemaVariationParameters•  Rangesfrom0to1andquantifiesthedifferent

characteristicsetsw.r.t.numberofpossiblecharacteristicsetscreatedforeachclass

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 20

EvoGen[MP16a](3)•  QueryGenerationParameters–  Retrievalofadiachronicdataset

•  obtainallversionsofadataset–  Retrievalofaspecificversion–  Snapshotqueries

•  queriesaffectingasingleversion–  TemporalQueries

•  queriesthatretrievethetimelineofsubgraphsthroughmultipleversions

–  QueriesonChanges•  queriesonchanges

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 21

VersioningBenchmarks:Comparison•  EvoGenisabenchmarkgenerator,whileBEARisabenchmarkoverreal

versioneddata•  BEARismore“complete”regardingquerysupportwhencomparedto

EvoGen

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 22

EvoGen BEAR

ModernVersionMaterialization

✔ ✔

ModernSingle-VersionQueries

HistoricalVersionMaterialization

✔ ✔

HistoricalSingle-VersionQueries

✔ ✔

DeltaMaterialization ✔

Single-deltaqueries ✔ ✔

Cross-deltaqueries

Cross-versionqueries ✔ ✔

VersioningBenchmarks:Comparison

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 23

EvoGen BEAR

ModernVersionMaterialization

✔ ✔

ModernSingle-VersionQueries

HistoricalVersionMaterialization

✔ ✔

HistoricalSingle-VersionQueries

✔ ✔

DeltaMaterialization ✔

Single-deltaqueries ✔ ✔

Cross-deltaqueries

Cross-versionqueries ✔

Questions?

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 24

References•  [NW10]ThomasNeumannandGerhardWeikum.x-RDF-3X:fastquerying,highupdaterates,and

consistencyforRDFdatabases.VLDBEndowment,3(1-2):256–263,2010.•  [VG06]MaxVölkelandTudorGroza.SemVersion:AnRDF-basedontologyversioningsystem.InIADIS

Int’lConf.WWW/Internet,volume2006,page44,2006.•  [CB07]SteveCassidyandJamesBallantine.VersionControlforRDFTripleStores.ICSOFT(ISDM/

EHST/DC),7:5–12,2007.•  [SC+13]MielVanderSande,PieterColpaert,etal.R&Wbase:gitfortriples.InLDOW,2013.•  [GH+14]MarkusGraube,StephanHensel,etal.R43ples:Revisionsfortriples.LDQ,2014.•  [MK+15]PaulMeinhardt,MagnusKnuth,etal.TailR:aplatformforpreservinghistoryonthewebof

data.InInt’lConf.onSemanticSystems,pages57–64.ACM,2015.

•  [MP16]MariosMeimaris,GeorgePapastefanatos,etal.AQueryLanguageforMultiversionDataWebArchives.InarXiv:1504.01891,2016.

•  [SN+09]HerbertVandeSompel,MichaelLNelson,etal.Memento:Timetravelfortheweb.arXivpreprintarXiv:0911.1112,2009.

•  [GU+15]JavierDavidFernandezGarcia,JürgenUmbrich,etal.BEAR:BenchmarkingtheEfficiencyofRDFArchiving.Technicalreport,DepartmentfürInformationsverarbeitungundProzessmanagement,WUViennaUniversityofEconomicsandBusiness,2015.

•  [MP16a]MariosMeimarisandGeorgePapastefanatos.TheEvoGenBenchmarkSuiteforEvolvingRDFData.MeDAW,2016.

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 25

11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 26

ThisworkwassupportedbygrantsfromtheEUH2020FrameworkProgrammeprovidedfortheprojectHOBBIT(GAno.688227).