versioning for linked data: archiving systems and benchmarks
TRANSCRIPT
VersioningforLinkedData:ArchivingSystemsandBenchmarks
VassilisPapakonstantinou,GiorgosFlouris
IriniFundulaki,GiannisRousakisInstituteofComputerScience–FORTH,Greece
KostasStefanidisUniversityofTampere,Finland
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 1
TheLinkedOpenDataCloud
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 2
Media
Government
Geographic
Publications
User-generated
Lifesciences
Cross-domain
TheLinkedOpenDataCloud
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 3*AdaptedfromSuchanek&Weikumtutorial@SIGMOD2013
Datasetsevolveovertime
EvolvingDataset:DBpedia• InitialRelease:January10,2007;9yearsago• StableRelease:DBpedia3.11a/k/aDBpedia2015-04a/k/a
DBpedia2015A/September2015• Datasets2.0,3.0–3.9,2014and2015includinglinksto
externaldatasetsthatmaybeevolvingovertime– Geographicdomain:LinkedGeoData,GeoNames,…– BibliographicSources:DBLP– BiologicalDatasets:Diseasome,DrugBank,…
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 4
Versioning
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 5
Versioningreferstotheabilitytostoreandretrievedifferentversionsofanevolvingdataset.
"Versioningisthecreationandmanagementofmultiplereleasesofaproduct,allofwhichhavethesamegeneralfunctionbutareimproved,upgradedorcustomized.”
VersioningSystems:ArchivingStrategies• FullMaterialization– Mostwidelyusedapproachforstoringdifferentversionsofdatasets
– Explicitstorageofallversionsofadatasetinthearchive+ Advantages:
• Noprocessingcostforstoringversions• Costforretrievingversionsoransweringqueriesboilsdowntocostofqueryanswering
– Disadvantages• Significantspaceoverheadforlargedatasetsthatdonotchangesignificantlyovertime• Duplicationofunchangeddata
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 6
VersioningSystems:ArchivingStrategies• Delta-based– Onefullversionofthedatasetmustbestored,andforeachnewversiononlythesetofchangesw.r.t.previousversionarestored(delta)
– Solutions:forwardversusbackwarddeltas+ Advantages:
• Modestspacerequirementssincedeltasaremuchsmallerthanthedatasetitself
– Disadvantages• Additionalcomputationalcostforcomputingandstoringdeltas• Overheadduringquerytimesinceon-the-flyconstructionofoneormoreversionsofthedataisrequired
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 7
VersioningSystems:ArchivingStrategies• AnnotatedTriples– Eachtripleisannotatedwithitstemporalvalidity• 2timestampsthatdeterminewhenthetriplewascreatedanddeleted
+ Advantages:• Savingstoragespacebyavoidingrepetitions,astriplesareannotatedonlywhenareaddedordeleted
– Disadvantages• Asindelta-based:extracomputationalcostforallretrievaldemands,exceptfordeltamaterialization.
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 8
VersioningSystems:ArchivingStrategies• HybridStrategy– Combinationoffull-materializationanddelta-basedstrategiesbasedonacostmodeltoquantifytheoverheads• Spaceandtimeoverheadforstorage• Timeoverheadforqueryevaluation
– Combinationofdelta-basedandannotatedtriplesapproachtostoreconsecutivedeltas• Eachtripleisannotatedwithavaluethatdeterminesitsversion
+ Advantages:- Advantagesofallarchivingstrategiesmaybeenjoyed
- Disadvantages- Costforidentifyingtheappropriatearchivingstrategyaccordingtotypeofchanges
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 9
VersioningSystems:QueryTypes• Queriesaredistinguishedbasedontheir
1. Focus:VersionandDeltaqueries• Versionqueriesrequestdatastoredinversions– Modern:queriesrequestinginformationonlatestversion– Historical:queriesrequestinginformationonpastversions
• Deltaqueriesrequiredataofthechangesofversions2. Versioningsolution
• Materialization:requestfullversionoffulldelta• Single-version• Cross-version
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 10
QueryTypes:Combinationofbasiccategories
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 11
Modern Historical Delta Materialization SingleVersion
Crossversion
ModernVersionMaterialization
✔ ✔
ModernSingle-VersionQueries
✔
✔
HistoricalVersionMaterialization
✔
✔
HistoricalSingle-VersionQueries
✔
✔
DeltaMaterialization ✔
✔
Single-deltaqueries ✔
✔
Cross-deltaqueries ✔
✔
Cross-versionqueries ✔
✔
✔
Retrievefullcurrentversion
Retrievefullpastversion
Retrievedatafrom2consecutiveversions
Returnallfriendsofaperson
Returnallfriendsofapersonat
timet
Returnallfriendsthataperson
obtainedbetweentimetandtimet+1
ArchivingSystemsforLinkedData
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 12
System/Framework ArchivingStrategy
SPARQLsupport
BlankNodessupport
Committing,Merging,Branching
x-RDF-3X[NW10] AnnotatedTriples ✔
− −
SemVersion[VG06] FullMaterialization
− ✔
✔
Cassidyet.al[CB07] Delta-based − − ✔
R&Wbase[SC+13] AnnotatedTriples ✔
✔
✔
R43ples[GH+14] Delta-based ✔
− ✔
TailR[MK+15] Hybridapproach − − −
Imat.al[MP16] Delta-based − − −
Memento[SN+09] FullMaterialization
− − −
BenchmarkingRDFVersioningSystems• Benchmarkisa– setofsoftwaretools,– performancemetrics,and– setofclearexecutionrules
• Standardizedapplicationscenariothatservesasabasisfortestingandevaluationcomparisonofsystems
• Clearsetoffactorsthataremeasuredandtheconditionsunderwhichshouldbemeasured
• Leadstoimprovements:– vendorscanimprovetheirtechnology– currentbenchmarkdesigncanbeimprovedtocovernewnecessitiesandapplicationdomains
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 13
ImportanceofBenchmarking• Help– DesignersandDeveloperstoimprovetheirtoolsbyassessingtheirusefulnessandconstantlyevaluatingtheirperformance
– Userstocomparethedifferentavailabletoolsandevaluatesuitabilityfortheirneeds
– Researcherstocomparethemselvestoothers
• Existforalongterm– Toallowadequatemeasurementsofsystems– Evaluationinthefield
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 14
Benchmarks:DesignPrinciples
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 15
Principle Comment
Relevant Thebenchmarkismeaningfulforthetargetdomain
Understandable Thebenchmarkiseasytounderstandanduse
GoodMetrics Themetricsdefinedbythebenchmarkarelinear,orthogonalandmonotonic
Configurable&Scalable
Thebenchmarkisapplicabletoabroadspectrumofhardwareandsoftwareconfigurations
Coverage Thebenchmarkworkloaddoesnotoversimplifythetypicalenvironment
Acceptance Thebenchmarkisrecognizedasrelevantbythemajorityofvendorsandusers
Open&Accessible
Thebenchmarkshouldbeavailabletosystemsundertest
Unbiased Thebenchmarkshouldbefairtoallsystems
BenchmarkingRDFVersioningSystems• AVersioningBenchmarkshouldtesthowdifferentsystems
behavewithrespectto– thespacerequiredbythemulti-versionrepositoryand– theefficiencyofretrievingdifferentversionsandansweringqueriesacrossversions
• VersioningBenchmarks
– BEAR[GU+15]– EvoGen[MP16a]
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 16
BEAR[GU+15](1)• Characteristics
– Agnosticw.r.t.archivingstrategies– Simplequeriesbecomingmorecomplexovertime– Extensible,abletoincorporateadditionalfeatures
• Datasetfeatures– Datadynamicity
• Numberofchangesbetweenversionsquantifiedthroughchangeratioanddatagrowth
– Datastaticcore• Triplesthatdonotchangeacrossversions
– Totalversion-oblivioustriples• Differenttriplesthatexistinanarchiveindependentlyoftheversion
– RDFvocabulary• RDFResources(subjectsandobjects)ofRDFtriples
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 17
BEAR[GU+15]](2)• QueryGeneration– ResultCardinalityandQuerySelectivityareconsidered– QueryCategories1. VersionMaterialization:Retrievalofaversion2. DeltaMaterialization:Resultsofaquerybetweenversions3. ChangeChecking:Booleanquerythataskswhetherthereare
changesbetweenversions4. Crossversionjoin:Joinbetweentriplepatternsaccessingdata
indifferentversions5. Changematerialization:Pointwhereaqueryisevaluated
differently
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 18
EvoGen[MP16a](1)• EvoGenBenchmarkSuite• BasedonLUBMgeneratorextendedwith10classesand19
propertiestosupportschemaevolution• ExtensibleandHighlyConfigurableBenchmarkGenerator– DataGeneration• #generatedversions,#changesbetweenversions
– QueryGeneration• takesintoaccountthedatagenerationconfiguration
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 19
EvoGen[MP16a](2)• DataGeneration– Shift:Showshowadatasetevolvesw.r.t.size
• Distinguishedintopositive/negativeshifts(increasing/decreasingsize)
– Monotonicity:Booleanvaluethatdetermineswhethershiftschangemonotonicallyadataset
• SchemaEvolutionofadataset– OntologyEvolution• Representsthechangeoftheontologyw.r.t.#classes
– SchemaVariationParameters• Rangesfrom0to1andquantifiesthedifferent
characteristicsetsw.r.t.numberofpossiblecharacteristicsetscreatedforeachclass
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 20
EvoGen[MP16a](3)• QueryGenerationParameters– Retrievalofadiachronicdataset
• obtainallversionsofadataset– Retrievalofaspecificversion– Snapshotqueries
• queriesaffectingasingleversion– TemporalQueries
• queriesthatretrievethetimelineofsubgraphsthroughmultipleversions
– QueriesonChanges• queriesonchanges
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 21
VersioningBenchmarks:Comparison• EvoGenisabenchmarkgenerator,whileBEARisabenchmarkoverreal
versioneddata• BEARismore“complete”regardingquerysupportwhencomparedto
EvoGen
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 22
EvoGen BEAR
ModernVersionMaterialization
✔ ✔
ModernSingle-VersionQueries
✔
HistoricalVersionMaterialization
✔ ✔
HistoricalSingle-VersionQueries
✔ ✔
DeltaMaterialization ✔
Single-deltaqueries ✔ ✔
Cross-deltaqueries
Cross-versionqueries ✔ ✔
VersioningBenchmarks:Comparison
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 23
EvoGen BEAR
ModernVersionMaterialization
✔ ✔
ModernSingle-VersionQueries
✔
HistoricalVersionMaterialization
✔ ✔
HistoricalSingle-VersionQueries
✔ ✔
DeltaMaterialization ✔
Single-deltaqueries ✔ ✔
Cross-deltaqueries
Cross-versionqueries ✔
✔
References• [NW10]ThomasNeumannandGerhardWeikum.x-RDF-3X:fastquerying,highupdaterates,and
consistencyforRDFdatabases.VLDBEndowment,3(1-2):256–263,2010.• [VG06]MaxVölkelandTudorGroza.SemVersion:AnRDF-basedontologyversioningsystem.InIADIS
Int’lConf.WWW/Internet,volume2006,page44,2006.• [CB07]SteveCassidyandJamesBallantine.VersionControlforRDFTripleStores.ICSOFT(ISDM/
EHST/DC),7:5–12,2007.• [SC+13]MielVanderSande,PieterColpaert,etal.R&Wbase:gitfortriples.InLDOW,2013.• [GH+14]MarkusGraube,StephanHensel,etal.R43ples:Revisionsfortriples.LDQ,2014.• [MK+15]PaulMeinhardt,MagnusKnuth,etal.TailR:aplatformforpreservinghistoryonthewebof
data.InInt’lConf.onSemanticSystems,pages57–64.ACM,2015.
• [MP16]MariosMeimaris,GeorgePapastefanatos,etal.AQueryLanguageforMultiversionDataWebArchives.InarXiv:1504.01891,2016.
• [SN+09]HerbertVandeSompel,MichaelLNelson,etal.Memento:Timetravelfortheweb.arXivpreprintarXiv:0911.1112,2009.
• [GU+15]JavierDavidFernandezGarcia,JürgenUmbrich,etal.BEAR:BenchmarkingtheEfficiencyofRDFArchiving.Technicalreport,DepartmentfürInformationsverarbeitungundProzessmanagement,WUViennaUniversityofEconomicsandBusiness,2015.
• [MP16a]MariosMeimarisandGeorgePapastefanatos.TheEvoGenBenchmarkSuiteforEvolvingRDFData.MeDAW,2016.
11/8/16 BLINK2016:BenchmarkingBigLinkedData,18October2016 25