archiving scientific data - university of pennsylvania

Post on 11-Feb-2022

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Archiving Scientific Data

SusanB.DavidsonCIS700:AdvancedTopicsinDatabases

MW1:30-3

Towne309

http://www.cis.upenn.edu/~susan/cis700/homepage.html

• Datachangesovertime• Newdataisadded

• Mistakesarecorrected• Olddataisremoved

• Toenablereproducibilityandverifiability,itmustbepossibletoaccessthestateofadatabaseasofacertainpointintime.• Alsocrucialfordereferencingcitations

• Mayalsowanttoaskquestionsabouthowthedatabasehaschanged.

Why archive?

2

• Manydatabasesperiodicallypublishnewversions• Keepcopyofeachversion

• Allowsdataasofacertaintimetobeaccessedquickly

• Maynotbespaceefficientsinceverylittlemaychangebetweenversions

• Doesn’tallowefficientqueriesoverthechangehistory

• Keepalogofchanges(“sequenceofdelta”)• Spaceefficient• Maybeexpensivetorecomputedataasofacertaintime

• Maybeexpensivetoquerychangehistory

How to archive?

3

• Versioningandcitation:experienceswitheagle-i• ArchivingXMLdatasets• Conclusions

Outline

4

• eagle-iisanRDFdatasetwhichcontainsinformationaboutresourcesfortranslationalresearch(e.g.software,celllines,labfacilities)

• Eachresourcehasanimmutableeagle-iid;thesubjectofeachresourcetripleisaneagle-iid

• Resourcesareclassifiedusinganontology,andthecitationdependsontheclassificationoftheresource.

• eagle-italkedaboutcitationbutdidn’tautomateit…

Our experience: eagle-i

5

6

7

8

Citation architecture

9

• Thelatestcopyofeagle-iisavailableonthewebsite,butitisnot“versioned”

• Wedidadailydownloadsincewedidn’tknowhowfrequentlyitchanged(notfrequently!)

• Needed“timequeries”tounderstandhowthedatasetchangedovertime• Whattripleswereadded/deletedintheperiod[t,t’]?

• WhatwastheobjectoftripleXattimet?

• WhenwastripleYfirstadded/deleted

eagle-i versioning manager

10

Example: versioning 2 RDF triples

11

• Whenshouldversioningbetriggered?• Atleastwhenausercitesaneagle-iresource

• Whatshouldbeversioned?• Atleastchangestotheresourcebeingcited.

Ø Ifaversionofaresourceisnotcited,itdoesnothavetobestored.

Ø However,time-basedquerieswillonlydetectchangeswithrespecttocitationsratherthanallchanges.

Versioning and citation

12

• Versioningandcitation:experienceswitheagle-i• ArchivingXML• Conclusions

Outline

13

• Keepcopyofeachnewversionofthedatabase• Allowsdataasofacertaintimetobeaccessedquickly

• Maynotbespaceefficientsinceverylittlemaychangebetweenversions

• Doesn’tallowefficientqueriesoverthechangehistory

• Keepalogofchanges(“sequenceofdelta”)• Spaceefficient• Maybeexpensivetorecomputedataasofacertaintime

• Maybeexpensivetoquerychangehistory

Recall: approaches to archiving

14

• Ignoresthe“semanticcontinuityofkeys”byfocusingonminimaleditdistance

Problem with diff-based approaches

15

• Focusonhierarchicalscientificdatasets• XML-based• Changesareprimarilyinsertions

• Changesidentifiedbasedonkeys• Versionmergingbasedonkeys• Inheritanceoftimestamps

• Timestampisstoredatachildelementonlywhenitisdifferentfromthetimestampofitsparentelement

Ø “Key-based+merging”approach

Proposed approach in paper

16

Example: sequence of versions

17

Adding keys

18

Example of an archive

19

Representing archive in XML

20

• Akeyhasform(Q,{P1,…,Pk}),whereQ,Piarepathexpressions• Qidentifiesthetargetset

• Piarekeypaths,analogoustokeyattributesinrelations

• AnXMLdocumentsatisfiesakey(Q,{P1,…,Pk})if• FromanynodeidentifiedbyQ,everyPiexistsuniquely• Iftwonodesn1andn2identifiedbyQhavethesamevalueattheendofeachkeypathin{P1,…,Pk}thenn1andn2arethesamenode.

What is a key for XML?

21

• SinceXMLishierarchical,wealsoneedtospecifykeysrelativetoacontextnode• (Q,(Q’,{P1,…,Pk}))

• Examples• (/,(db,{})).Thereisatmostonedbelementbelowtheroot.

• (/db,(dept,{name})).Everydeptnodewithinadbnodecanbeuniquelyidentifiedbythecontentsofitsnamesubelement.

• (/db/dept,(emp,{fn,ln})).Everyempnodewithinadeptnodealongthepath/db/deptcanbeuniquelyidentifiedbythecontentsofitsfnandlnsubelements.

• (/db/dept/emp,(sal,{})).Thereisatmostonesalsubelementundereachempnodealongthepath/db/dept/emp.

Relative keys

22

• Assumptions:• Everykeydefinedforanodeisrelativetoitsparent,e.g.thekeyforempisrelativetoitsparentdeptnode

• Frontiernodesidentifyunkeyedportionsofthedocument

Archiver architecture

23

• Recursivelymergenodesintheincomingversion(D)tonodesinthearchive(A)thathavethesamekeyvalue,startingfromtheroot.

• WhenanodeyinDismergedwithanodexfromA,thetimestampofxisaugmentedwithi(thenewversionnumber),andsubtreesarerecursivelymerged.

• NodesinDthatdonothavenodesinAaresimplyaddedwithiasthetimestamp

Nested merge

24

Further compaction under frontier node

25

• Whatisthedatabaseatt=1?

• WhendidJoeDoegetasalaryraise?

• Whatwerethechangestothedatabasebetweent=1andt=3?

Querying the archive

26

• Versioningisimportantformanydifferentapplications

• Whiletechniquesaresimilarbetweendifferentrepresentations(e.g.files,relations,XML,RDF),differencesinassumptionscanbeusedtobuildmoreefficientsolutions.• Andtheoperations(e.g.queries)youwishtoperformareimportanttoo!

Conclusions

27

top related