![Page 1: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/1.jpg)
Ibis:AProvenanceManagerforMul5‐LayerSystems
ChristopherOlston&AnishDasSarmaYahoo!Research
![Page 2: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/2.jpg)
Mo5va5on:ManySub‐Systems
scalablefilesysteme.g.GFS
distributedsor5ng&hashinge.g.Map‐Reduce
dataflowprogrammingframeworke.g.Pig
workflowmanagere.g.Oozie
low‐latencyprocessor
servinginges5on
datumX
datumY
metadataqueries
provenanceofX?
![Page 3: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/3.jpg)
IbisProject
• Benefits:– Provideuniformviewtousers– Factoroutmetadatamanagementcode– Decouplemetadatalife5mefromdata/subsystemlife5me
• Challenges:– Overheadofshippingmetadata– Disparatedata/processinggranulari5es
dataprocessingsub‐systems metadatamanager users
metadataqueries
answers
metadataIbis
integratedmetadata
THISPAPER
![Page 4: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/4.jpg)
ExampleGranularityLaRces
Pigscript
PigjobPiglogicalopera5onMRjob
Pigphysicalopera5on
MRjobphase
MRtask
TaskaTempt
datagranulari5es processgranulari5es
Table
Columngroup
RowColumn
Cell
Version
Webpage
Workflow
MRprogram
![Page 5: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/5.jpg)
Challenges
• Inference:Givenrela5onshipsexpressedatonegranularity,answerqueriesaboutothergranulari5es(theseman;csaretrickyhere!)
• Efficiency:Implementinferencewithoutresor5ngtomaterializingeverythingintermsoffinestgranularity(e.g.cells)
![Page 6: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/6.jpg)
TalkOutline
• Informaloverview– Exampledataprovenancegraph
– Querylanguageoverview+examples
• Touchonformalmodel(detailsinpaper)
![Page 7: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/7.jpg)
ExampleWorkflow
IMDbExtract
Y!Extract
Merge
ExtractedY!
ExtractedIMDb
MovieDB
IMDBwebpage
Yahoo!Movieswebpage
![Page 8: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/8.jpg)
extractpigscript
5tle year leadactor
Avatar 2009 V1:WorthingtonV2:Saldana
Incep5on 2010 DiCaprio5tle year leadactor
Avatar 2009 Saldana
Incep5on 2010 DiCaprio
5tle year leadactor
Avatar 2009 Worthington
Incep5on 2010 DiCaprio
Yahoo!Movieswebpage
IMDBwebpage
mapoutput1
mapoutput2
pigjob2
Yahooextractedtable
IMDBextractedtable
combinedtable
maptask1,aTempt1
maptask2,aTempt1
reducetask1,aTempt1
mergepigscript
version=3wrapper=yahoo
pigjob1
version=2wrapper=imdb
license=yahooauth.score=5
license=imdbauth.score=4
ProvenanceGraph
![Page 9: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/9.jpg)
MeaningofProvenanceRela5onships
• (P,D1,D2):ProcessPconsumedPARTOFdatumD1andemiTedALLOFdatumD2
• “partall”seman5csareanaturaldefault
• Upshot:ifD1andD2aretables,cannotinferthatagivenrowinD1influencedD2
• Inquerylanguage,cans5llask“partpart”ques5ons:d2εD2suchthatD1influencedd2?
![Page 10: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/10.jpg)
QueryLanguage:“IQL”
• SQL‐stylelanguageforqueryingtheprovenancegraph
• Specialconstructs:– Under(containment):IsrowRundertableT?– Influence:DoesdataD1influencedataD2?– Feed:DoesdataDfeedprocessP?– Emit:DoesprocessPemitdataD?
![Page 11: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/11.jpg)
IQLExamples
• Finddataitemsthatinfluencedthecombinedextractedtable:
• Finddatatablesthatare“contaminated”byversion3oftheextrac5onscript(foundtohaveabug):
select d.id from AnyData d, Table t where d influences t and t.id = (combined extracted table);
select t.id from PigScript p, PigJob j, AnyData d1, AnyData d2, Table t where p.id = (extract pig script) and j under p and j.version = 3 and j emits d1 and d1 influences d2 and d2 under t;
![Page 12: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/12.jpg)
Implementa5onStatus
• Wehaveaworkingstorage/queryenginebasedonrewri5ngoverSQL/RDBMS(SQLite)
• We’recurrentlyworkingonautoma5cprovenancecapture(fromPig,Hadoop,etc.)
![Page 13: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/13.jpg)
TalkOutline
• Informaloverview– Exampledataprovenancegraph
– Querylanguageoverview+examples
• Touchonformalmodel(detailsinpaper)– Open‐worldseman5cs
– Transi5veinferenceofcontainment&influence
![Page 14: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/14.jpg)
Open‐WorldSeman5cs
• MetadataofIbisencodessetFoffacts• Open‐world:– Correctness:AllfactsinFarecorrect– Incomplete:MaybeotherfactsunknowntoIbis
• Extension,ext(F),offactsthatcanbederivedfromF
• TrueworldhassetoffactsF’
U|
U|• WehaveFext(F)F’
![Page 15: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/15.jpg)
Open‐WorldSeman5cs:OneImplica5on
• SupposeFcontains:• ProcesspemiTedrowr1
• Currentlyr1istheonlyrowintableT• ``ProcesspemiTedtableT’’isafactthatmaybeinF’(trueworld)butcannotbeinferredinext(F)
![Page 16: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/16.jpg)
Inferring“XisunderY”
• Definedintermsof“granulariza5on”:1. ResolveXandYintofinest‐grainelements(e.g.cells)2. Performsetcontainmentcheck
• Implementedviaashortcutthatavoidsenumera5ngsub‐elements
• Proofthatimplementa5on&defini5onareequivalent
![Page 17: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/17.jpg)
Inferring“XisunderY”
Basicelementbdefinedbygranularityg,directparentsP(andaniden5fier).
Granulariza5onofbtofinestgranularitygmindefinedby:
G(b)={b’=(gmin,P’)|bcontainsb’}Containmentobtainedbyrecursiveapplica5onofparentrela5on
ComplexelementEdefinedbysetofgranularity{g1,…,gn},andcorrespondingbasicelements{b1,…,bn}.
Granulariza5onofcomplexelementEconsis5ngofb1,….,bnis:G(E)=iG(bi)
U
![Page 18: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/18.jpg)
Inferring“XisunderY”
UnderCheck‐1:SetsofcomplexelementsE1,E2.E1isunderE2iffnotexistsatrueworldwithUe1εE1G(e1)Ue1εE1G(e1)
U|
EfficientUnderCheck‐2:SetsofcomplexelementsE1,E2.E1isunderE2iffforalle1εE1,existse2εE2suchthate1isundere2.
Givencomplexelementse1ande2withbasicelementsetsB(e1)andB(e2),e1isundere2iffforallb2εB(e2),existsb1εB(e1)suchthatb2containsb1.
Theorem:Check‐1isequivalenttoCheck‐2.
![Page 19: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/19.jpg)
Inferring“XinfluencesY”
Giventwodataver5cesd1andd2:
(1)d1influences(0)d2iffd2isunderd1;(2)d1influences(1)d2iffoneofthefollowinghold:
(A)d1influences(0)d2(B)thereexistsaprovenancerela5onship(d1’,p,d2’)such
thatd1influences(0)d1’andd2’influences(0)d2
(3)Foranyintegerk>1,d1influences(k)d2iffexistsd*suchthatd1influences(1)d*andd*influences(k‐1)d2
![Page 20: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/20.jpg)
RelatedWork
• Mul5‐layersystemprovenance:– HarvardPASSv2
• Nestedcollec5onsinscien5ficworkflowprovenance:– Kepler’sCOMADnestedcollec5ons– ZOOMuserviews– Openprovenancemodel
• Annota5onsonarbitrarysub‐regionsofrela5ons:– [Eltabakhetal.]– [Srivastavaetal.]
![Page 21: Ibis: A Provenance Manager for Mul‐Layer Systems · Yahoo extracted table IMDB extracted table combined table map task 1, aempt 1 map task 2, aempt 1 reduce task 1, aempt 1 merge](https://reader034.vdocuments.net/reader034/viewer/2022042223/5ec98ebd846779596d2ce3c7/html5/thumbnails/21.jpg)
Summary
• Manysemi‐independentdatamgmt.layers+provenancequeryneedsintegratedprovenance
• Diversedata&processgranulari5escarefulseman5cs
• Ourcontribu5ons:– Formalmul5‐granularityprovenanceseman5cs– Querylanguage– Workingprototype(seepaper;workinprogress)