approximated provenance for complex applications

31
Approximated Provenance for Complex Applications Susan B. Davidson University of Pennsylvania Eleanor Ainy, Daniel Deutch, Tova Milo Tel Aviv University

Upload: feryal

Post on 24-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Approximated Provenance for Complex Applications. Susan B. Davidson University of Pennsylvania Eleanor Ainy , Daniel Deutch , Tova Milo Tel Aviv University. Crowd Sourcing . The engagement of crowds of Web users for data procurement and knowledge creation. 2. Why now?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Approximated Provenance for Complex Applications

Approximated Provenance for Complex Applications

Susan B. DavidsonUniversity of Pennsylvania

Eleanor Ainy, Daniel Deutch, Tova MiloTel Aviv University

Page 2: Approximated Provenance for Complex Applications

The engagement of crowds of Web users for data procurement and

knowledge creation.

2

Crowd Sourcing

2

Page 3: Approximated Provenance for Complex Applications

Why now?

33

We are all connected, all the time!

Page 4: Approximated Provenance for Complex Applications

4

Complexity?

• Many of the initial applications were quite simple– Specify Human Interaction Task (HIT) using e.g.

Mechanical Turk, collect responses, aggregate to form result.

• Newer ideas are multi-phase and complex, e.g. mining frequent fact sets from the crowd (OASSIS)– Model as workflows with global state

Page 5: Approximated Provenance for Complex Applications

5

Outline

• “State-of-the-art” in crowd data provenance• New challenges• A proposal for modeling crowd data

provenance

Page 6: Approximated Provenance for Complex Applications

6

Outline

• “State-of-the-art” in crowd data provenance• New challenges• A proposal for modeling crowd data

provenance

Page 7: Approximated Provenance for Complex Applications

7

Crowd data provenance?• TripAdvisor: aggregates reviews and presents average

ratings– Individual reviews are part of the provenance

• Wikipedia: keeps extensive information about how pages are edited – ID of the user who generated the page as well as changes to

page (when, who, summary) – Provides several views of this information, e.g. by page or by

editor

• Mainly used for presentation and explanation

Page 8: Approximated Provenance for Complex Applications

8

Page 9: Approximated Provenance for Complex Applications

9

Page 10: Approximated Provenance for Complex Applications

10

Page 11: Approximated Provenance for Complex Applications

11

Page 12: Approximated Provenance for Complex Applications

12

Page 13: Approximated Provenance for Complex Applications

13

Page 14: Approximated Provenance for Complex Applications

14

Page 15: Approximated Provenance for Complex Applications

15

Page 16: Approximated Provenance for Complex Applications

16

Outline

• “State-of-the-art” in crowd data provenance• New challenges• A proposal for modeling crowd data

provenance

Page 17: Approximated Provenance for Complex Applications

17

Challenges for crowd data provenance • Complexity of processes and number of user

inputs involved– Provenance can be very large, leading to difficulties in

viewing and understanding provenance• Need for

– Summarization– Multidimensional views– Provenance mining– Compact representation for maintenance and cleaning

Page 18: Approximated Provenance for Complex Applications

18

Summarization

• Large size of provenance need for abstraction – E.g., in heavily edited Wikipedia pages:

• “x1, x2, x3 are formatting changes; y1, y2, y3, y4 add content; z1 , z2 represent divergent viewpoints”

• “u1 , u2 , u3 represent edits by robots; v1, v2 represent edits by Wikipedia administrators”

– E.g., in a movie-rating application to summarize the provenance of the average rating for “MatchPoint”

• “Audience crowd members gave higher ratings (8-10) whereas critics gave lower ratings (3-5).”

Page 19: Approximated Provenance for Complex Applications

19

Multidimensional Views

• “Perspective” through which provenance can be viewed or mined– E.g. in TripAdvisor, if there is an “outlier” review it

would be useful to see other reviews by that person to “calibrate” it.

– “Question” perspective could show which questions are bad/unclear

Page 20: Approximated Provenance for Complex Applications

20

Maintenance and Cleaning

• May need update propagation to remove certain users, questions and/or answers– E.g. spammers or bad questions

• Mining of provenance may lag behind the aggregate calculation– E.g., detecting a spammer may only be possible

when they have answered enough questions, or when enough answers have been obtained from other users.

Page 21: Approximated Provenance for Complex Applications

21

Outline

• “State-of-the-art” in crowd data provenance• New challenges• A proposal for modeling crowd data

provenance

Page 22: Approximated Provenance for Complex Applications

22

Crowd Sourcing Workflow

Movie reviews Aggregator Platform

Page 23: Approximated Provenance for Complex Applications

23

Provenance expression

Page 24: Approximated Provenance for Complex Applications

Propagating provenance annotations through joins

24

JOIN (on B)

…a b c …

p

The annotation p * r means joint use of data annotated by p and data annotated by r

…a b c d e p * r

R

R ⋈ S

S

…d b e …

r

A B C

D B E

A B C D E

[Green, Karvounarakis, Tannen, Provenance Semirings. PODS 2007]

Page 25: Approximated Provenance for Complex Applications

Propagating provenance annotations through unions and projections

25

…a b c1 p …a b c2 r …a b c3

…s

…a b p + r + s …

+ means alternative use of data, which arises in both PROJECT and UNION.

PROJECT

R A B C

πABR A B

[Green, Karvounarakis, Tannen, Provenance Semirings. PODS 2007]

Page 26: Approximated Provenance for Complex Applications

26

Annotated Aggregate Expressions

1 d1 20 p1

2 d1 10 p2

3 d1 15 P3

Q =

REid Dept Sal

select Dept, sum(Sal)from Rgroup by Dept

The sum salary for d1 could be represented by the expression (20 p1 + 10 p2 + 15 p3)⊗ ⊗ ⊗

This provenance aware value “commutes” with deletion.

[Amsterdamer, Deutch, Tannen, Provenance for Aggregate Queries. PODS 2011]

Page 27: Approximated Provenance for Complex Applications

27

Provenance expression

Page 28: Approximated Provenance for Complex Applications

28

Provenance expression: Benefits

• Can understand how movie ratings were computed.

• Can be used for data maintenance and cleaning– E.g. if U2 is discovered to be a spammer, “map” its

provenance annotation to 0

Page 29: Approximated Provenance for Complex Applications

29

Summarizing provenance

• Map annotations to a corresponding “summary”– h: Ann Ann’, where |Ann’| << |Ann|

• E.g. in our example, let– h(Ui)=h(Si)=1, h(Ai)=A, h(Ci)=C– Reducing the expression to

– Which simplifies to

Page 30: Approximated Provenance for Complex Applications

30

Constructing mappings?

• How do we define and find “good” mappings?– Provenance size– Semantic constraints (e.g. two annotations can

only be mapped to the same annotation if they come from the same input table)

– Distance between original provenance expression and the mapped expression (e.g. grouping all young French people and giving them an average rating for some movie)

Page 31: Approximated Provenance for Complex Applications

31

Conclusions

• Provenance is needed for crowd-sourcing applications to help understand the results and reason about their quality.

• Techniques from database/workflow provenance can be used, but there are special challenges and “opportunities”