utility of considering m ultiple a lternative r ectifications in data cleaning

49
Utility of Considering Multiple Alternative Rectifications in Data Cleaning PREET INDER SINGH RIHAN MASTER’S THESIS 1 Committee Members Dr. Subbarao Kambhampati (Chair) Dr. Huan Liu Dr. Hasan Davulcu

Upload: kelii

Post on 24-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Utility of Considering M ultiple A lternative R ectifications in Data Cleaning. Preet Inder Singh Rihan Master’s Thesis. Committee Members Dr. Subbarao Kambhampati (Chair) Dr. Huan Liu Dr. Hasan Davulcu. Importance of Data Cleaning. Data is one of the most useful resources - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

1

Utility of Considering Multiple Alternative Rectifications in Data CleaningPREET INDER SINGH RIHAN MASTER ’S THESIS

Committee MembersDr. Subbarao Kambhampati

(Chair)Dr. Huan Liu

Dr. Hasan Davulcu

Page 2: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

2

Importance of Data Cleaning

Data is one of the most useful resources◦ Crucial to numerous important decision making

and analysis processes

High volume, variety and velocity of data make it difficult to obtain data in the cleanest form

Page 3: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

3

Sources and Types of Noise

A few reasons for the noise present in data◦ Imperfect sensing devices/ information extractor◦ Heterogeneity in data from multiple sources◦ Errors in data entry, misspelling etc.

Data that suffers from quality issues is called Dirty data

Make Model Cartype Condition Drivetrain

TSX Acura Used FWD

Honda Corolla Sedan New FWD

Honda Civic Sdna Used FWD

Example of dirty data

Sushovan De
Change title (this is the same as previous slide).You can consider: "Sources and types of noise"
Page 4: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

4

Current Techniques

◦ De-Duplication

◦ Inconsistencies

◦ Schema Noise

◦ Outlier detection ◦ Conditional

functional dependencies

◦ BayesWipe

Types of Problems Industry Solutions Academic Approaches

Sushovan De
The content in this slide is too squished to the top. Consider moving it downwards
Page 5: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

5

Common themes in Current Data Cleaning Techniques

Considers multiple rectifications Picks most likely rectification deterministically, using:

◦ fixed rules◦ domain experts

Sushovan De
Are you sure this picture makes sense? I know I put it in, but make sure you have a story, or an explanation that goes with it. (removing it is okay, or replacing it with an example)
Page 6: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

6

2 Honda Civic Sedan New FWD 0.6

ExampleTID Make Model Cartype Condition Drivetrain

1 Honda Civic Sedan Used FWD

2 Honda Corolla Sedan New FWD

3 Honda Civic Sedan Used FWD

4 Honda Civic Sedan Used FWD

5 Toyota Corolla Sedan New FWD

Dirty Database

2 Honda Corolla Sedan New FWD

TID Make Model Cartype Condition Drivetrain

1 Honda Civic Sedan Used FWD

2 Honda Civic Sedan New FWD

3 Honda Civic Sedan Used FWD

4 Honda Civic Sedan Used FWD

5 Toyota Corolla Sedan New FWD

Deterministic clean output

2 Toyota Corolla Sedan New FWD 0.4

2 Toyota Corolla Sedan New FWDTrue Tuple

Dirty Tuple

Rectification 1

Rectification 2

Page 7: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

7

Data Cleaning Approaches: Problems

Hard to get the perfect fixed rules/domain knowledge

Partially correct rules/knowledge may ignore true rectification◦ Results in information loss◦ Irrecoverable when original data is decoupled

from cleaned outcome

Sushovan De
The picture is on top of the word "ignore". Click the picture, then click the "Send backwards" button.
Page 8: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

8

✔✔✔✔

Alternative Approach: Considering multiple alternative candidates after data cleaning Keep multiple alternative rectifications in a probabilistic database

Advantages:◦ Prevent information loss◦ Generates query results with more

recall

Page 9: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

9

Alternative Approach : Potential Challenges

Keeping multiple alternative rectifications of a dirty tuple poses some challenges:◦ Query results with many irrelevant results -- low precision◦ Query processing over probabilistic data◦ Size of probabilistic data

Page 10: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

10

Problem Statement

To investigate the trade-offs of considering multiple alternative rectifications of a dirty data instance against having a deterministically selected unique clean rectification of a dirty data instance

Page 11: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

Agenda Motivation Deterministic and Probabilistic clean outcomes Investigation Strategy Optimization technique Experiments and Results

Page 12: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

12

Background SystemFor investigation, BayesWipe[1] is used◦ End to end probabilistic data cleaning system◦ Cleans structured data◦ Handles data quality issues due to

◦ Inconsistency◦ Incompleteness◦ Substitutions

[1] Y. Hu, S. De, Y. Chen, and S. Kambhampati. Bayesian data cleaning for web data. arXiv preprint arXiv:1204.3677, 2012

Page 13: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

13

BayesWipe For every tuple T in dirty database:

◦ Set of rectifications (T*)s is generated◦ Every T* has a probability value P(T*|T)

◦ P(T*|T) is the system’s confidence in claiming T* to be the true tuple

Sushovan De
The first statement is unnecessary - it is implied by the second statement
Sushovan De
(T*)s -- the s should be outside the bracket, or just remove 's'
Sushovan De
system's -- apostrophe missing
Page 14: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

14

BayesWipe-DET

BayesWipe’s Clean Outcomes

BayesWipe is used to produced outcomes in two modes◦ BayesWipe-DET: - Only most likely rectification◦ BayesWipe-PDB: - All rectification with associated

probability

T

T*1

T*i

T*n

T*2.

.BayesWipeDirty Data BayesWipe-PDB

Sushovan De
The phrase "outcomes for investigation" is weird. Consider writing: "BayesWipe can produce output in two modes"
Page 15: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

15

TID Make Model Cartype Condition Drivetrain Probability

1Honda Civic Sedan Used FWD 0.9

Honda Civik Sedan New FWD 0.1

2

Toyota Corolla Sedan New FWD 0.2

Toyota Corolla Sedan Used FWD 0.2

Honda Civic Sedan Used FWD 0.6

3Honda Civik Sedan New FWD 0.05

Honda Civic Sedan Used FWD 0.95

4Toyota Corolla Sedan New FWD 0.9

Honda Corolla Sedan New FWD 0.1

Example:BayesWipe-DET and BayesWipe-PDB

TID Make Model Cartype Condition Drivetrain

1 Honda Civik Sedan Used FWD

2 Honda Corolla Sedan New FWD

3 Honda Civic Sedan Used FWD

4 Toyota Corolla Sedan New FWD

Dirt

y Da

taba

se

BayesWipe

TID Make Model Cartype Condition Drivetrain

1 Honda Civic Sedan Used FWD

2 Honda Civic Sedan New FWD

3 Honda Civic Sedan Used FWD

4 Toyota Corolla Sedan New FWD

Baye

sWip

e-DE

T

Baye

sWip

e-PD

B

Page 16: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

16

BayesWipe-PDB Type Types of Probabilistic database

◦ Tuple Independent ◦ Block Independent Disjoint (BID) ◦ C-Table

BayesWipe-PDB type◦ Block Independent Disjoint (BID)

TID Make Model Cartype Condition Drivetrain Probability

12

Honda Civic Sedan Used FWD 0.85

Honda Corolla Sedan New FWD 0.10

Honda Civic Sdfkshf Used FWD 0.05

210Toyota Corolla Sedan New FWD 0.9

Honda Corolla Sedan New FWD 0.1

Block Independent Disjoint Probabilistic database

Page 17: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

17

BayesWipe-PDB Storage BayesWipe-PDB is stored into a relational database:

◦ SQL Server

Query Processing Engine:◦ Mystiq[2] -- a prototype of Probabilistic database

management system

[2] Boulos, Jihad, Nilesh Dalvi, Bhushan Mandhani, Shobhit Mathur, Chris Re, and Dan Suciu. "MYSTIQ: a system for finding more answers by using probabilities." In SIGMOD, pp. 891-893. ACM, 2005.

Page 18: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

Agenda Motivation Deterministic and Probabilistic clean outcomes Investigation Strategy Optimization technique Experiments and Results

Page 19: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

19

Investigation Strategy Criteria to compare BayesWipe-PDB and BayesWipe-DET◦ Accuracy of query results◦ Scalability

Page 20: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

20

Accuracy of Query results

To check if BayesWipe-PDB makes improvement in query results

Query results from BayesWipe-PDB and BayesWipe-DET are compared using:◦ Precision of query results◦ Recall of query results◦ Total increase in true positives over multiple queries◦ Total increase in false positive over multiple queries

Page 21: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

21

Query Results:BayesWipe-PDB and BayesWipe-DET

BayesWipe-DET Query results ◦ Set of deterministic

tuples

BayesWipe-PDB Query results ◦ Set of Probabilistic

tuples◦ Multiple rectifications of

a tuple

TID Make Model Cartype Condition Drivetrain

1 Honda Civic Sedan Used FWD

2 Honda Civic Sedan New FWD

3 Honda Civic Sedan Used FWD

4 Toyota Corolla Sedan New FWD

σ𝑚𝑎𝑘𝑒¿ ′𝐻𝑜𝑛𝑑𝑎′

TID Make Model Cartype Condition Drivetrain

1 Honda Civic Sedan Used FWD

2 Honda Civic Sedan New FWD

3 Honda Civic Sedan Used FWD

Deterministic Query Results

TID Make Model Cartype Condition Drivetrain Probability

1 Honda Civic Sedan Used FWD 0.9

2 Honda Civic Sedan Used FWD 0.6

3Honda Civik Sedan New FWD 0.05

Honda Civic Sedan Used FWD 0.95

4 Honda Corolla Sedan New FWD 0.1

σ𝑚𝑎𝑘𝑒¿ ′𝐻𝑜𝑛𝑑𝑎′

TID Make Model Cartype Condition Drivetrain Probability

1Honda Civic Sedan Used FWD 0.9

Honda Civik Sedan New FWD 0.1

2

Toyota Corolla Sedan New FWD 0.2

Toyota Corolla Sedan Used FWD 0.2

Honda Civic Sedan Used FWD 0.6

3Honda Civik Sedan New FWD 0.05

Honda Civic Sedan Used FWD 0.95

4Toyota Corolla Sedan New FWD 0.9

Honda Corolla Sedan New FWD 0.1

Probabilistic Query Results

Page 22: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

22

Accuracy of Query results:Evaluation Challenges

Precision/Recall calculation is not straightforward Evaluation Challenges:

◦ Defining accuracy/relevance of resultant Tuple◦ Precision/Recall for Query Results from BayesWipe-PDB

Page 23: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

23

TID Make Model Cartype Condition Drivetrain

1 Honda Civic Sedan Used FWD

2 Honda Corolla Sedan New FWD

3 Honda Civic Sedan Used AWD

4 Toyota Corolla Sedan New FWD

3 Honda Civic Sedan Used AWD

Defining Relevance/Accuracy of Resultant tuple

TID Make Model Cartype Condition Drivetrain

1 Honda Civic Sedan Used FWD

2 Honda Corolla Sedan New FWD

3 Honda Civic Sedan Used FWD

4 Toyota Corolla Sedan New FWD

Ground Truth Results Observed Results

Query Results for

3 Honda Civic Sedan Used FWD

Ground Truth Result

Observed Result

Sushovan De
This looks good
Page 24: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

24

Relevance of Resulting Tuples In this precision and recall computation

◦ Relevance is defined only by tuple ids

A probabilistic/deterministic tuple from query results is relevant if:◦ Its tuple id appears in query results from ground

truth

Page 25: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

25

Precision/Recall of Probabilistic Query Results

Precision and Recall is not defined for probabilistic query results

True precision/recall for probabilistic query results◦ Calculated over all possible worlds*◦ Overall precision recall is weighted sum of

precision/recall over all possible worlds

Exponential numbers of possible worlds

* A possible world is a state of a probabilistic database in which each random variable in the PDB has been assigned one of its possible values

Sushovan De
The footnote is confusing. Consider writing: "A possible world is a state of a probabilistic database in which each random variable in the PDB has been assigned one of its possible values"
Page 26: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

26

Approximate Precision/Recall of Probabilistic Results

Two ways to approximate precision/recall calculation◦ Consider partial belongingness of tuples

◦ Where P(t) is probability of the tuple t

◦ Use a pass threshold to classify probabilistic tuples as query results or not ◦ Calculated precision and Recall using standard formula for query results◦ Traditional way [3] to handle uncertain results

[3] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. VLDB,pages 586–597. VLDB Endowment, 2002.

Page 27: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

27

Precision/Recall approximation using a pass Threshold

A fixed pass threshold is applied to total probability

◦ Aggregated probabilities of all rectifications of probabilistic tuple

Total Probability>= ◦ Query results◦ Rejected

TID Make Model Cartype Condition Drivetrain Probability

1 Honda Civic Sedan Used FWD 0.9

2 Honda Civic Sedan Used FWD 0.6

3Honda Civik Sedan New FWD 0.05

Honda Civic Sedan Used FWD 0.95

4 Honda Corolla Sedan New FWD 0.1

Probabilistic query results

TID Probability

1 0.9

2 0.6

3 1

4 0.1

Aggregated probabilities of probabilistic tuple

h h𝑡 𝑟𝑒𝑠 𝑜𝑙𝑑 θ=0.2 TID

1

2

3

Determinized results

Page 28: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

28

Scalability To check if BayesWipe-PDB is scalable

Two comparisons are performed◦ Size of BayesWipe-PDB vs. Size of

BayesWipe-DET◦ Query Processing time over BayesWipe-

PDB vs. Query Processing time over BayesWipe-DET

Page 29: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

29

Potential issues with BayesWipe-PDB

Number of rectifications with low probabilities increases as data size increases

Potential issues:◦ Query results with very low precision

◦ One way to control is by good pass threshold ◦ Scalability issues

◦ High physical space◦ High query processing time

Page 30: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

Agenda Motivation Deterministic and Probabilistic clean outcomes Investigation Strategy Optimization technique Experiments and Results

Page 31: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

31

Optimization Technique Reason for potential issues

◦ Too many irrelevant results in BayesWipe-PDB

Pre-Pruning, an optimization technique ◦ Prevent irrelevant rectifications to be stored in

BayesWipe-PDB◦ Checks every rectification T* of tuple T◦ Stores T* in BayesWipe-PDB if

◦ T* passes pre-pruning algorithm

Page 32: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

32

PrePruned BayesWipe-PDB

Probabilistic Database stores multiple candidate clean version T* after Pre-Pruning

Dirty data

Multiple alternatives

(T*)BayesWipe BayesWipe-PDBPruning

Page 33: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

33

Pre-Pruning Algorithm Pre-Pruning Algorithm considers every candidate clean version and associated probability i.e. P(T*|T)

α β P(T*|T)

Rejected AcceptedFurther

Investigated

Page 34: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

34

Pre-Pruning Algorithm If

◦ T* is kept as clean version of the tuple T

In the case of rare, legitimate tuples, it is possible that both T and T* have low probabilities

The values of α, β and γ are set to 0.009, 0.5 and 5 respectively

For this, Prior of tuple T and Prior of tuple T* is considered

P[T ]=¿ tupleT occurs∈dirty database

¿ tuples∈dirty database

α β P(T*|T)

Rejected AcceptedFurther Investigated

Sushovan De
The sentence "It is more likely that a more frequent rectifications tuple has been corrupted than a less frequent rectifications" is confusing (and grammatically incorrect, i think). Consider writing: "In the case of rare, legitimate tuples, it is possible that both T and T* have low probabilities"
Page 35: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

Agenda Motivation Deterministic and Probabilistic clean outcomes Investigation Strategy Optimization technique Experiments and Results

Page 36: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

36

Experimental Setup Experiments are performed on used car dataset crawled from Google base

Experiments are performed on size of data set varying from 1000 tuples to 30000 tuples

Synthetic noise is introduced at random to the clean dataset◦ Noise level varies from 1% to 20%

Random queries were selected to compare the quality of query results extracted from BayesWipe-DET and BayesWipe-PDB

Page 37: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

37

Experimental Results Present finding of comparison of BayesWipe-PDB and BayesWipe-DET on◦ Accuracy of query results (Precision and Recall)◦ Scalability (Size and Query processing time)

Present the effect of optimization technique on BayesWipe-PDB

Page 38: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

38

00.30.60.9

BayesWipe-PDB Recall BayesWipe-DET Recall

RECA

LBayesWipe-PDB vs. BayesWipe-DET Recall and Precision Data Size = 2500

Noise = 10%Threshold = 0.1

make = acura model = outlander sports

cartype = sedan make = bmw & condition = used

model = jetta model = cooper s model = h3 mini Average0

0.10.20.30.40.50.60.70.80.9

1

BayesWipe-PDB Precision BayesWipe-DET Precision

Prec

isio

n

Sushovan De
This looks much nicer
Page 39: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

39

BayesWipe-PDB vs. BayesWipe-DET:Effect of Threshold Dataset size 30000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.2

0.4

0.6

0.8

1

Average BayesWipe-PDB Precision

Average BayeWipe-PDB Recall

Threshold

Prec

ision

/Rec

all

DET Recall=0.82

DET Precision= 0.997

Sushovan De
Please mention the exact values of DET-Precision and Recall, since the lines have been removed. (The arrows are too fat to be judged)
Page 40: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

40

Accuracy of Query Results from Multiple Random Queries

Average of precision and recall values of multiple random queries does not give good idea

Comparison on non normalized metrics over 100 random queries◦ Total increase numbers of true positives generated

◦ True Positive (BayesWipe-PDB) – True Positives(BayesWipe-DET)◦ Total increase number of false positive generated

◦ False Positive (BayesWipe-PDB) – False Positives(BayesWipe-DET)

Page 41: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

41

1 2 5 10 15 200

100

200

300

400

500

600

Increase in True Positives Increase in False Negative

Noise Percentage

BayesWipe-PDB vs. BayesWipe-DET: True Positives and False Positives Gain

Data Size =30000Threshold = 0.1

Page 42: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

42

BayesWipe-PDB vs. BayesWipe-DET: Size of Database

1 2 5 10 15 200

50000

100000

150000

200000

250000

300000

350000

400000

BayesWipe-PDB database size BayesWipe-DET database size

Noise

Num

ber o

f tup

les

Data Size = 30000

Page 43: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

43

BayesWipe-PDB vs. BayesWipe-DET: Average Query Processing time

Data Size =30000

1 2 5 10 15 200

10

20

30

40

50

60

70

80

BayesWipe-PDB query processing time BayesWipe-DET query processing time

Noise Percentage

Tim

e in

ms

Page 44: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

44

model = 'c

ontinental gt'

model = 'e

quinox'

make = 'is

uzu'

make = 'fo

rd'

make = 'h

yundai'

model = '5

25'

'550 gran turis

mo'

model = 'e

nclave

'0.86

0.90.940.98

BayesWipe-PDB Recall Optimized BayesWipe-PDB Recall BayesWipe-DET Recall

Reca

ll

model = 'c

ontinental gt'

model = 'e

quinox'

make = 'is

uzu'

make = 'fo

rd'

make = 'h

yundai'

model = '5

25'

'550 gran turis

mo'

model = 'e

nclave

'0.2

0.4

0.6

0.8

1

BayesWipe-PDB Precision Optimized BayesWipe-PDB Precision BayesWipe-DET Precision

Prec

isio

n

Optimized BayesWipe-PDB vs. BayesWipe-DET Recall and Precision

Sushovan De
Please mark which graph is the optimized graph
Page 45: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

45

Optimized BayesWipe-PDB:True Positives and False Positives Gain

1 2 5 10 15 200

100

200

300

400

500

600

Increase in True Positives Increase in True Positive (Optimized) Increase in False NegativeIncrease in False Positive (Optimized)

Noise Percentage

Data Size =30000Threshold = 0.1

Page 46: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

46

1 2 5 10 15 200

50000

100000

150000

200000

250000

300000

350000

400000

BayesWipe-DET database size BayesWipe-PDB database sizePrePruned BayesWipe-PDB database size

Noise Percentage

Num

ber o

f tup

les

Optimized BayesWipe-PDB:Database Size Comparison

Data Size =30000

Page 47: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

47

1 2 5 10 15 200

10

20

30

40

50

60

70

80

BayesWipe-DET query processing time BayesWipe-PDB query processing timePrePruned BayesWipe-PDB query processing time

Noise Percentage

Tim

e in

ms

Optimized BayesWipe-PDB: Average Query Processing time

Data Size =30000

Page 48: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

48

Results SummaryQuery Processing time Database size Precision Recall

BayesWipe-DET Low Same as Dirty Data High Low

BayesWipe-PDB Very High Very largeIncreases with

increase inThreshold value

Decreases with increase in

Threshold value

Optimized BayesWipe-PDB High Large

Higher to Precision of

BayesWipe-DET in most cases (65%

times)

Higher or equal to Recall of

BayesWipe-DET

Sushovan De
Minor: press SHIFT+ENTER after the word "Prepruned" in Prepruned BayesWipe-PDB for a nicer-looking title
Page 49: Utility of Considering  M ultiple  A lternative  R ectifications in  Data Cleaning

49

Conclusion I studied the utility of considering multiple alternative rectifications in data cleaning

For that, I compare BayesWipe-PDB and BayesWipe-DET BayesWipe-PDB always has better recall for query results at the cost of precision

BayesWipe-PDB also requires larger physical space and high query processing time

Optimization technique provide a way to minimize the cost of precision and scalability issues