circumventing data quality problems using multiple join paths yannis kotidis, athens university of...

21
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University Divesh Srivastava, AT&T Labs-Research

Post on 20-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

Circumventing Data Quality Problems

Using Multiple Join Paths

Yannis Kotidis, Athens University of Economics and BusinessAmélie Marian, Rutgers UniversityDivesh Srivastava, AT&T Labs-Research

Page 2: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 2

Motivating ExampleSales

TN

TN

BAN

TN

TN

BAN

CustName

CustName

ORN

PON

Provisioning

CustName

CustName PONSubPON

Inventory

PON

TN CircuitID

CircuitID

Ordering

ORN TN

TN: Telephone NumberORN: Order NumberBAN: Billing Accoung NumberPON: Provisoning Order NumberSubPON: Related PON

What is the Circuit ID associated with a Telephone Numberthat appears in SALES?

Page 3: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 3

Motivations Data applications with overlapping

features Data integration Web sources

Data quality issues (duplicate, null, default values, data inconsistencies) Data-entry problems Data integration problems

Page 4: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 4

Contributions Multiple Join Path (MJP) framework

Quantifies answer quality Takes corroborating evidence into account Agglomerative scoring of answers

Answer computation techniques Designed for MJP scoring methodologies Several output options (top-k, top-few)

Experimental evaluation on real data VIP integration platform Quality of answers Efficiency of our techniques

Page 5: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 5

Outline

Multiple Join Path Framework Problem Definition

Our Approach Scoring Answers Computing Answers

Experimental Evaluation Related Work

Page 6: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 6

Multiple Join Path Framework:

Problem Definition Query of the form:

“Given X=a find the value of Y”

Examples: Given a telephone number of a customer, find the ID of the

circuit to which the telephone line is attached.One answer expected

Given a circuit ID, find the name of customers whose telephones are attached to the circuit ID.Possibly several answers

Page 7: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 7

Schema Graph Directed acyclic graph Nodes are field names Intra-application edge

Links fields in the same application

Inter-application edge Links fields across

applicationsAll (non-source, non-sink) nodes in schema graph are (possibly approximate) primary or foreign keys of their applications

Page 8: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 8

Data Graph Given a specific value of the source node X what

are values of the sink node Y? Considers all join paths from X to Y in the schema

graph

X (no corresponding SALES.BAN)

X X

Example: two paths lead to answer c1

Page 9: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 9

Scoring Answers Which are the correct values?

Unclean data No a priori knowledge

Technique to score data edges What is the probability that the fields

associated by the edge is correct Probabilistic interpretation of data edge

scores to score full join paths Edge score aggregation Independent on the length of the path

Page 10: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 10

Scoring Data Edges Rely on functional

dependencies (we are considering fields that are keys)

Data edge scores model the error in the data

Intra-application edge Inter-application edge

equals 1, unless approximate matching

Fields A and B within the same application

A B (and symetrically for B -> A)

|},...,1),,{(|

1),(

nibabascore

ii

Where bi are the values instantiated from querying the application with value a

A B

|)}.*,(.*),{(|

1),(

jiji bABabascore

B Aand

Page 11: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 11

Scoring Data Paths A single data path is

scored using a simple sequential composition of its data edges probabilities

Data paths leading to the same answer are scored using parallel composition

n

i iedgeScorepathScore1

)*(

()

21

21

pathScorepathScore

pathScorepathscore

thScoreparallelpa

X a b Y0.5 0.8 0.6

pathScore=0.5*0.8*0.6=0.24

X a b Y0.5 0.8 0.6

c

pathScore=0.24+0.2-(0.24*0.2)pathScore=0.392

0.40.5

Independence Assumption

Page 12: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 12

Identifying Answers Only interested in best answers Standard top-k techniques do not apply

Answer scores can always be increased by new information

We keep score range information Return top answers when identified, may not

have complete scores Two return strategies

Top-k Top-few (weaker stop condition)

Page 13: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 13

Computing Answers Take advantage of early pruning

Only interested in best answers Incremental data graph computation

Probes to each applications Cost model is number of probes

Standard graph searching techniques (DFS, BFS) do not take advantage of score information

We propose a technique based on the notion of maximum benefit

Page 14: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 14

Maximum Benefit Benefit computation of a path uses two

components Known scores of the explored data edges Best way to augment an answer’s scores

Uses residual benefit of unexplored schema edges

Our strategy makes choices that aim at maximizing this benefit metric

Page 15: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 15

VIP Experimental Platform Integration platform developed at AT&T 30 legacy systems Real data Developed as a platform for resolving

disputes between applications that are due to data inconsistencies

Front-end web interface

Page 16: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 16

VIP Queries Random sample of 150 user queries. Analysis shows that queries can be classified

according to the number of answers they retrieve: noAnswer(nA): 56 queries anyAnswer(aA): 94 queries

oneLarge(oL): 47 queries manyLarge(mL): 4 queries manySmall(mS): 8 queries

heavyHitters(hH): 10 queries that returned between 128 and 257 answers per query

Page 17: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 17

VIP Schema GraphPaths leading to an answer/paths leading to top-1 answer (94 queries)

Not considering all paths may lead to missing top-1 answers

Page 18: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 18

Number of Parallel Paths Contributing to the Top-1 Answer

0

2

4

6

8

10

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91

Number of Parallel Paths

Fre

qu

en

cy C

ou

nt

Average of 10 parallel paths per answer, 2.5 significant

Page 19: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 19

Cost of Execution

Page 20: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 20

Related Work Keyword Search in DBMS (BANKS, DBXPlorer,

DISCOVER, ObjectRank) Query is set of keywords Top-k query model DB as data graph Do not agglomerate scores

Top-k query evaluation (TA, MPro, Upper) Consider tuples as an entity Wait for exact answer (Except for NRA) Do not agglomerate scores

Probabilistic ranking of DB results Queries not selective, large answer set

We take corroborative evidence into account to rank query results

Page 21: Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University

9/11/2006 Amélie Marian - Rutgers University 21

Conclusion Multiple Join Path Framework

Uses corroborating evidence to identify high quality results

Looks at all paths in the schema graph Scoring mechanism

Probabilistic interpretation Takes schema information into account

Techniques to compute answers Take into account agglomerative scoring Top-k and top-few