data quality aware query systems

35
Data Quality Aware Query Systems N. Khodabandehloo Yeganeh, S. Sadiq, K. Deng, X. Xue School of Information Technology and Electrical Engineering The University of Queensland

Upload: brenden-duncan

Post on 01-Jan-2016

32 views

Category:

Documents


3 download

DESCRIPTION

Data Quality Aware Query Systems. N. Khodabandehloo Yeganeh, S. Sadiq, K. Deng, X. Xue School of Information Technology and Electrical Engineering The University of Queensland. Understanding Data Quality. Garbage-in Garbage-out !. ???. - PowerPoint PPT Presentation

TRANSCRIPT

Data Quality Aware Query SystemsN. Khodabandehloo Yeganeh, S. Sadiq, K. Deng, X. Xue

School of Information Technology and Electrical Engineering

The University of Queensland

Understanding Data Quality

2

Garbage-in Garbage-out !

???I am sorry Mrs Smith, but our records show that you

are dead!

"A One-House, $400 Million Bubble Goes Pop," New York Times, February 15, 2006 by Gretchen Ruethling. A house valued at $121,900 skyrocketed to $400 million after someone "most likely hit a wrong key and changed the figure in the county's computer system." The article goes on, "The error probably happened when someone tried to check the value of a house in a county computer program that showed real estate values, but then accidentally gained access to a different program without authorization, said James Murphy, the county treasurer. The inflated value ...was used to calculate tax rates and led the county to expect $8 million in property taxes that did not exist."

"A One-House, $400 Million Bubble Goes Pop," New York Times, February 15, 2006 by Gretchen Ruethling. A house valued at $121,900 skyrocketed to $400 million after someone "most likely hit a wrong key and changed the figure in the county's computer system." The article goes on, "The error probably happened when someone tried to check the value of a house in a county computer program that showed real estate values, but then accidentally gained access to a different program without authorization, said James Murphy, the county treasurer. The inflated value ...was used to calculate tax rates and led the county to expect $8 million in property taxes that did not exist."

www.dataqualitysolutions.com

Historical Perspective Data as Product

Product Manufacturing

Data Manufacturing

Input Raw Material Raw Data

Process Materials Processing

Data Processing

Output Physical Products

Data Products

Data IS the product ... Intention and usage of data is no longer aligned in current large scale information systems

Data IS the product ... Intention and usage of data is no longer aligned in current large scale information systems

Wang, Storey & Firth (1995) A Framework for Data Quality research. IEEE TKDE Vol.7/4, Aug, 1995.

Changing Nature of DQ

Data IS the product ... Intention and usage of data is no longer aligned in current large scale information systemsData IS the product ... Intention and usage of data is no longer aligned in current large scale information systems

No absolute data quality metrics ... Diversity of data usage makes quality definitions application/usage specific and feedback drivenNo absolute data quality metrics ... Diversity of data usage makes quality definitions application/usage specific and feedback driven

Data is shared ... A central control of data quality is not possibleData is shared ... A central control of data quality is not possible

Data Life Cycle

5

Design requirements and usage benefits may not have

explicit connections

Data IS the product Data IS the product

Data is being integrated, more than it is being designed

Data is shared Data is shared

Diversity in data format, scale

and usage unprecedented

No absolute data quality metrics

No absolute data quality metrics

Related Works

Data Acquisition

Data Integration

Data Utilization

data entry errors, poor input validation, lack of user training

character recognition in document management systems

performance of RFID readers

dealing with ambient noise in sensor network communication

...

Differences in Format

Structural (schemata) and semantic (constraints, values) mismatches

Privacy preservation

Error propagation

Attribution, Audit, Tracking, Lineage

...

Informational overload/ relevance/usablility

Real time analytics, business intelligence

Data mining & statistical (un)truths

Privacy & trust

Problems

Related WorksData Quality Dimensions [SMB05] [WSF95] [WSB96]

Accuracy (Erroneous) Postcode “4107” is typed “4017”

Consistency (Inconsistent) ITEE Vs. Information Technology and Electrical Engineering

Completeness (Missing) Students don’t have to declare a major till graduation, so major is missing in

most enrolments Currency (Obsolete)

Old phone numbers Accessibility (Unavailable)

Server down, privacy concerns Reliability & Trust (Uncertainty)

Related WorksRelated Computational DQ Problems Record Linkage. Different records representing a single real-life.

Approximate matching [GIJ01], record linkage [JIK03] and entity resolution techniques [BGS08]

Constraints. Maintain the semantic integrity of data. Semantic Integrity Constraints [CFG07] [BWG07]

Uncertainty. How complete or certain is the data. Probabilistic data bases [LLR97]

Provenance or lineage of data is to identify the pedigree of a data object.

Annotation and non annotation approaches [SPG05]

Data Quality Aware Query Systems ArchitectureQuality Aware Components:

•Quality Aware Queries (Source Selection

Quality Aware

QueriesCommunication Network

Org 1

MDB

Org 2

MDB

Org n

MDB

DQ ProfilingDQ Profiling DQ Profiling

Quality Aware Source Selection Example: Virtual ShopGoogle product search returns 91345 results for “Cannon Powershot”.

Is user going to check all results to find the best source of information that matches his own requirements?

How can we help user to find what he really wants?

Quality Aware Source Selection Multiple Sources of Information

Quality Aware

QueriesCommunication Network

Org 1

MDB

Org 2

MDB

Org n

MDB

DQ ProfilingDQ Profiling DQ Profiling

Quality Aware Source Selection Multiple Sources of InformationIn a virtual shop user query can be answered from various sources of information (virtual shops).

Each source is accessed through a common schema as below:

• Item title

• Item Description

• Numbers Available

• Price

• Tax

• User Comments

Quality Aware Source Selection Multiple Sources of Information

Phases to answer the query:

1. Select k best sources.

2. Run the query against selected sources and generate the results.

3. If more results are required, continue step 2 with next k’ sources.

Our approaches uses source quality and user preferences on data quality for phase 1.

Quality Aware Source Selection User Requirements – Querying for different reasons

• Learn about different items (products) - such a user may not care about he “Numbers Available” and “Tax” columns. “Price” is somewhat important to the user although obsoleteness and inaccuracy in price values can be tolerated. However, consistency of “Item Title" and completeness within the populations of “User Comments" in the query results, is of highest importance.

• Compare prices - thus user is sure about the item to purchase but is searching for the best price. Obviously “Price” and “Tax” fields have the greatest importance in this case. They should be current and accurate. "Numbers Available" is also important although slight inaccuracies in this column are acceptable as any number more than 1 will be sufficient.

Selection of a good source for data is subjected to what does the term “good" mean to the user

Which source should be selected?

Quality Aware Source Selection Following Questions Should be Answered

1. How to obtain information about the generic data quality of data sources?

2. How to model and capture user specific data quality requirements?

3. How to conduct the quality aware query and rank results based on the quality aware query?

Selection of a good source for data is subjected to what does the term “good" mean to the user

Which source should be selected?

Quality Aware Metadata Schema + Profiling

Quality Aware Queries

Communication Network

Org 1

MDB

Org 2

MDB

Org n

MDB

DQ Profiling

DQ Profiling

DQ Profiling

Quality Aware Metadata Schema + Profiling

Item Title Item Description Num Available Price Tax User Comments

S2IS Canon PShot 8 310 31

Camera 1000 Test

S3IS Canon PShot 375

XL-2 4 184 18

DMC-TZ5K Panasonic Lmx 340

DSC-W55 S ony Cshot 2 260 26

Object Metric Value

Shop.Item Title Completeness 0.83

Shop.Item Desc Completeness 0.83

Shop.Num Available Completeness 0.50

Shop.Price Completeness 1.00

Shop.Tax Completeness 0.50

Shop.User Comments Completeness 0.16

Quality Aware SQL For purpose of simple illustration we allow using quality-aware metadata,

to query any metric as [Column Name.Metric Name] as part of the SQL query formulation.

SELECT Title, Price FROM ShopItem WHERE Title.Completeness>0.8

SELECT Title, Title.Accuracy, Price FROM ShopItem ORDER BY Price.Accuracy

ORDER BY Price.Accuracy models a one dimensional preference that indicates sources with higher price accuracy are preferred, a two dimensional preference can not be intuitively achieved.

Quality Aware SQLTo prioritize preferences:

E.g. from the sources with highest data quality, sources with high currency of price are prioritized over sources with high completeness of price.

Hierarchy ClauseSELECT Title AS t, Price AS p, [User Comments] AS u

FROM ShopItem WHERE ...

HIERARCHY(ShopItem)

p OVER (t,u) 7, u OVER (t) 3

HIERARCHY(ShopItem.p)

p.Currency OVER (p.Completeness) 3

Generally: HIERARCHY(a) a.x OVER (a.x',...) n

Why Hierarchy?

Intuitively human defines preferences as partial orders (pairs).

E.g. I prefer cofee to tea.

Quality Aware SQL

Related Works Preferences:

Preference theory [Wri63], logic [Man92] and reasoning [WD91]

Multi criteria decision making [Saa22]

Preferences in database [LL87]: Preference queries [GJM00]

Skyline operator [BKS01]

Formulation [Kie02]

Processing Quality Aware SQL AHP (Analytical

Hierarchy Processes)

AHP Tree for buying a car

• Define a hierarchy of objectives starting from the Goal.

• Each level of hierarchy defines lower lever objectives.

• Continue to the lowest level of hierarchy.

Dashboard Seats Sound System

Dashboard1 5 5

Seats1/5 1 1

Sound System 1/5 1 1

AHP Tree for the following query

SELECT Title AS t, Price AS p, [User Comments] AS uFROM ShopItem WHERE ...HIERARCHY(ShopItem) p OVER (t,u) 7, u OVER (t) 3HIERARCHY(ShopItem.p) p.Currency OVER (p.Completeness) 3

.

Processing Quality Aware SQL AHP (Analytical Hierarchy Processes)

Title Price User Comments

Title 1 1/7 1/3

Price 7 1 7

User Comments

3 1/7 1

Price

Currency Completeness

Currency 1 3

Completeness 1/3 1

Title

Currency Completeness

Currency 1 1

Completeness 1 1

User Comments

Currency Completeness

Currency 1 1

Completeness 1 1

Processing Quality Aware SQL AHP (Analytical Hierarchy Processes)

AHP weights for the following query

SELECT Title AS t, Price AS p, [User Comments] AS uFROM ShopItem WHERE ...HIERARCHY(ShopItem) p OVER (t,u) 7, u OVER (t) 3HIERARCHY(ShopItem.p) p.Currency OVER (p.Completeness) 3

.

Query

Title Price User Comments

Completeness Currency Completeness Currency Completeness Currency

0.036 0.036 0.546 0.182 0.100 0.100

Processing Quality Aware SQL AHP (Analytical Hierarchy Processes)

Ranking Sources Using the Calculated Weights:

• Simple Additive Weight Method

Query

Title Price User Comments

Completeness Currency Completeness Currency Completeness Currency

0.036 0.036 0.546 0.182 0.100 0.100

Processing Quality Aware SQL Generating AHP Tree from Quality Aware SQL

Evaluation Three different scenarios:

Making a list of products: price is not important.SELECT Title AS t, Price AS p, [User Comments] AS uFROM ShopItem AS s HIERARCHY(s) t OVER (p) 9, u OVER

(p) 3

Selecting a product to buy: price is very important.SELECT Title AS t, Price AS p, [User Comments] AS uFROM ShopItem AS s HIERARCHY(s) p OVER (t,u) 9

Reviewing user comments: User comments are important.SELECT Title AS t, Price AS p, [User Comments] AS uFROM ShopItem AS s HIERARCHY(s) u OVER (p) 9, t OVER

(p) 3

Evaluation Three different source ranking methods:

No information on quality:

Randomly select a source.

Ranking based on source quality: previous similar source ranking approaches*

Quality-aware source ranking: Referring to our contributions

* F. Naumann, J.C. Freytag, and M. piliopoulou. Qualitydriven source selection using Data Envelopment Analysis. Proc. of the 3rd Conference on InformationQuality (IQ), Cambridge, MA, 1998.

Future Works Remember the research challenges?

1. How to obtain information about the generic data quality of data sources?

2. How to model and capture user specific data quality requirements?

3. How to conduct the quality aware query?

Future Works

1. How to obtain information about the generic data quality of data sources? Smart Profiling

Methods to update data profiling (e.g. view updating and data warehouse updating approaches) and generate smarter profiles using complex business rules.

2. How to model and capture user specific data quality requirements?

3. How to conduct the quality aware query?

Future Works

1. How to obtain information about the generic data quality of data sources?

2. How to model and capture user specific data quality requirements? Model Analysis And Verification

Hierarchy Clause should handle inconsistencies in user requirements : Check the consistency of user preferences and repair it if possible.

3. How to conduct the quality aware query?

Future Works

1. How to obtain information about the generic data quality of data sources?

2. How to model and capture user specific data quality requirements?

3. How to conduct the quality aware query? Improve Source Selection and Ranking

Resulting quality of columns for a joined table is different from the column quality of individual tables. Query plans should be ranked instead of sources.

Questions?

End of presentation…

References[SMB05] M. Scannapieco, P. Missier, and C. Batini. Data quality at a glance. Datenbank-

Spektrum, 14:6-14, 2005.[WSF95] RY Wang, VC Storey, and CP Firth. A framework for analysis of data quality research.

Knowledge and Data Engineering, IEEE Transactions on, 7(4):623-640,1995.[WSB96] R.Y. Wang and D.M. Strong. Beyond accuracy: what data quality means to data

consumers. Journal of Management Information Systems, 12(4):5-33, 1996.[CFG07] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: consistency and

accuracy. Proceedings of the 33rd international conference on Very large data bases, pages 315-326, 2007.

[BW07] P. Bohannon, F. Wenfei, F. GEERTS, J. Xibei, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. ICDE, 2007.

[GIJ01] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate String Joins in a Database (Almost) for Free. Proceedings of the international conference on very large data bases, pages 491-500,2001.

[GIK03] L. Gravano, P.G. Ipeirotis, N. Koudas, and D. Srivastava. Text Joins for Data Cleansing and Integration in an RDBMS. Proc. of Int. Conf. on Data Engineering (ICDE), 2003.

[BGS08] O. Benjelloun, H. Garcia-Molina, Q. Su, and J. Widom. Swoosh: A generic approach to entity resolution. VLDB Journal, 2008.

References[SPG05] Y.L. Simmhan, B. Plale, and D. Gannon. A Survey of Data Provenance in eScience.

SIGMOD RECORD, 34(3):31, 2005.[LLR97] L.V.S. Lakshmanan, N. Leone, R. Ross, and VS Subrahmanian. ProbView: aexible

probabilistic database system. ACM Transactions on Database Systems (TODS), 22(3):419-469, 1997.

[Wri63] G.H. von Wright. The Logic of Preference. Edinburgh University Press, 1963.[Man92] Suryanarayana Murthy Mantha. First-order preference theories and their applications. PhD

thesis, Mathematics, Salt Lake City, UT, USA, 1992.[WD91] M.P. Wellman and J. Doyle. Preferential semantics for goals. In Proceedings of the National

Conference on Artificial Intelligence, pages 698-703, 1991.[Saa96] T.L. Saaty. Multicriteria Decision Making: The Analytic Hierarchy Process: Planning, Priority

Setting, Resource Allocation. RWS Publications, 1996.[IL87] M. Lacroix and P. Lavency. Preferences: Putting More Knowledge into Queries. In Proceedings

of the 13th International Conference on Very Large Data Bases, pages 217{225. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 1987.

[GJM00] K. Govindarajan, B. Jayaraman, and S. Mantha. Preference Queries in Deductive Databases. New Generation Computing, 19(1):57-86, 2000.

[BKS01] S. Borzsonyi, D. Kossmann, and K. Stocker. The skyline operator. In Proc. Of ICDE, pages 421-430, 2001.

[Kie02] W. Kiebling. Foundations of preferences in database systems. In Proceedings of the 28th international conference on Very Large Data Bases, pages 311{322.VLDB Endowment, 2002.