data quality class 3. goals dimensions of data quality enterprise reference data data parsing

43
Data Quality Class 3

Post on 20-Dec-2015

227 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Data Quality

Class 3

Page 2: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Goals

Dimensions of Data Quality Enterprise Reference Data Data Parsing

Page 3: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Dimensions of Data Quality

Poor data quality is similar to obscenity-– It seems as if there are no real ways to measure it,

but you know it when you see it!

In reality, data quality can be measured– The frame of reference for measurement is different

Page 4: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Dimensions of Data Quality 2

Data Models Data Values Data Presentation Data Policy

Page 5: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Example: Sales Database

Sales and marketing database– Current customers– Sales leads

Name, address, contact data For current customers, sales data

Page 6: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Data Quality of Data Models

Clarity of definition Comprehensiveness Flexibility Robustness

Page 7: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Data Quality of Data Models 2

Essentialness Attribute granularity Precision of domains Homogeneity

Page 8: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Data Quality of Data Models 3

Naturalness Identifiability Obtainability Relevance

Page 9: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Data Quality of Data Models 4

Simplicity Semantic Consistency Structural Consistency

Page 10: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Data Quality of Data Values

Accuracy Null values Completeness Consistency Currency

Page 11: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Accuracy

Agreement with established sources Database of record Other corroborative sources

Page 12: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Null Values

Null vs. Missing– Unavailable– Not applicable– No value– Not classified– Truly null

Page 13: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Completeness

Mandatory attributes require values Optional attributes may hold values (when and

how?) Inapplicable attributes may not have a value

(also when and how?) Completeness constraints

Page 14: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Consistency

Are values in one set consistent with values in another set?

Consistency relations between attributes in the same table

Consistency assertions across columns Consistency relationships between tables

Page 15: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Currency/Timeliness

What data is current? How is it kept up-to-date? Time expectations for accessibility to data

Page 16: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Data Quality of Data Presentation

Appropriateness Correct Interpretation Flexibility Format Precision

Page 17: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Data Quality of Data Presentation 2

Portability Representation Consistency Representation of Null Values

Page 18: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Data Quality of Data Policy

Access Metadata Privacy Fault-tolerance Security

Page 19: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Reference Data

Relatively static Referred to from within many tables Shared data Examples:

– Product catalog– Security type classification– Currency codes

Page 20: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Reference Data 2

Metadata Data Domains Mappings between those domains

Page 21: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Domains

A data domain is a subclassed data type Domains can be described using enumerations

– Example: states, cities, product codes

Domains can be described using functions– Example: formatted trouble ticket ids such as CC-

NNNN

Page 22: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Domain Membership

Data attributes can be affiliated with data domains

Test a value to make sure it is valid within a domain– For enumerated domains, a lookup works– For generated/described domains, check to see if

the value could be generated by function

Page 23: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Domains as Metadata

The existence of a data domain is metadata Keep track of which tables have attributes

making use of which domains Manage domains as enterprise reference data

Page 24: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Domain Tables

Domains are sets of values Maintain in two tables:

– Domain name table (domain_name, domain_id)– Domain value table (domain_id, value)

Example:– (2, “STATES”)– {(2, “NY”), (2, “MA”), (2, “CT”), (2, “VA”)…}

Page 25: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Mappings

Map values in one domain to values in other domains

1-1, 1-Many, Many-1, Many-Many Represents relations from one set to another

set 1-1, Many-1 are functions

Page 26: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Mappings as Metadata

Semi-static mappings are also metadata Keep track of which tables refer to mappings Enterprise reference data

Page 27: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Mapping Tables

Maintain mapping information in two tables:– Mapping name table (from_domain, to_domain,

mapping_id, mapping name)– Mapping table (mapping_id, source_value,

target_value)

Page 28: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Example:

(“Currency”, “Country”, 65, “CurrencyToCountry”)

(65, “USD”, “US)

Page 29: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Tables

Inference of data fields from source data In our case, this is straightforward based on

SGML tags in the data Determine table structure in context of use of

metadata and reference data In other words, try to maintain normal form

Page 30: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Keys

Primary key: must be unique across all values in table Primary key may be assigned based on internal

increasing value, or maybe extant in the data Foreign key: relates values in one table to values in

another table Foreign keys must exist in target table (=referential

integrity)

Page 31: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Data Profiling and Parsing

Goals:– Identify data domains– Identify mappings– Identify candidate keys

Page 32: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Data Column Profiling

For each table– For each column

Type inference Subclassed Type inference Count the number of distinct values Enumerate the distinct values Sort the distinct values Look for patterns of usage Document discoveries

Page 33: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Data Table Profiling

For each Table– Identify candidate keys– Identify data domains– Identify data mappings– Look for patterns

Page 34: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Multiple Table Profiling

Identify foreign key relationships Validate referential integrity Look for patterns

Page 35: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Data Parsing

2 kinds of data domains– Enumerated, consisting of a predefined set of

values– Inferred/Functional, consisting of values that are

validated based on a set of rules

Page 36: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Type Inference

Data value sets all conform to a data type All values within a column must belong to the same

type Propose data type by a series of inferences

– Initial assumption is that all values belong to strictest type– Test for violations to type restriction– As a violation is discovered, loosen the restriction and test

again for violations– Measure conformance at each level

Page 37: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Subclassed Type Inference

Once type has been inferred, look for additional restrictions within type

Example:– Integer vs. Integer within a range (0..100)

For integers and real values, look for ranges For character strings, look for length, context,

character patterns

Page 38: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

String Parsing

Classify characters into groups– Alphabetic– Numeric/Digits– Punctuation (this can be further refined)

Transform all values within a column set into their corresponding pattern

Example: 789-23-1100 would change to– DDDPDDPDDDD

Page 39: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

String Parsing

Given a set of representative strings, perform column analysis again

Look for high frequency counts for specific patterns

This will provide proposed functional domains We can characterize certain domains by format

(e.g., telephone numbers, SSNs, Tax Ids, UPC codes, etc.)

Page 40: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Word Parsing

More advanced technique to explore domain types

Classify word tokens in terms of predefined characterization– Example: Name words, business words, transaction

words, connector words, titles, etc.

Page 41: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Primary Key Discovery

Iterative process to find candidate keys A candidate key is one or more attributes

whose values, when composed, can uniquely locate a record

Page 42: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Primary Key Discovery 2

For I = 1, number of attributes– For each set S composed of I attributes do

Are the composed values unique across the tables? If so, add to candidate key set

Page 43: Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing

Next Assignment

Type Inferencer– Propose data types for analyzed columns

Format Parser– Transform strings into pattern strings– Propose format for a column– Catalog discovered named formats (telno, SSN,

etc.)– More details on posting on web site