12 cleaning (1)

Upload: sadia-ashraf

Post on 06-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 12 Cleaning (1)

    1/17

    ISE430

    Data Warehousing and Data Mining

    Data Extraction and Cleansing

  • 8/3/2019 12 Cleaning (1)

    2/17

    Data Cleansing and Quality

    Without data cleansing DW will suffer from: Lack of quality

    Loss of trust

    Diminishing user base, and

    Loss of business sponsorship and funding Data quality characteristics: Data quality affects

    several attributes associated with data:

    Accuracy: Is it realistic or believable?

    Integrity: Is it structured and managed? Consistency: Is it consistently defined and

    maintained?

    Validity: Is the data valid, based on business orindustry rules and standards?

  • 8/3/2019 12 Cleaning (1)

    3/17

    Causes of Poor Data Quality

    These factors can contribute to poor data quality:

    Business rules do not exist or there are no standardsfor data capture.

    Standards may exist but are not enforced at the

    point of data capture.

    Inconsistent data entry (incorrect spelling, use ofnicknames, middle names, or aliases) occurs.

    Data entry mistakes (character transposition,

    misspellings, and so on) happen. Integration of data from systems with different data

    standards is present.

    Data quality issues are perceived as time-consuming

    and expensive to fix.

  • 8/3/2019 12 Cleaning (1)

    4/17

    Sources of Data Quality Problems

    Source: The Data Warehousing Institute, Data

    Quality and the Bottom Line, 2002

  • 8/3/2019 12 Cleaning (1)

    5/17

    Data Cleansing Options

    Clean data is the result of a combination of efforts:

    making sure that data entered into the system isclean

    cleaning up problems after the data is accepted.

    Also known as rule-based strategy:

    clean the source data

    clean the warehouse data

  • 8/3/2019 12 Cleaning (1)

    6/17

    Data Cleansing Processes in a DataQuality Initiative (1)

    Standardization

    changing data elements to acommon format defined by the data steward

    Gender Analysis returns a persons genderbased on the elements of their name

    Entity Analysis

    returns an entity type based onthe elements in a name

    De-duplication uses match coding to discardrecords that represent the same person, company

    or entity Matching and Merging matches records based

    on a given set of criteria and at a given sensitivitylevel. Used to link records across multiple datasources.

  • 8/3/2019 12 Cleaning (1)

    7/17

    Data Cleansing Processes in a DataQuality Initiative (2)

    Address Verification

    verifies that if you send apiece of mail to a given address, it will be delivered.Does not verify occupants, just valid addresses.

    Householding/Consolidation used to identify

    links between records, such as customers living inthe same household or companies belonging to thesame parent company.

    Parsing identifies individual data elements and

    separates them into unique fields

    Casing provides the proper capitalization to dataelements that have been entered in all lowercaseor all uppercase.

  • 8/3/2019 12 Cleaning (1)

    8/17

    Analysis and Standardization Example

    Who is the biggest supplier?

    Anderson Construction $ 2,333.50

    Briggs,Inc $ 8,200.10

    Brigs Inc. $32,900.79

    Casper Corp. $27,191.05

    Caspar Corp $ 6,000.00

    Solomon Industries $43,150.00

    The Casper Corp $11,500.00

  • 8/3/2019 12 Cleaning (1)

    9/17

    Standardization Scheme

    Briggs, Inc

    Brigs Inc.

    Briggs Inc.

    Casper Corp.

    Casper Corp.

    Caspar Corp

    The Casper Corp

  • 8/3/2019 12 Cleaning (1)

    10/17

    Data Cleansing Rules

    Types of rules

    Integrity rules: tests integrity of the data

    Cleansing Rules: specification of an action to betaken when integrity violations are encountered.

    Integrity rules vs. Cleansing rules

    Integrity rules: refer to the way the data must conform to meet business

    rules.

    Cleansing rules: combine the definition from the integrity rule with the action

    to be taken in the event of a violation.

  • 8/3/2019 12 Cleaning (1)

    11/17

    Data Integrity Rules Classification

  • 8/3/2019 12 Cleaning (1)

    12/17

    Uses of Rules

    Audit data

    Report on the occurrence of integrity violationswithout taking any steps to restore integrity to thedata.

    Filter data Detect data integrity violations and remove the

    offending data from the set used to populate thegiven target.

    Correct data Detect data integrity violations and repair the data to

    restore its integrity.

  • 8/3/2019 12 Cleaning (1)

    13/17

    Rule Based Cleansing Process

    Determine the action to be taken from perspectivesof:

    At which data is the rule directed source data ortarget data?

    What type of cleansing action is being performed

    auditing, filtering, or correction?

    At what point in the warehousing process will thecleansing action occur?

  • 8/3/2019 12 Cleaning (1)

    14/17

    Cleaning Source vs. Target Data

    SourceData Cleansing Target Data Cleansing

    Requires source data models Rules from target data models

    Typically large number of rules Typically smaller number of rules

    Implicit rules difficult to

    determine

    Implicit rules readily identified

    Fixes problems at the source May only fix symptoms

    One fix applies to many targets Multiple targets may = redundantcleansing

    May require operational system

    changes

    Avoids operational systems issues

  • 8/3/2019 12 Cleaning (1)

    15/17

    Data Cleansing Economics

  • 8/3/2019 12 Cleaning (1)

    16/17

    Principles of Rule Based DC

    Data quality directly affects data warehouse acceptance.

    Source data is typically dirty and must be made reliable for warehousing.

    Data models are a good source of data integrity rules

    Both source and target data models yield integrity rules.

    Data cleansing rules combine integrity rules with actions to be taken whenviolations are detected.

    Both source data and target data may be cleansed.

    Auditing reports data integrity violations, but does not fix them.

    Filtering discards data that violates integrity rules.

    Correction repairs data that violates integrity rules.

    Choice of techniques involves severity of the data quality problem,available data models, and commitment of time and resources.

    Whatever techniques are chosen, a systematic, rule-based approachyields better results than an unstructured approach.

  • 8/3/2019 12 Cleaning (1)

    17/17

    Summary

    Data Cleansing Explained

    Reading Resources

    Ponniah: Data Warehousing Fundamentals (Ch. 12)

    Next

    Online Analytic Processing (OLAP)

    MOLAP, ROLAP