12 cleaning (1)
TRANSCRIPT
-
8/3/2019 12 Cleaning (1)
1/17
ISE430
Data Warehousing and Data Mining
Data Extraction and Cleansing
-
8/3/2019 12 Cleaning (1)
2/17
Data Cleansing and Quality
Without data cleansing DW will suffer from: Lack of quality
Loss of trust
Diminishing user base, and
Loss of business sponsorship and funding Data quality characteristics: Data quality affects
several attributes associated with data:
Accuracy: Is it realistic or believable?
Integrity: Is it structured and managed? Consistency: Is it consistently defined and
maintained?
Validity: Is the data valid, based on business orindustry rules and standards?
-
8/3/2019 12 Cleaning (1)
3/17
Causes of Poor Data Quality
These factors can contribute to poor data quality:
Business rules do not exist or there are no standardsfor data capture.
Standards may exist but are not enforced at the
point of data capture.
Inconsistent data entry (incorrect spelling, use ofnicknames, middle names, or aliases) occurs.
Data entry mistakes (character transposition,
misspellings, and so on) happen. Integration of data from systems with different data
standards is present.
Data quality issues are perceived as time-consuming
and expensive to fix.
-
8/3/2019 12 Cleaning (1)
4/17
Sources of Data Quality Problems
Source: The Data Warehousing Institute, Data
Quality and the Bottom Line, 2002
-
8/3/2019 12 Cleaning (1)
5/17
Data Cleansing Options
Clean data is the result of a combination of efforts:
making sure that data entered into the system isclean
cleaning up problems after the data is accepted.
Also known as rule-based strategy:
clean the source data
clean the warehouse data
-
8/3/2019 12 Cleaning (1)
6/17
Data Cleansing Processes in a DataQuality Initiative (1)
Standardization
changing data elements to acommon format defined by the data steward
Gender Analysis returns a persons genderbased on the elements of their name
Entity Analysis
returns an entity type based onthe elements in a name
De-duplication uses match coding to discardrecords that represent the same person, company
or entity Matching and Merging matches records based
on a given set of criteria and at a given sensitivitylevel. Used to link records across multiple datasources.
-
8/3/2019 12 Cleaning (1)
7/17
Data Cleansing Processes in a DataQuality Initiative (2)
Address Verification
verifies that if you send apiece of mail to a given address, it will be delivered.Does not verify occupants, just valid addresses.
Householding/Consolidation used to identify
links between records, such as customers living inthe same household or companies belonging to thesame parent company.
Parsing identifies individual data elements and
separates them into unique fields
Casing provides the proper capitalization to dataelements that have been entered in all lowercaseor all uppercase.
-
8/3/2019 12 Cleaning (1)
8/17
Analysis and Standardization Example
Who is the biggest supplier?
Anderson Construction $ 2,333.50
Briggs,Inc $ 8,200.10
Brigs Inc. $32,900.79
Casper Corp. $27,191.05
Caspar Corp $ 6,000.00
Solomon Industries $43,150.00
The Casper Corp $11,500.00
-
8/3/2019 12 Cleaning (1)
9/17
Standardization Scheme
Briggs, Inc
Brigs Inc.
Briggs Inc.
Casper Corp.
Casper Corp.
Caspar Corp
The Casper Corp
-
8/3/2019 12 Cleaning (1)
10/17
Data Cleansing Rules
Types of rules
Integrity rules: tests integrity of the data
Cleansing Rules: specification of an action to betaken when integrity violations are encountered.
Integrity rules vs. Cleansing rules
Integrity rules: refer to the way the data must conform to meet business
rules.
Cleansing rules: combine the definition from the integrity rule with the action
to be taken in the event of a violation.
-
8/3/2019 12 Cleaning (1)
11/17
Data Integrity Rules Classification
-
8/3/2019 12 Cleaning (1)
12/17
Uses of Rules
Audit data
Report on the occurrence of integrity violationswithout taking any steps to restore integrity to thedata.
Filter data Detect data integrity violations and remove the
offending data from the set used to populate thegiven target.
Correct data Detect data integrity violations and repair the data to
restore its integrity.
-
8/3/2019 12 Cleaning (1)
13/17
Rule Based Cleansing Process
Determine the action to be taken from perspectivesof:
At which data is the rule directed source data ortarget data?
What type of cleansing action is being performed
auditing, filtering, or correction?
At what point in the warehousing process will thecleansing action occur?
-
8/3/2019 12 Cleaning (1)
14/17
Cleaning Source vs. Target Data
SourceData Cleansing Target Data Cleansing
Requires source data models Rules from target data models
Typically large number of rules Typically smaller number of rules
Implicit rules difficult to
determine
Implicit rules readily identified
Fixes problems at the source May only fix symptoms
One fix applies to many targets Multiple targets may = redundantcleansing
May require operational system
changes
Avoids operational systems issues
-
8/3/2019 12 Cleaning (1)
15/17
Data Cleansing Economics
-
8/3/2019 12 Cleaning (1)
16/17
Principles of Rule Based DC
Data quality directly affects data warehouse acceptance.
Source data is typically dirty and must be made reliable for warehousing.
Data models are a good source of data integrity rules
Both source and target data models yield integrity rules.
Data cleansing rules combine integrity rules with actions to be taken whenviolations are detected.
Both source data and target data may be cleansed.
Auditing reports data integrity violations, but does not fix them.
Filtering discards data that violates integrity rules.
Correction repairs data that violates integrity rules.
Choice of techniques involves severity of the data quality problem,available data models, and commitment of time and resources.
Whatever techniques are chosen, a systematic, rule-based approachyields better results than an unstructured approach.
-
8/3/2019 12 Cleaning (1)
17/17
Summary
Data Cleansing Explained
Reading Resources
Ponniah: Data Warehousing Fundamentals (Ch. 12)
Next
Online Analytic Processing (OLAP)
MOLAP, ROLAP