1 mining with noise knowledge: error awareness data mining xindong wu department of computer science...
TRANSCRIPT
1
Mining with Noise Knowledge: Error Awareness Data Mining
Xindong WuDepartment of Computer Science
University of Vermont, USA;Hong Kong Polytechnic University;
合肥工业大学计算机与信息学院
Tsinghua University, Beijing, January 15, 2008 2
Outline
1. Introduction Noise Existing Efforts in Noise Handling
2. A System Framework for Error Tolerant Data Mining
3. Error Detection and Instance Ranking
4. Error Profiling with Structured Noise
5. Error Tolerant Mining
Tsinghua University, Beijing, January 15, 2008 3
Noise Is Everywhere
Random noise “a random error or variance in a measured
variable” (Han & Kamber 2001) “any property of the sensed pattern which is not
due to the true underlying model but instead to randomness in the world or the sensors” (Duda et.al. 2000)
Structured noise Caused by systematic mechanisms
Equipment failure Deceptive information
Tsinghua University, Beijing, January 15, 2008 4
Noise Categories and Locations
Categorized by types Erroneous value Missing value
Categorized by variable types (Zhu & Wu 2004) Independent variable
Attribute noise Dependent variable
Class noise
Tsinghua University, Beijing, January 15, 2008 5
Existing Efforts (1): Learning with Random Noise
Dataset D
Data preprocessing
Dataset D’
LearningA single Learner
Data preprocessing techniques Identifying mislabeled examples (Brodley & Friedl 1999)
Noise filtering Erroneous attribute value detection (Teng 1999)
Attribute value prediction Missing attribute value acquisition (Zhu & Wu 2004, Zhu &
Wu 2005) Acquiring the most informative missing values
Data imputation (Fellegi & Holt 1976) Filling missing values
Tsinghua University, Beijing, January 15, 2008 6
Existing Efforts (2):Classifier Ensembling w/ Random Noise
Dataset D
Ensemble Learning
L1
L2
Ln
Learners Combination
A single Learner
...
Bagging (Breiman 1996) Boosting (Freund & Schapire 1996)
Tsinghua University, Beijing, January 15, 2008 7
Limitations
The design of current ensembling methods only focus on making diverse base learners
How to learn from past noise-handling efforts to avoid future noise?
Tsinghua University, Beijing, January 15, 2008 8
Outline
1. Introduction Noise Existing Efforts in Noise Handling
2. A System Framework for Error Tolerant Data Mining
3. Error Detection and Instance Ranking
4. Error Profiling with Structured Noise
5. Error Tolerant Mining
Tsinghua University, Beijing, January 15, 2008 9
A System Framework for Noise-Tolerant Data Mining
Error Identification and Instance Ranking Error Profiling and Reasoning Error-Tolerant Mining
Tsinghua University, Beijing, January 15, 2008 10
Tsinghua University, Beijing, January 15, 2008 11
Outline
1. Introduction Noise Existing Efforts in Noise Handling
2. A System Framework for Error Tolerant Data Mining
3. Error Detection and Instance Ranking
4. Error Profiling with Structured Noise
5. Error Tolerant Mining
Tsinghua University, Beijing, January 15, 2008 12
Error Detection and Instance Ranking(AAAI-04)
Error Detection Construct suspicious instance subset Locate erroneous attribute values
Impact-Sensitive Ranking Rank suspicious instances based on located erroneous
attribute values and their impacts.
Noisy Dataset
D Suspicious Instances
Subset S
Erroneous Attribute Detection
Calculate Information-gain
Ratios
Impact-sensitive Weight for Each
Attribute
Overall Impact Value for Each Suspicious
Instance
Impact-sensitive Ranking and
Recommendation
Impact-sensitive Ranking
Error Detection
Tsinghua University, Beijing, January 15, 2008 13
Outline
1. Introduction Noise Existing Efforts in Noise Handling
2. A System Framework for Error Tolerant Data Mining
3. Error Detection and Instance Ranking
4. Error Profiling with Structured Noise
5. Error Tolerant Mining
Tsinghua University, Beijing, January 15, 2008 14
Error Profiling with Error Profiling with Structured NoiseStructured Noise
Unlimited types of structured noise Occurs in many studies Objective
Construct a systematic approach Study specific types of structured noise.
Tsinghua University, Beijing, January 15, 2008 15
Approach
Learning
Noisy Data D Rules that
describe the modification
patternPurged Data set D’
OutputInput Rule Learning
Noisy Data D2
Modified Data set
D2’
Rule Evaluation
Tsinghua University, Beijing, January 15, 2008 16
Associative Noise (ICDM ’07)
Associative noise The error in one attribute is associated with other attribute
values Stability of certain measures is conditioned on other attributes Intentionally planted false information
Model Assumptions
Noisy data set D, purged data set D’.
Associative Corruption Rules Errors are only in feature attributes.
Tsinghua University, Beijing, January 15, 2008 17
Associative Profiling
Take the purged data set D' as the base data set
For each corrupted attribute Ai in D, add A
i into
D' and label Ai as the class attribute
Learn a classification tree from D' Obtain modification rules
If A1 = a11, A2 = a21,C = c1, then A5 = a51=> A5 = a52
If A2 = a21, A3 = a31, then A5 = a52=> A5 = a52
Tsinghua University, Beijing, January 15, 2008 18
Associative Profiling Rules
Inverse obtained rules If A1 = a11, A2 = a21,C = c1, then A5 = a51=> A5 = a52
If A1 = a11, A2 = a21,C = c1, then A5 = a52=> A5 = a51
In D’, learn a Bayes learner L for attribute A’i
Evaluation Correcting noisy data set D2 with the help of L Corrected data set D’2 Does D’2 have a higher quality than data set D2 in
terms of supervised learning?
Tsinghua University, Beijing, January 15, 2008 19
Outline
1. Introduction Noise Existing Efforts in Noise Handling
2. A System Framework for Error Tolerant Data Mining
3. Error Detection and Instance Ranking
4. Error Profiling with Structured Noise
5. Error Tolerant Mining
Tsinghua University, Beijing, January 15, 2008 20
Error-Tolerant Data Error-Tolerant Data MiningMining
Get a set of diverse base training sets by re-sampling
Unify error detection, correction and data cleansing for each base training set to improve its quality
Classifier ensembling.
Tsinghua University, Beijing, January 15, 2008 21
C2 Flowchart
Tsinghua University, Beijing, January 15, 2008 22
Accuracy EnhancementThree Steps: Locate noisy data from given dataset
Recommend possible corrections Attribute prediction Construct solution set
Select and perform one correction for each noisy instance
Classifier T’
D’ D S’
Tsinghua University, Beijing, January 15, 2008 23
Attribute Prediction
Switch each attribute (Ai) with the class label to train a classifier APi
Ik: A1, A2, .., Ai, .., AN, C
Ik: A1, A2, ..,C, .., AN, Ai
Classification Algorithm
APi
Use APi to evaluate whether attribute Ai possibly contains any error
Tsinghua University, Beijing, January 15, 2008 24
Construct A Solution Set
Ik : A1 A2 … Ai C
AP1 AP2 APi
Ik’ : A1
’ A2’ … Ai
’ C
For example, Solution set for instance Ik
{A1 --> A1’, Aj --> Aj’, {Ak1 --> Ak1’, Ak2 --> Ak2’ }}
k = 3: maximum attribute value changes.
D’
Classifier T’
D’S
Tsinghua University, Beijing, January 15, 2008 25
Select and Perform Corrections
D
D1’ S1…
Classifier Ensembling
D2’ S2Dn’ Sn
Resampling
Noise locating, detecting,
D1’’ D2’’ Dn’’… correcting
Tsinghua University, Beijing, January 15, 2008 26
Experimental Results
We integrate Weka-3-4 packages into our system
We use C4.5 classification tree Real-world datasets from UCI data
depository Attribute error corruption scheme:
Erroneous attribute values are introduced into each attribute independently with noise level x100%.
Tsinghua University, Beijing, January 15, 2008 27
Results C2 won 34
trials Bagging
won 4 trials Tied 2 trials
Tsinghua University, Beijing, January 15, 2008 28
ResultsMonks3
Performance Comparison on Base Learners
Performance Comparison on Four Methods
20%
10%
Noise Level
Tsinghua University, Beijing, January 15, 2008 29
ResultsMonks3
Performance Comparison on Base Learners
Performance Comparison on Four Methods
40%
30%
Noise Level
Tsinghua University, Beijing, January 15, 2008 30
Performance Discussions
C2 (ICDM 2006), Bagging, ACE (ICTAI 2005) outperform classifier T
C2 outperforms Bagging in most trials When the noise level is high, the accuracy
enhancement module is less reliable We can consider improvement from following
aspects: Locating noisy data Recommending possible corrections Selecting and performing one correction.
Tsinghua University, Beijing, January 15, 2008 31
Concluding Remarks
A defining problem hence a long-term issue Data mining from large, noisy data sources
With different types of noise, Structured noise
A specific type of structured noise Associative Noise
Associative profiling Random noise
C2 – Corrective Classification Future work: how to combine noise profiling and
noise-tolerant mining with unknown noise types?
Tsinghua University, Beijing, January 15, 2008 32
References1. J. Han and M. Kamber. Data Mining: Concepts and
Techniques, 2001.
2. R.O. Duda, et.al. Pattern Classification (2nd Edition), Wiley-Interscience, 2000.
3. Y. Zhang, X. Zhu, X. Wu and J.P. Bond, ACE: An Aggressive Classifier Ensemble with Error Detection, Correction and Cleansing, IEEE ICTAI 2005, pp.310-317.
4. Y. Zhang, X. Zhu and X. Wu, Corrective Classification: Classifier Ensembling with Corrective and Diverse Base Learners, ICDM 2006, pp.1199-1204.
5. X. Zhu and X. Wu, Class Noise vs Attribute Noise: A Quantitative Study of Their Impacts, Artificial Intelligence Review, 22(2004), 3-4: 177-210.
6. Y. Zhang and X Wu, Noise Modeling with Associative Corruption Rules, ICDM 2007, pp. 733-738.
Tsinghua University, Beijing, January 15, 2008 33
Acknowledgements
Joint work with: Dr. Xingquan Zhu Yan Zhang Dr. Jeffrey Bond
Supported by DOD (US) NSF (US)