1 mining with noise knowledge: error awareness data mining xindong wu department of computer science...

33
1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic University; 合合合合合合合合合合合合合合

Upload: tobias-lynch

Post on 11-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

1

Mining with Noise Knowledge: Error Awareness Data Mining

Xindong WuDepartment of Computer Science

University of Vermont, USA;Hong Kong Polytechnic University;

合肥工业大学计算机与信息学院

Page 2: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 2

Outline

1. Introduction Noise Existing Efforts in Noise Handling

2. A System Framework for Error Tolerant Data Mining

3. Error Detection and Instance Ranking

4. Error Profiling with Structured Noise

5. Error Tolerant Mining

Page 3: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 3

Noise Is Everywhere

Random noise “a random error or variance in a measured

variable” (Han & Kamber 2001) “any property of the sensed pattern which is not

due to the true underlying model but instead to randomness in the world or the sensors” (Duda et.al. 2000)

Structured noise Caused by systematic mechanisms

Equipment failure Deceptive information

Page 4: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 4

Noise Categories and Locations

Categorized by types Erroneous value Missing value

Categorized by variable types (Zhu & Wu 2004) Independent variable

Attribute noise Dependent variable

Class noise

Page 5: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 5

Existing Efforts (1): Learning with Random Noise

Dataset D

Data preprocessing

Dataset D’

LearningA single Learner

Data preprocessing techniques Identifying mislabeled examples (Brodley & Friedl 1999)

Noise filtering Erroneous attribute value detection (Teng 1999)

Attribute value prediction Missing attribute value acquisition (Zhu & Wu 2004, Zhu &

Wu 2005) Acquiring the most informative missing values

Data imputation (Fellegi & Holt 1976) Filling missing values

Page 6: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 6

Existing Efforts (2):Classifier Ensembling w/ Random Noise

Dataset D

Ensemble Learning

L1

L2

Ln

Learners Combination

A single Learner

...

Bagging (Breiman 1996) Boosting (Freund & Schapire 1996)

Page 7: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 7

Limitations

The design of current ensembling methods only focus on making diverse base learners

How to learn from past noise-handling efforts to avoid future noise?

Page 8: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 8

Outline

1. Introduction Noise Existing Efforts in Noise Handling

2. A System Framework for Error Tolerant Data Mining

3. Error Detection and Instance Ranking

4. Error Profiling with Structured Noise

5. Error Tolerant Mining

Page 9: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 9

A System Framework for Noise-Tolerant Data Mining

Error Identification and Instance Ranking Error Profiling and Reasoning Error-Tolerant Mining

Page 10: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 10

Page 11: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 11

Outline

1. Introduction Noise Existing Efforts in Noise Handling

2. A System Framework for Error Tolerant Data Mining

3. Error Detection and Instance Ranking

4. Error Profiling with Structured Noise

5. Error Tolerant Mining

Page 12: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 12

Error Detection and Instance Ranking(AAAI-04)

Error Detection Construct suspicious instance subset Locate erroneous attribute values

Impact-Sensitive Ranking Rank suspicious instances based on located erroneous

attribute values and their impacts.

Noisy Dataset

D Suspicious Instances

Subset S

Erroneous Attribute Detection

Calculate Information-gain

Ratios

Impact-sensitive Weight for Each

Attribute

Overall Impact Value for Each Suspicious

Instance

Impact-sensitive Ranking and

Recommendation

Impact-sensitive Ranking

Error Detection

Page 13: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 13

Outline

1. Introduction Noise Existing Efforts in Noise Handling

2. A System Framework for Error Tolerant Data Mining

3. Error Detection and Instance Ranking

4. Error Profiling with Structured Noise

5. Error Tolerant Mining

Page 14: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 14

Error Profiling with Error Profiling with Structured NoiseStructured Noise

Unlimited types of structured noise Occurs in many studies Objective

Construct a systematic approach Study specific types of structured noise.

Page 15: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 15

Approach

Learning

Noisy Data D Rules that

describe the modification

patternPurged Data set D’

OutputInput Rule Learning

Noisy Data D2

Modified Data set

D2’

Rule Evaluation

Page 16: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 16

Associative Noise (ICDM ’07)

Associative noise The error in one attribute is associated with other attribute

values Stability of certain measures is conditioned on other attributes Intentionally planted false information

Model Assumptions

Noisy data set D, purged data set D’.

Associative Corruption Rules Errors are only in feature attributes.

Page 17: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 17

Associative Profiling

Take the purged data set D' as the base data set

For each corrupted attribute Ai in D, add A

i into

D' and label Ai as the class attribute

Learn a classification tree from D' Obtain modification rules

If A1 = a11, A2 = a21,C = c1, then A5 = a51=> A5 = a52

If A2 = a21, A3 = a31, then A5 = a52=> A5 = a52

Page 18: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 18

Associative Profiling Rules

Inverse obtained rules If A1 = a11, A2 = a21,C = c1, then A5 = a51=> A5 = a52

If A1 = a11, A2 = a21,C = c1, then A5 = a52=> A5 = a51

In D’, learn a Bayes learner L for attribute A’i

Evaluation Correcting noisy data set D2 with the help of L Corrected data set D’2 Does D’2 have a higher quality than data set D2 in

terms of supervised learning?

Page 19: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 19

Outline

1. Introduction Noise Existing Efforts in Noise Handling

2. A System Framework for Error Tolerant Data Mining

3. Error Detection and Instance Ranking

4. Error Profiling with Structured Noise

5. Error Tolerant Mining

Page 20: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 20

Error-Tolerant Data Error-Tolerant Data MiningMining

Get a set of diverse base training sets by re-sampling

Unify error detection, correction and data cleansing for each base training set to improve its quality

Classifier ensembling.

Page 21: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 21

C2 Flowchart

Page 22: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 22

Accuracy EnhancementThree Steps: Locate noisy data from given dataset

Recommend possible corrections Attribute prediction Construct solution set

Select and perform one correction for each noisy instance

Classifier T’

D’ D S’

Page 23: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 23

Attribute Prediction

Switch each attribute (Ai) with the class label to train a classifier APi

Ik: A1, A2, .., Ai, .., AN, C

Ik: A1, A2, ..,C, .., AN, Ai

Classification Algorithm

APi

Use APi to evaluate whether attribute Ai possibly contains any error

Page 24: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 24

Construct A Solution Set

Ik : A1 A2 … Ai C

AP1 AP2 APi

Ik’ : A1

’ A2’ … Ai

’ C

For example, Solution set for instance Ik

{A1 --> A1’, Aj --> Aj’, {Ak1 --> Ak1’, Ak2 --> Ak2’ }}

k = 3: maximum attribute value changes.

D’

Classifier T’

D’S

Page 25: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 25

Select and Perform Corrections

D

D1’ S1…

Classifier Ensembling

D2’ S2Dn’ Sn

Resampling

Noise locating, detecting,

D1’’ D2’’ Dn’’… correcting

Page 26: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 26

Experimental Results

We integrate Weka-3-4 packages into our system

We use C4.5 classification tree Real-world datasets from UCI data

depository Attribute error corruption scheme:

Erroneous attribute values are introduced into each attribute independently with noise level x100%.

Page 27: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 27

Results C2 won 34

trials Bagging

won 4 trials Tied 2 trials

Page 28: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 28

ResultsMonks3

Performance Comparison on Base Learners

Performance Comparison on Four Methods

20%

10%

Noise Level

Page 29: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 29

ResultsMonks3

Performance Comparison on Base Learners

Performance Comparison on Four Methods

40%

30%

Noise Level

Page 30: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 30

Performance Discussions

C2 (ICDM 2006), Bagging, ACE (ICTAI 2005) outperform classifier T

C2 outperforms Bagging in most trials When the noise level is high, the accuracy

enhancement module is less reliable We can consider improvement from following

aspects: Locating noisy data Recommending possible corrections Selecting and performing one correction.

Page 31: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 31

Concluding Remarks

A defining problem hence a long-term issue Data mining from large, noisy data sources

With different types of noise, Structured noise

A specific type of structured noise Associative Noise

Associative profiling Random noise

C2 – Corrective Classification Future work: how to combine noise profiling and

noise-tolerant mining with unknown noise types?

Page 32: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 32

References1. J. Han and M. Kamber. Data Mining: Concepts and

Techniques, 2001.

2. R.O. Duda, et.al. Pattern Classification (2nd Edition), Wiley-Interscience, 2000.

3. Y. Zhang, X. Zhu, X. Wu and J.P. Bond, ACE: An Aggressive Classifier Ensemble with Error Detection, Correction and Cleansing, IEEE ICTAI 2005, pp.310-317.

4. Y. Zhang, X. Zhu and X. Wu, Corrective Classification: Classifier Ensembling with Corrective and Diverse Base Learners, ICDM 2006, pp.1199-1204.

5. X. Zhu and X. Wu, Class Noise vs Attribute Noise: A Quantitative Study of Their Impacts, Artificial Intelligence Review, 22(2004), 3-4: 177-210.

6. Y. Zhang and X Wu, Noise Modeling with Associative Corruption Rules, ICDM 2007, pp. 733-738.

Page 33: 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic

Tsinghua University, Beijing, January 15, 2008 33

Acknowledgements

Joint work with: Dr. Xingquan Zhu Yan Zhang Dr. Jeffrey Bond

Supported by DOD (US) NSF (US)