Download - A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

A General Framework for Mining Concept-Drifting Data Streams with

Skewed Distributions

Jing Gao† Wei Fan‡ Jiawei Han† Philip S. Yu‡

†University of Illinois at Urbana-Champaign‡IBM T. J. Watson Research Center

Introduction (1)

• Data Stream– Continuously arriving

data flow– Applications: network

traffic, credit card transaction flow, phone calling records, etc.

10

11

10

1

00

11

Introduction (2)• Stream Classification

– Construct a classification model based on past records

– Use the model to predict labels for new data– Help decision making

Fraud?

Fraud

Classification model

Labeling

Framework

……… ?………

Classification Model Predict

Concept Drifts

• Changes in P(x,y)– P(x,y)=P(y|x)P(x) x-feature vector, y-class label– No Change, Feature Change, Conditional Change, Dual C

hange– Expected error is not a good indicator of concept drifts– Training on the most recent data could help reduce expect

ed error

Time Stamp 1

Time Stamp 11

Time Stamp 21

Issues in Stream Classification(1)

• Generative Model– P(y|x) follows some

distribution

• Descriptive Model– Let data decides

• Stream Data– Distribution unknow

n and evolving

Issues in Stream Classification(2)

• Label Prediction– Classify x into one

class

• Probability Estimation– x is assigned to all

classes with different probabilities

• Stream Applications– Stochastic, prediction

confidence information is needed

Mining Skewed Data Stream• Skewed Distribution

– Credit card frauds, network intrusions

• Existing Stream Classification Algorithms– Evaluated on balanced

data

• Problems– Ignore minority examples– The cost of misclassifying

minority examples is usually huge

+

-

Classify every leaf node as negative

Stream Ensemble Approach (1)

……… ?………

Training set? Insufficient positive examples!

Step 1

Sampling

Stream Ensemble Approach (2)

Step 2

Ensemble

C1 C2 Ck……

k

i

iE xfk

xf1

)(1

)(

1 2 k……

Why this approach works?• Incorporation of old positive examples

– increase the training size, reduce variance– negative examples reflect current concepts, so

the increase in boundary bias is small• Ensemble

– reduce variance caused by single model– disjoint sets of negative examples—the

classifiers will make uncorrelated errors• Bagging & Boosting

– running cost is much higher– cannot generate reliable probability estimates for

skewed distributions

Analysis

)()|()( xxcPxf ccc 2222 /)( sb

)()|()( xxcPxf ccEC

• Error Reduction– Sampling

– Ensemble

• Efficiency Analysis– Single model– Ensemble– Ensemble is more efficient

k

ibb iE

k 1

22

2 1

))log()(( qpqp knnknndO

))log()(( qpqp nnnndkO

Experiments

• Measures– Mean Squared Error

– ROC Curve – Recall-Precision Curve

• Baseline Methods– NS: No sampling +Single Model– SS: Sampling + Single Model– SE: Sampling + Ensemble

n

iii xPxf

nL

1

2))|()((1

Experimental Results (1)

Mean Squared Error on Synthetic Data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Feature Condi ti onal Dual

SENSSS

Feature Change only P(x) changes

Conditional Change only P(y|x) changes

Dual Change both P(x) and P(y|x)

changes


Mean Squared Error on Real Data

0

0.05

0.1

0.15

0.2

0.25

Thyroi d1 Thyroi d2 Opt Letter Covtype

SENSSS


ROC Curve Recall-Precision Plot

Plots on Synthetic Data


ROC Curve Recall-Precision Plot

Plots on Real Data


Training Time

Conclusions

• General issues in stream classification– concept drifts– descriptive model– probability estimation

• Mining skewed data streams– sampling and ensemble techniques– accurate and efficient

• Wide applications– graph data– airforce data

Thanks!

• Any questions?

Download - A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Top Related