A General Framework for Mining Concept-Drifting Data Streams with
Skewed Distributions
Jing Gao† Wei Fan‡ Jiawei Han† Philip S. Yu‡
†University of Illinois at Urbana-Champaign‡IBM T. J. Watson Research Center
Introduction (1)
• Data Stream– Continuously arriving
data flow– Applications: network
traffic, credit card transaction flow, phone calling records, etc.
10
11
10
1
00
11
Introduction (2)• Stream Classification
– Construct a classification model based on past records
– Use the model to predict labels for new data– Help decision making
Fraud?
Fraud
Classification model
Labeling
Framework
……… ?………
Classification Model Predict
Concept Drifts
• Changes in P(x,y)– P(x,y)=P(y|x)P(x) x-feature vector, y-class label– No Change, Feature Change, Conditional Change, Dual C
hange– Expected error is not a good indicator of concept drifts– Training on the most recent data could help reduce expect
ed error
Time Stamp 1
Time Stamp 11
Time Stamp 21
Issues in Stream Classification(1)
• Generative Model– P(y|x) follows some
distribution
• Descriptive Model– Let data decides
• Stream Data– Distribution unknow
n and evolving
Issues in Stream Classification(2)
• Label Prediction– Classify x into one
class
• Probability Estimation– x is assigned to all
classes with different probabilities
• Stream Applications– Stochastic, prediction
confidence information is needed
Mining Skewed Data Stream• Skewed Distribution
– Credit card frauds, network intrusions
• Existing Stream Classification Algorithms– Evaluated on balanced
data
• Problems– Ignore minority examples– The cost of misclassifying
minority examples is usually huge
+
-
Classify every leaf node as negative
Stream Ensemble Approach (1)
……… ?………
Training set? Insufficient positive examples!
Step 1
Sampling
Stream Ensemble Approach (2)
Step 2
Ensemble
C1 C2 Ck……
k
i
iE xfk
xf1
)(1
)(
1 2 k……
Why this approach works?• Incorporation of old positive examples
– increase the training size, reduce variance– negative examples reflect current concepts, so
the increase in boundary bias is small• Ensemble
– reduce variance caused by single model– disjoint sets of negative examples—the
classifiers will make uncorrelated errors• Bagging & Boosting
– running cost is much higher– cannot generate reliable probability estimates for
skewed distributions
Analysis
)()|()( xxcPxf ccc 2222 /)( sb
)()|()( xxcPxf ccEC
• Error Reduction– Sampling
– Ensemble
• Efficiency Analysis– Single model– Ensemble– Ensemble is more efficient
k
ibb iE
k 1
22
2 1
))log()(( qpqp knnknndO
))log()(( qpqp nnnndkO
Experiments
• Measures– Mean Squared Error
– ROC Curve – Recall-Precision Curve
• Baseline Methods– NS: No sampling +Single Model– SS: Sampling + Single Model– SE: Sampling + Ensemble
n
iii xPxf
nL
1
2))|()((1
Experimental Results (1)
Mean Squared Error on Synthetic Data
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Feature Condi ti onal Dual
SENSSS
Feature Change only P(x) changes
Conditional Change only P(y|x) changes
Dual Change both P(x) and P(y|x)
changes
Experimental Results (2)
Mean Squared Error on Real Data
0
0.05
0.1
0.15
0.2
0.25
Thyroi d1 Thyroi d2 Opt Letter Covtype
SENSSS
Experimental Results (3)
ROC Curve Recall-Precision Plot
Plots on Synthetic Data
Experimental Results (4)
ROC Curve Recall-Precision Plot
Plots on Real Data
Experimental Results (5)
Training Time
Conclusions
• General issues in stream classification– concept drifts– descriptive model– probability estimation
• Mining skewed data streams– sampling and ensemble techniques– accurate and efficient
• Wide applications– graph data– airforce data
Thanks!
• Any questions?