from feature construction, to simple but effective modeling, to domain transfer
DESCRIPTION
From Feature Construction, to Simple but Effective Modeling, to Domain Transfer. Wei Fan IBM T.J.Watson www.cs.columbia.edu/~wfan www.weifan.info [email protected] , [email protected]. Feature Vector. Most data mining and machine learning model assume the following structured data: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/1.jpg)
From Feature Construction, to Simple but Effective
Modeling, to Domain Transfer
Wei FanIBM T.J.Watson
www.cs.columbia.edu/~wfanwww.weifan.info
![Page 2: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/2.jpg)
Feature Vector
Most data mining and machine learning model assume the following structured data: (x1, x2, ..., xk) -> y where xi’s are independent variable y is dependent variable.
y drawn from discrete set: classification y drawn from continuous variable: regression
![Page 3: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/3.jpg)
Frequent Pattern-Based Feature Construction
Data not in the pre-defined feature vectors Transactions
Biological sequence
Graph database
Frequent pattern is a good candidate for discriminative features So, how to mine them?
![Page 4: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/4.jpg)
FP: Sub-graphO
A discovered pattern
HO
O
NSC 4960
NSC 191370
O O
NH
O
HN
O
O
SH
NSC 40773
O
O
O
HO
O
HO
O
O
NSC 164863 NS
H2N O
OOO
O
O O
O
OO
OH
O
NSC 699181
(example borrowed from George Karypis presentation)
![Page 5: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/5.jpg)
Computational Issues
Measured by its “frequency” or support. E.g. frequent subgraphs with sup ≥ 10%
Cannot enumerate sup = 10% without first enumerating all patterns > 10%.
Random sampling not work since it is not exhaustive.
NP hard problem
![Page 6: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/6.jpg)
1. Mine frequent patterns (>sup)
Frequent Patterns1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------
DataSet mine
Mined Discriminative
Patterns
1 2 4
select
2. Select most discriminative patterns;
3. Represent data in the feature space using such patterns;
4. Build classification models.
F1 F2 F4
Data1 1 1 0Data2 1 0 1Data3 1 1 0Data4 0 0 1
………represent
|
Petal.Width< 1.75setosa
versicolor virginica
Petal.Length< 2.45
Any classifiers you can name
NN
DT
SVM
LR
Conventional Procedure
Feature Construction followed by Selection
Two-Step Batch Method
![Page 7: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/7.jpg)
Two Problems
Mine step combinatorial explosion
Frequent Patterns
1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------
DataSetmine
1. exponential explosion 2. patterns not considered if minsupport isn’t small
enough
![Page 8: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/8.jpg)
Two Problems Select step
Issue of discriminative power
Frequent Patterns
1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------
Mined Discriminative
Patterns
1 2 4
select
3. InfoGain against the complete dataset, NOT on subset of
examples
4. Correlation notdirectly evaluated on their
joint predictability
![Page 9: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/9.jpg)
Direct Mining & Selection via Model-based Search Tree Basic Flow
Mined Discriminative Patterns
Compact set of highly
discriminative patterns
1234567...
Divide-and-Conquer Based Frequent Pattern Mining
2
Mine & Select P: 20%
Y
3
Mine & Select P:20%
Y
6
Y
+
Y Y4
N
Few Data
N N
+
N
5
N
Mine & Select P:20%
7
N
Mine & Select P:20%
…
… Y
dataset
1
Mine & Select P: 20%
Most discriminative F based on IG
Feature Miner
Classifier
Global Support:
10*20%/10000=0.02%
![Page 10: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/10.jpg)
Analyses (I)
1. Scalability of pattern enumeration
Upper bound (Theorem 1):
“Scale down” ratio:
2. Bound on number of returned features
![Page 11: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/11.jpg)
Analyses (II)
3. Subspace pattern selection
Original set:
Subset: 4. Non-overfitting
5. Optimality under exhaustive search
![Page 12: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/12.jpg)
Experimental Studies: Itemset Mining (I)
Scalability Comparison
01
23
4
Adult Chess Hypo Sick Sonar
Log(DT #Pat) Log(MbT #Pat)
0
1
2
3
4
Adult Chess Hypo Sick Sonar
Log(DTAbsSupport) Log(MbTAbsSupport)
Datasets #Pat using MbT supRatio (MbT #Pat / #Pat using MbT
sup)
Adult 252809 0.41%
Chess +∞ ~0%
Hypo 423439 0.0035%
Sick 4818391 0.00032%
Sonar 95507 0.00775%
2
Mine & Select P: 20%
Y
3
Mine & Select P:20%
Y
+
Y Y
Few Data
N
+
N
dataset
1
Mine & Select P: 20%
Most discriminative F based on IG
Global Support:
10*20%/10000=0.02%
6
Y
5
N
Mine & Select P:20%
7
N
Mine & Select P:20%4
N
2
Mine & Select P: 20%
Y
3
Mine & Select P:20%
Y
+
Y Y
Few Data
N
+
N
dataset
1
Mine & Select P: 20%
Most discriminative F based on IG
Global Support:
10*20%/10000=0.02%
6
Y
5
N
Mine & Select P:20%
7
N
Mine & Select P:20%4
N
![Page 13: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/13.jpg)
Experimental Studies: Itemset Mining (II)
Accuracy of Mined Itemsets
70%
80%
90%
100%
Adult Chess Hypo Sick Sonar
DT Accuracy MbT Accuracy
4 Wins 1 loss
01
23
4
Adult Chess Hypo Sick Sonar
Log(DT #Pat) Log(MbT #Pat)
But, much smallernumber ofpatterns
![Page 14: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/14.jpg)
Experimental Studies: Itemset Mining (III)
Convergence
![Page 15: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/15.jpg)
Experimental Studies: Graph Mining (I)
9 NCI anti-cancer screen datasets The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov. Active (Positive) class : around 1% - 8.3%
2 AIDS anti-viral screen datasets URL: http://dtp.nci.nih.gov. H1: CM+CA – 3.5% H2: CA – 1%
O
O
O
HO
O
HO
O
O
![Page 16: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/16.jpg)
Experimental Studies: Graph Mining (II) Scalability
0300600900
120015001800
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
DT #Pat MbT #Pat
0
1
2
3
4
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
Log(DT Abs Support) Log(MbT Abs Support)2
Mine & Select P: 20%
Y
3
Mine & Select P:20%
Y
+
Y Y
Few Data
N
+
N
dataset
1
Mine & Select P: 20%
Most discriminative F based on IG
Global Support:
10*20%/10000=0.02%
6
Y
5
N
Mine & Select P:20%
7
N
Mine & Select P:20%4
N
2
Mine & Select P: 20%
Y
3
Mine & Select P:20%
Y
+
Y Y
Few Data
N
+
N
dataset
1
Mine & Select P: 20%
Most discriminative F based on IG
Global Support:
10*20%/10000=0.02%
6
Y
5
N
Mine & Select P:20%
7
N
Mine & Select P:20%4
N
![Page 17: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/17.jpg)
Experimental Studies: Graph Mining (III) AUC and Accuracy
0.5
0.6
0.7
0.8
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
DT MbTAUC
Accuracy
0.88
0.92
0.96
1
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
DT MbT
11 Wins
10 Wins 1 Loss
![Page 18: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/18.jpg)
AUC of MbT, DT MbT VS Benchmarks
Experimental Studies: Graph Mining (IV)
7 Wins, 4 losses
![Page 19: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/19.jpg)
Summary Model-based Search Tree
Integrated feature mining and construction. Dynamic support Can mine extremely small support patterns Both a feature construction and a classifier Not limited to one type of frequent pattern: plug-play
Experiment Results Itemset Mining Graph Mining
New: Found a DNA sequence not previously reported but can be explained in biology.
Code and dataset available for download
![Page 20: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/20.jpg)
Even the true distribution is unknown, still assume that the data is generated by some known function. Estimate parameters inside
the function via training data CV on the training data
Model 1
Model 2Model 3
Model 4Model 5
Model 6
Some unknown distribution
How to train models?
There probably will always be mistakes unless:1. The chosen model indeed generates the distribution2. Data is sufficient to estimate those parameters
But how about, you don’t know which to choose or use the wrong one?
List of methods:• Logistic Regression• Probit models• Naïve Bayes• Kernel Methods• Linear Regression• RBF• Mixture models
After structure is prefixed, learning becomes optimization to minimize errors:
quadratic loss exponential loss slack variables
![Page 21: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/21.jpg)
How to train models II
Not quite sure the exact function, but use a family of “free-form” functions given some “preference criteria”.
There probably will always be mistakes unless: • the training data is sufficiently large.• free form function/criteria is appropriate.
List of methods:• Decision Trees• RIPPER rule learner• CBA: association rule• clustering-based methods• … …
Preference criteria Simplest hypothesis that fits the data is the best. Heuristics:
info gain, gini index, Kearns-Mansour, etc pruning: MDL pruning, reduced error-pruning, cost-based
pruning. Truth: none of purity check functions guarantee accuracy on
unseen test data, it only tries to build a smaller model
![Page 22: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/22.jpg)
Can Data Speak for Themselves? Make no assumption about the
true model, neither parametric form nor free form.
“Encode” the data in some rather “neutral” representations: Think of it like encoding
numbers in computer’s binary representation.
Always cannot represent some numbers, but overall accurate enough.
Main challenge: Avoid “rote learning”: do not
remember all the details Generalization “Evenly” representing
“numbers” – “Evenly” encoding the “data”.
![Page 23: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/23.jpg)
Potential Advantages
If the accuracy is quite good, then Method is quite “automatic and easy” to use No Brainer – DM can be everybody’s tool.
![Page 24: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/24.jpg)
Encoding Data for Major Problems
Classification: Given a set of labeled data items, such as, (amt, merchant category,
outstanding balance, date/time, ……,) and the label is whether it is a fraud or non-fraud.
Label: set of discrete values classifier: predict if a transaction is a fraud or non-fraud.
Probability Estimation: Similar to the above setting: estimate the probability that a transaction is a
fraud. Difference: no truth is given, i.e., no true probability
Regression: Given a set of valued data items, such as (zipcode, capital gain, education,
…), interested value is annual gross income. Target value: continuous values.
Several other on-going problems
![Page 25: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/25.jpg)
Encoding Data in Decision Trees
Think of each tree as a way to “encode” the training data. Why tree? a decision tree records some common
characteristic of the data, but not every piece of trivial detail Obviously, each tree encodes the data differently. Subjective criteria that prefers some encodings than
others are always adhoc. Do not prefer anything then – just do it randomly
Minimizes the difference by multiple encodings, and then “average” them.
1 2 3 4 5 6 7
0.5
1.5
2.5
ssssss
ssssssss
ssss ssss
s
s
ssssssss
sssssssssss
ss
sssss
cc cc
cc
c
c
cc
c
c
c
cc cc
c
c
c
c
cc
ccc c
cc
cccc
ccc
ccccc
cc
c
cccc
cc
v
vv
v
v v
vvv
v
vvv
v
vv
v
vv
v
v
v vv
v
vvv
v
v
v vv
v v
vv
vv
v
vv
v
vv
v
vv
v
v
setosa
versicolor
virginica
Petal length
Peta
l w
idth
1 2 3 4 5 6 7
0.5
1.5
2.5
ssssss
ssssssss
ssss ssss
s
s
ssssssss
sssssssssss
ss
sssss
cc cc
cc
c
c
cc
c
c
c
cc cc
c
c
c
c
cc
ccc c
cc
cc
cc
ccc
ccccc
cc
c
cccc
cc
v
vv
v
v v
vvv
v
vvv
v
vv
v
vv
v
v
v vv
v
vvv
v
v
v vv
v v
vv
vv
v
vv
v
vv
v
vv
v
v
setosa
versicolor
virginica
Petal length
Peta
l w
idth
![Page 26: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/26.jpg)
Random Decision Tree to Encode Data
-classification, regression, probability estimation
At each node, an un-used feature is chosen randomly A discrete feature is un-used if it has never been chosen
previously on a given decision path starting from the root to the current node.
A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen
![Page 27: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/27.jpg)
Continued
We stop when one of the following happens: A node becomes too small (<= 3 examples). Or the total height of the tree exceeds some limits:
Such as the total number of features.
![Page 28: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/28.jpg)
Illustration of RDT
B1: {0,1}
B2: {0,1}
B3: continuous
B2: {0,1}
B3: continuous
B2: {0,1}
B3: continuous
B3: continous
B1 == 0
B2 == 0?
Y
B3 < 0.3?
N
Y N
……… B3 < 0.6?
Random threshold 0.3
Random threshold 0.6
B1 chosen randomly
B2 chosen randomly
B3 chosen randomly
![Page 29: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/29.jpg)
Classification
|
Petal.Width< 1.75
setosa 50/0/0
versicolor0/49/5
virginica 0/1/45
Petal.Length< 2.45
P(setosa|x,θ) = 0
P(versicolor|x,θ) = 49/54
P(virginica|x,θ) = 5/54
![Page 30: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/30.jpg)
Regression
|
Petal.Width< 1.75
setosa Height=10in
versicolorHeight=15 in
virginica Height=12in
Petal.Length< 2.45
15 in average
value of all examples
In this leaf node
![Page 31: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/31.jpg)
Prediction
Simply Averaging over multiple trees
![Page 32: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/32.jpg)
Potential Advantage
Training can be very efficient. Particularly true for very large datasets. No cross-validation based estimation of
parameters for some parametric methods. Natural multi-class probability. Natural multi-label classification and
probability estimation. Imposes very little about the structures of the
model.
![Page 33: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/33.jpg)
Reasons
The true distribution P(y|X) is never known. Is it an elephant?
Every random tree is not a random guess of this P(y|X). Their structure is, but not the “node statistics” Every random tree is consistent with the training data. Each tree is quite strong, not weak. In other words, if the distribution is the same, each random
tree itself is a rather decent model.
![Page 34: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/34.jpg)
Expected Error Reduction
Proven that for quadratic loss, such as: for probability estimation:
( P(y|X) – P(y|X, θ) )2
regression problems ( y – f(x))2
General theorem: the “expected quadratic loss” of RDT (and any other model averaging) is less than any combined model chosen “at random”.
![Page 35: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/35.jpg)
Theorem Summary
![Page 36: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/36.jpg)
Number of trees
Sampling theory: The random decision tree can be thought as sampling
from a large (infinite when continuous features exist) population of trees.
Unless the data is highly skewed, 30 to 50 gives pretty good estimate with reasonably small variance. In most cases, 10 are usually enough.
![Page 37: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/37.jpg)
Variance Reduction
![Page 38: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/38.jpg)
Optimal Decision Boundary
from Tony Liu’s thesis (supervised by Kai Ming Ting)
![Page 39: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/39.jpg)
RDT lookslike the optimal
boundary
![Page 40: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/40.jpg)
Regression Decision Boundary (GUIDE)
Properties• Broken and Discontinuous• Some points are far from truth• Some wrong ups and downs
![Page 41: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/41.jpg)
RDT Computed FunctionProperties• Smooth and Continuous• Close to true function• All ups and downs caught
![Page 42: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/42.jpg)
Hidden Variable
![Page 43: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/43.jpg)
Hidden Variable Limitation of GUIDE
Need to decide grouping variables and independent variables. A non-trivial task.
If all variables are categorical, GUIDE becomes a single CART regression tree.
Strong assumption and greedy-based search. Sometimes, can lead to very unexpected results.
![Page 44: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/44.jpg)
It grows like …
![Page 45: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/45.jpg)
ICDM’08 Cup Crown Winner
Nuclear ban monitoring RDT based approach is the highest award
winner.
![Page 46: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/46.jpg)
Ozone Level Prediction (ICDM’06 Ozone Level Prediction (ICDM’06 Best Application Paper)Best Application Paper)
Daily summary maps of two datasets from Texas Commission on Environmental Quality (TCEQ)
![Page 47: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/47.jpg)
SVM: 1-hr criteria CV
![Page 48: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/48.jpg)
AdaBoost: 1-hr criteria CV
![Page 49: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/49.jpg)
SVM: 8-hr criteria CV
![Page 50: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/50.jpg)
AdaBoost: 8-hr criteria CV
![Page 51: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/51.jpg)
Other Applications
Credit Card Fraud Detection Late and Default Payment Prediction Intrusion Detection Semi Conductor Process Control Trading anomaly detection
![Page 52: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/52.jpg)
Conclusion Imposing a particular form of model may not be a good idea to
train highly-accurate models for general purpose of DM. It may not even be efficient for some forms of models. RDT has been show to solve all three major problems in data
mining, classification, probability estimation and regressions, simply, efficiently and accurately.
When physical truth is unknown, RDT is highly recommended Code and dataset is available for download.
![Page 53: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/53.jpg)
Standard Supervised Learning
New York Times
training (labeled)
test (unlabeled)
Classifier 85.5%
New York Times
![Page 54: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/54.jpg)
In Reality……
New York Times
training (labeled)
test (unlabeled)
Classifier 64.1%
New York Times
Labeled data not available!Reuters
![Page 55: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/55.jpg)
Domain Difference Performance Droptrain test
NYT NYT
New York Times New York Times
Classifier 85.5%
Reuters NYT
Reuters New York Times
Classifier 64.1%
ideal setting
realistic setting
![Page 56: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/56.jpg)
A Synthetic Example
Training(have conflicting concepts)
Test
Partially overlapping
![Page 57: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/57.jpg)
Goal
SourceDomain Target
Domain
SourceDomain
SourceDomain
To unify knowledge that are consistent with the test domain from multiple source domains (models)
![Page 58: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/58.jpg)
Summary Transfer from one or multiple source
domains Target domain has no labeled examples
Do not need to re-train Rely on base models trained from each domain The base models are not necessarily developed
for transfer learning applications
![Page 59: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/59.jpg)
Locally Weighted Ensemble
),( yxf k
k
i
iiE yxfxwyxf1
),()(),(
),(2 yxf
M1
M2
Mk
……
Training set 1),(1 yxf
),|(),( ii MxyYPyxf
),(maxarg| yxfxy Ey
Test example xTraining set 2
Training set k
……
)(1 xw
)(2 xw
)(xwk
k
i
i xw1
1)(
x-feature value y-class label
Training set
![Page 60: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/60.jpg)
Modified Bayesian Model Averaging
M1
M2
Mk
……
Test set
),|( iMxyP
)|( DMP i
k
iii MxyPDMPxyP
1
),|()|()|(
Bayesian Model Averaging
M1
M2
Mk
……
Test set
Modified for Transfer Learning
),|( iMxyP)|( xMP i
k
iii MxyPxMPxyP
1
),|()|()|(
![Page 61: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/61.jpg)
Global versus Local Weights
2.40 5.23-2.69 0.55-3.97 -3.622.08 -3.735.08 2.151.43 4.48……
x y
100001…
M1
0.60.40.20.10.61…
M2
0.90.60.40.10.30.2…
wg
0.30.30.30.30.30.3…
wl
0.20.60.70.50.31…
wg
0.70.70.70.70.70.7…
wl
0.80.40.30.50.70…
Locally weighting scheme Weight of each model is computed per example Weights are determined according to models’
performance on the test set, not training set
Training
![Page 62: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/62.jpg)
Synthetic Example Revisited
Training(have conflicting concepts)
Test
Partially overlapping
M1 M2
M1 M 2
![Page 63: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/63.jpg)
Optimal Local Weights
C1
C2
Test example x
0.9 0.1
0.4 0.6
0.8 0.2
Higher Weight
Optimal weights Solution to a regression problem
0.9 0.4
0.1 0.6
w1
w2=
0.8
0.2
k
i
i xw1
1)(
H w f
![Page 64: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/64.jpg)
Approximate Optimal Weights
How to approximate the optimal weights M should be assigned a higher weight at x if P(y|M,x) is
closer to the true P(y|x) Have some labeled examples in the target domain
Use these examples to compute weights None of the examples in the target domain are labeled
Need to make some assumptions about the relationship between feature values and class labels
Optimal weights Impossible to get since f is unknown!
![Page 65: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/65.jpg)
Clustering-Manifold Assumption
Test examples that are closer in feature space are more likely to share the same class label.
![Page 66: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/66.jpg)
Graph-based Heuristics Graph-based weights approximation
Map the structures of models onto test domain
Clustering Structure
M1M2
weight on x
![Page 67: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/67.jpg)
Graph-based Heuristics
Local weights calculation Weight of a model is proportional to the similarity
between its neighborhood graph and the clustering structure around x.
Higher Weight
![Page 68: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/68.jpg)
Local Structure Based Adjustment Why adjustment is needed?
It is possible that no models’ structures are similar to the clustering structure at x
Simply means that the training information are conflicting with the true target distribution at x
Clustering Structure
M1M2
ErrorError
![Page 69: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/69.jpg)
Local Structure Based Adjustment How to adjust?
Check if is below a threshold Ignore the training information and propagate the labels of
neighbors in the test set to x
Clustering Structure
M1M2
![Page 70: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/70.jpg)
Verify the Assumption
Need to check the validity of this assumption Still, P(y|x) is unknown How to choose the appropriate clustering algorithm
Findings from real data sets This property is usually determined by the nature of the
task Positive cases: Document categorization Negative cases: Sentiment classification Could validate this assumption on the training set
![Page 71: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/71.jpg)
Algorithm
Check Assumption
Neighborhood Graph Construction
Model Weight Computation
Weight Adjustment
![Page 72: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/72.jpg)
Data Sets
Different applications Synthetic data sets Spam filtering: public email collection personal inboxes
(u01, u02, u03) (ECML/PKDD 2006) Text classification: same top-level classification problems
with different sub-fields in the training and test sets (Newsgroup, Reuters)
Intrusion detection data: different types of intrusions in training and test sets.
![Page 73: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/73.jpg)
Baseline Methods Baseline Methods
One source domain: single models Winnow (WNN), Logistic Regression (LR), Support
Vector Machine (SVM) Transductive SVM (TSVM)
Multiple source domains: SVM on each of the domains TSVM on each of the domains
Merge all source domains into one: ALL SVM, TSVM
Simple averaging ensemble: SMA Locally weighted ensemble without local structure based
adjustment: pLWE Locally weighted ensemble: LWE
Implementation Package: Classification: SNoW, BBR, LibSVM, SVMlight Clustering: CLUTO package
![Page 74: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/74.jpg)
Performance Measure
Prediction Accuracy 0-1 loss: accuracy Squared loss: mean squared error
Area Under ROC Curve (AUC)
Tradeoff between true positive rate and false positive rate Should be 1 ideally
![Page 75: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/75.jpg)
A Synthetic Example
Training(have conflicting concepts)
Test
Partially overlapping
![Page 76: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/76.jpg)
Experiments on Synthetic Data
![Page 77: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/77.jpg)
Spam Filtering
Problems Training set:
public emails Test set:
personal emails from three users: U00, U01, U02
pLWE
LR
SVM
SMA
TSVM
WNN
LWE
pLWE
LR
SVM
SMA
TSVM
WNN
LWE
Accuracy
MSE
![Page 78: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/78.jpg)
20 Newsgroup
C vs S
R vs T
R vs S
C vs T
C vs R
S vs T
![Page 79: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/79.jpg)
pLWE
LR
SVM
SMA
TSVM
WNN
LWE
Acc
pLWE
LR
SVM
SMA
TSVM
WNN
LWE
MSE
![Page 80: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/80.jpg)
Reuters
pLWE
LR
SVM
SMA
TSVM
WNN
LWE
pLWE
LR
SVM
SMA
TSVM
WNN
LWE
Accuracy
MSE
Problems Orgs vs People
(O vs Pe) Orgs vs Places
(O vs Pl) People vs
Places (Pe vs Pl)
![Page 81: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/81.jpg)
Intrusion Detection
Problems (Normal vs Intrusions) Normal vs R2L (1) Normal vs Probing (2) Normal vs DOS (3)
Tasks 2 + 1 -> 3 (DOS) 3 + 1 -> 2 (Probing) 3 + 2 -> 1 (R2L)
![Page 82: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/82.jpg)
ConclusionsLocally weighted ensemble framework
transfer useful knowledge from multiple source domains
Graph-based heuristics to compute weights Make the framework practical and effective
Code and Dataset available for download
![Page 83: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer](https://reader030.vdocuments.net/reader030/viewer/2022032606/56812dac550346895d92d547/html5/thumbnails/83.jpg)
More information
www.weifan.info or www.cs.columbia.edu/~wfan For code, dataset and papers