memory-constrained data mining slobodan vucetic assistant professor department of computer and...
TRANSCRIPT
Memory-Constrained Data Mining
Slobodan Vucetic
Assistant ProfessorDepartment of Computer and Information Sciences
Center for Information Science and TechnologyTemple University, Philadelphia
Scientific Data Mining LabDr. Slobodan Vucetic, Assistant ProfessorCIS Department, IST Center, Temple University, Philadelphia, USA
Need: (see Nature of March 23, 2006) Amount of data in science every year Shift from computers supporting scientists to playing central role in
testing, and even formulation, of scientific hypothesis
Lab Mission: Developing an interface between data analysis and applied sciences Working on collaborative projects at the interface between computer
science and other disciplines (sciences, engineering, business) Training students to become computational research scientists
Research Tasks: Predictive Modeling Pattern Discovery Summarization
Spatial and temporal dependency
High dimensional data
Data collection bias
Data and knowledge fusion from multiple sources
Large-scale data
Missing/noisy/unstable attributes …
Scientific Data Mining Lab:
Research Challenges
Data Mining Resource-Constrained Data Mining (NSF)
Earth Science Applications Estimation of geophysical parameters from satellite data (NSF)
Biomedical Applications Gene expression data analysis (NIH, PA Dept. of Health) Bioinformatics of protein disorder (PA Dept. of Health) Bioinformatics core facility (PA Dept. of Health) Text mining and Information retrieval (NSF) Spatial modeling of disease and infection spread
Spatial and Temporal Knowledge Discovery Spatial-temporal data reduction (NSF) Analysis of deregulated electricity markets Analysis of highway traffic data
Scientific Data Mining Lab:
Current Projects
Aim: Accurate and efficient estimation of geophysical parameters from MISR and MODIS instruments on Terra satellite and ground based observations (huge data streams)
Df
Cf
Bf
Af
An
Aa
Ba
Ca
Da
70.5º
70.5º
60.0º
60.0º
45.6º
45.6º
26.1º26.1º
0.0º
2800
km
MISR: Multi-angle Imaging Spectro-Radiometer9 view angles at Earth surface4 Spectral bands
400-km swath width
Vucetic, S., Han, B., Mi, W., Li, Z., Obradovic, Z., A Data Mining Approach for the Validation of Aerosol Retrievals, IEEE Geoscience and Remote Sensing Letters, 2008.
Scientific Data Mining Lab:
Multiple-Source Spatial-Temporal Data Analysis
Result: several pricing regimes existed in California marketVucetic, S., Obradovic, Z. and Tomsovic, K. (2001) “Price-Load Relationships in California's
Electricity Market," IEEE Trans. on Power Systems.
0 2000 4000 6000 8000 10000 12000
203040
0 2000 4000 6000 8000 10000 120000
50100150
APR 8, 98 OCT 1, 98 OCT 1, 99APR 1, 99 JULY 1, 98 JAN 1, 98 JULY 1, 98
Price prediction (R2) Regime
Size (hours) Local Global
Price Volatility
1 5707 0.79 0.76 40 2 4630 0.81 0.75 19 3 1425 0.72 -0.49 9 4 1191 0.48 -0.03 56
15 20 25 30 35 400
10
20
30
40
50
Load [GWh]
Pric
e [$
/MW
h] 4 1
2 3
Scientific Data Mining Lab:
Temporal Data Mining
Aim: analyze price vs. load dependences by discovering semi-stationary segments in multivariate time series
When topic is difficult to express as a query, often No relevant articles are found by keyword search Too many irrelevant articles are returned
Biomedical Example:“Apurinic/apyrimidinic endonuclease”: 638 citations returned by PubMed“Apurinic/apyrimidinic endonuclease disorder”:1 citation (irrelevant) returned
Result: Large lift of relevant retrievals in top 10Han, B., Obradovic, Z., Hu, Z.Z., Wu, C. H. and Vucetic, S. (2006) “Substring Selection for Biomedical Document Classification,” Bioinformatics.
Scientific Data Mining Lab:Text Mining: Re-Ranking of Articles Retrieved by a Search Engine
Scientific Data Mining Lab:Collaborative filtering
Aim: Predict preferences of an active customer given his/her preferences on some items and a database of preferences of other customers
Result: Regression-based collaborative filtering algorithm is superior to the neighbor-based approach. It is two orders of magnitude faster on-line predicting; more accurate; more robust to small number of observed votes.
Vucetic, S., Obradovic, Z., Collaborative Filtering Using a Regression-Based Approach, Knowledge and Information Systems, Vol. 7, No. 1, pp. 1-22, 2005.
Aim: Understanding protein disorder and its functions
Results:• Protein disorders are very
common (contrary to a 20th century belief)
• Fraction of disorder varies a lot by genomes
• Different types of disorder exist in proteins
• Involved with many important functions Vucetic, S., Brown C., Dunker A.K and
Obradovic, Z., Flavors of Protein Disorder, Proteins: Structure, Function and Genetics, Vol. 52, pp. 573-584, 2003.
Kissinger et al, 1995
Scientific Data Mining Lab:Bioinformatics: Protein Disorder Analysis
Scientific Data Mining Lab:Analysis of Highway Traffic Data
Aim: understand traffic patters, predict traffic congestion and delays
In progress…
Scientific Data Mining Lab:Spatio-Temporal Disease Modelling
Aim: predict infection or disease risk, given the information about population movement
5 10 15 20
5
10
15
20
25
30
5 10 15 200
0.01
0.02
0.03
0.04
Figure 1. Illustration of location clusters and the associated risks.
Location Type
Act
ivit
y T
yp
e In
fec
tio
n R
isk
Result: movement information is very useful in prediction of the infection riskVucetic, S,. Sun, H., Aggregation of Location Attributes for Prediction of Infection Risk, Workshop on Spatial Data Mining: Consolidation and Renewed Bearing, SDM, Bethesda, MD, 2006.
Scientific Data Mining Lab:
Resource-Constrained Data Mining
Aim: Efficient knowledge discovery from large data by limited-capacity
computing devicesApproach:
Integration of data mining and data compression
Figure1. left) Noisy checkerboard data – the goal is to discriminate between black and yellow dots and the achievable accuracy is 90%, middle) 100 randomly selected examples and the trained prediction model that has 76% accuracy, right) 100 examples selected by the reservoir algorithm and the trained prediction model that has 88% accuracy
Resource-Constrained Data Mining:Motivation
Data mining objective: Efficient and accurate algorithms for learning from large data
Performance measures: Accuracy Scaling with data size (# examples, #attributes)
Mainstream data mining: many accurate learning algorithms that scale linearly or even sub-
linearly with data size and dimension, in both runtime and space Caveat:
linear space scaling is often not sufficient it implies an unbounded growth in memory with data size
Challenge: how to learn from large, or practically infinite, data sets/streams
using limited memory resources
Resource-Constrained Data Mining:Learning Scenario
Examples are observed sequentially in a single pass
Data stream examples independent and identically
distributed (IID)
Could store the data summary in reservoir with fixed
memory
Resource-Constrained Data Mining:Approaches
Model-Free: Reservoir Approach Maintains a random sample of size R from data stream Add xt with min(1, R/t), remove randomly Caveat: random sampling often not optimal
Data-Free: Online algorithms Updates the model as examples are observed Perceptron: wt+1 = wt + (yt - f(xt))xt , where f(x) = wTx Caveat: sensitive to data ordering
Hybrid: Data + Model Implicitly done with Support Vector Machines (SVMs)
Resource-Constrained Data Mining:Objective
Develop a memory-constrained SVM algorithm
What is SVM? Popular data mining algorithm for classification The most accurate on many problems Theoretically and practically appealing Computationally expensive
Cubic training time cost O(N3) (e.g. neural nets are O(N)) Quadratic training memory cost O(N2) (e.g. neural nets are O(N)) Linear prediction cost O(N) (e.g. neural nets are O(1))
Resource-Constrained Data Mining:SVM Overview
Goal: Use x1 and x2 to predict class
y {-1, 1} Assume linear prediction
function f(x) = w1x1+w2x2+b sign(f(x)) is final prediction
Challenge: What is better, f1(x) or f2(x) What is the best choice for f(x)?
Answer: Best f(x) has the most wiggle
space it has largest margin
f1(x) f2(x)x1
x2
Resource-Constrained Data Mining:SVM Overview
Maximizing margin is equivalent to:minimize ||w||2
such that yi f(xi) 1
What if data are noisy?minimize ||w||2 + Cii
such that yi f(xi) 1 - i, i 0
What if problem is nonlinear?X (X)
Resource-Constrained Data Mining:SVM Overview
Standard approach convert to dual problemminimize ||w||2 + Cii
such that yi f(xi) 1 - i, i 0
where Qij = yiyj(xi)(xj) = yiyjK(xi, xj) , K is the Kernel function
Gaussian kernel: K(xi,xj) = exp(||xi – xj||2/A)
i are Lagrange multipliers
Optimization becomes the Quadratic Programming Problem (minimizing convex function with linear constraints)
There is the optimal solution in O(N3) time and O(N2) space SVM predictor:
To predict class of example x, we should compare it with all training examples with i > 0
0,
1min :
2ii ij j i i i
Ci j i i
W Q b y
N
iiii bxxKyxf
1
),()(
Resource-Constrained Data Mining:SVM Overview
N
iiii bxxKyxf
1
),()(
f(x) = -1 f(x) = +1
i=0
i=C
0<i<CSupport vectors
Reserve vectors
Error vectors
Resource-Constrained Data Mining:Incremental-Decremental SVM
Standard SVM solution is “batch”, meaning that all training data should be available for learning
Alternative is “online” SVM that can be update when new training data are available
Incremental-Decremental SVM [Cauwenberghs, Poggio, 2000] For each new example, the update takes
O(Ns2) time, Ns – number of support vectors (0<i<C)
O(NsN) memory. Considering Ns = O(N), memory is O(N2) Total cost for online training on N examples is
O(N3) time O(N2) memory The same as for batch mode
Resource-Constrained Data Mining:Memory-Constrained IDSVM
Idea Modify IDSVM by upper-bounding number of support vectors
How Twin Vector Machine (TVM) Define budget B and a set of pivot vectors q1…qB Quantize each example to its nearest pivot,
Q(x) = {qk, k = arg minj=1:B ||x-qj||} D = {(xi,yi), i = 1…N} Q(D) = {(Q(xi),yi), i = 1…N}
Training SVM on Q(D) is equivalent to SVM on TV,TV = {TVj, j = 1…B} (Twin Vector Set)TVj = {(qj,+1,nj
+}, (qj,-1,nj-)} (Twin Vector)
O(N3) O(B3) (constant) time; O(N2) O(B2) (constant) memory
minimize ||w||2 + Cii
such that yi f(xi) 1 - i, i 0, i =
1…N
minimize ||w||2 + Cj(nj+j
+ + nj-j
-)such that f(qj) 1 - j, -f(qj) 1 - j-,
j+
, j- 0, j = 1…B
Resource-Constrained Data Mining:Online TVM
Online-TVM Input: Data stream D = {(xi,yi), i = 1…N}, budget B, kernel
function K, slack parameter C Output: TVM with parameters 1
+,1-,… B
+,B-, and b
1. Initialize TVM = 0, TV = 2. for i = 1 to N3. if Beneficial(xi)4. Update-TV5. Update-TVM
Resource-Constrained Data Mining:Online TVM
Beneficial1. if size(TV) < B or |f(xi)| m1
2. return 1
3. else
4. return 0 Online-TVM Input: Data stream D = {(xi,yi), i = 1…N},
budget B, kernel function K, slack parameter C Output: TVM with parameters 1
+,1-,…
B+,B
-, and b
1. Initialize TVM = 0, TV = 2. for i = 1 to N3. if Beneficial(xi)4. Update-TV5. Update-TVM
-1
0
+1
-2
-2
buffer
buffer
m1
Resource-Constrained Data Mining:Online TVM
Update-TV s = size(TV) TVB+1 = {(xi,yi,1), (qi,-yi,0)} if s < B
TVs+1 = TVB+1 elseif maxi=1:B|f(qi)| > m2
k = arg maxi=1:B |f(qi)| TVk = TVB+1
else find best pair TVi, TVj to merge use (**) to calculate qnew TVi = {(qnew,+1, si
+ + sj+), (qnew,-1,si
- + sj-)}
TVj = TVB+1
jjii
jjjiiinew
ssss
qssqssq
)()((**)
-1
0
+1
-2
-2
buffer
buffer
m2
Resource-Constrained Data Mining:Online TVM
Merging Heuristics: Nearest versus Weighted
Global versus One-Sided
Rejection merging
-1
0
+1
GlobalMerge
OneSideMerge +1
0
-1
Resource-Constrained Data Mining:Results
100
-1 0 1
-1.5
-1
-0.5
0
0.5
1
1.5-6
-4
-2
0
2
4
6
400
-1 0 1
-1.5
-1
-0.5
0
0.5
1
1.5 -5
0
5
10000
-1 0 1
-1.5
-1
-0.5
0
0.5
1
1.5 -20
-10
0
10
Budget B = 100
Resource-Constrained Data Mining:
Results
102
103
104
0.75
0.8
0.85
0.9
0.95
1Checkerboard (noisy)
Length of data stream (in log scale)
Acc
urac
y
TVM
IDSVMLIBSVM
Random Sampling0 2000 4000 6000 8000 10000
0
500
1000
1500
2000
2500Checkerboard (noisy)
Length of data stream
CP
U t
ime
(in s
econ
ds)
TVM
IDSVM
Budget B = 100
Resource-Constrained Data Mining:
Results
101
102
103
104
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1Checkerboard (noisy)
Length of data stream (in log scale)
Acc
urac
y
TVM budget 50
TVM budget 100
TVM budget 200
0 2000 4000 6000 8000 100000
20
40
60
80
Length of data stream
CP
U t
ime
(in s
econ
ds)
Checkerboard (noisy)
TVM budget 50TVM budget 100TVM budget 200
Resource-Constrained Data Mining:
Results
Budget B = 100
0 5000 100000.7
0.75
0.8
0.85
0.9
0.95
1
Length of data stream
Acc
urac
y
Checkerboard (noisy)
with buffer
without buffer
102
103
104
105
0.74
0.76
0.78
0.8
0.82
0.84Adult
Length of data stream (in log scale)
Acc
urac
y
OneSideMerge
GlobalMerge
Resource-Constrained Data Mining:
Results
102
103
104
105
0.74
0.76
0.78
0.8
0.82
0.84
0.86Adult
Length of data stream (in log scale)
Acc
urac
y
TVMIDSVMLIBSVMRandom Sampling
102
103
104
0.85
0.86
0.87
0.88
0.89
0.9
0.91
0.92Banana
Length of data stream (in log scale)
Acc
urac
y
TVM
IDSVMLIBSVM
Random Sampling
102
103
104
0.8
0.85
0.9
0.95
1Checkerboard
Length of data stream (in log scale)
Acc
urac
y
TVM
IDSVMLIBSVM
Random Sampling
102
103
104
0.76
0.78
0.8
0.82
0.84Gauss
Length of data stream (in log scale)
Acc
urac
y
TVM
IDSVMLIBSVM
Random Sampling
102
103
104
105
0.9
0.92
0.94
0.96
0.98
1IJCNN
Length of data stream (in log scale)
Acc
urac
y
TVM
IDSVMLIBSVM
Random Sampling
102
103
104
0.94
0.95
0.96
0.97
0.98
0.99
1Pendigits
Length of data stream (in log scale)
Acc
urac
y
TVM
IDSVMLIBSVM
Random Sampling
Resource-Constrained Data Mining:
Results
Resource-Constrained Data Mining:
Results
Resource-Constrained Data Mining:
Conclusions
Memory-Constrained SVM is successful Significantly higher accuracy than baseline Close to the optimal approach
Merging heuristics are very important Future work
Further improvements Forgetting Probabilistic merging
Use data compression Non-IID streams
Thank You!
More information: http://www.ist.temple.edu/~vucetic/
Collaboration/assistantship contact: Slobodan Vucetic CIS Department, IST Center, Temple University [email protected]