memory-constrained data mining slobodan vucetic assistant professor department of computer and...

Memory-Constrained Data Mining

Slobodan Vucetic

Assistant ProfessorDepartment of Computer and Information Sciences

Center for Information Science and TechnologyTemple University, Philadelphia

Scientific Data Mining LabDr. Slobodan Vucetic, Assistant ProfessorCIS Department, IST Center, Temple University, Philadelphia, USA

Need: (see Nature of March 23, 2006) Amount of data in science every year Shift from computers supporting scientists to playing central role in

testing, and even formulation, of scientific hypothesis

Lab Mission: Developing an interface between data analysis and applied sciences Working on collaborative projects at the interface between computer

science and other disciplines (sciences, engineering, business) Training students to become computational research scientists

Research Tasks: Predictive Modeling Pattern Discovery Summarization

Spatial and temporal dependency

High dimensional data

Data collection bias

Data and knowledge fusion from multiple sources

Large-scale data

Missing/noisy/unstable attributes …

Scientific Data Mining Lab:

Research Challenges

Data Mining Resource-Constrained Data Mining (NSF)

Earth Science Applications Estimation of geophysical parameters from satellite data (NSF)

Biomedical Applications Gene expression data analysis (NIH, PA Dept. of Health) Bioinformatics of protein disorder (PA Dept. of Health) Bioinformatics core facility (PA Dept. of Health) Text mining and Information retrieval (NSF) Spatial modeling of disease and infection spread

Spatial and Temporal Knowledge Discovery Spatial-temporal data reduction (NSF) Analysis of deregulated electricity markets Analysis of highway traffic data


Current Projects

Aim: Accurate and efficient estimation of geophysical parameters from MISR and MODIS instruments on Terra satellite and ground based observations (huge data streams)

Df

Cf

Bf

Af

An

Aa

Ba

Ca

Da

70.5º

70.5º

60.0º

60.0º

45.6º

45.6º

26.1º26.1º

0.0º

2800

km

MISR: Multi-angle Imaging Spectro-Radiometer9 view angles at Earth surface4 Spectral bands

400-km swath width

Vucetic, S., Han, B., Mi, W., Li, Z., Obradovic, Z., A Data Mining Approach for the Validation of Aerosol Retrievals, IEEE Geoscience and Remote Sensing Letters, 2008.


Multiple-Source Spatial-Temporal Data Analysis

Result: several pricing regimes existed in California marketVucetic, S., Obradovic, Z. and Tomsovic, K. (2001) “Price-Load Relationships in California's

Electricity Market," IEEE Trans. on Power Systems.

0 2000 4000 6000 8000 10000 12000

203040

0 2000 4000 6000 8000 10000 120000

50100150

APR 8, 98 OCT 1, 98 OCT 1, 99APR 1, 99 JULY 1, 98 JAN 1, 98 JULY 1, 98

Price prediction (R2) Regime

Size (hours) Local Global

Price Volatility

1 5707 0.79 0.76 40 2 4630 0.81 0.75 19 3 1425 0.72 -0.49 9 4 1191 0.48 -0.03 56

15 20 25 30 35 400

10

20

30

40

50

Load [GWh]

Pric

e [$

/MW

h] 4 1

2 3


Temporal Data Mining

Aim: analyze price vs. load dependences by discovering semi-stationary segments in multivariate time series

When topic is difficult to express as a query, often No relevant articles are found by keyword search Too many irrelevant articles are returned

Biomedical Example:“Apurinic/apyrimidinic endonuclease”: 638 citations returned by PubMed“Apurinic/apyrimidinic endonuclease disorder”:1 citation (irrelevant) returned

Result: Large lift of relevant retrievals in top 10Han, B., Obradovic, Z., Hu, Z.Z., Wu, C. H. and Vucetic, S. (2006) “Substring Selection for Biomedical Document Classification,” Bioinformatics.

Scientific Data Mining Lab:Text Mining: Re-Ranking of Articles Retrieved by a Search Engine

Scientific Data Mining Lab:Collaborative filtering

Aim: Predict preferences of an active customer given his/her preferences on some items and a database of preferences of other customers

Result: Regression-based collaborative filtering algorithm is superior to the neighbor-based approach. It is two orders of magnitude faster on-line predicting; more accurate; more robust to small number of observed votes.

Vucetic, S., Obradovic, Z., Collaborative Filtering Using a Regression-Based Approach, Knowledge and Information Systems, Vol. 7, No. 1, pp. 1-22, 2005.

Aim: Understanding protein disorder and its functions

Results:• Protein disorders are very

common (contrary to a 20th century belief)

• Fraction of disorder varies a lot by genomes

• Different types of disorder exist in proteins

• Involved with many important functions Vucetic, S., Brown C., Dunker A.K and

Obradovic, Z., Flavors of Protein Disorder, Proteins: Structure, Function and Genetics, Vol. 52, pp. 573-584, 2003.

Kissinger et al, 1995

Scientific Data Mining Lab:Bioinformatics: Protein Disorder Analysis

Scientific Data Mining Lab:Analysis of Highway Traffic Data

Aim: understand traffic patters, predict traffic congestion and delays

In progress…

Scientific Data Mining Lab:Spatio-Temporal Disease Modelling

Aim: predict infection or disease risk, given the information about population movement

5 10 15 20

5

10

15

20

25

30

5 10 15 200

0.01

0.02

0.03

0.04

Figure 1. Illustration of location clusters and the associated risks.

Location Type

Act

ivit

y T

yp

e In

fec

tio

n R

isk

Result: movement information is very useful in prediction of the infection riskVucetic, S,. Sun, H., Aggregation of Location Attributes for Prediction of Infection Risk, Workshop on Spatial Data Mining: Consolidation and Renewed Bearing, SDM, Bethesda, MD, 2006.


Resource-Constrained Data Mining

Aim: Efficient knowledge discovery from large data by limited-capacity

computing devicesApproach:

Integration of data mining and data compression

Figure1. left) Noisy checkerboard data – the goal is to discriminate between black and yellow dots and the achievable accuracy is 90%, middle) 100 randomly selected examples and the trained prediction model that has 76% accuracy, right) 100 examples selected by the reservoir algorithm and the trained prediction model that has 88% accuracy

Resource-Constrained Data Mining:Motivation

Data mining objective: Efficient and accurate algorithms for learning from large data

Performance measures: Accuracy Scaling with data size (# examples, #attributes)

Mainstream data mining: many accurate learning algorithms that scale linearly or even sub-

linearly with data size and dimension, in both runtime and space Caveat:

linear space scaling is often not sufficient it implies an unbounded growth in memory with data size

Challenge: how to learn from large, or practically infinite, data sets/streams

using limited memory resources

Resource-Constrained Data Mining:Learning Scenario

Examples are observed sequentially in a single pass

Data stream examples independent and identically

distributed (IID)

Could store the data summary in reservoir with fixed

memory

Resource-Constrained Data Mining:Approaches

Model-Free: Reservoir Approach Maintains a random sample of size R from data stream Add xt with min(1, R/t), remove randomly Caveat: random sampling often not optimal

Data-Free: Online algorithms Updates the model as examples are observed Perceptron: wt+1 = wt + (yt - f(xt))xt , where f(x) = wTx Caveat: sensitive to data ordering

Hybrid: Data + Model Implicitly done with Support Vector Machines (SVMs)

Resource-Constrained Data Mining:Objective

Develop a memory-constrained SVM algorithm

What is SVM? Popular data mining algorithm for classification The most accurate on many problems Theoretically and practically appealing Computationally expensive

Cubic training time cost O(N3) (e.g. neural nets are O(N)) Quadratic training memory cost O(N2) (e.g. neural nets are O(N)) Linear prediction cost O(N) (e.g. neural nets are O(1))

Resource-Constrained Data Mining:SVM Overview

Goal: Use x1 and x2 to predict class

y {-1, 1} Assume linear prediction

function f(x) = w1x1+w2x2+b sign(f(x)) is final prediction

Challenge: What is better, f1(x) or f2(x) What is the best choice for f(x)?

Answer: Best f(x) has the most wiggle

space it has largest margin

f1(x) f2(x)x1

x2


Maximizing margin is equivalent to:minimize ||w||2

such that yi f(xi) 1

What if data are noisy?minimize ||w||2 + Cii

such that yi f(xi) 1 - i, i 0

What if problem is nonlinear?X (X)


Standard approach convert to dual problemminimize ||w||2 + Cii

such that yi f(xi) 1 - i, i 0

where Qij = yiyj(xi)(xj) = yiyjK(xi, xj) , K is the Kernel function

Gaussian kernel: K(xi,xj) = exp(||xi – xj||2/A)

i are Lagrange multipliers

Optimization becomes the Quadratic Programming Problem (minimizing convex function with linear constraints)

There is the optimal solution in O(N3) time and O(N2) space SVM predictor:

To predict class of example x, we should compare it with all training examples with i > 0

0,

1min :

2ii ij j i i i

Ci j i i

W Q b y

N

iiii bxxKyxf

1

),()(


N

iiii bxxKyxf

1

),()(

f(x) = -1 f(x) = +1

i=0

i=C

0<i<CSupport vectors

Reserve vectors

Error vectors

Resource-Constrained Data Mining:Incremental-Decremental SVM

Standard SVM solution is “batch”, meaning that all training data should be available for learning

Alternative is “online” SVM that can be update when new training data are available

Incremental-Decremental SVM [Cauwenberghs, Poggio, 2000] For each new example, the update takes

O(Ns2) time, Ns – number of support vectors (0<i<C)

O(NsN) memory. Considering Ns = O(N), memory is O(N2) Total cost for online training on N examples is

O(N3) time O(N2) memory The same as for batch mode

Resource-Constrained Data Mining:Memory-Constrained IDSVM

Idea Modify IDSVM by upper-bounding number of support vectors

How Twin Vector Machine (TVM) Define budget B and a set of pivot vectors q1…qB Quantize each example to its nearest pivot,

Q(x) = {qk, k = arg minj=1:B ||x-qj||} D = {(xi,yi), i = 1…N} Q(D) = {(Q(xi),yi), i = 1…N}

Training SVM on Q(D) is equivalent to SVM on TV,TV = {TVj, j = 1…B} (Twin Vector Set)TVj = {(qj,+1,nj

+}, (qj,-1,nj-)} (Twin Vector)

O(N3) O(B3) (constant) time; O(N2) O(B2) (constant) memory

minimize ||w||2 + Cii

such that yi f(xi) 1 - i, i 0, i =

1…N

minimize ||w||2 + Cj(nj+j

+ + nj-j

-)such that f(qj) 1 - j, -f(qj) 1 - j-,

j+

, j- 0, j = 1…B

Resource-Constrained Data Mining:Online TVM

Online-TVM Input: Data stream D = {(xi,yi), i = 1…N}, budget B, kernel

function K, slack parameter C Output: TVM with parameters 1

+,1-,… B

+,B-, and b

1. Initialize TVM = 0, TV = 2. for i = 1 to N3. if Beneficial(xi)4. Update-TV5. Update-TVM


Beneficial1. if size(TV) < B or |f(xi)| m1

2. return 1

3. else

4. return 0 Online-TVM Input: Data stream D = {(xi,yi), i = 1…N},

budget B, kernel function K, slack parameter C Output: TVM with parameters 1

+,1-,…

B+,B

-, and b

1. Initialize TVM = 0, TV = 2. for i = 1 to N3. if Beneficial(xi)4. Update-TV5. Update-TVM

-1

0

+1

-2

-2

buffer

buffer

m1


Update-TV s = size(TV) TVB+1 = {(xi,yi,1), (qi,-yi,0)} if s < B

TVs+1 = TVB+1 elseif maxi=1:B|f(qi)| > m2

k = arg maxi=1:B |f(qi)| TVk = TVB+1

else find best pair TVi, TVj to merge use (**) to calculate qnew TVi = {(qnew,+1, si

+ + sj+), (qnew,-1,si

- + sj-)}

TVj = TVB+1

jjii

jjjiiinew

ssss

qssqssq

)()((**)

-1

0

+1

-2

-2

buffer

buffer

m2


Merging Heuristics: Nearest versus Weighted

Global versus One-Sided

Rejection merging

-1

0

+1

GlobalMerge

OneSideMerge +1

0

-1

Resource-Constrained Data Mining:Results

100

-1 0 1

-1.5

-1

-0.5

0

0.5

1

1.5-6

-4

-2

0

2

4

6

400

-1 0 1

-1.5

-1

-0.5

0

0.5

1

1.5 -5

0

5

10000

-1 0 1

-1.5

-1

-0.5

0

0.5

1

1.5 -20

-10

0

10

Budget B = 100

Resource-Constrained Data Mining:

Results

102

103

104

0.75

0.8

0.85

0.9

0.95

1Checkerboard (noisy)

Length of data stream (in log scale)

Acc

urac

y

TVM

IDSVMLIBSVM

Random Sampling0 2000 4000 6000 8000 10000

0

500

1000

1500

2000


Length of data stream

CP

U t

ime

(in s

econ

ds)

TVM

IDSVM

Budget B = 100


Results

101

102

103

104

0.65

0.7

0.75

0.8

0.85

0.9

0.95



Acc

urac

y

TVM budget 50

TVM budget 100

TVM budget 200

0 2000 4000 6000 8000 100000

20

40

60

80


CP

U t

ime

(in s

econ

ds)

Checkerboard (noisy)

TVM budget 50TVM budget 100TVM budget 200


Results

Budget B = 100

0 5000 100000.7

0.75

0.8

0.85

0.9

0.95

1


Acc

urac

y

Checkerboard (noisy)

with buffer

without buffer

102

103

104

105

0.74

0.76

0.78

0.8

0.82

0.84Adult


Acc

urac

y

OneSideMerge

GlobalMerge


Results

102

103

104

105

0.74

0.76

0.78

0.8

0.82

0.84

0.86Adult


Acc

urac

y

TVMIDSVMLIBSVMRandom Sampling

102

103

104

0.85

0.86

0.87

0.88

0.89

0.9

0.91

0.92Banana


Acc

urac

y

TVM

IDSVMLIBSVM

Random Sampling

102

103

104

0.8

0.85

0.9

0.95

1Checkerboard


Acc

urac

y

TVM

IDSVMLIBSVM

Random Sampling

102

103

104

0.76

0.78

0.8

0.82

0.84Gauss


Acc

urac

y

TVM

IDSVMLIBSVM

Random Sampling

102

103

104

105

0.9

0.92

0.94

0.96

0.98

1IJCNN


Acc

urac

y

TVM

IDSVMLIBSVM

Random Sampling

102

103

104

0.94

0.95

0.96

0.97

0.98

0.99

1Pendigits


Acc

urac

y

TVM

IDSVMLIBSVM

Random Sampling


Results


Conclusions

Memory-Constrained SVM is successful Significantly higher accuracy than baseline Close to the optimal approach

Merging heuristics are very important Future work

Further improvements Forgetting Probabilistic merging

Use data compression Non-IID streams

Thank You!

More information: http://www.ist.temple.edu/~vucetic/

Collaboration/assistantship contact: Slobodan Vucetic CIS Department, IST Center, Temple University [email protected]

memory-constrained data mining slobodan vucetic assistant professor department of computer and...

Documents

data mining approach

temporal data miningaim

healthtext mining

computer science

irrelevant articles

information sciencescenter

protein disorder pa

oftenno relevant articles