effective and unsupervised fractal-based feature selection for very large datasets: removing linear...

Effective and

Unsupervised

Fractal-based

Feature Selection

for Very Large

Datasets

Removing linear and non-linear attribute correlations

Antonio Canabrava Fraideinberze

Jose F Rodrigues-Jr

Robson Leonardo Ferreira Cordeiro

Databases and Images Group

University of São Paulo

São Carlos - SP - Brazil

Terabytes ?

How to analyze that data?

Terabytes?

Parallel processing

and dimensionality

reduction, for

sure...

Terabytes?

, but how to remove

linear and non-linear

attribute correlations,

besides irrelevant

attributes?

Terabytes?

, and how to reduce

dimensionality without

human supervision

and being task

independent?

Terabytes?

Curl-RemoverMedium-

dimensionality

Agenda

Fundamental Concepts

Related Work

Proposed Method

Evaluation

Conclusion

Agenda

Related Work

Proposed Method

Evaluation

Conclusion

Fractal Theory

Embedded, Intrinsic and Fractal Correlation Dimension

Fractal Correlation Dimension ≅ Intrinsic Dimension

Fractal Theory

Embedded, Intrinsic and Fractal Correlation Dimension

Embedded dimension ≅ 3

Intrinsic dimension ≅ 1

Embedded dimension ≅ 3

Intrinsic dimension ≅ 2

Fractal Theory

Fractal Correlation Dimension - Box Counting

Fractal Theory

log(r)17

Fractal Theory

log(r)18

Fractal Theory

Multidimensional

Quad-tree[Traina Jr. et al, 2000]

Agenda

Related Work

Proposed Method

Evaluation

Conclusion

Related Work

Dimensionality Reduction - Taxonomy 1

Dimensionality

Reduction

Supervised AlgorithmsUnsupervised

Algorithms

Principal Component

Analysis

Singular Vector

Decomposition

Fractal Dimension

Reduction

Related Work

Dimensionality Reduction - Taxonomy 2

Dimensionality

Reduction

Feature ExtractionFeature Selection

Principal Component

Analysis

Singular Vector

Decomposition

Fractal Dimension

Reduction

EmbeddedFilterWrapper

Related Work

Terabytes?

Existing methods need supervision,

miss non-linear correlations, cannot

handle Big Data or work for

classification only

Agenda

Related Work

Proposed Method

Evaluation

Conclusion

General Idea

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.

General Idea

in ascending order of relevance.Builds partial trees

for the full dataset

and for its E

(E-1)-dimensional

projections

General Idea

TreeID

spatial

position

Partial

count of

points

General Idea

Sums partial point

counts and reports

log(r) and log(sum2)

for each tree

General Idea

Computes D2 for

the full dataset and

pD2 for each of its E

(E-1)-dimensional

projections

General Idea

The least relevant

attribute, i.e., the one

not in the projection

that minimizes

| D2 - pD2 |

General Idea

Spots the second

least relevant

attribute …

General Idea

3 Main Issues

General Idea

3 Main Issues

in ascending order of relevance.1° Too much data to

be shuffled – one

data pair per cell/tree

General Idea

3 Main Issues

in ascending order of relevance.2° One

data pass

irrelevant

attribute

General Idea

3 Main Issues

3° Not enough

memory for mappers

Proposed Method

Curl-Remover

1° Issue - Too much data to be shuffled; one data pair per

cell/tree;

Our solution - Two-phase dimensionality reduction:

a) Serial feature selection in a tiny data sample (one reducer). Used to

speed-up processing only;

b) All mappers project data into a fixed subspace

in ascending order of relevance.Builds/reports N (2 or

3) tree levels of

lowest resolution…

Proposed Method

Curl-Remover

in ascending order of relevance.… plus the points

projected into the M (2

or 3) most relevant

attributes of sample

Proposed Method

Curl-Remover

in ascending order of relevance.Builds the full trees from

their low resolution level

cells and the projected

points

Proposed Method

Curl-Remover

Proposed Method

Curl-Remover

High resolution cells

are never shuffled

Proposed Method

Curl-Remover

2° Issue - One data pass per irrelevant attribute;

Our solution – Stores/reads the tree level of highest

resolution, instead of the original data.

Rdb = cost to read dataset;

TWRtree = cost to transfer,

write and read the last tree

level in next reduce step;

If (Rdb > TWRtree)

then writes tree;

Proposed Method

Curl-Remover

Proposed Method

Curl-Remover

in ascending order of relevance.Writes tree’s last level in

Proposed Method

Curl-Remover

in ascending order of relevance.Reads tree’s last level

from HDFS

Proposed Method

Curl-Remover

Proposed Method

Curl-Remover

Reads dataset

only twice

Proposed Method

Curl-Remover

3° Issue - Not enough memory for mappers;

Our solution – Sorts data in mappers and reports “tree slices”

whenever needed.

in ascending order of relevance.Sorts its local points and

builds “tree slices”

monitoring memory

consumption

Proposed Method

Curl-Remover

Proposed Method

Curl-Remover

Proposed Method

Curl-Remover

Reports “tree slices”

with very little overlap

Agenda

Related Work

Proposed Method

Evaluation

Conclusion

Evaluation

Datasets

Sierpinski - Sierpinski Triangle + 1 attribute linearly correlated + 2 attributes non-

linearly correlated. 5 attributes, 1.1 billion points;

Sierpinski Hybrid - Sierpinski Triangle + 1 attribute non-linearly correlated + 2

random attributes. 5 attributes, 1.1 billion points;

Yahoo! Network Flows - communication patterns between end-users in the web. 12

attributes, 562 million points;

Astro - high-resolution cosmological simulation. 6 attributes, 1 billion points;

Hepmass - physics-related dataset with particles of unknown mass. 28 attributes, 10.5

million points;

Hepmass Duplicated – Hepmass + 28 correlated attributes. 56 attributes, 10.5

million points.

Evaluation

Fractal Dimension

Hepmass

Evaluation

Fractal Dimension

Hepmass Duplicated

Evaluation

Comparison with sPCA - Classification

Evaluation

Comparison with sPCA - Classification

8% more accurate,

7.5% faster

Evaluation

Comparison with sPCA

Percentage of Fractal Dimension after selection

Agenda

Related Work

Proposed Method

Evaluation

Conclusion

Conclusions

Accuracy - eliminates both linear and non-linear attribute correlations,

besides irrelevant attributes; 8% better than sPCA;

Scalability – linear scalability on the data size (theoretical analysis);

experiments with up to 1.1 billion points;

Unsupervised - it does not require the user to guess the number of attributes

to be removed neither requires a training set;

Semantics - it is a feature selection method, thus maintaining the semantics of

the attributes;

Generality - it suits for analytical tasks in general, and not only for

classification;

Conclusions

Unsupervised - it does not require the user to guess the number of attributes

to be removed neither requires a training set;

the attributes;

classification;

Conclusions

Unsupervised - it does not require the user to guess the number of

attributes to be removed neither requires a training set;

the attributes;

classification;

Conclusions

Semantics - it is a feature selection method, thus maintaining the semantics

of the attributes;

classification;

Conclusions

Semantics - it is a feature selection method, thus maintaining the semantics

of the attributes;

classification;

Questions?robson@icmc.usp.br

Hepmass Duplicated

effective and unsupervised fractal-based feature selection for very large datasets: removing linear...

Data & Analytics

malaysian household income distribution: a fractal point ......

a minimal fractal music fractal music composition language

research article fractal oscillations of chirp functions and...

connecting the fractal...

an introduction to machine learning · machine learning...

fractal triangles - fractal foundation › fractivities ›...

linear methods, cont’d; svms intro. straw poll which would...

a fractal valued random iteration algorithm and fractal...

a review of methods used to determine the fractal...

mitered fractal trees: constructions and...

references tcp is max-plus linear, [single flow] f.baccelli,...

fractal capacitors - stanford...

non-linear dynamics and fractal composition of human ...ecg...

applying unsupervised learning - mathworks · applying...

lois d'échelle et transitions fractal – non fractal en...

unsupervised learning clustering algorithms - it -...

fractal-shaped reconﬁgurable antennas -...

fractal robots - study...

1 unsupervised morphological segmentation with log-linear...

unsupervised linear unmixing for hyperspectral data...