effective and unsupervised fractal-based feature selection for very large datasets: removing linear...

64
Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets Removing linear and non-linear attribute correlations Antonio Canabrava Fraideinberze Jose F Rodrigues-Jr Robson Leonardo Ferreira Cordeiro Databases and Images Group University of São Paulo São Carlos - SP - Brazil

Upload: jose-f-rodrigues-jr

Post on 07-Feb-2017

32 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Effective and

Unsupervised

Fractal-based

Feature Selection

for Very Large

Datasets

Removing linear and non-linear attribute correlations

Antonio Canabrava Fraideinberze

Jose F Rodrigues-Jr

Robson Leonardo Ferreira Cordeiro

Databases and Images Group

University of São Paulo

São Carlos - SP - Brazil

Page 2: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

2

Terabytes ?

How to analyze that data?

Page 3: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

3

Terabytes?

Parallel processing

and dimensionality

reduction, for

sure...

How to analyze that data?

Page 4: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

How to analyze that data?

4

Terabytes?

, but how to remove

linear and non-linear

attribute correlations,

besides irrelevant

attributes?

Page 5: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

How to analyze that data?

5

Terabytes?

, and how to reduce

dimensionality without

human supervision

and being task

independent?

Page 6: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

6

Terabytes?

Curl-RemoverMedium-

dimensionality

How to analyze that data?

Page 7: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Agenda

Fundamental Concepts

Related Work

Proposed Method

Evaluation

Conclusion

7

Page 8: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Agenda

Fundamental Concepts

Related Work

Proposed Method

Evaluation

Conclusion

8

Page 9: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Fundamental Concepts

Fractal Theory

...

...

...

...9

Page 10: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Fundamental Concepts

Fractal Theory

...

...

...

...10

Page 11: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Fundamental Concepts

Fractal Theory

11

Page 12: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Fundamental Concepts

Fractal Theory

12

Page 13: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Fundamental Concepts

Fractal Theory

Embedded, Intrinsic and Fractal Correlation Dimension

Fractal Correlation Dimension ≅ Intrinsic Dimension

13

Page 14: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Fundamental Concepts

Fractal Theory

Embedded, Intrinsic and Fractal Correlation Dimension

Embedded dimension ≅ 3

Intrinsic dimension ≅ 1

Embedded dimension ≅ 3

Intrinsic dimension ≅ 2

14

Page 15: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Fundamental Concepts

Fractal Theory

Fractal Correlation Dimension - Box Counting

15

Page 16: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Fundamental Concepts

Fractal Theory

Fractal Correlation Dimension - Box Counting

16

Page 17: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Fundamental Concepts

Fractal Theory

Fractal Correlation Dimension - Box Counting

log(r)17

Page 18: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Fundamental Concepts

Fractal Theory

Fractal Correlation Dimension - Box Counting

log(r)18

Page 19: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Fundamental Concepts

Fractal Theory

Fractal Correlation Dimension - Box Counting

19

Multidimensional

Quad-tree[Traina Jr. et al, 2000]

Page 20: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Agenda

Fundamental Concepts

Related Work

Proposed Method

Evaluation

Conclusion

20

Page 21: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Related Work

Dimensionality Reduction - Taxonomy 1

Dimensionality

Reduction

Supervised AlgorithmsUnsupervised

Algorithms

Principal Component

Analysis

Singular Vector

Decomposition

Fractal Dimension

Reduction

21

Page 22: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Related Work

Dimensionality Reduction - Taxonomy 2

Dimensionality

Reduction

Feature ExtractionFeature Selection

Principal Component

Analysis

Singular Vector

Decomposition

Fractal Dimension

Reduction

EmbeddedFilterWrapper

22

Page 23: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Related Work

23

Terabytes?

Existing methods need supervision,

miss non-linear correlations, cannot

handle Big Data or work for

classification only

Page 24: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Agenda

Fundamental Concepts

Related Work

Proposed Method

Evaluation

Conclusion

24

Page 25: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

General Idea

25

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.

Page 26: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

General Idea

26

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.Builds partial trees

for the full dataset

and for its E

(E-1)-dimensional

projections

Page 27: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

General Idea

27

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.

TreeID

+

cell

spatial

position

Partial

count of

points

Page 28: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

General Idea

28

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.

Sums partial point

counts and reports

log(r) and log(sum2)

for each tree

Page 29: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

General Idea

29

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.

Computes D2 for

the full dataset and

pD2 for each of its E

(E-1)-dimensional

projections

Page 30: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

General Idea

30

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.

The least relevant

attribute, i.e., the one

not in the projection

that minimizes

| D2 - pD2 |

Page 31: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

General Idea

31

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.

Spots the second

least relevant

attribute …

Page 32: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

General Idea

3 Main Issues

32

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.

Page 33: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

General Idea

3 Main Issues

33

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.1° Too much data to

be shuffled – one

data pair per cell/tree

Page 34: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

General Idea

3 Main Issues

34

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.2° One

data pass

per

irrelevant

attribute

Page 35: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

General Idea

3 Main Issues

35

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.

3° Not enough

memory for mappers

Page 36: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Proposed Method

Curl-Remover

36

1° Issue - Too much data to be shuffled; one data pair per

cell/tree;

Our solution - Two-phase dimensionality reduction:

a) Serial feature selection in a tiny data sample (one reducer). Used to

speed-up processing only;

b) All mappers project data into a fixed subspace

Page 37: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

37

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.Builds/reports N (2 or

3) tree levels of

lowest resolution…

Proposed Method

Curl-Remover

Page 38: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

38

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.… plus the points

projected into the M (2

or 3) most relevant

attributes of sample

Proposed Method

Curl-Remover

Page 39: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

39

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.Builds the full trees from

their low resolution level

cells and the projected

points

Proposed Method

Curl-Remover

Page 40: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

40

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.

Proposed Method

Curl-Remover

High resolution cells

are never shuffled

Page 41: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Proposed Method

Curl-Remover

41

2° Issue - One data pass per irrelevant attribute;

Our solution – Stores/reads the tree level of highest

resolution, instead of the original data.

Page 42: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

42

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.

Rdb = cost to read dataset;

TWRtree = cost to transfer,

write and read the last tree

level in next reduce step;

If (Rdb > TWRtree)

then writes tree;

Proposed Method

Curl-Remover

Page 43: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

43

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.

Proposed Method

Curl-Remover

Page 44: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

44

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.Writes tree’s last level in

HDFS

Proposed Method

Curl-Remover

Page 45: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

45

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.Reads tree’s last level

from HDFS

Proposed Method

Curl-Remover

Page 46: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

46

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.

Proposed Method

Curl-Remover

Reads dataset

only twice

Page 47: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Proposed Method

Curl-Remover

47

3° Issue - Not enough memory for mappers;

Our solution – Sorts data in mappers and reports “tree slices”

whenever needed.

Page 48: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

48

Removes the E - ⌈D2⌉ least relevant attributes, one at a time

in ascending order of relevance.Sorts its local points and

builds “tree slices”

monitoring memory

consumption

Proposed Method

Curl-Remover

Page 49: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Proposed Method

Curl-Remover

49

Y

X

Page 50: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Proposed Method

Curl-Remover

50

Reports “tree slices”

with very little overlap

Page 51: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Agenda

Fundamental Concepts

Related Work

Proposed Method

Evaluation

Conclusion

51

Page 52: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Evaluation

Datasets

Sierpinski - Sierpinski Triangle + 1 attribute linearly correlated + 2 attributes non-

linearly correlated. 5 attributes, 1.1 billion points;

Sierpinski Hybrid - Sierpinski Triangle + 1 attribute non-linearly correlated + 2

random attributes. 5 attributes, 1.1 billion points;

Yahoo! Network Flows - communication patterns between end-users in the web. 12

attributes, 562 million points;

Astro - high-resolution cosmological simulation. 6 attributes, 1 billion points;

Hepmass - physics-related dataset with particles of unknown mass. 28 attributes, 10.5

million points;

Hepmass Duplicated – Hepmass + 28 correlated attributes. 56 attributes, 10.5

million points.

52

Page 53: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Evaluation

Fractal Dimension

Hepmass

53

Page 54: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Evaluation

Fractal Dimension

Hepmass Duplicated

54

Page 55: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Evaluation

Comparison with sPCA - Classification

55

Page 56: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Evaluation

Comparison with sPCA - Classification

56

8% more accurate,

7.5% faster

Page 57: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Evaluation

Comparison with sPCA

Percentage of Fractal Dimension after selection

57

Page 58: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Agenda

Fundamental Concepts

Related Work

Proposed Method

Evaluation

Conclusion

58

Page 59: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Conclusions

Accuracy - eliminates both linear and non-linear attribute correlations,

besides irrelevant attributes; 8% better than sPCA;

Scalability – linear scalability on the data size (theoretical analysis);

experiments with up to 1.1 billion points;

Unsupervised - it does not require the user to guess the number of attributes

to be removed neither requires a training set;

Semantics - it is a feature selection method, thus maintaining the semantics of

the attributes;

Generality - it suits for analytical tasks in general, and not only for

classification;

59

Page 60: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Conclusions

Accuracy - eliminates both linear and non-linear attribute correlations,

besides irrelevant attributes; 8% better than sPCA;

Scalability – linear scalability on the data size (theoretical analysis);

experiments with up to 1.1 billion points;

Unsupervised - it does not require the user to guess the number of attributes

to be removed neither requires a training set;

Semantics - it is a feature selection method, thus maintaining the semantics of

the attributes;

Generality - it suits for analytical tasks in general, and not only for

classification;

60

Page 61: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Conclusions

Accuracy - eliminates both linear and non-linear attribute correlations,

besides irrelevant attributes; 8% better than sPCA;

Scalability – linear scalability on the data size (theoretical analysis);

experiments with up to 1.1 billion points;

Unsupervised - it does not require the user to guess the number of

attributes to be removed neither requires a training set;

Semantics - it is a feature selection method, thus maintaining the semantics of

the attributes;

Generality - it suits for analytical tasks in general, and not only for

classification;

61

Page 62: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Conclusions

Accuracy - eliminates both linear and non-linear attribute correlations,

besides irrelevant attributes; 8% better than sPCA;

Scalability – linear scalability on the data size (theoretical analysis);

experiments with up to 1.1 billion points;

Unsupervised - it does not require the user to guess the number of

attributes to be removed neither requires a training set;

Semantics - it is a feature selection method, thus maintaining the semantics

of the attributes;

Generality - it suits for analytical tasks in general, and not only for

classification;

62

Page 63: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Conclusions

Accuracy - eliminates both linear and non-linear attribute correlations,

besides irrelevant attributes; 8% better than sPCA;

Scalability – linear scalability on the data size (theoretical analysis);

experiments with up to 1.1 billion points;

Unsupervised - it does not require the user to guess the number of

attributes to be removed neither requires a training set;

Semantics - it is a feature selection method, thus maintaining the semantics

of the attributes;

Generality - it suits for analytical tasks in general, and not only for

classification;

63

Page 64: Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

[email protected]

Hepmass Duplicated