orpailleur -- triclustering talk

27
An Experimental Comparison of Some Triclustering Algorithms Dmitry V. Gnatyshak, Dmitry I. Ignatov*, Sergei O. Kuznetsov School of Applied Mathematics and Information Science & Intelligence Systems and Structural Analysis Lab NRU Higher School of Economics, Moscow, Russia LORIA Orpailleur meeting, Nancy, France, 2013

Upload: dmitry-ignatov

Post on 06-May-2015

120 views

Category:

Technology


1 download

DESCRIPTION

An informal talk on triclustering for Orpailleur team, LORIA, Nancy, France

TRANSCRIPT

Page 1: Orpailleur -- triclustering talk

An Experimental Comparison of Some Triclustering Algorithms

Dmitry V. Gnatyshak, Dmitry I. Ignatov*, Sergei O. Kuznetsov

School of Applied Mathematics and Information Science & Intelligence Systems and Structural Analysis Lab

NRU Higher School of Economics, Moscow, Russia

LORIA Orpailleur meeting, Nancy, France, 2013

Page 2: Orpailleur -- triclustering talk

Outline

1. Motivation and problem setting2. FCA basic definitions3. Triclustering methods4. Experiments5. Conclusion

2

Page 3: Orpailleur -- triclustering talk

MotivationA large amount of structured and unstructured data

generates triadic data.Example: folksonomy is a set of triples (user, object, tag)

Examples:Bibsonomy.org (user, bookmark, tag)Social networking sites(user, group, interest)Delicious(user, link, tag)

3

Page 4: Orpailleur -- triclustering talk

Main goals

1. Comparison of some triclustering methods

2. Development of a toolbox for triclustering experiments

3. New possibly better methods

4. Possible applications

4

Page 5: Orpailleur -- triclustering talk

FCA: basic definitions

Biology Mathematics Computer Science

Chemistry

Kate x x

Mike x x x

Alex x x

Pete x x x

5

Example: – students, – courses, means “to take a course in”.

(R. Wille, 1982; B. Ganter, R. Wille, 1999)

Def. A formal context is a triple , where , , and is an incidence relation.

Page 6: Orpailleur -- triclustering talk

FCA: basic definitionsDef. Galois operators (concept-forming operators)

is a set of attributes common to all

is a set of objects which possessed all attributes from B

6

Примеры:

Biology Mathematics Computer Science

Chemistry

Kate x x

Mike x x x

Alex x x

Pete x x x

Page 7: Orpailleur -- triclustering talk

FCA: basic definitionsDef. A pair is called formal concept of context

7

Examples:• is a formal concept is not a formal concept

Biology Mathematics Computer Science

Chemistry

Kate x x

Mike x x x

Alex x x

Pete x x x

Page 8: Orpailleur -- triclustering talk

Triadic FCA: basic definitions

Def. A triple is called triadic formal concept (triconcept) of triadic context

where

is an extent, is an intent, and is a modus.

Def. A quadruple is a triadic formal context (tricontext). are sets of objects, attributes and conditions respectively.

Let , , , then concept-forming operators in triadic case are:

8

(F. Lehmann, R. Wille, 1995)

Page 9: Orpailleur -- triclustering talk

OAC-triclusters(based on box operators)

Def. L be a triadic context, then OAC-tricluster based on box operators built on a triple is a triple of sets .

is a density of .

Box operators

9

(D. Ignatov et al., 2011) … …

Page 10: Orpailleur -- triclustering talk

OAC-triclusters(based on prime operators)

Def. L be triadic context, then OAC-tricluster based on prime operators for a triple is a triple .

is a density of .

Prime-operators of singletons

10

Page 11: Orpailleur -- triclustering talk

OAC-triclusteringThe algorithm’s idea:INPUT: is a tricontext

is a density thresholdOUTPUT: is a set of triclusters

For each triple of the context it builds a tricluster by the definitions

Notes:1. It makes sense to use hash-function to avoid duplicates2. Triples enumeration is easy to parallelize

11

Page 12: Orpailleur -- triclustering talk

TriBoxModel: be a tricontext, then its set of triclusters follows the model below:

is a parameter, is a constant, is a residual, is a Boolean variable, which shows that is in tricluster (similarly for and ).

The method’s idea: to minimize residuals in case of , using the greedy approach: It enumerates all triples adding an object (attribute or condition) at each step

Notes:1. It makes sense to use hash-function to avoid duplicates2. Triples enumeration is easy to parallelize

12

(A. Kramarenko & B. Mirkin, 2011)

Page 13: Orpailleur -- triclustering talk

Spectral Triclustering: SpecTricThe algorithm’s idea: It sequentially splits tricontext into subtricontexts according to normalized

mincut criterion while they have sufficient size, then it returns them as a set of triclusters.

Notes:1. W a, where is a set of vertices, is a set of edges composed by the rule: .

Then we split graph by approximation of the second smallest eigenvector of the Laplacian matrix of the input graph.

2. To find the partition vector we use a generalized task of finding eigenvalues: , where is a diagonal matrix of sums of the vertex degrees.

13

(D. Ignatov & Z. Sekinaeva, 2011; Ignatov et al. 2013)

Page 14: Orpailleur -- triclustering talk

Spectral Triclustering: SpecTric

14

(D. Ignatov & Z. Sekinaeva, 2011; Ignatov et al. 2013)

Page 15: Orpailleur -- triclustering talk

TRIAS

It finds all formal concepts with a predefined minimal support for each of the sets , и . In the task a formal triconcept is a tricluster with density 1.

The algorithm’s idea:It sequentially uses algorithm NextClosure (B. Ganter, 1987)

by first finding a formal concept in dyadic formal context , second it works on a dyadic context built on the extent of each previously found concepts. Then resulting sets are combined into a final triadic concept.

15

(R. Jäschke, 2006)

Page 16: Orpailleur -- triclustering talk

ExperimentsMain goals:

Fault-tolerance testComparison by criteria: time, quantity, mean density,

coverage and diversity

For TriBox and OAC-triclustering we implemented their parallel versionsThey were included to the comparison

16

Page 17: Orpailleur -- triclustering talk

DataFault-tolerance test

contexts with three cuboids of ones on the main diagonal with a different noise probability

Time, tricluster number, average density, coverage, and diversityRandom contexts Top-250 IMDBBibSonomy

17

Page 18: Orpailleur -- triclustering talk

Comparison Criteria Fault-tolerance is an ability of a triclustering method to find

out triclusters maximally similar to the initial cuboids :

where is a number of cuboids, is the obtained triclustering set. Coverage is defined as a fraction of the triples of the input

context among the set of all triclusters. Diversity is defined via Boolean function on two triclusters:

Then the diversity of a tricluster is :

18

Page 19: Orpailleur -- triclustering talk

Results (fault-tolerance)

19

Page 20: Orpailleur -- triclustering talk

20

OAC-prime triclustering exampleIMDB

Page 21: Orpailleur -- triclustering talk

Results (time, quantity, average density, coverage, diversity)

Method T, ms # , %

Uniform random context ()

OAC (box) 407 73 9,88 100,00 0,00 0,00 0,00 0,00

OAC (prime) 312 2659 32,23 100,00 92,51 60,07 59,80 59,45

SepcTric 277 5 8,74 8,84 100,00 100,00 100,00 100,00

TriBox 6218 1011 74,00 96,02 97,42 66,25 79,53 84,80

TRIAS 29367 38356 100,00 100,00 99,99 99,93 4,07 3,51

IMDB

OAC (box) 2314 1500 1,84 100,00 15,65 9,67 0,70 7,87

OAC (prime) 547 1274 53,85 100,00 96,55 94,56 92,14 28,52

Spectric 98799 21 17,07 20,88 100,00 100,00 100,00 100,00

TriBox 197136 328 91,65 98,90 98,89 98,46 95,21 30,94

TRIAS 102554 1956 100,00 100,0 99,89 99,69 52,52 26,18

BibSonomy

OAC (box) 19297 398 4,16 100,00 79,59 67,28 42,83 79,54

OAC (prime) 13556 1289 94,66 100,00 99,74 88,58 99,51 99,53

SpecTric 5906563 2 50,00 100,00 100,00 100,00 100,00 100,00

TriBox Time > 24 hours

TRIAS 110554 1305 100,00 100,00 99,98 91,70 99,78 99,92

21

Page 22: Orpailleur -- triclustering talk

Method Time Quantity Average density

Coverage Diversity Efficiency of parallel version

OAC (box)average large low high ~ very low very low~ average high

OAC (prime)small large average high ~ average average ~ high low

SpecTricSmall for small contexts small low average ~ high 1 –

TriBox high average high high high high

TRIAS Strongly depends on , and the triconcepts structure

very large 1 high ~ low high ~ low –

22

Results (time, quantity, average density, coverage, diversity)

Page 23: Orpailleur -- triclustering talk

ConclusionThere is no a winner according to the comparison criteria

Method TriBox shows best results but it takes huge computational time

OAC-triclustering based on prime operators gives the second best results and it is sufficiently fast

23

Page 24: Orpailleur -- triclustering talk

24

ConclusionThere is no a winner according to the comparison criteria

Details by methods:

TRIASHigh elapsed timeToo large number of small well-interpreted triclusters

(triconcepts)

Page 25: Orpailleur -- triclustering talk

25

ConclusionOAC (box operators)

Large triclusters of low densityHigh density, small diversityAn efficient parallelization

OAC (prime-operators)High speed of computationsLarge number of dense well-interpreted triclustersLow efficiency of parallelization

Page 26: Orpailleur -- triclustering talk

26

ConclusionSpectral Triclustering

High computational speed on small contextsWell-interpreted triclusters but of the low density Diversity is always equals to 1, but it causes too low coverage

TriBoxA moderate number of well-interpreted triclustersHigh elapsed timeEfficient parallelizationReasonably high coverage and diversity

Page 27: Orpailleur -- triclustering talk

Merci beaucoup!

Questions?

27