similarity measures in formal concept analysis

38
Similarity Measures in Formal Concept Analysis Faris Alqadah Raj Bhatnagar Computer Science Department University of Cincinnati Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati ) Similarity Measures in Formal Concept Analysis 1 / 28

Upload: faris-alqadah

Post on 09-Jul-2015

706 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Similarity Measures in Formal Concept Analysis

Similarity Measures in Formal Concept Analysis

Faris AlqadahRaj Bhatnagar

Computer Science DepartmentUniversity of Cincinnati

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 1 / 28

Page 2: Similarity Measures in Formal Concept Analysis

Outline

1 Introduction

2 Formal Concept Analysis

3 Similarity Measures

4 Experiments

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 2 / 28

Page 3: Similarity Measures in Formal Concept Analysis

Introduction

Outline

1 Introduction

2 Formal Concept Analysis

3 Similarity Measures

4 Experiments

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 3 / 28

Page 4: Similarity Measures in Formal Concept Analysis

Introduction

Motivation

Formal Concept Analysis (FCA) studied and applied successivelyin many diverse fieldsData mining, conceptual modeling, software engineering, andsocial networkingPossible draw back: large number of conceptsEssential to develop formalisms to segment, cluster andcategorize concepts

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 4 / 28

Page 5: Similarity Measures in Formal Concept Analysis

Introduction

Related Work

Few studies have focused on similarity measure of formalconceptsAd-hoc approaches based on applications (Y.Ding 2002) (Blachonand Gandrillon 2007), no formal study of similarity.Similarity in fuzzy concepts addressed by Belholavek.Concept similarity in ontologies encompass string similaritymeasures, and external sources of data such as dictionaries.

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 5 / 28

Page 6: Similarity Measures in Formal Concept Analysis

Formal Concept Analysis

Outline

1 Introduction

2 Formal Concept Analysis

3 Similarity Measures

4 Experiments

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 6 / 28

Page 7: Similarity Measures in Formal Concept Analysis

Formal Concept Analysis

Concepts

Context K = (G,M, I)G objects, M attributes, I is relationA ⊆ G then A′ is defined to be B = {m ∈ M|aRm ∀a ∈ A}Dually defined for B ⊆ M, B′

DefinitionA formal concept of K is a pair (A,B) with A ⊆ G and B ⊆ G suchthat A′ = B and B′ = A. Set of all concepts denoted as B

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 7 / 28

Page 8: Similarity Measures in Formal Concept Analysis

Formal Concept Analysis

Concepts

Context K = (G,M, I)G objects, M attributes, I is relationA ⊆ G then A′ is defined to be B = {m ∈ M|aRm ∀a ∈ A}Dually defined for B ⊆ M, B′

DefinitionA formal concept of K is a pair (A,B) with A ⊆ G and B ⊆ G suchthat A′ = B and B′ = A. Set of all concepts denoted as B

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 7 / 28

Page 9: Similarity Measures in Formal Concept Analysis

Formal Concept Analysis

Concepts

Context K = (G,M, I)G objects, M attributes, I is relationA ⊆ G then A′ is defined to be B = {m ∈ M|aRm ∀a ∈ A}Dually defined for B ⊆ M, B′

DefinitionA formal concept of K is a pair (A,B) with A ⊆ G and B ⊆ G suchthat A′ = B and B′ = A. Set of all concepts denoted as B

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 7 / 28

Page 10: Similarity Measures in Formal Concept Analysis

Formal Concept Analysis

Relation to other theories

m1 m2 m3 m4g1 0 1 0 1g2 0 0 1 1g3 0 0 0 1g4 1 0 0 0g5 1 1 1 0g6 0 0 1 0g7 1 1 0 0

Concepts are maximal rectangles of 1s under suitable permutationBi-clusters, co-clusters: maximal sub-matrices with zero varianceMaximal bicliques in bipartite graph

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 8 / 28

Page 11: Similarity Measures in Formal Concept Analysis

Formal Concept Analysis

Relation to other theories

m1 m2 m3 m4g1 0 1 0 1g2 0 0 1 1g3 0 0 0 1g4 1 0 0 0g5 1 1 1 0g6 0 0 1 0g7 1 1 0 0

Concepts are maximal rectangles of 1s under suitable permutationBi-clusters, co-clusters: maximal sub-matrices with zero varianceMaximal bicliques in bipartite graph

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 8 / 28

Page 12: Similarity Measures in Formal Concept Analysis

Formal Concept Analysis

Hierarchy of Concepts

sm1,m2,m3,m4

sm1,m2,m3g5

sm2,m4g1

sm3,m4g2

sm1,m2g5,g7

sm3g2,g5,g6

sm4g1,g2,g3

sm1g4,g5,g7

sm2g1,g5,g7

sg1,g2,g3,g4,g5,g6,g7

Concepts of a context form a natural hierarchial structure

TheoremFormal concepts of a context ordered by the subset relation form acomplete lattice.

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 9 / 28

Page 13: Similarity Measures in Formal Concept Analysis

Formal Concept Analysis

Number of concepts

Given K = (G,M, I), in worst case |B| ∈ O(2min{|G|,|M|})

Empirical evidence with real world datasets suggests this is morelike O(max{|G|, |M|}2)Examples:

Genes-GO terms (1910× 3885) > 1,000,000 conceptsGenes-PhenoTypes (1910× 3965) > 1,000,000 conceptsMushrooms (8124× 120) > 200,00 concepts

Constraints typically imposed to make enumeration tractable,however parameter selection is problematicSegmenting, clustering of concepts should be explored

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 10 / 28

Page 14: Similarity Measures in Formal Concept Analysis

Formal Concept Analysis

Number of concepts

Given K = (G,M, I), in worst case |B| ∈ O(2min{|G|,|M|})

Empirical evidence with real world datasets suggests this is morelike O(max{|G|, |M|}2)Examples:

Genes-GO terms (1910× 3885) > 1,000,000 conceptsGenes-PhenoTypes (1910× 3965) > 1,000,000 conceptsMushrooms (8124× 120) > 200,00 concepts

Constraints typically imposed to make enumeration tractable,however parameter selection is problematicSegmenting, clustering of concepts should be explored

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 10 / 28

Page 15: Similarity Measures in Formal Concept Analysis

Formal Concept Analysis

Number of concepts

Given K = (G,M, I), in worst case |B| ∈ O(2min{|G|,|M|})

Empirical evidence with real world datasets suggests this is morelike O(max{|G|, |M|}2)Examples:

Genes-GO terms (1910× 3885) > 1,000,000 conceptsGenes-PhenoTypes (1910× 3965) > 1,000,000 conceptsMushrooms (8124× 120) > 200,00 concepts

Constraints typically imposed to make enumeration tractable,however parameter selection is problematicSegmenting, clustering of concepts should be explored

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 10 / 28

Page 16: Similarity Measures in Formal Concept Analysis

Formal Concept Analysis

Number of concepts

Given K = (G,M, I), in worst case |B| ∈ O(2min{|G|,|M|})

Empirical evidence with real world datasets suggests this is morelike O(max{|G|, |M|}2)Examples:

Genes-GO terms (1910× 3885) > 1,000,000 conceptsGenes-PhenoTypes (1910× 3965) > 1,000,000 conceptsMushrooms (8124× 120) > 200,00 concepts

Constraints typically imposed to make enumeration tractable,however parameter selection is problematicSegmenting, clustering of concepts should be explored

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 10 / 28

Page 17: Similarity Measures in Formal Concept Analysis

Similarity Measures

Outline

1 Introduction

2 Formal Concept Analysis

3 Similarity Measures

4 Experiments

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 11 / 28

Page 18: Similarity Measures in Formal Concept Analysis

Similarity Measures

Formal Definition

DefinitionA similarity measure S is a function with non-negative real valuesdefined on the Cartesian product X × X of a set X

S : X × X → R (1)

such that the following three properties are satisfied1 ∃s0 ∈ R : −∞ < S(x , y) ≤ s0 < +∞, ∀x , y ∈ X2 s(x , x) = s0 ∀x ∈ X3 s(x , y) = s(y , x) ∀x , y ∈ X

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 12 / 28

Page 19: Similarity Measures in Formal Concept Analysis

Similarity Measures

Weighted Concept Similarity

Set-inspired similarity measures

Jaccard index SJac =|x ∩ y ||x ∪ y |

(2)

Sorenesen coefficient SSor =2 ∗ |x ∩ y ||x |+ |y |

(3)

Symmetric difference SXor = 1− |x y ||x ∪ y |

(4)

Combine set-based similarity measures to form concept similaritymeasure

SwS (C1,C2) = w ∗ S(A1,A2) + (1− w) ∗ S(B1,B2) (5)

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 13 / 28

Page 20: Similarity Measures in Formal Concept Analysis

Similarity Measures

Weighted Concept Similarity

Set-inspired similarity measures

Jaccard index SJac =|x ∩ y ||x ∪ y |

(2)

Sorenesen coefficient SSor =2 ∗ |x ∩ y ||x |+ |y |

(3)

Symmetric difference SXor = 1− |x y ||x ∪ y |

(4)

Combine set-based similarity measures to form concept similaritymeasure

SwS (C1,C2) = w ∗ S(A1,A2) + (1− w) ∗ S(B1,B2) (5)

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 13 / 28

Page 21: Similarity Measures in Formal Concept Analysis

Similarity Measures

Formal Proof

Proofs depend on the fact that S is a set based similarity measure

Proof.By the properties of set union and set intersectionSJac(x , y) ≤ 1 ∀x , y , thus by the definition of weighted conceptsimilarity, s0 = 1.Property 2 is trivially satisfied by the fact that SJac is a similaritymeasure, thus SJac(x , x) = 1 and therefore

SwJac(C1,C1) = w ∗ 1 + (1− w) ∗ 1 = 1 ∀C1 ∈ B(G,M, I)

Property 3 is also satisfied by the fact that SJac is a similaritymeasure, so SJac(x , y) = SJac(y , x)

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 14 / 28

Page 22: Similarity Measures in Formal Concept Analysis

Similarity Measures

Weighted Concept Similarity

Well established similarity measures, easy to computeSet intersection, union, and difference of any two sets x , y can becomputed in O(min{|x |, |y |})O(min({|A1|, |B1|, |A2|, |B2|})) for any given pair of concepts(A1,B1) and (A2,B2)

Drawback is selecting w

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 15 / 28

Page 23: Similarity Measures in Formal Concept Analysis

Similarity Measures

Drawbacks of Weighted Concept Similarity

sm1,m2,m3,m4

sm1,m2,m3g5

sm2,m4g1

sm3,m4g2

sm1,m2g5,g7

sm3g2,g5,g6

sm4g1,g2,g3

sm1g4,g5,g7

sm2g1,g5,g7

sg1,g2,g3,g4,g5,g6,g7 m1 m2 m3 m4

g1 0 1 0 1g2 0 0 1 1g3 0 0 0 1g4 1 0 0 0g5 1 1 1 0g6 0 0 1 0g7 1 1 0 0

Measures only consider set cardinalities and not informationC1 = ({g5}, {m1,m2,m3}), C2 = ({g2,g5,g6}, {m3}), andC3 = ({g4,g5,g7}, {m1})

S0.5Jac(C1,C2) = S0.5

Jac(C1,C3) = 0.333S0.5

Sor (C1,C2) = S0.5Sor (C1,C3) = 0.5

Intuitively the similarity between C1 and C3 should be greater thanthat of C1 and C2

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 16 / 28

Page 24: Similarity Measures in Formal Concept Analysis

Similarity Measures

Drawbacks of Weighted Concept Similarity

sm1,m2,m3,m4

sm1,m2,m3g5

sm2,m4g1

sm3,m4g2

sm1,m2g5,g7

sm3g2,g5,g6

sm4g1,g2,g3

sm1g4,g5,g7

sm2g1,g5,g7

sg1,g2,g3,g4,g5,g6,g7 m1 m2 m3 m4

g1 0 1 0 1g2 0 0 1 1g3 0 0 0 1g4 1 0 0 0g5 1 1 1 0g6 0 0 1 0g7 1 1 0 0

Measures only consider set cardinalities and not informationC1 = ({g5}, {m1,m2,m3}), C2 = ({g2,g5,g6}, {m3}), andC3 = ({g4,g5,g7}, {m1})

S0.5Jac(C1,C2) = S0.5

Jac(C1,C3) = 0.333S0.5

Sor (C1,C2) = S0.5Sor (C1,C3) = 0.5

Intuitively the similarity between C1 and C3 should be greater thanthat of C1 and C2

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 16 / 28

Page 25: Similarity Measures in Formal Concept Analysis

Similarity Measures

Zeros Induced Similarity

View concepts as maximal sub-matrices of 1sCombining any two concepts must result in the introduction ofzerosThink of similarity as number of zeros introduced by combiningtwo concepts

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 17 / 28

Page 26: Similarity Measures in Formal Concept Analysis

Similarity Measures

Zeros Induced Similarity

Given C1 = (A1,B1) and C2 = (A2,B2) then

z(C1,C2) =∑

a∈A1∪A2

|(B1 ∪ B2) \ a′| (6)

DefinitionGiven concepts C1 = (A1,B1) and C2 = (A2,B2) the zeros-inducedindex is

Sz =|A1 ∪ A2| ∗ |B1 ∪ B2| − z(C1,C2)

|A1 ∪ A2| ∗ |B1 ∪ B2|(7)

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 18 / 28

Page 27: Similarity Measures in Formal Concept Analysis

Similarity Measures

Formal Proof

Proof.For any two sets x , y x \ y ⊆ x , thusz(C1,C2) ≤ |A1 ∪ A2| ∗ |B1 ∪ B2| ∀C1,C2, implying that s0 = 1.For any concept C = (A,B) , by definition A′ = B which implies

∀a ∈ A a′ ⊇ B→ z(C,C) = 0→ Sz(C,C) = s0

Property 3 is guaranteed by the commutative property of setunion.

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 19 / 28

Page 28: Similarity Measures in Formal Concept Analysis

Similarity Measures

Zeros Induced Similarity

C1 = ({g5}, {m1,m2,m3}), C2 = ({g2,g5,g6}, {m3}), andC3 = ({g4,g5,g7}, {m1})

Sz(C1,C2) =9− 4

9=

59

and

Sz(C1,C3) =9− 3

9=

23

Direct implementation of computing zeros has complexity ofO(max{|A1|, |B1|, |A2|, |B2|}2)

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 20 / 28

Page 29: Similarity Measures in Formal Concept Analysis

Similarity Measures

Zeros Induced Similarity

C1 = ({g5}, {m1,m2,m3}), C2 = ({g2,g5,g6}, {m3}), andC3 = ({g4,g5,g7}, {m1})

Sz(C1,C2) =9− 4

9=

59

and

Sz(C1,C3) =9− 3

9=

23

Direct implementation of computing zeros has complexity ofO(max{|A1|, |B1|, |A2|, |B2|}2)

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 20 / 28

Page 30: Similarity Measures in Formal Concept Analysis

Experiments

Outline

1 Introduction

2 Formal Concept Analysis

3 Similarity Measures

4 Experiments

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 21 / 28

Page 31: Similarity Measures in Formal Concept Analysis

Experiments

Datasets and Method

Real world, labeled datasetsEnumerate concepts, and compute similarity matrixUtilize similarity matrix with agglomerative clustering algorithm

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 22 / 28

Page 32: Similarity Measures in Formal Concept Analysis

Experiments

Datasets

Name Dimensions Density Num. classesCongress 435× 48 0.33 2

Mushrooms 8124× 120 0.1917 2news_mer 2000× 892 0.003 2news_pcr 1997× 1025 0.0026 2

news_allrec 3124× 1671 0.0014 4

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 23 / 28

Page 33: Similarity Measures in Formal Concept Analysis

Experiments

Evaluation Measures

MultPrec(e,e′) =min(|C(e) ∩ C(e′)|, |L(e) ∩ L(e′)|)

|C(e) ∩ C(e′)|

MultRcl(e,e′) =min(|C(e) ∩ C(e′)|, |L(e) ∩ L(e′)|)

|L(e) ∩ L(e′)|

B3Prec = Avge[Avge′,C(e)∩C(e′)6=∅

[MultPrec(e,e′)

]]B3Rcl = Avge

[Avge′,L(e)∩L(e′) 6=∅

[MultRcl(e,e′)

]]

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 24 / 28

Page 34: Similarity Measures in Formal Concept Analysis

Experiments

Evaluation Measures

MultPrec(e,e′) =min(|C(e) ∩ C(e′)|, |L(e) ∩ L(e′)|)

|C(e) ∩ C(e′)|

MultRcl(e,e′) =min(|C(e) ∩ C(e′)|, |L(e) ∩ L(e′)|)

|L(e) ∩ L(e′)|

B3Prec = Avge[Avge′,C(e)∩C(e′)6=∅

[MultPrec(e,e′)

]]B3Rcl = Avge

[Avge′,L(e)∩L(e′) 6=∅

[MultRcl(e,e′)

]]

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 24 / 28

Page 35: Similarity Measures in Formal Concept Analysis

Experiments

Experimental Results

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 25 / 28

Page 36: Similarity Measures in Formal Concept Analysis

Experiments

Similarity Matrices

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 26 / 28

Page 37: Similarity Measures in Formal Concept Analysis

Experiments

Computation Times

Dataset Similarity Measure CPU Time (seconds)

Mushrooms

Weighted Jaccard 545.23± 3.45Weighted Sornensen 300.35± 1.64Weighted SymmDiff 961.62± 2.13

Zeros Induced 4125.22± 3.76

Congress

Weighted Jaccard 522.24± 4.2204Weighted Sornensen 289.89± 0.69Weighted SymmDiff 885.89± 2.77

Zeros Induced 3233.54± 3.45

news_allrec

Weighted Jaccard 3.9170± 0.0440Weighted Sornensen 2.6630± 0.0517Weighted SymmDiff 6.1900± 0.0474

Zeros Induced 8.2050± 0.1203

news_mer

Weighted Jaccard 0.7700± 0.0067Weighted Sornensen 0.5100± 0.0176Weighted SymmDiff 1.2270± 0.0134

Zeros Induced 1.9720± 0.0225

news_pcr

Weighted Jaccard 0.7680± 0.0092Weighted Sornensen 0.5040± 0.0158Weighted SymmDiff 1.2280± 0.0235

Zeros Induced 1.8530± 0.0183

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 27 / 28

Page 38: Similarity Measures in Formal Concept Analysis

Experiments

Conclusion

First steps towards clustering formal conceptsZeros-induced measure no parameters requiredInitial experiments indicate superiority of zeros-induced measureon clustering sparse dataFuture work should incorporate the lattice structure explicitly

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 28 / 28