similarity measures in formal concept analysis
TRANSCRIPT
Similarity Measures in Formal Concept Analysis
Faris AlqadahRaj Bhatnagar
Computer Science DepartmentUniversity of Cincinnati
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 1 / 28
Outline
1 Introduction
2 Formal Concept Analysis
3 Similarity Measures
4 Experiments
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 2 / 28
Introduction
Outline
1 Introduction
2 Formal Concept Analysis
3 Similarity Measures
4 Experiments
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 3 / 28
Introduction
Motivation
Formal Concept Analysis (FCA) studied and applied successivelyin many diverse fieldsData mining, conceptual modeling, software engineering, andsocial networkingPossible draw back: large number of conceptsEssential to develop formalisms to segment, cluster andcategorize concepts
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 4 / 28
Introduction
Related Work
Few studies have focused on similarity measure of formalconceptsAd-hoc approaches based on applications (Y.Ding 2002) (Blachonand Gandrillon 2007), no formal study of similarity.Similarity in fuzzy concepts addressed by Belholavek.Concept similarity in ontologies encompass string similaritymeasures, and external sources of data such as dictionaries.
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 5 / 28
Formal Concept Analysis
Outline
1 Introduction
2 Formal Concept Analysis
3 Similarity Measures
4 Experiments
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 6 / 28
Formal Concept Analysis
Concepts
Context K = (G,M, I)G objects, M attributes, I is relationA ⊆ G then A′ is defined to be B = {m ∈ M|aRm ∀a ∈ A}Dually defined for B ⊆ M, B′
DefinitionA formal concept of K is a pair (A,B) with A ⊆ G and B ⊆ G suchthat A′ = B and B′ = A. Set of all concepts denoted as B
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 7 / 28
Formal Concept Analysis
Concepts
Context K = (G,M, I)G objects, M attributes, I is relationA ⊆ G then A′ is defined to be B = {m ∈ M|aRm ∀a ∈ A}Dually defined for B ⊆ M, B′
DefinitionA formal concept of K is a pair (A,B) with A ⊆ G and B ⊆ G suchthat A′ = B and B′ = A. Set of all concepts denoted as B
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 7 / 28
Formal Concept Analysis
Concepts
Context K = (G,M, I)G objects, M attributes, I is relationA ⊆ G then A′ is defined to be B = {m ∈ M|aRm ∀a ∈ A}Dually defined for B ⊆ M, B′
DefinitionA formal concept of K is a pair (A,B) with A ⊆ G and B ⊆ G suchthat A′ = B and B′ = A. Set of all concepts denoted as B
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 7 / 28
Formal Concept Analysis
Relation to other theories
m1 m2 m3 m4g1 0 1 0 1g2 0 0 1 1g3 0 0 0 1g4 1 0 0 0g5 1 1 1 0g6 0 0 1 0g7 1 1 0 0
Concepts are maximal rectangles of 1s under suitable permutationBi-clusters, co-clusters: maximal sub-matrices with zero varianceMaximal bicliques in bipartite graph
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 8 / 28
Formal Concept Analysis
Relation to other theories
m1 m2 m3 m4g1 0 1 0 1g2 0 0 1 1g3 0 0 0 1g4 1 0 0 0g5 1 1 1 0g6 0 0 1 0g7 1 1 0 0
Concepts are maximal rectangles of 1s under suitable permutationBi-clusters, co-clusters: maximal sub-matrices with zero varianceMaximal bicliques in bipartite graph
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 8 / 28
Formal Concept Analysis
Hierarchy of Concepts
sm1,m2,m3,m4
sm1,m2,m3g5
sm2,m4g1
sm3,m4g2
sm1,m2g5,g7
sm3g2,g5,g6
sm4g1,g2,g3
sm1g4,g5,g7
sm2g1,g5,g7
sg1,g2,g3,g4,g5,g6,g7
Concepts of a context form a natural hierarchial structure
TheoremFormal concepts of a context ordered by the subset relation form acomplete lattice.
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 9 / 28
Formal Concept Analysis
Number of concepts
Given K = (G,M, I), in worst case |B| ∈ O(2min{|G|,|M|})
Empirical evidence with real world datasets suggests this is morelike O(max{|G|, |M|}2)Examples:
Genes-GO terms (1910× 3885) > 1,000,000 conceptsGenes-PhenoTypes (1910× 3965) > 1,000,000 conceptsMushrooms (8124× 120) > 200,00 concepts
Constraints typically imposed to make enumeration tractable,however parameter selection is problematicSegmenting, clustering of concepts should be explored
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 10 / 28
Formal Concept Analysis
Number of concepts
Given K = (G,M, I), in worst case |B| ∈ O(2min{|G|,|M|})
Empirical evidence with real world datasets suggests this is morelike O(max{|G|, |M|}2)Examples:
Genes-GO terms (1910× 3885) > 1,000,000 conceptsGenes-PhenoTypes (1910× 3965) > 1,000,000 conceptsMushrooms (8124× 120) > 200,00 concepts
Constraints typically imposed to make enumeration tractable,however parameter selection is problematicSegmenting, clustering of concepts should be explored
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 10 / 28
Formal Concept Analysis
Number of concepts
Given K = (G,M, I), in worst case |B| ∈ O(2min{|G|,|M|})
Empirical evidence with real world datasets suggests this is morelike O(max{|G|, |M|}2)Examples:
Genes-GO terms (1910× 3885) > 1,000,000 conceptsGenes-PhenoTypes (1910× 3965) > 1,000,000 conceptsMushrooms (8124× 120) > 200,00 concepts
Constraints typically imposed to make enumeration tractable,however parameter selection is problematicSegmenting, clustering of concepts should be explored
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 10 / 28
Formal Concept Analysis
Number of concepts
Given K = (G,M, I), in worst case |B| ∈ O(2min{|G|,|M|})
Empirical evidence with real world datasets suggests this is morelike O(max{|G|, |M|}2)Examples:
Genes-GO terms (1910× 3885) > 1,000,000 conceptsGenes-PhenoTypes (1910× 3965) > 1,000,000 conceptsMushrooms (8124× 120) > 200,00 concepts
Constraints typically imposed to make enumeration tractable,however parameter selection is problematicSegmenting, clustering of concepts should be explored
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 10 / 28
Similarity Measures
Outline
1 Introduction
2 Formal Concept Analysis
3 Similarity Measures
4 Experiments
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 11 / 28
Similarity Measures
Formal Definition
DefinitionA similarity measure S is a function with non-negative real valuesdefined on the Cartesian product X × X of a set X
S : X × X → R (1)
such that the following three properties are satisfied1 ∃s0 ∈ R : −∞ < S(x , y) ≤ s0 < +∞, ∀x , y ∈ X2 s(x , x) = s0 ∀x ∈ X3 s(x , y) = s(y , x) ∀x , y ∈ X
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 12 / 28
Similarity Measures
Weighted Concept Similarity
Set-inspired similarity measures
Jaccard index SJac =|x ∩ y ||x ∪ y |
(2)
Sorenesen coefficient SSor =2 ∗ |x ∩ y ||x |+ |y |
(3)
Symmetric difference SXor = 1− |x y ||x ∪ y |
(4)
Combine set-based similarity measures to form concept similaritymeasure
SwS (C1,C2) = w ∗ S(A1,A2) + (1− w) ∗ S(B1,B2) (5)
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 13 / 28
Similarity Measures
Weighted Concept Similarity
Set-inspired similarity measures
Jaccard index SJac =|x ∩ y ||x ∪ y |
(2)
Sorenesen coefficient SSor =2 ∗ |x ∩ y ||x |+ |y |
(3)
Symmetric difference SXor = 1− |x y ||x ∪ y |
(4)
Combine set-based similarity measures to form concept similaritymeasure
SwS (C1,C2) = w ∗ S(A1,A2) + (1− w) ∗ S(B1,B2) (5)
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 13 / 28
Similarity Measures
Formal Proof
Proofs depend on the fact that S is a set based similarity measure
Proof.By the properties of set union and set intersectionSJac(x , y) ≤ 1 ∀x , y , thus by the definition of weighted conceptsimilarity, s0 = 1.Property 2 is trivially satisfied by the fact that SJac is a similaritymeasure, thus SJac(x , x) = 1 and therefore
SwJac(C1,C1) = w ∗ 1 + (1− w) ∗ 1 = 1 ∀C1 ∈ B(G,M, I)
Property 3 is also satisfied by the fact that SJac is a similaritymeasure, so SJac(x , y) = SJac(y , x)
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 14 / 28
Similarity Measures
Weighted Concept Similarity
Well established similarity measures, easy to computeSet intersection, union, and difference of any two sets x , y can becomputed in O(min{|x |, |y |})O(min({|A1|, |B1|, |A2|, |B2|})) for any given pair of concepts(A1,B1) and (A2,B2)
Drawback is selecting w
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 15 / 28
Similarity Measures
Drawbacks of Weighted Concept Similarity
sm1,m2,m3,m4
sm1,m2,m3g5
sm2,m4g1
sm3,m4g2
sm1,m2g5,g7
sm3g2,g5,g6
sm4g1,g2,g3
sm1g4,g5,g7
sm2g1,g5,g7
sg1,g2,g3,g4,g5,g6,g7 m1 m2 m3 m4
g1 0 1 0 1g2 0 0 1 1g3 0 0 0 1g4 1 0 0 0g5 1 1 1 0g6 0 0 1 0g7 1 1 0 0
Measures only consider set cardinalities and not informationC1 = ({g5}, {m1,m2,m3}), C2 = ({g2,g5,g6}, {m3}), andC3 = ({g4,g5,g7}, {m1})
S0.5Jac(C1,C2) = S0.5
Jac(C1,C3) = 0.333S0.5
Sor (C1,C2) = S0.5Sor (C1,C3) = 0.5
Intuitively the similarity between C1 and C3 should be greater thanthat of C1 and C2
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 16 / 28
Similarity Measures
Drawbacks of Weighted Concept Similarity
sm1,m2,m3,m4
sm1,m2,m3g5
sm2,m4g1
sm3,m4g2
sm1,m2g5,g7
sm3g2,g5,g6
sm4g1,g2,g3
sm1g4,g5,g7
sm2g1,g5,g7
sg1,g2,g3,g4,g5,g6,g7 m1 m2 m3 m4
g1 0 1 0 1g2 0 0 1 1g3 0 0 0 1g4 1 0 0 0g5 1 1 1 0g6 0 0 1 0g7 1 1 0 0
Measures only consider set cardinalities and not informationC1 = ({g5}, {m1,m2,m3}), C2 = ({g2,g5,g6}, {m3}), andC3 = ({g4,g5,g7}, {m1})
S0.5Jac(C1,C2) = S0.5
Jac(C1,C3) = 0.333S0.5
Sor (C1,C2) = S0.5Sor (C1,C3) = 0.5
Intuitively the similarity between C1 and C3 should be greater thanthat of C1 and C2
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 16 / 28
Similarity Measures
Zeros Induced Similarity
View concepts as maximal sub-matrices of 1sCombining any two concepts must result in the introduction ofzerosThink of similarity as number of zeros introduced by combiningtwo concepts
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 17 / 28
Similarity Measures
Zeros Induced Similarity
Given C1 = (A1,B1) and C2 = (A2,B2) then
z(C1,C2) =∑
a∈A1∪A2
|(B1 ∪ B2) \ a′| (6)
DefinitionGiven concepts C1 = (A1,B1) and C2 = (A2,B2) the zeros-inducedindex is
Sz =|A1 ∪ A2| ∗ |B1 ∪ B2| − z(C1,C2)
|A1 ∪ A2| ∗ |B1 ∪ B2|(7)
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 18 / 28
Similarity Measures
Formal Proof
Proof.For any two sets x , y x \ y ⊆ x , thusz(C1,C2) ≤ |A1 ∪ A2| ∗ |B1 ∪ B2| ∀C1,C2, implying that s0 = 1.For any concept C = (A,B) , by definition A′ = B which implies
∀a ∈ A a′ ⊇ B→ z(C,C) = 0→ Sz(C,C) = s0
Property 3 is guaranteed by the commutative property of setunion.
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 19 / 28
Similarity Measures
Zeros Induced Similarity
C1 = ({g5}, {m1,m2,m3}), C2 = ({g2,g5,g6}, {m3}), andC3 = ({g4,g5,g7}, {m1})
Sz(C1,C2) =9− 4
9=
59
and
Sz(C1,C3) =9− 3
9=
23
Direct implementation of computing zeros has complexity ofO(max{|A1|, |B1|, |A2|, |B2|}2)
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 20 / 28
Similarity Measures
Zeros Induced Similarity
C1 = ({g5}, {m1,m2,m3}), C2 = ({g2,g5,g6}, {m3}), andC3 = ({g4,g5,g7}, {m1})
Sz(C1,C2) =9− 4
9=
59
and
Sz(C1,C3) =9− 3
9=
23
Direct implementation of computing zeros has complexity ofO(max{|A1|, |B1|, |A2|, |B2|}2)
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 20 / 28
Experiments
Outline
1 Introduction
2 Formal Concept Analysis
3 Similarity Measures
4 Experiments
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 21 / 28
Experiments
Datasets and Method
Real world, labeled datasetsEnumerate concepts, and compute similarity matrixUtilize similarity matrix with agglomerative clustering algorithm
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 22 / 28
Experiments
Datasets
Name Dimensions Density Num. classesCongress 435× 48 0.33 2
Mushrooms 8124× 120 0.1917 2news_mer 2000× 892 0.003 2news_pcr 1997× 1025 0.0026 2
news_allrec 3124× 1671 0.0014 4
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 23 / 28
Experiments
Evaluation Measures
MultPrec(e,e′) =min(|C(e) ∩ C(e′)|, |L(e) ∩ L(e′)|)
|C(e) ∩ C(e′)|
MultRcl(e,e′) =min(|C(e) ∩ C(e′)|, |L(e) ∩ L(e′)|)
|L(e) ∩ L(e′)|
B3Prec = Avge[Avge′,C(e)∩C(e′)6=∅
[MultPrec(e,e′)
]]B3Rcl = Avge
[Avge′,L(e)∩L(e′) 6=∅
[MultRcl(e,e′)
]]
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 24 / 28
Experiments
Evaluation Measures
MultPrec(e,e′) =min(|C(e) ∩ C(e′)|, |L(e) ∩ L(e′)|)
|C(e) ∩ C(e′)|
MultRcl(e,e′) =min(|C(e) ∩ C(e′)|, |L(e) ∩ L(e′)|)
|L(e) ∩ L(e′)|
B3Prec = Avge[Avge′,C(e)∩C(e′)6=∅
[MultPrec(e,e′)
]]B3Rcl = Avge
[Avge′,L(e)∩L(e′) 6=∅
[MultRcl(e,e′)
]]
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 24 / 28
Experiments
Experimental Results
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 25 / 28
Experiments
Similarity Matrices
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 26 / 28
Experiments
Computation Times
Dataset Similarity Measure CPU Time (seconds)
Mushrooms
Weighted Jaccard 545.23± 3.45Weighted Sornensen 300.35± 1.64Weighted SymmDiff 961.62± 2.13
Zeros Induced 4125.22± 3.76
Congress
Weighted Jaccard 522.24± 4.2204Weighted Sornensen 289.89± 0.69Weighted SymmDiff 885.89± 2.77
Zeros Induced 3233.54± 3.45
news_allrec
Weighted Jaccard 3.9170± 0.0440Weighted Sornensen 2.6630± 0.0517Weighted SymmDiff 6.1900± 0.0474
Zeros Induced 8.2050± 0.1203
news_mer
Weighted Jaccard 0.7700± 0.0067Weighted Sornensen 0.5100± 0.0176Weighted SymmDiff 1.2270± 0.0134
Zeros Induced 1.9720± 0.0225
news_pcr
Weighted Jaccard 0.7680± 0.0092Weighted Sornensen 0.5040± 0.0158Weighted SymmDiff 1.2280± 0.0235
Zeros Induced 1.8530± 0.0183
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 27 / 28
Experiments
Conclusion
First steps towards clustering formal conceptsZeros-induced measure no parameters requiredInitial experiments indicate superiority of zeros-induced measureon clustering sparse dataFuture work should incorporate the lattice structure explicitly
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 28 / 28