Similarity Measures in Formal Concept Analysis

Post on 09-Jul-2015

704 views 0 download

Tags:

Transcript of Similarity Measures in Formal Concept Analysis

Similarity Measures in Formal Concept Analysis

Faris AlqadahRaj Bhatnagar

Computer Science DepartmentUniversity of Cincinnati

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 1 / 28

Outline

1 Introduction

2 Formal Concept Analysis

3 Similarity Measures

4 Experiments

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 2 / 28

Introduction

Outline

1 Introduction

2 Formal Concept Analysis

3 Similarity Measures

4 Experiments

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 3 / 28

Introduction

Motivation

Formal Concept Analysis (FCA) studied and applied successivelyin many diverse fieldsData mining, conceptual modeling, software engineering, andsocial networkingPossible draw back: large number of conceptsEssential to develop formalisms to segment, cluster andcategorize concepts

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 4 / 28

Introduction

Related Work

Few studies have focused on similarity measure of formalconceptsAd-hoc approaches based on applications (Y.Ding 2002) (Blachonand Gandrillon 2007), no formal study of similarity.Similarity in fuzzy concepts addressed by Belholavek.Concept similarity in ontologies encompass string similaritymeasures, and external sources of data such as dictionaries.

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 5 / 28

Formal Concept Analysis

Outline

1 Introduction

2 Formal Concept Analysis

3 Similarity Measures

4 Experiments

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 6 / 28

Formal Concept Analysis

Concepts

Context K = (G,M, I)G objects, M attributes, I is relationA ⊆ G then A′ is defined to be B = {m ∈ M|aRm ∀a ∈ A}Dually defined for B ⊆ M, B′

DefinitionA formal concept of K is a pair (A,B) with A ⊆ G and B ⊆ G suchthat A′ = B and B′ = A. Set of all concepts denoted as B

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 7 / 28

Formal Concept Analysis

Concepts

Context K = (G,M, I)G objects, M attributes, I is relationA ⊆ G then A′ is defined to be B = {m ∈ M|aRm ∀a ∈ A}Dually defined for B ⊆ M, B′

DefinitionA formal concept of K is a pair (A,B) with A ⊆ G and B ⊆ G suchthat A′ = B and B′ = A. Set of all concepts denoted as B

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 7 / 28

Formal Concept Analysis

Concepts

Context K = (G,M, I)G objects, M attributes, I is relationA ⊆ G then A′ is defined to be B = {m ∈ M|aRm ∀a ∈ A}Dually defined for B ⊆ M, B′

DefinitionA formal concept of K is a pair (A,B) with A ⊆ G and B ⊆ G suchthat A′ = B and B′ = A. Set of all concepts denoted as B

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 7 / 28

Formal Concept Analysis

Relation to other theories

m1 m2 m3 m4g1 0 1 0 1g2 0 0 1 1g3 0 0 0 1g4 1 0 0 0g5 1 1 1 0g6 0 0 1 0g7 1 1 0 0

Concepts are maximal rectangles of 1s under suitable permutationBi-clusters, co-clusters: maximal sub-matrices with zero varianceMaximal bicliques in bipartite graph

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 8 / 28

Formal Concept Analysis

Relation to other theories

m1 m2 m3 m4g1 0 1 0 1g2 0 0 1 1g3 0 0 0 1g4 1 0 0 0g5 1 1 1 0g6 0 0 1 0g7 1 1 0 0

Concepts are maximal rectangles of 1s under suitable permutationBi-clusters, co-clusters: maximal sub-matrices with zero varianceMaximal bicliques in bipartite graph

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 8 / 28

Formal Concept Analysis

Hierarchy of Concepts

sm1,m2,m3,m4

sm1,m2,m3g5

sm2,m4g1

sm3,m4g2

sm1,m2g5,g7

sm3g2,g5,g6

sm4g1,g2,g3

sm1g4,g5,g7

sm2g1,g5,g7

sg1,g2,g3,g4,g5,g6,g7

Concepts of a context form a natural hierarchial structure

TheoremFormal concepts of a context ordered by the subset relation form acomplete lattice.

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 9 / 28

Formal Concept Analysis

Number of concepts

Given K = (G,M, I), in worst case |B| ∈ O(2min{|G|,|M|})

Empirical evidence with real world datasets suggests this is morelike O(max{|G|, |M|}2)Examples:

Genes-GO terms (1910× 3885) > 1,000,000 conceptsGenes-PhenoTypes (1910× 3965) > 1,000,000 conceptsMushrooms (8124× 120) > 200,00 concepts

Constraints typically imposed to make enumeration tractable,however parameter selection is problematicSegmenting, clustering of concepts should be explored

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 10 / 28

Formal Concept Analysis

Number of concepts

Given K = (G,M, I), in worst case |B| ∈ O(2min{|G|,|M|})

Empirical evidence with real world datasets suggests this is morelike O(max{|G|, |M|}2)Examples:

Genes-GO terms (1910× 3885) > 1,000,000 conceptsGenes-PhenoTypes (1910× 3965) > 1,000,000 conceptsMushrooms (8124× 120) > 200,00 concepts

Constraints typically imposed to make enumeration tractable,however parameter selection is problematicSegmenting, clustering of concepts should be explored

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 10 / 28

Formal Concept Analysis

Number of concepts

Given K = (G,M, I), in worst case |B| ∈ O(2min{|G|,|M|})

Empirical evidence with real world datasets suggests this is morelike O(max{|G|, |M|}2)Examples:

Genes-GO terms (1910× 3885) > 1,000,000 conceptsGenes-PhenoTypes (1910× 3965) > 1,000,000 conceptsMushrooms (8124× 120) > 200,00 concepts

Constraints typically imposed to make enumeration tractable,however parameter selection is problematicSegmenting, clustering of concepts should be explored

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 10 / 28

Formal Concept Analysis

Number of concepts

Given K = (G,M, I), in worst case |B| ∈ O(2min{|G|,|M|})

Empirical evidence with real world datasets suggests this is morelike O(max{|G|, |M|}2)Examples:

Genes-GO terms (1910× 3885) > 1,000,000 conceptsGenes-PhenoTypes (1910× 3965) > 1,000,000 conceptsMushrooms (8124× 120) > 200,00 concepts

Constraints typically imposed to make enumeration tractable,however parameter selection is problematicSegmenting, clustering of concepts should be explored

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 10 / 28

Similarity Measures

Outline

1 Introduction

2 Formal Concept Analysis

3 Similarity Measures

4 Experiments

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 11 / 28

Similarity Measures

Formal Definition

DefinitionA similarity measure S is a function with non-negative real valuesdefined on the Cartesian product X × X of a set X

S : X × X → R (1)

such that the following three properties are satisfied1 ∃s0 ∈ R : −∞ < S(x , y) ≤ s0 < +∞, ∀x , y ∈ X2 s(x , x) = s0 ∀x ∈ X3 s(x , y) = s(y , x) ∀x , y ∈ X

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 12 / 28

Similarity Measures

Weighted Concept Similarity

Set-inspired similarity measures

Jaccard index SJac =|x ∩ y ||x ∪ y |

(2)

Sorenesen coefficient SSor =2 ∗ |x ∩ y ||x |+ |y |

(3)

Symmetric difference SXor = 1− |x y ||x ∪ y |

(4)

Combine set-based similarity measures to form concept similaritymeasure

SwS (C1,C2) = w ∗ S(A1,A2) + (1− w) ∗ S(B1,B2) (5)

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 13 / 28

Similarity Measures

Weighted Concept Similarity

Set-inspired similarity measures

Jaccard index SJac =|x ∩ y ||x ∪ y |

(2)

Sorenesen coefficient SSor =2 ∗ |x ∩ y ||x |+ |y |

(3)

Symmetric difference SXor = 1− |x y ||x ∪ y |

(4)

Combine set-based similarity measures to form concept similaritymeasure

SwS (C1,C2) = w ∗ S(A1,A2) + (1− w) ∗ S(B1,B2) (5)

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 13 / 28

Similarity Measures

Formal Proof

Proofs depend on the fact that S is a set based similarity measure

Proof.By the properties of set union and set intersectionSJac(x , y) ≤ 1 ∀x , y , thus by the definition of weighted conceptsimilarity, s0 = 1.Property 2 is trivially satisfied by the fact that SJac is a similaritymeasure, thus SJac(x , x) = 1 and therefore

SwJac(C1,C1) = w ∗ 1 + (1− w) ∗ 1 = 1 ∀C1 ∈ B(G,M, I)

Property 3 is also satisfied by the fact that SJac is a similaritymeasure, so SJac(x , y) = SJac(y , x)

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 14 / 28

Similarity Measures

Weighted Concept Similarity

Well established similarity measures, easy to computeSet intersection, union, and difference of any two sets x , y can becomputed in O(min{|x |, |y |})O(min({|A1|, |B1|, |A2|, |B2|})) for any given pair of concepts(A1,B1) and (A2,B2)

Drawback is selecting w

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 15 / 28

Similarity Measures

Drawbacks of Weighted Concept Similarity

sm1,m2,m3,m4

sm1,m2,m3g5

sm2,m4g1

sm3,m4g2

sm1,m2g5,g7

sm3g2,g5,g6

sm4g1,g2,g3

sm1g4,g5,g7

sm2g1,g5,g7

sg1,g2,g3,g4,g5,g6,g7 m1 m2 m3 m4

g1 0 1 0 1g2 0 0 1 1g3 0 0 0 1g4 1 0 0 0g5 1 1 1 0g6 0 0 1 0g7 1 1 0 0

Measures only consider set cardinalities and not informationC1 = ({g5}, {m1,m2,m3}), C2 = ({g2,g5,g6}, {m3}), andC3 = ({g4,g5,g7}, {m1})

S0.5Jac(C1,C2) = S0.5

Jac(C1,C3) = 0.333S0.5

Sor (C1,C2) = S0.5Sor (C1,C3) = 0.5

Intuitively the similarity between C1 and C3 should be greater thanthat of C1 and C2

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 16 / 28

Similarity Measures

Drawbacks of Weighted Concept Similarity

sm1,m2,m3,m4

sm1,m2,m3g5

sm2,m4g1

sm3,m4g2

sm1,m2g5,g7

sm3g2,g5,g6

sm4g1,g2,g3

sm1g4,g5,g7

sm2g1,g5,g7

sg1,g2,g3,g4,g5,g6,g7 m1 m2 m3 m4

g1 0 1 0 1g2 0 0 1 1g3 0 0 0 1g4 1 0 0 0g5 1 1 1 0g6 0 0 1 0g7 1 1 0 0

Measures only consider set cardinalities and not informationC1 = ({g5}, {m1,m2,m3}), C2 = ({g2,g5,g6}, {m3}), andC3 = ({g4,g5,g7}, {m1})

S0.5Jac(C1,C2) = S0.5

Jac(C1,C3) = 0.333S0.5

Sor (C1,C2) = S0.5Sor (C1,C3) = 0.5

Intuitively the similarity between C1 and C3 should be greater thanthat of C1 and C2

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 16 / 28

Similarity Measures

Zeros Induced Similarity

View concepts as maximal sub-matrices of 1sCombining any two concepts must result in the introduction ofzerosThink of similarity as number of zeros introduced by combiningtwo concepts

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 17 / 28

Similarity Measures

Zeros Induced Similarity

Given C1 = (A1,B1) and C2 = (A2,B2) then

z(C1,C2) =∑

a∈A1∪A2

|(B1 ∪ B2) \ a′| (6)

DefinitionGiven concepts C1 = (A1,B1) and C2 = (A2,B2) the zeros-inducedindex is

Sz =|A1 ∪ A2| ∗ |B1 ∪ B2| − z(C1,C2)

|A1 ∪ A2| ∗ |B1 ∪ B2|(7)

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 18 / 28

Similarity Measures

Formal Proof

Proof.For any two sets x , y x \ y ⊆ x , thusz(C1,C2) ≤ |A1 ∪ A2| ∗ |B1 ∪ B2| ∀C1,C2, implying that s0 = 1.For any concept C = (A,B) , by definition A′ = B which implies

∀a ∈ A a′ ⊇ B→ z(C,C) = 0→ Sz(C,C) = s0

Property 3 is guaranteed by the commutative property of setunion.

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 19 / 28

Similarity Measures

Zeros Induced Similarity

C1 = ({g5}, {m1,m2,m3}), C2 = ({g2,g5,g6}, {m3}), andC3 = ({g4,g5,g7}, {m1})

Sz(C1,C2) =9− 4

9=

59

and

Sz(C1,C3) =9− 3

9=

23

Direct implementation of computing zeros has complexity ofO(max{|A1|, |B1|, |A2|, |B2|}2)

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 20 / 28

Similarity Measures

Zeros Induced Similarity

C1 = ({g5}, {m1,m2,m3}), C2 = ({g2,g5,g6}, {m3}), andC3 = ({g4,g5,g7}, {m1})

Sz(C1,C2) =9− 4

9=

59

and

Sz(C1,C3) =9− 3

9=

23

Direct implementation of computing zeros has complexity ofO(max{|A1|, |B1|, |A2|, |B2|}2)

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 20 / 28

Experiments

Outline

1 Introduction

2 Formal Concept Analysis

3 Similarity Measures

4 Experiments

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 21 / 28

Experiments

Datasets and Method

Real world, labeled datasetsEnumerate concepts, and compute similarity matrixUtilize similarity matrix with agglomerative clustering algorithm

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 22 / 28

Experiments

Datasets

Name Dimensions Density Num. classesCongress 435× 48 0.33 2

Mushrooms 8124× 120 0.1917 2news_mer 2000× 892 0.003 2news_pcr 1997× 1025 0.0026 2

news_allrec 3124× 1671 0.0014 4

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 23 / 28

Experiments

Evaluation Measures

MultPrec(e,e′) =min(|C(e) ∩ C(e′)|, |L(e) ∩ L(e′)|)

|C(e) ∩ C(e′)|

MultRcl(e,e′) =min(|C(e) ∩ C(e′)|, |L(e) ∩ L(e′)|)

|L(e) ∩ L(e′)|

B3Prec = Avge[Avge′,C(e)∩C(e′)6=∅

[MultPrec(e,e′)

]]B3Rcl = Avge

[Avge′,L(e)∩L(e′) 6=∅

[MultRcl(e,e′)

]]

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 24 / 28

Experiments

Evaluation Measures

MultPrec(e,e′) =min(|C(e) ∩ C(e′)|, |L(e) ∩ L(e′)|)

|C(e) ∩ C(e′)|

MultRcl(e,e′) =min(|C(e) ∩ C(e′)|, |L(e) ∩ L(e′)|)

|L(e) ∩ L(e′)|

B3Prec = Avge[Avge′,C(e)∩C(e′)6=∅

[MultPrec(e,e′)

]]B3Rcl = Avge

[Avge′,L(e)∩L(e′) 6=∅

[MultRcl(e,e′)

]]

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 24 / 28

Experiments

Experimental Results

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 25 / 28

Experiments

Similarity Matrices

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 26 / 28

Experiments

Computation Times

Dataset Similarity Measure CPU Time (seconds)

Mushrooms

Weighted Jaccard 545.23± 3.45Weighted Sornensen 300.35± 1.64Weighted SymmDiff 961.62± 2.13

Zeros Induced 4125.22± 3.76

Congress

Weighted Jaccard 522.24± 4.2204Weighted Sornensen 289.89± 0.69Weighted SymmDiff 885.89± 2.77

Zeros Induced 3233.54± 3.45

news_allrec

Weighted Jaccard 3.9170± 0.0440Weighted Sornensen 2.6630± 0.0517Weighted SymmDiff 6.1900± 0.0474

Zeros Induced 8.2050± 0.1203

news_mer

Weighted Jaccard 0.7700± 0.0067Weighted Sornensen 0.5100± 0.0176Weighted SymmDiff 1.2270± 0.0134

Zeros Induced 1.9720± 0.0225

news_pcr

Weighted Jaccard 0.7680± 0.0092Weighted Sornensen 0.5040± 0.0158Weighted SymmDiff 1.2280± 0.0235

Zeros Induced 1.8530± 0.0183

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 27 / 28

Experiments

Conclusion

First steps towards clustering formal conceptsZeros-induced measure no parameters requiredInitial experiments indicate superiority of zeros-induced measureon clustering sparse dataFuture work should incorporate the lattice structure explicitly

Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Similarity Measures in Formal Concept Analysis 28 / 28