Efficient Mining of Graph-Based Data

CSE@UTA SRL Workshop 1

Efficient Mining of Graph-Based Data

Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook

University of Texas at ArlingtonDepartment of Computer Science and

Engineering

http://cygnus.uta.edu/subdue


Motivation Structural/relational data Ease of graph representation


Graph-Based Discovery

object

triangle

R1

C1

T1

B1

T2

B2

T3

B3

T4

B4

Input Database Substructure S1 (graph form)

Compressed Database

R1

C1object

squareon

shape

shape S1S1 S1S1 S1S1

S1S1


Algorithm

1. Create substructure for each unique vertex label

Substructures:

triangle (4), square (4),circle (1), rectangle (1)

circle

rectangle

triangle

square

on

on

triangle

square

on

ontriangle

square

on

ontriangle

square

on

on

on


Algorithm

2. Expand best substructure by an edge or edge+neighboring vertex

Substructures:

triangle

square

on

rectangle

square

on

rectangle

triangleon

circle

rectangle

triangle

square

on

on

triangle

square

on

ontriangle

square

on

ontriangle

square

on

on

on

rectangle

circle

on


Algorithm

3. Keep only best beam-width substructures on queue

4. Terminate when queue is empty or #discovered substructures >= limit

5. Compress graph and repeat to generate hierarchical description

Note: polynomially constrained


Evaluation Metric Substructures evaluated based on

ability to compress input graph Compression measured using

minimum description length (DL) Best substructure S in graph G

minimizes: DL(S) + DL(G|S)


Examples


Inexact Graph Match Some variations may occur

between instances Want to abstract over minor

differences Difference = cost of transforming

one graph to isomorphism of another

Match if cost/size < threshold


Parallel/Distributed Discovery Divide graph into P partitions using

Metis, distribute to P processors Each processor performs serial Subdue

on local partition Broadcast best substructures, evaluate

on other processors Master processor stores best global

substructures Close to linear speedup


Graph-Based Concept Learning One graph stores positive examples One graph stores negative examples Find substructure that compresses

positive graph but not negative graph (PosEgsNotCovered) + (NegEgsCovered)

Multiple iterations implements set-covering approach


Concept-Learning Example

object

object

object

on

on

triangle

square

shape

shape


Concept-Learning Results Chess endgames (19,257

examples) Black King is (+) or is not (-) in

check 99.8% FOIL, 99.21% Subdue


More Concept-Learning Results

Tic-Tac-Toe endgames + is win for X (958 examples) 100% Subdue, 92.35% FOIL

Bach chorales Musical sequences (20 sequences) 100% Subdue, 85.71% FOIL


Graph-Based Clustering Iterate Subdue until single vertex Each cluster (substructure)

inserted into a classification lattice

Root


Clustering Example: Animals

Name Body Cover Heart Chamber Body Temp. Fertilization

mammal hair four regulated internalbird feathers four regulated internalreptile cornified-skin imperfect-four unregulated internal

amphibian moist-skin three unregulated external

fish scales two unregulated external

animal

hair

mammal

BodyCover

Fertilization

HeartChamber

BodyTempinternalregulated

Namefour


Graph-Based Clustering Results

Animals

BodyTemp: unregulatedHeartChamber: fourBodyTemp: regulatedFertilization: internal

Fertilization: externalName: mammalBodyCover: hair

Name: birdBodyCover: feathers

Name: reptileBodyCover: cornified-skin

HeartChamber: imperfect-fourFertilization: internal

Name: fishBodyCover: scales

HeartChamber: two

Name: amphibianBodyCover: moist-skinHeartChamber: three


Cobweb Results

Comparison of Subdue and Cobweb results Subdue lattice produced better generalization,

resulting in less clusters at higher levels Subdue lattice identifies overlap between

(reptile) and (amphibian/fish)

animals

amphibian/fishmammal/bird reptile

mammal bird fish amphibian


Clustering Example: DNA


Graph-Based Clustering Results

Coverage 61%

68%

71%

DNA

O |O == P — OH

C — N C — C

C — C \ O

O |O == P — OH | O | CH2

C \ N — C \ C

O \ C / \ C — C N — C / \O C


Evaluation of Clusterings Traditional evaluation:

Not applicable to hierarchical domains Does not make sense to compare clusters

in different subtrees Not applicable to relational clusterings

erDistanceIntraClust

erDistanceInterClustQualityClustering


Properties of Good Clusterings

Small number of clusters Large coverage good generality

Big cluster descriptions More features more inferential power

Minimal or no overlap between clusters More distinct clusters better defined

concepts


New Evaluation Heuristic for Hierarchical Clusterings

c

iHc

i

c

ijji

c

i

c

ij

H

k

H

l ljkisize

ljki

C i

i j

CQHH

HH

HHdistance

CQ1

1

1 1

1

1 1 1 1 ,,

,,

)(

),(max

),(

Clustering rooted at C with c children Hi having |Hi| instances Hi,k

distance() measured by inexact graph match Animals: SubdueCQ=2.6, CobwebCQ=1.7


Graph-Based Data Mining: Application Domains Biochemical domains

Protein data DNA data Toxicology (cancer) data

Spatial-temporal domains Earthquake data Aircraft Safety and Reporting System

Telecommunications data Program source code Web topology

web_page

web_page

web_page

hyperlink

hyperlink

hyperlink

home …

…


Theoretical Analysis Galois lattice [Lequiere et al.] Conceptual graphs [Sowa et al.] PAC analysis [Jappy et al.]


Graph-based Data Mining Pattern (substructure) discovery Hierarchical discovery Distributed discovery Concept learning Clustering Compression heuristic based on

minimum description length


Future Work Concept learning

Theoretical analysis Comparison to ILP systems

Clustering Classification lattice Hierarchical relational conceptual clustering

evaluation metric Probabilistic substructures Domains: WWW, source code


Subdue Source Code and Data

http://cygnus.uta.edu/subdue

Efficient Mining of Graph-Based Data

Documents

Transcript of Efficient Mining of Graph-Based Data