Efficient Mining of Graph-Based Data

28
CSE@UTA SRL Workshop 1 Efficient Mining of Graph-Based Data Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department of Computer Science and Engineering http://cygnus.uta.edu/subdue

description

Efficient Mining of Graph-Based Data. Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department of Computer Science and Engineering http://cygnus.uta.edu/subdue. Motivation. Structural/relational data Ease of graph representation. - PowerPoint PPT Presentation

Transcript of Efficient Mining of Graph-Based Data

Page 1: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 1

Efficient Mining of Graph-Based Data

Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook

University of Texas at ArlingtonDepartment of Computer Science and

Engineering

http://cygnus.uta.edu/subdue

Page 2: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 2

Motivation Structural/relational data Ease of graph representation

Page 3: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 3

Graph-Based Discovery

object

triangle

R1

C1

T1

B1

T2

B2

T3

B3

T4

B4

Input Database Substructure S1 (graph form)

Compressed Database

R1

C1object

squareon

shape

shape S1S1 S1S1 S1S1

S1S1

Page 4: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 4

Algorithm

1. Create substructure for each unique vertex label

Substructures:

triangle (4), square (4),circle (1), rectangle (1)

circle

rectangle

triangle

square

on

on

triangle

square

on

ontriangle

square

on

ontriangle

square

on

on

on

Page 5: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 5

Algorithm

2. Expand best substructure by an edge or edge+neighboring vertex

Substructures:

triangle

square

on

rectangle

square

on

rectangle

triangleon

circle

rectangle

triangle

square

on

on

triangle

square

on

ontriangle

square

on

ontriangle

square

on

on

on

rectangle

circle

on

Page 6: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 6

Algorithm

3. Keep only best beam-width substructures on queue

4. Terminate when queue is empty or #discovered substructures >= limit

5. Compress graph and repeat to generate hierarchical description

Note: polynomially constrained

Page 7: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 7

Evaluation Metric Substructures evaluated based on

ability to compress input graph Compression measured using

minimum description length (DL) Best substructure S in graph G

minimizes: DL(S) + DL(G|S)

Page 8: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 8

Examples

Page 9: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 9

Inexact Graph Match Some variations may occur

between instances Want to abstract over minor

differences Difference = cost of transforming

one graph to isomorphism of another

Match if cost/size < threshold

Page 10: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 10

Parallel/Distributed Discovery Divide graph into P partitions using

Metis, distribute to P processors Each processor performs serial Subdue

on local partition Broadcast best substructures, evaluate

on other processors Master processor stores best global

substructures Close to linear speedup

Page 11: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 11

Graph-Based Concept Learning One graph stores positive examples One graph stores negative examples Find substructure that compresses

positive graph but not negative graph (PosEgsNotCovered) + (NegEgsCovered)

Multiple iterations implements set-covering approach

Page 12: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 12

Concept-Learning Example

object

object

object

on

on

triangle

square

shape

shape

Page 13: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 13

Concept-Learning Results Chess endgames (19,257

examples) Black King is (+) or is not (-) in

check 99.8% FOIL, 99.21% Subdue

Page 14: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 14

More Concept-Learning Results

Tic-Tac-Toe endgames + is win for X (958 examples) 100% Subdue, 92.35% FOIL

Bach chorales Musical sequences (20 sequences) 100% Subdue, 85.71% FOIL

Page 15: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 15

Graph-Based Clustering Iterate Subdue until single vertex Each cluster (substructure)

inserted into a classification lattice

Root

Page 16: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 16

Clustering Example: Animals

Name Body Cover Heart Chamber Body Temp. Fertilization

mammal hair four regulated internalbird feathers four regulated internalreptile cornified-skin imperfect-four unregulated internal

amphibian moist-skin three unregulated external

fish scales two unregulated external

animal

hair

mammal

BodyCover

Fertilization

HeartChamber

BodyTempinternalregulated

Namefour

Page 17: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 17

Graph-Based Clustering Results

Animals

BodyTemp: unregulatedHeartChamber: fourBodyTemp: regulatedFertilization: internal

Fertilization: externalName: mammalBodyCover: hair

Name: birdBodyCover: feathers

Name: reptileBodyCover: cornified-skin

HeartChamber: imperfect-fourFertilization: internal

Name: fishBodyCover: scales

HeartChamber: two

Name: amphibianBodyCover: moist-skinHeartChamber: three

Page 18: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 18

Cobweb Results

Comparison of Subdue and Cobweb results Subdue lattice produced better generalization,

resulting in less clusters at higher levels Subdue lattice identifies overlap between

(reptile) and (amphibian/fish)

animals

amphibian/fishmammal/bird reptile

mammal bird fish amphibian

Page 19: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 19

Clustering Example: DNA

Page 20: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 20

Graph-Based Clustering Results

Coverage 61%

68%

71%

DNA

O |O == P — OH

C — N C — C

C — C \ O

O |O == P — OH | O | CH2

C \ N — C \ C

O \ C / \ C — C N — C / \O C

Page 21: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 21

Evaluation of Clusterings Traditional evaluation:

Not applicable to hierarchical domains Does not make sense to compare clusters

in different subtrees Not applicable to relational clusterings

erDistanceIntraClust

erDistanceInterClustQualityClustering

Page 22: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 22

Properties of Good Clusterings

Small number of clusters Large coverage good generality

Big cluster descriptions More features more inferential power

Minimal or no overlap between clusters More distinct clusters better defined

concepts

Page 23: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 23

New Evaluation Heuristic for Hierarchical Clusterings

c

iHc

i

c

ijji

c

i

c

ij

H

k

H

l ljkisize

ljki

C i

i j

CQHH

HH

HHdistance

CQ1

1

1 1

1

1 1 1 1 ,,

,,

)(

),(max

),(

Clustering rooted at C with c children Hi having |Hi| instances Hi,k

distance() measured by inexact graph match Animals: SubdueCQ=2.6, CobwebCQ=1.7

Page 24: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 24

Graph-Based Data Mining: Application Domains Biochemical domains

Protein data DNA data Toxicology (cancer) data

Spatial-temporal domains Earthquake data Aircraft Safety and Reporting System

Telecommunications data Program source code Web topology

web_page

web_page

web_page

hyperlink

hyperlink

hyperlink

home …

Page 25: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 25

Theoretical Analysis Galois lattice [Lequiere et al.] Conceptual graphs [Sowa et al.] PAC analysis [Jappy et al.]

Page 26: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 26

Graph-based Data Mining Pattern (substructure) discovery Hierarchical discovery Distributed discovery Concept learning Clustering Compression heuristic based on

minimum description length

Page 27: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 27

Future Work Concept learning

Theoretical analysis Comparison to ILP systems

Clustering Classification lattice Hierarchical relational conceptual clustering

evaluation metric Probabilistic substructures Domains: WWW, source code

Page 28: Efficient Mining of  Graph-Based Data

CSE@UTA SRL Workshop 28

Subdue Source Code and Data

http://cygnus.uta.edu/subdue