Survey on Frequent Pattern Mining on Graph Data - Slides

31
Sriskandarajah Suhothayan Kasun Gajasinghe Isuru Loku Narangoda Subash Chaturanga

Transcript of Survey on Frequent Pattern Mining on Graph Data - Slides

Page 1: Survey on Frequent Pattern Mining on Graph Data - Slides

Sriskandarajah SuhothayanKasun Gajasinghe

Isuru Loku NarangodaSubash Chaturanga

Page 2: Survey on Frequent Pattern Mining on Graph Data - Slides

OutlineIntroductionBasic principlesSolution patterns

Page 3: Survey on Frequent Pattern Mining on Graph Data - Slides

IntroductionGraphs can be seen in everywhere.In computer science, graph is viewed as an

abstract data structure which represents relationships among data.

Page 4: Survey on Frequent Pattern Mining on Graph Data - Slides

Graph based data miningGraph based data mining is finding out useful

and understandable patterns from graph representation of data.

The main subject area of graph based data mining is identifying the frequently occurring subgraph patterns.

Page 5: Survey on Frequent Pattern Mining on Graph Data - Slides

ApproachesIn the recent past a significant work has been

done in this subject area to develop algorithms to mine graph data efficiently.

In this paper we are discussing about such several well known algorithms under following categories.Mathematical Graph Theory Based

ApproachesGreedy Search Based ApproachesInductive Logic Programming ApproachInductive Database Based Approaches

Page 6: Survey on Frequent Pattern Mining on Graph Data - Slides

ApplicationsBioInformatics

mine biochemical structures finding out biological conserved sub networks

Chemical compound analysisWeb browsing pattern analysisintrusion network analysismining communication networks

Page 7: Survey on Frequent Pattern Mining on Graph Data - Slides

Basic PrinciplesSubgraph categories

general subgraphsinduced subgraphsconnected subgraphs

Subgraph Isomorphism ProblemThis finds whether there exists a one-to-one

mapping from a set of vertices to another set.

Page 8: Survey on Frequent Pattern Mining on Graph Data - Slides

Basic PrinciplesGraph Invariants

Quantities to characterize the topological structure of a graph

number vertices, degree of each vertex number of edges connected to the vertex

Page 9: Survey on Frequent Pattern Mining on Graph Data - Slides

Solution Approaches

direct

Categorization

Completeness

complete searchheuristic search

Subgraph isomorphismmatching problem

Indirect(solves the subgraph similarity problem)

Page 10: Survey on Frequent Pattern Mining on Graph Data - Slides

Solution Approaches

Greedy search Inductive logic programming (ILP) Inductive database Complete level-wise search Support Vector Machine (SVM)

Page 11: Survey on Frequent Pattern Mining on Graph Data - Slides

Greedy searchThe conventional solution

Categorized into Depth-First search (DFS) and Breadth-First Search (BFS) Beam search

The disadvantage: as the search proceeds it prunes the branches which do not fit to the maximum branch number limit

Page 12: Survey on Frequent Pattern Mining on Graph Data - Slides

Inductive logic programming (ILP)

Induction?

combination of the 'abduction' (guessing) to select some hypotheses and the 'justification' to seek those hypotheses to justify the observed facts.

Page 13: Survey on Frequent Pattern Mining on Graph Data - Slides

Inductive logic programming (ILP)

positive examples + negative examples => hypothesis+ background knowledge

background knowledge to control the search process (prune some search

paths) introduce predetermined subgraph patterns ILP can be in any of four categories

Page 14: Survey on Frequent Pattern Mining on Graph Data - Slides

Inductive database

Subgraphs and relations among subgraphs are pre-generated sad stored in an inductive database

Advantage: fast operation as the basic patternsDisadvantage: large amount of computation

and memory utilization

Page 15: Survey on Frequent Pattern Mining on Graph Data - Slides

Complete level-wise searchIt's Complete and Direct

Here data are not sets of items Rather graphs having the combinations of a

vertex set V(G) and an edge set E(G) which include topological information.

Extended approach of Apriori algorithm is used

Page 16: Survey on Frequent Pattern Mining on Graph Data - Slides

Support Vector Machine (SVM)

Used for classification and regression analysis

A non-probabilistic binary linear classifier

SVN is a heuristic search and an indirect method in terms of subgraph isomorphism problem.

Page 17: Survey on Frequent Pattern Mining on Graph Data - Slides

Categorization

Mathematical Graph Theory Based Approaches

Greedy Search Based Approaches Inductive Logic Programming Approach Inductive Database Based Approaches Kernel Function Based Approaches

Page 18: Survey on Frequent Pattern Mining on Graph Data - Slides

Greedy Search Based Approaches

Use heuristics to evaluate the solution.

Two major works SUBDUE GBI

Page 19: Survey on Frequent Pattern Mining on Graph Data - Slides

Graph Based Induction (GBI)Has two methods

one for chunking and the other for extracting patters.

Can arrive at local minimum solutions; using pair wise chunking at each step by the opportunistic beam search.

Ability to reconstruct the original graph as and when needed

The advantage of GBI is that it can handle both directed and undirected labelled graph even with closed paths which includes closed edges.

Use empirical graph size definition, limitation in continuously compressing the graph, graph never becomes a single vertex.

Extract substructures and construct a classifier.

Page 20: Survey on Frequent Pattern Mining on Graph Data - Slides

SUBDUE

A graph-based relational learning system

Compress the graphs based on Minimum Description Length (MDL) principle

Not face high computational complexity (uses computationally constrained beam search)

Miss some optimum sub graphs

fewer number of highly interesting patterns; than generating a large number of patterns from which interesting patterns need to be identified.

Runtime much larger than gSpan and FSG: non-linear with the dataset size (because of the implementation of graph isomorphism problem)

Page 21: Survey on Frequent Pattern Mining on Graph Data - Slides

Mathematical Approaches Apriori-based methods

– AGM– FSG

Pattern Growth methods– gSpan

Page 22: Survey on Frequent Pattern Mining on Graph Data - Slides

Apriori-based Approach AGM

– Used to mine “frequent induced subgraphs”

– Works with both directed and undirected graphs

– Importantly, this algorithm is not limited to the connected graphs. It also supports isolated graphs.

Page 23: Survey on Frequent Pattern Mining on Graph Data - Slides

AGMBreadth first search. Create new candidates for level k+1

by joining two graphs at level k.

AGM generates new graphs by adding a new node:

And then proceeds as per Apriori...

Page 24: Survey on Frequent Pattern Mining on Graph Data - Slides

FSG– FSG works better on graph data sets with more

edge and vertex labels– This is an optimized algorithm of AGM with added

techniques for efficiency.– FSG increases the efficiency of the candidate

generation of frequent subgraphs by introducing the Transaction ID (TID) method.

– efficient candidate subgraph generation algorithms.

Page 25: Survey on Frequent Pattern Mining on Graph Data - Slides

FSG– FSG is a apriori-based and therefore uses level-

wise algorithm

– Faces two challenges: candidate generation: the generation of size

subgraph candidates is more complicated and costly

pruning false positives: subgraph isomorphism test is an NP-complete problem

Page 26: Survey on Frequent Pattern Mining on Graph Data - Slides

gSpan– Uses Depth-First-Search (DFS)– can be used to find frequent sub graphs one by

one from small to large ones.

– Advantages• No candidate generation and false test• Better saving of space by DFS.

Pattern growth mathod

Page 27: Survey on Frequent Pattern Mining on Graph Data - Slides

GRAPH DATASET

FREQUENT PATTERNS(MIN SUPPORT IS 2)

(A) (B) (C)

(1) (2)

Page 28: Survey on Frequent Pattern Mining on Graph Data - Slides

Another three approaches to mine graph based data.

Inductive Logic Programming approach Inductive database approach Kernel function based approach

Page 29: Survey on Frequent Pattern Mining on Graph Data - Slides

ILP approach.

ILP systems constructs predictive model for a given data set by searching large space of candidate hypothesis.

WARMR – proposed in 1998. Combination of

Apriori-like level wise search and IPL method. But have a high computational complexity.

FARMER – proposed in 2011. Runs two orders of magnitude than WARMER.

Page 30: Survey on Frequent Pattern Mining on Graph Data - Slides

Inductive DB approach.

Databases which are capable of handling patterns within data. Quite different from from typical data bases.

Uses interactive querying process to mine data in these data bases.

MolFea is an effort related to this area. Has a

better computational efficiency which mines linear fragments in chemical compounds..

Also this performs a complete search of the paths in graph data.

Page 31: Survey on Frequent Pattern Mining on Graph Data - Slides

Kernel Function based approach

This “kernel” function basically defines similarity between two graphs

The paper consists of two efforts done based on this approach, which classifies the graphs in to binary classes by SVM (Support Vector - Machine).