Correlation Search in Graph Databases

22
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda

description

Correlation Search in Graph Databases. Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda. Outline. Motivation Challenges Problem Definition Solution Performance Evaluation Related Works. Motivation. - PowerPoint PPT Presentation

Transcript of Correlation Search in Graph Databases

Page 1: Correlation Search in Graph Databases

Correlation Search in Graph Databases

Yiping Ke James Cheng Wilfred Ng

Presented By Phani Yarlagadda

Page 2: Correlation Search in Graph Databases

Outline

• Motivation

• Challenges

• Problem Definition

• Solution

• Performance Evaluation

• Related Works

Page 3: Correlation Search in Graph Databases

Motivation

• Graph Databases and their importance

• Correlation mining from graph databases

• Structural similarity and statistical similarity

Page 4: Correlation Search in Graph Databases

Challenges

• Candidate key

• High complexity graph operations

• Vast search space

Page 5: Correlation Search in Graph Databases

Problem Definition• Pearson’s Correlation Coefficient Popularly used correlation measure• Definition Given two graphs g1 and g2, the Pearson’s Correlation Coefficient

of g1 and g2, denoted as φ(g1, g2), is defined as follows

When supp(g1) or supp(g2) is equal to 0 or 1, φ(g1, g2) is defined to be 0.The range of φ(g1, g2) falls within [−1, 1]

In this paper we are concerned about positively correlated graphs only

Page 6: Correlation Search in Graph Databases

Problem Definition

• Correlated Graphs

Two graphs g1 and g2 are correlated if and only if φ(g1, g2) ≥ θ,

where θ (0 < θ ≤ 1) is a user-specified minimum correlation threshold.

Page 7: Correlation Search in Graph Databases

Problem Definition

• Correlated Graph Search

Given a graph database D, a correlation query graph q and a minimum correlation threshold θ, the problem of Correlated Graph Search (CGS) is to find the set of all graphs that are correlated with q. The answer set of the CGS problem is defined as Aq = {(g,Dg) : φ(q, g) ≥ θ}.

Page 8: Correlation Search in Graph Databases

Solution-Candidate Set Generation

• Mine the set of frequent graphs (FG’s) from D using the thresholds

• Drawbacks1. All existing FG mining algorithms generate

graphs with higher support before those with lower support.

2. Not efficient and scalable ,especially when D is large or the lower bound is low.

Page 9: Correlation Search in Graph Databases

Solution-Candidate Set Generation

• Mine the set of FG’s using the threshold

• Advantages1. Efficient candidate generation.

2. Significant reduction in search space.

Page 10: Correlation Search in Graph Databases

Solution-Framework

• The framework of the solution consists of the following four steps.

1. Obtain the projected database Dq of q.2. Mine the set of candidate graphs C from Dq,

using lower(q,g)/supp(q) as the minimum support threshold.

3. Refine C by three heuristic rules.4. For each candidate graph g C,

a) Obtain Dg.b) Add (g,Dg) to Aq if φ(q, g) ≥ θ.

Page 11: Correlation Search in Graph Databases

Solution-Heuristic Rules

• Heuristic Rule 1

Given a graph g, if g C and g q, then

g base(Aq)

Identifies graphs that are guaranteed to be answers

Page 12: Correlation Search in Graph Databases

Solution-Heuristic Rules

• Heuristic Rule 2

Given two graphs g1 and g2,

where g1 g2 and

supp(g1, q) = supp(g2, q),

if g1 base(Aq), then g2 base(Aq)

Helps in reduction of the search space so that the unrewarding query costs for false positives.

Page 13: Correlation Search in Graph Databases

Solution-Heuristic Rules

• Heuristic Rule 3

Given two graphs g1 and g2,

where g1 g2,

if supp(g2, q) < f(supp(g1)),

then g2 base(Aq)

Helps in reduction of the search space so that the unrewarding query costs for false positives.

Page 14: Correlation Search in Graph Databases

Solution-Algorithm• Input: A graph database D, a query graph q, and a correlation threshold

θ. Output: The answer set Aq.

1. Obtain Dq;2. Mine FGs from Dq using lower(q,g) supp(q) as the minimum support

threshold and add the FGs to C;3. for each graph g C in size-descending order do4. if (g q)5. Add (g,Dg) to Aq;6. else7. Obtain Dg;8. if (φ(q, g) ≥ θ)9. Add (g,Dg) to Aq;10. else11. H2 ← {g’ C : g g, supp(g’;Dq) = supp(g;Dq)};12. C ← C−H2;13. H3 ← {g’ C : g g, supp(g’;Dq) < f(supp(g))/supp(q) };14. C ← C−H3;

Page 15: Correlation Search in Graph Databases

Solution-Example

• Consider the graph database below

Page 16: Correlation Search in Graph Databases

Solution-Example

• Query q

• Candidate set

Page 17: Correlation Search in Graph Databases

Performance Evaluation

• The dataset contains the compound structures of cancer and AIDS data from NCI open database compunds.

• The dataset contains about 249k graphs.• On average each graph in dataset has 21 nodes

and 23 edges. The number of distinct labels for nodes and edges is 88.

• We randomly generate four sets of queries, F1, F2, F3 and F4 each of which contain 100 queries. The support ranges for the queries in F1 to F4 are [0.02,0.05],(0.05,0.07],(0.07,0.1] and (0.1,1.0]

Page 18: Correlation Search in Graph Databases

Performance Evaluation

• Effect of candidate generation

Page 19: Correlation Search in Graph Databases

Performance Evaluation

• Effect of

Page 20: Correlation Search in Graph Databases

Performance Evaluation

• Effect of Heuristic Rules

Page 21: Correlation Search in Graph Databases

Performance Evaluation

• Effect of Graph Size

Page 22: Correlation Search in Graph Databases

Related Works

• Raymond proposes an efficient algorithm MCES for similarity search.

• Williams proposes an indexing technique that adopts graph decomposition method for similarity search.

• Zhang and Feigenbaum adopted φ correlation coefficient to measure the correlated pairs in transaction databases.