Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1...
-
date post
19-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1...
![Page 1: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/1.jpg)
Chen Chen1, Cindy X. Lin1, Matt Fredrikson2, Mihai Christodorescu3, Xifeng Yan4, Jiawei Han1
1University of Illinois at Urbana-Champaign2University of Wisconsin at Madison
3IBM T. J. Watson Research Center4University of California at Santa Barbara
1
![Page 2: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/2.jpg)
OutlineMotivation
The efficiency bottleneck encountered in big networks
Patterns must be preservedSummarize-MineExperimentsSummary
2
![Page 3: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/3.jpg)
3
![Page 4: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/4.jpg)
Frequent Subgraph MiningFind all graphs p such that |Dp| >= min_supGet into the topological structures of graph
dataUseful for many downstream applications
4
![Page 5: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/5.jpg)
ChallengesSubgraph isomorphism checking is inevitable
for any frequent subgraph mining algorithmThis will have problems on big networks
Suppose there is only one triangle in the network
But there are 1,000,000 length-2 pathsWe must enumerate all these 1,000,000,
because any one of them has the potential to grow into a full triangle
5
![Page 6: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/6.jpg)
Too Many EmbeddingsSubgraph isomorphism is NP-hard
So, when the problem size increases, …During the checking, large graphs are grown
from small subpartsFor small subparts, there might be too many
(overlapped) embeddings in a big networkSuch embedding enumerations will finally kill
us
6
![Page 7: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/7.jpg)
Motivating ApplicationSystem call graphs from security research
Model dependencies among system callsUnique subgraph signatures for malicious
programsCompare malicious/benign programs
These graphs are very bigThousands of nodes on averageWe tried state-of-art mining technologies, but
failed
7
![Page 8: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/8.jpg)
Our ApproachSubgraph isomorphism checking cannot be
done on large networksSo we do it on small graphs
Summarize-MineSummarize: Merge nodes by label and collapse
corresponding edgesMine: Now, state-of-art algorithms should work
8
![Page 9: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/9.jpg)
Mining after Summarization
Summarize
G1
g1
G2
g2
… Original
Summary
Mining&
Output
a
b
c
a
a c
ab
a
b
a
bc
…
…
…
…c
…
9
![Page 10: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/10.jpg)
Remedy for Pattern ChangesFrequent subgraphs are presented on a
different abstraction levelFalse negatives & false positives, compared to
true patterns mined from the un-summarized database D
False negatives (recover)Randomized technique + multiple rounds
False positives (delete)Verify against DSubstantial work can be transferred to the
summaries10
![Page 11: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/11.jpg)
OutlineMotivationSummarize-Mine
The algorithm flow-chartRecovering false negativesVerifying false positives
ExperimentsSummary
11
![Page 12: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/12.jpg)
12
![Page 13: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/13.jpg)
False NegativesFor a pattern p, if each of its vertices bears a
different label, then the embeddings of p must be preserved after summarization
Since we are merging groups of vertices by label, the nodes of p should stay in different groups
Otherwise,
...
a
b
a
c
b
a
Gigi
c
a
bcp
13
![Page 14: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/14.jpg)
Missing Prob. of EmbeddingsSuppose
Assign xj nodes for label lj (j=1,…,L) in the summary Si => xj groups of nodes with label lj
in the original graph Gi
Pattern p has mj nodes with label lj
Then
14
![Page 15: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/15.jpg)
No “Collision” for Same LabelsConsider a specific embedding f: p->Gi, f is
preserved if vertices in f(p) stay in different groups
Randomly assign mj nodes with label lj to xj
groups, the probability that they will not “collide” is:
Multiply probabilities for independent events15
![Page 16: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/16.jpg)
ExampleA pattern with 5 labels, each label => 2
verticesm1 = m2 = m3 = m4 = m5 = 2
Assign 20 nodes in the summary (i.e., 20 node groups in the original graph) for each labelThe summary has 100 verticesx1 = x2 = x3 = x4 = x5 = 20
The probability that an embedding will persist
16
774.020
19
20
19
20
19
20
19
20
19
![Page 17: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/17.jpg)
Extend to Multiple GraphsSetting x1,…,xL to the same values across all
Gi’s in the database only depends on m1,…,mL, i.e., pattern
p’s vertex label distribution We denote this probability as q(p)
For each of p’s support graphs in D, it has a probability of at least q(p) to continue support pThus, the overall support can be bounded
below by a binomial random variable
17
![Page 18: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/18.jpg)
Support Moves Downward
18
![Page 19: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/19.jpg)
False Negative Bound
19
![Page 20: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/20.jpg)
Example, Cont.As above, q(p)=0.774min_sup=50
20
min_sup' 40 39 38 37 36 35
1 round 0.5966
0.4622
0.3346
0.2255
0.1412
0.0820
2 rounds 0.3559
0.2136
0.1119
0.0508
0.0199
0.0067
3 rounds 0.2123
0.0988
0.0374
0.0115
0.0028
0.0006
![Page 21: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/21.jpg)
False Positives
Much easier to handleJust check against the original database DDiscard if this “actual” support is less than
min_sup
a
b
a
cb
a
Gi
gi
c
p
a
a cb
a
21
![Page 22: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/22.jpg)
The Same Skeleton as gSpanDFS code treeDepth-first search
Minimum DFS code?Check support by
isomorphism testsRecord all one-edge
extensions along the way
Pass down the projected database and recurse
22
![Page 23: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/23.jpg)
Integrate Verification SchemesTop-Down and Bottom-UpPossible factors
Amount of false positivesTop-down verification can
be performed earlyTop-down preferred
by experiments
23
Transaction ID list for p1 => Dp1
Just search within Dp1
Transaction ID list for p2 => Dp2
Just search within D-Dp2;if frequent, can stop
![Page 24: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/24.jpg)
Summary-Guided VerificationSubstantial verification work can be
performed on the summaries, as well
24
Got it!
![Page 25: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/25.jpg)
Iterative Summarize-MineUse a single pattern tree to hold all results
spanning across multiple iterationsNo need to combine pattern sets in a final stepAvoid verifying patterns that have already been
checked by previous iterationsVerified support graphs are accurate, they can
help pre-pruning in later iterationsDetails omitted
25
![Page 26: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/26.jpg)
OutlineMotivationSummarize-MineExperimentsSummary
26
![Page 27: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/27.jpg)
DatasetReal data
W32.Stration, a family of mass-mailing wormsW32.Virut, W32.Delf, W32.Ldpinch,
W32.Poisonivy, etc.Vertex # up to 20,000 and edge # even higherAvg. # of vertices: 1,300
Synthetic dataSize, # of distinct node/edge labels, etc.Generator details omitted
27
![Page 28: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/28.jpg)
A Sample Malware SignatureMined from W32.StrationA malware reading and leaking certain
registry settings related to the network devices
28
![Page 29: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/29.jpg)
Comparison with gSpangSpan is an efficient graph pattern mining
algorithmGraphs with different size are randomly
drawnEventually, gSpan cannot work
29
![Page 30: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/30.jpg)
The Influence of min_sup' Total vs. False PositivesThe gap corresponds to true patternsIt gradually widens as we decrease min_sup'
30
![Page 31: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/31.jpg)
Summarization Ratio10/1 node(s) before/after summarization =>
ratio=10Trading-off min_sup' and t as the inner loopA range of reasonable parameters in the
middle
31
![Page 32: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/32.jpg)
ScalabilityOn the synthetic dataParameters are tuned as done above
32
![Page 33: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/33.jpg)
OutlineMotivationSummarize-MineExperimentsSummary
33
![Page 34: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/34.jpg)
SummaryWe solve the frequent subgraph mining problem
for graphs with big sizeWe found interesting malware signaturesOur algorithm is much more efficient, while the
state-of-art mining technologies do not workWe show that patterns can be well preserved on
higher-level by a good generalization schemeVery useful, given the emerging trend of huge
networksThe data has to be preprocessed and summarized
34
![Page 35: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/35.jpg)
SummaryOur method is orthogonal to many previous
works on this topic => Combine for further improvementEfficient pattern space traversalOther data space reduction techniques
different from our compression within individual transactions Transaction sampling, merging, etc. They perform compression between transactions
35
![Page 36: Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d295503460f949fe04c/html5/thumbnails/36.jpg)
36