Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.
-
date post
20-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Dr. Larry Holder School of EECS, WSU Graph-based Pattern Learning.
Dr. Larry HolderSchool of EECS, WSU
Graph-based Pattern Learning
Graphs
Protein-proteinInteraction
PowerGrid
SocialNetwork
Internet
Web
Some Graph Statistics
• Web 10B pages, 1T hyperlinks Topology storage: 10TB Google PageRank: Eigenvector on 10Bx10B
adjacency matrix (sparse)
• MySpace 100M users, 10B friendship links Clique/community detection 300K new users per day
Graph Problems
• Degree
• Diameter
• Centrality
• Shortest path
• Cycles/tours
• Minimum spanning tree
• Traversals/search
• Connectivity
• Clustering
• Partitioning
• Cliques
• Motifs
• Subgraph isomorphism
• Frequent subgraphs
• Pattern learning
• Dynamics
Graph-based Pattern Learning
• Unsupervised pattern discovery
• Hierarchical conceptual clustering
• Supervised pattern learning
• Anomaly detection
• Dynamic graph pattern learning
Unsupervised Pattern Discovery
• Frequency-based (AGM, gSpan, FSG, Gaston) “Graph-based Data Mining” Find all subgraphs g within a set of graph transactions G
such that
where is subgraph isomorphism and t is the minimum support Focus on pruning and fast, code-based graph matching Still requires subgraph isomorphism
t
G
Gggg
||
|
Unsupervised Pattern Discovery• Graph compression and the minimum
description length (MDL) principle The best theory minimizes the description
length of the theory and the description length of the data given the theory
• The best graphical pattern S minimizes the description length of S and the description length of the graph G compressed with pattern S
• where description length DL(G) is the minimum number of bits needed to represent G (SUBDUE)
• Compression can be based on inexact matches to pattern
))|()((min SGDLSDLS
S1
S1
S1
S1
S1 S2
S2 S2
Hierarchical Conceptual Clustering• Use iterative process on input
graph G Repeat
• Find best pattern S in graph G• Add S to hierarchy• G = G compressed with S
Until no more compression
• Clustering is a lattice
• Clusters described by pattern Not just instances as in
traditional clustering techniques
organization
organization
organization
male
male
male
male
male
male
male
male
male
male
male
place
place
place placelocation
location
location
location
location
location
location
location
location
affiliation
affiliation
affiliation
affiliation
affiliation
affiliation
affiliation
organization
organization
organization
male
male
male
male
male
male
male
male
male
male
male
place
place
place placelocation
location
location
location
location
location
location
location
location
affiliation
affiliation
affiliation
affiliation
affiliation
affiliation
affiliation
Mock TerroristScenario
Event Generator
Fund raisingRecruitmentTrainingReconnaissance...
Message TrafficReports (142)
SRA TEESText Extraction
System
SUBDUEPattern Learner
Entitiesand
Relationships
Convertto
Graph
Patterns
Observables
Hierarchical pattern discovered at 7th iteration of SUBDUE
DHS Insight ProjectTerrorist Group Data
Supervised Learning• Given positive graph G+ and negative graph G-
• Find pattern S minimizing DL(G+ | S) / DL(G- | S)
• When |G+|,|G-| >> 1, find pattern S maximizing classification accuracy:
NP
TNTP
GG
gSGggSGg
||||
|}|{||}|{|
SUBDUE
PositiveGraphs
NegativeGraphs
Pattern(s)
Results Examples Entities Relations Accuracy Time
Events 308 533,196 630,733 80% 86 min
Groups 84 457,209 597,163 85% 813 min
EDB
Convert EDB to SUBDUE graph
format
Positive & negative examples
Patterns
Evaluate
Evidence Assessment, Grouping, Linking and Evaluation (EAGLE) Program
Evidence DB (EDB)contains simulated dataon threat and non-threat activity• Persons, targets, capabilities, resources, transfers, and communications
SUBDUE
Non-threat
Threat
DARPA/AFRL
Graph Regression (with Nikhil Ketkar, WSU)• Learn a model Yi = f(Gi ), where Yi is a
real number and Gi is a graph E.g., solubility or binding activity of chemical
compounds
• One approach Apply frequent-graph miner to set of
training graphs Gi Frequent subgraphs form a feature
vector V Input {(Yi, Vi)} to linear support-vector
machine
• gRegress approach Prune feature set based on correlation
with other features and lack of correlation with Y
• Learn model using non-linear SVM or piece-wise regression
Anomaly Detection (with Bill Eberle, TTU)• Learn normative patterns of activity
• Detect small, unlikely deviations from normative patterns
• Present anomalies and their context to analyst
Anomaly
Convert to graph
GBAD
Activity Data
NormativePatternGraph-Based
Anomaly Detection (GBAD)
SUBDUE
GBAD Approach
• Determine normative pattern S using SUBDUE minimum description length (MDL) heuristic that minimizes: M(S,G) = DL(G|S) + DL(S)
• Three algorithms for handling each of the different anomaly categories GBAD-MDL finds anomalous modifications GBAD-P (Probability) finds anomalous insertions GBAD-MPS (Maximum Partial Substructure) finds
anomalous deletions
DHS Insight Project: Cargo Data• Shipment data from PIERS (Port
Import Export Reporting Service)• Only North American imports (U.S.,
Puerto Rico, Canada)• 65,535 records (shipments)• Information categories:
General Commodity codes Countries and ports U.S. company names and locations Foreign shipper names and locations Notification party names and locations Shipping line, vessel and packaging Container Weight and shipment Financial
ARRIVAL_INFO
“020601”
VDATE
SHIPMENT
COMMODITY
“EMPTY RACK”
COMMODITY
COUNTRIES_AND_PORTS
“YOKOHAMA”
“SEATTLE”
“JAPAN”
US_IMPORTER
FPORT
USPORT
COUNTRY
“AMERICAN TRI NET EXPRESS”
NAME
FOREIGN_SHIPPER
“TRI NET”
FNAME
VESSEL
“CSCO”
“LING YUN HE”
36
TARIFF
“CONTAINER FOR ONE OR
MORE MODES OF TRANSPORT”
HARM_DESC
860900
HSCODE
CONTAINER
FINANCIAL
CARGO
HAS_A
HAS_AHAS_A
HAS_A
HAS_A
HAS_A
HAS_A
HAS_A
HAS_A
HAS_A
“TOLU4972933”
CONTAINER
VALUE
27579
00434100
“”
“”
0.00
5.60
BOL_NBR
HAZMAT_FLA
CONSIZE
TEUS
MTONS
SLINE
VESSEL
VOYAGE
Anomaly Detection in Cargo Data• Marijuana seized at port on Florida [U.S. Customers
Service 2000].• Smuggler did not disclose some financial
information, and ship traversed extra port.• GBAD-P discovers the extra traversed port; GBAD-
MPS discovers the missing financial information.
DHS CyberSecurity R&D Program: Insider Threat Detection using Graphs
Insider Threat Scenarios (CERT Insider Threat Documents)1. Frontline staff reviews case (invasion of privacy).2. Frontline staff submits case directly to a case officer
(bypassing the approval officer).3. Frontline staff recommends or decides case.4. Approval officer reverses accept/reject recommendation
from assigned case officer.5. Unassigned case officer updates or recommends case.6. Applicant communicates with approval officer or case
officer.7. Unassigned case officer communicates with applicant.8. Database access from an external source or after hours.
Gov’t IDRequestProcessing
GBAD on Scenario 1
GBAD on Scenario 4
• 1000 cases• Multiple
normative patterns
• 1-3 anomalies• No false
positives
Dynamic Graph Pattern Learning(with Chang hun You, WSU)
• Dynamic graph DG = {G1, G2, …, Gn}
• Find graph rewrite rules between pairs of graphs Gi / Gi+1
Find common subgraph between Gi and Gi+1
Remainder of Gi to be removed (GR) Remainder of Gi+1 to be added (GA)
• Find transformation rules of temporal patterns in rewrite rules Remove (GR) at time t, then add (GA) at time t+k
Dynamic Graph (BioNet)
Graph Rewriting Rule
Example: Circadian Rhythm in Drosophila (Fruit Fly)
Example: Circadian Rhythm in Drosophila (Fruit Fly)
Transformation rule (Sub 1): Structure appearing and disappearing in network.
Full temporal transformation rule: Boxes are removals (after 5 hours), and ellipses are additions (after 7 hours) of Sub 1. Cycles every 12 hours. Time 6-47 is training; time 54-66 is prediction.
Graph-based Pattern Learning
• Algorithms Pattern discovery and
clustering Supervised learning Anomaly detection Dynamic graphs
• Applications Social networks Biological networks Computer networks Process flows (Semantic) Web …
linkeddata.org
High Performance Computing Issues
• Memory bottleneck Most real-world graphs do not fit in main memory Patterns of access to graph not sequential
• Computational bottleneck Graph and subgraph isomorphism
High Performance Computing Issues
• Functional parallelism Parallel search over space of candidate subgraph
patterns• High communication to avoid redundancy• Child patterns rely on embeddings kept with parent Hinders parallelism Computing embeddings from scratch is NPC
• Data parallelism Partition graphs, find patterns in each partition, evaluate
patterns in other partitions• Edge cuts may break patterns• May require NPC subgraph isomorphism
• MapReduce [Google] Dean & Ghemawat, “MapReduce: Simplified Data Processing on
Large Clusters,” OSDI 2004.
• Hadoop [Yahoo] MapReduce Distributed filesystem
Data-Intensive Scalable Computing
Map
Reduce
Multiscale Issues
• Hierarchical networks Higher-level hyper-nodes
summarize detail at lower levels
E.g., Netflix prize (www.netflixprize.com) • 17K movies, 400K users,
100M reviews• E.g., user’s average rating vs.
specific ratings• E.g., movie’s average rating
vs. specific rating
user movie
review
5
rating3.5
avg.rating
“Matrix”
title(reviews…)
4.5
avg.rating
user movie
Conclusions
• Graph representation of relational data
• Graph-based pattern learning improves understanding of modeled behavior
• Massive, dynamic graphs
• Numerous application domains
• Graph problems computationally and memory intensive
• HPC (data-intensive computing) and multiscale approaches
For More Information
• Larry Holder, School of EECS, WSU Email: [email protected] URL: www.eecs.wsu.edu/~holder
• SUBDUE Source code in C Datasets www.subdue.org
• D. Cook and L. Holder (2006). Mining Graph Data, Wiley. (www.eecs.wsu.edu/mgd)