March, 2011 I3.1 Noise-Aware Data Mining in Information Networks Xifeng Yan University of California...
-
Upload
claribel-boyd -
Category
Documents
-
view
217 -
download
2
Transcript of March, 2011 I3.1 Noise-Aware Data Mining in Information Networks Xifeng Yan University of California...
March, 2011
I3.1 Noise-Aware Data Mining in
Information Networks
Xifeng YanUniversity of California at Santa
BarbaraINARC
INARC PI Report Involvement
– I2.1: In-Network Storage – I2.2: Large-Scale Information Network Processing
– I3.1: QoI Mining of Noisy, Volatile, Uncertain, and Incomplete Heterogeneous Information Networks
– I3.2: Modeling and Mining of Text-Rich Information Networks– E1.2: Composite Network Modeling with Composite Graphs
Objective– Concepts, Models, Theories, Methods, and Systems for measuring and
operating Information Networks and Others– Concepts and Models in Noise-aware data mining of information
networks Collaborators:
– Z. Wen IBM/SCNARC, J. Bao RPI/IRC, J. Han UIUC/INARC, M. Srivatsa IBM/INARC, V. Kawadia BBN/IRC, S. Desai Army
2
I3.1: Noise-Aware Mining: Graph Iceberg
R1 has high concentration of black vertices, but low connectivity
R2 contrarily has few black vertices, but well-connected;
R3, is an anomaly region with high density of black vertices and high connectivity
Nan Li, et al., Towards Iceberg Analysis in Graph OLAP in preparation for VLDB Journal
Find abnormal high density of intrusions in a network (targeted attack)
Find online communities where sensitive topics appear abnormally high (extremist groups)
Help us to study why it happens
3
I3.1: Noise-Aware Mining: Graph Iceberg
Huge search space– If we confine the size of the regions to be s, the total
number of regions in a graph with n vertices is O(ns); Our method
– Find promising vertices first– Cluster these vertices to find the communities.
Promising vertices– Aggregate the personalized
page rank score of neighbors
where the event takes place– High Value => Good vertices
Nan Li, et al., Towards Iceberg Analysis in Graph OLAP in preparation for VLDB Journal4
gIceberg: PPV-Based Aggregation
Personalized PageRank vector (PPV) aggregation– Use PPV to measure the local closeness of two vertices
Local clustering algorithms– Query-aggregated personalized PageRank score (PPS)
Personalized PageRank approximation– Random-walk based Sampling– Pair-wise PPV formula– Active Boundary
5
Sq (v) qV pv pv (x)x|xV ,qL(x )
I3.1: Noise-Aware Mining: Graph Iceberg
Our model: aggregated personalized page rank + sampling– 10-50 times faster
A novel graph mining framework– find anomaly regions in large heterogeneous information
networks – Noise-aware mining: It is an aggregate measure, which
can easily overcome noise
The first-of-its kind in network science
Nan Li, et al., Towards Iceberg Analysis in Graph OLAP in preparation for VLDB Journal6
I3.1: Noise-Aware Mining: Structural Correlation
A novel metric, Decayed Hitting Time, is proposed to assess and rank structural correlations
SIGMOD reviewer: “Interesting problem that I haven’t seen before”
The first-of-its kind defined for networks Sampling algorithm: 10-20 times faster An aggregate measure: noise-resistant
Question: Is the distribution of events (blue nodes) influenced
by the network links or not? If it is, to which degree?
(UCSB) Z. Guan, J. Wu, Z. Yun, A. Singh, X. Yan, Assessing and Ranking Structural Correlations in Graphs, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011 7
What is structural correlation?
Real world networks not only contain nodes and edges, but also have events (attributes)– Information network: events, documents, etc.– Social network: blog posts, rumors, opinions, online
shopping, etc.– Virus/Malware infections
Virus propagation through computer networks, email network, or facebook. Which one is the main channel for a specific virus/malware?
Some events are correlated to network links, while others just occur randomly
8
Correlation Metric in Information Networks
Question: Is the distribution of events (blue nodes) influenced by the network links or not? If it is, to which degree?
Help understand the distribution of events in networks Help detect viral influence in the underlying network
– Correlation has to do with link type, event type and time
Why measuring such correlation?
9
How to measure?
10
If correlated, black nodes tend to stick together.
A naïve approach: only look at neighborhood
General idea: compute the aggregated proximity among black nodes, which will be noise-resistant
Measure definition
The measure
– Vq: the set of nodes having event q; s(*) can be any graph proximity measure
We choose hitting time since it treats as a whole (compared to personalized PageRank and shortest distance, etc.)
11
Hitting time
The expected number of steps to reach a target node via random walk:
– B: target node set; Pr(TB=t|x0=vi): the probability that we start from vi and reach B after t steps
Bvi
12
Decayed Hitting Time (DHT)
Hitting time can be infinite To better and faster calculate proximity, we propose
using Decayed Hitting Time
– Mapping [1,∞) to [0,1], high value means high proximity
– Emphasizing the importance of short paths and reducing the impact of long paths
– Facilitating approximation of DHT
13
DHT sampling approximation
Perform c simulated random walks from vi
Two strategies:– In each random walk, stop when we hit a target node.
Get an estimate May never stop In large graph, can be time consuming
– In each random walk, stop when we hit a target node, or the maximum number of steps (denoted by s) is reached. Get an approximation to
14
Bounds for Sampling Approximation
Suppose we have random walks hitting a target, and which reach s steps (not hit)
For each random walk in those , its contribution to is upper bounded by and lower bounded by 0
Bounds for are
15
From measure to significance
Consider a randomly select set of m nodes: , where – As m increases, randomly selected m nodes tend to close to
one another (actually, monotonic increase of can be proved)
– Just relying on is not enough, we should assess the deviation of to random cases
An approximation method for significance
16
Estimating
• Sampling: Randomly sample c node sets of size m and estimate their ρ values (also by sampling). Then take the sample mean as an estimate of
An approximation method by geometric distribution– When generating , each node has probability m/n to be chosen– Relaxing: each node is chosen independently– Start from a node , the probability that the random walk
hits a target node after t steps is , where . By definition of DHT:
17
Estimating Also use Sampling
– Sample node sets of size m and estimate their ρ values. Then compute the sample variance
– Since we assume each in the definition of is independent, we have
– Thus, we sample pairs and estimate their DHTs and compute sample variance
18
Experiments - Datasets
DBLP– Co-author network– Events: keywords in paper titles– 815,940 nodes, 2,857,960 edges and 171,614 events
TaoBao– Online shopping data, friend network– Events: products– 794,001 nodes, 1,370,284 edges, 100 typical products
Twitter– 40 million nodes and 1.4 billion edges
19
Experiments - Efficiency
20
Experiments - Effectiveness (TaoBao)
21
Experiments – Correlation Evolution (TaoBao)
22
Collaborations
Collaborations with researchers in other networks– (E1.2)Work with Prithwish Basu (BBN) on Network Design– (I2.1) Work with Arun Iyengar and Mudahakar Srivatsa (IBM), who
has done much work on DTN and Storage, on building connection between informaiton network processing on DTN and Clusters. Shengqi Yang will work on it in IBM this summer.
– (T2.3) Work with Vikas Kawadia (BBN), on using graph query processing for distributed trust computing. Ziyu Guan is collaborating with Vikas
– (S1.1) Work with Zhen Wen (IBM), on the social network application of graph density indexing.
– (E1.1) Work with Jie Bao (RPI), on RDF queries using neighborhood-based graph search.
– (I3.1, I3.2) Work with Jiawei Han (UIUC) on graph mining– Work with Sachi Desai (Army) on graph query language/system
2323
Research Papers A. Khan, N. Li, Z. Guan, X. Yan, S. Chakraborty, and S. Tao, Neighborhood Based
Fast Graph Search in Large Networks, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011.
Z. Guan, J. Wu, Z. Yun, A. Singh, X. Yan, Assessing and Ranking Structural Correlations in Graphs, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011.
Qiang Qu, Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu, and Hongyan Li, “Efficient Topological OLAP on Information Networks", DASFAA'11.
Yizhou Sun et al., PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks, submitted to VLDB’2011
Nan Li, et al., Towards Iceberg Analysis in Graph OLAP, to be submitted to VLDB Journal
24
Research Papers Gengxin Miao, Ziyu Guan, Louise Moser, Xifeng Yan, Shu Tao, Nikos Anerousis,
Latent Association Analysis of Document Pairs, submitted to SIGKDD’2011 Ziyu Guan et al., Diffusion through Co-occurrence Relationships for Expert Search
on the Web, submitted to SIGIR’2011 Shengqi Yang, Bo Zong, Arijit Khan, Ben Zhao, Xifeng Yan, Managing Large-
Scale Graphs for Efficient Distributed Processing, submitted to VLDB 2011 Nan Li, Arijit Khan, Xifeng Yan, and Zhen Wen, Density Index and Proximity
Search on Large Graphs, to be submitted to VLDB Journal C. C. Aggarwal, A. Khan, X. Yan. A Probabilistic Index for Massive and Dynamic
Graph Streams, Research Report, to be submitted to VLDB Journal
25
Next Six Months and Path Ahead to 2012
Continue research – Composite Network Modeling and Design – Large-scale Information Network Processing– Information Network in DTN– Information Network Mining and Measuring with Noise
and Dynamics Structural Correlation in Dynamic Situations Mining Graph Patterns in a Noise Environment Node Mining and Inference for Multiple Information Networks
(QoI)
– Information Network Modeling with Text– Graph Query Language– Information Network Query Engine
2626
Brief Summary of My Team’s Work
in Other Tasks.
27
I2.2: Graph Partition for Distributed Graph Computing
28Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processingsubmitted to VLDB 2011
Are typical techniques efficient for graph queries?
Graph partitioning and distribution techniques (e.g., Pregel) Limitations:– Unbalanced workload due to skewed
uniformly distributed graph queries.– Communication overhead due to inter‐
machine (cross partition) communication.
Goal– Model-based Graph Partitioning Techniques– First-of-Its Kind Distributed Graph Computing Platform
in public for Information, Social, and Communication Networks
I2.1: Adapt Sedge to DTN
Master– Vertex->Partition Map
+ Network Contact Graph
+ Route Table (opportunistic path)
Worker– Message Queue (Cache)
Superstep = Time slot– e.g. 1 min, 1 hour, 1 day,
etc.
1
2
3
4
5
6
P1
P2
P3
P4
P5
P6
Contact Graph
Cluster Connection
29
I2.2 gDensity: Model-Based Indexing
Problem definition (labeled proximity search)– Label-based graph proximity search, seeks to find the top-k vertex subsets with the smallest diameters, for a given query of distinct labels. Each subset must cover all the labels specified in the query.
30
Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal
10 – 300 times faster
Using probabilistic model to build index
31
Align two networks
Linked In Facebook
I2.2: Graph Search: the Model-Based Approach
Ideas Use information propagation
model to propagate labels in information networks
Convert vertices to vectors Align sets of vectors
Query Speed: 0.1 sec for WebGraph:10M vertices, 213M edges
Information Propagation Model
A. Khan et al., Neighborhood Based Fast Graph Search in Large Networks, SIGMOD’11
I3.2: Progressive Network Analysis for Expert Search
Goal: find and rank people who have expertise described by user query Web pages are more noisy, contain spam compared to corpus in an
enterprise. Both relevance and reputation should be considered Use a heterogeneous hypergraph to model the co-occurrence
relationships among people and words and devise a heat diffusion model on the hyerpgraph
Applied to 0.5B web pages Accuracy: 50%-200% improvement than the leading language model
methods. Significantly overcome noises in the Web.
Ziyu Guan, et al., “Diffusion through Co-occurrence Relationships for Expert Search on the Web”, SIGIR’11 (sub) 32
I3.2: Latent Association Analysis of Document Pairs
Latent Association Analysis (LAA) mines the topics of two document sets simultaneously, taking the bipartite network between two document sets into consideration
One of the first attempts to analyze the topic structures of two connected document sets, aiming to infer their mapping network model
LAA significantly outperforms existing algorithms with 70% accuracy improvement
Topic Simplex for Corpus 1
?
Topic Simplex for Corpus 2
0 1
1
?
Correlation Factor
… …
Document Pairs
Gengxin Miao, et al., “Latent Association Analysis of Document Pairs”, KDD’11 (sub) 33
E1.2: Collaborative Network Modeling And Inference
34
Questions:1. How to model it?2. How information flows among
different agents? 3. How agents interact with each
other?4. How to measure the quality of the
flow?5. Is there any mis-interaction
among these agents? 6. Can we identify the role of the
agents?7. Can we identify the relationship? 8. Can we identify the weak
components?