Topological Data Analysis of Attributed Networks using ...1 Introduction Topological data analysis...

Topological Data Analysis of Attributed Networks usingDiffusion Frechet Functions with Ego-Networks

Warren Keil1 and Mehmet Aktas1

University of Central Oklahoma, Edmond OK 73034, USA,[email protected], [email protected]

Abstract. In this paper we study the attributed networks using topological dataanalysis. We first extract the ego network of each node. We then define the diffu-sion Frechet function over ego networks, which takes both network topology andattribute information into consideration, to extract the topological features. Next,we encode this information in persistent diagrams using functional filtrations andfinally reach our goal by combining the distances within the persistence diagramswith machine learning algorithms. Our experiment shows that our method can bepromising in clustering the attributed networks.

1 Introduction

Topological data analysis (TDA) has been a very active area of research in the pastcouple of decades. One of the main goals of TDA is to try to extract certain topologicalfeatures of data, sometimes referred to as the shape of the data [4]. The idea is thatthese inherent features of the data are embedded in the shape of the data, and so byfinding these topological features associated with the data, it may be possible to uncoverinformation contained in the data not accessible by other methods.

A popular and exciting area of TDA has been on network data. Networks are struc-tured data representing relationships between objects, where nodes and edges representobjects and relationships respectively. In this paper, we consider attributed networks,the networks where vertices are affiliated with multidimensional attributes. The mainfocus of our study is to be able to use the attributes and network topology of an at-tributed network to find structure and topological feature inside the data. If successful,this can be used to enhance clustering and other machine learning algorithms.

2 Background

While there has been a considerable amount of research performed on network data, wefind a few studies very interesting and influential for our own research. In the paper [1],the authors’ research method involved first looking at one vertex in the dataset, and thencreating a subnetwork consisting of every vertex connected to the original vertex. Thenetworks used in their study was from Facebook with vertices representing people andedges representing friends and family. This subgraph is referred to as an ego network.They then use a weighting function to assign weights to each edge of the ego networkbased on the shared attributes of each adjacent vertex. Through this method, they are

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

2018 2018

2018 2018

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

1A 1B

1C 1D

2A 2B

2C 2D

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

2018 2018

2018 2018

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

1A 1B

1C 1D

2A 2B

2C 2D

The 7th International Conference on Complex Networks andTheir Applications. 11 - 13 Dec., 2018, Cambridge (UK)

able to identify sub-communities within Facebook networks and even predict if a newsubgroup would form [1].

As our second main reference, Martinez, et al, [2, 3] study the results of modifyingthe Frechet Function by using a heat function as the kernel instead of the Euclideannorm. To illustrate this, the classical Frechet function is defined as

Fα(x) :=∫

Rd‖x− y‖2α(dy).

The modified function, defined the Diffusion Frechet Function (DFF)is,

F(x) :=∫

Rdd2(x,y)α(dy),

where α is some probability measure and d(x,y) is the solution to the heat equation, alsoknown as Green’s function. They are then able to show that the DFF is able to detectfeatures in data where the classical Frechet function is not able to. Specifically, the DFFis able to detect all the modes multi-modal data, as well as uncover other informationcontained in data. They are then able to show that the DFF is stable with respect tothe p-Wasserstein distance. The authors also modify their DFF to be able to computediffusion distances between network data. The DFF on networks as described in [3] isalso proven to be stable with respect to the Wasserstein distance.

3 Approach

Our method of analyzing attributed networks is to first define an ego network for eachnode in the network as described in [1]. We then compute the DFF over each ego net-work. The DFF assigns values to each node. Next, we perform a functional filtrationusing the DFF values and the simplical complex structure in the graph. We then com-pute the persistent homology of each of these filtrations to store this information asdiagrams (or barcodes).

The final step before clustering is to store the all of the information in a distance ma-trix. The distance matrix is constructed by computing the Wasserstein distance betweeneach diagram. We then use established statistical learning techniques to find structurewithin data.

4 Results

The data we use in our study is the Amazon co-purchasing dataset, which contains de-tailed product descriptions, ratings, and an addition, directed network, which describeswhich pairs of items were bought by the same user. We use first 200 items in the datasetin our experiment. The product descriptions, ratings, number of reviews, and subcate-gories are the attributes of our network.

We modify the weight function in [1] to be defined as the weight between two alterequals the logistic sigmoid function times the number of reviews times the averagerating of two items. We also require their group types to be equal and their sales rank

195

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

2018 2018

2018 2018

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

1A 1B

1C 1D

2A 2B

2C 2D

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

2018 2018

2018 2018

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

1A 1B

1C 1D

2A 2B

2C 2D


to be in a similar interval. This function then defines the structure of each ego networkand the weights of the edges of each of these networks.

The statistical learning algorithm we find to be most useful is the k-means clusteringalgorithm. We run the k-means clustering algorithm on each ego network using thediffusion distances as the metric. We run the algorithm set to detect 2 through 5 meansfor each ego network. Next, we analyze the features of each ego network based onwhich cluster the k-means algorithm assigned it to.

What we find is that nodes in a cluster have a lot of features in common such as thetype of item or movie genre, but some features were not as apparent. Our current andfuture work on this project is to identify these surprising cluster results and to first seeif they are statistically significant. That is, using hypotheses testing, see if the differentclusters can be reliably used to predict the purchasing habits of Amazon customersbased on their clustering. We then are to go back and verify that the k-means algorithmdoes require our methods of using the ego network weighting algorithm, the DFF, andthe homology groups calculation. This will verify with confidence whether our methodof utilizing topology data analysis methods is revealing information in attribute networkdata not currently accessible by statistical or machine learning methods.

References

1. Leskovec, Jure, et al. Learning to Discover Social Circles in Ego Networks. Stanford LargeNetwork Dataset Collection, snap.stanford.edu/data/. (2015)

2. Martinez, Diego H. Diaz. Multiscale Summaries of Probability Measures with Applicationsto Plant and Microbiome Data. Diss. The Florida State University, (2016)

3. Martnez, Diego H. Diaz, et al. Probing the geometry of data with diffusion Frchet functions.Applied and Computational Harmonic Analysis (2018).

4. Carlsson, Gunnar. Topology and data. Bulletin of the American Mathematical Society 46.2255-308 (2009).

196

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

2018 2018

2018 2018

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

1A 1B

1C 1D

2A 2B

2C 2D

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

2018 2018

2018 2018

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

COMPLEXNETWORKS

1A 1B

1C 1D

2A 2B

2C 2D


Topological Data Analysis of Attributed Networks using ...1 Introduction Topological data analysis...

Documents

Transcript of Topological Data Analysis of Attributed Networks using ...1 Introduction Topological data analysis...