Supervised-Learning Link Recommendation in the DBLP co-authoring network

Introduction Link Prediction and Metrics Results Conclusions

Supervised Learning Link Recommendation in theDBLP co-authoring network

Gabriel P Gimenes, Hugo Gualdron, Thiago R Raddo, Jose FRodrigues Jr

Instituto de Ciencias Matematicas e de ComputacaoUniversidade de Sao Paulo

Av. Trabalhador Sao-carlense, 400-Centro, Sao Carlos, SP, Brasil

Click for paper:http://www.icmc.usp.br/pessoas/junio/PublishedPapers/Gimenes_et_al_IEEE-PerCom-SCI2014.pdf

This work has financtial support from FAPESP (2013/10026-7 2011/13724-1)

1 / 22

http://www.icmc.usp.br/pessoas/junio/PublishedPapers/Gimenes_et_al_IEEE-PerCom-SCI2014.pdf


Summary

1 Introduction

2 Link Prediction and Metrics

3 Results

4 Conclusions

2 / 22


Context

Advances in the WWW led to improved mechanisms for usersto interact

Data became abundant in several scenarios

social networks, co-authoring networks, recommender systems,communication networks

Need for tools that can assist in the decision making process

Most of the networks produced on our daily lives are dynamic- Link Recommendation

3 / 22


Objectives

Analysis of the Link Recommendation task on a co-authoringnetwork - DBLP

Comparison between the most used algorithms in supervisedlearning using performance metrics (AUC, F-measure,Precision e Recall)

Including the use of meta-classifiers such as Bagging andRandom Forest

Detailed study of the parameters involved on the technique -Core(k) and the intervals

4 / 22


Link Prediction and Metrics

1 Introduction


3 Results

4 Conclusions

5 / 22


Problem Definition

It is possible to model a co-authoring network as a graph,nodes represent individuals and edges indicate a collaborationbetween them

The idea is to predict/recommend new edges using only pastand present informations about the network using supervisedlearning techniques

6 / 22


Problem Definition

Applications exist in different domains such as:

Forecasting suspect behavior on social networks, terrorism, forexampleIdentifying interactions that would need intenseexperimentation in biologySuggesting new collaborations/interactions to individuals onco-authoring networks

7 / 22


Problem Definition

Given a snapshot of a network on time t, we are interested inthe edges that most likely should/could exist in t’, wheret < t ′.

Training a supervised classifier using topological featuresextracted from the network to be able to analyze its dynamics

8 / 22


Problem Definition

9 / 22


Core

Core(k) is the subset of nodes of interest

Nodes that have at least k edges on training and test intervalsare considered to be in Core(k), the other nodes are not used

10 / 22


Topological Features

Metric Equation

Common Neighbours CN(x , y) = |Γ(x) ∩ Γ(y)|

Jaccard Coeficient JC(x , y) = |Γ(x)∩Γ(y)||Γ(x)∪Γ(y)|

Preferential Attachment PA(x , y) = |Γ(x)| ∗ |Γ(y)|

Adamic-Adar Coeficient AA(x , y) =∑

z∈Γ(x)∩Γ(y)1

log|Γ(z)|

Geodesic Distance shortest path between x and y

Resource Allocation Index RA(x , y) =∑

z∈Γ(x)∩Γ(y)1

|Γ(z)|

Local Paths LP(x , y) =∣∣∣paths(2)

x,y

∣∣∣+ e ∗∣∣∣paths(3)

x,y

∣∣∣Node Clustering Coeficient ANCC(x , y) = cc(x) + cc(y)

11 / 22


Results

1 Introduction


3 Results

4 Conclusions

12 / 22


Experiments

Classification instances are node pairs, classified as positive ornegative depending on the existence of an edge between themon the test interval

The metrics presented are used as an array of features

Classifier DetailsJ48 Decision TreeNaive Bayes ProbabilisticMLP Neural NetworkRandom Forest Set of Decision TreesBagging Set of Decision Trees

13 / 22


Results

Settings considered:

[1995 − 2005], [2006 − 2007]

[1990 − 1999], [2000 − 2004]

[1995 − 1999], [2000 − 2004]

Using k as 0, 3, 5 and 7 in each case.

14 / 22


Results - Interval G [1995, 2005],G [2006, 2007]

k Classificador PRECISION RECALL F-MEASURE AUC

0

J48 0.723 0.706 0.7 0.764NB 0.741 0.585 0.505 0.626

MLP 0.562 0.555 0.541 0.593RF 0.877 0.868 0.867 0.939

Bagging 0.809 0.8 0.798 0.887

1

J48 0.787 0.759 0.753 0.817NB 0.777 0.598 0.52 0.648

MLP 0.628 0.618 0.61 0.639RF 0.914 0.903 0.902 0.977

Bagging 0.84 0.83 0.829 0.913

15 / 22




3

J48 0.852 0.845 0.844 0.87NB 0.773 0.585 0.499 0.704

MLP 0.715 0.714 0.713 0.735RF 0.917 0.913 0.912 0.974

Bagging 0.846 0.841 0.841 0.925

5

J48 0.827 0.771 0.761 0.79NB 0.778 0.601 0.526 0.727

MLP 0.695 0.679 0.672 0.74RF 0.897 0.888 0.887 0.972

Bagging 0.844 0.83 0.828 0.913

16 / 22




7

J48 0.861 0.839 0.836 0.867NB 0.786 0.626 0.566 0.741

MLP 0.725 0.719 0.717 0.785RF 0.914 0.908 0.907 0.971

Bagging 0.883 0.866 0.865 0.94

17 / 22



18 / 22


Results

Bagging and Random Forest classifiers outperform every otherclassifier significantly

We belive that the metaclassifiers are better suited for LinkPrediction due to their thickness in dealing with redundantmetrics and bad instances

Also the metaclassifiers can surpass overfitting errors better

The parameters k and the time interval affect the quality ofthe recommendation

19 / 22


Conclusions

1 Introduction


3 Results

4 Conclusions

20 / 22


Conclusions

We analyzed the Link Recommendation problem on thesupervised learning context

Compared algorithms using evaluation metrics such as AUC,F-Measure, Precision and Recall

Each experiment was set on a different interval and we run itwith different values of k

The dataset was sensible to long periods of time - strongdynamism of the academic community

In our experiments the neighbourhood cut (core) was alsoimportant to further improve the results

21 / 22


Thanks!

Questions?

Click for paper:http://www.icmc.usp.br/pessoas/junio/PublishedPapers/

Gimenes_et_al_IEEE-PerCom-SCI2014.pdf

22 / 22



Supervised-Learning Link Recommendation in the DBLP co-authoring network

Data & Analytics

Transcript of Supervised-Learning Link Recommendation in the DBLP co-authoring network