LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1.
-
Upload
godfrey-harvey -
Category
Documents
-
view
216 -
download
1
Transcript of LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1.
1
LINK PREDICTION IN CO-AUTHORSHIP NETWORKLe Nhat Minh ( A0074403N)
Supervisor: Dongyuan Lu
2
Introduction• Link prediction
• Introduce future connections within the network scope
• Co-authorship network• A network of collaborations among researchers, scientists,
academic writers
3
Introduction• Potential applications
• Recommend experts or group of researchers for individual
researcher.
4
Outline
• Problem Background
• Related Work
• Workflow
• Conclusion
• Result Analysis
• Research plan
5
Problem Background
• What connect researchers together ?
• Given an instance of co-authorship network:
• A researcher connect to another if they collaborated on at least one
paper.
Problem
Background
Related
Work
Workflow
Conclusion
X2001
Y2004
X X
XY
6
Problem Background
• How to predict the link?
• Based on criteria:
• Co-authorship network topology
• Researcher’s personal information
• Researcher’s papers
• Boost up link predictions performance
• Recommend link should be really relevant to the interest of the
authors or at least possible for researcher to collaborate.
Problem
Background
Related
Work
Workflow
Conclusion
7
Related Work
• Link prediction problems in Social network
• Liben‐Nowell, D., & Kleinberg, J., 2007
• Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S., 2013
• In social network, interactions among users are very
dynamic with:
• Creation of new link within a few days
• Deletion or replacement of the existent links
• Different features present by the two networks
• Characteristics of individual researcher : citations, affiliations , institutions, ...
• Characteristics of person : marriage status, ages, working places, …
Problem
Background
Related
Work
Workflow
Conclusion
8
• Three mainstream approaches for link prediction:
• Similarity based estimation
• Liben‐Nowell, D., & Kleinberg, J., 2007
• Maximum likelihood estimation
• Murata, T., & Moriyasu, S., 2008
• Guimerà, R., & Sales-Pardo, M., 2009
• Supervised Learning model
• Pavlov, M., & Ichise, R., 2007
• Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M., 2006
Problem
Background
Related
Work
Workflow
Conclusion
9
Similarity Based Estimation• Use metrics to estimate proximities of pairs of researchers
• Based on those proximities to rank pairs of researchers
• The top pairs of researchers will likely to be the recommendations.
Problem
Background
Related
Work
Workflow
Conclusion
10
Similarity Based Estimation• Network structure based measurement
Some conventions:
Yand X node between Similarity :XYS
X of neighbours ofSet :Γ(X)
Yof neighbours ofSet :Γ(Y)
Ynode of Degree|:Γ(Y)|k(Y)
X node of Degree:|Γ(X)|k(X)
Problem
Background
Related
Work
Workflow
Conclusion
11
Similarity Based Estimation• Common Neighbor:
|(Y) (X)| SXY
XY
Problem
Background
Related
Work
Workflow
Conclusion
12
Similarity Based Estimation• Jaccard’s coefficient:
|)()(|
|)()(|
YX
YXSXY
XY
Problem
Background
Related
Work
Workflow
Conclusion
13
Similarity Based Estimation• Preferential Attachment:
)()( YkXkSXY
XY
Problem
Background
Related
Work
Workflow
Conclusion
14
Similarity Based Estimation
• Adamic/Adar:
)()( )(log
1
YXZXY ZkS
XY
Z
Problem
Background
Related
Work
Workflow
Conclusion
15
Similarity Based Estimation• Shortest Path:
• Defines the minimum number of edges connecting two nodes.
• PageRank:• A random walk on the graph assigning the probability that a node
could be reach. The proximity between a pair of node can be determined by the sum of the node PageRank.
Problem
Background
Related
Work
Workflow
Conclusion
16
Maximum Likelihood Estimation• Predefine specific rules of a network
• Required a prior knowledge of the network
• The likelihood of any non-connected link is calculated according to those rules.
Problem
Background
Related
Work
Workflow
Conclusion
17
Supervised Learning Model• Construct dimensional feature vectors
• Fetch these vectors to classifiers to optimize a target function (training model)
• Link prediction becomes a binary classification
Problem
Background
Related
Work
Workflow
Conclusion
18
Supervised Learning Model
• Related work (Al Hasan, M., Chaoji, V., Salem, S., & Zaki,
M., 2006) using:• Decision Tree• SVM (Linear Kernel)• K nearest neighbor• Multilayer Perceptron• Naives Bayes• Bagging
• Combine many classifiers (Pavlov, M., & Ichise, R., 2007)• Decision stump + AdaBoost• Decision Tree + AdaBoost• SMO + AdaBoost
Problem
Background
Related
Work
Workflow
Conclusion
19
Summary• Similarity based estimation
• Not quite well-perform
• Maximum likelihood• Depend on the network
• Supervised learning model• Perform better than similarity based estimation
Problem
Background
Related
Work
Workflow
Conclusion
20
Workflow
Problem
Background
Related
Work
Workflow
Conclusion
Classifier Model Features
21
Graph Description
• Co-authorship graph:
• Undirected graph G (V , E)
• Node or Vertex ( Author )
• Author ID
• Author Name
• Link or Edge (Co-authorship)
• Pair of author ID
• List of publication year followed by paper title
(Ex: 2004 :”Introduction to …” )
Problem
Background
Related
Work
Workflow
Conclusion
22
Setting up data• Dataset is separated into 2 timing spans: 2000 – 2010
and 2010 – 2013• The first is for training, the latter is for testing.• Currently, there are 134,307 researchers in the network
2000 – 2013.• Crop out authors who are not available in testing period,
remaining 104,265 researchers
Problem
Background
Related
Work
Workflow
Conclusion
23
Setting up data• Choose a subset from 104,265 researchers• Experiment on 937 researchers
2000-2010 2010-2013
Real Network
No of node 104,265 104,265
No of link 413,691 35,558
Experiment Network
No. of node 937 937
No. of link 3093 57
Problem
Background
Related
Work
Workflow
Conclusion
24
Baseline Features
• Extract features from the network structure:
• Local similarity
• Common Neighbor
• Adamic / Adar
• Preferential Attachment
• Jaccard’s coefficient
• Global similarity
• Shortest Path
• PageRank
Problem
Background
Related
Work
Workflow
Conclusion
25
Baseline Features
• Feature for co-authorship network
• Keyword matching (Cohen, S., & Ebel, L., 2013 )
A suggested metric to measure the textual relavancy uses a TF-
IDF based function to determine.
Problem
Background
Related
Work
Workflow
Conclusion
26
Proposed FeaturesProductivity of the authors
Observe the “history” of an authorFor example, at a particular node A:
Problem
Background
Related
Work
Workflow
Conclusion
T2 = 2005T0 = 2000 T1 = 2004 T3= 2006
i=0 i=1 i=2 i=3
n=3m=1
n=4m=2
n=6m=2
n=7m=3
n : No. of shared paperm: No. of collaborators
1m
1n
0m
2n
1m
1n
27
Proposed Features
α : a constant to assign the weight of each time period
0 1
1
1)(
)(i ii
mm
TTi
TT
nnAP
iTiT
ii
Problem
Background
Related
Work
Workflow
Conclusion
Productivity of the authors
Observe the “history” of an author
The “productivity” of node A:
28
Training set
• Set up training data
• With n nodes, there is possible links.
• Among those, separate two links
• Positive link: links appear in training years.
• Negative link: the remaining non-existent link in training years.
Note: Avoid bias training by balancing the number of instances between true
and false label.
• Classify all the non-existent links
• Compare with the testing data
2
)1( nn
Problem
Background
Related
Work
Workflow
Conclusion
29
Experimental Results
• Measurement of performance
• Precision:
• Recall:
• Harmonic mean:
• New links to predict: 57 links
005.0558826
26
P
45.03126
26
R
009.031558826*2
2621
F
Problem
Background
Related
Work
Workflow
Conclusion
Prediction
True Link False Link
True Link 26 31
False Link 5,588 429,778
30
Result Analysis
• Possible reasons
• Features
• Small set of data – sampling problem
• Instances of the negative links used for training
Problem
Background
Related
Work
Workflow
Conclusion
31
Research Plan
• Use weighted graph with parameters:
• No. of papers
• No. of neighbor
• No. of citations
• Focus on features that specifically target the co-authorship network:
• Citations
• Institutions
• Enlarge the experiment dataset size
Thank you
Problem
Background
Related
Work
Workflow
Conclusion
32
References• Adamic, L. A., & Adar, E. (2003). Friends and neighbors on the web. Social networks,
25(3), 211-230.• Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M. (2006). Link prediction using supervised
learning. In SDM’06: Workshop on Link Analysis, Counter-terrorism and Security.• Liben‐Nowell, D., & Kleinberg, J. (2007). The link‐prediction problem for social
networks. Journal of the American society for information science and technology, 58(7), 1019-1031.
• Pavlov, M., & Ichise, R. (2007). Finding Experts by Link Prediction in Co-authorship Networks. FEWS, 290, 42-55.
• Murata, T., & Moriyasu, S. (2008). Link prediction based on structural properties of online social networks. New Generation Computing, 26(3), 245-257.
• Guimerà, R., & Sales-Pardo, M. (2009). Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences, 106(52), 22073-22078.
• Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S. (2013). An Evolutionary Algorithm Approach to Link Prediction in Dynamic Social Networks. arXiv preprint arXiv:1304.6257.
• Cohen, S., & Ebel, L. (2013). Recommending collaborators using keywords. In Proceedings of the 22nd international conference on World Wide Web companion 959-962.
33
• Link per year of training set is greater than link per year of testing set:• In testing period, only consider “new” collaborations. • Any collaborations between researchers that already has a link will
be disregarded.
2000-2010 2010-2013No of node 937 937No of link 3093 57
34
Results with different classifiers
Classifier Precision(Positive Predictive Value)
(%)
Recall(Hit rate)
(%)
F1(Harmonic mean)
(%)
Decision Tree 0.3 24.6 0.5
SMO 0.5 45.6 0.9
Bagging 0.4 28.1 0.7
Naive Bayes 0.2 77.2 0.3
Multilayer Perceptron
0.4 47.3 0.8
35
Proposed Feature• The reason for proposing this feature:
• Keep track of the researcher tendency• Give “bonus” to researcher who tend to collaborate with “new”
colleagues rather than “old” ones• Also give high score for prolific researchers (based on number of
published paper)
36
Stochastic Block Model• Guimerà, R., & Sales-Pardo, M., 2009
Problem
Background
Related
Work
Workflow
Conclusion
lrll QQMA )1()|L(
in isother theand in is node one that such nodes of pairs of No. :
, group between edges of No. :
connected are , group in nodes y that twoprobabilit :
r
l
Q
37
Stochastic Block Model
1
2
3
4
5
6
7
X Y
Problem
Background
Related
Work
Workflow
Conclusion
}}7,6,5,4{
},3,2,1{{M
6
1
6
5
6
5
6
11L
5102
The reliability of an individual link is:
')'()'()'|(
)()|()|1()|1(
dMMpMLMAL
dMMpMALMALAALR
xy
xyxy