Similarity-based clustering using a network analysis approach
Transcript of Similarity-based clustering using a network analysis approach
![Page 1: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/1.jpg)
Similarity-based clustering using a network analysis approach
Leandro Ariza-JiménezPhD student in Mathematical Engineering
Advisors:
Olga Lucía Quintero Montoya
Nicolás Pinel Peláez
![Page 2: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/2.jpg)
Outline
• Motivation
• Problem statement
• Networks and communities
• Similarity-based networks
• Application examples
• Future work
• Conclusions
2
![Page 3: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/3.jpg)
Motivation
3
Metagenomic data visualization of a
simulated microbial community.
Laczny et al. (2014).
Alignment-free Visualization
of Metagenomic Data by
Nonlinear Dimension
Reduction.
Scientific Reports, 4(1).
![Page 4: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/4.jpg)
Motivation
4
![Page 5: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/5.jpg)
Motivation
• Major challenges and issues in data clustering:• A priori unknown number of clusters
• “Dimensionality curse”
• Convergence
• Heuristics
5
![Page 6: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/6.jpg)
Problem statement
Conventional approaches to data clustering may notalways succesfully retrieve the underlying structure of
the data due to their inherent issues
Research question:
Can we overcome these issues by performing data-clustering based on a network analysis approach?
6
![Page 7: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/7.jpg)
Networks
7
![Page 8: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/8.jpg)
LinkedIn Social Network
8
![Page 9: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/9.jpg)
Collaboration network of
scientists working at the
Santa Fe Institute (SFI).
Edges connect scientists that
have coauthored at least one
paper. Symbols indicate the
research areas of the
scientists. Naturally, there are
more edges between scholars
working on the same area than
between scholars working in
different areas.
Fortunato, S., & Hric, D. (2016).
Community detection in networks:
A user guide. Physics Reports, 659,
1-44.
Collaboration network of scientists
9
![Page 10: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/10.jpg)
Community
• Networks can have community structure.• Network vertices are organized into groups
• No definition is universally accepted, but there is a intuitive definition.
• Its definition often depends on the target application.
• It is a group of vertices which probably…• share common properties
• play similar roles
10
![Page 11: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/11.jpg)
LinkedIn Social Network
11
![Page 12: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/12.jpg)
Collaboration network of
scientists working at the
Santa Fe Institute (SFI).
Edges connect scientists that
have coauthored at least one
paper. Symbols indicate the
research areas of the
scientists. Naturally, there are
more edges between scholars
working on the same area than
between scholars working in
different areas.
Fortunato, S., & Hric, D. (2016).
Community detection in
networks: A user guide.
Physics Reports, 659, 1-44.
Collaboration network of scientists
12
![Page 13: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/13.jpg)
Similarity-based network
• Networks can represent similarity relationships between objects.
• Have to compute pair-wise similarities as a prerequisite.
• Then, obtain a representative network adjacency matrix.
• Adjacency matrix construction approaches:• Knn >> Binary (sparse) matrix
• Heat kernel>> Weighted (fully connected) matrix
13
![Page 14: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/14.jpg)
Similarity-based network
14
![Page 15: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/15.jpg)
Similarity-based network
15
![Page 16: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/16.jpg)
Similarity-based network
16
![Page 17: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/17.jpg)
Community detection in networks
17
Blondel, V., Guillaume, J.,
Lambiotte, R., & Lefebvre,
E. (2008). Fast unfolding
of communities in large
networks. Journal Of
Statistical Mechanics:
Theory And Experiment,
2008(10), P10008.
![Page 18: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/18.jpg)
Application example
18
Phylogenetic tree of human
intestinal microorganisms.
Numered microorganisms are
used to construct artificial
datasets based on their
genome sequences.
Adjacent microorganisms in
the tree are similar in a
Phylogenetic sense.
![Page 19: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/19.jpg)
Application example: True comms (3)
19
Datasetgrp06exp03
(Nodes=1242, Edges=10042)
![Page 20: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/20.jpg)
Application example: Louvain comms (4)
20
Datasetgrp06exp03
(Nodes=1242, Edges=10042)
![Page 21: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/21.jpg)
Application example: True comms (3)
21
Datasetgrp06exp02
(N=4920,E=40680)
![Page 22: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/22.jpg)
Application example: Louvain comms (7)
22
Datasetgrp06exp02
(N=4920,E=40680)
![Page 23: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/23.jpg)
Application example: True comms (10)
23
Datasetgrp01exp01
(E=16666,N=139619)
![Page 24: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/24.jpg)
Application example: Louvain comms (10)
24
Datasetgrp01exp01
(E=16666,N=139619)
![Page 25: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/25.jpg)
Application example: True comms (10)
25
Datasetgrp02exp01
(N=18474, E=154945)
![Page 26: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/26.jpg)
Application example: Louvain comms (13)
26
Datasetgrp02exp01
(N=18474, E=154945)
![Page 27: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/27.jpg)
Application example: Results
27
CommunityReference CAA NAA
K SI t K SI t K SI
grp06exp03 3 0.127 15,439 3 0.133 0,572 4 0.076
grp06exp02 3 0.134 97,347 3 0.140 6,281 7 0.132
grp01exp01 10 0.079 598,244 10 0.090 76,958 10 0.094
grp02exp01 10 0.044 1659,469 10 0.060 94,530 13 0.106
K = Number of communities
t = computation time (s)
SI = Silhouette index
CAA = Cluster analysis approach
NAA = Network analysis approach
![Page 28: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/28.jpg)
Future work
• Generate similarity networks based on different measure functions.
• Explore strategies to obtain sparse and weighted adjacency matrices.• E.g. Hybrid approach between Knn and heat kernel.
• Include complementary goodness metrics for community detection.
• Adopt/propose a community definition that represent the behavior of the metagenomic communities.
• Evaluate state-of-the-art algorithms for disjoint and overlapping community detection.
28
![Page 29: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/29.jpg)
Conclusions
• Clustering of high-dimensional data can be performed following a network analysis approach.
• Network analysis can provide…• A direct representation of high-dimensional data
• Methods for clustering data into communities without supervision
• The success of this approach depends on how is measured the similarity between objects in high-dimensional spaces.
29
![Page 30: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/30.jpg)
References
• Blondel, V., Guillaume, J., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal Of Statistical Mechanics: Theory And Experiment, 2008(10), P10008. http://dx.doi.org/10.1088/1742-5468/2008/10/p10008
• Coscia, M., Giannotti, F., & Pedreschi, D. (2011). A classification for community discovery methods in complex networks. Statistical Analysis And Data Mining, 4(5), 512-546. http://dx.doi.org/10.1002/sam.10133
• Geoff Dougherty. Pattern Recognition and Classification. Springer New York, 2013.
• Fortunato, S., & Hric, D. (2016). Community detection in networks: A user guide. Physics Reports, 659, 1-44. http://dx.doi.org/10.1016/j.physrep.2016.09.002
30
![Page 31: Similarity-based clustering using a network analysis approach](https://reader031.fdocuments.us/reader031/viewer/2022020705/61fbde61a41882236d599cde/html5/thumbnails/31.jpg)
References
• Laczny, C., Pinel, N., Vlassis, N., & Wilmes, P. (2014). Alignment-free Visualization of Metagenomic Data by Nonlinear Dimension Reduction. Scientific Reports, 4(1). http://dx.doi.org/10.1038/srep04516
• von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics And Computing, 17(4), 395-416. http://dx.doi.org/10.1007/s11222-007-9033-z
• van der Maaten, Laurens. (2013). Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research, 15, 3221−3245.
31