Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop,...
Transcript of Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop,...
On the encoding of large, high-dimensional and unorganized datasets
Ulderico Fugacci
6th SmartData@PoliTo Workshop, Castello del Valentino
30 January 2020
Higher Methods in Data Science and ML
TopologicalInformationDataset
Motivation
Topological Data Analysis (TDA) and Persistent Homology (PH) allow for extracting the core topological information from large, high-dimensional and unorganized datasets.
E.g. point clouds, complex networks, (semi-)metric spaces
TopologicalInformationDataset
Motivation
Topological Data Analysis (TDA) and Persistent Homology (PH) allow for extracting the core topological information from large, high-dimensional and unorganized datasets.
E.g. point clouds, complex networks, (semi-)metric spaces
Filtered Simplicial Complex
MotivationSimplicial Complex:
A family K of subsets (called simplices) of a finite set V closed under the operation of taking subsets
0-, 1-, 2-, 3-simplices
V a point cloud with |V |≥ 30
Filtered complex K associated with V consists of more than a billion of simplices
Issue:
MotivationSimplicial Complex:
A family K of subsets (called simplices) of a finite set V closed under the operation of taking subsets
0-, 1-, 2-, 3-simplices
Solution:
Development of compact and efficient data structures
for encoding simplicial complexes
MotivationSimplicial Complex:
A family K of subsets (called simplices) of a finite set V closed under the operation of taking subsets
0-, 1-, 2-, 3-simplices
Motivation
• Which info to be stored?
• Data structures:• Simplex-based representations• Top-based representations• Operator-driven representations
• Issues and solutions in adopting top-based representations
• Comparisons
Outline:
Motivation
• Data structures for specific classes of complexes• E.g., manifolds or complexes of low dimension
• Hierarchical and multi-resolution models
• Construction of a simplicial complex from a dataset
Out of Scope:
Data Structures for Simplicial Complexes
Data structure:
• its simplices
K = K0 ∪ K1 ∪ … ∪ Kd
where Ki is the collection of the i-simplices of K
• the topological relations Ri,j ⊆ Ki × Kj
between the simplices of K encoding the (co-)boundary of each simplex
The entities which a simplicial complex consists of are:
4
31
2
Data Structures for Simplicial Complexes
Data structure:
• its simplices
K = K0 ∪ K1 ∪ … ∪ Kd
where Ki is the collection of the i-simplices of K
• the topological relations Ri,j ⊆ Ki × Kj
between the simplices of K encoding the (co-)boundary of each simplex
The entities which a simplicial complex consists of are:
A data structure for K has to explicitly store a portion of the above information and to (efficiently) retrieve the remaining part
4
31
2
Data Structures for Simplicial Complexes
Topological Relations:
Given a i-simplex 𝜎 and a j-simplex 𝜏 of K,
(𝜎, 𝜏) ∈ Ri,j
𝜎 ⊆ 𝜏
|𝜎 ∩ 𝜏|= i (equivalently, 𝜎 ∩ 𝜏 ∈ Ki-1)
𝜏 ⊆ 𝜎
for i < j
for i = j
for i > j
(12, 123) ∈ R1,2 (12, 24) ∈ R1,1 (12, 1) ∈ R1,0
4
31
2Example:
Data Structures for Simplicial Complexes
Topological Relations:
Given a i-simplex 𝜎 and a j-simplex 𝜏 of K,
(𝜎, 𝜏) ∈ Ri,j
𝜎 ⊆ 𝜏
|𝜎 ∩ 𝜏|= i (equivalently, 𝜎 ∩ 𝜏 ∈ Ki-1)
𝜏 ⊆ 𝜎
for i < j
for i = j
for i > j
An i-simplex 𝜎 is called a top simplex of K if Ri,i+1 is empty
(12, 123) ∈ R1,2 (12, 24) ∈ R1,1 (12, 1) ∈ R1,0
4
31
2Example:
Data Structures for Simplicial Complexes
Compactness
Effic
ienc
y
Store all the entities
Store only the top-simplices
๏ Simplex-based representations
๏ Top-based representations
๏ Operator-driven representations
Data structures:
Data Structures for Simplicial Complexes
Compactness
Effic
ienc
y
Store all the entities Incidence
Graph
Store only the top-simplices
๏ Simplex-based representations
๏ Top-based representations
๏ Operator-driven representations
Data structures:
Data Structures for Simplicial Complexes
Compactness
Effic
ienc
y
Store all the entities Incidence
Graph Simplex Tree
Store only the top-simplices
๏ Simplex-based representations
๏ Top-based representations
๏ Operator-driven representations
Data structures:
Data Structures for Simplicial Complexes
Compactness
Effic
ienc
y
Store all the entities Incidence
Graph Simplex Tree
Store only the top-simplices
IA* Data Structure
๏ Simplex-based representations
๏ Top-based representations
๏ Operator-driven representations
Data structures:
Data Structures for Simplicial Complexes
Compactness
Effic
ienc
y
Store all the entities Incidence
Graph Simplex Tree
Store only the top-simplices
IA* Data Structure
๏ Simplex-based representations
๏ Top-based representations
๏ Operator-driven representations
Stellar Tree
Data structures:
Data Structures for Simplicial Complexes
Compactness
Effic
ienc
y
Store all the entities Incidence
Graph Simplex Tree
Store only the top-simplices
Skeleton Blocker
IA* Data Structure
๏ Simplex-based representations
๏ Top-based representations
๏ Operator-driven representations
Stellar Tree
Data structures:
Simplex-based RepresentationsIncidence Graph:
4
31
2123
12 13 23 24 34
1 2 3 4
(𝜎, 𝜏) ∈ A ⟷ (𝜎, 𝜏) ∈ Ri,i+1
The simplicial complex K is encoded via a directed graph G = (N, A):
N ⟷ K
• All the relations between simplices can be immediately retrieved
• The representation size exponentially increases with the complex dimension
Simplex Tree:4
31
2123
12 13 23 24 34
1 2 3 4
(𝜎, 𝜏) ∈ A ⟷ (𝜎, 𝜏) ∈ Ri,i+1 and I(𝜎) < I(𝜏)
The simplicial complex K is encoded via a directed graph G = (N, A):
N ⟷ K
• Graph is not uniquely determined but it depends on the chosen vertex order
where I(𝜎) denotes the maximum value taken by the vertices of 𝜎 w.r.t. a total order on K0
Simplex-based Representations
Simplex Tree:4
3
1
2
(𝜎, 𝜏) ∈ A ⟷ (𝜎, 𝜏) ∈ Ri,i+1 and I(𝜎) < I(𝜏)
The simplicial complex K is encoded via a directed graph G = (N, A):
N ⟷ K
• Graph is not uniquely determined but it depends on the chosen vertex order
where I(𝜎) denotes the maximum value taken by the vertices of 𝜎 w.r.t. a total order on K0
123
12 13 14 23 34
1 2 3 4
Simplex-based Representations
Top-based RepresentationsIA* Data Structure: 4
31
2
The simplicial complex K is encoded via a directed graph G = (N, A):
N ⟷ K0 ∪ Ktop
123 24 34
1 2 3 4
(𝜎, 𝜏) ∈ A ⟷ 𝜎, 𝜏 ∈ Ktop and (𝜎, 𝜏) ∈ Ri,i
𝜎 ∈ Ktop and (𝜎, 𝜏) ∈ Ri,0
𝜏 ∈ Ktop and (𝜎, 𝜏) ∈ (R0,i)*
• Compact: it explicitly stores just a fraction of the entities of a simplicial complex
• Not all the relations between simplices are immediately available
Skeleton Blocker:4
31
2
The simplicial complex K is encoded by storing its 1-skeleton (i.e. the graph consisting of the 0- and the 1-simplices) and a map returning, for each 1-simplex 𝜎, the blockers of K containing 𝜎, where:
• Designed for flag complexes (e.g. VR complexes) and edge contraction
• Too specific: inefficient in any other task
Operator-driven Representations
4
31
2{ 234 }
a simplex 𝜏 is a blocker if 𝜏 does not belong to K but all its faces do
Possible Issues in Top-based Representations
But, how to…
Top-based representations look like promising data structures for encoding a simplicial complex K
1. Store information associated to each simplex of K (e.g. labels, gradient, …)?
2. Efficiently perform operators having explicitly stored a fraction of the entities of K?
Possible Issues in Top-based Representations
But, how to…
Top-based representations look like promising data structures for encoding a simplicial complex K
1. Store information associated to each simplex of K (e.g. labels, gradient, …)?
2. Efficiently perform operators having explicitly stored a fraction of the entities of K?
Attach information to the top simplices only
Possible Issues in Top-based Representations
But, how to…
Top-based representations look like promising data structures for encoding a simplicial complex K
1. Store information associated to each simplex of K (e.g. labels, gradient, …)?
2. Efficiently perform operators having explicitly stored a fraction of the entities of K?
Attach information to the top simplices only
Re-define the algorithms performing the operators trying to extract the lowest possible amount of non-explicitly stored entities
ComparisonsTop-based vs Simplex-based
Comparisons
Top-based vs Simplex-based
Comparisons
Top-based vs Operator-driven
In Summary
We have briefly overviewed the most common data structures proposed in the literature
• Express the most frequently adopted operators in terms of top simplices
• Face the bottleneck concerning the construction of a simplicial complex from a dataset (maybe proposing an approximated construction?)
• Investigate with your help the connections between simplicial complexes (and the data structures to store them) and itemsets in association rule learning
Future Directions:
Thank you!
Main References:• Simplex-based Representations:
- Boissonnat, Maria. The Simplex Tree: an Efficient Data Structure for General Simplicial Complexes. In Algorithmica, 2014
• Top-based Representations:- Canino, De Floriani, Weiss. IA*: An Adjacency-Based Representation for Non-manifold Simplicial Shapes
in Arbitrary Dimensions. In Computers & Graphics, 2011- Fellegara, Weiss, De Floriani. The Stellar Tree: a Compact Representation for Simplicial Complexes and
Beyond. arXiv preprint, 2017
• Operator-driven Representations:- Attali, Lieutier, Salinas. Efficient Data Structure for Representing and Simplifying Simplicial Complexes in
High Dimensions. In International Journal of Computational Geometry & Applications, 2012
• Comparisons:- Fugacci, Iuricich, De Floriani. Computing Discrete Morse Complexes from Simplicial Complexes.
In Graphical Models, 2019- Fellegara, Iuricich, De Floriani, Fugacci. Efficient Homology‐Preserving Simplification of High‐Dimensional
Simplicial Shapes. In Computer Graphics Forum, 2019