Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop,...

32
On the encoding of large, high-dimensional and unorganized datasets Ulderico Fugacci 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML

Transcript of Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop,...

Page 1: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

On the encoding of large, high-dimensional and unorganized datasets

Ulderico Fugacci

6th SmartData@PoliTo Workshop, Castello del Valentino

30 January 2020

Higher Methods in Data Science and ML

Page 2: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

TopologicalInformationDataset

Motivation

Topological Data Analysis (TDA) and Persistent Homology (PH) allow for extracting the core topological information from large, high-dimensional and unorganized datasets.

E.g. point clouds, complex networks, (semi-)metric spaces

Page 3: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

TopologicalInformationDataset

Motivation

Topological Data Analysis (TDA) and Persistent Homology (PH) allow for extracting the core topological information from large, high-dimensional and unorganized datasets.

E.g. point clouds, complex networks, (semi-)metric spaces

Filtered Simplicial Complex

Page 4: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

MotivationSimplicial Complex:

A family K of subsets (called simplices) of a finite set V closed under the operation of taking subsets

0-, 1-, 2-, 3-simplices

Page 5: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

V a point cloud with |V |≥ 30

Filtered complex K associated with V consists of more than a billion of simplices

Issue:

MotivationSimplicial Complex:

A family K of subsets (called simplices) of a finite set V closed under the operation of taking subsets

0-, 1-, 2-, 3-simplices

Page 6: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Solution:

Development of compact and efficient data structures

for encoding simplicial complexes

MotivationSimplicial Complex:

A family K of subsets (called simplices) of a finite set V closed under the operation of taking subsets

0-, 1-, 2-, 3-simplices

Page 7: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Motivation

• Which info to be stored?

• Data structures:• Simplex-based representations• Top-based representations• Operator-driven representations

• Issues and solutions in adopting top-based representations

• Comparisons

Outline:

Page 8: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Motivation

• Data structures for specific classes of complexes• E.g., manifolds or complexes of low dimension

• Hierarchical and multi-resolution models

• Construction of a simplicial complex from a dataset

Out of Scope:

Page 9: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Data Structures for Simplicial Complexes

Data structure:

• its simplices

K = K0 ∪ K1 ∪ … ∪ Kd

where Ki is the collection of the i-simplices of K

• the topological relations Ri,j ⊆ Ki × Kj

between the simplices of K encoding the (co-)boundary of each simplex

The entities which a simplicial complex consists of are:

4

31

2

Page 10: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Data Structures for Simplicial Complexes

Data structure:

• its simplices

K = K0 ∪ K1 ∪ … ∪ Kd

where Ki is the collection of the i-simplices of K

• the topological relations Ri,j ⊆ Ki × Kj

between the simplices of K encoding the (co-)boundary of each simplex

The entities which a simplicial complex consists of are:

A data structure for K has to explicitly store a portion of the above information and to (efficiently) retrieve the remaining part

4

31

2

Page 11: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Data Structures for Simplicial Complexes

Topological Relations:

Given a i-simplex 𝜎 and a j-simplex 𝜏 of K,

(𝜎, 𝜏) ∈ Ri,j

𝜎 ⊆ 𝜏

|𝜎 ∩ 𝜏|= i (equivalently, 𝜎 ∩ 𝜏 ∈ Ki-1)

𝜏 ⊆ 𝜎

for i < j

for i = j

for i > j

(12, 123) ∈ R1,2 (12, 24) ∈ R1,1 (12, 1) ∈ R1,0

4

31

2Example:

Page 12: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Data Structures for Simplicial Complexes

Topological Relations:

Given a i-simplex 𝜎 and a j-simplex 𝜏 of K,

(𝜎, 𝜏) ∈ Ri,j

𝜎 ⊆ 𝜏

|𝜎 ∩ 𝜏|= i (equivalently, 𝜎 ∩ 𝜏 ∈ Ki-1)

𝜏 ⊆ 𝜎

for i < j

for i = j

for i > j

An i-simplex 𝜎 is called a top simplex of K if Ri,i+1 is empty

(12, 123) ∈ R1,2 (12, 24) ∈ R1,1 (12, 1) ∈ R1,0

4

31

2Example:

Page 13: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Data Structures for Simplicial Complexes

Compactness

Effic

ienc

y

Store all the entities

Store only the top-simplices

๏ Simplex-based representations

๏ Top-based representations

๏ Operator-driven representations

Data structures:

Page 14: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Data Structures for Simplicial Complexes

Compactness

Effic

ienc

y

Store all the entities Incidence

Graph

Store only the top-simplices

๏ Simplex-based representations

๏ Top-based representations

๏ Operator-driven representations

Data structures:

Page 15: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Data Structures for Simplicial Complexes

Compactness

Effic

ienc

y

Store all the entities Incidence

Graph Simplex Tree

Store only the top-simplices

๏ Simplex-based representations

๏ Top-based representations

๏ Operator-driven representations

Data structures:

Page 16: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Data Structures for Simplicial Complexes

Compactness

Effic

ienc

y

Store all the entities Incidence

Graph Simplex Tree

Store only the top-simplices

IA* Data Structure

๏ Simplex-based representations

๏ Top-based representations

๏ Operator-driven representations

Data structures:

Page 17: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Data Structures for Simplicial Complexes

Compactness

Effic

ienc

y

Store all the entities Incidence

Graph Simplex Tree

Store only the top-simplices

IA* Data Structure

๏ Simplex-based representations

๏ Top-based representations

๏ Operator-driven representations

Stellar Tree

Data structures:

Page 18: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Data Structures for Simplicial Complexes

Compactness

Effic

ienc

y

Store all the entities Incidence

Graph Simplex Tree

Store only the top-simplices

Skeleton Blocker

IA* Data Structure

๏ Simplex-based representations

๏ Top-based representations

๏ Operator-driven representations

Stellar Tree

Data structures:

Page 19: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Simplex-based RepresentationsIncidence Graph:

4

31

2123

12 13 23 24 34

1 2 3 4

(𝜎, 𝜏) ∈ A ⟷ (𝜎, 𝜏) ∈ Ri,i+1

The simplicial complex K is encoded via a directed graph G = (N, A):

N ⟷ K

• All the relations between simplices can be immediately retrieved

• The representation size exponentially increases with the complex dimension

Page 20: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Simplex Tree:4

31

2123

12 13 23 24 34

1 2 3 4

(𝜎, 𝜏) ∈ A ⟷ (𝜎, 𝜏) ∈ Ri,i+1 and I(𝜎) < I(𝜏)

The simplicial complex K is encoded via a directed graph G = (N, A):

N ⟷ K

• Graph is not uniquely determined but it depends on the chosen vertex order

where I(𝜎) denotes the maximum value taken by the vertices of 𝜎 w.r.t. a total order on K0

Simplex-based Representations

Page 21: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Simplex Tree:4

3

1

2

(𝜎, 𝜏) ∈ A ⟷ (𝜎, 𝜏) ∈ Ri,i+1 and I(𝜎) < I(𝜏)

The simplicial complex K is encoded via a directed graph G = (N, A):

N ⟷ K

• Graph is not uniquely determined but it depends on the chosen vertex order

where I(𝜎) denotes the maximum value taken by the vertices of 𝜎 w.r.t. a total order on K0

123

12 13 14 23 34

1 2 3 4

Simplex-based Representations

Page 22: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Top-based RepresentationsIA* Data Structure: 4

31

2

The simplicial complex K is encoded via a directed graph G = (N, A):

N ⟷ K0 ∪ Ktop

123 24 34

1 2 3 4

(𝜎, 𝜏) ∈ A ⟷ 𝜎, 𝜏 ∈ Ktop and (𝜎, 𝜏) ∈ Ri,i

𝜎 ∈ Ktop and (𝜎, 𝜏) ∈ Ri,0

𝜏 ∈ Ktop and (𝜎, 𝜏) ∈ (R0,i)*

• Compact: it explicitly stores just a fraction of the entities of a simplicial complex

• Not all the relations between simplices are immediately available

Page 23: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Skeleton Blocker:4

31

2

The simplicial complex K is encoded by storing its 1-skeleton (i.e. the graph consisting of the 0- and the 1-simplices) and a map returning, for each 1-simplex 𝜎, the blockers of K containing 𝜎, where:

• Designed for flag complexes (e.g. VR complexes) and edge contraction

• Too specific: inefficient in any other task

Operator-driven Representations

4

31

2{ 234 }

a simplex 𝜏 is a blocker if 𝜏 does not belong to K but all its faces do

Page 24: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Possible Issues in Top-based Representations

But, how to…

Top-based representations look like promising data structures for encoding a simplicial complex K

1. Store information associated to each simplex of K (e.g. labels, gradient, …)?

2. Efficiently perform operators having explicitly stored a fraction of the entities of K?

Page 25: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Possible Issues in Top-based Representations

But, how to…

Top-based representations look like promising data structures for encoding a simplicial complex K

1. Store information associated to each simplex of K (e.g. labels, gradient, …)?

2. Efficiently perform operators having explicitly stored a fraction of the entities of K?

Attach information to the top simplices only

Page 26: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Possible Issues in Top-based Representations

But, how to…

Top-based representations look like promising data structures for encoding a simplicial complex K

1. Store information associated to each simplex of K (e.g. labels, gradient, …)?

2. Efficiently perform operators having explicitly stored a fraction of the entities of K?

Attach information to the top simplices only

Re-define the algorithms performing the operators trying to extract the lowest possible amount of non-explicitly stored entities

Page 27: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

ComparisonsTop-based vs Simplex-based

Page 28: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Comparisons

Top-based vs Simplex-based

Page 29: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Comparisons

Top-based vs Operator-driven

Page 30: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

In Summary

We have briefly overviewed the most common data structures proposed in the literature

• Express the most frequently adopted operators in terms of top simplices

• Face the bottleneck concerning the construction of a simplicial complex from a dataset (maybe proposing an approximated construction?)

• Investigate with your help the connections between simplicial complexes (and the data structures to store them) and itemsets in association rule learning

Future Directions:

Page 31: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Thank you!

Page 32: Higher Methods in Data Science and ML On the encoding of ... · 6th SmartData@PoliTo Workshop, Castello del Valentino 30 January 2020 Higher Methods in Data Science and ML. Topological

Main References:• Simplex-based Representations:

- Boissonnat, Maria. The Simplex Tree: an Efficient Data Structure for General Simplicial Complexes. In Algorithmica, 2014

• Top-based Representations:- Canino, De Floriani, Weiss. IA*: An Adjacency-Based Representation for Non-manifold Simplicial Shapes

in Arbitrary Dimensions. In Computers & Graphics, 2011- Fellegara, Weiss, De Floriani. The Stellar Tree: a Compact Representation for Simplicial Complexes and

Beyond. arXiv preprint, 2017

• Operator-driven Representations:- Attali, Lieutier, Salinas. Efficient Data Structure for Representing and Simplifying Simplicial Complexes in

High Dimensions. In International Journal of Computational Geometry & Applications, 2012

• Comparisons:- Fugacci, Iuricich, De Floriani. Computing Discrete Morse Complexes from Simplicial Complexes.

In Graphical Models, 2019- Fellegara, Iuricich, De Floriani, Fugacci. Efficient Homology‐Preserving Simplification of High‐Dimensional

Simplicial Shapes. In Computer Graphics Forum, 2019