Rolando Kindelan Computer Science Department arXiv:2111 ...

12
1 ATOPOLOGICAL DATA A NALYSIS BASED C LASSIFIER Rolando Kindelan Computer Science Department Faculty of Mathematical and Physical Sciences University of Chile 851 Beauchef Av. Santiago de Chile, Chile. Center of Medical Biophysics, Universidad de Oriente, Santiago de Cuba, Cuba. [email protected] José Frías Center for Research in Mathematics Jalisco S/N, Col. Valenciana, CP: 36023 Guanajuato, Gto., México. [email protected] Mauricio Cerda Integrative Biology Program Institute of Biomedical Sciences Biomedical Neuroscience Institute Center for Medical Informatics and Telemedicine Faculty of Medicine Universidad de Chile 1027 Independencia Av., Santiago, Chile. [email protected] Nancy Hitschfeld Computer Science Department Faculty of Mathematical and Physical Sciences University of Chile 851 Beauchef Av. Santiago de Chile, Chile [email protected] November 11, 2021 ABSTRACT Topological Data Analysis (TDA) is an emergent field that aims to discover topological information hidden in a dataset. TDA tools have been commonly used to create filters and topological descriptors to improve Machine Learning (ML) methods. This paper proposes an algorithm that applies TDA directly to multi-class classification problems, without any further ML stage, showing advantages for imbalanced datasets. The proposed algorithm builds a filtered simplicial complex on the dataset. Persistent Homology (PH) is applied to guide the selection of a sub-complex where unlabeled points obtain the label with the majority of votes from labeled neighboring points. We select 8 datasets with different dimensions, degrees of class overlap and imbalanced samples per class. On average, the proposed TDABC method was better than KNN and weighted-KNN. It behaves competitively with Local SVM and Random Forest baseline classifiers in balanced datasets, and it outperforms all baseline methods classifying entangled and minority classes. 1 Introduction Classification is a Machine Learning (ML) task that employs known data labels to label data with an unknown category. Classification faces challenges such as high dimensionality, noise, and imbalanced data distributions. TDA has been 1 This paper is under consideration at Pattern Recognition Letters Journal. arXiv:2111.05214v2 [cs.LG] 10 Nov 2021

Transcript of Rolando Kindelan Computer Science Department arXiv:2111 ...

1

A TOPOLOGICAL DATA ANALYSIS BASED CLASSIFIER

Rolando KindelanComputer Science Department

Faculty of Mathematical and Physical SciencesUniversity of Chile

851 Beauchef Av. Santiago de Chile, Chile.Center of Medical Biophysics,

Universidad de Oriente, Santiago de Cuba, [email protected]

José FríasCenter for Research in Mathematics

Jalisco S/N, Col. Valenciana, CP: 36023 Guanajuato, Gto., Mé[email protected]

Mauricio CerdaIntegrative Biology Program

Institute of Biomedical SciencesBiomedical Neuroscience Institute

Center for Medical Informatics and TelemedicineFaculty of MedicineUniversidad de Chile

1027 Independencia Av., Santiago, [email protected]

Nancy HitschfeldComputer Science Department

Faculty of Mathematical and Physical SciencesUniversity of Chile

851 Beauchef Av. Santiago de Chile, [email protected]

November 11, 2021

ABSTRACT

Topological Data Analysis (TDA) is an emergent field that aims to discover topological informationhidden in a dataset. TDA tools have been commonly used to create filters and topological descriptorsto improve Machine Learning (ML) methods. This paper proposes an algorithm that applies TDAdirectly to multi-class classification problems, without any further ML stage, showing advantagesfor imbalanced datasets. The proposed algorithm builds a filtered simplicial complex on the dataset.Persistent Homology (PH) is applied to guide the selection of a sub-complex where unlabeled pointsobtain the label with the majority of votes from labeled neighboring points. We select 8 datasetswith different dimensions, degrees of class overlap and imbalanced samples per class. On average,the proposed TDABC method was better than KNN and weighted-KNN. It behaves competitivelywith Local SVM and Random Forest baseline classifiers in balanced datasets, and it outperforms allbaseline methods classifying entangled and minority classes.

1 Introduction

Classification is a Machine Learning (ML) task that employs known data labels to label data with an unknown category.Classification faces challenges such as high dimensionality, noise, and imbalanced data distributions. TDA has been

1This paper is under consideration at Pattern Recognition Letters Journal.

arX

iv:2

111.

0521

4v2

[cs

.LG

] 1

0 N

ov 2

021

A PREPRINT - NOVEMBER 11, 2021

successful in reducing dimensionality, and it has demonstrated robustness to noise by inferring the underlying dataset’stopology [1, 2]. However, how TDA can contribute to classifying imbalanced datasets has not been explored.

In this work, we propose a method entirely based on TDA to classify imbalanced and noisy datasets. The fundamentalidea is to provide multidimensional and multi-size neighborhoods around each unlabeled point. We use topologicalinvariants computed through PH to guide the detection of appropriate neighborhoods. We use the neighborhoods topropagate labels from labeled to unlabeled points. A preliminary version of this work is available in [3].

2 Related Work

PH is a powerful tool in TDA that captures the topological features in a nested family of simplicial complexes builton data, according to an incremental threshold value [2]. These topological features are encoded, considering thosevalues, when they appear (born) and disappear or merge (died). The difference between birth and death scale is calledthe persistence of a topological feature. Evolution of the simplicial structure is encoded using high-level representationscalled barcodes and persistence diagrams [2].

Regarding the relation between PH and the classification problem, the typical approach considers hybrid TDA+MLmethods, which combine topological descriptors with a conventional ML classifier. Topological descriptors arecommonly built based on vectorized or summarized persistence diagrams and barcodes [4]. Examples of these hybridTDA+ML methods are: TDA+SVM for image classification [5], and TDA+k-NN, TDA+CNN and TDA+SVM fortime series classification [6, 7, 8]. Self-Organized Maps were combined with PH tools to cluster and classify time seriesin the financial domain [9].

Approaches to address the imbalanced classification problem can be mainly classified into re-sampling and weightedobjective functions based algorithms. Data re-sampling methods balance data by augmenting or removing samples fromminority or majority classes, respectively. The Synthetic Minority Oversampling Technique (SMOTE) is the establishedgeometric approach to balance classes by oversampling the minority class [10]. Multiple variations of SMOTE havebeen developed [11], including novel approaches such as the SMOTE-LOF which takes into account the Local OutlierFactor [12] to identify noisy synthetic samples. Furthermore, overlap samples from different classes has been reportedas a big issue in imbalance problems. Neighborhood under-sampling from majority class on the overlapped regionhas been applied to achieve better results [13]. These heuristics are simple and can be combined with any classifier asthey modify the training set, although they assume data points can always be discarded or generated. In contrast, SVMor neural networks adaptations modify their objective function to give a higher relative importance to minority classsamples [14]. More related to our proposed work, Zhang et al [15] proposes the Rare-class Nearest Neighbour (KRNN),which defines a dynamic neighborhood based on the inclusion of at least k positive samples.

3 Fundamental concepts

In this section, we introduce mathematical definitions to explain our proposed method; for a complete theoretical basissee [2].

3.1 Simplicial Complexes

Simplicial complexes are combinatorial and algebraic objects, which can be used to represent a discrete space encodingtopological features of the data space. Concepts related to simplicial complexes are defined briefly as follows: aq-simplex σ is the convex hull of q + 1 affinely independent points {s0, . . . , sq} ⊂ Rn, q ≤ n. The set V(σ) is calledthe set of vertices of σ and the simplex σ is generated by V(σ); this relation will be denoted by σ = [s0, . . . , sq]. Aq-simplex σ has dimension dim(σ) = q and it has |V(σ)| = q + 1 vertices. Given a q-simplex σ, a d-simplex τ with0 ≤ d ≤ q and V(τ) ⊆ V(σ); τ is called a d-face of σ, denoted by τ ≤ σ, and σ is called a q-coface of τ , denoted byσ ≥ τ . Note that the 0-faces of a q-simplex σ are the points in V(σ), the 1-faces are line segments with endpoints inV(σ) and so forth. A q-simplex has

(q+1d+1

)d-faces and

∑qd=0

(q+1d+1

)= 2q+1 − 1 faces in total.

In order to define homology groups of topological spaces, the notion of simplicial complexes is central:

Definition 1 (Simplicial complex). A simplicial complex K in Rn is a finite collection of simplices in Rn such that:

• σ ∈ K and τ ≤ σ =⇒ τ ∈ K.

• σ1, σ2 ∈ K =⇒ σ1 ∩ σ2 is a either a face of both σ1 and σ2 or empty.

2

A PREPRINT - NOVEMBER 11, 2021

Figure 1: Example of StK([s4]) and LkK([s4]) on a given simplicial complex (a tetrahedron) K.

The dimension of K is dim(K) = max{dim(σ) | σ ∈ K}. The set V(K) = ∪σ∈KV(σ) is called the set of vertices ofK.

Definition 2 (Star, Closure, Closed Star, and Link). Let K be a simplicial complex, and σ ∈ K be a q-simplex. Thestar of σ in K is the set of all co-faces of σ in K [2]:

StK(σ) = {τ ∈ K | σ ≤ τ}. (1)

If K is a subset of simplices K ⊂ K. The closure of K in K is the smallest simplicial complex containing K:

ClK(K) = {µ ∈ K | µ ≤ σ for some σ ∈ K}. (2)

The smallest simplicial complex that contains StK(σ) is the closed star (closure of star) of σ in K:

StK(σ) = ClK(StK(σ)). (3)

The link of σ is the set of simplices in its closed star that do not share any face with σ [2]:

LkK(σ) = {τ ∈ StK(σ) | τ ∩ σ = ∅}. (4)

Since the link operator concept will be important throughout this paper, we present two equivalent characterizations ofthis set:

Lemma 1. Let K be a simplicial complex and σ ∈ K. Then LkK(σ) coincides with the sets

A = StK(σ) \ (StK(σ) ∪ ClK(σ)), and (5)

B =⋃

µ∈StK(σ)

{[V(µ) \ V(σ)]} (6)

Proof. Let τ be a simplex in LkK(σ). In particular, τ does not belong to StK(σ) nor ClK(σ) since every simplex inone of these two sets necessarily intersects σ, then LkK(σ) ⊂ A.If τ is a simplex in A, then there exists µ ∈ StK(σ) such that τ ≤ µ and (V(µ) \ V(τ)) ⊂ V(σ). It follows thatτ = [V(µ) \ V(σ)] and A ⊂ B. Finally, if τ ∈ B, then τ = [V(µ) \ V(σ)] for some µ ∈ StK(σ). It follows thatτ ∈ StK(σ), but τ ∩ σ = ∅. Then, B ⊂ LkK(σ), and the equivalence of sets A and B is stated.

Figure 1 presents an example of the star and link of the 0-simplex [s4] in a given simplicial complex K built on a pointset S = {s2, s3, s4, s5}.

3

A PREPRINT - NOVEMBER 11, 2021

3.2 Persistent Homology

As a general rule, the objective of PH is to track how topological features on a topological space appear and disappearwhen a scale value (usually a radius) varies incrementally, in a process known as filtration [2].Definition 3 (Filtration). LetK be a simplicial complex. A filtrationF onK is a succession of increasing sub-complexesof K: ∅ ⊆ K0 ⊆ K1 ⊆ K2 ⊆ K3 ⊆ · · · ⊆ Kn = K. In this case, K is called a filtered simplicial complex.

In many simplicial complexes the simplices are determined by proximity under a distance function. A filtration F on asimplicial complex K is obtained by taking a collection EK of positive values 0 < ε0 < ε1 < · · · < εn, and the complexKi corresponds to the value εi. The set EK is called the filtration value collection associated to F . Every Ki ⊆ K couldbe recovered with the association function ψF (εi).

A filtration could be understood as a method to build the whole simplicial complex K from a “family” of sub-complexesincrementally sorted according to some criteria, where each level i corresponds to the “birth” or “death” of a topologicalfeature (connected components, holes, voids). A topological feature of dimension j is a chain of j-simplices, which isnot trivial in the j-th homology group, also referred to as a non-trivial j-cycle. Thus, a persistence interval (birth, death)is the “lifetime” of a given topological feature [2] (see Figure 2).

Figure 2: A fragment of a simplicial complex filtration, with some selected topological features.

Definition 4 (Filtration value of a q-simplex). Let K be a filtered simplicial complex and EK its filtration valuecollection. Let σ ∈ K be a q-simplex. If σ ∈ Kj but σ 6∈ Kj−1, then ξK(σ) = εj is the filtration value of σ.

Note that τ ≤ σ =⇒ ξK(τ) ≤ ξK(σ), which means that in a filtered simplicial complex K, every simplex τ ∈ Kappears before all its co-faces.

3.3 Classification problem

Let Rn be a feature space and P ⊂ Rn a finite subspace. Suppose P is divided in two subspaces P = S ∪X , whereS is the training set and X is the test set. Let L = {l1, . . . , lN} be the label set, and T = {(p, l) : p ∈ P, l ∈ L}be the association space, where T = TS ∪ TX , TS and TX the two disjoint association sets corresponding to S andX , respectively. The label list Y = {li | (xi, li) ∈ TX}, is the list of labels assigned to each element of X in theassociation set TX . Thus, the classification problem could be defined as how to predict a suitable label l ∈ L for every

4

A PREPRINT - NOVEMBER 11, 2021

x ∈ X by assuming the association set TX is unknown. Consequently, the predicted label list, Y ⊂ L|TX |, will be thecollection of labels resulting from the classification method.

4 Proposed Classification Method

A classification method based on TDA is presented in this section. Overall, a filtered simplicial complex K is built overP to generate data relationships. The proposed method is based on the assumption that on the filtration, a sub-complexKi ⊂ K exists, whose simplices represent a feasible approximation to the data topology. The fact that a point set{v0, v1, . . . , vq} ⊂ P defines a q-simplex σ ∈ K implies a similarity or dissimilarity relationship between the pointsv0, v1, . . . , vq . This implicit relationship among data is applied by the proposed method to propagate labels from labeledpoints to unlabeled points. In Figure 3, the proposed method is illustrated by applying a 4-step process to classify twounlabeled points x1, x2 ∈ X .

Figure 3: Overall TDABC algorithm.

4.1 Step 1. Building the filtered simplicial complex.

A filtered simplicial complex K is built on the dataset P = S ∪X using a distance or proximity function which isproblem-specific (Euclidean, Manhattan, Cosine, and more). A maximal dimension 2 ≤ q � |P | is given to control thesimplicial complex exponential growing. We apply the edge collapsing method ([16]) to reduce the number of simplicesbut maintaining the same persistence information of the original simplicial complex.

4.2 Step 2 & 3. Recover a meaningful sub-complex

A filtered simplicial complex K provides a large quantity of multiscale data relationships. Thereby, we would like tochoose a sub-complex Ki from the filtered simplicial complex K, to approximate the actual structure of the dataset. Inthis vein, we exploit the ability of PH to detect topological features. For a filtered simplicial complex K of dimensionq, PH will compute up to q-dimensional homology groups. Each topological feature represented by an element ina homology group of a given dimension will be represented by a persistence interval (birth, death) ⊂ R. Let Dj

be the set of the persistent intervals of non-trivial j-cycles along the filtration of K. We then collect all persistenceintervals D =

⋃j>0D

j . The 0-dimensional homology group is excluded because we aim to minimize the connectedcomponents while looking at homology groups in higher dimensions.

5

A PREPRINT - NOVEMBER 11, 2021

We define the int(·) function (Equation 7) which measures the lifetime of the topological feature associated with eachd ∈ D. The immortal persistence intervals (infinite death values) are truncated to the maximum value in the collectionof filtration values max(EK).

int(d) := min{d[death],max(EK)} − d[birth], (7)

A persistence interval d ∈ D is selected by using the functions defined on Equations 8,9,10:

(a) The persistence interval with maximal persistence:dm = MaxInt(D) = arg max

d∈D(int(d)). (8)

(b) A persistence interval selected randomly on the upper half persistence intervals of highest-lifespan:dr = RandInt(D) = random({d | int(d) > avg(D)}). (9)

(c) The closest interval to the persistence intervals average:da = AvgInt(D) = arg min

d∈D|int(d)− avg(D)| , (10)

where avg(D) = 1|D| ·

∑di∈D int(di).

If a tie occurs during the computation of MaxInt(D) or AvgInt(D), the persistence interval with higher birth timeshould be taken. Once d ∈ {dm, dr, da} is chosen, a sub-complex Ki must be selected. We choose an appropriatefiltration value from the birth, death or middle-time of persistence interval d:

εi ∈{d[birth], d[death],

d[birth] + d[death]

2

},

then the respective sub-complex Ki = ψF (εi) is obtained.

We collect all simplices born on the lifespan of the selected persistence interval and compute their closure to get theminimal simplicial complex that contains them.

In Figure 4, Ki ⊆ K is chosen by using PH to guide the selection of candidate persistence intervals. Then, a sub-complex is recovered on the death of the persistence interval selected according to the MaxInt(·), RandInt(·)selection functions (see Equation 8 and Equation 9). In this example, the filtered simplicial complex K was built on theCircles dataset (noise = 10), see details in Section 5.

4.3 Step 4. Classify

The neighborhood relationships of a q-simplex σ ∈ K will be recovered by using the link, star, and closed star(Definition 2). A key component of the proposed method is the label propagation over a filtered simplicial complexdetailed in this section.

Suppose a preferred sub-complex Ki ∈ K has been selected according to section 4.2. Let A be the R-module withgenerators l1, l2, . . . , lN with N = |L|. We consider 0 ∈ A to represent no-value. The generator lj will be associatedto the label lj according to Definition 5.Definition 5 (Association function). Let Φi : Ki → A be the association function defined on a 0-simplex v ∈ Ki asΦi(v) = lj if (v, lj) ∈ TS and Φi(v) = 0 in any other case. The association function can be extended to a q-simplex σas Φi(σ) =

∑v∈V(σ) Φi(v).

In a simplicial complex, we can use the link operation to propagate labels from labeled points in S to unlabeled pointsin X . To do so, we define the extension function as follows:Definition 6 (Extension function). Let Ψi : X → A be the function defined on a point v ∈ X by

Ψi(v) =∑

σ∈LkKi([v])

Φi(σ)

ξK([V(σ) ∪ {v}](11)

According to Lemma 1, we can also obtain an equivalent formula

Ψi(v) =∑

µ∈StKi([v])

Φi([V(µ) \ {v}])ξK(µ)

(12)

6

A PREPRINT - NOVEMBER 11, 2021

Figure 4: A barcode representation of PH in K is shown in (a), which have two homology groups H0 (group ofconnected components) and H1 (group of 1-cycles). We show two persistence intervals [b1, d1] (purple), and [b2, d2](yellow) corresponding to the RandInt(·) and MaxInt(·) selection functions, respectively. In (b) and (c), respectively,the sub-complexes Kd1 ⊆ K and Kd2 ⊆ K are shown.

Figure 5: From a selected sub-complex Ki ⊆ K in (a), the star of x1 is obtained in (b). In (c), Lk([x1]) is showndivided into 0-simplices, 1-simplices, and 2-simplices. In (d), the extension function Ψi(x1) is executed, and finallythe labeling function Υi(x1) assigns a green label to x1. The ξK function acts as a tie-breaker of green/red/blue labelcontributions, even when, in section (b), might seem a tie between labels.

In Equation 11 and Equation 12, we obtain the co-faces of v such that µ ∈ StKi([v]), µ = [V(σ)∪{v}], σ ∈ LkKi

([v])according to Lemma 1. The filtration value ξK(µ) is applied to prioritize the influence of σ to label v. Let α, β ∈StKi

([v]) be two simplices, such that ξK(α) < ξK(β). This condition implies that α was clustered around v earlierthan β was since α appears before β in the filtration. In consequence, the α contributions should be more importantthan the β contributions. Using filtration values as inverse weight provides several properties such as distance encodingoperators, indirect local outlier factors, and density estimators. Figure 5 section (d), shows the impact of ξK(·)−1 toclassify an unlabeled point x1 ∈ X .

According to the previous definitions, given a point v ∈ X , the evaluation of the extension function at v would beΨi(v) =

∑Nj=1 aj · lj , where aj ∈ R+ ∪ {0}, j = 1, . . . , N .

Definition 7 (Labeling function). Let v be a point in X such that Ψi(v) =∑Nj=1 aj · lj . Let a be the maximum value

in {aj}Nj=1, and A = {j | aj = a} be the set of maximum value indexes. We define the labeling function Υi at v asΥi(v) = lk where k is uniformly selected at random from A. If a = 0 then Υi(v) = ∅.

7

A PREPRINT - NOVEMBER 11, 2021

Algorithm 1 Labeling: Labeling a point set X .

Require: A filtered simplicial complex K. A non-empty point set X .Ensure: A predicted labels list Y of X .

1: D ← GetPersistenceIntervalSet(K) where:D = {di | di = (birth, death)}

2: Get a desired persistence interval d where:d ∈ {MaxInt(D), RandInt(D), AvgInt(D)}

3: εi ← d[birth]; Ki ← ψF (εi); Y ← {}4: while X 6= ∅ do5: v ∈ X; l← Υi(v)6: if l = ∅ then7: l← handling_special_cases(v) {see Section 4.4}8: end if9: Y ← Y ∪ {l}; X ← X \ {v}

10: end while11: return Y

If there is a unique maximum in the set {aj}Nj=1 from the previous definition, the labeling function is uniquely definedat v. In all tested datasets, the label assignment of each point in X was uniquely defined because the factor 1

ξK(·) acts asa tie-breaker. Figure 5 shows the labeling process on a previously selected sub-complex, and the classification processis summarized in Algorithm 1.

4.4 Dealing with special cases

There are two cases where a = 0 in Definition 7. When a point v is isolated (i.e., v /∈ P , thus LkKi([v]) = ∅), or when

all points in LkKi([v]) are unlabeled.

4.4.1 Isolated points

To handle isolated points, we look for a collection of points close to v. This collection is defined by Uv = {u ∈ V(K0) |f(u, v) ≤ 2 · εi}, where f(·) is the distance or dissimilarity function applied to build K. Where εi = d[death] is thedeath time of the chosen persistence interval d to recover Ki (Section 4.2). Since Uv could be a non-disjoint collectionof 0-simplices on Ki, we compute their label contributions regarding v, with Ψ(v) =

∑u∈Uv

Ψi(u)f(u,v) . We use 1

f(·,·) togive more importance to label contributions according to the closeness to v. Then, Υi(v) is performed according toDefinition 7. Note that when Uv ⊆ X , we reach the second case.

4.4.2 Unlabeled link

To address the case when all points in LkKi([v]) are unlabeled, we look for the shortest paths from v to the labeled points.

This problem can be considered “semi-supervised learning" by constraining the domain to a neighborhood aroundthe link. We propose a heuristic to reach a fast convergence by using a priority queue Q. We insert all σ ∈ StKi

([v])in Q by using their filtration values as priority ρ(σ) = ξK(σ). We also maintain a flag to avoid processing simplicesmore than once. While Q is not empty, we process τ ∈ Q with priority ρ(τ) by quantifying its label contributionswith

∑τ∈Q

∑µ∈StKi

(τ)Φi(µ)

ρ(τ)+ξK(µ) = {aj}Nj=1. Every non-visited µ such that V(µ) ⊆ X , is added to Q with priorityρ(µ) = ρ(τ) + ξK(µ). At the end, we report the label with majority of votes according to Definition 7.

5 Experimental Results

The proposed TDA-based classifier (TDABC) was evaluated considering the selection functions: RandInt(·) orTDABC-R, MaxInt(·) or TDABC-M, and AvgInt(·) or TDABC-A. Four baseline methods were selected to comparethe proposed methods: k-Nearest Neighbors (KNN), distance-based weighted k-NN (WKNN), Linear Support VectorMachine (LSVM), and Random Forest (RF). All baseline classifiers were manually configured to deal with imbalanceddatasets using the known class frequencies (“class_weight” parameter in Scikit-Learn Library [17]). Table 1 shows thedatasets and their characteristics.

8

A PREPRINT - NOVEMBER 11, 2021

Table 1: Selected datasets to evaluate proposed and baseline classifiers.

Name Dimensions Classes Size Samplesper class Noise Mean Stdev

Circles 2 2 50 [25,25] 3 - -Moon 2 2 200 [100,100] 10 - -Swissroll 3 6 300 [50,50,50,50,50,50,] 10 - -Iris 4 3 150 [50,50,50] - - -Normal 350 5 300 [60,10,50,100,80] - [0,0.3,0.18,0.67,0] 0.486Sphere 3 5 653 [500,100,25,16,12] - 0.3 0.147Wine 13 3 178 [59, 71, 48] - - -Cancer 30 2 569 [212, 357] - - -

5.1 Artificial datasets

The Circles, Swissroll, Moon, Norm, and Sphere datasets were artificially generated. In the case of Circles, Moon,and Swissroll a Gaussian noise factor was added to diffuse per-class boundaries and to assess classification performancefor overlapped data regions.

The Normal, and Sphere datasets were generated based on a Normal distribution per dimension. The Normal datasethas a high dimension (P ⊂ R350, |P | < 350). The Sphere dataset is always in three dimensions (P ⊂ R3), aiming tocapture entanglement and imbalance sample distributions.

5.2 Real-world datasets

The Iris, Wine, and Cancer datasets were selected as real datasets to compare the proposed classifiers and the baselineones. The Iris dataset [18] is a balanced dataset where one class is linearly separable and the other two are slightlyentangled each other. The Wine dataset [18] is an imbalanced dataset with thirteen different measurements to classifythree types of wine. The Breast Cancer dataset (Cancer) [18] is an imbalanced dataset with thirty features and twoclasses. The Wine and Cancer datasets were transformed using a logarithmic statistical transformation. Accordingly, aresulting dataset was obtained P ′ = {ln (p+M)}p∈P , with M the minimum component value of the dataset employedto deal with negative numbers. On the other datasets, no transformation was required.

5.3 Evaluation methodology

We build baseline and proposed classifiers using the Euclidean distance in all datasets.

The classifier evaluation across datasets was conducted using a Repeated R-Fold Cross-Validation process (10% fold,N=5). We divide results according to the balancing condition of the datasets, balanced dataset results are shown inTable 2, and imbalanced dataset results in Table 3. We compute the following metrics: F1 = 2 · Precision·Recall

Precision+Recall , the ROC-AUC curve with one-vs-rest approach and macro average. For imbalanced datasets, we compute the Precision-Recall(PR) curve and report the Average Precision metric or (PR-AUC).

Additionally, we follow the experimental setting presented in [19] to assess the classifier’s behavior under imbalanceddata conditions. We generate 16 two-classes datasets in R2 by using the Normal distribution. The label 0 (positive)samples were generated using µ = 0, σ = 1.1, and samples of label 1 (negative) were generated using µ = 2.0, σ = 2.2. We start generating a dataset with 100 samples, 50 per class, then we maintain the same 50 samples on the positiveclass and increasing the negative class with 50 samples up to 800. We perform repeated cross-validation in each dataset,computing the average AUC and the standard deviation, then plot an AUC curve and representing standard deviationwith vertical lines. The Figure 6 shows the AUC-curve per classifier.

5.4 Analysis

Experiments on balanced datasets, see Table 2, show that the proposed TDABC methods behave competitively withrespect to baseline classifiers (better than the average), even with broadly used classifiers such as LSVM and RF.Specifically when topology becomes complex like in the Swissroll case, where no hyperplane correctly identifiesclasses, TDABC shows better F1 and AUC metrics than all reference methods.

9

A PREPRINT - NOVEMBER 11, 2021

Table 2: Metric results per classifier across balanced datasets. In black those classifiers that were superior to thearithmetic mean.

CLASSIFIERS CIRCLES MOON SWISSROLL IRISF1

TDABC-A 0.620 0.505 0.829* 0.932TDABC-M 0.620 0.509* 0.813 0.922TDABC-R 0.632 0.503 0.805 0.943KNN 0.449 0.445 0.738 0.961*WKNN 0.480 0.463 0.782 0.951LSVM 0.638* 0.426 0.716 0.937RF 0.577 0.501 0.726 0.935Average 0.573 0.478 0.773 0.940

ROC-AUCTDABC-A 0.620 0.505 0.993 0.986TDABC-M 0.620 0.510* 0.990 0.986TDABC-R 0.632 0.504 0.991 0.984KNN 0.460 0.445 0.992 0.996WKNN 0.480 0.465 0.996* 0.998*LSVM 0.640* 0.428 0.991 0.993RF 0.580 0.501 0.993 0.994Average 0.576 0.479 0.992 0.991

Table 3: Metric results per classifier across imbalanced datasets. We show global metric results and the results on theminority (Min) class. In black those classifiers that were superior to the arithmetic mean.

CLASSIFIERS NORMAL SPHERE WINE CANCER

Global Minclass Global Min

class Global Minclass Global Min

classF1

TDABC-A 0.351 0.019 0.599 0.250* 0.950 0.941 0.935 0.917TDABC-M 0.397 0.059* 0.525 0.100 0.942 0.933 0.943 0.928TDABC-R 0.336 0.027 0.556 0.138 0.917 0.905 0.934 0.916KNN 0.411 0.022 0.360 0.000 0.918 0.906 0.947 0.933WKNN 0.412 0.022 0.420 0.000 0.918 0.906 0.947 0.933LSVM 0.425 0.042 0.506 0.043 0.940 0.930 0.932 0.914RF 0.250 0.000 0.520 0.075 0.978 0.974* 0.954 0.941*Average 0.369 0.027 0.498 0.086 0.938 0.928 0.942 0.926

PR-AUCTDABC-A 0.765 0.500 0.959 0.300 0.978 1.000* 0.928 0.928TDABC-M 0.739 0.529 0.975 0.370* 0.977 0.999 0.937 0.937TDABC-R 0.706 0.408 0.969 0.310 0.977 1.000* 0.928 0.928KNN 0.796 0.724 0.965 0.150 0.990 1.000* 0.943 0.943WKNN 0.817 0.760* 0.973 0.210 0.993 1.000* 0.943 0.943LSVM 0.872 0.424 0.969 0.300 0.994 0.999 0.929 0.929RF 0.730 0.054 0.981 0.250 0.999 1.000* 0.949 0.949*Average 0.775 0.486 0.970 0.271 0.987 1.000 0.937 0.937

The Circles and Moon datasets are balanced and have very entangled classes due to the noise factor, making theclassification a challenge. In these datasets, k-NN (F1 = 0.449 and F1 = 0.445) and wk-NN (F1 = 0.480 andF1 = 0.463) behave poorly (lower than average). This behavior is related to the fixed value of k and to the assumptionthat each data point is equally relevant. Even though wk-NN imposes a local data point weight based on distances, it isnot enough with highly entangled classes, as our results show. The TDABC methods are capable of dealing with theentanglement challenge through a disambiguation factor based on filtration values (ξK). The Iris dataset is a simplecase, where practically all methods have a good performance.

10

A PREPRINT - NOVEMBER 11, 2021

Figure 6: ROC-AUC curves across 16 generated datasets from 50-50 to 50-800 to evaluate how classifiers behave underseveral imbalance conditions.

Then we analyze imbalance datasets without resampling, see Table 2. In the case of the Normal and Sphere datasets,there is a high class imbalance ratio, with [60, 10, 50, 100, 80] and [500, 100, 25, 16, 12] samples per class, respectively.In these scenarios, still the proposed TDABC is competitive in terms of F1, and also has the best performance for theminority class. In high imbalance scenarios, it is important to have dynamic neighborhoods. The proposed methodgenerates dynamic-sized “neighborhoods” for each point, in contrast to k-NN and wk-NN classifiers. In the highimbalance case, the disambiguation factor (ξK) also provides a multi-scale local weight to TDABC methods.

The Wine, and Cancer datasets have 13 and 30 dimensions, respectively, with a low class imbalance ratio, with[59, 71, 48] and [212, 357] samples per class, respectively. In these scenarios TDABC is competitive (F1), in both globaland in the minority class. We confirm experimentally that as the imbalance ratio increases, the relative performance ofour TDABC method regarding baseline methods also grows, as shown in Figure 6.

6 Conclusions

Overall, we show that TDA can be used to perform classification tasks without any additional ML method. Toour knowledge, this is the first study that proposes this approach for classification. Three methods were presented,TDABC-A, TDABC-R, TDABC-M, which behave competitively to SOTA methods.

In our method we show that PH plays a key role on selecting a sub-complex which approximate well enough the datatopology, through the MaxInt, AvgInt or RandInt selection functions. Moreover, link and star operators provide dynamicneighborhoods for classification. This work offers a novel application for the filtration values as the inverse weightto measure each simplex label contribution with several properties to deal with overlapped classes such as distanceencoding operators, indirect local outlier factors, and density estimators.

11

A PREPRINT - NOVEMBER 11, 2021

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could haveappeared to influence the work reported in this paper.

Acknowledgments

ANID grants DOCTORADO 21181978, FONDECYT 1211484, ICN09_015, PIA ACT192015. Also Beca postdoctoralCONACYT (Mexico). We thank Prof. J.C. Gómez-Larrañaga from CIMAT, for his insightful discussion.

input(template.bbl)

References

[1] Gunnar Carlsson. “Topology and data”. In: Bulletin of the American Mathematical Society 46.2 (Jan. 2009),255–308.

[2] Herbert Edelsbrunner and John Harer. Computational Topology - an Introduction. Michigan, USA: AmericanMathematical Society, 2010. ISBN: 978-0-8218-4925-5.

[3] Rolando Kindelan et al. Classification based on Topological Data Analysis. 2021. arXiv: 2102.03709 [cs.LG].[4] Nieves Atienza, Rocio Gonzalez-Díaz, and Manuel Soriano-Trigueros. “On the stability of persistent entropy

and new summary functions for topological data analysis”. In: Pattern Recognition 107 (2020), p. 107509. ISSN:0031-3203.

[5] Kathryn Garside et al. “Topological data analysis of high resolution diabetic retinopathy images”. In: PLOS ONE14.5 (May 2019), 1–10.

[6] Vinay Venkataraman, Karthikeyan Natesan Ramamurthy, and Pavan K. Turaga. “Persistent homology of attractorsfor action recognition”. In: Proc. ICIP. 2016, 4150–4154.

[7] Yuhei Umeda. “Time Series Classification via Topological Data Analysis”. In: Trans. Jpn. Soc. Artif. Intell. 32(May 2017), D–G72_1.

[8] Lee M. Seversky, Shelvi Davis, and Matthew Berger. “On Time-Series Topological Data Analysis: New Dataand Opportunities”. In: CVPRW. June 2016, 1014–1022.

[9] Sourav Majumdar and Arnab Kumar Laha. “Clustering and classification of time series using topological dataanalysis with applications to finance”. In: Expert Syst. Appl. 162 (2020), p. 113868. ISSN: 0957-4174.

[10] Nitesh Chawla et al. “SMOTE: Synthetic Minority Over-sampling Technique”. In: J. Artif. Intell. Res. (JAIR) 16(June 2002), pp. 321–357. DOI: 10.1613/jair.953.

[11] Ankur Goyal, Likhita Rathore, and Sandeep Kumar. “A Survey on Solution of Imbalanced Data ClassificationProblem Using SMOTE and Extreme Learning Machine”. In: Communication and Intelligent Systems. Ed. byHarish Sharma et al. Singapore: Springer Singapore, 2021, pp. 31–44. ISBN: 978-981-16-1089-9.

[12] Asniar, Nur Ulfa Maulidevi, and Kridanto Surendro. “SMOTE-LOF for noise identification in imbalanced dataclassif.” In: J. King Saud Univ.-Comp. Inf. Sci. (2021). ISSN: 1319-1578.

[13] Pattaramon Vuttipittayamongkol and Eyad Elyan. “Neighbourhood-based undersampling approach for handlingimbalanced and overlapped data”. In: Information Sciences 509 (Jan. 2020), pp. 47–70. ISSN: 0020-0255.

[14] H Ibrahim, S A Anwar, and M I Ahmad. “Classification of imbalanced data using support vector machine andrough set theory: A review”. In: Journal of Physics: Conference Series 1878.1 (May 2021), p. 012054.

[15] Xiuzhen Zhang et al. “KRNN: k Rare-class Nearest Neighbour classification”. In: Pattern Recognit. 62 (Feb.2017), pp. 33–44. ISSN: 0031-3203.

[16] Jean-Daniel Boissonnat and Siddharth Pritam. “Edge Collapse and Persistence of Flag Complexes”. In: Proc.SoCG. 2020, 19:1–19:15. ISBN: 978-3-95977-143-6.

[17] Fabian Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning Research12 (Jan. 2012).

[18] Dheeru Dua and Casey Graff. UCI Machine Learning Repository. 2017. URL: http://archive.ics.uci.edu/ml.

[19] Miroslav Kubat, Robert Holte, and Stan Matwin. “Learning when negative examples abound”. In: Proc. ECML-97.Springer, 1997, pp. 146–153. ISBN: 978-3-540-68708-5.

12