Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis....

29
Topological Data Analysis Avik Laha Columbia University April 12, 2019 Avik Laha Topological Data Analysis 1 / 29

Transcript of Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis....

Page 1: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Topological Data Analysis

Avik Laha

Columbia University

April 12, 2019

Avik Laha Topological Data Analysis 1 / 29

Page 2: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Motivation

A somewhat different presentation than others – look at generalmethod rather than paper

Overall, the idea is that many interesting characteristics of datashould not depend on certain details of the representation, i.e. theyare topological

Will largely make use of Chazal and Michel’s An introduction to

Topological Data Analysis: fundamental and practical

aspects for data scientists

Avik Laha Topological Data Analysis 2 / 29

Page 3: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Overview

First, we will look at what it means for a feature in data to be“topological”, and topological invariants

Then, we will discuss persistent homology in particular as arealization of TDA

Finally, we will briefly touch on applications

Avik Laha Topological Data Analysis 3 / 29

Page 4: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Topological features

Toy example – for data obtained by different measurement schemes,interesting feature (hole) is preserved

Avik Laha Topological Data Analysis 4 / 29

Page 5: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

What is topology?

Definition (Topological Space)

A pair X = (S , T ) where S is a set and T a set of its subsets such that:

1 ∅,S ∈ T2 T is closed under arbitrary unions of its elements

3 T is closed under finite intersections of its elements

Interpret elements of T as open sets

Gives a notion of a continuous map (preimage of any open set isopen) – topology is the study of such spaces and continuous mapsbetween them

For X ,Y topological spaces, if f : X → Y is a continuous map withcontinuous inverse, it is a homeomorphism, and X ∼= Y arehomeomorphic

Avik Laha Topological Data Analysis 5 / 29

Page 6: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Topology and learning

Sensible to consider the sample space as a topological space, as anymetric space has a natural topology

Collection of data is application of some measurement mapf : X → Y to elements of viable domain A ⊂ X

Question (for future): how do we recover A or f −1(B) for B ∈ Y ,given we only have finitely many samples?

Avik Laha Topological Data Analysis 6 / 29

Page 7: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Simplices

First, need a way to encode topology which we can work with

An n-simplex is intuitively a basic n-dimensional object, i.e. theconvex hull of n + 1 affinely independent points

Avik Laha Topological Data Analysis 7 / 29

Page 8: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Simplicial complexes

Abstractly, a generalization of a graph: a 0-simplicial complex is a setof points, a 1-simplicial complex is a graph. . .

An n-simplicial complex contains up to n-dimensional simplices (butalso all lower dimensions)

Geometrically, just a set of simplices

Definition (Simplicial complex)

A pair (V ,K ) where V consists of “vertices”, K is a collection of finitesubsets of V which contains all vertices, and obeys σ ∈ K =⇒ anysubset ς ⊂ σ ∈ K has ς ∈ K

Avik Laha Topological Data Analysis 8 / 29

Page 9: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Simplicial complexes, cont.

Avik Laha Topological Data Analysis 9 / 29

Page 10: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Simplicial complexes from data

For now, assume that X is a finite set of points in (M, ρ) a metricspace, d is the inherited metric on X , and α ∈ R+:

Definition (Vietoris-Rips Complex)

Ripsα(X ) := the set of simplices σ = [x0, . . . , xn] such that d(xi , xj) ≤ α

Definition (Cech Complex)

Cechα(X ) := the set of simplices σ = [x0, . . . , xn] such thatn⋂

i=0Bα(xi ) 6= ∅

Note that Bα(xi ) is the (closed) ball of radius α centered on xi

Related by Ripsα(X ) ⊂ Cechα(X ) ⊂ Rips2α(X )

Avik Laha Topological Data Analysis 10 / 29

Page 11: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Rips and Cech complexes

Avik Laha Topological Data Analysis 11 / 29

Page 12: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Summary so far

The topology of data is potentially interesting, so we decided to lookinto it

But actual datasets are just finite samples, and in any casetopological spaces generally have infinite descriptions

Introduced simplicial complexes and found a way to build them fromfinite sets of points, but does this actually help us understand thetopology of data?

Avik Laha Topological Data Analysis 12 / 29

Page 13: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Nerve theorem

In short, yes (given satisfaction of certain conditions)

Definition (Nerve)

For a cover U = {Ui} of M, the simplicial complex C (U) := the set of

simplices σ = [Ui0 , . . . ,Uin ] such thatn⋂

j=0Uij 6= ∅

Avik Laha Topological Data Analysis 13 / 29

Page 14: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Nerve theorem, cont.

Definition (Homotopy, etc.)

For continuous f , f ′ : X → Y , a continuous map h : X × [0, 1]→ Y suchthat h(x , 0) = f (x) and h(x , 1) = f ′(x). If f , f ′ permit a homotopy, theyare homotopic, and if there exists g : Y → X such that f ◦ g and g ◦ fare homotopic to the identity maps, X and Y are homotopy-equivalent

Roughly, X can be continuously deformed into Y ⇐⇒ they arehomotopy-equivalent

If X ∼= Y then they are homotopy-equivalent, but the converse is notnecessarily true

Avik Laha Topological Data Analysis 14 / 29

Page 15: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Nerve theorem, cont.

If a space is homotopy-equivalent to a point, it is contractible – thetop row is contractible while the bottom row is not:

Avik Laha Topological Data Analysis 15 / 29

Page 16: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Nerve theorem, cont.

Proposition (Nerve Theorem)

Let U = {Ui}i∈I be a cover of M such that for any subset A ⊂ I , theintersection UA :=

⋂i∈A

Ui is empty or contractible. Then M is

homotopy-equivalent to the nerve C (U)

Note that as balls in Rn are convex (hence contractible), and theCech complex is the nerve of such balls of fixed radius around a set ofpoints, it is homotopy equivalent to the union of those balls

Avik Laha Topological Data Analysis 16 / 29

Page 17: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Reconstruction theorem

Our previous observation might make us hope that the Cech complexcan summarize the topological data of some space X , and theReconstruction Theorem tells us that this is indeed true undercertain (technical) conditions

Avik Laha Topological Data Analysis 17 / 29

Page 18: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Another example

Avik Laha Topological Data Analysis 18 / 29

Page 19: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Another example, cont.

Avik Laha Topological Data Analysis 19 / 29

Page 20: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Another example, cont.

Avik Laha Topological Data Analysis 20 / 29

Page 21: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Another example, cont.

Avik Laha Topological Data Analysis 21 / 29

Page 22: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Homology

We want a concise way of summarizing the topological characteristicsof an object: homology provides a set of invariants which do just that

Associates a set of groups (which will indeed be vector spaces forsimplicial homology) to a topological space

Does not uniquely identify a topological space: if X ,Y arehomotopy-equivalent, they have the same homology groups, butconverse not necessarily true and certainly they are not necessarilyhomeomorphic (see link: pseudocircle)

Avik Laha Topological Data Analysis 22 / 29

Page 23: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Betti numbers

The k-th Betti number of a topological space X is the dimension ofits k-th homology group

Roughly, β0 corresponds to the number of connected components, β1to the number of punctures, β2 to the number of “voids”. . .

Avik Laha Topological Data Analysis 23 / 29

Page 24: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Persistent homology

Our primary issue remaining is that in general it is not obvious whatthe correct radius is for construction of our simplicial complex

Persistent homology attempts to remedy this problem by highlightingthe topological features which persist while growing the radii

Use persistence diagrams: keeps track of increase/decrease of eachBetti number, i.e. birth/death of features as radii increase

Avik Laha Topological Data Analysis 24 / 29

Page 25: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Toy example

Can consider union of balls of radius r around X ⊂ Rn as sublevel setof the natural function fX : Rn → R, so let’s look at persistence for ageneral function:

Avik Laha Topological Data Analysis 25 / 29

Page 26: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

More complex example

Avik Laha Topological Data Analysis 26 / 29

Page 27: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Some good and bad things

Persistence diagrams are fairly stable under certain perturbations ofdata, as desired from a topological learning methodCare must be taken to deal with outliers – there are methods tomitigate this problem, but that is beyond the scope of thispresentation

Avik Laha Topological Data Analysis 27 / 29

Page 28: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

Applications with machine learning

TDA has found application in a number of fields, including biology,chemistry, sensor networks, shape analysis, materials science, andcosmology

The method has done well with data which has some naturalrepresentation as a graph or complex, for example in genetics orcosmology, suggesting it may lend itself well to program analysis

Often used with other learning methods, ex. an embedding of theinitial data may be used to find the topological characteristics, or aCNN can be used to extract data from persistence diagrams

Avik Laha Topological Data Analysis 28 / 29

Page 29: Topological Data Analysis - Columbia Universitysuman/avik_slides.pdf · Topological Data Analysis. Genetics (February 2019). [3] Shiu, G. Topological Data Analysis for Cosmology and

References I

[1] Chazal, F., and Michel, B. An introduction to Topological DataAnalysis: fundamental and practical aspects for data scientists.

[2] Humphreys, D. P., McGuirl, M. R., Miyagi, M., andBlumberg, A. J. Fast Estimation of Recombination Rates UsingTopological Data Analysis. Genetics (February 2019).

[3] Shiu, G. Topological Data Analysis for Cosmology and String Theory.

[4] So, G. Topological Data Analysis.

[5] Umeda, Y. Time Series Classification via Topological Data Analysis.Transactions of the Japanese Society for Artifical Intelligence (2017).

Avik Laha Topological Data Analysis 29 / 29