Statistics for a Computational TopologistPart I
Brittany Terese FasyTA: Samuel A. Micka
School of Computing and Dept. of Mathematical SciencesMontana State University
August 14, 2018
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 1 / 25
Why Topological Data Analysis?“Data has shape and the shape matters.” - Gunnar Carlsson
Today, Data is high-dimensional,
HUGE, present everywhere
———————Nicolaua, Levine, andCarlsson, PNAS 2011
———————http://astrobites.com/ ———————
www.mapconstruction.org
... and needs to be summarized, analyzed, and compared!
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 2 / 25
Why Topological Data Analysis?“Data has shape and the shape matters.” - Gunnar Carlsson
Today, Data is high-dimensional, HUGE,
present everywhere
———————Nicolaua, Levine, andCarlsson, PNAS 2011
———————http://astrobites.com/
———————www.mapconstruction.org
... and needs to be summarized, analyzed, and compared!
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 2 / 25
Why Topological Data Analysis?“Data has shape and the shape matters.” - Gunnar Carlsson
Today, Data is high-dimensional, HUGE, present everywhere
———————Nicolaua, Levine, andCarlsson, PNAS 2011
———————http://astrobites.com/ ———————
www.mapconstruction.org
... and needs to be summarized, analyzed, and compared!
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 2 / 25
What questions do we ask in data anaylsis?
Think! Write down one question (2 min)
Pair! Share with partner, and add more questions to your list (5 min)
Share! Raise hands please! (5 min)
More ideas? [email protected]
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 3 / 25
What questions do we ask in data anaylsis?
Think! Write down one question (2 min)
Pair! Share with partner, and add more questions to your list (5 min)
Share! Raise hands please! (5 min)
More ideas? [email protected]
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 3 / 25
What questions do we ask in data anaylsis?
Think! Write down one question (2 min)
Pair! Share with partner, and add more questions to your list (5 min)
Share! Raise hands please! (5 min)
More ideas? [email protected]
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 3 / 25
What questions do we ask in data anaylsis?
Think! Write down one question (2 min)
Pair! Share with partner, and add more questions to your list (5 min)
Share! Raise hands please! (5 min)
More ideas? [email protected]
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 3 / 25
What questions do we ask in data anaylsis?
Think! Write down one question (2 min)
Pair! Share with partner, and add more questions to your list (5 min)
Share! Raise hands please! (5 min)
More ideas? [email protected]
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 3 / 25
Data Analysis Questions
Summarize and Analyze
What is this shape?
How many components / populations?
Can we categorize? (Classification)
What are the parameters? (Inference: Point Estimation)
How far do parameters likely lie from estimates? (Confidence Sets)
Compare
Are these the same? In distribution?
Has something changed? If so, what has changed?
Which is bigger?
Can we retain the null hypothesis? (Inference: Hypothesis Testing)
What is the relationship between X and Y ? (Regression)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 4 / 25
Data Analysis Questions
Summarize and Analyze
What is this shape? How many components / populations?
Can we categorize? (Classification)
What are the parameters? (Inference: Point Estimation)
How far do parameters likely lie from estimates? (Confidence Sets)
Compare
Are these the same? In distribution?
Has something changed? If so, what has changed?
Which is bigger?
Can we retain the null hypothesis? (Inference: Hypothesis Testing)
What is the relationship between X and Y ? (Regression)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 4 / 25
Data Analysis Questions
Summarize and Analyze
What is this shape? How many components / populations?
Can we categorize? (Classification)
What are the parameters? (Inference: Point Estimation)
How far do parameters likely lie from estimates? (Confidence Sets)
Compare
Are these the same? In distribution?
Has something changed? If so, what has changed?
Which is bigger?
Can we retain the null hypothesis? (Inference: Hypothesis Testing)
What is the relationship between X and Y ? (Regression)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 4 / 25
Data Analysis Questions
Summarize and Analyze
What is this shape? How many components / populations?
Can we categorize? (Classification)
What are the parameters? (Inference: Point Estimation)
How far do parameters likely lie from estimates? (Confidence Sets)
Compare
Are these the same? In distribution?
Has something changed? If so, what has changed?
Which is bigger?
Can we retain the null hypothesis? (Inference: Hypothesis Testing)
What is the relationship between X and Y ? (Regression)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 4 / 25
Data Analysis Questions
Summarize and Analyze
What is this shape? How many components / populations?
Can we categorize? (Classification)
What are the parameters? (Inference: Point Estimation)
How far do parameters likely lie from estimates? (Confidence Sets)
Compare
Are these the same? In distribution?
Has something changed? If so, what has changed?
Which is bigger?
Can we retain the null hypothesis? (Inference: Hypothesis Testing)
What is the relationship between X and Y ? (Regression)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 4 / 25
Data Analysis Questions
Summarize and Analyze
What is this shape? How many components / populations?
Can we categorize? (Classification)
What are the parameters? (Inference: Point Estimation)
How far do parameters likely lie from estimates? (Confidence Sets)
Compare
Are these the same?
In distribution?
Has something changed? If so, what has changed?
Which is bigger?
Can we retain the null hypothesis? (Inference: Hypothesis Testing)
What is the relationship between X and Y ? (Regression)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 4 / 25
Data Analysis Questions
Summarize and Analyze
What is this shape? How many components / populations?
Can we categorize? (Classification)
What are the parameters? (Inference: Point Estimation)
How far do parameters likely lie from estimates? (Confidence Sets)
Compare
Are these the same? In distribution?
Has something changed? If so, what has changed?
Which is bigger?
Can we retain the null hypothesis? (Inference: Hypothesis Testing)
What is the relationship between X and Y ? (Regression)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 4 / 25
Data Analysis Questions
Summarize and Analyze
What is this shape? How many components / populations?
Can we categorize? (Classification)
What are the parameters? (Inference: Point Estimation)
How far do parameters likely lie from estimates? (Confidence Sets)
Compare
Are these the same? In distribution?
Has something changed?
If so, what has changed?
Which is bigger?
Can we retain the null hypothesis? (Inference: Hypothesis Testing)
What is the relationship between X and Y ? (Regression)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 4 / 25
Data Analysis Questions
Summarize and Analyze
What is this shape? How many components / populations?
Can we categorize? (Classification)
What are the parameters? (Inference: Point Estimation)
How far do parameters likely lie from estimates? (Confidence Sets)
Compare
Are these the same? In distribution?
Has something changed? If so, what has changed?
Which is bigger?
Can we retain the null hypothesis? (Inference: Hypothesis Testing)
What is the relationship between X and Y ? (Regression)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 4 / 25
Data Analysis Questions
Summarize and Analyze
What is this shape? How many components / populations?
Can we categorize? (Classification)
What are the parameters? (Inference: Point Estimation)
How far do parameters likely lie from estimates? (Confidence Sets)
Compare
Are these the same? In distribution?
Has something changed? If so, what has changed?
Which is bigger?
Can we retain the null hypothesis? (Inference: Hypothesis Testing)
What is the relationship between X and Y ? (Regression)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 4 / 25
Data Analysis Questions
Summarize and Analyze
What is this shape? How many components / populations?
Can we categorize? (Classification)
What are the parameters? (Inference: Point Estimation)
How far do parameters likely lie from estimates? (Confidence Sets)
Compare
Are these the same? In distribution?
Has something changed? If so, what has changed?
Which is bigger?
Can we retain the null hypothesis? (Inference: Hypothesis Testing)
What is the relationship between X and Y ? (Regression)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 4 / 25
Data Analysis Questions
Summarize and Analyze
What is this shape? How many components / populations?
Can we categorize? (Classification)
What are the parameters? (Inference: Point Estimation)
How far do parameters likely lie from estimates? (Confidence Sets)
Compare
Are these the same? In distribution?
Has something changed? If so, what has changed?
Which is bigger?
Can we retain the null hypothesis? (Inference: Hypothesis Testing)
What is the relationship between X and Y ? (Regression)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 4 / 25
Most Important Questions
1. Which descriptor best captures our data?
Descriptors
Confidence Sets
2. How do we measure distance between descriptors?
Distances
Clustering
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 5 / 25
Most Important Questions
1. Which descriptor best captures our data?
Descriptors
Confidence Sets
2. How do we measure distance between descriptors?
Distances
Clustering
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 5 / 25
Most Important Questions
1. Which descriptor best captures our data?
Descriptors
Confidence Sets
2. How do we measure distance between descriptors?
Distances
Clustering
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 5 / 25
Most Important Questions
1. Which descriptor best captures our data?
Descriptors
Confidence Sets
2. How do we measure distance between descriptors?
Distances
Clustering
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 5 / 25
Most Important Questions
1. Which descriptor best captures our data?
Descriptors
Confidence Sets
2. How do we measure distance between descriptors?
Distances
Clustering
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 5 / 25
Descriptors
Topological Descriptors
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 6 / 25
Descriptors
Stat Reverences
Wasserman. All of Statistics: a Concise Course in Statistical Inference.Springer, 2010.
Givens and Hoeting. Computational Statistics. Wiley, 2013.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 7 / 25
Descriptors
Stat Slide: The Basics
Let F be a probability distribution with density f .
X ∼ F reads “X has distribution F”.
Here, X is called a random variable.
Expectation: E(X ) =∫x dF (x).
Quantile Function CDF−1(q).
−4 −2 0 2 4
0.00.1
0.20.3
0.4
−4 −2 0 2 4
0.00.2
0.40.6
0.81.0
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 8 / 25
Descriptors
Stat Slide: The Basics
Let F be a probability distribution with density f .
X ∼ F reads “X has distribution F”.
Here, X is called a random variable.
Expectation: E(X ) =∫x dF (x).
Quantile Function CDF−1(q).
−4 −2 0 2 4
0.00.1
0.20.3
0.4
−4 −2 0 2 4
0.00.2
0.40.6
0.81.0
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 8 / 25
Descriptors
Stat Slide: The Basics
Let F be a probability distribution with density f .
X ∼ F reads “X has distribution F”.
Here, X is called a random variable.
Expectation: E(X ) =∫x dF (x).
Quantile Function CDF−1(q).
−4 −2 0 2 4
0.00.1
0.20.3
0.4
−4 −2 0 2 4
0.00.2
0.40.6
0.81.0
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 8 / 25
Descriptors
Stat Slide: The Basics
Let F be a probability distribution with density f .
X ∼ F reads “X has distribution F”.
Here, X is called a random variable.
Expectation: E(X ) =∫x dF (x).
Quantile Function CDF−1(q).
−4 −2 0 2 4
0.00.1
0.20.3
0.4
−4 −2 0 2 4
0.00.2
0.40.6
0.81.0
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 8 / 25
Descriptors
Prob/Stat Slide: Descriptors and Limit Theory
Let F be some distribution.
Let X1,X2, . . . ,Xn ∼ F . (The data).
A statistic or descriptor is a function of the data:T (X1,X2, . . . ,Xn) or T (X n).
Sample average: X n = 1n
∑Xi .
Law of Large Numbers
X n converges to E(Xi ) in probability:
∀ε > 0, limn→∞
(|P(X n − E(Xi )| > ε))→ 0.
Central Limit Theorem√n(X n − E(Xi )) converges in distribution to a Normal distribution, i.e.,
sample average is approximately Normal for large enough samples.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 9 / 25
Descriptors
Prob/Stat Slide: Descriptors and Limit Theory
Let F be some distribution.
Let X1,X2, . . . ,Xn ∼ F . (The data).
A statistic or descriptor is a function of the data:T (X1,X2, . . . ,Xn) or T (X n).
Sample average: X n = 1n
∑Xi .
Law of Large Numbers
X n converges to E(Xi ) in probability:
∀ε > 0, limn→∞
(|P(X n − E(Xi )| > ε))→ 0.
Central Limit Theorem√n(X n − E(Xi )) converges in distribution to a Normal distribution, i.e.,
sample average is approximately Normal for large enough samples.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 9 / 25
Descriptors
Prob/Stat Slide: Descriptors and Limit Theory
Let F be some distribution.
Let X1,X2, . . . ,Xn ∼ F . (The data).
A statistic or descriptor is a function of the data:T (X1,X2, . . . ,Xn) or T (X n).
Sample average: X n = 1n
∑Xi .
Law of Large Numbers
X n converges to E(Xi ) in probability:
∀ε > 0, limn→∞
(|P(X n − E(Xi )| > ε))→ 0.
Central Limit Theorem√n(X n − E(Xi )) converges in distribution to a Normal distribution, i.e.,
sample average is approximately Normal for large enough samples.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 9 / 25
Descriptors
Prob/Stat Slide: Descriptors and Limit Theory
Let F be some distribution.
Let X1,X2, . . . ,Xn ∼ F . (The data).
A statistic or descriptor is a function of the data:T (X1,X2, . . . ,Xn) or T (X n).
Sample average: X n = 1n
∑Xi .
Law of Large Numbers
X n converges to E(Xi ) in probability:
∀ε > 0, limn→∞
(|P(X n − E(Xi )| > ε))→ 0.
Central Limit Theorem√n(X n − E(Xi )) converges in distribution to a Normal distribution, i.e.,
sample average is approximately Normal for large enough samples.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 9 / 25
Descriptors
Prob/Stat Slide: Descriptors and Limit Theory
Let F be some distribution.
Let X1,X2, . . . ,Xn ∼ F . (The data).
A statistic or descriptor is a function of the data:T (X1,X2, . . . ,Xn) or T (X n).
Sample average: X n = 1n
∑Xi .
Law of Large Numbers
X n converges to E(Xi ) in probability:
∀ε > 0, limn→∞
(|P(X n − E(Xi )| > ε))→ 0.
Central Limit Theorem√n(X n − E(Xi )) converges in distribution to a Normal distribution, i.e.,
sample average is approximately Normal for large enough samples.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 9 / 25
Descriptors
Prob/Stat Slide: Descriptors and Limit Theory
Let F be some distribution.
Let X1,X2, . . . ,Xn ∼ F . (The data).
A statistic or descriptor is a function of the data:T (X1,X2, . . . ,Xn) or T (X n).
Sample average: X n = 1n
∑Xi .
Law of Large Numbers
X n converges to E(Xi ) in probability:
∀ε > 0, limn→∞
(|P(X n − E(Xi )| > ε))→ 0.
Central Limit Theorem√n(X n − E(Xi )) converges in distribution to a Normal distribution,
i.e.,sample average is approximately Normal for large enough samples.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 9 / 25
Descriptors
Prob/Stat Slide: Descriptors and Limit Theory
Let F be some distribution.
Let X1,X2, . . . ,Xn ∼ F . (The data).
A statistic or descriptor is a function of the data:T (X1,X2, . . . ,Xn) or T (X n).
Sample average: X n = 1n
∑Xi .
Law of Large Numbers
X n converges to E(Xi ) in probability:
∀ε > 0, limn→∞
(|P(X n − E(Xi )| > ε))→ 0.
Central Limit Theorem√n(X n − E(Xi )) converges in distribution to a Normal distribution, i.e.,
sample average is approximately Normal for large enough samples.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 9 / 25
Descriptors
Data as Point Clouds
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 10 / 25
Descriptors
Data as Point Clouds
big loop
noise
pinch
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 10 / 25
Descriptors
Data as Persistence Diagrams
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 11 / 25
Confidence Sets
Confidence Sets for Persistence Diagrams:Analyzing Descriptors
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 12 / 25
Confidence Sets
Objective
To Find a Threshold
Given α ∈ (0, 1), we will find qα > 0 such that
P(W∞(D, Dn) ≤ qα) ≥ 1− α.
References
BTF, Lecci, Rinaldo, Wasserman, Balakrishnan, and Singh.Confidence sets for persistence diagrams. Annals of Stat., 2014.
Chazal, BTF, Lecci, Rinaldo, Singh, and Wasserman. On theBootstrap for Persistence Diagrams and Landscapes. Modeling andAnalysis of Information Systems, 2013.
Chazal, BTF, Lecci, Michel, Rinaldo, and Wasserman. RobustTopological Inference: Distance To a Measure and Kernel Distance,JMLR 18(159):1–40, 2018.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 13 / 25
Confidence Sets
Objective
To Find a Threshold
Given α ∈ (0, 1), we will find qα > 0 such that
P(W∞(D, Dn) ≤ qα) ≥ 1− α.
References
BTF, Lecci, Rinaldo, Wasserman, Balakrishnan, and Singh.Confidence sets for persistence diagrams. Annals of Stat., 2014.
Chazal, BTF, Lecci, Rinaldo, Singh, and Wasserman. On theBootstrap for Persistence Diagrams and Landscapes. Modeling andAnalysis of Information Systems, 2013.
Chazal, BTF, Lecci, Michel, Rinaldo, and Wasserman. RobustTopological Inference: Distance To a Measure and Kernel Distance,JMLR 18(159):1–40, 2018.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 13 / 25
Confidence Sets
Stat Slide: Bootstrapping
Old idiom: “pull yourself up by your bootstraps”
Want: a parameter of an unknown distribution F .
Try: estimate using empirical distribution F .
Nonparametric technique!
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 14 / 25
Confidence Sets
Stat Slide: Bootstrapping
Old idiom: “pull yourself up by your bootstraps”
Want: a parameter of an unknown distribution F .
Try: estimate using empirical distribution F .
Nonparametric technique!
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 14 / 25
Confidence Sets
Stat Slide: Bootstrapping
Old idiom: “pull yourself up by your bootstraps”
Want: a parameter of an unknown distribution F .
Try: estimate using empirical distribution F .
Nonparametric technique!
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 14 / 25
Confidence Sets
Stat Slide: Bootstrapping
Old idiom: “pull yourself up by your bootstraps”
Want: a parameter of an unknown distribution F .
Try: estimate using empirical distribution F .
Nonparametric technique!
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 14 / 25
Confidence Sets
Bottleneck Bootstrap
We have a point cloud sample:Sn = {X1, . . . ,Xn}; Xi ∼ P.
Subsample (with replacement),obtaining: X = {X ∗1 , . . . ,X ∗b }
Compute Θ∗b(X ∗) = W∞(X ∗,Sn)using KDE or DTM.
Consider all possible outcomes:
{Θ∗b(X ∗)}X∗⊂Sn
Mimics:
{Θ(X ) = W∞(Sn,M)}Sn⊂M
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 15 / 25
Confidence Sets
Bottleneck Bootstrap
We have a point cloud sample:Sn = {X1, . . . ,Xn}; Xi ∼ P.
Subsample (with replacement),obtaining: X = {X ∗1 , . . . ,X ∗b }
Compute Θ∗b(X ∗) = W∞(X ∗,Sn)using KDE or DTM.
Consider all possible outcomes:
{Θ∗b(X ∗)}X∗⊂Sn
Mimics:
{Θ(X ) = W∞(Sn,M)}Sn⊂M
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 15 / 25
Confidence Sets
Bottleneck Bootstrap
We have a point cloud sample:Sn = {X1, . . . ,Xn}; Xi ∼ P.
Subsample (with replacement),obtaining: X = {X ∗1 , . . . ,X ∗b }
Compute Θ∗b(X ∗) = W∞(X ∗,Sn)using KDE or DTM.
Consider all possible outcomes:
{Θ∗b(X ∗)}X∗⊂Sn
Mimics:
{Θ(X ) = W∞(Sn,M)}Sn⊂M
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 15 / 25
Confidence Sets
Bottleneck Bootstrap
We have a point cloud sample:Sn = {X1, . . . ,Xn}; Xi ∼ P.
Subsample (with replacement),obtaining: X = {X ∗1 , . . . ,X ∗b }
Compute Θ∗b(X ∗) = W∞(X ∗,Sn)using KDE or DTM.
Consider all possible outcomes:
{Θ∗b(X ∗)}X∗⊂Sn
Mimics:
{Θ(X ) = W∞(Sn,M)}Sn⊂M
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 15 / 25
Confidence Sets
Bottleneck Bootstrap
We have a point cloud sample:Sn = {X1, . . . ,Xn}; Xi ∼ P.
Subsample (with replacement),obtaining: X = {X ∗1 , . . . ,X ∗b }
Compute Θ∗b(X ∗) = W∞(X ∗,Sn)using KDE or DTM.
Consider all possible outcomes:
{Θ∗b(X ∗)}X∗⊂Sn
Mimics:
{Θ(X ) = W∞(Sn,M)}Sn⊂M
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 15 / 25
Confidence Sets
Bottleneck Bootstrap
We have a point cloud sample:Sn = {X1, . . . ,Xn}; Xi ∼ P.
Subsample (with replacement),obtaining: X = {X ∗1 , . . . ,X ∗b }
Compute Θ∗b(X ∗) = W∞(X ∗,Sn)using KDE or DTM.
Consider all possible outcomes:
{Θ∗b(X ∗)}X∗⊂Sn
Mimics:
{Θ(X ) = W∞(Sn,M)}Sn⊂M
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 15 / 25
Confidence Sets
Confidence Sets for Persistent Diagrams
Cα = {D ∈ DT : W∞(D, Dn) ≤ qα}
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 16 / 25
Confidence Sets
Confidence Sets for Persistent Diagrams
Cα = {D ∈ DT : W∞(D, Dn) ≤ qα}
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 16 / 25
Confidence Sets
Example
Birth
Dea
th
Noisy GridNoisy Grid KDE h=0.05●
●●●●●●●●
●●●●●●●●●●●●
●●●●●●●●
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
Death
Birth
KDE h=0.05●
●●●●●●●●
●●●●●●●●●●●●
●●●●●●●●
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
Death
Birth
DTM m=0.01●
●●●●●●●●●●●●
●● ●●●●●●●●●●
●●●●●●●●●●
●●●
0.05 0.10 0.15
0.05
0.10
0.15
dim 0dim 1
Birth
Dea
th
DTM m=0.01●
●●●●●●●●●●●●
●● ●●●●●●●●●●
●●●●●●●●●●
●●●
0.05 0.10 0.15
0.05
0.10
0.15
dim 0dim 1
Birth
Dea
th
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 17 / 25
Confidence Sets
Challenges
Techniques
Prove limit theorems.
Determine suitableassumptions on input.
Use the geometry of input(e.g., properties of anunderlying smoothmanifold).
Questions
These results are in thelimit. When is n big enough?
What confidence sets can weconstruct in the multi-dsetting?
What is the optimalthreshold for particularfiltrations?
Power analysis: are therejected points topologicallyinsignificant? (Type IIerrors)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 18 / 25
Confidence Sets
Challenges
Techniques
Prove limit theorems.
Determine suitableassumptions on input.
Use the geometry of input(e.g., properties of anunderlying smoothmanifold).
Questions
These results are in thelimit. When is n big enough?
What confidence sets can weconstruct in the multi-dsetting?
What is the optimalthreshold for particularfiltrations?
Power analysis: are therejected points topologicallyinsignificant? (Type IIerrors)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 18 / 25
Confidence Sets
Challenges
Techniques
Prove limit theorems.
Determine suitableassumptions on input.
Use the geometry of input(e.g., properties of anunderlying smoothmanifold).
Questions
These results are in thelimit. When is n big enough?
What confidence sets can weconstruct in the multi-dsetting?
What is the optimalthreshold for particularfiltrations?
Power analysis: are therejected points topologicallyinsignificant? (Type IIerrors)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 18 / 25
Confidence Sets
Challenges
Techniques
Prove limit theorems.
Determine suitableassumptions on input.
Use the geometry of input(e.g., properties of anunderlying smoothmanifold).
Questions
These results are in thelimit. When is n big enough?
What confidence sets can weconstruct in the multi-dsetting?
What is the optimalthreshold for particularfiltrations?
Power analysis: are therejected points topologicallyinsignificant? (Type IIerrors)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 18 / 25
Confidence Sets
Challenges
Techniques
Prove limit theorems.
Determine suitableassumptions on input.
Use the geometry of input(e.g., properties of anunderlying smoothmanifold).
Questions
These results are in thelimit. When is n big enough?
What confidence sets can weconstruct in the multi-dsetting?
What is the optimalthreshold for particularfiltrations?
Power analysis: are therejected points topologicallyinsignificant? (Type IIerrors)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 18 / 25
Confidence Sets
Challenges
Techniques
Prove limit theorems.
Determine suitableassumptions on input.
Use the geometry of input(e.g., properties of anunderlying smoothmanifold).
Questions
These results are in thelimit. When is n big enough?
What confidence sets can weconstruct in the multi-dsetting?
What is the optimalthreshold for particularfiltrations?
Power analysis: are therejected points topologicallyinsignificant? (Type IIerrors)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 18 / 25
Confidence Sets
Challenges
Techniques
Prove limit theorems.
Determine suitableassumptions on input.
Use the geometry of input(e.g., properties of anunderlying smoothmanifold).
Questions
These results are in thelimit. When is n big enough?
What confidence sets can weconstruct in the multi-dsetting?
What is the optimalthreshold for particularfiltrations?
Power analysis: are therejected points topologicallyinsignificant? (Type IIerrors)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 18 / 25
Confidence Sets
Challenges
Techniques
Prove limit theorems.
Determine suitableassumptions on input.
Use the geometry of input(e.g., properties of anunderlying smoothmanifold).
Questions
These results are in thelimit. When is n big enough?
What confidence sets can weconstruct in the multi-dsetting?
What is the optimalthreshold for particularfiltrations?
Power analysis: are therejected points topologicallyinsignificant? (Type IIerrors)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 18 / 25
Confidence Sets
Challenges
Techniques
Prove limit theorems.
Determine suitableassumptions on input.
Use the geometry of input(e.g., properties of anunderlying smoothmanifold).
Questions
These results are in thelimit. When is n big enough?
What confidence sets can weconstruct in the multi-dsetting?
What is the optimalthreshold for particularfiltrations?
Power analysis: are therejected points topologicallyinsignificant? (Type IIerrors)
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 18 / 25
Distances
Distance Measures:Comparing Descriptors
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 19 / 25
Distances
?=
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 20 / 25
Distances
?=
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 20 / 25
Distances
Distances Between Diagrams
Bottleneck d∞.
Interleaving distance.
Wasserstein dp.
Erosion distance.
Question
Can we define a centroid /Frechet mean?
arg minD
∑i
W 2∞(D,Di )
1. Turner, Mileyko, Mukherjee, and Harer. Frechet Meansfor Distributions of Persistence Diagrams. DCG, 2014.2. Munch, Tuner, Bendich, Mukherjee, Mattingly, andHarer. Probabilistic Frechet Means for Time VaryingPersistence Diagrams. Electronic Journal of Statistics,2015.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 21 / 25
Distances
Distances Between Diagrams
Bottleneck d∞.
Interleaving distance.
Wasserstein dp.
Erosion distance.
Question
Can we define a centroid /Frechet mean?
arg minD
∑i
W 2∞(D,Di )
1. Turner, Mileyko, Mukherjee, and Harer. Frechet Meansfor Distributions of Persistence Diagrams. DCG, 2014.2. Munch, Tuner, Bendich, Mukherjee, Mattingly, andHarer. Probabilistic Frechet Means for Time VaryingPersistence Diagrams. Electronic Journal of Statistics,2015.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 21 / 25
Distances
Distances Between Diagrams
Bottleneck d∞.
Interleaving distance.
Wasserstein dp.
Erosion distance.
Question
Can we define a centroid /Frechet mean?
arg minD
∑i
W 2∞(D,Di ) 1. Turner, Mileyko, Mukherjee, and Harer. Frechet Means
for Distributions of Persistence Diagrams. DCG, 2014.2. Munch, Tuner, Bendich, Mukherjee, Mattingly, andHarer. Probabilistic Frechet Means for Time VaryingPersistence Diagrams. Electronic Journal of Statistics,2015.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 21 / 25
Distances
Distances Between Diagrams
Bottleneck d∞.
Interleaving distance.
Wasserstein dp.
Erosion distance.
Question
Can we define a centroid /Frechet mean?
arg minD
∑i
W 2∞(D,Di ) 1. Turner, Mileyko, Mukherjee, and Harer. Frechet Means
for Distributions of Persistence Diagrams. DCG, 2014.2. Munch, Tuner, Bendich, Mukherjee, Mattingly, andHarer. Probabilistic Frechet Means for Time VaryingPersistence Diagrams. Electronic Journal of Statistics,2015.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 21 / 25
Distances
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 22 / 25
Distances
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 22 / 25
Distances
Clustering
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 23 / 25
Distances
Clustering
... and Classification
Clustering (Unsupervised Learning)
Heirarchical: agglomerative or divisive.
k-means: NP-hard, so algorithms find a local minimum.
Distribution- and density-based clustering: e.g., DBSCAN.
Fuzzy clustering: membership is not binary.
Classification (Supervised Learning)
input data (training sample): D = {(Xi ,Yi )}ni=1
k-nn clustering: for new X , we predict Y by majority vote of the k nearestneighbors of the covariates (features) in D.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 24 / 25
Distances
Clustering
... and Classification
Clustering (Unsupervised Learning)
Heirarchical: agglomerative or divisive.
k-means: NP-hard, so algorithms find a local minimum.
Distribution- and density-based clustering: e.g., DBSCAN.
Fuzzy clustering: membership is not binary.
Classification (Supervised Learning)
input data (training sample): D = {(Xi ,Yi )}ni=1
k-nn clustering: for new X , we predict Y by majority vote of the k nearestneighbors of the covariates (features) in D.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 24 / 25
Distances
Clustering
... and Classification
Clustering (Unsupervised Learning)
Heirarchical: agglomerative or divisive.
k-means: NP-hard, so algorithms find a local minimum.
Distribution- and density-based clustering: e.g., DBSCAN.
Fuzzy clustering: membership is not binary.
Classification (Supervised Learning)
input data (training sample): D = {(Xi ,Yi )}ni=1
k-nn clustering: for new X , we predict Y by majority vote of the k nearestneighbors of the covariates (features) in D.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 24 / 25
Distances
Clustering
... and Classification
Clustering (Unsupervised Learning)
Heirarchical: agglomerative or divisive.
k-means: NP-hard, so algorithms find a local minimum.
Distribution- and density-based clustering: e.g., DBSCAN.
Fuzzy clustering: membership is not binary.
Classification (Supervised Learning)
input data (training sample): D = {(Xi ,Yi )}ni=1
k-nn clustering: for new X , we predict Y by majority vote of the k nearestneighbors of the covariates (features) in D.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 24 / 25
Distances
Clustering ... and Classification
Clustering (Unsupervised Learning)
Heirarchical: agglomerative or divisive.
k-means: NP-hard, so algorithms find a local minimum.
Distribution- and density-based clustering: e.g., DBSCAN.
Fuzzy clustering: membership is not binary.
Classification (Supervised Learning)
input data (training sample): D = {(Xi ,Yi )}ni=1
k-nn clustering: for new X , we predict Y by majority vote of the k nearestneighbors of the covariates (features) in D.
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 24 / 25
Distances
Homework!
Curate a list of topological descriptors. For each, we are looking for:
Name of descriptor.
List of distances that can be used between descriptors.
Short explanation (very short).
Reference to where first used, or a good use of it.
Pros: What is it good for?
Cons: Where / when is it insufficient?
https://github.com/compTAG/ima-multid
B. Fasy (MSU) Statistics for a Computational Topologist August 14, 2018 25 / 25
Top Related