Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional,...

Post on 12-Aug-2020

2 views 0 download

Transcript of Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional,...

Dependency Discovery viaMultiscale Graph Correlation

Cencheng Shen

University of Delaware

Collaborators: Joshua T. Vogelstein, Carey E. Priebe, Shangsi Wang, Youjin Lee,Mauro Maggioni, Qing Wang, Alex Badea.

Acknowledgment: NSF DMS, DARPA SIMPLEX.

R package available in CRAN and https: // github. com/ neurodata/ MGC/

Matlab code available in https: // github. com/ neurodata/ mgc-matlab

C. Shen MGC: 1/38

Overview

1. Motivation

2. Methodology

3. Theoretical Properties

4. Simulations and Experiments

5. Summary

C. Shen MGC: 2/38

Motivation

C. Shen MGC: 3/38

Motivation

Given paired data (Xn,Yn) = {(xi, yi) ∈ Rp × Rq, for i = 1, . . . , n},

• Are they related?

• How are they related?

X Y

brain connectivity creativity / personality

brain shape health

gene / protein cancer

social networks attributes

anything anything else

C. Shen MGC: 4/38

Motivation

Given paired data (Xn,Yn) = {(xi, yi) ∈ Rp × Rq, for i = 1, . . . , n},• Are they related?

• How are they related?

X Y

brain connectivity creativity / personality

brain shape health

gene / protein cancer

social networks attributes

anything anything else

C. Shen MGC: 4/38

Motivation

Given paired data (Xn,Yn) = {(xi, yi) ∈ Rp × Rq, for i = 1, . . . , n},• Are they related?

• How are they related?

X Y

brain connectivity creativity / personality

brain shape health

gene / protein cancer

social networks attributes

anything anything else

C. Shen MGC: 4/38

Motivation

Given paired data (Xn,Yn) = {(xi, yi) ∈ Rp × Rq, for i = 1, . . . , n},• Are they related?

• How are they related?

X Y

brain connectivity creativity / personality

brain shape health

gene / protein cancer

social networks attributes

anything anything else

C. Shen MGC: 4/38

Motivation

Given paired data (Xn,Yn) = {(xi, yi) ∈ Rp × Rq, for i = 1, . . . , n},• Are they related?

• How are they related?

X Y

brain connectivity creativity / personality

brain shape health

gene / protein cancer

social networks attributes

anything anything else

C. Shen MGC: 4/38

Formal Definition of Independence Testing

(xi, yi)i.i.d.∼ FXY , i = 1, . . . , n

H0 : FXY = FXFY ,

HA : FXY 6= FXFY .

A test is universally consistent if its power converges to 1 as n→∞against any dependent FXY .

Without loss of generality, we shall assume FXY has finite secondmoments.

C. Shen MGC: 5/38

Formal Definition of Independence Testing

(xi, yi)i.i.d.∼ FXY , i = 1, . . . , n

H0 : FXY = FXFY ,

HA : FXY 6= FXFY .

A test is universally consistent if its power converges to 1 as n→∞against any dependent FXY .

Without loss of generality, we shall assume FXY has finite secondmoments.

C. Shen MGC: 5/38

Formal Definition of Independence Testing

(xi, yi)i.i.d.∼ FXY , i = 1, . . . , n

H0 : FXY = FXFY ,

HA : FXY 6= FXFY .

A test is universally consistent if its power converges to 1 as n→∞against any dependent FXY .

Without loss of generality, we shall assume FXY has finite secondmoments.

C. Shen MGC: 5/38

Benchmarks

CorrelationMeasures

Linear

Pearson

Rank

CCA

RV

Non-linear

MICMantel

UniversalConsistent

Dcorr

HSIC

HHG

C. Shen MGC: 6/38

Benchmarks

CorrelationMeasures

Linear

Pearson

Rank

CCA

RV

Non-linear

MICMantel

UniversalConsistent

Dcorr

HSIC

HHG

C. Shen MGC: 6/38

Benchmarks

CorrelationMeasures

Linear

Pearson

Rank

CCA

RV

Non-linear

MICMantel

UniversalConsistent

Dcorr

HSIC

HHG

C. Shen MGC: 6/38

Benchmarks

CorrelationMeasures

Linear

Pearson

Rank

CCA

RV

Non-linear

MICMantel

UniversalConsistent

Dcorr

HSIC

HHG

C. Shen MGC: 6/38

Benchmarks

CorrelationMeasures

Linear

Pearson

Rank

CCA

RV

Non-linear

MICMantel

UniversalConsistent

Dcorr

HSIC

HHG

C. Shen MGC: 6/38

Motivations

Modern data sets may be high-dimensional, nonlinear, noisy, of limitedsample size, structured, from disparate spaces. Thus we desire a testthat

• is consistent against all dependencies;

• has good finite-sample testing performance;

• is easy to understand and efficient to implement;

• *provides insights into the dependency structure.

To that end, we propose the multiscale graph correlation in [Shen etal.(2018)].

C. Shen MGC: 7/38

Motivations

Modern data sets may be high-dimensional, nonlinear, noisy, of limitedsample size, structured, from disparate spaces. Thus we desire a testthat

• is consistent against all dependencies;

• has good finite-sample testing performance;

• is easy to understand and efficient to implement;

• *provides insights into the dependency structure.

To that end, we propose the multiscale graph correlation in [Shen etal.(2018)].

C. Shen MGC: 7/38

Motivations

Modern data sets may be high-dimensional, nonlinear, noisy, of limitedsample size, structured, from disparate spaces. Thus we desire a testthat

• is consistent against all dependencies;

• has good finite-sample testing performance;

• is easy to understand and efficient to implement;

• *provides insights into the dependency structure.

To that end, we propose the multiscale graph correlation in [Shen etal.(2018)].

C. Shen MGC: 7/38

Motivations

Modern data sets may be high-dimensional, nonlinear, noisy, of limitedsample size, structured, from disparate spaces. Thus we desire a testthat

• is consistent against all dependencies;

• has good finite-sample testing performance;

• is easy to understand and efficient to implement;

• *provides insights into the dependency structure.

To that end, we propose the multiscale graph correlation in [Shen etal.(2018)].

C. Shen MGC: 7/38

Motivations

Modern data sets may be high-dimensional, nonlinear, noisy, of limitedsample size, structured, from disparate spaces. Thus we desire a testthat

• is consistent against all dependencies;

• has good finite-sample testing performance;

• is easy to understand and efficient to implement;

• *provides insights into the dependency structure.

To that end, we propose the multiscale graph correlation in [Shen etal.(2018)].

C. Shen MGC: 7/38

Motivations

Modern data sets may be high-dimensional, nonlinear, noisy, of limitedsample size, structured, from disparate spaces. Thus we desire a testthat

• is consistent against all dependencies;

• has good finite-sample testing performance;

• is easy to understand and efficient to implement;

• *provides insights into the dependency structure.

To that end, we propose the multiscale graph correlation in [Shen etal.(2018)].

C. Shen MGC: 7/38

Motivations

Modern data sets may be high-dimensional, nonlinear, noisy, of limitedsample size, structured, from disparate spaces. Thus we desire a testthat

• is consistent against all dependencies;

• has good finite-sample testing performance;

• is easy to understand and efficient to implement;

• *provides insights into the dependency structure.

To that end, we propose the multiscale graph correlation in [Shen etal.(2018)].

C. Shen MGC: 7/38

Methodology

C. Shen MGC: 8/38

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Computing Distance and Centering

Input: Xn = [x1, . . . , xn] as the data matrix with each columnrepresenting one sample observation, and similarly Yn.

Distance Computation: Let A be the n× n Euclidean distancematrices of Xn:

Aij = ‖xi − xj‖2,

and similarly B from Yn.

Centering: Then we center A and B by columns, with the diagonalsexcluded:

Aij =

{Aij − 1

n−1

∑ns=1 Asj , if i 6= j,

0, if i = j;(1)

similarly for B.

C. Shen MGC: 10/38

Computing Distance and Centering

Input: Xn = [x1, . . . , xn] as the data matrix with each columnrepresenting one sample observation, and similarly Yn.

Distance Computation: Let A be the n× n Euclidean distancematrices of Xn:

Aij = ‖xi − xj‖2,

and similarly B from Yn.

Centering: Then we center A and B by columns, with the diagonalsexcluded:

Aij =

{Aij − 1

n−1

∑ns=1 Asj , if i 6= j,

0, if i = j;(1)

similarly for B.

C. Shen MGC: 10/38

Computing Distance and Centering

Input: Xn = [x1, . . . , xn] as the data matrix with each columnrepresenting one sample observation, and similarly Yn.

Distance Computation: Let A be the n× n Euclidean distancematrices of Xn:

Aij = ‖xi − xj‖2,

and similarly B from Yn.

Centering: Then we center A and B by columns, with the diagonalsexcluded:

Aij =

{Aij − 1

n−1

∑ns=1 Asj , if i 6= j,

0, if i = j;(1)

similarly for B.

C. Shen MGC: 10/38

Incorporating the Locality Principle

Ranking: Define {RAij} as the “rank” of xi relative to xj , that is, RAij = k

if xi is the kth closest point (or “neighbor”) to xj , as determined byranking the set {A1j , A2j , . . . , Anj} by ascending order. Similarly defineRBij for the y’s.

For any (k, l) ∈ [n]2, define the rank truncated matrices Ak, Bl, and thejoint distance matrix Ckl as

Akij = AijI(RAij ≤ k),

Blij = BijI(RBij ≤ l).

When ties occur, minimal rank is recommended, e.g., if Y only takes twovalue, RBij takes value in {1, 2} only. We assume no ties for each ofpresentation.

C. Shen MGC: 11/38

Incorporating the Locality Principle

Ranking: Define {RAij} as the “rank” of xi relative to xj , that is, RAij = k

if xi is the kth closest point (or “neighbor”) to xj , as determined byranking the set {A1j , A2j , . . . , Anj} by ascending order. Similarly defineRBij for the y’s.

For any (k, l) ∈ [n]2, define the rank truncated matrices Ak, Bl, and thejoint distance matrix Ckl as

Akij = AijI(RAij ≤ k),

Blij = BijI(RBij ≤ l).

When ties occur, minimal rank is recommended, e.g., if Y only takes twovalue, RBij takes value in {1, 2} only. We assume no ties for each ofpresentation.

C. Shen MGC: 11/38

Incorporating the Locality Principle

Ranking: Define {RAij} as the “rank” of xi relative to xj , that is, RAij = k

if xi is the kth closest point (or “neighbor”) to xj , as determined byranking the set {A1j , A2j , . . . , Anj} by ascending order. Similarly defineRBij for the y’s.

For any (k, l) ∈ [n]2, define the rank truncated matrices Ak, Bl, and thejoint distance matrix Ckl as

Akij = AijI(RAij ≤ k),

Blij = BijI(RBij ≤ l).

When ties occur, minimal rank is recommended, e.g., if Y only takes twovalue, RBij takes value in {1, 2} only. We assume no ties for each ofpresentation.

C. Shen MGC: 11/38

Incorporating the Locality Principle

Ranking: Define {RAij} as the “rank” of xi relative to xj , that is, RAij = k

if xi is the kth closest point (or “neighbor”) to xj , as determined byranking the set {A1j , A2j , . . . , Anj} by ascending order. Similarly defineRBij for the y’s.

For any (k, l) ∈ [n]2, define the rank truncated matrices Ak, Bl, and thejoint distance matrix Ckl as

Akij = AijI(RAij ≤ k),

Blij = BijI(RBij ≤ l).

When ties occur, minimal rank is recommended, e.g., if Y only takes twovalue, RBij takes value in {1, 2} only. We assume no ties for each ofpresentation.

C. Shen MGC: 11/38

Local Distance Correlations

A Family of Local Correlations: Let ◦ denote the entry-wise product,E(·) = 1

n(n−1)

∑ni 6=j(·) denote the diagonal-excluded sample mean of a

square matrix, then the sample local covariance, variance, and correlationare defined as:

dCovk,l(Xn,Yn) = E(Ak ◦Bl′)− E(Ak)E(Bl),

dV ark(Xn) = E(Ak ◦Ak′)− E2(Ak),

dV arl(Yn) = E(Bl ◦Bl′)− E2(Bl),

dCorrk,l(Xn,Yn) = dCovk,l(Xn,Yn)/√dV ark(Xn) · dV arl(Yn).

for k, l = 1, . . . , n. If dV ark(Xn) · dV arl(Xn) ≤ 0, we setdCorrkl(Xn,Yn) = 0 instead.

There are a maximum of n2 different local correlations. At k = l = n,dCorrkl(Xn,Yn) equals the “global” distance correlation dCorr(Xn,Yn)by Szekely et al.(2007).

C. Shen MGC: 12/38

Local Distance Correlations

A Family of Local Correlations: Let ◦ denote the entry-wise product,E(·) = 1

n(n−1)

∑ni 6=j(·) denote the diagonal-excluded sample mean of a

square matrix, then the sample local covariance, variance, and correlationare defined as:

dCovk,l(Xn,Yn) = E(Ak ◦Bl′)− E(Ak)E(Bl),

dV ark(Xn) = E(Ak ◦Ak′)− E2(Ak),

dV arl(Yn) = E(Bl ◦Bl′)− E2(Bl),

dCorrk,l(Xn,Yn) = dCovk,l(Xn,Yn)/√dV ark(Xn) · dV arl(Yn).

for k, l = 1, . . . , n. If dV ark(Xn) · dV arl(Xn) ≤ 0, we setdCorrkl(Xn,Yn) = 0 instead.

There are a maximum of n2 different local correlations. At k = l = n,dCorrkl(Xn,Yn) equals the “global” distance correlation dCorr(Xn,Yn)by Szekely et al.(2007).

C. Shen MGC: 12/38

Local Distance Correlations

A Family of Local Correlations: Let ◦ denote the entry-wise product,E(·) = 1

n(n−1)

∑ni 6=j(·) denote the diagonal-excluded sample mean of a

square matrix, then the sample local covariance, variance, and correlationare defined as:

dCovk,l(Xn,Yn) = E(Ak ◦Bl′)− E(Ak)E(Bl),

dV ark(Xn) = E(Ak ◦Ak′)− E2(Ak),

dV arl(Yn) = E(Bl ◦Bl′)− E2(Bl),

dCorrk,l(Xn,Yn) = dCovk,l(Xn,Yn)/√dV ark(Xn) · dV arl(Yn).

for k, l = 1, . . . , n. If dV ark(Xn) · dV arl(Xn) ≤ 0, we setdCorrkl(Xn,Yn) = 0 instead.

There are a maximum of n2 different local correlations. At k = l = n,dCorrkl(Xn,Yn) equals the “global” distance correlation dCorr(Xn,Yn)by Szekely et al.(2007).

C. Shen MGC: 12/38

Local Distance Correlations

A Family of Local Correlations: Let ◦ denote the entry-wise product,E(·) = 1

n(n−1)

∑ni 6=j(·) denote the diagonal-excluded sample mean of a

square matrix, then the sample local covariance, variance, and correlationare defined as:

dCovk,l(Xn,Yn) = E(Ak ◦Bl′)− E(Ak)E(Bl),

dV ark(Xn) = E(Ak ◦Ak′)− E2(Ak),

dV arl(Yn) = E(Bl ◦Bl′)− E2(Bl),

dCorrk,l(Xn,Yn) = dCovk,l(Xn,Yn)/√dV ark(Xn) · dV arl(Yn).

for k, l = 1, . . . , n. If dV ark(Xn) · dV arl(Xn) ≤ 0, we setdCorrkl(Xn,Yn) = 0 instead.

There are a maximum of n2 different local correlations. At k = l = n,dCorrkl(Xn,Yn) equals the “global” distance correlation dCorr(Xn,Yn)by Szekely et al.(2007).

C. Shen MGC: 12/38

Local Distance Correlations

A Family of Local Correlations: Let ◦ denote the entry-wise product,E(·) = 1

n(n−1)

∑ni 6=j(·) denote the diagonal-excluded sample mean of a

square matrix, then the sample local covariance, variance, and correlationare defined as:

dCovk,l(Xn,Yn) = E(Ak ◦Bl′)− E(Ak)E(Bl),

dV ark(Xn) = E(Ak ◦Ak′)− E2(Ak),

dV arl(Yn) = E(Bl ◦Bl′)− E2(Bl),

dCorrk,l(Xn,Yn) = dCovk,l(Xn,Yn)/√dV ark(Xn) · dV arl(Yn).

for k, l = 1, . . . , n. If dV ark(Xn) · dV arl(Xn) ≤ 0, we setdCorrkl(Xn,Yn) = 0 instead.

There are a maximum of n2 different local correlations. At k = l = n,dCorrkl(Xn,Yn) equals the “global” distance correlation dCorr(Xn,Yn)by Szekely et al.(2007).

C. Shen MGC: 12/38

Smoothed Maximum c∗(Xn,Yn)

One would like to use the optimal local correlation for testing.

But directly taking the maximum local correlation

max(k,l)∈[n]2

{Dcorrk,l(Xn,Yn)}

will yield a biased statistic under independence, i.e., the maximum isalways larger than 0 in expectation even under independent relationship!

Instead, we take a smoothed maximum, by finding a connected region inthe local correlation map with significant local correlatons – if such aregion exists, use the maximum within the region.

C. Shen MGC: 13/38

Smoothed Maximum c∗(Xn,Yn)

One would like to use the optimal local correlation for testing.

But directly taking the maximum local correlation

max(k,l)∈[n]2

{Dcorrk,l(Xn,Yn)}

will yield a biased statistic under independence, i.e., the maximum isalways larger than 0 in expectation even under independent relationship!

Instead, we take a smoothed maximum, by finding a connected region inthe local correlation map with significant local correlatons – if such aregion exists, use the maximum within the region.

C. Shen MGC: 13/38

Smoothed Maximum c∗(Xn,Yn)

One would like to use the optimal local correlation for testing.

But directly taking the maximum local correlation

max(k,l)∈[n]2

{Dcorrk,l(Xn,Yn)}

will yield a biased statistic under independence, i.e., the maximum isalways larger than 0 in expectation even under independent relationship!

Instead, we take a smoothed maximum, by finding a connected region inthe local correlation map with significant local correlatons – if such aregion exists, use the maximum within the region.

C. Shen MGC: 13/38

Smoothed Maximum

Pick a threshold τ ≥ 0 (we choose by an approximate null distribution ofDcorr, which is symmetric beta and converges to 0 as n→∞), computethe set

{(k, l) such that Dcorrk,l(Xn,Yn) > max{τ,Dcorr(Xn,Yn)}},

and calculate the largest connected component R of the set.

If there are sufficiently many elements in R (> 2n), take the maximumcorrelation within R as MGC statistic c∗(Xn,Yn), and set theneighborhood pair as the optimal scale (k∗, l∗).

C. Shen MGC: 14/38

Smoothed Maximum

Pick a threshold τ ≥ 0 (we choose by an approximate null distribution ofDcorr, which is symmetric beta and converges to 0 as n→∞), computethe set

{(k, l) such that Dcorrk,l(Xn,Yn) > max{τ,Dcorr(Xn,Yn)}},

and calculate the largest connected component R of the set.

If there are sufficiently many elements in R (> 2n), take the maximumcorrelation within R as MGC statistic c∗(Xn,Yn), and set theneighborhood pair as the optimal scale (k∗, l∗).

C. Shen MGC: 14/38

Smoothed Maximum

Pick a threshold τ ≥ 0 (we choose by an approximate null distribution ofDcorr, which is symmetric beta and converges to 0 as n→∞), computethe set

{(k, l) such that Dcorrk,l(Xn,Yn) > max{τ,Dcorr(Xn,Yn)}},

and calculate the largest connected component R of the set.

If there are sufficiently many elements in R (> 2n), take the maximumcorrelation within R as MGC statistic c∗(Xn,Yn),

and set theneighborhood pair as the optimal scale (k∗, l∗).

C. Shen MGC: 14/38

Smoothed Maximum

Pick a threshold τ ≥ 0 (we choose by an approximate null distribution ofDcorr, which is symmetric beta and converges to 0 as n→∞), computethe set

{(k, l) such that Dcorrk,l(Xn,Yn) > max{τ,Dcorr(Xn,Yn)}},

and calculate the largest connected component R of the set.

If there are sufficiently many elements in R (> 2n), take the maximumcorrelation within R as MGC statistic c∗(Xn,Yn), and set theneighborhood pair as the optimal scale (k∗, l∗).

C. Shen MGC: 14/38

Permutation Test

To get a p-value by MGC for any given data, we utilize the permutationtest: randomly permute index of the second data set for r times, computethe permuted MGC statistic c∗(Xn,Yπn ) for each permutation π, andestimate

Prob(c∗(Xn,Yn) > c∗(Xn,Yπn ))

as the p-value.

This is a standard nonparametric testing procedure employed by Mantel,Dcorr, HHG, HSIC, where the null distribution of the dependency measurecannot be exactly derived.

C. Shen MGC: 15/38

Permutation Test

To get a p-value by MGC for any given data, we utilize the permutationtest: randomly permute index of the second data set for r times, computethe permuted MGC statistic c∗(Xn,Yπn ) for each permutation π, andestimate

Prob(c∗(Xn,Yn) > c∗(Xn,Yπn ))

as the p-value.

This is a standard nonparametric testing procedure employed by Mantel,Dcorr, HHG, HSIC, where the null distribution of the dependency measurecannot be exactly derived.

C. Shen MGC: 15/38

Computation Complexity

• Distance computation takes O(n2 max(p, q))

• Centering takes O(n2)

• Ranking takes O(n2log(n))

• All local correlations can be iteratively computed in O(n2)

• The smoothed maximum takes O(n2)

Overall, MGC can be computed in O(n2 max(p, q, log(n))), which iscomparable to DCorr, HHG, and HSIC.

The permutation test takes O(n2 max(r, p, q, log(n))) for r randompermutations.

There are a number of ways to speed up the method for big data: fasterimplementation when p = q = 1, null distribution approximation bysubsampling and spectral embedding.

C. Shen MGC: 16/38

Computation Complexity

• Distance computation takes O(n2 max(p, q))

• Centering takes O(n2)

• Ranking takes O(n2log(n))

• All local correlations can be iteratively computed in O(n2)

• The smoothed maximum takes O(n2)

Overall, MGC can be computed in O(n2 max(p, q, log(n))), which iscomparable to DCorr, HHG, and HSIC.

The permutation test takes O(n2 max(r, p, q, log(n))) for r randompermutations.

There are a number of ways to speed up the method for big data: fasterimplementation when p = q = 1, null distribution approximation bysubsampling and spectral embedding.

C. Shen MGC: 16/38

Computation Complexity

• Distance computation takes O(n2 max(p, q))

• Centering takes O(n2)

• Ranking takes O(n2log(n))

• All local correlations can be iteratively computed in O(n2)

• The smoothed maximum takes O(n2)

Overall, MGC can be computed in O(n2 max(p, q, log(n))), which iscomparable to DCorr, HHG, and HSIC.

The permutation test takes O(n2 max(r, p, q, log(n))) for r randompermutations.

There are a number of ways to speed up the method for big data: fasterimplementation when p = q = 1, null distribution approximation bysubsampling and spectral embedding.

C. Shen MGC: 16/38

Computation Complexity

• Distance computation takes O(n2 max(p, q))

• Centering takes O(n2)

• Ranking takes O(n2log(n))

• All local correlations can be iteratively computed in O(n2)

• The smoothed maximum takes O(n2)

Overall, MGC can be computed in O(n2 max(p, q, log(n))), which iscomparable to DCorr, HHG, and HSIC.

The permutation test takes O(n2 max(r, p, q, log(n))) for r randompermutations.

There are a number of ways to speed up the method for big data: fasterimplementation when p = q = 1, null distribution approximation bysubsampling and spectral embedding.

C. Shen MGC: 16/38

Examples

Dcorr(Xn,Yn) = 0.15MGC(Xn,Yn) = 0.15p-vals: < 0.001

Dcorr(Xn,Yn) = 0.01MGC(Xn,Yn) = 0.13p-vals: 0.3 vs < 0.001

C. Shen MGC: 17/38

Examples

Dcorr(Xn,Yn) = 0.15MGC(Xn,Yn) = 0.15p-vals: < 0.001

Dcorr(Xn,Yn) = 0.01MGC(Xn,Yn) = 0.13p-vals: 0.3 vs < 0.001

C. Shen MGC: 17/38

Examples

Dcorr(Xn,Yn) = 0.15MGC(Xn,Yn) = 0.15p-vals: < 0.001

Dcorr(Xn,Yn) = 0.01MGC(Xn,Yn) = 0.13p-vals: 0.3 vs < 0.001

C. Shen MGC: 17/38

Examples

Dcorr(Xn,Yn) = 0.15MGC(Xn,Yn) = 0.15p-vals: < 0.001

Dcorr(Xn,Yn) = 0.01MGC(Xn,Yn) = 0.13p-vals: 0.3 vs < 0.001

C. Shen MGC: 17/38

Examples

Dcorr(Xn,Yn) = 0.15MGC(Xn,Yn) = 0.15p-vals: < 0.001

Dcorr(Xn,Yn) = 0.01MGC(Xn,Yn) = 0.13p-vals: 0.3 vs < 0.001

C. Shen MGC: 17/38

Examples

Dcorr(Xn,Yn) = 0.15MGC(Xn,Yn) = 0.15p-vals: < 0.001

Dcorr(Xn,Yn) = 0.01MGC(Xn,Yn) = 0.13p-vals: 0.3 vs < 0.001

C. Shen MGC: 17/38

Examples

Dcorr(Xn,Yn) = 0.15MGC(Xn,Yn) = 0.15p-vals: < 0.001

Dcorr(Xn,Yn) = 0.01MGC(Xn,Yn) = 0.13p-vals: 0.3 vs < 0.001

C. Shen MGC: 17/38

TheoreticalProperties

C. Shen MGC: 18/38

Basic Properties of Sample MGC

Theorem 1 (Well-behaved Correlation Measure)

1. Boundedness: c∗(Xn,Yn) ∈ [−1, 1].

2. Symmetric: c∗(Xn,Yn) = c∗(Yn,Xn).

3. Invariant: c∗(Xn,Yn) is invariant to any distance-preservingtransformations φ, δ applied to Xn and Yn each (i.e., rotation, scaling,translation, reflection).

4. 1-Linear: c∗(Xn,Yn) = 1 if and only if FX is non-degenerate and(X,uY ) are dependent via an isometry for some non-zero constant u.

C. Shen MGC: 19/38

Basic Properties of Sample MGC

Theorem 1 (Well-behaved Correlation Measure)

1. Boundedness: c∗(Xn,Yn) ∈ [−1, 1].

2. Symmetric: c∗(Xn,Yn) = c∗(Yn,Xn).

3. Invariant: c∗(Xn,Yn) is invariant to any distance-preservingtransformations φ, δ applied to Xn and Yn each (i.e., rotation, scaling,translation, reflection).

4. 1-Linear: c∗(Xn,Yn) = 1 if and only if FX is non-degenerate and(X,uY ) are dependent via an isometry for some non-zero constant u.

C. Shen MGC: 19/38

Basic Properties of Sample MGC

Theorem 1 (Well-behaved Correlation Measure)

1. Boundedness: c∗(Xn,Yn) ∈ [−1, 1].

2. Symmetric: c∗(Xn,Yn) = c∗(Yn,Xn).

3. Invariant: c∗(Xn,Yn) is invariant to any distance-preservingtransformations φ, δ applied to Xn and Yn each (i.e., rotation, scaling,translation, reflection).

4. 1-Linear: c∗(Xn,Yn) = 1 if and only if FX is non-degenerate and(X,uY ) are dependent via an isometry for some non-zero constant u.

C. Shen MGC: 19/38

Basic Properties of Sample MGC

Theorem 1 (Well-behaved Correlation Measure)

1. Boundedness: c∗(Xn,Yn) ∈ [−1, 1].

2. Symmetric: c∗(Xn,Yn) = c∗(Yn,Xn).

3. Invariant: c∗(Xn,Yn) is invariant to any distance-preservingtransformations φ, δ applied to Xn and Yn each (i.e., rotation, scaling,translation, reflection).

4. 1-Linear: c∗(Xn,Yn) = 1 if and only if FX is non-degenerate and(X,uY ) are dependent via an isometry for some non-zero constant u.

C. Shen MGC: 19/38

Basic Properties of Sample MGC

Theorem 1 (Well-behaved Correlation Measure)

1. Boundedness: c∗(Xn,Yn) ∈ [−1, 1].

2. Symmetric: c∗(Xn,Yn) = c∗(Yn,Xn).

3. Invariant: c∗(Xn,Yn) is invariant to any distance-preservingtransformations φ, δ applied to Xn and Yn each (i.e., rotation, scaling,translation, reflection).

4. 1-Linear: c∗(Xn,Yn) = 1 if and only if FX is non-degenerate and(X,uY ) are dependent via an isometry for some non-zero constant u.

C. Shen MGC: 19/38

Consistency of Sample MGC

Theorem 2 (Consistency)

1. 0-Indep: c∗(Xn,Yn)n→∞→ 0 if and only if independence.

2. Valid Test: Under the permutation test, Sample MGC is a valid test,i.e., it controls the type 1 error level α.

3. Consistency: At any type 1 error level α, testing powerβ(c∗(Xn,Yn))

n→∞→ 1 against any dependent FXY .

C. Shen MGC: 20/38

Consistency of Sample MGC

Theorem 2 (Consistency)

1. 0-Indep: c∗(Xn,Yn)n→∞→ 0 if and only if independence.

2. Valid Test: Under the permutation test, Sample MGC is a valid test,i.e., it controls the type 1 error level α.

3. Consistency: At any type 1 error level α, testing powerβ(c∗(Xn,Yn))

n→∞→ 1 against any dependent FXY .

C. Shen MGC: 20/38

Consistency of Sample MGC

Theorem 2 (Consistency)

1. 0-Indep: c∗(Xn,Yn)n→∞→ 0 if and only if independence.

2. Valid Test: Under the permutation test, Sample MGC is a valid test,i.e., it controls the type 1 error level α.

3. Consistency: At any type 1 error level α, testing powerβ(c∗(Xn,Yn))

n→∞→ 1 against any dependent FXY .

C. Shen MGC: 20/38

Consistency of Sample MGC

Theorem 2 (Consistency)

1. 0-Indep: c∗(Xn,Yn)n→∞→ 0 if and only if independence.

2. Valid Test: Under the permutation test, Sample MGC is a valid test,i.e., it controls the type 1 error level α.

3. Consistency: At any type 1 error level α, testing powerβ(c∗(Xn,Yn))

n→∞→ 1 against any dependent FXY .

C. Shen MGC: 20/38

Defining Population MGC

Suppose (X,Y ), (X ′, Y ′), (X ′′, Y ′′), (X ′′′, Y ′′′) are iid as FXY .

Let I(·)be the indicator function, define two random variables

IρkX,X′ = I(

∫B(X,‖X′−X‖)

dFX(u) ≤ ρk)

IρlY ′,Y = I(

∫B(Y ′,‖Y ′−Y ‖)

dFY (u) ≤ ρl)

for ρk, ρl ∈ [0, 1]. Further define

dρkX = (‖X −X ′‖ − ‖X −X ′′‖)IρkX,X′

dρlY ′ = (‖Y ′ − Y ‖ − ‖Y ′ − Y ′′′‖)IρlY ′,Y

The population local covariance can be defined as

Dcovρk,ρl(X,Y ) = E(dρkX dρlY ′)− E(dρkX )E(dρlY ′).

Normalizing and taking a smoothed maximum yield population MGC.

C. Shen MGC: 21/38

Defining Population MGC

Suppose (X,Y ), (X ′, Y ′), (X ′′, Y ′′), (X ′′′, Y ′′′) are iid as FXY . Let I(·)be the indicator function, define two random variables

IρkX,X′ = I(

∫B(X,‖X′−X‖)

dFX(u) ≤ ρk)

IρlY ′,Y = I(

∫B(Y ′,‖Y ′−Y ‖)

dFY (u) ≤ ρl)

for ρk, ρl ∈ [0, 1].

Further define

dρkX = (‖X −X ′‖ − ‖X −X ′′‖)IρkX,X′

dρlY ′ = (‖Y ′ − Y ‖ − ‖Y ′ − Y ′′′‖)IρlY ′,Y

The population local covariance can be defined as

Dcovρk,ρl(X,Y ) = E(dρkX dρlY ′)− E(dρkX )E(dρlY ′).

Normalizing and taking a smoothed maximum yield population MGC.

C. Shen MGC: 21/38

Defining Population MGC

Suppose (X,Y ), (X ′, Y ′), (X ′′, Y ′′), (X ′′′, Y ′′′) are iid as FXY . Let I(·)be the indicator function, define two random variables

IρkX,X′ = I(

∫B(X,‖X′−X‖)

dFX(u) ≤ ρk)

IρlY ′,Y = I(

∫B(Y ′,‖Y ′−Y ‖)

dFY (u) ≤ ρl)

for ρk, ρl ∈ [0, 1]. Further define

dρkX = (‖X −X ′‖ − ‖X −X ′′‖)IρkX,X′

dρlY ′ = (‖Y ′ − Y ‖ − ‖Y ′ − Y ′′′‖)IρlY ′,Y

The population local covariance can be defined as

Dcovρk,ρl(X,Y ) = E(dρkX dρlY ′)− E(dρkX )E(dρlY ′).

Normalizing and taking a smoothed maximum yield population MGC.

C. Shen MGC: 21/38

Defining Population MGC

Suppose (X,Y ), (X ′, Y ′), (X ′′, Y ′′), (X ′′′, Y ′′′) are iid as FXY . Let I(·)be the indicator function, define two random variables

IρkX,X′ = I(

∫B(X,‖X′−X‖)

dFX(u) ≤ ρk)

IρlY ′,Y = I(

∫B(Y ′,‖Y ′−Y ‖)

dFY (u) ≤ ρl)

for ρk, ρl ∈ [0, 1]. Further define

dρkX = (‖X −X ′‖ − ‖X −X ′′‖)IρkX,X′

dρlY ′ = (‖Y ′ − Y ‖ − ‖Y ′ − Y ′′′‖)IρlY ′,Y

The population local covariance can be defined as

Dcovρk,ρl(X,Y ) = E(dρkX dρlY ′)− E(dρkX )E(dρlY ′).

Normalizing and taking a smoothed maximum yield population MGC.

C. Shen MGC: 21/38

Defining Population MGC

Suppose (X,Y ), (X ′, Y ′), (X ′′, Y ′′), (X ′′′, Y ′′′) are iid as FXY . Let I(·)be the indicator function, define two random variables

IρkX,X′ = I(

∫B(X,‖X′−X‖)

dFX(u) ≤ ρk)

IρlY ′,Y = I(

∫B(Y ′,‖Y ′−Y ‖)

dFY (u) ≤ ρl)

for ρk, ρl ∈ [0, 1]. Further define

dρkX = (‖X −X ′‖ − ‖X −X ′′‖)IρkX,X′

dρlY ′ = (‖Y ′ − Y ‖ − ‖Y ′ − Y ′′′‖)IρlY ′,Y

The population local covariance can be defined as

Dcovρk,ρl(X,Y ) = E(dρkX dρlY ′)− E(dρkX )E(dρlY ′).

Normalizing and taking a smoothed maximum yield population MGC.C. Shen MGC: 21/38

Sample to Population

Alternatively, the population version can be equivalently defined viacharacteristic functions of FXY :

Dcovρk=1,ρl=1(X,Y ) =

∫t,s|gXY (t, s)− gX(t)gY (s)|2dw(t, s)

with respect to a non-negative weight function w(t, s) on (t, s) ∈ Rp ×Rq.The weight function is defined as:

w(t, s) = (dpdq|t|1+p|s|1+q)−1,

where dp = π(1+p)/2

Γ((1+p)/2) is a non-negative constant tied to the dimensionality

p, and Γ(·) is the complete Gamma function.

Can be similarly adapted to the local correlation.

C. Shen MGC: 22/38

Sample to Population

Alternatively, the population version can be equivalently defined viacharacteristic functions of FXY :

Dcovρk=1,ρl=1(X,Y ) =

∫t,s|gXY (t, s)− gX(t)gY (s)|2dw(t, s)

with respect to a non-negative weight function w(t, s) on (t, s) ∈ Rp ×Rq.

The weight function is defined as:

w(t, s) = (dpdq|t|1+p|s|1+q)−1,

where dp = π(1+p)/2

Γ((1+p)/2) is a non-negative constant tied to the dimensionality

p, and Γ(·) is the complete Gamma function.

Can be similarly adapted to the local correlation.

C. Shen MGC: 22/38

Sample to Population

Alternatively, the population version can be equivalently defined viacharacteristic functions of FXY :

Dcovρk=1,ρl=1(X,Y ) =

∫t,s|gXY (t, s)− gX(t)gY (s)|2dw(t, s)

with respect to a non-negative weight function w(t, s) on (t, s) ∈ Rp ×Rq.The weight function is defined as:

w(t, s) = (dpdq|t|1+p|s|1+q)−1,

where dp = π(1+p)/2

Γ((1+p)/2) is a non-negative constant tied to the dimensionality

p, and Γ(·) is the complete Gamma function.

Can be similarly adapted to the local correlation.

C. Shen MGC: 22/38

Sample to Population

Alternatively, the population version can be equivalently defined viacharacteristic functions of FXY :

Dcovρk=1,ρl=1(X,Y ) =

∫t,s|gXY (t, s)− gX(t)gY (s)|2dw(t, s)

with respect to a non-negative weight function w(t, s) on (t, s) ∈ Rp ×Rq.The weight function is defined as:

w(t, s) = (dpdq|t|1+p|s|1+q)−1,

where dp = π(1+p)/2

Γ((1+p)/2) is a non-negative constant tied to the dimensionality

p, and Γ(·) is the complete Gamma function.

Can be similarly adapted to the local correlation.

C. Shen MGC: 22/38

Theoretical Advantages of MGC

Theorem 3 (Convergence, Mean and Variance)

1. 0-Indep: c∗(X,Y ) = 0 if and only if independence.

2. Convergence: c∗(Xn,Yn)n→∞→ c∗(X,Y ).

3. Almost Unbiased: E(c∗(Xn,Yn)) = c∗(X,Y ) +O(1/n).

4. Diminishing Variance: V ar(c∗(Xn,Yn)) = O(1/n).

The last three properties also hold for any local correlation by(ρk, ρl) = ( k−1

n−1 ,l−1n−1).

C. Shen MGC: 23/38

Theoretical Advantages of MGC

Theorem 3 (Convergence, Mean and Variance)

1. 0-Indep: c∗(X,Y ) = 0 if and only if independence.

2. Convergence: c∗(Xn,Yn)n→∞→ c∗(X,Y ).

3. Almost Unbiased: E(c∗(Xn,Yn)) = c∗(X,Y ) +O(1/n).

4. Diminishing Variance: V ar(c∗(Xn,Yn)) = O(1/n).

The last three properties also hold for any local correlation by(ρk, ρl) = ( k−1

n−1 ,l−1n−1).

C. Shen MGC: 23/38

Theoretical Advantages of MGC

Theorem 3 (Convergence, Mean and Variance)

1. 0-Indep: c∗(X,Y ) = 0 if and only if independence.

2. Convergence: c∗(Xn,Yn)n→∞→ c∗(X,Y ).

3. Almost Unbiased: E(c∗(Xn,Yn)) = c∗(X,Y ) +O(1/n).

4. Diminishing Variance: V ar(c∗(Xn,Yn)) = O(1/n).

The last three properties also hold for any local correlation by(ρk, ρl) = ( k−1

n−1 ,l−1n−1).

C. Shen MGC: 23/38

Theoretical Advantages of MGC

Theorem 3 (Convergence, Mean and Variance)

1. 0-Indep: c∗(X,Y ) = 0 if and only if independence.

2. Convergence: c∗(Xn,Yn)n→∞→ c∗(X,Y ).

3. Almost Unbiased: E(c∗(Xn,Yn)) = c∗(X,Y ) +O(1/n).

4. Diminishing Variance: V ar(c∗(Xn,Yn)) = O(1/n).

The last three properties also hold for any local correlation by(ρk, ρl) = ( k−1

n−1 ,l−1n−1).

C. Shen MGC: 23/38

Theoretical Advantages of MGC

Theorem 3 (Convergence, Mean and Variance)

1. 0-Indep: c∗(X,Y ) = 0 if and only if independence.

2. Convergence: c∗(Xn,Yn)n→∞→ c∗(X,Y ).

3. Almost Unbiased: E(c∗(Xn,Yn)) = c∗(X,Y ) +O(1/n).

4. Diminishing Variance: V ar(c∗(Xn,Yn)) = O(1/n).

The last three properties also hold for any local correlation by(ρk, ρl) = ( k−1

n−1 ,l−1n−1).

C. Shen MGC: 23/38

Theoretical Advantages of MGC

Theorem 3 (Convergence, Mean and Variance)

1. 0-Indep: c∗(X,Y ) = 0 if and only if independence.

2. Convergence: c∗(Xn,Yn)n→∞→ c∗(X,Y ).

3. Almost Unbiased: E(c∗(Xn,Yn)) = c∗(X,Y ) +O(1/n).

4. Diminishing Variance: V ar(c∗(Xn,Yn)) = O(1/n).

The last three properties also hold for any local correlation by(ρk, ρl) = ( k−1

n−1 ,l−1n−1).

C. Shen MGC: 23/38

Theoretical Advantages of MGC

Theorem 4 (Advantages of Population MGC vs Dcorr)

1. For any dependent FXY , c∗(X,Y ) ≥ Dcorr(X,Y ).

2. There exists dependent FXY such that c∗(X,Y ) > Dcorr(X,Y ).

As MGC and Dcorr share similar variance and same mean under the null,the mean advantage in the alternative is translated to the testing power.

Theorem 5 (Optimal Scale of MGC Implies Geometry Structure)

If the relationship is linear (or with independent noise), the global scale isalways optimal and c∗(X,Y ) = Dcorr(X,Y ).

Conversely, the optimal scale being local, i.e., c∗(X,Y ) > Dcorr(X,Y ),implies a non-linear relationship.

C. Shen MGC: 24/38

Theoretical Advantages of MGC

Theorem 4 (Advantages of Population MGC vs Dcorr)

1. For any dependent FXY , c∗(X,Y ) ≥ Dcorr(X,Y ).

2. There exists dependent FXY such that c∗(X,Y ) > Dcorr(X,Y ).

As MGC and Dcorr share similar variance and same mean under the null,the mean advantage in the alternative is translated to the testing power.

Theorem 5 (Optimal Scale of MGC Implies Geometry Structure)

If the relationship is linear (or with independent noise), the global scale isalways optimal and c∗(X,Y ) = Dcorr(X,Y ).

Conversely, the optimal scale being local, i.e., c∗(X,Y ) > Dcorr(X,Y ),implies a non-linear relationship.

C. Shen MGC: 24/38

Theoretical Advantages of MGC

Theorem 4 (Advantages of Population MGC vs Dcorr)

1. For any dependent FXY , c∗(X,Y ) ≥ Dcorr(X,Y ).

2. There exists dependent FXY such that c∗(X,Y ) > Dcorr(X,Y ).

As MGC and Dcorr share similar variance and same mean under the null,the mean advantage in the alternative is translated to the testing power.

Theorem 5 (Optimal Scale of MGC Implies Geometry Structure)

If the relationship is linear (or with independent noise), the global scale isalways optimal and c∗(X,Y ) = Dcorr(X,Y ).

Conversely, the optimal scale being local, i.e., c∗(X,Y ) > Dcorr(X,Y ),implies a non-linear relationship.

C. Shen MGC: 24/38

Theoretical Advantages of MGC

Theorem 4 (Advantages of Population MGC vs Dcorr)

1. For any dependent FXY , c∗(X,Y ) ≥ Dcorr(X,Y ).

2. There exists dependent FXY such that c∗(X,Y ) > Dcorr(X,Y ).

As MGC and Dcorr share similar variance and same mean under the null,the mean advantage in the alternative is translated to the testing power.

Theorem 5 (Optimal Scale of MGC Implies Geometry Structure)

If the relationship is linear (or with independent noise), the global scale isalways optimal and c∗(X,Y ) = Dcorr(X,Y ).

Conversely, the optimal scale being local, i.e., c∗(X,Y ) > Dcorr(X,Y ),implies a non-linear relationship.

C. Shen MGC: 24/38

Theoretical Advantages of MGC

Theorem 4 (Advantages of Population MGC vs Dcorr)

1. For any dependent FXY , c∗(X,Y ) ≥ Dcorr(X,Y ).

2. There exists dependent FXY such that c∗(X,Y ) > Dcorr(X,Y ).

As MGC and Dcorr share similar variance and same mean under the null,the mean advantage in the alternative is translated to the testing power.

Theorem 5 (Optimal Scale of MGC Implies Geometry Structure)

If the relationship is linear (or with independent noise), the global scale isalways optimal and c∗(X,Y ) = Dcorr(X,Y ).

Conversely, the optimal scale being local, i.e., c∗(X,Y ) > Dcorr(X,Y ),implies a non-linear relationship.

C. Shen MGC: 24/38

Theoretical Advantages of MGC

Theorem 4 (Advantages of Population MGC vs Dcorr)

1. For any dependent FXY , c∗(X,Y ) ≥ Dcorr(X,Y ).

2. There exists dependent FXY such that c∗(X,Y ) > Dcorr(X,Y ).

As MGC and Dcorr share similar variance and same mean under the null,the mean advantage in the alternative is translated to the testing power.

Theorem 5 (Optimal Scale of MGC Implies Geometry Structure)

If the relationship is linear (or with independent noise), the global scale isalways optimal and c∗(X,Y ) = Dcorr(X,Y ).

Conversely, the optimal scale being local, i.e., c∗(X,Y ) > Dcorr(X,Y ),implies a non-linear relationship.

C. Shen MGC: 24/38

Theoretical Advantages of MGC

Theorem 4 (Advantages of Population MGC vs Dcorr)

1. For any dependent FXY , c∗(X,Y ) ≥ Dcorr(X,Y ).

2. There exists dependent FXY such that c∗(X,Y ) > Dcorr(X,Y ).

As MGC and Dcorr share similar variance and same mean under the null,the mean advantage in the alternative is translated to the testing power.

Theorem 5 (Optimal Scale of MGC Implies Geometry Structure)

If the relationship is linear (or with independent noise), the global scale isalways optimal and c∗(X,Y ) = Dcorr(X,Y ).

Conversely, the optimal scale being local, i.e., c∗(X,Y ) > Dcorr(X,Y ),implies a non-linear relationship.

C. Shen MGC: 24/38

MGC is applicable to similarity / kernelmatrix

Theorem 6 (Transforming kernel to distance)

Given any characteristic kernel function k(·, ·), define an inducedsemi-metric as

d(·, ·) = 1− k(·, ·)/max{k(·, ·)}.

Then d(·, ·) is of strong negative type, and the resulting MGC isuniversally consistent.

Namely, given a sample kernel matrices Kn×n, one can compute theinduced distance matrix by

D = J −K/maxi,j∈[1,...,n]2{K(i, j)},

and apply MGC to the induced distance matrices.

C. Shen MGC: 25/38

Simulationsand

Experiments

C. Shen MGC: 26/38

Visualizations of 20 Simulation Settings

Line

ar

Linear: 1

0

1

Exp

onen

tial

Exponential: 0.99

0

1

Cub

ic

Cubic: 0.84

0

0.9

Join

t Nor

mal

Joint Normal: 0.2

0

0.5

Ste

p F

unct

ion

Step Function: 0.76

0

0.9

Qua

drat

ic

Quadratic: 0.67

0

0.7

W S

hape

W Shape: 0.42

0

0.5

Spi

ral

Spiral: 0.3

0

0.3

Ber

noul

li

Bernoulli: 0.93

0

1

Loga

rithm

ic

Logarithmic: 0.67

0

0.7

Fou

rth

Roo

t

Fourth Root: 0.65

0

0.7

Sin

e P

erio

d 4

Sine Period 4 : 0.3

0

0.3

Sin

e P

erio

d 16

Sine Period 16 : 0.14

0

0.2

Squ

are

Square: 0.08

0

0.1

Tw

o P

arab

olas

Two Parabolas: 0.46

0

0.5

Circ

le

Circle: 0.52

0

0.6

Elli

pse

Ellipse: 0.56

0

0.6

Dia

mon

d

Diamond: 0.08

0

0.1

Mul

tiplic

ativ

e

Multiplicative: 0.11

0

0.2

Inde

pend

ence

Independence: 0

MGC, Distance Correlation, and Pearson's Correlation for 20 Dependencies

0

0.1

C. Shen MGC: 27/38

Visualizations of 20 Simulation SettingsLi

near

Linear: 1

0

1

Exp

onen

tial

Exponential: 0.99

0

1

Cub

ic

Cubic: 0.84

0

0.9

Join

t Nor

mal

Joint Normal: 0.2

0

0.5

Ste

p F

unct

ion

Step Function: 0.76

0

0.9

Qua

drat

ic

Quadratic: 0.67

0

0.7

W S

hape

W Shape: 0.42

0

0.5

Spi

ral

Spiral: 0.3

0

0.3

Ber

noul

li

Bernoulli: 0.93

0

1

Loga

rithm

ic

Logarithmic: 0.67

0

0.7

Fou

rth

Roo

t

Fourth Root: 0.65

0

0.7

Sin

e P

erio

d 4

Sine Period 4 : 0.3

0

0.3

Sin

e P

erio

d 16

Sine Period 16 : 0.14

0

0.2

Squ

are

Square: 0.08

0

0.1

Tw

o P

arab

olas

Two Parabolas: 0.46

0

0.5

Circ

le

Circle: 0.52

0

0.6

Elli

pse

Ellipse: 0.56

0

0.6

Dia

mon

d

Diamond: 0.08

0

0.1

Mul

tiplic

ativ

e

Multiplicative: 0.11

0

0.2

Inde

pend

ence

Independence: 0

MGC, Distance Correlation, and Pearson's Correlation for 20 Dependencies

0

0.1

C. Shen MGC: 27/38

Evaluation Criterion

• Power is the probability of rejecting the null when the alternative istrue.

• Required sample size Nα,β(c) to achieve a power of β at type 1 errorlevel α using a statistic c.

C. Shen MGC: 28/38

Evaluation Criterion

• Power is the probability of rejecting the null when the alternative istrue.

• Required sample size Nα,β(c) to achieve a power of β at type 1 errorlevel α using a statistic c.

C. Shen MGC: 28/38

Testing Power: Linear vs Nonlinear

0 0.5 1 1.5 2

Noise Level

0

0.2

0.4

0.6

0.8

1

Tes

ting

Pow

er

Linear Relationship

0 0.25 0.5 0.75 1

Noise Level

0

0.2

0.4

0.6

0.8

1

Tes

ting

Pow

er

Quadratic Relationship

MGC Distance Correlation Pearson's Correlation

n = 30, p = q = 1,

X ∼ Uniform(−1, 1),

ε ∼ Normal(0, noise),Y = X + ε and Y = X2 + ε.

C. Shen MGC: 29/38

Testing Power: Linear vs Nonlinear

0 0.5 1 1.5 2

Noise Level

0

0.2

0.4

0.6

0.8

1

Tes

ting

Pow

er

Linear Relationship

0 0.25 0.5 0.75 1

Noise Level

0

0.2

0.4

0.6

0.8

1

Tes

ting

Pow

er

Quadratic Relationship

MGC Distance Correlation Pearson's Correlation

n = 30, p = q = 1,

X ∼ Uniform(−1, 1),

ε ∼ Normal(0, noise),Y = X + ε and Y = X2 + ε.

C. Shen MGC: 29/38

Required Sample Size

When noise = 1, p = q = 1, the required sample size Nα=0.05,β=0.85(c):

in linear relationship, 40 for all three methods;in quadratic relationship, 80 for MGC, 180 for Dcorr, and > 1000 forPearson.

Next we compute the size for each simulation, and summarize by themedian over close-to-linear (type 1-5) and strongly non-linear relationships(type 6-19).

We consider univariate (1D) and multivariate (10D) cases.

C. Shen MGC: 30/38

Required Sample Size

When noise = 1, p = q = 1, the required sample size Nα=0.05,β=0.85(c):

in linear relationship, 40 for all three methods;in quadratic relationship, 80 for MGC, 180 for Dcorr, and > 1000 forPearson.

Next we compute the size for each simulation, and summarize by themedian over close-to-linear (type 1-5) and strongly non-linear relationships(type 6-19).

We consider univariate (1D) and multivariate (10D) cases.

C. Shen MGC: 30/38

Required Sample Size

When noise = 1, p = q = 1, the required sample size Nα=0.05,β=0.85(c):

in linear relationship, 40 for all three methods;in quadratic relationship, 80 for MGC, 180 for Dcorr, and > 1000 forPearson.

Next we compute the size for each simulation, and summarize by themedian over close-to-linear (type 1-5) and strongly non-linear relationships(type 6-19).

We consider univariate (1D) and multivariate (10D) cases.

C. Shen MGC: 30/38

Required Sample Size

When noise = 1, p = q = 1, the required sample size Nα=0.05,β=0.85(c):

in linear relationship, 40 for all three methods;in quadratic relationship, 80 for MGC, 180 for Dcorr, and > 1000 forPearson.

Next we compute the size for each simulation, and summarize by themedian over close-to-linear (type 1-5) and strongly non-linear relationships(type 6-19).

We consider univariate (1D) and multivariate (10D) cases.

C. Shen MGC: 30/38

Required Sample Size

When noise = 1, p = q = 1, the required sample size Nα=0.05,β=0.85(c):

in linear relationship, 40 for all three methods;in quadratic relationship, 80 for MGC, 180 for Dcorr, and > 1000 forPearson.

Next we compute the size for each simulation, and summarize by themedian over close-to-linear (type 1-5) and strongly non-linear relationships(type 6-19).

We consider univariate (1D) and multivariate (10D) cases.

C. Shen MGC: 30/38

Median Size TableTesting Methods 1D Lin 1D Non-Lin 10D Lin 10D Non-Lin

MGC 50 90 60 165Dcorr 50 250 60 515

Pearson / RV / CCA 50 >1000 50 >1000

HHG 70 90 100 315

HSIC 70 95 100 400

MIC 120 180 n/a n/a

C. Shen MGC: 31/38

Signal Subgraph via MGC

We consider predicting the site and sex based on functional magneticresonance image (fMRI) graphs. Two datasets used are SWU4 and HNU1,which have 467 and 300 samples respectively.

Each sample is an fMRI scan registered to the MNI152 template using theDesikan altas, which has 70 regions.

We used an iterative screening method (similar to backward selection) viaMGC from [Wang et al.(2018)] to extract signal subgraph (in this casebrain regions) that are most dependent with sites and sex, and also runleave-one-out cross validation with K-Nearest Neighbor classifier to verifythe results.

C. Shen MGC: 32/38

Signal Subgraph via MGC

We consider predicting the site and sex based on functional magneticresonance image (fMRI) graphs. Two datasets used are SWU4 and HNU1,which have 467 and 300 samples respectively.

Each sample is an fMRI scan registered to the MNI152 template using theDesikan altas, which has 70 regions.

We used an iterative screening method (similar to backward selection) viaMGC from [Wang et al.(2018)] to extract signal subgraph (in this casebrain regions) that are most dependent with sites and sex, and also runleave-one-out cross validation with K-Nearest Neighbor classifier to verifythe results.

C. Shen MGC: 32/38

C. Shen MGC: 33/38

Figure: A total of 22 regions are recognized for site difference, which maximizesthe MGC statistic and almost minimizes the leave-one-out cross validation error.It is no longer the case for sex, for which neither the MGC nor the error are toosignificant for any size of subgraph.

C. Shen MGC: 34/38

Summary

C. Shen MGC: 35/38

Summary

MGC builds on distance correlation, the locality principle, and taking asmoothed maximum.

• Proper distance transformation ensures the universal consistency.

• Compute all local correlations iteratively.

• Identify the optimal local correlation without inflating the sample bias.

They made MGC advantageous in theory and practice.

C. Shen MGC: 36/38

Summary

MGC builds on distance correlation, the locality principle, and taking asmoothed maximum.

• Proper distance transformation ensures the universal consistency.

• Compute all local correlations iteratively.

• Identify the optimal local correlation without inflating the sample bias.

They made MGC advantageous in theory and practice.

C. Shen MGC: 36/38

Summary

MGC builds on distance correlation, the locality principle, and taking asmoothed maximum.

• Proper distance transformation ensures the universal consistency.

• Compute all local correlations iteratively.

• Identify the optimal local correlation without inflating the sample bias.

They made MGC advantageous in theory and practice.

C. Shen MGC: 36/38

Summary

MGC builds on distance correlation, the locality principle, and taking asmoothed maximum.

• Proper distance transformation ensures the universal consistency.

• Compute all local correlations iteratively.

• Identify the optimal local correlation without inflating the sample bias.

They made MGC advantageous in theory and practice.

C. Shen MGC: 36/38

Summary

MGC builds on distance correlation, the locality principle, and taking asmoothed maximum.

• Proper distance transformation ensures the universal consistency.

• Compute all local correlations iteratively.

• Identify the optimal local correlation without inflating the sample bias.

They made MGC advantageous in theory and practice.

C. Shen MGC: 36/38

Summary

MGC builds on distance correlation, the locality principle, and taking asmoothed maximum.

• Proper distance transformation ensures the universal consistency.

• Compute all local correlations iteratively.

• Identify the optimal local correlation without inflating the sample bias.

They made MGC advantageous in theory and practice.

C. Shen MGC: 36/38

Summary

MGC builds on distance correlation, the locality principle, and taking asmoothed maximum.

• Proper distance transformation ensures the universal consistency.

• Compute all local correlations iteratively.

• Identify the optimal local correlation without inflating the sample bias.

They made MGC advantageous in theory and practice.

C. Shen MGC: 36/38

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

References1. C. Shen, C. E. Priebe, and J. T. Vogelstein, “From distance correlation to themultiscale graph correlation,” Journal of the American Statistical Association,2019.

2. J. T. Vogelstein, E. Bridgeford, Q. Wang, C. E. Priebe, M. Maggioni, and C.Shen, “Discovering and Deciphering Relationships Across Disparate DataModalities,” eLife, 2019.

3. Y. Lee, C. Shen, and J. T. Vogelstein, “Network dependence testing viadiffusion maps and distance-based correlations,” Biometrika, 2019.

4. S. Wang, C. Shen, A. Badea, C. E. Priebe, and J. T. Vogelstein, “Signalsubgraph estimation via iterative vertex screening,” under review.

5. C. Shen and J. T. Vogelstein, “The Exact Equivalence of Distance and KernelMethods for Hypothesis Testing,” under review.

C. Shen MGC: 38/38