Post on 08-Jan-2022
arX
iv:1
312.
7214
v1 [
stat
.ME
] 2
7 D
ec 2
013
Copula Correlation: An Equitable And
Consistently Estimable Measure For
Association.
A. Adam Ding, and Li Yi ∗
Abstract
Reshef et al. (Science, 2011) proposed an important new concept of equitabil-
ity for good measures of association between two random variables. To this end,
they proposed a novel measure, the maximal information coefficient (MIC). However,
Kinney and Atwal (2013) showed that MIC in fact is not equitable. Others pointed
out that MIC often has low power to detect association. We clarify the paradox on the
equitability of MIC and mutual information (MI) by separating the equitability and
estimability of the parameters. We prove that MI is not consistently estimable over a
class of continuous distributions. Several existing association measures in literature are
studied from these two perspectives. We propose a new correlation measure, the cop-
ula correlation, which is half the L1-distance of copula density from the independence.
Among the association measures studied, this new measure is the only one to be both
equitable and consistently estimable. Simulations and a real data analysis confirm its
good theoretical properties.
Key Words: Equitability; maximal information coefficient; Copula; rate of convergence;
mutual information; distance correlation.
∗A. Adam Ding is Associate Professor of Mathematics, Northeastern University, Boston, MA; email:a.ding@neu.edu. Li Yi is PhD candidate of Mathematics, Northeastern University, Boston, MA; email:li.yi3@husky.neu.edu.
1 INTRODUCTION
With the advance of modern technology, the size of available data keeps exploding. Data
mining is increasingly used to keep up with the trend, and to explore complex relationships
among a vast number of variables. The nonlinear relationships are as important as the
linear relationship in data exploration. Hence the traditional measure such as Pearson’s
linear correlation coefficient is no longer adequate for today’s big data analysis. Reshef et al.
(2011) proposed the concept of equitability, that is, a measure of dependence (association)
should give equal importance to linear and nonlinear relationships. For this purpose, they
proposed a novel maximal information coefficient (MIC) measure.
The MIC measure has stimulated great interest and further studies in the statistical com-
munity. Speed (2011) praised it as “a correlation for the 21st century”. It has been quickly
adopted by many researchers in data analysis. However, its mathematical and statistical
properties are still not studied very well. There are also criticisms on the measure based on
those properties.
Simon and Tibshirani (2011) showed that MIC often has low power for detecting depen-
dence in comparison to the distance correlation (dcor) proposed by Szekely et al. (2007).
The dcor asymptotically has the same power as Pearson’s correlation in the linear case,
and can also detect all nonlinear relationships asymptotically. Thus, Simon and Tibshirani
(2011) recommended dcor over MIC for detecting dependence among variables. However,
dcor does not have the equitable property.
Kinney and Atwal (2013) gives a strict mathematical definition of R2-equitability de-
scribed in Reshef et al. (2011)’s MIC paper. They discovered that no non-trivial statistic
can be R2-equitable, and hence proposed a replacement definition of self-equitability. Inter-
estingly, the MIC is also not self-equitable, but mutual information (MI) is self-equitable.
This contradicts with the observation that MI is not equitable but MIC is equitable in sim-
ulation studies of Reshef et al. (2011) and Reshef et al. (2013). This paradox can be explain
by the fact that the simulated finite sample studies are also affected by the statistical errors
in estimators for MI and MIC. Hence we need to separate the studies of equitability of a
measure from its estimability.
1
We relate the study of equitability to another popular line of research on the copula – a
joint probability distribution with uniform marginals. Sklar’s Theorem decomposes any joint
probability distribution into two components: the marginal distributions and the copula.
The copula captures all the association information among the variables. Hence an equitable
dependence measure should be copula-based. We propose a definition of weak-equitability
that is equivalent to the requirement of copula-based measures. Therefore, it is natural
to reexpress the association measures in terms of copula. While little attention has been
paid to directly create an association measure from the copula, many probability distances
can be applied on the copula to create such measures. And some of those copula-based
probability distances have been used for testing of independence (Genest and Remillard,
2004; Genest et al., 2007; Kojadinovic and Holmes, 2009).
Recasting the problem in terms of copula provides additional insights. As Omelka et al.
(2009) and Segers (2012) pointed out, some standard mathematical conditions on the proba-
bility distributions do not hold for many common copulas. We need to study the convergence
of copula estimators under non-standard settings. Under a non-standard setting, we provide
a theoretical proof that the mutual information (MI)’s minimax risk is infinite. While MI is
known to be very hard to estimate under continuous distributions, no mathematical qualifi-
cation for this fact exists. Our result is the first theoretical result demonstrating that MI is
not consistently estimable for continuous random variables. The difficulty of accurately esti-
mating MI explains how the theoretically equitable MI measure appears to be non-equitable
for finite sample data analysis in Reshef et al. (2011).
Based on the copula, we propose a new association measure, the copula correlation (Ccor),
which is defined as half the L1-distance of the copula density function from independence.
We theoretically study the MI, MIC, dcor, Ccor and several other dependence measures in
literature from the perspective of both equitability and estimability. The Ccor is the only
measure with the desirable property of being both self-equitable and consistently estimable.
This makes it a strong candidate as a dependence measure in data exploration.
In Section 2, we introduce the definition of several dependence measures and study their
equitability. We study the convergence of estimators for MI and Ccor in Section 3. Section 4
2
compare these dependence measures further through numerical examples. We then end the
paper with a summary discussion.
2 ASSOCIATION MEASURES AND THEIR EQUI-
TABILITY
In this section, we review several classes of association measures D(X ; Y ) between two ran-
dom variables X and Y in the literature and introduces our proposed new measure. We
review and strictly define the concept of equitability, that is, the association measure treats
all types of relationships equally. We conduct a systematic analysis for the equitability of
these association measures in this section. For simplicity, we will be focusing on contin-
uous univariate random variables X and Y . Section 5 briefly discuss extension to higher
dimensional and other types of random variables.
2.1 Several Classes Of Association Measures From Independence
Characterization
The most commonly used association measure is Pearson’s linear correlation coefficient
ρ(X ; Y ) = Cov(X, Y )/√V ar(X)V ar(Y ) where COV (X, Y ) denotes the covariance between
X and Y , and V ar(X) denotes the variance of X . The linear correlation coefficient ρ is good
at characterizing linear relationships between X and Y : |ρ| = 1 for perfectly deterministic
linear relationship and ρ = 0 when X and Y are independent. However, for nonlinear
relationships, dependent X and Y can also have zero linear correlation coefficient ρ.
Several classes of association measures have the desirable property that D(X ; Y ) = 0
if and only if X and Y are statistically independent. They use different but equivalent
mathematical characterizations of the statistical independence between X and Y of a similar
form:
fX,Y (x, y) = fX(x)fY (y) for all x, y. (1)
Here the fX,Y can be either joint cumulative distribution function (CDF) FX,Y (x, y) =
Pr(X ≤ x, Y ≤ y), or joint characteristic function φX,Y (s, t) = E[ei(Xs+Y t)] with E[·] de-noting the expectation, or joint probability density function pX,Y . Then fX and fY are the
3
corresponding marginal functions: CDFs FX(x) = Pr(X ≤ x) and FY , or characteristic
functions φX(s) = E[eiXs] and φY , or probability density functions pX and pY .
Due to the characterization (1), it is natural to use a functional distance between the
joint function fX,Y and the product of marginal functions fXfY as an association measure
D(X ; Y ). Such types of D(X ; Y ) would equal to zero if and only if fX,Y = fXfY always,
i.e., X and Y are independent. The first class of association measures use CDFs in the
characterization (1). In that case, using L∞ and L2 norms for functional distances lead to,
respectively, Kolmogorov-Smirnov criterion
KS(X ; Y ) = maxx,y
|FX,Y (x, y)− FX(x)FY (y)|, (2)
and Cramer-von Mises criterion
CVM(X ; Y ) =∫∫x,y
|FX,Y (x, y)− FX(x)FY (y)|2FX(dx)FY (dy)
=∫∫x,y
|FX,Y (x, y)− FX(x)FY (y)|2pX(x)pY (y)dxdy. (3)
For the second class of association measures, using the characteristic functions in the char-
acterization (1) can lead to the distance covariance of Szekely and Rizzo (2009).
dCov2(X ; Y ) =
∫∫
s,t
|φX,Y (s, t)− φX(s)φY (t)|2|s|2|t|2 dtds. (4)
The third class of association measures uses the probability density functions in the charac-
terization (1). This class includes the popular mutual information (MI) criterion
MI(X ; Y ) = EX,Y [log pX,Y (x, y)− log(pX(x)pY (y))]=
∫∫x,y
[log pX,Y (x, y)− log(pX(x)pY (y))]pX,Y (x, y)dxdy. (5)
In the later subsection 2.4, we will see that these association measures from the first two
classes are not equitable. Hence we propose to consider more association measures in the
third class. Specifically, we consider the class of Copula-Distance
CDα =
∫∫
x,y
|pX,Y (x, y)− pX(x)pY (y)|α[pX(x)pY (y)]1−αdxdy, (6)
for α > 0. The name comes from the fact (shown in subsection 2.3) that CDα is the Lα
distance of the copula density from the uniform density (i.e., independence) on the unit
square.
4
The association measures above have different ranges, making comparison among them
difficult. For example, the KS criterion’s value falls between 0 and 1, while the MI takes value
between 0 and ∞. For easy interpretation, it is customary to transform these measures into
correlation measures with values between 0 and 1. Szekely et al. (2007) defined the distance
correlation from the distance covariance in (4) as
dcor(X ; Y ) =dCov(X ; Y )√
dCov(X ;X)dCov(Y ; Y ). (7)
Similarly, we can define mutual information correlation from MI in (5) as (Joe, 1989)
MIcor =√
1− e−2MI . (8)
For the Copula-Distance, we notice that 0 ≤ CD1 ≤ 2. Hence we propose a new corre-
lation measure, copula correlation (Ccor), as
Ccor =1
2CD1 =
1
2
∫∫
x,y
|pX,Y (x, y)− pX(x)pY (y)|dxdy. (9)
2.2 Parameters, Estimators and MIC
The association measures in Section 2.1 are all parameters. Sometimes the same names also
refer to the corresponding sample statistics. Let (X1, Y1), ..., (Xn, Yn) be a random sample
of size n from the joint distribution of (X, Y ). Then the sample statistic ρn =∑n
i=1(Xi −X)(Yi − Y )/
√∑ni=1(Xi − X)2
∑ni=1(Yi − Y )2 is also called Pearson’s correlation coefficient.
In fact, ρn is an estimator for ρ, and converges at the parametric rate of n−1/2 assuming
the random variables have finite first two moments. The first two class of measures have
natural empirical estimators, replacing CDFs and characteristic functions by their empirical
version. Particularly, Szekely et al. (2007) showed that the resulting dcorn statistic is the
sample correlation of centered distances between pairs of (Xi, Yi) and (Xj, Yj). The last
class of association measures use the probability density functions instead, and are harder
to estimate. For continuous X and Y , simply plugging in empirical density functions do not
result in good estimators for the association measures. However, we will see in section 2.4
that the first two class of measures do not have the equitability property. Hence we will need
to study the harder-to-estimate measures MIcor and Ccor.
5
The MIC introduced in Reshef et al. (2011) is in fact a definition of a sample statistic,
not a parameter. On the data set (X1, Y1), ..., (Xn, Yn), they first consider putting these n
data points into a grid G of bX × bY bins. Then the mutual information MIG for the grid
is computed from the empirical frequencies of the data on the grid. The MIC statistic is
defined as the maximum value of MIG/ log[min(bX , bY )] over all possible grids G with the
the total number of bins bXbY bounded by B = n0.6. That is,
MICn = maxbXbY <B
MIGlog[min(bX , bY )]
(10)
The MICn is always bounded between 0 and 1 since 0 ≤ MIG ≤ log[min(bX , bY )].
The corresponding parameter MIC for the joint distribution of X and Y can be defined
as the limit of the sample statistic for big sample size MIC = limn→∞MICn. We notice
that this definition depends on the tuning parameter B and the implicit assumption that
the limit exists. Hence the MIC parameter may changes with different selection of B(n).
This is in contrast to the usual statistical literature where the parameter definition is fixed
but its estimator may contain some tuning parameter B(n). Because the MIC parameter is
only defined as a limit, the theoretical study on its mathematical properties is very hard.
As we introduce the strict mathematical definition for the equitability in next two sub-
sections 2.3 and 2.4, we can see that equitability should be a property for the parameter
not for the statistic. For any data set, there are always statistical errors between the sample
statistic and the true parameter. A good association measure should be equitable and can
also be well estimated (i.e., the statistical errors can be controlled).
2.3 Copula and Weak-equitability
The equitability refers to the concept that an association measure D[X ; Y ] should remain
invariant under certain transformations of the random variables X and Y . For example, if
we change the unit of X (or Y ), the values of X (or Y ) changes by a constant multiple, but
should not affect the association measure D[X ; Y ] at all. Similarly, if we apply a monotone
transformation on X (e.g. the commonly used logarithmic or exponential transformation),
then the association with Y should not be affected and the measure D[X ; Y ] should remain
the same. This leads to the following definition of weak-equitability.
6
Definition 1 A dependence measure D[X ; Y ] is weakly-equitable if and only if D[X ; Y ] =
D[f(X); Y ] whenever f is a strictly monotone continuous deterministic function.
The weak-equitability property is related to the popular copula concept. The Sklar’s
theorem ensures that for any joint distribution FX,Y , there exists a copula C – a probability
distribution on the unit square – such that
FX,Y (x, y) = C[FX(x), FY (y)] for all x, y. (11)
The copula captures all the association between X and Y . For continuous joint distribution
FX,Y , the copula C is also continuous with copula density function c(u, v) = ∂2
∂u∂vC(u, v) for
(u, v) ∈ [0, 1]× [0, 1].
We call a dependence measure D[X ; Y ] symmetric if D[X ; Y ] = D[Y ;X ] for all random
variables X and Y . Then a symmetric dependence measure D[X ; Y ] is weakly-equitable
if and only if D[X ; Y ] depends on the copula C(u, v) only and is not affected by the
marginals FX(x) and FY (y). Clearly the corresponding sample statistic for a symmetric
weakly-equitable D[X ; Y ] should be a rank-based statistic.
For the association measures mentioned above, Kolmogorov-Smirnov criterionKS, Cramer-
von Mises criterion CVM , mutual information MI, Copula-Distance CDα and maximal
information coefficient MIC are all weakly-equitable. The linear coefficient ρ and distance
covariance dCov are not weakly-equitable. However, it is easy to get a weakly-equitable
version of these statistics by calculating them on the ranks of X and Y . For example,
the weakly-equitable version of the linear coefficient ρn is the Spearman’s rank correlation
coefficient
ρs,n =
∑ni=1(RX,i − RX)(RY,i − RY )√∑n
i=1(RX,i − RX)2∑n
i=1(RY,i − RY )2(12)
where RX,i and RY,i denote the ranks for Xi and Yi. The corresponding parameter is
ρs = limn→∞
ρs,n =Cov(U, V )√
V ar(U)V ar(V )(13)
with U = F−1X (X) and V = F−1
Y (Y ) each following univariate uniform distribution, and the
copula C(u, v) is the joint distribution of U and V .
7
We can rewrite the Copula-Distance of (6) in terms of copula density function as
CDα =
∫∫
x,y
|pX,Y (x, y)−pX(x)pY (y)|α[pX(x)pY (y)]1−αdxdy =
1∫
0
1∫
0
|c(u, v)−1|αdudv. (14)
Hence the copula correlation (9) is in fact half the L1 distance of the copula density function
from independence
Ccor =1
2CD1 =
1
2
1∫
0
1∫
0
|c(u, v)− 1|dudv. (15)
The weakly-equitable statistics KS, CVM and MI in equations (2), (3) and (5) similarly
have copula-based equivalent definitions as following,
KS = max0≤u≤1,0≤v≤1
|C(u, v)− uv|,
CV M =1∫0
1∫0
|C(u, v)− uv|2dudv,
MI =1∫0
1∫0
log[c(u, v)]c(u, v)dudv.
(16)
We will focus on these copula-based equivalent versions in the rest of paper, as they allow
better understanding of the issues at hand.
2.4 Self-equitable Measures
The weakly-equitable measure D(X ; Y ) remains invariant under strictly monotone continu-
ous transformation f(·): D[X ; Y ] = D[f(X); Y ]. This equitability concept can be extended
to the invariance under all transformations in the regression setting Y = f(X) + ε where ε
denotes the random noise that is independent of X conditional on f(X). That is, D(X ; Y )
remains invariant when f(X) contains all information of X about Y . Mathematically, this
leads to the following self-equitability definition.
Definition 2 (Kinney and Atwal, 2013) A dependence measure D[X ; Y ] is self-equitable if
and only if D[X ; Y ] = D[f(X); Y ] whenever f is a deterministic function and X ↔ f(X) ↔Y forms a Markov chain.
Kinney and Atwal (2013) showed that the self-equitability is characterized by a com-
monly used inequality in information theory.
8
Definition 3 A dependence measure D[X ; Y ] satisfies the Data Processing Inequality
(DPI) if and only if D[X ; Y ] ≥ D[X ;Z] whenever the random variables X, Y, Z form a
Markov chain X ↔ Y ↔ Z.
All dependence measures satisfying DPI are self-equitable. Since mutual information
MI satisfies DPI, it is self-equitable. Similarly we derive the self-equitability of the Copula-
Distance CDα.
Theorem 1 The Copula-Distance CDα is self-equitable when α ≥ 1.
The proof of Theorem 1 is provided in Appendix A1.
As a direct result of Theorem 1, the copula correlation Ccor = 12CD1 is also self-equitable.
Examples MIcor Ccor MIC dcor KS CVM
A 0.94 0.625 1 0.56 0.19 0.0035
B 0.94 0.625 0.95 0.82 0.19 0.0074
C 0.94 0.625 1 0.87 0.25 0.0084
D 1 1 1 1 0.25 0.0111
E 0.97 0.750 1 0.94 0.25 0.0098
F 0.87 0.500 1 0.79 0.25 0.0069
Table 1: The values of several association measures on some examples. For each exampledistribution, the graph shows its probability density function: the white regions have zerodensity, the shaded regions have constant densities. The dark regions have densities twiceas big as the densities on the light grey regions.
We illustrate the equitability/non-equitability of all association measures above on some
examples of simple probability distributions on the unit square. These examples are modified
from those in Kinney and Atwal (2013). When (X, Y ) follows the distribution in Example A,
(f(X), Y ) follows the distribution in Example B with the invertible piece-wise transformation
f(x) = x, 0 ≤ x ≤ 14; x + 1
2, 1
4< x ≤ 1
2; x − 1
4, 1
2< x ≤ 1. And (X, g(Y )) follows the
9
distribution in Example C with the invertible piece-wise transformation g(x) = x, 0 ≤ x ≤14; x + 1
4, 1
4< x ≤ 3
4; x − 1
2, 3
4< x ≤ 1. It is easy to check that X ↔ f(X) ↔ Y
and X ↔ g(Y ) ↔ Y both are Markov chains. Hence self-equitable measures MIcor and
Ccor remain constants for the first three examples A, B and C. The other measures MIC,
dcor, KS and CVM do not have same value across the first three examples, thus are all not
self-equitable.
The next three examples D-F show increasing noise levels. Correspondingly, a good
correlation measure should show decreasing values in the order of D, E and F. Notice that
MIC and KS remain constants across Examples D, E and F. Hence they fails to correctly
reflect the noise levels as good correlation measures should.
Another desirable property for a correlation measure is that the deterministic relationship
should result in the maximal correlation value of 1. We can see that KS is far less than one
for the deterministic relationship Y = X in Example D. From these examples, we can see
that MIcor and Ccor are the only two self-equitable correlation measures with desirable
properties.
We note again that the mathematically strict definitions of equitablity above are defined
for the parameter D(X ; Y ), not on its estimator Dn. The equitablity should be a property
about the parameter. Since the estimator Dn varies from data set to data set, we should
not define on it the property involving invariance under different models. The paradoxical
observation that MIC is equitable but MI is not in Reshef et al. (2011) comes from their
attempt to decide equitability on the estimator Dn instead. To see how that approach can
be misleading, we plot the the correlation measures Ccor and MIC for nine different functions
under different noise levels in Figure 1. Similar plots were used by Reshef et al. (2011) and
Reshef et al. (2013) to study the equitability of association measures.
The first two columns of graphs in Figure 1 plots the estimated Ccor and MIC values from
simulation data set of sizes n = 100 and n = 10000 respectively. We estimate Ccor using the
estimator in equations (20) and (A.9). The estimator is described in detail in Appendix A4.
The data is generated from the regression model Y = f(X)+ε with X uniformly distributed
10
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
n=100
Noise Level
Cco
r
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
n=10000
Noise Level
Cco
r
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
true
Noise LevelC
cor
ABCDEFGHI
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
n=100
Noise Level
MIC
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
n=10000
Noise Level
MIC
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
true
Noise Level
MIC ?
Figure 1: The correlation measures Ccor and MIC across nine different functions at variousnoise levels at different sample sizes.
11
on [0, 1] and the noise ε uniformly distributed on [−0.5l, 0.5l]. The nine functions are all
chosen to have maximum value of 1 and minimum value of 0 over the domain of X , and are
listed in Table 2. We chose the uniformly distributed noise so that the true Ccor values can
be easily calculated at noise levels l = 0, 0.1, ..., 2. These true values are plotted in the last
graph in the first row of Figure 1. To assess the variation of the association measures across
different functions under same noise level as in Reshef et al. (2011), we need to look at the
last graph of the true values. The first two graphs of estimated values contain estimation
errors and the patterns change with the sample size. The increasing sample size reduces
the estimation errors and makes the plot of estimated values increasingly similar to the plot
of true values. However, we do not know before hand what is the sufficient sample size to
make the patterns for estimated values similar enough to the patterns for true values. The
true parameter MIC value can not be calculated directly for different noise levels as it is
only defined as a limit of statistics. The patterns observed using simulated data sets of size
n = 10, 000 is still likely to be misleading. At least we know the last two functions (circle
and cross) should have true MIC = 1 in the noiseless case (l = 0). But the estimated MIC
values for those two functions are still less than 0.8 for n = 10, 000.
A Linear y = x
B Quadratic y = x2
C Square Root y =√x
D Cubic y = x3
E Centered Cubic y = 4(x− 12)
3
F Centered Quadratic y = 4x(1− x)G Cosine (Period 1) y = 1
2 [cos(2πx) + 1]H Circle (x− 1
2 )2 + y2 = 1
4I Cross (y − 1
2)2 = (x− 1
2)2
Table 2: The functional relationships studied in Figure 1.
On a side note, MIC calculation is also very computationally intensive for large data set.
We only plotted sample size up to n = 10, 000 in in Figure 1 since it is computationally
inhibitive to generate a similar plot for MIC at n = 100, 000. In contrast, we were able to
generate the plot for Ccor at n = 100, 000 in half an hour.
12
Reshef et al. (2011) observed large variation of MI among different functions at the same
noise level and concluded that mutual information is not an equitable measure. However,
these variations are in part due to the estimation errors. For the noiseless situation, MI = ∞(correspondingly, MIcor = 1) for the functions, and the large variations are purely due to
the statistical error MI−MI. In the next section, we show that MI can not be consistently
estimated over a class of continuous copulas. This explains the contradictory claims about
the equitability of mutual information MI in Reshef et al. (2011) and Kinney and Atwal
(2013). We also show that the proposed copula correlation Ccor is consistently estimable
and we consider it the candidate of choice for an equitable correlation measure.
3 STATISTICAL ERROR IN THE ASSOCIATION
MEASURE ESTIMATION
We now turn our attention to the statistical errors in estimating the association measures.
Particularly we focus on the two self-equitable measures MI and Ccor.
From equation (16), the KS and CVM are defined through the underlying copula function
C(u, v). We use the notations KS(C) and CVM(C) to emphasize their dependence on the
copula C. Then we can estimate them by plug-in estimators KS = KS(Cn) and CV M =
CVM(Cn), where Cn(u, v) denotes the empirical estimator for the copula function C(u, v).
Since Cn(u, v) converges to C(u, v) at the parametric rate of n−1/2 (Omelka et al., 2009;
Segers, 2012), KS and CVM can also be estimated at the parametric rate of n−1/2.
In contrast, the measures MI and Ccor are defined through copula density function c(u, v)
in equations (15) and (16). For discrete distributions, the empirical density function cn(u, v)
also converges to true density c(u, v) at the rate of n−1/2. Hence the plug-in estimator MI =
MI(cn) converges to MI at the parametric rate of n−1/2 (Joe, 1989) for discrete distribu-
tions. For continuous distributions, the kernel density estimator cn(u, v) (Moon et al., 1995;
Khan et al., 2007; Reshef et al., 2011) has been used for plug-in estimator MI = MI(cn).
However, such estimator MI does not converge to MI at the parametric rate of n−1/2 for
continuous distributions.
The convergence of density estimator has been studied well in the literatures. With
13
bounded m-th derivatives, the two-dimensional kernel density estimator converges to the
true density at the rate of n−(m+1)/(2m+4) (Silverman, 1986; Scott, 1992). However, the cop-
ula density function is harder to estimate because it is often unbounded, hence its derivatives
are also unbounded. For example, the commonly used bivariate Gaussian density is bounded.
However, after transforming to the unit square through copula decomposition (11), the cor-
responding Gaussian copula density is unbounded except in the independence case. The
copula density can not be accurately estimated where its value c(u, v) is big. Because the
mutual information MI overweighs the region where c(u, v) is big, it can not be consistently
estimated. On the other hand, our proposed Ccor does not need accurate density estima-
tion in the region where c(u, v) is big. We show that MI is not consistently estimable in
subsection 3.1, and provide a consistent estimator Ccor in subsection 3.2.
3.1 The Mutual Information Is Not Consistently Estimable
We study the minimax rate of convergence for mutual information for continuous copula
distributions. Starting from Farrell (1972), the minimax rate of convergence for density
estimation has been studied for the class of functions whose derivatives satisfy the Holder
condition. Since the minimax rate agrees with the achievable convergence rate by the kernel
estimator, it is also the optimal convergence rate of density estimation under those Holder
classes.
For simplicity, let us consider imposing the Holder condition on the copula density itself
(i.e., on the 0-th derivative). That is, the density function satisfies
|c(u1, v1)− c(u2, v2)| ≤ M1‖(u1 − u2, v1 − v2)‖ (17)
for a constant M1 and all u1, v1, u2, v2 values between 0 and 1. Here and in the following
‖ · ‖ refers to the Euclidean norm. However, most commonly used continuous copula density
functions are unbounded (Omelka et al., 2009; Segers, 2012) and hence do not satisfy the
Holder condition (17). Therefore, we need to investigate the problem with some nonstandard
conditions.
Since the Holder condition can not hold for the region where c(u, v) is big, we impose
it only on the region where the copula density is small. Specifically, we assume that the
14
Holder condition (17) holds only on the region AM = {(u, v) : c(u, v) < M} for a constant
M > 1. That is, |c(u1, v1)− c(u2, v2)| ≤ M1‖(u1 − u2, v1 − v2)‖ whenever (u1, v1) ∈ AM and
(u2, v2) ∈ AM . Then this condition is satisfied by all common continuous copulas listed in
the book by Nelsen (2006). For example, all Gaussian copulas satisfy the Holder condition
(17) on AM for some constants M > 1 and M1 > 0. If (17) holds on AM for any particular
M and M1 values, then (17) holds on AM also for all smaller M values and for all bigger M1
values. Without loss of generality, we assume that M is close to 1 and M1 is a big constant.
Let C denotes the class of continuous copulas whose density satisfies the Holder condition
(17) on AM . We can then study the minimax risk for estimating MI(C) for C ∈ C. Without
loss of generality, we consider the data set {(U1, V1), ..., (Un, Vn)} consisting of independent
observations from a copula distribution C ∈ C.
Theorem 2 Let MIn be any estimator of the mutual information MI in equation (16) based
on the observations (U1, V1), ..., (Un, Vn) from a copula distribution C ∈ C. And let MIcorn
be any estimator of the MIcor in equation (8). Then
supC∈C
E[|MIn(C)−MI(C)|] = ∞, and
supC∈C
E[|MIcorn(C)−MIcor(C)|] ≥ a2 > 0,(18)
for a positive constant a2.
The proof of Theorem 2 uses a method of Le Cam (Le Cam, 1973, 1986) by finding a
pair of hardest to estimate copulas. That is, we can find a pair of copulas C1 and C2 in
the class C such that C1 and C2 are arbitrarily close in Hellinger distance but their mutual
information are very different. Then no estimator can estimate MI well at both copulas
C1 and C2, leading to a lower bound for the minimax risk. Detailed proof is provided in
Appendix A2.
Theorem 2 implies that the MI and MIcor can not be estimated consistently over the
class C since the minimax risk is bounded away from zero. The difficulty in estimating MI is
due to the fact that it overweighs the region with large density c(u, v) values. From equation
(16), we see that MI is the expectation of log[c(u, v)] under the true copula distribution
c(u, v). In contrast, the Ccor in (15) takes the expectation at the independence case, similar
15
to CVM . This allows Ccor to be consistently estimable over the class C, as shown in the
next subsection 3.2.
3.2 The Consistent Estimation Of Copula Correlation
The proposed copula correlation measure Ccor can be consistently estimated since the region
of large copula density values has little effect on it. To see this, we derive an alternative
expression of Ccor from (15). Let x+ = max(x, 0) denote the non-negative part of x. Then
1∫
0
1∫
0
[c(u, v)− 1]+dudv −1∫
0
1∫
0
[1− c(u, v)]+dudv =
1∫
0
1∫
0
[c(u, v)− 1]dudv = 1− 1 = 0.
Hence1∫0
1∫0
[c(u, v)− 1]+dudv =1∫0
1∫0
[1− c(u, v)]+dudv. Therefore,
1∫0
1∫0
|c(u, v)− 1|dudv =1∫0
1∫0
[c(u, v)− 1]+dudv +1∫0
1∫0
[1− c(u, v)]+dudv
= 21∫0
1∫0
[1− c(u, v)]+dudv
Then we get the alternative expression of Ccor from (15),
Ccor =1
2
1∫
0
1∫
0
|c(u, v)− 1|dudv =
1∫
0
1∫
0
[1− c(u, v)]+dudv. (19)
In the new expression (19), Ccor only depends on [1 − c(u, v)]+ which is nonzero only
when c(u, v) < 1. To estimate Ccor well, we only need the density estimator cn(u, v) to be
good for points (u, v) with low copula density. Specifically, we consider the plug-in estimator
Ccor = Ccor(cn) =
1∫
0
1∫
0
[1− cn(u, v)]+dudv, (20)
where cn(u, v) =1
nh2
n∑i=1
K(u−Ui
h)K(v−Vi
h) is a kernel density estimator with kernel K(·) and
bandwidth h.
To analyze the statistical error of Ccor, we can look at the error in the low copula
density region separately from the error in the high copula density region. Specifically, let
M2 be a constant between 1 and M , say, M2 = (M + 1)/2. Then we can separate the
16
unit square into the low copula density region AM2 = {(u, v) : c(u, v) ≤ M2} and the high
copula density region AcM2
= {(u, v) : c(u, v) > M2}. We now have Ccor = T1(c) + T2(c)
where T1(c) =∫∫
AM2
[1 − c(u, v)]+dudv and T2(c) =∫∫
Ac
M2
[1 − c(u, v)]+dudv. Since the Holder
condition (17) holds on AM , the classical error rate O(h+ (nh2)−1/2) for the kernel density
estimator holds for |cn(u, v) − c(u, v)| on the low copula density region AM2. Hence the
error |T1(cn) − T1(c)| is also bounded by O(h + (nh2)−1/2). While the density estimation
error |cn(u, v) − c(u, v)| can be unbounded on the high copula density region AcM2
, it only
propagates into error for Ccor when cn(u, v) < 1. We can show that the overall propagated
error |T2(cn)− T2(c)| is controlled at a higher order O((nh2)−1). Therefore, the error rate of
Ccor can be controlled by the classical kernel density estimation error rate as summarized
in the following Theorem 3.
Theorem 3 Let cn(u, v) = 1nh2
n∑i=1
K(u−Ui
h)K(v−Vi
h) be a kernel estimation of the copula
density based on observations (U1, V1), ..., (Un, Vn). We assume the following conditions
1. The bandwidth h → 0 and nh2 → ∞.
2. The kernel K has compact support [−1, 1].
3.∫∞−∞K(x)dx = 1,
∫∞−∞ xK(x)dx = 0 and µ2 =
∫∞−∞ x2K(x)dx > 0.
Then the plug-in estimator Ccor = Ccor(cn) in (20) has a risk bound
supC∈C
E[|Ccor − Ccor|] ≤ 2√M1h +
2µ2√nh2
+M5
nh2(21)
for some finite constant M5 > 0.
The detailed proofs for Theorem 3 are provided in Appendix A3. From (21), if we choose
the bandwidth h = n−1/4, then Ccor converges to the true value Ccor at the rate of O(n−1/4).
Thus Ccor can be consistently estimated, in contrast to the results on MI and MIcor in
subsection 3.1.
The Theorem 3 provides only an upper bound for the statistical error of the plug-in
estimator Ccor. The actual error may be lower. In fact, the error |T1(c) − T1(cn)| can
17
be controlled at O(n−1/2) using kernel density estimator cn (Bickel and Ritov, 2003). Here
we did not find the optimal rate of convergence. But the upper bound already shows that
Ccor is much easier to estimate than MI and MIcor. Similar to classical kernel density
estimation theory, assuming that the Holder condition holds on AM for the m-th derivatives
of the copula density, the upper bound on the convergence rate can be further improved to
O(n−(m+1)/(2m+4)).
The technical conditions 1 − 3 in Theorem 3 are classical conditions on the bandwidth
and the kernel. We have used the bivariate product kernel for technical simplicity. Other
variations of the conditions in the literature may be used. For example, it is possible to relax
the compact support condition 2 to allow using the Gaussian kernel.
In practice, the (Ui, Vi)’s are not observed. The estimator conventionally is calculated on
(Ui = RX,i/(n+ 1), Vi = RY,i/(n+ 1))’s. The risk bound still holds using (Ui, Vi)’s.
4 NUMERICAL STUDIES
In this section, we conduct two numerical studies on the finite sample properties of the
proposed Ccor and compare it with several other measures. We first conduct a simulation
study on the power using Ccor and other measures for detecting association. We then
apply Ccor to a data set of social, economic, health, and political indicators from the World
Health Organization (WHO). This WHO data set is analyzed by Reshef et al. (2011), and is
available from their website http://www.exploredata.net. Their maximal information-based
nonparametric exploration (MINE) package is available on the same website. We used their
MINE package to calculate MIC.
4.1 Power Comparison
We analyze the power to detect association by MIC, linear correlation ρ, distance correlation
dcor and our copula correlation Ccor on simulated data sets. We choose the same simulation
setting as in the study by Simon and Tibshirani (2011). Data sets of sample size n = 320
were generated from the regression model Y = f(X) + ε with Gaussian error ε ∼ N(0, σ2),
and the four measures were applied to test the null hypothesis thatX and Y are independent.
18
The sample size n = 320 is chosen as same as that in the simulations by Reshef et al. (2011).
The rejection rule at α = 0.05 level is decided from 500 data sets generated from the null
hypothesis. Then the rejection results of tests on 500 simulated data sets were used to
calculate their power.
We simulated data from the eight functions f(·) in Simon and Tibshirani (2011) plus a
new functional relationship: the Cross pattern. These nine functions are listed in Table 3.
The powers are calculated at 31 different noise levels: σ = 0, 0.1σ0,..., 3σ0. We note that
the σ0 value was chosen differently for each functional relationship so that the power curves
clearly show the pattern of decreasing within the simulated noise range. One simulated
data set for each combination of the noise level and the functional relationship is plotted in
Table 3.
noise levelsType formula 0 0.5 1 1.5 2 2.5 3
Linear y = x
Quadratic y = 4(x− 12 )
2
Cubic y = 80(x − 13)
3 − 12(x− 13)
Power x1/4 y = x1/4
Step function y = 1{x > 12}
Sine: period 1/2 y = sin(4πx)
Circle (x− 12)
2 + y2 = 14
Cross (y − 12)
2 = (x− 12)
2
Sine: period 1/8 y = sin(8πx)
Table 3: The simulated data in the power study.
The power curves for testing independent null hypothesis by MIC, linear correlation ρ,
distance correlation dcor and copula correlation Ccor are plotted in the Figure 2. Without
Ccor, the other three measures follow the conclusion from Simon and Tibshirani (2011): dcor
has the best power in all but one case. MIC only beats dcor in the case of high frequency
periodic function. This seems to suggest that dcor is a better association measure to detect
nonlinear relationship.
19
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
Linear
Noise Level
Pow
er
cordcorMICCcor
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
Quadratic
Noise Level
Pow
er
cordcorMICCcor
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
Cubic
Noise Level
Pow
er
cordcorMICCcor
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
x1 4
Noise Level
Pow
er
cordcorMICCcor
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
Step function
Noise Level
Pow
er
cordcorMICCcor
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
Sine: period 1/2
Noise Level
Pow
er
cordcorMICCcor
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
Circle
Noise Level
Pow
er
cordcorMICCcor
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
Cross
Noise Level
Pow
er
cordcorMICCcor
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
Sine: period 1/8
Noise Level
Pow
er
cordcorMICCcor
Figure 2: This plots the power curves for testing independence using the four correlationmeasures.
20
However, that conclusion no longer holds when we also consider Ccor in the comparison.
The dcor has clear power advantage over MIC and Ccor only in four out of nine cases. The
magnitude of its advantage in those four cases pales in comparison to the magnitude of
its disadvantage in the last three cases. A more careful examination also reveal that those
four cases are where the Pearson’s linear correlation ρ also performs well. Hence Ccor and
MIC are more useful than dcor to find novel functional relationships which are not detected
by Pearson’s linear correlation. Ccor and MIC are adept at detecting different types of
functional relationships. MIC is able to detect the high frequency periodic signal, while
Ccor works well for the type of many-to-many relationships where the other measures fail.
We can better understand the performance of these measures by combining the power
curves shown in Figure 2 with the corresponding data patterns plotted in Table 3. The four
cases where dcor has biggest power advantage over Ccor and MIC are: (1) Linear function
with noise level 1.5σ0; (2) Cubic function with noise level 1.5σ0; (3) Power function x1/4
with noise level 1.0σ0; and (4) Step function with noise level 0.5σ0. The corresponding data
plots for those four cases in Table 3 are very random to the naked eyes. The detection
of the association in those cases comes from the underlying monotone increasing trends.
The Pearson’s linear correlation is also very good at detecting such monotone trends. On
the other hand, Ccor has biggest power advantage over the other measures for two cases:
Circle with noise level 2.0σ0 and Cross with noise level 1.0σ0. While the noise obscured
the functional relationships somewhat in these two cases, the non-random patterns are still
clearly visible to the naked eyes. It is quite reasonable to expect the detection power in these
cases to be very close to 100%. However, only Ccor achieves this. The other association
measures all emphasize some particular features of dependence and were not able to pick
up these obvious association. In contrast, MIC has its biggest power advantage over other
measures for the high frequency periodic function with noise level 1.0σ0. The corresponding
plot in Table 3 does not show visible non-random patterns. It is rather amazing that MIC
is able to detect this type of signal with near 100% power for this case. More mathematical
study on MIC is needed to understand why it is good at detecting high frequency periodic
signals and what other type of signals it can detect. Ccor on the other hand, seems to find
21
many relationships ignored by linear correlation, particularly those visible to human eyes.
Thus, Ccor would be a very useful tool for automated scanning for possible associations in
huge data sets.
We note that Ccor and MIC aim to be equitable, putting equal importance to the linear
function and other types of deterministic relationships. This necessarily leads to lower cor-
relation values than Pearson’s ρ in linear cases to give other types of functional relationships
a fair shake. Therefore, Ccor and MIC will lose some power to detect the functions that
Pearson’s linear correlation can detect. This is more than compensated by their ability to
detect relationships which escaped detection by Pearson’s linear correlation. While dcor
amazingly seems to find every relationship detectable to Pearson’s linear correlation, this
limits its ability to detect other types of association. In practice, Pearson’s linear correlation
would always be the first measure we check. Hence it is not a concern that a new association
measure may lose power where Pearson’s linear correlation works. Ccor and MIC are very
well suited as the second/third measures to check other types of association missed by the
first scan using linear correlation.
The above study concentrates on the power of detecting association by different measures.
It is worthy to point out that power is not the only criterion nor the most important criterion
to assess the usefulness of equitable association measures. The equitable measures are used
to rank the detected association relationships without giving preference to linear relationship.
Therefore, losing some power for detecting the linear relationship to give more consideration
to other types of association in fact is a desirable property.
4.2 Analysis Of WHO Data
We now apply the new measure Ccor to the WHO data set. We repeat the analysis
in Reshef et al. (2011) by calculating the pairwise correlations among the 357 variables in
the data set. The first variable contains the ID numbers of the countries: from 1 to 202.
These numerical values have no real intrinsic meaning. Hence the correlations between the
first variable with other variables are rather senseless. We drop the first variable and only
calculate the pairwise correlations among the rest 356 variables. There are many missing
22
data in the data sets. For some pairs of variables the available sample size is very small.
Since our estimator for Ccor uses the copula density estimation, its accuracy under a very
small sample size is suspectable. Therefore we calculate the measure Ccor only on those
pairs with at least n = 50 common observations. This results in 49286 pairwise correlations
in total.
We first look at some pairs of variables studied by Reshef et al. (2011). Figure 3 plots
the data along with linear correlation (cor), MIC and Ccor values for the examples 4C-4H
in Reshef et al. (2011).
We can see that Ccor and MIC qualitatively give the same conclusion in those examples.
They both give low correlations to the first case. They both detect some clear nonrandom
relationships with weak linear correlations. They give lower correlation values than ρ in the
two cases with high linear correlations, but big enough to detect the relationship. There
are some differences in the numerical values between Ccor and MIC. The biggest difference
occurs for the third case in the first row, with MIC = 0.72 and Ccor = 0.46.
To compare the estimates for Ccor and MIC, we plotted their values for all 49286 pairs
on the WHO data sets in Figure 4. We can see that the values fall in a band around the
diagonal. This means that Ccor and MIC generally rate the pair-wise association similarly.
To investigate the different rankings by these two measures, we investigate three pairs
of variables that have very similar values in one measure but big difference in the other
measure. These three pairs are labeled as A, B and C on the graph of Figure 4. We plot the
data for these variables in the Figure 5. Since Ccor and MIC are both rank-based, we also
plot these data in the ranks to avoid any specious pattern due to the scales on the variables.
As we can see from Figure 5, the later two cases (B and C) both seem to have strong linear
relationships with some noise. While the noise patterns are different in Figure 5B and 5C,
the average noise amount looks about the same. The first case Figure 5A clearly is much
noisier than the later two cases. This pattern is correctly reflected by Ccor which assigns
similar correlation to the latter two case while giving the first case a much lower correlation
value. However, MIC assigns about the same correlation value to the first two cases and
23
0 5 10 15
010
2030
40
cor=0.12, MIC=0.1, Ccor=0.18
Dentist Density (per 10,000)
Life
Los
t to
Inju
ries
(%yr
s)
1 2 3 4 5 6 7
3040
5060
7080
90
cor=−0.84, MIC=0.61, Ccor=0.52
Children Per Woman
Life
Exp
ecta
ncy
(Yea
rs)
0 500000 1000000 2000000
050
010
0015
00
cor=−0.12, MIC=0.72, Ccor=0.46
Number of PhysiciansD
eath
s du
e to
HIV
/AID
S
0 10000 20000 30000 40000
020
4060
cor=0.04, MIC=0.50, Ccor=0.42
Income / Person (Int$)
Adu
lt (F
emal
e) O
besi
ty (
%)
0 50 100 150 200 250 300
−10
010
2030
4050
60
cor=−0.14, MIC=0.54, Ccor=0.43
Health Exp. / Person (US$)
Mea
sles
Imm
. Dis
parit
y (%
)
0 10000 30000 50000
010
0020
0030
0040
0050
0060
00cor=0.93, MIC=0.85, Ccor=0.82
Gross Nat’l Inc / Person (Int$)
Hea
lth E
xp. /
Per
son
(Int
$)
Figure 3: The raw data and estimated correlation measures for several example casesin Reshef et al. (2011).
24
a much higher correlation value to the third case. This certainly does not agree with the
observed data patterns. Particularly, MIC assigns a correlation value of 1 to the case 5C
which is far from a noiseless deterministic relationship. From these observations, it appears
that Ccor better reflects the noise level than MIC. Thus Ccor appears to be a better equitable
correlation measure.
5 DISCUSSIONS AND CONCLUSIONS
For simplicity of presentation, we only worked on bivariate continuous distributions. The
definition for Ccor in equation (9) also works for discrete distributions, where the densities
pXY (x, y), pX(x) and pY (y) are defined in respect to the appropriate counting measures.
Similarly Ccor can be defined for mixed-type random variables using the densities under
appropriate dominating measures in equation (9). The extension to higher dimensions is
straight forward. For p-dimensional X and q-dimensional Y , the densities pXY (x, y), pX(x)
and pY (y) are functions of respectively p+q, p and q dimensions. However, high-dimensional
density estimation is much harder. The statistical properties of the Ccor in higher-dimension
is the subject of future research.
We studied the equitability of several association measures in this paper. We explained
the paradoxical equitability results on MI and MIC in literature by separating estimability
from equitability. To this end, we prove that MI is not consistently estimable over a class of
continuous copulas. This provides the first mathematical quantitative result on the difficulty
of estimating MI in continuous distributions.
We proposed a novel association measure, the copula correlation (Ccor). We proved that
theoretically Ccor is both self-equitable and consistently estimable. It is the only known
association measure that have both properties. Numerical studies show that Ccor performs
well as intended. It ranks the correlations more reasonable than MIC on the WHO data
set. It is good at finding some functional relationships missed by other correlation measures.
25
0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
Ccor
MIC
A B
C
Figure 4: The Ccor and MIC values for all pairs in the WHO data. Three cases labeled onthe graph is shown in detail in the Figure 5
26
0 10 20 30 40
4060
8010
0
A: Raw data, MIC=0.64, Ccor=0.33
Malnutrition Prevalence
Boy
s C
ompl
etin
g P
rimar
y S
chl(%
)
0 20 40 60 80
020
4060
80
A: Ranks, MIC=0.64, Ccor=0.33
Malnutrition PrevalenceB
oys
Com
plet
ing
Prim
ary
Sch
l(%)
65 70 75 80 85 90 95 100
2040
6080
100
B: Raw data, MIC=0.64, Ccor=0.71
DTP3 Imm. in 1yr olds(%)
Hib
3 Im
m. i
n 1y
r ol
ds(%
)
0 20 40 60 80 100
020
4060
8010
0
B: Ranks, MIC=0.64, Ccor=0.71
DTP3 Imm. in 1yr olds(%)
Hib
3 Im
m. i
n 1y
r ol
ds(%
)
0 10000 20000 30000 40000 50000 60000 70000
010
2030
4050
60
C: Raw data, MIC=1, Ccor=0.70
Income Per Person
Oil
Con
sum
ptio
n P
er P
erso
n
0 10 20 30 40 50 60
010
2030
4050
60
C: Ranks, MIC=1, Ccor=0.70
Income Per Person
Oil
Con
sum
ptio
n P
er P
erso
n
Figure 5: The comparison of Ccor and MIC on three example cases.
27
Computationally, Ccor is faster than MIC for large sample size by several order of magnitude.
Based on these studies, Ccor will be a very useful new tool to explore complex associations
in big data sets.
References
Bickel, P. J. and Ritov, Y. (2003). Nonparametric estimators which can be “plugged-in”.
The Annals of Statistics, 31(4):pp. 1033–1053.
Donoho, D. L. and Liu, R. C. (1991). Geometrizing rates of convergence, ii. The Annals of
Statistics, 19(2):pp. 633–667.
Farrell, R. H. (1972). On the best obtainable asymptotic rates of convergence in estimation
of a density function at a point. The Annals of Mathematical Statistics, 43(1):pp. 170–180.
Genest, C., Quessy, J.-F., and Remillard, B. (2007). Asymptotic local efficiency of cramer-
von mises tests for multivariate independence. The Annals of Statistics, 35(1):pp. 166–191.
Genest, C. and Remillard, B. (2004). Test of independence and randomness based on the
empirical copula process. Test, 13(2):335–369.
Joe, H. (1989). Relative entropy measures of multivariate dependence. Journal of the Amer-
ican Statistical Association, 84(405):157–164.
Jones, M. C., Marron, J. S., and Sheather, S. J. (1996). A brief survey of bandwidth selection
for density estimation. Journal of the American Statistical Association, 91(433):401–407.
Khan, S., Bandyopadhyay, S., Ganguly, A. R., Saigal, S., Erickson, D. J., Protopopescu,
V., and Ostrouchov, G. (2007). Relative performance of mutual information estimation
methods for quantifying the dependence among short and noisy data. Phys. Rev. E,
76:026209.
Kinney, J. B. and Atwal, G. S. (2013). Equitability, mutual information, and the maximal
information coefficient. arXiv preprint arXiv:1301.7745.
28
Kojadinovic, I. and Holmes, M. (2009). Tests of independence among continuous random
vectors based on cramer-von mises functionals of the empirical copula process. Journal of
Multivariate Analysis, 100(6):1137 – 1154.
Le Cam, L. (1973). Convergence of estimates under dimensionality restrictions. The Annals
of Statistics, pages 38–53.
Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer series in
statistics. Springer, New York, NY.
Moon, Y. I., Rajagopalan, B., and Lall, U. (1995). Estimation of mutual information using
kernel density estimators. Physical Review E, 52(3):2318–2321.
Nelsen, R. B. (2006). An Introduction to Copulas (Springer Series in Statistics). Springer-
Verlag New York, Inc., Secaucus, NJ, USA.
Omelka, M., Gijbels, I., and Veraverbeke, N. (2009). Improved kernel estimation of copulas:
weak convergence and goodness-of-fit testing. The Annals of Statistics, 37(5B):3023–3058.
Reshef, D., Reshef, Y., Mitzenmacher, M., and Sabeti, P. (2013). Equitability analysis of
the maximal information coefficient, with comparisons. arXiv preprint arXiv:1301.6314.
Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh,
P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011). Detecting novel associ-
ations in large data sets. Science, 334(6062):1518–1524.
Scott, D. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization.
Wiley Series in Probability and Statistics. Wiley.
Segers, J. (2012). Asymptotics of empirical copula processes under non-restrictive smooth-
ness assumptions. Bernoulli, 18(3):764–782.
Silverman, B. W. (1986). Density estimation for statistics and data analysis, volume 26.
CRC press.
29
Simon, N. and Tibshirani, R. (2011). comment on detecting novel associations in large data
sets by reshef et al, science dec 16, 2011. Science.
Speed, T. (2011). A correlation for the 21st century. Science, 334(6062):1502–1503.
Szekely, G. J. and Rizzo, M. L. (2009). Brownian distance covariance. The annals of applied
statistics, pages 1236–1265.
Szekely, G. J., Rizzo, M. L., and Bakirov, N. K. (2007). Measuring and testing dependence
by correlation of distances. The Annals of Statistics, 35(6):2769–2794.
Wand, M. and Jones, C. (1994). Multivariate plug-in bandwidth selection. Computational
Statistics, 9(2):97–116.
Wand, M. and Jones, M. (1993). Comparison of smoothing parameterizations in bivariate
kernel density estimation. Journal of the American Statistical Association, 88(422):520–
528.
Appendices
A1 Proof of Theorem 1.
We check the DPI for CDα on the random variables X, Y, Z forming a Markov chain
X ↔ Y ↔ Z. Consider U = F−1X (X), V = F−1
Y (Y ) and W = F−1Z (Z). Then U , V and W
each follows the uniform distribution, with their joint density function cXY Z(u, v, w) as a 3-
dimensional copula. Similarly cXY (u, v) denotes the joint density of U and V , and cY Z(v, w)
denotes the joint density of V and W . In the following, we suppress the subscript from these
copula densities c(·) when their input arguments clearly indicate which one is being used.
Now, by the copula property,1∫0
c(v, w)dw = 1 for all v ∈ [0, 1]. Hence,
CDα(X ; Y ) =
1∫
0
1∫
0
|c(u, v)− 1|αdudv =
1∫
0
1∫
0
1∫
0
|c(u, v)− 1|αc(v, w)dudvdw.
30
For α ≥ 1, |x− 1|α is convex in x. Hence applying Jensen’s inequality,
CDα(X ; Y ) =
1∫
0
1∫
0
1∫
0
|c(u, v)− 1|αc(v, w)dudvdw ≥1∫
0
1∫
0
|1∫
0
c(u, v)c(v, w)dv− 1|αdudw.
Since X ↔ Y ↔ Z, we have c(u, v, w) = c(u, v)c(v, w) for all u, v, w values. Therefore1∫0
c(u, v)c(v, w)dv =1∫0
c(u, v, w)dv = c(u, w). This leads to
CDα(X ; Y ) ≥1∫
0
1∫
0
|1∫
0
c(u, v)c(v, w)dv−1|αdudw =
1∫
0
1∫
0
|c(u, w)−1|αdudw = CDα(X,Z).
Hence DPI holds for CDα. Thus CDα is self-equitable.
A2 Proof of Theorem 2.
To prove the theorem, we use Le Cam (1973)’s method to find the lower bound on the
minimax risk of the estimating mutual information MI. To do this, we will use a more
convenient form of Le Cam’s method developed by Donoho and Liu (1991). Define the
module of continuity of a functional T over the class F with respect to Hellinger distance as
in equation (1.1) of Donoho and Liu (1991):
w(ε) = sup{|T (F1)− T (F2)| : H(F1, F2) ≤ ε, Fi ∈ F}. (A.1)
Here H(F1, F2) denotes the Hellinger distance between F1 and F2. Then the minimax rate
of convergence for estimating T (F ) over the class F is bounded below by w(n−1/2).
We now look for a pair of density functions c1(u, v) and c2(u, v) on the unit square for
distributions that are close in Hellinger distance but far away in their mutual information.
This provides a lower bound on the module of continuity for mutual information MI over
the class C, and hence leading to a lower bound on the minimax risk. We provide an outline
of the proof here.
We first divide the unit square into three disjoint regions R1, R2 and R3 with R1 ∪R2 ∪R3 = [0, 1] × [0, 1]. The first density function c1(u, v) puts probability masses δ, a and
1 − a − δ respectively on the regions R1, R2 and R3 each uniformly. The a is an arbitrary
31
small fixed value, for example, a = 0.01. For now, we take δ to be another small fixed value.
The area of the region is chosen so that c1(u, v) = M on region R2 and c1(u, v) = M∗ on
region R1 for a very big M∗. The second density function c2(u, v), compared to c1(u, v),
moves a small probability mass ε from R1 to R2. We will see that the Hellinger distance
between c1 and c2 is of the same order as ε, but the change in MI is unbounded for big M∗.
Hence module of continuity w(ε) is unbounded for mutual information MI. Hence the MI
can not be consistently estimated over the class C.
Specifically, the region R1 is chosen to be a narrow strip immediately above the diagonal,
R1 = {(u, v) : −δ1 < u − v < 0}; and R2 is chosen to be a narrow strip immediately
below the diagonal, R2 = {(u, v) : 0 ≤ u − v < δ2}. The remaining region is R3 =
[0, 1]× [0, 1] \ (R1 ∪R2). The values of δ1 and δ2 are chosen so that the areas of regions R1
and R2 are δ/M∗ and a/M respectively. Then clearly c1(u, v) = M∗ on R1; c1(u, v) = M on
R2; c1(u, v) = (1 − a − δ)/(1 − a/M − δ/M∗) on R3. And c2(u, v) = M∗ − ε(M∗/δ) on R1;
c2(u, v) = M + ε(M/a) on R2; c2(u, v) = c1(u, v) on R3. See the Figure 6.
Then we have
2H2(c1, c2) =∫∫
(√
c2(u, v)−√c1(u, v))
2dudv
= (√
M∗ − ε(M∗/δ)−√M∗)2δ/M∗ + (
√M + ε(M/a)−
√M)2a/M
= δ(√
1− ε/δ − 1)2 + a(√
1 + ε/a− 1)2
= δ(ε/2δ)2 + a(ε/2a)2 + o(ε2)= ε2( 1
4δ+ 1
4a) + o(ε2).
Hence the Hellinger distance is of the same order as ε:
H(c1, c2) = ε
√1
8δ+
1
8a+ o(ε).
On the other hand, the difference in the mutual information is
MI(c1)−MI(c2)= δ log(M∗) + a log(M)− (δ − ε) log[M∗ − ε(M∗/δ)]− (a + ε) log[M + ε(M/a)]= ε log(M∗)− ε log(M)− (δ − ε) log(1− ε/δ)− (a + ε) log(1 + ε/a).
(A.2)
Here M , δ and a are fixed constants. Hence when M∗ → ∞, this difference in MI also goes
to ∞. For example, if we let M∗ = e1/(ε)2, then the module of continuity w(ε) ≥ O(1/ε).
32
0 1
1
R1
R2
R3
R3
u
v
δ1
δ2
Figure 6: The plot shows the regions R1, R2 and R3. The other two narrow strips neighboringR1 and R2 are for the continuity correction mentioned at the end of the proof.
That means, the rate of convergence is at least O(w(n−1/2)) = O(n1/2) → ∞. In other
words, MI can not be consistently estimated.
The small difference in Hellinger distance of c1 and c2 can lead to unbounded difference in
MI(c1) and MI(c2) since MI is unbounded. After the transformation MIcor =√1− e−2MI
in (8), the mutual information correlation is bounded. The difference between MIcor(c1)
and MIcor(c2) in the above example is actually small since the MI are big for both c1 and
c2 (leading to corresponding MIcors close to zero). However, MIcor is also very hard to
estimate over the class C. To see that, we follow the same reasoning above but modify the
example of c1 and c2. First, we notice that for any pair of densities c1 and c2,
|MIcor(c1)−MIcor(c2)| = |√1− e−2MI(c1) −
√1− e−2MI(c2)|
= | [1−e−2MI(c1)]−[1−e−2MI(c2)]√1−e−2MI(c1)+
√1−e−2MI(c2)
|≥ 1
2|e−2MI(c1) − e−2MI(c2)|
= 12e−2MI(c1)|1− e−2[MI(c1)−MI(c2)]|.
For the difference MIcor(c1) −MIcor(c2) to be the same order of the difference MI(c1) −
33
MI(c2), we need to set MI(c1) at constant order when ε → 0.
Therefore, we modify the above c1 to have probability mass δ = 2ε in region R1, varying
with the ε value instead of fixed as before. And we set M∗ = e1/ε, leading to
MI(c1) = δ log(M∗) + a log(M) + (1− a− δ) log[(1− a− δ)/(1− a/M − δ/M∗)]= 2 + a log(M) + (1− a− 2ε) log[(1− a− 2ε)/(1− a/M − 2εe−1/ε)],
which converges to a fixed constant a1 = 2 + a log(M) + (1 − a) log[(1 − a)/(1 − a/M)] as
ε → 0. Using (A.2), recall that δ = 2ε and M∗ = e1/ε, we have
MI(c1)−MI(c2)= ε log(M∗)− ε log(M)− (δ − ε) log(1− ε/δ)− (a+ ε) log(1 + ε/a)= 1− ε log(M)− ε log(1/2)− (a + ε) log(1 + ε/a),
which converges to 1 as ε → 0. Hence we have
limε→0
w(ε) ≥ limε→0
1
2e−2MI(c1)|1− e−2[MI(c1)−MI(c2)]| = 1
2e−2a1(1− e−2(1)),
a positive constant a2 = e−2a1(1 − e−2)/2. Therefore, MIcor can not be estimated consis-
tently over the class C either.
The above outlines the main idea of the proof, ignoring some mathematical subtleties.
One is that the example densities c1 and c2 are only piecewise continuous on the three
regions, but not truly continuous as required for the class C. This can be easily remedied by
connecting the three pieces linearly. Specifically we set the densities ci(u, v) = M , i = 1, 2,
on the boundary between R1 and R3, {(u, v) : u− v = −δ1}, and on the boundary between
R2 and R3, {(u, v) : u− v = δ2}. Then we use two narrow strips within R3, {(u, v) : −δ3 ≤u−v ≤ −δ1} and {(u, v) : δ2 ≤ u−v ≤ δ4} to connect the constant ci(u, v) values on the rest
of region R3 with the boundary value ci(u, v) = M continuously through linear (in u − v)
ci(u, v)’s on the two strips that satisfies the Holder condition (17). By the Holder condition
(17), the connection can be made with strips of width at most (M − 1 + a + δ)/M1. This
continuity modification does not affect the calculation of the difference MI(c1) − MI(c2)
above as c1 and c2 only differ on regions R1 and R2. Within regions R1 and R2, the densities
c1 and c2 can be further similarly connected continuously linearly in u − v. As there is
no Holder condition on AcM , the connection within R1 and R2 can be as steep as we want.
Clearly the order obtained through above calculations will not change if we make these
connections very steep so that their effect is negligible.
34
Another technical subtlety is that the c1 and c2 defined above are only densities on the
unit square but not copula densities which require uniform marginal distributions. However,
it is clear that the marginal densities for cis are uniform over the interval (δ3, 1 − δ4) and
linear in the rest of interval near the two end points 0 and 1. The copulas densities c∗i ’s
corresponding to ci’s can be calculated directly through Sklar’s decomposition (11). It is
easy to see that the order for the module of continuity w(ε) remain the same for using the
corresponding copula densities c∗i ’s.
A3 Proof of Theorem 3.
Let M2 be a constant between 1 and M , say, M2 = (M + 1)/2. Denote AM2 = {(u, v) :
c(u, v) ≤ M2}. Then denote T1(c) =∫∫
AM2
[1− c(u, v)]+dudv and T2(c) =∫∫
Ac
M2
[1− c(u, v)]+dudv
so that Ccor = T1(c) + T2(c).
For a density estimator cn(u, v), we have the corresponding copula correlation estimator
by plug-in cn(u, v) into the Ccor expression. Hence Ccor = T1(cn) + T2(cn). We now bound
the errors in estimating T1 and T2 separately.
T1 involves the integral over the (u, v) points in AM2 only. Those points are contained
in the set of low density points where the Holder condition holds. Hence we can apply
the usual bounds for kernel density estimation. Particularly, let cn(u, v) = E[cn(u, v)] =∫∫
K(s)K(t)c(u+ hs, v + ht)dsdt denote the expectation of the density estimator cn. Then
the bias in density estimation is bounded by
|cn(u, v)− c(u, v)| ≤∫∫
K(s)K(t)|c(u+ hs, v + ht)− c(u, v)|dsdt.
For (u, v) ∈ AM2 , c(u + hs, v + ht) ∈ AM for h ≤ (M −M2)/(√2M1), |s| ≤ 1 and |t| ≤ 1.
Since the support of K(·) is [−1, 1], for small enough h, the bias is bounded using the Holder
condition by
1∫
−1
1∫
−1
K(s)K(t)M1h(|s|+ |t|)dsdt ≤ 2M1h
1∫
−1
1∫
−1
K(s)K(t)dsdt = 2M1h. (A.3)
35
The variance of cn is given by
V ar[cn(u, v)] =1
nV ar[
1
h2K(
u− U1
h)K(
v − V1
h)] ≤ 1
nh2
1∫
−1
1∫
−1
K2(s)K2(t)c(u+hs, v+ht)dsdt.
Hence by the same arguments above, for small enough h, the variance is bounded by
1
nh2
1∫
−1
1∫
−1
K2(s)K2(t)[c(u, v) +M1h(|s|+ |t|)]dsdt ≤ 1
nh2µ22[c(u, v) + 2M1h], (A.4)
where µ2 =∫ 1
t=−1K2(t)dt. Combining (A.3) and (A.4), we get
E{[cn(u, v)− c(u, v)]2} ≤ 4M1h2 +
1
nh2µ22[c(u, v) + 2M1h]. (A.5)
The integration of the right hand side over the region AM2 is bounded by its integration over
the whole unit square: (u, v) ∈ [0, 1]× [0, 1]. For h small enough, since 2M1h ≤ 1, we get
E{∫∫
AM2
[cn(u, v)− c(u, v)]2dudv} ≤1∫0
1∫0
{4M1h2 + 1
nh2µ22[c(u, v) + 1]}dudv
= 4M1h2 +
2µ22
nh2 .
(A.6)
Hence
{E∫∫
AM2
|cn(u, v)− c(u, v)|dudv}2 ≤ E{∫∫
AM2
[cn(u, v)− c(u, v)]2dudv}
≤ 4M1h2 +
2µ22
nh2 ≤ (2√M1h+ 2µ2√
nh)2.
That is,
|T1(cn)− T1(c)| ≤ E
∫∫
AM2
|cn(u, v)− c(u, v)|dudv ≤ 2√
M1h +2µ2√nh
. (A.7)
Now we look at the error bound on AcM2
. Since the Holder condition does not hold here,
we can not control the error in cn on AcM2
. Notice that
V ar[cn(u, v)] =1
nV ar[
1
h2K(
u− U1
h)K(
v − V1
h)]
may be unbounded since c(u, v) is unbounded on AcM2
. However,
V ar[cn(u, v)] ≤1
nh2
1∫
−1
1∫
−1
K2(s)K2(t)c(u+ hs, v + ht)dsdt ≤ 1
nh2M2
KE[cn(u, v)],
36
where MK = max0≤t≤1
K(t).
Let 1{cn(u, v) < 1} be the indicator variable for where cn < 1. Then
Pr[cn(u, v) < 1] = E[1{cn(u, v) < 1}] ≤ V ar[cn(u, v)]
[cn(u, v)− 1]2
by Chebyshev’s inequality.
Let M3 be a constant between 1 and M2, say M3 = (1 +M2)/2 > 1. Then for an point
(u, v) ∈ AcM2
, when h is small enough, the h-square centered at (u, v) are contained in AcM3
.
Hence cn(u, v) =∫∫
K(s)K(t)c(u + hs, v + ht)dsdt ≥ M3. Since the function x/(x − 1)2 is
strictly decreasing on [1,∞), let M4 = M3/(M3 − 1)2, then
E[1{cn(u, v) < 1}] ≤ V ar[cn(u, v)]
[cn(u, v)− 1]2≤ 1
nh2M2
K
cn(u, v)
[cn(u, v)− 1]2≤ 1
nh2M2
KM4.
Hence
|T2(cn)− T2(c)| = |T2(cn)| = E|∫∫
Ac
M2
[1− cn(u, v)]+dudv| ≤∫∫
Ac
M2
E[1{cn(u, v) < 1}]dudv
≤ 1nh2M
2KM4.
(A.8)
Combine (A.7) and (A.8),
|Ccor − Ccor| ≤ 2√M1h+
2µ2√nh
+1
nh2M2
KM4.
This is (21) with M5 = M2KM4.
A4 Ccor Estimation With Kernel Copula Density Es-
timator: Bandwidth Selection And Finite Sample
Correction.
We estimate Ccor using the plug-in estimator (20). For the compact support kernel K(·),we took the constant function on [−1, 1]. That is, K(x) = 1/2 for −1 ≤ x ≤ 1. Hence the
resulting bivariate kernel is simply a square (u± h, v ± h).
We first make a finite-sample correction to Ccor. For any fixed sample size n and fixed
bandwidth h, the estimator Ccor can never reach the value of 1 and 0. This problem
diminishes for large sample size as Ccor converges to the true value by Theorem 3. However,
37
this can be a serious problem for real applications where the sample size is always finite. We
make a linear correction of
Ccor = (Ccor − Cmin)/(Cmax− Cmin). (A.9)
Here Cmax and Cmin are the maximum and minimum possible values of Ccor and are
functions of n and h. Cmax is the Ccor value on perfectly matched U and V : Ui = Vi,
i = 1, ..., n. Cmin is calculated on the most evenly distributed possible case of (Ui, Vi)’s.
That is, for Ui arranged in increasing order, Vi’s are arranged in evenly distributed columns
with the neighboring Vis separated by 2h distance within each column. The reported values
in the numerical studies throughout the paper is for this finite-sample corrected estimator.
We now turn attention to the choice of bandwidth. Theorem 3 suggested the bandwidth
h = b · n−1/4 for a constant b. While asymptotically any b value works, for any finite sample
different b values make a big difference. There have been extensive literature on bandwidth
selection for density estimations. Wand and Jones (1993) and Wand and Jones (1994) pro-
vided plug-in formulas for choosing bandwidth in multivariate density estimation. However,
those formulas can not be directly used here since they are calculated under conditions in-
appropriate for copula density estimation as argued in section 3. They were calculated for
other types of kernels and a Gaussian reference distribution which is not a copula distri-
bution. Also, minimizing estimation error of Ccor is different from minimizing the error
in density function c(u, v). In any case, we first still tried to plug into Ccor the bivariate
density estimation using the function KDE2d() in R with default bandwidth. This is similar
to what is done with MI estimation by Khan et al. (2007) and Reshef et al. (2011). The
resulting estimator Ccor is ok for big sample size, but can be much improved upon for the
mediate sample sizes smaller than thousands.
Therfore, we used an empirical approach to decide on the constant b for bandwidth
selection. For the nine functions listed in Table 2, we calculated the true values of Ccor at
various noise levels. Then we estimated Ccor on generated noisy data sets using different
bandwidth values at sample sizes of n = 102, 103, 104 and 105. The averages of Ccor from
100 randomly generated noisy data sets are compared to the true Ccor values to decide
on an optimal b value. From this simulation, we decided on the bandwidth h = 0.25n−1/4.
38
Figure 7 plots the simulation results using h = 0.25n−1/4. We can see that the performance
of Ccor improves as sample size increases, and gives very accurate estimates for Ccor under
big sample sizes. For illustration, we showed the plots with bandwidth h = 0.1n−1/4 and
h = 0.5n−1/4 in Figure 8 and Figure 9 respectively. Those bandwidth choices are clearly
either too small or too big.
All the reported numerical results in this paper use the plug-in estimator Ccor in equation
(A.9) with a square kernel and bandwidth h = 0.25n−1/4. This choice works well in the
numerical studies. Further investigation of other kernel and bandwidth choices is a future
research topic. Data-based adaptive bandwidth selection (Jones et al., 1996) could also be
investigated.
Another possible future research direction is to consider the Ccor estimator over a range
of varying bandwidths. This idea is motivated by the MIC measure. Although theoretically
not equitable, Reshef et al. (2011) demonstrated some good attributes of MIC under finite
sample. More mathematical investigation of MIC is warranted to understand its behaviour.
Studies by Reshef et al. (2013) indicate that taking the maximum value of the MI statistics
over varying sizes of grids is essential to its stability across different functional relationships
in finite samples. It can be proven that taking maximum of the plug-in Ccor estimator over
a range of varying bandwidths still results in a consistent estimator. It could be interesting
to investigate if such estimators can exhibit some good attributes of MIC in finite sample.
39
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
A: Linear
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
B: Quadratic
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
C: Square Root
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
D: Cubic
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
E: Centered Cubic
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
F: Centered Quadratic
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
G: Cosine
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
H: Circle
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
I: Cross
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
Figure 7: The comparison of Ccor with its estimated values under different sample sizes.This estimator uses the square kernel density estimator with bandwidth h = 0.25n−1/4.
40
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
A: Linear
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
B: Quadratic
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
C: Square Root
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
D: Cubic
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
E: Centered Cubic
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
F: Centered Quadratic
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
G: Cosine
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
H: Circle
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
I: Cross
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
Figure 8: The comparison of Ccor with its estimated values under different sample sizes.This estimator uses the square kernel density estimator with bandwidth h = 0.1n−1/4.
41
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
A: Linear
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
B: Quadratic
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
C: Square Root
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
D: Cubic
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
E: Centered Cubic
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
F: Centered Quadratic
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
G: Cosine
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
H: Circle
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
I: Cross
Noise Level
Cco
r
truen=102
n=103
n=104
n=105
Figure 9: The comparison of Ccor with its estimated values under different sample sizes.This estimator uses the square kernel density estimator with bandwidth h = 0.5n−1/4.
42