Estimating Feature-Label Dependence Using Gini Distance...

17
IEEE Proof 1 Estimating Feature-Label Dependence Using 2 Gini Distance Statistics 3 Silu Zhang , Xin Dang , Member, IEEE, Dao Nguyen, Dawn Wilkins, and Yixin Chen, Member, IEEE 4 Abstract—Identifying statistical dependence between the features and the label is a fundamental problem in supervised learning. This 5 paper presents a framework for estimating dependence between numerical features and a categorical label using generalized Gini 6 distance, an energy distance in reproducing kernel Hilbert spaces (RKHS). Two Gini distance based dependence measures are 7 explored: Gini distance covariance and Gini distance correlation. Unlike Pearson covariance and correlation, which do not characterize 8 independence, the above Gini distance based measures define dependence as well as independence of random variables. The test 9 statistics are simple to calculate and do not require probability density estimation. Uniform convergence bounds and asymptotic bounds 10 are derived for the test statistics. Comparisons with distance covariance statistics are provided. It is shown that Gini distance statistics 11 converge faster than distance covariance statistics in the uniform convergence bounds, hence tighter upper bounds on both Type I 12 and Type II errors. Moreover, the probability of Gini distance covariance statistic under-performing the distance covariance statistic in 13 Type II error decreases to 0 exponentially with the increase of the sample size. Extensive experimental results are presented to 14 demonstrate the performance of the proposed method. 15 Index Terms—Energy distance, feature selection, Gini distance covariance, Gini distance correlation, distance covariance, reproducing 16 kernel Hilbert space, dependence test, supervised learning Ç 17 1 INTRODUCTION 18 B UILDING a prediction model from observations of features 19 and responses (or labels) is a well-studied problem in 20 machine learning and statistics. The problem becomes partic- 21 ularly challenging in a high dimensional feature space. A 22 common practice in tackling this challenge is to reduce the 23 number of features under consideration, which is in general 24 achieved via feature combination or feature selection. 25 Feature combination refers to combining high dimensional 26 inputs into a smaller set of features via a linear or nonlinear 27 transformation, e.g., principal component analysis (PCA) [35], 28 independent component analysis (ICA) [16], curvilinear com- 29 ponents analysis [21], multidimensional scaling (MDS) [81], 30 nonnegative matrix factorization (NMF) [47], Isomap [80], 31 locally linear embedding (LLE) [63], Laplacian eigenmaps [6], 32 stochastic neighbor embedding (SNE) [33], etc. Feature selec- 33 tion, also known as variable selection, aims at choosing a sub- 34 set of features that is “relevant” to the response variable [9], 35 [40], [41]. In terms of interpretability, feature selection is more 36 appealing than feature combination because it preserves the 37 physical meaning of the original features. 38 Feature selection under supervised setting can further be 39 broadly categorized into filter models, wrapper models and 40 embedded models. Filter models separates the feature selec- 41 tion task from the classification task to avoid increasing learn- 42 ing bias. A common approach is to use correlation to measure 43 feature importance. A wrapper model aims to select a feature 44 subset that achieves optimal classification performance for a 45 predetermined classifier. An embedded model is one that 46 achieves feature selection during the learning process, i.e., fea- 47 ture selection and the training of the classifier are performed 48 simultaneously. For datasets with limited sample size and 49 ultrahigh dimension, both wrapper model and embedded 50 model suffer from over-fitting, whereas filter models are 51 more applicable. In this paper, we present a filter-based fea- 52 ture selection method using new dependence measures— 53 generalized Gini distance covariance and correlation. Unlike 54 the commonly used Pearson correlation, which is only sensi- 55 tive to linear dependence and does not characterize indepen- 56 dence, our method also characterizes independence. Gini 57 distance statistics measures the dependence between a contin- 58 ues random variable/vector and a categorical response, well 59 suited for feature selection in classification tasks. They also 60 have nice interpretations: Gini distance covariance is a mea- 61 sure of between-group variation and Gini distance correlation 62 is the ratio of between group-variation and the total variation. 63 The proposed statistics are closely related to distance covari- 64 ance and correlation, which measure the dependence between 65 two continuous random variables/vectors. Theoretical results 66 show that Gini distance statistics are likely to perform better 67 in terms of Type II error. 68 Next, we review work most related to ours. For a more 69 comprehensive survey of this subject, the reader is referred 70 to [30], [31], [50], [90]. S. Zhang is with the Department of Diagnostic Imaging, St. Jude Children’s Research Hospital. 262 Danny Thomas Place, Memphis, TN 38105. E-mail: [email protected]. X. Dang and D. Nguyen are with the Department of Mathematics, University of Mississippi, University, MS 38677. E-mail: {xdang, dxnguyen}@olemiss.edu. D. Wilkins and Y. Chen are with the Department of Computer and Information Science, University of Mississippi, University, MS 38677. E-mail: {dwilkins, yixin}@olemiss.edu. Manuscript received 20 Feb. 2019; revised 14 Oct. 2019; accepted 30 Nov. 2019. Date of publication 0 . 0000; date of current version 0 . 0000. (Corresponding author: Silu Zhang.) Recommended for acceptance by P. Ravikumar. Digital Object Identifier no. 10.1109/TPAMI.2019.2960358 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. X, XXXXX 2019 1 0162-8828 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of Estimating Feature-Label Dependence Using Gini Distance...

Page 1: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

1 Estimating Feature-Label Dependence Using2 Gini Distance Statistics3 Silu Zhang , Xin Dang ,Member, IEEE, Dao Nguyen, Dawn Wilkins, and Yixin Chen,Member, IEEE

4 Abstract—Identifying statistical dependence between the features and the label is a fundamental problem in supervised learning. This

5 paper presents a framework for estimating dependence between numerical features and a categorical label using generalized Gini

6 distance, an energy distance in reproducing kernel Hilbert spaces (RKHS). Two Gini distance based dependence measures are

7 explored:Gini distance covariance andGini distance correlation. Unlike Pearson covariance and correlation, which do not characterize

8 independence, the above Gini distance based measures define dependence as well as independence of random variables. The test

9 statistics are simple to calculate and do not require probability density estimation. Uniform convergence bounds and asymptotic bounds

10 are derived for the test statistics. Comparisons with distance covariance statistics are provided. It is shown that Gini distance statistics

11 converge faster than distance covariance statistics in the uniform convergence bounds, hence tighter upper bounds on both Type I

12 and Type II errors. Moreover, the probability of Gini distance covariance statistic under-performing the distance covariance statistic in

13 Type II error decreases to 0 exponentially with the increase of the sample size. Extensive experimental results are presented to

14 demonstrate the performance of the proposed method.

15 Index Terms—Energy distance, feature selection, Gini distance covariance, Gini distance correlation, distance covariance, reproducing

16 kernel Hilbert space, dependence test, supervised learning

Ç

17 1 INTRODUCTION

18 BUILDING a prediction model from observations of features19 and responses (or labels) is a well-studied problem in20 machine learning and statistics. The problem becomes partic-21 ularly challenging in a high dimensional feature space. A22 common practice in tackling this challenge is to reduce the23 number of features under consideration, which is in general24 achieved via feature combination or feature selection.25 Feature combination refers to combining high dimensional26 inputs into a smaller set of features via a linear or nonlinear27 transformation, e.g., principal component analysis (PCA) [35],28 independent component analysis (ICA) [16], curvilinear com-29 ponents analysis [21], multidimensional scaling (MDS) [81],30 nonnegative matrix factorization (NMF) [47], Isomap [80],31 locally linear embedding (LLE) [63], Laplacian eigenmaps [6],32 stochastic neighbor embedding (SNE) [33], etc. Feature selec-33 tion, also known as variable selection, aims at choosing a sub-34 set of features that is “relevant” to the response variable [9],35 [40], [41]. In terms of interpretability, feature selection is more36 appealing than feature combination because it preserves the37 physical meaning of the original features.

38Feature selection under supervised setting can further be39broadly categorized into filter models, wrapper models and40embedded models. Filter models separates the feature selec-41tion task from the classification task to avoid increasing learn-42ing bias. A common approach is to use correlation tomeasure43feature importance. A wrapper model aims to select a feature44subset that achieves optimal classification performance for a45predetermined classifier. An embedded model is one that46achieves feature selection during the learning process, i.e., fea-47ture selection and the training of the classifier are performed48simultaneously. For datasets with limited sample size and49ultrahigh dimension, both wrapper model and embedded50model suffer from over-fitting, whereas filter models are51more applicable. In this paper, we present a filter-based fea-52ture selection method using new dependence measures—53generalized Gini distance covariance and correlation. Unlike54the commonly used Pearson correlation, which is only sensi-55tive to linear dependence and does not characterize indepen-56dence, our method also characterizes independence. Gini57distance statisticsmeasures the dependence between a contin-58ues random variable/vector and a categorical response, well59suited for feature selection in classification tasks. They also60have nice interpretations: Gini distance covariance is a mea-61sure of between-group variation andGini distance correlation62is the ratio of between group-variation and the total variation.63The proposed statistics are closely related to distance covari-64ance and correlation,whichmeasure the dependence between65two continuous random variables/vectors. Theoretical results66show that Gini distance statistics are likely to perform better67in terms of Type II error.68Next, we review work most related to ours. For a more69comprehensive survey of this subject, the reader is referred70to [30], [31], [50], [90].

� S. Zhang is with the Department of Diagnostic Imaging, St. Jude Children’sResearchHospital. 262Danny Thomas Place, Memphis, TN 38105.E-mail: [email protected].

� X. Dang and D. Nguyen are with the Department of Mathematics,University of Mississippi, University, MS 38677.E-mail: {xdang, dxnguyen}@olemiss.edu.

� D. Wilkins and Y. Chen are with the Department of Computer andInformation Science, University of Mississippi, University, MS 38677.E-mail: {dwilkins, yixin}@olemiss.edu.

Manuscript received 20 Feb. 2019; revised 14 Oct. 2019; accepted 30 Nov.2019. Date of publication 0 . 0000; date of current version 0 . 0000.(Corresponding author: Silu Zhang.)Recommended for acceptance by P. Ravikumar.Digital Object Identifier no. 10.1109/TPAMI.2019.2960358

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. X, XXXXX 2019 1

0162-8828� 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht _tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

71 1.1 Related Work

72 With a common goal of improving the generalization per-73 formance of the prediction model and providing a better74 interpretation of the underlying process, all feature selection75 methods are built around the concepts of feature relevance76 and feature redundancy.

77 1.1.1 Feature Relevance

78 The concept of relevance has been studied in many fields out-79 side machine learning and statistics [34]. In the context of fea-80 ture selection, John et al. [40] defined feature relevance via a81 probabilistic interpretation where a feature and the response82 variable are irrelevant if and only if they are conditionally83 independent given any subset of features. Following this defi-84 nition, Nilsson et al. [57] investigated distributions under85 which an optimal classifier can be trained over a minimal86 number of features. Although the above definition of rele-87 vance characterizes the statistical dependence, testing the con-88 ditional dependence is in general a challenge for continuous89 random variables.90 Significant amount of efforts have been devoted to find-91 ing a good trade-off between theoretical rigor and practical92 feasibility in defining dependence measures. A summary of93 related work on defining feature relevance is shown in94 Table 1. Pearson Correlation [60] based methods [24], [25],95 [72], [88] are among the most popular approaches. Pearson96 correlation and its variations are in general straightforward97 to implement, but is sensitive only to linear dependence98 between two variables. Specifically, Pearson correlation can99 be zero for dependent random variables. To address the

100 nonlinear dependence, many researchers tackled nonlinear101 dependence via linearization [1], [71], [73], [92].102 Other correlation measures, which treat linear and non-103 linear dependence under the same framework, have been104 developed to address the limitation of Pearson correlation105 based methods. Among these, mutual information (diver-106 gence) based approaches have been investigated extensively107 [5], [38], [39], [52], [56], [58], [67], [83], [96]. As mutual infor-108 mation is hard to evaluate, several approximations have109 been suggested [15], [22], [44], [45], [48], [61].

110Mutual information relies on the estimation of the proba-111bility density functions, which is especially challenging when112the sample size is small, e.g., in the medical domain. This113motivated the development of model-free approaches [18],114[49], [76], [77]. Cui et al. [18] defined a new index using the115mean variance (MV) of the conditional distribution function116of a feature given the class variable. It considers the ranking117of the samples in a dependence measure, hence is a robust118method for heavy-tailed datasets. The distance covariance119and correlation proposed by Sz�ekely et al. [76], [77] measures120the dependence between two numerical random variables of121arbitrary dimension. Our approach fits into the model-free122category and is closely related to distance covariance and123correlation, but aims to measure the dependence between124a numerical random variable and a categorical random125variable.

1261.1.2 Feature Redundancy

127Although one may argue that all features dependent on the128response variable are informative, redundant features129unnecessarily increase the dimensionality of the learning130problem, hence may reduce the generalization perfor-131mance [93]. Eliminating feature redundancy is, therefore, an132essential step in feature selection [61].133Several methods were proposed to reduce redundancy134explicitly via a feature dependence measure [7], [10], [55],135[82], [87], [89]. There are also many methods that formulate136feature selection as an optimization problem where redun-137dancy reduction is implicitly achieved via optimizing an138objective function, for example, [13], [17], [32], [43], [66],139[91], [95]. Particularly, class separation has been widely used140as an objective function in redundancy reduction [11], [14],141[86], [97]. Many researchers investigated optimal feature142subset selection under various optimization formulations,143such as using a special class of monotonic feature selection144criterion functions [70], or incorporating a regularization145term to control the sparsity of the solution [3], [19], [53], [62],146[84], [85], [94].

1471.2 An Overview of the Proposed Approach

148For problems of large scale (large sample size and/or high149feature dimension), feature selection is commonly per-150formed in two steps. A subset of candidate features are first151identified via a screening [24] (or a filtering [31]) process152based upon a predefined “importance” measure that can be153calculated efficiently. The final collection of features are154then chosen from the candidate set by solving an optimiza-155tion problem. Usually, this second step is computationally156more expensive than the first step. Hence for problems with157very high feature dimension, identifying a subset of “good”158candidate features, thus reducing the computational cost of159the subsequent optimization algorithm, is essential.160The work presented in this article is a model-free approach161that aims at improving the feature screening process via a162new dependence measure. Sz�ekely et al. [76], [77] introduced163distance covariance and distance correlation, which extended164the classical bivariate product-moment covariance and corre-165lation to random vectors of arbitrary dimension. Distance166covariance (and distance correlation) characterizes indepen-167dence: it is zero if and only if the two random vectors are

TABLE 1Summary of Related Work on Feature Relevance

Category Representatives

Pearson correlation(linear model) based

Stoppiglia et al. [72], Wei andBillings [88], Fan and Lv [24], Fanet al. [25]

Linearization based Song et al. [71], Sun et al. [73],Armanfard et al. [1], Yao et al. [92]

Mutual information(divergence) based

Iannarilli Jr. and Rubin [38],Novovicov�a et al. [58], Javed et al. [39],Wang et al. [83], Zhai et al. [96],Majiand Pal [52], Sindhwani et al. [67],Naghibi et al. [56]

Mutual informationapproximation based

Kwak and Choi [44], [45], Penget al. [61], Lefakis and Fleuret [48],Ding et al. [22]

Model-free Sz�ekely et al. [76], [77], Li et al. [49],Cui et al. [18]

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. X, XXXXX 2019

Page 3: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

168 independent. Moreover, the corresponding statistics are sim-169 ple to calculate and do not require estimating the distribution170 function of the random vectors. These properties make dis-171 tance covariance and distance correlation particularly appeal-172 ing to the dependence test, which is a crucial component in173 feature selection [8], [49].174 Although distance covariance and distance correlation can175 be extended to handle categorical variables using a metric176 space embedding [51], Gini distance covariance and Gini dis-177 tance correlation [20] provide a natural alternative to measur-178 ing dependence between a numerical random vector and a179 categorical random variable. In this article, we investigate180 selecting informative features for supervised learning prob-181 lems with numerical features and a categorical response vari-182 able using generalized Gini distance covariance and Gini183 distance correlation. The contributions of this paper are given184 as follows:

185 � Generalized Gini Distance Covariance and Gini Distance186 Correlation. We extend Gini distance covariance and187 Gini distance correlation to RKHS via positive definite188 kernels. The choice of kernel not only brings flexibility189 to the dependence tests, but also makes it easier to190 derive theoretical performance bounds on the tests.191 � Simple Dependence Tests.Gini distance statistics are sim-192 ple to calculate. We prove that when there is depen-193 dence between the feature vector and the response194 variable, the probability of Gini distance covariance195 statistic under-performing distance covariance statistic196 approaches 0with the growth of the sample size.197 � Uniform Convergence Bounds and Asymptotic Analysis.198 Under the bounded kernel assumption, we derive uni-199 form convergence bounds for both Type I and Type II200 errors. Compared with distance covariance and dis-201 tance correlation statistics, the bounds for Gini dis-202 tance statistics are tighter. Asymptotic analysis is also203 presented.

204 1.3 Outline of the Paper

205 The remainder of the paper is organized as follows: Section 2206 motivates Gini distance covariance and Gini distance corre-207 lation from energy distance. We then extend them to RKHS208 and present a connection between generalized Gini distance209 covariance and generalized distance covariance. Section 3210 provides estimators of Gini distance covariance and Gini211 distance correlation. Dependence tests are developed using212 these estimators. We derive uniform convergence bounds213 for both Type I and Type II errors of the dependence tests.214 In Section 3.3 we present connections with dependence tests215 using distance covariance. A connection to maximum mean216 discrepancy (MMD) [29] is shown in Section 3.4. Asymp-217 totic results are given in Section 3.5. We discuss several218 algorithmic issues in Section 4. In Section 5, we explain the219 extensive experimental studies conducted and demonstrate220 the results. We conclude and discuss the strengths and limi-221 tations of the proposed method in Section 6.

222 2 GINI DISTANCE COVARIANCE AND CORRELATION

223 In this section, we first present a brief review of the energy224 distance. As an instance of the energy distance, Gini

225distance covariance is introduced to measure dependence226between numerical and categorical random variables. Gini227distance covariance and correlation are then generalized to228reproducing kernel Hilbert spaces (RKHS) to facilitate con-229vergence analysis in Section 3. Connections with distance230covariance are also discussed.

2312.1 Energy Distance

232Energy distance was first introduced in [4], [74], [75] as a233measure of statistical distance between two probability dis-234tributions with finite first order moments. The energy dis-235tance between the q-dimensional independent random236variablesX and Y is defined as [78]

EðX;Y Þ ¼ 2EjX � Y jq � EjX �X0jq � EjY � Y 0jq; (1)238238

239where j � jq is the euclidean norm in Rq, EjXjq þ EjY jq < 1,240X0 is an iid copy ofX, and Y 0 is an iid copy of Y .241Energy distance has many interesting properties. It is242scale equivariant: for any a 2 R,

EðaX; aY Þ ¼ jajEðX;Y Þ:244244

245It is rotation invariant: for any rotation matrix R 2 Rq�q

EðRX;RY Þ ¼ EðX;Y Þ:247247

248Test statistics of an energy distance are in general relatively249simple to calculate and do not require density estimation250(Section 3). Most importantly, as shown in [75], if ’X and ’Y

251are the characteristic functions of X and Y , respectively, the252energy distance (1) can be equivalently written as

EðX;Y Þ ¼ cðqÞZRq

’XðxÞ � ’Y ðxÞ½ �2

jxjqþ1q

dx; (2)

254254

255where cðqÞ > 0 is a constant only depending on q. Thus256E � 0with equality to zero if and only if X and Y are identi-257cally distributed. The above properties make energy dis-258tance especially appealing to testing identical distributions259(or dependence).

2602.2 Gini Distance Covariance and Gini Distance261Correlation

262Gini distance covariance was proposed in [20] to measure263dependence between a numerical random variable X 2 Rq

264from function F (cumulative distribution function, CDF) and a265categorical variable Y withK values L1; . . . ; LK . If we assume266the categorical distribution PY of Y isPrðY ¼ LkÞ ¼ pk and the267conditional distribution ofX given Y ¼ Lk is Fk, the marginal268distribution ofX is

F ðxÞ ¼XKk¼1

pkFkðxÞ:

270270

271When the conditional distribution of X given Y is the same272as the marginal distribution ofX,X and Y are independent,273i.e., there is no correlation between them. However, when274they are dependent, i.e., F 6¼ Fk for some k, the dependence275can be measured through the difference between the mar-276ginal distribution F and conditional distribution Fk.

ZHANG ET AL.: ESTIMATING FEATURE-LABEL DEPENDENCE USING GINI DISTANCE STATISTICS 3

Page 4: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

277 This difference is measured by Gini distance covariance,278 gCovðX; Y Þ, which is defined as the expected weighted L2

279 distance between characteristic functions of the conditional280 and marginal distributions (if the expectation is finite):

gCovðX;Y Þ :¼ cðqÞXKk¼1

pk

ZRq

½’kðxÞ � ’ðxÞ�2

jxjqþ1q

dx;

282282

283 where cðqÞ is the same constant as in (2), ’k and ’ are the284 characteristic functions for the conditional distribution Fk

285 and marginal distribution F , respectively. It follows imme-286 diately that gCovðX;Y Þ ¼ 0 mutually implies independence287 between X and Y . Based on (1) and (2), the Gini distance288 covariance is clearly a weighted energy distance, hence can289 be equivalently defined as

gCovðX;Y Þ

¼XKk¼1

pk 2EjXk �Xjq � EjXk �Xk0jq � EjX �X0jq

h i;

(3)291291

292 where ðXk;Xk0Þ and ðX;X0Þ are independent pair variables

293 from Fk and F , respectively.294 Gini distance covariance can be standardized to have a295 range of [0, 1], a desired property for a correlation measure.296 The resulting measure is called Gini distance correlation,297 denoted by gCorðX;Y Þ, which is defined as

gCorðX;Y Þ

¼PK

k¼1 pk 2EjXk �Xjq � EjXk �Xk0jq � EjX �X0jq

h iEjX �X0jq

;

(4)299299

300 provided that EjXjq þ EjXkjq < 1 and F is not a degener-301 ate distribution. Gini distance correlation satisfies the fol-302 lowing properties [20].

303 1) 0 � gCorðX;Y Þ � 1.304 2) gCorðX;Y Þ ¼ 0 if and only if X and Y are305 independent.306 3) gCorðX;Y Þ ¼ 1 if and only if Fk is a single point mass307 distribution almost surely for all k ¼ 1; . . . ; K.308 4) gCorðaRX þ b; Y Þ ¼ gCorðX; Y Þ for all a 6¼ 0, b 2 Rq,309 and any orthonormal matrix R 2 Rq�q.310 Property 2 are especially useful in testing dependence.

311 2.3 Gini Distance Statistics in RKHS

312 Energy distance based statistics naturally generalizes from a313 euclidean space tometric spaces [51]. By using a positive defi-314 nite kernel (Mercer kernel) [54], distributions aremapped into315 a RKHS [69] with a kernel induced distance. Hence one can316 extend energy distances to a much richer family of statistics317 defined in RKHS [64]. Let M : Rq �Rq ! R be a Mercer ker-318 nel [54]. There is an associated RKHSHM of real functions on319 Rq with reproducing kernel M, where the function320 d : Rq �Rq ! R defines a distance inHM ,

dMðx;x0Þ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiMðx; xÞ þMðx0; x0Þ � 2Mðx;x0Þ

p: (5)

322322

323Hence Gini distance covariance and Gini distance correla-324tion are generalized to RKHS,HM , as

gCovMðX; Y Þ

¼XKk¼1

pk 2EdMðXk;XÞ � EdMðXk;Xk0Þ � EdMðX;X0Þ½ �;

(6)326326

327

gCorMðX;Y Þ

¼PK

k¼1 pk 2EdMðXk;XÞ � EdMðXk;Xk0Þ � EdMðX;X0Þ½ �

EdMðX;X0Þ :

(7)329329

330The choice of kernels allows one to design various tests. In331this paper, we focus on bounded translation and rotation332invariant kernels. Our choice is based on the following333considerations:

3341) The boundedness of a positive definite kernel implies335the boundedness of the distance in RKHS, which336makes it easier to derive strong (exponential) conver-337gence inequalities based on bounded deviations (dis-338cussed in Section 3);3392) Translation and rotation invariance is an important340property to have for testing of dependence.341Same as in Rq, Gini distance covariance and Gini distance342correlation in RKHS also characterize independence, i.e.,343gCovMðX;Y Þ ¼ 0 and gCorMðX;Y Þ ¼ 0 if and only if X and344Y are independent. This is derived as the following from345the connection between Gini distance covariance and dis-346tance covariance in RKHS. Distance covariance was intro-347duced in [76] as a dependence measure between random348variablesX 2 Rp and Y 2 Rq. If X and Y are embedded into349RKHS’s induced by MX and MY , respectively, the general-350ized distance covariance ofX and Y is [64]:

dCovMX;MYðX; Y Þ

¼ EdMXðX;X0ÞdMY

ðY; Y 0Þ þ EdMXðX;X0ÞEdMY

ðY; Y 0Þ� 2E EX0dMX

ðX;X0ÞEY 0dMYðY; Y 0Þ

� �:

(8)

352352

353

354In the case of Y being categorical, one may embed it355using a set difference kernelMY ,

MY ðy; y0Þ ¼12 if y ¼ y0;0 otherwise:

�(9)

357357

358This is equivalent to embedding Y as a simplex with edges359of unit length [51], i.e., Lk is represented by aK dimensional360vector of all zeros except its kth dimension, which has the361value

ffiffi2

p

2 . The distance induced by MY is called the set dis-362tance, i.e., dMY

ðy; y0Þ ¼ 0 if y ¼ y0 and 1 otherwise. Using the363set distance, we have the following results on the general-364ized distance covariance between a numerical and a cate-365gorical random variable.

366Lemma 1. Suppose thatX 2 Rq is from distribution F and Y is a367categorical variable with K values L1; . . . ; LK . The categorical368distribution PY of Y is P ðY ¼ LkÞ ¼ pk and the conditional dis-369tribution of X given Y ¼ Lk is Fk, the marginal distribution of370X is F ðxÞ ¼

PKk¼1 pkFkðxÞ: Let MX : Rq �Rq ! R be a

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. X, XXXXX 2019

Page 5: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

371 Mercer kernel andMY a set difference kernel. The generalized dis-372 tance covariance dCovMX;MY

ðX;Y Þ is equivalent to

dCovMX;MYðX;Y Þ :¼ dCovMX

ðX;Y Þ

¼XKk¼1

p2k 2EdMXðXk;XÞ � EdMX

ðXk;Xk0Þ � EdMX

ðX;X0Þ� �

:

(10)374374

375

376 From (6) and (10), it is clear that the generalized Gini377 covariance is always larger than or equal to the generalized378 distance covariance under the set difference kernel and the379 sameMX , i.e.,

1

gCovMXðX;Y Þ � dCovMX

ðX;Y Þ; (11)381381

382 where they are equal if and only if both are 0, i.e., X and Y383 are independent. This yields the following theorem. The384 proof of Lemma 1 is given in Appendix A, which can be385 found on the Computer Society Digital Library at http://386 doi.ieeecomputersociety.org/TPAMI.2019.2960358.

387 Theorem 2. For any bounded Mercer kernel M : Rq �Rq ! R,388 gCovMðX;Y Þ ¼ 0 if and only if X and Y are independent. The389 same result holds for gCorMðX;Y Þ assuming that the marginal390 distribution ofX is not degenerate.

391 Proof. The proof of the sufficient part for gCovMðX;Y Þ is392 immediate from the definition (6). The inequality (11)393 suggests that dCovM ¼ 0 when gCovM ¼ 0. Hence the394 proof of the necessary part is complete if we show that395 dCovM ¼ 0 implies independence. This is proven as the396 following.397 Let X and Y be the RKHS induced byM and the set dif-398 ference kernel (9), respectively, with the associated dis-399 tance metrics defined according to (5). X and Y are both400 separableHilbert spaces [2], [12] as they each have a count-401 able set of orthonormal basis [54]. Hence X and Y are of402 strong negative type (Theorem 3.16 in [51]). Because the403 metrics on X and Y are bounded, the marginals of ðX;Y Þ404 on X � Y have finite first moment in the sense defined405 in [51]. Therefore, dCovMðX;Y Þ ¼ 0 implies that X and Y406 are independent (Theorem 3.11 [51]).407 Finally, the proof for gCorMðX; Y Þ follows from the408 above and the condition thatX is not degenerate. tu

409 In the remainder of the paper, unless noted otherwise,410 we use the default distance function 2

dMðx; x0Þ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� e

�jx�x0 j2q

s2

r;

412412

413 induced by a weighted Gaussian kernel, Mðx;x0Þ ¼ 12 e

�jx�x0 j2qs2 :

414 It is immediate that the above distance function is translation415 and rotation invariant and is bounded with the range [0, 1).416 Moreover, using Taylor expansion, it is not difficult to show

417that gCorM approaches gCor when the kernel parameter s

418approaches1.

4193 DEPENDENCE TESTS

420We first present an unbiased estimator of the generalized Gini421distance covariance. Probabilistic bounds for large deviations422of the empirical generalized Gini distance covariance are then423derived. These bounds lead directly to two dependence tests.424We also provide discussions on connections with the depen-425dence test using generalized distance covariance and connec-426tion with maximum mean discrepancy (MMD) [29]. Finally,427asymptotic analysis of the test statistics is presented.

4283.1 Estimation

429In Section 2.2, Gini distance covariance and Gini distance430correlation were introduced from an energy distance point431of view. An alternative interpretation based on Gini mean432difference was given in [20]. This definition yields simple433point estimators.434Let X 2 Rq be a random variable from distribution F . Let435Y 2 Y ¼ fL1; . . . ; LKg be a categorical random variable with436K values and PrðY ¼ LkÞ ¼ pk 2 ð0; 1Þ. The conditional dis-437tribution of X given Y ¼ Lk is Fk. Let ðX;X0Þ and ðXk;Xk

0Þ438be independent pair variables from F and Fk, respectively.439The Gini distance covariance (3) and Gini distance correla-440tion (4) can be equivalently written as

gCovðX;Y Þ ¼ D�XKk¼1

pkDk; (12) 442442

443

gCorðX;Y Þ ¼ D�PK

k¼1 pkDk

D; (13)

445445

446where D ¼ EjX �X0jq and Dk ¼ EjXk �Xk0jq are the Gini

447mean difference (GMD) of F and Fk in Rq [26], [27], [42],448respectively. This suggests that Gini distance covariance is a449measure of between-group variation and Gini distance cor-450relation is the ratio of between-group variation and the total451Gini variation. Replacing j � jq with dMð�; �Þ in (12) and (13)452yields the GMD version of (6) and (7).453Given an iid sample data D ¼ ðxi; yiÞ 2 Rq � Y : i ¼f4541; . . . ; ng, let Ik be the index set of sample points with yi ¼ Lk.455The probability pk is estimated by the sample proportion of456category k, i.e., pk ¼ nk

n where nk ¼ jIkj > 2. The point esti-457mators of the generalized Gini distance covariance and Gini458distance correlation for a given kernelM are

gCovnM :¼ D�XKk¼1

pkDk; (14) 460460

461

gCornM :¼ D�PK

k¼1 pkDk

D; (15)

463463

464where

Dk ¼nk

2

� ��1 Xi < j2Ik

dMðxi; xjÞ; (16)466466

467

D ¼ n

2

� ��1Xi< j

dMðxi; xjÞ: (17)469469

1. The inequality holds for Gini covariance and distance covarianceas well, i.e., gCovðX;Y Þ � dCovðX;Y Þ where X 2 Rd and Y is categori-cal. The notations of gCovMX

ðX;Y Þ and gCovMðX;Y Þ are used‘interchangeably with bothMX andM representing a Mercer kernel.

2. Since any bounded translation and rotation invariant kernels canbe normalized to define a distance function with the maximum valueno greater than 1, the results in Sections 2.3 and 3 hold for these kernelsas well.

ZHANG ET AL.: ESTIMATING FEATURE-LABEL DEPENDENCE USING GINI DISTANCE STATISTICS 5

Page 6: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

470471 Theorem 3. The point estimator (14) of the generalized Gini dis-472 tance covariance is unbiased.

473 Proof. Clearly, Dk and D are unbiased because they are U-474 statistics of size 2. Also pkDk is unbiased since E½pkDk� ¼475 EE½pkDkjnk� ¼ E½nkn Dk� ¼ pkDk. This leads to the unbiased-476 ness of gCovnM . tu

477 3.2 Uniform Convergence Bounds

478 We derive two probabilistic inequalities, from which depen-479 dence tests using point estimators (14) and (15) are480 established.

481 Theorem 4. Let D ¼ ðxi; yiÞ 2 Rq � Y : i ¼ 1; . . . ; nf g be an482 iid sample of ðX;Y Þ and M a Mercer kernel over Rq �Rq that483 induces a distance function dMð�; �Þ with bounded range [0, 1).484 For every � > 0,

Pr gCovnM � gCovMðX;Y Þ � �� �

� exp�n�2

12:5

� ; and

Pr gCovMðX; Y Þ � gCovnM � �� �

� exp�n�2

12:5

� :486486

487

488 Theorem 5. Under the condition of Theorem 4, for every � > 0

Pr½D� D � �� � exp�n�2

2

� ; and

Pr½D� D � �� � exp�n�2

2

� :490490

491

492 Proofs of Theorem 4 and Theorem 5 are given in Appen-493 dix B, available in the online supplemental material. Next494 we consider a dependence test based on gCovnM . Theorem 2495 shows that gCovMðX;Y Þ ¼ 0mutually implies that X and Y496 are independent. This suggests the following null and alter-497 native hypotheses:

H0 : gCovMðX;Y Þ ¼ 0;

H1 : gCovMðX;Y Þ � 2cn�t; c > 0 and t > 0:499499

500 The null hypothesis is rejected when gCovnM � cn�t where501 c > 0 and t 2 0; 12

�. Next we establish upper bounds for the

502 Type I and Type II errors of the above dependence test.

503 Corollary 6. Under the conditions of Theorem 4, the following504 inequalities hold for any c > 0 and t 2 0; 12

�:

Type I : Pr gCovnM � cn�tjH0

� �� exp � c2n1�2t

12:5

� ;

(18)

506506

507

Type II : Pr gCovnM � cn�tjH1

� �� exp � c2n1�2t

12:5

� :

(19)509509

510

511 Proof. Let � ¼ cn�t. The Type I bound is immediate from512 Theorem 4. The Type II bound is derived from the follow-513 ing inequality and Theorem 4.

Pr gCovnM � cn�tjH1

� �� Pr cn�t � gCovnM þ gCovMðX;Y Þ � 2cn�t � 0jH1

� �¼ Pr gCovMðX;Y Þ � gCovnM � cn�tjH1

� �:

515515

516 tu

517A dependence test can also be performed using the518empirical Gini distance correlation under the above null519and alternative hypotheses with gCorM replacing gCovM .520The null hypothesis is rejected when gCornM � cn�t where521c > 0 and t 2 0; 14

�. Type I and Type II bounds are pre-

522sented as below.

523Corollary 7. Under the conditions of Theorem 4 and Theorem 5524where additionally D � 2n�t, the following inequalities hold

525for any c > 0 and t 2 0; 14 �

:

Type I : Pr gCornM � cn�tjH0

� �� exp � c2n1�4t

12:5

� þ exp �n1�2t

2

� ;

(20) 527527

528

Type II : Pr gCornM � cn�tjH1

� �� exp � c2n1�2t

12:5

� :

(21)

530530

531

532Proof. From (15), we have

Pr gCornM � cn�tjH0

� �� Pr gCovnM � cn�2t OR D � n�tjH0

h i� Pr gCovnM � cn�2tjH0

� �þ Pr D � n�tjH0

h i� Pr gCovnM � cn�2tjH0

� �þ Pr D� D � n�tjH0

h i:

534534

535Let �1 ¼ cn�2t and �2 ¼ n�t. The Type I bound is derived536from Theorem 4 and Theorem 5. The boundedness of537dMð�; �Þ implies that D < 1. Therefore,

Pr gCornM � cn�tjH1

� �� Pr gCovnM � cn�tjH1

� �� Pr gCovMðX;Y Þ � gCovnM � cn�tjH1

� �:

539539

540Hence the Type II bound is given by Theorem 4 with541� ¼ cn�t. tu

5423.3 Connections to Generalized Distance543Covariance

544In Section 2.3, generalized Gini distance covariance is related545to generalized distance covariance through (11). Under the546conditions of Lemma 1, dCovMX;MY

ðX;Y Þ ¼ 0 if and only ifX547and Y are independent. Hence dependence tests similar to548those in Section 3.2 can be developed using empirical esti-549mates of dCovMX;MY

ðX;Y Þ. Next, we establish a result similar550to Theorem 4 for generalized distance covariance.We demon-551strate that generalized Gini distance covariance has a tighter552probabilistic bound for large deviations than its generalized553distance covariance counterpart.554Using the unbiased estimator for distance covariance555developed in [79], we generalize it to an unbiased estimator556for dCovMX;MY

ðX;Y Þ defined in (8). Let D ¼ ðxi; yiÞ 2 Rq�f557Rp : i ¼ 1; . . . ; ng be an iid sample from the joint distribution558of X and Y . Let A ¼ ðaijÞ be a symmetric, n� n, centered559kernel distance matrix of sample x1; . . . ; xn. The ði; jÞth entry560ofA is

Aij ¼ aij � 1n�2 ai� � 1

n�2 a�j þ 1ðn�1Þðn�2Þ a��; i 6¼ j;

0; i ¼ j;

�562562

6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. X, XXXXX 2019

Page 7: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

563 where aij ¼ dMXðxi; xjÞ, ai� ¼

Pnj¼1 aij, a�j ¼

Pni¼1 aij, and

564 a�� ¼Pn

i;j¼1 aij. Similarly, using dMYðyi; yjÞ, a symmetric,

565 n� n, centered kernel distance matrix is calculated for sam-566 ples y1; . . . ; yn and denoted by B ¼ ðbijÞ. An unbiased esti-567 mator of dCovMX;MY

ðX; Y Þ is given as

dCovnMX;MY¼ 1

nðn� 3ÞXi 6¼j

AijBij: (22)

569569

570 We have the following result on the concentration of571 dCovnMX;MY

around dCovMX;MYðX; Y Þ.

572 Theorem 8. Let D ¼ ðxi; yiÞ 2 Rq �Rp : i ¼ 1; . . . ; nf g be an573 iid sample of ðX;Y Þ. Let MX : Rq �Rq ! R and574 MY : Rp �Rp ! R be Mercer kernels. dMX

ð�; �Þ and dMYð�; �Þ

575 are distance functions induced by MX and MY , respectively.576 Both distance functions have a bounded range [0, 1). For every577 � > 0,

Pr dCovnMX;MY� dCovMX;MY

ðX;Y Þ � �h i

� exp�n�2

512

� ;

and

Pr dCovMX;MYðX; Y Þ � dCovnMX;MY

� �h i

� exp�n�2

512

� :579579

580

581 The proof is provided in Appendix C, available in the582 online supplemental material. Note that the above result is583 established for both X and Y being numerical. When Y is584 categorical, it can be embedded into RK using the set differ-585 ence kernel (9). Therefore, in the following discussion, we586 use the simpler notation introduced in Lemma 1 where587 dCovMX;MY

is denoted by dCovMX.

588 The upper bounds for generalized Gini distance covari-589 ance is clearly tighter than those for generalized distance590 covariance. Replacing gCovMðX; Y Þ in H0 and H1 with591 dCovMX

ðX;Y Þ, one may develop dependence tests parallel592 to those in Section 3.2: reject the null hypothesis when593 dCovnMX

� cn�t where c > 0 and t 2 0; 12 �

. Upper bounds594 on Type I and Type II errors can be established in a result595 similar to Corollary 6 with the only difference being replac-596 ing the constant 12.5 with 512. Hence the bounds on the gen-597 eralized Gini distance covariance based dependence test are598 tighter than those on the generalized distance covariance599 based dependence test.600 To further compare the two dependence tests, we con-601 sider the following null and alternative hypotheses:

H0 : SðX; Y Þ ¼ 0;

H1 : SðX; Y Þ � T ; T > 0;603603

604 where SðX;Y Þ ¼ gCovMXðX;Y Þ or dCovMX

ðX; Y Þ with the

605 corresponding test statistics Sn ¼ gCovnMXor dCovnMX

,

606 respectively. The null hypothesis is rejected when Sn � t

where 0 < t � T . Note that this test is more general than

the dependence test discussed in Section 3.2, which is a spe-

cial case with T ¼ 2cn�t and t ¼ cn�t. Upper bounds on

Type I errors follow immediately from (18) by replacingcn�t with t. Type II error bounds, however, are more diffi-

cult to derive due to the fact that t ¼ T would make devia-

tion nonexistent. Next, we take a different approach by

establishing which one of gCovnMXand dCovnMX

is less likelyto underperform in terms of Type II errors.

607Under the alternative hypothesis

H10 : dCovMX

ðX; Y Þ � T ; T > 0;609609

610we compare two dependence tests:

611� acceptingH10 when gCovnMX

� t, 0 < t � T ;

612� acceptingH10 when dCovnMX

� t, 0 < t � T .

613We call that “gCovnMXunderperforms dCovnMX

” if and only if

gCovnMX< t � dCovnMX

;615615

616i.e., the dependence betweenX and Y is detected by dCovnMX

617but not by gCovnMX. The following theorem demonstrates an

618upper bound on the probability that gCovnMXunderperforms

619dCovnMX.

620Theorem 9. UnderH10 and conditions of Theorem 8, there exists

621g > 0 such that the following inequality holds for any T > 0622and 0 < t � T :

Pr gCovnMXunderperforms dCovnMX

jH10

h i� 2e�ng2 : 624624

625

626Proof. Lemma 1 implies that gCovMXðX;Y Þ � dCovMX

ðX;Y Þ627where the equality holds if and only if both are 0, i.e.,X and

628Y are independent. Therefore, under H10, for any T > 0

629and 0 < t � T , we define

g ¼gCovMX

ðX;Y Þ � dCovMXðX;Y Þffiffiffiffiffiffiffiffiffi

12:5p

þffiffiffiffiffiffiffiffi512

p > 0:

631631

632It follows that

Pr gCovnMXunderperforms dCovnMX

jH10

h i¼ Pr gCovnMX

< t � dCovnMXjH1

0h i

� Pr gCovnMX< dCovnMX

jH10

h i� Pr gCovMX

ðX; Y Þ � gCovnMX�

ffiffiffiffiffiffiffiffiffi12:5

pg OR

hdCovnMX

� dCovMXðX; Y Þ �

ffiffiffiffiffiffiffiffi512

pgjH1

0i

� Pr gCovMXðX; Y Þ � gCovnMX

�ffiffiffiffiffiffiffiffiffi12:5

pgjH1

0h i

þ Pr dCovnMX� dCovMX

ðX;Y Þ �ffiffiffiffiffiffiffiffi512

pgjH1

0h i

� 2e�ng2 ;634634

635where the last step is from Theorems 4 and 8. tu

6363.4 Connections to Maximum Mean Discrepancy

637In [29], Gretton et al. proposed a method in testing if two sam-638ples are drawn from different distributions based on maxi-639mum mean discrepancy (MMD), defined as the largest640difference in expectations over functions in a RKHS. Sejdi-641novic et al. [64] showed the equivalence of distance-based and642RKHS-based methods in hypothesis testing. In particular, it643was shown that distance covariance and HSIC are equivalent,644andMMD is equivalent to energy distance when the distance645is computedwith a semimetric of negative type.

ZHANG ET AL.: ESTIMATING FEATURE-LABEL DEPENDENCE USING GINI DISTANCE STATISTICS 7

Page 8: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

646 The Gini distance statistics was generalized to RKHS via a647 kernel induced energy distancewhileMMDmeasures the dif-648 ference between two distributions in RKHS. The following649 result shows a close connection between Gini distance covari-650 ance in RKHS and the average of squared MMD between the651 margin distributionF and conditional distributionsFk’s.

652 Corollary 10. Suppose that X 2 Rq is from distribution F and653 Y is a categorical variable with K values L1; . . . ; LK . The cate-654 gorical distribution PY of Y is P ðY ¼ LkÞ ¼ pk and the condi-655 tional distribution of X given Y ¼ Lk is Fk, the marginal656 distribution of X is F ðxÞ ¼

PKk¼1 pkFkðxÞ: For any Mercer

657 kernel M : Rq �Rq ! R, there exists a Mercer kernel658 bM : Rq �Rq ! R such that

gCovMðX;Y Þ ¼XKk¼1

2pkd2bMðF; FkÞ;

660660

661 where d2bMðF; FkÞ is the squared MMD between F and Fk in

662 RKHS of bM.

663 Proof. Let dMðx; x0Þ be a distance induced by M as defined664 in (5). We construct a distance induced kernel bM centered665 at x0 as

bMðx; x0Þ ¼ 1

2½dMðx; x0Þ þ dMðx0; x0Þ � dMðx; x0Þ�:

667667

668 bM is positive definite (Lemma 12, [64]). From Theorem 22669 of [64], we have

2EdMðXk;XÞ � EdMðXk;Xk0Þ � EdMðX;X0Þ ¼ 2d2bMðF; FkÞ:

671671

672 The proof follows (6). tu

673 The result above means that for any Mercer kernel M,674 one can construct another Mercer kernel bM such that the675 Gini covariance in M is equivalent to a weighted average of676 squared MMD in bM.

677 3.5 Asymptotic Analysis

678 We now present asymptotic distributions for the proposed679 Gini covariance and Gini correlation.

680 Theorem 11. Assume Eðd2MðX;X0ÞÞ < 1 and pk > 0 for681 k ¼ 1; . . . ; K. Under dependence of X and Y , gCovnMX

and682 gCornMX

have the asymptotic normality property. That is,

ffiffiffin

pðgCovnMX

� gCovMXðX;Y ÞÞD!Nð0; s2

vÞ; (23)684684

685

ffiffiffin

pðgCornMX

� gCorMXðX;Y ÞÞD!Nð0; s

2v

D2Þ; (24)

687687

688 where s2v is given in the proof.

689 Under independence of X and Y , gCovnMXand gCornMX

690 converge in distribution, respectively, according to

nðgCovnMXÞD!

X1l¼1

�lðx21l � 1Þ; (25)692692

693

nðgCornMXÞD! 1

D

X1l¼1

�lðx21l � 1Þ; (26)

695695

696where �1; ::: are non-negative constants dependent on F and697x2

11;x212; . . . ; are independent x

21 variates.

698Note that the boundedness of the positive definite kernel699M implies the condition of Eðd2MðX;X0ÞÞ < 1.

700Proof.We focus on a proof for the generalized Gini distance701covariance and results for the correlation follow immedi-702ately from Slutsky’s theorem [68] and the fact that D is a703consistent estimator of D.704Let gðxÞ ¼ EdMðx;X0Þ � EdMðX;X0Þ. With the U-sta-705tistic theorem, we have

ffiffiffin

pðD� DÞD!Nð0; v2Þ;

707707

708where v2 ¼ 4Eg2ðXÞ ¼ 4P

k pkEg2ðXkÞ. Similarly, let

709gkðxÞ ¼ EdMðx;X0kÞ � EdMðXk;X

0kÞ for k ¼ 1; 2; . . . ; K

710and v2k ¼ 4Eg2kðXkÞ=pk. We have

ffiffiffin

pðDk � DkÞ

D!Nð0; v2kÞ: 712712

713

714Let SS be the variance and covariance matrix for ~gg ¼7152ðg1ðX1Þ; . . . ; gKðXKÞ; gðXÞÞT , where X ¼ Xk with proba-716bility pk. In other words, SS ¼ E~gg~ggT . Denote ðD1; . . . ;717DK; DÞT as dd and ðD1; . . . ;DK;DÞT as dd. From the U-statistic

718theorem [36], we haveffiffiffin

pðdd� ddÞD!Nð00;SSÞ. Let bb ¼ ð�p1;

719. . . ;�pK; 1ÞT be the gradient vector of gCovMXðX;Y Þ with

720respect to dd. Then s2v ¼ bbTSSbb > 0 under the assumption of

721dependence ofX and Y , since

hðxÞ :¼ bbT~ggðxÞ ¼ 2Xk

pkðgðxkÞ � gkðxkÞÞ

¼ 2Xk

pkðEdMðxk;XÞ � EdMðxk;XkÞÞ � 2ðD�Xk

pkDkÞ

6¼ 0;723723

724and s2v ¼

Pk pkE½hðXkÞ2�. In this case, by theDeltamethod,

725ffiffiffin

pbbT ðdd� ddÞ is asymptotically normally distributed with 0

mean and variance s2v. With the result of bb ¼ ð�p1; . . . ;

�pK; 1ÞT being a consistent estimator of bb and by the

Slutsky’s theorem,we have the same limiting normal distri-

bution for gCovnMX¼ bbT dd as that of bbT dd. Therefore, the result

of (23) is proved.726However, under the independence assumption, s2

v ¼ 0727because hðxÞ ¼ 0, resulting from the same distribution of728X and Xk. This corresponds to the degenerate case of U-

729statistics and bbT dd has a mixture of x2 distributions [65].

Hence the result of (25) holds. tu

730One way to use the results of (23) and (24) is to test H0

731based on the confidence interval approach. More specifi-732cally, an asymptotically ð1� aÞ100 percent confidence inter-733val for gCovMX

ðX;Y Þ is

gCovnMXðX; Y Þ Z1�a=2

s2vffiffiffin

p ;

735735

736where s2v is a consistent estimator of s2

v and Z1�a=2 is the7371� a=2 quantile of the standard normal random variable. If738this interval does not contain 0, we can reject H0 at signifi-739cance level a=2. This test controls Type II error to be a=2.

8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. X, XXXXX 2019

Page 9: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

740 On the other hand, if a test to control Type I error is pre-741 ferred, we usually need to rely on a permutation test rather742 than the results of (25) and (26) since �’s depend on the dis-743 tribution F , which is unknown. Details of the permutation744 test are in the next section.

745 4 AN ALGORITHMIC VIEW

746 Although the uniform convergence bounds for generalized747 Gini distance covariance and generalized Gini distance cor-748 relation in Sections 3.2 and 3.3 are established upon the749 bounded kernel assumption, all the results also hold for750 Gini distance covariance and Gini distance correlation if the751 features are bounded. This is because when the features are752 bounded, they can be normalized so that supx;x0 jx� x0jq ¼ 1.753 The calculation of test statistics (14) and (15) requires evalu-754 ating distances between all unique pairs of samples. Its time755 complexity is therefore Qðn2Þ, where n is the sample size. In756 the one dimension case, i.e., q ¼ 1, Gini distance statistics can757 be calculated in QðnlognÞ time [20].3 Note that distance758 covariance and distance correlation can also be calculated in759 QðnlognÞ time [37]. Nevertheless, the implementation for760 Gini distance statistics is much simpler as it does not require761 the centering process.762 Generalized Gini distance statistics are functions of the763 kernel parameter s. Fig. 1a shows gCovnM and gCornM of X1

764 and Y1 for n ¼ 2000. The numerical random variable X1 is765 generate from amixture of two dimensional normal distribu-

766 tions: N1 Nð½1; 2�T ; diag½2; :5�Þ; N2 Nð½�3;�5�T ; diag½1;767 1�Þ, and N3 Nð½�1; 2�T ; diag½2; 2�Þ. The three components768 have equal mixing proportions. The categorical variable769 Y1 2 fy1; y2; y3g is independent of X1. The results in Fig. 1b770 are calculated from X2 and Y2 for n ¼ 2000. The numerical771 random variableX2 is generated byNi if and only if Y2 ¼ yi,772 i ¼ 1; 2; 3. The categorical distribution of Y2 is PrðY2 ¼773 yiÞ ¼ 1

3 . It is clear thatX2 and Y2 are dependent on each other.774 Fig. 1 shows the impact of kernel parameter s on the

775estimated generalized Gini distance covariance and Gini dis-776tance correlation. As a result, this affects the Type I and Type777II error bounds given in Section 3.2. In this example, under778H0 (or H1), the minimum (or maximum) gCovnM is achieved779at s2 ¼ 50 (or s2 ¼ 29). These extremes yield tightest bounds780in (18) and (19).4 Note that gCovnM is an unbiased estimate of781gCovM . Although gCovM can never be negative, gCovnM can782be negative, especially underH0.783This example also suggests that in addition to the theoreti-784cal importance, the inequalities in (18) and (19) may be785directly applied to dependence tests. Given a desired bound786(or significance level), a, on Type I and Type II errors, we call787the value that determines whether H0 should be rejected788(hence to acceptH1) the critical value of the test statistic. Based789on (18) and (19), the critical value for gCovnM, cvða; nÞ, which is790a function of a and the sample size n, is calculated as

cvða; nÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi12:5log 1

a

n

s:

792792

793The three horizontal dashed lines in Fig. 1b illustrate the criti-794cal values for a ¼ 0:01, a ¼ 0:05, and a ¼ 0:15, respectively.795The population Gini distance covariance estimated using79620,000 iid samples is not included in the figure because of its797closeness to gCovnM. With a proper choice of s, H1 should be798accepted based on the 2000 samples of ðX2; Y2Þ with both799Type I and Type II errors no greater than 0.05. Note that we800could not really accept H1 at the level a ¼ 0:01 because the801estimated maximum gCovM is around 0.28, which is smaller802than 0.3393 (two times the critical value at a ¼ 0:01).803The above test, although simple, has two limitations:

804� Choosing an optimal s is still an open problem.805Numerical search is computationally expensive even806if it is in one dimension;

Fig. 1. Estimates of the generalized Gini distance covariance and generalized Gini distance correlation for different kernel parameters using 2000 iidsamples: (a) independent case; (b) dependent case. Three critical values are shown in (b). They are calculated for significance levels 0.01, 0.05,and 0.15, respectively. In terms of the uniform convergence bounds, the optimal value of the kernel parameter s is defined by the minimizer (ormaximizer) of the test statistics underH0 (orH1).

3. When the inner produce kernel Mðx; x0Þ ¼ xTx0 is chosen,generalized Gini distance statistics reduces to Gini distance statistics.

4. The kernel parameter s also affects the Type I and Type II errorbounds for gCornM in (20) and (21). The Type I error bound for gCovnM issignificantly tighter than that for gCornM .

ZHANG ET AL.: ESTIMATING FEATURE-LABEL DEPENDENCE USING GINI DISTANCE STATISTICS 9

Page 10: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

807 � The simplicity of the distribution free critical value808 cvða; nÞ comeswith a price: itmight not be tight enough809 formany distributions, especiallywhenn is small.810 Therefore, we apply permutation test [23], a commonly811 used statistical tool for constructing sampling distributions, to812 handle scenarios that the test based on cvða; nÞ is not feasible.813 We randomly shuffle the samples of X and keep the samples814 of Y untouched. We expect the generalized Gini distance sta-815 tistics of the shuffled data should have values close to 0816 because the random permutation breaks the dependence817 between samples of X and samples of Y . Repeating the ran-818 dom permutation many times, we may estimate the critical819 value for a given significance level a based on the statistics of820 the permuted data. A simple approach is to use the percentile821 defined by a, e.g., when a ¼ 0:05 the critical value is 95th per-822 centile of the test statistic of the permuted data. The null823 hypothesis H0 is rejected when the test statistic is larger than824 the critical value.825 Fig. 2 shows the permutation test results of data gener-826 ated from the same distributions used in Fig. 1. The plots on827 the top are gCovnM and the critical values. The plots at the

828 bottom are gCornM and the critical values. Test statistics are829 calculated from 500 samples. The critical values are esti-830 mated from 5000 random permutations at a ¼ 0:05. As illus-831 trated in Fig. 2a, when X and Y are independent, the832 permutation tests do not reject H0 at significance level 0.05.833 Fig. 2b shows that when X and Y are dependent, H0 is834 rejected at significance level 0.05 by the permutation tests. It835 is interesting to note that the decision to reject or accept H0

836 is not influenced by the value of the kernel parameter s. 5

837 5 EXPERIMENTS

838 We first compare Gini distance statistics with distance statis-839 tics on artificial datasets where the dependent features are840 known.We then provide comparisons on real world datasets

841including the MNIST dataset, a breast cancer dataset and 19842publicly available datasets. For these real world datasets, we843also include another three baseline methods: Pearson R2,844mean variance (MV) [18] and a direct average of squared845MMD (avgMMD2), i.e., 1

K

PKk¼1 dM

2ðF; FkÞ, where M is the846sameGaussian kernel used for Gini and distance statistics.

8475.1 Simulation Results

848In this experiment, we compare dependence tests using four849statistics, dCovnM , dCornM , gCovnM , and gCornM , on artificial850datasets. The kernel parameter was fixed at s2 ¼ 10. The851data were generated from three distribution families: nor-852mal, exponential, and Gamma distributions under both H0

853(X and Y are independent) and H1 (X and Y are depen-854dent). Given a distribution family, we first randomly choose855a distribution F0 and generate n iid samples of X. Samples856of the categorical Y are then produced independent of X.857Repeating the process, we create a total of m independent858datasets under H0. In the dependent case, X is produced by859F ¼

PKk¼1 pkFk, a mixture of K distributions where K is the

860number of different values that Y takes, pk is the probability861that Y ¼ yk, and Fk is a distribution from the same family862that yields the data under H0. The dependence between X863and Y is established by the data generating process: a sam-864ple ofX is created by Fk if Y ¼ yk. The mixture model is ran-865domly generated, i.e., pk and Fk are both randomly chosen.866For each Y ¼ yk (k ¼ 1; . . . ; K), nk ¼ n � pk iid samples of X867are produced following Fk. This results in one data set of868size n ¼

PKk¼1 nk under H1. Following the same procedure,

869we obtain m independent data sets under H1 each corre-870sponding to a randomly selected K-component mixture871model F . In our experiments, n ¼ 100 andm ¼ 10; 000.872Table 2 summarizes the model parameters of the three873distribution families. Ið�Þ is the indicator function. Gð�Þ is874the gamma function. Nðm; sÞ denotes the normal distribu-875tion with mean m and standard deviation s. Gða;bÞ denotes876the gamma distribution with shape a and rate b. Uða; bÞ877denotes the uniform distribution over interval ½a; b�. DirðaÞ878denotes the Dirichlet distribution with concentration a. A

Fig. 2. Permutation tests of the generalized Gini distance covariance and generalized Gini distance correlation for different kernel parameters using200 iid samples and 5000 random permutations: (a) independent case; (b) dependent case. The 95 percentile curves define the critical values fora ¼ 0:05. They are calculated from the permuted data. Test statistics higher (lower) than the critical value suggest acceptingH1 (H0).

5. Although the decision to reject or accept H0 is not affected by s,the p-value of the test does vary with respect to s.

10 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. X, XXXXX 2019

Page 11: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof879 distribution (F0 or Fk) is randomly chosen via its parameter

880 (s). The unbiasedness of gCovnM requires that there are at881 least two data points for each value of Y . Therefore, random882 proportions that do not meet this requirement are removed.883 Fig. 3 shows the performance of dCovnM and gCovnM in884 terms of type I and type II errors with the critical value t

885 set to different values. As Theorem 9 suggests, for any t,886 gCovnM outperforms dCovnM in type II error. However,887 with the same value of t, gCovnM underperforms dCovnM888 in terms of type I error. The results presented by Fig. 3889 motivates us to compare Gini and distance statistics using890 power (with type I error a controlled at 0.05) and area891 under the curve (AUC). Both measures have values

892between 0 and 1 with a value closer to 1 indicating better893performance. Table 3 illustrates the performance of the894four test statistics under different values of K for the895three distribution families. The highest power and AUC896among the four test statistics are shown in bold and the897second highest are underlined. In this experiment, gCovnM898appears to be the most competitive test statistics in terms899of power and ROC at all values of K. In addition, both900gCovnM and gCornM outperform dCovnM and dCornM in most901of the cases. We also tested the influence of s2 and902observed stable results (figures are provided in supple-903mentary materials, which can be found on the Computer904Society Digital Library at http://doi.ieeecomputersociety.905org/TPAMI.2019.2960358.).

9065.2 The MNIST Dataset

907We first tested feature selection methods using different test908statistics on the MNIST data. The advantage of using an909image dataset like MNIST is that we can visualize the910selected pixels. We expect useful/dependent pixels to911appear in the center part of the image. Some descriptions of912the MNIST data are listed in Table 4.913The 7 test statistics under comparisons are: Pearson R2,914MV, avgMMD2, dCovnM , dCornM , gCovnM , and gCornM . For915each method, the top k features were selected by ranking

TABLE 2Models of Different Distribution Families

pðxjuÞ u

Normal 1ffiffiffiffiffiffiffiffi2ps2

p e�ðx�mÞ2

s2m Nð0; 5Þs2 1=Gð1; 1Þ

Exponential �e��xIðx � 0Þ � Uð0; 5Þ

Gamma baxa�1e�bx

GðaÞ Iðx � 0Þa Uð0; 10Þb Uð0; 10Þ

Proportions pk Dirð1Þ

Fig. 3. Simulation results using normal distribution with K = 3: (a) Type I error; (b) Type II error.

TABLE 3Power (a ¼ 0:05) and AUC

Power AUC

dCovnM dCornM gCovnM gCornM dCovnM dCornM gCovnM gCornM

K ¼ 3 Normal 0.991 0.993 0.996 0.996 0.998 0.998 0.999 0.999Exponential 0.666 0.669 0.701 0.681 0.871 0.875 0.880 0.881Gamma 0.956 0.960 0.974 0.971 0.988 0.989 0.992 0.992

K ¼ 4 Normal 0.999 0.999 1.000 1.000 1.000 1.000 1.000 1.000Exponential 0.737 0.734 0.774 0.756 0.908 0.909 0.920 0.918Gamma 0.987 0.987 0.994 0.992 0.997 0.997 0.998 0.998

K ¼ 5 Normal 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000Exponential 0.790 0.776 0.823 0.805 0.931 0.930 0.941 0.939Gamma 0.995 0.993 0.998 0.996 0.999 0.999 0.999 0.999

ZHANG ET AL.: ESTIMATING FEATURE-LABEL DEPENDENCE USING GINI DISTANCE STATISTICS 11

Page 12: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

916 the test statistics in descending order. Then the selected fea-917 ture set was used to train the same classifier and the test918 accuracies were compared. The classifier used was a ran-919 dom forest consisting of 100 trees. We used the training and920 test set provided by [46] for training and testing. To reduce921 computation cost, we randomly selected 5000 samples to922 calculate test statistics for all methods. For avgMMD2, Gini923 and distance statistics, each feature was standardized by924 subtracting the mean and dividing by the standard devia-925 tion. Pearson R2 and MV are not affected by data standardi-926 zation. The kernel parameter s2 was set to be 10. Because of927 the randomness involved in training random forest, each928 experiment was repeated 10 times and the average test929 accuracy was used for comparison.930 The results of the MNIST data are summarized in Fig. 4.931 From Fig. 4a we can see the clear increasing trend in accuracy932 as more features are selected, as expected. Among all meth-933 ods, Pearson R2 performs the poorest. The discrepancy in934 accuracy between Pearson R2 and the other six test statistics935 comes from the ability of characterizing non-linear depen-936 dence. The other five methods behave similar in terms of clas-937 sification accuracy. Specifically, we expect dCovnM and gCovnM938 to be very similar becauseMNIST is a balanced dataset. Under939 the following two scenarios dCovMðX;Y Þ and gCovMðX;Y Þ940 will give the same ranking of the the features because their941 ratio is a constant (Remark 2.8 & 2.9 of [20]):

942 1) When the data is balanced, i.e., p1 ¼ p2 ¼ ::: ¼ pK ¼ 1K,

943 dCovMðX;Y Þ ¼ 1K gCovMðX;Y Þ;

944 2) When the data has only 2 classes, i.e., K ¼ 2,945 dCovMðX; Y Þ ¼ 2p1p2gCovMðX;Y Þ.946 Hence, when n is sufficiently large, dCovnM and gCovnM947 will have the same ranking for the features.948 The difference between Gini and distance statistics is949 more observable in the value range, as shown in Fig. 4b.950 Both dCornM and gCornM are bounded between 0 and 1, but

951 clearly gCornM takes a much wider range than dCornM .

952Therefore, gCornM is a more sensitive measure of depen-

953dence than dCornM . gCovnM is also more sensitive than

954dCovnM as shown both empirically in Fig. 4b and theoreti-955cally by (11).956Fig. 4c shows the visualization of the selected pixels as957white. Pearson R2 and avgMMD2 are not able to select some958of the pixels in the center part even when k is greater than959400. Other four methods behave similar in this graph.

9605.3 The Breast Cancer Dataset

961We then compared the 7 feature selection methods on a gene962selection task. The dataset used in this experiment was the963TCGA breast cancer microarray dataset from the UCSC Xena964database [28]. This data contains expression levels of 17278965genes from 506 patients and each patient has a breast cancer966subtype label (luminal A, luminal B, HER2-enriched, or basal-967like). PAM50 is a gene signature consisting of 50 genes968derived from microarray data and is considered as the gold-969standard for breast cancer subtype prognosis and predic-970tion [59]. In this experiment,we randomly hold-out 20 percent971as test data, used eachmethod to select top k genes, then eval-972uated the classification performance and compared the973selected genes with the PAM50 gene signature. Because this974dataset has a relatively small sample size, all training exam-975ples were used to calculate test statistics and train the classi-976fier, and we repeated each classification test 30 times. Other977experiment setupswere kept the same as previous.978The results are shown in Fig. 5. Fig. 5a shows the classifi-979cation performance using selected top k genes using differ-980ent test statistics as well as using all PAM50 genes (shown as981a green dotted line). Note that the accuracy of PAM50 is the982averaged value from 30 runs. We plot it as a line across the983entire k range for easier comparison with other methods.984Among the 7 selection methods under comparison, gCovnM985has the best overall performance, even outperforms PAM50986with k ¼ 30. This suggests that gCovnM is able to select a987smaller number of genes and the prediction is better than the988gold standard. We also observe that gCornM outperforms989PAM50 with k ¼ 40 and k ¼ 50. Pearson R2, MV, avgMMD2,990dCovnM , and dCornM are not able to exceed PAM50 within 50991genes. Fig. 5b shows the number of PAM50 genes appear in992the top k selected genes for each selection method. It is obvi-993ous to see that PearsonR2 and avgMMD2 selectmuch smaller994number of PAM50 compared to others. Gini statistics are995able to select more PAM50 genes than distance statistics as k996increases. The small ratio of PAM50 included in the selected997genes by any method is because of the high correlation998between genes. PAM50 was derived by not only selecting999most subtype dependent genes, but also less mutually1000dependent genes to obtain a smaller set of genes for the same1001prediction accuracy. Even though any of the selection meth-1002ods under comparison does not take the feature-feature1003dependence into consideration, both gCovnM and gCornM are1004able to select a better gene set than PAM50 for classification.

10055.4 Other Publicly Available Datasets

1006We further tested the 7 feature selection methods on a total of100719 publicly available datasets of classification tasks. The 191008datasets cover a wide range of data type, sample size, and fea-1009ture set size. Specifically, we avoided binary-class and

TABLE 4Data set summary

Data Set Train/Test Size Features Classes

MNIST 60000/10000 784 10Breast Cancer 405/101 17278 4GDC PANCAN 2076/519 24981 12Head and Neck Cancer 2010/502 23686 4Pancreatic Cancer 77/19 18278 4Medulloblastoma 228/57 33297 3Gene Expression (UCI) 641/160 20531 5Gastrointestinal Lesions 61/15 1396 3Satellite 4435/2000 36 6Ecoli 269/67 7 8Glass 171/43 9 6Urban Land Cover 168/507 147 9Wine 142/36 13 3Anuran Calls 5756/1439 22 4Breast Tissue 85/21 9 4Cardiotocography 1701/425 21 10Leaf 272/68 14 30Mice Protein Expression 864/216 77 8HAR 4252/1492 561 6UJIndoorLoc 19937/1111 520 13Forest Types 198/325 27 4

12 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. X, XXXXX 2019

Page 13: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

1010 balanced datasets because dCovnM and gCovnM give the same1011 ranking on these datasets when sufficient training samples are1012 given. For datasets without training and test sets provided, we1013 randomly hold out 20 percent as the test set. The descriptions1014 of these datasets used are summarized in Table 4. GDC PAN-1015 CAN,Head andNeckCancer, Pancreatic Cancer andMedullo-1016 blastoma are gene datasets from the UCSC Xena database [28].1017 GDC PANCAN used DNA methylation features, Pancreatic1018 Cancer used RNA-seq features, Head and Neck used single-1019 cell RNA-seq features and Medulloblastoma used microarray1020 features. The Gene Expression fromUCI is also a gene data for1021 PANCAN analysis but used RNA-seq features. The remaining1022 datasets are all from UCI. For the UJInddorLoc datasets, we

1023randomly selected 5000 samples from the training set to calcu-1024late test statistics. For the remaining datasets, all training sam-1025ples were used. For each dataset, each method was used to1026select top k features for training with three different values of1027k. Each classification testwas repeated 10 times.1028As we do not have the ground truth of the dependent1029features, only classification accuracy was used for evalua-1030tion. The average test accuracy (from 10 runs) of the 7 meth-1031ods under comparison with different values of k on the 191032datasets are summarized in Table 5. Of all methods, the1033highest accuracy is shown in bold and the second highest1034one is underlined. The top 1 (top 2) statistic is determined1035by the number of times a method is shown in bold (bold or

Fig. 4. The MNIST dataset. (a) Test accuracy using the top k selected features. (b) Test statistics of features in descending order. (c) Visualization ofthe top k pixels selected. White: selected. Black: not selected. Test accuracy using the selected pixels is labeled on the top of each image.

ZHANG ET AL.: ESTIMATING FEATURE-LABEL DEPENDENCE USING GINI DISTANCE STATISTICS 13

Page 14: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

1036 underlined). Among all methods, gCornM appears 19 times1037 as top 1 and 33 times in top 2, outperforming all other meth-1038 ods. MV, dCovnM , and gCovnM have similar performance, fol-1039 lowed by avgMMD2 and then dCornM . Pearson R2 is ranked1040 the last, with 8 times as top 1 and 14 times in top 2. We1041 observed that the performance of gCornM is more superior1042 on the gene datasets evaluated, namely, Breast Cancer,1043 GDC PANCAN, Head and Neck Cancer, Pancreatic Cancer,1044 Medulloblastoma, and Gene Expression (UCI). Gene data-1045 sets typically have a large number of features, which is usu-1046 ally an order of magnitude greater than the sample size,1047 making selecting a small set of good features necessary yet1048 challenging. The empirical results on five gene datasets sug-1049 gests gCornM to be a more competitive feature selection1050 method than other methods under comparison.

1051 6 CONCLUSION

1052 We proposed a feature selection framework based on a new1053 dependence measure between a numerical feature X and a1054 categorical label Y using generalized Gini distance statistics:1055 Gini distance covariance gCovðX;Y Þ and Gini distance corre-1056 lation gCorðX;Y Þ. We presented estimators of gCovðX;Y Þ1057 and gCorðX;Y Þ using n iid samples, i.e., gCovnM and gCornM ,1058 and derived uniform convergence bounds. We showed that1059 gCovnM converge faster than its distance statistic counterpart

1060 dCovnM , and the probability of gCovnM under-performing

1061 dCovnM in terms of Type II error decreases to 0 exponentially

1062 as the sample size increases. gCovnM and gCornM are also sim-

1063 pler to calculate than dCovnM and dCornM . Extensive experi-

1064 ments were performed to compare gCovnM and gCornM with

1065 other dependence measures in feature selection tasks using1066 artifical and realworld datasets, includingMNIST, breast can-1067 cer and 19 publicly available daatsets. For simulated data,1068 gCovnM and gCornM perform better in terms of power and

1069 AUC. For real world datasets, on average, gCornM is able to

1070 select more meaningful features and has better classification1071 performances. Notice that the advantage of Gini statistics1072 over distance statistics is less observable in real world datasets1073 than in simulation settings. This is because for real world1074 datasets the ground truth is unavailable and the difference is

1075more difficult to see using classification accuracy as the per-1076formancemeasure.However, when the data dimension is suf-1077ficiently large, the difference between methods under1078comparison is more significant. As we see on the gene data-1079sets, gCornM is significantly better than the baseline methods.1080Therefore, we would recommend the use of gCornM for high1081dimension data. In spite of the equivalence between gCovM1082and a weighted average of squared MMD in bM, gCovnM is1083superior to a direct average of squared MMD using the same1084kernel M in many settings, suggesting the importance of1085usingweighted average and the choice of kernel.1086The proposed feature selection method using generalized1087Gini distance statistics have several limitations:

1088� Choosing an optimal s is still an open problem. In1089our experiments we used s2 ¼ 10 after data1090standardization;1091� The computation cost for gCovnM and gCornM is Oðn2Þ,1092which is same as avgMMD2, dCovnM and dCornM , but

1093more expensive than MV (OðnlognÞ) and Pearson R2

1094(OðnÞ). For large datasets, a sampling of data is desired.1095� Features selected by Gini distance statistics, as well as1096other dependence measure based methods, can be1097redundant, hence a subsequent feature elimination1098may be needed for the sake of feature subset selection.

1099ACKNOWLEDGMENTS

1100The authorswould like to thank the reviewers and the associate1101editor for the suggestions to improve the quality of the paper.

1102REFERENCES

1103[1] N. Armanfard, J. P. Reilly, and M. Komeili, “Local feature selec-1104tion for data classification,” IEEE Trans. Pattern Anal. Mach. Intell.,1105vol. 38, no. 6, pp. 1217–1227, Jun. 2016.1106[2] N. Aronszajn, “Theory of reproducing kernels,” Trans. Amer. Soc.,1107vol. 68, no. 3, pp. 337–404, 1950.1108[3] A. Barbu, Y. She, L. Ding, and G. Gramajo, “Feature selection with1109annealing for computer vision and big data learning,” IEEE Trans.1110Pattern Anal. Mach. Intell., vol. 39, no. 2, pp. 272–286, Feb. 2017.1111[4] L. Baringhaus and C. Franz, “On a new multivariate two-sample1112test,” J. Multivariate Anal., vol. 88, no. 1, pp. 190–206, 2004.1113[5] R. Battiti, “Using mutual information for selecting features in1114supervised neural net learning,” IEEE Trans. Neural Netw., vol. 5,1115no. 4, pp. 537–550, Jul. 1994.

Fig. 5. The breast cancer dataset. (a) Test accuracy using the top k selected genes. (b) Number of PAM50 genes in the selected top k genes.

14 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. X, XXXXX 2019

Page 15: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

1116 [6] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral tech-1117 niques for embedding and clustering,” in Proc. Int. Conf. Neural1118 Inf. Process. Syst., 2002, vol. 14, pp. 585–591.1119 [7] K. Benabdeslem and M. Hindawi, “Efficient semi-supervised1120 feature selection: Constraint, relevance, and redundacy,” IEEE1121 Trans. Knowl. Data Eng., vol. 26, no. 5, pp. 1131–1143, May1122 2014.1123 [8] J. R. Berrendero, A. Cuevas, and J. L. Torrecilla, “Variable selection in1124 functional data classification: Amaxima-hunting proposal,” Statistica1125 Sinica, vol. 26, no. 2, pp. 619–638, 2016.1126 [9] A. L. Blum and P. Langley, “Selection of relevant features and exam-1127 ples inmachine learning,”Artif. Intell., vol. 97, no. 12, pp. 245–271, 1997.1128 [10] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32,1129 2001.

1130[11] M. Bressan and J. Vitri�a, “On the selection and classification of inde-1131pendent features,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25,1132no. 10, pp. 1312–1317, Oct. 2003.1133[12] S. Canu, “Functional learning through kernels,” Advances in Learn-1134ing Theory: Methods, Models and Application, J. Suykens, G. Horvath,1135S. Basu, C. Micchelli, J. Vandewalle (Ed.), Amsterdam, The1136Netherlands: IOS Press, 2003, pp. 89–110.1137[13] R. Chakraborty and N. R. Pal, “Feature selection using a neural1138framework with controlled redundancy,” IEEE Trans. Neural1139Netw. Learn. Syst., vol. 26, no. 1, pp. 35–50, Jan. 2015.1140[14] Q. Cheng, H. Zhou, and J. Cheng, “The fisher-markov selector: Fast1141selectingmaximally separable feature subset formulticlass classifica-1142tionwith applications to high-dimensional data,” IEEE Trans. Pattern1143Anal.Mach. Intell., vol. 33, no. 6, pp. 1217–1233, Jun. 2011.

TABLE 5Classification Accuracies Using Top k Dependent Features

GDC PANCAN Head and Neck Cancer Pancreatic Cancer Medulloblastoma

k 5 10 15 3 5 7 10 30 50 5 7 10

Pearson R2 0.600 0.791 0.841 0.886 0.936 0.959 0.611 0.711 0.774 0.963 0.930 0.951MV 0.792 0.874 0.892 0.638 0.953 0.957 0.721 0.632 0.768 0.954 0.958 0.970avgMMD2 0.517 0.639 0.734 0.886 0.938 0.949 0.389 0.553 0.568 0.858 0.872 0.835dCovnM 0.783 0.853 0.888 0.907 0.951 0.970 0.658 0.679 0.753 0.932 0.956 0.946dCornM 0.745 0.852 0.877 0.894 0.923 0.954 0.616 0.700 0.774 0.946 0.939 0.951

gCovnM 0.807 0.857 0.882 0.893 0.937 0.955 0.847 0.842 0.837 0.933 0.965 0.965gCornM 0.826 0.863 0.881 0.894 0.982 0.982 0.800 0.842 0.874 0.854 0.870 0.916

Gene Expression (UCI) Gastrointestinal Lesions Satellite Ecolik 5 7 10 10 100 200 5 10 15 2 3 4

Pearson R2 0.909 0.917 0.934 0.747 0.700 0.660 0.598 0.830 0.869 0.624 0.772 0.776MV 0.936 0.961 0.982 0.660 0.660 0.673 0.831 0.885 0.900 0.713 0.760 0.781avgMMD2 0.711 0.696 0.819 0.547 0.680 0.693 0.770 0.834 0.860 0.624 0.772 0.845dCovnM 0.889 0.907 0.963 0.700 0.607 0.673 0.627 0.860 0.896 0.712 0.766 0.788dCornM 0.851 0.928 0.937 0.787 0.633 0.680 0.808 0.887 0.903 0.713 0.764 0.787

gCovnM 0.923 0.933 0.948 0.560 0.647 0.693 0.834 0.870 0.881 0.718 0.766 0.781gCornM 0.741 0.980 0.972 0.607 0.700 0.653 0.838 0.873 0.898 0.634 0.758 0.803

Glass Urban Land Cover Wine Anuran Callsk 3 5 7 30 60 90 2 4 6 5 10 15

Pearson R2 0.702 0.730 0.707 0.764 0.778 0.798 0.753 0.969 0.972 0.927 0.957 0.975MV 0.702 0.667 0.758 0.782 0.799 0.805 0.917 0.992 0.997 0.933 0.958 0.980avgMMD2 0.707 0.670 0.744 0.778 0.796 0.799 0.758 0.903 0.942 0.934 0.958 0.979dCovnM 0.702 0.691 0.744 0.786 0.795 0.809 0.897 0.997 1.000 0.938 0.957 0.979dCornM 0.707 0.672 0.751 0.781 0.817 0.809 0.881 0.992 0.997 0.937 0.956 0.978

gCovnM 0.705 0.674 0.663 0.784 0.790 0.804 0.881 0.997 0.997 0.938 0.956 0.979gCornM 0.693 0.665 0.670 0.787 0.804 0.805 0.900 0.997 1.000 0.937 0.958 0.980

Breast Tissue Cardiotocography Leaf Mice Protein Expressionk 3 5 7 5 10 15 4 7 10 10 20 30

Pearson R2 0.810 0.857 0.833 0.831 0.890 0.894 0.437 0.637 0.647 0.978 0.942 0.980MV 0.810 0.857 0.843 0.805 0.880 0.893 0.494 0.576 0.676 0.868 0.964 0.982avgMMD2 0.810 0.857 0.857 0.675 0.860 0.894 0.443 0.601 0.676 0.977 0.970 0.980dCovnM 0.819 0.857 0.852 0.773 0.874 0.891 0.471 0.690 0.663 0.893 0.950 0.985dCornM 0.810 0.857 0.838 0.816 0.849 0.894 0.506 0.606 0.704 0.890 0.952 0.977

gCovnM 0.810 0.857 0.852 0.817 0.878 0.895 0.468 0.688 0.654 0.888 0.945 0.983gCornM 0.810 0.857 0.857 0.766 0.880 0.887 0.503 0.594 0.706 0.888 0.943 0.982

HAR UJIndoorLoc Forest Types Top 1 Top 2k 100 200 300 100 200 300 5 10 15 (times) (times)

Pearson R2 0.756 0.780 0.856 0.695 0.811 0.847 0.751 0.756 0.795 8 14MV 0.719 0.778 0.775 0.753 0.864 0.873 0.699 0.758 0.806 11 25avgMMD2 0.781 0.832 0.863 0.731 0.867 0.869 0.764 0.757 0.794 11 18dCovnM 0.765 0.779 0.859 0.768 0.866 0.873 0.652 0.809 0.807 11 22dCornM 0.767 0.780 0.853 0.793 0.863 0.871 0.715 0.752 0.802 8 19

gCovnM 0.766 0.853 0.858 0.763 0.864 0.872 0.651 0.818 0.804 11 26gCornM 0.821 0.853 0.853 0.797 0.864 0.872 0.696 0.734 0.806 19 33

ZHANG ET AL.: ESTIMATING FEATURE-LABEL DEPENDENCE USING GINI DISTANCE STATISTICS 15

Page 16: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

1144 [15] T. W. S. Chow and D. Huang, “Estimating optimal feature subsets1145 using efficient estimation of high-dimensional mutual information,”1146 IEEE Trans. Neural Netw., vol. 16, no. 1, pp. 213–224, Jan. 2005.1147 [16] P. Comon, “Independent component analysis, A new concept?”1148 Signal Process., vo. 36, no. 3, pp. 287–314, 1994.1149 [17] C. Constantinopoulos, M. K. Titsias, and A. Likas, “Bayesian feature1150 and model selection for Gaussian mixture models,” IEEE Trans.1151 Pattern Anal. Mach. Intell., vol. 28, no. 6, pp. 1013–1018, Jun. 2006.1152 [18] H. Cui, R. Li, and W. Zhong,“Model-free feature screening for1153 ultrahigh dimensional discriminant analysis,” J. Amer. Statistical1154 Assoc., vol. 110, no. 510, pp. 630–641, 2015.1155 [19] B. B. Damodaran, N. Courty, and S. Lef�evre, “Sparse Hilbert1156 Schmidt independence criterion and surrogate-kernel-based fea-1157 ture selection for hyperspectral image classification,” IEEE Trans.1158 Geosci. Remote Sens., vol. 55, no. 4, pp. 2385–2398, Apr. 2017.1159 [20] X. Dang, D. Nguyen, Y. Chen, and J. Zhang “A new Gini cor-1160 relation between quantitative and qualitative variables,” 2018,1161 arXiv: 1809.09793.1162 [21] P. Demartines and J. H�erault, “Curvilinear component analysis: A1163 self-organizing neural network for nonlinear mapping of data1164 sets,” IEEE Trans. Neural Netw., vol. 8, no. 1, pp. 148–154, Jan. 1997.1165 [22] A. A. Ding, J. G. Dy, Y. Li, and Y. Chang, “A robust-equitable1166 measure for feature ranking and selection,” J. Mach. Learn. Res.,1167 vol. 18, pp. 1–46, 2017.1168 [23] E. Edgington and P. Onghena, Randomization Tests, 4th ed., London,1169 UK/Boca Raton, FL,USA:Chapman andHall/CRC, 2007.1170 [24] J. Fan and J. Lv, “Sure independence screen for ultrahighdimensional1171 feature space,” J. Roy. Statist. Soc., B, vol. 70, no. 5, pp. 849–911, 2008.1172 [25] J. Fan, R. Samworth, and Y. Wu, “Ultrahigh dimensional feature1173 selection: Beyond the linear model,” J. Mach. Learn. Res., vol. 10,1174 pp. 2013–2038, 2009.1175 [26] C. Gini, Variabilit�a e Mutabilit�a: Contributo Allo Studio Delle Distribu-1176 zioni e Relazioni Statistiche, Vol. III (part II), Bologna, Italy: Tipografia1177 di Paolo Cuppin, 1912.1178 [27] C. Gini, “Sulla misura della concentrazione e della variabilit�a dei1179 caratteri,” Atti del Reale Istituto Veneto di Scienze, Lettere ed Aeti, 62,1180 1203–1248, 1914. English Translation: On the measurement of con-1181 centration and variability of characters. Metron, LXIII(1), pp. 3–38,1182 2005.1183 [28] M. Goldman, B. Craft, A. N. Brooks, J. Zhu, and D. Haussler,“The1184 ucsc xena platform for cancer genomics data visualization and1185 interpretation,” bioRxiv, 2018.1186 [29] A.Gretton, K.M. Borgwardt,M. J. Rasch, B. Sch€olkopf, andA. Smola,1187 “A kernel two-sample test,” J. Mach. Learn. Res., vol. 13, pp. 723–773,1188 2012.1189 [30] J. Gui, Z. Sun, S. Ji, D. Tao, and T. Tan, “Feature selection based on1190 structured sparsity: A comprehensive study,” IEEE Trans. Neural1191 Netw. Learn. Syst., vol. 28, no. 7, pp. 1490–1507, Jul. 2017.1192 [31] I. Guyon, and A. Elisseeff, “An introduction to variable and1193 feature selection,” J. Mach. Learn. Res., vol. 3, pp. 1157–1182, 2003.1194 [32] X. He, M. Ji, C. Zhang, and H. Bao, “A variance minimization crite-1195 rion to feature selection using laplacian regularization,” IEEE Trans.1196 Pattern Anal. Mach. Intell., vol. 33, no. 10, pp. 2013–2025, Oct. 2011.1197 [33] G. E. Hinton and S. T. Roweis, “Stochastic neighbor embedding,”1198 in Proc. Advances Neural Inf. Process. Syst., 2002, vol. 15, pp. 833–840.1199 [34] B. Hjørland, “The foundation of the concept of relevance,” J. Amer.1200 Soc. Inf. Sci. Technol., vol. 61, no. 2, pp. 217–237, 2010.1201 [35] H. Hotelling, “Analysis of a complex of statistical variables into1202 principal components,” J. Educ. Psychol., vol. 24, no. 6, pp. 417–441,1203 1933.1204 [36] W. Hoeffding, “Class of statistics with asymptotically normal1205 distribution,” Ann. Math. Statist., vol. 19, pp. 293–325, 1948.1206 [37] X. Hou and G. Sz�ekely, “Fast computing for distance covariance,”1207 Technometrics, vol. 58, no. 4, pp. 435–447, 2016.1208 [38] F. J. Iannarilli Jr, and P. A. Rubin, “Feature selection for multiclass1209 discrimination via mixed-integer linear programming,” IEEE Trans.1210 Pattern Anal. Mach. Intell., vol. 25, no. 6, pp. 779–783, Jun. 2003.1211 [39] K. Javed, H. A. Babri, and M. Saeed, “Feature selection based1212 on class-dependent densities for high-dimensional binary data,”1213 IEEE Trans. Knowl. Data Eng., vol. 24, no. 3, pp. 465–477,1214 Mar. 2012.1215 [40] G. H. John, R. Kohavi, and K. Pfleger, “Irrelevant feature and the1216 subset selection problem,” in Proc. 11th Int. Conf. Mach. Learn.,1217 1994, pp. 121–129.1218 [41] R. Kohavi and G. H. John, “Wrappers for feature subset selection,”1219 Artif. Intell., vol. 97, nos. 1–2, pp. 273–324, 1997.1220 [42] G. Koshevoy and K. Mosler, “Multivariate gini indices,” J. Multi-1221 variate Anal. vol. 60, no. 2, pp. 252–276, 1997.

1222[43] B. Krishnapuram,A. J.Hartemink, L. Carin, andM.A. T. Figueiredo,1223“A Bayesian approach to joint feature selection and classifier1224design,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 9,1225pp. 1105–1111, Sep. 2004.1226[44] N.Kwak andC.Choi, “Input feature selection for classification prob-1227lems,” IEEETrans. Neural Netw., vol. 13, no. 1, pp. 143–159, Jan. 2002.1228[45] N. Kwak and C. Choi, “Input feature selection by mutual informa-1229tion based on Parzen window,” IEEE Trans. Pattern Anal. Mach.1230Intell., vol. 24, no. 12, pp. 1667–1671, Dec. 2002.1231[46] Y. LeCun andC.Cortes, “MNIST handwritten digit database,” 2010.1232[47] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-1233negative matrix factorization,” Nature, vol. 401, pp. 788–791, 1999.1234[48] L. Lefakis and F. Fleuret, “Jointly informative feature selection1235made tractable by Gaussian modeling,” J. Mach. Learn. Res.,1236vol. 17, pp. 1–39, 2016.1237[49] R. Li, W. Zhong, and L. Zhu, “Feature screening via distance1238correlation learning,” J. J. Amer. Statistical Assoc., vol. 107, no. 499,1239pp. 1129–1139, 2012.1240[50] H. Liu and L. Yu, “Toward integrating feature selection algo-1241rithms for classification and clustering,” IEEE Trans. Knowl. Data1242Eng., vol. 17, no. 4, pp. 491–502, Apr. 2005.1243[51] R. Lyons, “Distance covariance in metric spaces,” Ann. Probability,1244vol. 41, no. 5, pp. 3284–3305, 2013.1245[52] P. Maji and S. K. Pal, “Feature selection using f-information meas-1246ures in fuzzy approximation spaces,” IEEE Trans. Knowl. Data1247Eng., vol. 22, no. 6, pp. 854–867, Jun. 2010.1248[53] Q. Mao and I. W. Tsang, “A feature selection method for multivar-1249iate performance measures,” IEEE Trans. Pattern Anal. Mach.1250Intell., vol. 35, no. 9, pp. 2051–2063, Sep. 2013.1251[54] J. Mercer, ”Functions of positive and negative type, and their1252connection the theory of integral equations,” Philos. Trans. Roy.1253Soc. A, vol. 209, pp. 415–446, 1909.1254[55] P. Mitra, C. A. Murthy, and K. Pal, “Unsupervised feature selec-1255tion using feature similarity,” IEEE Trans. Pattern Anal. Mach.1256Intell., vol. 24, no. 3, pp. 301–312, Mar. 2002.1257[56] T. Naghibi, S. Hoffmann, and B. Pfister, “A semidefinite program-1258ming based search strategy for feature selection with mutual1259information measure,” IEEE Trans. Pattern Anal. Mach. Intell.,1260vol. 37, no. 8, pp. 1529–1541, Aug. 2015.1261[57] R. Nilsson, J. M. Pe~na, J. Bj€orkegren, and J. Tegn�er, “Consistent1262feature selection for pattern recognition in polynomial time,” J.1263Mach. Learn. Res., vol. 8, pp. 589–612, 2007.1264[58] J. Novovicov�a, P. Pudil, and J. Kittler, “Divergence based feature1265selection for multimodal class densities,” IEEE Trans. Pattern Anal.1266Mach. Intell., vol. 18, no. 2, pp. 218–223, Feb. 1996.1267[59] J. S. Parker et al., “Supervised risk predictor of breast cancer based on1268intrinsic subtypes,” J. Clin.Oncology, vol. 27, no. 8, pp. 1160–1167, 2009.1269[60] K. Pearson, “Notes on regression and inheritance in the case of1270two parents,” Proc. Roy. Soc. London, vol. 58, pp. 240–242, 1895.1271[61] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual1272information: Criteria of max-dependency, max-relevance, and1273min-redundancy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27,1274no. 8, pp. 1226–1238, Aug. 2005.1275[62] S. Ren, S. Huang, J. Ye, and X. Qian, “Safe feature screening for1276generalized LASSO,” IEEE Trans. Pattern Anal. Mach. Intell.,1277vol., 40, no. 12, pp. 2992–3006, Dec. 2018.1278[63] S. T. Roweis and L. L. Saul, “Locally linear embedding,” Sci.,1279vol. 290, pp. 2323–2326, 2000.1280[64] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu,1281“Equivalence of distance-based and RKHS-based statistics in1282hypothesis testing,”Ann. Statist., vol. 41, no. 5, pp. 2263–2291, 2013.1283[65] R. Serfling, Approximation Theorems of Mathematical Statistics,” New1284York, NY, USA: Wiley, 1980.1285[66] M. Shah,M.Marchand, and J.Corbeil, “Feature selectionwith conjunc-1286tions of decision stumps and learning from microarray data,” IEEE1287Trans. PatternAnal.Mach. Intell., vol. 34, no. 1, pp. 174–186, Jan. 2012.1288[67] V. Sindhwani, S. Rakshit, D. Deodhare, D. Erdogmus, J. C. Principe,1289and P.Niyogi, “Feature selection inMLPs and SVMs based onmaxi-1290mum output information,” IEEE Trans. Neural Netw., vol. 15, no. 4,1291pp. 937–948, Jul. 2004.1292[68] E. Slutsky, “€Uber Stochastische Asymptoten und Grenzwerte,”1293Metron (in German), vol. 5, no. 3, pp. 3–89, 1925.1294[69] A. J. Smola, A. Gretton, L. Song, and B. Sch€olkopf, “A Hilbert1295space embedding for distributions,” in Proc. Conf. Algorithmic1296Learn. Theory, 2007, vol. 4754, pp. 13–31.1297[70] P. Somol, P. Pudil, and J. Kittler, “Fast branch & bound algorithms1298for optimal feature selection,” IEEE Trans. Pattern Anal. Mach.1299Intell., vol. 26, no. 7, pp. 900–912, Jul. 2004.

16 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. X, XXXXX 2019

Page 17: Estimating Feature-Label Dependence Using Gini Distance ...home.olemiss.edu/~xdang/papers/KernelGini.pdf · 15 Index Terms—Energy distance, feature selection, Gini distance covariance,

IEEE P

roof

1300 [71] L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt, “Feature1301 selection via dependence maximization,” J. Mach. Learn. Res.,1302 vol. 13, pp. 1393–1434, 2012.1303 [72] H. Stoppiglia, G. Dreyfus, R. Dubois, and Y. Oussar, “Ranking a1304 sandom feature for variable and feature selection,” J. Mach. Learn.1305 Res., vol. 3, pp. 1399–1414, 2003.1306 [73] Y. Sun, S. Todorovic, and S. Goodison, “Local-learning-based fea-1307 ture selection for high-dimensional data analysis,” IEEE Trans.1308 Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1610–1626, Sep. 2010.1309 [74] G. J. Sz�ekely and M. L. Rizzo, “Testing for equal distributions in1310 high dimension,” InterStat, vol. 5, pp. 1249–1272, 2004.1311 [75] G. J. Sz�ekely andM. L. Rizzo, “A new test for multivariate normal-1312 ity,” J. Multivariate Anal., vol. 93, no. 1, pp. 58–80, 2005.1313 [76] G. J. Sz�ekely, M. L. Rizzo, and N. K. Bakirov, “Measuring and test-1314 ing dependence by correlation of distances,” Ann. Statist., vol. 35,1315 no. 6, pp. 2769–2794, 2007.1316 [77] G. J. Sz�ekely and M. L. Rizzo, “Brownian distance covariance,”1317 Ann. Appl. Statist., vol. 3, no. 4, pp. 1233–1303, 2009.1318 [78] G. J. Sz�ekely and M. L. Rizzo, “Energy statistics: A class of1319 statistics based on distances,” J. Statist. Planning Inference, vol. 143,1320 no. 8, pp. 1249–1272, 2013.1321 [79] G. J. Sz�ekely andM. L. Rizzo, “Partial distance correlationwithmeth-1322 ods for dissimilarities,” Ann. Statist., vol. 42, no. 6, pp. 2382–2412,1323 2014.1324 [80] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global1325 geometric framework for nonlinear dimensionality reduction,”1326 Sci., vol. 290, pp. 2319–2323, 2000.1327 [81] W. S. Torgerson, “Multidimensional scaling I: Theory and meth-1328 od,” Psychometrika, vol. 17, no. 4, pp. 401–419, 1952.1329 [82] E. Tuv, A. Borisov, G. Runger, and K. Torkkola, “Feature selection1330 with ensembles, artificial variables, and redundancy elimination,”1331 J. Mach. Learn. Res., vol. 10, pp. 1341–1366, 2009.1332 [83] H. Wang, D. Bell, and F. Murtagh, “Axiomatic approach to feature1333 subset selection based on relevance,” IEEE Trans. Pattern Anal.1334 Mach. Intell., vol. 21, no. 3, pp. 271–277, Mar. 1999.1335 [84] J. Wang, P. Zhao, S. C. H. Hoi, and R. Jin, “Online feature selection1336 and its applications,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 3,1337 pp. 698–710, Mar. 2014.1338 [85] K.Wang, R. He, L.Wang,W.Wang, and T. Tan, “Joint feature selec-1339 tion and subspace learning for cross-modal retrieval,” IEEE Trans.1340 Pattern Anal. Mach. Intell., vol. 38, no. 10, pp. 2010–2023, Oct. 2016.1341 [86] L. Wang, “Feature selection with kernel class separability,” IEEE1342 Trans. Pattern Anal. Mach. Intell., vol. 30, no. 9, pp. 1534–1546,1343 Sep. 2008.1344 [87] L. Wang, N. Zhou, and F. Chu, “A general wrapper approach to1345 selection of class-dependent features,” IEEE Trans. Neural Netw.,1346 vol. 19, no. 7, pp. 1267–1278, Jul. 2008.1347 [88] H. Wei and S. A. Billings, “Feature Subset Selection and Ranking1348 for Data Dimensionality Reduction,” IEEE Trans. Pattern Anal.1349 Mach. Intell., vol. 29, no. 1, pp. 162–166, Jan. 2007.1350 [89] X. Wu, K. Yu, W. Ding, H. Wang, and X. Zhu, “Online feature1351 selection with streaming features,” IEEE Trans. Pattern Anal.1352 Mach. Intell., vol. 35, no. 5, pp. 1178–1192, May 2013.1353 [90] Z. J. Xiang, Y. Wang, and P. J. Ramadge, “Screeningtests for Lasso1354 problems,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 5,1355 pp. 1008–1027, May 2017.1356 [91] S. Yang and B. Hu, “Discriminative feature selection by nonpara-1357 metric Bayes error minimization,” IEEE Trans. Knowl. Data Eng.,1358 vol. 24, no. 8, pp. 1422–1434, Aug. 2012.1359 [92] C. Yao, Y. Liu, B. Jang, J. Han, and J. Han, “LLE score: A new filter-1360 based unsupervised feature selection method based on nonlinear1361 manifold embedding and its application to image recognition,”1362 IEEE Trans. Image Process., vol. 26, no. 11, pp. 5257–5269, Nov. 2017.1363 [93] L. Yu andH. Liu, “Efficient feature selection via analysis of relevance1364 and redundancy,” J.Mach. Learn. Res., vol. 5, pp. 1205–1224, 2004.1365 [94] H. Zeng and Y. Cheung, “Feature selection and kernel learning for1366 local learning-based clustering,” IEEE Trans. Pattern Anal. Mach.1367 Intell., vol. 33, no. 8, pp. 1532–1547, Aug. 2011.1368 [95] Z. Zhao, L. Wang, H. Liu, and J. Ye, “On similarity preserving1369 feature selection,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 3, pp.1370 619–632, Mar. 2013.1371 [96] Y. Zhai, Y. Ong, and I. Tsang, “Making trillion correlations feasi-1372 ble in feature grouping and selection,” IEEE Trans. Pattern Anal.1373 Mach. Intell., vol. 38, no. 12, pp. 2472–2486, Dec. 2016.1374 [97] L. Zhou, L. Wang, and C. Shen, “Feature selection with redun-1375 dancy-constrained class separability,” IEEE Trans. Neural Netw.,1376 vol. 21, no. 5, pp. 853–858, May 2010.

1377Silu Zhang received theBS degree in bioengineer-1378ing from ZhejiangUniversity, China, the MS degree1379in chemical engineering from North Carolina State1380University, both in 2011, and the second MS and1381the PhD degrees in computer science from the Uni-1382versity of Mississippi, in 2017 and 2019. She is cur-1383rently a diagnostic imaging research scientist with1384St. Jude children’s research hospital, working on1385feature extraction and selection from MRI images1386and clustering analysis of brain tumor subtypes.

1387XinDang received theBS degree in appliedmath-1388ematics from Chongqing University, China, in13891991, the master’s and the PhD degrees in statis-1390tics from the University of Texas at Dallas, in13912003, and 2005. Currently, she is a professor of1392the Department of Mathematics, University of Mis-1393sissippi. Her research interests include robust and1394nonparametric statistics, statistical and numerical1395computing, and multivariate data analysis. In par-1396ticular, she has focused on data depth and appli-1397cations, bioinformatics, machine learning, and1398robust procedure computation. She is a member of the IMS, ASA, ICSA,1399and the IEEE.

1400Dao Nguyen received the bachelor of computer1401science degree from the University of Wollongong,1402Australia, in 1997, the PhD degree in electrical1403engineering from the University of Science and1404Technology, Korea, in 2010, and the second PhD1405degree in statistics from the University of Michigan,1406Ann Arbor, in 2016. He was a postdoc scholar at1407the University of California, Berkeley for more than1408a year before becoming an assistant professor of1409mathematics at the University of Mississippi, in14102017. His research interests includemachine learn-1411ing, dynamicsmodeling, stochastic optimization, Bayesian analysis. He is a1412member of the ASA, ISBA.

1413Dawn Wilkins received the BA and MA degrees in1414mathematical systems from Sangamon State Uni-1415versity (now the University of Illinois–Springfield), in14161981 and 1983, respectively, and the PhD degree in1417computer science from Vanderbilt University, in14181995. She joined the faculty of the University of1419Mississippi, where she is currently the professor of1420computer and information science. Her research1421interests are primarily inmachine learning, computa-1422tional biology, bioinformatics and database systems.

1423Yixin Chen received the BS degree from the1424Department of Automation, Beijing Polytechnic Uni-1425versity, in 1995, theMS degree in control theory and1426application from Tsinghua University, in 1998, the1427MS and PhD degrees in electrical engineering from1428the University of Wyoming, in 1999 and 2001, and1429the another PhD degree in computer science from1430the Pennsylvania State University, in 2003. He had1431been an assistant professor of computer science at1432the University of New Orleans. He is currently a1433professor of computer and information science with1434the University of Mississippi. His research interests include machine learn-1435ing, data mining, computer vision, bioinformatics, and robotics and control.1436He is amember of the ACM, IEEE, and IEEEComputer Society.

1437" For more information on this or any other computing topic,1438please visit our Digital Library at www.computer.org/csdl.

ZHANG ET AL.: ESTIMATING FEATURE-LABEL DEPENDENCE USING GINI DISTANCE STATISTICS 17