Distance functions for categorical and mixed variables

8
Distance functions for categorical and mixed variables Brendan McCane * , Michael Albert Department of Computer Science, University of Otago, P.O. Box 56, Dunedin 9015, Otago, New Zealand Received 28 May 2007; received in revised form 13 December 2007 Available online 6 February 2008 Communicated by W. Pedrycz Abstract In this paper, we compare three different measures for computing Mahalanobis-type distances between random variables consist- ing of several categorical dimensions or mixed categorical and numeric dimensions – regular simplex, tensor product space, and symbolic covariance. The tensor product space and symbolic covariance distances are new contributions. We test the methods on two application domains – classification and principal components analysis. We find that the tensor product space distance is impractical with most problems. Over all, the regular simplex method is the most successful in both domains, but the symbolic covariance method has several advantages including time and space efficiency, applicability to different contexts, and theoretical neatness. Ó 2008 Elsevier B.V. All rights reserved. Keywords: Categorical data; Mahalanobis distance; Distance functions 1. Introduction In this paper, we compare three different measures for computing Mahalanobis-type distances between random variables consisting of several categorical dimensions or mixed categorical and numeric dimensions – regular sim- plex, tensor product space, and symbolic covariance. In each case, distances are computed via an interpretation of the categorical data in some real vector space. For carrying out practical computations, the dimension of this space is important, and the lower the dimension, the easier the computations will be. The regular simplex method is well known and involves replacing a single k level categorical variable with a ðk 1Þ-dimensional numerical variable in such a way that each level of the categorical variable is mapped to a vertex of a regular simplex, and is thereby at the same distance from every other level of that variable. For a categorical variable with three levels, this results in an equilateral triangle in R 2 . The dimension of the embed- ding space is thus the sum of the number of levels of all the variables, minus the number of variables. The tensor prod- uct space method is commonly used in the location model for dealing with mixed variables (Kurzanowski, 1993) although here we develop a new derivation which gives rise to a Mahalanobis-type distance in the product space. The dimension of the embedding space is the product of the number of levels of all the variables. The final method seems to be completely new and calculates an analogue of the covariance between any two categorical variables, which is used to create a Mahalanobis-type distance. In this case, strictly speaking, there is no embedding space, but all computations take place in a dimension equal to the num- ber of variables. For comparison purposes, we also com- pare the results in a classification experiment with the heterogeneous value difference metric (HVDM) (Wilson and Martinez, 1997) and with Naive Bayes. Why do we need to develop Mahalanobis-type distances for categorical and mixed variables? The chief reason is that most research on calculating distances of this sort is either heuristic (Wilson and Martinez, 1997; Gower, 1971; 0167-8655/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2008.01.021 * Corresponding author. Fax: +64 3 479 8529. E-mail addresses: [email protected] (B. McCane), malbert@ cs.otago.ac.nz (M. Albert). www.elsevier.com/locate/patrec Available online at www.sciencedirect.com Pattern Recognition Letters 29 (2008) 986–993

Transcript of Distance functions for categorical and mixed variables

Page 1: Distance functions for categorical and mixed variables

Available online at www.sciencedirect.com

www.elsevier.com/locate/patrec

Pattern Recognition Letters 29 (2008) 986–993

Distance functions for categorical and mixed variables

Brendan McCane *, Michael Albert

Department of Computer Science, University of Otago, P.O. Box 56, Dunedin 9015, Otago, New Zealand

Received 28 May 2007; received in revised form 13 December 2007Available online 6 February 2008

Communicated by W. Pedrycz

Abstract

In this paper, we compare three different measures for computing Mahalanobis-type distances between random variables consist-ing of several categorical dimensions or mixed categorical and numeric dimensions – regular simplex, tensor product space, andsymbolic covariance. The tensor product space and symbolic covariance distances are new contributions. We test the methodson two application domains – classification and principal components analysis. We find that the tensor product space distance isimpractical with most problems. Over all, the regular simplex method is the most successful in both domains, but the symboliccovariance method has several advantages including time and space efficiency, applicability to different contexts, and theoreticalneatness.� 2008 Elsevier B.V. All rights reserved.

Keywords: Categorical data; Mahalanobis distance; Distance functions

1. Introduction

In this paper, we compare three different measures forcomputing Mahalanobis-type distances between randomvariables consisting of several categorical dimensions ormixed categorical and numeric dimensions – regular sim-plex, tensor product space, and symbolic covariance. Ineach case, distances are computed via an interpretation ofthe categorical data in some real vector space. For carryingout practical computations, the dimension of this space isimportant, and the lower the dimension, the easier thecomputations will be. The regular simplex method is wellknown and involves replacing a single k level categoricalvariable with a ðk � 1Þ-dimensional numerical variable insuch a way that each level of the categorical variable ismapped to a vertex of a regular simplex, and is therebyat the same distance from every other level of that variable.For a categorical variable with three levels, this results in

0167-8655/$ - see front matter � 2008 Elsevier B.V. All rights reserved.

doi:10.1016/j.patrec.2008.01.021

* Corresponding author. Fax: +64 3 479 8529.E-mail addresses: [email protected] (B. McCane), malbert@

cs.otago.ac.nz (M. Albert).

an equilateral triangle in R2. The dimension of the embed-ding space is thus the sum of the number of levels of all thevariables, minus the number of variables. The tensor prod-uct space method is commonly used in the location modelfor dealing with mixed variables (Kurzanowski, 1993)although here we develop a new derivation which gives riseto a Mahalanobis-type distance in the product space. Thedimension of the embedding space is the product of thenumber of levels of all the variables. The final methodseems to be completely new and calculates an analogueof the covariance between any two categorical variables,which is used to create a Mahalanobis-type distance. In thiscase, strictly speaking, there is no embedding space, but allcomputations take place in a dimension equal to the num-ber of variables. For comparison purposes, we also com-pare the results in a classification experiment with theheterogeneous value difference metric (HVDM) (Wilsonand Martinez, 1997) and with Naive Bayes.

Why do we need to develop Mahalanobis-type distancesfor categorical and mixed variables? The chief reason is thatmost research on calculating distances of this sort is eitherheuristic (Wilson and Martinez, 1997; Gower, 1971;

Page 2: Distance functions for categorical and mixed variables

1 By naive, we mean without using optimizations such as sparse matricesand sparse matrix multiplication.

2 Actually in time OðS2:376L Þ using a better bound for matrix multipli-

cation

B. McCane, M. Albert / Pattern Recognition Letters 29 (2008) 986–993 987

Cost and Salzberg, 1993; Huang, 1998), assumes the vari-ables are independent (Kurzanowski, 1993; Bar-Hen andDaudin, 1995; Cuadras et al., 1997; Goodall, 1966; Li andBiswas, 2002), or makes use of the special case of ordinaldata (Podani, 1999) by assuming an underlying distributionand a discretisation function. The value difference metric(VDM) is one of the most popular categorical distancesand was introduced by Stanfill and Waltz (1986) and is par-ticular to classification problems. The metric is based onsample probabilities

vdmðxa; yaÞ ¼X

c

P ðcjxaÞ2 !0:5X

c

jP ðcjxaÞ � P ðcjyaÞj2

ð1Þwhere vdm is the value difference metric on attributea; P ðcjxaÞ is the probability of class c given that attributea has the value xa. The first term is a weighting term givenby Stanfill and Waltz (1986) to appropriately weight theimportance of each attribute. The weighting term doesnot appear in the description given by Wilson and Marti-nez (1997). The total VDM is then the sum of vdm overall attributes a.

The most influential study on mixed distances is that ofWilson and Martinez (1997). They introduce several newmetrics based on the VDM: heterogeneous value differencemetric (HVDM), interpolated VDM (IVDM) and win-dowed VDM (WDVM). The HVDM is very similar tothe similarity metric of Gower (1971) and is used for com-parison with other metrics in our study. The IVDM is anextension of the VDM that takes into account continuousattributes by discretising them to calculate sample proba-bilities to use in Eq. (1). However, the actual probabilitiesused are interpolated between neighbouring bins, depend-ing on where in the bin the continuous value actually falls.The WVDM is a more sophisticated version of IVDM,where the probabilities are calculated at each point whichoccurs in the training set using a sliding window similarto a Parzen window (Parzen, 1962). The simplified VDM(without the initial weighting term) and related measureshave been used in several influential works including Costand Salzberg (1993), Domingos (1996).

The limitations of these techniques is that they are con-strained to be used in a classification context and almostall assume that attributes are independent of each other.Thus, two highly correlated attributes will contributetwice as much to the evidence as they should. We attemptto avoid these problems in the techniques presented here.Also, we are not just interested in distances between pop-ulations as in (Kurczynski, 1970), but also distancesbetween individuals and distances between an individualand a population. The former is useful for clustering algo-rithms, and the latter for classification algorithms. Theapplications for such distances are predominantly in clas-sification, clustering and dimensionality reduction. Wegive examples of classification and dimensionality reduc-tion in the results.

2. Methods

2.1. Regular simplex method

The regular simplex method is the simplest of all themethods. The basic idea is to assume that any two distinctlevels of a categorical variable are separated by the samedistance. To achieve this, each level of an n-level variableis associated with a distinct vertex of a regular simplex inðn� 1Þ dimensional space. For simplicity, the distancebetween levels is assumed to be 1. For example, given a cat-egorical variable X 2 fA;B;Cg;A could be mapped to(0,0), B to (1,0) and C to ð1=2;

ffiffiffi3p

=2Þ. The choice of sim-plex is arbitrary, and has no effect on the subsequent anal-ysis. One possible choice is to continue the development ofthe sequence above. Take the existing set of vertices, andappend their centroid. By the symmetry of the regular sim-plex, this point is at the same distance from all the preced-ing vertices. Now add one more coordinate, taking thevalue 0 on the original set of points, and set its value onthe new point to be such that the distance from any, andhence all, other elements of the set is 1.

Each level of a variable is then replaced by the corre-sponding vertex in the simplex. For a problem with c cate-gorical dimensions, where the kth variable has nk levels, anobservation of the c-dimensional variables is replaced witha variable with

Pck¼1ðnk � 1Þ numeric dimensions. A dis-

tance function can then be defined based on the covariancematrix of the replaced data points

drsðx1; x2Þ ¼ ðx01 � x02ÞTX�1

rsðx01 � x02Þ ð2Þ

where x01 is the regular simplex representation of the inputvector x1. Let SL ¼

Pck¼1nk, the sum of the levels across all

categorical variables.P

rs is of size ððSL � cÞ � ðSL � cÞÞand can be naively1 calculated in time OðN sS

2LÞ, where N s

is the number of samples in the data set.P�1

rs can be calcu-lated in time OðS3

LÞ.2 Therefore the space complexity of this

method is OðS2LÞ and the time complexity is OðN sS

2L þ S3

LÞ.

2.2. Tensor product space method

The tensor product space method is most similar inspirit to the original Mahalanobis distance derivation.The underlying idea of the Mahalanobis distance is thatwe wish to calculate the Euclidean distance between twon-dimensional points, p1; p2, where each dimension is inde-pendent of the others. Unfortunately, p1 and p2 cannot bemeasured directly, but observations q1 and q2, which arelinear transformations of the original values, can be. Thatis, for some matrix A

q1 ¼ Ap1

Page 3: Distance functions for categorical and mixed variables

988 B. McCane, M. Albert / Pattern Recognition Letters 29 (2008) 986–993

We want to calculate the distance between p1 and p2 so

dðp1; p2Þ ¼ðp1 � p2ÞTðp1 � p2Þ ð3Þ

¼ðA�1q1 � A�1q2ÞTðA�1q1 � A�1q2Þ ð4Þ

¼ðq1 � q2ÞTA�1T

A�1ðq1 � q2Þ ð5Þ

The matrix A�1TA�1 is just the inverse covariance matrix of

the population of q’s, and we are left with the classic Maha-lanobis distance.

Now consider a random variable X which is defined overa space of c categorical variables, where the kth variablehas nk levels. For two categorical variables to be indepen-dent, the product of the marginal distributions must equalthe actual distribution. That is

P ðX i ¼ a;X j ¼ bÞ ¼ PðX i ¼ aÞP ðX j ¼ bÞ:

The joint distribution P ðX i;X jÞ and the subsequent mar-ginal distributions P ðX iÞ can be estimated from the samplepopulation. The joint distribution may not be independent,and to mimic the construction above we need to find atransformation from the dependent joint distribution toan independent joint distribution. The independent jointis estimated simply as the product of the marginals.

We are left with the problem of estimating a linear trans-formation between tensor product spaces. The initial prob-ability tensor space is a dependent observable space, T D

T D ffi X 1 � X 2 � X 3 . . .

where T Di ¼ PðX 1 ¼ a;X 2 ¼ b;X 3 ¼ c; . . .Þ. For example,

with a two-dimensional random variable, where the firstdimension has 2 levels and the second has 3, we wouldget a tensor space of 6 dimensions. We want a linear trans-formation from T D to T I, where T I is the independent ten-sor space T I

i ¼ P ðX 1 ¼ aÞPðX 2 ¼ bÞP ðX 3 ¼ cÞ . . .. Theproblem is ill-posed so there are many possible solutions.We have chosen the solution which produces a transforma-tion as close as possible to the identity. Since both tensorproduct spaces are probability spaces, the transformationmatrix, M, must be a column stochastic matrix, whichcan be defined as

Mii ¼1 if si < ti

tisi

otherwise

(

Mij ¼

ti�Miisisj

if 1�Mjj P ti�xiisisj

1�Pj

k¼1

Mkj otherwise

8><>:

where si and ti are the joint probabilities of the dependentand independent tensor product spaces, respectively.

The matrix, M, is a transformation from a dependentspace to an independent one, and as such is analogous toA�1 in Eq. (5). By analogy, we can call the matrixP�1

tp ¼ MTM the inverse covariance matrix of the categor-ical variables. A Mahalanobis-like distance function canthen be defined

d tpðx1; x2Þ ¼ ðx01 � x02ÞTX�1

tpðx01 � x02Þ ð6Þ

where x01 is the tensor product space representation of x1.Let P L ¼

Qck¼1nk, the product of the levels. Matrix M is

of size P L � P L, therefore the naive size complexity of thismethod is OðP 2

LÞ and the naive time complexity is OðP 3LÞ.

2.3. Symbolic covariance

Consider the formula for covariance of two field-valued(generally real-valued) variables X and Y

r2ðX ; Y Þ ¼ EððX � X ÞðY � Y ÞÞ

where both E and an overbar indicate expectation. Nowconsider two categorical random variables A and B withvalues A1 through An and B1 through Bm, respectively.For 1 6 i 6 n and 1 6 s 6 m let

pis ¼defn

PðA ¼ Ai;B ¼ BsÞ

ai ¼defn

PðA ¼ AiÞ

bs ¼defnPðB ¼ BsÞ

Consider A1 through An and B1 through Bm as symbolicvariables and define

A ¼defna1A1 þ a2A2 þ � � � þ anAn

B ¼defn b1B1 þ b2B2 þ � � � þ bmBm

where A and B are also symbolic expressions. Then wehave

Aj � A ¼Xn

i¼1

aiðAj � AiÞ

As Aj and Ai are categories and not values, the term Aj � Ai

does not really make sense, so we replace it with a moregeneric term, d which we call the distinction between twocategorical values

Aj � A ¼defnXn

i¼1

aidðAj;AiÞ ð7Þ

We define d to have the following properties:

dðAi;AiÞ ¼defn

0 ð8Þ

dðAi;AjÞ ¼defn�dðAj;AiÞ ð9Þ

The definition in (9) is required so the expression X � X canbe positive or negative as is the case with numeric attributes.Without this definition, the symbolic covariance couldnever be 0 – even if the variables are uncorrelated. Also,without this definition, the symbolic covariance would notbe invariant to a reordering of categories (see Property 2 be-low). We note that a side effect of this definition is that thesymbolic covariance could be either positive or negative asin the case of numeric covariance. However, the positivityor negativity is somewhat arbitrary due to the ordering of

Page 4: Distance functions for categorical and mixed variables

B. McCane, M. Albert / Pattern Recognition Letters 29 (2008) 986–993 989

the categories – if negative covariances are not needed, theabsolute value of the symbolic covariance can be taken.

The underlying motivation is that we view the expres-sion X � X as representing the ‘‘difference” in the senseof ‘‘being different from” an observation of the randomvariable X and its mean, rather than the same ‘‘difference”in the sense of ‘‘a value computed by the rules of arithme-tic”. While, for real-valued variables, it makes perfect senseto collapse these two meanings, this collapse is not at allself-evident, nor necessarily desirable for categorical ones.

We now propose the symbolic covariance

r2s ðA;BÞ ¼

defnEððA� AÞðB� BÞÞ ð10Þ

¼Xn

i¼1

Xm

s¼1

pis

Xn

j¼1

Xm

t¼1

ajbtdðAi;AjÞdðBs;BtÞ ð11Þ

¼Xi<j

Xs<t

ðpisajbt � pjsaibt � pitajbs þ pjtaibsÞ

dðAi;AjÞdðBs;BtÞ ð12Þ

This remains a symbolic expression. We can realise an ac-tual value for r2

s by choosing appropriate values ofdðAi;AjÞ. In the absence of other information, choosingdðAi;AjÞ ¼ 1 for i < j and �1 for i > j is a reasonableassumption. For ordinal variables, we might choose a dis-tinction based on the ordering.

Okada (2000) apparently (he did not provide details inhis paper) almost discovered r2

s but it seems he failed torealise the necessity of setting dðAi;AjÞ ¼ �dðAj;AiÞ.

Aside from the pragmatic possibility of using this sym-bolic covariance as an ingredient in a Mahalanobis-typedistance, it has a number of attractive properties.

Property 1 (Independence). If A and B are independent,

then pij ¼ aibj and it follows immediately that r2s ðA;BÞ ¼ 0.

Property 2 (Renaming). The quantity r2s ðA;BÞ is invariant

under a renaming of the categories.

This claim is easily verified by direct computation in thecase, where Ai and Aiþ1 are exchanged. The sign changes ind expressions are exactly balanced by the reordering of theterms in their multipliers.

Property 3 (r2s ðA;AÞ). If A is a categorical variable with n

levels, then the quantity r2s ðA;AÞ is maximised when each

level is equally likely.

We also note that the symbolic covariance has somesimilarities to the v2 statistic. While v2 may be used as aproxy for covariance, its purpose is very different. It essen-

Table 1Some example categorical variables

Sample A Ai � A B Bi � B

1 A1 0.5 B2 �0.52 A2 �0.5 B1 0.53 A1 0.5 B2 �0.54 A2 �0.5 B1 0.5

tially asks what is the probability that two (or more) vari-ables are independent, and this is not the same as asking towhat extent two variables are independent.

The symbolic covariance is technically only definedbetween two categorical variables, however, since we canuse Eq. (7) to transform a categorical variable to a (mean-shifted) real number, we can use the standard definition ofcovariance for calculating the covariance between a cate-gorical variable and a numeric one. We show the resultsof a mean-shift in the examples below. We note what lookslike a paradox in calculating Ai � A, in particular

ðAi � AÞ � ðAj � AÞ 6¼ dðAi;AjÞ

In effect, the function Ai � A as defined in Eq. (7) is morelike a projection operator than a one-dimensional differ-ence operator.

A Mahalanobis-like distance function can be definedusing the symbolic covariance matrix

dscðx1; x2Þ ¼ ðDðx1 � x2ÞÞTX�1

scDðx1 � x2Þ ð13Þ

where Dðx1 � x2Þ implies applying d to each dimension inthe vector.

Psc is of size c� c and since calculating the

covariance between two variables takes ninj, calculatingthe covariance matrix takes OðS2

LÞ. Therefore the size com-plexity of the method is Oðc2Þ and the time complexity isOðS2

L þ c3Þ.

2.3.1. Some simple examples

Let us consider some simple examples to show the utilityof the method. Table 1 shows four samples from a popula-tion each with four binary attributes or variables. Lookingat the variables we would expect that A and B are perfectlycorrelated (either positively or negatively), A and C areuncorrelated, and A and D are partially correlated.

To calculate the symbolic covariance for variable A, wecan first shift to the mean

A1 � A ¼ A1 � ða1A1 þ a2A2Þ ¼ A1 � ð0:5A1 þ 0:5A2Þ¼ 0:5A1 � 0:5A1 þ 0:5A1 � 0:5A2

¼ 0:5dðA1;A1Þ þ 0:5dðA1;A2Þ ¼ 0:5

A2 � A ¼ A2 � ð0:5A1 þ 0:5A2Þ¼ 0:5A2 � 0:5A1 þ 0:5A2 � 0:5A2Þ¼ 0:5dðA2;A1Þ þ 0:5dðA2;A2Þ ¼ �0:5

The results of similar calculations are shown in every othercolumn of Table 1. Calculating the symbolic covariance isthen straightforward

C Ci � C D Di � D r2s

C1 0.5 D1 0.25 rsðA;AÞ ¼ 1:0C2 �0.5 D2 �0.75 rsðA;BÞ ¼ �1:0C2 �0.5 D1 0.25 rsðA;CÞ ¼ 0C1 0.5 D1 0.25 rsðA;DÞ ¼ 0:5

Page 5: Distance functions for categorical and mixed variables

990 B. McCane, M. Albert / Pattern Recognition Letters 29 (2008) 986–993

rsðA;BÞ ¼ EððA� AÞðB� BÞÞ ¼ ð0:5Þð�0:5Þ þ ð�0:5Þð0:5Þþ ð0:5Þð�0:5Þ þ ð�0:5Þð0:5Þ ¼ �1:0

The results of similar calculations are shown in the last col-umn of Table 1. Note that the results are exactly what wewould intuitively expect. Any classification method thatuses the notion of a distance function can then be used.

2.4. Heterogeneous value distance metric (HVDM)

Although only applicable to classification problems theHVDM (Wilson and Martinez, 1997) has been influentialin the literature and is included here as a comparison withthe other methods. The HVDM is given by

HVDMðx; yÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXm

a¼1

d2ðxa; yaÞs

where m is the number of attributes, and

dðxa; yaÞ ¼

1 if xa or ya

is unknown

normalized vdm ðxa; yaÞ if a is categorical

normalized diff ðxa; yaÞ if a is numeric

8>><>>:

As given by Wilson and Martinez (1997), normalizedvdm is

normalized vdm ðxa; yaÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXC

c¼1

jPðcjxaÞ � P ðcjyaÞj2

vuutand normalized diff is

normalized diff ðxa; yaÞ ¼jxa � yaj

4ra

where ra is the standard deviation of numeric attribute a.

Table 2Classification results on various problems from the UCI Machine Learning d

NB LDASC LDARS QDAS

HayesRoth 78 ± 6.1 48 ± 13 75 ± 7.9 48 ±LungCancer 43 ± 22 63 ± 19 53 ± 32 53 ±Promoter 92 ± 9.2 69 ± 16 87 ± 6.7 67 ±Monks1 75 ± 9.6 67 ± 10 73 ± 6.6 85 ±Monks2 59 ± 11 51 ± 16 46 ± 6.7 70 ±Monks3 98 ± 4 84 ± 11 92 ± 6.1 82 ±Tictactoe 71 ± 3.6 71 ± 4.7 99 ± 1.4 77 ±Votes 93 ± 4.1 86 ± 5.4 96 ± 3.1 95 ±Mushroom 94 ± 0.6 97 ± 0.7 100 ± 0 100 ±Audiology 73 ± 11 70 ± 10 81 ± 10 0 ±Anneal 46 ± 6.3 92 ± 3 87 ± 3.8 15 ±Credit 78 ± 5.2 60 ± 3.9 76 ± 3.7 56 ±Heart 81 ± 9.9 83 ± 7.4 84 ± 6.6 80 ±Allbp 96 ± 1.2 89 ± 2.3 94 ± 1.7 89 ±Allhypo 95 ± 0.9 87 ± 2.1 95 ± 1.1 0.04 ±Adult 55 ± 0.7 33 ± 3.1 38 ± 0.7 13 ±Autos 72 ± 5.8 6.5 ± 3.4 78 ± 12 0 ±Postop 67 ± 15 48 ± 15 41 ± 15 1.1 ±

Mean and standard deviations are shown for randomised 10-fold cross-validatiowith only categorical variables, below the line are problems with mixed variab

a Only three rounds of cross-validation were used due to large run times.b No results available due to large run times.

3. Results

We compare the regular simplex and symbolic covari-ance methods on two example applications – classificationand principal components. We do not include the tensorproduct space in the results as the technique is impracticalfor most problems – the dimensionality is huge for mostpractical problems with the result that the data set cannotbe represented in machine RAM in most cases.

3.1. Classification results

In our first experiment, we compare the performance ofthe two methods on example datasets with only categori-cal variables or with mixed categorical and numeric vari-ables. We use the Naive Bayes classifier as the defaultcomparison method, and the nearest neighbour classifierusing HVDM as a widely used method from the litera-ture. Since Naive Bayes has strong assumptions regardingthe independence of the variables, we would expect thatremoving the correlation between the variables wouldresult in improvements in many cases. We have used thediscriminant analysis family of methods to test our dis-tance calculations since these methods naturally makeuse of Mahalanobis like distances. The classificationmethods we use are:

NB: Naive Bayes used on the raw variables. Ifnumeric variables are involved, then a nor-mal distribution is used to model them.

LDASC: Linear discriminant analysis using symboliccovariance.

LDARS: Linear discriminant analysis using a regularsimplex.

atabase

C QDARS RDASC RDARS HVDMNN

17 43 ± 9.7 52 ± 14 84 ± 9.2 68 ± 9.528 40 ± 34 53 ± 17 43 ± 27 43 ± 2217 56 ± 8.4 89 ± 8.8 82 ± 17 87 ± 9.59.5 82 ± 7.3 78 ± 16 83 ± 9.6 80 ± 169.2 34 ± 11 72 ± 7.9 65 ± 15 51 ± 186.6 92 ± 9.2 89 ± 6.9 93 ± 6.6 86 ± 104.6 89 ± 2.6 75 ± 3.9 77 ± 6.4 77 ± 52.2 93 ± 4.2 93 ± 4.5 92 ± 3.3 93 ± 3.40 99 ± 0.5 99 ± 0.3 98 ± 0.3 100 ± 0a

0 0 ± 0 50 ± 16 56 ± 15 42 ± 376.7 28 ± 6.2 90 ± 2.7 18 ± 5.2 97 ± 1.8

6.7 49 ± 4.9 55 ± 6.6 45 ± 3.1 82 ± 5.3

6.6 77 ± 9.2 82 ± 7.2 70 ± 8.1 79 ± 9.71.2 0.2 ± 0.3 90 ± 2 95 ± 0.8 96 ± 1.1

0.1 0.4 ± 0.8 89 ± 1.3 89 ± 2.6 93 ± 1.90.8 19 ± 1.1 35 ± 7.2 38 ± 2.4 NAb

0 32 ± 11 40 ± 19 32 ± 12 33 ± 173.5 1.1 ± 3.5 36 ± 15 34 ± 16 51 ± 21

n. The best mean value in each row is bolded. Above the line are problemsles.

Page 6: Distance functions for categorical and mixed variables

20 40 60 80 100

—2—1

01

Exam 1Exam Mark

Prin

cipa

l Com

pone

nt

20 40 60 80 100

—2.5

—2.0

—1.5

—1.0

—0.5

0.0

0.5

1.0

Exam 2Exam Mark

Prin

cipa

l Com

pone

nt

20 40 60 80 100

—1.5

—1.0

—0.5

0.0

0.5

1.0

1.5

Exam 3Exam Mark

Prin

cipa

l Com

pone

nt

Fig. 1. Exam mark versus principal component for symbolic covariance.

B. McCane, M. Albert / Pattern Recognition Letters 29 (2008) 986–993 991

QDASC: Quadratic discriminant analysis using sym-bolic covariance.

QDARS: Quadratic discriminant analysis using a regu-lar simplex.

LDASC: Regularised discriminant analysis using sym-bolic covariance.

RDARS: Regularised discriminant analysis using aregular simplex.

HVDMNN: Nearest neighbour algorithm using theHVDM.

Note that for the RDA methods, the regularised param-eters are not estimated using cross-validation as suggestedby Friedman (1989). The parameter c is arbitrarily set to0.1 and the parameter k is estimated as

k ¼ 1:0� si=s

where si is the sum of the singular values in the class covari-ance matrix Ri and s is the sum of the singular values in thepooled covariance matrix R. LDA, QDA and RDA all usecovariance matrices to form decision boundaries, so wesimply use the appropriate covariance matrix within eachof these methods. See McLachlan (2004) for a full descrip-tion of LDA, QDA and RDA.

The results for each method on a subset (each problemhas at least some categorical variables) of the UCI machinelearning database (Newman et al., 1998) are shown inTable 2. It is not our intention to develop the best classifierfor these problems although we note that many of theresults are equal to the best in the literature (e.g. mush-rooms), but rather to show that these methods can be use-ful and are therefore worth considering when dealing withcategorical data. It is worth noting that sometimes QDAfails miserably when an accurate estimate of the class spe-cific covariance matrix cannot be obtained (for example,when only one or two samples of that class exist in the data– this happens for audiology and autos). Naive Bayes per-forms quite well in most cases, even when the assumptionsof independence are violated. On balance, the regular sim-plex method outperforms the method of symbolic covari-ance performing best in eight problems compared withfour for symbolic covariance. The symbolic covariancemethod can be likened to an a priori (sub-optimal) projec-tion, whereas the regular simplex method can result in amore optimal data-driven projection. The symbolic covari-ance method performs comparably to Naive Bayes andHVDM using nearest neighbour.

3.2. Principal components

For our principal components example, we use as anexample multiple choice exam results from our first yearprogramming course. We have used three data sets – onewith 24 questions, and two with 20 questions. All answersare in the range ‘‘A” to ‘‘D” or ‘‘A” to ‘‘E”. The three datasets have 151, 157 and 200 sample points, respectively.Because of the size of the resultant tensor product space,

that method could not be applied to this problem. We alsohave the exam mark for each student and what we hope to

Page 7: Distance functions for categorical and mixed variables

992 B. McCane, M. Albert / Pattern Recognition Letters 29 (2008) 986–993

find is that the principal component of the data is highlycorrelated with the mark – this should be a validation ofthe method. The results are shown in Figs. 1 and 2. We

20 40 60 80 100

—2.0

—1.5

—1.0

—0.5

0.0

0.5

1.0

Exam 1Exam Mark

Prin

cipa

l Com

pone

nt

20 40 60 80 100

—1.0

—0.5

0.0

0.5

1.0

1.5

2.0

Exam 2Exam Mark

Prin

cipa

l Com

pone

nt

20 40 60 80 100

—1.5

—1.0

—0.5

0.0

0.5

1.0

Exam 3Exam Mark

Prin

cipa

l Com

pone

nt

Fig. 2. Exam mark versus principal component for regular simplex.

can see from both figures that the principal component ishighly correlated for both methods, thus verifying the use-fulness of the techniques. However, the regular simplexmethod is more highly correlated and seems to be the pref-erable method of the two – although it comes at a cost ofhigher dimensionality of the problem (having 4–5 timesmore dimensions than the symbolic covariance method).

4. Conclusion

We have investigated the problem of distance calculationsfor categorical and mixed variables, and have introducedtwo new Mahalanobis type distances for these types of vari-ables – the symbolic covariance method and the tensor prod-uct space method. The tensor product space method istheoretically pleasing but completely impractical for mostproblems. The symbolic covariance method is also theoreti-cally pleasing, sharing many properties with the standardnumeric covariance and thus leading to a natural Mahalan-obis distance. It is also efficient in both time and space, can beapplied in many problem settings (classification, clusteringand dimensionality reduction) which is not true of all meth-ods (e.g. HVDM), and has the potential to be applied inother statistical contexts (e.g. measures of variability, statis-tical tests etc.). Although the regular simplex method outper-formed the symbolic covariance method in terms ofaccuracy, the performance improvement was only moderateat the cost of significantly higher dimensionality. On bal-ance, we believe the symbolic covariance to be a useful addi-tion to the literature on heterogeneous distance functions.

References

Bar-Hen, A., Daudin, J.-J., 1995. Generalization of the Mahalanobisdistance in the mixed case. J. Multivariate Anal. 53, 332–342.

Cost, R.S., Salzberg, S., 1993. A weighted nearest neighbor algorithm forlearning with symbolic features. Machine Learn. 10, 57–78.

Cuadras, C., Forging, J., Oliva, F., 1997. The proximity of an individualto a population with applications in discriminant analysis. J. Classif-icat. 14 (1), 117–136.

Domingos, P., 1996. Unifying instance-based and rule-based induction.Machine Learn. 24 (2), 141–168.

Friedman, J., 1989. Regularized discriminant analysis. J. Amer. Statist.Assoc. 84 (405), 165–175.

Goodall, D., 1966. A new similarity index based on probability.Biometrics 22 (4), 882–907.

Gower, J.C., 1971. A general coefficient of similarity and some of itsproperties. Biometrics 27, 857–874.

Huang, Z., 1998. Extensions to the k-means algorithm for clustering largedata sets with categorical values. Data Min. Knowl. Discov. 2 (3), 283–304.

Kurczynski, T., 1970. Generalized distance and discrete variables.Biometrics 26 (3), 525–534.

Kurzanowski, W., 1993. The location model for mixtures of categoricaland continuous variables. J. Classificat. 10, 25–49.

Li, C., Biswas, G., 2002. Unsupervised learning with mixed numeric andnominal data. IEEE Trans. Knowl. Data Eng. 14 (4), 673–690.

McLachlan, G., 2004. Discriminant Analysis and Statistical PatternRecognition. Wiley.

Newman, D.J., Hettich, S., Blake, C., Merz, C., 1998. UCI Repository ofMachine Learning Databases. <http://www.ics.uci.edu/~mlearn/MLRepository.html>.

Page 8: Distance functions for categorical and mixed variables

B. McCane, M. Albert / Pattern Recognition Letters 29 (2008) 986–993 993

Okada, T., 2000. A note on covariances for categorical data. In: Leung,K.S., Chan, L.-W., Meng, H. (Eds.), Intelligent Data Engineering andAutomated Learning – IDEAL 2000: Data Mining, Financial Engi-neering, and Intelligent Agents19, Lecture Notes in Computer Science,vols. 1983/2000. Springer, Berlin/Heidelberg, pp. 150–157.

Parzen, E., 1962. On estimation of a probability density function andmode. Ann. Math. Statist. 33 (3), 1065–1076.

Podani, J., 1999. Extending Gower’s general coefficient of similarity toordinal characters. Taxon 48 (2), 331–340.

Stanfill, C., Waltz, D.L., 1986. Toward memory-based reasoning. Comm.ACM 29 (12), 1213–1228.

Wilson, D.R., Martinez, T.R., 1997. Improved heterogeneous distancefunctions. J. Artif. Intell. Res. (JAIR) 6, 1–34.