Trimmed Comparison of Distributions
-
Upload
juan-antonio-cuesta-albertos-and-carlos-matran -
Category
Documents
-
view
217 -
download
0
Transcript of Trimmed Comparison of Distributions
![Page 1: Trimmed Comparison of Distributions](https://reader035.fdocuments.us/reader035/viewer/2022072116/57509edf1a28abbf6b14adea/html5/thumbnails/1.jpg)
Trimmed Comparison of DistributionsAuthor(s): Pedro César Álvarez-Esteban, Eustasio Del Barrio, Juan Antonio Cuesta-Albertos andCarlos MatranSource: Journal of the American Statistical Association, Vol. 103, No. 482 (Jun., 2008), pp.697-704Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/27640092 .
Accessed: 15/06/2014 05:09
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp
.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].
.
American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.
http://www.jstor.org
This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions
![Page 2: Trimmed Comparison of Distributions](https://reader035.fdocuments.us/reader035/viewer/2022072116/57509edf1a28abbf6b14adea/html5/thumbnails/2.jpg)
Trimmed Comparison of Distributions Pedro C?sar Alvarez-Esteban, Eustasio del Barrio, Juan Antonio Cuesta-Albertos, and Carlos Matr?n
This article introduces an analysis of similarity of distributions based on the ?2-Wasserstein distance between trimmed distributions. Our
main innovation is the use of the impartial trimming methodology, already considered in robust statistics, which we adapt to this setup. Instead of simply removing data at the tails to provide some robustness to the similarity analysis, we develop a data-driven trimming method
aimed at maximizing similarity between distributions. Dissimilarity is then measured in terms of the distance between the optimally trimmed
distributions. We provide illustrative examples showing the improvements over previous approaches and give the relevant asymptotic results
to justify the use of this methodology in applications.
KEY WORDS: Asymptotics; Impartial trimming; Similarity; Trimmed distributions; Wasserstein distance.
1. INTRODUCTION
An intrinsic consequence of randomness is variability. Sam
ples obtained from a random experiment generally will differ, and even two ideal samples coming from the same random gen
erator cannot be expected to be the same. A main challenge for
the statistician is to be able to detect departures from this ideal
equality that cannot reasonably be attributed to randomness.
Often the researcher is not really concerned about exact co
incidence, but rather wants to guarantee that the random gen
erators do not differ too much. The usual approach in the sta
tistical literature to this "not differ too much" involves fixing a
certain parameter related to the distribution of the random gen
erators (possibly the distribution itself) and checking whether some distance between the parameters in the two samples lies
below a given threshold. In this article we propose a different
approach to the problem with a motivation somehow influenced
by robust statistics.
Imagine that we want to compare two univariate data sam
ples. We observe that the associated histograms look different, but we realize that we can remove a certain fraction, say 5% of
the data in the first sample and another 5% of data in the second
sample, in a such way that the remaining data in both samples produce very similar histograms. We then would be tempted to say that the (95%) core of the underlying distributions are
similar. This could be the case when, for instance, trying to as
sess the similarity of two human populations with respect to a
given feature. Both populations could be initially equal, but the
presence of different immigration patterns might cause a differ ence in the overall distribution of that feature, whereas on the other hand, the "cores" of both populations remain equal. An
other example in which we could be interested in comparing the "cores" of two distributions is when we want to check equality in the distributions generating the two samples of a physical
Pedro C?sar ?lvarez-Esteban is Associate Professor (E-mail: pedroc@eio. uva.es), Eustasio del Barrio is Associate Professor, and Carlos Matr?n is Profes
sor, Department of Statistics and Operations Research, University of Valladolid, Valladolid, Spain. Juan Antonio Cuesta-Albertos is Professor, Department of
Mathematics, Statistics, and Computation, University of Cantabria, Santander,
Spain. This research was supported in part by the Spanish Ministry of Science and Technology and FEDER (grant BFM2005-04430-C02-01 and 02) and by the Consejer?a de Educaci?n y Cultura de la Junta de Castilla y Le?n (grant PA PIJCL VA 102/06). The data sets corresponding to the multiclinical study were
kindly provided by Axel Munk and Claudia Czado. The data used in Section 3 are available at the majors.dat file in the examples data sets of many statistics
packages. We obtained them from the textbook by Moore and McCabe (2003). The computational analyses were done using R statistical software. The R pro grams and functions used to analyze the examples considered in this work are available at http://www.eio.uva.es/~pedroc/RJ. The authors thank the reviewing team for their careful reading of the manuscript and useful suggestions.
magnitude but find that the measuring devices are not perfect and introduce some distortions when the true values lie within a certain range, leaving other values unaffected. The distortions
introduced by the two measuring devices could be of different
types, but if they did not affect more than a small fraction of the observations, again the "core" of the distributions could be
equal. Let us formalize this idea of the core of a distribution. When
trimming a fraction (of size at most a) of the data in the sam
ple to allow a better comparison with the other sample, we re
place the empirical measure ? X^=i ^/ wim a new probability
measure that gives weight 0 to the observations in the bad set
and weight -^ to every observation remaining in the sample.
Here k is the number of trimmed observations; thus k < na
and -^ <
? ^~. Instead of simply keeping/removing data, we
could increase the weight of data in good ranges (by a factor bounded by yz^) and downplay the importance of data in bad
zones, not necessarily removing them. The new trimmed em
pirical measure can be written as
1 n
11" - 7 hi 8X., where 0 < bi <-and
- 7 bi
= 1. n *?'
(1 ?
a) n *?^ ?=i i=i
If the random generator of the sample were P, then the the
oretical counterpart of the trimming procedure would be to re
place the probability P(B) = fB 1 dP by the new measure
P(B)= [ gdP, JB
whereO<g<-and gdP = l. (1) (I-a) J
We call a probability measure like P in (1) an a -trimming of P. We show in Section 2 that all a-trimmings of P can be ex
pressed in terms of trimming functions. For a given trimming function, h, Pn denotes the corresponding a -trimming of P.
The trimming function h determines which zones in the distri bution P are downplayed or removed.
Turning to the measurements-with-errors example, the un
derlying distribution of the samples, P and Q, could be dif ferent because of the distortions introduced by the measuring device, but a suitable trimming function, h, could produce a
trimmings, Pn and Qn, that are very similar or even equal.
The right trimming function generally will be unknown, and
? 2008 American Statistical Association Journal of the American Statistical Association
June 2008, Vol. 103, No. 482, Theory and Methods DO110.1198/016214508000000274
697
This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions
![Page 3: Trimmed Comparison of Distributions](https://reader035.fdocuments.us/reader035/viewer/2022072116/57509edf1a28abbf6b14adea/html5/thumbnails/3.jpg)
698 Journal of the American Statistical Association, June 2008
we should look for the best possible one. This makes sense if we consider a metric, d, between probability measures and take
/z0 = argmind(P/,, Qh). (2) h
If the ?f-trimmings PnQ and QnQ are equal then we can say that
the core of the distributions coincide. It also would be of in terest to check whether these optimally trimmed probabilities are close to one another. Our goal is to introduce and analyze
methods for testing the similarity/dissimilarity of trimmed dis tributions.
In this article we consider the L2-Wasserstein (or Mallows) distance between distributions. Note that in a related work, Munk and Czado (1998) considered a trimmed version of the Wasserstein distance consisting in trimming both distributions
solely in their tails and in a symmetric way. In the next sec
tion we discuss this approach in our context. However, we want
to emphasize that in this article we use impartial trimming not
only as a way to robustify a statistical procedure, but also as
a method to discard part of the data to achieve the best possi ble fit between two given samples or between a sample and a
theoretical distribution, thus searching for the maximum simi
larity between them. To the best of our knowledge, this point of view has not been previously considered in the literature and can lead to a new methodology in relation with the similarity concept. However, the fact that the data themselves decide the method of trimming is common to several statistical method
ologies (see, e.g., Cuesta, Gordaliza, and Matr?n 1997; Garc?a
Escudero, Gordaliza, Matr?n, and Mayo-Iscar 2008; Gordaliza
1991; Maronna 2005; Rousseeuw 1985), described here by the term "impartial trimming."
The article organized as follows. In Section 2 we formally in troduce the trimming methodology to measure dissimilarities.
We present the properties of trimming and a preliminary exam
ple describing the innovation of our methodology with respect to the naif approach of symmetrically trimming to gain robust ness. Asymptotics for our dissimilarity measure complete the
mathematical analysis considered in Section 2. In Section 3 we
compare our methodology with that of Munk and Czado on a
real data set, showing the flexibility that impartial trimming in troduces in the similarity setup. We give proofs of our results in the Appendix.
2. MEASURING DISSIMILARITIES THROUGH IMPARTIAL TRIMMING
As discussed earlier, we could consider trimmings of a proba
bility distribution on a Borel set simply by considering the con
ditional probability given that set. But here it is convenient to
introduce a slightly more general concept. Trimmed probabili ties can be defined in general probability spaces, although for
practical purposes we restrict ourselves to probabilities on the
real line.
Definition L Let P be a probability measure on R and let 0 < a < 1. We say that a probability measure P*, on R, is an
a -trimming of P if P* is absolutely continuous with respect to
P(P*?F)and^<i^.
We denote the set of a-trimmings of P by T (P); that is, if V denotes the set of probability measures on R, then
? dP* 1 1 T*(P)= P*eP:P*?P. -<-P-a.s. . (3)
[ dP 1 ? a J
The limit case in which a = 1, T1 (P), is just the set of proba bility measures absolutely continuous with respect to P.
An equivalent characterization is that P* e Ta(P) if and
only if P* ? P and ̂ =
j^f with 0 < / < 1. If / takes
only the values 0 and 1, then it is the indicator of a set, say A, such that P(A)
= 1 ? a and trimming corresponds to consider
ing the probability measure P(-\A). Definition (3) allows us to reduce the weight of some regions of the sample space without
completely removing them from the feasible set. The following proposition gives an useful characterization
of trimmings of a probability distribution in terms of the trim
mings of the U[0, 1] distribution.
Proposition I. Let Ca be the class of absolutely continuous functions h : [0, 1] -> [0, 1] such that, h(0) = 0 and h(\) = 1, with derivative h' such that 0 < t? < j^. For any real proba bility measure P, we have the following:
a. Ta(P) = {P* eV: P*(-oo, t] = h(P(-oo, t]), h e Ca} b. Ta(U[0, 1]) = {P* e V:P*(-oo,t] = h(t),0 < t < 1,
heCa}.
It will be useful to write Pn for the probability measure with distribution function h(P(?oo,t]), leading to Ta(P) =
{Ph:heCa}. To measure closeness between distributions, we resort to the
L2-Wasserstein distance defined in the set V2 of probabilities with finite second moment. For P and Q in V2, y\?i(P, Q) is defined as the lowest L2-distance between random variables
(rv's), defined on any probability space, with distributions P and Q. The measure of closeness, or matching, between P and
Q at a given level, a, or, equivalently, between their distribution functions F and G is now defined by
Ta(P, Q) = ra(F, G) := inf W2(Ph, Qh). (4) heCa
The following alternative expression for Wi(P, ?) is a key aspect to the usefulness of this distance in statistics on the line. If F and G are the distribution functions of P and Q and F~l and G~l are the respective (left-continuous) quantile functions, then the L2-Wasserstein distance between P and Q is given by (see, e.g., Bickel and Freedman 1981)
l'/2 m(p,Q) [ (F-\t)-G-\t))2dt
Jo (5)
/o
Recall that F~l is defined on (0, 1) by F~l(t) = mf{s : F(s) >
t}, which satisfies that its distribution function is F. From this, it is obvious that for the probability measures based on two sam
ples (resp. one sample and a theoretical distribution), W2 coin cides with the L2 distance to the diagonal in a Q-Q plot (resp. probability plot).
From (5) and Proposition 1, we obtain the equivalent expres sion of (4) as
ra(F,G)= inf f (F~l(t)
- G~l(t))2t?(t)dt. (6)
heCa Jo
This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions
![Page 4: Trimmed Comparison of Distributions](https://reader035.fdocuments.us/reader035/viewer/2022072116/57509edf1a28abbf6b14adea/html5/thumbnails/4.jpg)
?lvarez-Esteban, del Barrio, Cuesta-Albertos, and Matr?n: Trimmed Comparison of Distributions 699
The infimum in (6) is easily attained at the function ho below
(see Gordaliza 1991), associated with a set with Lebesgue mea sure 1 ? a. We call this minimizer, ho, an impartial a-trimming between P and Q. Obviously, after Proposition 1, ho(F(x)) and ho(G(x)) are the distribution functions of the impartially a-trimmed probabilities.
To analyze (6), let us consider the map t :?> |F_1(r) ?
G~l(t)\ as a random variable defined on (0,1) endowed with the Lebesgue measure, i. Let
Lf,g(x) :=l{t e (0,1) : \F~l(t) -
G~l(t)\ <x}, x> 0,
denote its distribution function and write LJ G for the corre
sponding quantile inverse. If Lf,g is continuous at L~^lG(\ ?
0, then
inf [ <eCa Jo (F-\t)-G-\t))2hf(t)dt
heCc
?1 =
f (F-l(t)-G'l(t))%(t)dt, (7) Jo
where
AoC) -
YZ^h\F-Ht)-G-Ht)\<Lj)G{\-a)Y (8)
In this case ho is in fact the unique mimimizer of the criterion functional.
Even if L/TG is not continuous at ?^(l ?
a), we can en sure the existence of a set Ao (not necessarily unique) such that
i(A0) = 1 - a and
[t e (0,1) : \F~l(t) -
G~\t)\ < L^G(l -a)}
CA0c{te (0,1) : \F~l(t) -
G~l(t)\ < Lj]G(l
- a)}.
Obviously, if for any such Ao we consider the function ho Ca with h'0
= yz^/ao? then the infimum in (6) is attained at ho.
Therefore, problem (6) is equivalent to
"Krb fA(F~l(t) ~ G~lw2dt)
' (9) where A varies on the Borel sets in (0,1) with Lebesgue mea sure equal to 1 ?a.
2.1 Comparison With Symmetric Trimming
Munk and Czado (1998) (see also Czado and Munk 1998;
Freitag, Czado, and Munk 2007) considered a trimmed version of the Wasserstein distance for the assessment of similarity be tween the distribution functions F and G as
/ 1 rl-ct/2 \l/2
Ta(F, G) := -- / (F~l(t)
- G-\t))2dt .
\l-0i Ja? )
(10)
Note that the right side of the foregoing expression equals Wi(Pci, ?a)> where Pa is the probability measure with distrib ution function
Fa(t) = -^?(F(t)-a/2), 1 -a
F~\a/2) < t < F_1(l -a/2), (11)
Center 1 (a= 0.1) Center2(a=0.1)
a_^??U=mi_ ._^-o_
0 200 400 600 800 0 200 400 600 800
Cholesterol Cholesterol
Figure 1. Histograms of trimmed data (cholesterol levels) in two
clinical centers. The white part in the bars shows the trimming propor tion in the associated zone.
and similarly for Qa. When comparing two samples, this corre
sponds to the distance between the sample distributions associ ated with the symmetrically trimmed samples. This naif way of
trimming is widely used and confers protection against conta mination by outliers. However, the arbitrariness in the choice of the trimming zones has been largely reported as a serious draw back of procedures based on this method (see e.g., Cuesta et al.
1997; Garc?a-Escudero et al. 2008; Gordaliza 1991; Rousseeuw
1985). In our setting, the question is why two distributions that are very different in their tails are considered similar but ones that differ in their central parts are considered nonsimilar.
To get an idea of the differences between our approach and the symmetrical trimming, let us recall example 1 of Munk and Czado (1998), which corresponds to a multiclinical study on
cholesterol and fibrinogen levels in two sets of patients (of size 116 and 141) in two clinical centers. For the fibrinogen data, our
impartial trimming proposal for a = .1 essentially coincides with the symmetrical trimming. But Figure 1 displays the ef fects of our trimming proposal for the cholesterol data, showing a significant trimming in the middle part of the histograms as
well, corresponding to both centers. This even improves on the level of similarity shown by Munk and Czado (1998), strength ening their assessment of similarity on these data, but also pro vides a descriptive look at the way in which both populations (dis)agree. The posterior analysis of the trimmed data can be
very useful in a global comparison of the populations.
2.2 Nonparametric Test of Similarity
As is usual in many statistical analyses, the interest of statisti cians when analyzing similarity of distributions relies on assert
ing the equivalence of the involved probability distributions. In
hypothesis testing this is achieved by taking equivalence or sim
ilarity as the alternative hypothesis, whereas dissimilarity is the null hypothesis. In agreement with this point of view, Munk and Czado (1998) considered the testing problem with the null hy pothesis that the trimmed distance (10) exceeds some A value, a threshold to be analyzed by the experimenters and statisticians in an ad hoc way. Graphics of p values for different A values
(see Fig. 4 in Sec. 3) play a key role in this analysis, and the fact that our measure of dissimilarity, (ra(F, G))1/2, is mea sured on the same scale as the variable of interest favors this
goal. We also note that it is the Wasserstein distance between trimmed versions of the original distributions. This allows us to
This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions
![Page 5: Trimmed Comparison of Distributions](https://reader035.fdocuments.us/reader035/viewer/2022072116/57509edf1a28abbf6b14adea/html5/thumbnails/5.jpg)
700 Journal of the American Statistical Association, June 2008
handle the very nice properties of this distance (see, e.g., Bickel and Freedman 1981) in a friendly way in connection with our
problem. Let X\, ...,Xn (resp. Y\,..., Ym) be iid observations with
common distribution function F (resp. G), and let X(\),..., X(n) (resp. Y(\),..., Y(m)) be the corresponding ordered sam
ples. We base our test of Ho:ra(F,G) > Aq against Ha :
ra(F,G) < Aq on the empirical counterparts of ra(F,G),
namely, Tna := ra(Fn, G), where Fn denotes the empirical dis tribution function based on the data in the one-sample problem and Tn,m,a := xa(Fn, Gm) in the two-sample case. Our next re
sults show that under some mild assumptions on F and G,Tna and Tn%m,a are asymptotically normal, a fact that we use later
to approximate the critical values of Ho against Ha. For nota tional reasons, in the Appendix we give the proof only of the
one-sample statement.
To obtain the asymptotic behavior of our statistics, we as sume that
F and G have absolute moments of order 4 + 8, for some 8 > 0.
A further regularity assumption is that F has a continuously differentiate density, F' = /, such that
sup xeR
F(x)(l - F(x))f(x)
f2(x) Additional notation includes ho as defined in (8) and
*F-l(t)
Kt)
and
<oo. (13)
:= (x- G-l(F(x)))h'0(F(x))dx, (14)
JF-{(\?)
n-\
sl^G) := (T3^2
- E (min<U)
- l{ynjanj,
(15)
where
anJ = (X(/+d -
X(/))((X(/+1) + X{i))/2 -
G-\i/n)) X
h\X{i)-G-Hi/n)\<L-FXnG(\-a)y <16)
Theorem 2. Assume that F and G satisfy (12) and (13) and that Lf,c is continuous at L~pXG(\ ?a). Then +Jh~(Tn^
?
ra(F, G)) is asymptotically centered normal with variance
ol(F,G)=*(j l2(t)dt-(f l(t)dt\ Y (17)
This asymptotic variance can be consistently estimated by
s2a(G) given by (15). If G also satisfies (13) and ^
-* X e
(0,1), then y^(^,m,of
- tct(F, G)) is asymptotically cen
tered normal with variance (1 -
X)a2(F, G) + Xa2(G, F). This variance can be consistently estimated by s2ma
=
n+mSn,a(Grn) +
n+msm,a(Fn)
If ra(F,G) = 0, then Theorem 2 reduces to V^X,a ~~>
0 in probability; note that ra(F, G) = 0 implies that (x ?
G_1 (F(x)))2h,0(F(x)) = 0 for almost every x and thus cr2(F,
G) = 0. This generally would suffice for our applications, but we also give the exact rate and the limiting distribution in the
Appendix.
GPA GPA
Figure 2. Histograms for variable GPA. (a) Males; (b) females;
(c) computer science students; (d) engineering students.
3. EXAMPLE AND SIMULATIONS
Our analysis is based on the variable college grade point average (GPA) collected from a group of 234 students. This variable takes values of 0-4. The students are classified by the variables gender and major (1 = computer science, 2 = engi neering, 3 = other sciences). We are interested in studying the distributional similarity of the GPA between males (n = 117) and females (m = 117) and also between students with a ma
jor in computer sciences (n = 78) and students with a major in
engineering (m = 78). Figure 2 shows the histogram for each
sample.
Comparisons of these samples using classical procedures produce the results displayed in Table 1. Because the Shapiro
Wilks tests reject the normality of the four samples, we use non
parametric methods like the Kolmogorov-Smirnov test (KS) or the Wilcoxon-Mann-Whitney test (WMW) to analyze the null hypothesis that both samples come from the same distribu tion in the comparisons of GPA by sex and GPA by major. The
p values of these tests clearly reject the null hypotheses. Under the possibility of impartially trimming both samples
as described in Section 2, we obtain the optimal trimming func tions displayed in Figure 3. In this figure, and for each compar ison, we plot the value of IF"1^)
? G~l(t)\ and the cutting
values L~pl G (1 ?
a) for a = .05, .1, and .2. Figure 3(a) shows that the optimal trimming involves the lower tail, but not ex
actly from the lower end point. When the trimming level grows
Table 1. Two-sample p values for classical tests
p value
Test GPA by gender GPA by major
Shapiro-Wilks (sample 1) .0176 .0360
Shapiro-Wilks (sample 2) .0217 .0001 KS .0028 .0040
WMW .0004 .0175
This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions
![Page 6: Trimmed Comparison of Distributions](https://reader035.fdocuments.us/reader035/viewer/2022072116/57509edf1a28abbf6b14adea/html5/thumbnails/6.jpg)
?lvarez-Esteban, del Barrio, Cuesta-Albertos, and Matr?n: Trimmed Comparison of Distributions
(a) (b)
701
CO b
?
TE * O o
^ ? CN ?
O ?
CO o
I ?
"i-1-1-1-1-r 0.0 0.2 0.4 0.6 0.8 1.0
CM ?
O ?
i i i i i r
0.0 0.2 0.4 0.6 0.8 1.0
t t
Figure 3. Trimming functions, (a) GPA by gender; (b) GPA by major. (? a = .05; - - a = . 1 ;
- a = .2.)
(a = . 1 and .2), the trimmed zone is not an interval and it in cludes points around percentiles 20%, 40%, 60%, and 70%.
Figure 3(b) shows that the points that should be trimmed to make both samples more similar are between percentiles 10% and 30%. This example illustrates a nonsymmetrical dissimi
larity between samples; in fact, in the first comparison, the less similar zone is close to the lower tail but not to the upper tail
where the values are more similar.
3.1 p Value Curve
To gain some insight into the assessment of the similarity or dissimilarity of the underlying distributions, we can use the
p value curve to test the null hypothesis Ho : Ta(F, G) > Aq against Ha:za(F,G) <
Aq. In the two-sample comparison case, we use the statistic
_ / nm (Tn^a-A20) ?n,m,a
? + -;- U?; Vn + m sn,m,a
To obtain the values of Tn,m,a, we compute \F~l(t) ?
G~l(t)\2 over a grid in [0,1], using the (1 ?
oO-quantile of
these values to determine LJl G (1 ?
a). The integral is then
calculated numerically. The computation of sn^m^ is done sim
ilarly. The asymptotic p value curve, P(An), is defined as
P(A0):= sup lim PF,G(Zn,m,a < zo) = 4>(zo), (F,G)e//o"'m""O?
where zo is the observed value of Zn,mM. (Note that the supre mum is attained when the distance between both distributions is exactly Ao?) These asymptotic p value curves can be used in two ways. On one hand, given a fixed value of Ao that con trols the degree of dissimilarity, it is possible to find the p value associated with the corresponding null hypothesis to de cide whether or not the distributions are similar. On the other
hand, given a fixed test level (p value), we can find the value of Ao such that for every A > Ao, we should reject the hy pothesis Ho : ra(F, G) > A2. In this way, we can get a sound idea of the degree of dissimilarity between the distributions. To handle the values of Ao, the experimenter should take into ac count how to interpret the Wasserstein distance, recalling that in the case where F and G belong to the same location fam
ily, their Wasserstein distance is the absolute difference of their locations.
Figure 4 illustrates the improved assessment obtained by im
partial trimming over the Munk and Czado methodology. It dis
(a) (b)_ ioj
- i.o-l-"
' "
?"8' ^^^v
?'8" ^^s?N.
0.6- "
*'%\?^v S 0.6- \\V
ft 0.4.-. v\ *>. ?-4""
\\\
' *' *''
0.2- \^v VVV^ ?*2'
^VVV ^'N. 0.0-""
" * " " " " ^^ -?*>V2m 0.0-
" " T^S^
" "*" -^ V^?i-1-r??f-1-r?' T"?-p-'?i-1. .; m-:->?r-r?-^
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0,4 0.5 0:6
Figure 4. p value curves using impartial and symmetrical (MC) trimmings, (a) GPA by gender; (b) GPA by major. [? a = .051 ; - a = .051
(MC); ? a = .102;
- a = .102 (MC); a = .205; a = .205 (MC).]
This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions
![Page 7: Trimmed Comparison of Distributions](https://reader035.fdocuments.us/reader035/viewer/2022072116/57509edf1a28abbf6b14adea/html5/thumbnails/7.jpg)
702 Journal of the American Statistical Association, June 2008
plays the p value curves using impartial trimming and symmet
rical trimming for both comparisons for different trimming lev
els (a =
.05, .1, and .2). For each plot, a horizontal line marks
a reference level for the test (.05). The GPAs of males and fe
males are similar up to Ao ranging from .32 to .36 (depending on the trimming size) when impartial trimmings are used. These
values represent between 100 x .32/2.815 = 11.4% and 12.8%
of the average of the medians of the samples. But when us
ing symmetrical trimmings, the horizontal line cuts the p value curves for Ao ranging from .56 to .59, between 20% and 21%
of the average of the medians. A similar analysis of the com
parison of GPAs by major leads us to values of Ao ranging from .29 to .36, which represent between a 9.6% and a 11.9%
of the average of the medians when using impartial trimming.
Instead, when using symmetrical trimming, these percentages
range from 16.6% to 19.5%.
3.2 Simulations
We end this section by reporting a small simulation study to illustrate our procedure's performance for finite samples for
testing Ho : ra(F, G) > Aq against Ha : ra(F, G) < Aq in the
two-sample problem. We considered two different contami
nated normal models, two different trimming sizes, and sev
eral values of the threshold Ao. In each situation we gener
ated 10,000 replicas of the trimmed score ZnjnM as defined in
(18) for several values of ? = m. We compared these replicas
with the .05 theoretical quantile of the standard normal distrib
ution, rejecting Hq for observed values smaller than this quan
tity. Table 2 shows the observed rejection frequencies. We find
good agreement with our asymptotic results even for moder
ate sample sizes, with low rejection frequencies for thresholds
Ao smaller than the true distance and high rejection frequencies otherwise. When the threshold equals the true distance, we also
can see how the observed frequency approximates the nominal
level.
4. CONCLUSIONS AND POSSIBLE EXTENSIONS
We have introduced a procedure to compare two samples or
probability distributions on the real line based on the impartial trimming methodology. The procedure is designed mainly to assess similarity of the core of two samples by discarding that
part of the data that has a greater influence on the dissimilarity of the distributions. Our method is based on trimming the corre
sponding samples according to the same trimming function, but
it allows nonsymmetrical trimming; thus it can greatly improve the previous methodology based on simply trimming the tails.
We have evaluated the performance of our procedure through an analysis of some real data samples that emphasized the ap
pealing possibilities in data analysis and the significance of the
analysis of the p value curves for assessing similarities. A sim
ulation study has provided also evidence about the behavior of the procedure for finite samples, in agreement with asymptotic results. Although we treated only dissimilarities based on the
Wasserstein distance, other metrics or dissimilarities could be
handled under the same scheme.
Representation of trimmings of any distribution in terms of those of the uniform distribution is no longer possible in the
multivariate setting. However, under very general assumptions,
it has been proven (see Cuesta and Matr?n 1989) that given two
probabilities P and Q on R*, there exists an "optimal transport map" T such that Q
? PoT~l, and if X is any random vector
with law P,thenE\\X-T(X)\\2 = W?(P, g). Moreover if Pa
is an a-trimming of P, then Pa o T~l is an a -trimming of Q
and T is an optimal map between Pa and Pa o T~l, so the mul
tivariate version of (4) would be the minimization over the set
of a-trimmings of P of the expression W|(Pa, Pa o T~l). We also should mention that obtaining the optimal map T remains an open problem for k > 1. Although these are troubling facts,
obtaining the optimal trimming between two samples is already possible through standard optimization procedures. A final dif
ficulty concerns the asymptotic behavior of the involved statis
tics, to which the techniques used in our proofs do not extend.
Table 2. Simulated powers for the trimmed scores Zn^
P = .95N(0, 1) + .05N(5, 1) P = .9N(0, 1) + .1N(5, 1) Q = .95N(0, 1) + .05N(-5. 1) Q = 0.9N(0, 1) + 0.1 N(-5, 1)
a = .05 (1-qOAq
n Frequency a ? A
(1??)Aq n Frequency
[(1 -a)ra(P, Q) = .384] .25 100 .0320 [(1 -
a)ia(P, Q) = 1.004] .25 100 .0028 200 .0268 200 0 500 .0086 500 0
1,000 .0021 1,000 0 5,000 0 5,000 0
.5 100 .1412 .5 100 .0109 200 .1633 200 .0031 500 .2264 500 .0002
1,000 .3134 1,000 0 5,000 .7648 5,000 0
1 100 .4912 1 100 .0850 200 .6957 200 .0727 500 .9474 500 .0657
1,000 .9989 1,000 .0584 5,000 1.0000 5,000 .0486
This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions
![Page 8: Trimmed Comparison of Distributions](https://reader035.fdocuments.us/reader035/viewer/2022072116/57509edf1a28abbf6b14adea/html5/thumbnails/8.jpg)
?lvarez-Esteban, del Barrio, Cuesta-Albertos, and Matr?n: Trimmed Comparison of Distributions 703
Allowing trimming in both samples with different trimming functions would provide an interesting alternative to our present
proposal. Through our research, still in progress, we have iden
tified a radically different behavior to that presented in this ar ticle for identical trimming in both samples.
APPENDIX: PROOFS AND FURTHER RESULTS In this appendix, pn(t)
= ^?f(F~x(t))(F~X(t)
- F~x(t)) de
notes the weighted quantile process, where / is the density function
of F.
Proof of Proposition 1
Let A = {P* eV: P*(-oo, t] = h(P(-oo, t]), heCa}. For P* e
A, absolute continuity of h entails
P*(s, il = h(P(-oo, t]) -
h(P(-oo, s])
rP(-oo,t] ! = / h\x)dx<-P(s,t].
JP(-oo,s] ! ~a
Thus P* ? P and ^
< ? and, therefore, P* eTa(P).
Conversely, given P* e Ta(P), if F is the distribution function of
P and we define h(t) = f? ̂p-(F~x (s))ds, then it is immediate that h Ca and
P*(-?o,il= / JL-(s)dF(s) f J?o
*F(t) jp* = i C1^(F-{(s))ds
= h(P(-oo,t]). JO dP
Therefore, P* e A, and part a is proven. The proof of part b is imme
diate from the proof of part a.
The following lemmas collect some results, which we use in our
proofs of Theorems 2 and A.l. These results can be easily proven us
ing Schwarz's inequality, standard arguments in empirical processes
theory, or the Arzel?-Ascoli theorem.
Lemma A.I. If F and G have finite absolute moment of order r > 4,
then the following hold:
a. ^??/\F-l(t))2dt^0md^nfl_[/n(F-l(t))2dt^0.
b. y/h~f?/n(F-l(t))2dt
-? Oand Jh~?l_{/n(F-\t))2dt
-> 0 in
probability. c.
^^T{Y^l)lg(G-x(t))\F-x{t)-G-{(t)\dt <oo.
d. Furthermore, if G satisfies (13), then -j= Jy ,n
t(\ ?
t)/
g2(G-](t))dt^0.
Lemma A.2. Under the || ||oo topology, the set Ca in Proposition 1
and the set Ca(F, G) = {h e Ca : ?q (F~x(t) - G'x (t))2h'(t) dt = 0),
for F, G e Vi, are compact.
Proof of Theorem 2
From theorem 6.2.1 of Cs?rg? and Horv?th (1993) and (13), we can
assume that there exist Brownian bridges Bn satisfying
1/2-v \Pn(t) -
Bn(t)\ n ' sup -;
\/n<t<\-\/n ('0 ~t))V
Op(\ogn), ifv = 0
Op(\), if 0 < v < 1/2
Now we set M?(h) = Vh~fo(F~l(t)
- G~l(t))2h\t)dt and
A~l/n Bn(t)
l/
-1-1/?
1/
(A.l)
Nn(h) = 2 f n J^)(G-\t)-F-\t))ti(t)dt J\/n f(F l(t))
+ V^ / J\ln (G-X(t)-F'x(t))2h\t)dt.
Observe that
sup \Mn(h) -
Nn{h)\ heCa
-1/?
<V^i (F-\t)-G~\t))2dt Jo 1
1/ + vH (F?-1(/)-G-'(i))2rfr
-? In f2(F~\t)) dt
Jl/n f(F HO) = : AnA + A??2 + An,3 + ^/i,4 + A?,5
Lemma A.l implies that An\ -> 0 and AnjL ?> 0 in probability.
From (A.1), we get An,3 < 0/?(l)-^ Z,1"1'" j$=fadt, and
this last integral tends to 0 by Lemma A.l. Thus An? -> 0 in
probability. Similarly, A? 4 -> 0 in probability. Finally, (A.l) yields
A?,5 < 0P(1K-V2 //-;/- i^l]_|G-l(0 _
F-l(0|^ for
some v e (0, 1/2). Lemma A.l shows that /q1 ^p-w^l0'1^) ~
F~x(t)\dt < co. Thus, by dominated convergence, we obtain that
An,5 -> 0 in probability. Collecting the foregoing estimates, we obtain
suP/zeC \Mn(h) ?
Nn(h)\ -> 0 in probability, and thus ?Jn(Tn,a ?
Sn,a) ?> 0 in probability, where y/?Sni(X ?
infheCa Nn(h). Therefore,
we need only show that <J?(Sn,a ?
*a(F, G)) ->w N(0, <?2(F, G)), where
. 1 r:-l tt\ ? r-1
V? 5W a = inf ^ . ^ v , . ' " '0 f(F-[(t)) heCa if b,)g {t)~f (fW
, Jo
+ v^ f {G-X{t)-F-X{t))2h\t)dt Jo
Let us denote
?w=argmin f (F~{ (t) - G'X (t))2hf (t)dt ' CaJo heCaJo
^~~ 0
"'" f(F'Ht))
2 f] G-{(t)~F-x(t) , + -?=/ 5?-Al i,? h\t)dt. Vn Jo
" '
Clearly, /z^(0 "^ ^n(0 f?r almost every f. Furthermore, optimality of
hn shows that
Bn := V^n.c* ~ 2
/ 5(0-'l . ^KJh'0(t)dt V Jo f(F-HO)
Jo
f(F-Ht)) 1
0 < 0,
but, in contrast,
B? = >fi([ (F-X(t)-G-\t))2hfn(t)dt
(F~x(t)-G-](t))2hf0(t)dt
1 (t\ _ 77-1 / r G~[(t)-F~[(t) , , +
2Jo B{t)
^ K\h'n(t)-h'0(t))dt f(F~x(t))
='Bnj +B)h2
This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions
![Page 9: Trimmed Comparison of Distributions](https://reader035.fdocuments.us/reader035/viewer/2022072116/57509edf1a28abbf6b14adea/html5/thumbnails/9.jpg)
704 Journal of the American Statistical Association, June 2008
and Bn \ > 0 by optimality of /iq, whereas Bn 2 = op (1) by the dom
inated convergence theorem. Therefore, Bn ? 0 in probability, which
shows that
A?-Tff(F,G))^2? B(t)-?(r)
~
F?~h'(t)dt. ' Jo /(F-!(r))
?
(A.2)
Integrating by parts, we obtain ?^ B(t)G }!lZu ^ hfQ(t)dt =
r\ f{F~x{t)) - Jq l(t)dB(t), which proves the asymptotic normality and the ex
pression (17) for the variance. The claim about the variance es
timator readily follows by noting that s2 a(G) = 4(/0 l2(t)dt
?
(fXln(t)dt)2), where ln(t) = fpf^/2)(x
" G~x(Fn(x))) x
h'n(Fn(x))dx and /^(0 =
argmin/^^ J(F?~ ?
G~x)hf. It can be
shown that, with probability 1, ln (t) -> / (f ) for almost every ? e (0, 1 ).
A standard uniform integrability argument completes the proof. The final result in this section establishes the asymptotic behavior
of nTn,a when F and G are equivalent at trimming level a. Recall the
definition of Ca(F, G) in Lemma A.2 and note that Ca(F, F) ?Ca,
but also note that for F ^ G, we have that Ca (F, G) is a proper subset
of Co;. Also note that Ca (F, G) + 0 if and only if za (F, G) = 0. In fact,
the size of Ca(F, G) depends on the Lebesgue measure of the set {t e
(0, 1) : F~l(t) ^ G~x(t)}. xa(F, G) = 0 if and only if the measure of this last set is at most a; if it equals a, then the only function in
Ca(F,G) corresponds to h'(t) =
f=^/(F-1(0=G-1(0)*
Theorem A.l. If ra(F, G) ?
0, F satisfies (13) and
t(\-t)
I 0 fl(F-v(t))
dt < 00, (A.3)
then
nTn,a-> min / 9 h (t)dt, heCa(F,G)Jo f2(F~x(t))
where {#(0}o<r<l is a Brownian bridge. Because h i-> Jq B2(t)/
f2(F~x(t))h''(t)dt is || ||oo-continuous as a function of h, it attains
its minimum value on Ca (F, G).
Proof. We define Dn(h) =nf?(F~X(t) -
G~x(t))2h'\t)dt and
D(h) = ?Q B2(t)/f2(F-x(t))h\t)dt for heCa. Note that
Dn(h)= [ JO
1 ?2
fZ(F-[(t)) h'(t)dt Pn(0_
0
1/^ n~\u^2Ur + rc [ (F~v(t)-G~v(t)Yh'(t)dt JO JO
'1 + 2V^i Pn(|} (F-X(t)-G~l(t))h'(t)dt.
Jo f(F~x(t))y
Also observe that nTn,a = Dn(hn) for some hn e C^. If /z e Ca(P, G),
then the second and third summands on the right side vanish and
Dn(h) = Jo f2$-\t))h'{t)dt'
By ?3)' (A3)' and a-S- rePresenta"
tion of weak convergence, versions of pn(-)/f(F~x(-)) and B(-)/
f(F H-)) exist (for which we keep the same notation) such that
\\pn(-)/f(F~\-)) -
B(.)/f(F-l(-))\\2 -* 0 a.s.
Now for these versions, we have
sup \Dn(h)-D(h)\ heCa(F,G)
1 -a Jo P2(t) B2(t)
dt -+0 a.s., f2(F'\t)) f2(F-\t))\
whereas for ho eCa ?
Ca(F, G), we have a.s. that Dn(h) -> co uni
formly in a sufficiently small neighborhood of h?. Furthermore, if
hn ?? h e Ca(F, G), then we can extract a subsequence such that
n?Q(F~{(t) -
G~\t))2t?n(t)dt -+ 0. The result follows from the
next technical lemma, the easy proof of which is omitted.
Lemma A.3. Let (X, d) be a compact metric space, let A be a com
pact subset of X, and let {fn} and / be real valued, continuous func
tions on X such that the following hold:
a- suPxeA 1/nW -
/Ml -> 0, as ^ ̂ co.
b. For x e X ? A there exists sx > 0 such that
t?d(y,x)<?x fn(y) -> oo, as n -> CO.
c. If xn -> x e A there exists a subsequence, {xm}, such that
fm(xm)^ fix).
Then min^x fn(x) -> minjeA /(*).
[Received October 2007. Revised January 2008.]
REFERENCES
Bickel, P., and Freedman, D. (1981), "Some Asymptotic Theory for the Boot
strap," The Annals of Statistics, 9, 1196-1217.
Cs?rg?, M., and Horv?th, L. (1993), Weighted Approximations in Probability and Statistics, New York: Wiley.
Cuesta, J. A., and Matr?n, C. (1989), "Notes on the Wasserstein Metric in Hilbert Spaces," The Annals of Probability, 17, 1264-1276.
Cuesta, J., Gordaliza, A., and Matr?n, C. (1997), "Trimmed ?-Means: An At
tempt to Robustify Quantizers," The Annals of Statistics, 25, 553-576.
Czado, C, and Munk, A. (1998), "Assessing the Similarity of Distributions?
Finite-Sample Performance of the Empirical Mallows Distance," Journal of Statistical Computation and Simulation, 60, 319-346.
Freitag, G., Czado, C, and Munk, A. (2007), "A Nonparametric Test for Simi
larity of Marginals With Applications to the Assessment of Population Bioe
quivalence," Journal of Statistical Planning and Inference, 137, 691-1M.
Garc?a-Escudero, L., Gordaliza, A., Matr?n, C, and Mayo-Iscar, A. (2008), "A General Trimming Approach to Robust Cluster Analysis," The Annals of Statistics, to appear.
Gordaliza, A. (1991), "Best Approximations to Random Variables Based on
Trimming Procedures," Journal of Approximation Theory, 64, 162-180.
Maronna, R. (2005), "Principal Components and Orthogonal Regression Based on Robust Scales," Technometrics, 41, 264-273.
Moore, D. S., and McCabe, G. P. (2003), Introduction to the Practice of Statis tics (4th ed.), New York: W.H. Freeman.
Munk, A., and Czado, C. (1998), "Nonparametric Validation of Similar Dis tributions and Assessment of Goodness of Fit," Journal of Royal Statistical
Society, Ser. B, 60, 223-241.
Rousseeuw, P. (1985), "Multivariate Estimation With High Breakdown Point," in Mathematical Statistics and Applications, Vol. B, eds. W. Grossmann,
G. Pflug, I. Vincze, and W. Werz, Dordrecht: Reidel, pp. 283-297.
This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions