On exchangeable sampling distributions for uncontrolled data

6
ELSEVIER Statistics & Probability Letters 26 (1996) 1-6 STATISTICS& PROBABILITY LETTERS On exchangeable sampling distributions for uncontrolled data Tom Leonard Department of Statistics, University of Wisconsin-Madison, 1210 W. Dayton St., Madison, W1 53706, USA Received November 1993; revised October 1994 Abstract When statistical observations are not based upon a controlled randomized experiment, it can be appealing to try to model their joint distribution via an exchangeable sampling distribution. However, exchangeable sampling distributions should be used with extreme caution, and do not obviously usefully model any lack of independence of the observation vectors. The two main problems concern the distributions of the test statistics, together with a lack of identifiability of the dependencies between the observation vectors. Two new asymptotic results relating to empirical processes, Dirichlet processes, and non-parametric tests for fit, are described in order to highlight these problems. Keywords: Uncontrolled data; Non-parametric tests of significance; Cramer-Von Mises statistic; Exchangeable sampling distribution; Dirichlet process; Empirical processes 1. Introduction Let Yi ..... Yn denote p × 1 random vectors with common c.d.f. Fo(t), for t E R p. In some situations where randomization has not been employed at the design stage, it may appear inappropriate to take Y1..... Y, to be independent. However, a modest extension of Laplace's principle of insufficient reason (Laplace, 1812) can justify an assumption of "exchangeability" of the random vectors Y1.... , Y,. This aspect is also discussed in detail by Lindley and Novick (1981), who highly recommend applying exchangeable sampling distributions to prediction problems, and, for example, explain Simpson's paradox, using this device. An appealing way of constructing an appropriate exchangeable distribution for Y1 ..... Y, is as follows. Staye I." Conditionally on a random c.d.f. F(t), YI ..... Yn denote a random sample from a distribution with c.d.f. F(t). Stage II: The c.d.f. F(t) follows a Dirichlet process with parametric function ~Fo(t), where ~ E (0, ~). Dirichlet processes are discussed in detail by Ferguson (1973), and Antoniak (1974). For example, E[F(t)] = Fo(t) (t E RP), (1.1) and cov[F(s),F(t)] = (~ + 1)-IK(s,t;Fo), (1.2) 0167-7152/96/$12.00 (~) 1996 Elsevier Science B.V. All rights reserved SSDI 0167-7152(94)00245-2

Transcript of On exchangeable sampling distributions for uncontrolled data

E L S E V I E R Statistics & Probability Letters 26 (1996) 1-6

STATISTICS& PROBABILITY

LETTERS

On exchangeable sampling distributions for uncontrolled data

Tom Leonard

Department of Statistics, University of Wisconsin-Madison, 1210 W. Dayton St., Madison, W1 53706, USA

Received November 1993; revised October 1994

Abstract

When statistical observations are not based upon a controlled randomized experiment, it can be appealing to try to model their joint distribution via an exchangeable sampling distribution. However, exchangeable sampling distributions should be used with extreme caution, and do not obviously usefully model any lack of independence of the observation vectors. The two main problems concern the distributions of the test statistics, together with a lack of identifiability of the dependencies between the observation vectors. Two new asymptotic results relating to empirical processes, Dirichlet processes, and non-parametric tests for fit, are described in order to highlight these problems.

Keywords: Uncontrolled data; Non-parametric tests of significance; Cramer-Von Mises statistic; Exchangeable sampling distribution; Dirichlet process; Empirical processes

1. Introduction

Let Yi . . . . . Yn denote p × 1 random vectors with common c.d.f. Fo(t), for t E R p. In some situations where randomization has not been employed at the design stage, it may appear inappropriate to take Y1 . . . . . Y, to be independent. However, a modest extension of Laplace's principle of insufficient reason (Laplace, 1812) can justify an assumption of "exchangeability" o f the random vectors Y1 . . . . , Y,. This aspect is also discussed in detail by Lindley and Novick (1981), who highly recommend applying exchangeable sampling distributions to prediction problems, and, for example, explain Simpson's paradox, using this device.

An appealing way of constructing an appropriate exchangeable distribution for Y1 . . . . . Y , is as follows.

Staye I." Conditionally on a random c.d.f. F(t ) , YI . . . . . Yn denote a random sample from a distribution with c.d.f. F(t ) .

Stage II: The c.d.f. F ( t ) follows a Dirichlet process with parametric function ~Fo(t), where ~ E (0, ~ ) . Dirichlet processes are discussed in detail by Ferguson (1973), and Antoniak (1974). For example,

E[F(t)] = Fo(t) (t E RP), (1.1)

and

cov[F(s ) ,F ( t ) ] = (~ + 1 ) - IK( s , t ;Fo) , (1.2)

0167-7152/96/$12.00 (~) 1996 Elsevier Science B.V. All rights reserved SSDI 0167-7152(94)00245-2

2 T. Leonard~Statistics & Probability Letters 26 (1996) 1 -6

where

K(s,t;Fo) = Fo{min(s,t)} - Fo(s)Fo(t), (1.3)

and min(s , t ) denotes the vector whose elements are the minima of the corresponding elements of s and t. Suppose that we wish to use statistics of the form

T,, = n -1 ~-]~Y(Yi) (1.4) i 1

to estimate unknown parameters appearing in F0(t), where 9(t) is specified and bounded for t 6 ~P. Then, as n ~ oc, with ~ fixed,

T,,--~H, (1.5) d

where

H : I .q(t)dF(t), (1.6) JR I '

and the F appearing in the integrand of (1.6) follows the Dirichlet process described above, with mean value function Fo(t). Therefore, when ~ is fixed and finite, it seems difficult to use 7",, to consistently estimate unknown quantitities in Fo(t). In particular, it appears similarly impossible to consistently estimate any moment or expectation relating to F0(t), which is assumed to exist.

The conditional mean of F(t), given Yt -- yT . . . . . 11,, -- y , is

F*(t) = (1 - p)F,,(t) + pFo(t), (1.7)

where p -- ~/(~ +n ) , and Fn(t) denotes the empirical c.d.f. The quantity in (1.7) also describes an appropriate predictive probability for a future observation vector. We are therefore motivated to consider asymptotic results as n ~ ~ , with p fixed. In this case 7",, is a consistent estimator for

Ho = I ,q( t ) dFo( t ), (1.8) JR P

and (1.7) can be used to sensibly predict the behavior of future observations by compromising between predictions based upon Fo(t) and predictions based upon the empirical c.d.f.

A similar limiting device is considered by Paul and Plackett (1978), where generalizing an assumption of several independent Poisson distributions. In their limiting case, they obtain a scale adjustment to the distribution of the chi-squared goodness-of-fit statistic. See also Leonard (1977), and Diaconis and Efron (1985).

2. An asymptotic result

Let U = nl /2 (Fn - F) and Uo = n l / 2 " c l ' / 2 ( F n - Fo), where

= (1 + ~) / (n + ~). (2.1)

Then, conditionally on F, U has zero expectation, and covariance kernel K(s,t;F), from (1.3). However, unconditionally, U0 has zero expectation, and covariance kernel K(s, t; Fo). The quantity ~-i can be referred to as an "overdispersion factor".

Let F(t) possess positive support for t 6 R p. Define the limiting process, given F, of U, as n ~ ~ , as "a Brownian bridge with parametric function F " (see Csorgo et al., 1986). Then if Fo(t) possesses positive

T. Leonard~Statistics & Probability Letters 26 (1996) 1-6

support, for t E ~P, the limiting process of U* = nl/'2pb'2(F, - F0), as n --~ ~ , with p : n/(n + ~) fixed, will follow a "Brownian bridge with parametric function F0".

To see this let {BI . . . . . Bq} denote any Borel partition of B~P and let

Ni = f e dF, ( t ) , (2.2) i

Oi = J i dF( t ) , (2.3) s

and

~i = J~ dF0(t) (i = 1 . . . . . q). (2.4) i

Then, given 01 . . . . . Oq, Ni . . . . . Nq possess a multinomial distribution with cell probabilities 01 . . . . . Oq, and sample size n. Furthermore, 01, . . . , Oq possess a Dirichlet distribution with parameters ~ l . . . . . ~ q . As n ~ ~ , with q and p = ~/(~ ÷ n) fixed, it is straightforward to check that, unconditionally on 01 . . . . . Oq the limiting distribution of n l/2(Ni - -n~l ) . . . . , r l - l /2(Nq- n~q), is multivariate normal with zero mean vector, and ( j , k ) the covariance equal to P-J(~)6/k - ~j~k), with 6jk denoting the Kronecker-delta function.

This parallels the more standard result, that, conditionally on 01 . . . . . Oq, the limiting distribution of n - I 2 ( N l - nOl) . . . . . n - I / 2 (Nq-nOq) is multivariate normal with zero mean vector, and ( j ,k ) th covariance equal to Oi6/k -OjOk. When applied to all choices of q, and Bi . . . . . Bq, this standard result characterizes a "Brownian bridge with parametric function F " for U = nl/Z(F. - F). The results of the preceding paragraph therefore characterize a "Brownian bridge with parametric function F0" for U* = nl /2p l /Z(F, - Fo).

Note that the Dirichlet distribution for 0~ . . . . , Oq, described above, can be represented by

q

O~ : ;~// E ;~., g : l

where the 2j are independent gamma variates, with parameters respective means ~j, and variances ~j/~. The approximate multivariate normality, for large ~, o f the Dirichlet distribution, is therefore easily established via approximate normality of the gamma distribution, together with an application of the "delta method".

Our main result, that the limiting process for U* = nl/2pl/2(F, - F o ) is a Brownian bridge with parametric function F0, also holds under some relaxations of the Dirichlet assumption for F0. It is enough to assume that, as ~ ~ ~c, the limiting process for ~ I / 2 ( F - F0) is a Brownian bridge with parametric function F0.

3. Non-parametric tests for fit

The suitability of Fo(t) as a common distribution for YI . . . . . Y, may be tested by reference to a standard non-parametric test, e.g., based upon the Cramer-Von Mises statistic

U = n [ [F,,(t) - F0(t)] 2 dF0(t). (3.1) JR

Durbin and Knott and Durbin et al. (1972) (1975) show that U converges in distribution, as n ---+ .~c, and when F = Fo, to a random variable W. When p = 1, and F0 is fully specified,

Z 2" "2n2, (3.2) w = ~ / / : . / - I

T. Leonard~Statistics & Probability Letters 26 (1996) 1 6

1 These results are proved by Durbin where ZI,Z2 . . . . are independent normal variates, so that E(W) = g. et al., by reference to the standard asymptotic normality properties, conditional upon F, as described in Section 2. The distribution of W can be substantially affected when F0 contains estimated parameters (e.g., Stephens and Maag, 1968; Koziol, 1982).

The following lemma is an immediate consequence of the developments in Section 2, and the proofs by Durbin et al. It however refers to the distribution of U under the full sampling assumptions of Section 1.

Lemma 1. Assume the exchanyeable sampliny distribution Jor Y1 . . . . . Y,, defined in Section 1. Then as n ---+ oo, with p = ~/(~ + n) fixed, pU converyes in distribution to W.

As a consequence of Lemma 1, the limiting density of U is ~*(u) = p~(pu) where ~(W) is the density of W. When p = 1 ,E(U) = ~p. The results suggest that if a standard fixed size Cramer-Von Mises test is used to investigate H0, when YI . . . . . Yn, are in fact exchangeable, rather than independent, many "apparently significant" results might be provided, when, depending upon the correct value of p, Fn might not indeed be significantly different from F0. This practical problem is compounded by the identifiability difficulties o f the next section.

4. An identifiability problem

Under the partition {B1 . . . . . Bq} described in Section 2, and the distributional assumptions of Section 1, the joint distribution of Ni . . . . . Nq, given ~, is

p(N, = nt . . . . . N u = /'/q[@) ~-- + n)]/FI + J

(4.1)

The expression in (4.1) also provides the likelihood of ~, when n t . . . . . nq is observed. Suppose now that YJ . . . . . Yn happen to give no ties, i.e., n distinct realizations Yl,..-,Yn. However, then let q ~ oo, with n fixed and {Bl , . . . ,Bq} denoting any partition such that each element of the partition contains at most one distinct observation vector. In such limiting cases, the expression in (4.1) possesses the behavior

F(c~)~" (0 < ~ < ~ ) , (4.2) ~f(~ln) ~ n!K F(~ + n)

where K/Ilj:,,j=l ~j --~ 1, as q --~ cx~. Note that, in the limit, exactly n elements of the partition will contain a distinct observation vector. The likelihood (4.2) can also be obtained via the Polya urn scheme representation of Dirichlet processess, due to Blackwell and MacQueen (1973).

The likelihood in (4.2) contains no information, apart from n, from the data. This suggests that it is very difficult to identify the parameter p = 7/(~ + n) when the observations are continuous. For example, if 7 is taken to possess prior density zt(~), the formula (1.7) can be replaced by the hierarchical Bayes estimator

F*(t) = (1 - p*)F,(t) + p*Fo(t), (4.3)

but where p* = E[ply ] depends upon n(~) and n, and not the current data. Furthermore, while U, in (3.1) is an unbiased estimator, when p = I, for ~p, it is not a consistent estimator for p. While Antoniak (1974) describes consistent estimators for p, in an empirical Bayes context, these would require the presence of ties in the data.

T. Leonard/Statistics & ProbabiliO' Letters 26 (1996) 1-6 5

5. Discussion

The problems experienced in Sections 3 and 4 are also experienced for a variety of other exchangeable sampling distributions. Suppose, for example, that n scalar observations are taken to be (a) multivariate normal, and (b) exchangeable. Then the distribution of Y = (Yl . . . . . y,)V must possess mean vector and covariance matrix of the forms, #ep and o'2[(1 - p)Ip + pepeVpp] where ep denotes the p × 1 unit vector, and the three

parameters/~,0`2, and p satisfy - e ~ < /~ < ~ , 0 < 0`2 < ~ , and - 1 / ( n - 1) < p < +1. Then the likelihood of ~t, 0`2, and p, is

[ (#, a2, ply) = (2rC)--l/2'n(0`2)--1/2"n(l --p)--l/Z'~n--1)[l + ( n - 1)p] -I/2

X exp{-S2/2(1 - p ) a 2 - n ( f - #)2/2a2[1 + (n - 1)p]}, (5.1)

and this only depends upon the data via a two-dimenstional sufficient statistics t = (~,sZ), where s z = Z(y i - y)2, and ~ denotes the sample mean. Furthermore, S 2 --- S(Y~.- y)2 and Y are independent, $2/(1- p ) a 2 possesses a chi-squared distribution with (n - 1) degrees of freedom, and Y is normally distributed with mean /t and variance n-t0`2[1 + ( n - 1)p]. Consequently, an equal-tailed 100(1 - ~ ) % confidence interval for # is

5:tl/z,~(s/x/n)[1 + (n - l)p]l/2(1 - p ) - l / 2 , (5.2)

where tb.2.~ denotes the percentage point for the usual t-confidence interval, when p = 0. The interval in (5.2) can be much larger than the usual interval confidence interval, e.g., if p is close to one. Equivalantly, the standard t-tests can be much too ready to reject null hypotheses, compared with t-tests based upon the current assumptions, with p specified and positive. This exchangeable sampling model for non-randomized data can therefore lead to arbitrary inferences, since p cannot be identified from the current data, separately from /~ and a 2. Obviously, replications would help, but it might be necessary to make the same choice of p, for different samples. Furthermore, independence of the replications might not be appropriate, in the absence of randomization.

Suppose, more generally, that the vectors Y1 . . . . . I1,, are exchangeable. Then DeFinetti 's theorem (e.g., Diaconis and Freedman, 1980), tells us, that when n is large, the joint c.d.f, of Y1 . . . . . Y, must approach the form

F(yl,y2 . . . . . y , ) = fo {i=~ G(yiltl)} dH(q), (5.3)

i.e., given tl, YI . . . . . Yn are a random sample from a distribution with c.d.f. G(.It/) but where q possesses a distribution H across some space f2.

Identifiability problems remain for many parametric choices of G and H. For example, let r~ strongly converge to r/, as n ---, oc, whenever Y1,. . . , Yn are a random sample, with common c.d.f. G(ytrl). Then under the joint distribution (5.3), r~ will converge in distribution to a random variable t /wi th distribution H. In many circumstances, it will therefore be very difficult to estimate consistently all unknown quantities appearing in G and H.

The results in this paper suggest that it can be very difficult to use exchangeable sampling distributions to model non-randomized data, unless something is definitely known about the dependency structure of the observations. It is however not at all obvious how such knowledge can be meaningfully obtained. This confirms that many inferences drawn from non-randomized data are at best subjective. Furthermore results, based upon confidence intervals or hypothesis tests can be, at worst, quite misleading.

6 • Leonard~Statistics & Probability Letters 26 (1996) 1 -6

Acknowledgements

The author would like to thank David Mason, and a referee, for helpful comments , and Michael Newton ,

for demonstra t ing that (4.2) can also be obtained by a Polya urn scheme representation.

References

Anderson, T.W. and D.A. Darling (1952), Asymptotic theory of certin "goodness-of-fit" criteria based on stochastic processes, Ann. Math. Statist. 23, 193 212.

Antoniak, C.E. (1974), Mixtures of Dirichlet processes, with applications to Bayesian non-parametric problems, Ann. Statist. 2, 1152 1174.

Blackwell, D. and J.B. MacQueen (1973), Ferguson Distributions via Polya urn schemes, Ann Statist. 1, 558-568. Csorgo, M., S., Csorgo L., Horvath and D.M. Mason (1986), Weighted empirical and quantile processes, Ann. Probab. 14, 31 85. Diaconis, P. and B. Efron (1985), Testing for independence in a two-way table: new interpretations of the chi-square statistic, Ann.

Statist. 13, 845 887. Diaconis, P. and D. Freedman (1980), Finite exchangeable sequences, Ann. Probab. 8, 745 764. Durbin, J, and M. Knott (1972), Components of Cramer-Von Mises statistics, J. Roy. Statist. Soe. B 34, 290-307. Durbin, J., M. Knott and C.C. Taylor (1975), Components of Cramer-Von Mises Statistics I1. ,L Roy. Statist. Soe. B 37, 211-257. Ferguson, T. (1973), A Bayesian analysis of some non-parametric problems, Ann. Statist. 1, 209-230. Freedman, D. and P. Diaconis (1983), On inconsistent Bayes estimates in the discrete case, Ann. Statist. 11, 1109-1118. Koziol, J.A. (1982), A class of invariant procedures for assessing multivariate normality, Biometrika 69, 423-428. Laplace, P.S. (1812), Theorie Analytique des Probabilities (Courcier, Paris). Leonard, T. (1977), A Bayesian approach to some multinomial estimation and pre-testing problems, J. Amer. Statist. Assoc. 72, 869-874. Lindley, D.V. and M.R. Novick (1981), The role of exchangeability in inference, Ann. Statist. 9, 45-58. Paul, S.R. and R.C. Plackett (1978), Inference sensitivity for Poisson mixtures, Biometrika 65, 591-602. Stephens, MA. and U.R. Maag (1968), Further percentage points for W 2, Biometrika 55, 428-430.