Nonparametric Bayes stochastically ordered latent class modelshy35/NP.pdf · 2010-09-21 ·...

Nonparametric Bayes stochastically ordered latent class models

Hongxia Yanga, Sean O’Brienb and David B. Dunsona

aDepartment of Statistical Science, bDepartment of Biostatistics and Bioinformatics

Duke University, NC 27708

email: [email protected], [email protected], [email protected]

1

Authors’ Footnote:

Hongxia Yang is a PhD Student, Department of Statistical Science, Duke University. Mailing ad-

dress: Department of Statistical Science Box 90251, 214 Old Chemistry Building, Duke University,

Durham, NC 27708-0251 (email: [email protected]).

Sean O’Brien is Assistant Professor, Department of Biostatistics and Bioinformatics, Duke Uni-

versity. Mailing address: Department of Biostatistics and Bioinformatics, DUMC Box 2721,Duke

University, Durham, NC 27710 (email: [email protected]).

David B. Dunson is Professor, Department of Statistical Science, Duke University. Mailing ad-

dress: Department of Statistical Science Box 90251, 214 Old Chemistry Building, Duke University,

Durham, NC 27708-0251 (email: [email protected]).

2

Abstract

Latent class models (LCMs) are used increasingly for addressing a broad variety of problems, including

sparse modeling of multivariate and longitudinal data, model-based clustering, and flexible inferences on

predictor effects. Typical frequentist LCMs require estimation of a single finite number of classes, which

does not increase with the sample size, and have a well-known sensitivity to parametric assumptions on

the distributions within a class. Bayesian nonparametric methods have been developed to allow an infinite

number of classes in the general population, with the number represented in a sample increasing with sample

size. In this article, we propose a new nonparametric Bayes model that allows predictors to flexibly impact

the allocation to latent classes, while limiting sensitivity to parametric assumptions by allowing class-specific

distributions to be unknown subject to a stochastic ordering constraint. An efficient MCMC algorithm is

developed for posterior computation. The methods are validated using simulation studies and applied to the

problem of ranking medical procedures in terms of the distribution of patient morbidity.

Keywords: Factor analysis; Latent variables; Mixture model; Model-based clustering; Nested Dirichlet

process; Order restriction; Random probability measure; Stick-breaking.

3

1. INTRODUCTION

Latent class models (LCMs) are routinely used for analysis and interpretation of multivariate

data. LCMs comprise an extremely rich class of discrete mixture models, which allow units to

be allocated to latent sub-populations or clusters, with the allocation probabilities potentially

dependent on predictors. Suppose one collects response data yi = (yi1, . . . , yip)′ ∈ <p and predictors

xi = (xi1, . . . , xiq)′ for subjects i = 1, . . . , n. Then, a simple Gaussian LCM model could be specified

as

f(yi |xi) =K∑

k=1

πk(xi)Np

(yi;µk,Σk), (1)

where πk(xi) is the probability of allocation to latent class k given predictors xi, the response data

for subjects in class k are normally distributed with mean µk and covariance Σk, and K is the

number of latent classes. In routine applications of such models, πk(xi) is typically specified as a

logistic regression model and the EM algorithm is used for maximum likelihood estimation.

There are a number of well known issues that arise in considering model (1) and related LCMs.

First, there is the so-called label ambiguity problem, which results because there is nothing dis-

tinguishing class k from k′ a priori. The estimates produced by the EM algorithm correspond to

a local mode, with an identical likelihood obtained for any permutation of the labels {1, . . . , K}on the K clusters. Label ambiguity is even more of a problem in Bayesian analyses of LCMs re-

lying on Markov chain Monte Carlo (MCMC) for posterior computation, as label switching makes

it difficult to obtain meaningful posterior summaries of the cluster-specific parameters from the

MCMC output, though post-processing can potentially be used (Stephens 2000; Jasra, Holmes

and Stephens 2005). Although constraints on the component-specific parameters, such as ordered

means, are widely-used to avoid label ambiguity, it is typically not clear what constraints are appro-

priate in multivariate models such as (1) and partial ambiguity may remain even with constraints.

A second well known issue is uncertainty in the choice of K. Although standard analyses rely on

selection criteria, such as the BIC, the theoretical justification for use of the BIC in mixture models

such as LCMs is unclear. In addition, conditioning on a selected value in a two-stage procedure

clearly ignores uncertainty in the selection process. A third issue with LCMs is sensitivity to para-

metric assumptions, with a very different number of clusters and allocation to clusters potentially

4

obtained if one replaces the normality assumption in (1) with a multivariate t distribution or other

choice.

Our motivation is drawn from an application to ranking of medical procedures in terms of

the distribution of patient morbidity following the procedure. In particular, we would like to

obtain clusters (latent classes) of procedures having a similar morbidity distribution, while also

estimating an ordering in severity of the procedures. Ideally, Ideally, we would like to avoid some

of the problems arising in typical LCMs through stochastic ordering restrictions that are natural

in many applications, with nonparametric Bayes methods used to allow infinitely-many classes and

avoid parametric assumptions on the class-specific distributions. We will focus on the setting in

which subjects are nested within pre-specified groups, with i = 1, . . . , n indexing the groups and

j = 1, . . . , ni the subjects in the ith group. In the motivating application, groups correspond to

different medical procedures.

For illustration, initially consider the case in which yij is a single outcome for subject j in group

i, there are no predictors, and we let yij ∼ Fi, with Fi the distribution specific to group i. Then,

taking a nonparametric Bayes approach, we require a prior for the collection of distributions {Fi}ni=1.

Two possibilities that have been proposed in the literature include hierarchical Dirichlet process

(HDP) (Teh et al. 2006) and nested Dirichlet process (nDP) mixtures (Rodriguex et al. 2008). The

HDP specification automatically allocates patients to clusters, with dependence incorporated in

the cluster weights across the groups. The nDP is more relevant in clustering groups, with each

cluster having a different distribution of subject-level outcomes. Specifically, the nDP mixture

model would let Fi(y) = Fi′(y) with prior probability 1/(1+α), with α a precision parameter. The

densities specific to each cluster are then modeled using separate DP mixture models.

This approach partly addresses our interests in allowing clustering of procedures based on

the distribution of patient outcomes, while allowing the number of clusters (latent classes) to be

unknown. However, there is no allowance for predictors that provide information about the cluster

allocation and there is no natural way to obtain a ranking of the procedures. Potentially, one may

rank the procedures based on the mean of Fi, but it is not clear that the mean is the best summary

to rank on, as the proportion of subjects having extreme or life-threatening adverse events may be

more clinically relevant. With this motivation, we propose a nonparametric Bayes stochastically

5

ordered LCM (SO-LCM) that is inspired by the nDP but has a fundamentally different structure.

Section 2 proposes the basic structure of the SO-LCM, with considerations of properties and

extensions to more complex hierarchical models motivated in particular by the ranking of medical

procedures application. Section 3 outlines an MCMC algorithm for posterior computation. Section

4 contains a simulation study assessing operating characteristics under a default prior. Section

5 applies the method to the medical procedures data, showing advantages relative to parametric

methods, and Section 6 contains a discussion.

2. STOCHASTICALLY ORDERED LATENT CLASS PRIORS

2.1 Basic Formulation and Properties

Consider a collection of unknown distributions P = {P1, . . . , Pn}, with P ∼ P, where P is a prior.

In particular, the prior P is induced by letting,

Pi ∼∞∑

k=1

πk(xi)δP ∗k , P ∗k =

∞∑

l=1

vlδθkl, (2)

where πk(xi) = Pr(Pi = P ∗k |xi) is the conditional probability of allocating distribution i to cluster

k given predictors xi = (xi1, . . . , xiq)′, and each of the cluster-specific distributions is assumed to

be discrete. In particular, the distribution P ∗k specific to cluster k has probability weights {vl} on

atoms {θkl}. This discreteness assumption will be relaxed later by using Pi as a mixture distribution

within a continuous kernel.

There are two main distinct features of prior (2) relative to the nested Dirichlet process. First,

we allow covariates to impact the allocation to clusters. In the motivating application to ranking

of medical procedures, this is an important modification, as we have preliminary rankings of the

different procedures by physicians. These rankings can serve as a predictor informing the allocation

to clusters. Hence, instead of simply relying on the preliminary physician rankings or the outcomes

data in isolation, we allow for a combination or fusion of these data in ranking the procedures.

Second, as we are interested in ranking the procedures, we impose a stochastic ordering restriction

on the cluster-specific distributions with P ∗k ¹ P ∗

k′ for all k < k′, where P ∗k ¹ P ∗

k′ denotes that P ∗k

is stochastically no larger than P ∗k′ so that P ∗

k (a,∞) ≤ P ∗k′(a,∞) for all a. This restriction implies

that clusters with a higher index correspond to stochastically higher distributions.

6

Dunson and Peddada (2008) proposed a restricted dependent Dirichlet process (rDDP) prior

for stochastically ordered distributions. Here, we apply the rDDP prior to the cluster-specific

distributions P ∗ = {P ∗k }∞k=1. We could have instead used an alternative stochastically ordered

prior, such as the approaches proposed by Karabatsos and Walker (2007). We used the rDDP

mixture prior instead to avoid the partitioning effect of the Polya tree prior. Such an effect can

be removed using mixtures of Polya trees, though the computation can be more intensive for such

models and the results still tend to be quite spiky looking densities. The estimates produced in

DP mixtures of Gaussian kernels in our experience tend to match our prior beliefs for the latent

variable density more closely.

The stochastic ordering prior from the rDDP is accomplished by first letting vl = νl∏

s<l(1 −νs) with νl ∼ beta(1, α2) independently for l = 1, . . . ,∞. Then, we let θl = {θkl}∞k=1 ∼ P0

independently for l = 1, . . . ,∞, with P0 chosen so that P0(θ1l ≤ θ2l ≤ · · · ) = 1. The cluster k

distribution, P ∗k , is marginally distributed according to a Dirichlet process prior with precision α2

and base distribution P0k, with P0k the kth marginal distribution of P0. This implies that θkl ∼ P0k

marginally, where θkl is the kth element of the multivariate vector θl. In addition, Pr(P ∗k ¹ P ∗

k′) = 1

for all k < k′ a priori (and hence a posteriori). Dependence in the elements of P ∗ is incorporated

through the use of fixed weights {vl}∞l=1 for all k and dependent atoms. This dependence structure

allows flexible borrowing of information across the cluster-specific distributions.

As a specific choice of P0, let θ1l = γ∗1l ∼ N(m0, s20) and γ∗kl = θkl − θk−1,l, for k = 2, . . . ,∞,

with

γ∗kl ∼ w0δ0 + (1− w0)N+(0, κ−1), k = 2, . . . ,∞, (3)

where w0 = Pr(γ∗kl = 0) and N+ denotes the normal distribution truncated to have positive support.

By including positive mass at zero, the prior allows a subset of the atoms in Pk and Pk′ to be

identical. This is appealing in allowing commonalities between the distributions specific to different

latent classes. Also including a positive probability of zero values allows collapsing on an effectively

lower-dimensional model through zeroing out the coefficients. This allows us to start with a very

richly parameterized model and adaptively drop out parameters that are not needed. To allow

the data to inform about the appropriate value for the point mass probability w0, we choose a

hyperprior w0 ∼ beta(aw0 , bw0), with aw0 = bw0 = 1 used routinely as a default.

7

To complete a specification of the SO-LCM, we require a prior for the predictor-dependent

probabilities. For simplicity, we use the logistic regression-type model

πk(x) =ψk exp{x′βk}∑Kl=1 ψl exp{x′βl}

, ψk ∼ Gamma(α1/K, 1), βk ∼ H, (4)

where ψk ≥ 0 is a baseline weight for mixture component k, βk are regression parameters controlling

the impact of the predictors on the probabilities of allocation to each cluster (latent class), and H

is a prior on the regression coefficients. For example, H can be chosen to be Gaussian or, to allow

shrinkage towards zero for unimportant coefficients, we can choose a heavy-tailed Cauchy prior or

a variable selection mixture prior with a mass at zero.

Unlike in typical generalized logistic regression models, we avoid placing identifiability con-

straints on the parameters, such as setting the coefficients equal to zero in a reference class. Unlike

in frequentist models fitted by maximum likelihood, the choice of the reference class can impact

the results, and it is important to maintain exchangeability of the cluster indices in model (4).

Otherwise, there may be some bias introduced in which we favor stochastically smaller or larger

distributions a priori. In Bayesian modeling, it is not necessary to satisfy frequentist identifiabil-

ity criteria, and indeed it is often quite useful to consider over-parameterized models as long as

inferences are based on identifiable quantities.

To further motivate model (2) - (4), it is useful to consider properties in the baseline case in

which x = 0, so that we obtain πk = ψk/∑K

l=1 ψl. In this case, the particular gamma prior that

was chosen for the cluster-specific weight parameters leads to (π1, . . . , πK) ∼ Dir(α1/K, . . . , α1/K).

This is the same distribution on the cluster-specific probabilities that was proposed by Ishwaran and

Zarepour (2002) in developing a finite approximation to the Dirichlet process. It is straightforward

to show (proof in appendix A) that the prior probability of clustering two groups in this baseline

case is,

Pr(Pi = Pi′) = E( K∑

k=1

πkπk

)=

1 + α1/K

1 + α1, (5)

which simplifies to 1/(1+α1) in the limit as K →∞. In addition, the prior probability that group

i is stochastically less than group i′ can be derived as,

Pr(Pi ≺ Pi′) =12

{1− Pr(Pi = Pi′)

}=

α1

2(1 + α1)(1− 1

K), (6)

8

which reduces to α12(1+α1) in the limit as K →∞.

Hence, α1 is a key hyperparameter controlling the prior on clustering and ordering of the

groups. For greater flexibility, we recommend letting α1 ∼ Gamma(a1, b1). In many applications,

it is appealing to favor a slow rate of introduction of new clusters with sample size. As in the DP,

clusters are introduced at a rate proportion to α1 log n when K is sufficiently large. In order to favor

few clusters relative to the number of groups n, one can choose the hyperparameters a1, b1 so that

the prior is concentrated at values close to zero. In the application to ranking of medical procedures

in terms of their severity, our physician collaborators have a strong preference for parsimony and

expect a model with 6 (or fewer) clusters to fit the data adequately. This knowledge is used to elicit

the a1, b1 hyperparameters. In the case in which covariates are included, (5) and (6) can potentially

be extended, and it will be the case that prior clustering and ordering probabilities depend on the

relative values of the predictors for the two groups. However, it is not straightforward to obtain

simple analytic forms.

2.2 Applications to Ranking Medical Procedures

In the motivating application to ranking medical procedures based on the distribution of patient

morbidity following each procedure, response data consist of a vector y∗ij = (y∗ij1, . . . , y∗ijp)

′ of p

measures of morbidity on the jth patient having procedure i, for i = 1, . . . , n and j = 1, . . . , ni. The

first p1 elements of y∗ij are continuous and the next p2 elements are binary with p1 +p2 = p. Higher

values of each of the measurements imply higher morbidity, and we relate the measurements to a

latent morbidity score for each patient within each procedure through the following factor model,

y∗ijt = ht(yijt), ht(y) = y, t = 1, . . . , p1, ht(y) = 1(y > 0), t = p1 + 1, . . . , p

yij = µ + Ληij + εij , εijt ∼ N(0, σ2it),

ηij ∼ fi, fi(η) =∫K(η; γ)dPi(γ),

K(η; γ) =∫

N(η; γ, ϕ)dQ(ϕ), (7)

where yijt is a continuous variable underlying y∗ijt, with y∗ijt = yijt for continuous responses and

y∗ijt = 1(yijt > 0) for binary responses, and µ = (µ1, . . . , µp)′ is a p × 1 intercept vector, Λ =

(λ1, . . . , λp)′ is a p× 1 vector of factor loadings, ηij is a latent morbidity score for the jth patient

9

having procedure i and where K(·; γ) is an unknown unimodal kernel that is symmetric about

γ. The procedure-specific latent variable density functions are modeled as a flexible location-scale

mixture of Gaussian densities. By using an unknown kernel, we favor fewer and more biologically

interpretable clusters. Letting σ−2it = cidt for continuous responses, we obtain an additive log-linear

model for the residual precision, with ci a procedure-specific multiple and dt a response type specific

multiple, while fixing σ−2it = 1 for binary responses. This allows the residual variance to change

for the different procedures, while also allowing a shift specific to each measure of morbidity. The

constraint on the residual variances for the continuous variables underlying the binary responses

is a standard identifiability condition. Because higher values of y∗ijt imply higher morbidity, we

constrain the factor loadings to be non-negative so that λt ≥ 0 for t = 1, . . . , p. For the scale

mixture component, we let Q ∼ DP (α0Q0) where Q0 = Gamma(c0, d0) is the base measure.

Within expression (7), fi denotes the density of the latent morbidity score specific to patients

receiving procedure i. This density is treated as unknown using a flexible location-scale mixture

of Gaussians. It is straightforward to show that fi is marginally modeled as a Dirichlet process

mixture of mixture of normals, as in Lo (1984) and Escobar and West (1995). It is well known

that such a model is highly flexible. We avoid using Pi directly as the distribution of the latent

factor scores within procedure i, since that would assume that the factor scores follow a discrete

distribution. It seems more biologically realistic to allow a continuum of patient morbidity, while

allowing patients with similar but not identical morbidity to be clustered. This is accomplished by

the proposed model in that patients allocated to the same mixture component will be clustered. As

mentioned above, we are more interested in clustering and ranking of the medical procedures instead

of the patients. Because K(·; γ) is monotonically stochastically increasing in γ, we maintained the

stochastic ordering restriction in the Pis. Note that two procedures i and i′ having Pi = Pi′ , which

is allowed by the proposed prior, will also have fi = fi′ and hence have the same morbidity density.

In addition, fi ≺ fi′ (the distribution of patient morbidity under procedure i is stochastically less

than that under procedure i′) if and only if Pi ≺ Pi′ . Hence, the clustering and ranking properties

of the prior for {Pi} proposed above extend directly to the continuous latent factor model in (7).

To complete a Bayesian specification of the SO-LCM model in (7), we choose priors as follows.

The intercept vector is assigned a normal prior, µt ∼ N(µ0, σ20) for t = 1, . . . , p, and the factor

10

loadings are assigned robust truncated Cauchy priors by letting λt ∼ N+(0, τ) for t = 1, . . . , p with

τ ∼ Inv-Gamma(1/2, 1/2). We use a common precision τ to include dependent shrinkage across the

loadings. And the multiplicative terms in the variance model, {ci} and {dt}, are assigned gamma

priors. Elicitation of the different hyperparameters in these priors is considered later.

3. POSTERIOR COMPUTATION

Due to the structure of the model described in section 2.1, it becomes straightforward to adapt

previously proposed algorithms for posterior computation in DPMs and logistic regression models.

Because the structure of the base measure creates some difficulties in implementing Polya urn-

based algorithms, we will focus on the exact block Gibbs sampler (Yau et al. 2010) for posterior

computation and update polychotomous weights through Holmes and Held (2006). We will focus

on the simple model yij ∼ fi, fi(η) =∫ K(η; γ)dPi(γ), K(η; γ) =

∫N(η; γ, ϕ)dQ(ϕ), Pi ∼ SO-LCM

and Q ∼ DP(α0Q0). The sampling steps are as follows,

1. Sample πk(xi) through the following steps. The polychotomous generalization of the logistic

regression model is defined via

ζi ∼M(1;π1(xi), . . . , πK(xi)), πk(xi) =ψk exp(x′iβj)∑Kl=1 ψl exp(x′iβl)

, (8)

where ζi is the procedure cluster indicator and M(1; ·) denotes the single sample multinomial

distribution. Defining β[k] = (β1, . . . ,βk−1,βk+1, . . . ,βK), we have

L(βk|ζ,β[k]) ∝n∏

i=1

K∏

k=1

[πk(xi)]1(ζi=j)

∝n∏

i=1

[χij ]1(ζi=j)[1− χij ]1(ζi 6=j) (9)

where

χij =exp{x′iβj + log(ψj)− Cij}

1 + exp{x′iβj + log(ψj)− Cij} =exp(xi

′βj − Cij)

1 + exp(xi′βj − Cij)

,

Cij = log[∑

k 6=j

exp{x′iβk + log(ψk)}]

= log[∑

k 6=j

exp(xi′βj)

](10)

where xi = (x′i, 1)′ and βj = (β′j , log ψj)′. The prior for log(ψk) from (4) can be approxi-

mated as log(ψk)D≈ N(m, v). We use this approximation to obtain an efficient Metropolis

11

independence chain proposal. The conditional likelihood L(βj |ζ, β[j]) has the form of a lo-

gistic regression on class indicator 1(ζi = j), which allows us to use the algorithm of Holmes

and Held (2006). Details are in appendix B.

2. Sample the procedure cluster indicators ζi, for i = 1, . . . , n, from a multinomial distribution

with probabilities

Pr(ζi = k| · · · ) ∝ πk(xi)ni∏

j=1

L∑

l=1

vlN(yij ; θkl, ϕij)

and construct mk =∑n

i=1 1(ζi ≡ k). For K, we first choose a reasonable upper bound and

then monitor the maximum index of the occupied components. If all the MCMC samples have

maximum indices several units below the upper bound, then the upper bound is sufficiently

high, while otherwise the upper bound can be increased, with the analysis re-run.

3. The joint prior distribution of the group indicator ξij and a latent variable qij can be written

as

f(ξij , qij |v) =∑

l:vl>qij

δj(·) =∞∑

l=1

1(qij < vl)δl(·).

Implement the Exact Block Gibbs sampler steps:

i. Sample qij ∼ Unif(0, vξij) for j = 1, . . . , ni with vl = νl

∏s<l(1− νs).

ii. Sample the stick-breaking random variables

νl ∼ beta(1 +

K∑

k=1

nkl, α2 +L∑

s=l+1

K∑

k=1

nks

), l = 1, . . . , L

where nkl is the number of observations assigned to atom l of distribution k with nkl =∑n

i=1

∑nij=1 1(ζi ≡ k, ξij ≡ l) and L the minimum value satisfying v1 + . . . + vL > 1 −

min{qij}.

iii. Sample ξij for i = 1, . . . , n and j = 1, . . . , ni from the multinomial conditional with

Pr(ξij = l) ∝ 1(qij < vl)N(yij ; θζil, ϕij).

4. Sample γ∗l from

p(γ∗l | · · · ) ∝[ ∏

{(i,j)|ζi,ξij=l}N(yij ; θζil, ϕij)

]P0(γ∗l )

12

where θkl = w′kγ∗l , wk = (1′k,0

′K−k)

′, γ∗l = (γ∗1l, . . . , γ∗Kl) as defined in (3). P0 is the baseline

measure. If no observation is assigned to a specific cluster, then the parameters are drawn

directly from the prior distribution P0. Construct θkl through θkl = w′kγ∗l .

5. ϕij is updated through the exact blocked gibbs sampler similar to the above steps.

6. Use random walk Metropolis-Hastings method to update concentration parameter α1.

7. Sample concentration parameter α2 with conjugate prior Gamma(a2, b2) directly from

(α2| · · · ) ∼ Gamma(a2 + K(L− 1), b2 −K

L−1∑

l=1

log(1− vl))

We should note that the accuracy of the truncation depends on the values of α1 and α2. Thus

the hyperparameters (a1, b1) and (a2, b2) should be chosen to give little prior probability to

values of α1 and α2 larger than those used to calculate the truncation level.

Note that this algorithm can be generalized easily to accommodate model (7), so the details are

omitted.

4. SIMULATION STUDY

We separate this section into two parts. Predictors are not considered in the first simulation but

will be considered in the second simulation. Model (7) is studied and both simulations mimic the

structure of the medical procedure data.

4.1 Without predictors

Data yij are generated according to (7), with one continuous response (p1 = 1) and six binary

responses (p2 = 6) for each of 100 patients (j = 1, . . . , 100) in each of 60 procedures (i = 1, . . . , 60).

Parameters for the data generating model are µ = (0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4)′, Λ = 1′p×1

and Σ = diag(0.5, 1, 1, 1, 1, 1, 1). The latent morbidity ηij is generated from one of four mixtures

of Gaussian components outlined in Table 1, with the first fifteen procedures being generated

from mixture distribution T1, the second fifteen procedures generated from T2, the third fifteen

procedures generated from T3 and the last fifteen procedures generated from T4, where T1 ≺ T2 ≺T3 ≺ T4 such that the generated latent morbidity distributions are stochastically ordered. As shown

13

in Figure 1 and Table 1, distributions share components with each other and the ordering of the

distributions is subtle.

[Figure 1 about here.]

To obtain an initial clustering of the medical procedures using standard methods, we first aver-

aged the severity data for the different patients having each procedure to obtain yi = 1ni

∑nij=1 yij as

a p = 7 dimensional summary of severity for procedure i. We then applied model-based clustering

(Fraley and Raftery 2002; Fraley, Raftery and Wehrens 2005) to the data {y1, . . . , yn} using the

R functions available in the package described in Fraley et al. (2005). These approaches rely on

fitting of finite mixture models with the EM algorithm, with the model fit for a variety of choices

of the number of mixture components, which also corresponds to the number of clusters. The BIC

is used to select the optimal number of clusters. Figure 2 plots the BIC for the simulated data vs

the number of cluster for four different options on the cluster shapes. In this case, the best model

according to BIC is EEI model (equal size and shape) with six clusters. Note that this approach

does not utilize the patient-specific data and instead clusters based on the mean severity measure

across patients, while the proposed approach should have advantages in clustering procedures based

on the entire distribution across patients.


For the SO-LCM estimation, parameters α1, α2 are fixed to be 1 and a normal inverse-gamma

prior distribution, NIG(0, 0.1, 2, 3) is chosen for the baseline measure m0 and s20 described in (3),

implying that E(m0|s20) = 0,V (m0|s2

0) = 10s20, E(s2

0) = 1, and V (s20) = 3. Additionally, we assign

priors w0 ∼ beta(1, 1), κ ∼ Gamma(1/2, 1/2), representing a robust and flexible default prior for

the base measure P0. Without predictors, the prior for the cluster-specific allocation probabilities

turns out to be (π1, . . . , πK)′ ∼ Dir(α1/K, . . . , α1/K). Posterior samples under this SO-LCM prior

are obtained through the algorithm described in Section 3 with prespecified truncation bounds

K = 20. This truncation tends to be accurate for α1 ≤ 1 and α2 ≤ 1, where such values of α1 and

α2 favor a small number of mixture components. In this particular application, mixture components

close to the upper bound are not occupied in any of the MCMC samples after the burn-in period.

14

20,000 iterations are found to be enough for parameters to converge. All results are based on 5,000

samples obtained after a burn-in period of 20,000 iterations.

For each pair of distributions Pi and Pi′ , (i < i′), the probability Pr(Pi = Pi′) was calculated

as the proportion of posterior samples for which Pi and Pi′ are assigned to the same cluster; and

Pr(Pi ≺ Pi′) is calculated as the proportion of posterior samples for which Pi is assigned to a cluster

with stochastically less morbidity than Pi′ . Results are shown in Figure 3, where Figure 3(a) is

the ranking plot with the (i, j)th entry of the lower triangular matrix identifying the probability

for Pi ≺ Pi′ and Figure 3(b) is the clustering plot with the (i, j)th entry identifying the probability

for Pi = Pi′ . Figure 3 illustrates that there is not enough information in the data to differentiate

the first thirty procedures, which is not surprising given the very subtle differences in T1 and T2

shown in Figure 2. However, the true rankings and clusterings in the medical procedures are

otherwise accurately reflected in the results. The estimated density of T1 is shown in Figure 4(a).

For comparison, this density is also estimated under a DPM prior with the same base measure

and precision parameter α2 = 1 (in Figure 4(b)). The estimate obtained using the SO-LCM prior

distribution appears to capture both the small and large modes more accurately than the DPM

alternative.



4.2 With predictors

Potentially, the incorporation of predictors may improve the ability to detect subtle differences in

the differences of patient morbidity between procedures. To assess this, we repeated the simulation

study of Section 4.1 but modified the model to allow predictor-dependent mixture weights. Mim-

icking the real data, we assumed there was a single predictor corresponding to an initial physician

severity score obtained from their clinical experience and not from examination of the current data.

In particular, predictors for the first fifteen procedures are chosen uniformly from (-4, -3.5), the

second fifteen procedures chosen uniformly from (-0.6, -0.5), the third fifteen procedures chosen

uniformly from (0.2, 0.3) and the last fifteen procedures chosen uniformly from (3.5, 4). Data are

15

then generated from the assumed model exactly as described in Section 4.1 but assuming logistic

regression model (4) for the weights with ψ = (0.5, 1, 1.5, 2)′ and β = (−1.5,−0.5, 0.5, 1.5)′.

In the analysis, priors are specified as described in Section 4.1 and model (4) and we additionally

choose a N(0,10 I) prior for β to complete the specification. Posterior computation was performed

using the MCMC algorithm shown in Section 3. Apparent convergence was rapid, and 20,000

iterations were judged to be sufficient. Prespecified truncation bounds K = 20 are still adequate.

Results are based on 5,000 samples after a burn-in period of 20,000 iterations.

Figure 5(a) depicts the ranking performance and Figure 5(b) depicts the clustering plot. Both

ranking and clustering performances are improved compared to study 4.1 such that the first thirty

procedures are ranked consistently with the true order and are clustered correctly.


5. MEDICAL PROCEDURE APPLICATION

The analysis of congenital heart surgery outcomes data is challenging because of the wide array

of defects encountered and the large number of associated surgical procedures. Certain diagnoses

occur relatively frequently, but variations on the typical anatomy are commonplace. To overcome

this difficulty, researchers have proposed methods to allow procedures with similar mortality and

morbidity risk to be grouped together for analysis. Two widely used methods are the Risk Ad-

justment for Congenital Heart Surgery (RAHCS-1) methodology (Jenkins 2004) and the Aristotle

Basic Complexity Levels. RACHS-1 groups 143 types of congenital heart surgery procedures into

6 categories based on their estimated risk of in-hospital mortality. Similarly, the Aristotle method

groups 143 types of procedures into 4 categories (levels) based on their potential for mortality,

morbidity, and technical difficulty. For both RACHS-1 and Aristotle, procedure categories were

determined by panels of subject matter experts without using a formal statistical framework. In

this section, our goal is to show that the SO-LCM methodology provides a useful statistical frame-

work for grouping procedures into categories of risk and for choosing the number of categories.

More formally, we sought to identify clusters of congenital procedures with similar distributions of

post-procedural morbidity.

Data for this analysis were obtained from the Society of Thoracic Surgeons (STS) Congenital

16

Heart Surgery database. The study population consisted of N=79,635 patients who underwent

one of 143 types of congenital cardiovascular procedures at an STS-participating center during

the years 2002-2008. Post-operative morbidity was regarded as a patient-level unobserved latent

variable. Indicators of morbidity included a single continuous variable, post-operative length of

stay (PLOS), modeled as y1 = log(1 + PLOS); and 6 binary (yes/no) variables: y2 = renal failure,

y3 = stroke, y3 = heart block, y4 = requirement for extracorporeal membrane oxygenation or

ventricular assist device; y5 = phrenic nerve injury, and y6 = in-hospital mortality. Responses from

different patients were assumed to be independent. Multiple responses from the same patient were

conditionally independent given the latent morbidity variable. The joint model for all 7 endpoints

is:

yij1|ηij ∼ N[µ1 + λ1ηij , σ2i1] (PLOS)

yij2|ηij ∼ Bernoulli[Φ(µ2 + λ2ηij)] (Stroke)

yij3|ηij ∼ Bernoulli[Φ(µ3 + λ3ηij)] (Renal failure)

yij4|ηij ∼ Bernoulli[Φ(µ4 + λ4ηij)] (Heart block)

yij5|ηij ∼ Bernoulli[Φ(µ5 + λ5ηij)] (ECMO/VAD)

yij6|ηij ∼ Bernoulli[Φ(µ6 + λ6ηij)] (Phrenic nerve injury)

yij7|ηij ∼ Bernoulli[Φ(µ7 + λ7ηij)] (In-hospital mortality)

Let fi denote the density of ηij among patients undergoing the ith type of procedure, i.e. ηij ∼ fi.

Our goal is to estimate the densities f1, f2, . . . , f143 nonparametrically under the assumption that

they have an unknown stochastic ordering. This assumption facilitates ranking of the procedures

and is less restrictive than alternative parametric models which assume a Gaussian distribution

and procedure-specific location parameters. An important consideration for the analysis is that

the procedure-specific sample sizes are small and highly variable (median = 50; range 10 to 2000).

To account for these low sample sizes, we propose a method of analysis that permits borrowing

of information across procedures and incorporates external prior information. A potentially useful

auxiliary covariate is each procedure’s Aristotle Basic Complexity (ABC) score. The ABC score is

a number ranging from 0.5 to 15 that represents the average subjective ranking by an international

panel of congenital heart surgeons. Large ABC scores imply that the procedure is considered to be

17

a difficult operation with high potential for mortality and morbidity.

The procedure-specific morbidity distributions are estimated nonparametrically under the SO-

LCM prior distribution,

fi(η) =∫K(η; γ)dPi, K(η; γ) =

∫N(η; γ, ϕ)dQ(ϕ)

Piind∼

K∑

i=1

πk(xi)δP ∗k , πk(x) =ψk exp(x′βk)∑Kl=1 ψl exp(x′βl)

{P ∗1 , P ∗

2 , . . . , P ∗K} ∼ rDPP(α2P0), Q ∼ DP(αQ0).

Hyper-priors are ψk ∼ Gamma(α1/K, 1), α1 ∼ Gamma(1, 6/ ln 143), and βkind∼ N(0, 10) and

α2 ∼ Gamma(1, 1). The prior for α1 is chosen based on expert elicitation to favor 6 or fewer

clusters in the procedures. The prespecified truncation bound K = 20 is found to be enough since

mixture components close to the upper bound are not occupied after the algorithm converges. The

baseline distribution P0 is constructed as in (3) with w0 ∼ beta(1, 1), κ ∼ Gamma(1/2, 1/2) and

(m0, s20) ∼ NIG(0, 0.1, 2, 3). Priors for the outcomes model are µt

ind∼ N(0, 10) and λtind∼ N+(0, τ),

t = 1, 2, . . . , p, where τ ∼ InvGamma(0.5, 0.5). Estimates are calculated with 50,000 MCMC itera-

tions after a burn-in period of 20,000 iterations. We took a long burn-in and collection interval to

be conservative, but similar results are obtained using shorter chains.

The estimated association between latent morbidity and risk of complications is depicted in

Figure 6. To facilitate interpretation, morbidity is plotted on the scale of percentiles of the marginal

distribution of η. The 100pth percentile is defined as F−1(p) where F−1 is the inverse of the function

F (x) = Pr(η < x) =∑143

i=1nin Fi(x) and n = n1 + n2 + · · · + n143. Details are in Appendix C. For

each endpoint, the increase of estimated risk ranges from 10% to 40% for patients in the 90th

percentile of the morbidity distribution compared to patients in the 10th percentile. These results

suggest that the selected outcomes are internally consistent and valid indicators of the concept of

morbidity.


Figures 7 and 8 summarize posterior inferences for procedures (n = 66) having at least 200

occurrences in the database. For the ith procedure, let Ri =∑66

h=1 I(Fi ¹ Fh) denote the number

of procedures having a morbidity distribution that is stochastically no smaller than Fi. In each

18

figure, procedures are sorted in ascending order based on the posterior mean E[Ri]. In Figure 7, the

posterior means are plotted on the vertical axis along with 95% probability intervals. The relatively

wide probability intervals indicate that there is uncertainty regarding the true rank ordering of the

procedure-specific morbidity distributions. Nonetheless, several procedures have narrow intervals

and are clearly distinguished as either low (e.g. atrial septal defect repair) or high (e.g. Norwood

operation) latent morbidity. Ranking and clustering performances are depicted in Figure 8(a) and

8(b).

The total 143 procedures can be grouped into four, five or six homogeneous clusters according

to posterior clustering probabilities shown in Table 2. The data suggest high (99%) posterior

probability of fewer than 8 clusters, with 32% probability assigned to the posterior mode of 5.

We also propose a way to obtain an optimal point estimation of the ranked clustering as fol-

lowing. Laird and Louis (1989) represented the ranks by,

Rk(θ) = rank(θk) =K∑

j=1

I{θk≥θj}, (11)

with the smallest θ having rank 1 and the largest having rank K. Denote that the true rank for θ

is p, the estimated rank is p, Ri is the rank variable for object i and Ri is the estimated rank (we

drop the dependency on θ for notational convenience). To find the optimal ranked clustering, we

define the following loss function L(p, p) as

L(p, p) =∑

(i,j)∈M

(2× 1[Ri < Rj , Ri > Rj ] + 2× 1[Ri > Rj , Ri < Rj ]

+1[Ri = Rj , Ri 6= Rj ] + 1[Ri 6= Rj , Ri = Rj ]). (12)

We penalize for the pairs with Ri = Rj or Ri = Rj half of those pairs for which the ordering are in

the opposite direction. The posterior expected loss turns out to be

E(L(p, p|y)) =∑

(i,j)∈M

(2× 1(Ri < Rj)Pr{Ri > Rj |y}+ 2× 1(Ri > Rj)Pr{Ri < Rj |y}

+ 1(Ri = Rj)Pr{Ri 6= Rj |y}+ 1(Ri 6= Rj)Pr{Ri = Rj |y})

,

where Pr{Ri > Rj |y}, Pr{Ri < Rj |y}, Pr{Ri = Rj |y} and Pr{Ri 6= Rj |y} are estimated through

the MCMC outputs. We may obtain an optimal ranked clustering which achieves a smallest Bayes

risk.

19

To compare the performance of our method with that of the Aristotle level(with 4 levels) based

on the Bayes risk, we let K = 4 so that we will only obtain 4 clusters. The Bayes risks for

the ranking clustering obtained from the Aristotle score is 7726.9. Our optimal Bayesian ranked

clustering achieves smaller risk: 6533.1.

We compare groupings based on Aristotle to our final point estimate in table 3. Several proce-

dures that were predicted to be relatively low-risk by the Aristotle score were actually moderate-risk

or high-risk according to our proposed methodology. Among 24 procedures that were Aristotle Cat-

egory 1 (lowest risk), only 9 of these procedures were assigned to the lowest risk category according

to our method. The correlation between these two ranked clustering is 0.45. The correlation be-

tween the ranked clustering of the SO-LCM and the log(1 + PLOS) is 0.86 and the correlation

between the ranked clustering of the Aristotle level and the log(1 + PLOS) is 0.58.



6. DISCUSSION

We have formulated a novel extension of the nested Dirichlet process(nDP) for a family of a priori

subjects to a partial stochastic ordering that allows us to simultaneously rank and cluster proce-

dures. The procedures are clustered by their entire distribution rather than by particular features

of it. Similar to the nDP, the SO-LCM also allows us to cluster subjects within procedures. The

SO-LCM is also straightforward to be imbedded for stochastically ordered mixture distributions

within a large hierarchical model.

Although inspired by the pioneering work of the nDP, this article makes several important con-

tributions. First, the stochastically ordered priors that allow covariates to impact the allocation

to clusters are developed to apply to nonparametrically estimate densities for multiple procedures

subject to a stochastic ordering constraint. In addition, we can also test the hypothesis of equal-

ities between procedures against stochastically ordered alternatives. After examining some of the

theoretical properties of the model, we describe a computationally efficient implementation and

demonstrate the flexibility of the model through both a simulation study and an application where

20

the SO-LCM is used within a hierarchical model. Heat maps are also offered to summarize the

ranking and clustering structures generated by the model.

It is straightforward to make several generalizations of the SO-LCM. One natural generalization

is to include hyperparameters in the prior on the regression coefficients of the predictor dependent

probabilities H and the baseline measure P0. For H, we can choose a heavy-tailed Cauchy prior or

a variable selection mixture prior with a mass at zero to shrink unimportant coefficients towards

zero. We note that, conditional on P0, the distinct atoms {P ∗k }∞k=1 are assumed to be independent.

Therefore, including hyperparameters in P0 allows us to parametrically borrow information across

the distinct distributions.

Another natural generalization of the SO-LCM is to replace the beta(1, α2) stick-breaking den-

sities with more general forms beta(ak, bk) as considered in Ishwaran and James (2001), with the

SO-LCM corresponding to the special case ak = 1, bk = α2. Richer classes of priors that encompass

the SO-LCM as a particular case will be obtained, though in some regression contexts it does not

always outperform the DP model with beta(1, α2) in terms of the log-predictive marginal likelihood.

We can also generalize the procedure to incorporate multivariate latent factors whose distribu-

tions are stochastically ordered. This generalization is inspired by the valuable suggestion from the

editors. In having univariate stochastic ordering on the latent variable level, we actually induce

multivariate stochastic ordering for the responses (albeit in a somewhat restrictive manner). To

directly place the stochastic ordering constraint on multivariate distributions, we can adopt multi-

variate monotone functions. In particular, in place of the scalar θkl we could have a vector θkl with

P0 chosen (e.g. multivariate truncated normal) so that the different elements are appropriately

ordered to satisfy the constraint. In the simple ordering case, we could just let (3) independently

for each element of the θkl vector instead of just for the θkl scalar. We could even have different

orders for different variables and could have some variables with no ordering.

21

7. APPENDIX

Clustering Probability Under (2), the probability that Pi = Pi′ so that groups i and i′ are

allocated to the same cluster is,

P(Pi = Pi′) = E(K∑

k=1

πkπk) =K∑

k=1

E(π2k)

=K∑

k=1

Var(πk) + E(πk)2 =K∑

k=1

α1K (α1 − α1

K )α2

1(α1 + 1)+ (

α1K

α1)2

= K(α1K (α1 − α1

K )α2

1(α1 + 1)+ (

α1K

α1)2) =

α1 − α1K

α1(α1 + 1)+

1K

And when K goes to infinity,

limK→∞

α1 − α1K

α1(α1 + 1)+

1K

=1

α1 + 1

MCMC supplement We would introduce a set of variables dij , i = 1, . . . , p, j = 1 . . . , K, and

define yij = 1 if the ith observation belongs to class j, j ∈ {1, . . . , K} and yij = 0 otherwise. Notice

that the equivalent representation of equation (8) is,

yij =

{ 1, sij ≥ 0,

0, else.

sij = xi′βj − Cij + εij , εij ∼ N(0, dij)

dij = (2eij)2, eij ∼ KS, βj ∼ π(βj) λj ∼ Gamma(α1

K, 1) (A.2)

Parameters of equation (A.2) are updated through following steps,

1. Sampling βj in the case of a normal prior on βj , π(βj) = N(b0, v0), the full conditional

distribution of βj given s·j and d·j is still normal,

βj |s·j , d·j ∼ MN(Bj , Vj),

Bj = Vj(v−10 b0 + x′Wj(s·j + C·j)), Vj = (v−1

0 + x′Wjx)−1, Wj = diag(d−11j , . . . , d−1

nj )

where x = (x1, . . . , xp)′, s·j = (s1j , . . . , snj)′, C·j = (C1j , . . . , Cnj)′ and d·j = (d1j , . . . , dnj)′.

2. Following advice of Holmes and Held (2006), we update {s·j , d·j} jointly given βj ,

π(s·j , d·j |βj , y·j) = π(s·j |βj , y·j)π(d·j |s·j , βj)

22

followed by an update to βj |s·j , d·j . The marginal densities for the sij ’s are independent

truncated logistic distributions,

sij |βj , yij ∝

Logistic(x′iβj + log(λj)− Cij , 1)I(sij > 0), yij = 1,

Logistic(x′iβj + log(λj)− Cij , 1)I(sij ≤ 0), else.

where Logistic(a, b) denotes the density function of the logistic distribution with mean a and

scale parameter b.

3. Sampling d·j through rejection sampling. As advised by Holmes and Held (2006), we use

Generalized Inverse Gaussian distribution g(dlj) = GIG(0.5, 1, r2) = rInvGamma(1,r)

, where

r = (sij − x′iβj)2, as rejection sampling density. Following a draw from g(·) the sample is

accepted with probability α(·),

α(dlj) =L(dlj)π(dlj)

Mg(dlj)

where M ≥ supdlj

L(dlj)π(dlj)g(dlj)

, L(dlj) denotes the likelihood, L(dlj) ∝ d−1/2lj exp(−0.5r2/dlj),

and π(dlj) is the prior,

π(dlj) =14d−1/2lj KS(

12d

1/2lj )

where KS(·) denotes the Kolmogorov-Smirnov density.

Section 5 supplement

Define πi = ni∑i ni

the proportion of patients receiving procedure i, then

F (x) = Pr(η ≤ x) =p∑

i=1

πiFi(x) (A.4)

where Fi(x) = Pr(η ≤ x|procedure = i) is the CDF for procedure i. Let F−1(u) denote the quantile

function associated with F . We can use the notation F (x|θ) and F−1(u|θ) to emphasize that these

functions depend on unknown parameters (namely, the matrix of mass points and the vector of

associated probabilities).

Take the first binary responses for example,

yij1 = µ1 + λ1ηij + ε1, where ε1 ∼ N(0, 1). (A.5)

23

So,

g(x) = Pr(yij1 > 0|ηij = x) = Φ(µ1 + λ1x).

Define the function

h(u) ≡ h(u|µ1, λ1, θ) = Φ(µ1 + λ1F−1(u|θ))

=p∑

i=1

πiΦ(µ1 + λ1F−1i (u|θ)) (A.6)

and let

h(u) = {h(0.01), h(0.02), . . . , h(0.99)} .

We would like to evaluate E[h(u)|data]. This can be done by evaluating (A.6) at each MCMC

iteration and then averaging across iterations.

The hardest part is figuring out how to calculate F−1(u|θ) efficiently. One approach would be

to approximate F (X) as a piecewise linear function instead of a step function. R function approx

can be implemented to evaluate F−1(u) at specified values of u. The approx function takes as

arguments a vector x, a vector y, and an optional vector xout. approx builds a piecewise linear

function by interpolating between the y’s. It then evaluates the function at the points xout and

returns the vector of associated y values. The trick used here is to reverse the roles of x and y.

That is, let x = g(η) and y = η, and xout = c(0.01, 0.02, . . . , 0.99). η denotes a grid of values for

η. To create the grid, we use values evenly spaced on the interval between the lowest and highest

mass points that have any posterior probability.

REFERENCES

Dunson, D., and Peddada, S. (2008), “Bayesian nonparametric inference on stochastic ordering,”

Biometrika, 95(4), 859–874.

Escobar, M., and West, M. (1995), “Bayesian Density Estimation and Inference Using Mixtures,”

Journal of the American Statistical Association, 90(430), 577–588.

Fraley, C., and Raftery, A. E. (2002), “Model-based clustering, discriminant analysis, and density

estimation,” Journal of the American Statistical Association, 97(458), 611–631.

24

Fraley, C., Raftery, A. E., and Wehrens, R. (2005), “mclust: Model-based Cluster Analysis,” R

package version 2.1-11, .

Holmes, C., and Held, L. (2006), “Bayesian auxiliary variable models for binary and multinomial

regression,” Bayesian Analysis, 1(1), 145–168.

Ishwaran, H., and James, L. (2001), “Gibbs sampling methods for stick-breaking priors,” Journal

of the American Statistical Association, 96(453), 161–173.

Ishwaran, H., and Zarepour, M. (2002), “Exact and approximate sum-representations for the Dirich-

let process,” The Canadian Journal of Statistics, 30, 269–283.

Jasra, A., Holmes, C. C., and Stephens, D. A. (2005), “Markov chain Monte Carlo methods and the

label switching problem in Bayesian mixture modelling,” Statististical Science, 20(1), 50–67.

Jenkins, K. (2004), “Risk adjustment for congenital heart surgery: the RACHS-1 method,” Semi-

nars in thoracic and cardiovascular surgery. Pediatric cardiac surgery annual, 7, 180–184.

Karabatsos, G., and Walker, S. (2007), “Bayesian nonparametric inference of stochastically ordered

distributions, with Polya trees and Bernstein polynomials,” Statistics and Probability Letters,

77, 907–913.

Laird, N. M., and Louis, T. A. (1989), “Empirical Bayes ranking methods,” Journal of Educational

Statistics, 14, 29–46.

Lo, A. (1984), “On a class of Bayesian nonparametric estimates: I. density estimates,” The Annals

of Statistics, 12(1), 351–357.

Rodriguex, A., Dunson, D., , and Gelfand, A. (2008), “The Nested Dirichlet Process,” Journal of

the American Statistical Association, 103(483), 1131–1154.

Stephens, M. (2000), “Dealing with label switching in mixture models,” Journal of the Royal

Statistical Society, Ser. B, 62, 795–809.

Teh, Y., Jordan, M., Beal, M., and Blei, D. (2006), “Hierarchical Dirichlet processes,” Journal of

the American Statistical Association, 101(476).

25

Yau, C., Papaspiliopoulos, O., Roberts, G., and Holmes, C. (2010), “Nonparametric hidden Markov

models with application to the analysis of copy-number-variation in mammalian genomes,”

Journal of Royal Statistical Society: Series B (to appear), .

26

Table 1: True distribution used in simulation study 4.1

Comp1 Comp2 Comp3 Comp4

Dist w (µ, σ2) w (µ, σ2) w (µ, σ2) w (µ, σ2)

T1 0.75 (-3,0.5) 0.25 (0,1)

T2 0.75 (-2.5,0.5) 0.25 (0,1)

T3 0.2 (1,0.5) 0.5 (1.5,1) 0.3 (2,1)

T4 0.2 (2,1) 0.3 (2.5,0.5) 0.4 (3,1) 0.1 (3.5,0.5)

Table 2: Posterior Probability for clustering (Epidemiology Application)

Number of clusters Posterior Probability

1 0.01

2 0.02

3 0.12

4 0.21

5 0.32

6 0.25

7 0.06

8 0.01

27

Table 3: Clustering comparison between Aristotle Level and SO-LCM

SO-LCM \ Aristotle Level 1 Level 2 Level 3 Level 4

Level 1 14 12 10 1

Level 2 4 17 24 3

Level 3 4 7 12 16

Level 4 1 5 4 11

28

List of Figures1 True distributions used in simulation study 4.1 . . . . . . . . . . . . . . . . . . . . . 302 Frequentist model-based clustering results implemented via the EM algorithm using the Mclust function in R in simulation study 4.1, with the different symbols representing different model assumptions. ”EII”: spherical, equal volume; ”EEI”: spherical, equal volume and shape; ”EEV”: spherical, equal volume but varying orientation; ”EEE”: ellipsoidal, equal volume, shape, and orientation refered to R package Mclust. 313 Posterior probability for ranking and clustering in study of section 4.1 with entry (i, j) in (a) being the lower triangular matrix identifying the probability for Pi ≺ Pi′ and in (b) the probability for Pi = Pi′ . 324 True (solid lines) and estimated (dashed lines) densities from SO-LCM and DPM for distribution T1 335 Posterior probability for ranking and clustering in study of section 4.2 . . . . . . . . 346 CDF plot for weighted latent morbidity score regarding different responses . . . . . . 357 Sorted Procedures Posterior Means with 95% Confidence Intervals . . . . . . . . . . 368 Ranking and Clustering for selected 66 procedures with observations larger than 200 37

29

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

T1 T2T3

T4

Figure 1: True distributions used in simulation study 4.1

30

2 4 6 8

−17

00−

1600

−15

00−

1400

−13

00

number of components

BIC

EIIEEV

EEIEEE

Figure 2: Frequentist model-based clustering results implemented via the EM algorithm using theMclust function in R in simulation study 4.1, with the different symbols representing differentmodel assumptions. ”EII”: spherical, equal volume; ”EEI”: spherical, equal volume and shape;”EEV”: spherical, equal volume but varying orientation; ”EEE”: ellipsoidal, equal volume, shape,and orientation refered to R package Mclust.

31

10 20 30 40 50 60

10

20

30

40

50

60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(a)

10 20 30 40 50 60

10

20

30

40

50

60 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(b)

Figure 3: Posterior probability for ranking and clustering in study of section 4.1 with entry (i, j)in (a) being the lower triangular matrix identifying the probability for Pi ≺ Pi′ and in (b) theprobability for Pi = Pi′ .

32

−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

(a)

−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

(b)

Figure 4: True (solid lines) and estimated (dashed lines) densities from SO-LCM and DPM fordistribution T1

33

10 20 30 40 50 60

10

20

30

40

50

60 0

0.2

0.4

0.6

0.8

(a)

10 20 30 40 50 60

10

20

30

40

50

60 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(b)

Figure 5: Posterior probability for ranking and clustering in study of section 4.2

34

0.0 0.2 0.4 0.6 0.8 1.0

0.05

60.

058

0.06

00.

062

CDF plot for weighted latent morbidity score regarding renalfail

u

h(u)

0.0 0.2 0.4 0.6 0.8 1.0

0.16

0.17

0.18

0.19

CDF plot for weighted latent morbidity score regarding stroke

u

h(u)

0.0 0.2 0.4 0.6 0.8 1.0

0.03

550.

0365

0.03

750.

0385

CDF plot for weighted latent morbidity score regarding heartblock

u

h(u)

0.0 0.2 0.4 0.6 0.8 1.0

0.05

50.

060

0.06

50.

070

CDF plot for weighted latent morbidity score regarding ecmovad

u

h(u)

0.0 0.2 0.4 0.6 0.8 1.0

0.06

00.

065

0.07

0

CDF plot for weighted latent morbidity score regarding phrenicnerve

u

h(u)

0.0 0.2 0.4 0.6 0.8 1.0

0.19

0.20

0.21

0.22

0.23

0.24

CDF plot for weighted latent morbidity score regarding unplannedreop

u

h(u)

0.0 0.2 0.4 0.6 0.8 1.0

0.60

0.62

0.64

0.66

0.68

CDF plot for weighted latent morbidity score regarding PLOS

u

h(u)

Figure 6: CDF plot for weighted latent morbidity score regarding different responses

35

V6 V3 V1 V80 V82 V40 V11 V39 V54 V14 V65 V79 V24 V88

−2

−1

0

1

2

3

procedure sorted by posterior mean

Figure 7: Sorted Procedures Posterior Means with 95% Confidence Intervals

36

10 20 30 40 50 60

10

20

30

40

50

60 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(a)

10 20 30 40 50 60

10

20

30

40

50

60

0

0.1

0.2

0.3

0.4

0.5

0.6

(b)

Figure 8: Ranking and Clustering for selected 66 procedures with observations larger than 200

37

Nonparametric Bayes stochastically ordered latent class modelshy35/NP.pdf · 2010-09-21 ·...

Documents

Transcript of Nonparametric Bayes stochastically ordered latent class modelshy35/NP.pdf · 2010-09-21 ·...