Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability

10
Biometrics 69, 683–692 DOI: 10.1111/biom.12050 September 2013 Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability Na Cui, 1 Yuguo Chen, 1, * Dylan S. Small 2 1 Department of Statistics, University of Illinois at Urbana-Champaign, 725 S. Wright Street, Champaign, Illinois 61820, U.S.A. 2 Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A. email: [email protected] Summary. Understanding the infection and recovery rate from parasitic infections is valuable for public health planning. Two challenges in modeling these rates are (1) infection status is only observed at discrete times even though infection and recovery take place in continuous time and (2) detectability of infection is imperfect. We address these issues through a Bayesian hierarchical model based on a random effects Weibull distribution. The model incorporates heterogeneity of the infection and recovery rate among individuals and allows for imperfect detectability. We estimate the model by a Markov chain Monte Carlo algorithm with data augmentation. We present simulation studies and an application to an infection study about the parasite Giardia lamblia among children in Kenya. Key words: Bayesian Hierarchical Model; Infection Rate; Markov Chain Monte Carlo; Panel Data; Recovery Rate; Two- state Process. 1. Introduction Parasitic infections in humans are often characterized by re- peated infections and clearance of parasites; examples include malaria and Giardia lamblia. To understand this dynamic be- havior of parasitic infections, repeated observations of the in- fection and recovery status in the same group of individuals are required. This type of longitudinal data is often mod- eled as a two-state stochastic process. For example, Bekessy, Molineaux, and Storey (1976) proposed a first-order Markov model to study the dynamics of malaria in Garki, Nigeria. Un- der the assumptions of constant transition rates and perfect detectability, the model can be easily fitted by the maximum likelihood method where the results are always based solely on raw counts of infected and uninfected observations. How- ever, these assumptions are often not satisfied in real world situations for three main reasons. First, the disease is often imperfectly detected due to imperfect diagnostic instruments and procedures. For example, false negatives may be common in the detection of Giardia lamblia by stool samples because even if a person harbors Giardia lamblia, these parasites may not be excreted in every stool sample (Nagelkerke, Chunge, and Kinoti 1990). Second, the pattern of transitions may not correspond to the first-order Markov model. For example, it is possible that an individual may have high immunity shortly after the clearance of an infection but the immunity wanes over time. Third, there is evidence that individuals are het- erogeneous in their infection probabilities and dynamics for many parasitic diseases (Woolhouse et al. 1997). The assump- tion of homogeneous infection rates without identifying those individuals who are frequently infected from others will intro- duce biased estimates of the transition rates. Such realistic situations complicate the study of malaria dynamics. To address the above issues, many Markov and semi- Markov approaches and corresponding solutions have been proposed. Nagelkerke et al. (1990) generalized Bekessy et al.’s model to the situation with imperfect detectability, and used numerical maximization of a partial likelihood to estimate the transition rates and rate of detectability. Ng and Cook (1997) introduced a mixed continuous-time two-state pro- cess that accommodates the heterogeneity among individuals by the bivariate log-normal distribution. They estimated the model by maximizing the approximated likelihood through numerical methods. Cook (1999) adopted the exponential survival function with random effects to deal with hetero- geneity among individuals. Smith and Vounatsou (2003) and Rosychuk and Islam (2009) developed a hidden two-state Markov model with imperfect detectability and designed a Markov chain Monte Carlo (MCMC) algorithm to estimate the parameters. Crespi, Cumberland, and Blower (2005) de- veloped Markov and semi-Markov models describing recur- rence and time-inhomogeneous transition rates. However, none of the above methods considers simultane- ously non-homogeneous transition rates, imperfect detectabil- ity and heterogeneity between individuals. This is partly due to the numerical challenges in obtaining the marginal like- lihood function. Here, we propose a Bayesian hierarchical model based on a random effects Weibull model to explicitly address these difficulties. The model assumes that the true parasite dynamics is a continuous two-state stochastic process which is hidden due to the discrete observations and imper- fect detection. Given the latent process, the observed data is assumed to be conditionally independent of each other. We further model the latent process as a non-Markov continuous stochastic process based on a Weibull survival function where © 2013, The International Biometric Society 683

Transcript of Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability

Page 1: Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability

Biometrics 69, 683–692 DOI: 10.1111/biom.12050September 2013

Modeling Parasite Infection Dynamics when there Is Heterogeneityand Imperfect Detectability

Na Cui,1 Yuguo Chen,1,* Dylan S. Small2

1Department of Statistics, University of Illinois at Urbana-Champaign, 725 S. Wright Street, Champaign,Illinois 61820, U.S.A.

2Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia,Pennsylvania 19104, U.S.A.∗email: [email protected]

Summary. Understanding the infection and recovery rate from parasitic infections is valuable for public health planning.Two challenges in modeling these rates are (1) infection status is only observed at discrete times even though infection andrecovery take place in continuous time and (2) detectability of infection is imperfect. We address these issues through aBayesian hierarchical model based on a random effects Weibull distribution. The model incorporates heterogeneity of theinfection and recovery rate among individuals and allows for imperfect detectability. We estimate the model by a Markovchain Monte Carlo algorithm with data augmentation. We present simulation studies and an application to an infection studyabout the parasite Giardia lamblia among children in Kenya.

Key words: Bayesian Hierarchical Model; Infection Rate; Markov Chain Monte Carlo; Panel Data; Recovery Rate; Two-state Process.

1. Introduction

Parasitic infections in humans are often characterized by re-peated infections and clearance of parasites; examples includemalaria and Giardia lamblia. To understand this dynamic be-havior of parasitic infections, repeated observations of the in-fection and recovery status in the same group of individualsare required. This type of longitudinal data is often mod-eled as a two-state stochastic process. For example, Bekessy,Molineaux, and Storey (1976) proposed a first-order Markovmodel to study the dynamics of malaria in Garki, Nigeria. Un-der the assumptions of constant transition rates and perfectdetectability, the model can be easily fitted by the maximumlikelihood method where the results are always based solelyon raw counts of infected and uninfected observations. How-ever, these assumptions are often not satisfied in real worldsituations for three main reasons. First, the disease is oftenimperfectly detected due to imperfect diagnostic instrumentsand procedures. For example, false negatives may be commonin the detection of Giardia lamblia by stool samples becauseeven if a person harbors Giardia lamblia, these parasites maynot be excreted in every stool sample (Nagelkerke, Chunge,and Kinoti 1990). Second, the pattern of transitions may notcorrespond to the first-order Markov model. For example, it ispossible that an individual may have high immunity shortlyafter the clearance of an infection but the immunity wanesover time. Third, there is evidence that individuals are het-erogeneous in their infection probabilities and dynamics formany parasitic diseases (Woolhouse et al. 1997). The assump-tion of homogeneous infection rates without identifying thoseindividuals who are frequently infected from others will intro-duce biased estimates of the transition rates. Such realisticsituations complicate the study of malaria dynamics.

To address the above issues, many Markov and semi-Markov approaches and corresponding solutions have beenproposed. Nagelkerke et al. (1990) generalized Bekessy et al.’smodel to the situation with imperfect detectability, and usednumerical maximization of a partial likelihood to estimatethe transition rates and rate of detectability. Ng and Cook(1997) introduced a mixed continuous-time two-state pro-cess that accommodates the heterogeneity among individualsby the bivariate log-normal distribution. They estimated themodel by maximizing the approximated likelihood throughnumerical methods. Cook (1999) adopted the exponentialsurvival function with random effects to deal with hetero-geneity among individuals. Smith and Vounatsou (2003) andRosychuk and Islam (2009) developed a hidden two-stateMarkov model with imperfect detectability and designed aMarkov chain Monte Carlo (MCMC) algorithm to estimatethe parameters. Crespi, Cumberland, and Blower (2005) de-veloped Markov and semi-Markov models describing recur-rence and time-inhomogeneous transition rates.

However, none of the above methods considers simultane-ously non-homogeneous transition rates, imperfect detectabil-ity and heterogeneity between individuals. This is partly dueto the numerical challenges in obtaining the marginal like-lihood function. Here, we propose a Bayesian hierarchicalmodel based on a random effects Weibull model to explicitlyaddress these difficulties. The model assumes that the trueparasite dynamics is a continuous two-state stochastic processwhich is hidden due to the discrete observations and imper-fect detection. Given the latent process, the observed data isassumed to be conditionally independent of each other. Wefurther model the latent process as a non-Markov continuousstochastic process based on a Weibull survival function where

© 2013, The International Biometric Society 683

Page 2: Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability

684 Biometrics, September 2013

1 2 3 4 5 6 7 8 9 10

0 0 1 0 1 1 0 0 1 1

0 0 0 0 1 1 0 0 1 1

(a)

(b)

(c)

(d)

Figure 1. Relation between observed values and the underlying latent process. (a) Time index from 1 to 10; (b) Thecontinuous latent process Zi(t) (1 ≤ t ≤ 10); (c) The value of Zi(t) at discrete time points Zi(j) (j = 1, . . . , 10); (d) Thecorresponding observed value at discrete time points, Xij (j = 1, . . . , 10).

the parameters of the Weibull have a random distributionacross individuals; the Weibull distribution allows for moreflexible transition rates than the constant transition ratesof an exponential distribution, and the randomness of theWeibull parameters across individuals allows for heterogene-ity among individuals. To estimate the parameters in such amodel, we propose an MCMC algorithm with data augmen-tation for full Bayesian inference, and construct a series of ef-ficient moves to explore the space of the latent process. In thesimulation study, we show the performance of the proposedmethod for models with different settings. We also apply theproposed method to the infection study of Giardia lamblia ofchildren in Kenya (Chunge, 1989).

2. Data

In this article, we focus on a longitudinal study ofGiardia lamblia described by Chunge (1989). Giardia lamblia,an intestinal parasite, is the most common cause of parasiticgastrointestinal disease and is especially prevalent in youngchildren. In the study, 84 children were chosen and the stoolsof children were examined for the presence of Giardia everyweek. The weekly testing result of each child was recordedas 1 if the parasite was found and 0 if the parasite was notfound. At the end of 44 consecutive weeks, 58 children with10–44 consecutive weekly registration were selected to formthe data set (Nagelkerke et al., 1990).

3. Statistical Model

The observed data is the weekly recorded status of the pres-ence or absence of the parasite for n individuals. Let Xij = 1if individual i is detected to be infected by the parasiteat the jth week and Xij = 0 if the parasite is not detected(i = 1, . . . , n; j = 1, . . . , ni). Although the data is recorded atdiscrete times, the actual infection and recovery take placein continuous time. Let Zi(t), 1 ≤ t ≤ ni, denote the hidden

continuous true presence or absence of the parasite for in-dividual i at time t. The underlying true status Zi(t) alsotakes value 1 or 0 depending on whether individual i at timet is infected or not. See Figure 1 for an illustration. Theprimary goal is to model the joint distribution of observeddata X= (Xij, i = 1, . . . , n; j = 1, . . . , ni) and latent processZ= (Zi(t), i = 1, . . . , n; 1 ≤ t ≤ ni). We will first address therelation between X and Z and then discuss the modeling forZ in the following sections.

3.1. Imperfect Detectability

Under the perfect detectability assumption, we have Xij =Zi(j) for all i = 1, . . . , n and j = 1, . . . , ni. If the diagnosticprocedure is imperfect, the observation Xij may be differentfrom the true value Zi(j), which leads to misclassification ofXij. The example shown in Figure 1 has a misclassificationat time 3. There are two types of misclassification: imper-fect sensitivity (Zi(j) = 1, Xij = 0) and imperfect specificity(Zi(j) = 0, Xij = 1).

To deal with the two-state data with misclassification, sev-eral approaches have been proposed. Nagelkerke et al. (1990)considered the imperfect sensitivity in the continuous-timeMarkov model. Bureau, Shiboski, and Hughes (2003) andRosychuk and Islam (2009) incorporated both types of mis-classification. A more detailed review dealing with misclassi-fication issues can be found in Ji and Fan (2009).

Following Nagelkerke et al. (1990) who also analyzed theGiardia lamblia infection in Kenya children, we assume themeasurement has perfect specificity and imperfect sensitiv-ity. See Baig et al. (2012) for discussion of detection ofGiardia lamblia. Moreover, the observed values are assumedto be conditionally independent of each other given the trueprocess. Let 1 − p denote the probability of false negatives,then the relation between Xij and Zi(j) can be parameterized

Page 3: Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability

Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability 685

as

P (Xij = 1|Zi(j) = 1) = p and P (Xij = 0|Zi(j) = 0) = 1.

3.2. Latent Process

The parasite dynamics are dependent on the modeling of thelatent process. Through the modeling, we are trying to learnthe transition rate h0i(t) from uninfected to infected and h1i(t)from infected to uninfected for individual i at time t which aredefined by

h0i(t) = lim�t→0

P(Zi(t + �t) = 1|Zi(t) = 0)

�t,

h1i(t) = lim�t→0

P(Zi(t + �t) = 0|Zi(t) = 1)

�t.

Discrete and continuous time Markov models have beenwidely used to model the latent process Zi(t) by assumingconstant hazard rates over time for all individuals. However,the Markov assumption with the same transition rate for allindividuals may not be appropriate for at least two reasons.First, the transition rate may depend on the duration thatthe individual has been in a state. For example, an individ-ual may have a high immunity shortly after clearance of aninfection. Second, the transition rates may vary considerablybetween individuals (Woolhouse et al., 1997). Also, individu-als have different levels of immunity. Some individuals couldhave frequent transitions; while other individuals have rela-tively inactive transitions. Therefore, to specify a more real-istic stochastic process, we consider the distribution for thetransition rates with the flexibility to be time dependent andto incorporate the heterogeneity.

Here we adopt the random effects Weibull distribution tomodel the duration time and include the between-individualvariation. Let hsi denote the hazard rate for individual i atstate s (s = 0, 1). We have

hsi(t) = usih0s (t), t ≥ 0, (1)

where usi denotes the random effect introduced by between-individual variation and h0

s (t) is the baseline Weibull hazardfunction which is defined as

h0s (t) = α−βs

s βs tβs−1, t ≥ 0.

Here αs > 0 is the scale parameter and βs > 0 is the shape pa-rameter of the distribution. This is a versatile family of distri-butions that can take on many shapes based on the value ofβs. For example, this distribution is degenerated to exponen-tial distribution when βs = 1. The corresponding probabilitydensity function of the hazard function in (1) is

fsi(t) = usiα−βss βs tβs−1e−usiα

−βss tβs

, t ≥ 0.

For the random effects u0i and u1i, gamma distributionwith mean 1 and an unknown variance is used quite oftenfor its conjugacy property. However, the possible correlationbetween u0i and u1i is not specified under this situation. We

model the random effects u0i and u1i jointly as a bivariatelog-normal distribution. Let Ui = (u0i, u1i)

T, i = 1, . . . , n. Wethen assume Ui’s are independent and identically distributedas:

log

(u0i

u1i

)∼ Normal

((−0.5σ20

−0.5σ21

),

(σ20 τσ0σ1

τσ0σ1 σ21

))

where σ0 and σ1 are non-negative unknown parameters andτ is the correlation coefficient between log(u0i) and log(u1i).The mean is chosen to make the expectation of the randomeffects equal to 1. Larger values of σ2

s , s = 0, 1, correspondto greater heterogeneity of individuals and positive or neg-ative correlation between the logarithm of random effects isdetermined by the sign of τ.

3.3. The Likelihood

Let ti = (ti,1, . . . , ti,mi) denote the times that individual i

changes its state in the time interval from 1 to ni, where mi de-notes the total number of transitions. Then the latent processZi(t), 1 ≤ t ≤ ni, can be represented by the transition timepoints ti and its initial state Zi(1). Assume the latent processZi started at −∞ and is in equilibrium at time 1. Then thedensity function of the first observed transition given Zi(1) is

p(ti,1|Zi(1) = s) = Psi(t > ti,1 − 1)

μsi

,

where μsi = ∫ ∞0

tfsi(t) dt denotes the expected duration timein state s for individual i (Cox and Isham, 1980). Let πi(s) =P(Zi(1) = s) = μsi/(μ0i + μ1i). Then the density function ofZi with Zi(1) = s is

p(Zi|α0, α1, β0, β1, usi)

= πi(s)Psi(t > ti,1 − 1)

μsi

[mi−1∏k=1

fa(s,k),i(ti,k+1 − ti,k)

]

× Pa(s,mi),i(t > ni − ti,mi),

where a(s, k) equals 1 − s if k is odd and s if k is even. Thelikelihood for the complete data (X,Z) is

p(X,Z|p, α0, α1, β0, β1,U)

×n∏

i=1

p(Zi|α0, α1, β0, β1,Ui)

ni∏j=1

p(Xij|Zi(j), p),

where U = (Ui, i = 1, . . . , n).

3.4. Prior and Posterior Distributions

Following the Bayesian framework, we specify the prior distri-bution for all the parameters � = (α0, α1, β0, β1, σ

21 , σ2

2 , τ, p)

Page 4: Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability

686 Biometrics, September 2013

as follows:

p ∼ Beta(γp

1 , γp

2 ),

αs ∼ Gamma(γα1 , γα

2), s = 0, 1,

βs ∼ Gamma(γβ1 , γ

β2), s = 0, 1,(

σ20 τσ0σ1

τσ0σ1 σ21

)∼ Inverse-Wishart(W, v),

where all the γ’s take positive values, W is a 2 × 2 positivedefinite matrix, and v > 1 is the degree of freedom. Then, theposterior distribution of interest is

p(�,Z,U|X)

∝ p(X|Z, �,U)p(Z|�,U)p(U|�)p(�)

∝ p(X|Z, p)p(Z|α0, α1, β0, β1,U)p(U|σ20 , σ2

1 , τ)p(�). (2)

4. Markov Chain Monte Carlo Algorithm

In this section we discuss the MCMC algorithm for estimat-ing the parameters in the hierarchical Bayesian model. Asthe posterior distribution in (2) is too difficult to sample di-rectly, we use the Metropolis-within-Gibbs algorithm to sam-ple from the conditional distribution of each variable. We di-vide the parameters �, the latent process Z, and the randomeffects U into five groups: L1 = (α0, α1, β0, β1), L2 = p, L3 =(σ2

0 , σ21 , τ), L4 = U, and L5 = Z. In Web Appendix A, we pro-

vide the posterior conditional distribution of the parametersin each group. A sketch of the Metropolis-within-Gibbs al-gorithm is given in Algorithm 1, where M is the number of

Markov chain iterations and L(t)i:j = (L

(t)i ,L

(t)i+1, · · · ,L

(t)j ).

Algorithm 1 Metropolis-Within-Gibbs Algorithm

Initialize all the parameters and variables:

L(0)1 , L

(0)2 ,L

(0)3 ,L

(0)4 , and L

(0)5 .

for t = 1 to M dofor i = 1 to 5 do

Given L(t)1:i−1, L

(t−1)i+1:5, generate a sample L

i from a pro-posal distribution qi(Li|Lt−1

i ).Let

L(t)i =

{L

i , with probability ri

L(t−1)i , otherwise,

where

ri = min

{p(L

i |L(t)1:i−1, L

(t−1)i+1:5)qi(L

t−1i |L

i )

p(Lt−1i |L(t)

1:i−1, L(t−1)i+1:5)qi(L

i |Lt−1

i ), 1

}

end forend for

Here are more details of the algorithm.Initialization: Assign arbitrary initial values for the pa-

rameters L1 = (α0, α1, β0, β1), L2 = p, and L3 = (σ20 , σ2

1 , τ) in

the corresponding parameter space. Generate initial valuesof L4 = (Ui, i = 1, · · · , n) independently from the log-normaldistribution with parameters L3. For latent process L5 = Z,we first assign values at time points Zi(j), i = 1, · · · , n; j =1, · · · , ni, using the posterior probability P(Zi(j) = 1|Xij =1) = 1 and P(Zi(j) = 0|Xij = 0) = 1/(2 − p) by assumingP(Zi(j) = 0) = P(Zi(j) = 1) = 1/2; and then generate thetransition points ti = (ti,1, · · · , ti,mi

) for individual i in the fol-lowing way:

(1) Let T = 1, s = Zi(1), and l = 1.(2) Find the next nearest data point d ∈ N, T < d ≤ ni, such

that Zi(d) = s.(3) Generate �t ∼ fsi(t)1{0<t≤d−T }. Let T = T + �t. If T <

ni, set s = 1 − s, ti,l = T , l = l + 1, and go back to step2; otherwise stop.

Proposal for L1: Generate the four parameters of theWeibull distribution (α0, α1, β0, β1) from a truncated normaldistribution with the current values as the means and a smallstandard deviation ε1 > 0 in one Markov chain iteration, thatis,

q(αs |αold

s ) ∼ N(α;αolds , ε1)1{αs>0}, s = 0, 1

q(βs |βold

s ) ∼ N(β;βolds , ε1)1{βs>0}, s = 0, 1.

Proposal for L2: Generate the parameter p from a trun-cated normal distribution with the current value as the meanand a small standard deviation ε2 > 0, that is,

q(p|pold) ∼ N(p;pold, ε2)1{0≤p≤1}.

Proposal for L3: The parameters σ20 , σ2

1 , τ denote the vari-ance and correlation coefficient of the logarithm of the randomeffects. We perform a random walk on the parameters as:

(σ20) ∼ (σ2

0)old Unif(1 − ε3, 1 + ε3),

(σ21) ∼ (σ2

1)old Unif(1 − ε3, 1 + ε3),

τ ∼ N(τ; τold, ε4)1{−1≤τ≤1},

where ε3 > 0 and ε4 > 0 control the scale of the perturbation.All the three parameters are updated once in one Markovchain iteration.

Proposal for L4: Randomly choose an individual i andgenerate the random effects of this individual from a trun-cated normal distribution with the current values as themeans and a small standard deviation ε5 > 0 in one Markovchain iteration, that is,

q(u0i|uold

0i ) ∼ N(u; uold0i , ε5)1{u0i>0},

q(u1i|uold

1i ) ∼ N(u; uold1i , ε5)1{u1i>0}.

Proposal for L5: To find a proposal distribution for thelatent process L5 = Z, we need to construct a continuous pathwhere the values of the path at the predetermined time pointsmatch the observed values with imperfect detectability p. We

Page 5: Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability

Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability 687

tbegin l l+1 t

end

tbegin

tend

Ziold

Zinew

(M3)

tbegin l t

end

tbegin t* t

end

Ziold

Zinew

(M1)

tbegin

l,tend

tbegin t* t** l,t

end

Ziold

Zinew

(M2)

Figure 2. Three moves for latent process Zi. M1 chooses l = 3 and changes ti,l to t ; M2 chooses l = 3 and inserts twotransition time points t and t between tbegin and tend; and M3 chooses l = 5 and deletes two transition time points ti,l andti,l+1 between tbegin and tend.

use the following proposal distribution which works well forour model. Given the current latent process Zi(t) of an in-dividual i, we consider three possible moves: M1—randomlyperturb a transition time point; M2—split a transition in-terval into three pieces by adding two transition points inthe interval; and M3—merge three consecutive duration in-tervals into one long interval. See Figure 2 for an illustrationof the three moves. The detailed procedure of updating theith individual’s latent process Zi(t) with transition time pointsti = (ti,1, · · · , ti,mi

) is explained in the following:

(1) Let c equal to 1, 2 and 3 with equal probability.(2) If c = 1, the move M1 is chosen. Choose one integer

number l from 1 to mi with equal probability. Randomlygenerate a value t from Unif(tbegin, tend), where tbegin =ti,l−11{l>1} + 1{l=1} and tend = ti,l+11{l<mi} + ni1{l=mi}. LetZnew

i be the old mi transition points with the lth elementreplaced by t. Calculate the proposal probability:

q(Znewi |Zold

i ) = 1

3mi(tend − tbegin),

and q(Zoldi |Znew

i ) is the same.(3) If c = 2, the move M2 is chosen. Choose one integer

number l from 1 to mi + 1 with equal probability. Ran-domly generate two values t < t independently fromUnif(tbegin, tend), where tbegin = ti,l−11{l>1} + 1{l=1} andtend = ti,l1{l<mi+1} + ni1{l=mi+1}. Let Znew

i be the old mi

transition points with two more transition time pointst and t inserted after the first l components. Calcu-

late the proposal probabilities

q(Znewi |Zold

i ) = 1

3(mi + 1)

2

(tend − tbegin)2and

q(Zoldi |Znew

i ) = 1

3(mi + 1).

(4) If c = 3, the move M3 is chosen. Choose one integernumber l from 1 to mi − 1 with equal probability. LetZnew

i be the old mi transition points with two transitiontime points ti,l and ti,l+1 deleted. Calculate the proposalprobabilities

q(Znewi |Zold

i ) = 1

3(mi − 1)and

q(Zoldi |Znew

i ) = 1

3(mi − 1)

2

(tend − tbegin)2,

where tbegin = ti,l−11{l>1} + 1{l=1} and tend =ti,l+21{l<mi−1} + ni1{l=mi−1}.

With the three moves, the Markov chain is irreducible as allpossible latent processes can be reached no matter what theinitial latent process is.

5. Numerical Studies

In this section, we first evaluate the performance of our modelon simulated data, and then we apply it to the real data in-troduced in Section 2. We start with the simple model withconstant hazard rates, imperfect detectability and homogene-ity among individuals (Model 1), then we consider the more

Page 6: Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability

688 Biometrics, September 2013

Table 1The bias and coverage probability of 95% credible intervals ofeach parameter based on 500 simulated data sets in Model 1,

Model 2 and Model 3, respectively

Model True Parameter Value Bias 95% CI coverage

1 β1 = β0 = 1.0 (fixed) — —λ1 = 1/α1 = 0.3 0.0218 94.2%λ0 = 1/α0 = 0.2 0.0133 95.6%

p = 0.95 0.0121 93.2%

2 β1 = β0 = 1.0 (fixed) — —λ1 = 1/α1 = 0.3 0.0227 94.2%λ0 = 1/α0 = 0.2 0.0142 94.4%

p = 0.95 0.0116 94.8%σ1 = 0.2 0.0230 96.4%σ0 = 0.2 0.0239 95.0%τ = 0.25 0.1656 95.8%

3 α1 = 3.33 0.4907 88.6%β1 = 1.2 0.1615 87.8%α0 = 5 0.7851 89.0%

β0 = 0.8 0.1119 84.8%p = 0.90 0.0205 90.0%σ1 = 0.2 0.0487 98.0%σ0 = 0.2 0.0521 98.4%τ = 0 0.1145 96.2%

complicate model with constant hazard rates, imperfect de-tectability and heterogeneity among individuals (Model 2),and finally we study the general model with nonconstant haz-ard rates, imperfect detectability and heterogeneity among in-dividuals (Model 3). For each model, we simulated 500 datasets with the same parameter values from the model. Eachdata set contains n = 50 individuals with the number of ob-servations for each individual ni ranging from 30 to 50. Thetrue values of the parameters that have been used to generatethe simulated data are given in Table 1. In Model 1, there isno heterogeneity among individuals, so there are no values forthe three parameters of the covariance matrix.

As discussed before, the probability density function ofWeibull distribution when βs = 1, s = 0, 1, is degeneratedto exponential distribution with rate parameter λs = 1/αs,s = 0, 1. For Models 1 and 2, we estimate αs only by fixingβs at its true value, which is equivalent to assuming exponen-tial survival distribution and estimating its rate parameterλs = 1/αs.

To estimate the parameters in each data set, we follow theinitialization procedure and generate samples from the pro-posal distributions under the MCMC framework described inSection 4. The tuning parameters (ε1, ε2, ε3, ε4, ε5) of the pro-posal distributions are set as (0.1, 0.03, 0.03, 0.05, 0.1) to geta reasonable acceptance rate for each parameter. Then theposterior mean and 95% posterior credible interval of eachparameter are derived based on samples from Markov chainsafter some burn-in period. Here the 95% posterior credible in-terval of each parameter is derived by using the 2.5th and the97.5th percentiles of the samples. For the overall performanceof each model, we also report the bias and the coverage prob-abilities of 95% posterior credible intervals of each parameterbased on 500 simulated data sets.

Model 1: By fixing βs, s = 0, 1, at its true value, we esti-mate the remaining three parameters (λ1, λ0, p) together. Theprior distribution for λs (s = 0, 1) is set as Gamma(0.01, 0.01)and the prior distribution for p is chosen as Beta(0.01, 0.01).This prior, which is U-shaped, is chosen to have mean 1/2and a large variance. We have tried other priors for p, such asBeta(1, 1) (which is the uniform distribution) and Beta(9, 1),and the posterior inference is not strongly affected by theprior. All the following estimates are based on samples from1,000,000 MCMC iterations with a 400,000 burn-in period.

In the experiment, the Markov chain converges quickly andmixes well; and the histograms of MCMC samples are roughlybell-shaped with means close to the true values. Table 1 showsthe coverage probability of 95% credible intervals and the biasof each parameter in this model based on 500 simulated datasets. The averaged bias for each parameter is very small witha maximum value around 0.02. The coverage probabilities forall parameters are very close to the nominal level 95%. In sum-mary, we can obtain reasonable estimates of all parameters inthis model.

Model 2: Similar to Model 1, we fix βs at its true valueand estimate the remaining parameters (λ1, λ0, p, σ1, σ0, τ)together. The priors for λs, (s = 0, 1) and p are the sameas those of Model 1. The prior distribution for the covari-ance matrix of the logarithm of random effects is set asInverse-Wishart(I2, 3) where I2 is a 2 × 2 identity matrix.All the following results are based on samples from 1,000,000MCMC iterations with a 500,000 burn-in period.

In this model, we introduced three parameters σ1, σ0, andτ to describe the heterogeneity among individuals. To see howthe individuals behave differently, we checked the average du-ration times at infected and uninfected states for all individ-uals based on one simulated data set from this model. Forthe infected state, most of the individuals have an averageduration time from three to four time units, but one indi-vidual’s average duration time is five time units and anotherone is two time units. Similarly, for the uninfected state, allthe individuals have an average duration time from five to sixtime units except a few with much larger or smaller durationtimes.

To show the overall performance of this model, we list inTable 1 the coverage probability of 95% credible intervals andthe bias of each parameter based on 500 simulated data sets.The biases of all parameters are small except τ which hasa relatively large bias 0.1656. The coverage probabilities ofthe 95% credible intervals for all parameters are close to thenominal level 95%. Therefore, the estimates of all parametersin this model are still satisfactory.

Model 3: Here we work on the general model usingWeibull survival function with random effects which al-lows for nonconstant hazard rates, imperfect detectabilityand heterogeneity among individuals. The parameters areα1, β1, α0, β0, p, σ1, σ0 and τ. We set the prior distributionfor αs and βs, s = 0, 1, as Gamma(0.01, 0.01), the prior dis-tribution of p as Beta(0.01, 0.01), and the prior distributionfor the covariance matrix of the logarithm of random effectsas Inverse-Wishart(I2, 3). All the following results are basedon samples from 1,000,000 MCMC iterations with a 500,000burn-in period. In Web Figure 1, we present the autocorrela-tion function plot of the MCMC samples.

Page 7: Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability

Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability 689

Table 2The sample variance and correlations between theparameters of the Weibull distribution in Model 3

Correlation α1 β1 α0 β0

α1 1.0000 0.5133 0.6524 0.4610β1 0.5133 1.0000 0.2291 −0.0036α0 0.6524 0.2291 1.0000 0.7201β0 0.4610 −0.0036 0.7201 1.0000

Variance 0.3088 0.0291 0.7627 0.0119

Similar to Model 2, the heterogeneity among individuals inthis model is quite obvious too. The coverage probability of95% credible intervals and the bias of each parameter basedon 500 simulated data sets from this general model are givenin Table 1. Overall, the coverage probabilities of the four pa-rameters of Weibull survival function αs, βs, (s = 0, 1) as wellas p are 5–10% lower than the nominal level 95%, and thoseof the parameters σ1, σ0 and τ are a little higher than thenominal level 95%. This may be due to the interplay betweenthe nonconstant hazard rates and heterogeneity among in-dividuals. To examine the properties of the estimates of theparameters of the Weibull distribution, we display the sam-ple variance and correlations in Table 2. The parameters β1

and β0 are considerably easier to estimate (smaller variance)than α1 and α0. The parameter α1 is substantially correlatedwith all of the other parameters α0, β0 and β1. The parameterβ1 is only substantially correlated with α1. The parameter α0

has a substantial correlation with α1 and β0 but not β1. Theparameter β0 has a substantial correlation with only α1 andα0 but not β1.

We ran the simulation using another set of parameter val-ues for Model 3 with σ1 = σ0 = 0.05 and all other parametersare kept the same. The coverage probabilities of 95% credibleintervals of α1, β1, α0, β0, and p are 91%, 90%, 96%, 89%, and94%, respectively. So we expect that the performance wouldbe better when the heterogeneity among individuals is notstrong or when the hazard rate is close to a constant (such asModel 2).

In Web Appendix B, we fit simple models (such as Model1 with perfect detectability) to data generated from morecomplicated models (such as Models 2 and 3) and explorewhether the information we obtained from the simple modelis still realistic. Based on the simulation results, we concludethat fitting simple models to data generated from more com-plicated models may give us misleading information aboutthe model. See Web Appendix B for details of the simulationresults.

Real data: In this part, we re-analyze the data about theinfection of Giardia lamblia introduced in Section 2. Followingthe same procedure as the simulated data, we start with thesimple model assuming constant hazard rates and continuewith the model allowing for nonconstant hazard rates. Theprior distribution of each parameter is chosen to be the sameas in the simulated data.

The results of the three models by assuming constant haz-ard rates: R1–R3 are shown in Table 3. Besides the constanthazard rates assumption, model R1 also assumes perfect de-

Table 3Parameter estimation for the longitudinal data of infectionwith the parasite Giardia lamblia among children in Kenya

Model Parameter Estimate 95% CI MLE

R1 λ1 0.3696 [0.3052, 0.4409] 0.3581λ0 0.3328 [0.2744, 0.3971] 0.3311

R2 λ1 0.2724 [0.2074, 0.3469] 0.2359λ0 0.3018 [0.2349, 0.3764] 0.3311p 0.8953 [0.8494, 0.9374] 0.9090

R3 λ1 0.2986 [0.2354, 0.3687] —λ0 0.3139 [0.2517, 0.3843] —p 0.9272 [0.8866, 0.9585] —σ1 0.2706 [0.1604, 0.3656] —σ0 0.2059 [0.1447, 0.2829] —τ 0.2150 [−0.2182, 0.5913] —

R4 α1 1.0937 [0.7007, 1.6612] —β1 0.6089 [0.5119, 0.7564] —α0 1.4741 [0.9172, 2.1138] —β0 0.7236 [0.5574, 0.9022] —

R5 α1 1.0679 [0.5318, 2.2026] —β1 0.5220 [0.4201, 0.7166] —α0 1.6608 [1.0737, 2.5884] —β0 0.7935 [0.6089, 1.0025] —p 0.9285 [0.8908, 0.9622] —

R6 α1 1.2758 [0.6546, 2.1114] —β1 0.6981 [0.5412, 0.9823] —α0 1.7384 [0.8608, 2.7434] —β0 0.9034 [0.6256, 1.3331] —p 0.9487 [0.9068, 0.9821] —σ1 0.3170 [0.1606, 0.5494] —σ0 0.4683 [0.2135, 0.8676] —τ −0.4531 [−0.8851, 0.2977] —

tectability and homogeneity among individuals; model R2 as-sumes homogeneity among individuals; while model R3 hasno additional assumptions. For models R1 and R2 with theassumptions of constant hazard rates and homogeneity amongindividuals, the maximum likelihood estimate (MLE) can beobtained as a result of the Markov property. Bekessy et al.(1976) gave the MLEs with perfect detectability and laterNagelkerke et al. (1990) provided the MLEs based on botha partial likelihood and the full likelihood with imperfect de-tectability. For comparison, we also list the MLEs of eachparameter in these two models. Note that all the estimatesincluding MLEs in Table 3 use week as the unit instead ofday.

For model R1, we only have two parameters λ1 and λ0.Both posterior means are close to the MLEs. From the esti-mates, we observe that the transition rate from infected touninfected status (λ1) is slightly higher than the hazard ratefrom uninfected to infected status (λ0), but the 95% credibleintervals of both parameters overlap a lot, which implies thedifference is not significant.

For model R2, the posterior mean of p is 0.8953, whichindicates that the detection procedure fails to detect around10.5% infection cases. The posterior means and MLEs are stillclose for all the parameters with the 95% credible intervalscovering the MLEs. By allowing for imperfect detectability,the estimates of λ1 and λ0 are different from those of model

Page 8: Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability

690 Biometrics, September 2013

3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.90

1

2

3

4

5

6

7

8

9

(a) Duration times at infected state (R3)

2.6 2.8 3 3.2 3.4 3.6 3.8 40

1

2

3

4

5

6

7

8

(b) Duration times at uninfected state (R3)

1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.20

1

2

3

4

5

6

7

8

9

(c) Duration times at infected state (R6)

1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.50

1

2

3

4

5

6

7

8

9

(d) Duration times at uninfected state (R6)

Figure 3. Duration times at infected and uninfected states for all individuals of the real data in models R3 and R6,respectively.

R1, especially for λ1. Now the transition rate from infected touninfected status is smaller than the hazard rate from unin-fected to infected status, but the difference is not significant.Allowing imperfect detectability makes it possible to discovermore about the underlying disease process.

Compared to model R2, model R3 incorporates the het-erogeneity among individuals. The estimates of λ1, λ0, and p

are similar to model R2, which indicates that allowing het-erogeneity among individuals does not affect the estimates ofother parameters. Moreover, the correlation between the log-arithm of random effects is not significantly different from 0as the corresponding 95% credible interval covers 0. To checkthe heterogeneity among individuals, we show in Figure 3 theaverage duration times at infected and uninfected states forall individuals. The durations times at each state are similarbut still show some variation.

Next, we investigate the models allowing for nonconstanthazard rates. The corresponding results are also shown in Ta-ble 3. Model R4 assumes perfect detectability and homogene-ity among individuals. We can see that the estimates of β1 andβ0 are smaller than 1 and the 95% credible intervals do notcover 1. This indicates the actual hazard rates are not con-stant and they decay as the time goes on. Therefore, adoptingthe Weibull survival function allows us to detect non-constanthazard rates.

Model R5 allows imperfect detectability besides noncon-stant hazard rates. The posterior mean of p is 0.9285, whichmeans that the detection procedure fails to detect 7.15% in-

fected cases. The estimates of the parameters of the Weibullsurvival function also changed slightly comparing to modelR4. Note that the credible interval for the shape parameterβ0 has an upper bound about 1.

Finally, model R6 considers the most general situation.The estimates of the parameters of the survival function aresimilar to those of model R5 and the estimate of the imperfectdetectability is a little higher in model R6. Again, there is nosignificant correlation between the logarithm of random ef-fects given that the 95% credible interval of τ includes 0. Thereexists heterogeneity among individuals according to Figure 3.

Model R6 fits the data better than the other models inthe sense that model R6 is the only model that contains allof the features of imperfect sensitivity of the measurementof parasite infection, nonconstant hazards and heterogeneityamong individuals, and the fit of model R6 suggests that all ofthese features are significant. First, the imperfect detectabilityis non-negligible due to the imperfect diagnostic instrumentsand procedures. The estimated value of p is below 0.95, andthe 95% credible interval for p is below 1. Second, the haz-ard rates appear to be nonconstant especially for the infectionstate given that the 95% credible interval of β1 does not cover1. Third, the data reveals clear heterogeneity among individ-uals. Therefore, we conclude that it is important to considernonconstant hazard rates, imperfect detectability and randomeffects in the model. In this way, we can detect the influenceof imperfect detectability and random effects and learn moreabout the underlying true dynamics of parasites.

Page 9: Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability

Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability 691

To assess the model fitting, we generated 1000 new data sets(with the same size as the Giardia lamblia data) using the es-timated parameter values from model R6. For each simulateddata, we compute the average number of consecutive 1’s andthe average number of consecutive 0’s in the data. These twostatistics are related to the average duration times at infectedand uninfected states. Based on the 1000 simulated data, themean and 95% confidence interval for the average numberof consecutive 1’s are 2.7332 and (2.3571, 3.1457), and themean and 95% confidence interval for the average number ofconsecutive 0’s are 3.4850 and (2.9699, 4.0993). For the realGiardia lamblia data, the average number of consecutive 1’sand 0’s are 2.6049 and 3.5886, respectively, so they fall in the95% confidence intervals formed by the simulated data. Basedon these two statistics, model R6 fits the data reasonablywell.

Additionally, a potential advantage of our model is that wemay identify individuals who are more frequently infected.Specifically, the proportion of time that an individual i isinfected is a function of the parameters α0, α1, β0, β1, u0i

and u1i, and so we can find the posterior distribution of thetime infected for each child, and identify children who are fre-quently infected. In Web Figure 2, we plot the posterior meansand 95% credible intervals for the proportion of time in theinfected state for all individuals in the real data based onmodel R6. Unfortunately, due to the relatively short amountof time that we observed children in this study, we are not ableto identify with confidence children who are more frequentlyinfected, but in a study with a longer observation time, wemight be able to identify children who are more frequentlyinfected.

Note that the parameters in the random effects Weibullmodel are conditional on the random effects and that an ap-propriate comparison of the parameters can be obtained bycomputing the marginal parameters for the random effectsmodel by integrating over the random effect (Lee and Nelder,2004). In our model, this integration is easy to do for the haz-ard parameter. See Web Appendix C for details of the deriva-tion. In Web Figure 3, we compared the marginal hazards forgoing from uninfected to infected and infected to uninfectedfor models R1-R6 over time.

6. Discussion

We proposed a Bayesian hierarchical model to study the be-havior of Giardia lamblia based on the longitudinal data inChunge (1989). Our model is flexible by allowing (1) imper-fect detectability, (2) non-constant hazard rates, and (3) het-erogeneity among individuals. We also proposed an MCMCalgorithm with data augmentation to estimate the parame-ters in such a model. Simulation studies show that we canobtain reasonable estimates of all parameters under differentsettings.

We adopted the assumption of perfect specificity and im-perfect sensitivity in Nagelkerke et al. (1990) which is specificto this type of data. When both specificity and sensitivityare imperfect, we expect that the estimation of the parame-ters will become more difficult. We may need more data andneed to run the MCMC for longer time to obtain reasonableestimates of the parameters.

We have assumed heterogeneity parameters which are con-stant over a person’s observation period in the model. Thisis plausible for the 44-weeks of data we have modeled, butfor longer time periods, it is possible that some children whotended to get infected a lot when they were younger wouldbecome less prone to infection when they were older. It wouldbe useful in future research to extend our model to allow forsuch age-dependent heterogeneity.

We have modeled the infection process Zi(t) as a 0/1 pro-cess, that is, a child is either infected or not infected. A direc-tion for future research is to model the Zi(t) process as havingmore levels. Infections can be severe with lots of cysts in thestool (and thus higher detectability) or mild. Unfortunatelyour data only contains a binary measurement of whether therewere or were not cysts in the stool. It would be valuable infuture research to record the number of cysts in the stool andextend our model to make use of such data.

7. Supplementary Materials

Web Appendices and Figures referenced in Sections 4 and 5and the software for implementing the algorithms in Section 4are available with this paper at the Biometrics website onWiley Online Library.

Acknowledgements

The authors thank the editor, the associate editor, and tworeferees for their constructive comments. This work was sup-ported in part by NSF grants DMS-08-06175 and DMS-11-06796.

References

Baig, M. F., Kharal, S. A., Qadeer, S. A., and Badvi, J. A. (2012). Acomparative study of different methods used in the detectionof Giardia lamblia on fecal specimens of children. Annals ofTropical Medicine and Public Health 5, 163–167.

Bekessy, A., Molineaux, L., and Storey, J. (1976). Estimation ofincidence and recovery rates of Plasmodium falciparum par-asitaemia from longitudinal data. Bulletin of the WorldHealth Organization 54, 685–693.

Bureau, A., Shiboski, S., and Hughes, J. P. (2003). Applications ofcontinuous time hidden Markov models to the study of mis-classified disease outcomes. Statistics in Medicine 22, 441–462.

Chunge, R. N. (1989). Intestinal parasites in a rural communityin Kiambu District, Kenya with special reference to Giardialamblia. PhD thesis, University College Galway.

Cook, R. J. (1999). A mixed model for two-state Markov processesunder panel observation. Biometrics 55, 915–920.

Cox, D. R. and Isham, V. (1980). Point Processes. Chapman &Hall/CRC, London.

Crespi, C. M., Cumberland, W. G., and Blower, S. (2005). A queue-ing model for chronic recurrent conditions under panel ob-servation. Biometrics 61, 193–198.

Ji, Y. and Fan, Z. (2009). Analysis of longitudinal binary data withmisclassification.

Lee, Y. and Nelder, J. A. (2004). Conditional and marginal models:another view. Statistical Science 19, 219–238.

Page 10: Modeling Parasite Infection Dynamics when there Is Heterogeneity and Imperfect Detectability

692 Biometrics, September 2013

Nagelkerke, N. J. D., Chunge, R. N., and Kinoti, S. N. (1990). Es-timation of parasitic infection dynamics when detectabilityis imperfect. Statistics in Medicine 9, 1211–1219.

Ng, E. T. M. and Cook, R. J. (1997). Modeling two-state diseaseprocesses with random effects. Lifetime Data Analysis 3,315–335.

Rosychuk, R. J. and Islam, S. (2009). Parameter estimation ina model for misclassified Markov data—A Bayesian ap-proach. Computational Statistics & Data Analysis 53, 3805–3816.

Smith, T. and Vounatsou, P. (2003). Estimation of infection andrecovery rates for highly polymorphic parasites when de-

tectability is imperfect, using hidden Markov models. Statis-tics in Medicine 22, 1709–1724.

Woolhouse, M. E. J., Dye, C., Etard, J. F., Smith, T., Charlwood,J. D., Garnett, G. P., Hagan, P., Hii, J. L. K., Ndhlovu, P.D., Quinnell, R. J., Watts, C. H., Chandiwana, S. K., andAnderson, R. M. (1997). Heterogeneities in the transmissionof infectious agents: implications for the design of controlprograms. Proceedings of the National Academy of Sciences94, 338–342.

Received February 2012. Revised March 2013.Accepted April 2013.