Modelling patterns of agreement for nominal scales

21
STATISTICS IN MEDICINE Statist. Med. 2008; 27:810–830 Published online 17 July 2007 in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/sim.2945 Modelling patterns of agreement for nominal scales Chris Roberts , Biostatistics Group, School of Medicine, University of Manchester, Stopford Building, Oxford Road, Manchester M13 9PL, U.K. SUMMARY The measurement of agreement of repeat rating is the usual method of assessing the reliability of categorical scales. Measurement of agreement is also important in genetic twin studies based on categorical scales. One of the most commonly used methods of analysis for both types of study is the kappa coefficient. For scales with more than two categories, one approach is to use a single summary kappa coefficient. While this may be sufficient for many studies, in some instances investigation of heterogeneity in the pattern of agreement may give additional insights as there may be greater agreement for some pairs of categories than for others. In this paper, kappa-type coefficients are used to model heterogeneity in the pattern of agreement. Constraints are added to the heterogeneous model to obtain simplified models. Procedures for estimation, confidence intervals, and inference for these coefficients are described for the case of two ratings per subject for a single sample and for the comparison of two independent samples. Formulae for sample size and power calculation are derived using the non-central chi-squared distribution. Two simulation studies are carried out to check the empirical test size and power. Methods are illustrated by two examples involving nominal scales with three categories. Copyright 2007 John Wiley & Sons, Ltd. KEY WORDS: nominal scale agreement; kappa; homogeneity; heterogeneity; twin studies 1. INTRODUCTION The assessment of reliability plays an important role in the development of scales used for clinical decision making and outcome measures for clinical research. Such studies usually involve two or more ratings of the same specimen of interest by different raters. For categorical scales statistical analysis often involves assessment of agreement between repeated ratings. In genetic twin studies identical twins (mono-zygotic, MZ) are compared with for non-identical (di-zygotic, DZ) twins to investigate whether a particular characteristic or trait has a genetic component. In the absence of an interaction between zygocity and the environment, greater association between responses for Correspondence to: Chris Roberts, Biostatistics Group, School of Medicine, University of Manchester, Stopford Building, Oxford Road, Manchester M13 9PL, U.K. E-mail: [email protected] Received 17 January 2006 Copyright 2007 John Wiley & Sons, Ltd. Accepted 26 April 2007

Transcript of Modelling patterns of agreement for nominal scales

STATISTICS IN MEDICINEStatist. Med. 2008; 27:810–830Published online 17 July 2007 in Wiley InterScience(www.interscience.wiley.com) DOI: 10.1002/sim.2945

Modelling patterns of agreement for nominal scales

Chris Roberts∗,†

Biostatistics Group, School of Medicine, University of Manchester, Stopford Building, Oxford Road,Manchester M13 9PL, U.K.

SUMMARY

The measurement of agreement of repeat rating is the usual method of assessing the reliability of categoricalscales. Measurement of agreement is also important in genetic twin studies based on categorical scales.One of the most commonly used methods of analysis for both types of study is the kappa coefficient. Forscales with more than two categories, one approach is to use a single summary kappa coefficient. Whilethis may be sufficient for many studies, in some instances investigation of heterogeneity in the pattern ofagreement may give additional insights as there may be greater agreement for some pairs of categoriesthan for others. In this paper, kappa-type coefficients are used to model heterogeneity in the pattern ofagreement. Constraints are added to the heterogeneous model to obtain simplified models. Procedures forestimation, confidence intervals, and inference for these coefficients are described for the case of tworatings per subject for a single sample and for the comparison of two independent samples. Formulaefor sample size and power calculation are derived using the non-central chi-squared distribution. Twosimulation studies are carried out to check the empirical test size and power. Methods are illustrated bytwo examples involving nominal scales with three categories. Copyright q 2007 John Wiley & Sons, Ltd.

KEY WORDS: nominal scale agreement; kappa; homogeneity; heterogeneity; twin studies

1. INTRODUCTION

The assessment of reliability plays an important role in the development of scales used for clinicaldecision making and outcome measures for clinical research. Such studies usually involve two ormore ratings of the same specimen of interest by different raters. For categorical scales statisticalanalysis often involves assessment of agreement between repeated ratings. In genetic twin studiesidentical twins (mono-zygotic, MZ) are compared with for non-identical (di-zygotic, DZ) twins toinvestigate whether a particular characteristic or trait has a genetic component. In the absence ofan interaction between zygocity and the environment, greater association between responses for

∗Correspondence to: Chris Roberts, Biostatistics Group, School of Medicine, University of Manchester, StopfordBuilding, Oxford Road, Manchester M13 9PL, U.K.

†E-mail: [email protected]

Received 17 January 2006Copyright q 2007 John Wiley & Sons, Ltd. Accepted 26 April 2007

MODELLING PATTERNS OF AGREEMENT FOR NOMINAL SCALES 811

pairs of MZ twins than for DZ twins may suggest a genetic component to the trait. Where the traitis categorical, this may be investigated by comparing the level of agreement between responses ofidentical twins (MZ) with those of non-identical (DZ) twins.

The most commonly used method of statistical analysis for measuring agreement for categoricalscales is the kappa coefficient proposed by Cohen [1] in 1960. Fleiss [2] described a method forassessing the reliability of nominal scales using a summary kappa coefficient for multiple raters.Bloch and Kraemer [3] proposed a maximum-likelihood estimator for a kappa coefficient for binarydata, which Kraemer [4] had previously referred to as the intra-class kappa coefficient being derivedfrom a measurement model for binary data. This approach enabled rigorous evaluation of standarderror [3, 5–10] and sample size [7, 11] estimators for the intra-class kappa coefficient for binarydata with two ratings. Mekibib et al. [12] extended maximum-likelihood estimation to three ratingsby making assumptions regarding the third-order moments. Donner and Klar [13], Donner et al.[14], and Nam [15] considered the comparison of intra-class kappa coefficients from independentgroups or strata. Roberts and McNamee introduced an intra-class kappa coefficient for ordinal databased on a multinomial measurement model [16].

Recently, Bartfay and Donner [17–19] proposed methods of estimation and inference for thekappa coefficient for nominal scales which are appropriate where ratings can be considered to beexchangeable. Bartfay and Donner [17] considered estimation and inference for a single sample.Bartfay and Donner [18] and Bartfay et al. [20] considered the comparison of kappa coefficientsfor two independent groups, particularly important for twin studies. These methods assume ahomogeneous pattern of agreement across categories from which a single summary coefficientof agreement is derived. While this may be a sufficient analysis for many studies, modelingheterogeneity in the pattern of agreement could be important. For example, in twin studies, greateragreement among twins regarding assignment to some categories for than to other categories maygive additional insights into the genetic trait. In scale development studies, less agreement forsome categories than for others might help identify weaknesses in the scale.

One approach to the investigation of heterogeneity in the pattern of agreement, suggested byFleiss [2] and Landis and Koch [21], is to consider the kappa coefficient for the indicator variableof each category. This coefficient averages the agreement between the specified category and allothers together. If one considers how categories in a nominal scale are often defined, it can be seenthat this coefficient has limitations. Rather than being absolute nominal categories may be definedin juxtaposition to each other. Considering a scale with three categories {A,B,C}, the terminologythat differentiates category A from B may differ from that differentiating A from C. For example,in a scale of allergic response (see Example 1 below), deciding between different types of allergicresponse is likely to be more complex than differentiating between a particular allergic responseand non-response. The definition of a particular category may more clearly demarcate it from onecategory than from another, so that agreement between some pairs of categories may be greaterthan that for other pairs. Identification of pairs of categories that are easily confused may suggestchanges that improve reliability, but a kappa coefficient for a specific category gives only limitedinsight into this, a problem that is illustrated in Example 1 below. In the context of twin associationstudies involving a nominal classification, there may be association in one aspect but not in another.For example, in a twin study considering smoking status (Example 2 below), the smoking statusof each twin might be classified as (Never smoked, Current Smoker or Ex-Smoker). One mighthypothesize that the strength of association will differ between Never Smoked and Current Smokersas compared to Current Smokers and Ex-smokers. To address these types of questions, Roberts andMcNamee [22] proposed a matrix of kappa-type coefficients to describe a heterogeneous pattern of

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

812 C. ROBERTS

agreement for a nominal scale that can be used to investigate agreement in distinguishing betweenpairs of categories. This matrix can be considered to be a heterogeneous model for the pattern ofagreement.

In this paper, the homogeneity of pattern is assessed by comparing one of the homogeneousmodels proposed by Bartfay and Donner with the heterogeneous model suggested by Robertsand McNamee [22]. Section 2 outlines these methods. Methods for comparing the pattern ofheterogeneity between two groups that one might consider in a twins study are also described. InSection 3, test size and sample size–power issues are considered. A simulation study is used tocheck the size of the proposed test of homogeneity in the case of three categories. A formula isderived for estimating sample size and power for the test of homogeneity. Because this formulais based on an asymptotic approximation, it may overestimate the power. This is investigated byestimating the empirical power through a simulation study. Section 4 illustrates the applicationof the methods through two examples; the first involves a single sample, the second is a genetictwin study data comparing MZ and DZ twins, both of which involve nominal scales with threecategories. In both the examples, constraints will be applied to the kappa-type coefficients todevelop a parsimonious model of the pattern of agreement, which may be more informativethan assuming either homogeneity or allowing complete heterogeneity. The discussion considerslimitations and possible extensions of the methods described. It concludes with a brief review ofhow the main alternative methods for the analysis of agreement, including the use of log-linearmodels as proposed by, Agresti [23], Becker and Agresti [24] and Agresti [25] and latent classmodels described by Guggenmoos-Holzmann and Vonk [26], may relate to the assessment ofheterogeneity in the pattern of agreement.

2. MODELS FOR NOMINAL SCALE AGREEMENT

2.1. A model for homogeneity of agreement

Suppose that each individual in a sample of N is rated twice using a scale with C(>2) mutuallyexclusive categories. Alternatively, suppose that N independent pairs of twins have been assessedusing such a scale. If ( j, k) are the pair of ratings, the data might be recorded as a two-wayclassification, with n jk representing the number of subjects for which the first rating is to category jand the second rating is to category k. Suppose the order of each pair of ratings is not relevant sothat the pair of ratings can be said to be exchangeable. This may be fully justified on the basis oftheoretical considerations. In a reliability study, the assumption can be appropriate where ratingsare obtained by a memoryless or automated process. It may be acceptable where observers arerandomly sampled for each subject independently from a large panel of observers. It will applyin a twin study if we are able to ignore characteristics such as the birth order of the twins. Inother situations empirical support may be gained by considering the marginal probabilities of thetwo-way classification. Similar marginal distributions for the first and second ratings, sometimesreferred to as marginal homogeneity, suggest greater acceptability of the assumption.

Under the assumption of exchangeability, a two-way classification of a data set, n jk , is not unique,but a unique representation can be obtained by defining frequencies m jk , where m j j = n j j andm jk = n jk+nk j for j<k, illustrated in Table I. Because ratings are assumed to be exchangeable, wecan define � j as the marginal probability that a rating is in category j . To investigate agreement insuch a data set, Bartfay and Donner [17, 18] proposed two models. In the first, the joint probability

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

MODELLING PATTERNS OF AGREEMENT FOR NOMINAL SCALES 813

TableI.Notationfordata

andmodel

ofheterogeneity.

Rating1

Category

12

C−

1C

Prop

.

Rating2

n 11

n 21

n C−1

1n C

11

m11

—···

——

�2 1+

� 1(1

−� 1

)�1

� 2� 1

(1−

� 12)

� C−1

� 1(1

−� 1

C−1

)� C

� 1(1

−� 1

C)

� 1

n 21

n 22

n C−1

2n C

22

m12

m22

···—

—� 1

� 2(1

−� 1

2)

�2 2+

� 2(1

−� 2

)�2

� C−1

� 2(1

−� 2

C−1

)� C

� 2(1

−� 2

C)

� 2

······

······

n 1C

−1n 1

C−1

n C−1

C−1

n CC

−1C

−1

m1C

−1m2C

−1···

mC

−1C

−1—

� 1� C

−1(1

−� 1

C−1

)� 2

� C−1

(1−

� 2C

−1)

�2 C−1

+� C

−1(1

−� C

−1)�

C−1

� C� C

−1(1

−� C

−1C

)� C

−1n 1

Cn 2

C−2

···n C

−1C

n CC

Cm1C

m2C

−2mC

−1C

mCC

� 1� C

(1−

� 1C

)� 2

� C(1

−� 2

C)

� C−1

� C(1

−� C

−1C

)�2 C

+� C

(1−

� C)�

C� C

Prop

.� 1

� 2� C

−1� C

N

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

814 C. ROBERTS

is given by

Pr[( j, j)] = �2j + � j (1 − � j )� (1)

for concordant ratings or twins and

Pr[( j, k)] = Pr[(k, j)] = � j�k(1 − �) (2)

for discordant ratings or twins. The parameter �, defined to be the overall kappa coefficient, and� j are estimated by numerically maximizing the log-likelihood, given by

LLHom =C∑j=1

m j j log(�2j + � j (1 − � j )�) +

C−1∑j=1

C∑k= j+1

m jk log(2� j�k(1 − �)) (3)

as � does not have a closed-form maximum-likelihood estimate.To enable the construction of a goodness-of-fit test, Bartfay and Donner [18] combined all the

discordant cells into a single cell to give a modified log-likelihood defined by

LL ′Hom=

C∑j=1

m j j log(�2j+� j (1−� j )�)+

(C−1∑j=1

C∑k = j+1

m jk

)log

(2(1−�)

(C−1∑j=1

C∑k= j+1

� j�k

))

(4)

Again, maximum-likelihood estimates for � and � j are obtained by numerical maximization[17, 18]. Equations (3) and (4) will give different estimates of both � and � j .

2.2. A model for heterogeneity of agreement

Heterogeneity can be considered by introducing a more flexible model for the joint probabilities.Equations (1) and (2) will be modified by replacing the summary parameter � by a set of parameters� j and � jk to give

Pr[( j, j)] = �2j + � j (1 − � j )� j (5)

for concordant ratings or twins and

Pr[( j, k)] = Pr[(k, j)] = � j�k(1 − � jk) (6)

for discordant ratings or twins. The parameter � j is the intra-class kappa coefficient for thebinary indicator variable of category j as described by Kraemer [4]. It is a measure of reliabilityor agreement for category j as compared with all other categories. The parameter � jk maybe interpreted as a measure of the ability to distinguish between categories j and k [22]. Forconvenience, � jk will be referred to as simply the inter-class kappa coefficient, although a moreprecise terminology might be inter-category intra-class kappa coefficient.

FBy considering the marginal and cell probabilities, it can be shown that the intra-class kappacoefficients � j can be expressed as a weighted mean of the inter-class kappa coefficients � jk , namely

� j =∑C−1

j=1∑C

k= j+1 �k� jk∑C−1j=1

∑Ck= j+1 �k

(7)

For a scale with C categories, there are C(C − 1)/2 identifiable second order moments. In equa-tions (5) and (6), there are C intra-class kappa coefficients and C(C − 1)/2 inter-class kappa

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

MODELLING PATTERNS OF AGREEMENT FOR NOMINAL SCALES 815

coefficients. Hence, for a three-category scale, a full model can be based on the three intra-classkappa coefficients (�1, �2 and �3) or the three inter-class kappa coefficients (�12, �13 and �23).For larger numbers of categories, the set of C(C − 1)/2 unique inter-class kappa coefficients � jkgives a full or saturated model providing a comparator for more parsimonious models. In contrast,the C intra-class kappa coefficients are no longer a full model of the cross-classification. Thelog-likelihood for the full model is

LLFull =C∑j=1

n j j log(�2j + � j (1 − � j )� j ) +

C−1∑j=1

C∑k = j+1

n jk log(2� j�k(1 − � jk)) (8)

Substituting � j from Equation (7) into Equation (5), the log-likelihood for the full model is

LLFull =C∑j=1

m j j log(�2j + � j

∑k �= j

�k� jk) +C−1∑j=1

C∑k= j+1

m jk log(2� j�k(1 − � jk)) (9)

In the more general circumstance of an equal number of multiple ratings, Roberts and Mc-Namee [22] derived closed-form maximum-likelihood estimates of � j , � j , and � jk . For the caseof two ratings, these reduce to

�̂ j =(2m j j + ∑

k �= jm jk

)/2N (10)

�̂ j = m j j/N − �̂2j�̂ j (1 − �̂ j )

(11)

and

�̂ jk = 1 − m jk/N

2�̂ j �̂k(12)

where m jk is the frequency defined above. The maximum of the log-likelihood is∑Cj=1

∑Ck= j m jk log(m jk/n).

Different constraints may be applied to the parameters of equation (9) to test more parsimoniousmodels, but estimation now depends on numerical maximization of the log-likelihood. This can becarried out using maximization softwares such as ML [27] in STATA [28] or Solver in MicrosoftExcel [29], both of which allow constraints. For larger number of categories and/or smaller samplesizes convergence and boundary value problems may be encountered, an issue that is consideredin the Discussion. Homogeneity is tested by adding the constraint � jk = � for all j �= k so thatequation (9) reduces to (3).

2.3. Large sample variance estimates and confidence intervals

Bloch and Kraemer [3] derived an expression for the asymptotic variance of �̂ j based on the deltamethods:

Var[�̂ j ] = (1 − � j )

N

((1 − � j )(1 − 2� j ) + (2 − � j )� j

2� j (1 − � j )

)(13)

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

816 C. ROBERTS

By application of the delta method large sample estimates of the variance and covariance estimatesfor �̂ jk can be derived. These are:

Var[�̂ jk] = (1 − � jk)

2N

((1 − � jk)(2 − � j − �k − 2� jk)

+(1 − (1 − � jk)((1 − �k)� j + (1 − � j )�k)

� j�k

))(14)

Cov[�̂ jk, �̂ jl ] = (1 − � jk)(1 − � jl)

2N

(2 − � j − � jk − � jl − �kl + � j − 1

� j

)(15)

and

Cov[�̂ jk, �̂lm] = (1 − � jk)(1 − �lm)

2N(2 − � jk − � jl − �kl − �km) (16)

Standard error estimates can be obtained by substitution of the maximum-likelihood estimates ofthe parameters. It should be noted that if � j + �k = 1, that is, where the number of categories isonly 2, � jk = � j = �k and formula (14) simplifies to that for Var[�̂ j ] given by equation (13). Thevariance terms can be used to obtain large-sample test statistics and confidence intervals for � jand � jk by assuming normality. For example, to test H0: � jk = �0, the test statistic

z = �̂ jk − �0se[�̂ jk] (17)

can be used with approximate (1− �) confidence interval given by �̂ jk ± z1−�/2 se[�̂ jk]. Becausethe permissible range of values of � jk are in the interval [−(1 − � j )/� j , 1] for � j��k , this largesample confidence interval for � j and � jk may give values outside this range. An alternative,which gets around the range restriction, is to construct profile likelihood confidence intervalsfor � jk for the inter-class kappa terms by searching for the pair of values of � jk such that2(maxLLFull − maxLLFull(� jk))= �21,1−�.

Test statistics and confidence intervals for the difference between � jk and � jl or � jk and �lmcan also be constructed. For H0: � jk = �lm , the following statistic can be used:

z = �̂ jk − �̂lm√Var[�̂ jk] + Var[�̂lm] − 2Cov[�̂ jk, �̂lm] (18)

where multiple post hoc tests are carried out, the type I error can be controlled by a Bonferronicorrection of p-values. For example, p-values would be multiplied by C(C −1)/2, if all pair-wisecomparisons of the inter-class kappa coefficient were being made. Where closed-form estimates ofthe inter- or intra-class kappa coefficients, given by equations (10)–(12), are available, the methodmay be used by substitution into the variance and covariance estimates given by equations (14)–(16). If these are obtained numerically, large-sample Wald standard errors and confidence intervalsmay be obtained numerically from the Hessian of the log-likelihood, using statistical maximizationsoftwares such as ML in STATA [27].

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

MODELLING PATTERNS OF AGREEMENT FOR NOMINAL SCALES 817

2.4. Comparison between independent groups

Comparing the agreement between populations or strata, that is, between independent groups,might be of interest for example, in twin studies, one may wish to compare the agreement betweenMZ twin pairs with that of DZ twin pairs. Where there is evidence of heterogeneity in the patternof agreement for a nominal classification, one might compare the values of the inter-class andintra-class kappa coefficients between groups. The likelihood ratio procedure can be adapted tothis, although some thought needs to be given to the parameterization of the marginal probabilitiesin each independent group. One could assume a common marginal probability � j for category jacross groups, when estimating both the pooled and separate estimates of the inter-class kappacoefficients. This approach would be justified by the argument that kappa coefficients for scaleswith different marginal probabilities are not comparable [30]. Alternatively, one could estimatepooled and separate estimates of � jk allowing � j to vary between groups, an approach that hasbeen taken when comparing intra-class kappa coefficients for binary data [13–15] and by Bartfayand Donner for nominal scale data [18]. Assuming common marginal proportions across groupsmay seem theoretically attractive, but estimates of kappa coefficients obtained in this way maydiffer substantially from those obtained from separate analyses of each group. It should perhapsbe noted that a similar discrepancy could occur when comparing variances between populations ifa common mean was assumed. For this reason, it is suggested that category marginal probabilitiesbe allowed to vary between groups.

In the comparison of two groups, say A and B, one may be interested in testing H0: �Ajk = �Bjk .This could be tested using

z jk = �̂Ajk − �̂Bjk√Var[�̂Ajk] + Var[�̂Bjk]

(19)

where the variance terms are given by equation (14). A test of the composite hypothesis, that is,H0: �A12 = �B12, . . . , �

AC−1,C = �BC−1,C , could be constructed using

C∑j=1

∑k< j

z2jk

which, for a large sample, might be assumed to have a chi-squared distribution with degrees offreedom equal to C(C − 1)/2, where C is the number of categories.

3. TEST SIZE AND POWER

3.1. Test size

The likelihood ratio test of H0: �12 = �13 = �23 = � assumes a chi-squared distribution for−2(MaxLLfull − MaxLLhom), where MaxLLfull and MaxLLhom are the maximum values ofthe log-likelihood under the null and full models, respectively. Test size is likely to be affected bysmall sample size, the value of � and some categories having low prevalence. A simulations studywas carried out to investigate the relationship between these factors and test size for a scale withthree categories. Data were simulated under H0 for a range of sample sizes (50, 100, 200, 400,and 800) and values of � (0, 0.2, 0.4, 0.6, and 0.8). Rather than working systematically through a

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

818 C. ROBERTS

combination of category proportions, six sets of probabilities that might be more or less favourableto the assumptions of normality were selected. These were of equal prevalence (0.333, 0.333, 0.333),expected to give the best estimate of test size, two sets with one smaller category ({0.4, 0.4, 0.2}and {0.45, 0.45, 0.1}) and three sets with two smaller categories ({0.25, 0.25, 0.5} {0.2, 0.2, 0.6}and {0.1, 0.1, 0.8}), hypothesized to be the least favourable scenario. To estimate a test size of5 per cent with a 95 per cent confidence interval of width less than 1 per cent, 7299 simulationsare required for each combination. In the singular case, where there were no off-diagonal cells inthe generated table corresponding to perfect agreement, occurring most often when � was equalto 0.8, data were regenerated. Where there was non-convergence, a situation that tended to occurwhere there were two empty cells on the main diagonal of the table, data were also regenerated.Simulation was carried out using STATA [28] with predefined starting seeds chosen to allow repli-cation of results and prevent duplication of the random number sequence. The null model, thatis, the homogeneous model, was fitted using ML [27] and the full heterogeneous model using theclosed-form estimates described in Section 2.2.

Table II gives the empirical test size. The simulation-regeneration rate is given in square bracketsfor those combinations where this was necessary. The regeneration rate was higher where samplesize was small, or the prevalence of one or more categories was small and the simulation parameterfor � was zero. The chi-squared approximation to the likelihood ratio test statistic is known to bepoor where the minimum cell frequency is below 4, with test size tending to be too small if themajority of expected cell sizes are below 0.5 and too large if the majority are between 0.5 and4 [31]. Those cases where the simulation parameters gave an expected minimum cell frequencybelow 4 are given in bold typeface in Table II. Where the expected minimum cell size was above 4test size ranged from 4.30 to 6.80 per cent with 75 per cent (78/104) of simulation scenarios in therange (4.5, 5.5 per cent). While this is below the 95 per cent that one would expect, based on thesimulation sample size calculation if the test size was correct, it suggests that the test is close to thenominal level. Where the expected minimum cell size was below 4, empirical test size was morevariable, ranging from 3.90 to 7.98 per cent, with only 15 per cent (7/46) of scenarios in the range(4.5, 5.5 per cent), with 5 below the range and 34 above. When a logistic regression model was fittedto the indicator variable for a p-value less than 5 per cent, sample size (�21 = 257.0, p<0.0001) wasmost strongly related to test size, followed by regeneration rate (�21 = 43.0, p<0.0001), set categoryproportions (�25 = 19.7, p= 0.0014), and the value of �(�24 = 16.9, p= 0.002), all of which wererelate to the expected minimum cell frequencies. Asymptotic theory would suggest that test sizewould tend to 5 per cent as sample size increased. For a sample size of 800, the test size exceeded6 per cent (see Table II) where proportions were (0.8, 0.1, 0.1) and for � = 0 and 0.8. When thesesimulations were repeated, but using a sample size of 1600, empirical test sizes of 5.25 and 5.39per cent were obtained for �= 0 and 0.8, respectively.

3.2. Power and sample size for testing homogeneity

The sample size and power for the test of heterogeneity can be estimated by considering thelikelihood ratio test of H0: � jk = � for all j �= k. For power equal to (1−�), an � size test will requirePr[2(MaxLLfull − MaxLLhom)>�2f,�] = �, where MaxLLfull and MaxLLhom are the maximumvalues of the log-likelihood under the null and full models and with f degrees of freedom. Forpower (1− �), one requires a sample size N such that Pr[2(MaxLLfull − MaxLLhom)>�2f,�] = �,with f =C(C−1)/2. The sample size can be estimated if we assume that the term 2(MaxLLfull−MaxLLhom) has a non-central chi-squared distribution X2

f,1−�(�) with f degrees of freedom and

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

MODELLING PATTERNS OF AGREEMENT FOR NOMINAL SCALES 819

Table II. Empirical type I error rate (per cent) estimated using 7299 simulation for each cell [simulationregeneration rate] when testing H0: �12 = �13 = �23 = � samples sizes 50, 100, 200, 400, and 800.

Simulated inter-class kappa coefficient

Category prevalence Sample(�1,�2,�3) size �= 0 �= 0.2 �= 0.4 �= 0.6 �= 0.8 Mean

Equal(0.33, 0.33, 0.33) 50 5.97 5.53 5.24 6.36 7.98 6.22

100 5.48 5.15 5.65 5.36 6.80 5.69200 5.12 5.68 4.91 5.43 5.61 5.35400 5.01 4.88 5.39 5.29 5.52 5.22800 5.16 5.21 4.92 4.79 5.32 5.08Mean 5.35 5.29 5.22 5.45 6.25 5.51

One smaller(0.4, 0.4, 0.2) 50 6.07 6.30 5.88 6.53 6.57 6.27

100 5.46 5.44 5.23 5.31 7.04 5.70200 5.04 5.58 5.38 5.57 5.32 5.38400 5.06 4.82 4.97 4.64 5.26 4.95800 4.91 5.43 5.21 5.26 4.58 5.08Mean 5.31 5.51 5.33 5.46 5.76 5.48

(0.45, 0.45, 0.1) 50 4.34 6.76 7.75 6.74 4.60 6.04100 4.40 6.40 5.56 6.97 6.18 5.90200 5.32 5.82 5.35 5.09 6.74 5.66400 5.70 4.91 5.73 5.49 5.42 5.45800 5.19 5.47 4.90 4.84 5.09 5.10Mean 4.99 5.87 5.86 5.83 5.61 5.63

Two smaller(0.5, 0.25, 0.25) 50 7.22 [0.12] 5.57 5.98 7.02 6.63 6.49

100 4.92 5.17 5.36 5.85 6.99 5.66200 5.30 5.48 4.91 5.42 6.00 5.42400 5.06 5.30 5.62 4.87 5.13 5.20800 5.19 5.26 4.30 5.22 5.16 5.03Mean 5.54 5.36 5.24 5.68 5.98 5.56

(0.6, 0.2, 0.2) 50 6.95 [1.39] 6.52 [0.01] 7.22 6.04 6.20 6.58100 5.41 5.56 5.60 6.18 6.14 5.78200 5.82 5.25 5.68 5.30 7.00 5.81400 5.09 5.47 5.04 5.27 5.68 5.31800 5.08 5.03 5.56 4.95 5.19 5.16Mean 5.67 5.56 5.82 5.55 6.04 5.73

(0.8, 0.1, 0.1) 50 3.90 [36.7] 5.85 [5.21] 5.52 [0.73] 5.56 [0.27] 4.43 [0.44] 5.05100 6.43 [13.27] 6.81 [0.31] 5.39 4.79 4.94 5.67200 6.66 [1.84] 6.49 6.70 5.29 4.36 5.90400 6.02 [0.01] 5.58 5.48 6.60 4.71 5.68800 6.08 5.01 5.55 5.25 6.49 5.68Mean 5.82 5.95 5.73 5.50 4.99 5.60

Note: Bold typeface represents cases where the expected frequencies are less than 4 in one or more cells.

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

820 C. ROBERTS

non-centrality parameter � under the alternative hypothesis. Define

Dhom = MaxLLhom

N=

⎡⎢⎢⎢⎢⎣

C∑j=1

(�2j + � j (1 − � j )�) log(�2j + � j (1 − � j )�)

+C∑j=1

∑k �= j

(� j�k(1 − �)) log(2� j�k(1 − �))

⎤⎥⎥⎥⎥⎦ (20)

and

Dfull = MaxLLfull

N=

⎡⎢⎢⎢⎢⎣

C∑j=1

(�2j + � j (1 − � j )� j ) log(�2j + � j (1 − � j )� j )

+C∑j=1

∑k �= j

(� j�k(1 − � jk)) log(2� j�k(1 − � jk))

⎤⎥⎥⎥⎥⎦ (21)

The values of Dnull and Dfull are obtained by substituting the hypothesized values of � j , � under thenull hypothesis and � jk under the alternate hypothesis into equations (20) and (21). The minimumsample size required to obtain power (1 − �) is then given by

N = �

2(Dfull − Dhom)(22)

The required non-centrality parameter � is that value for which the non-central chi-squared distri-bution X2

f,1−�(�) is equal to the chi-square value X2f,�. This can be obtained from tables of the

non-central chi-squared distribution and from some statistical software packages. For the case ofthree categories ( f = 2) and test size � = 0.05, to obtain 80 and 90 per cent power, the requiredvalues of non-centrality parameter � are 9.635 and 12.654, respectively. Table III displays estimatesof sample size for a range of values of � from 0.2 to 0.9, and possible value for �12, �13, and�23 under the alternative hypothesis. To simplify presentation, it is assumed that �1 = �2 = �3 = 1

3 ,although sample size can be calculated for other values using equations (20)–(22). From Table III,the minimum sample size required to test H0: � = 0.4, and H1: �12 = 0.6, �13 = 0.3, �23 = 0.3 are349 for 80 per cent power and 459 for 90 per cent power. It can be seen, that unless the differencesbetween the values of � jk are great, sample sizes to test heterogeneity need to be large to haveadequate power.

Sample size estimation is based on an assumption of normality. Power was checked by estimatingthe empirical power in a simulation study. For each combination of values, data were simulatedunder the alternative hypothesis using the same procedures as those in the simulation study of testsize. Because the sample size formula gives non-integer values, the simulation study used the twoclosest integer values weighted in appropriate proportions to approximate the non-integer value. Toestimate 80 per cent power with a 95 per cent confidence interval of width 2 per cent, a minimumof 6147 simulations are required for each combination. For a power of 90 per cent, a minimumof 3458 simulations are required for the same precision.

For nominal powers of both 80 and 90 per cent, the empirical power (Table III) was near thespecified level. For 80 per cent power, the empirical power was within the range (79, 81 per cent)for 53 per cent (27/51) of simulation scenarios. Coincidently, 53 per cent (27/51) were within therange (89, 91 per cent) for 90 per cent power. Too low empirical power implies that the samplesize estimate is too small. For a nominal power of 80 per cent, the empirical power was below 79

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

MODELLING PATTERNS OF AGREEMENT FOR NOMINAL SCALES 821

Table III. Sample size estimates for testing heterogeneity for various alternative hypotheses for a scalewith three categories with �1 = �2 = �3 = 1

3 .

80 per cent power 90 per cent power(6147 simulations) (3458 simulations)

H1 Empirical EmpiricalH0 power power� �12 �13 �23 N per cent N per cent

0.2 0.6 0 0 102 83.0 133 91.60.2 0.5 0.05 0.05 186 81.1 245 90.90.2 0.4 0.1 0.1 431 80.6 565 90.10.2 0.4 0.2 0 335 79.7 439 89.60.2 0.3 0.3 0 465 79.8 610 89.6

0.3 0.7 0.1 0.1 92 82.0 120 91.60.3 0.6 0.15 0.15 170 80.7 223 90.80.3 0.5 0.2 0.2 395 80.3 518 90.50.3 0.5 0.3 0.1 308 79.7 404 90.00.3 0.6 0.3 0 135 80.8 177 90.60.3 0.4 0.4 0.1 431 78.3 565 87.60.3 0.45 0.45 0 194 79.4 255 89.1

0.4 0.8 0.2 0.2 79 82.8 104 91.80.4 0.7 0.25 0.25 149 80.0 195 90.40.4 0.6 0.3 0.3 349 80.0 459 90.60.4 0.6 0.4 0.2 275 79.3 360 90.20.4 0.7 0.4 0.1 120 79.4 157 89.20.4 0.8 0.4 0 65 82.3 85 90.70.4 0.5 0.5 0.2 388 79.2 509 89.30.4 0.55 0.55 0.1 175 78.6 230 88.30.4 0.6 0.6 0 100 79.4 131 88.4

0.5 0.9 0.3 0.3 64 83.9 84 92.60.5 0.8 0.35 0.35 124 82.0 163 90.80.5 0.7 0.4 0.4 297 80.2 390 90.10.5 0.7 0.5 0.3 235 78.7 309 88.50.5 0.8 0.5 0.2 101 80.0 133 90.70.5 0.9 0.5 0.1 54 83.4 70 92.30.5 0.6 0.6 0.3 337 77.7 443 87.50.5 0.65 0.65 0.2 152 77.6 200 88.50.5 0.7 0.7 0.1 87 78.6 114 89.1

0.6 0.9 0.45 0.45 96 83.0 126 92.00.6 0.8 0.5 0.5 238 80.1 312 90.50.6 0.8 0.6 0.4 191 79.8 251 89.80.6 0.9 0.6 0.3 80 81.8 105 91.90.6 0.7 0.7 0.4 281 77.6 368 88.30.6 0.75 0.75 0.3 127 77.7 166 88.70.6 0.8 0.8 0.2 72 79.2 94 88.1

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

822 C. ROBERTS

Table III. Continued.

80 per cent power 90 per cent power(6147 simulations) (3458 simulations)

H1 Empirical EmpiricalH0 power power� �12 �13 �23 N per cent N per cent

0.7 0.9 0.6 0.6 172 81.4 225 91.90.7 0.9 0.7 0.5 141 81.0 186 90.90.7 0.8 0.8 0.5 219 78.3 287 89.00.7 0.85 0.85 0.4 98 79.4 129 88.90.7 0.9 0.9 0.3 55 80.8 72 89.5

0.8 0.95 0.73 0.73 199 84.0 261 92.80.8 0.9 0.75 0.75 496 80.5 652 91.00.8 0.9 0.8 0.7 400 80.9 525 90.90.8 0.95 0.8 0.65 166 83.2 218 91.40.8 0.85 0.85 0.7 590 77.9 775 88.50.8 0.88 0.88 0.65 266 79.0 349 88.70.8 0.9 0.9 0.6 151 79.3 198 88.5

0.9 0.95 0.95 0.8 307 79.1 404 88.50.9 0.97 0.97 0.75 134 82.3 175 90.5

Total 80.28 90.02

per cent in 20 per cent (10/51) of simulation scenarios. Similarly, for a power of 90 per cent, theempirical power was below 89 per cent for 27 per cent (14/51) of scenarios.

4. ILLUSTRATIVE EXAMPLES

4.1. Example 1. A three-category scale

Bartfay and Donner [17] illustrated their method using data from Guggenmoos-Holzmann andVonk [26] regarding agreement in reporting allergic responses. In this study, mothers with newbornbabies were queried with respect to their history of atopic diseases, including allergic rhinitis, foodallergy, allergic asthma and neurodermitis. Responses were grouped into three categories: ‘noatopy’, ‘atopy, but no neurodermitis’, and ‘neurodermitis’. To assess the consistency of response,the history of mothers’ atopic disease was taken again two years later. Table IV gives the cross-classification of their responses. Although marginal homogeneity would be difficult to justifytheoretically, marginal probabilities were similar for the first and second ratings. Table IV givesthe estimates of maximum likelihood of � j and � jk with large sample confidence intervals basedon the delta method. For the inter-class kappa coefficients, profile likelihood confidence intervalsare also given in square brackets for the full model and the simplified models described below.

The inter-class kappa terms are interpreted in much the same way as the intra-class kappa coeffi-cients. A value of � jk equal to 1 implies that categories j and k are not confused at all, while a valueof zero implies that the two categories are indistinguishable [22] while examining the inter-class

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

MODELLING PATTERNS OF AGREEMENT FOR NOMINAL SCALES 823

TableIV.Reportedatopic

disease.

Second

interview

Firstinterview

Noatopy

Atopy,no

neurod

ermitis

Neuroderm

itis

Total

(a)Frequ

encies

(per

cent)

Noatopy

136(59)

12(5)

1(0.4)

149(64)

Atopy,no

neurod

ermitis

8(3)

59(25)

4(2)

71(31)

Neuroderm

itis

2(1)

4(2)

6(3)

12(5)

Total(per

cent)

146(63)

75(32)

11(5)

232

Fullmod

elNoatopy

Atopy,no

neurod

ermitis

Neuroderm

itis

(b)Fullmod

elpa

rameter

estimates

(95percent

C.I.)

Noatopy

�̂ 1=0.78

6(0

.703

,0.86

9)Atopy,no

neurod

ermitis

�̂ 12=0.78

5(0

.695

,0.87

4)�̂ 2

=0.72

0(0

.624

,0.81

7)Neuroderm

itis

�̂ 13=0.79

5(0

.573

,1.01

8)�̂ 2

3=

−0.105

(−0.79

8,0.59

1)�̂ 3

=0.49

7(0

.241

,0.75

4)

Marginalprop

ortio

n,�̂j

0.63

60.31

50.05

0

Mod

el−2

log-lik

eG2

d.f.

pAIC

Parameter

estim

ates

[profile95

percent

C.I.]

(c)Likelihoodratiostatistics,pa

rameter

estimates

forthemod

els

�̂ 12=0.78

5[0.

684,0.86

3](1)Fu

ll52

8.70

053

8.70

�̂ 13=0.79

5[0.

503,0.94

8]�̂ 2

3=

−0.105

[−0.89

0,0.47

3](2)�̂ 1

2=

�̂ 13=

�̂ 23

540.98

12.28

20.00

254

6.98

�̂ 12=

�̂ 13=

�̂ 23=0.72

8[0.

631,0.80

9](3)�̂ 1

2=

�̂ 13

528.71

0.01

10.93

453

6.71

�̂ 12=

�̂ 13=0.78

6[0.

694,0.86

0]�̂ 2

3=

−0.102

[−0.86

0,0.47

3](4)�̂ 1

2=

�̂ 13,�̂ 2

3=0

528.80

0.10

20.95

353

4.80

�̂ 12=

�̂ 13=0.78

5[0.

693,0.85

9]

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

824 C. ROBERTS

kappa terms, it should be noted that �̂12 = 0.785 and �̂13 = 0.795, whereas �̂23 =−0.105, sug-gesting that the categories ‘Atopy, no neurodermitis’ and ‘Neurodermitis’ are not distinguishable,while both ‘Atopy, no neurodermitis’ and ‘Neurodermitis’ appear to be similarly distinguishablefrom ‘No atopy’. If the inter-class kappa coefficients � jk are equal for all k �= j , one might say thatcategory j is confused in equal measure as all the other categories. This condition is equivalentto � jk = � j , for all k �= j , which is equivalent to

Pr[X ′ = j |X �= j] = Pr[X ′ = j |X = k]

for all k �= j , a condition that James [32] called ‘impartial non-agreement’. Hence, the category‘No Atopy’ can be said to be in ‘impartial non-agreement’.

The intra-class kappa coefficients for each of the categories are �̂1 = 0.786, �̂2 = 0.720, and�̂3 = 0.497, from which it is not so evident that the categories ‘Atopy, no neurodermitis’ and‘Neurodermitis’ are difficult to distinguish. The intra-class kappa coefficients, � j , being a weightedaverage of the inter-class kappa terms, � jk , mix sources of disagreement. This illustrates a limitationof the intra-class kappa coefficients for the indicator variable as compared with the inter-class kappacoefficient.

Besides the homogeneous model (Model 2: �12 = �13 = �23) possible parsimonious modelsmight be considered to constrain �12 = �13 (Model 3) and to add an additional constraint �23to zero (Model 4), giving a single-parameter model. The likelihood ratio tests for these modelsare summarized in Table IV. There does not appear to be much support for the assumption ofhomogeneity (�22 = 12.3, p= 0.002). The likelihood ratio tests give greater support to models 3(�21 = 0.01, p= 0.93) and 4 (�22 = 0.095, p= 0.95) as simplified models. Similar conclusions mightbe drawn from the application of Akaike’s information criterion (AIC), defined as −2 log-likelihood+2 (number of parameters). This would tend to suggest model 4 as the most parsimonious, althoughthe difference in fit compared with model 3 would be considered unimportant by the ‘rule of thumb’that models differing by less than 2 can be regarded as having similar fit.

Adopting a large-sample variance approach, a formal pair-wise comparison of �̂23, �̂12 and �̂13can be made using equation (18). After making a Bonferroni correction to the p-value for three pair-wise comparisons, a test of null hypothesis that of H0: �12 = �23 gives z = 2.47 (p= 0.04) and thatof H0: �13 = �23 gives z = 2.33 (p= 0.06), confirming that the level of agreement is much lowerfor (‘Atopy, no neurodermitis’, ‘Neurodermitis’) than the other two combinations of categories.Application of equation (17) to the hypothesis H0: �23 = 0 gives z = −0.297, with a large-sample95 per cent confidence interval (−0.802, 0.591) again suggesting difficulty in distinguishing ‘Atopy,no neurodermitis’ and ‘Neurodermitis’. Several methods suggested in Section 2 have been givenhere for illustrative purposes, but, clearly, in a confirmatory setting, statistical testing should berestricted to the particular hypothesis of interest to avoid multiple testing artefact.

4.2. Example 2. Comparison of two populations: a twins study

In a second illustration the methods are applied to data on smoking behaviour from a studyof twins presented by Hannah et al. [33] based on twin pairs registered with the AustralianNational Health and Medical Research Council Twin Registry, an example also used by Bartfayand Donner [18] to illustrate their method. Each individual was classified as ‘never smoked’, ‘ex-smoker’ or ‘current smoker’. The data for male twin pairs are given in Table V, broken down byzygocity. The inter-class kappa coefficients are given together with the summary kappa coefficient �.

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

MODELLING PATTERNS OF AGREEMENT FOR NOMINAL SCALES 825

TableV.Results

from

theAustraliantwin

study(m

alesubjects).

Frequencies

Mon

o-zygo

tic(M

Z)

Di-zygo

tic(D

Z)

Pooled

(per

cent)

N=56

6N

=35

2N

=91

8

Smok

ing

status

category

Never

(1)

Ex

(2)

Current

(3)

Never

(1)

Ex

(2)

Current

(3)

Never

(1)

Ex

(2)

Current

(3)

Never

(1)

221

(39)

——

121

(34)

——

342

(37)

——

Ex

(2)

80(14)

74(13)

—46

(13)

29(8)

—12

6(14)

103

—Current

(3)

58(10)

59(10)

74(13)

60(17)

53(15)

43(12)

118

(13)

112

(12)

117

(13)

Marginal

prop

ortio

n,�̂j

0.51

0.25

0.23

0.49

0.22

0.28

0.51

0.24

0.25

(95

percent

C.I.)

(95

percent

C.I.)

(95

percent

C.I.)

Hypotheses

z-test

p

�̂ 12

0.46

(0.35,0.56

)0.41

(0.26,0.56

)0.44

(0.35,0.53

)�̂M

Z12

=�̂D

Z12

0.52

0.60

1

�̂ 13

0.57

(0.47,0.67

)0.39

(0.26,0.52

)0.50

(0.42,0.58

)�̂M

Z13

=�̂D

Z13

2.14

0.03

2

�̂ 23

0.12

(−0.08

,0.32

)−0

.19

(−0.45

,0.07

)0.00

(−0.16

,0.16

)�̂M

Z23

=�̂D

Z23

1.89

0.05

8

�̂0.43

(0.37,0.50

)0.27

(0.18,0.35

)0.37

(0.31,0.43

)�̂M

Z=

�̂DZ

3.14

0.00

2

Mon

o-zygo

tic(M

Z)

Di-zygo

tic(D

Z)

Pooled

N=56

6N

=35

2N

=91

8

−2log.

−2log.

−2log.

Constraints

between

Mod

ellik

e.Chang

ed.f.

pAIC

like.

Change

d.f.

pAIC

like.

Change

d.f.

pgrou

psChang

ed.f.

p

1Fu

ll18

62.0

1872

.011

84.2

1194

.230

57.4

�̂MZ

12=

�̂DZ

12,�̂M

Z13

=�̂D

Z13

11.18

30.01

1

�̂MZ

23=

�̂DZ

23

2�̂ 1

2=

�̂ 13=

�̂ 23

1878

.316

.31

20.00

0318

84.3

1201

.517

.26

20.00

0212

06.5

3089

.732

.22

2<0.00

01�̂M

Z=

�̂DZ

9.83

10.00

2

3�̂ 2

3=0

1863

.41.35

10.24

618

71.4

1186

.52.26

10.13

211

94.5

3057

.40.01

10.93

9�̂M

Z12

=�̂D

Z12

,�̂M

Z13

=�̂D

Z13

7.58

20.02

3

�̂MZ

23=

�̂DZ

23=0

4�̂ 2

3=0,

�̂ 12=

�̂ 13

1865

.83.79

20.15

018

71.8

1186

.52.27

20.32

111

92.5

3058

.40.91

20.63

3�̂M

Z12

=�̂M

Z13

=�̂D

Z12

=�̂D

Z13

6.03

10.01

4

�̂MZ

23=

�̂DZ

23=0

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

826 C. ROBERTS

The point estimates suggest heterogeneity in the pattern of agreement, with �12 and �13 beinglarger than �23. A likelihood ratio test would reject homogeneity (model 2) in both the MZ twin(�22 = 16.31, p= 0.0003) and DZ twin (�22 = 17.26, p= 0.0002) samples. Two additional modelswere fitted, firstly, constraining �23 to be zero (model 3) and, secondly, adding the constraint�12 = �13 (model 4). Likelihood ratio statistics suggest either model 3 or 4 as a simplified modelfor MZ or DZ twins. Similar conclusions would be drawn by considering AIC with model 3 havingthe smallest AIC for MZ twins and model 4 for DZ twins, for models 1, 3 and 4 are similar bythe ‘rule of thumb’ that differences less than 2 are unimportant. One interpretation of models 3and 4 is that there is agreement between twins in taking up smoking but not in ceasing to smoke.An alternative explanation might be the measurement error in the ascertainment of smoking statusas ‘current smoker’, and that ‘ex-smoker’ may be more easily confused than ‘never smoked’ witheither ‘ex-smoker’ or ‘current smoker’.

It should be noted that �DZ23 is negative. The permissible range of values of � jk are in theinterval [−(1 − � j )/� j , 1] for � j��k , with a lower bound in this case of −2.57. Negative varianceestimates are difficult to interpret, but we can give some meaning to negative values of � j and� jk by considering equations (5) and (6). Where � j is negative, Pr[( j, j)] is less than �2j . Hence,the probability of rating a particular subject as category j conditional on having already rated thatsubject as category j is less than the marginal probability � j . Where � jk is negative, Pr[( j, k)] isgreater than � j�k , so that the probability of rating a particular subject as category j is increasedby a previous rating to category k. In the context of twins’ studies, a negative value of � jk meansthat if one twin is trait j , then the co-twin has an increased probability of being classified astrait k. An alternative explanation for negative values may be sampling error, an argument that issupported by the width of the confidence interval in this case.

In such a study, one is interested in comparing the level of agreement among MZ twins with thatof DZ twins, with higher levels of agreement for the former interpreted as evidence of a geneticbasis to the trait. This comparison can be made by a likelihood ratio test combining both sampleswith and without constraints (see Table V). As suggested in Section 2.4, the combined model isfitted while allowing category proportion to vary between groups. The hypothesis that agreementfor MZ twins is the same as that for DZ twins is rejected for all four models. Clearly, such multipletesting should be avoided; hence based on the total AIC for model fitting in the separate groupsabove, one could perhaps have considered model 4 as the most appropriate before making thecomparison by zygocity. Table V also gives the large-sample z-test comparison of H0: �MZ

jk = �DZjkfor j �= k based on the large-sample standard errors. In this analysis, the null hypothesis would berejected for �13 (p= 0.032), with a similar trend for �23 (p= 0.058), but not for �12 (p= 0.601).

Kraemer [34] has pointed out that rejection of the null hypothesis does not establish a geneticbasis of the trait, suggesting the use of the appropriate heritability coefficient with its confidenceinterval. One possibility might be to consider the coefficient h2 defined as the proportion of the vari-ance due to genetic variation [35]. This can be estimated by taking twice the difference between theintra-class correlation for MZ twin and the intra-class correlation for DZ twin where one assumes acommon environmental and genetic source of variance [35]. In this context, it might be definedfor the relationship between traits j and k that h2jk = 2(�MZ

jk − �DZjk ). The values of the heritabil-

ity coefficient with large-sample 95 per cent confidence intervals are h212 = 0.10(−0.27, 0.46),h213 = 0.37(0.03, 0.70) and h223 = 0.63(−0.02, 1.29), from which one might conclude that thestrongest evidence for a genetic basis is for smokig rather than for ex-smoking. This conclu-sion is surprising as it appears to contradict previous results that tended to suggest the converse. It

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

MODELLING PATTERNS OF AGREEMENT FOR NOMINAL SCALES 827

should be noted that the coefficient h2 was originally derived from variance components, so that itis unreasonable to apply it where one or both of its constituent coefficients is negative. The largevalue for h223 can therefore be disregarded as �DZ23 is negative.

5. DISCUSSION

Bartfay and Donner [17, 18] proposed two models for estimation of a single kappa coefficient,the first of which can be compared against a more general model allowing heterogeneity as a testof homogeneity. In both illustrative data sets heterogeneity in the pattern of agreement has beenobserved. It has also been shown that parsimonious models that could give additional insight intothe pattern of agreement can be fitted. The second homogeneous model suggested by Bartfay andDonner [17, 18], in which off-diagonal cells are pooled in order to construct a goodness-of-fitstatistic, cannot be compared with a heterogeneous model using a likelihood ratio test, whichwould seem to be a limitation of this model.

While assuming that homogeneity may improve efficiency [19], there are, as discussed inthe Introduction, good theoretical reasons to believe that heterogeneity will occur. An additionalexplanation relates to the mathematical properties of kappa-type coefficients themselves. It is wellknown that kappa coefficients depend on prevalence [36, 37], with lower values being obtainedwhen the prevalence of a trait approaches zero or one. Hence, there is no reason to expect inter-classkappa coefficients for different pairs of categories to be the same, when the marginal proportionsdiffer. Heterogeneity in the pattern of agreement, as measured by this type of coefficient, couldperhaps be considered as the norm if category proportions differ.

It might be argued that a coefficient based on the assumption of homogeneity may be of limitedvalue in the presence of heterogeneity. One might also suggest that the use of the homogeneousmodel should be accompanied by a test of heterogeneity. Unfortunately, it has been seen that sucha test has only reasonable power to detect heterogeneity when sample size is large or heterogeneityis substantial. The methods described here are therefore of only limited value in reliability sub-studies conducted in clinical trials and epidemiological studies as these tend to have small samplesizes. Theoretical arguments rather than empirical evidence may therefore be needed to justify theassumption of homogeneity in such studies. This may not be too important as such studies aregenerally concerned with quality control of established measures in the context of the main study,so a single summary measure may be sufficient. In scale development, where an in-depth analysisis required, the issue is more important and a large sample size is justified. The need to formallytest for heterogeneity therefore depends on the context and objectives of the study.

In considering the methods to compare coefficients between independent groups, it was suggestedthat marginal proportions should be allowed to vary between groups. In Example 2, the proportionsof subjects allocated to each of the three categories are similar for MZ and DZ twins. Had theproportions differed, a comparison of the two populations using this type of coefficient would bedifficult to interpret, particularly if the smaller coefficients for DZ twins occurred for combinationsof categories with lower relative frequency in DZ twins than in MZ twins. In a study comparing thereliability of a measure in two populations or strata, the interpretation might be simpler. Kraemer[4] argued that reliability measures for a scale should be functions of both the measurement systemand the population of subjects to which the measure is applied. Since changes in prevalence implychanges in a population, it would be unreasonable to expect a measure to have the same reliabilitywhen the prevalence of the trait differed. Lower values of inter-class kappa coefficients for a

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

828 C. ROBERTS

category that was less frequent in a particular group could be interpreted as the scale being lessreliable in that aspect for that sub-population. Awareness of this difference could be important for aresearcher planning a study in either or both populations, so that formal comparison of kappa-typecoefficients between populations in which the prevalence of traits differed can be justified in suchcontexts.

The two examples presented here involved scales with three categories. The number of second-order moments rapidly increases as the number of categories in a scale increases with six momentsfor a four-category scale and 10 for five categories. This is not a problem for the full model asthe closed-form estimates can be used, but where constraints are added non-convergence may bemore likely during numerical likelihood maximization due to the boundary value problems. Todeal with this, a transformation of the form

� jk = log[1 − � jk]

could be used together with a logit transformation of the marginal proportions,

� j = log

[� j

1 − � j

]

Maximization would then be carried out using � jk and � j instead of � jk and � j .The analysis and methods described here have been for two ratings per subject or for twin data.

Where the number of ratings is greater than 2, inter-class kappa coefficients can be easily estimatedfor the full model and balanced data using closed-form maximum likelihood [22]. If constraintsare added to the model, maximum-likelihood estimates could, at least in theory, be obtained, butmoment terms to the order of the number of ratings need to be considered. For a scale with threecategories and with three rating of each subject, there are four third-order moments to consider.This means that extension of the methods described here to more than two ratings is likely torequire assumptions regarding higher-order moments to facilitate numerical estimation.

This paper has considered kappa-type coefficients that assume exchangeability. In Example 2,the assumption can be justified on good theoretical grounds. In Example 1, the assessments weremade at different time points so that exchangeability is less plausible on theoretical grounds,but marginal probabilities were very similar. Where there is evidence of marginal heterogeneity,the application of the method may be less justified. An alternative method for assessment ofagreement in categorical data, is the use of log-linear models described by Agresti [23] and Beckerand Agresti [24]. These allow marginal heterogeneity and would therefore be more suitable whereexchangeability of ratings pair cannot be assumed. The log-linear models currently developedare perhaps more suited to ordinal rather than nominal scale data, and models do not appear tohave been developed to consider agreement between pairs of categories. Guggenmoos-Holzmannand Vonk [26] describe a latent class method for examination of agreement and reliability innominal scale data. This partitions the probability of assignment to a category into probabilitiesof fortuitous and systematic assignment to each category. In that sense, it is similar to consideringthe kappa coefficient for each category rather than examining agreement in relation to pairs ofcategories. Where interest is in the latter, there does not currently appear to be an approach whichallows marginal heterogeneity. What is more, in the absence of a more flexible model that allowsmarginal heterogeneity there is no method of systematically evaluating the bias caused by assumingmarginal homogeneity, in the presence of marginal heterogeneity. As in other areas of statistics,

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

MODELLING PATTERNS OF AGREEMENT FOR NOMINAL SCALES 829

there may be a need to compromise between the assumptions of the method and the question tobe addressed when considering the choice of methods of analysis. An alternative approach wouldbe to consider the issue at the design stage. Design choices can be made that make the assumptionof exchangeability more tenable. Reliability is often assessed by just two raters who assess allsubjects so that systematic bias between raters can cause marginal heterogeneity. An alternativemight be to sample pairs of raters independently for each subject from a large panel giving datamore consistent with the exchangeability assumption.

ACKNOWLEDGEMENTS

The author is very grateful to two anonymous referees for their helpful suggestions and constructivecomments on an earlier version of the article.

REFERENCES

1. Cohen JA. Coefficient of agreement for nominal scales. Educational and Psychological Measurement 1960;10:37–46.

2. Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin 1971; 76:378–383.3. Bloch DA, Kraemer HC. 2× 2 kappa coefficients measures of agreement and association. Biometrics 1989;

45:269–288.4. Kraemer HC. Ramification of a population model of K as a coefficient of reliability. Psychometrika 1979;

44:461–472.5. Garner JB. The standard error of Cohen’s kappa. Statistics in Medicine 1991; 10:767–775.6. Hale CA, Fleiss JL. Interval estimation under two study designs for kappa with binary classifications. Biometrics

1993; 49:523–524.7. Donner A, Eliasziw M. A goodness-of-fit approach to inference procedures for kappa statistics: confidence interval

construction, significance—testing and sample size estimation. Statistics in Medicine 1992; 11:1511–1519.8. Basu S, Basu A. Comparison of several goodness-of-fit tests for the kappa statistic based on exact power and

coverage probability. Statistics in Medicine 1995; 14:347–356.9. Nam J. Interval estimation of the kappa coefficient with binary classification and an equal marginal probability

model. Biometrics 2000; 56:583–585.10. Blackman NJM, Koval JJ. Interval estimation for Cohen’s kappa as a measure of agreement. Statistics in Medicine

2000; 19:723–741.11. Nam J. Testing the intraclass version of kappa coefficient of agreement with binary scale and sample size

determination. Biometrical Journal 2002; 44:558–570.12. Mekibib A, Donner A, Klar N. Inference procedures for assessing interobserver agreement among multiple raters.

Biometrics 2001; 57:584–588.13. Donner A, Klar N. The statistical analysis of kappa statistics in multiple samples. Journal of Clinical Epidemiology

1996; 49:1053–1058.14. Donner A, Eliasziw M, Klar N. Testing the homogeneity of kappa statistics. Biometrics 1996; 52:176–183.15. Nam J. Homogeneity score test for the intraclass version of the kappa statistic and sample-size determination in

multiple or stratified studies. Biometrics 2003; 59:1027–1035.16. Roberts C, McNamee R. Assessing the reliability of ordered categorical scales using kappa-type statistics.

Statistical Methods in Medical Research 2005; 14:493–514.17. Bartfay E, Donner A. Statistical inferences for interobserver agreement studies with nominal data. The Statistician

2001; 50:135–146.18. Bartfay E, Donner A. Statistical inferences for a twin correlation with multinomial outcomes. Statistics in

Medicine 2001; 20:249–262.19. Bartfay E, Donner A. The effect of collapsing multinomial data when assessing agreement. International Journal

of Epidemiology 2000; 29:1070–1075.20. Bartfay E, Donner A, Klar N. Testing equality of twin correlations in multinomial outcomes. Annals of Human

Genetics 1999; 63:341–349.

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim

830 C. ROBERTS

21. Landis JR, Koch GG. A one-way components of variance model for categorical data. Biometrics 1977; 33:671–679.22. Roberts C, McNamee R. A matrix of kappa-type coefficients to assess the reliability of nominal scales. Statistics

in Medicine 1998; 17:471–488.23. Agresti A. A model for agreement between ratings on an ordinal scale. Biometrics 1988; 44:539–548.24. Becker MP, Agresti A. Log-linear modelling of pairwise interobserver agreement on a categorical scale. Statistics

in Medicine 1992; 11:101–114.25. Agresti A. Modelling patterns of agreement and disagreement. Statistical Methods in Medical Research 1992;

1:201–218.26. Guggenmoos-Holzmann I, Vonk R. Kappa-like indices of observer agreement viewed from a latent class

perspective. Statistics in Medicine 1998; 17:797–812.27. Gould W, Pitblado J, Sribney W. Maximum Likelihood Estimation with Stata (2nd edn). Stata Press: TX, 2003.28. StataCorp. Stata Statistical Software Release 9.2. Stata Corporation: College Station, TX, 2005.29. Microsoft® Office Excel, part of Micro-soft Office Professional 2003, c©1985–2003 Microsoft Corporation,

U.S.A.30. Thompson WD, Walter SD. A reappraisal of the kappa statistic. Journal of Clinical Epidemiology 1988; 41:

949–958.31. Agresti A. Categorical Data Analysis (2nd edn). Wiley: New York, 2002; 396.32. James IR. Analysis of nonagreement amongst multiple raters. Biometrics 1983; 39:651–657.33. Hannah MC, Hopper JH, Mathews JD. Twin concordance for a binary trait II nested analysis of ever smoking

and ex-smoking traits and unnested analysis of a ‘committed-smoking’ traits. Acta Geneticae Gemellology 1985;37:153–154.

34. Kraemer HC. What is the right statistical measure of twin concordance (or diagnostic reliability and validity)?Archives of General Psychiatry 1997; 54:1121–1124.

35. Christian JC, Williams CJ. Comparison of analysis of variance and likelihood models of twin data analysis. InAdvances in Twin and Sib-pair Studies, Spector TD, Sniedor H, MacGregor AJ (eds). Oxford University Press:London, 2000.

36. Spitznagel EL, Helzer JEA. Proposed solution to the base rate problem in the kappa statistic. Archives of GeneralPsychiatry 1985; 42:725–728.

37. Gjorup T. The kappa coefficient and the prevalence of diagnosis. Methods of Information in Medicine 1988;27:184–186.

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:810–830DOI: 10.1002/sim