Rater Reliability

8/9/2019 Rater Reliability

1/8


2/8

2

Estimate the reliability between or among judges separately for each scale (one

estimate each for country of study’s origin, effect size, percentage of males, etc.).

What Intraclass Correlation to Use?

There are two main types of intraclass correlations, Case 2 (ICC(2)), which is therandom case, and Case 3 (ICC(3)), which is the fixed case. In Case 3, the judges and

studies are crossed, meaning that for a given scale, each judge codes all studies. So if in

my meta-analysis, Jim and Joan code all 25 studies for effect size, then studies and people are completely crossed and Case 3, the fixed case, applies. In Case 2, the judges

and studies are not crossed. For all 25 studies, you have two people code each study, but

different people code different studies. For example, Jim and Joan codes studies 1 – 5,Jim and Steve code studies 6 – 10, Joan and Steve code studies 11 – 15, and so forth. If

so, studies are nested in raters, and Case 2, the random case, applies.

Another way to collect data is to have different numbers of people code eachstudy. Joan codes study 1, Jim and Steve code study 2, and Joan, Jim and Steve all code

study 3. Avoid collecting data like that. If you do, some studies will be coded morereliably than others and no single number will accurately estimate the reliability of the

data.The proper intraclass correlation to use depends on how you collect the data

during your study. If for the scale in question, the same people rate each and every study,

then use ICC(3). If different people code different studies for a scale, then use ICC(2). Ifdifferent people code different scales, but the same people code the same scales across

studies, you can still use ICC(3) because you report reliability by scale.

Collecting Data to Estimate Reliability Before a Full Study

Regardless of how you collect data for you whole study, however, I recommend

that you estimate the whole study reliability by collecting data on a subsample of studies

where all the judges and studies are completely crossed. If you do that, you will have thedata you need to estimate both ICC(2) and ICC(3) the way I’m showing you how to do it.

It doesn’t work the other way. If you collect nested data, you have to use some ugly

models and most likely hire a statistician to figure how to get the right estimates (if you

are interested or just feeling lucky, some models for more complicated data collectiondesigns are developed for you in Brennan, 1992; Cronbach, et al., 1972, Crocker &

Algina, 1986).

The simple models that I present here were developed by Shrout & Fleiss (1979).They assume that the data were collected in a design in which the judges and studies

(targets) are completely crossed. I will show you

1. How to estimate the reliability of a single judge in the fixed and random

conditions

2. How to estimate the reliability of any number of judges (e.g., two) given thereliability of a single judge, and

3. How to estimate the number of judges required to attain any desired level of

reliability for the scale.


3/8

3

Illustrative Example

Scenario

Jim, Joe and Sue have gather data on a meta-analysis of the effects of classical

music on plant growth. They have gathered a random sample of five of their studies andeach of them has rated the rigor of the same five studies. Their ratings are reproduced

below.

Study Jim Joe Sue

1 2 3 1

2 3 2 2

3 4 3 3

4 5 4 45 5 5 3

If we use SAS GLM, we can specify the model in which rating is a function of rater,

study, and their interaction.

[Technical note: In doing so, we have what is essentially an ANOVA model in which

there is one observation per cell. In such a design, the error and interaction terms are notseparately estimable. If there is good reason to believe that there is an interaction

between raters and targets (e.g., Olympic figure skating judges ratings of people from

their own country), then the entire design should be replicated to allow a within cell error

term.]


4/8

4

SAS Input

data rel1;

input rating rater target;

*************************************************

*Rating is the variable that is each judge's

*evaluation of the study's rigor.*Jim is rater 1, Joe is 2, and Sue is 3.

*Target is the study number.

*************************************************;

cards;

2 1 1

3 1 2

4 1 3

5 1 4

5 1 5

3 2 1

2 2 2

3 2 3

4 2 4

5 2 5

1 3 1

2 3 2

3 3 3

4 3 4

3 3 5

;

*proc print;

proc glm ;

class rater target;

model rating = rater target rater*target;

run;

Note that rater*target is the interaction term, and we will use that as the error term underthe assumption that the interaction is negligible.


5/8

5

SAS Output

The GLM Procedure

Class Level Information

Class Levels Values

rater 3 1 2 3

target 5 1 2 3 4 5

Number of observations 15

The GLM Procedure

Dependent Variable: rating

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 14 20.93333333 1.49523810 . .

Error 0 0.00000000 .

Corrected Total 14 20.93333333

R-Square Coeff Var Root MSE rating Mean

1.000000 . . 3.266667

Source DF Type I SS Mean Square F Value Pr > F

rater 2 3.73333333 1.86666667 . .

target 4 14.26666667 3.56666667 . .

rater*target 8 2.93333333 0.36666667 . .

Source DF Type III SS Mean Square F Value Pr > F

rater 2 3.73333333 1.86666667 . .

target 4 14.26666667 3.56666667 . .

rater*target 8 2.93333333 0.36666667 . .

What we want from the output is the Type III mean squares (in this case, 1.87, 3.57, and

.37).


6/8

6

Now to compute the estimates:

Item Formula Estimate

Reliability

of one

randomrater:

ICC(2,1)

n EMS JMS k EMS k BMS

EMS BMS

/)()1( −+−+

− 61.

5/)37.87.1(337).13(57.3

37.57.3=

−+−+

−

Reliabilityof one

fixedrater:

ICC(3,1)

EMS k BMS

EMS BMS

)1( −+

− 74.

37).13(57.3

37.57.3=

−+

−

Note.BMS=mean square for targets (studies)

JMS = mean square for raters (judges)

EMS = mean square for rater*targetK = number of raters

N = number of targets (studies)

Notice that the reliability of one random rater is less than the reliability of on fixed rater.This is because mean difference among raters reduce reliability when raters are random

but not when raters are fixed. The reliability estimate of the one random rater is a bit low

to use in practice.

We can use the estimate of reliability of one rater to estimate the reliability of any

number of raters using the Spearman-Brown prophecy formula. We can use a variation

on the theme to estimate the number of raters we need to achieve any desired reliability.

Suppose we want to know what the reliability will be if we have two raters. The generalform of the Spearman-Brown is:

ii

iiCC

k

k

ρ

ρ ρ

)1(1'

−+

=

If we move to two raters, then k is two and ii ρ is either ICC(2,1) or ICC(3,1). For two

random judges, we have:

=

+

=

61.1

)61(.2'CC ρ .76.

For two fixed judges, we have:

=

+

=

74.1

)74(.2'CC ρ .85.

For both estimates, we see that the average of two judges’ ratings will be more reliable

than will one judge’s ratings. Fixed judges are still more reliable than random judges.


7/8

7

Suppose we want to achieve a reliability of .90. Then we can use a variant of the

Spearman Brown that looks like this:

)1(

)1(*

*

ρ ρ

ρ ρ

−

−=

L

Lm

where m is an integer formed by rounding up, * ρ is our aspiration level, and L ρ is our

lower estimate, either ICC(2,1) or ICC(3,1). In our example, for the random case, we

have:

=

−

−=

)90.1(61.

)61.1(9.m 5.75, or 6 when rounded up.

For fixed judges, we have

=

−

−=

)90.1(74.

)74.1(9.m 3.16 or 4 when we round up.


8/8

8

References

Brennan, R. L. (1992). Elements of generalizability theory. Iowa City, IA: ACT

Publications.

Crocker, L., & Algina, J. (1986). Introduction to classical and modern testtheory. New York: Holt, Rinehart & Winston.

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). Thedependability of behavioral measurements. New York: Wiley.

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing

rater reliability. Psychological Bulletin, 86 , 420-428.

Rater Reliability

Documents

Transcript of Rater Reliability