Rater Reliability

download Rater Reliability

of 5

Transcript of Rater Reliability

  • 8/9/2019 Rater Reliability

    1/8

  • 8/9/2019 Rater Reliability

    2/8

      2

    Estimate the reliability between or among judges separately for each scale (one

    estimate each for country of study’s origin, effect size, percentage of males, etc.).

    What Intraclass Correlation to Use?

    There are two main types of intraclass correlations, Case 2 (ICC(2)), which is therandom case, and Case 3 (ICC(3)), which is the fixed case. In Case 3, the judges and

    studies are crossed, meaning that for a given scale, each judge codes all studies. So if in

    my meta-analysis, Jim and Joan code all 25 studies for effect size, then studies and people are completely crossed and Case 3, the fixed case, applies. In Case 2, the judges

    and studies are not crossed. For all 25 studies, you have two people code each study, but

    different people code different studies. For example, Jim and Joan codes studies 1 – 5,Jim and Steve code studies 6 – 10, Joan and Steve code studies 11 – 15, and so forth. If

    so, studies are nested in raters, and Case 2, the random case, applies.

    Another way to collect data is to have different numbers of people code eachstudy. Joan codes study 1, Jim and Steve code study 2, and Joan, Jim and Steve all code

    study 3. Avoid collecting data like that. If you do, some studies will be coded morereliably than others and no single number will accurately estimate the reliability of the

    data.The proper intraclass correlation to use depends on how you collect the data

    during your study. If for the scale in question, the same people rate each and every study,

    then use ICC(3). If different people code different studies for a scale, then use ICC(2). Ifdifferent people code different scales, but the same people code the same scales across

    studies, you can still use ICC(3) because you report reliability by scale.

    Collecting Data to Estimate Reliability Before a Full Study

    Regardless of how you collect data for you whole study, however, I recommend

    that you estimate the whole study reliability by collecting data on a subsample of studies

    where all the judges and studies are completely crossed. If you do that, you will have thedata you need to estimate both ICC(2) and ICC(3) the way I’m showing you how to do it.

    It doesn’t work the other way. If you collect nested data, you have to use some ugly

    models and most likely hire a statistician to figure how to get the right estimates (if you

    are interested or just feeling lucky, some models for more complicated data collectiondesigns are developed for you in Brennan, 1992; Cronbach, et al., 1972, Crocker &

    Algina, 1986).

    The simple models that I present here were developed by Shrout & Fleiss (1979).They assume that the data were collected in a design in which the judges and studies

    (targets) are completely crossed. I will show you

    1. How to estimate the reliability of a single judge in the fixed and random

    conditions

    2. How to estimate the reliability of any number of judges (e.g., two) given thereliability of a single judge, and

    3. How to estimate the number of judges required to attain any desired level of

    reliability for the scale.

  • 8/9/2019 Rater Reliability

    3/8

      3

    Illustrative Example

    Scenario

    Jim, Joe and Sue have gather data on a meta-analysis of the effects of classical

    music on plant growth. They have gathered a random sample of five of their studies andeach of them has rated the rigor of the same five studies. Their ratings are reproduced

     below.

    Study Jim Joe Sue

    1 2 3 1

    2 3 2 2

    3 4 3 3

    4 5 4 45 5 5 3

    If we use SAS GLM, we can specify the model in which rating is a function of rater,

    study, and their interaction.

    [Technical note: In doing so, we have what is essentially an ANOVA model in which

    there is one observation per cell. In such a design, the error and interaction terms are notseparately estimable. If there is good reason to believe that there is an interaction

     between raters and targets (e.g., Olympic figure skating judges ratings of people from

    their own country), then the entire design should be replicated to allow a within cell error

    term.]

  • 8/9/2019 Rater Reliability

    4/8

      4

    SAS Input

    data rel1;

    input rating rater target;

    *************************************************

    *Rating is the variable that is each judge's

    *evaluation of the study's rigor.*Jim is rater 1, Joe is 2, and Sue is 3.

    *Target is the study number.

    *************************************************; 

    cards;

    2 1 1

    3 1 2

    4 1 3

    5 1 4

    5 1 5

    3 2 1

    2 2 2

    3 2 3

    4 2 4

    5 2 5

    1 3 1

    2 3 2

    3 3 3

    4 3 4

    3 3 5

    ;

    *proc print; 

     proc glm ;

    class rater target;

    model rating = rater target rater*target;

    run;

     Note that rater*target is the interaction term, and we will use that as the error term underthe assumption that the interaction is negligible.

  • 8/9/2019 Rater Reliability

    5/8

      5

    SAS Output

    The GLM Procedure

    Class Level Information

    Class Levels Values

    rater 3 1 2 3

    target 5 1 2 3 4 5

    Number of observations 15

    The GLM Procedure

    Dependent Variable: rating

    Sum of

    Source DF Squares Mean Square F Value Pr > F

    Model 14 20.93333333 1.49523810 . .

    Error 0 0.00000000 .

    Corrected Total 14 20.93333333

    R-Square Coeff Var Root MSE rating Mean

    1.000000 . . 3.266667

    Source DF Type I SS Mean Square F Value Pr > F

    rater 2 3.73333333 1.86666667 . .

    target 4 14.26666667 3.56666667 . .

    rater*target 8 2.93333333 0.36666667 . .

    Source DF Type III SS Mean Square F Value Pr > F

    rater 2 3.73333333 1.86666667 . .

    target 4 14.26666667 3.56666667 . .

    rater*target 8 2.93333333 0.36666667 . .

    What we want from the output is the Type III mean squares (in this case, 1.87, 3.57, and

    .37).

  • 8/9/2019 Rater Reliability

    6/8

      6

     Now to compute the estimates:

    Item Formula Estimate

    Reliability

    of one

    randomrater:

    ICC(2,1)

    n EMS  JMS k  EMS k  BMS 

     EMS  BMS 

    /)()1(   −+−+

    −  61.

    5/)37.87.1(337).13(57.3

    37.57.3=

    −+−+

    − 

    Reliabilityof one

    fixedrater:

    ICC(3,1)

     EMS k  BMS 

     EMS  BMS 

    )1(   −+

    −  74.

    37).13(57.3

    37.57.3=

    −+

    − 

     Note.BMS=mean square for targets (studies)

    JMS = mean square for raters (judges)

    EMS = mean square for rater*targetK = number of raters

     N = number of targets (studies)

     Notice that the reliability of one random rater is less than the reliability of on fixed rater.This is because mean difference among raters reduce reliability when raters are random

     but not when raters are fixed. The reliability estimate of the one random rater is a bit low

    to use in practice.

    We can use the estimate of reliability of one rater to estimate the reliability of any

    number of raters using the Spearman-Brown prophecy formula. We can use a variation

    on the theme to estimate the number of raters we need to achieve any desired reliability.

    Suppose we want to know what the reliability will be if we have two raters. The generalform of the Spearman-Brown is:

    ii

    iiCC 

     ρ 

     ρ  ρ 

    )1(1'

    −+

    =  

    If we move to two raters, then k is two and ii ρ  is either ICC(2,1) or ICC(3,1). For two

    random judges, we have:

    =

    +

    =

    61.1

    )61(.2'CC  ρ  .76.

    For two fixed judges, we have:

    =

    +

    =

    74.1

    )74(.2'CC  ρ  .85.

    For both estimates, we see that the average of two judges’ ratings will be more reliable

    than will one judge’s ratings. Fixed judges are still more reliable than random judges.

  • 8/9/2019 Rater Reliability

    7/8

      7

    Suppose we want to achieve a reliability of .90. Then we can use a variant of the

    Spearman Brown that looks like this:

    )1(

    )1(*

    *

     ρ  ρ 

     ρ  ρ 

    −=

     L

     Lm  

    where m is an integer formed by rounding up, * ρ  is our aspiration level, and  L ρ  is our

    lower estimate, either ICC(2,1) or ICC(3,1). In our example, for the random case, we

    have:

    =

    −=

    )90.1(61.

    )61.1(9.m 5.75, or 6 when rounded up.

    For fixed judges, we have

    =

    −=

    )90.1(74.

    )74.1(9.m 3.16 or 4 when we round up.

  • 8/9/2019 Rater Reliability

    8/8

      8

    References

    Brennan, R. L. (1992). Elements of generalizability theory. Iowa City, IA: ACT

    Publications.

    Crocker, L., & Algina, J. (1986).  Introduction to classical and modern testtheory. New York: Holt, Rinehart & Winston.

    Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). Thedependability of behavioral measurements. New York: Wiley.

    Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing

    rater reliability.  Psychological Bulletin, 86 , 420-428.