Lombard Reliability

7/29/2019 Lombard Reliability

1/18

11/29/2005 10:ntercoder Reliability in Content Analysis

Page 1ttp://www.temple.edu/sct/mmc/reliability/

Practical Resources for Assessing and ReportingIntercoder Reliability in Content Analysis ResearchProjectsMatthew Lombard, Temple University, Philadelphia, PA, USA

Jennifer Snyder-Duch, Carlow College, Pittsburgh, PA, USACheryl Campanella Bracken, Cleveland State University, Cleveland, OH, USALast updated on October 23, 2004IMPORTANT NOTES:This document is, by design, always a work in progress. If you have information you believe should be added or updated, please e-mail the authors.The authors have attempted to avoid violations of copyright and inappropriate use of and references to materials; if you identify a violation or related problem,please notify the authors.Links to locations outside this document will open in a new browser window.

Quick links:

.1 What is this site about, who created it, and why?

.2 What is intercoder reliability?

.3 Why should content analysis researchers care about intercoder reliability?

.4 How should content analysis researchers properly assess and report intercoder reliability?

.5 Which measure(s) of intercoder reliability should researchers use?

.6 How should researchers calculate intercoder reliability? What software is available?

.7 How should content analysts set up their reliability data for use with reliability software?

.8 How does each software tool for intercoder reliability work?

.9 What other issues and potential problems should content analysts know about?

.10 Where can content analysts find more information about intercoder reliability?

.11 References and bibliography

1. What is this site about, who created it, and why? [TOP]In Lombard, Snyder-Duch, and Bracken (2002), we reviewed the literature regarding intercoder reliability incontent analysis research and reported a study that characterized the assessment and reporting of reliability i200 studies in the mass communication literature between 1994 and 1998. Consistent with earlier examinatio


2/18



(Pasadeos, Huhman, Standley, & Wilson, 1995; Riffe & Freitag, 1997), we found that only 69% (n = 137) of tharticles contained any report of intercoder reliability; even in these articles, few details were provided in theaverage 4.5 sentences devoted to reliability and important information was ambiguously reported, not reportedor represented inappropriate decisions by researchers. Based on the literature review and these results, weproposed guidelines regarding appropriate procedures for assessment and reporting of this important aspect ocontent analysis.

This supplemental online resource contains:

Background information regarding what intercoder reliability is and why it's importantA modified version of the guidelines from the articleDescriptions of and recommendations regarding currently available software tools that researchers canuse to calculate the different reliability indicesInformation about how to obtain and use the softwareFurther clarification of issues related to reliability

The online format will allow us to update the information as the tools, and perspectives on the different indicesand their proper use, evolve.

2. What is intercoder reliability? [TOP]Intercoder reliability is the widely used term for the extent to which independent coders evaluate a characterisof a message or artifact and reach the same conclusion. Although in its generic use as an indication ofmeasurement consistency this term is appropriate and is used here, Tinsley and Weiss (1975, 2000) note thatthe more specific term for the type of consistency required in content analysis is intercoder (or interrater)agreement. They write that while reliability could be based on correlational (or analysis of variance) indices thaassess the degree to which "ratings of different judges are the same when expressed as deviations from theirmeans," intercoder agreement is needed in content analysis because it measures only "the extent to which thedifferent judges tend to assign exactly the same rating to each object" (Tinsley & Weiss, 2000, p. 98); evenwhen intercoder agreement is used for variables at the interval or ratio levels of measurement, actualagreement on the coded values (even if similar rather than identical values 'count') is the basis for assessmen

3. Why should content analysis researchers care about intercoder reliability? [TOP]It is widely acknowledged that intercoder reliability is a critical component of content analysis, and that althougit does not insure validity, when it is not established properly, the data and interpretations of the data can not bconsidered valid. As Neuendorf (2002) notes, "given that a goal of content analysis is to identify and recordrelatively objective (or at least intersubjective) characteristics of messages, reliability is paramount. Without thestablishment of reliability, content analysis measures are useless" (p. 141). Kolbe and Burnett (1991) write th"interjudge reliability is often perceived as the standard measure of research quality. High levels of

disagreement among judges suggest weaknesses in research methods, including the possibility of pooroperational definitions, categories, and judge training" (p. 248).A distinction is often made between the coding of manifest content, information "on the surface," and latentcontent under these surface elements. Potter and Levine-Donnerstein (1997) note that for latent content thecoders must provide subjective interpretations based on their own mental schema and that this "only increasethe importance of making the case that the judgments of coders are intersubjective, that is, those judgments,while subjectively derived, are shared across coders, and the meaning therefore is also likely to reach out toreaders of the research" (p. 266).


3/18



There are important practical reasons to establish intercoder reliability too. Neuendorf (2002) argues that inaddition to being a necessary, although not sufficient, step in validating a coding scheme, establishing a highlevel of reliability also has the practical benefit of allowing the researcher to divide the coding work among madifferent coders. Rust and Cooil (1994) note that intercoder reliability is important to marketing researchers inpart because "high reliability makes it less likely that bad managerial decisions will result from using the data" 11); Potter and Levine-Donnerstein (1997) make a similar argument regarding applied work in public informatcampaigns.

The bottom line is that content analysis researchers should care about intercoder reliability because not onlycan its proper assessment make coding more efficient, without it all of their work - data gathering, analysis, aninterpretation - is likely to be dismissed by skeptical reviewers and critics.

4. How should content analysis researchers properly assess and report intercoder reliability?[TOP]

The following guidelines for the calculation and reporting of intercoder reliability are based on a review ofliterature concerning reliability and a detailed examination of reports of content analyses in communication

journals; the literature review and study are presented in detail in Lombard, Snyder-Duch, and Bracken (2002and Lombard, Snyder-Duch, and Bracken (2003).

First and most important, calculate and report intercoder reliability. All content analysis projects should bedesigned to include multiple coders of the content and the assessment and reporting of intercoder reliabilityamong them. Reliability is a necessary (although not sufficient) criterion for validity in the study and without it aresults and conclusions in the research project may justifiably be doubted or even considered meaningless.Follow these steps in order.1. Select one or more appropriate indices. Choose one or more appropriate indices of intercoder reliability

based on the characteristics of the variables, including their level(s) of measurement, expected distributionsacross coding categories, and the number of coders. If percent agreement is selected (and this is notrecommended), use a second index that accounts for agreement expected by chance. Be prepared to justifand explain the selection of the index or indices. Note that the selection of index or indices must precededata collection and evaluation of intercoder reliability.

2. Obtain the necessary tools to calculate the index or indices selected. Some of the indices can be calculated

"by hand" (although this may be quite tedious) while others require automated calculation. A small number ospecialized software applications as well as macros for established statistical software packages are availa(see How should researchers calculate intercoder reliability? What software is available? below).

3. Select an appropriate minimum acceptable level of reliability for the index or indices to be used. Coefficients

of .90 or greater are nearly always acceptable, .80 or greater is acceptable in most situations, and .70 may appropriate in some exploratory studies for some indices. Higher criteria should be used for indices known be liberal (i.e., percent agreement) and lower criteria can be used for indices known to be more conservativ(Cohen's kappa, Scott's pi, and Krippendorff's alpha). A preferred approach is to calculate and report two (omore) indices, establishing a decision rule that takes into account the assumptions and/or weaknesses ofeach (e.g., to be considered reliable, a variable may be at or above a moderate level for a conservative indeor at or above a high level for a liberal index). In any case the researcher should be prepared to justify thecriterion or criteria used.


4/18



4. Assess reliability informally during coder training. Following instrument design and preliminary coder traininassess reliability informally with a small number of units which ideally are not part of the full sample (orcensus) of units to be coded, and refine the instrument and coding instructions until the informal assessmensuggests an adequate level of agreement.

5. Assess reliability formally in a pilot test. Using a random or other justifiable procedure, select a representati

sample of units for a pilot test of intercoder reliability. The size of this sample can vary depending on theproject but a good rule of thumb is 30 units (for more guidance see Lacy and Riffe, 1996). If at all possible,

when selecting the original sample for the study select a separate representative sample for use in codertraining and pilot testing of reliability. Coding must be done independently and without consultation orguidance; if possible, the researcher should not be a coder. If reliability levels in the pilot test are adequate,proceed to the full sample; if they are not adequate, conduct additional training, refine the coding instrumenand procedures, and only in extreme cases, replace one or more coders.

6. Assess reliability formally during coding of the full sample. When confident that reliability levels will be

adequate (based on the results of the pilot test of reliability), use another representative sample to assessreliability for the full sample to be coded (the reliability levels obtained in this test are the ones to bepresented in all reports of the project). This sample must also be selected using a random or other justifiablprocedure. The appropriate size of the sample depends on may factors but it should not be less than 50 unor 10% of the full sample, and it rarely will need to be greater than 300 units; larger reliability samples are

required when the full sample is large and/or when the expected reliability level is low (Neuendorf, 2002; seLacy and Riffe, 1996 for a discussion). The units from the pilot test of reliability can be included in thisreliability sample only if the reliability levels obtained in the pilot test were adequate. As with the pilot test, thcoding must be done independently, without consultation or guidance.

7. Select and follow an appropriate procedure for incorporating the coding of the reliability sample into the

coding of the full sample. Unless reliability is perfect, there will be coding disagreements for some units in threliability sample. Although an adequate level of intercoder agreement suggests that the decisions of each the coders could reasonably be included in the final data, and although it can only address the subset ofpotential coder disagreements that are discovered in the process of assessing reliability, the researcher mudecide how to handle these coding disagreements. Depending on the characteristics of the data and thecoders, the disagreements can be resolved by randomly selecting the decisions of the different coders, usin

a 'majority' decision rule (when there are an odd number of coders), having the researcher or other expertserve as tie-breaker, or discussing and resolving the disagreements. The researcher should be able to justiwhichever procedure is selected.

8. Report intercoder reliability in a careful, clear, and detailed manner in all research reports. Even if the

assessment of intercoder reliability is adequate, readers can only evaluate a study based on the informationprovided, which must be both complete and clear. Provide this minimum information:

The size of and the method used to create the reliability sample, along with a justification of thatmethod.The relationship of the reliability sample to the full sample (i.e., whether the reliability sample is the same as the full sample, a subset of thesample, or a separate sample).

The number of reliability coders (which must be 2 or more) and whether or not they include the

researcher(s).The amount of coding conducted by each reliability, and non-reliability, coder.The index or indices selected to calculate reliability, and a justification of this/these selections.The intercoder reliability level for each variable, for each index selected.The approximate amount of training (in hours) required to reach the reliability levels reported.How disagreements in the reliability coding were resolved in the full sample.

Where and how the reader can obtain detailed information regarding the coding instrument,procedures, and instructions (e.g., from the authors).

Do NOT use only percent agreement to calculate reliability. Despite its simplicity and widespread use, there is


5/18



consensus in the methodological literature that percent agreement is a misleading and inappropriately liberalmeasure of intercoder agreement (at least for nominal-level variables); if it is reported at all the researcher mujustify its value in the context of the attributes of the data and analyses at hand.Do NOT use Cronbach's alpha, Pearson's r, or other correlation-based indices that standardize coder valuesand only measure covariation. While these indices may be used as a measure of reliability in other contexts,reliability in content analysis requires an assessment of intercoder agreement (i.e., the extent to which codersmake the identical coding decisions) rather than covariation.Do NOT use chi-square to calculate reliability.

Do NOT use overall reliability across variables, rather than reliability levels for each variable, as a standard foevaluating the reliability of the instrument.

5. Which measure(s) of intercoder reliability should researchers use? [TOP]There are literally dozens of different measures, or indices, of intercoder reliability. Popping (1988) identified 3different "agreement indices" for coding nominal categories, which excludes several techniques for interval anratio level data. But only a handful of techniques are widely used.

In communication the most widely used indices are:Percent agreementHolsti's methodScott's pi (p)Cohen's kappa (k)Krippendorff's alpha (a)

Just some of the indices proposed, and in some cases widely used, in other fields are Perreault and Leigh's(1989) Irmeasure; Tinsley and Weiss's (1975) T index; Bennett, Alpert, and Goldstein's (1954) S index; Lin's

(1989) concordance coefficient; Hughes and Garretts (1990) approach based on Generalizability Theory, andRust and Cooil's (1994) approach based on "Proportional Reduction in Loss" (PRL).

It would be nice if there were one universally accepted index of intercoder reliability. But despite all the effortthat scholars, methodologists and statisticians have devoted to developing and testing indices, there is noconsensus on a single, "best" one.While there are several recommendations for Cohen's kappa (e.g., Dewey (1983) argued that despite itsdrawbacks, kappa should still be "the measure of choice") and this index appears to be commonly used inresearch that involves the coding of behavior (Bakeman, 2000), others (notably Krippendorff, 1978, 1987) havargued that its characteristics make it inappropriate as a measure of intercoder agreement.Percent agreement seems to be used most widely and is intuitively appealing and simple to calculate, but themethodological literature is consistent in identifying it as a misleading measure that overestimates trueintercoder agreement (at least for nominal-level variables).

Krippendorff's alpha is well regarded and very flexible (it can be used with multiple coders, accounts for differesample sizes and missing data, and can be used for ordinal, interval and ratio level variables), but it requirestedious calculations and automated (software) options have not been widely available.There's general consensus in the literature that indices that do not account for agreement that could beexpected to occur by chance (such as percent agreement and Holsti's method) are too liberal (i.e., overestimatrue agreement), while those that do can be too conservative.


6/18



And there is consensus that a few indices that are sometimes used are inappropriate: Cronbach's alpha wasdesigned to only measure internal consistency via correlation, standardizing the means and variance of datafrom different coders and only measuring covariation (Hughes & Garrett, 1990), and chi-square produces highvalues for both agreement and disagreement deviating from agreement expected by chance (the "expectedvalues" in the chi-square formula).Before they examine their reliability data, researchers need to select an index of intercoder reliability based onits properties and assumptions, and the properties of their data, including the level of measurement of each

variable for which agreement is to be calculated and the number of coders.Whichever index or indices is/are used, researchers need to explain why the properties and assumptions of thindex or indices they select are appropriate in the context of the characteristics of the data being evaluated.

6. How should researchers calculate intercoder reliability? What software is available? [TOP]There are four approaches available to researchers who need to calculate intercoder reliability:

By handThe paper-and-pencil and a hand calculator approach. This can be exceedingly tedious and in many casesimpractical.

Specialized softwareA very small number of software applications have been created specifically to calculate one or more reliabilitindices. Three considered here are:

AGREE. Popping's (1984) software calculates Cohen's kappa and an index called D2, said to beappropriate for pilot studies in which "the response categories have to be developed by the coders during theassigning process. In this situation each coder may finish with a different set and number of categories." The Windowbased software was available from ProGAMMA but that company went out of business in June 2003 and

AGREE is now available from SciencePlusGroup. The academic version of this software costs 399 euro(approximately $460).Krippendorff's alpha 3.12a. This Windows program (and an earlier DOS version) calculates Krippendorffalpha; unfortunately it is a beta or test mode program and has not been distributed widely. KlausKrippendorff (2001, 2004) has indicated that a new and more comprehensive software application is unddevelopment. Details will likely appear on Professor Krippendorff's web site.PRAM. The new Program for Reliability Assessment with Multiple Coders software is available at no cosfrom Skymeg Software and calculates several different indices:

Percent agreement

Scotts piCohens kappaLins concordance correlation coefficientHolstis reliabilityandSpearman's rhoPearson correlation coefficient

Unfortunately, the current (as of March 2004) version of PRAM (v. 0.4.5) is an alpha release and the


7/18



authors warn that it contains bugs and is not yet finished. However, in the release notes they state that "available coefficients [have been] tested and verified by Dr. [Kim] Neuendorf's students."

Statistical software packages that include reliability calculationsTwo general purpose statistical software applications that include reliability calculations are:

Simstat. Available from Provalis Research, this Windows program can be used to calculate several

indices:

Percent agreementScott's piCohen's kappaFree marginal (nominal)Krippendorf's r-barKrippendorf's R (for ordinal data)Free marginal (ordinal)

The full version of the program costs $149 but a limited version that only allows 20 variables is availablefor $40; an evaluation copy of the full software can be downloaded at no cost but only works for 30 days

7 uses.SPSS. The well known Statistical Package for the Social Sciences can be used to calculate Cohen'skappa.

Macros for statistical software packagesA few people have written "macros," customized programming that can be used with existing software packagto automate the calculation of intercoder reliability. All of these are free, but require some expertise to use.

Kang, Kara, Laskey, and Seaton (1993) created and published a SAS macro that calculates Krippendorfalpha for nominal, ordinal and interval data, Cohen's kappa, Perraults's index, and Bennet's index.Berry and Mielke (1997) created a macro (actually a FORTRAN subroutine) to calculate a "chance-corrected index of agreement ... that measures the agreement of multiple raters with a standard set ofresponses (which may be one of the raters)" (p. 527). The article says that the ASTAND subroutine and"an appropriate driver program" are available from Kenneth Berry ([email protected]) in theDepartment of Sociology at Colorado State University.We also have created SPSS macros to calculate percent agreement and Krippendorff's alpha. They'reavailable free at the links below.

7. How should content analysts set up their reliability data for use with reliability software? [TOPEach software application that can be used to calculate intercoder reliability has its own requirements regardindata formatting, but all of them fall into two basic setup formats. In the first data setup format, used in PRAMand Krippendorff's Alpha 3.12a, each row represents the coding judgments of a particular coder for a singlecase across all the variables and each column represents a single variable. Data in this format appearssomething like this:

Unit var1 var2 var3 var4 var5 var6 . .


8/18



case1 (coder1)

case1 (coder2)

case2 (coder1)

case2 (coder2)

case3 (coder1)

case3 (coder2)

.

.

In the second data setup, used in Simstat and SPSS, each row represents a single case and each columnrepresents the coding judgments of a particular coder for a particular variable. Data in this format appearssomething like this:

Unit var1coder1 var1coder2 var2coder1 var2coder2 var3coder1 var3coder2 . .

case 1

case 2

case 3

case 4 case 5

case 6

.

.

It's relatively easy to transform a data file from the first format to the second (by inserting a second copy of eaccolumn and copying data from the line of one coder to the line for the other and deleting the extra lines), but threverse operation is more tedious. Researchers are advised to select a data setup based on the program(s)they'll use to calculate reliability.

8. How does each software tool for intercoder reliability work?[TOP]Details about by hand calculations;AGREE, Krippendorff's Alpha 3.12a, PRAM, Simstat and SPSS software;and macros, are presented in this section.By handThis doesn't involve a software tool of course, but for those who choose this technique, Neuendorf (2002)includes a very useful set of examples that illustrate the step-by-step process for percent agreement, Scott's pCohen's kappa and the nominal level version of Krippendorff's alpha.A note of caution: Avoid rounding values during by hand calculations; while each such instance may not makesubstantial difference in the final reliability value, the series of them can make the final value highly inaccurateExamples:Click here to view examples of by hand calculations for percent agreement, Scott's pi, Cohen's kappa andKrippendorff's alpha for a nominal level variable with two values.Click here to view examples of by hand calculations for Krippendorff's alpha for a ratio level variable with seve


9/18



values.

AGREEPopping's (1984) AGREE specialty software calculates Cohen's kappa and an index called D2 (Popping, 198Data input and output:Data can be entered from the keyboard or a plain-text file in either a "data matrix" format - a single variable

version of the second data setup format above - or a "data table" (pictured in the screen shot below) which iselsewhere called an agreement matrix, where the values in the cells represent the number of cases on whichtwo coders agreed and disagreed. Output appears on screen.Screen shot:

Issues/limitations:The program can handle at most 90 variables as input. The maximum number of variables allowed in one

analysis is 30, the maximum number of categories per analysis is 40.Comments:Given its cost, researchers should be sure they need the specific functions available in this basic Windowsprogram before purchasing it.More information:A fairly detailed description of the program and its features can be found here, on the web site of the currentpublisher, the SciencePlus Grup.

Krippendorff's Alpha 3.12a

Krippendorff's Alpha 3.12a is a beta version of a program Krippendorff developed to calculate his reliabilityindex. It has not been distributed widely and no future versions are planned, but as noted above, Krippendorffin the process of developing a new and more comprehensive program.Data input and output:The program uses the first data setup format above. Data can be entered from the program but it is easier tocreate a plain-text, comma delimited data file (most spreadsheet programs can export data in this format) andload that file. Output (the user can select terse, medium or verbose versions) is displayed onscreen and can bsaved and/or printed. Click here to view instructions for the program. Click here to view program output (for th


10/18



nominal level variable used in the by hand examples above).Screen shot:

Issues/limitations:As beta version software the program is not polished and has some limitations; a single space in the data filefreezes the program, and attempting to evaluate too many variables at once can result in a very long delayfollowed by an error message. Krippendorff has also discovered minor (in the third decimal place) errors insome program output.

Comments:Despite its limitations this is a useful tool for those who want to calculate the well regarded Krippendorff's alphindex. We look forward to the release of Krippendorff's new software.More information:See Professor Krippendorff's web site.

PRAMProgram for Reliability Assessment with Multiple Coders (PRAM) is a new program developed in conjunctionwith Neuendorf's (2002) book, The content analysis guidebook. It is in alpha testing mode, and users arewarned that the program still contains numerous bugs. PRAM calculates percent agreement, Scotts pi, Cohen

kappa, Lins concordance correlation coefficient, and Holstis reliability (as well as Spearman's rho and Pearsocorrelation coefficient).Data input and output:PRAM requires input in the form of a Microsoft Excel (.xls) spreadsheet file with very specific formatting. Thereadme.txt file notes that "The first row is a header row and must contain the variable names... The first column must contathe coder IDs in numeric form... The second column must contain the unit (case) IDs in numeric form... All other columns mcontain numerically coded variables, with variable names on the header line." The program loads and evaluates the datadirectly from the source file - the user never sees the data matrix. The output comes in the form of another Excel file, which automatically named pram.xls; errors are noted in a log file (pram.txt). Click here to view program output (for the


11/18



nominal level variable used in the by hand examples above)Screen shot:

Issues/limitations:The program accommodates missing values in the form of empty cells but not special codes (e.g., "99") thatrepresent missing values.The data file must contain only integers (if necessary, a variable for which some or all values contain decimalpoints can be transformed by multiplying all values by the same multiple of 10).

The current version of the program has moderate memory limitations; the programmers recommend that datavalues not be greater than 100.The programmers planned to include Krippendorff's alpha among the indices calculated but have not done soyet (the option is displayed but can't be selected).Output for most indices includes analyses for all pairs of coders, which is a valuable diagnostic tool, but it alsoincludes the average of these values, even for indices not designed to accommodate multiple coders; usecaution in interpreting and reporting these values.In our experience the PRAM results for Holsti's reliability are not trustworthy.

Comments:This is an early, sometimes buggy version of what has the potential to be a very useful tool. Some calculationproblems in earlier versions have apparently been solved however, as the release notes state that "All availabcoefficients [have been] tested and verified by Dr. [Kim] Neuendorf's students."More information:An online accompaniment to Neuendorf's book is available here; some limited information about PRAM isavailable here; more specific information about the program is in the readme.txt file that comes with it.


12/18



SimstatAvailable from Provalis Research, this Windows program calculates percent agreement, Scott's pi, Cohen'skappa, and versions of Krippendorff's alpha.

Data input and output:The program uses the second data setup format described above. It can import data files in various formats bsaves files in a proprietary format (with a .dbf file extension). To calculate reliability, the user selects "Statistics

"Tables," and then "Inter-Raters" from the menu and then selects indices and variables. The output appears in"notebook" file on screen (and which can be printed). Click here to view program output (for the nominal levelvariable used in the by hand examples above).Screen shot:

Issues/limitations:Simstat can only evaluate the judgments of two coders.The two versions of Krippendorff's reliability index Simstat produces are for ordinal level variables. Theprogram's documentation states, "Krippendorf's R-bar adjustment is the ordinal extension of Scott's pi andassumes that the distributions of the categories are equal for the two sets of ratings. Krippendorf's r adjustmenis the ordinal extension of Cohen's Kappa in that it adjusts for the differential tendencies of the judges in thecomputation of the chance factor."

In our experience Simstat occasionally has behaved erratically.Comments:If one already has access to this software, or cost is not an issue, this is a valuable, though limited, tool forcalculating reliability.More information:Click here to read the program's help page regarding Inter-raters agreement statistics.


13/18



SPSSSPSS can be used to calculate Cohen's kappa.Data input and output:The program uses the second data setup format described above. It can import data files in various formats bsaves files in a proprietary format (with a .sav file extension). As with other SPSS operations, the user has twooptions available to calculate Cohen's kappa. First, she can select "Analyze," "Descriptive Statistics," and

"Crosstabs..." in the menus, then click on "Statistics" to bring up a dialog screen, and click on the check box fo"Kappa." The second option is to prepare a syntax file containing the commands needed to make thecalculations. The syntax (which can also be automatically generated with the "paste" button and then edited fofurther analyses), should be:CROSSTABS

/TABLES=var1c1 BY var1c2

/FORMAT= AVALUE TABLES

/STATISTIC=CC KAPPA

/CELLS= COUNT ROW COLUMN TOTAL .

where var1c1 and var1c2 are the sets of judgments for variable 1 for coder 1 and coder 2 respectively.

The output appears in an "output file on screen (and which can be printed). Click here to view program output(for the nominal level variable used in the by hand examples above).Screen shot:

Issues/limitations:Cohen's kappa is the only index SPSS calculates.


14/18



The program creates a symmetric (square) 2-way table of values, with the columns representing the valuesused by one coder and the rows representing the values used by the other coder. If in the entire dataset eithecoder has not used a specific value and the other coder has used that value, SPSS produces an error messag("Kappa statistics cannot be computed. They require a symmetric 2-way table in which the values of the first variable matchthe values of the second variable.").

Comments:

SPSS is the most established of the software that can be used to calculate reliability, but it is of limited usebecause it only calculates Cohen's kappa. Perhaps communication and other researchers can encourage SPSto incorporate other indices.More information:The SPSS web site is http://spss.com.

Macros for statistical software packagesFor researchers proficient with their use, SAS and SPSS macros to automate the calculations are available fosome indices.

Data input and output:Kang, Kara, Laskey, and Seaton (1993) created and published a SAS macro that calculates Krippendorff'salpha for nominal, ordinal and interval data, Cohen's kappa, Perrault's index, and Bennett's index. The authorstate:

"The calculation of intercoder agreement is essentially a two-stage process. The first stage involvesconstructing an agreement matrix which summarizes the coding results. This first stage isparticularly burdensome and is greatly facilitated by a matrix manipulation facility. Such a facility isnot commonly available in popular spreadsheet or database software, and frequently is not availablein statistical packages. The second stage is the calculation of a specific index of intercoderagreement and is much simpler. This second stage could be computed with any existingspreadsheet software package... Our objective is to incorporate both sets of calculations into a singlecomputer run. The SAS PROC IML and MACRO facilities (in SAS version 6 or higher) readily allowfor the calculation of the agreement matrix, very complex for multiple coders, and SAS is alsocapable of the more elementary index calculations. Our SAS Macro is menu driven and is 'userfriendly.' (pp. 20-21).

The text of the macro is included in Appendix B of the article and is 3 3/4 pages long; click here for an electroncopy that can be cut-and-pasted for use in SAS. The authors report tests that demonstrate the validity of themacro for each of the indices..

Berry and Mielke (1997) created a macro (actually a FORTRAN subroutine) to calculate a "chance-correctedindex of agreement ... that measures the agreement of multiple raters with a standard set of responses (which

may be one of the raters)" (p. 527).

We have created macros for use with the Windows versions of SPSS to calculate Krippendorff's alpha andpercent agreement. Instructions for the use of the macros are included within them.Krippendorff's alpha macrosClick here to view a macro that calculates Krippendorff's alpha for one or more 2 category nominal levelvariables with 2 coders, and prints details of the calculations. Click here to view SPSS program output for this


15/18



macro for the nominal level variable used in the by hand examples above.

Click here to view a macro that does the same thing but does not print the detailed calculations. Click here toview program output for this macro for the nominal level variable used in the by hand examples above.

These macros can be modified fairly easily for variables with different numbers of coding categories. Click herto view a macro that calculates Krippendorff's alpha for one or more variables with 3 (rather than 2) nominallevel coding categories, and prints details of the calculations (or click here to view a macro that does the same

thing but does not print the detailed calculations). A comparison of the macros for 2 and 3 category variablesreveals the modifications needed to create macros that accommodate variables with any number of codingcategories.In principle the macros can be modified to accommodate different numbers of coders as well (see the authorsfor details).Percent agreement (not recommended)Click here to view a macro that calculates percent agreement for one or more nominal level variables with 2coders and prints details of the calculations. Click here to view SPSS program output for the nominal levelvariable used in the by hand examples above.

Click here to view a macro that does the same thing but does not print the detailed calculations. Click here toview program output for the nominal level variable used in the by hand examples above.

Issues/limitations:All of these macros must be used exactly as instructed to yield valid results.Comments:Only those familiar and comfortable with using and manipulating software are advised to use macros such asthe ones described and presented here. Hopefully a usable and affordable specialized software package and/an expanded version of a general purpose statistical software package will make the use of macros redundanand obsolete.

More information:See the articles identified and/or contact the authors.

9. What other issues and potential problems should content analysts know about? [TOP]Two other common considerations regarding intercoder reliability involve the number of reliability coders andthe particular units (messages) they evaluate.

Many indices can be extended to three or more coders by averaging the index values for each pair of coders,but this is inappropriate because it can hide systematic, inconsistent coding decisions among the coders (on tother hand, calculations for each coder pair, which are automatically provided by some software (e.g., PRAM)can be a valuable diagnostic tool during coder training).While most researchers have all reliability coders evaluate the same set of units, a few systematically assignjudges to code overlapping sets of units. We recommend the former approach because it provides the mostefficient basis for evaluating coder agreement, and along with others (e.g., Neuendorf, 2002), against the latteapproach (fortunately, most indices can account for missing data that results from different coders evaluating


16/18



different units, whatever the reason). See Neuendorf (2002) and Potter and Levine-Donnerstein (1999) fordiscussions.

10. Where can content analysts find more information about intercoder reliability? [TOP]

To supplement the information provided here the interested researcher may want to:Visit the Content Analysis Resources web siteSubscribe to the CONTENT listserv (see the Content Analysis Resources web site for details and to subscribeVisit the web site of Professor Klaus Krippendorff at the University of PennsylvaniaReview the references belowSee Lombard, Snyder-Duch, and Bracken (2002, 2003)Periodically return to this online resource for updated information

11. References and bibliography [TOP]

ReferencesBakeman, R. (2000). Behavioral observation and coding. In H. T. Reis & C. M. Judge (Eds.), Handbook ofresearch methods in social and personality psychology(pp. 138-159). New York: Cambridge University PressBennett, E. M., Alpert, R., & Goldstein, A. C. (1954). Communications through limited response questioning.Public Opinion Quarterly, 18, 303-308.Berry, K. J., & Mielke, P. W., Jr. (1997). Measuring the joint agreement between multiple raters and a standardEducational and Psychological Measurement, 57(3), 527-530.Dewey, M. E. (1983). Coefficients of agreement. British Journal of Psychiatry, 143, 487-489.

Hughes, M. A., & Garrett, D. E. (1990). Intercoder reliability estimation--Approaches in marketing: AGeneralizability Theory framework for quantitative data. Journal of Marketing Research, 27 (May), 185-195.Kang, N., Kara, A., Laskey, H. A., & Seaton, F. B. (1993). A SAS MACRO for calculating intercoder agreemenin content analysis. Journal of Advertising, 23(2), 17-28.Kolbe, R. H. & Burnett, M. S. (1991). Content-analysis research: An examination of applications with directivesfor improving research reliability and objectivity. Journal of Consumer Research, 18, 243-250.Krippendorff, K. (1978). Reliability of binary attribute data. Biometrics, 34, 142-144.Krippendorff, K. (1987). Association, agreement, and equity. Quality and Quantity, 21, 109-123.Krippendorff, K. (2001, August 22). CONTENT [Electronic discussion list]. The posting is available here.Krippendorff, K. (2004, October 23). CONTENT [Electronic discussion list]. The posting is available here.Lacy, S., & Riffe, D. (1996). Sampling error and selecting intercoder reliability samples for nominal contentcategories: Sins of omission and commission in mass communication quantitative research. Journalism & MaCommunication Quarterly, 73, 969-973.


17/18



Lin, L. I. (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics, 45, 255-268.Lombard, M., Snyder-Duch, J., & Bracken, C. C. (2002). Content analysis in mass communication: Assessmeand reporting of intercoder reliability. Human Communication Research, 28, 587-604.Lombard, M., Snyder-Duch, J., & Bracken, C. C. (2003). Correction. Human Communication Research, 29, 46472.

Neuendorf, K. A. (2002). The content analysis guidebook. Thousand Oaks, CA: Sage.Pasadeos, Y., Huhman, B., Standley, T., & Wilson, G. (1995, May).Applications of content analysis in newsresearch: A critical examination. Paper presented to the Theory and Methodology Division at the annualconference of the Association for Education in Journalism and Mass Communication (AEJMC), Washington,D.C.Perreault, W. D., & Leigh, L. E. (1989). Reliability of nominal data based on qualitative judgments. Journal ofMarketing Research, 26(May), 135-148.Popping, R. (1983). Traces of agreement. On the dot-product as a coefficient of agreement. Quality & Quantit

17, 1-18.Popping, R. (1984). AGREE, a package for computing nominal scale agreement. Computational Statistics andData Analysis, 2, 182-185.Popping, R. (1988). On agreement indices for nominal data. In Willem E. Saris & Irmtraud N. Gallhofer (Eds.),Sociometric research: Volume 1, data collection and scaling(pp. 90-105). New York: St. Martin's Press.Potter, W. J., & Levine-Donnerstein, D. (1999). Rethinking validity and reliability in content analysis. Journal oApplied Communication Research, 27(3), 258+.ProGAMMA (2002). AGREE. Accessed on July 7, 2002 at http://www.gamma.rug.nl/.

Riffe, D., & Freitag, A. A. (1997). A content analysis of content analyses: Twenty-five years ofJournalismQuarterly. Journalism & Mass Communication Quarterly, 74 (4), 873-882.Rust, R., & Cooil, B. (1994). Reliability Measures for Qualitative Data: Theory and Implications. Journal ofMarketing Research, 31 (February), 1-14.Skymeg Software (2002). Program for Reliability Assessment with Multiple Coders (PRAM). Retrieved on July7, 2002 from: http://www.geocities.com/skymegsoftware/pram.htmlTinsley, H. E. A. & Weiss, D. J. (1975). Interrater reliability and agreement of subjective judgements. Journal of Counseling Psychology, 22, 358-376.Tinsley, H. E. A. & Weiss, D. J. (2000). Interrater reliability and agreement. In H. E. A. Tinsley & S. D. Brown, Eds., Handbook of Applied Multivariate Statistic

and Mathematical Modeling, pp. 95-124. San Diego, CA: Academic Press.

Bibliography

Banerjee, M., Capozzoli, M., McSweeney, L., & Sinha, D. (1999). Beyond kappa: A review of interrateragreement measures. The Canadian Journal of Statistics, 27(1), 3-23.Berelson, B. (1952). Content analysis in communication research. Glencoe, IL: Free Press.


18/18



Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educationaand Psychological Measurement, 41, 687-699.Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37-46.Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement of partial credit. Psychological Bulletin, 70(4), 213-220.Craig, R. T. (1981). Generalization of Scott's Index of Intercoder Agreement. Public Opinion Quarterly, 45(2), 260-264.

Ellis, L. (1994). Research methods in the social sciences. Madison, WI: WCB Brown & Benchmark.

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382.Frey, L. R., Botan, C. H., & Kreps, G. L. (2000). Investigating communication: An introduction to researchmethods (2nd ed.). Boston, MA: Allyn and Bacon."Interrater Reliability" (2001). Journal of Consumer Psychology, 10(1&2), 71-73.Holsti (1969). Content analysis for the social sciences and humanities. Reading, MA: Addison-Wesley.

Krippendorff, K. (1980). Content analysis: An introduction to its methodology. Beverly Hills, CA: Sage.

Lawlis, G. F., & Lu, E. (1972). Judgment of counseling process: Reliability, agreement, and error. PsychologicBulletin, 78, 17-20.Lichter, S. R., Lichter, L. S., & Amundsom, D. (1997). Does Hollywood hate business or money? Journal ofCommunication, 47(1), 68-84.Riffe, D., Lacy, S., & Fico, F. G. (1998).Analyzing media messages: Using quantitative content analysis inresearch. Mahwah, NJ: Lawrence Erlbaum.Scott, W. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 17, 321-325.Seun, H. K., & Lee, P. S. C. (1985). Effects of the use of percentage agreement on behavioral observation reliabilities: A reassessment. Journal ofPsychopathology and Behavioral Assessment, 7, 221-234.

Singletary, M. W. (1993). Mass communication research: Contemporary methods and applications. Addison-Wesley.

[TOP]

Lombard Reliability

Documents

Transcript of Lombard Reliability