9705185805

download 9705185805

of 12

Transcript of 9705185805

  • 8/3/2019 9705185805

    1/12

    SAMPLING ERROR AND SELECTINGINTERCOD ER RELIABILITY SAMPLESFOR NOM INAL CONTENT CATEGORIESBy Stephen Lacy and Daniel Riffe

    This study views intercoder reliability as a sampling problem . It developsa formula for generating sample sizes needed to have valid reliabilityestimates. It also suggests steps for reporting reliability. The resultingsample sizes will permit a knozun degree of confidence that the agreem entin a sample o f items is representative of the pattern that would occur if allcontent items were coded by all coders.

    Every researcher who conducts a content analysis faces the samequestion: How large a sample of content units should be used to assess thelevel of reliability?To an extent, sample size dep end s on the num ber of content units inthe population and the homogeneity of the population with respect tovariable cod ing com plexity. Co ntent can be categorized easily for som evariables, bu t not for othe r variables. Ho w doe s a researcher ensure thatvariations in degree of difficulty are included in the reliability assessment?As in most applications involving representativeness, the answer isprobability sampling, assuring that each unit in the reliability check isselected ran do m ly.' Calcu lating samplin g error for reliability tests is pos -sible with prob ability sam pling, but few content analy ses add res s this point.This study views intercoder reliability as a sam pling prob lem , requ ir-ing clarification of the term "po pu lation ." Co ntent analysis typically refersto a study's "population" as all potentially codabte content from which asamp le is dra wn and analyze d. How ever, this sam ple itself becomes a"popu lat ion" of content units from which a sample of test units is randomlydr aw n to check reliability. This article sugg ests content sa mp les need to havereliabilify e stimates repres enting the pop ulatio n. The resulting sam ple sizeswill permit a know n d egre e of confidence that the agreem ent in a samp le oftest units is representative of the pattem that would occur if all study unitswere coded by al! coders.

    Rep roducibility reliability is the extent to which co ding decision s can Backgroundbe replicated by different rese archers.- In principle, the use of m ultipleindependent coders applying the same rules in the same way assures thatcategorized content does not represent the bias of one coder.Research methods texts discuss reliability in terms of measurementerror resulting from problems in coding instructions, failure of coders to

  • 8/3/2019 9705185805

    2/12

    studies addre ss w hether the content units tested represent the popu lationitems studied.-* Often, reliability samp les hav e been selected ha pha zar dlybased on conve nience (e.g., the first 50 item s to be coded might be used )Research texts vary in their appr oac h to sam pling for reliability tesWeber's^^ only p ert ine nt rec om m end atio n is tha t "The best test of the clarof category definitions is to code a small sam ple of the text." Stempconcludes that reliability estimates "should be based on several samples content from the material in the study"'' and that a "minimum standawo uld be the selection of three passage s to be coded by all coders,""

    W immer and D ominick" urge analysts to condu ct a pilot study onsample of the "content universe" and, assuming satisfactory results, then code the main body of data. Then a subsam ple, "probably betw een 10% a25%," should be reanalyzed by independent coders to calculate overintercoder reliability,Kaid and Wadsworth'' suggest that "levels of reliability should

    assessed initially on a subsample of the total sample to be analyzed betproceeding with the actual coding." How large a subsam ple? "W hen a velarge sam ple is involve d, a sub sam ple of 5-7 perce nt of the total is prob absufficient for assessing reliability."Most texts do not discuss reliability in the context of probabilisam pling and the resulting sam pling erro r, Singletary has noted that reliabity checks introduce sampling error when probability samples are used.Krippendorf argues that probability sampling to gei a representative sampis not nece ssary ," Yet early inquiries into reliability testing did ad dreprobab ility sam pling . Scott's'^ article intro duc ing his pi included an eqution accounting for sampling error, though that component was droppefrom subsequent references to ;)/ in statistics and content analysis textCohen'^ discussed sam pling error while introducing kappa. An early articby Janis, Fadner, and Janowitz'' comparing reliability of different codinschemes provided reliability coefficients with confidence intervals.Schutz' ' ' dealt with me asure me nt error and sam ple size. He explorethe impact of "chance agreement" on reliability measures: i.e., some codagreements could occur by chance, though the existence of coding critereduces the influence chance could have.'^Schutz offered a formula that enabled a researcher to set a minimacceptable level of reliability and then co m pu te the level that m ust be achievin a reliability test to account for chance agreements. The formula allows researcher to be certain that the observed sample reliability level is higenough that, even if chance agreemen t could be elimina ted/^ the "rem aide r" level of agreem ent wou ld exceed the acceptable level, For exam ple, if thminim al acc eptable level of ag ree m en t is 807c,, the resea rch er m ight n eed ach ieve a level as high as 837

  • 8/3/2019 9705185805

    3/12

    FIGURE 1Why Reliahility Confidence Interval Uses a One-Tailed TestMinimal acceptiibic

    0% 80% ya"... I 100%y^5% -h5'l o\Confidence Interval

    Continuum for level of agreement in cod ing decisionsRelevant area for de term ining acceptability of reliability test.

    conclude that the "true" reliability of the populafion equals or exceeds theminimal acceptable level.The reaso n for a one-tailed confidence interval is illustrated in Figure1, The minim al acceptable a greem ent level is 80%, and the samp le level ofagreement is 90%, The resulting area of concem is the gray area between90% and 80%, wh ich involves the negafive side of the confidence interval, Aresearcher'sconclusionofacceptablereiiability is not affected by whether thepopulation agreement exceeds 5% on the posifive side because acceptance isbased on a minimal standard, which would fall on the negative side of theinterval.For simplicity, this analysis uses "simple agreement" (total agree-ments divided by total decisions) with a dichotomous decision (the coderseither agree or disagree).Survey rese archer s use the formula for stan da rd erro r of proporfion toestimate a minim al samp le size necessary to infer to the popu lation at a givenlevel of confidence. A similar pro ced ure is used here. We start with theequation for the standard error of prop()rfion and add the finite populafioncorrecfion (FPC), The FPC is used when the sample makes up 10'^ or moreof the pop ulatio n, lt redu ces the stan dar d erro r but is often igno red becauseit has little impact when a sample has a small proporfion of the population.The resulting formula is:

    SE = / P Q 7 ,^^ (Equafion 1)V I-1 V N -l

  • 8/3/2019 9705185805

    4/12

    P ^ the populat ion level of agre em ent ,an dQ ^(l -P ) . An dfi=samp lesizefthe reliability check.Equation 2 allows the researcher to solve for n, which represents tnu m ber of test units. In ord er to solve for n, the researcher m us t follow fsteps:Step 1. The first step is to determ ine V (the number of content unbeing studied)."^ It usu ally ha sbe en determ ined before reaching the pointchecking for the reliability of the instrument.Step 2. The researcher must determine the acceptable level of proability for estim ating the confidence interval. We assu m e most conteanalysts will use the same levels of probability for the sampling error intercode r reliability checks as are used with m ost sam pling error estim atei.e., 95% (p=.O5) and 99% (p=.Ol) levels of probability.Step 3. Once the acceptable probability level is determined, the fomu la for confidence intervals is used to calculate the sta nd ard e rror (SE). T

    formula is:Confidence interval probab ility = Z (SE) (Equation 3)

    Z is the standardized point on the normal curve that corresponds with tacceptable level of probability.Step 4. The researcher m ust set a minim al level of intercod er reliabity for the test units. Co ntent analysis texts wa rn th at an acceptable levelintercoderreliability should reflectthenature and difficulty of categoriesacontent. '' For exam ple, a minim um level of 80% simple agreement is oftused w ith new co ding proc edure s, a level consistent with m inimal requ irme nt reco m m end ation s by Krippen dorf and the analysisof Schutz.^' But thlevel is lower than recommended by others.^'

    Step 5. The level of agreement in coding all study units (P) must estimated. This is the level of agreem ent am ong all coders if they coded evecontent unit in the study . This step is the most difficult step because involves estim ating the un kn ow n popula tion reliability figure. Tw o aproa che s are possible. The first is to estim ate P based on a prete st of tcoding instrumentand on previous research. Th ese con disto assu m eaP thexceeds the minimal acceptable reliability figure by a certain level.The second approach creates the question: How many percentapoin ts above th e minimal reliability level shou ld P be? For this analys is,will be assum ed that the popula tion level shou ld be set at 5 percentage poinabove the minim al acceptable level of agree m ent. For exam ple, if the m inimacceptab le reliability figure is .8, then the assu med P w ou ld be .85. Fiperce ntage p oints is useful bec ause it is consistent with a confidence intervof 5%. If the reliability figure equ als or exceeds .85, chances are 95 out of 1that the po pula tion (content un its in the stud y) figure equ als or exceeds .8On ce the five steps hav e been taken, the resulting figures are plu gg einto Equation 2 and the number of units needed for the reliability test determined.

  • 8/3/2019 9705185805

    5/12

    Confidence interval = Z (SE) (Equation 3)Our example confidence interval is 5% and our desired level ofprobability is 95%. So,

    .05 = 1.64 (SE)or, S E - , 0 5 / 1 . 6 4 - . 0 3Recall that our formula for sample size begins with SE,

    and becomesSE = V H-l

    '-1)(SE)^ +

    V N-i

    PQN

    (Equation 1)

    {Equafion 2)Now we can plug in our numbers and determine how large a randomsample we will need to achieve at minimum the standard 85% reliabilityagreement, with 1,000 study units and an assumed true agreement level of90%. Thus, PQ - .90 (.10) or ,09,Our confidence interval is .05, and the resulfing SE at p - .05 confi-dence level was .03, squared to .0009. So Equation 2 looks like

    n - (999)(.0009) + .09(1000) - 90.899 = 91,9(999)(,009) + ,9 0.989In other words, if we achieve at least 90% agreement in a simplerandom sample of 92 test units (rounded from 91.9) taken from 1,000 studyunits, chances are 95 out of 100 tha t 85"/" or better agree me nt w ould exist if allstudy units were coded by all coders and reliability measured.Table 1 solves Equation 2 for n w ith three hypo thefical levels of P (85%,90%, and 95%) and with num bers of study u nits equal to 100,250,500,1,000,5,000, and 10,000. The samp le sizes are based on confidence interva l with 95%probability. The table demonstrates how higher P levels and smaller num-bers of study units affect the test units nee ded. How ever, the number of testunits needed decreases much faster with higher levels of P than with thedecline in the number of study units.-^Table 2 assum es the sam e ag reement levels as Table 1. How ever, Table2 presents numb ers of test units for 99% level of probability. The figures fora given number of study units and agreement level are higher in Table 2because they represen t the increased num ber of test units nee ded to reach thehigher level of probability.The main problem in determining an appropriate sample of test units

  • 8/3/2019 9705185805

    6/12

    TABLE 1Num ber of Content Units N eeded for Reliability Test, Based on V arious Population Size

    Three Assumed Levels of Population ntercoder Agreement, and a 95% Level of Probability

    Population Size(Sfudy Units)

    Assumed Level of Agreement in Population (Study Units)85"X> 9 0 % 9 5 %

    10.0005,000l,Oi)O

    500250100

    141139

    125111

    91

    10099928472

    5454524945

    59 51 36Note: The numbers are taken from the equation for standard error of proportions and are adjustedwith the finite population adjustment. The standard error was used to find a sample size thatwould have sampling error equal to or less than 5% for the assumed population level of agreemenThe equation is

    S.E. = /PxQ X

    whe re P = percentage of agreement in popu lation, Q = (1-P), N = the population size, and n = thesample size.generates a confidence interval that does dip below the minimal acceptablevel of reliability. For exam ple, if the test un its ' reliability level equ als .minus .05, the confidence interval dip s below the minim al acceptable level.85. This indicates that reliability figure for the population of study unim ight n ot exceed the acceptable level of .85.Under this condition, the researcher could randomly select mocontent units for the reliability check or accept a lower minimal level agreemen t, say .80. If the first ap pro ach is used, the larger sam ple size can det erm ined by plugg ing the test un its ' reliability level (.86) into Equation 2P. Add itional un its could be ran dom ly selected and ad de d to the original teunits to calculate a new reliability figure an d confidence interval based onlarger sample.

    Limitationsof theAnalysisThis analysts may seem limited because it is: (a) based on a dichotmous decision, (b) with two coders, and (c) it uses a simple agreeme

    m easu re of reliability. Ho wev er, the first two are not limitations. Sam plinerror is not affected by the number of coders, who introduce measureme

  • 8/3/2019 9705185805

    7/12

    TABLE 1Number of Content Units Needed for Reliability Test, Based on Various Population Sizes,

    Three Assumed Levels of Population Intercoder Agreement, and a 99% Level of ProbabilityAssumed Level of A greem ent in Popu lation (Study Units)85% 90% 95%

    Population Size(Study Units)10,000 271 193 104

    5,000 263 190 1031,000 218 165 95

    500 179 142 87250 132 111 75100 74 67 52

    Note: The number s are taken from the equa tion for stan dard error of proportions and are adjustedwith the finite pop ulatio n adjustment. The standa rd error was used to find a sam ple size thatwould have sam pling error equal to or less than 5"/.. for the assum ed po pula tion level of agreem ent.The equations is S.E, = / F X Q XV /n-1 V N-1where P = percen tage of agreemt-nt in population, Q = (l-P), N = the population size, and n = thesample size.discussed in note 11, the researcher should randomly stratify the test units,select a larger number of test units, or both.Equation 2 is limited, how ever, to nom inal da ta becau se it is based onthe standa rd error of prop ortio ns, A parallel analysis to this one for intervaland ratio level categories could be developed using the standard error ofmeans.

    The use of sim ple agre em ent in reliability test is not a problem either.At least three other m easu res of reliability, besid es agreem ent am ong codin gpairs,areavailablefornomina] level data, Theseare Scott's pi,-^ Krippendorf'salpha,-*" and Cohen's kappa.^~ Several discussions of the relative advantagesan d disa dva nta ges of these m easu res are available,-" These three me asure swere developed to deal with measu rem ent error du e to chance and not witherror introduced throug h sam pling. The representativeness of a sam ple oftest units is not dependent on the test applied.

    Some beginning researchers might struggle with the task of making Using theassu m ptio ns an d so lving the equa tions. If this is the case, the two tables can Tables

  • 8/3/2019 9705185805

    8/12

    leaning of news stories, take the assumed agreement level of 85% amostu dy units,^^ Th ird, find the popu lation size in the tables that is closest bgreater than the size of the study units being analyzed. Take the nu m be rtest units from the table.For example, a researcher studying coverage of economic newsnetwork newscasts has 425 stories from 40 newscasts selected from pre viou s year. Variables involve num be rs of stories dev ote d to vario us typof econom ic news, Acceptinga confidence level of 95%, the researcher w oulook do w n the 907ci level of agreem ent colum n in Table 1 unfil she or he cato a po pu latio n size of 500 (the closest sa m ple size that is greater than 42The num ber of units nee ded for the reliability check equa ls 84.

    'y An inevitable question from graduate sfijdents conducting their ficontent analysis is how many items to use in the intercoder reliability teThis arficie has attem pted to an sw er this quesfion an d to suggest a procedufor esfimating s am pling err or in reliability sam ples. Of cour se, for sam plierror to have mean ing, the sam ple must bea prohahility samp le. The formuused here is the unbiased esfimator for simple random samples; samplbased on proporfion or stratification will require adjustments available many stafistics books,*"

    When reporfing reliability level, confidence intervals should be ported w ith both me asu res of reliability. Simple agre em ent confidenintervals can be calculated using the standard error of proportion s. Tconfidence intervals for Scott's ;'/ and Cohen's kappa can be calculatedreferring to the formulas presented in the original articles for these coefcients.

    The role of selecHon bias in de term inin g reliability coefficients see mto ha ve gotten lost since earlier explorafion s of reliability. Th is bias can onbe estimated th roug h probability sampling. The study of content needsm ore rigor ous w ay of dealing w ith potential selecfion bias. Using probab ilisam ples and confidence intervals for reliability figures w ou ld he lp ad d rigo

    NOTES1. The analysis in this arficle is based on simple random sampling freliability tests. However, under some circumstances, other forms of proability samp ling, such as strafified ran do m sa m pling, m ight be preferable fselecting reliability test sam ple s. For exa m ple, if certain categ ories ofvariable may make up a small proporfion of the content units being stud iethe researcher m ight oversam ple these categories.2. Reproducibility reliability, also called equivalence reliability, diffefrom stability and ac curacy reliability. Stability con cern s the sam e codtesting reliability of the sam e conte nt at two points in fime, Accuracreliability involves comparing coding results with some known standarThe term reliability is used here to refer to reprod ucibility. See Kla

  • 8/3/2019 9705185805

    9/12

    Mass Comm unication Qua ntitative Research," journalism Quarterly 70 (spring1993): 126-32.5. Robert Philip Weber, Basic Content Analysis, 2d ed. (Newbury Park,CA.: Sage University Pa per Series on Qu antita tive A pplication s in the SocialSciences, 07-075), 23.6. Guido H. Stempel III, "Statistical Designs for Content Analysis," inResearch Methods in M ass Com munication, ed. Stemp el and W estley, 143.7. Stempel, "Content Analysis," 128.8. Roger D. Wimm er and Joseph R.Dominick, Mass Media Research: AnIntroduction, 3d ed. (Belmont, CA: W ads w orth , 1991), 173.9. Lynda Lee Kaid and Anne Johnston W adsw orth, "C ontent A nalysis,"in Measurement of Communication Behavior, ed. Philip Emmert and Larry L.Barker (NY: Longman, 1989), 208.

    10. Michael Singletary, Mass Communication Research (NY: Longman,1994), 297.

    11. Krippendorf argue s that reliability sam ples "need not be representa-tive of the pop ulatio n charac teristics" bu t "must be representative ofall distinc-tions made within the sample of data at hand" (emphasis in original). Hesuggests purposive or stratified sampling to ensure that "all categories ofanalysis, all decisions specified by vario us forms of instru ctions, are indee drepre sente d in the reliability data regardless of lioiofrequently they may occur inthe actual data" (empha sis ad de d). See Krippendorf, Content Analysis, 146.If a researcher suspects that some variable categories will occur infre-quently in a simple random sample for a reliability check, disproportionatesampling of the less frequent categories would be useful. Frequency ofcategories could be estima ted by a prete st and different sam pling rates couldbe used for categories that ap pe ar less frequently. When figuring ov erallagreement for reliability, the results for particular categories would have tobe weighted to reflect the proportions in the study units.

    This procedure might create problems when content has infrequentcategories that are difficult to identify. It could re quire quota sam pling, orselecting and checking content units for these infrequent categories until apro po rtion of the test units equ als the estimated pro por tion of the infrequentcategories.Another way of handling infrequent categories would be to increase thereliability test sam ple size above the minim um recom m ended here. Largersamples will increase the probability of including infrequent categoriesam ong the test units. If the larger sam ple does not include sufficient nu m ber sof the infrequ ent categ ories, add ition al units can be selected. This will, ofcourse, lead to coding of additional units from categories that appear fre-quently, but the resulting reliability figure will be more representative ofcontent units being studied.

    No one would argue that all variables need to be tested in a reliabilitycheck, but a large num be r of categories w ithin a variable (e.g., a twenty -six-category sch em e for coding the variable "ne ws topic") could create logisticalproblems. Just generating a stratified reliability sample that would includesufficient numbers of units for each of these categories would be time

  • 8/3/2019 9705185805

    10/12

    tional and Psychological Measurement 20 (1960): 37-46,14. Irving L. Janis, Raym ond H. Eadner, and M orris Jano witz, "TReliability of a Content Analysis Technique," Public Opinion Quarterl(su m mer 1943): 293-96.15. William C. Schutz, "Reliability, Ambiguity and Content AnalysiPsychological Rtim-w 59 (1952): 119-29,16. Presum ably, these chance agreem ents could lead content analystsoverestimate theex tent of coder agreem ent d ue to the precision of the codiins tru m en t. In effect, Schutz so ug ht a wa y to contro l for th e effect of thochance agre em ents. But just because chance could affect reliability do es nmean it dtes, and the "o dd s" of agreem ent thro ugh r and om ness chan ge ona coding criterion is introduced and used,17. But of course it can't. Its effect can only be acknowledged acompensated for.18. Strictly inte rprete d, N equ als the nu m be r of coding decisions that w

    be ma de by each coder. If reliability is checke d sep ara tely for each cod icategory in the content analysis, then N equals the total number of unselected for the con ten t ana lysis, Ifthe reliability ischec ked for total dec isioma de in using the coding proced ure, N equals the num ber of units analyzmulfiplied by the nu m ber of categories being used . This analysis as sumthat each variable is checked and rep orted separately, which mean s N equnum ber of content un its in the populafion.19. This advice, while sound, adds a bothersome vagueness to conteanalysis. This is a bit like a profess or's respo nse that the length of an essshou ld be "as long as it takes." Ho w long is a piece of string?20. Krippendorf, Content Analysis, recommends generally using the level for intercoder reliability, although he says some data with reliabilifigures as low as ,67 could be rep orte d for highly specu lative conclusions.is not clear whether Krippendorf's agreement level figures are for simpagreem ent amon g coders or for some other reliability meas ure. Schutz("Reliability, Ambiguity and Content Analysis") analysis starts with the level of simple agr eem ent. This analysis will use .80 to remain consistent wSchutz.21. See Singletary (Mi7ss Communication Research, 296) who states thScott's ;)/ of ,7 is the conse nsu s val ue for the sta tistic. Un de r som e condifio

    this wo uld b e consistent with a simp le agreem ent of ,8, bu t not alwayWimmer and Dominick (Mass Media Research, 181 ) repo rt a rule of th umat least .9 for simp le agreem ent and a Scott's /'/ or Krip pen dorf 's alpha of for intercoder reliabilitv.22. No te that this is a one-tailed test. Con tent ana lysis researchers aconcerned that the reliability figure exceeds a minim al level, which wo uld on the negafive side of a confidence interv al. The acceptance of a codininstrument as reliable is not affected by whether the population reliabilifigure exce eds the reliability test figure on the po sitive side of the c onfideninterval.23. Three factors affect sam pling error: the size of the sam ple, the ho mgeneity of the populafion, an d the proportion of the pop ulation in the sam pl

  • 8/3/2019 9705185805

    11/12

    a reliability test.25. Scott, "Reliability of Content Analysis."26. Krippendorf, Content Analysis.27. Cohen, "Coefficient of Agreement for Nominal Scales."28. For examples, see Maria Adele Hughes and Dennis F. Carrett,"Intercoder Reliability Estimation Approaches in Marketing: A Generaliza-tion Theory Framework for Quantitative Data," journal ofMarketing Research27 (May 1990): 185-195; and Richard H. Kolbe and Melissa S. Burnett,"Content-Analysis Research: An Examination of Applications with Direc-tives for Im prov ing Research Reliability and Objectivity,"/Di/-mi/dfConsiWrResearch 18 (September 1991): 243-250.29. Coding simple content, such as numbers of stories, typically yieldshigh er levels of reliability beca use cues for coding are mor e explicit. Thepopulation agreement will be higher than coding schemes that deal withw ord mearu ngs . A low er reliability figure is an acceptab le tra de off for

    studying categories that concern meaning.30. For examp le, see C. A. Moser and G. Kalton, Survey Methods in SocialInvestigations, 2d e d. (NY: Basic Books, 1972).

  • 8/3/2019 9705185805

    12/12

    Copyright of Journalism & Mass Communication Quarterly is the property of Association for Education in

    Journalism & Mass Communication and its content may not be copied or emailed to multiple sites or posted to a

    listserv without the copyright holder's express written permission. However, users may print, download, or

    email articles for individual use.