Examining Teacher Grades Using Rasch Measurement Theory

18
Journal of Educational Measurement Spring 2009, Vol. 46, No. 1, pp. 1–18 Examining Teacher Grades Using Rasch Measurement Theory Jennifer Randall University of Massachusetts George Engelhard, Jr. Emory University In this study, we present an approach to questionnaire design within educational re- search based on Guttman’s mapping sentences and Many-Facet Rasch Measurement Theory. We designed a 54-item questionnaire using Guttman’s mapping sentences to examine the grading practices of teachers. Each item in the questionnaire repre- sented a unique student scenario that was graded by teachers. Three focus groups of elementary (N = 5), middle (N = 4), and high school (N = 2) teachers examined the scenarios for clarity, comprehensiveness, and ease of understanding. Based on the suggestions of the focus groups, the revised questionnaires were completed by 516 public school teachers located in a major metropolitan county in the South- east. The grades assigned by the teachers to the scenarios were analyzed using the FACETS computer program. The results of the analyses suggest that teachers pri- marily assign grades on the basis of student achievement as expected, although for some teachers other facets (ability, behavior, and effort) may play a role in final grade assignment. The use of Guttman’s (1977) mapping sentences provides one approach to ques- tionnaire development in educational research by providing researchers with a frame- work to systematically consider a variety of relevant, logical, and substantive facets of behaviors, attitudes, and interests. Guttman’s mapping sentences, a component of Guttman’s Facet Design (Donald, 1995), allow educational researchers to succinctly specify the content universe of the questionnaire. Each mapping sentence is com- posed of several explanatory or independent variables (called facets) that are hypoth- esized to influence the response or dependent variable. These facets can be generated through both research and experience, and they are designed to “precisely describe the boundaries of the research, define the content universe and are directly used to generate questions or observations” (Donald, 1995, p. 118). The response variable typically represents some type of appraisive behavior. Appraisive behavior involves judgments such as risk, seriousness, beauty, and fairness, which may or may not im- ply a preference (Roskam & Broers, 1996), and responses are never simply correct or incorrect. A benefit of Guttman’s mapping sentences is that aspects of the entire content universe consisting of multiple facets can be explored systematically and logically. To illustrate the utility of Guttman’s mapping sentences in educational research, we designed a questionnaire to examine the grading practices of public school teachers. Because teachers’ final grade assignments require judgments and the ap- praisal of students, this topic seemed especially appropriate for demonstrating the Copyright c 2009 by the National Council on Measurement in Education 1

Transcript of Examining Teacher Grades Using Rasch Measurement Theory

Page 1: Examining Teacher Grades Using Rasch Measurement Theory

Journal of Educational MeasurementSpring 2009, Vol. 46, No. 1, pp. 1–18

Examining Teacher Grades Using RaschMeasurement Theory

Jennifer RandallUniversity of Massachusetts

George Engelhard, Jr.Emory University

In this study, we present an approach to questionnaire design within educational re-search based on Guttman’s mapping sentences and Many-Facet Rasch MeasurementTheory. We designed a 54-item questionnaire using Guttman’s mapping sentencesto examine the grading practices of teachers. Each item in the questionnaire repre-sented a unique student scenario that was graded by teachers. Three focus groupsof elementary (N = 5), middle (N = 4), and high school (N = 2) teachers examinedthe scenarios for clarity, comprehensiveness, and ease of understanding. Based onthe suggestions of the focus groups, the revised questionnaires were completed by516 public school teachers located in a major metropolitan county in the South-east. The grades assigned by the teachers to the scenarios were analyzed using theFACETS computer program. The results of the analyses suggest that teachers pri-marily assign grades on the basis of student achievement as expected, although forsome teachers other facets (ability, behavior, and effort) may play a role in finalgrade assignment.

The use of Guttman’s (1977) mapping sentences provides one approach to ques-tionnaire development in educational research by providing researchers with a frame-work to systematically consider a variety of relevant, logical, and substantive facetsof behaviors, attitudes, and interests. Guttman’s mapping sentences, a component ofGuttman’s Facet Design (Donald, 1995), allow educational researchers to succinctlyspecify the content universe of the questionnaire. Each mapping sentence is com-posed of several explanatory or independent variables (called facets) that are hypoth-esized to influence the response or dependent variable. These facets can be generatedthrough both research and experience, and they are designed to “precisely describethe boundaries of the research, define the content universe and are directly used togenerate questions or observations” (Donald, 1995, p. 118). The response variabletypically represents some type of appraisive behavior. Appraisive behavior involvesjudgments such as risk, seriousness, beauty, and fairness, which may or may not im-ply a preference (Roskam & Broers, 1996), and responses are never simply corrector incorrect. A benefit of Guttman’s mapping sentences is that aspects of the entirecontent universe consisting of multiple facets can be explored systematically andlogically.

To illustrate the utility of Guttman’s mapping sentences in educational research,we designed a questionnaire to examine the grading practices of public schoolteachers. Because teachers’ final grade assignments require judgments and the ap-praisal of students, this topic seemed especially appropriate for demonstrating the

Copyright c© 2009 by the National Council on Measurement in Education 1

Page 2: Examining Teacher Grades Using Rasch Measurement Theory

Randall and Engelhard

use and value of mapping sentences. The first step in developing a mapping sen-tence is to identify the content universe in order to determine the particular facetsto be studied. A review of the literature suggests that teacher grades as a constructprimarily consist of some unknown combination of student academic achievement(criterion-referenced performance), ability (similar to aptitude, but not student class-room achievement), effort (how hard a student works), and behavior (student conductin class). The Appendix provides the operational definitions of these four facets thatcomprise the content universe for this study.

The effort and ability facets are the most frequently reported nonachievementfactors considered by classroom teachers (Cicmanec, Johanson, & Howley, 2001;Cross & Frary, 1999; Frary, Cross, & Weber, 1992). McMillan, Myran, and Work-man (2002) reported that 47% of elementary teachers considered ability in finalgrade assignment from “quite a bit” to “completely.” Brookhart (1993) found thatteachers at all grade levels routinely considered both a student’s level of ability andeffort when assigning final grades based on hypothetical scenarios. McMillan andNash (2000) reported similar results with elementary and secondary mathematicsand English teachers admitting consideration of achievement as well as nonachieve-ment factors, like effort and participation, when assigning final grades. Furthermore,Bursuck et al. (1996) found that teachers at all grade levels consider a variety offacets when assigning students’ grades including effort, notebooks, attendance, classparticipation, and preparedness/organization (all of which could be considered prox-ies for effort); whereas elementary school teachers were more likely than middleand high school teachers to adjust grades based on ability. Feldman, Alibrandi, andKropf (1998) also reported that 16% of high school teachers used ability as the basisfor grade assignment. Stiggins, Frisbie, and Griswold (1989) reported that, althoughclassroom achievement was the most important factor in grade assignment by highschool teachers, they were also inclined to give significant consideration of effort,measured by homework completion and extra credit. These findings were echoed inMcMillan’s (2001) survey of 1,483 middle and high school teachers where he foundthat the factors most used by teachers were academic performance and the masteryof specific learning objectives, but other factors including student effort, ability level,class participation, and quality of homework were used to some extent. Similarly, inother studies, secondary teachers have reported raising grades for high effort fairlyoften (Cross & Frary, 1999). Zoeckler (2007) found that for a sample of high schoolEnglish teachers, grades were influenced by grading systems, perceptions of studenteffort, and concern for the moral development of the student.

Studies have suggested that the grade construct has a student behavior componentas well (Cizek, Robert, & Fitzgerald, 1995; Frary et al., 1992). Frary et al. found that31% of high school teachers agreed, or tended to agree, that behavior should affectthe grade. In the Stiggins et al. (1989) study of high school teachers, all participantsreported consideration of attitude when making decisions about borderline cases.McMillan and Lawson (2001), however, found that for secondary teachers (grades 6to 12) disruptive student behavior contributed little, if anything, to student grades.

The primary purpose of this research was to illustrate the use of Guttman’smapping sentences for systematic questionnaire development within the frameworkof item response theory. A secondary purpose was to develop a line of inquiry

2

Page 3: Examining Teacher Grades Using Rasch Measurement Theory

Examining Teacher Grades Using Rasch Measurement Theory

into grading practices by systematically collecting information on teachers’ grad-ing practices, and to clarify the measurement construct of grading through the useof Guttman’s mapping sentences. Specifically, this study addressed the question ofwhether teacher-assigned grades are influenced by the classroom achievement, abil-ity, behavior, and effort of students.

We assessed the success of these purposes with three main criteria: (a) The ease ofuse, and understanding, of the questionnaire as reported by focus group teachers andcomments from surveyed participants; (b) the alignment of the quantitative results,to some extent, with previous research as well as our own qualitative data; and (c)the substantive value of the information provided by our choice of analysis beyondthat of more traditional analyses, for example, analysis of variance.

Method

Questionnaire

The complete questionnaire has a total of 54 items with each item representing astudent scenario. For example, Table 1 presents four scenarios with their respectivefacets and structs identified. Each unique combination of structs is referred to as a

TABLE 1Sample Scenarios

Facet Ability Achievement Behavior Effort

Struct High High Excellent Great Deal

Scenario Jonathan is a student with high ability, based on intelligence tests administered(Structuple) by the school. His behavior in class is always excellent. He rarely talks out

of turn and has great manners. He works hard and, based on project, test,and quiz scores, you know that he has mastered 89% of the course objectives.

Struct Average High Inappropriate Low

Scenario Glenda is a student with average ability, based on intelligence tests administered(Structuple) by the school. Her behavior is completely inappropriate. She talks out of turn

often in class and is often disobedient. She does not work very hard, butbased on project, test, and quiz scores, you know that she has mastered89% of the course objectives.

Struct Average Low Excellent Great Deal

Scenario Willie is a student with average ability, based on intelligence tests administered(Structuple) by the school. His behavior in class is always excellent. He rarely talks out

of turn and has great manners. He works very hard, but based on project, test,and quiz scores, you know that he has mastered 69% of the course objectives.

Struct Low Average Inappropriate Low

Scenario Donna is a student with low ability, based on intelligence tests administered(Structuple) by the school. Her behavior is completely inappropriate. She talks out

of turn often in class and is often disobedient. She does not work hard,and based on project, test, and quiz scores, you know that she has mastered79% of the course objectives.

3

Page 4: Examining Teacher Grades Using Rasch Measurement Theory

Randall and Engelhard

structuple, and each structuple represents a unique student scenario. We use the termfacet to represent the independent variables and structs to represent the levels withinthose facets. For the purposes of this study, every possible combination of structswas formed: ability (low, average, and high), achievement (low, average, and high),behavior (inappropriate, average, and excellent), and effort (low, great deal). Thisyields 54 scenarios on the questionnaire.

Figure 1 illustrates the mapping sentence that guides this study. A fractional-replication design was used (Roskam & Broers, 1996) to avoid possible fatigue ef-fects for the teachers.

The 54 scenarios were divided into three forms with 36 scenarios in each form.Using the complete questionnaire as the master numbered from 1 to 54, Form Acontained items 1–36. Form B contained items 19–54. Form C contained items 1–18and 36–54. These common blocks of scenarios were used to link and calibrate allof the scenarios onto a common logistic scale. Each teacher received either FormA, Form B, or Form C, with directions that asked them to read each scenario care-fully as if it were their own class and to assign both a letter and numeric grade. Dueto a processing error, one scenario/structuple was inadvertently not placed on anyform. The missing scenario described a student with high ability, low achievement,excellent behavior, and a great deal of effort. After all data were collected, we exam-ined the responses to be sure that each questionnaire item (combination of structs)was sufficiently represented. Every item was responded to by teachers a minimum of172 times.

Focus Groups

Three focus groups, comprised entirely of high school, elementary, and middleschool teachers, respectively, were formed to consider the comprehensiveness andease of understanding of the questionnaire and each scenario. The main purpose of

A teacher considers a student’s

Ability Low Average High

Achievement Low Average High

Behavior Inappropriate Average Excellent

Degree of Effort Low Great Deal

And assigns a final letter grade

A, A-, B+, B, B-, C+, C, C-, D, or F

FIGURE 1. Guttman’s mapping sentence. Ability, Achievement, Behavior, and Effort are thefacets and the structs are the levels below.

4

Page 5: Examining Teacher Grades Using Rasch Measurement Theory

Examining Teacher Grades Using Rasch Measurement Theory

these focus groups was to obtain feedback regarding the questionnaire design includ-ing the clarity, ease of understanding, word choice, and length of the questionnaire.A secondary purpose of the focus groups was to allow teachers the opportunity toshare their perceptions about appropriate grading practices particularly in relation tothe questionnaire and the student scenarios under review.

Based on the recommendations of Krueger (1994), each focus group was no largerthan five teachers (5, 4, and 2 at the elementary, middle, and high school levels, re-spectively) to allow participants more opportunities to speak and share their views.Focus group participants were selected through the snowball technique. At eachgrade level one teacher was chosen who met the selection criteria. That teacher thennominated other colleagues that also met the selection criteria. All focus group par-ticipants had at least 5 years of teaching experience and currently teach in a pub-lic school. Primarily, participants made recommendations that would improve theclarity of the scenarios including suggestions regarding word choice. Changes inthe questionnaire were made if a majority of the teachers present agreed that thechange would improve the instrument. In cases where there was a tie the moder-ator/primary researcher cast the deciding vote. Teacher-participants also providedmore substantive/content-related feedback regarding the initial questionnaire. Weconsidered and integrated all of the feedback they provided in light of the purposes ofthe study. The focus group feedback provided evidence that we had successfully metour first criterion for success: a questionnaire that was meaningful, comprehensible,and easy to use.

Participants

The final questionnaires were distributed in a large metropolitan county in thesoutheastern United States. The school district is the largest district in the state andconsists of approximately 152,000 students with 60 elementary schools, 20 middleschools, and 15 high schools. The participating county was also, at the time of datacollection, involved in a system-wide evaluation of its grading policy. The existinggrading policy requires that all teachers assign grades based on student academicprogress alone as defined by the county’s content standards. At both the elemen-tary and middle school levels, separate grades are assigned for conduct and effortand basing final grades on either the conduct or the effort of students is explicitlyforbidden. Individual schools were randomly selected from within each level to par-ticipate. Principals in the selected schools were asked to send an e-mail link to theonline questionnaire to the teachers within their schools. The total number of po-tential respondents was approximately 2,400 teachers (800 elementary, 800 middleschool, and 800 high school). Assuming all principals forwarded the request to theirteachers, as they were asked to do, the response rate was approximately 21.5%.

Table 2 includes the demographic information for our sample. This information ismissing for 174 of the 516 teachers because they opted not to provide it. The datafrom these teachers were included in the analysis.

Procedures

The rating scale model (RSM) extension of the Rasch model (Rasch, 1980)provides a framework for analyzing ratings in two or more ordered categories

5

Page 6: Examining Teacher Grades Using Rasch Measurement Theory

TABLE 2Demographic Information of Teachers

Grade Level

Demographic Information Elementary Middle High Totala

Number of teachers 79 155 108Percentage of sample 15.3 30.2 20. 9Total years teaching

Mean 9.9 9.5 10.6 9.9Standard deviation 7.0 6.5 6.0 6.5

Measurement course (percentages)No course 57.7 64.3 74.0 65.4Course 38.5 35.1 23.2 32.1Missing 3.9 .7 2.8 2.1

Academic level (percentages)General 50.4 52.7 48.1 50.7Special education 1.3 2.6 .9 1.8AP/honors/gifted only 1.3 12.5 14.8 10.7General and special education 1.3 .0 .0 .3General and honors .0 16.1 22.2 14.3Other .9 4.5 1.8 2.8Missing 44.8 11.6 12.1 19.4

Gender (percentages)Women 98.3 86.4 89.8 90.2Men .4 12.9 10.2 9.2Missing 1.3 .7 .0 .6

Subject area: middle and highonly (percentages)Elementary 100Social studies 14.9 18.5 12.6Science 18.1 23.1 15.6Math 29.3 25.0 21.5English 21.0 17.6 16.0Foreign language 8.3 10.2 7.0Other 1.9 .0 1.1Missing 6.4 5.6 22.4

Race/ethnicity (percentages)White 43.6 74.9 65.7 64.8African American 38.5 8.3 14.8 17.3Hispanic .0 .7 5.6 2.1Other 3.8 1.9 2.8 2.6Missing 14.1 14.2 11.1 13.2

aThe total number of teachers who responded to the online questionnaire was 516 with 174 teachers notproviding demographic information. The responses of all 516 teachers are included in the data analysis.

(Engelhard, 2005; Wright & Masters, 1982). Within the framework we consider anA–F grading scale to be an ordinal scale comprising a fixed number of mutuallyexclusive response categories. Ten progressive and exhaustive categories were iden-tified: F to A. The letter grades were converted to a 0 to 9 rating scale (0 = F, 1 =

6

Page 7: Examining Teacher Grades Using Rasch Measurement Theory

Examining Teacher Grades Using Rasch Measurement Theory

D, 2 = C–, 3 = C, 4 = C+, 5 = B–, 6 = B, 7 = B+, 8 = A–, 9 = A). To analyzethese data we used FACETS 3.62 (Linacre, 2007) computer program, which providesestimates for the parameters of the many-facet Rasch model.

The FACETS model (Linacre, 1994) is a further extension of the Rasch modelthat allows for the modeling of grade assignment as a function of multiple facets (thefour student characteristics). Conceptually, this is similar to regression analysis: theresponse variable is the logistic transformation of ratios of final grade probabilitiesand the independent variables are the four facets. In the case of this study, the naturallogarithm of the odds (referred to as logits) expressed as

In

(Pnjqrmk

Pnjqrmk−1

)= θn − C A j − Aq − Er − Bm − Tk,

where Pnjqrmk is the probability that a teacher with grading policy θ n will assigna student a grade of k when the described student has various levels of classroomachievement (CAj), ability (Aq), effort (Er), and behavior (Bm). Here, Tk representsthe difficulty of moving from a grade of k−1 to k (e.g., B– to B) assuming thatteachers are using the same grading scale (as defined by the school system) andare interpreting the scale similarly. To be sure that, in general, teachers were usingthe full grading scale as intended, we examined the grading category usage usingLinacre’s (1999) guidelines for investigating rating scale category utility before webegan data analysis.

The FACETS program yields a logit scale that theoretically can vary from –∞ to∞. In this case, the teacher-grading facet is positively oriented with negative logitsrepresenting more severe grading practices (lower grades), and positive logits rep-resenting more lenient grading practices (higher grades). A logit scale is simply aninterval scale in which the unit intervals between the locations on the variable maphave a uniform value or meaning. The logit measures for θ n, CAj, Aq, Er, and Bm areestimated using joint maximum likelihood methods.

To assess model fit, both the outfit mean square and infit mean square statistics(Wright & Masters, 1982) were used. In addition, we examined the reliability of sep-aration coefficient for each facet. This coefficient provides a measure of the degreeto which the “elements” within a facet (i.e., the individual structs) are separated, rep-resenting the ratio of the true score variance to the observed score variance (Wright& Masters) and is defined as

R = SD2 − M SE

SD2,

where SD2 is the observed variance of the element difficulty for a facet and MSEis the mean square calibration error for each element within a facet. Larger differ-ences between the structs or elements within a facet will yield higher reliability ofseparation coefficients.

Results

Recall that we examined teacher grading scale usage before moving forward withadditional data analysis to be sure that teachers were using the entire 10-category

7

Page 8: Examining Teacher Grades Using Rasch Measurement Theory

Randall and Engelhard

grading scale. Based on Linacre’s (1999) guidelines, we found that teachers, overall,tended to use different grading categories than those made available to them (as de-termined by the school district in which they taught). Specifically, the 10-categorymodel did not result in a model where the average logit measures for each categoryadvanced in order as the categories (grades) increased, signifying irregular patternsof usage in categories. Moreover, close examination of the 10-category model re-vealed that the step calibrations did not advance in order (i.e., each category, in turn,is the most probable, or modal, category); and failure to do so also suggests irreg-ular patterns. These two violations, as well the irregular observation distribution ofthe 10-category model, indicated the need to combine several of the grading cate-gories. We determined that teachers primarily used six grading categories, and theteacher grades were combined and labeled: F, D, C (C–, C, or C+), B (B– or B), B+,and A (A– or A). The analyses reported in this study are based on these six gradingcategories.

Figure 2 presents the variable map. The first column labeled “Logit” is the logitscale ranging from +6 to –6. Characteristics, or items, that are not well differen-tiated appear horizontal, or adjacent, on the logit scale. All the other columns canbe discussed in reference to the logit scale. The second column labeled “Teachers”represents the individual participants. Teachers are represented by asterisks and plussigns (+) with each asterisk representing 18 teachers and each + representing be-tween 1 and 17 teachers with that location on the logit scale.

Teachers at the top of the variable map tended to assign a greater number of higherfinal grades overall (more lenient grading policy). Teachers at the bottom of the col-umn tended to assign lower final grades overall (more strict grading policies). Thethird through sixth columns correspond to the student facets as represented in thequestionnaire scenarios. The final column indicates the location of the category co-efficient estimates for the six-category grading scale. Elements near the top of thevariable map represent aspects of scenarios receiving higher final grades, whereaselements near the bottom of the variable map represent aspects of scenarios receivinglower final grades. For instance, in the column that corresponds to student behavior,it is clear that students with excellent behavior tend to receive higher final gradesthan students with average or inappropriate behavior. Students with high levels ofeffort tend to receive slightly higher final grades than students with low effort. Asexpected, the trend is replicated when we observe the achievement column as well.Overall, students with higher achievement are more likely to receive higher finalgrades. These results are not surprising. The ability column, however, does presenta less intuitive result. It appears that teachers tend to assign lower grades to studentswith high ability than to students with either low or average ability. These results sug-gest that being perceived as a high-ability student (as opposed to being perceived as awell-behaved student) is of little benefit and can, perhaps, be detrimental.

Table 3 provides a summary of the measurement report illustrated in the variablemap. For teachers, this summary table provides information on both the distribu-tion (how well they are spread across the logit scale) and fit of teacher-participants.The separation reliability of .75 as well as the chi-square test of significance, whichtests the hypothesis that all teachers behaved the same when assigning final grades,suggest that individual teachers, when provided with information regarding levels

8

Page 9: Examining Teacher Grades Using Rasch Measurement Theory

Logit Teachers Ability Achievement Behavior Effort Grade 6 High A

5 B+

4

3 +

2 + + Excellent +

1 **+ **+ *+ Great *********+ Average Deal

0 *******+ Low Average --------------- * High Average *+ Low +

-1 + + + Inappropriate C +

-2 + +

-3 + ---------------

-4 D

-5 Low

-6 F

FIGURE 2. Variable map. Lines in column 7 represent the point at which the probability ofteachers assigning the next higher grade begins to exceed the probability of teachersassigning the next lower grade.∗ = 18 teachers, + = 1 through 17 teachers.

of performance for these four student characteristics are inclined to assign gradesdifferently. In other words, teachers, when presented with the same students, are notall equally severe or lenient graders. The outfit mean square of .60 indicates thatteachers demonstrate good fit and, on average provided predictable responses, thatis, assigned letter grades consistent with the measurement model. In general, fewresponses were unpredictable or aberrant.

The expected value of the outfit statistic, however, is 1.0 and the mean value of.60 suggests there is less variation in teachers’ responses than expected. The data

9

Page 10: Examining Teacher Grades Using Rasch Measurement Theory

TABLE 3Measurement Report Summary: Estimated Logits for Teachers and Student Facets

Student Facets

Teachers Ability Behavior Effort Achievement

MeasuresMean .28 .00 .00 .00 .00SD .64 .29 1.19 .43 4.75N 516 3 3 2 3

InfitMean .59 .57 .60 .60 .62SD .56 .06 .08 .02 .19

OutfitMean .60 .61 .63 .63 .65SD .71 .05 .08 .02 .19

Reliability of separation .75 .99 >.99 >.99 >.99Chi-square statistic 1,919.2∗ 342.2∗ 7,233.4∗ 971.4∗ 106,856.8∗

∗p < .001.

also reveal 20 individual teachers with outfit mean square statistics greater than 2.0,suggesting their individual responses were inconsistent and did not adequately fit themeasurement model. These 20 teachers, however, represent less than 4% of the totalsample and the results are not unexpected.

To illustrate the interpretation of teacher misfit, an important benefit of Rasch anal-ysis, Figure 3 presents the standardized residual plots for three teachers who took thesame form. Teacher 9 (outfit statistic = 9.0) has several standardized residuals greaterthan an absolute value of two, with many greater than an absolute value of six, in-dicating that this teacher was unusually inconsistent, or unpredictable, in responses.For example, for item 3, which represents a student with high ability, high achieve-ment, average behavior, and high levels of effort, Teacher 9 was expected to assigna final grade of 4.68 (between a B+ and an A), but instead assigned a final grade of3.0 (B) resulting in a standardized residual of –3.5. The standardized residual val-ues of the responses from Teacher 11 (outfit statistic = .1) are very small with mosthovering near zero and none above the absolute value of one. In fact, for item 3 thisteacher was expected to assign a final grade of 4 (B+) and did so with a final gradeof 4.07. The residual plot of Teacher 4 (outfit statistic = 1.0) illustrates the amountof variance expected in teacher responses. The figure illustrates that all standard-ized residuals fall within the absolute value of two. Examination of Teacher 4’s finalgrade assignment to the student represented in item 3 reveals that the expected gradeassignment was 3.51, but instead a slightly higher final grade of 4.0 was assigned,resulting in a standardized residual of .88.

Table 4 represents the individual measurement reports for all student facets. Thestandardized difference between the different logit measures of the same parameterhas an expectation of zero and a variance of one (Wright & Masters, 1982). Compari-son tests, seen in Table 4, of the logit measures .03 (low ability), .34 (average ability),and –.37 (high ability) reveal a noteworthy difference among the three ability levels.

10

Page 11: Examining Teacher Grades Using Rasch Measurement Theory

Teacher #4

Outfit = 1.0

-10

-8

-6

-4

-2

0

2

4

6

8

10

0 5 10 15 20 25 30 35 40 45 50 55

Scenario

Standardized Residual

♦ Residual

Teacher # 9

Outfit = 9.0

-10

-8

-6

-4

-2

0

2

4

6

8

10

0 5 10 15 20 25 30 35 40 45 50 55

Scenario

Standardized Residual

♦ Residual

Teacher # 11

Outfit = .1

-10

-8

-6

-4

-2

0

2

4

6

8

10

0 5 10 15 20 25 30 35 40 45 50 55

Scenario

Standardized Residual

♦ Residual

FIGURE 3. Outfit statistics for teachers.

Furthermore, the outfit mean square statistics of .66 (low ability), .55 (average abil-ity), and .63 (high ability) also suggest that the grades assigned across this particularstudent characteristic are what we would expect based on the measurement model,but also with less variation overall than expected.

A summary of the information included in the Achievement Measurement Reportis presented in Table 4. The large standardized differences among the three levelsof achievement indicate that they are well-differentiated. The outfit mean square

11

Page 12: Examining Teacher Grades Using Rasch Measurement Theory

TABLE 4Measurement Report by Student Facets

Observed Average Logit Model Standard Infit Mean Outfit Mean AbilityRating Measure Error Square Square Level

2.66 .03 .02 .63 .66 Low2.62 .34 .03 .50 .55 Average2.59 –.37 .02 .59 .63 High

Comparison Tests Difference Between Logits Standardized Difference p-value

Low vs. average −.31 −8.60 <.0001Low vs. high .40 14.14 <.0001Average vs. high .71 19.69 <.0001

Reliability > .99; fixed (all same) χ 2 = 342.2, df = 2, p < .001

Observed Average Logit Model Standard Infit Mean Outfit Mean AchievementRating Measure Error Square Square Level

.77 −5.64 .03 .87 .84 Low2.52 −.35 .02 .40 .40 Average4.42 5.98 .02 .60 .70 High

Comparison Tests Difference Between Logits Standardized Difference p-value

Low vs. average −5.29 223.80 <.0001Low vs. high −11.62 322.28 <.0001Average vs. high −6.33 146.72 <.0001

Reliability >.99; fixed (all same) χ 2 = 106,856.8, df = 2, p < .001

Observed Average Logit Model Standard Infit Mean Outfit Mean BehaviorRating Measure Error Square Square Level

2.12 −1.47 .02 .53 .60 Inappropriate2.60 .03 .02 .55 .54 Average3.15 1.45 .03 .72 .74 Excellent

Comparison Tests Difference Between Logits Standardized Difference p-value

Inappropriate vs. average −1.50 −53.03 <.0001Inappropriate vs. excellent −2.92 −80.99 <.0001Average vs. excellent −1.42 −39.38 <.0001

Reliability > .99; fixed (all same) χ 2 = 7,233.4, df = 2, p < .001

Observed Average Logit Model Standard Infit Mean Outfit Mean EffortRating Measure Error Square Square Level

2.44 −.43 .02 .58 .60 Low2.81 .43 .02 .61 .65 Great deal

Comparison Test Difference Between Logits Standardized Difference p-value

Low vs. great deal −.86 30.41 <.0001

Reliability >.99; fixed (all same) χ 2 = 971.4, df = 1, p < .001

12

Page 13: Examining Teacher Grades Using Rasch Measurement Theory

Examining Teacher Grades Using Rasch Measurement Theory

statistics also indicate consistent results across these three levels (.84, .40, and .70,respectively). The summaries of the measurement reports for both Behavior and Ef-fort are nearly identical. The standardized differences between logit measures acrossthe student characteristic levels suggest that all levels of each student characteris-tic are sufficiently differentiated. Furthermore, the low outfit mean square statisticsfor each level suggest consistent responses across these two student characteristics.In sum, the data suggest that the achievement facet has the largest differences be-tween structs. However, the structs or levels within the ability, behavior, and effortfacets also show some variation, though not as much as the levels in the achievementfacet.

Discussion

Questionnaire development remains a complicated endeavor, particularly in edu-cational research. The use of Guttman’s mapping sentences provides the educationalresearcher with a viable method of instrument development. The grading practice ofteachers is certainly a topic of increasing importance to stakeholders and providesan excellent example of how this methodology can be used to gain valuable insightinto teacher behavior and opinions. The results suggest that classroom achievementis the primary facet used by teachers in assigning grades. We surveyed 54 teachersin a pilot study of the questionnaire in another district and they indicated that to con-sider anything other than achievement is not acceptable. One teacher wrote: “Gradesshould reflect demonstrated knowledge and skills and content. Behavior and effortshould be included, but not as a percentage, probably as comments.” These gradingpractices are in alignment with recommended practice that discourages the consid-eration of facets such as behavior (Airasian, 2005; Brookhart, 2004; Stiggins, 2008).Some teachers stressed the fact that behavior should never factor into a student’sgrade. Again, there were comments from the pilot study such as “I feel strongly thatbehavior should NOT enter into grade manipulation.” Interestingly, these commentswere called into question by the results obtained from our analysis of the grades thatteachers assigned. It appears that students with excellent behavior are more likelyto receive higher grades. Based on these findings we were satisfied that our secondcriterion for success was adequately met as the results are in alignment with both theprevious research and our own discussions with classroom teachers.

Our final criterion for success can best be addressed by articulating the additionalinformation gained by using methods employed in item response theory and how itcan be used to inform our understanding of teachers’ grading practices, instrumentdevelopment, and future research. First, teachers did not use all the grade categories(which are identical to the categories in the county in which they teach) made avail-able to them in this study, choosing to combine some categories and use only six (F,D, C, B, B+, and A). Second, the fit statistics allowed us to determine if specificscenarios/items were problematic (too much unexplained variation or not enough) aswell as if specific teachers were responding unexpectedly based on the measurementmodel. This information can be used to remove scenarios that provide little informa-tion, or misinformation, as we move forward with additional studies. We could alsouse the data on teacher fit to select specific teachers (across the variability spectrum)

13

Page 14: Examining Teacher Grades Using Rasch Measurement Theory

Randall and Engelhard

for further, more qualitative study (i.e., interviews or observations) to determine thespecific reasons for their grading decisions.

By using teacher model fit information we can more systematically choose teach-ers with varying points of view about appropriate grading practices. Finally, furtheranalysis beyond the scope of this paper, to examine if scenarios (not just overallgrades) functioned differentially across grade levels, subject area, or gender, canalso be conducted using item response theory. Although we understand that othermethods of data analysis would provide interesting insights into teachers’ gradingpractices as well, we believe that the choice to use item response theory provides theresearcher with unique pieces of information that may otherwise go unnoticed.

Limitations of Study

This study, through the use of fractional replication design, allowed an extensiveand comprehensive examination of several facets that teachers may consider whenassigning final grades. The small scenario outfit mean square statistics (<.8), how-ever, indicate smaller than expected variance in the responses. This phenomenoncould be caused by teachers choosing to ignore some student characteristics, as theywere certainly permitted to do, when making their final grade decisions, and, in-stead, relying solely on one or two student characteristics when making their finalgrade decisions.

Another possible limitation is that teachers tend to teach more than one academiclevel in public schools. It is reasonable to assume that teachers may assign finalgrades differently within these classes, and it is not clear which class these multiple-academic-level teachers may have considered when assigning grades to each sce-nario. A final limitation is the sample itself. Although the sample size is large, theresponse rate (21.5%) is still quite low. This response rate coupled with the factthat 174 teachers chose not to provide their demographic information makes it dif-ficult for us to determine how representative our sample is of the overall teacherpopulation.

Implications for Future Research

Using additional, or different, facets is the next step in understanding the gradeconstruct. Some literature suggests the use of improvement (McMillan et al., 2002)as a guiding force when assigning student grades. Including the student improvementvariable may provide critical knowledge about teacher grading practices. Tomlinson(2001) asserts that teachers should grade for success just as they teach for successwith grades reflecting not only normative grades, but personal growth. Teachers whoassign grades based in large part on a student’s progress are more concerned withhow far students have come, in terms of achievement, as opposed to where theyare. The improvement facet should certainly receive further consideration in futurestudies of teachers’ grading practices.

In this study, teachers graded scenarios representing specific student characteris-tics. Future research should explore more specifically the weight that teachers placeon various facets. Recall that the logit measures for the ability facet were disor-dered (high-ability results in lower grades than both average and low ability). We

14

Page 15: Examining Teacher Grades Using Rasch Measurement Theory

Examining Teacher Grades Using Rasch Measurement Theory

believe that this disorder might be explained by (a) teachers, on average, penaliz-ing high-ability students with low or average achievement more so than average-or low-ability students; (b) similarly, teachers awarding low-ability students withhigher grades because they have low ability to compensate for their perceived in-abilities; or (c) teachers simply choosing to ignore the ability facet when makingdecisions about final grades. Because we did not ask teachers to tell us specificallywhat facets/student characteristics they considered when responding to the scenarios,we cannot be sure which, if any, of these possibilities explains the disordered values.This information could be gleaned by having teachers assign a weight to each factorrepresenting its relative importance in assigning a grade for each scenario. Anotherapproach might be to have each teacher think aloud regarding the use of the vari-ous pieces of information during grade assignments. Future research that uses eitherof the aforementioned techniques could provide an explanation for unexpected re-sults such as the disordered ability facet as well as provide additional insight into thegrading philosophies of classroom teachers.

Grades are becoming increasingly important in our nation’s public and privateschools. Grades can and do become the focus of a great deal of tension amongteachers, parents, and students. Some researchers argue that student knowledgeand understanding of course content cannot be summed up in a simple letter ornumerical grade. Jongsma (1991) contends that grades may “misinform and de-ceive” (p. 318) insisting that a simple letter or numerical grade implies a level ofevaluative precision that is likely impossible. This lack of precision is what frus-trates teachers, students, and parents and gives rise to conflict among stakehold-ers. More systematic and comprehensive research devoted to the study of grad-ing practices and what teachers consider important when assigning these grades isimperative.

The primary purpose of this study was to demonstrate how Guttman’s mappingsentences can be applied successfully to questionnaire development in educationalresearch. They provide a succinct and logical approach to questionnaire designwithin the behavioral sciences by permitting the researcher to examine multiplefacets and levels of specific behaviors and attitudes. They have been typically ana-lyzed with methods based on multidimensional scaling, such as smallest space anal-ysis. This study illustrates that data obtained from Guttman’s mapping sentences canalso be productively modeled and analyzed with methods based on item responsetheory.

Appendix: Operational Definitions

Classroom Achievement is defined as a student’s current criterion-referencedacademic performance only in class. This rating can include test scores and projectscores, as well as written assignments. Classroom achievement indicates the degreeto which a student has mastered the course objectives. This variable does not includeattendance, behavior, or previous achievement.

High Achievement could be indicated by high test, quiz, and project scores. Thestudent demonstrates both high levels of knowledge and skills in the course or mas-ters 89% of course objectives.

15

Page 16: Examining Teacher Grades Using Rasch Measurement Theory

Randall and Engelhard

Average/Medium Achievement could be indicated by average test, quiz, and projectscores. The student demonstrates sufficient levels of knowledge and skills in thecourse or masters 79% of the course objectives.

Low Achievement could be indicated by low test, quiz, and project scores. Thestudent demonstrates little knowledge or skill in the course, and has mastered lessthan 69% of the course objectives.

Ability is defined by a student’s score on any acknowledged aptitude test such asthe Cognitive Ability Test, Otis Lennon, or Wechsler Intelligence Scale III.

High Ability would be defined as a student scoring at the 84th percentile or higheron one of these exams.

Average Ability would be defined as a student scoring between the 16th and 83rdpercentiles on one of these exams.

Low Ability would be defined as a student scoring at the 16th percentile or lower.Behavior/Conduct is defined as a student’s overall demeanor.Excellent Behavior would be defined as a cooperative student that listens in class,

rarely talks out of turn, and is alert and respectful of the teacher and fellow students.Average Behavior would describe a student that is cooperative at times, speaks out

of turn occasionally, and is occasionally disrespectful.Inappropriate Behavior would be defined as a student who talks out of turn often,

does not listen in class, and is often disrespectful of the teacher and fellow students.Effort is defined as the degree of a student’s willingness and motivation to learn.High Effort, or a great deal of effort would be exhibited by a student who turns

in all assignments (homework and class work), seeks extra help when needed, andattempts to actively participate in class.

Low Effort would be exhibited by a student who turns in assignments sparingly,rarely seeks extra help, and seldom participates in class.

References

Airasian, P. (2005). Classroom assessment: Concepts and applications (5th ed.). Boston: Mc-Graw Hill.

Brookhart, S. (1993). Teacher’s grading practices: Meaning and values. Journal of Educa-tional Measurement, 30, 123–142.

Brookhart, S. (2004). Grading. Upper Saddle River, NJ: Pearson Education.Bursuck, W., Polloway, E., Plante, L., Epstein, M., Jayanthi, M., & McConeghy, J. (1996).

Report card grading and adaptations: A national survey of classroom practices. ExceptionalChildren, 62, 301–318.

Cicmanec, K., Johanson, G., & Howley, A. (2001, April). High school mathematics teachers:Grading practice and pupil control ideology. Paper presented at the annual meeting of theAmerican Educational Research Association, Seattle, WA.

Cizek, G., Robert, R., & Fitzgerald, S. (1995, April). Further investigation of teachers’ as-sessment practices. Paper presented at the annual meeting of the American EducationalResearch Association, San Francisco.

Cross, L., & Frary, R. (1999). Hodgepodge grading: Endorsed by students and teachers alike.Applied Measurement in Education, 12, 53–72.

Donald, I. (1995). Facet theory: Defining research domains. In G.M. Breakwell, S. Hammond,& C. Fife-Shaw (Eds.), Research methods in psychology (pp. 116 – 137). London: SagePublications.

16

Page 17: Examining Teacher Grades Using Rasch Measurement Theory

Examining Teacher Grades Using Rasch Measurement Theory

Engelhard, G. (2005). Item response theory (IRT) models for rating scale data. In B. Everitt& D. Howell (Eds.), Encyclopedia of statistics in behavioral science vol. 2 (pp. 995–1003).Chichester, UK: John Wiley & Sons.

Feldman, A., Alibrandi, M., & Kropf, A. (1998). Grading with points: The determination ofreport card grades by high school science teachers. School Science and Mathematics, 98,140–148.

Frary, R., Cross, L., & Weber, L. (1992, April). Testing and grading practices and opinions inthe nineties: 1890s or 1990s? Paper presented at the annual meeting of the National Councilon Measurement in Education, San Francisco.

Guttman, L. (1977). The mapping sentence for assessing values. In S. Levy (Ed.), LouisGuttman on theory and methodology: Selected writings (pp. 127–134). Brookfield, VT:Dartmouth Publishing Company.

Jongsma, K. (1991). Rethinking grading practices: Research to practice. The Reading Teacher,45, 318–320.

Krueger, R. (1994). Focus groups: A practical guide for applied research. Thousand Oaks,CA: Sage Publications.

Linacre, J. (1994). Many-facet Rasch measurement. Chicago: MESA Press.Linacre, J. (1999). Investigating rating scale category utility. Journal of Outcome Measure-

ment, 3, 103–122.Linacre, J. (2007). Facets Rasch measurement computer program (version 3.62) [Computer

software]. Chicago: Winsteps.McMillan, J. (2001). Secondary teachers’ classroom assessment and grading practices. Edu-

cational Measurement: Issues and Practice, 20(1), 20–32.McMillan, J., & Lawson, S. (2001). Secondary science teachers’ classroom assessment and

grading practices. Richmond, VA: Virginia Commonwealth University & the MetropolitanEducational Consortium.

McMillan, J., Myran, S., & Workman, D. (2002). Elementary teachers’ classroom assessmentand grading practices. Journal of Educational Research, 95, 203–213.

McMillan, J., & Nash, S. (2000, April). Teacher classroom assessment and grading practicesdecision making. Paper presented at the annual meeting of the National Council on Mea-surement in Education, New Orleans, LA.

Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago:The University of Chicago Press. (Original work published 1960.)

Roskam, E. E., & Broers, N. (1996). Constructing questionnaires: An application of facetdesign and item response theory to the study of lonesomeness. In G. Engelhard & M. Wilson(Eds.), Objective measurement: Theory into practice, vol. 3 (pp. 349–385). Norwood, NJ:Ablex.

Stiggins, R. (2008). An introduction to student-involved assessment FOR learning (5th ed).Upper Saddle River, NJ: Pearson Education.

Stiggins, R., Frisbie, D., & Griswold, P. (1989). Inside high school grading practices: Buildinga research agenda. Educational Measurement: Issues and Practices, 8(2), 5–14.

Tomlinson, C. (2001). Grading for success. Educational Leadership, 58(6), 12–15.Wright, B. D., & Masters, G. (1982). Rating scale analysis: Rasch measurement. Chicago:

MESA Press.Zoeckler, L. (2007). Moral aspects of grading: A study of high school English teachers’ per-

ceptions. American Secondary Education, 35(2), 83–102.

Authors

JENNIFER RANDALL is an Assistant Professor, University of Massachusetts, EducationalPolicy, Research, and Administration, 111 Infirmary Way, Hills House South, Amherst,

17

Page 18: Examining Teacher Grades Using Rasch Measurement Theory

Randall and Engelhard

MA 01003; [email protected]. Her primary research interests include classroomassessment, social studies education, and applications of the Rasch model.

GEORGE ENGELHARD, Jr. is a Professor of Educational Measurement and Policy, Edu-cational Studies, Emory University, Atlanta, GA 30322; [email protected]. His primaryresearch interests include rater-mediated assessment, history of measurement theory, andinvariant measurement.

18