Concept of a Test

Issues of Reliability, Validity

and Item Analysis in Classroom Assessment

byProfessor Stafford A. Griffith

Jamaica Teachers Association

Education ConferenceAssessment in EducationRitz Carlton Resort & Spa,

Montego BayApril 2-4, 2013

Concept of a Test Some of the earliest forms of assessment or

testing may be noted in biblical references. Adam and Eve, for example, were subjected to a simple test in the Garden of Eden based on a test item presented in a negative form.

Another account, is taken from Judges 12: 4 - 6. It was an oral examination (shibolleth) devised by the Gileadite army to identify members of the defeated Ephraimite army who were attempting to escape under cover of a false identity.

Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

Outside of the biblical accounts, historians generally agree that the Chinese were the first to use large scale testing

These were introduced as early as 2000 B.C. to measure the proficiency of candidates for public office and to reduce patronage

Today, we think of a test as an item/question, problem or task or a mix of these, administered under prescribed conditions Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

It is designed to elicit responses that provide information to make judgements about a candidate.

It is a systematic procedure for measuring a sample of a candidate’s behaviour that can give an accurate and truthful account of a candidate’s skills, knowledge or ability, or other characteristics, at the time the test was administered.


Reliability of Test Scores

Two essential requirements for a technically sound test are reliability and validity.

Reliability is the extent to which test scores are consistent or dependable.

Only to the extent that scores are reliable can they be useful in conveying information about a student’s performance.


From a more technical standpoint, reliability is the extent to which scores are free from errors of measurement.

Classical Test Theory (CTT) defines reliability as a property that is based on three considerations:

observed scores,

true scores and

measurement errors. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

In Classical Test Theory, a person’s observed score is a function of that person’s true score, plus error.

This may be represented simply as: Xo = Xt + Xe

Where Xo represents the observed score;

Xt represents the true score; and

Xe represents the error.


The level of confidence we can have in test scores hinges on how much error we have in the observed scores of students.

Reliability, or level of confidence we can have in test scores, is expressed as a index ranging from 0 to 1. It may therefore be .99 (high) or .10 (low).


The reliability coefficients commonly used to determine and report on the consistency with which a test measures are derived from various approaches:

test-retest,

alternative form,

internal consistency,

split-half and

inter-rater (a special form of reliability).Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

Validity of Test Scores

Validity is the extent to which a test does the job for which it is intended.

Essentially, validity is about what inference can be made from the scores obtained on an instrument.


The most widely encountered discussions refer to three lines of validity evidence:

content validity (representativeness of the domain);

criterion-related validity (correlation with/prediction of scores from another instrument);

construct validity (association with some theoretical construct).


Validity is the most important technical quality of a test.

An important way of assuring, or assessing validity is to use a subject matter by behaviour grid called a specifications table or a table of specifications.

It helps to define the weighting to be given to various subject matter and behaviours (or objectives or skills).

It helps to avoid the testing of extraneous material.


Example of a Table of Specifications

Cont Obj Kn Co Ap An Tot

Classif of animals 2 4 4 - 10

Plants of the earth 4 4 2 - 10

Pop and Evol 3 3 - 4 10

Var and Selec - 1 5 4 10

Origin of Sol Sys 5 2 - 3 10

Chan in Land Fea 2 2 6 - 10

Total 16 16 17 11 60Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

It is important to work out the types of items/questions, their psychometric characteristics, the number of items and questions and how these will be scored.

The specifications for test construction should be so clear that two test constructors would produce tests that are comparable and interchangeable.


Item Analysis

In writing and analysing test tasks, two critical indicators of goodness of the tasks should be considered: the facility (or difficulty) and

the discrimination.

The facility level for a task is the percentage of candidates responding correctly or satisfactory to it.


It is expressed as an index:

an f-value, or

a p-value (which is really the probability of a person in a particular group responding correctly or satisfactorily).

The formula for calculating p is very simple: p = R/T, that is, the number of students responding correctly to an item divided by the number of students responding to the item.

Its value ranges from 0 to 1.00.Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

The discrimination level for a task is the extent to which performance on the task separates the better candidates from the poorer ones.

The calculation of this d-index is generally more complex than the calculation of the facility index and is often represented by a biserial or a point-biserial correlation index (r).

It ranges from -1.00 to +1.00. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

Easier and relatively accurate estimates of the extent of discrimination of a task scored dichotomously are, however, obtained by:

comparing the way the top performing students perform on the task with

the way the bottom performing students perform on that task.


The discrimination index for an item is calculated by:

ranking students according to performance on the test;

separating the top performing students and the bottom performing students;

finding the p value of the item for the top performing students and the p value for the bottom performing students;

subtracting the p value for the low performing students from the p value of the high performing students


The table indicates how students performed on an item with four possible responses (A, B, C and D). The correct response is C.

Response A B C DUpper Group - 2 8 -Lower Group 4 3 2 1

The facility index of the item is (a) 1.00 (b) .10 (c) .05 (d) .50

The discrimination index of the item is (a) 6 (b) .60 (c) .06 (d) .66


Summary Based on our discussions, I trust that in

developing and using tests for assessment in

the classroom, you will consider the need to: provide scores that are reliable provide scores that are valid develop and use items/tasks that are at the right

difficulty level develop and use items/tasks that can

discriminate between those who have the desired competences and those who do not.


Thank you.


Concept of a Test

Documents

Transcript of Concept of a Test