Post on 18-Dec-2021
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglish Language Education
The Effects of the Number of Options on Different Tasks in
Multiple-Choice Listening Tests
Hideki IImmPwfectuval Ciniversity oj7Ktimamoto
Abstract
This study investigated the effects of different rnimbers of options (four-option vs,
three-option) on different listening tasks lpicture vs. dialogue vs, monologue). The data were
collected from 118 Japanese university students and were analyzed from the perspective of itemdifficulty, discrirnination, reliahility, and distractor performance, The results indicate that thenumber of options had no notal)le impact on item dithculty in picture and dialogue tasks, but had
a sigriificant effect in the monologue task. TIhe results also suggest that the number ofoptions had
no great impact on item discrimination and reliability in all tasks, and that the number of itemscontaining all discrirninating distractors was different between the three tasks. The overall
findings provide evidence that task difference is an importarrt element in the study of optimal
number of options and the development of multiple-choice listening tests,
1. Introduction
One single-answer multiple-choice item consists of one question (the stem), one correct
option (the keyt and several incorrect options (distractors), ln lariguage assessment, the
multiple-choice test format is widely used because of its objectivity, automated scoring, and
reliability (Thompson, 1995). ln addition, this format allows testing of recqptive ski11s without
roquiring test takers to prodnce written or spoken lariguage (Hughes, 2003). Moreover, ifitems arewell-constmcted, the multiple-choice format can provide content validity evidence by allowing a
thorough and representative sampling ofthe cognitive domain (Downing, 2006),
On the other hand, the multiple-choice format is often criticized because it induces random
guessing, tests only recognition knowledge, and may produce harmfu1 washback (Hughes, 2003).
The greatest weakness ofthis fbrmat is that it is dirucult to develop proper test items. To be more
precise, it is extremely didicult to create a sufficient number ofplausible distractors. As Brown 's
law (Brown, 2005) states, `Lwhen writing four-option multiple-choice items, the stem and correct
option are easy to write, and the next two distractors are relatively easy to make up, as well, but
the last distractor is absolutely impossible" (p. 49). In fact, Haladyna and Downing (1993) show
that three options (one key and two distractors) are suitable since even in well-developed tests it is
237
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglish Language Education
rare that more thaii three options are functional. Moreover, Rodoriguez's (2005) meta-analysis of
80 years of research into the optimal number of options concluded that multiple-choice items
should consist of three options because using more does little to irnprove the test and results in
improperdistractors.
Despite the above-mentioned research, the four-option test fbrmat (one key and three
distractors) is the most widely used in domestic and international English proficiency tests such as
EIKEN, TOEIC, and TOEFL (iBT and ITP), which begs the question, do al1 four options function
properly in those tests? Although we carmot access the official test data directly, several empirical
studies have been conducted into the effects of number of options in EFL testing, and this may
give us some insight into the optimal number ofoptions in multiple-choice items. These wi11 be
discussed in the next section.
2. Background
Although the optimal rrumber of multiple-choice options has long been discussed
throughout educational testing, very few studies have been conducted in the language-testing
context. For instance, a rneta-analysis by Rodriguez (2005), which reviewed 27 articles written
over 80 years, did not include any studies related to the L2 field. Only recently have several
studies focused on the relationship between the number of options and EFL test perfbrmance.
First, Shizuka, Takeuchi, Yashima, and Yoshizawa (2006) compared the effect of four- and
three-option fbrmats on reading perfbrmance using a university entrance exam in Japan. They
changed an original four-option item into a three-option one by deleting the least chosen distractor
from the data of a previous administration of the entrance exam. [Ihe results indicated that the
average item facility and average item discrimination between the two forrnats were not
significantly different. in addition, their distractor analyses found that the rrumber of effective
distractors was almost equal between the two option formats.
Second, Currie and Chiramanee (2010) investigated how three-, four-, and five-option
multiple-choice items differed from open-ended items when testing L2 English grarnmar
knowledge for undergraduates in Thailand. [Ihey developed distractors for the multiple-choice
items based on participants' incorrect answers to open-ended items. They fbund that there was no
significant differences in item facility between the three-, four-, and five-option items, and that the
alpha relial)ility coefficients were all almost the same arnong the three option formats.
Lastly, Lee and Winke (2012) compared three-, four-, and five-option items using listening
test items equiyalent to the university entrance exarnination in Korea. They converted the original
five-option items into fbur- and three-option fbrrnats by deleting the least plausible option. The
plausibility of distractors was judged by different evaluators arnongst graduate students, English
teachers, and high school students. They reported that average scores on the three-option tests
were significantly higher than those on the four- and five-options tests, but no difference was
238
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglish Language Education
fbund in average item discrirnination between the three fbrmats. As fbr reliability (Cronbach'salpha), no consistent pattems were found that depended on the number of options.
In sum, the results ofprevious studies can be divided into two types: (a) the number of
options had no effect on test perfbrrnance (Shizuka et al., 2006; Currie & Chiramanee, 201O), and
(b) the three-option fbmiat was easier than the other fbrmats (Lee & Winke, 2012). The clear
diffbrence between these studies lies in whether the test was written-mode (reading and grarnmar)or spoken-mode (listening). It is reasonable to assume that the effects of different numbers of
options have more impact on listening test performance than on reading or grammar test
perfbrrnance because of "the
real-time nature of spoken language" (Buck, 2001, p. 6). ln the
listening test, test takers have to read and choose one of three-, four-, or five-options while
listening to texts. [[hus, it is possible that a fewer nuniber of options could rednce the cognitive
demands on test takers.
Having said that, it is not safe to postu1ate that a fewer number of options can always
increase listening test perfbrmance, As reviewed below, text typesHlialogue or
monologue-might be related to listening test perfbrrnance and in that case, the effect ofdifferent
numbers of options may vary according to the type oflistening text.
Several studies have investigated which type of input test takers find more diracult in
listening comprehension. ln an earlier study by Shohamy and Inbar (1991), three types oflistening
test anews report, a lecture, and a consultative dialogue were compared, It was fbund that the
news report (scripted monologue) was the most difficult, fo11owed by the lecture (one speakerinteracting with audience), and finally the dialogue (one expert and one addressee). The authors
attributed this finding to the fact that the news report consisted of a 1arger density ofpropositions
and more complex grammatical and syntactic stmctures than the lecture and dialogue. In contrast,
Read (2002) fbund that a monologue version was less dithcult than a discussion version by imeespeakers of the same text. The author assumed that the discussion version was more demanding
because test takers were not accustomed to unscriptedlspontaneous dialogues that contained
colloquial expressions. Finally, Papageorgiou, Stevens, and Goodwin (2012) compared a
relatively longer dialogue and monologue and found that in some cases monologues were more
didicult than dialogues, In sum, the relative dienculty of dialogues and rnonologues has been
inconclusive to date, but at least some differences in difficulty have been found between them,
Having examined the research on text types, we will now return to Lee and Winke's (2012)study where the three-option fbrrnat was found to be easier than the four- and five-option formatsin a listening test, In their study, the listening test consisted of several task types: picture,
question-response, dialogue, and monologue. Some rnay be confused by refening to dialogues
and monologues msks, but this paper identified these two as such, because passage types can be
subcategorized into task characteristies (Buck, 2001), in Lee and Winke's study, none ofthe tasks
were analyzed sqparately, but instead all ofthe data were gathered together, Given the difference
239
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglish Language Education
in difficulty level between dialogues and monologues reviewed ahove, task difference should be
included in research on the effect ofnumber of options on listening test perfbrmance.
3. Research Questions
ln order to investigate the effect of number of options on different tasks in multiple-choice
listening tests, the fbllowing research questions were formulated:
RQ 1 : Does the number of options afliect item difiiculty? Are there any differences between tasks
in this respect ?
RQ 2: Does the number of options affect item discrimination? Are there any differences between
tasks in this respect?
RQ 3: Does the number of options affect reliability? Are there any differences between tasks in
this respect?
RQ 4: Does the rrumber Qf options affbct distractor pcrfbrrnance? Are there any differences
between tasks in this respect ?
4. Method
4.1 Participants
The participants consisted of 118 undergraduates from fbur intact classes in two private
universities. The first two classes (n = 62) were allocated a fbur-option test fbrmat (Group A) and
the second two classes (n =: 56) were allocated a three-option test format (Group B). In the view of
class teachers, participants' English proficiency levels were low intamediate.
4.2 Materials
The materials used in this study were adapted from three listening sections of a TOEIC
preparation book (Educational Testing Service, 2011). The first section, Photographs, was a
picture task in which test takers heard four statements ahout a picture and then selected the
statement that best described the picture. These statements were n6t pimted in the test books and
were played only once, The second section, Cbnvensations, was a dialogue task in which test
takers heard short conversations between two people and answered three questions about what the
speakers said, The third section, S7tort 7Lilks, was a monologue task in which test takers heard
talks given by a single speaker and answered three questions ahout what the speaker said. in both
the dialogue and monologue tasks, questions and options were printed in the test book and the
conversation and talk were played once.
Although the original TOEIC listening has one additional section, Question-Response, this
was omitted from this study because it originally consists of three options and comparing the
three- with the two-option itern format was outside the scope ofthis study.
240
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglish Language Education
4.3 Three-Option Format
Following by Shizuka et al. (2006), the original four-option fbrrnat used in the TOEIC was
transformed into a three-option format by deleting the least popular of the three distractors (Figure1). The least popular distractor was chosen based on item analyses in a pilot study in which 90
university students took the original test one year prior to the main study (in June 2012).Accordmg to class teachers, the studentsi who participated in the pilot study did not differ
considerably in English proficiency levels from those in the main study, placed as they were in the
same level of English class.
Three-o tion fbrrnat
1. Key2. Distractor 1
3. Distractgr,2 tt4;lijlwhIas ttt/' ttt
iFlgure 1. (]tption deletion method.
4.4 Procedures
The data were collected in June 2013. This study employed a common item equating design
using Rasch modeling with the computer program WINSTEPS (Linacre, 2014a) to compare two
simi1ar test forrnats that differed only in the number of options per item on the three tasks{a)
picture, (b) dialogue, and (c) monologueein terrns of item didiculty, discrimination, relial],ility,
and distractor performance (See section 5 below).
Figure 2 shows the outline ofthis research, Regarding the picture task, 1O four-option items
were used as anchor items and served a linking purpose. Both groups (Group A: n =
62; Group B:
n = 56) took the cornmon test first (k = 1O), GToup A then took the four-option experimental test (k
= 15) and Group B took the three-option experimental test (k = 15). A similar procedure was used
for the dialogue and monologue tasks. The two groups took three differeni test sets with a
one-week intervar between each adrninistration.
241
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglishLanguageEducation
Picture
Cbmmon items
- Dialogue
Cbmmon items
.
Group A & B
(N-118)- 4-option (k = 1O)
JEx erimentalitems
4-option (k - 18)
tthperimental items
Monologue
Cbmmon items
4-option(k - 1 8)
Group A(n
-
62)- 4-option (k - 15)
vs.
4-option (k= 21)
Group B(n -
56)-
vs,
JExperimentalitems
4-option (k ; 21)
3-option (k = 15) 3-option (k= 21)
vs.
3-option (k = 21)
,FVgt{re 2. Task flow ofthis study.
5. Results and D--lscusslon
5.1 Comparison of Task Difficulty for Each Group
Befbre investigating the effects of different numbers of options on ]istening tasks, the
overall dithculty of the three tasks was examined. This was necessary to reveal the characteristics
of each task, Mean percentage corrcct scores of the three tasks for each format were compared.
The results are reported in Tal)le 1 . In the four-option group, students s¢ ored highest in the picturetask (60.0%), fo11owed by the dialogue task (54.8%), and finally the monologue task (41.6%).Sirnilarly, in the three-option group, students obtained the highest mean scores in the picture task
(70.6%), fo11owed by the dialogue task (64.7%), and finally the monologue task (5 1 .5%),
Table 1
Mean PeTcentqge Ct)n"ect Sbores on the 71hree 7kesksfor the 7Wo Groups
Task No. ofitems4-option fbrmat (n =;
62) 3-option fbrmat (n = 56)
M sw M saPictureDialogue
Monologue
152121 60.054,841,6 14,818.915,9 70,664,751,5 14,617.514.9
To examine these differences statistically, a Friedman test was conducted for each group.The test was significant fbr the fbur-option group ()(?(2, N = 62) = 49.14, p < .001), and fbr the
three-option group ()e?(2, N = 56) = 44.67, p < ,OOI), Follow-up pairwise comparisons were
conducted using Wilcoxon signed-rauk tests, The results ofthese tests are shown in Table 2, A
Bonferroni correction was applied, and all effects were evaluated at a ,O167 (.0513) level of
significance. Table 2 also reperts the effect size fbr each pairwise comparison fbr each task. In the
four-option group, the mean score in the monologue task was significantly lower than that in the
242
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglish Language Education
dialogue and picture tasks (ps < ,OOI) with a 1arge effect size (r = .66 and .77, respectivelyi, On
the other hana the significant diffbrence between the picture and dialogue tasks was marginal lp= ,02) with a smal1 effect size (r == .30). Similar to the four-option group, in the three-option group,
the mean score in the monologue task was significantly lower than that in the dialogue and picture
tasks (ps < ,OOI) with a large effect size (r = .72 and .83, respectively), There was also a
significant difference between the picture and dialogue tasks (p < ,OOI), but the effect size was
smal1 (T - .35).
Tal)le 2]Fbllow-
Lip Z2stsfor Comparing S℃ores Across 7tisks in 71vo thnuats
Format 4-option (n - 62) 3-Qption (n -
56)
Task Pic vs, DiaDia vs. Mon Mon vs. Pic Pic vs, DiaDia vs, Mon Mon vs. Pic
Za
pbrc
2.34
.02
.30
520<
,OOI*
.66
6,07< ,OOI*
.77
2.61< .OOI*
.35
5.41< ,OOI*
,72
621< .OOI*
.83
?Vbte. Pic = picture; DiaF dialogue; Mon
= monologue.
aWilcoxon sigrred-ranlc test.
b Asymp. sig. (2-tailed).
C Effect size,
Ep < .Ol67 (,05/3)
Judging from the above, it can be stated that the monologue task was the most dithcult,
fbllowed by the dialogue, and finally the picture task. This finding supports the previous studies
reviewed (Shohamy & Inbar, 1991; Papageorgiou, Stevens, & Goodwin, 2012), which showed
that monologic texts were more diMcult than dialogic ones, The reason fbr this is that monologues
usually contain fewer features associated with interactive discourse (Buck, 2001), features that
may help listeners comprehend the text.
This study also fbund that the picture task was the least difficult of the three. This can beexplained by the difference in the test-taking process, ln dialogue and monologue tasks, test takers
are asked to read and answer three questions (with fbur options each) while listening to the text. In
the picture task, on the other hand, test takers simply listen to four short statements and select the
one that matches the picture. Thus, picture tasks demand less cognitive load than dialogue and
monologue tasks.
5.2 Comparison of Item Difficulty
5.2.1 Selection of Common Items for Equation
This section reports the quality of common items fbr equating. First, item invariance
difficulty was checked. Figure 3 shows the item dienculty plot between the fbur- and three-option
formats fbr the picture task. As can be seen, all items were located within the acceptable zone
between the two control lines (solid curvy lines), except for Item 1O, which was an outlier and so
removed.
243
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglishLanguageEducation
t.t..Tr.t..Tt..tT.rrt..T.t.tmTt.rrtTtTt..ttT.mt.ttm.ttrettm.nvtt..trc.ttt..tmttttTtm..ttrrtttttt.tTtt.tt.
1
1:JGltFII;I
Plottmpirical line
l・ i' I
tiS.F/
TTsjg;.1i-l
-' i '" l
'i
iii
L-mo --wwum
4
.-"
.E1i-.--L--
-Ml
.tttttttr
Fo-r eptien fomist-l
71ll/l・/ltIiI1iill・II!lllii!
..-.T.t.tt.tT.tmtww-ttTttumtum-ttttvttumun-av-uammt-ttttttt-.rvwtttt-tt-tttm-ttmttfi-ttrmttd
Flgure 3. Item dithculty plot between thefour-option fbrrnat and the three-option forrnat for the
picture task,
Next, the items were evaluated from the viewpoint of fit statistics, reported in Tal)le 3.
Mig7t is divided into two types: (a) undedit and fo) ovedit. Undedit (> 1.3) degrades the qualityofthe ensuing measures, while overfit (< O.7) has no practical implications (Bond & Fox, 2007).
Accordmg to this guideline, two out of 1O items were diagnosed as misfit (Item 1 as overfit: O.18;
Item 9 as underfit: 2.49).
Table3]Fit
Statistids andPoinFBiserial ofClonelations on Ct)mmon Items in Picture Task
i
2 g K
O.08051O.373.241,091.45O,18O,91O.94O.96L12O,921,051,04O.69O.80O.721.131.071.00118.49.51.41.54.33.52,36
IVbte. ID = item dienculty. MNSQ
= mean squares, lobi
-- point-biserial correlation.
"Outlier in item diMculty invariance.
Dark shaded items were removed from equating. Light shaded values did not reach expected values for the
equatlon,
244
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglish Language Education
Finally, point-biserial correlation (i?,bi) was examined. The value ofpoint-biserial correlation
fbr equating purposes should be .20 or more (Hirai, 2010). Again, Item 1 and 9 did not reach the
expected values (.19 and .12, respectively). Based on these observations, seven out of 1O common
items served as equat Lon fbr the picture task,
The same examination of common items was conducted fbr the dialogue and monologue
tasks. As a result, 15 and 16 test items, respectively, were selected fbr equating (see Appendix A
& B).
5.2.2 Item diffTiculty
Tahle 4 reports mean values ofdithculty estimates in the fbur- and three-option fbrmats, theresults ofpaired Ftests, and the effect size for each task.
With regard to the picture tasK the mean dithculty ofthe four-option format (Mr == O.25, so=
1.32) was slightly higher than that of the three-option fbrmat CM =
O.1 1, sw = 1.18), bnt there
was no significant difference between them lp = .60) with a small effect (d - O.1 1). With regard to
'
the dialogue tasK the mean dithculty of the four-option forrnat (M == - O,26, S[) = O.82) was
slightly higher thaii that of the three-option format (M =
- O.41, SD
== O.87), but no significant
difference was found between them (p = .16) with a small eflect (d = .18). Finally, in themonologue task, the mean difficulty of the four-option format (?LI - O.27, so - O.94) was higherthan that of the three-option format (M
= - O.25, sw
= O.89). Although the statistical difference
between the formats was marginal lp ==
,07), the effect size was medium (d = .57).
Table 4Rayults
ofPaired "testsfor Mean ftem Det7iculty (ZbgitsiPicture d wtal Monologue
4-option 3-option 4-option 3-o tionMswpd 0251.32,60JlO.11L18 - O.26
O.82.16.18
-O,41
O.87
4-o tion 3-option
O.27O.94.07.57- O.25 O.89
M)te, Item dienculty (logits): The higher the item dienculty is, the more dithcult it is,
These results suggest that the effbct of different numbers of options on item dificulty varied
between tasks, That is, the number of options did not infiuence item difficulty in picture anddialogue tasks. On the other hana the diencu1ty of the monologue task was affected to a moderate
degree by the different number of options; the three-option forrnat made the test easier than thefour-option forrnat. The main reason for these differences can be attributed to the varyingdifficulty ofthe tasks. The monologue, as discussed in section 5.1, was the most diMcult task sothat including fewer options could help test takers to focus on listening and comprehend the text
245
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglish Language Education
better, which also led to selection of the correct answer. ln other words, reducing the number of
options (that test takers have to read) may have a beneficial effect fbr test takers only when the
task is diMcult.
This finding did not support some prior studies (Shizuka et al. 2006; Currie & Chiramanee,
2010), where no difft)rence in item diMculty was found between formats. One possible
interpretation is that even in those studies there would had been a difference in item dienculty if
tasks had been analyzed separately; some tasks would have been more dithcult than others and
vice versa. By extension, this interpretation can be applied to one supponive previous study (Lee& Winke, 2012), where the three-option fbrmat was significantly easier than the other formats. ln
short, it can be assumed that many more diencult taskslitems were included in Lee and Winke's
study so their study only found the difference in item dienculty between the formats.
5.3 Comparison of Item Discrimination
Tal]le 5 preseiits mean item discrimination (point-biscrial correlations: lpbi) in four- and
three-option fbrrnats, the results ofpaired t-tests, and effect size,
ln the picture task, the fbur-option format had slightly lower mcan item discrimination (lpbi= .33) than in the three-option fbrrnat foy O,O1) and no significant difference was fbund between
the two forrnats with almost no effect lp = .85, d = ,05). in the dialogue task, the mean item
discrimination in the four-option fbrmat (fpbi = .39) was somewhat higher than that in the
three-option format (fpbi= .33), resulting in no sigriificant difference between them with small
effect (p = ,23, d -- ,39). Similar to the picture task the mean item discrimination between the
four- and three-fbrmats was almost the same (fpbi =
.30 and ,29, respectively), and no significant
difference was found with almost no efTect size ip = .84, d
= , 16).
Tahle 5Results
ofPaired t-testsfor 1;{fean frem Diserimination
Picture Dialogue Monologue
4-option 3-option 4-option 3-option 4-option 3-option
M*SD*pd ,33.16.85.05,3425 .39.14.23,39,33,17 .30.15.84.16.29,18
?Vbte. *Fisher-z transforrried values
These findmgs indicate that option numbers had no impact on item discrimination in all
tasks. This supports the findings of Shizuka et al., (2006) and Lee and Winke (2012), and adds
another piece of evidence for the viability of substituting the four-option forrnat with the
three-option format. This may be because items of both formats are essentially the
246
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglish Language Education
sameHluestions and correct options (keys), Iri other words, the core of each item remains the
same between the four- and three-option forrnats. Iherefore, redncing one distractor may not have
changed the essential ability oftest items to discriminate between test takers.
5.4 Comparison of Reliabmaty
Table 6 shows item reliahility and person relial)ility in both fbrmats fbr each task. Itemreliahility provides an estimate ofhow much confidence we can have in the perfbrmance of the
items; how likely they are to perform in the sarne way on another test population under similar
conditions (Green, 2013). As Tahle 6 shows, test reliability was consistently somewhat higher inthe four-option fbrmat than the three-option forrnat across the tasks, but the differences are quitesmal1 ranging from O.2 (dialogue and monologue) to O,4 tpicture), Person reliability provides an estimate of how much confidence we can have in the testtakers' results, i.e., the extent to which they are 1ikely to per[fbrm in the same way on another set of
test items of a similar nature given under similar conditions (Green, 2013), Compared to itemreliability, there seems to be a wider difference in person reliability between the fbur- and
three-option forrnats, ranging from O.2 (monologue) to O.7 tpicture). Moreover, unlike item
relial)ility, person reliability fbr the three-option format was slightly higlier: than that fbr the
four-option format in dialogue and monologue tasks.
Table 6Results
ofltem Reliability and Penson Reliability
Picture Dialogue Monologue4-option 3-option 4-option 3-o tion 4-option 3-option
Itemreliability
Person reliability
.93.65 .89,58 ,87,77 ,85,81 .90.69 .88.71
Overall, these results suggest that the effect of option numbers had no considerable effect on
reliability in both fbrmats fbr each task. However, consistent patterns were observed neither in thefbur- and three-option formats nor in the tasks. [[his inconsistency in reliability was also fbund inthe prior study by Lee and Winke (2012), where no difference was fbund in one ofthe three test
sets, but three- and four-option fbrrnats had higher reliability in another set, and three- and
five-option formats were mere reliable than the fbur-option format in another. This may bebecause reliability is sensitive to many factors such as sample ability and variances in itemdithculty, and sample size ofperson and items (Linacre, 2014b), Therefore, we need to collect and
analyze more data from a 1arger sample and use more test items to draw a conclusion conceming
the relationship between reliability and the number ofoptions,
247
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglish Language Education
5.5 Comparison of Distractor Perfbrmance
Ms section reports how well distractors hnctioned in the two formats across tasks. lhe
functionality of distractors has conventionally been evaluated in two ways: (a) frequency and (b)discrimiiiahility (Haladyna, 2004; Haladyna & Downing, 1993; Henning, 1987), The fbrmer
measures the number oftest takers that chose each distractor, and a non-fimctioning distractor is
defined as one that was chosen by fewer than 59,6 oftest takers (e.g,, Haladyna & Downing, 1993).
The latter measures the extent to which a distractor can distinguish among high-al)ility and
low-ability test takers, and a non-fimctioning distractor is defined as one with a positive
point-biserial corTelation (e.g,, Haladyna & Downing). Based on these guidelines, we will evaluate
distractors in the fourL and three-option formats in the three tasks.
5.5.1 Frequency
Table 7 shows the number of items with which three, two, one, and no frequently chosen
distractors in the four- and three-option fbrrnats ofeach task. [Ihe task with the greatest number of
question iterns containing all frequently chosen distractors was the monologue task (95,2% fbr
both the four- and three-option fbrrnats), fbllowed by the dialogue task (85,7% fbr both), by
finally the picture task (46.7% each). In contrast to the dialogue and monologue tasks, items with
less frequerrtly chosen distractors were in the picture task in both formats, That is, many
distractors in the picture task were not chosen by test takers, while most distractors in both
dialogue and monologue task were selected by test takers.
Tal]le 77he
IVitmber of]F>'equently (]hosen Disnactors in TM,o formatsfor EZich 7Zityk
Frequently-chosen
distractors
Picture (k =15) Dialogue k=21) Monologue (k == 21)
4-option 3-option 4-o tion 3-option 4-option 3-option
321o 7 (46.7)4 (26,7)4 (26.7)o (o.o)
nla7 (46.7)7 (46.7)1 (O.6)
18 (85,7)3 (14.3)o (o.o)o (o.o)
nla18(85.7)3
(14,3)o(o,o)
20 (95.2) 1 (4.8) o (o.o) o (o.o)
nla20
(95.2) 1 (4,8) o(o.o)
IVbte. Percentages appear in parentheses,
5.52 DiscriminabMty
The frequency criteria al)ove can tell us how many test takers chose each distractor. What is
irnportant is not only how many distractors were chosen for each multiple choice item, but also
what level oftest taker selects which distractor (Shizuka et al. 2006), TEible 8 shows the number of
items with three, two, one, and no discrirninating distractors in both forrnats ofeach task, Contrary
to the frequency criteria, the greatest number ofitems with all discriminating distractors was in the
248
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglish Language Education
dialogue (57.1% in the four-option format and 90.5% in the three-option fbrmat), fo11owed by themonologue (42.99t6 and 71.4%, respectivelyi, and by the picture (40% and 66.7%, respectivelyi.
Tal)le 87ke
Mimber qfDiscriminating Disnactors in 7M,o Ilonnatsfor Ebch 72xsk
Discriminating
distractor
Picture (k=15) Dialogue (k - 21) Monologue (k - 21)4-option 3-option 4-option 3-option
321o 6 (40.0)5 (33.3)3 (20.0)o(o.o)
nla10
(66.7)4 (26,7)1 (6,7)
12 (57, 1)9 (42.9)o (o,o)o (o,o)
nla19 (90.5)2 (9.5)o (o,o)
4-o tion 3-option
9 (42.9)8 (38.1)4(19.0)o
(o,o)
nla15 (71.4)6(28.6)o(o.o)
?Vbte. Percentages appear in parentheses.
These results indicate that the frequently chosen distractors in the four-option fbrmat were
not always discrirninating distractors. To put it precisely, there were seven items (46,79,6) with
three frequently chosen distractors in the picture task (see Table 7), but the number ofitems with
three-discriminating distractors decreased from seven to six items (40.0%). With regard to the
dialogue task, the number of items with three-discriminating distractors was rednced from 18
(85,7%) to 12 items (57,1%), More strikingly, in the monologue task, the number ofitems with
three-discriminating distractors was rednced from 20 (95.2%) to only nine items (42,9%). On the
other hand, such a sharp drop was not seen in the three-option format. That is, the two frequently
chosen distractors in question items remained the two discriminating distractors in those items. Iriother words, the three-option forrnat had similar, if not the same, discriminatory power as thefbur-option fbrmat. These findings support Shizuka et al.'s (2006) prior study where the number
of the disci iminating distractors was almost the same for the four- and three-option formats,
From the viewpoint of task difTerence, the number of items that contained all effective
distractors difft:red: the dialogue task had the highest number of effective distractors (57.19t6 fbr
four-option format and 90.5% for three-option format), fo11owed by the monologue task (42.9%and 71.4%, respectively), and finally the picture task (40,O% and 66.7%, respectively). This
difference may be related to task difficulty. ln the picture task some items were too easy to distract
some test takers, while in the monologue task some items were too dithcult to allow test takers toselect the right answer and indnced random guessing.
6. Conclusion
This study has investigated the effects of different numbers of options (fbur-option vs.
three-option) on different listening tasks tpicture vs. dialogue vs. monologue) from the viewpoint
ofitem diruculty, discrimination, reliability and distractor per[fbrmance.
249
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglish Language Education
The main finding of this study suggests that the effects of different numbers of options
varies according to task. The number of options had no notahle impact on picture and dialogue
tasks, but did have an effect on the monologue task. This finding sheds new light on the study of
the optimal number of options and test developrnent because task difference is an essential factor
that determines item dienculty for multiple-choice listening tests.
Another finding indicates that the number of options had no great impact on item
discrimination and reliability in all tasks, which supports many previous studies on the number of
options, including L2 studies (e.g., Lee & Winke, 2012) and a meta-analysis that did not include
L2 studies (Rodriguez, 2005), The other finding of this study shows that the number of items
containing all discriminating distractors differed between the three tasks: The greatest number was
in the dialogue, fo11owed by the monologue, and finally by the picture. Again, task difference
becomes an important element in the development ofmultiple-choice listening tests.
Judging from the al)ovementioned findings, in this study the three-option format may have
the same potential to fUnction as the fbur-option fbrmat, ifwe take into account of the effects of
task difference.
There were some 1imitations to this study. It represents a 1irnited sample, iterns, and tasks,
and its findings can only be applied within the limited context in which it was conductedahat is,
three sections of the TOEIC listening exam taken by low intermediate university students. Future
studies should include different tasks with leamers at different levels.
Acknowledgements
This work was supported by JSPS KAKENHI Grant Nurnber (23520692), I would like to
thank three anonymous reviewers and Dr, Sachiyo Takanami for valuable comments on an earlier
version ofthis article.
References
Educational Testing Service. (201 1). TC)EIC test ctfiicialpractice; Listening. Tokyo: Institute for
lnternational Business Communication,Bond, T, G,, & Fox, C. M. (2007). Applying the R`xsch moclel: R`nclamental measurement in the
human science (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.
Brown, J. D. (2005). 7hsting in language program: A comprehensive guide to Engtish language
assessment (New ed.). New Yotk, NY: McGraw Hill.
Buck, G. (2001). Assessing iistening. Cambridge University Press.
Cunie, M., & Chiramanee, T. (2010). The effect of the multiple-choice item fbrmat on the
measurement of knowledge of language structure, Langtiqge Ttsting 24, 471-491. doi:
10.117710265532209356790
250
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglish Language Education
Downing, S. M. (2006). Selected-response item fomiats in test development. ln S[ M. Downing &
T. M, Haladyna (Eds.), Hinndbook pftest development tpp, 287-301). Mahwah, NJ: Lawrence
Erlbaum,
Green, R. (2013). Statistical analysesfor language testens, New York, NY: Palgrave Macmi11an.
Haladyna, T, M. (2004). Devetoping and valiclating multipte-choice test items (3rd ed.). Mahwah, NJ: Lawrence Erlbaum.
Haladyna, T, M., & Downing, S. M. (1993), How many options is enough for a multiple-choice
item? Educational and Rsychotcigi'cal Measuremen4 53, 999-1O10.
doi: 10.1177!OO13164493053004013
Henning, G. (1987). A guide to language testing: Devetopmenc evaluation, research. Cambridge,
MA: Newbury House.
Hirai, A. (2010). 71iyt mondai and kyazai sai rlyou no susume: 71EASY riron hen [Test Easy
Assernbly System]. Tokyo: Maruzen.
Hughes, A, (2003). Ilestingfor language teachers (2nd ed.). Carnbridge University Press.Lee, H., & Winke, P. (2012). The differences among three-, fbur-, and five-option-itern fbrmats in
the context of a high-stakes English-language listening tests. Language 7layting; 3a 99-123,
doii 10.1177f0265532212451235
Linaere, M. (2014a). WINSTEPS (Version 3.81) [Computer software]. Retrieved from www.
wmstqps. com
Linacre, M. (2014b). A user's guide to uaACSZErs: MZ?Vrs71EP Rasch-modet computerprogram.
Retrieved ffom www.winsteps,com
Papageorgiou, S., Stevens, R., & Goodwin, S, (2012). The relative difficulty of dialogic and
monologic input in a second-language listening comprehension test, Language Assessment
2uarter(}L 9, 375-397, doi: 10.1080115434303.2012,721425
Read, J. (2002). The use of interactive input in EAP listening assessment. Jbu,:nal ofEhglish for Academic Purpose 1, 105-119.
Rodriguez, M. C. (2005). Three options are optimal for multiple-choice itcms: A meta-analysis of
80 years ofresearch. Educational Measurement: lssues andPractica 24(2), 3-13.
Shizulca, T,, Takeuchi, O., Yashima, T,, & Yoshizawa, K, (2006), A cornparison of three- and
fbur-option English tests fbr university entrance selection purposes in Japan. Language
717stiug 23, 35-57. doi: 10.1191102655322061t319oa
Shohamy, E., & inbar, O. (1991). Construct validation eflistening comprehension tests: The
effect oftext and question type, Language 7lesting 8, 23-40.
Thompson, I. (1995), Assessment of second!fbreign laiiguage listening comprehension, ln D. J,
Mendelsohn & J, Rubin (Eds.), A guide for the teaching ofsecond langitage listening (pp, 31-58), San Diego, CA: Dominie Press,
251
Japan Society of English Language Education
NII-Electronic Library Service
JapanSociety ofEnglish Language Education
Appendices
Appendix A
Selection ofCornmon Items fbr Equation fbr the Dialogue and Monologue tasksr
'--"'"--.,..,..gMrmr'-M""-""'-"-
ii'-'------""--""A;J;"1fu-,'a----"'-"'"- 1
I r-- --'-""""-'----u-n-rm"r-""-""'il
km"-------""------------------------,・ti
,
g-
.
-
.
-
i.ny・"<drtJ"Sti"1 -i l,g- %
.-
.
-7-
..ssffe
7-
I
I 11"..r-...-.-. 1un-.r-.--;, .L-..-.-...."."-"""-".-".1/ i i L-""-m.rT-r--------. .------.r------------...-.--.----uu=' l/I Feuv.optiops format lI Fbor.opaon formst :L-.-.-.....t.tt.........-.tutuJ----tNn-Tttt...............ttt.........-....-.t L--.......tt......-....-.-.--tN--tNtTtttT.--......tt.t..-...-- ]
]Figure Al, Item difficulty plot fbr the dialogue IiVgure A2. Item difficulty plot fbr the
task. monologuetask.
Appendix B
iFlt Statistics and Point-Biserial Ptilaesfor Conversation and 7Zilk
Item#
va.
2
3wa・
5
6
7
Dialogue Item #
.s..,mu, .I."
Monologue
D Inf Outftfa' D
es9101112131415161718
lnf Outf lhbi
-O.80
1.29
L12-2.15-1.71
O.50
O.87
1.17-
O.36-120
O,82
O.12-O.60-O,80
O,68- O.28
e,41
O.91
O,921.12O,86O,83O.90O.99O.961,341,17O.891.05O.92O.9SO.921.011,13l,09O,94 O.86
1,15
. 9・.sfi..'・""' ti1;S
''
O,95'・
P:24'""/'"'{'1['l"1'''
O.77
1,14
O.90
O.90
O.89
1.01
L19
1.18
O,95
'l,・・,l・lii, llll'',,l・l.I・1'S,.li
,45.24,49.45
.43
.40
.42lil"lpa/
'1' .21
.46
.31
,47
,43
.44
.37
.24
.28
.43
2
3
4
5
6
7
8
9
10
11
12
13
14
IS
16 tt---
18,..・.gl
1,42
O,51-O,14-O,06-O.41
O.55
O.09-O.49
1.24-1,52-1.42-O,93
O,73
O.41
O.28-1.24
O.64
O.37
1.19O.971.elO,931.09O,931,081,031.01O,87O.891.00O.851,07e.98O.87L131.00・ISS ' -'・
'・・.ou-,',li:-'.l,
O.9i '.39
''
1.04 .34
O.93 .43
120 .22
O.88 ,43
1,08 .27
1.04 .31
O.97 ,31
O.85 .44
O.80 ,44
O.99 .33
O.83 .51
1.06 28
1.e6 ,36
O.77 ,,.49
1.231,14
,32
IVbte. D == item diffic ty. lnf= infit mean squares. Ouif= outfit mean squares, lhbi= point-bisencorrelation.'Outlier in item dienculty invariance, Dark shaded items were removed from equation. Light shaded values
did not reach expected values for the equation.
252
NII-Electionic