Post on 18-Dec-2021
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTesting Association
Computerized assessment of second language speaking: A review of tests
yajiazHou[ma gt2]lbkyo Uhivensity ofIFbreign Studies
Abstract
Testing second language speaking via cornputer is a fast-growing area in second
lariguage assessment. The purpose of this article is threefbld: to familiarize readers with
the state of corrrputer technology as it can be applied to speaking tests by discussing
three speaking tests: the speaking section of the Test ofEnglish as a Foreign Laiiguage
Internet-based test (TOEFL iBT), the Versant English Test (VET), and the Basic
English Skill Test Plus (BEST PIus); to summarize the evidence fbr these tests' validity
for potential end-users to evaluate thern; and to provide critical issues to consider fordeveloping a computer-delivered speaking test and potential areas for future research.
1. Introduction
Cornputerized assessment of second language speaking is a fast-expanding area in
second language assessment. Various attempts have been made to apply computer
technology to the delivery ofspeaking tests.
In order to benefit readers who wish to advance their knowledge in testing speaking
via computer, this article discusses three speaking tests that have successfu11y integrated
cornputer technology into their test delivery, task selection and presentation, and scoimg
process, respectively. The selected tests are the speaking section of the Test of English
as a Foreign Language Internet-based test (TOEFL iBT), the Versant English Test(VET), and the Basic English Ski11 Test Plus (BEST PIus). The second purpose ofthis article is to summarize the validity evidence of the three
tests. These tests may be used fbr various purposes, including placing learners in
different levels, measuring learners' improvements, and collecting spoken data in
research studies. It is thus critical for teachers and researchers to collect infbrrnation inorder to judge to what extent these tests are reliable and valid fbr their purposes.
Finally, with an increasing use of computer-delivered speaking tests, it is urgent to
gather evidence to support the use of the tests. Based on the descriptive and evaluative
review of the selected speaking tests, this anicle offers perspectiyes fbr future trends
and future research avenues in testing speakmg via computer,
In sum, this paper aims to answer the fo11owing questions:
(1) What are the recent developments of cornputer application in testing second
language speaking as reflected in the TOEFL iBT, the VET, and the BEST PIus?
(2) What empirical evidence exists for the reliability and validity of the speaking
section of the TOEFL iBT, the VET, and the BEST PIus?
-81-
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTesting Association
(3) What are the critical issues to be considered fbr test development and future
researeh in testing second language speaking via computer?
The rest of the article is presented in three major sections. The first introduces a
validity framework to guide the discussion ofyalidity issues on the speaking section of
the TOEFL iBT, the VET, and the BEST PIus. The second fbcuses on the description of
the key characteristics ofthe three tests and reviews available validity evidence. The last
section is devoted to the future direction oftest deyelopment and research,
2. Validity framework
Validity has been defined as "the
degree to which evidence and theory support the
interpretations ef test scores entailed by proposed uses of tests" (AERA/APAINCME,1999, p. 9). Although there has been agreement in considering validity to be a unified
concept, various criteria fbr examining validity exist. Among the different frameworks
that incorporate various taxonomies of validity evidence, this study adopts the one
proposed by Messick (1995) and adapted by Miller and Linn (2000) to the performancetest. This framework indicates that the validation of perfbrmance assessments should
incerporate six aspects of validity: content, substantive, structural, generalizability,external, and consequential. We fbund this framework usefu1 for organizing the validity
issues concerning speaking assessment that belongs to a method of performanceassessment. In the fbllowing, we briefiy describe each of these aspects as well as their
implications for evaluating the validity of inferences based on the scores of speaking
tests.
2.1. Content
The content aspect of validity is related to the degree of the congruence of test tasks
to the test specifications, and how well they represent the construct measured.
According to Miller and Linn (2000), evidence fbr content validity could be gatheredthrough content analysis in two phases. First, during initial task development phase, a
blueprint with a theoretical base provides evidence of the content and ski11s being
assessed. Second, after the tasks are constmcted, empirical evidence is gathered from
individuals not involved in the development process through interview or questionnaire.
2.2. Substantive
The substantive aspect is concerned with the process used by examinees when they
respond to the test tasks and the consistency of those processes with the constmct the
assessment is intended to measure. Thus, two aspects of evidence are sought to
demonstrate that tasks lead examinees to engage in the intended cognitive processes and
that the infiuence of constmct-inrelevant factors is minimized. For the first aspect,
experts must review the tasks and scoring rubrics to judge whether scores are based onthe successfu1 completion of the intended process. Errrpirical evidence is also essentialon the congruence of what examinees are actually engaged in when perfbrming the test
-82-
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTesting Association
tasks with the intended process. This line of study could include analysis of the speech
samples as well as the cognitive processes and strategies examinees use via think-aloud
protocols. Second, to minimize constmct-irrelevant variances, potential sources of these
variances, particularly on the part of examinees, mnst be identified, For the
computer-delivered speaking assessment, examinees' reaetions to the test and
examinees' characteristics, such as computer anxiety and computer familiarity, may besystematically related to test scores, and therefbre should be investigated.
2.3. Structural
In general, stmctural validity is evaluated by investigating the internal stmcture of
the test to ensure that the relationships among test items and sections correspond with
the constmct definition. In assessing speaking, this aspect mainly concerns about
whether the scoring categories are consistent with the constmct. The stmctural validity
requires theoretical evidence that the scoring model is based on an existing knowledge
base regarding the construct and ernpirical evidence that is highly dependent on scoring
method. For analytic ratings, it is critical to ensure that rating categories measure related
but distinct elements, and fbr holistic ratings, whether raters could cover al1 dimensionswhen issuing a holistic score is important Further, since the scoring method is central to
understanding the inferences that can be made from tests, ernpirical evidence that
provides rationale for the selection ofa certain scoring method is also required,
2.4. Generalizability
The generalizability aspect focuses on the replicability of test results across
multiple levels of the assessment procedure. A major concern with perfbrmanceassessment is that score interpretation not be limited to the sample of assessed tasks, but
be broadly generalizable to the constmct domain. This issue of generalizability of score
inferences across tasks and contexts goes to the very heart of score meaning,
Generalizahility theory is often used to examine the degree of the generalizability oftest
scores. ln addition, the limits of score meaning are also affected by the degree of
generalizability across occasions and raters of task perfbrmance. Such sources of
measurement error associated with the sampling of tasks, occasions, and raters underlie
traditional reliability concerns.
2.5. External
The external aspect refers to the extent to which relationships between test scores
and external assessments reflect those as constmcts specify. Similar to what is known as
criterion-related validity, this aspect usually examines correlations of test scores with
relevant criteria, such as other measures of the targeted construct (convergent validity)
and variables hypothesized to be unrelated to the construct (discriminant yalidity).
Convergent evidence is based on high correlations due to shared constmcts;
discriminant evidence is based on low correlations due to different censtmcts.
-83-
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTesting Association
2.6. Consequential
The consequential aspect examines the degree to which assessments have bothintended positive effects and plausible unintended negative effects. Intended
consequences can include changes in the instmctional and curricular practices of
teachers that lead to better learning environments fbr students. Unintended
consequences can include bias in assessment leading to misinterpretation. The emphasis
of the evaluation should be on the degree to which the positive outcomes of the test
outweigh any negative consequences.
The fbregoing validity framework provides a usefu1 framework fbr organizingvalidity eyidence for various types of speaking assessments, and it is desirable to haveas many sources of evidence as possible that can significantly contribute to certain
aspects of validity. Yet, depending on the purpose of a test and the types of
interpretations of test scores that are made, some sources of validity evidence may bemore credible than others. Since the three speaking tests to be reviewed could be used in
diffbrent contexts fbr different purposes, it is not the intention of this article to argue foreach test. Rather, readers must evaluate each in view oftheir own experience.
3. Test description and evaluation
This section reviews the speaking section of the TOEFL iBT, the VET, and the
BEST PIus, fbr each of which of key characteristics of the tests are reported from three
aspects: general description, test task characteristics, and scoring method. Table 1 and
Table 2 present summaries of the characteristics for the tlll:ee tests, respectively.
Evidence for the evaluation of the tests is then presented in the previously introdncedframework ofvalidity, and summarized in Table 3 at the end ofthis section.
3.1. TOEFL iBT speaking section
3.1.1. General description
The TOEFL iBT was developed by Educational Testing Service (ETS). It debutedin September 2005 in Nonh America and was later launched in other countries. It is a
four-ski11 test in which the speaking section is mandatory. The speaking section of the
test is used to evaluate examinees' oral comniunication ski11s in the context of their
readiness for study in English-speaking universities. Specifically, it measures the ahility
to speak about everyday familiar topics, and to summarize, evaluate, compare, and
synthesize information from multiple sources in a spontaneous manner.
The TOEFL iBT is delivered via an Internet-based delivery system but not
corrrpletely Intemet-based in that examinees need to take the test on particularcomputers at ETS-certified test centers around the world. During the test, examinees
wear noise-canceling headphones and speak into a microphone. Responses are digha11y
recorded and sent to ETS's Online Scoring Network where they are scered by raters.
3.1.2. Test task characteristics
The speaking section ofthe TOEFL iBT consists of six tasks. Two are independent
-84-
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTesting Association
tasks, which require examinees to express opinions on familiar topics related to campus
life or lecture situations. The other four are integrated tasks. Two of the four arelistening/speaking tasks, which ask examinees to listen to a short spoken recording and
then respond to it. The final twe are readingllistening!speakng tasks, which require
examinees to read a short text, listen to a spoken recording that relates to what they read,
and then respond about what they read and heard. Reading texts are 75-1OO words long;
recording lasts between 60 and 120 seconds. Examinees can take notes and use them
when responding to the speaking tasks. Their comprehension is not separately assessed,
as these texts are considered short and memorable.
The test is not computer-adaptive; each examinee rnust give response to all six
tasks, and the tasks are not adapted to the examinees' speaking level. For each task,
examinees are given 15-30 seconds to prepare their response, and 45-60 seconds to
respond, ln tota1, the speaking section takes approximately 20 minutes.
3.1.3. Scoring
The TOEFL iBT speaking section uses holistic scoring, and three to six different
cenified raters issue a holistic score for each respense on a scale ofzero to fbur based
on delivery, language use, and topic development. Delivery refers to the pace and clarity
of speech, and raters consider speakers' pronunciation, intonation, rate of speech, and
degree of hesitancy. Language use includes the range, complexity, precision, and
automaticity of vocabulary and grammar use. Raters evaluate examinees' al)ility to
select words and phrases and to produce stmctures that appropriately and effectively
communicate their ideas. When assessing topic development, raters take into account
the progression of ideas, the degree of elaboration, and completeness, and, in integrated
tasks, the relevance ofthe content.
The average rating of all the raters is then converted to a scaled score from zero to
30. Examinees can view their scores online 15 business days after the test and choose to
receive a copy of their score report by mail. Universities and agencies can also receive
scores in paper or electronic forrnats on request. The ETS uses the Online ScoringNetwork, a secure internet-supported system, to share data, train raters, and monitor
scoring of examinees' speaking samples. Raters are certified befbre they can beginscoring work and prior to each scoring session on a daily basis,
3.1.4. Evaluation
Empirical evidence for the validity of the TOEFL iBT speaking section is available
fbr each of the six aspects as specified by the validity framework.
Cbntent
In developing speaking tasks fbr the TOEFL iBT, a theoretical framework was first
developed in which speaking constmct measured by various task types was specified
(Butler, Eignor, Jones, McNamara, & Suomi, 2000). Following this framework, task
statements were drafted and then prototype tasks were developed.
-85-
The Japan Language Testing Association
NII-Electronic Library Service
.の
它8靄
ε
曽
壱♪の
‘
.
詈順冨の
8房o
も量
ヨ祐口嘱
壽[β。
の冖巨三
8
曇
三゚,
毘
薯
の
OO
8N
一口〇一口OO
Oの
旨OO
Q順日o
℃
8<
(
一
)
⇔D口
暑巴g 。
葡口
石2g
。
コ
08
鳴
紬o
話−
N 工工一Eleotronio Library Servioe
葛日畠
2。
〉。
℃
。
鼠
£−
9
叨
OOの
OO・。
OO
り。ON
・うO【 腎
。嚇Oめ
毟【。。
5
音δ
一口〇一
口OO
OのhコOO
o
瑁書8<
毟[・。
a§り
(
一
)
bo唱
署呂の
薊口
ぞ
毳コ
唱
囂軌。
ヨ・
(
一)
⇔o眉
着』の丶
a。
口
毎迄
鳥眉窯毟
マ
器恥。
ヨ゚
(
一
)
bD口
暑
島
軸
£
5軽
葛三
窯鬱
℃
器島g芦
一 86一
O
叩6
匂
oo
【8の
。°・
芦
。
bΩ轄
島鵠日
−
ご。〉順可∩
−
o嘱易幽o=
口羇
日5=
・。
い
寸
の
い
寸
・昏コ
。。9
oa
ε
」
衷=
日吋
』
The Japan Language Testing Assooiation
Q
己9
おヨ日邸
』
(
一
)
墓ヨ巴。
℃
ξ
仁
&曾
o」
8の
ごoboo
冨0
6。
§2
o【邸
Qの
。。
三一
爵
(
崗)
←
冒宕呂。
宕り
。
〉
量壱甲口oZ
゜・
聾a目
8
で
器島
層で
鳥8蕁
℃oの
邸
ρ−
b田2ヨ
←凶哨
』』国
○卜
o
日鞘
o
日=
聾2閉8房
畄
bo三
目帽
=
。
旦
β
(
岩。←
ζ。.
。
Z)
。
替着
謡
ご。
主。
で
老
謡
喝o
曇。
自。の
ε」
8の
芻犒紹」
3冨詰』9
図
旨←
眉O
ヨ
日
昏o
主
εの
出
毬
昌
8審。・・
⇔o
ε髱o
島←国
コ
出O卜
名こ。
の
。
崖邸
識
ちE日
ち
蠧量員
お悶
。
コ
£
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTesting Association
Two studies provided positive empirical evidence fbr the content validity of the test.
In Rosenfeld, Leung, and Oltman (2001), a large number ofuniversity faculty members
and students in North America rated a series of task statements in relation to the
importance of each task statement to the successfu1 completion of coursework. Results
support the claim that test content as indicated in task statements was relevant and
representative of what examinees encounter in a university context. Cumming, Grant,
Mulcahy-Ernt, and Powers (2005) evaluated seven speaking prototype tasks by asking
seven experienced ESL instmctors whether they thought the prototype tasks represented
the domain of academic English required fbr studies at English-medium universities and
fu1fi11ed the pu:poses for which they had been designed. The instructors viewed the
prototype tasks positively, particularly integrated tasks that required students to speak in
reference to reading or listening to source texts.
Substantive
The substantive aspect of validity on the TOEFL iBT speaking section has been
examined by analyzing speech samples, examinees' strategic behaviors, and their
attitudes toward the test. While the fbrmer two types of analyses provided positive
evidence, the latter yielded only negative evidence.
Brown, Iwashita, and McNamara (2005) analyzed speech samples of examinees on
six prototype tasks to investigate whether the content ofthe rating scales was relevant to
perfbrmance on panicular tasks, They fbund that the greater fluency, more sophisticated
vocabulary, better pronunciation, greater grammatical accuracy, and more relevant
content were characteristic of speech samples receiving higher holistic scores from
raters, which indicated that the constmct that the tasks were designed to measure was
indeed represented in the discourse.
Although use of strategy is considered integral to perfbrming tasks, it is not
included as a part of the speaking constmct in the speaking section of the TOEFL iBT.
To address the concern of constmct-underepresentation, Swain, Huang, Barkaoui,
Brooks, and Lapkm (2009) asked 30 Chinese ESL learners to report strategic behaviors
when taking the test and examined how the reported strategies related to their test scores,
No sigriificant differences were found in rqported strategy use across proficiency levels
or between the total number ofreported strategic behaviors and test scores. The results
of the study did not change the constmct of the TOEFL iBT speaking section.
Stricker and Attali (201O) administered a questionnaire concerning attitudes toward
the test to examinees in fbur countries. Overall attimdes about the TOEFL iBT were
moderately positive in most countries. Yet, compared with other ski11s, attitudes toward
the speaking section were consistently less favorable in all countries, and were
unfavoral)le in two countries. Although this study did not relate speaking test scores to
examinees' attitudes toward the speaking section, the results showed that attitude
toward the test may be one potential source of construct-irrelevant variances.
-87-
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTesting Association
Struetural
Two investigations into the appropriateness of using a holistic rating scale have
justified the validity of the stmctural aspects of the test. First, to ensure the relevance of
the content of the holistic rating scales with features of speaking proficiency deemed by
expert judges, Brown, Iwashita, and McNamara (2005) used verbal-report methodology
to examine what aspects of examinees' speech the raters paid most attention to when
scoring tasks. Results showed that with no guidance, domain experts distinguishedbetween perfbrmances using a common set of criteria similar to those included in rating
scales developed for the tasks. Second, Xi and Mollaun (2006) justified using a holisticratmg scale rather than analytic rating scales. The study found that the reliability ofscores obtained from the two types ofrating scales did not differ substantially. But sincethe analytic scores were highly correlated, they would not provide additional
infbrmation beyond what the holistic scores could offer fbr most examinees.
Generalizability
Generalizability of the TOEFL iBT speaking test was verified across occasions,
raters, and tasks. Zhang (2008) analyzed the scores of examinees who repeated the testonce within 30 days. Result showed a moderate correlation of O.84 between the scores
of the two tests, suggesting that the difficulty of tasks on different occasions could beconsidered equivalent. With regard to rater generalizability, inter-rater reliahility was
not published by the ETS fbr operational settings, and instead, the ETS claimed that
raters are constantly monitored each time they score a test. In addition, Xi and Mollaun
(2009) fbund that Indian raters could rate the speech of examinees of various national
backgrounds reliably, which provides positive evidence fbr recruiting nonnative
speakers to score the test. Finally, as to the generalizability of the scores across tasks,
the generalizability index published by the ETS is O.88, higher than other sections. This
may be attributable to the Lee' study (2005), which justified the test design concerningthe present number of tasks for each task type in order to achieve acceptable reliahility.
lixtei:nal
Convergent evidence for external validity has been collected through comparison
with the selfiassessment of examinees and with local test. Two studies providedevidence fbr the distinctiveness of what is measured in the speaking section from other
parts of the test. With a sample of over 2000 examinees, Wang, Eignor, and Enright
(2008) found that the correlation between the responses to the statements of
selfassessment on speaking ski11s and the summative scores for the test was a moderate
O.55. In addition, the speaking section is also used also for selection of ITA
(International Teaching Assistant). Therefbre, Xi (2008) examined the relationships
between the speaking scores and those on several local ITA tests, and the regression
results suggest that the speaking test scores could sigriificantly predict scores en
students' TA assignments. In Sawaki, Suicker, and Oraaje (2008), the factor loading
patterns of the test indicate that both independent and integrated tasks define the
-88-
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTesting Association
speaking constmcts accurately, and are minimally involved in the reading and listening
constmcts. Stricker and Rock (2008) verified the result of Sawaki et al. fbr different
groups in native language and period ofexposure to the English lariguage.
thnsequential
As to washback, the addition of the speaking section into the TOEFL iBT has
improved the impact of the test on teaching and learriing. Several studies by Wall and
Horak (e.g., 2008) reported on the practices of teachers a year after the launch of the
TOEFL iBT in Europe, They concluded that the TOEFL iBT has indeed had the desired
effect on the content of TOEFL preparation classes, in that much more emphasis has
been placed on the teaching of speaking abilities and integrated tasks.
3.2. Versant English Test
3.2.1. General description
Versant English Test (VET), developed by Pearson, intends to measure what the
developers called `'facility
in a spoken language." It is defined as the ability to
understand spoken English on everyday topics and to respond appropriately at a
natiye-like conversational pace in intelligible English. This ability is assumed to
underlie high perfbrmance in communicative settings, since learners nmst understand
interlocutors correctly and efficiently in real time to be able to respond. Learners must
also be able to fbrmulate and articulate a cornprehensible answer without undue delay.
Without specifying the uses of the test, developers claimed the test could be used
for various purposes, including measuring proficiencies, placement, and achievement.
The VET is Interr)et-based and can be taken wherever there are telephones or corrrputers
with access to the Internet. However, the end users of the test are responsible for
administration: verifying the identity of the examinee, giving the examinee the required
material, and monitoring the administration of the test. Examinees can retrieve their
results from a secure website within minutes of test completion.
3.2.2. Test task characteristics
The VET consists of 63 items grouped into six sections: reading aloud, repeating,
short answer questions, sentence builds, story retelling, and open questions. This variety
provides multiple fu11y independent measures that underlie facility with spoken English,
including phonological fluency, sentence constmction and comprehension, passive and
active vocabulary use, and pronunciation of rhythmic and segmental units.
Developers desigried the test so that each section consists of increasingly
challenging items delivered by native speakers from various English-speaking regions.
In addition, through the use of algorithms, the Versant Testing System selects items for
each examinee from a pool of items graded fbr difficulty, thus ensuring that each
administration event is unique. The printed text is available on a test paper given to the
examinee a few minutes befbre the beginning of the test, and is also available for those
using a computer, The entire test lasts appreximately 15 minutes.
-89-
The Japan Language Testing Association
NII-Electronic Library Service
The Japan Language Testing Assooiation
,・。
。
ぢ口
頑9。。
蕊。」儡いβ“、
。
耜這旨o
房。
こo
日
彎【8。
曇
三
§
薯
ひひ
ひ
IQooo
鯛oo
」
8の
口
9裾Q喟日一
毫09
,
倉巻五日
8
0bo
酪
乱口
5
8嘱゜。
口。
留己日
8
⇔。
眉
琶迄日−
h。
口。
三」
ご
噐
霽8口
2口。
の
ロo咽罵嘱o口5自o
ζ
貯冒ρ邸
8>−
ぎ5三」
西。
濤日
8口
g口。
°勹
−
口o
琵で
雪口o
占
O
ぎN
h。
口。
三」
鴇
o
曾
80り
口o胃思oロコロo
占
〇一1
【
頬
oo
憲oQり
o喟栃コo
=
Q
艮[邸
=
<
§日5出
o口
E8】≧
日ONl
い
。【』
8嘱[&邸
・・
畳。一
”[εoト
ぢ乞
冩
お
冒O
日9
高
50卜
o
騫Q
且曾
且g
ぢ客
貯で
費。
畠
[畚o
踏
o
ぢ
口o咽罵』
o
畚田
(
。[憩。
且9ぢ
乙
8酒巴×。
履。
百
δ
(
。[鵞。
言9←o
乙
g。
蔟巴×
2目富」
£
(
。
暑。嘱[&忠。
Z)
8磊。
暮
8刺
o
δ
(
。【盤り喟[&e。
乙
口。
鬻。
魯
8蓄ト
(
。
室
舞【&εo
乙
日
書
倉崗
(
。【畚。
且静ぢ
乙
8且層・。
。
で
。
ぢ
蠹
(
巳・。
口o
磊。
昌σ
口。
8−
(
n)
。oロ
ヨ2む。
あ・
(
9)
。・
票388
←8
°。・
(
寸
N)お
≧°。
自
巷朗・
(巴)
・
8。
鵞。の
§魯午
(
°。)
彎。
罵
窯・
ギ
。
〉
且邸
℃
<
。
〉
且巷甲
8Z
やo・,
邸
ρ
−
』
8a
日
8
田9&日
8
零駕
島葱唱
勗ま三一
唱o
旨』
−←o
日9β
旨匡
卜Q自
国
ロ自
卜国〉
仁
&oh
o』
80弓
西o°oo
一
8
。D
ξ醒
一8の
bo
ξ爵
」
9邸
出
o
自潭
o
日惘←
。
の
8紆。
畄
bo口喟目“
匡
。
旦
£
(
・,
日8こo,
o
乙
。
替老ε
西。
〉哨[。
唱
崇
謡
冒ヨ
目。。
ε」
oり
の
8摺呂」
8詫」
“』り
呂
器卜
唱o
εo
目
ヒ。
〉嘱【ぞ甥
昌
葛
昌
。。
三畠
奮国
蹇で
憙
畠〉
名こ。の
。
聟麗」
。
冒
8ち
蠧日日
お”N
彑ρ
醇
一90一
N 工工一Eleotronio Library
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTesting Association
3.2.3, Scoring
[[he VET is thus fatr the only test that has applied automatic speech recognition
technology to the rating of the speech sarnple. It provides informtion on four
diagriostic subscores (sentence mastery, vocabulary, fluency and pronunciation) as well
as an overall score. In assessing fluency and pronunciation, such elements as rate of
speech, length and position ofpauses, stress, and segmentation are measured. Scores are
reported both numerically (in a range of20 to 80) and in criterion-referenced descriptors.
The first item in each section is considered a practice item and is not scored. The
responses to the story-retelling section and the open-ended questions are not
automatically scored, and in tota1, 64 responses are automatically scored.
3.2.4. Evaluation
In the 2008 technical report ofthe VET, published online by Pearson, various types
of validity evidence were introduced in detail, which can be surnmarized from the
perspectives of content, stmctural, generalizability, and external validity,
(]bntent
Content validity of the VET was based on expert judgment on two aspects. First,
the audio prorrrpt in the test is not biased toward certain dialects. Second, the content of
the questions is context-free, and examinees only need general rather than specific
knowledge to answer the questions.
Structural
Correlations among test subscores on each scoring category-sentence mastery,
vocal)ulary, fluency and pronunciation-range from O.61 to O.90. This indicates that the
different scores are measuring distinct aspects of the test constmct, and therefbre offer
useful diagnostics.
Geneniliiabthty
Since VET scores do not involve human raters, besides test-retest reliahility, the
generalizability aspect of the VET was supported by agreements of machine-produced
scores, correlations with human scores, and a comparison of machine-produced scores
with human scores. First, the correlation between the scores from the first and second
administrations was found to be r = O.97, indicating high test-retest reliability fbr the
VET. Second, the agreements between ratings of the VET machine-generated scores,
ranging from O.88 to O.97, were similar to those of human raters, suggesting that they
are as reliable as human raters. Third, correlations between machine-generated scores
and human scores, based on data from 50 examinees were high, ranging from O.89 toO.97. in particular, at the overall score level, the VET machine-generated scores are
virtually indistinguishal)le from scoring based on carefu1 human transcriptions and
repeated independent human judgments. Finally, overall scores show an effective
separation between native and non-native examinees. Test scores of a norrning group of
-91-
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTesting Association
775 native speakers of English were cornpared with those of a non-native nerming
group of 603 speakers. The score distributions for the different groups of examinees
show that native speakers all perfbrm well, while non-natives show a range of ability
levels.
External
To establish the external validity ofthe VET, its overall scores were compared with
other well-established large-scale tests. It was fbund that the correlations between the
VET overall scores and those on other speaking tests, including the Test of SpokenEnglish (TSE)i, the speaking section of the TOEFL iBT, and the International English
Language Testing System (IELTS) Speaking Test2 were quite high, ranging from O.75to O,94. 0n the other hand, the correlations were moderate between the VET scores and
the TOEFL iBT overall scores. Based on these findings, the developers claimed that theVET scores are more closely related to those tests measuring speaking ability rather
than general laiiguage ability; this seems to provide convergent and discrirninantvalidity for the test.
3.3. BEST PIus
3.3.1. General description
The original Basic English Skills Test (BEST) was developed in the early 1980s as
an easily administered assessment of the speaking abilities of non-English-speaking
adult refugees and immigrants to the United States. In order to shorten administration
time but maintain greater accuracy of measurement fbr accountability assessment,
researchers at the Center for Applied Linguistics (CAL) developed BEST PIus, a
computer-adaptive version of the BEST oral interview. The purpose of the test is to
assess the ability to understand and use unprepared, conversational, everyday language
within topic areas generally covered in adult education courses. The test is used to
measure language gains in individuals as well as the overall effbctiveness of a language
program through a pretest-posttest process.
The BEST PIus is available in two face-to-face formats: computer-adaptive and
print-based. Here, only the computer-adaptive fbrmat is introdnced. In the
computer-adaptive fbrmat, test items are delivered through a computer wnh a Best Plus
CD, USB, or Network. The test is conducted befbre a test administrator who reads the
question from the computer screen, listens to the candidate's response, and enters thescores into the computer. The cornputer-adaptive system determines the difficulty levelof the next question based on current estimates of the examinees' ability. At times, the
i The TSE is a tape-based test that measures the ability of nonnative speakers of English to
communicate effectively. The test is used for employment, graduate assistantships, licensure, and
cenification purposes and has been discontinued as ofMarch 3 1, 201O.2 The IELTS speaking test, delivered by a cenified exarniner, consists of three parts: (1) answering everyday questions, (2) talking about a topic, and (3) having a discussion on the topic with the examlner.
-92-
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTesting Association
examinee is invited to look at the computer screen (for example, when shown a pictureto describe), but the examinee does not operate the corrrputer.
3.3.2. Test task characteristics
During the test, the examinee is presented with a relatively easy item on a selected
topic, fo11owed by increasingly challenging questions if the exarninee is able. There are13 general topic domains, such as personal identification, health, housing, and
transportation. The test comprises seven item types that vary in their cognitive and
linguistic demands: photo descriptions, entry items, yeslno questions, choice questions,
personal expansion, general expansion, and elaboration. Each time a new topic is begun,the first item presented is an entry question or photo description. Higher-abilityexaminees quickly move to open-ended expansion questions while lower-abilityexaminees move to questions that provide more support, such as choice or yesino
questions. An examinee may encounter up to seven different topically organized groupsof questions in a test administration. The difficulty of test tasks was determined byanalyzing the data collected from a field test that involved a 1arge number of examinees.
The test lasts between 5 and 20 minutes, dqpending on the proficiency level of the
exarmnee.
3.3.3. Scoring
The BEST PIus evaluates examinees' perforrriance on a rating scale consisting of
three criteria: listening comprehension, language complexity, and communication on a
zero-ten scale. Scores are reported on a scale of 88-999.
3.3.4. Evaluation
Validity evidence fbr the BEST PIus is available on two aspects, generalizability
and external validity, as reported in the BEST PIus technical report (2005).
Generalicabutty
Evidence for the generalizability aspect of test validity was provided with regard to
occasions, ratings, and test forms. First, to establish test-retest reliability, 32 examinees
took the test on two successive occasions; the correlation between the two sets ofscores
was O.89. Second, during the test, both an expert scorer and a novice scorer rated all the
responses, and the correlations for raw test scores between the rater pairings were
consistently above O.90. Third, parallel-form reliability was established through
investigating the equivalence of the three print-based test forrns in a study of 48 adult
ESL students who each took two test forms. The correlation of scores between test
fbrrns ranged from O.85 (Forms B and C) to O.96 (Forms A and C), The rationale fbr
using print-based test fbrms was that it would be difficult to predict the test items that
are used fbr each form of the BEST PIus since it is a corrrputer-adaptive test. Instead,
test items on the print-based tests were drawn from the computer-adaptive item pool.
-93-
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTesting Association
External
Evidence for this aspect was gathered from the relationship between test scores and
Program Placement Levels and the relationship between test scores and scores from
other measures of English proficiency, First, the field test was conducted with 24
programs that provided infbrmation about which classes 1866 of their students were
placed into on the basis of existing measures and soning procedures. The average
correlation coefficient showing the relationship between placement levels and scores on
the BEST PIus across all programs was O.72. Developers claimed that this demonstrated
the usefulness of the test as a placement instmment in ESL programs. Second, anotlier
study that established the relationship between educational level gain on the BEST PIus
and the numher of instmctional hours received. It was fbund that 53% of examinees
madea level gain if they received less than 60 instructional hours, as compared to 70%ofexaminees who made a level gain if they received more than 140 instructional hours.
3.4. Summary
In this section, we introduced the three computer-delivered speaking tests, each of
which features a distinct application of computer technology: the TOEFL iBT uses both
independent and integrated tasks and makes fu11 use of multimedia for task prompts; the
VET is the only one scored by machine; and the BEST PIus delivers test tasks in
response to the examinee's ability.
Table 3 sumrnarizes validity evidence for the three tests and the fbllowings are
clear from the table. First, validity evidence is more abundant fbr the TOEFL iBT than
fbr other two tests.. Second, given that the tests have distinct features and accordingly
different concerns about each test's validity, test developers have conducted different
types of studies that fa11 into the same categories. Finally, the studies reviewed have
been mainly concerned with aspects of content, generalizability across occasions and
across raters, and relationships with external scores. The other aspects of substantive,
structural, and consequential have seldom been dealt with.
4. Challenges and future directions
4.1. Issues to consider for test development
As reflected in the three tests reviewed, new developments in the application of
computer technology to testing speaking have made the delivery of speaking tests more
efficient in many ways. However, when using computers to develop tasks, it is essential
to consider the balance between efficiency and practicality, particularly for high-stake
or small-scale tests. In the fo11owing, we consider the issues in test development on
methods oftest delivery, task de!ivery, rating, and, most importantly, test constmct.
-94-
The Japan Language Testing Association
NII-Electronic Library Service
Oひ.OI
¢.011
廴
−
(
80N
ご
ε。
蒙。
お
ゼコ
2弔
着ヨ←−
(
O
;N)
=
鐔く
昭
お
巷』の
”o
蕎目o
磊。
∋
α−
o口
島2審
ヒ
⇔。
担霑。。
8§
恩磊目喟驚一。
出
肩邸
日o
で
ぢp磊信oo
ヨ蜩≧
唱oヨo
昌
⇔o
拒8ごo
臥
呂毬廼8
●O
OO
口
祠
口鱈
〉
岩馬
>o
【
自
Σ9ゴ
5ω
口
OO
ち
85
お
長国
・
聡
昌り
巳一
ワう
N 工工一Eleotronio Library Servioe
(ひ
8へ
こε
口
罵
韓の
七52
邸
嵎ロ
ヨ←
,
(
い
OON
ご
ε
口
津o
占
毬嘗竃。
五日説
壱。
巴゜り
−
ぢ
昼旨8養≧。。
の
80a
。の
5静o
こ
鼠呂霧廼oり
゜
o
≧
省ε紹50り
一 95一
The Japan Language Testing Assooiation
ぢo
需
ぞ賃
冨廿
8口o唱
o・昏
衷ρ
ぢ口
頃。
書−
浸8口
8
琶
曁b。
唱
三
岩脅
甲
(寸OON
)[ε。
bD口
導り
(
;ON
ごε。
コ
聲口 の
£
”
墓島ヌ嘗。
穿国
−
(88)
【εω
並ち自コ
冖
b。
εの
逼。
日
魯[。
〉。
〔【
−
ぢ戸
磊口oo
壱嘱津
妊9口
8
哲こ
3。・
。
8≧
§8・・
。」
号」
宕
瓜98
尨ω
q8
°
損〇一口oり
旨=←の
固箇
←
国〉
卜
曽目』国
O卜
毫箭目
』尉
巴
弓
」
〉個[
晝−
」
。
琶畠
目09
唱o
涓唱
ξoO
彑箪華
も
碧
身く
の←
。・
8。。
口
暑a・。
它H。
主。
で−
Hga
日
8℃
98【
。。
』
;08
口。
℃
冨智窯一邸
と。
蠧口
自お占。【驀←
The Japan Language Testing Association
NII-Electronic Library Service
The Japan Language Testing Assooiation
墾。
ヨ譲。
嘱
ぢ
日甥口
壱自
ε
乱
否。
宅8注名魯
磊口。惘罵一。
名−
N卜 .011
」
”の【。
〉
。
=口
日
8エα
葦津
−
ひ
。 ◎,011
』
δヨρ
聲。」
冒ε
亙[田邸
自
−
Oひ .Oo
>o
』邸
捗ヨρ
疊。』
」
g孚』
。一』
−
寸ひ .OI
い
ト .O
冂
転
”の
場
gbo口
署巴。。
h
毟。
薫≧
−
トひ .011k
凵倉[鳶壼二の
。
逼幽。
卜−
ト
ひ .OI
ひoQ,011
廴
論日罵h
爲巳口
』
壱喟韓
旨o冨』o
ヒ09
,
トひ.OI
◎QQQ.
O
”着。
日
8翫邸
bo口
喟
琶〇
三
弓・5
芝−
(
°。
OON
)
超。
=
曙[冩
≧−
(
°。
OON
)
着。
国
曙
お
若ヨ゜。
(°。
OO
ぺ
と
ε。
2窪・5°。
”・。
嵐罵
8ho
濤曽
(
°。
8N)喟×
毯・。
9<←
=邸
OO
=O¢O
‘娼
≧厂−
鴇.O叫
鼓。゚
OON
ごε。
b。
自≧
冨口o
目。,の
oの
。。
邸
蟄霧
壱刷≧
−
QoQQ
.O
酬
」
蒼ヨ
鵞愚』
哲琶−場。
腎
(
ひ
OON
)
雪亡6【【o
芝略喟×
”。。
お駑」
o>
嘱
罵唱
‘
起
≧吩
』
〇一
国
屋で
日日
εの
boq瑁
莓b。
毛雨ロ
日
8−
它
屋=
aぢ
εヨ
で。
§石。
日
損
日
8&
田2畠
−
°。°。.O、
(
い
8巴
8日
”・。順。。
嘗目西8ε
魯【嘱畚N嘱
[
§口。
O−
畧£。
。
〉哨罵。Dω
田8>
還。・
o彑
唱
ゼ唱
8口
こ08
昌
9。。
冥国。
り島揖日属o
紹∩・
揖。
bDお〉員
8°
90
扁邸
Oり
Oロnの
O』
O¢
。。
考・。
こo
魯溶屋吾属・
の
お冨」
認Oh⇔
・・
巻・・
。
ζo
倉【哨ρ邸
。
且
函・
の
着題の
の
oh
註
聟5の
。
こ。
倉溶眉喜出・
隶岩。
弓。・。
口
8
一轟
日o
罠国
智【順鵞N
驀お
器O
魄
三山
hの
国顛
卜
目
〉
宀一田日』国
O宀
毫⇔
留』
田9冨お〉
慧 」
8琶日09
自O=
眉唱O自
彑箪霎
U。
詈。
含く
の
←
°。
gbo
口
暑。
勢で。h。
〉
器℃
−
おぢロ
日8ヨ。
。
蕊
当員。
8・。
コ〉。
魯ヨ孚ちご雨
貫日口
『(
ε口
8)
需[驀←
一 96一
N 工工一Eleotronio Library
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTesting Association
4.1.1. Computer-based and Internet-based testing
Internet-based tests such as the VET are highly efficient as they can be taken
anywhere and anytime as long as examinees have access to the Internet oratelephone.
However, they have at least two potential problems: the risk of leaking of test items and
difficulty ofidentifying the examinee. These issues have made it difficult fbr high-stake
tests to be administered via the Internet, and tests such as the TOEFL iBT corrrpromise
by delivering the test through the Internet to desigriated sites.
4.1.2. Computer-adaptive, semi-adaptive, and non-adaptive testing
In computer-adaptive tests, the computer gives test tasks that are appropriate to theability of the examinee. This can save administration time compared with non-adaptive
tests that give the same tasks to all the examinees. Although widely used for testing
receptive skills, computer-adaptive tests are not a very popular choice for testing
speaking skills because of the high cost of developing speaking test tasks. The ideal
item pool for a computer adaptive test would have a 1arge number of highly
discriminating items at each ability level. A prerequisite of such an item pool is field
testing, which is labor-intensive and at a high risk fbr leakage oftest items.
In addition, to rnake a speaking test completely corrrputer-adaptive, the computer
rmist be able to rate an examinee's speech accurately. So fair, it is still difficult fbr
computers to achieve high precision in recognizing the examinee's speech on
open-ended tasks. Tests such as the BEST PIus, which require test administrators to
input a score fora task in order fbr the computer to determine the next task, are a way
around this problem. In this case, test developers must sti11 weigh the costs of
developing an item pool against the merits ofsaving administration time.
4.1.3. Automatie rating and human rating
Besides delivering test tasks, a computer can increase the efficiency of testing
speaking by replacing human raters. However, test developers can face two difficulties
in applying automatic rating technology to testing speaking. First, automatic rating, as
in the VET, can currently only be used for rating tasks that have highly predictableresponses, such as read-aloud or repetition tasks. Tasks that require extended responses
from examinees are still difficult to rate satisfactorily by machine. A second
impediment to the widespread use of automatic rating is the higli cost of the
development and application of speech recognition technology to testing.
4.1.4. The test construct ofspeaking: to include listening ability or net
In all three tests reviewed, listening comprehension ability plays an important role
in the corrrpletion oftest tasks. It is true that in order to interact successfu11y with others
in English, we rmist first understand what others are saying, This trend may also be
attributable to the increasing use of multimedia in task prompts. However, we must be
carefu1 with score interpretation. Also, test developers need to decide on the difficulty
ofthe listening prompt, which may depend on to what extent a successfu1 understanding
-97-
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTesting Association
of listening prompts is intended to affect the completion of tasks.
4.2. Areas for future research
As reviewed in the previous section, a considerable amount of care and study has
been devoted to specifying the content and scoring procedures of the tests and
determining their generalizability across raters and their relationships with external
scores. However, more research is needed on the substantive, generalizability, and
consequential aspects of the validity of computer-delivered speaking tests in order to
improve the validation process and provide more usefu1 information to stakeholders.
4.2.1. Substantiye aspect
Using computers in testing speaking is likely to cause serious
constmct-underrgpresentativeness and construct-irrelevant variance from the
perspective of examinees' use of strategy as well as potential effects of
computer-delivery method and examinee characteristics on test performance.
Ebeaminees ' use oftes"taking strategy
Although theoretically, the use oftest-taking strategy is part oflariguage ability, we
lack infbrmation regarding the context of corrrputer-delivered speaking tests. Studies on
this topic are likely to be illuminating with respect to the construct validity of
computer-delivered speaking test scores. Empirical evidence can help test developers
decide whether to include the use oftest-taking strategy as part of the test censtmct, and
thus avoid the problem of constmct-underrepresentativeness. Analysis of selfreporting
by think-aloud protocol or questionnaire can reveal the cogriitive and rnetacognitive
processes involved when examinees respond to various speaking tasks.
Iiffoct ofcomputer-`letive, p method
Different from a traditional speaking test where an interlocutor is present to
administer the test or act as the conversation partner, the computer-delivered one asks
the examinee to speak to the computer. Exploration ofthe effbcts ofdelivery method on
examinees' test perfbrmance is thus important in order to eliminate constmct-irrelevant
variance. Also, how examinees actually use the planning time in a computer-delivered
speaking test compared with a face-to-face test is another issue for fUrther investigation.
ELtfect ofexaminee charactei:istics on testpectbrmance
Research on the effect of examinee characteristics on test perfbrmance could
further our understanding of the constmcts that tests assess. Future studies should
explore the possible interaction of affective factors, such as attitudes toward the test,
cornputer anxiety, and communication anxiety, with perfbrmance on a
computer-delivered speaking test. Also, using computers fbr assessment may cause
construct-irrelevant variance due to a lack of computer fatniliarity on the part of
examinees. Considering the low exposure to computers fbr high school and university
-98-
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTesting Association
students,corrrputer familiarity is a factor that needs investigation.
4.2.2. Generalizability acress tasks
So far, rescarch conducted on performance assessment has shown that a rarge
examinee-by-task variance exists that would limit the generalizablity of scores from
these tests (e.g., Brennan, Gao, & Coloton, 1995). Since administration time fbr
computer-delivered speaking tests usually lasts 15 to 20 minutes, increasing the number
of tasks is often impractical. More studies are needed to determine the boundaries of
interpretations and ways ofreducing the interaction ofexaminees with tasks.
4.2.3. The consequential aspect ofvalidity
Studies on the impact of administering computer-delivered speaking tests on
teaching and learning English are still few in number. Errrpirical evidence about various
learning contexts is needed. Further research should explore how learners prepare for
speaking tests delivered by computer and how this differs from preparation forface-to-face tests in case there are any negative unintended consequences.
5. Conclusion
New developments in the application of cornputer technology have made the
delivery of speaking tests more efficient in many ways. From summarizing the validity
evidence available for the three tests, we hope that test deyelopers and test users, who
are ultimately responsible fbr fair and valid test use, will heed our calls for fumher
validity work, Continuing effbrts in this area are essential given the expected increasing
use ofcorrrputers in testing speaking.
Acknowledgement
I am deeply gratefu1 to Prof, Negishi Masashi fbr encouraging me to write this article, I
would also 1ike to thank two anonymous reviewers for their helpfur comrnents, and Dr.
Koizumi Rie for her constmctive advice on choosing the validity framework,
Reference
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education, (1999). Standt2rzls for educational and psychological testing. Washington, DC: American Educational Research Association.Brennan, R. L., Gao, X,, & Colton, D. A. (1995), Generalizability analyses ef work
keys listening and writing tests. Educational and Psychological measurement,
55(2), 157-176.
Brown, A., Iwashita, N., & McNamara, T. (2005), An examination ofrater orientations
and examinee penyformance on English:for=t4cademic-Purpose speaking tasks
(TOEFL Monograph No. 29). Princeton, NJ: Educational Testing Service.Butler, F. A,, Eignor, D., Jones, S., McNamara, T., & Suomi, B, K. (2000), 1()EllL
2000 speaking framework: A working paper (TOEFL Monograph No. 20).
Princeton, NJ: Educational Testing Service.
-99-
The Japan Language Testing Association
NII-Electronic Library Service
The JapanLanguageTestingAssociation
Center fbr Applied Linguistics. (2005). BESTPItLs technical report, Washington, DC:
Center fbr Applied Linguistics,Cumming, A., Grant, L., Mulcahy-Ernt, P,, & Powers, D. E. (2005). A
teacher-verijication study of speaking and writing protompe tasks for a new
70EFL 7lest (TOEFL Monograph No. MS-26), Princeton, NJ: Educational Testing
Service,Lee, Y.-W. (2005). Dependability ofscores for a new ESL speakeng test: Evaluating
protompe tasks (TOEFL Monograph No. MS-28). Princeton, NJ: Educational
Testing Service,Messick, S. (1995), Validity ofpsychological assessment: Validation of inferences from
persons' responses and perfbrmances as scientific inquiry into score meaning.
American Rsychologist, 50, 741-749,Miller, D. M., & Linn, L. R. (2000), Validation of perfbrrnance-based assessments.
Ampiied Rsychological Measuremen4 24(4), 367-378.
Pearson. (2008). Vlarsant English 71est: 71est design and valitlation research. Pearson
Education. Retrieved Feb 24, 2011, firom http:
!lwww.Versant.jp!pdflversantfbrenglish-technicalmanual-v6.pdfRosenfeld, M., Leung, S., & Oltman, P. K, (2001), 711ie reading writing speaking and
listening tasks importantfor academic success at the undergraduate and graduate
levels (TOEFL Monograph No. MS-21). Princeton, NJ: Educational Testing
Service.Sawaki, Y., Stricker, L. J., & Orarije, A. (2008). thctor strueture qf the roEl7L
intemet-Based 71est (iB7): Ebeploration in a.field trial sample. (TOEFL iBT-04). Princeton, NJ: Educational Testing Service,
Stricker, L. J., & Attali, Y. (201O). 7;est takeny '
attitudes about the roEFL iBT (TOEFL iBT -13). Princeton, NJ: Educational Testing Service.
Stricker, L. J., & Rock, D. (2008). jFbctor structure of the roEFL internet-based test
across subgroups (TOEFL iBT-07). Princeton, NJ: Educational Testing Service.
Swain M., Huang L.-S., Barkaoui K., Brooks L., & Lapkin, S. (2009). 71he Elic)eaking
Sbction of the roEFL iBT as71B7): 7lest-takei:s' reported strategic behavior:y
(TOEFL iBT-1O). Princeton, NJ: Educational Testing Service.Wall, D., & Horak, T. (2008). 17lre impact ofchangas in the roEFL examination on
teaching and learning in Clentral and EZistern Europe: Phase 2, Coping with
change (TOEFL iBT-05). Princeton, NJ: Educational Testing Service.Wang, L., Eignor, D., & Enright, M. K, (2008). A final analysis. In C.A. Chapelle, M. K.
Emight & J. M. Jamieson (Eds.), Building a validity angumentfor the 7lest of English as a jFbreign Language (pp. 259-3 18). New York: Routledge.
Xi, X. (2008), investigating the criterion-related validity qfthe roEIFZ speaking scores for I7;4 screening and setting standards for I714s, (TOEFL iBT-03). Princeton, NJ: Educational Testing Service.Xi, X., & Mollaun, P. (2006). investigating the utiiity ofanalytic scoringfor the rolff TL
Academic SPeaking 7lest aASI;) (TOEFL iBT-Ol). Princeton, NJ: Educational
Testing Service.
Xi, X., & Mollaun, P. (2009). Hbw do ratens.from indiape,:form in scoring the roEFL
iBT speaking section and what ldnd of training heips2 (TOEFL iBT-11).
Princeton, NJ: Educational Testing Service.Zhang, Y. (2008). Repeater anaLyyesfor roEl7L iBT (ETS Research Memorandum No.
RM. 08-05). Princeton, NJ: Educational Testing Service,
- 100
-