gt2] (TOEFL (VET), (BEST

The Japan Language Testing Association

NII-Electronic Library Service

The JapanLanguageTesting Association

Computerized assessment of second language speaking: A review of tests

yajiazHou[ma gt2]lbkyo Uhivensity ofIFbreign Studies

Abstract

Testing second language speaking via cornputer is a fast-growing area in second

lariguage assessment. The purpose of this article is threefbld: to familiarize readers with

the state of corrrputer technology as it can be applied to speaking tests by discussing

three speaking tests: the speaking section of the Test ofEnglish as a Foreign Laiiguage

Internet-based test (TOEFL iBT), the Versant English Test (VET), and the Basic

English Skill Test Plus (BEST PIus); to summarize the evidence fbr these tests' validity

for potential end-users to evaluate thern; and to provide critical issues to consider fordeveloping a computer-delivered speaking test and potential areas for future research.

1. Introduction

Cornputerized assessment of second language speaking is a fast-expanding area in

second language assessment. Various attempts have been made to apply computer

technology to the delivery ofspeaking tests.

In order to benefit readers who wish to advance their knowledge in testing speaking

via computer, this article discusses three speaking tests that have successfu11y integrated

cornputer technology into their test delivery, task selection and presentation, and scoimg

process, respectively. The selected tests are the speaking section of the Test of English

as a Foreign Language Internet-based test (TOEFL iBT), the Versant English Test(VET), and the Basic English Ski11 Test Plus (BEST PIus). The second purpose ofthis article is to summarize the validity evidence of the three

tests. These tests may be used fbr various purposes, including placing learners in

different levels, measuring learners' improvements, and collecting spoken data in

research studies. It is thus critical for teachers and researchers to collect infbrrnation inorder to judge to what extent these tests are reliable and valid fbr their purposes.

Finally, with an increasing use of computer-delivered speaking tests, it is urgent to

gather evidence to support the use of the tests. Based on the descriptive and evaluative

review of the selected speaking tests, this anicle offers perspectiyes fbr future trends

and future research avenues in testing speakmg via computer,

In sum, this paper aims to answer the fo11owing questions:

(1) What are the recent developments of cornputer application in testing second

language speaking as reflected in the TOEFL iBT, the VET, and the BEST PIus?

(2) What empirical evidence exists for the reliability and validity of the speaking

section of the TOEFL iBT, the VET, and the BEST PIus?

(3) What are the critical issues to be considered fbr test development and future

researeh in testing second language speaking via computer?

The rest of the article is presented in three major sections. The first introduces a

validity framework to guide the discussion ofyalidity issues on the speaking section of

the TOEFL iBT, the VET, and the BEST PIus. The second fbcuses on the description of

the key characteristics ofthe three tests and reviews available validity evidence. The last

section is devoted to the future direction oftest deyelopment and research,

2. Validity framework

Validity has been defined as "the

degree to which evidence and theory support the

interpretations ef test scores entailed by proposed uses of tests" (AERA/APAINCME,1999, p. 9). Although there has been agreement in considering validity to be a unified

concept, various criteria fbr examining validity exist. Among the different frameworks

that incorporate various taxonomies of validity evidence, this study adopts the one

proposed by Messick (1995) and adapted by Miller and Linn (2000) to the performancetest. This framework indicates that the validation of perfbrmance assessments should

incerporate six aspects of validity: content, substantive, structural, generalizability,external, and consequential. We fbund this framework usefu1 for organizing the validity

issues concerning speaking assessment that belongs to a method of performanceassessment. In the fbllowing, we briefiy describe each of these aspects as well as their

implications for evaluating the validity of inferences based on the scores of speaking

tests.

2.1. Content

The content aspect of validity is related to the degree of the congruence of test tasks

to the test specifications, and how well they represent the construct measured.

According to Miller and Linn (2000), evidence fbr content validity could be gatheredthrough content analysis in two phases. First, during initial task development phase, a

blueprint with a theoretical base provides evidence of the content and ski11s being

assessed. Second, after the tasks are constmcted, empirical evidence is gathered from

individuals not involved in the development process through interview or questionnaire.

2.2. Substantive

The substantive aspect is concerned with the process used by examinees when they

respond to the test tasks and the consistency of those processes with the constmct the

assessment is intended to measure. Thus, two aspects of evidence are sought to

demonstrate that tasks lead examinees to engage in the intended cognitive processes and

that the infiuence of constmct-inrelevant factors is minimized. For the first aspect,

experts must review the tasks and scoring rubrics to judge whether scores are based onthe successfu1 completion of the intended process. Errrpirical evidence is also essentialon the congruence of what examinees are actually engaged in when perfbrming the test

tasks with the intended process. This line of study could include analysis of the speech

samples as well as the cognitive processes and strategies examinees use via think-aloud

protocols. Second, to minimize constmct-irrelevant variances, potential sources of these

variances, particularly on the part of examinees, mnst be identified, For the

computer-delivered speaking assessment, examinees' reaetions to the test and

examinees' characteristics, such as computer anxiety and computer familiarity, may besystematically related to test scores, and therefbre should be investigated.

2.3. Structural

In general, stmctural validity is evaluated by investigating the internal stmcture of

the test to ensure that the relationships among test items and sections correspond with

the constmct definition. In assessing speaking, this aspect mainly concerns about

whether the scoring categories are consistent with the constmct. The stmctural validity

requires theoretical evidence that the scoring model is based on an existing knowledge

base regarding the construct and ernpirical evidence that is highly dependent on scoring

method. For analytic ratings, it is critical to ensure that rating categories measure related

but distinct elements, and fbr holistic ratings, whether raters could cover al1 dimensionswhen issuing a holistic score is important Further, since the scoring method is central to

understanding the inferences that can be made from tests, ernpirical evidence that

provides rationale for the selection ofa certain scoring method is also required,

2.4. Generalizability

The generalizability aspect focuses on the replicability of test results across

multiple levels of the assessment procedure. A major concern with perfbrmanceassessment is that score interpretation not be limited to the sample of assessed tasks, but

be broadly generalizable to the constmct domain. This issue of generalizability of score

inferences across tasks and contexts goes to the very heart of score meaning,

Generalizahility theory is often used to examine the degree of the generalizability oftest

scores. ln addition, the limits of score meaning are also affected by the degree of

generalizability across occasions and raters of task perfbrmance. Such sources of

measurement error associated with the sampling of tasks, occasions, and raters underlie

traditional reliability concerns.

2.5. External

The external aspect refers to the extent to which relationships between test scores

and external assessments reflect those as constmcts specify. Similar to what is known as

criterion-related validity, this aspect usually examines correlations of test scores with

relevant criteria, such as other measures of the targeted construct (convergent validity)

and variables hypothesized to be unrelated to the construct (discriminant yalidity).

Convergent evidence is based on high correlations due to shared constmcts;

discriminant evidence is based on low correlations due to different censtmcts.

2.6. Consequential

The consequential aspect examines the degree to which assessments have bothintended positive effects and plausible unintended negative effects. Intended

consequences can include changes in the instmctional and curricular practices of

teachers that lead to better learning environments fbr students. Unintended

consequences can include bias in assessment leading to misinterpretation. The emphasis

of the evaluation should be on the degree to which the positive outcomes of the test

outweigh any negative consequences.

The fbregoing validity framework provides a usefu1 framework fbr organizingvalidity eyidence for various types of speaking assessments, and it is desirable to haveas many sources of evidence as possible that can significantly contribute to certain

aspects of validity. Yet, depending on the purpose of a test and the types of

interpretations of test scores that are made, some sources of validity evidence may bemore credible than others. Since the three speaking tests to be reviewed could be used in

diffbrent contexts fbr different purposes, it is not the intention of this article to argue foreach test. Rather, readers must evaluate each in view oftheir own experience.

3. Test description and evaluation

This section reviews the speaking section of the TOEFL iBT, the VET, and the

BEST PIus, fbr each of which of key characteristics of the tests are reported from three

aspects: general description, test task characteristics, and scoring method. Table 1 and

Table 2 present summaries of the characteristics for the tlll:ee tests, respectively.

Evidence for the evaluation of the tests is then presented in the previously introdncedframework ofvalidity, and summarized in Table 3 at the end ofthis section.

3.1. TOEFL iBT speaking section

3.1.1. General description

The TOEFL iBT was developed by Educational Testing Service (ETS). It debutedin September 2005 in Nonh America and was later launched in other countries. It is a

four-ski11 test in which the speaking section is mandatory. The speaking section of the

test is used to evaluate examinees' oral comniunication ski11s in the context of their

readiness for study in English-speaking universities. Specifically, it measures the ahility

to speak about everyday familiar topics, and to summarize, evaluate, compare, and

synthesize information from multiple sources in a spontaneous manner.

The TOEFL iBT is delivered via an Internet-based delivery system but not

corrrpletely Intemet-based in that examinees need to take the test on particularcomputers at ETS-certified test centers around the world. During the test, examinees

wear noise-canceling headphones and speak into a microphone. Responses are digha11y

recorded and sent to ETS's Online Scoring Network where they are scered by raters.

3.1.2. Test task characteristics

The speaking section ofthe TOEFL iBT consists of six tasks. Two are independent

tasks, which require examinees to express opinions on familiar topics related to campus

life or lecture situations. The other four are integrated tasks. Two of the four arelistening/speaking tasks, which ask examinees to listen to a short spoken recording and

then respond to it. The final twe are readingllistening!speakng tasks, which require

examinees to read a short text, listen to a spoken recording that relates to what they read,

and then respond about what they read and heard. Reading texts are 75-1OO words long;

recording lasts between 60 and 120 seconds. Examinees can take notes and use them

when responding to the speaking tasks. Their comprehension is not separately assessed,

as these texts are considered short and memorable.

The test is not computer-adaptive; each examinee rnust give response to all six

tasks, and the tasks are not adapted to the examinees' speaking level. For each task,

examinees are given 15-30 seconds to prepare their response, and 45-60 seconds to

respond, ln tota1, the speaking section takes approximately 20 minutes.

3.1.3. Scoring

The TOEFL iBT speaking section uses holistic scoring, and three to six different

cenified raters issue a holistic score for each respense on a scale ofzero to fbur based

on delivery, language use, and topic development. Delivery refers to the pace and clarity

of speech, and raters consider speakers' pronunciation, intonation, rate of speech, and

degree of hesitancy. Language use includes the range, complexity, precision, and

automaticity of vocabulary and grammar use. Raters evaluate examinees' al)ility to

select words and phrases and to produce stmctures that appropriately and effectively

communicate their ideas. When assessing topic development, raters take into account

the progression of ideas, the degree of elaboration, and completeness, and, in integrated

tasks, the relevance ofthe content.

The average rating of all the raters is then converted to a scaled score from zero to

30. Examinees can view their scores online 15 business days after the test and choose to

receive a copy of their score report by mail. Universities and agencies can also receive

scores in paper or electronic forrnats on request. The ETS uses the Online ScoringNetwork, a secure internet-supported system, to share data, train raters, and monitor

scoring of examinees' speaking samples. Raters are certified befbre they can beginscoring work and prior to each scoring session on a daily basis,

3.1.4. Evaluation

Empirical evidence for the validity of the TOEFL iBT speaking section is available

fbr each of the six aspects as specified by the validity framework.

Cbntent

In developing speaking tasks fbr the TOEFL iBT, a theoretical framework was first

developed in which speaking constmct measured by various task types was specified

(Butler, Eignor, Jones, McNamara, & Suomi, 2000). Following this framework, task

statements were drafted and then prototype tasks were developed.

．の

它8靄

壱♪の

詈順冨の

も量

ヨ祐口嘱

壽［β。

の冖巨三

三゚，

一口〇一口OO

Q順日o

　（

⇔D口

暑巴g 。

葡口

話−

N 工工一Eleotronio 　Library 　Servioe

葛日畠

〉。

　　　9

OO・。

り。ON

・うO【腎

。嚇Oめ

毟【。。

　　一口〇一

OのhコOO

瑁書8＜

毟［・。

a§り

　（

署呂の

薊口

毳コ

囂軌。

ヨ・

一）

⇔o眉

着』の丶

毎迄

鳥眉窯毟

器恥。

ヨ゚

　（

　bD口

葛三

窯鬱

器島g芦

一 86一

【8の

。°・

bΩ轄

島鵠日

ご。〉順可∩

o嘱易幽o＝

口羇

日5＝

・。

・昏コ

。。9

衷＝

日吋

The 　Japan 　Language 　Testing 　Assooiation

おヨ日邸

　　（

墓ヨ巴。

＆曾

ごoboo

o【邸

。。

三一

　　（

崗）

冒宕呂。

宕り

量壱甲口oZ

゜・

聾a目

器島

層で

鳥8蕁

℃oの

b田2ヨ

←凶哨

』』国

○卜

　　　o

日鞘

日＝

聾2閉8房

目帽

岩。←

ζ。．

替着

ご。

主。

曇。

自。の

芻犒紹」

3冨詰』9

旨←

8審。・・

ε髱o

島←国

出O卜

名こ。

崖邸

ちE日

蠧量員

お悶

Two studies provided positive empirical evidence fbr the content validity of the test.

In Rosenfeld, Leung, and Oltman (2001), a large number ofuniversity faculty members

and students in North America rated a series of task statements in relation to the

importance of each task statement to the successfu1 completion of coursework. Results

support the claim that test content as indicated in task statements was relevant and

representative of what examinees encounter in a university context. Cumming, Grant,

Mulcahy-Ernt, and Powers (2005) evaluated seven speaking prototype tasks by asking

seven experienced ESL instmctors whether they thought the prototype tasks represented

the domain of academic English required fbr studies at English-medium universities and

fu1fi11ed the pu:poses for which they had been designed. The instructors viewed the

prototype tasks positively, particularly integrated tasks that required students to speak in

reference to reading or listening to source texts.

Substantive

The substantive aspect of validity on the TOEFL iBT speaking section has been

examined by analyzing speech samples, examinees' strategic behaviors, and their

attitudes toward the test. While the fbrmer two types of analyses provided positive

evidence, the latter yielded only negative evidence.

Brown, Iwashita, and McNamara (2005) analyzed speech samples of examinees on

six prototype tasks to investigate whether the content ofthe rating scales was relevant to

perfbrmance on panicular tasks, They fbund that the greater fluency, more sophisticated

vocabulary, better pronunciation, greater grammatical accuracy, and more relevant

content were characteristic of speech samples receiving higher holistic scores from

raters, which indicated that the constmct that the tasks were designed to measure was

indeed represented in the discourse.

Although use of strategy is considered integral to perfbrming tasks, it is not

included as a part of the speaking constmct in the speaking section of the TOEFL iBT.

To address the concern of constmct-underepresentation, Swain, Huang, Barkaoui,

Brooks, and Lapkm (2009) asked 30 Chinese ESL learners to report strategic behaviors

when taking the test and examined how the reported strategies related to their test scores,

No sigriificant differences were found in rqported strategy use across proficiency levels

or between the total number ofreported strategic behaviors and test scores. The results

of the study did not change the constmct of the TOEFL iBT speaking section.

Stricker and Attali (201O) administered a questionnaire concerning attitudes toward

the test to examinees in fbur countries. Overall attimdes about the TOEFL iBT were

moderately positive in most countries. Yet, compared with other ski11s, attitudes toward

the speaking section were consistently less favorable in all countries, and were

unfavoral)le in two countries. Although this study did not relate speaking test scores to

examinees' attitudes toward the speaking section, the results showed that attitude

toward the test may be one potential source of construct-irrelevant variances.

Struetural

Two investigations into the appropriateness of using a holistic rating scale have

justified the validity of the stmctural aspects of the test. First, to ensure the relevance of

the content of the holistic rating scales with features of speaking proficiency deemed by

expert judges, Brown, Iwashita, and McNamara (2005) used verbal-report methodology

to examine what aspects of examinees' speech the raters paid most attention to when

scoring tasks. Results showed that with no guidance, domain experts distinguishedbetween perfbrmances using a common set of criteria similar to those included in rating

scales developed for the tasks. Second, Xi and Mollaun (2006) justified using a holisticratmg scale rather than analytic rating scales. The study found that the reliability ofscores obtained from the two types ofrating scales did not differ substantially. But sincethe analytic scores were highly correlated, they would not provide additional

infbrmation beyond what the holistic scores could offer fbr most examinees.

Generalizability

Generalizability of the TOEFL iBT speaking test was verified across occasions,

raters, and tasks. Zhang (2008) analyzed the scores of examinees who repeated the testonce within 30 days. Result showed a moderate correlation of O.84 between the scores

of the two tests, suggesting that the difficulty of tasks on different occasions could beconsidered equivalent. With regard to rater generalizability, inter-rater reliahility was

not published by the ETS fbr operational settings, and instead, the ETS claimed that

raters are constantly monitored each time they score a test. In addition, Xi and Mollaun

(2009) fbund that Indian raters could rate the speech of examinees of various national

backgrounds reliably, which provides positive evidence fbr recruiting nonnative

speakers to score the test. Finally, as to the generalizability of the scores across tasks,

the generalizability index published by the ETS is O.88, higher than other sections. This

may be attributable to the Lee' study (2005), which justified the test design concerningthe present number of tasks for each task type in order to achieve acceptable reliahility.

lixtei:nal

Convergent evidence for external validity has been collected through comparison

with the selfiassessment of examinees and with local test. Two studies providedevidence fbr the distinctiveness of what is measured in the speaking section from other

parts of the test. With a sample of over 2000 examinees, Wang, Eignor, and Enright

(2008) found that the correlation between the responses to the statements of

selfassessment on speaking ski11s and the summative scores for the test was a moderate

O.55. In addition, the speaking section is also used also for selection of ITA

(International Teaching Assistant). Therefbre, Xi (2008) examined the relationships

between the speaking scores and those on several local ITA tests, and the regression

results suggest that the speaking test scores could sigriificantly predict scores en

students' TA assignments. In Sawaki, Suicker, and Oraaje (2008), the factor loading

patterns of the test indicate that both independent and integrated tasks define the

speaking constmcts accurately, and are minimally involved in the reading and listening

constmcts. Stricker and Rock (2008) verified the result of Sawaki et al. fbr different

groups in native language and period ofexposure to the English lariguage.

thnsequential

As to washback, the addition of the speaking section into the TOEFL iBT has

improved the impact of the test on teaching and learriing. Several studies by Wall and

Horak (e.g., 2008) reported on the practices of teachers a year after the launch of the

TOEFL iBT in Europe, They concluded that the TOEFL iBT has indeed had the desired

effect on the content of TOEFL preparation classes, in that much more emphasis has

been placed on the teaching of speaking abilities and integrated tasks.

3.2. Versant English Test

Versant English Test (VET), developed by Pearson, intends to measure what the

developers called `'facility

in a spoken language." It is defined as the ability to

understand spoken English on everyday topics and to respond appropriately at a

natiye-like conversational pace in intelligible English. This ability is assumed to

underlie high perfbrmance in communicative settings, since learners nmst understand

interlocutors correctly and efficiently in real time to be able to respond. Learners must

also be able to fbrmulate and articulate a cornprehensible answer without undue delay.

Without specifying the uses of the test, developers claimed the test could be used

for various purposes, including measuring proficiencies, placement, and achievement.

The VET is Interr)et-based and can be taken wherever there are telephones or corrrputers

with access to the Internet. However, the end users of the test are responsible for

administration: verifying the identity of the examinee, giving the examinee the required

material, and monitoring the administration of the test. Examinees can retrieve their

results from a secure website within minutes of test completion.

The VET consists of 63 items grouped into six sections: reading aloud, repeating,

short answer questions, sentence builds, story retelling, and open questions. This variety

provides multiple fu11y independent measures that underlie facility with spoken English,

including phonological fluency, sentence constmction and comprehension, passive and

active vocabulary use, and pronunciation of rhythmic and segmental units.

Developers desigried the test so that each section consists of increasingly

challenging items delivered by native speakers from various English-speaking regions.

In addition, through the use of algorithms, the Versant Testing System selects items for

each examinee from a pool of items graded fbr difficulty, thus ensuring that each

administration event is unique. The printed text is available on a test paper given to the

examinee a few minutes befbre the beginning of the test, and is also available for those

using a computer, The entire test lasts appreximately 15 minutes.

，・。

ぢ口

頑9。。

蕊。」儡いβ“、

耜這旨o

房。

彎【8。

ひひ

9裾Q喟日一

倉巻五日

　0bo

乱口

8嘱゜。

口。

留己日

⇔。

琶迄日−

口。

三」

霽8口

2口。

ロo咽罵嘱o口5自o

貯冒ρ邸

8＞−

ぎ5三」

西。

濤日

g口。

琵で

雪口o

口。

三」

　口o胃思oロコロo

〇一1

憲oQり

o喟栃コo

艮［邸

§日5出

E8】≧

日ONl

。【』

8嘱［＆邸

　・・

畳。一

”［εoト

ぢ乞

　冩

且曾

　且g

ぢ客

　貯で

費。

［畚o

口o咽罵』

畚田

　　　（

。［憩。

且9ぢ

8酒巴×。

履。

　　　（

。［鵞。

言9←o

蔟巴×

2目富」

　　　（

暑。嘱［＆忠。

8磊。

　　　（

。【盤り喟［＆e。

口。

鬻。

8蓄ト

舞【＆εo

倉崗

。【畚。

且静ぢ

8且層・。

巳・。

磊。

口。

　　　（

。oロ

ヨ2む。

あ・

。・

票388

°。・

　（

N）お

≧°。

巷朗・

（巴）

鵞。の

§魯午

　（

°。）

彎。

窯・

且邸

且巷甲

やo・，

田9＆日

零駕

島葱唱

勗ま三一

旨』

−←o

日9β

旨匡

卜Q自

ロ自

卜国〉

西o°oo

一8の

自潭

日惘←

8紆。

bo口喟目“

・，

日8こo，

替老ε

西。

〉哨［。

冒ヨ

目。。

8摺呂」

8詫」

“』り

器卜

ヒ。

〉嘱【ぞ甥

。。

三畠

奮国

蹇で

畠〉

名こ。の

聟麗」

蠧日日

お”N

一90一

N 工工一Eleotronio 　Library 　

3.2.3, Scoring

[[he VET is thus fatr the only test that has applied automatic speech recognition

technology to the rating of the speech sarnple. It provides informtion on four

diagriostic subscores (sentence mastery, vocabulary, fluency and pronunciation) as well

as an overall score. In assessing fluency and pronunciation, such elements as rate of

speech, length and position ofpauses, stress, and segmentation are measured. Scores are

reported both numerically (in a range of20 to 80) and in criterion-referenced descriptors.

The first item in each section is considered a practice item and is not scored. The

responses to the story-retelling section and the open-ended questions are not

automatically scored, and in tota1, 64 responses are automatically scored.

3.2.4. Evaluation

In the 2008 technical report ofthe VET, published online by Pearson, various types

of validity evidence were introduced in detail, which can be surnmarized from the

perspectives of content, stmctural, generalizability, and external validity,

(]bntent

Content validity of the VET was based on expert judgment on two aspects. First,

the audio prorrrpt in the test is not biased toward certain dialects. Second, the content of

the questions is context-free, and examinees only need general rather than specific

knowledge to answer the questions.

Structural

Correlations among test subscores on each scoring category-sentence mastery,

vocal)ulary, fluency and pronunciation-range from O.61 to O.90. This indicates that the

different scores are measuring distinct aspects of the test constmct, and therefbre offer

useful diagnostics.

Geneniliiabthty

Since VET scores do not involve human raters, besides test-retest reliahility, the

generalizability aspect of the VET was supported by agreements of machine-produced

scores, correlations with human scores, and a comparison of machine-produced scores

with human scores. First, the correlation between the scores from the first and second

administrations was found to be r = O.97, indicating high test-retest reliability fbr the

VET. Second, the agreements between ratings of the VET machine-generated scores,

ranging from O.88 to O.97, were similar to those of human raters, suggesting that they

are as reliable as human raters. Third, correlations between machine-generated scores

and human scores, based on data from 50 examinees were high, ranging from O.89 toO.97. in particular, at the overall score level, the VET machine-generated scores are

virtually indistinguishal)le from scoring based on carefu1 human transcriptions and

repeated independent human judgments. Finally, overall scores show an effective

separation between native and non-native examinees. Test scores of a norrning group of

775 native speakers of English were cornpared with those of a non-native nerming

group of 603 speakers. The score distributions for the different groups of examinees

show that native speakers all perfbrm well, while non-natives show a range of ability

levels.

External

To establish the external validity ofthe VET, its overall scores were compared with

other well-established large-scale tests. It was fbund that the correlations between the

VET overall scores and those on other speaking tests, including the Test of SpokenEnglish (TSE)i, the speaking section of the TOEFL iBT, and the International English

Language Testing System (IELTS) Speaking Test2 were quite high, ranging from O.75to O,94. 0n the other hand, the correlations were moderate between the VET scores and

the TOEFL iBT overall scores. Based on these findings, the developers claimed that theVET scores are more closely related to those tests measuring speaking ability rather

than general laiiguage ability; this seems to provide convergent and discrirninantvalidity for the test.

3.3. BEST PIus

The original Basic English Skills Test (BEST) was developed in the early 1980s as

an easily administered assessment of the speaking abilities of non-English-speaking

adult refugees and immigrants to the United States. In order to shorten administration

time but maintain greater accuracy of measurement fbr accountability assessment,

researchers at the Center for Applied Linguistics (CAL) developed BEST PIus, a

computer-adaptive version of the BEST oral interview. The purpose of the test is to

assess the ability to understand and use unprepared, conversational, everyday language

within topic areas generally covered in adult education courses. The test is used to

measure language gains in individuals as well as the overall effbctiveness of a language

program through a pretest-posttest process.

The BEST PIus is available in two face-to-face formats: computer-adaptive and

print-based. Here, only the computer-adaptive fbrmat is introdnced. In the

computer-adaptive fbrmat, test items are delivered through a computer wnh a Best Plus

CD, USB, or Network. The test is conducted befbre a test administrator who reads the

question from the computer screen, listens to the candidate's response, and enters thescores into the computer. The cornputer-adaptive system determines the difficulty levelof the next question based on current estimates of the examinees' ability. At times, the

i The TSE is a tape-based test that measures the ability of nonnative speakers of English to

communicate effectively. The test is used for employment, graduate assistantships, licensure, and

cenification purposes and has been discontinued as ofMarch 3 1, 201O.2 The IELTS speaking test, delivered by a cenified exarniner, consists of three parts: (1) answering everyday questions, (2) talking about a topic, and (3) having a discussion on the topic with the examlner.

examinee is invited to look at the computer screen (for example, when shown a pictureto describe), but the examinee does not operate the corrrputer.

During the test, the examinee is presented with a relatively easy item on a selected

topic, fo11owed by increasingly challenging questions if the exarninee is able. There are13 general topic domains, such as personal identification, health, housing, and

transportation. The test comprises seven item types that vary in their cognitive and

linguistic demands: photo descriptions, entry items, yeslno questions, choice questions,

personal expansion, general expansion, and elaboration. Each time a new topic is begun,the first item presented is an entry question or photo description. Higher-abilityexaminees quickly move to open-ended expansion questions while lower-abilityexaminees move to questions that provide more support, such as choice or yesino

questions. An examinee may encounter up to seven different topically organized groupsof questions in a test administration. The difficulty of test tasks was determined byanalyzing the data collected from a field test that involved a 1arge number of examinees.

The test lasts between 5 and 20 minutes, dqpending on the proficiency level of the

exarmnee.

3.3.3. Scoring

The BEST PIus evaluates examinees' perforrriance on a rating scale consisting of

three criteria: listening comprehension, language complexity, and communication on a

zero-ten scale. Scores are reported on a scale of 88-999.

3.3.4. Evaluation

Validity evidence fbr the BEST PIus is available on two aspects, generalizability

and external validity, as reported in the BEST PIus technical report (2005).

Generalicabutty

Evidence for the generalizability aspect of test validity was provided with regard to

occasions, ratings, and test forms. First, to establish test-retest reliability, 32 examinees

took the test on two successive occasions; the correlation between the two sets ofscores

was O.89. Second, during the test, both an expert scorer and a novice scorer rated all the

responses, and the correlations for raw test scores between the rater pairings were

consistently above O.90. Third, parallel-form reliability was established through

investigating the equivalence of the three print-based test forrns in a study of 48 adult

ESL students who each took two test forms. The correlation of scores between test

fbrrns ranged from O.85 (Forms B and C) to O.96 (Forms A and C), The rationale fbr

using print-based test fbrms was that it would be difficult to predict the test items that

are used fbr each form of the BEST PIus since it is a corrrputer-adaptive test. Instead,

test items on the print-based tests were drawn from the computer-adaptive item pool.

External

Evidence for this aspect was gathered from the relationship between test scores and

Program Placement Levels and the relationship between test scores and scores from

other measures of English proficiency, First, the field test was conducted with 24

programs that provided infbrmation about which classes 1866 of their students were

placed into on the basis of existing measures and soning procedures. The average

correlation coefficient showing the relationship between placement levels and scores on

the BEST PIus across all programs was O.72. Developers claimed that this demonstrated

the usefulness of the test as a placement instmment in ESL programs. Second, anotlier

study that established the relationship between educational level gain on the BEST PIus

and the numher of instmctional hours received. It was fbund that 53% of examinees

madea level gain if they received less than 60 instructional hours, as compared to 70%ofexaminees who made a level gain if they received more than 140 instructional hours.

3.4. Summary

In this section, we introduced the three computer-delivered speaking tests, each of

which features a distinct application of computer technology: the TOEFL iBT uses both

independent and integrated tasks and makes fu11 use of multimedia for task prompts; the

VET is the only one scored by machine; and the BEST PIus delivers test tasks in

response to the examinee's ability.

Table 3 sumrnarizes validity evidence for the three tests and the fbllowings are

clear from the table. First, validity evidence is more abundant fbr the TOEFL iBT than

fbr other two tests.. Second, given that the tests have distinct features and accordingly

different concerns about each test's validity, test developers have conducted different

types of studies that fa11 into the same categories. Finally, the studies reviewed have

been mainly concerned with aspects of content, generalizability across occasions and

across raters, and relationships with external scores. The other aspects of substantive,

structural, and consequential have seldom been dealt with.

4. Challenges and future directions

4.1. Issues to consider for test development

As reflected in the three tests reviewed, new developments in the application of

computer technology to testing speaking have made the delivery of speaking tests more

efficient in many ways. However, when using computers to develop tasks, it is essential

to consider the balance between efficiency and practicality, particularly for high-stake

or small-scale tests. In the fo11owing, we consider the issues in test development on

methods oftest delivery, task de!ivery, rating, and, most importantly, test constmct.

Oひ．OI

¢．011

蒙。

ゼコ

着ヨ←−

；N）

鐔く

巷』の

蕎目o

磊。

島2審

　ヒ

⇔。

担霑。。

恩磊目喟驚一。

肩邸

ぢp磊信oo

ヨ蜩≧

唱oヨo

拒8ごo

呂毬廼8

　　　OO

口鱈

岩馬

Σ9ゴ

長国

昌り

巳一

ワう

N 工工一Eleotronio 　Library 　Servioe

（ひ

韓の

嵎ロ

ヨ←

　（

毬嘗竃。

五日説

壱。

巴゜り

昼旨8養≧。。

。の

鼠呂霧廼oり

省ε紹50り

一 95一

ぞ賃

冨廿

8口o唱

o・昏

ぢ口

　　頃。

書−

浸8口

曁b。

岩脅

（寸OON

）［ε。

導り

ごε。

聲口の

　　　　”

墓島ヌ嘗。

穿国

（88）

【εω

並ち自コ

逼。

魯［。

〉。

〔【

ぢ戸

磊口oo

壱嘱津

妊9口

哲こ

3。・

§8・・

。」

号」

損〇一口oり

旨＝←の

固箇

国〉

曽目』国

毫箭目

』尉

〉個［

晝−

琶畠

涓唱

彑箪華

身く

の←

。・

8。。

暑a・。

它H。

主。

で−

。。

口。

冨智窯一邸

と。

蠧口

自お占。【驀←

墾。

ヨ譲。

日甥口

壱自

否。

宅8注名魯

磊口。惘罵一。

名−

N卜．011

　　　”の【。

＝口

8エα

葦津

。 ◎，011

δヨρ

聲。」

亙［田邸

　Oひ．Oo

』邸

捗ヨρ

疊。』

g孚』

。一』

寸ひ．OI

ト．O

”の

gbo口

署巴。。

毟。

薫≧

トひ．011k

　　　凵倉［鳶壼二の

逼幽。

卜−

ひ．OI

ひoQ，011

論日罵h

爲巳口

壱喟韓

旨o冨』o

トひ．OI

◎QQQ．

”着。

8翫邸

琶〇

弓・5

芝−

超。

曙［冩

≧−

　　　（

着。

若ヨ゜。

　（°。

2窪・5°。

　　　　”・。

嵐罵

濤曽

　　　　（

8N）喟×

毯・。

9＜←

＝邸

＝O¢O

‘娼

≧厂−

鴇．O叫

鼓。゚

ごε。

自≧

冨口o

目。，の

。。

蟄霧

壱刷≧

　QoQQ

蒼ヨ

鵞愚』

哲琶−場。

雪亡6【【o

芝略喟×

　　”。。

お駑」

罵唱

≧吩

〇一

屋で

日日

boq瑁

莓b。

毛雨ロ

屋＝

で。

§石。

田2畠

　　　　°。°。．O、

”・。順。。

嘗目西8ε

魯【嘱畚N嘱

§口。

畧£。

〉哨罵。Dω

田8＞

還。・

ゼ唱

9。。

冥国。

　り島揖日属o

紹∩・

揖。

bDお〉員

扁邸

Oロnの

。。

考・。

魯溶屋吾属・

　　　　の

お冨」

認Oh⇔

・・

巻・・

倉【哨ρ邸

函・

　　　　　の

着題の

聟5の

こ。

倉溶眉喜出・

隶岩。

弓。・。

一轟

罠国

智【順鵞N

驀お

三山

国顛

宀一田日』国

毫⇔

留』

田9冨お〉

慧」

8琶日09

自O＝

眉唱O自

彑箪霎

詈。

含く

暑。

勢で。h。

器℃

おぢロ

日8ヨ。

当員。

8・。

コ〉。

魯ヨ孚ちご雨

貫日口

『（

需［驀←

一 96一

N 工工一Eleotronio 　Library 　

4.1.1. Computer-based and Internet-based testing

Internet-based tests such as the VET are highly efficient as they can be taken

anywhere and anytime as long as examinees have access to the Internet oratelephone.

However, they have at least two potential problems: the risk of leaking of test items and

difficulty ofidentifying the examinee. These issues have made it difficult fbr high-stake

tests to be administered via the Internet, and tests such as the TOEFL iBT corrrpromise

by delivering the test through the Internet to desigriated sites.

4.1.2. Computer-adaptive, semi-adaptive, and non-adaptive testing

In computer-adaptive tests, the computer gives test tasks that are appropriate to theability of the examinee. This can save administration time compared with non-adaptive

tests that give the same tasks to all the examinees. Although widely used for testing

receptive skills, computer-adaptive tests are not a very popular choice for testing

speaking skills because of the high cost of developing speaking test tasks. The ideal

item pool for a computer adaptive test would have a 1arge number of highly

discriminating items at each ability level. A prerequisite of such an item pool is field

testing, which is labor-intensive and at a high risk fbr leakage oftest items.

In addition, to rnake a speaking test completely corrrputer-adaptive, the computer

rmist be able to rate an examinee's speech accurately. So fair, it is still difficult fbr

computers to achieve high precision in recognizing the examinee's speech on

open-ended tasks. Tests such as the BEST PIus, which require test administrators to

input a score fora task in order fbr the computer to determine the next task, are a way

around this problem. In this case, test developers must sti11 weigh the costs of

developing an item pool against the merits ofsaving administration time.

4.1.3. Automatie rating and human rating

Besides delivering test tasks, a computer can increase the efficiency of testing

speaking by replacing human raters. However, test developers can face two difficulties

in applying automatic rating technology to testing speaking. First, automatic rating, as

in the VET, can currently only be used for rating tasks that have highly predictableresponses, such as read-aloud or repetition tasks. Tasks that require extended responses

from examinees are still difficult to rate satisfactorily by machine. A second

impediment to the widespread use of automatic rating is the higli cost of the

development and application of speech recognition technology to testing.

4.1.4. The test construct ofspeaking: to include listening ability or net

In all three tests reviewed, listening comprehension ability plays an important role

in the corrrpletion oftest tasks. It is true that in order to interact successfu11y with others

in English, we rmist first understand what others are saying, This trend may also be

attributable to the increasing use of multimedia in task prompts. However, we must be

carefu1 with score interpretation. Also, test developers need to decide on the difficulty

ofthe listening prompt, which may depend on to what extent a successfu1 understanding

of listening prompts is intended to affect the completion of tasks.

4.2. Areas for future research

As reviewed in the previous section, a considerable amount of care and study has

been devoted to specifying the content and scoring procedures of the tests and

determining their generalizability across raters and their relationships with external

scores. However, more research is needed on the substantive, generalizability, and

consequential aspects of the validity of computer-delivered speaking tests in order to

improve the validation process and provide more usefu1 information to stakeholders.

4.2.1. Substantiye aspect

Using computers in testing speaking is likely to cause serious

constmct-underrgpresentativeness and construct-irrelevant variance from the

perspective of examinees' use of strategy as well as potential effects of

computer-delivery method and examinee characteristics on test performance.

Ebeaminees ' use oftes"taking strategy

Although theoretically, the use oftest-taking strategy is part oflariguage ability, we

lack infbrmation regarding the context of corrrputer-delivered speaking tests. Studies on

this topic are likely to be illuminating with respect to the construct validity of

computer-delivered speaking test scores. Empirical evidence can help test developers

decide whether to include the use oftest-taking strategy as part of the test censtmct, and

thus avoid the problem of constmct-underrepresentativeness. Analysis of selfreporting

by think-aloud protocol or questionnaire can reveal the cogriitive and rnetacognitive

processes involved when examinees respond to various speaking tasks.

Iiffoct ofcomputer-`letive, p method

Different from a traditional speaking test where an interlocutor is present to

administer the test or act as the conversation partner, the computer-delivered one asks

the examinee to speak to the computer. Exploration ofthe effbcts ofdelivery method on

examinees' test perfbrmance is thus important in order to eliminate constmct-irrelevant

variance. Also, how examinees actually use the planning time in a computer-delivered

speaking test compared with a face-to-face test is another issue for fUrther investigation.

ELtfect ofexaminee charactei:istics on testpectbrmance

Research on the effect of examinee characteristics on test perfbrmance could

further our understanding of the constmcts that tests assess. Future studies should

explore the possible interaction of affective factors, such as attitudes toward the test,

cornputer anxiety, and communication anxiety, with perfbrmance on a

computer-delivered speaking test. Also, using computers fbr assessment may cause

construct-irrelevant variance due to a lack of computer fatniliarity on the part of

examinees. Considering the low exposure to computers fbr high school and university

students,corrrputer familiarity is a factor that needs investigation.

4.2.2. Generalizability acress tasks

So far, rescarch conducted on performance assessment has shown that a rarge

examinee-by-task variance exists that would limit the generalizablity of scores from

these tests (e.g., Brennan, Gao, & Coloton, 1995). Since administration time fbr

computer-delivered speaking tests usually lasts 15 to 20 minutes, increasing the number

of tasks is often impractical. More studies are needed to determine the boundaries of

interpretations and ways ofreducing the interaction ofexaminees with tasks.

4.2.3. The consequential aspect ofvalidity

Studies on the impact of administering computer-delivered speaking tests on

teaching and learning English are still few in number. Errrpirical evidence about various

learning contexts is needed. Further research should explore how learners prepare for

speaking tests delivered by computer and how this differs from preparation forface-to-face tests in case there are any negative unintended consequences.

5. Conclusion

New developments in the application of cornputer technology have made the

delivery of speaking tests more efficient in many ways. From summarizing the validity

evidence available for the three tests, we hope that test deyelopers and test users, who

are ultimately responsible fbr fair and valid test use, will heed our calls for fumher

validity work, Continuing effbrts in this area are essential given the expected increasing

use ofcorrrputers in testing speaking.

Acknowledgement

I am deeply gratefu1 to Prof, Negishi Masashi fbr encouraging me to write this article, I

would also 1ike to thank two anonymous reviewers for their helpfur comrnents, and Dr.

Koizumi Rie for her constmctive advice on choosing the validity framework,

Reference

American Educational Research Association, American Psychological Association, &

National Council on Measurement in Education, (1999). Standt2rzls for educational and psychological testing. Washington, DC: American Educational Research Association.Brennan, R. L., Gao, X,, & Colton, D. A. (1995), Generalizability analyses ef work

keys listening and writing tests. Educational and Psychological measurement,

55(2), 157-176.

Brown, A., Iwashita, N., & McNamara, T. (2005), An examination ofrater orientations

and examinee penyformance on English:for=t4cademic-Purpose speaking tasks

(TOEFL Monograph No. 29). Princeton, NJ: Educational Testing Service.Butler, F. A,, Eignor, D., Jones, S., McNamara, T., & Suomi, B, K. (2000), 1()EllL

2000 speaking framework: A working paper (TOEFL Monograph No. 20).

Princeton, NJ: Educational Testing Service.

The JapanLanguageTestingAssociation

Center fbr Applied Linguistics. (2005). BESTPItLs technical report, Washington, DC:

Center fbr Applied Linguistics,Cumming, A., Grant, L., Mulcahy-Ernt, P,, & Powers, D. E. (2005). A

teacher-verijication study of speaking and writing protompe tasks for a new

70EFL 7lest (TOEFL Monograph No. MS-26), Princeton, NJ: Educational Testing

Service,Lee, Y.-W. (2005). Dependability ofscores for a new ESL speakeng test: Evaluating

protompe tasks (TOEFL Monograph No. MS-28). Princeton, NJ: Educational

Testing Service,Messick, S. (1995), Validity ofpsychological assessment: Validation of inferences from

persons' responses and perfbrmances as scientific inquiry into score meaning.

American Rsychologist, 50, 741-749,Miller, D. M., & Linn, L. R. (2000), Validation of perfbrrnance-based assessments.

Ampiied Rsychological Measuremen4 24(4), 367-378.

Pearson. (2008). Vlarsant English 71est: 71est design and valitlation research. Pearson

Education. Retrieved Feb 24, 2011, firom http:

!lwww.Versant.jp!pdflversantfbrenglish-technicalmanual-v6.pdfRosenfeld, M., Leung, S., & Oltman, P. K, (2001), 711ie reading writing speaking and

listening tasks importantfor academic success at the undergraduate and graduate

levels (TOEFL Monograph No. MS-21). Princeton, NJ: Educational Testing

Service.Sawaki, Y., Stricker, L. J., & Orarije, A. (2008). thctor strueture qf the roEl7L

intemet-Based 71est (iB7): Ebeploration in a.field trial sample. (TOEFL iBT-04). Princeton, NJ: Educational Testing Service,

Stricker, L. J., & Attali, Y. (201O). 7;est takeny '

attitudes about the roEFL iBT (TOEFL iBT -13). Princeton, NJ: Educational Testing Service.

Stricker, L. J., & Rock, D. (2008). jFbctor structure of the roEFL internet-based test

across subgroups (TOEFL iBT-07). Princeton, NJ: Educational Testing Service.

Swain M., Huang L.-S., Barkaoui K., Brooks L., & Lapkin, S. (2009). 71he Elic)eaking

Sbction of the roEFL iBT as71B7): 7lest-takei:s' reported strategic behavior:y

(TOEFL iBT-1O). Princeton, NJ: Educational Testing Service.Wall, D., & Horak, T. (2008). 17lre impact ofchangas in the roEFL examination on

teaching and learning in Clentral and EZistern Europe: Phase 2, Coping with

change (TOEFL iBT-05). Princeton, NJ: Educational Testing Service.Wang, L., Eignor, D., & Enright, M. K, (2008). A final analysis. In C.A. Chapelle, M. K.

Emight & J. M. Jamieson (Eds.), Building a validity angumentfor the 7lest of English as a jFbreign Language (pp. 259-3 18). New York: Routledge.

Xi, X. (2008), investigating the criterion-related validity qfthe roEIFZ speaking scores for I7;4 screening and setting standards for I714s, (TOEFL iBT-03). Princeton, NJ: Educational Testing Service.Xi, X., & Mollaun, P. (2006). investigating the utiiity ofanalytic scoringfor the rolff TL

Academic SPeaking 7lest aASI;) (TOEFL iBT-Ol). Princeton, NJ: Educational

Testing Service.

Xi, X., & Mollaun, P. (2009). Hbw do ratens.from indiape,:form in scoring the roEFL

iBT speaking section and what ldnd of training heips2 (TOEFL iBT-11).

Princeton, NJ: Educational Testing Service.Zhang, Y. (2008). Repeater anaLyyesfor roEl7L iBT (ETS Research Memorandum No.

RM. 08-05). Princeton, NJ: Educational Testing Service,

gt2] (TOEFL (VET), (BEST

Documents

Transcript of gt2] (TOEFL (VET), (BEST

2003 Porsche 911 Carrera GT2 - AutoRevo

AfterSales Training - 911 Turbo GT2 GT3 Engine Repair

Free TOEFL download: Free TOEFL ...

GT2 Criteria for Dams and Ancillary Works Designs

50156459-b-Fisherbrand GT2 GT2R Centrifuge-en

GT2 Air Push Type - HASMAK · High-Accuracy Digital Contact Sensor GT2 Series NEW Revolutionary technology enables high-accuracy measurement GT2 Air Push Type Air push models for

Routy GT2 BSX 290 Series - OpenBuilds Mexico...persona que requiere un equipo asequible. El Routy GT2 BSX 290 es muy parecido a su hermano mayor del Routy GT2 BSX 310. Recomendamos

TECHART Price Guide for 911 Turbo & GT2

ZEISS, SOLA & AO Lens Availability · ZEISS Individual® 2 Progressive Lens Series 1.1 ZEISS Individual Progressive 1.2 ZEISS GT2 3DV 1.3 ZEISS GT2® 3D 1.4 ZEISS GT2 3D Short 1.5

911 GT2 - brmmodelcars.com€¦ · RS0050 Porsche 911 GT2 #12 revoslot@revoslotcars.com RS0051 Porsche 911 GT2 #91 PORSCHE 911 GT2 SPARE PARTS RS101A Porsche 911GT2 - Full white body

GT2 Filter/Coalescer Panel - Seaworth Sheet GT2.pdfConstruction: The Seaworth GT2 Filter/Coalescer Panel is a washable product consist-ing of a durable synthetic media sandwiched between

CD-GT2 CD-BT2 CD-VT2 - Tascam · CD-GT2 · CD-BT2 · CD-VT2 CD Instrument and Vocal Trainers CD-GT2 · CD-BT2 · CD-VT2 CD Instrument and Vocal Trainers T he CD trainer line – a

Inct Gt1 Gt2 Gt3

AiM Infotech Porsche 911 (997 MK1) Turbo, GT2, GT2 RS, 911 (997 ...

TOEFL Primary TOEFL Junior Institutional Testing ...

TOEFL Vocabulary Tests (Word by Meaning)blog.english-team.com/.../60_toefl_vocabulary_tests... · toefl tests 60 TOEFL Vocabulary Tests 600 Words by Meaning ... TOEFL Vocabulary

CD-GT2 CD-BT2 CD-VT2 - media.djmania.net · CD-GT2 · CD-BT2 · CD-VT2 CD Instrument and Vocal Trainers CD-GT2 · CD-BT2 · CD-VT2 CD Instrument and Vocal Trainers T he CD trainer

AiM Infotech Porsche 911 (997 MK1) Turbo, GT2, … › download › ecu › stock › porsche › 3...AiM Infotech Porsche 911 (997 MK1) Turbo, GT2, GT2 RS, 911 (997 MK2) GT3, GT3RS

GT2 by Zeiss Fact Sheets

PORSCHE 911 GT2 - BRM Model Cars