Test Review Paper ReportTOEFL Internet-based Test...

Test Review Paper Report 1

Running head: TEST REVIEW PAPER REPORT: TOEFL INTERNET-BASED TEST-WRITING

Test Review Paper Report

TOEFL Internet-based Test – Writing

Qian Liu

University of Southern California

EDUC 527

Assessment in the Language Classroom

Professor Zsuzsa Londe

December 3, 2008


Introduction

Test of English as a Foreign Language (TOEFL) Internet-based Test (iBT) is a

standardized test of English proficiency whose purpose is to measure how well test takers

speak, listen, read and write in English and use these skills together in an academic setting

(ETS, 2008). The test is designed for students who plan to study abroad. More than 6000

colleges, universities and licensing agencies in 136 countries accept TOEFL scores (ETS,

2007a).

TOEFL iBT has been gradually introduced throughout the world since 2005. Today,

both Paper-based Test (PBT) and Internet-based Test (iBT) are offered by Educational Testing

Service (ETS), but iBT has taken over the leading role and PBT only serves as a supplement.

Unlike PBT, TOEFL iBT tests all four skills of English including speaking “integratively and

independently” (Zareva, 2005, p.48). That is to say, a number of tasks require the test taker to

use more than one language skill (ETS, 2007a). The test is delivered on computer via the

Internet and scores are reported online. It takes about four hours and all four sections are

conveniently taken on the same day (ETS, 2007a).

This paper aims to evaluate the usefulness of the writing section of TOEFL Internet-

based Test. To this end, it first points out what is new in TOEFL iBT writing section by

comparing its format and content with the Paper-based Test and Computer-based Test. Then it

analyzes the usefulness of TOEFL iBT, especially the writing section in terms of six qualities

of test usefulness, which are reliability, validity, authenticity, interactiveness, impact and

practicality. Finally, it concludes with the main strengths and weaknesses of TOEFL Internet-

based Test.


What is New in TOEFL iBT Writing Section?

The writing section is approximately 50 minutes long and includes 2 tasks. The first one

is an independent task which lasts 20 minutes. It requires test takers to write based on what is

read and heard. The second task is a 30-munite long integrated task that asks test takers to

support an opinion on a topic. This task is very similar to the writing task in TOEFL

Computer-based Test (CBT) and the Test of Written English (TWE). Test takers wear

headphones and type their responses. The typed responses are sent to ETS’s Online Scoring

Network (ONS). Scoring is done by two to four certified ETS raters on a score scale of 0 to 5

according to the Rubrics (see Appendix A ). The new rubrics are more specific, more precise,

and more student-friendly than the writing rubric for the TOEFL CBT (Wall & Horak, 2008).

Test takers can view their TOEFL iBT scores online 15 business days after they take the test.

They will also receive a copy of their score report by mail (ETS, 2007a).

Compare to the writing tasks in TOEFL PBT and CBT, the new writing tasks provide

students with reading and listening texts as content input. On one hand, the purpose of using

these prompts is to support test takers in their response construction by providing them “not

only with some information for their writing, but also with some vocabulary to lean upon and

some genre conventions to model their response upon” (Zareva, 2005, p. 53). On the other

hand, it reflects the communicative demands in academic settings and highlights the

importance of students’ ability to synthesise ideas from multiple sources as an input to

writing (Leki & Carson, 1994, as cited in Zareva, 2005).

A major improvement of the test format is that the TOEFL iBT allows students to take

notes while completing the tasks. It is beneficial in integrated tasks particularly. Note-taking


helps test takers recall relevant information and have a better understanding of listening

material. It is reported that being allowed to take notes, students feel more comfortable under

test conditions and perform better (Zareva, 2005).

Test Usefulness of TOEFL iBT Writing Section

In the process of designing and developing a test, the most important consideration is

its usefulness. According to Bachman and Palmer (1996), test usefulness can be described as

a function of several different qualities. They are reliability, construct validity, authenticity,

interactiveness, impact and practicality. It is impossible to design a test which is perfect in all

of these qualities. Therefore, to maximize the test usefulness is to higher each quality and to

maintain appropriate balance among the different qualities.

Reliability

A reliable test should be consistent and dependable (Brown, 2004). As one of the most

important qualities of test usefulness, test reliability is affected by four factors, which are

student, scoring, test administration and the test itself (Mousavi, 2002, as cited in Brown,

2004). Since student-related reliability is hard to control, unreliable factors can be easily

caused by temporary illness, such as a “bad day”, I will analyze reliability of TOEFL iBT

writing only in terms of rater reliability, test administration reliability and test reliability.

Rater Reliability

All writing responses are sent to ETS’s Online Scoring Network (ONS) and scored by

two to four raters (ETS, 2007a). In order to ensure maximum objectivity and reliability, all

raters are given well-defined and articulated scoring rubrics and are trained by ETS. They

cannot begin scoring work until they are certified. In addition, they must pass a calibration


test before each scoring session (ETS, 2008b). All ETS raters undergo constant monitoring

when scoring (ETS, 2008a). Educational Test Service (2008b) also reports that raters’

performance is evaluated by statistics. When there is an obvious discrepancy on the same

writing work, a scoring leader will be involved to solve a possible problem. Therefore, rater

reliability in TOEFL iBT speaking section is high.

My only concern is that raters are human beings who have personal preference as well

because of their educational background and teaching experience. The possible solution is to

use e-rater instead of human raters. The result of Attali’s (2007) study shows that e-rater score

has significantly higher reliability regarding grammar, vocabulary, and organization.

However, e-rater cannot measure length and content of a writing work. Experts who are

against e-rater even point out that writing is a kind of art and should not be measured by

machines (Attali, 2007).

Test Administration Reliability

Test administration reliability relies on keeping the same conditions in which the test is

administered, such as the noise level, temperature and the amount of light (Brown, 2004).

ETS (2008b) claims that TOEFL iBT follows standardized procedures for test administration.

By certifying all test centers facilities and equipment for administering, training test center

staff on how to handle a test administration session, providing online practice tests, and using

technology to control test delivery, ETS tries to ensure all test takers take the test under

similar conditions. Even so, unreliable factors exist in TOEFL tests which take place in

different countries. For example, in a similar size of the test room, there are only six people in

Taipei, Taiwan, and yet in Los Angeles, there are around 24 people sitting shoulder to


shoulder.

Test Reliability

As we can tell from the Table 1 and Table 2, the reliability of the Writing score is

somewhat lower compare to the reliability of the Reading, Listening, Speaking, and Total

scores.

Table 1

Reliabilities and Standard Errors of Measurement (based on the first year’s operational data

from September 2005 to December 2006)

________________________________________________________________________________ Score Scale Reliability Estimate Standard Error of Measurement (SEM) Reading 0-30 0.86 2.78 Listening 0-30 0.87 2.40 Speaking 0-30 0.90 1.70 Writing 0-30 0.78 2.65 Total 0-120 0.95 4.88 (ETS, 2007c, p.1)Table 2

Reliabilities and Standard Errors of Measurement (based on the first year’s operational data

from 2007)

________________________________________________________________________________ Score Scale Reliability Estimate Standard Error of Measurement (SEM) Reading 0-30 0.85 3.35 Listening 0-30 0.85 3.20 Speaking 0-30 0.88 1.62 Writing 0-30 0.74 2.76 Total 0-120 0.94 5.64 (ETS, 2008b, p. 3)It might result from the nature of the test itself. That is, the writing section is composed of

only two tasks (Breland, Bridgeman & Fowles, 1999, as cited in ETS, 2008b). In this regard,

the reliability of TOEFL iBT writing section is acceptable. Zhang (2008) conducts a study to


find out repeaters’ performance on TOEFL iBT. The findings indicate that small mean score

changes are observed, thus prove the test reliability is relatively high.

Validity

The primary purpose of a language test is to measure test takers’ language proficiency by

interpreting test scores. In this regard, validity is essential to the usefulness of any language

test (Bachman & Palmer, 1996). According to Brown (2004), when we consider the validity

of a test, we need to consider its content validity, criterion-related validity, construct validity,

consequential validity and face validity. Therefore, I will analyze the test validity of TOEFL

iBT in terms of those five aspects respectively.

Content Validity

The content of TOEFL iBT is relevant to and representative of the written tasks that

students will encounter in academic settings (ETS, 2007d). Although it is impossible that test

tasks are exactly the same as TLU tasks, two tasks in writing section of TOEFL iBT are

simulations of academic tasks. Based on their study, Cumming, Grant, Mulcahy-Ernt, and

Powers (2005) provide evidence about the content relevance of integrated test tasks. They

interviewed a sample of English as Second Language (ESL) teachers about the new type of

test tasks. The teachers gave positive feedback on the tests and viewed them to be realistic

and appropriate simulations of academic tasks. They also felt the tasks elicited writing

responses from their students that represented the way the students usually performed in their

English classes (Cumming, et al., 2005).

Criterion-related Validity

To what extent the criterion of the test has actually been reached is referred to as


criterion-related validity (Brown, 2004). Brown (2004, p. 24) claims that “criterion-related

evidence is best demonstrated through a comparison of results of an assessment with results

of some other measure of the same criterion.” There is no study that has conducted to

measure the relationship between TOEFL iBT writing scores in particular and other relevant

criteria of academic language proficiency. Only the 2003-2004 field study of TOEFL iBT

indicates that observed correlations between the total test score and students’ self assessment

is .52 (ETS, 2004).

Construct Validity

Construct validity pertains to the meaningfulness and appropriateness of the test score

interpretations. Both construct definition and the characteristics of the test task should be

considered when we are evaluating construct validity of a score interpretation (Bachman &

Palmer, 1996). In writing section, test takers’ abilities of choosing vocabulary correctly,

building grammatical structure, spelling words and using punctuation accurately, and

expressing information in an organized manner are tested. Those abilities are believed to be

main factors of achieving successfully writing in college and university settings. The scoring

procedures of TOEFL iBT writing also reflect the construct definition. ETS (2008b) claims

that test takers are not expected to produce a perfect essay. They can earn high score with a

response that contains some errors.

However, the test rubrics may not match to the construct definition. Lee and Kantor

(2005) point out that each of the writing tasks measures an aspect of writing, thus separate

scores should be reported for each of these aspects. Furthermore, there are several possible

sources of bias which lie in the task characteristics. Firstly, the writing section contains only


two writing tasks. Thus, it is extremely important to generalize the writing scores across tasks

and task types to ensure the test validity (Lee & Kantor, 2005). Secondly, the tasks might be

appropriate for undergraduate students. But for graduate students, even though they would

not have the language ability to deal with certain topics, they might have the language ability

needed to succeed in their particular fields. Cumming et al. (2005) suggest that test takers

should have more choices about what to write based on their individual preferences.

Consequential Validity

All the consequences of using an assessment determine the test’s consequential validity

(Brown, 2004). By and large, TOEFL iBT is used appropriately and has positive

consequences. This test is used to make decisions about students’ readiness to study in

America. From this perspective, positive consequences would involve selecting students who

have the English language proficiency necessary to succeed at the college, and denying

admission to those who do not. Another use of the TOEFL iBT test is to support English

teaching and learning. Accordingly, the test has the positive washback of facilitating students’

English learning process and promoting communicative language teaching (ETS, 2007d).

Face Validity

If the test looks right and fair to students is also significant and it is known as face

validity (Brown, 2004). High face validity involves well-constructed format, doable time

allotment, and clear directions of a test. To this end, the developer of the TOEFL iBT test

carried out numerous exploratory studies over four years to find the best way to design and

administrate the test (Chapelle, Enright, & Jamieson, 2008, as cited in ETS, 2007d). For

instance, test information is posted on ETS website; preparing material can be downloaded;


sample responses are provided; and note-taking is allowed. Allowing test takers to take notes

improves the face validity of the test, since academic life strongly encourages students to take

notes and use them in their academic preparation. Besides, high content validity of TOEFL

iBT contributes to its face validity as well.

Authenticity

Bachman and Palmer (1996) define authenticity as “the degree of correspondence of

the characteristics of a given language test task to the features of a target language use (TLU)

task” (p. 23). Authenticity is a critical quality of language tests because it investigates the

generalizability of score interpretations, thus it closely relevant to construct validity. Also,

authenticity affects test takers’ perception of the test. Test takers tend to provide positive

affective responses and perform at their best when the test task is likely to be enacted in the

real world (Bachman & Palmer, 1996).

First of all, the characteristics of both integrated and independent task correspond to

those of TLU tasks. Writing in English is required in all academic situations in America.

Integrated writing refers to a writing response on an exam or a reflection on what they have

learned in classes. This requires students to combine what they have heard from class lectures

with what they have read in textbooks or other materials. For instance, a student might be

asked to compare and contrast the ideas presented by the professor in class with those

expressed by an author in the reading material. In this case, he/she has to successfully draw

information from each source to analyze it (ETS, 2007a). Often students need to write essays

to express and support their opinions based on their own knowledge and experience. This

kind of writing can be referred to as independent writing. In order to complete this task, test


takers must be able to identify main ideas and develop the essay by using examples, reasons

and detail which requires critical thinking (ETS, 2007a).

Secondly, the process of completing the test tasks is almost the same with the

procedure of completing tasks in TLU domain. Students need to take notes on what they have

heard and they have read first. Then they summarize and paraphrase the useful information.

At last, they write them down using all these resources to support and express their own

opinions.

Thirdly, the language use in prompts is close to the TLU domain. The reading

passages are from real textbooks and course materials. Topics in independent writing

authentic relate to test takers’ personal experience and daily life.

Fourthly, the rubrics (see Appendix A) of the test tasks are close to the real academic

writing standards. In an academic setting, a student’s writing paper is evaluated on the basis

of its expression and organization. Students are required to present an idea in a clear and

well-organized manner. Their abilities of organizing information, developing accurate

content, and using grammar and vocabulary appropriately and precisely are valued both in the

test tasks and in TLU domain (ETS, 2007a).

Note-taking is one of the most outstanding features which reflect the authenticity of

TOEFL iBT writing section. According to Zareva (2005), allowing note-taking makes the test

tasks closer to authentic academic tasks college students perform in their daily life, since it is

the primary means of recording information when students are listening to a lecture,

participating in a group discussion and reading a textbook. Furthermore, Dunkel (1988)

investigates lecture note-taking among undergraduate L1 and L2 speakers of English. He


reports that good note-takers are good summarizers, who are equipped with academic skills

of reformulating and abstracting the gist of information (as cited in Zareva, 2005). Therefore,

allowing note-taking in writing section is authentic and corresponds to the TLU task.

Last but not least, all the responses are required to type into the computer rather than

write down with a pen on a piece of paper. As we all known, nowadays all the official

academic writings are required to be typed in most of the universities all over the world.

Thus, the requirement of typing responses into a computer in writing section exactly matches

the trend in today’s academic settings.

Interactiveness

Interactiveness is defined as the extent and type of involvement of the test taker’s

individual characteristics such as personal characteristic, language ability, topical knowledge,

and affective schemata in accomplishing a test task (Bachman & Palmer, 1996). The more

interactive a test is, the better test takers can perform on it.

Test takers of TOEFL iBT are non-native speakers who want to study in

undergraduate, graduate and postgraduate programs in English-medium colleges and

universities. On one hand, integrated tasks, which include various subjects and topics are

designed for academic purposes and are beneficial for test takers to perform their best. On the

other hand, independent tasks which involve test takers’ personal experiences should consider

personal characteristics carefully. A study conducted by Breland, Lee, Najarian and Muraki

(2004) shows that topics like art, music, housing, roommates, friends and children tend to

generate large gender differences.

The test involves test takers’ language ability as much as possible. In the writing


section of TOEFL iBT, the integrated task involves a wide range of areas of language

knowledge, such as reading and listening in writing. It also requires language abilities for

comprehension, such as paraphrasing, summarizing and presenting ideas. The independent

task allows examinees to demonstrate their abilities to form coherent written argument based

on their personal knowledge and experience.

In terms of topical knowledge, it is not required in the integrated task. All background

knowledge is provided by the test prompts. The situation is different in the independent task.

Test takers write response based on their personal experience and topical knowledge.

Therefore, examinees who have little topical knowledge of a specific topic might gain poor

scores. In order to minimum the influence of topical knowledge, the test can create more

open-ended choices based on individuals’ preferences, interests and inclinations (Cumming,

et al., 2005).

Affective schemata plays an significant role when test takers are completing the test

tasks. The writing tasks in TOEFL iBT simulate the real classroom environment as well as

use authentic reading materials and lectures. Additionally, those tasks correspond to the TLU

settings where test takers are asked to read materials, listen to lectures, take notes, and write

paper. Thus, test takers will have positive affective responses to the tasks in general.

Nevertheless, test takers’ affective schemata varies because of individual difference. For

example, test takers may have positive affective responses when they have relevant topical

knowledge of the test task. In contrast, they may have negative responses when they have no

idea about the test task topic. Also, test takers who have high language proficiency tend to be

positive towards language tests, while those who have low language proficiency may feel


threatened by the test.

Impact

TOEFL has gained extraordinary power worldwide because it is used to accept some

international students into American universities while rejecting others. It is believed that

tests in general and high-stakes tests like TOEFL in particular, have a tremendous impact on

students and teachers, even on society and education systems at large.

Impact on Individuals

There is no doubt that TOEFL has its impact upon the individuals within certain

educational systems. “Stake-holder” such as test takers and teachers are individuals who are

most directly affected by test use (Bachman & Palmer, 1996).

Impact on test takers. Bachman and Palmer (1996) state that test takers can be

affected by three aspects of the testing procedure: one is their experience of taking tests and

preparing for the test, the other is the feedback they receive after the test, and another is the

decision that may be made on them according to their test performance.

In order to take TOEFL Internet-based Test, vast majority of students spend several

weeks and even several months preparing. In some countries like China and Korea, “Cram”

courses are offered to provide students with the specific test practice. Techniques needed in

the test are taught as core subjects in the syllabus. The fact that some learners who get a high

score in TOEFL but cannot express themselves properly might due to this kind of negative

washback which is named “learning for the test”. Additionally, TOEFL scores do not measure

students’ actual language performance accurately. A negative impact happens when students

define their own English proficiency according to the TOEFL strictly. It is quite possible for


students to be successful communicating with native speakers of English orally and in written

but not successful on the TOEFL iBT. Almost every formal test has these weaknesses in

common, since its result means a lot to test takers.

Every coin has two sides. TOEFL iBT creates opportunities for test takers to get

familiar with TLU domain. Since it is designed to evaluate test takers’ English language

ability in academic settings in America, overseas students who never study in American

colleges and universities gain topical knowledge from the experience of taking the test itself.

What is more, TOEFL iBT has positive washback because of its great score-reporting system.

Students receive a detailed performance feedback (see Appendix B) from ETS rather than a

simple numerical score. Learning strategies are provided according to students’ proficiency

level. This complete, relevant, and meaningful feedback can be very beneficial for test takers

to achieve progress in English learning in the future.

Impact on teachers. When it comes to TOEFL’s impact on teachers, it is positive by

and large. Wall and Horak (2006) note that there is a hope that TOEFL will lead to a more

communicative approach to teaching and that preparation classes would pay more attention to

academic tasks and language (as cited in Wall & Horak, 2008). In the past, most of the

teachers conducted writing classes that focused on language structure and discrete

grammatical items. There was no integrated skills work. In order to prepare students for

integrated tasks in writing section, teachers begin to teach students how to summarize

properly, how to paraphrase, and how to select important information from input flood (Wall

& Horak, 2008). Those abilities will be extremely useful for students who are going to study

in the U.S. However, as Wall and Alderson (1993) point out, we cannot simply assume that a


test will automatically affect instructional practice (as cited in Bachman & Palmer, 1996).

Therefore, whether TOEFL iBT will have a positive washback upon ESL and EFL pedagogy

heavily depend on teachers’ English proficiency level, their English teaching ability, and their

teaching philosophy.

Impact on Society and Education Systems

As Bachman (1990) argues, a test is not a value-free production; it always serves to

meet the needs of an educational system or of society at large (as cited in Bachman & Palmer,

1996). In general, TOEFL iBT has a positive impact on society and education systems, since

it draws attention to English-language skills in academic settings in a more communicative

and authentic way. What is worth mentioning is that TOEFL scores have been misused

especially in non-English speaking countries. For example, many companies and government

departments in China simply evaluate job applicants’ English skills through their TOEFL iBT

scores even though those scores are nothing to do with workplace English skills.

Practicality

Practicality is such an essential quality of a test that for any given situation, if the

resources required for implementing the test exceed the resources available, the test will be

impractical, and will not be used unless resources can be allocated more efficiently no matter

how perfect it is. Bachman and Palmer (1996) state that test resources can be classified into

three types: human resources, material resources and time.

Regarding human resources, it includes test writers, scorers, test administrators, and

clerical support personnel (Bachman & Palmer, 1996). TOEFL iBT is developed and

administered by Educational Testing Service, a nonprofit educational organization which


gathers many language experts and testing experts. As a world wide test, a huge number of

raters are needed. In order to minimum the number raters as well as ensure test reliability, all

writing responses are scored by two rates and if the two ratings differ by more than one point,

the chief rater will rate it again (ETS, 2008b). In addition, ETS recruits and trains new raters

all over the world to avoid the shortage of scorers.

Nowadays, TOEFL Internet-based Test is conducting in 933 cities of 128 countries. There

are more than 3600 active test centers in total. 1679 test centers are in preparation and new

test centers are added on a continual basis (Tyson, 2007). Thanks to the development of

technology, internet and computers are parts of people’s life. Thus material resources of

TOEFL iBT are available.

The score reporting of TOEFL iBT is on time and accurate (Tyson, 2007). Test takers can

view their scores online 15 business days after the test, which is much faster than TOEFL

PBT. Colleges, universities, and agencies can also go online to view the scores of those

students who selected them as a score recipient (ETS, 2007a).

It is also worth mentioning that the process of choosing scoring method of TOEFL iBT,

which reflects test designers’ consideration of test practicality. There are three suggested

scoring methods in general: analytical, primary traits and holistic scoring methods. Analytical

scoring seems to be a detail-oriented method. However, it is found hard to put into practice

because the conceptually different aspects are actually interrelated with each other (Butler,

Eignor, Jones, McNamara & Suomi, 2000). For instance, it is hard to separate topical

development and language use since the use of cohesive ties is also the device of topical

development. As for primary trait scoring, it works well with tasks on different difficulty


levels but costs money to conduct. Comparing with those two scoring methods, holistic

scoring method is relatively practical and efficient. Therefore, it is implemented in TOEFL

iBT writing section.

Conclusion

As a test that can get students into more than 6000 universities worldwide, TOEFL iBT

has a high test reliability, validity, authenticity, interactiveness and practicality, and has a

positive impact as a whole. Even so, I still want to point out two aspects which can be

improved after my analysis. One is that test reliability can be improved by regulating test

administration. Test conditions should be more consistent, especially when the test is

conducted in different countries. The other is that considering only two tasks are required by

TOEFL iBT writing, test takers should have more options to choose topics which they are

familiar with. By doing so, test validity can be improved to some degree.

References:Attali, Y. (2007). Construct Validity of e-rater in Scoring TOEFL Essays (ETS Research


Memorandum No. RR-07-21). Educational Testing Service, Princeton, NJ. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford

University Press.Breland, H., Lee, Y.-W., Najarian, M., & Muraki, E. (2004). An Analysis of TOEFL-CBT

Writing Prompt Difficulty and Comparability for Different Gender Groups (TOEFL Research Rep. No. RR-76). Princeton, NJ: ETS.

Brown, D. (2004). Language Assessment: principles and classroom practices. NY: Pearson Education, Inc.

Butler, F. A., Eignor, D., Jones, S., McNamara, T., & Suomi, B. K. (2000). TOEFL 2000 speaking framework: A working paper (TOEFL Monograph No. MS-20). Princeton, NJ: ETS. Cumming, A., Grant, L., Mulcahy-Ernt, P. & Powers, D. (2005). A Teacher-verification Study of Speaking and Writing Prototype Tasks for a New TOEFL. Princeton, NJ: ETS. Cumming, A., Kantor, R., Baba, K., Eouanzoui, K., Erdosy, U., & James, M. (2006). Analysis

of Discourse Features and Verification of Scoring Levels for Independent and Integrated Prototype Written Tasks for the New TOEFL (TOEFL Monograph No. MS-30). Princeton, NJ: ETS.

Educational Testing Service (2004). English Language Competency Descriptors. Princeton, NJ.Educational Testing Service (2007a). TOEFL Tips. Retrieved Oct. 10, 2008 from

http://www.ets.org/Media/Tests/TOEFL/pdf/TOEFL_Tips.pdf. Educational Testing Service (2007b). TOEFL iBT Performance Feedback for Test Takers. Retrieved Oct. 20, 2008 from http://www.ets.org/Media/Tests/TOEFL/pdf/TOEFL_Perf_Feedback.pdf.Educational Testing Service (2007c). TOEFL iBT Score Reliability and Generalizability. Retrieved Nov. 17, 2008 from http://www.ets.org/Media/Tests/TOEFL/pdf/TOEFL_iBT_Score_Reliability_ Generalizability.pdfEducational Testing Service (2007d). Validity Evidence Supporting the Interpretation and Use of TOEFL iBT Scores. Princeton, NJ. Educational Testing Service (2008a). TOEFL at a Glance. Retrieved Oct. 10, 2008 from

http://www.ets.org/Media/Tests/TOEFL/pdf/TOEFL_at_a_Glance.pdf. Educational Testing Service (2008b). Reliability and Comparability of TOEFL iBT Scores. Retrieved Nov. 19, 2008 from http://www.ets.org/Media/Tests/TOEFL/pdf/TOEFL_iBT_Reliability.pdf.Lee, Y. & Kantor, R. (2005). Dependability of New ESL Writing Test Scores: Evaluating Prototype Tasks and Alternative Rating Schemes. Princeton, NJ: ETS. Tyson, E. (2007). TOEFL iBT Update. Retrieved Nov. 19, 2008 from http://www.cgsnet.org/portals/0/pdf/mtg_sm07Tyson.pdf. Wall, D. & Horak, T. (2008). The impact of changes in the TOEFL examination on teaching

and learning in central and eastern Europe: Phase 2, coping with change. TOEFL iBT Report No. TOEFLiBT-05.

Zhang, Y. (2008). Repeater Analyses for TOEFL iBT. (ETS Research Memorandum No. RM-08-05). Princeton, NJ: ETS.

http://www.ets.org/vgn-ext-templating/v/?vgnextoid=11ba1ddf2a34b110VgnVCM10000022f95190RCRD&vgnextchannel=d35ed898c84f4010VgnVCM10000022f95190RCRD

http://www.ets.org/vgn-ext-templating/v/?vgnextoid=11ba1ddf2a34b110VgnVCM10000022f95190RCRD&vgnextchannel=d35ed898c84f4010VgnVCM10000022f95190RCRD


Zareva, A. (2005). What is new in the new TOEFL-iBT 2006 test format? Electronic Journal of Foreign Language, 2(2), 45-57.

Appendix AIndependent Writing Rubrics


SCORE TASK DESCRIPTION5 An essay at this level largely accomplishes all of the following:

Effectively addresses the topic and task Is well organized and well developed, using clearly appropriate explanations,

exemplifications, and/or details Displays unity, progression, and coherence Displays consistent facility in the use of language, demonstrating syntactic variety,

appropriate word choice, and idiomaticity, though it may have minor lexical or grammatical errors

4 An essay at this level largely accomplishes all of the following: Addresses the topic and task well, though some points may not be fully elaborated Is generally well organized and well developed, using appropriate and sufficient

explanations, exemplifications and/or details Displays unity, progression, and coherence, though it may contain occasional

redundancy, digression, or unclear connections Displays facility in the use of language, demonstrating syntactic variety and range of

vocabulary, though it will probably have occasional noticeable minor errors in structure, word form, or use of idiomatic language that do not interfere with meaning

3 An essay at this level is marked by one or more of the following: Addresses the topic and task using somewhat developed explanations,

exemplifications, and/or details Displays unity, progression, and coherence, though connection of ideas may be

occasionally obscured May demonstrate inconsistent facility in sentence formation and word choice that

may result in lack of clarity and occasionally obscure meaning May display accurate, but limited range of syntactic structures and vocabulary

2 An essay at this level may reveal one or more of the following weaknesses: Limited development in response to the topic and task Inadequate organization or connection of ideas Inappropriate or insufficient exemplifications, explanations, or details to support or

illustrate generalizations in response to the task A noticeably inappropriate choice of words or word forms An accumulation of errors in sentence structure and/or usage

1 An essay at this level is seriously flawed by one or more of the following weaknesses: Serious disorganization or underdevelopment Little or no detail, or irrelevant specifics, or questionable responsiveness to the task Serious and frequent errors in sentence structure or usage

0 An essay at this level merely copies words from the topic, rejects the topic, or is otherwise not connected to the topic, is written in a foreign language, consists of keystroke characters, or is blank.

Integrated Writing Rubrics


SCORE TASK DESCRIPTION5 A response at this level successfully selects the important information from the lecture

and coherently and accurately presents this information in relation to the relevant information presented in the reading. The response is well organized, and occasional language errors that are present do not result in inaccurate or imprecise presentation of content or connections.

4 A response at this level is generally good in selecting the important information from the lecture and in coherently and accurately presenting this information in relation to the relevant information in the reading, but it may have minor omission, inaccuracy, vagueness, or imprecision of some content from the lecture or in connection to points made in the reading. A response is also scored at this level if it has more frequent or noticeable minor language errors, as long as such usage and grammatical structures do not result in anything more than an occasional lapse of clarity or in the connection of ideas.

3 A response at this level contains some important information from the lecture and conveys some relevant connection to the reading, but it is marked by one or more of the following: Although the overall response is definitely oriented to the task, it conveys only

vague, global, unclear or somewhat imprecise connection of the points made in the lecture to points made in the reading.

The response may omit one major key point made in the lecture. Some key points made in the lecture or the reading, or connections between the two,

maybe incomplete, inaccurate, or imprecise.2 A response at this level contains some relevant information from the lecture, but is

marked by significant language difficulties or by significant omission or inaccuracy of important ideas from the lecture or in the connections between the lecture and the reading; a response at this level is marked by one or more of the following: The response significantly misrepresents or completely omits the overall connection

between the lecture and the reading. The response significantly omits or significantly misrepresents important points

made in the lecture. The response contains language errors or expressions that largely obscure

connections or meaning at key junctures or that would likely obscure understanding of key ideas for a reader not already familiar with the reading and the lecture.

1 A response at this level is marked by one or more of the following: The response provides little or no meaningful or relevant coherent content from the

lecture. The language level of the responses is so low that it is difficult to derive meaning.

0 A response at this level merely copies sentences from the reading, rejects the topic or is otherwise not connected to the topic, is written in a foreign language, consists of keystroke characters, or is blank.

(ETS, 2007a, p. 46-47)


Appendix BWriting Based on Reading and Listening

LEVEL YOUR PERFORMANCE ADVICE FOR IMPROVEMENTGOOD (4.0-5.0)

You responded well to the task, relating the lecture to the reading. Weaknesses, if you have any, might have to do with: slight imprecision in your

summary of some of the main points, and/or

use of English that is occasionally ungrammatical or unclear.

Continue to improve your ability to relate and convey information from two or more sources. For example, practice analyzing reading passages in English. Read two articles or chapters on the same

topic or issue, write a summary of each, and then explain the ways they are similar and the ways they are different.

Practice combining listening and reading by searching for readings related to talks and lectures with teacher or a friend.

FAIR (2.5-3.5)

You responded to the task, relating the lecture to the reading, but your response indicates weaknesses, such as: an important idea or ideas

may be missing, unclear or inaccurate;

it may not be clear how the lecture and the reading passage are related; and/or

grammatical mistakes or vague/incorrect uses of words may make the writing difficult to understand.

Practice finding main points. Ask a friend to record news and

informational programs in English from the television or radio, or download talks or lectures from the Internet. Listen and take notes. Stop the

recording about every 30 seconds to write out a short summary of what you heard.

Replay the recording to check your summary. Mark places where you are not sure if you have understood what was said or if you are not sure you have expressed yourself well.

LIMITED (1.0-2.0)

Your response was judged as limited due to: failure to understand the

lecture or reading passage; deficiencies in relating the

lecture to the reading passage; and/or

many grammatical errors and/or very unclear expressions and sentence structures.

Read and listen to academic articles and other material in your own language. Take notes about what you read and hear. Begin by taking notes in your own language

and then take notes in English. Summarize the points in complete English

sentences. Ask your teacher to review your writing and

help you correct your errors. Gradually decrease the time it takes you to

read the material and write these summaries. Practice typing on a standard English

(QWERTY) keyboard.


(ETS, 2007b, 12-15)

Test Review Paper ReportTOEFL Internet-based Test...

Documents

Transcript of Test Review Paper ReportTOEFL Internet-based Test...