Test Review Paper ReportTOEFL Internet-based Test...
-
Upload
duongthuan -
Category
Documents
-
view
213 -
download
0
Transcript of Test Review Paper ReportTOEFL Internet-based Test...
Test Review Paper Report 1
Running head: TEST REVIEW PAPER REPORT: TOEFL INTERNET-BASED TEST-WRITING
Test Review Paper Report
TOEFL Internet-based Test – Writing
Qian Liu
University of Southern California
EDUC 527
Assessment in the Language Classroom
Professor Zsuzsa Londe
December 3, 2008
Test Review Paper Report 2
Introduction
Test of English as a Foreign Language (TOEFL) Internet-based Test (iBT) is a
standardized test of English proficiency whose purpose is to measure how well test takers
speak, listen, read and write in English and use these skills together in an academic setting
(ETS, 2008). The test is designed for students who plan to study abroad. More than 6000
colleges, universities and licensing agencies in 136 countries accept TOEFL scores (ETS,
2007a).
TOEFL iBT has been gradually introduced throughout the world since 2005. Today,
both Paper-based Test (PBT) and Internet-based Test (iBT) are offered by Educational Testing
Service (ETS), but iBT has taken over the leading role and PBT only serves as a supplement.
Unlike PBT, TOEFL iBT tests all four skills of English including speaking “integratively and
independently” (Zareva, 2005, p.48). That is to say, a number of tasks require the test taker to
use more than one language skill (ETS, 2007a). The test is delivered on computer via the
Internet and scores are reported online. It takes about four hours and all four sections are
conveniently taken on the same day (ETS, 2007a).
This paper aims to evaluate the usefulness of the writing section of TOEFL Internet-
based Test. To this end, it first points out what is new in TOEFL iBT writing section by
comparing its format and content with the Paper-based Test and Computer-based Test. Then it
analyzes the usefulness of TOEFL iBT, especially the writing section in terms of six qualities
of test usefulness, which are reliability, validity, authenticity, interactiveness, impact and
practicality. Finally, it concludes with the main strengths and weaknesses of TOEFL Internet-
based Test.
Test Review Paper Report 3
What is New in TOEFL iBT Writing Section?
The writing section is approximately 50 minutes long and includes 2 tasks. The first one
is an independent task which lasts 20 minutes. It requires test takers to write based on what is
read and heard. The second task is a 30-munite long integrated task that asks test takers to
support an opinion on a topic. This task is very similar to the writing task in TOEFL
Computer-based Test (CBT) and the Test of Written English (TWE). Test takers wear
headphones and type their responses. The typed responses are sent to ETS’s Online Scoring
Network (ONS). Scoring is done by two to four certified ETS raters on a score scale of 0 to 5
according to the Rubrics (see Appendix A ). The new rubrics are more specific, more precise,
and more student-friendly than the writing rubric for the TOEFL CBT (Wall & Horak, 2008).
Test takers can view their TOEFL iBT scores online 15 business days after they take the test.
They will also receive a copy of their score report by mail (ETS, 2007a).
Compare to the writing tasks in TOEFL PBT and CBT, the new writing tasks provide
students with reading and listening texts as content input. On one hand, the purpose of using
these prompts is to support test takers in their response construction by providing them “not
only with some information for their writing, but also with some vocabulary to lean upon and
some genre conventions to model their response upon” (Zareva, 2005, p. 53). On the other
hand, it reflects the communicative demands in academic settings and highlights the
importance of students’ ability to synthesise ideas from multiple sources as an input to
writing (Leki & Carson, 1994, as cited in Zareva, 2005).
A major improvement of the test format is that the TOEFL iBT allows students to take
notes while completing the tasks. It is beneficial in integrated tasks particularly. Note-taking
Test Review Paper Report 4
helps test takers recall relevant information and have a better understanding of listening
material. It is reported that being allowed to take notes, students feel more comfortable under
test conditions and perform better (Zareva, 2005).
Test Usefulness of TOEFL iBT Writing Section
In the process of designing and developing a test, the most important consideration is
its usefulness. According to Bachman and Palmer (1996), test usefulness can be described as
a function of several different qualities. They are reliability, construct validity, authenticity,
interactiveness, impact and practicality. It is impossible to design a test which is perfect in all
of these qualities. Therefore, to maximize the test usefulness is to higher each quality and to
maintain appropriate balance among the different qualities.
Reliability
A reliable test should be consistent and dependable (Brown, 2004). As one of the most
important qualities of test usefulness, test reliability is affected by four factors, which are
student, scoring, test administration and the test itself (Mousavi, 2002, as cited in Brown,
2004). Since student-related reliability is hard to control, unreliable factors can be easily
caused by temporary illness, such as a “bad day”, I will analyze reliability of TOEFL iBT
writing only in terms of rater reliability, test administration reliability and test reliability.
Rater Reliability
All writing responses are sent to ETS’s Online Scoring Network (ONS) and scored by
two to four raters (ETS, 2007a). In order to ensure maximum objectivity and reliability, all
raters are given well-defined and articulated scoring rubrics and are trained by ETS. They
cannot begin scoring work until they are certified. In addition, they must pass a calibration
Test Review Paper Report 5
test before each scoring session (ETS, 2008b). All ETS raters undergo constant monitoring
when scoring (ETS, 2008a). Educational Test Service (2008b) also reports that raters’
performance is evaluated by statistics. When there is an obvious discrepancy on the same
writing work, a scoring leader will be involved to solve a possible problem. Therefore, rater
reliability in TOEFL iBT speaking section is high.
My only concern is that raters are human beings who have personal preference as well
because of their educational background and teaching experience. The possible solution is to
use e-rater instead of human raters. The result of Attali’s (2007) study shows that e-rater score
has significantly higher reliability regarding grammar, vocabulary, and organization.
However, e-rater cannot measure length and content of a writing work. Experts who are
against e-rater even point out that writing is a kind of art and should not be measured by
machines (Attali, 2007).
Test Administration Reliability
Test administration reliability relies on keeping the same conditions in which the test is
administered, such as the noise level, temperature and the amount of light (Brown, 2004).
ETS (2008b) claims that TOEFL iBT follows standardized procedures for test administration.
By certifying all test centers facilities and equipment for administering, training test center
staff on how to handle a test administration session, providing online practice tests, and using
technology to control test delivery, ETS tries to ensure all test takers take the test under
similar conditions. Even so, unreliable factors exist in TOEFL tests which take place in
different countries. For example, in a similar size of the test room, there are only six people in
Taipei, Taiwan, and yet in Los Angeles, there are around 24 people sitting shoulder to
Test Review Paper Report 6
shoulder.
Test Reliability
As we can tell from the Table 1 and Table 2, the reliability of the Writing score is
somewhat lower compare to the reliability of the Reading, Listening, Speaking, and Total
scores.
Table 1
Reliabilities and Standard Errors of Measurement (based on the first year’s operational data
from September 2005 to December 2006)
________________________________________________________________________________ Score Scale Reliability Estimate Standard Error of Measurement (SEM) Reading 0-30 0.86 2.78 Listening 0-30 0.87 2.40 Speaking 0-30 0.90 1.70 Writing 0-30 0.78 2.65 Total 0-120 0.95 4.88 (ETS, 2007c, p.1)Table 2
Reliabilities and Standard Errors of Measurement (based on the first year’s operational data
from 2007)
________________________________________________________________________________ Score Scale Reliability Estimate Standard Error of Measurement (SEM) Reading 0-30 0.85 3.35 Listening 0-30 0.85 3.20 Speaking 0-30 0.88 1.62 Writing 0-30 0.74 2.76 Total 0-120 0.94 5.64 (ETS, 2008b, p. 3)It might result from the nature of the test itself. That is, the writing section is composed of
only two tasks (Breland, Bridgeman & Fowles, 1999, as cited in ETS, 2008b). In this regard,
the reliability of TOEFL iBT writing section is acceptable. Zhang (2008) conducts a study to
Test Review Paper Report 7
find out repeaters’ performance on TOEFL iBT. The findings indicate that small mean score
changes are observed, thus prove the test reliability is relatively high.
Validity
The primary purpose of a language test is to measure test takers’ language proficiency by
interpreting test scores. In this regard, validity is essential to the usefulness of any language
test (Bachman & Palmer, 1996). According to Brown (2004), when we consider the validity
of a test, we need to consider its content validity, criterion-related validity, construct validity,
consequential validity and face validity. Therefore, I will analyze the test validity of TOEFL
iBT in terms of those five aspects respectively.
Content Validity
The content of TOEFL iBT is relevant to and representative of the written tasks that
students will encounter in academic settings (ETS, 2007d). Although it is impossible that test
tasks are exactly the same as TLU tasks, two tasks in writing section of TOEFL iBT are
simulations of academic tasks. Based on their study, Cumming, Grant, Mulcahy-Ernt, and
Powers (2005) provide evidence about the content relevance of integrated test tasks. They
interviewed a sample of English as Second Language (ESL) teachers about the new type of
test tasks. The teachers gave positive feedback on the tests and viewed them to be realistic
and appropriate simulations of academic tasks. They also felt the tasks elicited writing
responses from their students that represented the way the students usually performed in their
English classes (Cumming, et al., 2005).
Criterion-related Validity
To what extent the criterion of the test has actually been reached is referred to as
Test Review Paper Report 8
criterion-related validity (Brown, 2004). Brown (2004, p. 24) claims that “criterion-related
evidence is best demonstrated through a comparison of results of an assessment with results
of some other measure of the same criterion.” There is no study that has conducted to
measure the relationship between TOEFL iBT writing scores in particular and other relevant
criteria of academic language proficiency. Only the 2003-2004 field study of TOEFL iBT
indicates that observed correlations between the total test score and students’ self assessment
is .52 (ETS, 2004).
Construct Validity
Construct validity pertains to the meaningfulness and appropriateness of the test score
interpretations. Both construct definition and the characteristics of the test task should be
considered when we are evaluating construct validity of a score interpretation (Bachman &
Palmer, 1996). In writing section, test takers’ abilities of choosing vocabulary correctly,
building grammatical structure, spelling words and using punctuation accurately, and
expressing information in an organized manner are tested. Those abilities are believed to be
main factors of achieving successfully writing in college and university settings. The scoring
procedures of TOEFL iBT writing also reflect the construct definition. ETS (2008b) claims
that test takers are not expected to produce a perfect essay. They can earn high score with a
response that contains some errors.
However, the test rubrics may not match to the construct definition. Lee and Kantor
(2005) point out that each of the writing tasks measures an aspect of writing, thus separate
scores should be reported for each of these aspects. Furthermore, there are several possible
sources of bias which lie in the task characteristics. Firstly, the writing section contains only
Test Review Paper Report 9
two writing tasks. Thus, it is extremely important to generalize the writing scores across tasks
and task types to ensure the test validity (Lee & Kantor, 2005). Secondly, the tasks might be
appropriate for undergraduate students. But for graduate students, even though they would
not have the language ability to deal with certain topics, they might have the language ability
needed to succeed in their particular fields. Cumming et al. (2005) suggest that test takers
should have more choices about what to write based on their individual preferences.
Consequential Validity
All the consequences of using an assessment determine the test’s consequential validity
(Brown, 2004). By and large, TOEFL iBT is used appropriately and has positive
consequences. This test is used to make decisions about students’ readiness to study in
America. From this perspective, positive consequences would involve selecting students who
have the English language proficiency necessary to succeed at the college, and denying
admission to those who do not. Another use of the TOEFL iBT test is to support English
teaching and learning. Accordingly, the test has the positive washback of facilitating students’
English learning process and promoting communicative language teaching (ETS, 2007d).
Face Validity
If the test looks right and fair to students is also significant and it is known as face
validity (Brown, 2004). High face validity involves well-constructed format, doable time
allotment, and clear directions of a test. To this end, the developer of the TOEFL iBT test
carried out numerous exploratory studies over four years to find the best way to design and
administrate the test (Chapelle, Enright, & Jamieson, 2008, as cited in ETS, 2007d). For
instance, test information is posted on ETS website; preparing material can be downloaded;
Test Review Paper Report 10
sample responses are provided; and note-taking is allowed. Allowing test takers to take notes
improves the face validity of the test, since academic life strongly encourages students to take
notes and use them in their academic preparation. Besides, high content validity of TOEFL
iBT contributes to its face validity as well.
Authenticity
Bachman and Palmer (1996) define authenticity as “the degree of correspondence of
the characteristics of a given language test task to the features of a target language use (TLU)
task” (p. 23). Authenticity is a critical quality of language tests because it investigates the
generalizability of score interpretations, thus it closely relevant to construct validity. Also,
authenticity affects test takers’ perception of the test. Test takers tend to provide positive
affective responses and perform at their best when the test task is likely to be enacted in the
real world (Bachman & Palmer, 1996).
First of all, the characteristics of both integrated and independent task correspond to
those of TLU tasks. Writing in English is required in all academic situations in America.
Integrated writing refers to a writing response on an exam or a reflection on what they have
learned in classes. This requires students to combine what they have heard from class lectures
with what they have read in textbooks or other materials. For instance, a student might be
asked to compare and contrast the ideas presented by the professor in class with those
expressed by an author in the reading material. In this case, he/she has to successfully draw
information from each source to analyze it (ETS, 2007a). Often students need to write essays
to express and support their opinions based on their own knowledge and experience. This
kind of writing can be referred to as independent writing. In order to complete this task, test
Test Review Paper Report 11
takers must be able to identify main ideas and develop the essay by using examples, reasons
and detail which requires critical thinking (ETS, 2007a).
Secondly, the process of completing the test tasks is almost the same with the
procedure of completing tasks in TLU domain. Students need to take notes on what they have
heard and they have read first. Then they summarize and paraphrase the useful information.
At last, they write them down using all these resources to support and express their own
opinions.
Thirdly, the language use in prompts is close to the TLU domain. The reading
passages are from real textbooks and course materials. Topics in independent writing
authentic relate to test takers’ personal experience and daily life.
Fourthly, the rubrics (see Appendix A) of the test tasks are close to the real academic
writing standards. In an academic setting, a student’s writing paper is evaluated on the basis
of its expression and organization. Students are required to present an idea in a clear and
well-organized manner. Their abilities of organizing information, developing accurate
content, and using grammar and vocabulary appropriately and precisely are valued both in the
test tasks and in TLU domain (ETS, 2007a).
Note-taking is one of the most outstanding features which reflect the authenticity of
TOEFL iBT writing section. According to Zareva (2005), allowing note-taking makes the test
tasks closer to authentic academic tasks college students perform in their daily life, since it is
the primary means of recording information when students are listening to a lecture,
participating in a group discussion and reading a textbook. Furthermore, Dunkel (1988)
investigates lecture note-taking among undergraduate L1 and L2 speakers of English. He
Test Review Paper Report 12
reports that good note-takers are good summarizers, who are equipped with academic skills
of reformulating and abstracting the gist of information (as cited in Zareva, 2005). Therefore,
allowing note-taking in writing section is authentic and corresponds to the TLU task.
Last but not least, all the responses are required to type into the computer rather than
write down with a pen on a piece of paper. As we all known, nowadays all the official
academic writings are required to be typed in most of the universities all over the world.
Thus, the requirement of typing responses into a computer in writing section exactly matches
the trend in today’s academic settings.
Interactiveness
Interactiveness is defined as the extent and type of involvement of the test taker’s
individual characteristics such as personal characteristic, language ability, topical knowledge,
and affective schemata in accomplishing a test task (Bachman & Palmer, 1996). The more
interactive a test is, the better test takers can perform on it.
Test takers of TOEFL iBT are non-native speakers who want to study in
undergraduate, graduate and postgraduate programs in English-medium colleges and
universities. On one hand, integrated tasks, which include various subjects and topics are
designed for academic purposes and are beneficial for test takers to perform their best. On the
other hand, independent tasks which involve test takers’ personal experiences should consider
personal characteristics carefully. A study conducted by Breland, Lee, Najarian and Muraki
(2004) shows that topics like art, music, housing, roommates, friends and children tend to
generate large gender differences.
The test involves test takers’ language ability as much as possible. In the writing
Test Review Paper Report 13
section of TOEFL iBT, the integrated task involves a wide range of areas of language
knowledge, such as reading and listening in writing. It also requires language abilities for
comprehension, such as paraphrasing, summarizing and presenting ideas. The independent
task allows examinees to demonstrate their abilities to form coherent written argument based
on their personal knowledge and experience.
In terms of topical knowledge, it is not required in the integrated task. All background
knowledge is provided by the test prompts. The situation is different in the independent task.
Test takers write response based on their personal experience and topical knowledge.
Therefore, examinees who have little topical knowledge of a specific topic might gain poor
scores. In order to minimum the influence of topical knowledge, the test can create more
open-ended choices based on individuals’ preferences, interests and inclinations (Cumming,
et al., 2005).
Affective schemata plays an significant role when test takers are completing the test
tasks. The writing tasks in TOEFL iBT simulate the real classroom environment as well as
use authentic reading materials and lectures. Additionally, those tasks correspond to the TLU
settings where test takers are asked to read materials, listen to lectures, take notes, and write
paper. Thus, test takers will have positive affective responses to the tasks in general.
Nevertheless, test takers’ affective schemata varies because of individual difference. For
example, test takers may have positive affective responses when they have relevant topical
knowledge of the test task. In contrast, they may have negative responses when they have no
idea about the test task topic. Also, test takers who have high language proficiency tend to be
positive towards language tests, while those who have low language proficiency may feel
Test Review Paper Report 14
threatened by the test.
Impact
TOEFL has gained extraordinary power worldwide because it is used to accept some
international students into American universities while rejecting others. It is believed that
tests in general and high-stakes tests like TOEFL in particular, have a tremendous impact on
students and teachers, even on society and education systems at large.
Impact on Individuals
There is no doubt that TOEFL has its impact upon the individuals within certain
educational systems. “Stake-holder” such as test takers and teachers are individuals who are
most directly affected by test use (Bachman & Palmer, 1996).
Impact on test takers. Bachman and Palmer (1996) state that test takers can be
affected by three aspects of the testing procedure: one is their experience of taking tests and
preparing for the test, the other is the feedback they receive after the test, and another is the
decision that may be made on them according to their test performance.
In order to take TOEFL Internet-based Test, vast majority of students spend several
weeks and even several months preparing. In some countries like China and Korea, “Cram”
courses are offered to provide students with the specific test practice. Techniques needed in
the test are taught as core subjects in the syllabus. The fact that some learners who get a high
score in TOEFL but cannot express themselves properly might due to this kind of negative
washback which is named “learning for the test”. Additionally, TOEFL scores do not measure
students’ actual language performance accurately. A negative impact happens when students
define their own English proficiency according to the TOEFL strictly. It is quite possible for
Test Review Paper Report 15
students to be successful communicating with native speakers of English orally and in written
but not successful on the TOEFL iBT. Almost every formal test has these weaknesses in
common, since its result means a lot to test takers.
Every coin has two sides. TOEFL iBT creates opportunities for test takers to get
familiar with TLU domain. Since it is designed to evaluate test takers’ English language
ability in academic settings in America, overseas students who never study in American
colleges and universities gain topical knowledge from the experience of taking the test itself.
What is more, TOEFL iBT has positive washback because of its great score-reporting system.
Students receive a detailed performance feedback (see Appendix B) from ETS rather than a
simple numerical score. Learning strategies are provided according to students’ proficiency
level. This complete, relevant, and meaningful feedback can be very beneficial for test takers
to achieve progress in English learning in the future.
Impact on teachers. When it comes to TOEFL’s impact on teachers, it is positive by
and large. Wall and Horak (2006) note that there is a hope that TOEFL will lead to a more
communicative approach to teaching and that preparation classes would pay more attention to
academic tasks and language (as cited in Wall & Horak, 2008). In the past, most of the
teachers conducted writing classes that focused on language structure and discrete
grammatical items. There was no integrated skills work. In order to prepare students for
integrated tasks in writing section, teachers begin to teach students how to summarize
properly, how to paraphrase, and how to select important information from input flood (Wall
& Horak, 2008). Those abilities will be extremely useful for students who are going to study
in the U.S. However, as Wall and Alderson (1993) point out, we cannot simply assume that a
Test Review Paper Report 16
test will automatically affect instructional practice (as cited in Bachman & Palmer, 1996).
Therefore, whether TOEFL iBT will have a positive washback upon ESL and EFL pedagogy
heavily depend on teachers’ English proficiency level, their English teaching ability, and their
teaching philosophy.
Impact on Society and Education Systems
As Bachman (1990) argues, a test is not a value-free production; it always serves to
meet the needs of an educational system or of society at large (as cited in Bachman & Palmer,
1996). In general, TOEFL iBT has a positive impact on society and education systems, since
it draws attention to English-language skills in academic settings in a more communicative
and authentic way. What is worth mentioning is that TOEFL scores have been misused
especially in non-English speaking countries. For example, many companies and government
departments in China simply evaluate job applicants’ English skills through their TOEFL iBT
scores even though those scores are nothing to do with workplace English skills.
Practicality
Practicality is such an essential quality of a test that for any given situation, if the
resources required for implementing the test exceed the resources available, the test will be
impractical, and will not be used unless resources can be allocated more efficiently no matter
how perfect it is. Bachman and Palmer (1996) state that test resources can be classified into
three types: human resources, material resources and time.
Regarding human resources, it includes test writers, scorers, test administrators, and
clerical support personnel (Bachman & Palmer, 1996). TOEFL iBT is developed and
administered by Educational Testing Service, a nonprofit educational organization which
Test Review Paper Report 17
gathers many language experts and testing experts. As a world wide test, a huge number of
raters are needed. In order to minimum the number raters as well as ensure test reliability, all
writing responses are scored by two rates and if the two ratings differ by more than one point,
the chief rater will rate it again (ETS, 2008b). In addition, ETS recruits and trains new raters
all over the world to avoid the shortage of scorers.
Nowadays, TOEFL Internet-based Test is conducting in 933 cities of 128 countries. There
are more than 3600 active test centers in total. 1679 test centers are in preparation and new
test centers are added on a continual basis (Tyson, 2007). Thanks to the development of
technology, internet and computers are parts of people’s life. Thus material resources of
TOEFL iBT are available.
The score reporting of TOEFL iBT is on time and accurate (Tyson, 2007). Test takers can
view their scores online 15 business days after the test, which is much faster than TOEFL
PBT. Colleges, universities, and agencies can also go online to view the scores of those
students who selected them as a score recipient (ETS, 2007a).
It is also worth mentioning that the process of choosing scoring method of TOEFL iBT,
which reflects test designers’ consideration of test practicality. There are three suggested
scoring methods in general: analytical, primary traits and holistic scoring methods. Analytical
scoring seems to be a detail-oriented method. However, it is found hard to put into practice
because the conceptually different aspects are actually interrelated with each other (Butler,
Eignor, Jones, McNamara & Suomi, 2000). For instance, it is hard to separate topical
development and language use since the use of cohesive ties is also the device of topical
development. As for primary trait scoring, it works well with tasks on different difficulty
Test Review Paper Report 18
levels but costs money to conduct. Comparing with those two scoring methods, holistic
scoring method is relatively practical and efficient. Therefore, it is implemented in TOEFL
iBT writing section.
Conclusion
As a test that can get students into more than 6000 universities worldwide, TOEFL iBT
has a high test reliability, validity, authenticity, interactiveness and practicality, and has a
positive impact as a whole. Even so, I still want to point out two aspects which can be
improved after my analysis. One is that test reliability can be improved by regulating test
administration. Test conditions should be more consistent, especially when the test is
conducted in different countries. The other is that considering only two tasks are required by
TOEFL iBT writing, test takers should have more options to choose topics which they are
familiar with. By doing so, test validity can be improved to some degree.
References:Attali, Y. (2007). Construct Validity of e-rater in Scoring TOEFL Essays (ETS Research
Test Review Paper Report 19
Memorandum No. RR-07-21). Educational Testing Service, Princeton, NJ. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford
University Press.Breland, H., Lee, Y.-W., Najarian, M., & Muraki, E. (2004). An Analysis of TOEFL-CBT
Writing Prompt Difficulty and Comparability for Different Gender Groups (TOEFL Research Rep. No. RR-76). Princeton, NJ: ETS.
Brown, D. (2004). Language Assessment: principles and classroom practices. NY: Pearson Education, Inc.
Butler, F. A., Eignor, D., Jones, S., McNamara, T., & Suomi, B. K. (2000). TOEFL 2000 speaking framework: A working paper (TOEFL Monograph No. MS-20). Princeton, NJ: ETS. Cumming, A., Grant, L., Mulcahy-Ernt, P. & Powers, D. (2005). A Teacher-verification Study of Speaking and Writing Prototype Tasks for a New TOEFL. Princeton, NJ: ETS. Cumming, A., Kantor, R., Baba, K., Eouanzoui, K., Erdosy, U., & James, M. (2006). Analysis
of Discourse Features and Verification of Scoring Levels for Independent and Integrated Prototype Written Tasks for the New TOEFL (TOEFL Monograph No. MS-30). Princeton, NJ: ETS.
Educational Testing Service (2004). English Language Competency Descriptors. Princeton, NJ.Educational Testing Service (2007a). TOEFL Tips. Retrieved Oct. 10, 2008 from
http://www.ets.org/Media/Tests/TOEFL/pdf/TOEFL_Tips.pdf. Educational Testing Service (2007b). TOEFL iBT Performance Feedback for Test Takers. Retrieved Oct. 20, 2008 from http://www.ets.org/Media/Tests/TOEFL/pdf/TOEFL_Perf_Feedback.pdf.Educational Testing Service (2007c). TOEFL iBT Score Reliability and Generalizability. Retrieved Nov. 17, 2008 from http://www.ets.org/Media/Tests/TOEFL/pdf/TOEFL_iBT_Score_Reliability_ Generalizability.pdfEducational Testing Service (2007d). Validity Evidence Supporting the Interpretation and Use of TOEFL iBT Scores. Princeton, NJ. Educational Testing Service (2008a). TOEFL at a Glance. Retrieved Oct. 10, 2008 from
http://www.ets.org/Media/Tests/TOEFL/pdf/TOEFL_at_a_Glance.pdf. Educational Testing Service (2008b). Reliability and Comparability of TOEFL iBT Scores. Retrieved Nov. 19, 2008 from http://www.ets.org/Media/Tests/TOEFL/pdf/TOEFL_iBT_Reliability.pdf.Lee, Y. & Kantor, R. (2005). Dependability of New ESL Writing Test Scores: Evaluating Prototype Tasks and Alternative Rating Schemes. Princeton, NJ: ETS. Tyson, E. (2007). TOEFL iBT Update. Retrieved Nov. 19, 2008 from http://www.cgsnet.org/portals/0/pdf/mtg_sm07Tyson.pdf. Wall, D. & Horak, T. (2008). The impact of changes in the TOEFL examination on teaching
and learning in central and eastern Europe: Phase 2, coping with change. TOEFL iBT Report No. TOEFLiBT-05.
Zhang, Y. (2008). Repeater Analyses for TOEFL iBT. (ETS Research Memorandum No. RM-08-05). Princeton, NJ: ETS.
Test Review Paper Report 20
Zareva, A. (2005). What is new in the new TOEFL-iBT 2006 test format? Electronic Journal of Foreign Language, 2(2), 45-57.
Appendix AIndependent Writing Rubrics
Test Review Paper Report 21
SCORE TASK DESCRIPTION5 An essay at this level largely accomplishes all of the following:
Effectively addresses the topic and task Is well organized and well developed, using clearly appropriate explanations,
exemplifications, and/or details Displays unity, progression, and coherence Displays consistent facility in the use of language, demonstrating syntactic variety,
appropriate word choice, and idiomaticity, though it may have minor lexical or grammatical errors
4 An essay at this level largely accomplishes all of the following: Addresses the topic and task well, though some points may not be fully elaborated Is generally well organized and well developed, using appropriate and sufficient
explanations, exemplifications and/or details Displays unity, progression, and coherence, though it may contain occasional
redundancy, digression, or unclear connections Displays facility in the use of language, demonstrating syntactic variety and range of
vocabulary, though it will probably have occasional noticeable minor errors in structure, word form, or use of idiomatic language that do not interfere with meaning
3 An essay at this level is marked by one or more of the following: Addresses the topic and task using somewhat developed explanations,
exemplifications, and/or details Displays unity, progression, and coherence, though connection of ideas may be
occasionally obscured May demonstrate inconsistent facility in sentence formation and word choice that
may result in lack of clarity and occasionally obscure meaning May display accurate, but limited range of syntactic structures and vocabulary
2 An essay at this level may reveal one or more of the following weaknesses: Limited development in response to the topic and task Inadequate organization or connection of ideas Inappropriate or insufficient exemplifications, explanations, or details to support or
illustrate generalizations in response to the task A noticeably inappropriate choice of words or word forms An accumulation of errors in sentence structure and/or usage
1 An essay at this level is seriously flawed by one or more of the following weaknesses: Serious disorganization or underdevelopment Little or no detail, or irrelevant specifics, or questionable responsiveness to the task Serious and frequent errors in sentence structure or usage
0 An essay at this level merely copies words from the topic, rejects the topic, or is otherwise not connected to the topic, is written in a foreign language, consists of keystroke characters, or is blank.
Integrated Writing Rubrics
Test Review Paper Report 22
SCORE TASK DESCRIPTION5 A response at this level successfully selects the important information from the lecture
and coherently and accurately presents this information in relation to the relevant information presented in the reading. The response is well organized, and occasional language errors that are present do not result in inaccurate or imprecise presentation of content or connections.
4 A response at this level is generally good in selecting the important information from the lecture and in coherently and accurately presenting this information in relation to the relevant information in the reading, but it may have minor omission, inaccuracy, vagueness, or imprecision of some content from the lecture or in connection to points made in the reading. A response is also scored at this level if it has more frequent or noticeable minor language errors, as long as such usage and grammatical structures do not result in anything more than an occasional lapse of clarity or in the connection of ideas.
3 A response at this level contains some important information from the lecture and conveys some relevant connection to the reading, but it is marked by one or more of the following: Although the overall response is definitely oriented to the task, it conveys only
vague, global, unclear or somewhat imprecise connection of the points made in the lecture to points made in the reading.
The response may omit one major key point made in the lecture. Some key points made in the lecture or the reading, or connections between the two,
maybe incomplete, inaccurate, or imprecise.2 A response at this level contains some relevant information from the lecture, but is
marked by significant language difficulties or by significant omission or inaccuracy of important ideas from the lecture or in the connections between the lecture and the reading; a response at this level is marked by one or more of the following: The response significantly misrepresents or completely omits the overall connection
between the lecture and the reading. The response significantly omits or significantly misrepresents important points
made in the lecture. The response contains language errors or expressions that largely obscure
connections or meaning at key junctures or that would likely obscure understanding of key ideas for a reader not already familiar with the reading and the lecture.
1 A response at this level is marked by one or more of the following: The response provides little or no meaningful or relevant coherent content from the
lecture. The language level of the responses is so low that it is difficult to derive meaning.
0 A response at this level merely copies sentences from the reading, rejects the topic or is otherwise not connected to the topic, is written in a foreign language, consists of keystroke characters, or is blank.
(ETS, 2007a, p. 46-47)
Test Review Paper Report 23
Appendix BWriting Based on Reading and Listening
LEVEL YOUR PERFORMANCE ADVICE FOR IMPROVEMENTGOOD (4.0-5.0)
You responded well to the task, relating the lecture to the reading. Weaknesses, if you have any, might have to do with: slight imprecision in your
summary of some of the main points, and/or
use of English that is occasionally ungrammatical or unclear.
Continue to improve your ability to relate and convey information from two or more sources. For example, practice analyzing reading passages in English. Read two articles or chapters on the same
topic or issue, write a summary of each, and then explain the ways they are similar and the ways they are different.
Practice combining listening and reading by searching for readings related to talks and lectures with teacher or a friend.
FAIR (2.5-3.5)
You responded to the task, relating the lecture to the reading, but your response indicates weaknesses, such as: an important idea or ideas
may be missing, unclear or inaccurate;
it may not be clear how the lecture and the reading passage are related; and/or
grammatical mistakes or vague/incorrect uses of words may make the writing difficult to understand.
Practice finding main points. Ask a friend to record news and
informational programs in English from the television or radio, or download talks or lectures from the Internet. Listen and take notes. Stop the
recording about every 30 seconds to write out a short summary of what you heard.
Replay the recording to check your summary. Mark places where you are not sure if you have understood what was said or if you are not sure you have expressed yourself well.
LIMITED (1.0-2.0)
Your response was judged as limited due to: failure to understand the
lecture or reading passage; deficiencies in relating the
lecture to the reading passage; and/or
many grammatical errors and/or very unclear expressions and sentence structures.
Read and listen to academic articles and other material in your own language. Take notes about what you read and hear. Begin by taking notes in your own language
and then take notes in English. Summarize the points in complete English
sentences. Ask your teacher to review your writing and
help you correct your errors. Gradually decrease the time it takes you to
read the material and write these summaries. Practice typing on a standard English
(QWERTY) keyboard.