Test Review

Running Head: TEST REVIEW

Test Review: TOEFL, IELTS, ACT Compass ESL Test

Kathleen Hamel

Colorado State University

TEST REVIEW 1

Introduction

More and more students from around the world want to study at international institutions,

many of which require knowledge and application of the English language. These institutions

then need to be able to assess students’ abilities to perform at the level necessary to excel in their

programs. Assessments, like tests or other alternative types, provide this information to

administration, teachers and students of what has been learned and what needs to be learned

(Miller, Linn, & Gronlund, 2009). In order to be admitted into a university, the most appropriate

assessment is through various English proficiency exams. The purpose of placement exams is to

determine whether students possess the necessary language to succeed at a university (Stoynoff

& Chapelle, 2005).

The university setting is of particular interest to me as a future ESL teacher. My short-

term goals are to be able to teach students who are directly coming from various international

countries who need to improve their English skills before moving onto their field-specific study.

Therefore, for this paper, I will be reviewing the Test of English as a Foreign Language Internet

Based Test, the International English Language Testing System and the ACT Compass ESL

Placement test. The reason why I decided on these three tests is because they all claim to have a

similar purpose, that is, to be used to make decisions about non-native speakers of English

(NNSE) who want to be admitted to English-medium universities. Through this review, I hope to

gain an understanding of how these tests are structured, the reliability and validity of these tests

and to understand the concerns, apprehensions, and questions that students might have about

these exams.

TEST REVIEW 2

Test of English as a Foreign Language Internet Based Test (TOEFL iBT®) Publisher: Educational Testing Service Brigham Library Mail Stop 07J Roedale Road Princeton, NJ 08541 1-609-734-5667 https://www.ets.org/test_link/contact/ Publication Date: 2005 Target Population:NNSE who plan to use English at the university level Cost: Fees vary by location, currently $190 in the U.S. Overview Initially starting off as a paper-based assessment, the TOEFL iBT is now the most current

version of the TOEFL and is entirely an Internet-based assessment. The TOEFL iBT seeks to

measure test takers ability “to use and understand English at the university level” (“About the

TOEFL”). On its website, the TOEFL iBT also claims to be the most widely respected and

recognized test in major English-speaking countries, i.e. United Kingdom, United States and

Australia. A more thorough description of the TOEFL iBT is given below (see Table 1).

Table 1: Extended Description of TOEFL iBT

Test Purpose The purpose of the TOEFL iBT test is to assess NNSE who plan on using English in an academic setting. According to Jamieson, Jones, Kirsch, Mosenthal and Taylor, 2000, “The test will measure examinee’s English-language proficiency in situations and tasks reflective of university life…” (as cited in “TOEFL iBT test framework,” 2010, p. 2).

Test Structure The TOEFL iBT is “administered via a computer from a secure, international, Internet-testing network” (“TOEFL iBT test framework,” 2010, p. 2). The TOEFL is broken up into four sections: Reading, Listening, Speaking, and Writing. The total time for all four sections is about four hours. Test takers have about 60-90 minutes to complete the Reading section. This section consists of about 3-5 reading passages that have

TEST REVIEW 3

approximately 700 words each; each passage has about 12-14 questions, totalling 36-70 questions. Each passage provides all the information needed, are university-level texts and are authentic texts. For the Listening section, test takers listen to 4-6 lectures and 2-3 conversations, lasting 3-5 minutes and 3 minutes long, respectively. The Listening section is comprised of 34-51 questions items and test takers have 60-90 minutes to complete this section. The goals of this section are to assess test takers’ “ability to understand spoken English in an academic setting” (p. 3). The Speaking section lasts 20 minutes long and consists of six tasks. For two of these tasks, they are considered ‘independent’ and no oral or written materials are received by the test taker; for this part, test takers respond to a familiar topic. The other four tasks focus on integrated skills: two are responding to written and oral input, and the other two are responding to an oral input. This section assess students’ ability to respond in academic settings, in and out of the classroom. The Writing section consists of two tasks and lasts for 50 minutes. These two tasks, one independent and one integrated, measure one’s ability to write in an academic setting. For the independent task, students are given a general question where they must develop their opinion on an issue with support. Then, for the integrated task, students read a passage and then listen to a lecture about opposing ideas and write how the important points relate to one another.

(“TOEFL iBT test framework,” 2010)

Scoring of Test The TOEFL iBT is scaled on a range from 0-120, with 120 being the highest. Each testing section (Reading, Listening, Speaking and Writing) is scaled from 0-30. For the Reading section, there are two types of multiple-choice questions: one that has a single correction answer and one that has more than one answer, which lends itself to partial-credit scoring. The Writing section is the same, with having more single criterion questions. The Speaking and Writing sections are scored a bit differently with rubrics. Both rubrics are holistic and a 4-point rubric is used for the Speaking section, while a 5-point rubric is used for the Writing section. In order to maintain reliability, raters must have appropriate qualifications and must pass a “calibration test” before they begin rating.

(“TOEFL iBT test framework,” 2010)

Statistical Distribution of the Scores for Normed Group

Educational Testing Service (ETS) conducted a standard-setting study in 2004 at five North American universities, with both graduate and undergraduate students. The calculated results are below: Mean Std. Dev.

TEST REVIEW 4

Reading 20.4 0.89 Listening 15.8 1.6 Speaking 22.4 1.5 Writing 23.4 2.3 Total 82.0 4.3

(“Results of standard,” 2005)

Standard Error of Measurement

The Standard Error of Measurement (SEM) as reported from TOEFL is reported below: Scale SEM Reading 0-30 3.35 Listening 0-30 3.20 Speaking 0-30 1.62 Writing 0-30 2.76 Total 0-120 5.64

(“Reliability and comparability,” 2011)

Evidence for Reliability

The reliability scores are provided below. Reliability Score Reading 0.85 Listening 0.85 Speaking 0.88 Writing 0.74 Total 0.94 Reading, Listening and Speaking are all considered to be fairly reliable since their scores are all above 0.75. The Writing score on the other hand is not as reliable as it scores 0.74. According to ETS, “This is a typical result” with only two writing tasks (p.5) . They claim that a higher reliability would be provided if the writing tasks were shorter and less time-consuming. Despite this, they claim that in order to test for an academic setting, extended responses are necessary.

(“Reliability and comparability,” 2011)

Evidence for Validity For the TOEFL iBT test, the validation of their test has been ongoing since the creation of their test in the 1970’s. The ways in which ETS claims the TOEFL is valid is that: the content is relevant to the types of tasks and tests that test takers will encounter in a university setting, the performance on the test is related to one's academic language proficiency and the test results are used and interpreted appropriately. (“Validity evidence,” 2011) This is confirmed by Chapelle, Enright and Jamieson (2008) in their book ‘Building a Validity Argument for the Test of English as a Foreign Language.’ Chapelle demonstrates that the evaluation of the tasks reflect the respective targeted language abilities and that the

TEST REVIEW 5

observed scores from the TOEFL iBT demonstrate a student’s success in academic language proficiency.

International English Language Testing System (IELTS) Publisher: British Council, IELTS Australia Pty Ltd and the Cambridge English Language

Assessment Bridgewater House 58 Whitworth Street Manchester M1 6BB +44 (0)161 957 7755 http://takeielts.britishcouncil.org Publication: 1989 Target Population: NNSE who want to be accepted into an university where English is needed Cost: $225 USD, varies by country Overview Beginning in 1980, originally as the English Language Testing Service (ELTS), it has since

transformed into the IELTS in 1989. The IELTS exam takes on two forms: Academic and

General Training; for the purposes of this paper, I will only be focusing on Academic. This

Academic test seeks to assess learners based on academic content and focuses on academic

domains of language use. The IELTS can be computer-based or paper-based. A more thorough

description of IELTS is given below (see Table 2).

Table 2: Extended Description of IELTS Academic

Test Purpose The IELTS Academic test is for test takers who want to study at a university where English is primarily used or for those who want to join a professional organization. According to the British Council, the IELTS Academic test “measures English language proficiency needed for an academic, higher learning environment” (“Understand the IELTS”).

TEST REVIEW 6

Test Structure The IELTS Academic test is broken up into four, equally weighted sections: Listening, Reading, Writing and Speaking. The total time to take the test is approximately three hours. The Listening section lasts 40 minutes, giving students 10 minutes within that time period to transfer answers to their answer sheet. Students listen to four recorded monologues and conversations. The Listening consists of 40 items related to the four recordings. The Reading section also contains 40 question items (“IELTS assessment”). The Reading section consists of three long reading passages with specific tasks following each. The passages are claimed to be authentic, coming from books and newspapers, and also include non-text types, like graphs and illustrations. The texts in this section also range factual to analytical. Test takers are given 60 minutes for this section. For the Writing section, test takers are also given 60 minutes to complete two writing tasks. In the first task, test takers are asked to summarize, and explain a non-text type, i.e. table, graph, or diagram in at least 150 words. For the second task, students are required to write at least 250 words in a short essay form about an issue providing relevant evidence. Both tasks are expected to be written in a formal style. Finally, the Speaking task is much shorter, with only 11 to 14 minutes to complete and is recorded. The Speaking task is reciprocal in nature as students are asked to complete a face-to-face interview (Bachman & Palmer, 2010). In this interview, students answer short questions about a familiar topic in a “structured discussion.”

(“Understand the IELTS”)

Scoring of Test The IELTS Academic is scored on a 9-band scoring range, 0-9 with 9 being the highest score. All sections (Listening, Speaking, Reading and Writing) are recorded on this 9-band scale and the totals for the four sections are then averaged for the final score since each section is equally weighted. Each band, and it’s subsequent numerical identifier, is given an identified ‘skill level’ which are: expert, very good, good, competent, modest, limited, extremely limited, intermittent, non-user and did not attempt (“Understand how to calculate”). Students can be rated at whole bands (i.e. 9) or half bands (i.e. 8.5). For the Listening section of the test, there are multiple choice questions, and limited response questions. All questions have one correct answer for students to receive one point for each. Similarly, the Reading section is scored with one correct answer for multiple choice and limited response questions. The tallied scores are then converted into the 9-band system. Unlike Listening and Reading, the Writing section is scored based on four criteria: Task Achievement, Coherence and Cohesion,

TEST REVIEW 7

Lexical Resource and Grammatical Range and Accuracy. Each of these criteria are then assessed on the 9-band scale, adhering to an analytical rubric. According to the page on “Understand the Writing test,” the second task is worth twice as much as the first task, most likely due to the response being longer than the first. Similarly to the Writing section, the Speaking section is also assessed on four criteria: Fluency and Coherence, Lexical Resource, Grammatical Range and Accuracy and Pronunciation (“Understand the Speaking”) and is scaled on the 9-band scale. The Listening and Speaking section is the same for both IELTS Academic and IELTS General.

(“IELTS assessment”)


IELTS gives multiple versions of each of their sections yearly. They provided reliability estimates and other statistical data for some of these versions for 2014. The calculated results are below: Mean Std. Dev. Reading 6.0 1.2 Listening 6.1 1.3 Speaking - - Writing - - For both Speaking and Writing, the mean and standard deviation were not reported.

(“Test performance,” 2014)


The Standard Error of Measurement (SEM) as reported from IELTS is reported below: Scale SEM Reading 0-9 0.38 Listening 0-9 0.39 Speaking and Writing 0-9 - According to IELTS, since Speaking and Writing are not item-based, they cannot be reported in the same manner as Reading and Listening, which is why they are combined above. Despite this, IELTS did not provide an SEM for these sections, rather deemed an SEM of 0.23 for the overall Academic and General IELTS test.

(“Test Performance,” 2014)


The IELTS test used Cronbach’s Alpha to provide information for the reliability of its tests. As mentioned above, multiple versions are provided for each section. Under “Test Performance,” IELTS provided over 40+ versions and their subsequent Alpha for their Listening versions for 2014, which averages to be 0.91. This estimate

TEST REVIEW 8

demonstrates a high level of internal consistency. The same is given for the Academic Reading version, which averages to be 0.90. As described above, Speaking and Writing were reported differently as was their reliability. For these sections, experimental generalizability studies were conducted by Shaw (2004) and Taylor and Jones (2001). Their G-studies, which were based on examiner certification data, estimated coefficients of 0.83-0.86 for Speaking and 0.81-0.89 for Writing. In order to provide a “cautious estimate” for their overall reliability, a composite reliability score was used, which yielded a result of 0.96 for both Academic and General IELTS tests; at face value, this would appear to be highly reliable, but it is difficult to distinguish whether the Academic test could be deemed ‘reliable.’

(“Test Performance,” 2014)

Evidence for Validity One way that IELTS ensured validity of their tests was through research comparing students’ relationship with their GPA and the IELTS sections of the test. Overall there was mixed results to the relationships of each. For example, there was a strong correlation between Listening and Reading scores and students’ GPAs, while there was not a correlation with their Speaking and Writing scores (Humphreys, Haugh, Fenton-Smith, Lobo, Michael & Walkinshaw, 2012). This means that only students’ Listening and Reading scores might predict their grades in those subsequent areas. An additional study, which looked at students’ IELTS scores and their acceptance and progress in postgraduate universities, noted that “opinions about the extent to which English proficiency test scores could be relied upon to indicate that an individual student was capable of successful study ... were contentious” (Lloyd-Jones, Neame & Medaney, 2011, p. 158). This once again shows that the relationship between students scoring on the IELTS and their success in academic settings are not significantly related.

TEST REVIEW 9

ACT Compass ESL Placement Test Publisher: ACT Educational Services Division 2201 North Dodge St. Iowa City, IA 52243-0168 319-337-1000 http://www.act.org/esl/index.html Publication Date: 1999 Target Population: High school students who are NNSE who want to attend university where English is needed Cost: $450 per campus for COMPASS/ESL annual license, cost per test taker varies Overview ACT began assessing college-bound students in 1959, while the ACT Compass test was not

launched until 1983. The ACT Compass test has a few different subjects that are assessed, but

for the purposes of this paper the ACT ESL placement test will be the focus. These subject-based

tests are used for postsecondary institutions to place students into the appropriate course-level;

the Compass test is actually sold to institutions who want to use it to place students into “ESOL

or mainstream courses” (Stoynoff & Chapelle, 2005). Table 3: Extended Description of ACT Compass ESL Test

Test Purpose The ACT ESL Compass test (ACT ESL) contains three untimed English proficiency tests: Grammar/Usage, Reading and Listening, with an optional e-Write (Essay) option. The purpose of the ACT ESL test is to place students in the appropriate IEP classes or university courses according to their skill level. This test assesses a student’s ability to understand Standard American English. It is a computer adaptive college placement test; it is adaptive because questions are tailored to test takers depending on their responses to previous questions.

(“About ACT Compass”)

Test Structure The ACT ESL has four sections: Listening, Reading, Grammar/Usage and Essay. Three of the four sections are not timed, but students typically take about 20 minutes on each. The Listening, Reading, and Grammar/Usage sections are all multiple-choice with about 10-15 items long (Stoynoff & Chapelle, 2005). The Listening section of the test is structured from easiest to

TEST REVIEW 10

hardest, meaning that each tasks increases “with the rate of speech, vocabulary, dictation and use of idiomatic and metaphorical language.” The purpose of this is to allow lower-proficiency level students to be able understand dialogue that students might encounter in face-to-face situations because a native speaker is likely to modify their speech. Then as the tasks get more difficult, test takers are assessed on their ability in the likelihood of encountering dialogues that are to mimic real-life conversations. The Reading section of the test is also structured from easy to hard tasks. The easier tasks include common knowledge topics without the use of idioms or metaphors, while the more difficult tasks include academic or, possibly, unfamiliar contexts. Students read passages about these varying topics which are then assessed by multiple-choice items. Background knowledge should not be necessary in order to understand the passages, and they are authentic materials appropriate for ESL learners. Test takers are assessed on the ability to refer and reason. The Grammar/Usage section of the text assesses a student’s ability to recognize and manipulate English in two main categories: sentence elements and sentence structure and syntax. All items are in a multiple-choice format and assess their English grammar and usage within context. Examples of sentence elements include different verb tenses and aspects, using subjects/objects correctly and writing conventions. Examples of sentence structure and syntax include: verb agreement, word order and use of various types of clauses. The Essay section of the test is the only component that is timed and test takers have 30 minutes to complete this section. The writing sample asks task takers to write about a specific issue, which this must take a position on, support with examples in an organized manner with correct grammar and mechanics. This issue is a common aspect of everyday life where test takers do not need background or specialized knowledge in order to successfully respond to the prompt.

(“About ACT Compass”)

Scoring of Test Each of the four areas of this test is scored into five different levels, which are Pre-Level 1, Level 1, Level 2, Level 3 and Level 4. Each student’s report includes both a proficiency descriptor and the Listening, Reading and Grammar/Usage tests are given a numeric score, on a scale from 1-99. The Essay section of the test is scored on a 6-point analytic scoring rubric. The purpose of using an analytic rubric is to “provide more specific and, potentially, diagnostic information regarding specific writing strengths and weaknesses” (“Answers to frequently”). The analytic scores focus on: development, focus, organization, language use and mechanics and are weighted 35%, 10%, 15%, 35% and 5%, respectively. The analytic scores are then

TEST REVIEW 11

weighted and summed, to provide an overall score that ranges from 2-12. Since this is an adaptive test, each student has the possibility to answer a different amount of questions. Therefore, the Grammar/Usage, Reading and Listening tests are “reported as estimates of the percentage of items in each administered content domain or test pool” (p. 99). The scoring of these sections takes into account the probability of guessing and the difficulty of the questions.

(“Internet Version”)


ACT conducted a validity study which included 50 schools from across the US which was conducted from September 1999 to June 2001. The calculated results are below: Mean Std. Dev. Reading 64.1 16.7 Listening 67.8 18.3 Grammar/Usage 64.1 16.7



The conditional SEM given by ACT was provided for the three lengths of the test, standard, extended and maximum, given in 5-point intervals for Reading, Listening and Grammar. The results below are the averaged SEM for the maximum length of the test. Scale SEM Reading 25-90 6.2 Listening 25-90 6.9 Grammar/Usage 25-90 5.4



The reliability scores below are averaged between the standard, extended and maximum length of the adaptive tests. Since Reading and Grammar/Usage are passaged-based, they are only presented in two lengths, standard and maximum, and have been averaged as such. Reliability Score Reading 0.89 Listening 0.87 Grammar 0.88


Evidence for Validity ACT released researched which showed that, based on the ACT ESL test, students were being placed into the correct courses. After administering a Likert-scale to instructors who had incoming students, “the results indicated the students were being placed in the right courses” (“Answer to frequently”). Despite their confidence, no methodology or numerical results were given, which questions the

TEST REVIEW 12

validity of the “research” itself. Outside research was conducted by Scott-Clayton (2012) to see if certain high-stakes placement exams would predict college success. Although her research doesn’t specifically look at the ACT Compass ESL exam, it looks at the success of the Compass tests (since there are multiple types) overall. The findings concluded that there is a weak relationship between students’ scores and their ability to do well in college courses; it was noted that for English classes especially it was not predictive of students doing well in their courses.

Discussion

As previously mentioned, the above three tests are all placement exams that assess a

student’s proficiency and are used for admissions purposes for postsecondary institutions. These

tests expect that the results will demonstrate a student’s success in their respective program or

institution. After researching all three, there are apparent strengths and weaknesses to each.

A strength of the TOEFL iBT is its ability to integrate tasks for its test takers. This is

important because in an academic context, the skills that are tested (i.e. Reading, Writing,

Listening and Speaking) are not individualized and separate; therefore, this test provides an

accurate representation of it’s target language use domain: an academic setting. Another strength

of the TOEFL iBT is that it has more than one study that demonstrates its reliability and validity.

This research shows that students are being tested on relevant topics and that their scores could

demonstrate their ability to excel in these postsecondary institutions. The biggest drawback to the

TOEFL iBT is that it is computer-based. If test takers are using QWERTY keyboards, and are

not used to them, this could be time consuming for the test takers and take away from their time

to complete the necessary tasks.

IELTS, on the other hand, can be taken either with a computer or with paper, which

allows test takers to choose what they are most comfortable with. Another advantage of the

IELTS is that provides authentic materials to its test takers. Because of this, students are then

TEST REVIEW 13

being assessed with materials that they could encounter in the real-world. Despite this, IELTS

does not provide researchers with all data related to their standard error of measurement which

makes it difficult to provide an accurate overview of the test itself.

Finally, the ACT ESL test does not provide an overview of the students capabilities for

universities. What I mean by this is they don’t require assessment of student’s speaking abilities,

and the writing section is even optional. Without these two critical skills, which are mainly

production based, how would an admissions committee truly be aware of the student’s ability to

complete tasks in the classroom. Another shortcoming of the ACT ESL test is that it’s only

multiple choice, so once again it does not require much of the student in terms of production.

However, the ACT ESL test is the cheapest and least time consuming of the three.

My advice for students who plan on taking these tests is to look at the university in which

you wish to attend to find out what tests they accept. If these universities do not have a

preference then I recommend the TOEFL iBT test because it has the most integrative skills and is

highly reliable and valid.

TEST REVIEW 14

References

ACT Educational Services. (n.d.). Internet version reference manual. Retrieved from https://www.act.org/content/dam/act/unsecured/documents/CompassReferenceManual.pdf

ACT Educational Services. (n.d.). Answers to frequently asked questions about COMPASS e-Write & ESL e-Write. Retrieved from http://www.act.org/content/dam/act/unsecured/documents/Compass-ewritefaq.pdf

Bachman, L., and Palmer, A. (2010). Language assessment in practice. New York, NY: Oxford

University Press. Chapelle, C. A., Enright, M. K., and Jamieson, J. M. (2008). Building a validity argument of the

test of English as a foreign language. New York, NY: Routledge. Educational Testing Service. (n.d.) About the TOEFL iBT test. Retrieved from

https://www.ets.org/toefl/ibt/about Humphreys, P., Haugh, M.m Fenton-Smith, B., Lobo, A., Michael, R. and Walkinshaw, I. (2012).

Tracking international students’ English proficiency over the first semester of undergraduate study. IELTS Research Reports Online Series (1). Retrieved from http://gallery.mailchimp.com/d0fe9bcdc8ba233b66e1e0b95/files/Humphreys2012_ORR. pdf

IELTS Partners. (n.d.). Understand how to calculate your IELTS scores. Retrieved from

http://takeielts.britishcouncil.org/find-out-about-results/understand-your-ielts-scores IELTS Partners. (n.d.) Understand the IELTS test format. Retrieved from

http://takeielts.britishcouncil.org/prepare-test/understand-test-format IELTS Partners. (n.d.). Understand the writing test. Retrieved from

http://takeielts.britishcouncil.org/prepare-test/understand-test-format/writing-test IELTS Partners. (2013). IELTS assessment criteria. Retrieved from

http://takeielts.britishcouncil.org/find-out-about-results/ielts-assessment-criteria IELTS Partners. (2014). Test performance 2014. Retrieved from

https://www.ielts.org/teaching-and-research/test-performance-2014 Lloyd-Jones, G., Neame, C. and Medaney, S. (2011). A multiple case study of the relationship

between the indicators of students’ English language competence on entry and students’

TEST REVIEW 15

academic progress at an international postgraduate university. IELTS Research Reports (11). Retrieved from http://radar.gsa.ac.uk/2451/1/Vol11_Report_3_A_multiple_case_study.pdf

Miller, M. D., Linn, R. L., and Gronlund, N. E. (2009). Measurement and assessment in

teaching. Upper Saddle River, New Jersey: Pearson Education. Reliability and comparability of TOEFL iBT scores. (2011). TOEFL iBT Research Insight 1(3),

1-8.

Results of standard setting at five north american universities. (2005). Princeton, NJ: Educational Testing Service. Retrieved from: https://drive.google.com/file/d/0B8inOwdwVYAZU3FncjJJWHZlMjQ/view

Scott-Clayton, J. (2012). Do high-stakes placement exams predict college success? Community College Research Center (41). New York, NY: Columbia University. Retrieved from http://67.205.94.182/media/k2/attachments/high-stakes-predict-success.pdf

Shaw, S.D. (2004). IELTS writing: revising assessment criteria and scales (Phase 3).

Research Notes 16, 3–7.

Stoynoff, S. and Chapelle, C. A. (2005). ESOL tests and testing. Baltimore, MD: Teachers of English to Speakers of Other Languages, Inc.

Taylor, L. and Jones, N. (2001). Revising the IELTS Speaking Test. Research Notes 4,

9–12. TOEFL iBT test framework and test development. (2010). TOEFL iBT Research Insight 1(1),

1-9. Validity evidence supporting the interpretation and use of TOEFL iBT scores. (2011). TOEFL

iBT Research Insight 1(4), 1-11.

Test Review

Documents

Transcript of Test Review