Use of Automated Writing Evaluation (AWE) for placement ...€¦ · vs. 4.6 hours. Motivation and...

Use of Automated Writing Evaluation (AWE) for placement tests: Can

scores of AWE be criteria to place students into language courses?

Zhi Li, Hyejin Yang, Stephanie Link, Volker Hegelheimer

IOWA STATE UNIVERSITY

1

October 5-6, 2012 University of Illinois, Urbana-Champaign

English Placement Test

Ø To place international students into appropriate ESL writing classes

Ø Practical needs

Immediate scoring

Improved time

management

Low

cost options

Example: Time Management

Fall 2012 @ ISU: 500+ essays to score

1 human rater: 5 min/essay --------------------------------------------------------------

How long will it take for raters to score all the essays?

--------------------------------------------------------------

18 raters: 28 essays = 2.3 hours

Each essay rated 2 to 3 times

TOTAL: 4.6 hours

What can the computer do?

What can the computer do?

Human Rating

5 minutes

Computer Rating

1 minute

Essays per Rater

23

2.8 hours to rate 500 essays

vs. 4.6 hours

Motivation and Purpose

To investigate whether the scores of

Criterion can be utilized to help make placement decisions in an ESL program.

Immediate scoring

Improved time

management

Low

cost options

AWE Validation Studies

e-rater®

.73-.93(correlation)

87-97% (Exact agreement)

Attali & Burstein, 2005; Burstein, Chodorow,

& Leacock, 2004

IntelliMetric

.50-.90 (correlation)

56-88% (Exact agreement)

Elliot, 2003; Vantage Learning,

1998, 1999, 2000, 2001

Intelligent Essay Assessor (IEA)

.81-.83 (correlation)

Landauer, Laham, and Foltz (2003)

Ø High level of correspondence

Computer Scoring for Placement

Ø  Concerns •  Impersonal •  Distorts the nature of

writing

Herrington and Moran,

2006

•  Discriminates according to: －  length －  grammar －  mechanics

Jones, 2006

•  Weak correlations may be due to: －  lack of formal

training/calibrating human evaluators

James, 2006

ACCUPLACER OnLine

WritePlacer Plus by

IntelliMetric

Computer Scoring for Placement

Ø  Validity Herrington

and Moran, 2006

Jones, 2006

James, 2006

ACCUPLACER OnLine

WritePlacer Plus by

IntelliMetric

•  “not that much worse. . than placement by readers” (p. 126)

•  Useful with “spot-checking” and retesting

•  ”a valid tool for assessing writing samples and placing students in composition courses”

Research questions

Ø RQ1. What is the relationship between Criterion output and EPT decisions? ¡ Holistic scores ¡ Trait feedback

Ø RQ2. To what extent can holistic scores of Criterion distinguish between different levels of ESL writing classes?

Participants

Ø 135 international undergraduate students

Ø  Fall semester 2012 at ISU

Disciplines Number of participants Engineering 48

LAS 37

Business 33

Design 10

Human Science 5

Agriculture 2

Setting

Ø Paper-based English Placement Test (30 min.)

Ø Topic: Modern convenience from Criterion ¡ Topic category - College level first year ¡ Topic mode - Persuasive

Modern conveniences such as fast food, automatic teller machines, and labor-saving appliances promise

to make life easier. Do these products and services actually make our lives more convenient or do they simply create new problems? Explain your position

with reasons and examples from your own experience, observations, or reading.

Rating Procedure

Ø  Number of raters ¡ Experienced: 9 ¡ New: 9

Ø  Rubric based on ACTFL Proficiency Guidelines ¡  General Description, Organization, Grammar & Vocab, Functional, Mechanics, and, Comprehensibility

Ø  Placement based on two raters’ agreement ¡  Third rating for controversial papers ¡  Inter-rater reliability: 62% exact agreement

EPT Scoring Criteria

Advanced Mid

(Pass)

ü  able to meet a range of work and/or academic writing needs. ü  able to narrate and describe with detail in all major time frames ü  cohesive devices in texts up to several paragraphs ü  good control of the most frequently used target-language

syntactic structures and a range of general vocabulary.

Advanced Low

(101C/D)

ü  able to meet basic work and/or academic writing needs. ü  able to narrate and describe in major time frames ü  a limited number of cohesive devices, ü  some redundancy and awkward repetition ü  some additional effort may be required in the reading of the

text.

Intermediate High

(101B)

ü  able to write compositions and simple summaries related to work and/or school experiences

ü  inconsistent in the use of appropriate major time markers, resulting in a loss of clarity.

ü  correspond to those of the spoken language.

Curriculum

Ø ESL Writing Curriculum

Placement Decisions

Engl 101B

Pass/ Eng150

Engl 101C

Materials

¡ Writing samples from EPT

¡ Stratified Random Sampling

¡ Verbatim Transcription

EPT Level Two-rater Samples

Three-rater Samples

Word count (M)

101 B 30 15 259

101 C 30 15 260

Pass 30 15 302

Data Collection

Ø  Entering essays into Criterion

Ø  Data extraction § Holistic scores §  Trait feedback (error numbers are normalized)

Grammar (S-V agreement, fragment and etc.)

Usage (wrong article, preposition errors and etc.)

Mechanics (spelling, missing comma, and etc.)

Style (repetition of words, short sentences, and etc.)

Criterion Scoring rubric

4

•  Slights some parts of the task •  Treats the topic simplistically or repetitively •  Is organized adequately, but you need more fully to support your

position with discussion, reasons, or examples •  Shows that you can say what you mean, but you could use

language more precisely or vigorously •  Demonstrates control in terms of grammar, usage, or sentence

structure, but you may have some errors.

3

•  Neglects or misinterprets important parts of the topic or task •  Lacks focus or is simplistic or confused in interpretation •  Is not organized or developed carefully from point to point •  Provides examples without explanation, or generalizations without

completely supporting them •  Uses mostly simple sentences or language that does not serve your

meaning •  Demonstrates errors in grammar, usage, or sentence structure

Data Analysis

Ø RQ1: Criterion output vs. Human ratings à Descriptive Statistics à  Correlation

à Regression

Ø RQ2: Criterion output differences b/w EPT levels à ANOVA

RQ1: Criterion output vs. Human ratings

5

18 19

1 0 3

19 20

3 0 0

6

28

10

0

1 2 3 4 5

Distribution of Criterion scores over EPT levels

B (N=43) C (N=45) Pass (N=44)

Criterion scores


Ø  Correlation (Spearman rho)

Ø GUMStyle

EPT levels (complete set N=132)

EPT levels (two-rater N = 89)

EPT levels (three-rater N=43)

Criterion scores (N=132)

Criterion scores 0.39** 0.47** 0.22 1

Word count 0.25** 0.31** 0.11 0.69**

Total errors -0.40** -0.48** -0.21 -0.43**

Grammar -0.25** -0.33** -0.07 -0.36**

Usage -0.21* -0.20 -0.22 -0.47**

Mechanics -0.28** -0.30** -0.22 -0.34**

Style -0.30** -0.35** -0.20 -0.26**

** significant at 0.05


Ø  Regression analysis (Criterion scores) Model Beta t p-value

Constant 7.963 0.000

Word count 0.604 11.933 0.000

Total errors 0.339 2.101 0.038

Grammar -0.287 -4.914 0.000

Usage -0.341 -6.330 0.000

Mechanics -0.229 -3.117 0.002

Style -0.383 -2.678 0.008

Dependent variable: Criterion scores R2 is 0.727


Ø  Regression analysis (EPT levels)

Model Beta (standardized coefficient)

t p-value

Constant 6.039 0.000

Word count 0.117 1.372 0.173

Total errors 0.384 1.379 0.170

Grammar -0.219 -2.099 0.038

Usage -0.186 -1.988 0.049

Mechanics -0.305 -2.489 0.014

Style -0.549 -2.252 0.026

Dependent variable: EPT levels R2 is 0.208

Model B-C C-Pass B-Pass

Criterion Scores -0.139 -0.580* -0.719*

Word count -1.489 -41.978* -43.467*

Total errors 3.02* 1.9 4.92*

Grammar 0.379 0.38 0.761*

Usage -0.088 0.474* 0.562*

Mechanics -0.013 1.193* 1.180*

Style 2.740* 0.348 3.088*

RQ2: Differences b/w EPT levels

Ø Post-hoc Multiple comparison in One-way ANOVA (N=135)

*. The mean difference is significant at 0.05 level

Discussion

Ø  RQ1 à relatively low correlation ¡ May be due to: ¡ Different grading rubrics ¡ Essay lengths ¡ Essay prompt level on Criterion (College 1st year)

Ø RQ2 à Distinguished Pass from 101B / C § Can - because of: wide coverage of error categories § Cannot - because of: style (repetition and spelling)

Implications & Future Studies

Ø  Potential use for ¡ distinguishing PASS from Non-Pass ¡ confirming placement through diagnostic test

Ø Future studies on ¡ The effects of different essay topic categories and

mode ¡ The predictive evidence of Criterion output ¡ Paper-based writing vs. computer-based writing

References

Ø Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. The Journal of Technology, Learning and Assessment,4(3). Retrieved from http://www/jtla.org.

Ø Fulcher, G. (1997). English Language Placement Test: Issues in reliability and validity. Language Testing, 14(2), 113–139.

Ø James, C. L. (2006). Validating a computerized scoring system for assessing writing and placing students in composition courses. Assessing Writing, 11, 167–178.

Ø Ware, P. D., & Warschauer, M. (2006). Electronic feedback and second language writing. In Hyland & F. Hyland (Eds.), Feedback in second language writing: Contexts and issues (pp. 105-122). Cambridge: Cambridge University Press.

Thank you! Questions and Comment? Acknowledgements: Yoo Ree Chung --------------------------------------------------------------- Zhi Li [email protected] Hyejin Yang [email protected] Stephanie Link [email protected] Volker Hegelheimer [email protected]

Website: volkerh.public.iastate.edu/awe

28

Use of Automated Writing Evaluation (AWE) for placement ...€¦ · vs. 4.6 hours. Motivation and...

Documents

Transcript of Use of Automated Writing Evaluation (AWE) for placement ...€¦ · vs. 4.6 hours. Motivation and...