Use of Automated Writing Evaluation (AWE) for placement ...€¦ · vs. 4.6 hours. Motivation and...
Transcript of Use of Automated Writing Evaluation (AWE) for placement ...€¦ · vs. 4.6 hours. Motivation and...
Use of Automated Writing Evaluation (AWE) for placement tests: Can
scores of AWE be criteria to place students into language courses?
Zhi Li, Hyejin Yang, Stephanie Link, Volker Hegelheimer
IOWA STATE UNIVERSITY
1
October 5-6, 2012 University of Illinois, Urbana-Champaign
English Placement Test
Ø To place international students into appropriate ESL writing classes
Ø Practical needs
Immediate scoring
Improved time
management
Low
cost options
Example: Time Management
Fall 2012 @ ISU: 500+ essays to score
1 human rater: 5 min/essay --------------------------------------------------------------
How long will it take for raters to score all the essays?
--------------------------------------------------------------
18 raters: 28 essays = 2.3 hours
Each essay rated 2 to 3 times
TOTAL: 4.6 hours
What can the computer do?
What can the computer do?
Human Rating
5 minutes
Computer Rating
1 minute
Essays per Rater
23
2.8 hours to rate 500 essays
vs. 4.6 hours
Motivation and Purpose
To investigate whether the scores of
Criterion can be utilized to help make placement decisions in an ESL program.
Immediate scoring
Improved time
management
Low
cost options
AWE Validation Studies
e-rater®
.73-.93(correlation)
87-97% (Exact agreement)
Attali & Burstein, 2005; Burstein, Chodorow,
& Leacock, 2004
IntelliMetric
.50-.90 (correlation)
56-88% (Exact agreement)
Elliot, 2003; Vantage Learning,
1998, 1999, 2000, 2001
Intelligent Essay Assessor (IEA)
.81-.83 (correlation)
Landauer, Laham, and Foltz (2003)
Ø High level of correspondence
Computer Scoring for Placement
Ø Concerns • Impersonal • Distorts the nature of
writing
Herrington and Moran,
2006
• Discriminates according to: - length - grammar - mechanics
Jones, 2006
• Weak correlations may be due to: - lack of formal
training/calibrating human evaluators
James, 2006
ACCUPLACER OnLine
WritePlacer Plus by
IntelliMetric
Computer Scoring for Placement
Ø Validity Herrington
and Moran, 2006
Jones, 2006
James, 2006
ACCUPLACER OnLine
WritePlacer Plus by
IntelliMetric
• “not that much worse. . than placement by readers” (p. 126)
• Useful with “spot-checking” and retesting
• ”a valid tool for assessing writing samples and placing students in composition courses”
Research questions
Ø RQ1. What is the relationship between Criterion output and EPT decisions? ¡ Holistic scores ¡ Trait feedback
Ø RQ2. To what extent can holistic scores of Criterion distinguish between different levels of ESL writing classes?
Participants
Ø 135 international undergraduate students
Ø Fall semester 2012 at ISU
Disciplines Number of participants Engineering 48
LAS 37
Business 33
Design 10
Human Science 5
Agriculture 2
Setting
Ø Paper-based English Placement Test (30 min.)
Ø Topic: Modern convenience from Criterion ¡ Topic category - College level first year ¡ Topic mode - Persuasive
Modern conveniences such as fast food, automatic teller machines, and labor-saving appliances promise
to make life easier. Do these products and services actually make our lives more convenient or do they simply create new problems? Explain your position
with reasons and examples from your own experience, observations, or reading.
Rating Procedure
Ø Number of raters ¡ Experienced: 9 ¡ New: 9
Ø Rubric based on ACTFL Proficiency Guidelines ¡ General Description, Organization, Grammar & Vocab, Functional, Mechanics, and, Comprehensibility
Ø Placement based on two raters’ agreement ¡ Third rating for controversial papers ¡ Inter-rater reliability: 62% exact agreement
EPT Scoring Criteria
Advanced Mid
(Pass)
ü able to meet a range of work and/or academic writing needs. ü able to narrate and describe with detail in all major time frames ü cohesive devices in texts up to several paragraphs ü good control of the most frequently used target-language
syntactic structures and a range of general vocabulary.
Advanced Low
(101C/D)
ü able to meet basic work and/or academic writing needs. ü able to narrate and describe in major time frames ü a limited number of cohesive devices, ü some redundancy and awkward repetition ü some additional effort may be required in the reading of the
text.
Intermediate High
(101B)
ü able to write compositions and simple summaries related to work and/or school experiences
ü inconsistent in the use of appropriate major time markers, resulting in a loss of clarity.
ü correspond to those of the spoken language.
Curriculum
Ø ESL Writing Curriculum
Placement Decisions
Engl 101B
Pass/ Eng150
Engl 101C
Materials
¡ Writing samples from EPT
¡ Stratified Random Sampling
¡ Verbatim Transcription
EPT Level Two-rater Samples
Three-rater Samples
Word count (M)
101 B 30 15 259
101 C 30 15 260
Pass 30 15 302
Data Collection
Ø Entering essays into Criterion
Ø Data extraction § Holistic scores § Trait feedback (error numbers are normalized)
Grammar (S-V agreement, fragment and etc.)
Usage (wrong article, preposition errors and etc.)
Mechanics (spelling, missing comma, and etc.)
Style (repetition of words, short sentences, and etc.)
Criterion Scoring rubric
4
• Slights some parts of the task • Treats the topic simplistically or repetitively • Is organized adequately, but you need more fully to support your
position with discussion, reasons, or examples • Shows that you can say what you mean, but you could use
language more precisely or vigorously • Demonstrates control in terms of grammar, usage, or sentence
structure, but you may have some errors.
3
• Neglects or misinterprets important parts of the topic or task • Lacks focus or is simplistic or confused in interpretation • Is not organized or developed carefully from point to point • Provides examples without explanation, or generalizations without
completely supporting them • Uses mostly simple sentences or language that does not serve your
meaning • Demonstrates errors in grammar, usage, or sentence structure
Data Analysis
Ø RQ1: Criterion output vs. Human ratings à Descriptive Statistics à Correlation
à Regression
Ø RQ2: Criterion output differences b/w EPT levels à ANOVA
RQ1: Criterion output vs. Human ratings
5
18 19
1 0 3
19 20
3 0 0
6
28
10
0
1 2 3 4 5
Distribution of Criterion scores over EPT levels
B (N=43) C (N=45) Pass (N=44)
Criterion scores
RQ1: Criterion output vs. Human ratings
Ø Correlation (Spearman rho)
Ø GUMStyle
EPT levels (complete set N=132)
EPT levels (two-rater N = 89)
EPT levels (three-rater N=43)
Criterion scores (N=132)
Criterion scores 0.39** 0.47** 0.22 1
Word count 0.25** 0.31** 0.11 0.69**
Total errors -0.40** -0.48** -0.21 -0.43**
Grammar -0.25** -0.33** -0.07 -0.36**
Usage -0.21* -0.20 -0.22 -0.47**
Mechanics -0.28** -0.30** -0.22 -0.34**
Style -0.30** -0.35** -0.20 -0.26**
** significant at 0.05
RQ1: Criterion output vs. Human ratings
Ø Regression analysis (Criterion scores) Model Beta t p-value
Constant 7.963 0.000
Word count 0.604 11.933 0.000
Total errors 0.339 2.101 0.038
Grammar -0.287 -4.914 0.000
Usage -0.341 -6.330 0.000
Mechanics -0.229 -3.117 0.002
Style -0.383 -2.678 0.008
Dependent variable: Criterion scores R2 is 0.727
RQ1: Criterion output vs. Human ratings
Ø Regression analysis (EPT levels)
Model Beta (standardized coefficient)
t p-value
Constant 6.039 0.000
Word count 0.117 1.372 0.173
Total errors 0.384 1.379 0.170
Grammar -0.219 -2.099 0.038
Usage -0.186 -1.988 0.049
Mechanics -0.305 -2.489 0.014
Style -0.549 -2.252 0.026
Dependent variable: EPT levels R2 is 0.208
Model B-C C-Pass B-Pass
Criterion Scores -0.139 -0.580* -0.719*
Word count -1.489 -41.978* -43.467*
Total errors 3.02* 1.9 4.92*
Grammar 0.379 0.38 0.761*
Usage -0.088 0.474* 0.562*
Mechanics -0.013 1.193* 1.180*
Style 2.740* 0.348 3.088*
RQ2: Differences b/w EPT levels
Ø Post-hoc Multiple comparison in One-way ANOVA (N=135)
*. The mean difference is significant at 0.05 level
Discussion
Ø RQ1 à relatively low correlation ¡ May be due to: ¡ Different grading rubrics ¡ Essay lengths ¡ Essay prompt level on Criterion (College 1st year)
Ø RQ2 à Distinguished Pass from 101B / C § Can - because of: wide coverage of error categories § Cannot - because of: style (repetition and spelling)
Implications & Future Studies
Ø Potential use for ¡ distinguishing PASS from Non-Pass ¡ confirming placement through diagnostic test
Ø Future studies on ¡ The effects of different essay topic categories and
mode ¡ The predictive evidence of Criterion output ¡ Paper-based writing vs. computer-based writing
References
Ø Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. The Journal of Technology, Learning and Assessment,4(3). Retrieved from http://www/jtla.org.
Ø Fulcher, G. (1997). English Language Placement Test: Issues in reliability and validity. Language Testing, 14(2), 113–139.
Ø James, C. L. (2006). Validating a computerized scoring system for assessing writing and placing students in composition courses. Assessing Writing, 11, 167–178.
Ø Ware, P. D., & Warschauer, M. (2006). Electronic feedback and second language writing. In Hyland & F. Hyland (Eds.), Feedback in second language writing: Contexts and issues (pp. 105-122). Cambridge: Cambridge University Press.
Thank you! Questions and Comment? Acknowledgements: Yoo Ree Chung --------------------------------------------------------------- Zhi Li [email protected] Hyejin Yang [email protected] Stephanie Link [email protected] Volker Hegelheimer [email protected]
Website: volkerh.public.iastate.edu/awe
28