Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf ·...

83
Master’s Thesis International Studies in Computational Linguistics (ISCL) Automated C-Test Difficulty Prediction: Integrating Lexical, Sentence, and Text Features in a Multi-Lingual Perspective Author: Sabrina Galasso 1 st Supervisor: Prof. Dr. Walt Detmar Meurers 2 nd Supervisor: apl. Prof. Dr. Kurt Eberle Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Arts in Computational Linguistics Seminar f¨ ur Sprachwissenschaft Eberhard Karls Universit¨ at T¨ ubingen February 9 th , 2018

Transcript of Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf ·...

Page 1: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

Master’s Thesis

International Studies in Computational Linguistics (ISCL)

Automated C-Test Difficulty Prediction:

Integrating Lexical, Sentence, and Text Features

in a Multi-Lingual Perspective

Author:Sabrina Galasso

1st Supervisor:Prof. Dr. Walt Detmar Meurers

2nd Supervisor:apl. Prof. Dr. Kurt Eberle

Submitted in Partial Fulfillment of the Requirementsfor the Degree of

Master of Arts in Computational Linguistics

Seminar fur SprachwissenschaftEberhard Karls Universitat Tubingen

February 9th, 2018

Page 2: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

Antiplagiatserklarung

Name: GalassoVorname: SabrinaMatrikel-Nummer: 3730351Adresse: Elly-Heuss-Knapp-Str. 27, 72074 Tubingen

Hiermit versichere ich, die Arbeit mit dem Titel:

”Automated C-Test Difficulty Prediction: Integrating Lexical, Sentence, and Text Fea-

tures in a Multi-Lingual Perspective“

bei Prof. Detmar Meurers

selbstandig und nur mit den in der Arbeit angegebenen Hilfsmitteln ver-fasst zu haben. Mir ist bekannt, dass ich alle schriftlichen Arbeiten, die ich imVerlauf meines Studiums als Studien- oder Prufungsleistung einreiche, selbstandig ver-fassen muss. Zitate sowie der Gebrauch von fremden Quellen und Hilfsmitteln mussennach den Regeln wissenschaftlicher Dokumentation von mir eindeutig gekennzeichnetwerden. Ich darf fremde Texte oder Textpassagen (auch aus dem Internet) nicht alsmeine eigenen ausgeben.

Ein Verstoß gegen diese Grundregeln wissenschaftlichen Arbeitens gilt als Tauschungs-bzw. Betrugsversuch und zieht entsprechende Konsequenzen nach sich. In jedemFall wird die Leistung mit

”nicht ausreichend“ (5,0) bewertet. In besonders schwer-

wiegenden Fallen kann der Prufungsausschuss den Kandidaten/die Kandidatin von derErbringung weiterer Prufungsleistungen ausschließen (vgl. § 12 Abs. 3 der Prufung-sordnung fur die Magisterstudiengange vom 11. und 25. September 1995 bzw. § 13Abs. 3 der Prufungsordnung fur die kulturwissenschaftlichen Bachelor- und Masterstu-diengange vom 12.10.2006 und 23.11.2007).

English version: I hereby declare that this paper is the result of my own independentscholarly work. I have acknowledged all the other authors’ ideas and referenced directquotations from their work (in the form of books, articles, essays, dissertations, and onthe internet). No material other than that listed has been used.

Tubingen, February 9, 2018

Sabrina Galasso

Page 3: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

Abstract

This thesis aims at the automated prediction of the difficulty of items,

sentences, and whole text passages in Spanish and English C-Tests. The

C-Test is an integrative placement test that measures a learner’s general

language proficiency. It is based on the cloze principle and tests the ability

to restore mutilated words in an authentic text. The process of designing

the test should involve knowledge about the difficulty of single items and

text passages.

Given actual learner data provided by a university’s language learning

center, we analyze the performance of C-Test participants on item, sentence,

and text level across proficiency levels for English and Spanish. C-Test items

are clustered into four to eight groups using different sets of performance

variables. A broad range of lexical features, readability features, syntactic

features, discourse features, and features describing the item’s context and

possible candidates are integrated into one pipeline. These features are

used within classification experiments in order to get insights into linguistic

characteristics of the test difficulty.

A combination of performance variables that is integrable into a real-

word C-Test generation application for both languages is presented: Similar

to the findings described by Svetashova (2015), the combination includes

information about the test takers’ performance on item and text level. The

C-Test items could be grouped into five interpretable classes: easy items in

difficult texts, easy items in easy texts, difficult items in easy texts, difficult

items in difficult texts, and as a fifth group either all items in very easy

texts for the English data, or all items in very difficult texts for the Spanish

data. Using the full set of features for the classification of these five classes,

leads to a macro-averaged F1 score of 0.76 (Support Vector Machine) for

the English data, and a score of 0.82 for the Spanish data.

In order to compare the difficulty characteristics of the two languages,

we further perform classification experiments using a comparable set of

features and add experiments based on information about the test takers’

performance on sentence and item level. The results show that the set of

comparable features is more predictive for Spanish than for English. Fea-

tures describing the item’s candidate space are highly predictive for both

languages.

Page 4: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

Acknowledgements

”No act of kindness, no matter how small, is ever wasted.” — Aesop

I am deeply grateful for the help and support that I received during the

writing of this thesis. Receiving kindness in its various facets helped me to

overcome challenging times.

First of all, I want to thank my supervisor Detmar Meurers for his con-

tinuous personal and technical guidance in all the time of research and

writing of this work. His feedback on my progress was always motivating

and deepened my interest in the topic.

I would like to offer my deepest thanks to Yulia Svetashova. Without her

personal advice and professional support this work would not have been

completed in this form. Our conversations shaped the fundamental struc-

ture of this thesis. Her huge willingness to help and her sincere words of

encouragement helped me through demanding periods.

I want to thank Claudia Duttlinger and Jorge Martın-Martın for giv-

ing me the opportunity to gain insights into real-world placement testing.

Thank you for placing trust in me as a researcher and software-developer.

I have greatly benefited from our collaboration.

I further want to thank Eyal Schejter, Bjorn Rudzewitz, Xiaobin Chen,

and Jochen Saile for their support whenever I faced technical problems. I

also appreciated help from Zarah Weiß regarding the understanding and

implementation of the Dependency Locality Theory. I would like to thank

my university friends Alexander Hartmann, Till Pachalli, Eyal Schejter, and

Heike Cardoso for proofreading my thesis.

Most of all, I want to thank my family members and closest friends.

Without the encouragement and unconditional support of my parents, my

sister, and my brother this work would not have been possible at all. And

finally, I would like to thank my future husband — for everything.

Page 5: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

Contents

1 Introduction 1

2 Background 2

2.1 Language Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 The C-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 From cloze tests to C-Tests . . . . . . . . . . . . . . . . . . . . . 4

2.2.2 Test design across languages . . . . . . . . . . . . . . . . . . . . 6

2.2.3 C-Test criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Linguistic Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Related Work on C-Test Difficulty Prediction . . . . . . . . . . . . . . . 10

2.4.1 Beinborn et al. (2014) and Beinborn (2016) . . . . . . . . . . . . 11

2.4.2 Svetashova (2015) . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Data 20

3.1 C-Tests at the Fachsprachenzentrum . . . . . . . . . . . . . . . . . . . . 20

3.2 Descriptive Analysis of the Data . . . . . . . . . . . . . . . . . . . . . . 20

3.2.1 Database Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.2 Available Data for English and Spanish . . . . . . . . . . . . . . 21

4 Performance Modeling 22

4.1 Performance Data Description . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Clustering Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Difficulty Modeling 28

5.1 Linguistic preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 Modeling the difficulty of English C-Tests . . . . . . . . . . . . . . . . . 30

5.2.1 Item Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2.2 Sentence Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2.3 Text Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3 Modeling the difficulty of Spanish C-Tests . . . . . . . . . . . . . . . . . 38

5.3.1 Item Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3.2 Sentence Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3.3 Text Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Experiments and Results 40

6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.2 Investigation of Performance Profiles . . . . . . . . . . . . . . . . . . . . 42

6.2.1 Classification Results on English Performance Profiles . . . . . . 43

6.2.2 Classification Results on Spanish Performance Profiles . . . . . . 45

6.3 Predicting C-Test Performance on Text and Item Level . . . . . . . . . 46

6.3.1 English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Page 6: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6.3.2 Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.4 Comparative Investigation of Difficulty Prediction for Spanish and English 50

6.4.1 Comparison of Performance on Text and Item Level . . . . . . . 51

6.4.2 Comparison of Performance on Sentence and Item Level . . . . . 53

6.5 Discussion of the Presented Results . . . . . . . . . . . . . . . . . . . . . 56

7 Conclusion 59

Bibliography 63

A Appendix 68

Page 7: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

List of Figures

2.1 A difficulty continuum described by Beinborn et al. (2014) . . . . . . . . 12

2.2 Variable factor map for the best performance profile (Svetashova, 2015) 17

2.3 Individuals factor map for the best performance profile (Svetashova, 2015) 18

2.4 Top 50 features by feature subset (Svetashova, 2015) . . . . . . . . . . . 18

2.5 Example text with items highlighted according to difficulty prediction

results (Svetashova, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1 Distribution of the participants’ proficiency levels for each text passage . 24

4.2 Distribution of the participants’ proficiency levels for each item . . . . . 25

4.3 Variables factor map using all 21 performance variables and 4 clusters.

(English) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Variables factor map using all 21 performance variables and 4 clusters.

(Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.5 Individuals and variables factor map using textAv and percCorrect (Span-

ish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.6 Individuals and variables factor map using textAv and percCorrect (En-

glish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1 UIMA preprocessing pipeline to annotate C-TestTokens . . . . . . . . . 29

6.1 Individuals and variables factor map with performance profile Text

SentProf Item CL4 (English) . . . . . . . . . . . . . . . . . . . . . . . . 44

6.2 Variable importance of the top 20 predictors in the RF model for Text

SentProf Item CL4 (English) . . . . . . . . . . . . . . . . . . . . . . . . 45

6.3 Variable importance of the top 20 predictors in the RF model for Text

SentProf Item CL4 (Spanish) . . . . . . . . . . . . . . . . . . . . . . . . 47

6.4 Variable importance of the top 20 predictors in the SVM model for

Text Item CL5 (English) . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.5 Cluster interpretation given the performance profile Text Item CL5 (En-

glish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.6 Variable importance of the top 20 predictors in the SVM model for

Text Item CL5 (Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.7 Cluster interpretation given the performance profile Text Item CL5 (Span-

ish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.8 Using Text Dlt features only: RF classification confusion matrices for

Spanish (top) and English (bottom) for the profile Text Item CL5. . . . 54

6.9 Using Item SubtlexCand features only: RF classification confusion ma-

trices for Spanish (top) and English (bottom) for the profile Text Item CL5. 54

6.10 Individuals and variables factor map with performance profile Sent

Item CL4 (Spanish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.11 Individuals and variables factor map with performance profile Sent

Item CL4 (English) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Page 8: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

List of Tables

2.1 Beinborn’s most predictive features . . . . . . . . . . . . . . . . . . . . . 12

2.2 Difficulty estimates for ”the” and ”in” (Svetashova, 2015) . . . . . . . . 14

2.3 Performance of feature groups in Svetashova (2015) . . . . . . . . . . . . 19

3.1 Number of participants and processed texts for Spanish and English . . 22

3.2 Conducted C-Tests with the number of test takers and information about

their performance score . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Theoretically possible item difficulty properties . . . . . . . . . . . . . . 26

5.1 The features of the C-TestToken type as implemented in our UIMA

preprocessing pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 List of context and candidate space features . . . . . . . . . . . . . . . . 34

5.3 List of lexical variation features . . . . . . . . . . . . . . . . . . . . . . . 37

5.4 List of lexical sophistication features . . . . . . . . . . . . . . . . . . . . 38

5.5 List of syntactic complexity features . . . . . . . . . . . . . . . . . . . . 38

6.1 The number of features (factor and numeric) before and after removing

highly correlated predictors . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2 SVM and RF classification results using all features (English) . . . . . . 43

6.3 SVM and RF classification results using all features (Spanish) . . . . . . 46

6.4 Mean values of different item and text level features by cluster (English) 49

6.5 Mean values of predictive item and text level features by cluster (Spanish) 51

6.6 Classification results for SVM and RF for the performance profile Text

Item CL5 and different feature subsets. 63 features where considered to

be comparable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.7 Variable importance comparison. RF models for profile Text Item CL5

using a set of 63 comparable features. . . . . . . . . . . . . . . . . . . . 53

6.8 Classification results for SVM and RF for the performance profile Sent

Item CL4 using the 63 comparative features. . . . . . . . . . . . . . . . 55

6.9 Variable importance comparison. RF models for profile Sent -Item CL4

using the set of 63 comparable features. . . . . . . . . . . . . . . . . . . 57

A.1 Item level features from the groups of linguistic, pyscholinguistic and

position based features. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

A.2 Item level features from the group of context and candidate space features. 69

A.3 Sentence level features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.4 Text level features from the group of lexical features. . . . . . . . . . . . 71

A.5 Text level features from the groups of syntactic complexity, DLT and

readability features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

A.6 English: Text ids and number of participants per text. . . . . . . . . . . 73

A.7 Spanish: Text ids and number of participants per text. . . . . . . . . . . 74

Page 9: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

List of Abbreviations

AE Analysis Engine

AOA Age-of-acquisition

AWL Academic Word List

CALT Computer-Assisted Language Testing

CERF Common European Framework of Reference for Languages

CTAP Common Text Analysis Platform

DLT Dependency Locality Theory

FSZ Fachsprachenzentrum (The Language Learning Center of the Uni-

versity of Tubingen)

HCPC Hierarchical Clustering on Principle Components

ICALL Intelligent Computer-Assisted Language Learning

IRT Item Response Theory

LFP Lexical Frequency Profile

NLP Natural Language Processing

NLTK Natural Language Toolkit

PCA Principal Component Analysis

PCFG probabilistic context-free grammar

POS Part-of-speech

RF Random Forest

SLA Second Language Acquisition

SVM Support Vector Machine

TFIDF Term Frequency–Inverse Document Frequency

TTR Type-Token-Ratio

UIMA Unstructured Information Management Architecture

Page 10: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

1 Introduction 1

1 Introduction

’What does it mean when we say that someone knows a language?’ (Spolsky, 1969).

This question was asked by Bernard Spolsky in the late 60th in order to introduce one

of his works on language testing and is still very present in contemporary research of

second language learning. It also implies the question of how one can break down the

continuum of proficiency from beginning learners of a foreign language to native-like

speakers. Based on empirical research, the Council of Europe developed a Common

European Framework of Reference for Languages (CEFR)1 that groups foreign language

proficiency into six levels. It was developed to have a standard for determining language

qualifications and to facilitate teaching.

It is known from the research of second language acquisition (SLA) that teaching is

most effective if the teaching material suits the learner’s current state of development.

The linguist and education researcher Stephen Krashen claimed that language acquisi-

tion occurs if learners are exposed to input that is slightly more advanced than what

they already know (Krashen, 1985). As the demand for language courses increased in

the past decades, also the interest of assigning large amounts of learners to appropriate

course levels in a cost-effective way increased. This assignment is done using placement

tests, which are conducted by universities, language schools or even by human resources

departments of large companies trying to figure out which further language training

their employees might need.

Placement tests differ highly from each other in terms of the underlying method.

They can be very versatile and complex, testing different kinds of language skills such

as grammar or vocabulary knowledge in different parts or including oral as well as

written assignments. The more complex the test is, the higher the costs for conducting

the tests are. Therefore, many institutions make use of simple cloze tests, which require

less preparation and scoring time and are still proven to be quite reliable in assigning

language levels to learners. The test which will be investigated in this thesis is the

C-Test, a variation of the cloze procedure following certain rules on how the gapping is

done. There is much research on the validity and reliability of C-Tests, but less research

on how the difficulty of the test can be predicted automatically. The process of building

up a C-Test should involve knowledge about the difficulty of single text passages in order

to select them appropriately and to then have influence on the test’s overall difficulty.

This in turn, should influence the analysis of the test taker’s performance results and

thereby the process of assigning them to appropriate language course levels. Thus,

predicting the test’s difficulty is an important step to be able to later on perform a

reasonable and appropriate result comparison.

Predicting the difficulty of C-Tests requires knowledge about linguistic complexity.

1http://www.coe.int/en/web/common-european-framework-reference-languages (Last accessed:18/02/08)

Automated C-Test Difficulty Prediction

Page 11: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 2

It needs to be investigated which characteristics of language use give evidence on how

difficult a test is. These characteristics can span different scopes of locality within the

test, such as single gaps themselves, sentences or whole paragraphs and texts. Some

gaps might not be difficult themselves, but the sentence or text in which they occur

is so hard to understand so that the gap cannot be filled out correctly. Furthermore

these characteristics cover different aspects of the language under consideration, e.g.

morphology, syntax, lexical variation, psycholinguistic frequency counts, or even dis-

course related features. There is much work on such features in order to predict the

complexity of texts. These features have also been used in the context of language

acquisition to predict the readability of texts or to analyze the language of learners

across proficiency levels. We will make use of complexity features in order to predict

the difficulty of C-Tests. Two languages will be investigated and compared to each

other in terms of the impact of feature types on the test difficulty.

This thesis will investigate the difficulty of C-Tests based on data provided by a

German university’s language learning center. The main question that underlies this

thesis is the following: How can we best predict the difficulty of single gaps and whole

texts? This implies the analysis of test takers’ performances and further leads to the

investigation of linguistic characteristics within tests. These characteristics can be

found on lexical, sentence and text level and will vary across languages with different

properties. It will be considered which features have a higher impact on the test

difficulty given the nature of the languages under consideration: English, as a more

analytic language as opposed to Spanish with a richer morphology. Section 2 will

give a brief overview on language testing in general and present the C-Test with the

criticism it has received. Furthermore the concept of automated complexity analysis

and corresponding research is presented. This is needed in order to understand the

existing work on the difficulty prediction of C-Tests: We will present in detail the work

of Svetashova (2015) and Beinborn (2016) who investigated item and text difficulty of

English C-Tests. Furthermore the dataset will be described (Section 3) and analyzed

by modeling the test takers’ performance (Section 4). In Section 5 we present our

difficulty model focusing on the underlying linguistic features. The machine learning

experiments and the results are explained in Section 6, followed by a section in which

a conclusion is drawn and future work is suggested.

2 Background

How a learner’s language skills can be tested and how test results can be analysed and

evaluated is a persistent question in SLA research. The following will give an overview

on different types of language tests and then describe the C-Test in detail focusing

on how it has been criticized positively as well as negatively. Afterwards, research

about the automated analysis of linguistic complexity will be presented, since this

Automated C-Test Difficulty Prediction

Page 12: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 3

is essential knowledge for understanding related work on C-Test difficulty prediction.

Difficulty prediction on English C-Tests has been conducted by (Svetashova, 2015) and

(Beinborn, 2016), whose works represent the fundamental basis of this thesis and are

therefore described in-depth.

2.1 Language Testing

Language testing is a broad term that describes the study of determining how profi-

ciently somebody uses a certain language. Although this main idea is true for all lan-

guage tests, there are fundamental differences with respect to the underlying method

and their purpose (McNamara, 2000).

Based on the test method, McNamara (2000) distinguishes between paper-and-

pencil language tests and performance tests. The former summarizes the more tra-

ditional tests which are used to assess discrete points of knowledge, such as certain

grammar or vocabulary skills. In these types of tests the response format is mostly

fixed, meaning that the test takers have to choose among a fixed set of multiple possi-

ble solutions. Such tests have the advantage of being efficient in terms of grading and

the scores can easily be compared across learners. Besides grammar and vocabulary

knowledge, such tests can measure skills in reading or listening comprehension but they

cannot be used for measuring the learner’s ability in language production. In contrast

to paper-and-pencil language tests, performance tests include acts of communication

where the focus is on language production skills. The response format is therefore

not fixed and the grading process is built upon an agreed rating procedure McNamara

(2000).

Additionally to the test method, language tests differ with respect to the test pur-

pose. Achievement tests aim at assessing individual progress and always relate to a

prior teaching goal. Thus, they give evidence on the outcome (i.e., achievement) of the

teaching process. Proficiency tests, in contrast, do not aim at measuring a learner’s

past progress but relate to the specific purpose of the use of the language in the future.

Such tests do often involve communication tasks specific to the target language usage

and try to simulate real world situations. Hughes (2007) defines the word ’proficient’

as ’having sufficient command of the language for a particular purpose’. Furthermore

there are placement tests, which aim at assigning the test taker to an appropriate lan-

guage course level. A challenging task with placement tests is to design a test that suits

the teaching programme of the institution. Since the outcome of placement tests does

not influence the future teaching methods they are considered to be summative as op-

posed to formative tests. Formative assessment tests can be used to formulate feedback

in order to influence future teaching, whereas summative assessment is summarizing

the students ability at a certain time. Hughes (2007) emphasizes the importance of de-

signing placement tests in a way that suits the teaching programme of the institution.

Automated C-Test Difficulty Prediction

Page 13: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 4

He argues that placement tests are most successful if they are designed for particular

situations. Putting much effort in the construction of placement tests will result in sav-

ing time and effort in teaching, but still the expense needed for versatile tests cannot

be afforded to supply the increasing demand for placement tests in many institutions.

This thesis investigates the difficulty of the C-Test, a placement test that relies on

the cloze procedure which can be generated semi-automatically and therefore has quite

low preparation costs. Furthermore, the scoring is very simple and comparable across

learners. The following subsection will give an overview of cloze tests in general and

present the modifications leading to the C-Test.

The implementation of language tests has changed within the last decades due to

technical progress in the whole field of SLA leading to an emergence of new inter-

disciplinary research fields such as Intelligent Computer-Assisted Language Learning

(ICALL)(Amaral and Meurers, 2011). Nowadays, the use of computers has become

a standard in foreign language courses. One can hardly imagine a language learning

course without technical aids, e.g. electronic communication platforms, electronic dic-

tionaries, or digital presentations. Big institutions cannot cope with the high demand

for language courses without making use of computers especially in their placement

procedures. Suvorov and Hegelheimer (2013) describe a new emerging field called

Computer-Assisted Language Testing (CALT). However, while the usage of computers

has become the norm, the gains of research in natural language processing (NLP) have

hardly been used for language learning and testing purposes. Meurers et al. (2010)

present a tool that makes use of NLP to visually highlight linguistic patterns that are

known to be difficult to learn. Amaral and Meurers (2011) describe the challenge of

using NLP to foster computer-assisted language learning systems and present a sys-

tem that automatically provides individualized feedback. Chapelle and Chung (2010)

describe how NLP can be used for language assessment and speech recognition.

2.2 The C-Test

The C-Test is a widely accepted language test used for different purposes. The following

will describe the origins of the C-Test and how it is designed and applied. Furthermore,

it will be presented how the test can been criticized.

2.2.1 From cloze tests to C-Tests

Cloze tests are based on the concept of reduced redundancy, which relies on the fact

that language often provides more information than what is actually necessary to un-

derstand what somebody intends to say. According to Spolsky (1969), the redundancy

within language might seem wasteful, but it does actually help to cope with noise

appearing in the communicating channel. Spolsky (1969) further points out that the

ability to cope with noise in language highly depends on the individual’s language pro-

Automated C-Test Difficulty Prediction

Page 14: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 5

ficiency. A non-native speaker might need the ’full normal redundancy’ whereas native

speakers can often perfectly understand messages where some parts are missing due

to interferences that disturb the communication. Tests based on reduced redundancy

principles challenge language learners to restore the distorted parts. It is assumed that

the competence in restoring these parts will give evidence on the test takers’ language

proficiency.

Cloze tests were originally constructed by deleting words randomly or by deleting

every nth word in a text. It has been argued that these approaches are the simplest

approaches and that they usually lead to deletions of different parts of speech, avoiding

deletions of e.g. only articles. However, such cloze tests have also widely been criti-

cized. According to Alderson (1979), the results of cloze tests relate more to tests of

grammar and vocabulary and less to reading comprehension tests. He further points

out that they rather measure low order language skills than higher order language skills

and that the restoration of the gap is highly dependent on the gap’s immediate con-

text. Moreover, the deletion rates highly influence the test results. Bachman (1982)

therefore suggests a linguistically motivated rational deletion of selected syntactic or

cohesive items. It is concluded that this cloze variation can measure higher-order skills

including coherence and cohesion by taking into account syntactic and discourse level

relationships (Bachman, 1982). The downside of such approaches is that it is hardly

possible to generalize the deletion procedure across tests and that each individual test

therefore needs to be analysed separately in terms of reliability and validation. As a

consequence, Raatz and Klein-Braley (1981) aims at designing a test where the ran-

dom deletion is modified without violating the principle of an internalized grammar and

hence, without having impact on generalizability. They describe six criteria followed

during the test development (Raatz and Klein-Braley, 1981):

1. shorter texts producing at least 100 items

2. no problems in choice of deletion rate and starting point

3. deletions should be an absolutely representative sample of the elements of the

text

4. it should not favor examinees with special knowledge

5. only exact scoring should be used

6. native speakers should obtain virtually perfect scores.

These criteria show that Raatz and Klein-Braley (1981) tackle problems concerning

the test design including text selection and deletion process. On top of that, they

address the problem of how such tests should be scored and how the results might be

analyzed and interpreted.

Automated C-Test Difficulty Prediction

Page 15: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 6

2.2.2 Test design across languages

A C-Test, as described by Grotjahn (2002), typically consists of four to six short texts

of different topics. By doing so, it can be avoided that the test results only represent

the learner’s familiarity with a certain topic. Each short text starts with a complete

sentence in order to provide the learner with some context. Starting from the second

sentence, from every second word the second half of the letters is blanked. If the word

to gap consists of an odd number of letters, the bigger part of the word is gapped. After

a predefined number of gaps, the text ends with a short untouched part to provide some

non-mutilated context again. This predefined number usually varies from 20 to 25.

In this thesis, we will use the following terms and notations: A single test item that

contains a gap is named an item. The correct solution is embraced in brackets. The

following gives an example:

• item: diffi[culty]

• base: diffi

• ending: culty

• intended word: difficulty

In contrast to usual cloze tests, the number of acceptable solutions for a C-Test gap is

low. It is not often the case that a semantically and grammatically correct restoration

other than the original solution should be considered as correct (Grotjahn, 2002). In

cases such as ”in Ju ”, one should accept both solutions: ”in June” and ”in July”. One

should differentiate between the following three types of multiple solution occurrences:

• A gap’s alternative solution should be accepted across texts.

E.g., theatre, theater. American or British English spelling variants

• A gap’s alternative solution should be accepted only within a specific text.

E.g., snowfall, snowstorm. The words snowfall and snowstorm might be inter-

changeable in one text, but not in the other, where a snowfall is much lighter

than a storm.

• A gap’s alternative solution should be accepted only for this specific gap.

E.g., June, July. In the context ”in July” one should accept also June. However,

in the context ”Fourth of July” one could argue that a learner should know that

only ”July” and not also ”June” should be accepted.

Grotjahn (2002) suggests to work with predefined lists of acceptable solutions. An-

other strategy would be to avoid multiple solution variants by reducing the number of

blanked letters. This is not recommended since it reduces the difficulty at the same

time.

Automated C-Test Difficulty Prediction

Page 16: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 7

The presented rules for developing C-Tests seem very fixed and language independent.

However, there exist language dependent issues with these construction rules. Grotjahn

and Tonshoff (1992) present the following language specific phenomena:

• apostrophes: Words containing apostrophes are often considered as one word

instead of as multiple words. In Italian, prepositional articles and the following

nouns are merged by an apostrophe, e.g. ”dell’anno”. Treating this construction

as one test item, the learner would need to restore a whole noun.

• compounds: The components of German compounds are usually not separated

by whitespace. In this way, whole stems can be blanked, making a restoration

very hard.

• enclitic personal pronouns: In some romance languages, multiple personal

pronouns are attached to the verb’s ending. E.g. the Italian verb ”regalarglielo”

(give it to him) would be difficult to restore when it is blanked.

• grapheme combinations: Grapheme combinations can represent single phones.

If a word’s non-blanked part ends with a graph that is part of such a polygraph

combination, it could mislead the reader, since inner phonation often serves as

restoration strategy. E.g. ”luglio” (Italian), where ”gl” is a polygraph.

As described, there often exist various options to handle such phenomena. The

challenge is to cope with them without altering the fixed C-Test rules too much to keep

the test results as generalizable and comparable as possible.

2.2.3 C-Test criticism

As mentioned in the beginning of this section, the development of the C-Test solved

some widely criticized issues that came up with cloze tests. The C-Test rules clearly

define the deletion rate and the starting and ending point of the deletions, which was

not the case for cloze tests. The fact that a C-Test is made up of several short texts

addressing different topics further reduces the problem of topic related distortions.

Moreover, it was shown that it is possible for native speakers to gain full C-Test scores.

An advantage of the C-Test as a placement test is that the restoration of gaps involves

different levels of language. Following a top-down processing approach, the learner can

infer the solution given the text’s topic, or information about persons, objects or places

involved in a sentence or text passage. Additional to such contextual clues, the learner

can follow ”grammatical, syntactical, lexical, semantic, collocational,[...] pragmatic,

logical, situational clues (and no doubt many others)” Klein-Braley (1985). Thus, being

able to use information on different levels of language leads to a successful restoration

of the gaps. This information can cover smaller or larger scopes within a text. As for

sentences or clauses, a reader could for example derive from the syntactic structure

Automated C-Test Difficulty Prediction

Page 17: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 8

that a gapped verb must be a past participle and use this knowledge to fill the gap.

As C-Tests are usually based on real-life texts, they can be considered authentic to

a certain extent (Klein-Braley, 1985). However, in general it is difficult to produce

authentic tests since there exist factors such as test anxiety which are not given in

real-life communications. According to McNamara (2000), test performances are only

indicators of how a learner would perform in a similar real world situation. McNa-

mara (2000, p. 8) emphasizes the importance of differentiating between the criterion

(behaviour in the target situation) and the test itself.

The validity and reliability of C-Tests has widely been investigated. Although the C-

Test has been criticized to only measure reading ability, it has been shown that solving

a C-Test involves also other language skills. Eckes and Grotjahn (2006) compared

C-Test results to German learners’ performance on a complex language test which

involves exercises in reading, writing, listening and speaking. The study shows that

the C-Test is able to measure the same general dimension as the more wide-ranging

test does. In placement tests it is important to get a general estimate of the learner’s

proficiency throughout different target real-life situations irrespective of the learner’s

competence in specific language skills (Eckes and Grotjahn, 2006). Babaii and Ansary

(2001) show similar results by comparing English C-Test results to TOEFL results.

They further investigated gap restoration strategies by collecting retrospective protocols

of the participants. These protocols indicate that the participants follow four different

strategies, namely automatic processing, lexical adjacency, sentential cues and top-

down cues. This also reveals that the C-Test does not only require local micro-level

processing strategies, but also macro-level processing (Eckes and Grotjahn, 2006). This,

in turn, is important for measuring general language proficiency. It has not only been

demonstrated that C-Test results correlate with other language tests, but also with

other external criteria such as school grades and teachers’ estimates Cohen et al. (1984).

Dornyei and Katona (1992) criticize the fact that the reliability of C-Tests has widely

been investigated but the reasons for this success are not clear, i.e. it has been shown

that the test is reliable but it has not been investigated why and how it works (Dornyei

and Katona, 1992).

This thesis is comparing C-Test item and overall difficulty of English tests to Spanish

tests. Most of the literature investigates C-Tests for learners of English as a foreign

language. It should be mentioned that the English test validation cannot simply be

projected to other languages. E.g., in Hebrew words are morphologically richer than

English. However, Cohen et al. (1984) show that the C-Test is reliable and valid

also for Hebrew. But they further show that Hebrew C-Tests correlate more with

grammar tests on verbal inflections and noun-adjective agreement than with reading

comprehension exercises. This is caused by the typological properties of Hebrew as a

synthetic language. Since Spanish is also more synthetic as English, one could expect

similar results for Spanish. Affixation is more prominent and complex in Spanish than

Automated C-Test Difficulty Prediction

Page 18: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 9

in English.

In summary, the C-Test is mostly considered to be reliable and valid for the measure-

ment of general language proficiency. The test is furthermore very efficient, meaning

that the costs for test construction and scoring are very low in contrast to other lan-

guage tests.

2.3 Linguistic Complexity

In order to understand what makes a test for language learners easy or difficult one

should take into account what makes the language under consideration in general more

complex. The exploration of linguistic complexity involves the comparison of linguistic

patterns across texts, which are assumed to be of different complexity. An example

for such texts are school book texts addressing different learner levels. It could be

investigated in which class level or school book section which syntactic patterns or

vocabulary entries are introduced. Furthermore, texts produced by language learners on

different levels of proficiency could be investigated to learn about linguistic complexity

of even native texts. Vajjala and Meurers (2012) have shown that measures from SLA

researches, which have proven to be informative in proficiency classification of learners,

can be used to improve readability classification of native language.

Readability measures are traditionally surface based and consist of formulas which

only include letter, syllable or word counts. Such measures have been criticized since

they do not take into account deeper linguistic structures. Technological advances in

computational linguistics and the growing availability of data made much deeper lin-

guistic analyses possible. McNamara et al. (2014) developed Coh-Metrix, a tool to au-

tomatically evaluate English text and discourse by computing a broad range of metrics,

e.g., traditional readability indices or more fine-grained measures based on theoretical

constructs such as cohesion and coherence, lexical diversity, syntactic analyses or la-

tent semantic analysis. Todirascu et al. (2013), Vajjala and Meurers (2012), Hancke

et al. (2012), and Weiß (2015) describe features for automatically measuring linguistic

complexity in the context of language learning. Chen and Meurers (2016) present a

web-based Common Text Analysis Platform (CTAP) which was built to strengthen re-

search collaboration by enabling researchers to share their feature implementations and

by providing an interface for non-programmers to manage their corpora and flexibly

choose feature sets corresponding to their purpose.

The indices for measuring linguistic complexity which are described in the mentioned

works differ from each other in terms of the locality they span. Indices on word level of-

ten provide information about morphological properties such as derivation or inflection.

They can also be based on lexicon look-ups of psycholinguistic databases that provide

information about the word’s age of acquisition (Kuperman et al., 2012). Furthermore,

word level features investigate lexical density and lexical variation.

Automated C-Test Difficulty Prediction

Page 19: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 10

On sentence level, syntactic complexity is measured by investigating certain syntac-

tic constructions and their frequencies such as the number of clauses or constituents

within the sentence. These features will be presented in more detail in the context of

related work in C-Test difficulty prediction in Section 2.4. Weiß (2017) made use of

features that are based on Gibson (2000)’s Dependency Locality Theory (DLT) in order

to assess the proficiency of learners of German as a second language. DLT emerged in

the field of Human Sentence Processing and is based on the idea of a cost that arises

when two elements in a sentence need to be integrated. According to DLT, this cost

depends on the locality of the two elements in terms of the distance between them. We

will describe these features in more detail in Section 5.2. Applying the DLT features on

two learner corpora resulted in the observation that integration costs increase with in-

creasing learner proficiency. Weiß (2017) emphasizes that this observation is consistent

across different tasks included in the corpus.

On text level, the indices can measure discourse related features often involving as-

pects of cohesion and coherence. Todirascu et al. (2013) describe measures for French

based on parts of speech, where they explored the number of referring expressions

within a text. They found out that a lower number of personal pronouns per sen-

tence increases the text’s difficulty, whereas a higher number of definite articles per

text make the texts more readable. They included also entity coherence and density

indices, and investigated manually annotated reference chains throughout sentences.

Graesser et al. (2003, p. 2) state that in general a ”text is less coherent when there are

many conceptual and structural gaps in the text, and the reader does not possess the

knowledge to fill them”. This quote does not refer to actual C-Test gaps, but illustrates

the connection between a lower coherence of a text and the reader’s ability to cope with

it. One can therefore assume that coherence related features can be used to predict

the difficulty of C-Tests, where actual gaps are used to decrease the text’s coherence.

Beinborn (2016) and Svetashova (2015) follow this assumption and further show how

complexity features can be adapted to the difficulty prediction of C-Test items. These

two works will be described in the following.

2.4 Related Work on C-Test Difficulty Prediction

One of the first attempts to find measures for determining a C-Test’s difficulty has been

reported by Klein-Braley (1984). She focused on the investigation of single C-Test text

passages to ensure a reasonable selection of texts passages. She developed measures

based on sentence length and Type-Token-Ratio and reports that these measures can

be used to predict the difficulty of C-Tests. A C-Test’s difficulty is often measured

by dividing the number of all erroneously entered gaps by the number of gaps in total

(mean error rate). Beinborn et al. (2014) criticize this measure since it does not capture

any information about single gaps:

Automated C-Test Difficulty Prediction

Page 20: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 11

”In the extreme case, half of the gaps can be solved by all learners and the

other half by almost no one. The test is then assigned a medium difficulty,

but the results are not useful for discrimination between learners.”(Beinborn

et al., 2014).

Beinborn et al. (2014), Beinborn (2016) and Svetashova (2015) therefore go further

by investigating not only the difficulty of whole texts but also the difficulty of sin-

gle C-Test items. New achievements in Natural Language Processing (NLP) and the

grown availability of language data and other resources allow for the development of a

great number of features on item, sentence and text level. The following will present

the named works by focusing on Svetashova (2015), which builds the theoretical and

technical groundwork of this work.

2.4.1 Beinborn et al. (2014) and Beinborn (2016)

Beinborn et al. (2014) developed a difficulty model including features based on four

concepts: solution difficulty, candidate ambiguity, inter-gap dependency and paragraph

difficulty. Solution difficulty and candidate ambiguity are processed on a so called

micro-level, where only the direct context of a gap is taken into account to estimate its

difficulty. The solution difficulty describes how likely it is for the student to know the

solution, e.g. based on the word’s frequency or morphological complexity. On macro-

level the inter-gap dependency (e.g., impacts of the difficulty of preceding gaps on the

current gap) and paragraph difficulty (including readability features) influence a gap’s

difficulty.

Four different C-Tests were conducted as placement tests for university students.

They analyse the answers of at least 140 participants per test and show that the variance

of error rates is very high within single paragraphs. Furthermore the answer variety

increases with higher error rates, showing that difficult gaps do lead to various mistakes

rather than to one typical mistake.

Their regression results show that solution difficulty has a high impact on the gap’s

difficulty, followed by the paragraph’s difficulty. Both feature groups have widely been

researched, while the other two perform worse. However, feature selection resulted in a

set of features from all four categories. Table 2.1 lists the top 21 features which resulted

in a leave-one-out testing accuracy of 57% on the whole data.

To compare their results to the performance of human experts, they conducted an

experiment with three English teachers. The teachers were asked to annotate 20 texts

by assigning a difficulty category to each item. Figure 2.1 lists these categories and the

corresponding error rates. The results show that approximately 50% of the predictions

were correct, which is on the same level as their automatic approach. Furthermore

their annotation experiment showed a very low inter-annotator agreement (0.36 Fleiss’

Automated C-Test Difficulty Prediction

Page 21: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 12

Table 2.1: The 21 most predictive features grouped by difficulty dimension (Beinbornet al., 2014, p. 526)

Kappa). The three of them agreed with each other and were actually correct in only

25,3% of the cases.

Figure 2.1: The difficulty continuum described by Beinborn et al. (2014, p. 524).

Beinborn’s dissertation aims at the difficulty prediction and manipulation of different

text-completion exercises including C-Tests for English, German and French (Beinborn,

2016). The implemented C-Test difficulty features can again be grouped into the four

dimensions described in Beinborn et al. (2014). The regression results show that so-

lution/word difficulty is the most predictive feature group for all three languages. In

general, the micro-level features work better than the features on the macro-level. The

difficulty prediction for the English and French data works even better when having

removed the macro-level features. Beinborn (2016) concludes that these findings give

evidence for the assumption that a gap’s difficulty is mostly determined by itself and

its direct context. As in Beinborn et al. (2014), features from all dimensions are present

after feature selection. This is the case for all the languages under consideration. How-

ever, there are twice as many features from the micro-level dimensions.

Beinborn (2016) further presents an error analysis that provides insights into the

data. The difficulty assigned to named entities turned out to be too high since named

entities might be familiar to learners but have not been integrated into the model.

Under-estimation happened in cases where candidate answers are more frequent and

simpler than the actually correct answer. This should be captured by candidate am-

biguity features, which apparently did not get enough weight. The same happens for

Automated C-Test Difficulty Prediction

Page 22: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 13

spelling difficulties as in words like ”of” and ”off”, which seem very simple but often

lead to an erroneous spelling.

In terms of the multilingual perspective in this work it is important to mention

Beinborn’s following findings. It was shown that the difficulty prediction for English

is worse than for the other two languages. Beinborn (2016) mention that this might

be caused by the comparably large candidate space given for short English words in

comparison to short words in French and German. A possible explanation why word

frequency features in particular are not as predictive for English is the substantial

presence of the English language in nowadays everyday-life, leading to a vocabulary

knowledge which is often domain-specific.

2.4.2 Svetashova (2015)

Svetashova (2015) aims at predicting the difficulty of C-Test items focusing on their

linguistic properties. Information about the difficulty of C-Test items and whole text

passages can then be used for an appropriate selection of texts. The data under consid-

eration was provided by the Language Learning Center of the University of Tubingen.

Svetashova (2015) presents in detail the underlying technical steps and theoretical foun-

dations. She tested various machine learning approaches and investigated a broad range

of variable combinations. It is important to mention that the following does not sum-

marize her work, but rather gives insight into five steps that are finally relevant for this

thesis.

Step 1: Model the learners’ performance. The first step was to extract rele-

vant performance statistics from five C-Tests, where each test is made up of five scored

texts and one calibration text. 20 distinct texts have been processed by more than

150 participants resulting in 7725 available answers. More information about the data

is given in Section 3. For each item the percentage of correctly inserted answers was

calculated. The proficiency levels (A1-C2) were available as final test scores. This

information was used to generate further variables representing the percentage of cor-

rectly inserted answers within a single proficiency level. Additionally, variables that

evolved from Item Response Theory (IRT) have been investigated. All answers that

did not exactly match the correct solution were considered as incorrect. This can also

include non-erroneous solutions such as spelling variants or incorrect encoding of the

apostrophe, which would not be considered an error by human correctors.

The following summarizes the information given in the performance statistics:

• percentage of exact matches

• percentage of exact matches by proficiency level (A1-A2)

• difference from the mean text passage difficulty

Automated C-Test Difficulty Prediction

Page 23: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 14

Table 2.2: This table is taken from Svetashova (2015). It shows difficulty estimatesfor occurrences of the words in and the and the difference from the mean(overall text difficulty).

• IRT difficulty estimate

• IRT discrimination value

Usually, the difficulty of an item is defined by the ratio of incorrectly inserted answers

to all answers. Beinborn et al. (2014) split the difficulty continuum into four levels

according to the error rate. Svetashova (2015) developed a new method to group

C-Test items into clusters, including two dimensions: The difficulty of the item itself

(percentage of correct answers for that item) and the overall difficulty of the text passage

(the mean text difficulty, MTD). It was found that words which are very similar to each

other can be of different difficulty depending on the MTD. In Table 2.2 it is shown that

the item in was filled out correctly by only 64.36 % of the test takers in text passage

590, whereas in text passage 524 more than 93 % got the correct answer. Thus, in

this case one cannot predict the item’s difficulty by considering it on the word level.

The table further shows the MTD of both texts, which differ highly from each other:

39.79 % for text 590, 70.26 % for text 524. Subtracting these MTD values from the

percentage of exact matches results in very similar values for the items across text

passages. Thus, the difference from the mean in these cases captures the similarity of

the same words occurring in different texts.

Svetashova (2015) combined different performance variables in order to find out which

items can be grouped together according to similarities. Hierarchical Clustering on

Principle Components (HCPC) was performed to investigate possible groupings. She

introduces the term performance profile, which describes a clustering principle or, in

other words, a combination of performance variables.

The following part will describe features that are linguistically motivated or somehow

else connected to the C-Test literature. Machine learning algorithms are trained on

these features in order to model the difficulty. In the end Svetashova (2015) checks

which features can best predict which performance clusters.

Step 2: Model the tests’ difficulty. The information about an item’s difficulty

can span different levels of locality: Svetashova (2015) describes features on text, sen-

tence, and item level:

• Text Level. The textual features include traditional surface-based readabil-

Automated C-Test Difficulty Prediction

Page 24: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 15

ity features and more sophisticated readability features described within Vajjala

(2015). Furthermore she included indices which measure lexical complexity and

which are based on different linguistic resources. An example for such features

are Lexical Frequency Profile measures. These measures are based on lists that

group words into different bands of vocabulary frequencies. A list which contains

word families that are especially common for academic use was also investigated.

Furthermore she included lexical density, richness, sophistication and variation

measures, which were originally developed by Lu (2012) in the context of Sec-

ond Language Acquisition. Also syntactic properties of the whole text have been

analysed: These features are based on the counts of production units such as

phrases, clauses, sentences, complex nominals, or t-units.

• Sentence Level. The features on sentence level mainly address syntactic com-

plexity based on counts of parse tags generated by both, a dependency parser

and a constituency parser. The depth of the parse tree and the log probability of

the parse tree (parse score) have also been used as features.

• Item Level. The item level features represent the broadest class. They capture

linguistic properties (LING) of the item based on morphological as well as syntac-

tic information obtained from part-of-speech (POS) tagging, named entity recog-

nition, parsing and a morphological analysis tool. A semantic feature encodes

whether the word has manually been annotated as topic specific. A further fea-

ture encodes whether the item has orthographic variants. In addition, the LING

features group contains surface-based features measuring the length of an item,

its base and its ending. The psycholinguistic features (PSY) are mostly based on

frequency lists and on lists generated within psycholinguistic experiments. Some

reveal information about a word’s familiarity, imageability, or stylistic category,

some are based on age-of-acquisition (AoA) ratings. She also investigated term

frequency–inverse document frequency (TFIDF) measures, which evolved from

information retrieval and indicate the importance of a word throughout a col-

lection of documents. The context and candidate space features (CONT) form

another group of item level based features. These can for example show how likely

other solutions are in the given context (1- to 5-grams). Such n-gram probabilities

have been extracted from a corpus containing about one hundred billion tokens.

The likelihood of POS sequences has also been investigated. A last subset of the

item level features addresses the positioning (POSIT) of the item itself. These

features are motivated by the assumption that the position of the item in the text

passage can influence its difficulty. The features indicate, for example, whether

an intended word occurs in the non-gapped introductory sentence, or whether it

occurs earlier or later in the passage.

Step 3: Choose the best performance profile. One of the main findings of Sve-

tashova (2015) is the way of how the performance data was modeled. Using Hierarchical

Automated C-Test Difficulty Prediction

Page 25: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 16

Clustering, she clustered the data in 2200 different ways and tried to find out which

cluster best represents the performance in terms of correlation with the full difficulty

feature set model (265 features). This was done by following two machine learning

approaches: Training a Support Vector Machine (SVM) classifier and a Random For-

est (RF) classifier. The performance profile that performed best combines variables of

three types: ”1) the log odds ratio to insert an item correctly for the test takers of

different proficiency levels, 2) the text passage average of correct insertions by the test

takers of different proficiency levels (measured as logits) and 3) the IRT model difficulty

estimates for each item”(Svetashova, 2015, p.89). Figure 2.2 shows the variable factor

map for the best performing performance profile. The item based irtDifficulty variable

and the logitA1-C2 variables negatively correlate with each other. With these variables

we can imagine a line that indicates the item difficulty from difficult (top left) to easy

(bottom right). The variables denoting performance on text passages (logitTextAver)

are perpendicular to the item based ones. We can imagine a line from the left bottom

to the right top with ”items which occur in text passages with decreasing difficulty”

(Svetashova, 2015, p. 90). Interpreting these findings, she groups the items into the

following four clusters visualized in Figure 2.3 (p. 18):

1. IdTd – difficult items in difficult text passages

2. IdTe – difficult items in easy text passages

3. IeTd – easy items in difficult text passages

4. IeTe – easy items in easy text passages.

The classification based on these four classes (referred to as 2DimClustering) sub-

stantially outperformed the classification based on the fourfold difficulty continuum

split (referred to as EqualRanges). As an example, the RF classification resulted in

a micro-averaged F1 score of 0.7959 using 2DimClustering, and 0.4747 using Equal-

Ranges. Svetashova (2015) experimented with both clustering approaches in order to

be able to compare her results to previous work in the field. The results for the fourfold

continuum split (F1-scores around 0.49 for RF) are comparable to those of Beinborn

et al. (2014), who report an SVM leave-one-out cross validation accuracy of 0.46.

Step 4: Investigate variable importance and feature groups. The perfor-

mance of features can be analysed following two ways: The investigation of variable

importance and the evaluation of the overall performance of distinct feature groups.

Figure 2.4 shows the results of extracting the variable importance after the RF and

SVM classification. The two most important features in RF were item-level features

based on the term frequency–inverse document frequency. The unigram log probability,

the number of candidate unigrams with a higher log probability and information about

topic-specificness closely follow as highly predicting features. Figure 2.4 shows how

many features out of each feature group were part of the top 50 features. The features

Automated C-Test Difficulty Prediction

Page 26: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 17

Figure 2.2: The variable factor map for the performance profile that worked best.The variables logitA1-C2 and irtDifficulty denote performance on items,whereas the logitTextAver variables denote performance on text pas-sages.(Svetashova, 2015, p. 90)

that performed best were those on text-level (especially lexical complexity) and item-

level (especially candidate space features and psycholinguistic characteristics).

The performance of single feature groups and feature group combinations has also

been tested and the results are summarized in Table 2.3 for both machine learning

algorithms. The best group performance is achieved by the combination of text-level

and item-level features for SVM (mic. F1-score 0.7959) as well as for RF (0.7347).

This combination even outperforms the performance of the total feature set. She fur-

ther showed that reducing the number of features from 265 to 30 does not lead to a

substantial performance loss.

Svetashova (2015) further carries out experiments on unseen data by training the

RF model on the whole dataset and testing it on 6 unseen texts that had been initially

excluded due to a relatively low number of participants. The classification lead to a

cross validation accuracy of 0.8088. It turned out that textual features are comparable

within the group of items in difficult texts (TdId, TdIe) and within the group of items in

easy texts (TeId, TeIe). Correspondingly, the item-based features are similar within the

group of difficult items (TdId, TeId) and within the group of easy items (TdIe, TeIe).

Step 5: Drawing a conclusion

Svetashova (2015) concludes that the mentioned findings are relevant for the actual

generation of C-Tests and that the pipeline could be integrated into a user applica-

tion, supporting language teachers to inspect item and text difficulties. Especially the

Automated C-Test Difficulty Prediction

Page 27: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 18

Figure 2.3: The individuals factor map for the performance profile that worked best(Svetashova, 2015, p. 91).

Figure 2.4: This figure shows the distribution of the top 50 features for the 2Dim-Clustering (two dimensions: text and item difficulty) for SVM andRF.(Svetashova, 2015, p. 102)

Automated C-Test Difficulty Prediction

Page 28: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

2 Background 19

Table 2.3: This table shows how the different feature groups and feature group com-binations perform in classification. (2DimClustering). (Table taken fromSvetashova (2015, p. 107))

experiments on the unseen data give evidence that the findings are generalizable. As

a step towards interpreting the results, she presents the prediction results of one test

passage visualized in Figure 2.5. The underlying text was classified as easy. The colors

indicate whether the single items were classified as easy (green) or difficult (red). The

number labels represent the percentage of participants that actually inserted the gap

correctly. The only item that was classified incorrectly as easy is ”one”, which was

filled in correctly by only 23.21% of the participants.

Figure 2.5: This figure shows a text where each item is highlighted according to theprediction results (red for predicted as difficult item, green for predictedas easy item). The numbers indicate the percentage of correct matches(Figure taken from Svetashova (2015, p. 119))

She further suggests the following:

1. ”text passage properties and the characteristics of the individual gaps” could be

shown to the user.

2. One could ”accumulate information about the performance of the testees at dif-

ferent proficiency levels, their typical errors and acquisition order of the elements

of different levels: word, sentence and broader context.”

Within this thesis, we will move a step closer to this target by integrating the linguis-

tic preprocessing presented by Svetashova (2015) into one single pipeline. This will be

Automated C-Test Difficulty Prediction

Page 29: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

3 Data 20

done focusing on those features with high information gain. We will further extract the

Spanish performance data and implement a broad range of features also for Spanish in

order to gain insights from a multi-lingual perspective.

3 Data

The data under investigation was provided by the Language Learning Centre (Fach-

sprachenzentrum, FSZ) of the University of Tubingen2. The FSZ offers foreign language

courses for different languages at different levels of proficiency. The four most requested

languages are English, Spanish, French, and Italian. For those languages they conduct

C-Tests in order to assign the students to appropriate course levels. The following will

first describe the testing procedure and then present the descriptive analysis of the data

under consideration.

3.1 C-Tests at the Fachsprachenzentrum

At the FSZ a C-Test consists of six text passages covering different topics. Only five

out of six passages are actually used for scoring and one of them is used for calibration.

Each short text starts with a header and a first sentence not containing any gaps.

Starting at the second sentence, one half of every second word is gapped. For words

with an odd number of letters, the gapped part is bigger. Words consisting of one letter

are ignored and never blanked. After 25 gaps, the current sentence is completed and

one more untouched sentence is added. Words that contain hyphens are counted as two

words. Additionally, there are some language specific rules concerning the tokenization:

For English a word which contains an apostrophe is counted as one word, whereas in

the other languages they count as two words, e.g.:

• Peter didn’t know that. → Peter did know th .

• Qu’est-ce que c’est? → Qu’e -ce q c’e ?

The tests were generated manually by course lecturers and then conducted electron-

ically on computers at the FSZ. Each test taker has 30 minutes to complete the test.

3.2 Descriptive Analysis of the Data

The FSZ database was provided as a MySQL dump file which contains an updated

version of the database investigated by Svetashova (2015). We will first describe the

2https://www.uni-tuebingen.de/en/facilities/verwaltung-dezernate/

division-iii-international-affairs/language-learning-centre/

termine-teilnahmebedingungen/einstufungstests/c-tests.html (Last accessed: 18/02/08)

Automated C-Test Difficulty Prediction

Page 30: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

3 Data 21

structure of the database and then consider the available content for English and Span-

ish.

3.2.1 Database Structure

Svetashova (2015) describes the database structure in detail. The following summarizes

this structure by listing the database’s three tables ctest, ctest texte and tmp followed

by the relevant information obtained from each table.

• ctest (tests): test id, language, ids of the six text passages, id of text that was

used for calibration

• ctest texte (text passages): text id, language, the text with gapped parts marked

by square brackets

• tmp (results): result id, student id, test id, language, answers for each gap number

(1-25), final score (0-125)

Table ctest contains the information about single tests that were conducted by mul-

tiple students. It stores the text ids of each of the six text passages which make up the

test and the id of the text used for calibration. Table ctest texte contains the single

text passages as plain text, where the gapped parts are enclosed in brackets. Therefore,

correct answers can be extracted directly from the texts. The tmp table contains infor-

mation about the single test trials including the student id, the answers for each gap

number and the student’s final score. The final score represents the number of correctly

filled gaps in five of the six texts and can therefore range from 1 to 125. Svetashova

(2015, p.33) lists how this score is mapped to the CERF levels A1 to C2 defined by

the Council of Europe (c.f. Section 1, p. 1) and to UNICert R© course recommendation

levels 3.

3.2.2 Available Data for English and Spanish

We extracted all available test results for English and Spanish from the database in

order to investigate the student’s performances. This was done following Svetashova

(2015)’s method. 4

As illustrated in Table 3.1, the earlier version of the FSZ database (English 2015)

contained the results of 1399 participants who took C-Tests in English. In total, 8394

text passages were processed by these participants. Svetashova (2015) performed the

item difficulty analysis only on those texts with a number of participants greater than

140 and therefore considered 7725 answers on 20 texts. The newer version of the

3http://unicert-online.org/en/unicert-stufen4Yulia Svetashova kindly provided a tool to extract the English C-Test performance data from the

database. We reused it and adapted it to further extract the Spanish data.

Automated C-Test Difficulty Prediction

Page 31: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

4 Performance Modeling 22

database contains answers of 4006 participants for English, who overall processed more

than 24000 texts (resulting in 106 distinct texts). We will perform item difficulty

analysis on the whole dataset as well as on the subset of texts that had been processed

by more than 140 participants.

The data available for Spanish contains the performances of 1570 participants, who

overall processed more than 9420 texts. These numbers are comparable to the numbers

given in the English 2015 version, despite of the number of distinct texts. Compared to

the English 2015 data, the Spanish data contains more than twice as many processed

texts, whereas the number of participants is only slightly higher (1570 for Spanish, 1399

for English 2015).

English2015

English2017

Spanish2017

participants 1399 4006 1570

texts processed in total 8394 24036 9420distinct texts 45 106 91

texts processed by > 140 part. 7725 22146 8358distinct texts 20 50 45

texts processed by <= 140 part. 669 1890 1062distinct texts 15 56 46

Table 3.1: The number of participants and the number of processed texts given in theFSZ database for English (version 2015 and 2017) and Spanish.

Table 3.2 shows how many students participated in each test and how well they

performed on it on average. For the English C-Tests with a reasonable number of test

takers (excluding tests 80 and 115) the students’ performances range from a score of 0

to a score of at most 121. The mean scores range from 60.0 (test 116) to 73.1 (test 90).

For the 10 Spanish tests the performances range from a score of 0 to a score of at most

116. For English as well as for Spanish, none of the test takers achieved the maximum

score of 125. The statistics show poorer mean performance scores for Spanish than for

English, ranging from 43.1 (test 128) to at most 57.5 (test 83).

4 Performance Modeling

This Section describes the performance of the test takers of Spanish and English C-

Tests.

The performance of C-Test takers can be modeled by considering different variables.

As described in Section 2.4.2, Svetashova (2015) extracted various variables from the

database. She found out that variables which are based on item and text level diffi-

culty across proficiency levels performed well in later clustering and machine learning

approaches. We will further add the test takers’ performance on sentence level across

Automated C-Test Difficulty Prediction

Page 32: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

4 Performance Modeling 23

ENtest id N Min Max Mean

80 5 46 74 62.481 432 1 121 68.690 373 0 114 73.191 486 0 120 70.495 422 0 117 62.696 487 0 120 61.1106 374 0 117 63.4111 444 0 111 60.0115 3 62 78 72.0116 426 4 111 60.0120 428 0 113 64.2125 126 0 106 60.5

EStest id N Min Max Mean

83 178 0 116 57.587 164 3 103 53.794 191 0 106 55.2103 159 0 113 48.0104 185 2 112 47.9110 145 0 101 49.5114 176 0 115 50.5119 132 0 98 47.4123 191 0 110 53.1128 49 0 95 43.1

Table 3.2: Conducted C-Tests with the number of test takers (N) and informationabout their performance score (min, max, mean) for the English and Span-ish data

proficiency levels to the performance statistics. The Java tool which extracts this in-

formation from the MySQL database was adapted from Svetashova (2015) 5.

In the following we will first describe the performance data in general and then

present the clustering approach and results.

4.1 Performance Data Description

For each item with a number of participants higher than 140 we extracted its text

id, item id, sent id and test id from the database. For each item, the following 21

performance variables were inspected:

• item level:

– percCorrect : percentage of correct insertions for the item under considera-

tion

– percCorrectA1, . . . , percCorrectC2 : percentage of correct insertions within

proficiency level A1, A2, . . .

• sentence level:

– SentAv : average of item-wise percentages of correct insertions in the sentence

– SentAvA1, . . . , SentAvA2 : average of item-wise percentages of correct in-

sertions within proficiency level A1, A2, . . .

• text level:

5Yulia Svetashova kindly provided the code to extract the performance statistics from the database.She further modified the code to also extract performance information on sentence level.

Automated C-Test Difficulty Prediction

Page 33: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

4 Performance Modeling 24

– TextAv : average of item-wise percentages of correct insertions

– TextAvA1, . . . , TextAvC2 : average of item-wise percentages of correct in-

sertions within proficiency level A1, A2, . . .

We further inspected the distribution of participants across different proficiency lev-

els for each text. The charts in Figure 4.1 show the percentages of participants per

proficiency level. The English C-Test text passages had been processed by a smaller

proportion of beginning learners (A1 and A2) than the Spanish C-Tests. For English,

the proportion of A1 level learners ranges from 4% to 16%, whereas for Spanish it

ranges from 9% to 31%. This phenomena is even more significant for the proportion

of A2 learners, which ranges from 11% to 27% for English and from 24% to 46% for

Spanish. Thus, assuming that the levels A1 and A2 describe beginning learners, about

half of the Spanish C-Test participants turned out to be beginning learners.

Figure 4.1: These charts visualize the distribution of the participants’ proficiency levelsfor each text passage (left chart for 56 English texts, right chart for 46Spanish texts). The proficiency levels are defined by the participants’ finalC-Test score.

The distribution of the participants’ proficiency levels can also be considered item-

wise instead of text-wise. Figure 4.2 shows how the distribution across the participants’

proficiency levels changes for items with increasing difficulty in terms of percCorrect

(overall percentage of correct insertions). For both languages, the distribution across

proficiency levels is less clear for difficult items than for easy items. The proportion of

correct insertions by the more proficient learners (C2) increases with increasing item

difficulty.

Considering only those texts with a number of at least 140 participants, these are

the three easiest items in terms of percCorrect (percentage of exact matches):

• 466 4: This yea[r’s] outdoor coo[king] season mi[ght] be ov[er], but (97.22%)

• 584 02: The pol[ice] said o[n] Monday th[ey] were sear[ching] for (97.07%)

• 470 01: At t[he] same ti[me], four o[ut] of fi[ve] (96.99%)

Automated C-Test Difficulty Prediction

Page 34: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

4 Performance Modeling 25

Figure 4.2: These 100% stacked bar charts visualize the distribution of the partici-pants’ proficiency levels for each item, where items are sorted by perc-Correct (overall percentage of correct insertions) from left easy to rightdifficult.

The three most difficult items in terms of exact matches (excluding those with apos-

trophe encoding issues) are the following:

• 601 16: continue t[o] grow a[t] really stagg[erring] rates (0.0%)

• 586 24: i[n] the abs[ence] of t[he] firm’s br[ash] and bril[liant] co-founder (0.0%)

• 581 25: fulfi[lling] the long[held] dream of the United States (0.63%)

For Spanish, there are 14 of in total 1125 items with no correct answers at all, such

as the following three:

• 648 15: pa[ra] fortalecer a[sı] el sist[ema] inmunologico y hac[erse] mas fue[rte]

frente a l[as] infecciones vır[icas]. (0.0%)

• 455 17: Un entr[amado] de len[guas], culturas, relig[iones] y pue[blos] (0.0%)

• 501 04: La elec[cion] de l[as] familias argen[tinas] se vo[lco] mas ha[cia] la seg[unda]

alternativa. (0.0%)

All of the ten most easiest items represent the determiners ”los” and ”la”, or the

preposition ”de”, e.g.:

• 454 05: que l[os] mosquitos n[o] pican (97.2%)

• 456 05: mien[tras] que l[a] segunda enfa[tiza] la nac[ion] (96.6%)

• 456 08: origen d[e] la sobe[ranıa] (94.9%)

• 455 12: que a l[o] largo d[e] la hist[oria] han trans[itado](94.4%)

Automated C-Test Difficulty Prediction

Page 35: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

4 Performance Modeling 26

IeSeTe easy item in easy sentence and easy text

IeSeTd easy item in easy sentence but difficult text

IeSdTe easy item in difficult sentence but easy text

IeSdTd easy item in difficult sentence and difficult text

IdSeTe difficult item in easy sentence and easy text

IdSeTd difficult item in easy sentence but difficult text

IdSdTe difficult item in difficult sentence but easy text

IdSdTd difficult item in difficult sentence and difficult text

Table 4.1: Theoretically possible item difficulty properties based on the three local-ity dimensions item (I ), sentence (S ), and text (T ). The units are eitherconsidered easy (e) or difficult (d).

4.2 Clustering Approach

To investigate different possible groupings of items, we follow Svetashova (2015) and

perform Hierarchical Clustering on Principle Components (HCPC) using R (R Devel-

opment Core Team, 2008) and the package FactoMineR (Le et al., 2008)6. This means

that first Principal Component Analysis (PCA) is performed in order to transform

the whole set of variables into a set of linearly uncorrelated variables. Afterwards,

Hierarchical Clustering is applied to group the items.

As the performance variables can be grouped into item, sentence, and text level

information, we can describe items according to these three dimensions. Considering

items (I ), sentences (S ), and texts (T ) as either easy (e) or difficult (d) leads to 8

clusters an item can belong to (cf. Figure 4.1).

We performed HCPC on the full set of 21 performance variables and experimented

with the number of clusters ranging from four to eight. The number of clusters does not

have remarkable impact on the resulting variable factor maps, which all look similar

to the four clusters approach shown in Figures 4.3 and 4.4 within the given language.

In both languages, the first two principal components describe more than 70% of the

variance. Variables on item and text level are perpendicular to each other, which

suggests that they encode different information. Those on sentence level are in between

them and correlate more with text level than with item level variables. For Spanish, the

variable distinction by different proficiency levels does not bring remarkable information

on all levels of locality (Figure 4.4). This is different within the English performance

data, where the different proficiency level variables are not relevant on text level, but

bring more information on sentence and item level. Thus, for English, inspecting the

proficiency levels is more interesting on more local levels than on more global levels

(Figure 4.3).

Experimenting with the number of clusters, we expected a clear cut individuals factor

map as found for the data in Svetashova (2015) (cf. individuals factor map in Figure

6The clustering of the performance data is based on R code kindly provided by Yulia Svetashova

Automated C-Test Difficulty Prediction

Page 36: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

4 Performance Modeling 27

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Variables factor map (PCA)

Dim 1 (49.09%)

Dim

2 (

23.1

9%)

textAv

textAvA1

textAvA2

textAvB1 textAvB2textAvC1

textAvC2

percCorrectpercCorrectA1

percCorrectA2percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAv

sentAvA1sentAvA2sentAvB1 sentAvB2

sentAvC1sentAvC2

textAv

textAvA1

textAvA2

textAvB1 textAvB2textAvC1

textAvC2

percCorrectpercCorrectA1

percCorrectA2percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAv

sentAvA1sentAvA2sentAvB1 sentAvB2

sentAvC1sentAvC2

textAv

textAvA1

textAvA2

textAvB1 textAvB2textAvC1

textAvC2

percCorrectpercCorrectA1

percCorrectA2percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAv

sentAvA1sentAvA2sentAvB1 sentAvB2

sentAvC1sentAvC2

textAv

textAvA1

textAvA2

textAvB1 textAvB2textAvC1

textAvC2

percCorrectpercCorrectA1

percCorrectA2percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAv

sentAvA1sentAvA2sentAvB1 sentAvB2

sentAvC1sentAvC2

textAv

textAvA1

textAvA2

textAvB1 textAvB2textAvC1

textAvC2

percCorrectpercCorrectA1

percCorrectA2percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAv

sentAvA1sentAvA2sentAvB1 sentAvB2

sentAvC1sentAvC2

textAv

textAvA1

textAvA2

textAvB1 textAvB2textAvC1

textAvC2

percCorrectpercCorrectA1

percCorrectA2percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAv

sentAvA1sentAvA2sentAvB1 sentAvB2

sentAvC1sentAvC2

textAv

textAvA1

textAvA2

textAvB1 textAvB2textAvC1

textAvC2

percCorrectpercCorrectA1

percCorrectA2percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAv

sentAvA1sentAvA2sentAvB1 sentAvB2

sentAvC1sentAvC2

textAv

textAvA1

textAvA2

textAvB1 textAvB2textAvC1

textAvC2

percCorrectpercCorrectA1

percCorrectA2percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAv

sentAvA1sentAvA2sentAvB1 sentAvB2

sentAvC1sentAvC2

textAv

textAvA1

textAvA2

textAvB1 textAvB2textAvC1

textAvC2

percCorrectpercCorrectA1

percCorrectA2percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAv

sentAvA1sentAvA2sentAvB1 sentAvB2

sentAvC1sentAvC2

Figure 4.3: English: Variables fac-tor map (PCA) using all21 performance variablesand 4 clusters.

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Variables factor map (PCA)

Dim 1 (46.65%)

Dim

2 (

23.8

6%)

textAvtextAvA1

textAvA2textAvB1textAvB2textAvC1

textAvC2

percCorrect

percCorrectA1percCorrectA2

percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAvsentAvA1sentAvA2

sentAvB1sentAvB2

sentAvC1sentAvC2

textAvtextAvA1

textAvA2textAvB1textAvB2textAvC1

textAvC2

percCorrect

percCorrectA1percCorrectA2

percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAvsentAvA1sentAvA2

sentAvB1sentAvB2

sentAvC1sentAvC2

textAvtextAvA1

textAvA2textAvB1textAvB2textAvC1

textAvC2

percCorrect

percCorrectA1percCorrectA2

percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAvsentAvA1sentAvA2

sentAvB1sentAvB2

sentAvC1sentAvC2

textAvtextAvA1

textAvA2textAvB1textAvB2textAvC1

textAvC2

percCorrect

percCorrectA1percCorrectA2

percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAvsentAvA1sentAvA2

sentAvB1sentAvB2

sentAvC1sentAvC2

textAvtextAvA1

textAvA2textAvB1textAvB2textAvC1

textAvC2

percCorrect

percCorrectA1percCorrectA2

percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAvsentAvA1sentAvA2

sentAvB1sentAvB2

sentAvC1sentAvC2

textAvtextAvA1

textAvA2textAvB1textAvB2textAvC1

textAvC2

percCorrect

percCorrectA1percCorrectA2

percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAvsentAvA1sentAvA2

sentAvB1sentAvB2

sentAvC1sentAvC2

textAvtextAvA1

textAvA2textAvB1textAvB2textAvC1

textAvC2

percCorrect

percCorrectA1percCorrectA2

percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAvsentAvA1sentAvA2

sentAvB1sentAvB2

sentAvC1sentAvC2

textAvtextAvA1

textAvA2textAvB1textAvB2textAvC1

textAvC2

percCorrect

percCorrectA1percCorrectA2

percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAvsentAvA1sentAvA2

sentAvB1sentAvB2

sentAvC1sentAvC2

textAvtextAvA1

textAvA2textAvB1textAvB2textAvC1

textAvC2

percCorrect

percCorrectA1percCorrectA2

percCorrectB1percCorrectB2

percCorrectC1percCorrectC2

sentAvsentAvA1sentAvA2

sentAvB1sentAvB2

sentAvC1sentAvC2

Figure 4.4: Spanish: Variables fac-tor map (PCA) using all21 performance variablesand 4 clusters.

2.3 on p. 18). In contrast, the performance profile consisting of the two variables textAv

and percCorrect did not result in a clear cut individuals factor map when using four

clusters. However, it resulted in a clear cut map using five clusters for both languages.

For the Spanish data, Cluster 1 represents items in very difficult texts (Figure 4.5) in

terms of textAv. Thus, within very difficult texts, the clustering does not distinguish

between difficult and easy items. However, with decreasing text difficulty, one can

distinguish between easy items (cluster 2 and 3) and difficult items (cluster 4 and 5).

−3 −2 −1 0 1 2 3

−3

−2

−1

01

2

Factor map

Dim 1 (59.68%)

Dim

2 (

40.3

2%)

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●●●

●●●

●●●●

●●

712_05712_13712_08712_20712_23

543_15543_09

712_16712_10712_14

543_22

712_02712_25

745_19

543_03745_08

543_23

712_24712_06

543_21745_10

543_14745_21

543_11543_06

712_01712_12712_19

745_14

543_20

712_03712_22

745_17

543_05745_20

543_01

712_21

745_12

543_08543_18745_15745_01745_03745_04

544_22544_25546_25

745_23745_05

546_23

745_13

544_20

712_07

543_16

544_03546_09

745_24

543_17

546_20

543_10745_22745_06

544_14544_01

711_25

712_11

711_13711_07

745_18

544_07546_19

548_09

543_04745_02

711_05

712_04

745_07

711_12

546_07544_23

711_02711_24548_08

544_09

548_20711_14

546_14

649_12

546_10

548_16548_19541_05541_22548_05

546_15

541_21541_16

543_25

711_22

543_12

711_17

649_08

544_17

711_06

712_18

541_04508_20645_15645_17638_05638_17501_04501_13708_05

649_02

546_22

638_06638_11

711_08

649_03

651_19

544_11

541_01

546_06

548_23

708_11508_18

544_05

501_19501_21508_13

548_02

653_16

651_01708_01508_08499_21645_04

541_17541_20

501_09651_23640_25

543_19

638_18638_20638_21

544_08

645_05645_19653_15640_18

546_13

508_23645_18645_21645_23

548_01

546_24

645_10

543_02

649_01

711_15711_20

508_07

647_12

745_09

543_07

653_17501_05

649_18

499_18

650_23

546_03

638_25

548_15

651_11

544_02

457_21457_14

499_02508_04653_08645_16

649_15

708_02708_19499_06

651_04

546_11546_16

647_01

508_25

544_18

455_17

544_21

651_18

537_24

546_08

647_23

712_17

645_06499_23653_06640_12640_24

455_21

708_03708_15708_16

543_13

739_01

508_17

455_16

548_14

501_17653_07653_12653_14647_11

508_19645_20640_22

457_17

638_19

648_15

649_21

651_22

739_04455_14

499_09

648_18648_25

647_07

543_24

739_24

708_20653_10647_08

508_03640_01499_10

649_22649_24

648_20

650_07

544_15

455_05537_25

708_13708_18499_24

544_16

710_18

548_17

651_17640_07

455_22

651_25

712_15

508_16

537_13

501_16499_07

648_13

745_11

541_02

653_05

541_15

508_15

457_22

547_09535_23

650_24537_06537_21

712_09

638_14

457_10

648_24

548_24

640_11

457_09739_16650_01537_16537_19

710_07

653_21638_15

647_19

640_08708_17

739_10647_03

455_08537_22

501_08

710_02

508_24

649_10

457_01650_06

547_04

708_22501_11501_23

541_13541_19

538_14

548_04

455_23

651_16

457_05650_20

538_15

648_01

640_05708_12708_14

538_22

649_23

501_14

548_03

457_24

638_04

541_11

502_12

649_13

547_01

457_11457_18

653_19

457_02457_20650_16650_22

742_12

537_15

645_09

497_08

541_12649_05

538_20538_25

710_09

640_15

455_03

647_21457_25

503_17

499_05645_14

534_01

649_04649_11

501_12

710_13

537_02

649_17

497_07535_09

544_24

548_11

647_10

710_08

739_22

534_17

541_03

645_13

535_04

711_21

544_13

501_20

710_25535_03

508_14

546_05546_12

640_04

650_25

534_20506_13

499_03

548_07548_21

710_17535_18

544_10

650_21

497_05

651_06

538_21502_20

649_25

499_11

546_01

708_08

537_20

502_06502_08

651_07

534_09

541_10

711_18

647_24

638_22

455_24

651_08651_10

535_11710_12

745_16

497_25

537_12

541_23

739_03

501_18

506_23

649_07

535_06535_14

647_06

497_09497_11

739_08739_18

653_23

503_20538_16506_21534_14

653_09640_14

650_12

499_12499_20

711_10

649_20

640_16640_23

506_10

638_16

503_16

546_18

547_22

651_05

538_10

548_12

499_15

648_19

457_12

653_11

711_01

534_18

745_25

647_09

640_09

457_04

538_02

710_04

651_14

505_07505_13

708_04

650_04650_19

638_13

534_12

546_17

739_02650_10

741_10741_20

501_24

648_07

502_07

547_05

640_03

647_25

651_12

739_05

742_23535_22

501_01

739_09

534_06

648_04

650_14

648_06

741_04

742_06

651_09508_06

547_12

651_24

497_17534_23

739_17650_03

502_15

648_23455_01

742_18

537_04

651_13

742_21502_16

546_02

537_10650_11

541_08

506_03

649_14

499_08

506_08

455_07

711_16

535_13

645_11

547_19

541_07

535_12

711_23

503_01

739_19

710_03

548_18

502_03506_04506_11506_16534_13

505_15

547_15

713_15

506_15

455_20

499_14499_19

741_03

502_22

651_21

495_07

544_04

503_05

541_25

713_17495_01

742_19

739_11

708_07

534_21

645_24

538_24

457_08

640_20

544_19

638_23

503_15

457_16

742_25

546_21

503_07502_24

535_21

648_08

500_08

503_14

547_07547_24

506_06

451_06

742_03742_04547_21

647_17

638_10501_15

500_16

653_03

741_12741_16505_05

451_05

537_14

534_15

739_12

538_08

713_08

456_06451_08

502_19

710_19

640_10651_15

456_02

713_18

454_16

742_20

649_06

495_11495_12

544_06

647_16

508_09

451_13

708_10640_02

648_02648_05

742_11

710_24

456_03646_03

547_02497_20

544_12

741_13

710_05

451_02454_15500_19

713_03

497_01497_10

646_17

648_16

638_03

649_19541_24

713_09

503_19503_22

741_17

495_03

640_17

739_06

710_22547_11547_14

744_20

455_18

647_13

508_01

500_18456_25

497_16

739_15

638_24

546_04

547_16

744_08

537_05

508_21

455_10

744_03744_18

650_13

502_14503_02

505_19

548_22711_04

650_15

651_03

547_03

711_19

742_10

495_22

711_09

640_19

742_13

500_24646_10

455_11

548_06

744_11495_17

538_18

653_22

456_17

741_05

454_07

713_22

535_07

647_02

645_25

451_12

648_14

638_08501_06

650_05647_14

742_22497_23497_24

541_06

501_07708_24

647_20

713_19

648_17

495_21

649_16

508_05

456_14

503_18

505_20

501_03

711_03

456_09

506_09502_13497_21

646_20

503_04742_14

541_09

506_01

499_17

711_11

739_21

503_12

640_13

710_11

537_11

454_19

497_15

548_25

645_01

505_16

547_18

708_23

500_17

497_22

653_04

500_01

499_01499_13

497_12503_10

741_08

502_10

500_03

455_06

744_09744_19744_25

499_16

505_11

495_24

537_07

548_10548_13

742_01538_23

501_25

503_24

645_08

454_14

653_18

538_03

537_17

653_13653_20

649_09

638_12

713_21

739_13

497_04502_17

454_11451_04

739_14

534_24

713_01

646_23454_04

647_05

541_18

455_15

638_01501_10

739_20647_15

505_06

742_05

451_07

499_04

650_09

541_14

535_15

741_14

638_02

506_12

505_12

638_09

547_13742_07

495_20

499_25

710_15

508_11

744_06456_13451_11

508_12

742_02

651_20

506_07

647_22

645_12

538_07534_02

744_23

645_22

535_16535_19

640_21

454_12

455_02455_09455_19

503_13

454_18

537_01

646_22

535_05

454_02

455_13

713_06

710_01710_06

645_02

646_15

741_25

708_21

455_04650_08

500_10

508_02

503_23

535_20

708_09501_22

534_16502_21506_22

708_25

505_18

744_15

651_02

500_05

710_23

741_23505_01505_09

638_07

547_10

505_08

653_02

456_15

739_25

744_16

454_17

506_20534_05

744_17

645_07

537_09457_15650_18

742_17535_17

506_19

645_03

502_11502_18

646_07

648_09

500_14

742_08

505_21

535_10

502_25

653_24

547_23547_25538_06

508_22

547_06

505_14

648_10

708_06501_02

500_22

739_23

534_07

650_17

653_01508_10

547_08710_20

741_26

499_22

744_22

535_08535_25497_14

640_06

454_24

742_09

710_10

456_16

647_04457_19

505_03

495_14

506_17538_17

457_13455_25

538_11538_13

457_03

744_13451_19

537_03537_18

534_04497_19

451_21646_01

503_25497_18

650_02

646_12

503_06

457_07457_23647_18

741_11

648_21

741_22

454_08

495_05

538_19

537_08648_03648_11

713_07

505_23

502_02538_09534_03534_22

456_18456_23500_12

502_23538_01547_20

739_07

495_16

454_10

457_06

713_20

537_23648_12

506_18742_15742_24534_08

713_16

455_12

535_02

648_22

497_06

646_08

534_19

547_17

456_07451_24

710_14

713_25

503_09

456_12451_10646_09

503_08

710_21

713_24

646_11

742_16

710_16

713_12

534_25502_05538_04

505_17

495_09451_25

506_14503_03502_09

495_18

538_05538_12

495_02

535_01535_24

456_11

741_18

500_06500_11451_01

713_02

497_02497_13

741_15

744_02

506_24

451_18

505_10505_22

534_11497_03

646_05

741_07505_04

534_10503_11

646_24

495_13

506_02506_25502_01

451_09

506_05

505_24

646_13

741_02

503_21

456_10

505_02

744_07456_04500_25454_20

713_10713_23

502_04

505_25

456_22

741_01

495_25744_14451_22646_06

713_14

500_20744_04

741_06713_11

456_24451_17451_20646_19454_21456_20

741_19713_13

646_16

741_21

500_23500_15744_10744_21744_12495_06495_04

646_14

741_24

456_21451_15646_02

713_04495_23744_24451_03

741_09

495_08

646_21454_13

713_05

456_01744_05

646_25

495_15495_19495_10451_16646_18

744_01456_19454_06454_09454_23646_04500_09454_22500_04454_03500_21454_01451_14451_23500_02500_07500_13456_08456_05454_05454_25

cluster 1

cluster 2

cluster 3

cluster 4

cluster 5

cluster 1 cluster 2 cluster 3 cluster 4 cluster 5

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Variables factor map (PCA)

Dim 1 (59.68%)

Dim

2 (

40.3

2%)

textAv

percCorrect

textAv

percCorrect

textAv

percCorrect

textAv

percCorrect

textAv

percCorrect

textAv

percCorrect

textAv

percCorrect

textAv

percCorrect

textAv

percCorrect

Figure 4.5: Spanish: Individuals and variables factor map using the two performancevariables textAv and percCorrect.

If we follow the same clustering approach for English, the individuals factor map

Automated C-Test Difficulty Prediction

Page 37: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

5 Difficulty Modeling 28

looks as shown in Figure 4.6. We can distinguish between easy and difficult items (in

terms of percCorrect) for texts of low difficulty (in terms of textAv). Items in very easy

texts are grouped into one cluster (cluster 5). With higher textual difficulty one can

distinguish between easy items (cluster 2 and 4) and difficult items (cluster 1 and 3). We

presume that the difference in the clustering results is not caused by language-specific

differences, but is rather dataset-specific.

−3 −2 −1 0 1 2 3 4

−2

−1

01

23

Factor map

Dim 1 (61.84%)

Dim

2 (

38.1

6%)

●●●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●●●

●●●●●●●●●

688_08591_11591_09591_21688_23591_14590_19591_06590_12591_15688_14688_22590_20591_13590_03

609_03594_09610_22756_16

590_25590_01

610_20

688_03590_09

605_02

610_06610_11609_17

591_12

610_04609_18

690_24594_19605_19610_13

591_25

609_23

688_06

690_23

590_21590_23591_05

690_02605_01594_10756_04

588_23

609_21756_09756_19

689_09

594_23

590_18

610_09

688_05688_16688_25

588_06

591_19

605_14588_16

601_01601_16

609_08

590_05688_24

605_09594_21609_24

605_11

689_20

756_12

689_02

610_19690_07588_08

610_15

481_25691_23

609_01

481_03

688_19

589_02

689_07

586_17

594_04

586_24

689_24

690_17

609_14

688_04

725_16

594_15756_25

481_13

590_24

605_12610_25605_16

481_09

756_24

588_10

756_08

691_05589_06725_01

481_16

690_15

691_07

756_17594_20610_24605_21588_17588_09

725_22

748_04

609_16

605_13

529_13

594_16

748_09

529_22

601_10589_05

724_13

690_22588_11

609_15

725_13586_18

748_10

725_08586_11

481_04601_13

690_03

586_02

748_25689_22

690_16690_11

601_03

588_05

598_08

748_06481_11

591_03

529_14

609_19

589_23

690_12756_22

586_12

481_05

688_10

603_16

748_18

688_01

594_11

691_11

724_15

690_01

691_01

688_17

689_06

589_25725_04

748_11

603_18

725_03691_12

748_17

601_25

725_25

591_20

598_12

688_13688_20

766_14

586_10

590_04

481_24

603_14

609_20

591_16

725_05

603_22

591_07

594_24

590_17

594_18588_13

481_08689_18601_04589_01

591_22

594_22

716_07

588_22

481_20748_08

691_17

688_07

724_04

588_20

756_02594_12

478_09598_11

688_12

764_20

691_04

531_17

603_08

604_09

478_16

691_09

585_18

716_17

588_14

590_06

691_06

585_21

587_17

529_23

598_21

591_04

478_21

762_04

598_18

688_11

719_13

478_08

690_21

591_10

589_03

716_11

582_10604_10

478_20

610_12756_21

716_22

609_12

532_18

748_02748_22

604_13

588_19

756_10

590_02

756_06

604_11

716_18

610_05

585_12764_09

589_09

529_08529_19

748_03748_20690_13605_03605_10

481_02

586_01

766_20

609_04

582_17587_22

601_20

748_15

582_24

532_02

587_23478_07

610_17

589_16

588_04

587_12

689_11

764_08531_12

594_01

716_10

589_15

756_03

762_25

478_15

724_16

582_01766_05

609_22

586_20

610_07

603_10

531_18

590_15

531_06584_22

691_25

719_20

601_21689_15

531_24585_03719_14

756_05

719_19

587_02

764_10

603_13

601_15

724_09

725_24

719_11584_10

716_14

604_20

756_14

600_13

590_13

478_14

690_20756_13

585_16

586_22

527_22

529_21

766_09

527_18

529_09

598_19

589_04

478_01

527_12527_20

594_03

600_14

529_12

689_23481_18601_23

716_06

479_04

603_15

720_13528_07

478_18

531_05

724_25

748_14

594_17605_22

724_05

601_19

690_14

591_23

604_18

589_13

594_06

591_24688_09

601_18

764_21582_12766_10587_06

527_16

689_08

529_11

582_15

690_10610_16

598_23

604_15

689_14

531_04

528_03

610_18

582_18

756_18

590_11

588_25748_01689_04689_25

590_08

748_05748_07

610_21

532_08762_02

605_24

716_12

725_02

605_08

529_20

589_24

582_22

586_16

600_25

691_15

582_19

724_08

756_20

604_19

609_13

591_02

594_02

589_14

591_08

603_24

689_10

756_01609_09

688_15

586_13

603_05724_12

601_17

690_18

689_03

590_07

609_02

716_09

766_15

601_05

600_11

725_14586_04

609_25

481_22

716_01

478_06

582_16

596_03

690_06

530_13

724_06

584_09

691_22

605_15594_07

762_18

601_24

609_10

688_02

724_23

605_17605_18

720_12

756_23

598_01

527_09

756_15

689_12

766_21

589_21

610_02

685_25

594_08

596_05

719_10

724_24

466_01

716_20

689_21

533_08

762_16

588_01

532_22

688_21590_10

725_12

598_06603_02

479_03

584_21

748_21

584_23

756_11605_20610_10

766_24

716_02

725_19589_19

528_11

532_16528_12

585_19

605_25

481_19

605_23

590_14

720_11

685_11

764_15

609_07

591_18

762_10

478_25

688_18

691_14

603_11766_25

610_14

588_21

720_07

529_17

588_12

532_10

466_21

588_03

481_15691_20

587_21

529_15

528_18600_09

584_13

591_01

478_13587_20

533_07

585_05764_24604_06

691_24

531_09

689_05

610_23

590_16

766_13

748_16

528_22

689_19

598_16

586_15

479_20

586_03

598_02

601_22

598_10

590_22

529_25

591_17

719_02

685_12

478_19

584_15

525_23

724_21

762_23

748_13

584_19

588_15

766_22

685_10

748_19481_12

528_20

589_11

532_12

690_05

584_25

766_19

605_04

586_19

600_16720_08

724_03

587_19

528_13

725_21

764_16

689_17

600_17

589_12

588_02

598_17716_21

533_21

601_06

748_23

530_18

724_18

589_10

724_14

533_05

600_03

529_07

691_02

603_01

525_22

601_09

685_09

605_05

530_12

586_25

594_13

528_21

596_01

584_12

756_07

716_23603_23

532_03

690_19

609_05

481_10

605_06

716_08

596_18

586_21

764_19

601_02

596_24

531_21585_20604_05

609_06

766_17598_22

610_03

691_13601_11

582_06

601_08

527_11720_17

598_09

479_22

598_20

600_01

610_08

719_09

594_25588_24

481_01

466_13

587_03587_16

479_05

601_12

716_16

481_14

690_09

479_02527_02

609_11

530_20

610_01

532_11

466_07

598_24

762_24584_14

689_13

725_11

587_08

594_05690_25

479_09

531_15587_24

725_07

594_14690_04

529_05603_04

481_07

598_13

529_24

764_07

581_25

604_12

528_17

724_22

689_16

532_19

478_17

716_05716_15

479_18

588_07

691_18691_10

690_08

530_03

586_14

470_05

604_07719_25

587_25

685_04

764_13762_13762_15

587_09

533_25

587_18587_13

605_07

529_04

748_12

529_18

479_01528_23

596_13

725_06

724_01

748_24

470_19

716_25

719_07

766_02603_06

601_07

585_07

529_01

725_23

532_09

764_14

530_11

466_18

527_21

719_12

720_19720_23530_15530_16

588_18

600_24532_04527_01

466_24

691_03

531_07585_25

586_05691_21

604_08

481_17

523_01

585_17

766_07

479_14

689_01

533_09

603_20766_18

584_04762_14

725_09725_10

603_19766_03

529_02

601_14

531_08531_22

589_20

762_06

589_07

764_04

766_23

691_16481_23

532_20

479_07

470_16

586_07

720_09

685_01

720_16

533_02

531_03

589_08

719_22762_21764_03582_02582_25

691_19481_06

584_24

724_20

479_06

585_22

589_18

533_24

531_23

470_25

725_15725_20

481_21

764_22

724_17716_04

762_12719_16

530_06

587_15

525_05

691_08

766_06

586_23

596_08

716_24

584_18

529_10

766_08

585_11604_03764_23

528_01

762_07762_11762_22

582_21604_24

466_05

582_08

600_06

582_11

724_11

530_09

589_22

604_25

598_05

719_08532_06

720_06

584_08585_08

603_25

582_20

724_19

762_08

766_16587_01

589_17

724_07

764_05719_18

533_23

479_25

584_17719_17

523_17

586_09

470_07

764_11

587_07

603_03

596_23

470_13

532_05585_15

530_10533_14

725_18

587_14

470_18

531_20584_07

724_02

762_05

530_05

586_06725_17

716_13

600_15

529_16

585_02

525_08

587_04

764_02

587_05

585_10604_23585_24764_01585_01

596_25

600_22

529_03

478_04

530_17

603_17

530_23

525_24

529_06

719_15719_21

586_08

531_01584_03

478_22

604_17766_01

533_20

719_04

525_07

720_18

525_21

532_13

470_21

600_19

604_04585_14

600_04

685_06

600_18

523_13

527_13479_17

466_08

716_19

596_19

527_14

598_25

531_10

603_21478_10

528_09

724_10

530_24

584_01764_12762_17

582_03

719_03

598_07

584_11

528_10

598_03

600_05532_14

766_04766_11582_23

762_19

582_07582_09

598_14

527_08

604_02

470_06

582_05

527_23

587_11

532_24

478_12

720_15

685_05

478_24

479_13720_04

603_09

527_19

582_04

470_12

587_10598_04

764_25

528_24

523_24

478_02

720_25

716_03

525_17

766_12

530_14

478_23

531_11

524_12

604_14

584_20585_23

470_23

598_15

719_23

533_17

719_06

582_13

478_11

720_20

596_17

603_07

762_09

466_17

523_09

532_25

604_22

596_09596_10

584_06531_16762_03

523_02

585_06604_16

603_12

523_05

527_05528_02

525_14

762_20

720_21

685_08

604_21

596_11

581_14

479_15600_02

762_01

478_05

764_17

530_04

527_15

596_06

585_13

582_14604_01531_13

478_03

525_04

531_14531_02

533_06

600_21

531_19

720_02

532_17

466_02466_11

533_12

524_14

479_21527_10720_14

719_05

685_21

600_23528_04528_05479_16

764_06764_18

532_23

530_02

527_04532_07527_17528_14479_12

584_16

479_10600_08527_03720_22

523_14

585_09

533_01533_04

527_24479_08

466_20

523_11

585_04

720_01

525_03

581_04

584_05

581_24

685_18

479_23

532_21532_01

530_21

528_25

596_16

719_01

525_16

600_07

525_11

581_18

533_15685_22

523_04

466_12

581_13

596_21

527_07

719_24

479_11

533_18685_02

523_03

530_25479_24527_06

685_23596_02

581_15

685_15596_15

528_16

581_09

533_16

525_02

479_19600_20

470_15

584_02

720_10600_10532_15

685_19

720_24530_19

525_01

720_03528_06528_15527_25

466_10466_16

600_12

685_17

528_19

470_03

720_05

581_01

523_07523_18

530_22

528_08

685_03596_20

525_06

530_01

685_24685_13685_07

524_02

466_19

581_08

533_22530_08533_03

466_15

533_10596_14

525_25

596_22

523_15

466_03

470_22523_12

530_07

470_17

524_25

581_17

596_04

525_09

685_14596_07

466_14466_23525_10

685_16

525_20

533_11

525_19

533_13596_12533_19

523_08

685_20

524_20

525_15

581_02

523_19523_06470_08523_10

524_24

466_22

523_16

466_25

581_20

466_06

470_24525_18525_13470_10523_22

524_22

466_09

581_10

525_12

581_05

524_16

470_09470_11

524_11

466_04

470_04523_23

581_03

470_20523_21470_14

581_07

523_20470_02470_01

581_12581_19

523_25

581_11581_23581_06581_16581_22

524_06524_07

581_21

524_15524_18524_19524_08524_21524_04524_13524_23524_03524_10524_17524_01524_05524_09

cluster 1

cluster 2

cluster 3

cluster 4

cluster 5

cluster 1 cluster 2 cluster 3 cluster 4 cluster 5

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Variables factor map (PCA)

Dim 1 (61.84%)

Dim

2 (

38.1

6%)

textAv

percCorrect

textAv

percCorrect

textAv

percCorrect

textAv

percCorrect

textAv

percCorrect

textAv

percCorrect

textAv

percCorrect

textAv

percCorrect

textAv

percCorrect

Figure 4.6: English: Individuals and variables factor map using the two performancevariables textAv and percCorrect.

5 Difficulty Modeling

Modeling the difficulty of C-Test text passages and individual items is a relevant step

needed to influence the difficulty of whole C-Tests. As a step closer towards the inte-

gration of the difficulty prediction into a single tool, we implemented all features into

one processing pipeline. We make use of the Java Framework UIMA (Unstructured

Information Management Architecture) in order to process raw C-Test text passages

extracted from the FSZ database (Ferrucci et al., 2009). UIMA can be used to im-

plement single processing components, so-called Analysis Engines (AEs), such as a

tokenizer, a part-of-speech tagger, or a feature extractor. Afterwards, the single com-

ponents can be chained together. In this way, structure is given to unstructured content

while UIMA takes care of the data flow and supports the programmer in configuring

and running the pipeline.

Following Svetashova (2015), the difficulty of a C-Test text passage can be modeled

on different levels of locality:

• difficulty prediction of single items

• difficulty prediction of single sentences

Automated C-Test Difficulty Prediction

Page 38: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

5 Difficulty Modeling 29

• difficulty prediction of the whole text passage

Depending on the level of locality, different kinds of linguistic features can be imple-

mented. There exist lexical features based on frequency lists or other psycholinguistic

databases, as well as readability features, syntactic features and also discourse features.

The following will describe the implemented features for English and Spanish.

Every feature is calculated item-wise: On sentence level, all items in one sentence

have the same feature value. On text level, all items in one text have the same feature

value.

5.1 Linguistic preprocessing

The UIMA pipeline visualized in Figure 5.1 is used to annotate C-Test text passages, as

extracted from the FSZ database, with different linguistic information. The database

stores text passages as strings with the correct solutions enclosed in brackets. All UIMA

annotations contain at least two features: the start and the end index of the span being

annotated. The indices of all annotations in the pipeline refer to the indices in the

original database text with the bracket structure. Since NLP tools, e.g. for sentence

segmentation or tokenization, require input without brackets, we delete brackets for the

relevant steps and reinsert them afterwards to ensure correct indices. The final UIMA

type that is needed to encode relevant item information is the C-TestToken type. It

differs from usual NLP tokens since the C-Test passages require their own tokenization

rules: NLP tokenizers often split words that are not separated by whitespace due to

some linguistic circumstances (c.f. example ”didn’t” in Table 5.1). A C-TestToken

annotation has the following features: It is stored whether the C-TestToken contains

a gap (isItem) and what the intended word, the base, the ending, its lemma and POS

are. Furthermore, the position of the token in the text and in the sentence is stored.

SentenceAnnotator

TokenAnnotator

PosAnnotator

LemmaAnnotator

C-TestTokenAnnotator

Figure 5.1: The UIMA pipeline that chains multiple NLP analysis engines together inorder to annotate C-TestTokens.

Sentence segmentation and tokenization was carried out by using the Apache OpenNLP

java library 7.

English part-of-speech (POS) tagging was also done using OpenNLP, resulting in

POS tags from the Penn Treebank Tagset. For Spanish, we used a Maximum Entropy

Tagger from the Stanford NLP Group8 in order to get simplified tags from the Ancora

7https://opennlp.apache.org/ (Last accessed: 18/02/08) (English Models were available forOpenNLP version 1.6. The Spanish models only for version 1.4)

8https://nlp.stanford.edu/ (Last accessed: 18/02/08)

Automated C-Test Difficulty Prediction

Page 39: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

5 Difficulty Modeling 30

feature description example”pipe[lines]”

example”did[n’t]”

isItem indicates whether this token is one ofthe 25 test items

true true

intendedWord the acutally intended word pipelines didn’t

base the item’s base, if it is an item pipe did

ending the item’s ending, if it is an item lines n’t

lemma the lemma of the intended token pipelines do not

pos the pos of the intended token NNS VBD RB

tokenId the position of the token in the text(counted in tokens)

65 45

tokenInSentId the position of the token in the sentence(counted in tokens)

3 8

mergedString the merged string of the intended wordif the C-TestToken is a result of two in-dividual NLP tokens. This string con-tains an underscore. It can be usedto see whether and how the NLP To-kenizer would have split the word.

- did n’t

Table 5.1: The features of the C-TestToken type as implemented in our UIMA pre-processing pipeline.

Corpus Tagset 9(Toutanova et al., 2003). We further transformed the simplified tags

into the Parole Reduced Tagset, a set of 66 tags10.

Lemmatization was done using OpenNLP’s Simple Lemmatizer. For Spanish, a dic-

tionary11 developed in the OpeNER project was adapted by transforming the tags into

the Parole Reduced Tagset (Garcıa-Pablos et al., 2013).

The English constituency parser is a probabilistic context-free grammar (PCFG)

parser implemented by the Stanford NLP Group.

Syntactic dependencies were annotated using a transition-based neural-network De-

pendency Parser implemented by the Stanford NLP Group (Chen and Manning, 2014).

We used the Universal Dependency representation which is available for both languages

(Nivre et al., 2016).

5.2 Modeling the difficulty of English C-Tests

The English C-Test difficulty model is based on the findings of Svetashova (2015)

focusing on the 70 most predictive features in her experiments. We further integrated

complexity measures that were already implemented in a similar pipeline described

9Stanford’s simplified ancora tagset: https://nlp.stanford.edu/software/spanish-faq.shtml

(Last accessed: 18/02/08)10see http://www.cs.upc.edu/~nlp/SVMTool/parole.html for a list of the reduced tags (Last ac-

cessed: 18/02/08)11The dictionary is available at https://github.com/opener-project/pos-tagger-en-es/tree/

master/core/src/main/resources (Last accessed: 18/02/08)

Automated C-Test Difficulty Prediction

Page 40: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

5 Difficulty Modeling 31

within Chen and Meurers (2016). Additionally, the pipeline contains features based on

the Dependency Locality Theory (DLT) as described by Weiß (2017)12. The following

will describe the 155 implemented features by grouping them according to the locality

they span.

5.2.1 Item Level

We integrated features from all item level groups described by Svetashova (2015) sum-

marized in Section 2.4.2: Features capturing linguistic properties (LING), psycholin-

guistic features (PSY), context and candidate space features (CONT) and position

based features (POSIT). Some of them were implemented just as described in her work,

some needed some modifications in order to be integrable into the UIMA pipeline. We

also added further features in order to allow for a reasonable comparison among Spanish

and English features.

Linguistic Features (Ling): The linguistic features that performed well in Sve-

tashova (2015) contain surface based features describing the number of letters to be

restored (endingLength) and the number of letters in the intended token’s lemma (lem-

maLength). Furthermore the POS of the intended token is captured by the boolean

feature isContentWord, which indicates whether the POS describes a content word or

not. We consider nouns, verbs, adjectives, adverbs and cardinal numbers to be content

words 13. A further feature is the number of dependencies the item has in the sentence

(numDependencies).

Psycholinguistic Features (Psy): We included the term frequency–inverse doc-

ument frequency (TFIDF ) feature, which is based on frequency counts of the tokens

in the document and in a collection of documents. This feature was developed in the

context of Information Retrieval in order to describe a term’s importance throughout

a collection of documents. Following Svetashova (2015), we consider also the term

frequency (i.e., the number of term occurrences by all number of words, TF ), the doc-

ument frequency (i.e., the number of documents that contain the term, DF ) and the

inverse document frequency (IDF ) as single features. We further added the number of

occurrences of the term in the document (TermOccInDoc) as a feature. The document

collection is a corpus of 45 C-Test text passages and 88 texts from the brown corpus

compiled by Svetashova (2015).

Other psycholinguistic features are semantic variables based on the machine-usable

dictionary MRC Wilson (1988), which provides up to 26 linguistic and psycholinguistic

attributes for more than 150,000 words. We adapted features based on the follow-

ing attributes: age of acquisition (Sem AoA) (Kuperman et al., 2012), concreteness

12Zarah Weiß kindly provided her Java code to extract DLT features for German, which we thenadapted to English and Spanish

13Penn Treebank tags starting with N, V, J, R and CD

Automated C-Test Difficulty Prediction

Page 41: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

5 Difficulty Modeling 32

(Sem Concr), familiarity (Sem Fam), and meaningfulness (Sem Meani), since they

performed well in experiments by Svetashova (2015). MRC also contains an attribute

that indicates in how many of the 15 stylistic categories of the Brown Corpus the word

can be found (StylNCatsMRC ). 14

In order to check whether a word belongs to a high frequency class, Svetashova (2015)

uses a web interface provided by the ADELEX research project 15. The tool extracts a

feature that indicates whether a token is contained in the 1000 most frequent words of a

collection of corpora. This is based on the idea of Lexical Frequency Profiles developed

by Laufer and Nation (1995). We implemented a similar feature that checks if the word

is contained in the 1000 most frequent words in the SUBTLEX-US corpus (Lfp Band1 )

(Brysbaert et al., 2012).

The AoA database 16 contains information about the lemma and POS of given words

in the SUBTLEX-US corpus (Kuperman et al., 2012). We processed this list in order

to ensure a reasonable mapping to Penn Treebank tags17. We follow Svetashova (2015)

and implement two features that indicate (1) whether a token is the most frequent token

(isLemmasMostFreqTok) of its lemma and (2) whether the POS tag is the token’s most

frequent tag (isTokensMostFreqTag)18.

Position based Features (Posit): This set of 8 features contains information

about the item’s position and its occurrences in the text. We measure the position

in terms of the token number (Number Token) and the gap number (Number Gap).

In order to include information about additional occurrences of the intended word

in the text, we implement features measuring the distance between the gap and the

previous occurrence of the intended word or the intended word’s lemma (distancePre-

vMention Token and distancePrevMention Lemma). Two boolean features indicate

whether the intended token occurs in the non-mutilated parts (isInClosing and isIn-

StartOrClosing). Svetashova (2015) further follows Beinborn et al. (2014) and suggests

features measuring the difficulty of the previous gap by considering its unigram and

trigram probabilities (previousItemUnigramProb and previousItemTrigramProb). We

follow Svetashova (2015) and look them up in the Web1T corpus.

Context and Candidate Space Features (Cont): We also integrated some of

the Context and Candidate Space Features presented by Svetashova (2015) into the

pipeline. The context features are based on Web1T, a google n-gram corpus containing

uni- to five-grams and their observed frequency counts extracted from texts of about 1

14We noticed missing values for some words and extracted our own counts from the Brown Corpususing Python’s Natural Language Toolkit (NLTK) resulting in the feature StylNCatsNLTK (Birdand Loper, 2006)

15project website: http://www.ugr.es/~inped/ada/ (Last accessed: 18/02/08)16the database including lemma and POS information is available at http://crr.ugent.be/archives/

806 (Last accessed: 18/02/08)17We merged determiners and articles and treated numbers as adjectives. Items where the tag was

”unclassified” or represented by a number were deleted from the list.18If the word was not on the list, it was assumed that it is not the token’s most frequent pos.

Automated C-Test Difficulty Prediction

Page 42: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

5 Difficulty Modeling 33

trillion word tokens (Brants and Franz, 2006)19. For each item the uni- to five-grams

at the left and right side of the token are considered. Thus, a span of 9 tokens is

under consideration resulting in 15 features. Furthermore the maximum n-gram log

probability of an item represents a feature (maxNgram).

The motivation behind the candidate features is that in some cases the actually

intended token is less frequent in the given context than another word which would

also fit into the gap according to its base and the length constraint. Svetashova (2015)

considers all words in the psycholinguistic database MRC which have the same base

and match the length constraint as competing candidates Wilson (1988). One feature

simply counts the number of possible endings (numPotentialEndings). The n-gram

log probabilities for the actual solution and the candidate solutions have also been

extracted from the Web1T corpus. The features consider the unigram and all bigrams

and trigrams that contain the item under consideration. Thus, the n-grams to the left

and right side of the token are considered. The features that span these 1- to 3-gram

windows can be grouped into three different subsets and finally result in 18 features

(Svetashova, 2015).

• Ngram Cands delta (double): the difference between the n-gram probability of

an item and its competing candidate with highest n-gram probability

• Ngram Cands hasBigger (boolean): true, if there exist candidate n-grams that

have a bigger log probability

• Ngram Cands weakness (int): the number of candidates with bigger n-gram prob-

ability

In order to have features that are comparable to Spanish, we implemented another

feature set based on the SUBTLEX databases which contain word frequencies based on

film subtitles. The SUBTLEX-US database contains 74,286 American English words

and the corresponding per million word frequencies. This word list was used instead

of MRC to generate a map that maps item bases to possibly intended words. The

number of candidates again represents a feature (subtlex numCandidates). We then

computed one unigram feature for each of the mentioned three ideas (delta, hasBigger

and weakness) replacing n-gram log probabilities by per million word frequencies:

• SubtlexCands delta (double): the difference between the per million word fre-

quency of an item and its competing candidate with highest per million word

frequency

• SubtlexCands hasBigger (boolean): true, if there exists a candidate that has a

bigger per million word frequency

• SubtlexCands weakness (int): the number of candidates with bigger per million

19The code to extract the information from Web1T was made available by Yulia Svetashova.

Automated C-Test Difficulty Prediction

Page 43: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

5 Difficulty Modeling 34

word frequency

All the context and candidate space features are listed in Table 5.2.

feature group feature

Item Ngram Probs unigrambigram leftbigram righttrigram lefttrigram centertrigram rightfourgram leftfourgram left centerfourgram right centerfourgram rightfivegram leftfivegram left centerfivegram centerfivegram right centerfivegram rightmaxNgram

Item Ngram Cands Delta unigrambigram leftbigram righttrigram lefttrigram centertrigram right

Item Ngram Cands HasBigger unigrambigram leftbigram righttrigram lefttrigram centertrigram right

Item Ngram Cands Weakness unigrambigram leftbigram righttrigram lefttrigram centertrigram rightnumPotentialEndings

Item SubtlexCands unigram deltaunigram hasBiggerunigram weaknessunigram numCandidates

Table 5.2: These are the 39 context and candidate space features implemented in theEnglish difficulty model

5.2.2 Sentence Level

Only two sentence level features implemented within Svetashova (2015) were among her

70 most predictive features: One feature considers Stanford’s CoreNLP constituency

parser output and counts every construction that is not a simple declarative clause

Automated C-Test Difficulty Prediction

Page 44: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

5 Difficulty Modeling 35

(numComplicators)20. The parseScore feature represents the log probability of the

parse tree and is also provided by Stanford. We also included the parseDepth feature

as described by Svetashova (2015), which describes the depth of the parse tree, and the

number of gaps in the sentence (numGapsInSent). The sentence id, thus the position

of the sentence in the text, is also stored (sentID).

As already mentioned in the context of linguistic complexity (Section 2.3, p. 9),

Weiß (2017) considered features based on the DLT. The DLT is a theory of how hu-

man computational resources are used during sentence processing, where ”words are

input one at a time” (Gibson, 2000). According to Gibson (2000), in sentence pro-

cessing two linguistic elements (e.g., a head and its dependent) need to be integrated.

This integration is costly and the cost depends on the distance between the elements.

Weiß (2017) follows Shain et al. (2016) and computes integration costs based on three

assumptions: a) verbs are more expensive, b) coordination is less expensive, and c)

modifier dependencies can be excluded. This leads to the following modification when

calculating integration costs:

a) a cost of 1 for non finite verbs, a cost of 2 for finite verbs (v)

b) only one collective count for a coordinated set of referents (c)

c) ignore dependencies to preceding modifiers (m)

Combining these conditions with each other and using none of the conditions (o)

leads to eight different computations of integration costs. For all the eight of them,

she implemented three variants: the number of maximal integrations costs, the mean

total integration cost at finite verbs, and the number of adjacent high integration cost

areas. However, the variant of adjacent high integration cost areas requires a ”threshold

to qualify discourse costs as ’high’” which should be set according to clear linguistic

evidence that still needs to be revealed (Weiß, 2017, p. 75). We therefore only integrate

the first two variants into our pipeline, resulting in a set of 16 sentence-level DLT

features:

• maxTotalIntegrationCostPerFiniteVerb for the 8 conditions:

o, v, c, m, cv, cm, vm, cmv

• totalIntegrationCostsAtFiniteVerbPerFiniteVerb for the 8 conditions:

o, v, c, m, cv, cm, vm, cmv

5.2.3 Text Level

Following Svetashova (2015), we implement the following three groups of text level

features: lexical features, syntactic features and traditional readability features.

20Yulia Svetashova kindly provided her Java code for the extraction of the numComplicators feature

Automated C-Test Difficulty Prediction

Page 45: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

5 Difficulty Modeling 36

Lexical features (Lex): The lexical complexity of texts can be measured in differ-

ent ways. We distinguish between lexical density, lexical variation, and lexical sophis-

tication. Additional lexical features are based on different frequency lists.

• Density: Lexical density, as described by Lu (2012), measures the ratio of lex-

ical words to all words in a text (density Lex ). We consider nouns, verbs, ad-

jectives, and adverbs as lexical words. We also compute the ratio of further

POS groups to the number of all words in the text, resulting in the following

features: density Functional, density Noun, density Verb, density Adjective, den-

sity Adverb, density Conjunction, and density Determiner.

• Variation: According to Lu (2012), who investigates lexical variation in the con-

text of language learning, ”lexical variation refers to the range of a learner’s

vocabulary as displayed in his or her language use”. Table 5.3 lists the lexical

variation features computed within this work and the corresponding descriptions

and formulas.

• Sophistication: Lexical sophistication measures describe how sophisticated the

words in a text are. Following Lu (2012), we count ”words, lexical words, and

verbs as sophisticated if they were not on the list of the 2000 most frequent

words generated from the British National Corpus”. The list was obtained from

the Lancaster’s University Centre for Computer Corpus Research on Language21. The different lexical sophistication measures vary in terms of the underlying

formula and in terms of the POS under consideration (either all words or verbs).

The formulas consist of word type and word token counts, where we follow Lu

(2012) and consider ”different inflections of the same lemma (e.g., “go,” “goes,”

“going,” “went,” and “gone”) as one type”Lu (2012, p. 192). Thus, in this case

the term type differs from the notion of a type in the lexical variation measures.

We implemented all 5 lexical sophistication measures as listed in Table 5.4.

Svetashova (2015) further gathered measures based on the Lexical Frequency

Profile (LFP), as introduced by Laufer and Nation (1995), from the ADELEX

Analyzer (ADA) (cf. p. 32). The assumption behind LFP is that ”the propor-

tion of high frequency general service and academic words in learners’ writing”

influences lexical richness (Laufer and Nation, 1995). ADA divides the 7000 most

frequent words of different corpora into 7 bands of 1000 words and computes the

percentage of words included in each list by either matching types or tokens. (NB:

in this case word types and not lemmas are considered as types). We replicate

these features by considering the 7000 most frequent tokens in the SUBTLEX-US

corpus (cf. p. 32).

Another group of sophistication features that performed well in Svetashova (2015)

21http://ucrel.lancs.ac.uk/bncfreq/lists/1_2_all_freq.txt (Last accessed: 18/02/08)

Automated C-Test Difficulty Prediction

Page 46: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

5 Difficulty Modeling 37

are features extracted from the Lextutor Vocabprofile 22. We integrated these

features into our pipeline by taking the Lextutor’s Academic Word List (AWL),

originally developed by Coxhead (2000), and then generating language families

using the Lextutor’s familiarizer 23. The familiarizer extracted the word families

for every word in the AWL, which resulted in about 3000 words belonging to 570

word families. The implemented feature awlPercent encodes the percentage of

words belonging to the AWL and the feature awlNumFam encodes how many

of the corresponding word families have been found in the text. We further do

this with the 1000 most frequent words in the SUBTLEX-US corpus (famList-

Percent k1, famNum k1 ).

Feature Description

variation NDWZ T (number of different words)variation Adj Tadj/Nlex (adjective variation)variation Adv Tadv/Nlex (adverb variation)variation Mod Tadj+adv/Nlex (modifier variation)variation Lex Tlex/Nlex (lexical word variation)variation Verb2 Tverb/Nlex (verb variation II)variation VV1 Tverb/Nverb (verb variation I)variation VV1sq T 2

verb/Nverb (squared VVI)variation VV1co Tverb/

√2Nverb (corrected VVI)

variation TTR T/N (type token ratio)

variation TTR RTTR T/√N (root TTR)

variation TTR CTTR T/√

2N (corrected TTRvariation TTR Log LogT/LogN (bilogarithmic TTR)variation TTR Uber Log2N/Log(N/T ) (Uber index)

Table 5.3: The lexical variation features. N refers to the number of words, whereas Trefers to the number of distinct words (i.e. word types).

Syntactic features (Syn): Svetashova (2015) made use of the Web-based L2 Syn-

tactical Complexity Analyzer developed by Lu (2010)24. We replicate the implemen-

tation of these features by analyzing the output of the Stanford constituency parser.

The Tregex patterns described by Lu (2010) are used to extract the relevant linguistic

units from the parser output (Levy and Andrew, 2006). The best performing features

in Svetashova (2015) have been integrated into the pipeline and are listed in Table 5.5.

Dependency Locality Theory features (Dlt): Weiß (2017) projected the DLT

features on sentence level (cf.p 35) to the text. We integrate these features into our

pipeline.

Readability features (Read): Traditional readability features highly correlate

with each other in the context of C-Test item difficulty prediction, as shown by Sve-

tashova (2015). We therefore only integrate two readability indices into the pipeline:

The Flesch Reading Ease formula (readability Flesh) and the the Flesch-Kincaid Grade

22http://www.lextutor.ca/vp/eng/ (Last accessed: 18/02/08)23http://www.lextutor.ca/familizer/ (Last accessed: 18/02/08)24http://aihaiyang.com/software/l2sca/ (Last accessed: 18/02/08)

Automated C-Test Difficulty Prediction

Page 47: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

5 Difficulty Modeling 38

Feature Description

sophistication LS1 NsophLex/Nword (lexical sophistication I)sophistication LS2 TsophWord/Tword (lexical sophistication II )sophistication VS1 TsophV erb/Nverb (verb sophistication I )sophistication CSV TsophV erb/

√2Nverb (corrected VS I)

sophistication VS2 T 2sophV erb/Nverb (verb sophistication II )

frequencyProfile Band1 Tok percentage of N in band 1 (freq. 1-1000 in SUBTLEX)frequencyProfile Band1 Type percentage of T in band 1 (freq. 1-1000 in SUBTLEX)frequencyProfile Band2 Tok percentage of N in band 2 (freq. 1001-2000 in SUBTLEX)frequencyProfile Band2 Type percentage of T in band 2 (freq. 1001-2000 in SUBTLEX)frequencyProfile Band3 Tok . . .frequencyProfile Band3 Type . . .frequencyProfile Band4 Tok . . .frequencyProfile Band4 Type . . .frequencyProfile Band5 Tok . . .frequencyProfile Band5 Type . . .frequencyProfile Band6 Tok . . .frequencyProfile Band6 Type . . .frequencyProfile Band7 Tok . . .frequencyProfile Band7 Type . . .awlPercent percentage of N in AWL (Academic Word List)awlNumFam number of AWL familiesfamListPercent k1 percentage of words in k1

list(freq. 1-1000 in SUBTLEX)

famNum k1 number of k1 families

Table 5.4: The lexical sophistication features. N refers to the number of words, lexicalwords or verbs. T refers to the number of distinct words, lexical words orverbs (i.e. word types).

Feature Description

complexity CNperC # complex nominals / # clauses (Complex nominals per clause)complexity CTperT # complex t-units / # t-units (Complex T-unit ratio)complexity DCperC # dependent clauses / # clauses (Dependent clause ratio)complexity VPperT # verb phrases / # T-units (Verb phrases per T-unit)complexity MLC # words / # clauses (mean length of a clause)complexity MLT # words / # t-units (mean length of a t-unit)complexity MLS # words / # sentences (mean length of a sentence)

Table 5.5: The syntactic complexity features.

Level (readability Kincaid). These formulas are based on the average sentence length

and the number of syllables per word.

5.3 Modeling the difficulty of Spanish C-Tests

The implementation of the Spanish C-Test difficulty features was built up upon the

described English features and follows two main goals: An efficient way of predicting

the difficulty of Spanish C-Test items per se and a meaningful, linguistically insightful

comparison of difficulty phenomena in the two languages. The following will mainly

focus on those Spanish features that differ from the English ones.

Automated C-Test Difficulty Prediction

Page 48: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

5 Difficulty Modeling 39

5.3.1 Item Level

Linguistic Features (Ling): The linguistic features for Spanish are the same as those

for English: endingLength, lemmaLength, isContentWord and numDependencies. We

considered the Ancora tags starting with N (nouns), V (verbs), A (adjectives) and R

(adverbs) as content words.

Psycholinguistic Features (Psy): The TFIDF measures have been adapted to

Spanish by compiling a new collection of documents: 91 C-Test text passages and 90

randomly chosen texts from a sample of the Corpus del Espanol 25.

The Lexical Frequency Profile measure, which indicates whether the item is contained

in the 1000 most frequent words, is based on the Spanish SUBTLEX-ESP corpus (Cue-

tos et al., 2012) 26 .

Position Based Features (Posit): All the position based features from English,

except for those based on the Web1T-corpus, have been adapted to the Spanish pipeline.

Context and Candidate Space (Cont): We implemented candidate space fea-

tures that are comparable between the two languages by focusing on the SUBTLEX

databases. As already described for English, the SUBTLEX database has been used to

generate maps from item bases to possible solutions according to the length constraint.

The following exemplifies a map entry for the base ”an” and all the matching possibly

intended words extracted from the first 72,286 words in SUBTLEX-ESP27:

an[ ] → [anos, anade, anora, anado, aneja, anoro, anada, anadı, anejo]

5.3.2 Sentence Level

On sentence level we include the following features that have been described for English

already: parseScore, parseDepth, numGapsInSent and sentID. Also the DLT features

have been adapted to Spanish.

5.3.3 Text Level

On text level we fully adapted the DLT and readability features to the Spanish pipeline.

The Spanish readability features make use of syllable counts extracted by following the

approach described by Hernandez-Figueroa et al. (2009)28.

All the lexical density and variation features have been adapted to Spanish. To in-

25a sample of the corpus is available at https://www.corpusdata.org/spanish.asp(Last accessed:17/12/29)

26http://crr.ugent.be/archives/679(Last accessed: 17/12/28)27This number was chosen because it is the number of words available in the English database.28Code available at https://github.com/vic/silabas4j (Last accessed: 17/12/29)

Automated C-Test Difficulty Prediction

Page 49: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 40

clude the aspect of sophistication, we implement the lexical sophistication measures

developed by Lu (2012), which are listed for English in Table 5.4 (p. 38). We con-

sider a Spanish word as sophisticated if it is among the 2000 most frequent words in

SUBTLEX-ESP. We further adapt the Lexical Frequency Profile features by considering

the SUBTLEX-ESP corpus.

6 Experiments and Results

This section describes the machine learning experiments conducted in order to find out

how the difficulty of C-Test items, sentences and text passages can be predicted based

on the described difficulty features. We experiment with different clustering approaches

and analyze how the predictions correlate with the actual learner performance data.

At the end of this Section the results will be discussed.

First, we predict the resulting cluster memberships using the whole set of difficulty

features for both languages. Different performance profiles are evaluated by taking into

account their interpretability.

We then focus on those two performance profiles that are suitable for real world

applications where the selection of text passages should be based on predictions about

the difficulty of single passages.

In order to compare the difficulty characteristics across the two languages English

and Spanish, a unique set of comparable features is designed. We conduct classification

experiments using this feature set on the presented performance profiles. In addition, we

inspect the performances on the more local levels sentence and item, since the linguistic

differences between the languages are expected to be rather grammatical than content

related.

6.1 Experimental Setup

The data that has been investigated in the following experiments comprises performance

statistics for English and Spanish C-Tests gathered from the FSZ of the University

of Tubingen (cf. Section 3). As described in Section 4.2, we perform Hierarchical

Clustering on Principal Components (HCPC) on the performance data. The resulting

classes are then predicted using machine learning models which are trained on the

implemented difficulty features. The R package caret is used for the data preprocessing

and the machine learning part (Kuhn et al., 2015). Following Svetashova (2015), the

machine learning is performed using two different algorithms: Support Vector Machines

(SVM) and Random Forests (RF)29. The caret package provides a function that splits

29Yulia Svetashova kindly provided her machine learning pipeline including feature preprocessing andmodel training

Automated C-Test Difficulty Prediction

Page 50: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 41

the data into test and training set. We use this function to create data partitions and

use 80% of the data for training, and 20% for testing. Svetashova (2015) chose the

named algorithms for several reasons (c.f. Svetashova (2015, p. 71 ff.)):

• ”They have been successfully applied to a wide spectrum of NLP problems”

• SVM makes a comparison to results reported in other literature possible

• ”both algorithms have the implementations that are fast in terms of training time

and are suitable for regression and classification tasks”

• ”these two algorithms showed best accuracy values in our preliminary testing

experiments”

• ”SVMs are reported to handle successfully large feature sets”

• ”The attractive property of random forests is that they permit to understand the

relative importance of variables that provide the predictive accuracy”

In terms of feature preprocessing, the machine learning pipeline follows four steps:

1. Dealing with missing values: Missing values in the feature data are caused

by different reasons. First, features based on table lookups can contain missing

values if the item under consideration is not present in the list. This applies

to the item-level psycholinguistic features based on the MRC dictionary lookup

(including age of acquisition, concreteness, familiarity, and meaningfulness fea-

tures). We replace missing values by the average feature values in these features.

Furthermore, features based on Web1T corpus lookups contain missing values if a

certain n-gram is not found in the corpus at all. We chose to assign the minimum

probability found in the corresponding feature to all missing values in the fea-

tures. The position based feature that measures the uni- or trigram probability

of the previous item is always missing for the first item in the text. These missing

values are replaced by the average feature value. The features indicating whether

an item is a lemma’s most frequent token or indicating whether an item has a

lemma’s most frequent tag are set to false if the words are not found in the MRC

lookup.

2. Zero- and near-zero variance removal: In order to steer clear of features

that contain no or little information, zero and near-zero variance removal can

be applied. We checked for zero and near-zero variance features, but it turned

out that the feature set integrated in the pipeline of this thesis does not contain

any such features. This is caused by the fact that we focus on those features of

Svetashova (2015) that turned out to be predictive in her experiments, in which

she also removed such variables first.

3. Highly correlated predictors removal: The consideration of multiple features

that correlate highly with each other does not result in any gain of information

Automated C-Test Difficulty Prediction

Page 51: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 42

when performing machine learning experiments. We therefore filter out variables

that correlate highly with another variable, setting the cutoff threshold to 0.95.

Table 6.1 lists the number of features used in total and the number of features af-

ter having removed highly correlated predictors. For the English data, 98 numeric

features out of 145 remain after having removed the highly correlated numeric

features. For the Spanish, 66 out of 95 remain. Correlations of features mainly

concerns DLT features on text and sentence level, and different variations of fea-

tures that only differ in terms of the mathematical formula (e.g., different varia-

tions of TTR or the verb variation formula). Considering types instead of tokens

when computing the lexical frequency profile measures on text level does also not

make a difference in some cases, leading to highly correlating variables (e.g., Fre-

quencyProfile Band7 Type highly correlates with FrequencyProfile Band7 Tok).

We further found highly correlating variables within the group of candidate and

context n-gram features.

EN ES

# features 156 100

# numeric predictors 145 95# numeric predictors (highly correlated deleted) 98 66

# factor predictors 11 5

# all predictors (filtered) 109 71

Table 6.1: The number of features (factor and numeric) before and after removinghighly correlated predictors

In order to evaluate the models, we report macro- and micro-averaged F1 scores,

inspect the variable importance, and train additional models on feature subsets. The

R package caret provides a variable importance evaluation function. This function uses

the model information and shows which variables are most predictive.

6.2 Investigation of Performance Profiles

This section describes the machine learning experiments conducted in order to find

out how the difficulty of C-Test items can be predicted based on different information

about the test takers’ performance. As described in Section 4.2 (Clustering Approach),

we can describe the difficulty of an item according to three dimensions: text difficulty,

sentence difficulty, and item difficulty. Assigning a binary label (easy or difficult) to each

dimension leads to the 8 clusters listed in Table 4.1 (p. 26). Depending on the number of

dimensions present in a performance profile, we experimented with different numbers

of clusters: If three dimensions are present in the performance cluster, clustering is

performed with 4 to 8 clusters. If only two dimensions are present, we perform clustering

with 4 clusters and add experiments with 5 clusters, since the 5 cluster individuals

maps (Figure 4.5 and 4.6, p. 28) are interpretable in a way that is suitable for real

Automated C-Test Difficulty Prediction

Page 52: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 43

world applications.

When describing performance profiles, the terms Text, Sent, and Item indicate

that only the overall average variable is considered (percCorrect, sentAv, or textAv).

TextProf, SentProf, and ItemProf further include the proficiency level variables. The

full set of 21 performance variables is described on page 23 and referred to as all.

As seen in the variable factor map, the English proficiency level variables on text

level highly correlate with each other independent from the number of clusters (4-8)

(see Figure 4.3). We therefore consider proficiency variable combinations on sentence

and item level (SentProf and ItemProf ) more detailedly than on text level. However,

in order to have results that are comparable to those reported in Svetashova (2015),

we add clusters including proficiency level variables on text level (TextProf ItemProf ).

6.2.1 Classification Results on English Performance Profiles

SVM RFEnglish Performance Profiles # vars mac F1 mic F1 mac F1 mic F1

all CL4 21 0.63 0.63 0.67 0.68all CL5 21 0.57 0.61 0.60 0.63all CL6 21 0.56 0.58 0.61 0.62all CL7 21 0.58 0.58 0.62 0.61all CL8 21 0.54 0.58 0.57 0.60Text Sent Item CL4 3 0.68 0.71 0.64 0.71Text Sent Item CL5 3 0.67 0.67 0.69 0.71Text Sent Item CL6 3 0.61 0.62 0.59 0.62Text Sent Item CL7 3 0.59 0.58 0.60 0.60Text Sent Item CL8 3 0.63 0.62 0.64 0.62Text SentProf Item CL4 9 0.79 0.83 0.94 0.95Text SentProf Item CL5 9 0.72 0.73 0.86 0.86Text SentProf Item CL6 9 0.69 0.68 0.86 0.85Text SentProf Item CL7 9 0.74 0.72 0.89 0.86Text SentProf Item CL8 9 0.77 0.72 0.89 0.85Text SentProf ItemProf CL4 15 0.56 0.57 0.60 0.60Text SentProf ItemProf CL5 15 0.45 0.54 0.49 0.57Text SentProf ItemProf CL6 15 0.45 0.52 0.51 0.57Text SentProf ItemProf CL7 15 0.47 0.52 0.47 0.53Text SentProf ItemProf CL8 15 0.46 0.48 0.49 0.52TextProf ItemProf CL4 14 0.76 0.77 0.77 0.79TextProf ItemProf CL5 14 0.77 0.75 0.76 0.74Text ItemProf CL4 8 0.59 0.63 0.61 0.65Text Item CL4 2 0.73 0.75 0.74 0.75Text Item CL5 2 0.76 0.74 0.75 0.72

Table 6.2: English: The macro- and micro-averaged F1-scores for SVM and RF clas-sification using all features. Values higher or equal to 0.70 are highlighted.

.

Table 6.2 presents the SVM and RF results for the classification of different perfor-

mance profiles using the full feature set. Values higher or equal to 0.70 are highlighted.

The performance profiles that are best predictable by the classifiers are those contain-

Automated C-Test Difficulty Prediction

Page 53: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 44

ing a high proportion of sentence performance variables. The profile Text SentProf

Item CL4 is built up using four clusters and the following nine performance variables:

textAv, sentAv, sentence averages split by proficiency levels (sentAvA1, sentAvA2,...),

and the item variable percCorrect. Figure 6.1 shows the individuals factor map and the

variables factor map for this profile. Roughly speaking, the clusters describe items ac-

cording to the sentAv variable: Cluster 1 is the cluster with items in difficult sentences,

followed by those in less difficult sentences (red items, cluster 2). Even less difficult are

the sentences containing cluster 3 items, and finally, cluster 4 items occur in the easiest

sentences. Predicting these classes using all features resulted in an F1 score of 0.95

for RF classification. The most predictive features in this model are listed in Figure

6.2 and reflect the tendencies interpreted from the factor maps in Figure 6.1: 7 out of

the 10 most predictive features are sentence level features. The two most predictive

features are the parse score and the number of gaps in a sentence. Only two item level

features (Item Posit Number Token and Item Posit Number Gap) are among the top

20 features.

−10 −5 0 5 10

−10

−5

05

Factor map

Dim 1 (71.74%)

Dim

2 (

10.5

6%)

●●●●●●

●●●●●●●

●●●●●●

●●●

●●

●●●

●●

●●

●●●●●●

●●●

●●●

●●●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●●●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●●●●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●●

●●

●●●●

●●●●●●●●●

●●●●●●●

●●

●●

●●●●●●●

●●

●●●

●●●

●●●●●●

●●

●●●

●●●●

●●●

●●●●●●●●●●

●●

●●●●

●●

●●●

●●●

589_02589_06589_05589_01589_03589_04725_25

590_19590_20590_25590_21590_23590_18590_24609_17609_18609_14609_16609_15609_19

590_22601_01601_03601_04

525_23525_22

601_02

525_24525_25

531_24

609_23609_21609_24

591_11

594_19

591_09

594_23

591_14

531_23

591_06591_15

594_21

591_13

609_20

591_12

748_04748_09610_22748_10610_20

594_20

591_05

748_06748_11609_22748_08594_24594_18

610_19

594_22748_02748_03610_25

609_25

610_24591_03591_16591_07

748_01748_05748_07594_17

591_04591_10

529_22

591_02591_08690_24610_18610_21690_23

604_09594_25604_10

610_06610_04

604_13604_11529_23

591_01

529_19

610_09

604_20610_23604_18604_15529_21

690_22

604_19529_20

690_21

529_17529_15

690_20610_05

604_12

610_07585_18719_13585_21690_18585_12604_08529_18

590_03

719_14719_19719_11

590_01590_09

585_16756_19

604_17

590_05

529_16

690_19719_10

688_23

604_14585_19756_25

604_16

756_24

688_22

610_08756_22

719_09585_20

590_04688_25590_06688_24

719_12

590_02

756_21

688_19

585_17719_16479_04719_18719_17

590_08

585_15719_15

756_20

590_07

585_14

688_20

756_23

479_03

582_17582_24527_18527_12585_13527_16582_15582_18582_22582_19

479_05479_02479_09

582_16

479_01

688_21

479_07479_06

688_18

766_14

470_19

582_25

470_16

582_21

479_10

582_20

479_08527_13527_14470_18

691_05691_07725_01688_08

584_23582_23

725_13

527_15

725_08

766_15

688_14691_11691_01

584_25527_17

691_12725_04725_03688_03725_05691_04

766_13

691_09691_06688_06481_25481_03

596_24

481_13688_05688_16481_09

527_22

481_16

527_20

470_15

766_17601_25

481_04

584_24

688_04481_11481_05725_02586_17586_24610_11481_24

470_17

610_13481_08481_20725_14601_20

688_10688_01

596_23

688_17

601_21

586_18

596_25

725_12

690_02688_13

766_16

481_02

601_23

610_15605_02688_07688_12756_16688_11605_01690_07

481_18691_02

690_17

533_08601_24

756_09605_09

691_13

605_11690_15

533_07725_11478_09725_07527_21

481_22478_16691_10756_12586_20690_03688_09605_12690_16478_21690_11478_08478_20

601_22725_06

588_06690_12588_16

533_05586_22690_01

691_03478_07481_19688_15756_08588_08

478_15725_09725_10

756_17481_15610_12586_16688_02691_08478_14478_01598_21598_18

588_10

481_12478_18610_17588_17588_09

481_10690_13588_11

533_09527_23

588_05

481_01481_14

527_19533_02478_06

756_04610_16

598_19481_07586_15605_03605_10

598_23

690_14588_13690_10756_10586_19

588_22588_20

481_17586_25

588_14

478_13586_21481_23

527_24

481_06478_19481_21

588_19610_10690_06756_14605_19756_13610_14588_04764_09605_08

586_14756_18605_14

764_10

766_20533_06478_17690_05587_17756_15586_23533_01533_04

605_16756_02756_11

598_22587_22598_20587_23

605_21

598_24

605_13690_09605_04756_06588_01690_04605_05690_08756_03605_06588_21588_12588_03756_05756_07

766_21

478_04

588_15

533_03766_24

588_02605_07478_10

766_25

478_12

756_01

478_02

766_22

724_13

766_19

478_11588_07600_13

478_05587_21587_20

600_14

478_03

603_16724_15

598_25

719_20764_13

603_18588_18

587_19

605_22603_14689_09

531_12605_24531_06600_11603_08

587_16

689_20689_02

531_05587_24

689_07

605_15

525_05766_18

594_09689_24

531_04605_17605_18

766_23

587_25587_18587_13

724_16

764_11

600_09603_10605_20605_25605_23603_13

525_08

530_13

594_10

601_19

603_15600_16601_18

525_07

764_12

600_17

689_22

587_15

529_13689_06

601_17724_12

588_23

529_14

531_09587_14

594_04689_18594_15724_04

525_04

530_18530_12

594_16603_11

525_03

689_11

530_20

594_11689_15529_08

525_02

530_03

724_18724_14724_09719_22

689_23

525_01

530_11

724_05689_08

530_15530_16

689_14

531_07600_15

689_04689_25594_12

525_06

529_09529_12

531_08531_03

689_10724_08529_11

528_22

689_03

530_06719_21528_20530_09

724_06689_12594_01689_21

530_10528_21530_05

603_20603_19

530_17531_01

724_17

594_03689_05594_06689_19

531_10

724_11531_17600_08

689_17594_02

528_23

724_03

531_11530_14

603_17594_07588_25

530_04531_13

724_10

594_08

600_10531_02

529_07531_18

689_13

600_12

689_16

530_02

603_09

589_23589_25

603_07724_01529_05603_12532_02529_04689_01529_01

530_19

594_13

529_02

530_01528_24

594_05724_07594_14

530_08

529_10532_08724_02

530_07

716_07588_24716_17

529_03529_06

716_11589_24

528_25

532_16531_21

532_18716_10

532_10531_15

716_14589_21

528_19

716_06

532_12716_12

532_03

591_21

523_01

531_22

716_09

532_11

716_01762_25716_02

532_09531_20

532_22

532_04

591_25

523_13

532_06

591_19

584_22716_08

523_09

532_05

589_20

523_02

531_16

523_05

586_11586_02

716_16

531_14

586_12

531_19532_13

716_05716_15589_22

532_14

716_22532_19

586_10

523_11

716_18

524_12

762_23

523_04523_03

716_04584_21

524_14

532_17

532_20720_13

586_01

523_07

598_12

591_20

532_07762_24

591_22

532_01

716_13584_15

598_11

584_19528_07

523_12

528_03

532_15

720_12

523_08

716_20

523_06523_10

762_21716_03584_14532_24

586_13

720_11

586_04

762_22

591_23

528_11

591_24

528_12533_21528_18716_21596_03716_23532_23

596_05586_03

720_17528_13584_18532_21

524_16

764_20603_22

524_11

533_25584_17

762_20

591_18

720_19600_25528_17598_16598_10

591_17

533_24

596_01720_16598_17

584_20764_21

533_23

586_05586_07598_13

533_20528_01720_18584_16

603_24

716_19

586_09

524_15

720_15

586_06764_15586_08

720_20528_09764_24528_10720_21

524_13524_10

764_16720_14528_02764_19

533_18

600_24528_04528_05

603_23

601_16

528_14598_14720_10596_06598_15764_14528_16529_25533_22528_06528_15600_22528_08764_22

601_10

685_11

533_19

764_23596_02

601_13

685_12529_24685_10603_25

748_25

685_09603_05600_23764_25603_21596_04596_07685_04764_17

748_18601_15

764_18685_01

748_17724_25601_05

603_04

748_20748_15

685_06603_06685_05

724_23724_24748_14762_04

685_08

601_06691_23601_09724_21601_11601_08601_12

685_02685_15

762_02

601_07724_22748_16

470_13685_03

748_13601_14748_19

685_13

748_24

685_07

762_18

524_25

691_17

762_16466_01

685_14

724_20

470_12

524_24

685_16

762_10

524_22

724_19748_12

691_25587_12479_20466_13466_07587_02691_15

609_03

762_13762_15

587_06691_22

533_17479_22466_18762_14762_06479_18762_12

691_14691_20

762_07762_11

691_24

609_08

466_05762_08762_05

609_01

524_21

725_16

762_17466_08

524_23

762_19479_25

764_08

725_22

691_18

762_09762_03

470_14

466_17762_01

587_03587_08691_21

466_02466_11

691_16587_09691_19589_09

466_12

589_16589_15609_12

479_21

609_04

466_10466_16479_23587_01

589_13598_08

587_07479_24

725_24

587_04587_05479_19466_15

589_14

466_03466_14

609_13

587_11

609_09609_02

587_10466_06

609_10589_19764_07

590_12

466_09

603_02725_19609_07

466_04

589_11589_12764_04589_10764_03470_25

725_21748_22609_05603_01764_05609_06609_11598_01

466_21470_21

598_06589_07589_08725_23589_18598_02

470_23589_17725_15725_20527_09764_06748_21598_09725_18603_03725_17

590_17

466_24

748_23527_11

590_15

527_02

590_13584_10

470_22

598_05

590_11

527_01

470_24

598_07598_03584_09598_04

590_10590_14

470_20

584_13

466_20

590_16

527_08

584_12

527_05

466_19

527_10527_04527_03

466_23

527_07527_06

584_04

466_22466_25

584_08584_07766_05584_03766_09584_01584_11766_10584_06596_18584_05596_13584_02

582_10582_01

596_08

582_12

766_02766_07766_03766_06766_08596_17596_09596_10

720_07600_03

596_11

600_01720_08766_01596_16766_04766_11596_15766_12

582_06

600_06

596_14

720_09

585_03

581_25

582_02

596_12

720_06600_04582_08582_11

525_21

600_05

479_14525_17

600_02

525_14

720_04582_03582_07582_09582_05582_04582_13585_05

479_17

600_07720_02

525_16525_11

582_14

479_13

720_01

479_15

720_03

479_16

604_06

479_12

720_05

585_07

581_24581_18479_11525_09525_10525_20525_19

523_17581_15

604_05

525_15

530_23

525_18

585_11

525_13525_12

530_24

585_08604_07585_02585_10585_01

581_14581_17

604_03

581_04581_20581_13523_14

585_06

530_21581_09530_25581_01

604_04585_09585_04604_02

523_18530_22581_08581_19523_15581_23

604_01

581_16581_22581_02581_21523_19581_10581_05523_16523_22

685_25

581_03581_07581_12523_21523_20581_11581_06

585_25585_22470_05585_24585_23470_07

524_20

470_06685_21685_18685_22685_23685_19685_17685_24470_03

719_02

685_20

524_18524_19

470_08

524_17

470_10470_09470_11

719_07

470_04470_02470_01

719_08719_04719_03719_06719_05719_01523_24720_23478_25

720_25720_22

523_23720_24

523_25

478_22478_24478_23

532_25

719_25764_01

719_23764_02716_25533_14716_24719_24533_12533_15533_16

524_02

533_10533_11533_13596_19604_24604_25604_23600_19600_18596_21

524_06524_07

604_22596_20604_21600_21

524_08

596_22

524_04524_03

600_20

524_01524_05524_09

610_02610_03610_01690_25

527_25

cluster 1 cluster 2 cluster 3 cluster 4

cluster 1 cluster 2 cluster 3 cluster 4

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Variables factor map (PCA)

Dim 1 (71.74%)

Dim

2 (

10.5

6%)

textAv

sentAv

sentAvA1

sentAvA2

sentAvB1sentAvB2

sentAvC1

sentAvC2

percCorrect

textAv

sentAv

sentAvA1

sentAvA2

sentAvB1sentAvB2

sentAvC1

sentAvC2

percCorrect

textAv

sentAv

sentAvA1

sentAvA2

sentAvB1sentAvB2

sentAvC1

sentAvC2

percCorrect

textAv

sentAv

sentAvA1

sentAvA2

sentAvB1sentAvB2

sentAvC1

sentAvC2

percCorrect

textAv

sentAv

sentAvA1

sentAvA2

sentAvB1sentAvB2

sentAvC1

sentAvC2

percCorrect

textAv

sentAv

sentAvA1

sentAvA2

sentAvB1sentAvB2

sentAvC1

sentAvC2

percCorrect

textAv

sentAv

sentAvA1

sentAvA2

sentAvB1sentAvB2

sentAvC1

sentAvC2

percCorrect

textAv

sentAv

sentAvA1

sentAvA2

sentAvB1sentAvB2

sentAvC1

sentAvC2

percCorrect

textAv

sentAv

sentAvA1

sentAvA2

sentAvB1sentAvB2

sentAvC1

sentAvC2

percCorrect

Figure 6.1: English: Individuals and variables factor map using performance profileText SentProf Item CL4. Predicting these 4 classes using all English fea-tures results in an F1 score of 0.95

Table 6.2 further shows that the RF prediction of the English performance profile

TextProf ItemProf CL4 results in a micro-averaged F1 score of 0.79 which is comparable

to the RF result reported by Svetashova (2015): a micro-averaged F1 score of 0.7959 on

a performance profile that only contains the two dimensions text and item. However,

the extended data set does not show the clear cut distinction between the item clusters

as it does for the data set investigated by Svetashova (2015, p. 91). We therefore chose

to inspect more detailedly the performance profile that makes use of two variables only:

textAv and percCorrect (see p. 28 for factor maps and a description and interpretation of

the resulting clusters). The SVM classification for this profile (Text Item CL5 ) results

in a macro-averaged F1 score of 0.76. We will now describe the results for Spanish,

Automated C-Test Difficulty Prediction

Page 54: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 45

Importance

Text_Lex_Sophistication_VS2

Text_Lex_FrequencyProfile_Band6_Tok

Text_Syn_Complexity_MLS

Text_Dlt_mMaxTotalIntegrationCostPerFiniteVerb

Text_Lex_awlPercent

Item_Posit_Number_Gap

Text_Lex_FrequencyProfile_Band3_Type

Text_Lex_FrequencyProfile_Band1_Type

Text_Lex_FrequencyProfile_Band2_Type

Text_Dlt_cTotalIntegrationCostsAtFiniteVerbPerFiniteVerb

Text_Syn_Complexity_MLT

Text_Lex_Sophistication_LS1

Sent_Const_numComplicators

Sent_sentID

Item_Posit_Number_Token

Sent_Dlt_cTotalIntegrationCostsAtFiniteVerbPerFiniteVerb

Sent_Dlt_cmvMaxTotalIntegrationCostPerFiniteVerb

Sent_Const_parseDepth

Sent_numGapsInSent

Sent_Const_parseScore

20 40 60 80 100

Figure 6.2: English: Variable importance of the top 20 predictors in the RFmodel, when predicting the 4 classes of the performance profileText SentProf Item CL4

where similarities can be drawn in terms of interpretable clustering approaches.

6.2.2 Classification Results on Spanish Performance Profiles

The classification results for the Spanish performance profiles are listed in Table 6.3.

In contrast to English, the F1 scores are significantly higher for most of the profiles.

Values higher than 0.80 are highlighted, whereas for English, values higher than 0.70

were highlighted instead. The best performance profiles are again the ones with a high

proportion of sentence level variables. Using the best profile (Text SentProf Item CL4 )

the RF classification resulted in a micro-averaged F1 score of 0.93, which is slightly

worse than for the English data (0.95). Also the variable importance plotted in Figure

6.3 shows similarities to what could be observed for English: The parse score and the

number of gaps in the sentence are the most predictive features. In contrast to the

variable importance picture for English, these two most predictive Spanish features

are less dominant than for English. (100% numGapsInSent, 62.84 % parseScore for

Spanish, in contrast to 100 % parseScore, 88.12 % numGapsInSent for English).

High classification results can also be observed for the profiles combining text and

item variables: The four classes from profile TextProf ItemProf CL4 were predictable

with F1 scores of up to 0.86 (0.79 for English). However, in terms of interpretability

the performance profile Text Item CL5 (cf. p. 27) with an F1 score of 0.82 in SVM

Automated C-Test Difficulty Prediction

Page 55: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 46

SVM RFSpanish Performance Profiles # vars mac F1 mic F1 mac F1 mic F1

all CL4 21 0.68 0.70 0.68 0.70all CL5 21 0.73 0.68 0.75 0.71all CL6 21 0.74 0.71 0.72 0.69all CL7 21 0.70 0.69 0.78 0.77all CL8 21 0.71 0.68 0.74 0.73Text Sent Item CL4 3 0.75 0.77 0.73 0.76Text Sent Item CL5 3 0.78 0.77 0.78 0.77Text Sent Item CL6 3 0.78 0.77 0.78 0.78Text Sent Item CL7 3 0.70 0.70 0.73 0.72Text Sent Item CL8 3 0.63 0.68 0.68 0.72Text SentProf Item CL4 9 0.87 0.89 0.91 0.93Text SentProf Item CL5 9 0.91 0.91 0.92 0.93Text SentProf Item CL6 9 0.80 0.83 0.82 0.83Text SentProf Item CL7 9 0.83 0.85 0.85 0.85Text SentProf Item CL8 9 0.84 0.86 0.86 0.87Text SentProf ItemProf CL4 15 0.71 0.71 0.74 0.74Text SentProf ItemProf CL5 15 0.64 0.65 0.69 0.70Text SentProf ItemProf CL6 15 0.63 0.66 0.64 0.68Text SentProf ItemProf CL7 15 0.58 0.61 0.60 0.63Text SentProf ItemProf CL8 15 0.50 0.53 0.47 0.52TextProf ItemProf CL4 14 0.80 0.84 0.83 0.86TextProf ItemProf CL5 14 0.68 0.71 0.72 0.75Text ItemProf CL4 8 0.48 0.48 0.54 0.55Text Item CL4 2 0.85 0.82 0.84 0.82Text Item CL5 2 0.82 0.82 0.81 0.80

Table 6.3: Spanish: The macro- and micro-averaged F1 scores for SVM and RF clas-sification using all features. Values higher or equal to 0.80 are highlighted.

classification was chosen for further experiments.

6.3 Predicting C-Test Performance on Text and Item Level

The difficulty prediction of single texts is relevant for the selection of text passages

during the composition of a whole C-Test. As shown by Svetashova (2015), C-Test

items can be clustered into 4 groups according to two dimensions: overall text passage

difficulty and item difficulty. She presents experiments on unseen data and suggests

how the results could be applied in a C-Test generation application. A C-Test item

in a given text passage could be highlighted according to its item difficulty (Ie and

Id) as in Figure 2.5 on p. 19. Furthermore, the whole text passage is classified as

easy or difficult (Te and Td). However, given the newly compiled extended datasets,

clustering the items into 5 groups provides a clearer picture for both languages than

using 4 clusters (cf. Figure 4.5 and 4.6 on p. 27). We will therefore investigate more

detailedly how the item classes resulting from the performance profile Text Item CL5

can be predicted automatically. For both languages, this clustering principle is both

interpretable and suitable for the usage within a real world application.

Automated C-Test Difficulty Prediction

Page 56: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 47

Importance

Text_Lex_Sophistication_VS1

Text_Lex_Variation_Adv

Text_Lex_FrequencyProfile_Band5_Type

Sent_sentID

Item_Psy_TFIDF

Item_SubtlexCand_deltaBigger

Text_Lex_Variation_Lex

Item_Psy_DF

Text_Lex_Density_Function

Item_Posit_Number_Gap

Sent_Dlt_cvTotalIntegrationCostsAtFiniteVerbPerFiniteVerb

Sent_Dlt_oMaxTotalIntegrationCostPerFiniteVerb

Text_Lex_Density_Determiner

Sent_Dlt_mMaxTotalIntegrationCostPerFiniteVerb

Sent_Dlt_cmvMaxTotalIntegrationCostPerFiniteVerb

Item_Posit_Number_Token

Text_Lex_Variation_TTR_Uber

Sent_Const_parseDepth

Sent_Const_parseScore

Sent_numGapsInSent

20 40 60 80 100

Figure 6.3: Spanish: Variable importance of the top 20 predictors in the RFmodel, when predicting the 4 classes of the performance profileText SentProf Item CL4

The SVM model trained on the whole feature set resulted in a macro-average F1

score of 0.76 for the English data (Table 6.2), and 0.82 for the Spanish data (Table

6.3). In the following, we will consider the variable importance of the models for both

languages and inspect some mean values of predictive features.

6.3.1 English

The most predictive features for the prediction of the five classes of the English profile

Text Item CL5 are plotted in Figure 6.4. 15 out of the 20 most predictive features are

item level features. The most prominent feature subgroup on item level is the group

of psycholinguistic features with term and document frequency measures leading the

way (TFIDF and DF ). Also the number of stylistic categories that contain the item

forms a predictive variable. Further predictive psycholinguistic features are based on

frequency lists and psycholinguistic ratings. The surface based features indicating the

length of the item’s lemma and the length of the gap in letters are also among the top

20 features. The number of candidates with bigger unigram probability in the Web1T

corpus is also a predictive feature on item level. On the text level, POS tag based lexical

density and variation, and human sentence processing measures based on Dependency

Locality Theory perform well.

Figure 6.5 visualizes how the clusters can be interpreted based on the two dimension:

Automated C-Test Difficulty Prediction

Page 57: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 48

Importance

Item_Psy_Sem_ConcrItem_Psy_Sem_FamFeature

Text_Lex_Variation_AdvItem_Ling_lemmaLength

Text_Dlt_mMaxTotalIntegrationCostPerFiniteVerbText_Lex_Variation_Mod

Item_SubtlexCand_weaknessText_Dlt_vmMaxTotalIntegrationCostPerFiniteVerb

Item_Ngram_Probs_Five_LeftItem_Ling_endingLength

Text_Lex_Density_AdverbItem_Psy_Sem_AoA_Kup

Item_Psy_Lfp_Band1.1Item_Ngram_Cands_BiggerUniDelta

Item_Psy_StylNCatsMRCItem_Psy_StylNCatsNLTK

Item_Ngram_Probs_Bi_LeftItem_Psy_TFIDF

Item_Ngram_Probs_UniItem_Psy_DF

20 40 60 80100

A●

B

20 40 60 80100

CItem_Psy_Sem_Concr

Item_Psy_Sem_FamFeatureText_Lex_Variation_Adv

Item_Ling_lemmaLengthText_Dlt_mMaxTotalIntegrationCostPerFiniteVerb

Text_Lex_Variation_ModItem_SubtlexCand_weakness

Text_Dlt_vmMaxTotalIntegrationCostPerFiniteVerbItem_Ngram_Probs_Five_Left

Item_Ling_endingLengthText_Lex_Density_AdverbItem_Psy_Sem_AoA_Kup

Item_Psy_Lfp_Band1.1Item_Ngram_Cands_BiggerUniDelta

Item_Psy_StylNCatsMRCItem_Psy_StylNCatsNLTK

Item_Ngram_Probs_Bi_LeftItem_Psy_TFIDF

Item_Ngram_Probs_UniItem_Psy_DF ●

D

20 40 60 80100

E

Figure 6.4: English: Variable importance of the top 20 predictors in the SVM model,when predicting the 5 classes of the performance profile Text Item CL5

text and item difficulty. For each of the clusters, Table 6.4 lists mean values of features

on text and item level. On the item level, the mean feature values of cluster 1 and

3 should indicate higher difficulty, whereas the mean feature values of cluster 2 and

4 should indicate lower difficulty. Inspecting the mean number of documents that

contain the item (Item Psy DF ), one can see higher values for the green highlighted

clusters (61.31 and 80) than for the red ones (18.5 and 31.9). This observation supports

an intuitive assumption: Throughout a collection of documents, difficult items occur

in less documents than easy items. Also the other three item level features listed

in the table reflect intuitive assumptions: The frequency of the item in a corpus is

lower for difficult words and higher for easy words. The mean values for the feature

Item SubtlexCand weakness show that if a word has many solution candidates with

higher frequency than the actual solution’s frequency, the item gets more difficult. On

text level, the values of cluster 5 indicate very low textual difficulty in contrast to the

other four clusters. The distinction between mean values of medium (green) and high

(red) difficulty is not noticeable in the lexical features adverb density and modifier

variation. However, the DLT mean feature values reflect the degrees of difficulty in the

following remarkable way: the higher the integration cost features are, the less difficult

is the gap restoration.

Automated C-Test Difficulty Prediction

Page 58: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 49

Decreasing Text Difficulty

Decreasing Item Difficulty

Cluster 5:all items in very easy texts

Cluster 4:easy items in easy texts

Cluster 2:easy items in difficult texts

Cluster 1:difficult items in difficult texts

Cluster 3:difficult items in easy texts

Figure 6.5: English: Cluster interpretation given the performance profileText Item CL5

Cluster 1IdTd

Cluster 2IeTd

Cluster 3IdTe

Cluster 4IeTe

Cluster 5IallTvery e

Item Psy DF 18.474 61.31 31.903 80 67.449Item Ngram Probs Uni -11.277 -8.397 -10.588 -7.316 -8.136Item Psy StylNcatsNLTK 8.153 13.021 9.897 13.935 12.638Item SubtlexCand weakness 8.095 2.464 5.009 0.907 1.572Text Lex density Adverb 0.054 0.059 0.057 0.061 0.028Text Dlt vmMaxTotalInt 1.165 1.145 1.34 1.3 1.47Text Lex Variation Mod 0.21 0.212 0.205 0.212 0.157Text Dlt mMaxTotalInt 0.624 0.61 0.749 0.712 0.841

Table 6.4: This table lists the mean values of different English item and text levelfeatures by cluster. Cells highlighted in green indicate simplicity on thecorresponding locality level. Cells highlighted in red indicate difficulty. Ontext level, yellow indicates strong simplicity.

6.3.2 Spanish

The top 20 predictors in the SVM model for the performance profile Text Item CL5 are

listed in Figure 6.6. 15 predictors are text level features including features on lexical

variation, sophistication, density, and frequency profiles. On item level, only three

features are among the top 20 (TFIDF, DF and IDF ). Furthermore, two sentence level

features can be considered as predictive: The parse score features as well as one of the

DLT integration cost features (cmvMaxTotalIntegrationCostPerFiniteVerb).

Table 6.5 lists the mean values of features that were predictive in either SVM or RF.

Automated C-Test Difficulty Prediction

Page 59: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 50

Importance

Text_Lex_Variation_VV1coText_Lex_Density_Noun

Sent_Dlt_cmvMaxTotalIntegrationCostPerFiniteVerbSent_Const_parseScore

Text_Lex_FrequencyProfile_Band2_TypeText_Lex_Density_Adposition

Text_Lex_Variation_TTR_UberText_Lex_Sophistication_LS1

Item_Psy_TFIDFItem_Psy_IDFItem_Psy_DF

Text_Lex_Density_ConjunctionText_Lex_Sophistication_LS2

Text_Lex_FrequencyProfile_Band5_TypeText_Lex_Density_Determiner

Text_Lex_Density_AdverbText_Lex_FrequencyProfile_Band1_Type

Text_Lex_FrequencyProfile_Band6_TokText_Lex_FrequencyProfile_Band6_Type

Text_Lex_Variation_Adv

0 20 40 60 80

A●

B

0 20 40 60 80

CText_Lex_Variation_VV1co

Text_Lex_Density_NounSent_Dlt_cmvMaxTotalIntegrationCostPerFiniteVerb

Sent_Const_parseScoreText_Lex_FrequencyProfile_Band2_Type

Text_Lex_Density_AdpositionText_Lex_Variation_TTR_UberText_Lex_Sophistication_LS1

Item_Psy_TFIDFItem_Psy_IDFItem_Psy_DF

Text_Lex_Density_ConjunctionText_Lex_Sophistication_LS2

Text_Lex_FrequencyProfile_Band5_TypeText_Lex_Density_Determiner

Text_Lex_Density_AdverbText_Lex_FrequencyProfile_Band1_Type

Text_Lex_FrequencyProfile_Band6_TokText_Lex_FrequencyProfile_Band6_Type

Text_Lex_Variation_Adv ●

D

0 20 40 60 80

E

Figure 6.6: Spanish: Variable importance of the top 20 predictors in the SVM model,when predicting the 5 classes of the performance profile Text Item CL5

For item level features, the values in cluster 2 and 3 should indicate higher difficulty,

whereas the values in cluster 4 and 5 should indicate lower difficulty. The listed features

confirm these assumptions. Items that occur in less documents are more difficult to

fill. Items with a positive difference between the n-gram probability of the item itself

and its competing candidate with highest n-gram probability are easier to master. A

high number of possible candidates is also an indicator for the item’s difficulty. On

text level, the mean values of all three features in cluster 3 and 5 indicate simplicity in

contrast to the values in cluster 2 and 4. The distinction between very difficult items in

cluster 1 and medium difficult items in cluster 2 and 4 is only noticeable in the adverb

variation. A high number of distinct adverbs divided by the number of all lexical words

in a text leads to higher textual difficulty.

6.4 Comparative Investigation of Difficulty Prediction for Spanish

and English

In the following, we will first compare the classifiers’ performance in predicting the

five classes of the presented profile Text Item CL5 by investigating a unique set of

comparable features for both languages.

Additionally, we perform experiments on a performance profile including the test

takers’ performance on sentence and item level. This is motivated by the assumption

Automated C-Test Difficulty Prediction

Page 60: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 51

Cluster 5:easy items in easy texts

Cluster 1:all items in very difficult texts

Cluster 2:difficult items in difficult texts

Cluster 3:difficult items in easy texts

Cluster 4:easy items in difficult texts

Decreasing Text Difficulty

Decreasing Item Difficulty

Figure 6.7: Spanish: Cluster interpretation given the performance profileText Item CL5

Cluster 1IallTvery e

Cluster 2IdTd

Cluster 3IdTe

Cluster 4IeTd

Cluster 5IeTe

Item Psy DF 37.307 15.456 19.346 97.030 108.913Item SubtlexCands deltaBigger 1144.471 -1010.447 -871.123 2023.725 3305.070Item SubtlexCands numCand 37.067 32.096 29.867 22.269 22.631Text Dlt cmvMaxTotalInt 0.768 0.851 0.996 0.855 1.022Text Read Kincaid 19.020 19.804 17.503 19.219 17.429Text Lex Variation Adv 0.101 0.059 0.056 0.058 0.054

Table 6.5: This table lists the mean values of different predictive Spanish item and textlevel features by cluster. Cells highlighted in green indicate simplicity onthe corresponding locality level. Cells highlighted in red indicate difficulty.On text level, violet indicates strong difficulty.

that the differences between Spanish and English C-Test item difficulty are mainly

caused by the differences in the grammatical structure of the two languages rather

than by the text’s general content.

6.4.1 Comparison of Performance on Text and Item Level

We extract a subset of features that encode morphological or syntactic information

which we consider to be suitable for a linguistic comparison. This includes features

encoding POS or parsing information, including DLT features. Furthermore the SUB-

TLEX candidate features are included, since they might capture influences caused by

Automated C-Test Difficulty Prediction

Page 61: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 52

features #feat SVMmac. F1

SVMmic. F1

RFmac. F1

RFmic. F1

EN all 63 (no highly cor.) 31 0.70 0.68 0.70 0.68ES all 63 (no highly cor.) 36 0.77 0.76 0.78 0.76EN Item SubtlexCands 4 0.20 0.39 0.27 0.38ES Item SubtlexCands 4 0.32 0.40 0.30 0.39EN Text Dlt 16 0.46 0.47 0.49 0.51ES Text Dlt 16 0.50 0.64 0.52 0.63

Table 6.6: Classification results for SVM and RF for the performance profileText Item CL5 and different feature subsets. 63 features where consideredto be comparable. After filtering highly correlated variables, 31 remainedfor the English data and 36 for the Spanish data. For the classificationusing the other feature subsets, no filtering was applied.

the richer morphology of the Spanish language. We consider the SUBTLEX candidate

features as comparable, since the features rely on the same type of corpus and the exact

same number of words in the databases. The results for RF and SVM using different

feature subsets are presented in Table 6.6. All F1 scores show a better performance for

the Spanish than for the English data. Using all 63 comparative features and filtering

highly correlated features leads to a macro-averaged F1 score of 0.78 for the Spanish

data, and 0.70 for the English data. The feature Text Lex Density Lex and four DLT

features were removed due to high correlations in the English, but not in the Spanish

data. Thus, the DLT features show higher variance in the Spanish data. In the follow-

ing, we check for differences in the variable importance of the models and further focus

on the SUBTLEX candidate features and on the textual DLT features.

Table 6.7 lists the RF variable importance scaled over all features. Features that

occur in the top 10 in both languages, are printed in bold. For English, three SUBTLEX

candidate features are the most important predictors by far, with a scaled importance

ranging from 61.85 (weakness) to 100 (deltaBigger). A further feature on the item

level is the item’s number of dependencies. On the text level, four lexical variation

measures and one DLT measure are among the 10 most predictive features. The parse

score represents the only highly predictive sentence level feature. For the Spanish data,

the picture looks similar. The most important feature is also the SUBTLEX candidate

feature deltaBigger. It describes the difference between the per million word frequency

of an item and its competing candidate with highest per million word frequency. The

number of competing candidates and the candidate weakness are also among the 10

most predictive features, but have lower importance than in the English model. The

second most important feature for Spanish is the boolean feature indicating whether an

item is a content word or not. This feature is not among the 10 most predictive features

for English. On text level, the DLT measure that combines the condition c,v, and m

is also important for both languages, as well as certain lexical density and variation

measures.

As shown in Table 6.6, the Spanish candidate space features achieve a maximum

Automated C-Test Difficulty Prediction

Page 62: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 53

English features Imp Spanish features ImpItem SubtlexCand deltaBigger 100.00 Item SubtlexCand deltaBigger 100.00Item SubtlexCand numCand 74.54 Item Ling Morph isContentW 76.69Item SubtlexCand weakness 61.85 Text Dlt cmvMaxTotalInt 71.79Text Lex Variation Verb2 39.69 Item SubtlexCand numCand 67.20Text Lex Variation Adv 36.74 Text Lex Density Verb 64.00Item Ling Synt numDep 27.06 Text Lex Variation VV1 38.26Text Dlt cmvMaxTotalInt 26.34 Text Lex Variation TTR RTTR 38.25Text Lex Variation TTR 24.98 Text Lex Density Determiner 37.00Text Lex Variation VV1 20.76 Item SubtlexCand weakness 34.45Sent Const parseScore 19.33 Item Ling Synt numDep 27.13

Table 6.7: Variable importance given the RF models for profile Text Item CL5 usingthe same set of 63 comparable features for both languages. The listedimportance values are scaled over all features. Overlapping features aremarked in bold.

F1 score of 0.39 for English, and 0.40 for Spanish. Thus, their performance does not

differ much given the two languages. However, using the DLT integration cost features,

the classification works significantly better for Spanish. The textual DLT features

achieve a maximum F1 score of 0.64 using SVM for Spanish, and a maximum score

of 0.51 for English. Figure 6.8 shows the confusion matrices for the RF classification

task of the profile Text Item CL5 for both languages when using only the 16 textual

DLT features. We highlight correct classifications and those misclassifications that are

acceptable given the lack of information on item level: Considering only textual input

features, the classifier cannot distinguish between difficult and easy items. Therefore,

acceptable confusion occurs between class 1 and 2, as well as between class 3 and 4. For

the English data, almost all items of class 5 are tagged correctly as items in very easy

texts. For the Spanish data, all items of class 1 are correctly classified as items in very

difficult texts. Correspondingly, we consider a feature subset on item level only. The

tables in Figure 6.9 show the confusion matrices for the classifications using the item

candidate space features. Considering the English data, the confusion mostly happens

because there is only item level information given in the features. 1 is mostly confused

with 3, which have both the same item difficulty class. The classifier requires text level

information in order to distinguish between them. The same holds for class 2 and 4:

The model predicts correctly that the items are easy, but fails to correctly distinguish

whether the item is in an easy or a difficult text. Thus, the features’ locality levels

reflect the locality dimension from the performance data.

6.4.2 Comparison of Performance on Sentence and Item Level

Due to typological properties of the languages under consideration, we expect differ-

ences in C-Test difficulty phenomena to arise on item and sentence level, rather than

on text level. The following describes how the C-Test items can be clustered consider-

ing the performance variables sentAv and percCorr and shows how well the resulting

Automated C-Test Difficulty Prediction

Page 63: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 54

predictedEN 1 2 3 4 5

actu

al

1 0 33 1 4 02 0 43 0 4 03 0 0 30 34 14 0 6 36 28 05 0 0 1 1 25

predictedES 1 2 3 4 5

actu

al

1 15 0 0 0 02 0 70 0 0 03 0 1 52 0 74 0 39 1 0 05 0 0 36 0 3

Figure 6.8: Using Text Dlt featuresonly: RF classificationconfusion matrices forSpanish (top) and English(bottom) for the profileText Item CL5. Correctclassifications and accept-able misclassifications arehighlighted.

predictedEN 1 2 3 4 5

actu

al

1 7 5 20 6 02 4 4 9 30 03 11 8 26 18 24 3 3 5 57 25 2 2 3 19 1

predictedES 1 2 3 4 5

actu

al

1 0 6 1 1 72 2 43 17 4 43 2 27 22 4 54 1 10 5 7 175 1 4 4 15 15

Figure 6.9: Using Item SubtlexCandfeatures only: RF classifi-cation confusion matricesfor Spanish (top) andEnglish (bottom) for theprofile Text Item CL5.Correct classifications andacceptable misclassifica-tions are highlighted.

classes can be predicted using the comparative set of features.

For both languages, we performed HCPC to cluster the items into four groups, given

the two named variables. The individuals and variables factor maps are given in Figure

6.11 (English) and 6.10 (Spanish). The resulting clusters can be interpreted in the

following way:

• Cluster 1: Difficult items in difficult sentences

• Cluster 2: Difficult items in easy sentences

• Cluster 3: Easy items in difficult sentences

• Cluster 4: Easy items in easy sentences

The RF and SVM classification results using the full set of 63 comparable features

and two additional subsets are listed in Table 6.8. Taking into account all 63 features,

results in a maximal F1 score of 0.72 for Spanish, and 0.64 for English. Thus, the

classification of the Spanish performance classes outperforms the classification of the

English classes. However, using the set of 19 sentence features only, the English classes

are easier to predict than the Spanish ones. The higher F1 scores for the Spanish data,

when using all 63 features, might be caused by a reasonably good interaction with item

level features.

Table 6.9 shows the 20 most predictive variables for both languages in RF model. In

Automated C-Test Difficulty Prediction

Page 64: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 55

features #feat SVMmac. F1

SVMmic. F1

RFmac. F1

RFmic. F1

EN all 63 (no highly cor.) 31 0.62 0.64 0.59 0.61ES all 63 (no highly cor.) 36 0.61 0.67 0.67 0.72EN all sent 19 0.48 0.52 0.53 0.58ES all sent 19 0.43 0.54 0.43 0.53EN Sent Dlt 16 0.31 0.39 0.29 0.38ES Sent Dlt 16 0.34 0.44 0.34 0.42

Table 6.8: Classification results for SVM and RF for the performance profileSent Item CL4 using the 63 comparative features. After filtering highlycorrelated variables, 31 remained for the English data and 36 for the Span-ish data.

both cases, five item level variables are among the top 20 variables, but the Spanish item

level features are ranked higher than the English ones. The variable importance further

shows that the locality level of the features again reflect the investigated performance

variables’ locality: There is only one English and no Spanish text level feature among

the 10 most predictive features. Table 6.8 also shows that the Spanish classification

outperforms the English one when using only sentence level DLT features. Again, the

variable importance reflects these tendencies: Five Sent Dlt features are among the

top 11 for Spanish, and only two for English. A further difference concerning variable

importance is the ranking of the morphologic feature indicating whether the word is a

content word or not. This feature is the second most predictive feature for the Spanish

data, with a variable importance of 53.38 (scaled over all features). In contrast, for the

English data it has an importance of 12.56 and is not even in the top 10.

−4 −2 0 2 4

−4

−2

02

4

Factor map

Dim 1 (66.14%)

Dim

2 (

33.8

6%)

●●

●●

●●

●●●●●

●●●●

●●●

●●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●●●●●●●

●●

●●●

538_22538_20538_21638_25497_25

640_25640_24

741_10503_17638_17712_05712_13745_19712_20544_22544_25712_08712_23

741_12

546_25543_15543_09544_20546_23543_22503_16745_21

741_13

538_25638_18638_20638_21745_17543_23712_16712_10712_14745_20546_20543_03647_12

712_25

548_08501_13

543_21

541_05

712_02

745_08

712_24

640_23

508_20651_19546_19647_01541_22541_21

544_23

501_19548_05543_14

537_24541_04543_11

501_09455_17745_10

638_19543_06712_06

653_16

543_20647_11499_06651_23455_21

546_14543_05541_01

712_19

647_07745_14

503_15712_22

647_08653_15

503_14712_01712_12

537_25451_06

543_01712_03

508_23

546_15

451_05

499_09

544_17

499_21

543_08543_18

653_17451_08

499_10

638_14712_21

537_13548_02541_20

548_09

455_22745_12

648_25708_05

638_15745_23

451_02

741_14

653_08

648_20645_15645_17

537_21651_18

454_16711_25

546_22

548_20

499_07501_17

457_21

537_16537_19508_25

708_11502_20653_06

548_01

454_15

647_03

708_01653_07653_12653_14537_22

711_13

538_24

548_16

651_22

711_07548_19

508_19653_10

645_04

745_24

650_23711_05

745_22

711_24

455_23

648_24

745_15

645_05645_19

651_17

499_23711_12

546_13651_25501_16

711_02

537_15

645_18645_21

546_24745_01745_03

711_14645_10

649_12

745_18

745_04

503_18

710_07

647_10653_05

544_18

708_02708_19499_24

710_02

544_21

711_22

745_05

711_17

544_03

499_05501_08

546_16

500_16

543_16

645_16708_03708_15708_16

638_22

649_08499_02

501_11745_13537_20541_02

457_22711_06645_06502_16

739_01

543_17

508_24501_14

638_16

547_09

537_12

543_10

649_02544_14

708_20

712_07

649_03

647_06

544_01

455_24

645_20

500_19

451_04

499_11

739_04

711_08708_13708_18710_09713_03710_13454_14

745_06

451_07

544_07

638_13

650_24710_08500_18

501_12

454_11

506_13547_04

548_15457_24

647_09543_04

457_18

508_18546_09

712_11

649_21

741_11

541_19

457_20

506_10534_17508_13

548_04

508_08

544_09

547_01

745_02

649_22649_24

534_20

711_15711_20457_25650_20

503_13

649_01710_12

745_07

505_07505_13

548_03

742_12

548_14502_19

535_09

708_17

457_14

712_04

648_19

649_18454_12

535_04

541_03501_18

538_23

534_09535_03

543_25

650_16650_22708_22

508_07497_08

739_10

648_15651_01

649_15

708_12708_14

497_07648_18640_18739_24534_14

543_12

548_17

653_09

546_07

710_04

548_07

500_17

457_17

535_11640_01506_03

638_23

650_25

455_16

508_04506_08

499_08

648_23650_21

535_06535_14

653_11

534_18

455_20

712_18

506_04506_11

455_14505_15

544_11640_07

544_24

645_09

534_12546_10508_17

541_23

648_13541_16535_23455_05

645_14

502_12495_11495_12

649_23

544_05

499_20

651_11497_09497_11

501_21713_15

645_13713_01

537_14651_24

508_03640_12651_04

713_17

708_08

739_16

650_12

543_19

506_06506_16

544_08710_03499_03

640_22

506_15457_10505_05

547_05640_08

650_07457_09

501_04

456_17742_23

534_01

543_02

508_16

638_24

648_01

650_19

739_03534_13546_06547_12

457_01

651_21

640_05

502_06502_08

739_08

543_07

508_15

455_08

544_02

457_05

541_17456_14

649_25

745_09

650_14

713_18

646_03

649_10

455_18

535_13742_18

546_18

710_18

535_12497_17742_21

548_11

541_07

454_08

547_19

708_04502_17

457_11

501_15

457_02

547_15

647_02

650_01739_22

546_17

739_02

653_03

543_13

649_13

499_19

543_24

640_11534_15

649_20

710_05

739_05

454_10

712_17

640_04

503_20

742_19739_09

650_06

541_25

649_05

647_23

546_03

649_04649_11

547_07742_25

744_20501_05

451_10

646_10

500_22649_17544_15

711_21

502_07

456_06

495_22

544_16

744_08499_18456_02

537_11

744_03744_18

546_11

713_22

456_03

495_17

739_18

506_09

548_12

546_08651_16

645_11

638_05

502_15

506_01

713_19

646_17

534_06

742_20

744_11

508_14

711_18

456_13

638_06638_11

547_02

708_07

544_19

538_14

495_21

742_11505_16

538_15

508_21

640_15

537_17

535_18

739_11

712_15

745_11

495_07

451_01

742_06

649_07

647_05

502_03

497_20505_11

535_22

546_21

495_01

547_11547_14640_09

541_24

497_10

501_07

648_07

741_04

547_16

457_12

502_18

457_04

710_25

640_03

711_10

648_04

710_11

741_20

547_03

710_17

648_06

712_09

497_16

501_23

548_23

495_24

739_12

708_10

535_07505_06739_17

711_01

742_13

500_08

451_09

499_04

651_06

541_15

647_19

650_13

456_15713_21

505_12

744_09744_19

650_15548_18

651_07

541_06

456_09500_24

651_08651_10

548_06

506_07

653_04

640_14

455_19

741_03

742_22739_19640_16

646_20

739_06

455_03

495_20541_13

544_13

650_04

739_15

495_03

745_25

742_03742_04

506_12

650_10

534_23

742_14

501_20

711_16

501_10653_13

535_21

649_14

651_05508_06

455_07

645_23

711_23

538_16

544_10

547_18

713_08

497_21

744_06541_11

647_21502_22

456_16648_08497_23497_24505_18651_14497_15

538_10

457_08650_03502_14

457_16505_01505_09

503_19503_22

497_05

541_12

505_08651_12

451_13

547_22538_02

497_12

646_23713_09

502_24

651_20

650_11

506_23537_06

710_01710_06

741_25495_14

741_16

651_09535_15

534_21

506_21

745_16

648_02648_05

640_02

647_24

505_14497_22

503_01

546_05546_12

646_07

647_04

648_16651_13

713_02

744_15

708_23

547_13

742_10744_16

503_05

645_01

541_10

546_01

744_17

534_16

741_17

502_13

649_06

456_18

456_25

646_22

455_25

640_20

535_05

646_15

537_18

505_03

501_24

503_07

499_01

653_02

649_19

741_05

645_08

451_12

508_09

451_03

739_13

500_01

502_10

739_14

500_03537_02

648_14455_10

647_25455_01

741_26456_12

711_19

640_10

499_12548_24

744_25

454_19

646_01

710_10

506_17547_10

711_04

508_22

648_17

534_07

646_12

653_01

508_01

638_04

499_15710_19

454_13

744_13

455_11

537_23

651_15

711_09

454_07

535_10

503_24

547_24

653_21

544_04

645_12

710_24

742_17

648_21

739_21640_17

538_08547_21

708_21

501_01742_01

741_08

547_06500_20

548_10548_13650_18708_09645_02

649_16

503_02

710_22

535_08547_08

541_08

457_19454_09711_03

546_02

742_05713_20

640_19

499_22

495_16

508_05

744_23

711_11

653_19

651_03

650_17

534_02650_05742_07

500_15

500_10

646_08

497_14

713_16742_02

648_22

544_06

455_06739_20

645_07

647_17646_09

451_11

544_12

505_19

457_23

548_21

645_03708_06

500_05

646_11

505_17

499_14

454_18

713_06

503_23

649_09

538_18503_04

454_04

503_12

713_12455_15

497_01

497_19640_13495_13

710_14

647_16

497_18

502_11

534_08

500_14537_04

534_24

534_05

537_10

495_18

503_10

534_19

535_20

505_20

506_14

454_17

505_10

647_13

653_23

456_07

455_09505_04535_02

742_08

742_15742_24508_11

454_02

650_09455_13

508_12

646_05

506_02

739_25

744_22

547_17

502_21

506_05

741_23

744_02

546_04

535_16535_19

505_02648_09456_20

538_03

497_06

500_21

646_13

457_15

647_14

454_24

742_09

508_02742_16

495_05

648_10

647_20710_15

535_25

739_23640_21

534_04

535_01

650_08

502_25

739_07534_11

456_11

503_25713_23

534_10

451_19

502_02744_07501_25

640_06

501_06

538_07

500_12451_21

638_10

541_09

535_17501_03534_03646_06

457_13

497_04537_05

495_25

646_24

457_03

713_07

497_13651_02

713_14

710_23499_17456_10

508_10

645_24

713_11

456_04

744_14647_15

638_03

457_07

499_13

648_03648_11

741_22495_09

744_04713_13

499_16455_02

502_05541_18456_19

456_23

495_02

502_09541_14

647_22455_04

648_12495_15

538_06

710_20

505_21

501_22

506_22

500_25

502_23

457_06

547_23547_25

744_10744_12

650_02

500_06500_11

646_02

506_20

638_08534_22548_22503_06646_19

506_19537_07

495_23

538_17

455_12

538_11538_13646_16

535_24502_01

653_22451_18

495_19

645_25

646_04744_05

500_23538_19646_14

503_21

741_07

502_04

534_25

741_18538_09741_15638_12548_25

505_23

744_01

538_01503_09537_01638_01503_08646_21741_02

501_02456_01

713_10

454_20

638_02547_20638_09

710_21

503_03646_25

710_16

653_18

647_18

653_20

646_18

537_09506_18

741_01495_06538_04495_04538_05

456_22

538_12

454_21

741_06

451_17451_20744_21503_11638_07456_08

713_25

645_22

456_24

495_08

456_05

713_24

505_22

537_03744_24741_19741_21495_10713_04506_24451_15505_24

713_05497_02

456_21

741_09537_08653_24500_09741_24500_04497_03506_25500_02

505_25451_16454_23

500_07500_13454_22451_14

451_24

454_06454_03454_01

451_25

454_25454_05

451_22

451_23

708_24499_25708_25

cluster 1

cluster 2

cluster 3

cluster 4

cluster 1 cluster 2 cluster 3 cluster 4

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Variables factor map (PCA)

Dim 1 (66.14%)

Dim

2 (

33.8

6%)

sentAv

percCorrect

sentAv

percCorrect

sentAv

percCorrect

sentAv

percCorrect

sentAv

percCorrect

sentAv

percCorrect

sentAv

percCorrect

sentAv

percCorrect

sentAv

percCorrect

Figure 6.10: Spanish: Individuals and variables factor map using performance pro-file Sent Item CL4. Predicting these 4 classes using RF and all Englishfeatures results in a micro-averaged F1 score of 0.72

Automated C-Test Difficulty Prediction

Page 65: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 56

−4 −2 0 2 4

−4

−2

02

Factor map

Dim 1 (70.41%)

Dim

2 (

29.5

9%)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

589_02589_06589_05725_25

609_17609_18590_19590_20

589_01

590_25

601_01

589_03

590_21609_14590_23590_18

609_16

589_04

609_15

610_22591_11

609_19

610_20

601_03

591_09

590_24

525_23609_23594_19591_14585_18

591_06609_21

585_21604_09591_15610_06525_22610_04594_23

719_13690_24

591_13585_12604_10

601_04

604_13

609_24

604_11690_23688_23

594_21610_09748_04

766_14

591_12

688_08

748_09610_19

590_03719_14585_16719_19

748_10

719_11688_22

604_20748_06

529_22

591_05

590_01691_05582_17756_19

605_02

531_24582_24691_07

610_25

590_09

586_17

604_18

479_04586_24725_01688_14

748_11604_15

610_11

594_20610_24

756_16605_01610_13

689_09605_19

604_19

690_02527_18527_12725_13600_13725_08

748_08

764_09

691_11688_03

600_14588_06

691_01

527_16588_16

582_15

605_09598_21

724_13

582_18

603_16

598_18691_12590_05756_25690_22

605_11586_18

594_09

689_20

756_24

527_22481_25

582_22

689_02481_03

688_25

725_04756_09605_14527_20533_08603_18

719_10

609_20

582_19

725_03588_08

748_02

588_23

688_06

764_10587_17

529_23766_20689_07756_04

688_24585_19

690_07

748_03

720_13

689_24603_14481_13719_20478_09724_15

596_24725_05479_03

481_09

610_15

525_24

691_04

478_16

594_24

582_16

594_18

690_17

691_09

605_12529_19

594_22

688_05688_16

481_16600_11

691_06

587_22470_19

478_21533_07588_10587_23

688_19

594_10478_08756_12

591_21

478_20603_08

764_20

601_25584_23

591_03

596_03

598_19588_17605_16

756_22

690_15

762_25

588_09

716_07

598_12584_22

532_18

766_15

478_07481_04724_04

529_13

588_11

596_05

478_15

598_23470_16605_21481_11716_17

529_21

689_22533_05481_05

601_16

605_13

532_02

756_08588_05

688_04

720_12

609_22

598_11

748_01

690_03

591_16585_20

756_17

748_05748_07

716_11

690_16690_11

690_21

600_09

591_07

762_04529_14

478_14

610_05

586_11

691_23

586_02

603_10

690_12478_01689_06

584_25

594_04716_22478_18

600_25589_23

481_24

719_09

529_20

690_01

603_13

603_22

716_18

586_12

594_15

531_17

531_12

720_11716_10

530_13

724_16

601_20

481_08

610_07

586_20479_05

685_11

589_25

470_18600_16481_20

590_04

591_04

479_02

603_15600_17

531_06

594_17

689_18

609_03

604_12

756_21766_13

588_13766_21

591_10

601_02

598_08528_07

479_09

591_25

588_22588_20

586_10

716_14

466_01

764_21

601_21586_22

724_09

688_20

610_18

690_20

725_16

590_22

594_16

685_12532_08

688_10

610_21

725_02688_01

528_03

716_06

766_24588_14

531_05596_01

479_01

685_10

590_06688_17596_23

478_06

601_23

601_10

585_17

724_05

719_12

766_25

609_25

764_08

481_02

584_21591_19531_04

604_08

531_18

588_19

688_13

601_13

533_09

596_25

524_12

716_12

605_03605_10

685_09

586_16

594_11

725_14

725_22

590_02

610_12

529_08

766_17

533_02689_11

523_01762_02

690_13

529_17

588_04587_21724_08

529_15

720_17

688_07

587_20724_12756_02766_22479_07766_19

748_18586_01

610_17688_12

719_16

532_22748_25

690_18

479_06

716_09689_15

590_12

716_01

530_18

725_12

609_08

756_10

533_21

584_15

584_10587_12

531_23

530_12

584_19724_06

524_14

481_18

748_17

478_13601_19

532_16691_17528_22

762_23

688_11

601_18587_19689_23

601_24

762_18

756_20585_15

584_24

528_11

594_12

528_12

581_25

756_06

466_21

719_18

529_09

719_17

598_22

603_11

532_10

591_02

764_15529_12

478_19

582_25

525_25591_08

685_04

587_02

598_20

528_20

609_01

764_13689_08527_21716_20

689_14

582_10

762_16528_18

716_02

530_20

610_23

764_24

689_04689_25690_14598_24

466_13

529_11

756_03

720_19

690_10605_08

466_07

605_22

603_24

582_21590_08

610_16

719_15585_14

756_14601_17

528_21

756_23

724_25

604_17

756_13

532_12

691_02

525_05

587_06

587_16

585_03

756_05

582_20

689_10

762_10530_03

586_15

531_09

766_05

584_14

481_22

589_24

689_03

582_01

605_24

764_16

720_16

528_13533_25

594_01762_24

587_24

603_05685_01

720_07

590_07

748_20

691_13

523_13

601_22

584_09

756_18

532_03

586_19

598_16

691_25530_11

470_15

748_15

685_25

586_13

587_25

530_15530_16

766_09

598_10

588_25

690_06

688_09

600_15689_12587_18

764_19

466_18

724_03587_13

479_20

724_18724_14

591_20

529_18

719_22

586_04

689_21

525_08

604_14

586_25

594_25591_01

766_16

601_15

478_17766_18

596_18766_10

605_15

594_03

588_01527_13

532_11600_24

481_19716_21

470_13

527_14

582_12

586_21

527_09

720_08

766_23605_17605_18

691_10690_19

716_08

589_21

604_16

725_11

470_25

528_23

598_17591_22

585_13

525_07

594_06

725_07

532_19

756_01

724_23

582_23

481_15716_23

589_09

610_08

533_24523_09

610_10764_11

688_15

530_06

533_06

724_24523_02528_17523_05

588_21

716_16

725_24584_13

588_12588_03

748_14

589_16

605_20

688_21

756_15689_05

589_15

689_19605_25

530_09691_15

725_06610_14

605_23

466_05

691_03

598_01

762_21720_18

688_02

594_02

748_22

481_12716_05716_15587_15

532_09

756_11

479_22

586_14

688_18

685_06

584_18

600_03466_24

533_23

527_23

470_17533_01533_04

764_14

596_13

532_04

588_15

598_06

530_10

605_04

531_21

719_21

530_05

689_17

598_13762_22

527_19

586_03

527_15479_10764_12

691_22470_12

720_15603_20594_07

603_23762_13762_15

690_05

601_05

479_08588_02725_09725_10

532_20

724_01603_19

603_02

685_05530_17

481_10

605_05

584_17

479_18

691_08

523_11

584_12600_01589_13

594_08

470_21

529_07

724_21

587_14605_06

531_07

531_15533_20609_12

525_04

720_20600_22

529_16

585_05

596_06764_22

481_01

523_04

527_17598_25

481_14

598_02

523_03

466_08609_04764_07

531_08

720_21

524_16

470_05

764_23531_03762_14532_06

689_13

685_08528_01

589_14

600_08481_07

691_14762_06

724_17

596_08

691_20

716_04

524_11

527_11

690_09

591_23

586_23

591_24

478_04

581_14

691_24

689_16756_07

525_03

524_25

530_14

470_23

532_05

527_02

529_05603_04

720_09

720_14

587_03

762_12

588_07

724_11

690_04

725_19

523_07

605_07

527_24

762_07762_11

587_08

690_08

466_17

604_06

603_01

478_10

531_22

603_17

529_04

590_17

748_16

724_07

598_09

588_24533_03

530_04762_08

584_20

481_17

523_17

596_02

724_22532_13

764_04

600_10

603_06525_02529_01

478_12

524_24

716_13

720_06

588_18

587_09

724_02

764_03

748_13528_24

589_19

581_24

594_13

529_25

748_19762_05

581_04

689_01600_12

584_04

601_06525_01

481_23478_02

532_24

466_02

532_14

466_11479_25

528_09601_09

582_06

481_06

762_20

581_18

685_02531_01

524_22

481_21

530_02

528_10

527_01

478_11

581_13

685_15529_02

603_09

523_12

586_05

609_13

581_15

725_21600_06

719_02

720_10724_10

600_23

609_09

531_10

748_21

589_20

601_11525_06

764_25

601_08

581_09

478_05

764_05609_02

470_07

531_20

604_05

716_19

603_25

603_07

586_07

594_05

601_12

533_17

762_17

478_03

523_08

585_07

594_14

529_10

589_11

584_16

603_12

584_08

762_19691_18609_10466_12

591_18

533_18724_20523_06

589_12

716_03

528_02

766_02596_17

598_14

523_10531_11

590_15

720_23

581_01

596_09596_10

685_03

589_10

764_17589_22

584_07

532_17

720_04

685_13

596_11

685_07

590_13

530_19

591_17

762_09

604_07

766_07

532_23

600_04529_24

528_25

587_01

596_04598_15586_09

762_03601_07

766_03

596_07764_18532_07748_12

466_10

724_19691_21

584_03466_16

528_04528_05

470_06

609_07587_07

529_03531_13603_21

479_14

586_06

466_20

531_02

600_05

581_08

529_06762_01

748_23

585_11

528_14

532_21

582_02

598_05

691_16530_01

587_04

585_25

587_05

532_01

590_11584_01

748_24691_19

525_21

766_06

586_08

584_11

585_08

685_14601_14

766_08

524_15

581_17

685_16

725_23720_02

531_16

523_24

582_08

530_08

582_11

533_22

596_16

719_07

585_22

470_22

523_14

528_19

479_21466_15

585_02

603_03

527_08

528_16

585_10

531_14

530_07

585_01

532_15

466_03

531_19

581_02

600_02

587_11584_06596_15

525_17

604_03

720_01

581_20

478_25

725_15725_20609_05587_10524_13528_06528_15

466_14479_23

609_06533_19764_06598_07

525_14

598_03

766_01589_07

527_05581_10

524_10

590_10

581_05

528_08

685_21

524_20

479_24589_08

530_23

609_11

470_24589_18

719_25

598_04

466_19

585_24

590_14766_04766_11466_06479_19

582_03

479_17523_18

725_18

719_08

524_02

582_07582_09600_07582_05720_03

530_24

685_18

725_17470_14

533_14

527_10

720_25

584_05466_09

582_04590_16720_05

604_04

685_22

766_12585_06527_04

479_13

589_17

596_19

581_03466_23596_14524_21

685_23

527_03582_13581_07

525_16

604_02

525_11

466_04

470_03523_15685_19

581_19581_12

719_04

470_20

527_07466_22

584_02

527_06582_14

585_23

596_12466_25

479_15685_17

524_23

719_03

585_09585_04581_11

720_22685_24

764_02

581_23

523_19479_16

604_01581_06581_16

764_01

581_22

479_12523_16

600_19716_25

719_06

600_18

581_21

523_22530_21479_11

533_12

530_25

720_24

604_24

470_08

604_25

685_20719_05

596_21

525_09470_10525_10525_20716_24525_19

523_21

533_15

523_20470_09

719_23525_15

604_23

470_11719_01530_22

600_21

470_04

533_16

525_18525_13

478_22532_25596_20525_12523_23524_18

524_06524_07

470_02

524_19

470_01

596_22478_24604_22

610_02

600_20478_23523_25

604_21533_10719_24524_17533_11533_13

524_08524_04524_03610_03524_01524_05610_01524_09

690_25527_25

cluster 1

cluster 2

cluster 3

cluster 4

cluster 1 cluster 2 cluster 3 cluster 4

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Variables factor map (PCA)

Dim 1 (70.41%)

Dim

2 (

29.5

9%)

sentAv

percCorrect

sentAv

percCorrect

sentAv

percCorrect

sentAv

percCorrect

sentAv

percCorrect

sentAv

percCorrect

sentAv

percCorrect

sentAv

percCorrect

sentAv

percCorrect

Figure 6.11: English: Individuals and variables factor map using performance profileSent Item CL4. Predicting these 4 classes using SVM and all Englishfeatures results in a micro-averaged F1 score of 0.64

6.5 Discussion of the Presented Results

We first investigated how well different performance profiles can be predicted using the

full set of features. For both languages, the profile that contains a large proportion of

sentence level performance variables (Text SentProf Item CL4 ) performed best (mic.

F1 score of 0.95 for English and 0.93 for Spanish). It is the only profile that includes

variables split by proficiency levels on sentence level (SentProf ) and groups the items

into four classes. These classes can be interpreted to vary from each other in terms of

the difficulty of the sentences in which the items occur. Even though the classification

of the Spanish data is mostly outperforming the English classification, it is not the case

for the named performance profile. An explanation for this might lie in the performance

statistics themselves: The Principal Component Analysis using all performance vari-

ables has shown that within the Spanish data the proficiency level variables correlate

more with each other than within the English data (cf. Figure 4.3 and 4.4 on p. 27).

Thus, the profile containing the SentProf variables is more informative for English than

for Spanish and might therefore be easier to predict. In other words, the information

about how well learners of different proficiency levels can master a sentence might be

easier to predict for English C-Tests than for Spanish C-Tests. However, for a direct

comparison of the two languages the underlying set of features should be comparable.

The results on different performance profiles further confirm the findings of Svetashova

(2015)’s work: The profiles that combine item and text level performance variables

seem to be well predictable using the generated feature sets. Due to the interpretabil-

ity of the resulting factor maps, we decided to investigate the profile Text Item CL5

more deeply.

Automated C-Test Difficulty Prediction

Page 66: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 57

English Features Imp Spanish features ImpItem SubtlexCand deltaBigger 100.00 Item SubtlexCand deltaBigger 100.00Item SubtlexCand numCandidates 63.03 Item Ling Morph isContentWord.1 53.38Sent Const parseScore 58.45 Item SubtlexCand numCandidates 50.77Item SubtlexCand weakness 35.22 Sent Const parseScore 42.06Sent Const parseDepth 25.22 Item Ling Synt numDependencies 35.63Item Ling Synt numDependencies 25.10 Item SubtlexCand weakness 28.75Sent Dlt vmMaxTotalInt 18.02 Sent Const parseDepth 27.65Sent Dlt cTotalInt 13.49 Sent Dlt cmvMaxTotalInt 22.40Text Lex Density Verb 12.60 Sent Dlt mMaxTotalInt 21.49Item Ling Morph isContentWord.1 12.56 Sent Dlt cMaxTotalInt 18.24Text Dlt cmvMaxTotalInt 11.55 Sent Dlt cvTotalInt 16.94Text Lex Variation TTR RTTR 11.06 Text Lex Density Determiner 11.01Text Lex Density Function 10.62 Text Dlt cmvMaxTotalInt 10.64Text Lex Variation Lex 9.94 Text Lex Variation TTR Log 6.75Text Lex Variation TTR 8.86 Text Lex Variation Adv 6.27Text Lex Variation Verb2 8.70 Text Dlt vmTotalInt 5.33Text Dlt cTotalInt 8.67 Text Lex Density Lex 5.27Sent Dlt cmvTotalInt 8.20 Text Lex Density Function 5.07Text Lex Variation Mod 8.13 Text Lex Variation VV1co 5.05Text Lex Density Noun 7.40 Text Lex Variation Lex 4.14

Table 6.9: Variable importance given the RF models for profile Sent Item CL4 usingthe same set of 63 comparable features for both languages. The listedimportance values are scaled over all features.

The performance profile Text Item CL5, which is best interpretable and suitable for a

real-word application for both languages, resulted in a macro-averaged F1 score of 0.76

(English, 156) and 0.82 (Spanish, 100 features) using SVM. As shown in Svetashova

(2015)’s visualization (Figure 2.5), an application can make use of these models in order

to visualize item difficulties by color. The presented results approve that these models

can be used in practice.

Although the Spanish features are just a subset of the English features, they turned

out to produce better classification results. A prominent feature subgroup in the 20

most predictive English features is the group of psycholinguistic item level features in-

cluding term and document frequency measures, measures based on frequency lists and

psycholinguistic ratings. For the Spanish data, item level features are less prominent:

three term and document frequency features are the only item level features among the

top 20 features. Since the set of features for the two languages is different in the sense

that English has much more item level features than Spanish, a direct comparison can

not be drawn from these first experiments.

For both languages, we also presented mean feature values for the different clusters

and showed that they reflect the cluster interpretations and mostly confirm intuitive

assumptions about item difficulty. The results confirmed assumptions such as the fol-

lowings: A high document frequency or ngram frequency of an item makes a gap easier.

The more competing candidates an item has, the more difficult it is. However, the ex-

periments also lead to unexpected results. For both languages, on text as well as on

Automated C-Test Difficulty Prediction

Page 67: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

6 Experiments and Results 58

item level, the DLT integration costs increase with decreasing text or sentence diffi-

culty. DLT emerged in the field of human sentence processing and has been applied

within the context of complexity analysis of learner data (Weiß, 2017). Weiß (2017) has

shown that in German L2 corpora, DLT measures increase with increasing learner pro-

ficiency. One could therefore expect higher integration costs in sentences that contain

gaps which are difficult to fill in. The results presented in this study show the opposite

for features on sentence as well as on text level. Thus, the test takers do apparently

profit from constructions with high integration costs. A possible reason could be the

position of the gaps relative to the integration cost counting. The features in this thesis

only measure these costs on sentence and text level, not on item level. Furthermore,

one should consider other possible reasons for these findings by reviewing the parse

trees that served as input for the feature calculations 30. One reason could be that test

takers benefit from high integration costs because more dependent words precede the

verb under consideration.

Using a set of 63 comparable features only, we performed experiments on the per-

formance profile Text Item CL5. We have shown that four DLT predictors have been

removed due to high correlations from the set of English features but not from the set

of Spanish features. This suggests that the DLT features show higher variance in the

Spanish data. A closer look at the data and the resulting feature calculations might

show the reasons for this outcome. We expect that language-specific differences in the

dependency structures cause differences in the correlations of resulting features.

The variable importance of the RF models showed similar pictures for both languages.

A difference within the 10 most predictive features is the high ranking of the morpho-

logical feature isContentWord in the Spanish model. This ranking supports the claim

that the rich morphology of Spanish content words influences difficulty more than the

less rich morphology of English content words. The SUBTLEX candidate features are

ranked higher for English than for Spanish. As described on page 13, Beinborn (2016)

mentions that in English there exist more short words than in German and French,

which in turn causes high numbers of candidates for English short words. Since Span-

ish and French are both Romance languages and similar in terms of their grammatical

system, the amount of short words should be similar in contrast to English. Thus, a

larger candidate space for English short words might cause the candidate features to

be more predictive in the English classification than in the Spanish one.

The last set of experiments is investigating differences between English and Spanish

when considering the performance variables on sentence and item level only. Using the

full set of comparable features, the results for the Spanish classification outperform the

English one. However, this is not the case when using the full set of 19 comparable

sentence features. This seems to suggest that a reasonably good interaction with item

30Punctuation marks have not been ignored when considering the English and Spanish dependencystructures. We do not consider this to have influence on the results.

Automated C-Test Difficulty Prediction

Page 68: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

7 Conclusion 59

level features leads to better results when using all 63 features. The variable importance

does confirm that item level features are ranked higher in the Spanish data than in the

English data. This might be caused by the fact that in Spanish more information can

be encoded in one word. A further result that supports this idea is the rank of the

feature isContentWord. Just as for the profile Text Item CL5, it is again ranked higher

for Spanish than for English.

Moreover, it should be noted that differences between the languages can also be

caused by differences in the nature of the performance data. Inspecting the proficiency

levels more thoroughly is relevant in the context of difficulty comparison: For Spanish,

the proportion of proficient learners has been significantly smaller than for English

(cf. 4.1). This might have influenced the difficulty prediction. On different levels of

proficiency, different features are better indicators for difficulty.

7 Conclusion

The underlying work presented an automated way of investigating difficulty character-

istics of English and Spanish C-Tests, focusing on the locality levels of items, sentences,

and text passages. The purpose of this thesis was twofold: First, we aimed at devel-

oping a pipeline that automatically predicts the difficulty of C-Test items and text

passages for the two languages English and Spanish. This pipeline is supposed to be

integrable into a real world application, where difficulty estimates about text passages

and items can be used to influence the difficulty of whole C-Tests. Second, we aimed at

investigating how the difficulty characteristics of C-Tests vary across the two languages,

given the differences in their grammatical and morphological systems. The following

will summarize the findings and suggest ideas on future work.

We gave a review of existing work on the topic by focusing on the approaches pre-

sented within Svetashova (2015), Beinborn (2016), and Beinborn et al. (2014). Their

findings concerning performance modeling and difficulty predictions served as a baseline

for this work.

The performance of test takers could be modeled given a C-Test database provided

by the Language Learning Center of the University of Tubingen. We extracted statis-

tics about the test takers’ performance on different levels of locality: For each C-Test

item, the percentage of correct insertions and the corresponding averages throughout

sentences and whole texts have been used as performance variables. We further in-

spected the performance of test takers across proficiency levels given their final C-Test

score. The difficulty of a C-Test item has usually been defined by the ratio of incorrectly

inserted answers to all answers. Svetashova (2015) presented a new way of grouping

the items, namely based on information about the test takers’ performance on the two

locality dimensions item and text. Regression and classification results both confirmed

Automated C-Test Difficulty Prediction

Page 69: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

7 Conclusion 60

that the classes resulting from the two-dimensional clustering approach can be pre-

dicted quite accurately using a broad range of linguistic features. Thus, her findings

showed that the difficulty of a C-Test item depends of the difficulty of the whole text.

We followed her clustering approach and further included sentence averaged statistics as

performance variables. We performed Hierarchical Clustering on Principal Components

on the updated set of items and experimented with the combination of performance

variables and the number of clusters. A clustering principle using the item and text

variables only, resulted in an interpretable picture of item groupings. For both lan-

guages, the items could be grouped into five clusters: easy items in difficult texts, easy

items in easy texts, difficult items in easy texts, difficult items in difficult texts, and as

a fifth group either all items in very easy texts for the English data, or all items in very

difficult texts for the Spanish data. We presume that the difference in the clustering

results is not caused by language-specific differences, but is rather dataset-specific.

The difficulty of the C-Test items was modeled using a broad range of English fea-

tures that proofed to be predictive within the work of Svetashova (2015). In addition,

we considered features that emerged within the scope of linguistic complexity analysis

and human sentence processing to be worth being integrated into our pipeline. The

final feature set comprised surface-based, lexical, psycholinguistic, syntactic and dis-

course features, as well as C-Test specific features describing an item’s candidate space,

position, or context. In total, 156 English features have been implemented. To get

a multi-lingual perspective on C-Test difficulty characteristics, we designed a versatile

set of 100 features for Spanish C-Tests.

Our classification experiments have shown which performance profiles can be pre-

dicted best using which set of features. The experiments have been conducted using

Support Vector Machine (SVM) and Random Forest (RF) models with a data split of

80/20%. Using the full feature set, we experimented with different performance profiles

and got the best classification results for those profiles containing a high proportion of

sentence performance variables (Text SentProf Item CL4 ): a micro-averaged F1 score

of 0.95 for the English data, and 0.93 for the Spanish data using RF. The performance

profile Text Item CL5, which is best interpretable and suitable for a real-word applica-

tion for both languages, resulted in a macro-averaged F1 score of 0.76 (English, 156)

and 0.82 (Spanish, 100 features) using SVM. Thus, the Spanish classification outper-

forms the English classification, even though the Spanish features are just a subset of

the English features.

Additionally, it was shown that the clusters’ mean feature values reflect difficulty

tendencies which either confirm or disprove intuitive assumptions. As an example,

high frequent words are easier to master than low frequent words. On the other hand,

sentences which are known to be more complex in terms of the Dependency Locality

Theory are easier to master. We suggested that this might be caused by the gap’s

position relative to the integration costs counting. To determine the reasons that

Automated C-Test Difficulty Prediction

Page 70: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

7 Conclusion 61

explain these findings, we suggest the inspection of dependency structures of single

data points in future works.

In order to approximately compare our results to those reported by Svetashova

(2015), we performed classification experiments on the performance profile that com-

bines all item and text level performance variables, including the proficiency level split-

ting. Profile TextProf ItemProf CL4 is the performance profile that is most comparable

to Svetashova’s best performing profile, although it lacks performance statistics from

Item Response Theory. We reported comparative RF classification results given the

new English data: a micro-averaged F1 score of 0.79, while Svetashova (2015) reports a

score of 0.7959. It should be noted, however, that the results reported in this work are

all referring to the texts extracted from the latest version of the FSZ database, while

Svetashova (2015) conducted the experiments on an earlier, smaller version (c.f. Table

3.1, p. 22).

To compare the difficulty characteristics of the two languages, we performed clas-

sification experiments with a set of 63 comparable features. With a micro-averaged

F1 score of 72 in RF classification, the Spanish classification outperforms the English

classification, which shows a highest F1 score of 0.64 using SVM. From these results

we can conclude that the underlying set of comparable features is more predictive for

the Spanish language than for the English language. The variable importance showed

a much higher ranking of the morphologic feature isContentWord, which confirmed

our preliminary assumption that morphologic features are more important for Spanish,

which is morphologically richer than English. For English, the features describing the

item’s candidate space were highly predictive. The high number of English short words

increases the candidate space of short words. We suggested that this might have caused

the good performance of English candidate features in contrast to Spanish candidate

features. The most predictive feature for both languages is measuring the difference

between the word frequency of an item itself and its competing candidate with highest

word frequency. This feature is also the most predictive one in experiments on the test

takers’ performance on sentence and item level. The classification results for profile

Sentence Item CL4 showed that item level features are ranked lower in the English

experiment than in the Spanish experiment.

Given the findings of this thesis, several questions remain for future research. The

presented results show which language characteristics have influence on the difficulty of

C-Test items. More research is still necessary in order to figure out why certain features

are more predictive than others. This will involve a thorough inspection of single data

points, taking also into account the language learners’ proficiency levels. Additionally,

the set of features could be expanded by including further measures described for

example in the context of SLA, human sentence processing, or language complexity

analysis. Another way of extending the feature set would be to project item level

features to sentence and text level, e.g. by taking the average values of items in the

Automated C-Test Difficulty Prediction

Page 71: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

7 Conclusion 62

sentence or text. So far the text level features only consider the text as a non-gapped

text rather than as a gapped testing text.

Further research will also be required concerning the DLT measures, which were

proven to be predictive in the given task. The experiments showed that increasing

integration costs lower the sentence and text difficulties. More research is necessary to

present evidence for these findings.

In order to gain more insights into differences between C-Test difficulties in English

and Spanish, we suggest to extend the set of features which encode morphological and

syntactic information. For Spanish, there exist tagsets with a high number of different

tags encoding very specific information. We suggest to make use of this information

by extending those feature sets that are based on POS information. On top of that,

the syntactic complexity features could be adapted to the Spanish pipeline. The in-

vestigation of other languages is also desirable in order to broaden the multi-lingual

perspective on the topic.

We further suggest to examine the test takers’ distribution across proficiency levels

more detailedly. It would be interesting to discover which language characteristics have

impact on difficulties for beginning learners in contrast to those that indicate difficulties

for advanced learners.

The presented pipeline is fully automated and expects an input structure that is

already used in a real-world C-Test generation application. As a next step towards the

application of the results, the difficulty prediction component needs to be integrated

into the existing user interface.

Automated C-Test Difficulty Prediction

Page 72: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

References 63

References

Alderson, J. C. (1979). The cloze procedure and proficiency in english as a foreign

language. Tesol Quarterly, 13(2):219–227.

Amaral, L. and Meurers, D. (2011). On using intelligent computer-assisted language

learning in real-life foreign language teaching and learning. ReCALL, 23(1):4–24.

Babaii, E. and Ansary, H. (2001). The c-test: a valid operationalization of reduced

redundancy principle? System, 29(2):209–219.

Bachman, L. F. (1982). The trait structure of cloze test scores. Tesol Quarterly,

16(1):61–70.

Beinborn, L. (2016). Predicting and Manipulating the Difficulty of Text-Completion Ex-

ercises for Language Learning. Dr.rer.nat-thesis, Technische Universitat Darmstadt.

Beinborn, L., Zesch, T., and Gurevych, I. (2014). Predicting the difficulty of language

proficiency tests. Transactions of the Association for Computational Linguistics,

2:517–529.

Bird, S. and Loper, E. (2006). Nltk: the natural language toolkit. In Proceedings of

the COLING/ACL on Interactive presentation sessions, pages 69–72. Association for

Computational Linguistics.

Brants, T. and Franz, A. (2006). Web 1t 5-gram version 1.

Brysbaert, M., New, B., and Keuleers, E. (2012). Adding part-of-speech information

to the subtlex-us word frequencies. Behavior research methods, 44(4):991–997.

Chapelle, C. A. and Chung, Y.-R. (2010). The promise of nlp and speech processing

technologies in language assessment. Language Testing, 27(3):301–315.

Chen, D. and Manning, C. (2014). A fast and accurate dependency parser using neural

networks. In Proceedings of the 2014 conference on empirical methods in natural

language processing (EMNLP), pages 740–750. Association for Computational Lin-

guistics.

Chen, X. and Meurers, D. (2016). Ctap: A web-based tool supporting automatic

complexity analysis. In Proceedings of the Workshop on Computational Linguistics

for Linguistic Complexity (CL4LC), pages 113–119.

Cohen, A. D., Segal, M., and Bar-Siman-To, R. (1984). The c-test in hebrew. Language

Testing, 1(2):221–225.

Coxhead, A. (2000). A new academic word list. TESOL quarterly, 34(2):213–238.

Automated C-Test Difficulty Prediction

Page 73: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

References 64

Cuetos, F., Glez-Nosti, M., Barbon, A., and Brysbaert, M. (2012). Subtlex-esp: Spanish

word frequencies based on film subtitles. Psicologica, 33(2):133–143.

Dornyei, Z. and Katona, L. (1992). Validation of the c-test amongst hungarian efl

learners. Language Testing, 9(2):187–206.

Eckes, T. and Grotjahn, R. (2006). A closer look at the construct validity of c-tests.

Language Testing, 23(3):290–325.

Ferrucci, D., Lally, A., Verspoor, K., and Nyberg, E. (2009). Unstructured information

management architecture (UIMA) version 1.0. OASIS Standard.

Garcıa-Pablos, A., Cuadros, M., Gaines, S., and Rigau, G. (2013). Opener demo: Open

polarity enhanced named entity recognition. In Come Hack with OpeNER! Workshop

Programme, volume 501, pages 12–14.

Gibson, E. (2000). The dependency locality theory: A distance-based theory of linguis-

tic complexity. Image, language, brain, pages 95–126.

Graesser, A. C., McNamara, D. S., and Louwerse, M. M. (2003). What do readers need

to learn in order to process coherence relations in narrative and expository text.

Rethinking reading comprehension, pages 82–98.

Grotjahn, R. (2002). Konstruktion und einsatz von c-tests: Ein leitfaden fur die praxis.

Der C-Test. Theoretische Grundlagen und praktische Anwendungen, 4:211–225.

Grotjahn, R. and Tonshoff, W. (1992). Textverstandnis bei der c-test-bearbeitung.

pilotstudien mit franzosisch-und italienischlernern. Der C-Test. Theoretische Grund-

lagen und praktische Anwendungen, 1:19–95.

Hancke, J., Vajjala, S., and Meurers, D. (2012). Readability classification for German

using lexical, syntactic, and morphological features. In Proceedings of the 24th In-

ternational Conference on Computational Linguistics (COLING), pages 1063–1080,

Mumbai, India.

Hernandez-Figueroa, Z., Rodrıguez-Rodrıguez, G., and Carreras-Riudavets, F. (2009).

Separador de sılabas del espanol-silabeador tip.

Hughes, A. (2007). Testing for language teachers. Ernst Klett Sprachen.

Klein-Braley, C. (1984). Practice and Problems in Language Testing. Papers from the

International Symposium on Language Testing, volume 29, chapter Advance Predic-

tion of Difficulty with C-Tests., pages 97–112. ERIC, Colchester, England.

Klein-Braley, C. (1985). A cloze-up on the c-test: a study in the construct validation

of authentic tests. Language Testing, 2(1):76–104.

Krashen, S. (1985). The Input Hypothesis: Issues and Implications. Laredo.

Automated C-Test Difficulty Prediction

Page 74: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

References 65

Kuhn, M., Wing, J., Weston, S., Williams, A., Keefer, C., Engelhardt, A., Cooper,

T., Mayer, Z., and Kenkel, B. (2015). caret: Classification and regression training. r

package version 6.0–21. CRAN, Vienna, Austria.

Kuperman, V., Stadthagen-Gonzalez, H., and Brysbaert, M. (2012). Age-of-acquisition

ratings for 30,000 english words. Behavior Research Methods, 44(4):978–990.

Laufer, B. and Nation, P. (1995). Vocabulary size and use: Lexical richness in l2 written

production. Applied linguistics, 16(3):307–322.

Le, S., Josse, J., and Husson, F. (2008). FactoMineR: A package for multivariate

analysis. Journal of Statistical Software, 25(1):1–18.

Levy, R. and Andrew, G. (2006). Tregex and tsurgeon: tools for querying and ma-

nipulating tree data structures. In Proceedings of the fifth international conference

on Language Resources and Evaluation, pages 2231–2234, Genoa, Italy. European

Language Resources Association (ELRA).

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing.

International Journal of Corpus Linguistics, 15(4):474–496.

Lu, X. (2012). The relationship of lexical richness to the quality of esl learners’ oral

narratives. The Modern Language Journal, 96(2):190–208.

McNamara, D. S., Graesser, A. C., McCarthy, P. M., and Cai, Z. (2014). Automated

evaluation of text and discourse with Coh-Metrix. Cambridge University Press.

McNamara, T. (2000). Language Testing. Oxford Introduction to Language Study

ELT. OUP Oxford.

Meurers, D., Ziai, R., Amaral, L. A., Boyd, A., Dimitrov, A., Metcalf, V., and Ott,

N. (2010). Enhancing authentic web pages for language learners. In Proceedings

of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building

Educational Applications, pages 10–18, Los Angeles, California. Association for Com-

putational Linguistics.

Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D.,

McDonald, R. T., Petrov, S., Pyysalo, S., Silveira, N., et al. (2016). Universal

dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth

International Conference on Language Resources and Evaluation (LREC).

R Development Core Team (2008). R: A Language and Environment for Statistical

Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-

900051-07-0.

Automated C-Test Difficulty Prediction

Page 75: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

References 66

Raatz, U. and Klein-Braley, C. (1981). The c-test–a modification of the cloze procedure.

In Practice and Problems in Language Testing. Proceedings of the International Lan-

guage Testing Symposium of the Interuniversitare Sprachtestgruppe, volume 4, Essex,

England. Education Resources Information Center (ERIC).

Shain, C., van Schijndel, M., Futrell, R., Gibson, E., and Schuler, W. (2016). Memory

access during incremental sentence processing causes reading time latency. Pro-

ceedings of the Workshop on Computational Linguistics for Linguistic Complexity

(CL4LC), pages 49–58.

Spolsky, B. (1969). Reduced redundancy as a language testing tool. In Language

Testing Section of the 2nd International Congress of Applied Linguistics, Cambridge.

England. Education Resources Information Center (ERIC).

Suvorov, R. and Hegelheimer, V. (2013). Computer-assisted language testing. The

companion to language assessment.

Svetashova, Y. (2015). C-test item difficulty prediction: Exploring the linguistic charac-

teristics of c-tests using machine learning. Master’s thesis, Department of Linguistics,

University of Tubingen.

Todirascu, A., Francois, T., Gala, N., Fairon, C., Ligozat, A.-L., and Bernhard, D.

(2013). Coherence and cohesion for the assessment of text readability. Natural

Language Processing and Cognitive Science, 11:11–19.

Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. (2003). Feature-rich part-

of-speech tagging with a cyclic dependency network. In Proceedings of the 2003

Conference of the North American Chapter of the Association for Computational

Linguistics on Human Language Technology-Volume 1, pages 173–180. Association

for Computational Linguistics.

Vajjala, S. (2015). Analyzing text complexity and text simplification: connecting linguis-

tics, processing and educational applications. PhD thesis, University of Tubingen.

Vajjala, S. and Meurers, D. (2012). On improving the accuracy of readability classifi-

cation using insights from second language acquisition. In Proceedings of the Seventh

Workshop on Building Educational Applications Using NLP, pages 163–173. Associ-

ation for Computational Linguistics.

Weiß, Z. (2015). More linguistically motivated features of language complexity in read-

ability classification of german textbooks: Implementation and evaluation. Bachelor’s

Thesis, Department of Linguistics, University of Tubingen.

Weiß, Z. (2017). Using measures of linguistic complexity to assess german l2 proficiency

in learner corpora under consideration of task-effects. Master’s thesis, Department

of Linguistics, University of Tubingen.

Automated C-Test Difficulty Prediction

Page 76: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

References 67

Wilson, M. (1988). Mrc psycholinguistic database: Machine-usable dictionary, version

2.00. Behavior Research Methods, 20(1):6–10.

Automated C-Test Difficulty Prediction

Page 77: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

A Appendix 68

A Appendix

The following tables list all the features implemented in the English pipeline. The

column ”ES” denotes whether the feature has been adapted to Spanish. The column

”COMP” indicates whether the feature is included in the set of 63 comparable features.

Moreover, Table A.6 and Table A.7 list how many participants have processed a

certain text passages.

Feature Group Feature Name ES COMP

LinguisticFeatures

Item Ling endingLength√

Item Ling lemmaLength√

Item Ling Morph isContentWord√ √

Item Ling Synt numDependencies√ √

PsycholinguisticFeatures

Item Psy TermOccInDoc√

Item Psy TF√

Item Psy DF√

Item Psy IDF√

Item Psy TFIDF√

Item Psy isLemmasMostFreqTokItem Psy isTokensMostFreqTagItem Psy Lfp Band1

Item Psy Sem AoA KupItem Psy Sem ConcrItem Psy Sem FamFeatureItem Psy Sem ImagItem Psy Sem MeaniItem Psy StylNCatsMRCItem Psy StylNCatsNLTK

Position basedFeatures

Item Posit distancePrevMention Lemma√

Item Posit distancePrevMention Token√

Item Posit isInClosing√

Item Posit isInStartingOrClosing√

Item Posit Number Gap√

Item Posit Number Token√

Item Posit previousItemTrigramProbItem Posit previousItemUnigramProb

Table A.1: Item level features from the groups of linguistic, pyscholinguistic and po-sition based features.

Automated C-Test Difficulty Prediction

Page 78: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

A Appendix 69

Feature Group Feature Name ES COMP

Context andCandidate Space

Item Ngram Cands NumPotentialEndingsItem Ngram Cands BiggerUniDeltaItem Ngram Cands BiggerBiRightDeltaItem Ngram Cands BiggerBiLeftDeltaItem Ngram Cands BiggerTriCenterDeltaItem Ngram Cands BiggerTriLeftDeltaItem Ngram Cands BiggerTriRightDelta

Item Ngram Cands UniPrWeaknessItem Ngram Cands BiLeftPrWeaknessItem Ngram Cands BiRightPrWeaknessItem Ngram Cands TriLeftPrWeaknessItem Ngram Cands TriCenterPrWeaknessItem Ngram Cands TriRightPrWeakness

Item Ngram Cands HasBiggerUniItem Ngram Cands HasBiggerBiLeftItem Ngram Cands HasBiggerBiRightItem Ngram Cands HasBiggerTriCenterItem Ngram Cands HasBiggerTriLeft

Item Ngram Probs UniItem Ngram Probs Bi LeftItem Ngram Probs Bi RightItem Ngram Probs Tri CenterItem Ngram Probs Tri LeftItem Ngram Probs Tri RightItem Ngram Probs Four CenterRightItem Ngram Probs Four LeftItem Ngram Probs Four LeftCenterItem Ngram Probs Four RightItem Ngram Probs Five CenterItem Ngram Probs Five CenterRightItem Ngram Probs Five LeftItem Ngram Probs Five LeftCenterItem Ngram Probs Five RightItem Ngram Probs Max

Item SubtlexCand deltaBigger√ √

Item SubtlexCand hasBigger√ √

Item SubtlexCand numCandidates√ √

Item SubtlexCand weakness√ √

Table A.2: Item level features from the group of context and candidate space features.

Automated C-Test Difficulty Prediction

Page 79: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

A Appendix 70

Feature Group Feature Name ES COMP

SurfaceFeatures

Sent numGapsInSent√

Sent sentID√ √

SyntacticFeatures

Sent Const numComplicatorsSent Const parseDepth

√ √

Sent Const parseScore√ √

DLTFeatures

Sent Dlt oMaxTotalIntegrationCost√ √

Sent Dlt cMaxTotalIntegrationCost√ √

Sent Dlt mMaxTotalIntegrationCost√ √

Sent Dlt vMaxTotalIntegrationCost√ √

Sent Dlt cmMaxTotalIntegrationCost√ √

Sent Dlt cvMaxTotalIntegrationCost√ √

Sent Dlt vmMaxTotalIntegrationCost√ √

Sent Dlt cmvMaxTotalIntegrationCost√ √

Sent Dlt oTotalIntegrationCosts√ √

Sent Dlt cTotalIntegrationCosts√ √

Sent Dlt vTotalIntegrationCosts√ √

Sent Dlt mTotalIntegrationCosts√ √

Sent Dlt cmTotalIntegrationCosts√ √

Sent Dlt cvTotalIntegrationCosts√ √

Sent Dlt vmTotalIntegrationCosts√ √

Sent Dlt cmvTotalIntegrationCosts√ √

Table A.3: Sentence level features.

Automated C-Test Difficulty Prediction

Page 80: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

A Appendix 71

Feature Group Feature Name ES COMP

LexicalFeatures

Text Lex Density Adjective√ √

Text Lex Density Adverb√ √

Text Lex Density Conjunction√ √

Text Lex Density Determiner√ √

Text Lex Density Function√ √

Text Lex Density Lex√ √

Text Lex Density Noun√ √

Text Lex Density Verb√ √

Text Lex Variation Adj√ √

Text Lex Variation Adv√ √

Text Lex Variation Lex√ √

Text Lex Variation Mod√ √

Text Lex Variation NDWZ√ √

Text Lex Variation TTR√ √

Text Lex Variation TTR CTTR√ √

Text Lex Variation TTR Log√ √

Text Lex Variation TTR RTTR√ √

Text Lex Variation TTR Uber√ √

Text Lex Variation Verb2√ √

Text Lex Variation VV1√ √

Text Lex Variation VV1co√ √

Text Lex Variation VV1sq√ √

Text Lex Sophistication CSV√

Text Lex Sophistication LS1√

Text Lex Sophistication LS2√

Text Lex Sophistication VS1√

Text Lex Sophistication VS2√

Text Lex FrequencyProfile Band1 Tok√

Text Lex FrequencyProfile Band1 Type√

Text Lex FrequencyProfile Band2 Tok√

Text Lex FrequencyProfile Band2 Type√

Text Lex FrequencyProfile Band3 Tok√

Text Lex FrequencyProfile Band3 Type√

Text Lex FrequencyProfile Band4 Tok√

Text Lex FrequencyProfile Band4 Type√

Text Lex FrequencyProfile Band5 Tok√

Text Lex FrequencyProfile Band5 Type√

Text Lex FrequencyProfile Band6 Tok√

Text Lex FrequencyProfile Band6 Type√

Text Lex FrequencyProfile Band7 Tok√

Text Lex FrequencyProfile Band7 Type√

Text Lex awlNumFamText Lex awlPercentText Lex famListPercent k1Text Lex famNum k1

Table A.4: Text level features from the group of lexical features.

Automated C-Test Difficulty Prediction

Page 81: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

A Appendix 72

Feature Group Feature Name ES COMP

SyntacticComplexityFeatures

Text Syn Complexity CNperCText Syn Complexity CTperTText Syn Complexity DCperCText Syn Complexity MLCText Syn Complexity MLSText Syn Complexity MLTText Syn Complexity VPperT

DLTFeatures

Text Dlt oMaxTotalIntegrationCost√ √

Text Dlt cMaxTotalIntegrationCost√ √

Text Dlt mMaxTotalIntegrationCost√ √

Text Dlt vMaxTotalIntegrationCost√ √

Text Dlt cvMaxTotalIntegrationCost√ √

Text Dlt cmMaxTotalIntegrationCost√ √

Text Dlt vmMaxTotalIntegrationCost√ √

Text Dlt cmvMaxTotalIntegrationCost√ √

Text Dlt oTotalIntegrationCosts√ √

Text Dlt cTotalIntegrationCosts√ √

Text Dlt mTotalIntegrationCosts√ √

Text Dlt vTotalIntegrationCosts√ √

Text Dlt cmTotalIntegrationCosts√ √

Text Dlt cvTotalIntegrationCosts√ √

Text Dlt vmTotalIntegrationCosts√ √

Text Dlt cmvTotalIntegrationCosts√ √

Readability IndexFeatures

Text Readability Flesh√

Text Readability Kincaid√

Table A.5: Text level features from the groups of syntactic complexity, DLT and read-ability features.

Automated C-Test Difficulty Prediction

Page 82: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

A Appendix 73

# part. (>140) Text Ids

567 585564 529564 588550 527550 591546 584545 533541 532526 582495 609494 598494 610493 590492 589488 523487 587485 586485 605479 600476 581474 690474 716471 719470 720470 724469 725465 689464 685464 688455 691441 528435 478435 481432 466432 470432 479427 530427 531427 596422 603419 594418 604417 601374 524373 525173 764171 756169 766165 762160 748

Total: 22146

# part. (<140) Text Ids

62 60261 71860 58359 57857 76156 58054 76053 60752 75547 59347 60847 76546 57946 61146 72346 75245 60645 76343 59943 75043 75442 72142 75142 75341 69441 74940 69240 72240 75939 59538 68637 59237 71734 69333 59733 68733 75732 75829 76715 82214 82513 82112 81912 82012 82312 82910 8289 8249 8268 4688 4748 8275 4725 4765 4772 526

Total: 1890

Table A.6: English: Number of participants per text. On the left-hand side, thosewhich where processed by more than 140 participants. On the right-handside, those which were processed by less participants. In total, 24036(22146+1890) answers are available for English.

Automated C-Test Difficulty Prediction

Page 83: Automated C-Test Di culty Prediction: Integrating Lexical ...sgalasso/pdf/galasso-18.pdf · Automated C-Test Di culty Prediction: Integrating Lexical, Sentence, and Text Fea-tures

A Appendix 74

# part. (>140) Text Ids

222 745218 508215 506213 742212 547210 739209 744208 503208 505208 650208 741207 543207 548205 537204 535200 646197 649197 653193 651191 495191 544184 502181 546178 451178 454178 455178 456178 457178 534177 640175 541173 538173 648170 638167 645164 497164 499164 500164 501161 647153 713152 711151 712147 708147 710

Total: 8358

# part. (<140) Text Ids

65 79564 79462 79661 79859 79729 64328 50728 64227 53627 54026 50426 65226 74026 80725 54224 74324 79924 80424 81123 63923 71523 80922 53922 54922 80519 50919 54519 71419 74619 80219 80619 81018 27916 80815 80013 70910 8419 8428 8017 8475 8404 8434 8444 8454 8462 839

Total: 1062

Table A.7: Spanish: Number of participants per text. On the left-hand side, thosewhich where processed by more than 140 participants. On the right-hand side, those which were processed by less participants. In total, 9420(8358+1062) answers are available for Spanish.

Automated C-Test Difficulty Prediction