How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal,...

32
How Should Online Teachers of English as a Foreign Language (EFL) Write Feedback to Students? Cecilia Aguerrebere 1 , Monica Bulger 2 , Cristóbal Cobo 1 , Sofía García 1 , Gabriela Kaplan 3 , and Jacob Whitehill 4 1 Fundacíon Ceibal (Uruguay), 2 Future of Privacy Forum, 3 Plan Ceibal (Uruguay), and 4 Worcester Polytechnic Institute (USA)

Transcript of How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal,...

Page 1: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

How Should Online Teachers of English as a Foreign Language

(EFL) Write Feedback to Students? Cecilia Aguerrebere1, Monica Bulger2, Cristóbal Cobo1,Sofía García1, Gabriela Kaplan3, and Jacob Whitehill4

1Fundacíon Ceibal (Uruguay), 2Future of Privacy Forum, 3Plan Ceibal (Uruguay), and4Worcester Polytechnic Institute (USA)

Page 2: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Online learning of English as a Foreign Language (EFL)

• Online language learning, especially of English, is growing fast in terms of:

• Number of learners

• Number of platforms (e.g., Duolingo, Babbel, Coursera)

• For online language learning programs in which human teachers play a role, an important question is how they should spend their time to maximize students’ learning.

Page 3: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

• Some language learning platforms contain discussion forums in which students exchange messages with each other or with the teacher.

• Since learning a new language requires practice, teachers might want to encourage their students to communicate more on these forums.

• Main theme: How should teachers write feedback to their students to encourage their students to participate in the discussion forums?

Online learning of English as a Foreign Language (EFL)

Page 4: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

EFL learning in Uruguay• Our study is focused on students in

Uruguay.

• Uruguay is one of the largest countries (# learners) in the One Laptop Per Child (OLPC) program and has implemented various online programs for K-12 learning.

• Deployment of laptops/tablets is organized by Plan Ceibal.

• Research on online learning is conducted by Fundacíon Ceibal.

Page 5: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

EFL learning in Uruguay• Uruguay has implemented large-scale instruction of EFL

in a dual-mode format:

• Mode 1 — in-class learning:

• Students learn in normal classrooms with a local teacher, who does not necessarily speak English.

• They are also connected in real-time by videoconference to a remote teacher (RT), who is a high-proficiency (often native) English speaker.

• One RT may service multiple remote classrooms.RT1

CT1

classroom 1 classroom 2

CT2

classroom 3

CT3

classroom k

CTkCTk-1

classroom k-1

RTMRT2

Figure 2: Nested structure of the dataset.

0

100

200

0 25 50 75 100total comments per student

coun

t

Figure 3: Top: Histogram of the total commentsposted by student in the dataset under considera-tion.

unattended. We find previous work on secondary and pri-mary education contexts on particular topics such as teach-ers’ feedback strategies [21, 2], student-generated feedback [12]among other more general examples [24].

Our work complements and enriches the previous work inseveral aspects: (1) it studies asynchronous teacher feed-back in an online EFL environment, which has been seldomstudied [15, 29, 26], (2) it considers a secondary educationsetting, also fundamental and rarely analyzed [30], (3) it fol-lows a quantitative analysis exploiting a large scale dataset(not only in terms of the number of students but also inteachers, classrooms and schools diversity) as opposed tomost case studies which often include a very limited numberof students and classrooms [16, 29].

3. DATASET DESCRIPTIONThe dataset under consideration was originally collected byAguerrebere et al. [1] and includes all the comments (i.e.,content, posting date, user id) as well as administrative in-formation (for each student, who are her CT and RT) for the1st secondary school classrooms (12-year-olds) that partici-pated in the TDL program during school year 2017. In thiswork the dataset is extended to also include school year 2018.This includes a total of 27.627 comments exchanged between1074 students, organized in 83 classrooms (in 49 public highschools located in 18 di↵erent states in the country), and 35RTs. Figure 2 illustrates the nested structure of the dataset,i.e., the students organized in classrooms with their corre-sponding CTs, and the RTs assigned to each classroom. Fig-ure 3 shows the histogram of the total comments posted byeach student during their corresponding school year. It isimportant to remark that the dataset under considerationwas de-identified to preserve each participant’s privacy andhandled according to the Uruguayan privacy protection leg-islation. After talking with the stakeholders and the leadersof the program, a series of features relevant for the charac-terization of each comment were defined: complexity, speci-ficity, polarity and response delay. Each of these featuresrepresent di↵erent aspects of how elaborated a comment is(complexity, specificity), its tone (polarity) and how long thestudent had to wait to receive feedback (response delay).

0

20

40

60

80

2.5 5.0 7.5 10.0average comment complexity

coun

t

roleRTstudent

0

50

100

150

0.00 0.25 0.50 0.75 1.00average comment specificity

coun

t

roleRTstudent

0

100

200

300

400

−0.5 0.0 0.5 1.0average comment polarity

coun

t

roleRTstudent

Figure 4: Histograms of the average complexity(top), specificity (center) and polarity (bottom) ofthe feedback the RT gave to each student (red) andof each students’ posts (violet).

Complexity (c). Measures how elaborated a comment is, byadding its characters per word, words per sentence and totalsentences: c = 1

4#char#words +

15

#words#sent +#sent (weights are in-

cluded to give similar relevance to all terms, 4 and 5 are themedian characters per word and words per sentence respec-tively). Examples of low, medium and high complexity com-ments are: (c = 2.4) “Well done!”, (c = 3.7) “My favouritenational park is Yellowstone.”, (c = 8.9) “Hi Alberto! This isan accurate description of the di↵erent continents, but canyou try again? The activity is asking about di↵erent volcaniclandforms! Can you please look at the encyclopedia and readthe part about volcanic landforms to find the names of thethree types of volcanic landforms? Here’s the link: [link]”.Other options such as the automated readability index [19]were explored but discarded because of limitations to repre-sent the variability of the quite basic comments considered.

Specificity (s). Measures how specific are, on average, thewords used in a comment. It combines how deep each wordwi appears in the WordNet structure and how frequent theword is in the dataset: s = 1

W

PWi=1

depth(wi)Z + 1

freq(wi),

where W is the total words in the comment and Z a normal-izing factor equal to the maximum average comment com-plexity [7]. WordNet is a lexical database for the English lan-guage, that groups words into sets of synonyms and recordsseveral relations among them [22]. Examples of commentswith low and high specificity are: (s = 0.1) “Very good! Doyou have any cats? What do you feed them? ’ and (s = 1.2)“The skeleton of brontosaurus.” .

Polarity (p). Measures the tone of the comment (positive,negative) as the average of an index (-1 (negative) to +1

Page 6: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

EFL learning in Uruguay• Uruguay has implemented large-scale instruction of EFL

in a dual-mode format:

• Mode 2 — online learning:

• To give students additional practice, an online EFL platform was developed called the Tutorials for Differentiated Learning (TDL) program.

Page 7: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Tutorials for Differentiated Learning (TDL)

How Should Online English as a Foreign LanguageTeachers Write their Feedback to Students?

Cecilia AguerrebereFundación Ceibal, Uruguay

[email protected]

Monica BulgerFuture of Privacy Forum

[email protected]

Cristóbal CoboFundación Ceibal, [email protected]

Sofía GarcíaFundación Ceibal, [email protected]

Gabriela KaplanPlan Ceibal, Uruguay

[email protected]

Jacob WhitehillWorcester Polytech. Inst., USA

[email protected]

ABSTRACTWe analyze how teachers write feedback to students in anonline learning environment, specifically a setting in whichhigh school students in Uruguay are learning English as aforeign language. How complex should teachers’ feedbackbe? Should it be somehow adapted to each student’s En-glish proficiency level? How does teacher feedback a↵ectthe probability of engaging the student in a conversation?To explore these questions, we conducted both parametric(multilevel modeling) and non-parametric (bootstrapping)analyses of 27.627 messages exchanged between 35 teach-ers and 1074 students in 2017 and 2018. Our results sug-gest: (1) Teachers should adapt their feedback complexityto their students’ English proficiency level. Students whoreceive feedback that is too complex or too basic for theirlevel post 13-15% fewer comments than those who receiveadapted feedback. (2) Feedback that includes a question isassociated with higher odds-ratio (between 17.5-19) of en-gaging the student in conversation. (3) For students withlow English proficiency, slow turnaround (feedback after 1week) reduces this odds ratio by 0.7. These results have po-tential implications for the online platforms o↵ering foreignlanguage learning services, in which it is crucial to give thebest possible learning experience while judiciously allocatingteachers’ time.

Keywordsonline feedback, secondary education, English as a foreignlanguage, EFL

1. INTRODUCTIONFor decades, teacher feedback has been shown to be one ofthe greatest drivers influencing student learning [4, 13]. Inthis prominent research area, the focus has shifted from as-sessing whether feedback is e↵ective to identifying the mostpowerful strategies [30]. Because of the highly complex na-ture of the feedback process, the answer to this question

Stewie Griffin Mar 17 Jul, 2018 at 8:59 amThe southernmost continent in the world, Antarctica surrounds the South Pole.

Great work! What do you think of the weather there?Glenn Quagmire Mar 17 Jul, 2018 at 10:38 am

It is very cold.Stewie Griffin Vie 20 Jul, 2018 at 3:59 pm

FIND one interesting fact about Antarcticaon this screen and post it below.

DID YOU KNOW THAT...?

and

Write your answer below

What's the difference between

Stewie Griffin Mar 17 Jul, 2018 at 8:27 amAliens live in other planets, UFOs are unidentified objects.

Glenn Quagmire Mar 18 Jul, 2018 at 4:34 pmReally good! Can you think of another one?.

and

Write your answer below

What's the difference between

Figure 1: Example of a TDL exercise availablethrough the LMS: after reading a material aboutAntarctica (left) or Aliens (right), the students areasked to mention interesting facts about the topic.

remains deeply tied to the particular context in which it hap-pens. One particular learning domain that is growing fastin terms of number of learners and learning platforms (e.g.,Duolingo, Babbel, Learning English at Coursera) is onlinelearning of English as a foreign language (EFL). Despite itsincreasing prominence, teacher feedback in the online EFLcontext has received limited research attention [9, 30, 5].

In this paper we seek to contribute to the understanding ofhow teacher feedback influences students’ behavior in theonline EFL context. The particular setting that we focuson is an EFL program in which students learn English withthe help of a remote teacher (RT), a native English speaker,with whom students communicate online using discussionforums. Within this context, we seek to build an under-standing of how the feedback the RTs give to their studentsa↵ects their posting behavior by answering the following re-search questions:

1. How complex should the RT feedback be?

2. Should it be somehow adapted to their student’s En-glish proficiency level?

3. How does RT feedback a↵ect the probability of engag-ing the student in a conversation?

This research has potential implications for the countlessonline platforms o↵ering foreign language learning services,aiming to enhance students’ learning experience.

• TDL encourages students to complete reading exercises and write in an online discussion forum, e.g.:

Page 8: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

TDL online discussion forum: examples

• Example 1: RT’s feedback elicited a response from the student.

student doesn’t, that conversation ends there, and the stu-dent may start new threads when doing new exercises. Hereis an example of an interaction where the student engagedin the conversation:

Student: I do not have favorite music I like to listen to everythinga little.

RT: That’s great Alicia. What’s your favourite song right now?Student: At this moment I’ve heard a song from Michael Jackson

that I loved its name is Thriller.RT: Ok Alicia, thank you for sharing that :)

and another example where she didn’t:

Student: I see six oceans: Atlantic, Indian, Pacific, Atlantic, Arc-tic and Southern ocean.

RT: Very well Andrea.

Learning English: In TDL, the RTs’ feedback is intendedless as a way of correcting students’ mistakes and more as away to encourage students to participate. Participation inthe discussion forums is expected to be conducive to betterlearning since doing the exercises requires reading the ma-terial in English as well as writing the response in English.Therefore, two measures of interest are: the total commentsthe student posts and whether the student engages in a givenconversation with the RT.

2. PREVIOUS WORKEven though there is no unified definition of feedback, theseminal work by Hattie and Timperley [9] conceptualizesfeedback as information provided by an agent regarding as-pects of a student’s performance or understanding. It can beprovided e↵ectively, but it is dependent on several factorssuch as the task, the learning context and the learners [9,20]. It may improve learning outcomes when it has a directuse (e.g. correct the task), or it may increase motivationwhen only expressing praise for the student [20].

In the online language learning context, feedback has beenreported as a fundamental aspect in skills development [11].Teacher feedback in online language learning environmentscan also inform development of data-driven personalized feed-back. Emerging data-driven learning systems adapt feed-back to individual student needs, and have been shown toimprove learning outcomes [17]. Furthermore, data mininghas been used to understand the e↵ects that polarity (posi-tive vs. negative comments) and timing can have in di↵erentstudent’s learning aspects [12, 16].

Research on feedback for EFL learning in computer-mediated(CM) environments has widely focused on peer feedback, of-ten on EFL writing [18]. Jiang and Ribeiro [10] present asystematic literature review on the e↵ect of CM peer writtenfeedback on adult EFL writing. They confirmed the findingsfrom previous research acknowledging the positive impact ofCM peer feedback in this context.

As in many other subjects in educational data mining, mostresearch on feedback has focused on higher education set-tings [20], leaving the fundamental need to build under-standing of the primary and secondary education contextsunattended. We find previous work on secondary and pri-mary education contexts on particular topics such as teach-ers’ feedback strategies [3], student-generated feedback [8]among other more general examples [15].

0

100

200

0 25 50 75 100total comments per student

coun

t

0

20

40

60

80

2.5 5.0 7.5 10.0average comment complexity

coun

t

roleRTstudent

0

50

100

150

0.00 0.25 0.50 0.75 1.00average comment specificity

coun

t

roleRTstudent

0

100

200

300

400

−0.5 0.0 0.5 1.0average comment polarity

coun

t

roleRTstudent

Figure 1: Histogram of the total comments posted

by each student (top left). Histograms of the aver-

age complexity (top right), specificity (bottom left)

and polarity (bottom right) of the RTs’ feedback

(red) and of the students’ posts (violet).

Our work complements and enriches the previous work inseveral aspects: (1) it studies asynchronous teacher feedbackin an online EFL environment, which has been seldom stud-ied [19], (2) it considers a secondary education setting, alsofundamental and rarely analyzed [20], (3) it follows a quan-titative analysis exploiting a large scale dataset (in terms ofnumber of students, teachers, classrooms, and school diver-sity) as opposed to most case studies which often includerelatively few students or classrooms [19].

3. DATASET DESCRIPTIONThe dataset under consideration was originally collected byAguerrebere et al. [2] and includes all the comments (i.e.,content, posting date, user id) as well as administrative in-formation (for each student, who are her CT and RT) for the1st secondary school classrooms (12-year-olds) that partici-pated in the TDL program during school year 2017. In thiswork the dataset is extended to also include school year 2018.This includes a total of 27,627 comments exchanged between1074 students, organized in 83 classrooms (in 49 public highschools located in 18 di↵erent states in the country), and 35RTs. The dataset has a nested structure: students are orga-nized into classrooms. Each classroom has a dedicated class-room teacher; in contrast, each remote teacher may servemultiple classrooms. Figure 1 shows the histogram of thetotal comments posted by each student during their corre-sponding school year. The dataset has been de-identified topreserve each participant’s privacy and handled accordingto Uruguayan privacy protection legislation. After talkingwith the TDL stakeholders and the program leaders, a setof features characterizing each comment was defined: com-plexity, specificity, polarity and response delay. Each featurerepresents a di↵erent aspect of how elaborate a comment is(complexity, specificity), its tone (polarity) and how long thestudent had to wait to receive feedback (response delay).

Complexity (c) measures how elaborated a comment is, byadding its characters per word, words per sentence and totalsentences: c = 1

4#char#words +

15

#words#sent +#sent (weights are in-

cluded to give similar relevance to all terms, 4 and 5 are themedian characters per word and words per sentence respec-tively). Examples of low, medium and high complexity com-

• Example 2: RT’s feedback elicited no response.

Page 9: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Characteristics of teachers’ feedback

• We investigated the relationship between students’ discussion forum behavior (e.g., probability of response, #comments) and characteristics of teacher’s feedback:

• Complexity (absolute and relative)

• Polarity (positive vs. negative sentiment)

• Specificity

• Response delay

• Question

Page 10: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Characteristics of teachers’ feedback

• We investigated the relationship between students’ discussion forum behavior (e.g., probability of response, #comments) and characteristics of teacher’s feedback:

• Complexity (absolute and relative)

• Polarity (positive vs. negative sentiment)

• Specificity

• Response delay

• Question

Page 11: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Language complexity

• We operationalize (absolute) language complexity as:where4 is median number of characters/word, and5 is median number of words/sentencein our dataset.

• This model is inspired by a readability index by Krause et al. (2017).

student doesn’t, that conversation ends there, and the stu-dent may start new threads when doing new exercises. Hereis an example of an interaction where the student engagedin the conversation:

Student: I do not have favorite music I like to listen to everythinga little.

RT: That’s great Alicia. What’s your favourite song right now?Student: At this moment I’ve heard a song from Michael Jackson

that I loved its name is Thriller.RT: Ok Alicia, thank you for sharing that :)

and another example where she didn’t:

Student: I see six oceans: Atlantic, Indian, Pacific, Atlantic, Arc-tic and Southern ocean.

RT: Very well Andrea.

Learning English: In TDL, the RTs’ feedback is intendedless as a way of correcting students’ mistakes and more as away to encourage students to participate. Participation inthe discussion forums is expected to be conducive to betterlearning since doing the exercises requires reading the ma-terial in English as well as writing the response in English.Therefore, two measures of interest are: the total commentsthe student posts and whether the student engages in a givenconversation with the RT.

2. PREVIOUS WORKEven though there is no unified definition of feedback, theseminal work by Hattie and Timperley [9] conceptualizesfeedback as information provided by an agent regarding as-pects of a student’s performance or understanding. It can beprovided e↵ectively, but it is dependent on several factorssuch as the task, the learning context and the learners [9,20]. It may improve learning outcomes when it has a directuse (e.g. correct the task), or it may increase motivationwhen only expressing praise for the student [20].

In the online language learning context, feedback has beenreported as a fundamental aspect in skills development [11].Teacher feedback in online language learning environmentscan also inform development of data-driven personalized feed-back. Emerging data-driven learning systems adapt feed-back to individual student needs, and have been shown toimprove learning outcomes [17]. Furthermore, data mininghas been used to understand the e↵ects that polarity (posi-tive vs. negative comments) and timing can have in di↵erentstudent’s learning aspects [12, 16].

Research on feedback for EFL learning in computer-mediated(CM) environments has widely focused on peer feedback, of-ten on EFL writing [18]. Jiang and Ribeiro [10] present asystematic literature review on the e↵ect of CM peer writtenfeedback on adult EFL writing. They confirmed the findingsfrom previous research acknowledging the positive impact ofCM peer feedback in this context.

As in many other subjects in educational data mining, mostresearch on feedback has focused on higher education set-tings [20], leaving the fundamental need to build under-standing of the primary and secondary education contextsunattended. We find previous work on secondary and pri-mary education contexts on particular topics such as teach-ers’ feedback strategies [3], student-generated feedback [8]among other more general examples [15].

0

100

200

0 25 50 75 100total comments per student

coun

t

0

20

40

60

80

2.5 5.0 7.5 10.0average comment complexity

coun

t

roleRTstudent

0

50

100

150

0.00 0.25 0.50 0.75 1.00average comment specificity

coun

t

roleRTstudent

0

100

200

300

400

−0.5 0.0 0.5 1.0average comment polarity

coun

t

roleRTstudent

Figure 1: Histogram of the total comments posted

by each student (top left). Histograms of the aver-

age complexity (top right), specificity (bottom left)

and polarity (bottom right) of the RTs’ feedback

(red) and of the students’ posts (violet).

Our work complements and enriches the previous work inseveral aspects: (1) it studies asynchronous teacher feedbackin an online EFL environment, which has been seldom stud-ied [19], (2) it considers a secondary education setting, alsofundamental and rarely analyzed [20], (3) it follows a quan-titative analysis exploiting a large scale dataset (in terms ofnumber of students, teachers, classrooms, and school diver-sity) as opposed to most case studies which often includerelatively few students or classrooms [19].

3. DATASET DESCRIPTIONThe dataset under consideration was originally collected byAguerrebere et al. [2] and includes all the comments (i.e.,content, posting date, user id) as well as administrative in-formation (for each student, who are her CT and RT) for the1st secondary school classrooms (12-year-olds) that partici-pated in the TDL program during school year 2017. In thiswork the dataset is extended to also include school year 2018.This includes a total of 27,627 comments exchanged between1074 students, organized in 83 classrooms (in 49 public highschools located in 18 di↵erent states in the country), and 35RTs. The dataset has a nested structure: students are orga-nized into classrooms. Each classroom has a dedicated class-room teacher; in contrast, each remote teacher may servemultiple classrooms. Figure 1 shows the histogram of thetotal comments posted by each student during their corre-sponding school year. The dataset has been de-identified topreserve each participant’s privacy and handled accordingto Uruguayan privacy protection legislation. After talkingwith the TDL stakeholders and the program leaders, a setof features characterizing each comment was defined: com-plexity, specificity, polarity and response delay. Each featurerepresents a di↵erent aspect of how elaborate a comment is(complexity, specificity), its tone (polarity) and how long thestudent had to wait to receive feedback (response delay).

Complexity (c) measures how elaborated a comment is, byadding its characters per word, words per sentence and totalsentences: c = 1

4#char#words +

15

#words#sent +#sent (weights are in-

cluded to give similar relevance to all terms, 4 and 5 are themedian characters per word and words per sentence respec-tively). Examples of low, medium and high complexity com-

Page 12: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Language complexity: examples of high and low

ments are: (c = 2.4) “Well done!”, (c = 3.7) “My favouritenational park is Yellowstone.”, (c = 8.9) “Hi Alberto! This isan accurate description of the di↵erent continents, but canyou try again? The activity is asking about di↵erent volcaniclandforms! Can you please look at the encyclopedia and readthe part about volcanic landforms to find the names of thethree types of volcanic landforms? Here’s the link: [link]”.

Specificity (s) measures how specific, on average, the wordsare in the comment. It combines how deep each word wi ap-pears in the WordNet [13] structure and how frequent the

word is in the dataset: s = 1W

PWi=1

depth(wi)Z + 1

freq(wi),

where W is the total words in the comment and Z a normal-izing factor equal to the maximum average comment com-plexity [6]. Examples of comments with low and high speci-ficity: (s = 0.1) “Very good! Do you have any cats?” and(s = 1.2) “The skeleton of brontosaurus.” .

Polarity (p) measures the tone of the comment (positive,negative) as the average of an index (-1 (negative) to +1(positive))1 assigned to each sentence based on the adjec-tives it contains (e.g., great, nice, awful). Examples of pos-itive and negative comments: (p = 1.0) “Great Carla! Awe-some spelling!!” and (p = �0.6) “I would not like becausethey are dangerous.”.

Response delay (⌧) is the timelapse between the student’spost and the RT’s response in days. Median ⌧ is 3 days.

Figure 1 shows the histograms of the the average complex-ity, specificity and polarity of the RTs’ feedback and thestudents’ posts. The average specificity tends to be largerfor students than for RTs; this is likely because students’comments are responses to exercises that elicit very specificwords such as “skeleton of brontosaurus.”, whereas the RTs’comments often contain basic words, e.g., “Great work!”.

4. DATA ANALYSISWe examine the e↵ects of various feedback characteristicsof RTs’ feedback on students’ posting behavior. We usetwo complementary approaches [14]: (1) multilevel linearregression that models the nested nature of the data; and(2) non-parametric bootstrap analysis. The latter is morecomplicated (e.g., requires a bin width parameter) but canmodel non-linear relationships and makes fewer assumptions(e.g., normality of residuals) than many parametric models.

4.1 How complex should the RT feedback be?In this section we are interested in the question: Does RTfeedback complexity (low vs. high) a↵ect the total num-ber of comments posted by the student? Note that we mustconsider the potential confound of the student’s English pro-ficiency level, as the e↵ect of RT feedback complexity mayvary for more or less proficient students.

4.1.1 Non-parametric approachTo answer this question we first follow a non-parametricapproach, using bootstrap to test the null hypothesis:

H0 : E[T |Sc] = E[T |Sc, Rc], (1)

1The sentiment function of the pattern.en Python modulewas used: www.clips.uantwerpen.be/pages/pattern-en.

versus E[T |Sc] 6= E[T |Sc, Rc], where T is the total com-ments posted by the student, Sc is the student’s Englishproficiency level (low/high) and Rc is the complexity levelof the feedback the student received from his RT (low/high).Hypothesis (1) tests whether the total comments posted bythe student depend on the complexity level of the feedbackshe received from her RT, after conditioning on her Englishproficiency level. If the null hypothesis is rejected, then thecomplexity level of the feedback that the students receivea↵ects their average engagement with the program, mea-sured by the total comments they post and conditioned ontheir English proficiency level. It is important to note thatwe must reject the null hypothesis if E[T |Sc] di↵ers statisti-cally significantly from E[T |Sc, Rc] for any values of Sc andRc. For this reason, in this section we examine the potentialimpact of Rc on T for each possible combination of (Sc, Rc).

To estimate the student’s English proficiency level we use theaverage complexity of the comments posted by the student.Students with average complexity below (above) the medianare classified as having low (high) proficiency respectively.Similarly, the complexity level of the RT’s feedback is low(high) when it is below (above) the median.

Bootstrapping for equality of means: How can we testwhether P (T |Sc, Rc = low) and P (T |Sc, Rc = high) havethe same mean? If the distribution P (T |Sc, Rc) were Gaus-sian for all values of Sc and Rc, then we could just use at-test to compute the p-value (or a nested ANOVA to takeinto account the nested structure of the data). However, inour case the data are not Gaussian (see [1]). Fortunately,the bootstrap procedure proposed by Efron & Tibshirani [7]provides a rigorous methodology. By sampling with replace-ment from our original dataset, we can simulate multipledata samples. We subselect the data for which Rc = low

and the data for which Rc = high and then resample eachof them to generate multiple bootstrap samples. To enforcethe null hypothesis (i.e., equal means), we set the meansof the two samples to be equal to the mean of the com-bined sample. We then compute the normalized di↵erencein means between the two subsets in the bootstrapped sam-ple. Over all B bootstrap iterations (B = 10000), we finallycompute the fraction in which the normalized di↵erence inmeans is at least as large as the observed statistic. See [1]for a detailed description of the algorithms.

Results: For high level students, more basic feedback isassociated with more posting. Students who received highlevel feedback posted on average 22% fewer comments thanthose who received low level feedback (9.3 versus 12 totalcomments, p = 0.006). Examples:

Student A: My favorite food is hamburguer.

RT (low level): Nice Lucia! What do you eat in your hamburger?

Student B: They are in Spain in Barcelona.

RT (high level): Hello Pablo! Yes they are in Spain. Very good.Here is a link in case you would like to know a bit more aboutSpain and their culture. In Uruguay there are a lot of peoplewho have Spanish origins. It is very evident in the food :) Ihave never been to Spain. Would you like to go to Spain?

A possible explanation for this behavior is that even thehigh-level students have weak English proficiency and mightfeel overwhelmed by feedback that is too complex. No statis-

Page 13: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Relative language complexity: examples

Results: Table 1 (Model 4) shows the computed fixed ef-fects for the di↵erent covariates. The distance between theaverage complexity of the student’s comments and the av-erage complexity of the feedback the student received fromhis RT has a negative stat. sig. e↵ect on the total commentsposted by the student. There is a 13% (p = 0.008) and15% (p = 0.003) decrease in total comments with one unitdistance increase, for the low and high level students respec-tively. We present in the following a series of examples inorder to help the reader gain insight into what small andlarge student-RT complexity distance mean in practice.

Large distance:

Student A: I would like to defile.

RT: Nice try Marcela, but I do not understand what you mean.There are two models in the above photo, can you tell mewhich model is from Brazil and which model is from the USA?or - can you tell me, who is your favourite model? Myfavourite model is Kate Moss.

Small distance:

Student A: My favorite sport is football.

RT: Very good! what is your favorite football team?

In the latter example, the interactions seem closer to anonline chat for students with basic English skills.

4.2.2 Non-parametric approachA non-parametric approach is conducted to complement theresults obtained by the parametric analysis. For this pur-pose, bootstrapping is used to test the null hypothesis:

H0 : E[T |Sc] = E[T |Sc, D], (5)

versus E[T |Sc] 6= E[T |Sc, D], where D is the distance be-tween the student’s and the RT’s average comments com-plexity as defined in Model 4. The bootstrapping algorithmintroduced in Section 4.1.1 is used, with a for loop on smalland large D (below and above the median distance) insteadof RT complexity, and ↵ is set to the values obtained inSection 4.2.1 (see [1] for details).

Results: A negative statistically significant e↵ect is ob-served for D on the total comments posted by the student,both for low and high level students. Students who receivedfeedback less adapted to their level posted 15% and 34% lesscomments than those who received more adapted feedback,for low (6.6 versus 7.6 total comments, p = 0.007) and high(9.2 versus 12.3 total comments, p < 0.001) level studentsrespectively. To compare these results to those obtained bythe parametric approach we compute the equivalent per unitdecrease (computed as the total decrease divided by the dif-ference between the average D in the two compared levels)which equals 9% both for low and high student level.

4.3 Engaging students in conversationThe program aims at motivating the students to interactwith others in English. Therefore, we are interested not onlyin their total posts but also in the probability of engagingthem in a conversation. Following the same rationale asSection 4.1.3, we use a multilevel logistic regression model toexplore this question. Let (Yj)j=1,...,N be Bernoulli variableswith P (Yj = 1|⌘j) = exp(⌘j)/(1 + exp(⌘j)) with:

⌘j =� + �0cj + �1 log(⌧j) + �2pj + �3sj + �4yj+

�5qj + �6lj + �7ej + S + C +R, (6)

Yj = 1 if the student posts a second comment in conversa-tion j and 0 otherwise. � is the baseline. cj , pj and sj arethe complexity, polarity and specificity of the RT’s responseto the first comment posted by the student who initiatedconversation j. qj , ej and lj are boolean variables takingvalue 1 if the RT’s response asked the student a question,included an emoticon or shared a link respectively. ⌧j is thetimelapse between the moment the student started conver-sation j and the RT replied. yj is the school year. �0, . . . , �7represent the fixed e↵ects of the corresponding covariates.The nested random e↵ect student-CT-RT is represented bythe random variables S, C and R, assumed to follow zero-mean Gaussian distributions with standard deviations �S ,�C and �R. N is the total number of conversation threadsand there may be several threads per student.

Results: Table 2 shows the estimated covariate e↵ects. Byfar, and maybe not surprisingly, the fact that the RT asksthe student a question has the largest stat. sig. positive ef-fect on the probability of getting the student to continue theconversation. When a question is asked, assuming the rest ofthe covariates remain fixed, the odds ratio for students of thesame classroom is exp(2.94) = 19.0 and exp(2.86) = 17.5,for low and high level students respectively.

As more elaborated comments often include questions, thepositive e↵ect of complexity suggests that more elaboratedcomments increase the probability of engaging a student ina conversation. On the contrary, the negative stat. sig. ef-fect of polarity is likely due to the fact that very positivecomments such as “Great work!” tend to be quite basic interms of complexity. For high level students a larger delayis associated with more responses, as after a one week delaythe odds ratio is 1.2 (the rest of the covariates and randome↵ects remaining constant). This is likely because for highlevel students RT response delay is positively correlated withcomplexity (Spearman 0.1) and negatively correlated withcomments polarity (Spearman -0.12): writing more elabo-rated comments take longer. This is not observed for lowlevel students (Spearman 0.02 for both complexity and po-larity), for whom it seems important to reply as soon aspossible, as after a one week delay the odds ratio is 0.7.Finally, for both high and low level students, results sug-gest that using less specific words is associated with higherprobability of engagement in the conversation.

5. SUMMARY AND CONCLUSIONSWe conducted an observational analysis of 27,627 comments,exchanged between 1074 high school students and 35 RTsover 2 years, to study the e↵ect that di↵erent RT’s feedbackcharacteristics have on the students’ posting behavior in anonline EFL learning environment. The research questions,as well as the features defined for the characterization of thecomments, were discussed and validated with the stakehold-ers and the leaders of the program in order to take advantageof their wide experience on the topic. Both parametric (mul-tilevel modeling) and non-parametric (bootstrapping) anal-yses were performed, controlling for the e↵ect of possibleconfounds such as (1) the classroom teacher, (2) the remoteteacher and (3) the students’ English proficiency level. Ourresults suggest that:(1) Teachers should observe the complexity of their students’comments and adapt the complexity of their feedback ac-

Page 14: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Research questions

1.How complex should the teachers’ feedback be so as to encourage the student to write more in the forum?

2.Should teachers’ feedback complexity be congruent to that of their students?

Page 15: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

student doesn’t, that conversation ends there, and the stu-dent may start new threads when doing new exercises. Hereis an example of an interaction where the student engagedin the conversation:

Student: I do not have favorite music I like to listen to everythinga little.

RT: That’s great Alicia. What’s your favourite song right now?Student: At this moment I’ve heard a song from Michael Jackson

that I loved its name is Thriller.RT: Ok Alicia, thank you for sharing that :)

and another example where she didn’t:

Student: I see six oceans: Atlantic, Indian, Pacific, Atlantic, Arc-tic and Southern ocean.

RT: Very well Andrea.

Learning English: In TDL, the RTs’ feedback is intendedless as a way of correcting students’ mistakes and more as away to encourage students to participate. Participation inthe discussion forums is expected to be conducive to betterlearning since doing the exercises requires reading the ma-terial in English as well as writing the response in English.Therefore, two measures of interest are: the total commentsthe student posts and whether the student engages in a givenconversation with the RT.

2. PREVIOUS WORKEven though there is no unified definition of feedback, theseminal work by Hattie and Timperley [9] conceptualizesfeedback as information provided by an agent regarding as-pects of a student’s performance or understanding. It can beprovided e↵ectively, but it is dependent on several factorssuch as the task, the learning context and the learners [9,20]. It may improve learning outcomes when it has a directuse (e.g. correct the task), or it may increase motivationwhen only expressing praise for the student [20].

In the online language learning context, feedback has beenreported as a fundamental aspect in skills development [11].Teacher feedback in online language learning environmentscan also inform development of data-driven personalized feed-back. Emerging data-driven learning systems adapt feed-back to individual student needs, and have been shown toimprove learning outcomes [17]. Furthermore, data mininghas been used to understand the e↵ects that polarity (posi-tive vs. negative comments) and timing can have in di↵erentstudent’s learning aspects [12, 16].

Research on feedback for EFL learning in computer-mediated(CM) environments has widely focused on peer feedback, of-ten on EFL writing [18]. Jiang and Ribeiro [10] present asystematic literature review on the e↵ect of CM peer writtenfeedback on adult EFL writing. They confirmed the findingsfrom previous research acknowledging the positive impact ofCM peer feedback in this context.

As in many other subjects in educational data mining, mostresearch on feedback has focused on higher education set-tings [20], leaving the fundamental need to build under-standing of the primary and secondary education contextsunattended. We find previous work on secondary and pri-mary education contexts on particular topics such as teach-ers’ feedback strategies [3], student-generated feedback [8]among other more general examples [15].

0

100

200

0 25 50 75 100total comments per student

coun

t

0

20

40

60

80

2.5 5.0 7.5 10.0average comment complexity

coun

t

roleRTstudent

0

50

100

150

0.00 0.25 0.50 0.75 1.00average comment specificity

coun

t

roleRTstudent

0

100

200

300

400

−0.5 0.0 0.5 1.0average comment polarity

coun

t

roleRTstudent

Figure 1: Histogram of the total comments posted

by each student (top left). Histograms of the aver-

age complexity (top right), specificity (bottom left)

and polarity (bottom right) of the RTs’ feedback

(red) and of the students’ posts (violet).

Our work complements and enriches the previous work inseveral aspects: (1) it studies asynchronous teacher feedbackin an online EFL environment, which has been seldom stud-ied [19], (2) it considers a secondary education setting, alsofundamental and rarely analyzed [20], (3) it follows a quan-titative analysis exploiting a large scale dataset (in terms ofnumber of students, teachers, classrooms, and school diver-sity) as opposed to most case studies which often includerelatively few students or classrooms [19].

3. DATASET DESCRIPTIONThe dataset under consideration was originally collected byAguerrebere et al. [2] and includes all the comments (i.e.,content, posting date, user id) as well as administrative in-formation (for each student, who are her CT and RT) for the1st secondary school classrooms (12-year-olds) that partici-pated in the TDL program during school year 2017. In thiswork the dataset is extended to also include school year 2018.This includes a total of 27,627 comments exchanged between1074 students, organized in 83 classrooms (in 49 public highschools located in 18 di↵erent states in the country), and 35RTs. The dataset has a nested structure: students are orga-nized into classrooms. Each classroom has a dedicated class-room teacher; in contrast, each remote teacher may servemultiple classrooms. Figure 1 shows the histogram of thetotal comments posted by each student during their corre-sponding school year. The dataset has been de-identified topreserve each participant’s privacy and handled accordingto Uruguayan privacy protection legislation. After talkingwith the TDL stakeholders and the program leaders, a setof features characterizing each comment was defined: com-plexity, specificity, polarity and response delay. Each featurerepresents a di↵erent aspect of how elaborate a comment is(complexity, specificity), its tone (polarity) and how long thestudent had to wait to receive feedback (response delay).

Complexity (c) measures how elaborated a comment is, byadding its characters per word, words per sentence and totalsentences: c = 1

4#char#words +

15

#words#sent +#sent (weights are in-

cluded to give similar relevance to all terms, 4 and 5 are themedian characters per word and words per sentence respec-tively). Examples of low, medium and high complexity com-

TDL Dataset• 1074 6th grade (~12 y.o.) EFL learners from 83 classrooms

and 49 public schools in Uruguay collected during 2017-18.

• English proficiency is generally very basic.

• 35 remote teachers

• 27627 comments posted to the TDL discussion forums.

Page 16: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Parametric vs. Non-parametric statistical approaches

• Parametric (e.g., linear regression, mixed-effects models):

• Closed-form solutions of p-values for hypothesis testing.

• Statistically efficient when assumptions are met.

• Strong assumptions on the data, e.g., normal distribution of residuals.

• Non-parametric (e.g., bootstrapping):

• Flexible, with very few assumptions.

• Requires iterative (sampling) procedure to compute p-values.

Page 17: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

How complex should the teachers’ feedback be?

• We separately analyzed students whose forum posts had high language complexity and low language complexity.

• Motivation: reduce confound due to interaction between students’ and teachers’ language complexity.

• For each group, assess the null hypothesis H0 that the teacher’s language complexity had no effect, i.e.:

• H0: E[#comments | LCs, LCt] = E[#comments | LCs]

Page 18: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Bootstrap• To compare the means of two distributions, we followed the bootstrap

procedure by Efron & Tibshirani (1994):

• For each value of LCs, we have two subsets of data:E[#comments | LCs, LCt=high] and E[#comments | LCs, LCt=low].

• We "pretend" that these subsets (with n1, n2 examples each) have exactly the same mean by setting them to be equal.

• I.e., maybe the observed differences in these subsets just happened "by chance”.

• We then sample n1 data from the first subset and n2 data from the second subset -- with replacement.

• We check if the difference in means between the new samples is similarly large as the observed difference in means in the original two subsets.

• Over B bootstrap iterations, we obtain a p-value of how likely the difference in means occurs under H0.

Page 19: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Bootstrap• To compare the means of two distributions, we followed the bootstrap

procedure by Efron & Tibshirani (1994):

• For each value of LCs, we have two subsets of data:E[#comments | LCs, LCt=high] and E[#comments | LCs, LCt=low].

• We "pretend" that these subsets (with n1, n2 examples each) have exactly the same mean by setting them to be equal.

• I.e., maybe the observed differences in these subsets just happened "by chance”.

• We then sample n1 data from the first subset and n2 data from the second subset -- with replacement.

• We check if the difference in means between the new samples is similarly large as the observed difference in means in the original two subsets.

• Over B bootstrap iterations, we obtain a p-value of how likely the difference in means occurs under H0.

Page 20: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Bootstrap• To compare the means of two distributions, we followed the bootstrap

procedure by Efron & Tibshirani (1994):

• For each value of LCs, we have two subsets of data:E[#comments | LCs, LCt=high] and E[#comments | LCs, LCt=low].

• We "pretend" that these subsets (with n1, n2 examples each) have exactly the same mean by setting them to be equal.

• I.e., maybe the observed differences in these subsets just happened "by chance”.

• We then sample n1 data from the first subset and n2 data from the second subset -- with replacement.

• We check if the difference in means between the new samples is similarly large as the observed difference in means in the original two subsets.

• Over B bootstrap iterations, we obtain a p-value of how likely the difference in means occurs under H0.

Page 21: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Bootstrap• To compare the means of two distributions, we followed the bootstrap

procedure by Efron & Tibshirani (1994):

• For each value of LCs, we have two subsets of data:E[#comments | LCs, LCt=high] and E[#comments | LCs, LCt=low].

• We "pretend" that these subsets (with n1, n2 examples each) have exactly the same mean by setting them to be equal.

• I.e., maybe the observed differences in these subsets just happened "by chance”.

• We then sample n1 data from the first subset and n2 data from the second subset -- with replacement.

• We check if the difference in means between the new samples is similarly large as the observed difference in means in the original two subsets.

• Over B bootstrap iterations, we obtain a p-value of how likely the difference in means occurs under H0.

Page 22: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Bootstrap• To compare the means of two distributions, we followed the bootstrap

procedure by Efron & Tibshirani (1994):

• For each value of LCs, we have two subsets of data:E[#comments | LCs, LCt=high] and E[#comments | LCs, LCt=low].

• We "pretend" that these subsets (with n1, n2 examples each) have exactly the same mean by setting them to be equal.

• I.e., maybe the observed differences in these subsets just happened "by chance”.

• We then sample n1 data from the first subset and n2 data from the second subset -- with replacement.

• We check if the difference in means between the new samples is similarly large as the observed difference in means in the original two subsets.

• Over B bootstrap iterations, we obtain a p-value of how likely the difference in means occurs under H0.

B ti

mes

Page 23: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Results (non-parametric)

• High-complexity students:

• High-complexity feedback was associated with 22% fewer comments compared to low-complexity feedback (9.3 versus 12 total comments, p = 0.006).

• Low-complexity students:

• No stat. sig. differences (p=0.14).

Page 24: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

How complex should the teachers’ feedback be?

• Potential confound: Maybe the language complexity used by the teacher is related to their pedagogical skill?

• In a second analysis, we refined our null hypothesis to condition on each individual teacher:

• H0: E[#comments | T, LCs, LCt] = E[#comments | T, LCs]

Page 25: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Results (non-parametric)

• High-complexity students:

• As before, high-complexity feedback was associated with fewer comments compared to low-complexity feedback (8.9 versus 12.5 total comments, p = 0.01).

• Low-complexity students:

• No stat. sig. differences.

Page 26: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

How complex should the teachers’ feedback be?

• Using a log-linear model, we can also estimate the dose response, i.e., how much language complexity is associated with how much change in #comments?

tically significant di↵erences are observed for low level stu-dents (8.5 versus 7.6 total comments, p = 0.14).

4.1.2 Controlling for the RTAnother possible confound is the e↵ect of the RT, as moremotivated RTs may give better feedback that leads to moreposts by their students. To take this into account, null Hy-pothesis (1) is reformulated as:

H0 : E[T |Sc, RT ] = E[T |Sc, Rc, RT ], (2)

versus E[T |Sc, RT ] 6= E[T |Sc, Rc, RT ], where RT is the stu-dent’s remote teacher. If Hypothesis (2) is rejected, thecomplexity level of the feedback a given RT gives to his stu-dents has an e↵ect on the total comments posted by them.A variation of the bootstrapping algorithm is used to testHypothesis (2) where, instead of defining the [low, high] RTfeedback complexity levels globally from all samples, inde-pendent thresholds are defined for each RT.

Results: The result previously obtained remains valid evenwhen conditioning on the RT (i.e., resampling within eachRT), meaning that di↵erent feedback levels given by the RTto his students are associated with di↵erent total posts (8.9versus 12.5 total comments, p = 0.01). No statistically sig-nificant di↵erences are observed for low level students.

Even if controlling for the RT makes the result more solidfrom a statistical point of view, interpreting the results be-comes more di�cult, as the meaning of low and high com-plexity feedback changes from RT to RT. Because a funda-mental goal is to translate these results into useful infor-mation for the teachers, this approach has the disadvantagethat an absolute complexity level reference cannot be givento them as reference of what low and high means, and howto position the feedback they give with respect to that.

4.1.3 Parametric ApproachA parametric approach is conducted to complement the re-sults obtained by the non-parametric analysis, allowing forthe inclusion of additional predictors (possible confounds)and avoiding binning (in RT complexity). Binning is kept todetermine student level (low/high), as separate models arecomputed for each case. In order to take into account thenested structure of the data, a multilevel modeling approachis employed where the CT and RT e↵ects on the student’sactivity are modeled as nested random e↵ects. Therefore,we model student i’s total posts as a negative binomial ran-dom variable, to account for the fact that it is count datawith overdispersion, with expected value µi given by:

log(µi) = �+ �0ci + �1yi + �2pi + �3si + �4⌧i +C +R, (3)

(capital letters denote random variables and lower-case de-note fixed values). � is the baseline total comments. ci, pi,si and ⌧i are the average complexity, polarity, word speci-ficity and response delay of the feedback comments studenti received from his RT, respectively. yi is the school year.The fixed e↵ects �0, . . . , �4 represent the e↵ects of the corre-sponding covariates on the total comments. The nested ran-dom e↵ect CT-RT is represented by the random variables Cand R, assumed to follow zero-mean Gaussian distributionswith standard deviations �C and �R. All the parametricmodels were fit using the R lme4 package [4].

stud. level � �̂0 �̂1 �̂2 �̂3 �̂4

Model 3low 1.12 ** 0.03 0.30 . -0.03 0.76 0.01high 2.50 *** -0.11 * 0.26 -0.02 -0.02 0.04

Model 4low 1.63 *** -0.14 ** 0.25 -0.32 0.67 -0.01high 2.24 *** -0.16 ** 0.22 -0.06 0.12 -0.002

Table 1: E↵ects of RT feedback on students’ total

comments for Models 3 and 4, in log scale. Sig-

nif. codes: 0 (***) 0.001 (**) 0.01 (*) 0.05 (.) 0.1

Results: Table 1 (Model 3) shows the computed e↵ects forall the covariates. The parametric analysis, which includesother possible confounds, confirms the same tendency ob-served with the non-parametric approach: a negative statis-tically significant e↵ect of the RT feedback complexity levelis observed for high level students (exp(�0.11) = 0.9, i.e.,10% less total posts per unit increase in RT average complex-ity) and no e↵ect is observed for low level students. To com-pare this result to the one obtained by the non-parametricapproach, we compute the equivalent per unit decrease inthe non-parametric case (computed as the total decrease di-vided by the di↵erence between the average RT complexityin the two compared levels, 3.8 and 5.9) which equals 10%.

4.2 Should RTs feedback complexity be closeto that of their students?

Rather than the absolute complexity, we can also considerthe relative complexity of the RTs’ feedback compared tothe complexity of students’ comments. Put another way:should the feedback complexity be somehow adapted to thestudent? To answer this question we propose to model thetotal comments posted by the student as a function of thedistance between the average complexity of the student’scomments and the average complexity of the feedback thestudent received from his RT.

4.2.1 Parametric approachWe model each student i’s total posts as a negative binomialrandom variable with expected value µi given by:

log(µi) = �+�0|ci�csi�↵|+�1yi+�2pi+�3si+�4⌧i+C+R,

(4)The fixed e↵ect �0 represents the e↵ect of the absolute valueof the di↵erence between the student’s (csi) and the RT’s(ci) average comments complexity, where ↵ is introduced asan o↵set to account for the fact that the feedback may needto be close to that of the student but not necessarily equal.See Model 3 for the definition of the rest of the variables.

Setting ↵: Model 4 is fitted for di↵erent values of ↵ and theone corresponding to the largest log-likelihood is selected.For low student levels the maximum log-likelihood is ob-tained at ↵ = 0.25, whereas for high student level it is at↵ = �0.79. Hence, this analysis suggests that even if forboth low and high level students feedback complexity shouldbe close to the student level, low level students benefit fromfeedback slightly more complex than theirs whereas highlevel students benefit from feedback slightly below theirs.Recall that low level students post very basic comments,whereas those of high level students tend to be more elabo-rated but remain still simple.

Language complexity of teacher’s feedback

for student i.

Relationship between log(#comments) and teacher’s language

complexity.

Page 27: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Results (parametric)

• Consistent with non-parametric results, the log-linear model suggests that lower-complexity teacher feedback is associated with higher #comments by students.

• Dose response: 10% fewer total posts per unit increase in teachers’ language complexity.

Page 28: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Should teachers’ language complexity be similar to students’ language complexity?

• Let D be the difference in language complexity between the student’s post and the teacher's feedback.

• Using both the non-parametric and log-linear models, we tested the null hypothesis:

• H0: E[#comments | LCs, D] = E[#comments | LCs]

Page 29: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Results• Students write more (15% - 34%) comments on the TDL

discussion forum when D is small (i.e., their teachers’ language complexity is similar to that of their students):

• Results are stat. sig. (under both models) for both high-complexity and low-complexity learners.

stud. level � �̂0 �̂1 �̂2 �̂3 �̂4

low 1.63 *** -0.14 ** 0.25 -0.32 0.67 -0.01high 2.24 *** -0.16 ** 0.22 -0.06 0.12 -0.002

Table 2: E↵ects of RT feedback characteristics onthe student’s total comments, for low and high levelstudents, in logarithmic scale. Significance codes: 0(***) 0.001 (**) 0.01 (*) 0.05 (.) 0.1 () 1.

●● ●

● ● ●● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●● ●

●●

●●

●●●

●●

●●

●● ●

●●

●●

●●

●●

● ●●

●● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●●

●●

●● ●

●● ●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●● ●●

●●

●●

●●

●●●

●●●

●●

●●

●●

● ●●

● ●●

●●

●●

●●● ●●

● ●

●●●● ●

● ● ●● ●●

●●●●

●●●

● ● ●● ●●●

●●

● ●● ● ●●

●●● ●● ●●● ●

●●●●●● ● ●

●●

●● ●●●

●● ● ●●●●

●●●

●● ●

●●●

●● ●●● ● ●●●●

●●●

● ●●

●●

●●●

●●

●●●● ● ● ●●

● ●●●●

● ● ●●●

●●●●

●● ●● ● ●●● ●●

●● ●●

●● ●● ●

●●●

●● ●●

●●

●●●

●●

●●

●●●●

● ●●

● ●●●●

●●

●●●●

●● ●●●●●

●●

●●

●●●

● ●

●●

●● ●●

● ●●

●● ●●

●●

●●● ●●

● ●● ●

●● ●

●●●● ●

●●● ●● ●● ●●●

● ●● ● ●

●●

●● ● ●● ●

● ●●●

●●●● ●

●●

●●●●

●●

●● ●

●● ●●●●●

● ●

●●●●

● ● ●

● ●● ●● ● ● ●

●●● ●●

●●●

●● ●

●●

●●

●● ●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●●●

● ●● ● ●● ●● ●

● ●

●● ●●● ●●●●

● ●

●●

● ●

● ●●●

● ●●

● ●●●

●● ●●

● ●●

● ● ● ●●● ● ●●

●●

● ● ●●●●

●●●

●●

●●

● ●●

● ●

0

10

20

30

40

50

0 2 4 6Student − RT complexity distance

stud

ent t

otal

com

m.

fittedoriginal

Figure 7: Total comments posted by the student asa function of the student-RT complexity distance.Original (blue) and values fitted by Model 5 (red)for high level students are shown. Similar resultsare obtained for low level students.

given by:

log(µi) = �+�0|ci�csi�↵|+�1yi+�2pi+�3si+�4⌧i+C+R,

(5)The fixed e↵ect �0 represents the e↵ect of the absolute valueof the di↵erence between the student’s (csi) and the RT’s(ci) average comments complexity, where ↵ is introduced asan o↵set to account for the fact that the feedback may needto be close to that of the student but not necessarily equal.See Model 4 for the definition of the rest of the variables.

Setting ↵. Model 5 is fitted for di↵erent values of ↵ and theone corresponding to the largest log-likelihood is selected.For low student levels the maximum log-likelihood is ob-tained at ↵ = 0.25, whereas for high student level it is at↵ = �0.79. Hence, this analysis suggests that even if forboth low and high level students feedback complexity shouldbe close to the student level, low level students benefit fromfeedback slightly more complex than theirs whereas highlevel students benefit from feedback slightly below theirs.Recall that low level students post very basic comments,whereas those of high level students tend to be more elabo-rated but remain still simple.

Results. Figure 7 shows the total comments posted by thestudent as a function of the student-RT complexity distance.Table 2 shows the computed fixed e↵ects for the di↵erent co-variates included in the model. The distance between theaverage complexity of the student’s comments and the aver-age complexity of the feedback the student received from hisRT has a negative statistically significant e↵ect on the totalcomments posted by the student. There is a 13% (p = 0.008)and 15% (p = 0.003) decrease in the total comments withone unit distance increase, for the low and high level stu-dents respectively.

We present in the following a series of examples in orderto help the reader gain insight into what small and large

student-RT complexity distance mean in practice.

Large distance:

Student A: I would like to defile.

RT: Nice try Marcela, but I do not understand what you mean.There are two models in the above photo, can you tell mewhich model is from Brazil and which model is from the USA?or - can you tell me, who is your favourite model? Myfavourite model is Kate Moss.

Student B: I like to watch suspense movie.

RT: Gabriela! I don’t know any suspense movies. Do you have afavourite?? The only move I can think of that is suspenseis this one, Splice, its very weird and kind of creepy. But Ireally liked it!! Tell me what you think :) [link]

The RTs’ responses are very complex compared to what thestudents wrote.

Small distance:

Student A: My favorite sport is football.

RT: Very good! what is your favorite football team?

Student B: I believe that the aliens live in Jupiter.

RT: Why Jupiter and not Mars?

Students and RTs comments have comparable complexitylevels and, unlike the large distance examples, the interac-tions seem closer to what a regular online chat could be forstudents with basic English proficiency.

4.2.2 Non-parametric approachA non-parametric approach is conducted to complement theresults obtained by the parametric analysis. For this pur-pose, bootstrapping is used to test the following null hy-pothesis:

H0 : E[T |Sc] = E[T |Sc, D], (6)

versus E[T |Sc] 6= E[T |Sc, D], where D is the distance be-tween the student’s and the RT’s average comments com-plexity as defined in Model 5. Algorithm 1 is used, wherethe for loop on the RT complexity level is replaced by a sim-ilar loop on small and large D (below and above the mediandistance), and ↵ is set to the values obtained in Section 4.2.1.

Results. A negative statistically significant e↵ect is observedfor D on the total comments posted by the student, bothfor low and high level students. Students who received feed-back less adapted to their level posted 15% and 34% lesscomments than those who received more adapted feedback,for low (6.6 versus 7.6 total comments, p = 0.007) and high(9.2 versus 12.3 total comments, p < 0.001) level studentsrespectively. To compare these results to those obtained bythe parametric approach we compute the equivalent per unitdecrease (approximated as the total decrease divided by thedi↵erence between the average D in the two compared lev-els) which equals 9% both for low and high student level.

4.3 Engaging students in conversationThe program aims at motivating the students to interactwith others in English. Getting them to engage in a conver-sation with their RTs is considered a success. Therefore, we

Page 30: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Follow-up discussions with teachers

• We met with a panel of the remote teachers and asked them about the potential causes of these results.

• One possibility: students may be more motivated when their teacher’s language is accessible (i.e., not too complex): they understand it, they learn from and are challenged by it, but do not become frustrated.

Page 31: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

Summary• Using both parametric (log-linear mixed-effects) and non-

parametric (bootstrap) models, we conducted an observational analysis of a dataset of secondary-school EFL learners in Uruguay during 2017-18.

• Results suggest that, to encourage more forum participation, teachers' feedback to students should be (1) not too complex and (2) adapted to the language complexity of their students’ forum posts.

• Some teachers oppose RCTs due to issues of fairness and practicality. Observational analyses can still uncover some simple practical guidelines for adaptation that may be helpful.

Page 32: How Should Online Teachers of English as a Foreign ...jrwhitehill/EDM_Feedback_2019.pdfPlan Ceibal, Uruguay gkaplan@ceibal.edu.uy Jacob Whitehill Worcester Polytech. Inst., USA jrwhitehill@wpi.edu

End