All Children Reading - Asia - Technical Assisstance ...

Submission Date: January 2020

AID-OAA-TO-16-00017

TOCOR: Mitch Kirby

Prepared by: RTI International

3040 East Cornwallis Road

Research Triangle Park, NC 27709-0155

Tel: (919) 541-6000

This document was produced for review by the United States Agency for International

Development.

All Children Reading – Asia

Technical Assistance – Uzbekistan

Pilot Study Report

All Children Reading-Asia—Uzbekistan Pilot Study Report iii

Table of Contents

Page

List of Figures ................................................................................................................ iv

List of Tables ................................................................................................................. iv

List of Acronyms and Abbreviations ................................................................................ v

1 Introduction ............................................................................................................ 1

2 Activities Completed Under Pilot Study .................................................................. 1

2.1 Pilot Training—November 5–8, 2019 ....................................................... 1

2.2 Data Collection—November 11–22, 2019 ................................................ 4

2.3 Assessment Security Protocols Followed ................................................. 5

3 Results from Pilot Study ......................................................................................... 6

3.1 Basic Data Checks ................................................................................... 6

3.2 Early Grade Reading Assessments.......................................................... 7

3.3 Early Grade Mathematics Assessments ................................................. 14

3.3.1 Grade 2 EGMA ........................................................................... 15

3.3.2 Written Mathematics Assessment Grade 4 ................................. 16

3.3.3 Overall Recommendations for EGMA Grade 2 and Written Math Grade 4 ...................................................................................... 19

4 Recommendations ............................................................................................... 19

4.1 Revisions to Tangerine .......................................................................... 19

4.2 Revisions to Assessments ..................................................................... 19

4.3 Improving Efficiency During Data Collection ........................................... 20

Annex A: Pilot Training Agenda .................................................................................... 21

Annex B: Participants of Pilot Training and Data Collection........................................... 22

All Children Reading-Asia—Uzbekistan Pilot Study Report iv

List of Figures

Figure 1. .......................... Cumulative Distribution of Grade 2 ORF Scores for Passage A, by Language 10

Figure 2. .......................... Cumulative Distribution of Grade 2 ORF Scores for Passage B, by Language 11

Figure 3. .......................... Cumulative Distribution of Grade 2 ORF Scores for Passage C, by Language 11

Figure 4. ......... Cumulative Distribution of Grade 4 Comprehension Scores for Passage E, by Language 13



Figure 7. .................................................................. Item Analysis for Quantity Discrimination Subtask 15

Figure 8. ................................................................. Item Analysis for Addition and Subtraction Level 2 16

Figure 9. ........................................................ Item Characteristic Curves for the Numbers Subdomain 17

Figure 10. ........................................................................................... ICC Graph for Geometry Items 18

Figure 11. ...................................................................................... ICC Graph for Measurement Items 18

List of Tables

Table 1. .................................................................................. AAM Results by Grade and Assessment 2

Table 2. ............................................................ Number of Schools Visited by Each Assessment Team 5

Table 3. .......................................................................... Overview of EGRA Administration by Grade 7

Table 4. .............................................................. EGRA Duration (in minutes) by Grade and Language 8

Table 5.Grade 2 Cronbach’s Alpha Estimates for Oral Reading Fluency (ORF) by Passage and Language

............................................................................................................................................................................................ 8

Table 6. .................................. Grade 4 Cronbach’s Alpha Estimates for ORF by Passage and Language 9

Table 7. .................................................................... Grade 4 Cronbach’s Alpha for English by Subtask 9

Table 8. ................................................................................................ Math Administration by Grade 15

All Children Reading-Asia—Uzbekistan Pilot Study Report v

List of Acronyms and Abbreviations

AAM assessor accuracy measure

ACR All Children Reading

CICT Center for the Development of ICT

EGMA Early Grade Math Assessment

EGRA Early Grade Reading Assessment

ICC Item Characteristic Curves

IRT Item Response Theory

MPE Ministry of Public Education

ORF oral reading fluency

TIMSS Trends in International Mathematics and Science Study

USAID United States Agency for International Development

All Children Reading-Asia—Uzbekistan Pilot Study Report 1

INTRODUCTION

The Republic of Uzbekistan’s new administration, including the Ministry of Public Education (MPE), is embarking on a complete reform of the entire education sector and has requested assistance from the United States Agency for International Development (USAID) to shape and guide efforts to reform the basic education system. The critical areas that have been identified for immediate intervention are in assessment and data management: Early Grade Reading Assessment (EGRA), Early Grade Math Assessment (EGMA), education management information system, curriculum reform, teacher preparation and support, head teacher and school management training, and strategic and organizational planning. All Children Reading (ACR)-Asia is supporting the USAID/Uzbekistan office in the design and implementation of a package of technical assistance that will include the placement of a long-term policy advisor in the MPE and the design, implementation, and related capacity building for a national scale EGRA/EGMA survey during the 2019–2020 school year.

As a precursor to the national spring survey, the ACR-Asia team conducted a pilot from November 5 to 22, 2019. The pilot training team was comprised of Maria Dzula, Rachel Jordan, and Maitri Punjabi from RTI International. The objectives of the pilot study were as follows:

▪ Train pilot assessors on EGRA/EGMA protocols and administer a test of the pilot data collection process, which will build the capacity of MPE Methodists in EGRA/EGMA test administration

▪ Document revisions needed to Tangerine™ for future test administration and errors in translations or instrument language

▪ Train pilot assessors on the importance of test security and test security protocols

▪ Test assessors on their understanding and preparedness for data collection by administering two assessor accuracy measures (AAMs)

▪ Test the reliability and validity of the reading and mathematics instruments for each of the seven languages.

ACTIVITIES COMPLETED UNDER PILOT STUDY

Pilot Training – November 5-8, 2019

The pilot training was held over four days (November 5 to 8) at the Center for the Development of ICT (CICT) in Tashkent, Uzbekistan. See Annex A for the agenda. The workshop was facilitated by three RTI International technical experts: Ms. Jordan (EGRA), Ms. Punjabi (Statistician/Tangerine), and Ms. Dzula (EGMA/Written Math). The workshop was attended by 39 participants, including 6 CICT supervisors, 15 MPE Methodists, and 18 university students. The training was originally scheduled to be 5 days in length, but there were complications that delayed the start by one day. CICT staff invited more university students than indicated in an earlier agreement, as they were already in Tashkent and able to start training on time. After discussion with RTI and USAID, Methodists were asked to attend and the training start date was delayed by one day. These participants came from different regions of Uzbekistan and were divided into language teams of six individuals: One MPE-designated supervisor and a team of five assessors. Each language team maintained some balance of MPE Methodists and university students.

There were three language groups that did not have a full team of six assessors receiving four full training days:


▪ Kazakh: One supervisor and three assessors only: one assessor could not be designated by CICT; one assessor joined the training on the third day, and one assessor quit the training after two days and was replaced on the third.

▪ Turkmen: One supervisor and three assessors only. One assessor quit training after three days and could not be replaced before the end of the training; one assessor could not be designated by CICT.

▪ Tajik: One assessor was replaced halfway through the training.

The training was facilitated by CICT, which coordinated the venue and catering, and provided highly skilled assessors and CICT support staff. RTI provided EGRA/EGMA trainers, training materials, and paid data collection teams’ per diem, transport, and accommodation costs. RTI is thankful to CICT for their tremendous support during the training and pilot data collection period. Overall, the training was a success, and all goals were achieved.

Assessor Accuracy Measure

AAMs were conducted on the last two days of the training with a total of 25 different assessors. Only 25 assessors completed the AAMs as a few assessors had departed by the time they were administered. Also, as noted above, some teams were missing one assessor. The AAMs were administered by two assessors who were selected for the task based on their performance during the training week. The two skilled assessors were trained on the administration of the AAM, going through the pre-planned assessment in detail with the training team. The assessors followed a pre-practiced protocol of how exactly to complete the AAM (which words to read incorrectly, which questions to answer incorrectly, etc.). RTI employs this gold standard approach, where a selected assessor pretends to be a student to complete an assessment in the front of the room, with another selected assessor marking the score in the tablet as they follow along. Assessor responses were compared and scored against the gold standard. Table 1 below summarizes the results. The Grade 2 EGMA AAM was conducted with 12 assessors since assessors were initially split into two groups – Grade 2 and Grade 4 assessors. There was no AAM conducted for the Grade 4 written mathematics assessment. On the last day of the training, due to logistical concerns, the assessors were assigned to both Grade 2 and Grade 4 to ensure the timely completion of data collection at each school, and all assessors completed the Grade 2 and Grade 4 EGRA AAMs on the last day of training.

Table 1. AAM Results by Grade and Assessment

Assessment Subtask

Average Percent Agreement of Participants

(Standard Deviation)

[Min, Max] Agreement of Participants

Number of Assessors Who Achieved 90%

Agreement with the Gold Standard

Grade 2 EGRA

(n=23)

Uzbek L2 (Uzbek as a second language) Invented Words

90%

(0.09) [54%, 100%] 70%

Uzbek L2 (Uzbek as a second language) Oral Reading Passage

98%

(0.04) [83%, 100%] 96%

Uzbek L2 (Uzbek as a second language) Reading Comprehension

98%

(0.06) [80%, 100%] 91%

Overall 95%

(0.05) [80%, 99%] 91%


Assessment Subtask

Average Percent Agreement of Participants

(Standard Deviation)

[Min, Max] Agreement of Participants

Number of Assessors Who Achieved 90%

Agreement with the Gold Standard

Grade 4 EGRA

(n=24)

Uzbek L2 (Uzbek as a second language) Oral Reading Passage

95%

(0.09) [54%, 100%] 92%

Uzbek L 2 Reading Comprehension*

75%

(0.22) [20%, 100%] 46%

Overall* 85%

(0.14) [37%, 100%] 46%

Grade 2 EGMA

(n=12)

Quantity Comparison

98%

(0.06) [80%, 100%] 92%

Missing Number 100%

(0) [100%, 100%] 100%

Word Problems 99%

(0.05) [83%, 100%] 92%

Addition Level 2 92%

(0.11) [67%, 100%] 58%

Subtraction Level 2 89%

(0.16) [50%, 100%] 58%

Relational Reasoning

91%

(0.20) [30%, 100%] 83%

Spatial Thinking 2D 98%

(0.06) [80%, 100%] 92%

Spatial Thinking 3D 100%

(0) [100%, 100%] 100%

Overall 96%

(0.04) [90%, 100%] 100%

*Note: The low scores for the Grade 4 EGRA overall and reading comprehension results are due, for the most part, to a lack of agreement on the selection of two responses: “correct” and “correct with lookback.”

Ideally, each assessor should receive an AAM score of 90 percent or higher. Due to the limited number of assessors in the training, it was ultimately decided that all assessors should participate in the data collection activity, regardless of AAM score. However, the AAM results were used to inform the trainers of focus areas for refreshers during the training, and additional spot checks were used to determine the reliability of lower-scoring assessors during the data collection. One major finding was that assessors were struggling to agree on selecting a response option for the reading comprehension subtask questions, where responses of “correct” and “correct with lookback” were provided (hence the relatively low AAM scores for the Grade 4 EGRA reading comprehension measure). Though both responses indicate that the child answered correctly, there was a lack of agreement on the method of responding. Since the ‘lookback’ score was not a required part of the pilot analysis, the impact of this issue was minimal.

Pilot Training Challenges

The inclusion of university students in place of MPE Methodists just before the start of the training caused some complications in the pilot. The switch from MPE Methodist to student assessors was not communicated in advance, despite requests for attendance information to estimate costs accurately. While we successfully negotiated a compromise of teams composed of both Methodists and students, this resulted in the training starting a day late (in order to ensure the correct number of MPE assessors were attending). The switch to university


students also necessitated significantly higher per diem and stipend payments, which came out of the activity budget. Lastly, the change was contrary to the activity scope of work which stipulated building the assessment capacity of MPE staff as a key outcome along with MPE’s cost contribution to the activity.

As mentioned above, a number of teams were in flux with members quitting the activity or assessors joining late for the last two days of training (missing entirely two instructional days). Data for these assessors will be flagged and analyzed separately in the pilot analysis to review potential compromises in data quality.

Data Collection – November 11-22, 2019

Seven teams (one per language) deployed for data collection beginning on November 10, 2019. School visits commenced on November 11 and concluded on November 22. During each day of data collection, each team visited a single school to assess 20 Grade 2 and 20 Grade 4 pupils in reading and math. An exception was made for the Kazakh language group, since the team had fewer assessors than the other language groups. The Kazakh team was instructed to assess 12 Grade 2 and 12 Grade 4 pupils using only the EGRA; no EGMA was conducted by the Kazakh group. The Turkmen team also lacked one assessor and was therefore instructed not to administer EGMA to the Grade 2 pupils, while its overall sample counts remained the same (20 Grade 2 and 20 Grade 4 pupils).

At the end of data collection, the supervisors returned to Tashkent on November 23 to return all tablets and materials to the CICT building, where all materials will be stored. RTI’s consultant, Kamola Djumaniyazoa, was present at CICT on November 23 and 24 to assist with the counting and securing of all testing materials. All returned materials were counted and recorded on a material tracker document.

Pilot School Selection

A total of 70 schools were visited for the pilot data collection over the course of 10 school days. Each language team assessed in either its assigned language or in Uzbek depending on the language of instruction of the school (except for the Kazakh team, which only visited Kazakh language schools). Schools were selected beforehand by RTI International and CICT based on convenience of the school’s location, number of shifts (morning and evening), and number of enrolled pupils. While teams were sent to the regions that contain the largest cluster of schools that instruct in the selected languages, the pilot sample was not chosen to be representative of any population. Table 2 below summarizes the number of schools visited and number of pupils assessed by each team.


Table 2. Number of Schools Visited by Each Assessment Team

Language Team

Province Visited

Uzbek Schools Visited

Uzbek Grade 2 Pupils

Assessed

Uzbek Grade 4 Pupils

Assessed

Local Language Schools Visited

Local Language Grade 2 Pupils

Assessed

Local Language Grade 4 Pupils

Assessed

Uzbek Samarkand 10 202 205 - - -

Russian Tashkent City 3 61 61 7 146 142

Tajik Surkhandaryo 3 60 61 7 141 144

Turkmen Karakalpak 3 60 61 7 129 134

Kyrgyz Andijan 3 66 75 7 103 101

Kazakh Tashkent Province

0 0 0 10 117 115

Karakalpak Karakalpak 3 61 61 7 140 142

Total 25 510 524 45 776 778

Assessment Security Protocols Followed

The following security protocols were implemented during the pilot training:

▪ All assessments were rendered in Tangerine by RTI staff, to ensure that exposure to the assessments remained limited.

▪ Careful attention was paid to the signing in of training participants by RTI staff to ensure that only authorized participants were in the training rooms (and/or had access to any of the materials).

▪ All phones were checked at the door by RTI staff to ensure that participants could not take pictures of the materials. Camera functionality was disabled on the tablets. Additionally, all handouts (paper materials) were collected at the end of each training day and returned the following morning.

▪ RTI independently printed all assessment materials outside of the training venue. At the end of the workshop, extraneous paper was collected and disposed of by CICT staff.

▪ As with the adaptation workshop, all participants were instructed not to share any information about the assessment outside of the training or pilot directly by the Director of the CICT. Participants did not sign a confidentiality agreement.

The following security protocols were implemented during the pilot data collection:

▪ In order to limit issues with teachers preparing students ahead of time (or replacing students prior to data collection), individual schools were not notified ahead of time regarding their selection for participation. In addition to this, team supervisors were not informed of the name or location of the selected school until the evening before the school visit. This information was communicated over a call between MPE and the supervisor.

▪ Supervisors were assigned from a pool of CICT staff (in order to limit having staff with direct oversight of schools in charge of ensuring quality data collection).

▪ Supervisors were in charge of all materials throughout the data collection activity (tablets, paper assessments, and student stimuli). Supervisors collected all team materials at the end of each evening and returned them to the assessors each morning. In other words, assessors did not have access to any materials outside of active data collection.


▪ Supervisors were directed to collect assessor phones before handing out tablets, paper assessments, and student stimulus sheets. When it was observed that one assessor had his phone during the school visit, a reminder was sent to the supervisors to collect phones while visiting schools.

▪ Supervisors set up the rooms for testing and oversaw the student selection process at each school site.

▪ Testing materials did not leave the testing room at the school, unless in the possession of the supervisor.

A CICT photo/videographer was in attendance at the training and school visits. CICT confirmed that the videographer was there on their instruction and a member of CICT. However, how this information will be used is outside of our control.

Throughout the workshop, the team remained in contact with USAID (Iligiza Sharipova; Andrew Colburn) and MPE (Mukhayyo Azamova), in order to update them on each of these occurrences (as well as the overall training progress). We were assured that a practice visit to a nearby school would not be videotaped as initially requested by CICT. It became very evident through these discussions that clearer coordination is needed between CICT, RTI, and USAID for the national data collection.

RESULTS FROM PILOT STUDY

Basic Data Checks

As data were uploaded through Tangerine each evening of data collection, daily checks were run on the data to review any issues, and feedback was provided to the assessment teams. Some of these checks included identifying extreme values and evaluating whether they were accurate or reflected an administration issue, reviewing average assessment time to ensure assessments were not being rushed or taking longer than expected, and tabulating the number of pupils assessed in each grade at each school to ensure sampling counts were being met.

Once data collection ended, additional checks were run on the data before running the analysis. These included comparing preliminary results by assessor type (student or MPE employee), comparing preliminary results by time of administration (morning vs. afternoon), and reviewing preliminary results for specific flagged assessors who joined the training several days late. These checks were all done to address any potential bias that might come from the assessor’s occupation, assessment fatigue (which may be seen in afternoon assessments), and lack of complete training.

Results of these analyses are as follows:

▪ Student vs. MPE Methodist assessors: No significant difference across math and reading for Grades 2 and 4, suggesting that both students and Methodists were competent assessors.

▪ Morning vs. afternoon: No difference across any of the assessments, suggesting that assessors may conduct assessments both in the morning and afternoon of the school day.

▪ Fully trained vs. not fully trained assessors: Slight difference, especially in Grade 2 EGMA. The few assessors who were less trained scored students higher than the other assessors. This confirms the idea that assessors must be present for all days of training to ensure reliability in administration.


Early Grade Reading Assessments

The EGRA instruments used for this pilot activity were developed during the October 2019 adaptation workshop. The Grade 2 and Grade 4 assessments were designed to align with grade-level expectations and therefore included different subtasks. As shown in Table 3, the assessments for both grades began with subtasks in each of the seven languages of instruction (depending on sampled school)—denoted in the table as “first language.” After the first language subtasks, students then tackled Uzbek Language 2 subtasks. While these latter subtasks are designed to measure the Uzbek reading ability of students instructed in other languages, all students were administered these tasks during the pilot (including those instructed in Uzbek) to increase the sample size and provide an understanding of the relative difficulty of these tasks compared to the first language tasks. Finally, a subgroup of Grade 4 students was also administered two English language subtasks: a reading passage with comprehension questions, and an English vocabulary task.

In order to pilot multiple passage without overburdening students, a randomized approach was taken. Using Grade 2 as an example: all students were administered two first language reading passages with comprehension questions (all students received Passage A, while half the students received Passage B as their second passage and half the students received Passage C). This provided sufficient sample sizes for cross-passage analyses without further extending administration time.

Table 3. Overview of EGRA Administration by Grade

Grade 2 EGRA Grade 4 EGRA

Subtask 1 First Language – Letters First Language – Nonwords

Subtask 2 First Language – Nonwords First Language – Oral Reading (Passage A)

Subtask 3 First Language Oral Reading with Comprehension (Passage A)

First Language – Oral Reading (Passage B or Passage C; randomly assigned)

Subtask 4 First Language Oral Reading with Comprehension (Passage B or Passage C; randomly assigned)

First Language – Silent reading with comprehension (Passage E)

Subtask 5 Uzbek L2 (Uzbek as a second language) Nonwords (only for non-Uzbek students; same subtask as Uzbek first language nonwords)

First Language – Silent Reading with Comprehension (Passage F or Passage G; randomly assigned)

Subtask 6 Uzbek L2 (Uzbek as a second language) Oral Reading Fluency with Comprehension

Uzbek L2 (Uzbek as a second language) Oral Reading

Subtask 7 Uzbek L2 (Uzbek as a second language) Silent Passage with Comprehension

Subtask 8 English Oral Reading with Comprehension

Subtask 9 English Vocabulary

Total administration time, by language, is presented in Table 4. Since the Grade 4 EGRA included both oral and silent reading passages as well as English subtasks, it was expected that the Grade 4 assessment would take longer to administer. On average, the Grade 2 EGRA lasted 16 minutes, while the Grade 4 EGRA lasted 30 minutes total (national languages plus English). As seen in the table, there was relatively little variation across languages (and much of this variation resulted from assessors as opposed to language-specific considerations). It should be noted that the assessment time for the national survey is expected to be significantly shorter, since students will not be receiving as many passages as they did during the pilot.


Table 4. EGRA Duration (in minutes) by Grade and Language

Grade 2 EGRA Grade 4 EGRA

Uzbek 17 22

Russian 16 20

Karakalpak 19 23

Kazakh 17 24

Kyrgyz (missing)* 27

Tajik 14 23

Turkmen 16 27

Average 16 minutes 23 minutes (+7 minutes for English)

*End time did not work properly in Tangerine for the Kyrgyz Grade 2 EGRA

EGRA Reliability Estimates

Tables 5 and 6 provide estimates of internal consistency (Cronbach’s alpha) for Grade 2 and Grade 4 oral reading passages, by language. The reliability estimates for Grade 2 oral reading fluency (shown in Table 5) are all very high (i.e., Cronbach’s alpha ≥ 0.80) and provide evidence of strong internal consistency for all passages.

Table 5. Grade 2 Cronbach’s Alpha Estimates for Oral Reading Fluency (ORF) by Passage and Language

N ORF - Passage A ORF - Passage B ORF - Passage C ORF - L2

Uzbek 509 0.873 0.858 0.881 0.890

Russian 142 0.872 0.843 0.846 0.932

Karakalpak 140 0.871 0.858 0.882 0.888

Kazakh 117 0.872 0.827 0.867 0.933

Kyrgyz 93 0.859 0.858 0.886 0.904

Tajik 141 0.818 0.858 0.883 0.952

Turkmen 129 0.871 0.864 0.882 0.938

Average

0.911

The reliability estimates for Grade 4 ORF (shown in Table 6) are all high (i.e., above the traditional threshold of strong reliability, Cronbach’s alpha ≥ 0.70) and provide evidence of strong internal consistency for all passages.


Table 6. Grade 4 Cronbach’s Alpha Estimates for ORF by Passage and Language

N ORF - Passage A ORF - Passage B ORF - Passage C ORF - L2

Uzbek 507 0.840 0.853 0.818 0.844

Russian 139 0.769 0.826 0.839 0.946

Karakalpak 139 0.742 0.817 0.819 0.839

Kazakh 115 0.844 0.751 0.831 0.812

Kyrgyz 96 0.843 0.858 0.813 0.855

Tajik 135 0.766 0.812 0.832 0.788

Turkmen 127 0.795 0.760 0.912 0.865

Average

0.892

In addition to the estimates for first language and Uzbek Language 2 subtasks, reliability estimates were also calculated for the three English language subtasks administered to Grade 4 students. As shown in Table 7, these subtasks all provided strong evidence of reliability (Cronbach’s alpha ≥ 0.80).

Table 7. Grade 4 Cronbach’s Alpha for English by Subtask

N Oral Reading Passage Reading Comprehension Vocabulary

English 398 0.93 0.84 0.88

The Cronbach’s alpha estimates for Grade 2 and Grade 4 first language letter sounds, nonwords, and reading comprehension (not displayed here) are more moderate than those for reading fluency. This is due to the fact that letters and nonwords are both time-constrained tasks (one minute each), while reading comprehension has limited variability (with only 5 items in Grade 2 and 10 items in Grade 4) and fewer data points (as students are only asked questions based on how far they read in the oral passage in Grade 2). Overall item analyses, zero scores, and mean scores for these subtasks do not provide any concerning evidence regarding their reliability or appropriateness.

Grade 2 EGRA Findings

The letter and nonword subtasks performed as expected, although there are specific items under additional review. This review is being conducted in order to determine whether or not it is necessary to revise particular items for the full survey. (Note: the number of potential items for revision is very limited.)

While the reliability of the oral reading passages was found to be strong, it was essential to examine each of the three Grade 2 passages across languages to determine which passage would be most appropriate for the full survey. This information was examined alongside the performance of the accompanying reading comprehension subtasks, for the final selection.

In lieu of the many tables required to display the distributions of scores for reading fluency, the following three figures (Figures 1–3) show cumulative distributions of ORF scores for each passage, by language. In other words, the y-axes of the graphs show the correct words per minute (cwpm) while the x-axes show the percentiles of students. For example, looking at the Russian line (at the top) in Figure 1, it is clear that the bottom 10 percent of student scores (from 0 to 10 on the x-axis) are in the 10–20 cwpm range, while the top 10 percent of scores (from 90 to 100 on the x-axis) are in the 70–90 cwpm range. Ideally, we would want to see these lines follow similar patterns across all languages. We can see a clear discrepancy in this pattern from Figure 3. Looking at the Karakalpak and Kyrgyz lines, it is clear that between


5 percent and 10 percent of students were scoring 0 on Passage C for those two languages, while the rest of the languages had almost no students with zero scores. Based on this information, we recommend that Passage C not be used for the full survey. Additionally, these figures contain the average (mean) ORF scores for each language in parentheses next to the language name in the legend of each figure. While it is not necessarily expected that all means will be equivalent (as fluency rates differ by language and orthography, even for students with comparable reading abilities), it is useful to understand the relative difficulty of each passage across languages.

When analyzed alongside the reading comprehension scores for these passages, it was ultimately determined that either Passage A or Passage B would work equally well for the national survey. If needed, the unused passage could be used at a later timepoint (with the recognition that Passage B fluency and comprehension scores were slightly higher than Passage A for nearly all languages—providing evidence that Passage B subtasks are slightly easier and that statistical equating may be required for comparisons over time.

Figure 1. Cumulative Distribution of Grade 2 ORF Scores for Passage A, by Language


Figure 2. Cumulative Distribution of Grade 2 ORF Scores for Passage B, by Language

Figure 3. Cumulative Distribution of Grade 2 ORF Scores for Passage C, by Language


One final aspect worth noting is that after students were asked all the relevant comprehension questions, they were able to refer back to the passage in order to attempt to re-answer any questions they answered incorrectly initially. This was referred to as “lookbacks.” The purpose of this was to ensure that we were measuring true reading comprehension, as opposed to just memorization or immediate recall. This lookback activity proved important and led to significant increases in reading comprehension scores for some languages (particularly for the Uzbek Language 2 subtask). Therefore, it is recommended that this approach be used in Grade 2 for the national survey.

Grade 4 EGRA Findings

The nonword subtask performed as expected, although there are specific items under additional review. This review is being conducted to determine whether or not it is necessary to revise particular items for the full survey. (Note: the number of potential items for revision is very limited.)

The oral reading passages for Grade 4 were found to be reliable and behaved as expected. However, for all but two languages, Passage B produced significantly lower ORF scores than Passage A or Passage C. Based on the consistency across languages, Passage A is the recommended passage for use in the national survey, while Passage C could serve as a potential passage for future use (requiring statistical equating only for Uzbek and Turkmen—the languages for which Passage C was found to be easier, on average).

With regard to the silent reading comprehension subtask, differences were found among the three passages (E, F, and G). Cumulative distributions for these tasks are displayed in the following three figures (Figures 4–6). Based on the mean scores (in parentheses next to the language names in the legend of each figure), it is clear that Passage F was the easiest, followed by Passage G. Passage E proved to be the most difficult of the silent reading options. Additionally, ceiling effects are evident as a concern for Passage F (and possibly Passage G). This is shown by the large percentiles of students with scores at the top of the graph (along the 100 percent line). Based on this evidence, it is recommended that Passage E be used for the national survey, as it provides good variation in scores with room for measuring growth. Since students retain the silent reading passage during the comprehension portion of the silent comprehension task, there is no additional value to the lookback option for Grade 4.


Figure 4. Cumulative Distribution of Grade 4 Comprehension Scores for Passage E, by Language

Figure 5. Cumulative Distribution of Grade 4 Comprehension Scores for Passage F, by Language


Figure 6. Cumulative Distribution of Grade 4 Comprehension Scores for Passage G, by Language

Lastly, with regard to the Grade 4 English subtasks, there is good variation in ORF scores with a mean below first language (and Uzbek Language 2) scores, as expected. The concerning evidence, however, comes from the comprehension measure, which had exceedingly high zero scores. Based on the low vocabulary scores, it is assumed that the difficulties with comprehension are genuinely based on poor understanding of the English language (despite decent decoding skills), as opposed to any problems with the English subtasks themselves. Therefore, these subtasks are recommended for use in the national survey, if sufficient English language assessors can be mobilized for the activity.

Early Grade Mathematics Assessments

The Grade 2 EGMA instrument and Grade 4 Written Math instrument used for this pilot activity were developed during the October 2019 adaptation workshop. The Grade 2 and Grade 4 assessments were designed to align with grade-level expectations (and the Trends in International Mathematics and Science Study [TIMSS] Framework and other international frameworks for Grade 4) and therefore included different items. Table 8 shows the subtasks or domains for both grades. Grade 2 was administered as an oral, individual assessment, with one assessor assessing one student. Grade 4 was a written assessment administered by one assessor with 10 students. For Grade 4, students were asked to complete the assessment at their own pace. The average length of time for Grade 2 was 16 minutes; For Grade 4, students were given 45 minutes to complete the assessment. The analysis presented this section is split by grade but not by language, because the items in these questions were identical across languages.


Table 8. Math Administration by Grade

Grade 2 EGMA Grade 4 Written Math

Subtask/Domain 1 Quantity Discrimination Numbers and Expressions–13 items

Subtask/Domain 2 Missing Number Fractions–5 items

Subtask/Domain 3 Addition Level 2 Geometry–6 items

Subtask/Domain 4 Subtraction Level 2 Measurement–4 items

Subtask/Domain 5 Word Problems Statistics–4 items

Subtask/Domain 6 Relational Reasoning

Subtask/Domain 7 Spatial Structuring 2D

Subtask/Domain 8 Spatial Structuring 3D

1.1.1 Grade 2 EGMA

Overall, students performed on the EGMA Grade 2 assessment as expected. The average length of time for the EGMA administration was 16 minutes, which is within the overall targeted time for Grade 2 assessments.

As the EGMA instrument is designed to increase in complexity within a subtask, item specific analyses within subtasks provide information on how well the tasks are performing. Below are item specific analyses for three of the core subtasks, quantity discrimination, addition level 2, and subtraction level 2 (Figure 7). There were 10 items on the quantity discrimination subtask. Overall, the earlier items were easier for students, and the later items were more difficult. This pattern is consistent with the intended design of the EGMA, with students performing as expected.

Figure 7. Item Analysis for Quantity Discrimination Subtask

Figure 8 shows results for addition and subtractions subtasks.

0

10

20

30

40

50

60

70

80

90

100

Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10

Grade 2Quantity Discrimination

Correct Incorrect No Response


Figure 8. Item Analysis for Addition and Subtraction Level 2

For addition level 2, items 1 and 3 were the least difficult, and item 5 was the most difficult. A similar pattern followed for subtraction level 2. Although the patterns are less obvious for these two subtasks, given the small number of items, they are still consistent with the intended design of the EGMA instrument.

Similar analyses for other key subtasks follow the same pattern. Given the results above, we will make minimal changes to the EGMA instrument (e.g., changing the language of the instructions to the assessor).

1.1.2 Written Mathematics Assessment Grade 4

Given that the Written Mathematics assessment for Grade 4 was a newly developed instrument, we conducted an Item Response Theory (IRT) analysis that guided us in choosing which items to include on the final assessment. More items than needed were included in the pilot.

As piloted, the assessment has 33 items and took approximately 45 minutes to finish. Overall, the percent correct was 34 percent across languages, which suggests that the assessment is appropriately performing a diagnostic function. The assessment was given to fourth graders at the beginning of the year, so lower scores were expected.

We examined the math items by domains: numbers (including expressions), fractions, geometry, measurement, and statistics. These domains are aligned to the TIMSS mathematics framework. Raw frequencies were obtained followed by estimation of IRT models in Mplus. The IRT models give information on how well items discriminate ability on a subject area (e.g., fractions), expressed as a factor loading or discrimination parameter. Item difficulties are related to the prevalence of correct answers to an item and indicate the relative ease or difficulty in answering an item correctly. The results below are given by domain.

Numbers

There were 13 items in the numbers subdomain, including the subdomain on expressions (pre-algebra). All items loaded significantly on the domain. Figure 9 below contains the Item Characteristic Curves (ICC) for the 13 numbers items. The ICC suggests some redundant items where pairs of items have overlapping curves. These are Items 1 and 7 and Items 4 and 9. Each pair had very similar loadings and difficulties.

0%

20%

40%

60%

80%

100%

Item 1 Item 2 Item 3 Item 4 Item 5 Item 1 Item 2 Item 3 Item 4 Item 5

Addition Subtraction

Grade 2Addition and Subtraction Level 2

Correct Incorrect No Response


Figure 9. Item Characteristic Curves for the Numbers Subdomain

In sum, the 13 items for numbers performed well. One item from each of the redundant pairs above (4 and 9, and 1 and 7) could be removed without losing information.

Fractions

There were five items in the fractions subdomain, making possible modifications limited. Item 15 did not load significantly on the overall fractions factor and had a very low difficulty score, suggesting it is not a good item. The R2 explained by the factor for that item was very low: 0.10. An additional concern is that multiple items have loadings with negative values, suggesting that as fractions ability increases, the likelihood of answering those items decreases.

One difficulty with the fractions items is that for most students in Grade 4 at the beginning of the year, fractions had not been covered yet. The new Uzbekistan curriculum will introduce simple fractions in the future, but the students that have taken this assessment were not yet exposed to it. Given this, most items within the fractions subdomain will remain on the assessment, with minor modifications made, as they represent fundamental, basic concepts about fractions.

Geometry

There were six items in the geometry subdomain. Figure 10 shows the ICC graph, showing that Items 20 and 24 show slopes with no inflection, reflecting their poor discrimination. Of the remaining items, Item 22 is the most difficult.


Figure 10. ICC Graph for Geometry Items

Items 20 and 24 may be removed and replaced with items that load better onto the middle range of performance.

Measurement

There were four items in measurement, and the model fit was very good for the measurement factor, indicating good construct reliability. The ICC graph in Figure 11 shows that the items all loaded well.

Figure 11. ICC Graph for Measurement Items

Item 27 is the most difficult item, and Item 28 is the easiest (lowest difficulty/leftmost curve). Item 29 has the best discrimination (steepest slope). All four items will remain on the final assessment.

Statistics

There were four items in statistics, and the model fit was very good for the measurement factor, indicating good construct reliability. The four items all perform well in capturing statistics ability. Item 32 could likely be removed with little loss of overall construct reliability and measurement information. Based on the length of the final assessment, a decision about Item 32 will be made.


1.1.3 Overall Recommendations for EGMA Grade 2 and Written Math Grade 4

We note several recommendations in addition to the revisions above.

1. In Russian, the multiplication sign for certain items did not appear on the printed version of the assessment. In Karakalpak, the multiplication sign for these items was smaller than desired. For the full data collection, we will ensure that all printed versions of the assessment are formatted and printed correctly.

2. For each item for Written Math Grade 4, assessors were asked to enter the answer the child wrote down. Some answer choices were provided for the assessors, including the correct response, as well as other, common answer choices. In the coming month, we will perform a more in-depth review of these answer choices and provide the most common answer choices for assessors. This will both enable smoother data entry as well as provide information on patterns in student responses. These patterns provide important diagnostic information.

3. For all items in Grade 2 and 4, we will create slightly different but equivalent problems to prevent any cheating that may have occurred from item leakage during the pilot data collection. For EGMA Grade 2, we will use item specifications that have been established to create an equivalent version.

RECOMMENDATIONS

Revisions to Tangerine

Supervisors were asked to track errors in Tangerine and to notify the training team. Tangerine was updated twice before the data collection to address several issues. Primarily updates to Tangerine focused on correcting errors in coding on answer response selections and duplication of consent statements and child information.

Additional changes that must be made for the next data collection:

▪ Review consent statements and have Methodists and teachers rewrite them into grade-appropriate language.

▪ Format Tangerine to be more user friendly.

▪ Translate all instructions and response options to the assessor in the local language (not Uzbek for all assessments).

▪ Include only the language-relevant assessments in the Tangerine group (seven different Tangerine groups should be created).

▪ Include the tips about auto-stop, nudge, etc., in the Tangerine assessment.

Revisions to Assessments

We will create slightly modified forms of the assessments prior to the national data collection, in case leakage and memorization are a problem. This process will retain the difficulty of the assessments but allow for a minimally revised version that would differ from any taught or memorized version. For EGRA, we will create a new, non-piloted Uzbek language passage to be administered alongside the national assessment. This latter approach has the benefit of providing information on a passage that has no potential to be leaked.

Improvements to training: Though the pilot training went fairly smoothly, several processes could be improved. The following points are the areas of focus for improvements to the training planning and execution for the spring training:


▪ Identify supervisors and assessors well in advance (MPE). In particular, given the difficulty finding of assessors who were able to accurately administer the English language passage for Grade 4, arrangements need to be made to ensure that there are sufficient assessors for all language groups who are fluent in English.

▪ Spend less time on the background of these assessments and more time covering what assessments assessors need to complete at schools (Language 1 vs. Language 2 vs. English, Reading/Math).

▪ Cover the “at the school” protocol in detail. Address how to prepare for test security, dealing with school authorities to ensure an efficient day, and other notes on improving efficiency of data collection and defining the roles of individuals on the team.

Improving Efficiency During National Data Collection

A few hiccups were experienced during the pilot data collection, all of which need to be addressed for the spring data collection. Many of these issues could have been avoided by early planning and action.

▪ Adapt and finalize the school selection and student sampling procedures amid test security issues:

‒ Attain complete school census data from CICT that shows school-grade-language population and shift information (to be used for sampling). (MPE)

‒ Address the question of how to sample from schools that operate in multiple shifts. Develop a specific protocol for these different scenarios. (RTI)

‒ Evaluate potential implications of not conducting school verification in advance of data collection. Take other steps needed to mitigate the risk from not completing school verification in advance. (RTI/MPE)

▪ Prepare and obtain the assessment approval letter from the Minister well in advance, once the dates of the data collection are set. (MPE)

▪ Conduct school outreach to provide school administrators with general information on EGRA/EGMA in order to relieve fears of the assessment. (MPE)

▪ Finalize the assessor plan well in advance of the National data collection. Determine the mix of Methodists and university students, per diem amounts, and responsible funding party. Update the activity scope of work to reflect revised USAID and MPE roles and responsibilities (as needed). (RTI/MPE).


Annex A: Pilot Training Agenda

Early Grade Reading Assessment (EGRA) and Early Grade Math Assessment (EGMA)

Pilot Training: November 5–8, 2019

Tuesday - November 5 Wednesday - November 6 Thursday - November 7 Friday - November 8

EGRA and subtasks EGMA and subtasks School Practice 1 + Assessor Accuracy Measure (AAM) 1

School Practice 2 + AAM 2 + Deployment

Overview of pilot study and EGRA subtasks

Letter sounds, nonwords, oral reading fluency, reading comprehension + practice

Introduction to Tangerine™

Practice with tablets

Overview of EGMA subtasks

Quantity discrimination, missing number, addition/subtraction level 2, word problems, relational reasoning, spatial reasoning + practice

Overview of Grade 4 group administration + practice

Review test security protocols

Practice with tablets

School Practice 1

Debrief from school visit

AAM 1

Review procedures + practice

School Practice 2

Debrief from school visit

AAM 2

Wrap up

Deploy supervisors with all materials


Annex B: Participants of Pilot Training and Data Collection

Team Name Role – Pilot Data

Collection Training Days

Attended Organization

Uzbek Dilfuza Jumaniyozova Supervisor All MPE

Uzbek Mexmonaliyev Islom Assessor All MPE

Uzbek Begmatov Azamat Assessor All MPE

Uzbek Nazirjonova Zarina Assessor All University

Uzbek Doniyorova Shaxlo Assessor All University

Uzbek Sharipov Sardor Assessor All University

Russian Feliks Mirzabayev Supervisor All MPE

Russian Mavlonova Zuxra Assessor All MPE

Russian Baxromjonov Doston Assessor All MPE

Russian Abidinova Lobar Assessor All University

Russian Kamilova Maxbuba Assessor All University

Russian Koraboyev Muxammadali Assessor All University

Turkmen Shovqiddin Ishmurodov Supervisor 1 (joined training on last day)

MPE

Turkmen Yuldashev Dilshod Assessor All MPE

Turkmen Arazmuradov Abdimurat Assessor All MPE (Methodist)

Turkmen Paraxadova Zulhayo Assessor All University

Turkmen Ankaliyyeva Yelvira Assessor All University

Turkmen Bekturdiyeva Guzelь Assessor 3 (left training after 3 days)

University

Karakalpak Boymurodova Gulzoda Supervisor (former supervisor of Turkmen)

All MPE

Karakalpak Tleuova Miyirgul Assessor All MPE (Methodist)

Karakalpak Serjanova Zamira Assessor All MPE (Methodist)

Karakalpak Kozbagarova Jansilu Assessor All MPE (Methodist)

Karakalpak Abdikalikova Nilufar Assessor All University

Karakalpak Kamaritdinova Shaxsanem Assessor All University

Karakalpak Ruziev Farhod Supervisor 3 (left training after 3 days)

MPE

Tajik Namozov Nozim Supervisor All MPE

Tajik Halimova Charosxon Assessor 1 (joined training on last day)

MPE

Tajik Xamidova Shabnam Assessor All MPE

Tajik Xolov Siyovush Assessor All University

Tajik Inoyatulloyev Ibroxim Assessor All University

Tajik Hidoyatov Nusratillo Assessor All University

Kazakh Muzaffar Tulyaganov Supervisor All MPE

Kazakh Egamberdiyev Shoxrux Assessor All MPE

Kazakh Abdullayeva Dinara Assessor 2 (joined training for last 2 days)

University

Kazakh Seidullayev Turdiali Assessor 2 (joined training for last 2 days)

MPE


Team Name Role – Pilot Data

Collection Training Days

Attended Organization

Kazakh Jiyanbekov Abdumutal Assessor 1 (left training after 1 day)

University

Kyrgyz Gulzoda Shermatova Supervisor All MPE

Kyrgyz Rahmonaliyev Qobil Assessor All University

Kyrgyz Mamatjonov Begzod Assessor All University

Kyrgyz Abjabarov Umidbek Assessor All University

Kyrgyz Adilov Sadirbek Assessor All University

Kyrgyz Toshpoʻlatov Begzodbek Assessor All University

All Children Reading - Asia - Technical Assisstance ...

Documents

Transcript of All Children Reading - Asia - Technical Assisstance ...