Comparing the OPI and the OPIc: The Effect of Test Method ...

Comparing the OPI and theOPIc: The Effect of Test Methodon Oral Proficiency Scores andStudent PreferenceGregory L. ThompsonBrigham Young University

Troy L. CoxBrigham Young University

Nieves KnappBrigham Young University

Abstract: While studies have been done to rate the validity and reliability of the OralProficiency Interview (OPI) and Oral Proficiency Interview–Computer (OPIc) indepen-dently, a limited amount of research has analyzed the interexam reliability of these tests,and studies have yet to be conducted comparing the results of Spanish language learnerswho take both exams. For this study, 154 Spanish language learners of variousproficiency levels were divided into two groups and administered both the OPI andOPIc within a 2-week period using a counterbalanced design. In addition, studyparticipants took both a pre- and postsurvey that gathered data about their languagelearning background, familiarity with the OPI and OPIc, preparation and test-takingstrategies, and evaluations of each exam. The researchers found that 54.5% of theparticipants received the same rating on the OPI and OPIc, with 13.6% of examineesscoring higher on the OPI and 31.8% scoring higher on the OPIc. While the results foundthat students scored significantly better on the OPIc, the overall effect size was quitesmall. The authors also found that the overwhelming majority of the participantspreferred the OPI to the OPIc. This research begins to fill important gaps and providesempirical data to examine the comparability of the Spanish OPI and OPIc.

Key words: Spanish, language proficiency, oral proficiency, postsecondary

Gregory L. Thompson (PhD, University of Arizona) is Associate Professor ofSpanish Pedagogy, Brigham Young University, Provo, Utah.Troy L. Cox (PhD, Brigham Young University) is Associate Director of Researchand Assessment, Center for Language Studies, Brigham Young University, Provo,Utah.Nieves Knapp (PhD, University of Oviedo, Spain) is Associate Teaching Professorof Spanish Pedagogy, Brigham Young University, Provo, Utah.Foreign Language Annals, Vol. 49, Iss. 1, pp. 75–92. © 2016 by American Council on the Teaching of ForeignLanguages.DOI: 10.1111/flan.12178

Foreign Language Annals � VOL. 49, NO. 1 75

Introduction

Being self-taught is no disgrace; butbeing self-certified is another matter.—Hugh W. Nibley

In the modern globalized economy,many educational institutions need to dem-onstrate their students’ developing levels offoreign language proficiency. These situa-tions include, for example, obtaining U.S.teaching certifications, measuring languagegains from study abroad programs, and re-porting on course learning outcomes. Inaddition, organizations that seek employeeswho can use a non-English language inparticular ways for particular purposes,such as those who conduct business inter-nationally and government agencies, alsofind it essential to ensure that future em-ployees’ level of language proficiency is wellsuited to their expected tasks andresponsibilities.

When framing conversations aboutproficiency, the ACTFL Proficiency Guide-lines have served as a common, externalpoint of reference that allows educators todefine and discuss various levels of func-tional language ability. Based on theseguidelines, the Oral Proficiency Interview(OPI) has long been considered the goldstandard for measuring proficiency levels.Because cost and scheduling constraintscan make the OPI impractical in somesituations, a computer-based alternative,the Oral Proficiency Interview–Computer(OPIc), was developed. While some stud-ies have examined the comparability of thetwo exams in English (Surface, Poncheri,& Bhavsar, 2008; SWA Consulting Inc.,2009), little research has been done tocompare the two versions of the exam inother languages, and questions about theextent to which the OPI and OPIc arereally commensurate remain. Specifically,one might ask to what extent both assess-ments result in the same rating for anyparticular test taker. Similarly, can it beassumed that both assessments measureexactly the same constructs and thus,

does the OPIc truly represent a suitable,low-cost, easy-to-schedule substitute forthe OPI? In addition, because the two ver-sions of the assessment are delivered indifferent ways and support differing de-grees of interpersonal participation, it isimportant to consider the extent to whichparticipants prefer one test method overanother and why. For example, investigat-ing aspects of the OPI or OPIc that areparticularly beneficial, enjoyable, or dis-tasteful to test takers provides importantinsights into the nonmonetary benefits ofone delivery system vs. the other.

Background

DevelopmentThe OPI evolved from work done on lan-guage proficiency testing in the 1950s by theForeign Service Institute of the U.S. Depart-ment of State. In the early 1980s, the ACTFLadapted the government language stand-ards, resulting in the first set of foreignlanguage proficiency guidelines that werespecifically geared toward educators(ACTFL, 2012). Based on those guidelines,the ACTFL created an oral proficiency in-terview protocol that could be used to assessspeaking proficiency (Liskin-Gasparro,2003). The resulting ACTFL OPI is a face-to-face or telephone interview between atrained interviewer and an examinee. Theinterviewer elicits speech functions that re-flect the different proficiency levels, as out-lined in the guidelines, and strategicallyadapts the topics and probes so that eachexaminee receives a personalized, but com-parable, assessment of his or her skill level.Based on the speech sample that is elicited,the interviewer determines the proficiencylevel (Novice to Superior1) and, for Novice,Intermediate, and Advanced speakers, thesublevel (Low, Mid, and High) (Malone,2003). A second certified rater subsequentlyevaluates the speech sample, and any dis-crepancy between assigned scores is mod-erated by a third certified rater.

The OPI has generally been found to bea reliable test of language proficiency

76 SPRING 2016

(Abadin, Fried, Good, & Surface, 2012;Surface & Dierdorff, 2003), and since the1980s, tens of thousands of interviews havebeen performed. Although Norris (2001)raised some questions as to the validity ofcomputerized language tests, especiallythose that attempt to measure oral profi-ciency, and although the assessment is notwithout other critics (see Chalhoub-Deville& Fulcher, 2003; Liskin-Gasparro, 2003),the interrater (e.g., raters agree with therelative ordering of a group of individualsfrom low to high) and test-retest reliabilityfor either test modality (OPI or OPIc) havegenerally been found to be quite high. As aresult, the OPI has become one of the mostwidely accepted and administered instru-ments for measuring second languagespeaking proficiency in the United States,and it is consistently used in high-stakessituations by government, businesses, andeducational institutions.

Because the OPI is administered faceto face or in person over the telephone bytrained interviewers, the associated costsare often high. In order to mitigate thosecosts and thus make the test both lessexpensive and easier to schedule and ad-minister, the OPIc was developed in theearly 2000s for Novice through Advancedspeakers. In part due to the increase in thenumber of heritage and native speakers aswell as increasing proficiency levels innonnative speakers, in 2012, the ACTFLadapted the OPIc to provide ratingsthrough the Superior level. Unlike withthe OPI, prior to beginning the OPIc,test takers complete a survey that requiresthem to provide demographic information,select specific topics of interest (e.g.,sports, politics, literature), and indicatefrom a list the tasks and situations inwhich they can use the language. Basedon responses to the survey, an individual-ized test that takes into consideration thetest taker’s self-reported proficiency leveland interests is then created from amongthe pool of items in the question bank. Inthis manner, the OPIc is similar to the OPIin that each test is personalized, although

obviously not to the same extent. Like theOPI, the OPIc yields a ratable speech sam-ple that is subsequently double-blind-ratedby at least two certified OPIc raters. Theuse of the computer not only facilitates theadministration of the exam but also costsless than half of the traditional OPI(ACTFL, 2009a, 2009b).

Reliability: OPIcWhile the increased convenience and de-creased cost of administering the OPIc rela-tive to the OPI are obvious, a limitedamount of research has examined the OPIc’sreliability. SWA Consulting (2009) ana-lyzed a sample of 2,934 Korean ESL speak-ers who took the OPIc multiple timesduring a 30-day period. The report indi-cated high test-retest reliability with an r(i.e., Pearson Product Moment Correlation)above 0.90 (ranging from 0.90 to 0.93), an R(i.e., Spearman Rank Order Correlation)above 0.90 as well (ranging from 0.90 to0.94), and rater agreement (different ratersaward the same rating to the same test taker)above 85% (85–92%) for the first two OPIcadministrations. However, score stabilitydid decrease slightly as the time gap be-tween administrations increased (SWAConsulting Inc., 2009, p. 2). Abadin et al.(2012) conducted a similar study and re-ported high levels of interrater reliabilityacross Spanish, English, and Arabic OPIcadministrations (R ranging from 0.95 to0.97) and over time (R ranging from 0.96to 0.97). Levels of interrater reliability, asmeasured by absolute score agreement,were higher than 70% across all languages(English¼ 80%, Spanish¼ 80%, Arabic¼71%). At the sublevels (High, Mid, Low),adjacent agreement rates were more than90% across all main levels (Novice, Inter-mediate, Advanced). The lowest absoluterates of interrater agreement (75% overall)were found in the Advanced category:While the highest rate was at AdvancedMid (83%), agreement rates at theAdvanced Low and Advanced High suble-vels were substantially lower (66 and 60%,


respectively) (pp. 2, 10–11). Taken to-gether, these studies indicate that ratingsof speech samples that are collected usingthe OPIc do seem to be reliable. However,because, as seen with the Arabic OPIc(Abadin et al., 2012), some languagesmay present lower rates of interrater agree-ment or other concerning trends, research-ers still need to examine versions of theOPIc across a broader range of other lan-guages. The curious drop in interrateragreement at the Advanced proficiencylevel also needs to be better understood.

Concurrent Validity: OPI and OPIcThe OPIc has often been presented andperceived as “[A] different modality of theACTFL OPI and not a different assessment”(Surface et al., 2008, p. 8). Unfortunately,the literature supporting this claim has beensparse, and no definite conclusions can bedrawn without more extensive, experimen-tal studies; to our knowledge, only one re-port has been published specificallycomparing participants’ scores across thetwo exam formats (Surface et al., 2008).In order for the OPI and OPIc to be consid-ered truly equivalent forms of the sameassessment that measure the same underly-ing body of knowledge and skills, researchshould show high levels of correlation be-tween examinees’ scores on the OPI and theOPIc, as well as comparable levels of test-retest reliability, interrater reliability (e.g.,raters agree with the relative ordering of agroup of individuals from low to high), andrater agreement (different raters award thesame rating to the same test taker).

Surface et al. (2008) published a reporton two studies that were undertaken onbehalf of Language Testing International.Study 1 was designed to test the reliabilityand validity of the OPIc. Study participants(n¼ 99) were employees at a Korean com-pany and had varying levels of experiencewith English. Participants were randomlyassigned to two groups and completed apre- and postassessment survey, as well asan OPI and an OPIc (in different orders,

depending on group assignment). Thoseassigned to the second group took theOPIc twice in order to allow for an exami-nation of test-retest reliability (Surfaceet al., 2008). The authors found the follow-ing results:

� There was a significant correlation be-tween final scores on the OPI and thefirst administration of the OPIc (Pear-son’s r¼ 0.92, Spearman’s R¼ 0.91).

� Absolute agreement for final scores onthe OPI and the OPIc (first administra-tion) was at 63%. Agreement within ma-jor categories (Novice, Intermediate,etc.) was at 85%. When consideringboth absolute (e.g., the same rating)and adjacent agreement (e.g., a ratingwithin one sublevel such as AdvancedLow and Advanced Mid), final scoresagreed 98% of the time.

� The order in which the test was adminis-tered had no significant effect on testscores; however, the self-assessmenttask did seem to affect final ratings,with participants who self-assessed atthe lowest proficiency level tending toscore lower on the OPIc than on the OPI.

� The test-retest reliability for the first andsecond administrations of the OPIc washigh (r¼ 0.94, R¼ 0.91).

� Confirmatory factor analysis methodssupported the reliability and validity ofthe OPI and OPIc. (p. 30)

Although many of the results stronglysupported the comparability of the OPI andOPIc, the findings suggested some areas ofobvious concern. First, the level of absoluteagreement (63%) was much lower than ex-pected (Surface et al., 2008). Although per-fect agreement at the sublevel is rare, 63%seems too low to claim equivalence betweentest modalities. However, as noted previ-ously, when adjacent agreement was in-cluded, rates did jump to 98%.

Study 2 took place after the ACTFLupdated the OPIc in an effort to addressthe areas of concern suggested in Study 1.A sample of 27 participants from the same

78 SPRING 2016

Korean company was selected, all of whomtook the OPI and the revised OPIc within a48-hour period (Surface et al., 2008). Thesewere the key results:

� Interrater reliability for the OPIc wasfound to be generally high (r¼ 0.86,R¼ 0.85, G¼ 0.93); however, interraterreliability for the OPIc was slightly lessrobust than for the OPI, and exact agree-ment between raters 1 and 2 was only58%.

� The correlation between final scores onthe OPI and revised OPIc was strong(r¼ 0.97, R¼ 0.95), and final ratings ex-actly agreed 87% of the time. If adjacentagreement within major categories wastaken into account, agreement levelswere at 100%. (Surface et al., 2008, pp.36–37)

Once again, although some areas werestrong (e.g., correlation between finalscores, adjacent agreement levels, etc.),others (e.g., absolute agreement) wereweaker than expected. Specifically, mea-sures of interrater reliability were somewhatlower than in Study 1, and the small samplesize prevented any definitive conclusions,especially regarding the effect of the self-assessment task.

Other IssuesIn addition to questions about the extent towhich the OPI and OPIc result in the sameratings both by different raters and acrosstesting sessions and the extent to whichthey measure the same body of knowledgeand skills albeit in different ways, otherdifferences between the OPI and OPIc meritconsideration. First, the effect of self-assess-ment on the selection of prompts that areoffered on the OPIc and the impact ofprompt selection on candidates’ final profi-ciency rating remain unclear. For example,based on data from the Computerized OralProficiency Instrument (COPI) and theSimulated Oral Proficiency Interview(SOPI), Malabonga, Kenyon, and Carpenter

(2005) found that although self-assessmentscores were shown to correlate well withindependent assessments and COPI/SOPIscores, there was some indication that par-ticipants who rated themselves at a higherlevel of proficiency than was warranted mayhave been negatively affected by startingtheir testing session at too high a level (Ma-labonga et al., 2005). This affected the abil-ity of the raters to establish a floor indicatingwhat the examinees were able to accom-plish, as the prompts were too difficult.

What is more, issues of test-taker pref-erence should be investigated. As is the casewith research regarding the comparabilityof scores on the OPI and OPIc, the literaturethat deals with test examinees’ preferencesis also in its infant stages. Surface et al.(2008) found that although attitudes to-ward the OPIc were generally positive,test takers preferred the OPI to the OPIcand felt that the OPI offered a better oppor-tunity to demonstrate their language abili-ties (Surface et al., 2008). Although noother investigation has been published com-paring attitudes toward the OPI and theOPIc, research dealing with computerizedvs. face-to-face versions of proficiency testshas supported Surface et al.’s findings: Kid-dle and Kormos (2011) found that test tak-ers preferred the face-to-face test to thecomputerized version, and they also de-scribed the face-to-face test as more “fair”than the computerized test. While test tak-ers in Kenyon and Malabonga’s (2001)study also felt that the OPI allowed for abetter demonstration of ability, they re-ported that the OPI was not as fair as theCOPI/SOPI. Contrary to some previousstudies, Mousavi (2009) found that test tak-ers had very positive perceptions of digitallydelivered proficiency tests and that theyactually felt more comfortable with thecomputerized test than the face-to-face mo-dality. Research about test-taker preferen-ces in assessment modality as well as thepotential relationship between test-takerpreferences and final ratings is needed.

In summary, while studies have gener-ally shown high levels of interrater and


test-retest reliability when scoring the OPIor OPIc, research comparing the resultingscores, the impact of assessment modality,self-assessment and topic selection on thosescores, and test-taker preferences is still inits infancy and is made more complex giventhe large number of languages in whichthe OPI and OPIc are offered. Surfaceet al. (2008) advocated for the need ofadditional studies to be conducted “toadd to the foundation of evidence support-ing the ACTFL OPIc testing modality”(p. 34). To better understand these issues,this study addressed the following researchquestions:

1. What is the effect of test method (OPI orOPIc) on the speaking proficiency scoresof Spanish language learners?

2. Which test method do participants pre-fer, and why?

Methodology

ParticipantsOne hundred fifty-four students (81 femalesand 73males) at a large private university inthewestern United States participated in thestudy. The mean participant age was 23.4,with a standard deviation of 5.29. The sam-ple included approximately 54 introductorystudents in lower-division classes as well as100 students in upper-division courses in-cluding Spanish majors, Spanish translationmajors, Spanish teaching majors and mi-nors, and students from other majors whowere participating in the university’s lan-guage certificate program. Mean years ofstudy were 3.48. Most of the study partic-ipants in the upper-division classes had sig-nificant experience with the languagethrough living abroad or in formal studyabroad programs. All of the students in theseupper-division programs were required toscore Advanced Low or higher on the OPIin order to qualify for their respective cer-tificate, minor, or major. The universityroutinely covers the cost of the OPI for thefirst examination, and those who do not

score at or above Advanced Low must takeit again at their own expense. Although par-ticipation in the study was voluntary, severalparticipation incentives were offered: Allcosts associated with taking the second test(either the OPI or the OPIc) would be paidfor by a grant received by the researchers,students could choose the higher of their twotest scores if there was a disparity betweentheir OPI and OPIc final ratings, and partic-ipants would receive a nationally recognizedcertificate of their speaking proficiency.

InstrumentsFour instruments were used: two oral pro-ficiency tests (the telephone version of theOPI and the OPIc); a presurvey, which testtakers completed before taking either as-sessment; and a postsurvey, which theyfilled out after having completed both as-sessments. The presurvey consisted of ques-tions designed to elicit students’ relevantbackground information, language experi-ence, goals, and familiarity with the OPI andOPIc. Participants were also asked to ranktheir own proficiency from poor to excellentin the different language modalities. Thepostsurvey consisted of questions designedto elicit participants’ experiences with, andattitudes toward, both testing methods andalso asked students to rate what theythought their score would be based on theACTFL Proficiency Guidelines. Participantsalso expressed their attitude toward eachtest modality on a nine-point Likert scale(“dislike extremely” to “like extremely”)and were asked to explain which test formatthey preferred and the reasons why in re-sponse to a final open-ended question (“As atest taker, which method did you prefer andwhy?”).

ProceduresIn order to control for a potential ordereffect, students were randomly assigned totwo groups. The first group (41 females and36 males; average age¼ 23.2, SD¼ 4.21;mean years of study¼ 3.51, SD¼ 1.92)took the OPI before the OPIc; participants

80 SPRING 2016

in the second group (40 females and 37males; average age¼ 23.6, SD¼ 6.23;mean years of study¼ 3.51, SD¼ 1.92)took the OPIc and then the OPI. Both testswere administered according to studentavailability but within a 2-week periodso as to minimize the potential effect ofadditional instruction and other interac-tions on the students’ oral proficiency.With only a few exceptions, the postsurveywas completed before students receivedtheir final ratings, thus avoiding a score-based preference toward either test. Forthose cases in which there was a two-sub-level difference in a participant’s OPI andOPIc scores (n¼ 5), participants were con-tacted and asked to answer further ques-tions about their testing experience.

Data AnalysisA mixed-methods approach was used.While many educational researchers useparametric statistics with test scores thatmay or may not have been verified to beinterval data, using both parametric andnonparametric analysis to obtain informa-tion that responds to these research ques-tions can ensure that the findings are not anartifact of a failure of data to meet the strictassumptions of parametric research.

First, in order to determine the effect oftest method (OPI vs. OPIc) on the speakingproficiency scores of Spanish languagelearners, the fact that the scores were ordi-nal rather than interval or ratio in naturehad to be accounted for. In other words, thetest scale ranked test takers on a scale from 1to 10 (Novice Low, Novice Mid, and soforth to Superior), without distinguishingor defining the relative distance betweeneach unit on the scale. This meant thatthe amount of language ability requiredto move from 9 (Advanced High) to10 (Superior) on the OPI/OPIc scale wasexponentially greater than what was re-quired for a learner to move from 1 (NoviceLow) to 2 (Novice Mid). Given that therelative distances between each point onthe scalewere not equal, both nonparametric

(e.g., the Wilcoxon Signed-Rank Test) andparametric (the repeated-measuresANOVA)statistics are reported here. The WilcoxonSigned-Rank Test compared the relativerank of the same group of subjects ontwo measures. Using the repeated-measuresANOVAwith a counterbalanced design con-trolled for an order effect (e.g., scoringhigher on the second test) by balancingout that effect between the two groups.The dependent variables were test scoreson OPI and OPIc, and these variables wereanalyzed. First, the within-subjects effectwas analyzed by comparing individual par-ticipants’ scores on the two tests. Second, thebetween-subjects effect was ascertained bycomparing mean group scores and standarddeviations for each test method and examin-ing the confidence intervals.

To determine which test method (OPIvs. OPIc) participants preferred, students’Likert-scale ratings of the two test modali-ties on the postsurvey were compared usingthe Friedman rank order test. Examinees’open-ended responses were grouped intobroader categories (e.g., more personal,more realistic, easier to understand, no em-barrassment) for easier qualitative analysis.

Findings

The Effect of Test MethodTo understand the effect of test method onscores, the scores of the test takers’OPIwerecompared to their OPIc. The mean of theOPI (average¼ 6.52, SD¼ 1.56) was lowerthan the mean of the OPIc (average¼ 6.71,SD¼ 1.41).While the mean of the OPIc washigher, the distribution of OPIc scores wasmore skewed than the distribution of theOPI scores. In Figure 1, the population dis-tribution is presented with a histogram dis-tribution of the OPI scores presentedvertically on the left side of the graph anda mirrored histogram distribution of theOPIc scores with the same examinees onthe right. When comparing rank orders ofthe test takers on the two tests, one can seethat 54.5% (n¼ 84) of examinees scored thesame on both measures, 13.6% (n¼ 21)


scored higher on the OPI, and 31.8%(n¼ 49) scored higher on the OPIc. TheWilcoxon Signed-Rank Test found thatthe groups were statistically different interms of mean rank (z¼�3.13, p¼ 0.002);however, the adjacent agreement was 97%.TheWilcoxon statistic was unable to controlfor a potential test order effect, necessitatingthe need for further analysis.

Descriptive statistics showed thatGroup 1 (those who took the OPIc first)scored higher on both the OPIc and the

OPI than Group 2 (see Table 1). However,both groups received higher scores on theOPIc than they did on the OPI, indicatingthat there was no ordering effect. To verifythis observation, a repeated-measures AN-OVA was conducted. The between-subjectvariable was Group (Group 1: OPIc First,Group 2: OPI First), and the within-subjectvariable was Test Method (OPIc or OPI)with the dependent variable being the nu-merical conversion of the OPI/OPIc score.The mean difference between groups was

FIGURE 1

Histogram Illustrating OPI and OPIc Score Distribution

TABLE 1

Descriptive Statistics of OPI and OPIc Results

OPI First Group OPIc First Group Total

OPIc OPI OPIc OPI OPIc OPI

N 77 77 77 77 154 154Mean 6.57 6.36 6.86 6.69 6.71 6.53SD 1.39 1.67 1.43 1.46 1.41 1.57

82 SPRING 2016

0.305, 95% CI [�0.16, 0.77], and it was notfound to be significant (F (1,152)¼ 1.71,p¼ 0.193) with a negligible effect size (par-tial h2¼ 0.011). This indicated that the re-sults were not confounded by test order.The within-subject variable showed thatthe mean difference between OPI/OPIcscores was 0.19, 95% CI [0.07, 0.30],and it was found to be significant(F (1,152)¼ 10.44, p¼ 0.002), although theeffect size was still small (partial h2¼ 0.06).

Test Method PreferenceTo better understand students’ test methodpreferences, the postsurvey results were an-alyzed. One hundred forty (91%) of theexaminees completed the postsurvey, al-though some only answered some of thequestions. The Likert scale ranged from 1(“dislike extremely”) to 9 (“like ex-tremely”). For the question about the OPI,the mean was 6.86 (n¼ 140, SD¼ 1.78),which fell between “like slightly” and

“like moderately” (see Figure 2). For thequestion on the OPIc, the mean was 5.27(n¼ 139, SD¼ 2.08), which fell between“neither like nor dislike” and “like slightly”(see Figure 2). To see if the difference wasstatistically significant, the nonparametricFriedman rank order test was performed,as it is used with repeated measures. Thistest revealed a significant difference betweenpreference x2 (1)¼ 40.01, p< 0.001, withexaminees preferring the OPI overall.

The examinees were then divided intothree groups: those who preferred the OPI,those who preferred the OPIc, and thosewith no preference. The groups were deter-mined by looking at the preference ratingfor each test and subtracting the OPI scorefrom the OPIc score. Those with positivescores were placed in the OPI preferencegroup (n¼ 94), those with negative valueswere in the OPIc preference group (n¼ 25),and those with 0 were in the no-preferencegroup (n¼ 20). Figure 3 displays a scatter-plot of the Likert scale responses with OPI

FIGURE 2

Distribution of OPI vs. OPIc Preference Using Likert Scale


preference group in the lower right corner,the OPIc preference group in the upperright corner, and the no-preference groupalong the diagonal.

This figure shows some interestingtrends. First, there were only 20 partici-pants (14.3%) who were ambivalent aboutthe version of the test. These participantsvaried from disliking both exams on thelower end to having no preference in themiddle to those whowere satisfied and likedboth exam formats. This indicates that 119participants (85%) liked one test formatmore than the other (see Table 2). Second,there were 39 participants (28%) who likedthe OPI (“like slightly” to “like extremely”)but disliked the OPIc to some degree, butthere were only seven participants (5%)who liked the OPIc but disliked the OPI,indicating that there might be a strongerpreference for in-person interviews.

To ascertain why examinees preferredone test to the other, the responses to theopen-ended question were analyzed. Atheme analysis was performed on the com-ments to determine important commonali-ties and differences. The responses werethen coded based on the categories(themes) established and assigned to theappropriate category or categories, as

some of the participants’ comments ad-dressedmore than one established category.

A total of 132 students explained theirlevel of satisfaction with the OPI and OPIcin response to the open-ended question “Asa test taker, which method did you preferand why?” The majority of the students (94students; 71%) preferred the OPI to theOPIc. Only 27 (21%) students expressed apreference for the OPIc, and 11 (8%) had nopreference about the form of the exam. Onehundred twenty-eight of the 132 studentsalso wrote comments explaining thereason for their ratings. Four (3%) ofthe participants did not provide a reasonfor their rating. Of those who preferred theOPI, a total of 90 comments were offered,which are presented in order of frequencybelow:

1. More natural (n¼ 78; 87%). Many stu-dents stated that the OPI felt more like aconversation, since they were talking inreal time to an actual person. This “morenatural” feeling apparently impactedtheir view of the OPI as a better measureof actual oral proficiency. Some of theparticipants felt that they were betterable to understand a live person andsometimes struggled when the computer

FIGURE 3

Scatterplot of OPI vs. OPIc Preference Using Likert Scale

84 SPRING 2016

TABLE2

Matrix

ofOPIc

vs.OPIPreferences

OPIc

Dislike

extrem

ely

Dislike

very

much

Dislike

mod

erately

Dislike

slightly

Neither

like

nor

dislike

Like

slightly

Like

mod

erately

Like

very

much

Like

extrem

ely

Total

OPI

Likeextrem

ely

11

13

Likevery

much

11

21

13

34

16Likemod

erately

11

13

712

328

Likeslightly

32

38

112

29Neither

like

nor

dislike

11

31

41

11

Dislike

slightly

22

47

318

Dislike

mod

erately

21

28

518

Dislike

very

much

11

11

51

1Dislike

extrem

ely

11

22

6

Total

13

412

813

3448

1613

9


was speaking. In addition, the partici-pants felt that the interviewer could elab-orate and thus make the questions moreunderstandable. These were some of thecomments:

� “I much preferred the OPI because itfelt a lot more natural and organichaving a real person asking you realquestions that were relevant tomy life,affirming my statements, commentingon what I said, and forming questionsbased off what I said.” (emphasis inoriginal)

� “I preferred the OPI because I feel thatit was a better measure of my Spanishspeaking skills since I will rarely findmyself in a situation where I will needto use my Spanish in order to speak toa computer.”

� “I liked speaking to a live personrather than speaking to a computer.It made it easier to talk and to under-stand the questions being asked.”

� “I preferred the OPI method because Itook the test more seriously whenspeaking to a real person. I was alsoable to ask for clarification on ques-tions or phrases I did not understand.”

� “I prefer the OPI. The conversationcame more naturally. Plus, the factthat the interviewer was able to relateeach question to my previous re-sponse allowed an easier transitionfrom topic to topic. This also allowedme think for myself which vocabularyI wanted to use andwhich direction I’dlike to take the conversation. The ex-perience with the OPI seemed morenatural and realistic, which I liked.”

2. Feedback (n¼ 40; 44.5%). The studentsfound that the immediate feedback fromthe interviewer made them feel morecomfortable and helped them communi-cate more effectively. Even thoughACTFL-trained OPI raters do not assistinterviewees as they look for words andphrases, the students were still able togauge whether or not they were being

understood thanks to the feedback inter-viewers offered. The students wrote thefollowing:

� “Taking the OPIc I got a little tired oftalking with no one listening, eventhough I knew that my responseswould be graded later. I guess justnot having a real person give any dy-namic feedback in the moment waswhat caused that.”

� “I quickly grew tired of the OPIc be-cause I didn’t receive any feedbackduring the interview. For me, thatfeedback is important, even if it isjust, ‘Yes’ or ‘No.’”

� “I preferred speaking to a live inter-viewer because she could respondwith small vocal cues mid-sentenceto confirm that she was understandingthe thrust of my idea. These includedthings like, ‘S�ı,’ ‘Mm-hmm,’ ‘Claro,’and even simply the audible breathingcues that help us navigate when eachperson will speak.”

3. Better topics (n¼ 13; 14%). The OPIcuses a presurvey to determine the speak-er’s topics of interest and then choosesquestions from a database related tothose topics. While this is done to tailorthe exam to the students, some studentsstill found that the OPI provided moreinteresting discussion topics. The stu-dents stated the following:

� “I preferred the OPI because it waseasier to guide the conversation totopics that I feel comfortable talkingabout. . .. Also, with the OPI, you cansimply tell the tester that you don’tlike that question (they tell you thatat the beginning), and they’ll ask yousomething different.”

� “While taking theOPI, the interviewerwas able to redirect me when I was notanswering the question with enoughdetail, etc. I was also able to talk to theinterviewer about things that are morepersonal to me (my job, my classes,

86 SPRING 2016

my hobbies) because there was a dy-namic conversation going on ratherthan a question-and-answer session.”

� “I liked the OPI because I could talkabout the things that I wanted to talkabout. If I didn’t know much about asubject, we would change thesubject.”

4. More time (n¼ 4; 4%). The students feltthat the artificial time limit set by theOPIc sometimes interrupted them asthey were speaking and did not allowthem to finish their ideas. Having thehuman interviewer allowed them touse the necessary time to complete theirideas.

� “OPI. I wasn’t just randomly cut offwith time. On the OPIc that was reallyannoying. Not knowing the time limiton each question to get what I wantedto say across. The OPI allowed theinterviewer to help me get a betteridea.”

� “The OPI method felt more personal,even though I feel I did worse. I feel Icould actively engage in the role-playsand discussions. In addition, the in-terviewer gave me time to finish myresponses, whereas the OPIc did notalways.”

� “The conversation also goes moresmoothly—it’s not such a brusquetransition from topic to topic. Thetester naturally moves on to the nexttopic instead of the tester running outof time like can happen on the OPIc—that happened to me on a few ques-tions and I would panic, wonderingif that was a bad thing and whatwould happen if I hadn’t addressedall aspects of the question in the allot-ted time.”

In spite of the fact that the majority ofthe students preferred the OPI, 27 studentsexplained in the qualitative comments whythey preferred the OPIc. Not only did fewer

students prefer the OPIc, but the range ofreasons also varied considerably and oftenfell in more than one category. The themes,listed in order of frequency, that emergedfor the 27 participants who preferred theOPIc were as follows:

1. Less anxiety due to lack of real personalinteraction (n¼ 21; 78%). The majorityof people who favored the OPI com-mented that they preferred having alive person who was able to accommo-date their style, questions, and pace. Theopposite was true for the students whofelt more comfortable talking to the com-puter due to the total absence of a liveperson. They attributed the reduced anx-iety to the lack of pressure from a specificlive person who would be perceived asconsciously or unconsciously evaluatingeach phrase and sentence and noted thatthe avatar was easier to understand thana live person.

� “Computer-generated questions areeasier for me to calmly compose ananswer and give my response. I relateto objects (things) more easily thanpeople.”

� “I preferred the computer. The pointof the OPI is to express yourself inSpanish, and talk as much as possible,and this was easier to do with thecomputer . . . just to concentrate onspeaking a lot and trying to expressmyself. I didn’t have to worry aboutbeing interrupted or saying weirdthings, I could just blab on and onuntil the computer automaticallystopped me.”

� “The OPIc. I don’t know why peoplewould be more uncomfortable speak-ing to an avatar than to a real person;it seems counterintuitive. Speakingwith a real person is incrediblynerve-wracking, especially as it canbe difficult to hear them through thephone. The OPIc was always veryclear, and I didn’t feel worried aboutwhat the other person would think of


my answers or how they wouldrespond.”

� “I preferred the OPIc because I didn’tget as nervous while speaking, helpingme to maintain my train of thought. Itwas easier for me to keep going after Imade a mistake.”

2. Repeat questions (n¼ 11; 41%). On theOPIc, students could ask the avatar torepeat the questions in a totally nonjudg-mental assessment context. Some samplecomments included the following:

� “I actually really liked the OPIc. Ididn’t feel uncomfortable at all. Ithought it was nice that you couldrepeat the question, especially becausesome of the questions were really longand involved. And it was nice to beable to move on when you wanted to.”

� “The OPIc actually helped me remainmore calm and think through my an-swers more. I wasn’t as nervous, and Icould repeat the question with thepush of a button.”

� “I preferred the OPIc because there isno chance of the recording question-ing your points or ideas. It is alsomorecomfortable to repeat the question asmany times as you want to make sureyou understand.”

3. Flexibility (n¼ 7; 26%). Some studentsfelt less restricted by the OPIc and feltthat they had more freedom to expresstheir own ideas and thoughts because noone else was guiding the conversations.This is due in part to the fact that theOPIc uses a presurvey to determine thetopics of interest and then selects from adatabase of questions related to thosetopics when choosing questions. A fewstudents found that this system benefitedthem, and they felt more comfortablewith these questions:

� “The OPIc, because I could expoundon what I knew how to talk about,rather than trying to answer questions

they asked me that I didn’t necessarilyhave a vocabulary for.”

� “I like the OPIc because I felt like Icould go off on tangents a bit morebecause I was talking to a computer. Ijust talked as much as I could until mytime was up. The OPI had good as-pects to it as well, but I think I felt lessnervous talking to an avatar.”

� “I felt more comfortable speaking tothe computer because I got to influ-ence the topics that were discussedinstead of having them chosen forme by the interviewer. I liked thatmy responses were being recorded be-cause then I could express mythoughts and not have an awkwardpause if I made a mistake. When Imessed up in the computerized test Icould fix it if I caught it without dis-rupting my flow of expression.”

A small percentage of students pre-ferred neither the OPI nor the OPIc andrated them the same. Of the 132 partici-pants, 11 (or 8%) either found positive as-pects of both exams or found both of themto be problematic:

� “Taking the OPI was easier in the fact[sic] to the test being a conversation,which moved you from one question tothe next, but the OPIc was easier in thefact that you are not actually talking tosomeone, so you don’t get as nervous.”

� “I preferred the flexibility of theOPIc, butI preferred talking to a real person on theOPI. Both had their advantages.”

� “I liked both of them. It was really nice tohave responses from a real person and beable to have a conversation, but it wasnice to be able to talk as much as I wantedin the OPIc. It lessened the pressure tonot have someone respond right away.Both were actually pretty fun.”

� “I like that the OPI interviewer couldreword the question or stop me if I mis-understood him. It felt more like a

88 SPRING 2016

conversation, but I hate talking on thephone with people so I get really flusteredwhen I mess up. The OPIc was better forme in that I could talk without feelingself-conscious and I didn’t panic as bad. Imay have answered questions wrong, butI could talk more fluently without feelingembarrassed.”

� “Neither. I hate tests.”

The quantitative data from this studyreveal that the students performed better onthe OPIc but that, in contrast, they demon-strated a strong preference for the OPI. Thequalitative data give valuable insights intothe why of the students’ preferences. Thequantitative data and the qualitative datatogether reveal that different types of learn-ers prefer different assessment experiencesand thus favor either the OPI or OPIc.

DiscussionFirst, the results indicate that 55% of thestudents received exactly the same rating;32% scored higher on the OPIc and 13%scored higher on the OPI. Adjacent categoryagreement was 97%. However, the studentswho took the OPIc received ratings thatwere statistically significantly higher on av-erage although the overall effect size wasquite small, indicating that even thoughdifferences did exist between students’scores on the two versions of the assess-ment, the practical significance of this dif-ference was minimal. This confirms thatboth assessments are reliable and validmeasurements of oral proficiency as definedby the ACTFL Proficiency Guidelines inspite of small variations in performanceacross the two exams.

That said, it is important to point outthat differences of one sublevel within themajor sublevels can be a concern especiallysince receiving a score of Intermediate Highinstead of Advanced Low has implicationsfor teacher certification in many states,though in this study only 4 of the 154examinees (2.6%) would have been affectedby this rating difference. Similarly, fordual immersion programs, the difference

between a rating of Advanced Low andthe required level, AdvancedMid, is person-ally, if not statistically, meaningful. Thus,the choice of exam format may facilitateachieving the credentials that are requiredfor certain career paths.

It is also interesting to point out that acentralizing tendency with the OPIc wasobserved; that is, the data suggest that stu-dents moved more toward the middle ofmajor levels (Intermediate Mid, AdvancedMid) and away from the lower and higherends of these major levels. This was espe-cially true of students at the Advanced level.Twenty-three of the 48 (or 48%) studentswho scored Advanced Low on the OPIscored at the Advanced Mid level on theOPIc. In addition, one of the concernsraised in this study was whether studentscould be pushed to perform at the Superiorlevel by the OPIc or if a trained OPI raterwould be required to encourage candidatesto produce the level of speech and languageusage that are needed to demonstrate veryhigh levels of proficiency. Of the 13 stu-dents who scored Advanced High or Supe-rior with the OPI, 8 (62%) receivedlower scores on the OPIc, with 2 of thoseindividuals dropping from Advanced Highto Advanced Low. Because three partici-pants tested into the Superior level on theOPIc, it is clear that students can place intothis level on this version of the assessment;however, a larger percentage of studentsproduced speech samples that were ratedAdvanced High or Superior when interact-ing with a live interviewer, when taking theOPI. Thus, it will be important to investi-gate ways to push Advanced High/Superiorborder-level speakers to producemore com-plex speech that exemplifies higher-levelfunctions on the OPIc without the prompt-ing of a human interviewer.

It is also interesting that students over-whelmingly preferred the OPI in spite of thefact that, on average, they scored either thesame or slightly higher on the OPIc. Inthe majority of cases, students in this studyfound the OPI to be more natural, moreenjoyable, and more like a real speaking


exam; they appreciated interacting with areal person who could answer their ques-tions and challenge them to produce thehighest level of speech that they were capa-ble of maintaining. Finally, they felt that theOPI provided was a better measure of theirproficiency, reinforcing the face validity ofthe OPI over the OPIc. Interestingly, thosewho preferred the OPIc, while fewer innumber, cited the exact opposite reasons:Talking to a real person increased their levelof anxiety and made them feel that theywere being judged for their mistakes. Whilesome students found that talking to a com-puter was completely unnatural, others feltthat the computer’s nonjudgmental pres-ence and the opportunity to have questionsrepeated multiple times allowed them to bemore relaxed and thus better able to dem-onstrate their actual speaking ability.

These results clearly indicate that indi-vidual students’ personal characteristics andpreferred interpersonal style must be takeninto consideration when selecting an assess-ment format. Because students’ preferencesvaried widely, preparing students to takeoral proficiency exams is critical. The stu-dents in this study did not receive any back-ground information or specific instructionsabout the assessments prior to taking theOPI or the OPIc, other than that one was atelephone interview and the other was com-puter-based. It is thus unclear if priorknowledge about the structure of theexam and the kinds of speech tasks thatwere required at each progressive level onthe proficiency scale may have worked tostudents’ benefit; for example, perhaps stu-dents who had been expressly informed thatthey must be able to support and defend anopinion rather than simply share personalstories to receive a rating at the higher pro-ficiency levels may have pushed harder todemonstrate the required skills. At the veryleast, informing students about the oppor-tunity to select questions, that questionscould be repeated, and that their speechsamples would be limited to a certain num-ber of minutes for each question may haveincreased their satisfaction with the OPIc.

Finally, the setting in which theselearners acquired their second languageseemed to have a direct impact on theirexam preference. While the majority ofthe participants preferred the OPI to theOPIc, those participants who had extensiveimmersion experiences showed an evenhigher preference for the OPI. Becausemany of these participants had spent 18–24 months in a foreign country or in aregion of the United States where they couldbe fully immersed in the Spanish language,these participants were more comfortablewith and accustomed to speaking in amore naturalistic setting and felt that thetruly interpersonal nature of the OPI al-lowed them to provide a much better repre-sentation of their abilities. Interestingly,even though these students preferred theOPI, they tended to be rated slightly higheron the OPIc.

ConclusionThe results from this study inform languageeducators’ understanding of potential dif-ferences between the OPI and the OPIc aswell as the extent to which the OPIc canserve as a viable substitute for the OPI.Understanding and explaining to studentsthe differences in assessment formats andexpectations at each proficiency level, aswell as taking test takers’ preferences intoconsideration, will help universities, busi-nesses, and governmental agencies supportindividuals when selecting the test admin-istration format that best meets their needs,financial means, and personal preferences.

This study also raises many questionsas well as possible areas for future re-search, particularly about the effectivenessof the OPIc for students who are betweenmajor levels. In addition, since more than31% of the participants scored better onthe OPIc than on the OPI and more than13% scored better on the OPI than theOPIc, it will be important to examine ineven greater detail the impact of students’personal characteristics, environmentalfactors, depth of knowledge about and

90 SPRING 2016

experience with different exam formats,awareness of the kinds of language thatmust be demonstrated at each proficiencylevel, type of prior language acquisitionexperiences, proficiency level, and atti-tudes toward interpersonal or technol-ogy-based exams to tease out the causesof the slight differences in scores betweenthe two formats. Since this research showsthat more than 70% of the participants inthis study preferred the more costly ver-sion (the OPI), programs can decidewhether they have the resources to sup-port individuals who choose the OPI, ifthey can justify requiring the more expen-sive version of the assessment, and if theyhave time to prepare students to completeeither form by overtly teaching students atall levels to provide the richest speechsample and fully demonstrate the tasksand use language in the contexts that par-ticularly characterize higher levels ofproficiency.

In conclusion, since many importantfactors play a role in assessing an individu-al’s oral proficiency and since these examshave important implications for candidates’future success in business, education, andgovernment, it is important to consider theaccuracy, incurred costs, and satisfaction ofthose who take them. While this researchindicates that both assessments are equiva-lent measures of oral proficiency, many as-pects of the exams and the examineesrequire further study.

Note1. Note that while the 2012 Proficiency

Guidelines added Distinguished as thehighest category, the interview protocolonly tests through Superior.

AcknowledgmentsWe would like to express our gratitude toour research assistants Doug Porter andSteve Stokes for their invaluable help inorganizing and analyzing the data for thisstudy, and we would also like to thank thethree anonymous reviewers and Anne

Nerenz for their valuable comments andinsights on the different drafts of this article.

References

Abadin, H., Fried, D., Good, G., & Surface, E.(2012). Reliability study of the ACTFL OPI inChinese, Portuguese, Russian, Spanish, German,and English for the ACE review. Raleigh, NC:SWA Consulting Inc.

ACTFL. (2009a). Oral Proficiency Interview(OPI): Testing for proficiency [Electronic ver-sion]. Retrieved July 23, 2015, from http://www.actfl.org/professional-development/certified-proficiency-testing-program/testing-proficiency?pageid¼3348

ACTFL. (2009b). Oral Proficiency Interview(OPI): ACTFL Oral Proficiency Interview–com-puter (OPIc) [Electronic version]. RetrievedJuly 23, 2015, from http://www.languagetesting. com/oral-proficiency-interview-opi-2

ACTFL. (2012). ACTFL proficiency guidelines2012 [Electronic version]. Retrieved Au-gust 15, 2015, from http://www.actfl.org/publications/guidelines-and-manuals/actfl-proficiency-guidelines-2012

Chalhoub-Deville, M., & Fulcher, G. (2003).The Oral Proficiency Interview: A researchagenda. Foreign Language Annals, 36, 498–506.

Kenyon, D., & Malabonga, V. (2001). Com-paring examinee attitudes toward computerassisted and other oral proficiency assess-ments. Language Learning & Technology, 5,60–83.

Kiddle, T., & Kormos, J. (2011). The effect ofmode of response on a semidirect test of oralproficiency. Language Assessment Quarterly, 8,342–360.

Liskin-Gasparro, J. (2003). The ACTFL profi-ciency guidelines and the Oral Proficiency In-terview: A brief history and analysis of theirsurvival. Foreign Language Annals, 36, 483–490.

Malabonga, V., Kenyon, D. M., & Carpenter,H. (2005). Self-assessment, preparation andresponse time on a computerized oral profi-ciency test. Language Testing, 22, 59–92.

Malone, M. E. (2003). Research on the OralProficiency Interview: Analysis, synthesis, andfuture directions. Foreign Language Annals,36, 491–497.

Mousavi, S. A. (2009). Multimedia as a testmethod facet in oral proficiency tests.


http://www.actfl.org/professional-development/certified-proficiency-testing-program/testing-proficiency?pageid=3348




http://www.languagetesting.com/oral-proficiency-interview-opi-2



http://www.actfl.org/publications/guidelines-and-manuals/actfl-proficiency-guidelines-2012



International Journal of Pedagogies and Learn-ing, 5, 37–48.

Norris, J. (2001). Concerns with computer-ized adaptive oral proficiency assessment.Language Learning & Technology, 5, 99–105.

Surface, E., & Dierdorff, E. (2003). Reliabilityand the ACTFL Oral Proficiency Interview:Reporting indices of interrater consistencyand agreement for 19 languages. Foreign Lan-guage Annals, 36, 507–519.

Surface, E., Poncheri, R., & Bhavsar, K.(2008). Two studies investigating the reliabil-ity and validity of the English ACTFL OPIcwith Korean test takers: The ACTFL OPIc

validation project technical report. RetrievedAugust 15, 2015, from http://www.languagetesting.com/wp-content/uploads/2013/08/ACTFL-OPIc-English-Validation-2008.pdf

SWA Consulting Inc. (2009). Brief reliabilityreport 5: Test-retest reliability and absoluteagreement rates of English ACTFL OPIc profi-ciency ratings for double and single rated testswithin a sample of Korean test takers. Raleigh,NC: Author.

Submitted October 7, 2015

Accepted November 30, 2015

92 SPRING 2016

http://www.languagetesting.com/wp-content/uploads/2013/08/ACTFL-OPIc-English-Validation-2008.pdf




Comparing the OPI and the OPIc: The Effect of Test Method ...

Documents

Transcript of Comparing the OPI and the OPIc: The Effect of Test Method ...