7/29/2019 AME Article
1/56
ForP
eerRev
iewOnly
THE REVISED SAT SCORE AND ITS MARGINAL PREDICTIVE
VALIDITY
Journal: Applied Measurement in Education
Manuscript ID: HAME-2012-0095
Manuscript Type: Empirical Article
Keywords: Predictive Validity, SAT, College Admissions, Revised SAT
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
2/56
ForP
eerRev
iewOnly
1
ABSTRACT
This papers explores the predictive validity of the Revised SAT (R-SAT) score as an
alternative to the student SAT score. Freedle proposed this score for students who may
potentially be harmed by the relationship between item difficulty and ethnic DIF observed in the
test they took in order to apply to college. The R-SAT score is defined as the score minority
student would have received if only the hardest questions from the test had been considered and
was computed using formula score and an inverse regression approach. Predictive validity of
short and long-term academic outcomes is considered as well as the potential effect on the
overprediction and underprediction of grades among minorities. The predictive power of the R-
SAT score was compared to the predictive capacity of the SAT score and to the predictive
capacity of alternative Item Response Theory (IRT) ability estimates based on models that
explicitly considered DIF and/or were based on the hardest test questions. We found no evidence
of incremental validity in favor of the R-SAT score nor of the IRT ability estimates.
ge 1 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
3/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
2
THE REVISED SAT SCORE AND EXPLORING ITS RELATIVE PREDICTIVE VALIDITY
Introduction
One way that admission examinations are judged is by how well they are able to predict
college outcomes. Predictive validity studies analyze the degree of association between
admissions test scores and college outcomes, such as college grades and graduation rates.
Academic outcomes are relatively easy to collect and are also related to the behavior that tests
like the SAT are expected to predict, success in college. Some studies have also addressed the
prediction of nonacademic outcomes such as earnings, leadership, job satisfaction, satisfaction
with life and civic participation (Bowen & Bock, 1998; Allen, Robbins & Sawyer, 2010;
Oswald, Schmitt, Kim, Ramsay & Gillespie, 2004; Willingham, 1985).
In this study, we examine a measure of academic preparedness that has been proposed to
complement the SAT. This measure, the Revised-SAT or R-SAT, was proposed by Roy Freedle
(2003) with the goal of correcting the unfairness of SAT results for minorities he found through
his application of the Standardization method for DIF (Dorans & Kulick, 1983, 1986; Dorans &
Holland, 1992). The R-SAT was proposed as a score based on a subset of the SAT questions. We
will judge its success using the result of predictive validity analyses of short and long-term
outcomes.
This article is divided into five sections. The first section summarizes previous research
on the prediction of college outcomes. The research question for this investigation, the data
sources and methods are presented in the next 3 sections. Lastly, the results section presents the
findings obtained when calculating the revised SAT score, and using it to predict academic
Page 2
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
4/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
3
outcomes. The predictive capacity of the R-SAT score will be compared to that of the original
SAT score and three Item Response Theory (IRT) versions of the SAT score.
Prior Research on Prediction of College Outcomes
There is a substantial body of research on the validity of multiple variables to predict
college outcomes in a wide range of dimensions: education, employment and social outcomes.
This section presents a brief overview of this literature, with a particular focus on the role of high
school grades and standardized test scores in the prediction of (i) college grades and (ii)
graduation rates. Although these outcome indicators offer only a partial portrayal of students
educational achievement, the convenience of their collection and updating process makes them
the outcome most commonly used in predictive validity studies. A subsequent section describes
recent studies examining the prediction of non academic college outcomes and the role of non-
cognitive predictors.
Freedle proposed computing a new score based on the hardest questions of the most
widely taken standardized test in the US in order to compensate for the potentially unfair results
of minority students he found when analyzing differential item functioning and its relationship to
item difficulty. Details on the calculation of the R-SAT and Freedles expectations of this index
as well as the criticisms made to it are presented in this section as well.
College Grade Point Average
The relationship between high school grade point average, SAT scores andfreshmen
grade point average has been widely examined by researchers at the College Board and research
units within higher education institutions (Ramist, Lewis, & McCamley-Jenkins, 1994; Geiser &
Studley, 2004). In general College Board studies find that SAT scores make a substantial
contribution to predicting cumulative college GPAs and that the combination of SAT scores and
ge 3 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
5/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
4
high school records provide better predictions than either grades or test scores alone (Burton &
Ramist, 2001; Hezlett, Kuncel, Vey, Ahart, Ones, Campbell & Camara 2001). College Board
researchers have studied the validity of the SAT mostly using correlational analysis and have
taken into consideration the technical issues of range restriction, differences in grading across
colleges and unreliability of college grades to measure success in college (Camara &
Echternacht, 2000; Willingham, Lewis, Morgan & Ramist, 1990).1
Typical correlations between first-year grades and the SAT I (Verbal and Math scores
combined) range between 0.3 and 0.6 depending on the characteristics of the studies with an
average of 0.4 (Ramist, Lewis & McCamley-Jenkins, 1994; Zwick, 2002). Bridgman, Pollack
and Burton (2004) for example, report a correlation between freshman grades and the SAT I
score composite of 0.55, while the SAT Verbal test score has a correlation of 0.50 with
freshman grades, the SAT Math correlates 0.52.2
On average the measurement error of the SAT I
Math and Verbal sections is 30 points and the correlation with the outcome criterion tends to be
less strong when measurement error is considered (Zwick, 2002).
Standardized tests allow all applicants the opportunity to perform in an environment with
the same testing conditions, instructions and time-constraints, opportunities to ask questions and
procedures for scoring. Standardized test scores permit one to compare among students who
come from different schools in which grading standards can vary significantly. Zwick (2002),
aware that SAT scores add little predictive power to high school grades, justifies the use of the
standardized test scores in admissions to large institutions by noting the cost of interviewing
1Studies that do not adjust for range restriction and variations in grading standards tend to lower the
observed correlations underestimating the predictive power of the indexes used in admissions processes (Camara &
Echternacht, 2000).
2 They also report a correlation between high school grades and first-year college grades of 0.58.
Page 4
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
6/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
5
candidates or reviewing applications in elaborate detail. The cost for the school of collecting and
processing the scores is minimal. In addition, standardized test scores help reduce the
overprediction of African American college grades observed when purely using high school
grades.3
In 2005 the SAT I was revised in a number of ways (Kobrin, Patterson, Shaw, Mattern &
Barbuti, 2008): analogies were removed and replaced with more questions on reading passages
and the Verbal section was renamed the Critical Reading section. The Math section now includes
items from more advanced courses and does not include quantitative comparison items. In
addition, a third test was added including multiple-choice items on grammar and a student-
produced essay. Kobrin et al. report a correlation between test scores and first-year college
grades similar to that from previous studies (unadjusted r=0.35, r adjusted for range
restriction=0.53), concludes that the new writing test is the most predictive based on bivariate
correlations and multiple correlations (unadjusted r=0.36, r adjusted for range restriction=0.51)
and encouraged institutions to use both high school GPA and test scores when making
admissions decisions since that maximizes pedictibility of first-year college grades (unadjusted
r=0.46, r adjusted for range restriction=0.62).
Relative Predictive Validity of Different Academic Indicators
Previously, Geiser and Studley (2002), from the University of California, analyzed the
relative contribution of high school GPA, SAT I and SAT II scores to the prediction of college
success and found that SAT II scores were the best single predictor of first year GPA, and that
3 Overprediction means that a groups average predicted first-year grade point average (GPA), is greater than its
average actual first-year GPA. Although this problem is known to be present in the SAT I for African American and
Hispanic students, Ramist et al. (1994) finds it even more strongly when only using high school GPA to predict first-
year college grades.
ge 5 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
7/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
6
the SAT I scores added little to the prediction once SAT II scores and high school GPA were
already considered.4
After taking the SAT II and high school GPA into consideration, the SAT I
scores improved the overall prediction rate by a negligible 0.1% (from 21.0% to 21.1%). The
standardized coefficient of the SAT I, after controlling for SAT II and high school GPA, was
0.07, but statistically significant due to the large number of observations used. Geiser & Studley
(2002) analyzed a sample of 80,000 freshmen who entered the University of California from fall
1996 to fall 1999 using regression analysis. Their results were confirmed by subsequent findings
from College Board researchers (Ramist et al., 2001; Bridgeman, Burton & Cline, 2001; Kobrin,
Camara and Milewski, 2002) . For a more detailed review of these articles see Author (Year).
Based on the findings from multivariate analyses considering multiple academic
predictors of college performance such as the one conducted by Geiser & Studley (2002) the
National Center for Fair and Open Testing (FairTest) has stated the SAT I has little value in
predicting future college performance (FairTest, 2003) and highlights the better performance of
class rank, high school grades and SAT II scores. Others, however, have chosen to advocate for
admissions tests that are focus on achievement and that are based on standards and criterion
(Atkinson & Geiser, 2009).
Prediction by Ethnic Group and Gender
Notable differences in the validity and predictive accuracy of SAT scores and high school
grades by race and sex have been substantiated through numerous studies (Young, 2004). The
accuracy of high school grades and SAT scores for predicting freshmen grade point average is
higher for women, Asian Americans and White students, and lower for men, African Americans
4 Geiser& Studley (2002) combined three SAT II scores into a single composite variable that weights each SAT II
test equally; they did not analyze the predictive validity of separate test scores.
Page 6
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
8/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
7
and Hispanics. Furthermore, these admissions variables often overpredict the grades of African-
Americans and Hispanic students, and underpredict those of women (Burton & Ramist, 2001).
Ramist et al. (1994) report an overprediction of first-year GPA of -0.16 for African American
students and of -0.13 for Hispanic students when considering HSGPA and SAT scores. Geiser &
Studley (2002), on the other hand, found no significant over-prediction for African Americans
and an average overprediction of -0.04 for Hispanic students when including high school GPA
and SAT I scores in the regression equation. Zwick, Brown & Sklar (2004) conducted the same
type of analyses for each of the University of California campuses and for two merged-cohorts
(1996-1997 and 1998-1999). Their results vary significantly by the campus and merged-cohort
analyzed but were interpreted by the authors as supporting previous findings from the literature.5
There are a number of theories about the reasons for over and underprediction, for details
see Zwick, Brown, & Sklar (2004), Zwick (2002) and Steele and Aronson (1998) .
More recently researchers have looked at the differential prediction of test scores and
high school grades among students from different language background (Zwick & Schemler,
2004; Zwick & Sklar, 2005) and from schools with different financial and teaching resources
(Zwick & Himelfarb, 2011) as a way to investigate possible explanations to the issue of over and
underprediciton. Results show a reduction of prediction error for Hispanics and African
American students but not a complete elimination (from -0.15 to -0.08 and from -0.13 to -0.03
respectively) when using the second approach but no change when considering first language:
5
Zwick, Brown & Sklar (2004) observed no significant differences in overprediction of minorities if the SAT IIs
were considered instead of the SAT I and whether income and parental education were included in the regression
equation. Geiser & Studley (2002) also reported no practical change in the overprediction of minority groups when
examining the predictive power of using SAT II scores instead of SAT I scores but the underprediction of African
American grows to 0.03 and the overprediction for Hispanics grows to 0.08.
ge 7 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
9/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
8
overprediction is still observed for African American and Hispanic students when considering
first language.
Relative Predictive Validity Using Multivariate Regression Analysis and Considering
Sociodemographic Variables
Parental income and education play a modest role in the prediction of college
performance when controlling for additional academic indicators such as high school grades and
standardized tests. Geiser & Studley (2002) for example reported standardized coefficients that
ranged between 0.03-0.04 and 0.05-0.06 respectively.6
The modest standardized coefficients
associated to parental income and education was also reported by Bowen & Bok (1998) when
using multivariate regression analysis to predict college performance.7
The consideration of sociodemographic variables in the predictive validity regression
equation however is based on the results of Rothstein (2004) who find that most of the SAT
predictive power comes from the correlation with unobserved variables such as high school
sociodemographic variables.8
Rothsteins estimates show that the predictive contribution of the
SAT I score is 60% lower than would be indicated by traditional methods.
6 The R2 of the regression equation that included high school GPA, SAT I and SAT II scores increased from 22.3 to
22.8 when considering parental income and education.
7 Performance in college was measured as percentile rankbased on cumulative GPA of the entering cohort, rather
than freshmen grade point average, as a way to avoid school and major differences in grading philosophies and
practices (pages 72 to 76). The book also looks at differences in economic outcomes (such as employment, wage
and job satisfaction) and social outcomes (such as civic contribution, marital status and satisfaction with quality of
life).
8 The student-level variables he included are individual race and gender. The demographic make up of the school
was described by the fraction of students who were Black, Hispanics and Asian; the fraction of students receiving
subsidized lunches; and the average education of students parents.
Page 8
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
10/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
9
Controversy mounted between Geiser & Studley (2002) and Zwick and her colleagues
(Zwick et al., 2004; Zwick & Green, 2007) around the issue of whether the SAT I or the SAT IIs
were more sensitive to sociodemographic characteristics. This argument fueled the discussion
that prompted modifications made to the SAT I in 2005 much in line with those suggested by the
University of California (Atkinson & Pelfrey, 2004). The sensitivity of scores from different test
type to socioedemographic characteristics has also been prominent in the discussion of whether
general aptitude tests (like the SAT I) or curriculum-based test (more like the SAT IIs) should be
used for college admissions (Atkinson & Geiser, 2009). For more details about the controversy
see Author (Year, pp 107-109).
College Graduation
The ultimate goal of post-secondary education is college graduation, still this goal is
elusive. According to Baum & Ma (2007)people with a college degree earned, on average, 62
percent more than individuals with only a high school diploma in 2005. According to the
National Educational Longitudinal Study (NELS)9, 59% of those who started college earned
bachellors degrees by age 26 (Bowen, Chingos & McPherson, 2009). The National Center for
Higher Education Management Systems (NCES, IPEDS, 2007) reports that only 77.4 percent of
first-time, full-time students attending a four-year institution returned to that institution for their
second year of college in 2005 (this information excludes students who transfer to another
institution). Studies typically find that woman are slightly more likely to graduate from college
than men and that African Americans, Hispanics and Native Americans have a lower rate of
graduation than White students (Astin, Tsui & Avalos, 1996; Bowen and Bok, 1998).
In general, studies exploring the role of SAT scores and high school grades in college
persistence and college graduation find a moderate relationship between these college outcomes
9 NELS surveyed students who were in eigth grade in 1988, most of whom graduated from high school in 1992.
ge 9 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
11/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
10
and preadmission measures (Astin et al., 1996; Burton & Ramist, 2001; Mattern & Patterson,
2009, 2011a, 2011b). Although the traditional variables included in the multivariate regression
models explain a small proportion of the variance, Author (Year) found high school grades to be
the strongest predictor of college persistence, followed by the SAT II Writing scores. The
importance of high school grades was corroborated by Zwick & Sklar (2005). Sociodemographic
variables play a minor role in explaining college persistence and graduation (Authot, Year)
nevertheless Bowen & Bok (1998) found these variables to be more important in the college
prediction for African American students than for White students.
The lower correlation between college persistence and preadmission characteristics is to
be expected since persistence in college and ultimate graduation are more substantially
influenced by nonacademic factors than college GPA. Some of the variables that research has
identified as playing an importante role in determining persistence are finances, motivation,
social adjustment, family and health problems, institutions selectivity and size (Reason, 2009;
Bowen, Chingos & McPherson, 2009).10
Non Academic Predictors of College Success
Recently a number of studies have looked into the importance of non academic variables
to predict college success. These studies have claimed for the expansion of the definition of
college success to include longer-term outcomes such as persistence and graduation, as well as
less-researched outcomes such as leadership and civic participation, and have stressed the
importance of nonacademic predictors (Camara & Kimmel, 2005; Robbins, Lauver, Le, Davis &
Langley, 2004; Sternberg 1999, 2003; Kyllonen, 2008). Doing so allows to predict college
success more broadly and avoid relying exclusively on cognitive criteria and predictors. This in
10 Wilsons (1983) observes that the best predictor of college graduation are persistence to sophomore year and first-
year GPA. This information is closest in time and in content to what is being predicted, and it is not available at
admission.
Page 10
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
12/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
11
light of universities broader missions including social and personal outcomes for their students
and the reduced adverse impact that the consideration may have in the admission of traditional
minority students (Oswald, Schmitt, Kim, Ramsay & Gillespie, 2004; Breland, Maxey, Gernard,
Cumming & Trapani, 2001). The admissions decisions consider different dimensions of the
applicant depending on the institutional mission and philosophy (Perfetto, 1999). Sinha, Oswald,
Imus & Schmitt (2011) show that the adverse impact of admissions decisions can be reduced if
colleges use a battery of cognitive and non-cognitive predictors that are weighted according to
the values institutional stakeholders place on an expanded performance criterion of students
success.
Previous studies that looked into nonacademic measures of success (Bowen & Bok,
1998; Willingham, 1985) showed that the traditional academic predictors such as test scores and
high school records, have moderate to no relationship to nonacademic success. Sinha, Oswald,
Imus & Schmitt (2011) confirmed the same type of results: SAT/ACT scores and High School
GPA were more strongly correlated with college GPA than with non cognitive attributes.11
The Revised-SAT
Freedle proposed a methodology to correct the unfairness generated by the relationship
he observed between item difficulty and differential item functioning in the SAT and known as
the Freedle phenomenon: he observed that harder items showed DIF in favor of minority
students while easier items showed DIF in favor of White students (Freedle, 2003).12
The
11 Allen, Robbins & Swayer (2010), however, claim that noncognitive indicators and psychosocial factors can
increase the marginalprediction of academic college outcomes beyond what is already explained by traditional
predictors.12 Differential item functioning (DIF) studies refer to how items function after differences in score distributions
between groups have been statistically removed. The remaining differences indicate that the items function
differently for both groups. Typically, the groups examined are derived from classifications such as gender,race, ethnicity, or socioeconomic status. The performance of the group of interest (focal group) on a given test
item is compared to that of a reference or comparison group. White examinees are often used as the reference
group, while minority students are often the focal groups (Holland & Wainer, 1993).
ge 11 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
13/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
12
proposed methodology focused on how students perform on the hard half of the SAT test and is
called the Revised-SAT or R-SAT (Freedle, 2003). According to Freedle, the R-SAT would
increase the SAT verbal scores by as much as 200 to 300 points for individual minority test-
takers, it would reduce the mean score differences between White and minority test-takers by a
third and would produce a score that is a better indicator of the academic abilities of minority
students.
Freedle, citing the work from Diaz-Guerrero & Szalay (1991), interprets the difference
between a students R-SAT and his/her regular SAT score as a measure of the degree to which
the examinees cultural background diverges from White, middle class culture. In his paper,
Freedle recommends exploring the validity of the R-SAT index by comparing the correlation
between the observed R-SAT index and college grades to that observed between the SAT score
and college grades and by looking at how many admissions decisions would change if we
assume that SAT or R-SAT scores over 600 indicate students qualified for college.13
Freedle was strongly criticized by the College Board (Camara & Sathy, 2004), Dorans
(2004) and Dorans and Zeller (2004a, 2004b). Some of the criticisms concerned the method
used to study differential item functioning (standardization approach) and the way Freedle
implemented it. Those criticisms were addressed by Author (Year) and results partially
replicated Freedles findings when correctly implementing the standardization approach.
However, the relationship between item difficulty and DIF was present only in the SAT Verbal
test and only for African American students (Author, Year). When considering IRT methods to
13 Freedle recognizes that predictive validity analyses will necessarily be limited because many people who did not
attend selective colleges might have matriculated at such schools if their R-SAT scores had been used in the
admission process, but nevertheless considers it relevant to examine the implications of using the measure he
proposed.
Page 12
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
14/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
13
study DIF and to model guessing Freedles findings were also observed for Hispanic students
(Author, Year).
Dorans (2004) and Dorans and Zeller (2004a) also criticized the methods Freedle used
for calculating the necessary components of the R-SAT: the use of proportion correct rather than
formula score, his consideration of different (ethnic) samples for the half-test and his application
of inverse regression.Furthermore, Dorans & Zeller (2004b) explored the fairness of Freedles
R-SAT using Score Equity Assessment (SEA), a new methodology presented as a complement to
the existing procedures for fairness assessment, namely DIF analysis and differential prediction.
Using SEA Dorans and Zeller (2004b) found that the half-test to total test linking may be
population-dependent and therefore the scores produced on the hard-half test cannot be used
interchangeably with scores produced on the full-length SAT verbal test. For a more
comprehensive review of the criticisms Dorans (2004) and Dorans & Zeller (2004b) posed, see
Author (Year, pp. 113-114).
Research Questions
The current paper provides evidence regarding the predictive validity of the R-SAT and
aims to explore the validity of Freedles measure using multivariate regression models. These
models allow the exploration of the predictive validity of the R-SAT while controlling for the
effect of other relevant measures influencing the academic outcomes achieved by students in
college. The investigation starts by calculating the revised SAT score using Freedles
methodology while considering the methodological criticisms made by Dorans & Zeller to the
way Freedle calculated the necessary components of the R-SAT (Dorans, 2004; Dorans & Zeller,
2004a). Once the R-SAT is calculated, we examined how beneficial it was for minority students
ge 13 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
15/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
14
and how it fared in terms of predictive validity of college outcomes in comparison to the original
SAT score.
The predictive power of the R-SAT was also compared to the predictive capacity of
alternative Item Response Theory (IRT) ability estimates. IRT methods consider students ability
to be a latent variable to be inferred from the data and due to the invariance property these
estimates are not dependant on the set of test items under analysis.14
The model used in this
research, the Rasch model (Wu, Adams & Wilson, 1998), provides examinees ability estimates
that are a direct transformation of the sum of correct responses and allowed us to include a
parameter to consider DIF in the estimation of examinees ability. The predictive power of the R-
SAT score and original SAT score will be compared to the predictive capacity of ability
estimates from the Rasch model and the Rasch DIF model (Paek, 2002).15
Data Sources
Since the analyses requires information about the students SAT test scores and college
experience, the information was drawn from two primary sources: the University of California
Corporate Data System and the College Board.
The College Board datafiles contained item level performance, students individual
scores as well as students responses to the Student Data Questionnaire (forty-three questions),
including self-reported demographic and academic information such as parents education,
family income, and high school grade point average.
The University of California Corporate Student Information System provides systemwide
admissions and performance data. Through their applications to UC, students provide academic
14 Note that the invariance property holds only when the models used hold. This is tested using fit statistics.
15 For more details about this model and its aplication to the Freedle phenomenon see Author (Year) (2012).
Page 14
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
16/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
15
and demographic information that is subsequently verified and standardized. For those students
who enroll at UC, this information is complemented with their academic history including
college grades, number of courses and number of units completed, persistence and graduation.
Information about parental education level and family income is also provided.
Information from the College Board and UC system was complemented with an indicator
of school performance on a state standardized test (Academic Performance Index) from the
California Department of Education.
This study was conducted using the subset of examinees from the College Board file who
were juniors, came from California public high schools, took the SAT forms DX and QI in 1994
or SAT forms IZ and VD in 1999, spoke English as their best language and applied and enrolled
at the University of California. Only UC eligible students are admitted and are allowed to the
University of California. Although at the time there were several routes to become UC eligible,
most students became eligible through the statewide eligibility path. This path required students
to complete a certain number of courses by subject area and to achieve a certain test score
depending on their high school grades. In general, the UC eligibility criteria were set with the
ultimate goal to identify the top 12.5% high school graduates who, according to the California
Master Plan of Higher Education, should be considered for the University of California.
As result of the eligibility criteria and of enrollment decisions, the sample used has a
higher mean SAT score, higher high school grade point average, higher family income and
parents education than the College Boad sample of all high shool juniors from California public
high schools who took SAT forms DX and QI in 1994 and SAT forms IZ and VD in 1999 (see
Table 1). The difference in academic and demographic characteristics does not change the
phenomenon originally described by Freedle and studied by Author (Year). The relationship
ge 15 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
17/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
16
between item difficulty and DIF estimates is still observed among high and low ability students
when using the Rasch model to study DIF (Author, Year).16
INSERT TABLE 1 HERE
Methods
This section presents the details of how the R-SAT score was calculated, how one IRT
version of the original SAT score and two IRT versions of the R-SAT were estimated, and how
the relative predictive validity of these scores and ability estimates was assessed. Since a
previous study found stronger evidence of the relationship between DIF estimates and item
difficulty in the Verbal test than in the quantitative test (Author, Year), the analyses presented in
this paper focus exclusively on the Verbal test.
Calculation of the Revised SAT score and Estimation of IRT Ability Parameters
The R-SAT scores were calculated and the IRT ability estimates were estimated for the
specific SAT form and ethnic subgroup where previous studies (Author, Year) showed evidence
of a relationship between DIF and item difficulty estimates as defined by the standardization
method (Dorans & Kulick, 1983) and/or the Item Response Theory approach to DIF (Camilli &
Shepard, 1994). Table 2 presents a summary of the results obtained when using these two
methodologies across forms and ethnic groups. Thus, R-SAT was calculated and ability
16
The Freedle phenomenon was analyzed among high and low ability students and ability was defined as having a
high SAT score. The Freedle phenomenon was not analyzed among enrolled and non-enrolled students as this
categorization is not exclusively based on ability but it is also determined by financial considerations and personal
preferences. In addition, the sample size would have been extremely small for minority students. See Author (Year)
for more details.
Page 16
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
18/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
17
estimates were estimated for African Americans in forms IZ, QI and DX and for Hispanics in
forms IZ and VD.
INSERT TABLE 2 HERE
The R-SAT was obtained by calculating the corresponding formula score17
in the hardest
half of the test for all students who took each test form and then assigning African
American/Hispanic students the total score obtained by White students who performed similarly
in the hard half of that specific test form. Specifically, in order to obtain the revised score that
African American/Hispanic students should have gotten, a linear regression was estimated only
among the White students who took each form. The linear regression was used to predict their
SAT scores using the formula score obtained in the hard half of the test. A constant and a slope
coefficient were estimated and subsequently those parameter estimates were applied to the
formula score obtained by African American/Hispanic students in the hard half of the test.18
Although the R-SAT was calculated incorporating Dorans and Zellers recommendations
regarding the use of formula scores rather than the original proportion correct scores (Dorans,
2004; Dorans & Zeller, 2004a), the methodology employed to obtain the R-SAT is still subject to
criticism for the use of inverse regression and combining results from different ethnic groups
(Dorans & Zeller, 2004a, 2004b).
Hence, in addition to the R-SAT, ability estimates using IRT methodology were also
obtained. Initially the Rasch and Rasch DIF models (Adams, Wilson & Wang, 1997; Moore,
1996) were estimated in each form and ethnic group for which there was evidence of the Freedle
17 Formula scoring adjusts scores for the possibility of random guessing (Frary, 1988; Rogers, 1999).18 This methodology, originally used by Freedle (2003), allowed expressing the number of correct responses
(adjusted by random guessing) in a score that ranged from 200 to 800 just as the regular SAT Verbal score. The
scores of White students are used as the reference because they have been considered the reference group in all DIF
analyses.
ge 17 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
19/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
18
phenomenon, using all students from California public high schools who took the form (see
Table 3). These models were estimated using ConQuest (Wu, Adams & Wilson, 1998); the
Rasch Model Ability Estimate and Rasch DIF Model Ability Estimates were obtained
respectively. In addition, an IRT version of Freedles revised SAT was estimated by considering
only the hardest half of the items in each test form (Hard Half Ability Estimate using the Rasch
Model).19,20
In total, three IRT ability estimates were obtained for each African American or
Hispanic student.
While the ability estimates obtained from the Rasch model are a direct (but non-linear)
transformation of the sum of correct responses, they differ from the original SAT score in that
the IRT ability estimates consider guessing by using formula score. The ability estimates
obtained from the Rasch DIF model directly incorporates a parameter for DIF, and therefore
explicitly considers the phenomenon Freedle described in the ability estimation. The third IRT
ability estimate, obtained from estimating the Rasch model in only the hard half of the test,
attempts to adjust the ability estimate for the phenomenon described by Freedle following
exactly the same logic behind the methodology he proposed, but using IRT methods instead.
Since each of these models is directly estimated for a specific ethnic group comparison, the
ability estimates generated are not subject to the concerns expressed by Dorans and Zeller
(Dorans, 2004; Dorans & Zeller, 2004a; 2004b) regarding the use of inverse regression and
aggregation of estimates from different ethnic groups. Although IRT scaling tends to produce
ability estimates that are linearly related to the underlying ability measured, they may be more
useful than aggregated scores when examining the linear relationship between test scores and
19 See Author (Year, Appendix 1) for the model fit statistics for the Rasch, Rasch DIF and Hard Half models.
20 The item difficulty estimates from the original Rasch DIF model were used to define the hardest half of the items.
Page 18
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
20/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
19
external variables (e.g., outcome measures) because IRT ability estimates are less subject to the
ceiling and/or floor effects observed in aggregated scores (Thissen & Orlando, 2001; Xu &
Stone, 2011).
Predictive Validity Analyses
The predictive power of the regular SAT verbal score, the R-SAT score and the three IRT
ability estimates were compared for African American, Hispanic and Whites students. Linear
regression was used for GPA prediction and logistic regression was used for the prediction of
graduation because UC GPA is a continuous numerical variable and graduation is a dichotomous
outcome variable.21
The ordinary least squares method was used for estimating linear regressions
and the maximum likelihood technique was implemented for the estimation of logistic
regression. The college outcomes examined were the first through fourth year annual UC GPA,
cumulative fourth year UCGPA and whether students graduated by their fourth year at UC.
The academic outcomes included in this study are of particular interest because they are
not limited to grade point averages and they span over four years of the college career of students
taking the SAT in 1994 and 1999. Most research in this area has been limited to examining the
predictive validity of standardized test scores and high school grades in short-term academic
outcomes, specially grades.
The analyses controlled for academic and sociodemographic variables found to be
significant in previous college prediction research (Geiser & Studley, 2002; Author, Year;
Rothstein, 2004; Zwick et al., 2004). The sociodemographic variables included parents
21Although Bridgeman, Pollack and Burton (2004) find evidence suggesting a potential non-linear relationship
between college grades and test scores, Rothstein (2004) does not find evidence along this line. Exploratory analyses
conducted in this research sample did not provide evidence to support a non-linear relationshop between first-year
college grades and SAT scores.
ge 19 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
21/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
20
education and income level from the UC systemwide admissions and performance data. The
academic variables included a weighted high school GPA, calculated with up to eight honors-
level courses, the SAT Math score22
and the school academic performance index expressed as
quintile ranks for students who took the SAT in 1999. The school academic performance index
information was not available for the students who took the SAT in 1994 because the index was
calculated for the first time in 1998.23
Equations 1, 2 and 3 show the general regression equation models for the prediction of
annual UC GPA, cumulative fourth-year UC GPA and fourth-year UC graduation respectively.
UCGPAi
=1
+2
APIQ+3
Educ+4
Inc+5
HSGPA+ 6
SATM+7
Zi
(1)
CUMUCGPA4 =1 +2APIQ+3Educ+4Inc+5HSGPA+ 6SATM+7Zi (2)
LOGIT(GRAD4 )=1 +2APIQ+3Educ+4Inc+5HSGPA+ 6SATM+7Zi (3)
where
UCGPAi is the grade point average that a student had in yeari of college, where i ranges
between 1 and 4;
CUMUCGPA4 refers to the cumulative grade point average at the fourth college year;
GRAD4 is a binary variable indicating graduation by the fourth year of college, where 1
indicates a student has graduated, 0 indicates a student who has not graduated;
APIQ refers to the ranking of the school in the California Academic Performance Index;
Educ is the maximum number years of education achieved by the parents as reported in
the UC application;
Inc refers to the family income reported in the UC application (expressed in dollars) as
reported in the UC application;
22 Different ability/scores for the Verbal section were also included and the explanation is included below.
23 Regression models excluding API rank as explanatory variables are included in Author (Year, Appendix 5).
Page 20
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
22/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
21
HIGH SCHOOL GPA is the weighted GPA considering up to eight honors-level courses;
SAT Mis the original score obtained in the SAT Math test ; and
Zirefers to different indices of verbal ability. For each of the three regression models
there were five versions which differed in the verbal ability index included. In the first version of
each model (models 1.1, 2.1 and 3.1 in the tables) the verbal ability index is the SAT Verbal
score. The second version of each model uses the original SAT score for White students and the
highest score between the revised SAT Verbal score and the original SAT score for minority
students (models 1.2, 2.2 and 3.2 in the tables). The third and fourth versions of the models
include the Verbal ability estimates from the Rasch (models 1.3, 2.3 and 3.3 in the tables) and
Rasch DIF model respectively (models 1.4, 2.4 and 3.4 in the tables). Lastly, the fifth version of
the models consider the Verbal ability estimate obtained from estimating the Rasch model using
only the hardest half of the Verbal items (models 1.5, 2.5 and 3.5 in the tables).
The model presented in the text includes only SAT I Verbal and SAT I Math scores as
explanatory variables, and not SAT II scores, as most higher education institutions require only
the SAT I (or ACT) exam and results from these models will be more generalizable to other
institutions. Regressions including SAT II test scores as explanatory variables are included in
Author (Year, Appendix 4) and do not offer stronger evidence in support of the R-SAT Verbal
test score.
The analyses could not control for the effect of discipline or campus on the dependent
variable due to the small sample size of minority groups (Brown & Zwick, 2006). Sample size
also, as those from most limited our ability to properly model the within and between school
variation in high-school GPA and API quintile (Zwick & Green, 2007). In addition, it is
important to note that as most predictive validity studies, conclusions from this research are
ge 21 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
23/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
22
necessarily limited because many people who did not attend selective colleges might have
matriculated at such schools if their R-SAT Verbal scores had been used in the admission
process.
The analyses compared the explained variance as well as the size and statistical
significance of the standardized coefficients across models. The explained variance was
measured by the adjustedR2
statistic (Singer & Willett, 2003), an alternative to theR2, which
considers the number of variables included in the model. The adjustedR2
statistic is presented
below.
Adj R
2
= 1-[(n-1)/(n-p)] (1-R
2
)
where
n is the sample size
p refers to the number of parameters in the model
In logistic regression there is no precise counterpart to the R2
or adjusted R2
used in linear
regression. Several measures of goodness of fit have been proposed and Nagelkerkes
maximum-rescaled R2 or 2~R is used here. The statistic, given below, can achieve a maximum
value of 1:
max2
22~
R
RR =
where
n
L
L
R
/22
})(
)0(
{1 =
.
R2
achieves a maximum of less than 1 for discrete models, where the maximum is given
by nLR /22
max )}0({1= ,
)0(L is the likelihood of the intercept-only model,
Page 22
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
24/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
23
)(L is the likelihood of the specified model, and
n is the sample size.
Standardized regression coefficients, or beta weights, show the relative strength of
different predictor variables within a regression equation; the weights represent the number of
standard deviations that an outcome variable changes for each one standard deviation change in
any given predictor variable, all other variables held constant. A standardized regression
coefficient is computed by dividing a parameter estimate by the ratio of the sample standard
deviation of the dependent variable to the sample standard deviation of the regressor.
Results
This section presents the results of this research in three parts. The first two sections refer
to the calculation of the R-SAT and its predictive validity compared to the SAT, including its
performance on the issue of over or underprediction. The third section offers the predictive
validity findings related to the IRT ability estimates.
Freedles Revised SAT Verbal Score
Table 3 shows the number of students from California public high schools who originally
took each test form and for whom the adjusted scores were calculated. The adjusted scores were
calculated for a total of 3,922 Hispanic examinees and 2,234 African American examinees.
INSERT TABLE 3 HERE
The R-SAT Verbal score mean is higher than the original mean SAT Verbal score in all
ethnic groups and test forms (see Author (Year, Appendix 2) for details). On average, the R-SAT
ge 23 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
25/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
24
Verbal score increases the mean performance of African American students from 382.5 to 407
(6.4%) and the mean performance of Hispanic students from 471.6 to 484.0 (2.6%).
Table 4 shows results that display greater detail about whether and how the R-SAT
Verbal score benefits minority students. Note that the bottom 3 rows represent the students who
benefit from the use of the R-SAT Verbal score. We observe that 68% (a total of 1,537 over
2,234) of African American examinees improve their scores when the R-SAT Verbal score is
considered in place of the SAT Verbal score. The same occurs for 58% (a total of 2,271 over
3,922) of the Hispanic sample. In addition, the R-SAT Verbal tends to benefit students in the low
end of the original SAT Verbal score distribution. While most examinees increase their scores
by between 0 and 50 points, the increment reaches as high as 202 points in a number of cases.
On average, however, the score increase is not as large as Freedle described it to be.
INSERT TABLE 4 HERE
In order to assess the impact of the revised SAT score in the admissions decisions of
minority students, Freedle estimated and compared the number of African American students
who would be offered admission at competitive colleges when considering each score. Freedle
hypothesized that receiving an R-SAT score of at least 600 would be sufficiently meritorious to
interest many colleges in an applicant who received such a score.24
He found that by considering
the revised SAT score instead of the original SAT score the number of African Americans
scoring over 600 in two of the forms he analyzed increased from 166 to 235 (Form 4I) and from
24Freedle chose to consider an SAT score of 600 or above as meritorious because students whose high school grade
point average is between the 97 and 100 percentile receive an average SAT verbal score of 610 and, in addition, a
score of 600 also reflects a level of test performance that only about 5 percent of the test-taking population receives,
using the normal SAT scoring procedures (Freedle, 2003).
Page 24
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
26/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
25
117 to 167 (Form OB023) which was equivalent to an increase in admission to selective colleges
by 342 percent and by 334 respectively.
The analyses reported here show an effect in the same direction Freedle described,
however, the impact in the number of African American students whose admissions are likely to
have changed is more modest. When using the maximum of the SAT and the R-SAT Verbal
scores, the number of African American students scoring over 600 increases from 79 to 86. This
represents an increase of 8.9% over the original number of African American students in the
sample scoring over 600 (see Table 5) or an increase from 3.5% of all African Americans to
3.8% . When considering both African American and Hispanic students, the number of students
scoring over 600 increases from 458 (7.4% of all minority students) to 516 (8.3% of all minority
students), which is equivalent to an increase of 12.6%.
Overall, 7.4% of minority examinees score over 600. In comparison, 3,889 White
students, or 19.7% of all White examinees, score 600 or above and received an average score of
653.
INSERT TABLE 5 HERE
The consideration of a different cut-off score would only result in significant benefit for
minorities if it was drastically reduced. More than 60% of the African American and Hispanic
students considered in this analysis would receive an R-SAT Verbal score below 450 therefore
only a cut-off score around or below this level would result in a different admission decision.
This drastic reduction in score level, however, does not seem consistent with the assumption of
being admitted to highly competitive colleges.
ge 25 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
27/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
26
The analyses presented in Table 5 regarding the impact of Freedles R-SAT in
admissions decisions and subsequent analyses looking at the R-SATs predictive validity
consider the maximum score between the SAT Verbal score and R-SAT Verbal score for
minority students, and not just the revised SAT score. This is done in consideration of Freedles
own recommendations:
the solution is to recognize that this is a pervasive phenomena that can be easily
remedied by reporting two scores, the usual SAT and the R-SAT. (Freedle, 2003)
Since Freedle recommends reporting both scores and interprets the difference between
them as the difference between the White majoritys culture and the cultural background of
minority groups, then the consideration of the maximum of the two scores represents the less
disadvantageous scenario in which minority groups might compete for admission into selective
colleges.
Predictive Validity of the Revised SAT Verbal Score
This section presents the results on the predictive capacity of the revised SAT Verbal
score. Its capacity to predict short and long term academic outcomes is compared to that of the
original SAT Verbal score by ethnic group and academic outcome. It is important to keep in
mind that although the results are presented side-by-side for three ethnic groups, the main focus
of this investigation was to compare the goodness of fit statistics and parameter estimates within
ethnic groups, especially within minority groups. The results for the White students sample are
presented as a comparison with minority students results.
In order to increase the sample size the R-SAT Verbal score for all SAT forms were
combined. This aggregation was possible because the performance in each form was previously
Page 26
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
28/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
27
scaled by ETS.25
The aggregation conducted also assumes that the four SAT forms were equated
during test development.26
The inclusion of the school ranking in the model, however, meant
that only students taking the 1999 forms (IZ and VD) where included in the analysis.27
Table 6 shows the adjusted R2
for the multivariate models estimated within each ethnic
group. The overall predictive power of the models examined varies depending on the academic
outcome and ethnic group. In general, the models predict college grades better for White students
than for minority students. While the capacity to predict annual college grades for all groups
tends to decline over time, the overall prediction of cumulative fourth-year grade point average is
unexpectedly high for White and Hispanic students. In addition, and only for White students, the
prediction of fourth year graduation is significantly weaker than the prediction of college grades.
Interestingly, this is not the case for African American and Hispanic students. The models
capacity to predict long term outcomes, such as fourth-year cumulative grade point average and
four-year graduation, is surprising considering that these indices are measured four years into the
students college career. Long-term outcomes are often assumed to be affected by variables
different from those included here, such as financial aid and previous experience in college
(Wilson, 1983; Reason, 2009).
25Scaling refers to a psychometric process conducted to achieve comparability among test score from different test
forms.
26Equating is a process different from scaling and aims to adjust for differences in difficulty among test forms. For
an introduction to traditional scaling and equating methods please Kolen (1988).
27 The maximum score between the original SAT and the R-SAT Verbal score was used for minority students.
Models using just R-SAT Verbal score and excluding school ranking as explanatory variables are presented in
Author (Year, Appendix 5) and result in findings similar to the ones displayed in this section. They and do not
provide stronger evidence in favor of the R-SAT score.
ge 27 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
29/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
28
INSERT TABLE 6 HERE
In general, the adjusted R2
for Hispanic and White students are consistent with the results
reported by similar studies (Author, Year; Author, Year; Geiser & Studley, 2002; Zwick et al.,
2004). The power to predict college GPA for African American students, though, it is below
what has been reported by other studies and below the power to predict college GPA for the
other two ethnic groups and we believe it is in part an artifact of the small sample size. Geiser &
Studley (2002), for example, reported R2s closer to 10% for African American students (pp. 15).
When predicting graduation, however, the models predict better for African Americans than for
White and Hispanic students.
Table 6 shows that the capacity to predict college outcomes using the R-SAT Verbal
score is close to, but slightly less, than the predictive power capacity achieved when using the
original SAT score. The R-SAT Verbal score predicts better than the original SAT score only in
two cases and just for the African American group: 4th
year college grade point average and
fourth year cumulative grade point average. The difference in predictive power, though, does
not seem of large practical significance. It ranges between 0 and 1 percentage point and the
maximum increase in predictive capacity is only 0.59%.
The relatively weaker capacity to predict college outcomes associated with the use of the
R-SAT can also be observed in Tables 1, 2 and 3 in Author (Year, Appendix 3) which show the
standardized coefficient estimates and their statistical significance (p-values) when predicting
first-year UC GPA, cumulative fourth-year UC GPA and fourth-year graduation by ethnic group.
They also present the adjusted R2
for each regression and its sample size. In Author (Year,
Page 28
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
30/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
29
Appendix 3) we also discuss the results associated to the other explanatory variables included in
the regression models which are similar to the findings from previous literature.
Over and Underprediction of Freshmen Grades
Freedle suggested that the revised SAT score would help reduce the problem of over and
underprediction reported by the literature on predictive validity of college admissions tests
(Zwick et al., 2004; Zwick et al., 2002; Ramist et al., 1994; Ramist et al. 2001). The potential
improvement of over and underprediction obtained from using the revised SAT score rather than
the original SAT score was assessed and the results are presented in this section.
Under or overprediction is usually assessed by fitting one general prediction model for
college students from all ethnic groups and then summing the regression residuals for a particular
ethnic group. In order to have an idea of the average individual over or underprediction the sum
of residuals is then divided by the number of students in each ethnic group. In this case,
regression models 1.1 and 1.2 were estimated and the average residual by ethnic group
compared. All explanatory variables included in these models were described in the previous
section.
1stYRGPA =1+
2APIQ+
3Educ+
4Inc+
5HSGPA+
6SATM+
7SATV
i(1.1)
1stYRGPA=1 +2APIQ+3Educ+4Inc+5HSGPA+ 6SATM+7Max(SATV_RSATV)i (1.2)
Table 7 shows the regression output for regression models 1.1 and 1.2 for all first-year
UC students. The results are similar to those presented in Table 6 for White students. This is not
ge 29 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
31/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
30
surprising given that White students are the most numerous ethnic group included in the
sample.28
We find underprediction of White students grades (0.01) and overprediction of Hispanic
(-0.025) and African American students grades (-0.098) when using the SAT, just as previous
research did (Ramist et al., 1994, 2001). On average, the overprediction is smaller than the one
reported by Ramist et al. (1994) for African American students (-0.16) and larger than the that
reported by Geiser & Studley (2002) and by Zwick et al. (2004) for African American students,
except for the1998-1999 UCLA mega-cohort for the African American group.29
For Hispanic
students the overprediction is smaller than the one reported by Ramist et al. (2001) (-0.13) and
similar to some of the results reported by Zwick et al. (2004) (see for example Berkeley 1996-
1997 mega-cohort, Irvine 1998-1999 mega-cohort, San Diego 1996-1997 mega-cohort).
We found no improvement in the prediction accuracy from using the R-SAT Verbal score
for minority groups. On the contrary, the prediction errors for minorities increased when using
the maximum from the SAT and R-SAT Verbal score to 0.114 for African American students
and 0.032 for Hispanic students respectively.30
28 Although they also resemble somewhat the results obtained for the Hispanic subsample, the standardized
coefficients associated to parents education and income as well as the overall R2 are closer to those observed for the
White students. See Author (Year, Appendix 3) for details.
29
We focused our attention on Zwick et al. s model 6, which is the most similar to the analyses reported in this
section.
30 The same analysis was conducted for fourth-year cumulative UC GPA and the average underprediction for
African American and Hispanic students increased as well (from 0.181 to 0.194 and from 0.033 to 0.040
respectively).
Page 30
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
32/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
31
Predictive Validity of IRT Ability Estimates
This section presents the results regarding the predictive power of the IRT ability
estimates and compares those results to the predictive capacity of the R-SAT and original SAT
Verbal scores. The IRT ability estimates include: (i) ability estimates obtained from estimating
the Rasch model in all the test items, (ii) ability estimates obtained from estimating the Rasch
DIF model in all the test items and (iii) ability estimate obtained from estimating the Rasch
model in only the hardest half of the items. These three ability estimates were obtained for all
White, African American and Hispanic students.31
These analyses were conducted separately for each combination of ethnic group,
academic outcome and test form in which the Freedle phenomenon was observed. This analysis
structure translated in reduced sample sizes. Table 2 of this paper shows the ethnic groups and
forms in which the relationship between item difficulty and DIF estimates was observed.
Test forms and ethnic groups could not be aggregated as in the R-SAT predictive validity
analysis because the Conquest estimation, especially that of the Rasch DIF model, generates one
student ability estimate per ethnic comparison. In addition, ability estimates from different Rasch
models, student samples and test forms cannot be directly aggregated because they are on
different scales. Even if we assumed that test forms were equated during test development,
information about the difficulty parameters of items used in equating is not available, preventing
the use of a common scale for all ability estimates.
Two of the five output tables for form 1999 IZ are presented here (see Tables 8 and 9).
Both tables display summary statistics of the analyses conducted using one of the most current
31 This differs from the R-SAT analysis presented in the previous sections, in which, the new score was only
computed for minority students.
ge 31 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
33/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
32
forms analyzed (1999 IZ): (i) R square information for each of the models explaining a total of
six dependent variables for the African American/White comparison and (ii) R square
information for each of the models explaining a total of six dependent variables for the
Hispanic/White comparison. Form 1999 IZ has the largest sample size. The two tables presented
here are representative of the results obtained in the other test forms and ethnic groups (Hispanic
students taking Form 1994 VD, African American students taking form 1994 QI and 1994 Form
DX) . The remainder output tables are included in Author (Year, pp. 161-164). Although there
are differences in the overall predictive capacity by ethnic group, academic outcome and test
form, overall predictive validity results lead to the same conclusions as the findings presented in
Tables 8 and 9.
The predictive power of the multivariate regression models assessed fares best when
predicting college grades of White students and the performance of the model decreases over
time with the exception of cumulative fourth year GPA. College grades of minority students,
especially African American students, are not predicted in any meaningful way. Surprisingly, the
models under study predict fourth-year graduation better for African American and Hispanic
students than for White students. This trend was already noted in the previous section. Negative
adjusted R2
indicate very low explained variance in spite of the inclusion of a large number of
parameter in the regression model.
Although the overall predictive power varies significantly by form, ethnic group and
academic outcome, within the same ethnic group and academic outcome there is no significant
practical difference among the predictive capacity achieved when using either of the three IRT
ability estimates. In addition, there is no clear trend, as measured by the R2, in the superiority of
using either of the IRT ability estimates, the SAT original score or the revised SAT scores.
Page 32
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
34/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
33
The small sample size and related instability of results allow us to present only a tentative
conclusion about the little practical difference observed in the overall predictive power
associated to the different IRT ability estimates, and how they fare in comparison to the original
SAT score. In addition there is some evidence suggesting that Rasch and Rasch DIF ability
estimates fare better in predicting short term academic outcomes for minorities while the
original SAT score predicts better long-term outcomes for the same group.
Discussion
The research presented in this article aimed to examine the predictive validity of the R-
SAT score addressing the methodological criticisms made to the way Freedle obtained the
different component to calculate de R-SAT score (Dorans, 2004; Dorans & Zeller, 2004a,
2004b). We did so by using formula score rather than proportion correct as the basis to calculate
the R-SAT score and by directly estimating studentsability using the Rasch and the Rasch DIF
model, both with all items and with only the hardest half of the items. This latter approach
addressed the issue of inverse regression and aggregation of estimates from different ethnic
groups
Analyses presented above show that, in this sample, the R-SAT score helps minority
students, although not as much as Freedle expected. On average, it increases scores by 24 points
or 6% for African American students and by 12 points or 2.5% for Hispanic students. Using
Freedles assumptions, the consideration of the R-SAT would change admissions decisions of
minority students admitted into selective colleges by about 10%. This is much less than
Freedles prediction of approximately 300% increase.32
The small increases in R-SAT scores
are consistent with the magnitude of score increase reported by Dorans (2004) and Dorans &
Zeller (2004a).
32 Freedle identified an increase of 342% for Form 4I and an increase of 334% for Form OB023.
ge 33 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
35/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
34
In addition the predictive validity analyses show no significant difference in the capacity
to predict short and long-term outcomes when using either the original or the revised SAT score.
Also, results show that the traditional problem of over and underprediction would remain the
same when using the revised SAT score.
Results from using the IRT ability measures are somewhat less straightforward but also
support the conclusion that there is little practical difference in the overall predictive power
associated to the different IRT ability estimates, and how they fare in comparison to the original
SAT and the R-SAT scores.
This research has several limitations. Among them is the fact that predictive validity
analyses were conducted on a group of students who were already accepted to college and
therefore present significant restriction of range in some of the explanatory variables. In addition
many students who did not attend selective colleges might have matriculated at such schools if
their R-SAT scores had been used in the admission process but this limitation is also observed in
other predictive validity studies (Geiser & Studley, 2002; Zwick, 2002; Zwick, Brown & Sklar,
2004; Zwick & Sklar, 2005). This consideration limits in some extent the validity of our
findings. The use of inverse regression and the aggregation of different ethnic groups in order to
obtain the R-SAT scores (not the IRT ability estimates) are still subject to Dorans & Zellers
original criticisms. Recent changes to the content of the SAT and the inclusion of a mandatory
Writing test may limit the generalizability of the findings presented here since they were based in
somewhat older test forms. Larger sample size for each minority group may be desirable in
order to implement future research, especially for African American students, however, that will
require the combination of data for a number of colleges and universities that exceeds the overall
and minority sample size of the nine campuses of the University of California combined.
Page 34
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
36/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
35
Furthermore, and despite the limited sample size of African American and Hispanic students, we
were still able to observe results that were similar to those reported by previous research, such as
the statisticall significance and practical importance of high school grades for predicting college
grades and graduation. These results provide support for the validity of our results for these
particular samples.
We think it is important to highlight the consistency of the results obtained in the
numerous and diverse analyses implemented across African American and Hispanic students: no
strong evidence in favor of the R-SAT score is observed when (a) recalculating the scores using
only the most difficult items for minorities, (b) when using that R-SAT score to directly predict
short and long term outcomes using models that considered and did not consider SAT II scores,
(c) when using models that did not control for school quality and allowed us to have larger
sample sizes, (d) when evaluating the over and underprediction problem for minorities, and (e)
when using using IRT ability estimates (considering all items, all items plus a DIF parameter and
only the hardest-half items of the test) to predict short and long term outcomes.
The findings presented in this article consistently reveal that there are minimal benefits
associated with Freedles R-SAT and suggest that, rather than using measures aimed to
complement the SAT, efforts and energy should be directed to studying the phenomenon behind
the systematic relationship between item difficulty and DIF estimates (Author, Year) and directly
addressing those issues during test development. The investigation of potential causes should
include studies that investigate at Freedles proposed explanation, the influence of academic
versus home language (Freedle, 2010) including investigation of the cognitive processes of
students while taking the test as well as quantitative analyses and modeling techniques (De
Boeck, 2010). In addition, further research should investigate the sensibility of Freedles
ge 35 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
37/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
36
phenomenon to alternative forms of guessing such as differential guessing strategies between
White and students from other ethnic groups.
These results also suggest that alternative policy options should be considered if the goal
is to increase the representation of minority groups in higher education, specially at highly
selective institutions (Bowen, Chingos & McPherson, 2009)33
. Those options may include the
use of school quality indices as input in the admissions processes (Zwick & Himelfarb, 2011)
and/or explicitly considering nonacademic outcomes as desirable college goals and therefore
adjusting the weight of admission indicators accordingly (Sinha, Oswald, Imus & Schmitt,
2011).
33 Bowen et al. (2009) call undermatching to the phenomenon by which students enroll in institutions that are less
demanding than they are qualified to attend. The phenomenon is described as most pronounced among well-
qualified low-income and minority students, who enroll at two-year institutions o less-selective four year
insitutions. Since college completion varies sharply with school selectivity, even after controlling for student
characteristics, the penomenon of undermatching results in minority students graduating from less-demanding
colleges at lower rates than similar students at highly-selective institutions.
Page 36
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
38/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
37
Table 1: Descriptive Statistics. Overall Sample Taking SAT forms and Subsample of Students
who enrolled at UC.
Variable N Mean Std Dev.
Overall Sample
SAT Composite28,860 958 224
HSGPA 28,367 3.23 0.45
Income 25,678 56,853 30,239
Max Ed Level 28,489 6.40 2.18
UC Applicant Sample
SAT Composite11,155 1067 206
HSGPA 11,016 3.47 0.36
Income 9,866 62,550 30,779
Max Ed Level 11,027 6.89 2.16
UC Enrolled Sample
SAT Composite 4,804 1098 195
HSGPA 4,754 3.55 0.32
Income 4,253 63,250 30,938
Max Ed Level 4,749 6.93 2.19
Source: College Board
ge 37 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
39/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
38
Table 2
Presence of the Freedle Phenomenon According to the Standardization and Rasch Model Across
Methods, Forms and Ethnic Groups. Verbal Tests.*
Group Method 1999 IZ 1999 VD 1994 QI 1994 DX
White, African
American
Standardization
ApproachYES NO YES NO
Rasch Model YES NO YES YES
White, Hispanics
Standardization
ApproachNO NO NO NO
Rasch Model YES YES NO NO
* Presence of the Freedle phenomenon is defined as a statistically significant and high (above 0.3) correlation.
Page 38
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
40/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
39
Table 3: Number of Students for Whom the Revised Score was Calculated and
IRT Ability Parameters Estimated.
Group 1999 IZ 1999 VD 1994 QI 1994 DX Total
White Examinees 6548 6682 3360 3188 19778
Hispanics Examinees 1904 2018 - - 3922
African American Examinees 854 - 671 709 2234
ge 39 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
41/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
40
Table 4: Distribution of Score Difference by Ethnic Groups and Corresponding Mean SAT Verbal Score.
Overall Sample.
Difference Between
R-SAT Verbal Score
and SAT Verbal
Score (both end
points included)
African American Examinees Hispanic Examinees
Number PercentageMean SAT
ScoreNumber Percentage Mean SAT Score
[-106, -101] - 2 0% 515.0
[-100, -51] 39 2% 433.6 95 2% 506.2
[ -50, 0] 658 29% 438.7 1554 40% 518.4
[ 0, 49] 966 43% 396.2 1704 43% 468.9
[ 50, 101] 452 20% 301.6 418 11% 370.0
[ 100, 210] 119 5% 251.7 149 4% 276.1
TOTAL 2234 100% 382.5 3922 100% 471.6
Page 40
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
42/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
41
Table 5: Number of Examinees Scoring 600, or above, in the Sample and their Mean Scores.
Ethnic Group
Number of
Students Scoring
Over 600 when
considering SAT
Verbal Score
Mean SAT
Verbal
Number of Student
Scoring Over 600 when
considering Max.
between SAT and R-
SAT Verbal Score
Mean of Max.
Between SAT V
and R-SAT V
Total
Number of
Examinees
in the
Sample
African
American
Students
79 637 86 643 2,234
African
American and
Hispanic
Students
458 645 516 648 6,156
White
Students3,889 653 - - 19,778
ge 41 of 55
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
43/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
42
Table 6: Overall Predictive Power of the Original SAT Verbal Scores and the Maximum between the SAT
Verbal Scores and the Revised SAT Verbal scores. Multivariable Regression Models.
Score UCGPA 1st Year UCGPA 2nd Year
African
American
Students
Hispanic
Students
White
Students
African
American
Students
Hispanic
Students
White
Students
SAT V 2.15% 15.36% 21.24% 0.18% 13.16% 16.55%
Max [SATV or RSATV] 1.66% 15.00% - 0.07% 12.40% -
N 78 597 2253 73 540 2120
UCGPA 3rd Year UCGPA 4th Year
African
American
Students
Hispanic
Students
White
Students
African
American
Students
Hispanic
Students
White
Students
SAT V -4.39% 8.13% 12.92% 4.81% 5.01% 13.11%
Max [SATV or RSATV] -5.19% 7.27% - 4.94% 4.38% -
N 67 497 1964 64 476 1904
UC CUM GPA 4th YEAR UC GRADUATION BY 4th YEAR*
African
American
Students
Hispanic
Students
White
Students
African
American
Students
Hispanic
Students
White
Students
SAT V 0.12% 15.18% 20.68% 15.97% 13.35% 6.91%
Max [SATV or RSATV] 0.71% 14.28% - 15.08% 13.13% -
N 65 481 1927 78 613 2314
*Pseudo R2
is reported for the logistic regression used to predict fourth-year graduation.
Page 42
URL: http://mc.manuscriptcentral.com/hame Email: [email protected]
Applied Measurement in Education
7/29/2019 AME Article
44/56
ForP
eerRev
iewOnly
R-SAT Predictive Validity
43
Table 7: Predictive Power of First-Year UC GPA: A Joint Regression Equation. Standardized Estimates
and Statistical Significance.
Regression
Model
API
Quintile
Parents
Education
Income
LevelHS GPA
SAT
Math
Max
[SATV
or R-
SAT
VERBAL
SCORE]
SAT
Verbal
Adjusted
R2
N
1.1 0.102 0.098 0.039 0.330 -0.021 0.191 23.85% 2,928
(
7/2
Top Related