Download - AME Article

7/29/2019 AME Article

1/56

ForP

eerRev

iewOnly

THE REVISED SAT SCORE AND ITS MARGINAL PREDICTIVE

VALIDITY

Journal: Applied Measurement in Education

Manuscript ID: HAME-2012-0095

Manuscript Type: Empirical Article

Keywords: Predictive Validity, SAT, College Admissions, Revised SAT

URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

Applied Measurement in Education


2/56

ForP

eerRev

iewOnly

1

ABSTRACT

This papers explores the predictive validity of the Revised SAT (R-SAT) score as an

alternative to the student SAT score. Freedle proposed this score for students who may

potentially be harmed by the relationship between item difficulty and ethnic DIF observed in the

test they took in order to apply to college. The R-SAT score is defined as the score minority

student would have received if only the hardest questions from the test had been considered and

was computed using formula score and an inverse regression approach. Predictive validity of

short and long-term academic outcomes is considered as well as the potential effect on the

overprediction and underprediction of grades among minorities. The predictive power of the R-

SAT score was compared to the predictive capacity of the SAT score and to the predictive

capacity of alternative Item Response Theory (IRT) ability estimates based on models that

explicitly considered DIF and/or were based on the hardest test questions. We found no evidence

of incremental validity in favor of the R-SAT score nor of the IRT ability estimates.

ge 1 of 55




3/56

ForP

eerRev

iewOnly

R-SAT Predictive Validity

2

THE REVISED SAT SCORE AND EXPLORING ITS RELATIVE PREDICTIVE VALIDITY

Introduction

One way that admission examinations are judged is by how well they are able to predict

college outcomes. Predictive validity studies analyze the degree of association between

admissions test scores and college outcomes, such as college grades and graduation rates.

Academic outcomes are relatively easy to collect and are also related to the behavior that tests

like the SAT are expected to predict, success in college. Some studies have also addressed the

prediction of nonacademic outcomes such as earnings, leadership, job satisfaction, satisfaction

with life and civic participation (Bowen & Bock, 1998; Allen, Robbins & Sawyer, 2010;

Oswald, Schmitt, Kim, Ramsay & Gillespie, 2004; Willingham, 1985).

In this study, we examine a measure of academic preparedness that has been proposed to

complement the SAT. This measure, the Revised-SAT or R-SAT, was proposed by Roy Freedle

(2003) with the goal of correcting the unfairness of SAT results for minorities he found through

his application of the Standardization method for DIF (Dorans & Kulick, 1983, 1986; Dorans &

Holland, 1992). The R-SAT was proposed as a score based on a subset of the SAT questions. We

will judge its success using the result of predictive validity analyses of short and long-term

outcomes.

This article is divided into five sections. The first section summarizes previous research

on the prediction of college outcomes. The research question for this investigation, the data

sources and methods are presented in the next 3 sections. Lastly, the results section presents the

findings obtained when calculating the revised SAT score, and using it to predict academic

Page 2




4/56

ForP

eerRev

iewOnly


3

outcomes. The predictive capacity of the R-SAT score will be compared to that of the original

SAT score and three Item Response Theory (IRT) versions of the SAT score.

Prior Research on Prediction of College Outcomes

There is a substantial body of research on the validity of multiple variables to predict

college outcomes in a wide range of dimensions: education, employment and social outcomes.

This section presents a brief overview of this literature, with a particular focus on the role of high

school grades and standardized test scores in the prediction of (i) college grades and (ii)

graduation rates. Although these outcome indicators offer only a partial portrayal of students

educational achievement, the convenience of their collection and updating process makes them

the outcome most commonly used in predictive validity studies. A subsequent section describes

recent studies examining the prediction of non academic college outcomes and the role of non-

cognitive predictors.

Freedle proposed computing a new score based on the hardest questions of the most

widely taken standardized test in the US in order to compensate for the potentially unfair results

of minority students he found when analyzing differential item functioning and its relationship to

item difficulty. Details on the calculation of the R-SAT and Freedles expectations of this index

as well as the criticisms made to it are presented in this section as well.

College Grade Point Average

The relationship between high school grade point average, SAT scores andfreshmen

grade point average has been widely examined by researchers at the College Board and research

units within higher education institutions (Ramist, Lewis, & McCamley-Jenkins, 1994; Geiser &

Studley, 2004). In general College Board studies find that SAT scores make a substantial

contribution to predicting cumulative college GPAs and that the combination of SAT scores and

ge 3 of 55




5/56

ForP

eerRev

iewOnly


4

high school records provide better predictions than either grades or test scores alone (Burton &

Ramist, 2001; Hezlett, Kuncel, Vey, Ahart, Ones, Campbell & Camara 2001). College Board

researchers have studied the validity of the SAT mostly using correlational analysis and have

taken into consideration the technical issues of range restriction, differences in grading across

colleges and unreliability of college grades to measure success in college (Camara &

Echternacht, 2000; Willingham, Lewis, Morgan & Ramist, 1990).1

Typical correlations between first-year grades and the SAT I (Verbal and Math scores

combined) range between 0.3 and 0.6 depending on the characteristics of the studies with an

average of 0.4 (Ramist, Lewis & McCamley-Jenkins, 1994; Zwick, 2002). Bridgman, Pollack

and Burton (2004) for example, report a correlation between freshman grades and the SAT I

score composite of 0.55, while the SAT Verbal test score has a correlation of 0.50 with

freshman grades, the SAT Math correlates 0.52.2

On average the measurement error of the SAT I

Math and Verbal sections is 30 points and the correlation with the outcome criterion tends to be

less strong when measurement error is considered (Zwick, 2002).

Standardized tests allow all applicants the opportunity to perform in an environment with

the same testing conditions, instructions and time-constraints, opportunities to ask questions and

procedures for scoring. Standardized test scores permit one to compare among students who

come from different schools in which grading standards can vary significantly. Zwick (2002),

aware that SAT scores add little predictive power to high school grades, justifies the use of the

standardized test scores in admissions to large institutions by noting the cost of interviewing

1Studies that do not adjust for range restriction and variations in grading standards tend to lower the

observed correlations underestimating the predictive power of the indexes used in admissions processes (Camara &

Echternacht, 2000).

2 They also report a correlation between high school grades and first-year college grades of 0.58.

Page 4




6/56

ForP

eerRev

iewOnly


5

candidates or reviewing applications in elaborate detail. The cost for the school of collecting and

processing the scores is minimal. In addition, standardized test scores help reduce the

overprediction of African American college grades observed when purely using high school

grades.3

In 2005 the SAT I was revised in a number of ways (Kobrin, Patterson, Shaw, Mattern &

Barbuti, 2008): analogies were removed and replaced with more questions on reading passages

and the Verbal section was renamed the Critical Reading section. The Math section now includes

items from more advanced courses and does not include quantitative comparison items. In

addition, a third test was added including multiple-choice items on grammar and a student-

produced essay. Kobrin et al. report a correlation between test scores and first-year college

grades similar to that from previous studies (unadjusted r=0.35, r adjusted for range

restriction=0.53), concludes that the new writing test is the most predictive based on bivariate

correlations and multiple correlations (unadjusted r=0.36, r adjusted for range restriction=0.51)

and encouraged institutions to use both high school GPA and test scores when making

admissions decisions since that maximizes pedictibility of first-year college grades (unadjusted

r=0.46, r adjusted for range restriction=0.62).

Relative Predictive Validity of Different Academic Indicators

Previously, Geiser and Studley (2002), from the University of California, analyzed the

relative contribution of high school GPA, SAT I and SAT II scores to the prediction of college

success and found that SAT II scores were the best single predictor of first year GPA, and that

3 Overprediction means that a groups average predicted first-year grade point average (GPA), is greater than its

average actual first-year GPA. Although this problem is known to be present in the SAT I for African American and

Hispanic students, Ramist et al. (1994) finds it even more strongly when only using high school GPA to predict first-

year college grades.

ge 5 of 55




7/56

ForP

eerRev

iewOnly


6

the SAT I scores added little to the prediction once SAT II scores and high school GPA were

already considered.4

After taking the SAT II and high school GPA into consideration, the SAT I

scores improved the overall prediction rate by a negligible 0.1% (from 21.0% to 21.1%). The

standardized coefficient of the SAT I, after controlling for SAT II and high school GPA, was

0.07, but statistically significant due to the large number of observations used. Geiser & Studley

(2002) analyzed a sample of 80,000 freshmen who entered the University of California from fall

1996 to fall 1999 using regression analysis. Their results were confirmed by subsequent findings

from College Board researchers (Ramist et al., 2001; Bridgeman, Burton & Cline, 2001; Kobrin,

Camara and Milewski, 2002) . For a more detailed review of these articles see Author (Year).

Based on the findings from multivariate analyses considering multiple academic

predictors of college performance such as the one conducted by Geiser & Studley (2002) the

National Center for Fair and Open Testing (FairTest) has stated the SAT I has little value in

predicting future college performance (FairTest, 2003) and highlights the better performance of

class rank, high school grades and SAT II scores. Others, however, have chosen to advocate for

admissions tests that are focus on achievement and that are based on standards and criterion

(Atkinson & Geiser, 2009).

Prediction by Ethnic Group and Gender

Notable differences in the validity and predictive accuracy of SAT scores and high school

grades by race and sex have been substantiated through numerous studies (Young, 2004). The

accuracy of high school grades and SAT scores for predicting freshmen grade point average is

higher for women, Asian Americans and White students, and lower for men, African Americans

4 Geiser& Studley (2002) combined three SAT II scores into a single composite variable that weights each SAT II

test equally; they did not analyze the predictive validity of separate test scores.

Page 6




8/56

ForP

eerRev

iewOnly


7

and Hispanics. Furthermore, these admissions variables often overpredict the grades of African-

Americans and Hispanic students, and underpredict those of women (Burton & Ramist, 2001).

Ramist et al. (1994) report an overprediction of first-year GPA of -0.16 for African American

students and of -0.13 for Hispanic students when considering HSGPA and SAT scores. Geiser &

Studley (2002), on the other hand, found no significant over-prediction for African Americans

and an average overprediction of -0.04 for Hispanic students when including high school GPA

and SAT I scores in the regression equation. Zwick, Brown & Sklar (2004) conducted the same

type of analyses for each of the University of California campuses and for two merged-cohorts

(1996-1997 and 1998-1999). Their results vary significantly by the campus and merged-cohort

analyzed but were interpreted by the authors as supporting previous findings from the literature.5

There are a number of theories about the reasons for over and underprediction, for details

see Zwick, Brown, & Sklar (2004), Zwick (2002) and Steele and Aronson (1998) .

More recently researchers have looked at the differential prediction of test scores and

high school grades among students from different language background (Zwick & Schemler,

2004; Zwick & Sklar, 2005) and from schools with different financial and teaching resources

(Zwick & Himelfarb, 2011) as a way to investigate possible explanations to the issue of over and

underprediciton. Results show a reduction of prediction error for Hispanics and African

American students but not a complete elimination (from -0.15 to -0.08 and from -0.13 to -0.03

respectively) when using the second approach but no change when considering first language:

5

Zwick, Brown & Sklar (2004) observed no significant differences in overprediction of minorities if the SAT IIs

were considered instead of the SAT I and whether income and parental education were included in the regression

equation. Geiser & Studley (2002) also reported no practical change in the overprediction of minority groups when

examining the predictive power of using SAT II scores instead of SAT I scores but the underprediction of African

American grows to 0.03 and the overprediction for Hispanics grows to 0.08.

ge 7 of 55




9/56

ForP

eerRev

iewOnly


8

overprediction is still observed for African American and Hispanic students when considering

first language.

Relative Predictive Validity Using Multivariate Regression Analysis and Considering

Sociodemographic Variables

Parental income and education play a modest role in the prediction of college

performance when controlling for additional academic indicators such as high school grades and

standardized tests. Geiser & Studley (2002) for example reported standardized coefficients that

ranged between 0.03-0.04 and 0.05-0.06 respectively.6

The modest standardized coefficients

associated to parental income and education was also reported by Bowen & Bok (1998) when

using multivariate regression analysis to predict college performance.7

The consideration of sociodemographic variables in the predictive validity regression

equation however is based on the results of Rothstein (2004) who find that most of the SAT

predictive power comes from the correlation with unobserved variables such as high school

sociodemographic variables.8

Rothsteins estimates show that the predictive contribution of the

SAT I score is 60% lower than would be indicated by traditional methods.

6 The R2 of the regression equation that included high school GPA, SAT I and SAT II scores increased from 22.3 to

22.8 when considering parental income and education.

7 Performance in college was measured as percentile rankbased on cumulative GPA of the entering cohort, rather

than freshmen grade point average, as a way to avoid school and major differences in grading philosophies and

practices (pages 72 to 76). The book also looks at differences in economic outcomes (such as employment, wage

and job satisfaction) and social outcomes (such as civic contribution, marital status and satisfaction with quality of

life).

8 The student-level variables he included are individual race and gender. The demographic make up of the school

was described by the fraction of students who were Black, Hispanics and Asian; the fraction of students receiving

subsidized lunches; and the average education of students parents.

Page 8




10/56

ForP

eerRev

iewOnly


9

Controversy mounted between Geiser & Studley (2002) and Zwick and her colleagues

(Zwick et al., 2004; Zwick & Green, 2007) around the issue of whether the SAT I or the SAT IIs

were more sensitive to sociodemographic characteristics. This argument fueled the discussion

that prompted modifications made to the SAT I in 2005 much in line with those suggested by the

University of California (Atkinson & Pelfrey, 2004). The sensitivity of scores from different test

type to socioedemographic characteristics has also been prominent in the discussion of whether

general aptitude tests (like the SAT I) or curriculum-based test (more like the SAT IIs) should be

used for college admissions (Atkinson & Geiser, 2009). For more details about the controversy

see Author (Year, pp 107-109).

College Graduation

The ultimate goal of post-secondary education is college graduation, still this goal is

elusive. According to Baum & Ma (2007)people with a college degree earned, on average, 62

percent more than individuals with only a high school diploma in 2005. According to the

National Educational Longitudinal Study (NELS)9, 59% of those who started college earned

bachellors degrees by age 26 (Bowen, Chingos & McPherson, 2009). The National Center for

Higher Education Management Systems (NCES, IPEDS, 2007) reports that only 77.4 percent of

first-time, full-time students attending a four-year institution returned to that institution for their

second year of college in 2005 (this information excludes students who transfer to another

institution). Studies typically find that woman are slightly more likely to graduate from college

than men and that African Americans, Hispanics and Native Americans have a lower rate of

graduation than White students (Astin, Tsui & Avalos, 1996; Bowen and Bok, 1998).

In general, studies exploring the role of SAT scores and high school grades in college

persistence and college graduation find a moderate relationship between these college outcomes

9 NELS surveyed students who were in eigth grade in 1988, most of whom graduated from high school in 1992.

ge 9 of 55




11/56

ForP

eerRev

iewOnly


10

and preadmission measures (Astin et al., 1996; Burton & Ramist, 2001; Mattern & Patterson,

2009, 2011a, 2011b). Although the traditional variables included in the multivariate regression

models explain a small proportion of the variance, Author (Year) found high school grades to be

the strongest predictor of college persistence, followed by the SAT II Writing scores. The

importance of high school grades was corroborated by Zwick & Sklar (2005). Sociodemographic

variables play a minor role in explaining college persistence and graduation (Authot, Year)

nevertheless Bowen & Bok (1998) found these variables to be more important in the college

prediction for African American students than for White students.

The lower correlation between college persistence and preadmission characteristics is to

be expected since persistence in college and ultimate graduation are more substantially

influenced by nonacademic factors than college GPA. Some of the variables that research has

identified as playing an importante role in determining persistence are finances, motivation,

social adjustment, family and health problems, institutions selectivity and size (Reason, 2009;

Bowen, Chingos & McPherson, 2009).10

Non Academic Predictors of College Success

Recently a number of studies have looked into the importance of non academic variables

to predict college success. These studies have claimed for the expansion of the definition of

college success to include longer-term outcomes such as persistence and graduation, as well as

less-researched outcomes such as leadership and civic participation, and have stressed the

importance of nonacademic predictors (Camara & Kimmel, 2005; Robbins, Lauver, Le, Davis &

Langley, 2004; Sternberg 1999, 2003; Kyllonen, 2008). Doing so allows to predict college

success more broadly and avoid relying exclusively on cognitive criteria and predictors. This in

10 Wilsons (1983) observes that the best predictor of college graduation are persistence to sophomore year and first-

year GPA. This information is closest in time and in content to what is being predicted, and it is not available at

admission.

Page 10




12/56

ForP

eerRev

iewOnly


11

light of universities broader missions including social and personal outcomes for their students

and the reduced adverse impact that the consideration may have in the admission of traditional

minority students (Oswald, Schmitt, Kim, Ramsay & Gillespie, 2004; Breland, Maxey, Gernard,

Cumming & Trapani, 2001). The admissions decisions consider different dimensions of the

applicant depending on the institutional mission and philosophy (Perfetto, 1999). Sinha, Oswald,

Imus & Schmitt (2011) show that the adverse impact of admissions decisions can be reduced if

colleges use a battery of cognitive and non-cognitive predictors that are weighted according to

the values institutional stakeholders place on an expanded performance criterion of students

success.

Previous studies that looked into nonacademic measures of success (Bowen & Bok,

1998; Willingham, 1985) showed that the traditional academic predictors such as test scores and

high school records, have moderate to no relationship to nonacademic success. Sinha, Oswald,

Imus & Schmitt (2011) confirmed the same type of results: SAT/ACT scores and High School

GPA were more strongly correlated with college GPA than with non cognitive attributes.11

The Revised-SAT

Freedle proposed a methodology to correct the unfairness generated by the relationship

he observed between item difficulty and differential item functioning in the SAT and known as

the Freedle phenomenon: he observed that harder items showed DIF in favor of minority

students while easier items showed DIF in favor of White students (Freedle, 2003).12

The

11 Allen, Robbins & Swayer (2010), however, claim that noncognitive indicators and psychosocial factors can

increase the marginalprediction of academic college outcomes beyond what is already explained by traditional

predictors.12 Differential item functioning (DIF) studies refer to how items function after differences in score distributions

between groups have been statistically removed. The remaining differences indicate that the items function

differently for both groups. Typically, the groups examined are derived from classifications such as gender,race, ethnicity, or socioeconomic status. The performance of the group of interest (focal group) on a given test

item is compared to that of a reference or comparison group. White examinees are often used as the reference

group, while minority students are often the focal groups (Holland & Wainer, 1993).

ge 11 of 55




13/56

ForP

eerRev

iewOnly


12

proposed methodology focused on how students perform on the hard half of the SAT test and is

called the Revised-SAT or R-SAT (Freedle, 2003). According to Freedle, the R-SAT would

increase the SAT verbal scores by as much as 200 to 300 points for individual minority test-

takers, it would reduce the mean score differences between White and minority test-takers by a

third and would produce a score that is a better indicator of the academic abilities of minority

students.

Freedle, citing the work from Diaz-Guerrero & Szalay (1991), interprets the difference

between a students R-SAT and his/her regular SAT score as a measure of the degree to which

the examinees cultural background diverges from White, middle class culture. In his paper,

Freedle recommends exploring the validity of the R-SAT index by comparing the correlation

between the observed R-SAT index and college grades to that observed between the SAT score

and college grades and by looking at how many admissions decisions would change if we

assume that SAT or R-SAT scores over 600 indicate students qualified for college.13

Freedle was strongly criticized by the College Board (Camara & Sathy, 2004), Dorans

(2004) and Dorans and Zeller (2004a, 2004b). Some of the criticisms concerned the method

used to study differential item functioning (standardization approach) and the way Freedle

implemented it. Those criticisms were addressed by Author (Year) and results partially

replicated Freedles findings when correctly implementing the standardization approach.

However, the relationship between item difficulty and DIF was present only in the SAT Verbal

test and only for African American students (Author, Year). When considering IRT methods to

13 Freedle recognizes that predictive validity analyses will necessarily be limited because many people who did not

attend selective colleges might have matriculated at such schools if their R-SAT scores had been used in the

admission process, but nevertheless considers it relevant to examine the implications of using the measure he

proposed.

Page 12




14/56

ForP

eerRev

iewOnly


13

study DIF and to model guessing Freedles findings were also observed for Hispanic students

(Author, Year).

Dorans (2004) and Dorans and Zeller (2004a) also criticized the methods Freedle used

for calculating the necessary components of the R-SAT: the use of proportion correct rather than

formula score, his consideration of different (ethnic) samples for the half-test and his application

of inverse regression.Furthermore, Dorans & Zeller (2004b) explored the fairness of Freedles

R-SAT using Score Equity Assessment (SEA), a new methodology presented as a complement to

the existing procedures for fairness assessment, namely DIF analysis and differential prediction.

Using SEA Dorans and Zeller (2004b) found that the half-test to total test linking may be

population-dependent and therefore the scores produced on the hard-half test cannot be used

interchangeably with scores produced on the full-length SAT verbal test. For a more

comprehensive review of the criticisms Dorans (2004) and Dorans & Zeller (2004b) posed, see

Author (Year, pp. 113-114).

Research Questions

The current paper provides evidence regarding the predictive validity of the R-SAT and

aims to explore the validity of Freedles measure using multivariate regression models. These

models allow the exploration of the predictive validity of the R-SAT while controlling for the

effect of other relevant measures influencing the academic outcomes achieved by students in

college. The investigation starts by calculating the revised SAT score using Freedles

methodology while considering the methodological criticisms made by Dorans & Zeller to the

way Freedle calculated the necessary components of the R-SAT (Dorans, 2004; Dorans & Zeller,

2004a). Once the R-SAT is calculated, we examined how beneficial it was for minority students

ge 13 of 55




15/56

ForP

eerRev

iewOnly


14

and how it fared in terms of predictive validity of college outcomes in comparison to the original

SAT score.

The predictive power of the R-SAT was also compared to the predictive capacity of

alternative Item Response Theory (IRT) ability estimates. IRT methods consider students ability

to be a latent variable to be inferred from the data and due to the invariance property these

estimates are not dependant on the set of test items under analysis.14

The model used in this

research, the Rasch model (Wu, Adams & Wilson, 1998), provides examinees ability estimates

that are a direct transformation of the sum of correct responses and allowed us to include a

parameter to consider DIF in the estimation of examinees ability. The predictive power of the R-

SAT score and original SAT score will be compared to the predictive capacity of ability

estimates from the Rasch model and the Rasch DIF model (Paek, 2002).15

Data Sources

Since the analyses requires information about the students SAT test scores and college

experience, the information was drawn from two primary sources: the University of California

Corporate Data System and the College Board.

The College Board datafiles contained item level performance, students individual

scores as well as students responses to the Student Data Questionnaire (forty-three questions),

including self-reported demographic and academic information such as parents education,

family income, and high school grade point average.

The University of California Corporate Student Information System provides systemwide

admissions and performance data. Through their applications to UC, students provide academic

14 Note that the invariance property holds only when the models used hold. This is tested using fit statistics.

15 For more details about this model and its aplication to the Freedle phenomenon see Author (Year) (2012).

Page 14




16/56

ForP

eerRev

iewOnly


15

and demographic information that is subsequently verified and standardized. For those students

who enroll at UC, this information is complemented with their academic history including

college grades, number of courses and number of units completed, persistence and graduation.

Information about parental education level and family income is also provided.

Information from the College Board and UC system was complemented with an indicator

of school performance on a state standardized test (Academic Performance Index) from the

California Department of Education.

This study was conducted using the subset of examinees from the College Board file who

were juniors, came from California public high schools, took the SAT forms DX and QI in 1994

or SAT forms IZ and VD in 1999, spoke English as their best language and applied and enrolled

at the University of California. Only UC eligible students are admitted and are allowed to the

University of California. Although at the time there were several routes to become UC eligible,

most students became eligible through the statewide eligibility path. This path required students

to complete a certain number of courses by subject area and to achieve a certain test score

depending on their high school grades. In general, the UC eligibility criteria were set with the

ultimate goal to identify the top 12.5% high school graduates who, according to the California

Master Plan of Higher Education, should be considered for the University of California.

As result of the eligibility criteria and of enrollment decisions, the sample used has a

higher mean SAT score, higher high school grade point average, higher family income and

parents education than the College Boad sample of all high shool juniors from California public

high schools who took SAT forms DX and QI in 1994 and SAT forms IZ and VD in 1999 (see

Table 1). The difference in academic and demographic characteristics does not change the

phenomenon originally described by Freedle and studied by Author (Year). The relationship

ge 15 of 55




17/56

ForP

eerRev

iewOnly


16

between item difficulty and DIF estimates is still observed among high and low ability students

when using the Rasch model to study DIF (Author, Year).16

INSERT TABLE 1 HERE

Methods

This section presents the details of how the R-SAT score was calculated, how one IRT

version of the original SAT score and two IRT versions of the R-SAT were estimated, and how

the relative predictive validity of these scores and ability estimates was assessed. Since a

previous study found stronger evidence of the relationship between DIF estimates and item

difficulty in the Verbal test than in the quantitative test (Author, Year), the analyses presented in

this paper focus exclusively on the Verbal test.

Calculation of the Revised SAT score and Estimation of IRT Ability Parameters

The R-SAT scores were calculated and the IRT ability estimates were estimated for the

specific SAT form and ethnic subgroup where previous studies (Author, Year) showed evidence

of a relationship between DIF and item difficulty estimates as defined by the standardization

method (Dorans & Kulick, 1983) and/or the Item Response Theory approach to DIF (Camilli &

Shepard, 1994). Table 2 presents a summary of the results obtained when using these two

methodologies across forms and ethnic groups. Thus, R-SAT was calculated and ability

16

The Freedle phenomenon was analyzed among high and low ability students and ability was defined as having a

high SAT score. The Freedle phenomenon was not analyzed among enrolled and non-enrolled students as this

categorization is not exclusively based on ability but it is also determined by financial considerations and personal

preferences. In addition, the sample size would have been extremely small for minority students. See Author (Year)

for more details.

Page 16




18/56

ForP

eerRev

iewOnly


17

estimates were estimated for African Americans in forms IZ, QI and DX and for Hispanics in

forms IZ and VD.

INSERT TABLE 2 HERE

The R-SAT was obtained by calculating the corresponding formula score17

in the hardest

half of the test for all students who took each test form and then assigning African

American/Hispanic students the total score obtained by White students who performed similarly

in the hard half of that specific test form. Specifically, in order to obtain the revised score that

African American/Hispanic students should have gotten, a linear regression was estimated only

among the White students who took each form. The linear regression was used to predict their

SAT scores using the formula score obtained in the hard half of the test. A constant and a slope

coefficient were estimated and subsequently those parameter estimates were applied to the

formula score obtained by African American/Hispanic students in the hard half of the test.18

Although the R-SAT was calculated incorporating Dorans and Zellers recommendations

regarding the use of formula scores rather than the original proportion correct scores (Dorans,

2004; Dorans & Zeller, 2004a), the methodology employed to obtain the R-SAT is still subject to

criticism for the use of inverse regression and combining results from different ethnic groups

(Dorans & Zeller, 2004a, 2004b).

Hence, in addition to the R-SAT, ability estimates using IRT methodology were also

obtained. Initially the Rasch and Rasch DIF models (Adams, Wilson & Wang, 1997; Moore,

1996) were estimated in each form and ethnic group for which there was evidence of the Freedle

17 Formula scoring adjusts scores for the possibility of random guessing (Frary, 1988; Rogers, 1999).18 This methodology, originally used by Freedle (2003), allowed expressing the number of correct responses

(adjusted by random guessing) in a score that ranged from 200 to 800 just as the regular SAT Verbal score. The

scores of White students are used as the reference because they have been considered the reference group in all DIF

analyses.

ge 17 of 55




19/56

ForP

eerRev

iewOnly


18

phenomenon, using all students from California public high schools who took the form (see

Table 3). These models were estimated using ConQuest (Wu, Adams & Wilson, 1998); the

Rasch Model Ability Estimate and Rasch DIF Model Ability Estimates were obtained

respectively. In addition, an IRT version of Freedles revised SAT was estimated by considering

only the hardest half of the items in each test form (Hard Half Ability Estimate using the Rasch

Model).19,20

In total, three IRT ability estimates were obtained for each African American or

Hispanic student.

While the ability estimates obtained from the Rasch model are a direct (but non-linear)

transformation of the sum of correct responses, they differ from the original SAT score in that

the IRT ability estimates consider guessing by using formula score. The ability estimates

obtained from the Rasch DIF model directly incorporates a parameter for DIF, and therefore

explicitly considers the phenomenon Freedle described in the ability estimation. The third IRT

ability estimate, obtained from estimating the Rasch model in only the hard half of the test,

attempts to adjust the ability estimate for the phenomenon described by Freedle following

exactly the same logic behind the methodology he proposed, but using IRT methods instead.

Since each of these models is directly estimated for a specific ethnic group comparison, the

ability estimates generated are not subject to the concerns expressed by Dorans and Zeller

(Dorans, 2004; Dorans & Zeller, 2004a; 2004b) regarding the use of inverse regression and

aggregation of estimates from different ethnic groups. Although IRT scaling tends to produce

ability estimates that are linearly related to the underlying ability measured, they may be more

useful than aggregated scores when examining the linear relationship between test scores and

19 See Author (Year, Appendix 1) for the model fit statistics for the Rasch, Rasch DIF and Hard Half models.

20 The item difficulty estimates from the original Rasch DIF model were used to define the hardest half of the items.

Page 18




20/56

ForP

eerRev

iewOnly


19

external variables (e.g., outcome measures) because IRT ability estimates are less subject to the

ceiling and/or floor effects observed in aggregated scores (Thissen & Orlando, 2001; Xu &

Stone, 2011).

Predictive Validity Analyses

The predictive power of the regular SAT verbal score, the R-SAT score and the three IRT

ability estimates were compared for African American, Hispanic and Whites students. Linear

regression was used for GPA prediction and logistic regression was used for the prediction of

graduation because UC GPA is a continuous numerical variable and graduation is a dichotomous

outcome variable.21

The ordinary least squares method was used for estimating linear regressions

and the maximum likelihood technique was implemented for the estimation of logistic

regression. The college outcomes examined were the first through fourth year annual UC GPA,

cumulative fourth year UCGPA and whether students graduated by their fourth year at UC.

The academic outcomes included in this study are of particular interest because they are

not limited to grade point averages and they span over four years of the college career of students

taking the SAT in 1994 and 1999. Most research in this area has been limited to examining the

predictive validity of standardized test scores and high school grades in short-term academic

outcomes, specially grades.

The analyses controlled for academic and sociodemographic variables found to be

significant in previous college prediction research (Geiser & Studley, 2002; Author, Year;

Rothstein, 2004; Zwick et al., 2004). The sociodemographic variables included parents

21Although Bridgeman, Pollack and Burton (2004) find evidence suggesting a potential non-linear relationship

between college grades and test scores, Rothstein (2004) does not find evidence along this line. Exploratory analyses

conducted in this research sample did not provide evidence to support a non-linear relationshop between first-year

college grades and SAT scores.

ge 19 of 55




21/56

ForP

eerRev

iewOnly


20

education and income level from the UC systemwide admissions and performance data. The

academic variables included a weighted high school GPA, calculated with up to eight honors-

level courses, the SAT Math score22

and the school academic performance index expressed as

quintile ranks for students who took the SAT in 1999. The school academic performance index

information was not available for the students who took the SAT in 1994 because the index was

calculated for the first time in 1998.23

Equations 1, 2 and 3 show the general regression equation models for the prediction of

annual UC GPA, cumulative fourth-year UC GPA and fourth-year UC graduation respectively.

UCGPAi

=1

+2

APIQ+3

Educ+4

Inc+5

HSGPA+ 6

SATM+7

Zi

(1)

CUMUCGPA4 =1 +2APIQ+3Educ+4Inc+5HSGPA+ 6SATM+7Zi (2)

LOGIT(GRAD4 )=1 +2APIQ+3Educ+4Inc+5HSGPA+ 6SATM+7Zi (3)

where

UCGPAi is the grade point average that a student had in yeari of college, where i ranges

between 1 and 4;

CUMUCGPA4 refers to the cumulative grade point average at the fourth college year;

GRAD4 is a binary variable indicating graduation by the fourth year of college, where 1

indicates a student has graduated, 0 indicates a student who has not graduated;

APIQ refers to the ranking of the school in the California Academic Performance Index;

Educ is the maximum number years of education achieved by the parents as reported in

the UC application;

Inc refers to the family income reported in the UC application (expressed in dollars) as

reported in the UC application;

22 Different ability/scores for the Verbal section were also included and the explanation is included below.

23 Regression models excluding API rank as explanatory variables are included in Author (Year, Appendix 5).

Page 20




22/56

ForP

eerRev

iewOnly


21

HIGH SCHOOL GPA is the weighted GPA considering up to eight honors-level courses;

SAT Mis the original score obtained in the SAT Math test ; and

Zirefers to different indices of verbal ability. For each of the three regression models

there were five versions which differed in the verbal ability index included. In the first version of

each model (models 1.1, 2.1 and 3.1 in the tables) the verbal ability index is the SAT Verbal

score. The second version of each model uses the original SAT score for White students and the

highest score between the revised SAT Verbal score and the original SAT score for minority

students (models 1.2, 2.2 and 3.2 in the tables). The third and fourth versions of the models

include the Verbal ability estimates from the Rasch (models 1.3, 2.3 and 3.3 in the tables) and

Rasch DIF model respectively (models 1.4, 2.4 and 3.4 in the tables). Lastly, the fifth version of

the models consider the Verbal ability estimate obtained from estimating the Rasch model using

only the hardest half of the Verbal items (models 1.5, 2.5 and 3.5 in the tables).

The model presented in the text includes only SAT I Verbal and SAT I Math scores as

explanatory variables, and not SAT II scores, as most higher education institutions require only

the SAT I (or ACT) exam and results from these models will be more generalizable to other

institutions. Regressions including SAT II test scores as explanatory variables are included in

Author (Year, Appendix 4) and do not offer stronger evidence in support of the R-SAT Verbal

test score.

The analyses could not control for the effect of discipline or campus on the dependent

variable due to the small sample size of minority groups (Brown & Zwick, 2006). Sample size

also, as those from most limited our ability to properly model the within and between school

variation in high-school GPA and API quintile (Zwick & Green, 2007). In addition, it is

important to note that as most predictive validity studies, conclusions from this research are

ge 21 of 55




23/56

ForP

eerRev

iewOnly


22

necessarily limited because many people who did not attend selective colleges might have

matriculated at such schools if their R-SAT Verbal scores had been used in the admission

process.

The analyses compared the explained variance as well as the size and statistical

significance of the standardized coefficients across models. The explained variance was

measured by the adjustedR2

statistic (Singer & Willett, 2003), an alternative to theR2, which

considers the number of variables included in the model. The adjustedR2

statistic is presented

below.

Adj R

2

= 1-[(n-1)/(n-p)] (1-R

2

)

where

n is the sample size

p refers to the number of parameters in the model

In logistic regression there is no precise counterpart to the R2

or adjusted R2

used in linear

regression. Several measures of goodness of fit have been proposed and Nagelkerkes

maximum-rescaled R2 or 2~R is used here. The statistic, given below, can achieve a maximum

value of 1:

max2

22~

R

RR =

where

n

L

L

R

/22

})(

)0(

{1 =

.

R2

achieves a maximum of less than 1 for discrete models, where the maximum is given

by nLR /22

max )}0({1= ,

)0(L is the likelihood of the intercept-only model,

Page 22




24/56

ForP

eerRev

iewOnly


23

)(L is the likelihood of the specified model, and

n is the sample size.

Standardized regression coefficients, or beta weights, show the relative strength of

different predictor variables within a regression equation; the weights represent the number of

standard deviations that an outcome variable changes for each one standard deviation change in

any given predictor variable, all other variables held constant. A standardized regression

coefficient is computed by dividing a parameter estimate by the ratio of the sample standard

deviation of the dependent variable to the sample standard deviation of the regressor.

Results

This section presents the results of this research in three parts. The first two sections refer

to the calculation of the R-SAT and its predictive validity compared to the SAT, including its

performance on the issue of over or underprediction. The third section offers the predictive

validity findings related to the IRT ability estimates.

Freedles Revised SAT Verbal Score

Table 3 shows the number of students from California public high schools who originally

took each test form and for whom the adjusted scores were calculated. The adjusted scores were

calculated for a total of 3,922 Hispanic examinees and 2,234 African American examinees.

INSERT TABLE 3 HERE

The R-SAT Verbal score mean is higher than the original mean SAT Verbal score in all

ethnic groups and test forms (see Author (Year, Appendix 2) for details). On average, the R-SAT

ge 23 of 55




25/56

ForP

eerRev

iewOnly


24

Verbal score increases the mean performance of African American students from 382.5 to 407

(6.4%) and the mean performance of Hispanic students from 471.6 to 484.0 (2.6%).

Table 4 shows results that display greater detail about whether and how the R-SAT

Verbal score benefits minority students. Note that the bottom 3 rows represent the students who

benefit from the use of the R-SAT Verbal score. We observe that 68% (a total of 1,537 over

2,234) of African American examinees improve their scores when the R-SAT Verbal score is

considered in place of the SAT Verbal score. The same occurs for 58% (a total of 2,271 over

3,922) of the Hispanic sample. In addition, the R-SAT Verbal tends to benefit students in the low

end of the original SAT Verbal score distribution. While most examinees increase their scores

by between 0 and 50 points, the increment reaches as high as 202 points in a number of cases.

On average, however, the score increase is not as large as Freedle described it to be.

INSERT TABLE 4 HERE

In order to assess the impact of the revised SAT score in the admissions decisions of

minority students, Freedle estimated and compared the number of African American students

who would be offered admission at competitive colleges when considering each score. Freedle

hypothesized that receiving an R-SAT score of at least 600 would be sufficiently meritorious to

interest many colleges in an applicant who received such a score.24

He found that by considering

the revised SAT score instead of the original SAT score the number of African Americans

scoring over 600 in two of the forms he analyzed increased from 166 to 235 (Form 4I) and from

24Freedle chose to consider an SAT score of 600 or above as meritorious because students whose high school grade

point average is between the 97 and 100 percentile receive an average SAT verbal score of 610 and, in addition, a

score of 600 also reflects a level of test performance that only about 5 percent of the test-taking population receives,

using the normal SAT scoring procedures (Freedle, 2003).

Page 24




26/56

ForP

eerRev

iewOnly


25

117 to 167 (Form OB023) which was equivalent to an increase in admission to selective colleges

by 342 percent and by 334 respectively.

The analyses reported here show an effect in the same direction Freedle described,

however, the impact in the number of African American students whose admissions are likely to

have changed is more modest. When using the maximum of the SAT and the R-SAT Verbal

scores, the number of African American students scoring over 600 increases from 79 to 86. This

represents an increase of 8.9% over the original number of African American students in the

sample scoring over 600 (see Table 5) or an increase from 3.5% of all African Americans to

3.8% . When considering both African American and Hispanic students, the number of students

scoring over 600 increases from 458 (7.4% of all minority students) to 516 (8.3% of all minority

students), which is equivalent to an increase of 12.6%.

Overall, 7.4% of minority examinees score over 600. In comparison, 3,889 White

students, or 19.7% of all White examinees, score 600 or above and received an average score of

653.

INSERT TABLE 5 HERE

The consideration of a different cut-off score would only result in significant benefit for

minorities if it was drastically reduced. More than 60% of the African American and Hispanic

students considered in this analysis would receive an R-SAT Verbal score below 450 therefore

only a cut-off score around or below this level would result in a different admission decision.

This drastic reduction in score level, however, does not seem consistent with the assumption of

being admitted to highly competitive colleges.

ge 25 of 55




27/56

ForP

eerRev

iewOnly


26

The analyses presented in Table 5 regarding the impact of Freedles R-SAT in

admissions decisions and subsequent analyses looking at the R-SATs predictive validity

consider the maximum score between the SAT Verbal score and R-SAT Verbal score for

minority students, and not just the revised SAT score. This is done in consideration of Freedles

own recommendations:

the solution is to recognize that this is a pervasive phenomena that can be easily

remedied by reporting two scores, the usual SAT and the R-SAT. (Freedle, 2003)

Since Freedle recommends reporting both scores and interprets the difference between

them as the difference between the White majoritys culture and the cultural background of

minority groups, then the consideration of the maximum of the two scores represents the less

disadvantageous scenario in which minority groups might compete for admission into selective

colleges.

Predictive Validity of the Revised SAT Verbal Score

This section presents the results on the predictive capacity of the revised SAT Verbal

score. Its capacity to predict short and long term academic outcomes is compared to that of the

original SAT Verbal score by ethnic group and academic outcome. It is important to keep in

mind that although the results are presented side-by-side for three ethnic groups, the main focus

of this investigation was to compare the goodness of fit statistics and parameter estimates within

ethnic groups, especially within minority groups. The results for the White students sample are

presented as a comparison with minority students results.

In order to increase the sample size the R-SAT Verbal score for all SAT forms were

combined. This aggregation was possible because the performance in each form was previously

Page 26




28/56

ForP

eerRev

iewOnly


27

scaled by ETS.25

The aggregation conducted also assumes that the four SAT forms were equated

during test development.26

The inclusion of the school ranking in the model, however, meant

that only students taking the 1999 forms (IZ and VD) where included in the analysis.27

Table 6 shows the adjusted R2

for the multivariate models estimated within each ethnic

group. The overall predictive power of the models examined varies depending on the academic

outcome and ethnic group. In general, the models predict college grades better for White students

than for minority students. While the capacity to predict annual college grades for all groups

tends to decline over time, the overall prediction of cumulative fourth-year grade point average is

unexpectedly high for White and Hispanic students. In addition, and only for White students, the

prediction of fourth year graduation is significantly weaker than the prediction of college grades.

Interestingly, this is not the case for African American and Hispanic students. The models

capacity to predict long term outcomes, such as fourth-year cumulative grade point average and

four-year graduation, is surprising considering that these indices are measured four years into the

students college career. Long-term outcomes are often assumed to be affected by variables

different from those included here, such as financial aid and previous experience in college

(Wilson, 1983; Reason, 2009).

25Scaling refers to a psychometric process conducted to achieve comparability among test score from different test

forms.

26Equating is a process different from scaling and aims to adjust for differences in difficulty among test forms. For

an introduction to traditional scaling and equating methods please Kolen (1988).

27 The maximum score between the original SAT and the R-SAT Verbal score was used for minority students.

Models using just R-SAT Verbal score and excluding school ranking as explanatory variables are presented in

Author (Year, Appendix 5) and result in findings similar to the ones displayed in this section. They and do not

provide stronger evidence in favor of the R-SAT score.

ge 27 of 55




29/56

ForP

eerRev

iewOnly


28

INSERT TABLE 6 HERE

In general, the adjusted R2

for Hispanic and White students are consistent with the results

reported by similar studies (Author, Year; Author, Year; Geiser & Studley, 2002; Zwick et al.,

2004). The power to predict college GPA for African American students, though, it is below

what has been reported by other studies and below the power to predict college GPA for the

other two ethnic groups and we believe it is in part an artifact of the small sample size. Geiser &

Studley (2002), for example, reported R2s closer to 10% for African American students (pp. 15).

When predicting graduation, however, the models predict better for African Americans than for

White and Hispanic students.

Table 6 shows that the capacity to predict college outcomes using the R-SAT Verbal

score is close to, but slightly less, than the predictive power capacity achieved when using the

original SAT score. The R-SAT Verbal score predicts better than the original SAT score only in

two cases and just for the African American group: 4th

year college grade point average and

fourth year cumulative grade point average. The difference in predictive power, though, does

not seem of large practical significance. It ranges between 0 and 1 percentage point and the

maximum increase in predictive capacity is only 0.59%.

The relatively weaker capacity to predict college outcomes associated with the use of the

R-SAT can also be observed in Tables 1, 2 and 3 in Author (Year, Appendix 3) which show the

standardized coefficient estimates and their statistical significance (p-values) when predicting

first-year UC GPA, cumulative fourth-year UC GPA and fourth-year graduation by ethnic group.

They also present the adjusted R2

for each regression and its sample size. In Author (Year,

Page 28




30/56

ForP

eerRev

iewOnly


29

Appendix 3) we also discuss the results associated to the other explanatory variables included in

the regression models which are similar to the findings from previous literature.

Over and Underprediction of Freshmen Grades

Freedle suggested that the revised SAT score would help reduce the problem of over and

underprediction reported by the literature on predictive validity of college admissions tests

(Zwick et al., 2004; Zwick et al., 2002; Ramist et al., 1994; Ramist et al. 2001). The potential

improvement of over and underprediction obtained from using the revised SAT score rather than

the original SAT score was assessed and the results are presented in this section.

Under or overprediction is usually assessed by fitting one general prediction model for

college students from all ethnic groups and then summing the regression residuals for a particular

ethnic group. In order to have an idea of the average individual over or underprediction the sum

of residuals is then divided by the number of students in each ethnic group. In this case,

regression models 1.1 and 1.2 were estimated and the average residual by ethnic group

compared. All explanatory variables included in these models were described in the previous

section.

1stYRGPA =1+

2APIQ+

3Educ+

4Inc+

5HSGPA+

6SATM+

7SATV

i(1.1)

1stYRGPA=1 +2APIQ+3Educ+4Inc+5HSGPA+ 6SATM+7Max(SATV_RSATV)i (1.2)

Table 7 shows the regression output for regression models 1.1 and 1.2 for all first-year

UC students. The results are similar to those presented in Table 6 for White students. This is not

ge 29 of 55




31/56

ForP

eerRev

iewOnly


30

surprising given that White students are the most numerous ethnic group included in the

sample.28

We find underprediction of White students grades (0.01) and overprediction of Hispanic

(-0.025) and African American students grades (-0.098) when using the SAT, just as previous

research did (Ramist et al., 1994, 2001). On average, the overprediction is smaller than the one

reported by Ramist et al. (1994) for African American students (-0.16) and larger than the that

reported by Geiser & Studley (2002) and by Zwick et al. (2004) for African American students,

except for the1998-1999 UCLA mega-cohort for the African American group.29

For Hispanic

students the overprediction is smaller than the one reported by Ramist et al. (2001) (-0.13) and

similar to some of the results reported by Zwick et al. (2004) (see for example Berkeley 1996-

1997 mega-cohort, Irvine 1998-1999 mega-cohort, San Diego 1996-1997 mega-cohort).

We found no improvement in the prediction accuracy from using the R-SAT Verbal score

for minority groups. On the contrary, the prediction errors for minorities increased when using

the maximum from the SAT and R-SAT Verbal score to 0.114 for African American students

and 0.032 for Hispanic students respectively.30

28 Although they also resemble somewhat the results obtained for the Hispanic subsample, the standardized

coefficients associated to parents education and income as well as the overall R2 are closer to those observed for the

White students. See Author (Year, Appendix 3) for details.

29

We focused our attention on Zwick et al. s model 6, which is the most similar to the analyses reported in this

section.

30 The same analysis was conducted for fourth-year cumulative UC GPA and the average underprediction for

African American and Hispanic students increased as well (from 0.181 to 0.194 and from 0.033 to 0.040

respectively).

Page 30




32/56

ForP

eerRev

iewOnly


31

Predictive Validity of IRT Ability Estimates

This section presents the results regarding the predictive power of the IRT ability

estimates and compares those results to the predictive capacity of the R-SAT and original SAT

Verbal scores. The IRT ability estimates include: (i) ability estimates obtained from estimating

the Rasch model in all the test items, (ii) ability estimates obtained from estimating the Rasch

DIF model in all the test items and (iii) ability estimate obtained from estimating the Rasch

model in only the hardest half of the items. These three ability estimates were obtained for all

White, African American and Hispanic students.31

These analyses were conducted separately for each combination of ethnic group,

academic outcome and test form in which the Freedle phenomenon was observed. This analysis

structure translated in reduced sample sizes. Table 2 of this paper shows the ethnic groups and

forms in which the relationship between item difficulty and DIF estimates was observed.

Test forms and ethnic groups could not be aggregated as in the R-SAT predictive validity

analysis because the Conquest estimation, especially that of the Rasch DIF model, generates one

student ability estimate per ethnic comparison. In addition, ability estimates from different Rasch

models, student samples and test forms cannot be directly aggregated because they are on

different scales. Even if we assumed that test forms were equated during test development,

information about the difficulty parameters of items used in equating is not available, preventing

the use of a common scale for all ability estimates.

Two of the five output tables for form 1999 IZ are presented here (see Tables 8 and 9).

Both tables display summary statistics of the analyses conducted using one of the most current

31 This differs from the R-SAT analysis presented in the previous sections, in which, the new score was only

computed for minority students.

ge 31 of 55




33/56

ForP

eerRev

iewOnly


32

forms analyzed (1999 IZ): (i) R square information for each of the models explaining a total of

six dependent variables for the African American/White comparison and (ii) R square

information for each of the models explaining a total of six dependent variables for the

Hispanic/White comparison. Form 1999 IZ has the largest sample size. The two tables presented

here are representative of the results obtained in the other test forms and ethnic groups (Hispanic

students taking Form 1994 VD, African American students taking form 1994 QI and 1994 Form

DX) . The remainder output tables are included in Author (Year, pp. 161-164). Although there

are differences in the overall predictive capacity by ethnic group, academic outcome and test

form, overall predictive validity results lead to the same conclusions as the findings presented in

Tables 8 and 9.

The predictive power of the multivariate regression models assessed fares best when

predicting college grades of White students and the performance of the model decreases over

time with the exception of cumulative fourth year GPA. College grades of minority students,

especially African American students, are not predicted in any meaningful way. Surprisingly, the

models under study predict fourth-year graduation better for African American and Hispanic

students than for White students. This trend was already noted in the previous section. Negative

adjusted R2

indicate very low explained variance in spite of the inclusion of a large number of

parameter in the regression model.

Although the overall predictive power varies significantly by form, ethnic group and

academic outcome, within the same ethnic group and academic outcome there is no significant

practical difference among the predictive capacity achieved when using either of the three IRT

ability estimates. In addition, there is no clear trend, as measured by the R2, in the superiority of

using either of the IRT ability estimates, the SAT original score or the revised SAT scores.

Page 32




34/56

ForP

eerRev

iewOnly


33

The small sample size and related instability of results allow us to present only a tentative

conclusion about the little practical difference observed in the overall predictive power

associated to the different IRT ability estimates, and how they fare in comparison to the original

SAT score. In addition there is some evidence suggesting that Rasch and Rasch DIF ability

estimates fare better in predicting short term academic outcomes for minorities while the

original SAT score predicts better long-term outcomes for the same group.

Discussion

The research presented in this article aimed to examine the predictive validity of the R-

SAT score addressing the methodological criticisms made to the way Freedle obtained the

different component to calculate de R-SAT score (Dorans, 2004; Dorans & Zeller, 2004a,

2004b). We did so by using formula score rather than proportion correct as the basis to calculate

the R-SAT score and by directly estimating studentsability using the Rasch and the Rasch DIF

model, both with all items and with only the hardest half of the items. This latter approach

addressed the issue of inverse regression and aggregation of estimates from different ethnic

groups

Analyses presented above show that, in this sample, the R-SAT score helps minority

students, although not as much as Freedle expected. On average, it increases scores by 24 points

or 6% for African American students and by 12 points or 2.5% for Hispanic students. Using

Freedles assumptions, the consideration of the R-SAT would change admissions decisions of

minority students admitted into selective colleges by about 10%. This is much less than

Freedles prediction of approximately 300% increase.32

The small increases in R-SAT scores

are consistent with the magnitude of score increase reported by Dorans (2004) and Dorans &

Zeller (2004a).

32 Freedle identified an increase of 342% for Form 4I and an increase of 334% for Form OB023.

ge 33 of 55




35/56

ForP

eerRev

iewOnly


34

In addition the predictive validity analyses show no significant difference in the capacity

to predict short and long-term outcomes when using either the original or the revised SAT score.

Also, results show that the traditional problem of over and underprediction would remain the

same when using the revised SAT score.

Results from using the IRT ability measures are somewhat less straightforward but also

support the conclusion that there is little practical difference in the overall predictive power

associated to the different IRT ability estimates, and how they fare in comparison to the original

SAT and the R-SAT scores.

This research has several limitations. Among them is the fact that predictive validity

analyses were conducted on a group of students who were already accepted to college and

therefore present significant restriction of range in some of the explanatory variables. In addition

many students who did not attend selective colleges might have matriculated at such schools if

their R-SAT scores had been used in the admission process but this limitation is also observed in

other predictive validity studies (Geiser & Studley, 2002; Zwick, 2002; Zwick, Brown & Sklar,

2004; Zwick & Sklar, 2005). This consideration limits in some extent the validity of our

findings. The use of inverse regression and the aggregation of different ethnic groups in order to

obtain the R-SAT scores (not the IRT ability estimates) are still subject to Dorans & Zellers

original criticisms. Recent changes to the content of the SAT and the inclusion of a mandatory

Writing test may limit the generalizability of the findings presented here since they were based in

somewhat older test forms. Larger sample size for each minority group may be desirable in

order to implement future research, especially for African American students, however, that will

require the combination of data for a number of colleges and universities that exceeds the overall

and minority sample size of the nine campuses of the University of California combined.

Page 34




36/56

ForP

eerRev

iewOnly


35

Furthermore, and despite the limited sample size of African American and Hispanic students, we

were still able to observe results that were similar to those reported by previous research, such as

the statisticall significance and practical importance of high school grades for predicting college

grades and graduation. These results provide support for the validity of our results for these

particular samples.

We think it is important to highlight the consistency of the results obtained in the

numerous and diverse analyses implemented across African American and Hispanic students: no

strong evidence in favor of the R-SAT score is observed when (a) recalculating the scores using

only the most difficult items for minorities, (b) when using that R-SAT score to directly predict

short and long term outcomes using models that considered and did not consider SAT II scores,

(c) when using models that did not control for school quality and allowed us to have larger

sample sizes, (d) when evaluating the over and underprediction problem for minorities, and (e)

when using using IRT ability estimates (considering all items, all items plus a DIF parameter and

only the hardest-half items of the test) to predict short and long term outcomes.

The findings presented in this article consistently reveal that there are minimal benefits

associated with Freedles R-SAT and suggest that, rather than using measures aimed to

complement the SAT, efforts and energy should be directed to studying the phenomenon behind

the systematic relationship between item difficulty and DIF estimates (Author, Year) and directly

addressing those issues during test development. The investigation of potential causes should

include studies that investigate at Freedles proposed explanation, the influence of academic

versus home language (Freedle, 2010) including investigation of the cognitive processes of

students while taking the test as well as quantitative analyses and modeling techniques (De

Boeck, 2010). In addition, further research should investigate the sensibility of Freedles

ge 35 of 55




37/56

ForP

eerRev

iewOnly


36

phenomenon to alternative forms of guessing such as differential guessing strategies between

White and students from other ethnic groups.

These results also suggest that alternative policy options should be considered if the goal

is to increase the representation of minority groups in higher education, specially at highly

selective institutions (Bowen, Chingos & McPherson, 2009)33

. Those options may include the

use of school quality indices as input in the admissions processes (Zwick & Himelfarb, 2011)

and/or explicitly considering nonacademic outcomes as desirable college goals and therefore

adjusting the weight of admission indicators accordingly (Sinha, Oswald, Imus & Schmitt,

2011).

33 Bowen et al. (2009) call undermatching to the phenomenon by which students enroll in institutions that are less

demanding than they are qualified to attend. The phenomenon is described as most pronounced among well-

qualified low-income and minority students, who enroll at two-year institutions o less-selective four year

insitutions. Since college completion varies sharply with school selectivity, even after controlling for student

characteristics, the penomenon of undermatching results in minority students graduating from less-demanding

colleges at lower rates than similar students at highly-selective institutions.

Page 36




38/56

ForP

eerRev

iewOnly


37

Table 1: Descriptive Statistics. Overall Sample Taking SAT forms and Subsample of Students

who enrolled at UC.

Variable N Mean Std Dev.

Overall Sample

SAT Composite28,860 958 224

HSGPA 28,367 3.23 0.45

Income 25,678 56,853 30,239

Max Ed Level 28,489 6.40 2.18

UC Applicant Sample

SAT Composite11,155 1067 206

HSGPA 11,016 3.47 0.36

Income 9,866 62,550 30,779

Max Ed Level 11,027 6.89 2.16

UC Enrolled Sample

SAT Composite 4,804 1098 195

HSGPA 4,754 3.55 0.32

Income 4,253 63,250 30,938

Max Ed Level 4,749 6.93 2.19

Source: College Board

ge 37 of 55




39/56

ForP

eerRev

iewOnly


38

Table 2

Presence of the Freedle Phenomenon According to the Standardization and Rasch Model Across

Methods, Forms and Ethnic Groups. Verbal Tests.*

Group Method 1999 IZ 1999 VD 1994 QI 1994 DX

White, African

American

Standardization

ApproachYES NO YES NO

Rasch Model YES NO YES YES

White, Hispanics

Standardization

ApproachNO NO NO NO

Rasch Model YES YES NO NO

* Presence of the Freedle phenomenon is defined as a statistically significant and high (above 0.3) correlation.

Page 38




40/56

ForP

eerRev

iewOnly


39

Table 3: Number of Students for Whom the Revised Score was Calculated and

IRT Ability Parameters Estimated.

Group 1999 IZ 1999 VD 1994 QI 1994 DX Total

White Examinees 6548 6682 3360 3188 19778

Hispanics Examinees 1904 2018 - - 3922

African American Examinees 854 - 671 709 2234

ge 39 of 55




41/56

ForP

eerRev

iewOnly


40

Table 4: Distribution of Score Difference by Ethnic Groups and Corresponding Mean SAT Verbal Score.

Overall Sample.

Difference Between

R-SAT Verbal Score

and SAT Verbal

Score (both end

points included)

African American Examinees Hispanic Examinees

Number PercentageMean SAT

ScoreNumber Percentage Mean SAT Score

[-106, -101] - 2 0% 515.0

[-100, -51] 39 2% 433.6 95 2% 506.2

[ -50, 0] 658 29% 438.7 1554 40% 518.4

[ 0, 49] 966 43% 396.2 1704 43% 468.9

[ 50, 101] 452 20% 301.6 418 11% 370.0

[ 100, 210] 119 5% 251.7 149 4% 276.1

TOTAL 2234 100% 382.5 3922 100% 471.6

Page 40




42/56

ForP

eerRev

iewOnly


41

Table 5: Number of Examinees Scoring 600, or above, in the Sample and their Mean Scores.

Ethnic Group

Number of

Students Scoring

Over 600 when

considering SAT

Verbal Score

Mean SAT

Verbal

Number of Student

Scoring Over 600 when

considering Max.

between SAT and R-

SAT Verbal Score

Mean of Max.

Between SAT V

and R-SAT V

Total

Number of

Examinees

in the

Sample

African

American

Students

79 637 86 643 2,234

African

American and

Hispanic

Students

458 645 516 648 6,156

White

Students3,889 653 - - 19,778

ge 41 of 55




43/56

ForP

eerRev

iewOnly


42

Table 6: Overall Predictive Power of the Original SAT Verbal Scores and the Maximum between the SAT

Verbal Scores and the Revised SAT Verbal scores. Multivariable Regression Models.

Score UCGPA 1st Year UCGPA 2nd Year

African

American

Students

Hispanic

Students

White

Students

African

American

Students

Hispanic

Students

White

Students

SAT V 2.15% 15.36% 21.24% 0.18% 13.16% 16.55%

Max [SATV or RSATV] 1.66% 15.00% - 0.07% 12.40% -

N 78 597 2253 73 540 2120

UCGPA 3rd Year UCGPA 4th Year

African

American

Students

Hispanic

Students

White

Students

African

American

Students

Hispanic

Students

White

Students

SAT V -4.39% 8.13% 12.92% 4.81% 5.01% 13.11%

Max [SATV or RSATV] -5.19% 7.27% - 4.94% 4.38% -

N 67 497 1964 64 476 1904

UC CUM GPA 4th YEAR UC GRADUATION BY 4th YEAR*

African

American

Students

Hispanic

Students

White

Students

African

American

Students

Hispanic

Students

White

Students

SAT V 0.12% 15.18% 20.68% 15.97% 13.35% 6.91%

Max [SATV or RSATV] 0.71% 14.28% - 15.08% 13.13% -

N 65 481 1927 78 613 2314

*Pseudo R2

is reported for the logistic regression used to predict fourth-year graduation.

Page 42




44/56

ForP

eerRev

iewOnly


43

Table 7: Predictive Power of First-Year UC GPA: A Joint Regression Equation. Standardized Estimates

and Statistical Significance.

Regression

Model

API

Quintile

Parents

Education

Income

LevelHS GPA

SAT

Math

Max

[SATV

or R-

SAT

VERBAL

SCORE]

SAT

Verbal

Adjusted

R2

N

1.1 0.102 0.098 0.039 0.330 -0.021 0.191 23.85% 2,928

(